By Ergest

Make SQL Queries More Readable

You can use a CTE to make SQL queries easier to read and understand and more logical. CTEs (Common Table Expressions) allow you to build up a query that otherwise would look very messy.

Here's an example of something I've been tinkering with recently. I've been working on learning more about the ActivitySchema modeling technique for data warehouses. This technique requires that you set up the data in a time-series format (Entity, Activity, Timestamp)

To do this I've been using the Stackoverflow data set available in BigQuery's public data collection. The schema is pretty straightforward. There's a table called post_history which has a PK called id and a FK back to the post tables called post_id which joins to the id on every posts table (posts_questions, posts_answers, etc)

The post_history has a post_history_type_id field which has an integer corresponding to various post history types. You can see them all here. Now in order to set up the ActivitySchema table, I need to join each post to the history table to see what happened over time.

with post_history as (
	select
		ph.revision_guid,
        ph.user_id,
		u.display_name as user_name,
		case 
			when max(ph.post_history_type_id) in (1,2,3) then 'posted'
			when max(ph.post_history_type_id) in (4,5,6) then 'edited'
			when max(ph.post_history_type_id) in (7,8,9) then 'rolledback'
		end as activity,
		cast(null as string) as close_reason,
		max(ph.id) as activity_id,
        max(ph.post_id) as post_id,
        max(ph.creation_date) as creation_date
	from
		`bigquery-public-data.stackoverflow.post_history` ph
		join `bigquery-public-data.stackoverflow.users` u
     		on u.id = ph.user_id
	where
		ph.post_history_type_id between 1 and 9
	group by 1,2,3
	
	union all 
	
	select
		ph.revision_guid,
        ph.user_id,
		u.display_name as user_name,
		case
			when max(ph.post_history_type_id) in (10) then 'closed'
			when max(ph.post_history_type_id) in (11) then 'reopened'
			when max(ph.post_history_type_id) in (12) then 'deleted'
			when max(ph.post_history_type_id) in (13) then 'undeleted'
			when max(ph.post_history_type_id) in (14) then 'locked'
			when max(ph.post_history_type_id) in (15) then 'unlocked'
			when max(ph.post_history_type_id) in (16) then 'made_community_wiki'
			when max(ph.post_history_type_id) in (17) then 'migrated'
			when max(ph.post_history_type_id) in (18) then 'merged'
			when max(ph.post_history_type_id) in (19) then 'protected'
			when max(ph.post_history_type_id) in (20) then 'unprotected'
			when max(ph.post_history_type_id) in (22) then 'unmerged'
			when max(ph.post_history_type_id) in (24) then 'applied_edit_to'
			when max(ph.post_history_type_id) in (25) then 'tweeted'
			when max(ph.post_history_type_id) in (31) then 'moved_to_chat'
			when max(ph.post_history_type_id) in (33) then 'added_notice'
			when max(ph.post_history_type_id) in (34) then 'removed_notice'
			when max(ph.post_history_type_id) in (35) then 'migrated_away'
			when max(ph.post_history_type_id) in (36) then 'migrated_here'
			when max(ph.post_history_type_id) in (37) then 'merge_source'
			when max(ph.post_history_type_id) in (38) then 'merge_destination'
			when max(ph.post_history_type_id) in (50) then 'community_bump'
			when max(ph.post_history_type_id) in (52) then 'selected_hot'
			when max(ph.post_history_type_id) in (53) then 'removed_hot'
		end as activity,
		case 
			when max(ph.post_history_type_id) in (10) and max(comment) = '101' then 'duplicate'
			when max(ph.post_history_type_id) in (10) and max(comment) = '102' then 'off-topic'
			when max(ph.post_history_type_id) in (10) and max(comment) = '103' then 'unclear'
			when max(ph.post_history_type_id) in (10) and max(comment) = '104' then 'too_broad'
			when max(ph.post_history_type_id) in (10) and max(comment) = '105' then 'opinion_based'
		end as close_reason,
		max(ph.id) as activity_id,
        max(ph.post_id) as post_id,
		max(ph.creation_date) as creation_date
	from
		`bigquery-public-data.stackoverflow.post_history` ph
		join `bigquery-public-data.stackoverflow.users` u
     		on u.id = ph.user_id
	where
		ph.post_history_type_id > 9
	group by 1,2,3
)
select
	ph.activity_id  		as activity_id,
	ph.creation_date 		as ts,
	ph.user_name 			as customer,
	ph.activity 			as activity,
	'internal_db'			as source,
	ph.user_id				as source_id,
	'question'				as feature_1, --post_type
	ph.close_reason			as feature_2, --close_reason
	cast(null as string) 	as feature_3,
	0.0 					as revenue_impact,
	cast(null as string)    as link,
	cast(null as int64)	    as activity_occurrence,
	cast(null as timestamp) as activity_repeated_at,
	cast(null as string) 	as _activity_source
from
	`bigquery-public-data.stackoverflow.posts_questions` q
     join post_history ph on q.id = ph.post_id

union all

select
	ph.activity_id  		as activity_id,
	ph.creation_date 		as ts,
	ph.user_name 			as customer,
	ph.activity 			as activity,
	'internal_db'			as source,
	ph.user_id				as source_id,
	'answer'				as feature_1, --post_type
	ph.close_reason			as feature_2, --close_reason
	cast(null as string) 	as feature_3,
	0.0 					as revenue_impact,
	cast(null as string)    as link,
	cast(null as int64)	    as activity_occurrence,
	cast(null as timestamp) as activity_repeated_at,
	cast(null as string) 	as _activity_source
from
	`bigquery-public-data.stackoverflow.posts_answers` q
     join post_history ph on q.id = ph.post_id;

Since the posts are separated into their own tables by type (this is how BigQuery chose to ingest the data not the original schema) I basically would have to join the post history twice. I also needed to modify the history table so I could group some of the activities into a single line. In this case, post_history_id 1, 2 and 3 (initial title, initial body, initial tags) are all grouped under "posted"

You can see how I've joined with post_history in two places. If this was a subquery I'd have to join with the entire subquery twice. Following the DRY principle we define a CTE for the post history as follows:

You can see how the query now looks a lot cleaner. If I were to add more post types (I only chose questions and answers), I can keep joining with the CTE. The other nice thing about CTEs is that query engines know how to optimize them better whereas the subquery would be much harder to optimize.

There are a lot more advanced SQL techniques used in the above query that I will explain at a later date. You can copy and paste that query in the BigQuery console and it should run