Make SQL Queries More Readable
You can use a CTE to make SQL queries easier to read and understand and more logical. CTEs (Common Table Expressions) allow you to build up a query that otherwise would look very messy.
Here's an example of something I've been tinkering with recently. I've been working on learning more about the ActivitySchema modeling technique for data warehouses. This technique requires that you set up the data in a time-series format (Entity, Activity, Timestamp)
To do this I've been using the Stackoverflow data set available in BigQuery's public data collection. The schema is pretty straightforward. There's a table called post_history which has a PK called id and a FK back to the post tables called post_id which joins to the id on every posts table (posts_questions, posts_answers, etc)
The post_history has a post_history_type_id field which has an integer corresponding to various post history types. You can see them all here. Now in order to set up the ActivitySchema table, I need to join each post to the history table to see what happened over time.
with post_history as (
select
ph.revision_guid,
ph.user_id,
u.display_name as user_name,
case
when max(ph.post_history_type_id) in (1,2,3) then 'posted'
when max(ph.post_history_type_id) in (4,5,6) then 'edited'
when max(ph.post_history_type_id) in (7,8,9) then 'rolledback'
end as activity,
cast(null as string) as close_reason,
max(ph.id) as activity_id,
max(ph.post_id) as post_id,
max(ph.creation_date) as creation_date
from
`bigquery-public-data.stackoverflow.post_history` ph
join `bigquery-public-data.stackoverflow.users` u
on u.id = ph.user_id
where
ph.post_history_type_id between 1 and 9
group by 1,2,3
union all
select
ph.revision_guid,
ph.user_id,
u.display_name as user_name,
case
when max(ph.post_history_type_id) in (10) then 'closed'
when max(ph.post_history_type_id) in (11) then 'reopened'
when max(ph.post_history_type_id) in (12) then 'deleted'
when max(ph.post_history_type_id) in (13) then 'undeleted'
when max(ph.post_history_type_id) in (14) then 'locked'
when max(ph.post_history_type_id) in (15) then 'unlocked'
when max(ph.post_history_type_id) in (16) then 'made_community_wiki'
when max(ph.post_history_type_id) in (17) then 'migrated'
when max(ph.post_history_type_id) in (18) then 'merged'
when max(ph.post_history_type_id) in (19) then 'protected'
when max(ph.post_history_type_id) in (20) then 'unprotected'
when max(ph.post_history_type_id) in (22) then 'unmerged'
when max(ph.post_history_type_id) in (24) then 'applied_edit_to'
when max(ph.post_history_type_id) in (25) then 'tweeted'
when max(ph.post_history_type_id) in (31) then 'moved_to_chat'
when max(ph.post_history_type_id) in (33) then 'added_notice'
when max(ph.post_history_type_id) in (34) then 'removed_notice'
when max(ph.post_history_type_id) in (35) then 'migrated_away'
when max(ph.post_history_type_id) in (36) then 'migrated_here'
when max(ph.post_history_type_id) in (37) then 'merge_source'
when max(ph.post_history_type_id) in (38) then 'merge_destination'
when max(ph.post_history_type_id) in (50) then 'community_bump'
when max(ph.post_history_type_id) in (52) then 'selected_hot'
when max(ph.post_history_type_id) in (53) then 'removed_hot'
end as activity,
case
when max(ph.post_history_type_id) in (10) and max(comment) = '101' then 'duplicate'
when max(ph.post_history_type_id) in (10) and max(comment) = '102' then 'off-topic'
when max(ph.post_history_type_id) in (10) and max(comment) = '103' then 'unclear'
when max(ph.post_history_type_id) in (10) and max(comment) = '104' then 'too_broad'
when max(ph.post_history_type_id) in (10) and max(comment) = '105' then 'opinion_based'
end as close_reason,
max(ph.id) as activity_id,
max(ph.post_id) as post_id,
max(ph.creation_date) as creation_date
from
`bigquery-public-data.stackoverflow.post_history` ph
join `bigquery-public-data.stackoverflow.users` u
on u.id = ph.user_id
where
ph.post_history_type_id > 9
group by 1,2,3
)
select
ph.activity_id as activity_id,
ph.creation_date as ts,
ph.user_name as customer,
ph.activity as activity,
'internal_db' as source,
ph.user_id as source_id,
'question' as feature_1, --post_type
ph.close_reason as feature_2, --close_reason
cast(null as string) as feature_3,
0.0 as revenue_impact,
cast(null as string) as link,
cast(null as int64) as activity_occurrence,
cast(null as timestamp) as activity_repeated_at,
cast(null as string) as _activity_source
from
`bigquery-public-data.stackoverflow.posts_questions` q
join post_history ph on q.id = ph.post_id
union all
select
ph.activity_id as activity_id,
ph.creation_date as ts,
ph.user_name as customer,
ph.activity as activity,
'internal_db' as source,
ph.user_id as source_id,
'answer' as feature_1, --post_type
ph.close_reason as feature_2, --close_reason
cast(null as string) as feature_3,
0.0 as revenue_impact,
cast(null as string) as link,
cast(null as int64) as activity_occurrence,
cast(null as timestamp) as activity_repeated_at,
cast(null as string) as _activity_source
from
`bigquery-public-data.stackoverflow.posts_answers` q
join post_history ph on q.id = ph.post_id;
Since the posts are separated into their own tables by type (this is how BigQuery chose to ingest the data not the original schema) I basically would have to join the post history twice. I also needed to modify the history table so I could group some of the activities into a single line. In this case, post_history_id 1, 2 and 3 (initial title, initial body, initial tags) are all grouped under "posted"
You can see how I've joined with post_history in two places. If this was a subquery I'd have to join with the entire subquery twice. Following the DRY principle we define a CTE for the post history as follows:
You can see how the query now looks a lot cleaner. If I were to add more post types (I only chose questions and answers), I can keep joining with the CTE. The other nice thing about CTEs is that query engines know how to optimize them better whereas the subquery would be much harder to optimize.
There are a lot more advanced SQL techniques used in the above query that I will explain at a later date. You can copy and paste that query in the BigQuery console and it should run