Introduction to SQL (Advanced)
In our previous article, Introduction to SQL, we covered the basic concepts of SQL that you need to understand in order to begin gathering data for analysis and reporting. In this article, we will take a look at some of the more advanced SQL concepts that you will need to know as you advance in your career as an analyst. We will take a look at the UNION operator, the CASE expression, and the LIKE operator. We will then explore some of the most important techniques you need to understand when writing queries, subqueries. We will also take a look at advanced grouping and how it can help you write reports. We will finish this article by taking a look at the more advanced types of joins that can be used, such as the CROSS JOIN, the SELF JOIN, and INNER JOINs.
Advanced SQL Concepts
The Union Operator
As you may have guessed from the name, the UNION operator is used to combine the resultset of two or more SELECT statements into one output. It will also remove the duplicate rows between the various SELECT statements. There are two different variations on the UNION operator: UNION and UNION ALL. The difference is that the UNION operator returns a distinct output of the resultset. The SELECT statements that are combined using the UNION or UNION ALL operators can include any/all of the options that are regularly available on any SELECT statement. The only restriction is that the resultset in each SELECT statement must have the same structure.
The Case Expression
The CASE expression is referred to as a scalar function in SQL. There are several other useful functions within this category and all of them follow a similar concept: they are applied on a single value in your resultset (or on the data that you are querying) in order to manipulate that value. Here is how a basic CASE expression is structured:
You can use the CASE expression in the SELECT statement in various places, including the list of columns, in the join, or in the WHERE clause. In the example query below, the values of AccountType will be changed from Balances to Balance and leave all of the other values the same:
The Like Operator
The LIKE operator is one of the most versatile functions in SQL. It allows you to perform sophisticated string pattern searching and matching in your query. The LIKE operator follows this syntax:
This query returns all of the records from the Person table that have a first name that matches the name Jerry. You are probably wondering why I didn’t simply use the “=” operator right? Let’s say that I know that Jerry is short for several different names (Jeremiah, Jeremy, Jerome, etc.) and I wanted to return all of the records with these names as well. This is where the LIKE operator flexes its muscle and, when used with wildcard characters (see table below), can return the records that I want to see. This query would look like this example:
Subqueries are one of the most important skills that you must learn as an analyst. Subqueries are exactly what their name implies – a query within a query. Subqueries can be nested anywhere that an SQL expression can be used. While you may be able to solve your problem without using a subquery most of the time, it is inevitable that you will need to write them at some point in order to accomplish your goal. Since subqueries can be nested in multiple places, let’s look at how they are used in different places within our queries:
In a From Clause
Subqueries can be used in the FROM clause of one of your queries. In the example below, the subquery will return all of the records that exist in both the FactFinance and DimScenario tables. The final resultset is left joined to DimAccount and we end up with all of the rows from DimAccount and only those rows that match from the subquery.
A correlated subquery means that the subquery runs once for every row in the main query. Sounds like it might be slow doesn’t it? It is, and it can kill the performance of your SQL Server if you aren’t careful. You can put correlated subqueries in several places inside of a SELECT statement. In the example below, this subquery will run once for every row in the FactFinance table and is restricted by the relationship on ScenarioKey with the outer query:
When you begin writing SQL, the basic joins that we have already covered will work just fine for you. Once you get into writing more advanced SQL and your job responsibilities start to entail more sophisticated analytics requirements, you will find yourself needing to go beyond the basics in order to accomplish your objectives. Depending upon your situation, you may need to use one or more of the advanced types of joins that you have at your disposal – the cross join, the self join, and inner joins. In this section we will take a look at these three types of joins and how you can use them to accomplish your goals.
The Cross Join
A CROSS JOIN allows you to perform a Cartesian product between two tables returning all combinations of rows. This type of join results in every row of one table being joined to every row of another table. The CROSS JOIN doesn’t establish a relationship between tables like an INNER JOIN or OUTER JOIN would by specifying which columns to join on. You may want to use this type of join when you are trying to create a list of data combinations to use in a lookup table or on a specialized report. The query example below results in the unique list of all combinations of AccountTypes and ScenarioNames:
The Self Join
In an SQL statement, you may join the same table to itself (yep) using different aliases to facilitate a SELF JOIN. This is mostly used when you have a parent-child relationship that exists within a table. A common example is a hierarchy of some kind (say an organizational hierarchy). The SELF JOIN in this example unravels one level in the parent-child hierarchy in the DimAccount table. In order to unravel all levels, you must keep self joining once for every level, restricting the first level to the top-most parent in the hierarchy.
An INNER JOIN isn’t just limited to an equality relationship. There are several situations in analytics that require you to go beyond that basic relationship and into more advanced ones. This requires a strong ability to visualize data. A common example occurs in what is referred to as a Type II dimension. This type of dimension shows a point-in-time snapshot of the data. Take the example of an organization hierarchy. In certain views you may want to see the organization metrics based upon how the organization hierarchy looked at the point that the metric was taken. In that situation, you have to use an INNER JOIN in your query.
SQL includes some very powerful grouping functionality and it will prove vital to your work as an analyst. These functions are typically used in analytics and allow you to perform such things as ranking or partitioning your resultset. The following is the basic construct of these functions:
- Over Clause: The OVER clause is used to determine how the dataset should be partitioned and ordered. Inside the OVER clause, use the PARTITION BY or ORDER BY clauses to define the columns that are involved and the output:
- Window Functions: The OVER clause is used with Window Functions such as RANK or ROW_NUMBER, or with an Aggregation function:
- The output from this query returns the sum of OrderQuantity by SalesTerritoryKey and repeats those numbers for all ProductKeys in that SalesTerritoryKey