Can I calculate a median using GROUP BY in BigQuery?

This is a very short and sweet post on how to calculate the median for different subgroups of data that you want to aggregate like when using GROUP BY

Hello! In case you’re just looking for the solution and not a long-winded explanation, here is an example for how to do it:

The Solution

SELECT
  DISTINCT sex,
  median_height_per_sex
FROM (
  SELECT
    name,
    sex,
    height,
    PERCENTILE_CONT(height, 0.5) OVER(PARTITION BY sex) 
    	AS median_height_per_sex
  FROM
    height_data)

Background on the problem

When we want to calculate an aggregate statistic for different subgroups of data, we normally use GROUP BY and an already implemented function to compute the desired function over the field we select. For example, if we wanted to calculate the average height per sex from a table named height_data, we simply do:

SELECT
  sex,
  AVG(height) AS average_height
FROM
  height_data
GROUP BY
  sex

This query will return one row per sex and its corresponding average height. In other words, AVG computes one value for each group defined by the GROUP BY function. Functions that like AVG return one value per group of rows (such as MAX, MIN, SUM, etc) are called aggregate functions.

After a Google search you can find out that to calculate the median in BigQuery you have to use PERCENTILE_CONT(x, 0.5) where x is the field we want to calculate the median over and 0.5 indicates the 50th percentile. If you don’t read any further, what you would try to do is compute the median per sex using SELECT sex, PERCENTILE_CONT(height, 0.5) as median_height FROM height_data GROUP BY sex. However, this doesn’t work!

What differentiates PERCENTILE_CONT from AVG?

Well, PERCENTILE_CONT isn’t an aggregate function but an analytical function. An analytical function (sometimes also called more explicitly non-aggregate function) calculates one value per row rather than per group of rows. Practically, this means PERCENTILE_CONT can’t be applied at the same time as GROUP BY like AVG.

The syntax to apply PERCENTILE_CONT is also a bit different:

SELECT
  name,
  sex,
  height,
  PERCENTILE_CONT(height, 0.5) OVER() AS median_height
FROM
  height_data

The OVER() clause defines the subgroup over which the median will be calculated. If left empty as above, the median will be calculated for all the rows and assigned to all of them. If our height_data table looks something like this:

The result of applying the query above will be:

Remember the median is the value that is the middle position after ordering them in magnitude. In this case, the ordered heights would be [150, 160, 170, 175, 180, 180, 200] and the height in the middle (the median_height) is 175. This is assigned to every row in the table. However, this is not exactly what we want. Two things are missing:

1) We want to calculate one median_height per sex.

As mentioned earlier, the OVER() clause is used to define the different subgroups over which the median will be calculated. When we want to specify the subgroups, we use PARTITION BY inside of the parenthesis followed by the fields we want to use to separate the subgroups. In our case, this means:

SELECT
  name,
  sex,
  height,
  PERCENTILE_CONT(height, 0.5) OVER(PARTITION BY sex) 
  	AS median_height_per_sex
FROM
  height_data

The results of the query will be:

When we have an even number of rows, the median is calculated by linear interpolation between the two middle values. For the case of male, we have [170, 175, 180, 200]; the two middle values are 175 and 180, the middle point between those values is 177.5. Therefore, the median for male rows is 177.5. The median for the female rows ([150, 160, 180]) is 160.

Almost there! We only need to merge all of the male and female rows into a single one.

2) We want to have as a result only one row per sex, not one row per person.

To do so, we just need to see that if we remove the name and height fields, all male rows are the same. The same happens for female rows. Therefore, we can simply select the distinct rows based on sex and median_height_per_sex as follows:

SELECT
  DISTINCT sex,
  median_height_per_sex
FROM (
  SELECT
    name,
    sex,
    height,
    PERCENTILE_CONT(height, 0.5) OVER(PARTITION BY sex) 
    	AS median_height_per_sex
  FROM
    height_data)

This is equivalent to doing:

SELECT
  sex,
  median_height_per_sex
FROM (
  SELECT
    name,
    sex,
    height,
    PERCENTILE_CONT(height, 0.5) OVER(PARTITION BY sex) 
    	AS median_height_per_sex
  FROM
    height_data) GROUP BY sex, median_height_per_sex

The result of these queries is as follows:

which is what we wanted!

Follow me on Twitter @iamhectorotero for more interesting posts and content about data-related topics!

Written on May 17, 2020