How to get the values for multiple years on the same row - Conditional Aggregation MySQL
This is a tricky one to write about as I've really struggled to put down in words what my actual problem was and then turn it into something someone else would be searching for.
Maybe the problem is best described with an explanation of my problem. I want to show the values of something for a range of years in MySQL or Postgres.
I've been working with the US Census American Community Service dataset and I wanted to create a simple table where for the years 2017 to 2021, I wanted to show the race and the overall population. My database structure looks like this:
Race | Population | Year |
---|---|---|
White | 68.2 | 2021 |
Black/Africa American | 12.6 | 2021 |
Native | 0.8 | 2021 |
And I wanted my table to look like this:
Race | Year 2017 | Year 2018 | Year 2019 | Year 2020 | Year 2021 |
---|---|---|---|---|---|
White | 73.2 | 72.7 | 72.5 | 70.4 | 68.2 |
Black / African American | 12.7 | 12.7 | 12.7 | 12.6 | 12.6 |
I didn't know it yet, but what I needed was something called "conditional aggregation".
How to get the values for multiple years on the same row
The query to fetch this data isn't that bad at all, however finding out how to do it was. So hopefully this helps someone out in the future. I had to enlist the help of OpenAI to get me on the right track. It’s pretty handy for situations like this, when you don’t know what to type into your search engine.
SELECT
race,
MAX(CASE WHEN year = 2017 THEN percentage END) AS year_2017,
MAX(CASE WHEN year = 2018 THEN percentage END) AS year_2018,
MAX(CASE WHEN year = 2019 THEN percentage END) AS year_2019,
MAX(CASE WHEN year = 2020 THEN percentage END) AS year_2020,
MAX(CASE WHEN year = 2021 THEN percentage END) AS year_2021
FROM
"race_ethnicities"
GROUP BY
race;
To get the values for multiple years in one resultset, we can use the CASE
statement in MySQL and Postgres. Because we just want the value of the percentage row, we can use the aggregate function MAX()
and this will bring back that value and assign it to each year. Grouping by the race, makes sure we only bring back 1 race per year group.
I could have just done 5 separate queries to bring back the data for each year, however I knew there was a better and more performant way to achieve it.