Metrics Maven: Window Functions in PostgreSQL

Welcome to the Metrics Maven series: Database features, tips, tricks, and code

In our Metrics Maven series, IBM Cloud Database’s data scientist shares database features, tips, tricks, and code that you can use to get the metrics you need from your data. In this first article, we’ll look at how to use window functions in PostgreSQL.

PostgreSQL window functions

If you use PostgreSQL, you’re probably already familiar with many of the common aggregate functions, such as COUNT(), SUM(), MIN(), MAX(), and AVG(). However, you may not be familiar with window functions since they’re touted as an advanced feature. The good news is that window functions aren’t nearly as esoteric as they may seem.

As the name implies, window functions provide a “window” into your data, letting you perform aggregations against a set of data rows according to specified criteria that match the current row. While they are similar to standard aggregations, there are also additional functions that can only be used through window functions (such as the RANK() function we’ll demonstrate below). In some situations, window functions can minimize the complexity of your query or even speed up the performance.

It is important to note that window functions always use the OVER() clause, so if you see OVER(), you’re looking at a window function. Once you get used to how the OVER() clause is formatted, where it fits in your queries, and the kind of results you can get, you’ll soon start to see lots of ways to apply it. Let’s dive in!

OVER( )

Depending on the purpose and complexity of the window function you want to run, you can use OVER() all by itself or with a handful of conditional clauses. Let’s start by looking at using OVER() all by itself.

If the aggregation you want to run is to be performed across all the rows returned by the query and you don’t need to specify any other conditions, you can use the OVER() clause by itself. Here’s an example of a simple window function querying a table in our IBM Cloud Databases for PostgreSQL database containing the United States Census data on estimated population:

SELECT name AS state_name, 
     popestimate2015 AS state_population, 
     SUM(popestimate2015) 
          OVER() AS national_population 
FROM population 
WHERE state > 0 -- only state-level rows ORDER BY name;

Notice that we’re using a window function to sum the state populations over all the result rows (that’s the OVER() you see in our query—yep, just that one little addition to an otherwise standard query). Returned, we get result rows for each state and their populations with also the population sum for the nation—that’s the aggregation we performed with our window function:

Consider how this compares to standard aggregation functions. Without the window function, the simplest thing we could do is return the national population by itself, like this, by summing the state populations:

SELECT SUM(popestimate2015) AS national_population  
FROM population  
WHERE state > 0 -- only state-level rows;

The problem is, we don’t get any of the state level information this way. To get the same results as our window function, we’d have to do a sub-select as a derived table:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       region,
       division,
       SUM(popestimate2015)
            OVER(PARTITION BY division) AS divisional_population
FROM population  
WHERE state > 0 -- only state-level rows  
ORDER BY name;

Looks ugly in comparison, doesn’t it? Using window functions, our query is much less complex and easier to understand.

Condition clauses

In the above example, we looked at a simple window function without any additional conditions, but in many cases, you’ll want to apply some conditions in the form of additional clauses to your OVER() clause. One is PARTITION BY, which acts as the grouping mechanism for aggregations. The other one is ORDER BY, which orders the results in the window frame (the set of applicable rows).

So, besides the format of the returned rows as we reviewed above, the other obvious difference with window functions is how the syntax works in your queries. Use the OVER() clause with an aggregate function (like SUM() or AVG()) and/or with a specialized window function (like RANK() or ROW_NUMBER()) in your SELECTlist to indicate you’re creating a window and apply additional conditions as necessary to the OVER() clause, such as using PARTITION BY (instead of the GROUP BY you may be used to for aggregation).

Let’s look at some specific examples.

PARTITION BY

PARTITION BY allows us to group aggregations according to the values of the specified fields.

In our census data for estimated population, each state is categorized according to the division and region it belongs to. Let’s partition first by region:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       region,
       SUM(popestimate2015)
            OVER(PARTITION BY region) AS regional_population
FROM population  
WHERE state > 0 -- only state-level rows  
ORDER BY name  
;

Now we can see the population sum by region but still get the state level data:

Let’s add division:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       region,
       division,
       SUM(popestimate2015)
            OVER(PARTITION BY division) AS divisional_population
FROM population  
WHERE state > 0 -- only state-level rows  
ORDER BY name;

Now we’re looking at state-level data, broken out by region and division, with a population summary at the division level:

ORDER BY

As you’ve probably noticed in the previous queries, we’re using ORDER BY in the usual way to order the results by the state name, but we can also use ORDER BY in our OVER() clause to impact the window function calculation. For example, we’d want to use ORDER BY as a condition for the RANK() window function since ranking requires an order to be established. Let’s rank the states according to highest population:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       RANK()
            OVER(ORDER BY popestimate2015 desc) AS state_rank
FROM population  
WHERE state > 0 -- only state-level rows  
ORDER BY name;

In this case, we’ve added ORDER BY popestimate2015 desc as a condition of our OVER() clause in order to describe how the ranking should be performed. Because we still have our ORDER BY name clause for our result set, though, our results will continue to be in state name order, but we’ll see the populations ranked accordingly, with California coming in at number one based on its population:

Let’s combine our PARTITION BY and our ORDER BY window function clauses now to see the ranking of the states by population within each region. For this, we’ll change our result-level ORDER BY name clause at the end to order by region instead so that it’ll be clear how our window function works:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       region,
       RANK()
           OVER(PARTITION BY region ORDER BY popestimate2015 desc) AS regional_state_rank
FROM population  
WHERE state > 0 -- only state-level rows  
ORDER BY region;

Our results:

Here we can see that Illinois is the top-ranking state by population in the Midwest region and New York is number one in the Northeast region.

So, we combined some conditions here, but what if we need more than one window function?

Named window functions

In queries where you are using the same window function logic for more than one returned field or where you need to use more than one window function definition, you can name them to make your query more readable.

Here’s an example where we’ve defined two windows functions. One, named “rw,” partitions by region and the other, named “dw,” partitions by division. We’re using each one twice—once to calculate the population sum and again to calculate the population average. Our windows functions are defined and named using the WINDOW clause which comes after the WHERE clause in our query:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       region,
       SUM(popestimate2015)
            OVER rw AS regional_population,
       AVG(popestimate2015)
            OVER rw AS avg_regional_state_population,
       division,
       SUM(popestimate2015)
            OVER dw AS divisional_population,
       AVG(popestimate2015)
            OVER dw AS avg_divisional_state_population
FROM population  
WHERE state > 0 -- only state-level rows  
WINDOW rw AS (PARTITION BY region),  
       dw AS (PARTITION BY division)
ORDER BY name;

Since we didn’t do any manipulation on the averages values yet, the numbers look a little crazy; that can be easily cleaned up using ROUND() and CAST(), if need be. Our purpose here is to demonstrate how to use multiple window functions and the results you’ll get. Check it out:

Now that’s an informative report of population metrics, and window functions made it easy!

Wrapping up

This article has given you a glimpse of the power of PostgreSQL window functions. We touched on the benefits of using window functions, looked at how they are different (and similar) to standard aggregation functions, and learned how to use them with various conditional clauses, walking through examples along the way. Now that you can see how window functions work, start trying them out by replacing standard aggregations with window functions in your queries. Once you get the hang of them, you’ll be hooked.

In our next article, we’ll look at window framing options in PostgreSQL to give you even more control over how your window functions behave.

Was this article helpful?

YesNo

Phil Alger

Developer Advocate

Lisa Smith

Business Analyst, IBM Cloud Platform

Welcome to the Metrics Maven series: Database features, tips, tricks, and code

PostgreSQL window functions

OVER( )

Condition clauses

Named window functions

Wrapping up

More from Cloud

How a US bank modernized its mainframe applications with IBM Consulting and Microsoft Azure

The power of the mainframe and cloud-native applications

Modernize your mainframe applications with Azure

IBM Newsletters