May 18, 2020 By Henrik Loeser 3 min read

How advanced SQL helps to dig through the numbers.

All of us have been impacted by COVID-19 in one way or another. Looking at daily case numbers, the basic reproduction number, or mortality rates has become a routine. But what is behind those numbers? How are they computed and how does SQL help? 

In this post, I want to discuss how a few SQL analytics clauses can help to dig through the data. When I teach database systems, I always try to show how they apply to life. Here, SQL may not directly help to cure the disease, but it can help to compute and understand the statistics.

SQL for working with data

SQL, the Structured Query Language, is used as query language by most database systems (e.g., Db2, MySQL, Oracle, PostgreSQL, SQL Server, SQLite, etc.). Moreover, using programming libraries, SQL expressions can be applied to analyze in-memory data in apps or Jupyter notebooks, like in IBM Watson Studio. There are also services like the SQL Query service to analyze data stored in Cloud Object Storage (COS). To conclude, SQL is ubiquitous and, as we will see, versatile.

SQL analytics

Most of us know how to write a simple SELECT … FROM … WHERE statement. But how do you easily put data from different days side by side? How do you compute rolling averages to smooth out reported values (e.g., to account for “administrative weekends”)? How do you make sure that all countries are represented, even if they did not report at the latest available date (“MAX(date)”)?

The answer is to apply SQL window functions and to utilize common table expressions. They allow you to partition and order data, then apply additional functions on top. 

Suppose we have two tables: STATISTICS, which holds the daily case data (confirmed cases, deaths, the reporting date, etc.) from around the world, and DEMOGRAPHICS, that has country-specific data on population, area, population density, age groups, and more. The following SQL statements (among others) can be used to enrich and to analyze the data.

Delta from previous day

The following SQL statement uses LAG to access data from the previous day. The data is PARTITIONed BY country to refer to the same country and is sorted by the reporting date:

SELECT
   confirmed_cases - LAG(confirmed_cases,1) OVER (PARTITION BY country_id ORDER BY dt) AS confirmed_cases_delta,
   confirmed_cases,
   country_id,
   dt
FROM statistics

Smoothed rolling average

The SQL clause AVG() OVER with PRECEDING and FOLLOWING rows can be applied to smooth out reported numbers. This is necessary to account for “administrative weekends” with fewer reports or missing data which may be added later on:

SELECT
   AVG(confirmed_cases) OVER (PARTITION BY country_id ORDER BY dt ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS cases_smoothed5,
   confirmed_cases,
   country_id,
   dt
FROM STATISTICS

Computed rolling average can be used in figures like this for Germany. Zooming in, you see daily numbers zig-zagging over smoothed averages:

Confirmed cases and deaths reported for Germany.

Latest available per country data

Not all countries may have reported on the latest overall date, so if you are using MAX(date), those countries could be missed. By utilizing ROW_NUMBER() to enumerate available data in a DESCending way by date and the selecting only the first row, the most recent available data is returned. 

WITH latest_deaths_data AS
   ( SELECT country_id,
            dt,
            deaths,
            confirmed_cases,
            ROW_NUMBER() OVER (PARTITION BY country_id ORDER BY dt DESC) as rn
     FROM STATISTICS)
SELECT country_id,
       dt,
       deaths,
       confirmed_cases,
       rn
FROM latest_deaths_data
WHERE rn=1

A query like the one above can be part of a more complex query to compute the mortality per 100,000 population as discussed by Johns Hopkins University. Taking into account all countries (not just those with the latest numbers from today), here is my generated table:

Conclusions

SQL is ubiquitous and quite versatile, and many of us know how to write simple SQL statements. More advanced SQL helps to analyze the daily COVID-19 data from around the world. I have shown few statements in the blog above. 

In addition, I have created a gist on GitHub with sample SQL statements to dig into COVID-19 data. Feel free to comment on GitHub or to add to it. The two screenshots in this post are based on data processed with such SQL statements.

If you have feedback, suggestions, or questions about this post, please reach out to me on Twitter (@data_henrik) or LinkedIn

Was this article helpful?
YesNo

More from Cloud

How a US bank modernized its mainframe applications with IBM Consulting and Microsoft Azure

9 min read - As organizations strive to stay ahead of the curve in today's fast-paced digital landscape, mainframe application modernization has emerged as a critical component of any digital transformation strategy. In this blog, we'll discuss the example of a US bank which embarked on a journey to modernize its mainframe applications. This strategic project has helped it to transform into a more modern, flexible and agile business. In looking at the ways in which it approached the problem, you’ll gain insights into…

The power of the mainframe and cloud-native applications 

4 min read - Mainframe modernization refers to the process of transforming legacy mainframe systems, applications and infrastructure to align with modern technology and business standards. This process unlocks the power of mainframe systems, enabling organizations to use their existing investments in mainframe technology and capitalize on the benefits of modernization. By modernizing mainframe systems, organizations can improve agility, increase efficiency, reduce costs, and enhance customer experience.  Mainframe modernization empowers organizations to harness the latest technologies and tools, such as cloud computing, artificial intelligence,…

Modernize your mainframe applications with Azure

4 min read - Mainframes continue to play a vital role in many businesses' core operations. According to new research from IBM's Institute for Business Value, a significant 7 out of 10 IT executives believe that mainframe-based applications are crucial to their business and technology strategies. However, the rapid pace of digital transformation is forcing companies to modernize across their IT landscape, and as the pace of innovation continuously accelerates, organizations must react and adapt to these changes or risk being left behind. Mainframe…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters