ETL—meaning extract, transform, load—is a data integration process that combines, cleans and organizes data from multiple sources into a single, consistent data set for storage in a data warehouse, data lake or other target system.
ETL data pipelines provide the foundation for data analytics and machine learning workstreams. Through a series of business rules, ETL cleanses and organizes data to address specific business intelligence needs, such as monthly reporting—but it can also tackle more advanced analytics, which can improve back-end processes and end-user experiences. ETL pipelines are often used by organizations to:
Extract data from legacy systems
Cleanse the data to improve data quality and establish consistency
Load data into a target database
Businesses have been generating data since the age of the abacus, but modern analytics only became possible with the arrival of the digital computer and data storage.
A major step forward arrived in the 1970s, with a move to larger centralized databases. ETL was then introduced as a process for integrating and loading data for computation and analysis, eventually becoming the primary method to process data for data warehousing projects.
In the late 1980s, data warehouses and the move from transactional databases to relational databases that stored the information in relational data formats grew in popularity. Older transactional databases would store information transaction-by-transaction, with duplicate customer information stored with each transaction, so there was no easy way to access customer data in a unified way over time. With relational databases, analytics became the foundation of business intelligence (BI) and a significant tool in decision making.
Until the arrival of more sophisticated ETL software, early attempts were largely manual efforts by the IT team to extract data from various systems and connectors, transform the data into a common format, and then load it into interconnected tables. Still, the early ETL steps were worth the effort, as advanced algorithms, plus the rise of neural networks, produced ever-deeper opportunities for analytical insights.
The era of big data arrived in the 1990s as computing speeds and storage capacity continued to grow rapidly, with large volumes of data being pulled from new sources, such as social media and the Internet of Things (IoT). A limiting factor remained, with data often stored in on-premises data warehouses.
The next major step in both computing and ETL was cloud computing, which became popular in the late 1990s. Using data warehouses such as Amazon Web Services (AWS), Microsoft Azure and Snowflake, data can now be accessed from around the globe and quickly scaled to enable ETL solutions to deliver remarkable detailed insights and new-found competitive advantage.
The latest evolution is ETL solutions using streaming data to deliver up-to-the-second insights from huge amounts of data.
The most obvious difference between ETL and ELT—extract, load, transform—is the difference in order of operations. ELT copies or exports the data from the source locations, but instead of loading it to a staging area for transformation, it loads the raw data directly into the target data store to be transformed as needed.
While both processes leverage a variety of data repositories, such as databases, data warehouses, and data lakes, each process has its advantages and disadvantages. ELT is useful for ingesting high-volume, unstructured data sets as loading can occur directly from the source. ELT can be more ideal for big data management since it doesn’t need much upfront planning for data extraction and storage.
The ETL process requires more definition at the onset. Specific data points need to be identified for extraction along with any potential “keys” to integrate across disparate source systems. The source of input data is often tracked by using metadata. Even after that work is completed, the business rules for data transformations need to be constructed. This work can usually have dependencies on the data requirements for a given type of data analysis, which will determine the level of summarization that the data needs to have.
While ELT pipelines have become increasingly popular with the adoption of cloud databases, ELT technology is still a developing process, meaning that best practices are still being established.
The easiest way to understand how ETL works is to understand what happens in each step of the process.
During data extraction, raw data is copied or exported from source locations to a staging area. Data management teams can extract data from a variety of different sources, which can be structured or unstructured. Those data types include, but are not limited to:
In the staging area, the raw data undergoes data processing. Here, the data is transformed and consolidated for its intended analytical use case. This phase of the transformation process can include:
In this last step, the transformed data is moved from the staging area into a target data warehouse. Typically, this involves an initial loading of all data, followed by periodic loading of incremental data changes and, less often, full refreshes to erase and replace data in the warehouse. For most organizations that use ETL, the process is automated, well-defined, continuous and batch-driven. Typically, the ETL load process takes place during off-hours when traffic on the source systems and the data warehouse is at its lowest.
ETL and ELT are just two data integration methods, and there are other approaches that are also used to facilitate data integration workflows. Some of these include:
ETL solutions improve quality by performing data cleansing before loading the data to a different repository. A time-consuming batch operation, ETL is recommended more often for creating smaller target data repositories that require less frequent updating, while other data integration methods—including ELT (extract, load, transform), change data capture (CDC) and data virtualization—are used to integrate increasingly larger volumes of data that changes or real-time data streams.
In the past, organizations wrote their own ETL code. There are now many open source and commercial ETL tools and cloud-based services to choose from. Typical capabilities of these products include:
In addition, many ETL tools have evolved to include ELT capability and to support integration of real-time and streaming data for artificial intelligence (AI) applications.
Application programming interfaces (APIs) using Enterprise Application Integration (EAI) can be used in place of ETL for a more flexible, scalable solution that includes workflow integration. While ETL is still the primary data integration resource, EAI is increasingly used with APIs in web-based settings.
Learn how an open data lakehouse approach can provide trustworthy data and faster analytics and AI projects execution.
Explore the data leader's guide to building a data-driven organization and driving business advantage.
Discover why AI-powered data intelligence and data integration are critical to drive structured and unstructured data preparedness and accelerate AI outcomes.
Gain unique insights into the evolving landscape of ABI solutions, highlighting key findings, assumptions and recommendations for data and analytics leaders.
Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.
Explore how IBM Research is regularly integrated into new features for IBM Cloud Pak for Data.
Design a data strategy that eliminates data silos, reduces complexity and improves data quality for exceptional customer and employee experiences.
Watsonx.data enables you to scale analytics and AI with all your data, wherever it resides, through an open, hybrid and governed data store.
Unlock the value of enterprise data with IBM Consulting, building an insight-driven organization that delivers business advantage.