A data warehouse is a system that aggregates data from multiple sources into a single, central and consistent data store. Data warehouses help prepare data for data analytics, business intelligence (BI), data mining, machine learning (ML) and artificial intelligence (AI) initiatives.
An enterprise-grade data warehouse system enables an organization to run powerful analytics on large amounts of data (petabytes and more) in ways that a standard database cannot. The data can be structured, semi-structured or unstructured. Data can feed into a warehouse from multiple databases, including customer relationship management (CRM), inventory, point of sale (POS) and supply chain management systems.
Data warehousing systems—sometimes called enterprise data warehouse (EDW) systems—have been supporting business intelligence efforts for over three decades. Their functions focus on extracting data from other sources, cleansing and preparing the data and loading and maintaining the data, often in a relational database.
Traditionally, a data warehouse was hosted on-premises, often on a mainframe computer. Today, many data warehouses are hosted in the cloud and delivered as cloud services.
Data warehouses were born in the 1980s to optimize data analytics by making integrated transactional data available in a consistent format. As the power of business applications grew and new data sources exploded—including the World Wide Web, social media and Internet of Things (IoT)—the need for larger storage and faster analysis grew.
As the data warehouse evolved to support greater volumes and more granular data, more teams within organizations requested direct access to the data for self-service analytics functions.
Many organizations realized that their legacy data warehouses could not manage these new, huge workloads. And because many data warehouses stored only structured data, the richness of the analysis might be limited. In response, more flexible alternatives evolved, including cloud-native data warehouses and data lakehouses. (See “Data lakehouse vs. data warehouse” for more information.)
To process data quickly and efficiently, data warehouses most often use a three-tier architecture.
This tier is where the data flows from disparate data sources into a data warehouse server, where it is stored. The data is most often moved through a process known as extract, transform, load (ETL) or sometimes a process known as extract, load, transform (ELT). These processes are executed in different ways, but they both use automation to move data into a warehouse and prepare it for use in analytics.
This tier is traditionally built around an analytics engine, most often an online analytical processing (OLAP) system that is designed to deliver fast analytics and query speeds. Three types of OLAP models can be used in this tier:
The type of OLAP model used depends on the type of database system being used.
This tier includes a front-end user interface or reporting tool, which enables users to conduct ad hoc data analysis on their business data. Self-service business intelligence has multiple uses, such as the generation of reports based on historical data, identification of new opportunities or identifying process bottlenecks.
Most data warehouses use OLAP systems to support analytics. OLAP software performs multidimensional analysis at high speeds on large volumes of data from a unified, centralized data store, such as a data warehouse.
A piece of business data often has multiple dimensions. For example, sales figures might include several dimensions related to location (region, country and store), time (year, month, week and day) or product (brand, type).
In a traditional relational database, data is organized in row-and-column tables that can only represent two of these dimensions at a time—one dimension in the row and one dimension in the column. This can make analysis cumbersome.
However, OLAP systems enable users to analyze data along multiple dimensions at once, enabling faster processing and more insightful analysis. Common uses of OLAP include data mining and business intelligence apps, complex analytical calculations, predictive scenarios, budgeting and forecasting.
OLAP is distinct from online transactional processing or OLTP. OLTP systems track large numbers of transactions, by large numbers of users, in real-time. The key difference between OLTP and OLAP is that OLTP systems are designed primarily to capture data, whereas OLAP systems analyze data that has already been captured.
OLTP systems often use relational databases to record transactions such as:
While data warehouses don’t generally involve OLTP systems, the data recorded in databases by OLTP systems is typically fed to the warehouse, where an OLAP system enables analysis.
Database schemas define how data is organized within a database or data warehouse. There are two main types of schema structures used in data warehouses: the star schema and the snowflake schema.
Star and snowflake schema are both dimensional data models designed to optimize data retrieval speeds. Dimensional models increase redundancy to make it easier to locate information for reporting and retrieval. This modeling is typically used across OLAP systems.
This schema consists of one fact table (which contains measurable items such as products sold and monetary sales amounts) joined to dimension tables (that is, reference information showing how the facts can be grouped and organized, such as dates of sale and industry sold to).
In a diagram, the fact table can appear to be in the middle of a star pattern. The star schema is considered the simplest and most common type of schema, and its users benefit from its faster speeds while querying.
While not as widely adopted, the snowflake schema is another organizational structure used in data warehouses. In this case, the fact table is connected to several normalized dimension tables containing descriptive data about the facts in the central fact table. These dimension tables also have child tables.
This more complex, branching pattern can resemble a snowflake. Users of a snowflake schema benefit from its low levels of data redundancy, but this comes at the cost of slowing query performance.
Most data warehouses are built around a relational database system, either on-premises or in the cloud, where data is both stored and processed. A typical data warehouse has components such as:
The data layer or central database, is the heart of the data warehouse and is supported by all other components. The data can be input from business applications, email lists, websites or any other relational databases. The data is physically stored on a server or set of servers.
The data layer can partition segments of data so that users can only access the data they need. For example, the sales team would not normally have access to the HR team’s data and vice versa.
Typically, data warehouses have data governance and security capabilities built in, so organizations don’t need to do much custom data engineering work to include these features. Organizations might need to update data governance principles and security measures over time as new data from different sources are added to the warehouse.
Access tools connect to a data warehouse to provide a business-user-friendly front end. This can include dashboards, reporting and data visualization tools that enable data analysts and business users to interact with the data and extract insights. Examples of these tools include Tableau, Looker and Qlik.
ETL tools help move data from a data source into the data warehouse. The data is first "extracted" from its source, moved to a staging area where it is cleaned and prepared (or "transformed"), and then "loaded" into a warehouse.
ETL tools convert data into a consistent format so that it can be efficiently analyzed and queried when it is inside the warehouse. For example, data might be extracted from multiple customer databases and then transformed into a common format so all customer records have the same fields.
A data processing framework tool, such as Apache Spark, can help manage the transformation of data.
Because a data warehouse primarily stores structured data, the data is typically transformed before it is moved to the warehouse. While some warehouses can use an extract, load, transform (ELT) process instead—which loads data into the warehouse before transforming it—this process is more commonly used for data lakes, which don’t require standardized data formats. (See “Data lake vs. data warehouse” for more information.)
Metadata is data about data. Basically, it describes the data that is stored in a system to make it searchable. Metadata includes characteristics such as authors, dates or locations of articles, dates created and file sizes. Metadata and its management system help to organize data and make it usable for analytics.
Some data warehouses provide a sandbox that is walled off from the live data. It might be used as a testing environment, containing a copy of the production data and relevant analysis and visualization tools. Data analysts and data scientists can experiment with new analytical techniques in the sandbox without impacting the operations of the data warehouse for other users.
A connectivity layer for application programming interfaces (APIs) can help the warehouse pull data from organizational sources and provide access to visualization and analytics tools.
The terms data warehouse, database, data lake, data mart and data lakehouse are sometimes used interchangeably. While these systems are similar, important differences exist.
A database is like a filing cabinet that is built primarily for fast queries and transaction processing, but not analytics. A database typically serves as the focused data store for a specific application, whereas a data warehouse stores data from any number of applications in an organization.
A database focuses on capturing real-time data while a data warehouse has a broader scope, capturing historical data—but sometimes current data—for predictive analytics, machine learning and other advanced analysis.
A data lake is a storage solution for massive amounts of raw data with no predefined schema. Data lakes often contain unstructured data and semi-structured data, such as documents, videos, Internet of Things (IoT) logs and social media posts. They are commonly built on big data platforms such as Apache Hadoop.
Data lakes are primarily designed to offer low-cost storage for large amounts of data. To keep storage costs low, they don’t typically transform data or optimize for analytics, as a warehouse does.
A data mart is a type of data warehouse that contains data specific to a particular business line or department rather than an entire enterprise.
Because data marts contain a smaller subset of data, they enable a department or business line to discover more focused insights more quickly than is possible when working with the broader data warehouse data set.
For example, a marketing team might use a data mart to define ideal target demographics, while a product team might use one to analyze inventory patterns.
A data lakehouse is a data platform that merges aspects of data warehouses and data lakes—namely, the flexibility of a lake and the high performance of a warehouse—into one data management solution. Data lakehouses might also add features such as shared metadata, distributed structured query language (SQL) engines and built-in governance and security controls.
The advent of open source technologies and the desire to reduce data duplication and complex ETL pipelines has led to the development of the data lakehouse. By combining the key features of lakes and warehouses into one data solution, lakehouses can help accelerate data processing and support machine learning, data science and AI workloads.
A cloud-based data warehouse is built to run in the cloud. It is often offered to organizations as a managed data-storage service in which the data warehouse infrastructure is managed by the cloud company. The organization doesn’t need to make an upfront investment in hardware or software, nor does it need to manage its own system. Cloud services often offer flexible pricing as well.
Cloud-based data warehouses have grown more popular as more organizations use cloud computing services and seek to reduce their on-premises data center footprints.
A business can purchase a data warehouse license and then deploy the warehouse as its own on-premises infrastructure. Although this is typically more expensive than a cloud data warehouse service, it might be a good choice for government entities, financial institutions or other organizations that want more control over their data or need to comply with strict security or data privacy standards.
A data warehouse appliance is a preintegrated bundle of hardware and software—typically including central processing units (CPUs), storage, operating system and data warehouse software—that a business can connect to its network and start using as is.
In terms of upfront cost, speed of deployment, ease of scalability and data management control, a data warehouse appliance typically sits between cloud and on-premises implementations.
A data warehouse can make insights and information available to teams across an organization with fast self-service, enabling multiple use cases.
Data warehouses can support various AI and machine learning applications. Data scientists can analyze historical data to develop predictive algorithms. They can teach machine learning applications to pick up on patterns, such as suspicious account activity that might indicate fraud. They can use cleansed and validated warehouse data to build proprietary generative AI models or fine-tune existing models to better serve their unique business needs.
An enterprise-grade data warehouse can enable thousands of users to access and run analytics tools simultaneously. Because the data is stored separately from operational databases, and in a more efficient format, users can run their own self-service business intelligence queries without slowing down other key systems.
Data warehouses can help consolidate siloed data through ETL pipelines that automate cleansing and integration. This helps speed queries and processing and enables more users to access data.
An enterprise-grade data warehouse might also include support for open source formats, such as Apache Iceberg, Parquet and CSV, enabling further data sharing across the enterprise.
Data warehouses can also serve industry-specific uses, such as:
A data warehouse combines data streams from disparate data stores, which makes it easier for organizations to analyze this data. As a result, organizations can uncover valuable insights, boost performance, improve operations and ultimately, gain a competitive advantage.
More specifically, the benefits of a data warehouse can include:
A data warehouse can centralize data from various data sources, such as transactional systems, operational databases and flat files. It then cleanses this operational data, eliminates duplicates and standardizes it to create a single source of truth that gives an organization a comprehensive, reliable view of enterprise data.
When data is locked in disparate sources, it might limit the ability of decision makers to derive insights and set business strategies with confidence. A data warehouse with one central repository enables business users to draw all of an organization’s pertinent data into business decision-making.
By running reports on historical data, a data warehouse can help determine which systems and processes are working and what needs improvement.
Data warehouses make it possible to discover and report on themes, trends and aggregations. Data professionals and business leaders can use these insights to make better-informed decisions based on hard evidence in virtually every area of the organization, from business processes to financial management and inventory management.
When implementing data warehouse solutions, organizations might need to confront certain challenges to achieve high performance. These can include:
With terabyte and petabyte-sized data warehouses now commonplace, high-performance operations require excellent loading, efficient storage and powerful database engines that meet demands for hyperefficiency.
A data warehouse might be asked to manage tremendous volumes of structured and non-structured data from many sources. All this data must be cleansed and validated before it can be used. Robust data governance policies and practices can help ensure an accurate single source of truth for all users.
With enterprise data stored in disparate environments—whether by regulation or business need—data warehouses today often require hybrid and multicloud storage, with data flow, ingestion and analysis moving across different systems. Organizations might need highly experienced IT team members to help implement and maintain these complex systems.
As AI and machine learning become more critical components of business strategy, organizations need data warehouses that can support these workloads.
Ideally, a data warehouse should be able to collect, cleanse, organize and analyze data so that it is AI-ready, as well as facilitate the flow of data to AI and machine learning applications. However, not all data warehouses are built for AI, which can make it difficult to use organizational data for AI workloads.
Data warehouses are configured and optimized for data analytics, which means they are typically not ideal for storing massive amounts of data. As the amount of data in a warehouse grows, the cost and complexity of storage grows with it. Latency issues can also arise.
A data lakehouse can be an option for some organizations, depending on their unique data architectures, because it can provide both cheaper, more flexible storage and analytics capabilities.
Learn how an open data lakehouse approach can provide trustworthy data and faster analytics and AI projects execution.
IBM named a Leader for the 19th year in a row in the 2024 Gartner® Magic Quadrant™ for Data Integration Tools.
Explore the data leader's guide to building a data-driven organization and driving business advantage.
Discover why AI-powered data intelligence and data integration are critical to drive structured and unstructured data preparedness and accelerate AI outcomes.
Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.
Explore how IBM Research is regularly integrated into new features for IBM Cloud Pak® for Data.
Gain unique insights into the evolving landscape of ABI solutions, highlighting key findings, assumptions and recommendations for data and analytics leaders.
Design a data strategy that eliminates data silos, reduces complexity and improves data quality for exceptional customer and employee experiences.
Watsonx.data enables you to scale analytics and AI with all your data, wherever it resides, through an open, hybrid and governed data store.
Unlock the value of enterprise data with IBM Consulting, building an insight-driven organization that delivers business advantage.