For a long time, organizations relied on relational databases (developed in the 1970s) and data warehouses (developed in the 1980s) to manage their data. These solutions are still important parts of many organizations’ IT ecosystems, but they were designed primarily for structured datasets.
With the growth of the internet—and especially the arrival of social media and streaming media—organizations found themselves dealing with a lot more unstructured data, such as free-form text and images. Because of their strict schemas and comparatively expensive storage costs, warehouses and relational databases were ill-equipped to handle this influx of data. Â
In 2011, James Dixon, then the chief technology officer at Pentaho, coined the term “data lake.” Dixon saw the lake as an alternative to the data warehouse. Whereas warehouses deliver preprocessed data for targeted business use cases, Dixon imagined a data lake as a large body of data housed in its natural format. Users could draw the data they needed from this lake and use it as they pleased.
Many of the first data lakes were built on Apache Hadoop, an open-source software framework for distributed processing of large datasets. These early data lakes were hosted on-premises, but this quickly became an issue as the volume of data continued to surge.
Cloud computing offered a solution: moving data lakes to more scalable cloud object storage services.
Data lakes are still evolving today. Many data lake solutions now offer features beyond cheap, scalable storage, such as data security and governance tools, data catalogs and metadata management.
Data lakes are also core components of data lakehouses, a relatively new data management solution that combines the low-cost storage of a lake and the high-performance analytics capabilities of a warehouse. (For more information, see “Data lakes vs. data lakehouses”).