January 14, 2021 By Holly Vatter 5 min read

The importance of the data lake has continued to grow as organizations contend with the growth of unstructured data and its sources. By one estimate “80% of worldwide data will be unstructured by 2025” [1] meaning that companies must take a look at their data lake now to ensure they’re prepared.

To set up a data lake that works well alongside existing databases and data warehouses, companies must ensure that the data lake has sufficient scalability, integration, and deployment options to deliver all data, including unstructured data, for analysis when and where it is required. Automated governance is also essential to help ensure the data can be trusted, as are storage options so companies can choose what fits their current architecture and data lake needs the best. Each of these topics is covered in depth in the eBook, Building a robust, governed data lake for AI but a sneak preview is also available below.

Data Lake scalability, integration and deployment options

The data lake is undeniably one of the most important data management tools for collecting the newest and most varied sources of data. Data lakes help collect streaming audio, video, call log, sentiment and social media data “as is” to provide more complete, robust insights. This has considerable impact on the ability to perform AI, machine learning and Data Science; it isn’t too much of a stretch to say that the data lake is, at least in part, the basis for the future of these capabilities. However, in order to make the most of the data lake it must be scalable, integrated and widely deployable so that no data is missed and all data can be used easily.

Taken together the concepts of data lake scalability, integration and deployment options typically fall under the umbrella of enterprise readiness. It’s easy to see why; each of these traits speak directly to the data lakes ability to perform its core duties of ingesting data when and where it is required and then providing it for analytics.

The data lake must scale extensively, quickly, and at low cost. This is often achieved by using clusters of inexpensive commodity hardware. Having this available reduces the likelihood real-time data ingestion will be interrupted and also opens the opportunity to economically store cold, historical data from a database or warehouse

Federation capabilities should always be part of the data lake. Faster than enterprise service bus (ESB) or extract, transform, load (ETL) processes, it provides an easier way to break down silos across data management. A SQL-on-Hadoop engine like IBM Db2 Big SQL is recommended.

And even when federation capabilities exist, multiple deployment options are helpful – particularly since 45% of businesses run at least one big data workload in the cloud. [2]  On premises, multicloud and hybrid solutions should all be offered so companies can address compliance needs by putting the data lake behind an on-premises firewall or efficiency needs with a pay-as-you-go cloud model. The ability to combine data lake locations as needed in a hybrid environment enables businesses to make choices that suit their unique needs as situations arise and be more flexible.

Automated governance for the data lake

Data lake governance covers multiple areas ranging from the improvement of data integration spoken about previously and data cataloging to more traditional data governance and self-service data access. Automation is key across these areas to ensure that DBAs and data scientists use their time on higher value activities.

Though the function of integrating data is handled by capabilities such as federation, governance plays a vital role in facilitating the process with in-line data quality and active metadata policy enforcement. Experts expect data integration tasks to be reduced by 45% through the addition of ML and automated service-level management by 2022. [3] In addition, ML capabilities can help synchronize data from databases to cloud data warehouses. Look for AI-powered solutions like IBM Data Stage to help deliver the best data lake integration.

Data cataloging, on the other hand, helps companies understand the data in their data lake better. It does so through defining data in business terms, enabling better visual exploration of the data and the ability to track data lineage. Seek solutions like IBM Watson Knowledge Catalog that take these capabilities to the next level with automated data discovery and metadata generation, ML-extracted business glossaries and automated scanning and risk assessments of unstructured data.

What was traditionally been called data governance aligns most closely to security including compliance and audit readiness. While they cannot be overlooked, automation is vital so that as little manual effort as possible is expended in these areas. Namely, products like IBM InfoSphere Information Governance Catalog should be used to automate the classification and profiling of data assets and automatically enforce data protection rules established to anonymize and restrict access to sensitive information. It should also allow quick incident responses by flagging sensitive data, identifying issues and enabling easy audit response.

The self-service data access of governance most directly helps data scientists. Instead of wasting time cleansing data, this can be done with solutions like AutoAI that automates the data preparation and modeling stages of the data science lifecycle. The result is insights that are arrived at more quickly and trusted more thoroughly because clean data has been used.

Selecting the best data lake storage option

Multiple storage options exist for the data lake, allowing companies to choose the one that fits best with the current data management architecture and existing skill sets. Data lake vendors should offer object storage, file storage, and Apache Hadoop.

Object storage is based on units called objects that contain the data, metadata and a unique identifier. It provides the ability to scale computing power and storage independently – delivering cost savings in dynamic environments. Line-of-business applications, websites, mobile apps and long-term archives all benefit from object storage

The best file storage options allow transparent HDFS access and file access to the same storage capacity. Yottabyte scalability, flash acceleration and automated storage lifecycle management also speed performance and data access while providing opportunities for cost savings. Security features are also available such as live notifications, end-to-end encryption and WORM or immutable data.

Apache Hadoop, used by 62% of the market, [4] is built on open-source and relies on community support for improvement. Fault tolerance, reliability and high availability are key components. However, users should be aware that, unlike object storage, processing and storage capacity are scaled together.

Access more in-depth information and data lake case studies

Data lake success is predicated on a wide range of factors; ignoring any of them could be the difference between having well informed insights delivered in time to best the competition and lagging behind. Go in depth on a range of data lake topics in the eBook, Building a robust, governed data lake for AI and see how others are using the data lake within three industries. If you’d like to talk to an expert about the data lake directly, you can also schedule a free 30-minute conversation.

  1. 80 Percent of your Data will be Unstructured in Five Years. 
  2. 25+ Impressive Big Data Statistics for 2020
  3. Gartner Identifies Top 10 Data and Analytics Technology Trends for 2019
  4. New Survey Reveals Businesses are Bullish on Data Lakes
Was this article helpful?
YesNo

More from Analytics

Beyond the silos: Unifying statistical power with SPSS Statistics, R and Python

4 min read - IBM® SPSS Statistics is a leading comprehensive statistical software that provides predictive models and advanced statistical techniques to derive actionable insights from data. For many businesses, research institutions, data scientists, data analyst experts and statisticians, SPSS Statistics is the standard for statistical analysis. SPSS Statistics can empower its users with the following capabilities: Understanding data through comprehensive analysis and visualization Analyzing trends using regression and other statistical methods to spot pattern Predicting future scenarios with reliable forecasts using techniques like…

6 best practices for choosing a business planning solution

3 min read - Effective planning isn’t just a routine task—it’s a critical function that drives an organization’s strategic direction. As companies face rapid technological advancements, evolving consumer demands and global competition, business planning must adapt to stay relevant. When it comes to integrated planning solutions, it’s important to choose one that optimizes planning processes and delivers tangible economic impact. A robust planning solution offers AI-powered capabilities, scalability, and unmatched flexibility, positioning it as the future of business planning. Compared to other solutions, a…

The next generation of BI: Powered by IBM Granite foundation models

5 min read - IBM® Cognos® Analytics has long been recognized as the gold standard in business intelligence (BI). Renowned for its superior reporting capabilities, IBM Cognos offers an unparalleled level of depth and flexibility for organizations looking to extract valuable insights from their data. But what many might not know is how Cognos Analytics has seamlessly integrated artificial intelligence (AI) to revolutionize users’ BI experience. AI in Cognos automates many traditionally manual tasks. It also enhances decision-making by uncovering hidden insights, predicting future…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters