Loading
A man sits at his desk in a darkened office late at night looking pensively at his laptop

Seven benefits of optimizing data access & availability

Why optimal data access and availability is crucial

01

Why optimal data access and availability is crucial

Benefits of simplifying data architecture across varying data types, locations and solutions are high

3 min read

A more flexible data architecture without lock-in

02

A more flexible data architecture without lock-in

Place data on premises, in a private cloud or multiple vendors' public clouds

5 min read

Improved self-service data access

03

Improved self-service data access

Give authorized data users quick access to well-governed data

5 min read

Support for new data types and sources

04

Support for new data types and sources

Use structured, unstructured, graph, blockchain, and audiovisual data for analysis and models

4 min read

Better access to open-source capabilities

05

Better access to open-source capabilities

Embrace the data lake, MongoDB and PostgreSQL, at an enterprise level

4 min read

Lower total cost of ownership

06

Lower total cost of ownership

Drive greater efficiencies through automation and optimization

4 min read

Faster reporting and data analytics

07

Faster reporting and data analytics

Ingest and analyze real-time data and accelerate querying historical data

5 min read

Advanced querying and model building

08

Advanced querying and model building

Access built-in AI tools and functionality like confidence-based querying

6 min read

A modern data solution delivering these benefits

09

A modern data solution delivering these benefits

IBM Cloud Pak® for Data is the ideal choice for converging your data stores

3 min read
A man sits at his desk in a darkened office late at night looking pensively at his laptop
Seven benefits of optimizing data access & availability

01

3 min read

Why optimal data access and availability is crucial

Optimizing data access and availability delivers significant benefits by streamlining operations

Aerial view of shipping containers on dock and ship

Data strategies within organizations are under more intense scrutiny than ever before as they look to capture and use the vast amount and variety of data in ways that are scalable, efficient and lead to actionable insights and improved customer experiences. For this reason, data store convergence is rapidly becoming essential across industries, delivering benefits such as a single query engine for multiple workloads, automated governance, minimized data movement, self-service data access, lower costs, AI-based optimization and support for AI model creation.

Underlying data convergence is a story that should sound familiar to most people who have observed the data management space: integration. For enterprises looking to fully exploit their data, connecting to all their data is integral. To look at structured, unstructured and semi-structured data across a variety of internal and external sources requires a well-connected data ecosystem of database, data warehouse, data lake and streaming data solutions. And organizations are realizing the importance of utilizing a single query across the variety of data stores within their ecosystems. Yet, optimizing data availability goes beyond this level of data integration by facilitating a holistic data and AI platform solution that includes data governance and AI tools and services alongside data management capabilities.

As a result, data virtualization, multicloud capabilities and a highly modular toolset can be brought to bear alongside modern data management’s built-in AI-based advancements and continued improvements designed to capture the increasing amount of data sources like graph and blockchain. In the sections that follow, we’ll take a deeper look at each of the seven benefits that organizations can expect from an optimal data availability strategy.

02

5 min read

A more flexible data architecture without lock-in

Put data where it fits best on premises and across multiple vendors’ clouds

Illustration of data going directly to cloud or going to a containerized application

The ability to select where data resides is beneficial to businesses for a number of reasons. Foremost, choosing between a private cloud or public cloud allows a business to determine for themselves whether they want greater levels of control or would prefer something a bit easier to manage. Different options for geographical location can also help with regulatory compliance if cloud data must stay within certain borders or simply speed time to insight by being located closer to where the data will be processed. Of course, even if neither of those needs are a factor for a business, the freedom to choose from multiple cloud platform vendors allows a business to pick the one with ideal features or simply the most cost-effective.

Gartner predicts that by 2022, more than 75% of global organizations will be running containerized applications in production, up from less than 30% today. ” 1

The convergence of data stores allows this level of flexibility due to the vendor agnostic design and data virtualization. The ability to virtualize all data sources enables enterprises to query the data where it resides. Furthermore, the vendor agnostic design allows the same query engine to execute on any open file formats on any cloud.

Additional vendor flexibility is possible through containerization. Containerizing a data management solution, such as a database, allows it to operate with the same codebase no matter where it's located. It can even move between locations more easily with minimal or no changes. This result is due to the fact that the container sits on an open-source foundation such as the Red Hat® OpenShift® Platform. Even if the containerized data management solution is proprietary, it can run anywhere the open-source foundation can. Thus, running on premises, in private clouds, or across the public clouds of multiple different, and even competing, vendors won’t be a problem — nor will cloud migration if that ever becomes necessary.

93% of companies now use multiple cloud service providers. 2

03

5 min read

Improved self-service data access

Make accessing clean, usable data easy for those with the proper authorization

Three coworkers looking at computer monitor together

One of the most challenging aspects of uncovering insights and building models is finding the best data to use. Far too often siloed, duplicate, incomplete, outdated and erroneous data makes the data preparation process take much longer than it should. In the worst cases, it can throw off insights or models and lead to the wrong business decisions being made. The convergence of data stores helps resolve these problems through a combination of data virtualization and the embedded governance contained within the query engine.

Data virtualization is a combination of data federation plus an abstraction layer. As such, it allows users to access multiple data sources at the same time at a single access point. This method helps eliminate the data silo and data duplication problem by making all data accessible at once without the need for extract, transform, and load (ETL) processes. As a result, data users can easily peruse and select the most relevant data whether it happens to be in the database, data lake or somewhere else.

A data and AI platform with data virtualization can cut ETL requests by 25% to 65%. 1

Of course, being able to select and use data easily is futile if the data itself is of low quality or if the data validity can’t be ascertained. For this reason, governance features, such as data cleansing and data cataloging, must be included in data access and availability efforts. Newer data cataloging solutions are capable of using machine learning (ML) to extract business glossaries from the data to form metadata terms and then use them to perform risk assessments on the data. Optimizing data access and availability relies on active metadata to automatically discover and catalog newly ingested data. In addition, data lineage helps users understand how old a data set might be and whether it comes from a trusted source or not. Governance solutions, therefore, not only help correct data where possible, but give a good indication to users what data should and shouldn’t be used.

74% of leaders improve data quality with extensive data cleansing in contrast to just 43% of laggards. 2

An additional element of governance that should be considered is access control. The increasing amount of regulation and heightened consumer interest in who has access to personal data reaffirms the need for data to only be available to those people with a legitimate need. Fortunately, the single access point provided by data virtualization makes establishing data privacy much easier, as authorization and even data masking features for sensitive personal identifiable information (PII) can be added there as opposed to myriad different sources, enhancing governance’s scalability.

2 Build Your Trust Advantage: Leadership in the era of data and AI everywhere, IBM Institute for Business Value, 26 Feburary 2021

04

4 min read

Support for new data types and sources

Combine graph, blockchain and audiovisual data for deeper business insights

Man wearing headphones typing on keyboard and looking at computer monitor displaying five separate sound waves

Data store convergence allows businesses to put an increased emphasis on new sources of data that may have been previously overlooked. By utilizing a universal query engine across any source or type of data, without the need of data movement or replication, enterprises can fully exploit new data sources. These new sources can give a more complete, accurate picture of business realities and insights and help drive better decision-making and business outcomes. While audiovisual and other unstructured data will be covered in an upcoming section on open source, graph and blockchain will be explored in depth here.

Traditionally, businesses have maintained separate query engines for business intelligence (BI) and AI workloads. The convergence of data stores is made possible by the capability of one query engine to be used for data used for BI and AI applications.

Graph database functionality has previously been incompatible with online transaction processing (OLTP) databases. This issue introduced considerable manual effort as Structured Query Language (SQL) data needed to be pulled from its system and put into the correct format within a separate graph database to run graph applications on it. Data store convergence simplifies the process by introducing a database with built-in graph functionality. The underlying SQL engine can query the graph data directly while graph applications can query data directly from relational tables. In this way, graph and traditional data can be used together for an even greater effect than either one would be separately.

Similarly, the highly compressed and extremely valuable data contained within blockchain ledgers has been difficult to access and analyze. A modernized database, however, is able to connect natively with blockchain to surface the data within a ledger and present it as a relational table within the database. As such, blockchain data can be used alongside other forms of data within the database for more robust insight without building custom back ends or ad hoc reporting. Thus, AI developers can even incorporate blockchain data sets as a primary data source for apps or to provide additional detail.

05

4 min read

Better access to open-source capabilities

Take better advantage of unstructured data with enterprise-grade Hadoop, MongoDB and PostgreSQL

A man looks at his smartwatch against a colorful background

Though most businesses have been taking advantage of unstructured data for awhile, simplifying data access and availability can help make the process of embracing enterprise-grade open-source data stores much easier. It does so through the same platform-based approach mentioned previously. Much like a proprietary database, open-source solutions can be containerized and spun up as needed on a data and AI platform.

Take Apache Hadoop as an example. Though seen by some as the ideal base for data lake implementations, many businesses have realized that it takes considerable management effort to maintain such a database properly. The simplifying data access and availability streamlines the data architecture with a single offering that can perform against all analytics— from big data to warehousing. Paired with the virtualization capability, enterprises can take advantage of the open-source data stores. Furthermore, a platform approach not only allows Hadoop implementations to be created and scaled easily, but it introduces enterprise-grade capabilities from sources like Cloudera and IBM, which are specifically designed to add needed functionality and additional security. As part of the platform, these enterprise-grade services can be introduced just as quickly as Hadoop itself.

89% of developers are using at least one open-source database. ” 1

The same can be said of MongoDB and PostgreSQL. MongoDB is highly useful for JavaScript Object Notation (JSON) document storage and high-volume data storage while PostgreSQL is an object-relational database used when a transactional, standards-compliant, atomicity, consistency, isolation, durability (ACID)-compliant solution is needed out of the box. While businesses may not have a specific use in mind for these capabilities when they first modernize, their availability on the platform and the ease with which they can be introduced to the architecture helps, in part, to future-proof the business against unexpected needs. Moreover, other components on the platform such as AI-powered search and text analytics that understand a specific industry’s unique language and pull insights from documents and other text-based sources can be introduced to heighten the value of these open-source storage solutions.

1 5 Great Open-Source Database Solutions, Nordic APIS, 2 July 2020

06

4 min read

Lower total cost of ownership

Achieve greater efficiencies through better integration, automation and optimization

Data virtualization reduces ETL

While many of the advantages of optimizing data access and availability focus on using data to easily deliver more accurate insights and models faster, the potential efficiencies and cost savings can’t be ignored. Even if a business feels content with its current level of insight, a lower total cost of ownership (TCO) is undeniably valuable.

One data and AI platform was shown to reduce infrastructure management effort 65% to 85%. 1

Consider the platform-based approach of simplifying data access and availability; enterprises can reduce their expensive warehouse footprint by minimizing the movement of data for analytics. For example, with the warehouse-grade query capability over data lakes, businesses no longer need to migrate to data warehouses. In addition, with the database, data warehouse, data lake and fast data in one platform, the cost is substantially less than purchasing each one separately, particularly when the costs of connecting piecemeal solutions is considered. And compute and storage can be scaled independently. Moreover, the data virtualization built into such platforms reduces or eliminates the need for ETL processes. Beyond being an inefficient use of personnel, the data transfer fees accrued during the process can add up to a significant amount. The same can be said for the inclusion of multi-model capabilities within a database. By including graph database functionality within a traditional SQL database it can eliminate the necessity and, therefore, expense of a separate graph database.

Yet, cost savings from optimized data access and availability go beyond integration. Automation and optimization have been bolstered recently by the inclusion of ML. Workload management and resource optimization, in particular, have benefited from these advancements. Using a ML feedback mechanism, a database can monitor and compare the expected and actual runtimes of various workloads. Then it adjusts its predictions to better allocate workloads the next time. This process is much more efficient than manual tuning and has resulted in overall database performance improvements up to 30%2 and reduced manual effort.

2. According to IBM internal testing

07

5 min read

Faster reporting and data analytics

Get the fast data and streaming solutions you need for scalable real-time insight while speeding queries on historical data, too

Yellow-orange lights with cars blurred in motion on highway and circular ramps at night

One thing that hasn’t changed since businesses first began using data in earnest is the goal of getting insights even faster than they have previously. It’s no surprise then that optimizing data availability seeks to deliver increased insight speed, as well. With a platform-based approach, this speed increase applies to more traditional data, as well as the relatively newer streaming data.

The convergence of data stores solves the challenges of maintaining a separate data warehouse and data lake. In the case of data lakes, the universal query engine applies warehouse-grade querying over data lakes. Also, the embedded governance addresses poor data quality and consistency issues. In the case of data warehouses, enterprises no longer need to store expensive large data sets since the query can be performed over data lakes. Additionally, the universal query engine addresses the inability of data warehouses to easily handle unstructured or open data formats.

Furthermore, efforts to optimize data access and availability should ensure that in-memory computing, with column-based shadow tables, can be used within the database. This ability enables OLTP and online analytical processing (OLAP) workloads to be run in parallel so that answers to queries can be returned sooner. Data skipping for compressed data is similarly vital. By identifying what compressed data is irrelevant to a query, using metadata and bypassing it, unnecessary input/output (I/O) is avoided and query execution is sped significantly. Of course, newer technologies are also helping to speed the querying process, as well. ML query optimization is gaining traction as an improvement over previous, cost-based models. The reason is because while previous models could propose the best query execution strategies, they couldn’t adjust to the actual results. ML makes this process possible; the optimal query execution strategy is automatically revised based on query performance to help ensure the most up to date execution strategy. This method has the potential to provide 8 – 10 times faster query execution speed in some cases.1

For the fastest insights, though, businesses must consider how they capture and use streaming data. The Internet of Things (IoT), clickstream, mobile and social data are just a few examples of real-time information that has value to an organization in the moment. New fast data solutions can store and analyze as many as 250 billion events per day with three nodes — something that would have taken previous technologies nearly 100 nodes to accomplish. Open-format object storage such as Apache Parquet can also be used, which can provide a speed boost over Hadoop. However, one of the most interesting capabilities with newer fast data solutions is the ability to combine both historical and streaming data for insights in near real-time. This process enables better context for quick decisions and can make the difference between a good decision and the ideal one.

1. According to IBM internal testing

08

6 min read

Advanced querying and model building

Build better models and draw insights with confidence with the right data science tools and language support

Six illustrations on black background of types of analytical graphs that represent data

One of the most pressing reasons for optimal data access and availability in an enterprise is to take advantage of a heightened level of insight possible with advanced querying and AI models. The ability to use a single query across big data, data warehouse and streaming data simplifies the data landscape and ensures businesses can accelerate time to value by easily accessing data and rapidly extracting value from it. Also, the combination of an AI-ready database and a data and AI platform with access to such tools makes doing so much easier.

Ideally, databases should have confidence-based querying built in. At its core, confidence-based querying replaces traditional yes-or-no query outputs with probability results. Instead of providing a single answer, it gives the requestor a range of options with a percentage of how well they match the query. So, if an answer falls outside of the parameters on just a single variable, but is otherwise perfect, that answer would be returned with a high confidence score. This method helps mitigate the potential to overlook certain valuable insights. While not a new concept, having it built into the database automatically saves time and heightens the levels of integration. In addition, the ability to perform this operation using SQL expressions makes it more accessible for non-data scientists.

Of course, for data scientists, support for popular programming languages and libraries is vital to drive more rapid app development. Popular programing language support allows developers and data scientists to keep their data in an enterprise database but still use their existing knowledge base and libraries. Important languages and libraries to make sure are available as part of the database include Python, JSON, Go, Ruby, PHP, Java, Node.js, Sequelize and Jupyter Notebooks. In addition, Representational State Transfer (REST) application programming interfaces (APIs) can be particularly useful for recurring tasks, running queries based on predefined criteria and providing notifications when there are changes.

A data and AI platform approach helps extend the query and model support of the database by making building, running and trusting the AI models easier. For example, capabilities exist that automate AI lifecycle management, allow visual building of models and the ability to deploy models with just one click. Once it’s up and running, the models can be dynamically retrained and monitored for model fairness, explainability and drift. They can also be tested to mitigate regulatory, reputational and operational risk. Taken together this combination means that data scientists and developers have a much easier time crafting and deploying models while also making sure that they remain accurate and unbiased over time.

By 2025, 50% of data scientist activities will be automated by artificial intelligence, easing the acute talent shortage. ” 1

Model personalization can also be useful, customizing models to particular industries. Fueled by natural language processing (NLP), AI-powered search and text analytics is available that understands a specific industry’s unique language and can pull insights from documents and other text based sources. And for those organizations wishing to avoid annoying their customers with unresponsive chatbots, AI assistants can learn from customer conversations in an effort to resolve issues more accurately.

09

3 min read

A modern data solution delivering these benefits

Modernize your data with IBM Cloud Pak for Data and IBM’s full suite of data management products

Man with glasses sitting on steps working on laptop

While optimizing data access and availability may at first seem like a daunting task, the inherent flexibility provided by a platform approach means that it might be much easier than you think—and the benefits are considerable. IBM Cloud Pak for Data is able to deliver each of the benefits mentioned previously and does so in a way that can provide a significant Total Economic Impact—A net present value of USD 5.8 to USD 9.7 million over three years.1 It provides data virtualization and the option to use a host of containerized solutions to collect, organize and analyze data on top of a solid Red Hat OpenShift foundation, which allows it to run on essentially any cloud—IBM’s or another vendor’s.

IBM Cloud Pak for Data AutoSQL provides the universal query engine capability, thus organizations will require one query engine for big data and warehousing — eliminating the need for separation of BI and AI workloads. Additionally, the containerized solutions available include the IBM Db2® database, a leading AI database that is both powered by AI to deliver a better database experience and built for AI to encourage model building and insight. The data warehousing, data lake and fast data solutions within the Db2 family are also available, providing a robust data management offering that goes far beyond traditional OLTP.

Learn more by visiting the links in the “Next steps” section that follows or reach out to one of our experts who would be happy to have a conversation with you at no cost.