MapReduce is a programming model that uses parallel processing to speed large-scale data processing. MapReduce enables massive scalability across hundreds or thousands of servers within a Hadoop cluster.
The name "MapReduce" refers to the 2 tasks that the model performs to help “chunk” a large data processing task into many smaller tasks that can run faster in parallel.
The MapReduce programming model is one of the core processing components at the heart of Apache Hadoop, an open source software framework supported by the Apache Software Foundation and built on the Java programming language.
The MapReduce model can be used outside of Hadoop as well.
MapReduce helps chunk data processing projects into smaller pieces so they can run faster.
Using one processor to analyze a huge file with terabytes or petabytes of data might, for example, take 10 hours. A MapReduce job can split that same data file into 10 tasks that run in parallel on 10 processors. This job might only take an hour or less to run. The data can be aggregated from the distributed computing nodes to produce a result.
While mapping and reducing are the essential functions of the MapReduce model, the overall process includes a few more steps.
A MapReduce application accepts input data, which can include structured or unstructured data. MapReduce applications usually work with input files stored in the Hadoop distributed file system (HDFS), but the model can work with other data sources as well. (For more information, see “MapReduce within the Hadoop ecosystem.”)
For a specific job, the MapReduce framework helps select distributed servers, manage communications and data transfers and support fault tolerance and redundancy.
The input data is split into smaller blocks. These blocks are distributed to mappers—the functions that perform mapping in the next step—located across various computing nodes. The MapReduce framework aims for a roughly uniform assignment of data across mappers for efficient load balancing.
On each node, the map function processes the data it receives, converting the data into key/value pairs.
The total number of mappers is determined within the Hadoop framework, based on the total volume of data and the available memory blocks on each mapper. Parameters for the mappers, reducers and output formats can be set within the Hadoop cluster.
The Hadoop framework sorts the map outputs and assigns all key/value pairs with the same "key" (topic) to the same reducer. For example, in a data set of cities and their daily high temperatures, any data with the “Tokyo” key would go to the same reducer.
The reducer, as the name suggests, is the function that performs the reduce step.
Reduce functions process the key/value pairs that the mappers emit. This can involve merging, tabulating or performing other operations on the data, depending on the kind of processing required.
Mapping and reduction can be done on the same set of servers, but this is optional.
Each reducer outputs the results of its processing to the HDFS or other data store.
No matter the amount of data an organization wants to analyze, the key principles remain the same.
For this example, the data set includes cities (the keys) and the corresponding daily temperatures (the values) recorded for each city. A sample key/value pair might look like this: <Toronto, 18>.
The data is spread across multiple files. Each file might include data from a mix of cities, and it might include the same city multiple times.
From this data set, the user wants to identify the "maximum temperature" for each city across the tracked period.
An implementation of MapReduce to handle this job might look like this:
The Hadoop ecosystem is a suite of open source modules designed to work together to promote efficiency. There are modules that constitute the main Hadoop framework, including MapReduce plus 3 more:
HDFS is a distributed file system for storing application data on up to thousands of commodity servers.
HDFS is designed to provide fault tolerance for Hadoop and provide fast access to data. By default, data blocks are replicated across multiple nodes at load or write time.
The HDFS architecture features a NameNode to manage the file system namespace and file access, along with multiple DataNodes to manage data storage.
Open source Apache Hadoop, YARN, is a framework for job scheduling and cluster resource management. It supports multiple workloads, such as SQL queries, advanced modeling and real-time streaming.
This module is a collection of resource utilities and libraries that support other Hadoop modules. Also known as Hadoop Core, it includes multiple resources for automatic failure recovery, file-system-level and operating-system-level abstraction, Java Archive (JAR) files and scripts, plus more.
In addition to the main Hadoop framework, the Hadoop ecosystem also includes:
MapReduce programming offers several benefits that can help organizations gain valuable insights from big data analytics, such as:
MapReduce can efficiently handle many straightforward use cases, although complex data processing tasks might run faster on other applications. Here are some applications suited to MapReduce‘s parallel processing approach:
The MapReduce programming paradigm was created in 2004 by Google computer scientists Jeffery Dean and Sanjay Ghemawat. The goal of the MapReduce model is to simplify the transformation and analysis of large data sets through massive parallel processing on large clusters of commodity hardware. It also enables programmers to focus on algorithms rather than data management.
While Google introduced the first MapReduce framework, Apache Hadoop MapReduce is perhaps the most popular.
MapReduce played a key role in advancing big data analytics, but it does have its drawbacks. For example, data is usually not retained in memory by MapReduce, and iterative logic is possible only by chaining MapReduce apps together. These factors add greater complexity and can lead to longer processing times.
While MapReduce remains widely used—especially in legacy systems—many organizations are moving to faster or more specialized frameworks, such as Apache Spark, for big data applications.
Learn how an open data lakehouse approach can provide trustworthy data and faster analytics and AI projects execution.
IBM named a Leader for the 19th year in a row in the 2024 Gartner® Magic Quadrant™ for Data Integration Tools.
Explore the data leader's guide to building a data-driven organization and driving business advantage.
Discover why AI-powered data intelligence and data integration are critical to drive structured and unstructured data preparedness and accelerate AI outcomes.
Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.
Explore how IBM Research is regularly integrated into new features for IBM Cloud Pak® for Data.
Gain unique insights into the evolving landscape of ABI solutions, highlighting key findings, assumptions and recommendations for data and analytics leaders.
Design a data strategy that eliminates data silos, reduces complexity and improves data quality for exceptional customer and employee experiences.
Watsonx.data enables you to scale analytics and AI with all your data, wherever it resides, through an open, hybrid and governed data store.
Unlock the value of enterprise data with IBM Consulting, building an insight-driven organization that delivers business advantage.