My IBM

What is MapReduce?

19 November 2024

Authors

What is MapReduce?

MapReduce is a programming model that uses parallel processing to speed large-scale data processing. MapReduce enables massive scalability across hundreds or thousands of servers within a Hadoop cluster.

The name "MapReduce" refers to the 2 tasks that the model performs to help “chunk” a large data processing task into many smaller tasks that can run faster in parallel.

First is the "map task," which takes one set of data and converts it into another set of data formatted as key/value pairs, such as cities (keys) and daily high temperatures (values).
Second is the "reduce task," which takes the outputs from a map task, aggregates all values with the same key and processes the data to produce a final set of key/value pairs.

The MapReduce programming model is one of the core processing components at the heart of Apache Hadoop, an open source software framework supported by the Apache Software Foundation and built on the Java programming language.

The MapReduce model can be used outside of Hadoop as well.

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

Subscribe today

How MapReduce works

MapReduce helps chunk data processing projects into smaller pieces so they can run faster.

Using one processor to analyze a huge file with terabytes or petabytes of data might, for example, take 10 hours. A MapReduce job can split that same data file into 10 tasks that run in parallel on 10 processors. This job might only take an hour or less to run. The data can be aggregated from the distributed computing nodes to produce a result.

While mapping and reducing are the essential functions of the MapReduce model, the overall process includes a few more steps.

Input

A MapReduce application accepts input data, which can include structured or unstructured data. MapReduce applications usually work with input files stored in the Hadoop distributed file system (HDFS), but the model can work with other data sources as well. (For more information, see “MapReduce within the Hadoop ecosystem.”)

For a specific job, the MapReduce framework helps select distributed servers, manage communications and data transfers and support fault tolerance and redundancy.

Splitting

The input data is split into smaller blocks. These blocks are distributed to mappers—the functions that perform mapping in the next step—located across various computing nodes. The MapReduce framework aims for a roughly uniform assignment of data across mappers for efficient load balancing.

Mapping

On each node, the map function processes the data it receives, converting the data into key/value pairs.

The total number of mappers is determined within the Hadoop framework, based on the total volume of data and the available memory blocks on each mapper. Parameters for the mappers, reducers and output formats can be set within the Hadoop cluster.

Shuffling

The Hadoop framework sorts the map outputs and assigns all key/value pairs with the same "key" (topic) to the same reducer. For example, in a data set of cities and their daily high temperatures, any data with the “Tokyo” key would go to the same reducer.

The reducer, as the name suggests, is the function that performs the reduce step.

Reducing

Reduce functions process the key/value pairs that the mappers emit. This can involve merging, tabulating or performing other operations on the data, depending on the kind of processing required.

Mapping and reduction can be done on the same set of servers, but this is optional.

Result

Each reducer outputs the results of its processing to the HDFS or other data store.

An example of MapReduce

No matter the amount of data an organization wants to analyze, the key principles remain the same.

For this example, the data set includes cities (the keys) and the corresponding daily temperatures (the values) recorded for each city. A sample key/value pair might look like this: <Toronto, 18>.

The data is spread across multiple files. Each file might include data from a mix of cities, and it might include the same city multiple times.

From this data set, the user wants to identify the "maximum temperature" for each city across the tracked period.

An implementation of MapReduce to handle this job might look like this:

Data files containing temperature information feed into the MapReduce application as input.
The files are split into map tasks, with each task assigned to one of the mappers.
The mappers convert the data into key/value pairs.
The map outputs are shuffled and sorted so that all values with the same city key end up with the same reducer. For example, all temperature values for Toronto go to one reducer, while another reducer aggregates all the values for London.
Each reducer processes its data to determine the highest temperature value for each city. The data is then reduced to just the highest key/ value pair for each city.
After the reduce phase, the highest values can be collected to produce a result: <Tokyo, 38> <London, 27> <New York, 33> <Toronto, 32>.

MapReduce within the Hadoop ecosystem

The Hadoop ecosystem is a suite of open source modules designed to work together to promote efficiency. There are modules that constitute the main Hadoop framework, including MapReduce plus 3 more:

Hadoop distributed file system (HDFS)

HDFS is a distributed file system for storing application data on up to thousands of commodity servers.

HDFS is designed to provide fault tolerance for Hadoop and provide fast access to data. By default, data blocks are replicated across multiple nodes at load or write time.

The HDFS architecture features a NameNode to manage the file system namespace and file access, along with multiple DataNodes to manage data storage.

Yet Another Resource Negotiator (YARN)

Open source Apache Hadoop, YARN, is a framework for job scheduling and cluster resource management. It supports multiple workloads, such as SQL queries, advanced modeling and real-time streaming.

Hadoop Common

This module is a collection of resource utilities and libraries that support other Hadoop modules. Also known as Hadoop Core, it includes multiple resources for automatic failure recovery, file-system-level and operating-system-level abstraction, Java Archive (JAR) files and scripts, plus more.

Other modules

In addition to the main Hadoop framework, the Hadoop ecosystem also includes:

HBase (a non-relational, NoSQL database)
Mahout and Spark MLlib (algorithm libraries for machine learning)
Oozie (a job scheduler)
Apache Hive and Apache Pig (query-based data processing tools)
Solr and Lucene (for searching and indexing)
Spark (in-memory data processing engine)
Zookeeper (cluster coordination)

AI Academy

Is data management the secret to generative AI?

Explore why high-quality data is essential for the successful use of generative AI.

Go to episode

Benefits of MapReduce

MapReduce programming offers several benefits that can help organizations gain valuable insights from big data analytics, such as:

Scalability: MapReduce enables organizations to process petabytes of data stored in the HDFS across multiple servers or nodes.
Faster processing: With parallel processing and minimal data movement, MapReduce offers optimization of big data processing for massive volumes of data.
Simplicity: Developers can write MapReduce applications in their choice of programming languages, including Java, C++ and Python.
Cost savings: As an open source program, MapReduce can save an organization some budget on software expenses. That said, there will still be costs associated with infrastructure and data engineering staff.

MapReduce use cases

MapReduce can efficiently handle many straightforward use cases, although complex data processing tasks might run faster on other applications. Here are some applications suited to MapReduce‘s parallel processing approach:

Data integration, where the MapReduce framework is used to run the extract, transform and load (ETL) process to prepare data for analysis and storage.
Image processing, where tasks such as image classification can be split into smaller data sets and processed in parallel.
Log analysis, such as identifying trends by reviewing log files from web or application servers.
Machine learning (ML), where MapReduce can help with ML training tasks, such as collaborative filtering, k-means clustering and linear regression.
Sentiment analysis, where MapReduce can help add up customer scores on a website or identify response clusters, for example.
Tabulation, such as counting how many customers renewed their accounts, by country, over the past year.
Text mining, such as word count jobs that track the number of times a word occurs in a certain input set, such as a comment board.

History of MapReduce

The MapReduce programming paradigm was created in 2004 by Google computer scientists Jeffery Dean and Sanjay Ghemawat. The goal of the MapReduce model is to simplify the transformation and analysis of large data sets through massive parallel processing on large clusters of commodity hardware. It also enables programmers to focus on algorithms rather than data management.

While Google introduced the first MapReduce framework, Apache Hadoop MapReduce is perhaps the most popular.

MapReduce played a key role in advancing big data analytics, but it does have its drawbacks. For example, data is usually not retained in memory by MapReduce, and iterative logic is possible only by chaining MapReduce apps together. These factors add greater complexity and can lead to longer processing times.

While MapReduce remains widely used—especially in legacy systems—many organizations are moving to faster or more specialized frameworks, such as Apache Spark, for big data applications.

Data management for AI and analytics

Explore the value of data architectures and learn how IBM’s database portfolio can help simplify data for all your applications, analytics and AI workflows.

Resources

Managing data for AI and analytics at scale

Learn how an open data lakehouse approach can provide trustworthy data and faster analytics and AI projects execution.

2024 Gartner® Magic Quadrant™ for Data Integration Tools

IBM named a Leader for the 19th year in a row in the 2024 Gartner® Magic Quadrant™ for Data Integration Tools.

The data differentiator

Explore the data leader’s guide to building a data-driven organization and driving business advantage.

Increase AI adoption with AI-ready data

Discover why AI-powered data intelligence and data integration are critical to drive structured and unstructured data preparedness and accelerate AI outcomes.

The hybrid, open data lakehouse for AI

Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.

IBM Research® data management publications

Explore how IBM Research is regularly integrated into new features for IBM Cloud Pak® for Data.

Gartner® predicts 2024: How AI will impact analytics users

Gain unique insights into the evolving landscape of ABI solutions, highlighting key findings, assumptions and recommendations for data and analytics leaders.

What is MapReduce?

19 November 2024

Authors

Jim Holdsworth

Matthew Kosinski

What is MapReduce?

The latest AI News + Insights

How MapReduce works

Input

Splitting

Mapping

Shuffling

Reducing

Result

An example of MapReduce

MapReduce within the Hadoop ecosystem

Hadoop distributed file system (HDFS)

Yet Another Resource Negotiator (YARN)

Hadoop Common

Other modules

Is data management the secret to generative AI?

Benefits of MapReduce

MapReduce use cases

History of MapReduce

Resources

Related solutions