What is MapReduce?

19 November 2024

Authors

Matthew Kosinski

Enterprise Technology Writer

What is MapReduce?

MapReduce is a programming model that uses parallel processing to speed large-scale data processing. MapReduce enables massive scalability across hundreds or thousands of servers within a Hadoop cluster.

The name "MapReduce" refers to the 2 tasks that the model performs to help “chunk” a large data processing task into many smaller tasks that can run faster in parallel.

  • First is the "map task," which takes one set of data and converts it into another set of data formatted as key/value pairs, such as cities (keys) and daily high temperatures (values).

  • Second is the "reduce task," which takes the outputs from a map task, aggregates all values with the same key and processes the data to produce a final set of key/value pairs.

The MapReduce programming model is one of the core processing components at the heart of Apache Hadoop, an open source software framework supported by the Apache Software Foundation and built on the Java programming language.

The MapReduce model can be used outside of Hadoop as well.

3D design of balls rolling on a track

The latest AI News + Insights 


Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter. 

How MapReduce works

MapReduce helps chunk data processing projects into smaller pieces so they can run faster.

Using one processor to analyze a huge file with terabytes or petabytes of data might, for example, take 10 hours. A MapReduce job can split that same data file into 10 tasks that run in parallel on 10 processors. This job might only take an hour or less to run. The data can be aggregated from the distributed computing nodes to produce a result.

While mapping and reducing are the essential functions of the MapReduce model, the overall process includes a few more steps.

Input

A MapReduce application accepts input data, which can include structured or unstructured data. MapReduce applications usually work with input files stored in the Hadoop distributed file system (HDFS), but the model can work with other data sources as well. (For more information, see “MapReduce within the Hadoop ecosystem.”)

For a specific job, the MapReduce framework helps select distributed servers, manage communications and data transfers and support fault tolerance and redundancy.

Splitting

The input data is split into smaller blocks. These blocks are distributed to mappers—the functions that perform mapping in the next step—located across various computing nodes. The MapReduce framework aims for a roughly uniform assignment of data across mappers for efficient load balancing.

Mapping

On each node, the map function processes the data it receives, converting the data into key/value pairs.

The total number of mappers is determined within the Hadoop framework, based on the total volume of data and the available memory blocks on each mapper. Parameters for the mappers, reducers and output formats can be set within the Hadoop cluster.

Shuffling

The Hadoop framework sorts the map outputs and assigns all key/value pairs with the same "key" (topic) to the same reducer. For example, in a data set of cities and their daily high temperatures, any data with the “Tokyo” key would go to the same reducer.

The reducer, as the name suggests, is the function that performs the reduce step.

Reducing

Reduce functions process the key/value pairs that the mappers emit. This can involve merging, tabulating or performing other operations on the data, depending on the kind of processing required.

Mapping and reduction can be done on the same set of servers, but this is optional.

Result

Each reducer outputs the results of its processing to the HDFS or other data store.

An example of MapReduce

No matter the amount of data an organization wants to analyze, the key principles remain the same.

For this example, the data set includes cities (the keys) and the corresponding daily temperatures (the values) recorded for each city. A sample key/value pair might look like this: <Toronto, 18>.

The data is spread across multiple files. Each file might include data from a mix of cities, and it might include the same city multiple times.

From this data set, the user wants to identify the "maximum temperature" for each city across the tracked period.

An implementation of MapReduce to handle this job might look like this:

  1. Data files containing temperature information feed into the MapReduce application as input.

  2. The files are split into map tasks, with each task assigned to one of the mappers.

  3. The mappers convert the data into key/value pairs.

  4. The map outputs are shuffled and sorted so that all values with the same city key end up with the same reducer. For example, all temperature values for Toronto go to one reducer, while another reducer aggregates all the values for London.

  5. Each reducer processes its data to determine the highest temperature value for each city. The data is then reduced to just the highest key/ value pair for each city.

  6. After the reduce phase, the highest values can be collected to produce a result: <Tokyo, 38> <London, 27> <New York, 33> <Toronto, 32>.

MapReduce within the Hadoop ecosystem

The Hadoop ecosystem is a suite of open source modules designed to work together to promote efficiency. There are modules that constitute the main Hadoop framework, including MapReduce plus 3 more:

Hadoop distributed file system (HDFS)

HDFS is a distributed file system for storing application data on up to thousands of commodity servers.

HDFS is designed to provide fault tolerance for Hadoop and provide fast access to data. By default, data blocks are replicated across multiple nodes at load or write time.

The HDFS architecture features a NameNode to manage the file system namespace and file access, along with multiple DataNodes to manage data storage.

Yet Another Resource Negotiator (YARN)

Open source Apache Hadoop, YARN, is a framework for job scheduling and cluster resource management. It supports multiple workloads, such as SQL queries, advanced modeling and real-time streaming.

Hadoop Common

This module is a collection of resource utilities and libraries that support other Hadoop modules. Also known as Hadoop Core, it includes multiple resources for automatic failure recovery, file-system-level and operating-system-level abstraction, Java Archive (JAR) files and scripts, plus more.

Other modules

In addition to the main Hadoop framework, the Hadoop ecosystem also includes:

  • HBase (a non-relational, NoSQL database)
  • Mahout and Spark MLlib (algorithm libraries for machine learning)
  • Oozie (a job scheduler)
  • Apache Hive and Apache Pig (query-based data processing tools)
  • Solr and Lucene (for searching and indexing)
  • Spark (in-memory data processing engine) 
  • Zookeeper (cluster coordination)
AI Academy

Is data management the secret to generative AI?

Explore why high-quality data is essential for the successful use of generative AI.

Benefits of MapReduce

MapReduce programming offers several benefits that can help organizations gain valuable insights from big data analytics, such as:

  • Scalability: MapReduce enables organizations to process petabytes of data stored in the HDFS across multiple servers or nodes.

  • Faster processing: With parallel processing and minimal data movement, MapReduce offers optimization of big data processing for massive volumes of data.

  • Simplicity: Developers can write MapReduce applications in their choice of programming languages, including Java, C++ and Python.

  • Cost savings: As an open source program, MapReduce can save an organization some budget on software expenses. That said, there will still be costs associated with infrastructure and data engineering staff.

MapReduce use cases

MapReduce can efficiently handle many straightforward use cases, although complex data processing tasks might run faster on other applications. Here are some applications suited to MapReduce‘s parallel processing approach:

  • Data integration, where the MapReduce framework is used to run the extract, transform and load (ETL) process to prepare data for analysis and storage.

  • Image processing, where tasks such as image classification can be split into smaller data sets and processed in parallel.

  • Log analysis, such as identifying trends by reviewing log files from web or application servers.

  • Machine learning (ML), where MapReduce can help with ML training tasks, such as collaborative filtering, k-means clustering and linear regression.

  • Sentiment analysis, where MapReduce can help add up customer scores on a website or identify response clusters, for example.

  • Tabulation, such as counting how many customers renewed their accounts, by country, over the past year.

  • Text mining, such as word count jobs that track the number of times a word occurs in a certain input set, such as a comment board.

History of MapReduce

The MapReduce programming paradigm was created in 2004 by Google computer scientists Jeffery Dean and Sanjay Ghemawat. The goal of the MapReduce model is to simplify the transformation and analysis of large data sets through massive parallel processing on large clusters of commodity hardware. It also enables programmers to focus on algorithms rather than data management.

While Google introduced the first MapReduce framework, Apache Hadoop MapReduce is perhaps the most popular.

MapReduce played a key role in advancing big data analytics, but it does have its drawbacks. For example, data is usually not retained in memory by MapReduce, and iterative logic is possible only by chaining MapReduce apps together. These factors add greater complexity and can lead to longer processing times.

While MapReduce remains widely used—especially in legacy systems—many organizations are moving to faster or more specialized frameworks, such as Apache Spark, for big data applications.

Related solutions
Data management software and solutions

Design a data strategy that eliminates data silos, reduces complexity and improves data quality for exceptional customer and employee experiences.

Explore data management solutions
IBM watsonx.data

Watsonx.data enables you to scale analytics and AI with all your data, wherever it resides, through an open, hybrid and governed data store.

Discover watsonx.data
Data and analytics consulting services

Unlock the value of enterprise data with IBM Consulting, building an insight-driven organization that delivers business advantage.

Discover analytics services
Take the next step

Design a data strategy that eliminates data silos, reduces complexity and improves data quality for exceptional customer and employee experiences.

Explore data management solutions Discover watsonx.data