MapReduce and Spark

When a dataset grows beyond what a single machine can process — terabytes of web logs, billions of transactions, or petabytes of sensor readings — we need a way to split the work across many machines, process it in parallel, and combine the results. MapReduce was the programming model that made this possible at scale, and Apache Spark is the engine that took the idea further with in-memory processing and a richer API.

In this post, we will walk through both frameworks from the ground up: how MapReduce works, why Spark was created to address its limitations, and how to write distributed programs with each.

The MapReduce Programming Model
MapReduce Example: Word Count
- Python Implementation
Strengths and Limitations of MapReduce
- Strengths
- Limitations
Apache Spark
- Spark Architecture
RDDs: Resilient Distributed Datasets
- Transformations vs Actions
- PySpark Word Count
Spark vs MapReduce
Useful Resources

The MapReduce Programming Model

MapReduce was introduced by Google in 2004 and later implemented as the open-source Apache Hadoop project. The core idea is simple:

Split the data into chunks and distribute them across a cluster
Map — apply a function to each chunk independently and in parallel
Shuffle & Sort — group the intermediate results by key
Reduce — aggregate the grouped values into final results

The programmer only needs to write two functions — map and reduce — and the framework handles distribution, fault tolerance, and coordination automatically.

MapReduce Flow

Map Phase

Each mapper receives a chunk of input data and emits zero or more (key, value) pairs. Mappers run independently and in parallel across the cluster — they have no knowledge of each other.

Input: a split of the raw data (e.g., a block of a text file)
Output: a list of (key, value) pairs

Shuffle & Sort Phase

The framework collects all intermediate (key, value) pairs, groups them by key, and sorts them. This is the most network-intensive phase — data is transferred between machines so that all values for the same key end up on the same node.

Input: all (key, value) pairs from all mappers
Output: (key, [value1, value2, ...]) grouped by key

Reduce Phase

Each reducer receives a key and the complete list of values associated with that key. It applies an aggregation function and emits the final result.

Input: (key, [value1, value2, ...])
Output: (key, aggregated_result)

MapReduce Example: Word Count

The classic “hello world” of distributed computing is word count. Let us walk through it step by step.

Word Count Example

Given an input file with three lines — "hello world", "hello spark", "world hello" — the framework:

Splits the file into chunks (one line per split in this example)
Maps each line by emitting (word, 1) for every word
Shuffles & Sorts to group all counts by word: (hello, [1, 1, 1]), (world, [1, 1]), (spark, [1])
Reduces by summing the lists: (hello, 3), (world, 2), (spark, 1)

Python Implementation

Here is a simple Python implementation using Hadoop Streaming-style mapper and reducer scripts:

mapper.py

#!/usr/bin/env python3
import sys

for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        # Emit (word, 1) as tab-separated key-value pair
        print(f"{word}\t1")

reducer.py

#!/usr/bin/env python3
import sys
from collections import defaultdict

counts = defaultdict(int)

for line in sys.stdin:
    word, count = line.strip().split("\t")
    counts[word] += int(count)

for word, count in sorted(counts.items()):
    print(f"{word}\t{count}")

You would run these with Hadoop Streaming:

hadoop jar hadoop-streaming.jar \
  -mapper mapper.py \
  -reducer reducer.py \
  -input /input/textfiles \
  -output /output/wordcount

Strengths and Limitations of MapReduce

Strengths

Fault tolerance — if a node fails, the framework re-runs its tasks on another node
Horizontal scalability — add more machines to process more data
Simplicity — the programmer only writes map and reduce functions
Data locality — computation moves to where the data lives, minimizing network transfer

Limitations

Disk I/O overhead — intermediate results are written to disk between every stage, which is slow
No iterative processing — algorithms like PageRank or k-means require multiple passes over the data, each of which launches a separate MapReduce job
Limited expressiveness — complex pipelines require chaining multiple MapReduce jobs, which is cumbersome
High latency — not suitable for interactive queries or real-time processing

Apache Spark

Apache Spark was created at UC Berkeley’s AMPLab in 2009 specifically to address MapReduce’s limitations. The key innovations:

In-memory processing — intermediate results are kept in memory instead of being written to disk, making iterative algorithms up to 100x faster
Lazy evaluation — transformations are not executed immediately; Spark builds a DAG (Directed Acyclic Graph) of operations and optimizes the execution plan before running anything
Rich API — beyond just map and reduce, Spark provides filter, join, groupBy, flatMap, and many more operations out of the box
Unified engine — batch processing, SQL queries, streaming, machine learning, and graph processing all run on the same engine

Spark Architecture

Driver Program — the main program that creates a SparkContext, defines transformations and actions, and coordinates execution
Cluster Manager — allocates resources across the cluster (YARN, Mesos, Kubernetes, or Spark’s standalone manager)
Worker Nodes — machines in the cluster that run tasks
Executors — JVM processes on worker nodes that execute tasks and cache data in memory

RDDs: Resilient Distributed Datasets

The fundamental data structure in Spark is the RDD (Resilient Distributed Dataset). An RDD is:

Immutable — once created, it cannot be modified (transformations create new RDDs)
Partitioned — data is split across nodes in the cluster
Fault-tolerant — Spark tracks the lineage (sequence of transformations) that created each RDD, so if a partition is lost, it can be recomputed from the original data

Transformations vs Actions

Spark operations fall into two categories:

Transformations (lazy — build the DAG but do not execute):

map(func) — apply a function to each element
filter(func) — keep elements that satisfy a condition
flatMap(func) — map each element to zero or more elements
reduceByKey(func) — aggregate values by key
groupByKey() — group values by key
join(otherRDD) — join two RDDs by key

Actions (trigger execution and return results):

collect() — return all elements as a list
count() — return the number of elements
take(n) — return the first n elements
saveAsTextFile(path) — write results to disk
reduce(func) — aggregate all elements using a function

RDD Transformations & Actions

PySpark Word Count

Here is the same word count example in PySpark — notice how much more concise it is:

from pyspark import SparkContext

sc = SparkContext("local", "WordCount")

# Read input file -> RDD[String]
text_rdd = sc.textFile("input.txt")

# Chain transformations (lazy - nothing executes yet)
word_counts = (
    text_rdd
    .flatMap(lambda line: line.split())      # Split lines into words
    .map(lambda word: (word, 1))             # Create (word, 1) pairs
    .reduceByKey(lambda a, b: a + b)         # Sum counts by word
)

# Action triggers execution
results = word_counts.collect()

for word, count in sorted(results):
    print(f"{word}: {count}")

sc.stop()

The entire pipeline — splitting, mapping, shuffling, and reducing — is expressed in four lines of chained operations, and Spark handles all the distribution and optimization behind the scenes.

Spark vs MapReduce

Feature	MapReduce	Spark
Processing Model	Disk-based, batch only	In-memory, batch + streaming
Speed	Slower (disk I/O between stages)	Up to 100x faster for iterative jobs
Ease of Use	Low-level (map + reduce only)	High-level API (map, filter, join, SQL)
Iterative Algorithms	Poor (separate job per iteration)	Excellent (cache data in memory)
Fault Tolerance	Re-runs failed tasks	Recomputes lost partitions via lineage
Language Support	Java (primarily)	Java, Scala, Python, R
Ecosystem	Hadoop ecosystem	Spark SQL, MLlib, GraphX, Streaming
Resource Usage	Lower memory requirements	Higher memory requirements