Caching vs Persistence in Spark (PySpark)

Introduction

Apache Spark is built on lazy evaluation. Transformations such as select, filter, join, and groupBy do not execute immediately. Instead, Spark builds a logical plan (DAG) and executes it only when an action like show(), count(), collect(), or write() is triggered.

When the same DataFrame is reused multiple times, Spark will recompute the entire lineage for every action unless the result is cached or persisted. This is where cache() and persist() become important.

This article explains:

What cache() and persist() do
Storage levels and their behavior
Internal working
When to use each
Common mistakes
Difference between caching and checkpointing

Why Caching or Persistence is Needed

Consider the following example:

df = spark.read.parquet("data")
filtered = df.filter("amount > 1000")

filtered.count()
filtered.write.mode("overwrite").parquet("output")

Here, the filter transformation is reused twice. Since Spark is lazy, it will:

Re-read the parquet file
Reapply the filter
Execute the full transformation twice

This recomputation increases execution time and resource usage.

Caching or persistence avoids this repeated work.

cache() in Spark

cache() stores the DataFrame using the default storage level:

persist(StorageLevel.MEMORY_AND_DISK)

This means:

Spark tries to store the data in memory.
If memory is insufficient, it spills remaining partitions to disk.

Example:

df = spark.read.parquet("data")
filtered = df.filter("amount > 1000").cache()

filtered.count()   # triggers computation and caching
filtered.show()    # reused from cache

Important characteristics:

cache() is lazy. Data is stored only after the first action.
Cached data is stored in executor memory.
Data is stored in columnar format.
If executors are lost, cached data is also lost.
Cache does not truncate lineage.

persist() in Spark

persist() allows explicit control over storage level.

Example:

from pyspark import StorageLevel

df.persist(StorageLevel.MEMORY_ONLY)

Unlike cache(), persist() lets you define:

Memory only
Disk only
Memory and disk
Serialized or deserialized storage
Replication factor

This gives more flexibility for performance and memory management.

Storage Levels Explained

Common storage levels in Spark:

MEMORY_ONLY
Stores data only in memory in deserialized form.
Fast access but fails if data does not fit.
MEMORY_AND_DISK
Stores in memory and spills extra partitions to disk.
Default for cache().
MEMORY_ONLY_SER
Stores serialized objects in memory.
Uses less memory but slightly slower.
DISK_ONLY
Stores data only on disk.
Useful when memory is limited.
MEMORY_AND_DISK_SER
Serialized storage with disk fallback.

Choosing the right storage level depends on dataset size and memory availability.

Internal Behavior

When cache() or persist() is called:

Spark marks the DataFrame for caching.
No data is stored immediately.
On first action, Spark materializes partitions.
Partitions are stored according to chosen storage level.
Subsequent actions reuse stored partitions.

If a partition is evicted from memory or executor fails:

Spark recomputes that partition from lineage.
Cache does not provide full fault tolerance.

When to Use cache()

Use cache() when:

A DataFrame is reused multiple times within a job.
Running iterative workloads.
Working in notebooks or interactive sessions.
Dataset size fits comfortably in memory.
Medium-sized datasets that benefit from reuse.

Example:

df = spark.read.parquet("sales")
high_value = df.filter("amount > 1000").cache()

high_value.count()
high_value.groupBy("region").sum("amount").show()

The filtered dataset is reused, so caching improves performance.

When to Use persist()

Use persist() when:

Dataset is large and may not fit entirely in memory.
You want serialized storage to reduce memory usage.
Memory resources are limited.
You want disk-only persistence.
Fine-grained storage control is required.

Example:

df.persist(StorageLevel.DISK_ONLY)

This avoids memory pressure for large reusable datasets.

When Not to Cache or Persist

Avoid caching when:

The dataset is used only once.
The dataset is extremely large and causes heavy spilling.
The pipeline is linear with no reuse.
Data is written immediately after transformation.
Memory pressure leads to frequent eviction.

Caching unnecessarily can increase memory usage and garbage collection overhead.

Checkpoint vs Cache

Checkpoint and cache serve different purposes.

Checkpoint:

Truncates lineage.
Writes data to reliable storage (HDFS or cloud storage).
Used for long lineage DAGs.
Useful in streaming or iterative jobs.
Improves fault recovery.

Cache:

Does not truncate lineage.
Stores temporary data for performance.
Lost if executor fails.
Used to avoid recomputation.

Checkpoint improves reliability.
Cache improves performance.

Performance Considerations

Before caching:

Identify expensive transformations (joins, wide shuffles).
Confirm dataset reuse.
Estimate dataset size.
Choose appropriate storage level.
Trigger materialization intentionally using an action.
Monitor Spark UI Storage tab.

Blind caching can cause more harm than benefit.

Example – Large Dataset with Limited Memory

from pyspark import StorageLevel

df = spark.read.parquet("large_data")
df.persist(StorageLevel.DISK_ONLY)

df.count()
df.write.mode("overwrite").parquet("output")

Since the dataset is large, disk-only storage prevents memory exhaustion.

Example – Iterative Workload

In iterative workloads (for example, repeated transformations on same dataset), caching avoids recomputation and improves runtime significantly.

Summary

cache()

Default MEMORY_AND_DISK
Simple performance optimization
Suitable for medium-sized reusable datasets

persist()

Custom storage levels
Greater control
Suitable for memory-sensitive workloads

checkpoint()

Truncates lineage
Provides better recovery
Used in long-running pipelines

Correct use of caching and persistence improves performance and resource efficiency in Spark applications.

Caching vs Persistence in Spark (PySpark)

Introduction

Why Caching or Persistence is Needed

cache() in Spark

persist() in Spark

Storage Levels Explained

Internal Behavior

When to Use cache()

When to Use persist()

When Not to Cache or Persist

Checkpoint vs Cache

Performance Considerations

Example – Large Dataset with Limited Memory

Example – Iterative Workload

Summary

Comments

More from this blog

Partitioning vs Z-ORDER vs Liquid Clustering in Delta Lake

Understanding BASE in NoSQL Databases

Evolution of Microsoft Data Integration Platforms: From SSIS to Microsoft Fabric

SCD Types and SCD Type 2 Implementation in Databricks Using PySpark and Delta Merge

Auto Loader Implementation in Databricks using PySpark

Command Palette

Introduction

Why Caching or Persistence is Needed

cache() in Spark

persist() in Spark

Storage Levels Explained

Internal Behavior

When to Use cache()

When to Use persist()

When Not to Cache or Persist

Checkpoint vs Cache

Performance Considerations

Example – Large Dataset with Limited Memory

Example – Iterative Workload

Summary

Comments

More from this blog