Caching vs Persistence in Spark (PySpark)
Introduction
Apache Spark is built on lazy evaluation. Transformations such as select, filter, join, and groupBy do not execute immediately. Instead, Spark builds a logical plan (DAG) and executes it only when an action like show(), count(), collect(), or write() is triggered.
When the same DataFrame is reused multiple times, Spark will recompute the entire lineage for every action unless the result is cached or persisted. This is where cache() and persist() become important.
This article explains:
What cache() and persist() do
Storage levels and their behavior
Internal working
When to use each
Common mistakes
Difference between caching and checkpointing
Why Caching or Persistence is Needed
Consider the following example:
df = spark.read.parquet("data")
filtered = df.filter("amount > 1000")
filtered.count()
filtered.write.mode("overwrite").parquet("output")
Here, the filter transformation is reused twice. Since Spark is lazy, it will:
Re-read the parquet file
Reapply the filter
Execute the full transformation twice
This recomputation increases execution time and resource usage.
Caching or persistence avoids this repeated work.
cache() in Spark
cache() stores the DataFrame using the default storage level:
persist(StorageLevel.MEMORY_AND_DISK)
This means:
Spark tries to store the data in memory.
If memory is insufficient, it spills remaining partitions to disk.
Example:
df = spark.read.parquet("data")
filtered = df.filter("amount > 1000").cache()
filtered.count() # triggers computation and caching
filtered.show() # reused from cache
Important characteristics:
cache() is lazy. Data is stored only after the first action.
Cached data is stored in executor memory.
Data is stored in columnar format.
If executors are lost, cached data is also lost.
Cache does not truncate lineage.
persist() in Spark
persist() allows explicit control over storage level.
Example:
from pyspark import StorageLevel
df.persist(StorageLevel.MEMORY_ONLY)
Unlike cache(), persist() lets you define:
Memory only
Disk only
Memory and disk
Serialized or deserialized storage
Replication factor
This gives more flexibility for performance and memory management.
Storage Levels Explained
Common storage levels in Spark:
MEMORY_ONLY
Stores data only in memory in deserialized form.
Fast access but fails if data does not fit.MEMORY_AND_DISK
Stores in memory and spills extra partitions to disk.
Default for cache().MEMORY_ONLY_SER
Stores serialized objects in memory.
Uses less memory but slightly slower.DISK_ONLY
Stores data only on disk.
Useful when memory is limited.MEMORY_AND_DISK_SER
Serialized storage with disk fallback.
Choosing the right storage level depends on dataset size and memory availability.
Internal Behavior
When cache() or persist() is called:
Spark marks the DataFrame for caching.
No data is stored immediately.
On first action, Spark materializes partitions.
Partitions are stored according to chosen storage level.
Subsequent actions reuse stored partitions.
If a partition is evicted from memory or executor fails:
Spark recomputes that partition from lineage.
Cache does not provide full fault tolerance.
When to Use cache()
Use cache() when:
A DataFrame is reused multiple times within a job.
Running iterative workloads.
Working in notebooks or interactive sessions.
Dataset size fits comfortably in memory.
Medium-sized datasets that benefit from reuse.
Example:
df = spark.read.parquet("sales")
high_value = df.filter("amount > 1000").cache()
high_value.count()
high_value.groupBy("region").sum("amount").show()
The filtered dataset is reused, so caching improves performance.
When to Use persist()
Use persist() when:
Dataset is large and may not fit entirely in memory.
You want serialized storage to reduce memory usage.
Memory resources are limited.
You want disk-only persistence.
Fine-grained storage control is required.
Example:
df.persist(StorageLevel.DISK_ONLY)
This avoids memory pressure for large reusable datasets.
When Not to Cache or Persist
Avoid caching when:
The dataset is used only once.
The dataset is extremely large and causes heavy spilling.
The pipeline is linear with no reuse.
Data is written immediately after transformation.
Memory pressure leads to frequent eviction.
Caching unnecessarily can increase memory usage and garbage collection overhead.
Checkpoint vs Cache
Checkpoint and cache serve different purposes.
Checkpoint:
Truncates lineage.
Writes data to reliable storage (HDFS or cloud storage).
Used for long lineage DAGs.
Useful in streaming or iterative jobs.
Improves fault recovery.
Cache:
Does not truncate lineage.
Stores temporary data for performance.
Lost if executor fails.
Used to avoid recomputation.
Checkpoint improves reliability.
Cache improves performance.
Performance Considerations
Before caching:
Identify expensive transformations (joins, wide shuffles).
Confirm dataset reuse.
Estimate dataset size.
Choose appropriate storage level.
Trigger materialization intentionally using an action.
Monitor Spark UI Storage tab.
Blind caching can cause more harm than benefit.
Example – Large Dataset with Limited Memory
from pyspark import StorageLevel
df = spark.read.parquet("large_data")
df.persist(StorageLevel.DISK_ONLY)
df.count()
df.write.mode("overwrite").parquet("output")
Since the dataset is large, disk-only storage prevents memory exhaustion.
Example – Iterative Workload
In iterative workloads (for example, repeated transformations on same dataset), caching avoids recomputation and improves runtime significantly.
Summary
cache()
Default MEMORY_AND_DISK
Simple performance optimization
Suitable for medium-sized reusable datasets
persist()
Custom storage levels
Greater control
Suitable for memory-sensitive workloads
checkpoint()
Truncates lineage
Provides better recovery
Used in long-running pipelines
Correct use of caching and persistence improves performance and resource efficiency in Spark applications.
I