Skip to main content

Command Palette

Search for a command to run...

Caching vs Persistence in Spark (PySpark)

Updated
5 min read

Introduction

Apache Spark is built on lazy evaluation. Transformations such as select, filter, join, and groupBy do not execute immediately. Instead, Spark builds a logical plan (DAG) and executes it only when an action like show(), count(), collect(), or write() is triggered.

When the same DataFrame is reused multiple times, Spark will recompute the entire lineage for every action unless the result is cached or persisted. This is where cache() and persist() become important.

This article explains:

  • What cache() and persist() do

  • Storage levels and their behavior

  • Internal working

  • When to use each

  • Common mistakes

  • Difference between caching and checkpointing


Why Caching or Persistence is Needed

Consider the following example:

df = spark.read.parquet("data")
filtered = df.filter("amount > 1000")

filtered.count()
filtered.write.mode("overwrite").parquet("output")

Here, the filter transformation is reused twice. Since Spark is lazy, it will:

  • Re-read the parquet file

  • Reapply the filter

  • Execute the full transformation twice

This recomputation increases execution time and resource usage.

Caching or persistence avoids this repeated work.


cache() in Spark

cache() stores the DataFrame using the default storage level:

persist(StorageLevel.MEMORY_AND_DISK)

This means:

  • Spark tries to store the data in memory.

  • If memory is insufficient, it spills remaining partitions to disk.

Example:

df = spark.read.parquet("data")
filtered = df.filter("amount > 1000").cache()

filtered.count()   # triggers computation and caching
filtered.show()    # reused from cache

Important characteristics:

  • cache() is lazy. Data is stored only after the first action.

  • Cached data is stored in executor memory.

  • Data is stored in columnar format.

  • If executors are lost, cached data is also lost.

  • Cache does not truncate lineage.


persist() in Spark

persist() allows explicit control over storage level.

Example:

from pyspark import StorageLevel

df.persist(StorageLevel.MEMORY_ONLY)

Unlike cache(), persist() lets you define:

  • Memory only

  • Disk only

  • Memory and disk

  • Serialized or deserialized storage

  • Replication factor

This gives more flexibility for performance and memory management.


Storage Levels Explained

Common storage levels in Spark:

  1. MEMORY_ONLY
    Stores data only in memory in deserialized form.
    Fast access but fails if data does not fit.

  2. MEMORY_AND_DISK
    Stores in memory and spills extra partitions to disk.
    Default for cache().

  3. MEMORY_ONLY_SER
    Stores serialized objects in memory.
    Uses less memory but slightly slower.

  4. DISK_ONLY
    Stores data only on disk.
    Useful when memory is limited.

  5. MEMORY_AND_DISK_SER
    Serialized storage with disk fallback.

Choosing the right storage level depends on dataset size and memory availability.


Internal Behavior

When cache() or persist() is called:

  • Spark marks the DataFrame for caching.

  • No data is stored immediately.

  • On first action, Spark materializes partitions.

  • Partitions are stored according to chosen storage level.

  • Subsequent actions reuse stored partitions.

If a partition is evicted from memory or executor fails:

  • Spark recomputes that partition from lineage.

  • Cache does not provide full fault tolerance.


When to Use cache()

Use cache() when:

  1. A DataFrame is reused multiple times within a job.

  2. Running iterative workloads.

  3. Working in notebooks or interactive sessions.

  4. Dataset size fits comfortably in memory.

  5. Medium-sized datasets that benefit from reuse.

Example:

df = spark.read.parquet("sales")
high_value = df.filter("amount > 1000").cache()

high_value.count()
high_value.groupBy("region").sum("amount").show()

The filtered dataset is reused, so caching improves performance.


When to Use persist()

Use persist() when:

  1. Dataset is large and may not fit entirely in memory.

  2. You want serialized storage to reduce memory usage.

  3. Memory resources are limited.

  4. You want disk-only persistence.

  5. Fine-grained storage control is required.

Example:

df.persist(StorageLevel.DISK_ONLY)

This avoids memory pressure for large reusable datasets.


When Not to Cache or Persist

Avoid caching when:

  • The dataset is used only once.

  • The dataset is extremely large and causes heavy spilling.

  • The pipeline is linear with no reuse.

  • Data is written immediately after transformation.

  • Memory pressure leads to frequent eviction.

Caching unnecessarily can increase memory usage and garbage collection overhead.


Checkpoint vs Cache

Checkpoint and cache serve different purposes.

Checkpoint:

  • Truncates lineage.

  • Writes data to reliable storage (HDFS or cloud storage).

  • Used for long lineage DAGs.

  • Useful in streaming or iterative jobs.

  • Improves fault recovery.

Cache:

  • Does not truncate lineage.

  • Stores temporary data for performance.

  • Lost if executor fails.

  • Used to avoid recomputation.

Checkpoint improves reliability.
Cache improves performance.


Performance Considerations

Before caching:

  1. Identify expensive transformations (joins, wide shuffles).

  2. Confirm dataset reuse.

  3. Estimate dataset size.

  4. Choose appropriate storage level.

  5. Trigger materialization intentionally using an action.

  6. Monitor Spark UI Storage tab.

Blind caching can cause more harm than benefit.


Example – Large Dataset with Limited Memory

from pyspark import StorageLevel

df = spark.read.parquet("large_data")
df.persist(StorageLevel.DISK_ONLY)

df.count()
df.write.mode("overwrite").parquet("output")

Since the dataset is large, disk-only storage prevents memory exhaustion.


Example – Iterative Workload

In iterative workloads (for example, repeated transformations on same dataset), caching avoids recomputation and improves runtime significantly.


Summary

cache()

  • Default MEMORY_AND_DISK

  • Simple performance optimization

  • Suitable for medium-sized reusable datasets

persist()

  • Custom storage levels

  • Greater control

  • Suitable for memory-sensitive workloads

checkpoint()

  • Truncates lineage

  • Provides better recovery

  • Used in long-running pipelines

Correct use of caching and persistence improves performance and resource efficiency in Spark applications.


I