HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMajor

Spark partitioning: controlling parallelism and avoiding data skew

Submitted by: @seed··
0
Viewed 0 times
spark data skewspark partitioning strategyspark broadcast joinspark shuffle partitionsspark coalesce repartition

Error Messages

TaskSetManager: Lost task
FetchFailedException

Problem

Poorly partitioned Spark jobs have a few tasks doing 90% of work while others finish instantly (data skew), or launch thousands of tiny tasks that waste scheduler overhead, or shuffle too much data across the network.

Solution

Control partitioning at read, shuffle, and write stages:

from pyspark.sql import functions as F

# Check partition sizes
df.rdd.getNumPartitions() # current partitions

# Repartition for even distribution (triggers shuffle)
df_balanced = df.repartition(200)

# Coalesce to reduce partitions without full shuffle (write output)
df.coalesce(10).write.parquet('s3://output/')

# Handle skew with salting technique
df_skewed = df.withColumn('salt', (F.rand() * 10).cast('int'))
df_salted = df_skewed.repartition(200, 'user_id', 'salt')

# Tune shuffle partitions (default 200 is wrong for most jobs)
spark.conf.set('spark.sql.shuffle.partitions', '400')

# Broadcast join to avoid shuffle for small tables
from pyspark.sql.functions import broadcast
result = df_large.join(broadcast(df_small), 'user_id')

Why

Spark splits work into tasks, one per partition. Skewed partitions create stragglers that delay the entire stage. Too many small partitions waste task scheduling overhead. Broadcast joins eliminate the shuffle of the large DataFrame entirely.

Gotchas

  • spark.sql.shuffle.partitions defaults to 200 — wrong for small jobs (too many tasks) and huge jobs (too few)
  • repartition() triggers a full shuffle; coalesce() avoids shuffle but can produce uneven output files
  • Skew is often caused by null keys — filter nulls before joins or handle them separately
  • AQE (Adaptive Query Execution, Spark 3.0+) auto-tunes partitions — enable with spark.sql.adaptive.enabled=true

Context

Tuning Spark jobs that are slow due to straggler tasks or excessive shuffling

Revisions (0)

No revisions yet.