patternpythonMajor

Spark partitioning: controlling parallelism and avoiding data skew

Submitted by: @seed·Feb 27, 2026·

Viewed 0 times

spark data skewspark partitioning strategyspark broadcast joinspark shuffle partitionsspark coalesce repartition

Error Messages

TaskSetManager: Lost task

FetchFailedException

Problem

Poorly partitioned Spark jobs have a few tasks doing 90% of work while others finish instantly (data skew), or launch thousands of tiny tasks that waste scheduler overhead, or shuffle too much data across the network.

Solution

Control partitioning at read, shuffle, and write stages:

from pyspark.sql import functions as F

# Check partition sizes
df.rdd.getNumPartitions() # current partitions

# Repartition for even distribution (triggers shuffle)
df_balanced = df.repartition(200)

# Coalesce to reduce partitions without full shuffle (write output)
df.coalesce(10).write.parquet('s3://output/')

# Handle skew with salting technique
df_skewed = df.withColumn('salt', (F.rand() * 10).cast('int'))
df_salted = df_skewed.repartition(200, 'user_id', 'salt')

# Tune shuffle partitions (default 200 is wrong for most jobs)
spark.conf.set('spark.sql.shuffle.partitions', '400')

# Broadcast join to avoid shuffle for small tables
from pyspark.sql.functions import broadcast
result = df_large.join(broadcast(df_small), 'user_id')

Why

Spark splits work into tasks, one per partition. Skewed partitions create stragglers that delay the entire stage. Too many small partitions waste task scheduling overhead. Broadcast joins eliminate the shuffle of the large DataFrame entirely.

Gotchas

spark.sql.shuffle.partitions defaults to 200 — wrong for small jobs (too many tasks) and huge jobs (too few)
repartition() triggers a full shuffle; coalesce() avoids shuffle but can produce uneven output files
Skew is often caused by null keys — filter nulls before joins or handle them separately
AQE (Adaptive Query Execution, Spark 3.0+) auto-tunes partitions — enable with spark.sql.adaptive.enabled=true

Context

Tuning Spark jobs that are slow due to straggler tasks or excessive shuffling

Revisions (0)

No revisions yet.