principlepythonModerate
Data lake vs data warehouse: lakehouse architecture with Delta Lake
Viewed 0 times
delta lake acidlakehouse architecturedata lake warehousedelta merge upserticeberg delta hudi
Problem
A data lake (raw files in S3/GCS) has no ACID transactions, no schema enforcement, and poor query performance. A data warehouse (Snowflake/BigQuery) is expensive, vendor-locked, and cannot store unstructured data. Teams choose one and suffer the downsides of both.
Solution
Use a lakehouse format (Delta Lake, Apache Iceberg, or Apache Hudi) to add warehouse features on top of object storage:
from delta import configure_spark_with_delta_pip
from pyspark.sql import SparkSession
spark = configure_spark_with_delta_pip(
SparkSession.builder.appName('delta-etl')
).getOrCreate()
# Write with ACID guarantees
df.write.format('delta').mode('overwrite').save('s3://lake/orders/')
# MERGE (upsert) — not possible with raw Parquet
from delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, 's3://lake/orders/')
delta_table.alias('t').merge(
updates.alias('s'), 't.order_id = s.order_id'
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
from delta import configure_spark_with_delta_pip
from pyspark.sql import SparkSession
spark = configure_spark_with_delta_pip(
SparkSession.builder.appName('delta-etl')
).getOrCreate()
# Write with ACID guarantees
df.write.format('delta').mode('overwrite').save('s3://lake/orders/')
# MERGE (upsert) — not possible with raw Parquet
from delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, 's3://lake/orders/')
delta_table.alias('t').merge(
updates.alias('s'), 't.order_id = s.order_id'
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
Why
Delta Lake stores a transaction log alongside Parquet files, enabling ACID semantics, time travel, schema enforcement, and upserts on object storage. The lakehouse pattern eliminates the ETL copy from lake to warehouse while matching warehouse reliability.
Gotchas
- Delta Lake files are standard Parquet — the transaction log (_delta_log/) is what adds ACID; never delete it
- OPTIMIZE and VACUUM commands must be run periodically; small files accumulate without them
- VACUUM removes old file versions — default 7-day retention; time travel beyond that requires longer retention
- Delta Lake on S3 requires careful S3 consistency settings; use S3 with strong consistency (default since late 2020)
Context
Designing a storage architecture for a modern data platform
Revisions (0)
No revisions yet.