Delta lake and the delta architecture

AdamDoyle11 1,796 views 22 slides Jul 08, 2021
Slide 1
Slide 1 of 22
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22

About This Presentation

Slides from the July 2021 St. Louis Big Data IDEA meetup. Yue Fang presented on Delta Lake.


Slide Content

Delta Lake and the Delta Architecture 07/07/2021 ‹#›

Personal introduction Yue Fang a big data enthusiast, and has worked on big data tech skills for almost 10 years. builds data pipelines and platforms on Cloudera's platform and Azure's Cloud. is a certified AWS solution architect. deep experience using spark structured streaming, Kafka, Cassandra, Hive, HBase, Solr, EventHub and Cosmosdb. worked on the Azure Databricks platform and Delta Lake as well. ‹#›

Outline Apache Spark problems Data Lake problems What is DataBricks? Delta Lake key features Delta Lake architecture Lakehouse architecture ‹#›

Apache Spark Problems Not ACID compliant Missing schema enforcement Small files - big problems - File listing - File opening/closing - Reduced compression effectiveness - Excessive metadata(external HIVE tables) ‹#› Two docs for details. Generic Load/Save Functions - Spark 3.1.2 Documentation Transactional writes to cloud storage with DBIO | Databricks on AWS

Data Lakes Problems A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. ‹#› Reliability issues Failed production jobs leave data in corrupt state Lack of schema enforcement creates inconsistent and low quality data(schema-on-read) Lack of consistency makes it almost impossible to mix appends and reads, batch and streaming Performance issues File size inconsistency with either too small or too big files Slow read/write performance of cloud storage compared to file system storage Garbage In Garbage Out

Why is Databricks? ‹#› source: Comparing Databricks to Apache Spark Databricks builds on top of Spark and adds: Highly reliable and performant data pipelines Productive data science at scale.

Delta Lake introduction ‹#›

What is Delta Lake? an open source project that enables building a Lakehouse architecture on top of data lakes . a storage layer that brings scalable, ACID transactions to Apache Spark and other big-data engines. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS. ‹#›

Delta Lake key features ACID Transactions Scalable Metadata Handling Time Travel (data versioning) Open Format Delta Lake change data feed Unified Batch and Streaming Source and Sink Schema Enforcement Schema Evolution Audit History Updates and Delete 100% Compatible with Apache Spark API Data Clean-up ‹#›

Delta Lake key feature - ACID transaction What the transaction log is. How the transaction log serves as a single source of truth to support ACID. How Delta Lake computes the state of each table. Using optimistic concurrency control. How Delta Lake uses mutual exclusion to ensure that commits are serialized properly. ‹#› DEMO

Delta Lake key feature - Schema Enforcement Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that do not match the table’s schema. Schema validation on write. Cannot contain any additional columns that are not present in the target table’s schema Cannot have column data types that differ from the column data types in the target table. Can not contain column names that differ only by case. Table’s schema is saved in JSON format inside the transaction log. ‹#›

Delta Lake key feature - Schema Evolution Schema evolution is a feature that allows users to easily change a table’s current schema to accommodate data that is changing over time. “Read-Compatible” Schema Change . option('mergeSchema', 'true') Adding new columns (this is the most common scenario) Changing of data types from NullType -> any other type, or upcasts from ByteType -> ShortType -> IntegerType “Non-Read-Compabtile” Schema Change .option("overwriteSchema", "true") Dropping a column Changing an existing column’s data type (in place) Renaming column names that differ only by case (e.g. “Foo” and “foo”) ‹#› DEMO

Delta Lake key feature - Time Travel Delta Lake time travel allows you to query an older snapshot of a Delta table. Timestamp based Version number based Data retention Transaction log file retention period delta.logRetentionDuration = 30 days at default Data file retention period delta.deletedFileRetentionDuration = 7 Use cases: Audit data changes Reproduce experiments & report Rollbacks https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-alter-table.html#delta-table-schema-options ‹#› DEMO

Delta Lake key feature - Table Utility Commands Remove files no longer referenced by Delta table Audit History Retrieve Table Details Generate a manifest file Convert parquet table to Delta table Convert Delta table to parquet table ‹#› DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS DEMO

Delta Lake key feature - Insert|Delete|UpSert SQL INSERT DELETE UPDATE MERGE Delta Table API d elete u pdate Merge A merge operation can fail if multiple rows of the source dataset match and attempt to update the same rows of the target Delta table ‹#› DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS

Delta Lake key feature - Clean up Transaction Log clean up _delta_log Checkpoint log file delta.logRetentionDuration=30 days at default Data file clean up SQL API Vacuum command Retention 7 days at default spark.databricks.delta.retentionDurationCheck.enabled = true|false ‹#› DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS

Delta Lake key feature - Streaming as source and sink Delta Lake is deeply integrated with Spark Structured Streaming through readStream and writeStream. Delta Lake overcomes many of the limitations typically associated with streaming systems and files, including: Maintaining “exactly-once” processing with more than one stream (or concurrent batch jobs) Efficiently discovering which files are new when using files as the source for a stream ‹#› DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS

Delta Lake key feature - Streaming as source and sink As a source does not handle input that is not an append and throws an exception if any modifications occur on the table being used as a source. delete the output and checkpoint and restart the stream from the beginning. set either of these two options: ignoreDeletes ignoreChanges Specify initial position startingVersion startingTimestamp ‹#› As a sink Append mode Complete mode

Delta Lake key feature - Delta Lake change data feed Support DataBricks Runtime 8.4 and above The Delta change data feed represents row-level changes between versions of a Delta table. set spark.databricks.delta.properties.defaults.enableChangeDataFeed = true; Change data event schema In addition to the data columns, change data contains metadata columns that identify the type of change event: _change_type >>insert, update_preimage , update_postimage, delete _commit_version _commit_timestamp ‹#› DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS

Delta Lake Architecture ‹#› DEMO

LakeHouse Architecture ‹#› A paradigm or conception of modern architecture. Rely on Delta Lake under the hood Replace additional data warehouse and data lake Need fast SQL analysis engine Future trend

Thank you ‹#› Any questions are welcome. Learning together