2. Dustin Vannoy
Data Engineering Consultant
Co-founder Data Engineering San Diego
/in/dustinvannoy
@dustinvannoy
dustin@dustinvannoy.com
Technologies
• Azure & AWS
• Spark
• Kafka
• Python
Modern Data Systems
• Data Lakes
• Analytics in Cloud
• Streaming
5. Why Spark?
Big data and the cloud
changed our mindset.
We want tools that
scale easily as data
size grows.
Spark is a leader in
data processing that
scales across many
machines. It can run
on Hadoop but is
faster and easier than
Map Reduce.
With the shift to data lakes that use distributed file storage as the foundation, we have been missing the reliability that relational databases provides. Databricks Delta is a data management system focused on bringing more reliability and performance into our data lakes. It sits on top of existing storage and the API is very similar to reading and writing to files from Spark already. This session will present the overview of Delta Lake, why it may be a better option than standard data lake storage, and how you can use it from Azure Databricks. We will work through demos that showcase the key benefits of delta lake:1. ACID transactions2. Schema enforcement and evolution3. Time travel (data versioning)
Let’s think about the benefit of parallel processing, often referred to as distributed systems. The idea is actually very easy to understand. If we had a task such as counting all the people at a concert, you could have one person who is really good at counting do it and if the venue is small enough they will do just fine. But the job will be completed faster if you have many people counting and combining the results at the end. Sure there is a little more organization needed, but if you need to count the attendees at a Beyonce concert you could just hire a lot of people to do the job. And if one of them gets distracted by the music, you can send whoever finishes first in to take over counting that section. We call this capability “Horizontal Scaling” because if our data processing system is not powerful enough to do the work, we add more computers to help out rather than replacing the single server with a more powerful server. Distributed computing and parallel processing are not new concepts, few things in computing are, but what if you had an easy way to tell all the workers what to do without having to micro-manage to avoid two people counting the same section? That is where new programming models and frameworks have stepped in over the last 10 years and gave us the beloved buzz word ”Big Data”. Spark is not the only option here, but it has a lot of strengths and is often chosen over the traditional single machine processing options.
A fast and general engine for large-scale data processing, uses memory to provide benefit
Often replaces MapReduce as parallel programming api on Hadoop, the way it handles data (RDDs) provides one performance benefit and use of memory when possible provides another large performance benefit
Can run on Hadoop (using Yarn) but also as a separate Spark cluster. Local is possible as well but reduces the performance benefits…I find its still a useful API though
Run Java, Scala, Python, or R. If you don’t already know one of those languages really well, I recommend trying it in Python and Scala and pick whichever is easiest for you.
Several modules for different use cases, similar api so you can swap between modes relatively easily.
For example, we have both streaming and batch sources of some data and we reuse the rest of the spark processing transformations.
In the day to day we will talk about writing Spark code and also refer to running the code on the Spark cluster. There are actually quite a few options for how to do either of these things, but a quick look at Spark code that uses Spark DataFrames in Python. And then whatever cluster we run it on will have a concept of a master node and worker nodes, as well as some storage that is often a hybrid of local storage on the workers plus a distributed file system like Hadoop’s HDFS, Amazon S3, or Azure Data Lake Storage. If you don’t follow all those terms, it’s ok. There is plenty of time to build up to those concepts after you start learning to write spark code and run it in a simple Spark environment. We will cover that in other videos.
So we sort of get what Spark is, we saw a small code sample and discussed how a cluster exists to run the code on. Let’s go back to a higher level and talk about Spark’s strengths.
Quick overview of important databricks workspace segments – Clusters, Tables, Notebooks
Open create_parquet_tables notebook and run first few commands as examples of working without delta
Atomicity – typical Spark save does not use locking and is not atomic so it could leave incomplete changes behind and corrupt data. Overwrite will remove data before loading new data, so typically not an issue. With append mode the default commiter should have atomicity but some of the faster commiters don’t gurantee atomicity. - Learning Journal, Delta Lake for Apache Spark video on YouTube
Consistency – with typical Spark overwrite there is a time where no files exist and if failure happens at that point you are left in invalid state.
Isolation – an operation that is in progress (not commited) should not impact the results of other reads or writes...do not want dirty reads. Typical database offers different levels of isolation but Spark doesn’t have specific option of commit such as read/commited and serializable. Task level and job level commits exist but lack of atomicity in write leaves this not fully working.
Durability – typically not an issue, though lack of commit can lead to issues here as well
Atomicity – typical Spark save does not use locking and is not atomic so it could leave incomplete changes behind and corrupt data. Overwrite will remove data before loading new data, so typically not an issue. With append mode the default commiter should have atomicity but some of the faster commiters don’t gurantee atomicity. - Learning Journal, Delta Lake for Apache Spark video on YouTube
Consistency – with typical Spark overwrite there is a time where no files exist and if failure happens at that point you are left in invalid state.
Isolation – an operation that is in progress (not commited) should not impact the results of other reads or writes...do not want dirty reads. Typical database offers different levels of isolation but Spark doesn’t have specific option of commit such as read/commited and serializable. Task level and job level commits exist but lack of atomicity in write leaves this not fully working.
Durability – typically not an issue, though lack of commit can lead to issues here as well
Atomicity – typical Spark save does not use locking and is not atomic so it could leave incomplete changes behind and corrupt data. Overwrite will remove data before loading new data, so typically not an issue. With append mode the default commiter should have atomicity but some of the faster commiters don’t gurantee atomicity. - Learning Journal, Delta Lake for Apache Spark video on YouTube
Consistency – with typical Spark overwrite there is a time where no files exist and if failure happens at that point you are left in invalid state.
Isolation – an operation that is in progress (not commited) should not impact the results of other reads or writes...do not want dirty reads. Typical database offers different levels of isolation but Spark doesn’t have specific option of commit such as read/commited and serializable. Task level and job level commits exist but lack of atomicity in write leaves this not fully working.
Durability – typically not an issue, though lack of commit can lead to issues here as well
Quote and image from Databricks blog post by Burak Yavuz, Michael Armbrust and Brenner Heintz -> https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html
Demo notebook create_delta_tables
Show bad data when running one set of writes from one source, then run from second source
Same example with delta destination to show failure
Same example but tweaked to allow schema merge
Show transaction log files
Demo of file where data was streamed in, show by timestamp and version