Successfully reported this slideshow.
Your SlideShare is downloading. ×

Spark Streaming with Azure Databricks

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 31 Ad

More Related Content

Slideshows for you (20)

Similar to Spark Streaming with Azure Databricks (20)

Advertisement

Recently uploaded (20)

Spark Streaming with Azure Databricks

  1. 1. Dustin Vannoy Data Engineer Cloud + Streaming Spark Streaming with Azure Databricks
  2. 2. Dustin Vannoy Data Engineering Consultant Co-founder Data Engineering San Diego /in/dustinvannoy @dustinvannoy dustin@dustinvannoy.com Technologies • Azure & AWS • Spark • Kafka • Python Modern Data Systems • Data Lakes • Analytics in Cloud • Streaming
  3. 3. © Microsoft Azure + AI Conference All rights reserved. Agenda  Shifting to Streaming  Spark Structured Streaming  Apache Kafka  Azure Event Hubs  Get Hands On
  4. 4. © Microsoft Azure + AI Conference All rights reserved. Shifting to Streaming If you haven’t started with streaming, you will soon
  5. 5. Life’s a batch stream
  6. 6. Why Streaming? Data Engineers have decided that the business only updates in batch Our customers and business leaders know better
  7. 7. © Microsoft Azure + AI Conference All rights reserved. Is streaming ingestion easier?  Dealing with a large set of data at once brings its own challenges  Process as it comes in for cleaner logic  Even if not doing real-time analytics yet, prepare for when you will
  8. 8. © Microsoft Azure + AI Conference All rights reserved. Spark Structured Streaming Technology overview
  9. 9. Why Spark? Big data and the cloud changed our mindset. We want tools that scale easily as data size grows. Spark is a leader in data processing that scales across many machines. It can run on Hadoop but is faster and easier than Map Reduce.
  10. 10. © Microsoft Azure + AI Conference All rights reserved. What is Spark?  Fast, general purpose engine for large-scale data processing  Replaces MapReduce as Hadoop parallel programming API  Many options:  Yarn / Spark Cluster / Local  Scala / Python / Java / R  Spark Core / SQL / Streaming / ML / Graph
  11. 11. © Microsoft Azure + AI Conference All rights reserved. What is Spark Structured Streaming?  Alternative to traditional Spark Streaming which used DStreams  If you are familiar with Spark, it is best to think of Structured Streaming as Spark SQL API but for streaming  Use import spark.sql.streaming
  12. 12. © Microsoft Azure + AI Conference All rights reserved. What is Spark Structured Streaming? Tathagata Das “TD” - Lead Developer on Spark Streaming  “Fast, fault-tolerant, exactly-once stateful stream processing without having to reason about streaming"  "The simplest way to perform streaming analytics is not having to reason about streaming at all"  A table that is constantly appended with each micro-batch Reference: https://youtu.be/rl8dIzTpxrI
  13. 13. © Microsoft Azure + AI Conference All rights reserved. Structured Streaming - Read df = spark.readStream .format("kafka") .options(**consumer_config) .load()
  14. 14. © Microsoft Azure + AI Conference All rights reserved. Structured Streaming - Write df.writeStream .format("kafka") .options(**producer_config) .option("checkpointLocation","/tmp/cp001") .start()
  15. 15. © Microsoft Azure + AI Conference All rights reserved. Structured Streaming - Window df2.groupBy( col("VendorID"), window(col("pickup_dt"), "10 min")) .avg("trip_distance")
  16. 16. © Microsoft Azure + AI Conference All rights reserved. Structured Streaming – Output Mode  Output triggered on a time interval (defaults to 1 second)  Append  Just keep adding newest records  Complete mode  Output latest state of table  Useful for aggregation results
  17. 17. © Microsoft Azure + AI Conference All rights reserved. Apache Kafka Technology overview
  18. 18. Why Kafka? Streaming data directly from one system to another often problematic Kafka serves as the scalable broker, keeping up with producer and persisting data for all consumers
  19. 19. © Microsoft Azure + AI Conference All rights reserved. The Log “It is an append-only, totally-ordered sequence of records ordered by time.” - Jay Kreps Reference: The Log - Jay Kreps
  20. 20. © Microsoft Azure + AI Conference All rights reserved. Kafka Topics Feed where records published Multiple partitions per topic Order retained within partition
  21. 21. © Microsoft Azure + AI Conference All rights reserved. Consumers and offset Offset = record id Consumers read in order Multiple consumer per topic
  22. 22. © Microsoft Azure + AI Conference All rights reserved. Event Hubs Technology overview
  23. 23. Why Event Hubs? Same core capability as Kafka, using PaaS instead of IaaS Choose between Kafka or Event Hub APIs; avoid operational overhead of managing Kafka
  24. 24. © Microsoft Azure + AI Conference All rights reserved. Event Hubs key concepts  Namespace = container to hold multiple Event Hubs  Event Hub = Topic  Partitions and Consumer Groups  Same concepts as Kafka  Minor differences in implementation  Throughput Units define level of scalability
  25. 25. Eventhub Namespace Setup  Standard pricing to enable Kafka  Each Throughput unit  1 MB/s ingress  2 MB/s egress  Auto Inflate to allow autoscale
  26. 26. Eventhub Setup  Partition count  Max # of consumers  Message retention  More days = More $  Capture  Save to Azure Storage
  27. 27. Shared Access Key
  28. 28. Shared Access Key
  29. 29. Demo Structured Streaming + Event Hubs for Kafka
  30. 30. © Microsoft Azure + AI Conference All rights reserved. References  The Log - Jay Kreps  https://databricks.com/blog/2016/01/04/introducing-apache-spark- datasets.html  https://databricks.com/blog/2016/07/14/a-tale-of-three-apache- spark-apis-rdds-dataframes-and-datasets.html  https://github.com/Azure/azure-event-hubs-for-kafka  https://github.com/Azure/azure-event-hubs-spark
  31. 31. © Microsoft Azure + AI Conference All rights reserved.  Please use EventsXD to fill out a session evaluation. Thank you!

Editor's Notes

  • In the world of data science we often default to processing in nightly or hourly batches, but that pattern is not enough any more. Our customers and business leaders see information is being created all the time and realize it should be available much sooner. While the move to stream processing adds complexity, the tools we have available make it achievable for teams of any size. In this session we will talk about why we need to shift some of our workloads from batch data jobs to streaming in real-time. We'll dive into how Spark Structured Streaming in Azure Databricks enables this along with streaming data systems such as Kafka and EventHub. We will discuss the concepts, how Azure Databricks enables stream processing, and review code examples on a sample data set.
  • Shifting to Streaming:
    We don’t have to convince our stakeholders that they don’t really need streaming.
    Understand the needs, find the right uses for streaming, and make it happen.
    Discuss pros and cons, considerations before going to production, and general use cases in AI/ML


    Spark, Event Hubs, and Kafka
    Define the systems we will be using for this session, including some of the reasons we choose them
    Talk about some of the options for using these together

    Getting Hands On
    Review dependencies that are not covered
    Walk through basic setup of the most important pieces
    Demo of use case code, highlight some important Structure Streaming components

    Best Practices
    Cover things to consider when working with Spark Structured Streaming and Kafka or Event Hubs
  • In the world of data science, those of us who develop ETL pipelines have determined that everything can be processed in nightly or hourly batches, but that only makes sense to data engineers. Our customers and business leaders see information is being created all the time and realize it should be available much sooner.
  • Dealing with a large set of data at once brings its own challenges (a lot of resources at once, large table joins, run out of memory, etc)
    Process as it comes in for cleaner logic (rather than seeing latest state, we see events as they happen and update state downstream)
    Even if not doing real-time analytics yet, prepare for when you will - the times they are a’changin
  • A fast and general engine for large-scale data processing, uses memory to provide benefit
    Often replaces MapReduce as parallel programming api on Hadoop, the way it handles data (RDDs) provides one performance benefit and use of memory when possible provides another large performance benefit
    Can run on Hadoop (using Yarn) but also as a separate Spark cluster. Local is possible as well but reduces the performance benefits…I find its still a useful API though
    Run Java, Scala, Python, or R. If you don’t already know one of those languages really well, I recommend trying it in Python and Scala and pick whichever is easiest for you.
    Several modules for different use cases, similar api so you can swap between modes relatively easily.
    For example, we have both streaming and batch sources of some data and we reuse the rest of the spark processing transformations.
  • A fast and general engine for large-scale data processing, uses memory to provide benefit
    Often replaces MapReduce as parallel programming api on Hadoop, the way it handles data (RDDs) provides one performance benefit and use of memory when possible provides another large performance benefit
    Can run on Hadoop (using Yarn) but also as a separate Spark cluster. Local is possible as well but reduces the performance benefits…I find its still a useful API though
    Run Java, Scala, Python, or R. If you don’t already know one of those languages really well, I recommend trying it in Python and Scala and pick whichever is easiest for you.
    Several modules for different use cases, similar api so you can swap between modes relatively easily.
    For example, we have both streaming and batch sources of some data and we reuse the rest of the spark processing transformations.
  • A fast and general engine for large-scale data processing, uses memory to provide benefit
    Often replaces MapReduce as parallel programming api on Hadoop, the way it handles data (RDDs) provides one performance benefit and use of memory when possible provides another large performance benefit
    Can run on Hadoop (using Yarn) but also as a separate Spark cluster. Local is possible as well but reduces the performance benefits…I find its still a useful API though
    Run Java, Scala, Python, or R. If you don’t already know one of those languages really well, I recommend trying it in Python and Scala and pick whichever is easiest for you.
    Several modules for different use cases, similar api so you can swap between modes relatively easily.
    For example, we have both streaming and batch sources of some data and we reuse the rest of the spark processing transformations.
  • Window is essentially like grouping.
    Continuously compute the average distance for each vendor over the last 10 minutes
  • Window is essentially like grouping.
    Continuously compute the average distance for each vendor over the last 10 minutes
  • Quick overview of important databricks workspace segments – Clusters, Tables, Notebooks
    Open create_parquet_tables notebook and run first few commands as examples of working without delta

×