Apache Spark - In industry
Dorian Beganovic
About me
• Experience with Spark
• Q-Park - 20 months
• “Big data” - Spark, Hadoop, Data Lake
• Data warehousing - Microsoft SQL Server
• Personal projects
• Machine learning on EEG data (3 months)
• Spark Structured streaming (1 month)
• Really interested in data systems
• All types of databases (relational, parallel, columnar…)
• Big data, cloud, distributed systems
Hadoop
Apache Hadoop
• Open Source framework for distributed storage and processing
• Origins are in the project “Nutch” back in 2002 (Cutting, Cafarella)
• 2006. Yahoo! Created Hadoop based on GFS and MapReduce
• Based on MapReduce programming model Fundamental
assumption - all the modules are built to handle hardware failures
automatically
• Clusters built of commodity hardware
• Pig, Hive, Mahout - optimize Map Reduce
Spark
Apache Spark
• Open source fast and expressive cluster computing framework designed for Big data analytics
• Compatible with Apache Hadoop
• Developed at UC Berkley’s AMP Lab 2009. and donated to the Apache
Software Foundation in 2013.
• Original author - Matei Zaharia
• Databricks inc. - company behind Apache Spark (many other sponsors now)
Who uses Spark?
• In total over 3000 companies use Apache Spark
• Microsoft, Uber, Pinterest, Amazon, Oracle, Cisco, Verizon, Visa…
• https://spark.apache.org/powered-by.html
Why use Big Data tools?
• Complex analysis on 10TB+ of data
• Only use Big Data tools like Spark if your data doesn’t fit on a single machine
• Shuffle operation is extremely expensive (network IO is very slow)
AWS EC2 instance types
Why Spark - 1/4
1. Speed
Speed -1/2
checkouts
Speed - 2/2
2. Ease of use
Why Spark - 2/4
Ease of use
3. Generality
Why Spark - 3/4
Generality
• You can use one framework (Spark) for:
• Processing batch (big) data - Spark SQL
• Processing streaming (big) data - Spark Streaming
• Machine learning at scale - Spark MLLib
• Graph analysis at scale - Graph X
4. Runs everywhere
Why Spark - 4/4
• Access data from anywhere:
• S3, HDFS any JDBC database…
• Runs in:
• Standalone cluster mode
• EC2 (AWS Elastic cloud compute)
• Hadoop YARN
• Apache MESOS
Runs everywhere
• https://www.datanami.com/2017/09/29/hadoop-hard-find-strata-week/
• Currently Hadoop (HDFS) is slowly getting replaced with object storage (AWS S3…) in
the cloud
Spark Architecture
APIs
Spark SQL
• Originally named “Shark” and used to execute Hive queries in-memory
• As of Spark 2.0 - SQL 2003 standard support
• By far the most popular library (you’ll certainly use it for any task)
• ~90% of the codebase
• A lot faster and provides higher level operations than RDDs (based on RDDs )
• API is inspired by Python and R data frames
• Academic paper that introduced Spark SQL
Spark SQL Architecture
API example
• Ability to execute SQL queries is extremely powerful
• The official documentation is a great place to start
Spark Streaming
• Scalable fault tolerant streaming system
• Very high level of abstraction and powerful APIs
• Receivers receive data and chop them into micro-batches (not a single record at a
time)
• Spark processes batches and pushes out the result
• Input: files, Kafka, socket, Kinesis, Flume…
*RDD streaming API will be replaced (deprecated)
Spark Streaming Demo
Spark Structured Streaming Demo
• High-level streaming API built on DataFrames
• Catalyst optimizer creates incremental execution plan
• Unifies streaming, interactive and batch queries
• Supports multiple sources and sinks
• E.g. aggregate data in a stream, then serve using JDBC
• “The simplest way to perform streaming analytics is not having to reason about
streaming.”
• Probably the coolest thing Spark has
Spark Structured Streaming
Spark Structured Streaming Demo
Spark MLLib
• Best solution for distributed machine learning
• Not all algorithms are implemented (some can’t be)
• Really slow on single node or small datasets compared to established libraries
• APIs are very similar to those in scikit-learn (but can be painful to use with Scala
or Java)
• Two APIs
• RDD based (in “maintenance” mode)
• DataFrame
Spark MLLib
Key take-aways
• Don’t use Spark if you don’t need to (“big data”)
• The components and APIs have started consolidating and maturing (so your
knowledge after 6 months won’t be outdated)
• Lots of resources on the internet are outdated so focus only on Spark 2.0 and above
• Spark is the most popular tool for analysis of Big Data and likely to remain so in the
future
• Future of Hadoop is very “cloudy” as more and more workloads are moving into the
cloud (object storage S3)
Useful resources
• Spark home page: https://spark.apache.org/
• Apache Zeppelin notebook: https://zeppelin.apache.org
• Spark Core (Internals): https://www.youtube.com/watch?v=7ooZ4S7Ay6Y
• Spark: The Definitive Guide (released in 2018)
https://www.amazon.com/Spark-Definitive-Guide-Processing-
Simple/dp/1491912219

Apache Spark in Industry

  • 1.
    Apache Spark -In industry Dorian Beganovic
  • 2.
    About me • Experiencewith Spark • Q-Park - 20 months • “Big data” - Spark, Hadoop, Data Lake • Data warehousing - Microsoft SQL Server • Personal projects • Machine learning on EEG data (3 months) • Spark Structured streaming (1 month) • Really interested in data systems • All types of databases (relational, parallel, columnar…) • Big data, cloud, distributed systems
  • 3.
  • 4.
    Apache Hadoop • OpenSource framework for distributed storage and processing • Origins are in the project “Nutch” back in 2002 (Cutting, Cafarella) • 2006. Yahoo! Created Hadoop based on GFS and MapReduce • Based on MapReduce programming model Fundamental assumption - all the modules are built to handle hardware failures automatically • Clusters built of commodity hardware • Pig, Hive, Mahout - optimize Map Reduce
  • 6.
  • 7.
    Apache Spark • Opensource fast and expressive cluster computing framework designed for Big data analytics • Compatible with Apache Hadoop • Developed at UC Berkley’s AMP Lab 2009. and donated to the Apache Software Foundation in 2013. • Original author - Matei Zaharia • Databricks inc. - company behind Apache Spark (many other sponsors now)
  • 9.
    Who uses Spark? •In total over 3000 companies use Apache Spark • Microsoft, Uber, Pinterest, Amazon, Oracle, Cisco, Verizon, Visa… • https://spark.apache.org/powered-by.html
  • 10.
    Why use BigData tools? • Complex analysis on 10TB+ of data • Only use Big Data tools like Spark if your data doesn’t fit on a single machine • Shuffle operation is extremely expensive (network IO is very slow) AWS EC2 instance types
  • 11.
    Why Spark -1/4 1. Speed
  • 12.
  • 13.
  • 14.
    2. Ease ofuse Why Spark - 2/4
  • 15.
  • 16.
  • 17.
    Generality • You canuse one framework (Spark) for: • Processing batch (big) data - Spark SQL • Processing streaming (big) data - Spark Streaming • Machine learning at scale - Spark MLLib • Graph analysis at scale - Graph X
  • 18.
    4. Runs everywhere WhySpark - 4/4 • Access data from anywhere: • S3, HDFS any JDBC database… • Runs in: • Standalone cluster mode • EC2 (AWS Elastic cloud compute) • Hadoop YARN • Apache MESOS
  • 19.
    Runs everywhere • https://www.datanami.com/2017/09/29/hadoop-hard-find-strata-week/ •Currently Hadoop (HDFS) is slowly getting replaced with object storage (AWS S3…) in the cloud
  • 20.
  • 21.
  • 22.
    Spark SQL • Originallynamed “Shark” and used to execute Hive queries in-memory • As of Spark 2.0 - SQL 2003 standard support • By far the most popular library (you’ll certainly use it for any task) • ~90% of the codebase • A lot faster and provides higher level operations than RDDs (based on RDDs ) • API is inspired by Python and R data frames
  • 23.
    • Academic paperthat introduced Spark SQL Spark SQL Architecture
  • 24.
    API example • Abilityto execute SQL queries is extremely powerful • The official documentation is a great place to start
  • 25.
    Spark Streaming • Scalablefault tolerant streaming system • Very high level of abstraction and powerful APIs • Receivers receive data and chop them into micro-batches (not a single record at a time) • Spark processes batches and pushes out the result • Input: files, Kafka, socket, Kinesis, Flume…
  • 26.
    *RDD streaming APIwill be replaced (deprecated) Spark Streaming Demo
  • 27.
  • 28.
    • High-level streamingAPI built on DataFrames • Catalyst optimizer creates incremental execution plan • Unifies streaming, interactive and batch queries • Supports multiple sources and sinks • E.g. aggregate data in a stream, then serve using JDBC • “The simplest way to perform streaming analytics is not having to reason about streaming.” • Probably the coolest thing Spark has Spark Structured Streaming
  • 29.
  • 30.
    Spark MLLib • Bestsolution for distributed machine learning • Not all algorithms are implemented (some can’t be) • Really slow on single node or small datasets compared to established libraries • APIs are very similar to those in scikit-learn (but can be painful to use with Scala or Java) • Two APIs • RDD based (in “maintenance” mode) • DataFrame
  • 31.
  • 32.
    Key take-aways • Don’tuse Spark if you don’t need to (“big data”) • The components and APIs have started consolidating and maturing (so your knowledge after 6 months won’t be outdated) • Lots of resources on the internet are outdated so focus only on Spark 2.0 and above • Spark is the most popular tool for analysis of Big Data and likely to remain so in the future • Future of Hadoop is very “cloudy” as more and more workloads are moving into the cloud (object storage S3)
  • 33.
    Useful resources • Sparkhome page: https://spark.apache.org/ • Apache Zeppelin notebook: https://zeppelin.apache.org • Spark Core (Internals): https://www.youtube.com/watch?v=7ooZ4S7Ay6Y • Spark: The Definitive Guide (released in 2018) https://www.amazon.com/Spark-Definitive-Guide-Processing- Simple/dp/1491912219