Apache Spark in Industry

Apache Spark - In industry
Dorian Beganovic

About me
• Experience with Spark
• Q-Park - 20 months
• “Big data” - Spark, Hadoop, Data Lake
• Data warehousing - Microsoft SQL Server
• Personal projects
• Machine learning on EEG data (3 months)
• Spark Structured streaming (1 month)
• Really interested in data systems
• All types of databases (relational, parallel, columnar…)
• Big data, cloud, distributed systems

Apache Hadoop
• Open Source framework for distributed storage and processing
• Origins are in the project “Nutch” back in 2002 (Cutting, Cafarella)
• 2006. Yahoo! Created Hadoop based on GFS and MapReduce
• Based on MapReduce programming model Fundamental
assumption - all the modules are built to handle hardware failures
automatically
• Clusters built of commodity hardware
• Pig, Hive, Mahout - optimize Map Reduce

Apache Spark
• Open source fast and expressive cluster computing framework designed for Big data analytics
• Compatible with Apache Hadoop
• Developed at UC Berkley’s AMP Lab 2009. and donated to the Apache
Software Foundation in 2013.
• Original author - Matei Zaharia
• Databricks inc. - company behind Apache Spark (many other sponsors now)

Who uses Spark?
• In total over 3000 companies use Apache Spark
• Microsoft, Uber, Pinterest, Amazon, Oracle, Cisco, Verizon, Visa…
• https://spark.apache.org/powered-by.html

Why use Big Data tools?
• Complex analysis on 10TB+ of data
• Only use Big Data tools like Spark if your data doesn’t fit on a single machine
• Shuffle operation is extremely expensive (network IO is very slow)
AWS EC2 instance types

2. Ease of use
Why Spark - 2/4

Generality
• You can use one framework (Spark) for:
• Processing batch (big) data - Spark SQL
• Processing streaming (big) data - Spark Streaming
• Machine learning at scale - Spark MLLib
• Graph analysis at scale - Graph X

4. Runs everywhere
Why Spark - 4/4
• Access data from anywhere:
• S3, HDFS any JDBC database…
• Runs in:
• Standalone cluster mode
• EC2 (AWS Elastic cloud compute)
• Hadoop YARN
• Apache MESOS

Runs everywhere
• https://www.datanami.com/2017/09/29/hadoop-hard-find-strata-week/
• Currently Hadoop (HDFS) is slowly getting replaced with object storage (AWS S3…) in
the cloud

Spark SQL
• Originally named “Shark” and used to execute Hive queries in-memory
• As of Spark 2.0 - SQL 2003 standard support
• By far the most popular library (you’ll certainly use it for any task)
• ~90% of the codebase
• A lot faster and provides higher level operations than RDDs (based on RDDs )
• API is inspired by Python and R data frames

• Academic paper that introduced Spark SQL
Spark SQL Architecture

API example
• Ability to execute SQL queries is extremely powerful
• The official documentation is a great place to start

Spark Streaming
• Scalable fault tolerant streaming system
• Very high level of abstraction and powerful APIs
• Receivers receive data and chop them into micro-batches (not a single record at a
time)
• Spark processes batches and pushes out the result
• Input: files, Kafka, socket, Kinesis, Flume…

*RDD streaming API will be replaced (deprecated)
Spark Streaming Demo

Spark Structured Streaming Demo

• High-level streaming API built on DataFrames
• Catalyst optimizer creates incremental execution plan
• Unifies streaming, interactive and batch queries
• Supports multiple sources and sinks
• E.g. aggregate data in a stream, then serve using JDBC
• “The simplest way to perform streaming analytics is not having to reason about
streaming.”
• Probably the coolest thing Spark has
Spark Structured Streaming

Spark MLLib
• Best solution for distributed machine learning
• Not all algorithms are implemented (some can’t be)
• Really slow on single node or small datasets compared to established libraries
• APIs are very similar to those in scikit-learn (but can be painful to use with Scala
or Java)
• Two APIs
• RDD based (in “maintenance” mode)
• DataFrame

Key take-aways
• Don’t use Spark if you don’t need to (“big data”)
• The components and APIs have started consolidating and maturing (so your
knowledge after 6 months won’t be outdated)
• Lots of resources on the internet are outdated so focus only on Spark 2.0 and above
• Spark is the most popular tool for analysis of Big Data and likely to remain so in the
future
• Future of Hadoop is very “cloudy” as more and more workloads are moving into the
cloud (object storage S3)

Useful resources
• Spark home page: https://spark.apache.org/
• Apache Zeppelin notebook: https://zeppelin.apache.org
• Spark Core (Internals): https://www.youtube.com/watch?v=7ooZ4S7Ay6Y
• Spark: The Definitive Guide (released in 2018)
https://www.amazon.com/Spark-Definitive-Guide-Processing-
Simple/dp/1491912219

Apache Spark in Industry

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Apache Spark in Industry

Similar to Apache Spark in Industry (20)

Recently uploaded

Recently uploaded (20)

Apache Spark in Industry