An Introduction to Apache Spark with Amazon EMR. Dr. Peter Smith's presentation slides from the Vancouver Amazon Web Services User Group Meetup on November 20, 2018 at ACL hosted and presented by Onica.
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
1. An Introduction to Apache Spark
with Amazon EMR
Peter Smith, Principal Software Engineer, ACL
2. Overview
• What is Spark?
• A Brief Timeline of Spark
• Storing data in RDDs and DataFrames
• Distributed Processing of Distributed Data
• Loading data: CSV, Parquet, JDBC
• Queries: SQL, Scala, PySpark
• Setting up Spark with EMR (Demo)
3. What is Spark?
Apache Spark is a unified analytics engine
for large-scale data processing.
• Can process Terabytes of data (billions of rows)
• Click streams from a web application.
• IoT data.
• Financial trades.
• Computation performed over multiple (potentially 1000s) of compute nodes.
• Has held the world record for sorting 100TB of data in 23 minutes (2014)
4. Usage Scenarios
• Batch Processing
– Large amounts of data, read from disk, then processed.
• Streaming Data
– Data is processed in real-time
• Machine Learning
– Predicting outcomes based on past experience
• Graph Processing
– Arbitrary data relationship, not just rows and columns
5. Spark versus Database
SQL
Interpreter /
Execution
Engine
SQL Database
SQL Client
SQL Select, Insert, Update
• Disk files in proprietary
format (e.g. B-Trees,
WALs)
• Users never look directly
at data files.
• Execution engine has
100% control over file
storage.
• Often database server is
a single machine (lots of
CPUs)
7. A Brief Timeline of Spark
2003 - Google File System
2004 - Map Reduce
2006 2010
2013
2011
Hadoop 2.8.4
Spark 2.3.2
(as of today)
8. How is Data Stored?
Spark allows data to be read or written from disk in a range of formats:
• CSV – Possibly the simplest and most common: Fred,Jones,1998,10,Stan
• JSON – Often generated by web applications. { “first_name”: “Fred”, … }
• JDBC – If the source data is in a database.
• Parquet and ORC – Optimized storage for column-oriented queries.
• Others – You’re free to write your own connectors.
Data is read or written as complete files – doesn’t support inserts or updates.
(unlike a transactional database that completely controls the structure of the data).
9. RDDs - How Spark Stores Data
• RDD – Resilient Distributed Dataset
• Data is stored in RAM, partitioned across multiple servers
• Each partition operates in parallel.
Instead of using database replication
(for resilience), Spark will re-perform
the work on a different worker.
10. Example: Sort people by age.
1. Divide into partitions of 8 people
2. Within group, sort by age.
3. Shuffle people based on decade
of birth.
4. Sort within each group.
11. Data Frames – Rows/Columns (> Spark v1.5)
• RDD – Rows of Java Objects
• DataFrame – Rows of Typed Fields (like a Database table)
Id
(Integer)
First_name
(String)
Last_name
(String)
BirthYear
(Integer)
Shoe size
(Float)
Dog’s name
(String)
1 Fran Brown 1982 10.5 Stan
2 Mary Jones 1976 9.0 Fido
3 Brad Pitt 1963 11.0 Barker
4 Jane Simpson 1988 8.0 Rex
… … … … … …
• DataFrames allow better type-safety and performance optimization.
12. Example: Loading data from a CSV File
data.csv:
1,Fran,Brown,1982,10.5,Stan
2,Mary,Jones,1976,9.0,Fido
3,Brad,Pitt,1963,11.0,Barker
4,Jane,Simpson,1988,8.0,Rex
5,James,Thompson,1980,9.5,Bif
6,Paul,Wilson,1967,8.5,Ariel
7,Alice,Carlton,1984,11.5,Hank
8,Mike,Taylor,1981,9.5,Quincy
9,Shona,Smith,1975,9.0,Juneau
10,Phil,Arnold,1978,10.0,Koda
13. Example: Loading data from a CSV File
$ spark-shell
scala> val df = spark.read.csv("data.csv")
Notes:
• Similar methods exist for JSON, JDBC, Parquet, etc.
• You can write your own!
• Scala is a general purpose programming language (not like SQL)
25. Queries: User Defined Functions
def taxRateFunc(year: Int) = {
if (year >= 1984) 0.20 else 0.05
}
val taxRate = udf(taxRateFunc _)
df.select('birth_year, taxRate('birth_year)).show(5)
+----------+---------------+
|birth_year|UDF(birth_year)|
+----------+---------------+
| 1982| 0.05|
| 1976| 0.05|
| 1963| 0.05|
| 1988| 0.20|
| 1980| 0.05|
+----------+---------------+
UDAFs - Check out
http://build.acl.com
Computing Average
Dates in Spark!
26. Why is Spark better than a Database?
It looks a lot like SQL, but:
• Can read/write data in arbitrary formats.
• Can be extended with general purpose program code.
• Can be split across 1000s of compute nodes.
• Can do ML, Streaming, Graph queries.
• Can use cheap storage (such as S3)
But yeah, if you’re happy with your database, that’s OK too.
27. Queries: PySpark
Very similar API, but written in Python:
$ pyspark
>>>> spark.read.csv(”data.csv").show(5)
+---+-----+--------+----+----+------+
|_c0| _c1| _c2| _c3| _c4| _c5|
+---+-----+--------+----+----+------+
| 1| Fran| Brown|1982|10.5| Stan|
| 2| Mary| Jones|1976| 9.0| Fido|
| 3| Brad| Pitt|1963|11.0|Barker|
| 4| Jane| Simpson|1988| 8.0| Rex|
| 5|James|Thompson|1980| 9.5| Bif|
+---+-----+--------+----+----+------+