SlideShare a Scribd company logo
1 of 29
Download to read offline
An Introduction to Apache Spark
with Amazon EMR
Peter Smith, Principal Software Engineer, ACL
Overview
• What is Spark?
• A Brief Timeline of Spark
• Storing data in RDDs and DataFrames
• Distributed Processing of Distributed Data
• Loading data: CSV, Parquet, JDBC
• Queries: SQL, Scala, PySpark
• Setting up Spark with EMR (Demo)
What is Spark?
Apache Spark is a unified analytics engine
for large-scale data processing.
• Can process Terabytes of data (billions of rows)
• Click streams from a web application.
• IoT data.
• Financial trades.
• Computation performed over multiple (potentially 1000s) of compute nodes.
• Has held the world record for sorting 100TB of data in 23 minutes (2014)
Usage Scenarios
• Batch Processing
– Large amounts of data, read from disk, then processed.
• Streaming Data
– Data is processed in real-time
• Machine Learning
– Predicting outcomes based on past experience
• Graph Processing
– Arbitrary data relationship, not just rows and columns
Spark versus Database
SQL
Interpreter /
Execution
Engine
SQL Database
SQL Client
SQL Select, Insert, Update
• Disk files in proprietary
format (e.g. B-Trees,
WALs)
• Users never look directly
at data files.
• Execution engine has
100% control over file
storage.
• Often database server is
a single machine (lots of
CPUs)
Spark versus Database
Spark
Worker
Spark
Worker
Spark
Worker
Spark
Worker
Spark Driver
SQL, Java, Scala,
Python, R
S3 Bucket
• Disk formats and
locations are 100%
controlled by user.
• No Transactional
Inserts or Updates!
• Compute is spread
over multiple servers
to improve scale.
Separate EC2
Servers
Amazon EFS
A Brief Timeline of Spark
2003 - Google File System
2004 - Map Reduce
2006 2010
2013
2011
Hadoop 2.8.4
Spark 2.3.2
(as of today)
How is Data Stored?
Spark allows data to be read or written from disk in a range of formats:
• CSV – Possibly the simplest and most common: Fred,Jones,1998,10,Stan
• JSON – Often generated by web applications. { “first_name”: “Fred”, … }
• JDBC – If the source data is in a database.
• Parquet and ORC – Optimized storage for column-oriented queries.
• Others – You’re free to write your own connectors.
Data is read or written as complete files – doesn’t support inserts or updates.
(unlike a transactional database that completely controls the structure of the data).
RDDs - How Spark Stores Data
• RDD – Resilient Distributed Dataset
• Data is stored in RAM, partitioned across multiple servers
• Each partition operates in parallel.
Instead of using database replication
(for resilience), Spark will re-perform
the work on a different worker.
Example: Sort people by age.
1. Divide into partitions of 8 people
2. Within group, sort by age.
3. Shuffle people based on decade
of birth.
4. Sort within each group.
Data Frames – Rows/Columns (> Spark v1.5)
• RDD – Rows of Java Objects
• DataFrame – Rows of Typed Fields (like a Database table)
Id
(Integer)
First_name
(String)
Last_name
(String)
BirthYear
(Integer)
Shoe size
(Float)
Dog’s name
(String)
1 Fran Brown 1982 10.5 Stan
2 Mary Jones 1976 9.0 Fido
3 Brad Pitt 1963 11.0 Barker
4 Jane Simpson 1988 8.0 Rex
… … … … … …
• DataFrames allow better type-safety and performance optimization.
Example: Loading data from a CSV File
data.csv:
1,Fran,Brown,1982,10.5,Stan
2,Mary,Jones,1976,9.0,Fido
3,Brad,Pitt,1963,11.0,Barker
4,Jane,Simpson,1988,8.0,Rex
5,James,Thompson,1980,9.5,Bif
6,Paul,Wilson,1967,8.5,Ariel
7,Alice,Carlton,1984,11.5,Hank
8,Mike,Taylor,1981,9.5,Quincy
9,Shona,Smith,1975,9.0,Juneau
10,Phil,Arnold,1978,10.0,Koda
Example: Loading data from a CSV File
$ spark-shell
scala> val df = spark.read.csv("data.csv")
Notes:
• Similar methods exist for JSON, JDBC, Parquet, etc.
• You can write your own!
• Scala is a general purpose programming language (not like SQL)
Example: Examing a Data Frame
scala> df.show(5)
+---+-----+--------+----+----+------+
|_c0| _c1| _c2| _c3| _c4| _c5|
+---+-----+--------+----+----+------+
| 1| Fran| Brown|1982|10.5| Stan|
| 2| Mary| Jones|1976| 9.0| Fido|
| 3| Brad| Pitt|1963|11.0|Barker|
| 4| Jane| Simpson|1988| 8.0| Rex|
| 5|James|Thompson|1980| 9.5| Bif|
+---+-----+--------+----+----+------+
Example: Defining a Schema
scala> df.printSchema
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)
|-- _c4: string (nullable = true)
|-- _c5: string (nullable = true)
Example: Defining a Schema
scala> df.printSchema
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)
|-- _c4: string (nullable = true)
|-- _c5: string (nullable = true)
Example: Defining a Schema
scala> val mySchema = StructType(
Array(
StructField("id", LongType),
StructField("first_name", StringType),
StructField("last_name", StringType),
StructField("birth_year", IntegerType),
StructField("shoe_size", FloatType),
StructField("dog_name", StringType)
)
)
scala> val df = spark.read.schema(mySchema).csv("data.csv")
Example: Defining a Schema
scala> df.show(5)
+---+----------+---------+----------+---------+--------+
| id|first_name|last_name|birth_year|shoe_size|dog_name|
+---+----------+---------+----------+---------+--------+
| 1| Fran| Brown| 1982| 10.5| Stan|
| 2| Mary| Jones| 1976| 9.0| Fido|
| 3| Brad| Pitt| 1963| 11.0| Barker|
| 4| Jane| Simpson| 1988| 8.0| Rex|
| 5| James| Thompson| 1980| 9.5| Bif|
+---+----------+---------+----------+---------+--------+
scala> df.printSchema
root
|-- id: long (nullable = true)
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- birth_year: integer (nullable = true)
|-- shoe_size: float (nullable = true)
|-- dog_name: string (nullable = true)
Example: Counting Records
scala> df.count()
res21: Long = 10
Imagine 10 Billion rows over 1000 servers?
Example: Selecting Columns
scala> val df_dog = df.select(
col("first_name"),
col("dog_name”))
scala> df_dog.show(5)
+----------+--------+
|first_name|dog_name|
+----------+--------+
| Fran| Stan|
| Mary| Fido|
| Brad| Barker|
| Jane| Rex|
| James| Bif|
+----------+--------+
Example: Aggregations
scala> df.agg(
min(col("birth_year")),
avg(col("birth_year"))
).show
+---------------+---------------+
|min(birth_year)|avg(birth_year)|
+---------------+---------------+
| 1963| 1977.4|
+---------------+---------------+
Example: Filtering
scala> df.where("birth_year > 1980").show
+---+----------+---------+----------+---------+--------+
| id|first_name|last_name|birth_year|shoe_size|dog_name|
+---+----------+---------+----------+---------+--------+
| 1| Fran| Brown| 1982| 10.5| Stan|
| 4| Jane| Simpson| 1988| 8.0| Rex|
| 7| Alice| Carlton| 1984| 11.5| Hank|
| 8| Mike| Taylor| 1981| 9.5| Quincy|
+---+----------+---------+----------+---------+--------+
Example: Grouping
scala> df.groupBy(
(floor(col("birth_year") / 10) * 10) as "Decade"
).count.show
+------+-----+
|Decade|count|
+------+-----+
| 1960| 2|
| 1970| 3|
| 1980| 5|
+------+-----+
Example: More Advanced
scala> df.select(
col("first_name"),
col("dog_name"),
levenshtein(
col("first_name"),
col("dog_name")
) as "Diff"
).show(5)
+----------+--------+----+
|first_name|dog_name|Diff|
+----------+--------+----+
| Fran| Stan| 2|
| Mary| Fido| 4|
| Brad| Barker| 4|
| Jane| Rex| 4|
| James| Bif| 5|
+----------+--------+----+
Shorter version:
df.select(
'first_name,
'dog_name,
levenshtein(
'first_name,
'dog_name) as "Diff”
).show(5)
Queries: User Defined Functions
def taxRateFunc(year: Int) = {
if (year >= 1984) 0.20 else 0.05
}
val taxRate = udf(taxRateFunc _)
df.select('birth_year, taxRate('birth_year)).show(5)
+----------+---------------+
|birth_year|UDF(birth_year)|
+----------+---------------+
| 1982| 0.05|
| 1976| 0.05|
| 1963| 0.05|
| 1988| 0.20|
| 1980| 0.05|
+----------+---------------+
UDAFs - Check out
http://build.acl.com
Computing Average
Dates in Spark!
Why is Spark better than a Database?
It looks a lot like SQL, but:
• Can read/write data in arbitrary formats.
• Can be extended with general purpose program code.
• Can be split across 1000s of compute nodes.
• Can do ML, Streaming, Graph queries.
• Can use cheap storage (such as S3)
But yeah, if you’re happy with your database, that’s OK too.
Queries: PySpark
Very similar API, but written in Python:
$ pyspark
>>>> spark.read.csv(”data.csv").show(5)
+---+-----+--------+----+----+------+
|_c0| _c1| _c2| _c3| _c4| _c5|
+---+-----+--------+----+----+------+
| 1| Fran| Brown|1982|10.5| Stan|
| 2| Mary| Jones|1976| 9.0| Fido|
| 3| Brad| Pitt|1963|11.0|Barker|
| 4| Jane| Simpson|1988| 8.0| Rex|
| 5|James|Thompson|1980| 9.5| Bif|
+---+-----+--------+----+----+------+
Demo Time…
Using EMR
Driver
m4.2xlarge
Worker
m4.2xlarge
Worker
m4.2xlarge
Worker
m4.2xlarge
Worker
m4.2xlarge
Query
Data
Zeppelin
Spark History

More Related Content

What's hot

Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Yahoo Developer Network
 
Millions of Regions in HBase: Size Matters
Millions of Regions in HBase: Size MattersMillions of Regions in HBase: Size Matters
Millions of Regions in HBase: Size MattersDataWorks Summit
 
Amazon DynamoDB Lessen's Learned by Beginner
Amazon DynamoDB Lessen's Learned by BeginnerAmazon DynamoDB Lessen's Learned by Beginner
Amazon DynamoDB Lessen's Learned by BeginnerHirokazu Tokuno
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillMapR Technologies
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoopChirag Ahuja
 
23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...
23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...
23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...Amazon Web Services
 
Big Data Fundamentals in the Emerging New Data World
Big Data Fundamentals in the Emerging New Data WorldBig Data Fundamentals in the Emerging New Data World
Big Data Fundamentals in the Emerging New Data WorldJongwook Woo
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base InstallCloudera, Inc.
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the BasicsHBaseCon
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Rohit Agrawal
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
 
HBase Schema Design - HBase-Con 2012
HBase Schema Design - HBase-Con 2012HBase Schema Design - HBase-Con 2012
HBase Schema Design - HBase-Con 2012Ian Varley
 

What's hot (20)

Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis
 
Millions of Regions in HBase: Size Matters
Millions of Regions in HBase: Size MattersMillions of Regions in HBase: Size Matters
Millions of Regions in HBase: Size Matters
 
Amazon DynamoDB Lessen's Learned by Beginner
Amazon DynamoDB Lessen's Learned by BeginnerAmazon DynamoDB Lessen's Learned by Beginner
Amazon DynamoDB Lessen's Learned by Beginner
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache Drill
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
 
Apache hive
Apache hiveApache hive
Apache hive
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoop
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...
23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...
23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...
 
Big Data Fundamentals in the Emerging New Data World
Big Data Fundamentals in the Emerging New Data WorldBig Data Fundamentals in the Emerging New Data World
Big Data Fundamentals in the Emerging New Data World
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the Basics
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
 
HBase Schema Design - HBase-Con 2012
HBase Schema Design - HBase-Con 2012HBase Schema Design - HBase-Con 2012
HBase Schema Design - HBase-Con 2012
 

Similar to Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR

Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillDataWorks Summit
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillMapR Technologies
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to SparkKyle Burke
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterDon Drake
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
Python Pandas.pptx
Python Pandas.pptxPython Pandas.pptx
Python Pandas.pptxSujayaBiju
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSigmoid
 
Big Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon AthenaBig Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon AthenaJulien SIMON
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Matthias Niehoff
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Pandas Dataframe reading data Kirti final.pptx
Pandas Dataframe reading data  Kirti final.pptxPandas Dataframe reading data  Kirti final.pptx
Pandas Dataframe reading data Kirti final.pptxKirti Verma
 
Pandas-(Ziad).pptx
Pandas-(Ziad).pptxPandas-(Ziad).pptx
Pandas-(Ziad).pptxSivam Chinna
 
OSDC 2012 | Scaling with MongoDB by Ross Lawley
OSDC 2012 | Scaling with MongoDB by Ross LawleyOSDC 2012 | Scaling with MongoDB by Ross Lawley
OSDC 2012 | Scaling with MongoDB by Ross LawleyNETWAYS
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 

Similar to Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR (20)

Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Python Pandas.pptx
Python Pandas.pptxPython Pandas.pptx
Python Pandas.pptx
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
 
Big Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon AthenaBig Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon Athena
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir Volk
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
Gur1009
Gur1009Gur1009
Gur1009
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Pandas Dataframe reading data Kirti final.pptx
Pandas Dataframe reading data  Kirti final.pptxPandas Dataframe reading data  Kirti final.pptx
Pandas Dataframe reading data Kirti final.pptx
 
Pandas-(Ziad).pptx
Pandas-(Ziad).pptxPandas-(Ziad).pptx
Pandas-(Ziad).pptx
 
OSDC 2012 | Scaling with MongoDB by Ross Lawley
OSDC 2012 | Scaling with MongoDB by Ross LawleyOSDC 2012 | Scaling with MongoDB by Ross Lawley
OSDC 2012 | Scaling with MongoDB by Ross Lawley
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 

Recently uploaded

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 

Recently uploaded (20)

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 

Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR

  • 1. An Introduction to Apache Spark with Amazon EMR Peter Smith, Principal Software Engineer, ACL
  • 2. Overview • What is Spark? • A Brief Timeline of Spark • Storing data in RDDs and DataFrames • Distributed Processing of Distributed Data • Loading data: CSV, Parquet, JDBC • Queries: SQL, Scala, PySpark • Setting up Spark with EMR (Demo)
  • 3. What is Spark? Apache Spark is a unified analytics engine for large-scale data processing. • Can process Terabytes of data (billions of rows) • Click streams from a web application. • IoT data. • Financial trades. • Computation performed over multiple (potentially 1000s) of compute nodes. • Has held the world record for sorting 100TB of data in 23 minutes (2014)
  • 4. Usage Scenarios • Batch Processing – Large amounts of data, read from disk, then processed. • Streaming Data – Data is processed in real-time • Machine Learning – Predicting outcomes based on past experience • Graph Processing – Arbitrary data relationship, not just rows and columns
  • 5. Spark versus Database SQL Interpreter / Execution Engine SQL Database SQL Client SQL Select, Insert, Update • Disk files in proprietary format (e.g. B-Trees, WALs) • Users never look directly at data files. • Execution engine has 100% control over file storage. • Often database server is a single machine (lots of CPUs)
  • 6. Spark versus Database Spark Worker Spark Worker Spark Worker Spark Worker Spark Driver SQL, Java, Scala, Python, R S3 Bucket • Disk formats and locations are 100% controlled by user. • No Transactional Inserts or Updates! • Compute is spread over multiple servers to improve scale. Separate EC2 Servers Amazon EFS
  • 7. A Brief Timeline of Spark 2003 - Google File System 2004 - Map Reduce 2006 2010 2013 2011 Hadoop 2.8.4 Spark 2.3.2 (as of today)
  • 8. How is Data Stored? Spark allows data to be read or written from disk in a range of formats: • CSV – Possibly the simplest and most common: Fred,Jones,1998,10,Stan • JSON – Often generated by web applications. { “first_name”: “Fred”, … } • JDBC – If the source data is in a database. • Parquet and ORC – Optimized storage for column-oriented queries. • Others – You’re free to write your own connectors. Data is read or written as complete files – doesn’t support inserts or updates. (unlike a transactional database that completely controls the structure of the data).
  • 9. RDDs - How Spark Stores Data • RDD – Resilient Distributed Dataset • Data is stored in RAM, partitioned across multiple servers • Each partition operates in parallel. Instead of using database replication (for resilience), Spark will re-perform the work on a different worker.
  • 10. Example: Sort people by age. 1. Divide into partitions of 8 people 2. Within group, sort by age. 3. Shuffle people based on decade of birth. 4. Sort within each group.
  • 11. Data Frames – Rows/Columns (> Spark v1.5) • RDD – Rows of Java Objects • DataFrame – Rows of Typed Fields (like a Database table) Id (Integer) First_name (String) Last_name (String) BirthYear (Integer) Shoe size (Float) Dog’s name (String) 1 Fran Brown 1982 10.5 Stan 2 Mary Jones 1976 9.0 Fido 3 Brad Pitt 1963 11.0 Barker 4 Jane Simpson 1988 8.0 Rex … … … … … … • DataFrames allow better type-safety and performance optimization.
  • 12. Example: Loading data from a CSV File data.csv: 1,Fran,Brown,1982,10.5,Stan 2,Mary,Jones,1976,9.0,Fido 3,Brad,Pitt,1963,11.0,Barker 4,Jane,Simpson,1988,8.0,Rex 5,James,Thompson,1980,9.5,Bif 6,Paul,Wilson,1967,8.5,Ariel 7,Alice,Carlton,1984,11.5,Hank 8,Mike,Taylor,1981,9.5,Quincy 9,Shona,Smith,1975,9.0,Juneau 10,Phil,Arnold,1978,10.0,Koda
  • 13. Example: Loading data from a CSV File $ spark-shell scala> val df = spark.read.csv("data.csv") Notes: • Similar methods exist for JSON, JDBC, Parquet, etc. • You can write your own! • Scala is a general purpose programming language (not like SQL)
  • 14. Example: Examing a Data Frame scala> df.show(5) +---+-----+--------+----+----+------+ |_c0| _c1| _c2| _c3| _c4| _c5| +---+-----+--------+----+----+------+ | 1| Fran| Brown|1982|10.5| Stan| | 2| Mary| Jones|1976| 9.0| Fido| | 3| Brad| Pitt|1963|11.0|Barker| | 4| Jane| Simpson|1988| 8.0| Rex| | 5|James|Thompson|1980| 9.5| Bif| +---+-----+--------+----+----+------+
  • 15. Example: Defining a Schema scala> df.printSchema root |-- _c0: string (nullable = true) |-- _c1: string (nullable = true) |-- _c2: string (nullable = true) |-- _c3: string (nullable = true) |-- _c4: string (nullable = true) |-- _c5: string (nullable = true)
  • 16. Example: Defining a Schema scala> df.printSchema root |-- _c0: string (nullable = true) |-- _c1: string (nullable = true) |-- _c2: string (nullable = true) |-- _c3: string (nullable = true) |-- _c4: string (nullable = true) |-- _c5: string (nullable = true)
  • 17. Example: Defining a Schema scala> val mySchema = StructType( Array( StructField("id", LongType), StructField("first_name", StringType), StructField("last_name", StringType), StructField("birth_year", IntegerType), StructField("shoe_size", FloatType), StructField("dog_name", StringType) ) ) scala> val df = spark.read.schema(mySchema).csv("data.csv")
  • 18. Example: Defining a Schema scala> df.show(5) +---+----------+---------+----------+---------+--------+ | id|first_name|last_name|birth_year|shoe_size|dog_name| +---+----------+---------+----------+---------+--------+ | 1| Fran| Brown| 1982| 10.5| Stan| | 2| Mary| Jones| 1976| 9.0| Fido| | 3| Brad| Pitt| 1963| 11.0| Barker| | 4| Jane| Simpson| 1988| 8.0| Rex| | 5| James| Thompson| 1980| 9.5| Bif| +---+----------+---------+----------+---------+--------+ scala> df.printSchema root |-- id: long (nullable = true) |-- first_name: string (nullable = true) |-- last_name: string (nullable = true) |-- birth_year: integer (nullable = true) |-- shoe_size: float (nullable = true) |-- dog_name: string (nullable = true)
  • 19. Example: Counting Records scala> df.count() res21: Long = 10 Imagine 10 Billion rows over 1000 servers?
  • 20. Example: Selecting Columns scala> val df_dog = df.select( col("first_name"), col("dog_name”)) scala> df_dog.show(5) +----------+--------+ |first_name|dog_name| +----------+--------+ | Fran| Stan| | Mary| Fido| | Brad| Barker| | Jane| Rex| | James| Bif| +----------+--------+
  • 22. Example: Filtering scala> df.where("birth_year > 1980").show +---+----------+---------+----------+---------+--------+ | id|first_name|last_name|birth_year|shoe_size|dog_name| +---+----------+---------+----------+---------+--------+ | 1| Fran| Brown| 1982| 10.5| Stan| | 4| Jane| Simpson| 1988| 8.0| Rex| | 7| Alice| Carlton| 1984| 11.5| Hank| | 8| Mike| Taylor| 1981| 9.5| Quincy| +---+----------+---------+----------+---------+--------+
  • 23. Example: Grouping scala> df.groupBy( (floor(col("birth_year") / 10) * 10) as "Decade" ).count.show +------+-----+ |Decade|count| +------+-----+ | 1960| 2| | 1970| 3| | 1980| 5| +------+-----+
  • 24. Example: More Advanced scala> df.select( col("first_name"), col("dog_name"), levenshtein( col("first_name"), col("dog_name") ) as "Diff" ).show(5) +----------+--------+----+ |first_name|dog_name|Diff| +----------+--------+----+ | Fran| Stan| 2| | Mary| Fido| 4| | Brad| Barker| 4| | Jane| Rex| 4| | James| Bif| 5| +----------+--------+----+ Shorter version: df.select( 'first_name, 'dog_name, levenshtein( 'first_name, 'dog_name) as "Diff” ).show(5)
  • 25. Queries: User Defined Functions def taxRateFunc(year: Int) = { if (year >= 1984) 0.20 else 0.05 } val taxRate = udf(taxRateFunc _) df.select('birth_year, taxRate('birth_year)).show(5) +----------+---------------+ |birth_year|UDF(birth_year)| +----------+---------------+ | 1982| 0.05| | 1976| 0.05| | 1963| 0.05| | 1988| 0.20| | 1980| 0.05| +----------+---------------+ UDAFs - Check out http://build.acl.com Computing Average Dates in Spark!
  • 26. Why is Spark better than a Database? It looks a lot like SQL, but: • Can read/write data in arbitrary formats. • Can be extended with general purpose program code. • Can be split across 1000s of compute nodes. • Can do ML, Streaming, Graph queries. • Can use cheap storage (such as S3) But yeah, if you’re happy with your database, that’s OK too.
  • 27. Queries: PySpark Very similar API, but written in Python: $ pyspark >>>> spark.read.csv(”data.csv").show(5) +---+-----+--------+----+----+------+ |_c0| _c1| _c2| _c3| _c4| _c5| +---+-----+--------+----+----+------+ | 1| Fran| Brown|1982|10.5| Stan| | 2| Mary| Jones|1976| 9.0| Fido| | 3| Brad| Pitt|1963|11.0|Barker| | 4| Jane| Simpson|1988| 8.0| Rex| | 5|James|Thompson|1980| 9.5| Bif| +---+-----+--------+----+----+------+