SlideShare a Scribd company logo
Apache Spark: in and out
Ben Fradet - Tech lead
In and out
1. Intro
2. The different batch APIs
3. Real world examples
4. Addressing the API shortcomings
5. Running and configuring a Spark job on AWS EMR
6. Outro
Intro
The different batch APIs - minimal examples
val counts = rdd
.map(line => (line.word, 1))
.reduceByKey(_ + _)
val counts = df
.groupBy(“word”)
.count()
val counts = spark
.sql(“select word, count(*) ” +
“from words group by word”)
id word
1 Scala
2 Spark
3 API
4 Scala
val counts = ds
.groupByKey(_.word)
.count()
word count
Scala 2
Spark 1
API 1
The different batch APIs - comparison
RDD SQL DataFrame Dataset
API looks like Scala collections SQL Scala / SQL Scala collections
In memory JVM objects Off heap Off heap Off heap
Query optimization ✗ ✔ ✔ ✔
Code generation ✗ ✔ ✔ ✔
Syntax errors Compile time Runtime Compile time Compile time
Analysis errors NA Runtime Runtime Compile time
Real world examples - EnrichJob
val input: RDD[_] = getInputRDD(inputPath)
val all: RDD[(_, List[ValidatedEnrichedEvent]) =
input.map(e => (e, enrich(e))).cache()
val bad: RDD[Row] = all.flatMap(e => projectBads(e))
spark.createDataFrame(bad).write.text(badOutputPath)
val good: RDD[EnrichedEvent] = all.flatMap(e => projectGoods(e))
spark.createDataset(good).write.csv(goodOutputPath)
Real world examples - ShredJob
val input: RDD[String] = sc.textFile(inputPath)
val all: RDD[(String, List[ValidatedShreddedEvent]) =
input.map(e => shred(e)).cache()
val bad: RDD[Row] = all.flatMap(e => projectBads(e))
val deduped: RDD[(ShreddedEvent, ValidatedBoolean)] = all
.flatMap(e => projectGoods(e))
.groupBy(e => (e.eventId, e.eventFingerprint))
.flatMap { case (_, values) => values.take(1) }
.map(e => (e, depdupeCrossBatch(e)))
// …
spark.createDataFrame(bad + dedupeF).write.text(badOutputPath)
spark.createDataFrame(dedupeS.map(_.event)).write.text(goodOutputPath)
dedupeS.map(_.shreds).write.text(shreddedTypesOutputPath)
Addressing the API shortcomings - typelevel/frameless
// untyped -> runtime exception
ds.select(ds(“not-word”))
val typedDS = TypedDataset.create(ds)
// typed -> doesn’t compile
typedDS.select(typedDS(‘not-word))
val counts = ds.groupByKey(_.word).count()
case class WordCount(word: String, count: Int)
// checked at compile time
counts.as[WordCount]
id word
1 Scala
2 Spark
3 API
4 Scala
Addressing the API shortcomings - BenFradet/struct-type-encoder
case class MyCaseClass(a: Int, b: String, c: Double)
val inferred = spark
.read
.json("/some/dir/*.json")
.as[MyCaseClass]
val derived = spark
.read
.schema(StructTypeEncoder[MyCaseClass].encode)
.json("/some/dir/*.json")
.as[MyCaseClass]
Plus support for metadata and deeply-nested schemas!
Running and configuring a Spark job on AWS EMR
-- configurations ‘[{
“Classification”: “yarn-site”,
“Properties”: {
“yarn.nodemanager.vmem-check-enabled”: “false”,
“yarn.nodemanager.resource.memory-mb”: “117760”,
“yarn.scheduler.maximum-allocation-mb”: “117760”
}
},{
“Classification”: “spark”,
“Properties”: { “maximizeResourceAllocation”: "false" }
},{
“Classification”: “spark-defaults”,
“Properties: {
“spark.dynamicAllocation.enabled”: “false”,
“spark.executor.instances”: "4",
“spark.yarn.executor.memoryOverhead”: “3072”,
“spark.executor.memory”: “20G”,
“spark.executor.cores”: “3”,
“spark.yarn.driver.memoryOverhead”: “3072”,
“spark.driver.memory”: “20G”,
“spark.driver.cores”: “3”,
“spark.default.parallelism”: "48"
}]’
aws emr create-cluster 
--name "Snowplow Enrich Job" 
--release-label emr-5.12.0 
--applications Name=Spark 
--instance-type r4.4xlarge 
--instance-count 1 
--steps ‘[{
"Name": "Snowplow Spark Enrich",
"Args": [...],
"Jar": "s3://bucket/my-jar.jar",
"ActionOnFailure": "CONTINUE",
"MainClass": "EnrichJob",
"Type": "CUSTOM_JAR",
"Properties": "string"
}]’ 
Spark config cheat sheet
Thanks!
GitHub:
- github.com/BenFradet
- github.com/snowplow/snowplow
Twitter:
- @fradetben
- @snowplowdata
Contact us:
sales@snowplowanalytics.com
snowplowanalytics.com

More Related Content

What's hot

Community-driven Language Design at TC39 on the JavaScript Pipeline Operator ...
Community-driven Language Design at TC39 on the JavaScript Pipeline Operator ...Community-driven Language Design at TC39 on the JavaScript Pipeline Operator ...
Community-driven Language Design at TC39 on the JavaScript Pipeline Operator ...
Igalia
 
Bind me if you can
Bind me if you canBind me if you can
Bind me if you can
Ovidiu Farauanu
 
Brunhild
BrunhildBrunhild
Brunhild
Huafeng Mo
 
Lift 2 0
Lift 2 0Lift 2 0
Lift 2 0
SO
 
Auto cad 2006_api_overview
Auto cad 2006_api_overviewAuto cad 2006_api_overview
Auto cad 2006_api_overview
scdhruv5
 
SE 20016 - programming languages landscape.
SE 20016 - programming languages landscape.SE 20016 - programming languages landscape.
SE 20016 - programming languages landscape.
Ruslan Shevchenko
 
JavaScript - Object-Oriented Programming & Remote Scripting
JavaScript - Object-Oriented Programming & Remote ScriptingJavaScript - Object-Oriented Programming & Remote Scripting
JavaScript - Object-Oriented Programming & Remote Scripting
Chen Huang
 
Java Se next Generetion
Java Se next GeneretionJava Se next Generetion
Java Se next Generetion
Otávio Santana
 
L13: Scripting
L13: ScriptingL13: Scripting
L13: Scripting
medialeg gmbh
 
Student Data Base Using C/C++ Final Project
Student Data Base Using C/C++ Final ProjectStudent Data Base Using C/C++ Final Project
Student Data Base Using C/C++ Final Project
Haqnawaz Ch
 
Scheme 核心概念(一)
Scheme 核心概念(一)Scheme 核心概念(一)
Scheme 核心概念(一)
維然 柯維然
 
C Prog. - Strings (Updated)
C Prog. - Strings (Updated)C Prog. - Strings (Updated)
C Prog. - Strings (Updated)
vinay arora
 
Relaxing With CouchDB
Relaxing With CouchDBRelaxing With CouchDB
Relaxing With CouchDB
leinweber
 
Swift Micro-services and AWS Technologies
Swift Micro-services and AWS TechnologiesSwift Micro-services and AWS Technologies
Swift Micro-services and AWS Technologies
SimonPilkington8
 
Querying Nested JSON Data Using N1QL and Couchbase
Querying Nested JSON Data Using N1QL and CouchbaseQuerying Nested JSON Data Using N1QL and Couchbase
Querying Nested JSON Data Using N1QL and Couchbase
Brant Burnett
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
MongoDB
 

What's hot (16)

Community-driven Language Design at TC39 on the JavaScript Pipeline Operator ...
Community-driven Language Design at TC39 on the JavaScript Pipeline Operator ...Community-driven Language Design at TC39 on the JavaScript Pipeline Operator ...
Community-driven Language Design at TC39 on the JavaScript Pipeline Operator ...
 
Bind me if you can
Bind me if you canBind me if you can
Bind me if you can
 
Brunhild
BrunhildBrunhild
Brunhild
 
Lift 2 0
Lift 2 0Lift 2 0
Lift 2 0
 
Auto cad 2006_api_overview
Auto cad 2006_api_overviewAuto cad 2006_api_overview
Auto cad 2006_api_overview
 
SE 20016 - programming languages landscape.
SE 20016 - programming languages landscape.SE 20016 - programming languages landscape.
SE 20016 - programming languages landscape.
 
JavaScript - Object-Oriented Programming & Remote Scripting
JavaScript - Object-Oriented Programming & Remote ScriptingJavaScript - Object-Oriented Programming & Remote Scripting
JavaScript - Object-Oriented Programming & Remote Scripting
 
Java Se next Generetion
Java Se next GeneretionJava Se next Generetion
Java Se next Generetion
 
L13: Scripting
L13: ScriptingL13: Scripting
L13: Scripting
 
Student Data Base Using C/C++ Final Project
Student Data Base Using C/C++ Final ProjectStudent Data Base Using C/C++ Final Project
Student Data Base Using C/C++ Final Project
 
Scheme 核心概念(一)
Scheme 核心概念(一)Scheme 核心概念(一)
Scheme 核心概念(一)
 
C Prog. - Strings (Updated)
C Prog. - Strings (Updated)C Prog. - Strings (Updated)
C Prog. - Strings (Updated)
 
Relaxing With CouchDB
Relaxing With CouchDBRelaxing With CouchDB
Relaxing With CouchDB
 
Swift Micro-services and AWS Technologies
Swift Micro-services and AWS TechnologiesSwift Micro-services and AWS Technologies
Swift Micro-services and AWS Technologies
 
Querying Nested JSON Data Using N1QL and Couchbase
Querying Nested JSON Data Using N1QL and CouchbaseQuerying Nested JSON Data Using N1QL and Couchbase
Querying Nested JSON Data Using N1QL and Couchbase
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 

Similar to Apache spark: in and out

[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
Future Processing
 
Testing batch and streaming Spark applications
Testing batch and streaming Spark applicationsTesting batch and streaming Spark applications
Testing batch and streaming Spark applications
Łukasz Gawron
 
Apache Spark - Aram Mkrtchyan
Apache Spark - Aram MkrtchyanApache Spark - Aram Mkrtchyan
Apache Spark - Aram Mkrtchyan
Hovhannes Kuloghlyan
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
Spark Summit
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
Samir Bessalah
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
台灣資料科學年會
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
Spark Summit
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Operations on rdd
Operations on rddOperations on rdd
Operations on rdd
sparrowAnalytics.com
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Duyhai Doan
 
Intro to Scala.js - Scala UG Cologne
Intro to Scala.js - Scala UG CologneIntro to Scala.js - Scala UG Cologne
Intro to Scala.js - Scala UG Cologne
Marius Soutier
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Anastasios Skarlatidis
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
Fernando Rodriguez
 
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkSpark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Zalando Technology
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
Spark by Adform Research, Paulius
Spark by Adform Research, PauliusSpark by Adform Research, Paulius
Spark by Adform Research, Paulius
Vasil Remeniuk
 
AST - the only true tool for building JavaScript
AST - the only true tool for building JavaScriptAST - the only true tool for building JavaScript
AST - the only true tool for building JavaScript
Ingvar Stepanyan
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
Gerger
 

Similar to Apache spark: in and out (20)

[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
 
Testing batch and streaming Spark applications
Testing batch and streaming Spark applicationsTesting batch and streaming Spark applications
Testing batch and streaming Spark applications
 
Apache Spark - Aram Mkrtchyan
Apache Spark - Aram MkrtchyanApache Spark - Aram Mkrtchyan
Apache Spark - Aram Mkrtchyan
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Operations on rdd
Operations on rddOperations on rdd
Operations on rdd
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Intro to Scala.js - Scala UG Cologne
Intro to Scala.js - Scala UG CologneIntro to Scala.js - Scala UG Cologne
Intro to Scala.js - Scala UG Cologne
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkSpark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark by Adform Research, Paulius
Spark by Adform Research, PauliusSpark by Adform Research, Paulius
Spark by Adform Research, Paulius
 
AST - the only true tool for building JavaScript
AST - the only true tool for building JavaScriptAST - the only true tool for building JavaScript
AST - the only true tool for building JavaScript
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 

Recently uploaded

一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
eudsoh
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
Alireza Kamrani
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
dataschool1
 
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative ClassifiersML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
MastanaihnaiduYasam
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
agdhot
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
Timothy Spann
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
nhutnguyen355078
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
Bisnar Chase Personal Injury Attorneys
 
8 things to know before you start to code in 2024
8 things to know before you start to code in 20248 things to know before you start to code in 2024
8 things to know before you start to code in 2024
ArianaRamos54
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
Timothy Spann
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
Sid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.pptSid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.ppt
ArshadAyub49
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
osoyvvf
 
Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)
GeorgiiSteshenko
 
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdfreading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
perranet1
 

Recently uploaded (20)

一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
 
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative ClassifiersML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
 
8 things to know before you start to code in 2024
8 things to know before you start to code in 20248 things to know before you start to code in 2024
8 things to know before you start to code in 2024
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
Sid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.pptSid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.ppt
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
 
Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)
 
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdfreading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
 

Apache spark: in and out

  • 1. Apache Spark: in and out Ben Fradet - Tech lead
  • 2. In and out 1. Intro 2. The different batch APIs 3. Real world examples 4. Addressing the API shortcomings 5. Running and configuring a Spark job on AWS EMR 6. Outro
  • 4. The different batch APIs - minimal examples val counts = rdd .map(line => (line.word, 1)) .reduceByKey(_ + _) val counts = df .groupBy(“word”) .count() val counts = spark .sql(“select word, count(*) ” + “from words group by word”) id word 1 Scala 2 Spark 3 API 4 Scala val counts = ds .groupByKey(_.word) .count() word count Scala 2 Spark 1 API 1
  • 5. The different batch APIs - comparison RDD SQL DataFrame Dataset API looks like Scala collections SQL Scala / SQL Scala collections In memory JVM objects Off heap Off heap Off heap Query optimization ✗ ✔ ✔ ✔ Code generation ✗ ✔ ✔ ✔ Syntax errors Compile time Runtime Compile time Compile time Analysis errors NA Runtime Runtime Compile time
  • 6. Real world examples - EnrichJob val input: RDD[_] = getInputRDD(inputPath) val all: RDD[(_, List[ValidatedEnrichedEvent]) = input.map(e => (e, enrich(e))).cache() val bad: RDD[Row] = all.flatMap(e => projectBads(e)) spark.createDataFrame(bad).write.text(badOutputPath) val good: RDD[EnrichedEvent] = all.flatMap(e => projectGoods(e)) spark.createDataset(good).write.csv(goodOutputPath)
  • 7. Real world examples - ShredJob val input: RDD[String] = sc.textFile(inputPath) val all: RDD[(String, List[ValidatedShreddedEvent]) = input.map(e => shred(e)).cache() val bad: RDD[Row] = all.flatMap(e => projectBads(e)) val deduped: RDD[(ShreddedEvent, ValidatedBoolean)] = all .flatMap(e => projectGoods(e)) .groupBy(e => (e.eventId, e.eventFingerprint)) .flatMap { case (_, values) => values.take(1) } .map(e => (e, depdupeCrossBatch(e))) // … spark.createDataFrame(bad + dedupeF).write.text(badOutputPath) spark.createDataFrame(dedupeS.map(_.event)).write.text(goodOutputPath) dedupeS.map(_.shreds).write.text(shreddedTypesOutputPath)
  • 8. Addressing the API shortcomings - typelevel/frameless // untyped -> runtime exception ds.select(ds(“not-word”)) val typedDS = TypedDataset.create(ds) // typed -> doesn’t compile typedDS.select(typedDS(‘not-word)) val counts = ds.groupByKey(_.word).count() case class WordCount(word: String, count: Int) // checked at compile time counts.as[WordCount] id word 1 Scala 2 Spark 3 API 4 Scala
  • 9. Addressing the API shortcomings - BenFradet/struct-type-encoder case class MyCaseClass(a: Int, b: String, c: Double) val inferred = spark .read .json("/some/dir/*.json") .as[MyCaseClass] val derived = spark .read .schema(StructTypeEncoder[MyCaseClass].encode) .json("/some/dir/*.json") .as[MyCaseClass] Plus support for metadata and deeply-nested schemas!
  • 10. Running and configuring a Spark job on AWS EMR -- configurations ‘[{ “Classification”: “yarn-site”, “Properties”: { “yarn.nodemanager.vmem-check-enabled”: “false”, “yarn.nodemanager.resource.memory-mb”: “117760”, “yarn.scheduler.maximum-allocation-mb”: “117760” } },{ “Classification”: “spark”, “Properties”: { “maximizeResourceAllocation”: "false" } },{ “Classification”: “spark-defaults”, “Properties: { “spark.dynamicAllocation.enabled”: “false”, “spark.executor.instances”: "4", “spark.yarn.executor.memoryOverhead”: “3072”, “spark.executor.memory”: “20G”, “spark.executor.cores”: “3”, “spark.yarn.driver.memoryOverhead”: “3072”, “spark.driver.memory”: “20G”, “spark.driver.cores”: “3”, “spark.default.parallelism”: "48" }]’ aws emr create-cluster --name "Snowplow Enrich Job" --release-label emr-5.12.0 --applications Name=Spark --instance-type r4.4xlarge --instance-count 1 --steps ‘[{ "Name": "Snowplow Spark Enrich", "Args": [...], "Jar": "s3://bucket/my-jar.jar", "ActionOnFailure": "CONTINUE", "MainClass": "EnrichJob", "Type": "CUSTOM_JAR", "Properties": "string" }]’ Spark config cheat sheet
  • 11. Thanks! GitHub: - github.com/BenFradet - github.com/snowplow/snowplow Twitter: - @fradetben - @snowplowdata Contact us: sales@snowplowanalytics.com snowplowanalytics.com