SlideShare a Scribd company logo
Using Apache Spark
Bojan Babic
Apache Spark
spark.incubator.apache.org
github.com/apache/incubator-spark
user@spark.incubator.apache.org
The Spark Community
INTRODUCTION TO APACHE SPARK
What is Spark?
Efficient
• General execution
graphs
• In-memory storage
Usable
• Rich APIs in Java,
Scala, Python
• Interactive shell
Fast and Expressive Cluster Computing
System Compatible with Apache Hadoop
Today’s
talk
Key Concepts
Resilient Distributed Datasets
• Collections of objects spread
across a cluster, stored in RAM or
on Disk
• Built through parallel
transformations
• Automatically rebuilt on failure
• in hadoop work RDD corresponds
to partition elsewhere dataframe
Operations
• Transformations
(e.g. map, filter,
groupBy)
• Actions
(e.g. count, collect,
save)
Working With RDDs
RDD
RDD
RDD
RDD
Transfor
mations
Acti
on
Value
linesWithSpark = textFile.filter(l => l.contains("Spark”))
linesWithSpark.count()
74
linesWithSpark.first()
# Apache Spark
textFile = sc.textFile(”/var/log/hadoop/somelog”)
Spark (batch) example
Load error messages from a log into memory, then
interactively search for various patterns
val sc = new SparkContext(config)
val lines = spark.textFile(“hdfs://...”)
val messages = lines.filter(l => l.startswith(“ERROR”))
.map(e => e.split(“t“)(2))
messages.cache()
messages.filter(s => s.contains(“mongo”)).count()
messages.filter(s => s.contains(“500”)).count()
Base RDD
Transformed RDD
Action
Scaling Down
Spark Streaming example
val sc = new StreamingContext(conf, Seconds(10))
val kafkaParams = Map[String, String](...)
val messages = KafkaUtils.createStream[String, String, StringDecoder,
StringDecoder]
(
sc,
kafkaParams,
Map(topic -> threadsPerTopic),
storageLevel = StorageLevel.MEMORY_ONLY_SER
)
val revenue = messages.filter(m => m.contains(“[PURCHASE]”))
.map(p => p.split(“t”)(4)).reduceByKey(_ + _)
Mini-batch
Fault Recovery
RDDs track lineage information that can be used
to efficiently recompute lost data
val msgs = textFile.filter(l => l.contains(“ERROR”))
.map(e => e.split(“t”)(2))
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith
(…))
map
(func = split
(...))
JOB EXECUTION
Software Components
• Spark runs as a library in your
program (1 instance per app)
• Runs tasks locally or on cluster
– Mesos, YARN or standalone
mode
• Accesses storage systems via
Hadoop InputFormat API
– Can use HBase, HDFS, S3, …
Your application
SparkContext
Local
threads
Cluster
manager
Worker
Spark
executor
Worker
Spark
executor
HDFS or other storage
Task Scheduler
• General task
graphs
• Automatically
pipelines functions
• Data locality
aware
• Partitioning aware
to avoid shuffles
= cached
partition
=
RDD
joi
n
filte
r
groupB
y
Stage
3
Stage
1
Stage
2
A: B
:
C
:
D
:
E
:
F
:
ma
p
Advanced Features
• Controllable partitioning
– Speed up joins against a dataset
• Controllable storage formats
– Keep data serialized for efficiency, replicate to
multiple nodes, cache on disk
• Shared variables: broadcasts, accumulators
WORKING WITH SPARK
Using the Shell
Launching:
Modes:
MASTER=local ./spark-shell # local, 1 thread
MASTER=local[2] ./spark-shell # local, 2 threads
MASTER=spark://host:port ./spark-shell # cluster
spark-shell
pyspark (IPYTHON=1)
SparkContext
• Main entry point to Spark functionality
• Available in shell as variable sc
• In standalone programs, you’d make your own
(see later for details)
RDD Operators
• map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin
• reduce
• count
• fold
• reduceByKey
• groupByKey
• cogroup
• cross
• zip
sample
take
first
partitionBy
mapWith
pipe
save ...
Add Spark to Your Project
• Scala: add dependency to build.sbt
libraryDependencies ++= Seq(
("org.apache.spark" %% "spark-core" % "1.2.0").
exclude("org.mortbay.jetty", "servlet-api")
...
)
• make sure you exclude overlapping libs
and add Spark-Streaming
• Scala: add dependency to build.sbt
libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "1.2.0"
PageRank Performance
Other Iterative Algorithms
Time per Iteration (s)
CONCLUSION
Conclusion
• Spark offers a rich API to make data analytics
fast: both fast to write and fast to run
• Achieves 100x speedups in real applications
• Growing community with 25+ companies
contributing
Thanks!!!
disclosure: some parts of this presentation have been inspired by Matei Zaharia ampmeeting in Berlkey

More Related Content

What's hot

Satellite Downlink Budget Analysis
Satellite Downlink Budget AnalysisSatellite Downlink Budget Analysis
Satellite Downlink Budget Analysis
Partho Choudhury
 
Comparative Study and Performance Analysis of different Modulation Techniques...
Comparative Study and Performance Analysis of different Modulation Techniques...Comparative Study and Performance Analysis of different Modulation Techniques...
Comparative Study and Performance Analysis of different Modulation Techniques...
Souvik Das
 
Region filling
Region fillingRegion filling
Region filling
hetvi naik
 
Radix 4 booth
Radix 4 boothRadix 4 booth
Radix 4 booth
Richu Jose Cyriac
 
NLP_KASHK:Markov Models
NLP_KASHK:Markov ModelsNLP_KASHK:Markov Models
NLP_KASHK:Markov Models
Hemantha Kulathilake
 
Huffman coding
Huffman coding Huffman coding
Huffman coding
Nazmul Hyder
 
Lecture6 Signal and Systems
Lecture6 Signal and SystemsLecture6 Signal and Systems
Lecture6 Signal and Systems
babak danyal
 
Frequency Domain Filtering of Digital Images
Frequency Domain Filtering of Digital ImagesFrequency Domain Filtering of Digital Images
Frequency Domain Filtering of Digital Images
Upendra Pratap Singh
 
Chapter 1 and 2 gonzalez and woods
Chapter 1 and 2 gonzalez and woodsChapter 1 and 2 gonzalez and woods
Chapter 1 and 2 gonzalez and woods
asodariyabhavesh
 
Image restoration and degradation model
Image restoration and degradation modelImage restoration and degradation model
Image restoration and degradation model
AnupriyaDurai
 
Chapter 9 morphological image processing
Chapter 9   morphological image processingChapter 9   morphological image processing
Chapter 9 morphological image processing
Ahmed Daoud
 
Compression Ii
Compression IiCompression Ii
Compression Ii
anithabalaprabhu
 
Heaps
HeapsHeaps
Heaps
pratmash
 
15 predicate
15 predicate15 predicate
15 predicate
Tianlu Wang
 
0 1 knapsack using branch and bound
0 1 knapsack using branch and bound0 1 knapsack using branch and bound
0 1 knapsack using branch and bound
Abhishek Singh
 
Good denoising using wavelets
Good denoising using waveletsGood denoising using wavelets
Good denoising using wavelets
beenamohan
 
Dilation and erosion
Dilation and erosionDilation and erosion
Dilation and erosion
Aswin Pv
 
Digital image processing
Digital image processingDigital image processing
Digital image processing
ABIRAMI M
 
Unit ii
Unit iiUnit ii
IMPLEMENTATION OF UPSAMPLING & DOWNSAMPLING
IMPLEMENTATION OF UPSAMPLING & DOWNSAMPLINGIMPLEMENTATION OF UPSAMPLING & DOWNSAMPLING
IMPLEMENTATION OF UPSAMPLING & DOWNSAMPLING
FAIZAN SHAFI
 

What's hot (20)

Satellite Downlink Budget Analysis
Satellite Downlink Budget AnalysisSatellite Downlink Budget Analysis
Satellite Downlink Budget Analysis
 
Comparative Study and Performance Analysis of different Modulation Techniques...
Comparative Study and Performance Analysis of different Modulation Techniques...Comparative Study and Performance Analysis of different Modulation Techniques...
Comparative Study and Performance Analysis of different Modulation Techniques...
 
Region filling
Region fillingRegion filling
Region filling
 
Radix 4 booth
Radix 4 boothRadix 4 booth
Radix 4 booth
 
NLP_KASHK:Markov Models
NLP_KASHK:Markov ModelsNLP_KASHK:Markov Models
NLP_KASHK:Markov Models
 
Huffman coding
Huffman coding Huffman coding
Huffman coding
 
Lecture6 Signal and Systems
Lecture6 Signal and SystemsLecture6 Signal and Systems
Lecture6 Signal and Systems
 
Frequency Domain Filtering of Digital Images
Frequency Domain Filtering of Digital ImagesFrequency Domain Filtering of Digital Images
Frequency Domain Filtering of Digital Images
 
Chapter 1 and 2 gonzalez and woods
Chapter 1 and 2 gonzalez and woodsChapter 1 and 2 gonzalez and woods
Chapter 1 and 2 gonzalez and woods
 
Image restoration and degradation model
Image restoration and degradation modelImage restoration and degradation model
Image restoration and degradation model
 
Chapter 9 morphological image processing
Chapter 9   morphological image processingChapter 9   morphological image processing
Chapter 9 morphological image processing
 
Compression Ii
Compression IiCompression Ii
Compression Ii
 
Heaps
HeapsHeaps
Heaps
 
15 predicate
15 predicate15 predicate
15 predicate
 
0 1 knapsack using branch and bound
0 1 knapsack using branch and bound0 1 knapsack using branch and bound
0 1 knapsack using branch and bound
 
Good denoising using wavelets
Good denoising using waveletsGood denoising using wavelets
Good denoising using wavelets
 
Dilation and erosion
Dilation and erosionDilation and erosion
Dilation and erosion
 
Digital image processing
Digital image processingDigital image processing
Digital image processing
 
Unit ii
Unit iiUnit ii
Unit ii
 
IMPLEMENTATION OF UPSAMPLING & DOWNSAMPLING
IMPLEMENTATION OF UPSAMPLING & DOWNSAMPLINGIMPLEMENTATION OF UPSAMPLING & DOWNSAMPLING
IMPLEMENTATION OF UPSAMPLING & DOWNSAMPLING
 

Similar to Introduction to Apache Spark Ecosystem

Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
trihug
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Marius Soutier
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data Era
Handaru Sakti
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
clairvoyantllc
 

Similar to Introduction to Apache Spark Ecosystem (20)

Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data Era
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Spark core
Spark coreSpark core
Spark core
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 

Recently uploaded

Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
wisnuprabawa3
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
ihlasbinance2003
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
gerogepatton
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
nooriasukmaningtyas
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
sachin chaurasia
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 

Recently uploaded (20)

Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 

Introduction to Apache Spark Ecosystem

  • 5. What is Spark? Efficient • General execution graphs • In-memory storage Usable • Rich APIs in Java, Scala, Python • Interactive shell Fast and Expressive Cluster Computing System Compatible with Apache Hadoop
  • 7. Key Concepts Resilient Distributed Datasets • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure • in hadoop work RDD corresponds to partition elsewhere dataframe Operations • Transformations (e.g. map, filter, groupBy) • Actions (e.g. count, collect, save)
  • 8. Working With RDDs RDD RDD RDD RDD Transfor mations Acti on Value linesWithSpark = textFile.filter(l => l.contains("Spark”)) linesWithSpark.count() 74 linesWithSpark.first() # Apache Spark textFile = sc.textFile(”/var/log/hadoop/somelog”)
  • 9. Spark (batch) example Load error messages from a log into memory, then interactively search for various patterns val sc = new SparkContext(config) val lines = spark.textFile(“hdfs://...”) val messages = lines.filter(l => l.startswith(“ERROR”)) .map(e => e.split(“t“)(2)) messages.cache() messages.filter(s => s.contains(“mongo”)).count() messages.filter(s => s.contains(“500”)).count() Base RDD Transformed RDD Action
  • 11. Spark Streaming example val sc = new StreamingContext(conf, Seconds(10)) val kafkaParams = Map[String, String](...) val messages = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder] ( sc, kafkaParams, Map(topic -> threadsPerTopic), storageLevel = StorageLevel.MEMORY_ONLY_SER ) val revenue = messages.filter(m => m.contains(“[PURCHASE]”)) .map(p => p.split(“t”)(4)).reduceByKey(_ + _) Mini-batch
  • 12. Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data val msgs = textFile.filter(l => l.contains(“ERROR”)) .map(e => e.split(“t”)(2)) HDFS File Filtered RDD Mapped RDD filter (func = startsWith (…)) map (func = split (...))
  • 14. Software Components • Spark runs as a library in your program (1 instance per app) • Runs tasks locally or on cluster – Mesos, YARN or standalone mode • Accesses storage systems via Hadoop InputFormat API – Can use HBase, HDFS, S3, … Your application SparkContext Local threads Cluster manager Worker Spark executor Worker Spark executor HDFS or other storage
  • 15. Task Scheduler • General task graphs • Automatically pipelines functions • Data locality aware • Partitioning aware to avoid shuffles = cached partition = RDD joi n filte r groupB y Stage 3 Stage 1 Stage 2 A: B : C : D : E : F : ma p
  • 16. Advanced Features • Controllable partitioning – Speed up joins against a dataset • Controllable storage formats – Keep data serialized for efficiency, replicate to multiple nodes, cache on disk • Shared variables: broadcasts, accumulators
  • 18. Using the Shell Launching: Modes: MASTER=local ./spark-shell # local, 1 thread MASTER=local[2] ./spark-shell # local, 2 threads MASTER=spark://host:port ./spark-shell # cluster spark-shell pyspark (IPYTHON=1)
  • 19. SparkContext • Main entry point to Spark functionality • Available in shell as variable sc • In standalone programs, you’d make your own (see later for details)
  • 20. RDD Operators • map • filter • groupBy • sort • union • join • leftOuterJoin • rightOuterJoin • reduce • count • fold • reduceByKey • groupByKey • cogroup • cross • zip sample take first partitionBy mapWith pipe save ...
  • 21. Add Spark to Your Project • Scala: add dependency to build.sbt libraryDependencies ++= Seq( ("org.apache.spark" %% "spark-core" % "1.2.0"). exclude("org.mortbay.jetty", "servlet-api") ... ) • make sure you exclude overlapping libs
  • 22. and add Spark-Streaming • Scala: add dependency to build.sbt libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "1.2.0"
  • 24. Other Iterative Algorithms Time per Iteration (s)
  • 26. Conclusion • Spark offers a rich API to make data analytics fast: both fast to write and fast to run • Achieves 100x speedups in real applications • Growing community with 25+ companies contributing
  • 27. Thanks!!! disclosure: some parts of this presentation have been inspired by Matei Zaharia ampmeeting in Berlkey