SlideShare a Scribd company logo
1 of 27
Spark 计算模型
wangxing
MapReduce
Spark
“Apache Spark is a fast and general-purpose cluster
computing system. It provides high-level APIs in Java, Scala, Python and R,
and an optimized engine that supports general execution graphs. It also
supports a rich set of higher-level tools including Spark SQL for SQL and
structured data processing, MLlib for machine learning, GraphX for graph
processing, and Spark Streaming.”
Unified Platform
Resilient Distributed Dataset
RDD is an immutable and partitioned collection:
● Resilient:
○ automatically recover from node failures
● Distributed:
○ partitioned across the nodes of the cluster that can be operated on in
parallel
● Dataset:
○ RDDs are created with a file in the Hadoop file system, or an existing
Scala collection
Create RDD
“There are two ways to create RDDs: parallelizing an existing collection in your
driver program, or referencing a dataset in an external storage system, such as
a shared filesystem, HDFS, HBase, or any data source offering a Hadoop
InputFormat.”
1. var mydata = sc.parallelize(Array(1, 2, 3, 4, 5))
2. var mydata = sc.makeRDD(1 to 100, 2)
3. var mydata = sc.textFile("hdfs://sa-onlinehdm1-cnc0.hlg:8020/tmp/1.txt")
Operate RDD
“Two types of operations: transformations, which create a new dataset(RDD)
from an existing one, and actions, which return a value to the driver program
after running a computation on the dataset.”
“All transformations in Spark are lazy, in that they do not compute their results
right away. Instead, they just remember the transformations applied to some
base dataset. The transformations are only computed when an action requires
a result to be returned to the driver program.
Other Control
● persist / cache / unpersist:
○ stores any partitions of the rdd in memory/disk
○ can be stored using a different storage level
○ allows future actions to be much faster,a key tool for iterative
algorithms
● checkpoint:
○ saves the RDD to disk, actually forgets the lineage of the RDD
completely. This is allows long lineages to be truncated.
Example: WordCount
1. var lines = sc.textFile(“hdfs://sa-onlinehdm1-cnc0.hlg01:8020/tmp/1.txt”)
2. var counts = lines.flatMap(line => line.split(“ ”))
3. .map(word => (word, 1))
4. .reduceByKey((a, b) => a + b)
5. counts.collect().foreach(println)
6. counts.saveAsTextFile(“hdfs://sa-onlinehdm1-cnc0.hlg01:8020/tmp/1.txt”)
Example: WordCount
scala> counts.toDebugString
(2) ShuffledRDD[8] at reduceByKey at <console>:23 []
+-(2) MapPartitionsRDD[7] at map at <console>:23 []
| MapPartitionsRDD[6] at flatMap at <console>:23 []
| MapPartitionsRDD[5] at textFile at <console>:21 []
| hdfs: //xxx HadoopRDD[4] at textFile at <console>:21 []
Dependency
Dependency
● Narrow dependencies:
○ allow for pipelined execution on one cluster node
○ easy fault recovery
● Wide dependencies:
○ require data from all parent partitions to be available and to be
shuffled across the nodes
○ a single failed node might cause a complete re-execution
DAG
● Directed:
○ Only in a single direction
● Acyclic:
○ No looping
Shuffle
● redistributes data among partitions
● partition keys into buckets
● write intermediate files to disk fetched by the next stage of tasks
Stage
● A stage is a set of independent tasks of a Spark job;
● DAG of tasks is split up into stages at the boundaries where shuffle occurs;
● DAGScheduler runs the stages in topological order;
● Each Stage can either be a shuffle map stage, in which case its tasks'
results are input for another stage, or a result stage, in which case its
tasks directly compute the action that initiated a job.
Stage
Stage
Job Schedule
Job Schedule
Spark Deploy
Spark Deploy
● Each application gets its own executor processes, isolating applications
from each other;
● The driver program must listen for and accept incoming connections from
its executors on the worker nodes;
● Because the driver schedules tasks on the cluster, it should be run close to
the worker nodes, preferably on the same local area network.
Tips
Use groupByKey/collect carefully
Use mapPartitions if initialization is heavy
Use Kryo serialization instead of java
Repartition if filter causes data skew
Spark Streaming
● Receives live input data streams and divides the data into batches of X
seconds;
● Treats each batch of data as RDDs and processes them using RDD
operations;
● The processed results of the RDD operations are returned in batches;
● Batch sizes as low as ½ sec, latency of about 1 sec.
Spark Streaming
● Receives live input data streams and divides the data into batches of X
seconds;
● Treats each batch of data as RDDs and processes them using RDD
operations;
● The processed results of the RDD operations are returned in batches;
● Batch sizes as low as ½ sec, latency of about 1 sec.
Spark Streaming
Windowed based computations allow you to apply transformations over a sliding window
of data. Every time the window slides over a source DStream, the source RDDs that fall
within the window are combined and operated upon to produce the RDDs of the
windowed DStream.
● window length - The duration of the window.
● sliding interval - The interval at which the window operation is performed.
Spark SQL
● Query data via SQL on DataFrame, a distributed collection of data organized into
named columns;
● DataFrame can created from an existing RDD, from a Hive table, or from data
sources;
● DataFrame can be operated on as normal RDDs and can also be registered as a
temporary table. Registering as a table allows you to run SQL queries over its data.
Example:Run sql on a text file.
Thanks

More Related Content

What's hot

Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelMartin Zapletal
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache SparkGao Yunzhong
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to SparkDavid Smelker
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaAtif Akhtar
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architectureMarkus Klems
 
Effectively deploying hadoop to the cloud
Effectively  deploying hadoop to the cloudEffectively  deploying hadoop to the cloud
Effectively deploying hadoop to the cloudAvinash Ramineni
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - InstallationMartin Zapletal
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for SysadminsNathan Milford
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0HBaseCon
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learningSamir Bessalah
 

What's hot (20)

Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architecture
 
Effectively deploying hadoop to the cloud
Effectively  deploying hadoop to the cloudEffectively  deploying hadoop to the cloud
Effectively deploying hadoop to the cloud
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - Installation
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Hadoop2
Hadoop2Hadoop2
Hadoop2
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 
Advancing Scientific Data Support in ArcGIS
Advancing Scientific Data Support in ArcGISAdvancing Scientific Data Support in ArcGIS
Advancing Scientific Data Support in ArcGIS
 
Matlab netcdf guide
Matlab netcdf guideMatlab netcdf guide
Matlab netcdf guide
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learning
 

Viewers also liked

Spark streaming high level overview
Spark streaming high level overviewSpark streaming high level overview
Spark streaming high level overviewAvi Levi
 
Spark meetup stream processing use cases
Spark meetup   stream processing use casesSpark meetup   stream processing use cases
Spark meetup stream processing use casespunesparkmeetup
 
Scalable vertical search engine with hadoop
Scalable vertical search engine with hadoopScalable vertical search engine with hadoop
Scalable vertical search engine with hadoopdatasalt
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of HadoopAsif Ali
 
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsAnju Singh
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 

Viewers also liked (15)

Spark streaming high level overview
Spark streaming high level overviewSpark streaming high level overview
Spark streaming high level overview
 
Spark meetup stream processing use cases
Spark meetup   stream processing use casesSpark meetup   stream processing use cases
Spark meetup stream processing use cases
 
Scalable vertical search engine with hadoop
Scalable vertical search engine with hadoopScalable vertical search engine with hadoop
Scalable vertical search engine with hadoop
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of Hadoop
 
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Spark sql meetup
Spark sql meetupSpark sql meetup
Spark sql meetup
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 

Similar to Spark 计算模型

Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part IIArjen de Vries
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internalsAnton Kirillov
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Akhil Das
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentationpunesparkmeetup
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 

Similar to Spark 计算模型 (20)

Spark
SparkSpark
Spark
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internals
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 

Recently uploaded

UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Christo Ananth
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 

Recently uploaded (20)

Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 

Spark 计算模型

  • 3. Spark “Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.”
  • 5. Resilient Distributed Dataset RDD is an immutable and partitioned collection: ● Resilient: ○ automatically recover from node failures ● Distributed: ○ partitioned across the nodes of the cluster that can be operated on in parallel ● Dataset: ○ RDDs are created with a file in the Hadoop file system, or an existing Scala collection
  • 6. Create RDD “There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.” 1. var mydata = sc.parallelize(Array(1, 2, 3, 4, 5)) 2. var mydata = sc.makeRDD(1 to 100, 2) 3. var mydata = sc.textFile("hdfs://sa-onlinehdm1-cnc0.hlg:8020/tmp/1.txt")
  • 7. Operate RDD “Two types of operations: transformations, which create a new dataset(RDD) from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.” “All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset. The transformations are only computed when an action requires a result to be returned to the driver program.
  • 8. Other Control ● persist / cache / unpersist: ○ stores any partitions of the rdd in memory/disk ○ can be stored using a different storage level ○ allows future actions to be much faster,a key tool for iterative algorithms ● checkpoint: ○ saves the RDD to disk, actually forgets the lineage of the RDD completely. This is allows long lineages to be truncated.
  • 9. Example: WordCount 1. var lines = sc.textFile(“hdfs://sa-onlinehdm1-cnc0.hlg01:8020/tmp/1.txt”) 2. var counts = lines.flatMap(line => line.split(“ ”)) 3. .map(word => (word, 1)) 4. .reduceByKey((a, b) => a + b) 5. counts.collect().foreach(println) 6. counts.saveAsTextFile(“hdfs://sa-onlinehdm1-cnc0.hlg01:8020/tmp/1.txt”)
  • 10. Example: WordCount scala> counts.toDebugString (2) ShuffledRDD[8] at reduceByKey at <console>:23 [] +-(2) MapPartitionsRDD[7] at map at <console>:23 [] | MapPartitionsRDD[6] at flatMap at <console>:23 [] | MapPartitionsRDD[5] at textFile at <console>:21 [] | hdfs: //xxx HadoopRDD[4] at textFile at <console>:21 []
  • 12. Dependency ● Narrow dependencies: ○ allow for pipelined execution on one cluster node ○ easy fault recovery ● Wide dependencies: ○ require data from all parent partitions to be available and to be shuffled across the nodes ○ a single failed node might cause a complete re-execution
  • 13. DAG ● Directed: ○ Only in a single direction ● Acyclic: ○ No looping
  • 14. Shuffle ● redistributes data among partitions ● partition keys into buckets ● write intermediate files to disk fetched by the next stage of tasks
  • 15. Stage ● A stage is a set of independent tasks of a Spark job; ● DAG of tasks is split up into stages at the boundaries where shuffle occurs; ● DAGScheduler runs the stages in topological order; ● Each Stage can either be a shuffle map stage, in which case its tasks' results are input for another stage, or a result stage, in which case its tasks directly compute the action that initiated a job.
  • 16. Stage
  • 17. Stage
  • 21. Spark Deploy ● Each application gets its own executor processes, isolating applications from each other; ● The driver program must listen for and accept incoming connections from its executors on the worker nodes; ● Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network.
  • 22. Tips Use groupByKey/collect carefully Use mapPartitions if initialization is heavy Use Kryo serialization instead of java Repartition if filter causes data skew
  • 23. Spark Streaming ● Receives live input data streams and divides the data into batches of X seconds; ● Treats each batch of data as RDDs and processes them using RDD operations; ● The processed results of the RDD operations are returned in batches; ● Batch sizes as low as ½ sec, latency of about 1 sec.
  • 24. Spark Streaming ● Receives live input data streams and divides the data into batches of X seconds; ● Treats each batch of data as RDDs and processes them using RDD operations; ● The processed results of the RDD operations are returned in batches; ● Batch sizes as low as ½ sec, latency of about 1 sec.
  • 25. Spark Streaming Windowed based computations allow you to apply transformations over a sliding window of data. Every time the window slides over a source DStream, the source RDDs that fall within the window are combined and operated upon to produce the RDDs of the windowed DStream. ● window length - The duration of the window. ● sliding interval - The interval at which the window operation is performed.
  • 26. Spark SQL ● Query data via SQL on DataFrame, a distributed collection of data organized into named columns; ● DataFrame can created from an existing RDD, from a Hive table, or from data sources; ● DataFrame can be operated on as normal RDDs and can also be registered as a temporary table. Registering as a table allows you to run SQL queries over its data. Example:Run sql on a text file.

Editor's Notes

  1. 每一轮计算都有Map阶段和Reduce阶段,需要把计算步骤转换成若干轮的MapReduce。 Reduce的输出数据需要存储到磁盘,中间的shuffle有大量网络传输。因此MapReduce具有高延迟,不适合进行迭代计算的特点。 跑SQL需要搭建Hive,跑机器学习算法需要使用Mahout。
  2. Spark将数据切片后放内存中计算,速度是MapReduce的100倍之上。 Spark让开发者可以快速的用Java、Scala或Python编写程序。 极具通用性,使用统一技术栈解决SQL查询,流数据处理,机器学习和基于图的计算。
  3. RDD之前可以相互依赖。 Automatically rebuild on failure Persistence for reuse (RAM and/or disk)
  4. 1. 如果RDD的每个分区最多只能被一个Child RDD的一个分区使用,则称之为narrow dependency;若被多个Child RDD分区都依赖,则为wide dependency。
  5. Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers, which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code to the executors. Finally, SparkContext sends tasks to the executors to run.