SlideShare a Scribd company logo
1 of 19
Apache Spark
1
Agenda
• Introduction
• Features -
• Supports all major Big Data Environments
• General platform for major Big Data tasks
• Access diverse data sources
• Speed
• Ease of use
• Architecture
• Resilient Distributed Datasets
• In-memory Computing
• Performance
Apache Spark 2
Spark Introduction
• Apache Spark is an
– open source,
– parallel data processing framework
– master slave model
– complements Hadoop
to make it easy to develop fast, unified Big Data applications
• Cloudera offers commercial support for Spark with Cloudera
Enterprise.
• It has over 465 contributors making it most active Big Data Project in
Apache Foundation, started at UC Berkley in 2009
Apache Spark 3
Supports All major BigData Environments
Runs Everywhere -
• Standalone Cluster Mode
• Hadoop Yarn
• Apache Mesos
• Amazon EC2
• Spark runs on both Windows and UNIX-like systems (e.g. Linux,
Mac OS).
Apache Spark 4
General platform for all major Big Data tasks
• Common ETL (Sqoop)
• SQL and Analytics (Pig and Hive)
• Real Time Streaming (Storm)
• Machine Learning (Mahout)
• Graphs (Data Visualization)
• Both Interactive and Batch mode processing
• Reuse the same code for batch and stream processing, even joining
streaming data to historical data
Apache Spark 5
Access Diverse Data Sources
Read and write anywhere -
• HDFS
• Cassandra
• Hbase
• TextFile
• RDBMS
• Kafka
Apache Spark 6
Speed
• Run programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
• Spark has an advanced DAG execution engine that supports cyclic
data flow and in-memory computing.
Logistic regression in Hadoop and Spark
Apache Spark 7
Ease of Use
• Write application in
– Java
– Scala
– Python
• 80 high-level operators that make it easy to build parallel apps
Apache Spark 8
Unified Analytics with Cloudera’s Enterprise Data Hub
Faster Decisions (Interactive) Better Decisions (Batch)
Real-Time Action (Streaming
and Applications)
Web Security
Why is my website slow? What are the common causes
of performance issues?
How can I detect and block
malicious attacks in real-time?
Retail
What are our top selling items
across channels?
What products and services to
customers buy together?
How can I deliver relevant
promotions to buyers at the
point of sale?
Financial
Services
Who opened multiple accounts
in the past 6 months?
What are the leading
indicators of fraudulent
activity?
How can I protect my
customers from identity theft in
real-time?
Apache Spark 9
ARCHITECTURE
Apache Spark 10
Apache Spark 11
Resilient Distributed Datasets
• A read-only collection of objects partitioned across a set of machines
• Immutable. You can modify an RDD with a transformation but the
transformation returns you a new RDD whereas the original RDD
remains the same.
• Can be rebuilt if a partition is lost.
• Fault tolerant because they can be recreated & recomputed
RDD supports two types of operations:
• Transformation
• Action
Apache Spark 12
• Transformation: don't return a single value, they return a new RDD.
Nothing gets evaluated when you call a Transformation function, it
just takes an RDD and return a new RDD.
– Some of the Transformation functions are map, filter, flatMap,
groupByKey, reduceByKey, aggregateByKey, pipe, and coalesce.
• Action: operation evaluates and returns a new value. When an
Action function is called on a RDD object, all the data processing
queries are computed at that time and the result value is returned.
– Some of the Action operations are reduce, collect, count, first, take,
countByKey, and foreach.
Apache Spark 13
How in-memory computing Improve Performance
• Intermediate data is not spilled to disk
Apache Spark 14
Performance improvement in DAG executions
• New shuffle implementation, sort-based shuffle uses single buffer
which reduces substantial memory overhead and can support huge
workloads in a single stage. Earlier concurrent buffers were used
• The revamped network module in Spark maintains its own pool of
memory, thus bypassing JVM’s memory allocator, reducing the
impact of garbage collection.
• New external shuffle service ensures that Spark can still serve
shuffle files even when the executors are in GC pauses.
Apache Spark 15
• Timsort, which is a derivation of quicksort and mergesort. It works
better than earlier quicksort in partially ordered datasets.
• Exploiting cache locality reduced sorting by factor of 5.
sort_key record sort_key location
The Spark cluster was able to sustain 3GB/s/node I/O activity during the
map phase, and 1.1 GB/s/node network activity during the reduce phase,
saturating the 10Gbps link available on these machines.
Apache Spark 16
10b 10b100b 4b
• The memory consumed by RDDs can be optimized per size of data
• Performance can be tuned by caching
• MR supports only Map and Reduce operations and everything (join,
groupby etc) has to be fit into the Map and Reduce model, which
might not be the efficient way. Spark supports 80 other
transformations and actions.
Apache Spark 17
References
• http://spark.apache.org/
• http://en.wikipedia.org/wiki/Apache_Spark
• http://opensource.com/business/15/1/apache-spark-new-world-
record
• http://www.cloudera.com/content/cloudera/en/products-and-
services/cdh/spark.html
• http://vision.cloudera.com/mapreduce-spark/
• http://thesohiljain.blogspot.in/2015/03/apache-spark-
introduction.html
Apache Spark 18
Thank You
Apache Spark 19

More Related Content

What's hot

Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframeJaemun Jung
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with PythonGokhan Atil
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn
 

What's hot (20)

Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Spark
SparkSpark
Spark
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Apache spark
Apache sparkApache spark
Apache spark
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 

Viewers also liked

Media evaluation
Media evaluationMedia evaluation
Media evaluationlaura2898
 
Ig3 music video_production_diary_
Ig3 music video_production_diary_Ig3 music video_production_diary_
Ig3 music video_production_diary_MattRogero
 
Animan by muhammed zafar iqbal (boimela 2014)
Animan by muhammed zafar iqbal (boimela 2014)Animan by muhammed zafar iqbal (boimela 2014)
Animan by muhammed zafar iqbal (boimela 2014)sontumax
 
Assignment 1
Assignment 1Assignment 1
Assignment 1sontumax
 
Responsabilidad social en minería e industrias extractivas
Responsabilidad social en minería e industrias extractivasResponsabilidad social en minería e industrias extractivas
Responsabilidad social en minería e industrias extractivasAlfredo Anderson
 
Introducción a la Mecánica (Parte II)
Introducción a la Mecánica (Parte II)Introducción a la Mecánica (Parte II)
Introducción a la Mecánica (Parte II)Javier Orduz
 

Viewers also liked (11)

Media evaluation
Media evaluationMedia evaluation
Media evaluation
 
CV.
CV.CV.
CV.
 
Estibinson herrera ramirez
Estibinson herrera ramirezEstibinson herrera ramirez
Estibinson herrera ramirez
 
Ig3 music video_production_diary_
Ig3 music video_production_diary_Ig3 music video_production_diary_
Ig3 music video_production_diary_
 
Animan by muhammed zafar iqbal (boimela 2014)
Animan by muhammed zafar iqbal (boimela 2014)Animan by muhammed zafar iqbal (boimela 2014)
Animan by muhammed zafar iqbal (boimela 2014)
 
Assignment 1
Assignment 1Assignment 1
Assignment 1
 
презентация 15 10 14
презентация 15 10 14презентация 15 10 14
презентация 15 10 14
 
RAHUL[1]
RAHUL[1]RAHUL[1]
RAHUL[1]
 
14
1414
14
 
Responsabilidad social en minería e industrias extractivas
Responsabilidad social en minería e industrias extractivasResponsabilidad social en minería e industrias extractivas
Responsabilidad social en minería e industrias extractivas
 
Introducción a la Mecánica (Parte II)
Introducción a la Mecánica (Parte II)Introducción a la Mecánica (Parte II)
Introducción a la Mecánica (Parte II)
 

Similar to Spark introduction and architecture

Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014mahchiev
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark exampleShidrokhGoudarzi1
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overviewKaran Alang
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...Sri Ambati
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
What is Apache spark
What is Apache sparkWhat is Apache spark
What is Apache sparkmanisha1110
 

Similar to Spark introduction and architecture (20)

Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Apache spark
Apache sparkApache spark
Apache spark
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
Apache spark
Apache sparkApache spark
Apache spark
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
What is Apache spark
What is Apache sparkWhat is Apache spark
What is Apache spark
 

Recently uploaded

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 

Recently uploaded (20)

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 

Spark introduction and architecture

  • 2. Agenda • Introduction • Features - • Supports all major Big Data Environments • General platform for major Big Data tasks • Access diverse data sources • Speed • Ease of use • Architecture • Resilient Distributed Datasets • In-memory Computing • Performance Apache Spark 2
  • 3. Spark Introduction • Apache Spark is an – open source, – parallel data processing framework – master slave model – complements Hadoop to make it easy to develop fast, unified Big Data applications • Cloudera offers commercial support for Spark with Cloudera Enterprise. • It has over 465 contributors making it most active Big Data Project in Apache Foundation, started at UC Berkley in 2009 Apache Spark 3
  • 4. Supports All major BigData Environments Runs Everywhere - • Standalone Cluster Mode • Hadoop Yarn • Apache Mesos • Amazon EC2 • Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). Apache Spark 4
  • 5. General platform for all major Big Data tasks • Common ETL (Sqoop) • SQL and Analytics (Pig and Hive) • Real Time Streaming (Storm) • Machine Learning (Mahout) • Graphs (Data Visualization) • Both Interactive and Batch mode processing • Reuse the same code for batch and stream processing, even joining streaming data to historical data Apache Spark 5
  • 6. Access Diverse Data Sources Read and write anywhere - • HDFS • Cassandra • Hbase • TextFile • RDBMS • Kafka Apache Spark 6
  • 7. Speed • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. • Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. Logistic regression in Hadoop and Spark Apache Spark 7
  • 8. Ease of Use • Write application in – Java – Scala – Python • 80 high-level operators that make it easy to build parallel apps Apache Spark 8
  • 9. Unified Analytics with Cloudera’s Enterprise Data Hub Faster Decisions (Interactive) Better Decisions (Batch) Real-Time Action (Streaming and Applications) Web Security Why is my website slow? What are the common causes of performance issues? How can I detect and block malicious attacks in real-time? Retail What are our top selling items across channels? What products and services to customers buy together? How can I deliver relevant promotions to buyers at the point of sale? Financial Services Who opened multiple accounts in the past 6 months? What are the leading indicators of fraudulent activity? How can I protect my customers from identity theft in real-time? Apache Spark 9
  • 12. Resilient Distributed Datasets • A read-only collection of objects partitioned across a set of machines • Immutable. You can modify an RDD with a transformation but the transformation returns you a new RDD whereas the original RDD remains the same. • Can be rebuilt if a partition is lost. • Fault tolerant because they can be recreated & recomputed RDD supports two types of operations: • Transformation • Action Apache Spark 12
  • 13. • Transformation: don't return a single value, they return a new RDD. Nothing gets evaluated when you call a Transformation function, it just takes an RDD and return a new RDD. – Some of the Transformation functions are map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, pipe, and coalesce. • Action: operation evaluates and returns a new value. When an Action function is called on a RDD object, all the data processing queries are computed at that time and the result value is returned. – Some of the Action operations are reduce, collect, count, first, take, countByKey, and foreach. Apache Spark 13
  • 14. How in-memory computing Improve Performance • Intermediate data is not spilled to disk Apache Spark 14
  • 15. Performance improvement in DAG executions • New shuffle implementation, sort-based shuffle uses single buffer which reduces substantial memory overhead and can support huge workloads in a single stage. Earlier concurrent buffers were used • The revamped network module in Spark maintains its own pool of memory, thus bypassing JVM’s memory allocator, reducing the impact of garbage collection. • New external shuffle service ensures that Spark can still serve shuffle files even when the executors are in GC pauses. Apache Spark 15
  • 16. • Timsort, which is a derivation of quicksort and mergesort. It works better than earlier quicksort in partially ordered datasets. • Exploiting cache locality reduced sorting by factor of 5. sort_key record sort_key location The Spark cluster was able to sustain 3GB/s/node I/O activity during the map phase, and 1.1 GB/s/node network activity during the reduce phase, saturating the 10Gbps link available on these machines. Apache Spark 16 10b 10b100b 4b
  • 17. • The memory consumed by RDDs can be optimized per size of data • Performance can be tuned by caching • MR supports only Map and Reduce operations and everything (join, groupby etc) has to be fit into the Map and Reduce model, which might not be the efficient way. Spark supports 80 other transformations and actions. Apache Spark 17
  • 18. References • http://spark.apache.org/ • http://en.wikipedia.org/wiki/Apache_Spark • http://opensource.com/business/15/1/apache-spark-new-world- record • http://www.cloudera.com/content/cloudera/en/products-and- services/cdh/spark.html • http://vision.cloudera.com/mapreduce-spark/ • http://thesohiljain.blogspot.in/2015/03/apache-spark- introduction.html Apache Spark 18

Editor's Notes

  1. I quickly added these four benefits, please keep extending this list. Put quantifiable numbers (example ??% improvement in development cycle) wherever possible.
  2. List all the features added so far in Release #1. One line per feature, similar to the “Features in upcoming releases” format.
  3. The architecture diagram goes here. See the example I sent from one of my oldframework development project.