Spark - The beginnings

•Download as PPTX, PDF•

2 likes•341 views

Are you a Java developer interested in big data processing and never had the chance to work with Apache Spark ? My presentation aims to help you get familiar with Spark concepts and start developing your own distributed processing application.

Software

Spark –
The beginnings
Daniel Leon
Optymyze
[7th of November 2015]

Content
1) Hadoop Dilemma
2) Processing engines war
3) Spark ecosystem
4) Resilient Distributed Datasets
5) Spark application workflow
6) Conclusion

Beyond Map Reduce
•Complex iterative algorithms
•Interactive queries
•Real time processing

Different processing model
•More operation available
•Flexible way of composing operations
•Pluggable data sources
•Streaming capabilities built-in
•Pluggable algorithm

Resilient Distributed Dataset -
RDD
•Stored in memory and storage
•Immutable
•Enables parallel operations on collections of
elements
•Contains lineage information

Constructing RDD's
•Parallelize existing collections
lRDD=sc.parallelize([“a”, “b”, “c”])
•From files in HDFS, S3, Hive
llinesRDD=sc.textFile(“README”)
•Transforming an existing RDD

Operations on RDD's
•Transformations – lazy
lfilter
lmap
lgroupBy
•Actions
lcount
lcollect

Spark terminology
•Job – the work required to compute an RDD
•Stage – a wave of work within a job,
corresponding to one or more pipelined
RDD's
•Task – a unit of work within a stage,
correspoding to one RDD partition
•Shuffle – the transfer the data between stages

Conclusion
•Spark is :
•Complete and standalone solution for
distributed processing
•Fluent API
•Pluggable with other big data frameworks
•One of the most actively contributed Apache
project

Documentation
https://hadoopecosystemtable.github.io
https://databricks.com/spark/developer-resources
https://databricks.com/resources/slides
https://databricks.com/spark/training

What's hot

Beginner Apache Spark PresentationNidhin Pattaniyil

MATLAB and Scientific Data: New Features and CapabilitiesThe HDF-EOS Tools and Information Center

Hadoop-Quick introductionSandeep Singh

Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Spark Summit

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkDatabricks

An Introduction of Apache HadoopKMS Technology

An introduction to Apache Hadoop HiveMike Frampton

Hadoop at ayasdiMohit Jaggi

Hugfr SPARK & RIAK -20160114_hug_franceModern Data Stack France

The hadoop 2.0 ecosystem and yarnMichael Joseph

Operationalizing Big Data Pipelines At ScaleDatabricks

Databases and how to choose themDatio Big Data

Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowPyData

Apache Arrow: In Theory, In PracticeDremio Corporation

Big data vahidamiri-tabriz-13960226-datastack.irdatastack

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit

Open source big data landscape and possible ITS applicationsSoftwareMill

Big Telco - Yousun JeongSpark Summit

What's hot (20)

Beginner Apache Spark Presentation

MATLAB and Scientific Data: New Features and Capabilities

Hadoop-Quick introduction

Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark

An Introduction of Apache Hadoop

An introduction to Apache Hadoop Hive

Hadoop at ayasdi

Hugfr SPARK & RIAK -20160114_hug_france

The hadoop 2.0 ecosystem and yarn

Operationalizing Big Data Pipelines At Scale

Databases and how to choose them

Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...

Apache Iceberg - A Table Format for Hige Analytic Datasets

Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow

Apache Arrow: In Theory, In Practice

Big data vahidamiri-tabriz-13960226-datastack.ir

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...

Open source big data landscape and possible ITS applications

Big Telco - Yousun Jeong

Viewers also liked

Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksData Con LA

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...Chris Fregly

Apache SparkMahdi Esmailoghli

Introduction to Apache SparkAnastasios Skarlatidis

Apache spark linkedinYukti Kaura

New directions for Apache Spark in 2015Databricks

Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Chris Fregly

Extreme-scale Ad-Tech using Spark and Databricks at MediaMathSpark Summit

Introduction to Spark (Intern Event Presentation)Databricks

Spark Under the Hood - Meetup @ Data Science LondonDatabricks

Apache spark - Spark's distributed programming modelMartin Zapletal

Spark After Dark: Real time Advanced Analytics and Machine Learning with SparkChris Fregly

Jump Start with Apache Spark 2.0 on DatabricksAnyscale

Apache Spark & StreamingFernando Rodriguez

End-to-end Data Pipeline with Apache SparkDatabricks

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin

Apache Spark Overview @ ferretAndrii Gakhov

DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksData Con LA

New Developments in SparkDatabricks

Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks

Viewers also liked (20)

Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...

Apache Spark

Introduction to Apache Spark

Apache spark linkedin

New directions for Apache Spark in 2015

Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...

Extreme-scale Ad-Tech using Spark and Databricks at MediaMath

Introduction to Spark (Intern Event Presentation)

Spark Under the Hood - Meetup @ Data Science London

Apache spark - Spark's distributed programming model

Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark

Jump Start with Apache Spark 2.0 on Databricks

Apache Spark & Streaming

End-to-end Data Pipeline with Apache Spark

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...

Apache Spark Overview @ ferret

DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks

New Developments in Spark

Spark SQL Deep Dive @ Melbourne Spark Meetup

Similar to Spark - The beginnings

Paris Data Geek - Spark Streaming Djamel Zouaoui

APACHE SPARK.pptxDeepaThirumurugan

Hadoop ppt1chariorienit

Unit II Real Time Data Processing tools.pptxRahul Borate

Apache Spark FundamentalsZahra Eskandari

Hadoop in the Cloud – The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit

Big Data tools in practiceDarko Marjanovic

Apache Spark on HDinsight TrainingSynergetics Learning and Cloud Consulting

Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov

Hadoop in the Cloud - The what, why and how from the expertsDataWorks Summit/Hadoop Summit

Big Data in the Microsoft PlatformJesus Rodriguez

Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Dataconomy Media

Apache Spark in IndustryDorian Beganovic

P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP

Big Data in the Cloud - The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit

Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer

An Introduction to Apache SparkElvis Saravia

Intro to Apache Spark by CTO of TwingoMapR Technologies

Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA

List of Engineering Colleges in UttarakhandRoorkee College of Engineering, Roorkee

Similar to Spark - The beginnings (20)

Paris Data Geek - Spark Streaming

APACHE SPARK.pptx

Hadoop ppt1

Unit II Real Time Data Processing tools.pptx

Apache Spark Fundamentals

Hadoop in the Cloud – The What, Why and How from the Experts

Big Data tools in practice

Apache Spark on HDinsight Training

Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...

Hadoop in the Cloud - The what, why and how from the experts

Big Data in the Microsoft Platform

Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...

Apache Spark in Industry

P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.

Big Data in the Cloud - The What, Why and How from the Experts

Big Data Analytics with Hadoop, MongoDB and SQL Server

An Introduction to Apache Spark

Intro to Apache Spark by CTO of Twingo

Big_data_analytics_NoSql_Module-4_Session

List of Engineering Colleges in Uttarakhand

Recently uploaded

%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba

%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...masabamasaba

WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2

The title is not connected to what is insideshinachiaurasa2

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg

%in Benoni+277-882-255-28 abortion pills for sale in Benonimasabamasaba

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba

Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver

%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba

%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

Architecture decision records - How not to get lost in the pastPapp Krisztián

WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba

%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics

%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba

Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda

Recently uploaded (20)

%in Soweto+277-882-255-28 abortion pills for sale in soweto

%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...

WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...

The title is not connected to what is inside

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...

%in Benoni+277-882-255-28 abortion pills for sale in Benoni

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...

Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...

%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview

%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...

Microsoft AI Transformation Partner Playbook.pdf

Architecture decision records - How not to get lost in the past

WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain

%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...

Spark - The beginnings

2. Spark – The beginnings Daniel Leon Optymyze [7th of November 2015]

3. Content 1) Hadoop Dilemma 2) Processing engines war 3) Spark ecosystem 4) Resilient Distributed Datasets 5) Spark application workflow 6) Conclusion

4. Big data technologies

5. Hadoop – Where does it end ?

6. Hadoop Architecture

7. Hadoop evolution

8. Map Reduce workflow

9. Hadoop ecosystem

10. Beyond Map Reduce •Complex iterative algorithms •Interactive queries •Real time processing

11. Different processing model •More operation available •Flexible way of composing operations •Pluggable data sources •Streaming capabilities built-in •Pluggable algorithm

12. Searching for another processing engine

13. Processing engine comparison

14. Processing engine comparison

15. Spark ecosystem

16. Spark ecosystem

17. 100TB Daytona Sort Competition 2014

18. Resilient Distributed Dataset - RDD •Stored in memory and storage •Immutable •Enables parallel operations on collections of elements •Contains lineage information

19. Resilient Distributed Dataset - RDD

20. Constructing RDD's •Parallelize existing collections lRDD=sc.parallelize([“a”, “b”, “c”]) •From files in HDFS, S3, Hive llinesRDD=sc.textFile(“README”) •Transforming an existing RDD

21. Operations on RDD's •Transformations – lazy lfilter lmap lgroupBy •Actions lcount lcollect

22. Spark terminology •Job – the work required to compute an RDD •Stage – a wave of work within a job, corresponding to one or more pipelined RDD's •Task – a unit of work within a stage, correspoding to one RDD partition •Shuffle – the transfer the data between stages

23. Spark architecture

24. Conclusion •Spark is : •Complete and standalone solution for distributed processing •Fluent API •Pluggable with other big data frameworks •One of the most actively contributed Apache project

25. Documentation https://hadoopecosystemtable.github.io https://databricks.com/spark/developer-resources https://databricks.com/resources/slides https://databricks.com/spark/training

26. Spark – The beginnings Daniel Leon Optymyze [7th of November 2015]

Spark - The beginnings

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Spark - The beginnings

Similar to Spark - The beginnings (20)

Recently uploaded

Recently uploaded (20)

Spark - The beginnings