A short introduction to Spark and its benefits

Big Data &
Data Science
20 mars 2017

Big Data & Data Science : Agenda – 18h30 / 20h15
1/ L’écosystème Apache Spark
Johan Picard, Expert Big Data
2/ SQL on Hadoop at scale – SparkSQL2.1 & BigSQL4.3 on 100TB Hadoop-DS
Victor Hatinguais, Architecte Big Data
3/ Social Data : Machine Learning pour un projet à caractère social
Samed Atouati & Abdellah Lamrani Alaoui, aspirants Data Scientist, étudiants à l'Ecole Centrale Paris
4/ Data Science Experience
Zied Abidi, Data Scientist
5/ Comment faire parler les données pour détecter des anomalies ?
Pauline Clavelloux, Data Scientist
Questions & Réponses - Clôture

IBM | Spark 3
Power of data. Simplicity of design. Speed of innovation.
Apache Spark in 15 minutes

IBM | Spark 4
Apache Spark
Apache Spark is a fast and general engine for large scale data processing.
https://spark.apache.org/

IBM | Spark 5
Spark History: one of the most active open-source projects
2002 – MapReduce @ Google
2004 – MapReduce paper
2006 – Hadoop @ Yahoo
2008 – Hadoop Summit
2010 – Spark paper
2013 – Spark 0.7 Apache Incubator
2014 – Apache Spark top-level
2014 – 1.2.0 released in December
2015 – 1.3.0 released in March
2015 – 1.4.0 released in June
2015 – 1.5.0 released in September
2016 – 1.6.0 released in January
2016 – 2.0.0 released in July
2016 – 2.1.0 released in December
Spark is HOT!!!
Most active project in Hadoop ecosystem
One of top 3 most active Apache projects
Databricks founded by the creators of Spark from UC Berkeley’s AMPLab

IBM | Spark 6
Spark is the most active open source project in Big Data
Source: Syncort – Hadoop Perspectives for 2016
2015
2014
2016
900
Now 1039 contributors…

IBM | Spark 7
Why Spark? In-memory performances and code compactness

IBM | Spark 8
Spark RDD
In-memory distribution
HDFS
On-disk distribution
Why Spark? A distributed framework

IBM | Spark 9
Resilient Distributed Dataset
Create RDDs:
 parallelize
 textFile
 Transformations
Get results:
 Actions

IBM | Spark 10
Why Spark? A bunch of comfortables APIs

IBM | Spark 11
Spark Programming Languages

IBM | Spark 12
 Distributed File System
 Data Preparation
 SQL Engine
 Stream Processing
 Graph Engine
 Machine Learning
 Distributed R
Spark SQL
Spark
Streaming
GraphX MLlib Spark R
Why Spark? An unified framework

IBM | Spark 13
• Reliability
• Resiliency
• Security
• Multiple data sources
• Multiple applications
• Multiple users
• Files
• Semi-structured
• Databases
Unlimited Scale
Enterprise Platform
Wide Range of
Data Formats
Spark complements Hadoop (1/3): Hadoop Strengths

IBM | Spark 14
• Need deep Java skills
• Few abstractions available for
analysts
• No in-memory framework
• Application tasks write to disk with
each cycle
• Only suitable for batch workloads
• Rigid processing model
In-Memory Performance
Ease of Development
Combine Workflows
Spark complements Hadoop (2/3): MapReduce Weaknesses

IBM | Spark 15
Ease of Development
• Easier APIs
• Python, Scala, Java
• Resilient Distributed Datasets
• Unify processing
• Batch
• Interactive
• Iterative algorithms
• Micro-batch
Combine Workflows
Spark complements Hadoop (3/3): Spark Advantages

IBM | Spark 16
Ease of Development
Combine Workflows
Unlimited Scale
Enterprise Platform
Wide Range of
Data Formats
The Flexibility of Spark on a Stable Hadoop Platform

IBM | Spark 17
 Spark Shell: interactive Scala
 PySpark: interactive Python
 Spark Submit: compiled
 Notebooks: Jupyter, Zeppelin
How to develop and run a Spark job?

IBM | Spark 18
What Spark Is Not!
 Not only for Hadoop – Spark can work with Hadoop (especially HDFS), but Spark is a
standalone system
 Not a data store – Spark attaches to other data stores but does not provide its own
 Not only for machine learning – Spark includes machine learning and does it very well,
but it can handle much broader tasks equally well
 Not a replacement for Streams – Spark Streaming is micro-batching, not true
streaming, and cannot handle the real-time complex event processing
 Not a language!!!

IBM | Spark 20
IBM has the largest investment in Spark of any company in the world
visit www.spark.tc for more informationIBM | Spark
IBM Spark Technology Center
https://ibm.biz/hadoop-jira
https://ibm.biz/spark-jira
 On of the top commiter/contributor
 300+ inventors
 Commitment to educate 1 million data scientists
 Contributed SystemML
 Founding member of AMPLab
 Partnerships in the ecosystem

IBM | Spark 21
Leadership in Spark
 Spark Technology Center has contributed 829 code changes to Spark components since we started
around middle of 2015
 STC contributions have been. 52% to Spark SQL, 16% to PySpark, 26% to ML and MLlib.
 For more details, use this dash board https://www.ibm.biz/spark-jira

IBM | Spark 22
Data Science Experience (DSX)
IBM | Spark
ALL YOUR TOOLS IN ONE PLACE
IBM Data Science Experience is an environment that brings
together everything that a Data Scientist needs. It includes the
most popular Open Source tools and IBM unique value-add
functionalities with community and social features, integrated
as a first class citizen to make Data Scientists more successful.
datascience.ibm.com

IBM | Spark 23
Power of data. Simplicity of design. Speed of innovation.
PoT IBM sur Google
9 Mai : Manipulation de données massives avec Spark
10 Mai : Formation machine learning utilisant DSX

A short introduction to Spark and its benefits

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A short introduction to Spark and its benefits

Similar to A short introduction to Spark and its benefits (20)

Recently uploaded

Recently uploaded (20)

A short introduction to Spark and its benefits

Editor's Notes