SlideShare a Scribd company logo
1 of 23
Big Data &
Data Science
20 mars 2017
Big Data & Data Science : Agenda – 18h30 / 20h15
1/ L’écosystème Apache Spark
Johan Picard, Expert Big Data
2/ SQL on Hadoop at scale – SparkSQL2.1 & BigSQL4.3 on 100TB Hadoop-DS
Victor Hatinguais, Architecte Big Data
3/ Social Data : Machine Learning pour un projet à caractère social
Samed Atouati & Abdellah Lamrani Alaoui, aspirants Data Scientist, étudiants à l'Ecole Centrale Paris
4/ Data Science Experience
Zied Abidi, Data Scientist
5/ Comment faire parler les données pour détecter des anomalies ?
Pauline Clavelloux, Data Scientist
Questions & Réponses - Clôture
IBM | Spark 3
Power of data. Simplicity of design. Speed of innovation.
Apache Spark in 15 minutes
IBM | Spark 4
Apache Spark
Apache Spark is a fast and general engine for large scale data processing.
https://spark.apache.org/
IBM | Spark 5
Spark History: one of the most active open-source projects
2002 – MapReduce @ Google
2004 – MapReduce paper
2006 – Hadoop @ Yahoo
2008 – Hadoop Summit
2010 – Spark paper
2013 – Spark 0.7 Apache Incubator
2014 – Apache Spark top-level
2014 – 1.2.0 released in December
2015 – 1.3.0 released in March
2015 – 1.4.0 released in June
2015 – 1.5.0 released in September
2016 – 1.6.0 released in January
2016 – 2.0.0 released in July
2016 – 2.1.0 released in December
Spark is HOT!!!
Most active project in Hadoop ecosystem
One of top 3 most active Apache projects
Databricks founded by the creators of Spark from UC Berkeley’s AMPLab
IBM | Spark 6
Spark is the most active open source project in Big Data
Source: Syncort – Hadoop Perspectives for 2016
2015
2014
2016
900
Now 1039 contributors…
IBM | Spark 7
Why Spark? In-memory performances and code compactness
IBM | Spark 8
Spark RDD
In-memory distribution
HDFS
On-disk distribution
Why Spark? A distributed framework
IBM | Spark 9
Resilient Distributed Dataset
Create RDDs:
 parallelize
 textFile
 Transformations
Get results:
 Actions
IBM | Spark 10
Why Spark? A bunch of comfortables APIs
IBM | Spark 11
Spark Programming Languages
IBM | Spark 12
 Distributed File System
 Data Preparation
 SQL Engine
 Stream Processing
 Graph Engine
 Machine Learning
 Distributed R
Spark SQL
Spark
Streaming
GraphX MLlib Spark R
Why Spark? An unified framework
IBM | Spark 13
• Reliability
• Resiliency
• Security
• Multiple data sources
• Multiple applications
• Multiple users
• Files
• Semi-structured
• Databases
Unlimited Scale
Enterprise Platform
Wide Range of
Data Formats
Spark complements Hadoop (1/3): Hadoop Strengths
IBM | Spark 14
• Need deep Java skills
• Few abstractions available for
analysts
• No in-memory framework
• Application tasks write to disk with
each cycle
• Only suitable for batch workloads
• Rigid processing model
In-Memory Performance
Ease of Development
Combine Workflows
Spark complements Hadoop (2/3): MapReduce Weaknesses
IBM | Spark 15
In-Memory Performance
Ease of Development
• Easier APIs
• Python, Scala, Java
• Resilient Distributed Datasets
• Unify processing
• Batch
• Interactive
• Iterative algorithms
• Micro-batch
Combine Workflows
Spark complements Hadoop (3/3): Spark Advantages
IBM | Spark 16
In-Memory Performance
Ease of Development
Combine Workflows
Unlimited Scale
Enterprise Platform
Wide Range of
Data Formats
The Flexibility of Spark on a Stable Hadoop Platform
IBM | Spark 17
 Spark Shell: interactive Scala
 PySpark: interactive Python
 Spark Submit: compiled
 Notebooks: Jupyter, Zeppelin
How to develop and run a Spark job?
IBM | Spark 18
What Spark Is Not!
 Not only for Hadoop – Spark can work with Hadoop (especially HDFS), but Spark is a
standalone system
 Not a data store – Spark attaches to other data stores but does not provide its own
 Not only for machine learning – Spark includes machine learning and does it very well,
but it can handle much broader tasks equally well
 Not a replacement for Streams – Spark Streaming is micro-batching, not true
streaming, and cannot handle the real-time complex event processing
 Not a language!!!
IBM | Spark 19
Spark et IBM
IBM | Spark 20
IBM has the largest investment in Spark of any company in the world
visit www.spark.tc for more informationIBM | Spark
IBM Spark Technology Center
https://ibm.biz/hadoop-jira
https://ibm.biz/spark-jira
 On of the top commiter/contributor
 300+ inventors
 Commitment to educate 1 million data scientists
 Contributed SystemML
 Founding member of AMPLab
 Partnerships in the ecosystem
IBM | Spark 21
Leadership in Spark
 Spark Technology Center has contributed 829 code changes to Spark components since we started
around middle of 2015
 STC contributions have been. 52% to Spark SQL, 16% to PySpark, 26% to ML and MLlib.
 For more details, use this dash board https://www.ibm.biz/spark-jira
IBM | Spark 22
Data Science Experience (DSX)
IBM | Spark
ALL YOUR TOOLS IN ONE PLACE
IBM Data Science Experience is an environment that brings
together everything that a Data Scientist needs. It includes the
most popular Open Source tools and IBM unique value-add
functionalities with community and social features, integrated
as a first class citizen to make Data Scientists more successful.
datascience.ibm.com
IBM | Spark 23
Power of data. Simplicity of design. Speed of innovation.
PoT IBM sur Google
9 Mai : Manipulation de données massives avec Spark
10 Mai : Formation machine learning utilisant DSX

More Related Content

What's hot

Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Databricks
 
Delivering Data Science to the Business
Delivering Data Science to the BusinessDelivering Data Science to the Business
Delivering Data Science to the BusinessDataWorks Summit
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesDataWorks Summit
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkLi Jin
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsDataWorks Summit
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...DataWorks Summit
 
The state of Spark in the cloud
The state of Spark in the cloudThe state of Spark in the cloud
The state of Spark in the cloudNicolas Poggi
 
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...Databricks
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Large-scaled telematics analytics
Large-scaled telematics analyticsLarge-scaled telematics analytics
Large-scaled telematics analyticsDataWorks Summit
 
Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez DataWorks Summit
 
Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Lin Qiao
 
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...Databricks
 
What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3DataWorks Summit
 
Addressing Enterprise Customer Pain Points with a Data Driven Architecture
Addressing Enterprise Customer Pain Points with a Data Driven ArchitectureAddressing Enterprise Customer Pain Points with a Data Driven Architecture
Addressing Enterprise Customer Pain Points with a Data Driven ArchitectureDataWorks Summit
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamDataWorks Summit
 

What's hot (20)

Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
 
Delivering Data Science to the Business
Delivering Data Science to the BusinessDelivering Data Science to the Business
Delivering Data Science to the Business
 
How do you decide where your customer was?
How do you decide where your customer was?How do you decide where your customer was?
How do you decide where your customer was?
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
The state of Spark in the cloud
The state of Spark in the cloudThe state of Spark in the cloud
The state of Spark in the cloud
 
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Large-scaled telematics analytics
Large-scaled telematics analyticsLarge-scaled telematics analytics
Large-scaled telematics analytics
 
Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez
 
Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014
 
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
 
What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3
 
Addressing Enterprise Customer Pain Points with a Data Driven Architecture
Addressing Enterprise Customer Pain Points with a Data Driven ArchitectureAddressing Enterprise Customer Pain Points with a Data Driven Architecture
Addressing Enterprise Customer Pain Points with a Data Driven Architecture
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
 
Log I am your father
Log I am your fatherLog I am your father
Log I am your father
 

Similar to A short introduction to Spark and its benefits

20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooksAndrey Vykhodtsev
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Introduction to pyspark new
Introduction to pyspark newIntroduction to pyspark new
Introduction to pyspark newAnam Mahmood
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014mahchiev
 
20150617 spark meetup zagreb
20150617 spark meetup zagreb20150617 spark meetup zagreb
20150617 spark meetup zagrebAndrey Vykhodtsev
 
Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15IBMInfoSphereUGFR
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedinYukti Kaura
 
Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Stratio
 
2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit MumbaiAnand Haridass
 
Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3DataWorks Summit
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8Janu Jahnavi
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8Janu Jahnavi
 
Spark_Talha.pptx
Spark_Talha.pptxSpark_Talha.pptx
Spark_Talha.pptxITLAb21
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL PerformanceTakuya UESHIN
 
Insight on "From Hadoop to Spark" by Mark Kerzner
Insight on "From Hadoop to Spark" by Mark KerznerInsight on "From Hadoop to Spark" by Mark Kerzner
Insight on "From Hadoop to Spark" by Mark KerznerSynerzip
 

Similar to A short introduction to Spark and its benefits (20)

20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks
 
Spark 101
Spark 101Spark 101
Spark 101
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Introduction to pyspark new
Introduction to pyspark newIntroduction to pyspark new
Introduction to pyspark new
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
20150617 spark meetup zagreb
20150617 spark meetup zagreb20150617 spark meetup zagreb
20150617 spark meetup zagreb
 
Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedin
 
Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0
 
2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai
 
Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8
 
Spark_Talha.pptx
Spark_Talha.pptxSpark_Talha.pptx
Spark_Talha.pptx
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
 
Insight on "From Hadoop to Spark" by Mark Kerzner
Insight on "From Hadoop to Spark" by Mark KerznerInsight on "From Hadoop to Spark" by Mark Kerzner
Insight on "From Hadoop to Spark" by Mark Kerzner
 

Recently uploaded

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 

Recently uploaded (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 

A short introduction to Spark and its benefits

  • 1. Big Data & Data Science 20 mars 2017
  • 2. Big Data & Data Science : Agenda – 18h30 / 20h15 1/ L’écosystème Apache Spark Johan Picard, Expert Big Data 2/ SQL on Hadoop at scale – SparkSQL2.1 & BigSQL4.3 on 100TB Hadoop-DS Victor Hatinguais, Architecte Big Data 3/ Social Data : Machine Learning pour un projet à caractère social Samed Atouati & Abdellah Lamrani Alaoui, aspirants Data Scientist, étudiants à l'Ecole Centrale Paris 4/ Data Science Experience Zied Abidi, Data Scientist 5/ Comment faire parler les données pour détecter des anomalies ? Pauline Clavelloux, Data Scientist Questions & Réponses - Clôture
  • 3. IBM | Spark 3 Power of data. Simplicity of design. Speed of innovation. Apache Spark in 15 minutes
  • 4. IBM | Spark 4 Apache Spark Apache Spark is a fast and general engine for large scale data processing. https://spark.apache.org/
  • 5. IBM | Spark 5 Spark History: one of the most active open-source projects 2002 – MapReduce @ Google 2004 – MapReduce paper 2006 – Hadoop @ Yahoo 2008 – Hadoop Summit 2010 – Spark paper 2013 – Spark 0.7 Apache Incubator 2014 – Apache Spark top-level 2014 – 1.2.0 released in December 2015 – 1.3.0 released in March 2015 – 1.4.0 released in June 2015 – 1.5.0 released in September 2016 – 1.6.0 released in January 2016 – 2.0.0 released in July 2016 – 2.1.0 released in December Spark is HOT!!! Most active project in Hadoop ecosystem One of top 3 most active Apache projects Databricks founded by the creators of Spark from UC Berkeley’s AMPLab
  • 6. IBM | Spark 6 Spark is the most active open source project in Big Data Source: Syncort – Hadoop Perspectives for 2016 2015 2014 2016 900 Now 1039 contributors…
  • 7. IBM | Spark 7 Why Spark? In-memory performances and code compactness
  • 8. IBM | Spark 8 Spark RDD In-memory distribution HDFS On-disk distribution Why Spark? A distributed framework
  • 9. IBM | Spark 9 Resilient Distributed Dataset Create RDDs:  parallelize  textFile  Transformations Get results:  Actions
  • 10. IBM | Spark 10 Why Spark? A bunch of comfortables APIs
  • 11. IBM | Spark 11 Spark Programming Languages
  • 12. IBM | Spark 12  Distributed File System  Data Preparation  SQL Engine  Stream Processing  Graph Engine  Machine Learning  Distributed R Spark SQL Spark Streaming GraphX MLlib Spark R Why Spark? An unified framework
  • 13. IBM | Spark 13 • Reliability • Resiliency • Security • Multiple data sources • Multiple applications • Multiple users • Files • Semi-structured • Databases Unlimited Scale Enterprise Platform Wide Range of Data Formats Spark complements Hadoop (1/3): Hadoop Strengths
  • 14. IBM | Spark 14 • Need deep Java skills • Few abstractions available for analysts • No in-memory framework • Application tasks write to disk with each cycle • Only suitable for batch workloads • Rigid processing model In-Memory Performance Ease of Development Combine Workflows Spark complements Hadoop (2/3): MapReduce Weaknesses
  • 15. IBM | Spark 15 In-Memory Performance Ease of Development • Easier APIs • Python, Scala, Java • Resilient Distributed Datasets • Unify processing • Batch • Interactive • Iterative algorithms • Micro-batch Combine Workflows Spark complements Hadoop (3/3): Spark Advantages
  • 16. IBM | Spark 16 In-Memory Performance Ease of Development Combine Workflows Unlimited Scale Enterprise Platform Wide Range of Data Formats The Flexibility of Spark on a Stable Hadoop Platform
  • 17. IBM | Spark 17  Spark Shell: interactive Scala  PySpark: interactive Python  Spark Submit: compiled  Notebooks: Jupyter, Zeppelin How to develop and run a Spark job?
  • 18. IBM | Spark 18 What Spark Is Not!  Not only for Hadoop – Spark can work with Hadoop (especially HDFS), but Spark is a standalone system  Not a data store – Spark attaches to other data stores but does not provide its own  Not only for machine learning – Spark includes machine learning and does it very well, but it can handle much broader tasks equally well  Not a replacement for Streams – Spark Streaming is micro-batching, not true streaming, and cannot handle the real-time complex event processing  Not a language!!!
  • 19. IBM | Spark 19 Spark et IBM
  • 20. IBM | Spark 20 IBM has the largest investment in Spark of any company in the world visit www.spark.tc for more informationIBM | Spark IBM Spark Technology Center https://ibm.biz/hadoop-jira https://ibm.biz/spark-jira  On of the top commiter/contributor  300+ inventors  Commitment to educate 1 million data scientists  Contributed SystemML  Founding member of AMPLab  Partnerships in the ecosystem
  • 21. IBM | Spark 21 Leadership in Spark  Spark Technology Center has contributed 829 code changes to Spark components since we started around middle of 2015  STC contributions have been. 52% to Spark SQL, 16% to PySpark, 26% to ML and MLlib.  For more details, use this dash board https://www.ibm.biz/spark-jira
  • 22. IBM | Spark 22 Data Science Experience (DSX) IBM | Spark ALL YOUR TOOLS IN ONE PLACE IBM Data Science Experience is an environment that brings together everything that a Data Scientist needs. It includes the most popular Open Source tools and IBM unique value-add functionalities with community and social features, integrated as a first class citizen to make Data Scientists more successful. datascience.ibm.com
  • 23. IBM | Spark 23 Power of data. Simplicity of design. Speed of innovation. PoT IBM sur Google 9 Mai : Manipulation de données massives avec Spark 10 Mai : Formation machine learning utilisant DSX

Editor's Notes

  1. Open source : commiters & contributors Databricks : compagnie derrière Spark, politique, conserver la majorité des commiters pour orienter les decisions des features et leur business model Project Management Committees (PMC)  Nearly 20% of all JIRAs were contributed by the Spark Technology Center, placing IBM as the number two contributor to the Apache Spark Project by most accounts. In Machine Learning, the Spark Technology Center contributed no less than 45% of the new features, and up to 25% of the enhancements.  The STC has contributed 60-75% of all lines of code (LOC) worldwide to the PySpark project.  Significant code contributions were also made in SparkR, WebUI and many others.  In Spark SQL, Spark’s most active component, IBM leveraged its long-standing SQL experience by resolving up to 25% of all bug fixes for the new release.
  2. Spark is the most active open source project in Big Data with over 600 contributors in 2015, up from 315 in the previous 12-24 months. Today (5/26/2016) that number is up to 900! Look here to get the latest count: https://github.com/apache/spark Considering that Spark was only founded in 2009 and open-sourced in 2010, this is considerable growth. An interesting survey done by Syncsort - Nearly 70 percent of respondents when asked which compute framework they were most interested – answered Spark, surpassing interest in all other compute frameworks, including the recognized incumbent, MapReduce. MapReduce is an original component of the Hadoop ecosystem, being rapidly subsumed by Spark, which boasts better compute performance and a facility for interactive, streaming and other advanced Big Data analytics. We’ll talk about the advantages of Spark in a later slide. Notice many of the market leaders leverage Spark. The list above is not inclusive, these are some of the market leaders that presented at the 2015 Spark Summit in San Francisco and many of their presentations can be found online. The point is, Spark is gaining speed rapidly in the market… and for good reason as you’ll learn from this presentation. Read more about Sparks rapid growth: http://www.techrepublic.com/article/apache-spark-rises-to-become-most-active-open-source-project-in-big-data/
  3. Add another graph? Hortonworks ne backait pas Spark au début, projet Tez assez similaire mais abandonné avec l’avènement de Spark
  4. Immutable Two types of operations Transformations ~ DDL (Create View V2 as…) val rddNumbers = sc.parallelize(1 to 10): Numbers from 1 to 10 val rddNumbers2 = rddNumbers.map (x => x+1): Numbers from 2 to 11 The LINEAGE on how to obtain rddNumbers2 from rddNumber is recorded It’s a Directed Acyclic Graph (DAG) No actual data processing does take place  Lazy evaluations Actions ~ DML (Select * From V2…) rddNumbers2.collect(): Array [2, 3, 4, 5, 6, 7, 8, 9, 10, 11] Performs transformations and action Returns a value (or write to a file) Fault tolerance If data in memory is lost it will be recreated from lineage Caching, persistence (memory, spilling, disk) and check-pointing
  5. Day in an Hadoop developer life
  6. Open source innovation is the first leg we’ve just talked about. When it comes to Big Data, Apache Hadoop has been the dominant open source technology (and collection of projects, really) up until very recently, and it continues to be very important. The reasons are captured here on this slide, which extend the point we talked about a few slides ago, when we mentioned the low cost of storage that Hadoop is able to take advantage of. First, Hadoop has virtually unlimited scale. If it’s big enough for Yahoo!, Facebook, and LinkedIn, who deal with enormous data volumes, it should be good enough for any customer. And the scale also applies to the heterogeneous nature of the data, the applications running on the data, and the users running Hadoop applications. Hadoop can store virtually any kind of data, and if the hardware is there, it can support many concurrent applications or users. Second, Hadoop has become an enterprise-class platform. Much of the recent work in the open source community around Hadoop has been hardening its security capabilities. Applications using Hadoop are in place today that are PCI-DSS compliant. Hadoop has always been known for its resiliency with its failover capabilities for both data storage and processing. More recently, the services administering the storage and processing systems in Hadoop have themselves also gained failover capability. Finally, Hadoop is now seen as a reliable data engine – reports of issues like data corruption are exceedingly rare in Apache Hadoop. Third, Hadoop supports a wide range of the kinds of data you need to store: at the lowest level, it can store any kind of file data – part of Hadoop is, after all, a file system. Hadoop can also host databases for structured data, and you can also use Hadoop to work with what many term “semi-structured” data, such as log files.
  7. Apache Hadoop was once synonymous with MapReduce. As recently as early 2014, there was still considerable hype around MapReduce and its applications. However, as Hadoop has been entering the mainstream, its challenges have become increasingly apparent. First, from a developer perspective, programming Hadoop-MapReduce applications is quite difficult, and requires specialized skills around parallel programming and a deep understanding of Java. Also, there are very few abstractions available to enable analysts to easily and flexibly work with data. And ones that do exist do not typically perform very quickly. Second, Hadoop-MapReduce has no in-memory framework. Applications have their individual tasks load data sets, but once the tasks complete, the data sets are no longer in memory – and when they are in memory, they aren’t shared with other applications. Also, during the execution of a MapReduce application, each map task writes its interim results sets to disk – this is highly inefficient, as the reduce tasks then need to read them from disk, instead of from memory. Third, Hadoop-MapReduce is only suitable for batch workloads. There is no shame in this, as that’s what it was designed for, but for users who want to take advantage of Hadoop’s benefits, they need support for interactive or real-time workloads as well. And coming back to the execution of applications, only one pattern is supported in Hadoop-MapReduce: that is, map, and then reduce. There are many use cases, where different patterns are needed, for example, map, reduce, reduce. You can make these different patterns work in Hadoop-MapReduce, but it comes at a great cost in terms of complexity and performance.
  8. Apache Spark has been an active open source project since 2010, but it has become hugely popular starting around the middle of 2014. It is, in fact, the single most active project in the Apache Software Foundation, with over 500 code updates made per month by a community of over 400 contributors. The major reason for its popularity is that it addresses the weak points of Hadoop-MapReduce. While MapReduce has proven to be highly difficult, Spark is much simpler. Raw Spark applications (which can be coded in Java, like MapReduce, but also Python and Scala) are still not for novice programmers, but are far more accessible and require less coding than Hadoop-MapReduce. Spark is actually written in Scala, which is a relatively new language. One of the major features of Spark is its in-memory capabilities, which are based on the Spark concept of a Resilient Distributed Dataset (RDD). This greatly speeds up workloads, because you can keep data loaded in memory for multiple applications, thus saving them the overhead of loading data from disk. Early benchmarking results have shown speedups between 10x to 100x for the same applications as compared to MapReduce. Another reason for Spark’s massive appeal is its ability to support different classes of workloads. You can use Spark to build batch applications, just as you would have with Hadoop-MapReduce, but with its in-memory capabilities, interactive workloads (like running SQL queries) and iterative algorithms (running machine learning models against the same data set) are also possible. Finally, Spark-Streaming enables the running of micro-batch workloads (this would be near-realtime workloads, where a micro-batch could, for example, ensure latency as small as half a second for streaming data).
  9. There are some analyst reports that have provocative titles, like “Hadoop vs. Spark,” or “Does Spark Mean the End of Hadoop?”. Many of these articles are heavily sensationalized, and ignore the reality that Spark actually integrates deeply with Hadoop. Yes, Spark can run in a standalone mode, or on other distributed environments like Mesos, AWS, or Cassandra. But the majority of Spark adoption and activity we see is in concert with Hadoop. After all, Spark is just a processing framework – it needs data, resource management, and other enterprise services. Hadoop has all those things, which makes it an ideal complement to Spark. And as we can see on this slide, Spark fills holes that Hadoop itself has. Spark brings ease of use for developers, high performance from its in-memory capabilities, and much more flexible support for different kinds of workloads to Hadoop. The key point here is that it’s not “Spark or Hadoop,” but “Spark AND Hadoop.”
  10. To run the application, you will need to first define the dependencies. In Scala, it is defined in the simple.sbt file. In Java, it is defined in the pom.xml file. In Python, you don’t need to define any dependencies for this simple application, but if you used third party libraries, then you can use the –py-files argument to handle that. Next, you place your files in the typical directory structure as shown for Scala and Java. Python does not need to do this. Finally, you have to create the JAR package using the appropriate tool and then run the spark-submit to execute the application.
  11. Let’s talk about some of the misconceptions about Spark. Many people get confused on the difference between Hadoop and Spark, for that reason as we talk these points we’ll also discuss how they relate to Hadoop. Spark does not require Hadoop to run. You can run Spark using its standalone mode or on Hadoop clusters through YARN, or on Apache Mesos. Spark does not include a storage layer. You must provide a data store for Spark to access. Spark can access data in HDFS, Cassandra, Hbase, Hive, Tachyon, and any Hadoop data source. You do not need to have a machine learning project to use Spark. Spark can manage complex analytics such as streaming or graphing data. Spark does have a library for streaming, which can be useful for many use cases, however it is not true streaming. Spark Streaming process data streams in batches, where each batch contains a collection of events that arrived over the batch period (regardless of when the data were actually created). This is fine for some applications such as simple counts into Hadoop, but be aware that the lack of true record-by-record processes makes stream processing and time-series analytics impossible.
  12.  Nearly 20% of all JIRAs were contributed by the Spark Technology Center, placing IBM as the number two contributor to the Apache Spark Project by most accounts. In Machine Learning, the Spark Technology Center contributed no less than 45% of the new features, and up to 25% of the enhancements.  The STC has contributed 60-75% of all lines of code (LOC) worldwide to the PySpark project.  Significant code contributions were also made in SparkR, WebUI and many others.  In Spark SQL, Spark’s most active component, IBM leveraged its long-standing SQL experience by resolving up to 25% of all bug fixes for the new release.
  13. Hadoop: https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12327116 Spark: https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12326761