2. Big Data & Data Science : Agenda – 18h30 / 20h15
1/ L’écosystème Apache Spark
Johan Picard, Expert Big Data
2/ SQL on Hadoop at scale – SparkSQL2.1 & BigSQL4.3 on 100TB Hadoop-DS
Victor Hatinguais, Architecte Big Data
3/ Social Data : Machine Learning pour un projet à caractère social
Samed Atouati & Abdellah Lamrani Alaoui, aspirants Data Scientist, étudiants à l'Ecole Centrale Paris
4/ Data Science Experience
Zied Abidi, Data Scientist
5/ Comment faire parler les données pour détecter des anomalies ?
Pauline Clavelloux, Data Scientist
Questions & Réponses - Clôture
3. IBM | Spark 3
Power of data. Simplicity of design. Speed of innovation.
Apache Spark in 15 minutes
4. IBM | Spark 4
Apache Spark
Apache Spark is a fast and general engine for large scale data processing.
https://spark.apache.org/
5. IBM | Spark 5
Spark History: one of the most active open-source projects
2002 – MapReduce @ Google
2004 – MapReduce paper
2006 – Hadoop @ Yahoo
2008 – Hadoop Summit
2010 – Spark paper
2013 – Spark 0.7 Apache Incubator
2014 – Apache Spark top-level
2014 – 1.2.0 released in December
2015 – 1.3.0 released in March
2015 – 1.4.0 released in June
2015 – 1.5.0 released in September
2016 – 1.6.0 released in January
2016 – 2.0.0 released in July
2016 – 2.1.0 released in December
Spark is HOT!!!
Most active project in Hadoop ecosystem
One of top 3 most active Apache projects
Databricks founded by the creators of Spark from UC Berkeley’s AMPLab
6. IBM | Spark 6
Spark is the most active open source project in Big Data
Source: Syncort – Hadoop Perspectives for 2016
2015
2014
2016
900
Now 1039 contributors…
7. IBM | Spark 7
Why Spark? In-memory performances and code compactness
8. IBM | Spark 8
Spark RDD
In-memory distribution
HDFS
On-disk distribution
Why Spark? A distributed framework
9. IBM | Spark 9
Resilient Distributed Dataset
Create RDDs:
parallelize
textFile
Transformations
Get results:
Actions
10. IBM | Spark 10
Why Spark? A bunch of comfortables APIs
12. IBM | Spark 12
Distributed File System
Data Preparation
SQL Engine
Stream Processing
Graph Engine
Machine Learning
Distributed R
Spark SQL
Spark
Streaming
GraphX MLlib Spark R
Why Spark? An unified framework
13. IBM | Spark 13
• Reliability
• Resiliency
• Security
• Multiple data sources
• Multiple applications
• Multiple users
• Files
• Semi-structured
• Databases
Unlimited Scale
Enterprise Platform
Wide Range of
Data Formats
Spark complements Hadoop (1/3): Hadoop Strengths
14. IBM | Spark 14
• Need deep Java skills
• Few abstractions available for
analysts
• No in-memory framework
• Application tasks write to disk with
each cycle
• Only suitable for batch workloads
• Rigid processing model
In-Memory Performance
Ease of Development
Combine Workflows
Spark complements Hadoop (2/3): MapReduce Weaknesses
16. IBM | Spark 16
In-Memory Performance
Ease of Development
Combine Workflows
Unlimited Scale
Enterprise Platform
Wide Range of
Data Formats
The Flexibility of Spark on a Stable Hadoop Platform
17. IBM | Spark 17
Spark Shell: interactive Scala
PySpark: interactive Python
Spark Submit: compiled
Notebooks: Jupyter, Zeppelin
How to develop and run a Spark job?
18. IBM | Spark 18
What Spark Is Not!
Not only for Hadoop – Spark can work with Hadoop (especially HDFS), but Spark is a
standalone system
Not a data store – Spark attaches to other data stores but does not provide its own
Not only for machine learning – Spark includes machine learning and does it very well,
but it can handle much broader tasks equally well
Not a replacement for Streams – Spark Streaming is micro-batching, not true
streaming, and cannot handle the real-time complex event processing
Not a language!!!
20. IBM | Spark 20
IBM has the largest investment in Spark of any company in the world
visit www.spark.tc for more informationIBM | Spark
IBM Spark Technology Center
https://ibm.biz/hadoop-jira
https://ibm.biz/spark-jira
On of the top commiter/contributor
300+ inventors
Commitment to educate 1 million data scientists
Contributed SystemML
Founding member of AMPLab
Partnerships in the ecosystem
21. IBM | Spark 21
Leadership in Spark
Spark Technology Center has contributed 829 code changes to Spark components since we started
around middle of 2015
STC contributions have been. 52% to Spark SQL, 16% to PySpark, 26% to ML and MLlib.
For more details, use this dash board https://www.ibm.biz/spark-jira
22. IBM | Spark 22
Data Science Experience (DSX)
IBM | Spark
ALL YOUR TOOLS IN ONE PLACE
IBM Data Science Experience is an environment that brings
together everything that a Data Scientist needs. It includes the
most popular Open Source tools and IBM unique value-add
functionalities with community and social features, integrated
as a first class citizen to make Data Scientists more successful.
datascience.ibm.com
23. IBM | Spark 23
Power of data. Simplicity of design. Speed of innovation.
PoT IBM sur Google
9 Mai : Manipulation de données massives avec Spark
10 Mai : Formation machine learning utilisant DSX
Editor's Notes
Open source : commiters & contributors
Databricks : compagnie derrière Spark, politique, conserver la majorité des commiters pour orienter les decisions des features et leur business model
Project Management Committees (PMC)
Nearly 20% of all JIRAs were contributed by the Spark Technology Center, placing IBM as the number two contributor to the Apache Spark Project by most accounts.
In Machine Learning, the Spark Technology Center contributed no less than 45% of the new features, and up to 25% of the enhancements. The STC has contributed 60-75% of all lines of code (LOC) worldwide to the PySpark project. Significant code contributions were also made in SparkR, WebUI and many others. In Spark SQL, Spark’s most active component, IBM leveraged its long-standing SQL experience by resolving up to 25% of all bug fixes for the new release.
Spark is the most active open source project in Big Data with over 600 contributors in 2015, up from 315 in the previous 12-24 months. Today (5/26/2016) that number is up to 900! Look here to get the latest count: https://github.com/apache/spark
Considering that Spark was only founded in 2009 and open-sourced in 2010, this is considerable growth.
An interesting survey done by Syncsort - Nearly 70 percent of respondents when asked which compute framework they were most interested – answered Spark, surpassing interest in all other compute frameworks, including the recognized incumbent, MapReduce. MapReduce is an original component of the Hadoop ecosystem, being rapidly subsumed by Spark, which boasts better compute performance and a facility for interactive, streaming and other advanced Big Data analytics. We’ll talk about the advantages of Spark in a later slide.
Notice many of the market leaders leverage Spark. The list above is not inclusive, these are some of the market leaders that presented at the 2015 Spark Summit in San Francisco and many of their presentations can be found online.
The point is, Spark is gaining speed rapidly in the market… and for good reason as you’ll learn from this presentation. Read more about Sparks rapid growth: http://www.techrepublic.com/article/apache-spark-rises-to-become-most-active-open-source-project-in-big-data/
Add another graph?
Hortonworks ne backait pas Spark au début, projet Tez assez similaire mais abandonné avec l’avènement de Spark
Immutable
Two types of operations
Transformations ~ DDL (Create View V2 as…)
val rddNumbers = sc.parallelize(1 to 10): Numbers from 1 to 10
val rddNumbers2 = rddNumbers.map (x => x+1): Numbers from 2 to 11
The LINEAGE on how to obtain rddNumbers2 from rddNumber is recorded
It’s a Directed Acyclic Graph (DAG)
No actual data processing does take place Lazy evaluations
Actions ~ DML (Select * From V2…)
rddNumbers2.collect(): Array [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
Performs transformations and action
Returns a value (or write to a file)
Fault tolerance
If data in memory is lost it will be recreated from lineage
Caching, persistence (memory, spilling, disk) and check-pointing
Day in an Hadoop developer life
Open source innovation is the first leg we’ve just talked about. When it comes to Big Data, Apache Hadoop has been the dominant open source technology (and collection of projects, really) up until very recently, and it continues to be very important.
The reasons are captured here on this slide, which extend the point we talked about a few slides ago, when we mentioned the low cost of storage that Hadoop is able to take advantage of.
First, Hadoop has virtually unlimited scale. If it’s big enough for Yahoo!, Facebook, and LinkedIn, who deal with enormous data volumes, it should be good enough for any customer. And the scale also applies to the heterogeneous nature of the data, the applications running on the data, and the users running Hadoop applications. Hadoop can store virtually any kind of data, and if the hardware is there, it can support many concurrent applications or users.
Second, Hadoop has become an enterprise-class platform. Much of the recent work in the open source community around Hadoop has been hardening its security capabilities. Applications using Hadoop are in place today that are PCI-DSS compliant. Hadoop has always been known for its resiliency with its failover capabilities for both data storage and processing. More recently, the services administering the storage and processing systems in Hadoop have themselves also gained failover capability. Finally, Hadoop is now seen as a reliable data engine – reports of issues like data corruption are exceedingly rare in Apache Hadoop.
Third, Hadoop supports a wide range of the kinds of data you need to store: at the lowest level, it can store any kind of file data – part of Hadoop is, after all, a file system. Hadoop can also host databases for structured data, and you can also use Hadoop to work with what many term “semi-structured” data, such as log files.
Apache Hadoop was once synonymous with MapReduce. As recently as early 2014, there was still considerable hype around MapReduce and its applications. However, as Hadoop has been entering the mainstream, its challenges have become increasingly apparent.
First, from a developer perspective, programming Hadoop-MapReduce applications is quite difficult, and requires specialized skills around parallel programming and a deep understanding of Java. Also, there are very few abstractions available to enable analysts to easily and flexibly work with data. And ones that do exist do not typically perform very quickly.
Second, Hadoop-MapReduce has no in-memory framework. Applications have their individual tasks load data sets, but once the tasks complete, the data sets are no longer in memory – and when they are in memory, they aren’t shared with other applications. Also, during the execution of a MapReduce application, each map task writes its interim results sets to disk – this is highly inefficient, as the reduce tasks then need to read them from disk, instead of from memory.
Third, Hadoop-MapReduce is only suitable for batch workloads. There is no shame in this, as that’s what it was designed for, but for users who want to take advantage of Hadoop’s benefits, they need support for interactive or real-time workloads as well. And coming back to the execution of applications, only one pattern is supported in Hadoop-MapReduce: that is, map, and then reduce. There are many use cases, where different patterns are needed, for example, map, reduce, reduce. You can make these different patterns work in Hadoop-MapReduce, but it comes at a great cost in terms of complexity and performance.
Apache Spark has been an active open source project since 2010, but it has become hugely popular starting around the middle of 2014. It is, in fact, the single most active project in the Apache Software Foundation, with over 500 code updates made per month by a community of over 400 contributors.
The major reason for its popularity is that it addresses the weak points of Hadoop-MapReduce.
While MapReduce has proven to be highly difficult, Spark is much simpler. Raw Spark applications (which can be coded in Java, like MapReduce, but also Python and Scala) are still not for novice programmers, but are far more accessible and require less coding than Hadoop-MapReduce. Spark is actually written in Scala, which is a relatively new language.
One of the major features of Spark is its in-memory capabilities, which are based on the Spark concept of a Resilient Distributed Dataset (RDD). This greatly speeds up workloads, because you can keep data loaded in memory for multiple applications, thus saving them the overhead of loading data from disk. Early benchmarking results have shown speedups between 10x to 100x for the same applications as compared to MapReduce.
Another reason for Spark’s massive appeal is its ability to support different classes of workloads. You can use Spark to build batch applications, just as you would have with Hadoop-MapReduce, but with its in-memory capabilities, interactive workloads (like running SQL queries) and iterative algorithms (running machine learning models against the same data set) are also possible. Finally, Spark-Streaming enables the running of micro-batch workloads (this would be near-realtime workloads, where a micro-batch could, for example, ensure latency as small as half a second for streaming data).
There are some analyst reports that have provocative titles, like “Hadoop vs. Spark,” or “Does Spark Mean the End of Hadoop?”. Many of these articles are heavily sensationalized, and ignore the reality that Spark actually integrates deeply with Hadoop. Yes, Spark can run in a standalone mode, or on other distributed environments like Mesos, AWS, or Cassandra. But the majority of Spark adoption and activity we see is in concert with Hadoop. After all, Spark is just a processing framework – it needs data, resource management, and other enterprise services. Hadoop has all those things, which makes it an ideal complement to Spark.
And as we can see on this slide, Spark fills holes that Hadoop itself has. Spark brings ease of use for developers, high performance from its in-memory capabilities, and much more flexible support for different kinds of workloads to Hadoop.
The key point here is that it’s not “Spark or Hadoop,” but “Spark AND Hadoop.”
To run the application, you will need to first define the dependencies. In Scala, it is defined in the simple.sbt file. In Java, it is defined in the pom.xml file. In Python, you don’t need to define any dependencies for this simple application, but if you used third party libraries, then you can use the –py-files argument to handle that. Next, you place your files in the typical directory structure as shown for Scala and Java. Python does not need to do this.
Finally, you have to create the JAR package using the appropriate tool and then run the spark-submit to execute the application.
Let’s talk about some of the misconceptions about Spark. Many people get confused on the difference between Hadoop and Spark, for that reason as we talk these points we’ll also discuss how they relate to Hadoop.
Spark does not require Hadoop to run. You can run Spark using its standalone mode or on Hadoop clusters through YARN, or on Apache Mesos.
Spark does not include a storage layer. You must provide a data store for Spark to access. Spark can access data in HDFS, Cassandra, Hbase, Hive, Tachyon, and any Hadoop data source.
You do not need to have a machine learning project to use Spark. Spark can manage complex analytics such as streaming or graphing data.
Spark does have a library for streaming, which can be useful for many use cases, however it is not true streaming. Spark Streaming process data streams in batches, where each batch contains a collection of events that arrived over the batch period (regardless of when the data were actually created). This is fine for some applications such as simple counts into Hadoop, but be aware that the lack of true record-by-record processes makes stream processing and time-series analytics impossible.
Nearly 20% of all JIRAs were contributed by the Spark Technology Center, placing IBM as the number two contributor to the Apache Spark Project by most accounts.
In Machine Learning, the Spark Technology Center contributed no less than 45% of the new features, and up to 25% of the enhancements. The STC has contributed 60-75% of all lines of code (LOC) worldwide to the PySpark project. Significant code contributions were also made in SparkR, WebUI and many others. In Spark SQL, Spark’s most active component, IBM leveraged its long-standing SQL experience by resolving up to 25% of all bug fixes for the new release.