Pour ce mois de mars, nous vous proposons une thématique Big Data autour de Spark et du Machine Learning !
Nous attaquerons par une présentation d'Apache Spark 1.5 : son architecture distribuée et ses possibilités n'auront plus de secret pour vous.
Nous enchaînerons ensuite avec les fondamentaux du Machine Learning : vocabulaire (pour enfin comprendre ce que raconte les data scientists / dataminer ! ), usages et explication des algorithmes les plus populaires ... Promis la présentation ne comporte pas de formules de maths barbares ;)
Puis nous mettrons en pratique ces deux présentations en développant ensemble votre première application prédictive avec Apache Spark et Apache Zeppelin !
6. What is it about?
● A cluster computing framework
● Open source
● Written in Scala
7. History
2009 : Project start at MIT research lab
2010 : Project open-sourced
2013 : Become a Apache project and creation of the
Databricks company
2014 : Become a top level Apache project and the most active
project in the Apache fundation (500+ contributors)
2014 : Release of Spark 1.0, 1.1 and 1.2
2015 : Release of Spark 1.3, 1.4, 1.5 and 1.6
2015 : IBM, SAP… investment in Spark
2015 : 2000 registration in Spark Summit SF, 1000 in Spark
Summit Amsterdam
2016 : new Spark Summit in San Francisco in June 2016
13. Use as a collection
● DSL
● Monadic type
● Several operators
○ map, filter, count, distinct, flatmap, ...
○ join, groupBy, union, ...
14. Created from
● A collection (List, Set)
● Various formats of file
○ json, text, Hadoop SequenceFile, ...
● Various database
○ JDBC, Cassandra, ...
● Others RDD
15. Sample
val conf = new SparkConf()
.setAppName("sample")
.setMaster("local")
val sc = new SparkContext(conf)
val rdd = sc.textFile("data.csv")
val nb = rdd.map(s => s.length).filter(i => i> 10).count()
16. Lazy-evaluation
● Intermediate operators
○ map, filter, distinct, flatmap, …
● final operators
○ count, mean, fold, first, ...
val nb = rdd.map(s => s.length).filter(i => i> 10).count()
17. Caching
● Reused an intermediate result
● Cache operator
● Avoid re-computing
val r = rdd.map(s => s.length).cache()
val nb = r.filter(i => i> 10).count()
val sum = r.filter(i => i> 10).sum()
19. Run locally
val master = "local"
val master = "local[*]"
val master = "local[4]"
val conf = new SparkConf().setAppName("sample")
.setMaster(master)
val sc = new SparkContext(conf)
20. Run on cluster
val master = "spark://..."
val conf = new SparkConf().setAppName("sample")
.setMaster(master)
val sc = new SparkContext(conf)
26. DataFrame 1/3
● A distributed collection of rows
organized into named columns
● An abstraction for selecting,
filtering, aggregating and
plotting structured data
● Provide a schema
● Not a RDD replacement
What?
27. DataFrame 1/3
● RDD more efficient than before
(Hadoop)
● But RDD is still too complicated
for common tasks
● DataFrame is more simple and
faster
Why?
29. DataFrame 3/3
● From Spark 1.3
● DataFrame API is just an
interface
○ Implementation is done one time in
Spark engine
○ All languages take benefits of
optimization with out rewriting
anything
How ?
30. Spark Streaming
● Framework over RDD and
Dataframe API
● Real-time data processing
● RDD is DStream here
● Same as before but dataset is
not static
36. Spark MLlib
spark.mllib
val sc = //init sparkContext
val (trainingData, checkData) = sc.textFile
("train.csv")
/*transform*/
.randomSplit(Array(0.98, 0.02))
val model = RandomForest.trainClassifier(
trainingData,
10,
Map[Int, Int](),
30,
"auto",
"gini",
7,
100,
0)
val prediction = model.predict(...)
//init sparkContext
val (trainingData, checkData) = sc.textFile("train.
csv")
/*transform*/
.randomSplit(Array(0.98, 0.02))
val model = new
LogisticRegressionWithLBFGS()
.setNumClasses(10)
.run(train)
val prediction = model.predict(...)
Each model exposes its own
interface
37. Spark MLlib
● Provides uniform set of high-
level APIs
● Based on top of Dataframe
● Pipeline concepts
○ Transformer
○ Estimator
○ Pipeline
spark.ml
38. Spark MLlib
spark.ml
● Transformer : transform(DF)
○ map a dataFrame by adding new
column
○ predict the label and adding result in
new column
● Estimator : fit(DF)
○ learning algorithm
○ produces a model from dataFrame
40. Spark MLlib
spark.ml
val training:DataFrame = ???
val test:DataFrame = ???
val lr = new LogisticRegression()
lr.setMaxIter(10).setRegParam(0.01)
//training model
val model1 = lr.fit(training)
//prediction on data test
model1.transform(test)
41. Spark MLlib
spark.ml
val training:DataFrame = ???
val test:DataFrame = ???
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new RandomForestClassifier()()
/*.add parameter*/
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
// Fit the pipeline to training documents.
val model = pipeline.fit(training)
model.transform(test)
val training:DataFrame = ???
val test:DataFrame = ???
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
/*.add parameter*/
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
// Fit the pipeline to training documents.
val model = pipeline.fit(training)
model.transform(test)
Differents models
Same manner to create
the pipeline