scalable machine learning

me:
Sam Bessalah
Software Engineer, Freelance
Big Data, Distributed Computing, Machine Learning
Paris Data Geek Co-organizer
@samklr @DataParis

Machine Learning Land
VOWPAL WABBIT

Some Observations in Big Data Land
● New use cases push towards faster execution platforms and real
time predictions engines.
● Traditional MapReduce on Hadoop is fading away, especially for
Machine Learning
● Apache Spark has become the darling of the Big Data world,
thanks to its high level API and performances.
● Rise of Machine Learning public APIs to easily integrate models
into application and other data processing workflows.

● Used to be the only Hadoop MapReduce Framework
● Moved from MapReduce towards modern and faster
backends, namely
● Now provide a fluent DSL that integrates with Scala and
Spark

Mahout Example
Simple Co-occurence analysis in Mahout
val A =
drmFromHDFS (“ hdfs://nivdul/babygirl.txt“)
val cooccurencesMatrix = A.t %*% A
val numInteractions =
drmBroadcast(A.colsums)
val I = C.mapBlock(){
case (keys, block) =>
val indicatorBlock = sparse(row, col)
for (r <- block )
indicatorBlock = computeLLR (row, nbInt)
keys <- indicatorblock
}

Dataflow system, materialized by immutable and lazy, in-memory distributed
collections suited for iterative and complex transformations, like in most Machine
Learning algorithms.
Those in-memory collections are called Resilient Distributed Datasets (RDD)
They provide :
● Partitioned data
● High level operations (map, filter, collect, reduce, zip, join, sample, etc …)
● No side effects
● Fault recovery via lineage

MLlib
Machine Learning library within Spark :
● Provides an integrated predictive and data analysis
workflow
● Broad collections of algorithms and applications
● Integrates with the whole Spark Ecosystem
Three APIs in :

Example: Clustering via K-means
// Load and parse data
val data = sc.textFile(“hdfs://bbgrl/dataset.txt”)
val parsedData = data.map { x =>
Vectors.dense(x.split(“ “).map.(_.toDouble ))
}.cache()
//Cluster data into 5 classes using K-means
val clusters = Kmeans.train(parsedData, k=5, numIterations=20 )
//Evaluate model error
val cost = clusters.computeCost(parsedData)

Coming to Spark 1.2
● Ensembles of decision trees : Random Forests
● Boosting
● Topic modeling
● Streaming Kmeans
● A pipeline interface for machine workflows
A lot of contributions from the community

Machine Learning Pipeline
Typical machine learning workflows are complex !
Coming in next iterations of MLLib

● H20 is a fast (really fast), statistics, Machine Learning
and maths engine on the JVM.
● Edited by 0xdata (commercial entity) and focus on
bringing robust and highly performant machine learning
algorithms to popular Big Data workloads.
● Has APIs in R, Java, Scala and Python and integrates
to third parties tools like Tableau and Excel.

Example in R
library(h2o)
localH2O = h2o.init(ip = 'localhost', port = 54321)
irisPath = system.file("extdata", "iris.csv", package="h2o")
iris.hex = h2o.importFile(localH2O, path = irisPath, key = "iris.hex")
iris.data.frame <- as.data.frame(iris.hex)
> colnames(iris.hex)
[1] "C1" "C2" "C3" "C4" "C5"
>

Simple Logistic Regressioon to predict prostate cancer outcomes:
> prostate.hex = h2o.importFile(localH2O,
path="https://raw.github.com/0xdata/h2o/../prostate.csv",
key = "prostate.hex")
> prostate.glm = h2o.glm(y = "CAPSULE", x =c("AGE","RACE","PSA","DCAPS"),
data = prostate.hex,family = "binomial", nfolds = 10, alpha = 0.5)
> prostate.fit = h2o.predict(object=prostate.glm, newdata = prostate.hex)

> (prostate.fit)
IP Address: 127.0.0.1
Port : 54321
Parsed Data Key: GLM2Predict_8b6890653fa743be9eb3ab1668c5a6e9
predict X0 X1
1 0 0.7452267 0.2547732
2 1 0.3969807 0.6030193
3 1 0.4120950 0.5879050
4 1 0.3726134 0.6273866
5 1 0.6465137 0.3534863
6 1 0.4331880 0.5668120

Sparkling Water
Transparent use of H2O data and algorithms with the Spark API.
Provides a custom RDD : H2ORDD

val sqlContext = new SQLContext(sc)
import sqlContext._
airlinesTable.registerTempTable("airlinesTable") //H20 methods
val query = “SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO' OR Dest
LIKE 'SJC' OR Dest LIKE 'OAK'“
val result = sql(query)
result.count

Same but with Spark API
// H2O Context provide useful implicits for conversions
val h2oContext = new H2OContext(sc)
import h2oContext._
// Create RDD wrapper around DataFrame
val airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData)
airlinesTable.count
// And use Spark RDD API directly
val flightsOnlyToSF = airlinesTable.filter(f =>
f.Dest==Some("SFO") || f.Dest==Some("SJC") || f.Dest==Some("OAK")
)
flightsOnlyToSF.count

Build a model
import hex.deeplearning._
import hex.deeplearning.DeepLearningModel.DeepLearningParameters
val dlParams = new DeepLearningParameters()
dlParams._training_frame = result( 'Year, 'Month, 'DayofMonth, DayOfWeek,
'CRSDepTime, 'CRSArrTime,'UniqueCarrier,
FlightNum, 'TailNum, 'CRSElapsedTime,
'Origin, 'Dest,'Distance,‘IsDepDelayed)
dlParams.response_column = 'IsDepDelayed.name
// Create a new model builder
val dl = new DeepLearning(dlParams)
val dlModel = dl.train.get

Predict
// Use model to score data
val prediction = dlModel.score(result)(‘predict)
// Collect predicted values via the RDD API
val predictionValues = toRDD[DoubleHolder](prediction)
.collect
.map ( _.result.getOrElse("NaN") )

Slides: http://speakerdeck.com/samklr/

scalable machine learning

More Related Content

What's hot

Similar to scalable machine learning

More from Samir Bessalah

Recently uploaded

scalable machine learning

Editor's Notes