CourboSpark

COURBOSPARK:
DECISION TREE FOR
TIME-SERIES ON SPARK
Christophe Salperwyck – EDF R&D
Simon Maby – OCTO Technology -
@simonmaby
Xdata project: www.xdata.fr, grants from
"Investissement d'Avenir" program, 'Big Data' call

| 2
AGENDA
1. PROBLEM DESCRIPTION
2. IMPLEMENTATION
• Courbotree: presentation of the algorithm
• From mllib to courbospark
3. PERFORMANCES
• Configuration (cluster description, spark config…)
4. FEEDBACK ON SPARK/MLLIB

| 4
• 1 measure every 10 min
• 35 million customers
• Time-series: 144 points x 365 days
• Annual data volume: 1800 billion records, 120 TB
of raw data
BIG DATA!

| 5
LOAD CURVES CLASSIFICATION
Contract type Region … Equipment type Load Curve
9KVA 75 … Elec
6KVA 22 … Gas
… … … … …
12KVA 34 … Elec

| 6
WHY A DECISION TREE?
• Easy to understand
• Ability to explore the model
• Ability to choose the
expressivity of the model

| 7
Goal: find the most different curves depending on an explanatory
feature
How to split? we can either:
• Minimize curves dispersion (intra inertia)
or
• Maximize differences between average curves (inter inertia)
SPLIT CRITERIA: INERTIA

| 8
MAXIMIZE DIFFERENCES BETWEEN AVERAGE
CURVES (feature: Equipment Type)
Electrical
Gas
Hour
PinW
ArgMax(d)
mean

| 9
EXISTING DISTRIBUTED DECISION TREE
Scalable Distributed Decision Trees in Spark MLLib
Manish Amde (Origami Logic), Hirakendu Das (Yahoo! Inc.), Evan Sparks (UC Berkeley), Ameet
Talwalkar (UC Berkeley). Spark Summit 2014.
http://spark-summit.org/wp-content/uploads/2014/07/Scalable-Distributed-Decision-Trees-in-Spark-Made-
Das-Sparks-Talwalkar.pdf
A MapReduce Implementation of C4.5 Decision Tree Algorithm
Wei Dai, Wei Ji. International Journal of Database Theory and Application. Vol. 7, No. 1, 2014, pages
49-60. http://www.chinacloud.cn/upload/2014-03/14031920373451.pdf
PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce
Biswanath Panda, Joshua S. Herbach, Sugato Basu, Roberto J. Bayardo. VLDB 2009.
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36296.pdf
Distributed Decision Tree Learning for Mining Big Data Streams
Arinto Murdopo, Master thesis, Yahoo ! Labs Barcelona, July 2013.
http://people.ac.upc.edu/leandro/emdc/arinto-emdc-thesis.pdf

| 10
MLLIB DECISION TREE PARALLELIZATION

| 11
Step 1:
compute average
curves
[0:10[ [10:20[ [0:10[ [10:20[ [0:10[ [10:20[
Host 1 Host 2 Host 3
[0:10[ [10:20[
Host 1
Step 2:
collect and find
the best split
HORIZONTAL STRATEGY

| 12
To build the tree:
• Criteria: entropy, Gini, variance
• Data structure: LabelPoint
FROM MLLIB TO COURBOSPARK

| 13
To build the tree:
• Criteria: entropy, Gini, variance, inertia (to compare time-series)
• Data structure: LabelPoint, TimeSeries
• Finding split point for nominal features
For data visualization of the tree:
• Quantile on the nodes and leaves
• Lost of inertia
• Number of curves per nodes, leaves
FROM MLLIB TO COURBOSPARK

| 14
DEALING WITH NOMINAL FEATURES
Current implementation for regression:
➔ order the categories by their mean on the target
A BC D
Partitions tested: {A}/{CBD}, {AC}/{BD}, {ACB}/{C}

| 15
NOMINAL VALUES: TYPE OF CONTRACT
4 CATEGORIES {A, B, C, D}
A B
C D?

| 16
Hard to order curves…
Solution 1:
Compare curves 2 by 2 ➔ {A}/{BCD}, {AB}/{CD}, {ABC}/{D},
{AC}/{BD}…
Problem:
Combinatory problem depending on n the number of
different categories. Complexity is O(2n
).

| 17
Solution 2:
Agglomerative Hierarchical Clustering. Bottom up approach.
Complexity is O(n3
) - we don’t expect n > 100

| 18
HOW TO
Algorithm parameters
Configure spark context
Load the data file
Learn the model

| 19
LOOKING FOR THE TEST CONFIGURATION
For a constant global capacity on 12 nodes:
•120 cores + 120 GB RAM
#Executors RAM per exec. Cores per exec. Performance on
100Gb data
12 10 GB 10 22 minutes

| 20
SCALABILITY TO #CONTAINERS

| 21

| 22

| 24
FRAMEWORK STABILITY
Tested on:
• 10GB, 100GB, 200GB, 300GB,
400GB, 500GB, 1TB
• Categorical and continuous
variables
• Bin sizes from 100 to 1000

| 26
SCALABILITY TO #CATEGORIES

| 28
REAL LIFE DATASET
• 9 executors with 20 GB and 8 cores
• 10 to 1000 millions load curves (10 numerical and 10 categorical features)

| 29
• spark.default.parallelism
• spark.executor.memory
• spark.storage.memoryfraction
• spark.akka.framesize
TUNING

| 30
Developers view
• Flawless transition from local to cluster mode
• Debug mode with an IDE
• Good performances need knowledge
FEEDBACKS

| 32
Data Scientists view
• The API is not very data oriented
• …but now we have SparkSQL and Dataframes!
• IPython + pySpark
• Feature engineering VS model engineering
FEEDBACKS

| 33
OPS view
• Better than mapReduce
• Performances are predictable for tested code
• YARNed
• Lots of releases, MlLib code is evolving quickly
FEEDBACKS

| 34
FUTURE WORKS
• Unbalanced trees
• Improve performance
• Other criteria for time-series comparison
• Missing values in explanatory features

| 36
EXISTING DISTRIBUTED DECISION TREE
Partitionning Engine Target Ensemble Pruning
MLlib Horizontal Spark Num + Nom Yes No
MR C4.5 Vertical MR Nom No No
PLANET Horizontal MR Num + Nom Yes No
SAMOA Vertical Storm/S4 Num + Nom Yes Yes

| 37
EXAMPLE OF TREE ON LOAD CURVES

| 38
Store a curve +
explanatory features
Main class to
learn the model
Inertia split crterion
Store statistiques about the
model: quantiles, average
curve, lost in inertia

CourboSpark

More Related Content

Similar to CourboSpark

Recently uploaded

CourboSpark