Big learning 1.2

Practical Aspects of
Machine Learning on Big-Data platforms
(Big Learning)
Mohit Garg
Research Engineer
TXII, Airbus Group Innovations
25-05-2015

Motivation & Agenda
Motivation: Answer the questions
– Why multiple ecosystems for scalable
ML?
– Is there a unified approach for a big-
data platform for ML?
– How much to catch up on?
– How industry leaders are doing it?
– Put things into perspective !
Agenda: To present
– Quick brief on practical ML process
– Current landscape of Open source tools
– Evolutionary drivers with examples
– Case studies
– The Twitter Experience
– * Share journey and observations
(ongoing process)
ML (Optimization,
LA, Stats)
scalabilityBig Data (Schema,
workflow, architecture)

Quick brief - Process
α
β
Train
Tune (Cross Validate Data)
Measure (Test Data)
• Not applicable to all ML modeling techniques. Biologically-inspired algorithms are more of a paradigm
(guidelines) rather than algorithm, and requires algorithm definition under those guidelines (GA, ACO, ANN).
• Graph Source: http://www.astroml.org/sklearn_tutorial/practical.html
Bias Vs Variance
Learning Curve
Data Sampling Algorithm Model EvaluationModel/Hypothesis

Quick brief – Workflow Example
Graph Source:

Quick brief – WF breakdown k-means

Input Data (Points)
Statement Block
(assigning
clusters)
Termination
condition (if no
change)
While (!termination_condition)
meta input
(new cluster centres)
updates

Only after iteration
is over
Input Data (Points)
Statement Block
(assigning
clusters)
Termination
condition (if no
change)
While (!termination_condition)
meta input
(new cluster centres)
updates
new cluster centres

Quick brief – Pieces
• An ML-algorithm is finally a computer algorithm
• Complex design or blocks of
– Blocks feeding each other. Output becoming Input
– Iterations over entire data (Gradient descent for Linear, Logistic
Regression, K-means etc) – Memory limitations
– Algorithms as non-linear workflows.
• Principles when operating on Large datasets
– Minimize I/O – Don’t read/write disk again and again
– Minimize Network transfer – Localize logic (non-additive?) to
data
– Use online-trainable algorithms- Optimized Parallel Algorithms
– Ease of use –Abstraction - Package well for end-user

Quick brief – then and now
• Small Data
– Static data
• Big Data
– Static Data
– But cant run on single machine
• Online Big Data
– Integrated with ETL
– Prediction and Learning together
– Twitter case study
α
β
Train
Measure (Test Data)
α
β
Train
Measure (Test Data)
Velocity
α
β

Current Landscape – Open Source tools
Stream
Sybil

Current Landscape – Open Source tools
Data
Complexity&completeness
Stream
Sibyl

Sibyl
Evolutionary Drivers
Data
Stream
2. Scientific
1. Loose Integration
3. Architectural
4. Use Case*

Open Source tools – Landscape & Drivers
Data
Stream
SybilIs there a wholesome solution?

Quick Review: MapReduce + Hadoop
• Bigger focus on
– Large scaling on data
– Scheduling and Concurrency control
• Load balancing
– Fault tolerance
– Basically, to save tonnes of user’s efforts
required in older frameworks like MPI.
– The map and reduce can be ‘executables ’ in
virtually any language (Streaming)
– *Maps (& reducers) don’t interact !
• MapReduce exploits massively
parallelizable problems, what about rest of
them?
– Simple case: Try finding median of integers(1-40
say) using MR?
• Can we expect any algorithm to execute in
with MR implementation with same time-
bounds?

Loose Integration
• Set of components/APIs
– exposing existing tools with Map-Reduce frameworks
– to be compiled, optimized and deployed in
– streaming or pipe mode with frameworks.
• Hadoop/MapReduce bindings available for
– R
– Python (numpy, sci-kit)
• Focus on
– Accommodating existing user-base to leverage hadoop data storage
– Easy & neat APIs for native users.
– No real effort on ‘bridging the gap’

Loose Integration – Pydoop Example
• Uses Hadoop Pipes as underlying framework
• Based on Cpython, so provide inclusion of sci-kit, num-py etc
• Lets you define map and reduce logics
• But, does not provide better representations of ML Algorithms

Scientific - Efforts
• Efforts comes in waves with breakthroughs
• Efforts on
– Accuracy bounds & Convergence
– Execution time bounds
• Recent efforts in tandem with Big Data
– Distributable algorithms – Central Limit theorem (local
logic with simpler aggregations)
– Batch-to-Online model – ‘One pass mode’ (avoid
iterations)
• Example
– Distributable algorithms - Ensemble Classification (eg
Random forest), K-means++||
– Batch-to-online - (SGD)
• Note – Power ‘inherently’ lies in Big Data
– Simple algorithm with larger dataset outperforms complex
algorithms with smaller dataset
Image-2 Source: Andrew Ng – Coursera ML-08
O1 O2 ON
Ǝ

Logistic Classification
• Sample: Yes/No kind of answers
– Is this tweet a spam?
– Will this tweeter log back in 48 hours?
X1 X2 …… XN Y
X11 . . X1N 0
. . . . 1
. . . . 1
. . . . 1
. . . . 0
XM1 . . XMN 0
X Y
x1
x2
xN
hӨ (x)
Ө1
Ө2
ӨN
Hypothesis hӨ (x) = 1 / ( 1 + e – ӨTx)
Cost(x) hӨ (x) - y
J =Cost(X)
• Ө is unknown variable
• Lets start with random value of Ө
• Aim is to change Ө to minimize J

Gradient Descent
Iterations
J =Cost(X, Ө) minimize using Gradient DescentVisualization in 2D

Gradient Descent
• Cost function requires all records.
While (W does not change)
{
// Load data
// find local losses
// Aggregate local losses
// Find gradient
// Update W
// Save W to disk
}
/* Multiple Passes */
J =Cost(X, Ө)
M1 M2 MN
Map – loads data
Reduce – Calculates
gradient and updates W
R
Saves W (intermediate)
User code

Stochastic Gradient Descent (SGD)
• No need to get cost function for gradient calculation
• Each iteration on a data point - xi
• Gradient calculated using only xi
• As good as performing SGD on single machine. Reducer – a serious
bottleneck
M1 M2 MN
Map – loads data
gradient and updates W
R
Saves W (final)
// Load data
While (no samples left)
{
// Find gradient using xi
// Update W
}
// Save W
/* Single Pass */
User code
Ref: Bottou, Léon (1998). "Online Algorithms and Stochastic Approximations". Online Learning and Neural Networks

SGD - Distributed
• Similar to SGD, but have multiple reducers
• Data is thoroughly randomized
• Multiple classifiers are learned together – ensemble classifiers
• Bottleneck of single reducer (Network Data) resolved
• Testing using standard aggregation over predictors’ results
M1 M2 MN
Map –load data
gradient and updates WjR1
W1
// Pre-process – Randomize
// Load data
While (no samples left)
{
// Find gradient using xi
// Update Wj
}
/* Single Pass and
distributed */
User code
R2
W2
Ref: L. Bottou. Large-scale machine learning with stochastic gradient descent. COMPSTAT, 2010.

Now1971 2020
Moore’s Law vs Kryder’s Law
Source: Collective information from Wiki & its references
“if hard drives were to continue to progress
at their then-current pace of about 40%
per year, then in 2020 a two-platter, 2.5-
inch disk drive would store approximately
40 TB and cost about $40” - Kryder
Moore’s II law : “As the cost of computer
power to the consumer falls, the cost for
producers to fulfill Moore's law follows an
opposite trend: R&D, manufacturing, and
test costs have increased steadily with each
new generation of chips”
GAP
- Individual processors’ power growing at slower rate
- Data Storage becomes easier & cheaper.
- MORE data, LESS processors – and the gap is
widening !
- Computer h/w architecture working at its pace to
provide faster buses, RAM & augmented GPUs.
Architectural – Forces

2012
VolumeinExabytes
15000
2017
Percentage of uncertain data
Percentofuncertaindata
We are here
Sensors
& Devices
VoIP
Enterprise
Data
Social
Media
6000
9000
100
0
50
VeracityVolume
Variety
Architectural – Forces
Source: IBM - Challenges and Opportunities with Big Data- Dr Hammou Messatfa

Mahout with MapReduce
• Key feature: Somewhat loose & somewhat tight integration
– Among the earliest library to exploit batch-like scalable components online
learning algorithms.
– Some algorithms re-engineered for MapReduce, some not.
– Performance hit for iterative algorithms. Huge I/O overhead
– Each (global) iteration means Map-Reduce job :O
– Integration of new scalable learners less active.
• Industry acceptance
– Accepted for scalable Recommender systems
• Future
– Mahout Samsara for scalable low-level Linear Algebra as scala & spark bindings
Sybil

Cascading
• Key feature: Abstraction & Packaging
– Let you think of workflows as chain of MR
– Pre-existing methods for reading and storage methods
– Provide checkpoints in the workflow to save the state.
– J-Unit for test-case driven s/w development.
• Industry acceptance
– Scalding is scala bindings for cascading, from twitter
– Used by Bixo for anti-spam classification
– Used to load data by elasticsearch & Cassandra
– Ebay leverages scalding design for distributable computing.
Sybil

Pig – Quick Summary
 High level dataflow language (Pig Latin)
 Much simpler than Java
 Simplify the data processing
 Put the operations at the apropriate phases
 Chains multiple MR jobs
 Appropriate for ML workflows
 No need to take care of intermediate outputs
 Provide user defined functions (UDFs) in java,
integrable with Pig
Sybil

Pig – Quick Summary
A=LOAD 'file1' AS (x, y, z);
B=LOAD 'file2' AS (t, u, v);
C=FILTER A by y > 0;
D=JOIN C BY x, B BY u;
E=GROUP D BY z;
F=FOREACH E GENERATE
group, COUNT(D);
STORE F INTO 'output';
LOAD
FILTER
LOAD
JOIN
GROUP
FOREACH
STORE
Sybil
FILTER
LOCAL REARRANGE
Map
Reduce
Map
Reduce
PACKAGE
FOREACH
LOCAL REARRANGE
PACKAGE
FOREACH

ML-lib
Sybil
 Part of Apache Spark framework
 Data can be from hdfs, S3, local files
 Encapsulates run-time data as Resilient Distributed Data store (RDD)
 RDD are in-memory data pieces
 Failt tolerent – knows how to recreate itself if resident node goes down
 No distinction between map and reduce, just a task.
 Vigilant
 Bindings for R too – SparkR
 Real ingenuity in implementing new-generation algorithms (online &
distributed)
 Example, has three versions of K-means – Lloyd, k-means++, k-means++ ||
 Key feature
 Shared objects – means tasks (belonging to one node) can share objects.

Spark (ML-lib)
Sybil
Shared
variable

Tez
Sybil
 Apache Ìncubated project
 Fundamentally similar design principles as of Spark
 Encapsulates run-time data as nodes just like RDDs
 Key features
 In-memory data
 Shared objects – means tasks (belonging to one node) can
share objects.
 Very few comaprative studies available
 Not much contributions from open community

Distributed R
Sybil
 Opensource Project lead by HP
 Similar to R-hadoop but with some new
genuine features like
 User-defined array partitioning
 Local transformation/functions
 Master-Worker synchronization
 Not the same ingenuity yet, as seen in ML-lib.
 Only fundamentally scalable algorithms (online
& distributable) scales linearly.
 Tall claims of 50-100x time efficiency when
used with HP-Vertica database

Sibyl
Sybil
 Not opensource yet, but some rumours !
 Claims to provide a GFS based highly scalable
flexible infrastucture for embedding ML
process in ETL
 Designed for supervised learning
 Focussed on learning user behaviours
 Youtube video recommendations
 Spam filters
 Major design principle– Columnar Data
 Suitable for sparse datasets (new columns?)
 Comrpression techniques for columnar data
much efficient (structral similarity)

Columnar data- LZO Compression
• Idea 1
– Compression should be ‘splittable’
– A large file can be compressed and
split in size equal to hdfs block.
– Each block should hold its
‘decompression key’
Compression Size
(GB)
Compression
Time (s)
Decompression
Time (s)
None 8.0 - -
Gzip 1.3 241 72
LZO 2.0 55 35
• Idea 2
– Compress data on hadoop (save 3/4th
space)
– Save 75% I/O time !!
– Achieve Decompression < 75% I/O
time
| | | | | | | |

Conclusion
• Big Data has resurrected interest in ML algorithms.
• A two-way push is leading the confluence – Online & Distributed
Learning (scientific) & flexible workflows (architectural) to
accommodate them.
• Facilitated by compression, serialization, in-memory, DAG-
representations, columnar databases etc.
• Majority of man-hours goes into engineering the pipelines.
• Industry aiming to provide high level abstraction on standard ML
algorithms hiding gory details.

Learning Resources
• MOOCs (Coursera)
– Machine Learning (Stanford)
– Design & Analysis of Algorithms (Stanford)
– R Programming Language (John Hopkins)
– Exploratory Data Analysis (John Hopkins)
• Online Competitions
– Kaggle Data Science platform
• Software Resources
– Matlab
– R
– Scikit – Python ?
– APIs – ANN, JGAP
• 2009 - "Subspace extracting Adaptive Cellular Network for layered Architectures
with circular boundaries.“ Paper on IEEE.
• 2006-07. 1st prize – IBM’s Great mind challenge – “Transport Management System”
Multi –TSP implementation using Genetic Algorithm .

Big learning 1.2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big learning 1.2

Similar to Big learning 1.2 (20)

Recently uploaded

Recently uploaded (20)

Big learning 1.2