SlideShare a Scribd company logo
Practical Aspects of
Machine Learning on Big-Data platforms
(Big Learning)
Mohit Garg
Research Engineer
TXII, Airbus Group Innovations
25-05-2015
Motivation & Agenda
Motivation: Answer the questions
– Why multiple ecosystems for scalable
ML?
– Is there a unified approach for a big-
data platform for ML?
– How much to catch up on?
– How industry leaders are doing it?
– Put things into perspective !
Agenda: To present
– Quick brief on practical ML process
– Current landscape of Open source tools
– Evolutionary drivers with examples
– Case studies
– The Twitter Experience
– * Share journey and observations
(ongoing process)
ML (Optimization,
LA, Stats)
scalabilityBig Data (Schema,
workflow, architecture)
Quick brief on ML process
Quick brief - Process
α
β
Train
Tune (Cross Validate Data)
Measure (Test Data)
• Not applicable to all ML modeling techniques. Biologically-inspired algorithms are more of a paradigm
(guidelines) rather than algorithm, and requires algorithm definition under those guidelines (GA, ACO, ANN).
• Graph Source: http://www.astroml.org/sklearn_tutorial/practical.html
Bias Vs Variance
Learning Curve
Data Sampling Algorithm Model EvaluationModel/Hypothesis
Quick brief – Workflow Example
Graph Source:
Quick brief – WF breakdown k-means
Quick brief – WF breakdown k-means
Input Data (Points)
Statement Block
(assigning
clusters)
Termination
condition (if no
change)
While (!termination_condition)
meta input
(new cluster centres)
updates
Quick brief – WF breakdown k-means
Only after iteration
is over
Input Data (Points)
Statement Block
(assigning
clusters)
Termination
condition (if no
change)
While (!termination_condition)
meta input
(new cluster centres)
updates
new cluster centres
Quick brief – Pieces
• An ML-algorithm is finally a computer algorithm
• Complex design or blocks of
– Blocks feeding each other. Output becoming Input
– Iterations over entire data (Gradient descent for Linear, Logistic
Regression, K-means etc) – Memory limitations
– Algorithms as non-linear workflows.
• Principles when operating on Large datasets
– Minimize I/O – Don’t read/write disk again and again
– Minimize Network transfer – Localize logic (non-additive?) to
data
– Use online-trainable algorithms- Optimized Parallel Algorithms
– Ease of use –Abstraction - Package well for end-user
Quick brief – then and now
• Small Data
– Static data
• Big Data
– Static Data
– But cant run on single machine
• Online Big Data
– Integrated with ETL
– Prediction and Learning together
– Twitter case study
α
β
Train
Tune (Cross Validate Data)
Measure (Test Data)
α
β
Train
Tune (Cross Validate Data)
Measure (Test Data)
Velocity
α
β
Current Landscape
Current Landscape – Open Source tools
Stream
Sybil
Current Landscape – Open Source tools
Data
Complexity&completeness
Stream
Sibyl
Evolutionary Drivers
Current Landscape – Open Source tools
Data
Complexity&completeness
Stream
Sibyl
Sibyl
Evolutionary Drivers
Data
Complexity&completeness
Stream
2. Scientific
1. Loose Integration
3. Architectural
4. Use Case*
Open Source tools – Landscape & Drivers
Data
Complexity&completeness
Stream
SybilIs there a wholesome solution?
Loose Integration
Quick Review: MapReduce + Hadoop
• Bigger focus on
– Large scaling on data
– Scheduling and Concurrency control
• Load balancing
– Fault tolerance
– Basically, to save tonnes of user’s efforts
required in older frameworks like MPI.
– The map and reduce can be ‘executables ’ in
virtually any language (Streaming)
– *Maps (& reducers) don’t interact !
• MapReduce exploits massively
parallelizable problems, what about rest of
them?
– Simple case: Try finding median of integers(1-40
say) using MR?
• Can we expect any algorithm to execute in
with MR implementation with same time-
bounds?
Loose Integration
• Set of components/APIs
– exposing existing tools with Map-Reduce frameworks
– to be compiled, optimized and deployed in
– streaming or pipe mode with frameworks.
• Hadoop/MapReduce bindings available for
– R
– Python (numpy, sci-kit)
• Focus on
– Accommodating existing user-base to leverage hadoop data storage
– Easy & neat APIs for native users.
– No real effort on ‘bridging the gap’
R-Hadoop
Loose Integration – Pydoop Example
• Uses Hadoop Pipes as underlying framework
• Based on Cpython, so provide inclusion of sci-kit, num-py etc
• Lets you define map and reduce logics
• But, does not provide better representations of ML Algorithms
Scientific
Scientific - Interest
Scientific - Efforts
• Efforts comes in waves with breakthroughs
• Efforts on
– Accuracy bounds & Convergence
– Execution time bounds
• Recent efforts in tandem with Big Data
– Distributable algorithms – Central Limit theorem (local
logic with simpler aggregations)
– Batch-to-Online model – ‘One pass mode’ (avoid
iterations)
• Example
– Distributable algorithms - Ensemble Classification (eg
Random forest), K-means++||
– Batch-to-online - (SGD)
• Note – Power ‘inherently’ lies in Big Data
– Simple algorithm with larger dataset outperforms complex
algorithms with smaller dataset
Image-2 Source: Andrew Ng – Coursera ML-08
O1 O2 ON
Ǝ
Logistic Classification
• Sample: Yes/No kind of answers
– Is this tweet a spam?
– Will this tweeter log back in 48 hours?
X1 X2 …… XN Y
X11 . . X1N 0
. . . . 1
. . . . 1
. . . . 1
. . . . 0
XM1 . . XMN 0
X Y
x1
x2
xN
hӨ (x)
Ө1
Ө2
ӨN
Hypothesis hӨ (x) = 1 / ( 1 + e – ӨTx)
Cost(x) hӨ (x) - y
J =Cost(X)
• Ө is unknown variable
• Lets start with random value of Ө
• Aim is to change Ө to minimize J
Gradient Descent
Iterations
J =Cost(X, Ө) minimize using Gradient DescentVisualization in 2D
Gradient Descent
• Cost function requires all records.
While (W does not change)
{
// Load data
// find local losses
// Aggregate local losses
// Find gradient
// Update W
// Save W to disk
}
/* Multiple Passes */
J =Cost(X, Ө)
M1 M2 MN
Map – loads data
Reduce – Calculates
gradient and updates W
R
Saves W (intermediate)
User code
Stochastic Gradient Descent (SGD)
• No need to get cost function for gradient calculation
• Each iteration on a data point - xi
• Gradient calculated using only xi
• As good as performing SGD on single machine. Reducer – a serious
bottleneck
M1 M2 MN
Map – loads data
Reduce – Calculates
gradient and updates W
R
Saves W (final)
// Load data
While (no samples left)
{
// Find gradient using xi
// Update W
}
// Save W
/* Single Pass */
User code
Ref: Bottou, Léon (1998). "Online Algorithms and Stochastic Approximations". Online Learning and Neural Networks
SGD - Distributed
• Similar to SGD, but have multiple reducers
• Data is thoroughly randomized
• Multiple classifiers are learned together – ensemble classifiers
• Bottleneck of single reducer (Network Data) resolved
• Testing using standard aggregation over predictors’ results
M1 M2 MN
Map –load data
Reduce – Calculates
gradient and updates WjR1
W1
// Pre-process – Randomize
// Load data
While (no samples left)
{
// Find gradient using xi
// Update Wj
}
/* Single Pass and
distributed */
User code
R2
W2
Ref: L. Bottou. Large-scale machine learning with stochastic gradient descent. COMPSTAT, 2010.
Architectural
Now1971 2020
Moore’s Law vs Kryder’s Law
Source: Collective information from Wiki & its references
“if hard drives were to continue to progress
at their then-current pace of about 40%
per year, then in 2020 a two-platter, 2.5-
inch disk drive would store approximately
40 TB and cost about $40” - Kryder
Moore’s II law : “As the cost of computer
power to the consumer falls, the cost for
producers to fulfill Moore's law follows an
opposite trend: R&D, manufacturing, and
test costs have increased steadily with each
new generation of chips”
GAP
- Individual processors’ power growing at slower rate
- Data Storage becomes easier & cheaper.
- MORE data, LESS processors – and the gap is
widening !
- Computer h/w architecture working at its pace to
provide faster buses, RAM & augmented GPUs.
Architectural – Forces
2012
VolumeinExabytes
15000
2017
Percentage of uncertain data
Percentofuncertaindata
We are here
Sensors
& Devices
VoIP
Enterprise
Data
Social
Media
6000
9000
100
0
50
VeracityVolume
Variety
Architectural – Forces
Source: IBM - Challenges and Opportunities with Big Data- Dr Hammou Messatfa
Mahout with MapReduce
• Key feature: Somewhat loose & somewhat tight integration
– Among the earliest library to exploit batch-like scalable components online
learning algorithms.
– Some algorithms re-engineered for MapReduce, some not.
– Performance hit for iterative algorithms. Huge I/O overhead
– Each (global) iteration means Map-Reduce job :O
– Integration of new scalable learners less active.
• Industry acceptance
– Accepted for scalable Recommender systems
• Future
– Mahout Samsara for scalable low-level Linear Algebra as scala & spark bindings
Sybil
Cascading
• Key feature: Abstraction & Packaging
– Let you think of workflows as chain of MR
– Pre-existing methods for reading and storage methods
– Provide checkpoints in the workflow to save the state.
– J-Unit for test-case driven s/w development.
• Industry acceptance
– Scalding is scala bindings for cascading, from twitter
– Used by Bixo for anti-spam classification
– Used to load data by elasticsearch & Cassandra
– Ebay leverages scalding design for distributable computing.
Sybil
Cascading
Sybil
Pig – Quick Summary
 High level dataflow language (Pig Latin)
 Much simpler than Java
 Simplify the data processing
 Put the operations at the apropriate phases
 Chains multiple MR jobs
 Appropriate for ML workflows
 No need to take care of intermediate outputs
 Provide user defined functions (UDFs) in java,
integrable with Pig
Sybil
Pig – Quick Summary
A=LOAD 'file1' AS (x, y, z);
B=LOAD 'file2' AS (t, u, v);
C=FILTER A by y > 0;
D=JOIN C BY x, B BY u;
E=GROUP D BY z;
F=FOREACH E GENERATE
group, COUNT(D);
STORE F INTO 'output';
LOAD
FILTER
LOAD
JOIN
GROUP
FOREACH
STORE
Sybil
FILTER
LOCAL REARRANGE
Map
Reduce
Map
Reduce
PACKAGE
FOREACH
LOCAL REARRANGE
PACKAGE
FOREACH
ML-lib
Sybil
 Part of Apache Spark framework
 Data can be from hdfs, S3, local files
 Encapsulates run-time data as Resilient Distributed Data store (RDD)
 RDD are in-memory data pieces
 Failt tolerent – knows how to recreate itself if resident node goes down
 No distinction between map and reduce, just a task.
 Vigilant
 Bindings for R too – SparkR
 Real ingenuity in implementing new-generation algorithms (online &
distributed)
 Example, has three versions of K-means – Lloyd, k-means++, k-means++ ||
 Key feature
 Shared objects – means tasks (belonging to one node) can share objects.
Spark (ML-lib)
Sybil
Shared
variable
Tez
Sybil
 Apache Ìncubated project
 Fundamentally similar design principles as of Spark
 Encapsulates run-time data as nodes just like RDDs
 Key features
 In-memory data
 Shared objects – means tasks (belonging to one node) can
share objects.
 Very few comaprative studies available
 Not much contributions from open community
Tez
Sybil
Distributed R
Sybil
 Opensource Project lead by HP
 Similar to R-hadoop but with some new
genuine features like
 User-defined array partitioning
 Local transformation/functions
 Master-Worker synchronization
 Not the same ingenuity yet, as seen in ML-lib.
 Only fundamentally scalable algorithms (online
& distributable) scales linearly.
 Tall claims of 50-100x time efficiency when
used with HP-Vertica database
Sibyl
Sybil
 Not opensource yet, but some rumours !
 Claims to provide a GFS based highly scalable
flexible infrastucture for embedding ML
process in ETL
 Designed for supervised learning
 Focussed on learning user behaviours
 Youtube video recommendations
 Spam filters
 Major design principle– Columnar Data
 Suitable for sparse datasets (new columns?)
 Comrpression techniques for columnar data
much efficient (structral similarity)
Columnar data- LZO Compression
• Idea 1
– Compression should be ‘splittable’
– A large file can be compressed and
split in size equal to hdfs block.
– Each block should hold its
‘decompression key’
Compression Size
(GB)
Compression
Time (s)
Decompression
Time (s)
None 8.0 - -
Gzip 1.3 241 72
LZO 2.0 55 35
• Idea 2
– Compress data on hadoop (save 3/4th
space)
– Save 75% I/O time !!
– Achieve Decompression < 75% I/O
time
| | | | | | | |
Conclusion
• Big Data has resurrected interest in ML algorithms.
• A two-way push is leading the confluence – Online & Distributed
Learning (scientific) & flexible workflows (architectural) to
accommodate them.
• Facilitated by compression, serialization, in-memory, DAG-
representations, columnar databases etc.
• Majority of man-hours goes into engineering the pipelines.
• Industry aiming to provide high level abstraction on standard ML
algorithms hiding gory details.
Learning Resources
• MOOCs (Coursera)
– Machine Learning (Stanford)
– Design & Analysis of Algorithms (Stanford)
– R Programming Language (John Hopkins)
– Exploratory Data Analysis (John Hopkins)
• Online Competitions
– Kaggle Data Science platform
• Software Resources
– Matlab
– R
– Scikit – Python ?
– APIs – ANN, JGAP
• 2009 - "Subspace extracting Adaptive Cellular Network for layered Architectures
with circular boundaries.“ Paper on IEEE.
• 2006-07. 1st prize – IBM’s Great mind challenge – “Transport Management System”
Multi –TSP implementation using Genetic Algorithm .
** tHanK YoU **

More Related Content

What's hot

Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
Adarsh Pannu
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondDataWorks Summit
 
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Databricks
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlib
IMC Institute
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
Grigory Sapunov
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Databricks
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache Spark
Databricks
 
Ray and Its Growing Ecosystem
Ray and Its Growing EcosystemRay and Its Growing Ecosystem
Ray and Its Growing Ecosystem
Databricks
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt Dowle
Sri Ambati
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Spark Summit
 
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark Summit
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
Spark Summit
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Databricks
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Spark Summit
 

What's hot (20)

Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlib
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache Spark
 
Ray and Its Growing Ecosystem
Ray and Its Growing EcosystemRay and Its Growing Ecosystem
Ray and Its Growing Ecosystem
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt Dowle
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
 
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
 

Similar to Big learning 1.2

Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
Richard Garris
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Sharding
inside-BigData.com
 
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015Ioan Toma
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
Dr. Shikha Mehta
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenches
Vinay Shukla
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Paige_Roberts
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Lucas Jellema
 
Oow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BIOow2016 review-db-dev-bigdata-BI
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
Aapo Kyrölä
 
Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous Driving
Jan Wiegelmann
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
 
Chap 2 classification of parralel architecture and introduction to parllel p...
Chap 2  classification of parralel architecture and introduction to parllel p...Chap 2  classification of parralel architecture and introduction to parllel p...
Chap 2 classification of parralel architecture and introduction to parllel p...
Malobe Lottin Cyrille Marcel
 
Sistema de recomendación entiempo real usando Delta Lake
Sistema de recomendación entiempo real usando Delta LakeSistema de recomendación entiempo real usando Delta Lake
Sistema de recomendación entiempo real usando Delta Lake
Globant
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
inside-BigData.com
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
E05312426
E05312426E05312426
E05312426
IOSR-JEN
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Anubhav Jain
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big Graphs
Petr Novotný
 

Similar to Big learning 1.2 (20)

Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Sharding
 
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenches
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
 
Oow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BIOow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BI
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
 
Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous Driving
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Chap 2 classification of parralel architecture and introduction to parllel p...
Chap 2  classification of parralel architecture and introduction to parllel p...Chap 2  classification of parralel architecture and introduction to parllel p...
Chap 2 classification of parralel architecture and introduction to parllel p...
 
Sistema de recomendación entiempo real usando Delta Lake
Sistema de recomendación entiempo real usando Delta LakeSistema de recomendación entiempo real usando Delta Lake
Sistema de recomendación entiempo real usando Delta Lake
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 
E05312426
E05312426E05312426
E05312426
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big Graphs
 

Recently uploaded

The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 

Recently uploaded (20)

The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 

Big learning 1.2

  • 1. Practical Aspects of Machine Learning on Big-Data platforms (Big Learning) Mohit Garg Research Engineer TXII, Airbus Group Innovations 25-05-2015
  • 2. Motivation & Agenda Motivation: Answer the questions – Why multiple ecosystems for scalable ML? – Is there a unified approach for a big- data platform for ML? – How much to catch up on? – How industry leaders are doing it? – Put things into perspective ! Agenda: To present – Quick brief on practical ML process – Current landscape of Open source tools – Evolutionary drivers with examples – Case studies – The Twitter Experience – * Share journey and observations (ongoing process) ML (Optimization, LA, Stats) scalabilityBig Data (Schema, workflow, architecture)
  • 3. Quick brief on ML process
  • 4.
  • 5. Quick brief - Process α β Train Tune (Cross Validate Data) Measure (Test Data) • Not applicable to all ML modeling techniques. Biologically-inspired algorithms are more of a paradigm (guidelines) rather than algorithm, and requires algorithm definition under those guidelines (GA, ACO, ANN). • Graph Source: http://www.astroml.org/sklearn_tutorial/practical.html Bias Vs Variance Learning Curve Data Sampling Algorithm Model EvaluationModel/Hypothesis
  • 6. Quick brief – Workflow Example Graph Source:
  • 7. Quick brief – WF breakdown k-means
  • 8. Quick brief – WF breakdown k-means Input Data (Points) Statement Block (assigning clusters) Termination condition (if no change) While (!termination_condition) meta input (new cluster centres) updates
  • 9. Quick brief – WF breakdown k-means Only after iteration is over Input Data (Points) Statement Block (assigning clusters) Termination condition (if no change) While (!termination_condition) meta input (new cluster centres) updates new cluster centres
  • 10. Quick brief – Pieces • An ML-algorithm is finally a computer algorithm • Complex design or blocks of – Blocks feeding each other. Output becoming Input – Iterations over entire data (Gradient descent for Linear, Logistic Regression, K-means etc) – Memory limitations – Algorithms as non-linear workflows. • Principles when operating on Large datasets – Minimize I/O – Don’t read/write disk again and again – Minimize Network transfer – Localize logic (non-additive?) to data – Use online-trainable algorithms- Optimized Parallel Algorithms – Ease of use –Abstraction - Package well for end-user
  • 11. Quick brief – then and now • Small Data – Static data • Big Data – Static Data – But cant run on single machine • Online Big Data – Integrated with ETL – Prediction and Learning together – Twitter case study α β Train Tune (Cross Validate Data) Measure (Test Data) α β Train Tune (Cross Validate Data) Measure (Test Data) Velocity α β
  • 13. Current Landscape – Open Source tools Stream Sybil
  • 14. Current Landscape – Open Source tools Data Complexity&completeness Stream Sibyl
  • 16. Current Landscape – Open Source tools Data Complexity&completeness Stream Sibyl
  • 18. Open Source tools – Landscape & Drivers Data Complexity&completeness Stream SybilIs there a wholesome solution?
  • 20. Quick Review: MapReduce + Hadoop • Bigger focus on – Large scaling on data – Scheduling and Concurrency control • Load balancing – Fault tolerance – Basically, to save tonnes of user’s efforts required in older frameworks like MPI. – The map and reduce can be ‘executables ’ in virtually any language (Streaming) – *Maps (& reducers) don’t interact ! • MapReduce exploits massively parallelizable problems, what about rest of them? – Simple case: Try finding median of integers(1-40 say) using MR? • Can we expect any algorithm to execute in with MR implementation with same time- bounds?
  • 21. Loose Integration • Set of components/APIs – exposing existing tools with Map-Reduce frameworks – to be compiled, optimized and deployed in – streaming or pipe mode with frameworks. • Hadoop/MapReduce bindings available for – R – Python (numpy, sci-kit) • Focus on – Accommodating existing user-base to leverage hadoop data storage – Easy & neat APIs for native users. – No real effort on ‘bridging the gap’
  • 23. Loose Integration – Pydoop Example • Uses Hadoop Pipes as underlying framework • Based on Cpython, so provide inclusion of sci-kit, num-py etc • Lets you define map and reduce logics • But, does not provide better representations of ML Algorithms
  • 26. Scientific - Efforts • Efforts comes in waves with breakthroughs • Efforts on – Accuracy bounds & Convergence – Execution time bounds • Recent efforts in tandem with Big Data – Distributable algorithms – Central Limit theorem (local logic with simpler aggregations) – Batch-to-Online model – ‘One pass mode’ (avoid iterations) • Example – Distributable algorithms - Ensemble Classification (eg Random forest), K-means++|| – Batch-to-online - (SGD) • Note – Power ‘inherently’ lies in Big Data – Simple algorithm with larger dataset outperforms complex algorithms with smaller dataset Image-2 Source: Andrew Ng – Coursera ML-08 O1 O2 ON Ǝ
  • 27. Logistic Classification • Sample: Yes/No kind of answers – Is this tweet a spam? – Will this tweeter log back in 48 hours? X1 X2 …… XN Y X11 . . X1N 0 . . . . 1 . . . . 1 . . . . 1 . . . . 0 XM1 . . XMN 0 X Y x1 x2 xN hӨ (x) Ө1 Ө2 ӨN Hypothesis hӨ (x) = 1 / ( 1 + e – ӨTx) Cost(x) hӨ (x) - y J =Cost(X) • Ө is unknown variable • Lets start with random value of Ө • Aim is to change Ө to minimize J
  • 28. Gradient Descent Iterations J =Cost(X, Ө) minimize using Gradient DescentVisualization in 2D
  • 29. Gradient Descent • Cost function requires all records. While (W does not change) { // Load data // find local losses // Aggregate local losses // Find gradient // Update W // Save W to disk } /* Multiple Passes */ J =Cost(X, Ө) M1 M2 MN Map – loads data Reduce – Calculates gradient and updates W R Saves W (intermediate) User code
  • 30. Stochastic Gradient Descent (SGD) • No need to get cost function for gradient calculation • Each iteration on a data point - xi • Gradient calculated using only xi • As good as performing SGD on single machine. Reducer – a serious bottleneck M1 M2 MN Map – loads data Reduce – Calculates gradient and updates W R Saves W (final) // Load data While (no samples left) { // Find gradient using xi // Update W } // Save W /* Single Pass */ User code Ref: Bottou, Léon (1998). "Online Algorithms and Stochastic Approximations". Online Learning and Neural Networks
  • 31. SGD - Distributed • Similar to SGD, but have multiple reducers • Data is thoroughly randomized • Multiple classifiers are learned together – ensemble classifiers • Bottleneck of single reducer (Network Data) resolved • Testing using standard aggregation over predictors’ results M1 M2 MN Map –load data Reduce – Calculates gradient and updates WjR1 W1 // Pre-process – Randomize // Load data While (no samples left) { // Find gradient using xi // Update Wj } /* Single Pass and distributed */ User code R2 W2 Ref: L. Bottou. Large-scale machine learning with stochastic gradient descent. COMPSTAT, 2010.
  • 33. Now1971 2020 Moore’s Law vs Kryder’s Law Source: Collective information from Wiki & its references “if hard drives were to continue to progress at their then-current pace of about 40% per year, then in 2020 a two-platter, 2.5- inch disk drive would store approximately 40 TB and cost about $40” - Kryder Moore’s II law : “As the cost of computer power to the consumer falls, the cost for producers to fulfill Moore's law follows an opposite trend: R&D, manufacturing, and test costs have increased steadily with each new generation of chips” GAP - Individual processors’ power growing at slower rate - Data Storage becomes easier & cheaper. - MORE data, LESS processors – and the gap is widening ! - Computer h/w architecture working at its pace to provide faster buses, RAM & augmented GPUs. Architectural – Forces
  • 34. 2012 VolumeinExabytes 15000 2017 Percentage of uncertain data Percentofuncertaindata We are here Sensors & Devices VoIP Enterprise Data Social Media 6000 9000 100 0 50 VeracityVolume Variety Architectural – Forces Source: IBM - Challenges and Opportunities with Big Data- Dr Hammou Messatfa
  • 35. Mahout with MapReduce • Key feature: Somewhat loose & somewhat tight integration – Among the earliest library to exploit batch-like scalable components online learning algorithms. – Some algorithms re-engineered for MapReduce, some not. – Performance hit for iterative algorithms. Huge I/O overhead – Each (global) iteration means Map-Reduce job :O – Integration of new scalable learners less active. • Industry acceptance – Accepted for scalable Recommender systems • Future – Mahout Samsara for scalable low-level Linear Algebra as scala & spark bindings Sybil
  • 36. Cascading • Key feature: Abstraction & Packaging – Let you think of workflows as chain of MR – Pre-existing methods for reading and storage methods – Provide checkpoints in the workflow to save the state. – J-Unit for test-case driven s/w development. • Industry acceptance – Scalding is scala bindings for cascading, from twitter – Used by Bixo for anti-spam classification – Used to load data by elasticsearch & Cassandra – Ebay leverages scalding design for distributable computing. Sybil
  • 38. Pig – Quick Summary  High level dataflow language (Pig Latin)  Much simpler than Java  Simplify the data processing  Put the operations at the apropriate phases  Chains multiple MR jobs  Appropriate for ML workflows  No need to take care of intermediate outputs  Provide user defined functions (UDFs) in java, integrable with Pig Sybil
  • 39. Pig – Quick Summary A=LOAD 'file1' AS (x, y, z); B=LOAD 'file2' AS (t, u, v); C=FILTER A by y > 0; D=JOIN C BY x, B BY u; E=GROUP D BY z; F=FOREACH E GENERATE group, COUNT(D); STORE F INTO 'output'; LOAD FILTER LOAD JOIN GROUP FOREACH STORE Sybil FILTER LOCAL REARRANGE Map Reduce Map Reduce PACKAGE FOREACH LOCAL REARRANGE PACKAGE FOREACH
  • 40. ML-lib Sybil  Part of Apache Spark framework  Data can be from hdfs, S3, local files  Encapsulates run-time data as Resilient Distributed Data store (RDD)  RDD are in-memory data pieces  Failt tolerent – knows how to recreate itself if resident node goes down  No distinction between map and reduce, just a task.  Vigilant  Bindings for R too – SparkR  Real ingenuity in implementing new-generation algorithms (online & distributed)  Example, has three versions of K-means – Lloyd, k-means++, k-means++ ||  Key feature  Shared objects – means tasks (belonging to one node) can share objects.
  • 42. Tez Sybil  Apache Ìncubated project  Fundamentally similar design principles as of Spark  Encapsulates run-time data as nodes just like RDDs  Key features  In-memory data  Shared objects – means tasks (belonging to one node) can share objects.  Very few comaprative studies available  Not much contributions from open community
  • 44. Distributed R Sybil  Opensource Project lead by HP  Similar to R-hadoop but with some new genuine features like  User-defined array partitioning  Local transformation/functions  Master-Worker synchronization  Not the same ingenuity yet, as seen in ML-lib.  Only fundamentally scalable algorithms (online & distributable) scales linearly.  Tall claims of 50-100x time efficiency when used with HP-Vertica database
  • 45. Sibyl Sybil  Not opensource yet, but some rumours !  Claims to provide a GFS based highly scalable flexible infrastucture for embedding ML process in ETL  Designed for supervised learning  Focussed on learning user behaviours  Youtube video recommendations  Spam filters  Major design principle– Columnar Data  Suitable for sparse datasets (new columns?)  Comrpression techniques for columnar data much efficient (structral similarity)
  • 46. Columnar data- LZO Compression • Idea 1 – Compression should be ‘splittable’ – A large file can be compressed and split in size equal to hdfs block. – Each block should hold its ‘decompression key’ Compression Size (GB) Compression Time (s) Decompression Time (s) None 8.0 - - Gzip 1.3 241 72 LZO 2.0 55 35 • Idea 2 – Compress data on hadoop (save 3/4th space) – Save 75% I/O time !! – Achieve Decompression < 75% I/O time | | | | | | | |
  • 47. Conclusion • Big Data has resurrected interest in ML algorithms. • A two-way push is leading the confluence – Online & Distributed Learning (scientific) & flexible workflows (architectural) to accommodate them. • Facilitated by compression, serialization, in-memory, DAG- representations, columnar databases etc. • Majority of man-hours goes into engineering the pipelines. • Industry aiming to provide high level abstraction on standard ML algorithms hiding gory details.
  • 48. Learning Resources • MOOCs (Coursera) – Machine Learning (Stanford) – Design & Analysis of Algorithms (Stanford) – R Programming Language (John Hopkins) – Exploratory Data Analysis (John Hopkins) • Online Competitions – Kaggle Data Science platform • Software Resources – Matlab – R – Scikit – Python ? – APIs – ANN, JGAP • 2009 - "Subspace extracting Adaptive Cellular Network for layered Architectures with circular boundaries.“ Paper on IEEE. • 2006-07. 1st prize – IBM’s Great mind challenge – “Transport Management System” Multi –TSP implementation using Genetic Algorithm .