SlideShare a Scribd company logo
1 of 30
Machine Learning at Scale
Madhukara Phatak
• Madhukara Phatak
• Consult in Bigdata and
FP
• Work with Spark,
Hadoop and ecosystem
• Training on Bigdata
• @madhukaraphatak
• http://www.madhukara
phatak.com
How many of You?
• Own a Smart phone?
• Want to know when next phone coming into
market?
• Next version of existing phone coming into
market?
• Specs and prices of new phone?
– Months before phone releases
• Data from multiple sources aggregated in one
place
Rumor Engine
• A practical implementation of machine
learning to solve phone rumor problem.
• Built in 3 months
– Learning machine learning
– Learning Spark
– Idea
– Implementation
– Release
My Journey
• Hadoop
• Mahout and Nectar
• JavaScript
• Scala and Spark
• Coursera
• MLLib
• Rumour Engine
Big data at work
• Worked for a BSS/OSS product company
• Big data is normal in Telecom
• CDR (call data record ) around 3TB for
companies like Airtel
• Need a solution for processing over 6 months
of data.
• Started to work around 4 years ago
Hadoop Saga
• Hadoop was default choice
• Challenge in the ecosystem in India
• Hype vs Reality
• Work
– Building ML library Nectar
– Working with companies to build hadoop
expertise and solutions
– POC’s
My Journey
• Hadoop
• Mahout and Nectar
• Javascript
• Scala and Spark
• Courseera
• MLLib
• Rumour Engine
Machine Learning in Hadoop
• Apache Mahout was the choice but its too
hard to map it to any new requirements
• Map/Reduce implementation suffered from
speed and complexity
• Accuracy of the results are often poor
• We set out to build our own and realized it
was too much of overhead even to build
simplest things
ML and Map Reduce
• M/R forgets everything once one operation is
done
• Everything has to go through HDFS , slower
because of disk over heads
• Mahout long tried to make as fast possible ,
but they kind of gave up.
• In Zinnia , we moved on with aggregation and
KPI based solutions rather than pure ML.
My Journey
• Hadoop
• Mahout and Nectar
• Javascript
• Scala and Spark
• Courseera
• MLLib
• Rumour Engine
JavaScript
• Functional programming
• Closures
• Loose typing /type inference
• Prototype inheritance
• REPL (node.js) or webtools
Search for New Language
• Statically typed (Enterprise stack)
• Runs on JVM
• Ability to use Java libraries
• Functional programming
• Type inference
• Repl
My Journey
• Hadoop
• Mahout and Nectar
• Javascript
• Scala and Spark
• Coursera
• MLLib
• Rumour Engine
Scala
• Statically typed
• Type inference
• Functional programming and OO built in
• Parallelism built in
• REPL
• Scalable language
Search for Functional Bigdata
• Pig attempted on Hadoop
• Tuple Map/Reduce
• Javascript API for Hadoop
• Why functional bigdata?
Big data platform requirement
• Immutability support
• Transformation not CRUD
• Built in laziness
• Concise API
• Type inference
Java and Hadoop (Productivity)
• No Laziness
– Every Map/Reduce operation needs to write
output to HDFS
• Java allows crud like variable assignments but
fails in distributed mode
• Type of each key/value pair has to be declared
no way to skip it
• Lots of boiler plate code for closures
Apache Spark
• Apache Spark is a framework for lightening
fast cluster computing .
• Build by AmpLabs and now Databricks.
• Competitor to M/R of Hadoop
• Runs on Hadoop 1.0 and Hadoop 2.0 yarn
• Written in scala
Spark and ML
• Built for Iterative programs Aka ML
• Support for intermediate result caching
• Support for in memory processing
• Remembers across jobs not just within job
• There is suddenly interest in Bigdata ML again
with spark as its finally possible to run fast and
accurate with spark
• Mahout is moving on to Spark
My Journey
• Hadoop
• Mahout and Nectar
• Javascript
• Scala and Spark
• Coursera
• MLLib
• Rumour Engine
Learning Machine learning
• Coursera
• Example in octave
• Porting examples from octave to Spark
• https://github.com/zinniasystems/spark-ml-
class
• Uses
– MLLib
– JBlas
– Breeze
My Journey
• Hadoop
• Mahout and Nectar
• Javascript
• Scala and Spark
• Courseera
• MLLib
• Rumour Engine
MLLib
• Standard Spark library for Machine learning
• Built into spark
• Very small code base – 1200 line of scala code
• 40x – 100x faster than Mahout
• Supports
– Linear and Logistic regression
– SVM
– Recommender systems
Mahout vs MLLib
• Mahout has more algorithms than MLLib
• MLLib has less code than MLLib (1200 lines
scala vs >20,000 lines of java code
• Much improved performance and accuracy
• Mahout recognizes it , moving to spark
backend for next release
My Journey
• Hadoop
• Mahout and Nectar
• Javascript
• Scala and Spark
• Courseera
• MLLib
• Rumour Engine
Rumor Engine
• Crawls blog data
• As of 12 blogs everyday, more to add in future
• Naïve Bayes to classify
• Uses single node spark for prediction
• MLLib
• Has <200 lines of actual application scala
code.
ML-Scale challenges
• Choosing an algorithm
• Accuracy of algorithm implementation
• Modeling when data is noisy and big
• Faster sampling
• Real time processing
• Accuracy vs Performance
ML-People challenges
• Hard to find Data scientists
• Unique combination of skills – Programming
at scale and Maths.
• Mathematical reasoning and practicality of
implementation.
Thank you

More Related Content

What's hot

Rolling With Riak
Rolling With RiakRolling With Riak
Rolling With RiakJohn Lynch
 
Advanced Spark Meetup - Jan 12, 2016
Advanced Spark Meetup - Jan 12, 2016Advanced Spark Meetup - Jan 12, 2016
Advanced Spark Meetup - Jan 12, 2016Michelle Casbon
 
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Databricks
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Miguel Pastor
 
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014gmalouf678
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceChin Huang
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleDomino Data Lab
 
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...Databricks
 
Bigdata antipatterns
Bigdata antipatternsBigdata antipatterns
Bigdata antipatternsAnurag S
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Wee Hyong Tok
 
Scaling with Riak at Showyou
Scaling with Riak at ShowyouScaling with Riak at Showyou
Scaling with Riak at ShowyouJohn Muellerleile
 
Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...
Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...
Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...Databricks
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
How Apache Spark Changed the Way We Hire People with Tomasz MagdanskiHow Apache Spark Changed the Way We Hire People with Tomasz Magdanski
How Apache Spark Changed the Way We Hire People with Tomasz MagdanskiDatabricks
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitDatabricks
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsXiao Li
 
An Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal MalohlavaAn Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal MalohlavaSpark Summit
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended CutWes McKinney
 
Liferay and Big Data
Liferay and Big DataLiferay and Big Data
Liferay and Big DataMiguel Pastor
 

What's hot (20)

Rolling With Riak
Rolling With RiakRolling With Riak
Rolling With Riak
 
Advanced Spark Meetup - Jan 12, 2016
Advanced Spark Meetup - Jan 12, 2016Advanced Spark Meetup - Jan 12, 2016
Advanced Spark Meetup - Jan 12, 2016
 
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014
 
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
 
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
 
MLeap: Release Spark ML Pipelines
MLeap: Release Spark ML PipelinesMLeap: Release Spark ML Pipelines
MLeap: Release Spark ML Pipelines
 
Bigdata antipatterns
Bigdata antipatternsBigdata antipatterns
Bigdata antipatterns
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 
Scaling with Riak at Showyou
Scaling with Riak at ShowyouScaling with Riak at Showyou
Scaling with Riak at Showyou
 
Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...
Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...
Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
How Apache Spark Changed the Way We Hire People with Tomasz MagdanskiHow Apache Spark Changed the Way We Hire People with Tomasz Magdanski
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
An Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal MalohlavaAn Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal Malohlava
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
Liferay and Big Data
Liferay and Big DataLiferay and Big Data
Liferay and Big Data
 

Viewers also liked

First-passage percolation on random planar maps
First-passage percolation on random planar mapsFirst-passage percolation on random planar maps
First-passage percolation on random planar mapsTimothy Budd
 
mtc All Hands 8/15 Werte
mtc All Hands 8/15 Wertemtc All Hands 8/15 Werte
mtc All Hands 8/15 WerteArne Krueger
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design PatternsAllen Day, PhD
 
Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...Kai-Wen Zhao
 
Artificial intelligence 2015: Quo Vadis?
Artificial intelligence 2015: Quo Vadis?Artificial intelligence 2015: Quo Vadis?
Artificial intelligence 2015: Quo Vadis?Sergey Shelpuk
 
Network-Growth Rule Dependence of Fractal Dimension of Percolation Cluster on...
Network-Growth Rule Dependence of Fractal Dimension of Percolation Cluster on...Network-Growth Rule Dependence of Fractal Dimension of Percolation Cluster on...
Network-Growth Rule Dependence of Fractal Dimension of Percolation Cluster on...Shu Tanaka
 
Machine Learning and Logging for Monitoring Microservices
Machine Learning and Logging for Monitoring Microservices Machine Learning and Logging for Monitoring Microservices
Machine Learning and Logging for Monitoring Microservices Daniel Berman
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestKrishna Gade
 
Interlayer-Interaction Dependence of Latent Heat in the Heisenberg Model on a...
Interlayer-Interaction Dependence of Latent Heat in the Heisenberg Model on a...Interlayer-Interaction Dependence of Latent Heat in the Heisenberg Model on a...
Interlayer-Interaction Dependence of Latent Heat in the Heisenberg Model on a...Shu Tanaka
 
Percolation
PercolationPercolation
PercolationESUG
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik
 
Predictive analytics in mobility
Predictive analytics in mobilityPredictive analytics in mobility
Predictive analytics in mobilityEktimo
 
BigData & Supply Chain: A "Small" Introduction
BigData & Supply Chain: A "Small" IntroductionBigData & Supply Chain: A "Small" Introduction
BigData & Supply Chain: A "Small" IntroductionIvan Gruer
 
Logging : How much is too much? Network Security Monitoring Talk @ hasgeek
Logging : How much is too much? Network Security Monitoring Talk @ hasgeekLogging : How much is too much? Network Security Monitoring Talk @ hasgeek
Logging : How much is too much? Network Security Monitoring Talk @ hasgeekvivekrajan
 
Elasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easyElasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easyItamar
 

Viewers also liked (20)

Logging in moodle
Logging in moodleLogging in moodle
Logging in moodle
 
Percolation Model and Controllability
Percolation Model and ControllabilityPercolation Model and Controllability
Percolation Model and Controllability
 
First-passage percolation on random planar maps
First-passage percolation on random planar mapsFirst-passage percolation on random planar maps
First-passage percolation on random planar maps
 
mtc All Hands 8/15 Werte
mtc All Hands 8/15 Wertemtc All Hands 8/15 Werte
mtc All Hands 8/15 Werte
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
 
Percolation
PercolationPercolation
Percolation
 
Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...
 
Elastic Search
Elastic SearchElastic Search
Elastic Search
 
Artificial intelligence 2015: Quo Vadis?
Artificial intelligence 2015: Quo Vadis?Artificial intelligence 2015: Quo Vadis?
Artificial intelligence 2015: Quo Vadis?
 
Network-Growth Rule Dependence of Fractal Dimension of Percolation Cluster on...
Network-Growth Rule Dependence of Fractal Dimension of Percolation Cluster on...Network-Growth Rule Dependence of Fractal Dimension of Percolation Cluster on...
Network-Growth Rule Dependence of Fractal Dimension of Percolation Cluster on...
 
Machine Learning and Logging for Monitoring Microservices
Machine Learning and Logging for Monitoring Microservices Machine Learning and Logging for Monitoring Microservices
Machine Learning and Logging for Monitoring Microservices
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at Pinterest
 
Interlayer-Interaction Dependence of Latent Heat in the Heisenberg Model on a...
Interlayer-Interaction Dependence of Latent Heat in the Heisenberg Model on a...Interlayer-Interaction Dependence of Latent Heat in the Heisenberg Model on a...
Interlayer-Interaction Dependence of Latent Heat in the Heisenberg Model on a...
 
Percolation
PercolationPercolation
Percolation
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
Predictive analytics in mobility
Predictive analytics in mobilityPredictive analytics in mobility
Predictive analytics in mobility
 
BigData & Supply Chain: A "Small" Introduction
BigData & Supply Chain: A "Small" IntroductionBigData & Supply Chain: A "Small" Introduction
BigData & Supply Chain: A "Small" Introduction
 
Logging : How much is too much? Network Security Monitoring Talk @ hasgeek
Logging : How much is too much? Network Security Monitoring Talk @ hasgeekLogging : How much is too much? Network Security Monitoring Talk @ hasgeek
Logging : How much is too much? Network Security Monitoring Talk @ hasgeek
 
Deep Learning
Deep Learning Deep Learning
Deep Learning
 
Elasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easyElasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easy
 

Similar to Machine Learning at Scale

Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...Thoughtworks
 
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...Thoughtworks
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Transitioning from Java to Scala for Spark - March 13, 2019
Transitioning from Java to Scala for Spark - March 13, 2019Transitioning from Java to Scala for Spark - March 13, 2019
Transitioning from Java to Scala for Spark - March 13, 2019Gravy Analytics
 
Data Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah GuidoData Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah GuidoBitly
 
Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataJohn Nestor
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Spark Summit
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkUserReport
 
Hadoop world overview trends and topics
Hadoop world overview trends and topicsHadoop world overview trends and topics
Hadoop world overview trends and topicsValentin Kropov
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDatabricks
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkDatabricks
 

Similar to Machine Learning at Scale (20)

Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Machine learninginspark
Machine learninginsparkMachine learninginspark
Machine learninginspark
 
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
 
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Transitioning from Java to Scala for Spark - March 13, 2019
Transitioning from Java to Scala for Spark - March 13, 2019Transitioning from Java to Scala for Spark - March 13, 2019
Transitioning from Java to Scala for Spark - March 13, 2019
 
Data Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah GuidoData Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah Guido
 
Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big Data
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Hadoop world overview trends and topics
Hadoop world overview trends and topicsHadoop world overview trends and topics
Hadoop world overview trends and topics
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Apache drill
Apache drillApache drill
Apache drill
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
 

Recently uploaded

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 

Machine Learning at Scale

  • 1. Machine Learning at Scale Madhukara Phatak
  • 2. • Madhukara Phatak • Consult in Bigdata and FP • Work with Spark, Hadoop and ecosystem • Training on Bigdata • @madhukaraphatak • http://www.madhukara phatak.com
  • 3. How many of You? • Own a Smart phone? • Want to know when next phone coming into market? • Next version of existing phone coming into market? • Specs and prices of new phone? – Months before phone releases • Data from multiple sources aggregated in one place
  • 4. Rumor Engine • A practical implementation of machine learning to solve phone rumor problem. • Built in 3 months – Learning machine learning – Learning Spark – Idea – Implementation – Release
  • 5. My Journey • Hadoop • Mahout and Nectar • JavaScript • Scala and Spark • Coursera • MLLib • Rumour Engine
  • 6. Big data at work • Worked for a BSS/OSS product company • Big data is normal in Telecom • CDR (call data record ) around 3TB for companies like Airtel • Need a solution for processing over 6 months of data. • Started to work around 4 years ago
  • 7. Hadoop Saga • Hadoop was default choice • Challenge in the ecosystem in India • Hype vs Reality • Work – Building ML library Nectar – Working with companies to build hadoop expertise and solutions – POC’s
  • 8. My Journey • Hadoop • Mahout and Nectar • Javascript • Scala and Spark • Courseera • MLLib • Rumour Engine
  • 9. Machine Learning in Hadoop • Apache Mahout was the choice but its too hard to map it to any new requirements • Map/Reduce implementation suffered from speed and complexity • Accuracy of the results are often poor • We set out to build our own and realized it was too much of overhead even to build simplest things
  • 10. ML and Map Reduce • M/R forgets everything once one operation is done • Everything has to go through HDFS , slower because of disk over heads • Mahout long tried to make as fast possible , but they kind of gave up. • In Zinnia , we moved on with aggregation and KPI based solutions rather than pure ML.
  • 11. My Journey • Hadoop • Mahout and Nectar • Javascript • Scala and Spark • Courseera • MLLib • Rumour Engine
  • 12. JavaScript • Functional programming • Closures • Loose typing /type inference • Prototype inheritance • REPL (node.js) or webtools
  • 13. Search for New Language • Statically typed (Enterprise stack) • Runs on JVM • Ability to use Java libraries • Functional programming • Type inference • Repl
  • 14. My Journey • Hadoop • Mahout and Nectar • Javascript • Scala and Spark • Coursera • MLLib • Rumour Engine
  • 15. Scala • Statically typed • Type inference • Functional programming and OO built in • Parallelism built in • REPL • Scalable language
  • 16. Search for Functional Bigdata • Pig attempted on Hadoop • Tuple Map/Reduce • Javascript API for Hadoop • Why functional bigdata?
  • 17. Big data platform requirement • Immutability support • Transformation not CRUD • Built in laziness • Concise API • Type inference
  • 18. Java and Hadoop (Productivity) • No Laziness – Every Map/Reduce operation needs to write output to HDFS • Java allows crud like variable assignments but fails in distributed mode • Type of each key/value pair has to be declared no way to skip it • Lots of boiler plate code for closures
  • 19. Apache Spark • Apache Spark is a framework for lightening fast cluster computing . • Build by AmpLabs and now Databricks. • Competitor to M/R of Hadoop • Runs on Hadoop 1.0 and Hadoop 2.0 yarn • Written in scala
  • 20. Spark and ML • Built for Iterative programs Aka ML • Support for intermediate result caching • Support for in memory processing • Remembers across jobs not just within job • There is suddenly interest in Bigdata ML again with spark as its finally possible to run fast and accurate with spark • Mahout is moving on to Spark
  • 21. My Journey • Hadoop • Mahout and Nectar • Javascript • Scala and Spark • Coursera • MLLib • Rumour Engine
  • 22. Learning Machine learning • Coursera • Example in octave • Porting examples from octave to Spark • https://github.com/zinniasystems/spark-ml- class • Uses – MLLib – JBlas – Breeze
  • 23. My Journey • Hadoop • Mahout and Nectar • Javascript • Scala and Spark • Courseera • MLLib • Rumour Engine
  • 24. MLLib • Standard Spark library for Machine learning • Built into spark • Very small code base – 1200 line of scala code • 40x – 100x faster than Mahout • Supports – Linear and Logistic regression – SVM – Recommender systems
  • 25. Mahout vs MLLib • Mahout has more algorithms than MLLib • MLLib has less code than MLLib (1200 lines scala vs >20,000 lines of java code • Much improved performance and accuracy • Mahout recognizes it , moving to spark backend for next release
  • 26. My Journey • Hadoop • Mahout and Nectar • Javascript • Scala and Spark • Courseera • MLLib • Rumour Engine
  • 27. Rumor Engine • Crawls blog data • As of 12 blogs everyday, more to add in future • Naïve Bayes to classify • Uses single node spark for prediction • MLLib • Has <200 lines of actual application scala code.
  • 28. ML-Scale challenges • Choosing an algorithm • Accuracy of algorithm implementation • Modeling when data is noisy and big • Faster sampling • Real time processing • Accuracy vs Performance
  • 29. ML-People challenges • Hard to find Data scientists • Unique combination of skills – Programming at scale and Maths. • Mathematical reasoning and practicality of implementation.