SlideShare a Scribd company logo
1 of 30
Machine Learning at Scale
Madhukara Phatak
• Madhukara Phatak
• Consult in Bigdata and
FP
• Work with Spark,
Hadoop and ecosystem
• Training on Bigdata
• @madhukaraphatak
• http://www.madhukara
phatak.com
How many of You?
• Own a Smart phone?
• Want to know when next phone coming into
market?
• Next version of existing phone coming into
market?
• Specs and prices of new phone?
– Months before phone releases
• Data from multiple sources aggregated in one
place
Rumor Engine
• A practical implementation of machine
learning to solve phone rumor problem.
• Built in 3 months
– Learning machine learning
– Learning Spark
– Idea
– Implementation
– Release
My Journey
• Hadoop
• Mahout and Nectar
• JavaScript
• Scala and Spark
• Coursera
• MLLib
• Rumour Engine
Big data at work
• Worked for a BSS/OSS product company
• Big data is normal in Telecom
• CDR (call data record ) around 3TB for
companies like Airtel
• Need a solution for processing over 6 months
of data.
• Started to work around 4 years ago
Hadoop Saga
• Hadoop was default choice
• Challenge in the ecosystem in India
• Hype vs Reality
• Work
– Building ML library Nectar
– Working with companies to build hadoop
expertise and solutions
– POC’s
My Journey
• Hadoop
• Mahout and Nectar
• Javascript
• Scala and Spark
• Courseera
• MLLib
• Rumour Engine
Machine Learning in Hadoop
• Apache Mahout was the choice but its too
hard to map it to any new requirements
• Map/Reduce implementation suffered from
speed and complexity
• Accuracy of the results are often poor
• We set out to build our own and realized it
was too much of overhead even to build
simplest things
ML and Map Reduce
• M/R forgets everything once one operation is
done
• Everything has to go through HDFS , slower
because of disk over heads
• Mahout long tried to make as fast possible ,
but they kind of gave up.
• In Zinnia , we moved on with aggregation and
KPI based solutions rather than pure ML.
My Journey
• Hadoop
• Mahout and Nectar
• Javascript
• Scala and Spark
• Courseera
• MLLib
• Rumour Engine
JavaScript
• Functional programming
• Closures
• Loose typing /type inference
• Prototype inheritance
• REPL (node.js) or webtools
Search for New Language
• Statically typed (Enterprise stack)
• Runs on JVM
• Ability to use Java libraries
• Functional programming
• Type inference
• Repl
My Journey
• Hadoop
• Mahout and Nectar
• Javascript
• Scala and Spark
• Coursera
• MLLib
• Rumour Engine
Scala
• Statically typed
• Type inference
• Functional programming and OO built in
• Parallelism built in
• REPL
• Scalable language
Search for Functional Bigdata
• Pig attempted on Hadoop
• Tuple Map/Reduce
• Javascript API for Hadoop
• Why functional bigdata?
Big data platform requirement
• Immutability support
• Transformation not CRUD
• Built in laziness
• Concise API
• Type inference
Java and Hadoop (Productivity)
• No Laziness
– Every Map/Reduce operation needs to write
output to HDFS
• Java allows crud like variable assignments but
fails in distributed mode
• Type of each key/value pair has to be declared
no way to skip it
• Lots of boiler plate code for closures
Apache Spark
• Apache Spark is a framework for lightening
fast cluster computing .
• Build by AmpLabs and now Databricks.
• Competitor to M/R of Hadoop
• Runs on Hadoop 1.0 and Hadoop 2.0 yarn
• Written in scala
Spark and ML
• Built for Iterative programs Aka ML
• Support for intermediate result caching
• Support for in memory processing
• Remembers across jobs not just within job
• There is suddenly interest in Bigdata ML again
with spark as its finally possible to run fast and
accurate with spark
• Mahout is moving on to Spark
My Journey
• Hadoop
• Mahout and Nectar
• Javascript
• Scala and Spark
• Coursera
• MLLib
• Rumour Engine
Learning Machine learning
• Coursera
• Example in octave
• Porting examples from octave to Spark
• https://github.com/zinniasystems/spark-ml-
class
• Uses
– MLLib
– JBlas
– Breeze
My Journey
• Hadoop
• Mahout and Nectar
• Javascript
• Scala and Spark
• Courseera
• MLLib
• Rumour Engine
MLLib
• Standard Spark library for Machine learning
• Built into spark
• Very small code base – 1200 line of scala code
• 40x – 100x faster than Mahout
• Supports
– Linear and Logistic regression
– SVM
– Recommender systems
Mahout vs MLLib
• Mahout has more algorithms than MLLib
• MLLib has less code than MLLib (1200 lines
scala vs >20,000 lines of java code
• Much improved performance and accuracy
• Mahout recognizes it , moving to spark
backend for next release
My Journey
• Hadoop
• Mahout and Nectar
• Javascript
• Scala and Spark
• Courseera
• MLLib
• Rumour Engine
Rumor Engine
• Crawls blog data
• As of 12 blogs everyday, more to add in future
• Naïve Bayes to classify
• Uses single node spark for prediction
• MLLib
• Has <200 lines of actual application scala
code.
ML-Scale challenges
• Choosing an algorithm
• Accuracy of algorithm implementation
• Modeling when data is noisy and big
• Faster sampling
• Real time processing
• Accuracy vs Performance
ML-People challenges
• Hard to find Data scientists
• Unique combination of skills – Programming
at scale and Maths.
• Mathematical reasoning and practicality of
implementation.
Thank you

More Related Content

What's hot

Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Wee Hyong Tok
 

What's hot (20)

Rolling With Riak
Rolling With RiakRolling With Riak
Rolling With Riak
 
Advanced Spark Meetup - Jan 12, 2016
Advanced Spark Meetup - Jan 12, 2016Advanced Spark Meetup - Jan 12, 2016
Advanced Spark Meetup - Jan 12, 2016
 
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014
 
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
 
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
 
MLeap: Release Spark ML Pipelines
MLeap: Release Spark ML PipelinesMLeap: Release Spark ML Pipelines
MLeap: Release Spark ML Pipelines
 
Bigdata antipatterns
Bigdata antipatternsBigdata antipatterns
Bigdata antipatterns
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 
Scaling with Riak at Showyou
Scaling with Riak at ShowyouScaling with Riak at Showyou
Scaling with Riak at Showyou
 
Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...
Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...
Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
How Apache Spark Changed the Way We Hire People with Tomasz MagdanskiHow Apache Spark Changed the Way We Hire People with Tomasz Magdanski
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
An Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal MalohlavaAn Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal Malohlava
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
Liferay and Big Data
Liferay and Big DataLiferay and Big Data
Liferay and Big Data
 

Viewers also liked

20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
Allen Day, PhD
 

Viewers also liked (20)

Logging in moodle
Logging in moodleLogging in moodle
Logging in moodle
 
Percolation Model and Controllability
Percolation Model and ControllabilityPercolation Model and Controllability
Percolation Model and Controllability
 
First-passage percolation on random planar maps
First-passage percolation on random planar mapsFirst-passage percolation on random planar maps
First-passage percolation on random planar maps
 
mtc All Hands 8/15 Werte
mtc All Hands 8/15 Wertemtc All Hands 8/15 Werte
mtc All Hands 8/15 Werte
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
 
Percolation
PercolationPercolation
Percolation
 
Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...
 
Elastic Search
Elastic SearchElastic Search
Elastic Search
 
Artificial intelligence 2015: Quo Vadis?
Artificial intelligence 2015: Quo Vadis?Artificial intelligence 2015: Quo Vadis?
Artificial intelligence 2015: Quo Vadis?
 
Network-Growth Rule Dependence of Fractal Dimension of Percolation Cluster on...
Network-Growth Rule Dependence of Fractal Dimension of Percolation Cluster on...Network-Growth Rule Dependence of Fractal Dimension of Percolation Cluster on...
Network-Growth Rule Dependence of Fractal Dimension of Percolation Cluster on...
 
Machine Learning and Logging for Monitoring Microservices
Machine Learning and Logging for Monitoring Microservices Machine Learning and Logging for Monitoring Microservices
Machine Learning and Logging for Monitoring Microservices
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at Pinterest
 
Interlayer-Interaction Dependence of Latent Heat in the Heisenberg Model on a...
Interlayer-Interaction Dependence of Latent Heat in the Heisenberg Model on a...Interlayer-Interaction Dependence of Latent Heat in the Heisenberg Model on a...
Interlayer-Interaction Dependence of Latent Heat in the Heisenberg Model on a...
 
Percolation
PercolationPercolation
Percolation
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
Predictive analytics in mobility
Predictive analytics in mobilityPredictive analytics in mobility
Predictive analytics in mobility
 
BigData & Supply Chain: A "Small" Introduction
BigData & Supply Chain: A "Small" IntroductionBigData & Supply Chain: A "Small" Introduction
BigData & Supply Chain: A "Small" Introduction
 
Logging : How much is too much? Network Security Monitoring Talk @ hasgeek
Logging : How much is too much? Network Security Monitoring Talk @ hasgeekLogging : How much is too much? Network Security Monitoring Talk @ hasgeek
Logging : How much is too much? Network Security Monitoring Talk @ hasgeek
 
Deep Learning
Deep Learning Deep Learning
Deep Learning
 
Elasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easyElasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easy
 

Similar to Machine Learning at Scale

Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Thoughtworks
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
UserReport
 

Similar to Machine Learning at Scale (20)

Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Machine learninginspark
Machine learninginsparkMachine learninginspark
Machine learninginspark
 
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
 
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Transitioning from Java to Scala for Spark - March 13, 2019
Transitioning from Java to Scala for Spark - March 13, 2019Transitioning from Java to Scala for Spark - March 13, 2019
Transitioning from Java to Scala for Spark - March 13, 2019
 
Data Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah GuidoData Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah Guido
 
Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big Data
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Hadoop world overview trends and topics
Hadoop world overview trends and topicsHadoop world overview trends and topics
Hadoop world overview trends and topics
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Apache drill
Apache drillApache drill
Apache drill
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
 

Recently uploaded

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 

Recently uploaded (20)

WSO2Con2024 - Unleashing the Financial Potential of 13 Million People
WSO2Con2024 - Unleashing the Financial Potential of 13 Million PeopleWSO2Con2024 - Unleashing the Financial Potential of 13 Million People
WSO2Con2024 - Unleashing the Financial Potential of 13 Million People
 
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
 
WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...
WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...
WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public AdministrationWSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
 
WSO2CON 2024 - Architecting AI in the Enterprise: APIs and Applications
WSO2CON 2024 - Architecting AI in the Enterprise: APIs and ApplicationsWSO2CON 2024 - Architecting AI in the Enterprise: APIs and Applications
WSO2CON 2024 - Architecting AI in the Enterprise: APIs and Applications
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
WSO2CON 2024 - Building a Digital Government in Uganda
WSO2CON 2024 - Building a Digital Government in UgandaWSO2CON 2024 - Building a Digital Government in Uganda
WSO2CON 2024 - Building a Digital Government in Uganda
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
WSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
WSO2CON2024 - Why Should You Consider Ballerina for Your Next IntegrationWSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
WSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
 
WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...
WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...
WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...
 
WSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid EnvironmentsWSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid Environments
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 

Machine Learning at Scale

  • 1. Machine Learning at Scale Madhukara Phatak
  • 2. • Madhukara Phatak • Consult in Bigdata and FP • Work with Spark, Hadoop and ecosystem • Training on Bigdata • @madhukaraphatak • http://www.madhukara phatak.com
  • 3. How many of You? • Own a Smart phone? • Want to know when next phone coming into market? • Next version of existing phone coming into market? • Specs and prices of new phone? – Months before phone releases • Data from multiple sources aggregated in one place
  • 4. Rumor Engine • A practical implementation of machine learning to solve phone rumor problem. • Built in 3 months – Learning machine learning – Learning Spark – Idea – Implementation – Release
  • 5. My Journey • Hadoop • Mahout and Nectar • JavaScript • Scala and Spark • Coursera • MLLib • Rumour Engine
  • 6. Big data at work • Worked for a BSS/OSS product company • Big data is normal in Telecom • CDR (call data record ) around 3TB for companies like Airtel • Need a solution for processing over 6 months of data. • Started to work around 4 years ago
  • 7. Hadoop Saga • Hadoop was default choice • Challenge in the ecosystem in India • Hype vs Reality • Work – Building ML library Nectar – Working with companies to build hadoop expertise and solutions – POC’s
  • 8. My Journey • Hadoop • Mahout and Nectar • Javascript • Scala and Spark • Courseera • MLLib • Rumour Engine
  • 9. Machine Learning in Hadoop • Apache Mahout was the choice but its too hard to map it to any new requirements • Map/Reduce implementation suffered from speed and complexity • Accuracy of the results are often poor • We set out to build our own and realized it was too much of overhead even to build simplest things
  • 10. ML and Map Reduce • M/R forgets everything once one operation is done • Everything has to go through HDFS , slower because of disk over heads • Mahout long tried to make as fast possible , but they kind of gave up. • In Zinnia , we moved on with aggregation and KPI based solutions rather than pure ML.
  • 11. My Journey • Hadoop • Mahout and Nectar • Javascript • Scala and Spark • Courseera • MLLib • Rumour Engine
  • 12. JavaScript • Functional programming • Closures • Loose typing /type inference • Prototype inheritance • REPL (node.js) or webtools
  • 13. Search for New Language • Statically typed (Enterprise stack) • Runs on JVM • Ability to use Java libraries • Functional programming • Type inference • Repl
  • 14. My Journey • Hadoop • Mahout and Nectar • Javascript • Scala and Spark • Coursera • MLLib • Rumour Engine
  • 15. Scala • Statically typed • Type inference • Functional programming and OO built in • Parallelism built in • REPL • Scalable language
  • 16. Search for Functional Bigdata • Pig attempted on Hadoop • Tuple Map/Reduce • Javascript API for Hadoop • Why functional bigdata?
  • 17. Big data platform requirement • Immutability support • Transformation not CRUD • Built in laziness • Concise API • Type inference
  • 18. Java and Hadoop (Productivity) • No Laziness – Every Map/Reduce operation needs to write output to HDFS • Java allows crud like variable assignments but fails in distributed mode • Type of each key/value pair has to be declared no way to skip it • Lots of boiler plate code for closures
  • 19. Apache Spark • Apache Spark is a framework for lightening fast cluster computing . • Build by AmpLabs and now Databricks. • Competitor to M/R of Hadoop • Runs on Hadoop 1.0 and Hadoop 2.0 yarn • Written in scala
  • 20. Spark and ML • Built for Iterative programs Aka ML • Support for intermediate result caching • Support for in memory processing • Remembers across jobs not just within job • There is suddenly interest in Bigdata ML again with spark as its finally possible to run fast and accurate with spark • Mahout is moving on to Spark
  • 21. My Journey • Hadoop • Mahout and Nectar • Javascript • Scala and Spark • Coursera • MLLib • Rumour Engine
  • 22. Learning Machine learning • Coursera • Example in octave • Porting examples from octave to Spark • https://github.com/zinniasystems/spark-ml- class • Uses – MLLib – JBlas – Breeze
  • 23. My Journey • Hadoop • Mahout and Nectar • Javascript • Scala and Spark • Courseera • MLLib • Rumour Engine
  • 24. MLLib • Standard Spark library for Machine learning • Built into spark • Very small code base – 1200 line of scala code • 40x – 100x faster than Mahout • Supports – Linear and Logistic regression – SVM – Recommender systems
  • 25. Mahout vs MLLib • Mahout has more algorithms than MLLib • MLLib has less code than MLLib (1200 lines scala vs >20,000 lines of java code • Much improved performance and accuracy • Mahout recognizes it , moving to spark backend for next release
  • 26. My Journey • Hadoop • Mahout and Nectar • Javascript • Scala and Spark • Courseera • MLLib • Rumour Engine
  • 27. Rumor Engine • Crawls blog data • As of 12 blogs everyday, more to add in future • Naïve Bayes to classify • Uses single node spark for prediction • MLLib • Has <200 lines of actual application scala code.
  • 28. ML-Scale challenges • Choosing an algorithm • Accuracy of algorithm implementation • Modeling when data is noisy and big • Faster sampling • Real time processing • Accuracy vs Performance
  • 29. ML-People challenges • Hard to find Data scientists • Unique combination of skills – Programming at scale and Maths. • Mathematical reasoning and practicality of implementation.