Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Machine learninginspark


Published on

Published in: Technology, Education
  • Be the first to comment

Machine learninginspark

  1. 1. Machine Learning at Scale Madhukara Phatak Zinnia Systems @madhukaraphatak
  2. 2. Agenda • Zinnia and Big data • Hadoop Saga • Machine learning – State of Art • Scale Challenges • People challenges • Machine learning at Zinnia • Case studies • Demo
  3. 3. Zinnia and Big data • BSS/OSS product company • Big data is normal in Telecom • CDR (call data record ) around 3TB for companies like Airtel • Need a solution for processing over 6 months • Started to work around 3 years ago
  4. 4. Hadoop Saga • Hadoop was default choice • Challenge in the ecosystem in India • Hype vs Reality • Work – Building ML library Nectar – Working with companies to build hadoop expertise and solutions – POC’s
  5. 5. Machine Learning in Hadoop • Apache Mahout was the choice but its too hard to map it any new requirements • Map/Reduce implementation suffered from speed and complexity • Accuracy of the results often poor • We set out to build our own and realized it was too much of overhead even to build simplest things
  6. 6. ML and Map Reduce • M/R forgets everything once one operation is done • Everything has to go through HDFS , slower because of disk over heads • Mahout long tried to make as fast possible , but they kind of given up • In Zinnia , we moved on with aggregation and KPI based solutions rather than pure ML.
  7. 7. Apache Spark • Apache Spark is a framework for lightening fast cluster computing . • Build by AmpLabs and now Databricks. • Runs Hadoop 2.0 • Built for Iterative algorithms aka ML • There is suddenly interest in Bigdata ML again with spark as its finally possible to run fast and accurate with spark • Mahout is moving on to Spark
  8. 8. MLLib • Standard Spark library for Machine learning • Built into spark • Very small code base – 1200 line of scala code • 40x – 100x faster than Mahout • Supports – Linear and Logistic regression – SVM – Recommender systems
  9. 9. ML-Scale challenges • Choosing an algorithm • Accuracy of algorithm implementation • Modeling when data is noisy and big • Faster sampling • Real time processing • Accuracy vs Performance
  10. 10. ML-People challenges • Hard to find Data scientists • Unique combination of skills – Programming at scale and maths. • Mathematical reasoning and practicallality of implementation.
  11. 11. Machine learning at Zinnia Systems • 4 people team • We work on public data and use ML algorithms to get interesting insight out. • We work on following – Predictive modeling – Text analysis – Recommender systems – Classification systems
  12. 12. Case study –Movie twitter sentiment Analysis • Everyone likes movies and want to catch up good movie every week. • Too many critic reviews so difficult to say whom to trust. • Can we know what real audience think about the movies so that we can make right choice?
  13. 13. Movie twitter sentiment analysis • We build model using Naïve Bayes using labeled public tweets. • Collect tweet about movies every day and run through models to do the predictions. • We aggregate these scores to give our twitter score. • On par with imdb score. • Demo
  14. 14. Movie Recommendation System • Want to explore older movies based on your current liking? • We pull the data from FB for you and your friends movie liking , and recommend you movies out of our 17000 movie collection. • Model built using public Nextflix data • Demo
  15. 15. Kick start in ML • • ml-class • view