Successfully reported this slideshow.

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Machine learninginspark

  1. 1. Machine Learning at Scale Madhukara Phatak Zinnia Systems @madhukaraphatak
  2. 2. Agenda • Zinnia and Big data • Hadoop Saga • Machine learning – State of Art • Scale Challenges • People challenges • Machine learning at Zinnia • Case studies • Demo
  3. 3. Zinnia and Big data • BSS/OSS product company • Big data is normal in Telecom • CDR (call data record ) around 3TB for companies like Airtel • Need a solution for processing over 6 months • Started to work around 3 years ago
  4. 4. Hadoop Saga • Hadoop was default choice • Challenge in the ecosystem in India • Hype vs Reality • Work – Building ML library Nectar – Working with companies to build hadoop expertise and solutions – POC’s
  5. 5. Machine Learning in Hadoop • Apache Mahout was the choice but its too hard to map it any new requirements • Map/Reduce implementation suffered from speed and complexity • Accuracy of the results often poor • We set out to build our own and realized it was too much of overhead even to build simplest things
  6. 6. ML and Map Reduce • M/R forgets everything once one operation is done • Everything has to go through HDFS , slower because of disk over heads • Mahout long tried to make as fast possible , but they kind of given up • In Zinnia , we moved on with aggregation and KPI based solutions rather than pure ML.
  7. 7. Apache Spark • Apache Spark is a framework for lightening fast cluster computing . • Build by AmpLabs and now Databricks. • Runs Hadoop 2.0 • Built for Iterative algorithms aka ML • There is suddenly interest in Bigdata ML again with spark as its finally possible to run fast and accurate with spark • Mahout is moving on to Spark
  8. 8. MLLib • Standard Spark library for Machine learning • Built into spark • Very small code base – 1200 line of scala code • 40x – 100x faster than Mahout • Supports – Linear and Logistic regression – SVM – Recommender systems
  9. 9. ML-Scale challenges • Choosing an algorithm • Accuracy of algorithm implementation • Modeling when data is noisy and big • Faster sampling • Real time processing • Accuracy vs Performance
  10. 10. ML-People challenges • Hard to find Data scientists • Unique combination of skills – Programming at scale and maths. • Mathematical reasoning and practicallality of implementation.
  11. 11. Machine learning at Zinnia Systems • 4 people team • We work on public data and use ML algorithms to get interesting insight out. • We work on following – Predictive modeling – Text analysis – Recommender systems – Classification systems
  12. 12. Case study –Movie twitter sentiment Analysis • Everyone likes movies and want to catch up good movie every week. • Too many critic reviews so difficult to say whom to trust. • Can we know what real audience think about the movies so that we can make right choice?
  13. 13. Movie twitter sentiment analysis • We build model using Naïve Bayes using labeled public tweets. • Collect tweet about movies every day and run through models to do the predictions. • We aggregate these scores to give our twitter score. • On par with imdb score. • Demo
  14. 14. Movie Recommendation System • Want to explore older movies based on your current liking? • We pull the data from FB for you and your friends movie liking , and recommend you movies out of our 17000 movie collection. • Model built using public Nextflix data • Demo
  15. 15. Kick start in ML • https://www.coursera.org/course/ml • https://github.com/zinniasystems/spark- ml-class • https://class.coursera.org/nlp/lecture/pre view

×