Agenda
• Zinnia and Big data
• Hadoop Saga
• Machine learning – State of Art
• Scale Challenges
• People challenges
• Machine learning at Zinnia
• Case studies
• Demo
Zinnia and Big data
• BSS/OSS product company
• Big data is normal in Telecom
• CDR (call data record ) around 3TB for
companies like Airtel
• Need a solution for processing over 6
months
• Started to work around 3 years ago
Hadoop Saga
• Hadoop was default choice
• Challenge in the ecosystem in India
• Hype vs Reality
• Work
– Building ML library Nectar
– Working with companies to build hadoop
expertise and solutions
– POC’s
Machine Learning in Hadoop
• Apache Mahout was the choice but its
too hard to map it any new requirements
• Map/Reduce implementation suffered
from speed and complexity
• Accuracy of the results often poor
• We set out to build our own and realized
it was too much of overhead even to
build simplest things
ML and Map Reduce
• M/R forgets everything once one
operation is done
• Everything has to go through HDFS ,
slower because of disk over heads
• Mahout long tried to make as fast
possible , but they kind of given up
• In Zinnia , we moved on with
aggregation and KPI based solutions
rather than pure ML.
Apache Spark
• Apache Spark is a framework for
lightening fast cluster computing .
• Build by AmpLabs and now Databricks.
• Runs Hadoop 2.0
• Built for Iterative algorithms aka ML
• There is suddenly interest in Bigdata ML
again with spark as its finally possible to
run fast and accurate with spark
• Mahout is moving on to Spark
MLLib
• Standard Spark library for Machine
learning
• Built into spark
• Very small code base – 1200 line of scala
code
• 40x – 100x faster than Mahout
• Supports
– Linear and Logistic regression
– SVM
– Recommender systems
ML-Scale challenges
• Choosing an algorithm
• Accuracy of algorithm implementation
• Modeling when data is noisy and big
• Faster sampling
• Real time processing
• Accuracy vs Performance
ML-People challenges
• Hard to find Data scientists
• Unique combination of skills –
Programming at scale and maths.
• Mathematical reasoning and
practicallality of implementation.
Machine learning at Zinnia Systems
• 4 people team
• We work on public data and use ML
algorithms to get interesting insight out.
• We work on following
– Predictive modeling
– Text analysis
– Recommender systems
– Classification systems
Case study –Movie twitter sentiment
Analysis
• Everyone likes movies and want to catch
up good movie every week.
• Too many critic reviews so difficult to
say whom to trust.
• Can we know what real audience think
about the movies so that we can make
right choice?
Movie twitter sentiment analysis
• We build model using Naïve Bayes using
labeled public tweets.
• Collect tweet about movies every day
and run through models to do the
predictions.
• We aggregate these scores to give our
twitter score.
• On par with imdb score.
• Demo
Movie Recommendation System
• Want to explore older movies based on
your current liking?
• We pull the data from FB for you and
your friends movie liking , and
recommend you movies out of our 17000
movie collection.
• Model built using public Nextflix data
• Demo
Kick start in ML
• https://www.coursera.org/course/ml
• https://github.com/zinniasystems/spark-
ml-class
• https://class.coursera.org/nlp/lecture/pre
view