### SlideShare for iOS

by Linkedin Corporation

FREE - On the App Store

- Total Views
- 1,344
- Views on SlideShare
- 1,034
- Embed Views

- Likes
- 1
- Downloads
- 53
- Comments
- 0

http://energyfirefox.blogspot.com | 169 |

http://energyfirefox.blogspot.ru | 117 |

https://www.linkedin.com | 5 |

http://energyfirefox.blogspot.co.uk | 4 |

http://feedly.com | 3 |

http://energyfirefox.blogspot.it | 2 |

http://energyfirefox.blogspot.de | 2 |

http://www.inoreader.com | 2 |

http://energyfirefox.blogspot.jp | 1 |

http://energyfirefox.blogspot.co.nz | 1 |

http://energyfirefox.blogspot.fi | 1 |

http://energyfirefox.blogspot.fr | 1 |

http://energyfirefox.blogspot.dk | 1 |

http://www.linkedin.com | 1 |

Uploaded via SlideShare as OpenOffice

© All Rights Reserved

- 1. Apache Mahout: Scalable Machine Learning Library Anastasiia Kornilova
- 2. What is Machine Learning? “Machine learning - branch of artificial intelligence, concerns the construction and study of systems that can learn from data”
- 3. Typical Use Cases ● Recommend products/friends … ● Classify content into predefined groups ● Computer vision ● Sentiment analysis/opinion mining ● Find patterns in users behavior/actions ● Identify key topics/summarize text ● Detect anomalies/fraud ● Ranking search results ● Speech and handwriting recognition ● Natural language processing
- 4. ML Algorithms (subset): ● Supervised learning – – Logistic regression – Support Vector Machines – ● Linear regression Random Forests Unsupervised learning – – Blind signal separation – ● Clustering Hidden Markov models Semi-supervised
- 5. Many ML libraries, frameworks and tools: ● Weka ● Python Scikit ● Pylearn/Pylearn2 ● Theano ● Orange ● SSBrain :) ● More can be find here: http://mloss.org/software/
- 6. Typical Workflow ● Get data ● Prepare data ● Choose algorithm(s) ● Run your algorithm(s) ● Validate results
- 7. Every ML algorithms deals with: 1.Data 2.Computation over this data
- 8. Scalability strategies: ● “Bigger” computer ● More cores ● GPU computing ● Parallel computing, MapReduce
- 9. What is Mahout? ● ● Scalable ML library built on Hadoop, written in Java Driven by Ng et al's. Paper “MapReduce for Machine Learning on Multicore” ● Started as Lucene sub-project. Became Apache TLP in April 2010 ● 25 July 2013 - Apache Mahout 0.8 released ● Taste Recommended Framework by Sean Owen was added in 2008
- 10. Who use Mahout?
- 11. When you need Mahout? Data Size Lines, Sample Data Task Analysis and visualization Tools Whiteboard, bash, ... KBs – low MBs, Prototype Data Analysis and visualization Octave, R, bash, ... MBs – low Gbs, Online Data Storage Data bases (MySQL, Postgresql), ... Analysis NumPy, SciPy, BLAS, Weka Visualization GBs – TBs – Pbs Big Data Protovis, D3, ... Storage HDFS, Hbase, Cassandra, ... Analysis Mahout, Hive, Pig, …. table from Varad Meru
- 12. Advantages ● Community ● Documentations and examples ● Scalability ● Apache license ● Well tested ● Built over existing production quality libraries
- 13. Requirements ● Java 1.6.x or greater ● Maven 3.x to build the source code ● Hadoop 0.20.0 or greater
- 14. Core themes ● Recommender engines (collaborative filtering) ● Clustering ● Classification
- 15. Core themes ● Recommender engines (collaborative filtering) ● Clustering ● Classification
- 16. Algorithms ● User and Item based recommenders ● Matrix factorization based recommenders ● K-Means, Fuzzy K-Means clustering ● Latent Dirichlet Allocation ● Singular value decomposition ● Logistic regression based classifier ● Complementary Naive Bayes classifier ● Random forest decision tree based classifier
- 17. Recommender engine
- 18. Personalization level ● Generic / Non-Personalized: everyone receives same recommendations ● Demographic: matches a target group ● Ephemeral: matches current activity ● Persistent: matches long-term interests
- 19. Content based ● User Ratings x Item Attributes => Model ● Model applied to new items via attributes ● ● Alternative: knowledge-based (Item attributes form model of item space) Example: Personalized news feeds
- 20. Table of ratings
- 21. Ratings ● Explicit (Rating, Review, Vote, Like) ● Implicit (Click, Purchase, Follow)
- 22. Item Item ● For every item I ● Select N similar items ● Recommend users, who work with item I this N items
- 23. User user ● For every user ● Find n most similar users ● Aggregate preferences for this user ● Generate recommended items
- 24. Similarities metrics ● Pearson Correlation ● Tanimoto ● Cosine similarity ● Euclidean distance
- 25. Sparse matrix
- 26. Parameters ● ● ● ● DataModel – FileDataModel, MySQLJDBCDataModel, PostgreSQLJDBCDataModel, MongoDBDataModel, CassandraDataModel UserSimilarity – Pearson Corelation, Tanimoto, Log-Likelihood, Euclidian Distance, Cosine Similarity ItemSimilarity – Pearson Corelation, Tanimoto, Log-Likelihood, Euclidian Distance, Cosine Similarity UserNeighborhood – Nearest N-User Neighborhood, Threshold User Neighborhood
- 27. Code example
- 28. Evaluation ● Average absolute difference ● RMSE ● Precision and recall ● ● Precision is the proportion of top results that are relevant, for some definition of relevant. Recall is the proportion of all relevant results included in the top results.
- 29. Clustering
- 30. Mahout Clustering Algorithms ● K-Means - runs on Hadoop ● Fuzzy K-means - runs on Hadoop ● Latent Dirichlet Allocation -runs on Hadoop ● Canopy clustering - runs on Hadoop ● Minhash clustering - runs on Hadoop ● kMeans++ streaming clustering - documentation missing
- 31. Classification
- 32. Mahout Classification Algorithms ● Logistic regression (SGD) - model parameter selection can be done in Hadoop ● Naive Bayes - training runs on Hadoop ● Random Forests - training is done in Hadoop ● Hidden Markov Models - training is done in Map-Reduce
- 33. Resources ● Mahout in action ● Apache Mahout Cookbook ● Introduction to Apache Mahout ● http://mahout.apache.org/
- 34. Q&A

Full NameComment goes here.backlash218 months ago