Learning with
   Drew Farris   Committer to Apache Mahout since 2/2010     ..not as active in the past year      Author: Taming Tex...
   Mahout (as in hoot) or Mahout (as in trout)?   A scalable machine learning library
   A scalable machine learning library     ‘large’ data sets     Often Hadoop     ..but sometimes not
   A scalable machine learning library     Recommendation Mining
   A scalable machine learning library     Recommendation Mining     Clustering
   A scalable machine learning library     Recommendation Mining     Clustering     Classification
   A scalable machine learning library     Recommendation Mining     Clustering     Classification     Association Mi...
   A scalable machine learning library     Recommendation Mining     Clustering     Classification     Association Mi...
   A scalable machine learning library     Recommendation Mining     Clustering     Classification     Association Mi...
   Getting Started     Check out & build the code      ▪ git clone git://git.apache.org/mahout.git      ▪ mvn install –D...
   Getting Started     Check out & build the code     Examples in examples/bin
   Getting Started     Check out & build the code     Examples in examples/bin     Wiki (http://mahout.apache.org/)
   Getting Started     Check out & build the code     Examples in examples/bin     Wiki (http://mahout.apache.org/)   ...
   Getting Started       Check out & build the code       Examples in examples/bin       Wiki (http://mahout.apache.or...
   Getting Started     Check out & build the code     Examples in examples/bin     Wiki (http://mahout.apache.org/)   ...
   Kicking the Tires in examples/bin     classify-20newsgroups.sh     cluster-reuters.sh     cluster-syntheticcontrol....
   Kicking the Tires in examples/bin     classify-20newsgroups.sh     Premise: Classify News Stories     Algorithm: sg...
   Kicking the Tires in examples/bin     cluster-reuters.sh     Premise: Group Related News Stories       Data: http:/...
   Kicking the Tires in examples/bin     cluster-syntheticcontrol.sh        ▪ Premise: Cluster time series data         ...
   Kicking the Tires in examples/bin     asf-email-examples.sh      ▪ Recommendation (user based)      ▪ Clustering (kme...
   General Outline:     Data Transformation      ▪ From Native format to…      ▪ ..Sequence Files; Typed Key, Value pair...
   General Outline:     Data Transformation      ▪ From Native format to…      ▪ ..Sequence Files; Typed Key, Value pair...
   General Outline:     Data Transformation      ▪ From Native format to…      ▪ ..Sequence Files; Typed Key, Value pair...
   General Outline:     Data Transformation      ▪ From Native format to…      ▪ ..Sequence Files; Typed Key, Value pair...
   General Outline:     Data Transformation        ▪ From Native format to…        ▪ ..Sequence Files; Typed Key, Value ...
   General Outline:     Data Transformation        ▪ From Native format to…        ▪ ..Sequence Files; Typed Key, Value ...
   mahout seq2sparse     Tokenize Documents     Count Words     Make Partial/Merge Vectors     TFIDF     Make Partia...
   View Sequence Files with:       mahout seqdumper –i /path/to/sequence/file   Check out shortcuts in:       src/conf...
   asf-email-examples.sh (recommendation)   Premise: Recommend Interesting Threads   User based recommendation   Boole...
   Recommendation Steps     Convert Mail to Sequence Files     Convert Sequence Files to Preferences     Prepare Prefe...
   asf-email-examples.sh (classification)   Premise: Predict project mailing lists for incoming messages   Data labeled...
   Classification Steps     Convert Mail to Sequence Files     Sequence Files to Sparse Vectors     Modify Sequence Fi...
   asf-email-examples.sh (clustering)   Premise: Grouping Messages by Subject   Same Prep as Classification   Differen...
   Clustering Steps     Convert Mail to Sequence Files     Sequence Files to Sparse Vectors     Run Clustering (iterat...
   Insert Bar Camp Style Discussion Here
   Mahout in Action     Owen, Anil, Dunning and Friedman     http://bit.ly/IWMvaz   Taming Text     Ingersoll, Morton...
Upcoming SlideShare
Loading in...5
×

Mahout Introduction BarCampDC

1,633

Published on

An introduction to Apache Mahout presented at Apache BarCamp DC, May 19, 2012

A brief introduction to the examples and links to more resources for further exploration.

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,633
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
78
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • We encounter recommendations everywhere today, from books, to music to people.
  • Clustering combines related items into groups, like text documents organized by topic.
  • Classification is assigning classes or categories to new data based on what we know about existing data.
  • Identifying items that frequently appear together, whether it be shopping cart contents or frequently co-occuring terms.
  • It’s not the fastest linear algebra library, but it’s high performance, and uses a reasonably small memory footprint. Based upon COLT from CERN.It’s not the fastest collections library, but implements collections of primitive types that use open addressing. Fundamental stuff that’s missing from java.util and things that weren’t previously available in a commercial friendly license.
  • It’s not the fastest linear algebra library, but it’s high performance, and uses a reasonably small memory footprint. Based upon COLT from CERN.It’s not the fastest collections library, but implements collections of primitive types that use open addressing. Fundamental stuff that’s missing from java.util and things that weren’t previously available in a commercial friendly license.
  • Modify sequence file labels
  • Mahout Introduction BarCampDC

    1. 1. Learning with
    2. 2.  Drew Farris Committer to Apache Mahout since 2/2010  ..not as active in the past year   Author: Taming Text  My Company: (and BarCamp DC Sponsor)
    3. 3.  Mahout (as in hoot) or Mahout (as in trout)? A scalable machine learning library
    4. 4.  A scalable machine learning library  ‘large’ data sets  Often Hadoop  ..but sometimes not
    5. 5.  A scalable machine learning library  Recommendation Mining
    6. 6.  A scalable machine learning library  Recommendation Mining  Clustering
    7. 7.  A scalable machine learning library  Recommendation Mining  Clustering  Classification
    8. 8.  A scalable machine learning library  Recommendation Mining  Clustering  Classification  Association Mining
    9. 9.  A scalable machine learning library  Recommendation Mining  Clustering  Classification  Association Mining  A reasonable linear algebra library  A reasonable library of collections
    10. 10.  A scalable machine learning library  Recommendation Mining  Clustering  Classification  Association Mining  A reasonable linear algebra library  A reasonable library of collections  Other Stuff
    11. 11.  Getting Started  Check out & build the code ▪ git clone git://git.apache.org/mahout.git ▪ mvn install –DskipTests=true ▪ The tests take a looong time to run, not needed for intial build  Or use the Cloudera Virtual Machine (http://bit.ly/MyBnFi)
    12. 12.  Getting Started  Check out & build the code  Examples in examples/bin
    13. 13.  Getting Started  Check out & build the code  Examples in examples/bin  Wiki (http://mahout.apache.org/)
    14. 14.  Getting Started  Check out & build the code  Examples in examples/bin  Wiki (http://mahout.apache.org/)  Articles & Presentations ▪ Grant’s IBM Developerworks Article ▪ http://ibm.co/LUbptg (Nov 2011) ▪ Others @ http://bit.ly/IZ6PqE (wiki)
    15. 15.  Getting Started  Check out & build the code  Examples in examples/bin  Wiki (http://mahout.apache.org/)  Articles & Publications (http://bit.ly/IZ6PqE)  Mailing Lists ▪ user-subscribe@mahout.apache.org ▪ (http://bit.ly/L1GSHB) ▪ dev-subscribe@mahout.apache.org ▪ (http://bit.ly/JPeNoE)
    16. 16.  Getting Started  Check out & build the code  Examples in examples/bin  Wiki (http://mahout.apache.org/)  Articles & Presentations  Mailing Lists  Books! ▪ Mahout in Action: http://bit.ly/IWMvaz ▪ Taming Text: http://bit.ly/KkODZV
    17. 17.  Kicking the Tires in examples/bin  classify-20newsgroups.sh  cluster-reuters.sh  cluster-syntheticcontrol.sh  asf-email-examples.sh
    18. 18.  Kicking the Tires in examples/bin  classify-20newsgroups.sh  Premise: Classify News Stories  Algorithm: sgd  Data: http://people.csail.mit.edu/jrennie/20Newsgroups/20news- bydate.tar.gz
    19. 19.  Kicking the Tires in examples/bin  cluster-reuters.sh  Premise: Group Related News Stories  Data: http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
    20. 20.  Kicking the Tires in examples/bin  cluster-syntheticcontrol.sh ▪ Premise: Cluster time series data ▪ normal, cyclic, increasing, decreasing, upward, downward shift ▪ Algorithms: ▪ canopy, kmeans, fuzzykmeans, dirichlet, meanshift  See: https://cwiki.apache.org/MAHOUT/clustering-of-synthetic-control-data.html  Data: http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html
    21. 21.  Kicking the Tires in examples/bin  asf-email-examples.sh ▪ Recommendation (user based) ▪ Clustering (kmeans, dirichlet, minhash) ▪ Classification (naïve bayes, sgd)
    22. 22.  General Outline:  Data Transformation ▪ From Native format to… ▪ ..Sequence Files; Typed Key, Value pairs ▪ ..Labeled Vectors
    23. 23.  General Outline:  Data Transformation ▪ From Native format to… ▪ ..Sequence Files; Typed Key, Value pairs ▪ ..Labeled Vectors  Model Training
    24. 24.  General Outline:  Data Transformation ▪ From Native format to… ▪ ..Sequence Files; Typed Key, Value pairs ▪ ..Labeled Vectors  Model Training  Model Evaluation
    25. 25.  General Outline:  Data Transformation ▪ From Native format to… ▪ ..Sequence Files; Typed Key, Value pairs ▪ ..Labeled Vectors  Model Training  Model Evaluation  Lather, Rinse, Repeat
    26. 26.  General Outline:  Data Transformation ▪ From Native format to… ▪ ..Sequence Files; Typed Key, Value pairs ▪ ..Labeled Vectors  Model Training  Model Evaluation  Lather, Rinse, Repeat  Production
    27. 27.  General Outline:  Data Transformation ▪ From Native format to… ▪ ..Sequence Files; Typed Key, Value pairs ▪ ..Labeled Vectors  Model Training  Model Evaluation  Lather, Rinse, Repeat  Production  Lather, Rinse, Repeat
    28. 28.  mahout seq2sparse  Tokenize Documents  Count Words  Make Partial/Merge Vectors  TFIDF  Make Partial/Merge TFIDF Vectors
    29. 29.  View Sequence Files with:  mahout seqdumper –i /path/to/sequence/file Check out shortcuts in:  src/conf/driver.classes.props Run classes with:  mahout org.apache.mahout.SomeCoolNewFeature … Standalone vs. Distributed  Standalone mode is default  Set HADOOP_CONF_DIR to use Hadoop  MAHOUT_LOCAL will force standalone
    30. 30.  asf-email-examples.sh (recommendation) Premise: Recommend Interesting Threads User based recommendation Boolean preferences based on thread contribution  Implies boolean similarity measure – tanimoto, log-likelihood See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
    31. 31.  Recommendation Steps  Convert Mail to Sequence Files  Convert Sequence Files to Preferences  Prepare Preference Matrix  Row Similarity Job  Recommender Job See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
    32. 32.  asf-email-examples.sh (classification) Premise: Predict project mailing lists for incoming messages Data labeled based on the mailing list it arrived on Hold back a random 20% of data for testing, the rest for training. Algorithms: Naïve Bayes (Standard, Complimentary), SGD See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
    33. 33.  Classification Steps  Convert Mail to Sequence Files  Sequence Files to Sparse Vectors  Modify Sequence File Labels  Split into Training and Test Sets  Train the Model  Test the Model See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
    34. 34.  asf-email-examples.sh (clustering) Premise: Grouping Messages by Subject Same Prep as Classification Different Algorithms: (kmeans, dirichlet, minhash)  12/05/16 05:16:02 INFO driver.MahoutDriver: Program took 20577398 ms (Minutes: 342.95663333333334 See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
    35. 35.  Clustering Steps  Convert Mail to Sequence Files  Sequence Files to Sparse Vectors  Run Clustering (iterate)  Dump Results
    36. 36.  Insert Bar Camp Style Discussion Here
    37. 37.  Mahout in Action  Owen, Anil, Dunning and Friedman  http://bit.ly/IWMvaz Taming Text  Ingersoll, Morton and Farris  http://bit.ly/KkODZV
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×