Oscon miller 2011


Published on

The scheduled redis speaker was sick so I whipped up in about an hour and filled in on a different subject. It's a bit crude, but you get a big picture view of how to build a simple AI application using BigCouch. The accompanying video is up at http://www.youtube.com/watch?v=QEBDNxbSRuk

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Oscon miller 2011

  1. 1. Bayes on your (Big)Couch Mike Miller _milleratmit July 25, 2011
  2. 2. I want my app to do _this_ Mike Miller, Oscon 2011 2
  3. 3. CouchDB in a slide• Schema-free document database management system Documents are JSON objects Able to store binary attachments• RESTful API http://wiki.apache.org/couchdb/reference• Views: Custom, persistent representations of your data Incremental MapReduce with results persisted to disk Fast querying by primary key (views stored in a B-tree)• Bi-Directional Replication Master-slave and multi-master topologies supported Optional ‘filters’ to replicate a subset of the data Edge devices (mobile phones, sensors, etc.) Mike Miller, Oscon 2011 3
  4. 4. BigCouch = Couch+Scaling• Open Source, Apache License• Horizontal Scalability Easily add storage capacity by adding more servers Computing power (views, compaction, etc.) scales with more servers• No SPOF Any node can handle any request Individual nodes can come and go• Transparent to the Application All clustering operations take place “behind the curtain” looks (mostly) like a single server instance of CouchDB Mike Miller, Oscon 2011 4
  5. 5. ...back to making my app smart Mike Miller, Oscon 2011 5
  6. 6. Sample Data Height vs. Weight 80 Height [in] 75 Girls Boys 70 65 60 55 50 45 40 35 80 100 120 140 160 180 200 220 Weight [lbs] Mike Miller, Oscon 2011 6
  7. 7. Naive Bayes Classifier gaus mean male height 0.4height 0.35 0.3 0.25 0.2 0.15 male height 0.1 male variance 0.05 0 -3 -2 -1 0 1 2 3 Mike Miller, Oscon 2011 7
  8. 8. Implementation Plan Height vs. Weight 80 Height [in] Model people as documents in 75 Girls Boys CouchDB 70 65 60 Calculate Means/Variances with 55 MapReduce 50 45 Run classifier in the CouchDB as 40 post-MapReduce hook (“_list”) 35 80 100 120 140 160 180 200 220 Weight [lbs] • Note: do not need to specify fields to use in classification multi-class implementation continuous, incremental training! Results improve as training data trickles in. Mike Miller, Oscon 2011 8
  9. 9. 3 ways to follow along couchapp python tool to push/pull from other couchdb’s > sudo easy_install install -U couchapp > couchapp clone ‘http://millertime.cloudant.com/bitb create an account at cloudant.com > curl -X PUT ‘http://<username>:<pwd>@<username>.cloudant.com/bitb’ > couchapp push ‘http://<username>:<pwd>@<username>.cloudant.com/bitb’ github > git clone git@github.com:mlmiller/bayes.git CouchDB replication to your cloudant account bonus, brings along the data, too! Mike Miller, Oscon 2011 9
  10. 10. The Codepost MapReduce Classifier Hook (“_list” (Probability method) Calculator)client side test via node.js view code to calculate means and you can ignore variances everything else Mike Miller, Oscon 2011 10
  11. 11. Data Model Arbitrary number of numerical fields allowed‘class’ => training Data Mike Miller, Oscon 2011 11
  12. 12. Training via MapReduce ‘class’ => training Data views/training/map.js Calculate mean/variance for all numerical fields in a document emit: ([<class>, <field>], <value>) Reduce: _stats (Erlang builtin) Mike Miller, Oscon 2011 12
  13. 13. Bayes: Trained State pre-reduce output Mike Miller, Oscon 2011 13
  14. 14. Bayes: Trained State Count, Min, Max, Mean, Variance Automatically Updated as new training Data Arrives Mike Miller, Oscon 2011 14
  15. 15. Bayes Classifier lib/bayes_classifier.js Load state from DB No assumptions on Field Names Calculate prob. for all possible hypotheses Mike Miller, Oscon 2011 15
  16. 16. A brief aside... • Lets test our classifier Select 2000 documents for test Randomly choose 1000 documents for training sample Remaining documents used for validation • Simulate continuous training Add documents one at a time After each document addition, test on all 1000 of our validation sample Record and plot fraction of validation sample properly classified Mike Miller, Oscon 2011 16
  17. 17. A brief aside... Dramatic improvement with additional training data Number of documents in the training set Mike Miller, Oscon 2011 17
  18. 18. ... and back to the code Mike Miller, Oscon 2011 18
  19. 19. test it yourself• Client side test via node.js > ./test.js height=<some number> weigth=<some number> Classifier runs server side, configured in line 6 of test.jsCan point this to your DB Mike Miller, Oscon 2011 19
  20. 20. Running as CouchApp create a database (e.g., ‘bitb’) at cloudant.com add data then push your code >couchapp push ‘http://<user>:<pwd>@<usr>.cloudant.com/bitb’ HTML & CSS served directly from BigCouch to the browser Heavy lifting of classification done server side http://millertime.cloudant.com/bitb/_design/bayes/index.html Mike Miller, Oscon 2011 20
  21. 21. Running as API (_list) > curl http://millertime.cloudant.com/bitb/_design/ bayes/_list/index/training? height=65.65&weight=168.61&format=json &group=true Mike Miller, Oscon 2011 21
  22. 22. Wrapping Up: Bayes on BigCouch• Simple code, powerful results light requirements on data model can be relaxed with more complex view code Continuous learning is very powerful e.g., time-based learning (automatically adapt to changing conditions) Classification can be performed client- or server-side push documents into DB and they are auto-tagged! More sophisticated classifiers easily implemented e.g., Cloudant Search pre-calculates and exposes TF-IDF scores for textual classification, weighted classifiers, etc View Engine allows simple deployment of sophisticated domain libraries in mass parallel e.g. Lucene, R, SciPy, NumPy, Matlab/Octave, etc.. Mike Miller, Oscon 2011 22
  23. 23. Give it a spin Hosting, Management, Support for CouchDB and BigCouch http://cloudant.com http://github.com/cloudant/bigcouch Mike Miller, Oscon 2011 23