Your SlideShare is downloading. ×
  • Like
Oscon miller 2011
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

The scheduled redis speaker was sick so I whipped up in about an hour and filled in on a different subject. It's a bit crude, but you get a big picture view of how to build a simple AI application …

The scheduled redis speaker was sick so I whipped up in about an hour and filled in on a different subject. It's a bit crude, but you get a big picture view of how to build a simple AI application using BigCouch. The accompanying video is up at

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Bayes on your (Big)Couch Mike Miller _milleratmit July 25, 2011
  • 2. I want my app to do _this_ Mike Miller, Oscon 2011 2
  • 3. CouchDB in a slide• Schema-free document database management system Documents are JSON objects Able to store binary attachments• RESTful API• Views: Custom, persistent representations of your data Incremental MapReduce with results persisted to disk Fast querying by primary key (views stored in a B-tree)• Bi-Directional Replication Master-slave and multi-master topologies supported Optional ‘filters’ to replicate a subset of the data Edge devices (mobile phones, sensors, etc.) Mike Miller, Oscon 2011 3
  • 4. BigCouch = Couch+Scaling• Open Source, Apache License• Horizontal Scalability Easily add storage capacity by adding more servers Computing power (views, compaction, etc.) scales with more servers• No SPOF Any node can handle any request Individual nodes can come and go• Transparent to the Application All clustering operations take place “behind the curtain” looks (mostly) like a single server instance of CouchDB Mike Miller, Oscon 2011 4
  • 5. ...back to making my app smart Mike Miller, Oscon 2011 5
  • 6. Sample Data Height vs. Weight 80 Height [in] 75 Girls Boys 70 65 60 55 50 45 40 35 80 100 120 140 160 180 200 220 Weight [lbs] Mike Miller, Oscon 2011 6
  • 7. Naive Bayes Classifier gaus mean male height 0.4height 0.35 0.3 0.25 0.2 0.15 male height 0.1 male variance 0.05 0 -3 -2 -1 0 1 2 3 Mike Miller, Oscon 2011 7
  • 8. Implementation Plan Height vs. Weight 80 Height [in] Model people as documents in 75 Girls Boys CouchDB 70 65 60 Calculate Means/Variances with 55 MapReduce 50 45 Run classifier in the CouchDB as 40 post-MapReduce hook (“_list”) 35 80 100 120 140 160 180 200 220 Weight [lbs] • Note: do not need to specify fields to use in classification multi-class implementation continuous, incremental training! Results improve as training data trickles in. Mike Miller, Oscon 2011 8
  • 9. 3 ways to follow along couchapp python tool to push/pull from other couchdb’s > sudo easy_install install -U couchapp > couchapp clone ‘ create an account at > curl -X PUT ‘http://<username>:<pwd>@<username>’ > couchapp push ‘http://<username>:<pwd>@<username>’ github > git clone CouchDB replication to your cloudant account bonus, brings along the data, too! Mike Miller, Oscon 2011 9
  • 10. The Codepost MapReduce Classifier Hook (“_list” (Probability method) Calculator)client side test via node.js view code to calculate means and you can ignore variances everything else Mike Miller, Oscon 2011 10
  • 11. Data Model Arbitrary number of numerical fields allowed‘class’ => training Data Mike Miller, Oscon 2011 11
  • 12. Training via MapReduce ‘class’ => training Data views/training/map.js Calculate mean/variance for all numerical fields in a document emit: ([<class>, <field>], <value>) Reduce: _stats (Erlang builtin) Mike Miller, Oscon 2011 12
  • 13. Bayes: Trained State pre-reduce output Mike Miller, Oscon 2011 13
  • 14. Bayes: Trained State Count, Min, Max, Mean, Variance Automatically Updated as new training Data Arrives Mike Miller, Oscon 2011 14
  • 15. Bayes Classifier lib/bayes_classifier.js Load state from DB No assumptions on Field Names Calculate prob. for all possible hypotheses Mike Miller, Oscon 2011 15
  • 16. A brief aside... • Lets test our classifier Select 2000 documents for test Randomly choose 1000 documents for training sample Remaining documents used for validation • Simulate continuous training Add documents one at a time After each document addition, test on all 1000 of our validation sample Record and plot fraction of validation sample properly classified Mike Miller, Oscon 2011 16
  • 17. A brief aside... Dramatic improvement with additional training data Number of documents in the training set Mike Miller, Oscon 2011 17
  • 18. ... and back to the code Mike Miller, Oscon 2011 18
  • 19. test it yourself• Client side test via node.js > ./test.js height=<some number> weigth=<some number> Classifier runs server side, configured in line 6 of test.jsCan point this to your DB Mike Miller, Oscon 2011 19
  • 20. Running as CouchApp create a database (e.g., ‘bitb’) at add data then push your code >couchapp push ‘http://<user>:<pwd>@<usr>’ HTML & CSS served directly from BigCouch to the browser Heavy lifting of classification done server side Mike Miller, Oscon 2011 20
  • 21. Running as API (_list) > curl bayes/_list/index/training? height=65.65&weight=168.61&format=json &group=true Mike Miller, Oscon 2011 21
  • 22. Wrapping Up: Bayes on BigCouch• Simple code, powerful results light requirements on data model can be relaxed with more complex view code Continuous learning is very powerful e.g., time-based learning (automatically adapt to changing conditions) Classification can be performed client- or server-side push documents into DB and they are auto-tagged! More sophisticated classifiers easily implemented e.g., Cloudant Search pre-calculates and exposes TF-IDF scores for textual classification, weighted classifiers, etc View Engine allows simple deployment of sophisticated domain libraries in mass parallel e.g. Lucene, R, SciPy, NumPy, Matlab/Octave, etc.. Mike Miller, Oscon 2011 22
  • 23. Give it a spin Hosting, Management, Support for CouchDB and BigCouch Mike Miller, Oscon 2011 23