Uploaded on

The scheduled redis speaker was sick so I whipped up in about an hour and filled in on a different subject. It's a bit crude, but you get a big picture view of how to build a simple AI application …

The scheduled redis speaker was sick so I whipped up in about an hour and filled in on a different subject. It's a bit crude, but you get a big picture view of how to build a simple AI application using BigCouch. The accompanying video is up at http://www.youtube.com/watch?v=QEBDNxbSRuk

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
836
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
3
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Bayes on your (Big)Couch Mike Miller _milleratmit July 25, 2011
  • 2. I want my app to do _this_ Mike Miller, Oscon 2011 2
  • 3. CouchDB in a slide• Schema-free document database management system Documents are JSON objects Able to store binary attachments• RESTful API http://wiki.apache.org/couchdb/reference• Views: Custom, persistent representations of your data Incremental MapReduce with results persisted to disk Fast querying by primary key (views stored in a B-tree)• Bi-Directional Replication Master-slave and multi-master topologies supported Optional ‘filters’ to replicate a subset of the data Edge devices (mobile phones, sensors, etc.) Mike Miller, Oscon 2011 3
  • 4. BigCouch = Couch+Scaling• Open Source, Apache License• Horizontal Scalability Easily add storage capacity by adding more servers Computing power (views, compaction, etc.) scales with more servers• No SPOF Any node can handle any request Individual nodes can come and go• Transparent to the Application All clustering operations take place “behind the curtain” looks (mostly) like a single server instance of CouchDB Mike Miller, Oscon 2011 4
  • 5. ...back to making my app smart Mike Miller, Oscon 2011 5
  • 6. Sample Data Height vs. Weight 80 Height [in] 75 Girls Boys 70 65 60 55 50 45 40 35 80 100 120 140 160 180 200 220 Weight [lbs] Mike Miller, Oscon 2011 6
  • 7. Naive Bayes Classifier gaus mean male height 0.4height 0.35 0.3 0.25 0.2 0.15 male height 0.1 male variance 0.05 0 -3 -2 -1 0 1 2 3 Mike Miller, Oscon 2011 7
  • 8. Implementation Plan Height vs. Weight 80 Height [in] Model people as documents in 75 Girls Boys CouchDB 70 65 60 Calculate Means/Variances with 55 MapReduce 50 45 Run classifier in the CouchDB as 40 post-MapReduce hook (“_list”) 35 80 100 120 140 160 180 200 220 Weight [lbs] • Note: do not need to specify fields to use in classification multi-class implementation continuous, incremental training! Results improve as training data trickles in. Mike Miller, Oscon 2011 8
  • 9. 3 ways to follow along couchapp python tool to push/pull from other couchdb’s > sudo easy_install install -U couchapp > couchapp clone ‘http://millertime.cloudant.com/bitb create an account at cloudant.com > curl -X PUT ‘http://<username>:<pwd>@<username>.cloudant.com/bitb’ > couchapp push ‘http://<username>:<pwd>@<username>.cloudant.com/bitb’ github > git clone git@github.com:mlmiller/bayes.git CouchDB replication to your cloudant account bonus, brings along the data, too! Mike Miller, Oscon 2011 9
  • 10. The Codepost MapReduce Classifier Hook (“_list” (Probability method) Calculator)client side test via node.js view code to calculate means and you can ignore variances everything else Mike Miller, Oscon 2011 10
  • 11. Data Model Arbitrary number of numerical fields allowed‘class’ => training Data Mike Miller, Oscon 2011 11
  • 12. Training via MapReduce ‘class’ => training Data views/training/map.js Calculate mean/variance for all numerical fields in a document emit: ([<class>, <field>], <value>) Reduce: _stats (Erlang builtin) Mike Miller, Oscon 2011 12
  • 13. Bayes: Trained State pre-reduce output Mike Miller, Oscon 2011 13
  • 14. Bayes: Trained State Count, Min, Max, Mean, Variance Automatically Updated as new training Data Arrives Mike Miller, Oscon 2011 14
  • 15. Bayes Classifier lib/bayes_classifier.js Load state from DB No assumptions on Field Names Calculate prob. for all possible hypotheses Mike Miller, Oscon 2011 15
  • 16. A brief aside... • Lets test our classifier Select 2000 documents for test Randomly choose 1000 documents for training sample Remaining documents used for validation • Simulate continuous training Add documents one at a time After each document addition, test on all 1000 of our validation sample Record and plot fraction of validation sample properly classified Mike Miller, Oscon 2011 16
  • 17. A brief aside... Dramatic improvement with additional training data Number of documents in the training set Mike Miller, Oscon 2011 17
  • 18. ... and back to the code Mike Miller, Oscon 2011 18
  • 19. test it yourself• Client side test via node.js > ./test.js height=<some number> weigth=<some number> Classifier runs server side, configured in line 6 of test.jsCan point this to your DB Mike Miller, Oscon 2011 19
  • 20. Running as CouchApp create a database (e.g., ‘bitb’) at cloudant.com add data then push your code >couchapp push ‘http://<user>:<pwd>@<usr>.cloudant.com/bitb’ HTML & CSS served directly from BigCouch to the browser Heavy lifting of classification done server side http://millertime.cloudant.com/bitb/_design/bayes/index.html Mike Miller, Oscon 2011 20
  • 21. Running as API (_list) > curl http://millertime.cloudant.com/bitb/_design/ bayes/_list/index/training? height=65.65&weight=168.61&format=json &group=true Mike Miller, Oscon 2011 21
  • 22. Wrapping Up: Bayes on BigCouch• Simple code, powerful results light requirements on data model can be relaxed with more complex view code Continuous learning is very powerful e.g., time-based learning (automatically adapt to changing conditions) Classification can be performed client- or server-side push documents into DB and they are auto-tagged! More sophisticated classifiers easily implemented e.g., Cloudant Search pre-calculates and exposes TF-IDF scores for textual classification, weighted classifiers, etc View Engine allows simple deployment of sophisticated domain libraries in mass parallel e.g. Lucene, R, SciPy, NumPy, Matlab/Octave, etc.. Mike Miller, Oscon 2011 22
  • 23. Give it a spin Hosting, Management, Support for CouchDB and BigCouch http://cloudant.com http://github.com/cloudant/bigcouch Mike Miller, Oscon 2011 23