SlideShare a Scribd company logo
1 of 23
Download to read offline
Bayes on your (Big)Couch




                      Mike Miller
                      _milleratmit
                      July 25, 2011
I want my app to do _this_




             Mike Miller, Oscon 2011   2
CouchDB in a slide
• Schema-free document database management system
 Documents are JSON objects
 Able to store binary attachments

• RESTful API
 http://wiki.apache.org/couchdb/reference

• Views: Custom, persistent representations of your data
 Incremental MapReduce with results persisted to disk
 Fast querying by primary key (views stored in a B-tree)

• Bi-Directional Replication
 Master-slave and multi-master topologies supported
 Optional ‘filters’ to replicate a subset of the data
 Edge devices (mobile phones, sensors, etc.)
                                  Mike Miller, Oscon 2011   3
BigCouch = Couch+Scaling
• Open Source, Apache License
• Horizontal Scalability
 Easily add storage capacity by adding more servers
 Computing power (views, compaction, etc.) scales with
 more servers

• No SPOF
 Any node can handle any request
 Individual nodes can come and go

• Transparent to the Application
 All clustering operations take place “behind the curtain”
 looks (mostly) like a single server instance of CouchDB


                                       Mike Miller, Oscon 2011   4
...back to making my app smart




            Mike Miller, Oscon 2011   5
Sample Data
      Height vs. Weight
                  80
    Height [in]
                  75        Girls
                            Boys
                  70

                  65

                  60

                  55

                  50

                  45

                  40

                  35
                       80    100    120      140      160       180   200    220
                                                                        Weight [lbs]

                                          Mike Miller, Oscon 2011                      6
Naive Bayes Classifier
                                 gaus
           mean male
            height                 0.4

height                            0.35

                                   0.3

                                  0.25

                                   0.2

                                  0.15

           male height             0.1


    male    variance              0.05

                                    0
                                    -3   -2   -1   0   1   2   3




               Mike Miller, Oscon 2011                             7
Implementation Plan
                                                   Height vs. Weight
                                                               80




                                                 Height [in]
 Model people as documents in                                  75        Girls
                                                                         Boys
 CouchDB                                                       70

                                                               65

                                                               60
 Calculate Means/Variances with
                                                               55
 MapReduce
                                                               50

                                                               45

 Run classifier in the CouchDB as                               40

 post-MapReduce hook (“_list”)                                 35
                                                                    80    100    120   140   160   180   200    220
                                                                                                           Weight [lbs]


 • Note:
  do not need to specify fields to use in classification
  multi-class implementation
  continuous, incremental training! Results improve as training data trickles in.
                                   Mike Miller, Oscon 2011                                                                8
3 ways to follow along

 couchapp python tool to push/pull from other couchdb’s
 > sudo easy_install install -U couchapp
 > couchapp clone ‘http://millertime.cloudant.com/bitb'
 create an account at cloudant.com
 > curl -X PUT ‘http://<username>:<pwd>@<username>.cloudant.com/bitb’
 > couchapp push ‘http://<username>:<pwd>@<username>.cloudant.com/bitb’
 github
 > git clone git@github.com:mlmiller/bayes.git


 CouchDB replication to your cloudant account
 bonus, brings along the data, too!


                                      Mike Miller, Oscon 2011             9
The Code

post MapReduce                                  Classifier
 Hook (“_list”                                 (Probability
   method)                                     Calculator)




client side test
  via node.js                                  view code to
                                                calculate
                                                means and
   you can ignore                                variances
   everything else   Mike Miller, Oscon 2011              10
Data Model

                                     Arbitrary number of numerical
                                            fields allowed




‘class’ => training Data



                           Mike Miller, Oscon 2011                   11
Training via MapReduce
                                    ‘class’ => training Data
 views/training/map.js




           Calculate mean/variance for all numerical
                      fields in a document
                 emit: ([<class>, <field>], <value>)
                 Reduce: _stats (Erlang builtin)
                         Mike Miller, Oscon 2011               12
Bayes: Trained State




                             pre-reduce output



            Mike Miller, Oscon 2011              13
Bayes: Trained State




                                    Count, Min, Max, Mean,
                                          Variance

     Automatically Updated as new training Data
                      Arrives
                  Mike Miller, Oscon 2011                    14
Bayes Classifier
            lib/bayes_classifier.js
                     Load state from DB

                                      No assumptions on Field
                                              Names


                                 Calculate prob. for
                                    all possible
                                     hypotheses



            Mike Miller, Oscon 2011                             15
A brief aside...

 • Lets test our classifier
  Select 2000 documents for test
  Randomly choose 1000 documents for training sample
  Remaining documents used for validation

 • Simulate continuous training
  Add documents one at a time
  After each document addition, test on all 1000 of our validation sample
  Record and plot fraction of validation sample properly classified




                                Mike Miller, Oscon 2011                     16
A brief aside...


                                Dramatic improvement with
                                 additional training data




      Number of documents in the training set
                   Mike Miller, Oscon 2011                  17
... and back to the code




             Mike Miller, Oscon 2011   18
test it yourself
• Client side test via node.js
 > ./test.js height=<some number> weigth=<some number>
 Classifier runs server side, configured in line 6 of test.js




Can point this to
    your DB

                                      Mike Miller, Oscon 2011   19
Running as CouchApp



                create a database (e.g., ‘bitb’) at cloudant.com
                add data
                then push your code
                >couchapp push ‘http://<user>:<pwd>@<usr>.cloudant.com/bitb’
                HTML & CSS served directly from BigCouch to the browser
                Heavy lifting of classification done server side


 http://millertime.cloudant.com/bitb/_design/bayes/index.html
                          Mike Miller, Oscon 2011                              20
Running as API (_list)
 > curl 'http://millertime.cloudant.com/bitb/_design/
               bayes/_list/index/training?
       height=65.65&weight=168.61&format=json
                      &group=true'




                       Mike Miller, Oscon 2011          21
Wrapping Up: Bayes on BigCouch
• Simple code, powerful results
 light requirements on data model
 can be relaxed with more complex view code
 Continuous learning is very powerful
 e.g., time-based learning (automatically adapt to changing conditions)
 Classification can be performed client- or server-side
 push documents into DB and they are auto-tagged!
 More sophisticated classifiers easily implemented
 e.g., Cloudant Search pre-calculates and exposes TF-IDF scores for textual
 classification, weighted classifiers, etc
 View Engine allows simple deployment of sophisticated domain libraries in
 mass parallel
 e.g. Lucene, R, SciPy, NumPy, Matlab/Octave, etc..


                                   Mike Miller, Oscon 2011                    22
Give it a spin




 Hosting, Management, Support for CouchDB and BigCouch
                  http://cloudant.com
        http://github.com/cloudant/bigcouch
                     Mike Miller, Oscon 2011             23

More Related Content

Similar to Oscon miller 2011

P02 sparse coding cvpr2012 deep learning methods for vision
P02 sparse coding cvpr2012 deep learning methods for visionP02 sparse coding cvpr2012 deep learning methods for vision
P02 sparse coding cvpr2012 deep learning methods for visionzukun
 
Build on AWS: Migrating and Platforming
Build on AWS: Migrating and PlatformingBuild on AWS: Migrating and Platforming
Build on AWS: Migrating and PlatformingAmazon Web Services
 
BI 2008 Simple
BI 2008 SimpleBI 2008 Simple
BI 2008 Simplellangit
 
Hadoop World 2011: BI on Hadoop in Financial Services - Stefan Grschupf, Data...
Hadoop World 2011: BI on Hadoop in Financial Services - Stefan Grschupf, Data...Hadoop World 2011: BI on Hadoop in Financial Services - Stefan Grschupf, Data...
Hadoop World 2011: BI on Hadoop in Financial Services - Stefan Grschupf, Data...Cloudera, Inc.
 
Architecting Scalable Applications in the Cloud
Architecting Scalable Applications in the CloudArchitecting Scalable Applications in the Cloud
Architecting Scalable Applications in the CloudClint Edmonson
 
Stairway to heaven webinar
Stairway to heaven webinarStairway to heaven webinar
Stairway to heaven webinarCloudBees
 
DAT101 Understanding AWS Database Options - AWS re: Invent 2012
DAT101 Understanding AWS Database Options - AWS re: Invent 2012DAT101 Understanding AWS Database Options - AWS re: Invent 2012
DAT101 Understanding AWS Database Options - AWS re: Invent 2012Amazon Web Services
 
Scalable Database Options on AWS
Scalable Database Options on AWSScalable Database Options on AWS
Scalable Database Options on AWSAmazon Web Services
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...Data Con LA
 
SQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesSQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesEduardo Castro
 
44spotkaniePLSSUGWRO_CoNowegowKrainieChmur
44spotkaniePLSSUGWRO_CoNowegowKrainieChmur44spotkaniePLSSUGWRO_CoNowegowKrainieChmur
44spotkaniePLSSUGWRO_CoNowegowKrainieChmurTobias Koprowski
 
SQL Server 2008 Data Mining
SQL Server 2008 Data MiningSQL Server 2008 Data Mining
SQL Server 2008 Data Miningllangit
 
SQL Server 2008 Data Mining
SQL Server 2008 Data MiningSQL Server 2008 Data Mining
SQL Server 2008 Data Miningllangit
 
Machine Learning Classifiers
Machine Learning ClassifiersMachine Learning Classifiers
Machine Learning ClassifiersMostafa
 
SQL Server 2008 Data Mining
SQL Server 2008 Data MiningSQL Server 2008 Data Mining
SQL Server 2008 Data Miningllangit
 
Patterns & Practices of Microservices
Patterns & Practices of MicroservicesPatterns & Practices of Microservices
Patterns & Practices of MicroservicesWesley Reisz
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BIDenny Lee
 
Steps towards business intelligence
Steps towards business intelligenceSteps towards business intelligence
Steps towards business intelligenceAhsan Kabir
 
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMicrosoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMark Kromer
 

Similar to Oscon miller 2011 (20)

P02 sparse coding cvpr2012 deep learning methods for vision
P02 sparse coding cvpr2012 deep learning methods for visionP02 sparse coding cvpr2012 deep learning methods for vision
P02 sparse coding cvpr2012 deep learning methods for vision
 
Build on AWS: Migrating and Platforming
Build on AWS: Migrating and PlatformingBuild on AWS: Migrating and Platforming
Build on AWS: Migrating and Platforming
 
BI 2008 Simple
BI 2008 SimpleBI 2008 Simple
BI 2008 Simple
 
Hadoop World 2011: BI on Hadoop in Financial Services - Stefan Grschupf, Data...
Hadoop World 2011: BI on Hadoop in Financial Services - Stefan Grschupf, Data...Hadoop World 2011: BI on Hadoop in Financial Services - Stefan Grschupf, Data...
Hadoop World 2011: BI on Hadoop in Financial Services - Stefan Grschupf, Data...
 
Architecting Scalable Applications in the Cloud
Architecting Scalable Applications in the CloudArchitecting Scalable Applications in the Cloud
Architecting Scalable Applications in the Cloud
 
Stairway to heaven webinar
Stairway to heaven webinarStairway to heaven webinar
Stairway to heaven webinar
 
DAT101 Understanding AWS Database Options - AWS re: Invent 2012
DAT101 Understanding AWS Database Options - AWS re: Invent 2012DAT101 Understanding AWS Database Options - AWS re: Invent 2012
DAT101 Understanding AWS Database Options - AWS re: Invent 2012
 
Scalable Database Options on AWS
Scalable Database Options on AWSScalable Database Options on AWS
Scalable Database Options on AWS
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
 
SQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesSQL Server 2008 Integration Services
SQL Server 2008 Integration Services
 
44spotkaniePLSSUGWRO_CoNowegowKrainieChmur
44spotkaniePLSSUGWRO_CoNowegowKrainieChmur44spotkaniePLSSUGWRO_CoNowegowKrainieChmur
44spotkaniePLSSUGWRO_CoNowegowKrainieChmur
 
SQL Server 2008 Data Mining
SQL Server 2008 Data MiningSQL Server 2008 Data Mining
SQL Server 2008 Data Mining
 
SQL Server 2008 Data Mining
SQL Server 2008 Data MiningSQL Server 2008 Data Mining
SQL Server 2008 Data Mining
 
Machine Learning Classifiers
Machine Learning ClassifiersMachine Learning Classifiers
Machine Learning Classifiers
 
WebSphere Commerce v7 Data Load
WebSphere Commerce v7 Data LoadWebSphere Commerce v7 Data Load
WebSphere Commerce v7 Data Load
 
SQL Server 2008 Data Mining
SQL Server 2008 Data MiningSQL Server 2008 Data Mining
SQL Server 2008 Data Mining
 
Patterns & Practices of Microservices
Patterns & Practices of MicroservicesPatterns & Practices of Microservices
Patterns & Practices of Microservices
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Steps towards business intelligence
Steps towards business intelligenceSteps towards business intelligence
Steps towards business intelligence
 
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMicrosoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Oscon miller 2011

  • 1. Bayes on your (Big)Couch Mike Miller _milleratmit July 25, 2011
  • 2. I want my app to do _this_ Mike Miller, Oscon 2011 2
  • 3. CouchDB in a slide • Schema-free document database management system Documents are JSON objects Able to store binary attachments • RESTful API http://wiki.apache.org/couchdb/reference • Views: Custom, persistent representations of your data Incremental MapReduce with results persisted to disk Fast querying by primary key (views stored in a B-tree) • Bi-Directional Replication Master-slave and multi-master topologies supported Optional ‘filters’ to replicate a subset of the data Edge devices (mobile phones, sensors, etc.) Mike Miller, Oscon 2011 3
  • 4. BigCouch = Couch+Scaling • Open Source, Apache License • Horizontal Scalability Easily add storage capacity by adding more servers Computing power (views, compaction, etc.) scales with more servers • No SPOF Any node can handle any request Individual nodes can come and go • Transparent to the Application All clustering operations take place “behind the curtain” looks (mostly) like a single server instance of CouchDB Mike Miller, Oscon 2011 4
  • 5. ...back to making my app smart Mike Miller, Oscon 2011 5
  • 6. Sample Data Height vs. Weight 80 Height [in] 75 Girls Boys 70 65 60 55 50 45 40 35 80 100 120 140 160 180 200 220 Weight [lbs] Mike Miller, Oscon 2011 6
  • 7. Naive Bayes Classifier gaus mean male height 0.4 height 0.35 0.3 0.25 0.2 0.15 male height 0.1 male variance 0.05 0 -3 -2 -1 0 1 2 3 Mike Miller, Oscon 2011 7
  • 8. Implementation Plan Height vs. Weight 80 Height [in] Model people as documents in 75 Girls Boys CouchDB 70 65 60 Calculate Means/Variances with 55 MapReduce 50 45 Run classifier in the CouchDB as 40 post-MapReduce hook (“_list”) 35 80 100 120 140 160 180 200 220 Weight [lbs] • Note: do not need to specify fields to use in classification multi-class implementation continuous, incremental training! Results improve as training data trickles in. Mike Miller, Oscon 2011 8
  • 9. 3 ways to follow along couchapp python tool to push/pull from other couchdb’s > sudo easy_install install -U couchapp > couchapp clone ‘http://millertime.cloudant.com/bitb' create an account at cloudant.com > curl -X PUT ‘http://<username>:<pwd>@<username>.cloudant.com/bitb’ > couchapp push ‘http://<username>:<pwd>@<username>.cloudant.com/bitb’ github > git clone git@github.com:mlmiller/bayes.git CouchDB replication to your cloudant account bonus, brings along the data, too! Mike Miller, Oscon 2011 9
  • 10. The Code post MapReduce Classifier Hook (“_list” (Probability method) Calculator) client side test via node.js view code to calculate means and you can ignore variances everything else Mike Miller, Oscon 2011 10
  • 11. Data Model Arbitrary number of numerical fields allowed ‘class’ => training Data Mike Miller, Oscon 2011 11
  • 12. Training via MapReduce ‘class’ => training Data views/training/map.js Calculate mean/variance for all numerical fields in a document emit: ([<class>, <field>], <value>) Reduce: _stats (Erlang builtin) Mike Miller, Oscon 2011 12
  • 13. Bayes: Trained State pre-reduce output Mike Miller, Oscon 2011 13
  • 14. Bayes: Trained State Count, Min, Max, Mean, Variance Automatically Updated as new training Data Arrives Mike Miller, Oscon 2011 14
  • 15. Bayes Classifier lib/bayes_classifier.js Load state from DB No assumptions on Field Names Calculate prob. for all possible hypotheses Mike Miller, Oscon 2011 15
  • 16. A brief aside... • Lets test our classifier Select 2000 documents for test Randomly choose 1000 documents for training sample Remaining documents used for validation • Simulate continuous training Add documents one at a time After each document addition, test on all 1000 of our validation sample Record and plot fraction of validation sample properly classified Mike Miller, Oscon 2011 16
  • 17. A brief aside... Dramatic improvement with additional training data Number of documents in the training set Mike Miller, Oscon 2011 17
  • 18. ... and back to the code Mike Miller, Oscon 2011 18
  • 19. test it yourself • Client side test via node.js > ./test.js height=<some number> weigth=<some number> Classifier runs server side, configured in line 6 of test.js Can point this to your DB Mike Miller, Oscon 2011 19
  • 20. Running as CouchApp create a database (e.g., ‘bitb’) at cloudant.com add data then push your code >couchapp push ‘http://<user>:<pwd>@<usr>.cloudant.com/bitb’ HTML & CSS served directly from BigCouch to the browser Heavy lifting of classification done server side http://millertime.cloudant.com/bitb/_design/bayes/index.html Mike Miller, Oscon 2011 20
  • 21. Running as API (_list) > curl 'http://millertime.cloudant.com/bitb/_design/ bayes/_list/index/training? height=65.65&weight=168.61&format=json &group=true' Mike Miller, Oscon 2011 21
  • 22. Wrapping Up: Bayes on BigCouch • Simple code, powerful results light requirements on data model can be relaxed with more complex view code Continuous learning is very powerful e.g., time-based learning (automatically adapt to changing conditions) Classification can be performed client- or server-side push documents into DB and they are auto-tagged! More sophisticated classifiers easily implemented e.g., Cloudant Search pre-calculates and exposes TF-IDF scores for textual classification, weighted classifiers, etc View Engine allows simple deployment of sophisticated domain libraries in mass parallel e.g. Lucene, R, SciPy, NumPy, Matlab/Octave, etc.. Mike Miller, Oscon 2011 22
  • 23. Give it a spin Hosting, Management, Support for CouchDB and BigCouch http://cloudant.com http://github.com/cloudant/bigcouch Mike Miller, Oscon 2011 23