The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases
Upcoming SlideShare
Loading in...5
×
 

The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases

on

  • 337 views

Data comes in all forms and shapes. Data also evolves as life and people adapt to new situations, and so should your database. ...

Data comes in all forms and shapes. Data also evolves as life and people adapt to new situations, and so should your database.

When working with data, traditional relational database systems come to mind because that is how most of us have been trained. However, data is rarely homogeneous, and your database should not force you into a certain schema if your data is not relational.

During this talk we analyse the composition of "documents" in the context of a document-based database, and cover the basic principles of Map-Reduce and its potential use in the context of computational statistics.

What then happens when the amount of data you have no longer fits on 1 server? How easy is it for your favourite database to currently expand and adapt to your new growing requirements? What is your contingency plan if your server goes down?

We then go over some of the features that CouchDB, Riak and MongoDB provide you with, alongside some of David's personal opinions.

This is an intermediary talk. Listeners should have a working concept of Bayesian statistics, standard internet protocols as such as HTTP, and a minimum understanding of programming languages as such as JavaScript and Erlang as some of the examples for those database are using those languages.

Statistics

Views

Total Views
337
Views on SlideShare
337
Embed Views
0

Actions

Likes
0
Downloads
6
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Presentation Transcript

    • The Artful Business of Data Mining Distributed Schema-less Document-Based DatabasesWednesday 27 March 13
    • David Coallier @davidcoallierWednesday 27 March 13
    • Data Scientist At Engine Yard (.com)Wednesday 27 March 13
    • RDBMsWednesday 27 March 13
    • Structure Restrictions SafetyWednesday 27 March 13
    • id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ...Wednesday 27 March 13
    • id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ...Wednesday 27 March 13
    • id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ...Wednesday 27 March 13
    • id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ...Wednesday 27 March 13
    • id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ...Wednesday 27 March 13
    • What If?Wednesday 27 March 13
    • id name age address phone 1 david 26 IE 353 2 divad 27 US 1 3 foo 42 IE 353 4 bar 31 CA 1 5 john 17 NZ 131 6 jack 128 DK 311 7 jill 21 IE 353 ... ... ... ... ...Wednesday 27 March 13
    • Before Moving onWednesday 27 March 13
    • JSONWednesday 27 March 13
    • What is JSON?Wednesday 27 March 13
    • { "firstName": "David", "lastName": "Coallier", "age": 26, "address": { "streetAddress": "Mansfield House", "city": "Crosshaven", }, "phoneNumbers": [ { "type": "mobile", "number": "0863299999" } ] }Wednesday 27 March 13
    • What is HTTP?Wednesday 27 March 13
    • What is a Schema?Wednesday 27 March 13
    • AlternativeWednesday 27 March 13
    • Schema-lessWednesday 27 March 13
    • Does NOT Mean Structure-lessWednesday 27 March 13
    • Documents and K-V BucketsWednesday 27 March 13
    • CouchDB Cluster of unreliable commodity hardwareWednesday 27 March 13
    • Replication Attachments Generated “random” ids Dictionary Revisions? JSON Objects HTTP CRUDWednesday 27 March 13
    • DocumentsWednesday 27 March 13
    • Wednesday 27 March 13
    • { "_id": "131dafsd1vasd", "_rev": "12-fva32asdf", "firstName": "David", "lastName": "Coallier", "age": 26, "address": { "streetAddress": "Mansfield House", "city": "Crosshaven", }, "phoneNumbers": [ { "type": "mobile", "number": "0863299999" } ] }Wednesday 27 March 13
    • How do you find Anything?Wednesday 27 March 13
    • Map/ReduceWednesday 27 March 13
    • ...Wednesday 27 March 13
    • RiakWednesday 27 March 13
    • Dynamo PaperWednesday 27 March 13
    • CAP TheoremWednesday 27 March 13
    • Key-Value BucketsWednesday 27 March 13
    • Differences?Wednesday 27 March 13
    • CouchDB Riak Storage Model append-only bitcask Access HTTP HTTP, PB Retrieval Views(M/R) M/R, Indexes, Search Versioning Eventual Consistency Vector Clocks Concurrency No Locking Client Resolution Replication master/master/slave replication, clustering Scaling In/Out Big Couch Built-in Management Futon/Fuxton Riak Control http://guide.couchdb.org http://downloads.basho.com/papers/bitcask-intro.pdfWednesday 27 March 13
    • Map/ReduceWednesday 27 March 13
    • Mapper:Executed on documentReducer:Receives output from mappersWednesday 27 March 13
    • { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } }Wednesday 27 March 13
    • { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } }Wednesday 27 March 13
    • { "age": "32", "heads": "3", }Wednesday 27 March 13
    • Map: find-ages { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } }Wednesday 27 March 13
    • Map: find-ages function find_ages(doc) { if (typeof(doc.age) != undefined) { emit(doc._id, doc.age); } }Wednesday 27 March 13
    • Map: find-ages { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } }Wednesday 27 March 13
    • Map: find-ages { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } 26 32 42 17Wednesday 27 March 13
    • Map: find-ages 26 32 42 17 Reduce: sumWednesday 27 March 13
    • Reduce: sum function sum(values) { return sum(values); }Wednesday 27 March 13
    • Map: find-ages 26 32 42 17 Reduce: sum 117Wednesday 27 March 13
    • Mapper:Executed on documentReducer:Receives output from mappersWednesday 27 March 13
    • So What?Wednesday 27 March 13
    • The Machines They Lurn.Wednesday 27 March 13
    • The ProblemWednesday 27 March 13
    • Statistics ExampleWednesday 27 March 13
    • Mean, Std. Deviation AgeWednesday 27 March 13
    • n 1 µ = ∑ xi n i=1Wednesday 27 March 13
    • n 1 σ= ∑ n i=1 (xi − µ ) 2Wednesday 27 March 13
    • Mapper:Executed on documentReducer:Receives output from mappersWednesday 27 March 13
    • Mapper: Retrieve values, pre-processReducer: Receive, process further.Wednesday 27 March 13
    • { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } }Wednesday 27 March 13
    • [ [ 26, 676], [ 32, 1024], [ 42, 1764], [ 17, 289 ] ]Wednesday 27 March 13
    • /** * Our mapper function. */ map: function(doc) { emit(null, [doc.age, doc.age * doc.age]); } /** * Our reducer... */ reduce: function(keys, values, rereduce) { var N = 0; var summed = 0; var summedSquare = 0; for (var i in values) { N += 1; summed += values[i][0]; summedSquare += values[i][1]; } var mean = summed / N; var standard_deviation = Math.sqrt( (summedSquare / N) - (mean* mean) ) return [mean, standard_deviation] }Wednesday 27 March 13
    • /** * Our mapper function. */ map: function(doc) { emit(null, [doc.age, doc.age * doc.age]); } /** * Our reducer... */ reduce: function(keys, values, rereduce) { var N = values.length; var summed = sum(values.map(function(v) { return v[0]; })); var summedSquares = sum(values.map(function(v) { return v[1];})); var mean = summed / N; var standard_deviation = Math.sqrt( (summedSquares / N) - (mean*mean) ) return [mean, standard_deviation] }Wednesday 27 March 13
    • Naive BayesWednesday 27 March 13
    • Real Life FraudWednesday 27 March 13
    • P(x j = k | y = fraudulent) P(x j = k | y = normal) P(y)Wednesday 27 March 13
    • We need to: Sum x j = k , for each y to calculate P(x|y)Wednesday 27 March 13
    • We need: More than 1 mapper.Wednesday 27 March 13
    • We need 4 mappersWednesday 27 March 13
    • Mapper #1: ∑1i P(x = k | y = fraudulent) jWednesday 27 March 13
    • Mapper #2: ∑1i P(x = k | y = normal) jWednesday 27 March 13
    • Mapper #3: ∑1i P(y = fraudulent)Wednesday 27 March 13
    • Mapper #4: ∑1i P(y = normal)Wednesday 27 March 13
    • Reducer Sums up results for parametersWednesday 27 March 13
    • Cluster AnalysisWednesday 27 March 13
    • k-meansWednesday 27 March 13
    • Mapper: Divide vectors into subgroups, Calculate d(p,q) between vectors, find centroids, sum them up. Reducer: Sum up the sums, get new centroids.Wednesday 27 March 13