The Artful Business                        of Data Mining                            Distributed Schema-less              ...
David Coallier                         @davidcoallierWednesday 27 March 13
Data Scientist                         At Engine Yard (.com)Wednesday 27 March 13
RDBMsWednesday 27 March 13
Structure          Restrictions          SafetyWednesday 27 March 13
id    name      age    address                        1     david       1     315                        2     divad      ...
id    name      age    address                        1     david       1     315                        2     divad      ...
id    name      age    address                        1     david       1     315                        2     divad      ...
id    name      age    address                        1     david       1     315                        2     divad      ...
id    name      age    address                        1     david       1     315                        2     divad      ...
What If?Wednesday 27 March 13
id    name      age    address   phone                        1     david      26     IE        353                       ...
Before                   Moving onWednesday 27 March 13
JSONWednesday 27 March 13
What is JSON?Wednesday 27 March 13
{                            "firstName": "David",                            "lastName": "Coallier",                     ...
What is HTTP?Wednesday 27 March 13
What is a Schema?Wednesday 27 March 13
AlternativeWednesday 27 March 13
Schema-lessWednesday 27 March 13
Does      NOT      Mean      Structure-lessWednesday 27 March 13
Documents      and      K-V BucketsWednesday 27 March 13
CouchDB                        Cluster of unreliable commodity hardwareWednesday 27 March 13
Replication Attachments               Generated “random” ids               Dictionary Revisions?               JSON Object...
DocumentsWednesday 27 March 13
Wednesday 27 March 13
{                            "_id": "131dafsd1vasd",                            "_rev": "12-fva32asdf",                   ...
How do you      find      Anything?Wednesday 27 March 13
Map/ReduceWednesday 27 March 13
...Wednesday 27 March 13
RiakWednesday 27 March 13
Dynamo     PaperWednesday 27 March 13
CAP     TheoremWednesday 27 March 13
Key-Value  BucketsWednesday 27 March 13
Differences?Wednesday 27 March 13
CouchDB                                      Riak           Storage Model         append-only                             ...
Map/ReduceWednesday 27 March 13
Mapper:Executed on documentReducer:Receives output from mappersWednesday 27 March 13
{            {                                         {                    {                                     "_id": "...
{            {                                         {                    {                                     "_id": "...
{                  "age": "32",                  "heads": "3", }Wednesday 27 March 13
Map: find-ages                                 {            {                                         {                    ...
Map: find-ages                function find_ages(doc) {                  if (typeof(doc.age) != undefined) {               ...
Map: find-ages                                 {            {                                         {                    ...
Map: find-ages                                 {            {                                         {                    ...
Map: find-ages               26       32   42   17              Reduce: sumWednesday 27 March 13
Reduce: sum    function sum(values) {      return sum(values);    }Wednesday 27 March 13
Map: find-ages               26       32    42   17              Reduce: sum                             117Wednesday 27 Ma...
Mapper:Executed on documentReducer:Receives output from mappersWednesday 27 March 13
So     What?Wednesday 27 March 13
The     Machines     They Lurn.Wednesday 27 March 13
The     ProblemWednesday 27 March 13
Statistics     ExampleWednesday 27 March 13
Mean,  Std. Deviation  AgeWednesday 27 March 13
n                1             µ = ∑ xi                n i=1Wednesday 27 March 13
n           1        σ=   ∑           n i=1                 (xi − µ ) 2Wednesday 27 March 13
Mapper:Executed on documentReducer:Receives output from mappersWednesday 27 March 13
Mapper:  Retrieve values, pre-processReducer: Receive, process further.Wednesday 27 March 13
{            {                                         {                    {                                     "_id": "...
[                            [ 26, 676],                            [ 32, 1024],                            [ 42, 1764],  ...
/**                          * Our mapper function.                          */                        map: function(doc) ...
/**   * Our mapper function.   */ map: function(doc) {    emit(null, [doc.age, doc.age * doc.age]); } /**  * Our reducer.....
Naive  BayesWednesday 27 March 13
Real Life  FraudWednesday 27 March 13
P(x j = k | y = fraudulent)  P(x j = k | y = normal)  P(y)Wednesday 27 March 13
We need to:  Sum x j = k , for each y  to calculate P(x|y)Wednesday 27 March 13
We need:   More than 1 mapper.Wednesday 27 March 13
We need                          4                        mappersWednesday 27 March 13
Mapper #1:   ∑1i P(x = k | y = fraudulent)                        jWednesday 27 March 13
Mapper #2:   ∑1i P(x = k | y = normal)                        jWednesday 27 March 13
Mapper #3:   ∑1i P(y = fraudulent)Wednesday 27 March 13
Mapper #4:   ∑1i P(y = normal)Wednesday 27 March 13
Reducer         Sums up         results for         parametersWednesday 27 March 13
Cluster  AnalysisWednesday 27 March 13
k-meansWednesday 27 March 13
Mapper: Divide vectors into subgroups, Calculate d(p,q) between vectors, find centroids, sum them up. Reducer: Sum up the s...
Upcoming SlideShare
Loading in...5
×

The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases

342

Published on

Data comes in all forms and shapes. Data also evolves as life and people adapt to new situations, and so should your database.

When working with data, traditional relational database systems come to mind because that is how most of us have been trained. However, data is rarely homogeneous, and your database should not force you into a certain schema if your data is not relational.

During this talk we analyse the composition of "documents" in the context of a document-based database, and cover the basic principles of Map-Reduce and its potential use in the context of computational statistics.

What then happens when the amount of data you have no longer fits on 1 server? How easy is it for your favourite database to currently expand and adapt to your new growing requirements? What is your contingency plan if your server goes down?

We then go over some of the features that CouchDB, Riak and MongoDB provide you with, alongside some of David's personal opinions.

This is an intermediary talk. Listeners should have a working concept of Bayesian statistics, standard internet protocols as such as HTTP, and a minimum understanding of programming languages as such as JavaScript and Erlang as some of the examples for those database are using those languages.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
342
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases

  1. 1. The Artful Business of Data Mining Distributed Schema-less Document-Based DatabasesWednesday 27 March 13
  2. 2. David Coallier @davidcoallierWednesday 27 March 13
  3. 3. Data Scientist At Engine Yard (.com)Wednesday 27 March 13
  4. 4. RDBMsWednesday 27 March 13
  5. 5. Structure Restrictions SafetyWednesday 27 March 13
  6. 6. id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ...Wednesday 27 March 13
  7. 7. id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ...Wednesday 27 March 13
  8. 8. id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ...Wednesday 27 March 13
  9. 9. id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ...Wednesday 27 March 13
  10. 10. id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ...Wednesday 27 March 13
  11. 11. What If?Wednesday 27 March 13
  12. 12. id name age address phone 1 david 26 IE 353 2 divad 27 US 1 3 foo 42 IE 353 4 bar 31 CA 1 5 john 17 NZ 131 6 jack 128 DK 311 7 jill 21 IE 353 ... ... ... ... ...Wednesday 27 March 13
  13. 13. Before Moving onWednesday 27 March 13
  14. 14. JSONWednesday 27 March 13
  15. 15. What is JSON?Wednesday 27 March 13
  16. 16. { "firstName": "David", "lastName": "Coallier", "age": 26, "address": { "streetAddress": "Mansfield House", "city": "Crosshaven", }, "phoneNumbers": [ { "type": "mobile", "number": "0863299999" } ] }Wednesday 27 March 13
  17. 17. What is HTTP?Wednesday 27 March 13
  18. 18. What is a Schema?Wednesday 27 March 13
  19. 19. AlternativeWednesday 27 March 13
  20. 20. Schema-lessWednesday 27 March 13
  21. 21. Does NOT Mean Structure-lessWednesday 27 March 13
  22. 22. Documents and K-V BucketsWednesday 27 March 13
  23. 23. CouchDB Cluster of unreliable commodity hardwareWednesday 27 March 13
  24. 24. Replication Attachments Generated “random” ids Dictionary Revisions? JSON Objects HTTP CRUDWednesday 27 March 13
  25. 25. DocumentsWednesday 27 March 13
  26. 26. Wednesday 27 March 13
  27. 27. { "_id": "131dafsd1vasd", "_rev": "12-fva32asdf", "firstName": "David", "lastName": "Coallier", "age": 26, "address": { "streetAddress": "Mansfield House", "city": "Crosshaven", }, "phoneNumbers": [ { "type": "mobile", "number": "0863299999" } ] }Wednesday 27 March 13
  28. 28. How do you find Anything?Wednesday 27 March 13
  29. 29. Map/ReduceWednesday 27 March 13
  30. 30. ...Wednesday 27 March 13
  31. 31. RiakWednesday 27 March 13
  32. 32. Dynamo PaperWednesday 27 March 13
  33. 33. CAP TheoremWednesday 27 March 13
  34. 34. Key-Value BucketsWednesday 27 March 13
  35. 35. Differences?Wednesday 27 March 13
  36. 36. CouchDB Riak Storage Model append-only bitcask Access HTTP HTTP, PB Retrieval Views(M/R) M/R, Indexes, Search Versioning Eventual Consistency Vector Clocks Concurrency No Locking Client Resolution Replication master/master/slave replication, clustering Scaling In/Out Big Couch Built-in Management Futon/Fuxton Riak Control http://guide.couchdb.org http://downloads.basho.com/papers/bitcask-intro.pdfWednesday 27 March 13
  37. 37. Map/ReduceWednesday 27 March 13
  38. 38. Mapper:Executed on documentReducer:Receives output from mappersWednesday 27 March 13
  39. 39. { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } }Wednesday 27 March 13
  40. 40. { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } }Wednesday 27 March 13
  41. 41. { "age": "32", "heads": "3", }Wednesday 27 March 13
  42. 42. Map: find-ages { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } }Wednesday 27 March 13
  43. 43. Map: find-ages function find_ages(doc) { if (typeof(doc.age) != undefined) { emit(doc._id, doc.age); } }Wednesday 27 March 13
  44. 44. Map: find-ages { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } }Wednesday 27 March 13
  45. 45. Map: find-ages { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } 26 32 42 17Wednesday 27 March 13
  46. 46. Map: find-ages 26 32 42 17 Reduce: sumWednesday 27 March 13
  47. 47. Reduce: sum function sum(values) { return sum(values); }Wednesday 27 March 13
  48. 48. Map: find-ages 26 32 42 17 Reduce: sum 117Wednesday 27 March 13
  49. 49. Mapper:Executed on documentReducer:Receives output from mappersWednesday 27 March 13
  50. 50. So What?Wednesday 27 March 13
  51. 51. The Machines They Lurn.Wednesday 27 March 13
  52. 52. The ProblemWednesday 27 March 13
  53. 53. Statistics ExampleWednesday 27 March 13
  54. 54. Mean, Std. Deviation AgeWednesday 27 March 13
  55. 55. n 1 µ = ∑ xi n i=1Wednesday 27 March 13
  56. 56. n 1 σ= ∑ n i=1 (xi − µ ) 2Wednesday 27 March 13
  57. 57. Mapper:Executed on documentReducer:Receives output from mappersWednesday 27 March 13
  58. 58. Mapper: Retrieve values, pre-processReducer: Receive, process further.Wednesday 27 March 13
  59. 59. { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } }Wednesday 27 March 13
  60. 60. [ [ 26, 676], [ 32, 1024], [ 42, 1764], [ 17, 289 ] ]Wednesday 27 March 13
  61. 61. /** * Our mapper function. */ map: function(doc) { emit(null, [doc.age, doc.age * doc.age]); } /** * Our reducer... */ reduce: function(keys, values, rereduce) { var N = 0; var summed = 0; var summedSquare = 0; for (var i in values) { N += 1; summed += values[i][0]; summedSquare += values[i][1]; } var mean = summed / N; var standard_deviation = Math.sqrt( (summedSquare / N) - (mean* mean) ) return [mean, standard_deviation] }Wednesday 27 March 13
  62. 62. /** * Our mapper function. */ map: function(doc) { emit(null, [doc.age, doc.age * doc.age]); } /** * Our reducer... */ reduce: function(keys, values, rereduce) { var N = values.length; var summed = sum(values.map(function(v) { return v[0]; })); var summedSquares = sum(values.map(function(v) { return v[1];})); var mean = summed / N; var standard_deviation = Math.sqrt( (summedSquares / N) - (mean*mean) ) return [mean, standard_deviation] }Wednesday 27 March 13
  63. 63. Naive BayesWednesday 27 March 13
  64. 64. Real Life FraudWednesday 27 March 13
  65. 65. P(x j = k | y = fraudulent) P(x j = k | y = normal) P(y)Wednesday 27 March 13
  66. 66. We need to: Sum x j = k , for each y to calculate P(x|y)Wednesday 27 March 13
  67. 67. We need: More than 1 mapper.Wednesday 27 March 13
  68. 68. We need 4 mappersWednesday 27 March 13
  69. 69. Mapper #1: ∑1i P(x = k | y = fraudulent) jWednesday 27 March 13
  70. 70. Mapper #2: ∑1i P(x = k | y = normal) jWednesday 27 March 13
  71. 71. Mapper #3: ∑1i P(y = fraudulent)Wednesday 27 March 13
  72. 72. Mapper #4: ∑1i P(y = normal)Wednesday 27 March 13
  73. 73. Reducer Sums up results for parametersWednesday 27 March 13
  74. 74. Cluster AnalysisWednesday 27 March 13
  75. 75. k-meansWednesday 27 March 13
  76. 76. Mapper: Divide vectors into subgroups, Calculate d(p,q) between vectors, find centroids, sum them up. Reducer: Sum up the sums, get new centroids.Wednesday 27 March 13
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×