Modern Database Systems (for Genealogy)

4,561 views
4,207 views

Published on

Discover & identify ideal storage solution for our needs by examining the history of data storage & the modern database systems including Key Value, Relational, Graph and Document databases.

This presentation was given at RootsTech 2013 in March

Published in: Technology
0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,561
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
0
Comments
0
Likes
12
Embeds 0
No embeds

No notes for slide

Modern Database Systems (for Genealogy)

  1. 1. Modern Database Systems
  2. 2. @spf13 AKASteve FranciaChief Evangelist @responsible for drivers,integrations, web & docs
  3. 3. What’s the Point?๏ Goal: Discover & identify ideal storage solution for our needs๏ History is important๏ Many options today๏ Document databases are good for Genealogy
  4. 4. History of the World
  5. 5. Over 5500 years ago 2 People
  6. 6. 18041 Billion People
  7. 7. 19272 Billion People
  8. 8. World Population Growth
  9. 9. World Population Growth (last ~200 years in Billions) 8 6 4 7 6 5 4 2 3 2 11804 1927 0 1960 1974 1987 1999 2012
  10. 10. Really Big DataIn the last 50 years...over 4 % of the world peoplewere born...in less than 1 % of the time
  11. 11. History ofDatabases
  12. 12. 1970๏ Oracle creates the relational database๏ Everyone happily uses it for the next 43 years
  13. 13. What really happened
  14. 14. Let’s start atthe beginning
  15. 15. It’s a story about...Storing & Retrieving Information
  16. 16. Even today we still usethe same mediums for data storage
  17. 17. With the advent ofthe computer things really took off
  18. 18. 1960 : DBMS Emerges๏ Ordered set of fixed length fields๏ Low level pointer operations (flat files)๏ Most popular was IMS (created at IBM)๏ Shockingly still in use today at IBM & American Airlines
  19. 19. Lots of Problems๏ Complex and inflexible๏ User had to know physical structure of the DB in order to query for information๏ Adding a field to the DB required rewriting the underlying access/modification scheme๏ Records isolated (no relations)๏ Emphasis on records to be processed, not overall structure
  20. 20. 1970 : Relational DB๏ Edgar Frank “Ted” Codd๏ Relational Database theory๏ Codd’s 13 rules (aka 12 rules)
  21. 21. 3 HUGE Advantages๏ Data independence from hardware and storage implementation๏ Ability to process more than one record at a time with a single operation๏ Establishing a relationship between records
  22. 22. IBM vs Codd๏ IBM bet on IMS๏ Codd bets on relational DB๏ Eventually 2 relational prototypes emerge
  23. 23. Ingres๏ Built at UC Berkley๏ Uses QUEL๏ Inspires Sybase & MSSQL
  24. 24. System R๏ Built at IBM๏ Leads to SEQUEL... later SQL๏ Evolved into SQL/DS which evolved into DB2๏ Project concludes that relational model is viable
  25. 25. Oracle๏ Larry Ellison watches IBM๏ Starts Relational Software Inc.๏ Oracle 1st commercial RDBMS released in 1979๏ Beats IBM by 2 years to market
  26. 26. Entity Relationship๏ Proposed by Peter Chen in 1976๏ Focuses on data use and not logical table structure
  27. 27. 1980s๏ RDBMS dominates๏ Some fields (medicine, physics, multimedia) need more than RDBMS offers๏ Object Databases emerge
  28. 28. Object Databases๏ Inspired by Entity Relationship๏ More flexible than relational permits๏ Tightly coupled with OO programming language (c++, later Java)๏ Full object: data & methods stored
  29. 29. 1990s๏ Internet emerges๏ Data demand spikes๏ Databases used for archiving historical data
  30. 30. Early 2000s๏ Internet booms๏ RDBMS fails to scale๏ Indesperation we take a step backwards
  31. 31. MemcacheD๏1 dimensional๏ No persistence๏ No ACI or D๏ but...
  32. 32. ... FAST
  33. 33. 2005 ish๏ Relational + MemcacheD broken (and we didn’t know it)๏ Scale redefined with high volume & social๏ Infrastructure reinvented with cloud computing & SSDs
  34. 34. Alternatives Emerge๏ Dynamo / Key Value๏ Document๏ Graph
  35. 35. Modern Data Storage
  36. 36. A lot going onEasiest to define databases inbroad terms• What is a record? (data model)• CAP : CA, AP, CP ? (infrastructure model)
  37. 37. Data Storage Structure 1D 2D nDKey Key Value Key Value(s) Key Value Key Value(s)Value Key Value Key Key Value Key Value Key Value(s) Key Key Value Key Value(s)
  38. 38. Database structure 1D 2D nDKey Value Relational DocumentDynamo Graph
  39. 39. CAP Theorem AvailabilityPartitioning Consistency
  40. 40. CAP TheoremxxNode Node App
  41. 41. CAP Theorem Availability Dynamo RDBMS tKey Value ten Int o sis lerNoSQLs on ant Inc UnavailablePartition ConsistencyTolerant MongoDB BigTable
  42. 42. Key Value๏ ๏ Often 1 Dimensional storage (tupal) MultiMaster...๏ meaning Query key only availability over๏ Bucket index consistency (range) on keys ๏ Partitioning easy๏ Records cannot be thanks to single updated, only value replacedCassandra, Redis, MemcacheD, Riak, DynamoDB
  43. 43. Relational ๏ Single master๏ 2 Dimensional storage (map) meaning consistency >๏ Query any availability field ๏ Partitioning hard๏ due to BTree Indexes transactions & joinsOracle, MSSQL, MySQL, PostgreSQL, DB2
  44. 44. Document๏ ๏ Single master n Dimensional storage (hash meaning w/ nesting) consistency > availability๏ Query any field ๏ Partitioning easy at any level thanks to richer๏ BTree Indexes data modelMongoDB, CouchDB, RethinkDB
  45. 45. Graph ๏ 1 Dimensional storage... but grouped to appear 2D ๏ Differentiated by indexes ๏ Large indexes cover many relationships ๏ Query time depends on # records returned, not distance to get them ๏ Doesn’t require traversing to determine relationshipNeo4j, about 20 more... nobody talks much about
  46. 46. MongoDB for Genealogy
  47. 47. Right Data Model
  48. 48. Types of genealogy data๏ Events ๏ Photographs (birth, death, etc) ๏๏ Diaries & letters Official records ๏๏ Ship passenger list Census ๏๏ Occupation Names ๏๏ and more Relationships
  49. 49. Challenges of genealogy data๏ Lots of possible data points... need flexible schema๏ Multiple versions of same data point (3 different dates for death date, 4 variations on name).๏ Lots of data associated with physical records๏ Multiple versions of same nodes (intelligent nondestructive merge needed)๏ Need to have meta data associated
  50. 50. Individual User Events[] • Name• AFN • type • Email Address• Modification Date • date • Password • contributor[] • Individual_id • record[] Name• First[]• Middle[] Location• Last[] • city • state • county Record • contributor • country • type • coordinates[] • thumbnail • content • description • tags[]
  51. 51. Individualindividual = { _id : ObjectId("4f2978dfaa999d9db02618ce"), AFN : 1XYK-KQJ, name: { first: [john, johannes], middle: peter, last: [smith, sandvik] }}db.individual.find({name.first : ‘john’, name.middle : ‘peter’})
  52. 52. Individual.Eventsevents : [ death : { date : ISODate(1989-07-14), location : { city: pensacola, state: fl, county: escambia, country: usa coordinates : [30.26,87.12]}, contributor : ObjectId("4eeac...691")}]db.individual.find({events.death.date : ISODate(‘1989-07-14’)})db.individual.find({events.death.location : { $near:[30,90]}})
  53. 53. Event Versionsevents : [ birth : [ { date : ISODate(1928-04-06), location : { city: brattleboro, state: vt, county: windham, country: usa coordinates : [42.51,72.34]}, contributor : ObjectId("4ee...00000"), records: ObjectId("4ed8a...7b000000") }, { date : ISODate(1928-04-16), location : { city: brattleboro, state: vt, county: windham, country: usa coordinates : [42.51,72.34]}, contributor : ObjectId("4ee...37bb"), records: ObjectId("4eea...0000c8"), }],}
  54. 54. Query with Versioned Eventsevents : [ birth : [ { date : ISODate(1928-04-06)}, { date : ISODate(1928-04-16)} ],]db.individual.find({events.birth.date : ISODate(‘1928-04-16’)})
  55. 55. Recordsrecord1 = { _id : ObjectId("4ed8aea7d8562f7d7b") contributor : ObjectId("4eeab...1537bb"), type : birth certificate, thumbnail : BinData(0,"/9j/4AAQSkZJ...."), content : BinData(0,"j6b/Id11lWqs..."), tags : [NY, certified], description : "Johns birth certificate"}
  56. 56. Right Scale
  57. 57. MongoDB: Scale built in๏ Intelligent replication๏ Automatic partitioning of data (user configurable)๏ Horizontal Scale๏ Targeted Queries๏ Parallel Processing
  58. 58. Intelligent Replication Node 1 Node 2 Secondary Secondary Heartbeat Re on p i cat lic ati pli on Re Node 3 Primary
  59. 59. Scalable Architecture App Server App Server App Server Mongos Mongos Mongos Config Node 1 Server Secondary Config Node 1 Server Secondary Config Node 1 Server Secondary Shard Shard Shard
  60. 60. xHigh Availability in Shards Shard Shard Primary Mongod or Secondary Secondary
  61. 61. Targeted Requests 1 4 Mongos 2 3 Shard Shard Shard
  62. 62. Parallel processing 1 6 Mongos 5 2 2 2 4 4 4 Shard Shard Shard 3 3 3
  63. 63. Right Feature Set
  64. 64. Broad Feature Set๏ Rich query language๏ Native support for over 12 languages๏ GeoSpatial๏ Text search๏ Aggregation & MapReduce๏ GridFS (distributed & replicated file storage)๏ Integration with Hadoop, Solr & more
  65. 65. Last Year Ipresentedon Graph inMongoDB http://j.mp/XvJ3dl
  66. 66. FamilySearchpresented inDecember2012 http://j.mp/X03TXp
  67. 67. http://j.mp/X03TXp
  68. 68. http://j.mp/X03TXp
  69. 69. http://j.mp/X03TXp
  70. 70. http://spf13.com http://github.com/spf13 @spf13Questions?download at mongodb.org

×