Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sharing a Startup’s Big Data Lessons

507 views

Published on

  • Be the first to comment

Sharing a Startup’s Big Data Lessons

  1. 1. Sharing a Startup’s Big Data LessonsExperiences with non-RDBMS solutions at
  2. 2. Who we are• A search engine• A people search engine• An influencer search engine• Subscription- based
  3. 3. George StathisVP Engineering14+ years of experiencebuilding full-stack websoftware systems with a pastfocus on e-commerce andpublishing. Currentlyresponsible for buildingengineering capability toenable Traackrs growth goals.
  4. 4. What’s this talk about?• Share what we know about Big Data/NoSQL: what’s behind the buzz words?• Our reasons and method for picking a NoSQL database• Share the lessons we learned going through the process
  5. 5. Big Data/NoSQL: behind the buzz words
  6. 6. What is Big Data?• 3 Vs: – Volume – Velocity – Variety
  7. 7. What is Big Data? Volume + Velocity• Data sets too large or coming in at too high a velocity to process using traditional databases or desktop tools. E.g. big science Astronomy web logs atmospheric science rfid genomics sensor networks biogeochemical social networks military surveillance social data medical records internet text and documents photography archives internet search indexing video archives call detail records large-scale e-commerce
  8. 8. What is Big Data? Variety• Big Data is varied and unstructuredTraditional static reports Analytics, exploration & experimentation
  9. 9. What is Big Data?• Scaling data processing cost effectively $$$$$ $$$$$$$$ $$$
  10. 10. What is NoSQL?• NoSQL ≠ No SQL• NoSQL ≈ Not Only SQL• NoSQL addresses RDBMS limitations, it’s not about the SQL language• RDBMS = static schema• NoSQL = schema flexibility; don’t have to know exact structure before storing
  11. 11. What is Distributed Computing?• Sharing the workload: divide a problem into many tasks, each of which can be solved by one or more computers• Allows computations to be accomplished in acceptable timeframes• Distributed computation approaches were developed to leverage multiple machines: MapReduce• With MapReduce, the program goes to the data since the data is too big to move
  12. 12. What is MapReduce?Source: developer.yahoo.com
  13. 13. What is MapReduce?• MapReduce = batch processing = analytical• MapReduce ≠ interactive• Therefore many NoSQL solutions don’t outright replace warehouse solutions, they complement them• RDBMS is still safe 
  14. 14. What is Big Data? Velocity• In some instances, being able to process large amounts of data in real-time can yield a competitive advantage. E.g. – Online retailers leveraging buying history and click- though data for real-time recommendations• No time to wait for MapReduce jobs to finish• Solutions: streaming processing (e.g. Twitter Storm), pre-computing (e.g. aggregate and count analytics as data arrives), quick to read key/value stores (e.g. distributed hashes)
  15. 15. What is Big Data? Data Science• Emergence of Data Science• Data Scientist ≈ Statistician• Possess scientific discipline & expertise• Formulate and test hypotheses• Understand the math behind the algorithms so they can tweak when they don’t work• Can distill the results into an easy to understand story• Help businesses gain actionable insights
  16. 16. Big Data LandscapeSource: capgemini.com
  17. 17. Big Data LandscapeSource: capgemini.com
  18. 18. Big Data LandscapeSource: capgemini.com
  19. 19. So what’s Traackr and why did we need a NoSQL DB?
  20. 20. Traackr: context• A cloud computing company as about to launch a new platform; how does it find the most influential IT bloggers on the web that can help bring visibility to the new product? How does it find the opinion leaders, the people that matter?
  21. 21. Traackr: a people search engine Up to 50 keywords per search!
  22. 22. Traackr: a people search engine Proprietary 3-scale rankingPeopleas Contentsearch aggregatedresults by author
  23. 23. Traackr: 30,000 feetAcquisition Processing Storage & Indexing Services Applications
  24. 24. NoSQL is usually associated with“Web Scale” (Volume & Velocity)
  25. 25. Do we fit the “Web scale” profile? • In terms of users/traffic?
  26. 26. Source: compete.com
  27. 27. Source: compete.com
  28. 28. Source: compete.com
  29. 29. Source: compete.com
  30. 30. Do we fit the “Web scale” profile? • In terms of users/traffic? • In terms of the amount of data?
  31. 31. PRIMARY> use traackrswitched to db traackrPRIMARY> db.stats(){ "db" : "traackr", "collections" : 12, "objects" : 68226121, "avgObjSize" : 2972.0800625760330, That’s a quarter of a "dataSize" : 202773493971, terabyte … "storageSize" : 221491429671, "numExtents" : 199, "indexes" : 33, "indexSize" : 27472394891, "fileSize" : 266623699968, "nsSizeMB" : 16, "ok" : 1}
  32. 32. Wait! What? MySynology NAS at homecan hold 2TB!
  33. 33. No need for us to track the entire web Influencer Content Web Content Not at scale :-)
  34. 34. Do we fit the “Web scale” profile? • In terms of users/traffic? • In terms of the amount of data?
  35. 35. Variety view of “Web Scale” Web data is: Heterogeneous Unstructured (text)
  36. 36. Visualization of the Internet, Nov. 23rd 2003 Source: http://www.opte.org/
  37. 37. Data sources areisolated islands of richdata with lose links toone another
  38. 38. How do we build a database thatmodels all possible entities found on the web?
  39. 39. Modeling the web: the RDBMS way
  40. 40. Source: socialbutterflyclt.com
  41. 41. or
  42. 42. { "realName": "David Chancogne", "title": "CTO", "description": "Web. Geek.rnTraackr: http://traackr.comrnPropz: http://propz.me", "primaryAffiliation": "Traackr", "email": "dchancogne@traackr.com", "location": "Cambridge, MA, United States", "siteReferences": [ { "siteUrl": "http://twitter.com/dchancogne", "metrics": [ { "value": 216, "name": "twitter_followers_count" }, { "value": 2107, "name": "twitter_statuses_count" } ] }, { "siteUrl": "http://traackr.com/blog/author/david", "metrics": [ { "value": 21, "name": "google_inbound_links" } ] } ]} Influencer data as JSON
  43. 43. NoSQL = schema flexibility
  44. 44. Do we fit the “Web scale” profile? • In terms of users/traffic? • In terms of the amount of data?
  45. 45. Do we fit the “Web scale” profile? • In terms of users/traffic? • In terms of the amount of data? • In terms of the variety of the data ✓
  46. 46. Traackr’s Datastore Requirements• Schema flexibility ✓• Good at storing lots of variable length text• Batch processing options
  47. 47. Requirement: text storage Variable text length: 140 multi-pagecharacter < big variance < tweets blog posts
  48. 48. Requirement: text storageRDBMS’ answer to variable text length: Plan ahead for largest value CLOB/BLOB
  49. 49. Requirement: text storage Issues with CLOB/BLOG for us: No clue what largest value isCLOB/BLOB for tweets = wasted space
  50. 50. Requirement: text storage NoSQL solutions are great for text:No length requirements (automated chunking) Limited space overhead
  51. 51. Traackr’s Datastore Requirements• Schema flexibility ✓• Good at storing lots of variable length text ✓• Batch processing options
  52. 52. Requirement: batch processing Some NoSQLsolutions comewith MapReduce Source: http://code.google.com/
  53. 53. Requirement: batch processing MapReduce + RDBMS: Possible but proprietary solutionsUsually involves exporting data fromRDBMS into a NoSQL system anyway. Defeats data locality benefit of MR
  54. 54. Traackr’s Datastore Requirements• Schema flexibility ✓• Good at storing lots of variable length text ✓• Batch processing options ✓ A NoSQL option is the right fit
  55. 55. How did we pick a NoSQL DB?
  56. 56. Bewildering number of options (early 2010) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  57. 57. Bewildering number of options (early 2010) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  58. 58. Trimming optionsKey/Value Databases Column Databases• Distributed hashtables while•weSpread sheet like Graph Databases: can model• • Key is a row Designed for high as a graph we don’t want to id our domain load• pigeonhole ourselves into this structure. columns In-memory or on-disk • Attributes are• Eventually consistent use these tools for can be grouped We’d rather • Columns specialized data analysis but not as the into families main data store.Document Databases Graph Databases• Like Key/Value • Graph Theory G=(E,V)• Value = Document • Great for modeling• Document = JSON/BSON networks• JSON = Flexible Schema • Great for graph-based query algorithms
  59. 59. Trimming optionsKey/Value Databases Column Databases Memcache: memory-based,• Distributed hashtables • Spread sheet like we need true persistence• Designed for high load • Key is a row id• In-memory or on-disk • Attributes are columns• Eventually consistent • Columns can be grouped into familiesDocument Databases Graph Databases• Like Key/Value • Graph Theory G=(E,V)• Value = Document • Great for modeling• Document = JSON/BSON networks• JSON = Flexible Schema • Great for graph-based query algorithms
  60. 60. Trimming optionsKey/Value Databases Column Databases• Distributed hashtables • Spread sheet like• Designed for high load • Key is a row id• In-memory or on-disk • Attributes are columns• Eventually consistent • Columns can be grouped Amazon SimpleDB: not willing to store our data in into families a proprietary datastore.Document Databases Graph Databases• Like Key/Value • Graph Theory G=(E,V)• Value = Document • Great for modeling• Document = JSON/BSON networks• JSON = Flexible Schema • Great for graph-based query algorithms
  61. 61. Trimming optionsKey/Value Databases Column Databases• Distributed hashtables • Spread sheet like• Designed for high load • Key is a row id• In-memory or on-disk • Attributes are columns• Eventually consistent • Columns can be grouped into familiesDocument Databases Graph Databases• Like Key/Value • Graph Theory G=(E,V) Not willing to store ourProject a Redis and LinkedIn’s data in• Value proprietary datastore. • = Document Voldermort: no query filters, Great for modeling• Document used as queues or better = JSON/BSON networks• JSON = Flexible Schema distributed caches • Great for graph-based query algorithms
  62. 62. Trimming optionsKey/Value Databases Column Databases• Distributed hashtables • Spread sheet like• Designed for high load • Key is a row id• In-memory or on-disk • Attributes are columns CouchDB: no ad-hoc queries;• Eventually consistent • Columns can us maturity in early 2010 madebe grouped into families shy away although we did try early prototypes.Document Databases Graph Databases• Like Key/Value • Graph Theory G=(E,V)• Value = Document • Great for modeling• Document = JSON/BSON networks• JSON = Flexible Schema • Great for graph-based query algorithms
  63. 63. Trimming optionsKey/Value Databases Column Databases• Distributed hashtables • Spread sheet like• Designed for high load • Key is a row id• In-memory or on-disk • Attributes are columns• Eventually consistent • Columns can be grouped into familiesDocument Databases 2010, Graph Databases Cassandra: in early• • maturity questions, no secondary Graph Theory G=(E,V) Like Key/Value• Value = Document processing Great for modeling indexes and no batch •• options (came later on). Document = JSON/BSON networks• JSON = Flexible Schema • Great for graph-based query algorithms
  64. 64. Trimming optionsKey/Value Databases Column Databases• Distributed hashtables • Spread sheet like• MongoDB: in earlyis a row id Designed for high load • Key 2010, maturity• In-memory or on-disk questions, adoption questions • Attributes are columns and no batch processing options.• Eventually consistent • Columns can be grouped into familiesDocument Databases Graph Databases• Like Key/Value • Graph Theory G=(E,V)• Value = Document • Great for modeling• Document = JSON/BSON networks• JSON = Flexible Schema • Great for graph-based query algorithms
  65. 65. Trimming optionsKey/Value Databases Column Databases• Distributed hashtables • Spread sheet like• Designed for high load • Key is a row id• In-memory or on-disk • Attributes are columns• Eventually consistent • Columns can be grouped into familiesDocument Databases Graph Databases• Like Key/Value very close but• in early 2010, Riak: Graph Theory G=(E,V)• • Great for Value = Document adoption questions. modeling we had• Document = JSON/BSON networks• JSON = Flexible Schema • Great for graph-based query algorithms
  66. 66. Trimming optionsKey/Value Databases Column Databases• Distributed hashtables • Spread sheet like• Designed for high load • Key is a row id• In-memory or on-disk • Attributes are columns• Eventually consistent • Columns can be grouped into familiesDocument Databases Graph Databases• Like Key/Value came across as•theGraphmature G=(E,V) HBase: most Theory• Value = Document with several deployments, a at the time, • Great for modeling• Document = JSON/BSON "out-of-the box" healthy community, networks secondary indexes through a contrib and• JSON = Flexible Schema • Great for graph-based support for batch processing using Hadoop/MR query algorithms .
  67. 67. Lessons LearnedChallenges Rewards- Complexity - Choices- Missing Features - Empowering- Problem solution fit - Community- Resources - Cost
  68. 68. Rewards: ChoicesKey/Value Databases Column Databases• Distributed hashtables • Spread sheet like• Designed for high load • Key is a row id• In-memory or on-disk • Attributes are columns• Eventually consistent • Columns can be grouped into familiesDocument Databases Graph Databases• Like Key/Value • Graph Theory G=(E,V)• Value = Document • Great for modeling• Document = JSON/BSON networks• JSON = Flexible Schema • Great for graph-based query algorithms
  69. 69. Rewards: Choices Source: capgemini.com
  70. 70. Lessons LearnedChallenges Rewards- Complexity - Choices- Missing Features - Empowering- Problem solution fit - Community- Resources - Cost
  71. 71. When Big-Data = Big Architectures Must have an odd Master/slave architecture number of means a single point of failure, Zookeeper quorum so you need to protect your nodes master. Then you can run your Hbase nodes but it’s recommended to co-locate regionservers with hadoop datanodes so you have to manage resources. Must have a Hadoop HDFS cluster of at least 2x replication factor nodes And then we also have to manage the MapReduce processes and resources in the Hadoop layer.Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
  72. 72. Source: socialbutterflyclt.com
  73. 73. Jokes aside, no one said open source was easy to use
  74. 74. To be expected• Hadoop/Hbase are designed to move mountains• If you want to move big stuff, be prepared to sometimes use big equipment
  75. 75. What it means to a startup Development capacity beforeCongrats, you are now a sysadmin… Development capacity after
  76. 76. Lessons LearnedChallenges Rewards- Complexity - Choices- Missing Features - Empowering- Problem solution fit - Community- Resources - Cost
  77. 77. Mapping an saved search to a column store NameRanks References to influencer records
  78. 78. Mapping an saved search to a column store “attributes” column familyUnique for general “influencerId” column family key attributes for influencer ranks and foreign keys
  79. 79. Mapping an saved search to a column store Influencer ranks can be attribute “name” attribute names as well
  80. 80. Mapping an saved search to a column store Can get pretty long so needs indexing and pagination
  81. 81. Problem: no out-of-the-box row-based indexing and pagination
  82. 82. Jumping right into the code
  83. 83. Lessons LearnedChallenges Rewards- Complexity - Choices- Missing Features - Empowering- Problem solution fit - Community- Resources - Cost
  84. 84. a few months later…
  85. 85. Need to upgrade to Hbase 0.90• Making sure to remain on recent code base• Performance improvements• Mostly to get the latest bug fixes No thanks!
  86. 86. Looks like something is missing
  87. 87. Our DB indexes depend on this!
  88. 88. Let’s get this straight• Hbase no longer comes with secondary indexing out-of-the-box• It’s been moved out of the trunk to GitHub• Where only one other company besides us seems to care about it
  89. 89. Only one other maintainer besides us
  90. 90. What it means to a startup Congrats, you are now an hbase contrib maintainer… Development capacity
  91. 91. Source: socialbutterflyclt.com
  92. 92. Lessons LearnedChallenges Rewards- Complexity - Choices- Missing Features - Empowering- Problem solution fit - Community- Resources - Cost
  93. 93. Homegrown Hbase Indexes Row ids for Posts Rows have id prefixes that can be efficiently scanned using STARTROW and STOPROW filters
  94. 94. Homegrown Hbase Indexes Row ids for Posts Find posts for influencer_id_1234
  95. 95. Homegrown Hbase Indexes Row ids for Posts Find posts for influencer_id_5678
  96. 96. Homegrown Hbase Indexes• No longer depending on unmaintained code• Work with out-of-the-box Hbase installation
  97. 97. What it means to a startup You are back but you still need to maintain indexing logic Development capacity
  98. 98. a few months later…
  99. 99. Cracks in the data model huffingtonpost.com published under writes for http://www.huffingtonpost.com/arianna-huffington/post_1.html http://www.huffingtonpost.com/arianna-huffington/post_2.html authored by http://www.huffingtonpost.com/arianna-huffington/post_3.html huffingtonpost.com published under writes for http://www.huffingtonpost.com/shaun-donovan/post1.html http://www.huffingtonpost.com/shaun-donovan/post2.html authored by http://www.huffingtonpost.com/shaun-donovan/post3.html
  100. 100. Cracks in the data model huffingtonpost.com published under writes for Denormalized/duplicated for fast runtime access http://www.huffingtonpost.com/arianna-huffington/post_1.html and storage of influencer- http://www.huffingtonpost.com/arianna-huffington/post_2.html authored by http://www.huffingtonpost.com/arianna-huffington/post_3.html to-site relationship properties huffingtonpost.com published under writes for http://www.huffingtonpost.com/shaun-donovan/post1.html http://www.huffingtonpost.com/shaun-donovan/post2.html authored by http://www.huffingtonpost.com/shaun-donovan/post3.html
  101. 101. Cracks in the data model huffingtonpost.com published under writes for http://www.huffingtonpost.com/arianna-huffington/post_1.html http://www.huffingtonpost.com/arianna-huffington/post_2.html authored by huffingtonpost.com published under writes for http://www.huffingtonpost.com/shaun-donovan/post1.html http://www.huffingtonpost.com/shaun-donovan/post2.html authored by http://www.huffingtonpost.com/shaun-donovan/post3.html http://www.huffingtonpost.com/arianna-huffington/post_3.html Content attribution logic could sometimes mis-attribute posts because of the duplicated data.
  102. 102. Cracks in the data model huffingtonpost.com published under writes for http://www.huffingtonpost.com/arianna-huffington/post_1.html authored by huffingtonpost.com published under writes for http://www.huffingtonpost.com/shaun-donovan/post1.html http://www.huffingtonpost.com/shaun-donovan/post2.html authored by http://www.huffingtonpost.com/shaun-donovan/post3.html http://www.huffingtonpost.com/arianna-huffington/post_3.html http://www.huffingtonpost.com/arianna-huffington/post_2.html Exacerbated when we started tracking people’s content on a daily basis in mid- 2011
  103. 103. Fixing the cracks in the data model Normalize the sites http://www.huffingtonpost.com/arianna-huffington/post_1.html http://www.huffingtonpost.com/arianna-huffington/post_2.html authored by http://www.huffingtonpost.com/arianna-huffington/post_3.html writes for published under huffingtonpost.com published under writes for http://www.huffingtonpost.com/shaun-donovan/post1.html http://www.huffingtonpost.com/shaun-donovan/post2.html authored by http://www.huffingtonpost.com/shaun-donovan/post3.html
  104. 104. Fixing the cracks in the data model• Normalization requires stronger secondary indexing• Our application layer indexing would need revisiting…again!
  105. 105. What it means to a startup Psych! You are back to writing indexing code. Development capacity
  106. 106. Source: socialbutterflyclt.com
  107. 107. Lessons LearnedChallenges Rewards- Complexity - Choices- Missing Features - Empowering- Problem solution fit - Community- Resources - Cost
  108. 108. Traackr’s Datastore Requirements (Revisited)• Schema flexibility• Good at storing lots of variable length text• Out-of-the-box SECONDARY INDEX support!• Simple to use and administer
  109. 109. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  110. 110. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Nope! Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  111. 111. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • Graph Databases:•weAttributes are columns In-memory or on-disk looked at • • Columns can Eventually consistent closer but passed again be grouped Neo4J a bit for the same reasons into families as before. Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  112. 112. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases Memcache: still no • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  113. 113. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped Amazon SimpleDB: still no. into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  114. 114. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) Not willing to store ourProject a Redis and LinkedIn’s data in • Value proprietary datastore. • =Voldermort: still no Document Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  115. 115. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns CouchDB: more mature but still • Eventually consistent • Columns can no ad-hoc queries. be grouped into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  116. 116. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databasesa bit, added Cassandra: matured quite Graph Databases secondary indexes and batch processing • • Like Key/Valuerestrictive in its’ use than Graph Theory G=(E,V) options but more • • Value =solutions. After the Hbase lesson, Great for modeling other Document • Document useJSON/BSON simplicity of = was now more important. networks • JSON = Flexible Schema • Great for graph-based query algorithms
  117. 117. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • • Graph Theory G=(E,V) Like Key/Value strong contender still but Riak: • • Great for Value = Document questions remained. modeling adoption • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  118. 118. NoSQL picking – Round 2 (mid 2011) Key/Value Databasesby leaps Column Databases MongoDB: matured and bounds, increased • • Spread sheet like Distributed hashtables 10gen, advanced indexing adoption, support from • • batch processing Designed for high load as some Key is a row id out-of-the-box as well options, breeze to use, well documented and fit into • • Attributes In-memory or on-disk code base very nicely. are columns our existing • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  119. 119. Lessons LearnedChallenges Rewards- Complexity - Choices- Missing Features - Empowering- Problem solution fit - Community- Resources - Cost
  120. 120. Immediate Benefits• No more maintaining custom application-layer secondary indexing code
  121. 121. What it means to a startup Yay! I’m back! Development capacity
  122. 122. Immediate Benefits• No more maintaining custom application-layer secondary indexing code• Single binary installation greatly simplifies administration
  123. 123. What it means to a startup Honestly, I thought I’d never see you guys again! Development capacity
  124. 124. Immediate Benefits• No more maintaining custom application-layer secondary indexing code• Single binary installation greatly simplifies administration• Our NoSQL could now support our domain model
  125. 125. many-to-many relationship
  126. 126. { ”_id": "770cf5c54492344ad5e45fb791ae5d52”, "realName": "David Chancogne", "title": "CTO", "description": "Web. Geek.rnTraackr: http://traackr.comrnPropz: http://propz.me", "primaryAffiliation": "Traackr", "email": "dchancogne@traackr.com", "location": "Cambridge, MA, United States", "siteReferences": [ { Embedded list of "siteId": "b31236da306270dc2b5db34e943af88d", references to sites "contribution": 0.25 augmented with }, influencer-specific { "siteId": "602dc370945d3b3480fff4f2a541227c", site attributes (e.g. "contribution": 1.0 percent contribution } to content) ]} Modeling an influencer
  127. 127. { ”_id": "770cf5c54492344ad5e45fb791ae5d52”, "realName": "David Chancogne", "title": "CTO", "description": "Web. Geek.rnTraackr: http://traackr.comrnPropz: http://propz.me", "primaryAffiliation": "Traackr", "email": "dchancogne@traackr.com", "location": "Cambridge, MA, United States", "siteReferences": [ { "siteId": "b31236da306270dc2b5db34e943af88d", "contribution": 0.25 siteId indexed for }, “find influencers { connected to site X” "siteId": "602dc370945d3b3480fff4f2a541227c", "contribution": 1.0 } ] }> db.influencers.ensureIndex({siteReferences.siteId: 1});> db.influencers.find({siteReferences.siteId: "602dc370945d3b3480fff4f2a541227c"}); Modeling an influencer
  128. 128. Other Benefits• Ad hoc queries and reports became easier to write with JavaScript: no need for a Java developer to write map reduce code to extract the data in a usable form like it was needed with Hbase.• Simpler backups: Hbase mostly relied on HDFS redundancy; intra- cluster replication is available but experimental and a lot more involved to setup.• Great documentation• Great adoption and community
  129. 129. looks like we found the right fit!
  130. 130. We have more of this Development capacity
  131. 131. And less of this Source: socialbutterflyclt.com
  132. 132. Recap & Final Thoughts• 3 Vs of Big Data: – Volume – Velocity – Variety  Traackr• Big Data technologies are complementary to SQL and RDBMS• Until machines can think for themselves Data Science will be increasingly important
  133. 133. Recap & Final Thoughts• Be prepared to deal with less mature tech• Be as flexible as the data => fearless refactoring• Importance of ease of use and administration cannot be overstated for a small startup
  134. 134. Q&A

×