Sharing a Startup’s Big Data          LessonsExperiences with non-RDBMS solutions at
Who we are• A search  engine• A people  search engine• An influencer  search engine• Subscription-  based
George StathisVP Engineering14+ years of experiencebuilding full-stack websoftware systems with a pastfocus on e-commerce ...
What’s this talk about?• Share what we know about Big Data/NoSQL: what’s behind the buzz words?• Our reasons and method fo...
Big Data/NoSQL: behind the buzz words
What is Big Data?• 3 Vs:  – Volume  – Velocity  – Variety
What is Big Data? Volume + Velocity• Data sets too large or coming in at too high a velocity  to process using traditional...
What is Big Data? Variety• Big Data is varied and unstructuredTraditional static reports   Analytics, exploration &       ...
What is Big Data?• Scaling data processing cost effectively                        $$$$$     $$$$$$$$                     ...
What is NoSQL?• NoSQL ≠ No SQL• NoSQL ≈ Not Only SQL• NoSQL addresses RDBMS limitations, it’s not  about the SQL language•...
What is Distributed Computing?• Sharing the workload: divide a problem into  many tasks, each of which can be solved by on...
What is MapReduce?Source: developer.yahoo.com
What is MapReduce?• MapReduce = batch processing = analytical• MapReduce ≠ interactive• Therefore many NoSQL solutions don...
What is Big Data? Velocity• In some instances, being able to process large  amounts of data in real-time can yield a  comp...
What is Big Data? Data Science• Emergence of Data Science• Data Scientist ≈ Statistician• Possess scientific discipline & ...
Big Data LandscapeSource: capgemini.com
Big Data LandscapeSource: capgemini.com
Big Data LandscapeSource: capgemini.com
So what’s Traackr and why did we      need a NoSQL DB?
Traackr: context• A cloud computing company as about to  launch a new platform; how does it find the  most influential IT ...
Traackr: a people search engine     Up to 50 keywords per search!
Traackr: a people search engine                Proprietary                3-scale rankingPeopleas                         ...
Traackr: 30,000 feetAcquisition   Processing   Storage & Indexing   Services   Applications
NoSQL is usually associated with“Web Scale” (Volume & Velocity)
Do we fit the “Web scale” profile?       • In terms of users/traffic?
Source: compete.com
Source: compete.com
Source: compete.com
Source: compete.com
Do we fit the “Web scale” profile?       • In terms of users/traffic?    • In terms of the amount of data?
PRIMARY> use traackrswitched to db traackrPRIMARY> db.stats(){     "db" : "traackr",     "collections" : 12,     "objects"...
Wait! What? MySynology NAS at homecan hold 2TB!
No need for us to track the entire web                          Influencer                           Content              ...
Do we fit the “Web scale” profile?       • In terms of users/traffic?    • In terms of the amount of data?
Variety view of “Web Scale”        Web data is:      Heterogeneous     Unstructured (text)
Visualization of the Internet, Nov. 23rd 2003          Source: http://www.opte.org/
Data sources areisolated islands of richdata with lose links toone another
How do we build a database thatmodels all possible entities found on             the web?
Modeling the web: the RDBMS way
Source: socialbutterflyclt.com
or
{    "realName": "David Chancogne",    "title": "CTO",    "description": "Web. Geek.rnTraackr: http://traackr.comrnPropz: ...
NoSQL = schema flexibility
Do we fit the “Web scale” profile?       • In terms of users/traffic?    • In terms of the amount of data?
Do we fit the “Web scale” profile?       • In terms of users/traffic?    • In terms of the amount of data?   • In terms of...
Traackr’s Datastore Requirements• Schema flexibility   ✓• Good at storing lots of variable length text• Batch processing o...
Requirement: text storage        Variable text length:  140                      multi-pagecharacter < big variance < twee...
Requirement: text storageRDBMS’ answer to variable text length:    Plan ahead for largest value            CLOB/BLOB
Requirement: text storage    Issues with CLOB/BLOG for us:    No clue what largest value isCLOB/BLOB for tweets = wasted s...
Requirement: text storage  NoSQL solutions are great for text:No length requirements (automated             chunking)     ...
Traackr’s Datastore Requirements• Schema flexibility   ✓• Good at storing lots of variable length text   ✓• Batch processi...
Requirement: batch processing Some NoSQLsolutions comewith MapReduce                 Source: http://code.google.com/
Requirement: batch processing        MapReduce + RDBMS: Possible but proprietary solutionsUsually involves exporting data ...
Traackr’s Datastore Requirements• Schema flexibility   ✓• Good at storing lots of variable length text   ✓• Batch processi...
How did we pick a NoSQL DB?
Bewildering number of options (early 2010)  Key/Value Databases          Column Databases  •   Distributed hashtables   • ...
Bewildering number of options (early 2010)  Key/Value Databases          Column Databases  •   Distributed hashtables   • ...
Trimming optionsKey/Value Databases              Column Databases•   Distributed hashtables while•weSpread sheet like     ...
Trimming optionsKey/Value Databases            Column Databases                     Memcache: memory-based,•   Distributed...
Trimming optionsKey/Value Databases             Column Databases•   Distributed hashtables      •     Spread sheet like•  ...
Trimming optionsKey/Value Databases                Column Databases•   Distributed hashtables         •     Spread sheet l...
Trimming optionsKey/Value Databases             Column Databases•   Distributed hashtables        • Spread sheet like•   D...
Trimming optionsKey/Value Databases              Column Databases•   Distributed hashtables       •   Spread sheet like•  ...
Trimming optionsKey/Value Databases           Column Databases•   Distributed hashtables       • Spread sheet like•       ...
Trimming optionsKey/Value Databases             Column Databases•   Distributed hashtables      •   Spread sheet like•   D...
Trimming optionsKey/Value Databases             Column Databases•   Distributed hashtables      •   Spread sheet like•   D...
Lessons LearnedChallenges               Rewards- Complexity             - Choices- Missing Features       - Empowering- Pr...
Rewards: ChoicesKey/Value Databases          Column Databases•   Distributed hashtables   •   Spread sheet like•   Designe...
Rewards: Choices  Source: capgemini.com
Lessons LearnedChallenges               Rewards- Complexity             - Choices- Missing Features       - Empowering- Pr...
When Big-Data = Big Architectures  Must have an odd                                    Master/slave architecture     numbe...
Source: socialbutterflyclt.com
Jokes aside, no one said open source           was easy to use
To be expected• Hadoop/Hbase are  designed to move  mountains• If you want to move big  stuff, be prepared to  sometimes u...
What it means to a startup                Development capacity beforeCongrats, you  are now a sysadmin…      Development c...
Lessons LearnedChallenges               Rewards- Complexity             - Choices- Missing Features       - Empowering- Pr...
Mapping an saved search to a column store                              NameRanks                                  Referenc...
Mapping an saved search to a column store          “attributes”         column familyUnique    for general         “influe...
Mapping an saved search to a column store                             Influencer ranks                             can be ...
Mapping an saved search to a column store                Can get pretty long so needs indexing and pagination
Problem: no out-of-the-box row-based      indexing and pagination
Jumping right into the code
Lessons LearnedChallenges               Rewards- Complexity             - Choices- Missing Features       - Empowering- Pr...
a few months later…
Need to upgrade to Hbase 0.90• Making sure to remain on recent code base• Performance improvements• Mostly to get the late...
Looks like something is missing
Our DB indexes depend on this!
Let’s get this straight• Hbase no longer comes with secondary  indexing out-of-the-box• It’s been moved out of the trunk t...
Only one other maintainer  besides us
What it means to a startup  Congrats, you are    now an hbase contrib maintainer…                       Development capacity
Source: socialbutterflyclt.com
Lessons LearnedChallenges               Rewards- Complexity             - Choices- Missing Features       - Empowering- Pr...
Homegrown Hbase Indexes                Row ids for Posts Rows have id prefixes that can be efficiently scanned using START...
Homegrown Hbase Indexes       Row ids for Posts                    Find posts for                influencer_id_1234
Homegrown Hbase Indexes       Row ids for Posts                    Find posts for                influencer_id_5678
Homegrown Hbase Indexes• No longer depending on unmaintained code• Work with out-of-the-box Hbase installation
What it means to a startup You are back but you     still need to  maintain indexing          logic                       ...
a few months later…
Cracks in the data model      huffingtonpost.com                                published under   writes for              ...
Cracks in the data model      huffingtonpost.com                                published under   writes for              ...
Cracks in the data model      huffingtonpost.com                                      published under   writes for        ...
Cracks in the data model      huffingtonpost.com                                       published under   writes for       ...
Fixing the cracks in the data model                                      Normalize the sites                              ...
Fixing the cracks in the data model• Normalization requires stronger secondary indexing• Our application layer indexing wo...
What it means to a startup Psych! You are back to writing indexing        code.                       Development capacity
Source: socialbutterflyclt.com
Lessons LearnedChallenges               Rewards- Complexity             - Choices- Missing Features       - Empowering- Pr...
Traackr’s Datastore Requirements               (Revisited)• Schema flexibility• Good at storing lots of variable length te...
NoSQL picking – Round 2 (mid 2011)  Key/Value Databases          Column Databases  •   Distributed hashtables   •   Spread...
NoSQL picking – Round 2 (mid 2011)  Key/Value Databases              Column Databases  •   Distributed hashtables       • ...
NoSQL picking – Round 2 (mid 2011)  Key/Value Databases             Column Databases  •   Distributed hashtables        • ...
NoSQL picking – Round 2 (mid 2011)  Key/Value Databases             Column Databases                           Memcache: s...
NoSQL picking – Round 2 (mid 2011)  Key/Value Databases           Column Databases  •   Distributed hashtables    •   Spre...
NoSQL picking – Round 2 (mid 2011)  Key/Value Databases                Column Databases  •   Distributed hashtables       ...
NoSQL picking – Round 2 (mid 2011)  Key/Value Databases            Column Databases  •   Distributed hashtables      • Spr...
NoSQL picking – Round 2 (mid 2011)  Key/Value Databases                       Column Databases  •   Distributed hashtables...
NoSQL picking – Round 2 (mid 2011)  Key/Value Databases           Column Databases  •   Distributed hashtables    •   Spre...
NoSQL picking – Round 2 (mid 2011)  Key/Value Databasesby leaps Column Databases         MongoDB: matured     and bounds, ...
Lessons LearnedChallenges               Rewards- Complexity             - Choices- Missing Features       - Empowering- Pr...
Immediate Benefits• No more maintaining custom application-layer secondary indexing code
What it means to a startup  Yay! I’m back!                   Development capacity
Immediate Benefits• No more maintaining custom application-layer  secondary indexing code• Single binary installation grea...
What it means to a startup Honestly, I thought  I’d never see you      guys again!                       Development capac...
Immediate Benefits• No more maintaining custom application-layer  secondary indexing code• Single binary installation grea...
many-to-many relationship
{    ”_id": "770cf5c54492344ad5e45fb791ae5d52”,    "realName": "David Chancogne",    "title": "CTO",    "description": "We...
{                   ”_id": "770cf5c54492344ad5e45fb791ae5d52”,                   "realName": "David Chancogne",           ...
Other Benefits• Ad hoc queries and reports became easier to write with JavaScript:  no need for a Java developer to write ...
looks like we found the right fit!
We have more of this     Development capacity
And less of this Source: socialbutterflyclt.com
Recap & Final Thoughts• 3 Vs of Big Data:  – Volume  – Velocity  – Variety  Traackr• Big Data technologies are complement...
Recap & Final Thoughts• Be prepared to deal with less mature tech• Be as flexible as the data => fearless  refactoring• Im...
Q&A
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
Upcoming SlideShare
Loading in …5
×

Sharing a Startup’s Big Data Lessons

414 views
377 views

Published on

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
414
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
12
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Big science: Large Hadron Collider (LHC)Sensor networks: forest fire detectionCall detail record, a record of a (billing) event produced by a telecommunication network element
  • Big science: Large Hadron Collider (LHC)Sensor networks: forest fire detectionCall detail record, a record of a (billing) event produced by a telecommunication network element
  • Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
  • Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
  • Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
  • Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
  • Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
  • Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
  • Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
  • Taking a look at the amount of storage we are using as of a month ago in Mongo; this includes indexes
  • The point is that we don’t need to track the entire web: just the subset belonging to influencers!
  • There is a different perspective on “Web Scale” that has to do with the nature of the data on the web
  • Take the approach of using a simplifiedentity model
  • …withsemi-structured data storage formats like JSON:Facilitate capturing related attribute structures Enablethe flexibility of definingnew attributes as they are discovered
  • CLOB pre-allocated space
  • Sparse maps
  • - This is something we thought we needed back in early 2010- Traack needs to score its’ entire DB of influencers on a weekly basis to adjust the weighted averages and stats that drive the scores. This means processing north of 750K of sites, over 650K influencers and soon, millions of posts (post-level attributes)
  • Graph Databases: while we can model our domain as a graph we don’t want to pigeonhole ourselves into this structure. We’d rather use these tools for specialized data analysis but not as the main data store.
  • Memcache: memory-based,we need true persistence
  • Amazon SimpleDB: not willing to store our data in a proprietary datastore.
  • Redis and LinkedIn’s Project Voldermort: no query filters, better used as queues or distributed caches
  • CouchDB: no ad-hoc queries; maturity in early 2010 made us shy away although we did try early prototypes
  • Cassandra: in early 2010, maturity questions, no secondary indexes and no batch processing options (came later on).
  • MongoDB: in early 2010, maturity questions, adoption questions and no batch processing options
  • Riak: very close but in early 2010, we had adoption questions
  • HBase: came across as the most mature at the time, with several deployments, a healthy community, &quot;out-of-the box&quot; secondary indexes through a contrib and support for batch processing using Hadoop/MR Hadoop and its’ maturity was a big reason we picked HBase
  • Had to deal with a complex right from the start:- minimum number of data nodes to support replication- odd number of zookeper nodes to avoid voting deadlocks- co-locating region servers = paying close attention to JVM resources- Master = SPOF- co-locating job trackers = paying close attention to JVM resources
  • - Quick overview of how we modeled a list in hbase =&gt; saved searches- This is what our customers see- Let&apos;s consider the name, the ranks of the influencers and the influencer references
  • Each row has a unique key: the alist idWe would group general attributes under one family of columns appropriately named “attributes”. Benefit: can get Alist information without loading all the influencersWe would group the influencer references under another family of columns named “influencerIds”
  • Now we can see where the attributes we see on the screen are stored
  • - We coded the pagination and indexing features ourselves and contributed them back- Felt really good about it!
  • It wasn’t bad enough we had to write our own code to support our indexing needs, we now had to maintain a third-party code base that was quickly becoming outdated!
  • Simplified example for posts
  • Denormalized/duplicated for fast runtime access and storage of influencer-to-site relationship properties
  • Content attribution logic could sometimes mis-attribute posts because of the duplicated
  • Exacerbated when we started tracking people’s content on a daily basis in mid-2011
  • Graph Databases: we looked at Neo4J a bit closer but passed again for the same reasons as before
  • CouchDB: more mature but still no ad-hoc queries
  • Cassandra: matured quite a bit, added secondary indexes and batch processing options but more restrictive in its’ use than other solutions. After the Hbase lesson, simplicity of use was now more important
  • Riak: strong contender still but adoption questions
  • MongoDB: matured by leaps and bounds, increased adoption, support from 10gen, advanced indexing out-of-the-box as well as some batch processing options, breeze to use, well documented and fit into our existing code base very nicely.
  • Embedded list of references to sites augmented with influencer-specific site attributes (e.g. percent contribution to content)
  • siteId indexed for “find influencers connected to site X”
  • Big science: Large Hadron Collider (LHC)Sensor networks: forest fire detectionCall detail record, a record of a (billing) event produced by a telecommunication network element
  • Sharing a Startup’s Big Data Lessons

    1. 1. Sharing a Startup’s Big Data LessonsExperiences with non-RDBMS solutions at
    2. 2. Who we are• A search engine• A people search engine• An influencer search engine• Subscription- based
    3. 3. George StathisVP Engineering14+ years of experiencebuilding full-stack websoftware systems with a pastfocus on e-commerce andpublishing. Currentlyresponsible for buildingengineering capability toenable Traackrs growth goals.
    4. 4. What’s this talk about?• Share what we know about Big Data/NoSQL: what’s behind the buzz words?• Our reasons and method for picking a NoSQL database• Share the lessons we learned going through the process
    5. 5. Big Data/NoSQL: behind the buzz words
    6. 6. What is Big Data?• 3 Vs: – Volume – Velocity – Variety
    7. 7. What is Big Data? Volume + Velocity• Data sets too large or coming in at too high a velocity to process using traditional databases or desktop tools. E.g. big science Astronomy web logs atmospheric science rfid genomics sensor networks biogeochemical social networks military surveillance social data medical records internet text and documents photography archives internet search indexing video archives call detail records large-scale e-commerce
    8. 8. What is Big Data? Variety• Big Data is varied and unstructuredTraditional static reports Analytics, exploration & experimentation
    9. 9. What is Big Data?• Scaling data processing cost effectively $$$$$ $$$$$$$$ $$$
    10. 10. What is NoSQL?• NoSQL ≠ No SQL• NoSQL ≈ Not Only SQL• NoSQL addresses RDBMS limitations, it’s not about the SQL language• RDBMS = static schema• NoSQL = schema flexibility; don’t have to know exact structure before storing
    11. 11. What is Distributed Computing?• Sharing the workload: divide a problem into many tasks, each of which can be solved by one or more computers• Allows computations to be accomplished in acceptable timeframes• Distributed computation approaches were developed to leverage multiple machines: MapReduce• With MapReduce, the program goes to the data since the data is too big to move
    12. 12. What is MapReduce?Source: developer.yahoo.com
    13. 13. What is MapReduce?• MapReduce = batch processing = analytical• MapReduce ≠ interactive• Therefore many NoSQL solutions don’t outright replace warehouse solutions, they complement them• RDBMS is still safe 
    14. 14. What is Big Data? Velocity• In some instances, being able to process large amounts of data in real-time can yield a competitive advantage. E.g. – Online retailers leveraging buying history and click- though data for real-time recommendations• No time to wait for MapReduce jobs to finish• Solutions: streaming processing (e.g. Twitter Storm), pre-computing (e.g. aggregate and count analytics as data arrives), quick to read key/value stores (e.g. distributed hashes)
    15. 15. What is Big Data? Data Science• Emergence of Data Science• Data Scientist ≈ Statistician• Possess scientific discipline & expertise• Formulate and test hypotheses• Understand the math behind the algorithms so they can tweak when they don’t work• Can distill the results into an easy to understand story• Help businesses gain actionable insights
    16. 16. Big Data LandscapeSource: capgemini.com
    17. 17. Big Data LandscapeSource: capgemini.com
    18. 18. Big Data LandscapeSource: capgemini.com
    19. 19. So what’s Traackr and why did we need a NoSQL DB?
    20. 20. Traackr: context• A cloud computing company as about to launch a new platform; how does it find the most influential IT bloggers on the web that can help bring visibility to the new product? How does it find the opinion leaders, the people that matter?
    21. 21. Traackr: a people search engine Up to 50 keywords per search!
    22. 22. Traackr: a people search engine Proprietary 3-scale rankingPeopleas Contentsearch aggregatedresults by author
    23. 23. Traackr: 30,000 feetAcquisition Processing Storage & Indexing Services Applications
    24. 24. NoSQL is usually associated with“Web Scale” (Volume & Velocity)
    25. 25. Do we fit the “Web scale” profile? • In terms of users/traffic?
    26. 26. Source: compete.com
    27. 27. Source: compete.com
    28. 28. Source: compete.com
    29. 29. Source: compete.com
    30. 30. Do we fit the “Web scale” profile? • In terms of users/traffic? • In terms of the amount of data?
    31. 31. PRIMARY> use traackrswitched to db traackrPRIMARY> db.stats(){ "db" : "traackr", "collections" : 12, "objects" : 68226121, "avgObjSize" : 2972.0800625760330, That’s a quarter of a "dataSize" : 202773493971, terabyte … "storageSize" : 221491429671, "numExtents" : 199, "indexes" : 33, "indexSize" : 27472394891, "fileSize" : 266623699968, "nsSizeMB" : 16, "ok" : 1}
    32. 32. Wait! What? MySynology NAS at homecan hold 2TB!
    33. 33. No need for us to track the entire web Influencer Content Web Content Not at scale :-)
    34. 34. Do we fit the “Web scale” profile? • In terms of users/traffic? • In terms of the amount of data?
    35. 35. Variety view of “Web Scale” Web data is: Heterogeneous Unstructured (text)
    36. 36. Visualization of the Internet, Nov. 23rd 2003 Source: http://www.opte.org/
    37. 37. Data sources areisolated islands of richdata with lose links toone another
    38. 38. How do we build a database thatmodels all possible entities found on the web?
    39. 39. Modeling the web: the RDBMS way
    40. 40. Source: socialbutterflyclt.com
    41. 41. or
    42. 42. { "realName": "David Chancogne", "title": "CTO", "description": "Web. Geek.rnTraackr: http://traackr.comrnPropz: http://propz.me", "primaryAffiliation": "Traackr", "email": "dchancogne@traackr.com", "location": "Cambridge, MA, United States", "siteReferences": [ { "siteUrl": "http://twitter.com/dchancogne", "metrics": [ { "value": 216, "name": "twitter_followers_count" }, { "value": 2107, "name": "twitter_statuses_count" } ] }, { "siteUrl": "http://traackr.com/blog/author/david", "metrics": [ { "value": 21, "name": "google_inbound_links" } ] } ]} Influencer data as JSON
    43. 43. NoSQL = schema flexibility
    44. 44. Do we fit the “Web scale” profile? • In terms of users/traffic? • In terms of the amount of data?
    45. 45. Do we fit the “Web scale” profile? • In terms of users/traffic? • In terms of the amount of data? • In terms of the variety of the data ✓
    46. 46. Traackr’s Datastore Requirements• Schema flexibility ✓• Good at storing lots of variable length text• Batch processing options
    47. 47. Requirement: text storage Variable text length: 140 multi-pagecharacter < big variance < tweets blog posts
    48. 48. Requirement: text storageRDBMS’ answer to variable text length: Plan ahead for largest value CLOB/BLOB
    49. 49. Requirement: text storage Issues with CLOB/BLOG for us: No clue what largest value isCLOB/BLOB for tweets = wasted space
    50. 50. Requirement: text storage NoSQL solutions are great for text:No length requirements (automated chunking) Limited space overhead
    51. 51. Traackr’s Datastore Requirements• Schema flexibility ✓• Good at storing lots of variable length text ✓• Batch processing options
    52. 52. Requirement: batch processing Some NoSQLsolutions comewith MapReduce Source: http://code.google.com/
    53. 53. Requirement: batch processing MapReduce + RDBMS: Possible but proprietary solutionsUsually involves exporting data fromRDBMS into a NoSQL system anyway. Defeats data locality benefit of MR
    54. 54. Traackr’s Datastore Requirements• Schema flexibility ✓• Good at storing lots of variable length text ✓• Batch processing options ✓ A NoSQL option is the right fit
    55. 55. How did we pick a NoSQL DB?
    56. 56. Bewildering number of options (early 2010) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
    57. 57. Bewildering number of options (early 2010) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
    58. 58. Trimming optionsKey/Value Databases Column Databases• Distributed hashtables while•weSpread sheet like Graph Databases: can model• • Key is a row Designed for high as a graph we don’t want to id our domain load• pigeonhole ourselves into this structure. columns In-memory or on-disk • Attributes are• Eventually consistent use these tools for can be grouped We’d rather • Columns specialized data analysis but not as the into families main data store.Document Databases Graph Databases• Like Key/Value • Graph Theory G=(E,V)• Value = Document • Great for modeling• Document = JSON/BSON networks• JSON = Flexible Schema • Great for graph-based query algorithms
    59. 59. Trimming optionsKey/Value Databases Column Databases Memcache: memory-based,• Distributed hashtables • Spread sheet like we need true persistence• Designed for high load • Key is a row id• In-memory or on-disk • Attributes are columns• Eventually consistent • Columns can be grouped into familiesDocument Databases Graph Databases• Like Key/Value • Graph Theory G=(E,V)• Value = Document • Great for modeling• Document = JSON/BSON networks• JSON = Flexible Schema • Great for graph-based query algorithms
    60. 60. Trimming optionsKey/Value Databases Column Databases• Distributed hashtables • Spread sheet like• Designed for high load • Key is a row id• In-memory or on-disk • Attributes are columns• Eventually consistent • Columns can be grouped Amazon SimpleDB: not willing to store our data in into families a proprietary datastore.Document Databases Graph Databases• Like Key/Value • Graph Theory G=(E,V)• Value = Document • Great for modeling• Document = JSON/BSON networks• JSON = Flexible Schema • Great for graph-based query algorithms
    61. 61. Trimming optionsKey/Value Databases Column Databases• Distributed hashtables • Spread sheet like• Designed for high load • Key is a row id• In-memory or on-disk • Attributes are columns• Eventually consistent • Columns can be grouped into familiesDocument Databases Graph Databases• Like Key/Value • Graph Theory G=(E,V) Not willing to store ourProject a Redis and LinkedIn’s data in• Value proprietary datastore. • = Document Voldermort: no query filters, Great for modeling• Document used as queues or better = JSON/BSON networks• JSON = Flexible Schema distributed caches • Great for graph-based query algorithms
    62. 62. Trimming optionsKey/Value Databases Column Databases• Distributed hashtables • Spread sheet like• Designed for high load • Key is a row id• In-memory or on-disk • Attributes are columns CouchDB: no ad-hoc queries;• Eventually consistent • Columns can us maturity in early 2010 madebe grouped into families shy away although we did try early prototypes.Document Databases Graph Databases• Like Key/Value • Graph Theory G=(E,V)• Value = Document • Great for modeling• Document = JSON/BSON networks• JSON = Flexible Schema • Great for graph-based query algorithms
    63. 63. Trimming optionsKey/Value Databases Column Databases• Distributed hashtables • Spread sheet like• Designed for high load • Key is a row id• In-memory or on-disk • Attributes are columns• Eventually consistent • Columns can be grouped into familiesDocument Databases 2010, Graph Databases Cassandra: in early• • maturity questions, no secondary Graph Theory G=(E,V) Like Key/Value• Value = Document processing Great for modeling indexes and no batch •• options (came later on). Document = JSON/BSON networks• JSON = Flexible Schema • Great for graph-based query algorithms
    64. 64. Trimming optionsKey/Value Databases Column Databases• Distributed hashtables • Spread sheet like• MongoDB: in earlyis a row id Designed for high load • Key 2010, maturity• In-memory or on-disk questions, adoption questions • Attributes are columns and no batch processing options.• Eventually consistent • Columns can be grouped into familiesDocument Databases Graph Databases• Like Key/Value • Graph Theory G=(E,V)• Value = Document • Great for modeling• Document = JSON/BSON networks• JSON = Flexible Schema • Great for graph-based query algorithms
    65. 65. Trimming optionsKey/Value Databases Column Databases• Distributed hashtables • Spread sheet like• Designed for high load • Key is a row id• In-memory or on-disk • Attributes are columns• Eventually consistent • Columns can be grouped into familiesDocument Databases Graph Databases• Like Key/Value very close but• in early 2010, Riak: Graph Theory G=(E,V)• • Great for Value = Document adoption questions. modeling we had• Document = JSON/BSON networks• JSON = Flexible Schema • Great for graph-based query algorithms
    66. 66. Trimming optionsKey/Value Databases Column Databases• Distributed hashtables • Spread sheet like• Designed for high load • Key is a row id• In-memory or on-disk • Attributes are columns• Eventually consistent • Columns can be grouped into familiesDocument Databases Graph Databases• Like Key/Value came across as•theGraphmature G=(E,V) HBase: most Theory• Value = Document with several deployments, a at the time, • Great for modeling• Document = JSON/BSON "out-of-the box" healthy community, networks secondary indexes through a contrib and• JSON = Flexible Schema • Great for graph-based support for batch processing using Hadoop/MR query algorithms .
    67. 67. Lessons LearnedChallenges Rewards- Complexity - Choices- Missing Features - Empowering- Problem solution fit - Community- Resources - Cost
    68. 68. Rewards: ChoicesKey/Value Databases Column Databases• Distributed hashtables • Spread sheet like• Designed for high load • Key is a row id• In-memory or on-disk • Attributes are columns• Eventually consistent • Columns can be grouped into familiesDocument Databases Graph Databases• Like Key/Value • Graph Theory G=(E,V)• Value = Document • Great for modeling• Document = JSON/BSON networks• JSON = Flexible Schema • Great for graph-based query algorithms
    69. 69. Rewards: Choices Source: capgemini.com
    70. 70. Lessons LearnedChallenges Rewards- Complexity - Choices- Missing Features - Empowering- Problem solution fit - Community- Resources - Cost
    71. 71. When Big-Data = Big Architectures Must have an odd Master/slave architecture number of means a single point of failure, Zookeeper quorum so you need to protect your nodes master. Then you can run your Hbase nodes but it’s recommended to co-locate regionservers with hadoop datanodes so you have to manage resources. Must have a Hadoop HDFS cluster of at least 2x replication factor nodes And then we also have to manage the MapReduce processes and resources in the Hadoop layer.Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
    72. 72. Source: socialbutterflyclt.com
    73. 73. Jokes aside, no one said open source was easy to use
    74. 74. To be expected• Hadoop/Hbase are designed to move mountains• If you want to move big stuff, be prepared to sometimes use big equipment
    75. 75. What it means to a startup Development capacity beforeCongrats, you are now a sysadmin… Development capacity after
    76. 76. Lessons LearnedChallenges Rewards- Complexity - Choices- Missing Features - Empowering- Problem solution fit - Community- Resources - Cost
    77. 77. Mapping an saved search to a column store NameRanks References to influencer records
    78. 78. Mapping an saved search to a column store “attributes” column familyUnique for general “influencerId” column family key attributes for influencer ranks and foreign keys
    79. 79. Mapping an saved search to a column store Influencer ranks can be attribute “name” attribute names as well
    80. 80. Mapping an saved search to a column store Can get pretty long so needs indexing and pagination
    81. 81. Problem: no out-of-the-box row-based indexing and pagination
    82. 82. Jumping right into the code
    83. 83. Lessons LearnedChallenges Rewards- Complexity - Choices- Missing Features - Empowering- Problem solution fit - Community- Resources - Cost
    84. 84. a few months later…
    85. 85. Need to upgrade to Hbase 0.90• Making sure to remain on recent code base• Performance improvements• Mostly to get the latest bug fixes No thanks!
    86. 86. Looks like something is missing
    87. 87. Our DB indexes depend on this!
    88. 88. Let’s get this straight• Hbase no longer comes with secondary indexing out-of-the-box• It’s been moved out of the trunk to GitHub• Where only one other company besides us seems to care about it
    89. 89. Only one other maintainer besides us
    90. 90. What it means to a startup Congrats, you are now an hbase contrib maintainer… Development capacity
    91. 91. Source: socialbutterflyclt.com
    92. 92. Lessons LearnedChallenges Rewards- Complexity - Choices- Missing Features - Empowering- Problem solution fit - Community- Resources - Cost
    93. 93. Homegrown Hbase Indexes Row ids for Posts Rows have id prefixes that can be efficiently scanned using STARTROW and STOPROW filters
    94. 94. Homegrown Hbase Indexes Row ids for Posts Find posts for influencer_id_1234
    95. 95. Homegrown Hbase Indexes Row ids for Posts Find posts for influencer_id_5678
    96. 96. Homegrown Hbase Indexes• No longer depending on unmaintained code• Work with out-of-the-box Hbase installation
    97. 97. What it means to a startup You are back but you still need to maintain indexing logic Development capacity
    98. 98. a few months later…
    99. 99. Cracks in the data model huffingtonpost.com published under writes for http://www.huffingtonpost.com/arianna-huffington/post_1.html http://www.huffingtonpost.com/arianna-huffington/post_2.html authored by http://www.huffingtonpost.com/arianna-huffington/post_3.html huffingtonpost.com published under writes for http://www.huffingtonpost.com/shaun-donovan/post1.html http://www.huffingtonpost.com/shaun-donovan/post2.html authored by http://www.huffingtonpost.com/shaun-donovan/post3.html
    100. 100. Cracks in the data model huffingtonpost.com published under writes for Denormalized/duplicated for fast runtime access http://www.huffingtonpost.com/arianna-huffington/post_1.html and storage of influencer- http://www.huffingtonpost.com/arianna-huffington/post_2.html authored by http://www.huffingtonpost.com/arianna-huffington/post_3.html to-site relationship properties huffingtonpost.com published under writes for http://www.huffingtonpost.com/shaun-donovan/post1.html http://www.huffingtonpost.com/shaun-donovan/post2.html authored by http://www.huffingtonpost.com/shaun-donovan/post3.html
    101. 101. Cracks in the data model huffingtonpost.com published under writes for http://www.huffingtonpost.com/arianna-huffington/post_1.html http://www.huffingtonpost.com/arianna-huffington/post_2.html authored by huffingtonpost.com published under writes for http://www.huffingtonpost.com/shaun-donovan/post1.html http://www.huffingtonpost.com/shaun-donovan/post2.html authored by http://www.huffingtonpost.com/shaun-donovan/post3.html http://www.huffingtonpost.com/arianna-huffington/post_3.html Content attribution logic could sometimes mis-attribute posts because of the duplicated data.
    102. 102. Cracks in the data model huffingtonpost.com published under writes for http://www.huffingtonpost.com/arianna-huffington/post_1.html authored by huffingtonpost.com published under writes for http://www.huffingtonpost.com/shaun-donovan/post1.html http://www.huffingtonpost.com/shaun-donovan/post2.html authored by http://www.huffingtonpost.com/shaun-donovan/post3.html http://www.huffingtonpost.com/arianna-huffington/post_3.html http://www.huffingtonpost.com/arianna-huffington/post_2.html Exacerbated when we started tracking people’s content on a daily basis in mid- 2011
    103. 103. Fixing the cracks in the data model Normalize the sites http://www.huffingtonpost.com/arianna-huffington/post_1.html http://www.huffingtonpost.com/arianna-huffington/post_2.html authored by http://www.huffingtonpost.com/arianna-huffington/post_3.html writes for published under huffingtonpost.com published under writes for http://www.huffingtonpost.com/shaun-donovan/post1.html http://www.huffingtonpost.com/shaun-donovan/post2.html authored by http://www.huffingtonpost.com/shaun-donovan/post3.html
    104. 104. Fixing the cracks in the data model• Normalization requires stronger secondary indexing• Our application layer indexing would need revisiting…again!
    105. 105. What it means to a startup Psych! You are back to writing indexing code. Development capacity
    106. 106. Source: socialbutterflyclt.com
    107. 107. Lessons LearnedChallenges Rewards- Complexity - Choices- Missing Features - Empowering- Problem solution fit - Community- Resources - Cost
    108. 108. Traackr’s Datastore Requirements (Revisited)• Schema flexibility• Good at storing lots of variable length text• Out-of-the-box SECONDARY INDEX support!• Simple to use and administer
    109. 109. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
    110. 110. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Nope! Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
    111. 111. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • Graph Databases:•weAttributes are columns In-memory or on-disk looked at • • Columns can Eventually consistent closer but passed again be grouped Neo4J a bit for the same reasons into families as before. Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
    112. 112. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases Memcache: still no • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
    113. 113. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped Amazon SimpleDB: still no. into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
    114. 114. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) Not willing to store ourProject a Redis and LinkedIn’s data in • Value proprietary datastore. • =Voldermort: still no Document Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
    115. 115. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns CouchDB: more mature but still • Eventually consistent • Columns can no ad-hoc queries. be grouped into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
    116. 116. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databasesa bit, added Cassandra: matured quite Graph Databases secondary indexes and batch processing • • Like Key/Valuerestrictive in its’ use than Graph Theory G=(E,V) options but more • • Value =solutions. After the Hbase lesson, Great for modeling other Document • Document useJSON/BSON simplicity of = was now more important. networks • JSON = Flexible Schema • Great for graph-based query algorithms
    117. 117. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • • Graph Theory G=(E,V) Like Key/Value strong contender still but Riak: • • Great for Value = Document questions remained. modeling adoption • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
    118. 118. NoSQL picking – Round 2 (mid 2011) Key/Value Databasesby leaps Column Databases MongoDB: matured and bounds, increased • • Spread sheet like Distributed hashtables 10gen, advanced indexing adoption, support from • • batch processing Designed for high load as some Key is a row id out-of-the-box as well options, breeze to use, well documented and fit into • • Attributes In-memory or on-disk code base very nicely. are columns our existing • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
    119. 119. Lessons LearnedChallenges Rewards- Complexity - Choices- Missing Features - Empowering- Problem solution fit - Community- Resources - Cost
    120. 120. Immediate Benefits• No more maintaining custom application-layer secondary indexing code
    121. 121. What it means to a startup Yay! I’m back! Development capacity
    122. 122. Immediate Benefits• No more maintaining custom application-layer secondary indexing code• Single binary installation greatly simplifies administration
    123. 123. What it means to a startup Honestly, I thought I’d never see you guys again! Development capacity
    124. 124. Immediate Benefits• No more maintaining custom application-layer secondary indexing code• Single binary installation greatly simplifies administration• Our NoSQL could now support our domain model
    125. 125. many-to-many relationship
    126. 126. { ”_id": "770cf5c54492344ad5e45fb791ae5d52”, "realName": "David Chancogne", "title": "CTO", "description": "Web. Geek.rnTraackr: http://traackr.comrnPropz: http://propz.me", "primaryAffiliation": "Traackr", "email": "dchancogne@traackr.com", "location": "Cambridge, MA, United States", "siteReferences": [ { Embedded list of "siteId": "b31236da306270dc2b5db34e943af88d", references to sites "contribution": 0.25 augmented with }, influencer-specific { "siteId": "602dc370945d3b3480fff4f2a541227c", site attributes (e.g. "contribution": 1.0 percent contribution } to content) ]} Modeling an influencer
    127. 127. { ”_id": "770cf5c54492344ad5e45fb791ae5d52”, "realName": "David Chancogne", "title": "CTO", "description": "Web. Geek.rnTraackr: http://traackr.comrnPropz: http://propz.me", "primaryAffiliation": "Traackr", "email": "dchancogne@traackr.com", "location": "Cambridge, MA, United States", "siteReferences": [ { "siteId": "b31236da306270dc2b5db34e943af88d", "contribution": 0.25 siteId indexed for }, “find influencers { connected to site X” "siteId": "602dc370945d3b3480fff4f2a541227c", "contribution": 1.0 } ] }> db.influencers.ensureIndex({siteReferences.siteId: 1});> db.influencers.find({siteReferences.siteId: "602dc370945d3b3480fff4f2a541227c"}); Modeling an influencer
    128. 128. Other Benefits• Ad hoc queries and reports became easier to write with JavaScript: no need for a Java developer to write map reduce code to extract the data in a usable form like it was needed with Hbase.• Simpler backups: Hbase mostly relied on HDFS redundancy; intra- cluster replication is available but experimental and a lot more involved to setup.• Great documentation• Great adoption and community
    129. 129. looks like we found the right fit!
    130. 130. We have more of this Development capacity
    131. 131. And less of this Source: socialbutterflyclt.com
    132. 132. Recap & Final Thoughts• 3 Vs of Big Data: – Volume – Velocity – Variety  Traackr• Big Data technologies are complementary to SQL and RDBMS• Until machines can think for themselves Data Science will be increasingly important
    133. 133. Recap & Final Thoughts• Be prepared to deal with less mature tech• Be as flexible as the data => fearless refactoring• Importance of ease of use and administration cannot be overstated for a small startup
    134. 134. Q&A

    ×