SQL or NoSQL, that is the question!


Published on

Origins of NoSQL movement, description of approaches and different trade-offs new databases make

Published in: Technology
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Kaj sploh je Silicijeva Dolina? Zakaj se to sploh sprašujemo? Mislijo politiki dobesedno? Povedal bom o pozitivnih straneh. V bistvu sem hotel povedati drugo zgodbo
  • SQL or NoSQL, that is the question!

    1. 1. SQL or NoSQL, that is the question! October 2011 Andraž Tori, CTO at Zemanta @andraz andraz@zemanta.com
    2. 2. Answering <ul><li>- Why NoSQL? </li></ul><ul><li>- What is NoSQL? </li></ul><ul><li>- How does it work? </li></ul>
    3. 3. SQL is awesome! <ul><li>- Structured Query Language </li></ul><ul><li>- ACID </li></ul><ul><ul><li>Atomicity, Consistency, Isolation, Durability </li></ul></ul><ul><li>- Predictable </li></ul><ul><li>- Schema </li></ul><ul><li>- Based on rational algebra </li></ul><ul><li>- Standardized </li></ul>
    4. 4. No, really, it's awesome! <ul><li>- Hardened </li></ul><ul><li>- Free and commercial choices </li></ul><ul><ul><li>- MySQL, PostgreSQL, Oracle, DB2, MS SQL... </li></ul></ul><ul><li>- Commercial support </li></ul><ul><li>- Tooling </li></ul><ul><li>- Everyone knows it </li></ul><ul><li>- It's mature! </li></ul>
    5. 6. So this is the end, right?
    6. 7. Why the heck would someone not want SQL?
    7. 8. Why not to use SQL? <ul><li>- Clueless self-thought programmers who use text files </li></ul><ul><li>- NIH - Not Invented Here syndrome. And I want to design my own CPU! </li></ul><ul><li>- Because it's hard! </li></ul><ul><li>- I can't afford it </li></ul><ul><li>- “This app was first ported from Clipper to DBase” </li></ul>
    8. 9. Some other perspectives...
    9. 10. Let's say ...
    10. 11. You are a big tech company, located on west coast of USA
    11. 13. You are... <ul><li>- big international web company based in San Francisco </li></ul><ul><li>- 5 data centers around the world </li></ul><ul><li>- Petabytes of data behind the service </li></ul><ul><li>- A day of downtown costs you at least millions </li></ul><ul><li>- And it's not question of when, but if </li></ul>
    12. 14. You want to <ul><li>- keep the service up no matter what </li></ul><ul><li>- have it fast </li></ul><ul><li>- deal with humongous amounts of data </li></ul><ul><li>- enable your engineers to make great stuff </li></ul>
    13. 15. You are...
    14. 16. Some interesting constraints <ul><li>Amazon claim that just an extra one tenth of a second on their response times will cost them 1% in sales. </li></ul>
    15. 17. So... <ul><li>- Some pretty big and important problems </li></ul><ul><li>- And brightest engineers in the world </li></ul><ul><li>- Who loooove to build stuff </li></ul><ul><li>- Sooner or later even Oracle RAC cluster is not enough </li></ul>
    16. 18. Numbers everybody should know! Jeff Dean at famous Stanford talk L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 25 ns Main memory reference 100 ns Compress 1K bytes w/ cheap algorithm 3,000 ns Send 2K bytes over 1 Gbps network 20,000 ns Read 1 MB sequentially from memory 250,000 ns Round trip within same datacenter 500,000 ns Disk seek 10,000,000 ns Read 1 MB sequentially from disk 20,000,000 ns Send packet CA->Netherlands->CA 150,000,000 ns
    17. 19. Facebook circa 2009 <ul><li>- from 200GB (March 2008) to 4 TB of compressed new data added per day </li></ul><ul><li>- 135TB of compressed data scanned per day </li></ul><ul><li>- 7500+ Database jobs on production cluster per day </li></ul><ul><li>- 80K compute hours per day </li></ul><ul><li>- And that's just for data warehousing/analysis </li></ul><ul><ul><li>- plus thousands of MySQL machines acting as Key/Value stores </li></ul></ul>
    18. 20. Big Data <ul><li>- Internet generates huge amounts of data </li></ul><ul><li>- First encountered by big guys AltaVista, Google, Amazon … </li></ul><ul><li>- Need to be handled </li></ul><ul><li>- Classical storage solutions just don't fit/behave/scale anymore </li></ul>
    19. 21. So smart guys create solutions to these internal challenges
    20. 22. And then? <ul><li>- Papers: </li></ul><ul><li>The Google File System (Google, 2003) </li></ul><ul><li>MapReduce: Simplified Data Processing on Large Clusters (Google, 2004) </li></ul><ul><li>Bigtable: A Distributed Storage System for Structured Data (Google, 2006) </li></ul><ul><ul><li>Amazon Dynamo (Amazon, 2007) </li></ul></ul><ul><li>- Projects (all open source): </li></ul><ul><ul><li>Hadoop (coming out of Nutch, Yahoo, 2008) </li></ul></ul><ul><ul><li>Memcached (LiveJournal, 2003) </li></ul></ul><ul><ul><li>Voldemort (Linkedin, 2008) </li></ul></ul><ul><ul><li>Hive (Facebook, 2008) </li></ul></ul><ul><ul><li>Cassandra (Facebook, 2008) </li></ul></ul><ul><ul><li>MongoDB (2007) </li></ul></ul><ul><ul><li>Redis, Tokyo Cabinet , CouchDB, Riak... </li></ul></ul>
    21. 23. Four papers to rule them all <ul><li>Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “ The Google File System ”, 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003. </li></ul><ul><li>Jeffrey Dean and Sanjay Ghemawat, “ MapReduce: Simplified Data Processing on Large Clusters ”, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. </li></ul><ul><li>Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, “ Bigtable: A Distributed Storage System for Structured Data ”, OSDI'06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, November, 2006. </li></ul><ul><li>Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swami Sivasubramanian, Peter Vosshall and Werner Vogels, “ Dynamo: Amazon's Highly Available Key-Value Store ”, in the Proceedings of the 21st ACM Symposium on Operating Systems Principles, Stevenson, WA, October 2007. </li></ul>
    22. 24. Solving problems of big guys?
    23. 25. Total Sites Across All Domains August 1995 - October 2011, NetCraft
    24. 26. Yesterday's problem of biggest guys Is today's problem of garden variety startup
    25. 28. And so we end up with Cambrian explosion
    26. 31. These solutions don't have much in common, Except...
    27. 32. They definitely aren't SQL
    28. 33. Not Only SQL
    29. 34. So what are these beasts?
    30. 35. That's a hard question... <ul><li>- There is no standard </li></ul><ul><li>- This is a new technology </li></ul><ul><ul><li>- new research </li></ul></ul><ul><ul><li>- survival of the fittest </li></ul></ul><ul><ul><li>- experimenting </li></ul></ul><ul><li>- They obviously fulfill some new needs </li></ul><ul><ul><li>- but we don't yet know which are real and which superficial </li></ul></ul><ul><li>- Most are extremely use-case specific </li></ul>
    31. 36. Example use-cases <ul><li>- Shopping cart on Amazon </li></ul><ul><li>- PageRank calculation at Google </li></ul><ul><li>- Streams stuff at Twitter </li></ul><ul><li>- Extreme K/V store at bit.ly </li></ul><ul><li>- Analytics at Facebook </li></ul>
    32. 37. At the core, it's a different set of trade-offs and operational constraints
    33. 38. Trade-offs and operational constraints <ul><li>- Consistent? </li></ul><ul><ul><li>Eventually consistent? </li></ul></ul><ul><li>- Highly available? </li></ul><ul><ul><li>Distributed across continents? </li></ul></ul><ul><li>- Fault tolerant? </li></ul><ul><ul><li>Partition tolerant? </li></ul></ul><ul><ul><li>Tolerant to consumer grade hardware? </li></ul></ul><ul><li>- Distributed? </li></ul><ul><ul><li>Across 10, 100, 1000, 10000 machines? </li></ul></ul>
    34. 39. More possibilities <ul><li>- All in memory? (disk is the new tape) </li></ul><ul><li>- Batch processing? </li></ul><ul><ul><li>- tolerant to node failures? </li></ul></ul><ul><li>- Graph oriented? </li></ul><ul><li>- No transactions? </li></ul><ul><ul><li>Programmer deals with inconsistencies? </li></ul></ul><ul><li>- No schemas? </li></ul><ul><li>- BASE? (Basically Available, Soft state, Eventually Consistent) </li></ul><ul><li>- Horizontal scaling, with no downtime? </li></ul><ul><li>- Self healing? </li></ul>
    35. 40. A consistent topic: CAP Theorem
    36. 41. CAP theorem (Eric Brewer, 2000, Symposium on Principles of Distributed Computing) <ul><li>- CAP = Consistency, Availability, Partition tolerance </li></ul><ul><li>- Pick any two! </li></ul><ul><li>- Distributed systems have to sacrifice something to be fast </li></ul><ul><li>- Usually you drop: </li></ul><ul><ul><li>- consistency – all clients see the same data </li></ul></ul><ul><ul><li>- availability – the service returns something </li></ul></ul><ul><li>- Sometimes can even tune the trade-offs! </li></ul>
    37. 42. &quot;There is no free lunch with distributed data” – HP
    38. 43. Eventual Consistency <ul><li>- Different clients can read the data and write it, no locking or maybe partitioned nodes </li></ul><ul><li>- What we know is that given enough time data is synchronized to the same state across all replicas </li></ul>
    39. 44. But this is horrible!
    40. 45. … you already are eventually consistent! :) If your database stores how many vases you have in your shop...
    41. 46. Eventual consistency <ul><li>- Conflict resolution: </li></ul><ul><ul><li>- Read time </li></ul></ul><ul><ul><li>- Write time </li></ul></ul><ul><ul><li>- Asynchronous </li></ul></ul><ul><li>- Possibilities: </li></ul><ul><ul><li>- client timestamps </li></ul></ul><ul><ul><li>- vector clocks, when writing say what your original data version was </li></ul></ul><ul><li>- Conflict resolution can be server or client based </li></ul>
    42. 47. There are different kinds of consistencies <ul><li>- Read-your-writes consistency </li></ul><ul><li>- Monotonic write / monotonic read consistency </li></ul><ul><li>- Session consistency </li></ul><ul><li>- Casual consistency </li></ul>
    43. 48. There's not even a proper taxonomy of features different NoSQL solutions offer
    44. 49. And this presentation is too short to present whole breadth of possibilities
    45. 50. Usual taxonomy of NoSQL <ul><li>Usual taxonomy: </li></ul><ul><li>- Key/Value stores </li></ul><ul><li>- Column stores </li></ul><ul><li>- Document stores </li></ul><ul><li>- Graph stores </li></ul>
    46. 51. Other attributes <ul><li>- In-memory / on-disk </li></ul><ul><li>- Latency / throughput (batch processing) </li></ul><ul><li>- Consistency / Availability </li></ul>
    47. 52. Key/Value stores <ul><li>- a.k.a. Distributed hashtables! </li></ul><ul><li>- Amazon Dynamo </li></ul><ul><li>- Redis, Voldemort, Cassandra, Tokyo Cabinet, Riak </li></ul>
    48. 53. Document databases <ul><li>- Similar to Key/Value, but value is a document </li></ul><ul><li>- JSON or something similar, flexible schema </li></ul><ul><li>- CouchDB, MongoDB, SimpleDB... </li></ul><ul><li>- May support indexing or not </li></ul><ul><li>- Usually support more complex queries </li></ul>
    49. 54. Column stores <ul><li>- one key, multiple attributes </li></ul><ul><li>- hybrid row/column </li></ul><ul><li>- BigTable, Hbase, Cassandra, Hypertable </li></ul>
    50. 55. Graph Databases <ul><li>- Neo4J, Maestro OpenLink, InfiniteGraph, HyperGraphDB, AllegroGraph </li></ul><ul><li>- Whole semantic web shebang! </li></ul>
    51. 56. To make the situation even more confusing... <ul><li>- Fast pace of development </li></ul><ul><li>- In-memory stores gain on-disk support overnight </li></ul><ul><li>- Indexing capabilities are added </li></ul>
    52. 57. Two examples <ul><li>- Cassandra </li></ul><ul><li>- Hadoop </li></ul><ul><ul><li>- Hive </li></ul></ul><ul><ul><li>- Mahout </li></ul></ul>
    53. 59. Cassandra <ul><li>- BigTable + Dynamo </li></ul><ul><li>- P2P, horizontally scalable </li></ul><ul><li>- No SPOF </li></ul><ul><li>- Eventually consistent </li></ul><ul><li>- Tunable tradeoffs between consistency and availability </li></ul><ul><ul><li>- number of replicas, writes, reads </li></ul></ul>
    54. 60. Cassandra – writes <ul><li>- No reads </li></ul><ul><li>- No seeks </li></ul><ul><li>- Log oriented writes </li></ul><ul><li>- Fast, atomic inside ColumnFamily </li></ul><ul><li>- Always available for writing </li></ul>
    55. 61. Cassandra <ul><li>- Billions of rows </li></ul><ul><li>- Mysql: </li></ul><ul><ul><li>~ 300ms write </li></ul></ul><ul><ul><li>~ 350ms read </li></ul></ul><ul><li>- Cassandra: </li></ul><ul><ul><li>~ 0.12ms write </li></ul></ul><ul><ul><li>~ 15ms read </li></ul></ul>
    56. 62. Not enough time to go into data model...
    57. 63. Cassandra <ul><li>In production at: Facebook, Digg, Rackspace, Reddit, Cloudkick, Twitter </li></ul><ul><li>- largest production cluster over 150TB and over 150 machines </li></ul><ul><li>Other stuff: </li></ul><ul><ul><li>- pluggable partitioner (Random/OrderPerserving) </li></ul></ul><ul><ul><li>- rack aware, datacenter aware </li></ul></ul>
    58. 64. Experiences? <ul><li>- Works pretty good at Zemanta </li></ul><ul><ul><li>- user preferences store </li></ul></ul><ul><ul><li>- extending to new use-cases </li></ul></ul><ul><li>- Digg had some problems </li></ul><ul><li>- Don't necessary use it as primary store </li></ul><ul><li>- Not very easy to back-up, situation is improving </li></ul>
    59. 65. Cassandra - queries <ul><li>- Column by key </li></ul><ul><li>- Slices (of columns/supercolumns) </li></ul><ul><li>- Range queries (when using OrderPerservingPartitioner to be efficient) </li></ul>
    60. 67. Hadoop <ul><li>- GFS + MapReduce </li></ul><ul><li>- Fault tolerant </li></ul><ul><li>- (massively) distributed </li></ul><ul><li>- massive datasets </li></ul><ul><li>- batch-processing (non real-time responses) </li></ul><ul><li>- Written in Java </li></ul><ul><li>- A whole ecosystem </li></ul>
    61. 68. Hadoop: Why? (Owen O’Malley, Yahoo Inc!, omalley@apache.org) <ul><li>• Need to process 100TB datasets with multi-day jobs </li></ul><ul><li>• On 1 node: </li></ul><ul><ul><li>– scanning @ 50MB/s = 23 days </li></ul></ul><ul><ul><li>– MTBF = 3 years </li></ul></ul><ul><li>• On 1000 node cluster: </li></ul><ul><ul><li>– scanning @ 50MB/s = 33 min </li></ul></ul><ul><ul><li>– MTBF = 1 day </li></ul></ul><ul><li>• Need framework for distribution </li></ul><ul><ul><li>– Efficient, reliable, easy to use </li></ul></ul>
    62. 69. Hadoop @ Facebook <ul><li>- Use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning. </li></ul><ul><li>- Currently 2 major clusters: </li></ul><ul><ul><li>A 1100-machine cluster with 8800 cores and about 12 PB raw storage. </li></ul></ul><ul><ul><li>A 300-machine cluster with 2400 cores and about 3 PB raw storage. </li></ul></ul><ul><ul><li>Each (commodity) node has 8 cores and 12 TB of storage. </li></ul></ul><ul><li>- Heavy users of both streaming as well as the Java apis. They built a higher level data warehousing framework using these features called Hive (see the http://hadoop.apache.org/hive/). </li></ul>
    63. 70. But also at smaller startups <ul><li>- Zemanta: 2 to 4 node cluster, 7TB </li></ul><ul><ul><li>- log processing </li></ul></ul><ul><li>- Hulu 13 nodes </li></ul><ul><ul><li>- log storage and analysis </li></ul></ul><ul><li>- GumGum 9 nodes </li></ul><ul><ul><li>- image and advertising analytics </li></ul></ul><ul><li>- Universities: Cornell – Generating web graphs (100 nodes) </li></ul><ul><li>- It's almost everywhere </li></ul>
    64. 71. Hadoop Architecture - HDFS <ul><li>- HDFS provides a single distributed filesystem </li></ul><ul><li>- Managed by a NameNode (SPOF) </li></ul><ul><li>- Append-only filesystem </li></ul><ul><ul><li>- distributed by blocks (for example 64MB) </li></ul></ul><ul><li>- It's like one big RAID over all the machines </li></ul><ul><ul><li>- tunable replication </li></ul></ul><ul><li>- Rack aware, datacenter aware </li></ul><ul><li>- It just works, really! </li></ul>
    65. 74. Hadoop Architecture - MapReduce <ul><li>- Based on an old concept from Lisp </li></ul><ul><li>- Generally it's not just map-reduce, it's: </li></ul><ul><ul><li>Map -> shuffle (sort) -> merge-> reduce </li></ul></ul><ul><li>- Jobs can be partitioned </li></ul><ul><li>- Jobs can be run and be restarted independently (parallelization, fault tolerance) </li></ul><ul><li>- Aware of data-locality of HDFS </li></ul><ul><li>- Speculative execution (toward the end, of tasks machines that stall) </li></ul>
    66. 75. Infamous word counting example <ul><li>- “One and one is two and one is three” </li></ul><ul><li>- Two mappers: “One and one is”, “two and one is three” </li></ul><ul><li>- Pretty “stupid” mappers, just output word and “1” </li></ul>Otuput Mapper1: One 1 And 1 One 1 Is 1 Output Mapper2: Two 1 And 1 One 1 Is 1 Three 1 And 1 And 1 Is 1 Is 1 One 1 One 1 One 1 Two 1 Three 1 And 2 Is 2 One 3 Two 1 Three 1 Sorter Reducer
    67. 76. Important to know <ul><li>- Mappers can output more than one output per input (or none) </li></ul><ul><li>- Bucketing for reducers happens immediately after mapping output </li></ul><ul><li>- Every reducer gets all input records for certain “key” </li></ul><ul><li>- All parts are highly pluggable – readers, mapping, sorting, reducing … it's java </li></ul>
    68. 77. Hadoop <ul><li>- You can write your jobs in Java </li></ul><ul><li>- You get used to thinking inside the constraints </li></ul><ul><li>- You can use “Hadoop Streaming” to write jobs in any language </li></ul><ul><li>- It's great not to have to think about the machines, but you can “peep” if you want to see how your job is doing. </li></ul>
    69. 78. Now, this is a bit wonky, right? <ul><li>- Word counting is a really bad example </li></ul><ul><li>- However it's like “Hello world”, so get used to it </li></ul><ul><li>- When you get to real problems it gets much more logical </li></ul>
    70. 79. Benchmarks, 2009 <ul><li>This doesn't help me much, but... </li></ul>Bytes Nodes Maps Reduces Replication Time 500000000000 1406 8000 2600 1 59 seconds 1000000000000 1460 8000 2700 1 62 seconds 100000000000000 3452 190000 10000 2 173 minutes 1000000000000000 3658 80000 20000 2 975 minutes
    71. 80. Hive
    72. 81. Hive <ul><li>- A system built on top of Hive that mimics SQL </li></ul><ul><li>- Hive Query Language </li></ul><ul><li>- Built at Facebook, since writing MapReduce jobs in Java is tedious basic tasks </li></ul><ul><li>- Every operation is one or multiple full index scans </li></ul><ul><li>- Bunch of heuristics, query optimization </li></ul>
    73. 82. Hive – Why we love it at Zemanta <ul><li>- Don't need to transform your data on “load time” </li></ul><ul><li>- Just copy your files to HDFS (preferably compressed and chunked) </li></ul><ul><li>- Write your own deserializer (50 lines in Java) </li></ul><ul><li>- And use your file as a table </li></ul><ul><li>- Plus custom User Defined Functions </li></ul>
    74. 84. Mahout <ul><li>- Bunch of algorithms implemented </li></ul><ul><ul><li>Collaborative Filtering </li></ul></ul><ul><ul><li>User and Item based recommenders </li></ul></ul><ul><ul><li>K-Means, Fuzzy K-Means clustering </li></ul></ul><ul><ul><li>Mean Shift clustering </li></ul></ul><ul><ul><li>Dirichlet process clustering </li></ul></ul><ul><ul><li>Latent Dirichlet Allocation </li></ul></ul><ul><ul><li>Singular value decomposition </li></ul></ul><ul><ul><li>Parallel Frequent Pattern mining </li></ul></ul><ul><ul><li>Complementary Naive Bayes classifier </li></ul></ul><ul><ul><li>Random forest decision tree based classifier </li></ul></ul><ul><ul><li>High performance java collections (previously colt collections) </li></ul></ul><ul><ul><li>A vibrant community </li></ul></ul><ul><ul><li>and many more cool stuff to come by this summer thanks to Google summer of code </li></ul></ul>
    75. 85. General notes
    76. 86. Some observations <ul><li>- Non-fixed schemas are a blessing when you have to adapt constantly </li></ul><ul><ul><li>- that doesn't mean you should not have documentation and be thoughtful! </li></ul></ul><ul><li>- Denormalization is the way to scale </li></ul><ul><ul><li>- sorry guys </li></ul></ul><ul><li>- Clients get to manage things more precisely, but also have to manage things more precisely </li></ul>
    77. 87. Some internals, “fun” tricks <ul><li>- Bloom filter: Is data on this node? </li></ul><ul><ul><li>Maybe / Definitely not </li></ul></ul><ul><ul><li>Maybe -> let's go to disk to check out </li></ul></ul><ul><li>- Vector clocks </li></ul><ul><li>- Consistent hashing </li></ul>
    78. 88. Consistent hashing <ul><li>- key -> hash -> “coordinator node” </li></ul><ul><li>- depending on replication the key is then stored in sequential N nodes </li></ul><ul><li>- When new node gets added to the ring replication is relatively easy </li></ul>
    79. 89. And if you don't take anything else from this presentation...
    80. 93. But there's more to it
    81. 94. This is the edge today <ul><li>- Tons of interesting research waiting to be made </li></ul><ul><li>- Ability to leverage these solutions to process terabytes of data cheaply </li></ul><ul><li>- Ability to seize new opportunities </li></ul><ul><li>- Innovation is the only thing keeping you/us ahead </li></ul><ul><li>- Are you preparing yourself for tomorrow's technologies? Tomorrow's research? </li></ul>
    82. 95. Images <ul><li>http://www.flickr.com/photos/60861613@N00/3526232773/sizes/m/in/photostream/ </li></ul><ul><li>http://www.zazzle.com/sql_awesome_me_tshirt-235011737217980907 </li></ul><ul><li>http://geekandpoke.typepad.com/geekandpoke/2011/01/nosql.html </li></ul><ul><li>http://hadoop.apache.org/common/docs/current/hdfs_design.html </li></ul><ul><li>http://www.flickr.com/photos/unitednationsdevelopmentprogramme/4273890959/ </li></ul>