Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

NoSQL and MapReduce

My presentation for

TSSG,
Chelmsford, MA

  • Login to see the comments

NoSQL and MapReduce

  1. 1. NoSQL databases and MapReduce<br />J Singh<br />Early Stage IT<br />
  2. 2. What’s so fun about databases?<br />Traditional database discussions talked about<br />Employee records<br />Bank records<br />Now we talk about<br />Web search<br />Data mining<br />The collective intelligence of tweets<br />Scientific and medical databases<br />
  3. 3. How much data can a database hold?<br />The biggest OLTP databases<br />2001: 1.1 – 10.3 TB.<br />2003: 9.1 – 29.2 TB.<br />2005: 17.7 – 100.4 TB.<br />2010: ~2.5 PB.<br />The trend will continue<br />Very large databases bring new unique challenges<br />
  4. 4. Historical Context<br />Late 1990’s.<br />The web scales out.<br />Suddenly, databases not adequate for holding the data being accumulated<br />Scale out vs. Scale up<br />
  5. 5. Brewer’s Conjecture (p1)<br />Source: Eric Brewer’s July 2000 PODC Keynote<br />Main points:<br />Classic “Distributed Systems” don’t work<br />They focus on computation, not data<br />Distributing computation is easy, distributing data is hard<br />DBMS research is about ACID (mostly)<br />Atomicity, Consistency, Isolation and Durability<br />But we forfeit “C” and “I” for availability, graceful degradation and performance – this tradeoff is fundamental<br />BASE<br />Basically Available<br />Soft-state<br />Eventual Consistency<br />
  6. 6. Brewer’s Conjecture (p2)<br />BASE<br />Weak consistency<br />stale data OK<br />Availability first<br />Best effort<br />Approximate answers OK<br />Aggressive (optimistic)<br />Simpler!<br />Faster<br />Easier evolution<br />ACID<br />Strong consistency<br />Isolation<br />Focus on “commit”<br />Nested transactions<br />Availability?<br />Conservative (pessimistic)<br />Difficult evolution (e.g. schema)<br />But I think it’s a spectrum<br />Eric Brewer<br />
  7. 7. CAP Theorem<br />Since then,<br />Brewer’s conjecture formally proved: Gilbert & Lynch, 2002<br />Thus Brewer’s conjecture became the CAP theorem…<br />…and contributed to the birth of the NoSQL movement<br />But the theory is not settled<br />While http://nosql-database.org/ lists 122 NoSQL databases<br />
  8. 8. What is NoSQL?<br />Stands for Not Only SQL<br />Class of non-relational data storage systems<br />Usually do not require a fixed table schema nor do they use the concept of joins<br />All NoSQL offerings relax one or more of the ACID properties<br />
  9. 9. Forces at Work<br />Three major papers were the seeds of the NoSQL movement<br />CAP Theorem (discussed above)<br />BigTable(Google)<br />Dynamo (Amazon)<br />Some types of data could not be modeled well in RDBMS<br />Document Storage and Indexing<br />Recursive Data and Graphs<br />Time Series Data<br />Genomics Data<br />
  10. 10. NoSQL Databases<br />Key-Value Stores<br />A storage system that stores values, indexed by a key.<br />Example: Voldemort, Dynomite, Tokyo Cabinet<br />BigTable Clones (aka "ColumnFamily")<br />A tabular model where each row (at least in theory) can have an individual configuration of columns.<br />Example: HBase, Hypertable, Cassandra, Amazon SimpleDB<br />
  11. 11. NoSQL Databases<br />Document Databases<br />Collections of documents, which contain key-value collections (called "documents")<br />Example: CouchDB, MongoDB, Riak<br />Graph Databases<br />Nodes & relationships, both of which can hold key-value pairs<br />Example: AllegroGraph, InfoGrid, Neo4j<br />
  12. 12. Amazon SimpleDB<br />Key-value store<br />Written in Erlang, (as is CouchDB)<br />Data is modeled in terms of<br />Domain, a container of entities,<br />Item, an entity and <br />Attribute and Value, a property of an Item<br />Eventually Consistent, except when ReadConsistent flag specified<br />Impressive performance numbers, <br />e.g., .7 sec to store 1 million records<br />SQL-like SELECT<br />select output_list<br />from domain_name<br />[where expression] [sort_instructions] [limit limit] <br />
  13. 13. Google Datastore<br />Part of App Engine; also used for internal applications<br />Used for all storage<br />Incorporates a transaction model to ensure high consistency<br />Optimistic locking<br />Transactions can fail<br />CAP implications<br />Datastore isn’t just “eventually consistent”<br />They offer two commercial options (with different prices)<br />Master/Slave <br />Low latency but also lower availability<br />Asynchronous replication<br />High Replication<br />Strong availability at the cost of higher latency<br />
  14. 14. <ul><li>Some production data, circa 2008.
  15. 15. For more info, see video of Ryan Barrett’s talk at Google I/O</li></ul>Datastore Application at Google<br />
  16. 16. Databases and Key-Value Stores<br />http://browsertoolkit.com/fault-tolerance.png<br />
  17. 17. MapReduce Conceptual Underpinnings<br />Programming model from Lisp and other functional languages<br />(map square '(1 2 3 4))  (1 4 9 16)<br />(reduce + '(1 4 9 16)) 30 <br />Easy to distribute<br />Nice failure/retry semantics<br />
  18. 18. MapReduce Flow<br />
  19. 19. HadoopMapReduce<br />An Open Source project of the Apache Foundation<br />Other Hadoop-related projects at Apache include:<br />Cassandra™: A scalable multi-master database with no single points of failure.<br />HBase™: A scalable, distributed database that supports structured data storage for large tables.<br />Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.<br />Pig™: A high-level data-flow language and execution framework for parallel computation.<br />See the Apache Hadoop website for more.<br />
  20. 20. Hadoop Availability<br />Run on your laptop<br />Run on your server<br />Run on Amazon Cloud<br />Introduction at IBM DeveloperWorks<br />Run on Google App Engine<br />It’s not Hadoop, it’s Google’s implementation of MapReduce<br />
  21. 21. MapReduce Statistics @ GOOG<br />Take-away message:<br />MapReduce is not a “new-fangled technology of the future”<br />It is here, it is proven, use it!<br />
  22. 22. End of an Era?<br />The Relational Model is not necessarily the answer<br />It was excellent for data processing<br />Not a natural fit for<br />Data Warehouses<br />Web-oriented search<br />Real-time analytics, and<br />Semi-structured data<br />i.e., Semantic Web<br />SQL is not the answer<br />Coupling between modern programming languages and SQL are “ugly beyond belief”<br />Programming languages have evolved while SQL has remained static<br />Pascal<br />C/C++<br />Java<br />The little languages: Python, Perl, PHP, Ruby<br /><ul><li>The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007</li></ul>A critique of the “one size fits all” assumption in DBMS<br />
  23. 23. Take Aways<br />NoSQL databases are a solution to web-scale problems<br />A lot of data lives outside relational databases<br />With SQLnix.org, we are starting a local resource for NoSQL database knowledge<br />Taking on projects to apply the technology, not just read about it.<br />If you want to work on it, please contact us.<br />Thanks<br />

×