NoSQL and MapReduce


Published on

My presentation for

Chelmsford, MA

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

NoSQL and MapReduce

  1. 1. NoSQL databases and MapReduce<br />J Singh<br />Early Stage IT<br />
  2. 2. What’s so fun about databases?<br />Traditional database discussions talked about<br />Employee records<br />Bank records<br />Now we talk about<br />Web search<br />Data mining<br />The collective intelligence of tweets<br />Scientific and medical databases<br />
  3. 3. How much data can a database hold?<br />The biggest OLTP databases<br />2001: 1.1 – 10.3 TB.<br />2003: 9.1 – 29.2 TB.<br />2005: 17.7 – 100.4 TB.<br />2010: ~2.5 PB.<br />The trend will continue<br />Very large databases bring new unique challenges<br />
  4. 4. Historical Context<br />Late 1990’s.<br />The web scales out.<br />Suddenly, databases not adequate for holding the data being accumulated<br />Scale out vs. Scale up<br />
  5. 5. Brewer’s Conjecture (p1)<br />Source: Eric Brewer’s July 2000 PODC Keynote<br />Main points:<br />Classic “Distributed Systems” don’t work<br />They focus on computation, not data<br />Distributing computation is easy, distributing data is hard<br />DBMS research is about ACID (mostly)<br />Atomicity, Consistency, Isolation and Durability<br />But we forfeit “C” and “I” for availability, graceful degradation and performance – this tradeoff is fundamental<br />BASE<br />Basically Available<br />Soft-state<br />Eventual Consistency<br />
  6. 6. Brewer’s Conjecture (p2)<br />BASE<br />Weak consistency<br />stale data OK<br />Availability first<br />Best effort<br />Approximate answers OK<br />Aggressive (optimistic)<br />Simpler!<br />Faster<br />Easier evolution<br />ACID<br />Strong consistency<br />Isolation<br />Focus on “commit”<br />Nested transactions<br />Availability?<br />Conservative (pessimistic)<br />Difficult evolution (e.g. schema)<br />But I think it’s a spectrum<br />Eric Brewer<br />
  7. 7. CAP Theorem<br />Since then,<br />Brewer’s conjecture formally proved: Gilbert & Lynch, 2002<br />Thus Brewer’s conjecture became the CAP theorem…<br />…and contributed to the birth of the NoSQL movement<br />But the theory is not settled<br />While lists 122 NoSQL databases<br />
  8. 8. What is NoSQL?<br />Stands for Not Only SQL<br />Class of non-relational data storage systems<br />Usually do not require a fixed table schema nor do they use the concept of joins<br />All NoSQL offerings relax one or more of the ACID properties<br />
  9. 9. Forces at Work<br />Three major papers were the seeds of the NoSQL movement<br />CAP Theorem (discussed above)<br />BigTable(Google)<br />Dynamo (Amazon)<br />Some types of data could not be modeled well in RDBMS<br />Document Storage and Indexing<br />Recursive Data and Graphs<br />Time Series Data<br />Genomics Data<br />
  10. 10. NoSQL Databases<br />Key-Value Stores<br />A storage system that stores values, indexed by a key.<br />Example: Voldemort, Dynomite, Tokyo Cabinet<br />BigTable Clones (aka "ColumnFamily")<br />A tabular model where each row (at least in theory) can have an individual configuration of columns.<br />Example: HBase, Hypertable, Cassandra, Amazon SimpleDB<br />
  11. 11. NoSQL Databases<br />Document Databases<br />Collections of documents, which contain key-value collections (called "documents")<br />Example: CouchDB, MongoDB, Riak<br />Graph Databases<br />Nodes & relationships, both of which can hold key-value pairs<br />Example: AllegroGraph, InfoGrid, Neo4j<br />
  12. 12. Amazon SimpleDB<br />Key-value store<br />Written in Erlang, (as is CouchDB)<br />Data is modeled in terms of<br />Domain, a container of entities,<br />Item, an entity and <br />Attribute and Value, a property of an Item<br />Eventually Consistent, except when ReadConsistent flag specified<br />Impressive performance numbers, <br />e.g., .7 sec to store 1 million records<br />SQL-like SELECT<br />select output_list<br />from domain_name<br />[where expression] [sort_instructions] [limit limit] <br />
  13. 13. Google Datastore<br />Part of App Engine; also used for internal applications<br />Used for all storage<br />Incorporates a transaction model to ensure high consistency<br />Optimistic locking<br />Transactions can fail<br />CAP implications<br />Datastore isn’t just “eventually consistent”<br />They offer two commercial options (with different prices)<br />Master/Slave <br />Low latency but also lower availability<br />Asynchronous replication<br />High Replication<br />Strong availability at the cost of higher latency<br />
  14. 14. <ul><li>Some production data, circa 2008.
  15. 15. For more info, see video of Ryan Barrett’s talk at Google I/O</li></ul>Datastore Application at Google<br />
  16. 16. Databases and Key-Value Stores<br /><br />
  17. 17. MapReduce Conceptual Underpinnings<br />Programming model from Lisp and other functional languages<br />(map square '(1 2 3 4))  (1 4 9 16)<br />(reduce + '(1 4 9 16)) 30 <br />Easy to distribute<br />Nice failure/retry semantics<br />
  18. 18. MapReduce Flow<br />
  19. 19. HadoopMapReduce<br />An Open Source project of the Apache Foundation<br />Other Hadoop-related projects at Apache include:<br />Cassandra™: A scalable multi-master database with no single points of failure.<br />HBase™: A scalable, distributed database that supports structured data storage for large tables.<br />Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.<br />Pig™: A high-level data-flow language and execution framework for parallel computation.<br />See the Apache Hadoop website for more.<br />
  20. 20. Hadoop Availability<br />Run on your laptop<br />Run on your server<br />Run on Amazon Cloud<br />Introduction at IBM DeveloperWorks<br />Run on Google App Engine<br />It’s not Hadoop, it’s Google’s implementation of MapReduce<br />
  21. 21. MapReduce Statistics @ GOOG<br />Take-away message:<br />MapReduce is not a “new-fangled technology of the future”<br />It is here, it is proven, use it!<br />
  22. 22. End of an Era?<br />The Relational Model is not necessarily the answer<br />It was excellent for data processing<br />Not a natural fit for<br />Data Warehouses<br />Web-oriented search<br />Real-time analytics, and<br />Semi-structured data<br />i.e., Semantic Web<br />SQL is not the answer<br />Coupling between modern programming languages and SQL are “ugly beyond belief”<br />Programming languages have evolved while SQL has remained static<br />Pascal<br />C/C++<br />Java<br />The little languages: Python, Perl, PHP, Ruby<br /><ul><li>The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007</li></ul>A critique of the “one size fits all” assumption in DBMS<br />
  23. 23. Take Aways<br />NoSQL databases are a solution to web-scale problems<br />A lot of data lives outside relational databases<br />With, we are starting a local resource for NoSQL database knowledge<br />Taking on projects to apply the technology, not just read about it.<br />If you want to work on it, please contact us.<br />Thanks<br />