Introduction to NOSQL


Published on

Presentation of NOSQL concepts against traditional approaches and the challenges behind the concept

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Introduction to NOSQL

  1. 1. Introduction to No SQL ESEO, Angers, Mon 14th Jan 2013 Kasso LEGOUDA
  2. 2. Main Focus Profile Summary About the Presenter Kasso is a Senior technology architect with more than 14 years in IT. He has a broad vision of IT and currently covers predominantly within SOA, SAAS, User experience(Portal, web 2.0 and e-commerce) and applications modernization. Kasso spent over 10 years with software vendors such as Oracle, Vignette and Doubletrade . Kasso LEGOUDA Solutions architect INFOSYS Prime Skills: •Several Portals, WCMS and e- commerce systems •SOA methodologist •Several ECM systems •Identity management •Applications modernization •RDBMS •Big data and no sql Industries worked for: Retail Public sector Industry Health Energy Utilities
  3. 3. Agenda • NOSQL Motivations • Traditional solutions approaches • NOSQL Centric solutions • Existing implementations in a Nutshell • Return on experience • Questions/Answers and alternatives
  4. 4. NOSQL motivations What is the problem about?
  5. 5. NO SQL tag cloud
  6. 6. NOSQL motivations Challenging the growth of data Challenging the costs Challenging internet constraints
  7. 7. Challenging the growth of data • The rising of the cloud in multiple domains such as social network (Facebook, Google) , professional network (LinkedIn, viadeo) , eCommerce(amazon), social applications (talk, Gmail etc.), business application impacted the traditional approach in the N-tiers architecture and in particular in the storage space.
  8. 8. Challenging the growth of data • Historically, we started from a reasonable amount of data exchanged on the web and we are facing challenges the needs of having huge storage space for data exchanged . Today is the rise of Big Data • Big Data refers to data sets for which size is beyond the ability of commonly used software tools to capture, manage and process within a tolerable elapsed time (Wikipedia)
  9. 9. Challenging the growth of data • More than 1.8 zettabytes of information will be created and stored in 2011, according to IDC Digital Universe Study sponsored by EMC. That’s a mind-boggling figure, equivalent to 1.8 trillion gigabytes.
  10. 10. Challenging the growth of data: Big Data Unit Symbol Bytes Kilobyte KB 1024 Megabyte MB 1048576 Gigabyte GB 1073741824 Terabyte TB 1099511627776 Petabyte PB 1125899906842624 Exabyte EB 1152921504606846976 Zettabyte ZB 1180591620717411303424 Yottabyte YB 1208925819614629174706176 PAIN-O-Meter
  11. 11. NOSQL motivations Challenging the growth of data Challenging the costs Challenging internet constraints
  12. 12. Challenging the costs • Managing big amount of data is impacting costs • But the problem is not only about subsequent costs to address big data, it is also about availability. • The costs for storage are incredibly high in traditional implementation as soon as the storage increase exponentially • The investment is often made one time and not incrementally • Hence require significant initial budget
  13. 13. Challenging the costs Server Cost PowerEdge T110 II (basic) 8 GB, 3.1 Ghz Quad 4T $1,350 PowerEdge T110 II (basic) 32 GB, 3.4 Ghz Quad 8T $12,103 PowerEdge C2100 192 GB, 2 x 3 Ghz $19,960 IBM System x3850 X5 2048 GB, 8 x 2.4 Ghz $646,605 Blue Gene/P 14 teraflops, 4096 CPUs $1,300,000 K Computer (fastest super computer) 10 petaflops, 705,024 cores, 1,377 TB $10,000,000 annual operating cost
  14. 14. Challenging the costs • The challenge is to minimize the costs per Gb • Most of the Internet companies business models are based on the traffic, the storage is often a facility to capture audience(Google, Facebook etc.). In some other cases, the huge amount of storage is a consequence of the business ( e.g.. Amazon offer million of references across the word, collecting products ratings and reviews by million of clients, but the business is to capture online orders) – The less the storage costs are , the more profitability it procures .
  15. 15. NOSQL motivations Challenging the growth of data Challenging the costs Challenging internet constraints
  16. 16. Challenging Internet constraints • Availability • Scalability • Elasticity • Latency (data and network) • These constraints are due to the nature of internet: distributed. Cloud architecture is intend to address those constraints
  17. 17. Traditional solution approaches How does the software industry (hardware and software) currently addresses the challenges ?
  18. 18. Traditional solution approaches Solution approach from traditional industry in a nutshell Solution approach from traditional industry Examples
  19. 19. Solution approach from traditional industry in a nutshell Pros • Big data and data growth – Industry propose packaged Solutions for general purpose. Big data is addressed by sharding mechanisms • Internet constraints – Scaling: vendors provide vertical scaling (monolithic based: simple to set, familiar to developers but tremendous costly! – Low latency : to minimize network latency and maintain ACIDity, one solution: cache, cache and cache Cons • Costs: – This is why all is about. • Elasticity: – The solution approach require significant initial investment even for an incremental growth. However, hardware solutions exists ( soft virtualization, partitioning etc.)
  20. 20. Traditional solution approaches Solution approach from traditional industry in a nutshell Solution approach from traditional industry Examples
  21. 21. Solution approach from traditional industry • Big Data: Oracle (Exadata),IBM(Netezza). Data are still structured and stored in RDBMS systems , transactions are ACID • Scalability: vertical scalability solutions
  22. 22. Solution approach from traditional industry • Elasticity : pseudo elasticity (hardware or software solutions) • Latency techniques: cache systems , data replication (latency in cloud implies distribution)
  23. 23. Example of one solution: Exadata • Oracle Exadata example: a solution in the storage; appliance (software and hardware)
  24. 24. NO SQL The rising of a new approach?
  25. 25. NOSQL What is NOSQL about Ingredients of NoSql systems Type of NOSQL databases Solutionize the internet constraints with NOSQL Summary
  26. 26. What is NOSQL about • NO SQL is the common to define non-conventional cloud architecture solutions patterns addressing – Big data: in NoSql solutions, data are not or semi- structured (loss of FK notion) and transactions are not ACID. – Horizontal scaling and elasticity: the systems comprise of NoSql are distributed in nature and data are spread across systems. – Costs: Cheaper in terms of Gb per transactions (beware of ROI) because based on generic cheapest – Low latency • NO SQL implementations intend to fit with the CAP Theorem adressing network concerns
  27. 27. NOSQL What is NOSQL about Ingredients of NoSql systems Type of NOSQL databases Solutionize the internet constraints with NOSQL Summary
  28. 28. Distributed file system • File system: NoSql is distributed by nature. Hence, the system need a distributed file system. – The most popular existing solution is Hadoop HDFS and it derives from Google research – HDFS works as a cluster shards and synchronize data
  29. 29. Map Reduce • Map Reduce(populated by Google): the file system couldn’t work if there is no way to lookup data from the file system. The pattern used on the top of HDFS is Map Reduce. – A Map-Reduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file- system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. – Example:
  30. 30. Map Reduce Different types of Map Reduce Numerical summarizations • This pattern is like group by • Known uses – Word count – Read count – Min/Max/Count – Average/Median/Standard deviation • Example in Pig framework: – b = GROUP a BY groupcol2; – c = FOREACH b GENERATE group, MIN(a.numericalcol1), MAX(a.numericalcol1), COUNT_STAR(a);
  31. 31. Map Reduce Different types of Map Reduce Inverted index summarizations • Application: catalogue faceted search
  32. 32. Map Reduce Different types of Map Reduce Counting with counters • Known uses – Count number of records – Count a small number of unique instances
  33. 33. Horizontal sharding • Horizontal Sharding: This refers to how to spread data across the different nodes comprising of the cluster. The sharding requires a database. One of the most popular is Hbase derived from Google’s Big Table database
  34. 34. Systems and services Synchronization • Synchronizing configuration: This refers to the following objective; how to deploy once and run everywhere? In general, NoSql cluster are bundle with synchronization agents. One of the most popular is Zookeeper and come along with HBASE
  35. 35. NOSQL What is NOSQL about Ingredients of NoSql systems Type of NOSQL databases Solutionize the internet constraints with NOSQL Summary
  36. 36. Type of NoSql databases • Key-Value Store • Document Store • Column Database • Graph Database
  37. 37. Key Value store DB Key value store DB • Based on Hashing • Basic get/put/delete ops • Very Fast • Easy to scale horizontally • Know implementations – Membase – Redis
  38. 38. Document DB Document store DB • Document = self-contained piece of data • Semi-structured data • Querying • Know implementations – MongoDB – RavenDB – Lily (basis for one of the most significant French experimentation)
  39. 39. Column DB Column DB • Data stored by column • Semi-structured data • Know implementations – HBASE – Cassandra – Big Table
  40. 40. Graph DB Graph DB • Nodes, properties, edges • Based on graph theory • Node adjacency instead of indices • Know implementations – Neo4J – VertexDB
  41. 41. NOSQL What is NOSQL about Ingredients of NoSql systems Type of NOSQL databases Solutionize the internet constraints with NOSQL Summary
  42. 42. The CAP Theorem The CAP theorems • The CAP Theorem is addressing network concerns • The CAP comprises of 3 dimensions – Availability : Each client can always read and write date – Consistency: All client have the same view of data – Partition Tolerant : system works despite network partitions • The CAP theorem demonstrates that NOSQL DB cannot cover all the dimensions. Hence the next picture show a NoSql categorization according to the usage
  43. 43. Visual guide to NoSql systems
  44. 44. The eventual consistency The Eventual consistency • Eventual consistency is one of the consistency models used in the domain of parallel programming, for example in distributed shared memory, distributed transactions, and optimistic replication Ingredients • Conflict resolution ( as eventual consistency) – Read repair – Write repair – Asynchronous repair • Levels of consistency – ONE: Data is written after at least one node's commit table and memory table has been modified with the new data, and the node response has reached the client. – QUORUM: Data has to be written to replication factor/2+ 1 nodes before responding to the client. – ALL: All nodes have to read (write) the data
  45. 45. NOSQL What is NOSQL about Ingredients of NoSql systems Type of NOSQL databases Solutionize the internet constraints with NOSQL Summary
  46. 46. Summary • NOSQL meets new Internet requirements • NOSQL is not the only one solution to address Big Data (there are pure old fashioned models as well as hybris models • There are lot of implementations of NOSQL and each correspond to a specific usage • NOSQL requires more development effort than classical Model • NOSQL is not adapted while having FK • Software vendors are proposing composite models NOSQL/SQL
  47. 47. Questions?
  48. 48. Existing implementations in a nutshell How did the main internet actors addressed the big data challenge?
  49. 49. How did Google implemented Big Data Model • Storage is based on file system (GFS) • A Master server stores meta data and allows access to data
  50. 50. How did Facebook implemented Big Data Model • Database sharding + cache • Key value model • MySQL is customized and extended • No relations stored in the DB • The model is composite
  51. 51. How did Wikipedia implemented Big Data Model • Database sharding + cache • Mysql based model • Segregation between Read and writes
  52. 52. How did Amazon implemented Big Data Model • Pure NoSql • Based on Eventually Consistency
  53. 53. Return of Experience Implement Big data model in the Ecommerce?
  54. 54. Challenges • Everything should be a content: – Don’ t distinguish Product content management (Product catalogue) and Content management (Articles, Reviews etc.). • Guided navigation to be smooth and back and forth • The target is to have a very flexible CMS, scalable and cheap
  55. 55. Solution approach Based on Lily • Lily is • A document DB • Implementing availability and partition tolerant • Lily system is based on HBASE/HDFS, Solr, Zookeeper Lily Core Concepts • Storage – Hbase – repository model – versioning, varianting, mixins • Indexing – Mapping • Search – SOLR
  56. 56. Solution approach Lily and Hadoop • Final store for HBASE files • BlobStore for large items • Map-Reduce for distributing batched indexing • jobs Lily and HBASE • adds high-level content model • data types • versioning • blob storage on HDFS • focus on sparse (efficient) storage • RowLog for synchronous cross-table updates • async message queues • HBASE is not SQL: No query model, no transactions – Row operations are atomic
  57. 57. Solution architecture Lily and Zookeeper • Server discovery • Master Election for centralized jobs (indexer, rowlog-shard) • Locks • Runtime configuration needs – e.g. indexer configs • Be based on Zookeeper leader election mechanism Lily and SOLR • provides flexible mapping between HBase • content model and SOLR index fields • interactive and batch (M/R) index maintenance • sharding • use(s) SOLR as-is: loose, flexible, extensible • coupling • search access via SOLR (HTTP) API
  58. 58. Solution architecture: Lily deployment architecture
  59. 59. Solution architecture: Lily repository model
  60. 60. Solution architecture : Lily mixins
  61. 61. Solution architecture: Lily versionning
  62. 62. Application: content publishing
  63. 63. Application: content publishing
  64. 64. Lessons learned Pros • Leverage with NOSQL • The solution is now implemented as product • The MapReduce was helpful in solving navigation by reverse indexes: Cons • The project could have been implemented by traditional solution because • The complexity of NOSQL impacted ROI
  65. 65. Conclusion: Sql or NoSql? • Depends on projects • 3 approaches for cloud databases – Traditional RDBMS – Pure NOSQL – Hybrid approach(mix rdbms and NoSql) (to be explode before switching to pure NOSQL)
  66. 66. References • Books – Cloud architecture patterns, Bill Wilder ,Oreilly – Map Reduce Design Patterns Donald Miller & Adam Shook, Oreilly – Map Reduce: Simplified Data Processing on Large Clusters, Jeffrey Dean, Google Inc. – Mongo DB in action , Kyle Banker, Manning • Internet references – Pure NoSql • • • • – Hybrid approach • • 194542.pdf