Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Finding the Right Data Solution for your Application in the Data Storage Haystack


Published on

The NoSQL movement has rekindled interest in data storage solutions. A few years ago, within limited scale systems, storage choices for programmers and architects were simple where relational databases were almost always the choice. However, advent of Cloud and ever increasing user bases for applications have given rise to larger scale systems. Relational databases cannot always scale to meet the needs of those systems, and as an alternative, the NoSQL movement has proposed many solutions.

For a programmer who wants to select a data model, they now have to choose from a wide variety of choices like Local memory, Relational databases, Files, Distributed Cache, Column Family Storage, Document Storage, Name value pairs, Graph DBs, Service Registries, Queue, and Tuple Space etc. Furthermore, there are different layers/access choices such as directly accessing data, using object to relation mapping layer like hibernate/JPA, or using data services. Moreover, users also need to worry about how to scale up the storage in multiple dimensions like the number of databases, the number of tables, the amount of data in a table, frequency of requests, types of requests (read/write ratio).

Consequently, choosing the right data model for a given problem is no longer trivial, and such a choice needs a clear understanding of different storage offerings, their similarities, differences, as well as associated tradeoffs. We faced the same problem while designing the data interfaces for Stratos Platform as a Service (SaaS) offering, and in this talk, we would like to share our findings and experiences of that work. We will present a survey of different data models, their differences as well as similarities, tradeoffs, and killer apps for each model. We believe the participants will walk away with a border understanding about data models and guidelines on which model to be used when.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Finding the Right Data Solution for your Application in the Data Storage Haystack

  1. 1. Finding the Right Data Solutionfor Your Application in the Data Storage Haystack Srinath Perera Ph.D. Senior Software Architect, WSO2 Inc. Visiting Faculty, University of Moratuwa Research Scientist, Lanka Software Foundation
  2. 2. Data Models §  There has been many data models proposed (read Stonebraker’s “What Goes Around Comes Around” for more details) o  Hierarchical (IMS): late 1960’s and 1970’s o  Directed graph (CODASYL): 1970’s o  Relational: 1970’s and early 1980’s o  Entity-Relationship: 1970’s o  Extended Relational: 1980’s o  Semantic: late 1970’s and 1980’s §  For last 20-30 years, Relational Database systems (SQL) together with transactions has been the defacto data solution.Copyright Greg Morss and licensed for reuse under CC License ,
  3. 3. For many years, choice of data storage was a easy one (use RDBMS)Copyright by Alan Murray Walsh and licensed for reuse under CC License ,
  4. 4. Scale of Systems §  However, the scale of systems are changing due to o  Increasing user bases of systems. o  Mobile devices, online presence o  Cloud computing and multicore systems §  Scaling up RDBMS o  Put it in a bigger machine o  Replicate (Cluster) the database to 2-3 more nodes. But the approach does not scale up. o  Partition the data across many nodes (distribute, a.k.a. shredding). However, JOIN queries across many nodes are hard, and sometimes too slow. This often needs custom code and configurations. Also transactions do not scale as well.Copyright digitalART2 and licensed for reuse under CC License ,
  5. 5. CAP Theorem, Transactions, and Storage §  RDBMS model provide two things o  Relational model with SQL o  ACID transactions – (Atomic, Isolation, Consistent, Durable) §  It was a classical one size fit all solution, but it worked for a quite a some time. §  However, CAP theorem says that you can not have it all. o  Consistency, Availability and Partition Tolerance, pick two! §  But there are many usecases that do not need all RDBMS features, when those are dropped, systems could scale. (e.g. Google Big Table) §  However, to use them, one has to understand and utilize the application specific behavior.Copyright stephcarter and licensed for reuse under CC License ,
  6. 6. NoSQL and other Storage Systems§  Large internet companies hit the problem first, they build systems that are specific to their problems, and those systems did scale. o  Google Big table o  Amazon Dynamo§  Soon many others followed, and most of them are free and open source.§  Now there are couple of dozen§  Among advantages of NoSQL are o  Scalability o  Flexible schema o  Designed to scale and support fault tolerance out of the BoxCopyright ind{yeah} and licensed for reuse under CC License ,
  7. 7. However, with NoSQL solutions, choosing a data storage is no longer simple.Copyright Philipp Salzgeber on and licensed for reuse under CC License
  8. 8. Selecting the Right Data Solution§  What are the right Questions to ask?§  Categorize Answers for each question§  Take different cases based on different answers and make recommendations! Copyright by Krzysztof Poltorak, and licensed for reuse under CC License.
  9. 9. What are the right Questions? o  Types of data -  Structured, Semi-Structured, Unstructured o  Need for Scalability -  Number of users -  Number of data items -  Size of files -  Read/Write ratio o  Types of Queries -  Retrieve by Key -  WHERE clauses -  JOIN queries -  Offline Queries o  Consistency -  Loose Consistency -  Single Operation Consistency -  Transactions Copyright by romainguy, and licensed for reuse under CC License photos/romainguy/249370084
  10. 10. Unstructured Data §  Data do not have a particular structure, often retrieved through a key (name). o  E.g. File systems. §  Humans are good in processing unstructured data, but computers do not.§  This data are often stored in storage but consumed by humans at the end of the pipeline. (e.g. Document repository)§  One common use case is building structured data from unstructured data§  Often associate Metadata to help searchingCopyright Martyn Gorman and licensed for reuse under CC License,
  11. 11. Structured Data §  Have a structure and often described through a Schema §  Often a table like 2D structure is used, but other structures also possible. §  Main advantage of the structure is search§  Schema can be provided at the deployment time or at the runtime (dynamic schema)§  Schema can be used to o  Validate data o  Support user friendly search o  Optimize storage and queries Copyright Marion Doss by and licensed for reuse under CC License , photos/ooocha/2611398859/
  12. 12. Semi-structured Data §  Structure is not fully defined. But there is some inherent structure. §  For example o  XML documents, data are stored in a tree like structure o  Graph data o  Data structures like lists and arrays §  Support queries based on structure §  But processing data often needs custom code.Copyright Walter Baxter
  13. 13. Search§  Unstructured Data – no structure to support search. o  Search based on an reverse index o  Search through Properties§  Semi-Structured Data o  To search XML, Xpath or XQuery (Any tree like structure). o  Tuple spaces can be queried through tuple space templates o  Data registries can be searched for entries that matches with given Metadata descriptions (search by properties) o  Graph’s can be queried based on connectivity§  Structured Data o  Retrieve by Key o  WHERE clauses o  Queries with JOINs o  Offline QueriesCopyright bydigitalART2 and licensed for reuse under CC License ,
  14. 14. Consistency and Scalability§  Scalability – this is ability to handle more users, data, or larger files by adding more nodes. We will have 3 categories. o  Small systems (can handle with 1-3 nodes) o  Scalable systems (can handle with about 10 nodes) o  Highly scalable systems (anything larger, can be 100s or 1000s of Copyright NNSANews and licensed for reuse under CC nodes) License , 5347287260/ §  Consistency – this is how to keep the replicas of same data in many nodes synced up (e.g. replicas) how they can be updated without data corruptions. We will have 3 categories. o  Transactional – series of operations updated in ACID manner o  Atomic operation – single operation, updated in all replicas o  Eventual consistency - data will be eventually consistent
  15. 15. Data Storage Alternatives
  16. 16. Data Storage Implementations§  Expectations from data storages o  Reliably store the data o  Efficient search and retrieval of data whenever needed o  Data management – delete, update data Copyright John Atherton by and licensed for reuse under CC License ,
  17. 17. Challenges of Data Storage§  Reliability o  Replicating data o  Creating backup or recovering using backups§  Security§  Scaling and Parallel access o  Distribution or replications o  ACID transactions§  Availability o  Data replications§  Vendor lock-in o  Interoperability, standard query languages§  Simple use experience o  Hide the physical location of data, o  Provide simple API and security models o  Expressive query languages.
  18. 18. Data Storage Choices Queries Join Transactio Flexible Storage Type Advantages Disadvantages Key Where s ns Scale schema No unlessLocal memory Very fast Not durable Yes No No STMs No Yes Rigid schema, good for read oriented ModerRelational/ SQL Standardized usecases. Yes Yes Yes Yes ate NoColumn High write Not Yes,families performance, transactional, secondar(NoSQL ) replicated no-online joins Yes y index No No High Yes High write NotDocuments performance, transactional, Yes,DBs replicated no-online joins Yes views No No Yes Yes Easy to integrate withObject Struct programmingDatabases ured languages Yes Yes Yes Yes No No
  19. 19. Queries trans Disadvanta action Flexible Storage Type Advantages ges Key Search s Scale schema No structured Save big files whose search onFiles format not understood content Yes Indexing No Moderate YesDataRegistries/ Metadata search PropertyMetadata Unstru based searchCatalogs ctured Yes (Where) No Moderate Yes Representation of flow of messages overQueues time/ Tasks Yes N/A No Yes Yes Used to inference, veryTriple fast relationship RelationshipStores processing Yes search No No YesXML XPath/database XML native XQueryDistributedCache Fast, replicated No search Yes No No Yes Yes Model is too simple in some High write cases, notKey-value performance, transactionapairs replicated l Yes No No Yes Yes Semi- Very fast joins, natural structur to represent Not veryGraph DBs ed relationships, scalable Yes Graph Search Yes Low N/A
  20. 20. Choosing the Right Data Solution
  21. 21. How do We do this? Copyright 8664 and licensed for reuse under CC License , photos/ 80464769@N00/186 598462/§  Consider structured, semi-structured, and unstructured separately. o  Then drill down based on other 3 properties: scale, consistency, and search.§  Structured case is more complicated, other two are bit simpler.§  Start by giving a defacto for each case
  22. 22. Handling Structured Data §  There are three main considerations: scale, consistency and queries Small (1-3 nodes) Scalable (10 nodes) Highly Scalable (1000s nodes) Loose Operat ACID Loose Operat ACID Loose Operat ACID Consist ion Transa Consi ion Transa Consi ion Transa ency Consi ctions stency Consi ctions stency Consi ctions stency stency stencyPrimary DB/ KV/ DB/ DB KV/CF KV/CF Partitio KV/CF KV/CF No Key CF KV/ CF ned DB? Where DB/ CF/ DB/ DB CF/ CF/ Partitio CF/ CF/ No Doc CF/ Doc(?) Doc (?) ned Doc Doc Doc DB? JOIN DB DB DB ?? ?? ?? No No No Offline DB/CF/ DB/CF/ DB/CF/ CF/ CF/ No CF/ CF/ No Doc Doc Doc Doc Doc Doc Doc*KV: Key-Value Systems, CF: Column Families, Doc: document based Systems
  23. 23. Handling Small Scale Systems (1-3 nodes) Small (1-3 nodes) §  In general using DB here for every case might work. Loose Operati ACID Consi on Transa §  Reason for using options stency Consist ctions other than DB ency o  When there is potential need Primary DB/ DB/ KV/ DB to scale later. Key KV/ CF CF o  High write throughput Where DB/ DB/ DB §  KV is 1-D where as other two CF/ CF/Doc Doc are 2D JOIN DB DB DB Offline DB/ DB/CF/ DB/CF/ CF/ Doc Doc Doc*KV: Key-Value Systems, CF: ColumnFamilies, Doc: document basedSystems
  24. 24. Handling Scalable Systems Scalable (10 nodes) §  KV, CF, and Doc can easily handle this case. Loose Operati ACID §  If DBs used with data shredded Consi on Transa stenc Consist ctions across many nodes y ency o  Transactions might work given thatPrimary KV/CF KV/CF Partition participants on one transaction areKey ed DB? not too many.Where CF/ CF/Doc Partition o  JOINs might need to transfer too Doc ed DB? much data between nodes. o  Also should consider in MemoryJOIN ?? ?? Partition ed DBs like Vault DB. DB?? §  Offline mode will work.Offline CF/ CF/Doc No §  Most systems let users choose Doc consistency, and loose*KV-Key-Value Systems, CF-Column consistency can scale more.Families, Doc- document based Systems (e.g. Cassandra)
  25. 25. Highly Scalable Systems §  Transactions do not work in Highly Scalable (1000s nodes) this scale. (CAP theorem). Loose Operati ACID §  Same for JOINs. The problem Consis on Transac is sometime too much data tency Consist tions ency needs to be transferred Primary KV/CF KV/CF No between nodes to perform the Key JOIN. Where CF/Doc CF/Doc No §  Offline case handled through Map-Reduce. Even JOIN JOIN No No No case is OK since there is time. Offline CF/Doc CF/Doc No*KV: Key-Value Systems, CF: ColumnFamilies, Doc: document basedSystems
  26. 26. Highly Scalable Systems + Primary Key Retrieval Highly Scalable (1000s §  This is (comparatively) the nodes) easy one. Loose Operat ACID §  Can be solved through Consis ion Transa tency Consis ctions DHT (Distributed Hash tency table) based solutions or Primar KV/CF KV/CF No architectures like y Key OceanStore. Where CF/Doc CF/Doc No §  Both Key-Value storage (?) (?) (KV) and Column Families JOIN No No No (CF) can be used. But Key-Value model is Offline CF/Doc CF/Doc No preferred as it is more scalable. *KV-Key-Value Systems, CF-Column Families, Doc- document based Systems
  27. 27. Highly Scalable systems + WHERE Highly Scalable (1000s §  This Generally OK, but tricky. nodes) §  CF work through a Secondary Loose Operat Transa Consis ion ctions index that do Scatter-gather tency Consis (e.g. Cassandra). tency §  Doc work through Map- Primar KV/CF KV/CF No y Key Reduce views (e.g. Where CF/Doc CF/Doc No CouchDB) (?) (?) §  There is Bissa, which build a JOIN No No No index for all possible queries (No range queries) Offline CF/Doc CF/Doc No §  If you are doing this, you should do pilot runs and*KV-Key-Value Systems, CF-Column make sure things work.Families, Doc- document basedSystems
  28. 28. Handling Unstructured Data§  Storage Options o  Distributed File systems - generally scalable (e.g. NSF), but HDFS (Hadoop) and Lustre are highly scalable versions. o  Metadata registries (e.g. Niravana, SDSC Resource Broker)
  29. 29. Handling Semi-Structured Data Small Scale (1-3 Scalable (10 nodes) Highly nodes) Scalable XML (Queried XML DB or convert XML DB or convert to a ?? through XPath) to a structured structured model model Graphs Graph DBs Graph DBs if graph can ?? be partitioned Data Structures Data Structure Servers, Object Databases Queues Distributed Distributed Queues Distributed Queues Queues !§  Storage Options o  Answer depends on the type of structure. If there is a server optimized for a given type, it is often much more efficient than using a DB. (e.g. Graph databases can support fast relationship search)§  Search o  Very much custom. E.g. XML or any tree = Xpath, Graph can support very fast relationship search
  30. 30. Hybrid Approaches§  Some solutions have many types of data and hence need more than one data solution (hybrid architectures).§  For example o  Using DB for transactional data and CF for other data. o  Keeping metadata and actual data separate for large data archives. o  Use GraphDB to store relationship data while other data is in Column Family storage. Copyright Matthew Oliphant by and licensed for§  However, if transactions are reuse under CC License , photos/fajalar/3174131216/ needed, transactions have to be handled outside storage (e.g. using Atomikos Zookeeper ).
  31. 31. Other parameters§  Above list is not exhaustive, and there are other parameters o  Read/ Write ratio – when high it is easy to scale o  High write throughput o  Very large data products – you will need a file system. May be keep metadata in Data registry and store data in a file system. o  Flexible Schema o  Archival usecases o  Analytical usecases o  Others …§  So there is no silver bullet …
  32. 32. Conclusion§  For last 20 years or so, DBMS were the de facto storage solution§  However, DBMS could not scale well, and many NoSQL solutions have been proposed instead§  As a results. it is no longer easy to find the best data solution for your problem.§  We discussed may dimensions (types of data, scalability, queries, and consistency) and provided guidelines on when to use which data solution.§  Your feedback and thoughts are most welcome .. Contact me through