SlideShare a Scribd company logo
Finding the Right Data Solution
for Your Application in the Data
       Storage Haystack
               Srinath Perera Ph.D.
      Senior Software Architect, WSO2 Inc.
     Visiting Faculty, University of Moratuwa
  Research Scientist, Lanka Software Foundation
In Search for right Data Models
 §  There has been many data
     models proposed (read
     Stonebraker’s “What Goes
     Around Comes Around” for
     more details)
      o  Hierarchical (IMS): late 1960’s
         and 1970’s
      o  Directed graph (CODASYL):
         1970’s
      o  Relational: 1970’s and early
         1980’s
      o  Entity-Relationship: 1970’s
      o  Extended Relational: 1980’s
      o  Semantic: late 1970’s and 1980’s
   §  Database systems (SQL) together with transactions has been
       the defacto data solution.
Copyright Greg Morss and licensed for reuse under CC License , http://www.geograph.org.uk/photo/990700
For many years, choice of data storage was
            a easy one (use RDBMS)
Copyright by Alan Murray Walsh and licensed for reuse under CC License , http://www.geograph.org.uk/photo/1652880
Increasing Scale of Systems
  §  However, the scale of systems
      are changing due to
         o  Increasing user bases of
            systems.
         o  Mobile devices, online presence
         o  Cloud computing and multicore
            systems
   §  Scaling up RDBMS
          o  Put it in a bigger machine
          o  Replicate (Cluster) the database to 2-3 more nodes. But the
             approach does not scale up.
          o  Partition the data across many nodes (distribute, a.k.a. sharding).
             However, JOIN queries across many nodes are hard, and
             sometimes too slow. This often needs custom code and
             configurations. Also transactions do not scale as well.

Copyright digitalART2 and licensed for reuse under CC License , http://www.flickr.com/photos/digitalart/2101765353/
CAP Theorem, Transactions, and Storage
  §  RDBMS model provide two things
         o  Relational model with SQL
         o  ACID transactions – (Atomic,
            Isolation, Consistent, Durable)
  §  It was a classical one size fit all
      solution, but it worked for a quite a
      some time.
  §  However, CAP theorem says that
      you can not have it all.
         o  Consistency, Availability and Partition
            Tolerance, pick two!

 §  But there are many usecases that do not need all RDBMS
     features, when those are dropped, systems could scale. (e.g.
     Google Big Table)
 §  However, to use them, one has to understand and utilize the
     application specific behavior.
Copyright stephcarter and licensed for reuse under CC License , http://www.flickr.com/photos/stephcarter/541464462
NoSQL and other Storage Systems
                                                                     §  Large internet
                                                                         companies hit the
                                                                         problem first, they
                                                                         build systems that are
                                                                         specific to their
                                                                         problems, and those
                                                                         systems did scale.
                                                                           o  Google Big table
                                                                           o  Amazon Dynamo
§  Soon many others followed, and most of them are free and
    open source. Now there are couple of dozen
§  Among advantages of NoSQL are
    o  Scalability
    o  Flexible schema
    o  Designed to scale and support fault tolerance out of the Box
Copyright O hai :3 and licensed for reuse under CC License , http://www.flickr.com/photos/christigain/
                                            5636887941
However, with NoSQL solutions, choosing a
       data storage is no longer simple.
Copyright Philipp Salzgeber on and licensed for reuse under CC License http://
www.salzgeber.at/astro/pics/20081126_heart/index.html
Selecting the Right Data Solution




§  What are the right Questions to ask?
§  Categorize Answers for each question
§  Take different cases based on different answers and make
    recommendations!

 Copyright by Krzysztof Poltorak, and licensed for reuse under CC License.
          http://www.fotocommunity.com/pc/pc/display/22077920
What are the right Questions?
                                                                      o  Types of data
                                                                             -  Structured, Semi-Structured,
                                                                                Unstructured
                                                                      o  Need for Scalability
                                                                             -    Number of users
                                                                             -    Number of data items
                                                                             -    Size of files
                                                                             -    Read/Write ratio
                                                                      o  Types of Queries
                                                                             -    Retrieve by Key
                                                                             -    WHERE clauses
                                                                             -    JOIN queries
                                                                             -    Offline Queries
                                                                      o  Consistency
                                                                             -  Loose Consistency
                                                                             -  Single Operation Consistency
                                                                             -  Transactions



  Copyright by romainguy, and licensed for reuse under CC License http://www.flickr.com/
                             photos/romainguy/249370084
4Q > Types of Data > Unstructured Data
                                                                 §  Data do not have a particular
                                                                     structure, often retrieved
                                                                     through a key (name).
                                                                        o  E.g. File systems.
                                                                 §  Humans are good in processing
                                                                     unstructured data, but
                                                                     computers do not.



§  This data are often stored in storage but consumed by humans
    at the end of the pipeline. (e.g. Document repository)
§  One common use case is building structured data from
    unstructured data
§  Often associate Metadata to help searching

Copyright Martyn Gorman and licensed for reuse under CC License, http://www.geograph.org.uk/photo/294134
4Q > Types of Data > Structured Data
 §  Have a structure and often described through a Schema
 §  Often a table like 2D structure is used, but other structures
     also possible.
 §  Main advantage of the structure is search

§  Schema can be provided at
    the deployment time or at the
    runtime (dynamic schema)
§  Schema can be used to
    o  Validate data
    o  Support user friendly search
    o  Optimize storage and queries




 Copyright Marion Doss by and licensed for reuse under CC License , http://www.flickr.com/
                              photos/ooocha/2611398859/
4Q > Types of Data > Semi-structured Data
  §  Structure is not fully defined.
      But there is some inherent
      structure.
  §  For example
       o  XML documents, data are
          stored in a tree like structure
       o  Graph data
       o  Data structures like lists and
          arrays
  §  Support queries based on
      structure
  §  But processing data often
      needs custom code.


Copyright Walter Baxter http://www.geograph.org.uk/photo/1069339
4Q > Search
§  Unstructured Data – no structure to support search.
     o  Search based on an reverse index
     o  Search through Properties
§  Semi-Structured Data
     o  To search XML, Xpath or XQuery (Any tree like structure).
     o  Tuple spaces can be queried through tuple space templates
     o  Data registries can be searched for entries that matches with given
        Metadata descriptions (search by properties)
     o  Graph’s can be queried based on connectivity
§  Structured Data
     o    Retrieve by Key
     o    WHERE clauses
     o    Queries with JOINs
     o    Offline Queries



Copyright bydigitalART2 and licensed for reuse under CC License ,
        http://www.flickr.com/photos/digitalart/2101765353/
4Q > Consistency and Scalability
§  Scalability – this is ability to
    handle more users, data, or
    larger files by adding more
    nodes. We will have 3 categories.
   o  Small systems (can handle with 1-3
      nodes)
   o  Scalable systems (can handle with
      about 10 nodes)
   o  Highly scalable systems (anything
      larger, can be 100s or 1000s of      Copyright NNSANews and licensed for reuse under CC
      nodes)                                 License , http://www.flickr.com/photos/nnsanews/
                                                                5347287260/
 §  Consistency – this is how to keep the replicas of same data
     in many nodes synced up (e.g. replicas) how they can be
     updated without data corruptions. We will have 3 categories.
    o  Transactional – series of operations updated in ACID manner
    o  Atomic operation – single operation, updated in all replicas
    o  Eventual consistency - data will be eventually consistent
Data Storage
 Alternatives
Data Storage Implementations




§  Expectations from data storages
   o  Reliably store the data
   o  Efficient search and retrieval of data whenever needed
   o  Data management – delete, update data
          Copyright Stephen Eckert and licensed for reuse under CC License , http://www.flickr.com/
                                       photos/s_eckert/5378588233
Challenges of Data Storage
§  Reliability
   o  Replicating data
   o  Creating backup or recovering using backups
§  Security
§  Scaling and Parallel access
   o  Distribution or replications
   o  ACID transactions
§  Availability
   o  Data replications
§  Vendor lock-in
   o  Interoperability, standard query languages
§  Simple use experience
   o  Hide the physical location of data,
   o  Provide simple API and security models
   o  Expressive query languages.
Data Storage Choices
                                                                    Queries
                                                                              Join Transactio       Flexible
    Storage       Type    Advantages        Disadvantages     Key     Where    s       ns     Scale schema

                                                                                 No unless
Local memory                Very fast        Not durable      Yes      No     No  STMs         No     Yes
                                            Rigid schema,
                                            good for read
                                               oriented                                      Moder
Relational/ SQL           Standardized        usecases.       Yes      Yes    Yes     Yes     ate     No
Column                     High write            Not                   Yes,
families                 performance,       transactional,           secondar
(NoSQL )                   replicated       no-online joins   Yes     y index No      No      High    Yes
                           High write            Not
Documents                performance,       transactional,             Yes,
DBs                        replicated       no-online joins   Yes     views   No      No      Yes     Yes
                        Easy to integrate
                              with
Object            Struct programming
Databases         ured    languages                           Yes      Yes    Yes     Yes      No     No
Queries         trans
                                         Disadvanta                                 action              Flexible
  Storage     Type     Advantages            ges              Key      Search         s       Scale     schema
                                              No
                                          structured
                    Save big files whose search on
Files              format not understood content              Yes      Indexing      No      Moderate     Yes
Data
Registries/             Metadata search                                Property
Metadata    Unstru                                                   based search
Catalogs    ctured                                            Yes      (Where)       No      Moderate     Yes
                     Representation of flow
                       of messages over
Queues                    time/ Tasks                         Yes        N/A         No        Yes        Yes
                     Used to inference, very
Triple                  fast relationship                            Relationship
Stores                     processing                         Yes      search        No        No         Yes
XML                                                                    XPath/
database                  XML native                                   XQuery
Distributed
Cache                   Fast, replicated         No search    Yes        No          No        Yes        Yes
                                               Model is too
                                                  simple in
                                                    some
                      High write                 cases, not
Key-value           performance,               transactiona
pairs                 replicated                      l       Yes        No          No        Yes        Yes
          Semi- Very fast joins, natural
         structur    to represent               Not very
Graph DBs ed        relationships,              scalable      Yes    Graph Search Yes          Low        N/A
Choosing the Right
  Data Solution
How should We do this?


                                                               Copyright Brian
                                                           Robert Marshall and
                                                             licensed for reuse
                                                            under CC License ,
                                                                    http://
                                                           www.geograph.org.uk
                                                                /photo/938546




§  Consider structured, semi-structured, and unstructured
    separately.
   o  Then drill down based on other 3 properties: scale, consistency,
      and search.
§  Structured case is more complicated, other two are bit
    simpler.
§  Start by giving a defacto for each case
Handling Structured Data
  §  There are three main considerations: scale, consistency
      and queries
                Small (1-3 nodes)           Scalable (10 nodes)         Highly Scalable (1000s
                                                                                nodes)
            Loose Operat ACID Loose Operat ACID Loose Operat ACID
            Consist   ion  Transa Consi   ion  Transa Consi   ion  Transa
             ency   Consi ctions stency Consi ctions stency Consi ctions
                    stency              stency              stency
Primary     DB/ KV/     DB/       DB      KV/CF    KV/CF     Partitio   KV/CF    KV/CF    No
  Key         CF       KV/ CF                                 ned
                                                              DB?
 Where      DB/ CF/      DB/      DB       CF/    CF/    Partitio          CF/   CF/      No
             Doc         CF/              Doc(?) Doc (?)  ned              Doc   Doc
                         Doc                              DB?
  JOIN         DB        DB       DB        ??        ??       ??          No     No      No


 Offline     DB/CF/    DB/CF/ DB/CF/       CF/       CF/       No          CF/   CF/      No
              Doc       Doc    Doc         Doc       Doc                   Doc   Doc

*KV: Key-Value Systems, CF: Column Families, Doc: document based Systems
Handling Small Scale Systems (1-3 nodes)
             Small (1-3 nodes)            §  In general using DB here for
                                              every case might work.
             Loose Operati ACID
             Consi on       Transa        §  Reason for using options
             stency Consist ctions            other than DB
                    ency                     o  When there is potential need
  Primary    DB/    DB/ KV/ DB                  to scale later.
  Key        KV/ CF CF                       o  High write throughput
  Where      DB/      DB/        DB       §  KV is 1-D where as other two
             CF/      CF/Doc
             Doc
                                              are 2D
  JOIN       DB       DB         DB


  Offline    DB/      DB/CF/     DB/CF/
             CF/      Doc        Doc
             Doc

*KV: Key-Value Systems, CF: Column
Families, Doc: document based
Systems
Handling Scalable Systems
             Scalable (10 nodes)               §  KV, CF, and Doc can easily
                                                   handle this case.
             Loose     Operati ACID            §  If DBs used with data shredded
             Consi     on      Transa
             stenc     Consist ctions              across many nodes
             y         ency                       o  Transactions might work given that
Primary      KV/CF KV/CF           Partition         participants on one transaction are
Key                                ed DB?            not too many.
Where        CF/       CF/Doc      Partition
                                                  o  JOINs might need to transfer too
             Doc                   ed DB?            much data between nodes.
                                                  o  Also should consider in Memory
JOIN         ??        ??          Partition
                                   ed                DBs like Vault DB.
                                   DB??        §  Offline mode will work.
Offline      CF/       CF/Doc      No          §  Most systems let users choose
             Doc
                                                   consistency, and loose
*KV-Key-Value Systems, CF-Column
                                                   consistency can scale more.
Families, Doc- document based Systems              (e.g. Cassandra)
Highly Scalable Systems

                                          §  Transactions do not work in
               Highly Scalable (1000s
                       nodes)                 this scale. (CAP theorem).
             Loose     Operati ACID       §  Same for JOINs. The problem
             Consis    on      Transac        is sometime too much data
             tency     Consist tions
                       ency                   needs to be transferred
  Primary KV/CF        KV/CF         No
                                              between nodes to perform the
  Key                                         JOIN.
  Where      CF/Doc    CF/Doc        No   §  Offline case handled through
                                              Map-Reduce. Even JOIN
  JOIN       No        No            No       case is OK since there is
                                              time.
  Offline    CF/Doc    CF/Doc        No


*KV: Key-Value Systems, CF: Column
Families, Doc: document based
Systems
Highly Scalable Systems + Primary Key Retrieval

            Highly Scalable (1000s    §  This is (comparatively) the
                    nodes)                easy one.
           Loose Operat ACID          §  Can be solved through
           Consis  ion  Transa
           tency Consis ctions
                                          DHT (Distributed Hash
                  tency                   table) based solutions or
  Primar   KV/CF    KV/CF       No        architectures like
  y Key                                   OceanStore.
  Where CF/Doc CF/Doc           No    §  Both Key-Value storage
          (?)    (?)
                                          (KV) and Column Families
   JOIN      No       No        No
                                          (CF) can be used. But
                                          Key-Value model is
  Offline CF/Doc CF/Doc         No
                                          preferred as it is more
                                          scalable.
   *KV-Key-Value Systems, CF-Column
     Families, Doc- document based
                Systems
Highly Scalable systems + WHERE

             Highly Scalable (1000s     §  This Generally OK, but tricky.
                     nodes)
                                        §  CF work through a Secondary
            Loose Operat Transa
            Consis  ion  ctions             index that do Scatter-gather
            tency Consis                    (e.g. Cassandra).
                   tency
                                        §  Doc work through Map-
   Primar   KV/CF    KV/CF         No
   y Key                                    Reduce views (e.g.
   Where CF/Doc CF/Doc             No
                                            CouchDB)
           (?)    (?)                   §  There is Bissa, which build a
    JOIN      No       No          No       index for all possible queries
                                            (No range queries)
   Offline CF/Doc CF/Doc           No   §  If you are doing this, you
                                            should do pilot runs and
*KV-Key-Value Systems, CF-Column            make sure things work.
Families, Doc- document based
Systems
Handling Unstructured Data




§  Storage Options
   o  Distributed File systems - generally scalable (e.g. NSF), but HDFS
      (Hadoop) and Lustre are highly scalable versions.
   o  Metadata registries (e.g. Niravana, SDSC Resource Broker)
Handling Semi-Structured Data
                           Small Scale (1-3     Scalable (10 nodes)      Highly
                              nodes)                                    Scalable
           XML (Queried    XML DB or convert   XML DB or convert to a       ??
          through XPath)    to a structured      structured model
                                 model
             Graphs           Graph DBs        Graph DBs if graph can       ??
                                                   be partitioned
      Data Structures       Data Structure
                            Servers, Object
                              Databases
      Queues                  Distributed       Distributed Queues      Distributed
                               Queues                                    Queues
      !
§  Storage Options
   o  Answer depends on the type of structure. If there is a server
      optimized for a given type, it is often much more efficient than
      using a DB. (e.g. Graph databases can support fast relationship
      search)
§  Search
   o  Very much custom. E.g. XML or any tree = Xpath, Graph can
      support very fast relationship search
Hybrid Approaches
§  Some solutions have many types
    of data and hence need more than
    one data solution (hybrid
    architectures).
§  For example
   o  Using DB for transactional data and
      CF for other data.
   o  Keeping metadata and actual data
      separate for large data archives.
   o  Use GraphDB to store relationship
      data while other data is in Column
      Family storage.                       Copyright Matthew Oliphant by and licensed for

§  However, if transactions are            reuse under CC License , http://www.flickr.com/
                                                      photos/fajalar/3174131216/

    needed, transactions have to be
    handled outside storage (e.g.
    using Atomikos Zookeeper ).
Other parameters
§  Above list is not exhaustive, and there are other
    parameters
   o  Read/ Write ratio – when high it is easy to scale
   o  High write throughput
   o  Very large data products – you will need a file system. May be
      keep metadata in Data registry and store data in a file system.
   o  Flexible Schema
   o  Archival usecases
   o  Analytical usecases
   o  Others …
Take Home Message is ..


                                                  There is no silver
                                                       bullet



   You have to use
   right too for the
           job
Copyright eschipul, Siomuzzz and licensed for
reuse under CC License , http://www.flickr.com/
    photos/eschipul/4160817135 and http://
 www.flickr.com/photos/siomuzzz/2577041081
Sample Polyglot Architectures
       PaaS        Structured     Structured     Unstructured
                   (Relational)   (NOSQL)

       WSO2 Stratos MySQL based   Cassandra as   HDFS as a
                    RDB as a      a Service      Service
                    Service
       Azure       MSSQL as a     MS NoSQL
                   Service        storage
       AppEngine   Hosted         BigTable
                   MySQL


     Our work on Data Solutions for WSO2 Stratos
                   motivated this work.
   You can try out WSO2 Stratos Data offerings from
   https://data.stratoslive.wso2.com/home/index.html
Conclusion
§  For last 20 years or so, DBMS were the de facto storage
    solution
§  However, DBMS could not scale well, and many NoSQL
    solutions have been proposed instead
§  As a results. it is no longer easy to find the best data
    solution for your problem.
§  We discussed may dimensions (types of data, scalability,
    queries, and consistency) and provided guidelines on when
    to use which data solution.
§  Your feedback and thoughts are most welcome .. Contact
    me through srinath@wso2.com

More Related Content

Similar to Finding the Right Data Solution for Your Application in the Data Storage Haystack

Nosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptxNosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptx
Radhika R
 
History and Introduction to NoSQL over Traditional Rdbms
History and Introduction to NoSQL over Traditional RdbmsHistory and Introduction to NoSQL over Traditional Rdbms
History and Introduction to NoSQL over Traditional Rdbms
vinayh902
 
The google file system
The google file systemThe google file system
The google file system
Daniel Checchia
 
Storage Systems for High Scalable Systems Presentation
Storage Systems for High Scalable Systems PresentationStorage Systems for High Scalable Systems Presentation
Storage Systems for High Scalable Systems Presentation
andyman3000
 
Oasis: Standards & the Cloud June2011
Oasis: Standards & the Cloud June2011Oasis: Standards & the Cloud June2011
Oasis: Standards & the Cloud June2011
Jamie Clark
 
Scaling Databases On The Cloud
Scaling Databases On The CloudScaling Databases On The Cloud
Scaling Databases On The Cloud
Imaginea
 
Scaing databases on the cloud
Scaing databases on the cloudScaing databases on the cloud
Scaing databases on the cloud
Imaginea
 
Creating an RAD Authoratative Data Environment
Creating an RAD Authoratative Data EnvironmentCreating an RAD Authoratative Data Environment
Creating an RAD Authoratative Data Environment
anicewick
 
ACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False Dichotomy
Dan Sullivan, Ph.D.
 
Big iron 2 (published)
Big iron 2 (published)Big iron 2 (published)
Big iron 2 (published)Ben Stopford
 
Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...
Zaloni
 
UNIT 5- Other Databases.pdf
UNIT 5- Other Databases.pdfUNIT 5- Other Databases.pdf
UNIT 5- Other Databases.pdf
ShitalGhotekar
 
Scaling data on public clouds
Scaling data on public cloudsScaling data on public clouds
Scaling data on public clouds
Liran Zelkha
 
Sharded By Business Line: Migrating to a Core Database using MongoDB and Solr
Sharded By Business Line: Migrating to a Core Database using MongoDB and SolrSharded By Business Line: Migrating to a Core Database using MongoDB and Solr
Sharded By Business Line: Migrating to a Core Database using MongoDB and SolrMongoDB
 
Mongo la search platform - january 2013
Mongo la   search platform - january 2013Mongo la   search platform - january 2013
Mongo la search platform - january 2013MongoDB
 
NoSQL and Couchbase
NoSQL and CouchbaseNoSQL and Couchbase
NoSQL and Couchbase
Sangharsh agarwal
 
Deduplication and single instance storage
Deduplication and single instance storageDeduplication and single instance storage
Deduplication and single instance storageInterop
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabasesAdi Challa
 

Similar to Finding the Right Data Solution for Your Application in the Data Storage Haystack (20)

Nosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptxNosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptx
 
History and Introduction to NoSQL over Traditional Rdbms
History and Introduction to NoSQL over Traditional RdbmsHistory and Introduction to NoSQL over Traditional Rdbms
History and Introduction to NoSQL over Traditional Rdbms
 
Gfs论文
Gfs论文Gfs论文
Gfs论文
 
The google file system
The google file systemThe google file system
The google file system
 
Storage Systems for High Scalable Systems Presentation
Storage Systems for High Scalable Systems PresentationStorage Systems for High Scalable Systems Presentation
Storage Systems for High Scalable Systems Presentation
 
Oasis: Standards & the Cloud June2011
Oasis: Standards & the Cloud June2011Oasis: Standards & the Cloud June2011
Oasis: Standards & the Cloud June2011
 
Scaling Databases On The Cloud
Scaling Databases On The CloudScaling Databases On The Cloud
Scaling Databases On The Cloud
 
Scaing databases on the cloud
Scaing databases on the cloudScaing databases on the cloud
Scaing databases on the cloud
 
Creating an RAD Authoratative Data Environment
Creating an RAD Authoratative Data EnvironmentCreating an RAD Authoratative Data Environment
Creating an RAD Authoratative Data Environment
 
ACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False Dichotomy
 
Big iron 2 (published)
Big iron 2 (published)Big iron 2 (published)
Big iron 2 (published)
 
Bertenthal
BertenthalBertenthal
Bertenthal
 
Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...
 
UNIT 5- Other Databases.pdf
UNIT 5- Other Databases.pdfUNIT 5- Other Databases.pdf
UNIT 5- Other Databases.pdf
 
Scaling data on public clouds
Scaling data on public cloudsScaling data on public clouds
Scaling data on public clouds
 
Sharded By Business Line: Migrating to a Core Database using MongoDB and Solr
Sharded By Business Line: Migrating to a Core Database using MongoDB and SolrSharded By Business Line: Migrating to a Core Database using MongoDB and Solr
Sharded By Business Line: Migrating to a Core Database using MongoDB and Solr
 
Mongo la search platform - january 2013
Mongo la   search platform - january 2013Mongo la   search platform - january 2013
Mongo la search platform - january 2013
 
NoSQL and Couchbase
NoSQL and CouchbaseNoSQL and Couchbase
NoSQL and Couchbase
 
Deduplication and single instance storage
Deduplication and single instance storageDeduplication and single instance storage
Deduplication and single instance storage
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabases
 

More from Srinath Perera

Book: Software Architecture and Decision-Making
Book: Software Architecture and Decision-MakingBook: Software Architecture and Decision-Making
Book: Software Architecture and Decision-Making
Srinath Perera
 
Data science Applications in the Enterprise
Data science Applications in the EnterpriseData science Applications in the Enterprise
Data science Applications in the Enterprise
Srinath Perera
 
An Introduction to APIs
An Introduction to APIs An Introduction to APIs
An Introduction to APIs
Srinath Perera
 
An Introduction to Blockchain for Finance Professionals
An Introduction to Blockchain for Finance ProfessionalsAn Introduction to Blockchain for Finance Professionals
An Introduction to Blockchain for Finance Professionals
Srinath Perera
 
AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?
Srinath Perera
 
Healthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & ChallengesHealthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & Challenges
Srinath Perera
 
How would AI shape Future Integrations?
How would AI shape Future Integrations?How would AI shape Future Integrations?
How would AI shape Future Integrations?
Srinath Perera
 
The Role of Blockchain in Future Integrations
The Role of Blockchain in Future IntegrationsThe Role of Blockchain in Future Integrations
The Role of Blockchain in Future Integrations
Srinath Perera
 
Future of Serverless
Future of ServerlessFuture of Serverless
Future of Serverless
Srinath Perera
 
Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going? Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going?
Srinath Perera
 
Few thoughts about Future of Blockchain
Few thoughts about Future of BlockchainFew thoughts about Future of Blockchain
Few thoughts about Future of Blockchain
Srinath Perera
 
A Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New TechnologiesA Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New Technologies
Srinath Perera
 
Privacy in Bigdata Era
Privacy in Bigdata  EraPrivacy in Bigdata  Era
Privacy in Bigdata Era
Srinath Perera
 
Blockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and RisksBlockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and Risks
Srinath Perera
 
Today's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology LandscapeToday's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology Landscape
Srinath Perera
 
An Emerging Technologies Timeline
An Emerging Technologies TimelineAn Emerging Technologies Timeline
An Emerging Technologies Timeline
Srinath Perera
 
The Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming ApplicationsThe Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming Applications
Srinath Perera
 
Analytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the UglyAnalytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the Ugly
Srinath Perera
 
Transforming a Business Through Analytics
Transforming a Business Through AnalyticsTransforming a Business Through Analytics
Transforming a Business Through Analytics
Srinath Perera
 
SoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration TechnologySoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration Technology
Srinath Perera
 

More from Srinath Perera (20)

Book: Software Architecture and Decision-Making
Book: Software Architecture and Decision-MakingBook: Software Architecture and Decision-Making
Book: Software Architecture and Decision-Making
 
Data science Applications in the Enterprise
Data science Applications in the EnterpriseData science Applications in the Enterprise
Data science Applications in the Enterprise
 
An Introduction to APIs
An Introduction to APIs An Introduction to APIs
An Introduction to APIs
 
An Introduction to Blockchain for Finance Professionals
An Introduction to Blockchain for Finance ProfessionalsAn Introduction to Blockchain for Finance Professionals
An Introduction to Blockchain for Finance Professionals
 
AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?
 
Healthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & ChallengesHealthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & Challenges
 
How would AI shape Future Integrations?
How would AI shape Future Integrations?How would AI shape Future Integrations?
How would AI shape Future Integrations?
 
The Role of Blockchain in Future Integrations
The Role of Blockchain in Future IntegrationsThe Role of Blockchain in Future Integrations
The Role of Blockchain in Future Integrations
 
Future of Serverless
Future of ServerlessFuture of Serverless
Future of Serverless
 
Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going? Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going?
 
Few thoughts about Future of Blockchain
Few thoughts about Future of BlockchainFew thoughts about Future of Blockchain
Few thoughts about Future of Blockchain
 
A Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New TechnologiesA Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New Technologies
 
Privacy in Bigdata Era
Privacy in Bigdata  EraPrivacy in Bigdata  Era
Privacy in Bigdata Era
 
Blockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and RisksBlockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and Risks
 
Today's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology LandscapeToday's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology Landscape
 
An Emerging Technologies Timeline
An Emerging Technologies TimelineAn Emerging Technologies Timeline
An Emerging Technologies Timeline
 
The Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming ApplicationsThe Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming Applications
 
Analytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the UglyAnalytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the Ugly
 
Transforming a Business Through Analytics
Transforming a Business Through AnalyticsTransforming a Business Through Analytics
Transforming a Business Through Analytics
 
SoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration TechnologySoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration Technology
 

Recently uploaded

Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 

Recently uploaded (20)

Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 

Finding the Right Data Solution for Your Application in the Data Storage Haystack

  • 1. Finding the Right Data Solution for Your Application in the Data Storage Haystack Srinath Perera Ph.D. Senior Software Architect, WSO2 Inc. Visiting Faculty, University of Moratuwa Research Scientist, Lanka Software Foundation
  • 2. In Search for right Data Models §  There has been many data models proposed (read Stonebraker’s “What Goes Around Comes Around” for more details) o  Hierarchical (IMS): late 1960’s and 1970’s o  Directed graph (CODASYL): 1970’s o  Relational: 1970’s and early 1980’s o  Entity-Relationship: 1970’s o  Extended Relational: 1980’s o  Semantic: late 1970’s and 1980’s §  Database systems (SQL) together with transactions has been the defacto data solution. Copyright Greg Morss and licensed for reuse under CC License , http://www.geograph.org.uk/photo/990700
  • 3. For many years, choice of data storage was a easy one (use RDBMS) Copyright by Alan Murray Walsh and licensed for reuse under CC License , http://www.geograph.org.uk/photo/1652880
  • 4. Increasing Scale of Systems §  However, the scale of systems are changing due to o  Increasing user bases of systems. o  Mobile devices, online presence o  Cloud computing and multicore systems §  Scaling up RDBMS o  Put it in a bigger machine o  Replicate (Cluster) the database to 2-3 more nodes. But the approach does not scale up. o  Partition the data across many nodes (distribute, a.k.a. sharding). However, JOIN queries across many nodes are hard, and sometimes too slow. This often needs custom code and configurations. Also transactions do not scale as well. Copyright digitalART2 and licensed for reuse under CC License , http://www.flickr.com/photos/digitalart/2101765353/
  • 5. CAP Theorem, Transactions, and Storage §  RDBMS model provide two things o  Relational model with SQL o  ACID transactions – (Atomic, Isolation, Consistent, Durable) §  It was a classical one size fit all solution, but it worked for a quite a some time. §  However, CAP theorem says that you can not have it all. o  Consistency, Availability and Partition Tolerance, pick two! §  But there are many usecases that do not need all RDBMS features, when those are dropped, systems could scale. (e.g. Google Big Table) §  However, to use them, one has to understand and utilize the application specific behavior. Copyright stephcarter and licensed for reuse under CC License , http://www.flickr.com/photos/stephcarter/541464462
  • 6. NoSQL and other Storage Systems §  Large internet companies hit the problem first, they build systems that are specific to their problems, and those systems did scale. o  Google Big table o  Amazon Dynamo §  Soon many others followed, and most of them are free and open source. Now there are couple of dozen §  Among advantages of NoSQL are o  Scalability o  Flexible schema o  Designed to scale and support fault tolerance out of the Box Copyright O hai :3 and licensed for reuse under CC License , http://www.flickr.com/photos/christigain/ 5636887941
  • 7. However, with NoSQL solutions, choosing a data storage is no longer simple. Copyright Philipp Salzgeber on and licensed for reuse under CC License http:// www.salzgeber.at/astro/pics/20081126_heart/index.html
  • 8. Selecting the Right Data Solution §  What are the right Questions to ask? §  Categorize Answers for each question §  Take different cases based on different answers and make recommendations! Copyright by Krzysztof Poltorak, and licensed for reuse under CC License. http://www.fotocommunity.com/pc/pc/display/22077920
  • 9. What are the right Questions? o  Types of data -  Structured, Semi-Structured, Unstructured o  Need for Scalability -  Number of users -  Number of data items -  Size of files -  Read/Write ratio o  Types of Queries -  Retrieve by Key -  WHERE clauses -  JOIN queries -  Offline Queries o  Consistency -  Loose Consistency -  Single Operation Consistency -  Transactions Copyright by romainguy, and licensed for reuse under CC License http://www.flickr.com/ photos/romainguy/249370084
  • 10. 4Q > Types of Data > Unstructured Data §  Data do not have a particular structure, often retrieved through a key (name). o  E.g. File systems. §  Humans are good in processing unstructured data, but computers do not. §  This data are often stored in storage but consumed by humans at the end of the pipeline. (e.g. Document repository) §  One common use case is building structured data from unstructured data §  Often associate Metadata to help searching Copyright Martyn Gorman and licensed for reuse under CC License, http://www.geograph.org.uk/photo/294134
  • 11. 4Q > Types of Data > Structured Data §  Have a structure and often described through a Schema §  Often a table like 2D structure is used, but other structures also possible. §  Main advantage of the structure is search §  Schema can be provided at the deployment time or at the runtime (dynamic schema) §  Schema can be used to o  Validate data o  Support user friendly search o  Optimize storage and queries Copyright Marion Doss by and licensed for reuse under CC License , http://www.flickr.com/ photos/ooocha/2611398859/
  • 12. 4Q > Types of Data > Semi-structured Data §  Structure is not fully defined. But there is some inherent structure. §  For example o  XML documents, data are stored in a tree like structure o  Graph data o  Data structures like lists and arrays §  Support queries based on structure §  But processing data often needs custom code. Copyright Walter Baxter http://www.geograph.org.uk/photo/1069339
  • 13. 4Q > Search §  Unstructured Data – no structure to support search. o  Search based on an reverse index o  Search through Properties §  Semi-Structured Data o  To search XML, Xpath or XQuery (Any tree like structure). o  Tuple spaces can be queried through tuple space templates o  Data registries can be searched for entries that matches with given Metadata descriptions (search by properties) o  Graph’s can be queried based on connectivity §  Structured Data o  Retrieve by Key o  WHERE clauses o  Queries with JOINs o  Offline Queries Copyright bydigitalART2 and licensed for reuse under CC License , http://www.flickr.com/photos/digitalart/2101765353/
  • 14. 4Q > Consistency and Scalability §  Scalability – this is ability to handle more users, data, or larger files by adding more nodes. We will have 3 categories. o  Small systems (can handle with 1-3 nodes) o  Scalable systems (can handle with about 10 nodes) o  Highly scalable systems (anything larger, can be 100s or 1000s of Copyright NNSANews and licensed for reuse under CC nodes) License , http://www.flickr.com/photos/nnsanews/ 5347287260/ §  Consistency – this is how to keep the replicas of same data in many nodes synced up (e.g. replicas) how they can be updated without data corruptions. We will have 3 categories. o  Transactional – series of operations updated in ACID manner o  Atomic operation – single operation, updated in all replicas o  Eventual consistency - data will be eventually consistent
  • 16. Data Storage Implementations §  Expectations from data storages o  Reliably store the data o  Efficient search and retrieval of data whenever needed o  Data management – delete, update data Copyright Stephen Eckert and licensed for reuse under CC License , http://www.flickr.com/ photos/s_eckert/5378588233
  • 17. Challenges of Data Storage §  Reliability o  Replicating data o  Creating backup or recovering using backups §  Security §  Scaling and Parallel access o  Distribution or replications o  ACID transactions §  Availability o  Data replications §  Vendor lock-in o  Interoperability, standard query languages §  Simple use experience o  Hide the physical location of data, o  Provide simple API and security models o  Expressive query languages.
  • 18. Data Storage Choices Queries Join Transactio Flexible Storage Type Advantages Disadvantages Key Where s ns Scale schema No unless Local memory Very fast Not durable Yes No No STMs No Yes Rigid schema, good for read oriented Moder Relational/ SQL Standardized usecases. Yes Yes Yes Yes ate No Column High write Not Yes, families performance, transactional, secondar (NoSQL ) replicated no-online joins Yes y index No No High Yes High write Not Documents performance, transactional, Yes, DBs replicated no-online joins Yes views No No Yes Yes Easy to integrate with Object Struct programming Databases ured languages Yes Yes Yes Yes No No
  • 19. Queries trans Disadvanta action Flexible Storage Type Advantages ges Key Search s Scale schema No structured Save big files whose search on Files format not understood content Yes Indexing No Moderate Yes Data Registries/ Metadata search Property Metadata Unstru based search Catalogs ctured Yes (Where) No Moderate Yes Representation of flow of messages over Queues time/ Tasks Yes N/A No Yes Yes Used to inference, very Triple fast relationship Relationship Stores processing Yes search No No Yes XML XPath/ database XML native XQuery Distributed Cache Fast, replicated No search Yes No No Yes Yes Model is too simple in some High write cases, not Key-value performance, transactiona pairs replicated l Yes No No Yes Yes Semi- Very fast joins, natural structur to represent Not very Graph DBs ed relationships, scalable Yes Graph Search Yes Low N/A
  • 20. Choosing the Right Data Solution
  • 21. How should We do this? Copyright Brian Robert Marshall and licensed for reuse under CC License , http:// www.geograph.org.uk /photo/938546 §  Consider structured, semi-structured, and unstructured separately. o  Then drill down based on other 3 properties: scale, consistency, and search. §  Structured case is more complicated, other two are bit simpler. §  Start by giving a defacto for each case
  • 22. Handling Structured Data §  There are three main considerations: scale, consistency and queries Small (1-3 nodes) Scalable (10 nodes) Highly Scalable (1000s nodes) Loose Operat ACID Loose Operat ACID Loose Operat ACID Consist ion Transa Consi ion Transa Consi ion Transa ency Consi ctions stency Consi ctions stency Consi ctions stency stency stency Primary DB/ KV/ DB/ DB KV/CF KV/CF Partitio KV/CF KV/CF No Key CF KV/ CF ned DB? Where DB/ CF/ DB/ DB CF/ CF/ Partitio CF/ CF/ No Doc CF/ Doc(?) Doc (?) ned Doc Doc Doc DB? JOIN DB DB DB ?? ?? ?? No No No Offline DB/CF/ DB/CF/ DB/CF/ CF/ CF/ No CF/ CF/ No Doc Doc Doc Doc Doc Doc Doc *KV: Key-Value Systems, CF: Column Families, Doc: document based Systems
  • 23. Handling Small Scale Systems (1-3 nodes) Small (1-3 nodes) §  In general using DB here for every case might work. Loose Operati ACID Consi on Transa §  Reason for using options stency Consist ctions other than DB ency o  When there is potential need Primary DB/ DB/ KV/ DB to scale later. Key KV/ CF CF o  High write throughput Where DB/ DB/ DB §  KV is 1-D where as other two CF/ CF/Doc Doc are 2D JOIN DB DB DB Offline DB/ DB/CF/ DB/CF/ CF/ Doc Doc Doc *KV: Key-Value Systems, CF: Column Families, Doc: document based Systems
  • 24. Handling Scalable Systems Scalable (10 nodes) §  KV, CF, and Doc can easily handle this case. Loose Operati ACID §  If DBs used with data shredded Consi on Transa stenc Consist ctions across many nodes y ency o  Transactions might work given that Primary KV/CF KV/CF Partition participants on one transaction are Key ed DB? not too many. Where CF/ CF/Doc Partition o  JOINs might need to transfer too Doc ed DB? much data between nodes. o  Also should consider in Memory JOIN ?? ?? Partition ed DBs like Vault DB. DB?? §  Offline mode will work. Offline CF/ CF/Doc No §  Most systems let users choose Doc consistency, and loose *KV-Key-Value Systems, CF-Column consistency can scale more. Families, Doc- document based Systems (e.g. Cassandra)
  • 25. Highly Scalable Systems §  Transactions do not work in Highly Scalable (1000s nodes) this scale. (CAP theorem). Loose Operati ACID §  Same for JOINs. The problem Consis on Transac is sometime too much data tency Consist tions ency needs to be transferred Primary KV/CF KV/CF No between nodes to perform the Key JOIN. Where CF/Doc CF/Doc No §  Offline case handled through Map-Reduce. Even JOIN JOIN No No No case is OK since there is time. Offline CF/Doc CF/Doc No *KV: Key-Value Systems, CF: Column Families, Doc: document based Systems
  • 26. Highly Scalable Systems + Primary Key Retrieval Highly Scalable (1000s §  This is (comparatively) the nodes) easy one. Loose Operat ACID §  Can be solved through Consis ion Transa tency Consis ctions DHT (Distributed Hash tency table) based solutions or Primar KV/CF KV/CF No architectures like y Key OceanStore. Where CF/Doc CF/Doc No §  Both Key-Value storage (?) (?) (KV) and Column Families JOIN No No No (CF) can be used. But Key-Value model is Offline CF/Doc CF/Doc No preferred as it is more scalable. *KV-Key-Value Systems, CF-Column Families, Doc- document based Systems
  • 27. Highly Scalable systems + WHERE Highly Scalable (1000s §  This Generally OK, but tricky. nodes) §  CF work through a Secondary Loose Operat Transa Consis ion ctions index that do Scatter-gather tency Consis (e.g. Cassandra). tency §  Doc work through Map- Primar KV/CF KV/CF No y Key Reduce views (e.g. Where CF/Doc CF/Doc No CouchDB) (?) (?) §  There is Bissa, which build a JOIN No No No index for all possible queries (No range queries) Offline CF/Doc CF/Doc No §  If you are doing this, you should do pilot runs and *KV-Key-Value Systems, CF-Column make sure things work. Families, Doc- document based Systems
  • 28. Handling Unstructured Data §  Storage Options o  Distributed File systems - generally scalable (e.g. NSF), but HDFS (Hadoop) and Lustre are highly scalable versions. o  Metadata registries (e.g. Niravana, SDSC Resource Broker)
  • 29. Handling Semi-Structured Data Small Scale (1-3 Scalable (10 nodes) Highly nodes) Scalable XML (Queried XML DB or convert XML DB or convert to a ?? through XPath) to a structured structured model model Graphs Graph DBs Graph DBs if graph can ?? be partitioned Data Structures Data Structure Servers, Object Databases Queues Distributed Distributed Queues Distributed Queues Queues ! §  Storage Options o  Answer depends on the type of structure. If there is a server optimized for a given type, it is often much more efficient than using a DB. (e.g. Graph databases can support fast relationship search) §  Search o  Very much custom. E.g. XML or any tree = Xpath, Graph can support very fast relationship search
  • 30. Hybrid Approaches §  Some solutions have many types of data and hence need more than one data solution (hybrid architectures). §  For example o  Using DB for transactional data and CF for other data. o  Keeping metadata and actual data separate for large data archives. o  Use GraphDB to store relationship data while other data is in Column Family storage. Copyright Matthew Oliphant by and licensed for §  However, if transactions are reuse under CC License , http://www.flickr.com/ photos/fajalar/3174131216/ needed, transactions have to be handled outside storage (e.g. using Atomikos Zookeeper ).
  • 31. Other parameters §  Above list is not exhaustive, and there are other parameters o  Read/ Write ratio – when high it is easy to scale o  High write throughput o  Very large data products – you will need a file system. May be keep metadata in Data registry and store data in a file system. o  Flexible Schema o  Archival usecases o  Analytical usecases o  Others …
  • 32. Take Home Message is .. There is no silver bullet You have to use right too for the job Copyright eschipul, Siomuzzz and licensed for reuse under CC License , http://www.flickr.com/ photos/eschipul/4160817135 and http:// www.flickr.com/photos/siomuzzz/2577041081
  • 33. Sample Polyglot Architectures PaaS Structured Structured Unstructured (Relational) (NOSQL) WSO2 Stratos MySQL based Cassandra as HDFS as a RDB as a a Service Service Service Azure MSSQL as a MS NoSQL Service storage AppEngine Hosted BigTable MySQL Our work on Data Solutions for WSO2 Stratos motivated this work. You can try out WSO2 Stratos Data offerings from https://data.stratoslive.wso2.com/home/index.html
  • 34. Conclusion §  For last 20 years or so, DBMS were the de facto storage solution §  However, DBMS could not scale well, and many NoSQL solutions have been proposed instead §  As a results. it is no longer easy to find the best data solution for your problem. §  We discussed may dimensions (types of data, scalability, queries, and consistency) and provided guidelines on when to use which data solution. §  Your feedback and thoughts are most welcome .. Contact me through srinath@wso2.com