SlideShare a Scribd company logo
Big Data Platforms:
      An Overview
                          C. Scyphers
                   Chief Technical Architect

Daemon Consulting, LLC
What Is “Big Data”?
• Big Data is not simply a huge pile of information

• A good starting place is the following thought:


 “Big Data describes datasets so large they become very
  difficult to manage with traditional database tools.”
What Is A Big Data Platform?




 Putting it simply, it is any platform which supports
             those kind of large datasets.
It doesn’t have to be cutting edge technology.
Lots of legacy technologies can address the problem.
If only at a sizeable cost.
SQL
Of the new technologies, the most promising are from
                the “NoSQL” family.
What Is “NoSQL”?

SQL
A family of non-relational data storage technologies
Horizontal scalability
Distributed processing
Faster throughput
Usually with less cost than more traditional approaches.
Some of these
technologies
are new and
innovative
Others have been around for decades.
NoSQL Does Not Mean “SQL Is Bad”




When the trend was just starting, “NoSQL” was coined. It’s
unfortunate, because it implies antagonism towards SQL.
NoSQL Means “Not Only SQL”
        RELATIONAL




                                  RELATIONAL
                                     NON-
NoSQL is a complement to a traditional RDBMS, not
      necessarily as a replacement of them.
Why Won’t
 SQL Do?
Scale is very hard without ridiculous expense
SQL can get very complex, very quickly
Changing a schema for a large production system is
             both risky and expensive
Throughput can be a challenge
How Does
NoSQL Do It?
Scale is achieved through a shared-nothing architecture,
                  removing bottlenecks
Schemaless design means change becomes much less
          risky and significantly cheaper
Most solutions use simple RESTful interfaces
NoSQL is based upon a better understanding
 of data storage, usually referred to as the

           “CAP Theorem”
The CAP Theorem
        Grossly simplified (with apologies to Brewer):

A database can be
• Consistent (All clients see the same data)
• Available (All clients can find some available node)
• Partition-Tolerant (the database will continue to function
   even if split into disconnected sets – e.g. a network disruption)



               Pick Any Two.
CAP In Practice
• Consistent & Available (no Partition Tolerance)
  • Either single machines or single site clusters.
  • Typically uses 2 phase commits
CAP In Practice
• Consistent & Partition Tolerant (no Availability)
  • Some data may be inaccessible, but the remainder
    is available and consistent
  • Sharding is an example of this implementation




    Customers        Customers        Customers
       A-F              G-R              S-Z
CAP In Practice
• Available & Partition Tolerant (no Consistency)
  • Some data may be inaccurate; a conflict resolution
     strategy is required.
  • DNS is an example of this, as well as standard
     master-slave replication
CAP From A Vendor POV
• C-A (no P) – this is generally how most
               RDBMS vendors operate

• C-P (no A) – this is how many RDBMS’
               attempt to address scale
               without incurring large costs

• A-P (no C) – this is how most NoSQL
               approaches solve the problem
ACID vs BASE
   Traditional Databases                        NoSQL Databases Tend
    Are ACID Compliant                          To Be BASE Compliant
Atomicity – either the entire transaction          Basically
             completes or none of it does
Consistent – any transaction will take the         Available
               database from one consistent
               state to another, with no
               broken constraints
Isolation – changes do not affect other users      Scalable
            until committed
Durability – committed transactions can be         Eventually consistent
              recovered in case of system
              failure



      Eventually consistent is the key phrase here
SQL Strengths




Very well known technology
Very mature technology
Interoperability across vendors
Large talent pool from which to choose
Ad hoc operations common, if not encouraged
NoSQL Strengths




 Built to address massive scale
Through horizontal scalability
While remaining highly available
And handling unstructured data
NoSQL Pros/Cons
       Pros                         Cons

• Schema Evolution         • Querying the data is
• Horizontal Scalability     much harder
• Simple Protocols         • Paradigm Shift
                           • Security is a big issue
                           • May or may not
                             support data types
                             (BLOBs, spatial)
                           • Generally, uniqueness
                             cannot be enforced
A Disclaimer Before We Continue
• I am not an expert on every possible Big Data
  Platform
• There are hundreds of them; these are
  the ones I consider the leaders in the field
  and recommend
• If you have a favorite, please let me know
  and I’ll update the deck for next time
• The internal details on how these systems
  work are rather complex; I would prefer to
  take those questions offline
Flavors Of NoSQL

The major four divisions of NoSQL are:

•   Key-Value
•   Document Store
•   Columnar
•   Other
Key-Value
• At a very high level, key-value works essentially
  by pairing a index token (a key) with a data
  element (a value).
• Both index token and the data value can be of
  any structure.
• Such a pairing is arbitrary and up to the
  developer of the system to determine.
A Key-Value Example

“John Smith”, “100 Century Dr. Alexandria VA 22304”


“John Doe”, “16 Kozyak Street, Lozenets District, 1408 Sofia Bulgaria”




  In both examples, the key is a name and the value is an address.
   However, the structure of the address differs between the two.
Document Store
• Document stores extend the key-value paradigm
  into values with multiple attributes.
• The document values tend to be semi-structured
  data (XML, JSON, et al) but can also be Word or
  PDF documents.
A Document Store Example
“John Smith”, “<address><street>100 Century Dr.</street>
                        <city>Alexandria</city>
                        <state>VA</state>
                        <postalCode>22304</postalCode>
               </address>”

“John Doe”, “{
              “address”: {
                             “street”: “16 Kozyak Street”
                             “district”: “Lozenets, 1408”
                             “city”: “Sofia”
                             “country”: “Bulgaria”
                         }
              }”
Columnar Family

• Usually has “rows” and “columns”
  • Or, at least, their logical equivalents
• Not a traditional, “pure” column store
  • More of a hybridized approach leveraging
     key-value pairs
• A key with many values attached
The Others
• Hierarchical Databases
  • LDAP, Active Directory
• Graph Databases
  • Neo4j, Flock DB, InfiniteGraph
• XML
  • MarkLogic
• Object Oriented Databases
  • Versant
• Lotus Notes
• HPCC (LexisNexis)
Key-Value Recap

 Pairing a index token (a key)
 with a data element (a value)
Key-Value Pro/Con
         Pros                                   Cons
•   Schema Evolution                •   Packing & unpacking each key
•   Horizontal Scalability          •   Keys typically are not related
•   Simple Protocols                    to each other
•   Works well for volatile data    •   The entire value must be
•   High throughput, typically          returned, not just a part of it
    optimized for reads or writes   •   Security tends to be an issue
•   Keys become meaningful          •   Hard to support reporting,
    rather than arbitrary               analytics, aggregation or
•   Application logic defines           ordered values
    object model                    •   Generally does not support
                                        updates in place
                                    •   Application logic defines
                                        object model
Where Did Key-Value
       Come From?

The concept is quite old, but most people trace the
lineage back to Amazon and the Dynamo paper.
Dynamo
Amazon devised the Dynamo engine as a way to
address their scalability issues in a reliable way.

• Communication between nodes is peer to peer
  (P2P)
• Replication occurs with the end client
  addressing conflict resolution
• Quorum Reads/Writes
• Always writable (Hinted Handoff)
• Eventually Consistent
Eventually Consistent
• Rather than expending the runtime resources
  to ensure that all nodes are aware of a change
  before continuing, Dynamo uses an eventually
  consistent model.

• In this model, a subset of nodes are changed
• Those nodes then inform their neighbors until
  all nodes are changed (grossly simplifying).
Can I Use Dynamo?

  No. It’s an Amazon only internal product.
  However, AWS S3 is largely based upon it.


Amazon did announce a DynamoDB offering for
 their AWS customers. While it’s probably the
       same, I cannot guarantee that it is.
Riak
• Riak is a key-value database largely
  modeled after the Dynamo model.
• Open source (free) with paid support
  from Basho.

• Main claims to fame:
  • Extreme reliability
  • Performance speed
Riak Pro/Con
         Pros                                   Cons
•   All nodes are equal – no       •   Not meant for small, discrete
    single point of failure            and numerous datapoints.
•   Horizontal Scalability         •   Getting data in is great;
•   Full Text Search                   getting it out, not so much
•   RESTful interface (and HTTP)   •   Security is non-existent:
•   Consistency level tunable on          “Riak assumes the internal
    each operation                         environment is trusted”
•   Secondary indexes available    •   Conflict resolution can bubble
•   Map/Reduce (JavaScript &           up to the client if not careful.
    Erlang only)                   •   Erlang is fast, but it’s got a
                                       serious learning curve.
Riak Users
Redis
• Redis is a key-value in-memory datastore.
• Open source (free) with support from the
  community.

• Main claims to fame:
  • Fast. So very, very fast.
  • Transactional support
  • Best for rapidly changing data
Redis Pro/Con
         Pros                                   Cons
•   Transactional support           •   Entirely in memory
•   Blob storage                    •   Master-slave replication
•   Support for sets, lists and         (instead of master-master)
    sorted sets                     •   Security is non-existent:
•   Support for Publish-Subscribe       designed to be used in
    (Pub-Sub) messaging                 trusted environments
•   Robust set of operators         •   Does not support encryption
                                    •   Support can be hard to find
Redis Users
Voldemort
• Voldemort is a key-value in-memory database
  built by LinkedIn.
• Open source (free) with support from the
  community

• Main claims to fame:
  • Low latency
  • Highly Available
  • Very fast reads
Voldemort Pro/Con
         Pros                                 Cons
•   Highly customizable – each    •   Versioning means lots of disk
    layer of the stack can be         space being used.
    replaced as needed            •   Does not support range
•   Data elements are versioned       queries
    during changes                •   No complex query filters
•   All nodes are independent –   •   All joins must be done in
    no single point of failure        code
•   Very, very fast reads         •   No foreign key constraints
                                  •   No triggers
                                  •   Support can be hard to find
Voldemort Users
Key/Value
        “Big Vendors”
• Microsoft Azure Table Storage
• Oracle NoSQL
• BerkleyDB (Oracle)
Document Store Recap

Document stores store an index token
with a grouping of attributes in a semi-
         structured document
Document Store Pro/Con
         Pros                                  Cons
•   Tends to support a more       •   The entire value must be
    complex data model than           returned, not just a part of it
    key/value                     •   Security tends to be an issue
•   Good at content               •   Joins are not available within
    management                        the database
•   Usually supports multiple     •   No foreign keys
    indexes                       •   Application logic defines
•   Schemaless (can be nested)        object model
•   Typically low latency reads
•   Application logic defines
    object model
CouchDB
• CouchDB is a document store database.
• Open source (free), part of the Apache
  foundation with paid support available from
  several vendors.

• Main claims to fame:
  • Simple and easy to use
  • Good read consistency
  • Master-master replication
CouchDB Pro/Con
         Pros                                 Cons
•   Very simple API for           •   The simple API for
    development                       development is somewhat
•   MVCC support for read             limited
    consistency                   •   No foreign keys
•   Full Map/Reduce support       •   Conflict resolution devolves
•   Data is versioned                 to the application
•   Secondary indexes supported   •   Versioning requires extensive
•   Some security support             disk space
•   RESTful API, JSON support     •   Versioning places large load
•   Materialized views with           on I/O channels
    incremental update support    •   Replication for performance,
                                      not availability
CouchDB Users
MongoDB
• MongoDB is a document store database.
• Open source (free) with paid support available
  from 10Gen.

• Main claims to fame:
  • Index anything
  • Ad hoc query support
  • SQL like operations
    (not SQL syntax)
MongoDB Pro/Con
         Pros                                   Cons
•   Auto-sharding                   •   Does not support JSON: BSON
•   Auto-failover                       instead
•   Update in place                 •   Master-slave replication
•   Spatial index support           •   Has had some growing pains
•   Ad hoc query support                (e.g. Foursquare outage)
•   Any field in Mongo can be       •   Not RESTful by default
    indexed                         •   Failures require a manual
•   Very, very popular (lots of         database repair operation
    production deployments)             (similar to MySQL)
•   Very easy transition from SQL   •   Replication for availability,
                                        not performance
MongoDB Users
Document Store
        “Big Vendors”
• Lotus Domino
Columnar Family Recap

• A key with many values attached
• Usually presenting as “rows” and “columns”
  • Or, at least, their logical equivalents
Columnar Pro/Con
         Pros                                  Cons
•   Tend to have some level of     •   Is much less efficient when
    rudimentary security support       processing many columns
•   Usually include a degree of        simultaneously
    versioning                     •   Joins tend to not be
•   Can be more efficient than         supported
    row databases when             •   Referential integrity not
    processing a limited number        available
    of columns over a large
    amount of rows
Where Did Columnar
      Come From?

The concept has been around for a while, but most
 people trace the NoSQL lineage back to Google.
BigTable
Google devised the BigTable engine as a way to
address their search related scalability issues in a
reliable way.
• Data is organized through a set of keys:
  •   Row             • Column          • Timestamp
• A hybrid row/column store with a single master
• Versioning is handled through the time key
• Tablets are a dynamic partition of a sequence of
  rows – supports very efficient range scans
• Columns can be grouped into column families
• Column families can have access control
Can I Use BigTable?

No. It’s a Google only internal product. However,
 quite a few open source products are built upon
                  the concepts.
Cassandra
• Cassandra is a hybrid of Big Table built on
  Dynamo infrastructure
• Open source (free), built by Facebook with paid
  support available from several vendors.

• Main claims to fame:
  • An Apache project
  • Very, very fast writes
  • Spans multiple datacenters
Cassandra Pro/Con
         Pros                                      Cons
•   Designed to span multiple         •   No joins
    datacenters                       •   No referential integrity
•   Peer to peer communication        •   Written in Java – quite
    between nodes                         complex to administer
•   No single point of failure            and configure
•   Always writeable                  •   Last update wins
•   Consistency level is tunable at
    run time
•   Supports secondary indexes
•   Supports Map/Reduce
•   Supports range queries
Cassandra Users
HBase
• Hbase is a columnar database built on top of the
  Hadoop environment.
• Open source (free) with paid support from
  numerous vendors

• Main claims to fame:
  • Ad hoc type abilities
  • Easy integration with
    Map/Reduce
HBase Pro/Con
         Pros                                  Cons
•   Map/Reduce support             •   Secondary indexes generally
•   More of a CA approach and          not supported
    an AP                          •   Security is non-existent
•   Supports predicate push        •   Requires a Hadoop
    down for performance gains         infrastructure to function
•   Automatic partitioning and
    rebalancing of regions
•   Data is stored in a sorted
    order (not indexed)
•   RESTful API
•   Strong and vibrant ecosystem
HBase Users
Hadoop
• Hadoop is not a columnar store as such.
• Rather, Hadoop is a massively parallel data
  processing engine

• Main claims to fame:
  • Specializes in unstructured data
  • Very flexible and popular
Hadoop Pro/Con
         Pros                                   Cons
•   While written in Java, almost   •   Large amounts of disk space
    any language can leverage           and bandwidth required
    Hadoop                          •   Paradigm shift for IT staff
•   Runs on commodity servers       •   Quality talent is highly in
•   Horizontally scalable               demand and expensive
•   Very fast and powerful          •   Security is non-existent
•   Where Map/Reduce                •   Name node is a single point
    originated                          of failure
•   Ample support from vendors      •   More or less only supporting
•   “Helper” languages like Hive        batch processing
    and Pig                         •   Not user friendly to anyone
•   Strong and vibrant ecosystem        other than developers
Hadoop Users




         Plus lots, lots more
Columnar “Big Vendor”

• EMC Greenplum
• Teradata Aster


 In so far as both of these solutions are grafting
Map/Reduce into a (more or less) SQL environment
Which One Do
            I Use Where?
•   Key-Value for (relatively) simple, volitile data
•   Document store for more complex data
•   Columnar for analytical processing
•   RBDMS for traditional processing – particularly
    where a lazy consistency is not acceptable
    • Point Of Sale, for example
Questions?
@scyphers


            Additional Information At
http://www.daemonconsulting.net/BDC-FOSE-2012


Daemon Consulting, LLC
     http://www.daemonconsulting.net/

      Specializing In The Hard Stuff

More Related Content

What's hot

Big data analytics
Big data analyticsBig data analytics
Big data analytics
Vikram Nandini
 
Snowflake Best Practices for Elastic Data Warehousing
Snowflake Best Practices for Elastic Data WarehousingSnowflake Best Practices for Elastic Data Warehousing
Snowflake Best Practices for Elastic Data Warehousing
Amazon Web Services
 
CRISP-DM: a data science project methodology
CRISP-DM: a data science project methodologyCRISP-DM: a data science project methodology
CRISP-DM: a data science project methodology
Sergey Shelpuk
 
OLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEOLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSE
Zalpa Rathod
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
Vivek Aanand Ganesan
 
Big data ppt
Big data pptBig data ppt
Big data ppt
Deepika ParthaSarathy
 
Big data ppt
Big data pptBig data ppt
Big data ppt
Yash Raj
 
Big data unit i
Big data unit iBig data unit i
Big data unit i
Navjot Kaur
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial Intelligence
Manish Jain
 
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Edureka!
 
Data Science: Past, Present, and Future
Data Science: Past, Present, and FutureData Science: Past, Present, and Future
Data Science: Past, Present, and Future
Gregory Piatetsky-Shapiro
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
Mithlesh Sadh
 
Business Intelligence (BI) and Data Management Basics
Business Intelligence (BI) and Data Management  Basics Business Intelligence (BI) and Data Management  Basics
Business Intelligence (BI) and Data Management Basics
amorshed
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
Sulman Ahmed
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
Aswadmehar
 
Business case for Big Data Analytics
Business case for Big Data AnalyticsBusiness case for Big Data Analytics
Business case for Big Data Analytics
Vijay Rao
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
Thang Bui (Bob)
 
Big data
Big dataBig data
Big data
Ami Redwan Haq
 

What's hot (20)

Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Snowflake Best Practices for Elastic Data Warehousing
Snowflake Best Practices for Elastic Data WarehousingSnowflake Best Practices for Elastic Data Warehousing
Snowflake Best Practices for Elastic Data Warehousing
 
CRISP-DM: a data science project methodology
CRISP-DM: a data science project methodologyCRISP-DM: a data science project methodology
CRISP-DM: a data science project methodology
 
OLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEOLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSE
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data unit i
Big data unit iBig data unit i
Big data unit i
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial Intelligence
 
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
 
Data Science: Past, Present, and Future
Data Science: Past, Present, and FutureData Science: Past, Present, and Future
Data Science: Past, Present, and Future
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
 
Business Intelligence (BI) and Data Management Basics
Business Intelligence (BI) and Data Management  Basics Business Intelligence (BI) and Data Management  Basics
Business Intelligence (BI) and Data Management Basics
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Business case for Big Data Analytics
Business case for Big Data AnalyticsBusiness case for Big Data Analytics
Business case for Big Data Analytics
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Big data
Big dataBig data
Big data
 

Viewers also liked

Top 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data SolutionTop 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data Solution
DataStax
 
Science of culture? Computational analysis and visualization of cultural imag...
Science of culture? Computational analysis and visualization of cultural imag...Science of culture? Computational analysis and visualization of cultural imag...
Science of culture? Computational analysis and visualization of cultural imag...
Lev Manovich
 
Big Data
Big DataBig Data
Big Data
NGDATA
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should Know
Bernard Marr
 
Big data ppt
Big data pptBig data ppt
Big data ppt
IDBI Bank Ltd.
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
Nasrin Hussain
 

Viewers also liked (6)

Top 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data SolutionTop 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data Solution
 
Science of culture? Computational analysis and visualization of cultural imag...
Science of culture? Computational analysis and visualization of cultural imag...Science of culture? Computational analysis and visualization of cultural imag...
Science of culture? Computational analysis and visualization of cultural imag...
 
Big Data
Big DataBig Data
Big Data
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should Know
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similar to Big Data Platforms: An Overview

UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
Rahul Borate
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
Rahul Borate
 
NoSQL.pptx
NoSQL.pptxNoSQL.pptx
NoSQL.pptx
RithikRaj25
 
Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)
Don Demcsak
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7
abdulrahmanhelan
 
No SQL
No SQLNo SQL
6269441.ppt
6269441.ppt6269441.ppt
6269441.ppt
Swapna Jk
 
Oracle Week 2016 - Modern Data Architecture
Oracle Week 2016 - Modern Data ArchitectureOracle Week 2016 - Modern Data Architecture
Oracle Week 2016 - Modern Data Architecture
Arthur Gimpel
 
Big iron 2 (published)
Big iron 2 (published)Big iron 2 (published)
Big iron 2 (published)
Ben Stopford
 
NOsql Presentation.pdf
NOsql Presentation.pdfNOsql Presentation.pdf
NOsql Presentation.pdf
AkshayDwivedi31
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabases
Adi Challa
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
James Serra
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
LinkedIn
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
Venu Anuganti
 
Solr cloud the 'search first' nosql database extended deep dive
Solr cloud the 'search first' nosql database   extended deep diveSolr cloud the 'search first' nosql database   extended deep dive
Solr cloud the 'search first' nosql database extended deep dive
lucenerevolution
 
Database Technologies
Database TechnologiesDatabase Technologies
Database Technologies
Michel de Goede
 
BigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearchBigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearch
Sanura Hettiarachchi
 
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Lucas Jellema
 
No sql
No sqlNo sql
No sql
Prateek Jain
 
Dropping ACID: Wrapping Your Mind Around NoSQL Databases
Dropping ACID: Wrapping Your Mind Around NoSQL DatabasesDropping ACID: Wrapping Your Mind Around NoSQL Databases
Dropping ACID: Wrapping Your Mind Around NoSQL Databases
Kyle Banerjee
 

Similar to Big Data Platforms: An Overview (20)

UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
NoSQL.pptx
NoSQL.pptxNoSQL.pptx
NoSQL.pptx
 
Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7
 
No SQL
No SQLNo SQL
No SQL
 
6269441.ppt
6269441.ppt6269441.ppt
6269441.ppt
 
Oracle Week 2016 - Modern Data Architecture
Oracle Week 2016 - Modern Data ArchitectureOracle Week 2016 - Modern Data Architecture
Oracle Week 2016 - Modern Data Architecture
 
Big iron 2 (published)
Big iron 2 (published)Big iron 2 (published)
Big iron 2 (published)
 
NOsql Presentation.pdf
NOsql Presentation.pdfNOsql Presentation.pdf
NOsql Presentation.pdf
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabases
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
 
Solr cloud the 'search first' nosql database extended deep dive
Solr cloud the 'search first' nosql database   extended deep diveSolr cloud the 'search first' nosql database   extended deep dive
Solr cloud the 'search first' nosql database extended deep dive
 
Database Technologies
Database TechnologiesDatabase Technologies
Database Technologies
 
BigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearchBigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearch
 
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
 
No sql
No sqlNo sql
No sql
 
Dropping ACID: Wrapping Your Mind Around NoSQL Databases
Dropping ACID: Wrapping Your Mind Around NoSQL DatabasesDropping ACID: Wrapping Your Mind Around NoSQL Databases
Dropping ACID: Wrapping Your Mind Around NoSQL Databases
 

Recently uploaded

Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
SAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloudSAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloud
maazsz111
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
Data Hops
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 

Recently uploaded (20)

Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
SAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloudSAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloud
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 

Big Data Platforms: An Overview

  • 1. Big Data Platforms: An Overview C. Scyphers Chief Technical Architect Daemon Consulting, LLC
  • 2. What Is “Big Data”? • Big Data is not simply a huge pile of information • A good starting place is the following thought: “Big Data describes datasets so large they become very difficult to manage with traditional database tools.”
  • 3. What Is A Big Data Platform? Putting it simply, it is any platform which supports those kind of large datasets.
  • 4. It doesn’t have to be cutting edge technology.
  • 5. Lots of legacy technologies can address the problem.
  • 6. If only at a sizeable cost.
  • 7. SQL Of the new technologies, the most promising are from the “NoSQL” family.
  • 8. What Is “NoSQL”? SQL A family of non-relational data storage technologies
  • 12. Usually with less cost than more traditional approaches.
  • 13. Some of these technologies are new and innovative
  • 14. Others have been around for decades.
  • 15. NoSQL Does Not Mean “SQL Is Bad” When the trend was just starting, “NoSQL” was coined. It’s unfortunate, because it implies antagonism towards SQL.
  • 16. NoSQL Means “Not Only SQL” RELATIONAL RELATIONAL NON- NoSQL is a complement to a traditional RDBMS, not necessarily as a replacement of them.
  • 18. Scale is very hard without ridiculous expense
  • 19. SQL can get very complex, very quickly
  • 20. Changing a schema for a large production system is both risky and expensive
  • 21. Throughput can be a challenge
  • 23. Scale is achieved through a shared-nothing architecture, removing bottlenecks
  • 24. Schemaless design means change becomes much less risky and significantly cheaper
  • 25. Most solutions use simple RESTful interfaces
  • 26. NoSQL is based upon a better understanding of data storage, usually referred to as the “CAP Theorem”
  • 27. The CAP Theorem Grossly simplified (with apologies to Brewer): A database can be • Consistent (All clients see the same data) • Available (All clients can find some available node) • Partition-Tolerant (the database will continue to function even if split into disconnected sets – e.g. a network disruption) Pick Any Two.
  • 28. CAP In Practice • Consistent & Available (no Partition Tolerance) • Either single machines or single site clusters. • Typically uses 2 phase commits
  • 29. CAP In Practice • Consistent & Partition Tolerant (no Availability) • Some data may be inaccessible, but the remainder is available and consistent • Sharding is an example of this implementation Customers Customers Customers A-F G-R S-Z
  • 30. CAP In Practice • Available & Partition Tolerant (no Consistency) • Some data may be inaccurate; a conflict resolution strategy is required. • DNS is an example of this, as well as standard master-slave replication
  • 31. CAP From A Vendor POV • C-A (no P) – this is generally how most RDBMS vendors operate • C-P (no A) – this is how many RDBMS’ attempt to address scale without incurring large costs • A-P (no C) – this is how most NoSQL approaches solve the problem
  • 32. ACID vs BASE Traditional Databases NoSQL Databases Tend Are ACID Compliant To Be BASE Compliant Atomicity – either the entire transaction Basically completes or none of it does Consistent – any transaction will take the Available database from one consistent state to another, with no broken constraints Isolation – changes do not affect other users Scalable until committed Durability – committed transactions can be Eventually consistent recovered in case of system failure Eventually consistent is the key phrase here
  • 33. SQL Strengths Very well known technology
  • 36. Large talent pool from which to choose
  • 37. Ad hoc operations common, if not encouraged
  • 38. NoSQL Strengths Built to address massive scale
  • 42. NoSQL Pros/Cons Pros Cons • Schema Evolution • Querying the data is • Horizontal Scalability much harder • Simple Protocols • Paradigm Shift • Security is a big issue • May or may not support data types (BLOBs, spatial) • Generally, uniqueness cannot be enforced
  • 43. A Disclaimer Before We Continue • I am not an expert on every possible Big Data Platform • There are hundreds of them; these are the ones I consider the leaders in the field and recommend • If you have a favorite, please let me know and I’ll update the deck for next time • The internal details on how these systems work are rather complex; I would prefer to take those questions offline
  • 44. Flavors Of NoSQL The major four divisions of NoSQL are: • Key-Value • Document Store • Columnar • Other
  • 45. Key-Value • At a very high level, key-value works essentially by pairing a index token (a key) with a data element (a value). • Both index token and the data value can be of any structure. • Such a pairing is arbitrary and up to the developer of the system to determine.
  • 46. A Key-Value Example “John Smith”, “100 Century Dr. Alexandria VA 22304” “John Doe”, “16 Kozyak Street, Lozenets District, 1408 Sofia Bulgaria” In both examples, the key is a name and the value is an address. However, the structure of the address differs between the two.
  • 47. Document Store • Document stores extend the key-value paradigm into values with multiple attributes. • The document values tend to be semi-structured data (XML, JSON, et al) but can also be Word or PDF documents.
  • 48. A Document Store Example “John Smith”, “<address><street>100 Century Dr.</street> <city>Alexandria</city> <state>VA</state> <postalCode>22304</postalCode> </address>” “John Doe”, “{ “address”: { “street”: “16 Kozyak Street” “district”: “Lozenets, 1408” “city”: “Sofia” “country”: “Bulgaria” } }”
  • 49. Columnar Family • Usually has “rows” and “columns” • Or, at least, their logical equivalents • Not a traditional, “pure” column store • More of a hybridized approach leveraging key-value pairs • A key with many values attached
  • 50. The Others • Hierarchical Databases • LDAP, Active Directory • Graph Databases • Neo4j, Flock DB, InfiniteGraph • XML • MarkLogic • Object Oriented Databases • Versant • Lotus Notes • HPCC (LexisNexis)
  • 51. Key-Value Recap Pairing a index token (a key) with a data element (a value)
  • 52. Key-Value Pro/Con Pros Cons • Schema Evolution • Packing & unpacking each key • Horizontal Scalability • Keys typically are not related • Simple Protocols to each other • Works well for volatile data • The entire value must be • High throughput, typically returned, not just a part of it optimized for reads or writes • Security tends to be an issue • Keys become meaningful • Hard to support reporting, rather than arbitrary analytics, aggregation or • Application logic defines ordered values object model • Generally does not support updates in place • Application logic defines object model
  • 53. Where Did Key-Value Come From? The concept is quite old, but most people trace the lineage back to Amazon and the Dynamo paper.
  • 54. Dynamo Amazon devised the Dynamo engine as a way to address their scalability issues in a reliable way. • Communication between nodes is peer to peer (P2P) • Replication occurs with the end client addressing conflict resolution • Quorum Reads/Writes • Always writable (Hinted Handoff) • Eventually Consistent
  • 55. Eventually Consistent • Rather than expending the runtime resources to ensure that all nodes are aware of a change before continuing, Dynamo uses an eventually consistent model. • In this model, a subset of nodes are changed • Those nodes then inform their neighbors until all nodes are changed (grossly simplifying).
  • 56. Can I Use Dynamo? No. It’s an Amazon only internal product. However, AWS S3 is largely based upon it. Amazon did announce a DynamoDB offering for their AWS customers. While it’s probably the same, I cannot guarantee that it is.
  • 57. Riak • Riak is a key-value database largely modeled after the Dynamo model. • Open source (free) with paid support from Basho. • Main claims to fame: • Extreme reliability • Performance speed
  • 58. Riak Pro/Con Pros Cons • All nodes are equal – no • Not meant for small, discrete single point of failure and numerous datapoints. • Horizontal Scalability • Getting data in is great; • Full Text Search getting it out, not so much • RESTful interface (and HTTP) • Security is non-existent: • Consistency level tunable on “Riak assumes the internal each operation environment is trusted” • Secondary indexes available • Conflict resolution can bubble • Map/Reduce (JavaScript & up to the client if not careful. Erlang only) • Erlang is fast, but it’s got a serious learning curve.
  • 60. Redis • Redis is a key-value in-memory datastore. • Open source (free) with support from the community. • Main claims to fame: • Fast. So very, very fast. • Transactional support • Best for rapidly changing data
  • 61. Redis Pro/Con Pros Cons • Transactional support • Entirely in memory • Blob storage • Master-slave replication • Support for sets, lists and (instead of master-master) sorted sets • Security is non-existent: • Support for Publish-Subscribe designed to be used in (Pub-Sub) messaging trusted environments • Robust set of operators • Does not support encryption • Support can be hard to find
  • 63. Voldemort • Voldemort is a key-value in-memory database built by LinkedIn. • Open source (free) with support from the community • Main claims to fame: • Low latency • Highly Available • Very fast reads
  • 64. Voldemort Pro/Con Pros Cons • Highly customizable – each • Versioning means lots of disk layer of the stack can be space being used. replaced as needed • Does not support range • Data elements are versioned queries during changes • No complex query filters • All nodes are independent – • All joins must be done in no single point of failure code • Very, very fast reads • No foreign key constraints • No triggers • Support can be hard to find
  • 66. Key/Value “Big Vendors” • Microsoft Azure Table Storage • Oracle NoSQL • BerkleyDB (Oracle)
  • 67. Document Store Recap Document stores store an index token with a grouping of attributes in a semi- structured document
  • 68. Document Store Pro/Con Pros Cons • Tends to support a more • The entire value must be complex data model than returned, not just a part of it key/value • Security tends to be an issue • Good at content • Joins are not available within management the database • Usually supports multiple • No foreign keys indexes • Application logic defines • Schemaless (can be nested) object model • Typically low latency reads • Application logic defines object model
  • 69. CouchDB • CouchDB is a document store database. • Open source (free), part of the Apache foundation with paid support available from several vendors. • Main claims to fame: • Simple and easy to use • Good read consistency • Master-master replication
  • 70. CouchDB Pro/Con Pros Cons • Very simple API for • The simple API for development development is somewhat • MVCC support for read limited consistency • No foreign keys • Full Map/Reduce support • Conflict resolution devolves • Data is versioned to the application • Secondary indexes supported • Versioning requires extensive • Some security support disk space • RESTful API, JSON support • Versioning places large load • Materialized views with on I/O channels incremental update support • Replication for performance, not availability
  • 72. MongoDB • MongoDB is a document store database. • Open source (free) with paid support available from 10Gen. • Main claims to fame: • Index anything • Ad hoc query support • SQL like operations (not SQL syntax)
  • 73. MongoDB Pro/Con Pros Cons • Auto-sharding • Does not support JSON: BSON • Auto-failover instead • Update in place • Master-slave replication • Spatial index support • Has had some growing pains • Ad hoc query support (e.g. Foursquare outage) • Any field in Mongo can be • Not RESTful by default indexed • Failures require a manual • Very, very popular (lots of database repair operation production deployments) (similar to MySQL) • Very easy transition from SQL • Replication for availability, not performance
  • 75. Document Store “Big Vendors” • Lotus Domino
  • 76. Columnar Family Recap • A key with many values attached • Usually presenting as “rows” and “columns” • Or, at least, their logical equivalents
  • 77. Columnar Pro/Con Pros Cons • Tend to have some level of • Is much less efficient when rudimentary security support processing many columns • Usually include a degree of simultaneously versioning • Joins tend to not be • Can be more efficient than supported row databases when • Referential integrity not processing a limited number available of columns over a large amount of rows
  • 78. Where Did Columnar Come From? The concept has been around for a while, but most people trace the NoSQL lineage back to Google.
  • 79. BigTable Google devised the BigTable engine as a way to address their search related scalability issues in a reliable way. • Data is organized through a set of keys: • Row • Column • Timestamp • A hybrid row/column store with a single master • Versioning is handled through the time key • Tablets are a dynamic partition of a sequence of rows – supports very efficient range scans • Columns can be grouped into column families • Column families can have access control
  • 80. Can I Use BigTable? No. It’s a Google only internal product. However, quite a few open source products are built upon the concepts.
  • 81. Cassandra • Cassandra is a hybrid of Big Table built on Dynamo infrastructure • Open source (free), built by Facebook with paid support available from several vendors. • Main claims to fame: • An Apache project • Very, very fast writes • Spans multiple datacenters
  • 82. Cassandra Pro/Con Pros Cons • Designed to span multiple • No joins datacenters • No referential integrity • Peer to peer communication • Written in Java – quite between nodes complex to administer • No single point of failure and configure • Always writeable • Last update wins • Consistency level is tunable at run time • Supports secondary indexes • Supports Map/Reduce • Supports range queries
  • 84. HBase • Hbase is a columnar database built on top of the Hadoop environment. • Open source (free) with paid support from numerous vendors • Main claims to fame: • Ad hoc type abilities • Easy integration with Map/Reduce
  • 85. HBase Pro/Con Pros Cons • Map/Reduce support • Secondary indexes generally • More of a CA approach and not supported an AP • Security is non-existent • Supports predicate push • Requires a Hadoop down for performance gains infrastructure to function • Automatic partitioning and rebalancing of regions • Data is stored in a sorted order (not indexed) • RESTful API • Strong and vibrant ecosystem
  • 87. Hadoop • Hadoop is not a columnar store as such. • Rather, Hadoop is a massively parallel data processing engine • Main claims to fame: • Specializes in unstructured data • Very flexible and popular
  • 88. Hadoop Pro/Con Pros Cons • While written in Java, almost • Large amounts of disk space any language can leverage and bandwidth required Hadoop • Paradigm shift for IT staff • Runs on commodity servers • Quality talent is highly in • Horizontally scalable demand and expensive • Very fast and powerful • Security is non-existent • Where Map/Reduce • Name node is a single point originated of failure • Ample support from vendors • More or less only supporting • “Helper” languages like Hive batch processing and Pig • Not user friendly to anyone • Strong and vibrant ecosystem other than developers
  • 89. Hadoop Users Plus lots, lots more
  • 90. Columnar “Big Vendor” • EMC Greenplum • Teradata Aster In so far as both of these solutions are grafting Map/Reduce into a (more or less) SQL environment
  • 91. Which One Do I Use Where? • Key-Value for (relatively) simple, volitile data • Document store for more complex data • Columnar for analytical processing • RBDMS for traditional processing – particularly where a lazy consistency is not acceptable • Point Of Sale, for example
  • 93. @scyphers Additional Information At http://www.daemonconsulting.net/BDC-FOSE-2012 Daemon Consulting, LLC http://www.daemonconsulting.net/ Specializing In The Hard Stuff