Topic 10: Taxonomy of Data and Storage
Upcoming SlideShare
Loading in...5
×
 

Topic 10: Taxonomy of Data and Storage

on

  • 1,278 views

Cloud Computing Workshop 2013, ITU

Cloud Computing Workshop 2013, ITU

Statistics

Views

Total Views
1,278
Views on SlideShare
1,278
Embed Views
0

Actions

Likes
0
Downloads
43
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Topic 10: Taxonomy of Data and Storage Topic 10: Taxonomy of Data and Storage Presentation Transcript

  • 10: Taxonomy of Data and Storage Zubair Nabi zubair.nabi@itu.edu.pk April 20, 2013 Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 1 / 27
  • Outline 1 Datasets 2 Storage 3 Beyond RDBMS 4 NoSQL Taxonomy 5 NewSQL Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 2 / 27
  • Outline 1 Datasets 2 Storage 3 Beyond RDBMS 4 NoSQL Taxonomy 5 NewSQL Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 3 / 27
  • Introduction Data is everywhere and is the driving force behind our lives Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 4 / 27
  • Introduction Data is everywhere and is the driving force behind our lives The address book on your phone is data Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 4 / 27
  • Introduction Data is everywhere and is the driving force behind our lives The address book on your phone is data So is the newspaper that you read every morning Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 4 / 27
  • Introduction Data is everywhere and is the driving force behind our lives The address book on your phone is data So is the newspaper that you read every morning Everything you see around you is a potential source of data which might be useful for a certain application Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 4 / 27
  • Introduction Data is everywhere and is the driving force behind our lives The address book on your phone is data So is the newspaper that you read every morning Everything you see around you is a potential source of data which might be useful for a certain application We use this data to share information and make a more informed decision about different events Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 4 / 27
  • Introduction Data is everywhere and is the driving force behind our lives The address book on your phone is data So is the newspaper that you read every morning Everything you see around you is a potential source of data which might be useful for a certain application We use this data to share information and make a more informed decision about different events Datasets can easily be classified on the basis of their structure Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 4 / 27
  • Introduction Data is everywhere and is the driving force behind our lives The address book on your phone is data So is the newspaper that you read every morning Everything you see around you is a potential source of data which might be useful for a certain application We use this data to share information and make a more informed decision about different events Datasets can easily be classified on the basis of their structure 1 Structured 2 Unstructured 3 Semi-structured Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 4 / 27
  • Structured Data Formatted in a universally understandable and identifiable way Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 5 / 27
  • Structured Data Formatted in a universally understandable and identifiable way In most cases, structured data is formally specified by a schema Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 5 / 27
  • Structured Data Formatted in a universally understandable and identifiable way In most cases, structured data is formally specified by a schema Your phone address phone is structured because it has a schema consisting of name, phone number, address, email address, etc. Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 5 / 27
  • Structured Data Formatted in a universally understandable and identifiable way In most cases, structured data is formally specified by a schema Your phone address phone is structured because it has a schema consisting of name, phone number, address, email address, etc. Most traditional databases contain structured data revolving around data laid out across columns and rows Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 5 / 27
  • Structured Data Formatted in a universally understandable and identifiable way In most cases, structured data is formally specified by a schema Your phone address phone is structured because it has a schema consisting of name, phone number, address, email address, etc. Most traditional databases contain structured data revolving around data laid out across columns and rows Each field also has an associated type Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 5 / 27
  • Structured Data Formatted in a universally understandable and identifiable way In most cases, structured data is formally specified by a schema Your phone address phone is structured because it has a schema consisting of name, phone number, address, email address, etc. Most traditional databases contain structured data revolving around data laid out across columns and rows Each field also has an associated type Possible to search for items based on their data types Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 5 / 27
  • Unstructured Data Data without any conceptual definition or type Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 6 / 27
  • Unstructured Data Data without any conceptual definition or type Can vary from raw text to binary data Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 6 / 27
  • Unstructured Data Data without any conceptual definition or type Can vary from raw text to binary data Processing unstructured data requires parsing and tagging on the fly Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 6 / 27
  • Unstructured Data Data without any conceptual definition or type Can vary from raw text to binary data Processing unstructured data requires parsing and tagging on the fly In most cases, consists of simple log files Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 6 / 27
  • Semi-structured Data Occupies the space between the structured and unstructured data spectrum Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 7 / 27
  • Semi-structured Data Occupies the space between the structured and unstructured data spectrum For instance, while binary data has no structure, audio and video files have meta-data which has structure, such as author, time of creation, etc. Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 7 / 27
  • Semi-structured Data Occupies the space between the structured and unstructured data spectrum For instance, while binary data has no structure, audio and video files have meta-data which has structure, such as author, time of creation, etc. Can also be labelled as self-describing structure Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 7 / 27
  • Outline 1 Datasets 2 Storage 3 Beyond RDBMS 4 NoSQL Taxonomy 5 NewSQL Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 8 / 27
  • Database Management Systems (DBMS) Used to store and manage data Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 9 / 27
  • Database Management Systems (DBMS) Used to store and manage data Support for large amounts of data Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 9 / 27
  • Database Management Systems (DBMS) Used to store and manage data Support for large amounts of data Ensure concurrency, sharing, and locking Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 9 / 27
  • Database Management Systems (DBMS) Used to store and manage data Support for large amounts of data Ensure concurrency, sharing, and locking Security is useful too; to enable fine-grained access control Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 9 / 27
  • Database Management Systems (DBMS) Used to store and manage data Support for large amounts of data Ensure concurrency, sharing, and locking Security is useful too; to enable fine-grained access control Ability to keep working in the face of failure Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 9 / 27
  • Relational Database Management Systems (RDBMS) The most popular and predominant storage system in use Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 10 / 27
  • Relational Database Management Systems (RDBMS) The most popular and predominant storage system in use Data in different files is connected by using a key field Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 10 / 27
  • Relational Database Management Systems (RDBMS) The most popular and predominant storage system in use Data in different files is connected by using a key field Data is laid out in different tables, with a key field that identifies each row Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 10 / 27
  • Relational Database Management Systems (RDBMS) The most popular and predominant storage system in use Data in different files is connected by using a key field Data is laid out in different tables, with a key field that identifies each row The same key field is used to connect one table to another Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 10 / 27
  • Relational Database Management Systems (RDBMS) The most popular and predominant storage system in use Data in different files is connected by using a key field Data is laid out in different tables, with a key field that identifies each row The same key field is used to connect one table to another For instance, a relation might have customer ID as key and her details as data; another table might have the same key but different data, say her purchases; yet another table with the same key might have a breakdown of her preferences Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 10 / 27
  • Relational Database Management Systems (RDBMS) The most popular and predominant storage system in use Data in different files is connected by using a key field Data is laid out in different tables, with a key field that identifies each row The same key field is used to connect one table to another For instance, a relation might have customer ID as key and her details as data; another table might have the same key but different data, say her purchases; yet another table with the same key might have a breakdown of her preferences Examples include Oracle Database, MS SQL Server, MySQL, IBM DB2, and Teradata Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 10 / 27
  • Structured Query Language (SQL) Non-procedural language used for data retrieval and manipulation in RDBMS Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 11 / 27
  • Structured Query Language (SQL) Non-procedural language used for data retrieval and manipulation in RDBMS Adds a layer of abstraction over relational algebra, which enables set operations, selections, etc. Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 11 / 27
  • Structured Query Language (SQL) Non-procedural language used for data retrieval and manipulation in RDBMS Adds a layer of abstraction over relational algebra, which enables set operations, selections, etc. Due to its declarative nature, users operate in terms of their expected output while the underlying system decides the actual query execution plan Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 11 / 27
  • Structured Query Language (SQL) Non-procedural language used for data retrieval and manipulation in RDBMS Adds a layer of abstraction over relational algebra, which enables set operations, selections, etc. Due to its declarative nature, users operate in terms of their expected output while the underlying system decides the actual query execution plan Instructions consist of a specific SQL statement and additional parameters and operands Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 11 / 27
  • Structured Query Language (SQL) Non-procedural language used for data retrieval and manipulation in RDBMS Adds a layer of abstraction over relational algebra, which enables set operations, selections, etc. Due to its declarative nature, users operate in terms of their expected output while the underlying system decides the actual query execution plan Instructions consist of a specific SQL statement and additional parameters and operands For instance, the SELECT operator retrieves certain records, INSERT adds a record, and so on Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 11 / 27
  • RDBMS and Structured Data As structured data follows a predefined schema, it naturally maps on to a relational database system Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 12 / 27
  • RDBMS and Structured Data As structured data follows a predefined schema, it naturally maps on to a relational database system The schema defines the type and structure of the data and its relations Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 12 / 27
  • RDBMS and Structured Data As structured data follows a predefined schema, it naturally maps on to a relational database system The schema defines the type and structure of the data and its relations Schema design is an arduous process and needs to be done before the database can be populated Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 12 / 27
  • RDBMS and Structured Data As structured data follows a predefined schema, it naturally maps on to a relational database system The schema defines the type and structure of the data and its relations Schema design is an arduous process and needs to be done before the database can be populated Another consequence of a strict schema is that it is non-trivial to extend it Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 12 / 27
  • RDBMS and Structured Data As structured data follows a predefined schema, it naturally maps on to a relational database system The schema defines the type and structure of the data and its relations Schema design is an arduous process and needs to be done before the database can be populated Another consequence of a strict schema is that it is non-trivial to extend it For instance, adding a new attribute to an existing row necessitates adding a new column to the entire table Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 12 / 27
  • RDBMS and Structured Data As structured data follows a predefined schema, it naturally maps on to a relational database system The schema defines the type and structure of the data and its relations Schema design is an arduous process and needs to be done before the database can be populated Another consequence of a strict schema is that it is non-trivial to extend it For instance, adding a new attribute to an existing row necessitates adding a new column to the entire table Extremely suboptimal in tables with millions of rows Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 12 / 27
  • RDBMS and Semi- and Un-structured Data Unstructured data has no notion of schema while semi-structured data only has a weak one Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 13 / 27
  • RDBMS and Semi- and Un-structured Data Unstructured data has no notion of schema while semi-structured data only has a weak one Data within such datasets also has an associated type Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 13 / 27
  • RDBMS and Semi- and Un-structured Data Unstructured data has no notion of schema while semi-structured data only has a weak one Data within such datasets also has an associated type In fact, types are application-centric: It might be possible to interpret a field as a float in one application and as a string in another Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 13 / 27
  • RDBMS and Semi- and Un-structured Data Unstructured data has no notion of schema while semi-structured data only has a weak one Data within such datasets also has an associated type In fact, types are application-centric: It might be possible to interpret a field as a float in one application and as a string in another While it is possible, with human intervention, to glean structure from unstructured data, it is an extremely expensive task Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 13 / 27
  • RDBMS and Semi- and Un-structured Data Unstructured data has no notion of schema while semi-structured data only has a weak one Data within such datasets also has an associated type In fact, types are application-centric: It might be possible to interpret a field as a float in one application and as a string in another While it is possible, with human intervention, to glean structure from unstructured data, it is an extremely expensive task Structureless data generated by real-time sources can change the number of attributes and their types on the fly Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 13 / 27
  • RDBMS and Semi- and Un-structured Data Unstructured data has no notion of schema while semi-structured data only has a weak one Data within such datasets also has an associated type In fact, types are application-centric: It might be possible to interpret a field as a float in one application and as a string in another While it is possible, with human intervention, to glean structure from unstructured data, it is an extremely expensive task Structureless data generated by real-time sources can change the number of attributes and their types on the fly RDBMS would require the creation of a new table each time such a change takes place Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 13 / 27
  • RDBMS and Semi- and Un-structured Data Unstructured data has no notion of schema while semi-structured data only has a weak one Data within such datasets also has an associated type In fact, types are application-centric: It might be possible to interpret a field as a float in one application and as a string in another While it is possible, with human intervention, to glean structure from unstructured data, it is an extremely expensive task Structureless data generated by real-time sources can change the number of attributes and their types on the fly RDBMS would require the creation of a new table each time such a change takes place Therefore, unstructured and semi-structured data does not fit the relational model Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 13 / 27
  • Outline 1 Datasets 2 Storage 3 Beyond RDBMS 4 NoSQL Taxonomy 5 NewSQL Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 14 / 27
  • Motivation Different semantics: RDBMS provide ACID semantics: 1 Atomic: The entire transaction either succeeds or fails Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 15 / 27
  • Motivation Different semantics: RDBMS provide ACID semantics: 1 Atomic: The entire transaction either succeeds or fails 2 Consistent: Data within the database remains consistent after each transaction Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 15 / 27
  • Motivation Different semantics: RDBMS provide ACID semantics: 1 Atomic: The entire transaction either succeeds or fails 2 Consistent: Data within the database remains consistent after each transaction 3 Isolation: Transactions are sandboxed from each other Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 15 / 27
  • Motivation Different semantics: RDBMS provide ACID semantics: 1 Atomic: The entire transaction either succeeds or fails 2 Consistent: Data within the database remains consistent after each transaction 3 Isolation: Transactions are sandboxed from each other 4 Durable: Transactions are persistent across failures and restarts Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 15 / 27
  • Motivation Different semantics: RDBMS provide ACID semantics: 1 Atomic: The entire transaction either succeeds or fails 2 Consistent: Data within the database remains consistent after each transaction 3 Isolation: Transactions are sandboxed from each other 4 Durable: Transactions are persistent across failures and restarts Overkill in case of most user-facing applications Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 15 / 27
  • Motivation Different semantics: RDBMS provide ACID semantics: 1 Atomic: The entire transaction either succeeds or fails 2 Consistent: Data within the database remains consistent after each transaction 3 Isolation: Transactions are sandboxed from each other 4 Durable: Transactions are persistent across failures and restarts Overkill in case of most user-facing applications Most applications are more interested in availability and willing to sacrifice consistency leading to eventual consistency Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 15 / 27
  • Motivation Different semantics: RDBMS provide ACID semantics: 1 Atomic: The entire transaction either succeeds or fails 2 Consistent: Data within the database remains consistent after each transaction 3 Isolation: Transactions are sandboxed from each other 4 Durable: Transactions are persistent across failures and restarts Overkill in case of most user-facing applications Most applications are more interested in availability and willing to sacrifice consistency leading to eventual consistency This basically available, soft state, eventually consistent (BASE) model enables applications to function even in the face of partial failure Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 15 / 27
  • Motivation Different semantics: RDBMS provide ACID semantics: 1 Atomic: The entire transaction either succeeds or fails 2 Consistent: Data within the database remains consistent after each transaction 3 Isolation: Transactions are sandboxed from each other 4 Durable: Transactions are persistent across failures and restarts Overkill in case of most user-facing applications Most applications are more interested in availability and willing to sacrifice consistency leading to eventual consistency This basically available, soft state, eventually consistent (BASE) model enables applications to function even in the face of partial failure High Throughput: Most NoSQL databases sacrifice consistency for availability leading to higher throughput (in some cases an order of magnitude) Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 15 / 27
  • Motivation (2) Horizontal Scalability: To cater for more data, NoSQL stores can be scaled up by just adding more machines and the underlying system automatically re-distributes the data Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 16 / 27
  • Motivation (2) Horizontal Scalability: To cater for more data, NoSQL stores can be scaled up by just adding more machines and the underlying system automatically re-distributes the data Commodity Hardware: A large number of RDBMS require specialized and proprietary hardware for operation. In contrast, NoSQL databases function over commodity off-the-shelf hardware Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 16 / 27
  • Motivation (2) Horizontal Scalability: To cater for more data, NoSQL stores can be scaled up by just adding more machines and the underlying system automatically re-distributes the data Commodity Hardware: A large number of RDBMS require specialized and proprietary hardware for operation. In contrast, NoSQL databases function over commodity off-the-shelf hardware Programming Language Support: Over the years programming languages have started providing abstractions for database support (LINQ, etc.) while bypassing SQL. NoSQL databases provide abstractions that directly map onto the language abstractions leading to tighter coupling Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 16 / 27
  • Motivation (3) The Rise of Cloud Computing: Cloud Computing applications require horizontal scalability and low administration overhead. Both requirements are naturally satisfied by NoSQL stores Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 17 / 27
  • Outline 1 Datasets 2 Storage 3 Beyond RDBMS 4 NoSQL Taxonomy 5 NewSQL Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 18 / 27
  • Introduction NoSQL databases can be classified on the basis of: 1 Data Model: How data is represented Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 19 / 27
  • Introduction NoSQL databases can be classified on the basis of: 1 Data Model: How data is represented 2 Scalability: How scalable the system is Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 19 / 27
  • Introduction NoSQL databases can be classified on the basis of: 1 Data Model: How data is represented 2 Scalability: How scalable the system is 3 Query Model: What type of API it exposes Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 19 / 27
  • Introduction NoSQL databases can be classified on the basis of: 1 Data Model: How data is represented 2 Scalability: How scalable the system is 3 Query Model: What type of API it exposes 4 Persistence: How persistent the data is Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 19 / 27
  • Classification by Data Model Based on the data model, NoSQL databases can roughly be categorized into three categories: 1 Key/value Stores: A map/dictionary allowing put/get semantics per key Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 20 / 27
  • Classification by Data Model Based on the data model, NoSQL databases can roughly be categorized into three categories: 1 Key/value Stores: A map/dictionary allowing put/get semantics per key 2 Document Stores: Complex data structures to encapsulate document key/value pairs Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 20 / 27
  • Classification by Data Model Based on the data model, NoSQL databases can roughly be categorized into three categories: 1 Key/value Stores: A map/dictionary allowing put/get semantics per key 2 Document Stores: Complex data structures to encapsulate document key/value pairs 3 Column-Oriented Stores: Data laid out by column Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 20 / 27
  • Key/value Stores Data is stored within a large hash map Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 21 / 27
  • Key/value Stores Data is stored within a large hash map Simple get/put API Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 21 / 27
  • Key/value Stores Data is stored within a large hash map Simple get/put API Favour scalability over consistency Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 21 / 27
  • Key/value Stores Data is stored within a large hash map Simple get/put API Favour scalability over consistency Limit on the size of the key Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 21 / 27
  • Key/value Stores Data is stored within a large hash map Simple get/put API Favour scalability over consistency Limit on the size of the key Examples include Amazon’s Dynamo, LinkedIn’s Voldemort, Redis, and Memcached Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 21 / 27
  • Document Stores Key/value semantics but based on documents Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 22 / 27
  • Document Stores Key/value semantics but based on documents A document encapsulates data in a standard format, such as JSON, XML, PDF, etc. Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 22 / 27
  • Document Stores Key/value semantics but based on documents A document encapsulates data in a standard format, such as JSON, XML, PDF, etc. Documents themselves can be heterogeneous Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 22 / 27
  • Document Stores Key/value semantics but based on documents A document encapsulates data in a standard format, such as JSON, XML, PDF, etc. Documents themselves can be heterogeneous Documents can also be retrieved based on their content Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 22 / 27
  • Document Stores Key/value semantics but based on documents A document encapsulates data in a standard format, such as JSON, XML, PDF, etc. Documents themselves can be heterogeneous Documents can also be retrieved based on their content Examples include Apache CouchDB and MongoDB Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 22 / 27
  • Column-Oriented Stores Data is stored and processed by column Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 23 / 27
  • Column-Oriented Stores Data is stored and processed by column Useful for read-mostly and read-intensive data Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 23 / 27
  • Column-Oriented Stores Data is stored and processed by column Useful for read-mostly and read-intensive data Data within the same column is of the same type enabling opportunities for efficient compression Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 23 / 27
  • Column-Oriented Stores Data is stored and processed by column Useful for read-mostly and read-intensive data Data within the same column is of the same type enabling opportunities for efficient compression Columns are stored separately so they can be loaded in parallel Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 23 / 27
  • Column-Oriented Stores Data is stored and processed by column Useful for read-mostly and read-intensive data Data within the same column is of the same type enabling opportunities for efficient compression Columns are stored separately so they can be loaded in parallel Examples include Google’s BigTable (Apache HBase is its open source clone) and Facebook’s Cassandra Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 23 / 27
  • Outline 1 Datasets 2 Storage 3 Beyond RDBMS 4 NoSQL Taxonomy 5 NewSQL Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 24 / 27
  • Introduction A hybrid of traditional RDBMS and NoSQL Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 25 / 27
  • Introduction A hybrid of traditional RDBMS and NoSQL Scalability and performance of NoSQL and ACID guarantees of RDBMS Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 25 / 27
  • Introduction A hybrid of traditional RDBMS and NoSQL Scalability and performance of NoSQL and ACID guarantees of RDBMS Use SQL as the primary language Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 25 / 27
  • Introduction A hybrid of traditional RDBMS and NoSQL Scalability and performance of NoSQL and ACID guarantees of RDBMS Use SQL as the primary language Ability to scale out and run over commodity hardware Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 25 / 27
  • Introduction A hybrid of traditional RDBMS and NoSQL Scalability and performance of NoSQL and ACID guarantees of RDBMS Use SQL as the primary language Ability to scale out and run over commodity hardware Classified into: 1 New Databases: Designed from scratch Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 25 / 27
  • Introduction A hybrid of traditional RDBMS and NoSQL Scalability and performance of NoSQL and ACID guarantees of RDBMS Use SQL as the primary language Ability to scale out and run over commodity hardware Classified into: 1 New Databases: Designed from scratch 2 New MySQL Storage Engines: Keep MySQL as interface but replace the storage engine Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 25 / 27
  • Introduction A hybrid of traditional RDBMS and NoSQL Scalability and performance of NoSQL and ACID guarantees of RDBMS Use SQL as the primary language Ability to scale out and run over commodity hardware Classified into: 1 New Databases: Designed from scratch 2 New MySQL Storage Engines: Keep MySQL as interface but replace the storage engine 3 Transparent Clustering: Add pluggable features to existing databases to ensure scalability Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 25 / 27
  • New Databases 1 Query Distribution: Each node holds a subset of the data Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 26 / 27
  • New Databases 1 Query Distribution: Each node holds a subset of the data Queries are split and shipped to nodes that own the data Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 26 / 27
  • New Databases 1 Query Distribution: Each node holds a subset of the data Queries are split and shipped to nodes that own the data Examples include Google’s Spanner and NuoDB Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 26 / 27
  • New Databases 1 Query Distribution: Each node holds a subset of the data Queries are split and shipped to nodes that own the data Examples include Google’s Spanner and NuoDB 2 Pull Data: A central node (possibly replicated) holds all data Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 26 / 27
  • New Databases 1 Query Distribution: Each node holds a subset of the data Queries are split and shipped to nodes that own the data Examples include Google’s Spanner and NuoDB 2 Pull Data: A central node (possibly replicated) holds all data A set of processing nodes receives queries and pulls in required data from the central node Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 26 / 27
  • New Databases 1 Query Distribution: Each node holds a subset of the data Queries are split and shipped to nodes that own the data Examples include Google’s Spanner and NuoDB 2 Pull Data: A central node (possibly replicated) holds all data A set of processing nodes receives queries and pulls in required data from the central node Examples include VMware’s SQLFire Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 26 / 27
  • References 1 NoSQL Databases: https: //oak.cs.ucla.edu/cs144/handouts/nosqldbs.pdf 2 NewSQL – The New Way to Handle Big Data: http://www. linuxforu.com/2012/01/newsql-handle-big-data/ Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 27 / 27