Big Data: NoSQL & the DBA– Aswani Vonteddu                Aswani Vonteddu
The evolution of data stores•   Data modeling•   Data from the Developer’s standpoint•   Data from the DBA’s standpoint•  ...
Hierarchical object graph modelSource: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and GavinB...
Normalized for tables in RDBMSSource: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and GavinBi...
Data – Summary• In order to use an RDBMS,  – Designer to model data into tables  – Developer must normalize/de-no  – DBA h...
Impedance mismatch and the rise of ORMs (like                      Hibernate)[Table(name="Products")]                     ...
o So what is Big Data?o Sourceso Applicationso Technologies                  Aswani Vonteddu
What is Big Data?• It is not a technology in itself.• It is information about everything that is  happening around us, eve...
The four characteristics• Volume• Velocity• Variety• Veracity                Aswani Vonteddu
Sources• Clickstream• Tweets• Facebook: pictures and comments• Sensors  A Boeing 737 generates 240 TB of data  during a si...
Applications• Classification/Ontologies• Crowdsourcing - CAPTCHA• Natural language processing (NLP) –  Google translate• V...
Aswani Vonteddu
Setting up a Big Data platform• A Big Data platform must be equipped  with technologies for the following stages  of data ...
Technologies• Acquisition  – NoSQL databases (DynamoDB, Cassandra)    • Very high speed writes• Organization & Analysis  –...
NoSQL and why now?• RDBMSs must ensure ACID properties• CAP theorem says that all three of  Consistency, Availability and ...
Relational Databases• Random disk access• Data model is totally structured, and  predefined• Shared Everything architectur...
NoSQL categories• Graph DB• Column families• Document                    Aswani Vonteddu
Simple Key-Value stores• Distributed Hash Tables• Eventual consistency• Replication and Data partitioning• Example  Amazon...
Column families• Distributed Key-Value stores• Supports nested columns• Example  Cassandra                  Aswani Vonteddu
Apache Cassandra• Indexed by a Key• Supports columns and super-columns• Allows structured/un-structured data              ...
Cassandra          N          1N                       N4                       2          N          3      Aswani Vonteddu
Cassandra                                                           Coordinator                                           ...
Cassandra                                                           Coordinator                                           ...
Cassandra                                                            Coordinator                                          ...
Cassandra                                                            Coordinator                                          ...
Cassandra• Write operation:  – Commit log  – Memtable – In-Memory storage structure    (kind of a hash table)  – SSTable o...
Cassandra• Read operation:  – Coordinator node forwards the request    • to the node responsible    • And replica nodes ba...
CassandraIndexes:• Primary index (on the key)  supported default by the  Cassandra engine• Secondary indexes are to be  bu...
Document DBs• Similar to Key-Value stores, but Values  are often documents (JSON, ION, …)• Documents are versioned• Exampl...
Map Reduce• Introduced by Google• List processing system• Scales to clusters with thousands of nodes• And petabytes or Exa...
Techniques & algorithms..•   Vector Clocks•   Hinted handoff•   Read repair•   Anti-entropy repair                    Aswa...
Big Data talent• Deep analytical  – Mathematicians, Operations research    analysts, statisticians, ..• Big data savvy  – ...
The DBA’s role here?• Tremendous opportunity for the DBAs• Like in the early 90’s when businesses  migrated from mainframe...
References[1] McKinsey, Big data: The next frontier forinnovation, competition and productivity[2] IDC, The rise of Big Da...
Upcoming SlideShare
Loading in …5
×

Big Data: NoSQL & the DBA

3,256 views

Published on

  • Be the first to comment

Big Data: NoSQL & the DBA

  1. 1. Big Data: NoSQL & the DBA– Aswani Vonteddu Aswani Vonteddu
  2. 2. The evolution of data stores• Data modeling• Data from the Developer’s standpoint• Data from the DBA’s standpoint• Impedance mismatch and the rise of ORM Aswani Vonteddu
  3. 3. Hierarchical object graph modelSource: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and GavinBierman Aswani Vonteddu
  4. 4. Normalized for tables in RDBMSSource: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and GavinBierman Aswani Vonteddu
  5. 5. Data – Summary• In order to use an RDBMS, – Designer to model data into tables – Developer must normalize/de-no – DBA has to speed up queries Aswani Vonteddu
  6. 6. Impedance mismatch and the rise of ORMs (like Hibernate)[Table(name="Products")] [Table(name="Keywords")]class Product class Keyword{ { [Column(PrimaryKey=true)]int ID; [Column]string Title; [Column(PrimaryKey=true)]int ID; [Column]string Author; [Column]string Keyword; [Column]int Year; [Column(IsForeignKey=true)]int ProductID; [Column]int Pages; } private EntitySet<Rating> _Ratings; [ [Table(name="Ratings")] Association( Storage="_Ratings", class Rating ThisKey="ID", { OtherKey="ProductID“, DeleteRule="ONDELETECASCADE“ [Column(PrimaryKey=true)]int ID; ) [Column]string Rating; ] [Column(IsForeignKey=true)]int ProductID; ICollection<Rating> Ratings{ ... } } private EntitySet<Keyword> _Keywords; […] ICollection<Keyword> Keywords{ ... }} Source: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin Bierman Aswani Vonteddu
  7. 7. o So what is Big Data?o Sourceso Applicationso Technologies Aswani Vonteddu
  8. 8. What is Big Data?• It is not a technology in itself.• It is information about everything that is happening around us, every where and every minute• Almost all of us have contributed to Big Data with/with out our knowledge already, and we will continue to be doing that.• Un-structured Aswani Vonteddu
  9. 9. The four characteristics• Volume• Velocity• Variety• Veracity Aswani Vonteddu
  10. 10. Sources• Clickstream• Tweets• Facebook: pictures and comments• Sensors A Boeing 737 generates 240 TB of data during a single cross country flight. Aswani Vonteddu
  11. 11. Applications• Classification/Ontologies• Crowdsourcing - CAPTCHA• Natural language processing (NLP) – Google translate• Visualization – Facebook map Aswani Vonteddu
  12. 12. Aswani Vonteddu
  13. 13. Setting up a Big Data platform• A Big Data platform must be equipped with technologies for the following stages of data processing:• Acquisition• Organization• Analysis Aswani Vonteddu
  14. 14. Technologies• Acquisition – NoSQL databases (DynamoDB, Cassandra) • Very high speed writes• Organization & Analysis – Map Reduce (Apache Hadoop) • Code to Data, not otherwise • Map function and Reduce function together perform the desired analysis Aswani Vonteddu
  15. 15. NoSQL and why now?• RDBMSs must ensure ACID properties• CAP theorem says that all three of Consistency, Availability and Partition tolerance cannot be guaranteed by any distributed system• NoSQL databases are distributed, and are better options than RDBMS for applications that can deal with lack of one of those properties. Aswani Vonteddu
  16. 16. Relational Databases• Random disk access• Data model is totally structured, and predefined• Shared Everything architecture – Single point of failure Aswani Vonteddu
  17. 17. NoSQL categories• Graph DB• Column families• Document Aswani Vonteddu
  18. 18. Simple Key-Value stores• Distributed Hash Tables• Eventual consistency• Replication and Data partitioning• Example Amazon Dynamo Aswani Vonteddu
  19. 19. Column families• Distributed Key-Value stores• Supports nested columns• Example Cassandra Aswani Vonteddu
  20. 20. Apache Cassandra• Indexed by a Key• Supports columns and super-columns• Allows structured/un-structured data Aswani Vonteddu
  21. 21. Cassandra N 1N N4 2 N 3 Aswani Vonteddu
  22. 22. Cassandra Coordinator N 1 3. Success1. ConsistencyLevel.ONE 2. Write request 2. Write N request N 4 2 Replica node Responsible node N 3 Aswani Vonteddu
  23. 23. Cassandra Coordinator N 1 3. Success1. ConsistencyLevel.ONE 2. Write request 2. Write 4. Success N N request 4 2 Replica node Responsible node N 3 Aswani Vonteddu
  24. 24. Cassandra Coordinator N 1 3 or 4. Success 3 or 4. Success1. ConsistencyLevel.TWO 2. Write request 2. Write N request N 4 2 Replica node Responsible node N 3 Aswani Vonteddu
  25. 25. Cassandra Coordinator N 1 3 or 4. Success 3 or 4. Success1. ConsistencyLevel.TWO 2. Write request 2. Write 5. Success N N request 4 2 Replica node Responsible node N 3 Aswani Vonteddu
  26. 26. Cassandra• Write operation: – Commit log – Memtable – In-Memory storage structure (kind of a hash table) – SSTable on disk – Compaction Aswani Vonteddu
  27. 27. Cassandra• Read operation: – Coordinator node forwards the request • to the node responsible • And replica nodes based on the consistency level requested – Each node • Looks up in the Memtable + all existing SSTables • Takes the one with the latest timestamp. – Bloom filters help speed up this operation Aswani Vonteddu
  28. 28. CassandraIndexes:• Primary index (on the key) supported default by the Cassandra engine• Secondary indexes are to be built as a new column family with the column of interest as the key Aswani Vonteddu
  29. 29. Document DBs• Similar to Key-Value stores, but Values are often documents (JSON, ION, …)• Documents are versioned• Example DynamoDB Aswani Vonteddu
  30. 30. Map Reduce• Introduced by Google• List processing system• Scales to clusters with thousands of nodes• And petabytes or Exabytes of data volumes• Code is taken to data, not otherwise• Data must be disjoint• Maps the functions to nodes where the data resides• And Reduces the results from all nodes to build the final result• Example: Hadoop Aswani Vonteddu
  31. 31. Techniques & algorithms..• Vector Clocks• Hinted handoff• Read repair• Anti-entropy repair Aswani Vonteddu
  32. 32. Big Data talent• Deep analytical – Mathematicians, Operations research analysts, statisticians, ..• Big data savvy – Business and functional managers, budget, credit and financial analysts• Supporting Technology – DBAs, System & Network administrators, and Programmers Aswani Vonteddu
  33. 33. The DBA’s role here?• Tremendous opportunity for the DBAs• Like in the early 90’s when businesses migrated from mainframes to Oracle/SQL Server/DB2• Where? – Data modeling: Vast amounts of data, re-designing DHTs is harder than re-designing RDBMS by multiple folds since data migration is painful Aswani Vonteddu
  34. 34. References[1] McKinsey, Big data: The next frontier forinnovation, competition and productivity[2] IDC, The rise of Big Data: Managing, Storing and gainingvalue from endless information• Others – http://slidesha.re/LF8umk – http://slidesha.re/LF8vGY Aswani Vonteddu

×