• Save
Big Data: NoSQL & the DBA
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Big Data: NoSQL & the DBA

  • 2,483 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,483
On Slideshare
1,585
From Embeds
898
Number of Embeds
7

Actions

Shares
Downloads
0
Comments
0
Likes
1

Embeds 898

http://aswaniv.wordpress.com 880
http://www.linkedin.com 8
https://aswaniv.wordpress.com 6
https://si0.twimg.com 1
http://twimblr.appspot.com 1
http://198.96.64.122 1
https://twitter.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Industries: Healthcare, Telecommunications, Retail, Manufacturing, Public sector

Transcript

  • 1. Big Data: NoSQL & the DBA– Aswani Vonteddu Aswani Vonteddu
  • 2. The evolution of data stores• Data modeling• Data from the Developer’s standpoint• Data from the DBA’s standpoint• Impedance mismatch and the rise of ORM Aswani Vonteddu
  • 3. Hierarchical object graph modelSource: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and GavinBierman Aswani Vonteddu
  • 4. Normalized for tables in RDBMSSource: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and GavinBierman Aswani Vonteddu
  • 5. Data – Summary• In order to use an RDBMS, – Designer to model data into tables – Developer must normalize/de-no – DBA has to speed up queries Aswani Vonteddu
  • 6. Impedance mismatch and the rise of ORMs (like Hibernate)[Table(name="Products")] [Table(name="Keywords")]class Product class Keyword{ { [Column(PrimaryKey=true)]int ID; [Column]string Title; [Column(PrimaryKey=true)]int ID; [Column]string Author; [Column]string Keyword; [Column]int Year; [Column(IsForeignKey=true)]int ProductID; [Column]int Pages; } private EntitySet<Rating> _Ratings; [ [Table(name="Ratings")] Association( Storage="_Ratings", class Rating ThisKey="ID", { OtherKey="ProductID“, DeleteRule="ONDELETECASCADE“ [Column(PrimaryKey=true)]int ID; ) [Column]string Rating; ] [Column(IsForeignKey=true)]int ProductID; ICollection<Rating> Ratings{ ... } } private EntitySet<Keyword> _Keywords; […] ICollection<Keyword> Keywords{ ... }} Source: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin Bierman Aswani Vonteddu
  • 7. o So what is Big Data?o Sourceso Applicationso Technologies Aswani Vonteddu
  • 8. What is Big Data?• It is not a technology in itself.• It is information about everything that is happening around us, every where and every minute• Almost all of us have contributed to Big Data with/with out our knowledge already, and we will continue to be doing that.• Un-structured Aswani Vonteddu
  • 9. The four characteristics• Volume• Velocity• Variety• Veracity Aswani Vonteddu
  • 10. Sources• Clickstream• Tweets• Facebook: pictures and comments• Sensors A Boeing 737 generates 240 TB of data during a single cross country flight. Aswani Vonteddu
  • 11. Applications• Classification/Ontologies• Crowdsourcing - CAPTCHA• Natural language processing (NLP) – Google translate• Visualization – Facebook map Aswani Vonteddu
  • 12. Aswani Vonteddu
  • 13. Setting up a Big Data platform• A Big Data platform must be equipped with technologies for the following stages of data processing:• Acquisition• Organization• Analysis Aswani Vonteddu
  • 14. Technologies• Acquisition – NoSQL databases (DynamoDB, Cassandra) • Very high speed writes• Organization & Analysis – Map Reduce (Apache Hadoop) • Code to Data, not otherwise • Map function and Reduce function together perform the desired analysis Aswani Vonteddu
  • 15. NoSQL and why now?• RDBMSs must ensure ACID properties• CAP theorem says that all three of Consistency, Availability and Partition tolerance cannot be guaranteed by any distributed system• NoSQL databases are distributed, and are better options than RDBMS for applications that can deal with lack of one of those properties. Aswani Vonteddu
  • 16. Relational Databases• Random disk access• Data model is totally structured, and predefined• Shared Everything architecture – Single point of failure Aswani Vonteddu
  • 17. NoSQL categories• Graph DB• Column families• Document Aswani Vonteddu
  • 18. Simple Key-Value stores• Distributed Hash Tables• Eventual consistency• Replication and Data partitioning• Example Amazon Dynamo Aswani Vonteddu
  • 19. Column families• Distributed Key-Value stores• Supports nested columns• Example Cassandra Aswani Vonteddu
  • 20. Apache Cassandra• Indexed by a Key• Supports columns and super-columns• Allows structured/un-structured data Aswani Vonteddu
  • 21. Cassandra N 1N N4 2 N 3 Aswani Vonteddu
  • 22. Cassandra Coordinator N 1 3. Success1. ConsistencyLevel.ONE 2. Write request 2. Write N request N 4 2 Replica node Responsible node N 3 Aswani Vonteddu
  • 23. Cassandra Coordinator N 1 3. Success1. ConsistencyLevel.ONE 2. Write request 2. Write 4. Success N N request 4 2 Replica node Responsible node N 3 Aswani Vonteddu
  • 24. Cassandra Coordinator N 1 3 or 4. Success 3 or 4. Success1. ConsistencyLevel.TWO 2. Write request 2. Write N request N 4 2 Replica node Responsible node N 3 Aswani Vonteddu
  • 25. Cassandra Coordinator N 1 3 or 4. Success 3 or 4. Success1. ConsistencyLevel.TWO 2. Write request 2. Write 5. Success N N request 4 2 Replica node Responsible node N 3 Aswani Vonteddu
  • 26. Cassandra• Write operation: – Commit log – Memtable – In-Memory storage structure (kind of a hash table) – SSTable on disk – Compaction Aswani Vonteddu
  • 27. Cassandra• Read operation: – Coordinator node forwards the request • to the node responsible • And replica nodes based on the consistency level requested – Each node • Looks up in the Memtable + all existing SSTables • Takes the one with the latest timestamp. – Bloom filters help speed up this operation Aswani Vonteddu
  • 28. CassandraIndexes:• Primary index (on the key) supported default by the Cassandra engine• Secondary indexes are to be built as a new column family with the column of interest as the key Aswani Vonteddu
  • 29. Document DBs• Similar to Key-Value stores, but Values are often documents (JSON, ION, …)• Documents are versioned• Example DynamoDB Aswani Vonteddu
  • 30. Map Reduce• Introduced by Google• List processing system• Scales to clusters with thousands of nodes• And petabytes or Exabytes of data volumes• Code is taken to data, not otherwise• Data must be disjoint• Maps the functions to nodes where the data resides• And Reduces the results from all nodes to build the final result• Example: Hadoop Aswani Vonteddu
  • 31. Techniques & algorithms..• Vector Clocks• Hinted handoff• Read repair• Anti-entropy repair Aswani Vonteddu
  • 32. Big Data talent• Deep analytical – Mathematicians, Operations research analysts, statisticians, ..• Big data savvy – Business and functional managers, budget, credit and financial analysts• Supporting Technology – DBAs, System & Network administrators, and Programmers Aswani Vonteddu
  • 33. The DBA’s role here?• Tremendous opportunity for the DBAs• Like in the early 90’s when businesses migrated from mainframes to Oracle/SQL Server/DB2• Where? – Data modeling: Vast amounts of data, re-designing DHTs is harder than re-designing RDBMS by multiple folds since data migration is painful Aswani Vonteddu
  • 34. References[1] McKinsey, Big data: The next frontier forinnovation, competition and productivity[2] IDC, The rise of Big Data: Managing, Storing and gainingvalue from endless information• Others – http://slidesha.re/LF8umk – http://slidesha.re/LF8vGY Aswani Vonteddu