Successfully reported this slideshow.
Introduction to Big Data      and NoSQLNJ SQL Server User GroupMay 15, 2012Melissa Demsak    Don DemsakSQL Architect     A...
Meet Melissa• SQL Architect   – Realogy• SqlDiva, Twitter: sqldiva• Email – melissa@sqldiva.com
Meet Don• Advisory Solutions Architect   – EMC Consulting      • Application Architecture, Development & Design• DonXml.co...
The era of Big Data
How did we get here?• Expensive               • Culture of Limitations  o   Processors             o   Limit CPU cycles  o...
Typical RDBMS Implementations• Fixed table schemas• Small but frequent reads/writes• Large batch transactions• Focus on AC...
How we scale RDBMSimplementations
1 st    Step – Build arelational database        Relational        Database
2 nd   Step – Table  Partitioning       p1 p2 p3       Relational       Database
3 rd   Step – Database                   Partitioning Browser             Web Tier   B/L Tier   Relational                ...
4 th     Step – Move to the                  cloud? Browser         Web Tier   B/L Tier   SQL Azure                       ...
Problems created by too          much data• Where to store• How to store• How to process• Organization, searching, and  me...
Polyglot Programmer
Polyglot Persistence      (how to store)
• Atlanta 2009 - No:sql(east) conference   select fun, profit from real_world   where relational=false• Billed as “confere...
Types Of NoSQL Data Stores
5 Groups of Data          ModelsRelationalDocumentKey ValueGraphColumn Family
Document?• Think of a web page...  o Relational model requires column/tag  o Lots of empty columns  o Wasted space and pro...
Key/Value Stores• Simple Index on Key• Value can be any serialized form of data• Lots of different implementations   o Eve...
Graph?• Graph consists of   o Node („stations‟ of the graph)   o Edges (lines between them)• Graph Stores   o AllegroGraph...
Column Family?• Lots of variants   o  Object Stores       • Db4o       • GemStone/S       • InterSystems Caché       • Obj...
Okay got it, Now Let’sCompare Some Real World       Scenarios
You Need Constant                     Consistency•     You‟re dealing with financial transactions•     You‟re dealing with...
You Need Horizontal                 Scalability•     You‟re working across defined timezones•     You‟re Aggregating large...
Frequently Written Rarely          Read•     Think web counters and the like•     Every time a user comes to a page = ctr+...
Here Today Gone                 Tomorrow• Transient data like..    o Web Sessions    o Locks    o Short Term Stats       •...
Where to store• RAM   o Fast                                • Local Disk                                   o   SSD – super...
Big Data
Big Data Definition           •Beyond what traditionalVolume      environments can handle           •Need decisions fastVe...
Additional Big Data Concepts• Volumes & volumes of data• Unstructured• Semi-structured• Not suited for Relational Database...
Big Data Examples• Cassandra• Hadoop• Greenplum• Azure Storage• EMC Atmos• Amazon S3• SQL Azure (with Federations support)?
Real World Example• Twitter  o The challenges     • Needs to store many graphs           Who you are following          ...
What did they try?• Started with Relational  Databases• Tried Key-Value storage  of denormalized lists• Did it work?   o N...
What did they need?• Simplest possible thing that would work• Allow for horizontal partitioning• Allow write operations to...
The Result was FlockDB• Stores graph data• Not optimized for graph traversal operations• Optimized for large adjacency lis...
How Does it Work?• Stores graphs as sets of edges between nodes• Data is partitioned by node  o All queries can be answere...
How to Process BigData
ACID• Atomicity  o All or Nothing• Consistency  o Valid according to all defined rules• Isolation  o No transaction should...
BASE• Basically Available  o High availability but not always consistent• Soft state  o Background cleanup mechanism• Even...
Traditional (relational)      Approach            Extract   Transactional Data Store      Transform                      D...
Big Data Approach• MapReduce Pattern/Framework o an Input Reader o Map Function – To transform to a common shape   (format...
MongoDB Example> // map function                        > // reduce function> m = function(){                        > r =...
What is Hadoop?• A scalable fault-tolerant grid operating system for  data storage and processing• Its scalability comes f...
Hadoop Design Axioms1. System Shall Manage and Heal Itself2. Performance Shall Scale Linearly3. Compute Should Move to Dat...
Hadoop Core Components     Store             Process     HDFS           Map/Reduce  Self-healing      Fault-tolerantHigh-b...
HDFS: Hadoop Distributed File System Block Size = 64MBReplication Factor = 3  Cost/GB is a few ¢/month vs $/month
Hadoop Map/Reduce
Hadoop Job Architecture                                       Node                                      Manager           ...
Microsoft embraces HadoopGood for enterprises & developersGreat for end users!
HADOOP                                         [Azure and Enterprise] Java OM        Streaming OM     HiveQL              ...
Hive Plug-in for ExcelFooter Text                 5/15/2012   52
THANK YOU
Big Data (NJ SQL Server User Group)
Upcoming SlideShare
Loading in …5
×

Big Data (NJ SQL Server User Group)

1,897 views

Published on

Published in: Technology
  • Be the first to comment

Big Data (NJ SQL Server User Group)

  1. 1. Introduction to Big Data and NoSQLNJ SQL Server User GroupMay 15, 2012Melissa Demsak Don DemsakSQL Architect Advisory Solutions ArchitectRealogy EMC Consultingwww.sqldiva.com www.donxml.com
  2. 2. Meet Melissa• SQL Architect – Realogy• SqlDiva, Twitter: sqldiva• Email – melissa@sqldiva.com
  3. 3. Meet Don• Advisory Solutions Architect – EMC Consulting • Application Architecture, Development & Design• DonXml.com, Twitter: donxml• Email – don@donxml.com• SlideShare - http://www.slideshare.net/dondemsak
  4. 4. The era of Big Data
  5. 5. How did we get here?• Expensive • Culture of Limitations o Processors o Limit CPU cycles o Disk space o Limit disk space o Memory o Limit memory o Operating Systems o Limited OS Development o Software o Limited Software o Programmers o Programmers • One language • One persistence store
  6. 6. Typical RDBMS Implementations• Fixed table schemas• Small but frequent reads/writes• Large batch transactions• Focus on ACID o Atomicity o Consistency o Isolation o Durability
  7. 7. How we scale RDBMSimplementations
  8. 8. 1 st Step – Build arelational database Relational Database
  9. 9. 2 nd Step – Table Partitioning p1 p2 p3 Relational Database
  10. 10. 3 rd Step – Database Partitioning Browser Web Tier B/L Tier Relational DatabaseCustomer #1 Browser Web Tier B/L Tier Relational DatabaseCustomer #2 Browser Web Tier B/L Tier Relational DatabaseCustomer #3
  11. 11. 4 th Step – Move to the cloud? Browser Web Tier B/L Tier SQL Azure FederationCustomer #1 SQL Azure Browser Web Tier B/L Tier FederationCustomer #2 SQL Azure Browser Web Tier B/L Tier FederationCustomer #3
  12. 12. Problems created by too much data• Where to store• How to store• How to process• Organization, searching, and metadata• How to manage access• How to copy, move, and backup• Lifecycle
  13. 13. Polyglot Programmer
  14. 14. Polyglot Persistence (how to store)
  15. 15. • Atlanta 2009 - No:sql(east) conference select fun, profit from real_world where relational=false• Billed as “conference of no-rel datastores” (loose) Definition• (often) Open source• Non-relational• Distributed• (often) does not guarantee ACID
  16. 16. Types Of NoSQL Data Stores
  17. 17. 5 Groups of Data ModelsRelationalDocumentKey ValueGraphColumn Family
  18. 18. Document?• Think of a web page... o Relational model requires column/tag o Lots of empty columns o Wasted space and processing time• Document model just stores the pages as is o Saves on space o Very flexible• Document Databases o Apache Jackrabbit o CouchDB o MongoDB o SimpleDB o XML Databases • MarkLogic Server • eXist.
  19. 19. Key/Value Stores• Simple Index on Key• Value can be any serialized form of data• Lots of different implementations o Eventually Consistent • “If no updates occur for a period, eventually all updates will propagate through the system and all replicas will be consistent” o Cached in RAM o Cached on disk o Distributed Hash Tables• Examples o Azure AppFabric Cache o Memcache-d o VMWare vFabric GemFire
  20. 20. Graph?• Graph consists of o Node („stations‟ of the graph) o Edges (lines between them)• Graph Stores o AllegroGraph o Core Data o Neo4j o DEX o FlockDB • Created by the Twitter folks • Nodes = Users • Edges = Nature of relationship between nodes. o Microsoft Trinity (research project) • http://research.microsoft.com/en-us/projects/trinity/
  21. 21. Column Family?• Lots of variants o Object Stores • Db4o • GemStone/S • InterSystems Caché • Objectivity/DB • ZODB o Tabluar • BigTable • Mnesia • Hbase • Hypertable • Azure Table Storage o Column-oriented • Greenplum • Microsoft SQL Server 2012
  22. 22. Okay got it, Now Let’sCompare Some Real World Scenarios
  23. 23. You Need Constant Consistency• You‟re dealing with financial transactions• You‟re dealing with medical records• You‟re dealing with bonded goods• Best you use a RDMBS  Footer Text 5/15/2012 24
  24. 24. You Need Horizontal Scalability• You‟re working across defined timezones• You‟re Aggregating large quantities of data• Maintaining a chat server (Facebook chat)• Use Column Family Storage. Footer Text 5/15/2012 25
  25. 25. Frequently Written Rarely Read• Think web counters and the like• Every time a user comes to a page = ctr++• But it‟s only read when the report is run• Use Key-Value Storage. Footer Text 5/15/2012 26
  26. 26. Here Today Gone Tomorrow• Transient data like.. o Web Sessions o Locks o Short Term Stats • Shopping cart contents• Use Key-Value StorageFooter Text 5/15/2012 27
  27. 27. Where to store• RAM o Fast • Local Disk o SSD – super fast o Expensive o Fast spinning disks (7200+) o volatile o High Bandwidth possible o Persistent • SAN• Parallel File System o Storage Area Network o HDFS (Hadoop) o Fully managed o Auto-replicated for o Expensive parallel decentralized I/O • Cloud o Amazon o Box.Net o DropBox
  28. 28. Big Data
  29. 29. Big Data Definition •Beyond what traditionalVolume environments can handle •Need decisions fastVelocity •Many formatsVariety
  30. 30. Additional Big Data Concepts• Volumes & volumes of data• Unstructured• Semi-structured• Not suited for Relational Databases• Often utilizes MapReduce frameworks
  31. 31. Big Data Examples• Cassandra• Hadoop• Greenplum• Azure Storage• EMC Atmos• Amazon S3• SQL Azure (with Federations support)?
  32. 32. Real World Example• Twitter o The challenges • Needs to store many graphs  Who you are following  Who‟s following you  Who you receive phone notifications from etc • To deliver a tweet requires rapid paging of followers • Heavy write load as followers are added and removed • Set arithmetic for @mentions (intersection of users).
  33. 33. What did they try?• Started with Relational Databases• Tried Key-Value storage of denormalized lists• Did it work? o Nope • Either good at  Handling the write load  Or paging large amounts of data  But not both
  34. 34. What did they need?• Simplest possible thing that would work• Allow for horizontal partitioning• Allow write operations to• Arrive out of order o Or be processed more than once o Failures should result in redundant work• Not lost work!
  35. 35. The Result was FlockDB• Stores graph data• Not optimized for graph traversal operations• Optimized for large adjacency lists o List of all edges in a graph • Key is the edge value a set of the node end points• Optimized for fast read and write• Optimized for page-able set arithmetic.
  36. 36. How Does it Work?• Stores graphs as sets of edges between nodes• Data is partitioned by node o All queries can be answered by a single partition• Write operations are idempotent o Can be applied multiple times without changing the result• And commutative o Changing the order of operands doesn‟t change the result.
  37. 37. How to Process BigData
  38. 38. ACID• Atomicity o All or Nothing• Consistency o Valid according to all defined rules• Isolation o No transaction should be able to interfere with another transaction• Durability o Once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors
  39. 39. BASE• Basically Available o High availability but not always consistent• Soft state o Background cleanup mechanism• Eventual consistency o Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system and all the replicas will be consistent.
  40. 40. Traditional (relational) Approach Extract Transactional Data Store Transform Data Warehouse Load
  41. 41. Big Data Approach• MapReduce Pattern/Framework o an Input Reader o Map Function – To transform to a common shape (format) o a partition function o a compare function o Reduce Function o an Output Writer
  42. 42. MongoDB Example> // map function > // reduce function> m = function(){ > r = function( key , values ){... this.tags.forEach( ... var total = 0;... function(z){ ... for ( var i=0; i<values.length; i++ )... emit( z , { count : 1 } ... total += values[i].count;); ... return { count : total };... } ...};... );...}; > // execute > res = db.things.mapReduce(m, r, { out : "myoutput" } );
  43. 43. What is Hadoop?• A scalable fault-tolerant grid operating system for data storage and processing• Its scalability comes from the marriage of: o HDFS: Self-Healing High-Bandwidth Clustered Storage o MapReduce: Fault-Tolerant Distributed Processing• Operates on unstructured and structured data• A large and active ecosystem (many developers and additions like HBase, Hive, Pig, …)• Open source under the friendly Apache License• http://wiki.apache.org/hadoop/
  44. 44. Hadoop Design Axioms1. System Shall Manage and Heal Itself2. Performance Shall Scale Linearly3. Compute Should Move to Data4. Simple Core, Modular and Extensible
  45. 45. Hadoop Core Components Store Process HDFS Map/Reduce Self-healing Fault-tolerantHigh-bandwidth distributedClustered storage processing
  46. 46. HDFS: Hadoop Distributed File System Block Size = 64MBReplication Factor = 3 Cost/GB is a few ¢/month vs $/month
  47. 47. Hadoop Map/Reduce
  48. 48. Hadoop Job Architecture Node Manager Container App MstrClient Resource Node Manager ManagerClient App Mstr Container MapReduce Status Node Manager Job Submission Node Status Resource Request Container Container
  49. 49. Microsoft embraces HadoopGood for enterprises & developersGreat for end users!
  50. 50. HADOOP [Azure and Enterprise] Java OM Streaming OM HiveQL PigLatin .NET/C#/F# (T)SQL OCEAN OF DATA NOSQL [unstructured, semi-structured, structured] ETL HDFS A SEAMLESS OCEAN OF INFORMATION PROCESSING AND ANALYTICsEIS / RDBMS File OData AzureERP System [RSS] Storage
  51. 51. Hive Plug-in for ExcelFooter Text 5/15/2012 52
  52. 52. THANK YOU

×