SQL or NoSQL    How to choose Venu Anuganti Jan 2011 http://venublog.com/
Who am I  Data Architect, Database Kernel / Internals Engineer Implement and Scale SQL, NoSQL, Analytics and Data Warehouse solutions Large scale data handling for Games, Social Networking, SaaS, Click Tracking, Recommendation, Advertisement, Mobile and SEM marketing Blog:  http://venublog.com/
Agenda Buzz around  SQL  and  NoSQL  How to design and implement  Data Flow Architecture How to choose  Data Store  Solution Performance, Scalability and Availability SQL vs NoSQL Where SQL and NoSQL fits Types of SQL and NoSQL data stores Evaluation & Decision Making
Buzzzzzzz Why everyone is talking about NoSQL What is happening to SQL Does that mean end of SQL ? NoSQL era begins ? Why nobody talks about large SQL implementations ?
Evolution of Data Architecture
Data Architecture No standard solution that fits to all Business and it’s data defines architecture It’s all about solving problems You need to find the right tool that does the job
Traditional Architecture Relational database is everything SQL Embedded Client-Server based Data Stack Web, CDN, Load Balancers, Application, Database and Storage
Traditional Scalability … Scale-up Memory and hardware has limitations Scale-out Scaling reads Cache is the king Query cache Memcache Olap Pre-fetching Replication Scaling writes Redundant disk arrays, RAID Sharding
Common Problems… Relational model is heavy : Parsing, Locking, Logging, Buffer pool and threads Not every case can work within single node SMP Sharding does not solve all problems Cross shard or join between shards Need to update across multiple shards within a transaction Shard failure Online schema changes without taking the shard offline Add or replace shards in-line
Evolution  Data is growing rapidly on day by day Motivated by the needs of large web applications Hardware is not emerging as that of data growth Things are moving to Cloud and API driven Social networking and Cloud makes hard to scale using traditional way
Data is the Business Lot of new business models are DATA centric Real-time and Interactive Big Data Millions of user base, clients, customers, applications, … Tera bytes to peta bytes of data on day to day Business can only grow if they can properly make use of data Statistics, Reporting Real-time Re-targeting Recommendation Examples of data driven companies Facebook, Twitter, LinkedIn, Zynga, Groupon, Quora, Apple AppStore, FourSquare, any API Driven, all most all new emerging companies
Solution that works Data architecture is not just choosing a right data store, but should be a solution, with: Low In Cost (preferably open source, no hidden cost..) Simple To Implement High Performance Highly Available Highly Scalable Highly Reliable Highly Recoverable Rapid Development Zero Learning Curve Ability to do online changes (schema or node or automatic) Less Operational Maintenance No firefighting on day to day
NoSQL Solution Emerges Lot of companies emerged to solve data centric problems Big Table: Google started to implement massively distributed scalable system, followed by many, first foot step to the world of massive data scale Many companies followed building scale-out architecture using commodity hardware ACID was termed as bad for scaling, so relaxed consistency model came into picture Google  Big Table  and Amazon  Dynamo  are notable
Relaxed Consistency  Consistency is a major bottleneck for scalability People started implementing eventual consistency CAP Theorem ( C onsistency,  A vailability and  P artition-tolerance)  Consistency: “ Is the data I’m looking at now the same if I look at it somewhere else?” Availability: “ What happens if my database goes down?” Partitioning:   “ What if my data is on different node?” SQL – CA NoSQL – AP http://venublog.com/2010/04/07/cap-theorem-eventual-consistency-nosql/
Data Stores
Data Stores 3 Major Data Store Solutions SQL, OLTP  Relational, transactional processing Analytics, OLAP Data Warehousing, Analytics and reporting NoSQL Non relational, distributed, high performance and highly scalable
SQL Stores Disk based storage Data is stored as table (row by row and columns – row store) Mainly B-tree as the indexing mechanism
SQL Stores … Dynamic locking/ Lock free for concurrency control Write-ahead log (WAL) / transactional log for crash recovery SQL as the access language
SQL Stores Proven and widely adopted MySQL PostGreSQL VoltDB Clustrix MySQL Cluster ScaleDB ScaleBase DbShards Oracle SQL Server DB2 Sybase & … Supports  ACID Crash recovery DDL, DML, DCL
Analytic Stores Data warehousing, mainly for large sets of data Data marts, Dimensional, Fact and Aggregate tables ETL, BI, Reporting, Analytics Columnar and Compression is the key OLAP Cubes built-in or middle-tier Mostly SQL and also MDX driven
Analytic Stores Columnar data warehouse solutions GreenPlum (+ DCA appliance) Vertica (Break through, I love it) Aster ParAccel InfoBright (MySQL based) InfiniDB (open source, Calpont appliance) Netezza (appliance) XtremeData dbX (appliance) TeraData
NoSQL Stores Does not mean  No  to  SQL Actually  No t only  SQL Data store that may not require fixed table schemas Mainly derived from Google BigTable and Amazon Dynamo
NoSQL Stores … Non relational, schema free Distributed, ability to horizontally scale  Simple CLI or API protocol Eventually consistent, depends … Limitations of SQL to scale large data Ability to dynamically define new attributes
NoSQL Stores … Multiple Types based on storage architecture Key Value, KV  Document Graph Column Family
NoSQL Stores Key-Value Stores Dynamo Clones Membase Riak Redis Tokyo Cabinet Voldemort Document Stores MongoDB CouchDB Column Family BigTable Clones Cassandra HBase HyperTable Graph Databases Neo4J InfoGrid AllegroGraph FlockDB
What they are good at  &  How to choose
Basic Decision Principles Do not over architect from day-1, it’s overkill Startups can’t afford to spend time Understand business and implement with simple well known solutions to begin with Do not follow the models, just inspire from the problem solving Engineering talent is crucial, make sure you have right resources Evaluate and implement new solutions as the business grows
Basic Decision Principles … High availability & disaster recovery is a must Understand pros and cons of each and every design model, and weigh towards the best interest of the company Remember some of the big outage stories Tumblr, FourSuaure & Twitter  Lean towards community winner and widely adopted Do not lean towards only performance, unless you can create the state of the data back
SQL – Good High Performance OLTP, Transactions, ACID Structured, SQL Access , portability and tools Small amounts of data, typically < 500G per server,  supports inline UPDATE, DELETE, multi-condition/rows Relational model at data store, application independent Many tables with different types of dtaa
SQL – Good Simple or complex aggregation Statistics, reports at data store level Need access to more than one tuple of information Results based on multiple search conditions SELECT foo FROM bar where X=1 and Y=2 Fetching of ordered or array of data Compatible with many tools
SQL – Bad SQL complexity, parsing cost Learning and relational model design Performance and Scalability Strictly single node Sharding causes more trouble operationally Operational maintenance, fire fighting Puts a break to rapid development cycles
NoSQL - Good Fits very well for volatile data High read or write throughput Automatic horizontal scalability (Consistent hashing) Simple to implement, no investment for developers to design and implement relational model Application logic defines object model Support of MVCC in some form Compaction and un-compaction happens at top tier In-memory or disk based or combination
NoSQL - Good Rapid development cycles, programmer friendly Reduces the footprint at data store level NoSQL in general faster than SQL Supports INSERT, DELETE, SELECT Data is distributed by KEY over nodes Lists, sets, queues, pub-sub are also supported by some NoSQL – Redis S3 can handle large blobs; not all NoSQL can handle it
NoSQL - Bad Packing and Un-packing of each key Lack of relation from one key to another Need whole value from the key; to read/write any partial information No security or authentication Data store is merely a storage layer, can’t be used for: Analytics Reporting Aggregation Ordered values
SQL/NoSQL – Good and Bad Performance mainly depends on amount of memory Disk bound both takes a hit SQL has advantage due to sequential and read-ahead Optimization towards frequently accessed data SQL engines maintain LRU SQL Engines are proven and widely in use NoSQL is pretty much new; but marching …
Cache MySQL  HandlerSocket InnoDB bufferpool acts as cache No explicit cache needed, no write invalidation Write Through Cache (WTC)  is a good candidate for high reads or writes Gaming world really need this Membase or periodic flush to persistent storage layer Flash cache  can also help to scale IO bound workloads We might see them pitching in private cloud slowly
Document Store Document Stores Supports complex data model than KV Good at handling content management, session, profile data Multi index support Dynamic schemas, Nested schemas Auto distributed, eventual consistency MVCC (CouchDB) or automic (MongoDB) MongoDB, SimpleDB: widely adopted in this space Use Case: Search by complex patterns & CRUD apps
Column Family Store Hbase (Apache), Cassanda (Facebook) and HyperTable (Bidu) Hbase – CA Cassandra – AP Model consists of rows and columns Scalability: Splitting of both rows and columns Rows are split across nodes using primary key, range Columns are distributed using groups Horizonal and vertical partitioning can be used simultaneous Extension of document store HBase uses HDFS; Pig, Hive, Cascading can help Use case: Grouping of frequently used and un-used over data centers / stream of writes
Graph Store Social Graph Relationship between entities Data modeling on social networks Common Use Cases List of friends Recommendation system Following Followers Common Connections FAN IN/OUT
Cloud Data Stores Database Cloud Services Xeround (MySQL) Microsoft SQL Azure Database (SQL Server) SimpleDB (NoSQL) Google App Engine (NoSQL) SalesForce Database.com (Oracle) ClearDB (MySQL)
SUMMARY Pick right data model for the right problem Understand the data storage and use cases on read, write and growth patterns; and then come up with a plan to implement. Use case dictates everything. Compare pros and cons, and weigh towards the one that helps business Public cloud, private cloud or data center also dictates what model to choose You need right people
Finally …    SQL  Works great, can’t scale for large data NoSQL Works great, can’t fit for all     SQL + NoSQL
Questions ? http://venublog.com/ [email_address] Twitter: @vanuganti

SQL/NoSQL How to choose ?

  • 1.
    SQL or NoSQL How to choose Venu Anuganti Jan 2011 http://venublog.com/
  • 2.
    Who am I Data Architect, Database Kernel / Internals Engineer Implement and Scale SQL, NoSQL, Analytics and Data Warehouse solutions Large scale data handling for Games, Social Networking, SaaS, Click Tracking, Recommendation, Advertisement, Mobile and SEM marketing Blog: http://venublog.com/
  • 3.
    Agenda Buzz around SQL and NoSQL How to design and implement Data Flow Architecture How to choose Data Store Solution Performance, Scalability and Availability SQL vs NoSQL Where SQL and NoSQL fits Types of SQL and NoSQL data stores Evaluation & Decision Making
  • 4.
    Buzzzzzzz Why everyoneis talking about NoSQL What is happening to SQL Does that mean end of SQL ? NoSQL era begins ? Why nobody talks about large SQL implementations ?
  • 5.
    Evolution of DataArchitecture
  • 6.
    Data Architecture Nostandard solution that fits to all Business and it’s data defines architecture It’s all about solving problems You need to find the right tool that does the job
  • 7.
    Traditional Architecture Relationaldatabase is everything SQL Embedded Client-Server based Data Stack Web, CDN, Load Balancers, Application, Database and Storage
  • 8.
    Traditional Scalability …Scale-up Memory and hardware has limitations Scale-out Scaling reads Cache is the king Query cache Memcache Olap Pre-fetching Replication Scaling writes Redundant disk arrays, RAID Sharding
  • 9.
    Common Problems… Relationalmodel is heavy : Parsing, Locking, Logging, Buffer pool and threads Not every case can work within single node SMP Sharding does not solve all problems Cross shard or join between shards Need to update across multiple shards within a transaction Shard failure Online schema changes without taking the shard offline Add or replace shards in-line
  • 10.
    Evolution Datais growing rapidly on day by day Motivated by the needs of large web applications Hardware is not emerging as that of data growth Things are moving to Cloud and API driven Social networking and Cloud makes hard to scale using traditional way
  • 11.
    Data is theBusiness Lot of new business models are DATA centric Real-time and Interactive Big Data Millions of user base, clients, customers, applications, … Tera bytes to peta bytes of data on day to day Business can only grow if they can properly make use of data Statistics, Reporting Real-time Re-targeting Recommendation Examples of data driven companies Facebook, Twitter, LinkedIn, Zynga, Groupon, Quora, Apple AppStore, FourSquare, any API Driven, all most all new emerging companies
  • 12.
    Solution that worksData architecture is not just choosing a right data store, but should be a solution, with: Low In Cost (preferably open source, no hidden cost..) Simple To Implement High Performance Highly Available Highly Scalable Highly Reliable Highly Recoverable Rapid Development Zero Learning Curve Ability to do online changes (schema or node or automatic) Less Operational Maintenance No firefighting on day to day
  • 13.
    NoSQL Solution EmergesLot of companies emerged to solve data centric problems Big Table: Google started to implement massively distributed scalable system, followed by many, first foot step to the world of massive data scale Many companies followed building scale-out architecture using commodity hardware ACID was termed as bad for scaling, so relaxed consistency model came into picture Google Big Table and Amazon Dynamo are notable
  • 14.
    Relaxed Consistency Consistency is a major bottleneck for scalability People started implementing eventual consistency CAP Theorem ( C onsistency, A vailability and P artition-tolerance) Consistency: “ Is the data I’m looking at now the same if I look at it somewhere else?” Availability: “ What happens if my database goes down?” Partitioning: “ What if my data is on different node?” SQL – CA NoSQL – AP http://venublog.com/2010/04/07/cap-theorem-eventual-consistency-nosql/
  • 15.
  • 16.
    Data Stores 3Major Data Store Solutions SQL, OLTP Relational, transactional processing Analytics, OLAP Data Warehousing, Analytics and reporting NoSQL Non relational, distributed, high performance and highly scalable
  • 17.
    SQL Stores Diskbased storage Data is stored as table (row by row and columns – row store) Mainly B-tree as the indexing mechanism
  • 18.
    SQL Stores …Dynamic locking/ Lock free for concurrency control Write-ahead log (WAL) / transactional log for crash recovery SQL as the access language
  • 19.
    SQL Stores Provenand widely adopted MySQL PostGreSQL VoltDB Clustrix MySQL Cluster ScaleDB ScaleBase DbShards Oracle SQL Server DB2 Sybase & … Supports ACID Crash recovery DDL, DML, DCL
  • 20.
    Analytic Stores Datawarehousing, mainly for large sets of data Data marts, Dimensional, Fact and Aggregate tables ETL, BI, Reporting, Analytics Columnar and Compression is the key OLAP Cubes built-in or middle-tier Mostly SQL and also MDX driven
  • 21.
    Analytic Stores Columnardata warehouse solutions GreenPlum (+ DCA appliance) Vertica (Break through, I love it) Aster ParAccel InfoBright (MySQL based) InfiniDB (open source, Calpont appliance) Netezza (appliance) XtremeData dbX (appliance) TeraData
  • 22.
    NoSQL Stores Doesnot mean No to SQL Actually No t only SQL Data store that may not require fixed table schemas Mainly derived from Google BigTable and Amazon Dynamo
  • 23.
    NoSQL Stores …Non relational, schema free Distributed, ability to horizontally scale Simple CLI or API protocol Eventually consistent, depends … Limitations of SQL to scale large data Ability to dynamically define new attributes
  • 24.
    NoSQL Stores …Multiple Types based on storage architecture Key Value, KV Document Graph Column Family
  • 25.
    NoSQL Stores Key-ValueStores Dynamo Clones Membase Riak Redis Tokyo Cabinet Voldemort Document Stores MongoDB CouchDB Column Family BigTable Clones Cassandra HBase HyperTable Graph Databases Neo4J InfoGrid AllegroGraph FlockDB
  • 26.
    What they aregood at & How to choose
  • 27.
    Basic Decision PrinciplesDo not over architect from day-1, it’s overkill Startups can’t afford to spend time Understand business and implement with simple well known solutions to begin with Do not follow the models, just inspire from the problem solving Engineering talent is crucial, make sure you have right resources Evaluate and implement new solutions as the business grows
  • 28.
    Basic Decision Principles… High availability & disaster recovery is a must Understand pros and cons of each and every design model, and weigh towards the best interest of the company Remember some of the big outage stories Tumblr, FourSuaure & Twitter Lean towards community winner and widely adopted Do not lean towards only performance, unless you can create the state of the data back
  • 29.
    SQL – GoodHigh Performance OLTP, Transactions, ACID Structured, SQL Access , portability and tools Small amounts of data, typically < 500G per server, supports inline UPDATE, DELETE, multi-condition/rows Relational model at data store, application independent Many tables with different types of dtaa
  • 30.
    SQL – GoodSimple or complex aggregation Statistics, reports at data store level Need access to more than one tuple of information Results based on multiple search conditions SELECT foo FROM bar where X=1 and Y=2 Fetching of ordered or array of data Compatible with many tools
  • 31.
    SQL – BadSQL complexity, parsing cost Learning and relational model design Performance and Scalability Strictly single node Sharding causes more trouble operationally Operational maintenance, fire fighting Puts a break to rapid development cycles
  • 32.
    NoSQL - GoodFits very well for volatile data High read or write throughput Automatic horizontal scalability (Consistent hashing) Simple to implement, no investment for developers to design and implement relational model Application logic defines object model Support of MVCC in some form Compaction and un-compaction happens at top tier In-memory or disk based or combination
  • 33.
    NoSQL - GoodRapid development cycles, programmer friendly Reduces the footprint at data store level NoSQL in general faster than SQL Supports INSERT, DELETE, SELECT Data is distributed by KEY over nodes Lists, sets, queues, pub-sub are also supported by some NoSQL – Redis S3 can handle large blobs; not all NoSQL can handle it
  • 34.
    NoSQL - BadPacking and Un-packing of each key Lack of relation from one key to another Need whole value from the key; to read/write any partial information No security or authentication Data store is merely a storage layer, can’t be used for: Analytics Reporting Aggregation Ordered values
  • 35.
    SQL/NoSQL – Goodand Bad Performance mainly depends on amount of memory Disk bound both takes a hit SQL has advantage due to sequential and read-ahead Optimization towards frequently accessed data SQL engines maintain LRU SQL Engines are proven and widely in use NoSQL is pretty much new; but marching …
  • 36.
    Cache MySQL HandlerSocket InnoDB bufferpool acts as cache No explicit cache needed, no write invalidation Write Through Cache (WTC) is a good candidate for high reads or writes Gaming world really need this Membase or periodic flush to persistent storage layer Flash cache can also help to scale IO bound workloads We might see them pitching in private cloud slowly
  • 37.
    Document Store DocumentStores Supports complex data model than KV Good at handling content management, session, profile data Multi index support Dynamic schemas, Nested schemas Auto distributed, eventual consistency MVCC (CouchDB) or automic (MongoDB) MongoDB, SimpleDB: widely adopted in this space Use Case: Search by complex patterns & CRUD apps
  • 38.
    Column Family StoreHbase (Apache), Cassanda (Facebook) and HyperTable (Bidu) Hbase – CA Cassandra – AP Model consists of rows and columns Scalability: Splitting of both rows and columns Rows are split across nodes using primary key, range Columns are distributed using groups Horizonal and vertical partitioning can be used simultaneous Extension of document store HBase uses HDFS; Pig, Hive, Cascading can help Use case: Grouping of frequently used and un-used over data centers / stream of writes
  • 39.
    Graph Store SocialGraph Relationship between entities Data modeling on social networks Common Use Cases List of friends Recommendation system Following Followers Common Connections FAN IN/OUT
  • 40.
    Cloud Data StoresDatabase Cloud Services Xeround (MySQL) Microsoft SQL Azure Database (SQL Server) SimpleDB (NoSQL) Google App Engine (NoSQL) SalesForce Database.com (Oracle) ClearDB (MySQL)
  • 41.
    SUMMARY Pick rightdata model for the right problem Understand the data storage and use cases on read, write and growth patterns; and then come up with a plan to implement. Use case dictates everything. Compare pros and cons, and weigh towards the one that helps business Public cloud, private cloud or data center also dictates what model to choose You need right people
  • 42.
    Finally …  SQL Works great, can’t scale for large data NoSQL Works great, can’t fit for all  SQL + NoSQL
  • 43.
    Questions ? http://venublog.com/[email_address] Twitter: @vanuganti

Editor's Notes

  • #3 MySQL Employee 2000-2004 Database Companies MySQL SOLID ANTs Data Server ScaleDB Part of Yahoo’s cloud initiates like Sherpa and Mobstor and a platform MySQL Geek Still contribute randomly to MySQL source
  • #5 To answer all these, we need to understand how the traditional data architecture is and how it is currently used and the future of data
  • #6 When web is read-only, things used to scale with one or more systems with caching or LB in the front But as things change to real-time and interactive, the same architecture can’t keep up Talk about how Facebook, Twitter, LinkedIn is evolving Public cloud sucks in performance, but offers elasticity to grow ; but you need to design systems to balance hardware, performance and scalability
  • #7 If Facebook, Twitter or someone else uses NoSQL, does not mean everyone has to use it If someone scales using MySQL, does not mean everyone can use the same concept
  • #8 Caching was used for scaling reads
  • #9 Caching was used for scaling reads
  • #10 Caching was used for scaling reads
  • #11 When web is read-only, things used to scale with one or more systems with caching or LB in the front But as things change to real-time and interactive, the same architecture can’t keep up Talk about how Facebook, Twitter, LinkedIn is evolving Public cloud sucks in performance, but offers elasticity to grow ; but you need to design systems to balance hardware, performance and scalability
  • #13 Not everyone wants to hear about systems going down for hours and hours or even days Like FourSquare, Tumbler
  • #15 Typical OLTP system needs C &amp; A Replication is also eventual consistency Eventually consistent
  • #16 Now lets understand different types of data stores
  • #18 Widely adopted for years
  • #19 Widely adopted for years
  • #20 Widely adopted for years
  • #21 Widely adopted for years
  • #22 DCA Data Computing Appliance Talk about analytics and how crucial they are now
  • #24 Bunch of cloud based solutions, which are bit surprising
  • #25 Bunch of cloud based solutions, which are bit surprising
  • #26 Bunch of cloud based solutions, which are bit surprising
  • #28 Before getting into how to design and implement, lets understand some basics of design, what to achieve
  • #29 Twitter – MySQL crash and no proper backups in place.. Rolling restore Tumblr … it was down for close to 16 hours or so FourSuare is down for 12 hrs or so You don’t want to be a in situation where you don’t know where the problem is
  • #30 Employee or user can update his profile fields Guaranteed durability
  • #31 Employee or user can update his profile fields Guaranteed durability
  • #32 Employee or user can update his profile fields Guaranteed durability
  • #33 Gaming is a classic example for volatile data
  • #34 Gaming is a classic example for volatile data
  • #35 Gaming is a classic example for volatile data
  • #36 Gaming is a classic example for volatile data
  • #37 Gaming is a classic example for volatile data
  • #38 Gaming is a classic example for volatile data
  • #39 Gaming is a classic example for volatile data
  • #40 Gaming is a classic example for volatile data