SQL/NoSQL How to choose ?

SQL or NoSQL How to choose Venu Anuganti Jan 2011 http://venublog.com/

Who am I Data Architect, Database Kernel / Internals Engineer Implement and Scale SQL, NoSQL, Analytics and Data Warehouse solutions Large scale data handling for Games, Social Networking, SaaS, Click Tracking, Recommendation, Advertisement, Mobile and SEM marketing Blog: http://venublog.com/

Agenda Buzz around SQL and NoSQL How to design and implement Data Flow Architecture How to choose Data Store Solution Performance, Scalability and Availability SQL vs NoSQL Where SQL and NoSQL fits Types of SQL and NoSQL data stores Evaluation & Decision Making

Buzzzzzzz Why everyone is talking about NoSQL What is happening to SQL Does that mean end of SQL ? NoSQL era begins ? Why nobody talks about large SQL implementations ?

Evolution of Data Architecture

Data Architecture No standard solution that fits to all Business and it’s data defines architecture It’s all about solving problems You need to find the right tool that does the job

Traditional Architecture Relational database is everything SQL Embedded Client-Server based Data Stack Web, CDN, Load Balancers, Application, Database and Storage

Traditional Scalability … Scale-up Memory and hardware has limitations Scale-out Scaling reads Cache is the king Query cache Memcache Olap Pre-fetching Replication Scaling writes Redundant disk arrays, RAID Sharding

Common Problems… Relational model is heavy : Parsing, Locking, Logging, Buffer pool and threads Not every case can work within single node SMP Sharding does not solve all problems Cross shard or join between shards Need to update across multiple shards within a transaction Shard failure Online schema changes without taking the shard offline Add or replace shards in-line

Evolution Data is growing rapidly on day by day Motivated by the needs of large web applications Hardware is not emerging as that of data growth Things are moving to Cloud and API driven Social networking and Cloud makes hard to scale using traditional way

Data is the Business Lot of new business models are DATA centric Real-time and Interactive Big Data Millions of user base, clients, customers, applications, … Tera bytes to peta bytes of data on day to day Business can only grow if they can properly make use of data Statistics, Reporting Real-time Re-targeting Recommendation Examples of data driven companies Facebook, Twitter, LinkedIn, Zynga, Groupon, Quora, Apple AppStore, FourSquare, any API Driven, all most all new emerging companies

Solution that works Data architecture is not just choosing a right data store, but should be a solution, with: Low In Cost (preferably open source, no hidden cost..) Simple To Implement High Performance Highly Available Highly Scalable Highly Reliable Highly Recoverable Rapid Development Zero Learning Curve Ability to do online changes (schema or node or automatic) Less Operational Maintenance No firefighting on day to day

NoSQL Solution Emerges Lot of companies emerged to solve data centric problems Big Table: Google started to implement massively distributed scalable system, followed by many, first foot step to the world of massive data scale Many companies followed building scale-out architecture using commodity hardware ACID was termed as bad for scaling, so relaxed consistency model came into picture Google Big Table and Amazon Dynamo are notable

Relaxed Consistency Consistency is a major bottleneck for scalability People started implementing eventual consistency CAP Theorem ( C onsistency, A vailability and P artition-tolerance) Consistency: “ Is the data I’m looking at now the same if I look at it somewhere else?” Availability: “ What happens if my database goes down?” Partitioning: “ What if my data is on different node?” SQL – CA NoSQL – AP http://venublog.com/2010/04/07/cap-theorem-eventual-consistency-nosql/

Data Stores 3 Major Data Store Solutions SQL, OLTP Relational, transactional processing Analytics, OLAP Data Warehousing, Analytics and reporting NoSQL Non relational, distributed, high performance and highly scalable

SQL Stores Disk based storage Data is stored as table (row by row and columns – row store) Mainly B-tree as the indexing mechanism

SQL Stores … Dynamic locking/ Lock free for concurrency control Write-ahead log (WAL) / transactional log for crash recovery SQL as the access language

SQL Stores Proven and widely adopted MySQL PostGreSQL VoltDB Clustrix MySQL Cluster ScaleDB ScaleBase DbShards Oracle SQL Server DB2 Sybase & … Supports ACID Crash recovery DDL, DML, DCL

Analytic Stores Data warehousing, mainly for large sets of data Data marts, Dimensional, Fact and Aggregate tables ETL, BI, Reporting, Analytics Columnar and Compression is the key OLAP Cubes built-in or middle-tier Mostly SQL and also MDX driven

Analytic Stores Columnar data warehouse solutions GreenPlum (+ DCA appliance) Vertica (Break through, I love it) Aster ParAccel InfoBright (MySQL based) InfiniDB (open source, Calpont appliance) Netezza (appliance) XtremeData dbX (appliance) TeraData

NoSQL Stores Does not mean No to SQL Actually No t only SQL Data store that may not require fixed table schemas Mainly derived from Google BigTable and Amazon Dynamo

NoSQL Stores … Non relational, schema free Distributed, ability to horizontally scale Simple CLI or API protocol Eventually consistent, depends … Limitations of SQL to scale large data Ability to dynamically define new attributes

NoSQL Stores … Multiple Types based on storage architecture Key Value, KV Document Graph Column Family

NoSQL Stores Key-Value Stores Dynamo Clones Membase Riak Redis Tokyo Cabinet Voldemort Document Stores MongoDB CouchDB Column Family BigTable Clones Cassandra HBase HyperTable Graph Databases Neo4J InfoGrid AllegroGraph FlockDB

What they are good at & How to choose

Basic Decision Principles Do not over architect from day-1, it’s overkill Startups can’t afford to spend time Understand business and implement with simple well known solutions to begin with Do not follow the models, just inspire from the problem solving Engineering talent is crucial, make sure you have right resources Evaluate and implement new solutions as the business grows

Basic Decision Principles … High availability & disaster recovery is a must Understand pros and cons of each and every design model, and weigh towards the best interest of the company Remember some of the big outage stories Tumblr, FourSuaure & Twitter Lean towards community winner and widely adopted Do not lean towards only performance, unless you can create the state of the data back

SQL – Good High Performance OLTP, Transactions, ACID Structured, SQL Access , portability and tools Small amounts of data, typically < 500G per server, supports inline UPDATE, DELETE, multi-condition/rows Relational model at data store, application independent Many tables with different types of dtaa

SQL – Good Simple or complex aggregation Statistics, reports at data store level Need access to more than one tuple of information Results based on multiple search conditions SELECT foo FROM bar where X=1 and Y=2 Fetching of ordered or array of data Compatible with many tools

SQL – Bad SQL complexity, parsing cost Learning and relational model design Performance and Scalability Strictly single node Sharding causes more trouble operationally Operational maintenance, fire fighting Puts a break to rapid development cycles

NoSQL - Good Fits very well for volatile data High read or write throughput Automatic horizontal scalability (Consistent hashing) Simple to implement, no investment for developers to design and implement relational model Application logic defines object model Support of MVCC in some form Compaction and un-compaction happens at top tier In-memory or disk based or combination

NoSQL - Good Rapid development cycles, programmer friendly Reduces the footprint at data store level NoSQL in general faster than SQL Supports INSERT, DELETE, SELECT Data is distributed by KEY over nodes Lists, sets, queues, pub-sub are also supported by some NoSQL – Redis S3 can handle large blobs; not all NoSQL can handle it

NoSQL - Bad Packing and Un-packing of each key Lack of relation from one key to another Need whole value from the key; to read/write any partial information No security or authentication Data store is merely a storage layer, can’t be used for: Analytics Reporting Aggregation Ordered values

SQL/NoSQL – Good and Bad Performance mainly depends on amount of memory Disk bound both takes a hit SQL has advantage due to sequential and read-ahead Optimization towards frequently accessed data SQL engines maintain LRU SQL Engines are proven and widely in use NoSQL is pretty much new; but marching …

Cache MySQL HandlerSocket InnoDB bufferpool acts as cache No explicit cache needed, no write invalidation Write Through Cache (WTC) is a good candidate for high reads or writes Gaming world really need this Membase or periodic flush to persistent storage layer Flash cache can also help to scale IO bound workloads We might see them pitching in private cloud slowly

Document Store Document Stores Supports complex data model than KV Good at handling content management, session, profile data Multi index support Dynamic schemas, Nested schemas Auto distributed, eventual consistency MVCC (CouchDB) or automic (MongoDB) MongoDB, SimpleDB: widely adopted in this space Use Case: Search by complex patterns & CRUD apps

Column Family Store Hbase (Apache), Cassanda (Facebook) and HyperTable (Bidu) Hbase – CA Cassandra – AP Model consists of rows and columns Scalability: Splitting of both rows and columns Rows are split across nodes using primary key, range Columns are distributed using groups Horizonal and vertical partitioning can be used simultaneous Extension of document store HBase uses HDFS; Pig, Hive, Cascading can help Use case: Grouping of frequently used and un-used over data centers / stream of writes

Graph Store Social Graph Relationship between entities Data modeling on social networks Common Use Cases List of friends Recommendation system Following Followers Common Connections FAN IN/OUT

Cloud Data Stores Database Cloud Services Xeround (MySQL) Microsoft SQL Azure Database (SQL Server) SimpleDB (NoSQL) Google App Engine (NoSQL) SalesForce Database.com (Oracle) ClearDB (MySQL)

SUMMARY Pick right data model for the right problem Understand the data storage and use cases on read, write and growth patterns; and then come up with a plan to implement. Use case dictates everything. Compare pros and cons, and weigh towards the one that helps business Public cloud, private cloud or data center also dictates what model to choose You need right people

Finally …  SQL Works great, can’t scale for large data NoSQL Works great, can’t fit for all  SQL + NoSQL

Questions ? http://venublog.com/ [email_address] Twitter: @vanuganti

SQL/NoSQL How to choose ?

More Related Content

What's hot

Similar to SQL/NoSQL How to choose ?

Recently uploaded

SQL/NoSQL How to choose ?

Editor's Notes