• Save
SQL/NoSQL How to choose ?
Upcoming SlideShare
Loading in...5
×
 

SQL/NoSQL How to choose ?

on

  • 26,196 views

SQL/NoSQL How to choose ?

SQL/NoSQL How to choose ?

A technical talk on overall data architecture and how to evaluate and implement data store solution

Statistics

Views

Total Views
26,196
Views on SlideShare
25,506
Embed Views
690

Actions

Likes
72
Downloads
2
Comments
4

41 Embeds 690

http://venublog.com 513
https://twitter.com 28
http://npramesh.blogspot.com 24
http://net4x.blogspot.com 19
http://npramesh.blogspot.in 15
http://wordpress 11
http://paper.li 8
http://paper.li 8
http://www.techgig.com 8
http://www.mongodb.org 5
https://blackboard.strayer.edu 4
https://si0.twimg.com 3
http://dschool.co 3
https://twimg0-a.akamaihd.net 3
http://twitter.com 3
http://www.myvidster.com 3
http://cktrunk.greenday.epistema.local 3
http://www.studysols.com 2
http://www.scoop.it 2
http://a0.twimg.com 2
http://www.linkedin.com 2
http://npramesh.blogspot.com.ar 2
http://www.forum-quartiers-durables.com 1
http://net4x.blogspot.com.au 1
http://saujana.sg 1
http://npramesh.blogspot.com.br 1
http://localhost:8099 1
http://184.169.248.59 1
http://192.168.26.134 1
http://npramesh.blogspot.it 1
http://www.coopsussi.it 1
http://ckdev111.greenday.epistema.local 1
http://www.istikbal.gr 1
http://www.mea.org.uk 1
http://www.elitprofessionals.com 1
http://basecamp 1
http://npramesh.blogspot.fr 1
http://npramesh.blogspot.com.es 1
http://npramesh.blogspot.se 1
http://m10.indicthreads.com 1
http://www.09h15.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • This is a great presentation, if you're looking to refresh it, feel free to reach out to chat or reference some of the content that Clustrix has since produced:
    http://www.clustrix.com/database-landscape/ and technical documentation on the Clustrix architecture: http://docs.clustrix.com/display/CLXDOC/Home
    Are you sure you want to
    Your message goes here
    Processing…
  • Why is download disabled?
    Are you sure you want to
    Your message goes here
    Processing…
  • Thanks for sharing the info Mike; can you reach me at venu at venublog dot com; I also wanted to connect to Clustrix to understand more about the architecture
    Are you sure you want to
    Your message goes here
    Processing…
  • Hi Venu - it's great to see some analysis on this SQL/NoSQL question. I'm glad you did this.

    I would like to point out that Clustrix (where I work) is designed to fix the bad parts of SQL while keeping the good parts. We've completely solved the performance and scalability issues that you rightly pointed out. Our system is a distributed cluster that scales out. It is fault-tolerant and linearly scalable. You can add nodes on the fly with no downtime or other administrative action. This means that we can prevent folks from even having to shard in the first place.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • MySQL Employee 2000-2004 Database Companies MySQL SOLID ANTs Data Server ScaleDB Part of Yahoo’s cloud initiates like Sherpa and Mobstor and a platform MySQL Geek Still contribute randomly to MySQL source
  • To answer all these, we need to understand how the traditional data architecture is and how it is currently used and the future of data
  • When web is read-only, things used to scale with one or more systems with caching or LB in the front But as things change to real-time and interactive, the same architecture can’t keep up Talk about how Facebook, Twitter, LinkedIn is evolving Public cloud sucks in performance, but offers elasticity to grow ; but you need to design systems to balance hardware, performance and scalability
  • If Facebook, Twitter or someone else uses NoSQL, does not mean everyone has to use it If someone scales using MySQL, does not mean everyone can use the same concept
  • Caching was used for scaling reads
  • Caching was used for scaling reads
  • Caching was used for scaling reads
  • When web is read-only, things used to scale with one or more systems with caching or LB in the front But as things change to real-time and interactive, the same architecture can’t keep up Talk about how Facebook, Twitter, LinkedIn is evolving Public cloud sucks in performance, but offers elasticity to grow ; but you need to design systems to balance hardware, performance and scalability
  • Not everyone wants to hear about systems going down for hours and hours or even days Like FourSquare, Tumbler
  • Typical OLTP system needs C & A Replication is also eventual consistency Eventually consistent
  • Now lets understand different types of data stores
  • Widely adopted for years
  • Widely adopted for years
  • Widely adopted for years
  • Widely adopted for years
  • DCA Data Computing Appliance Talk about analytics and how crucial they are now
  • Bunch of cloud based solutions, which are bit surprising
  • Bunch of cloud based solutions, which are bit surprising
  • Bunch of cloud based solutions, which are bit surprising
  • Before getting into how to design and implement, lets understand some basics of design, what to achieve
  • Twitter – MySQL crash and no proper backups in place.. Rolling restore Tumblr … it was down for close to 16 hours or so FourSuare is down for 12 hrs or so You don’t want to be a in situation where you don’t know where the problem is
  • Employee or user can update his profile fields Guaranteed durability
  • Employee or user can update his profile fields Guaranteed durability
  • Employee or user can update his profile fields Guaranteed durability
  • Gaming is a classic example for volatile data
  • Gaming is a classic example for volatile data
  • Gaming is a classic example for volatile data
  • Gaming is a classic example for volatile data
  • Gaming is a classic example for volatile data
  • Gaming is a classic example for volatile data
  • Gaming is a classic example for volatile data
  • Gaming is a classic example for volatile data

SQL/NoSQL How to choose ? SQL/NoSQL How to choose ? Presentation Transcript

  • SQL or NoSQL How to choose Venu Anuganti Jan 2011 http://venublog.com/
  • Who am I
    • Data Architect, Database Kernel / Internals Engineer
    • Implement and Scale SQL, NoSQL, Analytics and Data Warehouse solutions
    • Large scale data handling for Games, Social Networking, SaaS, Click Tracking, Recommendation, Advertisement, Mobile and SEM marketing
    • Blog: http://venublog.com/
  • Agenda
    • Buzz around SQL and NoSQL
    • How to design and implement Data Flow Architecture
    • How to choose Data Store Solution
      • Performance, Scalability and Availability
      • SQL vs NoSQL
      • Where SQL and NoSQL fits
      • Types of SQL and NoSQL data stores
    • Evaluation & Decision Making
  • Buzzzzzzz
    • Why everyone is talking about NoSQL
    • What is happening to SQL
    • Does that mean end of SQL ? NoSQL era begins ?
    • Why nobody talks about large SQL implementations ?
    • Evolution of Data Architecture
  • Data Architecture
    • No standard solution that fits to all
    • Business and it’s data defines architecture
    • It’s all about solving problems
    • You need to find the right tool that does the job
  • Traditional Architecture
    • Relational database is everything
      • SQL
      • Embedded
      • Client-Server based
    • Data Stack
      • Web, CDN, Load Balancers, Application, Database and Storage
  • Traditional Scalability …
    • Scale-up
      • Memory and hardware has limitations
    • Scale-out
      • Scaling reads
        • Cache is the king
          • Query cache
          • Memcache
          • Olap
        • Pre-fetching
        • Replication
      • Scaling writes
        • Redundant disk arrays, RAID
        • Sharding
  • Common Problems…
      • Relational model is heavy : Parsing, Locking, Logging, Buffer pool and threads
      • Not every case can work within single node SMP
      • Sharding does not solve all problems
        • Cross shard or join between shards
        • Need to update across multiple shards within a transaction
        • Shard failure
        • Online schema changes without taking the shard offline
        • Add or replace shards in-line
  • Evolution
    • Data is growing rapidly on day by day
    • Motivated by the needs of large web applications
    • Hardware is not emerging as that of data growth
    • Things are moving to Cloud and API driven
    • Social networking and Cloud makes hard to scale using traditional way
  • Data is the Business
    • Lot of new business models are DATA centric
      • Real-time and Interactive
      • Big Data
        • Millions of user base, clients, customers, applications, …
        • Tera bytes to peta bytes of data on day to day
      • Business can only grow if they can properly make use of data
        • Statistics, Reporting
        • Real-time
        • Re-targeting
        • Recommendation
      • Examples of data driven companies
        • Facebook, Twitter, LinkedIn, Zynga, Groupon, Quora, Apple AppStore, FourSquare, any API Driven, all most all new emerging companies
  • Solution that works
    • Data architecture is not just choosing a right data store, but should be a solution, with:
      • Low In Cost (preferably open source, no hidden cost..)
      • Simple To Implement
      • High Performance
      • Highly Available
      • Highly Scalable
      • Highly Reliable
      • Highly Recoverable
      • Rapid Development
      • Zero Learning Curve
      • Ability to do online changes (schema or node or automatic)
      • Less Operational Maintenance
      • No firefighting on day to day
  • NoSQL Solution Emerges
    • Lot of companies emerged to solve data centric problems
    • Big Table: Google started to implement massively distributed scalable system, followed by many, first foot step to the world of massive data scale
    • Many companies followed building scale-out architecture using commodity hardware
    • ACID was termed as bad for scaling, so relaxed consistency model came into picture
    • Google Big Table and Amazon Dynamo are notable
  • Relaxed Consistency
    • Consistency is a major bottleneck for scalability
    • People started implementing eventual consistency
    • CAP Theorem ( C onsistency, A vailability and P artition-tolerance)
      • Consistency: “ Is the data I’m looking at now the same if I look at it somewhere else?”
      • Availability: “ What happens if my database goes down?”
      • Partitioning: “ What if my data is on different node?”
      • SQL – CA
      • NoSQL – AP
    • http://venublog.com/2010/04/07/cap-theorem-eventual-consistency-nosql/
    • Data Stores
  • Data Stores
      • 3 Major Data Store Solutions
        • SQL, OLTP
          • Relational, transactional processing
        • Analytics, OLAP
          • Data Warehousing, Analytics and reporting
        • NoSQL
          • Non relational, distributed, high performance and highly scalable
  • SQL Stores
    • Disk based storage
    • Data is stored as table (row by row and columns – row store)
    • Mainly B-tree as the indexing mechanism
  • SQL Stores …
    • Dynamic locking/ Lock free for concurrency control
    • Write-ahead log (WAL) / transactional log for crash recovery
    • SQL as the access language
  • SQL Stores
    • Proven and widely adopted
      • MySQL
      • PostGreSQL
      • VoltDB
      • Clustrix
      • MySQL Cluster
      • ScaleDB
      • ScaleBase
      • DbShards
      • Oracle
      • SQL Server
      • DB2
      • Sybase & …
    • Supports
      • ACID
      • Crash recovery
      • DDL, DML, DCL
  • Analytic Stores
    • Data warehousing, mainly for large sets of data
    • Data marts, Dimensional, Fact and Aggregate tables
    • ETL, BI, Reporting, Analytics
    • Columnar and Compression is the key
    • OLAP Cubes built-in or middle-tier
    • Mostly SQL and also MDX driven
  • Analytic Stores
    • Columnar data warehouse solutions
      • GreenPlum (+ DCA appliance)
      • Vertica (Break through, I love it)
      • Aster
      • ParAccel
      • InfoBright (MySQL based)
      • InfiniDB (open source, Calpont appliance)
      • Netezza (appliance)
      • XtremeData dbX (appliance)
      • TeraData
  • NoSQL Stores
    • Does not mean No to SQL
    • Actually No t only SQL
    • Data store that may not require fixed table schemas
    • Mainly derived from Google BigTable and Amazon Dynamo
  • NoSQL Stores …
    • Non relational, schema free
    • Distributed, ability to horizontally scale
    • Simple CLI or API protocol
    • Eventually consistent, depends …
    • Limitations of SQL to scale large data
    • Ability to dynamically define new attributes
  • NoSQL Stores …
    • Multiple Types based on storage architecture
      • Key Value, KV
      • Document
      • Graph
      • Column Family
  • NoSQL Stores
      • Key-Value Stores
        • Dynamo Clones
        • Membase
        • Riak
        • Redis
        • Tokyo Cabinet
        • Voldemort
      • Document Stores
        • MongoDB
        • CouchDB
      • Column Family
        • BigTable Clones
        • Cassandra
        • HBase
        • HyperTable
      • Graph Databases
        • Neo4J
        • InfoGrid
        • AllegroGraph
        • FlockDB
    • What they are good at
    • &
    • How to choose
  • Basic Decision Principles
    • Do not over architect from day-1, it’s overkill
    • Startups can’t afford to spend time
    • Understand business and implement with simple well known solutions to begin with
    • Do not follow the models, just inspire from the problem solving
    • Engineering talent is crucial, make sure you have right resources
    • Evaluate and implement new solutions as the business grows
  • Basic Decision Principles …
    • High availability & disaster recovery is a must
    • Understand pros and cons of each and every design model, and weigh towards the best interest of the company
    • Remember some of the big outage stories
      • Tumblr, FourSuaure & Twitter
    • Lean towards community winner and widely adopted
    • Do not lean towards only performance, unless you can create the state of the data back
  • SQL – Good
      • High Performance OLTP, Transactions, ACID
      • Structured, SQL Access , portability and tools
      • Small amounts of data, typically < 500G per server,
      • supports inline UPDATE, DELETE, multi-condition/rows
      • Relational model at data store, application independent
      • Many tables with different types of dtaa
  • SQL – Good
      • Simple or complex aggregation
      • Statistics, reports at data store level
      • Need access to more than one tuple of information
      • Results based on multiple search conditions
        • SELECT foo FROM bar where X=1 and Y=2
      • Fetching of ordered or array of data
      • Compatible with many tools
  • SQL – Bad
      • SQL complexity, parsing cost
      • Learning and relational model design
      • Performance and Scalability
        • Strictly single node
        • Sharding causes more trouble operationally
        • Operational maintenance, fire fighting
      • Puts a break to rapid development cycles
  • NoSQL - Good
      • Fits very well for volatile data
      • High read or write throughput
      • Automatic horizontal scalability (Consistent hashing)
      • Simple to implement, no investment for developers to design and implement relational model
      • Application logic defines object model
      • Support of MVCC in some form
      • Compaction and un-compaction happens at top tier
      • In-memory or disk based or combination
  • NoSQL - Good
      • Rapid development cycles, programmer friendly
      • Reduces the footprint at data store level
      • NoSQL in general faster than SQL
      • Supports INSERT, DELETE, SELECT
      • Data is distributed by KEY over nodes
      • Lists, sets, queues, pub-sub are also supported by some NoSQL – Redis
      • S3 can handle large blobs; not all NoSQL can handle it
  • NoSQL - Bad
      • Packing and Un-packing of each key
      • Lack of relation from one key to another
      • Need whole value from the key; to read/write any partial information
      • No security or authentication
      • Data store is merely a storage layer, can’t be used for:
        • Analytics
        • Reporting
        • Aggegation
        • Ordered values
  • SQL/NoSQL – Good and Bad
      • Performance mainly depends on amount of memory
      • Disk bound both takes a hit
        • SQL has advantage due to sequential and read-ahead
      • Optimization towards frequently accessed data
        • SQL engines maintain LRU
      • SQL Engines are proven and widely in use
      • NoSQL is pretty much new; but marching …
  • Cache
      • MySQL HandlerSocket
        • InnoDB bufferpool acts as cache
        • No explicit cache needed, no write invalidation
      • Write Through Cache (WTC) is a good candidate for high reads or writes
        • Gaming world really need this
        • Membase or periodic flush to persistent storage layer
      • Flash cache can also help to scale IO bound workloads
        • We might see them pitching in private cloud slowly
  • Document Store
      • Document Stores
        • Supports complex data model than KV
        • Good at handling content management, session, profile data
        • Multi index support
        • Dynamic schemas, Nested schemas
        • Auto distributed, eventual consistency
        • MVCC (CouchDB) or automic (MongoDB)
      • MongoDB, SimpleDB: widely adopted in this space
      • Use Case: Search by complex patterns & CRUD apps
  • Column Family Store
      • Hbase (Apache), Cassanda (Facebook) and HyperTable (Bidu)
        • Hbase – CA
        • Cassandra – AP
      • Model consists of rows and columns
      • Scalability: Splitting of both rows and columns
        • Rows are split across nodes using primary key, range
        • Columns are distributed using groups
        • Horizonal and vertical partitioning can be used simultaneous
      • Extension of document store
      • HBase uses HDFS; Pig, Hive, Cascading can help
      • Use case: Grouping of frequently used and un-used over data centers / stream of writes
  • Graph Store
      • Social Graph
      • Relationship between entities
      • Data modeling on social networks
      • Common Use Cases
        • List of friends
        • Recommendation system
        • Following
        • Followers
        • Common Connections
        • FAN IN/OUT
  • Cloud Data Stores
    • Database Cloud Services
      • Xeround (MySQL)
      • Microsoft SQL Azure Database (SQL Server)
      • SimpleDB (NoSQL)
      • Google App Engine (NoSQL)
      • SalesForce Database.com (Oracle)
      • ClearDB (MySQL)
  • SUMMARY
      • Pick right data model for the right problem
      • Understand the data storage and use cases on read, write and growth patterns; and then come up with a plan to implement
      • Compare pros and cons, and weigh towards the app and business
      • Public cloud, private cloud or DC also dictates what model to choose
      • You need right people
  • Finally …
      •  SQL
      • Works great, can’t scale for very large data
      • NoSQL
      • Works great, can’t fit for all
      •  SQL + NoSQL
  • Questions ?
    • http://venublog.com/
    • [email_address]
    • Twitter: @vanuganti