Presentation for http://strataconf.com/strata2012/public/schedule/detail/22693
Many of the new online and device-oriented application models require a high degree of operational and development agility such as unlimited elastic scale and flexible data models. The nascent NoSQL market is aiming to address these requirements but is extremely fragmented, with many competing vendors and technologies. Programming, deploying, and managing NoSQL solutions requires specialized and low-level knowledge that does not easily carry over from one vendor’s product to another. The SQL market on the other hand has a high level of maturity and at least conceptual standardization, but relational database systems were not originally designed for these requirements.
However, in contrast to common belief, the question of big versus small data is orthogonal to the question of SQL versus NoSQL. While the NoSQL model naturally supports extreme sharding, the fact that it does not require strong typing and normalization makes it attractive for “small” data as well. On the other hand, it is possible to scale relational SQL databases.
In this presentation, I will provide a short introduction to some architectural patterns that SQL-based solutions have been using to achieve scale and operational agility, contrast them with the NoSQL paradigms and show how SQL can be augmented with NoSQL paradigms at the platform level by using SQL Azure Federations as an example. I will also show how NoSQL offerings can benefit from the lessons learned with SQL.
What this all means is that NoSQL, BigData and SQL are not in conflict, like good and evil. Instead they are sometimes overlapping, but often complementary solutions that benefit from common paradigms addressing different requirements and can and will coexist.
2. AGENDA
• Scaling out your business is important!
• NoSQL Paradigms and NoSQL Platforms
• SQL learns from NoSQL
(with a demo of SQL Azure Federations)
• NoSQL learns from SQL
• Scalable Data Processing Platform of the Future
3. THE WEB 2.0 BUSINESS ARCHITECTURE
Attract Individual
Consumers:
- Provide interesting
service
- Provide mobility Online
- Provide social Monetize the Social:
Business - Improve individual
Monetize Individual: experience
- Upsell service
- VIP
Application - Re-sell Aggregate Data
(e.g., Advertisers)
- Speed
- Extra
Capabilities
4. SOCIAL NETWORKING: THE BUSINESS PROBLEM
• 100s of million of users
• 10s of million of users concurrently
• Terabytes to petabytes of data
• Structured and unstructured
• Required (eventual) data
consistency across users
• E.g. show your updated state in your
friends’ profile pages
5. SOLUTION
• Shard/Partition user data across hundreds to
thousands of SQL Databases
• Propagate data changes from one DB to other
DBs using reliable, async Message Service
• Managing routes from each DB to every other DB
would be too complex
• Global Transactions would hinder scale and
availability
• Provide a caching layer for performance
• And also used for
o Clean-up state (e.g. on account close)
o Deploy business logic (stored procedures)
6. EXAMPLE ARCHITECTURE
1-1000 3001-4000 I change
My DB
Async
gets updated my status
Message
Service TX1
TX3 TX2
Dispatcher Async userId=1024
Message
2001-3000 Async
Message
1001-2000
TX4 TX5
4001-5000 5001-6000 Web Tier
Data Tier
7. MANY LARGE SCALE CUSTOMERS USING SIMILAR PATTERNS
• Patterns
• Sharding and reliable messaging
• Sharding and fan/out query layer
• Caching layer
• Customer Examples
• Social Networking: Facebook, MySpace, etc
• Online electronic stores (cannot give names )
• Travel reservation systems (e.g. Choice International)
• MSN Casual Gaming
• etc.
8. LESSONS LEARNED FROM THESE SCENARIOS
• Require high availability
• Be able to scale out:
• Functional and Data Partitioning Architecture
• Provide scale-out processing:
o Function shipping
o Fanout and Map/Reduce processing
• Be able to deal with failures:
o Quorum
o Retries
o Eventual Consistency (similar to Read-consistent Snapshot Isolation)
• Be able to quickly grow and change:
• Elastic scale
• Flexible, open schema
• Multi-version schema support
Move better support for these patterns into the Data Platform!
9. WHAT IS NOSQL ABOUT?
• NoSQL = operational and developer agility at low CapEx and OpEx!
• Low Cost
• Free Open Source Stores, Community Support
• Scale CapEx cost below customer growth rate
• Web friendly developer model and tool chain, ease of use
• Processing Paradigms
• High Availability (scalable Replication, Fast Failover, DR/GeoDR, tunable latency)
• Scale-out (Sharding, Map-Reduce, Elasticity)
• Performance (tuned for specific workloads, Caching, co-located compute with partitioned state)
• Tunable/Eventual Consistency
• Data Model Paradigms
• Data first: Flexible Schema
• Low-impedance mismatch between programming and data model:
o Key-Documents and Objects (BLOBS, JSON, XML, POJO)
o Key-Wide Sparse Column Sets
o Graphs (e.g., RDF)
• Range from devices, over OLTP Web 2.0 applications to BigData Analytics
10. DATA MODELS
Data Model Example Stores (apologies to the ones I did not list)
Simple Key-Value Pairs Memcache, Redis, Dynamo, Voldermort, LevelDB, Azure Caching
Wide Sparse Column Sets HyperTable, Big Table, Cassandra, HBASE, Hyperbase, Amazon
DynamoDB, Windows Azure Tables, SQL Server/Azure Sparse
columns
BLOBs Amazon S3, Oracle Berkeley NoSQL, Windows Azure Blob
Store, SQL Server RBS/FileTable
JSON Documents MongoDB, CouchBase, Riak, RavenDB
Graph Neo4J, GraphDB, HypergraphDB, Stig, Intellidimension
Objects and XML Documents Versant, Oracle Berkeley NoSQL, MarkLogic, existDB, EMC
HiveDB, SQL Server/Azure, Oracle, IBM DB2
Extended Relational Oracle, EMC SQLFire, IBM DB2, MySQL, Postgres, SQL
Server/Azure
11. WHAT CAN SQL LEARN FROM NOSQL?
• Low CapEx, Low OpEx
• Built-in tunable High-Availability
• Data scale-out (Sharding)
• Processing scale-out (Map-Reduce, Fan-Out, tunable consistency)
• Flexible Data Models
• JSON (& XML) support
• Sparse columns/Column sets
• Integrate with BigData Analytics (e.g., Hadoop)
Many Relational Database Systems are incorporating these learning!
12. EXAMPLE: SQL AZURE FEDERATIONS
• Provides Data Partitioning/Sharding at the Data Platform
• Enables applications to build elastic scale-out applications
• Provides non-blocking SPLIT/DROP for shards (MERGE to come later)
• Auto-connect to right shard based on sharding keyvalue
• Provides SPLIT resilient query mode
13. SQL AZURE FEDERATION CONCEPTS
Federation
Represents the data being sharded
Azure DB with Federation Root
Federation Root Federation Directories, Federation
Database that logically houses federations, contains Users, Federation Distributions, …
federation meta data
Federation Key
Value that determines the routing of a piece of data Federation “Orders_Fed”
(defines a Federation Distribution) (Federation Key: CustomerID)
Atomic Unit
Member: PK [min, 100)
All rows with the same federation key value: always
together! AU AU AU
PK=5 PK=25 PK=35
Federation Member (aka Shard)
A physical container for a set of federated tables for
a specific key range and reference tables Member: PK [100, 488)
Federated Table AU AU AU
Table that contains only atomic units for the PK=105 PK=235 PK=365
Connection
member’s key range
Gateway
Reference Table Member: PK [488, max)
Non-sharded table AU AU AU
PK=555 PK=2545 PK=3565
Sharded
16
Application
14. DEMO
MAP-REDUCE SCALE-OUT OVER SQL
AZURE FEDERATIONS
• Sharded GamesInfo table using SQL Azure Federations
• Use a C# library that does implement a Map/Reduce
processor on top SQL Azure Federations
• Mapper and Reducer are specified using SQL
17
15. WHAT CAN NOSQL LEARN FROM SQL?
• Flexible data is good, but:
• Provide optional schema in data platform to help with constraints and optimizations
• Procedural Scale-Out processing is good, but:
• Develop a declarative language suited for and across the data models (e.g., coSQL)
• Standardize suitable abstractions and languages
• Eventual Consistency is good, but:
• Provide users the choice
• Simple Queries are good, but:
• Provide me with secondary indexes
• it will be more efficient to join between two collections of JSON documents in the
query engine than in the Application layer
Many NoSQL Database Systems are starting to incorporate these learnings!
16. THE WEB 2.0 BUSINESS ARCHITECTURE
Attract Individual
Consumers:
- Provide interesting
service
- Provide mobility Online
- Provide social Monetize the Social:
Business - Improve individual
Monetize Individual: experience
- Upsell service
- VIP
Application - Re-sell Aggregate Data
(e.g., Advertisers)
- Speed
- Extra
Capabilities
17. SCALE-OUT DATA PLATFORM ARCHITECTURE
Readable
Replica
Primary Copy
Shard
Readable
OLTP Workloads Replica
Traditional OLAP Workloads
Highly Available known schema
High Scale Readable Data warehouse, “Star joins”
High Flexibility Replica
Primary
Shard Dynamic OLAP Workloads
mostly touching 1 Readable
to low number of Replica 3Vs (Volume, Velocity, Variety)
shards Exploratory
Readable
Replica
Primary Scale-out queries, often using
Shard Query eventual consistent scale-out
Readable frameworks like Hadoop
Replica
SQL or NoSQL Store
19. CALL TO ACTION
• Familiarize yourself with the NoSQL genes in the Microsoft Online Platform
• Free 3-Month Trial for Windows and SQL Azure: http://www.windowsazure.com
• Engage with us throughout Strata
Presentation Speaker Date and Time
Do We Have the Tools We Need to Navigate
Dave Campbell 2/29 9:00am PST
the New World of Data?
Onsite Interview * Tim O’Reilly, Dave Campbell 2/29 10:15am PST
Unleash Insights on All Data With Microsoft
Alexander Stojanovic 2/29 11:30am PST
Big Data
Office Hours (Q&A session) Dave Campbell 2/29 1:30pm PST
Hadoop + Javascript: What We Learned Asad Khan 2/29 2:20pm PST
Democratizing BI at Microsoft: 40,000 Users
Kirkland Barrett 3/1 10:40am PST
and Counting
Data Marketplaces For Your Extended
Piyush Lumba 3/1 2:20pm PST
Enterprise
• Download slides with additional information and related resources:
http://www.slideshare.net/MichaelRys/presentations
22
21. RELATED RESOURCES
• Scale-Out with SQL Databases
• http://gigaom.com/cloud/facebook-shares-some-secrets-on-making-mysql-scale/
• Windows Gaming Experience Case Study:
http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=4000008310
• Scalable SQL: http://cacm.acm.org/magazines/2011/6/108663-scalable-sql
• http://www.slideshare.net/MichaelRys/scaling-with-sql-server-and-sql-azure-federations
• NoSQL and the Windows Azure Platform
• Whitepaper:
http://download.microsoft.com/download/9/E/9/9E9F240D-0EB6-472E-B4DE-
6D9FCBB505DD/Windows%20Azure%20No%20SQL%20White%20Paper.pdf
• SQL Federation blog:
http://blogs.msdn.com/b/cbiyikoglu/archive/2011/03/03/nosql-genes-in-sql-azure-
federations.aspx
• Contact me
• @SQLServerMike
• http://sqlblog.com/blogs/michael_rys/default.aspx
Editor's Notes
Example MySpace architecture:Service Dispatcher coordination point between all SQL ServersCentralizes route managementAvoids routes explosion Load-balanced across 30 SQL ServersMessages are sent randomly to theseEnables multicast/broadcast functionalitySupports destination lists and wildcards e.g. [DB1,DB3, DB4], DB%18,000 ~2k msgs/sec per dispatcher SQL ServerMyDB sends a message with my status change and a target list specifying the DBs that store my friends data.The Service Dispatcher forwards the message these DBs.Each DB processes the message updating my status in a partitioned table
Example MSN Casual Gaming:~2 Million users at launch~86 Million services requests/day 135 Windows Azure Data Services Hosting VMs ca. 18K connections in Connection Pools, this could grow with trafficCa. 1200 SQL Azure requests/second spread across all partitions during peak load~ 90% reads vs 10% writes (this varies per storage type)~ 200 bytes of storage per user~ 20% of database storage is currently used, but expect this to growSharded over 400 SQL Azure Databases
Note: Big-sized companies invest resources in building these platforms instead of using existing relational platforms!
No DB or OS Admin telling me what to do!
Performance and Scale:Map/Reduce PatternsEventual consistency (trade-off due to CAP)ShardingCachingAutomate management Lifecycle:Elastic Scale on demand (no need to pay for resources until needed)Automatic Fail-overScalable Schema version rolloutPerf troubleshootingAuto alertingAuto loadbalancingAuto resourcing (e.g., auto splits based on policies)Declarative policy-based management
Code First and revise quicklyWorking software over comprehensive documentationResponding to change over following a planApplication-model first (before database) Dictates the data model and queriesFlexible data modelsNo a priori modeling: Data first, schema later/Open SchemaKey/Value storesReduced impedance mismatch: JSON, XML, YAMLYou don’t know exactly what you are looking forMap/Reduce for adhoc analysisProvide Search across all your data instead of just queryLower Pain of adoption and maintenance From code to deployment & “monetization” of data, services, apps and tenantsRich Services out of the BoxData and services mashupEasy troubleshooting of deployed appsNo DB or OS Admin telling me what to do
Low CapEx, Low OpEx: SQL Azure and other Platform as a Service offeringsBuilt-in High-Availability (tunable): SQL Azure has quorum based built-in replicasData scale-out (Sharding): SQL Azure FederationsProcessing scale-out (Map-Reduce, Fan-Out, tunable consistency)Flexible Data ModelsJSON (& XML) supportSparse columns/Column sets Integrate with BigData Analytics (e.g., Hadoop)
SharePoint – BI, Enterprise Search, Enterprise Content Management, CollaborationTransform - ETLClean – Data Quality, AugmentationDiscover – Search, Meta-data, Classification, Information CatalogInfer – Recommendation Engines, Machine LearningShare – Publish, CollaborateGovern – Lineage & Impact Analysis, Master Data ManagementMarketplace – Private, Public, Bing Data, 3rd Party Data Sources, Models, Algorithms, APIs