Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scaling Databases On The Cloud


Published on

The most apt, current database Scaling Techniques for your cloud-app that ensure your application database scales effortlessly without a hitch.

Published in: Technology
  • Be the first to comment

Scaling Databases On The Cloud

  1. 1. Scaling databases on the cloud D e e p a k A n u p a l l i S e r v e r A r c h i t e c t C L O U D C O M P U T I N G - C O M I N G O F A G E A T R E A T I S E O N R E A L - L I F E U S E C A S E S Copyright (c) 2009, Pramati Technologies Private Limited. Imaginea is a Pramati business. All trade names and trade marks are owned by their respective owners 11/4/2009 1
  2. 2. We are • An emerging leader in product development services offering specialized services in Product Engineering, Interaction design and Test engineering. • US Headquarters in Sunnyvale, CA; India development centers in Hyderabad and Chennai • A 250+ strong and growing team • A business unit of Pramati technologies • Rich Experience in SaaS Engineering, Performance engineering, Cloud Computing, Web2.0, integrations and managing Amazon EC2 Deployment • Track record of delivering significant customer satisfaction
  3. 3. Initiatives in Cloud • Dekoh: • SocialTwist: • MyPicks Beijing 2008: • Qontext:
  4. 4. Application requirements • High reliability • Low Latency • Dynamic Scalability – Millions of Users – Volumes of data • Across the tiers – Web – Application – Data
  5. 5. Our biggest challenge • DB Perf bound by Disk I/O • Vertical scaling is an option – Ex: 512GB RAM, 32CPUs – Expensive – Only possible to an extent on cloud servers
  6. 6. Vertical Scaling: Limitations • Not everything will fit in memory • Lot of reads ~ Lot of page faults + disk seeks • RAID 6 or RAID 10 disks • 200MBps-1GBps is the max speed Think Horizontal !
  7. 7. Replication • Master-slave replication (MySQL Writes or Oracle RAC) • Writes on one Master Master • Reads on many Slaves • Application aware • Works in read mostly scenario Writes • Adds Slave lag Slave Slave Slave Reads
  8. 8. Sharding • Partition data across masters • Writes and Reads are distributed Shard Logic • Application is modified accordingly • Also use replication with fewer slaves to minimize slave lag Master Master Master • Choose a partitioning strategy that uniformly distributes data Slave Slave Slave
  9. 9. Sharding Schemes • Vertical shard_id = getShard(“profile”) • Profile DB, friend DB shard_id = getShard(profileID) • Not uniform Select * from Profile where id = ? • Range based • ID range, Location or Date based • Not uniform Corporate Corporate • Key or Hash based • ID hash • Fixed masters Tweets Posts • Directory • Mapping of ID to Shard • Single point of failure
  10. 10. Sharding Complexities • No Joins • De-normalize the data • Data Integrity • Application should enforce integrity • Re-shard • Changing the sharding scheme requires re-partitioning the entire data
  11. 11. De-normalization • Recent 10 messages to a recipient • Schema Messages Recipients • Messages Table stores message info timestamp • Recipients Table stores • Requires Join on Messages & Recipients table • De-normalize Messages Recipients • Store timestamp in Recipients table as timestamp timestamp well
  12. 12. Relationships • When data is partitioned into shards, foreign keys become obsolete • De-normalization avoids having relationships Application • If data can’t be de-normalized further, use memcached • But, this requires change in SQL queries MemCached Shard Shard Shard 1 2 3
  13. 13. Cloud Databases/Data stores • Amazon SimpleDB • Google BigTable • Apache HBase • Facebook/Apache Hive • CouchDB • Cassandra • Many more…
  14. 14. Amazon SimpleDB • Schema-less distributed key-value store • Highly reliable and scalable • Automatic indexing of columns • Querying with SQL-like syntax • Supports multiple values for key/attribute • Value for Money
  15. 15. Problems Addressed • High Availability – multiple nodes forming a ring • Partitioning – Consistent hashing • Replication – Replicated to multiple nodes • Eventual Consistency – Asynchronous replication of data using vector clocks
  16. 16. SimpleDB adoption • No Joins • No transactional support • String is the only data type • No aggregator functions • No full-text searches • Limits enforced on size of results, predicates, data etc.
  17. 17. Google BigTable • Distributed Key-value store • Runs on top of Google File System (GFS) • Timestamp versioned data • Automatic indexing of columns
  18. 18. BigTable adoption • Google Search, Maps, Earth, Orkut, Youtube, Reader, etc. • Google App Engine(GAE) uses BigTable as its datastore • DataNucleus supports JPA for BigTable • Limited transaction support • Eventual consistency
  19. 19. Hive • Hive is a data warehouse • Runs on top of Hadoop Distributed File system (HDFS) • Supports SQL-like syntax • User defined types and functions • Extensibility with Map-Reduce
  20. 20. Hive adoption • Facebook uses Hive to analyze historical data of users and content • Doesn’t support indexing of columns • Brute force mechanism to compute analytics
  21. 21. CouchDB • CouchDB is a document-oriented datastore • Schema-free • Accessible through RESTful JSON API • Distributed with incremental replication • Querying through Javascript
  22. 22. Is there a solution for all? • Different data-stores address different problem spaces • Identify what best suites your app
  23. 23. Thank You
  24. 24. C L O U D C O M P U T I N G - C O M I N G O F A G E A T R E A T I S E O N R E A L - L I F E U S E C A S E S Scaling databases on the cloud Copyright © 2009, Imaginea Inc. Not to be distributed or communicated without permission. 11/4/2009 24