Your SlideShare is downloading. ×
  • Like
  • Save
Scalable and Elastic Transactional Data Stores for Cloud Computing Platforms
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Scalable and Elastic Transactional Data Stores for Cloud Computing Platforms

  • 893 views
Published

Cloud computing has emerged as a multi-billion dollar industry and as a successful paradigm for web application deployment. Economies-of-scale, elasticity, and pay-per-use pricing have been the …

Cloud computing has emerged as a multi-billion dollar industry and as a successful paradigm for web application deployment. Economies-of-scale, elasticity, and pay-per-use pricing have been the biggest promises of cloud. Database management systems (DBMSs) serving these web applications form a critical component of the cloud software stack. These DBMSs must be able to scale-out to clusters of commodity servers to serve thousands of applications and their huge amounts of data. Moreover, to minimize the operating costs such DBMSs must also be elastic, i.e. posses the ability to increase and decrease the cluster size in a live system. This is in addition to serving a variety of applications (i.e. support multitenancy) while being self-managing, fault-tolerant, and highly available.

The overarching goal of my dissertation is to propose abstractions, protocols, and paradigms to design scalable and elastic database management systems that address the unique set of challenges posed by the cloud. My dissertation shows that with careful choice of design and features, it is possible to architect scalable DBMSs that efficiently support transactional semantics to ease application design and elastically adapt to fluctuating operational demands to optimize the operating cost. In this talk, I will outline my work that embodies this principle. In the first part, I will present techniques and system architectures to enable efficient and scalable transaction processing on clusters of commodity servers. In the second part, I will present techniques for on-demand database migration in a live system, a primitive operation critical to support lightweight elasticity as a first class feature in DBMSs. I will conclude the talk with a discussion of possible future directions.

Published in Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
893
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • In the last few years, we have witnessed a trend where web applications have been replacing desktop applications and large numbers of applications are now accessed via the browsers.
  • This shift from desktop to the web has also resulted in a paradigm shift in the application deployment infrastructure resulting in a paradigm popularly known as Cloud Computing.
  • This shift from desktop to the web has also resulted in a paradigm shift in the application deployment infrastructure resulting in a paradigm popularly known as Cloud Computing.
  • In its simplest form, cloud computing is essentially computing infrastructure and solutions delivered as a service. Analysts predict that this industry will be worth 150 billion dollars by 2014. Even though almost every aspect of computing can be provided as a service, there have been three popular cloud paradigms:Infrastructure as a service, the lowest level of abstraction, provides raw CPU, storage, and network as a service. Popular examples include Amazon web services, Rackspace, etc.The next higher level of abstraction is platform as a service that provides a platform or containers to deploy applications where the platform provider abstracts data management, fault-tolerance, elastic scaling etc, thus simplifying application deployment. Popular examples include Google AppEngine, Windows Azure, etc.The highest level of abstraction is software as a service that exposes a simple interface to customize pre-designed application logic. Popular examples include Salesforce.com.Major factors that have contributed to the success of cloud platforms are advances in the technology front, such as virtualization and pervasive broadband internet connectivity, as well as business and economic factors, such as economies of scale, transfer of risks etc.In this talk, we focus on Cloud application platforms, in particular, the database systems that serve these cloud application platforms.
  • Data is central to all modern applications and most modern enterprises manage petabytes of data. Hence DBMSs form a mission critical component in the cloud software stack and is the key to success as well as generating revenue.Considering the data needs for web-applications, there are two broad categories of systems:On one hand are OLTP systems that store and serve data. On the other hand are OLAP systems that provide intelligence and decision support.In this talk, we will focus on OLTP systems.Bring in the concept of service provider and the service user and whose problem are we solving (NEC discussion).
  • Data is central to all modern applications and most modern enterprises manage petabytes of data. Hence DBMSs form a mission critical component in the cloud software stack and is the key to success as well as generating revenue.Considering the data needs for web-applications, there are two broad categories of systems:On one hand are OLTP systems that store and serve data. On the other hand are OLAP systems that provide intelligence and decision support.In this talk, we will focus on OLTP systems.Bring in the concept of service provider and the service user and whose problem are we solving (NEC discussion).
  • Therefore, in summary, the major challenges for an OLTP database in the cloud are:Supporting transactions and scale-out while minimizing the number of distributed transactions,Supporting lightweight elastic scaling in a live system, andProviding autonomic control with intelligence similar to a human controller.
  • Stress about the ACID properties of transactions and how the applications benefit from it by simplifying their design.
  • Stress about the ACID properties of transactions and how the applications benefit from it by simplifying their design.
  • Therefore, if we consider Scale-out as the vertical axis and Functionality (or support for transactions) as the horizontal axis, at one extreme are the RDBMSs that support rich functionality but are hard to scale-out, and at the other extreme are Key-Value stores that allow scaling out to thousands of servers but support limited functionality.There exists a big chasm between the two types of systems and the challenge is to bridge this divide by efficiently supporting transactions while scaling out.Cloud platforms are multitenant and must support a variety of applications with varying needs. Therefore, bridging this chasm is important to support a variety of applications.Functionality , whether transactions are a subset.
  • In addition, when such a database is deployed on an elastic pay-per-use cloud infrastructure that allows for on-demand provisioning compared to static provisioning for the peak load, the challenge is to make the database layer elastic as the underlying cloud infrastructure without introducing a lot of overhead to make it elastic.Scale vs Elasticity
  • To this end, my dissertation makes the following contributions to address these challenges:We propose two different solutions to support transactions at scale for two different application scenarios: Elastras allows for elastically scalable transaction execution in databases where partitions are statically defined, while G-Store allows efficient transaction processing where database partitions are dynamically defined.Supporting lightweight elasticity essentially boils down to lightweight migration of database partitions in a live system. To this end, we propose two different techniques for two common database architectures: Albatross is a technique for live migration in databases that use the storage abstraction, while Zephyr is a technique for live migration in shared nothing database architectures.Finally, we are currently working on the design of Pythia, an autonomic controller.For the interest of time, in this talk, I will only get into the details of G-Store and Zephyr while providing a very high level overview of Elastras.
  • To this end, my dissertation makes the following contributions to address these challenges:We propose two different solutions to support transactions at scale for two different application scenarios: Elastras allows for elastically scalable transaction execution in databases where partitions are statically defined, while G-Store allows efficient transaction processing where database partitions are dynamically defined.Supporting lightweight elasticity essentially boils down to lightweight migration of database partitions in a live system. To this end, we propose two different techniques for two common database architectures: Albatross is a technique for live migration in databases that use the storage abstraction, while Zephyr is a technique for live migration in shared nothing database architectures.Finally, we are currently working on the design of Pythia, an autonomic controller.For the interest of time, in this talk, I will only get into the details of G-Store and Zephyr while providing a very high level overview of Elastras.
  • To this end, my dissertation makes the following contributions to address these challenges:We propose two different solutions to support transactions at scale for two different application scenarios: Elastras allows for elastically scalable transaction execution in databases where partitions are statically defined, while G-Store allows efficient transaction processing where database partitions are dynamically defined.Supporting lightweight elasticity essentially boils down to lightweight migration of database partitions in a live system. To this end, we propose two different techniques for two common database architectures: Albatross is a technique for live migration in databases that use the storage abstraction, while Zephyr is a technique for live migration in shared nothing database architectures.Finally, we are currently working on the design of Pythia, an autonomic controller.For the interest of time, in this talk, I will only get into the details of G-Store and Zephyr while providing a very high level overview of Elastras.
  • To this end, my dissertation makes the following contributions to address these challenges:We propose two different solutions to support transactions at scale for two different application scenarios: Elastras allows for elastically scalable transaction execution in databases where partitions are statically defined, while G-Store allows efficient transaction processing where database partitions are dynamically defined.Supporting lightweight elasticity essentially boils down to lightweight migration of database partitions in a live system. To this end, we propose two different techniques for two common database architectures: Albatross is a technique for live migration in databases that use the storage abstraction, while Zephyr is a technique for live migration in shared nothing database architectures.Finally, we are currently working on the design of Pythia, an autonomic controller.For the interest of time, in this talk, I will only get into the details of G-Store and Zephyr while providing a very high level overview of Elastras.
  • To this end, my dissertation makes the following contributions to address these challenges:We propose two different solutions to support transactions at scale for two different application scenarios: Elastras allows for elastically scalable transaction execution in databases where partitions are statically defined, while G-Store allows efficient transaction processing where database partitions are dynamically defined.Supporting lightweight elasticity essentially boils down to lightweight migration of database partitions in a live system. To this end, we propose two different techniques for two common database architectures: Albatross is a technique for live migration in databases that use the storage abstraction, while Zephyr is a technique for live migration in shared nothing database architectures.Finally, we are currently working on the design of Pythia, an autonomic controller.For the interest of time, in this talk, I will only get into the details of G-Store and Zephyr while providing a very high level overview of Elastras.
  • But before we delve into the details, I would like to spend a couple of minutes to give an overview of my research in the broader area of data management.The current talk, and my thesis, focuses on the OLTP aspect.In the data analysis front, I have worked on multiple projects. As an intern at IBM Almaden, I worked on a project called Ricardo that provides the ability for deep statistical analysis and modeling over large amounts of data. This paper was published in SIGMOD 2010 and parts of the framework ship in IBM InfoSphereBigInsights Enterprise edition. Recently, I worked on a project called MD-Hbase that presents the design and implementation of a scalable multi-dimensional indexing mechanism to support efficient high throughput location updates and multi-dimensional analysis queries on top of a Key-value store. Earlier, I have also worked on data stream processing systems providing intra-operator parallelism in common data stream operators, such as frequent elements or top-k elements, to efficiently exploit multicore processors.I have also worked on designing systems to exploit novel hardware architectures.
  • But before we delve into the details, I would like to spend a couple of minutes to give an overview of my research in the broader area of data management.The current talk, and my thesis, focuses on the OLTP aspect.In the data analysis front, I have worked on multiple projects. As an intern at IBM Almaden, I worked on a project called Ricardo that provides the ability for deep statistical analysis and modeling over large amounts of data. This paper was published in SIGMOD 2010 and parts of the framework ship in IBM InfoSphereBigInsights Enterprise edition. Recently, I worked on a project called MD-Hbase that presents the design and implementation of a scalable multi-dimensional indexing mechanism to support efficient high throughput location updates and multi-dimensional analysis queries on top of a Key-value store. Earlier, I have also worked on data stream processing systems providing intra-operator parallelism in common data stream operators, such as frequent elements or top-k elements, to efficiently exploit multicore processors.I have also worked on designing systems to exploit novel hardware architectures.
  • But before we delve into the details, I would like to spend a couple of minutes to give an overview of my research in the broader area of data management.The current talk, and my thesis, focuses on the OLTP aspect.In the data analysis front, I have worked on multiple projects. As an intern at IBM Almaden, I worked on a project called Ricardo that provides the ability for deep statistical analysis and modeling over large amounts of data. This paper was published in SIGMOD 2010 and parts of the framework ship in IBM InfoSphereBigInsights Enterprise edition. Recently, I worked on a project called MD-Hbase that presents the design and implementation of a scalable multi-dimensional indexing mechanism to support efficient high throughput location updates and multi-dimensional analysis queries on top of a Key-value store. Earlier, I have also worked on data stream processing systems providing intra-operator parallelism in common data stream operators, such as frequent elements or top-k elements, to efficiently exploit multicore processors.I have also worked on designing systems to exploit novel hardware architectures.
  • The goal of partitioning the schema is to leverage the application semantics and access patterns to minimize the number of distributed transactions.
  • The goal of partitioning the schema is to leverage the application semantics and access patterns to minimize the number of distributed transactions.
  • The goal of partitioning the schema is to leverage the application semantics and access patterns to minimize the number of distributed transactions.
  • The goal of partitioning the schema is to leverage the application semantics and access patterns to minimize the number of distributed transactions.
  • Now we know how to scale-out when the partitions are statically defined. So lets make it a bit more interesting: How to scale-out with transactions on dynamically formed partitions?Recall that our concept of partitions are the data items that are frequently being accessed within the same transaction. For certain applications, that might change with time. For instance, in online multi-player games, the application needs transactional access on the player profiles that are part of the same game instance, and this set changes with time. Similar behavior is observed in a number of collaboration based applications (examples?).
  • Now we know how to scale-out when the partitions are statically defined. So lets make it a bit more interesting: How to scale-out with transactions on dynamically formed partitions?Recall that our concept of partitions are the data items that are frequently being accessed within the same transaction. For certain applications, that might change with time. For instance, in online multi-player games, the application needs transactional access on the player profiles that are part of the same game instance, and this set changes with time. Similar behavior is observed in a number of collaboration based applications (examples?).
  • If the player profiles are part of the same database partition, then transactions on this group of players can be executed efficiently.
  • If the player profiles are part of the same database partition, then transactions on this group of players can be executed efficiently.
  • However, this group of players change with time, thus resulting in the concept of dynamically defined database partitions.
  • However, this group of players change with time, thus resulting in the concept of dynamically defined database partitions.
  • Scale.
  • Paper has more detailed evaluation
  • Paper has more detailed evaluation
  • So what does elasticity in the database tier mean?Mention the cost performance trade-off and repeat the fact that it is the cloud infrastructure that allows us to optimize operating cost, something that was not thought important in classical infrastructures.
  • So what does elasticity in the database tier mean?Mention the cost performance trade-off and repeat the fact that it is the cloud infrastructure that allows us to optimize operating cost, something that was not thought important in classical infrastructures.
  • So what does elasticity in the database tier mean?Mention the cost performance trade-off and repeat the fact that it is the cloud infrastructure that allows us to optimize operating cost, something that was not thought important in classical infrastructures.
  • So what does elasticity in the database tier mean?Mention the cost performance trade-off and repeat the fact that it is the cloud infrastructure that allows us to optimize operating cost, something that was not thought important in classical infrastructures.
  • So what does elasticity in the database tier mean?Mention the cost performance trade-off and repeat the fact that it is the cloud infrastructure that allows us to optimize operating cost, something that was not thought important in classical infrastructures.
  • So what does elasticity in the database tier mean?Mention the cost performance trade-off and repeat the fact that it is the cloud infrastructure that allows us to optimize operating cost, something that was not thought important in classical infrastructures.
  • So what does elasticity in the database tier mean?Mention the cost performance trade-off and repeat the fact that it is the cloud infrastructure that allows us to optimize operating cost, something that was not thought important in classical infrastructures.
  • Define wireframe in this slide. Defer index wireframe definition to the later slide.
  • Freeze  No structural modifications to the indices.Wireframe  Minimal information needed to start executing transactions at the destination, schema information, user authentication, the index wireframes, etc.
  • Just to give a concrete example of a wireframe, if we consider a B+ tree index, then only the internal nodes of the indices are migrated as part of the wireframe.
  • Just to give a concrete example of a wireframe, if we consider a B+ tree index, then only the internal nodes of the indices are migrated as part of the wireframe.
  • Once the destination is initialized with the minimal information, it can start executing transactions. At this point, migration enters the Dual mode where both the source and destination are executing transactions, new transactions arrive at the destination while the source continues execution of transactions that were active at the start of migration.
  • Once the destination is initialized with the minimal information, it can start executing transactions. At this point, migration enters the Dual mode where both the source and destination are executing transactions, new transactions arrive at the destination while the source continues execution of transactions that were active at the start of migration.
  • Once the destination is initialized with the minimal information, it can start executing transactions. At this point, migration enters the Dual mode where both the source and destination are executing transactions, new transactions arrive at the destination while the source continues execution of transactions that were active at the start of migration.
  • Once the destination is initialized with the minimal information, it can start executing transactions. At this point, migration enters the Dual mode where both the source and destination are executing transactions, new transactions arrive at the destination while the source continues execution of transactions that were active at the start of migration.
  • Make the future more specific.

Transcript

  • 1. PhD DefenseScalable and ElasticTransactional Data Stores forCloud Computing Platforms Sudipto Das Computer Science, UC Santa Barbara sudipto@cs.ucsb.eduCommittee:Divy Agrawal (co-chair), Amr El Abbadi (co-chair),Phil Bernstein, Tim SherwoodSponsors:
  • 2. Web replacing Desktop Sudipto Das {sudipto@cs.ucsb.edu} 2
  • 3. Paradigm shift in Infrastructure Sudipto Das {sudipto@cs.ucsb.edu} 3
  • 4. Paradigm shift in Infrastructure Sudipto Das {sudipto@cs.ucsb.edu} 4
  • 5. Cloud computing Computing infrastructure and solutions delivered as a service ◦ Industry worth USD150 billion by 2014* Contributors to success ◦ Economies of scale ◦ Elasticity and pay-per-use pricing Popular paradigms ◦ Infrastructure as a Service (IaaS) ◦ Platform as a Service (PaaS) ◦ Software as a Service (SaaS)*http://www.crn.com/news/channel-programs/225700984/cloud-computing-services-market-to-near-150-billion-in-2014.htm Sudipto Das {sudipto@cs.ucsb.edu} 5
  • 6. Databases for cloud platforms Data is central to applications DBMSs are mission critical component in cloud software stack ◦ Manage petabytes of data, drive revenue ◦ Serve a variety of applications (multitenancy) Data needs for cloud applications ◦ OLTP systems: store and serve data ◦ Data analysis systems: decision support, intelligence Sudipto Das {sudipto@cs.ucsb.edu} 6
  • 7. Databases for cloud platforms Data is central to applications DBMSs are mission critical component in cloud software stack ◦ Manage petabytes of data, drive revenue ◦ Serve a variety of applications (multitenancy) Data needs for cloud applications ◦ OLTP systems: store and serve data ◦ Data analysis systems: decision support, intelligence Sudipto Das {sudipto@cs.ucsb.edu} 7
  • 8. Application landscape Social gaming Rich content and mash-ups Managed applications Cloud application platforms Sudipto Das {sudipto@cs.ucsb.edu} 8
  • 9. Challenges for OLTP systems Scalability ◦ While ensuring efficient transaction execution! Lightweight Elasticity ◦ Scale on-demand! Sudipto Das {sudipto@cs.ucsb.edu} 9
  • 10. Two approaches to scalability Scale-up ◦ Preferred in classical enterprise setting (RDBMS) ◦ Flexible ACID transactions ◦ Transactions access a single node Sudipto Das {sudipto@cs.ucsb.edu} 10
  • 11. Two approaches to scalability Scale-up ◦ Preferred in classical enterprise setting (RDBMS) ◦ Flexible ACID transactions ◦ Transactions access a single node Scale-out ◦ Cloud friendly (Key value stores) ◦ Execution at a single server  Limited functionality & guarantees ◦ No multi-row or multi-step transactions Sudipto Das {sudipto@cs.ucsb.edu} 11
  • 12. Why care about transactions?confirm_friend_request(user1, user2){begin_transaction();
 update_friend_list(user1, user2, status.confirmed);
 update_friend_list(user2, user1, status.confirmed);end_transaction();} Sudipto Das {sudipto@cs.ucsb.edu} 12
  • 13. Why care about transactions?confirm_friend_request(user1, user2){begin_transaction();
 update_friend_list(user1, user2, status.confirmed);
 update_friend_list(user2, user1, status.confirmed);end_transaction();} Simplicity in application design with ACID transactions Sudipto Das {sudipto@cs.ucsb.edu} 13
  • 14. confirm_friend_request_A(user1, user2) { try { update_friend_list(user1, user2, status.confirmed); } catch(exception e) { report_error(e); return; } try { update_friend_list(user2, user1, status.confirmed); 
 } catch(exception e) { revert_friend_list(user1, user2); report_error(e); return; }}confirm_friend_request_B(user1, user2) { try{ 
 update_friend_list(user1, user2, status.confirmed); } catch(exception e) {
 report_error(e); 
 add_to_retry_queue(operation.updatefriendlist, user1, user2, current_time());
} try { 
 update_friend_list(user2, user1, status.confirmed); } catch(exception e) {
 report_error(e);
 add_to_retry_queue(operation.updatefriendlist, user2, user1, current_time()); }} Sudipto Das {sudipto@cs.ucsb.edu} 14
  • 15. confirm_friend_request_A(user1, user2) { try { update_friend_list(user1, user2, status.confirmed); } catch(exception e) { report_error(e); return; } try { update_friend_list(user2, user1, status.confirmed); 
 } catch(exception e) { revert_friend_list(user1, user2); report_error(e); return; }}confirm_friend_request_B(user1, user2) { try{ 
 update_friend_list(user1, user2, status.confirmed); } catch(exception e) {
 report_error(e); 
 add_to_retry_queue(operation.updatefriendlist, user1, user2, current_time());
} try { 
 update_friend_list(user2, user1, status.confirmed); } catch(exception e) {
 report_error(e);
 add_to_retry_queue(operation.updatefriendlist, user2, user1, current_time()); }} Sudipto Das {sudipto@cs.ucsb.edu} 15
  • 16. Challenge: Transactions at Scale Key Value StoresScale-out RDBMSs ACID transactions Sudipto Das {sudipto@cs.ucsb.edu} 16
  • 17. Challenge: Lightweight Elasticity Provisioning on-demand and not for peak Optimize operating cost! Capacity ResourcesResources Demand Capacity Demand Time Time Traditional Infrastructures Deployment in the Cloud Unused resources Slide Credits: Berkeley RAD Lab Sudipto Das {sudipto@cs.ucsb.edu} 17
  • 18. Contributions for OLTP systems Transactions at Scale ◦ ElasTraS [HotCloud 2009, UCSB TR 2010] ◦ G-Store [SoCC 2010] Lightweight Elasticity ◦ Albatross [VLDB 2011] ◦ Zephyr [SIGMOD 2011] Self-Manageability ◦ Pythia [in progress] Sudipto Das {sudipto@cs.ucsb.edu} 18
  • 19. Contributions for OLTP systemsIt is possible to architect scalable DBMSs thatefficiently support transactional semantics to easeapplication design and elastically adapt to fluctuatingoperational demands to optimize the operating cost. Sudipto Das {sudipto@cs.ucsb.edu} 19
  • 20. Contributions for OLTP systemsIt is possible to architect scalable DBMSs thatefficiently support transactional semantics to easeapplication design and elastically adapt to fluctuatingoperational demands to optimize the operating cost. Transactions at Scale ◦ ElasTraS [HotCloud 2009, UCSB TR 2010] ◦ G-Store [SoCC 2010] Sudipto Das {sudipto@cs.ucsb.edu} 20
  • 21. Contributions for OLTP systemsIt is possible to architect scalable DBMSs thatefficiently support transactional semantics to easeapplication design and elastically adapt to fluctuatingoperational demands to optimize the operating cost. Transactions at  Lightweight Scale Elasticity ◦ ElasTraS [HotCloud ◦ Albatross 2009, UCSB TR 2010] [VLDB 2011] ◦ G-Store ◦ Zephyr [SoCC 2010] [SIGMOD 2011] Sudipto Das {sudipto@cs.ucsb.edu} 21
  • 22. Contributions for OLTP systemsIt is possible to architect scalable DBMSs thatefficiently support transactional semantics to easeapplication design and elastically adapt to fluctuatingoperational demands to optimize the operating cost. Transactions at  Lightweight Scale Elasticity ◦ ElasTraS [HotCloud ◦ Albatross 2009, UCSB TR 2010] [VLDB 2011] ◦ G-Store ◦ Zephyr [SoCC 2010] [SIGMOD 2011] Sudipto Das {sudipto@cs.ucsb.edu} 22
  • 23. Contributions Data Management Transaction Processing Dynamic Static partitioning partitioning ElasTraS G-Store [HotCloud ‘09] [SoCC ‘10] [TR ‘10] Albatross [VLDB ‘11] Zephyr [SIGMOD ‘11] Dissertation Sudipto Das {sudipto@cs.ucsb.edu} 23
  • 24. Contributions Data ManagementAnalytics Transaction Processing Ricardo Dynamic Static[SIGMOD ‘10] partitioning partitioning MD-HBase ElasTraS [MDM ‘11] G-Store [HotCloud ‘09] Best Paper [SoCC ‘10] [TR ‘10] Runner up Anonimos Albatross [VLDB ‘11] [ICDE ‘10], Zephyr [SIGMOD ‘11] [TKDE] Dissertation Sudipto Das {sudipto@cs.ucsb.edu} 24
  • 25. Contributions Data ManagementAnalytics Transaction Processing Novel Architectures Ricardo Dynamic Static Hyder[SIGMOD ‘10] partitioning partitioning [CIDR ‘11] Best Paper MD-HBase ElasTraS [MDM ‘11] G-Store [HotCloud ‘09] CoTS Best Paper [SoCC ‘10] [TR ‘10] [ICDE ‘09], Runner up [VLDB ‘09] Anonimos Albatross [VLDB ‘11] [ICDE ‘10], Zephyr [SIGMOD ‘11] TCAM [TKDE] [DaMoN ‘08] Dissertation Sudipto Das {sudipto@cs.ucsb.edu} 25
  • 26. Transactions at Scale Key Value StoresScale-out RDBMSs ACID transactions Sudipto Das {sudipto@cs.ucsb.edu} 26
  • 27. Scale-out with static partitioning Table level partitioning (range, hash) ◦ Distributed transactions Partitioning the Database schema ◦ Co-locate data items accessed together ◦ Goal: Minimize distributed transactions Sudipto Das {sudipto@cs.ucsb.edu} 27
  • 28. Scale-out with static partitioning Table level partitioning (range, hash) ◦ Distributed transactions Partitioning the Database schema ◦ Co-locate data items accessed together ◦ Goal: Minimize distributed transactions Sudipto Das {sudipto@cs.ucsb.edu} 28
  • 29. Scale-out with static partitioning Table level partitioning (range, hash) ◦ Distributed transactions Partitioning the Database schema ◦ Co-locate data items accessed together ◦ Goal: Minimize distributed transactions Scaling-out with static partitioning ◦ ElasTraS [HotCloud 2009, TR 2010] Sudipto Das {sudipto@cs.ucsb.edu} 29
  • 30. Scale-out with static partitioning Table level partitioning (range, hash) ◦ Distributed transactions Partitioning the Database schema ◦ Co-locate data items accessed together ◦ Goal: Minimize distributed transactions Scaling-out with static partitioning ◦ ElasTraS [HotCloud 2009, TR 2010] ◦ Cloud SQL Server [ICDE 2011] ◦ MegaStore [CIDR 2011] ◦ RelationalCloud [CIDR 2011] Sudipto Das {sudipto@cs.ucsb.edu} 30
  • 31. Dynamically formed partitions Access patterns change, often rapidly ◦ Online multi-player gaming applications ◦ Collaboration based applications ◦ Scientific computing applications Not amenable to static partitioning Sudipto Das {sudipto@cs.ucsb.edu} 31
  • 32. Dynamically formed partitions Access patterns change, often rapidly ◦ Online multi-player gaming applications ◦ Collaboration based applications ◦ Scientific computing applications Not amenable to static partitioning How to get the benefit of partitioning when accesses do not statically partition? ◦ Ours is the first solution to allow that Sudipto Das {sudipto@cs.ucsb.edu} 32
  • 33. Online Multi-player Games ID Name $$$ Score Player Profile Sudipto Das {sudipto@cs.ucsb.edu} 33
  • 34. Online Multi-player Games Sudipto Das {sudipto@cs.ucsb.edu} 34
  • 35. Online Multi-player Games Execute transactions on player profiles while the game is in progress Sudipto Das {sudipto@cs.ucsb.edu} 35
  • 36. Online Multi-player Games Sudipto Das {sudipto@cs.ucsb.edu} 36
  • 37. Online Multi-player Games Partitions/groups are dynamic Sudipto Das {sudipto@cs.ucsb.edu} 37
  • 38. Online Multi-player Games Hundreds of thousands of concurrent groups Sudipto Das {sudipto@cs.ucsb.edu} 38
  • 39. Data Fusion for dynamic partitions[G-Store, SoCC 2010] Transactional access to a group of data items formed on-demand Challenge: Avoid distributed transactions! Sudipto Das {sudipto@cs.ucsb.edu} 39
  • 40. Data Fusion for dynamic partitions[G-Store, SoCC 2010] Transactional access to a group of data items formed on-demand Challenge: Avoid distributed transactions! Key Group Abstraction ◦ Groups are small ◦ Groups have non-trivial lifetime ◦ Groups are dynamic and on-demand Groups are dynamically formed tenant databases Sudipto Das {sudipto@cs.ucsb.edu} 40
  • 41. Transactions on GroupsWithout distributed transactions  One key selected as the leader Sudipto Das {sudipto@cs.ucsb.edu} 41
  • 42. Transactions on GroupsWithout distributed transactions  One key selected as the leader  Followers transfer ownership of keys to leader Sudipto Das {sudipto@cs.ucsb.edu} 42
  • 43. Transactions on GroupsWithout distributed transactions Key Group Ownership of keys at a single node  One key selected as the leader  Followers transfer ownership of keys to leader Sudipto Das {sudipto@cs.ucsb.edu} 43
  • 44. Transactions on GroupsWithout distributed transactions Grouping Protocol Key Group Ownership of keys at a single node  One key selected as the leader  Followers transfer ownership of keys to leader Sudipto Das {sudipto@cs.ucsb.edu} 44
  • 45. Why is group formation hard? Guarantee the contract between leaders and followers in the presence of: ◦ Leader and follower failures ◦ Lost, duplicated, or re-ordered messages ◦ Dynamics of the underlying system How to ensure efficient and ACID execution of transactions? Sudipto Das {sudipto@cs.ucsb.edu} 45
  • 46. Grouping protocol L(Joining) L(Joined)Follower(s)CreateRequest J JA JAALeader L(Creating) L(Joined) Time Sudipto Das {sudipto@cs.ucsb.edu} 46
  • 47. Grouping protocol L(Joining) L(Joined)Follower(s)CreateRequest J JA JAA Group OpnsLeader L(Creating) L(Joined) Time Sudipto Das {sudipto@cs.ucsb.edu} 47
  • 48. Grouping protocol L(Joining) L(Joined) L(Free)Follower(s)CreateRequest J JA JAA D DA Group OpnsLeader L(Creating) L(Joined) L(Deleting) L(Deleted) Delete Time Request Sudipto Das {sudipto@cs.ucsb.edu} 48
  • 49. Grouping protocol Log entries L(Joining) L(Joined) L(Free)Follower(s)CreateRequest J JA JAA D DA Group OpnsLeader L(Creating) L(Joined) L(Deleting) L(Deleted) Delete Time Request Sudipto Das {sudipto@cs.ucsb.edu} 49
  • 50. Grouping protocol Log entries L(Joining) L(Joined) L(Free)Follower(s)CreateRequest J JA JAA D DA Group OpnsLeader L(Creating) L(Joined) L(Deleting) L(Deleted) Delete Time Request  Conceptually akin to “locking” ◦ Locks held by groups Sudipto Das {sudipto@cs.ucsb.edu} 50
  • 51. Efficient transaction processing How does the leader execute transactions? ◦ Caches data for group members  underlying data store equivalent to a disk ◦ Transaction logging for durability ◦ Cache asynchronously flushed to propagate updates ◦ Guaranteed update propagation Transaction ManagerLeader Log Cache Manager Asynchronous update PropagationFollowers Sudipto Das {sudipto@cs.ucsb.edu} 51
  • 52. Prototype: G-Store [SoCC 2010] An implementation over Key-value stores Application Clients Transactional Multi-Key AccessGrouping Transaction Grouping Transaction Grouping Transaction Layer Manager Layer Manager Layer ManagerKey-Value Store Logic Key-Value Store Logic Key-Value Store Logic Distributed Storage G-Store Sudipto Das {sudipto@cs.ucsb.edu} 52
  • 53. Prototype: G-Store [SoCC 2010] An implementation over Key-value stores Application Clients Transactional Multi-Key Access Grouping middleware layer resident on top of a key-value storeGrouping Transaction Grouping Transaction Grouping Transaction Layer Manager Layer Manager Layer ManagerKey-Value Store Logic Key-Value Store Logic Key-Value Store Logic Distributed Storage G-Store Sudipto Das {sudipto@cs.ucsb.edu} 53
  • 54. G-Store Evaluation Implemented using HBase ◦ Added the middleware layer ◦ ~10000 LOC Experiments in Amazon EC2 Benchmark: An online multi-player game Cluster size: 10 nodes Data size: ~1 billion rows (>1 TB) Sudipto Das {sudipto@cs.ucsb.edu} 54
  • 55. G-Store Evaluation Implemented using HBase ◦ Added the middleware layer ◦ ~10000 LOC Experiments in Amazon EC2 Benchmark: An online multi-player game Cluster size: 10 nodes Data size: ~1 billion rows (>1 TB) For groups with 100 keys ◦ Group creation latency: ~10 – 100ms ◦ More than 10,000 groups concurrently created Sudipto Das {sudipto@cs.ucsb.edu} 55
  • 56. G-Store Evaluation Group creation latency Group creation throughput Sudipto Das {sudipto@cs.ucsb.edu} 56
  • 57. Lightweight Elasticity Provisioning on-demand and not for peak Optimize operating cost! Capacity ResourcesResources Demand Capacity Demand Time Time Traditional Infrastructures Deployment in the Cloud Unused resources Slide Credits: Berkeley RAD Lab Sudipto Das {sudipto@cs.ucsb.edu} 57
  • 58. Elasticity in the Database tier Load Balancer Application/ Web/Caching tier Database tier Sudipto Das {sudipto@cs.ucsb.edu} 58
  • 59. Elasticity in the Database tier Load Balancer Application/ Web/Caching tier Database tier Sudipto Das {sudipto@cs.ucsb.edu} 59
  • 60. Elasticity in the Database tier Load Balancer Application/ Web/Caching tier Database tier Sudipto Das {sudipto@cs.ucsb.edu} 60
  • 61. Elasticity in the Database tier Load Balancer Application/ Web/Caching tier Database tier Sudipto Das {sudipto@cs.ucsb.edu} 61
  • 62. Elasticity in the Database tier Load Balancer Application/ Web/Caching tier Database tier Sudipto Das {sudipto@cs.ucsb.edu} 62
  • 63. Elasticity in the Database tier Load Balancer Application/ Web/Caching tier Database tier Sudipto Das {sudipto@cs.ucsb.edu} 63
  • 64. Elasticity in the Database tier Load Balancer Application/ Web/Caching tier Database tier Sudipto Das {sudipto@cs.ucsb.edu} 64
  • 65. Live database migration Migrate a database partition (or tenant) in a live system ◦ Optimize operating cost ◦ Resource orchestration in multitenant systems Sudipto Das {sudipto@cs.ucsb.edu} 65
  • 66. Live database migration Migrate a database partition (or tenant) in a live system ◦ Optimize operating cost ◦ Resource orchestration in multitenant systems Different from ◦ Migration between software versions ◦ Migration in case of schema evolution Sudipto Das {sudipto@cs.ucsb.edu} 66
  • 67. VM migration for DB elasticity One DB partition-per-VM ◦ Pros: allows fine-grained load balancing VM VM VM ◦ Cons  Performance overhead Hypervisor  Poor consolidation ratio [Curino et al., CIDR 2011] Sudipto Das {sudipto@cs.ucsb.edu} 67
  • 68. VM migration for DB elasticity One DB partition-per-VM ◦ Pros: allows fine-grained load balancing ◦ Cons VM VM VM  Performance overhead  Poor consolidation ratio [Curino et Hypervisor al., CIDR 2011] Multiple DB partitions in a VM ◦ Pros: good performance ◦ Cons: Migrate all partitions  VM Coarse-grained load balancing Hypervisor Sudipto Das {sudipto@cs.ucsb.edu} 68
  • 69. Live database migration Multiple partitions share the same database process ◦ Shared process multitenancy Migrate individual partitions on- demand in a live system ◦ Virtualization in the database tier Straightforward solution ◦ Stop serving partition at the source ◦ Copy to destination ◦ Start serving at the destination ◦ Expensive! Sudipto Das {sudipto@cs.ucsb.edu} 69
  • 70. Migration cost measures Service un-availability ◦ Time the partition is unavailable Number of failed requests ◦ Number of operations failing/transactions aborting Performance overhead ◦ Impact on response times Additional data transferred Sudipto Das {sudipto@cs.ucsb.edu} 70
  • 71. Two common DBMS architectures Decoupled storage architectures ◦ ElasTraS, G-Store, Deuteronomy, MegaStore ◦ Persistent data is not migrated ◦ Albatross [VLDB 2011] Shared nothing architectures ◦ SQL Azure, Relational Cloud, MySQL Cluster ◦ Migrate persistent data ◦ Zephyr [SIGMOD 2011] Sudipto Das {sudipto@cs.ucsb.edu} 71
  • 72. Two common DBMS architectures Decoupled storage architectures ◦ ElasTraS, G-Store, Deuteronomy, MegaStore ◦ Persistent data is not migrated ◦ Albatross [VLDB 2011] Shared nothing architectures ◦ SQL Azure, Relational Cloud, MySQL Cluster ◦ Migrate persistent data ◦ Zephyr [SIGMOD 2011] Sudipto Das {sudipto@cs.ucsb.edu} 72
  • 73. Why is live DB migration hard? Persistent DB image must be migrated (GBs) ◦ How to ensure no downtime? Nodes can fail during migration ◦ How to guarantee correctness during failures?  Transaction atomicity and durability.  Recover migration state after failure. Transactions execute during migration ◦ How to guarantee serializability?  Transaction correctness equivalent to normal operation Sudipto Das {sudipto@cs.ucsb.edu} 73
  • 74. Our approach: Zephyr[SIGMOD 2011] Migration executed in phases ◦ Starts with transfer of minimal information to destination (“wireframe”) Database pages used as granule of migration ◦ Unique page ownership Source and destination concurrently execute transactions in one migration phase Minimal transaction synchronization  Guaranteed serializability Logging and handshaking protocols Sudipto Das {sudipto@cs.ucsb.edu} 74
  • 75. Simplifying assumptions For this talk ◦ Transactions access a single partition ◦ No replication ◦ No structural changes to indices Extensions in the paper [SIGMOD 2011] ◦ Relaxes these assumptions Sudipto Das {sudipto@cs.ucsb.edu} 75
  • 76. Design overview P1 P2 Owned Pages P3 PnActive transactions TS1,…, TSk Source Destination Page owned by Node Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu} 76
  • 77. Init mode Freeze indices and migrate wireframe P1 P1 P2 P2 Owned Pages P3 P3 Un-owned Pages Pn Pn TS1,…,Active transactions TSk Source Destination Page owned by Node Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu} 77
  • 78. What is an index wireframe? Source Sudipto Das {sudipto@cs.ucsb.edu} 78
  • 79. What is an index wireframe? Source Destination Sudipto Das {sudipto@cs.ucsb.edu} 79
  • 80. Dual mode P1 P1 P2 P2 P3 P3 Pn PnOld, still active TSk+1,…, TD1,…, New transactionstransactions TSl TDm Source Destination Page owned by Node Index wireframes remain frozen Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu} 80
  • 81. Dual mode P1 P3 accessed by P1 P2 TDi P2 P3 P3 Pn PnOld, still active TSk+1,…, TD1,…, New transactionstransactions TSl TDm Source Destination Page owned by Node Index wireframes remain frozen Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu} 81
  • 82. Dual mode Requests for un-owned pages can block P1 P3 accessed by P1 P2 TDi P2 P3 P3 Pn PnOld, still active TSk+1,…, TD1,…, New transactionstransactions TSl TDm Source Destination Page owned by Node Index wireframes remain frozen Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu} 82
  • 83. Dual mode Requests for un-owned pages can block P1 P3 accessed by P1 P2 TDi P2 P3 P3 P3 pulled Pn from source PnOld, still active TSk+1,…, TD1,…, New transactionstransactions TSl TDm Source Destination Page owned by Node Index wireframes remain frozen Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu} 83
  • 84. Finish mode P1 P1 P2 P2 P3 P3 P1, P2, … pushed from Pn source Pn TDm+1,…Completed ,TDn Source Destination Page owned by Node Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu} 84
  • 85. Finish mode Pages can be pulled by the destination, if needed P1 P1 P2 P2 P3 P3 P1, P2, … pushed from Pn source Pn TDm+1,…Completed ,TDn Source Destination Page owned by Node Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu} 85
  • 86. Normal operationIndex wireframe un-frozen P1 P2 P3 Pn TDn+1,…, TDp Source Destination Page owned by Node Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu} 86
  • 87. Artifacts of this design Once migrated, pages are never pulled back by source ◦ Abort transactions at source accessing the migrated pages No structural changes to indices during migration ◦ Abort transactions (at both nodes) that make structural changes to indices Destination “pulls” pages on-demand ◦ Transactions at the destination experience higher latency compared to normal operation Sudipto Das {sudipto@cs.ucsb.edu} 87
  • 88. Implementation Prototyped using an open source OLTP database H2 ◦ Supports standard SQL/JDBC API ◦ Serializable isolation level ◦ Tree Indices ◦ Relational data model Modified the database engine ◦ Added support for freezing indices ◦ Page migration status maintained using index ◦ ~6000 LOC Tungsten SQL Router migrates JDBC connections during migration Sudipto Das {sudipto@cs.ucsb.edu} 88
  • 89. Results Overview Downtime (partition unavailability) ◦ S&C: 3 – 8 seconds (needed to migrate, unavailable for updates) ◦ Zephyr: No downtime. Either source or destination is available Service interruption (failed operations) ◦ S&C: ~100 s – 1,000s. All transactions with updates are aborted ◦ Zephyr: ~10s – 100s. Order of magnitude less interruption Minimal operational and data transfer overhead Sudipto Das {sudipto@cs.ucsb.edu} 89
  • 90. Failed OperationsOrder ofmagnitudefewer failedoperations Sudipto Das {sudipto@cs.ucsb.edu} 90
  • 91. Concluding Remarks Sudipto Das {sudipto@cs.ucsb.edu} 91
  • 92. Concluding Remarks Sudipto Das {sudipto@cs.ucsb.edu} 92
  • 93. Concluding Remarks Majorenabling technologies ◦ Transactions at Scale  ElasTraS  G-Store ◦ Lightweight Elasticity  Albatross  Zephyr Sudipto Das {sudipto@cs.ucsb.edu} 93
  • 94. Future Directions Self-managing controller for large multitenant database infrastructures Convergence of transactional and analytics systems for real-time intelligence Putting human-in-the-loop: Leveraging crowd-sourcing Sudipto Das {sudipto@cs.ucsb.edu} 94
  • 95. Acknowledgements  My advisors and my committee members  Computer Science Dept. at UCSB  Funding sources: NSF, NEC Labs America, and AWS in Education  Colleagues at DSL and at UCSB  My familyNovember 16, 2011 Sudipto Das {sudipto@cs.ucsb.edu} 95
  • 96. Thank you!CollaboratorsUCSB:Divy Agrawal, Amr El Abbadi, Ömer EğecioğluShashank Agarwal, Shyam Antony, Aaron Elmore,Shoji Nishimura (NEC Japan)Microsoft Research Redmond:Phil Bernstein, Colin ReidIBM Almaden:Yannis Sismanis, Kevin Beyer, Rainer Gemulla,Peter Haas, John McPherson