Towards Granular Data Placement Strategies for
Cloud Platforms
Presented by Cleverence Kombe
By Johannes Lorey and Felix Naumann
OUTLINES
1. Introduction
2. Classification of Data Management Systems
3. Granular Data Management
4. Granular ILM in a hybrid storage
environment
5. Conclusion
1. INTRODUCTION
• Information Lifecycle Management (ILM)
• Decision process of choosing optimal storage location because of the increase of digital
information and the continuous efforts to reduce capital expenditure for hardware
• ILM strategies normally consider on-site hardware only like hard disks drives and
magnetic tapes
• It lacks flexibility of handling the varying demand in storage resources
• Therefore, Cloud Computing and Cloud storage offer novel approach for ILM
due to:
• Their rapid elasticity and the associated metering of resource consumption which have
little to no capital expenditure
• But long-term storage necessity may economically justify purchasing on-site hardware
instead of renting resources in the cloud.
1. INTRODUCTION…
Information Lifecycle Management (ILM) Pyramid
• The decision of using either on-site or cloud resources has usually been a
binary on for companies.
• Some companies have every resource on-site but security and reliability issues rise the
need for cloud resources
• Commercial cloud computing offerings seems to be good option for business
startups because they satisfy most resource demands and provide the entire
hardware and software stack for them to offer services over the Internet
• However, there has been little research on integrating both on-site and cloud
resources
• Data is either regarded in its atomic form or its entirely
• Most research deals with storing identical copies on-site and in cloud or technical
aspects of combining local and remote storage facilities but not on actual placement
decisions
1. INTRODUCTION…
• Cloud storage face same challenges as Distributed databases:
• Concurrency control, reliability, and consistency
• But ACID (Atomicity, Consistency, Isolation, and Duration) paradigm not
suitable for cloud environment.
• Two new additional requirements predictability and flexibility
• Research propose a hybrid approach both in terms of storage capability and
data organization.
1. INTRODUCTION…
• the worldwide distribution of data has sparked a rethinking of requirements
for data stores
• traditional taxonomies of databases that regarded relational database management
(RDBMS) systems only are not sufficient anymore
• The study propose a new classification that incorporates the requirements of
modern distributed applications.
• Three technical dimension presented:
• Scalability
• Transaction Policy
• Storage Model
2. Classification of Data Management Systems
2. Classification of Data Management Systems
• Scalability
• refers to both the degree of distribution of a platform and the flexibility to quickly
provide new resources on demand.
• Storage Model
• Row/Column describes the way data is retained in conventional RDBMS
• A Key/Value store associates a datum with a unique ID, but does not support
schema definitions.
• Hybrid model refers to any system that combines non-traditional approaches with
some of the features found in RDBMS
• Transaction Policy
• describes how potentially conflicting operations are handled by the system
• ACID comprises a well-known set of transaction properties
• “Basically Available, Soft state, Eventual consistency”
2. Classification of Data Management Systems
• Data Placement is infrastructure-centric
• available systems are evaluated regarding the three dimensions and the one most
fitting is selected as a data store.
• this approach lacks both flexibility and agility
• New approach which is data-centric view proposed
• Certain meta-information about data is considered to formalize data placement
strategies
• The strategy will then determine the most appropriate storage location not for the
entire data as a whole, but rather for subsets of individual pieces of data at different
points in time.
• This hybrid approach encompasses integration of multiple data storage systems, both
on-site and in the Cloud.
2. Classification of Data Management Systems
3. Granular Data Management
• Definition 1
• Two kinds of access operations non-conflicting (i.e. reads) and conflicting (i.e. updates)
• Let D denote a set of atomic pieces of data with respect to some application
• (e.g., one value in an n-tuple or a binary file in a file system).
• Let U denote a set of individual users accessing this data.
• For i; j Є N; 0 ≤ i ≤ |D| and 0 ≤ j ≤ |U|, we define a non-conflicting access operation of a datum di Є D
by a user uj Є U as a relationship di ---->uj
• (e.g., a datum is read by a user)
• Potentially conflicting access operation of a datum defined as
• di ЄD by a user uj Є U as a relationship di <---- uj
• (e.g., a datum is written by a user)
• This definition doesn’t include multiple access.
• eg. d1--->u1; d2--->u1; d3<---u1; d3<---u2
• The placement strategy should advice to store datum d1, d2 on the local hard disk of u1
• Datum d3 stored in the shared storage (But mechanisms for ensuring consistency needed)
3. Granular Data Management
• Definition 2
• Instead of focusing on singular pieces of data, it deals with granules of data, such as a
subset of values in an n-tuple or a number of files with distinct properties
• For i; j Є N; 0 ≤ i ≤ |D|=:θ, 0 ≤ j ≤ (|D|)i = (θ)i , a granule of data di,j is defined as
For i; j Є N; 0 ≤ i ≤ |D|=:θ, 0 ≤ j ≤ (|D|)i = (θ)i , a granule of data di,j is defined as
• i is considered as the Level of Granularity (LoG)
3. Granular Data Management
3. Granular Data Management
• Definition 3 : Decomposition of Granules
• For an arbitrary granule di,j, the decomposition function Ø is defined as:
• Addition of third dimension (time)
• Definition 4: Time granules and granular access
• Let T denote a set of atomic time frames t, where the length of each individual t is fixed (e.g., a second).
Similarly to Definition 2, for 0 ≤ l ≤ (|T|)i a granule of time ti,l is defined as
• Based on Definition 1: Access operation within time t
• di,j ---------->ui,k (for non-conflicting access)
• di,j <------- ui,k (for conflicting access)
ti,l
ti,l
3. Granular Data Management
4. Granular ILM in a hybrid storage Environment
• Definition 5 : Workloads and granular workloads
• A workload W = (D,U,T,A) is characterized by the:
• data D to store,
• the set of users U accessing this data,
• the overall available time frame T for the workload,
• set of access operations A.
• Here, the elements of A represent the operations introduced in Definition 1 and
extended in Definition 4
4. Granular ILM in a hybrid storage Environment
4. Granular ILM in a hybrid storage Environment
• Data placement cuboid used to relate different dimensions of storage
systems to the individual features Di, Ui, Ti, Ai depending on the LoG i.
• Scalability depends
on Di , Ui and Ti
• Storage Model depends
on Di and Ai
• Transaction policy
depends on Ai and Ui
• The approach introduced in this work may serve as the foundation for
a sophisticated data placement framework in the context of
Information Lifecycle Management.
• The specific costs associated with a particular infrastructure need to
be considered for real-world systems
• The policies mentioned above can be integrated into a highly flexible
data store incorporating multiple Cloud providers as well as commercial and
open-source database software to allow seamless and transparent data
migration.
5. Conclusion
Thank you!

Towards granular data placement strategies for cloud platforms

  • 1.
    Towards Granular DataPlacement Strategies for Cloud Platforms Presented by Cleverence Kombe By Johannes Lorey and Felix Naumann
  • 2.
    OUTLINES 1. Introduction 2. Classificationof Data Management Systems 3. Granular Data Management 4. Granular ILM in a hybrid storage environment 5. Conclusion
  • 3.
    1. INTRODUCTION • InformationLifecycle Management (ILM) • Decision process of choosing optimal storage location because of the increase of digital information and the continuous efforts to reduce capital expenditure for hardware • ILM strategies normally consider on-site hardware only like hard disks drives and magnetic tapes • It lacks flexibility of handling the varying demand in storage resources • Therefore, Cloud Computing and Cloud storage offer novel approach for ILM due to: • Their rapid elasticity and the associated metering of resource consumption which have little to no capital expenditure • But long-term storage necessity may economically justify purchasing on-site hardware instead of renting resources in the cloud.
  • 4.
  • 5.
    • The decisionof using either on-site or cloud resources has usually been a binary on for companies. • Some companies have every resource on-site but security and reliability issues rise the need for cloud resources • Commercial cloud computing offerings seems to be good option for business startups because they satisfy most resource demands and provide the entire hardware and software stack for them to offer services over the Internet • However, there has been little research on integrating both on-site and cloud resources • Data is either regarded in its atomic form or its entirely • Most research deals with storing identical copies on-site and in cloud or technical aspects of combining local and remote storage facilities but not on actual placement decisions 1. INTRODUCTION…
  • 6.
    • Cloud storageface same challenges as Distributed databases: • Concurrency control, reliability, and consistency • But ACID (Atomicity, Consistency, Isolation, and Duration) paradigm not suitable for cloud environment. • Two new additional requirements predictability and flexibility • Research propose a hybrid approach both in terms of storage capability and data organization. 1. INTRODUCTION…
  • 7.
    • the worldwidedistribution of data has sparked a rethinking of requirements for data stores • traditional taxonomies of databases that regarded relational database management (RDBMS) systems only are not sufficient anymore • The study propose a new classification that incorporates the requirements of modern distributed applications. • Three technical dimension presented: • Scalability • Transaction Policy • Storage Model 2. Classification of Data Management Systems
  • 8.
    2. Classification ofData Management Systems
  • 9.
    • Scalability • refersto both the degree of distribution of a platform and the flexibility to quickly provide new resources on demand. • Storage Model • Row/Column describes the way data is retained in conventional RDBMS • A Key/Value store associates a datum with a unique ID, but does not support schema definitions. • Hybrid model refers to any system that combines non-traditional approaches with some of the features found in RDBMS • Transaction Policy • describes how potentially conflicting operations are handled by the system • ACID comprises a well-known set of transaction properties • “Basically Available, Soft state, Eventual consistency” 2. Classification of Data Management Systems
  • 10.
    • Data Placementis infrastructure-centric • available systems are evaluated regarding the three dimensions and the one most fitting is selected as a data store. • this approach lacks both flexibility and agility • New approach which is data-centric view proposed • Certain meta-information about data is considered to formalize data placement strategies • The strategy will then determine the most appropriate storage location not for the entire data as a whole, but rather for subsets of individual pieces of data at different points in time. • This hybrid approach encompasses integration of multiple data storage systems, both on-site and in the Cloud. 2. Classification of Data Management Systems
  • 11.
    3. Granular DataManagement • Definition 1 • Two kinds of access operations non-conflicting (i.e. reads) and conflicting (i.e. updates) • Let D denote a set of atomic pieces of data with respect to some application • (e.g., one value in an n-tuple or a binary file in a file system). • Let U denote a set of individual users accessing this data. • For i; j Є N; 0 ≤ i ≤ |D| and 0 ≤ j ≤ |U|, we define a non-conflicting access operation of a datum di Є D by a user uj Є U as a relationship di ---->uj • (e.g., a datum is read by a user) • Potentially conflicting access operation of a datum defined as • di ЄD by a user uj Є U as a relationship di <---- uj • (e.g., a datum is written by a user) • This definition doesn’t include multiple access. • eg. d1--->u1; d2--->u1; d3<---u1; d3<---u2 • The placement strategy should advice to store datum d1, d2 on the local hard disk of u1 • Datum d3 stored in the shared storage (But mechanisms for ensuring consistency needed)
  • 12.
    3. Granular DataManagement • Definition 2 • Instead of focusing on singular pieces of data, it deals with granules of data, such as a subset of values in an n-tuple or a number of files with distinct properties • For i; j Є N; 0 ≤ i ≤ |D|=:θ, 0 ≤ j ≤ (|D|)i = (θ)i , a granule of data di,j is defined as For i; j Є N; 0 ≤ i ≤ |D|=:θ, 0 ≤ j ≤ (|D|)i = (θ)i , a granule of data di,j is defined as • i is considered as the Level of Granularity (LoG)
  • 13.
    3. Granular DataManagement
  • 14.
    3. Granular DataManagement • Definition 3 : Decomposition of Granules • For an arbitrary granule di,j, the decomposition function Ø is defined as: • Addition of third dimension (time) • Definition 4: Time granules and granular access • Let T denote a set of atomic time frames t, where the length of each individual t is fixed (e.g., a second). Similarly to Definition 2, for 0 ≤ l ≤ (|T|)i a granule of time ti,l is defined as • Based on Definition 1: Access operation within time t • di,j ---------->ui,k (for non-conflicting access) • di,j <------- ui,k (for conflicting access) ti,l ti,l
  • 15.
    3. Granular DataManagement
  • 16.
    4. Granular ILMin a hybrid storage Environment • Definition 5 : Workloads and granular workloads • A workload W = (D,U,T,A) is characterized by the: • data D to store, • the set of users U accessing this data, • the overall available time frame T for the workload, • set of access operations A. • Here, the elements of A represent the operations introduced in Definition 1 and extended in Definition 4
  • 17.
    4. Granular ILMin a hybrid storage Environment
  • 18.
    4. Granular ILMin a hybrid storage Environment • Data placement cuboid used to relate different dimensions of storage systems to the individual features Di, Ui, Ti, Ai depending on the LoG i. • Scalability depends on Di , Ui and Ti • Storage Model depends on Di and Ai • Transaction policy depends on Ai and Ui
  • 19.
    • The approachintroduced in this work may serve as the foundation for a sophisticated data placement framework in the context of Information Lifecycle Management. • The specific costs associated with a particular infrastructure need to be considered for real-world systems • The policies mentioned above can be integrated into a highly flexible data store incorporating multiple Cloud providers as well as commercial and open-source database software to allow seamless and transparent data migration. 5. Conclusion
  • 20.