SlideShare a Scribd company logo
1 of 23
www.ci.anl.gov
www.ci.uchicago.edu
Benchmarking Cloud-based Tagging
Services
Tanu Malik, Kyle Chard, Ian Foster
Computation Institute
Argonne National Laboratory and University of Chicago
www.ci.anl.gov
www.ci.uchicago.edu
2
Quiz
Blacksmith’s tools X-Y axes as a tool for calculus
Bayes theorem as a tool for probability theory
www.ci.anl.gov
www.ci.uchicago.edu
3
Tagging: A tool for information management
www.ci.anl.gov
www.ci.uchicago.edu
4
A Tagging Service
• Suppose we want to build a large-scale tagging
service that allows tagging of data files
1. What is the best way to store these large number
of tags and their values in a database?
2. Which cloud-based database offering to choose
from?
www.ci.anl.gov
www.ci.uchicago.edu
5
Storing Tags: Many Possibilities
• resource-tag-values is a 3-triple, like entity-
attribute-values
– noSQL stores
– triple stores
– relational stores
o horizontal schema
o triple schema
o vertically partitioned schema/decomposed storage
model/columnar schema
o attribute schema
www.ci.anl.gov
www.ci.uchicago.edu
6
Which cloud-based offering to choose from?
• No clear consensus regarding the performance
of cloud-based database offerings for sparse
datasets
• Motivates a tagging benchmark and an
infrastructure that uses this benchmark to
compare various cloud-based database offerings
• determine which cloud-based platform provides the most
efficient support for dynamic, sparse data
www.ci.anl.gov
www.ci.uchicago.edu
7
Outline
• Related Work
• The Tagging Model
• Workload Generation
• Framework for Evaluating Tagging Workloads
• Experiments
www.ci.anl.gov
www.ci.uchicago.edu
8
Related Work
• Twitter hash tags Vs tags in Del.ic.ious, Flickr
– Messages are tagged resulting in transient, trending social
groups
– Resources are tagged with semantically-meaningful keywords
– Tags with values
• Cloud benchmarks
– The CloudStone benchmark
o Web 2.0 social application in which event resources are tagged with
comments and ratings as tags
o Considers database tuning as a complex process
– The OLTP-Bench
o Infrastructure for monitoring performance and resource consumption
of a cloud-based database using a variety of relational workloads
o It does not support tagging workloads
www.ci.anl.gov
www.ci.uchicago.edu
9
The Tagging Model
or by common entities. Similarly if resources represent files
then they may be connected, under a containment relationship,
with a virtual directory resource.
Resources Tag, Values Users
{t1,v1; t2,v2; t3,v3}
{t1,v1’; t4,v4; t5,v5}
{t5,v5’; t6,v6}
{t1, v1’’; t2,v2’}
{t4, v4’; t7,v7}
{t5, v5’’; t7,v7’}
{t3,v3’; t6,v6’}
Fig. 1. The Tagging Model
valu
valu
uniq
B
We
and
syst
acro
•
•
•
www.ci.anl.gov
www.ci.uchicago.edu
10
The Tagging Model
• 3 modes to associate tag-value pairs with
resources:
– blind: users do not know tags assigned to a resource
– viewable: can see the tags associated with a resource
– suggestive: the service suggests possible tags to users
• Access control over resources, tags, and their
values:
– assign permissions on resources similar to Unix-style
permissions
– policies assigned at individual and group level
– users can only tag the resources they created/use of tags
by any user
– policies on creation or removal of a resource
www.ci.anl.gov
www.ci.uchicago.edu
11
Key Obervations
• Some resources are more frequently tagged than
others.
– empirical studies have shown that a power law holds for
this phenomenon.
• Number of distinct tags, owned by a user, are
relatively few.
– probability that a Flickr user has more than 750 distinct
tags is roughly 0.1%.
• Distinct tag usage increases with the increase in the
number of users and resources in the system
– higher correlation coefficient with the number of
resources than users.
J. Huang, et. al, “Conversational tagging in twitter,” Hypertext and Hypermedia. ACM, 2010.
C. Marlow, et.al , “Ht06, tagging paper, taxonomy, flickr, academic article, to read,”
Hypertext and Hypermedia. ACM, 2006.
www.ci.anl.gov
www.ci.uchicago.edu
12
Workload Generation
• No separate data and query generation phases
– database tables and their attributes, i.e., tags, are
decided by the incoming user workload
• Session-based, closed loop Markov-chain
process in Python
resource
ocabulary.
gs for the
s the bag
g prevents
over re-
typically
ad, write),
cluded in
e permis-
oup level.
he system,
a tag or a
resource.
ons to be
may allow
ny user or
allowing
urces they
urces and
vice may
all relationships and characteristics between resources, users,
and tags that have been described in the previous section. We
assume the most general options for the workload generator, in
particular, no access control policy on resources, a suggestive
mode of tagging and tag creation as optional to resource
creation. Based on these options, the following user operations
are supported as part of the workload generation phase:
• S1: Add a user.
• S2: A user uploads a resource.
• S3: A user creates a tag definition and uses this definition
to tag a resource with a value.
• S4: A user queries for a resource.
Fig. 2. The Markov Chain (a) States and some state transitions, (b) The
www.ci.anl.gov
www.ci.uchicago.edu
13
Workload Generation
• Transition probabilities in Markov Chain
– Some resources are more frequently tagged than others.
o when a user queries for a resource to be tagged, the chosen
resource is dictated by a power law.
– Number of distinct tags, owned by a user, are relatively
few.
o the transition probabilities of using an existing tag is higher
than the probability of creating a new tag definition.
– Distinct tag usage increases with the increase in the
number of users and resources in the system
o internal transition probabilities in the tagging state is a function
of the previous state
www.ci.anl.gov
www.ci.uchicago.edu
14
• Search for resources
• Find all other tags or
popular tag-value pairs
on a given resource.
• Suggest tags for a selected resource.
• Find resources that are “linked” to a set of
resources
• Find resources where value like ‘V ’ exists.
– This query fetches all tags with which the user is
associated.
Workload Generation
e individual or group level.
other objects in thesystem,
he permissions of a tag or a
e permission of a resource.
enable permissions to be
r instance users may allow
of their tags by any user or
ging, for example allowing
only tag the resources they
values when resources and
addition, the service may
moval of a resource (no one,
ag (no one, anyone, the tag
a tag value (no one, anyone,
the resource owner)
urce assumes that the cor-
• S4: A user queries for a reso
Fig. 2. The Markov Chain (a) States a
composite tagging state.
Figure 2(a) shows the session-b
chain behind the synthetic workloa
corresponds to a state, with the
composite state consisting of sub
www.ci.anl.gov
www.ci.uchicago.edu
15
Framework for Evaluating Tagging Datasets
• Tagging modifies the underlying database
schema, adding attributes or inserting values
– defining new tags increases the sparseness of the
dataset
– reusing existing tags or adding new values decreases
the sparseness of the database
www.ci.anl.gov
www.ci.uchicago.edu
16
Framework for Evaluating Tagging Datasets
• Given a tagging (query and tag) workload,
generate a variety of schemas for a given
database system.
• Measure sparseness of the generated tagging
dataset using a sparseness metric defined as:
Low s => vertically partitioned schema
High s => horizontal schema
www.ci.anl.gov
www.ci.uchicago.edu
17
Experimental Setup
• OLTP Bench: client infrastructure that executes web
workloads on relational databases deployed on the cloud
– workload generator into the centralized Workload Manager.
– generates a work queue, which is consumed by a user-
specified number of threads to control parallelism and
concurrency.
– all the threads are currently running on a single machine.
• We experiment with three databases:
– MySQL with InnoDB version 5.6,
– Postgres version 9.2, and
– SQL Server 2010
– DB_A, DB_B, and DB_C
www.ci.anl.gov
www.ci.uchicago.edu
18
EC2 setup
• Server DBMSs were deployed within the same
geographical region
• A single instance dedicated to workers and
collecting statistics and another instance running
the DBMS server
• For each DBMS server, flushed each system’s
buffers before each experiment.
• As recommended by OLTP-Bench, to mitigate noise
in cloud environments, we ran all of our
experiments without restarting our EC2 instances
(where possible), and executed the benchmarks
multiple times and averaged the results.
www.ci.anl.gov
www.ci.uchicago.edu
19
Comparing DB SaaS providers
• Given a schema and a tagging workload, which
database and its cloud offering offers the best
performance (good overall throughput and low
latency) and cost trade- off?
www.ci.anl.gov
www.ci.uchicago.edu
20
s = 0.37
(a) For s = 0.37
www.ci.anl.gov
www.ci.uchicago.edu
21
s=0.67
(b) For s = 0.55
(c) For s = 0.67
Fig. 3. Comparing the Four Schemas Across Database Offerings
www.ci.anl.gov
www.ci.uchicago.edu
22
DB_A and different machine sizes
ilable:
ilable:
ational
ce on
ilable:
-store:
,” The
Bases,
ounda-
ational
f data.
n Web
IEEE,
Wong, Fig. 4. Performance Vs Cost Tradeoff for DB A
www.ci.anl.gov
www.ci.uchicago.edu
23
Conclusion
• Tagging is becoming popular in Web 2.0 applications as a
tool for managing large volumes of information
• Being introduced as part of scientific workflows and
datasets
• Running a high-performance tagging service can be tricky
– Right data model
– Right database
• Proposed a benchmark for a tagging service
– it is open-source, though not released it
– tanum@ci.uchicago.edu
• Results show the benefits of using a benchmark

More Related Content

What's hot

Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceAccelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceIan Foster
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Ian Foster
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science ServicesIan Foster
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the ContinuumIan Foster
 
Streamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer researchStreamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer researchIan Foster
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilitiesIan Foster
 
NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasIan Foster
 
Scientific
Scientific Scientific
Scientific marpierc
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!Ian Foster
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillDatabricks
 
Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)Globus
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoSpark Summit
 
WoSC19: Serverless Workflows for Indexing Large Scientific Data
WoSC19: Serverless Workflows for Indexing Large Scientific DataWoSC19: Serverless Workflows for Indexing Large Scientific Data
WoSC19: Serverless Workflows for Indexing Large Scientific DataUniversity of Chicago
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm CrawlerJulien Nioche
 
Globus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management PlatformGlobus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management PlatformIan Foster
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of MetadataJim Dowling
 

What's hot (20)

Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceAccelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy Science
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science Services
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
Streamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer researchStreamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer research
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture Ideas
 
Scientific
Scientific Scientific
Scientific
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep Gill
 
Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
WoSC19: Serverless Workflows for Indexing Large Scientific Data
WoSC19: Serverless Workflows for Indexing Large Scientific DataWoSC19: Serverless Workflows for Indexing Large Scientific Data
WoSC19: Serverless Workflows for Indexing Large Scientific Data
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
poster draft 5
poster draft 5poster draft 5
poster draft 5
 
Globus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management PlatformGlobus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management Platform
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 

Viewers also liked

Stakeholder feedback data
Stakeholder feedback dataStakeholder feedback data
Stakeholder feedback datalyacosta09
 
Mckinsey it's all about the customer journey Sep 2010
Mckinsey it's all about the customer journey Sep 2010Mckinsey it's all about the customer journey Sep 2010
Mckinsey it's all about the customer journey Sep 2010Brian Crotty
 
Project Stakeholder Management [read-only] [compatibility mode]
Project Stakeholder Management [read-only] [compatibility mode]Project Stakeholder Management [read-only] [compatibility mode]
Project Stakeholder Management [read-only] [compatibility mode]Astute Trainers and Consultants
 
How to do Stakeholder Analysis
How to do Stakeholder AnalysisHow to do Stakeholder Analysis
How to do Stakeholder AnalysisAberdeen CES
 
An Introduction To Stakeholder Theory
An Introduction To Stakeholder TheoryAn Introduction To Stakeholder Theory
An Introduction To Stakeholder Theorynturnbull
 
Project Stakeholder Management - PMBOK 5
Project Stakeholder Management - PMBOK 5Project Stakeholder Management - PMBOK 5
Project Stakeholder Management - PMBOK 5pankajsh10
 
Stakeholder engagement and management
Stakeholder engagement and managementStakeholder engagement and management
Stakeholder engagement and managementSimon Misiewicz
 
6 best practices in stakeholder engagement
6 best practices in stakeholder engagement6 best practices in stakeholder engagement
6 best practices in stakeholder engagementWayne Dunn
 
Transforming Customer Experience: From Moments to Journeys
Transforming Customer Experience: From Moments to JourneysTransforming Customer Experience: From Moments to Journeys
Transforming Customer Experience: From Moments to JourneysMcKinsey on Marketing & Sales
 

Viewers also liked (12)

Stakeholder feedback data
Stakeholder feedback dataStakeholder feedback data
Stakeholder feedback data
 
Mckinsey it's all about the customer journey Sep 2010
Mckinsey it's all about the customer journey Sep 2010Mckinsey it's all about the customer journey Sep 2010
Mckinsey it's all about the customer journey Sep 2010
 
Project Stakeholder Management [read-only] [compatibility mode]
Project Stakeholder Management [read-only] [compatibility mode]Project Stakeholder Management [read-only] [compatibility mode]
Project Stakeholder Management [read-only] [compatibility mode]
 
How to do Stakeholder Analysis
How to do Stakeholder AnalysisHow to do Stakeholder Analysis
How to do Stakeholder Analysis
 
SCARF Augmented Stakeholder Analysis
SCARF Augmented Stakeholder Analysis SCARF Augmented Stakeholder Analysis
SCARF Augmented Stakeholder Analysis
 
PMP_Project Stakeholder Management
PMP_Project Stakeholder ManagementPMP_Project Stakeholder Management
PMP_Project Stakeholder Management
 
An Introduction To Stakeholder Theory
An Introduction To Stakeholder TheoryAn Introduction To Stakeholder Theory
An Introduction To Stakeholder Theory
 
Project Stakeholder Management - PMBOK 5
Project Stakeholder Management - PMBOK 5Project Stakeholder Management - PMBOK 5
Project Stakeholder Management - PMBOK 5
 
Stakeholder engagement and management
Stakeholder engagement and managementStakeholder engagement and management
Stakeholder engagement and management
 
Stakeholder Analysis
Stakeholder AnalysisStakeholder Analysis
Stakeholder Analysis
 
6 best practices in stakeholder engagement
6 best practices in stakeholder engagement6 best practices in stakeholder engagement
6 best practices in stakeholder engagement
 
Transforming Customer Experience: From Moments to Journeys
Transforming Customer Experience: From Moments to JourneysTransforming Customer Experience: From Moments to Journeys
Transforming Customer Experience: From Moments to Journeys
 

Similar to Benchmarking Cloud Databases for Tagging Services

Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013Kirill Osipov
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryScience as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryIan Foster
 
13-Essential-Data-Validation-Checks.pdf
13-Essential-Data-Validation-Checks.pdf13-Essential-Data-Validation-Checks.pdf
13-Essential-Data-Validation-Checks.pdfarifulislam946965
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataAbhishek M Shivalingaiah
 
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...IRJET Journal
 
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuantUniversity
 
artrec.pptx
artrec.pptxartrec.pptx
artrec.pptxAuraHub
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReducesscdotopen
 
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...Emanuel Lacić
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
 
JPJ1421 Facilitating Document Annotation Using Content and Querying Value
JPJ1421  Facilitating Document Annotation Using Content and Querying ValueJPJ1421  Facilitating Document Annotation Using Content and Querying Value
JPJ1421 Facilitating Document Annotation Using Content and Querying Valuechennaijp
 

Similar to Benchmarking Cloud Databases for Tagging Services (20)

Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryScience as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate Discovery
 
By
ByBy
By
 
SEppt
SEpptSEppt
SEppt
 
13-Essential-Data-Validation-Checks.pdf
13-Essential-Data-Validation-Checks.pdf13-Essential-Data-Validation-Checks.pdf
13-Essential-Data-Validation-Checks.pdf
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
 
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
 
artrec.pptx
artrec.pptxartrec.pptx
artrec.pptx
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduce
 
The Genopolis Microarray database
The Genopolis Microarray databaseThe Genopolis Microarray database
The Genopolis Microarray database
 
Data mining weka
Data mining wekaData mining weka
Data mining weka
 
Mastro
MastroMastro
Mastro
 
Mastro
MastroMastro
Mastro
 
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
 
OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 
Bn1038 demo pega
Bn1038 demo  pegaBn1038 demo  pega
Bn1038 demo pega
 
JPJ1421 Facilitating Document Annotation Using Content and Querying Value
JPJ1421  Facilitating Document Annotation Using Content and Querying ValueJPJ1421  Facilitating Document Annotation Using Content and Querying Value
JPJ1421 Facilitating Document Annotation Using Content and Querying Value
 

More from Tanu Malik

Sciunits: Resuable Research Object
Sciunits: Resuable Research Object Sciunits: Resuable Research Object
Sciunits: Resuable Research Object Tanu Malik
 
Auditing and Maintaining Provenance in Software Packages
Auditing and Maintaining Provenance in Software PackagesAuditing and Maintaining Provenance in Software Packages
Auditing and Maintaining Provenance in Software PackagesTanu Malik
 
LDV: Light-weight Database Virtualization
LDV: Light-weight Database VirtualizationLDV: Light-weight Database Virtualization
LDV: Light-weight Database VirtualizationTanu Malik
 
GEN: A Database Interface Generator for HPC Programs
GEN: A Database Interface Generator for HPC ProgramsGEN: A Database Interface Generator for HPC Programs
GEN: A Database Interface Generator for HPC ProgramsTanu Malik
 
Scientometrics
ScientometricsScientometrics
ScientometricsTanu Malik
 
SOLE: Linking Research Papers with Science Objects
SOLE: Linking Research Papers with Science ObjectsSOLE: Linking Research Papers with Science Objects
SOLE: Linking Research Papers with Science ObjectsTanu Malik
 

More from Tanu Malik (6)

Sciunits: Resuable Research Object
Sciunits: Resuable Research Object Sciunits: Resuable Research Object
Sciunits: Resuable Research Object
 
Auditing and Maintaining Provenance in Software Packages
Auditing and Maintaining Provenance in Software PackagesAuditing and Maintaining Provenance in Software Packages
Auditing and Maintaining Provenance in Software Packages
 
LDV: Light-weight Database Virtualization
LDV: Light-weight Database VirtualizationLDV: Light-weight Database Virtualization
LDV: Light-weight Database Virtualization
 
GEN: A Database Interface Generator for HPC Programs
GEN: A Database Interface Generator for HPC ProgramsGEN: A Database Interface Generator for HPC Programs
GEN: A Database Interface Generator for HPC Programs
 
Scientometrics
ScientometricsScientometrics
Scientometrics
 
SOLE: Linking Research Papers with Science Objects
SOLE: Linking Research Papers with Science ObjectsSOLE: Linking Research Papers with Science Objects
SOLE: Linking Research Papers with Science Objects
 

Recently uploaded

Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With RoomVIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Roomgirls4nights
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024APNIC
 
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Networking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGNetworking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGAPNIC
 
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Delhi Call girls
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersDamian Radcliffe
 
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Servicesexy call girls service in goa
 
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130  Available With RoomVIP Kolkata Call Girl Alambazar 👉 8250192130  Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Roomdivyansh0kumar0
 
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceDelhi Call girls
 
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call GirlVIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girladitipandeya
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.soniya singh
 
Radiant Call girls in Dubai O56338O268 Dubai Call girls
Radiant Call girls in Dubai O56338O268 Dubai Call girlsRadiant Call girls in Dubai O56338O268 Dubai Call girls
Radiant Call girls in Dubai O56338O268 Dubai Call girlsstephieert
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Servicegwenoracqe6
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024APNIC
 

Recently uploaded (20)

Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
 
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With RoomVIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
 
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Networking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGNetworking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOG
 
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
 
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
 
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130  Available With RoomVIP Kolkata Call Girl Alambazar 👉 8250192130  Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
 
Call Girls In South Ex 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In South Ex 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICECall Girls In South Ex 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In South Ex 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
 
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
 
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call GirlVIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
 
Radiant Call girls in Dubai O56338O268 Dubai Call girls
Radiant Call girls in Dubai O56338O268 Dubai Call girlsRadiant Call girls in Dubai O56338O268 Dubai Call girls
Radiant Call girls in Dubai O56338O268 Dubai Call girls
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
 
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024
 

Benchmarking Cloud Databases for Tagging Services

  • 1. www.ci.anl.gov www.ci.uchicago.edu Benchmarking Cloud-based Tagging Services Tanu Malik, Kyle Chard, Ian Foster Computation Institute Argonne National Laboratory and University of Chicago
  • 2. www.ci.anl.gov www.ci.uchicago.edu 2 Quiz Blacksmith’s tools X-Y axes as a tool for calculus Bayes theorem as a tool for probability theory
  • 4. www.ci.anl.gov www.ci.uchicago.edu 4 A Tagging Service • Suppose we want to build a large-scale tagging service that allows tagging of data files 1. What is the best way to store these large number of tags and their values in a database? 2. Which cloud-based database offering to choose from?
  • 5. www.ci.anl.gov www.ci.uchicago.edu 5 Storing Tags: Many Possibilities • resource-tag-values is a 3-triple, like entity- attribute-values – noSQL stores – triple stores – relational stores o horizontal schema o triple schema o vertically partitioned schema/decomposed storage model/columnar schema o attribute schema
  • 6. www.ci.anl.gov www.ci.uchicago.edu 6 Which cloud-based offering to choose from? • No clear consensus regarding the performance of cloud-based database offerings for sparse datasets • Motivates a tagging benchmark and an infrastructure that uses this benchmark to compare various cloud-based database offerings • determine which cloud-based platform provides the most efficient support for dynamic, sparse data
  • 7. www.ci.anl.gov www.ci.uchicago.edu 7 Outline • Related Work • The Tagging Model • Workload Generation • Framework for Evaluating Tagging Workloads • Experiments
  • 8. www.ci.anl.gov www.ci.uchicago.edu 8 Related Work • Twitter hash tags Vs tags in Del.ic.ious, Flickr – Messages are tagged resulting in transient, trending social groups – Resources are tagged with semantically-meaningful keywords – Tags with values • Cloud benchmarks – The CloudStone benchmark o Web 2.0 social application in which event resources are tagged with comments and ratings as tags o Considers database tuning as a complex process – The OLTP-Bench o Infrastructure for monitoring performance and resource consumption of a cloud-based database using a variety of relational workloads o It does not support tagging workloads
  • 9. www.ci.anl.gov www.ci.uchicago.edu 9 The Tagging Model or by common entities. Similarly if resources represent files then they may be connected, under a containment relationship, with a virtual directory resource. Resources Tag, Values Users {t1,v1; t2,v2; t3,v3} {t1,v1’; t4,v4; t5,v5} {t5,v5’; t6,v6} {t1, v1’’; t2,v2’} {t4, v4’; t7,v7} {t5, v5’’; t7,v7’} {t3,v3’; t6,v6’} Fig. 1. The Tagging Model valu valu uniq B We and syst acro • • •
  • 10. www.ci.anl.gov www.ci.uchicago.edu 10 The Tagging Model • 3 modes to associate tag-value pairs with resources: – blind: users do not know tags assigned to a resource – viewable: can see the tags associated with a resource – suggestive: the service suggests possible tags to users • Access control over resources, tags, and their values: – assign permissions on resources similar to Unix-style permissions – policies assigned at individual and group level – users can only tag the resources they created/use of tags by any user – policies on creation or removal of a resource
  • 11. www.ci.anl.gov www.ci.uchicago.edu 11 Key Obervations • Some resources are more frequently tagged than others. – empirical studies have shown that a power law holds for this phenomenon. • Number of distinct tags, owned by a user, are relatively few. – probability that a Flickr user has more than 750 distinct tags is roughly 0.1%. • Distinct tag usage increases with the increase in the number of users and resources in the system – higher correlation coefficient with the number of resources than users. J. Huang, et. al, “Conversational tagging in twitter,” Hypertext and Hypermedia. ACM, 2010. C. Marlow, et.al , “Ht06, tagging paper, taxonomy, flickr, academic article, to read,” Hypertext and Hypermedia. ACM, 2006.
  • 12. www.ci.anl.gov www.ci.uchicago.edu 12 Workload Generation • No separate data and query generation phases – database tables and their attributes, i.e., tags, are decided by the incoming user workload • Session-based, closed loop Markov-chain process in Python resource ocabulary. gs for the s the bag g prevents over re- typically ad, write), cluded in e permis- oup level. he system, a tag or a resource. ons to be may allow ny user or allowing urces they urces and vice may all relationships and characteristics between resources, users, and tags that have been described in the previous section. We assume the most general options for the workload generator, in particular, no access control policy on resources, a suggestive mode of tagging and tag creation as optional to resource creation. Based on these options, the following user operations are supported as part of the workload generation phase: • S1: Add a user. • S2: A user uploads a resource. • S3: A user creates a tag definition and uses this definition to tag a resource with a value. • S4: A user queries for a resource. Fig. 2. The Markov Chain (a) States and some state transitions, (b) The
  • 13. www.ci.anl.gov www.ci.uchicago.edu 13 Workload Generation • Transition probabilities in Markov Chain – Some resources are more frequently tagged than others. o when a user queries for a resource to be tagged, the chosen resource is dictated by a power law. – Number of distinct tags, owned by a user, are relatively few. o the transition probabilities of using an existing tag is higher than the probability of creating a new tag definition. – Distinct tag usage increases with the increase in the number of users and resources in the system o internal transition probabilities in the tagging state is a function of the previous state
  • 14. www.ci.anl.gov www.ci.uchicago.edu 14 • Search for resources • Find all other tags or popular tag-value pairs on a given resource. • Suggest tags for a selected resource. • Find resources that are “linked” to a set of resources • Find resources where value like ‘V ’ exists. – This query fetches all tags with which the user is associated. Workload Generation e individual or group level. other objects in thesystem, he permissions of a tag or a e permission of a resource. enable permissions to be r instance users may allow of their tags by any user or ging, for example allowing only tag the resources they values when resources and addition, the service may moval of a resource (no one, ag (no one, anyone, the tag a tag value (no one, anyone, the resource owner) urce assumes that the cor- • S4: A user queries for a reso Fig. 2. The Markov Chain (a) States a composite tagging state. Figure 2(a) shows the session-b chain behind the synthetic workloa corresponds to a state, with the composite state consisting of sub
  • 15. www.ci.anl.gov www.ci.uchicago.edu 15 Framework for Evaluating Tagging Datasets • Tagging modifies the underlying database schema, adding attributes or inserting values – defining new tags increases the sparseness of the dataset – reusing existing tags or adding new values decreases the sparseness of the database
  • 16. www.ci.anl.gov www.ci.uchicago.edu 16 Framework for Evaluating Tagging Datasets • Given a tagging (query and tag) workload, generate a variety of schemas for a given database system. • Measure sparseness of the generated tagging dataset using a sparseness metric defined as: Low s => vertically partitioned schema High s => horizontal schema
  • 17. www.ci.anl.gov www.ci.uchicago.edu 17 Experimental Setup • OLTP Bench: client infrastructure that executes web workloads on relational databases deployed on the cloud – workload generator into the centralized Workload Manager. – generates a work queue, which is consumed by a user- specified number of threads to control parallelism and concurrency. – all the threads are currently running on a single machine. • We experiment with three databases: – MySQL with InnoDB version 5.6, – Postgres version 9.2, and – SQL Server 2010 – DB_A, DB_B, and DB_C
  • 18. www.ci.anl.gov www.ci.uchicago.edu 18 EC2 setup • Server DBMSs were deployed within the same geographical region • A single instance dedicated to workers and collecting statistics and another instance running the DBMS server • For each DBMS server, flushed each system’s buffers before each experiment. • As recommended by OLTP-Bench, to mitigate noise in cloud environments, we ran all of our experiments without restarting our EC2 instances (where possible), and executed the benchmarks multiple times and averaged the results.
  • 19. www.ci.anl.gov www.ci.uchicago.edu 19 Comparing DB SaaS providers • Given a schema and a tagging workload, which database and its cloud offering offers the best performance (good overall throughput and low latency) and cost trade- off?
  • 21. www.ci.anl.gov www.ci.uchicago.edu 21 s=0.67 (b) For s = 0.55 (c) For s = 0.67 Fig. 3. Comparing the Four Schemas Across Database Offerings
  • 22. www.ci.anl.gov www.ci.uchicago.edu 22 DB_A and different machine sizes ilable: ilable: ational ce on ilable: -store: ,” The Bases, ounda- ational f data. n Web IEEE, Wong, Fig. 4. Performance Vs Cost Tradeoff for DB A
  • 23. www.ci.anl.gov www.ci.uchicago.edu 23 Conclusion • Tagging is becoming popular in Web 2.0 applications as a tool for managing large volumes of information • Being introduced as part of scientific workflows and datasets • Running a high-performance tagging service can be tricky – Right data model – Right database • Proposed a benchmark for a tagging service – it is open-source, though not released it – tanum@ci.uchicago.edu • Results show the benefits of using a benchmark