www.ci.anl.gov
www.ci.uchicago.edu
Benchmarking Cloud-based Tagging
Services
Tanu Malik, Kyle Chard, Ian Foster
Computatio...
www.ci.anl.gov
www.ci.uchicago.edu
2
Quiz
Blacksmith’s tools X-Y axes as a tool for calculus
Bayes theorem as a tool for p...
www.ci.anl.gov
www.ci.uchicago.edu
3
Tagging: A tool for information management
www.ci.anl.gov
www.ci.uchicago.edu
4
A Tagging Service
• Suppose we want to build a large-scale tagging
service that allow...
www.ci.anl.gov
www.ci.uchicago.edu
5
Storing Tags: Many Possibilities
• resource-tag-values is a 3-triple, like entity-
at...
www.ci.anl.gov
www.ci.uchicago.edu
6
Which cloud-based offering to choose from?
• No clear consensus regarding the perform...
www.ci.anl.gov
www.ci.uchicago.edu
7
Outline
• Related Work
• The Tagging Model
• Workload Generation
• Framework for Eval...
www.ci.anl.gov
www.ci.uchicago.edu
8
Related Work
• Twitter hash tags Vs tags in Del.ic.ious, Flickr
– Messages are tagged...
www.ci.anl.gov
www.ci.uchicago.edu
9
The Tagging Model
or by common entities. Similarly if resources represent files
then t...
www.ci.anl.gov
www.ci.uchicago.edu
10
The Tagging Model
• 3 modes to associate tag-value pairs with
resources:
– blind: us...
www.ci.anl.gov
www.ci.uchicago.edu
11
Key Obervations
• Some resources are more frequently tagged than
others.
– empirical...
www.ci.anl.gov
www.ci.uchicago.edu
12
Workload Generation
• No separate data and query generation phases
– database tables...
www.ci.anl.gov
www.ci.uchicago.edu
13
Workload Generation
• Transition probabilities in Markov Chain
– Some resources are ...
www.ci.anl.gov
www.ci.uchicago.edu
14
• Search for resources
• Find all other tags or
popular tag-value pairs
on a given r...
www.ci.anl.gov
www.ci.uchicago.edu
15
Framework for Evaluating Tagging Datasets
• Tagging modifies the underlying database...
www.ci.anl.gov
www.ci.uchicago.edu
16
Framework for Evaluating Tagging Datasets
• Given a tagging (query and tag) workload...
www.ci.anl.gov
www.ci.uchicago.edu
17
Experimental Setup
• OLTP Bench: client infrastructure that executes web
workloads o...
www.ci.anl.gov
www.ci.uchicago.edu
18
EC2 setup
• Server DBMSs were deployed within the same
geographical region
• A singl...
www.ci.anl.gov
www.ci.uchicago.edu
19
Comparing DB SaaS providers
• Given a schema and a tagging workload, which
database ...
www.ci.anl.gov
www.ci.uchicago.edu
20
s = 0.37
(a) For s = 0.37
www.ci.anl.gov
www.ci.uchicago.edu
21
s=0.67
(b) For s = 0.55
(c) For s = 0.67
Fig. 3. Comparing the Four Schemas Across D...
www.ci.anl.gov
www.ci.uchicago.edu
22
DB_A and different machine sizes
ilable:
ilable:
ational
ce on
ilable:
-store:
,” Th...
www.ci.anl.gov
www.ci.uchicago.edu
23
Conclusion
• Tagging is becoming popular in Web 2.0 applications as a
tool for manag...
Upcoming SlideShare
Loading in...5
×

Benchmarking Cloud-based Tagging Services

111

Published on

Workshop on CloudDB, in conjunction with ICDE, 2014

Published in: Internet, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
111
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Benchmarking Cloud-based Tagging Services

  1. 1. www.ci.anl.gov www.ci.uchicago.edu Benchmarking Cloud-based Tagging Services Tanu Malik, Kyle Chard, Ian Foster Computation Institute Argonne National Laboratory and University of Chicago
  2. 2. www.ci.anl.gov www.ci.uchicago.edu 2 Quiz Blacksmith’s tools X-Y axes as a tool for calculus Bayes theorem as a tool for probability theory
  3. 3. www.ci.anl.gov www.ci.uchicago.edu 3 Tagging: A tool for information management
  4. 4. www.ci.anl.gov www.ci.uchicago.edu 4 A Tagging Service • Suppose we want to build a large-scale tagging service that allows tagging of data files 1. What is the best way to store these large number of tags and their values in a database? 2. Which cloud-based database offering to choose from?
  5. 5. www.ci.anl.gov www.ci.uchicago.edu 5 Storing Tags: Many Possibilities • resource-tag-values is a 3-triple, like entity- attribute-values – noSQL stores – triple stores – relational stores o horizontal schema o triple schema o vertically partitioned schema/decomposed storage model/columnar schema o attribute schema
  6. 6. www.ci.anl.gov www.ci.uchicago.edu 6 Which cloud-based offering to choose from? • No clear consensus regarding the performance of cloud-based database offerings for sparse datasets • Motivates a tagging benchmark and an infrastructure that uses this benchmark to compare various cloud-based database offerings • determine which cloud-based platform provides the most efficient support for dynamic, sparse data
  7. 7. www.ci.anl.gov www.ci.uchicago.edu 7 Outline • Related Work • The Tagging Model • Workload Generation • Framework for Evaluating Tagging Workloads • Experiments
  8. 8. www.ci.anl.gov www.ci.uchicago.edu 8 Related Work • Twitter hash tags Vs tags in Del.ic.ious, Flickr – Messages are tagged resulting in transient, trending social groups – Resources are tagged with semantically-meaningful keywords – Tags with values • Cloud benchmarks – The CloudStone benchmark o Web 2.0 social application in which event resources are tagged with comments and ratings as tags o Considers database tuning as a complex process – The OLTP-Bench o Infrastructure for monitoring performance and resource consumption of a cloud-based database using a variety of relational workloads o It does not support tagging workloads
  9. 9. www.ci.anl.gov www.ci.uchicago.edu 9 The Tagging Model or by common entities. Similarly if resources represent files then they may be connected, under a containment relationship, with a virtual directory resource. Resources Tag, Values Users {t1,v1; t2,v2; t3,v3} {t1,v1’; t4,v4; t5,v5} {t5,v5’; t6,v6} {t1, v1’’; t2,v2’} {t4, v4’; t7,v7} {t5, v5’’; t7,v7’} {t3,v3’; t6,v6’} Fig. 1. The Tagging Model valu valu uniq B We and syst acro • • •
  10. 10. www.ci.anl.gov www.ci.uchicago.edu 10 The Tagging Model • 3 modes to associate tag-value pairs with resources: – blind: users do not know tags assigned to a resource – viewable: can see the tags associated with a resource – suggestive: the service suggests possible tags to users • Access control over resources, tags, and their values: – assign permissions on resources similar to Unix-style permissions – policies assigned at individual and group level – users can only tag the resources they created/use of tags by any user – policies on creation or removal of a resource
  11. 11. www.ci.anl.gov www.ci.uchicago.edu 11 Key Obervations • Some resources are more frequently tagged than others. – empirical studies have shown that a power law holds for this phenomenon. • Number of distinct tags, owned by a user, are relatively few. – probability that a Flickr user has more than 750 distinct tags is roughly 0.1%. • Distinct tag usage increases with the increase in the number of users and resources in the system – higher correlation coefficient with the number of resources than users. J. Huang, et. al, “Conversational tagging in twitter,” Hypertext and Hypermedia. ACM, 2010. C. Marlow, et.al , “Ht06, tagging paper, taxonomy, flickr, academic article, to read,” Hypertext and Hypermedia. ACM, 2006.
  12. 12. www.ci.anl.gov www.ci.uchicago.edu 12 Workload Generation • No separate data and query generation phases – database tables and their attributes, i.e., tags, are decided by the incoming user workload • Session-based, closed loop Markov-chain process in Python resource ocabulary. gs for the s the bag g prevents over re- typically ad, write), cluded in e permis- oup level. he system, a tag or a resource. ons to be may allow ny user or allowing urces they urces and vice may all relationships and characteristics between resources, users, and tags that have been described in the previous section. We assume the most general options for the workload generator, in particular, no access control policy on resources, a suggestive mode of tagging and tag creation as optional to resource creation. Based on these options, the following user operations are supported as part of the workload generation phase: • S1: Add a user. • S2: A user uploads a resource. • S3: A user creates a tag definition and uses this definition to tag a resource with a value. • S4: A user queries for a resource. Fig. 2. The Markov Chain (a) States and some state transitions, (b) The
  13. 13. www.ci.anl.gov www.ci.uchicago.edu 13 Workload Generation • Transition probabilities in Markov Chain – Some resources are more frequently tagged than others. o when a user queries for a resource to be tagged, the chosen resource is dictated by a power law. – Number of distinct tags, owned by a user, are relatively few. o the transition probabilities of using an existing tag is higher than the probability of creating a new tag definition. – Distinct tag usage increases with the increase in the number of users and resources in the system o internal transition probabilities in the tagging state is a function of the previous state
  14. 14. www.ci.anl.gov www.ci.uchicago.edu 14 • Search for resources • Find all other tags or popular tag-value pairs on a given resource. • Suggest tags for a selected resource. • Find resources that are “linked” to a set of resources • Find resources where value like ‘V ’ exists. – This query fetches all tags with which the user is associated. Workload Generation e individual or group level. other objects in thesystem, he permissions of a tag or a e permission of a resource. enable permissions to be r instance users may allow of their tags by any user or ging, for example allowing only tag the resources they values when resources and addition, the service may moval of a resource (no one, ag (no one, anyone, the tag a tag value (no one, anyone, the resource owner) urce assumes that the cor- • S4: A user queries for a reso Fig. 2. The Markov Chain (a) States a composite tagging state. Figure 2(a) shows the session-b chain behind the synthetic workloa corresponds to a state, with the composite state consisting of sub
  15. 15. www.ci.anl.gov www.ci.uchicago.edu 15 Framework for Evaluating Tagging Datasets • Tagging modifies the underlying database schema, adding attributes or inserting values – defining new tags increases the sparseness of the dataset – reusing existing tags or adding new values decreases the sparseness of the database
  16. 16. www.ci.anl.gov www.ci.uchicago.edu 16 Framework for Evaluating Tagging Datasets • Given a tagging (query and tag) workload, generate a variety of schemas for a given database system. • Measure sparseness of the generated tagging dataset using a sparseness metric defined as: Low s => vertically partitioned schema High s => horizontal schema
  17. 17. www.ci.anl.gov www.ci.uchicago.edu 17 Experimental Setup • OLTP Bench: client infrastructure that executes web workloads on relational databases deployed on the cloud – workload generator into the centralized Workload Manager. – generates a work queue, which is consumed by a user- specified number of threads to control parallelism and concurrency. – all the threads are currently running on a single machine. • We experiment with three databases: – MySQL with InnoDB version 5.6, – Postgres version 9.2, and – SQL Server 2010 – DB_A, DB_B, and DB_C
  18. 18. www.ci.anl.gov www.ci.uchicago.edu 18 EC2 setup • Server DBMSs were deployed within the same geographical region • A single instance dedicated to workers and collecting statistics and another instance running the DBMS server • For each DBMS server, flushed each system’s buffers before each experiment. • As recommended by OLTP-Bench, to mitigate noise in cloud environments, we ran all of our experiments without restarting our EC2 instances (where possible), and executed the benchmarks multiple times and averaged the results.
  19. 19. www.ci.anl.gov www.ci.uchicago.edu 19 Comparing DB SaaS providers • Given a schema and a tagging workload, which database and its cloud offering offers the best performance (good overall throughput and low latency) and cost trade- off?
  20. 20. www.ci.anl.gov www.ci.uchicago.edu 20 s = 0.37 (a) For s = 0.37
  21. 21. www.ci.anl.gov www.ci.uchicago.edu 21 s=0.67 (b) For s = 0.55 (c) For s = 0.67 Fig. 3. Comparing the Four Schemas Across Database Offerings
  22. 22. www.ci.anl.gov www.ci.uchicago.edu 22 DB_A and different machine sizes ilable: ilable: ational ce on ilable: -store: ,” The Bases, ounda- ational f data. n Web IEEE, Wong, Fig. 4. Performance Vs Cost Tradeoff for DB A
  23. 23. www.ci.anl.gov www.ci.uchicago.edu 23 Conclusion • Tagging is becoming popular in Web 2.0 applications as a tool for managing large volumes of information • Being introduced as part of scientific workflows and datasets • Running a high-performance tagging service can be tricky – Right data model – Right database • Proposed a benchmark for a tagging service – it is open-source, though not released it – tanum@ci.uchicago.edu • Results show the benefits of using a benchmark
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×