Data structures for cloud tag storage

Tagging schema design for high performance

Plan
▪ Tagging basis
▪ Database challenges
▪ Tagging solutions
▪ Pros and cons
▪ Q&A session

Tagging terms
• Tag is a non-hierarchical keyword or term assigned to a piece of information
• Tags are generally chosen informally and personally by the item's creator or by its viewer
• If tags are assigned by the creator and are limited it is taxonomy
• If tags are assigned by the viewer and are unlimited it is folksonomy
• Started to be widely used from 2003 by Flikr and Delicious web sites
• Tags are showed usually inline as well as tag cloud

Tagging challenges
+
1. used vocabulary reflects the user’s vocabulary directly
2. flexibility - the user can add or remove tags
3. multi-dimensional nature - users can assign any number and combination of tags to express a concept
lead to
-
1. specialized tags or tags without meaning to others than themselves, misspellings, singular/plural form,
compound words
2. tags are often ambiguous, overly personalized, poorly applied tag
3. Using synonyms, acronyms and homonyms which aren’t handled well

Database challenges
1. Performance
2. Queries awkwardness
3. Database size
4. Housekeeping

Full-text-search oriented solutions
Stackoverflow: <php><mysql><guid><encryption>
JSON: {“tags”:[“php”, “apache2”, “openinviter”]}

Full-text-search approaches
FTS
inside DB
+
FTS model
Relational/denormalized/FTS
model
Approach 1 Approach 2
FTS server
(Lucene, Sphinx,
Elastic, Solr, Xapian,
etc)
Application
server
Application
server

Housekeeping
Denormalized/FTS
1. Change all affected tags in all documents if a tag name changed
FTS
1. FTS index rebuild due fragmentation
2. FTS index refresh if it isn’t refreshed on COMMIT

Test example
StackOverflow posts via http://data.stackexchange.com/
From 31/07/2008 to 21-12-2012
Posts: 2 680 474
Applied tags: 7 791 527
Used unique tags: 30 485
Max tags count for a post: 5

Comparison
Initial population time
0 500 1000 1500 2000 2500
Relational
Denormalized
Complex data type
Full text search
Insert time
Model
Insert time,
seconds
Relational 1048
Denormalized 1205
Complex data type 2086
Full text search 1950

Comparison
DB size
Model Size total, MB Data size, MB Index size, MB
Relational 1166 338 828
Denormalized 1080 376 704
Complex data type 1134 256 878
Full text search 1055 416 639
0 200 400 600 800 1000 1200 1400
Relational
Denormalized
Complex data type
Full text search
DB size
Index size, MB Data size, MB Size total, MB

Comparison
Search by document id and all tag retrieval
Model
Speed with cold cache,
seconds
Speed with hot cache,
seconds
Relational 0,2 0,003
Denormalized 0,07 0,002
Complex data type 0,9 0,002
Full text search 0,3 0,001
0 0.2 0.4 0.6 0.8 1
Relational
Denormalized
Complex data type
Full text search
Speed with cold cache, seconds
0 0.0005 0.001 0.0015 0.002 0.0025 0.003 0.0035
Relational
Denormalized
Complex data type
Full text search
Speed with hot cache, seconds

Comparison
Search using 1 tags and all tag retrieval
Model
Speed
with cold
cache,
seconds
Speed with hot
cache, seconds
Relational 1 0,005
Denormalized 0,7 0,004
Complex data type 1,7 0,005
Full text search 0,7 0,002
0 0.5 1 1.5 2
Relational
Denormalized
Complex data type
Full text search
Speed with cold cache, seconds
0 0.001 0.002 0.003 0.004 0.005 0.006
Relational
Denormalized
Complex data type
Full text search
Speed with hot cache, seconds

Comparison
Search by AND using 2 tags and all tag retrieval
Model
Speed with
cold cache,
seconds
Speed with hot
cache, seconds
Relational 40 34
Denormalized 34 20
Complex data
type 34 14
Full text
search 20 2
0 5 10 15 20 25 30 35 40 45
Relational
Denormalized
Complex data type
Full text search
Search speed
Speed with hot cache, seconds Speed with cold cache, seconds

Comparison
Cloud tag population
Model Speed, seconds
relation 20
relational simplified 18
relational without fk 202
denormalized 18
Complex data type 21
fts 40
0 50 100 150 200 250
relation
relational simplified
relational without fk
denormalized
array
fts
Speed, seconds

Pros & Cons
Model Space consumption Search performance Insert performance Maintenance Additional housekeeping Risk of failure
Search queries
development
Relational worst worst highest minimal not required no worst
Denormalized moderate moderate good required required no moderate
Complex data type moderate moderate worst required required no moderate
Full text search optimal optimal moderate required required yes optimal

Conclusion
1. Choose your best model based on:
• Performance (search/insert/update)
• Space consumption
• Engineer experience
• Hardware cost
• Software cost
2. Each storage model should be checked on your RDBMS - don’t be afraid to try and
measure
3. Understanding how complex data types are stored inside is crucial
4. Understanding how FTS works inside is crucial
5. Investigate your DBMS unique features
There is no silver bullet for tag storage model!

Contacts
Feel free to ask any db-related questions: shtock@mail.ru

Data structures for cloud tag storage

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data structures for cloud tag storage

Similar to Data structures for cloud tag storage (20)

More from Alexander Tokarev

More from Alexander Tokarev (18)

Recently uploaded

Recently uploaded (20)

Data structures for cloud tag storage

Editor's Notes