IT talk SPb: Найдется все

TAGGING SCHEMA DESIGN
FOR HIGH PERFORMANCE
Alexander Tokarev
Senior Developer, DataArt
atokarev@dataart.com

25 November 2016 T A G G I N G S C H E M A D E S I G N F O R H I G H P E R F O R M A N C E 3
Plan
• Tagging basis
• Database challenges
• Tagging solutions
• Pros and cons
• Q&A session

Tagging terms
• Tag is a non-hierarchical keyword or term assigned to a piece of
information
• Tags are generally chosen informally and personally by the item's
creator or by its viewer
• If tags are assigned by the creator and are limited it is taxonomy
• If tags are assigned by the viewer and are unlimited it is folksonomy
• Started to be widely used from 2003 by Flikr and Delicious web sites
• Tags are showed usually inline as well as tag cloud

Tagging terms
INLINE TAG CLOUD

Tagging challenges
1. Used vocabulary reflects the
user’s vocabulary directly
2. Flexibility - the user can add or
remove tags
3. Multi-dimensional nature - users
can assign any number and
combination of tags to express a
concept
1. Specialized tags or tags without
meaning to others than themselves,
misspellings, singular/plural form,
compound words
2. Tags are often ambiguous, overly
personalized, poorly applied tag
3. Using synonyms, acronyms and
homonyms which aren’t handled
well
+ –

Database challenges
1. Performance
2. Queries awkwardness
3. Database size
4. Housekeeping

High normalized approach

Denormalized approach

Complex data type approach

Full-text-search oriented solutions
Stackoverflow: <php><mysql><guid><encryption>
JSON: {“tags”:[“php”, “apache2”, “openinviter”]}

Full-text-search approaches
Approach 1 Approach 2
FTS
inside DB
+
FTS model
Application
server
Relational/denormalized/FTS
model
FTS server
(Lucene, Sphinx,
Elastic, Solr,
Xapian, etc)
Application
server

Housekeeping
Denormalized/FTS
1. Change all affected tags in all documents if a tag name changed
FTS
1. FTS index rebuild due fragmentation
2. FTS index refresh if it isn’t refreshed on COMMIT

25 November 2016
Test example
StackOverflow posts via
http://data.stackexchange.com/
From 31/07/2008 to 21-12-2012
Posts: 2 680 474
Applied tags: 7 791 527
Used unique tags: 30 485
Max tags count for a post: 5

Comparison

Comparison.
Initial population time
0 500 1000 1500 2000 2500
Relational
Denormalized
Complex data type
Full text search
Insert time
Model Insert time, sec
Relational 1048
Denormalized 1205
Complex data type 2086
Full text search 1950

Comparison.
Database size
0 200 400 600 800 1000 1200 1400
Relational
Denormalized
Complex data type
Full text search
DB size
Index size, MB Data size, MB Size total, MB
Model Size total, MB Data size, MB Index size, MB
Relational 1166 338 828
Denormalized 1080 376 704
Complex data type 1134 256 878
Full text search 1055 416 639

Comparison.
Search by document id and all tag retrieval
0 0.2 0.4 0.6 0.8 1
Relational
Denormalized
Complex data type
Full text search
Speed with cold cache,
seconds
0 0.001 0.002 0.003 0.004
Relational
Denormalized
Complex data type
Full text search
Speed with hot cache, seconds
Model Speed with cold cache, sec Speed with hot cache, sec.
Relational 0,2 0,003
Denormalized 0,07 0,002
Complex data type 0,9 0,002
Full text search 0,3 0,001

Comparison.
Search using 1 tags and all tag retrieval
0 0.5 1 1.5 2
Relational
Denormalized
Complex data type
Full text search
Speed with cold cache,
seconds
0 0.0010.0020.0030.0040.0050.006
Relational
Denormalized
Complex data type
Full text search
Speed with hot cache, seconds
Model Speed with cold cache, sec Speed with hot cache, sec
Relational 1 0,005
Denormalized 0,7 0,004
Complex data type 1,7 0,005
Full text search 0,7 0,002

Comparison.
Search by AND using 2 tags & all tag retrieval
0 10 20 30 40 50
Relational
Denormalized
Complex data type
Full text search
Search speed
Speed with hot cache, seconds Speed with cold cache, seconds
Model
Speed with cold
cache, sec
Speed with hot
cache, sec
Relational 40 34
Denormalized 34 20
Complex data
type 34 14
Full text
search 20 2

Comparison.
Cloud tag population
0 50 100 150 200 250
relation
relational simplified
relational without fk
denormalized
array
fts
Speed, seconds
Model Speed, seconds
Relation 20
Relational simplified 18
Relational without fk 202
Denormalized 18
Complex data type 21
fts 40

Pros & Cons
Model
Space
consumption
Search
performance Insert performance Maintenance
Additional
housekeeping
Risk of
failure
Search queries
development
Relational
worst worst highest minimal not required no worst
Denormalized
moderate moderate good required required no moderate
Complex data
type
moderate moderate worst required required no moderate
Full text search
optimal optimal moderate required required yes optimal

There is no silver bullet
for tag storage model!
Conclusion

Conclusion
1. Choose your best model based on:
• Performance (search/insert/update)
• Space consumption
• Engineer experience
• Hardware cost
• Software cost
2. Each storage model should be checked on your RDBMS – don’t be afraid to try
and measure
3. Understanding how complex data types are stored inside is crucial
4. Understanding how FTS works inside is crucial
5. Investigate your DBMS unique features

Q&A

THANK YOU!
Alexander Tokarev
Senior Developer, DataArt
atokarev@dataart.com

IT talk SPb: Найдется все

IT talk SPb: Найдется все

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (17)

Similar to IT talk SPb: Найдется все

Similar to IT talk SPb: Найдется все (20)

More from DataArt

More from DataArt (20)

Recently uploaded

Recently uploaded (20)

IT talk SPb: Найдется все

Editor's Notes