The document discusses different tagging schema designs for databases to handle high performance. It defines tagging terms and outlines tagging challenges including database challenges. It then analyzes high normalized, denormalized, complex data type, and full-text search approaches. A comparison of these approaches shows performance differences for initial population, size, search speeds, and cloud tag population. The conclusion is that the best model depends on performance needs, space, experience, costs, and database features, and no single model is best in all cases.
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
IT talk SPb: Найдется все
1.
2. TAGGING SCHEMA DESIGN
FOR HIGH PERFORMANCE
Alexander Tokarev
Senior Developer, DataArt
atokarev@dataart.com
3. 25 November 2016 T A G G I N G S C H E M A D E S I G N F O R H I G H P E R F O R M A N C E 3
Plan
• Tagging basis
• Database challenges
• Tagging solutions
• Pros and cons
• Q&A session
4. Tagging terms
• Tag is a non-hierarchical keyword or term assigned to a piece of
information
• Tags are generally chosen informally and personally by the item's
creator or by its viewer
• If tags are assigned by the creator and are limited it is taxonomy
• If tags are assigned by the viewer and are unlimited it is folksonomy
• Started to be widely used from 2003 by Flikr and Delicious web sites
• Tags are showed usually inline as well as tag cloud
25 November 2016 T A G G I N G S C H E M A D E S I G N F O R H I G H P E R F O R M A N C E 4
5. Tagging terms
25 November 2016 T A G G I N G S C H E M A D E S I G N F O R H I G H P E R F O R M A N C E 5
INLINE TAG CLOUD
6. Tagging challenges
1. Used vocabulary reflects the
user’s vocabulary directly
2. Flexibility - the user can add or
remove tags
3. Multi-dimensional nature - users
can assign any number and
combination of tags to express a
concept
25 November 2016 T A G G I N G S C H E M A D E S I G N F O R H I G H P E R F O R M A N C E 6
1. Specialized tags or tags without
meaning to others than themselves,
misspellings, singular/plural form,
compound words
2. Tags are often ambiguous, overly
personalized, poorly applied tag
3. Using synonyms, acronyms and
homonyms which aren’t handled
well
+ –
7. Database challenges
25 November 2016 T A G G I N G S C H E M A D E S I G N F O R H I G H P E R F O R M A N C E 7
1. Performance
2. Queries awkwardness
3. Database size
4. Housekeeping
8. High normalized approach
25 November 2016 T A G G I N G S C H E M A D E S I G N F O R H I G H P E R F O R M A N C E 8
10. Complex data type approach
25 November 2016 T A G G I N G S C H E M A D E S I G N F O R H I G H P E R F O R M A N C E 10
11. Full-text-search oriented solutions
25 November 2016 T A G G I N G S C H E M A D E S I G N F O R H I G H P E R F O R M A N C E 11
Stackoverflow: <php><mysql><guid><encryption>
JSON: {“tags”:[“php”, “apache2”, “openinviter”]}
12. Full-text-search approaches
Approach 1 Approach 2
25 November 2016 T A G G I N G S C H E M A D E S I G N F O R H I G H P E R F O R M A N C E 12
FTS
inside DB
+
FTS model
Application
server
Relational/denormalized/FTS
model
FTS server
(Lucene, Sphinx,
Elastic, Solr,
Xapian, etc)
Application
server
13. Housekeeping
Denormalized/FTS
1. Change all affected tags in all documents if a tag name changed
FTS
1. FTS index rebuild due fragmentation
2. FTS index refresh if it isn’t refreshed on COMMIT
25 November 2016 T A G G I N G S C H E M A D E S I G N F O R H I G H P E R F O R M A N C E 13
14. 25 November 2016
Test example
StackOverflow posts via
http://data.stackexchange.com/
From 31/07/2008 to 21-12-2012
Posts: 2 680 474
Applied tags: 7 791 527
Used unique tags: 30 485
Max tags count for a post: 5
16. Comparison.
Initial population time
25 November 2016 T A G G I N G S C H E M A D E S I G N F O R H I G H P E R F O R M A N C E 16
0 500 1000 1500 2000 2500
Relational
Denormalized
Complex data type
Full text search
Insert time
Model Insert time, sec
Relational 1048
Denormalized 1205
Complex data type 2086
Full text search 1950
17. Comparison.
Database size
25 November 2016 T A G G I N G S C H E M A D E S I G N F O R H I G H P E R F O R M A N C E 17
0 200 400 600 800 1000 1200 1400
Relational
Denormalized
Complex data type
Full text search
DB size
Index size, MB Data size, MB Size total, MB
Model Size total, MB Data size, MB Index size, MB
Relational 1166 338 828
Denormalized 1080 376 704
Complex data type 1134 256 878
Full text search 1055 416 639
18. Comparison.
Search by document id and all tag retrieval
25 November 2016 T A G G I N G S C H E M A D E S I G N F O R H I G H P E R F O R M A N C E 18
0 0.2 0.4 0.6 0.8 1
Relational
Denormalized
Complex data type
Full text search
Speed with cold cache,
seconds
0 0.001 0.002 0.003 0.004
Relational
Denormalized
Complex data type
Full text search
Speed with hot cache, seconds
Model Speed with cold cache, sec Speed with hot cache, sec.
Relational 0,2 0,003
Denormalized 0,07 0,002
Complex data type 0,9 0,002
Full text search 0,3 0,001
19. Comparison.
Search using 1 tags and all tag retrieval
25 November 2016 T A G G I N G S C H E M A D E S I G N F O R H I G H P E R F O R M A N C E 19
0 0.5 1 1.5 2
Relational
Denormalized
Complex data type
Full text search
Speed with cold cache,
seconds
0 0.0010.0020.0030.0040.0050.006
Relational
Denormalized
Complex data type
Full text search
Speed with hot cache, seconds
Model Speed with cold cache, sec Speed with hot cache, sec
Relational 1 0,005
Denormalized 0,7 0,004
Complex data type 1,7 0,005
Full text search 0,7 0,002
20. Comparison.
Search by AND using 2 tags & all tag retrieval
25 November 2016 T A G G I N G S C H E M A D E S I G N F O R H I G H P E R F O R M A N C E 20
0 10 20 30 40 50
Relational
Denormalized
Complex data type
Full text search
Search speed
Speed with hot cache, seconds Speed with cold cache, seconds
Model
Speed with cold
cache, sec
Speed with hot
cache, sec
Relational 40 34
Denormalized 34 20
Complex data
type 34 14
Full text
search 20 2
21. Comparison.
Cloud tag population
25 November 2016 T A G G I N G S C H E M A D E S I G N F O R H I G H P E R F O R M A N C E 21
0 50 100 150 200 250
relation
relational simplified
relational without fk
denormalized
array
fts
Speed, seconds
Model Speed, seconds
Relation 20
Relational simplified 18
Relational without fk 202
Denormalized 18
Complex data type 21
fts 40
22. Pros & Cons
25 November 2016 T A G G I N G S C H E M A D E S I G N F O R H I G H P E R F O R M A N C E 22
Model
Space
consumption
Search
performance Insert performance Maintenance
Additional
housekeeping
Risk of
failure
Search queries
development
Relational
worst worst highest minimal not required no worst
Denormalized
moderate moderate good required required no moderate
Complex data
type
moderate moderate worst required required no moderate
Full text search
optimal optimal moderate required required yes optimal
23. There is no silver bullet
for tag storage model!
25 November 2016 T A G G I N G S C H E M A D E S I G N F O R H I G H P E R F O R M A N C E 23
Conclusion
24. Conclusion
1. Choose your best model based on:
• Performance (search/insert/update)
• Space consumption
• Engineer experience
• Hardware cost
• Software cost
2. Each storage model should be checked on your RDBMS – don’t be afraid to try
and measure
3. Understanding how complex data types are stored inside is crucial
4. Understanding how FTS works inside is crucial
5. Investigate your DBMS unique features
25 November 2016 T A G G I N G S C H E M A D E S I G N F O R H I G H P E R F O R M A N C E 24
25. Q&A
25 November 2016 T A G G I N G S C H E M A D E S I G N F O R H I G H P E R F O R M A N C E 25
In order to respond these challenges appropriate database design should be applied. HK – indexing, reindexing, tag change name, компромисс между realtime и прочее
Tell about clusters or IOT
Tell about clusters
Tell how they are set up in oracle and index tricks. It is significant to understand how complex data types are implemented in your database and where complex data are actually stored in.
Tags are stored in structured format
Usage of full text search improves search by tags via native language
It is deadly simple to deal with previously mentioned data models but it worth to stay on fts in detail
SQL search approach is rather straightforward so let’s consider FTS approach. full text search index is maintained either in DB or in dedicated server. App server uses FTS dialect either of db or a server. We will have a look into Approach 1. Pros and cons out of the ItTalk. Stackoverflow uses MSSql and Elastic for instance in model 2 with FTS model.
Index becomes fragmented due delete/insert usually adds new records and invalidates old
We took real world data via sql-like interface to StackOverflow. Please pay attention about maximum tag count for a post – I presume it is done intentionally. I presume they use 4rd data model and use VARCHAR field rather than CLOB/BLOB. Permits to export by 50000 bunches + capture required.
Let’s have a look how we created tables.
For some models difference is more 2 times. The reason is clear – fts maintenance, parcing.
Please pay attention it is only for Oracle DB. That stuff is completely DB-dependend. 5 years – 1 Gb so it worth to think about in-memory solutions.
Let’s have a look into queries and will see in tables.
The difference it time for cache is huge so I put in 2 diagrams
Sophisticated plan
2. starts from tag meanwhile complex data type starts from document
4. Could be faster using varchar2 and USE CACHE option which is switched off by default
1, 2 and 3 could be faster and consume less space using Oracle tricks like IOT/clusters (joined values are located closer) but aren’t used to not make the test very Oracle tailored.
There is an opinion arrays are extremely fast in Postgress due they work completely different than in Oracle. Please pay attention that first attempt in FTS in slightly different from the second – second is the same as cold cache. It seems Oracle initialize some structures on first attempt so it is 2-3 times slower that the second so here the second is mentioned. Complex datatype makes like FTS sort of init if we search by it so it is slower.
Please pay attention that extra table is omitted so the performance is nearly equal to denormalized. If we drop PK we use index so it takes extra time.
By maintenance I mean additional actions in case of tag changing
1. Due results could be very different all over databases
I would be happy if someone could repeat the cases in other DBMS + some additional features like full document list fetch as well as paging, IOT/clusters/in-memory – I’m ready to share table structure as well as dataset or you could speak with DataArt PR and I’ll do it by myself.