2. Plan
▪ Tagging basis
▪ Database challenges
▪ Tagging solutions
▪ Pros and cons
▪ Q&A session
3. Tagging terms
• Tag is a non-hierarchical keyword or term assigned to a piece of information
• Tags are generally chosen informally and personally by the item's creator or by its viewer
• If tags are assigned by the creator and are limited it is taxonomy
• If tags are assigned by the viewer and are unlimited it is folksonomy
• Started to be widely used from 2003 by Flikr and Delicious web sites
• Tags are showed usually inline as well as tag cloud
4. Tagging challenges
+
1. used vocabulary reflects the user’s vocabulary directly
2. flexibility - the user can add or remove tags
3. multi-dimensional nature - users can assign any number and combination of tags to express a concept
lead to
-
1. specialized tags or tags without meaning to others than themselves, misspellings, singular/plural form,
compound words
2. tags are often ambiguous, overly personalized, poorly applied tag
3. Using synonyms, acronyms and homonyms which aren’t handled well
10. Full-text-search approaches
FTS
inside DB
+
FTS model
Relational/denormalized/FTS
model
Approach 1 Approach 2
FTS server
(Lucene, Sphinx,
Elastic, Solr, Xapian,
etc)
Application
server
Application
server
11. Housekeeping
Denormalized/FTS
1. Change all affected tags in all documents if a tag name changed
FTS
1. FTS index rebuild due fragmentation
2. FTS index refresh if it isn’t refreshed on COMMIT
12. Test example
StackOverflow posts via http://data.stackexchange.com/
From 31/07/2008 to 21-12-2012
Posts: 2 680 474
Applied tags: 7 791 527
Used unique tags: 30 485
Max tags count for a post: 5
13. Comparison
Initial population time
0 500 1000 1500 2000 2500
Relational
Denormalized
Complex data type
Full text search
Insert time
Model
Insert time,
seconds
Relational 1048
Denormalized 1205
Complex data type 2086
Full text search 1950
14. Comparison
DB size
Model Size total, MB Data size, MB Index size, MB
Relational 1166 338 828
Denormalized 1080 376 704
Complex data type 1134 256 878
Full text search 1055 416 639
0 200 400 600 800 1000 1200 1400
Relational
Denormalized
Complex data type
Full text search
DB size
Index size, MB Data size, MB Size total, MB
15. Comparison
Search by document id and all tag retrieval
Model
Speed with cold cache,
seconds
Speed with hot cache,
seconds
Relational 0,2 0,003
Denormalized 0,07 0,002
Complex data type 0,9 0,002
Full text search 0,3 0,001
0 0.2 0.4 0.6 0.8 1
Relational
Denormalized
Complex data type
Full text search
Speed with cold cache, seconds
0 0.0005 0.001 0.0015 0.002 0.0025 0.003 0.0035
Relational
Denormalized
Complex data type
Full text search
Speed with hot cache, seconds
16. Comparison
Search using 1 tags and all tag retrieval
Model
Speed
with cold
cache,
seconds
Speed with hot
cache, seconds
Relational 1 0,005
Denormalized 0,7 0,004
Complex data type 1,7 0,005
Full text search 0,7 0,002
0 0.5 1 1.5 2
Relational
Denormalized
Complex data type
Full text search
Speed with cold cache, seconds
0 0.001 0.002 0.003 0.004 0.005 0.006
Relational
Denormalized
Complex data type
Full text search
Speed with hot cache, seconds
17. Comparison
Search by AND using 2 tags and all tag retrieval
Model
Speed with
cold cache,
seconds
Speed with hot
cache, seconds
Relational 40 34
Denormalized 34 20
Complex data
type 34 14
Full text
search 20 2
0 5 10 15 20 25 30 35 40 45
Relational
Denormalized
Complex data type
Full text search
Search speed
Speed with hot cache, seconds Speed with cold cache, seconds
18. Comparison
Cloud tag population
Model Speed, seconds
relation 20
relational simplified 18
relational without fk 202
denormalized 18
Complex data type 21
fts 40
0 50 100 150 200 250
relation
relational simplified
relational without fk
denormalized
array
fts
Speed, seconds
19. Pros & Cons
Model Space consumption Search performance Insert performance Maintenance Additional housekeeping Risk of failure
Search queries
development
Relational worst worst highest minimal not required no worst
Denormalized moderate moderate good required required no moderate
Complex data type moderate moderate worst required required no moderate
Full text search optimal optimal moderate required required yes optimal
20. Conclusion
1. Choose your best model based on:
• Performance (search/insert/update)
• Space consumption
• Engineer experience
• Hardware cost
• Software cost
2. Each storage model should be checked on your RDBMS - don’t be afraid to try and
measure
3. Understanding how complex data types are stored inside is crucial
4. Understanding how FTS works inside is crucial
5. Investigate your DBMS unique features
There is no silver bullet for tag storage model!
In order to respond these challenges appropriate database design should be applied. HK – indexing, reindexing, tag change name, компромисс между realtime и прочее
Tell about clusters or IOT
Tell about clusters
Tell how they are set up in oracle and index tricks. It is significant to understand how complex data types are implemented in your database and where complex data are actually stored in.
Tags are stored in structured format
Usage of full text search improves search by tags via native language
It is deadly simple to deal with previously mentioned data models but it worth to stay on fts in detail
SQL search approach is rather straightforward so let’s consider FTS approach. full text search index is maintained either in DB or in dedicated server. App server uses FTS dialect either of db or a server. We will have a look into Approach 1. Pros and cons out of the ItTalk. Stackoverflow uses MSSql and Elastic for instance in model 2 with FTS model.
Index becomes fragmented due delete/insert usually adds new records and invalidates old
We took real world data via sql-like interface to StackOverflow. Please pay attention about maximum tag count for a post – I presume it is done intentionally. I presume they use 4rd data model and use VARCHAR field rather than CLOB/BLOB. Permits to export by 50000 bunches + capture required.
Let’s have a look how we created tables.
For some models difference is more 2 times. The reason is clear – fts maintenance, parcing.
Please pay attention it is only for Oracle DB. That stuff is completely DB-dependend. 5 years – 1 Gb so it worth to think about in-memory solutions.
Let’s have a look into queries and will see in tables.
The difference it time for cache is huge so I put in 2 diagrams
Sophisticated plan
2. starts from tag meanwhile complex data type starts from document
4. Could be faster using varchar2 and USE CACHE option which is switched off by default
1, 2 and 3 could be faster and consume less space using Oracle tricks like IOT/clusters (joined values are located closer) but aren’t used to not make the test very Oracle tailored.
There is an opinion arrays are extremely fast in Postgress due they work completely different than in Oracle. Please pay attention that first attempt in FTS in slightly different from the second – second is the same as cold cache. It seems Oracle initialize some structures on first attempt so it is 2-3 times slower that the second so here the second is mentioned. Complex datatype makes like FTS sort of init if we search by it so it is slower.
Please pay attention that extra table is omitted so the performance is nearly equal to denormalized. If we drop PK we use index so it takes extra time.
By maintenance I mean additional actions in case of tag changing
1. Due results could be very different all over databases
I would be happy if someone could repeat the cases in other DBMS + some additional features like full document list fetch as well as paging, IOT/clusters/in-memory – I’m ready to share table structure as well as dataset or you could speak with DataArt PR and I’ll do it by myself.