SlideShare a Scribd company logo
1 of 22
Tagging schema design for high performance
Plan
▪ Tagging basis
▪ Database challenges
▪ Tagging solutions
▪ Pros and cons
▪ Q&A session
Tagging terms
• Tag is a non-hierarchical keyword or term assigned to a piece of information
• Tags are generally chosen informally and personally by the item's creator or by its viewer
• If tags are assigned by the creator and are limited it is taxonomy
• If tags are assigned by the viewer and are unlimited it is folksonomy
• Started to be widely used from 2003 by Flikr and Delicious web sites
• Tags are showed usually inline as well as tag cloud
Tagging challenges
+
1. used vocabulary reflects the user’s vocabulary directly
2. flexibility - the user can add or remove tags
3. multi-dimensional nature - users can assign any number and combination of tags to express a concept
lead to
-
1. specialized tags or tags without meaning to others than themselves, misspellings, singular/plural form,
compound words
2. tags are often ambiguous, overly personalized, poorly applied tag
3. Using synonyms, acronyms and homonyms which aren’t handled well
Database challenges
1. Performance
2. Queries awkwardness
3. Database size
4. Housekeeping
High normalized approach
Denormalized approach
Complex data type approach
Full-text-search oriented solutions
Stackoverflow: <php><mysql><guid><encryption>
JSON: {“tags”:[“php”, “apache2”, “openinviter”]}
Full-text-search approaches
FTS
inside DB
+
FTS model
Relational/denormalized/FTS
model
Approach 1 Approach 2
FTS server
(Lucene, Sphinx,
Elastic, Solr, Xapian,
etc)
Application
server
Application
server
Housekeeping
Denormalized/FTS
1. Change all affected tags in all documents if a tag name changed
FTS
1. FTS index rebuild due fragmentation
2. FTS index refresh if it isn’t refreshed on COMMIT
Test example
StackOverflow posts via http://data.stackexchange.com/
From 31/07/2008 to 21-12-2012
Posts: 2 680 474
Applied tags: 7 791 527
Used unique tags: 30 485
Max tags count for a post: 5
Comparison
Initial population time
0 500 1000 1500 2000 2500
Relational
Denormalized
Complex data type
Full text search
Insert time
Model
Insert time,
seconds
Relational 1048
Denormalized 1205
Complex data type 2086
Full text search 1950
Comparison
DB size
Model Size total, MB Data size, MB Index size, MB
Relational 1166 338 828
Denormalized 1080 376 704
Complex data type 1134 256 878
Full text search 1055 416 639
0 200 400 600 800 1000 1200 1400
Relational
Denormalized
Complex data type
Full text search
DB size
Index size, MB Data size, MB Size total, MB
Comparison
Search by document id and all tag retrieval
Model
Speed with cold cache,
seconds
Speed with hot cache,
seconds
Relational 0,2 0,003
Denormalized 0,07 0,002
Complex data type 0,9 0,002
Full text search 0,3 0,001
0 0.2 0.4 0.6 0.8 1
Relational
Denormalized
Complex data type
Full text search
Speed with cold cache, seconds
0 0.0005 0.001 0.0015 0.002 0.0025 0.003 0.0035
Relational
Denormalized
Complex data type
Full text search
Speed with hot cache, seconds
Comparison
Search using 1 tags and all tag retrieval
Model
Speed
with cold
cache,
seconds
Speed with hot
cache, seconds
Relational 1 0,005
Denormalized 0,7 0,004
Complex data type 1,7 0,005
Full text search 0,7 0,002
0 0.5 1 1.5 2
Relational
Denormalized
Complex data type
Full text search
Speed with cold cache, seconds
0 0.001 0.002 0.003 0.004 0.005 0.006
Relational
Denormalized
Complex data type
Full text search
Speed with hot cache, seconds
Comparison
Search by AND using 2 tags and all tag retrieval
Model
Speed with
cold cache,
seconds
Speed with hot
cache, seconds
Relational 40 34
Denormalized 34 20
Complex data
type 34 14
Full text
search 20 2
0 5 10 15 20 25 30 35 40 45
Relational
Denormalized
Complex data type
Full text search
Search speed
Speed with hot cache, seconds Speed with cold cache, seconds
Comparison
Cloud tag population
Model Speed, seconds
relation 20
relational simplified 18
relational without fk 202
denormalized 18
Complex data type 21
fts 40
0 50 100 150 200 250
relation
relational simplified
relational without fk
denormalized
array
fts
Speed, seconds
Pros & Cons
Model Space consumption Search performance Insert performance Maintenance Additional housekeeping Risk of failure
Search queries
development
Relational worst worst highest minimal not required no worst
Denormalized moderate moderate good required required no moderate
Complex data type moderate moderate worst required required no moderate
Full text search optimal optimal moderate required required yes optimal
Conclusion
1. Choose your best model based on:
• Performance (search/insert/update)
• Space consumption
• Engineer experience
• Hardware cost
• Software cost
2. Each storage model should be checked on your RDBMS - don’t be afraid to try and
measure
3. Understanding how complex data types are stored inside is crucial
4. Understanding how FTS works inside is crucial
5. Investigate your DBMS unique features
There is no silver bullet for tag storage model!
Q&A
Contacts
Feel free to ask any db-related questions: shtock@mail.ru

More Related Content

What's hot

Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginsearchbox-com
 
Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query ParsingErik Hatcher
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHPPaul Borgermans
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)Erik Hatcher
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Erik Hatcher
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEcommerce Solution Provider SysIQ
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorialChris Huang
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache SolrBiogeeks
 
บทที่5
บทที่5บทที่5
บทที่5Palm Unnop
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Alexandre Rafalovitch
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engineth0masr
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and TricksErik Hatcher
 

What's hot (20)

Mdb dn 2016_06_query_primer
Mdb dn 2016_06_query_primerMdb dn 2016_06_query_primer
Mdb dn 2016_06_query_primer
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component plugin
 
Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query Parsing
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHP
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorial
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache Solr
 
บทที่5
บทที่5บทที่5
บทที่5
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engine
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and Tricks
 

Similar to Data structures for cloud tag storage

Tagging search solution design
Tagging search solution designTagging search solution design
Tagging search solution designAlexander Tokarev
 
Tagging search solution design Advanced edition
Tagging search solution design Advanced editionTagging search solution design Advanced edition
Tagging search solution design Advanced editionAlexander Tokarev
 
P9 speed of-light faceted search via oracle in-memory option by alexander tok...
P9 speed of-light faceted search via oracle in-memory option by alexander tok...P9 speed of-light faceted search via oracle in-memory option by alexander tok...
P9 speed of-light faceted search via oracle in-memory option by alexander tok...Alexander Tokarev
 
Customizing SharePoint 2013 search display templates
Customizing SharePoint 2013 search display templatesCustomizing SharePoint 2013 search display templates
Customizing SharePoint 2013 search display templatesTony Testa
 
Optimizing Your Search Experience
Optimizing Your Search ExperienceOptimizing Your Search Experience
Optimizing Your Search ExperienceSumo Logic
 
Sumo Logic - Optimizing Your Search Experience (2016-08-17)
Sumo Logic - Optimizing Your Search Experience (2016-08-17)Sumo Logic - Optimizing Your Search Experience (2016-08-17)
Sumo Logic - Optimizing Your Search Experience (2016-08-17)Sumo Logic
 
SharePoint Jumpstart #2 Making Basic SharePoint Search Work
SharePoint Jumpstart #2 Making Basic SharePoint Search WorkSharePoint Jumpstart #2 Making Basic SharePoint Search Work
SharePoint Jumpstart #2 Making Basic SharePoint Search WorkEarley Information Science
 
SharePoint Connections Coast to Coast Overview of Enterprise Content Management
SharePoint Connections Coast to Coast Overview of Enterprise Content ManagementSharePoint Connections Coast to Coast Overview of Enterprise Content Management
SharePoint Connections Coast to Coast Overview of Enterprise Content ManagementIvan Sanders
 
The Humble & Mighty SharePoint URL Query String
The Humble & Mighty SharePoint URL Query StringThe Humble & Mighty SharePoint URL Query String
The Humble & Mighty SharePoint URL Query Stringpatrickdoran
 
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Databricks
 
TSPUG: Content Management in SharePoint 2010
TSPUG: Content Management in SharePoint 2010TSPUG: Content Management in SharePoint 2010
TSPUG: Content Management in SharePoint 2010Eli Robillard
 
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014Robert Meusel
 

Similar to Data structures for cloud tag storage (20)

Tagging search solution design
Tagging search solution designTagging search solution design
Tagging search solution design
 
Tagging search solution design Advanced edition
Tagging search solution design Advanced editionTagging search solution design Advanced edition
Tagging search solution design Advanced edition
 
P9 speed of-light faceted search via oracle in-memory option by alexander tok...
P9 speed of-light faceted search via oracle in-memory option by alexander tok...P9 speed of-light faceted search via oracle in-memory option by alexander tok...
P9 speed of-light faceted search via oracle in-memory option by alexander tok...
 
Share point summit_2010_lemieux-toc
Share point summit_2010_lemieux-tocShare point summit_2010_lemieux-toc
Share point summit_2010_lemieux-toc
 
Real world rm in share point 2013
Real world rm in share point 2013Real world rm in share point 2013
Real world rm in share point 2013
 
Customizing SharePoint 2013 search display templates
Customizing SharePoint 2013 search display templatesCustomizing SharePoint 2013 search display templates
Customizing SharePoint 2013 search display templates
 
Optimizing Your Search Experience
Optimizing Your Search ExperienceOptimizing Your Search Experience
Optimizing Your Search Experience
 
Webinar: Ditching File Shares For SharePoint Metadata
Webinar: Ditching File Shares For SharePoint MetadataWebinar: Ditching File Shares For SharePoint Metadata
Webinar: Ditching File Shares For SharePoint Metadata
 
Sumo Logic - Optimizing Your Search Experience (2016-08-17)
Sumo Logic - Optimizing Your Search Experience (2016-08-17)Sumo Logic - Optimizing Your Search Experience (2016-08-17)
Sumo Logic - Optimizing Your Search Experience (2016-08-17)
 
SharePoint Jumpstart #2 Making Basic SharePoint Search Work
SharePoint Jumpstart #2 Making Basic SharePoint Search WorkSharePoint Jumpstart #2 Making Basic SharePoint Search Work
SharePoint Jumpstart #2 Making Basic SharePoint Search Work
 
SharePoint Connections Coast to Coast Overview of Enterprise Content Management
SharePoint Connections Coast to Coast Overview of Enterprise Content ManagementSharePoint Connections Coast to Coast Overview of Enterprise Content Management
SharePoint Connections Coast to Coast Overview of Enterprise Content Management
 
Lib Sw Evaluation
Lib Sw EvaluationLib Sw Evaluation
Lib Sw Evaluation
 
Lib Sw Evaluation2
Lib Sw Evaluation2Lib Sw Evaluation2
Lib Sw Evaluation2
 
The Humble & Mighty SharePoint URL Query String
The Humble & Mighty SharePoint URL Query StringThe Humble & Mighty SharePoint URL Query String
The Humble & Mighty SharePoint URL Query String
 
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
 
TSPUG: Content Management in SharePoint 2010
TSPUG: Content Management in SharePoint 2010TSPUG: Content Management in SharePoint 2010
TSPUG: Content Management in SharePoint 2010
 
Real world records management in SharePoint 2013
Real world records management in SharePoint 2013Real world records management in SharePoint 2013
Real world records management in SharePoint 2013
 
Real world records management in share point 2013
Real world records management in share point 2013Real world records management in share point 2013
Real world records management in share point 2013
 
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
 
D01 etl
D01 etlD01 etl
D01 etl
 

More from Alexander Tokarev

Open Policy Agent for governance as a code
Open Policy Agent for governance as a code Open Policy Agent for governance as a code
Open Policy Agent for governance as a code Alexander Tokarev
 
Relational databases for BigData
Relational databases for BigDataRelational databases for BigData
Relational databases for BigDataAlexander Tokarev
 
Row level security in enterprise applications
Row level security in enterprise applicationsRow level security in enterprise applications
Row level security in enterprise applicationsAlexander Tokarev
 
Inmemory BI based on opensource stack
Inmemory BI based on opensource stackInmemory BI based on opensource stack
Inmemory BI based on opensource stackAlexander Tokarev
 
Oracle InMemory hardcore edition
Oracle InMemory hardcore editionOracle InMemory hardcore edition
Oracle InMemory hardcore editionAlexander Tokarev
 
Faceted search with Oracle InMemory option
Faceted search with Oracle InMemory optionFaceted search with Oracle InMemory option
Faceted search with Oracle InMemory optionAlexander Tokarev
 
Oracle JSON treatment evolution - from 12.1 to 18 AOUG-2018
Oracle JSON treatment evolution - from 12.1 to 18 AOUG-2018Oracle JSON treatment evolution - from 12.1 to 18 AOUG-2018
Oracle JSON treatment evolution - from 12.1 to 18 AOUG-2018Alexander Tokarev
 
Oracle JSON internals advanced edition
Oracle JSON internals advanced editionOracle JSON internals advanced edition
Oracle JSON internals advanced editionAlexander Tokarev
 
Oracle result cache highload 2017
Oracle result cache highload 2017Oracle result cache highload 2017
Oracle result cache highload 2017Alexander Tokarev
 
Oracle High Availabiltity for application developers
Oracle High Availabiltity for application developersOracle High Availabiltity for application developers
Oracle High Availabiltity for application developersAlexander Tokarev
 

More from Alexander Tokarev (18)

Rate limits and all about
Rate limits and all aboutRate limits and all about
Rate limits and all about
 
rnd teams.pptx
rnd teams.pptxrnd teams.pptx
rnd teams.pptx
 
FinOps for private cloud
FinOps for private cloudFinOps for private cloud
FinOps for private cloud
 
Graph ql and enterprise
Graph ql and enterpriseGraph ql and enterprise
Graph ql and enterprise
 
FinOps introduction
FinOps introductionFinOps introduction
FinOps introduction
 
Open Policy Agent for governance as a code
Open Policy Agent for governance as a code Open Policy Agent for governance as a code
Open Policy Agent for governance as a code
 
Relational databases for BigData
Relational databases for BigDataRelational databases for BigData
Relational databases for BigData
 
Cloud DWH deep dive
Cloud DWH deep diveCloud DWH deep dive
Cloud DWH deep dive
 
Cloud dwh
Cloud dwhCloud dwh
Cloud dwh
 
Row level security in enterprise applications
Row level security in enterprise applicationsRow level security in enterprise applications
Row level security in enterprise applications
 
Inmemory BI based on opensource stack
Inmemory BI based on opensource stackInmemory BI based on opensource stack
Inmemory BI based on opensource stack
 
Oracle InMemory hardcore edition
Oracle InMemory hardcore editionOracle InMemory hardcore edition
Oracle InMemory hardcore edition
 
Faceted search with Oracle InMemory option
Faceted search with Oracle InMemory optionFaceted search with Oracle InMemory option
Faceted search with Oracle InMemory option
 
Oracle JSON treatment evolution - from 12.1 to 18 AOUG-2018
Oracle JSON treatment evolution - from 12.1 to 18 AOUG-2018Oracle JSON treatment evolution - from 12.1 to 18 AOUG-2018
Oracle JSON treatment evolution - from 12.1 to 18 AOUG-2018
 
Oracle JSON internals advanced edition
Oracle JSON internals advanced editionOracle JSON internals advanced edition
Oracle JSON internals advanced edition
 
Oracle result cache highload 2017
Oracle result cache highload 2017Oracle result cache highload 2017
Oracle result cache highload 2017
 
Oracle json caveats
Oracle json caveatsOracle json caveats
Oracle json caveats
 
Oracle High Availabiltity for application developers
Oracle High Availabiltity for application developersOracle High Availabiltity for application developers
Oracle High Availabiltity for application developers
 

Recently uploaded

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 

Recently uploaded (20)

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 

Data structures for cloud tag storage

  • 1. Tagging schema design for high performance
  • 2. Plan ▪ Tagging basis ▪ Database challenges ▪ Tagging solutions ▪ Pros and cons ▪ Q&A session
  • 3. Tagging terms • Tag is a non-hierarchical keyword or term assigned to a piece of information • Tags are generally chosen informally and personally by the item's creator or by its viewer • If tags are assigned by the creator and are limited it is taxonomy • If tags are assigned by the viewer and are unlimited it is folksonomy • Started to be widely used from 2003 by Flikr and Delicious web sites • Tags are showed usually inline as well as tag cloud
  • 4. Tagging challenges + 1. used vocabulary reflects the user’s vocabulary directly 2. flexibility - the user can add or remove tags 3. multi-dimensional nature - users can assign any number and combination of tags to express a concept lead to - 1. specialized tags or tags without meaning to others than themselves, misspellings, singular/plural form, compound words 2. tags are often ambiguous, overly personalized, poorly applied tag 3. Using synonyms, acronyms and homonyms which aren’t handled well
  • 5. Database challenges 1. Performance 2. Queries awkwardness 3. Database size 4. Housekeeping
  • 8. Complex data type approach
  • 9. Full-text-search oriented solutions Stackoverflow: <php><mysql><guid><encryption> JSON: {“tags”:[“php”, “apache2”, “openinviter”]}
  • 10. Full-text-search approaches FTS inside DB + FTS model Relational/denormalized/FTS model Approach 1 Approach 2 FTS server (Lucene, Sphinx, Elastic, Solr, Xapian, etc) Application server Application server
  • 11. Housekeeping Denormalized/FTS 1. Change all affected tags in all documents if a tag name changed FTS 1. FTS index rebuild due fragmentation 2. FTS index refresh if it isn’t refreshed on COMMIT
  • 12. Test example StackOverflow posts via http://data.stackexchange.com/ From 31/07/2008 to 21-12-2012 Posts: 2 680 474 Applied tags: 7 791 527 Used unique tags: 30 485 Max tags count for a post: 5
  • 13. Comparison Initial population time 0 500 1000 1500 2000 2500 Relational Denormalized Complex data type Full text search Insert time Model Insert time, seconds Relational 1048 Denormalized 1205 Complex data type 2086 Full text search 1950
  • 14. Comparison DB size Model Size total, MB Data size, MB Index size, MB Relational 1166 338 828 Denormalized 1080 376 704 Complex data type 1134 256 878 Full text search 1055 416 639 0 200 400 600 800 1000 1200 1400 Relational Denormalized Complex data type Full text search DB size Index size, MB Data size, MB Size total, MB
  • 15. Comparison Search by document id and all tag retrieval Model Speed with cold cache, seconds Speed with hot cache, seconds Relational 0,2 0,003 Denormalized 0,07 0,002 Complex data type 0,9 0,002 Full text search 0,3 0,001 0 0.2 0.4 0.6 0.8 1 Relational Denormalized Complex data type Full text search Speed with cold cache, seconds 0 0.0005 0.001 0.0015 0.002 0.0025 0.003 0.0035 Relational Denormalized Complex data type Full text search Speed with hot cache, seconds
  • 16. Comparison Search using 1 tags and all tag retrieval Model Speed with cold cache, seconds Speed with hot cache, seconds Relational 1 0,005 Denormalized 0,7 0,004 Complex data type 1,7 0,005 Full text search 0,7 0,002 0 0.5 1 1.5 2 Relational Denormalized Complex data type Full text search Speed with cold cache, seconds 0 0.001 0.002 0.003 0.004 0.005 0.006 Relational Denormalized Complex data type Full text search Speed with hot cache, seconds
  • 17. Comparison Search by AND using 2 tags and all tag retrieval Model Speed with cold cache, seconds Speed with hot cache, seconds Relational 40 34 Denormalized 34 20 Complex data type 34 14 Full text search 20 2 0 5 10 15 20 25 30 35 40 45 Relational Denormalized Complex data type Full text search Search speed Speed with hot cache, seconds Speed with cold cache, seconds
  • 18. Comparison Cloud tag population Model Speed, seconds relation 20 relational simplified 18 relational without fk 202 denormalized 18 Complex data type 21 fts 40 0 50 100 150 200 250 relation relational simplified relational without fk denormalized array fts Speed, seconds
  • 19. Pros & Cons Model Space consumption Search performance Insert performance Maintenance Additional housekeeping Risk of failure Search queries development Relational worst worst highest minimal not required no worst Denormalized moderate moderate good required required no moderate Complex data type moderate moderate worst required required no moderate Full text search optimal optimal moderate required required yes optimal
  • 20. Conclusion 1. Choose your best model based on: • Performance (search/insert/update) • Space consumption • Engineer experience • Hardware cost • Software cost 2. Each storage model should be checked on your RDBMS - don’t be afraid to try and measure 3. Understanding how complex data types are stored inside is crucial 4. Understanding how FTS works inside is crucial 5. Investigate your DBMS unique features There is no silver bullet for tag storage model!
  • 21. Q&A
  • 22. Contacts Feel free to ask any db-related questions: shtock@mail.ru

Editor's Notes

  1. May be show uber-optimized version
  2. Synonyms are bad for reporting
  3. In order to respond these challenges appropriate database design should be applied. HK – indexing, reindexing, tag change name, компромисс между realtime и прочее
  4. Tell about clusters or IOT
  5. Tell about clusters
  6. Tell how they are set up in oracle and index tricks. It is significant to understand how complex data types are implemented in your database and where complex data are actually stored in.
  7. Tags are stored in structured format Usage of full text search improves search by tags via native language It is deadly simple to deal with previously mentioned data models but it worth to stay on fts in detail
  8. SQL search approach is rather straightforward so let’s consider FTS approach. full text search index is maintained either in DB or in dedicated server. App server uses FTS dialect either of db or a server. We will have a look into Approach 1. Pros and cons out of the ItTalk. Stackoverflow uses MSSql and Elastic for instance in model 2 with FTS model.
  9. Index becomes fragmented due delete/insert usually adds new records and invalidates old
  10. We took real world data via sql-like interface to StackOverflow. Please pay attention about maximum tag count for a post – I presume it is done intentionally. I presume they use 4rd data model and use VARCHAR field rather than CLOB/BLOB. Permits to export by 50000 bunches + capture required. Let’s have a look how we created tables.
  11. For some models difference is more 2 times. The reason is clear – fts maintenance, parcing.
  12. Please pay attention it is only for Oracle DB. That stuff is completely DB-dependend. 5 years – 1 Gb so it worth to think about in-memory solutions. Let’s have a look into queries and will see in tables.
  13. The difference it time for cache is huge so I put in 2 diagrams Sophisticated plan 2. starts from tag meanwhile complex data type starts from document 4. Could be faster using varchar2 and USE CACHE option which is switched off by default 1, 2 and 3 could be faster and consume less space using Oracle tricks like IOT/clusters (joined values are located closer) but aren’t used to not make the test very Oracle tailored.
  14. There is an opinion arrays are extremely fast in Postgress due they work completely different than in Oracle. Please pay attention that first attempt in FTS in slightly different from the second – second is the same as cold cache. It seems Oracle initialize some structures on first attempt so it is 2-3 times slower that the second so here the second is mentioned. Complex datatype makes like FTS sort of init if we search by it so it is slower.
  15. Please pay attention that extra table is omitted so the performance is nearly equal to denormalized. If we drop PK we use index so it takes extra time.
  16. By maintenance I mean additional actions in case of tag changing
  17. 1. Due results could be very different all over databases
  18. I would be happy if someone could repeat the cases in other DBMS + some additional features like full document list fetch as well as paging, IOT/clusters/in-memory – I’m ready to share table structure as well as dataset or you could speak with DataArt PR and I’ll do it by myself.