SlideShare a Scribd company logo
OLAP Querying of Document Stores
in the Presence of Schema Variety
Matteo Francia, Enrico Gallinucci, Matteo Golfarelli, Stefano Rizzi
{m.francia, enrico.gallinucci, matteo.golfarelli, stefano.rizzi} @ unibo.it
Introduction
Schemaless databases (e.g., document stores) are preferred for storing
heterogeneous data with variable structural forms
• Missing or additional fields
• Different names or types for a field
• Different structuresfor instances
Absence of a unique schema
• More flexibility to operational applications, more complexity to analytical applications
Adopting a classical DW design approach requires notable effort
• Understand the rules that drove the use of alternative schemas
• Integrate to identify a common schema to be adopted for analysis
Enrico Gallinucci – University of Bologna SEBD 2020 2
Our proposal
An approach to OLAP querying on document stores
• Embrace and exploit the inherent variety of documents
• Inclusive solution: the user can query a concept
even if it is present in a subset of documents only
Enrico Gallinucci – University of Bologna SEBD 2020 3
Document store
Schema extraction
Schema integration
FD enrichment
Querying
Dependency graph
Global schema, Mappings
Local schemas
Formulation
Validation
Reformulation
Translation
Execution
Merge
Case study
Real-world collection of workout sessions
• MongoDB 3.4
• 6 local schemas
• 5M workout sessions
• 35M exercises
• 85M sets
Enrico Gallinucci – University of Bologna SEBD 2020 4
[ { "_id" : ObjectId("54a4332f44"),
"User" :
{ "FullName" : ”John Smith",
"Age" : 42 },
"StartedOn" : ISODate("2017-06-15"),
"Facility" :
{ "Name" : "PureGym Piccadilly",
"Chain" : "PureGym" },
"SessionType" : "RunningProgram",
"DurationMins": 90,
"Exercises" :
[ { "Type" : "Leg press",
"ExCalories" : 28,
"Sets" :
[ { "Reps" : 14,
"Weight" : 60 },
. . .
] },
{ "Type" : "Tapis roulant" },
. . .
]
} ,
. . .
]
Schema extraction
Goal: identify the set of distinct local schemas that occur inside a
collection of documents
• Completely automatic stage
Enrico Gallinucci – University of Bologna SEBD 2020 5
Document store
Schema extraction
Schema integration
FD enrichment
Querying
Dependency graph
Global schema, Mappings
Local schemas
Schema integration
Goal: integrate the local schemas to obtain a single, global schema
• A preliminary global schema is defined as the name-based union of all local
schemas (automatic)
• The user refines the preliminary global schema by merging matching (sets of)
fields (pay-as-you-go)
• Existing tools (e.g., Coma 3.0) can be used to
automatically find candidate matches
Enrico Gallinucci – University of Bologna SEBD 2020 6
Document store
Schema extraction
Schema integration
FD enrichment
Querying
Dependency graph
Global schema, Mappings
Local schemas
Goal: propose a MD view of the global schema to enable OLAP analyses
• Identification of hierarchies → Identification of functional dependencies (FDs)
• Some exact FDs can be derived from the global schema
• Additional FDs can only be found by querying the data
• DSs may contain incomplete and faulty data
• We look for approximate FDs (AFDs)
• Efficient strategy for AFD detection [1]
Schema enrichment
Enrico Gallinucci – University of Bologna SEBD 2020 7
Document store
Schema extraction
Schema integration
FD enrichment
Querying
Dependency graph
Global schema, Mappings
Local schemas
[1] Golfarelli, Graziani, Rizzi: Starry vault: Automating multidimensional modeling from data vaults. In: Proc. ADBIS (2016)
Schema enrichment (example)
Enrico Gallinucci – University of Bologna SEBD 2020 8
Exercises.Type
Exercises.Sets.Weight
Exercises.Sets.Reps
_id
User.FullName
SessionType
Facility.Name
Facility.City
Facility.ChainUser.Age
User.FirstName
User.LastName DurationMins
Exercises._id
Exercises.Sets. _id
Exercises.ExCalories
Exercises.Sets.SetCalories
level (support=1)
level (support<1)
FD
AFD
Facility.Country
The result is a dependency graph, providing a MD view in terms of FDs
Querying (1)
Goal: enable the formulation of OLAP queries on the dependency graph
and their execution on the collection
• Validate against well-formedness requirements in the literature
• There must exist a field that functionally determines all the others (the fact)
• Functional independence in group-by set
• Disjointess:the granularity of measures
should be finer than the one of the
group-by fields (or: double-counting)
• Completeness:the group-by fields
should have full support
(balancing strategies are used)
Enrico Gallinucci – University of Bologna SEBD 2020 9
Document store
Schema extraction
Schema integration
FD enrichment
Querying
Formulation
Validation
Reformulation
Translation
Execution
Merge
Querying (example)
q = < {Facility.City; Exercises.Type}; (User.Age≥ 60); Exercises.Sets.Weight;avg >
Enrico Gallinucci – University of Bologna SEBD 2020 10
db.WS.aggregate({
{ $unwind: "$Exercises" },
{ $unwind: "$Exercises.Sets" },
{ $match: { "User.Age": { $gte: 60 } } },
{ $project: {
"Facility.City": { $ifNull: ["$FacilityCity","$FacilityName"] } },
"Exercises.Type": 1,
"Exercises.Sets.Weight": 1,
"balanced": { $cond: ["$FacilityCity",false,true] } } },
{ $group: {
" id": {
"FacilityCity", "$FacilityCity",
"ExercisesType", "$Exercises.Type",
"balanced", "$balanced" },
"Exercises.Sets.Weight": { $avg: "$Exercises.Sets.Weight" },
"count": { $sum: 1 },
"count-m": { $sum: f $cond: ["$Exercises.Sets.Weight",1,0] } } } } } )
Exercises.Type
Exercises.Sets.Weight
Exercises.Sets.Reps
_id
User.FullName
SessionType
Facility.Name
Facility.City
Facility.ChainUser.Age
User.FirstName
User.LastName DurationMins
Exercises._id
Exercises.Sets. _id
Exercises.ExCalories
Exercises.Sets.SetCalories
level (support=1)
level (support<1)
FD
AFD
Facility.Country
Goal: enable the formulation of multidimensional queries on the
dependency graph and their execution on the collection
• Reformulate into multiple queries, one for each local schema
• Rely on a previous query reformulation approach for federated DWs
• Translate into the query language of the DS
• Multi-stage pipelines on MongoDB
• Execute the queries and
merge the results
• Performed in-memory
• OLAP queries usually produce limited results
• Transcoding functions homogenize values Document store
Schema extraction
Schema integration
FD enrichment
Querying
Formulation
Validation
Reformulation
Translation
Execution
Merge
Querying (2)
Enrico Gallinucci – University of Bologna SEBD 2020 11
Future work
• Increase the efficiency of the querying phase
• Optimize query reformulation to issue a single but more complex query
• Broaden the scope of the approach
• Multiple collections and multiple store with different data models
• Evaluate alternative execution engines (e.g., big data tools)
• Build a fully working prototype
• Introduction of online repairing of AFDs
• Automatically repair the errors in AFDs at query time
• Directly receive correct results without need to modify the original data
Enrico Gallinucci – University of Bologna SEBD 2020 12
Thank you
Questions?
Query evaluation
• Level density
• Necessary when the group-by fails the completeness constraint
• Evaluate the impact of the balancing strategy
• Measure density
• Necessary when the measure of the query does not have full support and the
results offer only a partial summary
• Evaluate the precision of result depending on how many documentscontain
the measure
• Accuracy
• Necessary when a new query is formulated during an OLAP session
• Quantify the accuracy of results w.r.t. those obtained from the previous query
Enrico Gallinucci – University of Bologna SEBD 2020 14
Query execution times
Execution times of three queries (qa, qb, and qc) split into the execution
times of the required local queries
Enrico Gallinucci – University of Bologna SEBD 2020 15
Schema extraction (example)
Enrico Gallinucci – University of Bologna SEBD 2020 16
_id
User.FullName
User.Age
StartedOn
Facility.Name
Facility.Chain
SessionType
DurationMins
Sets_id
Reps
Weight
Exercises_id
Type
ExCalories
WS
Sets
Exercises
{ "_id" : ObjectId("54a4332f44"),
"User" :
{ "FullName" : ”John Smith",
"Age" : 42 },
"StartedOn" : ISODate("2017-06-15"),
"Facility" :
{ "Name" : "PureGym Piccadilly",
"Chain" : "PureGym" },
"SessionType" : "RunningProgram",
"DurationMins": 90,
"Exercises" :
[ { "Type" : "Leg press",
"ExCalories" : 28,
"Sets" :
[ { "Reps" : 14,
"Weight" : 60 },
. . .
] },
{ "Type" : "Tapis roulant" },
. . .
]
}
s1
Schema integration (example)
Enrico Gallinucci – University of Bologna SEBD 2020 17
Exercises_id
Type
ExCalories
_id
User.FullName
User.Age
StartedOn
Facility.Name
Facility.Chain
SessionType
DurationMins
Sets_id
Reps
Weight
Exercises_id
Type
ExCalories
_id
FirstName
LastName
Date
Gym.Name
Gym.City
Gym.Country
SessionType
DurationSecs
Series_id
ExType
Reps
Weight
SeriesCalories
_id
User.FullName
User.FirstName
User.LastName
User.Age
StartedOn
Facility.Name
Facility.Chain
Facility.City
Facility.Country
SessionType
DurationMins
Sets_id
Reps
Weight
SetCalories
WS
Sets
Exercises
WS
Series
WS
Sets
Exercises
s1 s2g
Schema integration
Goal: integrate the local schemas to obtain a single, global schema
• Schema matching: identify the single matches between the attributes
• Schema mapping: define mappings between each local schema and the global one
• Mappings include transcoding functions to enable query reformulation
• E.g., in the presence of selection predicates, or to integrate the query results
Schema integration is done in two steps
• Define a preliminary global schema as the
name-based union of all local schemas (automatic)
• Refine the preliminary global schema by merging
matching (sets of) fields (pay-as-you-go)
• Existing tools (e.g., Coma 3.0)can be used to
automatically find candidate matches
Enrico Gallinucci – University of Bologna SEBD 2020 18
Document store
Schema extraction
Schema integration
FD enrichment
Querying
Dependency graph
Global schema, Mappings
Local schemas

More Related Content

Similar to [SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety

Socialite, the Open Source Status Feed Part 1: Design Overview and Scaling fo...
Socialite, the Open Source Status Feed Part 1: Design Overview and Scaling fo...Socialite, the Open Source Status Feed Part 1: Design Overview and Scaling fo...
Socialite, the Open Source Status Feed Part 1: Design Overview and Scaling fo...
MongoDB
 
Collective Mind: bringing reproducible research to the masses
Collective Mind: bringing reproducible research to the massesCollective Mind: bringing reproducible research to the masses
Collective Mind: bringing reproducible research to the masses
Grigori Fursin
 
1 Project 2 Introduction - the SeaPort Project seri.docx
1  Project 2 Introduction - the SeaPort Project seri.docx1  Project 2 Introduction - the SeaPort Project seri.docx
1 Project 2 Introduction - the SeaPort Project seri.docx
honey725342
 

Similar to [SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety (20)

Be cse
Be cseBe cse
Be cse
 
Socialite, the Open Source Status Feed Part 1: Design Overview and Scaling fo...
Socialite, the Open Source Status Feed Part 1: Design Overview and Scaling fo...Socialite, the Open Source Status Feed Part 1: Design Overview and Scaling fo...
Socialite, the Open Source Status Feed Part 1: Design Overview and Scaling fo...
 
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
 
Searching Repositories of Web Application Models
Searching Repositories of Web Application ModelsSearching Repositories of Web Application Models
Searching Repositories of Web Application Models
 
Collective Mind: bringing reproducible research to the masses
Collective Mind: bringing reproducible research to the massesCollective Mind: bringing reproducible research to the masses
Collective Mind: bringing reproducible research to the masses
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
Nodejs Chapter 3 - Design Pattern
Nodejs Chapter 3 - Design PatternNodejs Chapter 3 - Design Pattern
Nodejs Chapter 3 - Design Pattern
 
Software Requirement Patterns (SRP)
Software Requirement Patterns (SRP)Software Requirement Patterns (SRP)
Software Requirement Patterns (SRP)
 
Software variability management - 2017
Software variability management - 2017Software variability management - 2017
Software variability management - 2017
 
Reduce Query Time Up to 60% with Selective Search
Reduce Query Time Up to 60% with Selective SearchReduce Query Time Up to 60% with Selective Search
Reduce Query Time Up to 60% with Selective Search
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
 
Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)
 
1 Project 2 Introduction - the SeaPort Project seri.docx
1  Project 2 Introduction - the SeaPort Project seri.docx1  Project 2 Introduction - the SeaPort Project seri.docx
1 Project 2 Introduction - the SeaPort Project seri.docx
 
Discovering the New SuccessFactors LMS Admin Features
Discovering the New SuccessFactors LMS Admin FeaturesDiscovering the New SuccessFactors LMS Admin Features
Discovering the New SuccessFactors LMS Admin Features
 
Using Application Skeletons to Improve eScience Infrastructure
Using Application Skeletons to Improve eScience InfrastructureUsing Application Skeletons to Improve eScience Infrastructure
Using Application Skeletons to Improve eScience Infrastructure
 
Eclipse plug in development
Eclipse plug in developmentEclipse plug in development
Eclipse plug in development
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4j
 
Proof of Concept for Learning Analytics Interoperability
Proof of Concept for Learning Analytics InteroperabilityProof of Concept for Learning Analytics Interoperability
Proof of Concept for Learning Analytics Interoperability
 
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop ClustersWBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
 
2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul
 

More from University of Bologna

Data models in precision agriculture: from IoT to big data analytics
Data models in precision agriculture: from IoT to big data analyticsData models in precision agriculture: from IoT to big data analytics
Data models in precision agriculture: from IoT to big data analytics
University of Bologna
 
[PhDThesis2021] - Augmenting the knowledge pyramid with unconventional data a...
[PhDThesis2021] - Augmenting the knowledge pyramid with unconventional data a...[PhDThesis2021] - Augmenting the knowledge pyramid with unconventional data a...
[PhDThesis2021] - Augmenting the knowledge pyramid with unconventional data a...
University of Bologna
 

More from University of Bologna (12)

Data models in precision agriculture: from IoT to big data analytics
Data models in precision agriculture: from IoT to big data analyticsData models in precision agriculture: from IoT to big data analytics
Data models in precision agriculture: from IoT to big data analytics
 
[EDBT2023] Describing and Assessing Cubes Through Intentional Analytics (demo...
[EDBT2023] Describing and Assessing Cubes Through Intentional Analytics (demo...[EDBT2023] Describing and Assessing Cubes Through Intentional Analytics (demo...
[EDBT2023] Describing and Assessing Cubes Through Intentional Analytics (demo...
 
[DataPlat2023] Opening
[DataPlat2023] Opening[DataPlat2023] Opening
[DataPlat2023] Opening
 
[DOLAP2023] The Whys and Wherefores of Cubes
[DOLAP2023] The Whys and Wherefores of Cubes[DOLAP2023] The Whys and Wherefores of Cubes
[DOLAP2023] The Whys and Wherefores of Cubes
 
[ADBIS2022] Insight-based Vocalization of OLAP Sessions
[ADBIS2022] Insight-based Vocalization of OLAP Sessions[ADBIS2022] Insight-based Vocalization of OLAP Sessions
[ADBIS2022] Insight-based Vocalization of OLAP Sessions
 
[SEBD2021] Conversational OLAP
[SEBD2021] Conversational OLAP[SEBD2021] Conversational OLAP
[SEBD2021] Conversational OLAP
 
[PhDThesis2021] - Augmenting the knowledge pyramid with unconventional data a...
[PhDThesis2021] - Augmenting the knowledge pyramid with unconventional data a...[PhDThesis2021] - Augmenting the knowledge pyramid with unconventional data a...
[PhDThesis2021] - Augmenting the knowledge pyramid with unconventional data a...
 
[EDBT2021] Conversational OLAP in Action (Best Demo Award EDBT2021)
[EDBT2021] Conversational OLAP in Action (Best Demo Award EDBT2021)[EDBT2021] Conversational OLAP in Action (Best Demo Award EDBT2021)
[EDBT2021] Conversational OLAP in Action (Best Demo Award EDBT2021)
 
[EDBT2021] Assess Queries for Interactive Analysis of Data Cubes
[EDBT2021] Assess Queries for Interactive Analysis of Data Cubes[EDBT2021] Assess Queries for Interactive Analysis of Data Cubes
[EDBT2021] Assess Queries for Interactive Analysis of Data Cubes
 
[DOLAP2020] Towards Conversational OLAP
[DOLAP2020] Towards Conversational OLAP[DOLAP2020] Towards Conversational OLAP
[DOLAP2020] Towards Conversational OLAP
 
[MIPRO2019] Map-Matching on Big Data: a Distributed and Efficient Algorithm w...
[MIPRO2019] Map-Matching on Big Data: a Distributed and Efficient Algorithm w...[MIPRO2019] Map-Matching on Big Data: a Distributed and Efficient Algorithm w...
[MIPRO2019] Map-Matching on Big Data: a Distributed and Efficient Algorithm w...
 
[DOLAP2019] Augmented Business Intelligence
[DOLAP2019] Augmented Business Intelligence[DOLAP2019] Augmented Business Intelligence
[DOLAP2019] Augmented Business Intelligence
 

Recently uploaded

Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Sérgio Sacani
 
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptAerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
sreddyrahul
 
Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a Technosignature
Sérgio Sacani
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
AADYARAJPANDEY1
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
muralinath2
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
muralinath2
 
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdfPests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
PirithiRaju
 
Seminar on Halal AGriculture and Fisheries.pptx
Seminar on Halal AGriculture and Fisheries.pptxSeminar on Halal AGriculture and Fisheries.pptx
Seminar on Halal AGriculture and Fisheries.pptx
RUDYLUMAPINET2
 
Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...
Sérgio Sacani
 

Recently uploaded (20)

Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
 
National Biodiversity protection initiatives and Convention on Biological Di...
National Biodiversity protection initiatives and  Convention on Biological Di...National Biodiversity protection initiatives and  Convention on Biological Di...
National Biodiversity protection initiatives and Convention on Biological Di...
 
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptAerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
 
Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a Technosignature
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
Topography and sediments of the floor of the Bay of Bengal
Topography and sediments of the floor of the Bay of BengalTopography and sediments of the floor of the Bay of Bengal
Topography and sediments of the floor of the Bay of Bengal
 
SAMPLING.pptx for analystical chemistry sample techniques
SAMPLING.pptx for analystical chemistry sample techniquesSAMPLING.pptx for analystical chemistry sample techniques
SAMPLING.pptx for analystical chemistry sample techniques
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243
Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243
Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243
 
Shuaib Y-basedComprehensive mahmudj.pptx
Shuaib Y-basedComprehensive mahmudj.pptxShuaib Y-basedComprehensive mahmudj.pptx
Shuaib Y-basedComprehensive mahmudj.pptx
 
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptxGLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
Hemoglobin metabolism: C Kalyan & E. Muralinath
Hemoglobin metabolism: C Kalyan & E. MuralinathHemoglobin metabolism: C Kalyan & E. Muralinath
Hemoglobin metabolism: C Kalyan & E. Muralinath
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
 
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdfPests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
Seminar on Halal AGriculture and Fisheries.pptx
Seminar on Halal AGriculture and Fisheries.pptxSeminar on Halal AGriculture and Fisheries.pptx
Seminar on Halal AGriculture and Fisheries.pptx
 
Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...
 

[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety

  • 1. OLAP Querying of Document Stores in the Presence of Schema Variety Matteo Francia, Enrico Gallinucci, Matteo Golfarelli, Stefano Rizzi {m.francia, enrico.gallinucci, matteo.golfarelli, stefano.rizzi} @ unibo.it
  • 2. Introduction Schemaless databases (e.g., document stores) are preferred for storing heterogeneous data with variable structural forms • Missing or additional fields • Different names or types for a field • Different structuresfor instances Absence of a unique schema • More flexibility to operational applications, more complexity to analytical applications Adopting a classical DW design approach requires notable effort • Understand the rules that drove the use of alternative schemas • Integrate to identify a common schema to be adopted for analysis Enrico Gallinucci – University of Bologna SEBD 2020 2
  • 3. Our proposal An approach to OLAP querying on document stores • Embrace and exploit the inherent variety of documents • Inclusive solution: the user can query a concept even if it is present in a subset of documents only Enrico Gallinucci – University of Bologna SEBD 2020 3 Document store Schema extraction Schema integration FD enrichment Querying Dependency graph Global schema, Mappings Local schemas Formulation Validation Reformulation Translation Execution Merge
  • 4. Case study Real-world collection of workout sessions • MongoDB 3.4 • 6 local schemas • 5M workout sessions • 35M exercises • 85M sets Enrico Gallinucci – University of Bologna SEBD 2020 4 [ { "_id" : ObjectId("54a4332f44"), "User" : { "FullName" : ”John Smith", "Age" : 42 }, "StartedOn" : ISODate("2017-06-15"), "Facility" : { "Name" : "PureGym Piccadilly", "Chain" : "PureGym" }, "SessionType" : "RunningProgram", "DurationMins": 90, "Exercises" : [ { "Type" : "Leg press", "ExCalories" : 28, "Sets" : [ { "Reps" : 14, "Weight" : 60 }, . . . ] }, { "Type" : "Tapis roulant" }, . . . ] } , . . . ]
  • 5. Schema extraction Goal: identify the set of distinct local schemas that occur inside a collection of documents • Completely automatic stage Enrico Gallinucci – University of Bologna SEBD 2020 5 Document store Schema extraction Schema integration FD enrichment Querying Dependency graph Global schema, Mappings Local schemas
  • 6. Schema integration Goal: integrate the local schemas to obtain a single, global schema • A preliminary global schema is defined as the name-based union of all local schemas (automatic) • The user refines the preliminary global schema by merging matching (sets of) fields (pay-as-you-go) • Existing tools (e.g., Coma 3.0) can be used to automatically find candidate matches Enrico Gallinucci – University of Bologna SEBD 2020 6 Document store Schema extraction Schema integration FD enrichment Querying Dependency graph Global schema, Mappings Local schemas
  • 7. Goal: propose a MD view of the global schema to enable OLAP analyses • Identification of hierarchies → Identification of functional dependencies (FDs) • Some exact FDs can be derived from the global schema • Additional FDs can only be found by querying the data • DSs may contain incomplete and faulty data • We look for approximate FDs (AFDs) • Efficient strategy for AFD detection [1] Schema enrichment Enrico Gallinucci – University of Bologna SEBD 2020 7 Document store Schema extraction Schema integration FD enrichment Querying Dependency graph Global schema, Mappings Local schemas [1] Golfarelli, Graziani, Rizzi: Starry vault: Automating multidimensional modeling from data vaults. In: Proc. ADBIS (2016)
  • 8. Schema enrichment (example) Enrico Gallinucci – University of Bologna SEBD 2020 8 Exercises.Type Exercises.Sets.Weight Exercises.Sets.Reps _id User.FullName SessionType Facility.Name Facility.City Facility.ChainUser.Age User.FirstName User.LastName DurationMins Exercises._id Exercises.Sets. _id Exercises.ExCalories Exercises.Sets.SetCalories level (support=1) level (support<1) FD AFD Facility.Country The result is a dependency graph, providing a MD view in terms of FDs
  • 9. Querying (1) Goal: enable the formulation of OLAP queries on the dependency graph and their execution on the collection • Validate against well-formedness requirements in the literature • There must exist a field that functionally determines all the others (the fact) • Functional independence in group-by set • Disjointess:the granularity of measures should be finer than the one of the group-by fields (or: double-counting) • Completeness:the group-by fields should have full support (balancing strategies are used) Enrico Gallinucci – University of Bologna SEBD 2020 9 Document store Schema extraction Schema integration FD enrichment Querying Formulation Validation Reformulation Translation Execution Merge
  • 10. Querying (example) q = < {Facility.City; Exercises.Type}; (User.Age≥ 60); Exercises.Sets.Weight;avg > Enrico Gallinucci – University of Bologna SEBD 2020 10 db.WS.aggregate({ { $unwind: "$Exercises" }, { $unwind: "$Exercises.Sets" }, { $match: { "User.Age": { $gte: 60 } } }, { $project: { "Facility.City": { $ifNull: ["$FacilityCity","$FacilityName"] } }, "Exercises.Type": 1, "Exercises.Sets.Weight": 1, "balanced": { $cond: ["$FacilityCity",false,true] } } }, { $group: { " id": { "FacilityCity", "$FacilityCity", "ExercisesType", "$Exercises.Type", "balanced", "$balanced" }, "Exercises.Sets.Weight": { $avg: "$Exercises.Sets.Weight" }, "count": { $sum: 1 }, "count-m": { $sum: f $cond: ["$Exercises.Sets.Weight",1,0] } } } } } ) Exercises.Type Exercises.Sets.Weight Exercises.Sets.Reps _id User.FullName SessionType Facility.Name Facility.City Facility.ChainUser.Age User.FirstName User.LastName DurationMins Exercises._id Exercises.Sets. _id Exercises.ExCalories Exercises.Sets.SetCalories level (support=1) level (support<1) FD AFD Facility.Country
  • 11. Goal: enable the formulation of multidimensional queries on the dependency graph and their execution on the collection • Reformulate into multiple queries, one for each local schema • Rely on a previous query reformulation approach for federated DWs • Translate into the query language of the DS • Multi-stage pipelines on MongoDB • Execute the queries and merge the results • Performed in-memory • OLAP queries usually produce limited results • Transcoding functions homogenize values Document store Schema extraction Schema integration FD enrichment Querying Formulation Validation Reformulation Translation Execution Merge Querying (2) Enrico Gallinucci – University of Bologna SEBD 2020 11
  • 12. Future work • Increase the efficiency of the querying phase • Optimize query reformulation to issue a single but more complex query • Broaden the scope of the approach • Multiple collections and multiple store with different data models • Evaluate alternative execution engines (e.g., big data tools) • Build a fully working prototype • Introduction of online repairing of AFDs • Automatically repair the errors in AFDs at query time • Directly receive correct results without need to modify the original data Enrico Gallinucci – University of Bologna SEBD 2020 12
  • 14. Query evaluation • Level density • Necessary when the group-by fails the completeness constraint • Evaluate the impact of the balancing strategy • Measure density • Necessary when the measure of the query does not have full support and the results offer only a partial summary • Evaluate the precision of result depending on how many documentscontain the measure • Accuracy • Necessary when a new query is formulated during an OLAP session • Quantify the accuracy of results w.r.t. those obtained from the previous query Enrico Gallinucci – University of Bologna SEBD 2020 14
  • 15. Query execution times Execution times of three queries (qa, qb, and qc) split into the execution times of the required local queries Enrico Gallinucci – University of Bologna SEBD 2020 15
  • 16. Schema extraction (example) Enrico Gallinucci – University of Bologna SEBD 2020 16 _id User.FullName User.Age StartedOn Facility.Name Facility.Chain SessionType DurationMins Sets_id Reps Weight Exercises_id Type ExCalories WS Sets Exercises { "_id" : ObjectId("54a4332f44"), "User" : { "FullName" : ”John Smith", "Age" : 42 }, "StartedOn" : ISODate("2017-06-15"), "Facility" : { "Name" : "PureGym Piccadilly", "Chain" : "PureGym" }, "SessionType" : "RunningProgram", "DurationMins": 90, "Exercises" : [ { "Type" : "Leg press", "ExCalories" : 28, "Sets" : [ { "Reps" : 14, "Weight" : 60 }, . . . ] }, { "Type" : "Tapis roulant" }, . . . ] } s1
  • 17. Schema integration (example) Enrico Gallinucci – University of Bologna SEBD 2020 17 Exercises_id Type ExCalories _id User.FullName User.Age StartedOn Facility.Name Facility.Chain SessionType DurationMins Sets_id Reps Weight Exercises_id Type ExCalories _id FirstName LastName Date Gym.Name Gym.City Gym.Country SessionType DurationSecs Series_id ExType Reps Weight SeriesCalories _id User.FullName User.FirstName User.LastName User.Age StartedOn Facility.Name Facility.Chain Facility.City Facility.Country SessionType DurationMins Sets_id Reps Weight SetCalories WS Sets Exercises WS Series WS Sets Exercises s1 s2g
  • 18. Schema integration Goal: integrate the local schemas to obtain a single, global schema • Schema matching: identify the single matches between the attributes • Schema mapping: define mappings between each local schema and the global one • Mappings include transcoding functions to enable query reformulation • E.g., in the presence of selection predicates, or to integrate the query results Schema integration is done in two steps • Define a preliminary global schema as the name-based union of all local schemas (automatic) • Refine the preliminary global schema by merging matching (sets of) fields (pay-as-you-go) • Existing tools (e.g., Coma 3.0)can be used to automatically find candidate matches Enrico Gallinucci – University of Bologna SEBD 2020 18 Document store Schema extraction Schema integration FD enrichment Querying Dependency graph Global schema, Mappings Local schemas