[DOLAP2020] Towards Conversational OLAP

Towards Conversational OLAP
Matteo Francia1
, Enrico Gallinucci1
, Matteo Golfarelli1
m.francia@unibo.it
1
University of Bologna
DOLAP2020

Data access democratization
Smart assistants are in companies’ agendas [1, 2]
Goal: perform conversational OLAP sessions
Existing OLAP interfaces: point-and-click metaphor to avoid SQL
Translate NL into Generalized Projection, Selection and Join (GPSJ) query [3]
Differences with state-of-art approaches [4, 5, 6, 7]
1. End-to-end dialog-driven framework for OLAP sessions
2. Plug-and-play: no impact on DW
3. No mandatory external knowledge
Matteo Francia (UniBO) DOLAP: Towards Conversational OLAP 2 / 18

Functional architecture
Speech
-to-Text
OLAP
operator
Full query Disambiguation
Execution &
Visualization
Automatic
KB feeding
KB
enrichment KB
DW
Raw
text
Annotated
parse forest
Parse
tree Results
Metadata & values
Synonyms
Log
Parse tree
Interpretation
Offline
Online
Synonyms
Ontology
SQL
generation
SQL
Sales by
Customer and
Month

Full query: Tokenization and mapping
Match raw text n-grams with known DW entities in KB
Product
Name
Type
Category
Family
Customer
C.City
C.Region
Gender
Store
S.City
S.Region
Date
Quarter
Month
YearStoreSales
StoreCost
UnitSales
Sales
M1 =  avg, UnitSales, where, Product, New York, group by, Region 
M2 =  avg, UnitSales, where, Product, New York, group by, Regin 
NL = “medium sales for product New York by the region”
T =  medium, sales, for, product, New, York, by, region 
average, UnitSales, where, Product, New York, group by, Region
Regin
KB
Build mappings (i.e., combinations of entities)
…
…
avg,
group by
New York,
Product,
Regin
Region
UnitSales
Where
…
KB

Full query: Parsing I
Grammar-based translation (effective for narrow lexicon [8, 9])
Capture complex syntax structures (i.e., query clauses)
LL(1) [10] not-ambiguous grammar: one PTM per mapping M
GPSJ ::= MC GC SC | MC SC | MC GC | MC |...
MC ::= ( Agg Mea | Mea | Cnt Fct | ...)+
GC ::= Gby Attr +
SC ::= Whr SCO
SCO ::= SCA “or” SCO | SCA
SCA ::= SCN “and” SCA | SCN
SCN ::= “not” SSC | SSC
SSC ::= Attr Cop Val | Val Attr | Val | ...
Category Entity Synonym samples
Int select return, show, get
Whr where in, such that
Gby group by by, for each, per
Cop =, <>, >, <, ≥, ≤ equal to, greater than
Agg sum, avg total, medium
Cnt count, count distinct number, amount
Fct Facts Domain specific
Mea Measures Domain specific
Att Attributes Domain specific
Val Categorical values Domain specific
Dates and numbers -

Full query: Parsing II
 Mea  Agg   Whr 
 MC   SC 
 GPSJ 
 SCO 
 SCA 
 SCN 
 SSC 
 Val  Attr   Gby   Attr 
 GC 
M2 =  avg, UnitSales, where, Product, Ne
 MC   SC 
 GPSJ 
 SCO
 SCA
 SCN
 SSC
 Attr 
Fully parsed: PTM includes all entities as leaves

Full query: Parsing III
e, Product, New York, group by, Region 
 SC 
 GPSJ 
 SCO 
 SCA 
 SCN 
 SSC 
 Val  Attr   Gby   Attr 
 GC 
 Mea  Agg   Whr   Gby 
 MC   SC 
 GPSJ 
 SCO 
 SCA 
 SCN 
 SSC 
 Val 
 SCO 
 SCA 
 SCN 
 SSC 
 Val 
 SC 
 Attr 
PTM
Partially parsed: some entities are not included in PTM (parse forest PFM)
..., group by, Regin : cannot group by on a value
If fully parsed PFM = PTM

Full query: Checking & Enhancement
Parsing veriﬁes syntax adherence to grammar, more issues arise
Tag problematic subtrees in PFM with annotations
Score(PFM)
M =  avg, UnitSales, where, Product, =, New York, group by, Regin 
 MC   SC 
 GPSJ 
 SCO 
 SCA 
 SCN 
 SSC 
 Val 
 SCO 
 SCA 
 SCN 
 SSC 
 Val 
 SC 
 Cop  Attr 
AVM
unparsed
Score(M)
 MC   SC 
 GPSJ 
 SCO 
 SCA 
 SCN 
 SSC 
 Val 
 SCO 
 SCA 
 SCN 
 SSC 
 Val 
 SC 
 Attr 
AVM
unparsed
Annotation type Gen. derivation sample
Ambiguous Attribute SSC ::= Val
Ambiguous Agg. Operator MC ::= Mea
Attribute-Value Mismatch SSC ::= Attr Cop Val
MD-Meas Violation MC ::= Agg Mea
MD-GBY Violation GC ::= Gby Attr +
Unparsed clause –

Disambiguation
Handle annotations
Add implicit information to PFM
Ask question for each annotation to reduce PFM to PTM
SQL generation: translate PTM into SQL code
 MC   SC 
 GPSJ 
 SCO 
 SCA 
 SCN 
 SSC 
 Val 
 SCO 
 SCA 
 SCN 
 SSC 
 Val 
 SC 
AVM
unparsed
New York is not a valid
Product, possible
Products are…
Dangling clause, do
you want to add it or
drop it?
Annotation type Description
Ambiguous Attribute Val is member of these attributes [...]
Ambiguous Agg. Operator Mea allows these operators [...]
Attribute-Value Mismatch Attr and Val domains mismatch, values are [...]
MD-Meas Violation Mea does not allow Agg , operators are [...]
MD-GBY Violation It is not allowed to group by on Attr without Attr
Unparsed GC clause There is a dangling grouping clause GC
Unparsed MC clause There is a dangling measure clause MC
Unparsed SC clause There is a dangling predicate clause SC

Robustness vs Complexity I
Multiple mappings ensures robustness...
…
NL = “medium sales for product New York by the regin”
T =  medium, sales, for, product, New, York, by, regin 
average, UnitSales, where, Product, New York, group by, Regin
Region
KB
More mappings, more interpretations
User experience: return only most promising query

Robustness vs Complexity II
Optimistic-pessimistic score function
Score(M) =
|M|
i=1 Sim(T , Ei)
Score(PFM) = Score(M ) where M is sub-sequence of M belonging to PTM
Score(PFM)
M =  avg, UnitSales, where, Product, =, New York, group by, Region 
 MC   SC 
 GPSJ 
 SCO 
 SCA 
 SCN 
 SSC 
 Val  Cop  Attr   Gby   Attr 
 GC 
AVM

Robustness vs Complexity III
Optimistic: annotations in PTM are likely to be solved
Score(PFM)
 SSC 
AVM
Score(PFM)
 MC   SC 
 GPSJ 
 SCO 
 SCA 
 SCN 
 SSC 
 GC 
AVM
M =  avg, UnitSales
 Mea  Agg 
 MC 


Robustness vs Complexity IV
Pessimistic: unparsed clauses are likely to be dropped
roup by, Region 
Gby   Attr 
group by, Region 
 Gby   Attr 
 GC 
Score(PFM)
 MC   SC 
 GPSJ 
 SCO 
 SCA 
 SCN 
 SSC 
 Val 
 SCO 
 SCA 
 SCN 
 SSC 
 Val 
 SC 
AVM
unparsed
Ranking by Score(PFM) allows pruning of parsed mappings

Evaluation I
Dataset
Real-word analytics queries [11] mapped to Foodmart schema
75% of queries are valid GPSJ queries
110 manually annotated queries
Automatic feeding: 1 fact, 39 attributes, 12 500 entities
Manual feeding: only 50 synonyms ("for each" synonym of group by)
Parameters
n-grams, n ∈ [1..4]
... mapped to top N entities with similarity ≥ α
Consider mappings covering at least 70% of T

Evaluation II
Accuracy: tree sim. TSim(PT, PT∗) [12] btw produced PTM and correct PT∗
M
1 2 3 4 5
k
0.0
0.2
0.4
0.6
0.8
1.0
TSim
N=2 N=4 N=6
(a) Varying N, k queries returned (α = 0.4)
1 2 3 4 5
k
0.0
0.2
0.4
0.6
0.8
1.0
TSim
=0.6 =0.5 =0.4
(b) Varying α, k queries returned (N = 6)
Accuracy depends on vocabulary (i.e., matched entities)
Proposing one query slightly impacts accuracy
Accuracy in [0.85, 0.9]

Evaluation III
0 1 2 3
Disambiguation step
0.0
0.2
0.4
0.6
0.8
1.0
TSim
(a) Disambiguation steps (k = 1, N = 6 and α = 0.4)
Disambiguation increases accuracy up to 0.94
State-of-art accuracy [4, 5, 6]

Conclusion & further enhancements
So far
Provided architecture for conversational OLAP with desiderata
Automated and portable: no impact on DW
Plug and play: no heavy manual lexicon deﬁnition
Robustness: adapt to spoken and syntactic inaccuracies
Translated NL to well-formed GPSJ query
What’s next?
1. Support a conversational OLAP session
Extend grammar with dialog primitives (i.e., OLAP operators)
Manage and reﬁne previous parse trees
2. Learn frequent disambiguations to minimize user interaction
3. Design metaphor to support interaction
Visual metaphor based on DFM
4. Test with real users to verify perceived effectiveness
Usability, immediacy, memorability

End

References I
[1] Ramanathan V. Guha, Vineet Gupta, Vivek Raghunathan, and Ramakrishnan Srikant.
User modeling for a personal assistant.
In WSDM, pages 275–284. ACM, 2015.
[2] Hype cycle for artiﬁcial intelligence, 2018.
http://www.gartner.com/en/documents/3883863/hype-cycle-for-artificial-intelligence-2018.
Accessed: 2019-06-21.
[3] Ashish Gupta, Venky Harinarayan, and Dallan Quass.
Aggregate-query processing in data warehousing environments.
In VLDB, pages 358–369. Morgan Kaufmann, 1995.
[4] Fei Li and H. V. Jagadish.
Understanding natural language queries over relational databases.
SIGMOD Record, 45(1):6–13, 2016.
[5] Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig.
Sqlizer: query synthesis from natural language.
PACMPL, 1(OOPSLA):63:1–63:26, 2017.
[6] Diptikalyan Saha, Avrilia Floratou, Karthik Sankaranarayanan, Umar Farooq Minhas, Ashish R. Mittal, and Fatma Özcan.
ATHENA: an ontology-driven system for natural language querying over relational data stores.
PVLDB, 9(12):1209–1220, 2016.
[7] Nicolas Kuchmann-Beauger, Falk Brauer, and Marie-Aude Aufaure.
QUASL: A framework for question answering and its application to business intelligence.
In RCIS, pages 1–12. IEEE, 2013.

References II
[8] Kedar Dhamdhere, Kevin S. McCurley, Ralﬁ Nahmias, Mukund Sundararajan, and Qiqi Yan.
Analyza: Exploring data with conversation.
In IUI, pages 493–504. ACM, 2017.
[9] Katrin Affolter, Kurt Stockinger, and Abraham Bernstein.
A comparative survey of recent natural language interfaces for databases.
The VLDB Journal, 28(5):793–819, 2019.
[10] John C. Beatty.
On the relationship between LL(1) and LR(1) grammars.
J. ACM, 29(4):1007–1022, 1982.
[11] Krista Drushku, Julien Aligon, Nicolas Labroche, Patrick Marcel, and Verónika Peralta.
Interest-based recommendations for business intelligence users.
Inf. Syst., 86:79–93, 2019.
[12] Kaizhong Zhang and Dennis E. Shasha.
Simple fast algorithms for the editing distance between trees and related problems.
SIAM J. Comput., 18(6):1245–1262, 1989.

[DOLAP2020] Towards Conversational OLAP

Recommended

Recommended

More Related Content

Similar to [DOLAP2020] Towards Conversational OLAP

Similar to [DOLAP2020] Towards Conversational OLAP (20)

More from University of Bologna

More from University of Bologna (8)

Recently uploaded

Recently uploaded (20)

[DOLAP2020] Towards Conversational OLAP