Dissertation Defense Presentation

A Framework for Mapping
User-designed Forms
to Relational Databases
Dissertation Presentation
November 15 2011
Ritu Khare
COMMITTEE :
Dr. Yuan An (Chair)
Dr. Jiexun Jason Li
Dr. Il-Yeol Song
Dr. Min Song
Dr. Christopher C. Yang
1

Presentation Order
1. Motivation
2. Problems
3. Solutions
4. Evaluation
5. Final Remarks

2

General Motivation: Database Usability (Sawyer, 1995)
 Enable users to SEARCH and  Enable users to DESIGN
QUERY databases databases. (Jagadish et al. 2007)
 Information Retrieval  Form-based DIY and WYSIWYG
Techniques (Liu et al, 2006, Hristidis paradigms
et al., 2003, Catarci, 2000, Jayapandian  FormAssembly, ZohoCreator,
and Jagadish, 2006) GoogleForms

Databases still remain unusable from the integration point of view
(Gurses et al., 2009)
4

Precise Motivation: Integration of New Needs

New
needs
related to 1) Building of new forms
patient’s
social 2) Integration of new form
habits into back-end
5

Research Objective
 To develop a mechanism to automatically map
and integrate a user-designed form into
existing structured database.
 Assume that a user-designed form is
already acquired
 Seek a framework that
 merges the semantically matching elements
between forms and databases.
 creates new database elements corresponding to
the unmatched form elements.

6

A form template represents the
semantic intentions of the designer Problem #1 : Form Understanding

Existing Work

 Focus on Search Forms
(Benslimane, et al. 2007, Kaljuviee
et al., 2001)
 shorter and simpler than the
data-entry forms. (empirical
finding)
 Rules and heuristics
(Zhang et al. 2004, He et al., 2007)
 Automatic Extraction of the form semantics  not likely to circumvent the
 Machine can only read the syntactic patterns ever broadening varieties in
of form elements. A certain layout pattern form topologies
cannot be associated with a semantic
intention.
8

Problem#2: Correspondence Discovery

Existing Work

 Schema and Ontology
Mapping (Madhavan et al., 2001,
 Detect semantically matching Euzenat and Shvaiko, 2005, Rahm and
Bernstein, 2001, An et al. 2005, An et al. 2006)
elements between a form and  Mostly semi-automatic
an existing database  Not applicable to form to
 Challenges database correspondence
discovery
 Variety of terms to denote the  Heterogeneity between forms and
same concepts. databases
 Correspondences are to be used for
 Variety of concepts denoted evolving the database; the discovery
process has to keep this requirement
by similar terms into consideration.
 Identify and eliminate the
invalid correspondences.
9

Problem# 3: Form Integration
Problem#3a: Merging Existing Work
 Merging into an existing  Form integration (Yang et al.,
database so that the same 2008)
concept is not duplicated and  largely manual
the database remains  expose the users to the technical
compact. details of the underlying data
 Merging increases the model.
potential of having NULL  Database integration (Yang et al.
values, i.e., less optimized 2003)
database.  provide guidelines.
 Judicious Decisions

10

Problem# 3: Form Integration
Problem#3b: Birthing
Existing Work:
 Extend the database for
 Form-based database design
the unmatched form  Several methods (Choobineh et al.
1988, Pavicevic et al, 2006, Choobeneh and
elements Venkatraman, 1992, Deklarit, 2008) and
commercial tools (Form assembly,
 How to automatically google forms, zohocreator, wufoo)
 No empirical evaluation of the
derive the functional resultant databases
dependencies among the  Few focus on designing a database
with certain desirable properties,
form elements? e.g., expressiveness (Yang et al, 2008,
Choobineh et al., 1988, Lukovic, et al 2007).
 How to translate the  These properties do not reflect
complex form patterns? any compliance with the form
semantics and are inadequate
 How to evaluate multiple for evaluating the mapping
process.
design alternatives &
pick one?
11

Research Questions and System Goals
1. Form Understanding
System Goals:
 A model to capture the form 1. To evolve a DB that is high-
semantics quality and optimized as per
 Extract this model from a given the form semantics, i.e.,
compliant to the principles
form (Wang and Strong, 1996,
Ramakrishnan and Gehrke, 2002,
2. Correspondence Discovery Silberschatz, et al., 2001, Batini and
Scannapieco, 2006):
 Determine semantically
 Completeness: All form
equivalent elements b/w form & elements represented in
database database
 Incorporate DB evolution  Correctness: Form
semantics retained:
requirement during discovery  Compactness: Equivalent
process elements merged
3. Form Integration  Normalization: 3NF w.r.t.
form’s functional
 Resolve merging conflicts while dependencies
maintaining the original form  Minimize NULL values in
semantics FKs and Descriptive
attributes
 Given a form pattern, derive a
2. To ensure minimalism in the
relational database with required user intervention
12
“desirable” properties

Form Representation: Form Tree
 The form tree accurately captures the designer's intentions, and
hence the semantic associations among the form elements.
 Inspired by hierarchical modeling of forms in existing works
(Dragut et al. 2009, Wu et al. 2009)

14

Framework Outline

Form
Understanding Form Tree
and Semantics
Extraction

Correspondenc
Form Tree with e Discovery and
Discovered Validation
Correspondences

Database
Design and Database
Evolution
15

Method 1a: Form Tree Generation

16

Method 1a: Form Tree Generation

I. Tag and 2. Derive Tree
Segment Phase Phase(5 rules)

 The approach leverages the probabilistic nature of form design
and develops a 2-layered Hidden Markov Model (HMM)
based artificial designer that has the ability to understand the
semantics of any arbitrarily designed form.
 T-HMM: Tagging HMM
 S-HMM-Segmentation T-HMM
17

Method 1b: Form Term Annotation
 Refine semantics by annotating terms
 Systematized Nomenclature of Medicine  Challenge: Same form term can be
Clinical Terms (SNOMED CT) comprising specified in multiple contexts, i.e.,
360,000 concepts belonging to various semantic categories. The key is to identify
semantic categories. the semantic category for a given term.
 We hypothesize that the term context can
ConceptID Description Semantic Category be derived from the structure of the form
tree.
0231832 Respiratory Rate Observable Entity

362508001 Both eyes, entire Body Structure

18

Method 1b: Form term annotation
Form Tree

SNOMED CT Choose the
Form Structure Classification best match SNOMED
Term CT
Analyzer Model Semantic concept from
this category Concept
category

SNOMED CT search service

19

Method 2: Correspondence Discovery and Validation

Linguistic Exact Concept
Matching Matching

1

2

20

Total Heuristics = 4
Method 2: Validation Algorithm
Past Medical X

History History
X
 Id HPI Medications SocialHistory
Family
Hx
History of
Meds
X

present
Illness

Oral

Hygiene Appetite
Id Options
radio  1 Good
2 Fair

good poor 3 Poor

Look-up table

21

Method 3: Database Design and Evolution

1 2

3

22

Method 3a: Birthing Algorithm Total Patterns = 12

 Principles: High Quality(Complete, Correct, Compact, Normalized) and
Optimization (minimize NULLs)
 Traverses the form tree in depth first order

M:1

Tj.ID -> Tj.c

Radiobutton Pattern
Textbox Pattern

Category/subcategory
Pattern Extended RB
Pattern

23

Method 3a: Birthing Algorithm
Sibling
categories
pattern

Textbox
pattern

Category-
subcat. pattern

Textbox
24 Radiobutton Checkbox pattern
pattern pattern

Method 3: Database Design and Evolution

1 2

3

25

Tot. merging scenarios = 8
Method 3b: Merging Algorithm
 Compactness Factor(CF): A
 Each merger involves a trade-off
configurable value (0,1) that indicates
between compactness and the weightage given to compactness
optimization (min. NULL values)
 Null Value Ratio(NVR): A calculated
principles.
value that indicates the potential of
having NULL values in a given table.

New DB Existing DB

NVR = 2/5=0.4

Case a: CF=0.5 Case b: CF=0.3
Final DB
(CF>NVR) (CF<=NVR)

26
More Compact More Optimized

System Goals: Principle Compliance & Min. Interventions
Evaluation Goals: Java, Tomcat,
A. How well the system meets the goals? MySQL Server,
yFiles, JSP
B. Impact of framework in accomplishing the goals ?

EM & Viterbi,
cross-
HMM-based tree validation
extraction
SNOMED CT Form Tree
Term Annotation Linguistic
Naïve Bayes Classifier, Similarity
Top-4 classes, SnAPI, =Lucene’s Default
Cross-validation per Corr. Settings
Form Tree with dataset Discovery
Discovered Validation
Correspondences Algorithm

Birthing
Algorithm Database
Merging
28 Algorithm CF=0.7

Data
(52 real world forms from 6 medical institutions)
 Healthcare : Forms are prevalent, and Information systems are unusable and inflexible.
Dataset Avg. Avg. SNOMED
Terms Inputs CT
Mappability
1 Walk in clinic encounter 32.33 49.33 75.77 %
forms (3 forms) Gold Benchmarks
2 Nursing patient 17.17 33 63.98% 52 Gold Std Trees
admission forms (6 (using a DIY interface that
forms) captures designers’ on-
the-fly semantic decisions)
3 Labor & delivery DB data- 16.14 37.29 58.8 %
entry forms (7 forms) Gold Std Annotations
(4235 form terms were
4 Adult visit encounter 47.83 65.22 56.2% manually studied & 2506
forms (59%) had corr. concept in
SNOMED CT)
(18 forms)
5 Family practice forms 82.61 100.46 59.38% 3 pairs of Gold DBs
(3 datasets were given to
(13 forms) 2 experts. Each expert
6 Child visit encounter 53 67.4 62.21% manually derived the 3
forms databases)
29 (5 forms)

Experiment 1: Form Tree
Extraction
 97.85% of parent child semantic
associations captured correctly
 An average tree with 135 edges
gets generated in 0.08 seconds.

Dataset1 Dataset2 Dataset3 Dataset4 Dataset5 Dataset6
Total Edges 272 362 461 2606 2674 644
Accuracy 95.22% 97.51% 100% 97.58% 98.46% 96.11%

Inaccuracies because of more hierarchical
complexity, i.e., semantic grouping and sub-
grouping.
30

Experiment 2: Form Term Annotation
Precision (#correct annotations
/# annotations)
Recall (# correct annotations
Baseline
/# relevant (gold) annotations)
(linguistics)

Hybrid
(linguistics + semantic structure)

Hybrid++
31

Avg time(s)/form
Exp. 2: Form Term Annotation 1.28, 1.77, 2.31,
10.29, 8.12, 3.44

 Enhanced all versions by adding
term processing: remove special
character, clinical acronyms
expansion.
 Precision only slightly improved
(3-5%)
 Recall majorly improved (25%).
 Final Precision =0.89, Recall
=0.76
 Baseline to Hybrid
 Avg. precision Improved by 26%.
 Recall no specific pattern
 Hybrid to Hybrid++
 Avg. Precision improved by 13%
 Avg. Recall improved by 17%
 Hybrid++: precision 0.86, recall 0.6

Structural knowledge can improve the overall performance.
32 Linguistic Techniques can only impact the recall.

Experiment 3: Form to Database Mapping
3a.Linguistic-based 3b. Concept-based 3c. Hybrid
Discovery Discovery Discovery

33

Exp 3: Description of evolved databases.
(35 to 450 tables), (Linguistic-based Discovery) (x:element-type
y:# elements)

Mapping Duration
per form:
few ms. to 200s.

34

Exp 3: Comparison with Gold Datasets
With Gold 1
With Gold 2  74%(avg.) of the system generated
tables “perfectly match" with the
tables in the gold databases.
 Based on the principles of quality
and optimization, the mismatches
could be divided into: Negative and
Positive

System
A Gold DB
Form Pattern Generated DB
Positive
Mismatch

Negative
Mismatch

35

Correctness. Completeness,
Exp. 3: Measuring Principle Compliance Normalization, Optimization,
Compactness.
An approx. universal set of merging situations

3a : Linguistic Discovery
DB1 DB2 DB3 DB4 DB5 DB6
> =75% compactness in 4
Linguistic databases.
Discovery
Databases 4, 6: >=20%
rejected due of form features
Concept
Discovery  Datasets 4 and 6
 Format Diversity: Gender (textbox,
Hybrid radiobuttons - M, F); DOB (single vs.
Discovery multiple textboxes)
 Section Scattering
3b: Concept-based 3c: Hybrid
Discovery Discovery
>= 70% compactness in 3 >= 80%
databases. compactness in 4
Datasets 5 & 6: >=33% databases.
undetected

36

% Reduction in no. of screens
Exp 3: Measuring User Interventions Avg. screen/form presented to user
Screen relevance(%)= (# of screens to
which user responds) /(# tot. screens)

Linguistic-based Discovery Concept-based Discovery Hybrid Discovery
% Red. Avg. Screen rel. % Red. Avg. Screen % Red. Avg. Screen
screens screens (%) Screens screens rel.(%) Screens screens rel. (%)
1 50 4 15.39 77 1 75 52 4 15.38
2 77 2 42.86 62 3 68.75 75 3 50
3 69 2 50.00 18 5 46.87 57 4 29.63
4 55 10 39.79 54 8 45.45 51 13 43.29
5 76 21 94.18 65 15 73.57 69 27 86.04
6 62 5 32.14 65 4 42.86 59 8 45

37

Results Summary & Implications
Exp3:
Exp1: Form tree Interventions
Form to DB Mapping
generation (6 DBs: 35 to 450 Intervention red. 61%
Accuracy = 0.98
(52 forms) tables, Intervention/form:
0.08s/tree few ms to 200s) ling.:10, con. : 8,
•Supervised Hyb.:13
•Intervention 10/tree for Hybrid approach
cardinality improves scenario Avg. screen rel. =50%
disambiguation identification (19%)
Validation
compactness (13%) Principle Compliance
Algorithm
over pure approaches. 84.5% identical, or
But performs less in Birthing superior to gold DBs
Improve precision terms of interventions & Algorithm
74% compact(hybrid)
(43%) and recall screen relevance. Merging
(29%) over baseline Algorithm
Exp2: Form term
•Tune validation/merging based on form
annotation
Precision= 0.89 features.
(2500 forms)
1 to 11s/form Recall = 0.76 •Birthing algorithm can be refined as per
gold std.
•Sophisticated term
techniques •Interventions & screen relevance can be
•SNOMED CT relationships improved by enhancing validation
38 •Unsupervised learning algorithm

Thesis Contributions:
Mapping user-designed form to relational database. (NEW problem)
Form Understanding
New Solution: 2-layered HMM that encodes designers Merging Algorithm
knowledge. First work to apply HMMs on form understanding
Balance b/w compactness &
Highly accurate (98%) and efficient (0.08s per form) optimization
Merged =>70% semantically matching
Form Term Annotation (NEW Problem!) elements in 11/18 cases.
Context-based solution leveraging semantic structure
Key Recommendations
Promising (0.89 precision, 0.76 recall) and efficient (1-11s);
Improves over baseline by 43% in precision and 29% in recall For term annotations, design hybrid
approaches leveraging both linguistics
Correspondence Validation Algorithm and structural semantics.

Heuristic based solution relying on frequent observations For improving database quality, design
approaches leveraging both linguistic
Reduces interventions by avg. 61%. and semantic methods for
correspondence discovery.
Birthing Algorithm Birthing algorithm could be further
Intertwines quality and optimization principles refined in terms of handling radio-button
groups and extended check-boxes to
4 medium (<65 tables) & 2 large (<500 tables)-scale DBs improve database quality.
3 medium-scale DBs intersect(or superior) with gold by 84.5%.
Enhance validation algorithm to further
reduce user interventions and improve
40 screen relevance

Limitations – I
Techniques Technique Evaluation
 Form Understanding  Compare with other
 Weak entities, part./card. learning models
constraints.  SVM, conditional random
fields, Bayesian networks,
 Form Term Annotations CAR
 Post coordinated mapping  Completeness and
 Correspondence Discovery Correctness of Heuristics
 Tree design rules, Heuristics
 Concatenated matches
for validation and merging,
 Merging Algorithm Birthing Form Patterns,
Classification attributes
 Detect/eliminate circular
 Assumptions
references in database.
 Class conditional
independence, Correctness
of most linguistic matching
concept
41  Theoretical Validity of
Birthing Algorithm

Limitations - II
Study Experimental Design
 Thorough User Studies  Map and merge forms from
 Can users understand/select
different sources
the right correspondences?  Experiments involving both
automatic form tree extraction
 Domain Expert Annotator
and term annotation methods.
 Large Scale of Databases
 Result Evaluation, Gold DB
 Limited Time
 Implementation
 Experimentation

42

Future Directions
Electronic Health Record General
 Can Clinicians
 Turn into an API
 Design Forms,
Understand/Identify  Amazon SimpleDB
Correspondences
 Google Datastore.
 Does this framework improve
 Data Quality, Patient Diagnosis  Leveraging More Form-Related
 Legal Perspective Information
 HIPPA regulations, Proprietary  Past Mappings
systems
 Usage frequency
 Customize for Form Categories
 Designer’s/User’s Domain
 Encounter, Walk-in, Regular
Visit, Data-entry Expertise
 Use other UMLS terminologies  Mapping Maintenance and
Record Conflict Resolution

43

Related Publications
 Exploiting Semantic Structure for Mapping User-specified Form Terms to
SNOMED CT Concepts
 Khare R., An Y., Li J., Song I-Y., Hu X. In the proceedings of 2nd International Health
Informatics Symposium (IHI 2012), Jan 28-30, 2012, Miami, FL, USA.
 Automatically Mapping and Integrating Multiple Data Entry Forms into a
Database
 An Y., Khare R., Song I-Y., Hu X. In the proceedings of 30th International Conference on
Conceptual Modeling (ER 2011), Oct 31-Nov 3, 2011, Brussels, Belgium.
 Can Clinicians Create High-Quality Databases? A Study on A Flexible
Electronic Health Record (fEHR) System
 Khare R., An Y., Song I-Y., Hu X., In the proceedings of 1st International Health Informatics
Symposium (IHI 2010), Nov 11-12, 2010, Arlington, VA, USA.
 Understanding Deep Web Search Interfaces
 Khare R., An Y., Song I-Y. Special Interest Group in Management of Data (SIGMOD) Record,
39(1):33-40, 2010.
 An Empirical Study on using Hidden Markov Model for Search Interface
Segmentation
 Khare R., and An Y., In the proceedings of 18th International Conference on Information and
Knowledge Management (CIKM 2009), Nov 3-5, 2009, Hong Kong.

44

Dissertation Defense Presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Viewers also liked

Viewers also liked (20)

Similar to Dissertation Defense Presentation

Similar to Dissertation Defense Presentation (20)

Recently uploaded

Recently uploaded (20)

Dissertation Defense Presentation

Editor's Notes