Dissertation Defense Presentation


Published on

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Form is designed for human consumption. Shorter 10 times – studied on 50 forms from both categories , simpler – hierarchical and repre of database tables (single vs multiple) Explain what is the problem and why its challenging? Syntactic means – formatting and sequence. Patters are infinite and design is so arbittrary that a certain pattern cant be associated with a certain semantic intention.These approaches rely on rendering engines (Gecko, Trident), which makes them browser dependent and inefficient.
  • to link these elements to the corresponding semantically matching elements of the existing hidden database.Form has values. And longer terms
  • Whether to merge or not to mergewhether the element in question becomes a new column in a new tablecorresponding to Diagnosis and link the column through foreign key, or do we duplicate this column into the new table and reduce the number of joins.
  • Make sure everything i.e. the rest of the presentation aligns with this. we seek the answers to these research questions through the development of a system that automatically maps a user-designed form to an existing database.
  • Prepare obvious answers – how is DOM tree different from semantic tree. Why we generate corres from form tree and then transfer to new database – so that users are presented corres. In terms of the form they had designed. DB-DB integration could be done – but here we leverage semantic form properties. As well.
  • The input form is represented as an equivalent semantic form tree using a form understandingalgorithm. We adopt a proactive approach to mapping in that we also standardize the formterms using an annotation technique focusing on the healthcare domain. Our solutions to theform understanding and the term annotation algorithms are described in Chapter 9.2. The generated semantic form tree is then studied with respect to the existing database; andthe semantic correspondences between the form tree and the existing database elements arediscovered and validated using user interventions and certain validation rules. This part isdescribed in Chapter 10.3. The form tree with discovered correspondences to the existing database elements is thenmapped and merged with the existing database. In particular, the matching elements aremerged to the target database elements and the new form elements are transformed into newdatabase elements and the existing database is extended using the new database elements.The database design and evolution algorithms are described in Chapter 11.
  • Approach identifies semantic grouping
  • the widely used medical terminology.
  • The HMMs are tailored for data-entry forms, and are aligned with the forms hierarchical complexity thereby providing a high extraction accuracy (Khare and An, 2009)
  • Who designed the forms? Why not other domains – which other domains? Possible. Have some idea. – opportunity to study whether systems can be improved.
  • Why does recall decrease – when number of correct predictions decrease on applying the hybrid method. Sometime linguitic approach returns more accurate result.
  • total number of screens wherein the user suggested to merge the elements over the total number of screens generated as a result of executing the validation algorithm.amount of redundancy minimization performed by the algorithm
  • Each area indicates the contribution of a form in generating the database elements.The peaks denote the general pattern of forms in a given dataset. Most of the datasets peak atcolumns, implying the most prevalence of textbox fields in the forms. The database 2 peaks atvalues implying the prevalence of select and radiobuttonelds in the forms. The database 5 peaksat foreign keys indicating the prevalence of categories and subcategories in the forms. The broad areas represent the presence of longer forms, and the narrower regions represent the presence ofshorter, or mergeable forms.This does not include the form tree generation time, user intervention time, or the execution of database DDL statements. The duration follows no fixed pattern. It depends multiple factors including the size ofthe form, and the size of the existing database. Lucene indexing helped in controlling the durationand it ranges from a few milliseconds to 200 seconds, even for the large-scale databases such as theones generated from the datasets 4 and 5.
  • We performed a table-level comparison, We manually analyzed the mismatched tables
  • At least 50% for all datasets. Huge reduction – many scenarios could be validated were found. 5 options per screen. Screen relevance – very low This denotes that most of the correspondences, identified using the linguistic matching method adopted by Lucene, were not semantically matching, and were hence rejected by the user. The screen relevance was particularly higher (94%) for the dataset 5 that represents the family practice forms. In these forms, the linguistically matching and yet semantically differing terms were not very prevalent. Approved merger for dataset 3, out of all the mergeable form elements, identified by the validation algorithm, 97.29% were merged to a semantically matching database element.
  • And did we reach all system goals? Specify again. Clearly. Did we reach the system goals?
  • Our experience of tagging 52 data-entry forms suggests that the training samples can be constructed quickly and easily, as compared to the construction of exhaustive set of rules or heuristicsTo further test the performance of the mapping framework in a heterogeneous environment,
  • Dissertation Defense Presentation

    1. 1. A Framework for Mapping User-designed Forms to Relational Databases Dissertation Presentation November 15 2011 Ritu KhareCOMMITTEE :Dr. Yuan An (Chair)Dr. Jiexun Jason LiDr. Il-Yeol SongDr. Min SongDr. Christopher C. Yang1
    2. 2. Presentation Order 1. Motivation 2. Problems 3. Solutions 4. Evaluation 5. Final Remarks2
    3. 3. 1. Motivation3
    4. 4. General Motivation: Database Usability (Sawyer, 1995) Enable users to SEARCH and  Enable users to DESIGN QUERY databases databases. (Jagadish et al. 2007)  Information Retrieval  Form-based DIY and WYSIWYG Techniques (Liu et al, 2006, Hristidis paradigms et al., 2003, Catarci, 2000, Jayapandian  FormAssembly, ZohoCreator, and Jagadish, 2006) GoogleForms Databases still remain unusable from the integration point of view (Gurses et al., 2009)4
    5. 5. Precise Motivation: Integration of New Needs New needsrelated to 1) Building of new forms patient’s social 2) Integration of new form habits into back-end5
    6. 6. Research Objective  To develop a mechanism to automatically map and integrate a user-designed form into existing structured database.  Assume that a user-designed form is already acquired  Seek a framework that  merges the semantically matching elements between forms and databases.  creates new database elements corresponding to the unmatched form elements.6
    7. 7. 2. Research Problems7
    8. 8. A form template represents the semantic intentions of the designer Problem #1 : Form Understanding Existing Work  Focus on Search Forms (Benslimane, et al. 2007, Kaljuviee et al., 2001)  shorter and simpler than the data-entry forms. (empirical finding)  Rules and heuristics (Zhang et al. 2004, He et al., 2007) Automatic Extraction of the form semantics  not likely to circumvent the Machine can only read the syntactic patterns ever broadening varieties in of form elements. A certain layout pattern form topologies cannot be associated with a semantic intention.8
    9. 9. Problem#2: Correspondence Discovery Existing Work  Schema and Ontology Mapping (Madhavan et al., 2001, Detect semantically matching Euzenat and Shvaiko, 2005, Rahm and Bernstein, 2001, An et al. 2005, An et al. 2006) elements between a form and  Mostly semi-automatic an existing database  Not applicable to form to Challenges database correspondence discovery  Variety of terms to denote the  Heterogeneity between forms and same concepts. databases  Correspondences are to be used for  Variety of concepts denoted evolving the database; the discovery process has to keep this requirement by similar terms into consideration.  Identify and eliminate the invalid correspondences.9
    10. 10. Problem# 3: Form Integration Problem#3a: Merging Existing Work  Merging into an existing  Form integration (Yang et al., database so that the same 2008) concept is not duplicated and  largely manual the database remains  expose the users to the technical compact. details of the underlying data  Merging increases the model. potential of having NULL  Database integration (Yang et al. values, i.e., less optimized 2003) database.  provide guidelines.  Judicious Decisions10
    11. 11. Problem# 3: Form Integration Problem#3b: Birthing Existing Work:  Extend the database for  Form-based database design the unmatched form  Several methods (Choobineh et al. 1988, Pavicevic et al, 2006, Choobeneh and elements Venkatraman, 1992, Deklarit, 2008) and commercial tools (Form assembly,  How to automatically google forms, zohocreator, wufoo)  No empirical evaluation of the derive the functional resultant databases dependencies among the  Few focus on designing a database with certain desirable properties, form elements? e.g., expressiveness (Yang et al, 2008, Choobineh et al., 1988, Lukovic, et al 2007).  How to translate the  These properties do not reflect complex form patterns? any compliance with the form semantics and are inadequate  How to evaluate multiple for evaluating the mapping process. design alternatives & pick one?11
    12. 12. Research Questions and System Goals 1. Form Understanding System Goals:  A model to capture the form 1. To evolve a DB that is high- semantics quality and optimized as per  Extract this model from a given the form semantics, i.e., compliant to the principles form (Wang and Strong, 1996, Ramakrishnan and Gehrke, 2002, 2. Correspondence Discovery Silberschatz, et al., 2001, Batini and Scannapieco, 2006):  Determine semantically  Completeness: All form equivalent elements b/w form & elements represented in database database  Incorporate DB evolution  Correctness: Form semantics retained: requirement during discovery  Compactness: Equivalent process elements merged 3. Form Integration  Normalization: 3NF w.r.t. form’s functional  Resolve merging conflicts while dependencies maintaining the original form  Minimize NULL values in semantics FKs and Descriptive attributes  Given a form pattern, derive a 2. To ensure minimalism in the relational database with required user intervention12 “desirable” properties
    13. 13. 3. Solutions13
    14. 14. Form Representation: Form Tree  The form tree accurately captures the designers intentions, and hence the semantic associations among the form elements.  Inspired by hierarchical modeling of forms in existing works (Dragut et al. 2009, Wu et al. 2009)14
    15. 15. Framework Outline Form Understanding Form Tree and Semantics Extraction Correspondenc Form Tree with e Discovery and Discovered Validation Correspondences Database Design and Database Evolution15
    16. 16. Method 1a: Form Tree Generation16
    17. 17. Method 1a: Form Tree Generation I. Tag and 2. Derive Tree Segment Phase Phase(5 rules)  The approach leverages the probabilistic nature of form design and develops a 2-layered Hidden Markov Model (HMM) based artificial designer that has the ability to understand the semantics of any arbitrarily designed form.  T-HMM: Tagging HMM  S-HMM-Segmentation T-HMM17
    18. 18. Method 1b: Form Term Annotation  Refine semantics by annotating terms  Systematized Nomenclature of Medicine  Challenge: Same form term can be Clinical Terms (SNOMED CT) comprising specified in multiple contexts, i.e., 360,000 concepts belonging to various semantic categories. The key is to identify semantic categories. the semantic category for a given term.  We hypothesize that the term context can ConceptID Description Semantic Category be derived from the structure of the form tree. 0231832 Respiratory Rate Observable Entity 362508001 Both eyes, entire Body Structure18
    19. 19. Method 1b: Form term annotation Form Tree SNOMED CT Choose the Form Structure Classification best match SNOMED Term CT Analyzer Model Semantic concept from this category Concept category SNOMED CT search service19
    20. 20. Method 2: Correspondence Discovery and Validation Linguistic Exact Concept Matching Matching 1 220
    21. 21. Total Heuristics = 4 Method 2: Validation Algorithm Past Medical X  History History X  Id HPI Medications SocialHistory Family Hx History of Meds X  present Illness Oral  Hygiene Appetite Id Options radio  1 Good 2 Fair  good poor 3 Poor Look-up table21
    22. 22. Method 3: Database Design and Evolution 1 2 322
    23. 23. Method 3a: Birthing Algorithm Total Patterns = 12 Principles: High Quality(Complete, Correct, Compact, Normalized) and Optimization (minimize NULLs) Traverses the form tree in depth first order M:1 Tj.ID -> Tj.c Radiobutton Pattern Textbox Pattern Category/subcategory Pattern Extended RB Pattern23
    24. 24. Method 3a: Birthing Algorithm Sibling categories pattern Textbox pattern Category- subcat. pattern Textbox24 Radiobutton Checkbox pattern pattern pattern
    25. 25. Method 3: Database Design and Evolution 1 2 325
    26. 26. Tot. merging scenarios = 8 Method 3b: Merging Algorithm  Compactness Factor(CF): A  Each merger involves a trade-off configurable value (0,1) that indicates between compactness and the weightage given to compactness optimization (min. NULL values)  Null Value Ratio(NVR): A calculated principles. value that indicates the potential of having NULL values in a given table. New DB Existing DB NVR = 2/5=0.4 Case a: CF=0.5 Case b: CF=0.3 Final DB (CF>NVR) (CF<=NVR)26 More Compact More Optimized
    27. 27. 4. Evaluation27
    28. 28. System Goals: Principle Compliance & Min. Interventions Evaluation Goals: Java, Tomcat, A. How well the system meets the goals? MySQL Server, yFiles, JSP B. Impact of framework in accomplishing the goals ? EM & Viterbi, cross- HMM-based tree validation extraction SNOMED CT Form Tree Term Annotation Linguistic Naïve Bayes Classifier, Similarity Top-4 classes, SnAPI, =Lucene’s Default Cross-validation per Corr. Settings Form Tree with dataset Discovery Discovered Validation Correspondences Algorithm Birthing Algorithm Database Merging28 Algorithm CF=0.7
    29. 29. Data (52 real world forms from 6 medical institutions) Healthcare : Forms are prevalent, and Information systems are unusable and inflexible. Dataset Avg. Avg. SNOMED Terms Inputs CT Mappability 1 Walk in clinic encounter 32.33 49.33 75.77 % forms (3 forms) Gold Benchmarks 2 Nursing patient 17.17 33 63.98% 52 Gold Std Trees admission forms (6 (using a DIY interface that forms) captures designers’ on- the-fly semantic decisions) 3 Labor & delivery DB data- 16.14 37.29 58.8 % entry forms (7 forms) Gold Std Annotations (4235 form terms were 4 Adult visit encounter 47.83 65.22 56.2% manually studied & 2506 forms (59%) had corr. concept in SNOMED CT) (18 forms) 5 Family practice forms 82.61 100.46 59.38% 3 pairs of Gold DBs (3 datasets were given to (13 forms) 2 experts. Each expert 6 Child visit encounter 53 67.4 62.21% manually derived the 3 forms databases)29 (5 forms)
    30. 30. Experiment 1: Form Tree Extraction  97.85% of parent child semantic associations captured correctly  An average tree with 135 edges gets generated in 0.08 seconds. Dataset1 Dataset2 Dataset3 Dataset4 Dataset5 Dataset6Total Edges 272 362 461 2606 2674 644Accuracy 95.22% 97.51% 100% 97.58% 98.46% 96.11% Inaccuracies because of more hierarchical complexity, i.e., semantic grouping and sub- grouping.30
    31. 31. Experiment 2: Form Term Annotation Precision (#correct annotations /# annotations) Recall (# correct annotations Baseline /# relevant (gold) annotations) (linguistics) Hybrid (linguistics + semantic structure) Hybrid++31
    32. 32. Avg time(s)/formExp. 2: Form Term Annotation 1.28, 1.77, 2.31, 10.29, 8.12, 3.44  Enhanced all versions by adding term processing: remove special character, clinical acronyms expansion.  Precision only slightly improved (3-5%)  Recall majorly improved (25%).  Final Precision =0.89, Recall =0.76  Baseline to Hybrid  Avg. precision Improved by 26%.  Recall no specific pattern  Hybrid to Hybrid++  Avg. Precision improved by 13%  Avg. Recall improved by 17%  Hybrid++: precision 0.86, recall 0.6 Structural knowledge can improve the overall performance.32 Linguistic Techniques can only impact the recall.
    33. 33. Experiment 3: Form to Database Mapping 3a.Linguistic-based 3b. Concept-based 3c. Hybrid Discovery Discovery Discovery33
    34. 34. Exp 3: Description of evolved databases. (35 to 450 tables), (Linguistic-based Discovery) (x:element-type y:# elements) Mapping Duration per form: few ms. to 200s.34
    35. 35. Exp 3: Comparison with Gold Datasets With Gold 1 With Gold 2  74%(avg.) of the system generated tables “perfectly match" with the tables in the gold databases.  Based on the principles of quality and optimization, the mismatches could be divided into: Negative and Positive System A Gold DB Form Pattern Generated DB Positive Mismatch Negative Mismatch35
    36. 36. Correctness. Completeness,Exp. 3: Measuring Principle Compliance Normalization, Optimization, Compactness. An approx. universal set of merging situations 3a : Linguistic Discovery DB1 DB2 DB3 DB4 DB5 DB6 > =75% compactness in 4Linguistic databases.Discovery Databases 4, 6: >=20% rejected due of form featuresConceptDiscovery  Datasets 4 and 6  Format Diversity: Gender (textbox, Hybrid radiobuttons - M, F); DOB (single vs.Discovery multiple textboxes)  Section Scattering3b: Concept-based 3c: HybridDiscovery Discovery>= 70% compactness in 3 >= 80%databases. compactness in 4Datasets 5 & 6: >=33% databases.undetected36
    37. 37. % Reduction in no. of screensExp 3: Measuring User Interventions Avg. screen/form presented to user Screen relevance(%)= (# of screens to which user responds) /(# tot. screens)Linguistic-based Discovery Concept-based Discovery Hybrid Discovery % Red. Avg. Screen rel. % Red. Avg. Screen % Red. Avg. Screen screens screens (%) Screens screens rel.(%) Screens screens rel. (%)1 50 4 15.39 77 1 75 52 4 15.382 77 2 42.86 62 3 68.75 75 3 503 69 2 50.00 18 5 46.87 57 4 29.634 55 10 39.79 54 8 45.45 51 13 43.295 76 21 94.18 65 15 73.57 69 27 86.046 62 5 32.14 65 4 42.86 59 8 4537
    38. 38. Results Summary & Implications Exp3: Exp1: Form tree Interventions Form to DB Mapping generation (6 DBs: 35 to 450 Intervention red. 61% Accuracy = 0.98 (52 forms) tables, Intervention/form: 0.08s/tree few ms to 200s) ling.:10, con. : 8, •Supervised Hyb.:13 •Intervention 10/tree for Hybrid approach cardinality improves scenario Avg. screen rel. =50% disambiguation identification (19%) Validation compactness (13%) Principle Compliance Algorithm over pure approaches. 84.5% identical, or But performs less in Birthing superior to gold DBs Improve precision terms of interventions & Algorithm 74% compact(hybrid) (43%) and recall screen relevance. Merging (29%) over baseline Algorithm Exp2: Form term •Tune validation/merging based on form annotation Precision= 0.89 features. (2500 forms) 1 to 11s/form Recall = 0.76 •Birthing algorithm can be refined as per gold std. •Sophisticated term techniques •Interventions & screen relevance can be •SNOMED CT relationships improved by enhancing validation38 •Unsupervised learning algorithm
    39. 39. 5. Final Remarks39
    40. 40. Thesis Contributions: Mapping user-designed form to relational database. (NEW problem)Form UnderstandingNew Solution: 2-layered HMM that encodes designers Merging Algorithmknowledge. First work to apply HMMs on form understanding Balance b/w compactness &Highly accurate (98%) and efficient (0.08s per form) optimization Merged =>70% semantically matchingForm Term Annotation (NEW Problem!) elements in 11/18 cases.Context-based solution leveraging semantic structure Key RecommendationsPromising (0.89 precision, 0.76 recall) and efficient (1-11s);Improves over baseline by 43% in precision and 29% in recall For term annotations, design hybrid approaches leveraging both linguisticsCorrespondence Validation Algorithm and structural semantics.Heuristic based solution relying on frequent observations For improving database quality, design approaches leveraging both linguisticReduces interventions by avg. 61%. and semantic methods for correspondence discovery.Birthing Algorithm Birthing algorithm could be furtherIntertwines quality and optimization principles refined in terms of handling radio-button groups and extended check-boxes to4 medium (<65 tables) & 2 large (<500 tables)-scale DBs improve database quality.3 medium-scale DBs intersect(or superior) with gold by 84.5%. Enhance validation algorithm to further reduce user interventions and improve40 screen relevance
    41. 41. Limitations – I Techniques Technique Evaluation  Form Understanding  Compare with other  Weak entities, part./card. learning models constraints.  SVM, conditional random fields, Bayesian networks,  Form Term Annotations CAR  Post coordinated mapping  Completeness and  Correspondence Discovery Correctness of Heuristics  Tree design rules, Heuristics  Concatenated matches for validation and merging,  Merging Algorithm Birthing Form Patterns, Classification attributes  Detect/eliminate circular  Assumptions references in database.  Class conditional independence, Correctness of most linguistic matching concept41  Theoretical Validity of Birthing Algorithm
    42. 42. Limitations - II Study Experimental Design  Thorough User Studies  Map and merge forms from  Can users understand/select different sources the right correspondences?  Experiments involving both automatic form tree extraction  Domain Expert Annotator and term annotation methods.  Large Scale of Databases  Result Evaluation, Gold DB  Limited Time  Implementation  Experimentation42
    43. 43. Future Directions Electronic Health Record General  Can Clinicians  Turn into an API  Design Forms, Understand/Identify  Amazon SimpleDB Correspondences  Google Datastore.  Does this framework improve  Data Quality, Patient Diagnosis  Leveraging More Form-Related  Legal Perspective Information  HIPPA regulations, Proprietary  Past Mappings systems  Usage frequency  Customize for Form Categories  Designer’s/User’s Domain  Encounter, Walk-in, Regular Visit, Data-entry Expertise  Use other UMLS terminologies  Mapping Maintenance and Record Conflict Resolution43
    44. 44. Related Publications  Exploiting Semantic Structure for Mapping User-specified Form Terms to SNOMED CT Concepts  Khare R., An Y., Li J., Song I-Y., Hu X. In the proceedings of 2nd International Health Informatics Symposium (IHI 2012), Jan 28-30, 2012, Miami, FL, USA.  Automatically Mapping and Integrating Multiple Data Entry Forms into a Database  An Y., Khare R., Song I-Y., Hu X. In the proceedings of 30th International Conference on Conceptual Modeling (ER 2011), Oct 31-Nov 3, 2011, Brussels, Belgium.  Can Clinicians Create High-Quality Databases? A Study on A Flexible Electronic Health Record (fEHR) System  Khare R., An Y., Song I-Y., Hu X., In the proceedings of 1st International Health Informatics Symposium (IHI 2010), Nov 11-12, 2010, Arlington, VA, USA.  Understanding Deep Web Search Interfaces  Khare R., An Y., Song I-Y. Special Interest Group in Management of Data (SIGMOD) Record, 39(1):33-40, 2010.  An Empirical Study on using Hidden Markov Model for Search Interface Segmentation  Khare R., and An Y., In the proceedings of 18th International Conference on Information and Knowledge Management (CIKM 2009), Nov 3-5, 2009, Hong Kong.44
    45. 45. Thank you45