Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dissertation Defense Presentation


Published on

Published in: Technology, Education
  • Be the first to comment

Dissertation Defense Presentation

  1. 1. A Framework for Mapping User-designed Forms to Relational Databases Dissertation Presentation November 15 2011 Ritu KhareCOMMITTEE :Dr. Yuan An (Chair)Dr. Jiexun Jason LiDr. Il-Yeol SongDr. Min SongDr. Christopher C. Yang1
  2. 2. Presentation Order 1. Motivation 2. Problems 3. Solutions 4. Evaluation 5. Final Remarks2
  3. 3. 1. Motivation3
  4. 4. General Motivation: Database Usability (Sawyer, 1995) Enable users to SEARCH and  Enable users to DESIGN QUERY databases databases. (Jagadish et al. 2007)  Information Retrieval  Form-based DIY and WYSIWYG Techniques (Liu et al, 2006, Hristidis paradigms et al., 2003, Catarci, 2000, Jayapandian  FormAssembly, ZohoCreator, and Jagadish, 2006) GoogleForms Databases still remain unusable from the integration point of view (Gurses et al., 2009)4
  5. 5. Precise Motivation: Integration of New Needs New needsrelated to 1) Building of new forms patient’s social 2) Integration of new form habits into back-end5
  6. 6. Research Objective  To develop a mechanism to automatically map and integrate a user-designed form into existing structured database.  Assume that a user-designed form is already acquired  Seek a framework that  merges the semantically matching elements between forms and databases.  creates new database elements corresponding to the unmatched form elements.6
  7. 7. 2. Research Problems7
  8. 8. A form template represents the semantic intentions of the designer Problem #1 : Form Understanding Existing Work  Focus on Search Forms (Benslimane, et al. 2007, Kaljuviee et al., 2001)  shorter and simpler than the data-entry forms. (empirical finding)  Rules and heuristics (Zhang et al. 2004, He et al., 2007) Automatic Extraction of the form semantics  not likely to circumvent the Machine can only read the syntactic patterns ever broadening varieties in of form elements. A certain layout pattern form topologies cannot be associated with a semantic intention.8
  9. 9. Problem#2: Correspondence Discovery Existing Work  Schema and Ontology Mapping (Madhavan et al., 2001, Detect semantically matching Euzenat and Shvaiko, 2005, Rahm and Bernstein, 2001, An et al. 2005, An et al. 2006) elements between a form and  Mostly semi-automatic an existing database  Not applicable to form to Challenges database correspondence discovery  Variety of terms to denote the  Heterogeneity between forms and same concepts. databases  Correspondences are to be used for  Variety of concepts denoted evolving the database; the discovery process has to keep this requirement by similar terms into consideration.  Identify and eliminate the invalid correspondences.9
  10. 10. Problem# 3: Form Integration Problem#3a: Merging Existing Work  Merging into an existing  Form integration (Yang et al., database so that the same 2008) concept is not duplicated and  largely manual the database remains  expose the users to the technical compact. details of the underlying data  Merging increases the model. potential of having NULL  Database integration (Yang et al. values, i.e., less optimized 2003) database.  provide guidelines.  Judicious Decisions10
  11. 11. Problem# 3: Form Integration Problem#3b: Birthing Existing Work:  Extend the database for  Form-based database design the unmatched form  Several methods (Choobineh et al. 1988, Pavicevic et al, 2006, Choobeneh and elements Venkatraman, 1992, Deklarit, 2008) and commercial tools (Form assembly,  How to automatically google forms, zohocreator, wufoo)  No empirical evaluation of the derive the functional resultant databases dependencies among the  Few focus on designing a database with certain desirable properties, form elements? e.g., expressiveness (Yang et al, 2008, Choobineh et al., 1988, Lukovic, et al 2007).  How to translate the  These properties do not reflect complex form patterns? any compliance with the form semantics and are inadequate  How to evaluate multiple for evaluating the mapping process. design alternatives & pick one?11
  12. 12. Research Questions and System Goals 1. Form Understanding System Goals:  A model to capture the form 1. To evolve a DB that is high- semantics quality and optimized as per  Extract this model from a given the form semantics, i.e., compliant to the principles form (Wang and Strong, 1996, Ramakrishnan and Gehrke, 2002, 2. Correspondence Discovery Silberschatz, et al., 2001, Batini and Scannapieco, 2006):  Determine semantically  Completeness: All form equivalent elements b/w form & elements represented in database database  Incorporate DB evolution  Correctness: Form semantics retained: requirement during discovery  Compactness: Equivalent process elements merged 3. Form Integration  Normalization: 3NF w.r.t. form’s functional  Resolve merging conflicts while dependencies maintaining the original form  Minimize NULL values in semantics FKs and Descriptive attributes  Given a form pattern, derive a 2. To ensure minimalism in the relational database with required user intervention12 “desirable” properties
  13. 13. 3. Solutions13
  14. 14. Form Representation: Form Tree  The form tree accurately captures the designers intentions, and hence the semantic associations among the form elements.  Inspired by hierarchical modeling of forms in existing works (Dragut et al. 2009, Wu et al. 2009)14
  15. 15. Framework Outline Form Understanding Form Tree and Semantics Extraction Correspondenc Form Tree with e Discovery and Discovered Validation Correspondences Database Design and Database Evolution15
  16. 16. Method 1a: Form Tree Generation16
  17. 17. Method 1a: Form Tree Generation I. Tag and 2. Derive Tree Segment Phase Phase(5 rules)  The approach leverages the probabilistic nature of form design and develops a 2-layered Hidden Markov Model (HMM) based artificial designer that has the ability to understand the semantics of any arbitrarily designed form.  T-HMM: Tagging HMM  S-HMM-Segmentation T-HMM17
  18. 18. Method 1b: Form Term Annotation  Refine semantics by annotating terms  Systematized Nomenclature of Medicine  Challenge: Same form term can be Clinical Terms (SNOMED CT) comprising specified in multiple contexts, i.e., 360,000 concepts belonging to various semantic categories. The key is to identify semantic categories. the semantic category for a given term.  We hypothesize that the term context can ConceptID Description Semantic Category be derived from the structure of the form tree. 0231832 Respiratory Rate Observable Entity 362508001 Both eyes, entire Body Structure18
  19. 19. Method 1b: Form term annotation Form Tree SNOMED CT Choose the Form Structure Classification best match SNOMED Term CT Analyzer Model Semantic concept from this category Concept category SNOMED CT search service19
  20. 20. Method 2: Correspondence Discovery and Validation Linguistic Exact Concept Matching Matching 1 220
  21. 21. Total Heuristics = 4 Method 2: Validation Algorithm Past Medical X  History History X  Id HPI Medications SocialHistory Family Hx History of Meds X  present Illness Oral  Hygiene Appetite Id Options radio  1 Good 2 Fair  good poor 3 Poor Look-up table21
  22. 22. Method 3: Database Design and Evolution 1 2 322
  23. 23. Method 3a: Birthing Algorithm Total Patterns = 12 Principles: High Quality(Complete, Correct, Compact, Normalized) and Optimization (minimize NULLs) Traverses the form tree in depth first order M:1 Tj.ID -> Tj.c Radiobutton Pattern Textbox Pattern Category/subcategory Pattern Extended RB Pattern23
  24. 24. Method 3a: Birthing Algorithm Sibling categories pattern Textbox pattern Category- subcat. pattern Textbox24 Radiobutton Checkbox pattern pattern pattern
  25. 25. Method 3: Database Design and Evolution 1 2 325
  26. 26. Tot. merging scenarios = 8 Method 3b: Merging Algorithm  Compactness Factor(CF): A  Each merger involves a trade-off configurable value (0,1) that indicates between compactness and the weightage given to compactness optimization (min. NULL values)  Null Value Ratio(NVR): A calculated principles. value that indicates the potential of having NULL values in a given table. New DB Existing DB NVR = 2/5=0.4 Case a: CF=0.5 Case b: CF=0.3 Final DB (CF>NVR) (CF<=NVR)26 More Compact More Optimized
  27. 27. 4. Evaluation27
  28. 28. System Goals: Principle Compliance & Min. Interventions Evaluation Goals: Java, Tomcat, A. How well the system meets the goals? MySQL Server, yFiles, JSP B. Impact of framework in accomplishing the goals ? EM & Viterbi, cross- HMM-based tree validation extraction SNOMED CT Form Tree Term Annotation Linguistic Naïve Bayes Classifier, Similarity Top-4 classes, SnAPI, =Lucene’s Default Cross-validation per Corr. Settings Form Tree with dataset Discovery Discovered Validation Correspondences Algorithm Birthing Algorithm Database Merging28 Algorithm CF=0.7
  29. 29. Data (52 real world forms from 6 medical institutions) Healthcare : Forms are prevalent, and Information systems are unusable and inflexible. Dataset Avg. Avg. SNOMED Terms Inputs CT Mappability 1 Walk in clinic encounter 32.33 49.33 75.77 % forms (3 forms) Gold Benchmarks 2 Nursing patient 17.17 33 63.98% 52 Gold Std Trees admission forms (6 (using a DIY interface that forms) captures designers’ on- the-fly semantic decisions) 3 Labor & delivery DB data- 16.14 37.29 58.8 % entry forms (7 forms) Gold Std Annotations (4235 form terms were 4 Adult visit encounter 47.83 65.22 56.2% manually studied & 2506 forms (59%) had corr. concept in SNOMED CT) (18 forms) 5 Family practice forms 82.61 100.46 59.38% 3 pairs of Gold DBs (3 datasets were given to (13 forms) 2 experts. Each expert 6 Child visit encounter 53 67.4 62.21% manually derived the 3 forms databases)29 (5 forms)
  30. 30. Experiment 1: Form Tree Extraction  97.85% of parent child semantic associations captured correctly  An average tree with 135 edges gets generated in 0.08 seconds. Dataset1 Dataset2 Dataset3 Dataset4 Dataset5 Dataset6Total Edges 272 362 461 2606 2674 644Accuracy 95.22% 97.51% 100% 97.58% 98.46% 96.11% Inaccuracies because of more hierarchical complexity, i.e., semantic grouping and sub- grouping.30
  31. 31. Experiment 2: Form Term Annotation Precision (#correct annotations /# annotations) Recall (# correct annotations Baseline /# relevant (gold) annotations) (linguistics) Hybrid (linguistics + semantic structure) Hybrid++31
  32. 32. Avg time(s)/formExp. 2: Form Term Annotation 1.28, 1.77, 2.31, 10.29, 8.12, 3.44  Enhanced all versions by adding term processing: remove special character, clinical acronyms expansion.  Precision only slightly improved (3-5%)  Recall majorly improved (25%).  Final Precision =0.89, Recall =0.76  Baseline to Hybrid  Avg. precision Improved by 26%.  Recall no specific pattern  Hybrid to Hybrid++  Avg. Precision improved by 13%  Avg. Recall improved by 17%  Hybrid++: precision 0.86, recall 0.6 Structural knowledge can improve the overall performance.32 Linguistic Techniques can only impact the recall.
  33. 33. Experiment 3: Form to Database Mapping 3a.Linguistic-based 3b. Concept-based 3c. Hybrid Discovery Discovery Discovery33
  34. 34. Exp 3: Description of evolved databases. (35 to 450 tables), (Linguistic-based Discovery) (x:element-type y:# elements) Mapping Duration per form: few ms. to 200s.34
  35. 35. Exp 3: Comparison with Gold Datasets With Gold 1 With Gold 2  74%(avg.) of the system generated tables “perfectly match" with the tables in the gold databases.  Based on the principles of quality and optimization, the mismatches could be divided into: Negative and Positive System A Gold DB Form Pattern Generated DB Positive Mismatch Negative Mismatch35
  36. 36. Correctness. Completeness,Exp. 3: Measuring Principle Compliance Normalization, Optimization, Compactness. An approx. universal set of merging situations 3a : Linguistic Discovery DB1 DB2 DB3 DB4 DB5 DB6 > =75% compactness in 4Linguistic databases.Discovery Databases 4, 6: >=20% rejected due of form featuresConceptDiscovery  Datasets 4 and 6  Format Diversity: Gender (textbox, Hybrid radiobuttons - M, F); DOB (single vs.Discovery multiple textboxes)  Section Scattering3b: Concept-based 3c: HybridDiscovery Discovery>= 70% compactness in 3 >= 80%databases. compactness in 4Datasets 5 & 6: >=33% databases.undetected36
  37. 37. % Reduction in no. of screensExp 3: Measuring User Interventions Avg. screen/form presented to user Screen relevance(%)= (# of screens to which user responds) /(# tot. screens)Linguistic-based Discovery Concept-based Discovery Hybrid Discovery % Red. Avg. Screen rel. % Red. Avg. Screen % Red. Avg. Screen screens screens (%) Screens screens rel.(%) Screens screens rel. (%)1 50 4 15.39 77 1 75 52 4 15.382 77 2 42.86 62 3 68.75 75 3 503 69 2 50.00 18 5 46.87 57 4 29.634 55 10 39.79 54 8 45.45 51 13 43.295 76 21 94.18 65 15 73.57 69 27 86.046 62 5 32.14 65 4 42.86 59 8 4537
  38. 38. Results Summary & Implications Exp3: Exp1: Form tree Interventions Form to DB Mapping generation (6 DBs: 35 to 450 Intervention red. 61% Accuracy = 0.98 (52 forms) tables, Intervention/form: 0.08s/tree few ms to 200s) ling.:10, con. : 8, •Supervised Hyb.:13 •Intervention 10/tree for Hybrid approach cardinality improves scenario Avg. screen rel. =50% disambiguation identification (19%) Validation compactness (13%) Principle Compliance Algorithm over pure approaches. 84.5% identical, or But performs less in Birthing superior to gold DBs Improve precision terms of interventions & Algorithm 74% compact(hybrid) (43%) and recall screen relevance. Merging (29%) over baseline Algorithm Exp2: Form term •Tune validation/merging based on form annotation Precision= 0.89 features. (2500 forms) 1 to 11s/form Recall = 0.76 •Birthing algorithm can be refined as per gold std. •Sophisticated term techniques •Interventions & screen relevance can be •SNOMED CT relationships improved by enhancing validation38 •Unsupervised learning algorithm
  39. 39. 5. Final Remarks39
  40. 40. Thesis Contributions: Mapping user-designed form to relational database. (NEW problem)Form UnderstandingNew Solution: 2-layered HMM that encodes designers Merging Algorithmknowledge. First work to apply HMMs on form understanding Balance b/w compactness &Highly accurate (98%) and efficient (0.08s per form) optimization Merged =>70% semantically matchingForm Term Annotation (NEW Problem!) elements in 11/18 cases.Context-based solution leveraging semantic structure Key RecommendationsPromising (0.89 precision, 0.76 recall) and efficient (1-11s);Improves over baseline by 43% in precision and 29% in recall For term annotations, design hybrid approaches leveraging both linguisticsCorrespondence Validation Algorithm and structural semantics.Heuristic based solution relying on frequent observations For improving database quality, design approaches leveraging both linguisticReduces interventions by avg. 61%. and semantic methods for correspondence discovery.Birthing Algorithm Birthing algorithm could be furtherIntertwines quality and optimization principles refined in terms of handling radio-button groups and extended check-boxes to4 medium (<65 tables) & 2 large (<500 tables)-scale DBs improve database quality.3 medium-scale DBs intersect(or superior) with gold by 84.5%. Enhance validation algorithm to further reduce user interventions and improve40 screen relevance
  41. 41. Limitations – I Techniques Technique Evaluation  Form Understanding  Compare with other  Weak entities, part./card. learning models constraints.  SVM, conditional random fields, Bayesian networks,  Form Term Annotations CAR  Post coordinated mapping  Completeness and  Correspondence Discovery Correctness of Heuristics  Tree design rules, Heuristics  Concatenated matches for validation and merging,  Merging Algorithm Birthing Form Patterns, Classification attributes  Detect/eliminate circular  Assumptions references in database.  Class conditional independence, Correctness of most linguistic matching concept41  Theoretical Validity of Birthing Algorithm
  42. 42. Limitations - II Study Experimental Design  Thorough User Studies  Map and merge forms from  Can users understand/select different sources the right correspondences?  Experiments involving both automatic form tree extraction  Domain Expert Annotator and term annotation methods.  Large Scale of Databases  Result Evaluation, Gold DB  Limited Time  Implementation  Experimentation42
  43. 43. Future Directions Electronic Health Record General  Can Clinicians  Turn into an API  Design Forms, Understand/Identify  Amazon SimpleDB Correspondences  Google Datastore.  Does this framework improve  Data Quality, Patient Diagnosis  Leveraging More Form-Related  Legal Perspective Information  HIPPA regulations, Proprietary  Past Mappings systems  Usage frequency  Customize for Form Categories  Designer’s/User’s Domain  Encounter, Walk-in, Regular Visit, Data-entry Expertise  Use other UMLS terminologies  Mapping Maintenance and Record Conflict Resolution43
  44. 44. Related Publications  Exploiting Semantic Structure for Mapping User-specified Form Terms to SNOMED CT Concepts  Khare R., An Y., Li J., Song I-Y., Hu X. In the proceedings of 2nd International Health Informatics Symposium (IHI 2012), Jan 28-30, 2012, Miami, FL, USA.  Automatically Mapping and Integrating Multiple Data Entry Forms into a Database  An Y., Khare R., Song I-Y., Hu X. In the proceedings of 30th International Conference on Conceptual Modeling (ER 2011), Oct 31-Nov 3, 2011, Brussels, Belgium.  Can Clinicians Create High-Quality Databases? A Study on A Flexible Electronic Health Record (fEHR) System  Khare R., An Y., Song I-Y., Hu X., In the proceedings of 1st International Health Informatics Symposium (IHI 2010), Nov 11-12, 2010, Arlington, VA, USA.  Understanding Deep Web Search Interfaces  Khare R., An Y., Song I-Y. Special Interest Group in Management of Data (SIGMOD) Record, 39(1):33-40, 2010.  An Empirical Study on using Hidden Markov Model for Search Interface Segmentation  Khare R., and An Y., In the proceedings of 18th International Conference on Information and Knowledge Management (CIKM 2009), Nov 3-5, 2009, Hong Kong.44
  45. 45. Thank you45