Your SlideShare is downloading. ×
0
Henning agt   talk-caise-semnet
Henning agt   talk-caise-semnet
Henning agt   talk-caise-semnet
Henning agt   talk-caise-semnet
Henning agt   talk-caise-semnet
Henning agt   talk-caise-semnet
Henning agt   talk-caise-semnet
Henning agt   talk-caise-semnet
Henning agt   talk-caise-semnet
Henning agt   talk-caise-semnet
Henning agt   talk-caise-semnet
Henning agt   talk-caise-semnet
Henning agt   talk-caise-semnet
Henning agt   talk-caise-semnet
Henning agt   talk-caise-semnet
Henning agt   talk-caise-semnet
Henning agt   talk-caise-semnet
Henning agt   talk-caise-semnet
Henning agt   talk-caise-semnet
Henning agt   talk-caise-semnet
Henning agt   talk-caise-semnet
Henning agt   talk-caise-semnet
Henning agt   talk-caise-semnet
Henning agt   talk-caise-semnet
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Henning agt talk-caise-semnet

173

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
173
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 28.06.2013 DIMA – TU Berlin 1 Fachgebiet Datenbanksysteme und Informationsmanagement Technische Universität Berlin http://www.dima.tu-berlin.de/ Automated Construction of a Large Semantic Network of Related Terms for Domain-Specific Modeling CAiSE 2013, June 21st, Valencia Henning Agt and Ralf-Detlef Kutsche Technische Universität Berlin
  • 2. 28.06.2013 DIMA – TU Berlin 2 ■ Autocompletion applications ■ Predict what the user wants to model next Motivation nurse treatment medicine emergency ...
  • 3. 28.06.2013 DIMA – TU Berlin 3 ■ Our Vision: Provide automated suggestions of semantically related model elements for domain modeling [5],[19] □ Focus on domain terminology and conceptual design □ Query domain and common sense ontologies □ Information extraction from text ■ Requirements for the intended application □ Dictionary of terms □ Relations between terms □ Query interface and ranking functions Research Goals nurse treatment medicine emergency ... OntoOntoOnto‐ logies Extract Modeling Tools Knowledge Service Query Text Analysis OntoOntoTermi‐ nology Retrieve/ Integrate Generate Provide Suggestions Use
  • 4. 28.06.2013 DIMA – TU Berlin 4 ■ Input dataset ■ Text analysis process ■ Application of SemNet ■ Evaluation of SemNet ■ Conclusions and Future Work Agenda N‐Gram Statistics Text  Corpus N‐Gram  DB POS DB Norm. N‐Gram  DB Analyse Parse Normalize Tag SemNet Analyse Co‐occurrences Applications Retrieve Query
  • 5. 28.06.2013 DIMA – TU Berlin 5 ■ Input dataset ■ Text analysis process ■ Application of SemNet ■ Evaluation of SemNet ■ Conclusions and Future Work Agenda N‐Gram Statistics Text  Corpus N‐Gram  DB POS DB Norm. N‐Gram  DB Analyse Parse Normalize Tag SemNet Analyse Co‐occurrences Applications Retrieve Query
  • 6. 28.06.2013 DIMA – TU Berlin 6 ■ Large amounts of text data ■ N-Grams □ Sequence of n consecutive words/tokens and its frequency □ Google provides 1,2,3,4 and 5-grams in several languages ■ We work on the English-All dataset V2 (1-grams and 5-grams) [11] Google Books N-Gram Dataset 5 million books Corpus 500 billion words N‐gram analysis N‐Gram Dataset CSV text files with word frequencies ... … to go to the hospital 46,410 general condition of the patient 28,198 I was in the hospital 19,268 discharge from the hospital . 12,476 admission to the hospital . 10,558 the patient to the hospital 6,422 by placing the patient in 6,026 between doctor and patient . 5,908 ... ... … able to leave the hospital 4,629 patient admitted to the hospital 4,303 a patient in the hospital 3,844 the symptom of the patient 2,559 the patient under local anesthesia 2,536 a patient is suffering from 2,475 the doctor and the hospital 1,362 the hospital and the doctor 1,017 ...
  • 7. 28.06.2013 DIMA – TU Berlin 7 ■ Input dataset ■ Text analysis process ■ Application of SemNet ■ Evaluation of SemNet ■ Conclusions and Future Work Agenda N‐Gram Statistics Text  Corpus N‐Gram  DB POS DB Norm. N‐Gram  DB Analyse Parse Normalize Tag SemNet Analyse Co‐occurrences Applications Retrieve Query
  • 8. 28.06.2013 DIMA – TU Berlin 8 ■ N-gram database  Make the data manageable □ Input: 2.5 terabytes of text □ Output: Tables with 10 million 1-grams and 710 million 5-grams (21 gigabytes) ■ Part-of-speech tagging [8], [9]  Identify lexical category of each text token □ Output: Table with POS tags for each 5-gram (14 gigabytes) ■ Normalization  Reduce amount of word variations □ Plural stemming, lowercasing of adjectives and normal nouns □ Proper nouns are not touched ■ Result: 710 million normalized and tagged 5-grams Preprocessing JJ    NN  IN  DT   NN general condition of the patient NN   NN NN CC   NN drug store pharmacist or doctor doctors  doctor Medical practitioner  medical practitioner hospitals in Valencia  hospital in Valencia Adjective Normal Noun DeterminerPreposition CoordinatingCoordinating conjunction
  • 9. 28.06.2013 DIMA – TU Berlin 9 ■ Input dataset ■ Text analysis process ■ Application of SemNet ■ Evaluation of SemNet ■ Conclusions and Future Work Agenda N‐Gram Statistics Text  Corpus N‐Gram  DB POS DB Norm. N‐Gram  DB Analyse Parse Normalize Tag SemNet Analyse Co‐occurrences Applications Retrieve Query
  • 10. 28.06.2013 DIMA – TU Berlin 10 ■ Goal: Detect domain terminology using syntactical patterns [12] ■ Analysis of existing dictionaries □ 75% of terms: noun, noun-noun, adjective noun combinations ■ Excerpt of the 20 patterns used: ■ No proper nouns: Stanford University / university professor □ Our focus is conceptual design on schema level ■ Limitation: 5-gram: 5 words □ Maximum length of a term: 3 words Lexical Patterns doctor or mental health professional term termseparation
  • 11. 28.06.2013 DIMA – TU Berlin 11 ■ Hierarchical pattern matching ■ Distributional Semantics [13], [22] □ “Words that occur in the same contexts tend to have similar meanings.” (Distributional Hypothesis by Z. Harris) Co-Occurring Terms your doctor or pharmacist .      9271 Context frequency Absolute  frequency „doctor“ and „pharmacist“ co‐occurred 9271 times Highest level remains No idiomatic phrasesNo consecutive patterns Easiest case
  • 12. 28.06.2013 DIMA – TU Berlin 12 ■ Discard 5-grams that contain 4 or 5 stopwords ■ Apply pattern matching on the remaining 5-grams  Result: Large table of binary relations ■ Frequency aggregation □ Many terms co-occurred in different contexts ■ Relative frequency computation □ For each term with respect to its related terms ■ Graph construction □ Directed, weighted edges □ Relational database and graph database serialization (SQLite / Neo4J) SemNet Construction to go to the doctor I am what I am a ) ( 2 )
  • 13. 28.06.2013 DIMA – TU Berlin 13 ■ Properties of SemNet □ 268,937 distinct single-word terms □ 2,115,494 distinct double-word terms □ 355,689 distinct triple-word terms □  2.7 million terms and 37.5 million relations □ 2.2 GB disc space ■ Lessons learned from the analysis process Statistics 41,6% 15,7% 32,6% 10,1% 4 or 5 stopwords N-Gram Information Content Only 1 term No pattern match N-grams with a semantic relationship Semantic relatedness: Zipf‘s law Rank Degreeofrelatedness
  • 14. 28.06.2013 DIMA – TU Berlin 14 ■ Input dataset ■ Text analysis process ■ Application of SemNet ■ Evaluation of SemNet ■ Conclusions and Future Work Agenda N‐Gram Statistics Text  Corpus N‐Gram  DB POS DB Norm. N‐Gram  DB Analyse Parse Normalize Tag SemNet Analyse Co‐occurrences Applications Retrieve Query
  • 15. 28.06.2013 DIMA – TU Berlin 15 ■ Query Interfaces □ SQL: Query the relational database □ Cypher: Query the Neo4J database □ Java: Use SemNet in your applications □ PHP: Explore the data in a web interface ■ Examples of top 10 automatically identified related terms Querying SemNet (f – absolute term frequency in the original text corpus, #r – number of related terms) select * from nouncooccurrences where termw1 =  5824331 and termw2 is null and termw3 is null order by relfreq desc limit 20; public ArrayList<String> getRelatedStringTerms(ArrayList<String> inputTerms) { … }
  • 16. 28.06.2013 DIMA – TU Berlin 16 ■ Challenge: Methods based matrices and vectors are too slow ■ Strategy: Related term sets intersection + relative frequency multiplication Ranking Results of Multiple Input Terms chair 0.0441 contents 0.0359 end 0.0221 front 0.0194 figure 0.0189 head 0.0189 side 0.0180 data 0.0157 hand 0.0132 column 0.0131 page 0.0118 edge 0.0112 result 0.0100 value 0.0099 place 0.0087 row 0.0086 show 0.0082 elbow 0.0072 list 0.0071 bed 0.0071 table transaction data 0.0735 information 0.0569 record 0.0376 table 0.0334 access 0.0310 spreadsheet 0.0252 name 0.0201 object 0.0164 retrieval system 0.0163 file 0.0158 example 0.0153 use 0.0150 connection 0.0146 structure 0.0139 field 0.0125 user 0.0124 change 0.0112 type 0.0107 size 0.0104 transaction 0.0102 database … … data 0.001155 contents 0.000359 information 0.000190 record 0.000091 use 0.000077 end 0.000060 example 0.000055 name 0.000050 figure 0.000047 value 0.000045 result 0.000037 list 0.000037 column 0.000034 row 0.000033 object 0.000024 field 0.000023 book 0.000016 order 0.000016 size 0.000014 query 0.000012 table+database … ∩ *
  • 17. 28.06.2013 DIMA – TU Berlin 17 ■ Prototype: Ecore Diagram Editor with class name suggestions [15] ■ Automated suggestion adaption with respect to the content of the model Modeling With Semantic Autocompletion
  • 18. 28.06.2013 DIMA – TU Berlin 18 ■ Input dataset ■ Text analysis process ■ Application of SemNet ■ Evaluation of SemNet ■ Conclusions and Future Work Agenda N‐Gram Statistics Text  Corpus N‐Gram  DB POS DB Norm. N‐Gram  DB Analyse Parse Normalize Tag SemNet Analyse Co‐occurrences Applications Retrieve Query
  • 19. 28.06.2013 DIMA – TU Berlin 19 ■ Challenge □ No gold standard available for many information extraction tasks ■ Our strategy: Compare SemNet to existing knowledge bases □ Provide measurements on how much information of WordNet and ConceptNet is contained in SemNet ■ WordNet V3.0: Lexical database for the English language [16] □ Synsets: Grouped terms that share the same sense □ Relations: Mainly taxonomic, part-whole and synonyms ■ ConceptNet V5.1: Semantic graph for general human knowledge [17] □ Nodes: Any natural language phrase that expresses a concept □ Relations: Taxonomic, part-whole, related-to and several others ■ SemNet: Semantic Network of Related Terms □ Nodes: Noun terminology □ Relations: Probabilistic links Evaluation Setup maternity morning sickness physical condition ectopic pregnancy entopic pregnancy synonym part meronym parturiency hyponym hypernym pregnancy Conceptually RelatedTo pregnancy expect morning sickness physical condition go to bed ectopic pregnancy PartOf stretch IsAIsA Related To Causes start family HasSubevent mother termination birth woman trimester stage weekchildbirth lactation month1 2 3 4 5 6 7 89 10 0.036 0.031 0.030 0.030 0.026 0.025 0.020 0.018 0.017 0.016 pregnancy Word sense pregnancy in WordNet (7 out of 32 relations) Concept pregnancy in ConceptNet (7 out of 58 relations). Term pregnancy in SemNet (First 10 out of 4039 relations). S W C
  • 20. 28.06.2013 DIMA – TU Berlin 20 ■ WordNet □ Iterate through all noun synsets (72,994 synsets evaluated) □ Check whether the nouns are contained in SemNet (98,681 nouns evaluated) Results: 77,16% of WordNet‘s synsets are contained in SemNet and 62,17% of WordNet‘s nouns are contained in SemNet ■ ConceptNet □ Problem: Concepts can be expressed using any natural language phrase □ First determine noun terminology □ Check whether the nouns are contained in SemNet (49,301 concepts evaluated)  Result: 82,40% of ConceptNet‘s nouns are contained in SemNet Noun terminology coverage (doctor, doc, physician, MD, Dr., medico) (ear doctor, ear specialist, otologist) (sleep talking, somniloquy, somniloquism) doctor go to bed  pregnancy beautiful
  • 21. 28.06.2013 DIMA – TU Berlin 21 ■ WordNet / ConceptNet □ Iterate through all previously found noun synsets (56,321 synsets used) and concepts (40,625 concepts used) □ Check whether the relations between synsets are contained in SemNet (61,931 WordNet relations evaluated and 256,213 ConceptNet relations evaluated) ■ Relation evaluation results Relation coverage (doctor, doc, physician, MD, Dr., medico) (medical practitioner, medical man) hypernym (surgeon)(allergist) hyponym
  • 22. 28.06.2013 DIMA – TU Berlin 22 ■ Input dataset ■ Text analysis process ■ Application of SemNet ■ Evaluation of SemNet ■ Conclusions and Future Work Agenda N‐Gram Statistics Text  Corpus N‐Gram  DB POS DB Norm. N‐Gram  DB Analyse Parse Normalize Tag SemNet Analyse Co‐occurrences Applications Retrieve Query
  • 23. 28.06.2013 DIMA – TU Berlin 23 ■ Summary □ Input: 710 million 5-grams and 20 part-of-speech patterns □ Hierarchical pattern matching, distributional semantics □ Output: 2.7M multi-word terms and 37.5M weighted relations □ Only a window of 5 words can be analyzed to detect relations □ Applications: Domain-specific modeling, keyword expansion, background knowledge for NLP tasks ■ Current and future work □ Support additional languages □ Improve ranking functions (pointwise mutual information) □ Relax 3-word-limitation, derive own n-gram datasets □ Combine probabilistic information with specific relations □ Domain clustering in the semantic network □ Additional modeling support: relations/associations, attributes Conclusions and Future Work
  • 24. 28.06.2013 DIMA – TU Berlin 24 [5] H. Agt: Supporting Software Language Engineering by Automated Domain Knowledge Acquisition. In: MODELS 2011 Workshops LNCS 7167 Springer 2012 [8] Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In: Proceedings of the NAACL 2003, pp. 173–180. [9] Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics 19(2), 313–330 (1993) [11] Michel, J.B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., Team, T.G.B., Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M.A., Aiden, E.L.: Quantitative Analysis of Culture Using Millions of Digitized Books. Science 331(6014), 176–182 (2011) [12] Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th Conference on Computational Linguistics, COLING 1992, vol. 2 (1992) [13] Harris, Z.: Distributional structure. Word 10(23), 146–162 (1954) [15] Agt, H.: SemAcom: A System for Modeling with Semantic Autocompletion. In: Model Driven Engineering Languages and Systems - 15th International Conference, MODELS 2012, Demo Track, Innsbruck, Austria (2012) [16] Fellbaum, C.: WordNet: An Electronic Lexical Database. The MIT Press, Cambridge (1998) [17] Speer, R., Havasi, C.: Representing General Relational Knowledge in ConceptNet 5. In: LREC 2012 [19] Agt, H., Kutsche, R.D., Wegeler, T.: Guidance for Domain Specific Modeling in Small and Medium Enterprises. In: SPLASH 2011 Workshops. DSM 2011, Portland, OR, USA (2011) [22] Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Int. Res. 37(1), 141–188 (2010) Thank You For Your Attention! MODELS? Try out SemNet: http://www.bizware.tu‐berlin.de/semnet/ Contact: henning.agt@tu‐berlin.de

×