Henning agt talk-caise-semnet

•

0 likes•537 views

C

The document describes the automated construction of a large semantic network called SemNet. It analyzes a large text corpus to extract terms and relations using n-gram analysis, part-of-speech tagging, and pattern matching. SemNet contains over 2.7 million terms and 37.5 million relations. The document evaluates SemNet by comparing it to WordNet and ConceptNet, finding that it contains over 77% of WordNet synsets and over 82% of ConceptNet nouns.

Technology Education

28.06.2013 DIMA – TU Berlin 1
Fachgebiet Datenbanksysteme und Informationsmanagement
Technische Universität Berlin
http://www.dima.tu-berlin.de/
Automated Construction of a Large Semantic Network
of Related Terms for Domain-Specific Modeling
CAiSE 2013, June 21st, Valencia
Henning Agt and Ralf-Detlef Kutsche
Technische Universität Berlin

28.06.2013 DIMA – TU Berlin 2
■ Autocompletion applications
■ Predict what the user wants to model next
Motivation
nurse
treatment
medicine
emergency
...

28.06.2013 DIMA – TU Berlin 3
■ Our Vision: Provide automated suggestions of semantically related
model elements for domain modeling [5],[19]
□ Focus on domain terminology and conceptual design
□ Query domain and common sense ontologies
□ Information extraction from text
■ Requirements for the intended application
□ Dictionary of terms
□ Relations between terms
□ Query interface and ranking functions
Research Goals
nurse
treatment
medicine
emergency
...
OntoOntoOnto‐
logies
Extract
Modeling
Tools
Knowledge
Service
Query
Text
Analysis
OntoOntoTermi‐
nology
Retrieve/
Integrate
Generate
Provide
Suggestions
Use

28.06.2013 DIMA – TU Berlin 4
■ Input dataset
■ Text analysis process
■ Application of SemNet
■ Evaluation of SemNet
■ Conclusions and Future Work
Agenda
N‐Gram
Statistics
Text
Corpus
N‐Gram
DB
POS
DB
Norm.
N‐Gram
DB
Analyse Parse
Normalize
Tag
SemNet
Analyse
Co‐occurrences
Applications
Retrieve
Query

28.06.2013 DIMA – TU Berlin 5
■ Input dataset
■ Text analysis process
■ Application of SemNet
■ Evaluation of SemNet
■ Conclusions and Future Work
Agenda
N‐Gram
Statistics
Text
Corpus
N‐Gram
DB
POS
DB
Norm.
N‐Gram
DB
Analyse Parse
Normalize
Tag
SemNet
Analyse
Co‐occurrences
Applications
Retrieve
Query

28.06.2013 DIMA – TU Berlin 6
■ Large amounts of text data
■ N-Grams
□ Sequence of n consecutive words/tokens and its frequency
□ Google provides 1,2,3,4 and 5-grams in several languages
■ We work on the English-All dataset V2 (1-grams and 5-grams) [11]
Google Books N-Gram Dataset
5 million books
Corpus
500 billion words N‐gram analysis
N‐Gram
Dataset
CSV text files
with word frequencies
...
…
to go to the hospital 46,410
general condition of the patient 28,198
I was in the hospital 19,268
discharge from the hospital . 12,476
admission to the hospital . 10,558
the patient to the hospital 6,422
by placing the patient in 6,026
between doctor and patient . 5,908
... ...
…
able to leave the hospital 4,629
patient admitted to the hospital 4,303
a patient in the hospital 3,844
the symptom of the patient 2,559
the patient under local anesthesia 2,536
a patient is suffering from 2,475
the doctor and the hospital 1,362
the hospital and the doctor 1,017
...

28.06.2013 DIMA – TU Berlin 7
■ Input dataset
■ Text analysis process
■ Application of SemNet
■ Evaluation of SemNet
■ Conclusions and Future Work
Agenda
N‐Gram
Statistics
Text
Corpus
N‐Gram
DB
POS
DB
Norm.
N‐Gram
DB
Analyse Parse
Normalize
Tag
SemNet
Analyse
Co‐occurrences
Applications
Retrieve
Query

28.06.2013 DIMA – TU Berlin 8
■ N-gram database
 Make the data manageable
□ Input: 2.5 terabytes of text
□ Output: Tables with
10 million 1-grams and
710 million 5-grams (21 gigabytes)
■ Part-of-speech tagging [8], [9]
 Identify lexical category of each text token
□ Output: Table with POS tags for each
5-gram (14 gigabytes)
■ Normalization
 Reduce amount of word variations
□ Plural stemming, lowercasing of
adjectives and normal nouns
□ Proper nouns are not touched
■ Result: 710 million normalized and tagged 5-grams
Preprocessing
JJ NN IN DT NN
general condition of the patient
NN NN NN CC NN
drug store pharmacist or doctor
doctors doctor
Medical practitioner  medical practitioner
hospitals in Valencia  hospital in Valencia
Adjective
Normal
Noun DeterminerPreposition
CoordinatingCoordinating
conjunction

28.06.2013 DIMA – TU Berlin 9
■ Input dataset
■ Text analysis process
■ Application of SemNet
■ Evaluation of SemNet
■ Conclusions and Future Work
Agenda
N‐Gram
Statistics
Text
Corpus
N‐Gram
DB
POS
DB
Norm.
N‐Gram
DB
Analyse Parse
Normalize
Tag
SemNet
Analyse
Co‐occurrences
Applications
Retrieve
Query

28.06.2013 DIMA – TU Berlin 10
■ Goal: Detect domain terminology using syntactical patterns [12]
■ Analysis of existing dictionaries
□ 75% of terms: noun, noun-noun, adjective noun combinations
■ Excerpt of the 20 patterns used:
■ No proper nouns: Stanford University / university professor
□ Our focus is conceptual design on schema level
■ Limitation: 5-gram: 5 words
□ Maximum length of a term: 3 words
Lexical Patterns
doctor or mental health professional
term termseparation

28.06.2013 DIMA – TU Berlin 11
■ Hierarchical pattern matching
■ Distributional Semantics [13], [22]
□ “Words that occur in the same contexts
tend to have similar meanings.”
(Distributional Hypothesis by Z. Harris)
Co-Occurring Terms
your doctor or pharmacist . 9271
Context
frequency
Absolute
frequency
„doctor“ and „pharmacist“
co‐occurred 9271 times
Highest level remains
No idiomatic phrasesNo consecutive patterns
Easiest case

28.06.2013 DIMA – TU Berlin 12
■ Discard 5-grams that contain 4 or 5 stopwords
■ Apply pattern matching on the remaining 5-grams
 Result: Large table of binary relations
■ Frequency aggregation
□ Many terms co-occurred in different contexts
■ Relative frequency computation
□ For each term with respect to its related terms
■ Graph construction
□ Directed, weighted edges
□ Relational database and graph
database serialization (SQLite / Neo4J)
SemNet Construction
to go to the doctor I am what I am a ) ( 2 )

28.06.2013 DIMA – TU Berlin 13
■ Properties of SemNet
□ 268,937 distinct single-word terms
□ 2,115,494 distinct double-word terms
□ 355,689 distinct triple-word terms
□  2.7 million terms and 37.5 million relations
□ 2.2 GB disc space
■ Lessons learned from the analysis process
Statistics
41,6%
15,7%
32,6%
10,1%
4 or 5
stopwords
N-Gram Information Content
Only
1 term
No pattern
match
N-grams
with a
semantic
relationship
Semantic relatedness: Zipf‘s law
Rank
Degreeofrelatedness

28.06.2013 DIMA – TU Berlin 14
■ Input dataset
■ Text analysis process
■ Application of SemNet
■ Evaluation of SemNet
■ Conclusions and Future Work
Agenda
N‐Gram
Statistics
Text
Corpus
N‐Gram
DB
POS
DB
Norm.
N‐Gram
DB
Analyse Parse
Normalize
Tag
SemNet
Analyse
Co‐occurrences
Applications
Retrieve
Query

28.06.2013 DIMA – TU Berlin 15
■ Query Interfaces
□ SQL: Query the relational database
□ Cypher: Query the Neo4J database
□ Java: Use SemNet in your applications
□ PHP: Explore the data in a web interface
■ Examples of top 10 automatically identified related terms
Querying SemNet
(f – absolute term frequency in the original text corpus, #r – number of related terms)
select * from nouncooccurrences where termw1 =
5824331 and termw2 is null and termw3 is null
order by relfreq desc limit 20;
public ArrayList<String>
getRelatedStringTerms(ArrayList<String>
inputTerms) { … }

28.06.2013 DIMA – TU Berlin 16
■ Challenge: Methods based matrices and vectors are too slow
■ Strategy: Related term sets intersection + relative frequency
multiplication
Ranking Results of Multiple Input Terms
chair 0.0441
contents 0.0359
end 0.0221
front 0.0194
figure 0.0189
head 0.0189
side 0.0180
data 0.0157
hand 0.0132
column 0.0131
page 0.0118
edge 0.0112
result 0.0100
value 0.0099
place 0.0087
row 0.0086
show 0.0082
elbow 0.0072
list 0.0071
bed 0.0071
table
transaction
data 0.0735
information 0.0569
record 0.0376
table 0.0334
access 0.0310
spreadsheet 0.0252
name 0.0201
object 0.0164
retrieval system 0.0163
file 0.0158
example 0.0153
use 0.0150
connection 0.0146
structure 0.0139
field 0.0125
user 0.0124
change 0.0112
type 0.0107
size 0.0104
transaction 0.0102
database
… …
data 0.001155
contents 0.000359
information 0.000190
record 0.000091
use 0.000077
end 0.000060
example 0.000055
name 0.000050
figure 0.000047
value 0.000045
result 0.000037
list 0.000037
column 0.000034
row 0.000033
object 0.000024
field 0.000023
book 0.000016
order 0.000016
size 0.000014
query 0.000012
table+database
…
∩
*

28.06.2013 DIMA – TU Berlin 17
■ Prototype: Ecore Diagram Editor with class name suggestions [15]
■ Automated suggestion adaption with respect to the content of the model
Modeling With Semantic Autocompletion

28.06.2013 DIMA – TU Berlin 18
■ Input dataset
■ Text analysis process
■ Application of SemNet
■ Evaluation of SemNet
■ Conclusions and Future Work
Agenda
N‐Gram
Statistics
Text
Corpus
N‐Gram
DB
POS
DB
Norm.
N‐Gram
DB
Analyse Parse
Normalize
Tag
SemNet
Analyse
Co‐occurrences
Applications
Retrieve
Query

28.06.2013 DIMA – TU Berlin 19
■ Challenge
□ No gold standard available for many information extraction tasks
■ Our strategy: Compare SemNet to existing knowledge bases
□ Provide measurements on how much information of WordNet and ConceptNet is
contained in SemNet
■ WordNet V3.0: Lexical database for the English language [16]
□ Synsets: Grouped terms that share the same sense
□ Relations: Mainly taxonomic, part-whole and synonyms
■ ConceptNet V5.1: Semantic graph for general human knowledge [17]
□ Nodes: Any natural language phrase that expresses a concept
□ Relations: Taxonomic, part-whole, related-to and several others
■ SemNet: Semantic Network of Related Terms
□ Nodes: Noun terminology
□ Relations: Probabilistic links
Evaluation Setup
maternity
morning
sickness
physical
condition
ectopic
pregnancy
entopic
pregnancy
synonym
part
meronym
parturiency
hyponym
hypernym
pregnancy
Conceptually
RelatedTo
pregnancy
expect
morning
sickness
physical
condition
go to bed
ectopic
pregnancy
PartOf
stretch
IsAIsA
Related
To
Causes
start
family
HasSubevent
mother
termination birth
woman
trimester
stage
weekchildbirth
lactation
month1
2
3 4
5
6
7
89
10
0.036
0.031
0.030 0.030
0.026
0.025
0.020
0.018
0.017
0.016
pregnancy
Word sense pregnancy in WordNet
(7 out of 32 relations)
Concept pregnancy in ConceptNet
(7 out of 58 relations).
Term pregnancy in SemNet
(First 10 out of 4039 relations).
S
W C

28.06.2013 DIMA – TU Berlin 20
■ WordNet
□ Iterate through all noun synsets
(72,994 synsets evaluated)
□ Check whether the nouns are
contained in SemNet
(98,681 nouns evaluated)
Results: 77,16% of WordNet‘s synsets are contained in SemNet and
62,17% of WordNet‘s nouns are contained in SemNet
■ ConceptNet
□ Problem: Concepts can be expressed
using any natural language phrase
□ First determine noun terminology
□ Check whether the nouns are
contained in SemNet
(49,301 concepts evaluated)
 Result: 82,40% of ConceptNet‘s nouns are contained in SemNet
Noun terminology coverage
(doctor, doc, physician, MD, Dr., medico)
(ear doctor, ear specialist, otologist)
(sleep talking, somniloquy, somniloquism)
doctor
go to bed
pregnancy
beautiful

28.06.2013 DIMA – TU Berlin 21
■ WordNet / ConceptNet
□ Iterate through all previously found
noun synsets (56,321 synsets used)
and concepts (40,625 concepts used)
□ Check whether the relations between
synsets are contained in SemNet
(61,931 WordNet relations evaluated and
256,213 ConceptNet relations evaluated)
■ Relation evaluation results
Relation coverage
(doctor, doc, physician, MD, Dr., medico)
(medical practitioner, medical man)
hypernym
(surgeon)(allergist)
hyponym

28.06.2013 DIMA – TU Berlin 22
■ Input dataset
■ Text analysis process
■ Application of SemNet
■ Evaluation of SemNet
■ Conclusions and Future Work
Agenda
N‐Gram
Statistics
Text
Corpus
N‐Gram
DB
POS
DB
Norm.
N‐Gram
DB
Analyse Parse
Normalize
Tag
SemNet
Analyse
Co‐occurrences
Applications
Retrieve
Query

28.06.2013 DIMA – TU Berlin 23
■ Summary
□ Input: 710 million 5-grams and 20 part-of-speech patterns
□ Hierarchical pattern matching, distributional semantics
□ Output: 2.7M multi-word terms and 37.5M weighted relations
□ Only a window of 5 words can be analyzed to detect relations
□ Applications: Domain-specific modeling, keyword expansion,
background knowledge for NLP tasks
■ Current and future work
□ Support additional languages
□ Improve ranking functions (pointwise mutual information)
□ Relax 3-word-limitation, derive own n-gram datasets
□ Combine probabilistic information with specific relations
□ Domain clustering in the semantic network
□ Additional modeling support: relations/associations, attributes
Conclusions and Future Work

28.06.2013 DIMA – TU Berlin 24
[5] H. Agt: Supporting Software Language Engineering by Automated
Domain Knowledge Acquisition. In: MODELS 2011 Workshops
LNCS 7167 Springer 2012
[8] Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-Rich
Part-of-Speech Tagging with a Cyclic Dependency Network. In:
Proceedings of the NAACL 2003, pp. 173–180.
[9] Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a Large
Annotated Corpus of English: The Penn Treebank. Computational
Linguistics 19(2), 313–330 (1993)
[11] Michel, J.B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., Team,
T.G.B., Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., Orwant,
J., Pinker, S., Nowak, M.A., Aiden, E.L.: Quantitative Analysis of
Culture Using Millions of Digitized Books. Science 331(6014),
176–182 (2011)
[12] Hearst, M.A.: Automatic acquisition of hyponyms from large text
corpora. In: Proceedings of the 14th Conference on
Computational Linguistics, COLING 1992, vol. 2 (1992)
[13] Harris, Z.: Distributional structure. Word 10(23), 146–162 (1954)
[15] Agt, H.: SemAcom: A System for Modeling with Semantic
Autocompletion. In: Model Driven Engineering Languages and
Systems - 15th International Conference, MODELS 2012, Demo
Track, Innsbruck, Austria (2012)
[16] Fellbaum, C.: WordNet: An Electronic Lexical Database. The MIT
Press, Cambridge (1998)
[17] Speer, R., Havasi, C.: Representing General Relational Knowledge
in ConceptNet 5. In: LREC 2012
[19] Agt, H., Kutsche, R.D., Wegeler, T.: Guidance for Domain Specific
Modeling in Small and Medium Enterprises. In: SPLASH 2011
Workshops. DSM 2011, Portland, OR, USA (2011)
[22] Turney, P.D., Pantel, P.: From frequency to meaning: vector
space models of semantics. J. Artif. Int. Res. 37(1), 141–188
(2010)
Thank You For Your Attention!
MODELS?
Try out SemNet: http://www.bizware.tu‐berlin.de/semnet/
Contact: henning.agt@tu‐berlin.de

Recommended

Big Data Profiling

Big Data Profiling

Big Data Profiling

eXascale Infolab

Data profiling comprises a broad range of methods to efficiently analyze a given data set. In a typical scenario, which mirrors the capabilities of commercial data profiling tools, tables of a relational database are scanned to derive metadata, such as data types and value patterns, completeness and uniqueness of columns, keys and foreign keys, and occasionally functional dependencies and association rules. Individual research projects have proposed several additional profiling tasks, such as the discovery of inclusion dependencies or conditional functional dependencies. Data profiling deserves a fresh look for two reasons: First, the area itself is neither established nor defined in any principled way, despite significant research activity on individual parts in the past. Second, current data profiling techniques hardly scale beyond what can only be called small data. Finally, more and more data beyond the traditional relational databases are being created and beg to be profiled. The talk proposes new research directions and challenges, including interactive and incremental profiling and profiling heterogeneous and non-relational data. Speaker: Felix Naumann studied mathematics, economy, and computer sciences at the University of Technology in Berlin. After receiving his diploma (MA) in 1997 he joined the graduate school "Distributed Information Systems" at Humboldt University of Berlin. He completed his PhD thesis on "Quality-driven Query Answering" in 2000. In 2001 and 2002 he worked at the IBM Almaden Research Center on topics around data integration. From 2003 - 2006 he was assistant professor for information integration at the Humboldt-University of Berlin. Since then he holds the chair for information systems at the Hasso Plattner Institute at the University of Potsdam in Germany.

5. Queue - Data Structures using C++ by Varsha Patil

5. Queue - Data Structures using C++ by Varsha Patil

5. Queue - Data Structures using C++ by Varsha Patil

widespreadpromotion

3. Stack - Data Structures using C++ by Varsha Patil

3. Stack - Data Structures using C++ by Varsha Patil

3. Stack - Data Structures using C++ by Varsha Patil

widespreadpromotion

9. Searching & Sorting - Data Structures using C++ by Varsha Patil

9. Searching & Sorting - Data Structures using C++ by Varsha Patil

9. Searching & Sorting - Data Structures using C++ by Varsha Patil

widespreadpromotion

13. Indexing MTrees - Data Structures using C++ by Varsha Patil

13. Indexing MTrees - Data Structures using C++ by Varsha Patil

13. Indexing MTrees - Data Structures using C++ by Varsha Patil

widespreadpromotion

14. Files - Data Structures using C++ by Varsha Patil

14. Files - Data Structures using C++ by Varsha Patil

14. Files - Data Structures using C++ by Varsha Patil

widespreadpromotion

6. Linked list - Data Structures using C++ by Varsha Patil

6. Linked list - Data Structures using C++ by Varsha Patil

6. Linked list - Data Structures using C++ by Varsha Patil

widespreadpromotion

An experimental comparison of globally-optimal data de-identification algorithms

An experimental comparison of globally-optimal data de-identification algorithms

An experimental comparison of globally-optimal data de-identification algorithms

arx-deidentifier

Collaboration and data sharing have become core elements of biomedical research. At the same time, there is a growing understanding of privacy threats related to data sharing, especially when sensitive data from distributed sources become available for linkage. Statistical disclosure control comprises well-known data anonymization techniques that allow the protection of data by introducing fuzziness. To protect datasets from different types of threats, different privacy criteria are commonly implemented. Data anonymization is an important measure, but it is computationally complex, and it can significantly reduce the expressiveness of data. To attenuate these problems, a number of algorithms has been proposed, which aim at increasing data quality or improving efficiency. Previous evaluations of such algorithms lack a systematic approach, as they focus on specific algorithms, specific privacy criteria, and specific runtime environments. Therefore, it is difficult for decision makers to decide which algorithm is best suited for their requirements. As a first step towards a comprehensive and systematic evaluation of anonymity algorithms, we report on our ongoing efforts for providing an open source benchmark. In this contribution, we focus on optimal algorithms utilizing global recoding with full-domain generalization. We present a systematic evaluation of domain-specific algorithms and generic search methods for a broad set of privacy criteria, including k-anonymity, l-diversity, t-closeness and d-presence, and their use in multiple real-world datasets. Our results show that there is no single solution fitting all needs, and that generic search methods can outperform highly specialized algorithms.

Recommended

Big Data Profiling

Big Data Profiling

Big Data Profiling

eXascale Infolab

Data profiling comprises a broad range of methods to efficiently analyze a given data set. In a typical scenario, which mirrors the capabilities of commercial data profiling tools, tables of a relational database are scanned to derive metadata, such as data types and value patterns, completeness and uniqueness of columns, keys and foreign keys, and occasionally functional dependencies and association rules. Individual research projects have proposed several additional profiling tasks, such as the discovery of inclusion dependencies or conditional functional dependencies. Data profiling deserves a fresh look for two reasons: First, the area itself is neither established nor defined in any principled way, despite significant research activity on individual parts in the past. Second, current data profiling techniques hardly scale beyond what can only be called small data. Finally, more and more data beyond the traditional relational databases are being created and beg to be profiled. The talk proposes new research directions and challenges, including interactive and incremental profiling and profiling heterogeneous and non-relational data. Speaker: Felix Naumann studied mathematics, economy, and computer sciences at the University of Technology in Berlin. After receiving his diploma (MA) in 1997 he joined the graduate school "Distributed Information Systems" at Humboldt University of Berlin. He completed his PhD thesis on "Quality-driven Query Answering" in 2000. In 2001 and 2002 he worked at the IBM Almaden Research Center on topics around data integration. From 2003 - 2006 he was assistant professor for information integration at the Humboldt-University of Berlin. Since then he holds the chair for information systems at the Hasso Plattner Institute at the University of Potsdam in Germany.

5. Queue - Data Structures using C++ by Varsha Patil

5. Queue - Data Structures using C++ by Varsha Patil

5. Queue - Data Structures using C++ by Varsha Patil

widespreadpromotion

3. Stack - Data Structures using C++ by Varsha Patil

3. Stack - Data Structures using C++ by Varsha Patil

3. Stack - Data Structures using C++ by Varsha Patil

widespreadpromotion

9. Searching & Sorting - Data Structures using C++ by Varsha Patil

9. Searching & Sorting - Data Structures using C++ by Varsha Patil

9. Searching & Sorting - Data Structures using C++ by Varsha Patil

widespreadpromotion

13. Indexing MTrees - Data Structures using C++ by Varsha Patil

13. Indexing MTrees - Data Structures using C++ by Varsha Patil

13. Indexing MTrees - Data Structures using C++ by Varsha Patil

widespreadpromotion

14. Files - Data Structures using C++ by Varsha Patil

14. Files - Data Structures using C++ by Varsha Patil

14. Files - Data Structures using C++ by Varsha Patil

widespreadpromotion

6. Linked list - Data Structures using C++ by Varsha Patil

6. Linked list - Data Structures using C++ by Varsha Patil

6. Linked list - Data Structures using C++ by Varsha Patil

widespreadpromotion

An experimental comparison of globally-optimal data de-identification algorithms

An experimental comparison of globally-optimal data de-identification algorithms

An experimental comparison of globally-optimal data de-identification algorithms

arx-deidentifier

Collaboration and data sharing have become core elements of biomedical research. At the same time, there is a growing understanding of privacy threats related to data sharing, especially when sensitive data from distributed sources become available for linkage. Statistical disclosure control comprises well-known data anonymization techniques that allow the protection of data by introducing fuzziness. To protect datasets from different types of threats, different privacy criteria are commonly implemented. Data anonymization is an important measure, but it is computationally complex, and it can significantly reduce the expressiveness of data. To attenuate these problems, a number of algorithms has been proposed, which aim at increasing data quality or improving efficiency. Previous evaluations of such algorithms lack a systematic approach, as they focus on specific algorithms, specific privacy criteria, and specific runtime environments. Therefore, it is difficult for decision makers to decide which algorithm is best suited for their requirements. As a first step towards a comprehensive and systematic evaluation of anonymity algorithms, we report on our ongoing efforts for providing an open source benchmark. In this contribution, we focus on optimal algorithms utilizing global recoding with full-domain generalization. We present a systematic evaluation of domain-specific algorithms and generic search methods for a broad set of privacy criteria, including k-anonymity, l-diversity, t-closeness and d-presence, and their use in multiple real-world datasets. Our results show that there is no single solution fitting all needs, and that generic search methods can outperform highly specialized algorithms.

eNanoMapper database, search tools and templates

eNanoMapper database, search tools and templates

eNanoMapper database, search tools and templates

Nina Jeliazkova

A webinar given at the NCIP Hub https://nciphub.org/resources/1925 Nanomaterial safety assessment has become an important task following the production growth of engineered nanomaterials (ENMs) and the increased interest for ENMs from various academic, industry and regulatory parties. A number of challenges exist in nanomaterials data representation and integration mainly due to the data complexity and origination of ENM information from diverse sources. We have recently described eNanoMapper database [1] as part of the computational infrastructure for toxicological data management of engineered materials, developed within eNanoMapper project [2]. The eNanoMapper prototype database is publicly available at http://data.enanomapper.net, demonstrating the integration of data from multiple sources, using the common data model and Application Programming Interface. The supported import formats are IUCLID5 files (OECD HT), semantic format (RDF) and custom spreadsheet templates. The latter accommodates the preferred approach for data gathering for the majority of the NanoSafety Cluster projects and is enabled by a configurable parser mapping the the custom spreadsheet organization into the internal eNanoMapper storage components through external configuration file. Import of spreadsheet data and other data formats, generated by a number of NanoSafety Cluster projects is currently ongoing. The export formats have been extended with the new ISA JSON format, following the most recent ISA specification. Defining templates for data gathering is a common activity for most of the NanoSafety Cluster projects usually resulting in modified Excel spreadsheets. In order to help avoiding the incompatibility issues, we present a tool for template generation, based on templates released under open license by JRC under the framework of the NANoREG project [3]. A number of physchem, in-vitro and in-vivo assays are supported and using feedback from users we added and extended existing information about different aspects of nanosafety, e.g. environmental exposure, cell culture assays, cellular and animal models, nanomaterial production features, and nanomaterial ageing. Finally, the data can be accessed programmatically via the application programming interface as well as via user friendly search interface at https://search.data.enanomapper.net. The search application is powered by a free text search engine and eNanoMapper ontology and was improved over the last year based on user feedback.The search function allows now multiple filtering for information. It is possible to stack filters for e.g. nanomaterial type, cell model and assay. eNanoMapper is supported by European Commission 7th Framework Programme for Research and Technological Development Grant (Grant agreement no: 604134).

7. Tree - Data Structures using C++ by Varsha Patil

7. Tree - Data Structures using C++ by Varsha Patil

7. Tree - Data Structures using C++ by Varsha Patil

widespreadpromotion

10. Search Tree - Data Structures using C++ by Varsha Patil

10. Search Tree - Data Structures using C++ by Varsha Patil

10. Search Tree - Data Structures using C++ by Varsha Patil

widespreadpromotion

20090813MEETING

20090813MEETING

20090813MEETINGmarxliouville

Data Mining and the Web_Past_Present and Future

Data Mining and the Web_Past_Present and Future

Data Mining and the Web_Past_Present and Futurefeiwin

1. Fundamental Concept - Data Structures using C++ by Varsha Patil

1. Fundamental Concept - Data Structures using C++ by Varsha Patil

1. Fundamental Concept - Data Structures using C++ by Varsha Patil

widespreadpromotion

Intro to JMP for statistics

Intro to JMP for statistics

Intro to JMP for statistics

Bioinformatics and Computational Biosciences Branch

Positional Data Organization and Compression in Web Inverted Indexes

Positional Data Organization and Compression in Web Inverted Indexes

Positional Data Organization and Compression in Web Inverted Indexes

Leonidas Akritidis

8. Graph - Data Structures using C++ by Varsha Patil

8. Graph - Data Structures using C++ by Varsha Patil

8. Graph - Data Structures using C++ by Varsha Patil

widespreadpromotion

ComputableFacts: a Secure System to Store Documents and Graphs

ComputableFacts: a Secure System to Store Documents and Graphs

ComputableFacts: a Secure System to Store Documents and Graphs

Accumulo Summit

This 20 minutes talk describes an automated data processing system, ComputableFacts, whose goal is to recover information from unstructured data in a variety of formats (such as Microsoft Office or Adobe PDF documents, emails, web pages, etc.) and convert it into a more usable form. Its key features are : Security: • Enforce authorizations across multiple access models to the database: batch, interactive and real-time. Data Engineering: • Extract data and metadata from a variety of sources and file formats • Provides a uniform representation of all data, regardless of its initial structure or format. Knowledge Engineering: • Build facts databases manually and/or automatically • Automatically derive new facts using rules • Execute complex queries Knowledge Dissemination: • Allow users to create alerts • Allow users to share and comment on documents • Allow users to create and export query-focused datasets • Allow users to rate documents. Later, recommend them documents of interest.

E mine by V.DINESH KUMAR KSRCT

E mine by V.DINESH KUMAR KSRCT

E mine by V.DINESH KUMAR KSRCT

First steps in Data Mining Kindergarten

First steps in Data Mining Kindergarten

First steps in Data Mining Kindergarten

Alexey Zinoviev

Data wrangling week 11

Data wrangling week 11

Data wrangling week 11

Ferdin Joe John Joseph PhD

Redis project : Relational Databases to Key-Value systems

Redis project : Relational Databases to Key-Value systems

Redis project : Relational Databases to Key-Value systems

Lamprini Koutsokera

Data wrangling week 6

Data wrangling week 6

Data wrangling week 6

Ferdin Joe John Joseph PhD

Dwdmunit1 abhagathk

RUGCombine & Livetrix

RUGCombine & Livetrix

RUGCombine & Livetrix

Try PurpleSearch

Complex Matching of RDF Datatype Properties

Complex Matching of RDF Datatype Properties

Complex Matching of RDF Datatype Properties

Machine learning

Machine learning

Machine learning

IR tutorial

Hussein Hazimeh

Sonja kabicher fuchs presentation-caise13_final

Sonja kabicher fuchs presentation-caise13_final

Sonja kabicher fuchs presentation-caise13_finalcaise2013vlc

Markus keuneke partial data-models

Markus keuneke partial data-models

Markus keuneke partial data-modelscaise2013vlc

More Related Content

What's hot

eNanoMapper database, search tools and templates

eNanoMapper database, search tools and templates

eNanoMapper database, search tools and templates

Nina Jeliazkova

A webinar given at the NCIP Hub https://nciphub.org/resources/1925 Nanomaterial safety assessment has become an important task following the production growth of engineered nanomaterials (ENMs) and the increased interest for ENMs from various academic, industry and regulatory parties. A number of challenges exist in nanomaterials data representation and integration mainly due to the data complexity and origination of ENM information from diverse sources. We have recently described eNanoMapper database [1] as part of the computational infrastructure for toxicological data management of engineered materials, developed within eNanoMapper project [2]. The eNanoMapper prototype database is publicly available at http://data.enanomapper.net, demonstrating the integration of data from multiple sources, using the common data model and Application Programming Interface. The supported import formats are IUCLID5 files (OECD HT), semantic format (RDF) and custom spreadsheet templates. The latter accommodates the preferred approach for data gathering for the majority of the NanoSafety Cluster projects and is enabled by a configurable parser mapping the the custom spreadsheet organization into the internal eNanoMapper storage components through external configuration file. Import of spreadsheet data and other data formats, generated by a number of NanoSafety Cluster projects is currently ongoing. The export formats have been extended with the new ISA JSON format, following the most recent ISA specification. Defining templates for data gathering is a common activity for most of the NanoSafety Cluster projects usually resulting in modified Excel spreadsheets. In order to help avoiding the incompatibility issues, we present a tool for template generation, based on templates released under open license by JRC under the framework of the NANoREG project [3]. A number of physchem, in-vitro and in-vivo assays are supported and using feedback from users we added and extended existing information about different aspects of nanosafety, e.g. environmental exposure, cell culture assays, cellular and animal models, nanomaterial production features, and nanomaterial ageing. Finally, the data can be accessed programmatically via the application programming interface as well as via user friendly search interface at https://search.data.enanomapper.net. The search application is powered by a free text search engine and eNanoMapper ontology and was improved over the last year based on user feedback.The search function allows now multiple filtering for information. It is possible to stack filters for e.g. nanomaterial type, cell model and assay. eNanoMapper is supported by European Commission 7th Framework Programme for Research and Technological Development Grant (Grant agreement no: 604134).

7. Tree - Data Structures using C++ by Varsha Patil

7. Tree - Data Structures using C++ by Varsha Patil

7. Tree - Data Structures using C++ by Varsha Patil

widespreadpromotion

10. Search Tree - Data Structures using C++ by Varsha Patil

10. Search Tree - Data Structures using C++ by Varsha Patil

10. Search Tree - Data Structures using C++ by Varsha Patil

widespreadpromotion

20090813MEETING

20090813MEETING

20090813MEETINGmarxliouville

Data Mining and the Web_Past_Present and Future

Data Mining and the Web_Past_Present and Future

Data Mining and the Web_Past_Present and Futurefeiwin

1. Fundamental Concept - Data Structures using C++ by Varsha Patil

1. Fundamental Concept - Data Structures using C++ by Varsha Patil

1. Fundamental Concept - Data Structures using C++ by Varsha Patil

widespreadpromotion

Intro to JMP for statistics

Intro to JMP for statistics

Intro to JMP for statistics

Bioinformatics and Computational Biosciences Branch

Positional Data Organization and Compression in Web Inverted Indexes

Positional Data Organization and Compression in Web Inverted Indexes

Positional Data Organization and Compression in Web Inverted Indexes

Leonidas Akritidis

8. Graph - Data Structures using C++ by Varsha Patil

8. Graph - Data Structures using C++ by Varsha Patil

8. Graph - Data Structures using C++ by Varsha Patil

widespreadpromotion

ComputableFacts: a Secure System to Store Documents and Graphs

ComputableFacts: a Secure System to Store Documents and Graphs

ComputableFacts: a Secure System to Store Documents and Graphs

Accumulo Summit

This 20 minutes talk describes an automated data processing system, ComputableFacts, whose goal is to recover information from unstructured data in a variety of formats (such as Microsoft Office or Adobe PDF documents, emails, web pages, etc.) and convert it into a more usable form. Its key features are : Security: • Enforce authorizations across multiple access models to the database: batch, interactive and real-time. Data Engineering: • Extract data and metadata from a variety of sources and file formats • Provides a uniform representation of all data, regardless of its initial structure or format. Knowledge Engineering: • Build facts databases manually and/or automatically • Automatically derive new facts using rules • Execute complex queries Knowledge Dissemination: • Allow users to create alerts • Allow users to share and comment on documents • Allow users to create and export query-focused datasets • Allow users to rate documents. Later, recommend them documents of interest.

E mine by V.DINESH KUMAR KSRCT

E mine by V.DINESH KUMAR KSRCT

E mine by V.DINESH KUMAR KSRCT

First steps in Data Mining Kindergarten

First steps in Data Mining Kindergarten

First steps in Data Mining Kindergarten

Alexey Zinoviev

Data wrangling week 11

Data wrangling week 11

Data wrangling week 11

Ferdin Joe John Joseph PhD

Redis project : Relational Databases to Key-Value systems

Redis project : Relational Databases to Key-Value systems

Redis project : Relational Databases to Key-Value systems

Lamprini Koutsokera

Data wrangling week 6

Data wrangling week 6

Data wrangling week 6

Ferdin Joe John Joseph PhD

Dwdmunit1 abhagathk

RUGCombine & Livetrix

RUGCombine & Livetrix

RUGCombine & Livetrix

Try PurpleSearch

Complex Matching of RDF Datatype Properties

Complex Matching of RDF Datatype Properties

Complex Matching of RDF Datatype Properties

Machine learning

Machine learning

Machine learning

IR tutorial

Hussein Hazimeh

What's hot (20)

eNanoMapper database, search tools and templates

eNanoMapper database, search tools and templates

eNanoMapper database, search tools and templates

7. Tree - Data Structures using C++ by Varsha Patil

7. Tree - Data Structures using C++ by Varsha Patil

7. Tree - Data Structures using C++ by Varsha Patil

10. Search Tree - Data Structures using C++ by Varsha Patil

10. Search Tree - Data Structures using C++ by Varsha Patil

10. Search Tree - Data Structures using C++ by Varsha Patil

20090813MEETING

20090813MEETING

20090813MEETING

Data Mining and the Web_Past_Present and Future

Data Mining and the Web_Past_Present and Future

Data Mining and the Web_Past_Present and Future

1. Fundamental Concept - Data Structures using C++ by Varsha Patil

1. Fundamental Concept - Data Structures using C++ by Varsha Patil

1. Fundamental Concept - Data Structures using C++ by Varsha Patil

Intro to JMP for statistics

Intro to JMP for statistics

Intro to JMP for statistics

Positional Data Organization and Compression in Web Inverted Indexes

Positional Data Organization and Compression in Web Inverted Indexes

Positional Data Organization and Compression in Web Inverted Indexes

8. Graph - Data Structures using C++ by Varsha Patil

8. Graph - Data Structures using C++ by Varsha Patil

8. Graph - Data Structures using C++ by Varsha Patil

ComputableFacts: a Secure System to Store Documents and Graphs

ComputableFacts: a Secure System to Store Documents and Graphs

ComputableFacts: a Secure System to Store Documents and Graphs

E mine by V.DINESH KUMAR KSRCT

E mine by V.DINESH KUMAR KSRCT

E mine by V.DINESH KUMAR KSRCT

First steps in Data Mining Kindergarten

First steps in Data Mining Kindergarten

First steps in Data Mining Kindergarten

Data wrangling week 11

Data wrangling week 11

Data wrangling week 11

Redis project : Relational Databases to Key-Value systems

Redis project : Relational Databases to Key-Value systems

Redis project : Relational Databases to Key-Value systems

Data wrangling week 6

Data wrangling week 6

Data wrangling week 6

Dwdmunit1 a

RUGCombine & Livetrix

RUGCombine & Livetrix

RUGCombine & Livetrix

Complex Matching of RDF Datatype Properties

Complex Matching of RDF Datatype Properties

Complex Matching of RDF Datatype Properties

Machine learning

Machine learning

Machine learning

IR tutorial

Viewers also liked

Sonja kabicher fuchs presentation-caise13_final

Sonja kabicher fuchs presentation-caise13_final

Sonja kabicher fuchs presentation-caise13_finalcaise2013vlc

Markus keuneke partial data-models

Markus keuneke partial data-models

Markus keuneke partial data-modelscaise2013vlc

Razvan petrusel presentation caise 2013

Razvan petrusel presentation caise 2013

Razvan petrusel presentation caise 2013caise2013vlc

Maurino andrea coopetitivecaise2013

Maurino andrea coopetitivecaise2013

Maurino andrea coopetitivecaise2013caise2013vlc

David aguilera presentation

David aguilera presentation

David aguilera presentationcaise2013vlc

Caise panelcaise2013vlc

Malinda scalability c_ai_se_2013_v3

Malinda scalability c_ai_se_2013_v3

Malinda scalability c_ai_se_2013_v3caise2013vlc

Abbasi et alcaise2013vlc

Viewers also liked (8)

Sonja kabicher fuchs presentation-caise13_final

Sonja kabicher fuchs presentation-caise13_final

Sonja kabicher fuchs presentation-caise13_final

Markus keuneke partial data-models

Markus keuneke partial data-models

Markus keuneke partial data-models

Razvan petrusel presentation caise 2013

Razvan petrusel presentation caise 2013

Razvan petrusel presentation caise 2013

Maurino andrea coopetitivecaise2013

Maurino andrea coopetitivecaise2013

Maurino andrea coopetitivecaise2013

David aguilera presentation

David aguilera presentation

David aguilera presentation

Caise panel

Malinda scalability c_ai_se_2013_v3

Malinda scalability c_ai_se_2013_v3

Malinda scalability c_ai_se_2013_v3

Abbasi et al

Similar to Henning agt talk-caise-semnet

Design and Development of a Provenance Capture Platform for Data Science

Design and Development of a Provenance Capture Platform for Data Science

Design and Development of a Provenance Capture Platform for Data Science

Production-Ready BIG ML Workflows - from zero to hero

Production-Ready BIG ML Workflows - from zero to hero

Production-Ready BIG ML Workflows - from zero to hero

Data science isn't an easy task to pull of. You start with exploring data and experimenting with models. Finally, you find some amazing insight! What now? How do you transform a little experiment to a production ready workflow? Better yet, how do you scale it from a small sample in R/Python to TBs of production data? Building a BIG ML Workflow - from zero to hero, is about the work process you need to take in order to have a production ready workflow up and running. Covering : * Small - Medium experimentation (R) * Big data implementation (Spark Mllib /+ pipeline) * Setting Metrics and checks in place * Ad hoc querying and exploring your results (Zeppelin) * Pain points & Lessons learned the hard way (is there any other way?)

Data science guide

Data science guide

Data science guide

How can I become a data scientist? What are the most valuable skills to learn for a data scientist now? Could I learn how to be a data scientist by going through online tutorials? What does a data scientist do? These are only some of the questions that are being discussed online, on blogs, on forums and on knowledge-sharing platforms like Quora. Let me share the Beginner's Guide to Data Science which will be really helpful to you. Also Checkout: http://bit.ly/2Mub6xP

Stream Processing

Stream Processing

Stream Processing

FogGuru MSCA Project

dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

eXascale Infolab

dipLODocus[RDF] is a new system for RDF data processing supporting both simple transactional queries and complex analytics efficiently. dipLODocus[RDF] is based on a novel hybrid storage model considering RDF data both from a graph perspective (by storing RDF subgraphs or RDF molecules) and from a "vertical" analytics perspective (by storing compact lists of literal values for a given attribute). http://diuf.unifr.ch/main/xi/diplodocus/

Azure Databricks for Data Scientists

Azure Databricks for Data Scientists

Azure Databricks for Data Scientists

Smart mrs bi project-presentation

Smart mrs bi project-presentation

Smart mrs bi project-presentation

Vimukthi Wickramasinghe

Reference Domain Ontologies and Large Medical Language Models.pptx

Reference Domain Ontologies and Large Medical Language Models.pptx

Reference Domain Ontologies and Large Medical Language Models.pptx

Chimezie Ogbuji

Large Language Models (LLMs) have exploded into the modern research and development consciousness and triggered an artificial intelligence revolution. They are well-positioned to have a major impact on Medical Informatics. However, much of the data used to train these revolutionary models are general-purpose and, in some cases, synthetically generated from LLMs. Ontologies are a shared and agreed-upon conceptualization of a domain and facilitate computational reasoning. They have become important tools in biomedicine, supporting critical aspects of healthcare and biomedical research, and are integral to science. In this talk, we will delve into ontologies, their representational and reasoning power, and how terminology systems such as SNOMED-CT, an international master terminology providing comprehensive coverage of the entire domain of medicine, can be used with Controlled Natural Languages (CNL) to advance how LLMs are used and trained.

Text Analytics for Legal work

Text Analytics for Legal work

Text Analytics for Legal work

AlgoAnalytics Financial Consultancy Pvt. Ltd.

Law firms & lawyers - rid the manual review of text documents, correspondence, etc. Text Analytics of unstructured documents signals potential knowledge that brings relevance & helps win cases. Moreover, use of text analytics helps offer small firms the same advantage that big firms have. As the information can be used to strengthen solutions and provide advice to attorneys, courtrooms will also benefit from more informed, better prepared legal teams and swift action, keeping long years of litigation away!

Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...

Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...

Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...

Faculty of Computer Science - Free University of Bozen-Bolzano

Data Discovery and Metadata

Data Discovery and Metadata

Data Discovery and Metadata

Make Sense Out of Data with Feature Engineering

Make Sense Out of Data with Feature Engineering

Make Sense Out of Data with Feature Engineering

Crossing the Analytics Chasm and Getting the Models You Developed Deployed

Crossing the Analytics Chasm and Getting the Models You Developed Deployed

Crossing the Analytics Chasm and Getting the Models You Developed Deployed

Robert Grossman

There are two cultures in data science and analytics - those that develop analytic models and those that deploy analytic models into operational systems. In this talk, we review the life cycle of analytic models and provide an overview of some of the approaches that have been developed for managing analytic models and workflows and for deploying them, including using analytic engines and analytic containers . We give a quick overview of languages for analytic models (PMML) and analytic workflows (PFA). We also describe the emerging discipline of AnalyticOps that has borrowed some of the techniques of DevOps.

polystore_NYC_inrae_sysinfo2021-1.pdf

polystore_NYC_inrae_sysinfo2021-1.pdf

polystore_NYC_inrae_sysinfo2021-1.pdf

2. visualization in data mining

2. visualization in data mining

2. visualization in data mining

Azad public school

E05312426

C2_W1---.pdf

EE-232-LEC-01 Data_structures.pptx

EE-232-LEC-01 Data_structures.pptx

EE-232-LEC-01 Data_structures.pptx

AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...

AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...

AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...

Dr. Haxel Consult

Word embeddings, deep learning, transformer models and other pre-trained neural language models (sometimes recently referred to as "foundational models") have fundamentally changed the way state-of-the-art systems for natural language processing and information access are built today. The "Data-to-Value" process methodology (Leidner 2013; Leidner 2022a,b) has been devised to embody best practices for the construction of natural language engineering solutions; it can assist practitioners and has also been used to transfer industrial insights into the university classroom. This talk recaps how the methodology supports engineers in building systems more consistently and then outlines the changes in the methodology to adapt it to the deep learning age. The cost and energy implications will also be discussed.

IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...

IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...

IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...

Similar to Henning agt talk-caise-semnet (20)

Design and Development of a Provenance Capture Platform for Data Science

Design and Development of a Provenance Capture Platform for Data Science

Design and Development of a Provenance Capture Platform for Data Science

Production-Ready BIG ML Workflows - from zero to hero

Production-Ready BIG ML Workflows - from zero to hero

Production-Ready BIG ML Workflows - from zero to hero

Data science guide

Data science guide

Data science guide

Stream Processing

Stream Processing

Stream Processing

dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Azure Databricks for Data Scientists

Azure Databricks for Data Scientists

Azure Databricks for Data Scientists

Smart mrs bi project-presentation

Smart mrs bi project-presentation

Smart mrs bi project-presentation

Reference Domain Ontologies and Large Medical Language Models.pptx

Reference Domain Ontologies and Large Medical Language Models.pptx

Reference Domain Ontologies and Large Medical Language Models.pptx

Text Analytics for Legal work

Text Analytics for Legal work

Text Analytics for Legal work

Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...

Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...

Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...

Data Discovery and Metadata

Data Discovery and Metadata

Data Discovery and Metadata

Make Sense Out of Data with Feature Engineering

Make Sense Out of Data with Feature Engineering

Make Sense Out of Data with Feature Engineering

Crossing the Analytics Chasm and Getting the Models You Developed Deployed

Crossing the Analytics Chasm and Getting the Models You Developed Deployed

Crossing the Analytics Chasm and Getting the Models You Developed Deployed

polystore_NYC_inrae_sysinfo2021-1.pdf

polystore_NYC_inrae_sysinfo2021-1.pdf

polystore_NYC_inrae_sysinfo2021-1.pdf

2. visualization in data mining

2. visualization in data mining

2. visualization in data mining

E05312426

C2_W1---.pdf

EE-232-LEC-01 Data_structures.pptx

EE-232-LEC-01 Data_structures.pptx

EE-232-LEC-01 Data_structures.pptx

AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...

AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...

AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...

IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...

IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...

IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...

More from caise2013vlc

Jelena zdravkovic c ai-se 2013 capability caas

Jelena zdravkovic c ai-se 2013 capability caas

Jelena zdravkovic c ai-se 2013 capability caascaise2013vlc

Sagar sen caise2013final

Sagar sen caise2013final

Sagar sen caise2013finalcaise2013vlc

Suriadi caise2013 slides

Suriadi caise2013 slides

Suriadi caise2013 slidescaise2013vlc

Fadila caise2013 vf

Fadila caise2013 vf

Fadila caise2013 vfcaise2013vlc

Michael mrissa c aise

Michael mrissa c aise

Michael mrissa c aisecaise2013vlc

Razvan petrusel presentation caise 2013

Razvan petrusel presentation caise 2013

Razvan petrusel presentation caise 2013caise2013vlc

Ramezani taghiabadi temporal compliance checking 2

Ramezani taghiabadi temporal compliance checking 2

Ramezani taghiabadi temporal compliance checking 2caise2013vlc

Ferreira c ai-se2013-final-handouts

Ferreira c ai-se2013-final-handouts

Ferreira c ai-se2013-final-handoutscaise2013vlc

Sonja meyer caise 2013

Sonja meyer caise 2013

Sonja meyer caise 2013caise2013vlc

Tony clark caise 13-presentation

Tony clark caise 13-presentation

Tony clark caise 13-presentationcaise2013vlc

Miguel goulao 2013 c-aise

Miguel goulao 2013 c-aise

Miguel goulao 2013 c-aisecaise2013vlc

Jorge cardoso caise-usdl-tosca-2013-06-18c

Jorge cardoso caise-usdl-tosca-2013-06-18c

Jorge cardoso caise-usdl-tosca-2013-06-18ccaise2013vlc

Kerrstin klemishc c-aise2013_

Kerrstin klemishc c-aise2013_

Kerrstin klemishc c-aise2013_caise2013vlc

Ignacio panach ormeño et-al_caise2013

Ignacio panach ormeño et-al_caise2013

Ignacio panach ormeño et-al_caise2013caise2013vlc

Peter sawyer caise

Peter sawyer caise

Peter sawyer caisecaise2013vlc

Scekic caise13-

Scekic caise13-

Scekic caise13-caise2013vlc

Moe wynn caise13 presentation

Moe wynn caise13 presentation

Moe wynn caise13 presentationcaise2013vlc

Jian yu caise13-

Jian yu caise13-

Jian yu caise13-caise2013vlc

Tommi kramer 2013-06-21-caise-re2-kramer

Tommi kramer 2013-06-21-caise-re2-kramer

Tommi kramer 2013-06-21-caise-re2-kramercaise2013vlc

Canovas cabot topublish-caise2013-

Canovas cabot topublish-caise2013-

Canovas cabot topublish-caise2013-caise2013vlc

More from caise2013vlc (20)

Jelena zdravkovic c ai-se 2013 capability caas

Jelena zdravkovic c ai-se 2013 capability caas

Jelena zdravkovic c ai-se 2013 capability caas

Sagar sen caise2013final

Sagar sen caise2013final

Sagar sen caise2013final

Suriadi caise2013 slides

Suriadi caise2013 slides

Suriadi caise2013 slides

Fadila caise2013 vf

Fadila caise2013 vf

Fadila caise2013 vf

Michael mrissa c aise

Michael mrissa c aise

Michael mrissa c aise

Razvan petrusel presentation caise 2013

Razvan petrusel presentation caise 2013

Razvan petrusel presentation caise 2013

Ramezani taghiabadi temporal compliance checking 2

Ramezani taghiabadi temporal compliance checking 2

Ramezani taghiabadi temporal compliance checking 2

Ferreira c ai-se2013-final-handouts

Ferreira c ai-se2013-final-handouts

Ferreira c ai-se2013-final-handouts

Sonja meyer caise 2013

Sonja meyer caise 2013

Sonja meyer caise 2013

Tony clark caise 13-presentation

Tony clark caise 13-presentation

Tony clark caise 13-presentation

Miguel goulao 2013 c-aise

Miguel goulao 2013 c-aise

Miguel goulao 2013 c-aise

Jorge cardoso caise-usdl-tosca-2013-06-18c

Jorge cardoso caise-usdl-tosca-2013-06-18c

Jorge cardoso caise-usdl-tosca-2013-06-18c

Kerrstin klemishc c-aise2013_

Kerrstin klemishc c-aise2013_

Kerrstin klemishc c-aise2013_

Ignacio panach ormeño et-al_caise2013

Ignacio panach ormeño et-al_caise2013

Ignacio panach ormeño et-al_caise2013

Peter sawyer caise

Peter sawyer caise

Peter sawyer caise

Scekic caise13-

Scekic caise13-

Scekic caise13-

Moe wynn caise13 presentation

Moe wynn caise13 presentation

Moe wynn caise13 presentation

Jian yu caise13-

Jian yu caise13-

Jian yu caise13-

Tommi kramer 2013-06-21-caise-re2-kramer

Tommi kramer 2013-06-21-caise-re2-kramer

Tommi kramer 2013-06-21-caise-re2-kramer

Canovas cabot topublish-caise2013-

Canovas cabot topublish-caise2013-

Canovas cabot topublish-caise2013-

Recently uploaded

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

Accelerate your Kubernetes clusters with Varnish Caching

Accelerate your Kubernetes clusters with Varnish Caching

Accelerate your Kubernetes clusters with Varnish Caching

Epistemic Interaction - tuning interfaces to provide information for AI support

Epistemic Interaction - tuning interfaces to provide information for AI support

Epistemic Interaction - tuning interfaces to provide information for AI support

Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024 https://alandix.com/academic/papers/synergy2024-epistemic/ As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.

UiPath Test Automation using UiPath Test Suite series, part 3

UiPath Test Automation using UiPath Test Suite series, part 3

UiPath Test Automation using UiPath Test Suite series, part 3

Assuring Contact Center Experiences for Your Customers With ThousandEyes

Assuring Contact Center Experiences for Your Customers With ThousandEyes

Assuring Contact Center Experiences for Your Customers With ThousandEyes

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other? Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management. The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM). Speakers: Bob Boule Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle. Gopinath Rebala Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.

Bits & Pixels using AI for Good.........

Bits & Pixels using AI for Good.........

Bits & Pixels using AI for Good.........

Alison B. Lowndes

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance Osaka Seminar: Overview.pdf

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx

Neuro-symbolic is not enough, we need neuro-*semantic*

Neuro-symbolic is not enough, we need neuro-*semantic*

Neuro-symbolic is not enough, we need neuro-*semantic*

Frank van Harmelen

Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”. All of this illustrated with link prediction over knowledge graphs, but the argument is general.

When stars align: studies in data quality, knowledge graphs, and machine lear...

When stars align: studies in data quality, knowledge graphs, and machine lear...

When stars align: studies in data quality, knowledge graphs, and machine lear...

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

How world-class product teams are winning in the AI era by CEO and Founder, P...

How world-class product teams are winning in the AI era by CEO and Founder, P...

How world-class product teams are winning in the AI era by CEO and Founder, P...

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

Jeffrey Haguewood

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams. Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

Knowledge engineering: from people to machines and back

Knowledge engineering: from people to machines and back

Knowledge engineering: from people to machines and back

Recently uploaded (20)

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

Accelerate your Kubernetes clusters with Varnish Caching

Accelerate your Kubernetes clusters with Varnish Caching

Accelerate your Kubernetes clusters with Varnish Caching

Epistemic Interaction - tuning interfaces to provide information for AI support

Epistemic Interaction - tuning interfaces to provide information for AI support

Epistemic Interaction - tuning interfaces to provide information for AI support

UiPath Test Automation using UiPath Test Suite series, part 3

UiPath Test Automation using UiPath Test Suite series, part 3

UiPath Test Automation using UiPath Test Suite series, part 3

Assuring Contact Center Experiences for Your Customers With ThousandEyes

Assuring Contact Center Experiences for Your Customers With ThousandEyes

Assuring Contact Center Experiences for Your Customers With ThousandEyes

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

Bits & Pixels using AI for Good.........

Bits & Pixels using AI for Good.........

Bits & Pixels using AI for Good.........

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance Osaka Seminar: Overview.pdf

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx

Neuro-symbolic is not enough, we need neuro-*semantic*

Neuro-symbolic is not enough, we need neuro-*semantic*

Neuro-symbolic is not enough, we need neuro-*semantic*

When stars align: studies in data quality, knowledge graphs, and machine lear...

When stars align: studies in data quality, knowledge graphs, and machine lear...

When stars align: studies in data quality, knowledge graphs, and machine lear...

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

How world-class product teams are winning in the AI era by CEO and Founder, P...

How world-class product teams are winning in the AI era by CEO and Founder, P...

How world-class product teams are winning in the AI era by CEO and Founder, P...

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

Knowledge engineering: from people to machines and back

Knowledge engineering: from people to machines and back

Knowledge engineering: from people to machines and back

Henning agt talk-caise-semnet

1. 28.06.2013 DIMA – TU Berlin 1 Fachgebiet Datenbanksysteme und Informationsmanagement Technische Universität Berlin http://www.dima.tu-berlin.de/ Automated Construction of a Large Semantic Network of Related Terms for Domain-Specific Modeling CAiSE 2013, June 21st, Valencia Henning Agt and Ralf-Detlef Kutsche Technische Universität Berlin

2. 28.06.2013 DIMA – TU Berlin 2 ■ Autocompletion applications ■ Predict what the user wants to model next Motivation nurse treatment medicine emergency ...

3. 28.06.2013 DIMA – TU Berlin 3 ■ Our Vision: Provide automated suggestions of semantically related model elements for domain modeling [5],[19] □ Focus on domain terminology and conceptual design □ Query domain and common sense ontologies □ Information extraction from text ■ Requirements for the intended application □ Dictionary of terms □ Relations between terms □ Query interface and ranking functions Research Goals nurse treatment medicine emergency ... OntoOntoOnto‐ logies Extract Modeling Tools Knowledge Service Query Text Analysis OntoOntoTermi‐ nology Retrieve/ Integrate Generate Provide Suggestions Use

4. 28.06.2013 DIMA – TU Berlin 4 ■ Input dataset ■ Text analysis process ■ Application of SemNet ■ Evaluation of SemNet ■ Conclusions and Future Work Agenda N‐Gram Statistics Text Corpus N‐Gram DB POS DB Norm. N‐Gram DB Analyse Parse Normalize Tag SemNet Analyse Co‐occurrences Applications Retrieve Query

5. 28.06.2013 DIMA – TU Berlin 5 ■ Input dataset ■ Text analysis process ■ Application of SemNet ■ Evaluation of SemNet ■ Conclusions and Future Work Agenda N‐Gram Statistics Text Corpus N‐Gram DB POS DB Norm. N‐Gram DB Analyse Parse Normalize Tag SemNet Analyse Co‐occurrences Applications Retrieve Query

6. 28.06.2013 DIMA – TU Berlin 6 ■ Large amounts of text data ■ N-Grams □ Sequence of n consecutive words/tokens and its frequency □ Google provides 1,2,3,4 and 5-grams in several languages ■ We work on the English-All dataset V2 (1-grams and 5-grams) [11] Google Books N-Gram Dataset 5 million books Corpus 500 billion words N‐gram analysis N‐Gram Dataset CSV text files with word frequencies ... … to go to the hospital 46,410 general condition of the patient 28,198 I was in the hospital 19,268 discharge from the hospital . 12,476 admission to the hospital . 10,558 the patient to the hospital 6,422 by placing the patient in 6,026 between doctor and patient . 5,908 ... ... … able to leave the hospital 4,629 patient admitted to the hospital 4,303 a patient in the hospital 3,844 the symptom of the patient 2,559 the patient under local anesthesia 2,536 a patient is suffering from 2,475 the doctor and the hospital 1,362 the hospital and the doctor 1,017 ...

7. 28.06.2013 DIMA – TU Berlin 7 ■ Input dataset ■ Text analysis process ■ Application of SemNet ■ Evaluation of SemNet ■ Conclusions and Future Work Agenda N‐Gram Statistics Text Corpus N‐Gram DB POS DB Norm. N‐Gram DB Analyse Parse Normalize Tag SemNet Analyse Co‐occurrences Applications Retrieve Query

8. 28.06.2013 DIMA – TU Berlin 8 ■ N-gram database  Make the data manageable □ Input: 2.5 terabytes of text □ Output: Tables with 10 million 1-grams and 710 million 5-grams (21 gigabytes) ■ Part-of-speech tagging [8], [9]  Identify lexical category of each text token □ Output: Table with POS tags for each 5-gram (14 gigabytes) ■ Normalization  Reduce amount of word variations □ Plural stemming, lowercasing of adjectives and normal nouns □ Proper nouns are not touched ■ Result: 710 million normalized and tagged 5-grams Preprocessing JJ NN IN DT NN general condition of the patient NN NN NN CC NN drug store pharmacist or doctor doctors doctor Medical practitioner  medical practitioner hospitals in Valencia  hospital in Valencia Adjective Normal Noun DeterminerPreposition CoordinatingCoordinating conjunction

9. 28.06.2013 DIMA – TU Berlin 9 ■ Input dataset ■ Text analysis process ■ Application of SemNet ■ Evaluation of SemNet ■ Conclusions and Future Work Agenda N‐Gram Statistics Text Corpus N‐Gram DB POS DB Norm. N‐Gram DB Analyse Parse Normalize Tag SemNet Analyse Co‐occurrences Applications Retrieve Query

10. 28.06.2013 DIMA – TU Berlin 10 ■ Goal: Detect domain terminology using syntactical patterns [12] ■ Analysis of existing dictionaries □ 75% of terms: noun, noun-noun, adjective noun combinations ■ Excerpt of the 20 patterns used: ■ No proper nouns: Stanford University / university professor □ Our focus is conceptual design on schema level ■ Limitation: 5-gram: 5 words □ Maximum length of a term: 3 words Lexical Patterns doctor or mental health professional term termseparation

11. 28.06.2013 DIMA – TU Berlin 11 ■ Hierarchical pattern matching ■ Distributional Semantics [13], [22] □ “Words that occur in the same contexts tend to have similar meanings.” (Distributional Hypothesis by Z. Harris) Co-Occurring Terms your doctor or pharmacist . 9271 Context frequency Absolute frequency „doctor“ and „pharmacist“ co‐occurred 9271 times Highest level remains No idiomatic phrasesNo consecutive patterns Easiest case

12. 28.06.2013 DIMA – TU Berlin 12 ■ Discard 5-grams that contain 4 or 5 stopwords ■ Apply pattern matching on the remaining 5-grams  Result: Large table of binary relations ■ Frequency aggregation □ Many terms co-occurred in different contexts ■ Relative frequency computation □ For each term with respect to its related terms ■ Graph construction □ Directed, weighted edges □ Relational database and graph database serialization (SQLite / Neo4J) SemNet Construction to go to the doctor I am what I am a ) ( 2 )

13. 28.06.2013 DIMA – TU Berlin 13 ■ Properties of SemNet □ 268,937 distinct single-word terms □ 2,115,494 distinct double-word terms □ 355,689 distinct triple-word terms □  2.7 million terms and 37.5 million relations □ 2.2 GB disc space ■ Lessons learned from the analysis process Statistics 41,6% 15,7% 32,6% 10,1% 4 or 5 stopwords N-Gram Information Content Only 1 term No pattern match N-grams with a semantic relationship Semantic relatedness: Zipf‘s law Rank Degreeofrelatedness

14. 28.06.2013 DIMA – TU Berlin 14 ■ Input dataset ■ Text analysis process ■ Application of SemNet ■ Evaluation of SemNet ■ Conclusions and Future Work Agenda N‐Gram Statistics Text Corpus N‐Gram DB POS DB Norm. N‐Gram DB Analyse Parse Normalize Tag SemNet Analyse Co‐occurrences Applications Retrieve Query

15. 28.06.2013 DIMA – TU Berlin 15 ■ Query Interfaces □ SQL: Query the relational database □ Cypher: Query the Neo4J database □ Java: Use SemNet in your applications □ PHP: Explore the data in a web interface ■ Examples of top 10 automatically identified related terms Querying SemNet (f – absolute term frequency in the original text corpus, #r – number of related terms) select * from nouncooccurrences where termw1 = 5824331 and termw2 is null and termw3 is null order by relfreq desc limit 20; public ArrayList<String> getRelatedStringTerms(ArrayList<String> inputTerms) { … }

16. 28.06.2013 DIMA – TU Berlin 16 ■ Challenge: Methods based matrices and vectors are too slow ■ Strategy: Related term sets intersection + relative frequency multiplication Ranking Results of Multiple Input Terms chair 0.0441 contents 0.0359 end 0.0221 front 0.0194 figure 0.0189 head 0.0189 side 0.0180 data 0.0157 hand 0.0132 column 0.0131 page 0.0118 edge 0.0112 result 0.0100 value 0.0099 place 0.0087 row 0.0086 show 0.0082 elbow 0.0072 list 0.0071 bed 0.0071 table transaction data 0.0735 information 0.0569 record 0.0376 table 0.0334 access 0.0310 spreadsheet 0.0252 name 0.0201 object 0.0164 retrieval system 0.0163 file 0.0158 example 0.0153 use 0.0150 connection 0.0146 structure 0.0139 field 0.0125 user 0.0124 change 0.0112 type 0.0107 size 0.0104 transaction 0.0102 database … … data 0.001155 contents 0.000359 information 0.000190 record 0.000091 use 0.000077 end 0.000060 example 0.000055 name 0.000050 figure 0.000047 value 0.000045 result 0.000037 list 0.000037 column 0.000034 row 0.000033 object 0.000024 field 0.000023 book 0.000016 order 0.000016 size 0.000014 query 0.000012 table+database … ∩ *

17. 28.06.2013 DIMA – TU Berlin 17 ■ Prototype: Ecore Diagram Editor with class name suggestions [15] ■ Automated suggestion adaption with respect to the content of the model Modeling With Semantic Autocompletion

18. 28.06.2013 DIMA – TU Berlin 18 ■ Input dataset ■ Text analysis process ■ Application of SemNet ■ Evaluation of SemNet ■ Conclusions and Future Work Agenda N‐Gram Statistics Text Corpus N‐Gram DB POS DB Norm. N‐Gram DB Analyse Parse Normalize Tag SemNet Analyse Co‐occurrences Applications Retrieve Query

19. 28.06.2013 DIMA – TU Berlin 19 ■ Challenge □ No gold standard available for many information extraction tasks ■ Our strategy: Compare SemNet to existing knowledge bases □ Provide measurements on how much information of WordNet and ConceptNet is contained in SemNet ■ WordNet V3.0: Lexical database for the English language [16] □ Synsets: Grouped terms that share the same sense □ Relations: Mainly taxonomic, part-whole and synonyms ■ ConceptNet V5.1: Semantic graph for general human knowledge [17] □ Nodes: Any natural language phrase that expresses a concept □ Relations: Taxonomic, part-whole, related-to and several others ■ SemNet: Semantic Network of Related Terms □ Nodes: Noun terminology □ Relations: Probabilistic links Evaluation Setup maternity morning sickness physical condition ectopic pregnancy entopic pregnancy synonym part meronym parturiency hyponym hypernym pregnancy Conceptually RelatedTo pregnancy expect morning sickness physical condition go to bed ectopic pregnancy PartOf stretch IsAIsA Related To Causes start family HasSubevent mother termination birth woman trimester stage weekchildbirth lactation month1 2 3 4 5 6 7 89 10 0.036 0.031 0.030 0.030 0.026 0.025 0.020 0.018 0.017 0.016 pregnancy Word sense pregnancy in WordNet (7 out of 32 relations) Concept pregnancy in ConceptNet (7 out of 58 relations). Term pregnancy in SemNet (First 10 out of 4039 relations). S W C

20. 28.06.2013 DIMA – TU Berlin 20 ■ WordNet □ Iterate through all noun synsets (72,994 synsets evaluated) □ Check whether the nouns are contained in SemNet (98,681 nouns evaluated) Results: 77,16% of WordNet‘s synsets are contained in SemNet and 62,17% of WordNet‘s nouns are contained in SemNet ■ ConceptNet □ Problem: Concepts can be expressed using any natural language phrase □ First determine noun terminology □ Check whether the nouns are contained in SemNet (49,301 concepts evaluated)  Result: 82,40% of ConceptNet‘s nouns are contained in SemNet Noun terminology coverage (doctor, doc, physician, MD, Dr., medico) (ear doctor, ear specialist, otologist) (sleep talking, somniloquy, somniloquism) doctor go to bed pregnancy beautiful

21. 28.06.2013 DIMA – TU Berlin 21 ■ WordNet / ConceptNet □ Iterate through all previously found noun synsets (56,321 synsets used) and concepts (40,625 concepts used) □ Check whether the relations between synsets are contained in SemNet (61,931 WordNet relations evaluated and 256,213 ConceptNet relations evaluated) ■ Relation evaluation results Relation coverage (doctor, doc, physician, MD, Dr., medico) (medical practitioner, medical man) hypernym (surgeon)(allergist) hyponym

22. 28.06.2013 DIMA – TU Berlin 22 ■ Input dataset ■ Text analysis process ■ Application of SemNet ■ Evaluation of SemNet ■ Conclusions and Future Work Agenda N‐Gram Statistics Text Corpus N‐Gram DB POS DB Norm. N‐Gram DB Analyse Parse Normalize Tag SemNet Analyse Co‐occurrences Applications Retrieve Query

23. 28.06.2013 DIMA – TU Berlin 23 ■ Summary □ Input: 710 million 5-grams and 20 part-of-speech patterns □ Hierarchical pattern matching, distributional semantics □ Output: 2.7M multi-word terms and 37.5M weighted relations □ Only a window of 5 words can be analyzed to detect relations □ Applications: Domain-specific modeling, keyword expansion, background knowledge for NLP tasks ■ Current and future work □ Support additional languages □ Improve ranking functions (pointwise mutual information) □ Relax 3-word-limitation, derive own n-gram datasets □ Combine probabilistic information with specific relations □ Domain clustering in the semantic network □ Additional modeling support: relations/associations, attributes Conclusions and Future Work

24. 28.06.2013 DIMA – TU Berlin 24 [5] H. Agt: Supporting Software Language Engineering by Automated Domain Knowledge Acquisition. In: MODELS 2011 Workshops LNCS 7167 Springer 2012 [8] Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In: Proceedings of the NAACL 2003, pp. 173–180. [9] Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics 19(2), 313–330 (1993) [11] Michel, J.B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., Team, T.G.B., Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M.A., Aiden, E.L.: Quantitative Analysis of Culture Using Millions of Digitized Books. Science 331(6014), 176–182 (2011) [12] Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th Conference on Computational Linguistics, COLING 1992, vol. 2 (1992) [13] Harris, Z.: Distributional structure. Word 10(23), 146–162 (1954) [15] Agt, H.: SemAcom: A System for Modeling with Semantic Autocompletion. In: Model Driven Engineering Languages and Systems - 15th International Conference, MODELS 2012, Demo Track, Innsbruck, Austria (2012) [16] Fellbaum, C.: WordNet: An Electronic Lexical Database. The MIT Press, Cambridge (1998) [17] Speer, R., Havasi, C.: Representing General Relational Knowledge in ConceptNet 5. In: LREC 2012 [19] Agt, H., Kutsche, R.D., Wegeler, T.: Guidance for Domain Specific Modeling in Small and Medium Enterprises. In: SPLASH 2011 Workshops. DSM 2011, Portland, OR, USA (2011) [22] Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Int. Res. 37(1), 141–188 (2010) Thank You For Your Attention! MODELS? Try out SemNet: http://www.bizware.tu‐berlin.de/semnet/ Contact: henning.agt@tu‐berlin.de