The document describes the automated construction of a large semantic network called SemNet. It analyzes a large text corpus to extract terms and relations using n-gram analysis, part-of-speech tagging, and pattern matching. SemNet contains over 2.7 million terms and 37.5 million relations. The document evaluates SemNet by comparing it to WordNet and ConceptNet, finding that it contains over 77% of WordNet synsets and over 82% of ConceptNet nouns.
Data profiling comprises a broad range of methods to efficiently analyze a given data set. In a typical scenario, which mirrors the capabilities of commercial data profiling tools, tables of a relational database are scanned to derive metadata, such as data types and value patterns, completeness and uniqueness of columns, keys and foreign keys, and occasionally functional dependencies and association rules. Individual research projects have proposed several additional profiling tasks, such as the discovery of inclusion dependencies or conditional functional dependencies.
Data profiling deserves a fresh look for two reasons: First, the area itself is neither established nor defined in any principled way, despite significant research activity on individual parts in the past. Second, current data profiling techniques hardly scale beyond what can only be called small data. Finally, more and more data beyond the traditional relational databases are being created and beg to be profiled. The talk proposes new research directions and challenges, including interactive and incremental profiling and profiling heterogeneous and non-relational data.
Speaker: Felix Naumann studied mathematics, economy, and computer sciences at the University of Technology in Berlin. After receiving his diploma (MA) in 1997 he joined the graduate school "Distributed Information Systems" at Humboldt University of Berlin. He completed his PhD thesis on "Quality-driven Query Answering" in 2000. In 2001 and 2002 he worked at the IBM Almaden Research Center on topics around data integration. From 2003 - 2006 he was assistant professor for information integration at the Humboldt-University of Berlin. Since then he holds the chair for information systems at the Hasso Plattner Institute at the University of Potsdam in Germany.
An experimental comparison of globally-optimal data de-identification algorithmsarx-deidentifier
Collaboration and data sharing have become core elements of biomedical research. At the same time, there is a growing understanding of privacy threats related to data sharing, especially when sensitive data from distributed sources become available for linkage. Statistical disclosure control comprises well-known data anonymization techniques that allow the protection of data by introducing fuzziness. To protect datasets from different types of threats, different privacy criteria are commonly implemented. Data anonymization is an important measure, but it is computationally complex, and it can significantly reduce the expressiveness of data. To attenuate these problems, a number of algorithms has been proposed, which aim at increasing data quality or improving efficiency. Previous evaluations of such algorithms lack a systematic approach, as they focus on specific algorithms, specific privacy criteria, and specific runtime environments. Therefore, it is difficult for decision makers to decide which algorithm is best suited for their requirements. As a first step towards a comprehensive and systematic evaluation of anonymity algorithms, we report on our ongoing efforts for providing an open source benchmark. In this contribution, we focus on optimal algorithms utilizing global recoding with full-domain generalization. We present a systematic evaluation of domain-specific algorithms and generic search methods for a broad set of privacy criteria, including k-anonymity, l-diversity, t-closeness and d-presence, and their use in multiple real-world datasets. Our results show that there is no single solution fitting all needs, and that generic search methods can outperform highly specialized algorithms.
Data profiling comprises a broad range of methods to efficiently analyze a given data set. In a typical scenario, which mirrors the capabilities of commercial data profiling tools, tables of a relational database are scanned to derive metadata, such as data types and value patterns, completeness and uniqueness of columns, keys and foreign keys, and occasionally functional dependencies and association rules. Individual research projects have proposed several additional profiling tasks, such as the discovery of inclusion dependencies or conditional functional dependencies.
Data profiling deserves a fresh look for two reasons: First, the area itself is neither established nor defined in any principled way, despite significant research activity on individual parts in the past. Second, current data profiling techniques hardly scale beyond what can only be called small data. Finally, more and more data beyond the traditional relational databases are being created and beg to be profiled. The talk proposes new research directions and challenges, including interactive and incremental profiling and profiling heterogeneous and non-relational data.
Speaker: Felix Naumann studied mathematics, economy, and computer sciences at the University of Technology in Berlin. After receiving his diploma (MA) in 1997 he joined the graduate school "Distributed Information Systems" at Humboldt University of Berlin. He completed his PhD thesis on "Quality-driven Query Answering" in 2000. In 2001 and 2002 he worked at the IBM Almaden Research Center on topics around data integration. From 2003 - 2006 he was assistant professor for information integration at the Humboldt-University of Berlin. Since then he holds the chair for information systems at the Hasso Plattner Institute at the University of Potsdam in Germany.
An experimental comparison of globally-optimal data de-identification algorithmsarx-deidentifier
Collaboration and data sharing have become core elements of biomedical research. At the same time, there is a growing understanding of privacy threats related to data sharing, especially when sensitive data from distributed sources become available for linkage. Statistical disclosure control comprises well-known data anonymization techniques that allow the protection of data by introducing fuzziness. To protect datasets from different types of threats, different privacy criteria are commonly implemented. Data anonymization is an important measure, but it is computationally complex, and it can significantly reduce the expressiveness of data. To attenuate these problems, a number of algorithms has been proposed, which aim at increasing data quality or improving efficiency. Previous evaluations of such algorithms lack a systematic approach, as they focus on specific algorithms, specific privacy criteria, and specific runtime environments. Therefore, it is difficult for decision makers to decide which algorithm is best suited for their requirements. As a first step towards a comprehensive and systematic evaluation of anonymity algorithms, we report on our ongoing efforts for providing an open source benchmark. In this contribution, we focus on optimal algorithms utilizing global recoding with full-domain generalization. We present a systematic evaluation of domain-specific algorithms and generic search methods for a broad set of privacy criteria, including k-anonymity, l-diversity, t-closeness and d-presence, and their use in multiple real-world datasets. Our results show that there is no single solution fitting all needs, and that generic search methods can outperform highly specialized algorithms.
eNanoMapper database, search tools and templatesNina Jeliazkova
A webinar given at the NCIP Hub https://nciphub.org/resources/1925
Nanomaterial safety assessment has become an important task following the production growth of engineered nanomaterials (ENMs) and the increased interest for ENMs from various academic, industry and regulatory parties. A number of challenges exist in nanomaterials data representation and integration mainly due to the data complexity and origination of ENM information from diverse sources. We have recently described eNanoMapper database [1] as part of the computational infrastructure for toxicological data management of engineered materials, developed within eNanoMapper project [2].
The eNanoMapper prototype database is publicly available at http://data.enanomapper.net, demonstrating the integration of data from multiple sources, using the common data model and Application Programming Interface. The supported import formats are IUCLID5 files (OECD HT), semantic format (RDF) and custom spreadsheet templates. The latter accommodates the preferred approach for data gathering for the majority of the NanoSafety Cluster projects and is enabled by a configurable parser mapping the the custom spreadsheet organization into the internal eNanoMapper storage components through external configuration file. Import of spreadsheet data and other data formats, generated by a number of NanoSafety Cluster projects is currently ongoing. The export formats have been extended with the new ISA JSON format, following the most recent ISA specification.
Defining templates for data gathering is a common activity for most of the NanoSafety Cluster projects usually resulting in modified Excel spreadsheets. In order to help avoiding the incompatibility issues, we present a tool for template generation, based on templates released under open license by JRC under the framework of the NANoREG project [3]. A number of physchem, in-vitro and in-vivo assays are supported and using feedback from users we added and extended existing information about different aspects of nanosafety, e.g. environmental exposure, cell culture assays, cellular and animal models, nanomaterial production features, and nanomaterial ageing.
Finally, the data can be accessed programmatically via the application programming interface as well as via user friendly search interface at https://search.data.enanomapper.net. The search application is powered by a free text search engine and eNanoMapper ontology and was improved over the last year based on user feedback.The search function allows now multiple filtering for information. It is possible to stack filters for e.g. nanomaterial type, cell model and assay.
eNanoMapper is supported by European Commission 7th Framework Programme for Research and Technological Development Grant (Grant agreement no: 604134).
Positional Data Organization and Compression in Web Inverted IndexesLeonidas Akritidis
The conference presentation of the article:
L. Akritidis, P. Bozanis, "Positional Data Organization and Compression in Web Inverted Indexes", In Proceedings of the 23rd International Conference on Database and Expert Systems Applications (DEXA), Lecture Notes in Computer Science (LLNCS), vol. 7446, pp. 422-429, 2012.
which was presented in Vienna, Austria in Spetember of 2012.
ComputableFacts: a Secure System to Store Documents and GraphsAccumulo Summit
This 20 minutes talk describes an automated data processing system, ComputableFacts, whose goal is to recover information from unstructured data in a variety of formats (such as Microsoft Office or Adobe PDF documents, emails, web pages, etc.) and convert it into a more usable form. Its key features are :
Security:
• Enforce authorizations across multiple access models to the database: batch, interactive and real-time.
Data Engineering:
• Extract data and metadata from a variety of sources and file formats
• Provides a uniform representation of all data, regardless of its initial structure or format.
Knowledge Engineering:
• Build facts databases manually and/or automatically
• Automatically derive new facts using rules
• Execute complex queries
Knowledge Dissemination:
• Allow users to create alerts
• Allow users to share and comment on documents
• Allow users to create and export query-focused datasets
• Allow users to rate documents. Later, recommend them documents of interest.
Alexey Zinoviev presented this paper on Second Thumbtack Technology Expert Day.
This paper covers next topics: Data Mining, Machine Learning, Octave, R language
YouTube: http://youtu.be/kGIP6XeWiaA
Redis project : Relational Databases to Key-Value systemsLamprini Koutsokera
Avaliable at: https://github.com/dbsmasters/bdsmasters
The current project is implemented in the context of the course "Big Data Management Systems" taught by Prof. Chatziantoniou in the Department of Management Science and Technology (AUEB). The aim of the project is to familiarize the students with big data management systems such as Hadoop, Redis, MongoDB and Azure Stream Analytics.
Presentation & workshop at
Norwegian Knowledge Centre for the Health ServicesOlso, January 15th 2007 &
NTNU Library (UBiT)Trondheim, January 17th & 18th 2007
Guus van den Brekel
Coördinator Electronic Services,
Central Medical Library
University Medical Center Groningen
Website: www.rug.nl/umcg/bibliotheek
Blog: Digicmb.blogspot.com
Slidedeck from our seminar about Machine Learning (07/11/2014)
Topics covered:
- What is Machine Learning?
- Techiques (clustering, classification, ...)
- Tools (Mahout, R, Spark MlLib, Weka, ...)
- Practical example of Machine Learning applications
- How to embed Machine Learning in software development
- Demo's
eNanoMapper database, search tools and templatesNina Jeliazkova
A webinar given at the NCIP Hub https://nciphub.org/resources/1925
Nanomaterial safety assessment has become an important task following the production growth of engineered nanomaterials (ENMs) and the increased interest for ENMs from various academic, industry and regulatory parties. A number of challenges exist in nanomaterials data representation and integration mainly due to the data complexity and origination of ENM information from diverse sources. We have recently described eNanoMapper database [1] as part of the computational infrastructure for toxicological data management of engineered materials, developed within eNanoMapper project [2].
The eNanoMapper prototype database is publicly available at http://data.enanomapper.net, demonstrating the integration of data from multiple sources, using the common data model and Application Programming Interface. The supported import formats are IUCLID5 files (OECD HT), semantic format (RDF) and custom spreadsheet templates. The latter accommodates the preferred approach for data gathering for the majority of the NanoSafety Cluster projects and is enabled by a configurable parser mapping the the custom spreadsheet organization into the internal eNanoMapper storage components through external configuration file. Import of spreadsheet data and other data formats, generated by a number of NanoSafety Cluster projects is currently ongoing. The export formats have been extended with the new ISA JSON format, following the most recent ISA specification.
Defining templates for data gathering is a common activity for most of the NanoSafety Cluster projects usually resulting in modified Excel spreadsheets. In order to help avoiding the incompatibility issues, we present a tool for template generation, based on templates released under open license by JRC under the framework of the NANoREG project [3]. A number of physchem, in-vitro and in-vivo assays are supported and using feedback from users we added and extended existing information about different aspects of nanosafety, e.g. environmental exposure, cell culture assays, cellular and animal models, nanomaterial production features, and nanomaterial ageing.
Finally, the data can be accessed programmatically via the application programming interface as well as via user friendly search interface at https://search.data.enanomapper.net. The search application is powered by a free text search engine and eNanoMapper ontology and was improved over the last year based on user feedback.The search function allows now multiple filtering for information. It is possible to stack filters for e.g. nanomaterial type, cell model and assay.
eNanoMapper is supported by European Commission 7th Framework Programme for Research and Technological Development Grant (Grant agreement no: 604134).
Positional Data Organization and Compression in Web Inverted IndexesLeonidas Akritidis
The conference presentation of the article:
L. Akritidis, P. Bozanis, "Positional Data Organization and Compression in Web Inverted Indexes", In Proceedings of the 23rd International Conference on Database and Expert Systems Applications (DEXA), Lecture Notes in Computer Science (LLNCS), vol. 7446, pp. 422-429, 2012.
which was presented in Vienna, Austria in Spetember of 2012.
ComputableFacts: a Secure System to Store Documents and GraphsAccumulo Summit
This 20 minutes talk describes an automated data processing system, ComputableFacts, whose goal is to recover information from unstructured data in a variety of formats (such as Microsoft Office or Adobe PDF documents, emails, web pages, etc.) and convert it into a more usable form. Its key features are :
Security:
• Enforce authorizations across multiple access models to the database: batch, interactive and real-time.
Data Engineering:
• Extract data and metadata from a variety of sources and file formats
• Provides a uniform representation of all data, regardless of its initial structure or format.
Knowledge Engineering:
• Build facts databases manually and/or automatically
• Automatically derive new facts using rules
• Execute complex queries
Knowledge Dissemination:
• Allow users to create alerts
• Allow users to share and comment on documents
• Allow users to create and export query-focused datasets
• Allow users to rate documents. Later, recommend them documents of interest.
Alexey Zinoviev presented this paper on Second Thumbtack Technology Expert Day.
This paper covers next topics: Data Mining, Machine Learning, Octave, R language
YouTube: http://youtu.be/kGIP6XeWiaA
Redis project : Relational Databases to Key-Value systemsLamprini Koutsokera
Avaliable at: https://github.com/dbsmasters/bdsmasters
The current project is implemented in the context of the course "Big Data Management Systems" taught by Prof. Chatziantoniou in the Department of Management Science and Technology (AUEB). The aim of the project is to familiarize the students with big data management systems such as Hadoop, Redis, MongoDB and Azure Stream Analytics.
Presentation & workshop at
Norwegian Knowledge Centre for the Health ServicesOlso, January 15th 2007 &
NTNU Library (UBiT)Trondheim, January 17th & 18th 2007
Guus van den Brekel
Coördinator Electronic Services,
Central Medical Library
University Medical Center Groningen
Website: www.rug.nl/umcg/bibliotheek
Blog: Digicmb.blogspot.com
Slidedeck from our seminar about Machine Learning (07/11/2014)
Topics covered:
- What is Machine Learning?
- Techiques (clustering, classification, ...)
- Tools (Mahout, R, Spark MlLib, Weka, ...)
- Practical example of Machine Learning applications
- How to embed Machine Learning in software development
- Demo's
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
A talk given at the DATAPLAT workshop, co-located with the IEEE ICDE conference (May 2024, Utrecht, NL).
Data Provenance for Data Science is our attempt to provide a foundation to add explainability to data-centric AI.
It is a prototype, with lots of work still to do.
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
Data science isn't an easy task to pull of.
You start with exploring data and experimenting with models.
Finally, you find some amazing insight!
What now?
How do you transform a little experiment to a production ready workflow? Better yet, how do you scale it from a small sample in R/Python to TBs of production data?
Building a BIG ML Workflow - from zero to hero, is about the work process you need to take in order to have a production ready workflow up and running.
Covering :
* Small - Medium experimentation (R)
* Big data implementation (Spark Mllib /+ pipeline)
* Setting Metrics and checks in place
* Ad hoc querying and exploring your results (Zeppelin)
* Pain points & Lessons learned the hard way (is there any other way?)
How can I become a data scientist? What are the most valuable skills to learn for a data scientist now? Could I learn how to be a data scientist by going through online tutorials? What does a data scientist do?
These are only some of the questions that are being discussed online, on blogs, on forums and on knowledge-sharing platforms like Quora.
Let me share the Beginner's Guide to Data Science which will be really helpful to you.
Also Checkout: http://bit.ly/2Mub6xP
Presentation by Steffen Zeuch, Researcher at German Research Center for Artificial Intelligence (DFKI) and Post-Doc at TU Berlin (Germany), at the FogGuru Boot Camp training in September 2018.
dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of DataeXascale Infolab
dipLODocus[RDF] is a new system for RDF data processing supporting both simple transactional queries and complex analytics efficiently. dipLODocus[RDF] is based on a novel hybrid storage model considering RDF data both from a graph perspective (by storing RDF subgraphs or RDF molecules) and from a "vertical" analytics perspective (by storing compact lists of literal values for a given attribute).
http://diuf.unifr.ch/main/xi/diplodocus/
Reference Domain Ontologies and Large Medical Language Models.pptxChimezie Ogbuji
Large Language Models (LLMs) have exploded into the modern research and development consciousness and triggered an artificial intelligence revolution. They are well-positioned to have a major impact on Medical Informatics. However, much of the data used to train these revolutionary models are general-purpose and, in some cases, synthetically generated from LLMs. Ontologies are a shared and agreed-upon conceptualization of a domain and facilitate computational reasoning. They have become important tools in biomedicine, supporting critical aspects of healthcare and biomedical research, and are integral to science. In this talk, we will delve into ontologies, their representational and reasoning power, and how terminology systems such as SNOMED-CT, an international master terminology providing comprehensive coverage of the entire domain of medicine, can be used with Controlled Natural Languages (CNL) to advance how LLMs are used and trained.
Law firms & lawyers - rid the manual review of text documents, correspondence, etc. Text Analytics of unstructured documents signals potential knowledge that brings relevance & helps win cases. Moreover, use of text analytics helps offer small firms the same advantage that big firms have. As the information can be used to strengthen solutions and provide advice to attorneys, courtrooms will also benefit from more informed, better prepared legal teams and swift action, keeping long years of litigation away!
Talk on Data Discovery and Metadata by Mark Grover from July 2019.
Goes into detail of the problem, build/buy/adopt analysis and Lyft's solution - Amundsen, along with thoughts on the future.
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
There are two cultures in data science and analytics - those that develop analytic models and those that deploy analytic models into operational systems. In this talk, we review the life cycle of analytic models and provide an overview of some of the approaches that have been developed for managing analytic models and workflows and for deploying them, including using analytic engines and analytic containers . We give a quick overview of languages for analytic models (PMML) and analytic workflows (PFA). We also describe the emerging discipline of AnalyticOps that has borrowed some of the techniques of DevOps.
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...Dr. Haxel Consult
Word embeddings, deep learning, transformer models and other pre-trained neural language models (sometimes recently referred to as "foundational models") have fundamentally changed the way state-of-the-art systems for natural language processing and information access are built today. The "Data-to-Value" process methodology (Leidner 2013; Leidner 2022a,b) has been devised to embody best practices for the construction of natural language engineering solutions; it can assist practitioners and has also been used to transfer industrial insights into the university classroom. This talk recaps how the methodology supports engineers in building systems more consistently and then outlines the changes in the methodology to adapt it to the deep learning age. The cost and energy implications will also be discussed.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Knowledge engineering: from people to machines and back
Henning agt talk-caise-semnet
1. 28.06.2013 DIMA – TU Berlin 1
Fachgebiet Datenbanksysteme und Informationsmanagement
Technische Universität Berlin
http://www.dima.tu-berlin.de/
Automated Construction of a Large Semantic Network
of Related Terms for Domain-Specific Modeling
CAiSE 2013, June 21st, Valencia
Henning Agt and Ralf-Detlef Kutsche
Technische Universität Berlin
2. 28.06.2013 DIMA – TU Berlin 2
■ Autocompletion applications
■ Predict what the user wants to model next
Motivation
nurse
treatment
medicine
emergency
...
3. 28.06.2013 DIMA – TU Berlin 3
■ Our Vision: Provide automated suggestions of semantically related
model elements for domain modeling [5],[19]
□ Focus on domain terminology and conceptual design
□ Query domain and common sense ontologies
□ Information extraction from text
■ Requirements for the intended application
□ Dictionary of terms
□ Relations between terms
□ Query interface and ranking functions
Research Goals
nurse
treatment
medicine
emergency
...
OntoOntoOnto‐
logies
Extract
Modeling
Tools
Knowledge
Service
Query
Text
Analysis
OntoOntoTermi‐
nology
Retrieve/
Integrate
Generate
Provide
Suggestions
Use
4. 28.06.2013 DIMA – TU Berlin 4
■ Input dataset
■ Text analysis process
■ Application of SemNet
■ Evaluation of SemNet
■ Conclusions and Future Work
Agenda
N‐Gram
Statistics
Text
Corpus
N‐Gram
DB
POS
DB
Norm.
N‐Gram
DB
Analyse Parse
Normalize
Tag
SemNet
Analyse
Co‐occurrences
Applications
Retrieve
Query
5. 28.06.2013 DIMA – TU Berlin 5
■ Input dataset
■ Text analysis process
■ Application of SemNet
■ Evaluation of SemNet
■ Conclusions and Future Work
Agenda
N‐Gram
Statistics
Text
Corpus
N‐Gram
DB
POS
DB
Norm.
N‐Gram
DB
Analyse Parse
Normalize
Tag
SemNet
Analyse
Co‐occurrences
Applications
Retrieve
Query
6. 28.06.2013 DIMA – TU Berlin 6
■ Large amounts of text data
■ N-Grams
□ Sequence of n consecutive words/tokens and its frequency
□ Google provides 1,2,3,4 and 5-grams in several languages
■ We work on the English-All dataset V2 (1-grams and 5-grams) [11]
Google Books N-Gram Dataset
5 million books
Corpus
500 billion words N‐gram analysis
N‐Gram
Dataset
CSV text files
with word frequencies
...
…
to go to the hospital 46,410
general condition of the patient 28,198
I was in the hospital 19,268
discharge from the hospital . 12,476
admission to the hospital . 10,558
the patient to the hospital 6,422
by placing the patient in 6,026
between doctor and patient . 5,908
... ...
…
able to leave the hospital 4,629
patient admitted to the hospital 4,303
a patient in the hospital 3,844
the symptom of the patient 2,559
the patient under local anesthesia 2,536
a patient is suffering from 2,475
the doctor and the hospital 1,362
the hospital and the doctor 1,017
...
7. 28.06.2013 DIMA – TU Berlin 7
■ Input dataset
■ Text analysis process
■ Application of SemNet
■ Evaluation of SemNet
■ Conclusions and Future Work
Agenda
N‐Gram
Statistics
Text
Corpus
N‐Gram
DB
POS
DB
Norm.
N‐Gram
DB
Analyse Parse
Normalize
Tag
SemNet
Analyse
Co‐occurrences
Applications
Retrieve
Query
8. 28.06.2013 DIMA – TU Berlin 8
■ N-gram database
Make the data manageable
□ Input: 2.5 terabytes of text
□ Output: Tables with
10 million 1-grams and
710 million 5-grams (21 gigabytes)
■ Part-of-speech tagging [8], [9]
Identify lexical category of each text token
□ Output: Table with POS tags for each
5-gram (14 gigabytes)
■ Normalization
Reduce amount of word variations
□ Plural stemming, lowercasing of
adjectives and normal nouns
□ Proper nouns are not touched
■ Result: 710 million normalized and tagged 5-grams
Preprocessing
JJ NN IN DT NN
general condition of the patient
NN NN NN CC NN
drug store pharmacist or doctor
doctors doctor
Medical practitioner medical practitioner
hospitals in Valencia hospital in Valencia
Adjective
Normal
Noun DeterminerPreposition
CoordinatingCoordinating
conjunction
9. 28.06.2013 DIMA – TU Berlin 9
■ Input dataset
■ Text analysis process
■ Application of SemNet
■ Evaluation of SemNet
■ Conclusions and Future Work
Agenda
N‐Gram
Statistics
Text
Corpus
N‐Gram
DB
POS
DB
Norm.
N‐Gram
DB
Analyse Parse
Normalize
Tag
SemNet
Analyse
Co‐occurrences
Applications
Retrieve
Query
10. 28.06.2013 DIMA – TU Berlin 10
■ Goal: Detect domain terminology using syntactical patterns [12]
■ Analysis of existing dictionaries
□ 75% of terms: noun, noun-noun, adjective noun combinations
■ Excerpt of the 20 patterns used:
■ No proper nouns: Stanford University / university professor
□ Our focus is conceptual design on schema level
■ Limitation: 5-gram: 5 words
□ Maximum length of a term: 3 words
Lexical Patterns
doctor or mental health professional
term termseparation
11. 28.06.2013 DIMA – TU Berlin 11
■ Hierarchical pattern matching
■ Distributional Semantics [13], [22]
□ “Words that occur in the same contexts
tend to have similar meanings.”
(Distributional Hypothesis by Z. Harris)
Co-Occurring Terms
your doctor or pharmacist . 9271
Context
frequency
Absolute
frequency
„doctor“ and „pharmacist“
co‐occurred 9271 times
Highest level remains
No idiomatic phrasesNo consecutive patterns
Easiest case
12. 28.06.2013 DIMA – TU Berlin 12
■ Discard 5-grams that contain 4 or 5 stopwords
■ Apply pattern matching on the remaining 5-grams
Result: Large table of binary relations
■ Frequency aggregation
□ Many terms co-occurred in different contexts
■ Relative frequency computation
□ For each term with respect to its related terms
■ Graph construction
□ Directed, weighted edges
□ Relational database and graph
database serialization (SQLite / Neo4J)
SemNet Construction
to go to the doctor I am what I am a ) ( 2 )
13. 28.06.2013 DIMA – TU Berlin 13
■ Properties of SemNet
□ 268,937 distinct single-word terms
□ 2,115,494 distinct double-word terms
□ 355,689 distinct triple-word terms
□ 2.7 million terms and 37.5 million relations
□ 2.2 GB disc space
■ Lessons learned from the analysis process
Statistics
41,6%
15,7%
32,6%
10,1%
4 or 5
stopwords
N-Gram Information Content
Only
1 term
No pattern
match
N-grams
with a
semantic
relationship
Semantic relatedness: Zipf‘s law
Rank
Degreeofrelatedness
14. 28.06.2013 DIMA – TU Berlin 14
■ Input dataset
■ Text analysis process
■ Application of SemNet
■ Evaluation of SemNet
■ Conclusions and Future Work
Agenda
N‐Gram
Statistics
Text
Corpus
N‐Gram
DB
POS
DB
Norm.
N‐Gram
DB
Analyse Parse
Normalize
Tag
SemNet
Analyse
Co‐occurrences
Applications
Retrieve
Query
15. 28.06.2013 DIMA – TU Berlin 15
■ Query Interfaces
□ SQL: Query the relational database
□ Cypher: Query the Neo4J database
□ Java: Use SemNet in your applications
□ PHP: Explore the data in a web interface
■ Examples of top 10 automatically identified related terms
Querying SemNet
(f – absolute term frequency in the original text corpus, #r – number of related terms)
select * from nouncooccurrences where termw1 =
5824331 and termw2 is null and termw3 is null
order by relfreq desc limit 20;
public ArrayList<String>
getRelatedStringTerms(ArrayList<String>
inputTerms) { … }
16. 28.06.2013 DIMA – TU Berlin 16
■ Challenge: Methods based matrices and vectors are too slow
■ Strategy: Related term sets intersection + relative frequency
multiplication
Ranking Results of Multiple Input Terms
chair 0.0441
contents 0.0359
end 0.0221
front 0.0194
figure 0.0189
head 0.0189
side 0.0180
data 0.0157
hand 0.0132
column 0.0131
page 0.0118
edge 0.0112
result 0.0100
value 0.0099
place 0.0087
row 0.0086
show 0.0082
elbow 0.0072
list 0.0071
bed 0.0071
table
transaction
data 0.0735
information 0.0569
record 0.0376
table 0.0334
access 0.0310
spreadsheet 0.0252
name 0.0201
object 0.0164
retrieval system 0.0163
file 0.0158
example 0.0153
use 0.0150
connection 0.0146
structure 0.0139
field 0.0125
user 0.0124
change 0.0112
type 0.0107
size 0.0104
transaction 0.0102
database
… …
data 0.001155
contents 0.000359
information 0.000190
record 0.000091
use 0.000077
end 0.000060
example 0.000055
name 0.000050
figure 0.000047
value 0.000045
result 0.000037
list 0.000037
column 0.000034
row 0.000033
object 0.000024
field 0.000023
book 0.000016
order 0.000016
size 0.000014
query 0.000012
table+database
…
∩
*
17. 28.06.2013 DIMA – TU Berlin 17
■ Prototype: Ecore Diagram Editor with class name suggestions [15]
■ Automated suggestion adaption with respect to the content of the model
Modeling With Semantic Autocompletion
18. 28.06.2013 DIMA – TU Berlin 18
■ Input dataset
■ Text analysis process
■ Application of SemNet
■ Evaluation of SemNet
■ Conclusions and Future Work
Agenda
N‐Gram
Statistics
Text
Corpus
N‐Gram
DB
POS
DB
Norm.
N‐Gram
DB
Analyse Parse
Normalize
Tag
SemNet
Analyse
Co‐occurrences
Applications
Retrieve
Query
19. 28.06.2013 DIMA – TU Berlin 19
■ Challenge
□ No gold standard available for many information extraction tasks
■ Our strategy: Compare SemNet to existing knowledge bases
□ Provide measurements on how much information of WordNet and ConceptNet is
contained in SemNet
■ WordNet V3.0: Lexical database for the English language [16]
□ Synsets: Grouped terms that share the same sense
□ Relations: Mainly taxonomic, part-whole and synonyms
■ ConceptNet V5.1: Semantic graph for general human knowledge [17]
□ Nodes: Any natural language phrase that expresses a concept
□ Relations: Taxonomic, part-whole, related-to and several others
■ SemNet: Semantic Network of Related Terms
□ Nodes: Noun terminology
□ Relations: Probabilistic links
Evaluation Setup
maternity
morning
sickness
physical
condition
ectopic
pregnancy
entopic
pregnancy
synonym
part
meronym
parturiency
hyponym
hypernym
pregnancy
Conceptually
RelatedTo
pregnancy
expect
morning
sickness
physical
condition
go to bed
ectopic
pregnancy
PartOf
stretch
IsAIsA
Related
To
Causes
start
family
HasSubevent
mother
termination birth
woman
trimester
stage
weekchildbirth
lactation
month1
2
3 4
5
6
7
89
10
0.036
0.031
0.030 0.030
0.026
0.025
0.020
0.018
0.017
0.016
pregnancy
Word sense pregnancy in WordNet
(7 out of 32 relations)
Concept pregnancy in ConceptNet
(7 out of 58 relations).
Term pregnancy in SemNet
(First 10 out of 4039 relations).
S
W C
20. 28.06.2013 DIMA – TU Berlin 20
■ WordNet
□ Iterate through all noun synsets
(72,994 synsets evaluated)
□ Check whether the nouns are
contained in SemNet
(98,681 nouns evaluated)
Results: 77,16% of WordNet‘s synsets are contained in SemNet and
62,17% of WordNet‘s nouns are contained in SemNet
■ ConceptNet
□ Problem: Concepts can be expressed
using any natural language phrase
□ First determine noun terminology
□ Check whether the nouns are
contained in SemNet
(49,301 concepts evaluated)
Result: 82,40% of ConceptNet‘s nouns are contained in SemNet
Noun terminology coverage
(doctor, doc, physician, MD, Dr., medico)
(ear doctor, ear specialist, otologist)
(sleep talking, somniloquy, somniloquism)
doctor
go to bed
pregnancy
beautiful
21. 28.06.2013 DIMA – TU Berlin 21
■ WordNet / ConceptNet
□ Iterate through all previously found
noun synsets (56,321 synsets used)
and concepts (40,625 concepts used)
□ Check whether the relations between
synsets are contained in SemNet
(61,931 WordNet relations evaluated and
256,213 ConceptNet relations evaluated)
■ Relation evaluation results
Relation coverage
(doctor, doc, physician, MD, Dr., medico)
(medical practitioner, medical man)
hypernym
(surgeon)(allergist)
hyponym
22. 28.06.2013 DIMA – TU Berlin 22
■ Input dataset
■ Text analysis process
■ Application of SemNet
■ Evaluation of SemNet
■ Conclusions and Future Work
Agenda
N‐Gram
Statistics
Text
Corpus
N‐Gram
DB
POS
DB
Norm.
N‐Gram
DB
Analyse Parse
Normalize
Tag
SemNet
Analyse
Co‐occurrences
Applications
Retrieve
Query
23. 28.06.2013 DIMA – TU Berlin 23
■ Summary
□ Input: 710 million 5-grams and 20 part-of-speech patterns
□ Hierarchical pattern matching, distributional semantics
□ Output: 2.7M multi-word terms and 37.5M weighted relations
□ Only a window of 5 words can be analyzed to detect relations
□ Applications: Domain-specific modeling, keyword expansion,
background knowledge for NLP tasks
■ Current and future work
□ Support additional languages
□ Improve ranking functions (pointwise mutual information)
□ Relax 3-word-limitation, derive own n-gram datasets
□ Combine probabilistic information with specific relations
□ Domain clustering in the semantic network
□ Additional modeling support: relations/associations, attributes
Conclusions and Future Work
24. 28.06.2013 DIMA – TU Berlin 24
[5] H. Agt: Supporting Software Language Engineering by Automated
Domain Knowledge Acquisition. In: MODELS 2011 Workshops
LNCS 7167 Springer 2012
[8] Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-Rich
Part-of-Speech Tagging with a Cyclic Dependency Network. In:
Proceedings of the NAACL 2003, pp. 173–180.
[9] Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a Large
Annotated Corpus of English: The Penn Treebank. Computational
Linguistics 19(2), 313–330 (1993)
[11] Michel, J.B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., Team,
T.G.B., Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., Orwant,
J., Pinker, S., Nowak, M.A., Aiden, E.L.: Quantitative Analysis of
Culture Using Millions of Digitized Books. Science 331(6014),
176–182 (2011)
[12] Hearst, M.A.: Automatic acquisition of hyponyms from large text
corpora. In: Proceedings of the 14th Conference on
Computational Linguistics, COLING 1992, vol. 2 (1992)
[13] Harris, Z.: Distributional structure. Word 10(23), 146–162 (1954)
[15] Agt, H.: SemAcom: A System for Modeling with Semantic
Autocompletion. In: Model Driven Engineering Languages and
Systems - 15th International Conference, MODELS 2012, Demo
Track, Innsbruck, Austria (2012)
[16] Fellbaum, C.: WordNet: An Electronic Lexical Database. The MIT
Press, Cambridge (1998)
[17] Speer, R., Havasi, C.: Representing General Relational Knowledge
in ConceptNet 5. In: LREC 2012
[19] Agt, H., Kutsche, R.D., Wegeler, T.: Guidance for Domain Specific
Modeling in Small and Medium Enterprises. In: SPLASH 2011
Workshops. DSM 2011, Portland, OR, USA (2011)
[22] Turney, P.D., Pantel, P.: From frequency to meaning: vector
space models of semantics. J. Artif. Int. Res. 37(1), 141–188
(2010)
Thank You For Your Attention!
MODELS?
Try out SemNet: http://www.bizware.tu‐berlin.de/semnet/
Contact: henning.agt@tu‐berlin.de