In this study, we put forward two claims: 1) it is possible to design a dynamic and extensible corpus without running the risk of getting into scalability problems; 2) it is possible to devise noise-resistant Language Technology applications without affecting performance. To support our claims, we describe the design, construction and limitations of a very specialized medical web corpus, called eCare_Sv_01, and we present two experiments on lay-specialized text classification. eCare_Sv_01 is a small corpus of web documents written in Swedish. The corpus contains documents about chronic diseases. The sublanguage used in each document has been labelled as "lay" or "specialized" by a lay annotator. The corpus is designed as a flexible text resource, where additional medical documents will be appended over time. Experiments show that the layspecialized labels assigned by the lay annotator are reliably learned by standard classifiers. More specifically, Experiment 1 shows that scalability is not an issue when increasing the size of the datasets to be learned from 156 up to 801 documents. Experiment 2 shows that lay-specialized labels can be learned regardless of the large amount of disturbing factors, such as machine translated documents or low-quality texts, which are numerous in the corpus.
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Marina Santini
Web corpora are a cornerstone of modern Language Technology. Corpora built from the web are convenient because their creation is fast and inexpensive. Several studies have been carried out to assess the representativeness of general-purpose web corpora by comparing them to traditional corpora. Less attention has been paid to assess the representativeness of specialized or domain-specific web corpora. In this paper, we focus on the assessment of domain representativeness of web corpora and we claim that it is possible to assess the degree of domainspecificity, or domainhood, of web corpora. We present a case study where we explore the effectiveness of different measures - namely the Mann-Withney-Wilcoxon Test, Kendall correlation coefficient, Kullback– Leibler divergence, log-likelihood and burstiness - to gauge domainhood. Our findings indicate that burstiness is the most suitable measure to single out domain-specific words from a specialized corpus and to allow for the quantification of domainhood.
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsMarina Santini
In this study, we focus on the creation and evaluation of domain-specific web corpora. To this purpose, we propose a two-step approach, namely the (1) the automatic extraction and evaluation of term seeds from personas and use cases/scenarios; (2) the creation and evaluation of domain-specific web corpora bootstrapped with term seeds automatically extracted in step 1. Results are encouraging and show that: (1) it is possible to create a fairly accurate term extractor for relatively short narratives; (2) it is straightforward to evaluate a quality such as domain-specificity of web corpora using well-established metrics.
Most work on scholarly document processing assumes that the information processed is trustworthy and factually correct. However, this is not always the case. There are two core challenges, which should be addressed: 1) ensuring that scientific publications are credible -- e.g. that claims are not made without supporting evidence, and that all relevant supporting evidence is provided; and 2) that scientific findings are not misrepresented, distorted or outright misreported when communicated by journalists or the general public. I will present some first steps towards addressing these problems and outline remaining challenges.
Linked Open Data about Springer Nature conferences. The story so farAliaksandr Birukou
Despite many efforts for making data about scholarly publications available on the Web of Data, lots of information about academic conferences is still contained in (at best) free-text format. When available in a structured format, these data would provide an essential input for the decisions researchers, libraries, publishers, funding and evaluation bodies take every day.
This talk will describe the project about having such data available as Linked Open Data (LOD) at lod.springer.com for around 10,000 computer science conferences. In addition, we will have a closer look at the lessons learnt from launching this portal and cover other Linked Data projects in Springer Nature. Finally, a novel semi-automated approach for classifying conference proceedings in Springer Nature will also be presented.
In this video from the 2017 Argonne Training Program on Extreme-Scale Computing, Phil Carns from Argonne presents: HPC I/O for Computational Scientists.
"Darshan is a scalable HPC I/O characterization tool. It captures an accurate but concise picture of application I/O behavior with minimum overhead."
Darshan was originally developed on the IBM Blue Gene series of computers deployed at the Argonne Leadership Computing Facility, but it is portable across a wide variety of platforms include the Cray XE6, Cray XC30, and Linux clusters. Darshan routinely instruments jobs using up to 786,432 compute cores on the Mira system at ALCF.
Watch the video: https://wp.me/p3RLHQ-hv9
Learn more: https://extremecomputingtraining.anl.gov/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Marina Santini
Web corpora are a cornerstone of modern Language Technology. Corpora built from the web are convenient because their creation is fast and inexpensive. Several studies have been carried out to assess the representativeness of general-purpose web corpora by comparing them to traditional corpora. Less attention has been paid to assess the representativeness of specialized or domain-specific web corpora. In this paper, we focus on the assessment of domain representativeness of web corpora and we claim that it is possible to assess the degree of domainspecificity, or domainhood, of web corpora. We present a case study where we explore the effectiveness of different measures - namely the Mann-Withney-Wilcoxon Test, Kendall correlation coefficient, Kullback– Leibler divergence, log-likelihood and burstiness - to gauge domainhood. Our findings indicate that burstiness is the most suitable measure to single out domain-specific words from a specialized corpus and to allow for the quantification of domainhood.
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsMarina Santini
In this study, we focus on the creation and evaluation of domain-specific web corpora. To this purpose, we propose a two-step approach, namely the (1) the automatic extraction and evaluation of term seeds from personas and use cases/scenarios; (2) the creation and evaluation of domain-specific web corpora bootstrapped with term seeds automatically extracted in step 1. Results are encouraging and show that: (1) it is possible to create a fairly accurate term extractor for relatively short narratives; (2) it is straightforward to evaluate a quality such as domain-specificity of web corpora using well-established metrics.
Most work on scholarly document processing assumes that the information processed is trustworthy and factually correct. However, this is not always the case. There are two core challenges, which should be addressed: 1) ensuring that scientific publications are credible -- e.g. that claims are not made without supporting evidence, and that all relevant supporting evidence is provided; and 2) that scientific findings are not misrepresented, distorted or outright misreported when communicated by journalists or the general public. I will present some first steps towards addressing these problems and outline remaining challenges.
Linked Open Data about Springer Nature conferences. The story so farAliaksandr Birukou
Despite many efforts for making data about scholarly publications available on the Web of Data, lots of information about academic conferences is still contained in (at best) free-text format. When available in a structured format, these data would provide an essential input for the decisions researchers, libraries, publishers, funding and evaluation bodies take every day.
This talk will describe the project about having such data available as Linked Open Data (LOD) at lod.springer.com for around 10,000 computer science conferences. In addition, we will have a closer look at the lessons learnt from launching this portal and cover other Linked Data projects in Springer Nature. Finally, a novel semi-automated approach for classifying conference proceedings in Springer Nature will also be presented.
In this video from the 2017 Argonne Training Program on Extreme-Scale Computing, Phil Carns from Argonne presents: HPC I/O for Computational Scientists.
"Darshan is a scalable HPC I/O characterization tool. It captures an accurate but concise picture of application I/O behavior with minimum overhead."
Darshan was originally developed on the IBM Blue Gene series of computers deployed at the Argonne Leadership Computing Facility, but it is portable across a wide variety of platforms include the Cray XE6, Cray XC30, and Linux clusters. Darshan routinely instruments jobs using up to 786,432 compute cores on the Mira system at ALCF.
Watch the video: https://wp.me/p3RLHQ-hv9
Learn more: https://extremecomputingtraining.anl.gov/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
A talk presented to the US Networking and Information Technology Research and Development (NITRD) Program's High End Computing Interagency Working Group, 16 January 2020
Providing Research Graph data in JSON-LD using Schema.orgJingbo Wang
In this presentation, we describe a pilot project that provides
Research Graph records to external web services using
JSON-LD. The Research Graph database contains a largescale
graph that links research datasets (i.e., data used to
support research) to funding records (i.e. grants), publications
and researcher records such as ORCID profiles.
This database was derived from the work of the Research
Data Alliance Working Group on Data Description Registry
Interoperability (DDRI), and curated using the Research
Data Switchboard open source software. By being available
in Linked Data format, the Research Graph database
is more accessible to third-party web services over the Internet,
which thus opens the opportunity to connect to the
rest of the world in the semantic format.
The primary purpose of this pilot project is to evaluate the
feasibility of converting registry objects in Research Graph
to JSON-LD by accessing widely used vocabularies published
at Schema.org. In this paper, we provide examples
of publications, datasets and grants from international research
institutions such as CERN INSPIREHEP, National
Computational Infrastructure (NCI) in Australia, and Australian
Research Council (ARC). Furthermore, we show how
these Research Graph records are made semantically available
as Linked Data through using Schema.org. The mapping
between Research Graph schema and Schema.org is
available on GitHub repository. We also discuss the potential need for an extension to Schema.org vocabulary for scholarly communication.
In the last decade, several Scientific Knowledge Graphs (SKG) were released, representing scientific knowledge in a structured, interlinked, and semantically rich manner. But, what kind of information they describe? How they have been built? What can we do with them? In this lecture, I will first provide an overview of well-known SKGs, like Microsoft Academic Graph, Dimensions, and others. Then, I will present the Academia/Industry DynAmics (AIDA) Knowledge Graph, which describes 21M publications and 8M patents according to i) the research topics drawn from the Computer Science Ontology, ii) the type of the author's affiliations (e.g, academia, industry), and iii) 66 industrial sectors (e.g., automotive, financial, energy, electronics) from the Industrial Sectors Ontology (INDUSO). Finally, I will showcase a number of tools and approaches using such SKGs, supporting researchers, companies, and policymakers in making sense of research dynamics.
"The ICAIL conference is the primary international conference addressing research in Artificial Intelligence and Law, and has been organized biennially since 1987"
#ICAIL2017, #ADAI
The purpose of this work is to outline our approach to the development and testing of several computational models for legal relevance in the narrow domain of patent law, specifically as documented through select proceedings of the USPTO PTAB cases.
Contribution #1: “real world” legal judgement
Patent Trial and Appeal Board (PTAB) publicly available dataset, as of Jan 2017 has about 100 zip files containing 10 GB of data (compressed).
PTAB data represent practitioner needs, better than the more commonly used citation graphs
Contribution #2: Doc semantics != Legal Relevance
Disproved prevailing notion that document semantics implies legal relevance or is at least correlated with it. e.g. Khoury and Bekkerman in 2016, “if a given document is not in the semantic neighborhood of the query document, it simply cannot be relevant for the query document"
Contribution #3: ~4X improvement in retrieval
Without subsector pre-processing: Recall@100 was 5%, After text pre-processing: Recall@100 was 19%
Contribution #4: Human-in-loop impact is dramatic
Potential for over 50X improvement, where a retrieval task failed Recall @ 5000 but passed Recall @100 with user feedback
Publishing conference proceedings internationally: Tips and tricksAliaksandr Birukou
In this presentation we look into main elements one has to consider when organizing an international conference. First, we describe the role of conference proceedings in CS and beyond. Second, we focus on the tasks of conference organizers. Third, we cover the peer review aspects and announce the new group CrossRef and DataCite start with this respect. We then cover indexing and dissemination, including Springer Nature Linked Open Data portal, http://lod.springer.com. We finalize the presentation with several tips and guidelines for organizers of international conferences as well as the word of warning regarding predatory publishers.
In 2015, three of the authors (Zarndt, McCain, Carner) surveyed the born digital content legal deposit policies and practices in 18 different countries and presented the results of the survey at the 2015 International News Media Conference hosted by the National Library of Sweden in Stockholm, Sweden, April 2015.
As a first step, the authors reviewed previous surveys about legal deposit and digital preservation. The authors updated and streamlined the 2015 survey in order to assess progress in creating or improving national policies and in implementing practices for preserving born digital content. The current survey consists of as many as 20 questions; which questions are asked depends on the respondent’s previous answers.
More than 50 countries and states in Australia, Germany and USA, participated in the survey. The survey closed at the end of November 2017. The authors expect to repeat the survey periodically in order to assess progress in developing born digital legal policy and implementing the policy in practice.
Sources of Change in Modern Knowledge Organization SystemsPaul Groth
Talk covering how knowledge graphs are making us rethink how change occurs in Knowledge Organization Systems. Based on https://arxiv.org/abs/1611.00217
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...Dr. Haxel Consult
Henning Schönenberger (SpringerNature, Germany)
Springer Nature SciGraph, the new Linked Open Data platform aggregating data sources from Springer Nature and key partners from the scholarly domain. The Linked Open Data platform will initially collate information from across the research landscape, such as funders, research projects, conferences, affiliations and publications. Additional data, such as citations, patents, clinical trials and usage numbers will follow over time. This high quality data from trusted and reliable sources provides a rich semantic description of how information is related, as well as enabling innovative visualizations of the scholarly domain.
Aligning Nuclear Physics Computing Techniques with Non-Research Physics CareersWouter Deconinck
Discussion on a user-centered design of the software stack used in nuclear physics computing, with a focus on the career development needs of graduate students.
Jana Parvanova, Vladimir Alexiev and Stanislav Kostadinov. In workshop Collaborative Annotations in Shared Environments: metadata, vocabularies and techniques in the Digital Humanities (DH-CASE 2013). Collocated with DocEng 2013. Florence, Italy, Sep 2013.
eNanoMapper database, search tools and templatesNina Jeliazkova
A webinar given at the NCIP Hub https://nciphub.org/resources/1925
Nanomaterial safety assessment has become an important task following the production growth of engineered nanomaterials (ENMs) and the increased interest for ENMs from various academic, industry and regulatory parties. A number of challenges exist in nanomaterials data representation and integration mainly due to the data complexity and origination of ENM information from diverse sources. We have recently described eNanoMapper database [1] as part of the computational infrastructure for toxicological data management of engineered materials, developed within eNanoMapper project [2].
The eNanoMapper prototype database is publicly available at http://data.enanomapper.net, demonstrating the integration of data from multiple sources, using the common data model and Application Programming Interface. The supported import formats are IUCLID5 files (OECD HT), semantic format (RDF) and custom spreadsheet templates. The latter accommodates the preferred approach for data gathering for the majority of the NanoSafety Cluster projects and is enabled by a configurable parser mapping the the custom spreadsheet organization into the internal eNanoMapper storage components through external configuration file. Import of spreadsheet data and other data formats, generated by a number of NanoSafety Cluster projects is currently ongoing. The export formats have been extended with the new ISA JSON format, following the most recent ISA specification.
Defining templates for data gathering is a common activity for most of the NanoSafety Cluster projects usually resulting in modified Excel spreadsheets. In order to help avoiding the incompatibility issues, we present a tool for template generation, based on templates released under open license by JRC under the framework of the NANoREG project [3]. A number of physchem, in-vitro and in-vivo assays are supported and using feedback from users we added and extended existing information about different aspects of nanosafety, e.g. environmental exposure, cell culture assays, cellular and animal models, nanomaterial production features, and nanomaterial ageing.
Finally, the data can be accessed programmatically via the application programming interface as well as via user friendly search interface at https://search.data.enanomapper.net. The search application is powered by a free text search engine and eNanoMapper ontology and was improved over the last year based on user feedback.The search function allows now multiple filtering for information. It is possible to stack filters for e.g. nanomaterial type, cell model and assay.
eNanoMapper is supported by European Commission 7th Framework Programme for Research and Technological Development Grant (Grant agreement no: 604134).
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...Advanced-Concepts-Team
Searching for information within large sets of unstructured, heterogeneous scientific data can be very challenging unless an inverted index has been created in advance. Several solutions, mainly based on the Hadoop ecosystem, have been proposed to accelerate the process of index construction. These solutions perform well when data are already distributed across the cluster nodes involved in the elaboration. On the other hand, the cost of distributing data can introduce noticeable overhead. We propose ISODAC, a new approach aimed at improving efficiency without sacrificing reliability. Our solution reduces to the bare minimum the number of I/O operations by using a stream of in-memory operations to extract and index heterogeneous data. We further improve the performance by using GPUs and POSIX Threads programming for the most computationally intensive tasks of the indexing procedure. ISODAC indexes heterogeneous documents up to 10.6x faster than other widely adopted solutions, such as Apache Spark.
An Exploratory Study on Genre Classification using Readability FeaturesMarina Santini
We present a preliminary study that explores whether text features used for readability assessment are reliable genre-revealing features. We empirically explore the difference between genre and domain. We carry out two sets of experiments with both supervised and unsupervised methods. Findings on the Swedish national corpus (the SUC) show that readability cues are good indicators of genre variation.
folksonomy, social tagging, tag clouds, automatic folksonomy construction, word clouds, wordle,context-preserving word cloud visualisation, CPEWCV, seam carving, inflate and push, star forest, cycle cover, quantitative metrics, realized adjacencies, distortion, area utilization, compactness, aspect ratio, running time, semantics in language technology
More Related Content
Similar to A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A talk presented to the US Networking and Information Technology Research and Development (NITRD) Program's High End Computing Interagency Working Group, 16 January 2020
Providing Research Graph data in JSON-LD using Schema.orgJingbo Wang
In this presentation, we describe a pilot project that provides
Research Graph records to external web services using
JSON-LD. The Research Graph database contains a largescale
graph that links research datasets (i.e., data used to
support research) to funding records (i.e. grants), publications
and researcher records such as ORCID profiles.
This database was derived from the work of the Research
Data Alliance Working Group on Data Description Registry
Interoperability (DDRI), and curated using the Research
Data Switchboard open source software. By being available
in Linked Data format, the Research Graph database
is more accessible to third-party web services over the Internet,
which thus opens the opportunity to connect to the
rest of the world in the semantic format.
The primary purpose of this pilot project is to evaluate the
feasibility of converting registry objects in Research Graph
to JSON-LD by accessing widely used vocabularies published
at Schema.org. In this paper, we provide examples
of publications, datasets and grants from international research
institutions such as CERN INSPIREHEP, National
Computational Infrastructure (NCI) in Australia, and Australian
Research Council (ARC). Furthermore, we show how
these Research Graph records are made semantically available
as Linked Data through using Schema.org. The mapping
between Research Graph schema and Schema.org is
available on GitHub repository. We also discuss the potential need for an extension to Schema.org vocabulary for scholarly communication.
In the last decade, several Scientific Knowledge Graphs (SKG) were released, representing scientific knowledge in a structured, interlinked, and semantically rich manner. But, what kind of information they describe? How they have been built? What can we do with them? In this lecture, I will first provide an overview of well-known SKGs, like Microsoft Academic Graph, Dimensions, and others. Then, I will present the Academia/Industry DynAmics (AIDA) Knowledge Graph, which describes 21M publications and 8M patents according to i) the research topics drawn from the Computer Science Ontology, ii) the type of the author's affiliations (e.g, academia, industry), and iii) 66 industrial sectors (e.g., automotive, financial, energy, electronics) from the Industrial Sectors Ontology (INDUSO). Finally, I will showcase a number of tools and approaches using such SKGs, supporting researchers, companies, and policymakers in making sense of research dynamics.
"The ICAIL conference is the primary international conference addressing research in Artificial Intelligence and Law, and has been organized biennially since 1987"
#ICAIL2017, #ADAI
The purpose of this work is to outline our approach to the development and testing of several computational models for legal relevance in the narrow domain of patent law, specifically as documented through select proceedings of the USPTO PTAB cases.
Contribution #1: “real world” legal judgement
Patent Trial and Appeal Board (PTAB) publicly available dataset, as of Jan 2017 has about 100 zip files containing 10 GB of data (compressed).
PTAB data represent practitioner needs, better than the more commonly used citation graphs
Contribution #2: Doc semantics != Legal Relevance
Disproved prevailing notion that document semantics implies legal relevance or is at least correlated with it. e.g. Khoury and Bekkerman in 2016, “if a given document is not in the semantic neighborhood of the query document, it simply cannot be relevant for the query document"
Contribution #3: ~4X improvement in retrieval
Without subsector pre-processing: Recall@100 was 5%, After text pre-processing: Recall@100 was 19%
Contribution #4: Human-in-loop impact is dramatic
Potential for over 50X improvement, where a retrieval task failed Recall @ 5000 but passed Recall @100 with user feedback
Publishing conference proceedings internationally: Tips and tricksAliaksandr Birukou
In this presentation we look into main elements one has to consider when organizing an international conference. First, we describe the role of conference proceedings in CS and beyond. Second, we focus on the tasks of conference organizers. Third, we cover the peer review aspects and announce the new group CrossRef and DataCite start with this respect. We then cover indexing and dissemination, including Springer Nature Linked Open Data portal, http://lod.springer.com. We finalize the presentation with several tips and guidelines for organizers of international conferences as well as the word of warning regarding predatory publishers.
In 2015, three of the authors (Zarndt, McCain, Carner) surveyed the born digital content legal deposit policies and practices in 18 different countries and presented the results of the survey at the 2015 International News Media Conference hosted by the National Library of Sweden in Stockholm, Sweden, April 2015.
As a first step, the authors reviewed previous surveys about legal deposit and digital preservation. The authors updated and streamlined the 2015 survey in order to assess progress in creating or improving national policies and in implementing practices for preserving born digital content. The current survey consists of as many as 20 questions; which questions are asked depends on the respondent’s previous answers.
More than 50 countries and states in Australia, Germany and USA, participated in the survey. The survey closed at the end of November 2017. The authors expect to repeat the survey periodically in order to assess progress in developing born digital legal policy and implementing the policy in practice.
Sources of Change in Modern Knowledge Organization SystemsPaul Groth
Talk covering how knowledge graphs are making us rethink how change occurs in Knowledge Organization Systems. Based on https://arxiv.org/abs/1611.00217
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...Dr. Haxel Consult
Henning Schönenberger (SpringerNature, Germany)
Springer Nature SciGraph, the new Linked Open Data platform aggregating data sources from Springer Nature and key partners from the scholarly domain. The Linked Open Data platform will initially collate information from across the research landscape, such as funders, research projects, conferences, affiliations and publications. Additional data, such as citations, patents, clinical trials and usage numbers will follow over time. This high quality data from trusted and reliable sources provides a rich semantic description of how information is related, as well as enabling innovative visualizations of the scholarly domain.
Aligning Nuclear Physics Computing Techniques with Non-Research Physics CareersWouter Deconinck
Discussion on a user-centered design of the software stack used in nuclear physics computing, with a focus on the career development needs of graduate students.
Jana Parvanova, Vladimir Alexiev and Stanislav Kostadinov. In workshop Collaborative Annotations in Shared Environments: metadata, vocabularies and techniques in the Digital Humanities (DH-CASE 2013). Collocated with DocEng 2013. Florence, Italy, Sep 2013.
eNanoMapper database, search tools and templatesNina Jeliazkova
A webinar given at the NCIP Hub https://nciphub.org/resources/1925
Nanomaterial safety assessment has become an important task following the production growth of engineered nanomaterials (ENMs) and the increased interest for ENMs from various academic, industry and regulatory parties. A number of challenges exist in nanomaterials data representation and integration mainly due to the data complexity and origination of ENM information from diverse sources. We have recently described eNanoMapper database [1] as part of the computational infrastructure for toxicological data management of engineered materials, developed within eNanoMapper project [2].
The eNanoMapper prototype database is publicly available at http://data.enanomapper.net, demonstrating the integration of data from multiple sources, using the common data model and Application Programming Interface. The supported import formats are IUCLID5 files (OECD HT), semantic format (RDF) and custom spreadsheet templates. The latter accommodates the preferred approach for data gathering for the majority of the NanoSafety Cluster projects and is enabled by a configurable parser mapping the the custom spreadsheet organization into the internal eNanoMapper storage components through external configuration file. Import of spreadsheet data and other data formats, generated by a number of NanoSafety Cluster projects is currently ongoing. The export formats have been extended with the new ISA JSON format, following the most recent ISA specification.
Defining templates for data gathering is a common activity for most of the NanoSafety Cluster projects usually resulting in modified Excel spreadsheets. In order to help avoiding the incompatibility issues, we present a tool for template generation, based on templates released under open license by JRC under the framework of the NANoREG project [3]. A number of physchem, in-vitro and in-vivo assays are supported and using feedback from users we added and extended existing information about different aspects of nanosafety, e.g. environmental exposure, cell culture assays, cellular and animal models, nanomaterial production features, and nanomaterial ageing.
Finally, the data can be accessed programmatically via the application programming interface as well as via user friendly search interface at https://search.data.enanomapper.net. The search application is powered by a free text search engine and eNanoMapper ontology and was improved over the last year based on user feedback.The search function allows now multiple filtering for information. It is possible to stack filters for e.g. nanomaterial type, cell model and assay.
eNanoMapper is supported by European Commission 7th Framework Programme for Research and Technological Development Grant (Grant agreement no: 604134).
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...Advanced-Concepts-Team
Searching for information within large sets of unstructured, heterogeneous scientific data can be very challenging unless an inverted index has been created in advance. Several solutions, mainly based on the Hadoop ecosystem, have been proposed to accelerate the process of index construction. These solutions perform well when data are already distributed across the cluster nodes involved in the elaboration. On the other hand, the cost of distributing data can introduce noticeable overhead. We propose ISODAC, a new approach aimed at improving efficiency without sacrificing reliability. Our solution reduces to the bare minimum the number of I/O operations by using a stream of in-memory operations to extract and index heterogeneous data. We further improve the performance by using GPUs and POSIX Threads programming for the most computationally intensive tasks of the indexing procedure. ISODAC indexes heterogeneous documents up to 10.6x faster than other widely adopted solutions, such as Apache Spark.
An Exploratory Study on Genre Classification using Readability FeaturesMarina Santini
We present a preliminary study that explores whether text features used for readability assessment are reliable genre-revealing features. We empirically explore the difference between genre and domain. We carry out two sets of experiments with both supervised and unsupervised methods. Findings on the Swedish national corpus (the SUC) show that readability cues are good indicators of genre variation.
folksonomy, social tagging, tag clouds, automatic folksonomy construction, word clouds, wordle,context-preserving word cloud visualisation, CPEWCV, seam carving, inflate and push, star forest, cycle cover, quantitative metrics, realized adjacencies, distortion, area utilization, compactness, aspect ratio, running time, semantics in language technology
Information Extraction, Named Entity Recognition, NER, text analytics, text mining, e-discovery, unstructured data, structured data, calendaring, standard evaluation per entity, standard evaluation per token, sequence classifier, sequence labeling, word shapes, semantic analysis in language technology
word sense disambiguation, wsd, thesaurus-based methods, dictionary-based methods, supervised methods, lesk algorithm, michael lesk, simplified lesk, corpus lesk, graph-based methods, word similarity, word relatedness, path-based similarity, information content, surprisal, resnik method, lin method, elesk, extended lesk, semcor, collocational features, bag-of-words features, the window, lexical semantics, computational semantics, semantic analysis in language technology.
inferential statistics, statistical inference, language technology, interval estimation, confidence interval, standard error, confidence level, z critical value, confidence interval for proportion, confidence interval for the mean, multiplier,
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
attribute selection, constructing decision trees, decision trees, divide and conquer, entropy, gain ratio, information gain, machine leaning, pruning, rules, suprisal
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Connector Corner: Automate dynamic content and events by pushing a button
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
1. A Web Corpus for eCare
Collection, Lay Annotation and Learning
Marina Santini
(marina.santini@ri.se)
RISE SICS East
Sweden
3 September 2017LTA'17 - FedCSIS 2017 - Prague
1
2. RISE & SICS
• RI.SE = Research Institutes of Sweden
• SICS = Swedish Institute of Computer Science
• SICS East Linköping
• Group: Language Technology and Intelligent Interaction Design
• Group Leader: Professor Arne Jönsson (arne.jonsson@ri.se)
Citation: Santini M., Jönsson A., Nyström M. and Alirezai M. (2017) Web Corpus for eCare:
Collection, Lay Annotation and Learning. First Results. Proceedings of LTA'17, FedCSIS 2017, Prague.
3 September 2017LTA'17 - FedCSIS 2017 - Prague 2
3. Outline
• The eCare@home project
• The Language Technology contribution
• The eCare Swedish corpus
• Lay-Specialized Annotation
• The experiments: lay-specialized text classification
• Conclusions
3 September 2017LTA'17 - FedCSIS 2017 - Prague 3
4. eCare@home
• Website: http://ecareathome.se/
• Interdisciplinary project
• Funded by Swedish Knowledge Foundation
1. Measure attributes about people and the environment.
2. Infer beyond that which we cannot measure.
3. Automatically configure devices and "things" to achieve a task.
4. Represent information in a human consumable way.
3 September 2017LTA'17 - FedCSIS 2017 - Prague 4
5. Language Technology and eCare@home
3 September 2017LTA'17 - FedCSIS 2017 - Prague 5
Generally speaking:
• The Internet of Things Sensor Data = Numbers
• Language Technology to present information in a “human consumable way”
6. How to Build a Medical Corpus for eCare?
3 September 2017LTA'17 - FedCSIS 2017 - Prague 6
Kardiologi vs hjärtspecialist
specialized vs lay
7. Starting point:
a Web Corpus for eCare
3 September 2017LTA'17 - FedCSIS 2017 - Prague 7
• Creation of a concept-specific medical web corpus
useful for eCare (and eHealth) LT applications, such as:
• the automatic extraction of lay synonyms (or paraphrases) of
medical terms,
• text simplification
• etc.
8. Web Corpus
• Which textual sources?
3 September 2017 LTA'17 - FedCSIS 2017 - Prague 8
• Using medical journals?
• Crawling user-generated texts?
• Relying on a specific web genre like blogs?
• Downloading medical websites?
• Web corpus!
9. Designing a corpus for eCare
3 September 2017LTA'17 - FedCSIS 2017 - Prague 9
1. having a publicly-available medical corpus annotated with lay-
specialized labels that can be easily shared;
2. having a corpus with a design and a structure that allow for
expansion with additional documents over time;
3. accounting for very specific medical terms.
10. Potential Issues
3 September 2017LTA'17 - FedCSIS 2017 - Prague 10
1. expanding the corpus over time may cause scalability issues
2. the web is noisy the corpus will be noisy: noise disturbs LT
applications
We made two assumptions:
1. scalability increasing the size of the corpus does not necessarily affect the
performance of LT applications negatively
2. Noise noisy texts do not necessarily affect the performance of LT applications
negatively
11. In the next slides...
3 September 2017LTA'17 - FedCSIS 2017 - Prague 11
1. Lay vs Specialized sublanguage
2. The construction and annotation of the eCare Swedish web
corpus
3. Experiments 1 and 2:
• Experiment 1: robustness to scalability issues ( increasing the size of
the corpus does not necessarily affect the performance of LT application
negatively )
• Experiment 2: resilience to noise ( noisy texts do not necessarily affect
the performance of LT application negatively)
12. Lay vs Specialized Sublanguage
3 September 2017LTA'17 - FedCSIS 2017 - Prague 12
• Definition of sublanguage:
• A language variety used by a specific user group in
certain communicative/situational contexts or domains
• Ex: the jargon used by police, or by the military, or by
the politicians, etc.
• Medical domain not only jargon but also medical
terminology!
• Two user groups in contact:
• patients lay sublanguage (ex: heart specialist)
• professional staff specialized sublanguage (ex:
cardiologist)
The Guardian, 2014
13. eCare Web Corpus:
Construction
3 September 2017LTA'17 - FedCSIS 2017 - Prague 13
• Seed terms from SNOMED CT
• Chronic diseases
• The corpus has been bootstrapped with 228
seed terms and 801 documents were bootcat-
ted (BootCat, Baroni & Bernardini, 2004)
Initial seeds Retrieved seeds Downloaded
documents
Number of
words
Unigrams 13 13 112 91 118
Bigrams 215 142 689 618 491
Total 228 155 801 709 609
Example of long medical term (source SNOMED CT
14. Small Data (vs Big Data)
3 September 2017LTA'17 - FedCSIS 2017 - Prague 14
• Small data: data that has small enough size for human comprehension.
• Use: for many problems and questions, small data in itself is enough.
• Small data (wikipedia) = a new buzz word
15. eCare Web Corpus: Lay Annotation
3 September 2017LTA'17 - FedCSIS 2017 - Prague 15
• Annotation by a native speaker who participates
who has little knowledge of medical
terminology.
• The lay-specialized text classification
experiments described later on are based on this
lay annotation.
16. Inter-rater agreement: Lay vs Expert (sample)
3 September 2017LTA'17 - FedCSIS 2017 - Prague 16
• Interrater agreement
To be taken with a grain of salt , but normally:
• Poor agreement = 0.20 or less
• Fair agreement = 0.20 to 0.40
• Moderate agreement = 0.40 to 0.60
• Good agreement = 0.60 to 0.80
• Very good agreement = 0.80 to 1.00
User group bias:
annotation maby be biassed by the
annotator’s domain expertise.
17. Noise
3 September 2017LTA'17 - FedCSIS 2017 - Prague 17
• Apparently the web is full of automatically translated
documents ! Do we need to sort them out? It depends!
• Out of 801 web documents, 339 have received comments
by the lay annotator, e.g. "Machine Translated" or "it is
about animals and not about humans".
• Is it important to remove this noise-ness for some LT tasks?
Maybe not always
• Computationally, the presence of noise might be irrelevant,
so it might not be worth investing resource for cleaning a
corpus
18. Lay/Specialized Text Classification
3 September 2017LTA'17 - FedCSIS 2017 - Prague 18
Based on the annotation made by the lay annotator
• Experiment 1: focus on scalability
• Experiment 2: focus on the impact of noise on performance
19. Features
3 September 2017LTA'17 - FedCSIS 2017 - Prague 19
No text pre-processing
The texts were defined as “string”
Filter converting strings to word vectors
21. Experiment 2: Resilience to Noise
3 September 2017LTA'17 - FedCSIS 2017 - Prague 21
• Noisy vs Cleaned
22. New Experiment: Lay vs Specialized Annotation
3 September 2017LTA'17 - FedCSIS 2017 - Prague 22
• SVM (weka implementation: SMO)
k Acc. Avg. P Avg. R Avg. F ROC A. Avg. TP Avg. FP
SMO 0.49 78.6 0.78 0.78 0.78 0.74 0.78 0.29
801 documents labelled by the lay annotator
k Acc. Avg. P Avg. R Avg. F ROC A. Avg. TP Avg. FP
SMO 0.54 79.5 0,79 0,79 0,79 0,77 0,79 0,25
778 documents labelled by the expert annotator
23. Conclusions: Findings
3 September 2017LTA'17 - FedCSIS 2017 - Prague 23
• We presented the construction of the the eCare corpus
• We made two claims and we supported our claims with
two experiments:
1. We can design an extensible/dynamic corpus without fearing
scalability issues
2. We can use a noisy corpus to build (certain) LT applications
without fearing bad performance
24. Replicability
3 September 2017LTA'17 - FedCSIS 2017 - Prague 24
• We encourage the replication of the results presented in this
paper, and welcome improvements and further discussion.
• Corpus, scripts and the output of the classification models
are available from the project website:
http://ecareathome.se.
25. Thanks for your attention !
3 September 2017 LTA'17 - FedCSIS 2017 - Prague 25
Any Questions ?