We introduce a framework for automated semantic document annotation that is composed of four processes, namely concept extraction, concept activation, annotation selection, and evaluation. The framework is used to implement and compare different annotation strategies motivated by the literature. For concept extraction, we apply entity detection with semantic hierarchical knowledge bases, Tri-gram, RAKE, and LDA. For concept activation, we compare a set of statistical, hierarchy-based, and graph-based methods. For selecting annotations, we compare top-k as well as kNN. In total, we define 43 different strategies including novel combinations like using graph-based activation with kNN. We have evaluated the strategies using three different datasets of varying size from three scientific disciplines (economics, politics, and computer science) that contain 100, 000 manually labeled documents in total. We obtain the best results on all three datasets by our novel combination of entity detection with graph-based activation (e.g., HITS and Degree) and kNN. For the economic and political science datasets, the best F-measure is .39 and .28, respectively. For the computer science dataset, the maximum F-measure of .33 can be reached. The experiments are the by far largest on scholarly content annotation, which typically are up to a few hundred documents per dataset only.
Gregor Große-Bölting, Chifumi Nishioka, and Ansgar Scherp. 2015. A Comparison of Different Strategies for Automated Semantic Document Annotation. In Proceedings of the 8th International Conference on Knowledge Capture (K-CAP 2015). ACM, New York, NY, USA, , Article 8 , 8 pages. DOI=http://dx.doi.org/10.1145/2815833.2815838
Mining and Managing Large-scale Linked Open DataAnsgar Scherp
Linked Open Data (LOD) is about publishing and interlinking data of different origin and purpose on the web. The Resource Description Framework (RDF) is used to describe data on the LOD cloud. In contrast to relational databases, RDF does not provide a fixed, pre-defined schema. Rather, RDF allows for flexibly modeling the data schema by attaching RDF types and properties to the entities. Our schema-level index called SchemEX allows for searching in large-scale RDF graph data. The index can be efficiently computed with reasonable accuracy over large-scale data sets with billions of RDF triples, the smallest information unit on the LOD cloud. SchemEX is highly needed as the size of the LOD cloud quickly increases. Due to the evolution of the LOD cloud, one observes frequent changes of the data. We show that also the data schema changes in terms of combinations of RDF types and properties. As changes cannot capture the dynamics of the LOD cloud, current work includes temporal clustering and finding periodicities in entity dynamics over large-scale snapshots of the LOD cloud with about 100 million triples per week for more than three years.
Knowledge Discovery in Social Media and Scientific Digital LibrariesAnsgar Scherp
The talk presents selected results of our research in the area of text and data mining in social media and scientific literature. (1) First, we consider the area of classifying microblogging postings like tweets on Twitter. Typically, the classification results are evaluated against a gold standard, which is either the hashtags of the tweets’ authors or manual annotations. We claim that there are fundamental differences between these two kinds of gold standard classifications and conducted an experiment with 163 participants to manually classify tweets from ten topics. Our results show that the human annotators are more likely to classify tweets like other human annotators than like the tweets’ authors (i. e., the hashtags). This may influence the evaluation of classification methods like LDA and we argue that researchers should reflect the kind of gold standard used when interpreting their results. (2) Second, we present a framework for semantic document annotation that aims to compare different existing as well as new annotation strategies. For entity detection, we compare semantic taxonomies, trigrams, RAKE, and LDA. For concept activation, we cover a set of statistical, hierarchy-based, and graph-based methods. The strategies are evaluated over 100,000 manually labeled scientific documents from economics, politics, and computer science. (3) Finally, we present a processing pipeline for extracting text of varying size, rotation, color, and emphases from scholarly figures. The pipeline does not need training nor does it make any assumptions about the characteristics of the scholarly figures. We conducted a preliminary evaluation with 121 figures from a broad range of illustration types.
URL: https://www.ukp.tu-darmstadt.de/ukp-home/news-singleview/artikel/guest-speaker-ansgar-scherp/
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...Ansgar Scherp
We propose a pipeline for text extraction from infographics
that makes use of a novel combination of data mining and computer vision techniques. The pipeline defines a sequence of steps to identify characters, cluster them into text lines, determine their rotation angle, and apply state-of-the-art OCR to recognize the text. In this paper, we formally define the pipeline and present its current implementation. In addition, we have conducted preliminary evaluations over a data corpus of 121 manually annotated infographics from a broad range of illustration types such as bar charts, pie charts, and line charts, maps, and others. We assess the results of our text extraction pipeline by comparing it with two baselines. Finally, we sketch an outline for future work and possibilities for improving the pipeline. - http://ceur-ws.org/Vol-1458/
Big Data and the Internet of Things (IoT) have the potential
to fundamentally shift the way we interact with our surroundings. The
challenge of deriving insights from the Internet of Things (IoT) has
been recognized as one of the most exciting and key opportunities for
both academia and industry. Advanced analysis of big data streams from
sensors and devices is bound to become a key area of data mining
research as the number of applications requiring such processing
increases. Dealing with the evolution over time of such data streams,
i.e., with concepts that drift or change completely, is one of the
core issues in stream mining. In this talk, I will present an
overview of data stream mining, and I will introduce
some popular open source tools for data stream mining.
Artificial intelligence and data stream miningAlbert Bifet
Big Data and Artificial Intelligence have the potential to
fundamentally shift the way we interact with our surroundings. The
challenge of deriving insights from data streams has been recognized
as one of the most exciting and key opportunities for both academia
and industry. Advanced analysis of big data streams from sensors and
devices is bound to become a key area of artificial intelligence
research as the number of applications requiring such processing
increases. Dealing with the evolution over time of such data streams,
i.e., with concepts that drift or change completely, is one of the
core issues in stream mining. In this talk, I will present an overview
of data stream mining, industrial applications, open source tools, and
current challenges of data stream mining.
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
In this talk, we present Apache SAMOA, an open-source platform for
mining big data streams with Apache Flink, Storm and Samza. Real time analytics is
becoming the fastest and most efficient way to obtain useful knowledge
from what is happening now, allowing organizations to react quickly
when problems appear or to detect new trends helping to improve their
performance. Apache SAMOA includes algorithms for the most common
machine learning tasks such as classification and clustering. It
provides a pluggable architecture that allows it to run on Apache
Flink, but also with other several distributed stream processing
engines such as Storm and Samza.
Mining and Managing Large-scale Linked Open DataAnsgar Scherp
Linked Open Data (LOD) is about publishing and interlinking data of different origin and purpose on the web. The Resource Description Framework (RDF) is used to describe data on the LOD cloud. In contrast to relational databases, RDF does not provide a fixed, pre-defined schema. Rather, RDF allows for flexibly modeling the data schema by attaching RDF types and properties to the entities. Our schema-level index called SchemEX allows for searching in large-scale RDF graph data. The index can be efficiently computed with reasonable accuracy over large-scale data sets with billions of RDF triples, the smallest information unit on the LOD cloud. SchemEX is highly needed as the size of the LOD cloud quickly increases. Due to the evolution of the LOD cloud, one observes frequent changes of the data. We show that also the data schema changes in terms of combinations of RDF types and properties. As changes cannot capture the dynamics of the LOD cloud, current work includes temporal clustering and finding periodicities in entity dynamics over large-scale snapshots of the LOD cloud with about 100 million triples per week for more than three years.
Knowledge Discovery in Social Media and Scientific Digital LibrariesAnsgar Scherp
The talk presents selected results of our research in the area of text and data mining in social media and scientific literature. (1) First, we consider the area of classifying microblogging postings like tweets on Twitter. Typically, the classification results are evaluated against a gold standard, which is either the hashtags of the tweets’ authors or manual annotations. We claim that there are fundamental differences between these two kinds of gold standard classifications and conducted an experiment with 163 participants to manually classify tweets from ten topics. Our results show that the human annotators are more likely to classify tweets like other human annotators than like the tweets’ authors (i. e., the hashtags). This may influence the evaluation of classification methods like LDA and we argue that researchers should reflect the kind of gold standard used when interpreting their results. (2) Second, we present a framework for semantic document annotation that aims to compare different existing as well as new annotation strategies. For entity detection, we compare semantic taxonomies, trigrams, RAKE, and LDA. For concept activation, we cover a set of statistical, hierarchy-based, and graph-based methods. The strategies are evaluated over 100,000 manually labeled scientific documents from economics, politics, and computer science. (3) Finally, we present a processing pipeline for extracting text of varying size, rotation, color, and emphases from scholarly figures. The pipeline does not need training nor does it make any assumptions about the characteristics of the scholarly figures. We conducted a preliminary evaluation with 121 figures from a broad range of illustration types.
URL: https://www.ukp.tu-darmstadt.de/ukp-home/news-singleview/artikel/guest-speaker-ansgar-scherp/
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...Ansgar Scherp
We propose a pipeline for text extraction from infographics
that makes use of a novel combination of data mining and computer vision techniques. The pipeline defines a sequence of steps to identify characters, cluster them into text lines, determine their rotation angle, and apply state-of-the-art OCR to recognize the text. In this paper, we formally define the pipeline and present its current implementation. In addition, we have conducted preliminary evaluations over a data corpus of 121 manually annotated infographics from a broad range of illustration types such as bar charts, pie charts, and line charts, maps, and others. We assess the results of our text extraction pipeline by comparing it with two baselines. Finally, we sketch an outline for future work and possibilities for improving the pipeline. - http://ceur-ws.org/Vol-1458/
Big Data and the Internet of Things (IoT) have the potential
to fundamentally shift the way we interact with our surroundings. The
challenge of deriving insights from the Internet of Things (IoT) has
been recognized as one of the most exciting and key opportunities for
both academia and industry. Advanced analysis of big data streams from
sensors and devices is bound to become a key area of data mining
research as the number of applications requiring such processing
increases. Dealing with the evolution over time of such data streams,
i.e., with concepts that drift or change completely, is one of the
core issues in stream mining. In this talk, I will present an
overview of data stream mining, and I will introduce
some popular open source tools for data stream mining.
Artificial intelligence and data stream miningAlbert Bifet
Big Data and Artificial Intelligence have the potential to
fundamentally shift the way we interact with our surroundings. The
challenge of deriving insights from data streams has been recognized
as one of the most exciting and key opportunities for both academia
and industry. Advanced analysis of big data streams from sensors and
devices is bound to become a key area of artificial intelligence
research as the number of applications requiring such processing
increases. Dealing with the evolution over time of such data streams,
i.e., with concepts that drift or change completely, is one of the
core issues in stream mining. In this talk, I will present an overview
of data stream mining, industrial applications, open source tools, and
current challenges of data stream mining.
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
In this talk, we present Apache SAMOA, an open-source platform for
mining big data streams with Apache Flink, Storm and Samza. Real time analytics is
becoming the fastest and most efficient way to obtain useful knowledge
from what is happening now, allowing organizations to react quickly
when problems appear or to detect new trends helping to improve their
performance. Apache SAMOA includes algorithms for the most common
machine learning tasks such as classification and clustering. It
provides a pluggable architecture that allows it to run on Apache
Flink, but also with other several distributed stream processing
engines such as Storm and Samza.
Big Data is a new term used in Business Analytics to identify datasets that we can not manage with current methodologies or data mining software tools due to their large size and complexity. Big Data mining is the capability of extracting useful information from these large datasets or streams of data. New mining techniques are necessary due to the volume, variability, and velocity, of such data.
In this talk, we will focus on advanced techniques in Big Data mining in real time using evolving data stream techniques: using a small amount of time and memory resources, and being able to adapt to changes. We will discuss a social network application of data stream mining to compute user influence probabilities. And finally, we will present the MOA software framework with classification, regression, and frequent pattern methods, and the SAMOA distributed streaming software that runs on top of Storm, Samza and S4.
An overview of streaming algorithms: what they are, what the general principles regarding them are, and how they fit into a big data architecture. Also four specific examples of streaming algorithms and use-cases.
Streaming data analysis in real time is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. Evolving data streams are contributing to the growth of data created over the last few years. We are creating the same quantity of data every two days, as we created from the dawn of time up until 2003. Evolving data streams methods are becoming a low-cost, green methodology for real time online prediction and analysis. We discuss the current and future trends of mining evolving data streams, and the challenges that the field will have to overcome during the next years.
Meetup MLDD: Machine Learning Dresden, 8th May 2018
Signals from outer space
How NASA Benefits from Graph-Powered NLP
Vlasta Kus talked about the advantages of graph-based natural language processing (NLP) using a public NASA dataset as example. From his abstract: "[...] we are building a platform (from large part open-source) that integrates Neo4j and NLP (such as Named Entity Recognition, sentiment analysis, word embeddings, LDA topic extraction), and we test and develop further related features and tools, lately, for example, integrating Neo4j and Tensorflow for employing deep learning techniques (such as deep auto-encoders for automatic text summarisation)."
Vlasta holds a Ph.D. in Physics from the Charles University in Prague and has worked for SecureOps, as a freelance Data Scientist, and since 2017 as a Data Scientist at GraphAware (https://graphaware.com/), a London-based company that builds solutions around Neo4j.
Efficient Online Evaluation of Big Data Stream ClassifiersAlbert Bifet
The evaluation of classifiers in data streams is fundamental so that poorly-performing models can be identified, and either improved or replaced by better-performing models. This is an increasingly relevant and important task as stream data is generated from more sources, in real-time, in large quantities, and is now considered the largest source of big data. Both researchers and practitioners need to be able to effectively evaluate the performance of the methods they employ. However, there are major challenges for evaluation in a stream. Instances arriving in a data stream are usually time-dependent, and the underlying concept that they represent may evolve over time. Furthermore, the massive quantity of data also tends to exacerbate issues such as class imbalance. Current frameworks for evaluating streaming and online algorithms are able to give predictions in real-time, but as they use a prequential setting, they build only one model, and are thus not able to compute the statistical significance of results in real-time. In this paper we propose a new evaluation methodology for big data streams. This methodology addresses unbalanced data streams, data where change occurs on different time scales, and the question of how to split the data between training and testing, over multiple models.
While much of the recent literature in spatial statistics has evolved around addressing the big data issue, practical implementations of these methods on high performance computing systems for truly large data are still rare. We discuss our explorations in this area at the National Center for Atmospheric Research for a range of applications, which can benefit from large scale computing infrastructure. These applications include extreme value analysis, approximate spatial methods, spatial localization methods and statistically-based data compression and are implemented in different programming languages. We will focus on timing results and practical considerations, such as speed vs. memory trade-offs, limits of scaling and ease of use.
Presentation for the Softskills Seminar course @ Telecom ParisTech. Topic is the paper by Domings Hulten "Mining high speed data streams". Presented by me the 30/11/2017
Dmitry Kan, Principal AI Scientist at Silo AI and host of the Vector Podcast [1], will give an overview of the landscape of vector search databases and their role in NLP, along with the latest news and his view on the future of vector search. Further, he will share how he and his team participated in the Billion-Scale Approximate Nearest Neighbor Challenge and improved recall by 12% over a baseline FAISS.
Presented at https://www.meetup.com/open-nlp-meetup/events/282678520/
YouTube: https://www.youtube.com/watch?v=RM0uuMiqO8s&t=179s
Follow Vector Podcast to stay up to date on this topic: https://www.youtube.com/@VectorPodcast
Interest is growing in the Apache Spark community in using Deep Learning techniques and in the Deep Learning community in scaling algorithms with Apache Spark. A few of them to note include:
· Databrick’s efforts in scaling Deep learning with Spark
· Intel announcing the BigDL: A Deep learning library for Spark
· Yahoo’s recent efforts to opensource TensorFlowOnSpark
In this lecture we will discuss the key use cases and developments that have emerged in the last year in using Deep Learning techniques with Spark.
Max-kernel search: How to search for just about anything?
Nearest neighbor search is a well studied and widely used task in computer science and is quite pervasive in everyday applications. While search is not synonymous with learning, search is a crucial tool for the most nonparametric form of learning. Nearest neighbor search can directly be used for all kinds of learning tasks — classification, regression, density estimation, outlier detection. Search is also the computational bottleneck in various other learning tasks such as clustering and dimensionality reduction. Key to nearest neighbor search is the notion of “near”-ness or similarity. Mercer kernels form a class of general nonlinear similarity functions and are widely used in machine learning. They can define a notion of similarity between pairs of objects of any arbitrary type and have been successfully applied to a wide variety of object types — fixed-length data, images, text, time series, graphs. I will present a technique to do nearest neighbor search with this class of similarity functions provably efficiently, hence facilitating faster learning for larger data.
STRIP: stream learning of influence probabilities.Albert Bifet
Influence-driven diffusion of information is a fundamental process in social networks. Learning the latent variables of such process, i.e., the influence strength along each link, is a central question towards understanding the structure and function of complex networks, modeling information cascades, and developing applications such as viral marketing.
Motivated by modern microblogging platforms, such as twitter, in this paper we study the problem of learning influence probabilities in a data-stream scenario, in which the network topology is relatively stable and the challenge of a learning algorithm is to keep up with a continuous stream of tweets using a small amount of time and memory. Our contribution is a number of randomized approximation algorithms, categorized according to the available space (superlinear, linear, and sublinear in the number of nodes n) and according to different models (landmark and sliding window). Among several results, we show that we can learn influence probabilities with one pass over the data, using O(nlog n) space, in both the landmark model and the sliding-window model, and we further show that our algorithm is within a logarithmic factor of optimal.
For truly large graphs, when one needs to operate with sublinear space, we show that we can still learn influence probabilities in one pass, assuming that we restrict our attention to the most active users.
Our thorough experimental evaluation on large social graph demonstrates that the empirical performance of our algorithms agrees with that predicted by the theory.
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Dmitry Kan
Promoting diversity among items in a search result has been shown to increase user satisfaction, compared to relevancy only based ranking. In this talk, we'll present how we went about implementing search result diversification methods across different vertical search engines. Starting from zero with no diversification at all, exploring simple heuristic-based methods and moving onwards to more complex ones based on entropy and determinantal point processing. We'll also discuss evaluation methods and useful tooling around that.
Presented by Dmitry Kan, Principal AI Scientist at Silo AI and Daniel Wärnå, AI Engineer, Silo AI.
YouTube recording:
https://www.youtube.com/watch?v=bri0C28mfl8
Code demoed: https://github.com/DmitryKey/bert-solr-search/tree/master/src/diversify
New learning technologies seem likely to transform much of science, as they are already doing for many areas of industry and society. We can expect these technologies to be used, for example, to obtain new insights from massive scientific data and to automate research processes. However, success in such endeavors will require new learning systems: scientific computing platforms, methods, and software that enable the large-scale application of learning technologies. These systems will need to enable learning from extremely large quantities of data; the management of large and complex data, models, and workflows; and the delivery of learning capabilities to many thousands of scientists. In this talk, I review these challenges and opportunities and describe systems that my colleagues and I are developing to enable the application of learning throughout the research process, from data acquisition to analysis.
Smart photo selection: interpret gaze as personal interestAnsgar Scherp
Manually selecting subsets of photos from large collections in order to present them to friends or colleagues or to print them as photo books can be a tedious task. Today, fully automatic approaches are at hand for supporting users. They make use of pixel information extracted from the images, analyze contextual information such as capture time and focal aperture, or use both to determine a proper subset of photos. However, these approaches miss the most important factor in the photo selection process: the user. The goal of our approach is to consider individual interests. By recording and analyzing gaze information from the user's viewing photo collections, we obtain information on user's interests and use this information in the creation of personal photo selections. In a controlled experiment with 33 participants, we show that the selections can be significantly improved over a baseline approach by up to 22% when taking individual viewing behavior into account. We also obtained significantly better results for photos taken at an event participants were involved in compared with photos from another event.
Big Data is a new term used in Business Analytics to identify datasets that we can not manage with current methodologies or data mining software tools due to their large size and complexity. Big Data mining is the capability of extracting useful information from these large datasets or streams of data. New mining techniques are necessary due to the volume, variability, and velocity, of such data.
In this talk, we will focus on advanced techniques in Big Data mining in real time using evolving data stream techniques: using a small amount of time and memory resources, and being able to adapt to changes. We will discuss a social network application of data stream mining to compute user influence probabilities. And finally, we will present the MOA software framework with classification, regression, and frequent pattern methods, and the SAMOA distributed streaming software that runs on top of Storm, Samza and S4.
An overview of streaming algorithms: what they are, what the general principles regarding them are, and how they fit into a big data architecture. Also four specific examples of streaming algorithms and use-cases.
Streaming data analysis in real time is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. Evolving data streams are contributing to the growth of data created over the last few years. We are creating the same quantity of data every two days, as we created from the dawn of time up until 2003. Evolving data streams methods are becoming a low-cost, green methodology for real time online prediction and analysis. We discuss the current and future trends of mining evolving data streams, and the challenges that the field will have to overcome during the next years.
Meetup MLDD: Machine Learning Dresden, 8th May 2018
Signals from outer space
How NASA Benefits from Graph-Powered NLP
Vlasta Kus talked about the advantages of graph-based natural language processing (NLP) using a public NASA dataset as example. From his abstract: "[...] we are building a platform (from large part open-source) that integrates Neo4j and NLP (such as Named Entity Recognition, sentiment analysis, word embeddings, LDA topic extraction), and we test and develop further related features and tools, lately, for example, integrating Neo4j and Tensorflow for employing deep learning techniques (such as deep auto-encoders for automatic text summarisation)."
Vlasta holds a Ph.D. in Physics from the Charles University in Prague and has worked for SecureOps, as a freelance Data Scientist, and since 2017 as a Data Scientist at GraphAware (https://graphaware.com/), a London-based company that builds solutions around Neo4j.
Efficient Online Evaluation of Big Data Stream ClassifiersAlbert Bifet
The evaluation of classifiers in data streams is fundamental so that poorly-performing models can be identified, and either improved or replaced by better-performing models. This is an increasingly relevant and important task as stream data is generated from more sources, in real-time, in large quantities, and is now considered the largest source of big data. Both researchers and practitioners need to be able to effectively evaluate the performance of the methods they employ. However, there are major challenges for evaluation in a stream. Instances arriving in a data stream are usually time-dependent, and the underlying concept that they represent may evolve over time. Furthermore, the massive quantity of data also tends to exacerbate issues such as class imbalance. Current frameworks for evaluating streaming and online algorithms are able to give predictions in real-time, but as they use a prequential setting, they build only one model, and are thus not able to compute the statistical significance of results in real-time. In this paper we propose a new evaluation methodology for big data streams. This methodology addresses unbalanced data streams, data where change occurs on different time scales, and the question of how to split the data between training and testing, over multiple models.
While much of the recent literature in spatial statistics has evolved around addressing the big data issue, practical implementations of these methods on high performance computing systems for truly large data are still rare. We discuss our explorations in this area at the National Center for Atmospheric Research for a range of applications, which can benefit from large scale computing infrastructure. These applications include extreme value analysis, approximate spatial methods, spatial localization methods and statistically-based data compression and are implemented in different programming languages. We will focus on timing results and practical considerations, such as speed vs. memory trade-offs, limits of scaling and ease of use.
Presentation for the Softskills Seminar course @ Telecom ParisTech. Topic is the paper by Domings Hulten "Mining high speed data streams". Presented by me the 30/11/2017
Dmitry Kan, Principal AI Scientist at Silo AI and host of the Vector Podcast [1], will give an overview of the landscape of vector search databases and their role in NLP, along with the latest news and his view on the future of vector search. Further, he will share how he and his team participated in the Billion-Scale Approximate Nearest Neighbor Challenge and improved recall by 12% over a baseline FAISS.
Presented at https://www.meetup.com/open-nlp-meetup/events/282678520/
YouTube: https://www.youtube.com/watch?v=RM0uuMiqO8s&t=179s
Follow Vector Podcast to stay up to date on this topic: https://www.youtube.com/@VectorPodcast
Interest is growing in the Apache Spark community in using Deep Learning techniques and in the Deep Learning community in scaling algorithms with Apache Spark. A few of them to note include:
· Databrick’s efforts in scaling Deep learning with Spark
· Intel announcing the BigDL: A Deep learning library for Spark
· Yahoo’s recent efforts to opensource TensorFlowOnSpark
In this lecture we will discuss the key use cases and developments that have emerged in the last year in using Deep Learning techniques with Spark.
Max-kernel search: How to search for just about anything?
Nearest neighbor search is a well studied and widely used task in computer science and is quite pervasive in everyday applications. While search is not synonymous with learning, search is a crucial tool for the most nonparametric form of learning. Nearest neighbor search can directly be used for all kinds of learning tasks — classification, regression, density estimation, outlier detection. Search is also the computational bottleneck in various other learning tasks such as clustering and dimensionality reduction. Key to nearest neighbor search is the notion of “near”-ness or similarity. Mercer kernels form a class of general nonlinear similarity functions and are widely used in machine learning. They can define a notion of similarity between pairs of objects of any arbitrary type and have been successfully applied to a wide variety of object types — fixed-length data, images, text, time series, graphs. I will present a technique to do nearest neighbor search with this class of similarity functions provably efficiently, hence facilitating faster learning for larger data.
STRIP: stream learning of influence probabilities.Albert Bifet
Influence-driven diffusion of information is a fundamental process in social networks. Learning the latent variables of such process, i.e., the influence strength along each link, is a central question towards understanding the structure and function of complex networks, modeling information cascades, and developing applications such as viral marketing.
Motivated by modern microblogging platforms, such as twitter, in this paper we study the problem of learning influence probabilities in a data-stream scenario, in which the network topology is relatively stable and the challenge of a learning algorithm is to keep up with a continuous stream of tweets using a small amount of time and memory. Our contribution is a number of randomized approximation algorithms, categorized according to the available space (superlinear, linear, and sublinear in the number of nodes n) and according to different models (landmark and sliding window). Among several results, we show that we can learn influence probabilities with one pass over the data, using O(nlog n) space, in both the landmark model and the sliding-window model, and we further show that our algorithm is within a logarithmic factor of optimal.
For truly large graphs, when one needs to operate with sublinear space, we show that we can still learn influence probabilities in one pass, assuming that we restrict our attention to the most active users.
Our thorough experimental evaluation on large social graph demonstrates that the empirical performance of our algorithms agrees with that predicted by the theory.
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Dmitry Kan
Promoting diversity among items in a search result has been shown to increase user satisfaction, compared to relevancy only based ranking. In this talk, we'll present how we went about implementing search result diversification methods across different vertical search engines. Starting from zero with no diversification at all, exploring simple heuristic-based methods and moving onwards to more complex ones based on entropy and determinantal point processing. We'll also discuss evaluation methods and useful tooling around that.
Presented by Dmitry Kan, Principal AI Scientist at Silo AI and Daniel Wärnå, AI Engineer, Silo AI.
YouTube recording:
https://www.youtube.com/watch?v=bri0C28mfl8
Code demoed: https://github.com/DmitryKey/bert-solr-search/tree/master/src/diversify
New learning technologies seem likely to transform much of science, as they are already doing for many areas of industry and society. We can expect these technologies to be used, for example, to obtain new insights from massive scientific data and to automate research processes. However, success in such endeavors will require new learning systems: scientific computing platforms, methods, and software that enable the large-scale application of learning technologies. These systems will need to enable learning from extremely large quantities of data; the management of large and complex data, models, and workflows; and the delivery of learning capabilities to many thousands of scientists. In this talk, I review these challenges and opportunities and describe systems that my colleagues and I are developing to enable the application of learning throughout the research process, from data acquisition to analysis.
Smart photo selection: interpret gaze as personal interestAnsgar Scherp
Manually selecting subsets of photos from large collections in order to present them to friends or colleagues or to print them as photo books can be a tedious task. Today, fully automatic approaches are at hand for supporting users. They make use of pixel information extracted from the images, analyze contextual information such as capture time and focal aperture, or use both to determine a proper subset of photos. However, these approaches miss the most important factor in the photo selection process: the user. The goal of our approach is to consider individual interests. By recording and analyzing gaze information from the user's viewing photo collections, we obtain information on user's interests and use this information in the creation of personal photo selections. In a controlled experiment with 33 participants, we show that the selections can be significantly improved over a baseline approach by up to 22% when taking individual viewing behavior into account. We also obtained significantly better results for photos taken at an event participants were involved in compared with photos from another event.
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...Ansgar Scherp
ACM SIGMM Rising Stars Symposium
The ACM SIGMM Rising Stars Symposium, inaugurated in 2015, will highlight plenary presentations of six selected rising SIGMM members on their vision and research achievements, and dialogs with senior members about the future of multimedia research.
See: http://www.acmmm.org/2016/?page_id=706
A Framework for Iterative Signing of Graph Data on the WebAnsgar Scherp
Existing algorithms for signing graph data typically do not cover the whole signing process. In addition, they lack distinctive features such as signing graph data at different levels of granularity, iterative signing of graph data, and signing multiple graphs. In this paper, we introduce a novel framework for signing arbitrary graph data provided, e g., as RDF(S), Named Graphs, or OWL. We conduct an extensive theoretical and empirical analysis of the runtime and space complexity of different framework configurations. The experiments are performed on synthetic and real-world graph data of different size and different number of blank nodes. We investigate security issues, present a trust model, and discuss practical considerations for using our signing framework.
We released a Java-based open source implementation of our software framework for iterative signing of arbitrary graph data provided, e. g., as RDF(S), Named Graphs, or OWL. The software framework is based on a formalization of different graph signing functions and supports different configurations. It is available in source code as well as pre-compiled as .jar-file.
The graph signing framework exhibits the following unique features:
- Signing graphs on different levels of granularity
- Signing multiple graphs at once
- Iterative signing of graph data for provenance tracking
- Independence of the used language for encoding the graph (i. e., the signature does not break when changing the graph representation)
The documentation of the software framework and its source code is available from: http://icp.it-risk.iwvi.uni-koblenz.de/wiki/Software_Framework_for_Signing_Graph_Data
Events in Multimedia - Theory, Model, ApplicationAnsgar Scherp
Talk by Ansgar Scherp.
Title: Events in Multimedia - Theory, Model, Application
Event: Workshop on Event-based Media Integration and Processing, ACM Multimedia, 2013
This is an introduction to an algorithm and methodology to extract semantics from one or several documents using Natural Language Processing and Machine learning techniques. The presentation describes the different components of the semantic analyzer using Wikipedia and Dbpedia as data sets.
Concept-Based Information Retrieval using Explicit Semantic AnalysisOfer Egozi
My master's thesis seminar at the Technion, summarizing my research work which was partly published in a AAAI-08 paper and now submitted to TOIS. Download and read notes for more details. Comments/questions are very welcome!
Unsupervised learning refers to a branch of algorithms that try to find structure in unlabeled data. Clustering algorithms, for example, try to partition elements of a dataset into related groups. Dimensionality reduction algorithms search for a simpler representation of a dataset. Spark's MLLib module contains implementations of several unsupervised learning algorithms that scale to huge datasets. In this talk, we'll dive into uses and implementations of Spark's K-means clustering and Singular Value Decomposition (SVD).
Bio:
Sandy Ryza is an engineer on the data science team at Cloudera. He is a committer on Apache Hadoop and recently led Cloudera's Apache Spark development.
Machine Learning in Pathology Diagnostics with Simagis Livekhvatkov
Simagis Live Digital Pathology platform employs latest generation of visual recognition technology with Deep Learning bring game changing application to pathology cancer diagnostics
PhD defense presentation of Dominik Kowald: Modeling Activation Processes in Human Memory to Improve Tag Recommendations. Presented at Know-Center / Graz University of Technology (Austria)
Profiling vs. Time vs. Content: What does Matter for Top-k Publication Recomm...MOVING Project
So far it is unclear how different factors of a scientific publication recommender system based on users' tweets have an influence on the recommendation performance. We examine three different factors, namely profiling method, temporal decay, and richness of content. Regarding profiling, we compare CF-IDF that replaces terms in TF-IDF by semantic concepts, HCF-IDF as novel hierarchical variant of CF-IDF, and topic modeling. As temporal decay functions, we apply sliding window and exponential decay. In terms of the richness of content, we compare recommendations using both full-texts and titles of publications and using only titles. Overall, the three factors make twelve recommendation strategies. We have conducted an online experiment with 123 participants and compared the strategies in a within-group design. The best recommendations are achieved by the strategy combining CF-IDF, sliding window, and with full-texts. However, the strategies using the novel HCF-IDF profiling method achieve similar results with just using the titles of the publications. Therefore, HCF-IDF can make recommendations when only short and sparse data is available. http://arxiv.org/abs/1603.07016
Large language models in higher educationPeter Trkman
Discussing the possibilities of large language models for the automatic generation of academic content by the students (e.g. master thesis), and the related need for changes in the way in which to educate and evaluate students.
Preservation Planning using Plato, by Hannes Kulovits and Andreas RauberJISC KeepIt project
This presentation, part of an extensive practical tutorial on logical and bit-stream preservation using Plato (a preservation planning tool) and EPrints (software for creating digital repositories), reviews preservation planning workflow, shows how to identify requirements using a mindmap approach and then how to upload the output to Plato to run experiments and produce results. The presentation was given as part of module 4 of a 5-module course on digital preservation tools for repository managers, presented by the JISC KeepIt project. For more on this and other presentations in this course look for the tag ’KeepIt course’ in the project blog http://blogs.ecs.soton.ac.uk/keepit/
This presentation provides an overview of the Systematic Inquiry Cycle and Logic Modeling as tools for designing and developing a research study or project/program initiative.
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...ssuserf63bd7
https://qidiantiku.com/solution-manual-for-data-visualization-exploring-and-explaining-with-data-1st-edition-by-camm.shtml
name:Solution manual for Data Visualization: Exploring and Explaining with Data 1st Edition by Camm
Edition:st Edition
author:by Jeffrey D. Camm , James J. Cochran, Michael J. Fry , Jeffrey W. Ohlmann
ISBN:ISBN: 9780357711415
type:solution manual
format:word/zip
All chapter include
This presentation has slides from a talk that I gave at the annual Experimental Biology meeting, 2015, on our curriculum for Big Data Analytics in the Inland Empire.
Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...Ansgar Scherp
Slides of our presentation @iiWAS2021: The 23rd International Conference on Information Integration and Web Intelligence, Linz, Austria, 29 November 2021 - 1 December 2021. ACM 2021, ISBN 978-1-4503-9556-4
STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topi...Ansgar Scherp
Presentation for our paper @iiWAS2021: The 23rd International Conference on Information Integration and Web Intelligence, Linz, Austria, 29 November 2021 - 1 December 2021. ACM 2021, ISBN 978-1-4503-9556-4
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...Ansgar Scherp
Text extraction from scientific figures has been addressed in the past by different unsupervised approaches due to the limited amount of training data. Motivated by the recent advances in Deep Learning, we propose a two-step neural-network-based pipeline to localize and extract text using Fully Convolutional Networks. We improve the localization of the text bounding boxes by applying a novel combination of a Residual Network with the Region Proposal Network based on Faster R-CNN. The predicted bounding boxes are further pre-processed and used as input to the of-the-shelf optical character recognition engine Tesseract 4.0. We evaluate our improved text localization method on five different datasets of scientific figures and compare it with the best unsupervised pipeline. Since only limited training data is available, we further experiment with different data augmentation techniques for increasing the size of the training datasets and demonstrate their positive impact. We use Average Precision and F1 measure to assess the text localization results. In addition, we apply Gestalt Pattern Matching and Levenshtein Distance for evaluating the quality of the recognized text. Our extensive experiments show that our new pipeline based on neural networks outperforms the best unsupervised approach by a large margin of 19-20%.
A Comparison of Approaches for Automated Text Extraction from Scholarly FiguresAnsgar Scherp
So far, there has not been a comparative evaluation of different approaches for text extraction from scholarly figures. In order to fill this gap, we have defined a generic pipeline for text extraction that abstracts from the existing approaches as documented in the literature. In this paper, we use this generic pipeline to systematically evaluate and compare 32 configurations for text extraction over four datasets of scholarly figures of different origin and characteristics. In total, our experiments have been run over more than 400 manually labeled figures. The experimental results show that the approach BS-4OS results in the best F-measure of 0.67 for the Text Location Detection and the best average Levenshtein Distance of 4.71 between the recognized text and the gold standard on all four datasets using the Ocropy OCR engine.
Can you see it? Annotating Image Regions based on Users' Gaze InformationAnsgar Scherp
Presentation on eyetracking-based annotation of image regions that I gave at Vienna on Oct 19, 2012. Download original PowerPoint file to enjoy all animations. For the papers, please refer to: http://www.ansgarscherp.net/publications
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBrad Spiegel Macon GA
Brad Spiegel Macon GA’s journey exemplifies the profound impact that one individual can have on their community. Through his unwavering dedication to digital inclusion, he’s not only bridging the gap in Macon but also setting an example for others to follow.
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesSanjeev Rampal
Talk presented at Kubernetes Community Day, New York, May 2024.
Technical summary of Multi-Cluster Kubernetes Networking architectures with focus on 4 key topics.
1) Key patterns for Multi-cluster architectures
2) Architectural comparison of several OSS/ CNCF projects to address these patterns
3) Evolution trends for the APIs of these projects
4) Some design recommendations & guidelines for adopting/ deploying these solutions.
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC
Ellisha Heppner, Grant Management Lead, presented an update on APNIC Foundation to the PNG DNS Forum held from 6 to 10 May, 2024 in Port Moresby, Papua New Guinea.
# Internet Security: Safeguarding Your Digital World
In the contemporary digital age, the internet is a cornerstone of our daily lives. It connects us to vast amounts of information, provides platforms for communication, enables commerce, and offers endless entertainment. However, with these conveniences come significant security challenges. Internet security is essential to protect our digital identities, sensitive data, and overall online experience. This comprehensive guide explores the multifaceted world of internet security, providing insights into its importance, common threats, and effective strategies to safeguard your digital world.
## Understanding Internet Security
Internet security encompasses the measures and protocols used to protect information, devices, and networks from unauthorized access, attacks, and damage. It involves a wide range of practices designed to safeguard data confidentiality, integrity, and availability. Effective internet security is crucial for individuals, businesses, and governments alike, as cyber threats continue to evolve in complexity and scale.
### Key Components of Internet Security
1. **Confidentiality**: Ensuring that information is accessible only to those authorized to access it.
2. **Integrity**: Protecting information from being altered or tampered with by unauthorized parties.
3. **Availability**: Ensuring that authorized users have reliable access to information and resources when needed.
## Common Internet Security Threats
Cyber threats are numerous and constantly evolving. Understanding these threats is the first step in protecting against them. Some of the most common internet security threats include:
### Malware
Malware, or malicious software, is designed to harm, exploit, or otherwise compromise a device, network, or service. Common types of malware include:
- **Viruses**: Programs that attach themselves to legitimate software and replicate, spreading to other programs and files.
- **Worms**: Standalone malware that replicates itself to spread to other computers.
- **Trojan Horses**: Malicious software disguised as legitimate software.
- **Ransomware**: Malware that encrypts a user's files and demands a ransom for the decryption key.
- **Spyware**: Software that secretly monitors and collects user information.
### Phishing
Phishing is a social engineering attack that aims to steal sensitive information such as usernames, passwords, and credit card details. Attackers often masquerade as trusted entities in email or other communication channels, tricking victims into providing their information.
### Man-in-the-Middle (MitM) Attacks
MitM attacks occur when an attacker intercepts and potentially alters communication between two parties without their knowledge. This can lead to the unauthorized acquisition of sensitive information.
### Denial-of-Service (DoS) and Distributed Denial-of-Service (DDoS) Attacks
1.Wireless Communication System_Wireless communication is a broad term that i...JeyaPerumal1
Wireless communication involves the transmission of information over a distance without the help of wires, cables or any other forms of electrical conductors.
Wireless communication is a broad term that incorporates all procedures and forms of connecting and communicating between two or more devices using a wireless signal through wireless communication technologies and devices.
Features of Wireless Communication
The evolution of wireless technology has brought many advancements with its effective features.
The transmitted distance can be anywhere between a few meters (for example, a television's remote control) and thousands of kilometers (for example, radio communication).
Wireless communication can be used for cellular telephony, wireless access to the internet, wireless home networking, and so on.
This 7-second Brain Wave Ritual Attracts Money To You.!nirahealhty
Discover the power of a simple 7-second brain wave ritual that can attract wealth and abundance into your life. By tapping into specific brain frequencies, this technique helps you manifest financial success effortlessly. Ready to transform your financial future? Try this powerful ritual and start attracting money today!
test test test test testtest test testtest test testtest test testtest test ...
A Comparison of Different Strategies for Automated Semantic Document Annotation
1. 1Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Gregor Große-Bölting
Chifumi Nishioka
Ansgar Scherp
A Comparison of Different
Strategies for Automated
Semantic Document Annotation
2. 2Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Motivation [1/2]
• Document annotation
– Facilitates users and search engines to find documents
– Requires a huge amount of human effort
– e.g., subject indexers in ZBW labeled 1.6 million
scientific documents in economics
• Semantic document annotation
– Documents annotated with semantic entities
– e.g., PubMed and MeSH, ACM DL and ACM CCS
Focus on semantic document annotation
Necessity of automated document annotation
3. 3Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Motivation [2/2]
• Small scale experiments so far
– Comparing a small number of strategies
– Datasets containing a few hundred documents
• Comparing of 43 strategies for document annotation
within the developed experiment framework
– The largest number of strategies
• Experiments with three datasets from different domains
– Contain full-texts of 100,000 documents annotated by subject
indexers
– The largest dataset of scientific publications
We conducted the largest scale experiment
4. 4Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Experiment Framework
Strategies are composed of methods from concept extraction,
concept activation, and annotation selection
1. Concept Extraction
detect concepts (candidate annotations) from each document
2. Concept Activation
compute a score for each concept of a document
3. Annotation Selection
select annotations from concepts for each document
4. Evaluation
measure performance of strategies with ground truth
5. 5Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Research Question
• Research questions solved with the experiment
framework
(I) Which strategy performs best?
(II) Which concept extraction method performs best?
(III) Which concept activation method performs best?
(IV) Which annotation selection method performs
best?
6. 6Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Extraction [1/2]
• Entity
– Extract entities from documents using a domain-specific
knowledge base
– Domain-specific knowledge base
• Entities (subjects) in a specific domain (e.g., medicine)
• One or more labels for each entity
• Relationships between entities
– Detect entities by string matching with entity labels
• Tri-gram
– Extract contigurous sequences of one, two, and three
words in a document
7. 7Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Extraction [2/2]
• RAKE (Rapid Automatic Keyword
Extraction) [Rose et al. 10]
– Unsupervised method for extracting keywords
– Incorporate cooccurrence and frequency of words
• LDA (Latent Dirichlet Allocation) [Blei et al. 03]
– Unsupervised topic modeling method for inferring latent
topics in a document corpus
– Topic model
• Topic: A probability distribution over words
• Document: A probability distribution over topics
– Treat a topic as a concept
8. 8Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Activation [1/6]
• Three types of concept activation
methods
– Statistical Methods
• Baseline
• Use only directly mentioned concepts
– Hierarchy-based Methods
• Reveal concepts that are not mentioned explicitly using a
hierarchical knowledge base
– Graph-based Methods
• Use only directly mentioned concepts
• Represent concept
cooccurrences as a graph
Bank, Interest Rate, Financial Crisis, Bank, Central
Bank, Tax, Interest Rate
Tax
Bank
Interest Rate
Financial Crisis
Central Bank
9. 9Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Activation [2/6]
• Statistical Methods
– Frequency
• 𝑓𝑟𝑒𝑞 𝑐, 𝑑 depends on Concept Extraction methods
– The number of appearances (Entity and Tri-gram)
– The score output by RAKE (RAKE)
– The probability of a topic for a document 𝑑 (LDA)
– CF-IDF [Goossen et al. 11]
• An extension of TF-IDF replacing words with concepts
• Lower scores for concepts that appear in many documents
𝑠𝑐𝑜𝑟𝑒 𝑐𝑓𝑖𝑑𝑓(𝑐, 𝑑) = 𝑐𝑓(𝑐, 𝑑) ∙ 𝑙𝑜𝑔
|𝐷|
| 𝑑 ∈ 𝐷 : {𝑐 ∈ 𝑑}|
𝑠𝑐𝑜𝑟𝑒𝑓𝑟𝑒𝑞(𝑐, 𝑑) = 𝑓𝑟𝑒𝑞(𝑐, 𝑑)
10. 10Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Activation [3/6]
• Hierarchy-based Methods [1/2]
– Base Activation
• 𝐶𝑙(𝑐): a set of child concepts of a concept 𝑐
• 𝜆: decay parameter, set 𝜆 = 0.4
• e.g.,
𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐, 𝑑 = 𝑓𝑟𝑒𝑞 𝑐, 𝑑 + 𝜆 ∙
𝑐 𝑖∈𝐶 𝑙(𝑐)
𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒(𝑐𝑖, 𝑑)
Social
Recommendation
Social
Tagging
Web Searching Web Mining
Site
Wrapping
Web Log
Analysis
World Wide Web
𝑐1
𝑐2
𝑐3
𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐1, 𝑑 = 1.00
𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐2, 𝑑 = 0.40
𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐3, 𝑑 = 0.16
11. 11Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Activation [4/6]
• Hierarchy-based Methods [2/2]
– Branch Activation
• 𝐵𝑁: reciprocal of the number of concepts that are located one
level above a concept 𝑐
– OneHop Activation
• 𝐶 𝑑: set of concepts in a document 𝑑
• Activates concepts in a maximum distance of one hop
𝑠𝑐𝑜𝑟𝑒 𝑏𝑟𝑎𝑛𝑐ℎ 𝑐, 𝑑 = 𝑓𝑟𝑒𝑞 𝑐, 𝑑 + 𝜆 ∙ 𝐵𝑁 ∙
𝑐 𝑖∈𝐶 𝑙(𝑐)
𝑠𝑐𝑜𝑟𝑒 𝑏𝑟𝑎𝑛𝑐ℎ(𝑐𝑖, 𝑑)
𝑠𝑐𝑜𝑟𝑒 𝑜𝑛𝑒ℎ𝑜𝑝 𝑐, 𝑑 =
𝑓𝑟𝑒𝑞 𝑐, 𝑑 if |𝐶𝑙(𝑐) ∩ 𝐶 𝑑| ≥ 2
𝑓𝑟𝑒𝑞 (𝑐, 𝑑) + 𝜆 ∙
𝑐 𝑖∈𝐶 𝑙(𝑐)
𝑓𝑟𝑒𝑞 𝑐𝑖, 𝑑 otherwise
12. 12Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Activation [5/6]
• Graph-based Methods [1/2]
– Degree [Zouaq et al. 12]
• 𝑑𝑒𝑔𝑟𝑒𝑒 𝑐, 𝑑 : the number of edges linked with a concept 𝑐
• e.g., 𝑠𝑐𝑜𝑟𝑒 𝑑𝑒𝑔𝑟𝑒𝑒(Bank, 𝑑) = 3
– HITS [Kleinberg 99; Zouaq et al. 12]
• Link analysis algorithm for search engines [Kleinberg 99]
• ℎ𝑢𝑏 𝑐, 𝑑 = 𝑐 𝑖∈𝐶 𝑛(𝑐) 𝑎𝑢𝑡ℎ 𝑐𝑖, 𝑑
• 𝑎𝑢𝑡ℎ 𝑐, 𝑑 = 𝑐 𝑖∈𝐶 𝑛(𝑐) ℎ𝑢𝑏 𝑐𝑖, 𝑑
𝑠𝑐𝑜𝑟𝑒 𝑑𝑒𝑔𝑟𝑒𝑒(𝑐, 𝑑) = 𝑑𝑒𝑔𝑟𝑒𝑒(𝑐, 𝑑)
𝑠𝑐𝑜𝑟𝑒ℎ𝑖𝑡𝑠 𝑐, 𝑑 = ℎ𝑢𝑏 𝑐, 𝑑 + 𝑎𝑢𝑡ℎ 𝑐, 𝑑
Tax
Bank
Interest Rate
Financial Crisis
Central Bank
13. 13Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Activation [6/6]
• Graph-based Methods [2/2]
– PageRank [Page et al. 99; Mihalcea & Paul 04]
• Link analysis algorithm for search engines
• Based on the intuition that a node that is linked from many
important nodes is more important
• 𝐶𝑖𝑛(𝑐): set of concepts connected with incoming edges from 𝑐
• 𝐶 𝑜𝑢𝑡(𝑐): set of concepts connected with outgoing edges from 𝑐
• 𝜇: dumping factor, 𝜇 = 0.85
𝑠𝑐𝑜𝑟𝑒 𝑝𝑎𝑔𝑒 𝑐, 𝑑 = 1 − 𝜇 + 𝜇 ∙
𝑐 𝑖∈𝐶 𝑖𝑛(𝑐)
𝑠𝑐𝑜𝑟𝑒 𝑝𝑎𝑔𝑒(𝑐𝑖, 𝑑)
|𝐶 𝑜𝑢𝑡(𝑐𝑖)|
14. 14Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Annotation Selection
• Top-5 and Top-10
– Select concepts whose scores are ranked in top-k
• k Nearest Neighbor (kNN) [Huang et al. 11]
– Based on the assumption that documents with similar
concepts share similar annotations
1. Compute similarity scores between a target document
and all documents with annotations
2. Select union of annotations of k nearest documents
Central bank
Law
Financial crisis
Finance
China
Human resource
Leadership
Marketing
Competition law
??
0.49
0.45
0.42
0.60
Example
- 𝑘 = 2
- Selected annotations
Finance; China; Marketing;
Competition Law
20. 20Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Datasets and Metrics of Experiments
Economics Political Science Computer Science
publication ZBW FIV SemEval 2010
# of publications 62,924 28,324 244
# of annotations 5.26 (± 1.84) 12.00 (± 4.02) 5.05 (± 2.41)
knowledge base STW European Thesaurus ACM CCS
# of entities 6,335 7,912 2,299
# of labels 11,679 8,421 9,086
• Computer Science: SemEval 2010 dataset [Kim et al. 10]
– Publications are annotated with keywords originally
– We converted keywords to entities by string matching
• All publications and labels of entities are in English
• We use full-texts of publications
• All annotations are used as ground truth
• Evaluation metrics: Precision, Recall, F-measure
21. 21Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
(I) Best Performing Strategies
• Economics and Political Science datasets
– The best strategy: Entity × HITS × kNN
– F-measure: 0.39 (economics), 0.28 (political science)
• Computer Science dataset
– The best strategy: Entity × Degree × kNN
– F-measure: 0.33 (computer science)
• Graph-based methods do not differ a lot
In general, a document annotation strategy
Entity × Graph-based method × kNN performs best
22. 22Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
(II) Influence of Concept Extraction
• Concept Extraction method: Entity
– Use domain-specific knowledge bases
– Knowledge bases: freely available and of high quality
– 32 thesauri listed in W3C SKOS Datasets
For Concept Extraction methods, Entity consistently
outperforms Tri-gram, RAKE, and LDA
23. 23Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
(III) Influence of Concept Activation
• Poor performance of hierarchy-based methods
– We use full-texts in the experiments
• Full-texts contain so many different concepts (203.80 unique
entities (SD: 24.50)) that others do not have to be activated
– However, OneHop can work as well as graph-based
methods
• It activates concept in one hop distance
For Concept Activation methods,
graph-based methods are better than statistical
methods or hierarchy-based methods
24. 24Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
(IV) Influence of Annotation Selection
• kNN
– No learning process
– Confirms the assumption that documents with similar
concepts share similar annotations
For Annotation Selection methods, kNN can enhance
the performance
25. 25Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Conclusion
• Large scale experiment for automated semantic
document annotation for scientific publications
• Best strategy: Entity × Graph-based method × kNN
– Novel combination of methods
• Best concept extraction method: Entity
• Best concept activation method: Graph-based
methods
– OneHop can achieve similar performance and requires
less computation cost
28. 28Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Research Question
• Research questions solved with the experiment
framework
(I) Which strategy performs best?
(II) Which concept extraction method performs best?
(III) Which concept activation method performs best?
(IV) Which annotation selection method performs
best?
30. 30Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Entity Extraction and Conversion
• Entity extraction
– String matching with entity labels
– Starting with longer entity labels
• e.g., From a text “financial crisis is …”, only an entity “financial
crisis” is detected (not “crisis”).
• Converting to entities
– Words and keywords are extracted in Tri-gram and RAKE
– They are converted to entities by string matching with
entity labels before annotation selection
– If no matched entity label is found, word or keyword is
discarded
31. 31Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
kNN [1/2]
• Similarity measure
– Each document is represented as a vector where each
element is a score of a concept
– Cosine similarity is used as a similarity measure
GDP
Immigration
Population
Bank
Interest rate
Canada
0.3
0.5
0.8
0.1
0.0
0.5
GDP
Immigration
Population
Bank
Interest rate
Canada
0.6
0.0
0.4
0.8
0.4
0.2
cosine similarity between (0.3, 0.5, 0.8, 0.1, 0.0, 0.5) and (0.6, 0.0, 0.4, 0.8, 0.4, 0.2)
𝒔𝒊𝒎 𝒅 𝟏, 𝒅 𝟐 = 𝟎. 𝟓𝟐
𝑑1 𝑑2
32. 32Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
kNN [2/2]
• k = 1
• k = 2
Central bank
Law
Financial crisis
Finance
China
Human resource
Leadership
Marketing
Competition law
??
0.49
0.45
0.42
0.60
Marketing
Competitive law
Selected annotations
Central bank
Law
Financial crisis
Finance
China
Human resource
Leadership
Marketing
Competition law
??
0.49
0.45
0.42
0.60
Marketing
Competitive law
Finance
China
Selected annotations
39. 39Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Result Table: RAKE
Economics
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .08 (.14) .08 (.12) .08 (.12) .15 (.18) .07 (.08) .10 (.11) .34 (.33) .34 (.33) .33 (.32)
Political Science
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .04 (.07) .08 (.13) .05 (.08) .07 (.09) .08 (.09) .07 (.08) .31 (.23) .18 (.15) .22 (.17)
Computer Science
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .24 (.24) .17 (.16) .19 (.17) .42 (.28) .15 (.10) .22 (.14) .42 (.27) .20 (.13) .25 (.15)
40. 40Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Result Table: LDA
Economics
kNN
Recall Precision F
Frequency .19 (.30) .19 (.30) .19 (.30)
Political Science
kNN
Recall Precision F
Frequency .15 (.19) .15 (.18) .14 (.17)
Computer Science
kNN
Recall Precision F
Frequency .28 (.27) .24 (.23) .24 (.22)
41. 41Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Materials
• Codes
– https://github.com/ggb/ShortStories
• Datasets
– economics and political science
• not publicly available yet
• contact us directly, if you are interested in
– computer science
• publicly available
42. 42Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Presentation
• K-CAP 2015
– International Conference on Knowledge Capture
– Scope
• Knowledge Acquisition / Capture
• Knowledge Extraction from Text
• Semantic Web
• Knowledge Engineering and Modelling
• …
• Time slot
– Presentation: 25 minutes
– Q & A: 5 minutes
43. 43Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Reference
• [Blei et al. 03] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation,
JMLR, 2003.
• [Blei 12] D. M. Blei. Probabilistic topic models, CACM, 2012.
• [Goossen et al. 11] F. Goossen, W. IJntema, F. Frasincar, F. Hogenboom, and U.
Kaymak. News personalization using the CF-IDF semantic recommender, WIMS,
2011.
• [Grosse-Bolting et al. 15] G. Grosse-Bolting, C. Nishioka, and A. Scherp. Generic
process for extracting user profiles from social media using hierarchical knowledge
bases, ICSC, 2015.
• [Huang et al. 11] M. Huang, A. Névéol, and Z. Lu. Recommending MeSH terms for
annotating biomedical articles, JAMIA, 2011.
• [Kapanipathi et al. 14] P. Kapanipathi, P. Jain, C. Venkataramani, and A. Sheth. User
interests identification on Twitter using a hierarchical knowledge base, ESWC, 2014.
• [Kim et al. 10] S. N. Kim, O. Medelyan, M. Y. Kan, and T. Baldwin. Semeval-2010
task 5: Automatic keyphrase extraction from scientific articles, International
Workshop on Semantic Evaluation, 2010.
44. 44Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Reference
• [Kleinberg 99] J. M. Kleinberg. Authoritative sources in a hyperlinked
environment, Journal of the ACM, 1999.
• [Mihalcea & Paul 04] R. Mihalcea and T. Paul. TextRank: Bringing order into texts,
EMNLP, 2004.
• [Page et al. 99] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank
citation ranking: bringing order to the web, TR of Stanford InfoLab, 1999.
• [Rose et al. 10] S. Rose, D. Engel, N. Cramer, and W. Cowley. Automatic keyword
extraction from individual documents, Text Mining, 2010.
• [Zouaq et al. 12] A. Zouaq, G. Dragan, and H. Marek. Voting theory for concept
detection, ESWC, 2012.