Within content-based recommendation systems, Linked Data have been already proposed as a valuable source of information to enhance the predictive power of recommender systems not only in terms of accuracy but also of diversity and novelty of results. In this direction, one of the main open issues in using Linked Data to feed a recommendation engine is related to feature selection: how to select only the most relevant subset of the original Linked Data thus avoiding both useless processing of data and the so-called “curse of dimensionality” problem. In this paper, we show how ABSTAT, an ontology-based (linked) data summarization framework, can drive the selection of properties/features useful to a recommender system. In particular, we compare a fully automated feature selection method based on ontology-based data summaries with more classical ones, and we evaluate the performance of these methods in terms of accuracy and aggregate diversity of a recommender system exploiting the top-k selected features. We set up an experimental testbed relying on datasets related to different knowledge domains. Results show the feasibility of a feature selection process driven by ontology-based data summaries for Linked Data-enabled recommender systems.
This is a joint work between the SisInf Lab at Polytechnic University of Bari, and the INSID&ES Lab at University of Milan-Bicocca, which was presented at ESWC 2018 Research Track.
The paper can be found at https://link.springer.com/chapter/10.1007%2F978-3-319-93417-4_9
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...Victor Giannakouris
This document proposes CSMR, a scalable algorithm for text clustering that uses cosine similarity and MapReduce. CSMR performs pairwise text similarity by representing text documents as vectors in a vector space model and measuring similarity in parallel using MapReduce. It is a 4-phase algorithm that includes word counting, text vectorization using term frequencies, applying TF-IDF to document vectors, and measuring cosine similarity. The algorithm is designed to cluster large text corpora in a scalable manner on distributed systems like Hadoop. Future work includes implementing and testing CSMR on real data and publishing results.
Recommender Systems with Apache Spark's ALS FunctionWill Johnson
A quick visual guide to recommender systems (user based, item based, and matrix factorization) and the code behind making an apache spark MatrxFactorization Model with the ALS function.
This document proposes a peer-to-peer approach for personalized top-k processing in collaborative tagging systems. It describes a two-layer gossip protocol to discover each user's personal network and distributed inverted lists to process queries locally. An evaluation on real tagging data shows the approach converges quickly, provides high recall with limited storage per user and processing time that increases slowly with the network size. The peer-to-peer solution enables scalable personalized search compared to centralized and cluster-based alternatives.
The World Wide Web is moving from a Web of hyper-linked documents to a Web of linked data. Thanks to the Semantic Web technological stack and to the more recent Linked Open Data (LOD) initiative, a vast amount of RDF data have been published in freely accessible datasets connected with each other to form the so called LOD cloud. As of today, we have tons of RDF data available in the Web of Data, but only a few applications really exploit their potential power. The availability of such data is for sure an opportunity to feed personalized information access tools such as recommender systems. We will show how to plug Linked Open Data in a recommendation engine in order to build a new generation of LOD-enabled applications.
(Lecture given @ the 11th Reasoning Web Summer School - Berlin - August 1, 2015)
This document discusses research into automatically discovering strong relationships between entities in Linked Data using genetic programming. The researchers aim to learn a cost function that can guide uninformed searches over Linked Data to find the most promising relationship paths. They experiment with different topological and semantic features as inputs to genetic programming to learn cost functions. The best-performing cost functions incorporate features like namespace variety, conditional node degree, and topics. This suggests specific, well-described paths through entities of different types are indicators of strong relationships in Linked Data.
Maximizing the Diversity of Exposure in a Social Network Cigdem Aslay
Social-media platforms have created new ways for citizens to stay informed and participate in public debates. However, to enable a healthy environment for information sharing, social deliberation, and opinion formation, citizens need to be exposed to sufficiently diverse viewpoints that challenge their assumptions, instead of being trapped inside filter bubbles.
In this paper, we take a step in this direction and propose a novel approach to maximize the diversity of exposure in a social network. We formulate the problem in the context of information propagation, as a task of recommending a small number of news articles to selected users.
We propose a realistic setting where we take into account content and user leanings, and the probability of further sharing an article. This setting allows us to capture the balance between maximizing the spread of information and ensuring the exposure of users to diverse viewpoints.
The resulting problem can be cast as maximizing a monotone and submodular function subject to a matroid constraint on the allocation of articles to users. It is a challenging generalization of the influence maximization problem. Yet, we are able to devise scalable approximation algorithms by introducing a novel extension to the notion of random reverse-reachable sets. We experimentally demonstrate the efficiency and scalability of our algorithm on several real-world datasets.
This document discusses various techniques for data preprocessing, including data integration, transformation, reduction, and discretization. It covers topics such as schema integration, handling redundant data, data normalization, dimensionality reduction, data cube aggregation, sampling, and entropy-based discretization. The goal of these techniques is to prepare raw data for knowledge discovery and data mining tasks by cleaning, transforming, and reducing the data into a suitable structure.
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...Victor Giannakouris
This document proposes CSMR, a scalable algorithm for text clustering that uses cosine similarity and MapReduce. CSMR performs pairwise text similarity by representing text documents as vectors in a vector space model and measuring similarity in parallel using MapReduce. It is a 4-phase algorithm that includes word counting, text vectorization using term frequencies, applying TF-IDF to document vectors, and measuring cosine similarity. The algorithm is designed to cluster large text corpora in a scalable manner on distributed systems like Hadoop. Future work includes implementing and testing CSMR on real data and publishing results.
Recommender Systems with Apache Spark's ALS FunctionWill Johnson
A quick visual guide to recommender systems (user based, item based, and matrix factorization) and the code behind making an apache spark MatrxFactorization Model with the ALS function.
This document proposes a peer-to-peer approach for personalized top-k processing in collaborative tagging systems. It describes a two-layer gossip protocol to discover each user's personal network and distributed inverted lists to process queries locally. An evaluation on real tagging data shows the approach converges quickly, provides high recall with limited storage per user and processing time that increases slowly with the network size. The peer-to-peer solution enables scalable personalized search compared to centralized and cluster-based alternatives.
The World Wide Web is moving from a Web of hyper-linked documents to a Web of linked data. Thanks to the Semantic Web technological stack and to the more recent Linked Open Data (LOD) initiative, a vast amount of RDF data have been published in freely accessible datasets connected with each other to form the so called LOD cloud. As of today, we have tons of RDF data available in the Web of Data, but only a few applications really exploit their potential power. The availability of such data is for sure an opportunity to feed personalized information access tools such as recommender systems. We will show how to plug Linked Open Data in a recommendation engine in order to build a new generation of LOD-enabled applications.
(Lecture given @ the 11th Reasoning Web Summer School - Berlin - August 1, 2015)
This document discusses research into automatically discovering strong relationships between entities in Linked Data using genetic programming. The researchers aim to learn a cost function that can guide uninformed searches over Linked Data to find the most promising relationship paths. They experiment with different topological and semantic features as inputs to genetic programming to learn cost functions. The best-performing cost functions incorporate features like namespace variety, conditional node degree, and topics. This suggests specific, well-described paths through entities of different types are indicators of strong relationships in Linked Data.
Maximizing the Diversity of Exposure in a Social Network Cigdem Aslay
Social-media platforms have created new ways for citizens to stay informed and participate in public debates. However, to enable a healthy environment for information sharing, social deliberation, and opinion formation, citizens need to be exposed to sufficiently diverse viewpoints that challenge their assumptions, instead of being trapped inside filter bubbles.
In this paper, we take a step in this direction and propose a novel approach to maximize the diversity of exposure in a social network. We formulate the problem in the context of information propagation, as a task of recommending a small number of news articles to selected users.
We propose a realistic setting where we take into account content and user leanings, and the probability of further sharing an article. This setting allows us to capture the balance between maximizing the spread of information and ensuring the exposure of users to diverse viewpoints.
The resulting problem can be cast as maximizing a monotone and submodular function subject to a matroid constraint on the allocation of articles to users. It is a challenging generalization of the influence maximization problem. Yet, we are able to devise scalable approximation algorithms by introducing a novel extension to the notion of random reverse-reachable sets. We experimentally demonstrate the efficiency and scalability of our algorithm on several real-world datasets.
This document discusses various techniques for data preprocessing, including data integration, transformation, reduction, and discretization. It covers topics such as schema integration, handling redundant data, data normalization, dimensionality reduction, data cube aggregation, sampling, and entropy-based discretization. The goal of these techniques is to prepare raw data for knowledge discovery and data mining tasks by cleaning, transforming, and reducing the data into a suitable structure.
This document summarizes research on discovering spatial co-location patterns from geospatial data. It discusses how spatial data mining differs from classical data mining by considering attribute relationships between neighboring spatial objects. The paper focuses on extracting frequent co-occurrence rules between boolean spatial features from ecological datasets. It presents three approaches for modeling co-location rules problems - reference feature centric, window centric, and event centric. The Co-location Miner algorithm is introduced for mining co-location rules that satisfy minimum prevalence and conditional probability thresholds from the data.
A Visual Exploration of Distance, Documents, and DistributionsRebecca Bilbro
The document discusses various distance metrics that can be used to quantify similarity between text documents for machine learning applications. It explains challenges in modeling text data due to its high dimensionality and sparse distributions. It then summarizes distance metrics available in Scikit-Learn and SciPy that can be used, including Euclidean, Manhattan, Chebyshev, Minkowski, Mahalanobis, Cosine, Canberra, Jaccard, and Hamming distances. It provides examples applying t-SNE visualization to embed documents from three text corpora using different distance metrics to understand how the choice of distance metric impacts the resulting visualizations.
Machine learning often requires us to think spatially and make choices about what it means for two instances to be close or far apart. So which is best - Euclidean? Manhattan? Cosine? It all depends! In this talk, we'll explore open source tools and visual diagnostic strategies for picking good distance metrics when doing machine learning on text.
This document provides an overview of an algorithmic techniques for big data analysis course. It discusses various data and computational models for analyzing large datasets, including streaming, external memory, distributed (MapReduce), and crowdsourcing models. Specific techniques covered include algorithms for counting distinct elements in different models, metric embedding for efficient pattern matching, and property testing for approximating properties of large datasets in sub-linear time. The document outlines the course topics, assignments, timeline and provides a high-level syllabus.
This document provides an overview of an algorithmic techniques for big data analysis course. It discusses the challenges of big data including volume, velocity, variety, and veracity. The course will develop algorithms to deal with big data, with an emphasis on different data processing models and common techniques. Students will complete projects, participate in discussions, and are encouraged to apply what they learn. Grading will be based on scribing lecture notes, participation, and a survey and project. The tentative syllabus and course plan cover topics like streaming algorithms, dimensionality reduction, and crowdsourcing.
An Answer Set Programming based framework for High-Utility Pattern Mining ext...Francesco Cauteruccio
This document summarizes a framework for high-utility pattern mining that extends previous work by introducing facets, a multi-layer database representation, and advanced utility functions. The framework allows mining patterns based on multiple perspectives using facets and considers database structure through layers of containers, objects, and transactions. An Answer Set Programming approach is used to flexibly encode the problem and compute utilities. The framework is demonstrated on a scientific paper review dataset, showing it can provide insights not possible with previous high-utility pattern mining systems.
Machine Learning Comparative Analysis - Part 1Kaniska Mandal
This document provides an overview of machine learning concepts and algorithms. It discusses supervised and unsupervised classification as well as reinforcement learning. Important concepts covered include concepts, instances, target concepts, hypotheses, inductive bias, Occam's razor, and restriction bias. Machine learning algorithms discussed include Bayesian classification, decision trees, linear regression, multi-layer perceptrons, K-nearest neighbors, boosting, and ensemble learning. The document compares the preferences, learning functions, performance, enhancements, and typical usages of these different machine learning approaches.
Course: Intro to Computer Science (Malmö Högskola)
A palette of applications showing abstraction, databases, simulation, artificial intelligence and numerical applications
This document provides a summary of MapReduce algorithms. It begins with background on the author's experience blogging about MapReduce algorithms in academic papers. It then provides an overview of MapReduce concepts including the mapper and reducer functions. Several examples of recently published MapReduce algorithms are described for tasks like machine learning, finance, and software engineering. One algorithm is examined in depth for building a low-latency key-value store. Finally, recommendations are provided for designing MapReduce algorithms including patterns, performance, and cost/maintainability considerations. An appendix lists additional MapReduce algorithms from academic papers in areas such as AI, biology, machine learning, and mathematics.
Machine learning and linear regression programmingSoumya Mukherjee
Overview of AI and ML
Terminology awareness
Applications in real world
Use cases within Nokia
Types of Learning
Regression
Classification
Clustering
Linear Regression Single Variable with python
ChatGPT
Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. It involves applying various techniques and methods to extract insights from data sets, often with the goal of uncovering patterns, trends, relationships, or making predictions.
Here's an overview of the key steps and techniques involved in data analysis:
Data Collection: The first step in data analysis is gathering relevant data from various sources. This can include structured data from databases, spreadsheets, or surveys, as well as unstructured data such as text documents, social media posts, or sensor readings.
Data Cleaning and Preprocessing: Once the data is collected, it often needs to be cleaned and preprocessed to ensure its quality and suitability for analysis. This involves handling missing values, removing duplicates, addressing inconsistencies, and transforming data into a suitable format for analysis.
Exploratory Data Analysis (EDA): EDA involves examining and understanding the data through summary statistics, visualizations, and statistical techniques. It helps identify patterns, distributions, outliers, and potential relationships between variables. EDA also helps in formulating hypotheses and guiding further analysis.
Data Modeling and Statistical Analysis: In this step, various statistical techniques and models are applied to the data to gain deeper insights. This can include descriptive statistics, inferential statistics, hypothesis testing, regression analysis, time series analysis, clustering, classification, and more. The choice of techniques depends on the nature of the data and the research questions being addressed.
Data Visualization: Data visualization plays a crucial role in data analysis. It involves creating meaningful and visually appealing representations of data through charts, graphs, plots, and interactive dashboards. Visualizations help in communicating insights effectively and spotting trends or patterns that may be difficult to identify in raw data.
Interpretation and Conclusion: Once the analysis is performed, the findings need to be interpreted in the context of the problem or research objectives. Conclusions are drawn based on the results, and recommendations or insights are provided to stakeholders or decision-makers.
Reporting and Communication: The final step is to present the results and findings of the data analysis in a clear and concise manner. This can be in the form of reports, presentations, or interactive visualizations. Effective communication of the analysis results is crucial for stakeholders to understand and make informed decisions based on the insights gained.
Data analysis is widely used in various fields, including business, finance, marketing, healthcare, social sciences, and more. It plays a crucial role in extracting value from data, supporting evidence-based decision-making, and driving actionable insig
The document outlines a framework for mining spatial co-location patterns from transaction-type data. It begins by discussing related work on existing approaches, including transaction-free and transaction-based methods. It then defines a new type of transaction-type data called a Spatial Co-location Transaction (SCT) to address limitations in previous approaches. The proposed framework first generates all SCTs from a spatial dataset, then applies association analysis methods to the SCTs to extract spatial co-location patterns. Both binary and quantitative analysis techniques are described to analyze SCTs and identify meaningful co-location patterns.
This course provides a detailed executive-level review of contemporary topics in graph modeling theory with specific focus on Deep Learning theoretical concepts and practical applications. The ideal student is a technology professional with a basic working knowledge of statistical methods.
Upon completion of this review, the student should acquire improved ability to discriminate, differentiate and conceptualize appropriate implementations of application-specific (‘traditional’ or ‘rule-based’) methods versus deep learning methods of statistical analyses and data modeling. Additionally, the student should acquire improved general understanding of graph models as deep learning concepts with specific focus on state-of-the-art awareness of deep learning applications within the fields of character recognition, natural language processing and computer vision. Optionally, the provided code base will inform the interested student regarding basic implementation of these models in Keras using Python (targeting TensorFlow, Theano or Microsoft Cognitive Toolkit).
Link to course:
https://www.experfy.com/training/courses/graph-models-for-deep-learning
University of Manchester Symposium 2012: Extraction and Representation of in ...geraintduck
This document describes research extracting and analyzing biological methods mentioned in the scientific literature. It developed bioNerDS, a tool to automatically extract mentions of computational resources from papers. bioNerDS was used to analyze over 1.8 million mentions from 230,000 open access articles, finding patterns in resource usage over time and between journals. Challenges included ambiguity, variability in names, and extracting methods from ordered resource mentions. The goal is to provide a way to extract "best practices" for any resource-based domain by mining the literature.
Foundations of Machine Learning - StampedeCon AI Summit 2017StampedeCon
This presentation will cover all aspects of modeling, from preparing data, training and evaluating the results. There will be descriptions of the mainline ML methods including, neural nets, SVM, boosting, bagging, trees, forests, and deep learning. common problems of overfitting and dimensionality will be covered with discussion of modeling best practices. Other topics will include field standardization, encoding categorical variables, feature creation and selection. It will be a soup-to-nuts overview of all the necessary procedures for building state-of-the art predictive models.
The document discusses decision tree learning, which is a machine learning approach for classification that builds classification models in the form of a decision tree. It describes the ID3 algorithm, which is a popular method for generating a decision tree from a set of training data. The ID3 algorithm uses information gain as the splitting criterion to recursively split the training data into purer subsets based on the values of the attributes. It selects the attribute with the highest information gain to make decisions at each node in the tree. Entropy from information theory is used to measure the information gain, with the goal being to build a tree that best classifies the training instances into target classes. An example applying the ID3 algorithm to a tennis playing dataset is provided to illustrate
Data preprocessing techniques
See my Paris applied psychology conference paper here
https://www.slideshare.net/jasonrodrigues/paris-conference-on-applied-psychology
or
https://prezi.com/view/KBP8JnekVH9LkLOiKY3w/
University Course "Micro and nano systems" for Master Degree in Biomedical Engineering at University of Pisa. Topic: Software for additive manufacturing (part1)
How to Effectively Combine Numerical Features and Categorical FeaturesDomino Data Lab
by Liangjie Hong
Head of Data Science, Etsy
Latent factor models and decision tree based models are widely used in tasks of prediction, ranking and recommendation. Latent factor models have the advantage of interpreting categorical features by a low-dimensional representation, while such an interpretation does not naturally fit numerical features. In contrast, decision tree based models enjoy the advantage of capturing the nonlinear interactions of numerical features, while their capability of handling categorical features is limited by the cardinality of those features. Since in real-world applications we usually have both abundant numerical features and categorical features with large cardinality (e.g. geolocations, IDs, tags etc.), we design a new model, called GB-CENT, which leverages latent factor embedding and tree components to achieve the merits of both while avoiding their demerits. With two real-world data sets, we demonstrate that GB-CENT can effectively (i.e. fast and accurately) achieve better accuracy than state-of-the-art matrix factorization, decision tree based models and their ensemble
Data enrichment is vital for leveraging heterogeneous data sources in various business analyses, AI applications, and data-driven services. Knowledge Graphs (KGs) support the enrichment of heterogeneous data sources by making entities first-class citizens: links to entities help interconnect heterogeneous data pieces or even ease access to external data sources to eventually augment the original data. Data annotation algorithms to find and link entities in reference KGs, as well as to identify out-of-KG entities have been proposed and applied to different types of data, such as tables, and texts. However, despite recent progress in annotation algorithms, the output of these algorithms does not always meet the quality requirements that make the enriched data valuable in downstream applications. As a result, semantic data enrichment remains an effort-consuming and error-prone task. In this seminar, we discuss the relationships between annotation algorithms, data enrichment, and KG construction, highlighting challenges and open problems. In addition, we advocate for a native human-in-the-loop perspective that enables users to control the outcome of the enrichment and, eventually, improve the quality of the enriched data. We focus in particular on the annotation and enrichment of tabular data and briefly discuss the application of a similar paradigm to the enrichment of textual data in the legal domain, e.g., on court decisions and criminal investigation documents.
Description of the DaCENA approach to the contextual exploration of knowledge graphs. We use machine learning to learn user preferences using a limited number of user inputs. Through these inputs, we learn a personalized ranking function over semantic associations (semi-paths in a knowledge graph) that best fit users' interests. References for the presentation are:
Bianchi & al.: Actively Learning to Rank Semantic Associations for Personalized Contextual Exploration of Knowledge Graphs. ESWC (1) 2017: 120-135.
Palmonari & al.: DaCENA: Serendipitous News Reading with Data Contexts. ESWC (Satellite Events) 2015: 133-137
More Related Content
Similar to Using Ontology-based Data Summarization to Develop Semantics-aware Recommender Systems. ESWC 2018
This document summarizes research on discovering spatial co-location patterns from geospatial data. It discusses how spatial data mining differs from classical data mining by considering attribute relationships between neighboring spatial objects. The paper focuses on extracting frequent co-occurrence rules between boolean spatial features from ecological datasets. It presents three approaches for modeling co-location rules problems - reference feature centric, window centric, and event centric. The Co-location Miner algorithm is introduced for mining co-location rules that satisfy minimum prevalence and conditional probability thresholds from the data.
A Visual Exploration of Distance, Documents, and DistributionsRebecca Bilbro
The document discusses various distance metrics that can be used to quantify similarity between text documents for machine learning applications. It explains challenges in modeling text data due to its high dimensionality and sparse distributions. It then summarizes distance metrics available in Scikit-Learn and SciPy that can be used, including Euclidean, Manhattan, Chebyshev, Minkowski, Mahalanobis, Cosine, Canberra, Jaccard, and Hamming distances. It provides examples applying t-SNE visualization to embed documents from three text corpora using different distance metrics to understand how the choice of distance metric impacts the resulting visualizations.
Machine learning often requires us to think spatially and make choices about what it means for two instances to be close or far apart. So which is best - Euclidean? Manhattan? Cosine? It all depends! In this talk, we'll explore open source tools and visual diagnostic strategies for picking good distance metrics when doing machine learning on text.
This document provides an overview of an algorithmic techniques for big data analysis course. It discusses various data and computational models for analyzing large datasets, including streaming, external memory, distributed (MapReduce), and crowdsourcing models. Specific techniques covered include algorithms for counting distinct elements in different models, metric embedding for efficient pattern matching, and property testing for approximating properties of large datasets in sub-linear time. The document outlines the course topics, assignments, timeline and provides a high-level syllabus.
This document provides an overview of an algorithmic techniques for big data analysis course. It discusses the challenges of big data including volume, velocity, variety, and veracity. The course will develop algorithms to deal with big data, with an emphasis on different data processing models and common techniques. Students will complete projects, participate in discussions, and are encouraged to apply what they learn. Grading will be based on scribing lecture notes, participation, and a survey and project. The tentative syllabus and course plan cover topics like streaming algorithms, dimensionality reduction, and crowdsourcing.
An Answer Set Programming based framework for High-Utility Pattern Mining ext...Francesco Cauteruccio
This document summarizes a framework for high-utility pattern mining that extends previous work by introducing facets, a multi-layer database representation, and advanced utility functions. The framework allows mining patterns based on multiple perspectives using facets and considers database structure through layers of containers, objects, and transactions. An Answer Set Programming approach is used to flexibly encode the problem and compute utilities. The framework is demonstrated on a scientific paper review dataset, showing it can provide insights not possible with previous high-utility pattern mining systems.
Machine Learning Comparative Analysis - Part 1Kaniska Mandal
This document provides an overview of machine learning concepts and algorithms. It discusses supervised and unsupervised classification as well as reinforcement learning. Important concepts covered include concepts, instances, target concepts, hypotheses, inductive bias, Occam's razor, and restriction bias. Machine learning algorithms discussed include Bayesian classification, decision trees, linear regression, multi-layer perceptrons, K-nearest neighbors, boosting, and ensemble learning. The document compares the preferences, learning functions, performance, enhancements, and typical usages of these different machine learning approaches.
Course: Intro to Computer Science (Malmö Högskola)
A palette of applications showing abstraction, databases, simulation, artificial intelligence and numerical applications
This document provides a summary of MapReduce algorithms. It begins with background on the author's experience blogging about MapReduce algorithms in academic papers. It then provides an overview of MapReduce concepts including the mapper and reducer functions. Several examples of recently published MapReduce algorithms are described for tasks like machine learning, finance, and software engineering. One algorithm is examined in depth for building a low-latency key-value store. Finally, recommendations are provided for designing MapReduce algorithms including patterns, performance, and cost/maintainability considerations. An appendix lists additional MapReduce algorithms from academic papers in areas such as AI, biology, machine learning, and mathematics.
Machine learning and linear regression programmingSoumya Mukherjee
Overview of AI and ML
Terminology awareness
Applications in real world
Use cases within Nokia
Types of Learning
Regression
Classification
Clustering
Linear Regression Single Variable with python
ChatGPT
Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. It involves applying various techniques and methods to extract insights from data sets, often with the goal of uncovering patterns, trends, relationships, or making predictions.
Here's an overview of the key steps and techniques involved in data analysis:
Data Collection: The first step in data analysis is gathering relevant data from various sources. This can include structured data from databases, spreadsheets, or surveys, as well as unstructured data such as text documents, social media posts, or sensor readings.
Data Cleaning and Preprocessing: Once the data is collected, it often needs to be cleaned and preprocessed to ensure its quality and suitability for analysis. This involves handling missing values, removing duplicates, addressing inconsistencies, and transforming data into a suitable format for analysis.
Exploratory Data Analysis (EDA): EDA involves examining and understanding the data through summary statistics, visualizations, and statistical techniques. It helps identify patterns, distributions, outliers, and potential relationships between variables. EDA also helps in formulating hypotheses and guiding further analysis.
Data Modeling and Statistical Analysis: In this step, various statistical techniques and models are applied to the data to gain deeper insights. This can include descriptive statistics, inferential statistics, hypothesis testing, regression analysis, time series analysis, clustering, classification, and more. The choice of techniques depends on the nature of the data and the research questions being addressed.
Data Visualization: Data visualization plays a crucial role in data analysis. It involves creating meaningful and visually appealing representations of data through charts, graphs, plots, and interactive dashboards. Visualizations help in communicating insights effectively and spotting trends or patterns that may be difficult to identify in raw data.
Interpretation and Conclusion: Once the analysis is performed, the findings need to be interpreted in the context of the problem or research objectives. Conclusions are drawn based on the results, and recommendations or insights are provided to stakeholders or decision-makers.
Reporting and Communication: The final step is to present the results and findings of the data analysis in a clear and concise manner. This can be in the form of reports, presentations, or interactive visualizations. Effective communication of the analysis results is crucial for stakeholders to understand and make informed decisions based on the insights gained.
Data analysis is widely used in various fields, including business, finance, marketing, healthcare, social sciences, and more. It plays a crucial role in extracting value from data, supporting evidence-based decision-making, and driving actionable insig
The document outlines a framework for mining spatial co-location patterns from transaction-type data. It begins by discussing related work on existing approaches, including transaction-free and transaction-based methods. It then defines a new type of transaction-type data called a Spatial Co-location Transaction (SCT) to address limitations in previous approaches. The proposed framework first generates all SCTs from a spatial dataset, then applies association analysis methods to the SCTs to extract spatial co-location patterns. Both binary and quantitative analysis techniques are described to analyze SCTs and identify meaningful co-location patterns.
This course provides a detailed executive-level review of contemporary topics in graph modeling theory with specific focus on Deep Learning theoretical concepts and practical applications. The ideal student is a technology professional with a basic working knowledge of statistical methods.
Upon completion of this review, the student should acquire improved ability to discriminate, differentiate and conceptualize appropriate implementations of application-specific (‘traditional’ or ‘rule-based’) methods versus deep learning methods of statistical analyses and data modeling. Additionally, the student should acquire improved general understanding of graph models as deep learning concepts with specific focus on state-of-the-art awareness of deep learning applications within the fields of character recognition, natural language processing and computer vision. Optionally, the provided code base will inform the interested student regarding basic implementation of these models in Keras using Python (targeting TensorFlow, Theano or Microsoft Cognitive Toolkit).
Link to course:
https://www.experfy.com/training/courses/graph-models-for-deep-learning
University of Manchester Symposium 2012: Extraction and Representation of in ...geraintduck
This document describes research extracting and analyzing biological methods mentioned in the scientific literature. It developed bioNerDS, a tool to automatically extract mentions of computational resources from papers. bioNerDS was used to analyze over 1.8 million mentions from 230,000 open access articles, finding patterns in resource usage over time and between journals. Challenges included ambiguity, variability in names, and extracting methods from ordered resource mentions. The goal is to provide a way to extract "best practices" for any resource-based domain by mining the literature.
Foundations of Machine Learning - StampedeCon AI Summit 2017StampedeCon
This presentation will cover all aspects of modeling, from preparing data, training and evaluating the results. There will be descriptions of the mainline ML methods including, neural nets, SVM, boosting, bagging, trees, forests, and deep learning. common problems of overfitting and dimensionality will be covered with discussion of modeling best practices. Other topics will include field standardization, encoding categorical variables, feature creation and selection. It will be a soup-to-nuts overview of all the necessary procedures for building state-of-the art predictive models.
The document discusses decision tree learning, which is a machine learning approach for classification that builds classification models in the form of a decision tree. It describes the ID3 algorithm, which is a popular method for generating a decision tree from a set of training data. The ID3 algorithm uses information gain as the splitting criterion to recursively split the training data into purer subsets based on the values of the attributes. It selects the attribute with the highest information gain to make decisions at each node in the tree. Entropy from information theory is used to measure the information gain, with the goal being to build a tree that best classifies the training instances into target classes. An example applying the ID3 algorithm to a tennis playing dataset is provided to illustrate
Data preprocessing techniques
See my Paris applied psychology conference paper here
https://www.slideshare.net/jasonrodrigues/paris-conference-on-applied-psychology
or
https://prezi.com/view/KBP8JnekVH9LkLOiKY3w/
University Course "Micro and nano systems" for Master Degree in Biomedical Engineering at University of Pisa. Topic: Software for additive manufacturing (part1)
How to Effectively Combine Numerical Features and Categorical FeaturesDomino Data Lab
by Liangjie Hong
Head of Data Science, Etsy
Latent factor models and decision tree based models are widely used in tasks of prediction, ranking and recommendation. Latent factor models have the advantage of interpreting categorical features by a low-dimensional representation, while such an interpretation does not naturally fit numerical features. In contrast, decision tree based models enjoy the advantage of capturing the nonlinear interactions of numerical features, while their capability of handling categorical features is limited by the cardinality of those features. Since in real-world applications we usually have both abundant numerical features and categorical features with large cardinality (e.g. geolocations, IDs, tags etc.), we design a new model, called GB-CENT, which leverages latent factor embedding and tree components to achieve the merits of both while avoiding their demerits. With two real-world data sets, we demonstrate that GB-CENT can effectively (i.e. fast and accurately) achieve better accuracy than state-of-the-art matrix factorization, decision tree based models and their ensemble
Similar to Using Ontology-based Data Summarization to Develop Semantics-aware Recommender Systems. ESWC 2018 (20)
Data enrichment is vital for leveraging heterogeneous data sources in various business analyses, AI applications, and data-driven services. Knowledge Graphs (KGs) support the enrichment of heterogeneous data sources by making entities first-class citizens: links to entities help interconnect heterogeneous data pieces or even ease access to external data sources to eventually augment the original data. Data annotation algorithms to find and link entities in reference KGs, as well as to identify out-of-KG entities have been proposed and applied to different types of data, such as tables, and texts. However, despite recent progress in annotation algorithms, the output of these algorithms does not always meet the quality requirements that make the enriched data valuable in downstream applications. As a result, semantic data enrichment remains an effort-consuming and error-prone task. In this seminar, we discuss the relationships between annotation algorithms, data enrichment, and KG construction, highlighting challenges and open problems. In addition, we advocate for a native human-in-the-loop perspective that enables users to control the outcome of the enrichment and, eventually, improve the quality of the enriched data. We focus in particular on the annotation and enrichment of tabular data and briefly discuss the application of a similar paradigm to the enrichment of textual data in the legal domain, e.g., on court decisions and criminal investigation documents.
Description of the DaCENA approach to the contextual exploration of knowledge graphs. We use machine learning to learn user preferences using a limited number of user inputs. Through these inputs, we learn a personalized ranking function over semantic associations (semi-paths in a knowledge graph) that best fit users' interests. References for the presentation are:
Bianchi & al.: Actively Learning to Rank Semantic Associations for Personalized Contextual Exploration of Knowledge Graphs. ESWC (1) 2017: 120-135.
Palmonari & al.: DaCENA: Serendipitous News Reading with Data Contexts. ESWC (Satellite Events) 2015: 133-137
Interoperability challenges & solutions in the EW-Shopp H2020 innovation action: tool-supported interoperability; exchange of event data and custom event ontology for data analytics; reconciliation across systems of spatial identifiers.
5-min presentation of EW-Shopp. EW-Shopp is an industry-driven H2020 project where AI is used to make data enrichment easier and predict the effect of weather and events in different business domains such as eCommerce, Retail, CRM, IoT, Digital Marketing
1) Research challenges in developing human-level AI include tackling the complexity of human intelligence such as faster learning without limited data, connections between cognitive skills, and reasoning about imaginary concepts.
2) Modern machine learning has achieved human-level performance in games and applications like translation but lacks human abilities such as commonsense reasoning and using background knowledge.
3) Promising research trends include techniques that combine symbolic and machine learning approaches, multi-modal learning, and generative models to support creativity.
Presentation of "Facet Annotation Using Reference Knowledge Bases" at the WWW2018 Research Track, i.e., The Web Conference 2018, April 26th, Lyon, France.
ABSTRACT: Faceted interfaces are omnipresent on the web to support data exploration and filtering. A facet is a triple: a domain (e.g., Book), a property (e.g., author,lanдuaдe), and a set of property values (e.g., { Austen, Beauvoir, Coelho, Dostoevsky, Eco, Kerouac, Suskind, ... },{ French, English, German, Italian, Portuguese, Russian, ... } ). Given a property (e.g., language), selecting one or more of its values (Enдlish and Italian) returns the domain entities (of type Book) that match the given values (the books that are written in English or Italian). To implement faceted interfaces in a way that is scalable to very large datasets, it is necessary to automate facet extraction. Prior work associates a facet domain with a set of homogeneous values, but does not annotate the facet property. In this paper, we annotate the facet property with a predicate from a reference Knowledge Base (KB) so as to maximize the semantic similarity between the property and the predicate. We define semantic similarity in terms of three new metrics: specificity, coverage, and frequency. Our experimental evaluation uses the DBpedia and YAGO K Bs and shows that for the facet annotation problem, we obtain better results than a state-of-the-art approach for the annotation of web tables as modified to annotate a set of values.
For more info about our work you can check out the websites of our labs:
INSID&S Lab (UNIMIB): http://inside.disco.unimib.it/
ADVIS Lab (UIC): https://www.cs.uic.edu/bin/view/Advis/WebHome
Using our multi-user model, a community of users provides feedback in a pay-as-you-go fashion to the ontology matching process by validating the mappings found by automatic methods, with the following advantages over having a single user: the effort required from each user is reduced, user errors are corrected, and consensus is reached. We propose strategies that dynamically determine the order in which the candidate mappings are presented to the users for validation. These strategies are based on mapping quality measures that we define. Further, we use a propagation method to leverage the validation of one mapping to other mappings. We use an extension of the AgreementMaker ontology matching system and the Ontology Alignment Evaluation Initiative (OAEI) Benchmarks track to evaluate our approach. Our results show how F-measure and robustness vary as a function of the number of user validations. We consider different user error and revalidation rates (the latter measures the number of times that the same mapping is validated). Our results highlight complex trade-offs and point to the benefits of dynamically adjusting the revalidation rate.
The tutorial has been presented at CAISE 2010. The tutorial discusses the state-of-the-art on research addresseing the quality of data at the conceptual level (conceptual schemas) and of Ontologies
More from Università degli Studi di Milano-Bicocca (8)
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...Scintica Instrumentation
Targeting Hsp90 and its pathogen Orthologs with Tethered Inhibitors as a Diagnostic and Therapeutic Strategy for cancer and infectious diseases with Dr. Timothy Haystead.
The binding of cosmological structures by massless topological defectsSérgio Sacani
Assuming spherical symmetry and weak field, it is shown that if one solves the Poisson equation or the Einstein field
equations sourced by a topological defect, i.e. a singularity of a very specific form, the result is a localized gravitational
field capable of driving flat rotation (i.e. Keplerian circular orbits at a constant speed for all radii) of test masses on a thin
spherical shell without any underlying mass. Moreover, a large-scale structure which exploits this solution by assembling
concentrically a number of such topological defects can establish a flat stellar or galactic rotation curve, and can also deflect
light in the same manner as an equipotential (isothermal) sphere. Thus, the need for dark matter or modified gravity theory is
mitigated, at least in part.
PPT on Alternate Wetting and Drying presented at the three-day 'Training and Validation Workshop on Modules of Climate Smart Agriculture (CSA) Technologies in South Asia' workshop on April 22, 2024.
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfSelcen Ozturkcan
Ozturkcan, S., Berndt, A., & Angelakis, A. (2024). Mending clothing to support sustainable fashion. Presented at the 31st Annual Conference by the Consortium for International Marketing Research (CIMaR), 10-13 Jun 2024, University of Gävle, Sweden.
Immersive Learning That Works: Research Grounding and Paths ForwardLeonel Morgado
We will metaverse into the essence of immersive learning, into its three dimensions and conceptual models. This approach encompasses elements from teaching methodologies to social involvement, through organizational concerns and technologies. Challenging the perception of learning as knowledge transfer, we introduce a 'Uses, Practices & Strategies' model operationalized by the 'Immersive Learning Brain' and ‘Immersion Cube’ frameworks. This approach offers a comprehensive guide through the intricacies of immersive educational experiences and spotlighting research frontiers, along the immersion dimensions of system, narrative, and agency. Our discourse extends to stakeholders beyond the academic sphere, addressing the interests of technologists, instructional designers, and policymakers. We span various contexts, from formal education to organizational transformation to the new horizon of an AI-pervasive society. This keynote aims to unite the iLRN community in a collaborative journey towards a future where immersive learning research and practice coalesce, paving the way for innovative educational research and practice landscapes.
Anti-Universe And Emergent Gravity and the Dark UniverseSérgio Sacani
Recent theoretical progress indicates that spacetime and gravity emerge together from the entanglement structure of an underlying microscopic theory. These ideas are best understood in Anti-de Sitter space, where they rely on the area law for entanglement entropy. The extension to de Sitter space requires taking into account the entropy and temperature associated with the cosmological horizon. Using insights from string theory, black hole physics and quantum information theory we argue that the positive dark energy leads to a thermal volume law contribution to the entropy that overtakes the area law precisely at the cosmological horizon. Due to the competition between area and volume law entanglement the microscopic de Sitter states do not thermalise at sub-Hubble scales: they exhibit memory effects in the form of an entropy displacement caused by matter. The emergent laws of gravity contain an additional ‘dark’ gravitational force describing the ‘elastic’ response due to the entropy displacement. We derive an estimate of the strength of this extra force in terms of the baryonic mass, Newton’s constant and the Hubble acceleration scale a0 = cH0, and provide evidence for the fact that this additional ‘dark gravity force’ explains the observed phenomena in galaxies and clusters currently attributed to dark matter.
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
The cost of acquiring information by natural selectionCarl Bergstrom
This is a short talk that I gave at the Banff International Research Station workshop on Modeling and Theory in Population Biology. The idea is to try to understand how the burden of natural selection relates to the amount of information that selection puts into the genome.
It's based on the first part of this research paper:
The cost of information acquisition by natural selection
Ryan Seamus McGee, Olivia Kosterlitz, Artem Kaznatcheev, Benjamin Kerr, Carl T. Bergstrom
bioRxiv 2022.07.02.498577; doi: https://doi.org/10.1101/2022.07.02.498577
Authoring a personal GPT for your research and practice: How we created the Q...Leonel Morgado
Thematic analysis in qualitative research is a time-consuming and systematic task, typically done using teams. Team members must ground their activities on common understandings of the major concepts underlying the thematic analysis, and define criteria for its development. However, conceptual misunderstandings, equivocations, and lack of adherence to criteria are challenges to the quality and speed of this process. Given the distributed and uncertain nature of this process, we wondered if the tasks in thematic analysis could be supported by readily available artificial intelligence chatbots. Our early efforts point to potential benefits: not just saving time in the coding process but better adherence to criteria and grounding, by increasing triangulation between humans and artificial intelligence. This tutorial will provide a description and demonstration of the process we followed, as two academic researchers, to develop a custom ChatGPT to assist with qualitative coding in the thematic data analysis process of immersive learning accounts in a survey of the academic literature: QUAL-E Immersive Learning Thematic Analysis Helper. In the hands-on time, participants will try out QUAL-E and develop their ideas for their own qualitative coding ChatGPT. Participants that have the paid ChatGPT Plus subscription can create a draft of their assistants. The organizers will provide course materials and slide deck that participants will be able to utilize to continue development of their custom GPT. The paid subscription to ChatGPT Plus is not required to participate in this workshop, just for trying out personal GPTs during it.
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Leonel Morgado
Current descriptions of immersive learning cases are often difficult or impossible to compare. This is due to a myriad of different options on what details to include, which aspects are relevant, and on the descriptive approaches employed. Also, these aspects often combine very specific details with more general guidelines or indicate intents and rationales without clarifying their implementation. In this paper we provide a method to describe immersive learning cases that is structured to enable comparisons, yet flexible enough to allow researchers and practitioners to decide which aspects to include. This method leverages a taxonomy that classifies educational aspects at three levels (uses, practices, and strategies) and then utilizes two frameworks, the Immersive Learning Brain and the Immersion Cube, to enable a structured description and interpretation of immersive learning cases. The method is then demonstrated on a published immersive learning case on training for wind turbine maintenance using virtual reality. Applying the method results in a structured artifact, the Immersive Learning Case Sheet, that tags the case with its proximal uses, practices, and strategies, and refines the free text case description to ensure that matching details are included. This contribution is thus a case description method in support of future comparative research of immersive learning cases. We then discuss how the resulting description and interpretation can be leveraged to change immersion learning cases, by enriching them (considering low-effort changes or additions) or innovating (exploring more challenging avenues of transformation). The method holds significant promise to support better-grounded research in immersive learning.
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Using Ontology-based Data Summarization to Develop Semantics-aware Recommender Systems. ESWC 2018
1. Using Ontology-based Data Summarization to
Develop Semantics-aware Recommender Systems
Tommaso Di Noia*, Corrado Magarelli* Andrea Maurino°, Matteo Palmonari°, Anisa Rula°**
*Polytechnic University of Bari
°University of Milano-Bicocca
**SDA, University of Bonn
This project has received funding from the European Union’s
Horizon 2020 research and innovation program under grant
agreements n. 732003 and n. 732590
2. Outline
• Feature Selection for Semantics-aware Recommender Systems
• Ontology-based Data Summarization with ABSTAT
• Feature Selection (ABSTAT vs. Information Gain)
• Experiments
• Conclusions and Future Work
2
3. Outline
• Feature Selection for Semantics-aware Recommender Systems
• Ontology-based Data Summarization with ABSTAT
• Feature Selection (ABSTAT vs. Information Gain)
• Experiments
• Conclusions and Future Work
3
4. Recommender Systems
• Help users in dealing with
information/choice overload
• Help to match users with items
4
5. Several Recommender Systems
perfectly work without using any
content! (e.g.Amazon)
Collaborative Filtering and Matrix
Factorization are state of the art
techniques for implementing
Recommender Systems
(ACM RecSys 2009,
by Neflix Challenge winners)
Why do we need content?
Content can tackle some issues of collaborative filtering
5
7. Why do we need content?
?
Collaborative Filtering issues: new item problem7
8. Why do we need content?
Who knows the «customers who bought…»?
Collaborative Filtering issues: poor explanations!8
9. Content-based Semantic
Recommendations
• Basic item KNN recommender system
• Given an user u a non rated item i, the rating of i
is predicted by:
where:
• N(i) = neighbors of the non rated item i
• r(u) = the items rated by the user u,
• r(u,j) = the rating value given to the item i by the user
u
Similarity functions:
• Jaccard
• Graph kernels
• Cosine similarity in a vector
space
• … several variants
… all based on subgraphs built
using certain properties9
11. The Feature Selection Problem
• Features = properties for
similarity evaluation
• Ontological properties?
• Categorical properties?
• Frequent properties?
• Feature selection
• Usually performed manually (ex-post)
• With statistical measures [Musto&al.UMAP2016]
• With ontology-based data summaries (this paper)
• Fully automatic feature selection with ABSTAT profiles
• (Manual pre-processing + frequency-based ranking with ABSTAT profiles + graph
kernel similarity [Ragone&al.SAC2017] )
The course of dimensionality
11
12. Ontology-based Data Summarization
vs. Statistical Techniques
• Statistical measures
• Download the full dataset
• Compute statistical measures over the full dataset
• Keep only the data of interest
• Run the algorithm
• Profiles (efficiently accessible via web)
• Ask for top-k most useful properties
• e.g., via API
• Download only the relevant data
• Run the algorithm
12
13. Outline
• Feature Selection for Semantics-aware Recommender Systems
• Ontology-based Data Summarization with ABSTAT
• Feature Selection (ABSTAT vs. Information Gain)
• Experiments
• Conclusions and Future Work
13
14. Ontology-driven Knowledge Graph
Summarization Profiling with ABSTAT
Minimal Type Patterns: there exist entities that
have Company as minimal type, which are
linked to literals that have gYear as minimal type
by the property foundingYear
Occurrence of types and properties
Frequency and instances: how many times this
pattern occurs as minimal type pattern and as a
pattern. Instances count considers pattern
inference
Cardinality descriptors: max/avg/min number of
different subjects associated with a same object
(and vice versa)
For more details: abstat.disco.unimib.it and [ESWC2016-demo, SUMPRE2016, ESWC2018-demo] 14
15. Ontology-driven Knowledge Graph
Summarization Profiling with ABSTAT
Minimal Type Patterns: there exist entities that
have Company as minimal type, which are
linked to literals that have gYear as minimal type
by the property foundingYear
Occurrence of types and properties
Frequency and instances: how many times this
pattern occurs as minimal type pattern and as a
pattern. Instances count considers pattern
inference
Cardinality descriptors: max/avg/min number of
different subjects associated with a same object
(and vice versa)
For more details: abstat.disco.unimib.it and [ESWC2016-demo, SUMPRE2016, ESWC2018-demo] 15
16. ABSTAT: Cardinality Descriptors
Subjects Objects
MinS(π) = 1
MaxS(π) = 3
AvgS(π) = 1,4 ≈ 1
MinO(π) = 1
MaxO(π) = 2
AvgO(π) = 1,7 ≈ 2
For each pattern π
• MinS, AvgS, MaxS
• Min/Avg/Max number of distinct
subjects associated with unique objects
in the triples represented by π
• MinO, AvgO, MaxO
• Min/Avg/Max number of distinct objects
associated with unique subjects in the
triples represented by π
• Local vs. global
• Local: for patterns
• Global: for properties, i.e., all triples with
a property
16
17. ABSTAT: Cardinality Descriptors
Subjects Objects
MinS(π) = 1
MaxS(π) = 3
AvgS(π) = 1,4 ≈ 1
MinO(π) = 1
MaxO(π) = 2
AvgO(π) = 1,7 ≈ 2
1
For each pattern π
• MinS, AvgS, MaxS
• Min/Avg/Max number of distinct
subjects associated with unique objects
in the triples represented by π
• MinO, AvgO, MaxO
• Min/Avg/Max number of distinct objects
associated with unique subjects in the
triples represented by π
• Local vs. global
• Local: for patterns
• Global: for properties, i.e., all triples with
a property
17
18. ABSTAT: Cardinality Descriptors
Subjects Objects
MinS(π) = 1
MaxS(π) = 3
AvgS(π) = 1,4 ≈ 1
MinO(π) = 1
MaxO(π) = 2
AvgO(π) = 1,7 ≈ 2
3
For each pattern π
• MinS, AvgS, MaxS
• Min/Avg/Max number of distinct
subjects associated with unique objects
in the triples represented by π
• MinO, AvgO, MaxO
• Min/Avg/Max number of distinct objects
associated with unique subjects in the
triples represented by π
• Local vs. global
• Local: for patterns
• Global: for properties, i.e., all triples with
a property
18
19. ABSTAT: Cardinality Descriptors
Subjects Objects
MinS(π) = 1
MaxS(π) = 3
AvgS(π) = 1,4 ≈ 1
MinO(π) = 1
MaxO(π) = 2
AvgO(π) = 1,7 ≈ 2
For each pattern π
• MinS, AvgS, MaxS
• Min/Avg/Max number of distinct
subjects associated with unique objects
in the triples represented by π
• MinO, AvgO, MaxO
• Min/Avg/Max number of distinct objects
associated with unique subjects in the
triples represented by π
• Local vs. global
• Local: for patterns
• Global: for properties, i.e., all triples with
a property
2
1
1
1
1
3
1
19
20. ABSTAT: Cardinality Descriptors
Subjects Objects
MinS(π) = 1
MaxS(π) = 3
AvgS(π) = 1,4 ≈ 1
MinO(π) = 1
MaxO(π) = 2
AvgO(π) = 1,7 ≈ 2
1
For each pattern π
• MinS, AvgS, MaxS
• Min/Avg/Max number of distinct
subjects associated with unique objects
in the triples represented by π
• MinO, AvgO, MaxO
• Min/Avg/Max number of distinct objects
associated with unique subjects in the
triples represented by π
• Local vs. global
• Local: for patterns
• Global: for properties, i.e., all triples with
a property
20
21. ABSTAT: Cardinality Descriptors
Subjects Objects
MinS(π) = 1
MaxS(π) = 3
AvgS(π) = 1,4 ≈ 1
MinO(π) = 1
MaxO(π) = 2
AvgO(π) = 1,7 ≈ 2
2
For each pattern π
• MinS, AvgS, MaxS
• Min/Avg/Max number of distinct
subjects associated with unique objects
in the triples represented by π
• MinO, AvgO, MaxO
• Min/Avg/Max number of distinct objects
associated with unique subjects in the
triples represented by π
• Local vs. global
• Local: for patterns
• Global: for properties, i.e., all triples with
a property
21
22. ABSTAT: Cardinality Descriptors
For each pattern π
• MinS, AvgS, MaxS
• Min/Avg/Max number of distinct
subjects associated with unique objects
in the triples represented by π
• MinO, AvgO, MaxO
• Min/Avg/Max number of distinct objects
associated with unique subjects in the
triples represented by π
• Local vs. global
• Local: for patterns
• Global: for properties, i.e., all triples with
a property
Subjects Objects
MinS(π) = 1
MaxS(π) = 3
AvgS(π) = 1,4 ≈ 1
MinO(π) = 1
MaxO(π) = 2
AvgO(π) = 1,7 ≈ 2
2
1
2
2
1
2
22
23. ABSTAT: Cardinality Descriptors
For each pattern π
• MinS, AvgS, MaxS
• Min/Avg/Max number of distinct
subjects associated with unique objects
in the triples represented by π
• MinO, AvgO, MaxO
• Min/Avg/Max number of distinct objects
associated with unique subjects in the
triples represented by π
• Local vs. global
• Local: for patterns
• Global: for properties, i.e., all triples with
a property
Subjects Objects
MinS(π) = 1
MaxS(π) = 3
AvgS(π) = 1,4 ≈ 1
MinO(π) = 1
MaxO(π) = 2
AvgO(π) = 1,7 ≈ 2
[minO,avgO,maxO]
Global cardinality
descriptors
Local cardinality
descriptors
Thing Thing
[1,5,249] [1,1,13]
cinematogaphy
Film Person
[1,14,249] [1,1,7]
cinematogaphy
[minS,avgS,maxS]
23
24. Cardinality Descriptors for Feature Selection
Subjects Objects
MinS(π) = 1
MaxS(π) = 3
AvgS(π) = 1,4 ≈ 1
MinO(π) = 1
MaxO(π) = 2
AvgO(π) = 1,7 ≈ 2
For each pattern π
• MinS, AvgS, MaxS
• Min/Avg/Max number of distinct
subjects associated with unique objects
in the triples represented by π
• MinO, AvgO, MaxO
• Min/Avg/Max number of distinct objects
associated with unique subjects in the
triples represented by π
• Local vs. global
• Local: for patterns
• Global: for properties, i.e., all triples with
a property
2
1
1
1
1
3
1
24
25. Cardinality Descriptors for Feature Selection
Subjects Objects
MinS(π) = 1
MaxS(π) = 3
AvgS(π) = 1,4 ≈ 1
MinO(π) = 1
MaxO(π) = 2
AvgO(π) = 1,7 ≈ 2
For each pattern π
• MinS, AvgS, MaxS
• Min/Avg/Max number of distinct
subjects associated with unique objects
in the triples represented by π
• MinO, AvgO, MaxO
• Min/Avg/Max number of distinct objects
associated with unique subjects in the
triples represented by π
• Local vs. global
• Local: for patterns
• Global: for properties, i.e., all triples with
a property
2
1
1
1
1
3
1
+ frequency!
25
26. Outline
• Feature Selection for Semantics-aware Recommender Systems
• Ontology-based Data Summarization with ABSTAT
• Feature Selection (ABSTAT vs. Information Gain)
• Experiments
• Conclusions and Future Work
26
27. Feature Selection with ABSTAT
FILTERING
(local cardinality
descriptor)
RANKING
(value*)
SELECTION
(k-properties)
PATTERNS
avgS > 1 DESC(frequency) k=2
*values of pattern frequency, local cardinality descriptor, or a combination of the first two.
PROJECTION
(property,
MAX(value(*))
P, MAX(frequency)
Properties
27
28. Feature Selection with ABSTAT
DESC(frequency*maxS) k=5
*values of pattern frequency, local cardinality descriptor, or a combination of the first two.
P, MAX(frequency*maxS)
FILTERING
(local cardinality
descriptor)
RANKING
(value*)
SELECTION
(k-properties)
PATTERNS
PROJECTION
(property,
MAX(value(*))
28
29. Feature Selection with IG
• Different statistical measures tested: Information Gain, Information Gain
Ratio, Chi-squared test
• Information Gain: expected reduction in entropy occurring when a feature
is present versus when it is absent.
• For a feature fi , IG is defined as
• where E( I ) is the entropy of the data, Iv is the number of items in which the feature
fi (e.g., director for movies) has a value equal to v (e.g., F.F.Coppola in the movie
domain), and E( Iv ) is the entropy computed on data where the feature fi assumes
value v.
29
30. Feature Selection with IG: preprocessing
• Manual pre-processing is required
• Reduce redundant or irrelevant features that are expected to bring little value
to the recommendation task, but, at the same time, pose scalability issues
Dataset # of features before pre-processing # of feaatures after pre-processing
Movielens 148 34
LastFM 271 25
LibraryThing 201 22
30
31. Outline
• Feature Selection for Semantics-aware Recommender Systems
• Ontology-based Data Summarization with ABSTAT
• Feature Selection (ABSTAT vs. Information Gain)
• Experiments
• Conclusions and Future Work
31
32. Experimental Settings:
Recommendation Method
• Content-based using an item-based nearest neighbors
algorithm [Di Noia & al.TIST2016]
• Given a set of entities rated by the user (=user profile),
• predict the rate only for the k nearest neighbors of the rated
items
• Jaccard item similarity
Rating prediction on k
most similar items to the
items rated by the user
32
34. Experimental Setting: Datasets & Measures
Datasets
• One-to-one mapping between RecSys benchmarks and DBpedia [Di Noia &
al.TIST2016]:
• MovieLens DBpedia (3883 Movies)
• Last.fm DBpedia (17632 Artists)
• The Library Thing DBpedia (37231 Books)
• DBpedia-2015-10, including infoboxes (392M triples)
Metrics
• Accuracy:
• Precision@N: fraction of relevant items in the Top-N recommendations
• MRR@N: average reciprocal rank of the first relevant recommended item
• Diversity:
• catalog coverage: percentage of items in the catalog recommended at least once
• aggregate diversity: aggregate entropy
34
35. Experimental Setting: Datasets & Measures
• Novelty
• Recommend items in the long tail
• Diversity
• Avoid to recommend only items in a
small subset of the catalog
• Suggest diverse items in the
recommendation list
• Serendipity
• Suggest unexpected but interesting
items
Is all about precision?
35
36. Experimental Settings: dbo vs. dbp properties
• Number of features/properties: 5, 20
• Which DBpedia? dbo (DBpedia Ontology ) vs. dbp (infobox) properties
• noRep best ranked property between dbo and dbp
• withRep keep duplicates
• Onlydbp in case of duplicates, only the dbp
• Onlydbo in case of duplicates, only the dbo
• IG vs. ABSTAT configurations:
Name Filter by Ranking Intuitive
AbsFreqAvgS AvgS > 1 Frequency Only properties that map at least two distinct subjects to one
obejct (on average) ranked by frequency
AbsFreq*MaxS NO FILTER Frequency*maxS Favors properties that are more frequent and map a higher
number of distinct subjects to one object
AbsMaxS NO FILTER MaxS Favors properties that map a higher number of distinct subjects to
one object
Tf-idf
(baseline)
NO FILTER Tfidf over patterns Favors properties that are more peculiar to the domain type36
40. Outline
• Feature Selection for Semantics-aware Recommender Systems
• Ontology-based Data Summarization with ABSTAT
• Feature Selection (ABSTAT vs. Information Gain)
• Experiments
• Conclusions and Future Work
40
41. Conclusions & Future Work
• Conclusions
• Fully automatic feature selection method with ontology-based knowledge
graph summaries (ABSTAT)
• Better or, in some cases, comparable to statistical measures, but without
requiring computation over the full dataset
• Additional evidence of informativeness of ABSTAT-based summaries
• Future work
• Add Tfidf in ABSTAT stats
• Experiments with additional measures (e.g., graph-based measures with path
longer than 1)
• API-based suggestion of most salient properties for an input entity type
41
42. Contacts: palmonari@disco.unimib.it - tommaso.dinoia@poliba.it
This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreements n. 732003 and n. 732590
Supporting Event and Weather-
based Data Analytics and
Marketing along the Shopper
Journeywww.ew-shopp.eu
Enabling the European Business Graph for
Innovative Data Products and Services
www.eubusinessgraph.eu/
Experiments & code:
https://zenodo.org/record/12
05712# .WrRCypPwa3U
http://ow.ly/zAA530d0wu0
ABSTAT (open source) code:
https://bitbucket.org/disco_u
nimib/abstat-core
ABSTAT home:
abstat.disco.unimib.it
42
43. Appendix: Explanations for Better/worst
Performance
Domain Type # Minimal Patterns Avg # Triples Variance
Movies dbo:Film 57757 74.02 549.31
Books dbo:Book 41684 44.97 169.48
Music dbo:Artist 40491 80.50 981.51
43
44. Appendix: Explanations for Better/worst
Performance
44
Top 20 selected features for the MovieLens dataset by using the different con- figurations of IG and AbsFreqAvgS.
Editor's Notes
We did not consider only accuracy because in RecSys is also important to go beyond popularity bias and show diverse elements across the catalog, as well as items in the long tail
There does not exist one approach that performs better than
DBO: Nella prima riga. forse non c’p differenza tra il
DBP: