Gianluca Demartini presented on using entities, graphs, and crowdsourcing for better web search. He discussed using crowdsourcing to perform entity linking and disambiguation on web pages through a system called ZenCrowd. ZenCrowd combines algorithmic and manual linking by automating manual linking via crowdsourcing tasks and using a probabilistic reasoning framework to assess workers. He also discussed using entity factor graphs for scientific literature disambiguation by modeling workers, links, clicks, and constraints within a probabilistic framework. The system was experimentally evaluated on news articles from several sources linked to entities from knowledge bases like Freebase and DBPedia.
The document discusses various natural language processing and machine learning techniques including sentiment analysis, automated essay scoring, content summarization, chatbots, information retrieval, cluster analysis, language neural networks, and language translation. It provides examples and links to resources on topics like word embeddings, one-hot encoding, the curse of dimensionality, neural networks, and building chatbots. Key points discussed are ensuring applications allow for imperfect accuracy from models and that without data, no machine learning is possible.
The Art of Social Media Analysis with Twitter & PythonKrishna Sankar
The document discusses analyzing social networks and Twitter data using Python. It provides an introduction to analyzing the Twitter network of the user @clouderati, including 2072 followers. The presentation will cover topics like mentions, hashtags, retweets, and constructing a social graph to analyze cliques and networks. It also provides some tips for working with Twitter APIs and building scalable social media analysis pipelines in Python.
Data Journalism (City Online Journalism wk8)Paul Bradshaw
The document provides an overview of data journalism including what it is, sources for finding data, and tools for analyzing and visualizing data. It discusses scraping data from websites, using tools like Google searches, spreadsheets, and APIs to extract structured data. Ethical considerations around scraping are also mentioned. The document concludes with assigning students to group blogs and individual strategies focusing on different aspects of online journalism.
The document discusses teaching data ethics in data science education. It provides context about the eScience Institute and a data science MOOC. It then presents a vignette on teaching data ethics using the example of an alcohol study conducted in Barrow, Alaska in 1979. The study had methodological and ethical issues in how it presented results to the community. The document concludes by discussing incorporating data ethics into all of the Institute's data science programs and initiatives like automated data curation and analyzing scientific literature visuals.
The document discusses data, data science, and finding data sources. It defines data as raw facts about the world and notes that data comes from various sources like government, scientific research, citizens, and private companies. It then discusses the growth of digital data and issues around open data. The document defines data science as using analysis methods to describe facts, detect patterns, and test hypotheses. Finally, it provides tips on finding needed data, such as searching open data sources, APIs, scraping, and joining datasets.
Data science remains a high-touch activity, especially in life, physical, and social sciences. Data management and manipulation tasks consume too much bandwidth: Specialized tools and technologies are difficult to use together, issues of scale persist despite the Cambrian explosion of big data systems, and public data sources (including the scientific literature itself) suffer curation and quality problems.
Together, these problems motivate a research agenda around “human-data interaction:” understanding and optimizing how people use and share quantitative information.
I’ll describe some of our ongoing work in this area at the University of Washington eScience Institute.
In the context of the Myria project, we're building a big data "polystore" system that can hide the idiosyncrasies of specialized systems behind a common interface without sacrificing performance. In scientific data curation, we are automatically correcting metadata errors in public data repositories with cooperative machine learning approaches. In the Viziometrics project, we are mining patterns of visual information in the scientific literature using machine vision, machine learning, and graph analytics. In the VizDeck and Voyager projects, we are developing automatic visualization recommendation techniques. In graph analytics, we are working on parallelizing best-of-breed graph clustering algorithms to handle multi-billion-edge graphs.
The common thread in these projects is the goal of democratizing data science techniques, especially in the sciences.
The document discusses various natural language processing and machine learning techniques including sentiment analysis, automated essay scoring, content summarization, chatbots, information retrieval, cluster analysis, language neural networks, and language translation. It provides examples and links to resources on topics like word embeddings, one-hot encoding, the curse of dimensionality, neural networks, and building chatbots. Key points discussed are ensuring applications allow for imperfect accuracy from models and that without data, no machine learning is possible.
The Art of Social Media Analysis with Twitter & PythonKrishna Sankar
The document discusses analyzing social networks and Twitter data using Python. It provides an introduction to analyzing the Twitter network of the user @clouderati, including 2072 followers. The presentation will cover topics like mentions, hashtags, retweets, and constructing a social graph to analyze cliques and networks. It also provides some tips for working with Twitter APIs and building scalable social media analysis pipelines in Python.
Data Journalism (City Online Journalism wk8)Paul Bradshaw
The document provides an overview of data journalism including what it is, sources for finding data, and tools for analyzing and visualizing data. It discusses scraping data from websites, using tools like Google searches, spreadsheets, and APIs to extract structured data. Ethical considerations around scraping are also mentioned. The document concludes with assigning students to group blogs and individual strategies focusing on different aspects of online journalism.
The document discusses teaching data ethics in data science education. It provides context about the eScience Institute and a data science MOOC. It then presents a vignette on teaching data ethics using the example of an alcohol study conducted in Barrow, Alaska in 1979. The study had methodological and ethical issues in how it presented results to the community. The document concludes by discussing incorporating data ethics into all of the Institute's data science programs and initiatives like automated data curation and analyzing scientific literature visuals.
The document discusses data, data science, and finding data sources. It defines data as raw facts about the world and notes that data comes from various sources like government, scientific research, citizens, and private companies. It then discusses the growth of digital data and issues around open data. The document defines data science as using analysis methods to describe facts, detect patterns, and test hypotheses. Finally, it provides tips on finding needed data, such as searching open data sources, APIs, scraping, and joining datasets.
Data science remains a high-touch activity, especially in life, physical, and social sciences. Data management and manipulation tasks consume too much bandwidth: Specialized tools and technologies are difficult to use together, issues of scale persist despite the Cambrian explosion of big data systems, and public data sources (including the scientific literature itself) suffer curation and quality problems.
Together, these problems motivate a research agenda around “human-data interaction:” understanding and optimizing how people use and share quantitative information.
I’ll describe some of our ongoing work in this area at the University of Washington eScience Institute.
In the context of the Myria project, we're building a big data "polystore" system that can hide the idiosyncrasies of specialized systems behind a common interface without sacrificing performance. In scientific data curation, we are automatically correcting metadata errors in public data repositories with cooperative machine learning approaches. In the Viziometrics project, we are mining patterns of visual information in the scientific literature using machine vision, machine learning, and graph analytics. In the VizDeck and Voyager projects, we are developing automatic visualization recommendation techniques. In graph analytics, we are working on parallelizing best-of-breed graph clustering algorithms to handle multi-billion-edge graphs.
The common thread in these projects is the goal of democratizing data science techniques, especially in the sciences.
1) Entity-centric data management stores information at the entity level and integrates information by interlinking entities. This provides advantages over keyword-based and relational database approaches.
2) The XI Pipeline extracts mentions from text and performs named entity recognition, entity linking, and entity typing to associate entities with text.
3) Approaches like ZenCrowd and TRank leverage both algorithms and human computation through crowdsourcing to improve entity linking and fine-grained entity typing.
Ontology-Based Word Sense Disambiguation for Scientific LiteratureeXascale Infolab
This document presents an approach for ontology-based word sense disambiguation for scientific literature. It leverages the structure of community-based ontologies to improve sense identification. The approach represents concepts as context vectors based on their relations in documents and ontologies. It evaluates techniques based on minimum distance between concepts in ontologies, shortest path between concepts in ontologies, and neighboring concepts in ontologies. Combining these graph-based models with context vectors achieves the best precision for word sense disambiguation on two scientific datasets.
This document presents SANAPHOR, an ontology-based coreference resolution system that improves upon existing approaches by leveraging semantic information. It first links entities in document clusters to semantic types and ontologies. It then splits or merges clusters based on these semantic relationships. The system was evaluated on the CoNLL-2012 dataset, where it improved coreference resolution performance over the baseline Stanford system, particularly for noun clusters. By utilizing semantic knowledge, SANAPHOR demonstrates the benefits of enhancing syntactic coreference resolution with an additional semantic layer.
Thomas Heinis is a post-doctoral researcher in the database group at EPFL. His research focuses on scalable data management algorithms for large-scale scientific applications. Thomas is a part of the "Human Brain Project" and currently works with neuroscientists to develop the data management infrastructure necessary for scaling up brain simulations. Prior to joining EPFL, Thomas completed his Ph.D. in the Systems Group at ETH Zurich, where he pursued research in workflow execution systems as well as data provenance.
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...eXascale Infolab
This document proposes three methods - LEXT, REXT, and LERIXT - for disambiguating the domain and range of properties in linked data by using context information. LEXT uses the type of subject resources, REXT uses the type of object resources, and LERIXT uses both. The methods were evaluated against expert judgments and achieved up to 96.5% precision for LEXT and 91.4% for REXT. LERIXT generated too many new sub-properties.
Over the last few years we have observed the emergence of hybrid human-machine information systems which are able to both scale over large amount of data as well as to maintain high-quality data processing intrinsic in human intelligence.
In this talk I will focus on the use of human intelligence at scale by means of crowdsourcing to deal with Big Data problems. We will look specifically on how to deal with the variety in data by means of Human Computation still being able to operate with a large data volume.
First, I will introduce the area of micro-task crowdsourcing also providing an overview of different research challenges that needs to be tackled to enable large-scale hybrid human-machine information systems. Next, I will provide examples of such hybrid systems for entity linking and disambiguation using crowdsourcing and a graph of linked entities as background corpus. I will describe how keyword query understanding can be crowdsourced to build search engines that can answer rare complex queries. Finally, I will present new techniques that allow to improve the quality of crowdsourced information system components by means of push crowdsourcing.
Big Data Analysis : Deciphering the haystack Srinath Perera
A primary outcome of Bigdata is to derive useful and actionable insights from large or challenges data collections. The goal is to run the transformations from data, to information, to knowledge, and finally to insights. This includes calculating simple analytics like Mean, Max, and Median, to derive overall understanding about data by building models, and finally to derive predictions from data. Some cases we can afford to wait to collect and processes them, while in other cases we need to know the outputs right away. MapReduce has been the defacto standard for data processing, and we will start our discussion from there. However, that is only one side of the problem. There are other technologies like Apache Spark and Apache Drill graining ground, and also realtime processing technologies like Stream Processing and Complex Event Processing. Finally there are lot of work on porting decision technologies like Machine learning into big data landscape. This talk discusses big data processing in general and look at each of those different technologies comparing and contrasting them.
Scientific Software Challenges and Community ResponsesDaniel S. Katz
a talk given at RTI International on 7 December 2015, discussing 12 scientific software challenges and how the scientific software community is responding to them
ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for...eXascale Infolab
The document describes ZenCrowd, a system that leverages probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. It combines both algorithmic and manual linking approaches. ZenCrowd automates manual linking via crowdsourcing by breaking it into micro-tasks and dynamically assessing human workers with a probabilistic model. An experimental evaluation on news articles demonstrates that the combined approach of algorithmic matching and crowdsourcing achieves higher precision and recall than using either technique alone.
This document provides an overview of offensive open-source intelligence (OSINT) techniques. It defines OSINT and discusses the differences between offensive and defensive OSINT approaches. Offensive OSINT focuses on gathering as much public information as possible to facilitate an attack against a target. The document outlines the OSINT process and details specific techniques for harvesting data from public sources, including scraping websites, using APIs, searching social media, analyzing images and metadata, and researching infrastructure components like IP addresses, domains, and software versions. The goal of offensive OSINT is to discover valuable information like employee emails, usernames, relationships, locations and technical vulnerabilities to enable attacks like phishing, social engineering, and infiltration.
(1) Amundsen is a data discovery platform developed by Lyft to help users find, understand, and use data.
(2) The platform addresses challenges around data discovery such as lack of understanding about what data exists and where to find it.
(3) Amundsen provides searchable metadata about data resources, previews of data, and usage statistics to help data scientists and others explore and understand data.
Smarter search drives value to your business. Delivering search that matches users to the right content is what you care about. But organizations often get stuck getting there. It turns out that you need quite a number of very different ingredients to deliver tremendous search. It can make your head spin! To help you think through where your team is on its road to smarter search, Pugh introduces the maturity model used by OpenSource Connections and walks you through a very concrete method to inventory needed skills and translate that into search roles for your team. He shows how to measure your capabilities in key areas of search to drive better ROI from search.
Mark Dehmlow, Head of the Library Web Department at the University of Notre Dame
At the University of Notre Dame, we recently implemented a new website in concert with rolling out a “next generation” OPAC into production for our campus. While much of the pre-launch feedback was positive, once we implemented the new systems, we started receiving a small number of intense criticisms and a small wave of problem reports. This presentation covers how to plan for big technology changes, prepare your organizations, effectively manage the barrage of post implementation technical problems, and mitigate customer concerns and criticisms. Participants are encouraged to bring brief war stories, anecdotes, and suggestions for managing technology implementations.”
Slides from my talk on Personalised Access to Linked Data. Presented at the EKAW 2014 conference. The poster to this paper won the best poster award at the conference!
OpenNeuro: a free online platform for sharing and analysis of neuroimaging dataKrzysztof Gorgolewski
This document describes OpenNeuro, a free online platform for sharing and analyzing neuroimaging data. OpenNeuro allows users to upload, manage, and share brain imaging data using the Brain Imaging Data Structure. It also enables running predefined analysis pipelines called Apps on uploaded data using software containers. This facilitates reproducible analysis. The document discusses OpenNeuro's features, available analysis pipelines, open source components, and vision for facilitating data sharing and reproducibility in neuroimaging research.
Wimmics Research Team 2015 Activity ReportFabien Gandon
Extract of the activity report of the Wimmics joint research team between Inria Sophia Antipolis - Méditerranée and I3S (CNRS and Université Nice Sophia Antipolis). Wimmics stands for web-instrumented man-machine interactions, communities and semantics. The team focuses on bridging social semantics and formal semantics on the web.
This document provides a summary of using Splunk for data science. Splunk can be used for tasks like trend forecasting, anomaly detection, sentiment analysis, and market segmentation. It integrates data from various sources and allows querying and visualizing data. Splunk complements other data science tools by executing scripts from R and Python. Effective data visualizations are also important for communicating insights from data.
1) Entity-centric data management stores information at the entity level and integrates information by interlinking entities. This provides advantages over keyword-based and relational database approaches.
2) The XI Pipeline extracts mentions from text and performs named entity recognition, entity linking, and entity typing to associate entities with text.
3) Approaches like ZenCrowd and TRank leverage both algorithms and human computation through crowdsourcing to improve entity linking and fine-grained entity typing.
Ontology-Based Word Sense Disambiguation for Scientific LiteratureeXascale Infolab
This document presents an approach for ontology-based word sense disambiguation for scientific literature. It leverages the structure of community-based ontologies to improve sense identification. The approach represents concepts as context vectors based on their relations in documents and ontologies. It evaluates techniques based on minimum distance between concepts in ontologies, shortest path between concepts in ontologies, and neighboring concepts in ontologies. Combining these graph-based models with context vectors achieves the best precision for word sense disambiguation on two scientific datasets.
This document presents SANAPHOR, an ontology-based coreference resolution system that improves upon existing approaches by leveraging semantic information. It first links entities in document clusters to semantic types and ontologies. It then splits or merges clusters based on these semantic relationships. The system was evaluated on the CoNLL-2012 dataset, where it improved coreference resolution performance over the baseline Stanford system, particularly for noun clusters. By utilizing semantic knowledge, SANAPHOR demonstrates the benefits of enhancing syntactic coreference resolution with an additional semantic layer.
Thomas Heinis is a post-doctoral researcher in the database group at EPFL. His research focuses on scalable data management algorithms for large-scale scientific applications. Thomas is a part of the "Human Brain Project" and currently works with neuroscientists to develop the data management infrastructure necessary for scaling up brain simulations. Prior to joining EPFL, Thomas completed his Ph.D. in the Systems Group at ETH Zurich, where he pursued research in workflow execution systems as well as data provenance.
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...eXascale Infolab
This document proposes three methods - LEXT, REXT, and LERIXT - for disambiguating the domain and range of properties in linked data by using context information. LEXT uses the type of subject resources, REXT uses the type of object resources, and LERIXT uses both. The methods were evaluated against expert judgments and achieved up to 96.5% precision for LEXT and 91.4% for REXT. LERIXT generated too many new sub-properties.
Over the last few years we have observed the emergence of hybrid human-machine information systems which are able to both scale over large amount of data as well as to maintain high-quality data processing intrinsic in human intelligence.
In this talk I will focus on the use of human intelligence at scale by means of crowdsourcing to deal with Big Data problems. We will look specifically on how to deal with the variety in data by means of Human Computation still being able to operate with a large data volume.
First, I will introduce the area of micro-task crowdsourcing also providing an overview of different research challenges that needs to be tackled to enable large-scale hybrid human-machine information systems. Next, I will provide examples of such hybrid systems for entity linking and disambiguation using crowdsourcing and a graph of linked entities as background corpus. I will describe how keyword query understanding can be crowdsourced to build search engines that can answer rare complex queries. Finally, I will present new techniques that allow to improve the quality of crowdsourced information system components by means of push crowdsourcing.
Big Data Analysis : Deciphering the haystack Srinath Perera
A primary outcome of Bigdata is to derive useful and actionable insights from large or challenges data collections. The goal is to run the transformations from data, to information, to knowledge, and finally to insights. This includes calculating simple analytics like Mean, Max, and Median, to derive overall understanding about data by building models, and finally to derive predictions from data. Some cases we can afford to wait to collect and processes them, while in other cases we need to know the outputs right away. MapReduce has been the defacto standard for data processing, and we will start our discussion from there. However, that is only one side of the problem. There are other technologies like Apache Spark and Apache Drill graining ground, and also realtime processing technologies like Stream Processing and Complex Event Processing. Finally there are lot of work on porting decision technologies like Machine learning into big data landscape. This talk discusses big data processing in general and look at each of those different technologies comparing and contrasting them.
Scientific Software Challenges and Community ResponsesDaniel S. Katz
a talk given at RTI International on 7 December 2015, discussing 12 scientific software challenges and how the scientific software community is responding to them
ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for...eXascale Infolab
The document describes ZenCrowd, a system that leverages probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. It combines both algorithmic and manual linking approaches. ZenCrowd automates manual linking via crowdsourcing by breaking it into micro-tasks and dynamically assessing human workers with a probabilistic model. An experimental evaluation on news articles demonstrates that the combined approach of algorithmic matching and crowdsourcing achieves higher precision and recall than using either technique alone.
This document provides an overview of offensive open-source intelligence (OSINT) techniques. It defines OSINT and discusses the differences between offensive and defensive OSINT approaches. Offensive OSINT focuses on gathering as much public information as possible to facilitate an attack against a target. The document outlines the OSINT process and details specific techniques for harvesting data from public sources, including scraping websites, using APIs, searching social media, analyzing images and metadata, and researching infrastructure components like IP addresses, domains, and software versions. The goal of offensive OSINT is to discover valuable information like employee emails, usernames, relationships, locations and technical vulnerabilities to enable attacks like phishing, social engineering, and infiltration.
(1) Amundsen is a data discovery platform developed by Lyft to help users find, understand, and use data.
(2) The platform addresses challenges around data discovery such as lack of understanding about what data exists and where to find it.
(3) Amundsen provides searchable metadata about data resources, previews of data, and usage statistics to help data scientists and others explore and understand data.
Smarter search drives value to your business. Delivering search that matches users to the right content is what you care about. But organizations often get stuck getting there. It turns out that you need quite a number of very different ingredients to deliver tremendous search. It can make your head spin! To help you think through where your team is on its road to smarter search, Pugh introduces the maturity model used by OpenSource Connections and walks you through a very concrete method to inventory needed skills and translate that into search roles for your team. He shows how to measure your capabilities in key areas of search to drive better ROI from search.
Mark Dehmlow, Head of the Library Web Department at the University of Notre Dame
At the University of Notre Dame, we recently implemented a new website in concert with rolling out a “next generation” OPAC into production for our campus. While much of the pre-launch feedback was positive, once we implemented the new systems, we started receiving a small number of intense criticisms and a small wave of problem reports. This presentation covers how to plan for big technology changes, prepare your organizations, effectively manage the barrage of post implementation technical problems, and mitigate customer concerns and criticisms. Participants are encouraged to bring brief war stories, anecdotes, and suggestions for managing technology implementations.”
Slides from my talk on Personalised Access to Linked Data. Presented at the EKAW 2014 conference. The poster to this paper won the best poster award at the conference!
OpenNeuro: a free online platform for sharing and analysis of neuroimaging dataKrzysztof Gorgolewski
This document describes OpenNeuro, a free online platform for sharing and analyzing neuroimaging data. OpenNeuro allows users to upload, manage, and share brain imaging data using the Brain Imaging Data Structure. It also enables running predefined analysis pipelines called Apps on uploaded data using software containers. This facilitates reproducible analysis. The document discusses OpenNeuro's features, available analysis pipelines, open source components, and vision for facilitating data sharing and reproducibility in neuroimaging research.
Wimmics Research Team 2015 Activity ReportFabien Gandon
Extract of the activity report of the Wimmics joint research team between Inria Sophia Antipolis - Méditerranée and I3S (CNRS and Université Nice Sophia Antipolis). Wimmics stands for web-instrumented man-machine interactions, communities and semantics. The team focuses on bridging social semantics and formal semantics on the web.
This document provides a summary of using Splunk for data science. Splunk can be used for tasks like trend forecasting, anomaly detection, sentiment analysis, and market segmentation. It integrates data from various sources and allows querying and visualizing data. Splunk complements other data science tools by executing scripts from R and Python. Effective data visualizations are also important for communicating insights from data.
LaGatta and de Garrigues - Splunk for Data Science - .conf2014Tom LaGatta
Splunk is well-suited for data science tasks like trend forecasting, anomaly detection, and sentiment analysis. It can integrate diverse data sources and contains algorithms for prediction and classification. Data scientists can also leverage R, Python, and other tools through Splunk apps and APIs to perform advanced analysis like predictive modeling and customized visualizations.
Data scientists utilize a variety of tools and techniques to obtain insights from data. In this session, we discuss where and how Splunk fits into the data scientist's tool belt. We highlight Splunk’s built-in statistical capabilities and integrate external statistical and graphical tools to showcase data preparation, predictive modeling and visualization.
Designing a synergistic relationship between undergraduate Data Science educa...Ciera Martinez
Biodiversity data is extremely approachable – the concept of a specimen existing in time and place is clear to grasp and interesting to a wide range of people. I exploited this inherent feature of Biodiversity data to create an educational framework for teaching undergraduate Data Science. The project utilized Discovery Learning theory, based in the belief that it is best for learners to discover facts and relationships for themselves. Students were given a choice of databases and were mentored through an entire data analysis pipeline, including gathering, cleaning, analyzing, and visualization of the data. Their work culminates into a tutorial posted online (curiositydata.org) – instilling proper documentation, open science, and data management techniques. These tutorials can then be used and remixed as documentation for the databases, curriculum, and workshops detailing how to access and analyze the databases data. Increased documentation will overcome accessibility challenges that plague many Biodiversity databases, in an overarching aim to increase the usage and in turn value of these vital data resources. Computer Science, Statistics, and Biology undergraduates are increasingly “data literate”, and if mentored properly, we can foster a symbiotic relationship between the real-world Data Science education and the increase of usability of Biodiversity databases.
This document provides an overview of getting started with data science using Python. It discusses what data science is, why it is in high demand, and the typical skills and backgrounds of data scientists. It then covers popular Python libraries for data science like NumPy, Pandas, Scikit-Learn, TensorFlow, and Keras. Common data science steps are outlined including data gathering, preparation, exploration, model building, validation, and deployment. Example applications and case studies are discussed along with resources for learning including podcasts, websites, communities, books, and TV shows.
Putting Linked Data to Use in a Large Higher-Education OrganisationMathieu d'Aquin
The document discusses using linked data in a large higher education organization. It describes building a linked data platform for the Open University containing course, publication, media, and other university data. Several applications were developed using this linked data including a study tool, research evaluation support, and community/media analytics. Key lessons learned include the potential for simple yet useful applications, rapid development, and challenges of dealing with incomplete or heterogeneous data without application-specific assumptions. Overall, the experiences highlight both opportunities and common pitfalls of interacting with linked data at scale in a large organization.
"Big Data" is term heard more and more in industry – but what does it really mean? There is a vagueness to the term reminiscent of that experienced in the early days of cloud computing. This has led to a number of implications for various industries and enterprises. These range from identifying the actual skills needed to recruit talent to articulating the requirements of a "big data" project. Secondary implications include difficulties in finding solutions that are appropriate to the problems at hand – versus solutions looking for problems. This presentation will take a look at Big Data and offer the audience with some considerations they may use immediately to assess the use of analytics in solving their problems.
The talk begins with an idea of how big "Big Data" can be. This leads to an appreciation of how important "Management Questions" are to assessing analytic needs. The fields of data and analysis have become extremely important and impact nearly all facets of life and business. During the talk we will look at the two pillars of Big Data – Data Warehousing and Predictive Analytics. Then we will explore the open source tools and datasets available to NATO action officers to work in this domain. Use cases relevant to NATO will be explored with the purpose of show where analytics lies hidden within many of the day-to-day problems of enterprises. The presentation will close with a look at the future. Advances in the area of semantic technologies continue. The much acclaimed consultants at Gartner listed Big Data and Semantic Technologies as the first- and third-ranked top technology trends to modernize information management in the coming decade. They note there is an incredible value "locked inside all this ungoverned and underused information." HQ SACT can leverage this powerful analytic approach to capture requirement trends when establishing acquisition strategies, monitor Priority Shortfall Areas, prepare solicitations, and retrieve meaningful data from archives.
Similar to Entities, Graphs, and Crowdsourcing for better Web Search (20)
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictioneXascale Infolab
1) The document presents HINGE, a new method for embedding hyper-relational knowledge graphs that aims to better capture information from facts containing multiple relations and entities.
2) HINGE uses a CNN to learn representations from base triplets and their associated key-value pairs to characterize the plausibility of facts.
3) An evaluation on link prediction tasks shows HINGE outperforms baselines and demonstrates that the triplet structure encodes essential information, while other representations discard important information.
Representation Learning on Graphs with Complex Structures
Invited talk, Deep Learning for Graphs and Structured Data Embedding Workshop
WWW2019, San Francisco, May 13, 2019
A force directed approach for offline gps trajectory mapeXascale Infolab
SIGSPATIAL 2018 paper
A Force-Directed Approach for Offline GPS Trajectory Map Matching
Efstratios Rappos (University of Applied Sciences of Western Switzerland (HES-SO)),
Stephan Robert (University of Applied Sciences of Western Switzerland (HES-SO)),
Philippe Cudré-Mauroux (University of Fribourg)
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...eXascale Infolab
This document proposes HistoSketch, a method for sketching streaming histograms that preserves similarity and adapts to concept drift. It works by:
1) Generating weighted samples from histograms such that the probability two sketches match equals histogram similarity.
2) Incrementally updating sketches using a weight decay factor to forget older data and adapt to drift over time.
3) Evaluating HistoSketch on classification tasks involving synthetic and real-world streaming data, finding it approximates histogram similarity well using small, fixed-size sketches while adapting rapidly to drift.
This document presents SwissLink, a high-precision context-free entity linking system. It extracts unambiguous surface forms (labels) from knowledge bases like DBpedia and Wikipedia to link entity mentions without context. It catalogs the surface forms, removes ambiguous ones using ratio and percentile methods, and performs fast string matching to link mentions. Evaluation on 30 Wikipedia articles shows the percentile-ratio method achieves over 95% precision and 45% recall, balancing precision and recall.
The document proposes a novel crowdsourcing system architecture and scheduling algorithm to address job starvation in multi-tenant crowd-powered systems. The architecture introduces HIT-Bundles to group heterogeneous tasks and control task serving. The Worker Conscious Fair Scheduling algorithm balances fairness and priority while minimizing worker context switching between tasks. Experiments on Amazon Mechanical Turk show the approach increases throughput over baseline schedulers and adapts to varying workforce levels and job priorities.
Efficient, Scalable, and Provenance-Aware Management of Linked DataeXascale Infolab
The proliferation of heterogeneous Linked Data on the Web requires data management systems to constantly improve their scalability and efficiency. Despite recent advances in distributed Linked Data management, efficiently processing large amounts of Linked Data in a scalable way is still very challenging. In spite of their seemingly simple data models, Linked Data actually encode rich and complex graphs mixing both instance and schema level data. At the same time, users are increasingly interested in investigating or visualizing large collections of online data by performing complex analytic queries. The heterogeneity of Linked Data on the Web also poses new challenges to database systems. The capacity to store, track, and query provenance data is becoming a pivotal feature of Linked Data Management Systems. In this thesis, we tackle issues revolving around processing queries on big, unstructured, and heterogeneous Linked Data graphs.
This document summarizes a presentation given at SSSW 2015 on making sense of semantic data. It discusses challenges in understanding semantic web data, including a "language gap" between semantic web languages like SPARQL and natural language. It presents an approach to bridging this gap through automatically verbalizing SPARQL queries in English. Evaluation results show this helps non-experts understand queries better and faster than the SPARQL format. It also discusses the "semantic gap" caused by mismatches between a question's semantics and a knowledge graph, and presents an approach using templates to generate SPARQL queries from natural language questions.
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataeXascale Infolab
Uduvudu exploits the semantic and structured nature of Linked Data to generate the best possible representation for a human based on a catalog of available Matchers and Templates. Matchers and Templates are designed that they can be build through an intuitive editor interface.
Executing Provenance-Enabled Queries over Web DataeXascale Infolab
The proliferation of heterogeneous Linked Data on the Web poses new challenges to database systems. In particular, because of this heterogeneity, the capacity to store, track, and query provenance data is becoming a pivotal feature of modern triple stores. In this paper, we tackle the problem of efficiently executing provenance-enabled queries over RDF data. We propose, implement and empirically evaluate five different query execution strategies for RDF queries that incorporate knowledge of provenance. The evaluation is conducted on Web Data obtained from two different Web crawls (The Billion Triple Challenge, and the Web Data Commons). Our evaluation shows that using an adaptive query materialization execution strategy performs best in our context. Interestingly, we find that because provenance is prevalent within Web Data and is highly selective, it can be used to improve query processing performance. This is a counterintuitive result as provenance is often associated with additional overhead.
Micro-task crowdsourcing is rapidly gaining popularity among research communities and businesses as a means to leverage Human Computation in their daily operations. Unlike any other service, a crowdsourcing platform is in fact a marketplace subject to human factors that affect its performance, both in terms of speed and quality. Indeed, such factors shape the dynamics of the crowdsourcing market. For example, a known behavior of such markets is that increasing the reward of a set of tasks would lead to faster results. However, it is still unclear how different dimensions interact with each other: reward, task type, market competition, requester reputation, etc.
In this paper, we adopt a data-driven approach to (A) perform a long-term analysis of a popular micro-task crowdsourcing platform and understand the evolution of its main actors (workers, requesters, and platform). (B) We leverage the main findings of our five year log analysis to propose features used in a predictive model aiming at determining the expected performance of any batch at a specific point in time. We show that the number of tasks left in a batch and how recent the batch is are two key features of the prediction. (C) Finally, we conduct an analysis of the demand (new tasks posted by the requesters) and supply (number of tasks completed by the workforce) and show how they affect task prices on the marketplace.
CIKM14: Fixing grammatical errors by preposition rankingeXascale Infolab
The detection and correction of grammatical errors still represent very hard problems for modern error-correction systems. As an example, the top-performing systems at the preposition correction challenge CoNLL-2013 only achieved a F1 score of 17%.
In this paper, we propose and extensively evaluate a series of approaches for correcting prepositions, analyzing a large body of high-quality textual content to capture language usage. Leveraging n-gram statistics, association measures, and machine learning techniques, our system is able to learn which words or phrases govern the usage of a specific preposition. Our approach makes heavy use of n-gram statistics generated from very large textual corpora. In particular, one of our key features is the use of n-gram association measures (e.g., Pointwise Mutual Information) between words and prepositions to generate better aggregated preposition rankings for the individual n-grams.
We evaluate the effectiveness of our approach using cross-validation with different feature combinations and on two test collections created from a set of English language exams and StackExchange forums. We also compare against state-of-the-art supervised methods. Experimental results from the CoNLL-2013 test collection show that our approach to preposition correction achieves ~30% in F1 score which results in 13% absolute improvement over the best performing approach at that challenge.
OLTPBenchmark is a multi-threaded load generator. The framework is designed to be able to produce variable rate, variable mixture load against any JDBC-enabled relational database. The framework also provides data collection features, e.g., per-transaction-type latency and throughput logs.
Together with the framework we provide the following OLTP/Web benchmarks:
TPC-C
Wikipedia
Synthetic Resource Stresser
Twitter
Epinions.com
TATP
AuctionMark
SEATS
YCSB
JPAB (Hibernate)
CH-benCHmark
Voter (Japanese "American Idol")
SIBench (Snapshot Isolation)
SmallBank
LinkBench
CH-benCHmark
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)eXascale Infolab
Internet Infrastructures for Big Data
Talk given at Verisign's Distinguished Speaker Series, 2014
Prof. Philippe Cudre-Mauroux
eXascale Infolab
http://exascale.info/
This document discusses a project called MEM0R1ES that aims to automatically organize a person's digital information from various devices and online services to generate useful digital memories. The project develops techniques for entity search, typing, clustering, and elicitation to extract, integrate and expose personal information from heterogeneous graphs. It has produced several open-source software components and published results in top conferences. The document outlines current research directions and concludes that the project addresses important societal issues through stimulating collaboration between institutions.
Crowdsourcing is useful for curating information about tail entities, which are less popular entities like local restaurants, niche sports, or emerging music bands. Targeted crowdsourcing platforms like Pick-A-Crowd aim to match tasks to workers who can provide higher quality answers by considering a worker's social profile and task context. Transactive search uses the knowledge of crowds to reconstruct memories and answer questions by targeting the right people to search sources like Twitter photos or event attendees.
This document discusses the evolution of cluster computing and resource management. It describes how:
1) Early clusters were single-purpose and used technologies like MapReduce. General purpose cluster OSes like YARN emerged to allow multiple applications on a cluster.
2) YARN improved on Hadoop by decoupling the programming model from resource management, allowing more flexibility and better performance/availability.
3) REEF aims to further improve frameworks by factoring out common functionalities around communication, configuration, and fault tolerance.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Dive into the realm of operating systems (OS) with Pravash Chandra Das, a seasoned Digital Forensic Analyst, as your guide. 🚀 This comprehensive presentation illuminates the core concepts, types, and evolution of OS, essential for understanding modern computing landscapes.
Beginning with the foundational definition, Das clarifies the pivotal role of OS as system software orchestrating hardware resources, software applications, and user interactions. Through succinct descriptions, he delineates the diverse types of OS, from single-user, single-task environments like early MS-DOS iterations, to multi-user, multi-tasking systems exemplified by modern Linux distributions.
Crucial components like the kernel and shell are dissected, highlighting their indispensable functions in resource management and user interface interaction. Das elucidates how the kernel acts as the central nervous system, orchestrating process scheduling, memory allocation, and device management. Meanwhile, the shell serves as the gateway for user commands, bridging the gap between human input and machine execution. 💻
The narrative then shifts to a captivating exploration of prominent desktop OSs, Windows, macOS, and Linux. Windows, with its globally ubiquitous presence and user-friendly interface, emerges as a cornerstone in personal computing history. macOS, lauded for its sleek design and seamless integration with Apple's ecosystem, stands as a beacon of stability and creativity. Linux, an open-source marvel, offers unparalleled flexibility and security, revolutionizing the computing landscape. 🖥️
Moving to the realm of mobile devices, Das unravels the dominance of Android and iOS. Android's open-source ethos fosters a vibrant ecosystem of customization and innovation, while iOS boasts a seamless user experience and robust security infrastructure. Meanwhile, discontinued platforms like Symbian and Palm OS evoke nostalgia for their pioneering roles in the smartphone revolution.
The journey concludes with a reflection on the ever-evolving landscape of OS, underscored by the emergence of real-time operating systems (RTOS) and the persistent quest for innovation and efficiency. As technology continues to shape our world, understanding the foundations and evolution of operating systems remains paramount. Join Pravash Chandra Das on this illuminating journey through the heart of computing. 🌟
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfflufftailshop
When it comes to unit testing in the .NET ecosystem, developers have a wide range of options available. Among the most popular choices are NUnit, XUnit, and MSTest. These unit testing frameworks provide essential tools and features to help ensure the quality and reliability of code. However, understanding the differences between these frameworks is crucial for selecting the most suitable one for your projects.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
A Comprehensive Guide to DeFi Development Services in 2024Intelisync
DeFi represents a paradigm shift in the financial industry. Instead of relying on traditional, centralized institutions like banks, DeFi leverages blockchain technology to create a decentralized network of financial services. This means that financial transactions can occur directly between parties, without intermediaries, using smart contracts on platforms like Ethereum.
In 2024, we are witnessing an explosion of new DeFi projects and protocols, each pushing the boundaries of what’s possible in finance.
In summary, DeFi in 2024 is not just a trend; it’s a revolution that democratizes finance, enhances security and transparency, and fosters continuous innovation. As we proceed through this presentation, we'll explore the various components and services of DeFi in detail, shedding light on how they are transforming the financial landscape.
At Intelisync, we specialize in providing comprehensive DeFi development services tailored to meet the unique needs of our clients. From smart contract development to dApp creation and security audits, we ensure that your DeFi project is built with innovation, security, and scalability in mind. Trust Intelisync to guide you through the intricate landscape of decentralized finance and unlock the full potential of blockchain technology.
Ready to take your DeFi project to the next level? Partner with Intelisync for expert DeFi development services today!
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Entities, Graphs, and Crowdsourcing for better Web Search
1. En##es,
Graphs,
and
Crowdsourcing
for
be7er
Web
Search
Gianluca
Demar#ni
eXascale
Infolab
University
of
Fribourg,
Switzerland
2. Gianluca
Demar#ni
• M.Sc.
at
University
of
Udine,
Italy
• Ph.D.
at
University
of
Hannover,
Germany
– En#ty
Retrieval
• Worked
for
UC
Berkeley
(on
Crowdsourcing),
Yahoo!
Research
(Spain),
L3S
Research
Center
(Germany)
• Post-‐doc
at
the
eXascale
Infolab,
Uni
Fribourg,
Switzerland.
• Lecturer
for
Social
Compu,ng
in
Fribourg
• Tutorial
on
En#ty
Search
at
ECIR
2012,
on
Crowdsourcing
at
ESWC
2013
and
ISWC
2013
• Research
Interests
– Informa#on
Retrieval,
Seman#c
Web,
Crowdsourcing
2
demartini@exascale.info
Gianluca
Demar#ni
5. Web
of
Data
• Freebase
– Acquired
by
Google
in
July
2010.
– Knowledge
Graph
launched
in
May
2012.
• Schema.org
– Driven
by
major
search
engine
companies
– Machine-‐readable
annota#ons
of
Web
pages
• Linked
Open
Data
– 31
billion
triples,
Sept.
2011
Gianluca
Demar#ni
5
7. I
will
talk
about
• En#ty
Linking/Disambigua#on
– On
the
Web
using
crowdsourcing
– For
scien#fic
literature
using
graphs
• Ad-‐hoc
Object
Retrieval
(En#ty
Ranking)
– Using
IR
and
graphs
• Crowdsourced
Query
Understanding
Gianluca
Demar#ni
7
8. Disclaimer
• No
efficiency
evalua#on
– Approaches
not
distributed
– But
designed
to
scale
out
• No
user
studies
– Goal:
Obtain
high
quality
data
– Only
TREC-‐like
evalua#on
on
effec#veness
Gianluca
Demar#ni
8
10. Gianluca
Demar#ni
10
h7p://dbpedia.org/resource/Facebook
h7p://dbpedia.org/resource/Instagram
jase:Instagram
owl:sameAs
Google
Android
<p>Facebook
is
not
wai#ng
for
its
ini#al
public
offering
to
make
its
first
big
purchase.</p><p>In
its
largest
acquisi#on
to
date,
the
social
network
has
purchased
Instagram,
the
popular
photo-‐sharing
applica#on,
for
about
$1
billion
in
cash
and
stock,
the
company
said
Monday.</p>
<p><span
about="h7p://dbpedia.org/resource/
Facebook"><cite
property=”rdfs:label">Facebook</
cite>
is
not
wai#ng
for
its
ini#al
public
offering
to
make
its
first
big
purchase.</span></p><p><span
about="h7p://dbpedia.org/resource/Instagram">In
its
largest
acquisi#on
to
date,
the
social
network
has
purchased
<cite
property=”rdfs:label">Instagram</
cite>
,
the
popular
photo-‐sharing
applica#on,
for
about
$1
billion
in
cash
and
stock,
the
company
said
Monday.</span></p>
RDFa
enrichment
HTML:
11. Crowdsourcing
• Exploit
human
intelligence
to
solve
– Tasks
simple
for
humans,
complex
for
machines
– With
a
large
number
of
humans
(the
Crowd)
– Small
problems:
micro-‐tasks
(Amazon
MTurk)
• Examples
– Wikipedia,
Image
tagging
• Incen#ves
– Financial,
fun,
visibility
Gianluca
Demar#ni
11
12. ZenCrowd
• Combine
both
algorithmic
and
manual
linking
• Automate
manual
linking
via
crowdsourcing
• Dynamically
assess
human
workers
with
a
probabilis#c
reasoning
framework
12
Crowd
Algorithms
Machines
Gianluca
Demar#ni
13. ZenCrowd
Architecture
Micro
Matching
Tasks
HTML
Pages
HTML+ RDFa
Pages
LOD Open Data Cloud
Crowdsourcing
Platform
ZenCrowd
Entity
Extractors
LOD Index Get Entity
Input Output
Probabilistic
Network
Decision Engine
Micro-
TaskManager
Workers Decisions
Algorithmic
Matchers
Gianluca
Demar#ni
13
Gianluca
Demar#ni,
Djellel
Eddine
Difallah,
and
Philippe
Cudré-‐Mauroux.
ZenCrowd:
Leveraging
Probabilis#c
Reasoning
and
Crowdsourcing
Techniques
for
Large-‐Scale
En#ty
Linking.
In:
21st
Interna#onal
Conference
on
World
Wide
Web
(WWW
2012).
14. Algorithmic
Matching
• Inverted
index
over
LOD
en##es
– DBPedia,
Freebase,
Geonames,
NYT
• TF-‐IDF
(IR
ranking
func#on)
• Top
ranked
URIs
linked
to
en##es
in
docs
• Threshold
on
the
ranking
func#on
or
top
N
Gianluca
Demar#ni
14
16. En#ty
Factor
Graphs
• Training
phase
– Ini#alize
worker
priors
– with
k
matches
on
known
answers
• Upda#ng
worker
Priors
– Use
link
decision
as
new
observa#ons
– Compute
new
worker
probabili#es
• Iden#fy
(and
discard)
unreliable
workers
Gianluca
Demar#ni
16
19. Lessons
Learnt
• Crowdsourcing
+
Prob
reasoning
works!
• But
– Different
worker
communi#es
perform
differently
– Many
low
quality
workers
– Comple#on
#me
may
vary
(based
on
reward)
• Need
to
find
the
right
workers
for
your
task
(see
WWW13
paper)
Gianluca
Demar#ni
19
20. ZenCrowd
Summary
• ZenCrowd:
Probabilis#c
reasoning
over
automa#c
and
crowdsourcing
methods
for
en#ty
linking
• Standard
crowdsourcing
improves
6%
over
automa#c
• 4%
-‐
35%
improvement
over
standard
crowdsourcing
• 14%
average
improvement
over
automa#c
approaches
• On-‐going
work:
– Also
used
for
instance
matching
across
datasets
– 3-‐way
blocking
with
the
crowd
h7p://exascale.info/zencrowd/
Gianluca
Demar#ni
20
21. En#ty
Disambigua#on
in
Scien#fic
Literature
• Using
a
background
concept
graph
Roman
Prokofyev,
Gianluca
Demar#ni,
Philippe
Cudré-‐Mauroux,
Alexey
Boyarsky,
and
Oleg
Ruchayskiy.
Ontology-‐Based
Word
Sense
Disambigua#on
in
the
Scien#fic
Domain.
In:
35th
European
Conference
on
Informa#on
Retrieval
(ECIR
2013).
Gianluca
Demar#ni
21
h7p://scienceWISE.info/
23. Ad-‐hoc
Object
Retrieval
• Once
en##es
have
been
iden#fied…
• We
want
to
rank
them
as
answer
to
a
query
• AOR
– Given
the
descrip#on
of
an
en#ty
– give
me
back
its
iden#fier
– Input:
query
q,
data
graph
G
– Output:
ranked
list
of
URIs
from
G
Gianluca
Demar#ni
23
24. An
Hybrid
Approach
to
AOR
Alberto
Tonon,
Gianluca
Demar#ni,
and
Philippe
Cudré-‐Mauroux.
Combining
Inverted
Indices
and
Structured
Search
for
Ad-‐hoc
Object
Retrieval.
In:
35th
Annual
ACM
SIGIR
Conference
(SIGIR
2012).
index()
User
Query Annotation
and Expansion
Inverted Index
RDF
Store
Ranking
FunctionsRanking
FunctionsRanking
Functions
query()
Entity Search
Keyword Query
intermediate
top-k results
Graph-Enriched
Results
Graph Traversals
(queries on object
properties)
Neighborhoods
(queries on datatype
properties)
Structured
Inverted Index
WordNet
3rd party
search engines
Final Ranking
Function
Pseudo-Relevance Feedback
Gianluca
Demar#ni
24
27. Summary
• AOR
=
“Given
the
descrip,on
of
an
en,ty,
give
me
back
its
iden,fier”
• Combining
classic
IR
techniques
+
structured
database
storing
graph
data
• Significantly
be7er
results
(up
to
+25%
MAP
over
BM25
baseline).
• Overhead
caused
from
the
graph
traversal
part
is
limited
Gianluca
Demar#ni
27
h7p://exascale.info/AOR/
33. Mo#va#on
• Web
Search
Engines
can
answer
simple
factual
queries
directly
on
the
result
page
• Users
with
complex
informa#on
needs
are
oyen
unsa#sfied
• Purely
automa#c
techniques
are
not
enough
• We
want
to
solve
it
with
Crowdsourcing!
Gianluca
Demar#ni
33
34. CrowdQ
• CrowdQ
is
the
first
system
that
uses
crowdsourcing
to
– Understand
the
intended
meaning
– Build
a
structured
query
template
– Answer
the
query
over
Linked
Open
Data
Gianluca
Demar#ni
34
Gianluca
Demar#ni,
Beth
Trushkowsky,
Tim
Kraska,
and
Michael
Franklin.
CrowdQ:
Crowdsourced
Query
Understanding.
In:
6th
Biennial
Conference
on
Innova#ve
Data
Systems
Research
(CIDR
2013).
36. User
Keyword Query
On#line'Complex'Query
Processing
Complex
query
classifier
Crowdsourcing
Platform
Vetrical
selection,
Unstructured
Search, ...
POS + NER tagging
Query Template Index
Crowd
Manager
N
Y
Queries Templ +
Answer Types
Structured
LOD Search
Result Joiner
Template Generation
SERP
t1
t2
t3
Off#line'Complex'Query
Decomposition
Structured Query
Query
Log
query
N
Answer
Composition
LOD Open Data Cloud
Match with existing
query templates
CrowdQ
Architecture
36
Off-‐line:
query
template
genera#on
with
the
help
of
the
crowd
On-‐line:
query
template
matching
using
NLP
and
search
over
open
data
37. Hybrid
Human-‐Machine
Pipeline
Gianluca
Demar#ni
37
Q=
birthdate
of
actors
of
forrest
gump
Query
annota#on
Noun
Noun
Named
en#ty
Verifica#on
En#ty
Rela#ons
Is
forrest
gump
this
en#ty
in
the
query?
Which
is
the
rela#on
between:
actors
and
forrest
gump
starring
Schema
element
Starring
<dbpedia-‐owl:starring>
Verifica#on
Is
the
rela#on
between:
Indiana
Jones
–
Harrison
Ford
Back
to
the
Future
–
Michael
J.
Fox
of
the
same
type
as
Forrest
Gump
-‐
actors
38. Structured
query
genera#on
SELECT
?y
?x
WHERE
{
?y
<dbpedia-‐owl:birthdate>
?x
.
?z
<dbpedia-‐owl:starring>
?y
.
?z
<rdfs:label>
‘Forrest
Gump’
}
Gianluca
Demar#ni
38
Results
from
BTC09:
Q=
birthdate
of
actors
of
forrest
gump
MOVIE
MOVIE
39. Conclusions
• Structured
Data
make
Web
Search
be7er
• Exploit
the
best
out
of
structured
and
unstructured
data
(Hybrid
AOR)
• Crowd
can
help
in
understanding
seman#cs
• Hybrid
human-‐machine
systems
(ZenCrowd)
• Exploit
Human
Intelligence
at
Scale
(CrowdQ)
gianlucademartini.net demartini@exascale.info
Gianluca
Demar#ni
39