Over the last few years we have observed the emergence of hybrid human-machine information systems which are able to both scale over large amount of data as well as to maintain high-quality data processing intrinsic in human intelligence.
In this talk I will focus on the use of human intelligence at scale by means of crowdsourcing to deal with Big Data problems. We will look specifically on how to deal with the variety in data by means of Human Computation still being able to operate with a large data volume.
First, I will introduce the area of micro-task crowdsourcing also providing an overview of different research challenges that needs to be tackled to enable large-scale hybrid human-machine information systems. Next, I will provide examples of such hybrid systems for entity linking and disambiguation using crowdsourcing and a graph of linked entities as background corpus. I will describe how keyword query understanding can be crowdsourced to build search engines that can answer rare complex queries. Finally, I will present new techniques that allow to improve the quality of crowdsourced information system components by means of push crowdsourcing.
The document discusses several topics related to search engines and online information, including:
1) The PageRank algorithm and its extensions over time to provide more contextually relevant search results.
2) Concerns about privacy and concentration of power as collective intelligence and user data is concentrated within large tech companies.
3) Differences in search results between engines and regions due to factors like censorship and localized information.
Slides of the presentation given at the 22nd International Conference on the World Wide Web.
URL: http://www2013.org/program/561-reactive-crowdsourcing/
More information on the Crowdsearcher project available at
crowdsearcher.search-computing.com
The document discusses the concept of "Broad Data" which refers to the large amount of freely available but widely varied open data on the World Wide Web, including structured and semi-structured data. It provides examples such as the growing linked open data cloud and over 710,000 datasets available from governments around the world. Broad data poses new challenges for data search, modeling, integration and visualization of partially modeled datasets. International open government data search and linking government data to additional contexts are also discussed.
Answering Search Queries with CrowdSearcher: a crowdsourcing and social netwo...Marco Brambilla
Web users are increasingly relying on social interaction to complete and validate the results of their search activities. While search systems are superior machines to get world-wide information, the opinions collected within friends and expert/local communities can ultimately determine our decisions: human curiosity and creativity is often capable of going much beyond the capabilities of search systems in scouting “interesting” results, or suggesting new, unexpected search directions. Such personalized interaction occurs in most times aside of the search systems and processes, possibly instrumented and mediated by a social network; when such interaction is completed and users resort to the use of search systems, they do it through new queries, loosely related to the previous search or to the social interaction.
In this paper we propose CrowdSearcher, a novel search paradigm that embodies crowds as first-class sources for the information seeking process. CrowdSearcher aims at filling the gap between generalized search systems, which operate upon world-wide information - including facts and recommendations as crawled and indexed by computerized systems – with social systems, capable of interacting with real people, in real time, to capture their opinions, suggestions, emotions. The technical contribution of this paper is the discussion of a model and architecture for integrating computerized search with human interaction, by showing how search systems can drive and encapsulate social systems. In particular we show how social platforms, such as Facebook, LinkedIn and Twitter, can be used for crowdsourcing search-related tasks; we demonstrate our approach with several prototypes and we report on our experiment upon real user communities.
Choosing the right crowd. Expert finding in social networks. edbt 2013Marco Brambilla
The document discusses using social networks and Q&A websites as platforms for crowd-searching in addition to traditional crowdsourcing platforms. It proposes a model for crowd-searching that utilizes social interactions on these platforms to find experts and get feedback on queries. The model involves initially searching for information, then promoting queries on social platforms to find friends and experts, and aggregating the responses. It provides examples of how this process may work for a job search query. Experimental results showed that questions posted on social networks received more responses than random questions, and that engagement depended on the difficulty and type of task.
The document discusses using open data and linked data on the web. It begins by defining open government data and its benefits like transparency and participation. It then explains how the semantic web uses linked data to connect related data across the web. Examples are given of government and other datasets that are available as linked open data. The presentation concludes by proposing future interdisciplinary collaboration to further develop applications using open and linked data.
The document discusses several topics related to search engines and online information, including:
1) The PageRank algorithm and its extensions over time to provide more contextually relevant search results.
2) Concerns about privacy and concentration of power as collective intelligence and user data is concentrated within large tech companies.
3) Differences in search results between engines and regions due to factors like censorship and localized information.
Slides of the presentation given at the 22nd International Conference on the World Wide Web.
URL: http://www2013.org/program/561-reactive-crowdsourcing/
More information on the Crowdsearcher project available at
crowdsearcher.search-computing.com
The document discusses the concept of "Broad Data" which refers to the large amount of freely available but widely varied open data on the World Wide Web, including structured and semi-structured data. It provides examples such as the growing linked open data cloud and over 710,000 datasets available from governments around the world. Broad data poses new challenges for data search, modeling, integration and visualization of partially modeled datasets. International open government data search and linking government data to additional contexts are also discussed.
Answering Search Queries with CrowdSearcher: a crowdsourcing and social netwo...Marco Brambilla
Web users are increasingly relying on social interaction to complete and validate the results of their search activities. While search systems are superior machines to get world-wide information, the opinions collected within friends and expert/local communities can ultimately determine our decisions: human curiosity and creativity is often capable of going much beyond the capabilities of search systems in scouting “interesting” results, or suggesting new, unexpected search directions. Such personalized interaction occurs in most times aside of the search systems and processes, possibly instrumented and mediated by a social network; when such interaction is completed and users resort to the use of search systems, they do it through new queries, loosely related to the previous search or to the social interaction.
In this paper we propose CrowdSearcher, a novel search paradigm that embodies crowds as first-class sources for the information seeking process. CrowdSearcher aims at filling the gap between generalized search systems, which operate upon world-wide information - including facts and recommendations as crawled and indexed by computerized systems – with social systems, capable of interacting with real people, in real time, to capture their opinions, suggestions, emotions. The technical contribution of this paper is the discussion of a model and architecture for integrating computerized search with human interaction, by showing how search systems can drive and encapsulate social systems. In particular we show how social platforms, such as Facebook, LinkedIn and Twitter, can be used for crowdsourcing search-related tasks; we demonstrate our approach with several prototypes and we report on our experiment upon real user communities.
Choosing the right crowd. Expert finding in social networks. edbt 2013Marco Brambilla
The document discusses using social networks and Q&A websites as platforms for crowd-searching in addition to traditional crowdsourcing platforms. It proposes a model for crowd-searching that utilizes social interactions on these platforms to find experts and get feedback on queries. The model involves initially searching for information, then promoting queries on social platforms to find friends and experts, and aggregating the responses. It provides examples of how this process may work for a job search query. Experimental results showed that questions posted on social networks received more responses than random questions, and that engagement depended on the difficulty and type of task.
The document discusses using open data and linked data on the web. It begins by defining open government data and its benefits like transparency and participation. It then explains how the semantic web uses linked data to connect related data across the web. Examples are given of government and other datasets that are available as linked open data. The presentation concludes by proposing future interdisciplinary collaboration to further develop applications using open and linked data.
Platfora is a big data analytics platform that transforms raw data in Hadoop into interactive business intelligence without requiring a separate data warehouse or ETL process. The document discusses Platfora's solution to challenges users face with existing approaches, such as non-intuitive interfaces for non-database administrators. It also summarizes user research findings that access to data is difficult and preparation requires many tools. The document proposes designing an interface that visualizes the full data model to address these issues.
DeCAT 2015 - International Workshop on Deep Content Analytics Techniques for ...Cataldo Musto
Opening presentation for DeCAT 2015 - International Workshop on Deep Content Analytics Techniques for Personalized and Intelligent Services, held in Dublin on June 30, 2015.
In this fifth session of the Elements of AI Luxembourg series of webinars, our guest speaker and co-organizer Prof. Martin Theobald talks about Current Topics and Trends in Big Data Analytics. More information, and a recording of the session, can be found on our reddit page:
eofai.lu/reddit
This document discusses two presentations on cognition for the semantic web. The first presentation discusses methods for involving humans in semantic data management, including crowdsourcing, citizen science, and games with a purpose. It provides examples of how these techniques can be used for tasks like data linking and validation. The second presentation discusses building cognitive and semantic systems to support understanding data and phenomena through visual examples. It aims to explain why and how these systems can make sense of data and foster understanding.
This document provides an outline for a thesis proposal on analyzing the spread of information in social networks. The proposal discusses previous work analyzing social media data and developing analysis tools. It also outlines current and planned research projects applying these tools to study specific events and actor types. The overall goal is to better understand how information spreads in social networks and identify influential users.
Data science remains a high-touch activity, especially in life, physical, and social sciences. Data management and manipulation tasks consume too much bandwidth: Specialized tools and technologies are difficult to use together, issues of scale persist despite the Cambrian explosion of big data systems, and public data sources (including the scientific literature itself) suffer curation and quality problems.
Together, these problems motivate a research agenda around “human-data interaction:” understanding and optimizing how people use and share quantitative information.
I’ll describe some of our ongoing work in this area at the University of Washington eScience Institute.
In the context of the Myria project, we're building a big data "polystore" system that can hide the idiosyncrasies of specialized systems behind a common interface without sacrificing performance. In scientific data curation, we are automatically correcting metadata errors in public data repositories with cooperative machine learning approaches. In the Viziometrics project, we are mining patterns of visual information in the scientific literature using machine vision, machine learning, and graph analytics. In the VizDeck and Voyager projects, we are developing automatic visualization recommendation techniques. In graph analytics, we are working on parallelizing best-of-breed graph clustering algorithms to handle multi-billion-edge graphs.
The common thread in these projects is the goal of democratizing data science techniques, especially in the sciences.
The document provides an overview of the author's new book "Data Dynamite" which argues that liberating data through greater transparency and accessibility will transform our world. The author discusses how data is currently locked away in warehouses inaccessible to most citizens and employees. However, initiatives like Data.gov are beginning to make more government data available. The book advocates for a strategic approach to transitioning organizations to be more data-centric and ensuring data is structured and tagged so it can be automatically shared and analyzed through visualization tools to generate insights.
Open Government Data, Linked Data, and the Missing Blocks in Korea Haklae Kim
This presentation discusses open government data and linked data. It provides examples of how open data initiatives from different governments have increased transparency and civic participation. Linked data practices are presented as a way to interconnect disparate datasets using semantic web standards. While Korea has strong e-government infrastructure, the presentation argues more can be done to implement open data and linked data practices. Participatory approaches are advocated to help design open data policies and solutions.
1) Entity-centric data management stores information at the entity level and integrates information by interlinking entities. This provides advantages over keyword-based and relational database approaches.
2) The XI Pipeline extracts mentions from text and performs named entity recognition, entity linking, and entity typing to associate entities with text.
3) Approaches like ZenCrowd and TRank leverage both algorithms and human computation through crowdsourcing to improve entity linking and fine-grained entity typing.
The document discusses issues with how computer science has directed the development of search systems, focusing on efficiency over user experience. It argues search systems have paid minimal attention to the user experience beyond results relevance and ad-matching. The goal of the plenary is to inspire designing search experiences that do more than just sell products well.
This document summarizes a workshop on crowd-sourcing given by Dr. Gianluca Demartini. The workshop covered an introduction to crowd-sourcing, examples of using crowds to conduct research from Heidelberg Laureate Forum participants, and the challenges and opportunities of crowd-sourcing. Specific topics discussed included defining crowd-sourcing, incentives for crowdsourcing work, examples like citizen science and reCAPTCHA, a case study of Amazon Mechanical Turk, and addressing the ethical issues of crowd-sourcing.
Democratizing Data within your organization - Data DiscoveryMark Grover
n this talk, we talk about the challenges at scale in an organization like Lyft. We delve into data discovery as a challenge towards democratizing data within your organization. And, go in detail about the solution to solve the challenge of data discovery.
Lyft developed Amundsen, an internal metadata and data discovery platform, to help their data scientists and engineers find data more efficiently. Amundsen provides search-based and lineage-based discovery of Lyft's data resources. It uses a graph database and Elasticsearch to index metadata from various sources. While initially built using a pull model with crawlers, Amundsen is moving toward a push model where systems publish metadata to a message queue. The tool has increased data team productivity by over 30% and will soon be open sourced for other organizations to use.
Crowdsourcing is an online, distributed problem solving and production model that revolutionized the internet and mobile market at present. It turns the customers into designer and marketers. The practice of Crowdsourcing is transforming the web and giving rise to a new field. Today the leading enterprises are embracing the next paradigm shift in the distribution of work by outsourcing to the crowd in the cloud. Everyday millions of people make all kind of voluntary online contribution. With the number of people online approaching 3 billion by 2016 and projected to reach 5 billion by 2020, new workforce has emerged that are now used for different purposes. Available on-demand this workforce has abundant capacity and the expertise knowledge to perform work from simple to complex and solve problems and grand challenges. This paper gives an introduction to Crowdsourcing, its theoretical grounding, model and examples with case study. In this paper we show that Crowdsourcing can be applied to wide variety of problems and that it raises numerous interesting technical and social challenges. Finally this paper proposes an agenda for using Crowdsourcing in NLP.
How can we mine, analyse and visualise the Social Web?
In this lecture, you will learn about mining social web data for analysis. Data preparation and gathering basic statistics on your data.
Amundsen is a metadata-driven application developed by Lyft to solve data discovery challenges. It provides a search-based UI and uses a distributed architecture with various microservices to index and serve metadata from multiple sources. Key components include a metadata service using Neo4j, a search service using Elasticsearch, and a frontend. The tool has been hugely successful at Lyft and is now open source. Future work includes expanding metadata coverage and integrating with other tools.
The document discusses emerging technologies that are impacting education, including massive open online courses (MOOCs), mobile learning, bring your own device (BYOD), gamification, badges, learning analytics, and adaptive learning. It also covers internet of things, augmented reality, wearable technologies, 3D printing, quantum computing, and crowd-sourced projects. The trends show that people expect to learn and work anywhere using their own mobile devices and through personalized and adaptive technologies.
Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014 eswcsummerschool
This document discusses combining the social web and semantic web through crowdsourcing. It defines key concepts like the social web, crowdsourcing, and semantic technologies. It then provides examples of how semantic tasks can be crowdsourced, such as annotating research papers, mapping topics to ontologies, and curating linked data. Challenges with crowdsourcing semantic tasks are also explored, such as how to optimally structure tasks and validate crowd responses.
(1) Amundsen is a data discovery platform developed by Lyft to help users find, understand, and use data.
(2) The platform addresses challenges around data discovery such as lack of understanding about what data exists and where to find it.
(3) Amundsen provides searchable metadata about data resources, previews of data, and usage statistics to help data scientists and others explore and understand data.
Platfora is a big data analytics platform that transforms raw data in Hadoop into interactive business intelligence without requiring a separate data warehouse or ETL process. The document discusses Platfora's solution to challenges users face with existing approaches, such as non-intuitive interfaces for non-database administrators. It also summarizes user research findings that access to data is difficult and preparation requires many tools. The document proposes designing an interface that visualizes the full data model to address these issues.
DeCAT 2015 - International Workshop on Deep Content Analytics Techniques for ...Cataldo Musto
Opening presentation for DeCAT 2015 - International Workshop on Deep Content Analytics Techniques for Personalized and Intelligent Services, held in Dublin on June 30, 2015.
In this fifth session of the Elements of AI Luxembourg series of webinars, our guest speaker and co-organizer Prof. Martin Theobald talks about Current Topics and Trends in Big Data Analytics. More information, and a recording of the session, can be found on our reddit page:
eofai.lu/reddit
This document discusses two presentations on cognition for the semantic web. The first presentation discusses methods for involving humans in semantic data management, including crowdsourcing, citizen science, and games with a purpose. It provides examples of how these techniques can be used for tasks like data linking and validation. The second presentation discusses building cognitive and semantic systems to support understanding data and phenomena through visual examples. It aims to explain why and how these systems can make sense of data and foster understanding.
This document provides an outline for a thesis proposal on analyzing the spread of information in social networks. The proposal discusses previous work analyzing social media data and developing analysis tools. It also outlines current and planned research projects applying these tools to study specific events and actor types. The overall goal is to better understand how information spreads in social networks and identify influential users.
Data science remains a high-touch activity, especially in life, physical, and social sciences. Data management and manipulation tasks consume too much bandwidth: Specialized tools and technologies are difficult to use together, issues of scale persist despite the Cambrian explosion of big data systems, and public data sources (including the scientific literature itself) suffer curation and quality problems.
Together, these problems motivate a research agenda around “human-data interaction:” understanding and optimizing how people use and share quantitative information.
I’ll describe some of our ongoing work in this area at the University of Washington eScience Institute.
In the context of the Myria project, we're building a big data "polystore" system that can hide the idiosyncrasies of specialized systems behind a common interface without sacrificing performance. In scientific data curation, we are automatically correcting metadata errors in public data repositories with cooperative machine learning approaches. In the Viziometrics project, we are mining patterns of visual information in the scientific literature using machine vision, machine learning, and graph analytics. In the VizDeck and Voyager projects, we are developing automatic visualization recommendation techniques. In graph analytics, we are working on parallelizing best-of-breed graph clustering algorithms to handle multi-billion-edge graphs.
The common thread in these projects is the goal of democratizing data science techniques, especially in the sciences.
The document provides an overview of the author's new book "Data Dynamite" which argues that liberating data through greater transparency and accessibility will transform our world. The author discusses how data is currently locked away in warehouses inaccessible to most citizens and employees. However, initiatives like Data.gov are beginning to make more government data available. The book advocates for a strategic approach to transitioning organizations to be more data-centric and ensuring data is structured and tagged so it can be automatically shared and analyzed through visualization tools to generate insights.
Open Government Data, Linked Data, and the Missing Blocks in Korea Haklae Kim
This presentation discusses open government data and linked data. It provides examples of how open data initiatives from different governments have increased transparency and civic participation. Linked data practices are presented as a way to interconnect disparate datasets using semantic web standards. While Korea has strong e-government infrastructure, the presentation argues more can be done to implement open data and linked data practices. Participatory approaches are advocated to help design open data policies and solutions.
1) Entity-centric data management stores information at the entity level and integrates information by interlinking entities. This provides advantages over keyword-based and relational database approaches.
2) The XI Pipeline extracts mentions from text and performs named entity recognition, entity linking, and entity typing to associate entities with text.
3) Approaches like ZenCrowd and TRank leverage both algorithms and human computation through crowdsourcing to improve entity linking and fine-grained entity typing.
The document discusses issues with how computer science has directed the development of search systems, focusing on efficiency over user experience. It argues search systems have paid minimal attention to the user experience beyond results relevance and ad-matching. The goal of the plenary is to inspire designing search experiences that do more than just sell products well.
This document summarizes a workshop on crowd-sourcing given by Dr. Gianluca Demartini. The workshop covered an introduction to crowd-sourcing, examples of using crowds to conduct research from Heidelberg Laureate Forum participants, and the challenges and opportunities of crowd-sourcing. Specific topics discussed included defining crowd-sourcing, incentives for crowdsourcing work, examples like citizen science and reCAPTCHA, a case study of Amazon Mechanical Turk, and addressing the ethical issues of crowd-sourcing.
Democratizing Data within your organization - Data DiscoveryMark Grover
n this talk, we talk about the challenges at scale in an organization like Lyft. We delve into data discovery as a challenge towards democratizing data within your organization. And, go in detail about the solution to solve the challenge of data discovery.
Lyft developed Amundsen, an internal metadata and data discovery platform, to help their data scientists and engineers find data more efficiently. Amundsen provides search-based and lineage-based discovery of Lyft's data resources. It uses a graph database and Elasticsearch to index metadata from various sources. While initially built using a pull model with crawlers, Amundsen is moving toward a push model where systems publish metadata to a message queue. The tool has increased data team productivity by over 30% and will soon be open sourced for other organizations to use.
Crowdsourcing is an online, distributed problem solving and production model that revolutionized the internet and mobile market at present. It turns the customers into designer and marketers. The practice of Crowdsourcing is transforming the web and giving rise to a new field. Today the leading enterprises are embracing the next paradigm shift in the distribution of work by outsourcing to the crowd in the cloud. Everyday millions of people make all kind of voluntary online contribution. With the number of people online approaching 3 billion by 2016 and projected to reach 5 billion by 2020, new workforce has emerged that are now used for different purposes. Available on-demand this workforce has abundant capacity and the expertise knowledge to perform work from simple to complex and solve problems and grand challenges. This paper gives an introduction to Crowdsourcing, its theoretical grounding, model and examples with case study. In this paper we show that Crowdsourcing can be applied to wide variety of problems and that it raises numerous interesting technical and social challenges. Finally this paper proposes an agenda for using Crowdsourcing in NLP.
How can we mine, analyse and visualise the Social Web?
In this lecture, you will learn about mining social web data for analysis. Data preparation and gathering basic statistics on your data.
Amundsen is a metadata-driven application developed by Lyft to solve data discovery challenges. It provides a search-based UI and uses a distributed architecture with various microservices to index and serve metadata from multiple sources. Key components include a metadata service using Neo4j, a search service using Elasticsearch, and a frontend. The tool has been hugely successful at Lyft and is now open source. Future work includes expanding metadata coverage and integrating with other tools.
The document discusses emerging technologies that are impacting education, including massive open online courses (MOOCs), mobile learning, bring your own device (BYOD), gamification, badges, learning analytics, and adaptive learning. It also covers internet of things, augmented reality, wearable technologies, 3D printing, quantum computing, and crowd-sourced projects. The trends show that people expect to learn and work anywhere using their own mobile devices and through personalized and adaptive technologies.
Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014 eswcsummerschool
This document discusses combining the social web and semantic web through crowdsourcing. It defines key concepts like the social web, crowdsourcing, and semantic technologies. It then provides examples of how semantic tasks can be crowdsourced, such as annotating research papers, mapping topics to ontologies, and curating linked data. Challenges with crowdsourcing semantic tasks are also explored, such as how to optimally structure tasks and validate crowd responses.
(1) Amundsen is a data discovery platform developed by Lyft to help users find, understand, and use data.
(2) The platform addresses challenges around data discovery such as lack of understanding about what data exists and where to find it.
(3) Amundsen provides searchable metadata about data resources, previews of data, and usage statistics to help data scientists and others explore and understand data.
Entities, Graphs, and Crowdsourcing for better Web SearcheXascale Infolab
Gianluca Demartini presented on using entities, graphs, and crowdsourcing for better web search. He discussed using crowdsourcing to perform entity linking and disambiguation on web pages through a system called ZenCrowd. ZenCrowd combines algorithmic and manual linking by automating manual linking via crowdsourcing tasks and using a probabilistic reasoning framework to assess workers. He also discussed using entity factor graphs for scientific literature disambiguation by modeling workers, links, clicks, and constraints within a probabilistic framework. The system was experimentally evaluated on news articles from several sources linked to entities from knowledge bases like Freebase and DBPedia.
Microtask Crowdsourcing Applications for Linked DataEUCLID project
This document discusses using microtask crowdsourcing to enhance linked data applications. It describes how crowdsourcing can be used in various components of the linked data integration process, including data cleansing, vocabulary mapping, and entity interlinking. Specific crowdsourcing applications and systems are discussed that address tasks like assessing the quality of DBpedia triples, entity linking with ZenCrowd, and understanding natural language queries with CrowdQ. The results show that crowdsourcing can often improve the results of automated techniques for various linked data tasks and help integrate and enhance large linked data sources.
This presentation hopes to illuminate how Search, Content Strategy, Information Architecture, User Experience, Interaction Design can break down silos to take back relevance. Because, in the end, we, the people, should be the arbiters of experience, not machines and certainly not math.
1) Intelligent software assistants are agents that can perform tasks with minimal direction from users by interacting through natural conversation.
2) Siri was an early virtual personal assistant that could answer questions, make recommendations, and perform actions through natural language.
3) Developing intelligent assistants that can interact with the world, users' friends and contacts, and the user themselves remains a challenge and area of ongoing research.
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInHakka Labs
By Nikolai Avteniev (Sr Software Engineer, LinkedIn)
LinkedIn is the professional profile of record for our 370M+ members globally, but many people don't realize the full potential of their LinkedIn profile – especially on mobile. Adding blogs, photos and other rich content to your profile on a small screen device can get tedious. That's why LinkedIn created Satori, a Hadoop tool that crawls the web and extracts data to discover members' professional content online. Satori uses machine learning techniques and leverages other open source tools like Nutch and Gobblin in order to help match members with relevant content in order to maximize their professional profile. In this talk, Nikolai will share his experience in building the product and discuss the challenges and opportunities encountered along the way.
Lecture 5: Mining, Analysis and VisualisationMarieke van Erp
This is the fourth lecture in the Social Web course at the VU University Amsterdam
Visit the website for more information: <a>Social Web 2012</a>
talk for HK SME center about web3.0 , AI, mobile appsAlex Hung
Mr. Alex Hung gave a presentation on how mobile apps help develop various industries through Web 3.0 technologies like semantic web. He discussed how semantic web allows machines to understand web content through metadata tags, enabling intelligent agents to perform tasks like online shopping. Examples were given of mobile apps developed for cultural heritage sites, language learning, e-books, and healthcare. Future pervasive apps may provide customized information based on location, time, user preferences through social media integration.
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictioneXascale Infolab
1) The document presents HINGE, a new method for embedding hyper-relational knowledge graphs that aims to better capture information from facts containing multiple relations and entities.
2) HINGE uses a CNN to learn representations from base triplets and their associated key-value pairs to characterize the plausibility of facts.
3) An evaluation on link prediction tasks shows HINGE outperforms baselines and demonstrates that the triplet structure encodes essential information, while other representations discard important information.
Representation Learning on Graphs with Complex Structures
Invited talk, Deep Learning for Graphs and Structured Data Embedding Workshop
WWW2019, San Francisco, May 13, 2019
A force directed approach for offline gps trajectory mapeXascale Infolab
SIGSPATIAL 2018 paper
A Force-Directed Approach for Offline GPS Trajectory Map Matching
Efstratios Rappos (University of Applied Sciences of Western Switzerland (HES-SO)),
Stephan Robert (University of Applied Sciences of Western Switzerland (HES-SO)),
Philippe Cudré-Mauroux (University of Fribourg)
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...eXascale Infolab
This document proposes HistoSketch, a method for sketching streaming histograms that preserves similarity and adapts to concept drift. It works by:
1) Generating weighted samples from histograms such that the probability two sketches match equals histogram similarity.
2) Incrementally updating sketches using a weight decay factor to forget older data and adapt to drift over time.
3) Evaluating HistoSketch on classification tasks involving synthetic and real-world streaming data, finding it approximates histogram similarity well using small, fixed-size sketches while adapting rapidly to drift.
This document presents SwissLink, a high-precision context-free entity linking system. It extracts unambiguous surface forms (labels) from knowledge bases like DBpedia and Wikipedia to link entity mentions without context. It catalogs the surface forms, removes ambiguous ones using ratio and percentile methods, and performs fast string matching to link mentions. Evaluation on 30 Wikipedia articles shows the percentile-ratio method achieves over 95% precision and 45% recall, balancing precision and recall.
The document proposes a novel crowdsourcing system architecture and scheduling algorithm to address job starvation in multi-tenant crowd-powered systems. The architecture introduces HIT-Bundles to group heterogeneous tasks and control task serving. The Worker Conscious Fair Scheduling algorithm balances fairness and priority while minimizing worker context switching between tasks. Experiments on Amazon Mechanical Turk show the approach increases throughput over baseline schedulers and adapts to varying workforce levels and job priorities.
This document presents SANAPHOR, an ontology-based coreference resolution system that improves upon existing approaches by leveraging semantic information. It first links entities in document clusters to semantic types and ontologies. It then splits or merges clusters based on these semantic relationships. The system was evaluated on the CoNLL-2012 dataset, where it improved coreference resolution performance over the baseline Stanford system, particularly for noun clusters. By utilizing semantic knowledge, SANAPHOR demonstrates the benefits of enhancing syntactic coreference resolution with an additional semantic layer.
Efficient, Scalable, and Provenance-Aware Management of Linked DataeXascale Infolab
The proliferation of heterogeneous Linked Data on the Web requires data management systems to constantly improve their scalability and efficiency. Despite recent advances in distributed Linked Data management, efficiently processing large amounts of Linked Data in a scalable way is still very challenging. In spite of their seemingly simple data models, Linked Data actually encode rich and complex graphs mixing both instance and schema level data. At the same time, users are increasingly interested in investigating or visualizing large collections of online data by performing complex analytic queries. The heterogeneity of Linked Data on the Web also poses new challenges to database systems. The capacity to store, track, and query provenance data is becoming a pivotal feature of Linked Data Management Systems. In this thesis, we tackle issues revolving around processing queries on big, unstructured, and heterogeneous Linked Data graphs.
This document summarizes a presentation given at SSSW 2015 on making sense of semantic data. It discusses challenges in understanding semantic web data, including a "language gap" between semantic web languages like SPARQL and natural language. It presents an approach to bridging this gap through automatically verbalizing SPARQL queries in English. Evaluation results show this helps non-experts understand queries better and faster than the SPARQL format. It also discusses the "semantic gap" caused by mismatches between a question's semantics and a knowledge graph, and presents an approach using templates to generate SPARQL queries from natural language questions.
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataeXascale Infolab
Uduvudu exploits the semantic and structured nature of Linked Data to generate the best possible representation for a human based on a catalog of available Matchers and Templates. Matchers and Templates are designed that they can be build through an intuitive editor interface.
Executing Provenance-Enabled Queries over Web DataeXascale Infolab
The proliferation of heterogeneous Linked Data on the Web poses new challenges to database systems. In particular, because of this heterogeneity, the capacity to store, track, and query provenance data is becoming a pivotal feature of modern triple stores. In this paper, we tackle the problem of efficiently executing provenance-enabled queries over RDF data. We propose, implement and empirically evaluate five different query execution strategies for RDF queries that incorporate knowledge of provenance. The evaluation is conducted on Web Data obtained from two different Web crawls (The Billion Triple Challenge, and the Web Data Commons). Our evaluation shows that using an adaptive query materialization execution strategy performs best in our context. Interestingly, we find that because provenance is prevalent within Web Data and is highly selective, it can be used to improve query processing performance. This is a counterintuitive result as provenance is often associated with additional overhead.
Micro-task crowdsourcing is rapidly gaining popularity among research communities and businesses as a means to leverage Human Computation in their daily operations. Unlike any other service, a crowdsourcing platform is in fact a marketplace subject to human factors that affect its performance, both in terms of speed and quality. Indeed, such factors shape the dynamics of the crowdsourcing market. For example, a known behavior of such markets is that increasing the reward of a set of tasks would lead to faster results. However, it is still unclear how different dimensions interact with each other: reward, task type, market competition, requester reputation, etc.
In this paper, we adopt a data-driven approach to (A) perform a long-term analysis of a popular micro-task crowdsourcing platform and understand the evolution of its main actors (workers, requesters, and platform). (B) We leverage the main findings of our five year log analysis to propose features used in a predictive model aiming at determining the expected performance of any batch at a specific point in time. We show that the number of tasks left in a batch and how recent the batch is are two key features of the prediction. (C) Finally, we conduct an analysis of the demand (new tasks posted by the requesters) and supply (number of tasks completed by the workforce) and show how they affect task prices on the marketplace.
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...eXascale Infolab
This document proposes three methods - LEXT, REXT, and LERIXT - for disambiguating the domain and range of properties in linked data by using context information. LEXT uses the type of subject resources, REXT uses the type of object resources, and LERIXT uses both. The methods were evaluated against expert judgments and achieved up to 96.5% precision for LEXT and 91.4% for REXT. LERIXT generated too many new sub-properties.
CIKM14: Fixing grammatical errors by preposition rankingeXascale Infolab
The detection and correction of grammatical errors still represent very hard problems for modern error-correction systems. As an example, the top-performing systems at the preposition correction challenge CoNLL-2013 only achieved a F1 score of 17%.
In this paper, we propose and extensively evaluate a series of approaches for correcting prepositions, analyzing a large body of high-quality textual content to capture language usage. Leveraging n-gram statistics, association measures, and machine learning techniques, our system is able to learn which words or phrases govern the usage of a specific preposition. Our approach makes heavy use of n-gram statistics generated from very large textual corpora. In particular, one of our key features is the use of n-gram association measures (e.g., Pointwise Mutual Information) between words and prepositions to generate better aggregated preposition rankings for the individual n-grams.
We evaluate the effectiveness of our approach using cross-validation with different feature combinations and on two test collections created from a set of English language exams and StackExchange forums. We also compare against state-of-the-art supervised methods. Experimental results from the CoNLL-2013 test collection show that our approach to preposition correction achieves ~30% in F1 score which results in 13% absolute improvement over the best performing approach at that challenge.
OLTPBenchmark is a multi-threaded load generator. The framework is designed to be able to produce variable rate, variable mixture load against any JDBC-enabled relational database. The framework also provides data collection features, e.g., per-transaction-type latency and throughput logs.
Together with the framework we provide the following OLTP/Web benchmarks:
TPC-C
Wikipedia
Synthetic Resource Stresser
Twitter
Epinions.com
TATP
AuctionMark
SEATS
YCSB
JPAB (Hibernate)
CH-benCHmark
Voter (Japanese "American Idol")
SIBench (Snapshot Isolation)
SmallBank
LinkBench
CH-benCHmark
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)eXascale Infolab
Internet Infrastructures for Big Data
Talk given at Verisign's Distinguished Speaker Series, 2014
Prof. Philippe Cudre-Mauroux
eXascale Infolab
http://exascale.info/
The binding of cosmological structures by massless topological defectsSérgio Sacani
Assuming spherical symmetry and weak field, it is shown that if one solves the Poisson equation or the Einstein field
equations sourced by a topological defect, i.e. a singularity of a very specific form, the result is a localized gravitational
field capable of driving flat rotation (i.e. Keplerian circular orbits at a constant speed for all radii) of test masses on a thin
spherical shell without any underlying mass. Moreover, a large-scale structure which exploits this solution by assembling
concentrically a number of such topological defects can establish a flat stellar or galactic rotation curve, and can also deflect
light in the same manner as an equipotential (isothermal) sphere. Thus, the need for dark matter or modified gravity theory is
mitigated, at least in part.
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...Advanced-Concepts-Team
Presentation in the Science Coffee of the Advanced Concepts Team of the European Space Agency on the 07.06.2024.
Speaker: Diego Blas (IFAE/ICREA)
Title: Gravitational wave detection with orbital motion of Moon and artificial
Abstract:
In this talk I will describe some recent ideas to find gravitational waves from supermassive black holes or of primordial origin by studying their secular effect on the orbital motion of the Moon or satellites that are laser ranged.
The cost of acquiring information by natural selectionCarl Bergstrom
This is a short talk that I gave at the Banff International Research Station workshop on Modeling and Theory in Population Biology. The idea is to try to understand how the burden of natural selection relates to the amount of information that selection puts into the genome.
It's based on the first part of this research paper:
The cost of information acquisition by natural selection
Ryan Seamus McGee, Olivia Kosterlitz, Artem Kaznatcheev, Benjamin Kerr, Carl T. Bergstrom
bioRxiv 2022.07.02.498577; doi: https://doi.org/10.1101/2022.07.02.498577
PPT on Alternate Wetting and Drying presented at the three-day 'Training and Validation Workshop on Modules of Climate Smart Agriculture (CSA) Technologies in South Asia' workshop on April 22, 2024.
Authoring a personal GPT for your research and practice: How we created the Q...Leonel Morgado
Thematic analysis in qualitative research is a time-consuming and systematic task, typically done using teams. Team members must ground their activities on common understandings of the major concepts underlying the thematic analysis, and define criteria for its development. However, conceptual misunderstandings, equivocations, and lack of adherence to criteria are challenges to the quality and speed of this process. Given the distributed and uncertain nature of this process, we wondered if the tasks in thematic analysis could be supported by readily available artificial intelligence chatbots. Our early efforts point to potential benefits: not just saving time in the coding process but better adherence to criteria and grounding, by increasing triangulation between humans and artificial intelligence. This tutorial will provide a description and demonstration of the process we followed, as two academic researchers, to develop a custom ChatGPT to assist with qualitative coding in the thematic data analysis process of immersive learning accounts in a survey of the academic literature: QUAL-E Immersive Learning Thematic Analysis Helper. In the hands-on time, participants will try out QUAL-E and develop their ideas for their own qualitative coding ChatGPT. Participants that have the paid ChatGPT Plus subscription can create a draft of their assistants. The organizers will provide course materials and slide deck that participants will be able to utilize to continue development of their custom GPT. The paid subscription to ChatGPT Plus is not required to participate in this workshop, just for trying out personal GPTs during it.
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
PPT on Direct Seeded Rice presented at the three-day 'Training and Validation Workshop on Modules of Climate Smart Agriculture (CSA) Technologies in South Asia' workshop on April 22, 2024.
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Leonel Morgado
Current descriptions of immersive learning cases are often difficult or impossible to compare. This is due to a myriad of different options on what details to include, which aspects are relevant, and on the descriptive approaches employed. Also, these aspects often combine very specific details with more general guidelines or indicate intents and rationales without clarifying their implementation. In this paper we provide a method to describe immersive learning cases that is structured to enable comparisons, yet flexible enough to allow researchers and practitioners to decide which aspects to include. This method leverages a taxonomy that classifies educational aspects at three levels (uses, practices, and strategies) and then utilizes two frameworks, the Immersive Learning Brain and the Immersion Cube, to enable a structured description and interpretation of immersive learning cases. The method is then demonstrated on a published immersive learning case on training for wind turbine maintenance using virtual reality. Applying the method results in a structured artifact, the Immersive Learning Case Sheet, that tags the case with its proximal uses, practices, and strategies, and refines the free text case description to ensure that matching details are included. This contribution is thus a case description method in support of future comparative research of immersive learning cases. We then discuss how the resulting description and interpretation can be leveraged to change immersion learning cases, by enriching them (considering low-effort changes or additions) or innovating (exploring more challenging avenues of transformation). The method holds significant promise to support better-grounded research in immersive learning.
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Human Computation for Big Data
1. Human Computation for Big Data
Gianluca Demartini
eXascale Infolab
University of Fribourg, Switzerland
gianlucademartini.net
exascale.info
CUSO Seminar on Big Data – May 23, 2014 – Fribourg
2. Gianluca Demartini
• M.Sc. at University of Udine, Italy
• Ph.D. at University of Hannover, Germany
– Entity Retrieval
• Worked for UC Berkeley (on Crowdsourcing), Yahoo! Research
(Spain), L3S Research Center (Germany)
• Post-doc at the eXascale Infolab, Uni Fribourg, Switzerland.
• Lecturer for Social Computing in Fribourg
• Tutorial on Entity Search at ECIR 2012, on Crowdsourcing at
ESWC 2013 and ISWC 2013
• Research Interests
– Information Retrieval, Semantic Web, Human Computation
2
demartini@exascale.info
Gianluca Demartini
3. Web of Data
• Freebase
– Acquired by Google in July 2010.
– Knowledge Graph launched in May 2012.
• Schema.org
– Driven by major search engine companies
– Machine-readable annotations of Web pages
• Linked Open Data
– 31 billion triples, Sept. 2011
• Volume and Variety
Gianluca Demartini 3
5. LOD data is an enormous graph
• Subject – Predicate – Object
– Barack Obama – marriedTo – Michelle Obama
• Specific scalable DB systems exist
Gianluca Demartini 5
e1
e2
e3
p1 p2
p3
e4
6. I will talk about
• Micro-task Crowdsourcing
• Hybrid Human-Machine systems
• Entity Linking/Disambiguation
– On the Web using crowdsourcing
• Improving Crowdsourcing Platform Quality
– Pushing tasks to workers
• Research directions
– Crowdsourced Query Understanding
– Transactive Search
Gianluca Demartini 6
7. Crowdsourcing
• Exploit human intelligence to solve
– Tasks simple for humans, complex for machines
– With a large number of humans (the Crowd)
– Small problems: micro-tasks (Amazon MTurk)
• Examples
– Wikipedia, Image tagging
• Incentives
– Financial, fun, visibility
Gianluca Demartini 7
8. Case-Study: Amazon MTurk
• Micro-task crowdsourcing marketplace
• On-demand, scalable, real-time workforce
• Different crowd motivation (not just money)
• Online since 2005 (still in “beta”)
• Currently the most popular platform
• Developer’s API as well as GUI
8Gianluca Demartini
12. Example: Hybrid Image Search
Yan, Kumar, Ganesan, CrowdSearch: Exploiting Crowds for Accurate Real-time Image
Search on Mobile Phones, Mobisys 2010.
12Gianluca Demartini
13. Not sure
Example: Hybrid Data Integration
paper conf
Data integration VLDB-01
Data mining SIGMOD-02
title author email
OLAP Mike mike@a
Social media Jane jane@b
Generate plausible matches
– paper = title, paper = author, paper = email, paper = venue
– conf = title, conf = author, conf = email, conf = venue
Ask users to verify
paper conf
Data integration VLDB-01
Data mining SIGMOD-02
title author email venue
OLAP Mike mike@a ICDE-02
Social media Jane jane@b PODS-05
Does attribute paper match attribute author?
NoYes
McCann, Shen, Doan: Matching Schemas in Online Communities. ICDE, 2008 13
14. Hybrid Systems: Key Issues
• The role of machine (i.e., algorithm) and
humans
– use only humans? both? who’s doing what?
• Quality control
• Payment
• Optimization: What to crowdsource
• Scalability: How much to crowdsource
14Gianluca Demartini
16. Gianluca Demartini 16
http://dbpedia.org/resource/Facebook
http://dbpedia.org/resource/Instagram
fbase:Instagram
owl:sameAs
Google
Android
<p>Facebook is not waiting for its initial
public offering to make its first big
purchase.</p><p>In its largest
acquisition to date, the social network
has purchased Instagram, the popular
photo-sharing application, for about $1
billion in cash and stock, the company
said Monday.</p>
<p><span
about="http://dbpedia.org/resource/Facebook"><cit
e property=”rdfs:label">Facebook</cite> is not
waiting for its initial public offering to make its first
big purchase.</span></p><p><span
about="http://dbpedia.org/resource/Instagram">In
its largest acquisition to date, the social network has
purchased <cite
property=”rdfs:label">Instagram</cite> , the popular
photo-sharing application, for about $1 billion in cash
and stock, the company said Monday.</span></p>
RDFa
enrichment
HTML:
17. ZenCrowd
• Combine both algorithmic and manual linking
• Automate manual linking via crowdsourcing
• Dynamically assess human workers with a
probabilistic reasoning framework
17
Crowd
AlgorithmsMachines
Gianluca Demartini
18. ZenCrowd Architecture
Micro
Matching
Tasks
HTML
Pages
HTML+ RDFa
Pages
LOD Open Data Cloud
Crowdsourcing
Platform
Z enCrowd
Entity
Extractors
LOD Index Get Entity
Input Output
Probabilistic
Network
Decision Engine
Micro-
TaskManager
Workers Decisions
Algorithmic
Matchers
Gianluca Demartini 18
Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. ZenCrowd: Leveraging Probabilistic
Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking. In: 21st International Conference on
World Wide Web (WWW 2012).
21. Worker Selection
Gianluca Demartini 21
Top US
Worker
0
0.5
1
0 250 500
WorkerPrecision
Number of Tasks
US Workers
IN Workers
0.6
0.62
0.64
0.66
0.68
0.7
0.72
0.74
0.76
0.78
0.8
1 2 3 4 5 6 7 8 9Precision
Top K workers
22. Lessons Learnt
• Crowdsourcing + Prob reasoning works!
• But
– Different worker communities perform differently
– Many low quality workers
– Completion time may vary (based on reward)
• Need to find the right workers for your task
(see WWW13 paper)
Gianluca Demartini 22
23. ZenCrowd Summary
• ZenCrowd: Probabilistic reasoning over automatic
and crowdsourcing methods for entity linking
• Standard crowdsourcing improves 6% over automatic
• 4% - 35% improvement over standard crowdsourcing
• 14% average improvement over automatic
approaches
http://exascale.info/zencrowd/
Gianluca Demartini 23
24. Blocking for Instance Matching
• Find the instances about the same real-world
entity within two datasets
• Avoid Comparison of all possible pairs
– Step 1: cluster similar items using a cheap
similarity measure
– Step 2: n*n comparison within the clusters with
an expensive measure
24Gianluca Demartini
25. Three-stage blocking with the Crowd
for Data Integration
• 1. Cheap clustering/inverted index selection of
candidates
• 2. Expensive similarity measure
• 3. Crowdsource low confidence matches
Gianluca Demartini 25
Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. Large-Scale Linked
Data Integration Using Probabilistic Reasoning and Crowdsourcing. In: VLDB Journal, Volume 22,
Issue 5 (2013), Page 665-687, Special issue on Structured, Social and Crowd-sourced Data on the
Web. October 2013.
27. Pull (Traditional) Crowdsourcing
• In MTurk HITs are published on the market
• The first worker willing to do it can take it
• Pro: Fast
• Con: Not necessarily optimal / not the best
worker for the task
Gianluca Demartini 27
28. Push Crowdsourcing
• Pick-A-Crowd: A system architecture that uses
Task-to-Worker matching:
– The worker’s social profile
– The task context
• Workers can provide higher quality answers
on tasks they relate to
28
Djellel Eddine Difallah, Gianluca Demartini, and Philippe Cudré-Mauroux. Pick-A-Crowd: Tell Me
What You Like, and I'll Tell You What to Do. In: 22nd International Conference on World Wide
Web (WWW 2013), Rio de Janeiro, Brazil, May 2013.
29. Matching Models–
Expert Finding
• Build an inverted index on the pages’ titles and description
• Use the title/description of the tasks as a key word query on the
inverted index and get a subset of pages
• Rank the workers by the number of liked pages in the subset
29
31. Discussion
• Pull vs. Push methodologies in Crowdsourcing
• Pick-A-Crowd system architecture with Task-
to-Worker recommendation
• Experimental comparison with AMT shows a
consistent quality improvement
“Workers Know what they Like”
31
www.openturk.com
32. OpenTurk
• Yet another a platform? Build on top of Mturk!
• Chrome Extension for push / notification
• 400+ users
• http://bit.ly/openturk-extension
• Open source:
https://github.com/openturk/extension
Gianluca Demartini 32
38. Motivation
• Web Search Engines can answer simple factual
queries directly on the result page
• Users with complex information needs are
often unsatisfied
• Purely automatic techniques are not enough
• We want to solve it with Crowdsourcing!
Gianluca Demartini 38
39. CrowdQ
• CrowdQ is the first system that uses
crowdsourcing to
– Understand the intended meaning
– Build a structured query template
– Answer the query over Linked Open Data
Gianluca Demartini 39
Gianluca Demartini, Beth Trushkowsky, Tim Kraska, and Michael Franklin. CrowdQ:
Crowdsourced Query Understanding. In: 6th Biennial Conference on Innovative Data Systems
Research (CIDR 2013).
40. Hybrid Human-Machine Pipeline
Gianluca Demartini 40
Q= birthdate of actors of forrest gump
Query annotation Noun Noun Named entity
Verification
Entity Relations
Is forrest gump this entity in the query?
Which is the relation between: actors and forrest gump starring
Schema element Starring <dbpedia-owl:starring>
Verification Is the relation between:
Indiana Jones – Harrison Ford
Back to the Future – Michael J. Fox
of the same type as
Forrest Gump - actors
41. Structured query generation
SELECT ?y ?x
WHERE { ?y <dbpedia-owl:birthdate> ?x .
?z <dbpedia-owl:starring> ?y .
?z <rdfs:label> ‘Forrest Gump’ }
Gianluca Demartini 41
Results from BTC09:
Q= birthdate of actors of forrest gump
43. Transactive Search
• What if the data to answer your query is not
stored on any digital support?
• What if the data is just in people minds?
• Big Data No Data
Gianluca Demartini 43
44. Transactive Search
• Search using Transactive (group) Memories
• “Who attended the WWW 2014 conference?”
• Machines: Harvest the Web + Data Mining
• Crowd: Search twitter, look at event pictures
• Transactive Memories: Remember who I met
Gianluca Demartini 44
Michele Catasta, Alberto Tonon, Djellel Eddine Difallah, Gianluca Demartini, Karl Aberer, and
Philippe Cudré-Mauroux. Hippocampus: Answering Memory Queries using Transactive Search.
In: 23rd International Conference on World Wide Web (WWW 2014), Web Science Track. Seoul,
South Korea, April 2014.
47. Discussion
• Sometime data is not on the Web
• The right group of people can still answer
– Collaboratively
– Using Transactive Search
– Better than machines or anonymous crowds
• Open challenges
– Incentives
– Repeatability
– SNA
Gianluca Demartini 47
49. State of Micro-task Crowdsourcing
• Platform side
– Pull platforms
– Batch processing
• Worker side
– Work flexibility
– Anonymity
• Requester side
– Web/API
Gianluca Demartini 49
50. The Future for Requesters
• Push Platforms
– RecSys, User Modeling, Trust
• Mobile Access
• Quality and Time guarantees
• Worker API (enable novel worker UI)
Gianluca Demartini 50
51. 51
The Future of the Worker side
• Reputation system for workers
• More than financial incentives
• Recognize worker potential (badges)
– Paid for their expertise
• Train less skilled workers (tutoring system)
Aniket Kittur et al. The Future of Crowd Work.
CSCW 2013. Gianluca Demartini
52. Crowdsourcing Ethics
• People work full-time as crowd workers
• Chinese crowdsourcing platform with 5.5M workers
• Pros
– Help developing countries
– Provide cash fast to people == short-term satisfaction
– Job Flexibility
• Cons
– No job security
– No social security
– Long term satisfaction? Career plans?
52Gianluca Demartini
Dagstuhl Seminar on “Crowdsourcing: From Theory to Practice and Long-Term Perspectives”,
September 2013.
53. Conclusions
• Structured Data makes the Web better
• It’s growing fast
– Large volume
– Large heterogeneity
• Crowds can help understanding data semantics
• Hybrid human-machine systems (ZenCrowd)
• Research opportunities:
– Exploit Human Intelligence at Scale (CrowdQ)
– Pick the right crowd (Pick-A-Crowd, Transactive Search)
gianlucademartini.net
demartini@exascale.infoGianluca Demartini 53