Slides about "Information and Data Extraction on the Web" for "Information management on the Web" course at DIA (Computer Science Department) of Roma Tre University
Linked Open Data Alignment and Enrichment Using Bootstrapping Based TechniquesPrateek Jain
The recent emergence of the “Linked Data” approach for publishing data represents a major step forward in realizing the original vision of a web that can “understand and satisfy the requests of people and machines to use the web content” – i.e. the Semantic Web. This new approach has resulted in the Linked Open Data (LOD) Cloud, which includes more than 70 large datasets contributed by experts belonging to diverse communities such as geography, entertainment, and life sciences. However, the current interlinks between datasets in the LOD Cloud – as we will illustrate – are too shallow to realize much of the benefits promised. If this limitation is left unaddressed, then the LOD Cloud will merely be more data that suffers from the same kinds of problems, which plague the Web of Documents, and hence the vision of the Semantic Web will fall short.
This thesis presents a comprehensive solution to address these issues using a bootstrapping based approach. It showcases using bootstrapping based methods to identify and create richer relationships between LOD datasets. The BLOOMS project (http://wiki.knoesis.org/index.php/BLOOMS) and the PLATO project, both built as part of this research, have provided evidence to the feasibility and the applicability of the solution.
The document outlines Pablo Mendes' PhD dissertation defense on adaptive semantic annotation of entities and concepts in text. It discusses Pablo Mendes' conceptual model for knowledge base tagging, the DBpedia knowledge base and DBpedia Spotlight system, core evaluations of the system, and case studies applying the system to tweets, audio transcripts, and educational material. The presentation concludes by thanking the audience.
WWW2013 Tutorial: Linked Data & EducationStefan Dietze
Linked data provides opportunities for sharing educational data on the web in a standardized way. It allows for the integration of heterogeneous educational resources and datasets from different platforms. This can enable new applications like cross-platform recommender systems and exploratory search. However, there are also challenges to address like annotation overhead, performance, and scalability when dealing with large amounts of distributed data.
The slideset used to conduct an introduction/tutorial
on DBpedia use cases, concepts and implementation
aspects held during the DBpedia community meeting
in Dublin on the 9th of February 2015.
(slide creators: M. Ackermann, M. Freudenberg
additional presenter: Ali Ismayilov)
Linked Data for Federation of OER Data & RepositoriesStefan Dietze
An overview over different alternatives and opportunities of using Linked Data principles and datasets for federated access to distributed OER repositories. The talk was held at the ARIADNE/GLOBE convening (http://ariadne-eu.org/content/open-federations-2013-open-knowledge-sharing-education) at LAK 2013, Leuven, Belgium on 8 April 2013
Retrieval, Crawling and Fusion of Entity-centric Data on the WebStefan Dietze
Stefan Dietze gave a keynote presentation covering three main topics:
1) Challenges in entity retrieval from heterogeneous linked datasets and knowledge graphs due to diversity and lack of standardization.
2) Approaches for enabling discovery and search through dataset recommendation, profiling, and entity retrieval methods that cluster entities to address link sparsity.
3) Going beyond linked data to exploit semantics embedded in web markup, with case studies in data fusion for entity reconciliation and retrieval.
Linked Open Data Alignment and Enrichment Using Bootstrapping Based TechniquesPrateek Jain
The recent emergence of the “Linked Data” approach for publishing data represents a major step forward in realizing the original vision of a web that can “understand and satisfy the requests of people and machines to use the web content” – i.e. the Semantic Web. This new approach has resulted in the Linked Open Data (LOD) Cloud, which includes more than 70 large datasets contributed by experts belonging to diverse communities such as geography, entertainment, and life sciences. However, the current interlinks between datasets in the LOD Cloud – as we will illustrate – are too shallow to realize much of the benefits promised. If this limitation is left unaddressed, then the LOD Cloud will merely be more data that suffers from the same kinds of problems, which plague the Web of Documents, and hence the vision of the Semantic Web will fall short.
This thesis presents a comprehensive solution to address these issues using a bootstrapping based approach. It showcases using bootstrapping based methods to identify and create richer relationships between LOD datasets. The BLOOMS project (http://wiki.knoesis.org/index.php/BLOOMS) and the PLATO project, both built as part of this research, have provided evidence to the feasibility and the applicability of the solution.
The document outlines Pablo Mendes' PhD dissertation defense on adaptive semantic annotation of entities and concepts in text. It discusses Pablo Mendes' conceptual model for knowledge base tagging, the DBpedia knowledge base and DBpedia Spotlight system, core evaluations of the system, and case studies applying the system to tweets, audio transcripts, and educational material. The presentation concludes by thanking the audience.
WWW2013 Tutorial: Linked Data & EducationStefan Dietze
Linked data provides opportunities for sharing educational data on the web in a standardized way. It allows for the integration of heterogeneous educational resources and datasets from different platforms. This can enable new applications like cross-platform recommender systems and exploratory search. However, there are also challenges to address like annotation overhead, performance, and scalability when dealing with large amounts of distributed data.
The slideset used to conduct an introduction/tutorial
on DBpedia use cases, concepts and implementation
aspects held during the DBpedia community meeting
in Dublin on the 9th of February 2015.
(slide creators: M. Ackermann, M. Freudenberg
additional presenter: Ali Ismayilov)
Linked Data for Federation of OER Data & RepositoriesStefan Dietze
An overview over different alternatives and opportunities of using Linked Data principles and datasets for federated access to distributed OER repositories. The talk was held at the ARIADNE/GLOBE convening (http://ariadne-eu.org/content/open-federations-2013-open-knowledge-sharing-education) at LAK 2013, Leuven, Belgium on 8 April 2013
Retrieval, Crawling and Fusion of Entity-centric Data on the WebStefan Dietze
Stefan Dietze gave a keynote presentation covering three main topics:
1) Challenges in entity retrieval from heterogeneous linked datasets and knowledge graphs due to diversity and lack of standardization.
2) Approaches for enabling discovery and search through dataset recommendation, profiling, and entity retrieval methods that cluster entities to address link sparsity.
3) Going beyond linked data to exploit semantics embedded in web markup, with case studies in data fusion for entity reconciliation and retrieval.
Demo: Profiling & Exploration of Linked Open DataStefan Dietze
This document discusses profiling and exploring linked datasets on the web. It describes the LinkedUp dataset catalog which classifies datasets by type, topic, quality and accessibility. The catalog allows querying across distributed datasets. Topic profiles of datasets are extracted by entity disambiguation and mapping dataset schemas. Visualizations show the relationships between datasets, topics and categories. Lessons learned are that broad categories from DBpedia introduce noise, and type-specific views of datasets can provide more precise topic profiles, as demonstrated in an explorer of educational datasets.
The document discusses the design and implementation of an XML database called the iDiary database to store personal diary entries. It describes taking an incremental approach to building out the database by first creating the basic XML structure, then adding more detailed tags and templates as needed. Key aspects covered include using XML tags to organize data, using XSL stylesheets to transform the XML into HTML for display, and iteratively improving the tag structure and templates based on the diverse types of diary entries.
This document discusses implementing Linked Data in low resource conditions. It begins by outlining goals of providing a high-level view of Linked Data, identifying possible bottlenecks due to limited resources, and offering suggestions to overcome bottlenecks based on experience. It then defines what is meant by "low-resource conditions", including limited IT competencies, software, hardware, electricity, internet access. The document outlines the Linked Data workflow and discusses each step in more detail, including data generation, conversion to RDF, data storage, maintenance, linking, and exposure. It highlights the example of AGRIS, a collaborative Linked Data application, and emphasizes starting small, being strategic, reusing existing resources, and collaborating to maximize resources in low
Linked Open Data Fundamentals for Libraries, Archives and Museumstrevorthornton
This document provides an overview of linked open data concepts for libraries, archives, and museums. It discusses what linked open data is, potential benefits for cultural institutions, and technical concepts like URIs, HTTP, RDF, ontologies, and SPARQL. The document also covers publishing linked open data by establishing URIs for resources and using content negotiation. Trust and attribution of linked data sources are addressed. Open data licensing, including options from Creative Commons, is also summarized.
Study Support and Integration of Cultural Information Resources with Linked DataKAMURA
A museum collection search system called Linked
Open Data for Academia (LODAC) Museum has been developed that uses Linked Data. The LODAC Museum identifies and associates artists, artworks, and museum information from some different museums to provide integrated data that are published as Linked Data with the SPARQL endpoint.
(This side used at Culture and Computing 2011)
20130805 Activating Linked Open Data in Libraries Archives and Museumsandrea huang
This document summarizes the LODLAM 2013 conference. It discusses how linked open data can activate libraries, archives, and museums by (1) bringing library data outside library walls and linking to external web data, (2) helping different actors create and aggregate data about the same objects, and (3) adding value to metadata by linking to external knowledge bases. The conference had over 100 participants from 16 countries and included sessions on topics like curation, vocabularies, tools, and case studies. Several projects and tools were presented, including LODLAM patterns, Karma, and Pundit. The document argues that linking library metadata to the web of data presents opportunities but also challenges of metadata interoperability and vocabulary
Semantic Web, Linked Data and Education: A Perfect Fit?Mathieu d'Aquin
This document discusses how semantic web technologies like linked data are a perfect fit for education. It provides examples of how the Open University has applied linked data to connect educational resources and data from across the university. Linked data allows for flexibility, accessibility, and the ability to combine and interpret different sources of knowledge. However, challenges remain around representing rich metadata about educational purpose and interpreting resources in an educational context.
A structured catalog of open educational datasetsStefan Dietze
This document discusses building a structured catalog for educational datasets on the Linked Open Data cloud. It proposes a processing chain to extract metadata from datasets, link entities and resources across datasets, and categorize datasets. This would provide a unified view of the educational data through a dataset catalog and index with links and cross-references. The goals are to classify datasets, link related entities, and provide infrastructure for federated queries over the interconnected educational datasets.
Experiences Evolving a New Analytical Platform: What Works and What's MissingCloudera, Inc.
The document summarizes Jeff Hammerbacher's presentation on evolving analytical platforms. It discusses how business intelligence is becoming more like scientific research, requiring tools across the entire research cycle. It provides SQL Server 2008 R2 as an example of an analytical data platform that integrates various components like ETL, reporting, analysis, search, and more into a unified suite. It then outlines the key players in the platform ecosystem, including infrastructure providers, platform providers, application developers, content providers, and end users. Finally, it traces the evolution of Hadoop and MapReduce as new foundations for large-scale data analysis and their growing adoption starting in 2005 through projects at Yahoo and other companies.
Information Extraction from the Web - Algorithms and ToolsBenjamin Habegger
This document provides an overview of algorithms and tools for information extraction from the web. It discusses document representations, approaches like wrappers that can extract semi-structured data from websites, and algorithms such as Wien, Stalker, DIPRE and IERel that learn wrappers. It also presents tools like WetDL for describing workflows and WebSource for executing them to extract and transform web data. Finally, it discusses applications of information extraction like semantic search engines and linking extracted data to schemas for data integration.
Enterprise information extraction: recent developments and open challengesYunyao Li
The document discusses declarative approaches to information extraction that address issues with traditional rule-based and machine learning-based methods. Declarative approaches use a declarative language and programming model to specify extraction tasks, enabling scalable infrastructure and development support. The talk will cover how declarative information extraction allows scalable processing, provides development tools, and conclude with questions.
Slides about "Usecases for Information Extraction with UIMA" for "Information management on the Web" course at DIA (Computer Science Department) of Roma Tre University
Information Extraction from Web-Scale N-Gram DataGerard de Melo
Search engines are increasingly relying on structured data to provide direct answers to certain types of queries. However, extracting such structured data from text is challenging, especially due to the scarcity of explicitly expressed knowledge. Even when relying on large document collections, pattern-based information extraction approaches typically expose only insufficient amounts of information. This paper evaluates to what extent n-gram statistics, derived from volumes of texts several orders of magnitude larger than typical corpora, can allow us to overcome this bottleneck. An extensive experimental evaluation is provided for three different binary relations, comparing different sources of n-gram data as well as different learning algorithms.
This document defines data and different types of data presentation. It discusses quantitative and qualitative data, and different scales for qualitative data. The document also covers different ways to present data scientifically, including through tables, graphs, charts and diagrams. Key types of visual presentation covered are bar charts, histograms, pie charts and line diagrams. Presentation should aim to clearly convey information in a concise and systematic manner.
Here are the class widths, marks and boundaries for the given class intervals:
a. Class interval (ci): 4 – 8
Class Width: 4
Class Mark: 6
Class Boundary: 3.5 – 8.5
b. Class interval (ci): 35 – 44
Class Width: 9
Class Mark: 39.5
Class Boundary: 34.5 – 43.5
c. Class interval (ci): 17 – 21
Class Width: 4
Class Mark: 19
Class Boundary: 16.5 – 20.5
d. Class interval (ci): 53 – 57
Class Width: 4
Class Mark: 55
Class Boundary: 52.5 –
SystemT: Declarative Information ExtractionYunyao Li
Slides used for my talk "SystemT: Declarative Information Extraction" at the event "University of Oregon Big Opportunities with Big Data Meeting" on August 8, 2014 (http://bigdata.uoregon.edu).
Combining Distributional Semantics and Entity Linking for Context-aware Conte...Cataldo Musto
This document describes a context-aware content-based recommendation framework called contextual eVSM. It combines distributional semantics and entity linking to address limitations of traditional content-based recommender systems related to poor semantic representation and lack of contextual modeling. The framework includes three main components: a semantic content analyzer, a context-aware profiler, and a recommender. The semantic content analyzer generates semantic representations of items using both entity linking and distributional semantics learned from text. The context-aware profiler builds contextual user profiles based on a strategy that combines standard user ratings with contextual information. The recommender then uses these representations to provide context-aware recommendations.
Start with Small Data: How to Understand your Visitors, Capture Data, and Pro...JimKellerES
Everyone is looking to "Big Data" to provide the answers: How do we learn more about our visitors? How do we capture data? How do we report on it? How do we use it to provide a personalized experience?
Big data is a big challenge, but this session will teach you to start with "small data" - the analytics and information that you probably already have access to, or could easily have access to, but that you're just not capturing or leveraging effectively. We'll discuss how to track more than just page views so that you can understand your visitors' behavior on a deeper level. You'll also learn about tracking specific visitors across sessions, providing a personalized experience, some tips for measuring and reporting on their behavior, and how to adjust your site to drive more meaningful interactions and conversions.
Compile, Clean, Connect: The mantra of data journalism (Future Everything 2011)Paul Bradshaw
Compile, clean, connect is the mantra of data journalism according to Paul Bradshaw, a visiting professor and course leader who discusses how data journalists gather raw data from sources like police websites, clean it by organizing and structuring it, and then connect it to stories and analyze trends or draw comparisons. The document provides an example of how Adrian Holovaty gathers crime reports from the Chicago police website on a daily basis and credits Bradshaw as a publisher, blogger, and founder providing resources for data journalism.
This document discusses semi-structured data extraction from web pages. It introduces semantic generators, which are sets of rules that translate HTML documents into XML. It describes the WebMantic architecture, which allows automatic generation of semantic generators and wrappers. A practical example of using WebMantic to extract data from a population website is provided. Experimental results on extracting data from several websites are also presented, along with conclusions and plans for future work.
Demo: Profiling & Exploration of Linked Open DataStefan Dietze
This document discusses profiling and exploring linked datasets on the web. It describes the LinkedUp dataset catalog which classifies datasets by type, topic, quality and accessibility. The catalog allows querying across distributed datasets. Topic profiles of datasets are extracted by entity disambiguation and mapping dataset schemas. Visualizations show the relationships between datasets, topics and categories. Lessons learned are that broad categories from DBpedia introduce noise, and type-specific views of datasets can provide more precise topic profiles, as demonstrated in an explorer of educational datasets.
The document discusses the design and implementation of an XML database called the iDiary database to store personal diary entries. It describes taking an incremental approach to building out the database by first creating the basic XML structure, then adding more detailed tags and templates as needed. Key aspects covered include using XML tags to organize data, using XSL stylesheets to transform the XML into HTML for display, and iteratively improving the tag structure and templates based on the diverse types of diary entries.
This document discusses implementing Linked Data in low resource conditions. It begins by outlining goals of providing a high-level view of Linked Data, identifying possible bottlenecks due to limited resources, and offering suggestions to overcome bottlenecks based on experience. It then defines what is meant by "low-resource conditions", including limited IT competencies, software, hardware, electricity, internet access. The document outlines the Linked Data workflow and discusses each step in more detail, including data generation, conversion to RDF, data storage, maintenance, linking, and exposure. It highlights the example of AGRIS, a collaborative Linked Data application, and emphasizes starting small, being strategic, reusing existing resources, and collaborating to maximize resources in low
Linked Open Data Fundamentals for Libraries, Archives and Museumstrevorthornton
This document provides an overview of linked open data concepts for libraries, archives, and museums. It discusses what linked open data is, potential benefits for cultural institutions, and technical concepts like URIs, HTTP, RDF, ontologies, and SPARQL. The document also covers publishing linked open data by establishing URIs for resources and using content negotiation. Trust and attribution of linked data sources are addressed. Open data licensing, including options from Creative Commons, is also summarized.
Study Support and Integration of Cultural Information Resources with Linked DataKAMURA
A museum collection search system called Linked
Open Data for Academia (LODAC) Museum has been developed that uses Linked Data. The LODAC Museum identifies and associates artists, artworks, and museum information from some different museums to provide integrated data that are published as Linked Data with the SPARQL endpoint.
(This side used at Culture and Computing 2011)
20130805 Activating Linked Open Data in Libraries Archives and Museumsandrea huang
This document summarizes the LODLAM 2013 conference. It discusses how linked open data can activate libraries, archives, and museums by (1) bringing library data outside library walls and linking to external web data, (2) helping different actors create and aggregate data about the same objects, and (3) adding value to metadata by linking to external knowledge bases. The conference had over 100 participants from 16 countries and included sessions on topics like curation, vocabularies, tools, and case studies. Several projects and tools were presented, including LODLAM patterns, Karma, and Pundit. The document argues that linking library metadata to the web of data presents opportunities but also challenges of metadata interoperability and vocabulary
Semantic Web, Linked Data and Education: A Perfect Fit?Mathieu d'Aquin
This document discusses how semantic web technologies like linked data are a perfect fit for education. It provides examples of how the Open University has applied linked data to connect educational resources and data from across the university. Linked data allows for flexibility, accessibility, and the ability to combine and interpret different sources of knowledge. However, challenges remain around representing rich metadata about educational purpose and interpreting resources in an educational context.
A structured catalog of open educational datasetsStefan Dietze
This document discusses building a structured catalog for educational datasets on the Linked Open Data cloud. It proposes a processing chain to extract metadata from datasets, link entities and resources across datasets, and categorize datasets. This would provide a unified view of the educational data through a dataset catalog and index with links and cross-references. The goals are to classify datasets, link related entities, and provide infrastructure for federated queries over the interconnected educational datasets.
Experiences Evolving a New Analytical Platform: What Works and What's MissingCloudera, Inc.
The document summarizes Jeff Hammerbacher's presentation on evolving analytical platforms. It discusses how business intelligence is becoming more like scientific research, requiring tools across the entire research cycle. It provides SQL Server 2008 R2 as an example of an analytical data platform that integrates various components like ETL, reporting, analysis, search, and more into a unified suite. It then outlines the key players in the platform ecosystem, including infrastructure providers, platform providers, application developers, content providers, and end users. Finally, it traces the evolution of Hadoop and MapReduce as new foundations for large-scale data analysis and their growing adoption starting in 2005 through projects at Yahoo and other companies.
Information Extraction from the Web - Algorithms and ToolsBenjamin Habegger
This document provides an overview of algorithms and tools for information extraction from the web. It discusses document representations, approaches like wrappers that can extract semi-structured data from websites, and algorithms such as Wien, Stalker, DIPRE and IERel that learn wrappers. It also presents tools like WetDL for describing workflows and WebSource for executing them to extract and transform web data. Finally, it discusses applications of information extraction like semantic search engines and linking extracted data to schemas for data integration.
Enterprise information extraction: recent developments and open challengesYunyao Li
The document discusses declarative approaches to information extraction that address issues with traditional rule-based and machine learning-based methods. Declarative approaches use a declarative language and programming model to specify extraction tasks, enabling scalable infrastructure and development support. The talk will cover how declarative information extraction allows scalable processing, provides development tools, and conclude with questions.
Slides about "Usecases for Information Extraction with UIMA" for "Information management on the Web" course at DIA (Computer Science Department) of Roma Tre University
Information Extraction from Web-Scale N-Gram DataGerard de Melo
Search engines are increasingly relying on structured data to provide direct answers to certain types of queries. However, extracting such structured data from text is challenging, especially due to the scarcity of explicitly expressed knowledge. Even when relying on large document collections, pattern-based information extraction approaches typically expose only insufficient amounts of information. This paper evaluates to what extent n-gram statistics, derived from volumes of texts several orders of magnitude larger than typical corpora, can allow us to overcome this bottleneck. An extensive experimental evaluation is provided for three different binary relations, comparing different sources of n-gram data as well as different learning algorithms.
This document defines data and different types of data presentation. It discusses quantitative and qualitative data, and different scales for qualitative data. The document also covers different ways to present data scientifically, including through tables, graphs, charts and diagrams. Key types of visual presentation covered are bar charts, histograms, pie charts and line diagrams. Presentation should aim to clearly convey information in a concise and systematic manner.
Here are the class widths, marks and boundaries for the given class intervals:
a. Class interval (ci): 4 – 8
Class Width: 4
Class Mark: 6
Class Boundary: 3.5 – 8.5
b. Class interval (ci): 35 – 44
Class Width: 9
Class Mark: 39.5
Class Boundary: 34.5 – 43.5
c. Class interval (ci): 17 – 21
Class Width: 4
Class Mark: 19
Class Boundary: 16.5 – 20.5
d. Class interval (ci): 53 – 57
Class Width: 4
Class Mark: 55
Class Boundary: 52.5 –
SystemT: Declarative Information ExtractionYunyao Li
Slides used for my talk "SystemT: Declarative Information Extraction" at the event "University of Oregon Big Opportunities with Big Data Meeting" on August 8, 2014 (http://bigdata.uoregon.edu).
Combining Distributional Semantics and Entity Linking for Context-aware Conte...Cataldo Musto
This document describes a context-aware content-based recommendation framework called contextual eVSM. It combines distributional semantics and entity linking to address limitations of traditional content-based recommender systems related to poor semantic representation and lack of contextual modeling. The framework includes three main components: a semantic content analyzer, a context-aware profiler, and a recommender. The semantic content analyzer generates semantic representations of items using both entity linking and distributional semantics learned from text. The context-aware profiler builds contextual user profiles based on a strategy that combines standard user ratings with contextual information. The recommender then uses these representations to provide context-aware recommendations.
Start with Small Data: How to Understand your Visitors, Capture Data, and Pro...JimKellerES
Everyone is looking to "Big Data" to provide the answers: How do we learn more about our visitors? How do we capture data? How do we report on it? How do we use it to provide a personalized experience?
Big data is a big challenge, but this session will teach you to start with "small data" - the analytics and information that you probably already have access to, or could easily have access to, but that you're just not capturing or leveraging effectively. We'll discuss how to track more than just page views so that you can understand your visitors' behavior on a deeper level. You'll also learn about tracking specific visitors across sessions, providing a personalized experience, some tips for measuring and reporting on their behavior, and how to adjust your site to drive more meaningful interactions and conversions.
Compile, Clean, Connect: The mantra of data journalism (Future Everything 2011)Paul Bradshaw
Compile, clean, connect is the mantra of data journalism according to Paul Bradshaw, a visiting professor and course leader who discusses how data journalists gather raw data from sources like police websites, clean it by organizing and structuring it, and then connect it to stories and analyze trends or draw comparisons. The document provides an example of how Adrian Holovaty gathers crime reports from the Chicago police website on a daily basis and credits Bradshaw as a publisher, blogger, and founder providing resources for data journalism.
This document discusses semi-structured data extraction from web pages. It introduces semantic generators, which are sets of rules that translate HTML documents into XML. It describes the WebMantic architecture, which allows automatic generation of semantic generators and wrappers. A practical example of using WebMantic to extract data from a population website is provided. Experimental results on extracting data from several websites are also presented, along with conclusions and plans for future work.
Entity linking meets Word Sense Disambiguation: a unified approach(TACL 2014)の紹介Koji Matsuda
My presentation of the paper that "Entity Linking meets Word Sense Disambiguation: a Unified Approach" (TACL 2014), Andrea Moro, Alessandro Raganato, Roberto Navigli (University of Roma)
The document discusses different methods for collecting job analysis data: observation, interviews, questionnaires, participant diaries, and previous studies. Observation involves directly watching employees perform their jobs but can be time-consuming. Interviews allow for quick data collection but responses may be distorted. Questionnaires efficiently gather data from many employees but require time and costs to develop. Participant diaries provide a complete picture of job duties but rely on accurate employee recall. Previous studies are easy to use but past employee performance may not reflect current roles.
5 tactics to personalize your email message for better results finalMarketingSherpa
Personalization is about striking delicate balance. Using it as a marketing tool for gaining trust and encouraging further engagement requires strategically adding a human element to email content while still conveying an effective marketing message.
This MarketingSherpa (www.marketingsherpa.com) webinar presentation will show you:
-Creative ways to add a personal touch in copy and subject lines
-How to get just the right amount of information from the consumer
-Why you should extend personalization to the landing page
-Tips on how to quickly personalize a template
The minutes from a production team meeting on February 4th, 2012 discussed issues with their music video for "Rum and Redbull" including that the video was too short without enough connecting scenes, a homeless scene was confusing and out of place, and some shots dragged on too long. The team decided to film additional transition scenes in a recording studio, re-film the homeless scene for clarity, and cut down shots to be sharper and sync with the music.
The meeting discussed production plans for an upcoming documentary. Key topics included developing a synopsis, target audience, interview subjects, production requirements, and potential interview questions. Action items assigned responsible parties to provide potential filming locations and a preliminary budget by June 23rd. The meeting also outlined pre-production documentation needs and production timescales, with the first month focused on footage collection and permissions and the second on editing. The facilitator concluded by summarizing discussions and reiterating next steps and contact information exchanges.
For three filming sessions at Queen's Park City of Westminster College, the group filmed cutaways of different classes on 12/11/14, filmed life skills classes such as cookery and construction workshops on 19/11/14 with Eugenie operating the camera for most of it, and interviewed Melanie Guymer about Maxine Murphy and got a voiceover about campus services for student awards footage on 11/12/14.
"Presentation on Job Analysis. Learn methods of analyzing
and evaluating a job. These PDF's
are available for all VEDA students for free on
www.veda-edu.com"
An introduction to Basho's Riak distributed data store, and the Ripple client in Ruby. Code samples from the demos are here: http://gist.github.com/365791
This document discusses OpenRefine, an open source tool for working with messy and unstructured data. OpenRefine can be used for data cleaning, ETL prototyping, and data extension/reconciliation. It has a graphical user interface and allows users to perform tasks like clustering similar records, faceting data, and reconciling data against external sources. The document provides examples of how OpenRefine can be used to clean data by removing errors, cluster records, and reconcile data against RDF files or SPARQL endpoints.
The document discusses how linked open data and semantic web technologies can be applied to educational data and resources on the web. It provides examples of projects that aim to expose, interlink, and enrich educational datasets using these technologies. The goal is to improve data sharing and interoperability, facilitate reuse of open educational resources, and leverage linked data as a knowledge base to support learning and education.
Slides about an overview about Apache UIMA and how it can be used for Metadata Generation in the context of the "Information management on the Web" course at DIA (Computer Science Department) of Roma Tre University
1) Access Innovations leverages Lucene and XML to semantically enrich and instantly distribute data from their NICEM and Media Sleuth databases containing 670,000 educational media items.
2) They face challenges of changing data formats and linking related items. To address this, they use a taxonomy applied during semantic enrichment and search to increase recall and precision.
3) By building XML records and indexing with Lucene including suggested taxonomy terms, they achieved flexibility to support on-save updates and multiple systems while improving search and directing users to ecommerce.
Everything you always wanted to know about search in typo3Olivier Dobberkau
This document provides an overview and agenda for a presentation on search functionality in TYPO3 using Apache Solr. The presentation covers the history of search technology, search terminology and concepts, why people search and search behaviors, and the key components and features of search in TYPO3 including indexing, querying, results, facets, analysis, and additional components. The goal is to answer questions about search capabilities in TYPO3.
Introduction to Open Data Commons, a licensing project by the Open Knowledge Foundation on legal tools for open data at the 14 April Open Source Show and Tell <http: /> hosted by The Team and presented by me, Jordan Hatcher.
The document summarizes a lecture on iPhone application development that covered custom classes, object lifecycles, and properties. Specifically, it discussed creating custom classes with properties and methods, allocating and initializing objects, and the two-step process of object creation involving allocation and initialization. It also mentioned implementing init methods and calling superclass methods.
Keynote presentation from UKSG 2010 Edinburgh. Rapid technology change and the emergence of the Semantic Web of Linked Data will change what you do, not just how you do it.
Web Science Synergies: Exploring Web Knowledge through the Semantic WebStefan Dietze
The document discusses exploring web data and knowledge through the semantic web. It describes how the semantic web adds meaning to data through shared vocabularies and schemas. It also discusses challenges with the large number and diversity of linked open datasets, including issues with accessibility, heterogeneity of schemas, and data quality. It proposes approaches to address these challenges, such as dataset profiling, metadata catalogs, and infrastructure for federated querying.
An On-line Collaborative Data Management SystemCameron Kiddle
A presentation I prepared that was presented by Rob Simmonds at the Gateway Computing Environments 2010 Workshop in New Orleans on November 14, 2010. It provides an overview of a data management system that was developed for GeoChronos - an on-line collaborative platform for Earth observation scientists.
Mining and Understanding Activities and Resources on the WebStefan Dietze
Research Seminar at KMRC Tübingen, Germany, on mining and understanding of Web acivities and resources through knowledge discovery and machine learning approaches.
Los Angeles R users group - Nov 17 2010 - Part 2rusersla
The document provides an outline for a talk on the future of R. It discusses R's current strengths and criticisms, as well as challenges like handling big data. It proposes 5 potential solutions: 1) Using R with other tools; 2) Packages for large data; 3) Improving R's capabilities; 4) Starting from scratch; 5) Adopting aspects of Clojure. Clojure is presented as having libraries for statistics, machine learning, and querying big data, positioning it as a potential model for R's evolution.
This document discusses advanced techniques for working with data view webparts in SharePoint 2010. It describes the default limitations of data view webparts and recommends tools like Fiddler, Stylus Studio, and Firebug for intercepting XML data, HTML-decoding it, and using XSLT to style and manipulate the webparts. The demonstration shows how to create a minimal data view webpart, use Fiddler to access the XML data, save it for editing in Stylus Studio, and write custom XSLT to apply to the webpart.
This presentation provides some thoughts on OER search. It includes a brief background on the metadata recommendation for OCW Consortium members from 2006. It also provides a bit of background on educational metadata and resource repositories. This presentation was prepared for a OER Search meeting hosted by Google on December 1, 2010. (This presentation replaces the draft presentation that had received 120 views.)
This document discusses how structured content from OpenLearn XML documents can be used to generate various data products and secondary resources. Specifically, it describes how automatic outline extraction could create a mindmap view of content. It also outlines how coursewide directories could be generated listing learning outcomes, image locations from units, and a meta-glossary of terms. Finally, it briefly mentions the DiscOU project and potential for search-based pedagogy using structured OU content.
This document discusses linking open data with Drupal. It begins with an introduction to open data and the semantic web. It explains how to transform open data into linked data using ontologies and semantic metadata. Several Drupal modules are presented for importing, publishing, and querying linked data. The document concludes by proposing a hackathon where participants could consume, publish, and build applications with linked open government data and the Drupal framework.
Linked Data at the Open University: From Technical Challenges to Organization...Mathieu d'Aquin
The document discusses how the Knowledge Media Institute at the Open University in the UK has developed a linked data platform, called data.open.ac.uk, to provide open access to various types of data from across the university, including course information, research publications, podcasts, videos, and more. It describes some of the technical and organizational challenges in developing the platform, and highlights how it has enabled new uses of the university's data and inspired innovation both within the university and more broadly in open education.
SemTechBiz 2012 Panel on Linking Enterprise Data3 Round Stones
The document discusses a panel on linked enterprise data patterns featuring Arnaud Le Hors from IBM, Ashok Malhotra from Oracle, and David Wood from 3 Round Stones. It provides details on recent linked data activities from the W3C including a new working group. It also summarizes IBM, Oracle, and 3 Round Stones' involvement with linked data and semantic technologies including products, projects, and standards.
This document discusses the evolution of the web from a web of documents to a web of linked data. It outlines the principles of linked data, which involve using URIs to identify things and linking those URIs to other URIs so that machines can discover more data. RDF is introduced as a standard data model for publishing linked data on the web using triples. Examples of linked data applications and datasets are provided to illustrate how linked data allows the web to function as a global database.
Similar to Data and Information Extraction on the Web (20)
The document presents research on using affect-enriched word embeddings to improve information retrieval from news datasets. Affect refers to feelings, emotions, personality and moods, which are important to capture for natural language understanding. Prior research showed that affect-enriched word embeddings outperformed the state-of-the-art on sentiment analysis, personality detection and frustration detection tasks. The document analyzes the affect scores of different news datasets and experiments with using affect-enriched embeddings for query expansion and document ranking, finding improvements over baselines.
Apache Jackrabbit Oak is a scalable content repository that uses multi-version concurrency control and pluggable components for storage and indexing. It includes the Oak query engine that selects indexes to perform search queries and traverses the repository if no index is available. Indexes in Oak can be configured and customized, including using the Lucene and Solr indexes to enable full-text and property searching capabilities. Native language support allows leveraging the advanced query capabilities of underlying indexes.
The document discusses Sling replication, including:
1) Sling replication was contributed to Sling in November 2013 and aims to be simple, resilient, and fast.
2) Replication is achieved through replication agents that export and import replication packages between Sling instances.
3) Replication packages are serialized and sent between instances, with various queue providers and distribution strategies available.
4) Configurations define the exporter, importer, queue provider, and distribution strategy for each replication agent.
This document discusses the design considerations for building a search engine to index 50 million heterogeneous documents and migrate them from an old commercial solution to Apache Solr. Key requirements include scalability, high performance, handling diverse stakeholder needs, and the ability to dynamically scale the system. Major challenges include architectural constraints, balancing performance with accuracy, and resolving diverging stakeholder concerns around ranking and results. An iterative process of prototyping, testing, and refining the indexing and search algorithms is recommended.
The document discusses integrating Apache Solr with Apache Oak for scalable search capabilities. It provides an overview of IndexEditor and QueryIndex APIs for mapping Oak content changes and queries to Solr. The Oak Solr bundle includes implementations for indexing and searching Oak content on Solr. Additional bundles support embedded or remote Solr deployment. The talk demonstrates populating a Solr index with Oak content and discusses further improvements.
This document discusses using Lucene and Solr for text categorization and classification. It provides an overview of classification algorithms like Naive Bayes and K-Nearest Neighbors and how they can be implemented using Lucene's indexing and querying capabilities. It also describes how classification models built with Lucene can be exposed through Solr for tasks like assigning categories to documents during indexing or performing classification-based more-like-this queries.
This document discusses machine learning with Apache Hama, a Bulk Synchronous Parallel computing framework. It provides an overview of Apache Hama and BSP, explains why machine learning algorithms are well-suited for BSP, and gives examples of collaborative filtering, k-means clustering, and gradient descent implemented on Hama. Benchmark results show Hama performs comparably to Apache Mahout for these algorithms.
Adapting a not OSGi framework to OSGi based architectures is often a common need which needs to be managed together with other concerns like backward compatibility, multiple components packaging, evolution and flexibility.
Handling such needs can be tricky because of possible hurdles related to different class loading models, fine grained dependency management, semantic versioning, etc.
This talk deals with a real life use case of adapting a not OSGi ready framework like Apache UIMA (http://uima.apache.org) to a fully OSGi based architecture for the Apache Clerezza project (http://incubator.apache.org/clerezza) highlighting how the different class loading mechanisms (not OSGI vs OSGi) can be handled and adapted and how the two frameworks can be integrated leveraging the OSGi capabilities and still maintaing backward compatibility, flexibility, etc..
A quick tour of available integration hooks in Apache Jackrabbit Oak to plug in Apache Solr in order to provide scalable search (& more) functionalities to the repository
The document describes the Domeo Annotation Toolkit, which allows users to create, visualize, curate, and share text mining results. It provides components to annotate web documents and export annotations in the Annotation Ontology RDF format. The Domeo client interface in a browser allows both manual and semi-automatic annotation of HTML documents. It can also trigger and display results from text mining web services like the NCBO Annotator through custom connectors. The toolkit is moving towards a federated architecture to allow sharing of annotations across multiple Domeo nodes.
This document discusses using natural language processing (NLP) techniques to enable natural language search in Apache Solr. It describes integrating Apache UIMA with Solr to allow NLP algorithms to analyze documents and queries. Custom Lucene analyzers and a QParserPlugin are used to index enriched fields and extract concepts from queries. The approach aims to improve search recall and precision by understanding language.
The document provides an overview and agenda for an Apache Solr crash course. It discusses topics such as information retrieval, inverted indexes, metrics for evaluating IR systems, Apache Lucene, the Lucene and Solr APIs, indexing, searching, querying, filtering, faceting, highlighting, spellchecking, geospatial search, and Solr architectures including single core, multi-core, replication, and sharding. It also provides tips on performance tuning, using plugins, and developing a Solr-based search engine.
The document discusses two use cases (UC1 and UC2) for applying Apache UIMA to automatically extract structured information from unstructured text. UC1 involves using UIMA to analyze real estate listings to extract fields like price, zone, and phone number to track trends in the real estate market. UC2 aims to automatically extract common information like language, funding type, and expiration date from announcements of EU tenders and contracts. The document outlines the different components involved in each use case, including crawlers to extract text, annotators to identify relevant entities, and CAS consumers to store extracted fields in databases or indexes.
This document provides an introduction and overview of Apache UIMA (Unstructured Information Management Architecture).
Apache UIMA is an open source framework for analyzing unstructured information like text, audio, and video. It allows defining type systems and building analysis pipelines using components called annotators that can extract metadata from unstructured data.
The document outlines some key aspects of Apache UIMA including its goals of supporting a community around analyzing unstructured content, how it can bridge different domains, and provides an example scenario of using it to extract metadata from articles about movies.
The document discusses Apache UIMA, an architectural framework for managing unstructured data. It is not inherently a semantic search tool. It allows for pluggable analysis engines and asynchronous scaleout. The document provides examples of how UIMA can be used for semantic search, including generating metadata for content management systems, data enrichment, and linking to external data sources. It also describes how AlchemyAPI services can be wrapped as UIMA components to perform named entity recognition and linking to knowledge bases.
Project Management Semester Long Project - Acuityjpupo2018
Acuity is an innovative learning app designed to transform the way you engage with knowledge. Powered by AI technology, Acuity takes complex topics and distills them into concise, interactive summaries that are easy to read & understand. Whether you're exploring the depths of quantum mechanics or seeking insight into historical events, Acuity provides the key information you need without the burden of lengthy texts.
Webinar: Designing a schema for a Data WarehouseFederico Razzoli
Are you new to data warehouses (DWH)? Do you need to check whether your data warehouse follows the best practices for a good design? In both cases, this webinar is for you.
A data warehouse is a central relational database that contains all measurements about a business or an organisation. This data comes from a variety of heterogeneous data sources, which includes databases of any type that back the applications used by the company, data files exported by some applications, or APIs provided by internal or external services.
But designing a data warehouse correctly is a hard task, which requires gathering information about the business processes that need to be analysed in the first place. These processes must be translated into so-called star schemas, which means, denormalised databases where each table represents a dimension or facts.
We will discuss these topics:
- How to gather information about a business;
- Understanding dictionaries and how to identify business entities;
- Dimensions and facts;
- Setting a table granularity;
- Types of facts;
- Types of dimensions;
- Snowflakes and how to avoid them;
- Expanding existing dimensions and facts.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 6
Data and Information Extraction on the Web
1. Data and Information
Extraction on the Web
Gestione delle Informazioni su Web - 2009/2010
Tommaso Teofili
tommaso [at] apache [dot] org
lunedì 12 aprile 2010
2. Agenda
Search
Goals
Problems
Data extraction
Information extraction
Mixing things together
lunedì 12 aprile 2010
3. Search - Goals
Find what we are looking for
Quickly
Easily
Have suggestions on other interesting related
stuff
Turn results into useful knowledge
lunedì 12 aprile 2010
5. Problems when googling
Where to search what we are looking for
How to write good queries (i.e.: relations
between terms?)
How to evaluate when a query is good
lunedì 12 aprile 2010
6. Search sources
Redundant, unhomogeneous, widespread,
public, noisy, free, sometimes standard, semi-
structured, linked, reachable...
in one word: the Web
lunedì 12 aprile 2010
7. Focused search sources
Address interesting sources for the desired
domain
Where possible, filter out the unclean and
fragmented ones
Choose the most standard and well
structured ones
lunedì 12 aprile 2010
10. Data extraction
Automatically collect data from the Web
Crawl data from domain specific sources
Aggregate homogeneous data (i.e.: using
equivalence classes)
Save (portions of downloaded) data to a
convenient separate storage (DB, file system,
repository, etc.)
lunedì 12 aprile 2010
11. Data extraction - Crawling
From scratch (good luck!)
Leveraging existing facilities (wget, HtmlUnit,
Selenium, Apache HttpClient, Ning’s Async
HttpClient, etc.)
Playing with existing projects (RoadRunner,
Webpipe, Apache Nutch, etc.)
lunedì 12 aprile 2010
14. Data extraction - Aggregating
Downloaded resources can be assigned to
equivalence classes
Crawling process is inherently defining page
classes to which pages belong automatically
Relations between page classes
RoadRunner, Webpipe, etc.
lunedì 12 aprile 2010
16. Data extraction - EC
“teams indexes” class
“teams” class
“players” class “coaches” class
lunedì 12 aprile 2010
17. Data extraction - Relevance
What do we really deserve?
Depending on the specific domain
Not all pages in all classes could be relevant
We could be interested only in a subset of
the found page classes
lunedì 12 aprile 2010
18. Data extraction - Example
We may be interested
in retrieving only
information regarding
players (Player class)
lunedì 12 aprile 2010
19. Data extraction - Problems
Server unavailability (HTTP 404, 403, 303, etc.)
Security and bandwith filters (don’t get your crawler
machine IP banned!)
Client unavailability (memory and storage space are
unlimited only in theory)
Encoding
Legal issues
...
lunedì 12 aprile 2010
21. Data vs Information
Data Information
Rough Clean
Semi-structured Structured
Mixed content Focused
Unmutable Managed
Navigation oriented Domain oriented
lunedì 12 aprile 2010
22. From Data to Information
We have crawled a lot of data
We eventually have some rough structure
(page classes and relations)
We want to pick only what we need
lunedì 12 aprile 2010
23. Information extraction - Pruning
We want to filter out at least:
Banners, advertisement, etc.
Headers/Footers
Navigation bars/Search boxes
Everything else not related with content
We may use XPath
lunedì 12 aprile 2010
26. Information extraction
Once we have extracted content
We are now interested in getting useful
information from it -> knowledge
Look for some matchings between extracted
data and our domain model
lunedì 12 aprile 2010
27. Information extraction - Example
Navigate XML (HTML DOM) nodes with XPath
Navigate content and find specific
“parts” (nodes or sub-trees)
Tag such “parts” as objects or properties
inside a (specific) domain model
Eventually need to traverse DOM multiple
times
lunedì 12 aprile 2010
31. Information extraction - Example
A Player (taken from the Player pageclass)
with name, date of birth and belonging to a
team
We now know that “Francesco Totti” is a Player
of “Italy” team and was born on “27/09/1976”
We can apply such XPaths to all PageClass
instances and get information about each player
lunedì 12 aprile 2010
32. Information extraction - Wrapper
Context navigation
RoadRunner
Webpipe
Statistical analysis
ExAlg
Other...
lunedì 12 aprile 2010
33. Information extraction - Problems
Not well structured sources
Frequently changing sources
False positives
Corrupted extracted data
lunedì 12 aprile 2010
35. Information extraction - Relevance
Using wrappers we can get a lot of
information
We could rank what is relevant in the:
“page” context
the domain model
For efficiency and “reasoning” purposes
lunedì 12 aprile 2010
37. Information extraction - Metadata
Stream extracted information into our
domain model
Extracted information -> Metadata
Populated domain objects contain
interesting semantics
relations
lunedì 12 aprile 2010
38. Store Metadata
DB (with classic relational schema)
Filesystem (XML)
Key-Value repository
Index
Triple Store
...
lunedì 12 aprile 2010
39. Query enriched data
Exploit acquired metadata semantics to build
SQL-like (with attributes and relations of our
domain model) queries on previously
unstructered data
Extract hidden knowledge querying
aggregated metadata
lunedì 12 aprile 2010
40. Sample queries
Get “young players”
SELECT * FROM giocatore g WHERE g.dob
AFTER 1993/01/01
Aggregate queries
Find the average age in each team
Find the average age of World Cup
players
lunedì 12 aprile 2010