Using an English noun phrase grammar defined by Hulth (2004a) as a starting point, we created an English noun phrase chunker to extract complex noun phrases identified within web-based articles. These phrases served as candidates for anchor texts that would link articles within the About.com network of content sites. Prior to the full-scale deployment, a group of annotators—who had some domain authority in their respective fields—evaluated articles that received these machine-generated anchor texts. Due to the limited amount of time before deployment, there was not sufficient time to create a reference set of documents for anchor text comparisons amongst the annotators; and as a result, we could not compute inter-labeler agreement. However, if we assume that the anchor text generator is another annotator, we could compute the average Cohen’s Kappa Coefficient (Landis and Koch, 1977) across all pairings of the anchor text generator and an annotator. Our approach showed a fair agreement level on average (as described in Pustejovsky and Stubbs (2013, p. 131–132)).
This document provides information on researching topics online, including searching the internet, evaluating sources, and taking notes. It discusses search engines, boolean operators, and directories for finding information. Meta-search engines and invisible web resources are described that can provide access to scholarly articles and databases. Guidelines are given for selecting search terms, evaluating sources, and properly citing quotes in notes.
This document provides information about researching topics online and evaluating sources. It discusses how to find useful information through search engines and remember the information found. It compares printed and internet sources, describing the publication and review process for printed materials versus the lack of oversight for internet sources. It also outlines how to use search engines and boolean operators effectively to search for topics and filter results.
This document provides guidance on how to evaluate the reliability of websites for research purposes. It identifies appropriate websites like academic journals, government publications, and encyclopedias. Inappropriate websites include personal blogs, forums, wikis, and commercial sites. To evaluate reliability, check the web address extension, background of the author and organization, and references cited. Examples demonstrate how to apply these criteria to determine if specific websites are reliable sources for research. Wikipedia can only be used to consult the references of topics as a last resort.
Big Data Palooza Talk: Aspects of Semantic ProcessingNa'im Tyson
This inaugural, Meetup talk, sponsored by the Knowledgent Group, discussed aspects of semantic processing, and emphasized using python for lexical semantics. Slides cite example code snippets for computing the relationships between words using the Natural Language Toolkit (NLTK) in Python. There is also a small overview of the technologies underlying the Semantic Web and text mining.
W13 libr250 evaluating and citing websites1lterrones
This document provides guidance on evaluating websites for research purposes. It discusses evaluating websites based on several criteria: authority or author; objectivity and potential for bias; timeliness of information; relevance to research topic; and practice evaluating sample websites using these criteria. The document also reviews proper citation of online sources using APA style, including required elements like author, date, title, URL, and retrieval date when needed. Resources for citing online sources according to APA style are provided.
This was the opening segment of the eight-week training program sponsored by NISO on the topic of working with scholarly information resource RESTful APIs with instructor Peter Murray of IndexData. Session One was held on Sept 15.
This document provides tips for using social media and altmetrics to increase the impact and citation rate of scientific publications. It recommends (1) tweeting about papers within 3 days with relevant hashtags and links, (2) uploading outputs to repositories like Figshare and Slideshare to passively generate traffic, and (3) working with communications teams to promote outputs through blogs, media coverage, and social media. Monitoring tools like Impactstory and Google Scholar can track a publication's citations and discussions online over time.
This document provides information on researching topics online, including searching the internet, evaluating sources, and taking notes. It discusses search engines, boolean operators, and directories for finding information. Meta-search engines and invisible web resources are described that can provide access to scholarly articles and databases. Guidelines are given for selecting search terms, evaluating sources, and properly citing quotes in notes.
This document provides information about researching topics online and evaluating sources. It discusses how to find useful information through search engines and remember the information found. It compares printed and internet sources, describing the publication and review process for printed materials versus the lack of oversight for internet sources. It also outlines how to use search engines and boolean operators effectively to search for topics and filter results.
This document provides guidance on how to evaluate the reliability of websites for research purposes. It identifies appropriate websites like academic journals, government publications, and encyclopedias. Inappropriate websites include personal blogs, forums, wikis, and commercial sites. To evaluate reliability, check the web address extension, background of the author and organization, and references cited. Examples demonstrate how to apply these criteria to determine if specific websites are reliable sources for research. Wikipedia can only be used to consult the references of topics as a last resort.
Big Data Palooza Talk: Aspects of Semantic ProcessingNa'im Tyson
This inaugural, Meetup talk, sponsored by the Knowledgent Group, discussed aspects of semantic processing, and emphasized using python for lexical semantics. Slides cite example code snippets for computing the relationships between words using the Natural Language Toolkit (NLTK) in Python. There is also a small overview of the technologies underlying the Semantic Web and text mining.
W13 libr250 evaluating and citing websites1lterrones
This document provides guidance on evaluating websites for research purposes. It discusses evaluating websites based on several criteria: authority or author; objectivity and potential for bias; timeliness of information; relevance to research topic; and practice evaluating sample websites using these criteria. The document also reviews proper citation of online sources using APA style, including required elements like author, date, title, URL, and retrieval date when needed. Resources for citing online sources according to APA style are provided.
This was the opening segment of the eight-week training program sponsored by NISO on the topic of working with scholarly information resource RESTful APIs with instructor Peter Murray of IndexData. Session One was held on Sept 15.
This document provides tips for using social media and altmetrics to increase the impact and citation rate of scientific publications. It recommends (1) tweeting about papers within 3 days with relevant hashtags and links, (2) uploading outputs to repositories like Figshare and Slideshare to passively generate traffic, and (3) working with communications teams to promote outputs through blogs, media coverage, and social media. Monitoring tools like Impactstory and Google Scholar can track a publication's citations and discussions online over time.
Presentation of thomson reuters and web of science in publishingPadmanabhan Krishnan
1) The document discusses various tools for scientific research including Web of Science, EndNote, Journal Citation Reports, and ResearcherID. It focuses on how to search literature efficiently and discover relevant information.
2) Dynamics of scholarly information are reviewed, with Web of Science presented as an integrated solution for literature search, analysis, writing and publishing papers. It allows searching cited references, times cited and related records.
3) Personal tools like EndNote Web and ResearcherID are presented as ways to manage references and build a profile to showcase publications and collaboration opportunities. Metrics like impact factors, citation counts and H-indexes are discussed to evaluate journals and researchers.
Majestic Workshop on Backlinks and Link BuildingSante J. Achille
This document discusses strategies for analyzing backlinks and link building. It provides an example of analyzing the backlink profile of a company in the food processing industry to identify new link building opportunities. It also discusses how to use Majestic tools like Link Context to evaluate link quality and find high-quality pages to link to on a given topic.
The document discusses using Web of Science and related databases to strengthen research discovery, assessment, and identification of producers of research. It outlines how the databases can be used to discover more relevant papers, assess the impact and performance of articles, authors, journals and institutions, and improve author identification. The document provides examples and screenshots related to searching topics, analyzing citation metrics, and identifying highly cited research.
The document discusses search engine optimization (SEO) strategies to improve a website's ranking in search engine results. It outlines building targeted backlinks from article directories, high authority websites, and social media sites. It also recommends ongoing SEO efforts like creating new keyword-optimized content and backlinks monthly to steadily improve search rankings over time. Regular reporting will monitor the impact of the SEO strategies on metrics like backlinks, search placement, traffic, and other key performance indicators.
This document provides information on how to check the indexing of publications in various databases. It begins by defining publishers, scientific journals, and the differences between SCI, SCIE, and ESCI indexed journals. It then explains how to check if a journal is indexed on the Web of Science, Scopus, or Google Scholar platforms. The document also discusses characteristics of good publications, including structure, reviewing process, and increasing citations. Overall, the document offers guidance on publishing research and verifying the indexing status of journals.
Images, Reviews, Tags and Recommendations: do enhanced contents and user contributed contents improve access to library resources in an academic library?
Ya Wang, San Francisco State University Leonard Library
Presented at the 2010 Electronic Resources & Libraries Conference.
Abstract: This presentation allows San Francisco State University to share our information about patron usage of catalog enhanced services and a journal article recommendation service. The presentation looks at features offered by Syndetic Solutions and LibraryThing added to our online library catalog. We also evaluate the bX article recommendation service from Ex Libris. A summary of usage statistics is included.
General criteria for high quality open access journalsIna Smith
Access the recording at http://webinar.assaf.org.za/playback/presentation/0.9.0/playback.html?meetingId=64bc87cc9da0731f5d8fc426bf700e593aeddd92-1479448454255
The document discusses a presentation on taming taxonomies in SharePoint. It covers content architecture and taxonomy concepts in theory, and explores content types, site columns, and metadata in practice. The presentation includes exercises to design content structures and apply metadata using SharePoint's building blocks.
This document provides information on evaluating print versus internet sources for research and introduces resources for researching online. It discusses the publication and review process, authorship, bias, and timeliness of information for both print and internet sources. It then describes different types of search engines and directories that can be used to search visible and invisible web resources, and provides examples of specific search engines, meta-search engines, and directories. It also discusses using Boolean operators to improve search results and lists other useful sites for finding research information online.
Slides for my full-day information architecture workshop. Will teach in Minneapolis, MN (November 12, 2012) and Toronto, ON (November 29, 2012) Details: http://rosenfeldmedia.com/workshops/
The document provides an overview of strategies for maximizing new user acquisitions through a holistic traffic generation approach. It discusses various online strategies including search engine optimization, paid search, social media marketing, and monitoring traffic analytics. Key recommendations include understanding user search behaviors and keywords, developing targeted content, optimizing site architecture and internal linking, and diversifying traffic sources across search engines and social networks.
1) BIBFRAME is a new bibliographic framework developed by the Library of Congress to replace MARC standards and better integrate library data with the semantic web.
2) BIBFRAME uses linked data principles and RDF to make library data more extensible and interconnected on the web.
3) The main benefits of BIBFRAME are that it allows library data to be more discoverable online, integrates better with web standards, and is more flexible and reusable than MARC records. However, transforming existing data and training catalogers will be challenges in adopting BIBFRAME.
1) BIBFRAME is a new bibliographic framework developed by the Library of Congress to replace MARC standards and better integrate library data with the semantic web.
2) BIBFRAME uses linked data principles and RDF to make library data more extensible and interconnected on the web.
3) The main benefits of BIBFRAME are that it allows library data to be more discoverable online, integrates better with web standards, and is more flexible and reusable than MARC records. However, transforming existing data and training catalogers will be challenges in adopting BIBFRAME.
Defining true north metrics to quantify engagement at LinkedInBonnie Barrilleaux
This document discusses defining metrics to measure engagement on LinkedIn. It recommends developing a "true north" metric that measures the overall goal or member value proposition, along with supporting "signpost" metrics that provide faster feedback. Examples are given of metrics for content sharing that could be measured at different points in the user experience funnel. Key considerations for metrics include balancing complexity, keeping them actionable, focusing on important user groups, and relating metrics back to member value.
Susan Mernit, an editor and publisher at Oakland Local, taught a class on using search engine optimization (SEO) to increase online visibility and traffic. The class covered the basics of SEO, including installing analytics, using keywords in headlines and content, and checking search rankings and referral traffic. Students were assigned to improve SEO on their websites by creating example headlines, analyzing keyword usage, and tracking changes in search referral percentages. Useful SEO resources were also provided.
This document summarizes the origins and development of Schema.org. It began as an effort by Tim Berners-Lee in 1989 to conceive of the World Wide Web. Later developments included the semantic web in 2001 and linked open data in 2009. Schema.org was introduced in 2011 as a joint effort between Google, Bing, Yahoo, and Yandex to create a common set of schemas for structured data on web pages. It has since grown significantly, with over 12 million websites now using Schema.org markup and over 500 types and 800 properties defined. Various communities like libraries have also influenced Schema.org through extensions and standards like LRMI.
Slides for a talk on "What Does The Evidence Tell Us About Institutional Repositories?" given by Brian Kelly, UKOLN and Jenny Delasalle, University of Warwick Library at the ILI 2012 (#ILI2012) conference held at Olympia, London on 30-31 October 2012.
Presentation of thomson reuters and web of science in publishingPadmanabhan Krishnan
1) The document discusses various tools for scientific research including Web of Science, EndNote, Journal Citation Reports, and ResearcherID. It focuses on how to search literature efficiently and discover relevant information.
2) Dynamics of scholarly information are reviewed, with Web of Science presented as an integrated solution for literature search, analysis, writing and publishing papers. It allows searching cited references, times cited and related records.
3) Personal tools like EndNote Web and ResearcherID are presented as ways to manage references and build a profile to showcase publications and collaboration opportunities. Metrics like impact factors, citation counts and H-indexes are discussed to evaluate journals and researchers.
Majestic Workshop on Backlinks and Link BuildingSante J. Achille
This document discusses strategies for analyzing backlinks and link building. It provides an example of analyzing the backlink profile of a company in the food processing industry to identify new link building opportunities. It also discusses how to use Majestic tools like Link Context to evaluate link quality and find high-quality pages to link to on a given topic.
The document discusses using Web of Science and related databases to strengthen research discovery, assessment, and identification of producers of research. It outlines how the databases can be used to discover more relevant papers, assess the impact and performance of articles, authors, journals and institutions, and improve author identification. The document provides examples and screenshots related to searching topics, analyzing citation metrics, and identifying highly cited research.
The document discusses search engine optimization (SEO) strategies to improve a website's ranking in search engine results. It outlines building targeted backlinks from article directories, high authority websites, and social media sites. It also recommends ongoing SEO efforts like creating new keyword-optimized content and backlinks monthly to steadily improve search rankings over time. Regular reporting will monitor the impact of the SEO strategies on metrics like backlinks, search placement, traffic, and other key performance indicators.
This document provides information on how to check the indexing of publications in various databases. It begins by defining publishers, scientific journals, and the differences between SCI, SCIE, and ESCI indexed journals. It then explains how to check if a journal is indexed on the Web of Science, Scopus, or Google Scholar platforms. The document also discusses characteristics of good publications, including structure, reviewing process, and increasing citations. Overall, the document offers guidance on publishing research and verifying the indexing status of journals.
Images, Reviews, Tags and Recommendations: do enhanced contents and user contributed contents improve access to library resources in an academic library?
Ya Wang, San Francisco State University Leonard Library
Presented at the 2010 Electronic Resources & Libraries Conference.
Abstract: This presentation allows San Francisco State University to share our information about patron usage of catalog enhanced services and a journal article recommendation service. The presentation looks at features offered by Syndetic Solutions and LibraryThing added to our online library catalog. We also evaluate the bX article recommendation service from Ex Libris. A summary of usage statistics is included.
General criteria for high quality open access journalsIna Smith
Access the recording at http://webinar.assaf.org.za/playback/presentation/0.9.0/playback.html?meetingId=64bc87cc9da0731f5d8fc426bf700e593aeddd92-1479448454255
The document discusses a presentation on taming taxonomies in SharePoint. It covers content architecture and taxonomy concepts in theory, and explores content types, site columns, and metadata in practice. The presentation includes exercises to design content structures and apply metadata using SharePoint's building blocks.
This document provides information on evaluating print versus internet sources for research and introduces resources for researching online. It discusses the publication and review process, authorship, bias, and timeliness of information for both print and internet sources. It then describes different types of search engines and directories that can be used to search visible and invisible web resources, and provides examples of specific search engines, meta-search engines, and directories. It also discusses using Boolean operators to improve search results and lists other useful sites for finding research information online.
Slides for my full-day information architecture workshop. Will teach in Minneapolis, MN (November 12, 2012) and Toronto, ON (November 29, 2012) Details: http://rosenfeldmedia.com/workshops/
The document provides an overview of strategies for maximizing new user acquisitions through a holistic traffic generation approach. It discusses various online strategies including search engine optimization, paid search, social media marketing, and monitoring traffic analytics. Key recommendations include understanding user search behaviors and keywords, developing targeted content, optimizing site architecture and internal linking, and diversifying traffic sources across search engines and social networks.
1) BIBFRAME is a new bibliographic framework developed by the Library of Congress to replace MARC standards and better integrate library data with the semantic web.
2) BIBFRAME uses linked data principles and RDF to make library data more extensible and interconnected on the web.
3) The main benefits of BIBFRAME are that it allows library data to be more discoverable online, integrates better with web standards, and is more flexible and reusable than MARC records. However, transforming existing data and training catalogers will be challenges in adopting BIBFRAME.
1) BIBFRAME is a new bibliographic framework developed by the Library of Congress to replace MARC standards and better integrate library data with the semantic web.
2) BIBFRAME uses linked data principles and RDF to make library data more extensible and interconnected on the web.
3) The main benefits of BIBFRAME are that it allows library data to be more discoverable online, integrates better with web standards, and is more flexible and reusable than MARC records. However, transforming existing data and training catalogers will be challenges in adopting BIBFRAME.
Defining true north metrics to quantify engagement at LinkedInBonnie Barrilleaux
This document discusses defining metrics to measure engagement on LinkedIn. It recommends developing a "true north" metric that measures the overall goal or member value proposition, along with supporting "signpost" metrics that provide faster feedback. Examples are given of metrics for content sharing that could be measured at different points in the user experience funnel. Key considerations for metrics include balancing complexity, keeping them actionable, focusing on important user groups, and relating metrics back to member value.
Susan Mernit, an editor and publisher at Oakland Local, taught a class on using search engine optimization (SEO) to increase online visibility and traffic. The class covered the basics of SEO, including installing analytics, using keywords in headlines and content, and checking search rankings and referral traffic. Students were assigned to improve SEO on their websites by creating example headlines, analyzing keyword usage, and tracking changes in search referral percentages. Useful SEO resources were also provided.
This document summarizes the origins and development of Schema.org. It began as an effort by Tim Berners-Lee in 1989 to conceive of the World Wide Web. Later developments included the semantic web in 2001 and linked open data in 2009. Schema.org was introduced in 2011 as a joint effort between Google, Bing, Yahoo, and Yandex to create a common set of schemas for structured data on web pages. It has since grown significantly, with over 12 million websites now using Schema.org markup and over 500 types and 800 properties defined. Various communities like libraries have also influenced Schema.org through extensions and standards like LRMI.
Slides for a talk on "What Does The Evidence Tell Us About Institutional Repositories?" given by Brian Kelly, UKOLN and Jenny Delasalle, University of Warwick Library at the ILI 2012 (#ILI2012) conference held at Olympia, London on 30-31 October 2012.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Evaluation of Anchor Texts for Automated Link Discovery in Semi-structured Web Documents
1. Evaluation of Anchor Texts
for Automated Link Discovery
in Semi-structured Web Documents
Na’im Tyson, Jon Roberts, Jeff Allen and Matt Lipson
Sciences, About.com
Novel Incentives for Collecting Data & Annotation from People
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 1 / 19
2. Purpose
Research Questions
Q: What do you do when you have little time and funding to
annotate web pages?
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 2 / 19
3. Purpose
Research Questions
Q: What do you do when you have little time and funding to
annotate web pages?
A: Create an algorithm to annotate web pages with anchor texts
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 2 / 19
4. Purpose
Research Questions
Q: What do you do when you have little time and funding to
annotate web pages?
A: Create an algorithm to annotate web pages with anchor texts
Q: How do you measure quality and consistency between your
algorithm and human annotators?
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 2 / 19
5. Purpose
Research Questions
Q: What do you do when you have little time and funding to
annotate web pages?
A: Create an algorithm to annotate web pages with anchor texts
Q: How do you measure quality and consistency between your
algorithm and human annotators?
A: Create an evaluation framework to determine consistency
between annotations of algorithm and humans!
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 2 / 19
6. Introduction What is About.com?
• Composition
1
Top health content migrated to verywell.com.
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 3 / 19
7. Introduction What is About.com?
• Composition
• Intent-driven website of two million articles divided into seven major
verticals: food, health1
, home, money, style, tech and travel
1
Top health content migrated to verywell.com.
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 3 / 19
8. Introduction What is About.com?
• Composition
• Intent-driven website of two million articles divided into seven major
verticals: food, health1
, home, money, style, tech and travel
• Over 200 million monthly visits from U.S., Western Europe and parts
of India
1
Top health content migrated to verywell.com.
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 3 / 19
9. Introduction What is About.com?
• Composition
• Intent-driven website of two million articles divided into seven major
verticals: food, health1
, home, money, style, tech and travel
• Over 200 million monthly visits from U.S., Western Europe and parts
of India
• Content Structure
1
Top health content migrated to verywell.com.
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 3 / 19
10. Introduction What is About.com?
• Composition
• Intent-driven website of two million articles divided into seven major
verticals: food, health1
, home, money, style, tech and travel
• Over 200 million monthly visits from U.S., Western Europe and parts
of India
• Content Structure
• Content written by a large number of writers, experts, using a
content-management system (CMS)—with each expert having their
own topic area
1
Top health content migrated to verywell.com.
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 3 / 19
11. Introduction What is About.com?
• Composition
• Intent-driven website of two million articles divided into seven major
verticals: food, health1
, home, money, style, tech and travel
• Over 200 million monthly visits from U.S., Western Europe and parts
of India
• Content Structure
• Content written by a large number of writers, experts, using a
content-management system (CMS)—with each expert having their
own topic area
• Nascent content can be linked to other articles with hypertext links -
inline links
1
Top health content migrated to verywell.com.
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 3 / 19
12. Introduction What is About.com?
• Composition
• Intent-driven website of two million articles divided into seven major
verticals: food, health1
, home, money, style, tech and travel
• Over 200 million monthly visits from U.S., Western Europe and parts
of India
• Content Structure
• Content written by a large number of writers, experts, using a
content-management system (CMS)—with each expert having their
own topic area
• Nascent content can be linked to other articles with hypertext links -
inline links
• Inline links - necessary for user recirculation
1
Top health content migrated to verywell.com.
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 3 / 19
13. Introduction What is About.com?
• Composition
• Intent-driven website of two million articles divided into seven major
verticals: food, health1
, home, money, style, tech and travel
• Over 200 million monthly visits from U.S., Western Europe and parts
of India
• Content Structure
• Content written by a large number of writers, experts, using a
content-management system (CMS)—with each expert having their
own topic area
• Nascent content can be linked to other articles with hypertext links -
inline links
• Inline links - necessary for user recirculation
• Have higher clicks per session cf. article listings (@ bottom of page),
trending articles and navigation units
1
Top health content migrated to verywell.com.
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 3 / 19
14. Introduction What makes Inline Links problematic?
• Experts hardly add links!
Figure 1: Histogram of link density of articles prior to the launch of automated link discovery.
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 4 / 19
15. Introduction What makes Inline Links problematic? (Continued)
• Experts do not receive extra incentives for annotating anchor text for
inline links
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 5 / 19
16. Introduction What makes Inline Links problematic? (Continued)
• Experts do not receive extra incentives for annotating anchor text for
inline links
• Producing quality inline links takes time
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 5 / 19
17. Introduction What makes Inline Links problematic? (Continued)
• Experts do not receive extra incentives for annotating anchor text for
inline links
• Producing quality inline links takes time
• Experts must know their content and other neighboring content to link
to it
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 5 / 19
18. Introduction What makes Inline Links problematic? (Continued)
• Experts do not receive extra incentives for annotating anchor text for
inline links
• Producing quality inline links takes time
• Experts must know their content and other neighboring content to link
to it
• Experts not compensated for direction of traffic outside their site
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 5 / 19
19. Generating Anchor Texts Making Anchor Texts from Keyword Extraction Algorithms
• Empirically-driven inline linking produces long sequences
Figure 2: Part-of-speech (POS) histogram of expert-generated anchor texts consisting of
six words in full-text articles.
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 6 / 19
20. Generating Anchor Texts Making Anchor Texts from Keyword Extraction Algorithms
• Empirically-driven inline linking produces long sequences
Figure 2: Part-of-speech (POS) histogram of expert-generated anchor texts consisting of
six words in full-text articles.
• TextRank [Mihalcea, 2004], KEA [Witten et al., 1999] and Hulth
[2004] produce sequences that are too short
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 6 / 19
21. Generating Anchor Texts Making Anchor Texts using Chunk Parsing
• Starting point: Generic grammar of POS sequences originally
derived from Hulth (2004)
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 7 / 19
22. Generating Anchor Texts Making Anchor Texts using Chunk Parsing
• Starting point: Generic grammar of POS sequences originally
derived from Hulth (2004)
• Grammars used to identify candidates for anchor text suggestions
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 7 / 19
23. Generating Anchor Texts Making Anchor Texts using Chunk Parsing
• Starting point: Generic grammar of POS sequences originally
derived from Hulth (2004)
• Grammars used to identify candidates for anchor text suggestions
• Augmented with entity and datetime tags
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 7 / 19
24. Generating Anchor Texts Making Anchor Texts using Chunk Parsing
• Starting point: Generic grammar of POS sequences originally
derived from Hulth (2004)
• Grammars used to identify candidates for anchor text suggestions
• Augmented with entity and datetime tags
• POS sequences expressed as Chunk Rules implemented in Python’s
NLTK
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 7 / 19
25. Generating Anchor Texts Making Anchor Texts using Chunk Parsing
• Starting point: Generic grammar of POS sequences originally
derived from Hulth (2004)
• Grammars used to identify candidates for anchor text suggestions
• Augmented with entity and datetime tags
• POS sequences expressed as Chunk Rules implemented in Python’s
NLTK
• Candidates selected based on weighted sum of document-level
features between source and target documents
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 7 / 19
26. Generating Anchor Texts Making Anchor Texts using Chunk Parsing
• Starting point: Generic grammar of POS sequences originally
derived from Hulth (2004)
• Grammars used to identify candidates for anchor text suggestions
• Augmented with entity and datetime tags
• POS sequences expressed as Chunk Rules implemented in Python’s
NLTK
• Candidates selected based on weighted sum of document-level
features between source and target documents
• Weights based on existing expert links
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 7 / 19
27. Evaluating Anchor Texts Quality Assurance Setup & Workflow
• Annotation of top 86k documents
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 8 / 19
28. Evaluating Anchor Texts Quality Assurance Setup & Workflow
• Annotation of top 86k documents
• Done across 13 annotators paid hourly
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 8 / 19
29. Evaluating Anchor Texts Quality Assurance Setup & Workflow
• Annotation of top 86k documents
• Done across 13 annotators paid hourly
• Annotators modified documents within an annotation environment
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 8 / 19
30. Evaluating Anchor Texts Quality Assurance Setup & Workflow
• Annotation of top 86k documents
• Done across 13 annotators paid hourly
• Annotators modified documents within an annotation environment
1 Keep anchor text
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 8 / 19
31. Evaluating Anchor Texts Quality Assurance Setup & Workflow
• Annotation of top 86k documents
• Done across 13 annotators paid hourly
• Annotators modified documents within an annotation environment
1 Keep anchor text
2 Modify anchor text (by expanding/contracting)
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 8 / 19
32. Evaluating Anchor Texts Quality Assurance Setup & Workflow
• Annotation of top 86k documents
• Done across 13 annotators paid hourly
• Annotators modified documents within an annotation environment
1 Keep anchor text
2 Modify anchor text (by expanding/contracting)
3 Delete anchor text
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 8 / 19
33. Evaluating Anchor Texts Quality Assurance Setup & Workflow
• Annotation of top 86k documents
• Done across 13 annotators paid hourly
• Annotators modified documents within an annotation environment
1 Keep anchor text
2 Modify anchor text (by expanding/contracting)
3 Delete anchor text
4 Modify link target
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 8 / 19
34. Evaluating Anchor Texts Quality Assurance Setup & Workflow (continued)
Figure 3: Example annotation environment used for investopedia.com.
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 9 / 19
35. Evaluating Anchor Texts Computing Inter-labeler Agreement
• For each annotator...
B B
positive negative
A positive a b
A negative c d
Table 1: Contingency table for the anchor text generator (A), and a single annotator (B).
Algorithm: ˆ quick brown fox jumps over the lazy dog $
Annotator: ˆ the quick brown fox $
d c a a a b b b b b d
Example 1: Example of phrase alignment between the anchor text generator and an annotator.
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 10 / 19
36. Evaluating Anchor Texts Computing Inter-labeler Agreement - Continued
Algorithm: ˆ quick brown fox jumps over the lazy dog $
Annotator: ˆ the quick brown fox $
d c a a a b b b b b d
B B
positive negative
A positive 3/11 = 0.27 5/11 = 0.45
A negative 1/11 = 0.09 2/11 = 0.18
Table 2: Contingency table computed from relative word agreements from Table 1 for the
generator (A) and annotator (B).
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 11 / 19
37. Evaluating Anchor Texts Computing Inter-labeler Agreement - Final
Cohen’s Kappa
K =
Pr(a) − Pr(e)
1 − Pr(e)
(1)
2
See Pustejovksy [2013, p. 133–134] for a detailed example of how to compute both
Pr(a) and Pr(e).
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 12 / 19
38. Evaluating Anchor Texts Computing Inter-labeler Agreement - Final
Cohen’s Kappa
K =
Pr(a) − Pr(e)
1 − Pr(e)
(1)
Average Cohen’s Kappa
¯K =
1
|A|
K (2)
2
See Pustejovksy [2013, p. 133–134] for a detailed example of how to compute both
Pr(a) and Pr(e).
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 12 / 19
39. Evaluating Anchor Texts Computing Inter-labeler Agreement - Final
Cohen’s Kappa
K =
Pr(a) − Pr(e)
1 − Pr(e)
(1)
Average Cohen’s Kappa
¯K =
1
|A|
K (2)
Annotation Precision of Document
Precision(dk) =
a
a + b
(3)
2
See Pustejovksy [2013, p. 133–134] for a detailed example of how to compute both
Pr(a) and Pr(e).
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 12 / 19
40. Evaluating Anchor Texts Computing Inter-labeler Agreement - Final
Cohen’s Kappa
K =
Pr(a) − Pr(e)
1 − Pr(e)
(1)
Average Cohen’s Kappa
¯K =
1
|A|
K (2)
Annotation Precision of Document
Precision(dk) =
a
a + b
(3)
Mean Average Precision
MAP =
1
|A|
|A|
i=1
1
m
m
k=1
Precision(dk) (4)
2
See Pustejovksy [2013, p. 133–134] for a detailed example of how to compute both
Pr(a) and Pr(e).
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 12 / 19
41. Results
• Average K: 0.33
• Fair level of agreement
• slight < fair < moderate < substantial < perfect
• MAP: 0.40
• Roughly a 40% agreement, on average, between the linker and an
annotator for this dataset
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 13 / 19
42. Discussion Two-tier Approach to Web Page Annotation
Testing Labeling Consistency
• Create same set of documents for annotation
• Documents already linked using the automated linking process
• Measure the mean average relative agreement, MAR
• Using agreements a and d from Table 1 for anchor text, t, compute
average relative agreement—between an annotator and another
annotator—for a document consisting of each anchor text, t
1
n
n
k=1
atk
+ dtk
(5)
• Compute mean across the set of all documents, D, to get the mean
average relative agreement between any two annotators:
MAR<i,j> =
1
|D|
|D|
k=1
1
n
n
l=1
atl
+ dtl
(6)
• Average all MAR<i,·> scores for each annotator i
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 14 / 19
43. Discussion Two-tier Approach to Web Page Annotation
Establish Best Practices
• Use threshold on MAR to remove bad actors within the group
[Neuendorf, 2002]
• Hold general meeting of annotators exposing good and bad practices
in annotation
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 15 / 19
44. Discussion Improvements to Anchor Text Selection
• Offer more parses of sentences given the noun phrase grammar
• NLTK returns first matching rule to grammar
• Probabilistic noun phrase grammar in NLTK
• Compute probabilities based on anchor texts used in annotations of this
study
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 16 / 19
45. Conclusions Lessons Learned
• Use a reference corpus
• Introduce annotators to one another to offer best practices
• Establish social media group to foster communication
• Static rules require updating to take into account experts’ choices
• Probabilistic grammar accommodates parsing flexibility
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 17 / 19
46. References
References I
R. Mihalcea and P. Tarau.
TextRank: Bringing Order into Texts
Conference on Empirical Methods in Natural Language Processing,
2004.
I.H. Witten et al.
KEA: Practical automatic keyphrase extraction
International Workshop on Description Logics, p. 254–256, 1999
A. Hulth.
Combining Machine Learning and Natural Language Processing for
Automatic Keyword Extraction.
Department of Computer and Systems Sciences, Stockholm University,
2004.
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 18 / 19
47. References
References II
J. Pustejovsky and A. Stubbs.
Natural Language Annotation for Machine Learning
O’Reilly Media, Inc., 2013.
K. A. Neuendorf.
The Content Analysis Guidebook
Thousand Oaks, California: Sage Publications.
Tyson, Roberts, Allen, Lipson (About.com) Evaluation of Annotations of Anchor Texts LREC 2016 19 / 19