This document summarizes key aspects of web page classification, including features and algorithms. It discusses on-page and neighbors features that are useful for classification, such as text, tags, URLs, and visual analysis, as well as features from linked pages. Popular algorithms mentioned include k-NN, SVMs, relaxation labeling, and relational learning approaches. The document also covers hierarchical classification, combining multiple sources of information, and research on blog classification.
Drupal@UT: A case study on redesigning the University of Texas at Austin websiteSpringbox
This is a presentation from the Drupal in Education Summit on June 2, 2014 hosted by the University of Texas. This kicks off the first day of DrupalCon Austin 2014. This case study discusses our shared experience in redesigning the University of Texas website using Drupal 7. This presentation covers the history of the University of Texas website and challenges, Springbox discovery findings, the infrastructure for the website, the process of changing from Drupal 6 to Drupal 7 and finally the theming process for the site. This presentation was created by staff from Springbox and the University of Texas Web & Contract Services team.
NCompass Live - January 2, 2014.
http://nlc.nebraska.gov/ncompasslive/
The Bibliographic Framework Initiative, or BIBFRAME, is intended to provide a replacement to the MARC format as an encoding standard for library catalogs. Its aim is to move library data into a Linked Data format, allowing it to interact with other data on the Web. In this session, Emily Nimsakont, the NLC’s Cataloging Librarian, will cover the basics of BIBFRAME, describe what it can provide for users of library catalogs that MARC can’t, and outline what librarians should be aware of regarding this change in the cataloging landscape.
Web Accessibility: Understanding & Practice!Rabab Gomaa
By the end of the session you will be able to:
Recognize the components of web accessibility
Identify accessibility failures according to WCAG 2
Experience various tools to evaluate the accessibility of your website against WCAG 2
Gateway to Oklahoma History Case Study: Structured Data and Metadata Evaluati...Emily Kolvitz
Image Resource Findability on the World Wide Web is still very much a landgrab. For the Semantic Web to become a reality online businesses and individuals have to get their hands dirty and also come facetoface with the realization that search engine giants are increasingly becoming the goto tool for information resource retrieval. “Increasingly, students use Web search engines such as Google to locate information resources rather than seek out library online catalogs or databases of scholarly journal articles” (Lippincott 2013). This puts the search engine giant in a unique position to dictate how the future of search will work on the Web and therefore, your organization’s future presence (or lack thereof) on the Web. Search Engine Optimization (SEO) techniques change frequently and remain much a mystery to many companies. The one variable in the equation of Web findability that remains a staple is good qualitymetadataunderthehoodoftheWebsite. Inthiscasestudy,amethodologyisappliedto the Gateway to Oklahoma History’s Website. This study can be generalized to organizations looking to benchmark their own findability maturity on the Web from an imagecentric viewpoint.
On May 10-11th, Katherine attended the first annual EBSCO User Group meet in the US city of Boston. Katherine was there on the invitation of the EBSCO User Group committee, made up of university librarians and EBSCO staff. This two day conference was inspired by the UK and Nordic user groups and this first meet-up was a great opportunity for Librarians from all over the US to come together and talk about how they are using EDS. Katherine an update on the major topics and trends which came up in the conference, and give some insight into the role of the EBSCO User Group in the US and the differences between the US and UK usage of EDS, and this lead into a wider discussion about changing role of Librarians in the UK and US.
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
Session 3/8. Priority issues. The Strategic Content Alliance, JISC sponsored workshops on Maximising Online Resource Effectiveness, held on different occasions throughout 2010 and delivered by Netskills.
Accessibility Testing Using Screen ReadersRabab Gomaa
Testing websites using screen readers gives web developers the opportunity to make a real evaluation of their code and help identify accessibility issues that could be missed in manual checkup, or neglected by automatic verification tools.
In this session we will demonstrate how blind people navigate the web and will show how to use screen readers for accessibility testing of web pages including interactive components and web forms.
Drupal@UT: A case study on redesigning the University of Texas at Austin websiteSpringbox
This is a presentation from the Drupal in Education Summit on June 2, 2014 hosted by the University of Texas. This kicks off the first day of DrupalCon Austin 2014. This case study discusses our shared experience in redesigning the University of Texas website using Drupal 7. This presentation covers the history of the University of Texas website and challenges, Springbox discovery findings, the infrastructure for the website, the process of changing from Drupal 6 to Drupal 7 and finally the theming process for the site. This presentation was created by staff from Springbox and the University of Texas Web & Contract Services team.
NCompass Live - January 2, 2014.
http://nlc.nebraska.gov/ncompasslive/
The Bibliographic Framework Initiative, or BIBFRAME, is intended to provide a replacement to the MARC format as an encoding standard for library catalogs. Its aim is to move library data into a Linked Data format, allowing it to interact with other data on the Web. In this session, Emily Nimsakont, the NLC’s Cataloging Librarian, will cover the basics of BIBFRAME, describe what it can provide for users of library catalogs that MARC can’t, and outline what librarians should be aware of regarding this change in the cataloging landscape.
Web Accessibility: Understanding & Practice!Rabab Gomaa
By the end of the session you will be able to:
Recognize the components of web accessibility
Identify accessibility failures according to WCAG 2
Experience various tools to evaluate the accessibility of your website against WCAG 2
Gateway to Oklahoma History Case Study: Structured Data and Metadata Evaluati...Emily Kolvitz
Image Resource Findability on the World Wide Web is still very much a landgrab. For the Semantic Web to become a reality online businesses and individuals have to get their hands dirty and also come facetoface with the realization that search engine giants are increasingly becoming the goto tool for information resource retrieval. “Increasingly, students use Web search engines such as Google to locate information resources rather than seek out library online catalogs or databases of scholarly journal articles” (Lippincott 2013). This puts the search engine giant in a unique position to dictate how the future of search will work on the Web and therefore, your organization’s future presence (or lack thereof) on the Web. Search Engine Optimization (SEO) techniques change frequently and remain much a mystery to many companies. The one variable in the equation of Web findability that remains a staple is good qualitymetadataunderthehoodoftheWebsite. Inthiscasestudy,amethodologyisappliedto the Gateway to Oklahoma History’s Website. This study can be generalized to organizations looking to benchmark their own findability maturity on the Web from an imagecentric viewpoint.
On May 10-11th, Katherine attended the first annual EBSCO User Group meet in the US city of Boston. Katherine was there on the invitation of the EBSCO User Group committee, made up of university librarians and EBSCO staff. This two day conference was inspired by the UK and Nordic user groups and this first meet-up was a great opportunity for Librarians from all over the US to come together and talk about how they are using EDS. Katherine an update on the major topics and trends which came up in the conference, and give some insight into the role of the EBSCO User Group in the US and the differences between the US and UK usage of EDS, and this lead into a wider discussion about changing role of Librarians in the UK and US.
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
Session 3/8. Priority issues. The Strategic Content Alliance, JISC sponsored workshops on Maximising Online Resource Effectiveness, held on different occasions throughout 2010 and delivered by Netskills.
Accessibility Testing Using Screen ReadersRabab Gomaa
Testing websites using screen readers gives web developers the opportunity to make a real evaluation of their code and help identify accessibility issues that could be missed in manual checkup, or neglected by automatic verification tools.
In this session we will demonstrate how blind people navigate the web and will show how to use screen readers for accessibility testing of web pages including interactive components and web forms.
2010 08 india search summit - opportunities in the future of search marketingGillian Muessig
Opportunities abound in the specialization of search marketing. Whether you're building a consultancy, agency, or looking for a field of study that is most likely to provide you with a satisfying career, consider one of the branches of search marketing.
IBM ´s journey towards becoming a 2.0 Enterprise, Examples and results of social media at IBM. Introduction to Lotus Connections 2.5 and some customer examples
Uniquely experienced educator; expert in public policy, international nongovernmental organizations, and gender justice, offering the benefit of 36 years experience to inform positive social change
International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
`A Survey on approaches of Web Mining in Varied Areasinventionjournals
There has been lot of research in recent years for efficient web searching. Several papers have proposed algorithm for user feedback sessions, to evaluate the performance of inferring user search goals. When the information is retrieved, user clicks on a particular URL. Based on the click rate, ranking will be done automatically, clustering the feedback sessions. Web search engines have made enormous contributions to the web and society. They make finding information on the web quick and easy. However, they are far from optimal. A major deficiency of generic search engines is that they follow the ‘‘one size fits all’’ model and are not adaptable to individual users.
School of Internet Marketing is the best training institute for digital marketing in Pune as they provide theory and practical knowledge to their students.
What is the current status quo of the Semantic Web as first mentioned by Tim Berners Lee in 2001?
Not only 10 blue links can drive you traffic anymore, Google has added many so called Knowlegde cards and panels to answer the specific informational need of their users. Sounds complicated, but it isn’t. If you ask for information, Google will try to answer it within the result pages.
I'll share my research from a theoretical point of view through exploring patents and papers, and actual testing cases in the live indices of Google. Getting your site listed as the source of an Answer Card can result in an increase of CTR as much as 16%. How to get listed? Come join my session and I'll shine some light on the factors that come into play when optimizing for Google's Knowledge graph.
Team of Rivals: UX, SEO, Content & Dev UXDC 2015Marianne Sweeny
The search engine landscape has changed dramatically and now relies heavily on user experience signals to influence rank in search results. In this presentation, I explore search engine methods for evaluating UX in a machine readable fashion and present a framework for successful cross-discipline collaboration.
Optimizing Library Websites for Better VisibilityErin Rushton
Binghamton University Librarians have attempted to employ search engine optimization strategies to make their website more visible on search engine result pages. Search engine optimization is the practice of improving ranking on search engine result pages and also increasing targeted traffic to a website. The presenter will discuss the effectiveness (or lack thereof) of developing a “do it yourself” optimization strategy for library websites
Optimizing Library Websites for Better VisibilityErin Rushton
Binghamton University Librarians have attempted to employ search engine optimization strategies to make their website more visible on search engine result pages. Search engine optimization is the practice of improving ranking on search engine result pages and also increasing targeted traffic to a website. The presenter will discuss the effectiveness (or lack thereof) of developing a “do it yourself” optimization strategy for library websites.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Neuro-symbolic is not enough, we need neuro-*semantic*
Web Page Classification
1. Web Page Classification Feature and Algorithms XiaoguangQi and Brian D. Davison Department of Computer Science & Engineering Lehigh University, June 2007 Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
2. Agenda Webpage classification significance Introduction Background Applications of web classification Features Algorithms Blog Classification Conclusion
16. Webpage classification significance What’s different between past and present what changed? Flash animation Java Script Video Clips, Embedded Object Advertise, GG Ad sense, Yahoo!
17. Introduction Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
18. Introduction Webpage classification or webpage categorization is the process of assigning a webpage to one or more category labels. E.g. “News”, “Sport” , “Business” GOAL: They observe the existing of web classification techniques to find new area for research. Including web-specific features and algorithms that have been found to be useful for webpage classification.
19. Introduction What will you learn? A Detailed review of useful features for web classification The algorithms used The future research directions Webpage classification can help improve the quality of web search. Knowing is thing help you to improve your SEO skill. Each search engine, keep their technique in secret.
20. Background Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
21. Background The general problem of webpage classification can be divided into Subject classification; subject or topic of webpage e.g. “Adult”, “Sport”, “Business”. Function classification; the role that the webpage play e.g. “Personal homepage”, “Course page”, “Admission page”.
22. Background Based on the number of classes in webpage classification can be divided into binary classification multi-class classification Based on the number of classes that can be assigned to an instance, classification can be divided into single-label classification and multi-label classification.
24. Applications of web classification Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
25. Applications of web classification Constructing and expanding web directories (web hierarchies) Yahoo ! ODP or “Open Dictionary Project” http://www.dmoz.org How are they doing?
27. Applications of web classification How are they doing? By human effort July 2006, it was reported there are 73,354 editor in the dmoz ODP. As the web changes and continue to grow so “Automatic creation of classifiers from web corpora based on use-defined hierarchies” has been introduced by Huang et al. in 2004 The starting point of this presentation !!
28. Applications of web classification Improving quality of search results Categories view Ranking view
30. Applications of web classification Improving quality of search results Categories view Ranking view In 1998, Page and Brin developed the link-based ranking algorithm called PageRank Calculates the hyperlinks with our considering the topic of each page
32. Applications of web classification Helping question answering systems Yang and Chua 2004 suggest finding answers to list questions e.g. “name all the countries in Europe” How it worked? Formulated the queries and sent to search engines. Classified the results into four categories Collection pages (contain list of items) Topic pages (represent the answers instance) Relevant page (Supporting the answers instance) Irrelevant pages After that , topic pages are clustered, from which answers are extracted. Answering question system could benefit from web classification of both accuracy and efficiency
33. Applications of web classification Other applications Web content filtering Assisted web browsing Knowledge base construction
34. Features Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
35. Features In this section, we review the types of features that useful in webpage classification research. The most important criteria in webpage classification that make webpage classification different from plaintext classification is HYPERLINK <a>…</a> We classify features into On-page feature: Directly located on the page Neighbors feature: Found on the pages related to the page to be classified.
36. Features: On-page Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
37. Features: On-page Textual content and tags N-gram feature Imagine of two different documents. One contains phrase “New York”. The other contains the terms “New” and “York”. (2-gram feature). In Yahoo!, They used 5-grams feature. HTML tags or DOM Title, Headings, Metadata and Main text Assigned each of them an arbitrary weight. Now a day most of website using Nested list (<ul><li>) which really help in web page classification.
38. Features: On-page Textual content and tags URL Kan and Thi 2004 Demonstrated that a webpage can be classified based on its URL
39. Features: On-page Visual analysis Each webpage has two representations Text which represent in HTML The visual representation rendered by a web browser Most approaches focus on the text while ignoring the visual information which is useful as well Kovacevic et al. 2004 Each webpage is represented as a hierarchical “Visual adjacency multi graph.” In graph each node represents an HTML object and each edge represents the spatial relation in the visual representation.
41. Features: Neighbors Features Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
42. Features: Neighbors Features Motivation The useful features that we discuss previously, in a particular these features are missing or unrecognizable
44. Features: Neighbors features Underlying Assumptions When exploring the features of neighbors, some assumptions are implicitly made in existing work. The presence of many “sports” pages in the neighborhood of P-a increases the probability of P-a being in “Sport”. Chakrabari et al. 2002 and Meczer 2005 showed that linked pages were more likely to have terms in common . Neighbor selection Existing research mainly focuses on page with in two steps of the page to be classified. At the distance no greater than two. There are six types of neighboring pages: parent, child, sibling, spouse, grandparent and grandchild.
46. Features: Neighbors features Neighbor selection cont. Furnkranz 1999 The text on the parent pages surrounding the link is used to train a classifier instead of text on the target page. A Target page will be assigned multiple labels. These label are then combine by some voting scheme to form the final prediction of the target page’s class Sun et al. 2002 Using the text on the target page. Using page title and anchor text from parent pages can improve classification compared a pure text classifier.
47. Features: Neighbors features Neighbor selection cont. Summary Using parent, child, sibling and spouse pages are all useful in classification, siblings are found to be the best source. Using information from neighboring pages may introduce extra noise, should be use carefully.
48.
49. Features: Neighbors features Features Label : by editor or keyworder Partial content : anchor text, the surrounding text of anchor text, titles, headers Full content Among the three types of features, using the full content of neighboring pages is the most expensive however it generate better accuracy.
50. Features: Neighbors features Utilizing artificial links (implicit link) The hyperlinks are not the only one choice. What is implicit link? Connections between pages that appear in the results of the same query and are both clicked by users. Implicit link can help webpage classification as well as hyperlinks.
51.
52. Discussion: Features However, since the results of different approaches are based on different implementations and different datasets, making it difficult to compare their performance. Sibling page are even more use full than parents and children. This approach may lie in the process of hyperlink creation. But a page often acts as a bridge to connect its outgoing links, which are likely to have common topic.
53.
54. Tip!Tracking Incoming LinkHow to know when someone link to you? Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
55. Algorithms Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
58. Way of boosting the classification by emphasizing the features with the better discriminative power
59.
60. Dimension Reduction (con) : Feature Selection Simple approaches First fragment of each document First fragment to the web documents in hierarchical classification Text categorization approaches Information gain Mutual information Etc.
61.
62. Feature Selection (Cont’d): Text Categorization Measures Using expected mutual information and mutual information Two well-known metrics based on variation of the k-Nearest Neighbor algorithm Weighted terms according to its appearing HTML tags Terms within different tags handle different importance Using information gain Another well-known metric Still not apparently show which one is more superior for web classification
63. Feature Selection (Cont’d): Text Categorization Measures Approving the performance of SVM classifiers By aggressive feature selection Developed a measure with the ability to predict the selection effectiveness without training and testing classifiers A popular Latent Semantic Indexing (LSI) In Text documents: Docs are reinterpreted into a smaller transformed, but less intuitive space Cons:high computational complexity makes it inefficient to scale in Web classification Experiments based on small datasets (to avoid the above ‘cons’) Some work has approved to make it applicable for larger datasets which still needs further study
66. Relational Learning (cont’d): 2 Main Approaches Relaxation Labeling Algorithms Original proposal: Image analysis Current usage: Image and vision analysis Artificial Intelligence pattern recognition web-mining Link-based Classification Algorithms Utilizing 2 popular link-based algorithms Loopy belief propagation Iterative classification
67.
68. Relational Learning (cont’d): Link-based Classification Algorithms Two popular link-based algorithms: Loopy belief propagation Iterative classification Better performance on a web collection than textual classifiers During the scientists’ study, ‘a toolkit’ was implemented Toolkit features Classify the networked data which utilized a relational classifier and a collective inference procedure Demonstrated its great performance on several datasets including web collections
70. Modifications to traditional algorithms The traditional algorithms adjusted in the context of Webpage classification k-Nearest Neighbors (kNN) Quantify the distance between the test document and each training documents using “a dissimilarity measure” Cosine similarity or inner product is what used by most existing kNN classifiers Support Vector Machine (SVM)
71. Modification Algorithms (Cont’d) : k-Nearest Neighbors Algorithm Varieties of modifications: Using the term co-occurrence in document Using probability computation Using “co-training”
72. k-Nearest Neighbors Algorithm(Cont’d): Modification Varieties Using the term co-occurrence in documents An improved similarity measure The more co-occurred terms two documents have in common, the stronger the relationship between them Better performance over the normal kNN (cosine similarity and inner product measures) Using the probability computation Condition: The probability of a document d being in class c is determined by its distance b/w neighbors and itself and its neighbors’ probability of being in c Simple equation Prob. of d @ c = (distance b/w d and neighbors)(neighbors’ Prob. @ c)
73. k-Nearest Neighbors Algorithm(Cont’d): Modification Varieties (2) Using “Co-training” Make use of labeled and unlabeled data Aiming to achieve better accuracy Scenario: Binary classification Classifying the unlabeled instances Two classifiers trained on different sets of features The prediction of each one is used to train each other Classifying only labeled instances The co-training can cut the error rate by half When generalized to multi-class problems When the number of categories is large Co-training is not satisfying On the other hand, the method of combining error-correcting output coding (more than enough classifiers in use), with co-training can boost performance
74. Modification Algorithms (Cont’d) : SVM-based Approach In classification, both positive and negative examples are required SVM-Based aim: To eliminate the need for manual collection of negative examples while still retaining similar classification accuracy
76. Take a Break!The Internet’s Ad Market PlaceBesides Google Adwords Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
78. Hierarchical Classification Not so many research since most web classifications focus on the same level approaches Approaches: Based on “divide and conquer” Error minimization Topical Hierarchy Hierarchical SVMs Using the degree of misclassification Hierarchical text categoriations
79. Hierarchical Classification (Cont’d): Approaches The use of hierarchical classification based on “divide and conquer” Classification problems are splitted into sub-problems hierarchically More efficient and accurate that the non-hierarchical way Error minimization when the lower level category is uncertain, Minimize by shifting the assignment into the higher one Topical Hierarchy Classify a web page into a topical hierarchy Update the category information as the hierarchy expands
80. Hierarchical Classification (Cont’d): Approaches (2) Hierarchical SVMs Observation: Hierarchical SVMs are more efficient than flat SVMs None are satisfying the effectiveness for the large taxonomies Hierarchical settings do more harm than good to kNNs and naive Bayes classifiers Hierarchical Classification By the degree of misclassification Opposed to measuring “correctness” Distance are measured b/w the classifier-assigned classes and the true class. Hierarchical text categorization A detailed review was provided in 2005
82. Combining Information from Multiple Sources Different sources are utilized Combining link and content information is quite popular Common combination way: Treat information from ‘different sources’ as ‘different (usually disjoint) feature sets’ on which multiple classifiers are trained Then, the generation of FINAL decision will be made by the classifiers Mostly has the potential to have better knowledge than any single method
83. Information Combination (Cont’d): Approaches Voting and Stacking The well-developed method in machine learning Co-Training Effective in combining multiple sources Since here, different classifiers are trained on disjoint feature sets
84. Information Combination (Cont’d): Cautions Please be noted that: Additional resource needs sometimes cause ‘disadvantage’ The combination of 2 does NOT always BETTER than each separately
85. Blog classification Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
86. Take a Break!Follow the Trend!!Everybody RETWEET!! Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
87. Follow me on TwitterFollow pChralso my Blog Http://www.PacharaStudio.com Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
88. Blog classification The word “blog” was originally a short form of “web log” Blogging has gained in popularity in recent years, an increasing amount of research about blog has also been conducted. Broken into three types Blog identification (to determine whether a web document is a blog) Mood classification Genre classification
89. Blog classification Elgersma and Rijke 2006 Common classification algorithm on Blog identification using number of human-selected feature e.g. “Comments” and “Archives” Accuracy around 90% Mihalcea and Liu 2006 classify Blog into two polarities of moods, happiness and sadness (Mood classification) Nowson 2006 discussed the distinction of three types of blogs (Genre Classification) News Commentary Journal
90. Blog classification Qu et al. 2006 Automatic classification of blogs into four genres Personal diary New Political Sports Using unigram tfidf document representation and naive Bayes classification. Qu et al.’s approach can achieve an accuracy of 84%.
91. Conclusion Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
92. Conclusion Webpage classification is a type of supervised learning problem that aims to categorize webpage into a set of predefined categories based on labeled training data. They expect that future web classification efforts will certainly combine content and link information in some form.
93. Conclusion Future work would be well-advised to Emphasize text and labels from siblings over other types of neighbors. Incorporate anchor text from parents. Utilize other source of (implicit or explicit) human knowledge, such as query logs and click-through behavior, in addition to existing labels to guide classifier creation.
94. Thank you. Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
95. Question? Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009