This presentation begins with a specific issue in text mining that connect it with word embeddings. Later, the importance of the Wikipedia is highlighted and finally, lessons to be learned from the Wikipedia are discussed.
All Models Are Wrong, But Some Are Useful: 6 Lessons for Making Predictive An...Brian Mac Namee
Introduces some key ideas for deploying machine learning based predictive analytics models effectively. Based on the book "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked Examples & Case Studies" (www.machinelearningbook.com)
Linked Open Data and data-driven journalismPia Jøsendal
A keynote held at the Media 3.0 seminar in Bergen. It is an introductionary presentation of simple key elements of linked open data. It adresses media and journalists, what data driven journalism can look like and why they should care about what linked open data can offer.
This talk was given at SEMANTiCS 2014 in Leipzig. It gives an overview how to develop an enterprise linked data strategy around controlled vocabularies based on SKOS. It discusses how knowledge graphs based on SKOS can extended step by step due to the needs of the organization.
Directions This assignment is for a Reading Course. The cross-disAlyciaGold776
Directions: This assignment is for a Reading Course. The cross-disciplinary unit that I will be implementing in my classroom is Social Studies (Grade 11 US History). Attached you will find a copy of the lesson plan and an attachment of Reading Standards. Current resources and tools that would enhance the learning experience for all students is Kahoot, Quizlet or Nearpod. Must use original work and must be APA formatted.
Please review the Special Accommodations and ELL section on the last page of the lesson plan, all bench marks and state standards for the lesson is within the lesson plan.
Benchmark - Cross-Disciplinary Unit Narrative
For this benchmark, write a 750-1,000 word narrative about a cross-disciplinary unit you would implement in your classroom. Choose a minimum of two standards, at least one for the content area of your field experience classroom and at least one supportive literacy standard to focus on for the unit narrative. You may use your Topic 3 "Instructional Strategies for Literacy Integration Matrix as a guide to inform this assignment."
Your narrative must include:
· Unit Description and Rationale: Complete description of unit theme and purpose, including learning objectives, based on the content area standards and literacy standards.
· Learning Opportunities: Description of two learning opportunities that create ways for students to learn, practice, and master academic language in content areas
· Collaboration: Description of how you would facilitate students’ collaborative use of current tools and resources to maximize content learning in varied contexts
· Support: Description of support that would be implemented for student literacy development across content areas
· Differentiation: Description of how the lessons within the unit would provide differentiated instruction
· Strategies: Description of strategies that you would use within your unit to advocate for equity in your classroom
· Cultural Diversity: Description of the effect of cultural diversity in the classroom on reading and writing development. Describe how the unit capitalizes on cultural diversity.
· Resources: Description of current resources and tools that would enhance the learning experience for all students.
Support your findings with 3-5 scholarly resources.
ELA Standards and Technology Matrix (Grades 11-12)
Click on the standard to view more information in CPALMS. Click on the links to visit the websites for the featured technology tools.
Grade Standards Technology
11-12 LAFS.1112.L.3.4
Determine or clarify the meaning of unknown and multiple-
meaning words and phrases based on grades 11–12 reading
and content, choosing flexibly from a range of strategies.
a. Use context (e.g., the overall meaning of a sentence,
paragraph, or text; a word’s position or function in a
sentence) as a clue to the meaning of a word or phrase.
b. Identify and correctly use patterns of word changes that
indic ...
All Models Are Wrong, But Some Are Useful: 6 Lessons for Making Predictive An...Brian Mac Namee
Introduces some key ideas for deploying machine learning based predictive analytics models effectively. Based on the book "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked Examples & Case Studies" (www.machinelearningbook.com)
Linked Open Data and data-driven journalismPia Jøsendal
A keynote held at the Media 3.0 seminar in Bergen. It is an introductionary presentation of simple key elements of linked open data. It adresses media and journalists, what data driven journalism can look like and why they should care about what linked open data can offer.
This talk was given at SEMANTiCS 2014 in Leipzig. It gives an overview how to develop an enterprise linked data strategy around controlled vocabularies based on SKOS. It discusses how knowledge graphs based on SKOS can extended step by step due to the needs of the organization.
Directions This assignment is for a Reading Course. The cross-disAlyciaGold776
Directions: This assignment is for a Reading Course. The cross-disciplinary unit that I will be implementing in my classroom is Social Studies (Grade 11 US History). Attached you will find a copy of the lesson plan and an attachment of Reading Standards. Current resources and tools that would enhance the learning experience for all students is Kahoot, Quizlet or Nearpod. Must use original work and must be APA formatted.
Please review the Special Accommodations and ELL section on the last page of the lesson plan, all bench marks and state standards for the lesson is within the lesson plan.
Benchmark - Cross-Disciplinary Unit Narrative
For this benchmark, write a 750-1,000 word narrative about a cross-disciplinary unit you would implement in your classroom. Choose a minimum of two standards, at least one for the content area of your field experience classroom and at least one supportive literacy standard to focus on for the unit narrative. You may use your Topic 3 "Instructional Strategies for Literacy Integration Matrix as a guide to inform this assignment."
Your narrative must include:
· Unit Description and Rationale: Complete description of unit theme and purpose, including learning objectives, based on the content area standards and literacy standards.
· Learning Opportunities: Description of two learning opportunities that create ways for students to learn, practice, and master academic language in content areas
· Collaboration: Description of how you would facilitate students’ collaborative use of current tools and resources to maximize content learning in varied contexts
· Support: Description of support that would be implemented for student literacy development across content areas
· Differentiation: Description of how the lessons within the unit would provide differentiated instruction
· Strategies: Description of strategies that you would use within your unit to advocate for equity in your classroom
· Cultural Diversity: Description of the effect of cultural diversity in the classroom on reading and writing development. Describe how the unit capitalizes on cultural diversity.
· Resources: Description of current resources and tools that would enhance the learning experience for all students.
Support your findings with 3-5 scholarly resources.
ELA Standards and Technology Matrix (Grades 11-12)
Click on the standard to view more information in CPALMS. Click on the links to visit the websites for the featured technology tools.
Grade Standards Technology
11-12 LAFS.1112.L.3.4
Determine or clarify the meaning of unknown and multiple-
meaning words and phrases based on grades 11–12 reading
and content, choosing flexibly from a range of strategies.
a. Use context (e.g., the overall meaning of a sentence,
paragraph, or text; a word’s position or function in a
sentence) as a clue to the meaning of a word or phrase.
b. Identify and correctly use patterns of word changes that
indic ...
Ariadne's Thread -- Exploring a world of networked information built from fre...Shenghui Wang
Most of the current interfaces to digital libraries are built on keyword-based search and list-based presentation. For users who do not have specific items to search for but would rather explore not-yet-familiar topics, it is not easy to figure out to what extend and on which aspects the returned records match the query. Users have to try different combinations of keywords to narrow down or broaden the search space in the hope of getting useful results in the end. In this talk, we will present a web interface that provides users an opportunity to interactively and visually explore the context of queries. In this interface, after entering a query, a contextual view about the query is visualised, where the most related journals, authors, subject headings, publishers, topical terms, etc. are positioned in 2D based on their relatedness to the query and among each other. By clicking any of these nodes, a new visualisation about the selected one is presented. With this click-through style, the users could get visual contexts about their selected entities (journal, author, topical terms, etc.) and shift their interests by choosing interested (types of) entities to investigate further. At any stop, a search in WorldCat.org with the currently focused entity (a topical word, a author or a journal) will return the most matched results (judged by the standard WorldCat search engine).
We implemented this interface over WorldCat, the world largest bibliographic database. To guarantee the responsiveness of this interactive interface, we adopt a two-step approach: an off-line preparation phase with an on-line process. Off-line, we build the semantic representation of each entity where Random Projection is used to vigorously reduce dimensionality (from 6 million to 600). In the on-line interface terms from a query are compared to entities in the reduced semantic matrix where reciprocal relatedness is used to select genuine matches. The number of hits is further reduced to render a network layout easy to overview and navigate. In the end, we can investigate the relations between roughly 6 million topical terms, 5 million authors, 1 million subject headings 1000 Dewey decimal codes and 1.7 million publishers.
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersCarlos Toxtli
ExperTwin is a Knowledge Advantage Machine (KAM) that is able to collect data from your areas of interest and present it in-time, in-context and in place to the worker workspace. This research paper describes how workers can be benefited from having a personal net of crawlers (as Google does) collecting and organizing updated data relevant to their areas of interest and delivering these to their workspace.
Linked Data has become a broadly adopted approach for information management and data management not only by government organisations but also more and more by various industries.
Enterprise linked data tackles several challenges like the improvement of information retrieval tools or the integration of distributed data silos. Enterprises understand better and better why their information management should not be limited by organisational boundaries but should rather consider to integrate and link information from different spheres like the public internet, government organisations, professional information providers, customers and even suppliers.
On the other hand, enterprise IT architects still tend to pull down the shutters wherever possible. The continuation of the success of the Semantic Web doesn't seem to be limited by technical barriers anymore but rather by people's mindsets of intranets being strictly cut off from other information sources.
In this talk I will throw new light on the reasons why metadata is key for professional information management, and why W3C's semantic web standards are so important to reduce costs of data management through economies of scale. I will discuss from a multi-stakeholder perspective several use cases for the industrialization of semantic technologies and linked data.
Content Architecture for Rapid Knowledge Reuse-congility2011Don Day
A familiar content issue is gathering and integrating the knowledge of isolated subject matter experts (SMEs) throughout an organization into a robust content strategy. This presentation will give you some perspectives on how to engage your SMEs in contributing their knowledge as directly as possible in a structured format for ease of integration into a larger, more versatile content strategy. The first part of this presentation will lay out an architecture for a cross-organization, single source content strategy based on DITA (Darwin Information Typing Architecture) for this example. The second part of the presentation considers the use of that architecture for handling information flows during a disaster response. The system must allow people to respond appropriately to the rapid influx of disparate questions at the same time as receiving large quantities of information from multiple data sources of variable reliability. The use of structured content based on DITA can contribute to the effective use of information in a crisis.
Search Solutions 2011: Successful Enterprise Search By DesignMarianne Sweeny
When your colleagues say they want Google, they don’t mean the Google Search Appliance. They mean the Google Search user experience: pervasive, expedient and delivering the information that they need. Successful enterprise search does not start with the application features, is not part of the information architecture, does not come from a controlled vocabulary and does not emerge on its own from the developers. It requires enterprise-specific data mining, enterprise-specific user-centered design and fine tuning to turn “search sucks” into search success within the firewall. This presentation looks at action items, tools and deliverables for Discovery, Planning, Design and Post Launch phases of an enterprise search deployment.
How google is using linked data today and vision for tomorrowVasu Jain
In this presentation, I will discuss how modern search engines, such as Google, make use of Linked Data spread inWeb pages for displaying Rich Snippets. Also i will present an example of the technology and analyze its current uptake.
Then i sketched some ideas on how Rich Snippets could be extended in the future, in particular for multimedia documents.
Original Paper :
http://scholar.google.com/citations?view_op=view_citation&hl=en&user=K3TsGbgAAAAJ&authuser=1&citation_for_view=K3TsGbgAAAAJ:u-x6o8ySG0sC
Another Presentation by Author: https://docs.google.com/present/view?id=dgdcn6h3_185g8w2bdgv&pli=1
Utilising wikipedia to explain recommendationsM. Atif Qureshi
This presentation shows an application of the explainable word embeddings (called EVE, which is the first explainable knowledge base embedding method). The application is a recommender system called (Lit@EVE, which a prototype recommender system for literature). The talk was presented at https://www.meetup.com/Customer-Analytics-Dublin-Meetup/
Ariadne's Thread -- Exploring a world of networked information built from fre...Shenghui Wang
Most of the current interfaces to digital libraries are built on keyword-based search and list-based presentation. For users who do not have specific items to search for but would rather explore not-yet-familiar topics, it is not easy to figure out to what extend and on which aspects the returned records match the query. Users have to try different combinations of keywords to narrow down or broaden the search space in the hope of getting useful results in the end. In this talk, we will present a web interface that provides users an opportunity to interactively and visually explore the context of queries. In this interface, after entering a query, a contextual view about the query is visualised, where the most related journals, authors, subject headings, publishers, topical terms, etc. are positioned in 2D based on their relatedness to the query and among each other. By clicking any of these nodes, a new visualisation about the selected one is presented. With this click-through style, the users could get visual contexts about their selected entities (journal, author, topical terms, etc.) and shift their interests by choosing interested (types of) entities to investigate further. At any stop, a search in WorldCat.org with the currently focused entity (a topical word, a author or a journal) will return the most matched results (judged by the standard WorldCat search engine).
We implemented this interface over WorldCat, the world largest bibliographic database. To guarantee the responsiveness of this interactive interface, we adopt a two-step approach: an off-line preparation phase with an on-line process. Off-line, we build the semantic representation of each entity where Random Projection is used to vigorously reduce dimensionality (from 6 million to 600). In the on-line interface terms from a query are compared to entities in the reduced semantic matrix where reciprocal relatedness is used to select genuine matches. The number of hits is further reduced to render a network layout easy to overview and navigate. In the end, we can investigate the relations between roughly 6 million topical terms, 5 million authors, 1 million subject headings 1000 Dewey decimal codes and 1.7 million publishers.
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersCarlos Toxtli
ExperTwin is a Knowledge Advantage Machine (KAM) that is able to collect data from your areas of interest and present it in-time, in-context and in place to the worker workspace. This research paper describes how workers can be benefited from having a personal net of crawlers (as Google does) collecting and organizing updated data relevant to their areas of interest and delivering these to their workspace.
Linked Data has become a broadly adopted approach for information management and data management not only by government organisations but also more and more by various industries.
Enterprise linked data tackles several challenges like the improvement of information retrieval tools or the integration of distributed data silos. Enterprises understand better and better why their information management should not be limited by organisational boundaries but should rather consider to integrate and link information from different spheres like the public internet, government organisations, professional information providers, customers and even suppliers.
On the other hand, enterprise IT architects still tend to pull down the shutters wherever possible. The continuation of the success of the Semantic Web doesn't seem to be limited by technical barriers anymore but rather by people's mindsets of intranets being strictly cut off from other information sources.
In this talk I will throw new light on the reasons why metadata is key for professional information management, and why W3C's semantic web standards are so important to reduce costs of data management through economies of scale. I will discuss from a multi-stakeholder perspective several use cases for the industrialization of semantic technologies and linked data.
Content Architecture for Rapid Knowledge Reuse-congility2011Don Day
A familiar content issue is gathering and integrating the knowledge of isolated subject matter experts (SMEs) throughout an organization into a robust content strategy. This presentation will give you some perspectives on how to engage your SMEs in contributing their knowledge as directly as possible in a structured format for ease of integration into a larger, more versatile content strategy. The first part of this presentation will lay out an architecture for a cross-organization, single source content strategy based on DITA (Darwin Information Typing Architecture) for this example. The second part of the presentation considers the use of that architecture for handling information flows during a disaster response. The system must allow people to respond appropriately to the rapid influx of disparate questions at the same time as receiving large quantities of information from multiple data sources of variable reliability. The use of structured content based on DITA can contribute to the effective use of information in a crisis.
Search Solutions 2011: Successful Enterprise Search By DesignMarianne Sweeny
When your colleagues say they want Google, they don’t mean the Google Search Appliance. They mean the Google Search user experience: pervasive, expedient and delivering the information that they need. Successful enterprise search does not start with the application features, is not part of the information architecture, does not come from a controlled vocabulary and does not emerge on its own from the developers. It requires enterprise-specific data mining, enterprise-specific user-centered design and fine tuning to turn “search sucks” into search success within the firewall. This presentation looks at action items, tools and deliverables for Discovery, Planning, Design and Post Launch phases of an enterprise search deployment.
How google is using linked data today and vision for tomorrowVasu Jain
In this presentation, I will discuss how modern search engines, such as Google, make use of Linked Data spread inWeb pages for displaying Rich Snippets. Also i will present an example of the technology and analyze its current uptake.
Then i sketched some ideas on how Rich Snippets could be extended in the future, in particular for multimedia documents.
Original Paper :
http://scholar.google.com/citations?view_op=view_citation&hl=en&user=K3TsGbgAAAAJ&authuser=1&citation_for_view=K3TsGbgAAAAJ:u-x6o8ySG0sC
Another Presentation by Author: https://docs.google.com/present/view?id=dgdcn6h3_185g8w2bdgv&pli=1
Similar to Text mining, word embeddings, & wikipedia (20)
Utilising wikipedia to explain recommendationsM. Atif Qureshi
This presentation shows an application of the explainable word embeddings (called EVE, which is the first explainable knowledge base embedding method). The application is a recommender system called (Lit@EVE, which a prototype recommender system for literature). The talk was presented at https://www.meetup.com/Customer-Analytics-Dublin-Meetup/
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...M. Atif Qureshi
My Master's thesis defense slides for Master's thesis, research for which was conducted under Prof. Kyu-Young Whang and successfully defended in KAIST, Computer Science Dept. on 16th December, 2010.
Identifying and ranking topic clusters in the blogosphereM. Atif Qureshi
Slides presented in COLING 2010 workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources.
Paper link: http://www.aclweb.org/anthology/W/W10/W10-3507.pdf
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...M. Atif Qureshi
My presentation slides for paper presented in International Conference on Information Science and Applications, ICISA, Seoul 2010.
Paper link: http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5480411&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D5480411
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
2. 12/01/17 2
Contents
● Introduction
● Text Mining
– Similar words
– Word ambiguity
● Word Embedding
– Related Research
– Toy Example
● Wikipedia
– Structure
– Phrase Chunking
– Case studies
3. 12/01/17 3
Problem
● Motivation
– Human beings have found a great comfort in expressing their viewpoint in writing
because of its ability to preserve thoughts for a longer period of time than oral
communication.
– Textual data is a very popular means of communication over the World Wide Web
in the form of data on online news websites, social networks, emails, governmental
websites, etc.
● Observation
Text may contain the following complexities
– Lack of contextual and background information
– Ambiguity due to more than one possible interpretation of the meaning of text
– Focus and assertions on multiple topics
4. 12/01/17 4
Text Mining
● Motivation
With so much textual data around us especially on
the World Wide Web, there is a motivation to
understand the meaning of the data
● Definition
It is the process by which textual data is analyzed in
order to derive high quality information on the basis
of patterns
5. 12/01/17 5
Similar Words
● Can similar words be group together as one?
– Simple techniques
● Lemmatization (mapping plural to singulars, accurate
but low coverage)
● Stemming (map word to a root word, inaccurate but
high coverage)
– Complex technique
● A word is known by the company it keeps → Word
Embeddings
6. 12/01/17 6
Word Ambiguity
● Is Apple a company or a fruit?
– “Apple tastes better than blackberry”
– “Apple phones are better than blackberry”
● Context is important
– Tastes → Fruit
– Phones → Apple Inc.
7. 12/01/17 7
Word Embedding
● Definition
– It is a technique in NLP that quantifies a concept
(word or phrase) as a vector of real numbers.
● Simple application scenario
– How similar are two words?
– Similarity(vector(good), vector(best))
8. 12/01/17 8
Related Research
● Word embeddings
– Word2Vec
● It is a predictive model which uses two layer neural networks
– FastText
● It is an extension to word2vec by Facebook
– GloVe
● It is a count based model which performs dimensionality reduction on the co-
occurrence matrix
● Wikipedia based Relatedness
– Semantic Relatedness Framework
● It uses Wikipedia sub-category hierarchy to measure relatedness
9. 12/01/17 9
Toy Example → Word
Embeddings
● Train co-occurence matrix
● Apply cosine similarity
● Find vectors
● Further concepts
– Dimestionality Reduction
– Window size
– Filter words
10. 12/01/17 10
Word Analogies
● Man is to Woman, King is to ____ ?
● London is to England, Islamabad is to
____ ?
● Using vectors, we can say
– King – Man + Woman → Queen
– Islamabad – London + England → Pakistan
11. 12/01/17 11
Why Wikipedia for Text
Mining?
● One of the largest encyclopedia
● Free to use
● Collaboratively and actively updated
12. 12/01/17 12
Wikipedia
● Each article has a title that identifies a concept.
● Each article contains content that defines a particular concept textually.
● Each article is mentioned inside different categories
– E.g., article ‘Espresso’ is mentioned inside ‘Coffee drinks’, ‘Italian cuisine’,
etc.
●
Each Wikipedia category generally contains parent and children categories.
– E.g., ‘Italian cuisine’ has parent categories ‘Italian culture’, ‘Cuisine by
nationality’, etc
– E.g., ‘Italian cuisine’ has children categories ‘Italian desserts ’, ‘Pizza’, etc
13. 12/01/17 13
C1
A1
A3
A4
C3C2
C4
C5 C6 C7
C10
C9
Category Article
Category Edge Article Belonging to Category
A2
Article Link
Wikipedia Category Graph Structure along with Wikipedia Articles
Wikipedia Graph
Structure
14. 12/01/17 14
Example of Wikipedia
Category Structure
academic_disciplines
science
interdisciplinary_fields
scientific_disciplines
behavioural_sciences
society
social_sciences
science_studies
information_technology
information
sociology
information_science
Truncated Wikipedia Category Graph
15. 12/01/17 15
Phrase Chunking using
Wikipedia
i prefer samsung s5 over htc, apple, nokia because it is economical and good.
i prefer samsung s5 over htc apple nokia because it is economical and good
Phrase chunking using phrase
boundaries
Longest phrase that matches with
Wikipedia Article Title or Redirect
(which is not a stopword)
samsung s5prefer htc apple
nokia economical
overi because it
and goodis
Removed stopwords Extracted phrases
I prefer Samsung S5 over HTC, Apple, Nokia because it is economical and good.
Conversion into lowercase
16. 12/01/17 16
Word Embedding using
Wikipedia
● We can find more complex relationships
due to
– Article-Category Graph structure
– Multi-lingual relations
– Infobox, birth, age, etc
18. 12/01/17 18
Perspective Aware Approach to
Search
● Problem: The result set from a search engine
(Google, Bing, Yahoo) for any user's query may have
an inherent perspective given issues with the search
engine or issues with the underlying collection.
● PAS is system that allows users to specify at query
time a perspective together with their query.
● The system allows the users to quickly surmise the
presence of the perspective in the returned set.
19. 12/01/17 19
Perspective Aware Approach to
Search
● Perspective is modelled by making use of
Wikipedia articles-categories graph
structure
– Perspective: activism
– Wikipedia fetches articles defining activism by
looking into category graph structure
21. 12/01/17 21
Keyword Extraction via
Identification of Domain-Specific
Keywords
Title of Web
Pages
Wikipedia Articles
& Redirects
Intersected
Phrases
Community Detection
Algorithm
Wikipedia
Category
Graph
Domain-Specific
Phrases
Identifies readable
phrases
Domain-Specific
Single Terms
Merging both
Domain-Specific
Keywords
By exploiting Wikipedia
Article-Category Structure
● Problem: Given a
collection of document
titles from different school
websites, we extract
domain specific keywords
for the entire website that
represent the domain.
● Example: “Information
Retrieval”, “Science”
22. 12/01/17 22
Innovation in Automotive
Red → Probability 1.0
Green → Probability 0.5
White → Probability 0.0
Size represents how much a category is mentioned inside the dataset`