The document discusses natural language processing (NLP) and big data. It is a presentation by Sarath P R on NLP and big data topics, including full text search and indexing, document clustering, representing data, and current trends. The presentation covers various NLP concepts and applications, such as how search engines work, using Apache Lucene for indexing and searching, and the definition of document clustering.
Croud Presents: How to Build a Data-driven SEO Strategy Using NLPDaniel Liddle
Exploring how you can harness the huge amounts of data available to build an effective, empirically-led SEO strategy using machine learning resource such as natural language processing (NLP). Including useful and practical tips on areas such as topic modelling, categorisation and clustering, so you can get started on using NLP in your own SEO strategy right away.
Conductor C3 2019 - A Sound Advantage: How Voice Search Works & Works For YouConductor
Upasna Gautam, Manager, Search, Ziff Davis
Become fluent in voice search form, function, and success. Learn how Google processes sound and conducts speech modeling; the four voice search quality metrics Google applies; and how to enhance your own strategy with tactics for targeting content by searcher need states.
Everything content writers need to know about SEO! This presentation will help the writers to learn about various SEO aspects in order to write content that ranks in Google and other search engines.
Croud Presents: How to Build a Data-driven SEO Strategy Using NLPDaniel Liddle
Exploring how you can harness the huge amounts of data available to build an effective, empirically-led SEO strategy using machine learning resource such as natural language processing (NLP). Including useful and practical tips on areas such as topic modelling, categorisation and clustering, so you can get started on using NLP in your own SEO strategy right away.
Conductor C3 2019 - A Sound Advantage: How Voice Search Works & Works For YouConductor
Upasna Gautam, Manager, Search, Ziff Davis
Become fluent in voice search form, function, and success. Learn how Google processes sound and conducts speech modeling; the four voice search quality metrics Google applies; and how to enhance your own strategy with tactics for targeting content by searcher need states.
Everything content writers need to know about SEO! This presentation will help the writers to learn about various SEO aspects in order to write content that ranks in Google and other search engines.
In my Mangools KWFinder review discover the keyword tool and the other features included with this affordable, easy to use suite of tools. Keyword research is important for a successful blog post, learn how this bundle of tools can help you achieve good rankings.
TOP 5 THINGS YOU SHOULD CONSIDER FOR YOUR SEOPalash Nawab
Google will love your website if it will have a good looking on-page layout. Based on the recommendations of Google, have your all on-page elements optimized for ranking. It includes Meta title tag, Meta description, H1 tag, and selection of anchor text.
Have SEO friendly URLS since URL structure is the first thing search engines begin with to analyze website. In case of any confusion with URL structure, search engines may end up indexing only a portion of the site's pages, and some of the web content may not appear under search.
TOP 5 THINGS YOU SHOULD CONSIDER FOR YOUR SEOPalash Nawab
On-page layout is more important than you imagine
Google will love your website if it will have a good looking on-page layout. Based on the recommendations of Google, have your all on-page elements optimized for ranking. It includes Meta title tag, Meta description, H1 tag, and selection of anchor text.
Have SEO friendly URLS since URL structure is the first thing search engines begin with to analyze website. In case of any confusion with URL structure, search engines may end up indexing only a portion of the site's pages, and some of the web content may not appear under search.
In a study with SEMRush we analyzed 80 million keywords and 1.6 million Featured Snippets. This presentation was given by AJ Ghergich (@SEO on Twitter) at the 2017 Dreamforce gathering.
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlOpenSource Connections
Over the past year, the POLITICO team has developed a recommendation system for our users, which recommends not only news content to read but also news topics to subscribe to. This talk will discuss our development path, including dead-ends and performance trade-offs. In the end, the team produced a system based on search technology (in our case, Elasticsearch) and refined by machine learning techniques to achieve a balance between personalization and serendipity.
Metaphic or the art of looking another way.Suresh Manian
For all intents and purposes, we are our words. And verbs and adjectives capture actions and sentiments better than any other tool. Metaphic is premised on the belief that a grammar book and a calculator are all you really need to make sense of web search and social media chatter, apart from all text, in general.
Data Con LA 2020
Description
The People at any organization are one of the most important stakeholders in the business. People Analytics & Research is the broad discipline in which employee data is leveraged to inform organizational decision-making. In current times, data science has found its way into People Analytics and Research with individuals using AI to predict or diagnose important metrics like turnover. However, it is only through ethical, context-driven, and inclusive methods that data science can continue to intelligently augment human resources. This talk will help attendees recognize and describe People Analytical challenges within their organizations and teams. Further, through a discussion of real-world examples, attendees will appreciate the need for inclusive and ethical context-driven best practices for People Analytics. Finally, attendees will be able to explore applications of AI/ML to problem solving for the People Analytics space. This is an interactive session, so please bring your questions, and get ready to put your thinking hats on!
Speaker
Sreyoshi Bhaduri, McGraw Hill, Manager, Global People Research and Analytics
How to unlock the secrets of effortless keyword research with ChatGPT.pptxDaniel Smullen
A guide on how to do keyword research using ChatGPT. Comparison of ChtGPT keyword research versus standard keyword research, the pros and cons, as well as some really great keyword research prompts to try within ChatGPT.
Answer to the most commonly used terminology Data Science with their areas of crucial roles in solving issues with case studies.
Likewise, let me know if anything is required. Ping me at google #bobrupakroy
SEO for Enterprise: Back to the BasicsAdam Audette
When you really boil it down, in-house SEO leaders are responsible for just three things: 1) Driving traffic, 2) Reporting on the channel, 3) being the bridge. This deck requires more explanation (all content was delivered verbally), but should illustrate one basic fact: keeping it simple is the key to success in enterprise SEO!
In my Mangools KWFinder review discover the keyword tool and the other features included with this affordable, easy to use suite of tools. Keyword research is important for a successful blog post, learn how this bundle of tools can help you achieve good rankings.
TOP 5 THINGS YOU SHOULD CONSIDER FOR YOUR SEOPalash Nawab
Google will love your website if it will have a good looking on-page layout. Based on the recommendations of Google, have your all on-page elements optimized for ranking. It includes Meta title tag, Meta description, H1 tag, and selection of anchor text.
Have SEO friendly URLS since URL structure is the first thing search engines begin with to analyze website. In case of any confusion with URL structure, search engines may end up indexing only a portion of the site's pages, and some of the web content may not appear under search.
TOP 5 THINGS YOU SHOULD CONSIDER FOR YOUR SEOPalash Nawab
On-page layout is more important than you imagine
Google will love your website if it will have a good looking on-page layout. Based on the recommendations of Google, have your all on-page elements optimized for ranking. It includes Meta title tag, Meta description, H1 tag, and selection of anchor text.
Have SEO friendly URLS since URL structure is the first thing search engines begin with to analyze website. In case of any confusion with URL structure, search engines may end up indexing only a portion of the site's pages, and some of the web content may not appear under search.
In a study with SEMRush we analyzed 80 million keywords and 1.6 million Featured Snippets. This presentation was given by AJ Ghergich (@SEO on Twitter) at the 2017 Dreamforce gathering.
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlOpenSource Connections
Over the past year, the POLITICO team has developed a recommendation system for our users, which recommends not only news content to read but also news topics to subscribe to. This talk will discuss our development path, including dead-ends and performance trade-offs. In the end, the team produced a system based on search technology (in our case, Elasticsearch) and refined by machine learning techniques to achieve a balance between personalization and serendipity.
Metaphic or the art of looking another way.Suresh Manian
For all intents and purposes, we are our words. And verbs and adjectives capture actions and sentiments better than any other tool. Metaphic is premised on the belief that a grammar book and a calculator are all you really need to make sense of web search and social media chatter, apart from all text, in general.
Data Con LA 2020
Description
The People at any organization are one of the most important stakeholders in the business. People Analytics & Research is the broad discipline in which employee data is leveraged to inform organizational decision-making. In current times, data science has found its way into People Analytics and Research with individuals using AI to predict or diagnose important metrics like turnover. However, it is only through ethical, context-driven, and inclusive methods that data science can continue to intelligently augment human resources. This talk will help attendees recognize and describe People Analytical challenges within their organizations and teams. Further, through a discussion of real-world examples, attendees will appreciate the need for inclusive and ethical context-driven best practices for People Analytics. Finally, attendees will be able to explore applications of AI/ML to problem solving for the People Analytics space. This is an interactive session, so please bring your questions, and get ready to put your thinking hats on!
Speaker
Sreyoshi Bhaduri, McGraw Hill, Manager, Global People Research and Analytics
How to unlock the secrets of effortless keyword research with ChatGPT.pptxDaniel Smullen
A guide on how to do keyword research using ChatGPT. Comparison of ChtGPT keyword research versus standard keyword research, the pros and cons, as well as some really great keyword research prompts to try within ChatGPT.
Answer to the most commonly used terminology Data Science with their areas of crucial roles in solving issues with case studies.
Likewise, let me know if anything is required. Ping me at google #bobrupakroy
SEO for Enterprise: Back to the BasicsAdam Audette
When you really boil it down, in-house SEO leaders are responsible for just three things: 1) Driving traffic, 2) Reporting on the channel, 3) being the bridge. This deck requires more explanation (all content was delivered verbally), but should illustrate one basic fact: keeping it simple is the key to success in enterprise SEO!
How to Boost Your SEO using Structured DataMartin Tang
This presentation is the slides when I presented for Developer Circles Penang from Facebook meetup in ACAT, Georgetown, Penang.
More info about this group can refer to: https://www.facebook.com/groups/DevCPenang/
Agenda of this deck:
1. What is Structured Data?
2. How structured data will improve my SEO?
3. How to implement to my website?
4. Tips & Suggestion
5. Q & A
How to Impress, Not Overwhelm your CMO with AnalyticsBonnie Mailey
In this presentation, Jeffalytics’ own Jeff Sauer and Hanapin’s Kristin Vick will provide you with some quick and effective ways to weed out important numbers and present them impactfully to your CMO.
How to Impress, Not Overwhelm your CMO with AnalyticsHanapin Marketing
In this presentation, Jeffalytics’ own Jeff Sauer and Hanapin’s Kristin Vick will provide you with some quick and effective ways to weed out important numbers and present them impactfully to your CMO.
Cole Napper: Are you ready for generative AI in people analytics?Edunomica
Cole Napper: Are you ready for generative AI in people analytics?
People Analytics Conference 2023 Summer
Website: https://pacamp.org
Youtube: https://www.youtube.com/channel/UCeHtPZ_ZLZ-nHFMUCXY81RQ
FB: https://www.facebook.com/pacamporg
After publishing your content. Leave for 3 months, and then it's time to re-optimize it and make your content 10x better!
Improve your keywords, discover new entities, content and readibilty.
How Do I Get a Job in Data Science? | People Ask Googleprateek kumar
One of the most common questions that aspiring data scientists ask is – ‘how do I get a data science job?’ There are many professionals looking to transition to data science but don’t know how. Therefore, this blog explains how you can get a data science job.
What to Know Before Applying
I want to make one thing clear at the start – getting a data science job is not easy. Sure, there are scores of openings and many companies are looking to hire data scientists so that they can gain an edge over their competitors using data.
Leveraging Data: LinkedIn Recruiter, Jobs, & Talent Pool AnalysisLinkedIn Europe
We held a recent webinar for our corporate customers which showed them:
- How to leverage LinkedIn Recruiter data to improve efficiency and workflow
- How to understand who is engaging with jobs and how to use this knowledge to maximise performance
- How to conduct a talent pool analysis
Leveraging Data in EMEA: LinkedIn Recruiter, Jobs, & Talent Pool Analysis | T...LinkedIn Talent Solutions
Does data play an important role in your recruiting efforts? Learn how your data can strengthen your recruiting success across Europe.
Find all LinkedIn Talent Pool Reports here on SlideShare: http://slidesha.re/15ryPlr
Learn more about LinkedIn Talent Solutions: http://linkd.in/1bgERGj
Subscribe to the LinkedIn Talent Blog: http://linkd.in/18yp4Cg
Follow the LinkedIn company page: http://linkd.in/1f39JyH
Tweet with us: http://bit.ly/HireOnLinkedIn
Qualitywebs is a leading digital marketing agency offering 360˚ matchless services at affordable rates. We offer an exclusive and extensive range of services all over the world like; Website & graphic designing, Search engine optimization (SEO), Social media optimization (SMO), Social media marketing (SMM), Pay per click (PPC)
https://qualitywebs.in
Similar to NLP& Bigdata. Motivation and Action (20)
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
NLP& Bigdata. Motivation and Action
1. NLP & Bigdata
Motivation and Action
Sarath P R
sarath.amrita@gmail.com
IIIT-MK
Thiruvananthapuram
November 09, 2013
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
2. About me
Working as Technical Lead - Bigdata
Like to develop software applications for good reasons
Independent Data Journalist at DScribe.IN
Holds Masters in Computer Science
Like to travel and meet people
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
3. Agenda
Introduction
Full text Search and Index
Document Clustering
Representing Data
Stanford NLP
R and Weka
Social Media and Sentiment Analysis
Introduction to Bigdata
Current Trends
Conclusion
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
4. Introduction
Sorry !!! No Definitions copied here for NLP !
In case you need a definition tell me. Otherwise we will ’see’
now what is NLP !
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
5. Introduction
Sorry !!! No Definitions copied here for NLP !
In case you need a definition tell me. Otherwise we will ’see’
now what is NLP !
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
6. Introduction - 2 minutes Targit Video
Watch Targit Video Here http://youtu.be/32KE0rbGZ9c
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
7. So What is He (Targit CTO) Saying ?
“Calling your system, and getting delivered an analysis is right
around the corner”
Go to Targit’s website http://targit.com. You will see a
Lion standing in the front page
They say “Targit is a courage Company”
That was all about Motivation. No hidden agenda to promote
Targit !
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
8. So What is He (Targit CTO) Saying ?
“Calling your system, and getting delivered an analysis is right
around the corner”
Go to Targit’s website http://targit.com. You will see a
Lion standing in the front page
They say “Targit is a courage Company”
That was all about Motivation. No hidden agenda to promote
Targit !
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
9. So What is He (Targit CTO) Saying ?
“Calling your system, and getting delivered an analysis is right
around the corner”
Go to Targit’s website http://targit.com. You will see a
Lion standing in the front page
They say “Targit is a courage Company”
That was all about Motivation. No hidden agenda to promote
Targit !
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
10. So What is He (Targit CTO) Saying ?
“Calling your system, and getting delivered an analysis is right
around the corner”
Go to Targit’s website http://targit.com. You will see a
Lion standing in the front page
They say “Targit is a courage Company”
That was all about Motivation. No hidden agenda to promote
Targit !
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
11. So What is He (Targit CTO) Saying ?
“Calling your system, and getting delivered an analysis is right
around the corner”
Go to Targit’s website http://targit.com. You will see a
Lion standing in the front page
They say “Targit is a courage Company”
That was all about Motivation. No hidden agenda to promote
Targit !
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
12. Introduction - Innovation
What we just saw is one aspect of NLP
What is it ?
It is Speech Recognition and Analytics
And what they did ?
It is Innovation !
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
13. Introduction - Innovation
What we just saw is one aspect of NLP
What is it ?
It is Speech Recognition and Analytics
And what they did ?
It is Innovation !
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
14. Introduction - Innovation
What we just saw is one aspect of NLP
What is it ?
It is Speech Recognition and Analytics
And what they did ?
It is Innovation !
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
15. Introduction - Innovation
What we just saw is one aspect of NLP
What is it ?
It is Speech Recognition and Analytics
And what they did ?
It is Innovation !
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
16. Introduction - Innovation
What we just saw is one aspect of NLP
What is it ?
It is Speech Recognition and Analytics
And what they did ?
It is Innovation !
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
17. Introduction - Search Engines & Information Retrieval
Tell me your opinion. Question follows
IS Google an NLP Company ?
Yes, they are. Biggest one !
So, how google works ? I mean the Search Engine !
From where they bring you the search results ?
Answer is 3 things. Crawler, Index and Algorithms
Now we will start with few NLP, Machine Learning and Analytics
related topics in detail
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
18. Introduction - Search Engines & Information Retrieval
Tell me your opinion. Question follows
IS Google an NLP Company ?
Yes, they are. Biggest one !
So, how google works ? I mean the Search Engine !
From where they bring you the search results ?
Answer is 3 things. Crawler, Index and Algorithms
Now we will start with few NLP, Machine Learning and Analytics
related topics in detail
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
19. Introduction - Search Engines & Information Retrieval
Tell me your opinion. Question follows
IS Google an NLP Company ?
Yes, they are. Biggest one !
So, how google works ? I mean the Search Engine !
From where they bring you the search results ?
Answer is 3 things. Crawler, Index and Algorithms
Now we will start with few NLP, Machine Learning and Analytics
related topics in detail
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
20. Introduction - Search Engines & Information Retrieval
Tell me your opinion. Question follows
IS Google an NLP Company ?
Yes, they are. Biggest one !
So, how google works ? I mean the Search Engine !
From where they bring you the search results ?
Answer is 3 things. Crawler, Index and Algorithms
Now we will start with few NLP, Machine Learning and Analytics
related topics in detail
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
21. Introduction - Search Engines & Information Retrieval
Tell me your opinion. Question follows
IS Google an NLP Company ?
Yes, they are. Biggest one !
So, how google works ? I mean the Search Engine !
From where they bring you the search results ?
Answer is 3 things. Crawler, Index and Algorithms
Now we will start with few NLP, Machine Learning and Analytics
related topics in detail
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
22. Introduction - Search Engines & Information Retrieval
Tell me your opinion. Question follows
IS Google an NLP Company ?
Yes, they are. Biggest one !
So, how google works ? I mean the Search Engine !
From where they bring you the search results ?
Answer is 3 things. Crawler, Index and Algorithms
Now we will start with few NLP, Machine Learning and Analytics
related topics in detail
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
23. Introduction - Search Engines & Information Retrieval
Tell me your opinion. Question follows
IS Google an NLP Company ?
Yes, they are. Biggest one !
So, how google works ? I mean the Search Engine !
From where they bring you the search results ?
Answer is 3 things. Crawler, Index and Algorithms
Now we will start with few NLP, Machine Learning and Analytics
related topics in detail
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
24. Full text Search and Inverted Index
In information retrieval, full-text search refers to techniques for
searching a single computer-stored document or a collection in a
full text database
When the number of documents to search is potentially large, or
the quantity of search queries to perform is substantial, the
problem of full-text search is often divided into two tasks
Indexing and Searching
The indexing stage will scan the text of all the documents and
build a list of search terms, called an index
In the search stage, when performing a specific query, only the
index is referenced, rather than the text of the original documents
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
25. Full text Search and Inverted Index
In information retrieval, full-text search refers to techniques for
searching a single computer-stored document or a collection in a
full text database
When the number of documents to search is potentially large, or
the quantity of search queries to perform is substantial, the
problem of full-text search is often divided into two tasks
Indexing and Searching
The indexing stage will scan the text of all the documents and
build a list of search terms, called an index
In the search stage, when performing a specific query, only the
index is referenced, rather than the text of the original documents
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
26. Full text Search and Inverted Index
In information retrieval, full-text search refers to techniques for
searching a single computer-stored document or a collection in a
full text database
When the number of documents to search is potentially large, or
the quantity of search queries to perform is substantial, the
problem of full-text search is often divided into two tasks
Indexing and Searching
The indexing stage will scan the text of all the documents and
build a list of search terms, called an index
In the search stage, when performing a specific query, only the
index is referenced, rather than the text of the original documents
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
27. Inverted index
It is the most popular data structure used in document
retrieval systems
Similar to the index in the back of a book
Used on a large scale for example in search engines
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
29. Index vs Inverted Index
Index
A forward index (or just index) is the list of documents, and which
words appear in them
Inverted Index
The inverted index is the list of words, and the documents in which
they appear
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
30. Index vs Inverted Index
Index
A forward index (or just index) is the list of documents, and which
words appear in them
Inverted Index
The inverted index is the list of words, and the documents in which
they appear
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
31. Exercise
Have a look at the table below
Document
Doc 1
Doc 2
Doc 3
Words
talk, iiitmk, campus,nlp
algorithm, bigdata, nlp
researchers, talk
What kind of an Index is it ?
Create an inverted index from this forward index
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
32. Exercise
Have a look at the table below
Document
Doc 1
Doc 2
Doc 3
Words
talk, iiitmk, campus,nlp
algorithm, bigdata, nlp
researchers, talk
What kind of an Index is it ?
Create an inverted index from this forward index
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
33. Exercise
Have a look at the table below
Document
Doc 1
Doc 2
Doc 3
Words
talk, iiitmk, campus,nlp
algorithm, bigdata, nlp
researchers, talk
What kind of an Index is it ?
Create an inverted index from this forward index
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
34. Exercise
Have a look at the table below
Document
Doc 1
Doc 2
Doc 3
Words
talk, iiitmk, campus,nlp
algorithm, bigdata, nlp
researchers, talk
What kind of an Index is it ?
Create an inverted index from this forward index
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
38. Apache Lucene Demo
Which Tool to try for indexing ans searching ?
Apache Lucene is a full-featured text search engine library
Written entirely in Java
Open Source
Scalable and High Performance Indexing
Powerful, Accurate and Efficient Search Algorithms
Interesting Features of Lucene Core
Allows Simultaneous update and searching
Powerful query types like phrase queries, wildcard queries,
range queries etc
Fielded searching (e.g. title, author, contents)
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
39. Apache Lucene Demo
Which Tool to try for indexing ans searching ?
Apache Lucene is a full-featured text search engine library
Written entirely in Java
Open Source
Scalable and High Performance Indexing
Powerful, Accurate and Efficient Search Algorithms
Interesting Features of Lucene Core
Allows Simultaneous update and searching
Powerful query types like phrase queries, wildcard queries,
range queries etc
Fielded searching (e.g. title, author, contents)
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
40. Document Clustering
Definition
The process of grouping a set of physical or abstract objects into
classes of similar objects is called clustering.
A cluster is a collection of data objects that are similar to one
another within the same cluster and are dissimilar to the objects in
other clusters.
Clustering is applicable in many fields, including machine
learning, pattern recognition, image analysis, information
retrieval, and bioinformatics.
Clustering is an example for un supervised learning in Machine
Learning
Cluster Analysis can be achieved by various algorithms
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
41. Document Clustering
Definition
The process of grouping a set of physical or abstract objects into
classes of similar objects is called clustering.
A cluster is a collection of data objects that are similar to one
another within the same cluster and are dissimilar to the objects in
other clusters.
Clustering is applicable in many fields, including machine
learning, pattern recognition, image analysis, information
retrieval, and bioinformatics.
Clustering is an example for un supervised learning in Machine
Learning
Cluster Analysis can be achieved by various algorithms
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
42. The Library Example
Reference
I found this example in the book Mahout In Action by Sean Owen,
Robin Anil, Ted Dunning, and Ellen Friedman
Inside the Library
A Library having thousands of books
There is no particular order or anything how books are
arranged in this Library
Brainstorm !
Will you enjoy finding a book you want from there ?
If not give me some solutions
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
43. The Library Example
Reference
I found this example in the book Mahout In Action by Sean Owen,
Robin Anil, Ted Dunning, and Ellen Friedman
Inside the Library
A Library having thousands of books
There is no particular order or anything how books are
arranged in this Library
Brainstorm !
Will you enjoy finding a book you want from there ?
If not give me some solutions
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
44. The Library Example
Reference
I found this example in the book Mahout In Action by Sean Owen,
Robin Anil, Ted Dunning, and Ellen Friedman
Inside the Library
A Library having thousands of books
There is no particular order or anything how books are
arranged in this Library
Brainstorm !
Will you enjoy finding a book you want from there ?
If not give me some solutions
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
45. The Library Example
Reference
I found this example in the book Mahout In Action by Sean Owen,
Robin Anil, Ted Dunning, and Ellen Friedman
Inside the Library
A Library having thousands of books
There is no particular order or anything how books are
arranged in this Library
Brainstorm !
Will you enjoy finding a book you want from there ?
If not give me some solutions
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
46. The Library Example
Reference
I found this example in the book Mahout In Action by Sean Owen,
Robin Anil, Ted Dunning, and Ellen Friedman
Inside the Library
A Library having thousands of books
There is no particular order or anything how books are
arranged in this Library
Brainstorm !
Will you enjoy finding a book you want from there ?
If not give me some solutions
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
47. Solutions
What about Sorting the books alphabetically by Title ?
Yes, for readers seraching a book by title, that will help.
What if some looking for books on some general subject ? For
example Health
Grouping books by topics will be more useful in this case
But how would you even begin this grouping ?
You will start reading books one by one and group them ! Good
Work :-)
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
48. Solutions
What about Sorting the books alphabetically by Title ?
Yes, for readers seraching a book by title, that will help.
What if some looking for books on some general subject ? For
example Health
Grouping books by topics will be more useful in this case
But how would you even begin this grouping ?
You will start reading books one by one and group them ! Good
Work :-)
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
49. Solutions
What about Sorting the books alphabetically by Title ?
Yes, for readers seraching a book by title, that will help.
What if some looking for books on some general subject ? For
example Health
Grouping books by topics will be more useful in this case
But how would you even begin this grouping ?
You will start reading books one by one and group them ! Good
Work :-)
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
50. Solutions
What about Sorting the books alphabetically by Title ?
Yes, for readers seraching a book by title, that will help.
What if some looking for books on some general subject ? For
example Health
Grouping books by topics will be more useful in this case
But how would you even begin this grouping ?
You will start reading books one by one and group them ! Good
Work :-)
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
51. Solutions
What about Sorting the books alphabetically by Title ?
Yes, for readers seraching a book by title, that will help.
What if some looking for books on some general subject ? For
example Health
Grouping books by topics will be more useful in this case
But how would you even begin this grouping ?
You will start reading books one by one and group them ! Good
Work :-)
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
52. Solutions
What about Sorting the books alphabetically by Title ?
Yes, for readers seraching a book by title, that will help.
What if some looking for books on some general subject ? For
example Health
Grouping books by topics will be more useful in this case
But how would you even begin this grouping ?
You will start reading books one by one and group them ! Good
Work :-)
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
53. Steps in Clustering
Clustering involves the following
An algorithm, the method used to group the books together.
A notion of both similarity and dissimilarity.
In the library example we relied on our assessment of which
books belonged in an existing stack and which should start a
new one.
A stopping condition.
In the library example, this might have been the point beyond
books can’t be stacked anymore, or when the stacks are
already quite dissimilar.
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
54. Steps in Clustering
Clustering involves the following
An algorithm, the method used to group the books together.
A notion of both similarity and dissimilarity.
In the library example we relied on our assessment of which
books belonged in an existing stack and which should start a
new one.
A stopping condition.
In the library example, this might have been the point beyond
books can’t be stacked anymore, or when the stacks are
already quite dissimilar.
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
55. Steps in Clustering
Clustering involves the following
An algorithm, the method used to group the books together.
A notion of both similarity and dissimilarity.
In the library example we relied on our assessment of which
books belonged in an existing stack and which should start a
new one.
A stopping condition.
In the library example, this might have been the point beyond
books can’t be stacked anymore, or when the stacks are
already quite dissimilar.
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
56. Steps in Clustering
Clustering involves the following
An algorithm, the method used to group the books together.
A notion of both similarity and dissimilarity.
In the library example we relied on our assessment of which
books belonged in an existing stack and which should start a
new one.
A stopping condition.
In the library example, this might have been the point beyond
books can’t be stacked anymore, or when the stacks are
already quite dissimilar.
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
57. K-Means Algorithm
Let’s see an Algorithm first and after that how to automate the
grouping of books in the Library Example.
K-Means
k-Means clustering aims to partition n observations into k
clusters.
Takes the input parameter, k, and partitions a set of n objects
into k clusters so that the resulting intracluster similarity is
high but the intercluster similarity is low.
Cluster similarity is measured in regard to the mean value of
the objects in a cluster, which can be viewed as the cluster’s
centroid
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
58. K-Means Algorithm
Let’s see an Algorithm first and after that how to automate the
grouping of books in the Library Example.
K-Means
k-Means clustering aims to partition n observations into k
clusters.
Takes the input parameter, k, and partitions a set of n objects
into k clusters so that the resulting intracluster similarity is
high but the intercluster similarity is low.
Cluster similarity is measured in regard to the mean value of
the objects in a cluster, which can be viewed as the cluster’s
centroid
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
59. K-Means Example
2 Reference
Teknomo, Kardi. K-Means Clustering Tutorials.
http://people.revoledu.com/kardi/tutorial/kMean
Data
Object
Medicine A
Medicine B
medicine C
Medicine D
Attribute 1 (X) weight index
1
2
4
5
Attribute 2 (Y) pH
1
1
3
4
Problem
we have 4 objects each having 2 attributes
we also know before hand that these objects belong to two
groups of medicine (cluster 1 and cluster 2)
The problem now is to determine which medicines belong to
cluster 1 and which medicines belong to the other cluster
2
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
60. K-Means Example
2 Reference
Teknomo, Kardi. K-Means Clustering Tutorials.
http://people.revoledu.com/kardi/tutorial/kMean
Data
Object
Medicine A
Medicine B
medicine C
Medicine D
Attribute 1 (X) weight index
1
2
4
5
Attribute 2 (Y) pH
1
1
3
4
Problem
we have 4 objects each having 2 attributes
we also know before hand that these objects belong to two
groups of medicine (cluster 1 and cluster 2)
The problem now is to determine which medicines belong to
cluster 1 and which medicines belong to the other cluster
2
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
61. Steps in K-means
Iterate until stable (ie no object move group):
1
Determine the centroid coordinate
2
Determine the distance of each object to the centroids
3
Group the object based on minimum distance (find the closest
centroid)
Each medicine represents one point with two features (X, Y). We
can represent it as coordinate in a feature space
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
62. Steps in K-means
Iterate until stable (ie no object move group):
1
Determine the centroid coordinate
2
Determine the distance of each object to the centroids
3
Group the object based on minimum distance (find the closest
centroid)
Each medicine represents one point with two features (X, Y). We
can represent it as coordinate in a feature space
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
63. Euclidean distance
Each clustering problem is basically based on a distance
between points
Euclidean Distance is most commonly usd distance measure
Mathematically, Euclidean distance between points with
coordinates (x1, y1) and (x2, y2) is
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
64. Iteration 0
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
65. Iteration 0
Initial Value of Centroids
Take medicine A and medicine B as the first centroids.
Let c1 and c 2 denote the coordinate of the centroids, then
c1 = (1,1) and c 2 = (2,1)
Objects-Centroids Distance
Calculate the distance between cluster centroid to each object.
Distance matrix using Euclidean Distance at iteration 0 is
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
66. Iteration 0
Initial Value of Centroids
Take medicine A and medicine B as the first centroids.
Let c1 and c 2 denote the coordinate of the centroids, then
c1 = (1,1) and c 2 = (2,1)
Objects-Centroids Distance
Calculate the distance between cluster centroid to each object.
Distance matrix using Euclidean Distance at iteration 0 is
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
67. Iteration 0
Each column in the distance matrix symbolizes the object
The first row of the distance matrix corresponds to the
distance of each object to the first centroid and the second
row is the distance of each object to the second centroid
For example, distance from medicine C = (4, 3) to the first
centroid c1 = (1,1) is
Similarly distance to the second centroid c 2 = (2,1) is
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
68. Iteration 0
Objects clustering
We assign each object based on the minimum distance
Thus, medicine A is assigned to group 1, medicine B to group
2 and so on
Group Matrix
The element of Group matrix below is 1 if and only if the
object is assigned to that group.
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
69. Iteration 0
Objects clustering
We assign each object based on the minimum distance
Thus, medicine A is assigned to group 1, medicine B to group
2 and so on
Group Matrix
The element of Group matrix below is 1 if and only if the
object is assigned to that group.
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
70. Iteration 1
Determine new centroids
Compute the new centroid of each group based on the new
members
Group 1 only has one member
thus the centroid remains as c1 = (1,1)
Group 2 now has three members, thus the centroid is the
average coordinate among the three members
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
71. Iteration 1
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
72. Iteration 1
Objects-Centroids Distance
Compute the distance of all objects to the new centroids
Distance matrix at iteration 1 is
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
73. Iteration 1
Objects clustering
Again we assign each object based on the minimum distance
Based on the new distance matrix, we move the medicine B
to Group 1 while all the other objects remain.
Group Matrix
Group matrix at Iteration 1
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
74. Iteration 1
Objects clustering
Again we assign each object based on the minimum distance
Based on the new distance matrix, we move the medicine B
to Group 1 while all the other objects remain.
Group Matrix
Group matrix at Iteration 1
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
75. Iteration 2
Determine new centroids
Compute the new centroid of each group based on the new
members
Group1 and group 2 both has two members, thus the thus the
new centroids are
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
76. Iteration 2
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
78. Iteration 2
Objects clustering
Again we assign each object based on the minimum distance
Group Matrix
Group matrix at Iteration 2
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
79. Iteration 2
Objects clustering
Again we assign each object based on the minimum distance
Group Matrix
Group matrix at Iteration 2
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
80. Results
We obtain result that G2 = G1.
Comparing the grouping of last iteration and this iteration
reveals that the objects does not move group anymore.
Thus, the computation of the k-mean clustering has reached
its stability and no more iteration is needed.
We get the final grouping as the results.
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
81. Document Representations
X-Y Plane Example
In previous example the measure of similarity (or similarity
metric) for the points was the Euclidean distance between two
points
And that was in the X-Y plane
Library Example
The library example had no such clear, mathematical measure.
And we relied entirely on our wisdom to judge book similarity
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
82. Document Representations
X-Y Plane Example
In previous example the measure of similarity (or similarity
metric) for the points was the Euclidean distance between two
points
And that was in the X-Y plane
Library Example
The library example had no such clear, mathematical measure.
And we relied entirely on our wisdom to judge book similarity
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
83. Document Representations
Brainstorm !
We need a metric that can be implemented on a computer.
One possible metric could be based on the number of words
common to two books’ titles.
So “Harry Potter: The Philosopher’s Stone” and “Harry
Potter: The Prisoner of Azkaban” have three words in
common: “Harry”, “Potter” and “The”.
But, even though the book “The Lord of the Rings: The Two
Towers” is similar to the Harry Potter series, this measure of
similarity doesn’t capture that.
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
84. Document Representations
Brainstorm !
We need a metric that can be implemented on a computer.
One possible metric could be based on the number of words
common to two books’ titles.
So “Harry Potter: The Philosopher’s Stone” and “Harry
Potter: The Prisoner of Azkaban” have three words in
common: “Harry”, “Potter” and “The”.
But, even though the book “The Lord of the Rings: The Two
Towers” is similar to the Harry Potter series, this measure of
similarity doesn’t capture that.
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
85. Document Representations
Brainstorm !
We need a metric that can be implemented on a computer.
One possible metric could be based on the number of words
common to two books’ titles.
So “Harry Potter: The Philosopher’s Stone” and “Harry
Potter: The Prisoner of Azkaban” have three words in
common: “Harry”, “Potter” and “The”.
But, even though the book “The Lord of the Rings: The Two
Towers” is similar to the Harry Potter series, this measure of
similarity doesn’t capture that.
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
86. Document Representations
Brainstorm !
We need a metric that can be implemented on a computer.
One possible metric could be based on the number of words
common to two books’ titles.
So “Harry Potter: The Philosopher’s Stone” and “Harry
Potter: The Prisoner of Azkaban” have three words in
common: “Harry”, “Potter” and “The”.
But, even though the book “The Lord of the Rings: The Two
Towers” is similar to the Harry Potter series, this measure of
similarity doesn’t capture that.
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
87. Document Representations
Brainstorm !
We need a metric that can be implemented on a computer.
One possible metric could be based on the number of words
common to two books’ titles.
So “Harry Potter: The Philosopher’s Stone” and “Harry
Potter: The Prisoner of Azkaban” have three words in
common: “Harry”, “Potter” and “The”.
But, even though the book “The Lord of the Rings: The Two
Towers” is similar to the Harry Potter series, this measure of
similarity doesn’t capture that.
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
88. Document Representations
Another Solutions
We could assemble word counts for each book, and when the
counts are close for many words, judge the books similar.
But the words like “a”, “an”, and “the” cannot contribute
much to the similarity, because they occurs frequently in both
books.
We could use numeric weights in the computation, and apply
low weights to these words to reduce their effect on the
similarity value.
Once we give a weight value to each word in a book, we can
easily find out the similarity of two books.
But the words like “a”, “an”, and “the” cannot contribute
much to the similarity, because they occurs frequently in both
books.
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
89. Document Representations
Another Solutions
We could assemble word counts for each book, and when the
counts are close for many words, judge the books similar.
But the words like “a”, “an”, and “the” cannot contribute
much to the similarity, because they occurs frequently in both
books.
We could use numeric weights in the computation, and apply
low weights to these words to reduce their effect on the
similarity value.
Once we give a weight value to each word in a book, we can
easily find out the similarity of two books.
But the words like “a”, “an”, and “the” cannot contribute
much to the similarity, because they occurs frequently in both
books.
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
90. Document Representations
Another Solutions
We could assemble word counts for each book, and when the
counts are close for many words, judge the books similar.
But the words like “a”, “an”, and “the” cannot contribute
much to the similarity, because they occurs frequently in both
books.
We could use numeric weights in the computation, and apply
low weights to these words to reduce their effect on the
similarity value.
Once we give a weight value to each word in a book, we can
easily find out the similarity of two books.
But the words like “a”, “an”, and “the” cannot contribute
much to the similarity, because they occurs frequently in both
books.
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
91. Document Representations
Another Solutions
We could assemble word counts for each book, and when the
counts are close for many words, judge the books similar.
But the words like “a”, “an”, and “the” cannot contribute
much to the similarity, because they occurs frequently in both
books.
We could use numeric weights in the computation, and apply
low weights to these words to reduce their effect on the
similarity value.
Once we give a weight value to each word in a book, we can
easily find out the similarity of two books.
But the words like “a”, “an”, and “the” cannot contribute
much to the similarity, because they occurs frequently in both
books.
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
92. Document Representations
Another Solutions
We could assemble word counts for each book, and when the
counts are close for many words, judge the books similar.
But the words like “a”, “an”, and “the” cannot contribute
much to the similarity, because they occurs frequently in both
books.
We could use numeric weights in the computation, and apply
low weights to these words to reduce their effect on the
similarity value.
Once we give a weight value to each word in a book, we can
easily find out the similarity of two books.
But the words like “a”, “an”, and “the” cannot contribute
much to the similarity, because they occurs frequently in both
books.
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
93. Document Representations
Another Solutions
We could assemble word counts for each book, and when the
counts are close for many words, judge the books similar.
But the words like “a”, “an”, and “the” cannot contribute
much to the similarity, because they occurs frequently in both
books.
We could use numeric weights in the computation, and apply
low weights to these words to reduce their effect on the
similarity value.
Once we give a weight value to each word in a book, we can
easily find out the similarity of two books.
But the words like “a”, “an”, and “the” cannot contribute
much to the similarity, because they occurs frequently in both
books.
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
94. Document Representations
What if one book is 300 pages long and the other 1000 pages
long?
We have to ensure that the weight of words should be relative
to the length of the text.
We will see a method called TF-IDF shortly
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
95. Document Representations
What if one book is 300 pages long and the other 1000 pages
long?
We have to ensure that the weight of words should be relative
to the length of the text.
We will see a method called TF-IDF shortly
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
96. Document Representations
What if one book is 300 pages long and the other 1000 pages
long?
We have to ensure that the weight of words should be relative
to the length of the text.
We will see a method called TF-IDF shortly
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
97. Document Representations
Task !
Explore following distance measures
1
Squared Euclidean distance measure
2
Manhattan distance measure
3
Cosine distance measure
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
98. Document Representations
Representing Data as Vectors
In mathematics, a vector is simply a point in space.
We found how books can be clustered together based on their
similarity in words.
In reality, clustering could be applied to any kind of object
provided we can distinguish similar and dissimilar items.
Clustering of anything via algorithms starts with representing
the object in a way that can be read by computers.
It is quite practical to think of objects in terms of their
measurable features or attributes.
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
99. Document Representations
Say we want to cluster bunch of Apples
3
3
Figure taken from Mahout in Action
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
100. Document Representations
A small, round, red apple is more similar to a small, round,
green one than a large, ovoid green one.
The process of vectorization starts with assigning features to a
dimension
Let’s say weight is feature (dimension) 0, color is 1, and size is
2
So the vector of a small round red apple looks like [0: 100
gram, 1: red, 2: small]
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
101. Document Representations
A small, round, red apple is more similar to a small, round,
green one than a large, ovoid green one.
The process of vectorization starts with assigning features to a
dimension
Let’s say weight is feature (dimension) 0, color is 1, and size is
2
So the vector of a small round red apple looks like [0: 100
gram, 1: red, 2: small]
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
102. Document Representations
A small, round, red apple is more similar to a small, round,
green one than a large, ovoid green one.
The process of vectorization starts with assigning features to a
dimension
Let’s say weight is feature (dimension) 0, color is 1, and size is
2
So the vector of a small round red apple looks like [0: 100
gram, 1: red, 2: small]
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
103. Document Representations
A small, round, red apple is more similar to a small, round,
green one than a large, ovoid green one.
The process of vectorization starts with assigning features to a
dimension
Let’s say weight is feature (dimension) 0, color is 1, and size is
2
So the vector of a small round red apple looks like [0: 100
gram, 1: red, 2: small]
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
104. Document Representations
Set of apples of different weight, sizes and colors converted to
vectors 4
4
Figure taken from Mahout in Action
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
105. Document Representations
Improving weighting with TF-IDF
Term frequency - Inverse Document Frequency (TF-IDF)
weighting is a widely used improvement on simple term
frequency weighting.
We found how books can be clustered together based on their
similarity in words.
Instead of simply using term frequency as values in the vector,
this value is multiplied by the inverse of the term’s document
frequency
IDF=log(N/n)
N=total number of documents
n = number of documents that contain a term
TF-IDF = TF*IDF
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
106. Stanford NLP
NLP Toolkit
Stanford NLP group provides NLP toolkits for various major
computational linguistics problems.
Written in Java.
Open Source
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
107. Stanford NLP
Stanford Named Entity Recognizer
Named-entity recognition (NER) techniques locate and
classify atomic elements in text into predefined categories
such as the names of persons, organizations, locations etc
Consider the following text
Hello Jona, I am in Indian Institute at Trivandrum
What are the entities in this ?
NER Demo
Stanford NER is also known as CRFClassifier
Conditional Random Field (CRF) sequence models are used for
structured predictions
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
108. Stanford NLP
Stanford Named Entity Recognizer
Named-entity recognition (NER) techniques locate and
classify atomic elements in text into predefined categories
such as the names of persons, organizations, locations etc
Consider the following text
Hello Jona, I am in Indian Institute at Trivandrum
What are the entities in this ?
NER Demo
Stanford NER is also known as CRFClassifier
Conditional Random Field (CRF) sequence models are used for
structured predictions
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
109. Stanford NLP
Stanford Named Entity Recognizer
Named-entity recognition (NER) techniques locate and
classify atomic elements in text into predefined categories
such as the names of persons, organizations, locations etc
Consider the following text
Hello Jona, I am in Indian Institute at Trivandrum
What are the entities in this ?
NER Demo
Stanford NER is also known as CRFClassifier
Conditional Random Field (CRF) sequence models are used for
structured predictions
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
110. Stanford NLP
Stanford Named Entity Recognizer
Named-entity recognition (NER) techniques locate and
classify atomic elements in text into predefined categories
such as the names of persons, organizations, locations etc
Consider the following text
Hello Jona, I am in Indian Institute at Trivandrum
What are the entities in this ?
NER Demo
Stanford NER is also known as CRFClassifier
Conditional Random Field (CRF) sequence models are used for
structured predictions
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
111. Stanford NLP
Stanford Named Entity Recognizer
Named-entity recognition (NER) techniques locate and
classify atomic elements in text into predefined categories
such as the names of persons, organizations, locations etc
Consider the following text
Hello Jona, I am in Indian Institute at Trivandrum
What are the entities in this ?
NER Demo
Stanford NER is also known as CRFClassifier
Conditional Random Field (CRF) sequence models are used for
structured predictions
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
112. Social Media and Sentiment Analysis
Twitter
Twitter Streaming Demo
Sentiment Analysis
Sentiment analysis is one of the hottest research areas in
computer science today.
A basic task in sentiment analysis is to classify the polarity of
a given text at the document, sentence, or aspect level.
Whether the expressed opinion in a document, a sentence or
an entity feature oraspect is positive, negative, or neutral.
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
113. Social Media and Sentiment Analysis
Twitter
Twitter Streaming Demo
Sentiment Analysis
Sentiment analysis is one of the hottest research areas in
computer science today.
A basic task in sentiment analysis is to classify the polarity of
a given text at the document, sentence, or aspect level.
Whether the expressed opinion in a document, a sentence or
an entity feature oraspect is positive, negative, or neutral.
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
114. Social Media and Sentiment Analysis
Movie Review
Let’s see a tweet on a recently released movie
“Wow #Krish3 looks more exciting than Superman n
Spider-Man for sure ! The Roshans have made a truly world
class super hero film, again!”
These snippets of text are a gold mine for companies and
individuals that want to monitor their reputation and get
timely feedback about their products and actions
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
115. Social Media and Sentiment Analysis
Movie Review
Let’s see a tweet on a recently released movie
“Wow #Krish3 looks more exciting than Superman n
Spider-Man for sure ! The Roshans have made a truly world
class super hero film, again!”
These snippets of text are a gold mine for companies and
individuals that want to monitor their reputation and get
timely feedback about their products and actions
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
116. Social Media and Sentiment Analysis
Movie Review
Let’s see a tweet on a recently released movie
“Wow #Krish3 looks more exciting than Superman n
Spider-Man for sure ! The Roshans have made a truly world
class super hero film, again!”
These snippets of text are a gold mine for companies and
individuals that want to monitor their reputation and get
timely feedback about their products and actions
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
117. Social Media and Sentiment Analysis
Document-Level Sentiment Analysis
Main approach for document level sentiment analysis is
supervised learning.
The system learns a classification model from the training data
common classification algorithms such as SVM, Naive Bayes,
Logistic Regression etc can be used
Thus new documents are tagged into their various sentiment
classes
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
118. Bigdata
Introduction to Bigdata
Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
The challenges include capture, curation, storage, search, sharing,
transfer, analysis, and visualization.
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
119. Bigdata
3 Vs of Bigdata
Volume: Ever-growing data of all types
Velocity: For time-sensitive processes such as catching fraud,
intrusion detection etc, the speed at which data arrives is a
characteristic of bigdata
Variety: Any type of data, structured and unstructured data
such as text, sensor data, audio, video, click streams, log files
and more
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action
123. References
Sean Owen, Robin Anil, Ted Dunning, Ellen Friedman, Mahout in Action,
Manning Publications
Jiawei Han, Micheline Kamber, Data Mining Concepts and Techniques
Teknomo, Kardi K-Means Clustering Tutorials
A first take at building an inverted index,
http://nlp.stanford.edu/IR-book/html/htmledition/
a-first-take-at-building-an-inverted-index-1.html
Sarath P R
sarath.amrita@gmail.com
NLP & Bigdata
Motivation and Action