Natural Language Processing using JavaScript "Natural" Library. This deck covers Natural Language Understanding using JavaScript "Natural" library in detail
Deep Learning techniques have enabled exciting novel applications. Recent advances hold lot of promise for speech based applications that include synthesis and recognition. This slideset is a brief overview that presents a few architectures that are the state of the art in contemporary speech research. These slides are brief because most concepts/details were covered using the blackboard in a classroom setting. These slides are meant to supplement the lecture.
RabbitMQ is an open source message-broker software that originally implemented the Advanced Message Queuing Protocol (AMQP).it accepts and forwards messages.
YouTube Link: https://youtu.be/sHeJgKBaiAI
** Python Certification Training: https://www.edureka.co/python **
This Edureka video on 'Speech Recognition in Python' will cover the concepts of speech recognition module in python with a program using speech recognition to translate speech into text. Following are the topics discussed:
How Speech Recognition Works?
How To Install SpeechRecognition In Python?
Working With Microphones
How To Install Pyaudio In Python?
Use case
Deep Learning techniques have enabled exciting novel applications. Recent advances hold lot of promise for speech based applications that include synthesis and recognition. This slideset is a brief overview that presents a few architectures that are the state of the art in contemporary speech research. These slides are brief because most concepts/details were covered using the blackboard in a classroom setting. These slides are meant to supplement the lecture.
RabbitMQ is an open source message-broker software that originally implemented the Advanced Message Queuing Protocol (AMQP).it accepts and forwards messages.
YouTube Link: https://youtu.be/sHeJgKBaiAI
** Python Certification Training: https://www.edureka.co/python **
This Edureka video on 'Speech Recognition in Python' will cover the concepts of speech recognition module in python with a program using speech recognition to translate speech into text. Following are the topics discussed:
How Speech Recognition Works?
How To Install SpeechRecognition In Python?
Working With Microphones
How To Install Pyaudio In Python?
Use case
Getting started on your natural language processing project? First you'll need to extract some features from your corpus. Frequency, Syntax parsing, word vectors are good ones to start with.
Defines a framework for authentication service using the X.500 directory.It is the Repository of public-key certificates,Based on use of public-key cryptography and digital signatures.
https://www.youtube.com/watch?v=lKrbeJ7-J98
HTTP messages are how data is exchanged between a server and a client. There are two types of messages: requests sent by the client to trigger an action on the server, and responses, the answer from the server.
Finite state automata (deterministic and nondeterministic finite automata) provide decisions regarding the acceptance and rejection of a string while transducers provide some output for a given input. Thus, the two machines are quite useful in language processing tasks.
This full detailed presentation on Enterprise message using JMS. This will provide details required from a junior programmer to architect level. Some of the information presented is inspired by some authors from various sources.
Getting started on your natural language processing project? First you'll need to extract some features from your corpus. Frequency, Syntax parsing, word vectors are good ones to start with.
Defines a framework for authentication service using the X.500 directory.It is the Repository of public-key certificates,Based on use of public-key cryptography and digital signatures.
https://www.youtube.com/watch?v=lKrbeJ7-J98
HTTP messages are how data is exchanged between a server and a client. There are two types of messages: requests sent by the client to trigger an action on the server, and responses, the answer from the server.
Finite state automata (deterministic and nondeterministic finite automata) provide decisions regarding the acceptance and rejection of a string while transducers provide some output for a given input. Thus, the two machines are quite useful in language processing tasks.
This full detailed presentation on Enterprise message using JMS. This will provide details required from a junior programmer to architect level. Some of the information presented is inspired by some authors from various sources.
Presentation from March 18th, 2013 Triangle Java User Group on Taming Text. Presentation covers search, question answering, clustering, classification, named entity recognition, etc. See http://www.manning.com/ingersoll for more.
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...Apache OpenNLP
Media analysts have to deal with with analyzing high volumes of real-time news feeds and social media streams which is often a tedious process because they need to write search profiles for entities. Python tools like NLTK do not scale to large production data sets and cannot be plugged into a distributed scalable frameworks like Apache Flink. Apache Flink being a streaming first engine is ideally suited for ingesting multiple streams of news feeds, social media, blogs etc.. and for being able to do streaming analytics on the various feeds. Natural Language Processing tools like Apache OpenNLP can be plugged into Flink streaming pipelines so as to be able to perform common NLP tasks like Named Entity Recognition (NER), Chunking, and text classification. In this talk, we’ll be building a real-time media analyzer which does Named Entity Recognition (NER) on the individual incoming streams, calculates the co-occurrences of the named entities and aggregates them across multiple streams; index the results into a search engine and being able to query the results for actionable insights. We’ll also be showing as to how to handle multilingual documents for calculating co-occurrences. NLP practitioners will come away from this talk with a better understanding of how the various Apache OpenNLP components can help in processing large streams of data feeds and can easily be plugged into a highly scalable and distributed framework like Apache Flink.
An Introduction to Natural Language ProcessingTyrone Systems
Learn about how Natural Language Processing in AI can be used and how it applies to you in the real world.
You can learn about NLP concepts, Pre-processing steps, Vectorization Methods, Generative and Unsupervised methods. All the resource is available for you to grow your knowledge and skills about Natural Language Processing webinar!
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...gagravarr
If you have one or two files, you can take the time to manually work out what they are, what they contain, and how to get the useful bits out (probably....). However, this approach really doesn't scale, mechanical turks or no! Luckily, there are Apache projects out there which can help!
In this talk, we'll first look at how we can work out what a given blob of 1s and 0s actually is, be it textual or binary. We'll then see how to extract common metadata from it, along with text, embedded resources, images, and maybe even the kitchen sink! We'll see how to do all of this with Apache Tika, and how to dive down to the underlying libraries (including its Apache friends like POI and PDFBox) for specialist cases. Finally, we'll look a little bit about how to roll this all out on a Big Data or Large-Search case.
Certified Associate in Python Programming certification focuses on the Object-Oriented Programming approach to Python, and shows that the individual is familiar with the more advanced aspects of programming, including the essentials of OOP, the essentials of modules and packages, the exception handling mechanism in OOP, advanced operations on strings, list comprehensions, lambdas, generators, closures, and file processing.
PCAP™ certification gives its holders confidence in their programming skills, helps them stand out in the job market, and gives them a head start on preparing for and advancing to the professional level.
Python is the programming language that opens more doors than any other, and the more you understand Python, the more you can do in the 21st Century. With a solid knowledge of Python, you can work in a multitude of jobs and a multitude of industries.
Python is either the highest-paid, or one of the highest-paid languages in all parts of the world today, and the salaries range between $90,000 and $130,000 a year (source: SalaryExpert.com).
With the ever-increasing reliance on the Internet, and with Python playing an ever-growing role, the salary of the average Python programmer is almost surely to rise.
Currently, there are 100,000+ unfulfilled Python jobs around the world, and the supply of qualified Python programmers is unable to match the demand.
Certified Associate in Python Programming Online Training
This intermediate Certified Associate in Python Programming training prepares learners to take the PCAP-31-03 exam, which is the one exam required to earn the PCAP certification.
Python is one of the simplest, most accessible programming languages around today, and it's hard to imagine a developer who won't benefit from knowing it. Python always seems to show up everywhere.
Developers who work with data scientists and researchers rely on the hundreds of scientific libraries to extend Python for their work. Developers working with tech startups love Python's ease of use and scalability -- perfect for building simple solutions that you know can explode when the company does. No matter where you develop, or how, learning Python will either accelerate your career, or start it on the right foot.
For anyone who manages developers, this Python training can be used for PCAP-31-03 exam prep, onboarding new developers, individual or team training plans, or as a Linux Foundation reference resource.
PCAP: What You Need to Know
This Python training covers PCAP-31-03 exam objectives, including these topics:
Recognizing basic concepts of Python: indenting, compilation, operators, and expressions
Coding with the Python language to accomplish basic programming tasks
Incorporating conditional execution, loops, Python syntax and semantics into code
Writing good code that leverages object-oriented programming
Defining and invoking your own functions and generators
Who Should Take PCAP Training?
Similar to NLP using JavaScript Natural Library (20)
Discussed in detail about how to design and develop custom skills (think custom apps) for Amazon Alexa Voice service.
Discusses how to design voice based experiences in detail.
Workflows are a key component of server side of IoT solution along with Analytics, Rule Engine and IoT device management. IoT focused Workflow tools draw their inspiration of classical workflow tools that exist in market, but focus more on IoT use cases. For example they are able to connect with IoT devices using IoT specific protocols like CoAP or MQTT. Node-RED is a visual tool for wiring together hardware devices, APIs and online services in new and interesting ways. It’s build by IBM Emerging Technology team from group for IoT, though it’s not limited only to IoT.
Using Swift for all Apple platforms (iOS, watchOS, tvOS and OS X)Aniruddha Chakrabarti
Swift has gained widespread popularity in just an year. So much so that Swift have emerged as the de-facto standard programming language for all Apple platforms including iOS, watchOS, tvOS and OS X. Apple also open sources Swift and soon after IBM ported Swift to Linux. Swift incorporates the language innovations that have happened in the last two decades. Swift is a compiled programming language and belongs to the ‘C’ family of languages similar to C++, Java, C#, Objective-C and D. Swift is influenced by dynamic programming languages like Python, Ruby and functional programming languages like Haskell.
Future of .NET - .NET on Non Windows Platforms. .NET has been so far targeted towards Windows only. Now Microsoft created a subset of .NET called .NET Core that would run on Linux and OS X apart from Windows.
High level overview of CoAP or Constrained Application Protocol. CoAP is a HTTP like protocol suitable for constrained environment like IoT. CoAP uses HTTP like request response model, status code etc.
memcached Distributed Cache. memcached is the most popular cache solution for low latency high throughput websites. improves the read timings drastically.
Provides an overview of Redis which is a Key Value NoSQL database and the different data types it supports. Also shows how to use Redis Client API from node.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaYara Milbes
Discover the transformative power of the WhatsApp API in our latest SlideShare presentation, "Top 7 Unique WhatsApp API Benefits." In today's fast-paced digital era, effective communication is crucial for both personal and professional success. Whether you're a small business looking to enhance customer interactions or an individual seeking seamless communication with loved ones, the WhatsApp API offers robust capabilities that can significantly elevate your experience.
In this presentation, we delve into the top 7 distinctive benefits of the WhatsApp API, provided by the leading WhatsApp API service provider in Saudi Arabia. Learn how to streamline customer support, automate notifications, leverage rich media messaging, run scalable marketing campaigns, integrate secure payments, synchronize with CRM systems, and ensure enhanced security and privacy.
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
GraphSummit Paris - The art of the possible with Graph TechnologyNeo4j
Sudhir Hasbe, Chief Product Officer, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
OpenMetadata Community Meeting - 5th June 2024OpenMetadata
The OpenMetadata Community Meeting was held on June 5th, 2024. In this meeting, we discussed about the data quality capabilities that are integrated with the Incident Manager, providing a complete solution to handle your data observability needs. Watch the end-to-end demo of the data quality features.
* How to run your own data quality framework
* What is the performance impact of running data quality frameworks
* How to run the test cases in your own ETL pipelines
* How the Incident Manager is integrated
* Get notified with alerts when test cases fail
Watch the meeting recording here - https://www.youtube.com/watch?v=UbNOje0kf6E
1. Basic Natural Language Processing
using
Natural (JavaScript/Node) Library
Aniruddha Chakrabarti
AVP and Chief Architect, Digital, Mphasis
@anchakra | Linkedin.com/in/aniruddhac | slideshare.net/aniruddha.chakrabarti/
2. Agenda
• Emergence of Artificial Intelligence, AI First
• What is Natural Language Processing (NLP)
• Natural JavaScript/Node NLP Library
• Tokenization - Word Tokenizer
• Stemming and Lemmatization
• String Distance
• Inflectors
• Phonetics
• N-Grams
• Classifier
• tf-idf
• POS Tagger
• Spell Check
3. → Turing Machine
→ Automating manual processes,
tabulating data
→ Reducing manual effort and time
→ IBM System/360 (S/360),
Mainframes, AS/400
→ Computing Power (Moore’s Law)
→ Systems need to be explicitly programmed using
explicit logic and rules. Pre programmed
→ Personal Computers (PCs), Communication
(Networked PCs, Client/Server, Internet, WWW)
→ Automating business processes
→ Mostly structured data
→ Systems that learn from historical data and can make predictions. Not
rule based system.
→ Uses Machine Learning, NLP to analyze unstructured data (text, image,
audio, video)
→ Predictive Analytics, Deep Learning, Neural Nets,
→ OCR, Speech recognition, Text to speech, Face recognition, Video
analysis, …
→ Cognitive Services (pay as you go model) – IBM Watson, Microsoft
Cognitive Services, …
→ Robotics, Internet of Things, Conversational Systems, Wearables, Blur of
physical & virtual
→ Still mostly Weak AI / Narrow AI
Third Era of Computing * - AI First/AI Everywhere (Cognitive Systems)
* From “The Computing Universe” by Tony Hey and Gyuri Papav
→ Strong AI / Full AI
→ Artificial General
Intelligence (AGI)
Tabulating Machines
1960 – 1980
Programmable Systems
1980 - 2010
AI First/AI Everywhere
(Cognitive Systems)
2010 - Current
Real AI ?
?
AI Winter AI Summer
• Artificial Intelligence has emerged as the third era of computing after tabulating machine and
programmable systems.
4. Gartner Hype Cycle … 2017
• AI technologies like Cognitive Computing, Virtual
Assistants/Chatbot, Conversational AI, Machine
Learning, Deep Learning and Autonomous Vehicles
appear at the peak in Gartner Hype Cycle of Emerging
Technologies, 2017.
• Reinforcement Learning and Artificial General
Intelligence (AGI) has appeared at the starting points of
hype cycle – they are expected to peak in coming years.
5. Emergence of “AI Everywhere”
Gartner recons AI as one of the
three mega trends. AI
technologies like
Conversational UI, Machine
Learning, Deep Learning and
Cognitive Computing
constitutes “AI Everywhere”
6. What is Natural Language Processing?
• Field of computer science, artificial intelligence and computational linguistics concerned
with the interactions between computers and human (natural) languages, and, in particular,
concerned with programming computers to fruitfully process large natural language corpora –
Wikipedia
• Broadly categorized into two areas -
▪ Natural Language Understanding (NLU)
▪ Natural Language Generation (NLG)
Natural Language
Processing (NLP)
Natural Language
Understanding (NLU)
Natural Language
Generation (NLG)
7. Some applications of NLP
• Spell correction (MS Word/ any other editor)
• Search engines (Google, Bing, Yahoo, wolfram alpha)
• Speech engines (Siri, Google Voice, Cortana)
• Personal Voice Assistants (Amazon Alexa, Google Home, …)
• Spam classifiers (All e-mail services)
• News feeds (Google, Yahoo!, and so on)
• Machine translation (Google Translate, and so on)
• Chatbots, Intelligent Virtual Agent/IVA
• IBM Watson, Microsoft LUIS, Amazon Lex/Alexa
8. NLP Tools & Libraries
• GATE
• Mallet (Java)
• Open NLP – Apache (Java)
• UIMA
• CoreNLP - Stanford CoreNLP toolkit (Java)
• Genism
• Natural Language Toolkit / NLTK (Python) – by far the most popular NLP library & tool
• spaCy (Python) – built on top of NLTK
• TextBlob
• Natural Library (JavaScript/Node)
NLTK
9. What is Natural
• "Natural" is a general natural language processing library for nodejs.
• Supports basic NLP tasks like tokenizing, stemming, classification, phonetics, tf-idf, WordNet,
string similarity, inflections
• At the moment, most of the algorithms are English-specific
• Created by Chris Umbel
• Loosely based on NLTK (Python) NLP Library
• https://github.com/NaturalNode/natural
• http://www.chrisumbel.com/article/node_js_natural_language_porter_stemmer_lancaster_baye
s_naive_metaphone_soundex
10. Natural library install and setup
• Install using npm (Package manager for Node), use –g switch (for global installation)
• Include the Natural package through require
npm install –g natural
// include the natural library
let Natural = require('natural');
11. Tokenization
• A word (Token) is the minimal unit that a machine can understand and process.
• Tokenization is the process of splitting the raw string into meaningful tokens
• Raw text cannot be further processed without going through tokenization.
• Complexity of tokenization varies according to the need of the NLP application, and the
complexity of the language itself.
▪ In English it can be as simple as choosing only words and numbers through a regular
expression. But for Chinese and Japanese, it will be a very complex task.
• Two primary types of tokenizers:
▪ Word Tokenizer: Tokenizes raw text to words
▪ Sentence Tokenizer: Tokenizes raw text to sentences
12. Word Tokenizer
• A word (Token) is the minimal unit that a machine can understand & process
• Tokenization is the process of splitting the raw string into meaningful tokens – Tokenizer
tokenizes or splits raw text into words
• Natural comes with multiple tokenizers -
▪ Word Tokenizer: a tokenizer that divides a text into sequences of alphabetic and
numeric characters. (Ignores punctuation)
▪ Word Punct Tokenizer: Word + punctuation tokenizer. A tokenizer that divides a text into
sequences of alphabetic and non-alphabetic characters.
▪ Treebank Word Tokenizer: uses regular expressions to tokenize text as in Penn
Treebank
▪ Regexp Tokenizer: Tokenizes text using regular expression patterns.
▪ Aggressive Tokenizer:
13. Word Tokenizer (Cont’d)
var sentence = "Hello, how are you? I don't know you!"
var wordTokenizer = new Natural.WordTokenizer();
var tokens = wordTokenizer.tokenize(sentence);
console.log(tokens);
// prints [ 'Hello', 'how', 'are', 'you', 'I', 'don', 't', 'know', 'you' ]
var tokenizer = new Natural.WordPunctTokenizer();
var tokens = tokenizer.tokenize(sentence);
console.log(tokens);
// prints [ 'Hello', ', ', 'how', 'are', 'you', '? ', 'I', 'don', '‘’,
// 't’, 'know', 'you', '!' ]
var tokenizer = new Natural. TreebankWordTokenizer();
var tokens = tokenizer.tokenize(sentence);
console.log(tokens);
// prints [ 'Hello', ', ', 'how', 'are', 'you', '? ', 'I', 'don', '‘’,
// 't’, 'know', 'you', '!' ]
console.log(new Natural.AgressiveTokenizer().tokenize(sentence));
// prints ['Hello', 'how', 'are', 'you', 'I', 'don', 't', 'know', 'you' ]
14. Stemming
• Process of reducing inflected or derived words to their word stem, base or root form.
• Similar to cutting down the branches of a tree to its stem
• More of a crude rule-based process by which we want to club together different variations of
the token – rule based
• Removes –s/es or -ing or -ed
eating, eats, eaten, eat -> eat
stopping, stopped, stops, stop -> stop
ate -> ate (wrong should be eat)
15. Stemming (Cont’d)
• Different stemming algorithms -
▪ Lovins Stemmer - First published stemmer was written by Julie Beth Lovins in 1968.
Lovins Stemmer is not used currently.
▪ Porter Stemmer - Written by Martin Porter and in July 1980. Very widely used and
became the de facto standard algorithm used for English stemming.
▪ Lancaster Stemmer - Paice/Husk stemmer developed at Lancaster University. The
stemmer, although remaining efficient and easily implemented, is known to be very
strong and aggressive. The stemmer utilizes a single table of rules, each of which may
specify the removal or replacement of an ending.
▪ Snowball Stemmer – Also called Porter2 stemmer, since this is an updated version of
original Porter Stemmer. Natural does not support Snowball Stemmer
• Lemmatization is a more robust and methodical way of combining grammatical variations to
the root of a word.
▪ Natural does not support any Lemmatization algorithm.
▪ NLTK and other matured NLP libraries support Lemmatization
16. Stemming – Porter Stemmer and Lancaster Stemmer
var porterStemmer = Natural.PorterStemmer;
console.log(porterStemmer.stem("ate")); // prints at
console.log(porterStemmer.stem("eating")); // prints eat
console.log(porterStemmer.stem("eats")); // prints eat
console.log(porterStemmer.stem("eat")); // prints eat
console.log(porterStemmer.stem("agreement")); // prints agreement
var lancasterStemmer = Natural.LancasterStemmer;
console.log(lancasterStemmer.stem("ate")); // prints at
console.log(lancasterStemmer.stem("eating")); // prints eat
console.log(lancasterStemmer.stem("eats")); // prints eat
console.log(lancasterStemmer.stem("eat")); // prints eat
console.log(lancasterStemmer.stem("agreement")); // prints agr
• Natural supports Porter Stemmer and Lancaster Stemmer only. It does not support Snowball
Stemmer.
• Both the stemmers provide a stem method
17. Stemming – Porter Stemmer (Non English languages)
• Natural supports Porter Stemmer in Non English languages also
• Following languages are supported -
▪ Farsi - PorterStemmerFa
▪ French - PorterStemmerFr
▪ Russian - PorterStemmerRu
▪ Spanish - PorterStemmerEs
▪ Italian - PorterStemmerIt
▪ PorterStemmerNo
▪ Swedish - PorterStemmerSv
▪ PorterStemmerPt
18. Lemmatization
• More methodical way of converting all the grammatical/inflected forms of the root of the
word.
• Uses context and part of speech to determine the inflected form of the word and applies
different normalization rules for each part of speech to get the root word (lemma)
• Natural NLP library does not support Lemmatization.
19. Inflector
• Inflectors are used to pluralize or singularize words
• There are different types of Inflectors available in Natural Library
▪ Noun Inflector: pluralize or singularize nouns only
▪ Verb Inflector: Verbs can be pluralized/singularized with a Verb Inflector. Natural
provides a inflector called PresentVerbInflector which works on Present Tense Verbs
only
▪ Both noun and verb inflector provides singularize and pluralize methods
▪ Number or Count Inflector: Ordinal numbers could be formed from normal number
▪ Provides a single method called nth which returns the ordinal form of any number
passed
20. Inflector (Cont’d)
// pluralize or singularize nouns only
var nounInflector = new Natural.NounInflector();
console.log(nounInflector.pluralize("Book")); // prints Books
console.log(nounInflector.pluralize("radius")); // prints radii
console.log(nounInflector.singularize("flies")); // prints fly
console.log(nounInflector.singularize("men")); // prints man
var countInflector = Natural.CountInflector;
console.log(countInflector.nth("1")); // prints 1st
console.log(countInflector.nth("2")); // prints 2nd
console.log(countInflector.nth("3")); // prints 3rd
console.log(countInflector.nth("4")); // prints 4th
console.log(countInflector.nth("10")); // prints 10th
var verbInflector = new Natural.PresentVerbInflector();
console.log(verbInflector.singularize("go")); // prints goes
console.log(verbInflector.singularize("run")); // prints runs
console.log(verbInflector.pluralize("becomes")); // prints become
console.log(verbInflector.pluralize("presents")); // prints present
21. N-Grams
• an n-gram is a contiguous sequence of n items from a given sample of text or speech.
• The items can be phonemes, syllables, letters, words or base pairs according to the
application. The n-grams typically are collected from a text or speech corpus.
• When the items are words, n-grams may also be called shingles
• An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram"; size 3 is a "trigram".
• Larger sizes are sometimes referred to by the value of n in modern language, e.g., "four-
gram", "five-gram", and so on.
Hello how are you Hello how how are are you
bigram
Hello how are you Hello how are how are you
trigram
Hello how are you Hello
unigram
how are you
23. Phonetics
• A phonetic algorithm is an algorithm for indexing of words by their pronunciation.
• A phonetic matching algorithm is an algorithm that matches word by their pronunciation rather
than spelling.
• Most phonetic algorithms were developed for use with the English language. Consequently,
applying the rules to words in other languages might not give a meaningful result.
• Some of the well known phonetics algorithms are –
▪ Soundex - Developed to encode surnames for use in censuses. Soundex codes are four-
character strings composed of a single letter followed by three numbers.
▪ Daitch–Mokotoff Soundex - Refinement of Soundex designed to better match surnames of
Slavic & Germanic origin. Daitch–Mokotoff Soundex codes are strings composed of six
numeric digits.
▪ Cologne phonetics - Similar to Soundex, but more suitable for German words.
▪ Metaphone, Double Metaphone, and Metaphone 3 - Suitable for use with most English
words, not just names. Metaphone algorithms are basis for many popular spell checkers.
▪ New York State Identification and Intelligence System (NYSIIS) - Maps similar phonemes to
the same letter. The result is a string that can be pronounced by the reader without decoding.
▪ Match Rating Approach developed by Western Airlines in 1977 - this algorithm has an
encoding and range comparison technique.
▪ Caverphone, created to assist in data matching between late 19th century and early 20th
century electoral rolls, optimized for accents present in parts of New Zealand.
24. Phonetics Matching (Cont’d)
• Natural supports Phonetic Matching using three algorithms –
▪ SoundEx
▪ Metaphone
▪ DoubleMetaphone
var metaphone = Natural.Metaphone;
var soundex = Natural.SoundEx;
var doubleMetaphone = Natural.DoubleMetaphone;
// using SoundEx for phonetic matching
console.log(soundex.compare("nuremberg", "nuremburg")); // returns true
console.log(soundex.compare("Paris", "Pari")); // returns false
// using Metaphone for phonetic matching
console.log(metaphone.compare("Fool", "Full")); // returns true
console.log(metaphone.compare("Fool", "Failed")); // returns false
// using Double Metaphone for phonetic matching
console.log(doubleMetaphone.compare("Bangalore", "Bengaluru")); // returns true
console.log(doubleMetaphone.compare("Mumbai", "Bombay")); // returns false
25. String Distance
• String Distance measures how closely two strings match.
• Natural provides JaroWinkler Distance and Levenshtein Distance algorithms for String
Distance match
JaroWinkler Distance
• Jaro distance between two words is the minimum number of single-character transpositions
required to change one word into the other.
• It is a variant proposed in 1990 by William E. Winkler of the Jaro distance metric (1989,
Matthew A. Jaro).
• Returns a number between 0 and 1 which tells how closely the strings match (0 = no match,
1 = exact match)
// Using JaroWrinkler Distance algorithm
console.log(Natural.JaroWinklerDistance("Hello", "Hello")); // returns 1: exact match
console.log(Natural.JaroWinklerDistance("Me", "You")); // returns 0: no match
console.log(Natural.JaroWinklerDistance("Bangalore", "Bengaluru")); // returns 0.72: partial match
console.log(Natural.JaroWinklerDistance("Mumbai", "Bombay")); // returns 0.66: partial match
26. String Distance - Levenstein Distance
• Levenstein Distance between two words is the minimum number of single-character edits
(insertions, deletions or substitutions) required to change one word into the other.
• Named after the Soviet mathematician Vladimir Levenshtein, who considered this distance
in 1965
• Also be referred as edit distance
// Using Levenshtein Distance algorithm
console.log(Natural.LevenshteinDistance("Hello", "Hello")); // 0
console.log(Natural.LevenshteinDistance("Bangalore", "Bengaluru")); // 3
console.log(Natural.LevenshteinDistance("Mumbai", "Bombay")); // 3
console.log(Natural.LevenshteinDistance("Chennai", "Madras")); // 6
console.log(Natural.LevenshteinDistance("Nuremberg", "Nuremburg")); // 1
B a n g a l o r e B e n g a l u r u
3 character change
N u r e m b e r g N u r e m b u r g
1 character change
27. tf-idf
• tf–idf or TFIDF is short for term frequency - inverse document frequency
• tf-idf determines how important a word (or words) is to a document relative to a corpus.
• Often used as weighting factor in searches of information retrieval, text mining & user modeling.
• The tf-idf value increases proportionally to the number of times a word appears in the
document and is offset by the frequency of the word in the corpus, which helps to adjust for
the fact that some words appear more frequently in general.
• tfidf method returns the measure of importance of a word
var tfidf = new Natural.TfIdf();
// Documents could be added to tf-idf. Here only a single doc is added, but more could be added
tfidf.addDocument("this document is about node. Its also about NLP. Node is used for it");
// Find out the tf-idf of different words in the document
console.log(tfidf.tfidf("node", 0)); // prints 0.61 as node appears multiple times in the doc
console.log(tfidf.tfidf("NLP", 0)); // prints 0.30 as NLP appears only single time
console.log(tfidf.tfidf("ruby", 0)); // prints 0 as ruby does not appear in the doc
console.log(tfidf.listTerms(0)); [ { term: 'node', tfidf: 0.6137056388801094 },
{ term: 'document', tfidf: 0.3068528194400547 },
{ term: 'nlp', tfidf: 0.3068528194400547 },
{ term: 'used', tfidf: 0.3068528194400547 } ]
28. tf-idf (cont’d)
• Disc files could also be added to tf-idf
• Multiple documents could be added to tf-idf
var tfidf = new Natural.TfIdf();
// Adding files from disc to tfidf
tfidf.addFileSync("C:/Data/Profile.txt");
console.log(tfidf.listTerms(0));
// Multiple documents added to tdidf which forms the entire corpus
tfidf.addDocument('this document is about node. Its also about NLP. Node is used for it');
tfidf.addDocument('this document is about ruby.');
tfidf.addDocument('this document is about ruby and node.');
console.log(tfidf.tfidf("node", 0)); // prints 2
console.log(tfidf.tfidf("NLP", 0)); // prints 1.40
console.log(tfidf.tfidf("ruby", 0)); // prints 0
console.log(tfidf.tfidf("node", 1)); // prints 0 as node does not appear in 2nd doc
console.log(tfidf.tfidf("ruby", 1)); // prints 1 as ruby appears in 2nd doc
console.log(tfidf.tfidf("node", 2)); // prints 1 as node appears in 3rd doc
console.log(tfidf.tfidf("ruby", 2)); // prints 1 as ruby appears in 3rd doc
29. tf-idf (cont’d)
• tfidf method returns the measure of importance of a word in various documents
• tfidf method accepts the word and a callback
// Multiple documents added to tdidf which forms the entire corpus
tfidf.addDocument('this document is about node. Its also about NLP. Node is used for it');
tfidf.addDocument('this document is about ruby.');
tfidf.addDocument('this document is about ruby and node.’);
// tfidfs method is used to find the importance of the word across multiple documents
tfidf.tfidfs('node', function(ctr, measure){
console.log('tf-idf of node in document #' + ctr + ' is ' + measure);
});
30. POS (Part of Speech) Tagging
• Process of marking up a word in a text (corpus) as corresponding to a particular part of
speech, based on both its definition and its context—i.e., its relationship with adjacent and
related words in a phrase, sentence, or paragraph.
• Also called grammatical tagging or word-category disambiguation,
31. POS (Part of Speech) Tagging
• Current state of the art POS tagging algorithms can predict the POS of the given word with
a higher degree of precision (that is approximately 97%). But still lots of research going on
in the area of POS tagging.
No Tag Description
1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition or subordinating conjunction
7. JJ Adjective
8. JJR Adjective, comparative
9. JJS Adjective, superlative
10. LS List item marker
11. MD Modal
12. NN Noun, singular or mass
13. NNS Noun, plural
14. NNP Proper noun, singular
15. NNPS Proper noun, plural
16. PDT Predeterminer
17. POS Possessive ending
18. PRP Personal pronoun
No Tag Description
19. PRP$ Possessive pronoun
20. RB Adverb
21. RBR Adverb, comparative
22. RBS Adverb, superlative
23. RP Particle
24. SYM Symbol
25. TO to
26. UH Interjection
27. VB Verb, base form
28. VBD Verb, past tense
29. VBG Verb, gerund or present participle
30. VBN Verb, past participle
31. VBP Verb, non-3rd person singular present
32. VBZ Verb, 3rd person singular present
33. WDT Wh-determiner
34. WP Wh-pronoun
35. WP$ Possessive wh-pronoun
36. WRB Wh-adverb
32. POS Tagging – Brill POS Tagger
• Natural supports POS tagging through Brill POS Tagger that implements Eric Brill's
transformational algorithm (transformation rules are specified in external files).
• E. Brill's tagger, most widely used English POS-taggers, employs rule-based algorithms.
// Path where natural library is located
var baseFolder = path.join(path.dirname(require.resolve("natural")), "brill_pos_tagger");
// Rules file located in /data/<language> sub folder under natural library
var rulesFilename = baseFolder + "/data/English/tr_from_posjs.txt";
// Lexicon file located in /data/<language> sub folder under natural library
var lexiconFilename = baseFolder + "/data/English/lexicon_from_posjs.json";
var defaultCategory = 'N';
var lexicon = new Natural.Lexicon(lexiconFilename, defaultCategory);
var rules = new Natural.RuleSet(rulesFilename);
// Any tagger needs lexicon and rules for successful POS tagging of words
// Brill POS Tagger object is created passing lexicon file and rules file location
var tagger = new Natural.BrillPOSTagger(lexicon, rules);
var sentence = "I see the man with the telescope";
var tokenizer = new Natural.WordTokenizer();
// tokenize the sentence to tokens
var tokens = tokenizer.tokenize(sentence);
console.log(tagger.tag(tokens));
[ [ 'I', 'NN' ],
[ 'see', 'VB' ],
[ 'the', 'DT' ],
[ 'man', 'NN' ],
[ 'with', 'IN' ],
[ 'the', 'DT' ],
[ 'telescope', 'NN' ] ]