Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Natural Language Processing using Java


Published on

This presentation talks about Natural Language Processing using Java. At Museaic, a music intelligence platform, we spent time figuring out how to extract central themes from song lyrics. In this talk, I will cover some of the tasks involved in natural language processing such as named entity recognition, word sense disambiguation and concept/theme extraction. I will also cover libraries available in java such as stanford-nlp, dbpedia-spotlight and graph approaches using WordNet and semantic databases. This talk would help people understand text processing beyond simple keyword approaches and provide them with some of the best techniques/libraries for it in the Java world.

Published in: Technology
  • Dating direct: ❤❤❤ ❤❤❤
    Are you sure you want to  Yes  No
    Your message goes here
  • Follow the link, new dating source: ❤❤❤ ❤❤❤
    Are you sure you want to  Yes  No
    Your message goes here
  • How can I improve my memory book? How can I improve my memory recall? visit to learn...▲▲▲
    Are you sure you want to  Yes  No
    Your message goes here

Natural Language Processing using Java

  1. 1. Natural Language Processing using Java SangVenkatraman April 21, 2015
  2. 2. Agenda • Text Retrieval and Search • Implementing Search • Evaluating Search Results • NLP - Document Level Analysis • Parsing and Part of Speech Tagging • Entity Extraction • Word Sense Disambiguation • Concept Extraction • Concept Polarity • NLP - Sentence Level Analysis • Document Summarization • Dependency Analysis and Coreference • Example Question Parsing System • Sentiment Analysis • Final Thoughts/Questions 2
  3. 3. Text Retrieval and Search • An collection of text documents exists in a system. This is called the corpus. • The documents are preprocessed and indexed before query time. • User performs a query - the query defines one or more concepts that the user is interested in. For e.g. “Thai restaurant in Atlanta” • The search engine is expected to retrieve most relevant documents based on a ranking function • The search engine can also apply some heuristics based on user feedback (such as always ignoring a specific document) to further prune the results. 3
  4. 4. Search - Vector Space Model • Term: Is a word or set of words (ngrams) • Each term defines one dimension • Query Vector: q = (X1,…,Xn) • Document Vector: d = (Y1,…,Ym) • relevance (q,d) ~ similarity(q,d) 4
  5. 5. Preparing Text for Search • Tokenization: For each document, we split it into paragraphs, split paragraphs into sentences and sentences into words. • Word Normalization: • Index text and query terms have same form e.g. match U.S.A and USA • Usually lower cased • Stop word Removal: An optional step where a predefined list of stop words are removed. More important for small corpuses • Stemming - Reduce terms to their stems • Language dependent - in English, every word has 2 parts, the stem and the affix • automate(s), automatic, automation => automat, plural forms like cats => cat • The “stem” may not be an actual word for e.g. consolidating => consolid 5
  6. 6. 6The inverted index part of the image taken from
  7. 7. Search Example • For any given term in the query: • Term Frequency (TF) - The number of times a term occurs in a document. Normalize this by the total number of terms in a document. • Document Frequency (DF) - The number of documents that the term occurs in • Inverse Document Frequency (IDF) - Inverse of above. So, it will be high for less frequent terms and low for more frequent terms. • Simple ranking of documents for a query • For all the terms in the query, sum up the product of TF and IDF. This can be used to rank the results with the documents with the highest tf-idf on top. • Example: • Document 1 = “The rose is red” • Document 2 = “Red shoe” • Query 1 = “Red” => both Document 1 and Document 2 because both documents have same number of terms after removing stop words 7
  8. 8. Evaluating Search Results • Search results can be evaluated by 2 metrics that encourage two kinds of algorithm behavior: • High Precision - Very few false positives. Critical for systems that cannot make a wrong recommendation. • High Recall - Very few misses. Critical for systems where every missed opportunity needs to be minimized but there is a low cost associated with a false positive. • FMeasure - The harmonic mean of precision and recall. It tries to balance out the explorative nature of search with the preciseness of the results. 8
  9. 9. • precision = a/(a + c) • recall = a/(a + b) • fMeasure = 2 * precision * recall/(precision + recall) • Example: • retrieved documents = 5, relevant documents = 10 • relevant documents within the 5 retrieved results = 4 • precision = 4/5 = 0.8, recall = 4/10 = 0.4, FMeasure = 0.53 9Table from
  10. 10. Section Summary • In this section, we applied NLP techniques across an entire corpus. This is where frameworks like map reduce play an important role. • The NLP techniques by themselves were shallow but were able to implicitly handle compound words and stop words. • Introduced a simple formula for ranking and retrieving search results. The real world involve more complex probabilistic models like BM25 that follow the same principles. • Reviewed some techniques for evaluating search algorithms. These simple approaches can also be used for other NLP and machine learning problems. 10
  11. 11. 11 Big Data is for Losers. I’m into Small Data now.
  12. 12. Extracting Concepts From Text • We apply various NLP techniques to analyze the contents of a document. Some example are: • Mentions of people, places, locations etc. • Central Themes or concepts in the document • This is different from search • Search follows a pull model where the users take initiative in querying the system for relevant documents. • In concept extraction, we can infer abstract concepts from text and push it to interested users. We may also be able to infer the concepts a user is interested in based on the content they consume. 12
  13. 13. Concept Extraction - Motivation 13
  14. 14. Sentence Segmentation • Periods are ambiguous - Abbreviations, decimals etc. • !, ? - Less ambiguous • Classifier - rules (using case, punctuation rules etc.), ML etc. • StanfordNLP sentence detection and tokenizer • Trained on Penn Bank dataset and is hence suited towards more formal english. • OpenNLP has a sentence detection and tokenizer as well. • Both these libraries perform pretty well for English and there is not much to choose between them. They can also be retrained. 14
  15. 15. Part of Speech Tagging using StanfordNLP • StanfordNLP is quite accurate (~90%) and has been trained using a maximum entropy tagger. 15 TAG POS TAG POS DT Determiner PRP Pronoun JJ,JJR,JJS Adjective VB Verbs NN,NNS Noun IN Preposition NNP,NNPS Proper Noun CC Conjunction
  16. 16. Named Entity Recognition • Named Entity Recognition is the NLP task of recognizing proper nouns in a document. • Named Entity Recognition consists of three steps: • Spotting: Statistical model pre-trained on well known corpus data help us “spot” entities in the text. • Disambiguation: Once spots are found, we may need to disambiguate them (for e.g. there are multiple entities with the same name and the correct url needs to be retrieved) • Filtering: Remove named entities whose types we are not interested in or entities that have very few links pointing to them. • At the end of NER, we get back a set of url of resources that were referenced in the text. 16
  17. 17. Spotting is the process of identifying and assigning classes to named entities. 17 STANFORDNLP OPENNLP I go to school at <ORGANIZATION>Stanford University</ORGANIZATION>, which is located in <LOCATION>California</LOCATION>. I go to school at <ORGANIZATION>Stanford University</ ORGANIZATION> which is located in <LOCATION>California</ LOCATION> Schooled in the <LOCATION>Philippines</LOCATION> Schooled in the <LOCATION>Philippines</LOCATION> Where does <ORGANIZATION>Toyota</ ORGANIZATION> have its factories? Where does <ORGANIZATION>Toyota</ORGANIZATION> have its factories? What does <ORGANIZATION>GM</ORGANIZATION> produce? What does <ORGANIZATION>GM</ORGANIZATION> produce? Is <ORGANIZATION>GM</ORGANIZATION> moving its jobs to <LOCATION>Atlanta</LOCATION>. is <ORGANIZATION>GM</ORGANIZATION> moving its jobs to <LOCATION>Atlanta</LOCATION>. I work at <ORGANIZATION>Chevy</ ORGANIZATION>. I work at Chevy. I work at <ORGANIZATION>chevy</ ORGANIZATION>. I work at chevy. I am fixing a <ORGANIZATION>General Motors</ ORGANIZATION> car I am fixing a <ORGANIZATION>General Motors</ ORGANIZATION> car You told me I was like the <LOCATION>Dead Sea</ LOCATION> You told me I was like the <LOCATION>Dead Sea</LOCATION>
  18. 18. Dbpedia Spotlight • Dbpedia Spotlight is an API that can be used to perform all 3 steps of NER • Spots - It identifies spots using a statistical backed model. • Spots are disambiguated based on other references in the document • Uri’s are retrieved for each of the identified named entities. These are usually dbpedia urls with references to freebase and other ontologies. • Provides API to perform the steps of NER separately as well • Spotting - Identifies only the spots • Disambiguate - Performs disambiguation based on different options provided • Annotate - Performs all 3 steps of NER and provides results • Candidates - Provides a ranked list of candidates for each spot 18
  19. 19. Dbpedia Spotlight Results 19 ID SONG EXPECTED ACTUAL PRECISION RECALL FMEASURE 1 Here We Stand (Talking Heads) Dairy_Queen Eleven 1.0 0.33 0.5 2 Kodachrome (Paul Simon) Kodachrome Nikon 1.0 0.5 0.66 3 Brand New Cadillac (The Crash) Cadillac 1.0 1.0 1.0 4 A Certain Romance (Arctic Monkeys) Converse_(shoe_company) Reebok Converse_(shoe_company) 1.0 1.0 1.0 5 My Humps (Black Eyed Peas) Dolce_&_Gabbana True_Religion Prada Gucci 1.0 0.33 0.5 Mean 1.0 0.63 0.73
  20. 20. Querying the Semantic Web • SPARQL is a query language to interact with the semantic web. • SPARQL is the equivalent of SQL for RDF stores. • Ontologies provide knowledge about different entities usually in the form of a subject-predicate-object triple. • English version of dbpedia contains 4.58 million things with 584 million facts. 20 SELECT ?industry WHERE {< resource/Fendi? dbprop:industry ?industry>
  21. 21. Named Entity Recognition Demo • 21
  22. 22. Extracting Concepts using Word Senses 22
  23. 23. Word Sense Disambiguation • For many words, multiple senses of the word exists based on the context. For e.g. there are multiple senses for the word “bank” (even within the same part of speech). • Extremely difficult for Computers. A combination of context and common sense information make this quite easy for humans. • Word Sense Disambiguation can be useful for • Machine translation between languages (surface form loses value during translation because the only thing that matters is the sense of the word) • Information Retrieval - Correct interpretation of the query. However this can be overcome by providing enough terms to only retrieve relevant documents. • Automatic annotation of text • Measuring semantic relatedness between documents. 23
  24. 24. • Solving the Word Sense Disambiguation Problem • Need an inventory of knowledge that can be used to disambiguate words. Usually a graph structure. Some examples are: • WordNet • Wikipedia • Yago • Freebase • ConceptNet • Algorithms to traverse the inventory to retrieve most likely disambiguation of a word. These are usually graph algorithms that work on a measure of centrality like degree centrality etc. • Assumptions: • The document has enough context to disambiguate the word correct. If not, we would default to the most frequent sense of a word. • Single sense per discourse 24
  25. 25. WordNet • WordNet is a hierarchically organized lexical database widely used in NLP applications. Started at Princeton in 1985. • Contains nouns, verbs adjectives and adverbs • Words are separated into senses and are represented as synsets. • The noun “bank” can have multiple senses based on the context (for e.g. bank of a river, financial institution etc.) • Synsets are connected by well defined semantic relationships • Majority of WordNet relations connect words from same part of speech. • Can be accessed in Java using the extJWNL library 25 PART OF SPEECH UNIQUE STRINGSNoun 117,798 Verb 11,529 Adjective 22,479 Adverb 4,481
  26. 26. WordNet Synsets 26 Synset format => baseform#pos#index bank#n#1 -> river bank bank#n#2 -> Financial institution bank#v#3 -> bank with a financial institution
  27. 27. WordNet Relationships • Hypernym - Defines a superordinate relationship. • Motor vehicle is a hypernym of car • Hyponym - Subordinate relationship • Mango is a hyponym of fruit • The root node of nouns is “entity” • Other relationships: InstanceOf, Synonyms/Antonyms, Meronym (PartOf) etc. 27
  28. 28. 28
  29. 29. Accessing WordNet using extJWNL • Download WordNet 3.0 dataset • Use the properties file to point to the location of WordNet • on the file system or database • Lemmatization - Needed to get the base form of a word (different from stemming) using the WordNet dictionary. • cat and cats have same lemma 29 val dictionary = Dictionary.getInstance(new FileInputStream(“data/file_properties.xml")) def getBaseForm(pos: POS, word: String): String = {
 dictionary.getMorphologicalProcessor.lookupBaseForm(pos, word.toLowerCase) }
  30. 30. WSD using WordNet • Example 1 - “I am going to the bank” • “bank” by itself usually just defaults to bank#n#1 • Example 2 - “What is the difference between a bank and a credit union?” • Credit Union only has one sense - credit_union#n#1 • Because credit union is present, “bank” is disambiguated to “bank#n#2” 30
  31. 31. Concept Graph • WordNet does not capture any common sense information. For e.g. bank (financial institution) and money do not have a close relationship in WordNet. • It is possible to use other resource like ConceptNet that map common sense knowledge to WordNet (and ontologies like dbpedia). For e.g. we can download mappings for concepts like Money, Love, Sports, Family etc. • Another option is to deploy a custom concept graph: • Deploy WordNet onto a Graph database. That forms the base graph. • Deploy custom concept mapping to the WordNet synsets. • Add mappings for relevant wikipedia (dbpedia) categories 31
  32. 32. Concept Extraction Architecture 32
  33. 33. Concept Analysis of over 500K songs 33
  34. 34. Concept Polarity • SentiWordNet is a lexical resource for opinion mining and sentiment analysis • SentiWordNet provides sentiment values for the different WordNet sysnsets. For each synset in WordNet, SentiWordNet assigns it scores on 3 dimensions - positivity, negativity and objectivity. • Once the central concepts are found, we can extract the polarity of the concepts. • Example: • “They are really happy to be here” => happy#a#1 has a very positive polarity. 34
  35. 35. Section Summary • Went beyond surface forms and analyzed the concepts contained in documents. • The approach was still mostly bag of words meaning that the structure of the individual sentences did not matter. • The approaches in tandem with common sense knowledge sources help in extracting concepts from documents. • It also allows documents to be compared based on semantic similarity measures. 35
  36. 36. 36
  37. 37. Document Summarization • Objective - Reduce the document in order to create a summary that retains the most important points of the original document. • Two Approaches: • Extractive: Extract the sentences that are most representative of the content of the document. • Generative: Generate a summary of the text using words that may not be part of the original text. This is a difficult task and is often not attempted. • Evaluating summarization techniques: • Somewhat subjective because humans sometimes cannot agree on the best summary • Extractive Approaches • Based on term frequency • Based on sentence similarity 37
  38. 38. 38 ID SENTENCE EXPECTED SCORE 1 Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. High 2 Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients. High 3 Cassandra also places a high value on performance. Low 4 In 2012, University of Toronto researchers studying NoSQL systems concluded that "In terms of scalability, there is a clear winner throughout our experiments. Low 5 Cassandra achieves the highest throughput for the maximum number of nodes in all experiments" although "this comes at the price of high write and read latencies." High 6 Cassandra's data model is a partitioned row store with tunable consistency. Medium 7 Rows are organized into tables; the first component of a table's primary key is the partition key; within a partition, rows are clustered by the remaining columns of the key. Medium 8 Other columns may be indexed separately from the primary key. Low 9 Tables may be created, dropped, and altered at runtime without blocking updates and queries. Low 10 Cassandra does not support joins or subqueries, except for batch analysis via Hadoop. Medium 11 Rather, Cassandra emphasizes denormalization through features like collections. Medium
  39. 39. TextRank • A graph approach where each vertex is a sentence and each edge has a weight corresponding to the similarity between the two sentences. Every vertex is connected to every other vertex. • For every sentence: • Calculate its similarity to every other sentence. The similarity measure can be simple for e.g. normalized value of the number of common terms between the 2 sentences • Sum the similarity of the sentence to every other row (sum up each of the rows). That is the score of the sentence. • Sort the vertices based on the sum of the weights of their edges and return the top k sentences. 39
  40. 40. 40 TOP SENTENCES SCORE Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients. 1.6 Cassandra also places a high value on performance. 1.125 Other columns may be indexed separately from the primary key. 0.999 • Can the similarity metric be improved?
  41. 41. Dependency Analysis in Sentences • StanfordNLP can be used to analyze the grammatical structure of sentences and provide a dependency graph between the different elements of the sentence. • LexicalizedParser can provide a graph where the vertices are the words and the edges are the grammatical relationships in a sentence. 41
  42. 42. 42 TAG MEANING TAG MEANING advmod Adverbial Modifier dobj Direct Object (she,gave) neg Negation Modifier iobj Indirect Object (gave,me) nsubj Nominal Subject amod Adjective Modifier nsubjpass Passive Nominal Subject prep Preposition
  43. 43. Question Parsing 43
  44. 44. Dependency Analysis • Works well for short sentences. It loses accuracy when the scope is increased to a document. • May aid in text simplification by using the relationships between the entities. • By analyzing the subject and the object, we can clearly establish a point of view (for e.g. direct address vs first person vs. second person etc.). • Could potentially help in story extrapolation but does not generalize well. So this is a topic of research. 44
  45. 45. Sentiment Analysis • StanfordNLP has a deep learning model for sentiment analysis. • Takes a deep parsing approach to sentiment analysis - the structure of the sentence is constructed prior to the analysis. • Was trained on movie reviews data and obtained an accuracy of 5% more than the closest model. • Uses an annotated dataset called the Stanford Sentiment Treebank. Users are encouraged to add labels to improve the model further. 45
  46. 46. Sentiment Analysis Examples • Taxonomy • Very Negative • Negative • Neutral • Positive • Very Positive 46
  47. 47. Sentiment Analysis Demo • rntnDemo.html 47
  48. 48. StanfordNLP Sentiment Analysis • Provides relatively good results for short sentences. • Sentences that are similar to the training data (movie reviews) perform much better than other sentences. • No good way to aggregate sentiments across a document. A future work would probably involve document level dependency parsing and sentiment analysis. • Only provides overall sentiment. Does not provide an indication of the object of the sentiment. 48
  49. 49. Final Thoughts • Shallow NLP is employed in text retrieval and search and provide good results for general search use cases. • Deeper NLP involves semantic parsing, common sense interpolation (both local and global knowledge bases) and tends to be harder. • Deeper NLP is more practical after picking a specific domain for e.g. medical records, legal documents etc. • 2 cents on Intelligence - Memory based systems • 49
  50. 50. Resources • StanfordNLP Github: • Own repository: • Dbpedia Spotlight: dbpedia-spotlight • Opennlp repo: • ConceptNet • On Intelligence book: On_Intelligence 50
  51. 51. Thank You 51 @sang_v