Big Data and Natural Language Processing


Published on

Natural Language Processing (NLP) is the branch of computer science focused on developing systems that allow computers to communicate with people using everyday language.

Published in: Business
  • Be the first to comment

Big Data and Natural Language Processing

  1. 1. Natural Language ProcessingNatural Language Processing June 2013 Michel Bruley
  2. 2. Natural Language Processing (NLP)Natural Language Processing (NLP) NLP is the branch of computer science focused on developing systems that allow computers to communicate with people using everyday language NLP is considered as a sub-field of artificial intelligence and has significant overlap with the field of computational linguistics. It is concerned with the interactions between computers and human (natural) languages. – Natural language generation systems convert information from computer databases into readable human language – Natural language understanding systems convert human language into representations that are easier for computer programs to manipulate. NLP encompasses both text and speech, but work on speech processing has evolved into a separate field
  3. 3. Where does it fit in the CS*Where does it fit in the CS* taxonomy?taxonomy? Computers Artificial Intelligence AlgorithmsDatabases Networking Robotics SearchNatural Language Processing Information Retrieval Machine Translation Language Analysis Semantics Parsing* CS = Computer Science
  4. 4. Why Natural Language Processing?Why Natural Language Processing? Applications for processing large amounts of texts require NLP expertise Classify text into categories, index and search large texts: Classify documents by topics, language, author, spam filtering, information retrieval (relevant, not relevant), sentiment classification (positive, negative) Extracting data from text: converting unstructured text into structure data Information extraction: discover names of people and events they participate in, from a document, … Automatic summarization: Condense 1 book into 1 page, … Speech processing, artificial voice: get flight information or book a hotel over the phone, … Question answering: find answers to natural language questions in a text collection or database Spelling & Grammar Corrections Plagiarism detection Automatic translation Etc.
  5. 5. The problemThe problem When people see text, they understand its meaning (by and large) According to research, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt tihng is that the frist and lsat ltteer are in the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by islelf but the wrod as a wlohe. When computers see text, they get only character strings (and perhaps HTML tags) We'd like computer agents to see meanings and be able to intelligently process text These desires have led to many proposals for structured, semantically marked up formats But often human beings still resolutely make use of text in human languages This problem isn’t likely to just go away
  6. 6. Example: Natural languageExample: Natural language understandingunderstanding Raw speech signal • Speech recognition Sequence of words spoken • Syntactic analysis using knowledge of the grammar Structure of the sentence • Semantic analysis using info. about meaning of words Partial representation of meaning of sentence • Pragmatic analysis using info. about context Final representation of meaning of sentence Natural language understanding process – Prof. Carolina Ruiz
  7. 7. Example detail: Syntactic AnalysisExample detail: Syntactic Analysis The big cat is drinking milk Noun Phrase Verb Phrase Determiner Adjective Phrase Noun Auxiliary Verb Noun Phrase The big cat is drinking milk • Syntactic analysis involves isolating phrases and sentences into a hierarchical structure, allowing the study of its constituents. • For example the sentence “the big cat is drinking milk” can be broken up into the following constituents:
  8. 8. Why NLP is difficultWhy NLP is difficult Language is flexible – New words, new meanings – Different meanings in different contexts Language is subtle – He arrived at the lecture – He chuckled at the lecture – He chuckled his way through the lecture – **He arrived his way through the lecture Language is complex!
  9. 9. Why NLP is difficultWhy NLP is difficult MANY hidden variables – Knowledge about the world – Knowledge about the context – Knowledge about human communication techniques • Can you tell me the time? Problem of scale – Many (infinite?) possible words, meanings, context Problem of sparsity – Very difficult to do statistical analysis, most things (words, concepts) are never seen before Long range correlations
  10. 10. Why NLP is difficultWhy NLP is difficult Key problems: – Representation of meaning – Language presupposes knowledge about the world – Language only reflects the surface of meaning – Language presupposes communication between people
  11. 11. Patented Natural Language Processing (NLP)Patented Natural Language Processing (NLP) “Reads” Every Communication“Reads” Every Communication  Each data feed is parsed through one or more of the 7 NLP engines  …it is then deconstructed to provide context, subject, and other information regarding the customer (gender, name etc.)  Finally each identified customer is matched back to the Discovery platform data to gain a full view Natural language processing (NLP) is the study of the interactions between computers and natural languages (e.g., English, Polish). The crucial challenge that NLP addresses is in deriving meaning from human or natural language input and allowing consumers to analyze parsed meanings in large volumes.
  12. 12. For Example….For Example…. I bought an iPad2 for my mom last week. She loves the weight, but doesn’t like the color. She wishes it came in blue. She says if it came in blue, then she’d buy one for all her friends Entities (brands, people, locations, times, products…) Events and relationships (purchasing event, my mom…) Sentiment (product specifications) Suggestions (feature specifications) Intent (to purchase, to leave) Geo/Temporal QUESTION: Why is this a big deal? NLP takes a simple English statement, parses them into the categories above (and more categories) and VOILA…we got STRUCTURED DATA
  13. 13. Aster ASTER DISCOVERY PLATFORM “Now- structured” data “Now- structured” data ArchitectureArchitecture Customers / Sales / Other data Customers / Sales / Other data Churn Score SQL MR Churn Score SQL MR Attensity Pipeline Real-time annotated social media data feed: 150+ million social and online sources Other Unstructured Data Emails; Surveys; CRM Notes…. Pipeline Connector ASAS Wrapper SQL MR ASAS Wrapper SQL MR NLP ETL Visualization (e.g., Tableau, MSTR) Predictive
  14. 14.  This integration provides types, subtypes, super types (“Savings”, “Checking”, “Investment”)  Inclusion of the Anaphora: Connecting a subject (George Harrison) without repeating the full name (“He”, “Him”)  Includes other languages besides English  Attensity’s Semantic Annotation Server (ASAS) capabilities  Entity Extraction: Automatic detection and extraction of more than 35 entities such as Name, Place  Uses Attensity Triples to create context on entities and identify verbs, relationships, actions  Auto Classification: Uses custom classification rules to classify articles by content, sort by relevance, and discovers repeated information  Exhaustive Extraction: Application of linguistic principles to extract context, entities, and relationships similar to how the human mind would  Voice Tags: to identify types of statements and auto classify them (Question, Intent, Conditional)  Creates a unique identifier for each entity for cross reference Aster + Attensity = CompetitiveAster + Attensity = Competitive AdvantageAdvantage
  15. 15. Structuring Unstructured Data: ProcessStructuring Unstructured Data: Process FlowFlow The flight was delayed and flight attendant would not give us any new information.
  16. 16. New Table: Customer Reactions Database Record from a Customer Survey date 10-02-06 region 0006 rec? 4 source telephone Why would you recommend/not recommend? The flight was delayed and flight attendant would not give us any new information. Who/What flight Behavior delay Fact/Triple flight : delay Same Record with Relational Facts Extracted from Notes Field date region source rec? who-what Behavior Fact/Triple 10-2-12 0006 telephone 4 flight delay flight : delay 10-2-12 0006 telephone 4 information give [not] information : give [not] 1-1-13 0007 e-mail 8 i happy [not] i : happy [not] 1-1-13 0007 e-mail 8 rep rude rep : rude 1-1-13 0007 e-mail 8 flight cancel flight : cancel Original Structured Data Newly Structured Data Provided by Attensity How Triples are Extracted &How Triples are Extracted & StructuredStructured Extract Extract relational facts & Triples from Notes field Then Fuse Populate new table with attribute values and fuse with structured data.
  17. 17. Team PowerTeam Power