AOL - Rao & Uppuluri - Hadoop World 2010

1,248
-1

Published on

Intelligent Text Information Processing System

Vaijanath Rao & Rohini Uppuluri
AOL

Learn more @ http://www.cloudera.com/hadoop/

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,248
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

AOL - Rao & Uppuluri - Hadoop World 2010

  1. 1. Hadoop Based Intelligent Text Processing System October 12, 2010 Hadoop World, NYC
  2. 2. Page 2 Who are we? •Vaijanath N. Rao •AOL •Contact: vaijanath.rao@teamaol.com •Rohini Uppuluri •AOL •Contact: rohini.uppuluri@teamaol.com
  3. 3. Page 3 Agenda 1. Introduction 2. Problem Statement 3. Our Intelligent Text Processing System 1. Overview 2. Detailed 3. Application(s) 4. Q and A
  4. 4. Page 4 Introduction
  5. 5. Page 5 Introduction( Continued…) • Information Extraction - Extracting information From Text • Part of Speech Analysis Ex: BlackBeauty<noun> is<verb> a<det> pretty<adjective> horse<noun> • Named Entity Extraction Ex: The CEO <Person>Mr. A</Person> of <Location>New York</Location> based Firm <Organization>Foo.Inc</Organization> announced its new Product <date>today</date> • Sentiment Analysis Ex: Watch this film. AVATAR is an achievement in many technical departments. It is a beautiful experience • Sentence Detection Ex: <Start Sentence>BlackBeauty is a pretty horse <End Sentence> • Some Tools: OpenNLP[5], LingPipe[6], GATE[7], NLTK[8] etc • Categorization/Classification - Categorize items into one of the predefined classes Ex: An article talking about some baseball match is a “Sports” article.
  6. 6. Page 6 Introduction (Continued…) • Challenges • Processing large amount of data • Most approaches use machine learning methods • Need to be trained on large amount of data • Need to way to perform the computations in a scalable manner • Domain Dependency
  7. 7. Page 7 Problem Statement • What we want to do? • Build Large Scale applications (processing text) • Why is this useful? • Analyze Large Content available at AOL • Applications: User interests Mining, Ad Targeting, Personalization etc • We need • A Large Scale NLP System • A Pipeline sort of architecture with users being able to plug in or out components • Abstraction or Transparency of the algorithms used as requested by the user
  8. 8. Page 8 Our Intelligent Text Processing System • Overview • Pipelined Architecture • Pluggable components • Work Flow Manager • Recovery Manager • Job Manager • Applications • Large Scale Applications using scalable way of applying NLP Models
  9. 9. Page 9 Overview
  10. 10. Page 10 Job Manager •Creates series of parallel and sequential dependent jobs (takes configuration file) •Example : Jobs A, B, C, D, E and F Job B depends on Job A ; Job E depends on D •Job manager creates following Tree •Jobs A,D and F are executed parallel •Jobs B and E will be executed parallel depending upon there parent jobs completion.
  11. 11. Page 11 Recovery Manager •Each job writes the configuration, start time, end time ( if completed) into the status file •Periodically checks for the status file updates to see if any job failed, if so restarts the job, by calling the Job manager with required configuration
  12. 12. Page 12 Sample Configuration <job name="keyphrase"> <mapreduce depends="none" name="postagger"> <inputargs>input arguments as string</inputargs> <output>$hdfsoutputLocation</output> <jar>postagger.jar</jar> <mainClass>com.aol.datalayer.nlp.postagger</mainClass> </mapreduce> <mapreduce depends="postagger" name="nounphrase"> <inputargs>input arguments as string</inputargs> <output>$hdfsoutputlocation</output> <jar>chunker.jar</jar> <mainClass>com.aol.datalayer.nlp.chunker</mainClass> </mapreduce> </job>
  13. 13. Page 13 Overview
  14. 14. Page 14 NLP Modeling Engine
  15. 15. Page 15 Detailed
  16. 16. Page 16 Applications
  17. 17. Page 17 Application 1- Location Aware Contextual Advertising - Example
  18. 18. Page 18 Location Aware Contextual Advertising- Overview
  19. 19. Page 19 Application 2- User Aware Ad Targetting - Example This is an illustrative example and does not represent any real user
  20. 20. Page 20 User Aware Ad Targetting
  21. 21. Page 21 Conclusions • Pipelined Architecture • NLP System • Large Scale Applications • Location aware Contextual Ad Targetting • User aware Ad targetting
  22. 22. Page 22 Future Work • Developing distributed algorithms for • POS Tagger • Sentiment Analyzer models • Exploring if it might be useful integrating with any open source distributed ML/TM framework
  23. 23. Page 23 References 1. Part-of-Speech Tagging: en.wikipedia.org/wiki/Part-of- speech_tagging 2. Coreference Resolution: en.wikipedia.org/wiki/Coreference 3. Named Entity Recognition: en.wikipedia.org/wiki/Named_entity_recognition 4. Sentiment Analysis:en.wikipedia.org/wiki/Sentiment_analysis 5. Open NLP: http://opennlp.sourceforge.net/ 6. LingPipe: http://alias-i.com/lingpipe/ 7. GATE: http://gate.ac.uk/ie/ 8. NLTK: www.nltk.org
  24. 24. Page 24 Q & A Thank You 

×