Digital Pebble Behemoth
Upcoming SlideShare
Loading in...5
×
 

Digital Pebble Behemoth

on

  • 3,945 views

Talk about NLP processing on top of Hadoop using Behemoth by Julien of Digital Pebble

Talk about NLP processing on top of Hadoop using Behemoth by Julien of Digital Pebble

Statistics

Views

Total Views
3,945
Views on SlideShare
3,920
Embed Views
25

Actions

Likes
9
Downloads
71
Comments
1

3 Embeds 25

http://www.slideshare.net 23
http://www.kunalmittal.com 1
http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Hi,

    have you looked at Apache Stanbol ?

    Cheers,

    S.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Digital Pebble Behemoth Digital Pebble Behemoth Presentation Transcript

  • Behemoth Large scale document processing with Hadoop Julien Nioche julien@digitalpebble.com Bristol Hadoop Workshop 10/03/10
  • DigitalPebble  Bristol-based consultancy  Specialised in Text Engineering – Natural Language Processing – Web Crawling – Information Retrieval – Data Mining  Strong focus on Open Source & Apache ecosystem  User | Contributor | Committer – Lucene, SOLR, Nutch – Tika – Mahout – GATE, UIMA
  • Open Source Frameworks for NLP  Apache UIMA – http://incubator.apache.org/uima/  GATE – http://gate.ac.uk/ – Pipeline of annotators – Stand-off annotations – Collection of resources (Tokenisers, POS taggers, ...) – GUIs – Community – Both very popular
  • Demo GATE
  • Web scale document processing  GATE – http://gatecloud.net/ - Closed-source, limited access – DIY  UIMA AS – http://incubator.apache.org/uima/doc-uimaas-what.html
  • UIMA AS  Low latency – throughput?  Storage & replication – DIY  Ease of configuration? – Esp. when mixing different types of Service Instances  Post-processing scalability – e.g. aggregate info across documents – DIY
  • Cometh Behemoth... Behemoth as depicted in the 'Dictionnaire Infernal'.
  • Бегемот The Master and Margarita M. Boulgakov
  • Behemoth  Hosted on Google Code (http://code.google.com/p/behemoth-pebble/)  Apache License  Large scale document analysis based on Apache Hadoop  Deploy UIMA or GATE-based apps on cluster  Provide adapters for common inputs  Encourage code reuse (sandbox)  Runs on Hadoop 0.18 / 0.19 / 0.20
  • Typical Workflow  Load input into HDFS  Convert input format into Behemoth Document Format – Input supported : standard files on local file system, WARC, Nutch segments – Use Apache Tika to identify mime-type, extract text and meta-data – Generate SequenceFile<Text,BehemothDocument>  Put GATE/UIMA resources on HDFS – Zipped GATE plugins + GAPP file – UIMA Pear package
  • Typical Workflow (cont.)  Process Behemoth docs with UIMA / GATE – Use Distributed Cache for sending G/U resources to slaves – Load application and do processing in Map – No reducers – Generate another SequenceFile<Text,BehemothDocument>  Post-process – Do whatever we want with annotations – … but can scale thanks to Map Reduce  Can do things differently – e.g. use reducers for postprocessing, convert input inside map step – Illustrated by example in Sandbox – Reuse modules e.g. GATEProcessor
  • Document implementation class Document String url; String contentType; String text; byte[] content; MapWritable metadata; List<Annotation> annotations; class Annotation String type; long start; long end; Map<String, String> features;
  • Example of document ./hadoop fs ­libjars /data/behemoth­pebble/build/behemoth­0.1­snapshot.job ­text textcorpusANNIE/part­* url: file:/data/behemoth­pebble/src/test/data/docs/droitshomme.txt contentType: text/plain metadata: null Content: Préambule Considérant que la reconnaissance de la dignité inhérente à tous les membres  (…) Text: Préambule  Considérant que la reconnaissance de la dignité inhérente à tous les membres  (…) Annotations: Token 0 9 string=Préambule Token 11 22 string=Considérant Token 23 26 string=que Token 27 29 string=la Token 30 44 string=reconnaissance Token 45 47 string=de
  • Advantages  Used as a common ground between UIMA and GATE – Deliberately simple document representation => fine for most applications – Feature names and values as Strings  Potentially not restricted to JAVA Annotators – Hadoop Pipe for C++ Annotators – Needs a C++ Implementation of BehemothDocument – Unless use AVRO (more on that later)  Harness multiple cores / CPU – Worth using even on a single machine  Easy Configuration – Custom BehemothConfiguration (behemoth-default & behemoth-site.xml) – What annotations to transfer from GATE / UIMA docs – What features to keep  Benefits from Hadoop Ecosystem – Focus on use of annotations and custom code
  • Sandbox  Reuse – Basic blocks : conversion / GATE-UIMA wrappers / ...  Extend – Add custom reducers for specific tasks  Share – Open to contributions – Separate from the core
  • Quick demo  Do we have 5 more minutes?
  • Future developments  Cascading – Tap / Pipe / Sink  Hbase – Avoid multiplicating SequenceFiles  AVRO – Facilitate annotators in languages != JAVA  Sandbox Examples – SOLR • Use Named Entities (Person, Location, … ) for faceting – MAHOUT • Generate vectors for document clustering  Better documentation, pretty pictures, etc...  Needs to be used on a very large scale – Anyone with a good use case?