Behemoth
Large scale document processing with
Hadoop


Julien Nioche
julien@digitalpebble.com

Bristol Hadoop Workshop 10/...
DigitalPebble

 Bristol-based consultancy
 Specialised in Text Engineering
   –   Natural Language Processing
   –   Web...
Open Source Frameworks for NLP

 Apache UIMA
  – http://incubator.apache.org/uima/


 GATE
  – http://gate.ac.uk/

  –  ...
Demo GATE
Web scale document processing

 GATE
  – http://gatecloud.net/ - Closed-source, limited access
  – DIY

 UIMA AS
   – ht...
UIMA AS

 Low latency
   – throughput?


 Storage & replication
   – DIY


 Ease of configuration?
   – Esp. when mixin...
Cometh Behemoth...



                     Behemoth as depicted
                     in the 'Dictionnaire
                ...
Бегемот


          The Master and
          Margarita

          M. Boulgakov
Behemoth

 Hosted on Google Code
  (http://code.google.com/p/behemoth-pebble/)
 Apache License

 Large scale document a...
Typical Workflow

 Load input into HDFS

 Convert input format into Behemoth Document Format
   – Input supported : stan...
Typical Workflow (cont.)

 Process Behemoth docs with UIMA / GATE
   –   Use Distributed Cache for sending G/U resources ...
Document implementation
class Document
     String url;
     String contentType;
     String text;
     byte[] content;
  ...
Example of document
./hadoop fs ­libjars /data/behemoth­pebble/build/behemoth­0.1­snapshot.job ­text textcorpusANNIE/part­...
Advantages

 Used as a common ground between UIMA and GATE
   – Deliberately simple document representation => fine for m...
Sandbox

 Reuse
  – Basic blocks : conversion / GATE-UIMA wrappers / ...


 Extend
  – Add custom reducers for specific ...
Quick demo

 Do we have 5 more minutes?
Future developments

 Cascading
   – Tap / Pipe / Sink
 Hbase
   – Avoid multiplicating SequenceFiles
 AVRO
   – Facili...
Digital Pebble Behemoth
Upcoming SlideShare
Loading in...5
×

Digital Pebble Behemoth

3,152

Published on

Talk about NLP processing on top of Hadoop using Behemoth by Julien of Digital Pebble

Published in: Technology
1 Comment
9 Likes
Statistics
Notes
  • Hi,

    have you looked at Apache Stanbol ?

    Cheers,

    S.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
3,152
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
75
Comments
1
Likes
9
Embeds 0
No embeds

No notes for slide

Digital Pebble Behemoth

  1. 1. Behemoth Large scale document processing with Hadoop Julien Nioche julien@digitalpebble.com Bristol Hadoop Workshop 10/03/10
  2. 2. DigitalPebble  Bristol-based consultancy  Specialised in Text Engineering – Natural Language Processing – Web Crawling – Information Retrieval – Data Mining  Strong focus on Open Source & Apache ecosystem  User | Contributor | Committer – Lucene, SOLR, Nutch – Tika – Mahout – GATE, UIMA
  3. 3. Open Source Frameworks for NLP  Apache UIMA – http://incubator.apache.org/uima/  GATE – http://gate.ac.uk/ – Pipeline of annotators – Stand-off annotations – Collection of resources (Tokenisers, POS taggers, ...) – GUIs – Community – Both very popular
  4. 4. Demo GATE
  5. 5. Web scale document processing  GATE – http://gatecloud.net/ - Closed-source, limited access – DIY  UIMA AS – http://incubator.apache.org/uima/doc-uimaas-what.html
  6. 6. UIMA AS  Low latency – throughput?  Storage & replication – DIY  Ease of configuration? – Esp. when mixing different types of Service Instances  Post-processing scalability – e.g. aggregate info across documents – DIY
  7. 7. Cometh Behemoth... Behemoth as depicted in the 'Dictionnaire Infernal'.
  8. 8. Бегемот The Master and Margarita M. Boulgakov
  9. 9. Behemoth  Hosted on Google Code (http://code.google.com/p/behemoth-pebble/)  Apache License  Large scale document analysis based on Apache Hadoop  Deploy UIMA or GATE-based apps on cluster  Provide adapters for common inputs  Encourage code reuse (sandbox)  Runs on Hadoop 0.18 / 0.19 / 0.20
  10. 10. Typical Workflow  Load input into HDFS  Convert input format into Behemoth Document Format – Input supported : standard files on local file system, WARC, Nutch segments – Use Apache Tika to identify mime-type, extract text and meta-data – Generate SequenceFile<Text,BehemothDocument>  Put GATE/UIMA resources on HDFS – Zipped GATE plugins + GAPP file – UIMA Pear package
  11. 11. Typical Workflow (cont.)  Process Behemoth docs with UIMA / GATE – Use Distributed Cache for sending G/U resources to slaves – Load application and do processing in Map – No reducers – Generate another SequenceFile<Text,BehemothDocument>  Post-process – Do whatever we want with annotations – … but can scale thanks to Map Reduce  Can do things differently – e.g. use reducers for postprocessing, convert input inside map step – Illustrated by example in Sandbox – Reuse modules e.g. GATEProcessor
  12. 12. Document implementation class Document String url; String contentType; String text; byte[] content; MapWritable metadata; List<Annotation> annotations; class Annotation String type; long start; long end; Map<String, String> features;
  13. 13. Example of document ./hadoop fs ­libjars /data/behemoth­pebble/build/behemoth­0.1­snapshot.job ­text textcorpusANNIE/part­* url: file:/data/behemoth­pebble/src/test/data/docs/droitshomme.txt contentType: text/plain metadata: null Content: Préambule Considérant que la reconnaissance de la dignité inhérente à tous les membres  (…) Text: Préambule  Considérant que la reconnaissance de la dignité inhérente à tous les membres  (…) Annotations: Token 0 9 string=Préambule Token 11 22 string=Considérant Token 23 26 string=que Token 27 29 string=la Token 30 44 string=reconnaissance Token 45 47 string=de
  14. 14. Advantages  Used as a common ground between UIMA and GATE – Deliberately simple document representation => fine for most applications – Feature names and values as Strings  Potentially not restricted to JAVA Annotators – Hadoop Pipe for C++ Annotators – Needs a C++ Implementation of BehemothDocument – Unless use AVRO (more on that later)  Harness multiple cores / CPU – Worth using even on a single machine  Easy Configuration – Custom BehemothConfiguration (behemoth-default & behemoth-site.xml) – What annotations to transfer from GATE / UIMA docs – What features to keep  Benefits from Hadoop Ecosystem – Focus on use of annotations and custom code
  15. 15. Sandbox  Reuse – Basic blocks : conversion / GATE-UIMA wrappers / ...  Extend – Add custom reducers for specific tasks  Share – Open to contributions – Separate from the core
  16. 16. Quick demo  Do we have 5 more minutes?
  17. 17. Future developments  Cascading – Tap / Pipe / Sink  Hbase – Avoid multiplicating SequenceFiles  AVRO – Facilitate annotators in languages != JAVA  Sandbox Examples – SOLR • Use Named Entities (Person, Location, … ) for faceting – MAHOUT • Generate vectors for document clustering  Better documentation, pretty pictures, etc...  Needs to be used on a very large scale – Anyone with a good use case?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×