Your SlideShare is downloading. ×
0
Digital Pebble Behemoth
Digital Pebble Behemoth
Digital Pebble Behemoth
Digital Pebble Behemoth
Digital Pebble Behemoth
Digital Pebble Behemoth
Digital Pebble Behemoth
Digital Pebble Behemoth
Digital Pebble Behemoth
Digital Pebble Behemoth
Digital Pebble Behemoth
Digital Pebble Behemoth
Digital Pebble Behemoth
Digital Pebble Behemoth
Digital Pebble Behemoth
Digital Pebble Behemoth
Digital Pebble Behemoth
Digital Pebble Behemoth
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Digital Pebble Behemoth

3,036

Published on

Talk about NLP processing on top of Hadoop using Behemoth by Julien of Digital Pebble

Talk about NLP processing on top of Hadoop using Behemoth by Julien of Digital Pebble

Published in: Technology
1 Comment
9 Likes
Statistics
Notes
  • Hi,

    have you looked at Apache Stanbol ?

    Cheers,

    S.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
3,036
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
74
Comments
1
Likes
9
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Behemoth Large scale document processing with Hadoop Julien Nioche julien@digitalpebble.com Bristol Hadoop Workshop 10/03/10
  • 2. DigitalPebble  Bristol-based consultancy  Specialised in Text Engineering – Natural Language Processing – Web Crawling – Information Retrieval – Data Mining  Strong focus on Open Source & Apache ecosystem  User | Contributor | Committer – Lucene, SOLR, Nutch – Tika – Mahout – GATE, UIMA
  • 3. Open Source Frameworks for NLP  Apache UIMA – http://incubator.apache.org/uima/  GATE – http://gate.ac.uk/ – Pipeline of annotators – Stand-off annotations – Collection of resources (Tokenisers, POS taggers, ...) – GUIs – Community – Both very popular
  • 4. Demo GATE
  • 5. Web scale document processing  GATE – http://gatecloud.net/ - Closed-source, limited access – DIY  UIMA AS – http://incubator.apache.org/uima/doc-uimaas-what.html
  • 6. UIMA AS  Low latency – throughput?  Storage & replication – DIY  Ease of configuration? – Esp. when mixing different types of Service Instances  Post-processing scalability – e.g. aggregate info across documents – DIY
  • 7. Cometh Behemoth... Behemoth as depicted in the 'Dictionnaire Infernal'.
  • 8. Бегемот The Master and Margarita M. Boulgakov
  • 9. Behemoth  Hosted on Google Code (http://code.google.com/p/behemoth-pebble/)  Apache License  Large scale document analysis based on Apache Hadoop  Deploy UIMA or GATE-based apps on cluster  Provide adapters for common inputs  Encourage code reuse (sandbox)  Runs on Hadoop 0.18 / 0.19 / 0.20
  • 10. Typical Workflow  Load input into HDFS  Convert input format into Behemoth Document Format – Input supported : standard files on local file system, WARC, Nutch segments – Use Apache Tika to identify mime-type, extract text and meta-data – Generate SequenceFile<Text,BehemothDocument>  Put GATE/UIMA resources on HDFS – Zipped GATE plugins + GAPP file – UIMA Pear package
  • 11. Typical Workflow (cont.)  Process Behemoth docs with UIMA / GATE – Use Distributed Cache for sending G/U resources to slaves – Load application and do processing in Map – No reducers – Generate another SequenceFile<Text,BehemothDocument>  Post-process – Do whatever we want with annotations – … but can scale thanks to Map Reduce  Can do things differently – e.g. use reducers for postprocessing, convert input inside map step – Illustrated by example in Sandbox – Reuse modules e.g. GATEProcessor
  • 12. Document implementation class Document String url; String contentType; String text; byte[] content; MapWritable metadata; List<Annotation> annotations; class Annotation String type; long start; long end; Map<String, String> features;
  • 13. Example of document ./hadoop fs ­libjars /data/behemoth­pebble/build/behemoth­0.1­snapshot.job ­text textcorpusANNIE/part­* url: file:/data/behemoth­pebble/src/test/data/docs/droitshomme.txt contentType: text/plain metadata: null Content: Préambule Considérant que la reconnaissance de la dignité inhérente à tous les membres  (…) Text: Préambule  Considérant que la reconnaissance de la dignité inhérente à tous les membres  (…) Annotations: Token 0 9 string=Préambule Token 11 22 string=Considérant Token 23 26 string=que Token 27 29 string=la Token 30 44 string=reconnaissance Token 45 47 string=de
  • 14. Advantages  Used as a common ground between UIMA and GATE – Deliberately simple document representation => fine for most applications – Feature names and values as Strings  Potentially not restricted to JAVA Annotators – Hadoop Pipe for C++ Annotators – Needs a C++ Implementation of BehemothDocument – Unless use AVRO (more on that later)  Harness multiple cores / CPU – Worth using even on a single machine  Easy Configuration – Custom BehemothConfiguration (behemoth-default & behemoth-site.xml) – What annotations to transfer from GATE / UIMA docs – What features to keep  Benefits from Hadoop Ecosystem – Focus on use of annotations and custom code
  • 15. Sandbox  Reuse – Basic blocks : conversion / GATE-UIMA wrappers / ...  Extend – Add custom reducers for specific tasks  Share – Open to contributions – Separate from the core
  • 16. Quick demo  Do we have 5 more minutes?
  • 17. Future developments  Cascading – Tap / Pipe / Sink  Hbase – Avoid multiplicating SequenceFiles  AVRO – Facilitate annotators in languages != JAVA  Sandbox Examples – SOLR • Use Named Entities (Person, Location, … ) for faceting – MAHOUT • Generate vectors for document clustering  Better documentation, pretty pictures, etc...  Needs to be used on a very large scale – Anyone with a good use case?

×