Digital Pebble Behemoth

Behemoth
Large scale document processing with
Hadoop

Julien Nioche
julien@digitalpebble.com

Bristol Hadoop Workshop 10/03/10

DigitalPebble

 Bristol-based consultancy
 Specialised in Text Engineering
– Natural Language Processing
– Web Crawling
– Information Retrieval
– Data Mining
 Strong focus on Open Source & Apache ecosystem
 User | Contributor | Committer
– Lucene, SOLR, Nutch
– Tika
– Mahout
– GATE, UIMA

Open Source Frameworks for NLP

 Apache UIMA
– http://incubator.apache.org/uima/

 GATE
– http://gate.ac.uk/

– Pipeline of annotators
– Stand-off annotations
– Collection of resources (Tokenisers, POS taggers, ...)
– GUIs
– Community
– Both very popular

Web scale document processing

 GATE
– http://gatecloud.net/ - Closed-source, limited access
– DIY

 UIMA AS
– http://incubator.apache.org/uima/doc-uimaas-what.html

UIMA AS

 Low latency
– throughput?

 Storage & replication
– DIY

 Ease of configuration?
– Esp. when mixing different types of Service Instances

 Post-processing scalability
– e.g. aggregate info across documents
– DIY

Cometh Behemoth...

Behemoth as depicted
in the 'Dictionnaire
Infernal'.

Бегемот

The Master and
Margarita

M. Boulgakov

Behemoth

 Hosted on Google Code
(http://code.google.com/p/behemoth-pebble/)
 Apache License

 Large scale document analysis based on Apache
Hadoop
 Deploy UIMA or GATE-based apps on cluster
 Provide adapters for common inputs
 Encourage code reuse (sandbox)

 Runs on Hadoop 0.18 / 0.19 / 0.20

Typical Workflow

 Load input into HDFS

 Convert input format into Behemoth Document Format
– Input supported : standard files on local file system, WARC, Nutch
segments
– Use Apache Tika to identify mime-type, extract text and meta-data
– Generate SequenceFile<Text,BehemothDocument>

 Put GATE/UIMA resources on HDFS
– Zipped GATE plugins + GAPP file
– UIMA Pear package

Typical Workflow (cont.)

 Process Behemoth docs with UIMA / GATE
– Use Distributed Cache for sending G/U resources to slaves
– Load application and do processing in Map
– No reducers
– Generate another SequenceFile<Text,BehemothDocument>

 Post-process
– Do whatever we want with annotations
– … but can scale thanks to Map Reduce

 Can do things differently
– e.g. use reducers for postprocessing, convert input inside map step
– Illustrated by example in Sandbox
– Reuse modules e.g. GATEProcessor

Document implementation
class Document
String url;
String contentType;
String text;
byte[] content;
MapWritable metadata;
List<Annotation> annotations;

class Annotation
String type;
long start;
long end;
Map<String, String> features;

Example of document
./hadoop fs libjars /data/behemothpebble/build/behemoth0.1snapshot.job text textcorpusANNIE/part*

url: file:/data/behemothpebble/src/test/data/docs/droitshomme.txt
contentType: text/plain
metadata: null

Content:
Préambule
Considérant que la reconnaissance de la dignité inhérente à tous les membres (…)

Text:
Préambule
Considérant que la reconnaissance de la dignité inhérente à tous les membres (…)

Annotations:
Token 0 9 string=Préambule
Token 11 22 string=Considérant
Token 23 26 string=que
Token 27 29 string=la
Token 30 44 string=reconnaissance
Token 45 47 string=de

Advantages

 Used as a common ground between UIMA and GATE
– Deliberately simple document representation => fine for most applications
– Feature names and values as Strings

 Potentially not restricted to JAVA Annotators
– Hadoop Pipe for C++ Annotators
– Needs a C++ Implementation of BehemothDocument
– Unless use AVRO (more on that later)

 Harness multiple cores / CPU
– Worth using even on a single machine

 Easy Configuration
– Custom BehemothConfiguration (behemoth-default & behemoth-site.xml)
– What annotations to transfer from GATE / UIMA docs
– What features to keep

 Benefits from Hadoop Ecosystem
– Focus on use of annotations and custom code

Sandbox

 Reuse
– Basic blocks : conversion / GATE-UIMA wrappers / ...

 Extend
– Add custom reducers for specific tasks

 Share
– Open to contributions
– Separate from the core

Quick demo

 Do we have 5 more minutes?

Future developments

 Cascading
– Tap / Pipe / Sink
 Hbase
– Avoid multiplicating SequenceFiles
 AVRO
– Facilitate annotators in languages != JAVA
 Sandbox Examples
– SOLR
• Use Named Entities (Person, Location, … ) for faceting
– MAHOUT
• Generate vectors for document clustering
 Better documentation, pretty pictures, etc...
 Needs to be used on a very large scale
– Anyone with a good use case?

Digital Pebble Behemoth

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Viewers also liked

Viewers also liked (19)

Similar to Digital Pebble Behemoth

Similar to Digital Pebble Behemoth (20)

More from Steve Loughran

More from Steve Loughran (20)

Recently uploaded

Recently uploaded (20)

Digital Pebble Behemoth