1. Behemoth
Large scale document processing with
Hadoop
Julien Nioche
julien@digitalpebble.com
Bristol Hadoop Workshop 10/03/10
2. DigitalPebble
Bristol-based consultancy
Specialised in Text Engineering
– Natural Language Processing
– Web Crawling
– Information Retrieval
– Data Mining
Strong focus on Open Source & Apache ecosystem
User | Contributor | Committer
– Lucene, SOLR, Nutch
– Tika
– Mahout
– GATE, UIMA
3. Open Source Frameworks for NLP
Apache UIMA
– http://incubator.apache.org/uima/
GATE
– http://gate.ac.uk/
– Pipeline of annotators
– Stand-off annotations
– Collection of resources (Tokenisers, POS taggers, ...)
– GUIs
– Community
– Both very popular
5. Web scale document processing
GATE
– http://gatecloud.net/ - Closed-source, limited access
– DIY
UIMA AS
– http://incubator.apache.org/uima/doc-uimaas-what.html
6. UIMA AS
Low latency
– throughput?
Storage & replication
– DIY
Ease of configuration?
– Esp. when mixing different types of Service Instances
Post-processing scalability
– e.g. aggregate info across documents
– DIY
9. Behemoth
Hosted on Google Code
(http://code.google.com/p/behemoth-pebble/)
Apache License
Large scale document analysis based on Apache
Hadoop
Deploy UIMA or GATE-based apps on cluster
Provide adapters for common inputs
Encourage code reuse (sandbox)
Runs on Hadoop 0.18 / 0.19 / 0.20
10. Typical Workflow
Load input into HDFS
Convert input format into Behemoth Document Format
– Input supported : standard files on local file system, WARC, Nutch
segments
– Use Apache Tika to identify mime-type, extract text and meta-data
– Generate SequenceFile<Text,BehemothDocument>
Put GATE/UIMA resources on HDFS
– Zipped GATE plugins + GAPP file
– UIMA Pear package
11. Typical Workflow (cont.)
Process Behemoth docs with UIMA / GATE
– Use Distributed Cache for sending G/U resources to slaves
– Load application and do processing in Map
– No reducers
– Generate another SequenceFile<Text,BehemothDocument>
Post-process
– Do whatever we want with annotations
– … but can scale thanks to Map Reduce
Can do things differently
– e.g. use reducers for postprocessing, convert input inside map step
– Illustrated by example in Sandbox
– Reuse modules e.g. GATEProcessor
12. Document implementation
class Document
String url;
String contentType;
String text;
byte[] content;
MapWritable metadata;
List<Annotation> annotations;
class Annotation
String type;
long start;
long end;
Map<String, String> features;
14. Advantages
Used as a common ground between UIMA and GATE
– Deliberately simple document representation => fine for most applications
– Feature names and values as Strings
Potentially not restricted to JAVA Annotators
– Hadoop Pipe for C++ Annotators
– Needs a C++ Implementation of BehemothDocument
– Unless use AVRO (more on that later)
Harness multiple cores / CPU
– Worth using even on a single machine
Easy Configuration
– Custom BehemothConfiguration (behemoth-default & behemoth-site.xml)
– What annotations to transfer from GATE / UIMA docs
– What features to keep
Benefits from Hadoop Ecosystem
– Focus on use of annotations and custom code
15. Sandbox
Reuse
– Basic blocks : conversion / GATE-UIMA wrappers / ...
Extend
– Add custom reducers for specific tasks
Share
– Open to contributions
– Separate from the core
17. Future developments
Cascading
– Tap / Pipe / Sink
Hbase
– Avoid multiplicating SequenceFiles
AVRO
– Facilitate annotators in languages != JAVA
Sandbox Examples
– SOLR
• Use Named Entities (Person, Location, … ) for faceting
– MAHOUT
• Generate vectors for document clustering
Better documentation, pretty pictures, etc...
Needs to be used on a very large scale
– Anyone with a good use case?