SlideShare a Scribd company logo
1 of 18
Download to read offline
Behemoth
Large scale document processing with
Hadoop


Julien Nioche
julien@digitalpebble.com

Bristol Hadoop Workshop 10/03/10
DigitalPebble

 Bristol-based consultancy
 Specialised in Text Engineering
   –   Natural Language Processing
   –   Web Crawling
   –   Information Retrieval
   –   Data Mining
 Strong focus on Open Source & Apache ecosystem
 User | Contributor | Committer
   –   Lucene, SOLR, Nutch
   –   Tika
   –   Mahout
   –   GATE, UIMA
Open Source Frameworks for NLP

 Apache UIMA
  – http://incubator.apache.org/uima/


 GATE
  – http://gate.ac.uk/

  –   Pipeline of annotators
  –   Stand-off annotations
  –   Collection of resources (Tokenisers, POS taggers, ...)
  –   GUIs
  –   Community
  –   Both very popular
Demo GATE
Web scale document processing

 GATE
  – http://gatecloud.net/ - Closed-source, limited access
  – DIY

 UIMA AS
   – http://incubator.apache.org/uima/doc-uimaas-what.html
UIMA AS

 Low latency
   – throughput?


 Storage & replication
   – DIY


 Ease of configuration?
   – Esp. when mixing different types of Service Instances


 Post-processing scalability
   – e.g. aggregate info across documents
   – DIY
Cometh Behemoth...



                     Behemoth as depicted
                     in the 'Dictionnaire
                     Infernal'.
Бегемот


          The Master and
          Margarita

          M. Boulgakov
Behemoth

 Hosted on Google Code
  (http://code.google.com/p/behemoth-pebble/)
 Apache License

 Large scale document analysis based on Apache
  Hadoop
 Deploy UIMA or GATE-based apps on cluster
 Provide adapters for common inputs
 Encourage code reuse (sandbox)

 Runs on Hadoop 0.18 / 0.19 / 0.20
Typical Workflow

 Load input into HDFS

 Convert input format into Behemoth Document Format
   – Input supported : standard files on local file system, WARC, Nutch
     segments
   – Use Apache Tika to identify mime-type, extract text and meta-data
   – Generate SequenceFile<Text,BehemothDocument>


 Put GATE/UIMA resources on HDFS
   – Zipped GATE plugins + GAPP file
   – UIMA Pear package
Typical Workflow (cont.)

 Process Behemoth docs with UIMA / GATE
   –   Use Distributed Cache for sending G/U resources to slaves
   –   Load application and do processing in Map
   –   No reducers
   –   Generate another SequenceFile<Text,BehemothDocument>


 Post-process
   – Do whatever we want with annotations
   – … but can scale thanks to Map Reduce


 Can do things differently
   – e.g. use reducers for postprocessing, convert input inside map step
   – Illustrated by example in Sandbox
   – Reuse modules e.g. GATEProcessor
Document implementation
class Document
     String url;
     String contentType;
     String text;
     byte[] content;
     MapWritable metadata;
     List<Annotation> annotations;

class Annotation
     String type;
     long start;
     long end;
     Map<String, String> features;
Example of document
./hadoop fs ­libjars /data/behemoth­pebble/build/behemoth­0.1­snapshot.job ­text textcorpusANNIE/part­*

url: file:/data/behemoth­pebble/src/test/data/docs/droitshomme.txt
contentType: text/plain
metadata: null

Content:
Préambule
Considérant que la reconnaissance de la dignité inhérente à tous les membres  (…)

Text:
Préambule 
Considérant que la reconnaissance de la dignité inhérente à tous les membres  (…)

Annotations:
          Token        0           9           string=Préambule
          Token        11          22          string=Considérant
          Token        23          26          string=que
          Token        27          29          string=la
          Token        30          44          string=reconnaissance
          Token        45          47          string=de
Advantages

 Used as a common ground between UIMA and GATE
   – Deliberately simple document representation => fine for most applications
   – Feature names and values as Strings

 Potentially not restricted to JAVA Annotators
   – Hadoop Pipe for C++ Annotators
   – Needs a C++ Implementation of BehemothDocument
   – Unless use AVRO (more on that later)

 Harness multiple cores / CPU
   – Worth using even on a single machine

 Easy Configuration
   – Custom BehemothConfiguration (behemoth-default & behemoth-site.xml)
   – What annotations to transfer from GATE / UIMA docs
   – What features to keep

 Benefits from Hadoop Ecosystem
   – Focus on use of annotations and custom code
Sandbox

 Reuse
  – Basic blocks : conversion / GATE-UIMA wrappers / ...


 Extend
  – Add custom reducers for specific tasks


 Share
  – Open to contributions
  – Separate from the core
Quick demo

 Do we have 5 more minutes?
Future developments

 Cascading
   – Tap / Pipe / Sink
 Hbase
   – Avoid multiplicating SequenceFiles
 AVRO
   – Facilitate annotators in languages != JAVA
 Sandbox Examples
   – SOLR
      • Use Named Entities (Person, Location, … ) for faceting
   – MAHOUT
      • Generate vectors for document clustering
 Better documentation, pretty pictures, etc...
 Needs to be used on a very large scale
   – Anyone with a good use case?
Digital Pebble Behemoth

More Related Content

What's hot

IP Address Lookup By Using GPU
IP Address Lookup By Using GPUIP Address Lookup By Using GPU
IP Address Lookup By Using GPUJino Antony
 
UltraESB - Advanced services
UltraESB - Advanced servicesUltraESB - Advanced services
UltraESB - Advanced servicesAdroitLogic
 
Support formobility
Support formobilitySupport formobility
Support formobilityRahul Hada
 
Haskell-related part of speech in ONLab
Haskell-related part of speech in ONLabHaskell-related part of speech in ONLab
Haskell-related part of speech in ONLabDmitry Zuikov
 
UltraESB - an introduction
UltraESB - an introductionUltraESB - an introduction
UltraESB - an introductionAdroitLogic
 

What's hot (8)

IP Address Lookup By Using GPU
IP Address Lookup By Using GPUIP Address Lookup By Using GPU
IP Address Lookup By Using GPU
 
UltraESB - Advanced services
UltraESB - Advanced servicesUltraESB - Advanced services
UltraESB - Advanced services
 
Support formobility
Support formobilitySupport formobility
Support formobility
 
Mahti quick-start guide
Mahti quick-start guide Mahti quick-start guide
Mahti quick-start guide
 
Haskell-related part of speech in ONLab
Haskell-related part of speech in ONLabHaskell-related part of speech in ONLab
Haskell-related part of speech in ONLab
 
General Purpose GPU Computing
General Purpose GPU ComputingGeneral Purpose GPU Computing
General Purpose GPU Computing
 
PF_DIRECT@TMA12
PF_DIRECT@TMA12PF_DIRECT@TMA12
PF_DIRECT@TMA12
 
UltraESB - an introduction
UltraESB - an introductionUltraESB - an introduction
UltraESB - an introduction
 

Viewers also liked

Hadoop: Beyond MapReduce
Hadoop: Beyond MapReduceHadoop: Beyond MapReduce
Hadoop: Beyond MapReduceSteve Loughran
 
Hadoop And Universities
Hadoop And UniversitiesHadoop And Universities
Hadoop And UniversitiesSteve Loughran
 
2014 01-02-patching-workflow
2014 01-02-patching-workflow2014 01-02-patching-workflow
2014 01-02-patching-workflowSteve Loughran
 
Lessons from building large clusters
Lessons from building large clustersLessons from building large clusters
Lessons from building large clustersSteve Loughran
 
Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)Steve Loughran
 
High availability hadoop november 2010
High availability hadoop   november 2010High availability hadoop   november 2010
High availability hadoop november 2010Steve Loughran
 
Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateSteve Loughran
 
High Availability Hadoop
High Availability HadoopHigh Availability Hadoop
High Availability HadoopSteve Loughran
 
My other computer is a datacentre
My other computer is a datacentreMy other computer is a datacentre
My other computer is a datacentreSteve Loughran
 
Slider: Applications on YARN
Slider: Applications on YARNSlider: Applications on YARN
Slider: Applications on YARNSteve Loughran
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemSteve Loughran
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresSteve Loughran
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 

Viewers also liked (19)

Community Engagement
Community EngagementCommunity Engagement
Community Engagement
 
Hadoop: Beyond MapReduce
Hadoop: Beyond MapReduceHadoop: Beyond MapReduce
Hadoop: Beyond MapReduce
 
Hadoop And Universities
Hadoop And UniversitiesHadoop And Universities
Hadoop And Universities
 
2014 01-02-patching-workflow
2014 01-02-patching-workflow2014 01-02-patching-workflow
2014 01-02-patching-workflow
 
Inside hadoop-dev
Inside hadoop-devInside hadoop-dev
Inside hadoop-dev
 
Datamining Location
Datamining LocationDatamining Location
Datamining Location
 
Lessons from building large clusters
Lessons from building large clustersLessons from building large clusters
Lessons from building large clusters
 
Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)
 
High availability hadoop november 2010
High availability hadoop   november 2010High availability hadoop   november 2010
High availability hadoop november 2010
 
HDFS Issues
HDFS IssuesHDFS Issues
HDFS Issues
 
Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the Gate
 
High Availability Hadoop
High Availability HadoopHigh Availability Hadoop
High Availability Hadoop
 
My other computer is a datacentre
My other computer is a datacentreMy other computer is a datacentre
My other computer is a datacentre
 
Slider: Applications on YARN
Slider: Applications on YARNSlider: Applications on YARN
Slider: Applications on YARN
 
HDFS
HDFSHDFS
HDFS
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
YARN Services
YARN ServicesYARN Services
YARN Services
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object Stores
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 

Similar to Digital Pebble Behemoth

A quick introduction to Storm Crawler
A quick introduction to Storm CrawlerA quick introduction to Storm Crawler
A quick introduction to Storm CrawlerJulien Nioche
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsJulien Nioche
 
Containerization is more than the new Virtualization: enabling separation of ...
Containerization is more than the new Virtualization: enabling separation of ...Containerization is more than the new Virtualization: enabling separation of ...
Containerization is more than the new Virtualization: enabling separation of ...Jérôme Petazzoni
 
Developing IT infrastructures with Puppet
Developing IT infrastructures with PuppetDeveloping IT infrastructures with Puppet
Developing IT infrastructures with PuppetAlessandro Franceschi
 
PuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into OperationsPuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into Operationsgrim_radical
 
Linux advanced concepts - Part 2
Linux advanced concepts - Part 2Linux advanced concepts - Part 2
Linux advanced concepts - Part 2NAILBITER
 
Introduction to node.js GDD
Introduction to node.js GDDIntroduction to node.js GDD
Introduction to node.js GDDSudar Muthu
 
Utosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptUtosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptwebhostingguy
 
Utosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptUtosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptwebhostingguy
 
Utosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptUtosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptwebhostingguy
 
Building Server Applications Using ObjectiveC And GNUstep
Building Server Applications Using ObjectiveC And GNUstepBuilding Server Applications Using ObjectiveC And GNUstep
Building Server Applications Using ObjectiveC And GNUstepguest9efd1a1
 
Building Server Applications Using Objective C And Gn Ustep
Building Server Applications Using Objective C And Gn UstepBuilding Server Applications Using Objective C And Gn Ustep
Building Server Applications Using Objective C And Gn Ustepwangii
 
Hands on with CoAP and Californium
Hands on with CoAP and CaliforniumHands on with CoAP and Californium
Hands on with CoAP and CaliforniumJulien Vermillard
 
iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)Eran Duchan
 
Scaleable PHP Applications in Kubernetes
Scaleable PHP Applications in KubernetesScaleable PHP Applications in Kubernetes
Scaleable PHP Applications in KubernetesRobert Lemke
 

Similar to Digital Pebble Behemoth (20)

A quick introduction to Storm Crawler
A quick introduction to Storm CrawlerA quick introduction to Storm Crawler
A quick introduction to Storm Crawler
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
Containerization is more than the new Virtualization: enabling separation of ...
Containerization is more than the new Virtualization: enabling separation of ...Containerization is more than the new Virtualization: enabling separation of ...
Containerization is more than the new Virtualization: enabling separation of ...
 
Developing IT infrastructures with Puppet
Developing IT infrastructures with PuppetDeveloping IT infrastructures with Puppet
Developing IT infrastructures with Puppet
 
Let's Go
Let's GoLet's Go
Let's Go
 
PuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into OperationsPuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into Operations
 
Linux advanced concepts - Part 2
Linux advanced concepts - Part 2Linux advanced concepts - Part 2
Linux advanced concepts - Part 2
 
Introduction to node.js GDD
Introduction to node.js GDDIntroduction to node.js GDD
Introduction to node.js GDD
 
Utosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptUtosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.ppt
 
Utosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptUtosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.ppt
 
Utosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptUtosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.ppt
 
Building Server Applications Using ObjectiveC And GNUstep
Building Server Applications Using ObjectiveC And GNUstepBuilding Server Applications Using ObjectiveC And GNUstep
Building Server Applications Using ObjectiveC And GNUstep
 
Building Server Applications Using Objective C And Gn Ustep
Building Server Applications Using Objective C And Gn UstepBuilding Server Applications Using Objective C And Gn Ustep
Building Server Applications Using Objective C And Gn Ustep
 
Hadoop
HadoopHadoop
Hadoop
 
Hands on with CoAP and Californium
Hands on with CoAP and CaliforniumHands on with CoAP and Californium
Hands on with CoAP and Californium
 
NS3 Overview
NS3 OverviewNS3 Overview
NS3 Overview
 
iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)
 
Scaleable PHP Applications in Kubernetes
Scaleable PHP Applications in KubernetesScaleable PHP Applications in Kubernetes
Scaleable PHP Applications in Kubernetes
 

More from Steve Loughran

The age of rename() is over
The age of rename() is overThe age of rename() is over
The age of rename() is overSteve Loughran
 
What does Rename Do: (detailed version)
What does Rename Do: (detailed version)What does Rename Do: (detailed version)
What does Rename Do: (detailed version)Steve Loughran
 
Put is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionPut is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionSteve Loughran
 
@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!Steve Loughran
 
PUT is the new rename()
PUT is the new rename()PUT is the new rename()
PUT is the new rename()Steve Loughran
 
Extreme Programming Deployed
Extreme Programming DeployedExtreme Programming Deployed
Extreme Programming DeployedSteve Loughran
 
What does rename() do?
What does rename() do?What does rename() do?
What does rename() do?Steve Loughran
 
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveDancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveSteve Loughran
 
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupApache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupSteve Loughran
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSteve Loughran
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object StoresSteve Loughran
 
Household INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraHousehold INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraSteve Loughran
 
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionHadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionSteve Loughran
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider projectSteve Loughran
 
Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!Steve Loughran
 
2013 11-19-hoya-status
2013 11-19-hoya-status2013 11-19-hoya-status
2013 11-19-hoya-statusSteve Loughran
 

More from Steve Loughran (20)

Hadoop Vectored IO
Hadoop Vectored IOHadoop Vectored IO
Hadoop Vectored IO
 
The age of rename() is over
The age of rename() is overThe age of rename() is over
The age of rename() is over
 
What does Rename Do: (detailed version)
What does Rename Do: (detailed version)What does Rename Do: (detailed version)
What does Rename Do: (detailed version)
 
Put is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionPut is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit Edition
 
@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!
 
PUT is the new rename()
PUT is the new rename()PUT is the new rename()
PUT is the new rename()
 
Extreme Programming Deployed
Extreme Programming DeployedExtreme Programming Deployed
Extreme Programming Deployed
 
Testing
TestingTesting
Testing
 
I hate mocking
I hate mockingI hate mocking
I hate mocking
 
What does rename() do?
What does rename() do?What does rename() do?
What does rename() do?
 
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveDancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
 
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupApache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User Group
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object stores
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
 
Household INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraHousehold INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony Era
 
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionHadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
 
Datacentre stack
Datacentre stackDatacentre stack
Datacentre stack
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider project
 
Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!
 
2013 11-19-hoya-status
2013 11-19-hoya-status2013 11-19-hoya-status
2013 11-19-hoya-status
 

Recently uploaded

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Recently uploaded (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Digital Pebble Behemoth

  • 1. Behemoth Large scale document processing with Hadoop Julien Nioche julien@digitalpebble.com Bristol Hadoop Workshop 10/03/10
  • 2. DigitalPebble  Bristol-based consultancy  Specialised in Text Engineering – Natural Language Processing – Web Crawling – Information Retrieval – Data Mining  Strong focus on Open Source & Apache ecosystem  User | Contributor | Committer – Lucene, SOLR, Nutch – Tika – Mahout – GATE, UIMA
  • 3. Open Source Frameworks for NLP  Apache UIMA – http://incubator.apache.org/uima/  GATE – http://gate.ac.uk/ – Pipeline of annotators – Stand-off annotations – Collection of resources (Tokenisers, POS taggers, ...) – GUIs – Community – Both very popular
  • 5. Web scale document processing  GATE – http://gatecloud.net/ - Closed-source, limited access – DIY  UIMA AS – http://incubator.apache.org/uima/doc-uimaas-what.html
  • 6. UIMA AS  Low latency – throughput?  Storage & replication – DIY  Ease of configuration? – Esp. when mixing different types of Service Instances  Post-processing scalability – e.g. aggregate info across documents – DIY
  • 7. Cometh Behemoth... Behemoth as depicted in the 'Dictionnaire Infernal'.
  • 8. Бегемот The Master and Margarita M. Boulgakov
  • 9. Behemoth  Hosted on Google Code (http://code.google.com/p/behemoth-pebble/)  Apache License  Large scale document analysis based on Apache Hadoop  Deploy UIMA or GATE-based apps on cluster  Provide adapters for common inputs  Encourage code reuse (sandbox)  Runs on Hadoop 0.18 / 0.19 / 0.20
  • 10. Typical Workflow  Load input into HDFS  Convert input format into Behemoth Document Format – Input supported : standard files on local file system, WARC, Nutch segments – Use Apache Tika to identify mime-type, extract text and meta-data – Generate SequenceFile<Text,BehemothDocument>  Put GATE/UIMA resources on HDFS – Zipped GATE plugins + GAPP file – UIMA Pear package
  • 11. Typical Workflow (cont.)  Process Behemoth docs with UIMA / GATE – Use Distributed Cache for sending G/U resources to slaves – Load application and do processing in Map – No reducers – Generate another SequenceFile<Text,BehemothDocument>  Post-process – Do whatever we want with annotations – … but can scale thanks to Map Reduce  Can do things differently – e.g. use reducers for postprocessing, convert input inside map step – Illustrated by example in Sandbox – Reuse modules e.g. GATEProcessor
  • 12. Document implementation class Document String url; String contentType; String text; byte[] content; MapWritable metadata; List<Annotation> annotations; class Annotation String type; long start; long end; Map<String, String> features;
  • 14. Advantages  Used as a common ground between UIMA and GATE – Deliberately simple document representation => fine for most applications – Feature names and values as Strings  Potentially not restricted to JAVA Annotators – Hadoop Pipe for C++ Annotators – Needs a C++ Implementation of BehemothDocument – Unless use AVRO (more on that later)  Harness multiple cores / CPU – Worth using even on a single machine  Easy Configuration – Custom BehemothConfiguration (behemoth-default & behemoth-site.xml) – What annotations to transfer from GATE / UIMA docs – What features to keep  Benefits from Hadoop Ecosystem – Focus on use of annotations and custom code
  • 15. Sandbox  Reuse – Basic blocks : conversion / GATE-UIMA wrappers / ...  Extend – Add custom reducers for specific tasks  Share – Open to contributions – Separate from the core
  • 16. Quick demo  Do we have 5 more minutes?
  • 17. Future developments  Cascading – Tap / Pipe / Sink  Hbase – Avoid multiplicating SequenceFiles  AVRO – Facilitate annotators in languages != JAVA  Sandbox Examples – SOLR • Use Named Entities (Person, Location, … ) for faceting – MAHOUT • Generate vectors for document clustering  Better documentation, pretty pictures, etc...  Needs to be used on a very large scale – Anyone with a good use case?