SlideShare a Scribd company logo
Behemoth
Large scale document processing with
Hadoop


Julien Nioche
julien@digitalpebble.com

Bristol Hadoop Workshop 10/03/10
DigitalPebble

 Bristol-based consultancy
 Specialised in Text Engineering
   –   Natural Language Processing
   –   Web Crawling
   –   Information Retrieval
   –   Data Mining
 Strong focus on Open Source & Apache ecosystem
 User | Contributor | Committer
   –   Lucene, SOLR, Nutch
   –   Tika
   –   Mahout
   –   GATE, UIMA
Open Source Frameworks for NLP

 Apache UIMA
  – http://incubator.apache.org/uima/


 GATE
  – http://gate.ac.uk/

  –   Pipeline of annotators
  –   Stand-off annotations
  –   Collection of resources (Tokenisers, POS taggers, ...)
  –   GUIs
  –   Community
  –   Both very popular
Demo GATE
Web scale document processing

 GATE
  – http://gatecloud.net/ - Closed-source, limited access
  – DIY

 UIMA AS
   – http://incubator.apache.org/uima/doc-uimaas-what.html
UIMA AS

 Low latency
   – throughput?


 Storage & replication
   – DIY


 Ease of configuration?
   – Esp. when mixing different types of Service Instances


 Post-processing scalability
   – e.g. aggregate info across documents
   – DIY
Cometh Behemoth...



                     Behemoth as depicted
                     in the 'Dictionnaire
                     Infernal'.
Бегемот


          The Master and
          Margarita

          M. Boulgakov
Behemoth

 Hosted on Google Code
  (http://code.google.com/p/behemoth-pebble/)
 Apache License

 Large scale document analysis based on Apache
  Hadoop
 Deploy UIMA or GATE-based apps on cluster
 Provide adapters for common inputs
 Encourage code reuse (sandbox)

 Runs on Hadoop 0.18 / 0.19 / 0.20
Typical Workflow

 Load input into HDFS

 Convert input format into Behemoth Document Format
   – Input supported : standard files on local file system, WARC, Nutch
     segments
   – Use Apache Tika to identify mime-type, extract text and meta-data
   – Generate SequenceFile<Text,BehemothDocument>


 Put GATE/UIMA resources on HDFS
   – Zipped GATE plugins + GAPP file
   – UIMA Pear package
Typical Workflow (cont.)

 Process Behemoth docs with UIMA / GATE
   –   Use Distributed Cache for sending G/U resources to slaves
   –   Load application and do processing in Map
   –   No reducers
   –   Generate another SequenceFile<Text,BehemothDocument>


 Post-process
   – Do whatever we want with annotations
   – … but can scale thanks to Map Reduce


 Can do things differently
   – e.g. use reducers for postprocessing, convert input inside map step
   – Illustrated by example in Sandbox
   – Reuse modules e.g. GATEProcessor
Document implementation
class Document
     String url;
     String contentType;
     String text;
     byte[] content;
     MapWritable metadata;
     List<Annotation> annotations;

class Annotation
     String type;
     long start;
     long end;
     Map<String, String> features;
Example of document
./hadoop fs ­libjars /data/behemoth­pebble/build/behemoth­0.1­snapshot.job ­text textcorpusANNIE/part­*

url: file:/data/behemoth­pebble/src/test/data/docs/droitshomme.txt
contentType: text/plain
metadata: null

Content:
Préambule
Considérant que la reconnaissance de la dignité inhérente à tous les membres  (…)

Text:
Préambule 
Considérant que la reconnaissance de la dignité inhérente à tous les membres  (…)

Annotations:
          Token        0           9           string=Préambule
          Token        11          22          string=Considérant
          Token        23          26          string=que
          Token        27          29          string=la
          Token        30          44          string=reconnaissance
          Token        45          47          string=de
Advantages

 Used as a common ground between UIMA and GATE
   – Deliberately simple document representation => fine for most applications
   – Feature names and values as Strings

 Potentially not restricted to JAVA Annotators
   – Hadoop Pipe for C++ Annotators
   – Needs a C++ Implementation of BehemothDocument
   – Unless use AVRO (more on that later)

 Harness multiple cores / CPU
   – Worth using even on a single machine

 Easy Configuration
   – Custom BehemothConfiguration (behemoth-default & behemoth-site.xml)
   – What annotations to transfer from GATE / UIMA docs
   – What features to keep

 Benefits from Hadoop Ecosystem
   – Focus on use of annotations and custom code
Sandbox

 Reuse
  – Basic blocks : conversion / GATE-UIMA wrappers / ...


 Extend
  – Add custom reducers for specific tasks


 Share
  – Open to contributions
  – Separate from the core
Quick demo

 Do we have 5 more minutes?
Future developments

 Cascading
   – Tap / Pipe / Sink
 Hbase
   – Avoid multiplicating SequenceFiles
 AVRO
   – Facilitate annotators in languages != JAVA
 Sandbox Examples
   – SOLR
      • Use Named Entities (Person, Location, … ) for faceting
   – MAHOUT
      • Generate vectors for document clustering
 Better documentation, pretty pictures, etc...
 Needs to be used on a very large scale
   – Anyone with a good use case?
Digital Pebble Behemoth

More Related Content

What's hot

IP Address Lookup By Using GPU
IP Address Lookup By Using GPUIP Address Lookup By Using GPU
IP Address Lookup By Using GPU
Jino Antony
 
UltraESB - Advanced services
UltraESB - Advanced servicesUltraESB - Advanced services
UltraESB - Advanced services
AdroitLogic
 
Support formobility
Support formobilitySupport formobility
Support formobilityRahul Hada
 
Mahti quick-start guide
Mahti quick-start guide Mahti quick-start guide
Mahti quick-start guide
CSC - IT Center for Science
 
Haskell-related part of speech in ONLab
Haskell-related part of speech in ONLabHaskell-related part of speech in ONLab
Haskell-related part of speech in ONLabDmitry Zuikov
 
General Purpose GPU Computing
General Purpose GPU ComputingGeneral Purpose GPU Computing
General Purpose GPU Computing
GlobalLogic Ukraine
 
PF_DIRECT@TMA12
PF_DIRECT@TMA12PF_DIRECT@TMA12
PF_DIRECT@TMA12
Nicola Bonelli
 
UltraESB - an introduction
UltraESB - an introductionUltraESB - an introduction
UltraESB - an introduction
AdroitLogic
 

What's hot (8)

IP Address Lookup By Using GPU
IP Address Lookup By Using GPUIP Address Lookup By Using GPU
IP Address Lookup By Using GPU
 
UltraESB - Advanced services
UltraESB - Advanced servicesUltraESB - Advanced services
UltraESB - Advanced services
 
Support formobility
Support formobilitySupport formobility
Support formobility
 
Mahti quick-start guide
Mahti quick-start guide Mahti quick-start guide
Mahti quick-start guide
 
Haskell-related part of speech in ONLab
Haskell-related part of speech in ONLabHaskell-related part of speech in ONLab
Haskell-related part of speech in ONLab
 
General Purpose GPU Computing
General Purpose GPU ComputingGeneral Purpose GPU Computing
General Purpose GPU Computing
 
PF_DIRECT@TMA12
PF_DIRECT@TMA12PF_DIRECT@TMA12
PF_DIRECT@TMA12
 
UltraESB - an introduction
UltraESB - an introductionUltraESB - an introduction
UltraESB - an introduction
 

Viewers also liked

Community Engagement
Community EngagementCommunity Engagement
Community Engagement
Steve Loughran
 
Hadoop: Beyond MapReduce
Hadoop: Beyond MapReduceHadoop: Beyond MapReduce
Hadoop: Beyond MapReduce
Steve Loughran
 
Hadoop And Universities
Hadoop And UniversitiesHadoop And Universities
Hadoop And Universities
Steve Loughran
 
2014 01-02-patching-workflow
2014 01-02-patching-workflow2014 01-02-patching-workflow
2014 01-02-patching-workflow
Steve Loughran
 
Inside hadoop-dev
Inside hadoop-devInside hadoop-dev
Inside hadoop-dev
Steve Loughran
 
Datamining Location
Datamining LocationDatamining Location
Datamining Location
Steve Loughran
 
Lessons from building large clusters
Lessons from building large clustersLessons from building large clusters
Lessons from building large clustersSteve Loughran
 
Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)
Steve Loughran
 
High availability hadoop november 2010
High availability hadoop   november 2010High availability hadoop   november 2010
High availability hadoop november 2010
Steve Loughran
 
HDFS Issues
HDFS IssuesHDFS Issues
HDFS Issues
Steve Loughran
 
Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the Gate
Steve Loughran
 
High Availability Hadoop
High Availability HadoopHigh Availability Hadoop
High Availability Hadoop
Steve Loughran
 
My other computer is a datacentre
My other computer is a datacentreMy other computer is a datacentre
My other computer is a datacentre
Steve Loughran
 
Slider: Applications on YARN
Slider: Applications on YARNSlider: Applications on YARN
Slider: Applications on YARN
Steve Loughran
 
HDFS
HDFSHDFS
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
Steve Loughran
 
YARN Services
YARN ServicesYARN Services
YARN Services
Steve Loughran
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object Stores
Steve Loughran
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
Steve Loughran
 

Viewers also liked (19)

Community Engagement
Community EngagementCommunity Engagement
Community Engagement
 
Hadoop: Beyond MapReduce
Hadoop: Beyond MapReduceHadoop: Beyond MapReduce
Hadoop: Beyond MapReduce
 
Hadoop And Universities
Hadoop And UniversitiesHadoop And Universities
Hadoop And Universities
 
2014 01-02-patching-workflow
2014 01-02-patching-workflow2014 01-02-patching-workflow
2014 01-02-patching-workflow
 
Inside hadoop-dev
Inside hadoop-devInside hadoop-dev
Inside hadoop-dev
 
Datamining Location
Datamining LocationDatamining Location
Datamining Location
 
Lessons from building large clusters
Lessons from building large clustersLessons from building large clusters
Lessons from building large clusters
 
Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)
 
High availability hadoop november 2010
High availability hadoop   november 2010High availability hadoop   november 2010
High availability hadoop november 2010
 
HDFS Issues
HDFS IssuesHDFS Issues
HDFS Issues
 
Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the Gate
 
High Availability Hadoop
High Availability HadoopHigh Availability Hadoop
High Availability Hadoop
 
My other computer is a datacentre
My other computer is a datacentreMy other computer is a datacentre
My other computer is a datacentre
 
Slider: Applications on YARN
Slider: Applications on YARNSlider: Applications on YARN
Slider: Applications on YARN
 
HDFS
HDFSHDFS
HDFS
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
YARN Services
YARN ServicesYARN Services
YARN Services
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object Stores
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 

Similar to Digital Pebble Behemoth

A quick introduction to Storm Crawler
A quick introduction to Storm CrawlerA quick introduction to Storm Crawler
A quick introduction to Storm Crawler
Julien Nioche
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
lucenerevolution
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
Julien Nioche
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
Jean-Baptiste Onofré
 
Containerization is more than the new Virtualization: enabling separation of ...
Containerization is more than the new Virtualization: enabling separation of ...Containerization is more than the new Virtualization: enabling separation of ...
Containerization is more than the new Virtualization: enabling separation of ...
Jérôme Petazzoni
 
Developing IT infrastructures with Puppet
Developing IT infrastructures with PuppetDeveloping IT infrastructures with Puppet
Developing IT infrastructures with Puppet
Alessandro Franceschi
 
Let's Go
Let's GoLet's Go
PuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into OperationsPuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into Operationsgrim_radical
 
Linux advanced concepts - Part 2
Linux advanced concepts - Part 2Linux advanced concepts - Part 2
Linux advanced concepts - Part 2
NAILBITER
 
Introduction to node.js GDD
Introduction to node.js GDDIntroduction to node.js GDD
Introduction to node.js GDD
Sudar Muthu
 
Utosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptUtosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptwebhostingguy
 
Utosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptUtosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptwebhostingguy
 
Utosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptUtosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptwebhostingguy
 
Building Server Applications Using ObjectiveC And GNUstep
Building Server Applications Using ObjectiveC And GNUstepBuilding Server Applications Using ObjectiveC And GNUstep
Building Server Applications Using ObjectiveC And GNUstep
guest9efd1a1
 
Building Server Applications Using Objective C And Gn Ustep
Building Server Applications Using Objective C And Gn UstepBuilding Server Applications Using Objective C And Gn Ustep
Building Server Applications Using Objective C And Gn Ustepwangii
 
Hands on with CoAP and Californium
Hands on with CoAP and CaliforniumHands on with CoAP and Californium
Hands on with CoAP and Californium
Julien Vermillard
 
NS3 Overview
NS3 OverviewNS3 Overview
NS3 Overview
Rahul Hada
 
iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)
Eran Duchan
 
Scaleable PHP Applications in Kubernetes
Scaleable PHP Applications in KubernetesScaleable PHP Applications in Kubernetes
Scaleable PHP Applications in Kubernetes
Robert Lemke
 

Similar to Digital Pebble Behemoth (20)

A quick introduction to Storm Crawler
A quick introduction to Storm CrawlerA quick introduction to Storm Crawler
A quick introduction to Storm Crawler
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
Containerization is more than the new Virtualization: enabling separation of ...
Containerization is more than the new Virtualization: enabling separation of ...Containerization is more than the new Virtualization: enabling separation of ...
Containerization is more than the new Virtualization: enabling separation of ...
 
Developing IT infrastructures with Puppet
Developing IT infrastructures with PuppetDeveloping IT infrastructures with Puppet
Developing IT infrastructures with Puppet
 
Let's Go
Let's GoLet's Go
Let's Go
 
PuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into OperationsPuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into Operations
 
Linux advanced concepts - Part 2
Linux advanced concepts - Part 2Linux advanced concepts - Part 2
Linux advanced concepts - Part 2
 
Introduction to node.js GDD
Introduction to node.js GDDIntroduction to node.js GDD
Introduction to node.js GDD
 
Utosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptUtosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.ppt
 
Utosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptUtosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.ppt
 
Utosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptUtosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.ppt
 
Building Server Applications Using ObjectiveC And GNUstep
Building Server Applications Using ObjectiveC And GNUstepBuilding Server Applications Using ObjectiveC And GNUstep
Building Server Applications Using ObjectiveC And GNUstep
 
Building Server Applications Using Objective C And Gn Ustep
Building Server Applications Using Objective C And Gn UstepBuilding Server Applications Using Objective C And Gn Ustep
Building Server Applications Using Objective C And Gn Ustep
 
Hadoop
HadoopHadoop
Hadoop
 
Hands on with CoAP and Californium
Hands on with CoAP and CaliforniumHands on with CoAP and Californium
Hands on with CoAP and Californium
 
NS3 Overview
NS3 OverviewNS3 Overview
NS3 Overview
 
iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)
 
Scaleable PHP Applications in Kubernetes
Scaleable PHP Applications in KubernetesScaleable PHP Applications in Kubernetes
Scaleable PHP Applications in Kubernetes
 

More from Steve Loughran

Hadoop Vectored IO
Hadoop Vectored IOHadoop Vectored IO
Hadoop Vectored IO
Steve Loughran
 
The age of rename() is over
The age of rename() is overThe age of rename() is over
The age of rename() is over
Steve Loughran
 
What does Rename Do: (detailed version)
What does Rename Do: (detailed version)What does Rename Do: (detailed version)
What does Rename Do: (detailed version)
Steve Loughran
 
Put is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionPut is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit Edition
Steve Loughran
 
@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!
Steve Loughran
 
PUT is the new rename()
PUT is the new rename()PUT is the new rename()
PUT is the new rename()
Steve Loughran
 
Extreme Programming Deployed
Extreme Programming DeployedExtreme Programming Deployed
Extreme Programming Deployed
Steve Loughran
 
Testing
TestingTesting
I hate mocking
I hate mockingI hate mocking
I hate mocking
Steve Loughran
 
What does rename() do?
What does rename() do?What does rename() do?
What does rename() do?
Steve Loughran
 
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveDancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Steve Loughran
 
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupApache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User Group
Steve Loughran
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object stores
Steve Loughran
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
Steve Loughran
 
Household INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraHousehold INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony Era
Steve Loughran
 
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionHadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Steve Loughran
 
Datacentre stack
Datacentre stackDatacentre stack
Datacentre stack
Steve Loughran
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider project
Steve Loughran
 
Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!
Steve Loughran
 
2013 11-19-hoya-status
2013 11-19-hoya-status2013 11-19-hoya-status
2013 11-19-hoya-statusSteve Loughran
 

More from Steve Loughran (20)

Hadoop Vectored IO
Hadoop Vectored IOHadoop Vectored IO
Hadoop Vectored IO
 
The age of rename() is over
The age of rename() is overThe age of rename() is over
The age of rename() is over
 
What does Rename Do: (detailed version)
What does Rename Do: (detailed version)What does Rename Do: (detailed version)
What does Rename Do: (detailed version)
 
Put is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionPut is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit Edition
 
@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!
 
PUT is the new rename()
PUT is the new rename()PUT is the new rename()
PUT is the new rename()
 
Extreme Programming Deployed
Extreme Programming DeployedExtreme Programming Deployed
Extreme Programming Deployed
 
Testing
TestingTesting
Testing
 
I hate mocking
I hate mockingI hate mocking
I hate mocking
 
What does rename() do?
What does rename() do?What does rename() do?
What does rename() do?
 
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveDancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
 
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupApache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User Group
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object stores
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
 
Household INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraHousehold INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony Era
 
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionHadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
 
Datacentre stack
Datacentre stackDatacentre stack
Datacentre stack
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider project
 
Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!
 
2013 11-19-hoya-status
2013 11-19-hoya-status2013 11-19-hoya-status
2013 11-19-hoya-status
 

Recently uploaded

Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 

Recently uploaded (20)

Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 

Digital Pebble Behemoth

  • 1. Behemoth Large scale document processing with Hadoop Julien Nioche julien@digitalpebble.com Bristol Hadoop Workshop 10/03/10
  • 2. DigitalPebble  Bristol-based consultancy  Specialised in Text Engineering – Natural Language Processing – Web Crawling – Information Retrieval – Data Mining  Strong focus on Open Source & Apache ecosystem  User | Contributor | Committer – Lucene, SOLR, Nutch – Tika – Mahout – GATE, UIMA
  • 3. Open Source Frameworks for NLP  Apache UIMA – http://incubator.apache.org/uima/  GATE – http://gate.ac.uk/ – Pipeline of annotators – Stand-off annotations – Collection of resources (Tokenisers, POS taggers, ...) – GUIs – Community – Both very popular
  • 5. Web scale document processing  GATE – http://gatecloud.net/ - Closed-source, limited access – DIY  UIMA AS – http://incubator.apache.org/uima/doc-uimaas-what.html
  • 6. UIMA AS  Low latency – throughput?  Storage & replication – DIY  Ease of configuration? – Esp. when mixing different types of Service Instances  Post-processing scalability – e.g. aggregate info across documents – DIY
  • 7. Cometh Behemoth... Behemoth as depicted in the 'Dictionnaire Infernal'.
  • 8. Бегемот The Master and Margarita M. Boulgakov
  • 9. Behemoth  Hosted on Google Code (http://code.google.com/p/behemoth-pebble/)  Apache License  Large scale document analysis based on Apache Hadoop  Deploy UIMA or GATE-based apps on cluster  Provide adapters for common inputs  Encourage code reuse (sandbox)  Runs on Hadoop 0.18 / 0.19 / 0.20
  • 10. Typical Workflow  Load input into HDFS  Convert input format into Behemoth Document Format – Input supported : standard files on local file system, WARC, Nutch segments – Use Apache Tika to identify mime-type, extract text and meta-data – Generate SequenceFile<Text,BehemothDocument>  Put GATE/UIMA resources on HDFS – Zipped GATE plugins + GAPP file – UIMA Pear package
  • 11. Typical Workflow (cont.)  Process Behemoth docs with UIMA / GATE – Use Distributed Cache for sending G/U resources to slaves – Load application and do processing in Map – No reducers – Generate another SequenceFile<Text,BehemothDocument>  Post-process – Do whatever we want with annotations – … but can scale thanks to Map Reduce  Can do things differently – e.g. use reducers for postprocessing, convert input inside map step – Illustrated by example in Sandbox – Reuse modules e.g. GATEProcessor
  • 12. Document implementation class Document String url; String contentType; String text; byte[] content; MapWritable metadata; List<Annotation> annotations; class Annotation String type; long start; long end; Map<String, String> features;
  • 14. Advantages  Used as a common ground between UIMA and GATE – Deliberately simple document representation => fine for most applications – Feature names and values as Strings  Potentially not restricted to JAVA Annotators – Hadoop Pipe for C++ Annotators – Needs a C++ Implementation of BehemothDocument – Unless use AVRO (more on that later)  Harness multiple cores / CPU – Worth using even on a single machine  Easy Configuration – Custom BehemothConfiguration (behemoth-default & behemoth-site.xml) – What annotations to transfer from GATE / UIMA docs – What features to keep  Benefits from Hadoop Ecosystem – Focus on use of annotations and custom code
  • 15. Sandbox  Reuse – Basic blocks : conversion / GATE-UIMA wrappers / ...  Extend – Add custom reducers for specific tasks  Share – Open to contributions – Separate from the core
  • 16. Quick demo  Do we have 5 more minutes?
  • 17. Future developments  Cascading – Tap / Pipe / Sink  Hbase – Avoid multiplicating SequenceFiles  AVRO – Facilitate annotators in languages != JAVA  Sandbox Examples – SOLR • Use Named Entities (Person, Location, … ) for faceting – MAHOUT • Generate vectors for document clustering  Better documentation, pretty pictures, etc...  Needs to be used on a very large scale – Anyone with a good use case?