Hydra

Hydra
Chris Birchall
2014/2/17
M3 Tech Talk #m3dev

What is it?
https://github.com/addthis/hydra
● Hadoop-style distrib processing
framework, optimised for trees
● The Big Idea:
data processing = building and navigating
tree data structures

Components
● Spawn: Job control (+ UI)
○ (think JobTracker, in Hadoop-speak)

● Minion: task runner
○ (think TaskTracker)

● QueryMaster + QueryWorker
● Meshy: Distrib filesystem
○ (think read-only HDFS)

● Zookeeper, RabbitMQ

Getting started (OSX)
# Prerequisites
brew install rabbitmq maven coreutils wget
# Check this works without a passphrase
ssh localhost
# Check that the GNU coreutils cmds
# (grm, gcp, gln, gmv) are on your PATH
# Clone & build
git clone https://github.com/addthis/hydra.git
cd hydra
mvn package

Getting started (2)
# Start local stack
hydra-uber/bin/local-stack.sh start
hydra-uber/bin/local-stack.sh start
# yes, twice!
hydra-uber/bin/local-stack.sh seed
# UI should now be running
open http://localhost:5052

Hello world
# Sample job definition file available at
hydra-uber/local/sample/self-gen-tree.json
# Click ‘Create’, copy-paste the job config,
# save the job and click ‘Kick’ to run it.
# Click the ‘Q’ button to open the query UI
# and see the resulting data.

Analysing text files
# Tips:
## “files” source is broken. Use “mesh2”.
## Docs are out of date. Read the source
code!
# Mesh filesystem root is here:
hydra-local/streams/
# Here’s an example job config I used to
parse some TSV-formatted Apache logs
https://gist.github.com/cb372/9046464

Conclusions
● If you have Small Data,
use grep, awk, sort, uniq
● If you have Big Data,
use Hadoop
● If you really like trees,
use Hydra ;)

Hydra

More Related Content

What's hot

Viewers also liked

More from Chris Birchall

Hydra