Hydra - A Practical
Introduction
Big Data DC - @bigdatadc
Matt Abrams - @abramsm
March 4th 2013
Agenda
•

What is Hydra?

•

Sample Data and Analysis Questions

•

Getting started with a local Hydra dev environment

•

Hydra’s Key Concepts

•

Creating your first Hydra job

•

Putting it all together
Hydra’s Goals
•

Support Streaming and Batch
Processing

•

Massive Scalability

•

Fault tolerant by design (bend but
do not break)

•

Incremental Data Processing

•

Full stack operational support
•

Command and Control

•

Alerting

•

Resource Management

•

Data/Task Rebalancing

•

Data replication and Backup
What Exactly is Hydra?
•

File System

•

Data Processing

•

Query System

•

Job/Cluster
Management

•

Operational Alerting

•

Open Source
Hydra - Terms
•

Job: a process for processing data

•

Task: a processing component of a job. A job can have
one to n tasks

•

Node: A logic unit of processing capacity available to a
cluster

•

Minion: Management process that runs on cluster nodes.
Acts as gate keeper for controlling task processes

•

Spawn: Cluster management controller and UI
Hydra Cluster
Our Sample Data (Log-Synth)
3.535,	
  5214d63bab95687d,	
  166.144.203.186,	
  "the	
  then	
  good"	
  
3.568,	
  5dbd9451948ad895,	
  88.120.153.226,	
  "know	
  boys"	
  
4.206,	
  5dbd9451948ad895,	
  88.120.153.226,	
  "to"	
  
4.673,	
  b967d99cad0b3e60,	
  88.120.153.226,	
  "seven"	
  
4.900,	
  bd0d760fbb338955,	
  166.144.203.186,	
  "did	
  local	
  it"
What do we want to know?
•

What are the top IP addresses by request count?

•

What are the top IP address by unique user count?

•

What are the most common search terms?

•

What are the most common search terms in the slowest
5% of queries?

•

What are the daily number of unique searches, unique
users, unique IP addresses, and distribution of
response times (all approximates)?
Setting up Hydra’s Local Stack
Vagrant
•

$	
  vagrant	
  init	
  precise32	
  http://
files.vagrantup.com/precise32.box	
  

•

//	
  add:	
  config.vm.network	
  :forwarded_port,	
  
guest:	
  5052,	
  host:	
  5052	
  to	
  your	
  Vagrantfile	
  

•

$	
  vagrant	
  up	
  

•

$	
  vagrant	
  ssh
Java7
•

$	
  sudo	
  apt-­‐get	
  update	
  	
  

•

$	
  sudo	
  apt-­‐get	
  install	
  python-­‐software-­‐
properties	
  

•

$	
  sudo	
  add-­‐apt-­‐repository	
  ppa:webupd8team/java	
  

•

$	
  sudo	
  apt-­‐get	
  update	
  

•

$	
  sudo	
  apt-­‐get	
  install	
  oracle-­‐java7-­‐installer
RabbitMQ, Maven, Git, Make

•

$	
  sudo	
  apt-­‐get	
  install	
  rabbitmq-­‐server	
  

•

$	
  sudo	
  apt-­‐get	
  install	
  maven	
  

•

$	
  sudo	
  apt-­‐get	
  install	
  git	
  

•

$	
  sudo	
  apt-­‐get	
  install	
  make
Copy on Write
•

$	
  wget	
  http://xmailserver.org/fl-­‐cow-­‐0.10.tar.gz	
  

•

$	
  tar	
  zxvf	
  fl-­‐cow-­‐0.10.tar.gz	
  

•

$	
  cd	
  fl-­‐cow-­‐0.10	
  

•

$	
  ./configure	
  —prefix=/usr	
  

•

$	
  make;	
  make	
  check	
  

•

$	
  sudo	
  make	
  install	
  

•

$	
  export	
  LD_PRELOAD=/usr/lib/libflcow.so:$LD_PRELOAD
Hydra
•

$	
  git	
  clone	
  https://github.com/addthis/
hydra.git	
  

•

$	
  cd	
  hydra;	
  mvn	
  clean	
  -­‐Pbdbje	
  package	
  

•

$	
  ./hydra-­‐uber/bin/local-­‐stack.sh	
  start	
  

•

$	
  ./hydra-­‐uber/bin/local-­‐stack.sh	
  start	
  

•

$	
  ./hydra-­‐uber/bin/local-­‐stack.sh	
  seed
Stage Sample Data in Stream
Directory

•

$	
  mkdir	
  ~/hydra/hydra-­‐local/streams/log-­‐synth	
  

•

$	
  cp	
  $YOUR_SAMPLE_DATA_DIR	
  ~/hydra/hydra-­‐
local/streams/log-­‐synth
Pipes and Filters
BundleFilters
• Return

true or false

• Operate

on entire

rows
• Add/Remove
• Edit
• May

ValueFilters
• Operate

on single
volume values

• Return

columns

Column Values

include a call to
ValueFilter

a value or null

• No

visibility to full
row

• Often

take input from
BundleFilter
BundleFilter - Chain
// chain of bundle filters
{"op":"chain", “filter”:[
//LIST OF BUNDLE
//FILTERS
….
]}
BundleFilter - Existence

// false if UID column is null
{"op":"field", "from":"UID"},
Bundle Filter - Concatenation

// joins FOO and BAR
// Stores output in new column “OUTPUT”
!

{"op":"concat", "in":["FOO", “BAR”], "out":"OUTPUT"},
BundleFilter - Equality
Testing

// FIELD_ONE == FIELD_TWO
!

{“op":"equals", "left":"FIELD_ONE", "right":"FIELD_TWO"},
BundleFilter - Math!

// DUR = Math.round((end-start)/1000)
!

{"op":"num", "columns":["END", "START", "DUR"], 

 "define":"c0,c1,sub,v1000,ddiv,toint,v2,set"}
Stack Math - Sample Data
C0,START_TIME

C1,END_TIME

100,234

200,468
Stack Math
c0,c1,sub,v1000,ddiv,toint,v2,set

200,468
100,234

Sub

200,468-100,234
=100,234
Stack Math
c0,c1,sub,v1000,ddiv,toint,v2,set

1000
100,234

DDIV

100,234/1000
=100.234
Stack Math
c0,c1,sub,v1000,ddiv,toint,v2,set

100.234

toint

100
Stack Math - Sample Result
C0,START_TIME

C1,END_TIME

C2,DURATION

100,234

200,468

100
ValueFilter - Glob
ValueFilter

{from:"SOURCE", filter:{op:”glob”, pattern:"Log_[0-9]*"}}

BundleFilter
ValueFilter - Chain, Split,
Index
ValueFilter

{op:"field", from:”LIST”,filter: {op:"chain", filter:[
{op:”split", split:"="}, 
{op:"index", index:0}
]}},
ValueFilter(s)
Data Attachments
Data Attachments are
Hydra’s Secret Weapon
•

Top-K Estimator

•

Cardinality Estimation (HyperLogLog Plus)

•

Quantile Estimation (Q,T-Digest)

•

Bloom Filters

•

Multiset streaming summarization (CountMin Sketch)
Data Attachment Example
A single node that tracks the top 1000 unique search terms, the distinct count of
UIDs, and provides quantile estimation for the query time
Putting it All Together
Job Structure
• Jobs

have three
sections
• Source
• Map
• Output
Source
•

Defines the properties
of the input data set

•

Several built in source
types:
•

Mesh

•

Local File System

•

Kafka
Map
•

Select fields from
input record to
process

•

Apply filters to rows
and columns

•

Drop or expand rows
Output - Tree
•

Output(s) can be trees
or data files

•

Trees represent data
aggregations that can
be queried

•

Files Output Targets
•

File System

•

Cassandra

•

HDFS
Lets put it all Together
Create Hydra Job
Run Job
Query
What are the top IP
Addresses By Record Count?
•

Exact
•
•

•

path: root/byip/+:+hits
ops: gather=ks;sort=1:n:d;limit=100

Approximate
•

path: root/byip/+$+uidcount

•

ops: gather=ks;sort=1:n:d;limit=100
What are the top IPs by
unique user count?
•

Exact
•
•

•

path: root/byip/+/+
ops: gather=kk;sort=0;gather=ku;sort=1:n:d

Approximate
•

path: root/byip/+$+uidcount

•

ops: gather=ks;sort=1:n:d;limit=100
What are the search terms
for the slowest 5%?
•

First get the 95th percentile query time
•
•

•

path: /root$+timeDigest=quantile(.95)
ops: num=c0,toint,v0,set;gather=a

Now find all queries then 95th percentile
•

path: /root/bytime/+/+:+hits

•

ops: num=c0,v950,gteq;gather=iks;sort=1:n:d
Daily Unqiue Searches, Users, IPs
and distribution of response times?
•

Query Path:
•

•

Ops:
•

•

root$+termcount$+uidcount$+ipcount$+timeDigest=quantile(.
25)$+timeDigest=quantile(.50)$+timeDigest=quantile(.75)$
+timeDigest=quantile(.95)$+timeDigest=quantile(.999):+hits

gather=sssaaaaaa;title=total,searches,uids,ips,.25,.50,.75,.95,.999

Remote Ops:
•

num=c4,toint,v4,set;num=c5,toint,v5,set;num=c6,toint,v6,set;num
=c7,toint,v7,set;num=c8,toint,v8,set;
But yeah, I could do that with CLI!
Related Open Source
Projects
•

Meshy - https://github.com/addthis/meshy

•

Codec - https://github.com/addthis/codec

•

Muxy - https://github.com/addthis/muxy

•

Bundle - https://github.com/addthis/bundle

•

Basis - https://github.com/addthis/basis

•

Column Compressor - https://github.com/addthis/
columncompressor

•

Cluster Boot Service - https://github.com/stewartoallen/cbs
Helpful Resources
•

Hydra - https://github.com/addthis/hydra

•

Hydra User Reference - http://ossdocs.addthiscode.net/hydra/latest/user-reference/

•

Hydra User Guide - http://oss-docs.addthiscode.net/
hydra/latest/user-guide/

•

IRC - #hydra

•

Mailing List - https://groups.google.com/forum/#!forum/
hydra-oss

Hydra - Getting Started

  • 1.
    Hydra - APractical Introduction Big Data DC - @bigdatadc Matt Abrams - @abramsm March 4th 2013
  • 3.
    Agenda • What is Hydra? • SampleData and Analysis Questions • Getting started with a local Hydra dev environment • Hydra’s Key Concepts • Creating your first Hydra job • Putting it all together
  • 4.
    Hydra’s Goals • Support Streamingand Batch Processing • Massive Scalability • Fault tolerant by design (bend but do not break) • Incremental Data Processing • Full stack operational support • Command and Control • Alerting • Resource Management • Data/Task Rebalancing • Data replication and Backup
  • 5.
    What Exactly isHydra? • File System • Data Processing • Query System • Job/Cluster Management • Operational Alerting • Open Source
  • 6.
    Hydra - Terms • Job:a process for processing data • Task: a processing component of a job. A job can have one to n tasks • Node: A logic unit of processing capacity available to a cluster • Minion: Management process that runs on cluster nodes. Acts as gate keeper for controlling task processes • Spawn: Cluster management controller and UI
  • 7.
  • 8.
    Our Sample Data(Log-Synth) 3.535,  5214d63bab95687d,  166.144.203.186,  "the  then  good"   3.568,  5dbd9451948ad895,  88.120.153.226,  "know  boys"   4.206,  5dbd9451948ad895,  88.120.153.226,  "to"   4.673,  b967d99cad0b3e60,  88.120.153.226,  "seven"   4.900,  bd0d760fbb338955,  166.144.203.186,  "did  local  it"
  • 9.
    What do wewant to know? • What are the top IP addresses by request count? • What are the top IP address by unique user count? • What are the most common search terms? • What are the most common search terms in the slowest 5% of queries? • What are the daily number of unique searches, unique users, unique IP addresses, and distribution of response times (all approximates)?
  • 10.
  • 11.
    Vagrant • $  vagrant  init  precise32  http:// files.vagrantup.com/precise32.box   • //  add:  config.vm.network  :forwarded_port,   guest:  5052,  host:  5052  to  your  Vagrantfile   • $  vagrant  up   • $  vagrant  ssh
  • 12.
    Java7 • $  sudo  apt-­‐get  update     • $  sudo  apt-­‐get  install  python-­‐software-­‐ properties   • $  sudo  add-­‐apt-­‐repository  ppa:webupd8team/java   • $  sudo  apt-­‐get  update   • $  sudo  apt-­‐get  install  oracle-­‐java7-­‐installer
  • 13.
    RabbitMQ, Maven, Git,Make • $  sudo  apt-­‐get  install  rabbitmq-­‐server   • $  sudo  apt-­‐get  install  maven   • $  sudo  apt-­‐get  install  git   • $  sudo  apt-­‐get  install  make
  • 14.
    Copy on Write • $  wget  http://xmailserver.org/fl-­‐cow-­‐0.10.tar.gz   • $  tar  zxvf  fl-­‐cow-­‐0.10.tar.gz   • $  cd  fl-­‐cow-­‐0.10   • $  ./configure  —prefix=/usr   • $  make;  make  check   • $  sudo  make  install   • $  export  LD_PRELOAD=/usr/lib/libflcow.so:$LD_PRELOAD
  • 15.
    Hydra • $  git  clone  https://github.com/addthis/ hydra.git   • $  cd  hydra;  mvn  clean  -­‐Pbdbje  package   • $  ./hydra-­‐uber/bin/local-­‐stack.sh  start   • $  ./hydra-­‐uber/bin/local-­‐stack.sh  start   • $  ./hydra-­‐uber/bin/local-­‐stack.sh  seed
  • 16.
    Stage Sample Datain Stream Directory • $  mkdir  ~/hydra/hydra-­‐local/streams/log-­‐synth   • $  cp  $YOUR_SAMPLE_DATA_DIR  ~/hydra/hydra-­‐ local/streams/log-­‐synth
  • 17.
  • 18.
    BundleFilters • Return true orfalse • Operate on entire rows • Add/Remove • Edit • May ValueFilters • Operate on single volume values • Return columns Column Values include a call to ValueFilter a value or null • No visibility to full row • Often take input from BundleFilter
  • 19.
    BundleFilter - Chain //chain of bundle filters {"op":"chain", “filter”:[ //LIST OF BUNDLE //FILTERS …. ]}
  • 20.
    BundleFilter - Existence //false if UID column is null {"op":"field", "from":"UID"},
  • 21.
    Bundle Filter -Concatenation // joins FOO and BAR // Stores output in new column “OUTPUT” ! {"op":"concat", "in":["FOO", “BAR”], "out":"OUTPUT"},
  • 22.
    BundleFilter - Equality Testing //FIELD_ONE == FIELD_TWO ! {“op":"equals", "left":"FIELD_ONE", "right":"FIELD_TWO"},
  • 23.
    BundleFilter - Math! //DUR = Math.round((end-start)/1000) ! {"op":"num", "columns":["END", "START", "DUR"], "define":"c0,c1,sub,v1000,ddiv,toint,v2,set"}
  • 24.
    Stack Math -Sample Data C0,START_TIME C1,END_TIME 100,234 200,468
  • 25.
  • 26.
  • 27.
  • 28.
    Stack Math -Sample Result C0,START_TIME C1,END_TIME C2,DURATION 100,234 200,468 100
  • 29.
    ValueFilter - Glob ValueFilter {from:"SOURCE",filter:{op:”glob”, pattern:"Log_[0-9]*"}} BundleFilter
  • 30.
    ValueFilter - Chain,Split, Index ValueFilter {op:"field", from:”LIST”,filter: {op:"chain", filter:[ {op:”split", split:"="}, {op:"index", index:0} ]}}, ValueFilter(s)
  • 31.
  • 32.
    Data Attachments are Hydra’sSecret Weapon • Top-K Estimator • Cardinality Estimation (HyperLogLog Plus) • Quantile Estimation (Q,T-Digest) • Bloom Filters • Multiset streaming summarization (CountMin Sketch)
  • 33.
    Data Attachment Example Asingle node that tracks the top 1000 unique search terms, the distinct count of UIDs, and provides quantile estimation for the query time
  • 34.
  • 35.
    Job Structure • Jobs havethree sections • Source • Map • Output
  • 36.
    Source • Defines the properties ofthe input data set • Several built in source types: • Mesh • Local File System • Kafka
  • 37.
    Map • Select fields from inputrecord to process • Apply filters to rows and columns • Drop or expand rows
  • 38.
    Output - Tree • Output(s)can be trees or data files • Trees represent data aggregations that can be queried • Files Output Targets • File System • Cassandra • HDFS
  • 39.
    Lets put itall Together
  • 40.
  • 41.
  • 42.
  • 43.
    What are thetop IP Addresses By Record Count? • Exact • • • path: root/byip/+:+hits ops: gather=ks;sort=1:n:d;limit=100 Approximate • path: root/byip/+$+uidcount • ops: gather=ks;sort=1:n:d;limit=100
  • 44.
    What are thetop IPs by unique user count? • Exact • • • path: root/byip/+/+ ops: gather=kk;sort=0;gather=ku;sort=1:n:d Approximate • path: root/byip/+$+uidcount • ops: gather=ks;sort=1:n:d;limit=100
  • 45.
    What are thesearch terms for the slowest 5%? • First get the 95th percentile query time • • • path: /root$+timeDigest=quantile(.95) ops: num=c0,toint,v0,set;gather=a Now find all queries then 95th percentile • path: /root/bytime/+/+:+hits • ops: num=c0,v950,gteq;gather=iks;sort=1:n:d
  • 46.
    Daily Unqiue Searches,Users, IPs and distribution of response times? • Query Path: • • Ops: • • root$+termcount$+uidcount$+ipcount$+timeDigest=quantile(. 25)$+timeDigest=quantile(.50)$+timeDigest=quantile(.75)$ +timeDigest=quantile(.95)$+timeDigest=quantile(.999):+hits gather=sssaaaaaa;title=total,searches,uids,ips,.25,.50,.75,.95,.999 Remote Ops: • num=c4,toint,v4,set;num=c5,toint,v5,set;num=c6,toint,v6,set;num =c7,toint,v7,set;num=c8,toint,v8,set;
  • 47.
    But yeah, Icould do that with CLI!
  • 48.
    Related Open Source Projects • Meshy- https://github.com/addthis/meshy • Codec - https://github.com/addthis/codec • Muxy - https://github.com/addthis/muxy • Bundle - https://github.com/addthis/bundle • Basis - https://github.com/addthis/basis • Column Compressor - https://github.com/addthis/ columncompressor • Cluster Boot Service - https://github.com/stewartoallen/cbs
  • 49.
    Helpful Resources • Hydra -https://github.com/addthis/hydra • Hydra User Reference - http://ossdocs.addthiscode.net/hydra/latest/user-reference/ • Hydra User Guide - http://oss-docs.addthiscode.net/ hydra/latest/user-guide/ • IRC - #hydra • Mailing List - https://groups.google.com/forum/#!forum/ hydra-oss