Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Hydra - A Practical
Introduction
Big Data DC - @bigdatadc
Matt Abrams - @abramsm
March 4th 2013
Agenda
•

What is Hydra?

•

Sample Data and Analysis Questions

•

Getting started with a local Hydra dev environment

•
...
Hydra’s Goals
•

Support Streaming and Batch
Processing

•

Massive Scalability

•

Fault tolerant by design (bend but
do ...
What Exactly is Hydra?
•

File System

•

Data Processing

•

Query System

•

Job/Cluster
Management

•

Operational Aler...
Hydra - Terms
•

Job: a process for processing data

•

Task: a processing component of a job. A job can have
one to n tas...
Hydra Cluster
Our Sample Data (Log-Synth)
3.535,	
  5214d63bab95687d,	
  166.144.203.186,	
  "the	
  then	
  good"	
  
3.568,	
  5dbd945...
What do we want to know?
•

What are the top IP addresses by request count?

•

What are the top IP address by unique user...
Setting up Hydra’s Local Stack
Vagrant
•

$	
  vagrant	
  init	
  precise32	
  http://
files.vagrantup.com/precise32.box	
  

•

//	
  add:	
  config.vm....
Java7
•

$	
  sudo	
  apt-­‐get	
  update	
  	
  

•

$	
  sudo	
  apt-­‐get	
  install	
  python-­‐software-­‐
properties...
RabbitMQ, Maven, Git, Make

•

$	
  sudo	
  apt-­‐get	
  install	
  rabbitmq-­‐server	
  

•

$	
  sudo	
  apt-­‐get	
  in...
Copy on Write
•

$	
  wget	
  http://xmailserver.org/fl-­‐cow-­‐0.10.tar.gz	
  

•

$	
  tar	
  zxvf	
  fl-­‐cow-­‐0.10.ta...
Hydra
•

$	
  git	
  clone	
  https://github.com/addthis/
hydra.git	
  

•

$	
  cd	
  hydra;	
  mvn	
  clean	
  -­‐Pbdbje...
Stage Sample Data in Stream
Directory

•

$	
  mkdir	
  ~/hydra/hydra-­‐local/streams/log-­‐synth	
  

•

$	
  cp	
  $YOUR...
Pipes and Filters
BundleFilters
• Return

true or false

• Operate

on entire

rows
• Add/Remove
• Edit
• May

ValueFilters
• Operate

on si...
BundleFilter - Chain
// chain of bundle filters
{"op":"chain", “filter”:[
//LIST OF BUNDLE
//FILTERS
….
]}
BundleFilter - Existence

// false if UID column is null
{"op":"field", "from":"UID"},
Bundle Filter - Concatenation

// joins FOO and BAR
// Stores output in new column “OUTPUT”
!

{"op":"concat", "in":["FOO"...
BundleFilter - Equality
Testing

// FIELD_ONE == FIELD_TWO
!

{“op":"equals", "left":"FIELD_ONE", "right":"FIELD_TWO"},
BundleFilter - Math!

// DUR = Math.round((end-start)/1000)
!

{"op":"num", "columns":["END", "START", "DUR"], 

 "define":...
Stack Math - Sample Data
C0,START_TIME

C1,END_TIME

100,234

200,468
Stack Math
c0,c1,sub,v1000,ddiv,toint,v2,set

200,468
100,234

Sub

200,468-100,234
=100,234
Stack Math
c0,c1,sub,v1000,ddiv,toint,v2,set

1000
100,234

DDIV

100,234/1000
=100.234
Stack Math
c0,c1,sub,v1000,ddiv,toint,v2,set

100.234

toint

100
Stack Math - Sample Result
C0,START_TIME

C1,END_TIME

C2,DURATION

100,234

200,468

100
ValueFilter - Glob
ValueFilter

{from:"SOURCE", filter:{op:”glob”, pattern:"Log_[0-9]*"}}

BundleFilter
ValueFilter - Chain, Split,
Index
ValueFilter

{op:"field", from:”LIST”,filter: {op:"chain", filter:[
{op:”split", split:"="}...
Data Attachments
Data Attachments are
Hydra’s Secret Weapon
•

Top-K Estimator

•

Cardinality Estimation (HyperLogLog Plus)

•

Quantile E...
Data Attachment Example
A single node that tracks the top 1000 unique search terms, the distinct count of
UIDs, and provid...
Putting it All Together
Job Structure
• Jobs

have three
sections
• Source
• Map
• Output
Source
•

Defines the properties
of the input data set

•

Several built in source
types:
•

Mesh

•

Local File System

•
...
Map
•

Select fields from
input record to
process

•

Apply filters to rows
and columns

•

Drop or expand rows
Output - Tree
•

Output(s) can be trees
or data files

•

Trees represent data
aggregations that can
be queried

•

Files O...
Lets put it all Together
Create Hydra Job
Run Job
Query
What are the top IP
Addresses By Record Count?
•

Exact
•
•

•

path: root/byip/+:+hits
ops: gather=ks;sort=1:n:d;limit=10...
What are the top IPs by
unique user count?
•

Exact
•
•

•

path: root/byip/+/+
ops: gather=kk;sort=0;gather=ku;sort=1:n:d...
What are the search terms
for the slowest 5%?
•

First get the 95th percentile query time
•
•

•

path: /root$+timeDigest=...
Daily Unqiue Searches, Users, IPs
and distribution of response times?
•

Query Path:
•

•

Ops:
•

•

root$+termcount$+uid...
But yeah, I could do that with CLI!
Related Open Source
Projects
•

Meshy - https://github.com/addthis/meshy

•

Codec - https://github.com/addthis/codec

•

...
Helpful Resources
•

Hydra - https://github.com/addthis/hydra

•

Hydra User Reference - http://ossdocs.addthiscode.net/hy...
Hydra - Getting Started
Upcoming SlideShare
Loading in …5
×

Hydra - Getting Started

1,214 views

Published on

Getting start with Hydra

Published in: Technology
  • Be the first to comment

Hydra - Getting Started

  1. 1. Hydra - A Practical Introduction Big Data DC - @bigdatadc Matt Abrams - @abramsm March 4th 2013
  2. 2. Agenda • What is Hydra? • Sample Data and Analysis Questions • Getting started with a local Hydra dev environment • Hydra’s Key Concepts • Creating your first Hydra job • Putting it all together
  3. 3. Hydra’s Goals • Support Streaming and Batch Processing • Massive Scalability • Fault tolerant by design (bend but do not break) • Incremental Data Processing • Full stack operational support • Command and Control • Alerting • Resource Management • Data/Task Rebalancing • Data replication and Backup
  4. 4. What Exactly is Hydra? • File System • Data Processing • Query System • Job/Cluster Management • Operational Alerting • Open Source
  5. 5. Hydra - Terms • Job: a process for processing data • Task: a processing component of a job. A job can have one to n tasks • Node: A logic unit of processing capacity available to a cluster • Minion: Management process that runs on cluster nodes. Acts as gate keeper for controlling task processes • Spawn: Cluster management controller and UI
  6. 6. Hydra Cluster
  7. 7. Our Sample Data (Log-Synth) 3.535,  5214d63bab95687d,  166.144.203.186,  "the  then  good"   3.568,  5dbd9451948ad895,  88.120.153.226,  "know  boys"   4.206,  5dbd9451948ad895,  88.120.153.226,  "to"   4.673,  b967d99cad0b3e60,  88.120.153.226,  "seven"   4.900,  bd0d760fbb338955,  166.144.203.186,  "did  local  it"
  8. 8. What do we want to know? • What are the top IP addresses by request count? • What are the top IP address by unique user count? • What are the most common search terms? • What are the most common search terms in the slowest 5% of queries? • What are the daily number of unique searches, unique users, unique IP addresses, and distribution of response times (all approximates)?
  9. 9. Setting up Hydra’s Local Stack
  10. 10. Vagrant • $  vagrant  init  precise32  http:// files.vagrantup.com/precise32.box   • //  add:  config.vm.network  :forwarded_port,   guest:  5052,  host:  5052  to  your  Vagrantfile   • $  vagrant  up   • $  vagrant  ssh
  11. 11. Java7 • $  sudo  apt-­‐get  update     • $  sudo  apt-­‐get  install  python-­‐software-­‐ properties   • $  sudo  add-­‐apt-­‐repository  ppa:webupd8team/java   • $  sudo  apt-­‐get  update   • $  sudo  apt-­‐get  install  oracle-­‐java7-­‐installer
  12. 12. RabbitMQ, Maven, Git, Make • $  sudo  apt-­‐get  install  rabbitmq-­‐server   • $  sudo  apt-­‐get  install  maven   • $  sudo  apt-­‐get  install  git   • $  sudo  apt-­‐get  install  make
  13. 13. Copy on Write • $  wget  http://xmailserver.org/fl-­‐cow-­‐0.10.tar.gz   • $  tar  zxvf  fl-­‐cow-­‐0.10.tar.gz   • $  cd  fl-­‐cow-­‐0.10   • $  ./configure  —prefix=/usr   • $  make;  make  check   • $  sudo  make  install   • $  export  LD_PRELOAD=/usr/lib/libflcow.so:$LD_PRELOAD
  14. 14. Hydra • $  git  clone  https://github.com/addthis/ hydra.git   • $  cd  hydra;  mvn  clean  -­‐Pbdbje  package   • $  ./hydra-­‐uber/bin/local-­‐stack.sh  start   • $  ./hydra-­‐uber/bin/local-­‐stack.sh  start   • $  ./hydra-­‐uber/bin/local-­‐stack.sh  seed
  15. 15. Stage Sample Data in Stream Directory • $  mkdir  ~/hydra/hydra-­‐local/streams/log-­‐synth   • $  cp  $YOUR_SAMPLE_DATA_DIR  ~/hydra/hydra-­‐ local/streams/log-­‐synth
  16. 16. Pipes and Filters
  17. 17. BundleFilters • Return true or false • Operate on entire rows • Add/Remove • Edit • May ValueFilters • Operate on single volume values • Return columns Column Values include a call to ValueFilter a value or null • No visibility to full row • Often take input from BundleFilter
  18. 18. BundleFilter - Chain // chain of bundle filters {"op":"chain", “filter”:[ //LIST OF BUNDLE //FILTERS …. ]}
  19. 19. BundleFilter - Existence // false if UID column is null {"op":"field", "from":"UID"},
  20. 20. Bundle Filter - Concatenation // joins FOO and BAR // Stores output in new column “OUTPUT” ! {"op":"concat", "in":["FOO", “BAR”], "out":"OUTPUT"},
  21. 21. BundleFilter - Equality Testing // FIELD_ONE == FIELD_TWO ! {“op":"equals", "left":"FIELD_ONE", "right":"FIELD_TWO"},
  22. 22. BundleFilter - Math! // DUR = Math.round((end-start)/1000) ! {"op":"num", "columns":["END", "START", "DUR"], "define":"c0,c1,sub,v1000,ddiv,toint,v2,set"}
  23. 23. Stack Math - Sample Data C0,START_TIME C1,END_TIME 100,234 200,468
  24. 24. Stack Math c0,c1,sub,v1000,ddiv,toint,v2,set 200,468 100,234 Sub 200,468-100,234 =100,234
  25. 25. Stack Math c0,c1,sub,v1000,ddiv,toint,v2,set 1000 100,234 DDIV 100,234/1000 =100.234
  26. 26. Stack Math c0,c1,sub,v1000,ddiv,toint,v2,set 100.234 toint 100
  27. 27. Stack Math - Sample Result C0,START_TIME C1,END_TIME C2,DURATION 100,234 200,468 100
  28. 28. ValueFilter - Glob ValueFilter {from:"SOURCE", filter:{op:”glob”, pattern:"Log_[0-9]*"}} BundleFilter
  29. 29. ValueFilter - Chain, Split, Index ValueFilter {op:"field", from:”LIST”,filter: {op:"chain", filter:[ {op:”split", split:"="}, {op:"index", index:0} ]}}, ValueFilter(s)
  30. 30. Data Attachments
  31. 31. Data Attachments are Hydra’s Secret Weapon • Top-K Estimator • Cardinality Estimation (HyperLogLog Plus) • Quantile Estimation (Q,T-Digest) • Bloom Filters • Multiset streaming summarization (CountMin Sketch)
  32. 32. Data Attachment Example A single node that tracks the top 1000 unique search terms, the distinct count of UIDs, and provides quantile estimation for the query time
  33. 33. Putting it All Together
  34. 34. Job Structure • Jobs have three sections • Source • Map • Output
  35. 35. Source • Defines the properties of the input data set • Several built in source types: • Mesh • Local File System • Kafka
  36. 36. Map • Select fields from input record to process • Apply filters to rows and columns • Drop or expand rows
  37. 37. Output - Tree • Output(s) can be trees or data files • Trees represent data aggregations that can be queried • Files Output Targets • File System • Cassandra • HDFS
  38. 38. Lets put it all Together
  39. 39. Create Hydra Job
  40. 40. Run Job
  41. 41. Query
  42. 42. What are the top IP Addresses By Record Count? • Exact • • • path: root/byip/+:+hits ops: gather=ks;sort=1:n:d;limit=100 Approximate • path: root/byip/+$+uidcount • ops: gather=ks;sort=1:n:d;limit=100
  43. 43. What are the top IPs by unique user count? • Exact • • • path: root/byip/+/+ ops: gather=kk;sort=0;gather=ku;sort=1:n:d Approximate • path: root/byip/+$+uidcount • ops: gather=ks;sort=1:n:d;limit=100
  44. 44. What are the search terms for the slowest 5%? • First get the 95th percentile query time • • • path: /root$+timeDigest=quantile(.95) ops: num=c0,toint,v0,set;gather=a Now find all queries then 95th percentile • path: /root/bytime/+/+:+hits • ops: num=c0,v950,gteq;gather=iks;sort=1:n:d
  45. 45. Daily Unqiue Searches, Users, IPs and distribution of response times? • Query Path: • • Ops: • • root$+termcount$+uidcount$+ipcount$+timeDigest=quantile(. 25)$+timeDigest=quantile(.50)$+timeDigest=quantile(.75)$ +timeDigest=quantile(.95)$+timeDigest=quantile(.999):+hits gather=sssaaaaaa;title=total,searches,uids,ips,.25,.50,.75,.95,.999 Remote Ops: • num=c4,toint,v4,set;num=c5,toint,v5,set;num=c6,toint,v6,set;num =c7,toint,v7,set;num=c8,toint,v8,set;
  46. 46. But yeah, I could do that with CLI!
  47. 47. Related Open Source Projects • Meshy - https://github.com/addthis/meshy • Codec - https://github.com/addthis/codec • Muxy - https://github.com/addthis/muxy • Bundle - https://github.com/addthis/bundle • Basis - https://github.com/addthis/basis • Column Compressor - https://github.com/addthis/ columncompressor • Cluster Boot Service - https://github.com/stewartoallen/cbs
  48. 48. Helpful Resources • Hydra - https://github.com/addthis/hydra • Hydra User Reference - http://ossdocs.addthiscode.net/hydra/latest/user-reference/ • Hydra User Guide - http://oss-docs.addthiscode.net/ hydra/latest/user-guide/ • IRC - #hydra • Mailing List - https://groups.google.com/forum/#!forum/ hydra-oss

×