Decoding Loan Approval: Predictive Modeling in Action
Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
1. Enabling Exploratory Analytics of Data in
Shared-service Hadoop Clusters
PRESENTED BY Sagi Zelnick Principal Architect @ Yahoo and Ledion Bitincka Principal Architect @ Splunk
Hadoop Summit June 2014 San Jose, CA
2. Overview
2 Yahoo Proprietary
! Hadoop @ Yahoo: 8+ years of innovation
! Hunk @ Yahoo: organization-wide investment for next 3+ years
! Yahoo providing Hunk as a self-service to explore, analyze & visualize data in HDFS
› Hunk allows for visually browsing very complex tables (250+ fields)
› Rapid prototyping for new jobs with almost instant results for searches, without having
to wait for the entire job/query to finish
› Cuts down on the development cycles by faster interaction with results
› Built-in graphs/charts makes for a powerful solution for many situations
3. About your speakers
3 Yahoo Proprietary
Sagi Zelnick Ledion Bitincka
Principal Architect Principal Architect
Yahoo Splunk
6. Over 600PB of Hadoop storage (over half an Exabyte)
6 Yahoo Proprietary
! Very large clusters used by many groups across the enterprise.
! More than 35,000 individual datanodes.
! Hadoop is provided as a service.
! Multiple cluster types such as research, dev, sandbox and production.
! Services such as HBase, Hive, Oozie, etc…
! Users are free to run jobs, but have resource constraints.
! Maintained by the Grid Operations Group.
7. Improving operational visibility with Hunk
! We pointed Hunk at many operational logs and event data we already
had on the grid.
! This includes system metrics, HDFS ops, JVM stats and YARN metrics.
! Created instrumentation to measure usage per user and job.
! Analyzed terabytes of NameNode audit logs.
! Job history leveraged for visualizing usage/growth and historical views.
! Custom events for HBase statistics.
7 Yahoo Proprietary
8. Use Case Customer Benefits
System metrics from 35k nodes Grid Ops / Grid
Customers
Identify slow tasks/nodes
when debugging
Historical insights of resources All Grid Customers Track organic growth
Job performance All Grid Customers Improved job SLAs
HBase metrics All Grid Customers Track region/RS/table
metrics…
Job logs in near real-time All Grid Customers / Ops Search for errors directly
from the YARN logs
Namenode operational data Research, Dev Improved performance and
stability
Tracking Hadoop performance and metrics in Hunk
8 Yahoo Proprietary
9. Measuring NameNode performance pre & post upgrades
9 Yahoo Proprietary
! Historical visualizations of all operations.
! Search data in Hunk from billions of NameNode events.
! Measure JVM and memory usage.
! Insights into operational performance.
11. 11 Yahoo Proprietary
n=5m avg(number*) as num_*
Last 2 days
✓ 2,753 events (5/20/14 1:14:21.000 AM to 5/22/14 1:14:21.000 AM)
_time
num_BlockReports num_CopyBl...perations num_HeartBeats num_ReadBl...perations
num_ReadMe...perations num_Replac...Operations num_WriteB...Operations num_blockChecksumOp
12:00 PM
Tue May 20
2014
12:00 AM
Wed May 21
12:00 PM
1,000,000,000
250,000,000
500,000,000
750,000,000
_time ↕
num_Bl
ockRep
orts ↕
num_Copy
BlockOpera
tions ↕
num_
HeartB
eats ↕
num_Read
BlockOpera
tions ↕
num_ReadMe
tadataOperati
ons ↕
num_Replac
eBlockOperat
ions ↕
num_Write
BlockOpera
tions ↕
num_blo
ckChecks
umOp ↕
2014-05-20 01:15:00 105604
7.0240
00
34677652.
000000
12412
1.2640
00
26242490.8
00000
0.000000 88112292.80
0000
126478486.
400000
51405.34
6000
2014-05-20 01:20:00 105551 30920700. 10653 22756041.8 0.000000 87745422.40 92323387.2 32070.48
Visualization
Sample troubleshooting in Hunk of 750 million events
12. 12 Yahoo Proprietary
New Search
index="simon_blue_new_all" this_cluster="dilithiumblue*" (log_subtype="JVM" ProcessName="NameNode") | tim
echart span=5m avg(Threads*) as threads_*
Last 2 days
✓ 8,463 events (5/20/14 12:00:00.000 AM to 5/22/14 12:00:00.000 AM)
_time
threads_Blocked threads_New threads_Runnable threads_Terminated threads_TimedWaiting
threads_Waiting
12:00 AM
Tue May 20
2014
12:00 PM 12:00 AM
Wed May 21
12:00 PM
200
400
_time ↕
threads_Block
ed ↕
threads_Ne
w ↕
threads_Runna
ble ↕
threads_Terminat
ed ↕
threads_TimedWait
ing ↕
threads_Waiti
ng ↕
2014-05-20 00:00:00 72.360000 10.638333 5.485833 0.000000 21.208333 78.555000
2014-05-20 00:05:00 70.177333 10.554667 5.277333 0.000000 20.744667 76.578000
2014-05-20 00:10:00 70.211333 9.998667 5.022000 0.000000 19.333333 73.766667
2014-05-20 00:15:00 70.300667 10.268000 5.156667 0.000000 17.488667 70.122000
2014-05-20 00:20:00 70.422667 10.376000 5.188000 0.000000 15.700000 66.611333
2014-05-20 00:25:00 70.444000 10.288000 5.144000 0.000000 14.089333 63.400667
Visualization
Big picture plus granular details
13. Analyzing NameNode RPC calls (troubleshooting)
13 Yahoo Proprietary
! Who is making what RPC call (open, listStatus, create, etc.).
! How often are they making these RPC calls.
! From which IP/host are they coming from.
! Search and visualize historical data from billions of events.
! Prevent NameNode abuse/misuse.
16. Queue insights (capacity & provisioning)
! Each Hadoop job runs in a specific queue.
! We track every aspect of the YARN framework.
! Immediate queue performance and configuration profiling via job
history server.
! Historical views and trends that enable better capacity management.
! Improved queue utilization and allocation management.
16 Yahoo Proprietary
17. New Search
index="jobsummary_logs_all_red" cluster="dilithium*" | eval total_slot_seconds=(mapSlotSeconds + reduceSlotSec
onds) | eval gb_hours=((total_slot_seconds * 0.5) / 3600) | eval gb_hours=round(gb_hours) | timechart span=6h sum
(gb_hours) as gb_hours by queue
Last 7 days
✓ 1,175,726 events (5/20/14 8:00:00.000 PM to 5/27/14 8:26:26.000 PM)
200,000
400,000
600,000
_time ↕
OTH
ER
↕
apg_dai
lyhigh_
p3 ↕
apg_dail
ymedium
_p5 ↕
apg_hou
rlyhigh_
p1 ↕
apg_ho
urlylow_
p4 ↕
apg_hourl
ymedium
_p2 ↕
apg
_p7
↕
curveb
all_larg
e ↕
curveb
all_me
d ↕
sling
shot
↕
sling
stone
↕
2014-05-20 18:00 415
4
45512 7071 25643 12111 29664 347
3
26547 14192 6087
5
4537
6
2014-05-21 00:00 193
41
92661 18005 41008 22944 88115 108
96
38648 8693 4818
6
8767
0
2014-05-21 06:00 211 108137 38398 35627 14934 101925 244 29269 14066 2434 4783
Visualization
_time
Wed May 21
2014
Thu May 22 Fri May 23 Sat May 24 Sun May 25 Mon May 26
Search | Splunk 6.1.0 http://spbl103n01.blue.ygrid.yahoo.com:9999/en-US/app/search...
Visualizing queues
17 Yahoo Proprietary
18. Self-service job reports
18 Yahoo Proprietary
! Each job is unique and so are the map and reduce elements.
! How to start analyzing jobs?
! Historical job performance and profiling enables in-depth
performance tuning.
! Long terms historical views and trending of growth.
19. 19 Yahoo Proprietary
clu
ster
↕
us
er
↕
que
ue
↕ jobName ↕ jobId ↕
status
↕
gb-ho
urs ↕
run_
mins
↕
cob
alt
g
m
on
grid
eng
PigLatin:findRemoteHDFSFromAudits.pig job_1398982765
383_315271
SUCCE
EDED
108.0
0
33.07
cob
alt
g
m
on
grid
eng
PigLatin:findRemoteHDFSFromAudits.pig job_1398982765
383_312700
SUCCE
EDED
104.0
0
37.37
cob
alt
g
m
on
grid
eng
PigLatin:findRemoteHDFSFromAudits.pig job_1398982765
383_309715
SUCCE
EDED
88.00 29.83
cob
alt
g
m
on
grid
ops
distcp: job_1398982765
383_309921
SUCCE
EDED
36.00 68.49
cob
alt
g
m
on
grid
ops
SPLK_spbl103n01.blue.ygrid.yahoo.com_1401125953.2076_0 job_1398982765
383_313570
SUCCE
EDED
25.00 14.26
cob
alt
g
m
on
grid
ops
nnaudit_DR_2014_05_25 job_1398982765
383_308938
SUCCE
EDED
25.00 15.43
cob g grid nnaudit_DB_2014_05_25 job_1398982765 SUCCE 24.00 18.07
New Search
index="jobsummary_logs_all_blue" cluster="*" user="gmon" |
eval total_slot_seconds=(mapSlotSeconds + reduceSlotSeconds) |
eval gb_hours=((total_slot_seconds * 0.5) / 3600) |
eval gb_hours=round(gb_hours,2) |
eval runtime=(finishTime-submitTime)/1000 | stats sum(gb_hours) as gb-hours
avg(runtime) as run_mins
by cluster user queue jobName jobId status| eval run_mins=round(run_mins/60,2) | sort -gb-hours
Yesterday
✓ 4,871 events (5/26/14 12:00:00.000 AM to 5/27/14 12:00:00.000 AM)
Statistics (4,871)
23. More data to tap into with the metastore / Hive sources
23 Yahoo Proprietary
! Using the metastore we can setup virtual indexes to any table(s) in
Hive, without the need to define the schema up-front
! Visualize very complex tables (250+ fields)
! Rapid prototyping for new jobs with almost instant results for searches,
without having to wait for the entire job/query to finish
! Built-in aggregates and graphs/charts
! Accelerates development workflow by providing faster interaction with
data
... it’s not just logs we’re looking at