Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

Enabling Exploratory Analytics of Data in
Shared-service Hadoop Clusters
PRESENTED BY Sagi Zelnick Principal Architect @ Yahoo and Ledion Bitincka Principal Architect @ Splunk
Hadoop Summit June 2014 San Jose, CA

Overview
2 Yahoo Proprietary
!  Hadoop @ Yahoo: 8+ years of innovation
!  Hunk @ Yahoo: organization-wide investment for next 3+ years
!  Yahoo providing Hunk as a self-service to explore, analyze & visualize data in HDFS
›  Hunk allows for visually browsing very complex tables (250+ fields)
›  Rapid prototyping for new jobs with almost instant results for searches, without having
to wait for the entire job/query to finish
›  Cuts down on the development cycles by faster interaction with results
›  Built-in graphs/charts makes for a powerful solution for many situations

About your speakers
3 Yahoo Proprietary
Sagi Zelnick Ledion Bitincka
Principal Architect Principal Architect
Yahoo Splunk

Hunk + Hadoop @ Yahoo
4Yahoo Proprietary

5 Yahoo Proprietary
History of Hadoop innovation @ Yahoo

Over 600PB of Hadoop storage (over half an Exabyte)
6 Yahoo Proprietary
!  Very large clusters used by many groups across the enterprise.
!  More than 35,000 individual datanodes.
!  Hadoop is provided as a service.
!  Multiple cluster types such as research, dev, sandbox and production.
!  Services such as HBase, Hive, Oozie, etc…
!  Users are free to run jobs, but have resource constraints.
!  Maintained by the Grid Operations Group.

Improving operational visibility with Hunk
!  We pointed Hunk at many operational logs and event data we already
had on the grid.
!  This includes system metrics, HDFS ops, JVM stats and YARN metrics.
!  Created instrumentation to measure usage per user and job.
!  Analyzed terabytes of NameNode audit logs.
!  Job history leveraged for visualizing usage/growth and historical views.
!  Custom events for HBase statistics.
7 Yahoo Proprietary

Use Case Customer Benefits
System metrics from 35k nodes Grid Ops / Grid
Customers
Identify slow tasks/nodes
when debugging
Historical insights of resources All Grid Customers Track organic growth
Job performance All Grid Customers Improved job SLAs
HBase metrics All Grid Customers Track region/RS/table
metrics…
Job logs in near real-time All Grid Customers / Ops Search for errors directly
from the YARN logs
Namenode operational data Research, Dev Improved performance and
stability
Tracking Hadoop performance and metrics in Hunk
8 Yahoo Proprietary

Measuring NameNode performance pre & post upgrades
9 Yahoo Proprietary
!  Historical visualizations of all operations.
!  Search data in Hunk from billions of NameNode events.
!  Measure JVM and memory usage.
!  Insights into operational performance.

Yahoo Proprietary
index="simon_blue_new_all" this_cluster="dilithiumblue*" (log_subtype="DFS" #hdfs=hdfs) | timechart spa
n=1h avg(number*) as num_*
Last 7 days
✓ 10,086 events (5/15/14 1:00:00.000 AM to 5/22/14 1:36:34.000 AM)
_time
num_BlockReports num_CopyBl...perations num_HeartBeats num_ReadBl...perations
num_ReadMe...perations num_Replac...Operations num_WriteB...Operations num_blockChecksumOp
Fri May 16
2014
Sun May 18 Tue May 20
200,000,000
400,000,000
600,000,000
_time ↕
num_Bl
ockRep
orts ↕
num_Copy
BlockOpera
tions ↕
num_
HeartB
eats ↕
num_Read
BlockOpera
tions ↕
num_ReadMe
tadataOperati
ons ↕
num_Replac
eBlockOperat
ions ↕
num_Write
BlockOpera
tions ↕
num_blo
ckChecks
umOp ↕
2014-05-15 01:00 112443
7.7359
02
46721126.
819672
51495
7.3840
98
12930433.0
77869
0.000000 94210832.78
6885
63512425.9
67213
13975.30
6557
2014-05-15 02:00 111549
6.2904
92
53597000.
262295
29871
7.6370
49
10402176.7
17213
0.000000 94109944.65
5738
93916552.3
93443
35459.28
8689
2014-05-15 03:00 111037
2.4173
56566721.
704918
42849
4.9449
13296385.5
90164
0.000000 94141430.29
5082
97353478.2
29508
20307.54
9344
Visualization
Visualization using Hunk
10

11 Yahoo Proprietary
n=5m avg(number*) as num_*
Last 2 days
✓ 2,753 events (5/20/14 1:14:21.000 AM to 5/22/14 1:14:21.000 AM)
_time
num_BlockReports num_CopyBl...perations num_HeartBeats num_ReadBl...perations
num_ReadMe...perations num_Replac...Operations num_WriteB...Operations num_blockChecksumOp
12:00 PM
Tue May 20
2014
12:00 AM
Wed May 21
12:00 PM
1,000,000,000
250,000,000
500,000,000
750,000,000
_time ↕
num_Bl
ockRep
orts ↕
num_Copy
BlockOpera
tions ↕
num_
HeartB
eats ↕
num_Read
BlockOpera
tions ↕
num_ReadMe
tadataOperati
ons ↕
num_Replac
eBlockOperat
ions ↕
num_Write
BlockOpera
tions ↕
num_blo
ckChecks
umOp ↕
2014-05-20 01:15:00 105604
7.0240
00
34677652.
000000
12412
1.2640
00
26242490.8
00000
0.000000 88112292.80
0000
126478486.
400000
51405.34
6000
2014-05-20 01:20:00 105551 30920700. 10653 22756041.8 0.000000 87745422.40 92323387.2 32070.48
Visualization
Sample troubleshooting in Hunk of 750 million events

New Search
index="simon_blue_new_all" this_cluster="dilithiumblue*" (log_subtype="JVM" ProcessName="NameNode") | tim
echart span=5m avg(Threads*) as threads_*
Last 2 days
✓ 8,463 events (5/20/14 12:00:00.000 AM to 5/22/14 12:00:00.000 AM)
_time
threads_Blocked threads_New threads_Runnable threads_Terminated threads_TimedWaiting
threads_Waiting
12:00 AM
Tue May 20
2014
12:00 PM 12:00 AM
Wed May 21
12:00 PM
200
400
_time ↕
threads_Block
ed ↕
threads_Ne
w ↕
threads_Runna
ble ↕
threads_Terminat
ed ↕
threads_TimedWait
ing ↕
threads_Waiti
ng ↕
2014-05-20 00:00:00 72.360000 10.638333 5.485833 0.000000 21.208333 78.555000
2014-05-20 00:05:00 70.177333 10.554667 5.277333 0.000000 20.744667 76.578000
2014-05-20 00:10:00 70.211333 9.998667 5.022000 0.000000 19.333333 73.766667
2014-05-20 00:15:00 70.300667 10.268000 5.156667 0.000000 17.488667 70.122000
2014-05-20 00:20:00 70.422667 10.376000 5.188000 0.000000 15.700000 66.611333
2014-05-20 00:25:00 70.444000 10.288000 5.144000 0.000000 14.089333 63.400667
Visualization
Big picture plus granular details

Analyzing NameNode RPC calls (troubleshooting)
!  Who is making what RPC call (open, listStatus, create, etc.).
!  How often are they making these RPC calls.
!  From which IP/host are they coming from.
!  Search and visualize historical data from billions of events.
!  Prevent NameNode abuse/misuse.

Visualizing 834 million discrete events …

15 Yahoo Confidential & Proprietary
… continued

Queue insights (capacity & provisioning)
!  Each Hadoop job runs in a specific queue.
!  We track every aspect of the YARN framework.
!  Immediate queue performance and configuration profiling via job
history server.
!  Historical views and trends that enable better capacity management.
!  Improved queue utilization and allocation management.

 New Search
index="jobsummary_logs_all_red" cluster="dilithium*" | eval total_slot_seconds=(mapSlotSeconds + reduceSlotSec
onds) | eval gb_hours=((total_slot_seconds * 0.5) / 3600) | eval gb_hours=round(gb_hours) | timechart span=6h sum
(gb_hours) as gb_hours by queue
Last 7 days
✓ 1,175,726 events (5/20/14 8:00:00.000 PM to 5/27/14 8:26:26.000 PM)
200,000
400,000
600,000
_time ↕
OTH
ER
↕
apg_dai
lyhigh_
p3 ↕
apg_dail
ymedium
_p5 ↕
apg_hou
rlyhigh_
p1 ↕
apg_ho
urlylow_
p4 ↕
apg_hourl
ymedium
_p2 ↕
apg
_p7
↕
curveb
all_larg
e ↕
curveb
all_me
d ↕
sling
shot
↕
sling
stone
↕
2014-05-20 18:00 415
4
45512 7071 25643 12111 29664 347
3
26547 14192 6087
5
4537
6
2014-05-21 00:00 193
41
92661 18005 41008 22944 88115 108
96
38648 8693 4818
6
8767
0
2014-05-21 06:00 211 108137 38398 35627 14934 101925 244 29269 14066 2434 4783
Visualization
_time
Wed May 21
2014
Thu May 22 Fri May 23 Sat May 24 Sun May 25 Mon May 26
Search | Splunk 6.1.0 http://spbl103n01.blue.ygrid.yahoo.com:9999/en-US/app/search...
Visualizing queues

Self-service job reports
!  Each job is unique and so are the map and reduce elements.
!  How to start analyzing jobs?
!  Historical job performance and profiling enables in-depth
performance tuning.
!  Long terms historical views and trending of growth.

clu
ster
↕
us
er
↕
que
ue
↕ jobName ↕ jobId ↕
status
↕
gb-ho
urs ↕
run_
mins
↕
cob
alt
g
m
on
grid
eng
PigLatin:findRemoteHDFSFromAudits.pig job_1398982765
383_315271
SUCCE
EDED
108.0
0
33.07
cob
alt
g
m
on
grid
eng
383_312700
SUCCE
EDED
104.0
0
37.37
cob
alt
g
m
on
grid
eng
383_309715
SUCCE
EDED
88.00 29.83
cob
alt
g
m
on
grid
ops
distcp: job_1398982765
383_309921
SUCCE
EDED
36.00 68.49
cob
alt
g
m
on
grid
ops
SPLK_spbl103n01.blue.ygrid.yahoo.com_1401125953.2076_0 job_1398982765
383_313570
SUCCE
EDED
25.00 14.26
cob
alt
g
m
on
grid
ops
nnaudit_DR_2014_05_25 job_1398982765
383_308938
SUCCE
EDED
25.00 15.43
cob g grid nnaudit_DB_2014_05_25 job_1398982765 SUCCE 24.00 18.07
New Search
index="jobsummary_logs_all_blue" cluster="*" user="gmon" |
eval total_slot_seconds=(mapSlotSeconds + reduceSlotSeconds) |
eval gb_hours=((total_slot_seconds * 0.5) / 3600) |
eval gb_hours=round(gb_hours,2) |
eval runtime=(finishTime-submitTime)/1000 | stats sum(gb_hours) as gb-hours
avg(runtime) as run_mins
by cluster user queue jobName jobId status| eval run_mins=round(run_mins/60,2) | sort -gb-hours
Yesterday
✓ 4,871 events (5/26/14 12:00:00.000 AM to 5/27/14 12:00:00.000 AM)
Statistics (4,871)

More data to tap into with the metastore / Hive sources
!  Using the metastore we can setup virtual indexes to any table(s) in
Hive, without the need to define the schema up-front
!  Visualize very complex tables (250+ fields)
!  Rapid prototyping for new jobs with almost instant results for searches,
without having to wait for the entire job/query to finish
!  Built-in aggregates and graphs/charts
!  Accelerates development workflow by providing faster interaction with
data
... it’s not just logs we’re looking at

26%
Integrated%Analy4cs%Pla8orm%for%Diverse%Data%Stores%
Full%featured,!
Integrated!
Product%
Fast!Insights!!
for!Everyone%
Works!with!
What!You!
Have!Today%
Explore% Visualize% Dashboard
s%
Share%Analyze%
Hadoop!Clusters! NoSQL!and!Other!Data!Stores!
Hadoop%Client%Libraries% Streaming%Resource%Libraries%

27%
Fast%Deployment%and%Configura4on%
Just%point%at%Hadoop%
•  Cer4fied%integra4ons%to%all%
major%Hadoop%distribu4ons%
•  Choose%1stLgen%MapReduce%
or%YARN%%
•  Create%Virtual%Indexes%across%
one%or%more%clusters%
•  From%download%to%searching%
data%in%<%60%minutes%
Connect%to%one%or%mul4ple%Hadoop%clusters%
YARN%
cer4fied%

28%
Interac4ve%Search%and%Results%Preview%
Rapidly%interact%with%data%
•  Powerful%Search%Processing%
Language%(SPL™)%
•  Ad%hoc%exploratory%analy4cs%
across%massive%datasets%
•  Preview%results%
•  No%ﬁxed%schema%
•  No%requirement%to%
“understand”%data%upfront%
Search%
interface%
Preview%
results%
Drill%down%
to%raw%data%
Pause%or%stop%MapReduce%jobs%

29%
Powerful%Dashboards%for%SelfLService%Analy4cs%
Interac4ve%Dashboards%
and%Charts%
•  EasyLtoLuse%dashboard%editor%
•  Chart%overlay%
•  Pan%and%zoom%
•  InLdashboard%drill%down%
•  Embed%charts%and%
dashboards%in%3rd%party%apps%
•  Reuse%skills%with%Splunk%
Enterprise%6.1%and%Hunk%6.1%

30%
Automate%Access%for%Rapid%Explora4on%
Supported%File%Formats%
•  Text%files%
•  Sequence%files%%
•  RCFile%
•  ORC%files%
•  Parquet%

31%
RoleLbased%Security%for%Shared%Clusters%
PassLthrough%
Authen4ca4on%
•  Provide%roleLbased%security%
for%Hadoop%clusters%
•  Access%Hadoop%resources%
under%security%and%
compliance%
•  Integrates%with%Kerberos%
for%Hadoop%security%
Business!
Analyst%
MarkeNng!
Analyst%
Sys!
Admin%
Business!!
Analyst!!
Queue:!!
Biz!AnalyNcs%
MarkeNng!
Analyst!
Queue:!
MarkeNng%
Sys!!
Admin2!
Queue:!!
Prod%

32%
Powerful%Developer%
Environment%
•  Use%a%standardsLbased%web%
framework%and%REST%API%%
•  Customize%dashboards%and%
UIs%with%Simple%XML,%
JavaScript%or%Django%
•  Choose%among%SDKs%%
•  One%integra4on%for%both%
Splunk%Enterprise%and%Hunk%
Build%Analy4csLRich%Big%Data%Apps%

33%
Explore,%analyze%and%visualize%data%in%
one%integrated%pla8orm%
Point%Hunk%at%your%storage%clusters%and%
explore%data%immediately%
Preview%results%as%MapReduce%jobs%run%and%
accelerate%reports%with%no%ﬁxed%schemas%
INTERACTIVE!
SEARCH!
RICH!DEVELOPER!
ENVIRONMENT!
Build%big%data%apps%using%standard%web%
languages%and%frameworks%
FULL%FEATURED!
ANALYTICS!
FAST!TO!DEPLOY!
AND!DRIVE!VALUE!
FullLFeatured,%Integrated%Analy4cs%Pla8orm%

Question/Comments?
Sagi Zelnick – Principal Architect
Email: zelnicks@yahoo-inc.com
Ledion Bitincka – Principal Architect
Email: lbitincka@splunk.com

Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

Similar to Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters (20)

More from Brett Sheppard

More from Brett Sheppard (15)

Recently uploaded

Recently uploaded (20)

Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters