Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

3,482 views

Published on

Yahoo presentation at Hadoop Summit San Jose, CA in June 2014.

  • Be the first to comment

Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

  1. 1. Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters PRESENTED BY Sagi Zelnick Principal Architect @ Yahoo and Ledion Bitincka Principal Architect @ Splunk Hadoop Summit June 2014 San Jose, CA
  2. 2. Overview 2 Yahoo Proprietary !  Hadoop @ Yahoo: 8+ years of innovation !  Hunk @ Yahoo: organization-wide investment for next 3+ years !  Yahoo providing Hunk as a self-service to explore, analyze & visualize data in HDFS ›  Hunk allows for visually browsing very complex tables (250+ fields) ›  Rapid prototyping for new jobs with almost instant results for searches, without having to wait for the entire job/query to finish ›  Cuts down on the development cycles by faster interaction with results ›  Built-in graphs/charts makes for a powerful solution for many situations
  3. 3. About your speakers 3 Yahoo Proprietary Sagi Zelnick Ledion Bitincka Principal Architect Principal Architect Yahoo Splunk
  4. 4. Hunk + Hadoop @ Yahoo 4Yahoo Proprietary
  5. 5. 5 Yahoo Proprietary History of Hadoop innovation @ Yahoo
  6. 6. Over 600PB of Hadoop storage (over half an Exabyte) 6 Yahoo Proprietary !  Very large clusters used by many groups across the enterprise. !  More than 35,000 individual datanodes. !  Hadoop is provided as a service. !  Multiple cluster types such as research, dev, sandbox and production. !  Services such as HBase, Hive, Oozie, etc… !  Users are free to run jobs, but have resource constraints. !  Maintained by the Grid Operations Group.
  7. 7. Improving operational visibility with Hunk !  We pointed Hunk at many operational logs and event data we already had on the grid. !  This includes system metrics, HDFS ops, JVM stats and YARN metrics. !  Created instrumentation to measure usage per user and job. !  Analyzed terabytes of NameNode audit logs. !  Job history leveraged for visualizing usage/growth and historical views. !  Custom events for HBase statistics. 7 Yahoo Proprietary
  8. 8. Use Case Customer Benefits System metrics from 35k nodes Grid Ops / Grid Customers Identify slow tasks/nodes when debugging Historical insights of resources All Grid Customers Track organic growth Job performance All Grid Customers Improved job SLAs HBase metrics All Grid Customers Track region/RS/table metrics… Job logs in near real-time All Grid Customers / Ops Search for errors directly from the YARN logs Namenode operational data Research, Dev Improved performance and stability Tracking Hadoop performance and metrics in Hunk 8 Yahoo Proprietary
  9. 9. Measuring NameNode performance pre & post upgrades 9 Yahoo Proprietary !  Historical visualizations of all operations. !  Search data in Hunk from billions of NameNode events. !  Measure JVM and memory usage. !  Insights into operational performance.
  10. 10. Yahoo Proprietary index="simon_blue_new_all" this_cluster="dilithiumblue*" (log_subtype="DFS" #hdfs=hdfs) | timechart spa n=1h avg(number*) as num_* Last 7 days ✓ 10,086 events (5/15/14 1:00:00.000 AM to 5/22/14 1:36:34.000 AM) _time num_BlockReports num_CopyBl...perations num_HeartBeats num_ReadBl...perations num_ReadMe...perations num_Replac...Operations num_WriteB...Operations num_blockChecksumOp Fri May 16 2014 Sun May 18 Tue May 20 200,000,000 400,000,000 600,000,000 _time ↕ num_Bl ockRep orts ↕ num_Copy BlockOpera tions ↕ num_ HeartB eats ↕ num_Read BlockOpera tions ↕ num_ReadMe tadataOperati ons ↕ num_Replac eBlockOperat ions ↕ num_Write BlockOpera tions ↕ num_blo ckChecks umOp ↕ 2014-05-15 01:00 112443 7.7359 02 46721126. 819672 51495 7.3840 98 12930433.0 77869 0.000000 94210832.78 6885 63512425.9 67213 13975.30 6557 2014-05-15 02:00 111549 6.2904 92 53597000. 262295 29871 7.6370 49 10402176.7 17213 0.000000 94109944.65 5738 93916552.3 93443 35459.28 8689 2014-05-15 03:00 111037 2.4173 56566721. 704918 42849 4.9449 13296385.5 90164 0.000000 94141430.29 5082 97353478.2 29508 20307.54 9344 Visualization Visualization using Hunk 10
  11. 11. 11 Yahoo Proprietary n=5m avg(number*) as num_* Last 2 days ✓ 2,753 events (5/20/14 1:14:21.000 AM to 5/22/14 1:14:21.000 AM) _time num_BlockReports num_CopyBl...perations num_HeartBeats num_ReadBl...perations num_ReadMe...perations num_Replac...Operations num_WriteB...Operations num_blockChecksumOp 12:00 PM Tue May 20 2014 12:00 AM Wed May 21 12:00 PM 1,000,000,000 250,000,000 500,000,000 750,000,000 _time ↕ num_Bl ockRep orts ↕ num_Copy BlockOpera tions ↕ num_ HeartB eats ↕ num_Read BlockOpera tions ↕ num_ReadMe tadataOperati ons ↕ num_Replac eBlockOperat ions ↕ num_Write BlockOpera tions ↕ num_blo ckChecks umOp ↕ 2014-05-20 01:15:00 105604 7.0240 00 34677652. 000000 12412 1.2640 00 26242490.8 00000 0.000000 88112292.80 0000 126478486. 400000 51405.34 6000 2014-05-20 01:20:00 105551 30920700. 10653 22756041.8 0.000000 87745422.40 92323387.2 32070.48 Visualization Sample troubleshooting in Hunk of 750 million events
  12. 12. 12 Yahoo Proprietary New Search index="simon_blue_new_all" this_cluster="dilithiumblue*" (log_subtype="JVM" ProcessName="NameNode") | tim echart span=5m avg(Threads*) as threads_* Last 2 days ✓ 8,463 events (5/20/14 12:00:00.000 AM to 5/22/14 12:00:00.000 AM) _time threads_Blocked threads_New threads_Runnable threads_Terminated threads_TimedWaiting threads_Waiting 12:00 AM Tue May 20 2014 12:00 PM 12:00 AM Wed May 21 12:00 PM 200 400 _time ↕ threads_Block ed ↕ threads_Ne w ↕ threads_Runna ble ↕ threads_Terminat ed ↕ threads_TimedWait ing ↕ threads_Waiti ng ↕ 2014-05-20 00:00:00 72.360000 10.638333 5.485833 0.000000 21.208333 78.555000 2014-05-20 00:05:00 70.177333 10.554667 5.277333 0.000000 20.744667 76.578000 2014-05-20 00:10:00 70.211333 9.998667 5.022000 0.000000 19.333333 73.766667 2014-05-20 00:15:00 70.300667 10.268000 5.156667 0.000000 17.488667 70.122000 2014-05-20 00:20:00 70.422667 10.376000 5.188000 0.000000 15.700000 66.611333 2014-05-20 00:25:00 70.444000 10.288000 5.144000 0.000000 14.089333 63.400667 Visualization Big picture plus granular details
  13. 13. Analyzing NameNode RPC calls (troubleshooting) 13 Yahoo Proprietary !  Who is making what RPC call (open, listStatus, create, etc.). !  How often are they making these RPC calls. !  From which IP/host are they coming from. !  Search and visualize historical data from billions of events. !  Prevent NameNode abuse/misuse.
  14. 14. 14 Yahoo Proprietary Visualizing 834 million discrete events …
  15. 15. 15 Yahoo Confidential & Proprietary … continued
  16. 16. Queue insights (capacity & provisioning) !  Each Hadoop job runs in a specific queue. !  We track every aspect of the YARN framework. !  Immediate queue performance and configuration profiling via job history server. !  Historical views and trends that enable better capacity management. !  Improved queue utilization and allocation management. 16 Yahoo Proprietary
  17. 17.  New Search index="jobsummary_logs_all_red" cluster="dilithium*" | eval total_slot_seconds=(mapSlotSeconds + reduceSlotSec onds) | eval gb_hours=((total_slot_seconds * 0.5) / 3600) | eval gb_hours=round(gb_hours) | timechart span=6h sum (gb_hours) as gb_hours by queue Last 7 days ✓ 1,175,726 events (5/20/14 8:00:00.000 PM to 5/27/14 8:26:26.000 PM) 200,000 400,000 600,000 _time ↕ OTH ER ↕ apg_dai lyhigh_ p3 ↕ apg_dail ymedium _p5 ↕ apg_hou rlyhigh_ p1 ↕ apg_ho urlylow_ p4 ↕ apg_hourl ymedium _p2 ↕ apg _p7 ↕ curveb all_larg e ↕ curveb all_me d ↕ sling shot ↕ sling stone ↕ 2014-05-20 18:00 415 4 45512 7071 25643 12111 29664 347 3 26547 14192 6087 5 4537 6 2014-05-21 00:00 193 41 92661 18005 41008 22944 88115 108 96 38648 8693 4818 6 8767 0 2014-05-21 06:00 211 108137 38398 35627 14934 101925 244 29269 14066 2434 4783 Visualization _time Wed May 21 2014 Thu May 22 Fri May 23 Sat May 24 Sun May 25 Mon May 26 Search | Splunk 6.1.0 http://spbl103n01.blue.ygrid.yahoo.com:9999/en-US/app/search... Visualizing queues 17 Yahoo Proprietary
  18. 18. Self-service job reports 18 Yahoo Proprietary !  Each job is unique and so are the map and reduce elements. !  How to start analyzing jobs? !  Historical job performance and profiling enables in-depth performance tuning. !  Long terms historical views and trending of growth.
  19. 19. 19 Yahoo Proprietary clu ster ↕ us er ↕ que ue ↕ jobName ↕ jobId ↕ status ↕ gb-ho urs ↕ run_ mins ↕ cob alt g m on grid eng PigLatin:findRemoteHDFSFromAudits.pig job_1398982765 383_315271 SUCCE EDED 108.0 0 33.07 cob alt g m on grid eng PigLatin:findRemoteHDFSFromAudits.pig job_1398982765 383_312700 SUCCE EDED 104.0 0 37.37 cob alt g m on grid eng PigLatin:findRemoteHDFSFromAudits.pig job_1398982765 383_309715 SUCCE EDED 88.00 29.83 cob alt g m on grid ops distcp: job_1398982765 383_309921 SUCCE EDED 36.00 68.49 cob alt g m on grid ops SPLK_spbl103n01.blue.ygrid.yahoo.com_1401125953.2076_0 job_1398982765 383_313570 SUCCE EDED 25.00 14.26 cob alt g m on grid ops nnaudit_DR_2014_05_25 job_1398982765 383_308938 SUCCE EDED 25.00 15.43 cob g grid nnaudit_DB_2014_05_25 job_1398982765 SUCCE 24.00 18.07 New Search index="jobsummary_logs_all_blue" cluster="*" user="gmon" | eval total_slot_seconds=(mapSlotSeconds + reduceSlotSeconds) | eval gb_hours=((total_slot_seconds * 0.5) / 3600) | eval gb_hours=round(gb_hours,2) | eval runtime=(finishTime-submitTime)/1000 | stats sum(gb_hours) as gb-hours avg(runtime) as run_mins by cluster user queue jobName jobId status| eval run_mins=round(run_mins/60,2) | sort -gb-hours Yesterday ✓ 4,871 events (5/26/14 12:00:00.000 AM to 5/27/14 12:00:00.000 AM) Statistics (4,871)
  20. 20. 20 Yahoo Proprietary
  21. 21. 21 Yahoo Proprietary
  22. 22. 22 Yahoo Proprietary
  23. 23. More data to tap into with the metastore / Hive sources 23 Yahoo Proprietary !  Using the metastore we can setup virtual indexes to any table(s) in Hive, without the need to define the schema up-front !  Visualize very complex tables (250+ fields) !  Rapid prototyping for new jobs with almost instant results for searches, without having to wait for the entire job/query to finish !  Built-in aggregates and graphs/charts !  Accelerates development workflow by providing faster interaction with data ... it’s not just logs we’re looking at
  24. 24. 24 Yahoo Proprietary
  25. 25. Meet%Hunk%!
  26. 26. 26% Integrated%Analy4cs%Pla8orm%for%Diverse%Data%Stores% Full%featured,! Integrated! Product% Fast!Insights!! for!Everyone% Works!with! What!You! Have!Today% Explore% Visualize% Dashboard s% Share%Analyze% Hadoop!Clusters! NoSQL!and!Other!Data!Stores! Hadoop%Client%Libraries% Streaming%Resource%Libraries%
  27. 27. 27% Fast%Deployment%and%Configura4on% Just%point%at%Hadoop% •  Cer4fied%integra4ons%to%all% major%Hadoop%distribu4ons% •  Choose%1stLgen%MapReduce% or%YARN%% •  Create%Virtual%Indexes%across% one%or%more%clusters% •  From%download%to%searching% data%in%<%60%minutes% Connect%to%one%or%mul4ple%Hadoop%clusters% YARN% cer4fied%
  28. 28. 28% Interac4ve%Search%and%Results%Preview% Rapidly%interact%with%data% •  Powerful%Search%Processing% Language%(SPL™)% •  Ad%hoc%exploratory%analy4cs% across%massive%datasets% •  Preview%results% •  No%fixed%schema% •  No%requirement%to% “understand”%data%upfront% Search% interface% Preview% results% Drill%down% to%raw%data% Pause%or%stop%MapReduce%jobs%
  29. 29. 29% Powerful%Dashboards%for%SelfLService%Analy4cs% Interac4ve%Dashboards% and%Charts% •  EasyLtoLuse%dashboard%editor% •  Chart%overlay% •  Pan%and%zoom% •  InLdashboard%drill%down% •  Embed%charts%and% dashboards%in%3rd%party%apps% •  Reuse%skills%with%Splunk% Enterprise%6.1%and%Hunk%6.1%
  30. 30. 30% Automate%Access%for%Rapid%Explora4on% Supported%File%Formats% •  Text%files% •  Sequence%files%% •  RCFile% •  ORC%files% •  Parquet%
  31. 31. 31% RoleLbased%Security%for%Shared%Clusters% PassLthrough% Authen4ca4on% •  Provide%roleLbased%security% for%Hadoop%clusters% •  Access%Hadoop%resources% under%security%and% compliance% •  Integrates%with%Kerberos% for%Hadoop%security% Business! Analyst% MarkeNng! Analyst% Sys! Admin% Business!! Analyst!! Queue:!! Biz!AnalyNcs% MarkeNng! Analyst! Queue:! MarkeNng% Sys!! Admin2! Queue:!! Prod%
  32. 32. 32% Powerful%Developer% Environment% •  Use%a%standardsLbased%web% framework%and%REST%API%% •  Customize%dashboards%and% UIs%with%Simple%XML,% JavaScript%or%Django% •  Choose%among%SDKs%% •  One%integra4on%for%both% Splunk%Enterprise%and%Hunk% Build%Analy4csLRich%Big%Data%Apps%
  33. 33. 33% Explore,%analyze%and%visualize%data%in% one%integrated%pla8orm% Point%Hunk%at%your%storage%clusters%and% explore%data%immediately% Preview%results%as%MapReduce%jobs%run%and% accelerate%reports%with%no%fixed%schemas% INTERACTIVE! SEARCH! RICH!DEVELOPER! ENVIRONMENT! Build%big%data%apps%using%standard%web% languages%and%frameworks% FULL%FEATURED! ANALYTICS! FAST!TO!DEPLOY! AND!DRIVE!VALUE! FullLFeatured,%Integrated%Analy4cs%Pla8orm%
  34. 34. Question/Comments? Sagi Zelnick – Principal Architect Email: zelnicks@yahoo-inc.com Ledion Bitincka – Principal Architect Email: lbitincka@splunk.com

×