Enabling Exploratory Analytics of Data in
Shared-service Hadoop Clusters
P R E S E N T E D B Y S a g i Z e l n i c k P r i...
About your speakers
2 Yahoo Proprietary
Sagi Zelnick Ledion Bitincka
Principal Architect Principal Architect
Yahoo Splunk
Background
3 Yahoo Proprietary
 Hadoop @ Yahoo: 8+ years of innovation
 Hunk @ Yahoo: organization-wide investment for n...
4 Yahoo Proprietary
History of Hadoop innovation @ Yahoo
Over 600PB of Hadoop storage (over half an exabyte)
5 Yahoo Proprietary
 Very large clusters used by many groups across t...
Improving visibility & providing operational insights with Hunk
 We pointed Hunk at many operational logs and event data ...
Use Case Customer Benefits
Namenode metrics, block ops, memory
usage
Research, Dev Improved performance and
stability
Syst...
Use Case Customer Benefits
Find dataset instances/files that have never
been accessed after creation
Data Storage Efficien...
9 Yahoo Proprietary
Sample search in Hunk
Measuring NameNode performance pre & post upgrades
10 Yahoo Proprietary
 Historical visualizations of all operations
 Se...
Yahoo Proprietary
New Search
i ndex=" si mon_bl ue_new_al l " t hi s_cl ust er =" di l i t hi umbl ue* " ( l og_subt ype="...
12 Yahoo Proprietary
✓ 2,753 events (5/20/14 1:14:21.000 AM to 5/22/14 1:14:21.000 AM)
_time
num_BlockReports num_CopyBl.....
13 Yahoo Proprietary
New Search
i ndex=" si mon_bl ue_new_al l " t hi s_cl ust er =" di l i t hi umbl ue* " ( l og_subt yp...
Analyzing NameNode RPC calls
14 Yahoo Proprietary
 Who is making what RPC call (open, listStatus, create, etc.)
 How oft...
15 Yahoo Proprietary
Visualizing 834 million discrete events …
16 Yahoo Confidential & Proprietary
… continued
Queue insights
 Each Hadoop job runs in a specific queue
 We track every aspect of the YARN framework
 Immediate queue ...
New Search
i ndex=" j obsummar y_l ogs_al l _r ed" cl ust er =" di l i t hi um* " | eval t ot al _sl ot _seconds=( m apSl ...
Creating job reports per user
19 Yahoo Proprietary
 Each job is unique and so are the map and reduce elements
 How to st...
More data to tap into with the metastore/hive sources
20 Yahoo Proprietary
 We will provide Hunk as a self-service to exp...
Hunk + Hadoop Demo
21Yahoo Proprietary
22 Yahoo Proprietary
23 Yahoo Proprietary
24 Yahoo Proprietary
25 Yahoo Proprietary
26 Yahoo Proprietary
© 2014 Splunk Inc.
Meet Hunk 6.1
28
Integrated Analytics Platform
Full-featured,
Integrated
Product
Insights for
Everyone
Works with
What You
Have Today
Ex...
29
Fast Deployment and Configuration
Just point at Hadoop
• Certified integrations to all
major Hadoop distributions
• Cho...
Interactive Search and Results Preview
Rapidly interact with data
• Powerful Search Processing
Language (SPL™)
• Ad hoc ex...
31
Powerful Dashboards for Self-Service Analytics
Interactive Dashboards
and Charts
• Easy-to-use dashboard editor
• Chart...
32
Hive Data Support
Supported File Formats
• Text files
• Sequence files
• RCFile
• ORC files
• Parquet
33
Role-based Security for Shared Clusters
Pass-through
Authentication
• Provide role-based security
for Hadoop clusters
•...
34
Powerful Developer
Environment
• Use a standards-based web
framework and REST API
• Customize dashboards and
UIs with S...
35
Explore, analyze and visualize data in
one integrated platform
Point Hunk at your storage clusters and
explore data imm...
Question/Comments?
Sagi Zelnick – Principal Architect
Email: zelnicks@yahoo-inc.com
Ledion Bitincka – Principal Architect
...
Upcoming SlideShare
Loading in …5
×

Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

676 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
676
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

  1. 1. Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters P R E S E N T E D B Y S a g i Z e l n i c k P r i n c i p a l A r c h i t e c t @ Y a h o o a n d L e d i o n B i t i n c k a P r i n c i p a l A r c h i t e c t @ S p l u n k H a d o o p S u m m i t J u n e 2 0 1 4 S a n J o s e , C A
  2. 2. About your speakers 2 Yahoo Proprietary Sagi Zelnick Ledion Bitincka Principal Architect Principal Architect Yahoo Splunk
  3. 3. Background 3 Yahoo Proprietary  Hadoop @ Yahoo: 8+ years of innovation  Hunk @ Yahoo: organization-wide investment for next 3+ years  Yahoo providing Hunk as a self-service to explore, analyze & visualize data in HDFS  Hunk allows visually browsing of very complex tables (250+ fields)  Rapid prototyping for new jobs with almost instant results for searches, without having to wait for the entire job/query to finish  Cuts down on the development cycles by faster interaction with results
  4. 4. 4 Yahoo Proprietary History of Hadoop innovation @ Yahoo
  5. 5. Over 600PB of Hadoop storage (over half an exabyte) 5 Yahoo Proprietary  Very large clusters used by many groups across the enterprise  More than 40,000 individual datanodes  Hadoop is provided as a service  Multiple cluster types such as research, dev, sandbox and production  Services such as HBase, Hive, Oozie, etc…  Users are free to run jobs, but have resource constraints  Maintained by Grid Operations Group
  6. 6. Improving visibility & providing operational insights with Hunk  We pointed Hunk at many operational logs and event data we already have on the grid  This includes system metrics, HDFS ops, JVM stats and YARN metrics  Created instrumentation to measure usage per user and job  Analyzed terabytes of NameNode audit logs  Job history leveraged for visualizing usage/growth and historical views  Custom events for HBase statistics 6 Yahoo Proprietary
  7. 7. Use Case Customer Benefits Namenode metrics, block ops, memory usage Research, Dev Improved performance and stability System/Hadoop metrics of ~40,000 individual datanodes Grid Ops / Grid Customers Identify slow tasks/nodes when debugging Historical insights into resource consumption All Grid Customers Track organic growth Generate reports on job performance All Grid Customers Improved job SLAs HBase metrics All Grid Customers Track region/RS/table metrics… Track job logs in near real-time All Grid Customers / Ops Detect and search for errors directly from the YARN job logs for troubleshooting Tracking Hadoop performance and metrics in Hunk 7 Yahoo Proprietary
  8. 8. Use Case Customer Benefits Find dataset instances/files that have never been accessed after creation Data Storage Efficiency Team, SE Savings via reduction of storage- costs How is each user/team using compute and disk capacity on a cluster? Management / Grid Customers Metering / Chargeback Replace ad hoc and legacy solutions for analyzing cluster-usage SE / Grid Solutions / Grid Performance / Hadoop Core Development Team Improved Grid-utilization and cost- reduction Generate reports on cluster performance, utilization of available capacity, etc. SE / Grid Solutions / Grid Performance / Hadoop Core Development Team Data-mining for product improvements and best-practices Determine KPIs of Hadoop stack components (Pig, Oozie, etc.) SE / Grid Solutions / Hadoop Stack Development Team Feedback for product improvements Find efficacy of various heuristics in Hadoop (data-locality of Tasks, replication of blocks, etc.) Hadoop Stack Development Team Fine-tune heuristics for better efficiency Tracking Hadoop performance and metrics continued 8 Yahoo Proprietary
  9. 9. 9 Yahoo Proprietary Sample search in Hunk
  10. 10. Measuring NameNode performance pre & post upgrades 10 Yahoo Proprietary  Historical visualizations of all operations  Search data in Hunk from billions of NameNode events  Measure JVM and memory usage  Insights into operational performance
  11. 11. Yahoo Proprietary New Search i ndex=" si mon_bl ue_new_al l " t hi s_cl ust er =" di l i t hi umbl ue* " ( l og_subt ype=" DFS" #hdf s=hdf s) | t i mechar t spa n=1h avg( number * ) as num_* Last 7 days ✓ 10,086 events (5/15/14 1:00:00.000 AM to 5/22/14 1:36:34.000 AM) _time num_BlockReports num_CopyBl...perations num_HeartBeats num_ReadBl...perations num_ReadMe...perations num_Replac...Operations num_WriteB...Operations num_blockChecksumOp Fri May 16 2014 Sun May 18 Tue May 20 200,000,000 400,000,000 600,000,000 _time ✓ num_Bl ockRep orts ✓ num_Copy BlockOpera tions ✓ num_ HeartB eats ✓ num_Read BlockOpera tions ✓ num_ReadMe tadataOperati ons ✓ num_Replac eBlockOperat ions ✓ num_Write BlockOpera tions ✓ num_blo ckChecks umOp ✓ 2014-05-15 01:00 112443 7.7359 02 46721126. 819672 51495 7.3840 98 12930433.0 77869 0.000000 94210832.78 6885 63512425.9 67213 13975.30 6557 Visualization Sample visualization in Hunk 11
  12. 12. 12 Yahoo Proprietary ✓ 2,753 events (5/20/14 1:14:21.000 AM to 5/22/14 1:14:21.000 AM) _time num_BlockReports num_CopyBl...perations num_HeartBeats num_ReadBl...perations num_ReadMe...perations num_Replac...Operations num_WriteB...Operations num_blockChecksumOp 12:00 PM Tue May 20 2014 12:00 AM Wed May 21 12:00 PM 1,000,000,000 250,000,000 500,000,000 750,000,000 _time ✓ num_Bl ockRep orts ✓ num_Copy BlockOpera tions ✓ num_ HeartB eats ✓ num_Read BlockOpera tions ✓ num_ReadMe tadataOperati ons ✓ num_Replac eBlockOperat ions ✓ num_Write BlockOpera tions ✓ num_blo ckChecks umOp ✓ Visualization Sample troubleshooting in Hunk of 750 million events
  13. 13. 13 Yahoo Proprietary New Search i ndex=" si mon_bl ue_new_al l " t hi s_cl ust er =" di l i t hi umbl ue* " ( l og_subt ype=" JVM" Pr ocessName=" NameNode" ) | t i m echar t span=5m avg( Thr eads* ) as t hr eads_* Last 2 days ✓ 8,463 events (5/20/14 12:00:00.000 AM to 5/22/14 12:00:00.000 AM) _time threads_Blocked threads_New threads_Runnable threads_Terminated threads_TimedWaiting threads_Waiting 12:00 AM Tue May 20 2014 12:00 PM 12:00 AM Wed May 21 12:00 PM 200 400 _time ✓ threads_Block ed ✓ threads_Ne w ✓ threads_Runna ble ✓ threads_Terminat ed ✓ threads_TimedWait ing ✓ threads_Waiti ng ✓ 2014-05-20 00:00:00 72.360000 10.638333 5.485833 0.000000 21.208333 78.555000 2014-05-20 00:05:00 70.177333 10.554667 5.277333 0.000000 20.744667 76.578000 2014-05-20 00:10:00 70.211333 9.998667 5.022000 0.000000 19.333333 73.766667 Visualization Big picture plus granular details
  14. 14. Analyzing NameNode RPC calls 14 Yahoo Proprietary  Who is making what RPC call (open, listStatus, create, etc.)  How often are they making these RPC calls  From which IP/host are they coming from  Search and visualize historical data from billions of events  Prevent NameNode abuse/misuse
  15. 15. 15 Yahoo Proprietary Visualizing 834 million discrete events …
  16. 16. 16 Yahoo Confidential & Proprietary … continued
  17. 17. Queue insights  Each Hadoop job runs in a specific queue  We track every aspect of the YARN framework  Immediate queue performance and configuration profiling via job history server  Historical views and trends that enable better capacity management  Improved queue utilization and allocation management 17 Yahoo Proprietary
  18. 18. New Search i ndex=" j obsummar y_l ogs_al l _r ed" cl ust er =" di l i t hi um* " | eval t ot al _sl ot _seconds=( m apSl ot Seconds + r educeSl ot Sec onds) | eval gb_hour s=( ( t ot al _sl ot _seconds * 0. 5) / 3600) | eval gb_hour s=r ound( gb_h our s) | t i mechar t span=6h sum ( gb_hour s) as gb_hour s by queue Last 7 days ✓ 1,175,726 events (5/20/14 8:00:00.000 PM to 5/27/14 8:26:26.000 PM) 200,000 400,000 600,000 OTH apg_dai apg_dail apg_hou apg_ho apg_hourl apg curveb curveb sling sling Visualization _time Wed May 21 2014 Thu May 22 Fri May 23 Sat May 24 Sun May 25 Mon May 26 Visualizing queues 18 Yahoo Proprietary
  19. 19. Creating job reports per user 19 Yahoo Proprietary  Each job is unique and so are the map and reduce elements  How to start analyzing jobs?  Historical job performance and profiling enables in-depth performance tuning  Long terms historical views and trending of growth
  20. 20. More data to tap into with the metastore/hive sources 20 Yahoo Proprietary  We will provide Hunk as a self-service to explore & visualize data in HDFS  Using the metastore we can setup virtual indexes to any table(s) in Hive, without the need to define the schema up-front  Allows for visually browsing very complex tables (250+ fields)  Rapid prototyping for new jobs with almost instant results for searches, without having to wait for the entire job/query to finish  Cuts down on the development cycles by faster interaction with results  Built-in graphs/charts makes for a powerful solution for many situations
  21. 21. Hunk + Hadoop Demo 21Yahoo Proprietary
  22. 22. 22 Yahoo Proprietary
  23. 23. 23 Yahoo Proprietary
  24. 24. 24 Yahoo Proprietary
  25. 25. 25 Yahoo Proprietary
  26. 26. 26 Yahoo Proprietary
  27. 27. © 2014 Splunk Inc. Meet Hunk 6.1
  28. 28. 28 Integrated Analytics Platform Full-featured, Integrated Product Insights for Everyone Works with What You Have Today Explore Visualize Dashboard s ShareAnalyze Hadoop Clusters NoSQL and Other Data Stores Hadoop Client Libraries Streaming Resource Libraries for Diverse Data Stores
  29. 29. 29 Fast Deployment and Configuration Just point at Hadoop • Certified integrations to all major Hadoop distributions • Choose 1st-gen MapReduce or YARN • Create Virtual Indexes across one or more clusters • From download to searching data in < 60 minutes Connect to one or multiple Hadoop clusters YARN certified
  30. 30. Interactive Search and Results Preview Rapidly interact with data • Powerful Search Processing Language (SPL™) • Ad hoc exploratory analytics across massive datasets • Preview results • No fixed schema • No requirement to “understand” data upfront Search interface Preview results 30 Drill down to raw data Pause or stop MapReduce jobs
  31. 31. 31 Powerful Dashboards for Self-Service Analytics Interactive Dashboards and Charts • Easy-to-use dashboard editor • Chart overlay • Pan and zoom • In-dashboard drilldown • Embed charts and dashboards in 3rd party apps • Reuse skills with Splunk Enterprise 6.1 and Hunk 6.1
  32. 32. 32 Hive Data Support Supported File Formats • Text files • Sequence files • RCFile • ORC files • Parquet
  33. 33. 33 Role-based Security for Shared Clusters Pass-through Authentication • Provide role-based security for Hadoop clusters • Access Hadoop resources under security and compliance • Integrates with Kerberos for Hadoop security Business Analyst Marketing Analyst Sys Admin Business Analyst Queue: Biz Analytics Marketing Analyst Queue: Marketing Sys Admin2 Queue: Prod
  34. 34. 34 Powerful Developer Environment • Use a standards-based web framework and REST API • Customize dashboards and UIs with Simple XML, JavaScript or Django • Choose among SDKs • One integration for both Splunk Enterprise and Hunk Build Analytics-Rich Big Data Apps
  35. 35. 35 Explore, analyze and visualize data in one integrated platform Point Hunk at your storage clusters and explore data immediately Preview results as MapReduce jobs run and accelerate reports with no fixed schemas INTERACTIVE SEARCH RICH DEVELOPER ENVIRONMENT Build big data apps using standard web languages and frameworks FULL-FEATURED ANALYTICS FAST TO DEPLOY AND DRIVE VALUE Hunk: One Integrated Platform
  36. 36. Question/Comments? Sagi Zelnick – Principal Architect Email: zelnicks@yahoo-inc.com Ledion Bitincka – Principal Architect Email: lbitincka@splunk.com

×