Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

1,574 views
1,355 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,574
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
32
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

  1. 1. Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 1 Experiences in migration of large analytics platform from MPP database to Hadoop YARN Srinivas Nimmagadda Roopesh Varier Technical Director, CPE Director, CPE
  2. 2. Agenda Introduction1 Big Data Needs2 MPP Platform and Challenges3 New Platform based on Hadoop/YARN4 Lessons learned during transition to Hadoop5 2Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier
  3. 3. Overview • Symantec Cloud Platform Engineering (CPE) – Build consolidated cloud infrastructure and platform services for next generation data powered Symantec applications. – Open source components as building blocks • Hadoop and Openstack • Bridge capability gaps and contribute back • A big data platform for batch and stream analytics integrated with Openstack. – Security, multi-tenancy, and reliability • Using large scale data analytics for security and data management work loads – Analytics – Reputation based security, Managed Security Services, Fraud Detection, Dial home application logs Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 3
  4. 4. Big Data Challenge • Hundreds of millions of users • Billions of files – File good or not? • Millions of URLs – URL safe or not? • Hundreds of thousands of applications – Stable or Crashed • Constant feed of information – Real time – Across the global – From our applications and appliances Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 5
  5. 5. Value from Volume • Volume of data – Multi-petabyte historical datasets – Multi-terabyte daily incremental datasets – Wide variety of input data formats – How do we manage? • Variety of workloads – ETL jobs – Batch applications – Interactive ad-hoc analysis • How to extract value from volume near real-time? Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 6
  6. 6. Agenda Introduction1 Big Data Needs2 MPP Platform and Challenges3 New Platform based on Hadoop/YARN4 Lessons learned during transition to Hadoop5 2Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier
  7. 7. MPP Platform Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 8 ETL Cluster Platform Services Raw Data Store Data Sources Applications Batch Interactive MPP DB Engine
  8. 8. Legacy MPP Analytics Solution • Custom Platform Services – Task/Job management (DAG based, Fault-tolerant) – Functional and performance monitoring – Automatic data lifecycle management – Inter cluster data transfers – Cluster tenancy management • ETL cluster • RDS (raw data store) on NAS • MPP (Massively Parallel Processing) DB engine at the core Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 9
  9. 9. Key Challenges • Scalability – Supporting rapid data growth – No support for heterogeneous hardware. • Operational costs – OpEx and Software licenses • Supporting new use models – Not Only SQL patterns in analytics (columnar storage, search, streaming) • Cluster operational challenges – Limited resource management (limits/quotas, utilization throttling) – Load balancing across multi-mode and multi-tenant workloads – Integrated secure tenancy services – HA and DR Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 10
  10. 10. Agenda Introduction1 Big Data Needs2 MPP Platform and Challenges3 New Platform based on Hadoop/YARN4 Lessons learned during transition to Hadoop5 2Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier
  11. 11. MPP Platform Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 12 Raw Data Store ETL Cluster Platform Services Data Sources Applications Batch Interactive MPP DB Engine
  12. 12. 7: YARN/HDFS 6: DistCP, Falcon 5: DAG: Oozie MPP DB Engine 3: HDFS MPP to Big Data Platform Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 13 Raw Data Store Platform Services Data Sources Applications Batch Interactive 1: Commodity Hardware 2: Hadoop Cluster 4: YARN ETL Job Management State Transfer Tenancy Guard ETL Cluster Batch Interactive Interactive Batch YARN
  13. 13. Big Data Platform Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 14 Multi-tenant Data Sources Applications Batch Interactive 1: Cluster Infrastructure 2: Hadoop 2.x Stack 3: HDFS 5: Oozie 4: YARN ETL Interactive Batch Raw Data Store ETL Jobs Batch Interactive Ad-hoc workloads Role-based provisioning Unified Logging API
  14. 14. Agenda Introduction1 Big Data Needs2 MPP Platform and Challenges3 New Platform based on Hadoop/YARN4 Lessons learned during transition to Hadoop5 2Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier
  15. 15. Cluster Build Experiences • Node selection – Single Node SKU, use commodity hardware components – Memory will be cheap, keep expansion options open – Spindle-Core-LAN Network ratios (1 : 2.5 : 1.5 Gbs) • Balance mixed workloads using YARN – Large clusters are better for effective resource utilization – Balance between ETL, Batch, Interactive jobs with YARN • Platform features and best practices – Central monitoring, log aggregation, and alerting metrics (ELK stack) – Role based automated deployment of OS and Hadoop configuration Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 16
  16. 16. Journey to Hadoop • Goals – Open Source platform – Scalable Distributed Processing • Existing app base built around SQL • Many technology choices in Hadoop ecosystem – Technology choices: Distributed Query Engines vs. fast MR – Evaluation with multi-PB data sets using 15 of our representative workloads. • e.g., complex joins (data shuffle), queries with variety of data – Criteria: Scale, Functionality, Stability, Performance, Integration with other open source ecosystem – Hive was the only technology able to scale and provide easy migration from our SQL workloads. – With Tez we had an acceptable performance trade off. Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 17
  17. 17. RDS and ETL Process • Platform features for ETL – File ingestion and Job management APIs – Secure tenancy, Replication • Conversion of 5 GB log file(.gz to .bz2) 1. Single node outside Hadoop: ~28 mins 2. In Hadoop, single mapper, parallel read and write approach: ~5 mins • A parallel RDS and ETL using YARN – Source file ingested from remote location – Converted to bz2 and stored in HDFS Raw Data Store (Passive data) – Data is transformed and loaded into Hive (Active data in ORC format) – Mix “active” and “passive” datasets in HDFS Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 18  Use YARN for managing ETL A P I NN DN DN DN DN DN DN Local .gz->bz2 MR based .gz->bz2 1 2
  18. 18. Large Cluster YARN Performance Modeling • Multi-mode: – ETL jobs: Guaranteed throughput – window computing – Ad-hoc queries – Low latency, fast execution – Batch analytics applications – Throughput • Multi-level – Departments/Projects, Users • How do we model and use YARN for above workloads? Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 19
  19. 19. Performance Modeling Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 20 ETL Batch Ad-hoc Map Tasks Reduce Tasks HDFS Storage Step 1: Compile your workload model
  20. 20. YARN Queue Model - 1 Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 21 ETL Queue Ad-hoc Queue Batch Queue Root Queue Projects Queues Jobs Cluster Utilization: Avg Latency: Throughput (jobs): Step 2: Develop your YARN queue resource allocation hierarchy
  21. 21. YARN Queue Model - 2 Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 22 ETL Ad-HocBatch Root Queue Project Queues Jobs Cluster Utilization: Avg Latency: Throughput (jobs):
  22. 22. YARN Queue Model - 3 Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 23 ETL Queue Batch Queue Root Queue Ad-hoc Project Queues Jobs Project Queues Step 3: Run jobs, iterate thru’ models and pick optimal Cluster Utilization Avg Wait Time Throughput (jobs):
  23. 23. Right Balance • Optimal solution is about right balance – Cluster infrastructure – Use the right software stack from Hadoop ecosystem – Data management – Application design and workload balancing with YARN – Good tools for monitoring and management • Approach – Start small and iterate faster – When in doubt, experiment and get data to make decisions. – Keep up customer use cases in perspective. Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 24
  24. 24. Summary – Incremental transition from MPP to Big Data – A journey towards open source distributed computing – Uniform Computing! • Infrastructure building blocks • Single large YARN cluster for variety of compute and storage loads – Open source – use and contribute • Work with community to address gaps – Share your ideas Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 25
  25. 25. Q & A Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 26

×