Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN
Upcoming SlideShare
Loading in...5
×
 

Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

on

  • 474 views

 

Statistics

Views

Total Views
474
Views on SlideShare
474
Embed Views
0

Actions

Likes
0
Downloads
8
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN Presentation Transcript

  • Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 1 Experiences in migration of large analytics platform from MPP database to Hadoop YARN Srinivas Nimmagadda Roopesh Varier Technical Director, CPE Director, CPE
  • Agenda Introduction1 Big Data Needs2 MPP Platform and Challenges3 New Platform based on Hadoop/YARN4 Lessons learned during transition to Hadoop5 2Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier
  • Overview • Symantec Cloud Platform Engineering (CPE) – Build consolidated cloud infrastructure and platform services for next generation data powered Symantec applications. – Open source components as building blocks • Hadoop and Openstack • Bridge capability gaps and contribute back • A big data platform for batch and stream analytics integrated with Openstack. – Security, multi-tenancy, and reliability • Using large scale data analytics for security and data management work loads – Analytics – Reputation based security, Managed Security Services, Fraud Detection, Dial home application logs Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 3
  • Big Data Challenge • Hundreds of millions of users • Billions of files – File good or not? • Millions of URLs – URL safe or not? • Hundreds of thousands of applications – Stable or Crashed • Constant feed of information – Real time – Across the global – From our applications and appliances Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 5
  • Value from Volume • Volume of data – Multi-petabyte historical datasets – Multi-terabyte daily incremental datasets – Wide variety of input data formats – How do we manage? • Variety of workloads – ETL jobs – Batch applications – Interactive ad-hoc analysis • How to extract value from volume near real-time? Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 6
  • Agenda Introduction1 Big Data Needs2 MPP Platform and Challenges3 New Platform based on Hadoop/YARN4 Lessons learned during transition to Hadoop5 2Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier
  • MPP Platform Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 8 ETL Cluster Platform Services Raw Data Store Data Sources Applications Batch Interactive MPP DB Engine
  • Legacy MPP Analytics Solution • Custom Platform Services – Task/Job management (DAG based, Fault-tolerant) – Functional and performance monitoring – Automatic data lifecycle management – Inter cluster data transfers – Cluster tenancy management • ETL cluster • RDS (raw data store) on NAS • MPP (Massively Parallel Processing) DB engine at the core Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 9
  • Key Challenges • Scalability – Supporting rapid data growth – No support for heterogeneous hardware. • Operational costs – OpEx and Software licenses • Supporting new use models – Not Only SQL patterns in analytics (columnar storage, search, streaming) • Cluster operational challenges – Limited resource management (limits/quotas, utilization throttling) – Load balancing across multi-mode and multi-tenant workloads – Integrated secure tenancy services – HA and DR Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 10
  • Agenda Introduction1 Big Data Needs2 MPP Platform and Challenges3 New Platform based on Hadoop/YARN4 Lessons learned during transition to Hadoop5 2Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier
  • MPP Platform Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 12 Raw Data Store ETL Cluster Platform Services Data Sources Applications Batch Interactive MPP DB Engine
  • 7: YARN/HDFS 6: DistCP, Falcon 5: DAG: Oozie MPP DB Engine 3: HDFS MPP to Big Data Platform Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 13 Raw Data Store Platform Services Data Sources Applications Batch Interactive 1: Commodity Hardware 2: Hadoop Cluster 4: YARN ETL Job Management State Transfer Tenancy Guard ETL Cluster Batch Interactive Interactive Batch YARN
  • Big Data Platform Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 14 Multi-tenant Data Sources Applications Batch Interactive 1: Cluster Infrastructure 2: Hadoop 2.x Stack 3: HDFS 5: Oozie 4: YARN ETL Interactive Batch Raw Data Store ETL Jobs Batch Interactive Ad-hoc workloads Role-based provisioning Unified Logging API
  • Agenda Introduction1 Big Data Needs2 MPP Platform and Challenges3 New Platform based on Hadoop/YARN4 Lessons learned during transition to Hadoop5 2Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier
  • Cluster Build Experiences • Node selection – Single Node SKU, use commodity hardware components – Memory will be cheap, keep expansion options open – Spindle-Core-LAN Network ratios (1 : 2.5 : 1.5 Gbs) • Balance mixed workloads using YARN – Large clusters are better for effective resource utilization – Balance between ETL, Batch, Interactive jobs with YARN • Platform features and best practices – Central monitoring, log aggregation, and alerting metrics (ELK stack) – Role based automated deployment of OS and Hadoop configuration Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 16
  • Journey to Hadoop • Goals – Open Source platform – Scalable Distributed Processing • Existing app base built around SQL • Many technology choices in Hadoop ecosystem – Technology choices: Distributed Query Engines vs. fast MR – Evaluation with multi-PB data sets using 15 of our representative workloads. • e.g., complex joins (data shuffle), queries with variety of data – Criteria: Scale, Functionality, Stability, Performance, Integration with other open source ecosystem – Hive was the only technology able to scale and provide easy migration from our SQL workloads. – With Tez we had an acceptable performance trade off. Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 17
  • RDS and ETL Process • Platform features for ETL – File ingestion and Job management APIs – Secure tenancy, Replication • Conversion of 5 GB log file(.gz to .bz2) 1. Single node outside Hadoop: ~28 mins 2. In Hadoop, single mapper, parallel read and write approach: ~5 mins • A parallel RDS and ETL using YARN – Source file ingested from remote location – Converted to bz2 and stored in HDFS Raw Data Store (Passive data) – Data is transformed and loaded into Hive (Active data in ORC format) – Mix “active” and “passive” datasets in HDFS Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 18  Use YARN for managing ETL A P I NN DN DN DN DN DN DN Local .gz->bz2 MR based .gz->bz2 1 2
  • Large Cluster YARN Performance Modeling • Multi-mode: – ETL jobs: Guaranteed throughput – window computing – Ad-hoc queries – Low latency, fast execution – Batch analytics applications – Throughput • Multi-level – Departments/Projects, Users • How do we model and use YARN for above workloads? Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 19
  • Performance Modeling Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 20 ETL Batch Ad-hoc Map Tasks Reduce Tasks HDFS Storage Step 1: Compile your workload model
  • YARN Queue Model - 1 Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 21 ETL Queue Ad-hoc Queue Batch Queue Root Queue Projects Queues Jobs Cluster Utilization: Avg Latency: Throughput (jobs): Step 2: Develop your YARN queue resource allocation hierarchy
  • YARN Queue Model - 2 Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 22 ETL Ad-HocBatch Root Queue Project Queues Jobs Cluster Utilization: Avg Latency: Throughput (jobs):
  • YARN Queue Model - 3 Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 23 ETL Queue Batch Queue Root Queue Ad-hoc Project Queues Jobs Project Queues Step 3: Run jobs, iterate thru’ models and pick optimal Cluster Utilization Avg Wait Time Throughput (jobs):
  • Right Balance • Optimal solution is about right balance – Cluster infrastructure – Use the right software stack from Hadoop ecosystem – Data management – Application design and workload balancing with YARN – Good tools for monitoring and management • Approach – Start small and iterate faster – When in doubt, experiment and get data to make decisions. – Keep up customer use cases in perspective. Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 24
  • Summary – Incremental transition from MPP to Big Data – A journey towards open source distributed computing – Uniform Computing! • Infrastructure building blocks • Single large YARN cluster for variety of compute and storage loads – Open source – use and contribute • Work with community to address gaps – Share your ideas Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 25
  • Q & A Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 26