SlideShare a Scribd company logo
1 of 25
Download to read offline
Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 1
Experiences in migration of large analytics
platform from MPP database to Hadoop
YARN
Srinivas Nimmagadda Roopesh Varier
Technical Director, CPE Director, CPE
Agenda
Introduction1
Big Data Needs2
MPP Platform and Challenges3
New Platform based on Hadoop/YARN4
Lessons learned during transition to Hadoop5
2Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier
Overview
• Symantec Cloud Platform Engineering (CPE)
– Build consolidated cloud infrastructure and platform services
for next generation data powered Symantec applications.
– Open source components as building blocks
• Hadoop and Openstack
• Bridge capability gaps and contribute back
• A big data platform for batch and stream analytics
integrated with Openstack.
– Security, multi-tenancy, and reliability
• Using large scale data analytics for security and data
management work loads
– Analytics – Reputation based security, Managed Security
Services, Fraud Detection, Dial home application logs
Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 3
Big Data Challenge
• Hundreds of millions of users
• Billions of files
– File good or not?
• Millions of URLs
– URL safe or not?
• Hundreds of thousands of applications
– Stable or Crashed
• Constant feed of information
– Real time
– Across the global
– From our applications and appliances
Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 5
Value from Volume
• Volume of data
– Multi-petabyte historical datasets
– Multi-terabyte daily incremental datasets
– Wide variety of input data formats
– How do we manage?
• Variety of workloads
– ETL jobs
– Batch applications
– Interactive ad-hoc analysis
• How to extract value from volume near real-time?
Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 6
Agenda
Introduction1
Big Data Needs2
MPP Platform and Challenges3
New Platform based on Hadoop/YARN4
Lessons learned during transition to Hadoop5
2Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier
MPP Platform
Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 8
ETL Cluster
Platform
Services
Raw Data
Store
Data Sources Applications
Batch
Interactive
MPP DB Engine
Legacy MPP Analytics Solution
• Custom Platform Services
– Task/Job management (DAG based, Fault-tolerant)
– Functional and performance monitoring
– Automatic data lifecycle management
– Inter cluster data transfers
– Cluster tenancy management
• ETL cluster
• RDS (raw data store) on NAS
• MPP (Massively Parallel Processing) DB engine at the core
Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 9
Key Challenges
• Scalability
– Supporting rapid data growth
– No support for heterogeneous hardware.
• Operational costs
– OpEx and Software licenses
• Supporting new use models
– Not Only SQL patterns in analytics (columnar storage, search, streaming)
• Cluster operational challenges
– Limited resource management (limits/quotas, utilization throttling)
– Load balancing across multi-mode and multi-tenant workloads
– Integrated secure tenancy services
– HA and DR
Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 10
Agenda
Introduction1
Big Data Needs2
MPP Platform and Challenges3
New Platform based on Hadoop/YARN4
Lessons learned during transition to Hadoop5
2Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier
MPP Platform
Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 12
Raw Data
Store
ETL Cluster
Platform
Services
Data Sources Applications
Batch
Interactive
MPP DB Engine
7: YARN/HDFS
6: DistCP, Falcon
5: DAG: Oozie
MPP DB Engine
3: HDFS
MPP to Big Data Platform
Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 13
Raw Data
Store
Platform
Services
Data Sources Applications
Batch
Interactive
1: Commodity Hardware
2: Hadoop Cluster
4: YARN
ETL
Job Management
State Transfer
Tenancy Guard
ETL Cluster
Batch
Interactive
Interactive Batch
YARN
Big Data Platform
Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 14
Multi-tenant
Data Sources Applications
Batch
Interactive
1: Cluster Infrastructure 2: Hadoop 2.x Stack
3: HDFS
5: Oozie
4: YARN
ETL
Interactive Batch
Raw Data Store ETL Jobs Batch Interactive
Ad-hoc
workloads
Role-based provisioning Unified Logging
API
Agenda
Introduction1
Big Data Needs2
MPP Platform and Challenges3
New Platform based on Hadoop/YARN4
Lessons learned during transition to Hadoop5
2Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier
Cluster Build Experiences
• Node selection
– Single Node SKU, use commodity hardware components
– Memory will be cheap, keep expansion options open
– Spindle-Core-LAN Network ratios (1 : 2.5 : 1.5 Gbs)
• Balance mixed workloads using YARN
– Large clusters are better for effective resource utilization
– Balance between ETL, Batch, Interactive jobs with YARN
• Platform features and best practices
– Central monitoring, log aggregation, and alerting metrics (ELK stack)
– Role based automated deployment of OS and Hadoop configuration
Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 16
Journey to Hadoop
• Goals
– Open Source platform
– Scalable Distributed Processing
• Existing app base built around SQL
• Many technology choices in Hadoop ecosystem
– Technology choices: Distributed Query Engines vs. fast MR
– Evaluation with multi-PB data sets using 15 of our representative
workloads.
• e.g., complex joins (data shuffle), queries with variety of data
– Criteria: Scale, Functionality, Stability, Performance, Integration with
other open source ecosystem
– Hive was the only technology able to scale and provide easy migration
from our SQL workloads.
– With Tez we had an acceptable performance trade off.
Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 17
RDS and ETL Process
• Platform features for ETL
– File ingestion and Job management APIs
– Secure tenancy, Replication
• Conversion of 5 GB log file(.gz to .bz2)
1. Single node outside Hadoop: ~28 mins
2. In Hadoop, single mapper, parallel read and write approach: ~5 mins
• A parallel RDS and ETL using YARN
– Source file ingested from remote location
– Converted to bz2 and stored in HDFS Raw Data Store (Passive data)
– Data is transformed and loaded into Hive (Active data in ORC format)
– Mix “active” and “passive” datasets in HDFS
Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 18
 Use YARN for managing ETL
A
P
I
NN
DN
DN
DN
DN
DN
DN
Local .gz->bz2
MR based .gz->bz2
1
2
Large Cluster YARN Performance Modeling
• Multi-mode:
– ETL jobs: Guaranteed throughput – window computing
– Ad-hoc queries – Low latency, fast execution
– Batch analytics applications – Throughput
• Multi-level
– Departments/Projects, Users
• How do we model and use YARN for above workloads?
Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 19
Performance Modeling
Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 20
ETL
Batch
Ad-hoc
Map
Tasks
Reduce
Tasks
HDFS
Storage
Step 1: Compile your workload model
YARN Queue Model - 1
Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 21
ETL
Queue
Ad-hoc
Queue
Batch
Queue
Root
Queue
Projects
Queues
Jobs
Cluster Utilization:
Avg Latency:
Throughput (jobs):
Step 2: Develop your YARN queue resource allocation hierarchy
YARN Queue Model - 2
Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 22
ETL Ad-HocBatch
Root
Queue
Project
Queues
Jobs
Cluster Utilization:
Avg Latency:
Throughput (jobs):
YARN Queue Model - 3
Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 23
ETL
Queue
Batch
Queue
Root
Queue
Ad-hoc
Project
Queues
Jobs
Project
Queues
Step 3: Run jobs, iterate thru’ models and pick optimal
Cluster Utilization
Avg Wait Time
Throughput (jobs):
Right Balance
• Optimal solution is about right balance
– Cluster infrastructure
– Use the right software stack from Hadoop ecosystem
– Data management
– Application design and workload balancing with YARN
– Good tools for monitoring and management
• Approach
– Start small and iterate faster
– When in doubt, experiment and get data to make decisions.
– Keep up customer use cases in perspective.
Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 24
Summary
– Incremental transition from MPP to Big Data
– A journey towards open source distributed computing
– Uniform Computing!
• Infrastructure building blocks
• Single large YARN cluster for variety of compute and storage loads
– Open source – use and contribute
• Work with community to address gaps
– Share your ideas
Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 25
Q & A
Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 26

More Related Content

What's hot

Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceDataWorks Summit
 
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...DataWorks Summit
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive DataWorks Summit
 
SQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQSQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQpivotalny
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop DataWorks Summit/Hadoop Summit
 
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single ClusterMaintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single ClusterMapR Technologies
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsDataWorks Summit
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobileDataWorks Summit
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseDataWorks Summit/Hadoop Summit
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnhdhappy001
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Rich placement constraints: Who said YARN cannot schedule services?
Rich placement constraints: Who said YARN cannot schedule services?Rich placement constraints: Who said YARN cannot schedule services?
Rich placement constraints: Who said YARN cannot schedule services?DataWorks Summit
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014hadooparchbook
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesDataWorks Summit
 

What's hot (20)

Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
 
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
SQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQSQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQ
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop
 
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single ClusterMaintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China Mobile
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Hive Now Sparks
Hive Now SparksHive Now Sparks
Hive Now Sparks
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Rich placement constraints: Who said YARN cannot schedule services?
Rich placement constraints: Who said YARN cannot schedule services?Rich placement constraints: Who said YARN cannot schedule services?
Rich placement constraints: Who said YARN cannot schedule services?
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 

Viewers also liked

Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupMárton Kodok
 
Cisco Network Functions Virtualization Infrastructure (NFVI)
Cisco Network Functions Virtualization Infrastructure (NFVI)Cisco Network Functions Virtualization Infrastructure (NFVI)
Cisco Network Functions Virtualization Infrastructure (NFVI)Cisco Russia
 
Microservices mit Java EE - am Beispiel von IBM Liberty
Microservices mit Java EE - am Beispiel von IBM LibertyMicroservices mit Java EE - am Beispiel von IBM Liberty
Microservices mit Java EE - am Beispiel von IBM LibertyMichael Hofmann
 
IoT and Big Data
IoT and Big DataIoT and Big Data
IoT and Big Datasabnees
 
Fluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshellFluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshellN Masahiro
 
Introduction to Data Modeling in Cassandra
Introduction to Data Modeling in CassandraIntroduction to Data Modeling in Cassandra
Introduction to Data Modeling in CassandraJim Hatcher
 
Better Insights from Your Master Data - Graph Database LA Meetup
Better Insights from Your Master Data - Graph Database LA MeetupBetter Insights from Your Master Data - Graph Database LA Meetup
Better Insights from Your Master Data - Graph Database LA MeetupBenjamin Nussbaum
 
Docker in Production, Look No Hands! by Scott Coulton
Docker in Production, Look No Hands! by Scott CoultonDocker in Production, Look No Hands! by Scott Coulton
Docker in Production, Look No Hands! by Scott CoultonDocker, Inc.
 
Setting up a Digital Business on Cloud
Setting up a Digital Business on CloudSetting up a Digital Business on Cloud
Setting up a Digital Business on CloudAmazon Web Services
 
Multi-container Applications on OpenShift with Ansible Service Broker
Multi-container Applications on OpenShift with Ansible Service BrokerMulti-container Applications on OpenShift with Ansible Service Broker
Multi-container Applications on OpenShift with Ansible Service BrokerAmazon Web Services
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...Márton Kodok
 
Migrate Oracle WebLogic Applications onto a Containerized Cloud Data Center
Migrate Oracle WebLogic Applications onto a Containerized Cloud Data CenterMigrate Oracle WebLogic Applications onto a Containerized Cloud Data Center
Migrate Oracle WebLogic Applications onto a Containerized Cloud Data CenterJingnan Zhou
 
From 10 Users to 10 Milion in 10 Days - Adam Lev, Tamar Labs - DevOpsDays Tel...
From 10 Users to 10 Milion in 10 Days - Adam Lev, Tamar Labs - DevOpsDays Tel...From 10 Users to 10 Milion in 10 Days - Adam Lev, Tamar Labs - DevOpsDays Tel...
From 10 Users to 10 Milion in 10 Days - Adam Lev, Tamar Labs - DevOpsDays Tel...DevOpsDays Tel Aviv
 
Expect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesExpect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesBhakti Mehta
 
150430 regiosessie corv_almelo
150430 regiosessie corv_almelo150430 regiosessie corv_almelo
150430 regiosessie corv_almeloKING
 

Viewers also liked (20)

Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch Warmup
 
Security Realism in Education
Security Realism in EducationSecurity Realism in Education
Security Realism in Education
 
Cisco Network Functions Virtualization Infrastructure (NFVI)
Cisco Network Functions Virtualization Infrastructure (NFVI)Cisco Network Functions Virtualization Infrastructure (NFVI)
Cisco Network Functions Virtualization Infrastructure (NFVI)
 
Microservices mit Java EE - am Beispiel von IBM Liberty
Microservices mit Java EE - am Beispiel von IBM LibertyMicroservices mit Java EE - am Beispiel von IBM Liberty
Microservices mit Java EE - am Beispiel von IBM Liberty
 
IoT and Big Data
IoT and Big DataIoT and Big Data
IoT and Big Data
 
Fluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshellFluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshell
 
What is DevOps?
What is DevOps?What is DevOps?
What is DevOps?
 
Question 7
Question 7Question 7
Question 7
 
Introduction to Data Modeling in Cassandra
Introduction to Data Modeling in CassandraIntroduction to Data Modeling in Cassandra
Introduction to Data Modeling in Cassandra
 
Better Insights from Your Master Data - Graph Database LA Meetup
Better Insights from Your Master Data - Graph Database LA MeetupBetter Insights from Your Master Data - Graph Database LA Meetup
Better Insights from Your Master Data - Graph Database LA Meetup
 
Veselík 1
Veselík 1Veselík 1
Veselík 1
 
Docker in Production, Look No Hands! by Scott Coulton
Docker in Production, Look No Hands! by Scott CoultonDocker in Production, Look No Hands! by Scott Coulton
Docker in Production, Look No Hands! by Scott Coulton
 
Setting up a Digital Business on Cloud
Setting up a Digital Business on CloudSetting up a Digital Business on Cloud
Setting up a Digital Business on Cloud
 
Multi-container Applications on OpenShift with Ansible Service Broker
Multi-container Applications on OpenShift with Ansible Service BrokerMulti-container Applications on OpenShift with Ansible Service Broker
Multi-container Applications on OpenShift with Ansible Service Broker
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
 
Migrate Oracle WebLogic Applications onto a Containerized Cloud Data Center
Migrate Oracle WebLogic Applications onto a Containerized Cloud Data CenterMigrate Oracle WebLogic Applications onto a Containerized Cloud Data Center
Migrate Oracle WebLogic Applications onto a Containerized Cloud Data Center
 
From 10 Users to 10 Milion in 10 Days - Adam Lev, Tamar Labs - DevOpsDays Tel...
From 10 Users to 10 Milion in 10 Days - Adam Lev, Tamar Labs - DevOpsDays Tel...From 10 Users to 10 Milion in 10 Days - Adam Lev, Tamar Labs - DevOpsDays Tel...
From 10 Users to 10 Milion in 10 Days - Adam Lev, Tamar Labs - DevOpsDays Tel...
 
Expect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesExpect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservices
 
Analyze, Influence and Engage Your Customer - v1.7
Analyze, Influence and Engage Your Customer - v1.7Analyze, Influence and Engage Your Customer - v1.7
Analyze, Influence and Engage Your Customer - v1.7
 
150430 regiosessie corv_almelo
150430 regiosessie corv_almelo150430 regiosessie corv_almelo
150430 regiosessie corv_almelo
 

Similar to Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Hadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRHadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRData Con LA
 
Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Uwe Printz
 
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013StampedeCon
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014John Berns
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionYARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionDataWorks Summit
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...Amazon Web Services
 
Next Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduceNext Generation of Hadoop MapReduce
Next Generation of Hadoop MapReducehuguk
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeNicolas Morales
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Global Business Events
 

Similar to Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN (20)

Hadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRHadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapR
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Welcome to Hadoop2Land!
Welcome to Hadoop2Land!
 
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionYARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
 
Next Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduceNext Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduce
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)codyslingerland1
 
Planetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...DianaGray10
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechProduct School
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kitJamie (Taka) Wang
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTxtailishbaloch
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Libraryshyamraj55
 
Oracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxOracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxSatishbabu Gunukula
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosErol GIRAUDY
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfInfopole1
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsDianaGray10
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfTejal81
 
CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024Brian Pichman
 
UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4DianaGray10
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud DataEric D. Schabell
 
Technical SEO for Improved Accessibility WTS FEST
Technical SEO for Improved Accessibility  WTS FESTTechnical SEO for Improved Accessibility  WTS FEST
Technical SEO for Improved Accessibility WTS FESTBillieHyde
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptxHansamali Gamage
 

Recently uploaded (20)

The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)
 
Planetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile Brochure
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kit
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Library
 
Oracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxOracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptx
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenarios
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdf
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projects
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
 
CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024
 
UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data
 
Technical SEO for Improved Accessibility WTS FEST
Technical SEO for Improved Accessibility  WTS FESTTechnical SEO for Improved Accessibility  WTS FEST
Technical SEO for Improved Accessibility WTS FEST
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx
 

Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

  • 1. Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 1 Experiences in migration of large analytics platform from MPP database to Hadoop YARN Srinivas Nimmagadda Roopesh Varier Technical Director, CPE Director, CPE
  • 2. Agenda Introduction1 Big Data Needs2 MPP Platform and Challenges3 New Platform based on Hadoop/YARN4 Lessons learned during transition to Hadoop5 2Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier
  • 3. Overview • Symantec Cloud Platform Engineering (CPE) – Build consolidated cloud infrastructure and platform services for next generation data powered Symantec applications. – Open source components as building blocks • Hadoop and Openstack • Bridge capability gaps and contribute back • A big data platform for batch and stream analytics integrated with Openstack. – Security, multi-tenancy, and reliability • Using large scale data analytics for security and data management work loads – Analytics – Reputation based security, Managed Security Services, Fraud Detection, Dial home application logs Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 3
  • 4. Big Data Challenge • Hundreds of millions of users • Billions of files – File good or not? • Millions of URLs – URL safe or not? • Hundreds of thousands of applications – Stable or Crashed • Constant feed of information – Real time – Across the global – From our applications and appliances Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 5
  • 5. Value from Volume • Volume of data – Multi-petabyte historical datasets – Multi-terabyte daily incremental datasets – Wide variety of input data formats – How do we manage? • Variety of workloads – ETL jobs – Batch applications – Interactive ad-hoc analysis • How to extract value from volume near real-time? Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 6
  • 6. Agenda Introduction1 Big Data Needs2 MPP Platform and Challenges3 New Platform based on Hadoop/YARN4 Lessons learned during transition to Hadoop5 2Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier
  • 7. MPP Platform Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 8 ETL Cluster Platform Services Raw Data Store Data Sources Applications Batch Interactive MPP DB Engine
  • 8. Legacy MPP Analytics Solution • Custom Platform Services – Task/Job management (DAG based, Fault-tolerant) – Functional and performance monitoring – Automatic data lifecycle management – Inter cluster data transfers – Cluster tenancy management • ETL cluster • RDS (raw data store) on NAS • MPP (Massively Parallel Processing) DB engine at the core Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 9
  • 9. Key Challenges • Scalability – Supporting rapid data growth – No support for heterogeneous hardware. • Operational costs – OpEx and Software licenses • Supporting new use models – Not Only SQL patterns in analytics (columnar storage, search, streaming) • Cluster operational challenges – Limited resource management (limits/quotas, utilization throttling) – Load balancing across multi-mode and multi-tenant workloads – Integrated secure tenancy services – HA and DR Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 10
  • 10. Agenda Introduction1 Big Data Needs2 MPP Platform and Challenges3 New Platform based on Hadoop/YARN4 Lessons learned during transition to Hadoop5 2Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier
  • 11. MPP Platform Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 12 Raw Data Store ETL Cluster Platform Services Data Sources Applications Batch Interactive MPP DB Engine
  • 12. 7: YARN/HDFS 6: DistCP, Falcon 5: DAG: Oozie MPP DB Engine 3: HDFS MPP to Big Data Platform Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 13 Raw Data Store Platform Services Data Sources Applications Batch Interactive 1: Commodity Hardware 2: Hadoop Cluster 4: YARN ETL Job Management State Transfer Tenancy Guard ETL Cluster Batch Interactive Interactive Batch YARN
  • 13. Big Data Platform Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 14 Multi-tenant Data Sources Applications Batch Interactive 1: Cluster Infrastructure 2: Hadoop 2.x Stack 3: HDFS 5: Oozie 4: YARN ETL Interactive Batch Raw Data Store ETL Jobs Batch Interactive Ad-hoc workloads Role-based provisioning Unified Logging API
  • 14. Agenda Introduction1 Big Data Needs2 MPP Platform and Challenges3 New Platform based on Hadoop/YARN4 Lessons learned during transition to Hadoop5 2Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier
  • 15. Cluster Build Experiences • Node selection – Single Node SKU, use commodity hardware components – Memory will be cheap, keep expansion options open – Spindle-Core-LAN Network ratios (1 : 2.5 : 1.5 Gbs) • Balance mixed workloads using YARN – Large clusters are better for effective resource utilization – Balance between ETL, Batch, Interactive jobs with YARN • Platform features and best practices – Central monitoring, log aggregation, and alerting metrics (ELK stack) – Role based automated deployment of OS and Hadoop configuration Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 16
  • 16. Journey to Hadoop • Goals – Open Source platform – Scalable Distributed Processing • Existing app base built around SQL • Many technology choices in Hadoop ecosystem – Technology choices: Distributed Query Engines vs. fast MR – Evaluation with multi-PB data sets using 15 of our representative workloads. • e.g., complex joins (data shuffle), queries with variety of data – Criteria: Scale, Functionality, Stability, Performance, Integration with other open source ecosystem – Hive was the only technology able to scale and provide easy migration from our SQL workloads. – With Tez we had an acceptable performance trade off. Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 17
  • 17. RDS and ETL Process • Platform features for ETL – File ingestion and Job management APIs – Secure tenancy, Replication • Conversion of 5 GB log file(.gz to .bz2) 1. Single node outside Hadoop: ~28 mins 2. In Hadoop, single mapper, parallel read and write approach: ~5 mins • A parallel RDS and ETL using YARN – Source file ingested from remote location – Converted to bz2 and stored in HDFS Raw Data Store (Passive data) – Data is transformed and loaded into Hive (Active data in ORC format) – Mix “active” and “passive” datasets in HDFS Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 18  Use YARN for managing ETL A P I NN DN DN DN DN DN DN Local .gz->bz2 MR based .gz->bz2 1 2
  • 18. Large Cluster YARN Performance Modeling • Multi-mode: – ETL jobs: Guaranteed throughput – window computing – Ad-hoc queries – Low latency, fast execution – Batch analytics applications – Throughput • Multi-level – Departments/Projects, Users • How do we model and use YARN for above workloads? Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 19
  • 19. Performance Modeling Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 20 ETL Batch Ad-hoc Map Tasks Reduce Tasks HDFS Storage Step 1: Compile your workload model
  • 20. YARN Queue Model - 1 Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 21 ETL Queue Ad-hoc Queue Batch Queue Root Queue Projects Queues Jobs Cluster Utilization: Avg Latency: Throughput (jobs): Step 2: Develop your YARN queue resource allocation hierarchy
  • 21. YARN Queue Model - 2 Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 22 ETL Ad-HocBatch Root Queue Project Queues Jobs Cluster Utilization: Avg Latency: Throughput (jobs):
  • 22. YARN Queue Model - 3 Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 23 ETL Queue Batch Queue Root Queue Ad-hoc Project Queues Jobs Project Queues Step 3: Run jobs, iterate thru’ models and pick optimal Cluster Utilization Avg Wait Time Throughput (jobs):
  • 23. Right Balance • Optimal solution is about right balance – Cluster infrastructure – Use the right software stack from Hadoop ecosystem – Data management – Application design and workload balancing with YARN – Good tools for monitoring and management • Approach – Start small and iterate faster – When in doubt, experiment and get data to make decisions. – Keep up customer use cases in perspective. Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 24
  • 24. Summary – Incremental transition from MPP to Big Data – A journey towards open source distributed computing – Uniform Computing! • Infrastructure building blocks • Single large YARN cluster for variety of compute and storage loads – Open source – use and contribute • Work with community to address gaps – Share your ideas Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 25
  • 25. Q & A Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 26