SlideShare a Scribd company logo
1 of 36
Download to read offline
Apache Pig를 위한 
Tez 연산 엔진 개발하기 
박철수 엔지니어 / 넷플릭스 빅데이터플랫폼팀 
Netflix Big Data Platform
CONTENTS 
1. Background 
2. What is Pig on Tez? 
3. Why Apache Tez? 
4. Shortcomings and What’s Next
1. Background
1.1 Netflix Data Pipeline 
Cloud 
apps 
Events Data Pipeline 
Suro Ursula 
Cassandra 
Stateful Data Pipeline 
SS 
Tables 
Aegisthus 
S3 
DW 
15 min 
Daily
1.2 Netflix Big Data Platform 
S3 
DW 
Hadoop clusters 
Federated 
execution 
engine 
Federated 
metadata 
service 
Data Lineage 
Data Visualization 
Data Movement 
Data Quality 
Pig Workflow 
Visualization 
Job/Cluster 
Performance 
Visualization
1.3 Data Volume 
~200 billions events/day 
~40 TB incoming data/day (compressed) 
~1.2 PB data read/day 
~100 TB data wrote/day 
10+ PB DW on S3
1.4 Netflix Big Data Platform 
S3 
DW 
Hadoop clusters 
Federated 
execution 
engine 
Federated 
metadata 
service 
Data Lineage 
Data Visualization 
Data Movement 
Data Quality 
Pig Workflow 
Visualization 
Job/Cluster 
Performance 
Visualization 
With ever growing 
data, ETL runs 
slower and slower.
1.5 ETL Completion Trend
1.6 Common Problems 
Common problems across organizations 
1. Similar data platform architecture 
1. Pig for ETL jobs 
2. Hive/Presto for ad-hoc queries
1.7 Pig on Tez Team 
• Alex Bain (LinkedIn: 2013/08~2014/01, Dev) 
• Mark Wagner (LinkedIn: 2013/08~2014/01, Dev) 
• Cheolsoo Park (Netflix: 2013/08~2014/08, Dev) 
• Olga Natkovich (Yahoo: 2013/08~present, PM) 
• Rohini Palaniswamy (Yahoo: 2013/08~present, Dev) 
• Daniel Dai (Hortonworks: 2013/08~present, Dev)
2. What is Pig on Tez?
2.1 Pig Concepts 
Non-blocking operators 
1. LOAD / STORE 
2. FOREACH __ GENERATE __ 
3. FILTER __ BY __ 
Blocking operators 
1. GROUP __ BY __ 
2. ORDER __ BY __ 
3. JOIN __ BY __ 
Translated to a MapReduce shuffle
2.2 MapReduce Plan 
LOAD 
FOREACH 
GROUP BY 
FOREACH 
STORE 
LOAD 
FOREACH 
LOCAL 
REARRANGE 
GLOBAL 
REARRANGE 
PACKAGE 
FOREACH 
STORE 
LOAD 
FOREACH 
LOCAL 
REARRANGE 
Shuffle 
PACKAGE 
FOREACH 
STORE 
Logical Plan 
Physical Plan MR Plan
2.3 What’s Problem? 
Restrictions by MapReduce 
1. Extra intermediate output on HDFS 
2. Artificial synchronization barriers 
3. Inefficient use of resources 
4. Multi-query optimization
2.4 Tez Concepts 
Low-level DAG Framework 
1. Build DAG by defining vertices and edges. 
2. Customize scheduling of DAG and movement of data. 
• Sequential and concurrent 
• 1-1, broadcasting, scatter and gather 
Flexible Input-Processor-Output Model 
1. Thin API layer to wrap around arbitrary application code. 
2. Compose inputs, processor, and outputs to execute arbitrary processing. 
Input Processor Output 
initialize 
initialize 
getReader 
run 
handleEvents 
handleEvents 
close 
close 
initialize 
getWriter 
handleEvents 
close
2.5 Pig on Tez 
Logical Plan 
LogToPhyTranslationVisitor 
Physical Plan 
TezCompiler MRCompiler 
Tez Plan 
Tez Execution Engine 
MR Plan 
MR Execution Engine
2.6 Tez DAG: Split + Group By + Join 
Load ‘foo’ 
Split multiplex De-multiplex 
Group by y, Group by z 
HDFS HDFS 
Load g1, Load g2 
Join g1, g2 
Load ‘foo’ 
Multiple outputs 
Group 
by y 
Group 
by z 
Reducer follows 
reducer 
Join g1, g2 
a = LOAD ‘foo’ AS (x, y, z); 
b = GROUP a BY y; 
c = GROUP a BY z; 
d = JOIN b BY group; 
c BY group;
2.7 Tez DAG: Order By 
Sample 
Aggregate 
Load, Partition 
Sort 
HDFS 
Load, Sample 
Partition 
Sort 
Aggregate a = LOAD ‘foo’ AS (x, y); 
b = FILTER a BY y is not null; 
c = ORDER b BY x; 
Stage sample map 
on distributed cache 
Broadcast sample map 
1-1 Unsorted 
edge 
Cache sample map
3. Why Apache Tez?
3.1 DAG Execution 
DAG Execution 
1. Eliminate HDFS writes between workflow jobs. 
2. Eliminate job launch overhead of workflow jobs. 
3. Eliminate identity mappers in every workflow jobs. 
Benefits 
1. Faster execution and higher predictability.
3.2 MR vs. Tez
3.3 AM / Container Reuse 
AM Reuse 
1. Grunt shell uses one AM for all commands till timeout. 
2. More than one DAGs submitted for merge join, collected group, and exec. 
Container Reuse 
1. Rerun new tasks on already warmed-up JVM. 
Benefits 
1. Reduce container launch overhead. 
2. Reduce networks IO. 
• 1-1 edge tasks are launched on same node.
3.4 Broadcast Edge / Object Cache 
Broadcast Edge 
1. Broadcast same data to all tasks in successor vertex. 
Object Cache 
1. Shared in memory objects for scope of vertex and DAG. 
Benefits 
1. Replace use of distributed cache. 
2. Avoid input fetching if cache is available on container reuse. 
• Replicated join runs faster on small cluster.
3.5 Vertex Group 
Vertex Group 
1. Group multiple vertices into a vertex group and produce a combiner output. 
Benefits 
1. Better performance due to elimination of an additional vertex. 
Load b Load a 
Group 
Load b Load a 
Union 
Group 
a = LOAD ‘a’; 
b = LOAD ‘b’; 
c = UNION a, b; 
d = GROUP c BY $0;
3.6 Slow Start/Pre-launch 
Slow Start/Pre-launch 
1. Pluggable vertex manager pre-launches the reducers before all maps have co 
mpleted so that shuffle can start (e.g. LIMIT not following ORDER BY). 
Benefits 
1. Better performance due to parallel execution of multiple vertices.
3.7 Performance Numbers 
250 
200 
150 
100 
50 
0 
1h22m vs 28m 
3h57m vs 3h54m 
Job 1 (2x) Job 2 (3x) Job 3 (1.7x) Job 4 (1.2x) Job 5 (1.0x) 
MR 
Tez 
20m vs 10m 
2h17m vs 1h15m 
33m vs 28m
3.8 Performance Deep Dive 
This MR job blocks DAG.
3.9 Performance Deep Dive 
Huge amount of intermediate 
files are written to HDFS.
4. Shortcomings 
And What’s Next
4.1 Shortcomings 
Auto Parallelism 
1. Eliminating mappers without adjusting parallelisms can make jobs run slower. 
In MR, combiners 
run with 1600 tasks. 
In Tez, combiners 
Run With 500 tasks.
4.2 Shortcomings 
Current Status 
1. User-specified parallelism always takes precedence. 
2. If no parallelism is specified, Pig estimates using static rules. For eg, if vertex 
contains filter-by, reduce its parallelism by 50%. 
3. At execution time, parallelism is adjusted again based on per-vertex sampling. 
Problems 
1. In legacy Pig jobs, parallelism is optimized for MR. So honoring user-specified 
parallelism can hurt performance in Tez. 
2. Static-rule-based estimation cannot be always accurate. 
3. Sample-based estimation cannot be always accurate.
4.3 Shortcomings 
Web UI and Tools Integration 
1. Tez AM has no UI (i.e. no job page). 
2. Tez hasn’t integrated with YARN ATS (i.e. no job history page). 
3. Tez hasn’t integrated with Netflix internal tools such as Inviso and Lipstick.
4.4 What’s Next? 
Tez 
1. Resolve TEZ-8: Tez UI for progress tracking and history. 
• Tez 0.5.x release (latest) doesn’t include TEZ-8. 
Pig on Tez 
1. Improve auto parallelism and usability. 
• Pig on Tez will be included in Pig 0.14 release, but these issues might be 
still there.
Q&A
THANK YOU

More Related Content

What's hot

Async and Non-blocking IO w/ JRuby
Async and Non-blocking IO w/ JRubyAsync and Non-blocking IO w/ JRuby
Async and Non-blocking IO w/ JRubyJoe Kutner
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesSadayuki Furuhashi
 
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Shirshanka Das
 
[214]유연하고 확장성 있는 빅데이터 처리
[214]유연하고 확장성 있는 빅데이터 처리[214]유연하고 확장성 있는 빅데이터 처리
[214]유연하고 확장성 있는 빅데이터 처리NAVER D2
 
Rapid Application Design in Financial Services
Rapid Application Design in Financial ServicesRapid Application Design in Financial Services
Rapid Application Design in Financial ServicesAerospike
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Elasticsearch - Dynamic Nodes
Elasticsearch - Dynamic NodesElasticsearch - Dynamic Nodes
Elasticsearch - Dynamic NodesScott Davis
 
Managing PostgreSQL with Ansible - FOSDEM PGDay 2016
Managing PostgreSQL with Ansible - FOSDEM PGDay 2016Managing PostgreSQL with Ansible - FOSDEM PGDay 2016
Managing PostgreSQL with Ansible - FOSDEM PGDay 2016Gulcin Yildirim Jelinek
 
Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017)
Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017)Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017)
Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017)mahesh madushanka
 
Distributed Applications with Apache Zookeeper
Distributed Applications with Apache ZookeeperDistributed Applications with Apache Zookeeper
Distributed Applications with Apache ZookeeperAlex Ehrnschwender
 
Search-time Parallelism: Presented by Shikhar Bhushan, Etsy
Search-time Parallelism: Presented by Shikhar Bhushan, EtsySearch-time Parallelism: Presented by Shikhar Bhushan, Etsy
Search-time Parallelism: Presented by Shikhar Bhushan, EtsyLucidworks
 
Developing Frameworks for Apache Mesos
Developing Frameworks  for Apache MesosDeveloping Frameworks  for Apache Mesos
Developing Frameworks for Apache MesosJoe Stein
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormDavorin Vukelic
 
Meetup on Apache Zookeeper
Meetup on Apache ZookeeperMeetup on Apache Zookeeper
Meetup on Apache ZookeeperAnshul Patel
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridDataWorks Summit
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
 

What's hot (19)

Async and Non-blocking IO w/ JRuby
Async and Non-blocking IO w/ JRubyAsync and Non-blocking IO w/ JRuby
Async and Non-blocking IO w/ JRuby
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics Pipelines
 
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
 
PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
 
[214]유연하고 확장성 있는 빅데이터 처리
[214]유연하고 확장성 있는 빅데이터 처리[214]유연하고 확장성 있는 빅데이터 처리
[214]유연하고 확장성 있는 빅데이터 처리
 
Rapid Application Design in Financial Services
Rapid Application Design in Financial ServicesRapid Application Design in Financial Services
Rapid Application Design in Financial Services
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Elasticsearch - Dynamic Nodes
Elasticsearch - Dynamic NodesElasticsearch - Dynamic Nodes
Elasticsearch - Dynamic Nodes
 
Managing PostgreSQL with Ansible - FOSDEM PGDay 2016
Managing PostgreSQL with Ansible - FOSDEM PGDay 2016Managing PostgreSQL with Ansible - FOSDEM PGDay 2016
Managing PostgreSQL with Ansible - FOSDEM PGDay 2016
 
Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017)
Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017)Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017)
Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017)
 
Distributed Applications with Apache Zookeeper
Distributed Applications with Apache ZookeeperDistributed Applications with Apache Zookeeper
Distributed Applications with Apache Zookeeper
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
 
Search-time Parallelism: Presented by Shikhar Bhushan, Etsy
Search-time Parallelism: Presented by Shikhar Bhushan, EtsySearch-time Parallelism: Presented by Shikhar Bhushan, Etsy
Search-time Parallelism: Presented by Shikhar Bhushan, Etsy
 
Tuning Solr for Logs
Tuning Solr for LogsTuning Solr for Logs
Tuning Solr for Logs
 
Developing Frameworks for Apache Mesos
Developing Frameworks  for Apache MesosDeveloping Frameworks  for Apache Mesos
Developing Frameworks for Apache Mesos
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Meetup on Apache Zookeeper
Meetup on Apache ZookeeperMeetup on Apache Zookeeper
Meetup on Apache Zookeeper
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop Grid
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
 

Viewers also liked

[E4]triple s deview
[E4]triple s deview[E4]triple s deview
[E4]triple s deviewNAVER D2
 
[1B3]모바일 앱 크래시 네이버에서는 어떻게 수집하고 보여줄까요
[1B3]모바일 앱 크래시 네이버에서는 어떻게 수집하고 보여줄까요[1B3]모바일 앱 크래시 네이버에서는 어떻게 수집하고 보여줄까요
[1B3]모바일 앱 크래시 네이버에서는 어떻게 수집하고 보여줄까요NAVER D2
 
오픈소스를 활용한 Batch_처리_플랫폼_공유
오픈소스를 활용한 Batch_처리_플랫폼_공유오픈소스를 활용한 Batch_처리_플랫폼_공유
오픈소스를 활용한 Batch_처리_플랫폼_공유knight1128
 
Tech planet 2015 Docker 클라우드 구축 프로젝트 - d4
Tech planet 2015 Docker 클라우드 구축 프로젝트 - d4Tech planet 2015 Docker 클라우드 구축 프로젝트 - d4
Tech planet 2015 Docker 클라우드 구축 프로젝트 - d4Sangcheol Hwang
 
[221] docker orchestration
[221] docker orchestration[221] docker orchestration
[221] docker orchestrationNAVER D2
 
[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기NAVER D2
 
Windows Server Containers- How we hot here and architecture deep dive
Windows Server Containers- How we hot here and architecture deep diveWindows Server Containers- How we hot here and architecture deep dive
Windows Server Containers- How we hot here and architecture deep diveDocker, Inc.
 
Tupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBTupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBDocker, Inc.
 
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Amy W. Tang
 

Viewers also liked (11)

[E4]triple s deview
[E4]triple s deview[E4]triple s deview
[E4]triple s deview
 
GenMyModel
GenMyModel GenMyModel
GenMyModel
 
[1B3]모바일 앱 크래시 네이버에서는 어떻게 수집하고 보여줄까요
[1B3]모바일 앱 크래시 네이버에서는 어떻게 수집하고 보여줄까요[1B3]모바일 앱 크래시 네이버에서는 어떻게 수집하고 보여줄까요
[1B3]모바일 앱 크래시 네이버에서는 어떻게 수집하고 보여줄까요
 
오픈소스를 활용한 Batch_처리_플랫폼_공유
오픈소스를 활용한 Batch_처리_플랫폼_공유오픈소스를 활용한 Batch_처리_플랫폼_공유
오픈소스를 활용한 Batch_처리_플랫폼_공유
 
Tech planet 2015 Docker 클라우드 구축 프로젝트 - d4
Tech planet 2015 Docker 클라우드 구축 프로젝트 - d4Tech planet 2015 Docker 클라우드 구축 프로젝트 - d4
Tech planet 2015 Docker 클라우드 구축 프로젝트 - d4
 
[221] docker orchestration
[221] docker orchestration[221] docker orchestration
[221] docker orchestration
 
[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기
 
Windows Server Containers- How we hot here and architecture deep dive
Windows Server Containers- How we hot here and architecture deep diveWindows Server Containers- How we hot here and architecture deep dive
Windows Server Containers- How we hot here and architecture deep dive
 
Tupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBTupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FB
 
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
 
HW09 Hadoop Vaidya
HW09 Hadoop VaidyaHW09 Hadoop Vaidya
HW09 Hadoop Vaidya
 

Similar to [2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종

Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataDataWorks Summit
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataDataWorks Summit
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Yahoo Developer Network
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aSchubert Zhang
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezDataWorks Summit
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentationVu Thi Trang
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Sector Sphere 2009
Sector Sphere 2009Sector Sphere 2009
Sector Sphere 2009lilyco
 
sector-sphere
sector-spheresector-sphere
sector-spherexlight
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Inc.
 
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Toronto-Oracle-Users-Group
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼NAVER D2
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
 

Similar to [2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종 (20)

Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big Data
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221a
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Sector Sphere 2009
Sector Sphere 2009Sector Sphere 2009
Sector Sphere 2009
 
sector-sphere
sector-spheresector-sphere
sector-sphere
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduce
 
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMUpgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
 
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 

More from NAVER D2

[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다NAVER D2
 
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...NAVER D2
 
[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기NAVER D2
 
[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발NAVER D2
 
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈NAVER D2
 
[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&A[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&ANAVER D2
 
[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기NAVER D2
 
[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep LearningNAVER D2
 
[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applications[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applicationsNAVER D2
 
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load BalancingOld version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load BalancingNAVER D2
 
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지NAVER D2
 
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기NAVER D2
 
[224]네이버 검색과 개인화
[224]네이버 검색과 개인화[224]네이버 검색과 개인화
[224]네이버 검색과 개인화NAVER D2
 
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)NAVER D2
 
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기NAVER D2
 
[213] Fashion Visual Search
[213] Fashion Visual Search[213] Fashion Visual Search
[213] Fashion Visual SearchNAVER D2
 
[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화NAVER D2
 
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지NAVER D2
 
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터NAVER D2
 
[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?NAVER D2
 

More from NAVER D2 (20)

[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다
 
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
 
[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기
 
[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발
 
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
 
[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&A[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&A
 
[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기
 
[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning
 
[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applications[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applications
 
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load BalancingOld version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
 
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
 
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
 
[224]네이버 검색과 개인화
[224]네이버 검색과 개인화[224]네이버 검색과 개인화
[224]네이버 검색과 개인화
 
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
 
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
 
[213] Fashion Visual Search
[213] Fashion Visual Search[213] Fashion Visual Search
[213] Fashion Visual Search
 
[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화
 
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
 
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
 
[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?
 

Recently uploaded

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종

  • 1.
  • 2. Apache Pig를 위한 Tez 연산 엔진 개발하기 박철수 엔지니어 / 넷플릭스 빅데이터플랫폼팀 Netflix Big Data Platform
  • 3. CONTENTS 1. Background 2. What is Pig on Tez? 3. Why Apache Tez? 4. Shortcomings and What’s Next
  • 5. 1.1 Netflix Data Pipeline Cloud apps Events Data Pipeline Suro Ursula Cassandra Stateful Data Pipeline SS Tables Aegisthus S3 DW 15 min Daily
  • 6. 1.2 Netflix Big Data Platform S3 DW Hadoop clusters Federated execution engine Federated metadata service Data Lineage Data Visualization Data Movement Data Quality Pig Workflow Visualization Job/Cluster Performance Visualization
  • 7. 1.3 Data Volume ~200 billions events/day ~40 TB incoming data/day (compressed) ~1.2 PB data read/day ~100 TB data wrote/day 10+ PB DW on S3
  • 8. 1.4 Netflix Big Data Platform S3 DW Hadoop clusters Federated execution engine Federated metadata service Data Lineage Data Visualization Data Movement Data Quality Pig Workflow Visualization Job/Cluster Performance Visualization With ever growing data, ETL runs slower and slower.
  • 10. 1.6 Common Problems Common problems across organizations 1. Similar data platform architecture 1. Pig for ETL jobs 2. Hive/Presto for ad-hoc queries
  • 11. 1.7 Pig on Tez Team • Alex Bain (LinkedIn: 2013/08~2014/01, Dev) • Mark Wagner (LinkedIn: 2013/08~2014/01, Dev) • Cheolsoo Park (Netflix: 2013/08~2014/08, Dev) • Olga Natkovich (Yahoo: 2013/08~present, PM) • Rohini Palaniswamy (Yahoo: 2013/08~present, Dev) • Daniel Dai (Hortonworks: 2013/08~present, Dev)
  • 12. 2. What is Pig on Tez?
  • 13. 2.1 Pig Concepts Non-blocking operators 1. LOAD / STORE 2. FOREACH __ GENERATE __ 3. FILTER __ BY __ Blocking operators 1. GROUP __ BY __ 2. ORDER __ BY __ 3. JOIN __ BY __ Translated to a MapReduce shuffle
  • 14. 2.2 MapReduce Plan LOAD FOREACH GROUP BY FOREACH STORE LOAD FOREACH LOCAL REARRANGE GLOBAL REARRANGE PACKAGE FOREACH STORE LOAD FOREACH LOCAL REARRANGE Shuffle PACKAGE FOREACH STORE Logical Plan Physical Plan MR Plan
  • 15. 2.3 What’s Problem? Restrictions by MapReduce 1. Extra intermediate output on HDFS 2. Artificial synchronization barriers 3. Inefficient use of resources 4. Multi-query optimization
  • 16. 2.4 Tez Concepts Low-level DAG Framework 1. Build DAG by defining vertices and edges. 2. Customize scheduling of DAG and movement of data. • Sequential and concurrent • 1-1, broadcasting, scatter and gather Flexible Input-Processor-Output Model 1. Thin API layer to wrap around arbitrary application code. 2. Compose inputs, processor, and outputs to execute arbitrary processing. Input Processor Output initialize initialize getReader run handleEvents handleEvents close close initialize getWriter handleEvents close
  • 17. 2.5 Pig on Tez Logical Plan LogToPhyTranslationVisitor Physical Plan TezCompiler MRCompiler Tez Plan Tez Execution Engine MR Plan MR Execution Engine
  • 18. 2.6 Tez DAG: Split + Group By + Join Load ‘foo’ Split multiplex De-multiplex Group by y, Group by z HDFS HDFS Load g1, Load g2 Join g1, g2 Load ‘foo’ Multiple outputs Group by y Group by z Reducer follows reducer Join g1, g2 a = LOAD ‘foo’ AS (x, y, z); b = GROUP a BY y; c = GROUP a BY z; d = JOIN b BY group; c BY group;
  • 19. 2.7 Tez DAG: Order By Sample Aggregate Load, Partition Sort HDFS Load, Sample Partition Sort Aggregate a = LOAD ‘foo’ AS (x, y); b = FILTER a BY y is not null; c = ORDER b BY x; Stage sample map on distributed cache Broadcast sample map 1-1 Unsorted edge Cache sample map
  • 21. 3.1 DAG Execution DAG Execution 1. Eliminate HDFS writes between workflow jobs. 2. Eliminate job launch overhead of workflow jobs. 3. Eliminate identity mappers in every workflow jobs. Benefits 1. Faster execution and higher predictability.
  • 22. 3.2 MR vs. Tez
  • 23. 3.3 AM / Container Reuse AM Reuse 1. Grunt shell uses one AM for all commands till timeout. 2. More than one DAGs submitted for merge join, collected group, and exec. Container Reuse 1. Rerun new tasks on already warmed-up JVM. Benefits 1. Reduce container launch overhead. 2. Reduce networks IO. • 1-1 edge tasks are launched on same node.
  • 24. 3.4 Broadcast Edge / Object Cache Broadcast Edge 1. Broadcast same data to all tasks in successor vertex. Object Cache 1. Shared in memory objects for scope of vertex and DAG. Benefits 1. Replace use of distributed cache. 2. Avoid input fetching if cache is available on container reuse. • Replicated join runs faster on small cluster.
  • 25. 3.5 Vertex Group Vertex Group 1. Group multiple vertices into a vertex group and produce a combiner output. Benefits 1. Better performance due to elimination of an additional vertex. Load b Load a Group Load b Load a Union Group a = LOAD ‘a’; b = LOAD ‘b’; c = UNION a, b; d = GROUP c BY $0;
  • 26. 3.6 Slow Start/Pre-launch Slow Start/Pre-launch 1. Pluggable vertex manager pre-launches the reducers before all maps have co mpleted so that shuffle can start (e.g. LIMIT not following ORDER BY). Benefits 1. Better performance due to parallel execution of multiple vertices.
  • 27. 3.7 Performance Numbers 250 200 150 100 50 0 1h22m vs 28m 3h57m vs 3h54m Job 1 (2x) Job 2 (3x) Job 3 (1.7x) Job 4 (1.2x) Job 5 (1.0x) MR Tez 20m vs 10m 2h17m vs 1h15m 33m vs 28m
  • 28. 3.8 Performance Deep Dive This MR job blocks DAG.
  • 29. 3.9 Performance Deep Dive Huge amount of intermediate files are written to HDFS.
  • 30. 4. Shortcomings And What’s Next
  • 31. 4.1 Shortcomings Auto Parallelism 1. Eliminating mappers without adjusting parallelisms can make jobs run slower. In MR, combiners run with 1600 tasks. In Tez, combiners Run With 500 tasks.
  • 32. 4.2 Shortcomings Current Status 1. User-specified parallelism always takes precedence. 2. If no parallelism is specified, Pig estimates using static rules. For eg, if vertex contains filter-by, reduce its parallelism by 50%. 3. At execution time, parallelism is adjusted again based on per-vertex sampling. Problems 1. In legacy Pig jobs, parallelism is optimized for MR. So honoring user-specified parallelism can hurt performance in Tez. 2. Static-rule-based estimation cannot be always accurate. 3. Sample-based estimation cannot be always accurate.
  • 33. 4.3 Shortcomings Web UI and Tools Integration 1. Tez AM has no UI (i.e. no job page). 2. Tez hasn’t integrated with YARN ATS (i.e. no job history page). 3. Tez hasn’t integrated with Netflix internal tools such as Inviso and Lipstick.
  • 34. 4.4 What’s Next? Tez 1. Resolve TEZ-8: Tez UI for progress tracking and history. • Tez 0.5.x release (latest) doesn’t include TEZ-8. Pig on Tez 1. Improve auto parallelism and usability. • Pig on Tez will be included in Pig 0.14 release, but these issues might be still there.
  • 35. Q&A