SlideShare a Scribd company logo
LET THE
TRANSFORMATION
BEGIN
The Active/Active
Data Lake with
Elastic Cloud Storage
© Copyright 2016 Dell Inc.
3
Multiple Hadoop Clusters - Challenges
MR/Hive/Pig
YARN
HDFS
MR/Hive/Pig
YARN
HDFS
MR/Hive/Pig
YARN
HDFS
ACTIVE/ACTIVE GLOBAL HADOOP STORAGE
© Copyright 2016 Dell Inc.
4
But we use DistCp…
• Active/Passive access
• Very high disk usage (3x replication on each site)
• Periodic transfers (once every few hours)
• Must carefully update Hive Metastore manually at each site
• Hard to build for 3 or more clusters
• Consumes YARN CPU/memory
• No Hive concurrency controls on target cluster
© Copyright 2016 Dell Inc.
5
Active/Active Hadoop with ECS
Global Namespace For Active/Active Hadoop
STORAGE EFFICIENCY
Denver Beijing Paris
GEO CACHING
STRONG CONSISTENCY ACTIVE/ACTIVE w/ FAILOVER
© Copyright 2016 Dell Inc.
6
Active/Active Hive with ECS: Solution Overview
• 2 to 8 sites
• A shared, common:
– Hadoop-Compatible File System – global namespace, readable and writable from all sites
– Hive Metastore DB
• Strong consistency
• Asynchronous replication (low latency updates)
• Fully recoverable from the failure of a single site and 4 drives in each site
• Very high storage efficiency (4.5 times better than HDFS with 3 sites)
• Hive concurrency and ACID transactions (insert, update, delete), even across sites
© Copyright 2016 Dell Inc.
7
Active/Active Hive with ECS: Architecture
ECS Node
ECS Node
ECS Node
ECS Node Node Manager
Metastore DB
Hadoop Master
Node Manager
Node Manager
Site 1
ECS Node
ECS Node
ECS Node
ECS Node Node Manager
Metastore DB
Hadoop Master
Node Manager
Node Manager
Site 2
ECS Node
ECS Node
ECS Node
ECS Node Node Manager
Metastore DB
Hadoop Master
Node Manager
Node Manager
Site 3
ECS Replication (async)
Metastore DB Repl (sync)
Hadoop installations at
different sites are
independent except for
the Hive Metastore DB and
the common file system
provided by ECS.
© Copyright 2016 Dell Inc.
8
Active/Active Hive Demonstration
hive> create table demotab1…
hive> insert into demotab1 partition (site=1)
values (11, ‘ant'), (12,'bear');
Site 1 Site 2
hive> select * from demotab1;
11 ant 1
12 bear 1
hive> insert into demotab1 partition (site=2)
values (21, 'cat'), (22,'dog');
hive> select * from demotab1;
11 ant 1
12 bear 1
21 cat 2
22 dog 2
hive> delete from demotab1 where site=2 and id=21;
hive> select * from demotab1;
11 ant 1
12 bear 1
22 dog 2
hive> create table demotab1
(
id int,
s string
)
partitioned by (site int)
clustered by (id) into 4 buckets
stored as orc tblproperties ('transactional'='true');
© Copyright 2016 Dell Inc.
9
Use Case: Enterprise Data Warehouse Offload
• Tables can be exported from an EDW to Hive on ECS with Apache Sqoop or similar tools
• Data will be efficiently distributed and protected across multiple sites
• If desired, delete exported records from the EDW
• Data can be queried in place using Hive SQL from any site
EDW SQL
© Copyright 2016 Dell Inc.
10
Storage Efficiency Comparison
38%
50%
56%
60% 63% 64% 66%
17%
11%
8% 7% 6% 5% 4%
0%
20%
40%
60%
80%
100%
2 3 4 5 6 7 8
Storage
Efficiency
Number of Sites
ECS Efficiency
HDFS Efficiency
Storage efficiency is the effective % of raw disk bytes that are usable by your data
4.5x Better Efficiency!
Active Active Data Lake with ECS

More Related Content

Similar to Active Active Data Lake with ECS

03 h base-2-installation_andshell
03 h base-2-installation_andshell03 h base-2-installation_andshell
03 h base-2-installation_andshell
dntth0601
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
Edureka!
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
Steve Staso
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
DataWorks Summit/Hadoop Summit
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
DanishMahmood23
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
7. emc isilon hdfs enterprise storage for hadoop
7. emc isilon hdfs   enterprise storage for hadoop7. emc isilon hdfs   enterprise storage for hadoop
7. emc isilon hdfs enterprise storage for hadoop
Taldor Group
 
HDFS tiered storage
HDFS tiered storageHDFS tiered storage
HDFS tiered storage
DataWorks Summit
 
Optimized Hive replication
Optimized Hive replicationOptimized Hive replication
Optimized Hive replication
Future of Data Meetup
 
Apache Hive micro guide - ConfusedCoders
Apache Hive micro guide - ConfusedCodersApache Hive micro guide - ConfusedCoders
Apache Hive micro guide - ConfusedCoders
Yash Sharma
 
מיכאל
מיכאלמיכאל
מיכאל
sqlserver.co.il
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
DataWorks Summit
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
Mahendran Ponnusamy
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight Service
Neil Mackenzie
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
Michael Rainey
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
In-Memory Computing Summit
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
senthil0809
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
DataWorks Summit
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
Qubole
 

Similar to Active Active Data Lake with ECS (20)

03 h base-2-installation_andshell
03 h base-2-installation_andshell03 h base-2-installation_andshell
03 h base-2-installation_andshell
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
7. emc isilon hdfs enterprise storage for hadoop
7. emc isilon hdfs   enterprise storage for hadoop7. emc isilon hdfs   enterprise storage for hadoop
7. emc isilon hdfs enterprise storage for hadoop
 
HDFS tiered storage
HDFS tiered storageHDFS tiered storage
HDFS tiered storage
 
Optimized Hive replication
Optimized Hive replicationOptimized Hive replication
Optimized Hive replication
 
Apache Hive micro guide - ConfusedCoders
Apache Hive micro guide - ConfusedCodersApache Hive micro guide - ConfusedCoders
Apache Hive micro guide - ConfusedCoders
 
מיכאל
מיכאלמיכאל
מיכאל
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight Service
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 

Recently uploaded

一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
keesa2
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
Jio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdfJio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdf
inaya7568
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
ytypuem
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
actyx
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
Alireza Kamrani
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
dataschool1
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
aguty
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
eudsoh
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
osoyvvf
 

Recently uploaded (20)

一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
 
Jio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdfJio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdf
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
 

Active Active Data Lake with ECS

  • 2. The Active/Active Data Lake with Elastic Cloud Storage
  • 3. © Copyright 2016 Dell Inc. 3 Multiple Hadoop Clusters - Challenges MR/Hive/Pig YARN HDFS MR/Hive/Pig YARN HDFS MR/Hive/Pig YARN HDFS ACTIVE/ACTIVE GLOBAL HADOOP STORAGE
  • 4. © Copyright 2016 Dell Inc. 4 But we use DistCp… • Active/Passive access • Very high disk usage (3x replication on each site) • Periodic transfers (once every few hours) • Must carefully update Hive Metastore manually at each site • Hard to build for 3 or more clusters • Consumes YARN CPU/memory • No Hive concurrency controls on target cluster
  • 5. © Copyright 2016 Dell Inc. 5 Active/Active Hadoop with ECS Global Namespace For Active/Active Hadoop STORAGE EFFICIENCY Denver Beijing Paris GEO CACHING STRONG CONSISTENCY ACTIVE/ACTIVE w/ FAILOVER
  • 6. © Copyright 2016 Dell Inc. 6 Active/Active Hive with ECS: Solution Overview • 2 to 8 sites • A shared, common: – Hadoop-Compatible File System – global namespace, readable and writable from all sites – Hive Metastore DB • Strong consistency • Asynchronous replication (low latency updates) • Fully recoverable from the failure of a single site and 4 drives in each site • Very high storage efficiency (4.5 times better than HDFS with 3 sites) • Hive concurrency and ACID transactions (insert, update, delete), even across sites
  • 7. © Copyright 2016 Dell Inc. 7 Active/Active Hive with ECS: Architecture ECS Node ECS Node ECS Node ECS Node Node Manager Metastore DB Hadoop Master Node Manager Node Manager Site 1 ECS Node ECS Node ECS Node ECS Node Node Manager Metastore DB Hadoop Master Node Manager Node Manager Site 2 ECS Node ECS Node ECS Node ECS Node Node Manager Metastore DB Hadoop Master Node Manager Node Manager Site 3 ECS Replication (async) Metastore DB Repl (sync) Hadoop installations at different sites are independent except for the Hive Metastore DB and the common file system provided by ECS.
  • 8. © Copyright 2016 Dell Inc. 8 Active/Active Hive Demonstration hive> create table demotab1… hive> insert into demotab1 partition (site=1) values (11, ‘ant'), (12,'bear'); Site 1 Site 2 hive> select * from demotab1; 11 ant 1 12 bear 1 hive> insert into demotab1 partition (site=2) values (21, 'cat'), (22,'dog'); hive> select * from demotab1; 11 ant 1 12 bear 1 21 cat 2 22 dog 2 hive> delete from demotab1 where site=2 and id=21; hive> select * from demotab1; 11 ant 1 12 bear 1 22 dog 2 hive> create table demotab1 ( id int, s string ) partitioned by (site int) clustered by (id) into 4 buckets stored as orc tblproperties ('transactional'='true');
  • 9. © Copyright 2016 Dell Inc. 9 Use Case: Enterprise Data Warehouse Offload • Tables can be exported from an EDW to Hive on ECS with Apache Sqoop or similar tools • Data will be efficiently distributed and protected across multiple sites • If desired, delete exported records from the EDW • Data can be queried in place using Hive SQL from any site EDW SQL
  • 10. © Copyright 2016 Dell Inc. 10 Storage Efficiency Comparison 38% 50% 56% 60% 63% 64% 66% 17% 11% 8% 7% 6% 5% 4% 0% 20% 40% 60% 80% 100% 2 3 4 5 6 7 8 Storage Efficiency Number of Sites ECS Efficiency HDFS Efficiency Storage efficiency is the effective % of raw disk bytes that are usable by your data 4.5x Better Efficiency!

Editor's Notes

  1. Hadoop customers who use multiple clusters across geographies face a lot of challenges today. The are lot of isolated “DAS clusters” in the environment, so the utilization/efficiency rate is very low, which drives up the overall TCO. It is hard and time consuming to move date from one cluster to another. While customers use DistCp, it is not ideal Distcp provides Active-passive access and consumes lot of resources ( CPU and Memory) It is very difficult to keep the data strongly consistent between the clusters- which is a big problem. Because, different users from different locations will have different versions of the same data.
  2. List as many cons about distcp as possible
  3. EMC Elastic Cloud Storage (ECS) solves those problems with its unique and industry leading Active-active Hadoop solution. With ECS, Data , metadata & index replicated across multiple geographic sites, Same bucket accessed from both sites simultaneously Hadoop compute running on both sides can access the same data. Moreover, ECS presents a single global namespace – which means that any data can be accessed from anywhere .
  4. Updates also work. If queries overlap with inserts, etc., Hive concurrency controls (locks) work across the sites to ensure that consistent views are maintained. Locks are maintained in the shared Metastore DB.
  5. Assumes that the default HDFS replication count of 3 is used at all sites.