SlideShare a Scribd company logo
1 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Disaster Recovery and Cloud
Migration for you Apache
Hive Warehouse
DataWorks Summit – Sydney
September 2017
2 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Presenter
Sankar Hariappan
Senior Software Engineer, Hortonworks
3 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
 About Apache Hive
 Disaster Recovery
 Replication Modes
 Fail Over
 Fail Back
 Replication at Hive-Scale
 Event Based Replication
 Change Management
 Bootstrapping
 REPL Commands
 Demonstration
 Cloud Migration Challenges
 Future Work
Agenda
4 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
About Apache Hive
 Data warehouse tool built on top of Apache Hadoop.
 Handle data warehousing tasks such as extract/transform/load (ETL), reporting, and
data analysis.
 Manage large datasets residing in distributed storage.
 SQL with Hive specific extensions.
 Query optimization powered by Apache Calcite and execution via Apache Tez, Apache
Spark, or MapReduce.
 Access to files stored either directly in Apache HDFS or in other data storage systems
such as Apache HBase.
 Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider.
5 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Apache Hive Architecture
Hive Thrift
Server
JDBC/ODBC
Driver
Compiler Optimizer Executor
HiveServer2
Hive
Metastore
HDFS
YARN
MS
Client RDBMS
6 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Disaster Recovery
 Deployment of clusters in more than one data center for business continuity or geo
localization.
 Hybrid cloud deployment for off-premise processing.
 Robust replication solution to achieve seamless disaster recovery.
– Prevent severe data loss.
– Eliminate single point of failure.
– Fault-tolerant.
7 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Replication Modes
 Master-Slave
Master Slave
Unidirectional
Read ReadWrite
 Master-Master
Master
Bidirectional
Read Write
Master
Read Write
Master
Read Write
Master
Read Write
Slave
Read
8 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Replication Modes
 Hub and Spoke pattern
Master
Slave
Read
Read
Write
Slave
Read
Slave
Read
Slave
Read
 Relay pattern
Master Slave
Read ReadWrite
Slave
Read
9 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Fail Over
 Slave take over the Master
responsibilities instantaneously.
 Ensure business continuity with minimal
data loss based on Recovery Point
Objective (RPO).
 Almost zero down-time or meet
Recovery Time Objective (RTO).
Master Slave
Unidirectional
Read Write
Fail over
Read Write
10 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Fail Back
 Slave cluster usually have minimal
processing capabilities which makes Fail
Back an important requirement.
 Original Master comes alive with latest
data.
 Ensure removal of stale data which was
not replicated to the Slave.
 Reverse replicate the delta of data
loaded into the Slave after Fail Over.
Master Slave
Unidirectional
Read ReadWrite
Fail back
11 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Replication at Hive-Scale
 Event based replication.
 First version of Hive Replication (Replv1) uses EXPORT-IMPORT semantics to replicate
data.
– Inefficient mechanism.
– 4X copy problem.
– Rubber-banding issue.
– Depends on external tools such as Falcon/Oozie to manage replication state.
 Second version of Hive Replication (Replv2) uses REPL commands (HIVE-14841)
– Point-in time replication.
– Reduce number of copies.
– Hive maintains the replicated state.
– Additional support for functions, constraint replication.
– Available in Apache Hive trunk.
12 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Event Logging
HiveServer2
Hive
Metastore
Metastore
RDBMS
Events Table
JDBC/ODBC
Runs Query Manage Metadata
13 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Event Logging
 Capture event: Create/Alter/Drop on DB/Table/Partition/Function/Constraint objects.
 Stored in Metastore RDBMS.
 Event is self-contained to recover the state of the object (metadata + data).
 Events are serialized using sequence number (event id).
14 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Event Based Replication
Metastore
RDBMS
Events Table
HDFS
Serialize new events
batch
Master Cluster
Slave Cluster
HiveServer2
Dump
(metadata + data)
HDFS
Meatastore
RDBMS
HiveServer2
DistcpMetastore API to
write objects
Data files
copy
Read repl
dump dir
REPL DUMP
REPL LOAD
15 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Event Based Replication
 Read batch of events from the Metastore RDBMS in the generated sequence.
 "repl dump <db name> from <event id> "
– get events newer than <event id>.
– includes data files information.
– "<event id>" is last replicated event id for DB from the destination cluster
 "repl load <db name> from <hdfs URI>"
– apply the events on destination
 State replicated in batches currently, can be optimized in future
16 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Change Management
 Replicating the following batch –
– Insert to table
– Drop table
 Need inserted files after drop for replication
 Trash like directory for capturing deleted files (CM directory)
 Use checksum to verify file, else lookup from CM directory using checksum
 Necessary for ordered replication - State in destination DB would correspond to state in
source X duration back.
17 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Bootstrapping
 What about data generated before event capturing was enabled?
 Bootstrapping - Uses same repl dump/load commands, but is not event based
 Incremental replication catches up with events during bootstrap to make change
consistent with state of source at time X in past.
 Optimized for large database.
 Parallel dump of large number of partitions.
18 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
REPL Commands
 REPL DUMP <db-name> [FROM <start-evid> [TO <end-evid>] [LIMIT <num-evids>] ];
– Execute this command in source cluster.
– Returns dump directory and last replication state for the dump.
– REPL DUMP <db-name>; bootstrap the whole database.
– REPL DUMP <db-name> FROM <start-evid>; to replicate all events after start-evid.
– REPL DUMP <db-name> FROM <start-evid> TO <end-evid>; to replicate a range of events.
– REPL DUMP <db-name> FROM <start-evid> LIMIT <num-evids>; to replicate a limited set of events.
 REPL LOAD <db-name> FROM <dump-dir>;
– dump-dir is the HDFS URI returned by REPL DUMP command.
– Execute this command in target cluster.
 REPL STATUS <db-name>;
– Execute this command in target cluster.
– Gets the last replicated state of the database in destination which should be the input for REPL
DUMP as start-evid.
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demonstration
20 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Cloud Migration Challenges
 Move is expensive
– Cloud file systems has implemented “move” as “copy”.
– Atomic move/rename of data files from temp directory to warehouse location in target.
– Move/rename of data files to CM directory.
– ACID and micro-managed tables can get rid of CM archival of data files.
 Distcp to copy data files to/from cloud
– Run distcp from on-prem cluster for hybrid deployment.
– Optimize distcp to use vendor specific tool to copy between cloud file systems.
 Data integrity when copy data to/from cloud
– Verification of data files copied across clusters using HDFS provided file checksum.
– Checksum is not consistent across all file systems.
21 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Current Replication Scope
 Supported
– Managed tables with partitions.
– Views.
– UDFs/UDAFs.
– Constraints.
– Ranger authorization policies.
 Limitations
– External tables.
– Manually copied data files to data location.
– ​SQL Standard-based Authorization
22 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Future Work
 Replicate ACID/Micro-managed tables.
 Replication to/from cloud storage such as S3 or WASB etc.
 Hot Data Replication.
 Faster Bootstrapping.
 Optimize Fail Back.
 Replicate Column Statistics, Index etc.
 Table level replication.
23 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
References: Hive Configurations for Replication
Hive Configuration Recommendation Description
hive.metastore.transactional.event.listeners org.apache.hive.hcatalog.listener.
DbNotificationListener
Enable event logging
hive.metastore.event.db.listener.timetolive 86400s/RPO Expiry time for the events logged in metastore
hive.repl.rootdir Any valid HDFS directory Root directory used by repl dump
hive.metastore.dml.events true Enable event generation for DML operations
hive.repl.cm.enabled true Enable change management to archive deleted data
files
hive.repl.cm.retain 24hr/RPO Expiry time for CM backed-up data files.
hive.repl.cm.interval 3600s Time interval to look-up on expired data files in CM
hive.repl.cmrootdir Any valid HDFS directory Root directory for Change Manager
hive.repl.replica.functions.root.dir Any valid HDFS Root directory to store UDFs/UDAFs jars. Config needed
in Target cluster.
hive.repl.approx.max.load.tasks 1000 / Depending on memory
capacity of target warehouse
Limit the number of execution tasks to control the
memory consumption. Config needed in Target cluster.
hive.repl.partitions.dump.parallelism 8/depends on cpu usage Number of threads to concurrently dump partitions
24 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
References: Apache Hive Documentation
https://cwiki.apache.org/confluence/display/Hive/Home
https://cwiki.apache.org/confluence/display/Hive/HiveReplicationv2Development
https://cwiki.apache.org/confluence/display/Hive/HiveReplicationDevelopment
https://cwiki.apache.org/confluence/display/Hive/Replication
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ImportExport
https://issues.apache.org/jira/browse/HIVE-14841
25 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Learn More
Apache Hive, Apache
HBase and Apache
Phoenix
Bird of a Feather
Thursday, September 21 @ 6:00p
C 2.3
https://dataworkssummit.com/sydney-2017/birds-of-a-
feather/apache-hive-apache-hbase-apache-phoenix/
26 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
THANK YOU!

More Related Content

What's hot

Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
Chris Nauroth
 
HBase Replication
HBase ReplicationHBase Replication
HBase Replication
ctrezzo
 
Oracle Database on Docker
Oracle Database on DockerOracle Database on Docker
Oracle Database on Docker
Franck Pachot
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
HBaseCon
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
The Apache Software Foundation
 
HBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsHBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table Snapshots
Cloudera, Inc.
 
Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3
Manish Chopra
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
larsgeorge
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
Douglas Bernardini
 
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Michael Arnold
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native Era
DataWorks Summit
 
High Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureHigh Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and Future
DataWorks Summit
 
[OSDC 2013] Hadoop Cluster HA 的經驗分享
[OSDC 2013] Hadoop Cluster HA 的經驗分享[OSDC 2013] Hadoop Cluster HA 的經驗分享
[OSDC 2013] Hadoop Cluster HA 的經驗分享
Tsu-Fen Han
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
DataWorks Summit/Hadoop Summit
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
Junping Du
 
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
Cross-DC Fault-Tolerant ViewFileSystem @ TwitterCross-DC Fault-Tolerant ViewFileSystem @ Twitter
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
DataWorks Summit/Hadoop Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
DataWorks Summit/Hadoop Summit
 
Cc Installation Process
Cc Installation ProcessCc Installation Process
Cc Installation Processmadhavamalladi
 

What's hot (20)

Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
 
HBase Replication
HBase ReplicationHBase Replication
HBase Replication
 
Oracle Database on Docker
Oracle Database on DockerOracle Database on Docker
Oracle Database on Docker
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
 
HBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsHBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table Snapshots
 
Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native Era
 
High Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureHigh Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and Future
 
[OSDC 2013] Hadoop Cluster HA 的經驗分享
[OSDC 2013] Hadoop Cluster HA 的經驗分享[OSDC 2013] Hadoop Cluster HA 的經驗分享
[OSDC 2013] Hadoop Cluster HA 的經驗分享
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
Cross-DC Fault-Tolerant ViewFileSystem @ TwitterCross-DC Fault-Tolerant ViewFileSystem @ Twitter
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
 
Cc Installation Process
Cc Installation ProcessCc Installation Process
Cc Installation Process
 

Similar to Disaster Recovery and Cloud Migration for your Apache Hive Warehouse

Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
DataWorks Summit
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
DataWorks Summit
 
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
FOD Paris Meetup -  Global Data Management with DataPlane Services (DPS)FOD Paris Meetup -  Global Data Management with DataPlane Services (DPS)
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
Abdelkrim Hadjidj
 
Seamless replication and disaster recovery for Apache Hive Warehouse
Seamless replication and disaster recovery for Apache Hive WarehouseSeamless replication and disaster recovery for Apache Hive Warehouse
Seamless replication and disaster recovery for Apache Hive Warehouse
DataWorks Summit
 
Seamless Replication and Disaster Recovery for Apache Hive Warehouse
Seamless Replication and Disaster Recovery for Apache Hive WarehouseSeamless Replication and Disaster Recovery for Apache Hive Warehouse
Seamless Replication and Disaster Recovery for Apache Hive Warehouse
Sankar H
 
Optimized Hive replication
Optimized Hive replicationOptimized Hive replication
Optimized Hive replication
Future of Data Meetup
 
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupApache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User Group
Steve Loughran
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
DataWorks Summit
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
DataWorks Summit
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
DataWorks Summit/Hadoop Summit
 
HiveWarehouseConnector
HiveWarehouseConnectorHiveWarehouseConnector
HiveWarehouseConnector
Eric Wohlstadter
 
Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
Joseph Niemiec
 
Ansible + Hadoop
Ansible + HadoopAnsible + Hadoop
Ansible + Hadoop
Michael Young
 
Micro services vs hadoop
Micro services vs hadoopMicro services vs hadoop
Micro services vs hadoop
Gergely Devenyi
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
alanfgates
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
DataWorks Summit
 
Real-Time Data Loading from MySQL to Hadoop
Real-Time Data Loading from MySQL to HadoopReal-Time Data Loading from MySQL to Hadoop
Real-Time Data Loading from MySQL to Hadoop
Continuent
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Edureka!
 
Active Active Data Lake with ECS
Active Active Data Lake with ECSActive Active Data Lake with ECS
Active Active Data Lake with ECS
ClaudioFahey1
 

Similar to Disaster Recovery and Cloud Migration for your Apache Hive Warehouse (20)

Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
 
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
FOD Paris Meetup -  Global Data Management with DataPlane Services (DPS)FOD Paris Meetup -  Global Data Management with DataPlane Services (DPS)
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
 
Seamless replication and disaster recovery for Apache Hive Warehouse
Seamless replication and disaster recovery for Apache Hive WarehouseSeamless replication and disaster recovery for Apache Hive Warehouse
Seamless replication and disaster recovery for Apache Hive Warehouse
 
Seamless Replication and Disaster Recovery for Apache Hive Warehouse
Seamless Replication and Disaster Recovery for Apache Hive WarehouseSeamless Replication and Disaster Recovery for Apache Hive Warehouse
Seamless Replication and Disaster Recovery for Apache Hive Warehouse
 
Optimized Hive replication
Optimized Hive replicationOptimized Hive replication
Optimized Hive replication
 
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupApache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User Group
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
 
HiveWarehouseConnector
HiveWarehouseConnectorHiveWarehouseConnector
HiveWarehouseConnector
 
Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
 
Ansible + Hadoop
Ansible + HadoopAnsible + Hadoop
Ansible + Hadoop
 
Micro services vs hadoop
Micro services vs hadoopMicro services vs hadoop
Micro services vs hadoop
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
 
Real-Time Data Loading from MySQL to Hadoop
Real-Time Data Loading from MySQL to HadoopReal-Time Data Loading from MySQL to Hadoop
Real-Time Data Loading from MySQL to Hadoop
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
 
Active Active Data Lake with ECS
Active Active Data Lake with ECSActive Active Data Lake with ECS
Active Active Data Lake with ECS
 

Recently uploaded

Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
Alina Yurenko
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Yara Milbes
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 

Recently uploaded (20)

Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 

Disaster Recovery and Cloud Migration for your Apache Hive Warehouse

  • 1. 1 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Disaster Recovery and Cloud Migration for you Apache Hive Warehouse DataWorks Summit – Sydney September 2017
  • 2. 2 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Presenter Sankar Hariappan Senior Software Engineer, Hortonworks
  • 3. 3 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved  About Apache Hive  Disaster Recovery  Replication Modes  Fail Over  Fail Back  Replication at Hive-Scale  Event Based Replication  Change Management  Bootstrapping  REPL Commands  Demonstration  Cloud Migration Challenges  Future Work Agenda
  • 4. 4 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved About Apache Hive  Data warehouse tool built on top of Apache Hadoop.  Handle data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis.  Manage large datasets residing in distributed storage.  SQL with Hive specific extensions.  Query optimization powered by Apache Calcite and execution via Apache Tez, Apache Spark, or MapReduce.  Access to files stored either directly in Apache HDFS or in other data storage systems such as Apache HBase.  Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider.
  • 5. 5 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Apache Hive Architecture Hive Thrift Server JDBC/ODBC Driver Compiler Optimizer Executor HiveServer2 Hive Metastore HDFS YARN MS Client RDBMS
  • 6. 6 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Disaster Recovery  Deployment of clusters in more than one data center for business continuity or geo localization.  Hybrid cloud deployment for off-premise processing.  Robust replication solution to achieve seamless disaster recovery. – Prevent severe data loss. – Eliminate single point of failure. – Fault-tolerant.
  • 7. 7 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Replication Modes  Master-Slave Master Slave Unidirectional Read ReadWrite  Master-Master Master Bidirectional Read Write Master Read Write Master Read Write Master Read Write Slave Read
  • 8. 8 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Replication Modes  Hub and Spoke pattern Master Slave Read Read Write Slave Read Slave Read Slave Read  Relay pattern Master Slave Read ReadWrite Slave Read
  • 9. 9 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Fail Over  Slave take over the Master responsibilities instantaneously.  Ensure business continuity with minimal data loss based on Recovery Point Objective (RPO).  Almost zero down-time or meet Recovery Time Objective (RTO). Master Slave Unidirectional Read Write Fail over Read Write
  • 10. 10 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Fail Back  Slave cluster usually have minimal processing capabilities which makes Fail Back an important requirement.  Original Master comes alive with latest data.  Ensure removal of stale data which was not replicated to the Slave.  Reverse replicate the delta of data loaded into the Slave after Fail Over. Master Slave Unidirectional Read ReadWrite Fail back
  • 11. 11 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Replication at Hive-Scale  Event based replication.  First version of Hive Replication (Replv1) uses EXPORT-IMPORT semantics to replicate data. – Inefficient mechanism. – 4X copy problem. – Rubber-banding issue. – Depends on external tools such as Falcon/Oozie to manage replication state.  Second version of Hive Replication (Replv2) uses REPL commands (HIVE-14841) – Point-in time replication. – Reduce number of copies. – Hive maintains the replicated state. – Additional support for functions, constraint replication. – Available in Apache Hive trunk.
  • 12. 12 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Event Logging HiveServer2 Hive Metastore Metastore RDBMS Events Table JDBC/ODBC Runs Query Manage Metadata
  • 13. 13 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Event Logging  Capture event: Create/Alter/Drop on DB/Table/Partition/Function/Constraint objects.  Stored in Metastore RDBMS.  Event is self-contained to recover the state of the object (metadata + data).  Events are serialized using sequence number (event id).
  • 14. 14 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Event Based Replication Metastore RDBMS Events Table HDFS Serialize new events batch Master Cluster Slave Cluster HiveServer2 Dump (metadata + data) HDFS Meatastore RDBMS HiveServer2 DistcpMetastore API to write objects Data files copy Read repl dump dir REPL DUMP REPL LOAD
  • 15. 15 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Event Based Replication  Read batch of events from the Metastore RDBMS in the generated sequence.  "repl dump <db name> from <event id> " – get events newer than <event id>. – includes data files information. – "<event id>" is last replicated event id for DB from the destination cluster  "repl load <db name> from <hdfs URI>" – apply the events on destination  State replicated in batches currently, can be optimized in future
  • 16. 16 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Change Management  Replicating the following batch – – Insert to table – Drop table  Need inserted files after drop for replication  Trash like directory for capturing deleted files (CM directory)  Use checksum to verify file, else lookup from CM directory using checksum  Necessary for ordered replication - State in destination DB would correspond to state in source X duration back.
  • 17. 17 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Bootstrapping  What about data generated before event capturing was enabled?  Bootstrapping - Uses same repl dump/load commands, but is not event based  Incremental replication catches up with events during bootstrap to make change consistent with state of source at time X in past.  Optimized for large database.  Parallel dump of large number of partitions.
  • 18. 18 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved REPL Commands  REPL DUMP <db-name> [FROM <start-evid> [TO <end-evid>] [LIMIT <num-evids>] ]; – Execute this command in source cluster. – Returns dump directory and last replication state for the dump. – REPL DUMP <db-name>; bootstrap the whole database. – REPL DUMP <db-name> FROM <start-evid>; to replicate all events after start-evid. – REPL DUMP <db-name> FROM <start-evid> TO <end-evid>; to replicate a range of events. – REPL DUMP <db-name> FROM <start-evid> LIMIT <num-evids>; to replicate a limited set of events.  REPL LOAD <db-name> FROM <dump-dir>; – dump-dir is the HDFS URI returned by REPL DUMP command. – Execute this command in target cluster.  REPL STATUS <db-name>; – Execute this command in target cluster. – Gets the last replicated state of the database in destination which should be the input for REPL DUMP as start-evid.
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demonstration
  • 20. 20 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Cloud Migration Challenges  Move is expensive – Cloud file systems has implemented “move” as “copy”. – Atomic move/rename of data files from temp directory to warehouse location in target. – Move/rename of data files to CM directory. – ACID and micro-managed tables can get rid of CM archival of data files.  Distcp to copy data files to/from cloud – Run distcp from on-prem cluster for hybrid deployment. – Optimize distcp to use vendor specific tool to copy between cloud file systems.  Data integrity when copy data to/from cloud – Verification of data files copied across clusters using HDFS provided file checksum. – Checksum is not consistent across all file systems.
  • 21. 21 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Current Replication Scope  Supported – Managed tables with partitions. – Views. – UDFs/UDAFs. – Constraints. – Ranger authorization policies.  Limitations – External tables. – Manually copied data files to data location. – ​SQL Standard-based Authorization
  • 22. 22 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Future Work  Replicate ACID/Micro-managed tables.  Replication to/from cloud storage such as S3 or WASB etc.  Hot Data Replication.  Faster Bootstrapping.  Optimize Fail Back.  Replicate Column Statistics, Index etc.  Table level replication.
  • 23. 23 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved References: Hive Configurations for Replication Hive Configuration Recommendation Description hive.metastore.transactional.event.listeners org.apache.hive.hcatalog.listener. DbNotificationListener Enable event logging hive.metastore.event.db.listener.timetolive 86400s/RPO Expiry time for the events logged in metastore hive.repl.rootdir Any valid HDFS directory Root directory used by repl dump hive.metastore.dml.events true Enable event generation for DML operations hive.repl.cm.enabled true Enable change management to archive deleted data files hive.repl.cm.retain 24hr/RPO Expiry time for CM backed-up data files. hive.repl.cm.interval 3600s Time interval to look-up on expired data files in CM hive.repl.cmrootdir Any valid HDFS directory Root directory for Change Manager hive.repl.replica.functions.root.dir Any valid HDFS Root directory to store UDFs/UDAFs jars. Config needed in Target cluster. hive.repl.approx.max.load.tasks 1000 / Depending on memory capacity of target warehouse Limit the number of execution tasks to control the memory consumption. Config needed in Target cluster. hive.repl.partitions.dump.parallelism 8/depends on cpu usage Number of threads to concurrently dump partitions
  • 24. 24 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved References: Apache Hive Documentation https://cwiki.apache.org/confluence/display/Hive/Home https://cwiki.apache.org/confluence/display/Hive/HiveReplicationv2Development https://cwiki.apache.org/confluence/display/Hive/HiveReplicationDevelopment https://cwiki.apache.org/confluence/display/Hive/Replication https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ImportExport https://issues.apache.org/jira/browse/HIVE-14841
  • 25. 25 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Learn More Apache Hive, Apache HBase and Apache Phoenix Bird of a Feather Thursday, September 21 @ 6:00p C 2.3 https://dataworkssummit.com/sydney-2017/birds-of-a- feather/apache-hive-apache-hbase-apache-phoenix/
  • 26. 26 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved THANK YOU!