Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...
htogcp.docx
1. Document Title GCP Data Ingestion – History data migration (One time load)
Version 1.0
Document Summary This Document guides you to migrate hive data to BQ.
Team GDIA – ENOP
2. Copy Hive data to BQ
Description:
Irrespective of source format, all the hive tables will be converted to an external hive table in ORC format
and ORC files will be copied from HDFS to GCS using DISTCP. BQ load will pick the files from GCS and
load data to BQ tables.
Approach:
Step 1:
Step 2:
Step 3:
Additional Artifacts:
Log file generation, mail alert in case of success/failure scenarios.
Pre-requisite:
1. Gcloud utility should be installed in HPC (putty). Please follow the below instruction:
export no_proxy="localhost, 127.0.0.1, .ford.com"
export https_proxy=http://internet.ford.com:83
export http_proxy=http://internet.ford.com:83
export HTTPS_PROXY=http://internet.ford.com:83
export HTTP_PROXY=http://internet.ford.com:83
curl https://sdk.cloud.google.com | bash
2. Hive database should be created in prior in appropriate path.
Input:
Config file with 7 values separated by commas (Hive target DB, Hive Source DB, Hive Table name, HDFS
Path, gcs path, dataset name, BQ table name)
Source Hive Table
(Any Format)
Interim Hive Table
(external - orc)
Distcp
istcp
GCS Bucket BigQuery
BQ
Load
Interim Hive Table
(external - orc)
GCS Bucket
3. Sample config file:
Script:
history load.txt
Script Execution:
3 parameters are passed as command line arguments to the shell script.
1. Input config file -> This is generated manually by the user. Content of the input files are
already mentioned in previous steps.
2. Email id -> Id to which the success/failure alert should be sent.
3. Json Key file -> RM has the access to generate vault key for each environment.
Modification to be done in script:
In Line #4 log path should be updated accordingly. (Instead of <path> proper unix directory path where
the log file should be saved has to be given)
Command to execute the script using putty:
sh onprem_hist.sh path/config.txt email_id path/gcp_key.json
4. Output:
Log file:
Mail Alert:
History Load
Status.msg
Note:
Ensure to drop the interim database and tables that were created as part of history load.