This document provides information about Dataproc, Google Cloud's fully managed Spark and Hadoop service. It discusses how Dataproc allows users to create clusters on-demand to process large datasets in a flexible and cost-effective manner. It also covers how Dataproc integrates with other Google Cloud services and provides open-source tools like Spark, Hadoop, Hive and Pig. Additionally, it summarizes best practices for using Dataproc such as leveraging initialization actions, specifying cluster versions, and using the Jobs API for submissions.
4. Dataproc
Google Cloud Platform’s
fully-managed data analytics
service
Rapid cluster creation
Familiar open source tools
Customizable hardware and
software
Ephemeral clusters on-demand
Integrated with other GCP
services
5. Cloud Dataproc: Open source solutions with GCP
Webhcat
BigQuery Cloud
Datastore
Cloud
Bigtable
Compute
Engine
Kubernetes
Engine
Cloud
Dataflow
Cloud
Dataproc
Cloud
Functions
Cloud
Vision API
Cloud
Storage
Key
Management
Service
Cloud Machine
Learning
Engine
Cloud
Pub/Sub
Cloud
Spanner
Cloud
SQL
Cloud
Translation
API
BQ Transfer
Service
6. Dataproc
DataProc is a managed Hadoop Mapreduce, Spark, Pig and Hive service to
easily process big data sets at low cost
8. Most traditional clusters are utilized only a
portion of the time they’re online
Spark and Hadoop often have poor economics
and scalability
Idle Clusters
Scaling
inflexibility
Job demand can be hard to predict,
and scaling can take considerable time
Dataproc brings cloud economics to Spark and
Hadoop
Anytime
Clusters
Clusters aren’t idle. Run clusters only
when you need them
Flexible
Scaling
By scaling clusters at anytime, your jobs can get
exactly the resources they need when required
Dataproc
9. Decouple storage from compute
Development & Test
Data sinks
Production
Cloud Dataproc
External applications Storage
Cloud Storage
Application Logs
Storage
BigQuery
Clusters
Development
Cloud Dataproc
Test
Cloud Dataproc
Data sources
Storage
Cloud Bigtable
Storage
Cloud Storage
Storage
BigQuery
Storage
Cloud Bigtable
Credits : Google Cloud Documentation
10. Cloud Dataproc - Node Types
Credits : Google Cloud Documentation
Clients
Cloud Dataproc Cluster
Cluster bucket
Google Cloud Storage
Cloud Dataproc API
Clusters
Operations
Jobs
Workflow
Templates
Master node(s)
Compute Engine
Primary workers
Compute
Engine
Compute + Storage
Managed instance Group
Secondary (PVM)
Worker(s)
Compute Engine
Compute
Cloud Network
Cloud IAM
11. The Pros and Cons of Cloud Storage versus HDFS
PROS
● Lower costs
● Separation from compute and storage
● Interoperability
● HDFS compatibility with equivalent
(or better) performance
● High data availability
● No storage management overhead
● Quick startup
● Cloud IAM security
● Global consistency
CONS
● Cloud Storage may increase I/O
variance
● Cloud Storage doesn’t support file
appends or truncates
● Cloud Storage isn’t POSIX- compliant
● Cloud Storage may not expose all file
system information
● Cloud storage may have greater
request latency
Credits :
https://cloud.google.com/blog/products/storage-data-transfer/hdfs-vs-cloud-storage-pros-cons-and-migration-tips
16. Scaling clusters - PVMs
● Dataproc has "secondary workers" which are preemptible VMs by default
● Processing only - do not store data
● Same instance type as on-demand worker node
Recommendation:
Start with 0 PVMs and slowly tune upwards.
Above 50% PVM is bad
PROS
● Cheaper
CONS
●Jobs can lose progress,
eventually fail if too many
preemptions
●Not be appropriate for all
workloads
18. Scaling clusters - Autoscaling
Is there too much or too
little YARN memory?
Do nothing
Is the cluster at the
maximum # of nodes?
Do not autoscale
Determine type and scale
of nodes to modify
Autoscale cluster
Yes No
Yes No
● Optimize resource usage
● Decommissioning workers when not
in use - Savings
● No manual intervention required
● Strike the right balance of primary
and preemptible workers
19. Scaling clusters - Autoscaling policy
An Autoscaling Policy is a reusable configuration that describes how clusters using the autoscaling policy
should scale
● Autoscaling is independent entity. Can be attached to 1 or more clusters (recommended only if they
share similar workloads)
● Not mandatory to define autoscaling policy during cluster creation
● On the fly autoscaling-policy modification
● You can scale to any number of worker nodes as long as there no quota restrictions
● Worker nodes comes up in 2-3 mins and are ready to be used
20. Scaling clusters - Autoscaling Algorithm
● Number of workers required :
exact Δworkers = avg(pending memory - available memory) / memory per worker
● Aggresiveness: scaleUpFactor: 0.5, scaleDownFactor: 0.5
if exact Δworkers > 0:
actual Δworkers = ROUND_UP(exact Δworkers * scaleUpFactor)
# examples:
# ROUND_UP(exact Δworkers=5 * scaleUpFactor=0.5) = 3
else:
actual Δworkers = ROUND_DOWN(exact Δworkers * scaleDownFactor)
# examples:
# ROUND_DOWN(exact Δworkers=-5 * scaleDownFactor=0.5) = -2
21. Scaling clusters - Autoscaling Algorithm
● Boundaries:
workerConfig:
minInstances: 2
maxInstances: 100
weight: 2
Weight signifies proportion of new workers to be added for each node type : For example
Autoscaler recommends 6 nodes, according to above policy
6 * (2 / 3) = 4 Nodes = Primary workers
6 * (1 / 3) = 2 Nodes = Secondary workers
● Frequency:
○ Cooldown_period : Time window after which autoscaling check/update happens
○ gracefulDecommissionTimeout: finish work in progress on a worker before it is removed
from the Cloud Dataproc cluster and avoid losing work in progress
secondaryWorkerConfig:
minInstances: 0
maxInstances: 100
weight: 1
22. Scaling clusters - Recommendations for autoscaling policy
● Cooldown period : default = 2m
○ Should not be less than 2m
○ We used value of 4 minutes
● Graceful_decommission_timeout: finishes work in progress on a worker before it is removed.
○ Default = disabled
○ It should be larger than the longest running job
● scaleUpFactor for Spark jobs :
○ With DynamicAllocation set to True(default), Spark continues to double the number of executors
○ Recommended to set scaleUpFactor to 1.0 (100%)
○ smooths out pending memory (fewer pending memory spikes)
● For mapreduce jobs, a good starting point for scaleUpFactor is to start with 0.05 or 0.1 and then gradually
move ahead.
○ Mapreduce are generally short living jobs, so unless it lasts for several minutes keep scaleUpFactor
low
● To scale down only when a cluster is idle, set scale_down factor and scale_down min_worker_fraction to 1.0
○ MapReduce and Spark write intermediate shuffle data to local disk. Removing workers with shuffle
data will job progress back
23. Scaling clusters - Enhanced flexibility mode (Flex)
● What is it?
○ When a Dataproc node is removed, Flex mode preserves stateful node data, such as
mapreduce shuffle data, in HDFS
● Particularly useful when using preemptible VMs heavily as your worker nodes or only autoscale the
preemptible worker group.
● Nodes can be removed cluster as soon as all containers running on those nodes have finished -
Shuffle data will be written in HDFS
Observations :
● Started with 2 primary workers - HDFS became full
● We can increase the number of primary worker nodes OR add --num-local-ssds to primary workers
● We can also write shuffle data to Cloud Storage
25. Dataproc Security - Networking
● Create a custom VPC or use a private subnet for dataproc clusters -
○ Add a strict F/W rule of 10.128.0.0/16
○ It allows master and worker nodes to communicate
● Internal IP only -
○ Include --no-address flag in cluster creation command
○ Subnet should have property ‘Private Google Access’ enabled so that it can talk to GCP APIs
internally.
○ In order to download files from internet (possible in installing extra softwares via initialization
script), Cloud NAT is required.
● Cloud Identity-Aware Proxy (IAP) :
○ Cloud IAP works by verifying user identity and context (e.g. device status, location) of the
request to determine if a user should be allowed to access an application or a VM
○ Access VMs from untrusted networks without the use of a VPN.
○ Can gcloud command to SSH into master node - gcloud compute ssh … --tunnel-through-iap
26. Dataproc Security - Cloud DNS
● Problem statement:
○ workflow contains variable ${NameNodeIP} which should point to master of dataproc cluster
○ The value is passed via properties file when submitting jobs
○ You spin up a new dataproc cluster with a different name - Will you go back and change the
master hostname in properties file every time before submitting the job?
● Using private hosted zone in Cloud DNS :
○ You have to map the master node’s internal IP with the private DNS endpoint
(internal.abc.com:>10.0.0.1)
○ You can use single DNS endpoint in your properties without worrying which dataproc cluster it
points to
○ The DNS entries can be modified as you create new master nodes
Common use cases:
● Useful when moving from staging environment to production environment - no code change
required
● Using cloud DNS private hosted zone you can have peering between VPC's and access Dataproc
resources across projects too.
27. Dataproc Security - Access Management
● Individual service account for individual dataproc cluster
○ Do not use default dataproc service account for every dataproc cluster
● Separate roles for various access levels for each GCP service -
○ For example, roles for access to GCS can contain list,read, read-write, read-write-modify.
○ Different roles for read, read-write to VM disk - if you have hosted any open-source Database
on compute engine
● A challenge - Compliance necessity :
○ The clusters placed in US should have read-write access only to buckets starting with a prefix,
say, ‘gcs-us-*’ and clusters placed in EU should only have access to buckets with prefix
‘gcs-eu-*’
● Solution = Bucket ACL
Is it scalable?
There can be numerous buckets with prefix gcs-us-*
New buckets will be added in future
28. Dataproc Security - Access Management
Solution : IAM Conditions (Beta)
● Read-write role attached along with following condition at project level:
resource.service == "storage.googleapis.com" && (
resource.name.startsWith("projects/<PROJECT-NAME>/buckets/gcs-eu")
● Only Read role attached along with following condition at project level:
resource.service == "storage.googleapis.com" && (
resource.name.startsWith("projects/<PROJECT-NAME>/buckets/gcs-read-only-*"
)
31. Extensibility - Apache Oozie
Oozie is a workflow scheduler system to manage Apache Hadoop jobs :
● Directed Acyclical Graphs (DAGs) of
○ control flow
○ action nodes
● Scheduler triggers jobs by time (frequency) and data availability.
● Supports Spark application,map-reduce,Pig, Hive, Sqoop,Distcp as well as system specific jobs such
as Java programs and shell scripts
● Workflows can be parameterized
○ Run the same workflow for different clients by passing different values for variables
Common use cases :
● Shell scripts to clean GCS paths if already present (useful for rerunning workflows), getting input
path of latest data available.
● Running Spark/Hive actions (output of shell actions can be input of following actions)
● Launching child workflows ( re-using already built workflows)
● Sending workflow status E-mail with error logs as subject if required.
32. Extensibility - Apache Hue
Hue is a Web interface for analyzing data with Apache Hadoop.
● Presence of HDFS File Browser
● Editor for Hive Query
● Editor for Pig Query
● Workflows can access Oozie Interface
● Hadoop Shell Access
● User Admin Interface
● Submit spark, pyspark jobs
36. Setting up Stackdriver alerts
● Some example alerts :
○ HDFS storage utilization for cluster above threshold of 60 or 80 %
○ Unhealthy nodes in the cluster above 0
■ Workers become unhealthy when local disk is full or above threshold defined in
yarn-site.xml
■ Consider increasing disk size of your worker
○ Uptime check of master node
○ All HDFS DataNodes are functional or not
■ Total HDFS size is determined by total boot disk size of all primary worker nodes
● Automating creation of Stackdriver alerts as part of bootstrapping :
○ Creating the YAML file defining the stackdriver alerts and conditions to trigger alert /
notification channels etc.
○ Placing placeholders such as {PROJECT-ID}, {CLUSTER-NAME}
○ Fetching metadata using GCP internal APIs, replacing the placeholders with actual value and
importing the alert using the final YAML file
○ Can be added as part of bootstrap or Cloud Function can be used
38. Troubleshooting - Dataproc Cluster properties
● We can use --properties flag to modify set of commonly used configuration files
● Recommended properties to set during cluster creation to help avoid job failure due to PVMs
○ yarn:yarn.resourcemanager.am.max-attempts=10,mapred:mapreduce.map.maxattempts=10,
mapred:mapreduce.reduce.maxattempts=10,spark:spark.task.maxFailures=10,spark:spark.sta
ge.maxConsecutiveAttempts=10
● Sending job driver’s and executor’s logs to Stackdriver :
○ dataproc:dataproc.logging.stackdriver.job.driver.enable=true
○ dataproc:dataproc.logging.stackdriver.job.yarn.container.enable=true
● To aggregate YARN logs in HDFS :
○ yarn:yarn.log-aggregation-enable=true
○ Logs can be seen for YARN applications in Apache Hue after setting the above property
40. Best Practices
● Use the Jobs API for submissions.
○ jobs.submit call over HTTP, using the gcloud command-line tool or the GCP Console itself
○ makes it easy to separate the permissions
○ No need to setup gateway nodes or having to use something like Apache Livy.
● Control the location of your initialization actions.
○ most commonly installed OSS components - installation scripts available in the
dataproc-initialization-actions GitHub repository.
○ run these initialization actions from a GCS location that you control
● Specify cluster image versions.
○ --image-version 1.4-debian9
○ If you don't specify - Cloud Dataproc will default to the most recent stable image version - can
cause compatibility issues in production
○ You can also specify sub-minor version i.e 1.4.xx-debian9
○ Sub-minor versions will be updated periodically for patches or fixes - more secure.
○ Use 1.4 instead of 1.4.xx