SlideShare a Scribd company logo
1 of 41
Download to read offline
Dataproc At Scale
Who am I?
Searce – Mumbai
Linkedin.com/rohitayare
Rohit Ayare
Senior DevOps Eng
Searce
Dataproc
Dataproc
Google Cloud Platform’s
fully-managed data analytics
service
Rapid cluster creation
Familiar open source tools
Customizable hardware and
software
Ephemeral clusters on-demand
Integrated with other GCP
services
Cloud Dataproc: Open source solutions with GCP
Webhcat
BigQuery Cloud
Datastore
Cloud
Bigtable
Compute
Engine
Kubernetes
Engine
Cloud
Dataflow
Cloud
Dataproc
Cloud
Functions
Cloud
Vision API
Cloud
Storage
Key
Management
Service
Cloud Machine
Learning
Engine
Cloud
Pub/Sub
Cloud
Spanner
Cloud
SQL
Cloud
Translation
API
BQ Transfer
Service
Dataproc
DataProc is a managed Hadoop Mapreduce, Spark, Pig and Hive service to
easily process big data sets at low cost
Dataproc
Most traditional clusters are utilized only a
portion of the time they’re online
Spark and Hadoop often have poor economics
and scalability
Idle Clusters
Scaling
inflexibility
Job demand can be hard to predict,
and scaling can take considerable time
Dataproc brings cloud economics to Spark and
Hadoop
Anytime
Clusters
Clusters aren’t idle. Run clusters only
when you need them
Flexible
Scaling
By scaling clusters at anytime, your jobs can get
exactly the resources they need when required
Dataproc
Decouple storage from compute
Development & Test
Data sinks
Production
Cloud Dataproc
External applications Storage
Cloud Storage
Application Logs
Storage
BigQuery
Clusters
Development
Cloud Dataproc
Test
Cloud Dataproc
Data sources
Storage
Cloud Bigtable
Storage
Cloud Storage
Storage
BigQuery
Storage
Cloud Bigtable
Credits : Google Cloud Documentation
Cloud Dataproc - Node Types
Credits : Google Cloud Documentation
Clients
Cloud Dataproc Cluster
Cluster bucket
Google Cloud Storage
Cloud Dataproc API
Clusters
Operations
Jobs
Workflow
Templates
Master node(s)
Compute Engine
Primary workers
Compute
Engine
Compute + Storage
Managed instance Group
Secondary (PVM)
Worker(s)
Compute Engine
Compute
Cloud Network
Cloud IAM
The Pros and Cons of Cloud Storage versus HDFS
PROS
● Lower costs
● Separation from compute and storage
● Interoperability
● HDFS compatibility with equivalent
(or better) performance
● High data availability
● No storage management overhead
● Quick startup
● Cloud IAM security
● Global consistency
CONS
● Cloud Storage may increase I/O
variance
● Cloud Storage doesn’t support file
appends or truncates
● Cloud Storage isn’t POSIX- compliant
● Cloud Storage may not expose all file
system information
● Cloud storage may have greater
request latency
Credits :
https://cloud.google.com/blog/products/storage-data-transfer/hdfs-vs-cloud-storage-pros-cons-and-migration-tips
Dataproc
Features
Dataproc
Features
Who am I?
Searce – Bangalore
Linkedin.com/manan-kshatriya
Twitter.com/@MananKsh
Medium.com/@Mannykshatriya
https://MananK.in
Manan Kshatriya
Data Engineer
Searce
Scaling clusters :
Dataproc at scale using On-demand and
Preemptible workers
Scaling clusters - PVMs
● Dataproc has "secondary workers" which are preemptible VMs by default
● Processing only - do not store data
● Same instance type as on-demand worker node
Recommendation:
Start with 0 PVMs and slowly tune upwards.
Above 50% PVM is bad
PROS
● Cheaper
CONS
●Jobs can lose progress,
eventually fail if too many
preemptions
●Not be appropriate for all
workloads
Scaling clusters - Manually
Click Edit
Enter desired worker nodes
Scaling clusters - Autoscaling
Is there too much or too
little YARN memory?
Do nothing
Is the cluster at the
maximum # of nodes?
Do not autoscale
Determine type and scale
of nodes to modify
Autoscale cluster
Yes No
Yes No
● Optimize resource usage
● Decommissioning workers when not
in use - Savings
● No manual intervention required
● Strike the right balance of primary
and preemptible workers
Scaling clusters - Autoscaling policy
An Autoscaling Policy is a reusable configuration that describes how clusters using the autoscaling policy
should scale
● Autoscaling is independent entity. Can be attached to 1 or more clusters (recommended only if they
share similar workloads)
● Not mandatory to define autoscaling policy during cluster creation
● On the fly autoscaling-policy modification
● You can scale to any number of worker nodes as long as there no quota restrictions
● Worker nodes comes up in 2-3 mins and are ready to be used
Scaling clusters - Autoscaling Algorithm
● Number of workers required :
exact Δworkers = avg(pending memory - available memory) / memory per worker
● Aggresiveness: scaleUpFactor: 0.5, scaleDownFactor: 0.5
if exact Δworkers > 0:
actual Δworkers = ROUND_UP(exact Δworkers * scaleUpFactor)
# examples:
# ROUND_UP(exact Δworkers=5 * scaleUpFactor=0.5) = 3
else:
actual Δworkers = ROUND_DOWN(exact Δworkers * scaleDownFactor)
# examples:
# ROUND_DOWN(exact Δworkers=-5 * scaleDownFactor=0.5) = -2
Scaling clusters - Autoscaling Algorithm
● Boundaries:
workerConfig:
minInstances: 2
maxInstances: 100
weight: 2
Weight signifies proportion of new workers to be added for each node type : For example
Autoscaler recommends 6 nodes, according to above policy
6 * (2 / 3) = 4 Nodes = Primary workers
6 * (1 / 3) = 2 Nodes = Secondary workers
● Frequency:
○ Cooldown_period : Time window after which autoscaling check/update happens
○ gracefulDecommissionTimeout: finish work in progress on a worker before it is removed
from the Cloud Dataproc cluster and avoid losing work in progress
secondaryWorkerConfig:
minInstances: 0
maxInstances: 100
weight: 1
Scaling clusters - Recommendations for autoscaling policy
● Cooldown period : default = 2m
○ Should not be less than 2m
○ We used value of 4 minutes
● Graceful_decommission_timeout: finishes work in progress on a worker before it is removed.
○ Default = disabled
○ It should be larger than the longest running job
● scaleUpFactor for Spark jobs :
○ With DynamicAllocation set to True(default), Spark continues to double the number of executors
○ Recommended to set scaleUpFactor to 1.0 (100%)
○ smooths out pending memory (fewer pending memory spikes)
● For mapreduce jobs, a good starting point for scaleUpFactor is to start with 0.05 or 0.1 and then gradually
move ahead.
○ Mapreduce are generally short living jobs, so unless it lasts for several minutes keep scaleUpFactor
low
● To scale down only when a cluster is idle, set scale_down factor and scale_down min_worker_fraction to 1.0
○ MapReduce and Spark write intermediate shuffle data to local disk. Removing workers with shuffle
data will job progress back
Scaling clusters - Enhanced flexibility mode (Flex)
● What is it?
○ When a Dataproc node is removed, Flex mode preserves stateful node data, such as
mapreduce shuffle data, in HDFS
● Particularly useful when using preemptible VMs heavily as your worker nodes or only autoscale the
preemptible worker group.
● Nodes can be removed cluster as soon as all containers running on those nodes have finished -
Shuffle data will be written in HDFS
Observations :
● Started with 2 primary workers - HDFS became full
● We can increase the number of primary worker nodes OR add --num-local-ssds to primary workers
● We can also write shuffle data to Cloud Storage
Dataproc Security
Dataproc Security - Networking
● Create a custom VPC or use a private subnet for dataproc clusters -
○ Add a strict F/W rule of 10.128.0.0/16
○ It allows master and worker nodes to communicate
● Internal IP only -
○ Include --no-address flag in cluster creation command
○ Subnet should have property ‘Private Google Access’ enabled so that it can talk to GCP APIs
internally.
○ In order to download files from internet (possible in installing extra softwares via initialization
script), Cloud NAT is required.
● Cloud Identity-Aware Proxy (IAP) :
○ Cloud IAP works by verifying user identity and context (e.g. device status, location) of the
request to determine if a user should be allowed to access an application or a VM
○ Access VMs from untrusted networks without the use of a VPN.
○ Can gcloud command to SSH into master node - gcloud compute ssh … --tunnel-through-iap
Dataproc Security - Cloud DNS
● Problem statement:
○ workflow contains variable ${NameNodeIP} which should point to master of dataproc cluster
○ The value is passed via properties file when submitting jobs
○ You spin up a new dataproc cluster with a different name - Will you go back and change the
master hostname in properties file every time before submitting the job?
● Using private hosted zone in Cloud DNS :
○ You have to map the master node’s internal IP with the private DNS endpoint
(internal.abc.com:>10.0.0.1)
○ You can use single DNS endpoint in your properties without worrying which dataproc cluster it
points to
○ The DNS entries can be modified as you create new master nodes
Common use cases:
● Useful when moving from staging environment to production environment - no code change
required
● Using cloud DNS private hosted zone you can have peering between VPC's and access Dataproc
resources across projects too.
Dataproc Security - Access Management
● Individual service account for individual dataproc cluster
○ Do not use default dataproc service account for every dataproc cluster
● Separate roles for various access levels for each GCP service -
○ For example, roles for access to GCS can contain list,read, read-write, read-write-modify.
○ Different roles for read, read-write to VM disk - if you have hosted any open-source Database
on compute engine
● A challenge - Compliance necessity :
○ The clusters placed in US should have read-write access only to buckets starting with a prefix,
say, ‘gcs-us-*’ and clusters placed in EU should only have access to buckets with prefix
‘gcs-eu-*’
● Solution = Bucket ACL
Is it scalable?
There can be numerous buckets with prefix gcs-us-*
New buckets will be added in future
Dataproc Security - Access Management
Solution : IAM Conditions (Beta)
● Read-write role attached along with following condition at project level:
resource.service == "storage.googleapis.com" && (
resource.name.startsWith("projects/<PROJECT-NAME>/buckets/gcs-eu")
● Only Read role attached along with following condition at project level:
resource.service == "storage.googleapis.com" && (
resource.name.startsWith("projects/<PROJECT-NAME>/buckets/gcs-read-only-*"
)
Extensibility - Integrating third-party tool:
Apache Oozie and Apache Hue
Extensibility - Apache Oozie
Oozie is a workflow scheduler system to manage Apache Hadoop jobs :
● Directed Acyclical Graphs (DAGs) of
○ control flow
○ action nodes
● Scheduler triggers jobs by time (frequency) and data availability.
● Supports Spark application,map-reduce,Pig, Hive, Sqoop,Distcp as well as system specific jobs such
as Java programs and shell scripts
● Workflows can be parameterized
○ Run the same workflow for different clients by passing different values for variables
Common use cases :
● Shell scripts to clean GCS paths if already present (useful for rerunning workflows), getting input
path of latest data available.
● Running Spark/Hive actions (output of shell actions can be input of following actions)
● Launching child workflows ( re-using already built workflows)
● Sending workflow status E-mail with error logs as subject if required.
Extensibility - Apache Hue
Hue is a Web interface for analyzing data with Apache Hadoop.
● Presence of HDFS File Browser
● Editor for Hive Query
● Editor for Pig Query
● Workflows can access Oozie Interface
● Hadoop Shell Access
● User Admin Interface
● Submit spark, pyspark jobs
Monitoring and Alerts: Stackdriver
Setting up Stackdriver alerts
● Some example alerts :
○ HDFS storage utilization for cluster above threshold of 60 or 80 %
○ Unhealthy nodes in the cluster above 0
■ Workers become unhealthy when local disk is full or above threshold defined in
yarn-site.xml
■ Consider increasing disk size of your worker
○ Uptime check of master node
○ All HDFS DataNodes are functional or not
■ Total HDFS size is determined by total boot disk size of all primary worker nodes
● Automating creation of Stackdriver alerts as part of bootstrapping :
○ Creating the YAML file defining the stackdriver alerts and conditions to trigger alert /
notification channels etc.
○ Placing placeholders such as {PROJECT-ID}, {CLUSTER-NAME}
○ Fetching metadata using GCP internal APIs, replacing the placeholders with actual value and
importing the alert using the final YAML file
○ Can be added as part of bootstrap or Cloud Function can be used
Troubleshooting
Troubleshooting - Dataproc Cluster properties
● We can use --properties flag to modify set of commonly used configuration files
● Recommended properties to set during cluster creation to help avoid job failure due to PVMs
○ yarn:yarn.resourcemanager.am.max-attempts=10,mapred:mapreduce.map.maxattempts=10,
mapred:mapreduce.reduce.maxattempts=10,spark:spark.task.maxFailures=10,spark:spark.sta
ge.maxConsecutiveAttempts=10
● Sending job driver’s and executor’s logs to Stackdriver :
○ dataproc:dataproc.logging.stackdriver.job.driver.enable=true
○ dataproc:dataproc.logging.stackdriver.job.yarn.container.enable=true
● To aggregate YARN logs in HDFS :
○ yarn:yarn.log-aggregation-enable=true
○ Logs can be seen for YARN applications in Apache Hue after setting the above property
Best practices
Best Practices
● Use the Jobs API for submissions.
○ jobs.submit call over HTTP, using the gcloud command-line tool or the GCP Console itself
○ makes it easy to separate the permissions
○ No need to setup gateway nodes or having to use something like Apache Livy.
● Control the location of your initialization actions.
○ most commonly installed OSS components - installation scripts available in the
dataproc-initialization-actions GitHub repository.
○ run these initialization actions from a GCS location that you control
● Specify cluster image versions.
○ --image-version 1.4-debian9
○ If you don't specify - Cloud Dataproc will default to the most recent stable image version - can
cause compatibility issues in production
○ You can also specify sub-minor version i.e 1.4.xx-debian9
○ Sub-minor versions will be updated periodically for patches or fixes - more secure.
○ Use 1.4 instead of 1.4.xx
Thank YouThank You
Thank you

More Related Content

What's hot

Cassandra Summit 2014: Monitor Everything!
Cassandra Summit 2014: Monitor Everything!Cassandra Summit 2014: Monitor Everything!
Cassandra Summit 2014: Monitor Everything!DataStax Academy
 
Cassandra Summit 2014: Active-Active Cassandra Behind the Scenes
Cassandra Summit 2014: Active-Active Cassandra Behind the ScenesCassandra Summit 2014: Active-Active Cassandra Behind the Scenes
Cassandra Summit 2014: Active-Active Cassandra Behind the ScenesDataStax Academy
 
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayCassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayDataStax Academy
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityHiromitsu Komatsu
 
20180522 infra autoscaling_system
20180522 infra autoscaling_system20180522 infra autoscaling_system
20180522 infra autoscaling_systemKai Sasaki
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)DataStax Academy
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentationEdward Capriolo
 
Multi-Region Cassandra Clusters
Multi-Region Cassandra ClustersMulti-Region Cassandra Clusters
Multi-Region Cassandra ClustersInstaclustr
 
Instaclustr webinar 2017 feb 08 japan
Instaclustr webinar 2017 feb 08   japanInstaclustr webinar 2017 feb 08   japan
Instaclustr webinar 2017 feb 08 japanHiromitsu Komatsu
 
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...DataStax
 
Pythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterPythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterDataStax Academy
 
Savanna: Hadoop on OpenStack
Savanna: Hadoop on OpenStackSavanna: Hadoop on OpenStack
Savanna: Hadoop on OpenStackMirantis
 
Software Development with Apache Cassandra
Software Development with Apache CassandraSoftware Development with Apache Cassandra
Software Development with Apache Cassandrazznate
 
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...DataStax
 
Mesosphere and Contentteam: A New Way to Run Cassandra
Mesosphere and Contentteam: A New Way to Run CassandraMesosphere and Contentteam: A New Way to Run Cassandra
Mesosphere and Contentteam: A New Way to Run CassandraDataStax Academy
 
Building and running cloud native cassandra
Building and running cloud native cassandraBuilding and running cloud native cassandra
Building and running cloud native cassandraVinay Kumar Chella
 
Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
Beginning Operations: 7 Deadly Sins for Apache Cassandra OpsBeginning Operations: 7 Deadly Sins for Apache Cassandra Ops
Beginning Operations: 7 Deadly Sins for Apache Cassandra OpsDataStax Academy
 
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...Instaclustr
 

What's hot (20)

Cassandra Summit 2014: Monitor Everything!
Cassandra Summit 2014: Monitor Everything!Cassandra Summit 2014: Monitor Everything!
Cassandra Summit 2014: Monitor Everything!
 
Cassandra Summit 2014: Active-Active Cassandra Behind the Scenes
Cassandra Summit 2014: Active-Active Cassandra Behind the ScenesCassandra Summit 2014: Active-Active Cassandra Behind the Scenes
Cassandra Summit 2014: Active-Active Cassandra Behind the Scenes
 
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayCassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
 
Hadoop2
Hadoop2Hadoop2
Hadoop2
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra Community
 
20180522 infra autoscaling_system
20180522 infra autoscaling_system20180522 infra autoscaling_system
20180522 infra autoscaling_system
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
Multi-Region Cassandra Clusters
Multi-Region Cassandra ClustersMulti-Region Cassandra Clusters
Multi-Region Cassandra Clusters
 
Instaclustr webinar 2017 feb 08 japan
Instaclustr webinar 2017 feb 08   japanInstaclustr webinar 2017 feb 08   japan
Instaclustr webinar 2017 feb 08 japan
 
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
 
Pythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterPythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra Cluster
 
Savanna: Hadoop on OpenStack
Savanna: Hadoop on OpenStackSavanna: Hadoop on OpenStack
Savanna: Hadoop on OpenStack
 
Software Development with Apache Cassandra
Software Development with Apache CassandraSoftware Development with Apache Cassandra
Software Development with Apache Cassandra
 
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
 
Mesosphere and Contentteam: A New Way to Run Cassandra
Mesosphere and Contentteam: A New Way to Run CassandraMesosphere and Contentteam: A New Way to Run Cassandra
Mesosphere and Contentteam: A New Way to Run Cassandra
 
Data Stores @ Netflix
Data Stores @ NetflixData Stores @ Netflix
Data Stores @ Netflix
 
Building and running cloud native cassandra
Building and running cloud native cassandraBuilding and running cloud native cassandra
Building and running cloud native cassandra
 
Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
Beginning Operations: 7 Deadly Sins for Apache Cassandra OpsBeginning Operations: 7 Deadly Sins for Apache Cassandra Ops
Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
 
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
 

Similar to Dataproc at Scale: Open Source Solutions

Automation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure DataAutomation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure DataYan Wang
 
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Mariano Gonzalez
 
Cloud Architecture best practices
Cloud Architecture best practicesCloud Architecture best practices
Cloud Architecture best practicesOmid Vahdaty
 
How to Build a Multi-DC Cassandra Cluster in AWS with OpsCenter LCM
How to Build a Multi-DC Cassandra Cluster in AWS with OpsCenter LCMHow to Build a Multi-DC Cassandra Cluster in AWS with OpsCenter LCM
How to Build a Multi-DC Cassandra Cluster in AWS with OpsCenter LCMAnant Corporation
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflowmutt_data
 
Automating using Ansible
Automating using AnsibleAutomating using Ansible
Automating using AnsibleAlok Patra
 
Deploying Perl apps on dotCloud
Deploying Perl apps on dotCloudDeploying Perl apps on dotCloud
Deploying Perl apps on dotClouddaoswald
 
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling StoryPHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Storyvanphp
 
Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...
Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...
Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...Continuent
 
Leonid Kuligin "Training ML models with Cloud"
 Leonid Kuligin   "Training ML models with Cloud" Leonid Kuligin   "Training ML models with Cloud"
Leonid Kuligin "Training ML models with Cloud"Lviv Startup Club
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2Aswini Ashu
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2aswini pilli
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedShubham Tagra
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
 
M|18 Creating a Reference Architecture for High Availability at Nokia
M|18 Creating a Reference Architecture for High Availability at NokiaM|18 Creating a Reference Architecture for High Availability at Nokia
M|18 Creating a Reference Architecture for High Availability at NokiaMariaDB plc
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
Sharing resources with non-Hadoop workloads
Sharing resources with non-Hadoop workloadsSharing resources with non-Hadoop workloads
Sharing resources with non-Hadoop workloadsDataWorks Summit
 

Similar to Dataproc at Scale: Open Source Solutions (20)

Automation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure DataAutomation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure Data
 
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
 
Cloud Architecture best practices
Cloud Architecture best practicesCloud Architecture best practices
Cloud Architecture best practices
 
How to Build a Multi-DC Cassandra Cluster in AWS with OpsCenter LCM
How to Build a Multi-DC Cassandra Cluster in AWS with OpsCenter LCMHow to Build a Multi-DC Cassandra Cluster in AWS with OpsCenter LCM
How to Build a Multi-DC Cassandra Cluster in AWS with OpsCenter LCM
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
 
Drupal performance
Drupal performanceDrupal performance
Drupal performance
 
Automating using Ansible
Automating using AnsibleAutomating using Ansible
Automating using Ansible
 
Deploying Perl apps on dotCloud
Deploying Perl apps on dotCloudDeploying Perl apps on dotCloud
Deploying Perl apps on dotCloud
 
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling StoryPHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
 
Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...
Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...
Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...
 
Leonid Kuligin "Training ML models with Cloud"
 Leonid Kuligin   "Training ML models with Cloud" Leonid Kuligin   "Training ML models with Cloud"
Leonid Kuligin "Training ML models with Cloud"
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speed
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
M|18 Creating a Reference Architecture for High Availability at Nokia
M|18 Creating a Reference Architecture for High Availability at NokiaM|18 Creating a Reference Architecture for High Availability at Nokia
M|18 Creating a Reference Architecture for High Availability at Nokia
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Sharing resources with non-Hadoop workloads
Sharing resources with non-Hadoop workloadsSharing resources with non-Hadoop workloads
Sharing resources with non-Hadoop workloads
 
Scaling symfony apps
Scaling symfony appsScaling symfony apps
Scaling symfony apps
 
Running MySQL in AWS
Running MySQL in AWSRunning MySQL in AWS
Running MySQL in AWS
 

Recently uploaded

SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 

Recently uploaded (20)

SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 

Dataproc at Scale: Open Source Solutions

  • 2. Who am I? Searce – Mumbai Linkedin.com/rohitayare Rohit Ayare Senior DevOps Eng Searce
  • 4. Dataproc Google Cloud Platform’s fully-managed data analytics service Rapid cluster creation Familiar open source tools Customizable hardware and software Ephemeral clusters on-demand Integrated with other GCP services
  • 5. Cloud Dataproc: Open source solutions with GCP Webhcat BigQuery Cloud Datastore Cloud Bigtable Compute Engine Kubernetes Engine Cloud Dataflow Cloud Dataproc Cloud Functions Cloud Vision API Cloud Storage Key Management Service Cloud Machine Learning Engine Cloud Pub/Sub Cloud Spanner Cloud SQL Cloud Translation API BQ Transfer Service
  • 6. Dataproc DataProc is a managed Hadoop Mapreduce, Spark, Pig and Hive service to easily process big data sets at low cost
  • 8. Most traditional clusters are utilized only a portion of the time they’re online Spark and Hadoop often have poor economics and scalability Idle Clusters Scaling inflexibility Job demand can be hard to predict, and scaling can take considerable time Dataproc brings cloud economics to Spark and Hadoop Anytime Clusters Clusters aren’t idle. Run clusters only when you need them Flexible Scaling By scaling clusters at anytime, your jobs can get exactly the resources they need when required Dataproc
  • 9. Decouple storage from compute Development & Test Data sinks Production Cloud Dataproc External applications Storage Cloud Storage Application Logs Storage BigQuery Clusters Development Cloud Dataproc Test Cloud Dataproc Data sources Storage Cloud Bigtable Storage Cloud Storage Storage BigQuery Storage Cloud Bigtable Credits : Google Cloud Documentation
  • 10. Cloud Dataproc - Node Types Credits : Google Cloud Documentation Clients Cloud Dataproc Cluster Cluster bucket Google Cloud Storage Cloud Dataproc API Clusters Operations Jobs Workflow Templates Master node(s) Compute Engine Primary workers Compute Engine Compute + Storage Managed instance Group Secondary (PVM) Worker(s) Compute Engine Compute Cloud Network Cloud IAM
  • 11. The Pros and Cons of Cloud Storage versus HDFS PROS ● Lower costs ● Separation from compute and storage ● Interoperability ● HDFS compatibility with equivalent (or better) performance ● High data availability ● No storage management overhead ● Quick startup ● Cloud IAM security ● Global consistency CONS ● Cloud Storage may increase I/O variance ● Cloud Storage doesn’t support file appends or truncates ● Cloud Storage isn’t POSIX- compliant ● Cloud Storage may not expose all file system information ● Cloud storage may have greater request latency Credits : https://cloud.google.com/blog/products/storage-data-transfer/hdfs-vs-cloud-storage-pros-cons-and-migration-tips
  • 14. Who am I? Searce – Bangalore Linkedin.com/manan-kshatriya Twitter.com/@MananKsh Medium.com/@Mannykshatriya https://MananK.in Manan Kshatriya Data Engineer Searce
  • 15. Scaling clusters : Dataproc at scale using On-demand and Preemptible workers
  • 16. Scaling clusters - PVMs ● Dataproc has "secondary workers" which are preemptible VMs by default ● Processing only - do not store data ● Same instance type as on-demand worker node Recommendation: Start with 0 PVMs and slowly tune upwards. Above 50% PVM is bad PROS ● Cheaper CONS ●Jobs can lose progress, eventually fail if too many preemptions ●Not be appropriate for all workloads
  • 17. Scaling clusters - Manually Click Edit Enter desired worker nodes
  • 18. Scaling clusters - Autoscaling Is there too much or too little YARN memory? Do nothing Is the cluster at the maximum # of nodes? Do not autoscale Determine type and scale of nodes to modify Autoscale cluster Yes No Yes No ● Optimize resource usage ● Decommissioning workers when not in use - Savings ● No manual intervention required ● Strike the right balance of primary and preemptible workers
  • 19. Scaling clusters - Autoscaling policy An Autoscaling Policy is a reusable configuration that describes how clusters using the autoscaling policy should scale ● Autoscaling is independent entity. Can be attached to 1 or more clusters (recommended only if they share similar workloads) ● Not mandatory to define autoscaling policy during cluster creation ● On the fly autoscaling-policy modification ● You can scale to any number of worker nodes as long as there no quota restrictions ● Worker nodes comes up in 2-3 mins and are ready to be used
  • 20. Scaling clusters - Autoscaling Algorithm ● Number of workers required : exact Δworkers = avg(pending memory - available memory) / memory per worker ● Aggresiveness: scaleUpFactor: 0.5, scaleDownFactor: 0.5 if exact Δworkers > 0: actual Δworkers = ROUND_UP(exact Δworkers * scaleUpFactor) # examples: # ROUND_UP(exact Δworkers=5 * scaleUpFactor=0.5) = 3 else: actual Δworkers = ROUND_DOWN(exact Δworkers * scaleDownFactor) # examples: # ROUND_DOWN(exact Δworkers=-5 * scaleDownFactor=0.5) = -2
  • 21. Scaling clusters - Autoscaling Algorithm ● Boundaries: workerConfig: minInstances: 2 maxInstances: 100 weight: 2 Weight signifies proportion of new workers to be added for each node type : For example Autoscaler recommends 6 nodes, according to above policy 6 * (2 / 3) = 4 Nodes = Primary workers 6 * (1 / 3) = 2 Nodes = Secondary workers ● Frequency: ○ Cooldown_period : Time window after which autoscaling check/update happens ○ gracefulDecommissionTimeout: finish work in progress on a worker before it is removed from the Cloud Dataproc cluster and avoid losing work in progress secondaryWorkerConfig: minInstances: 0 maxInstances: 100 weight: 1
  • 22. Scaling clusters - Recommendations for autoscaling policy ● Cooldown period : default = 2m ○ Should not be less than 2m ○ We used value of 4 minutes ● Graceful_decommission_timeout: finishes work in progress on a worker before it is removed. ○ Default = disabled ○ It should be larger than the longest running job ● scaleUpFactor for Spark jobs : ○ With DynamicAllocation set to True(default), Spark continues to double the number of executors ○ Recommended to set scaleUpFactor to 1.0 (100%) ○ smooths out pending memory (fewer pending memory spikes) ● For mapreduce jobs, a good starting point for scaleUpFactor is to start with 0.05 or 0.1 and then gradually move ahead. ○ Mapreduce are generally short living jobs, so unless it lasts for several minutes keep scaleUpFactor low ● To scale down only when a cluster is idle, set scale_down factor and scale_down min_worker_fraction to 1.0 ○ MapReduce and Spark write intermediate shuffle data to local disk. Removing workers with shuffle data will job progress back
  • 23. Scaling clusters - Enhanced flexibility mode (Flex) ● What is it? ○ When a Dataproc node is removed, Flex mode preserves stateful node data, such as mapreduce shuffle data, in HDFS ● Particularly useful when using preemptible VMs heavily as your worker nodes or only autoscale the preemptible worker group. ● Nodes can be removed cluster as soon as all containers running on those nodes have finished - Shuffle data will be written in HDFS Observations : ● Started with 2 primary workers - HDFS became full ● We can increase the number of primary worker nodes OR add --num-local-ssds to primary workers ● We can also write shuffle data to Cloud Storage
  • 25. Dataproc Security - Networking ● Create a custom VPC or use a private subnet for dataproc clusters - ○ Add a strict F/W rule of 10.128.0.0/16 ○ It allows master and worker nodes to communicate ● Internal IP only - ○ Include --no-address flag in cluster creation command ○ Subnet should have property ‘Private Google Access’ enabled so that it can talk to GCP APIs internally. ○ In order to download files from internet (possible in installing extra softwares via initialization script), Cloud NAT is required. ● Cloud Identity-Aware Proxy (IAP) : ○ Cloud IAP works by verifying user identity and context (e.g. device status, location) of the request to determine if a user should be allowed to access an application or a VM ○ Access VMs from untrusted networks without the use of a VPN. ○ Can gcloud command to SSH into master node - gcloud compute ssh … --tunnel-through-iap
  • 26. Dataproc Security - Cloud DNS ● Problem statement: ○ workflow contains variable ${NameNodeIP} which should point to master of dataproc cluster ○ The value is passed via properties file when submitting jobs ○ You spin up a new dataproc cluster with a different name - Will you go back and change the master hostname in properties file every time before submitting the job? ● Using private hosted zone in Cloud DNS : ○ You have to map the master node’s internal IP with the private DNS endpoint (internal.abc.com:>10.0.0.1) ○ You can use single DNS endpoint in your properties without worrying which dataproc cluster it points to ○ The DNS entries can be modified as you create new master nodes Common use cases: ● Useful when moving from staging environment to production environment - no code change required ● Using cloud DNS private hosted zone you can have peering between VPC's and access Dataproc resources across projects too.
  • 27. Dataproc Security - Access Management ● Individual service account for individual dataproc cluster ○ Do not use default dataproc service account for every dataproc cluster ● Separate roles for various access levels for each GCP service - ○ For example, roles for access to GCS can contain list,read, read-write, read-write-modify. ○ Different roles for read, read-write to VM disk - if you have hosted any open-source Database on compute engine ● A challenge - Compliance necessity : ○ The clusters placed in US should have read-write access only to buckets starting with a prefix, say, ‘gcs-us-*’ and clusters placed in EU should only have access to buckets with prefix ‘gcs-eu-*’ ● Solution = Bucket ACL Is it scalable? There can be numerous buckets with prefix gcs-us-* New buckets will be added in future
  • 28. Dataproc Security - Access Management Solution : IAM Conditions (Beta) ● Read-write role attached along with following condition at project level: resource.service == "storage.googleapis.com" && ( resource.name.startsWith("projects/<PROJECT-NAME>/buckets/gcs-eu") ● Only Read role attached along with following condition at project level: resource.service == "storage.googleapis.com" && ( resource.name.startsWith("projects/<PROJECT-NAME>/buckets/gcs-read-only-*" )
  • 29.
  • 30. Extensibility - Integrating third-party tool: Apache Oozie and Apache Hue
  • 31. Extensibility - Apache Oozie Oozie is a workflow scheduler system to manage Apache Hadoop jobs : ● Directed Acyclical Graphs (DAGs) of ○ control flow ○ action nodes ● Scheduler triggers jobs by time (frequency) and data availability. ● Supports Spark application,map-reduce,Pig, Hive, Sqoop,Distcp as well as system specific jobs such as Java programs and shell scripts ● Workflows can be parameterized ○ Run the same workflow for different clients by passing different values for variables Common use cases : ● Shell scripts to clean GCS paths if already present (useful for rerunning workflows), getting input path of latest data available. ● Running Spark/Hive actions (output of shell actions can be input of following actions) ● Launching child workflows ( re-using already built workflows) ● Sending workflow status E-mail with error logs as subject if required.
  • 32. Extensibility - Apache Hue Hue is a Web interface for analyzing data with Apache Hadoop. ● Presence of HDFS File Browser ● Editor for Hive Query ● Editor for Pig Query ● Workflows can access Oozie Interface ● Hadoop Shell Access ● User Admin Interface ● Submit spark, pyspark jobs
  • 33.
  • 34.
  • 35. Monitoring and Alerts: Stackdriver
  • 36. Setting up Stackdriver alerts ● Some example alerts : ○ HDFS storage utilization for cluster above threshold of 60 or 80 % ○ Unhealthy nodes in the cluster above 0 ■ Workers become unhealthy when local disk is full or above threshold defined in yarn-site.xml ■ Consider increasing disk size of your worker ○ Uptime check of master node ○ All HDFS DataNodes are functional or not ■ Total HDFS size is determined by total boot disk size of all primary worker nodes ● Automating creation of Stackdriver alerts as part of bootstrapping : ○ Creating the YAML file defining the stackdriver alerts and conditions to trigger alert / notification channels etc. ○ Placing placeholders such as {PROJECT-ID}, {CLUSTER-NAME} ○ Fetching metadata using GCP internal APIs, replacing the placeholders with actual value and importing the alert using the final YAML file ○ Can be added as part of bootstrap or Cloud Function can be used
  • 38. Troubleshooting - Dataproc Cluster properties ● We can use --properties flag to modify set of commonly used configuration files ● Recommended properties to set during cluster creation to help avoid job failure due to PVMs ○ yarn:yarn.resourcemanager.am.max-attempts=10,mapred:mapreduce.map.maxattempts=10, mapred:mapreduce.reduce.maxattempts=10,spark:spark.task.maxFailures=10,spark:spark.sta ge.maxConsecutiveAttempts=10 ● Sending job driver’s and executor’s logs to Stackdriver : ○ dataproc:dataproc.logging.stackdriver.job.driver.enable=true ○ dataproc:dataproc.logging.stackdriver.job.yarn.container.enable=true ● To aggregate YARN logs in HDFS : ○ yarn:yarn.log-aggregation-enable=true ○ Logs can be seen for YARN applications in Apache Hue after setting the above property
  • 40. Best Practices ● Use the Jobs API for submissions. ○ jobs.submit call over HTTP, using the gcloud command-line tool or the GCP Console itself ○ makes it easy to separate the permissions ○ No need to setup gateway nodes or having to use something like Apache Livy. ● Control the location of your initialization actions. ○ most commonly installed OSS components - installation scripts available in the dataproc-initialization-actions GitHub repository. ○ run these initialization actions from a GCS location that you control ● Specify cluster image versions. ○ --image-version 1.4-debian9 ○ If you don't specify - Cloud Dataproc will default to the most recent stable image version - can cause compatibility issues in production ○ You can also specify sub-minor version i.e 1.4.xx-debian9 ○ Sub-minor versions will be updated periodically for patches or fixes - more secure. ○ Use 1.4 instead of 1.4.xx