Binder1.pdf

Google Cloud
Certified Professional
Data Engineer Exam
Strategy, Tips and
Overview 1

About me
• I have 20 years experience in IT industry with
focus on Cloud, Data, ML & DevOps
• I hold more than 50+ certi
fi
cation in all the above
fi
elds
• I am a 2 time Google Professional Data Engineer
Certi
fi
ed and also 1 time Google DevOps and
Machine Learning Engineer Certi
fi
ed
• LinkedIn Pro
fi
le
2

Exam Cost
• 200 USD listed price
• 141.60 USD I paid after taxes etc.
• Price same whether at home or at exam center
5

Certification Exam Overview - 1
•50 Questions to be answered in 2 hours
•No negative marks
•All Multiple Choice Questions
•Select 1 right answer out of 4
•Select 2 or 3 right answers out of 5 or 6 choices
•144 seconds for every question
6

Certification Exam Overview - 2
• Questions can be answered and also marked for
review if you want to review later
• No pen and paper allowed or provided
• Cert Valid for 2 years
• Can be taken onsite at exam center or remotely
proctored
• Pass or Fail, marks will not be shared
• Exam is hard
7

Exam Strategy
• Process of elimination
• Eliminate usually 2 wrong choices
• Then make a decision/guess between the last 2
strong options
• On my
fi
rst go I made a choice and marked all
questions for review later
• On 2nd run I read through all my answers and
either con
fi
rmed or changed my choice
8

Top topics (in order)
• BigQuery
• Data
fl
ow
• Bigtable
• Dataproc
• Pub/Sub
• Cloud SQL
• Cloud Spanner
9
• Cloud Composer
• Data Prep
• Data Fusion
• Cloud DLP
• Pre-Trained ML AI
APIS
• Fundamentals of
Machine Learning
• Feature
Engineering
• Over
fi
tting
• IAM, Access
Control, Service
Accounts, Users,
Groups, Roles
• Kafka, Hive, Pig,
Hadoop

Tips (Guesswork)
• Choose Ideal Recommended Solution (Reference Architectures)
• Choose Google Product over other products
• Avoid Complex Convoluted Lengthy Manual Error prone
Cumbersome solution
• If little bit unsure or question is too time consuming, mark for
review later.
• After reading through all 50 questions you will
fi
gure out some
unsure questions
• If all else fail go with your
fi
rst guess
10

Do Review
• Sample Questions
• Exam Guide
11

Roles
& Responsibilities
of a Data Engineer
1

2
•Data Engineers are responsible for
designing, building, and maintaining
the infrastructure and systems that
are used to store, process, and
analyze large amounts of data.

3
•Design, build and maintain the data
pipeline: Data Engineers are
responsible for designing and building
the data pipeline, which includes the
collection, storage, processing, and
transportation of data.
•They also make sure that the pipeline is
e
ffi
cient, reliable, and scalable.

4
•Data storage and management: Data
Engineers are responsible for designing
and maintaining the data storage
systems, such as relational databases,
NoSQL databases, and data
warehouses.
•They also ensure that the data is properly
indexed, partitioned, and backed up.

5
•Data quality and integrity: Data
Engineers are responsible for ensuring
the quality and integrity of the data as
it
fl
ows through the pipeline.
•This includes cleaning, normalizing,
and validating the data before it is
stored.

6
•Data security: Data Engineers are
responsible for implementing security
measures to protect sensitive data
from unauthorized access.
•This includes implementing
encryption, access controls, and
monitoring for security breaches.

7
•Performance tuning and optimization:
Data Engineers are responsible for
monitoring the performance of the data
pipeline and making adjustments to
optimize its performance.
•This includes identifying and resolving
bottlenecks, and scaling resources as
needed.

8
•Collaboration with other teams: Data
Engineers often work closely with
Data Scientists, Data Analysts and
Business Intelligence teams to
understand their data needs and
ensure that the data pipeline is able
to support their requirements.

9
•Keeping up with the latest
technologies: Data Engineers need to
keep up-to-date with the latest
technologies and trends in the
fi
eld,
such as new data storage and
processing systems, big data
platforms, and data governance best
practices.

Types of Data
Storage Systems
1

Relational Databases
•Relational databases: These are the most common
type of data storage systems, and include popular
options such as MySQL, Oracle, and Microsoft SQL
Server.
•Relational databases store data in tables with rows
and columns, and are based on the relational
model.
•They are well suited for structured data and support
the use of SQL for querying and manipulating data.
2

NoSQL Databases
•NoSQL databases: These are non-relational
databases that are designed to handle large
amounts of unstructured or semi-structured data.
•Examples include MongoDB, Cassandra, and
Hbase.
•NoSQL databases are often used for big data and
real-time web applications.
•They are horizontally scalable and provide high
performance and availability.
3

Data Warehouses
•Data warehouses: These are specialized relational
databases that are optimized for reporting and
analytics.
•They are designed to handle large amounts of
historical data and support complex queries and
aggregations.
•Examples include Amazon Redshift, Google
BigQuery, and Microsoft Azure SQL Data
Warehouse
4

Data Lakes
•Data lakes: Data lake is a data storage
architecture that allows storing raw,
unstructured and structured data at any
scale.
•Data lake technologies, such as Amazon
S3, Azure Data Lake Storage, and Google
Cloud Storage, provide a centralized
repository that can store all types of data.
5

Columnar Databases
•Columnar databases: Columnar
databases, such as Apache Parquet,
Apache ORC, and Google Bigtable, are
used for storing and querying large
amounts of data in a columnar format.
•This format is optimized for read-intensive
workloads and analytical querying.
6

Key-value Databases
•Key-value databases: Key-value databases,
such as Redis and memcached, are
designed to store large amounts of data in
a simple key-value format, and are
optimized for read-heavy workloads.
•They are particularly well-suited for use
cases such as caching, session
management, and real-time analytics.
7

Object Storage
•Object storage: These systems are designed
to store and retrieve unstructured data, such
as images, videos, and audio
fi
les.
•They are often used for cloud storage and
archiving.
•Examples include Amazon S3, Microsoft
Azure Blob Storage, and OpenStack Swift.
8

File Storage
•File storage: These systems store data as
fi
les and directories, which can be organized
in a hierarchical
fi
le system.
•They are often used for storing large
fi
les
and streaming media, and are commonly
used in distributed systems like Hadoop
HDFS.
•Other Examples: NTFS, ext4
9

Time-series Databases
•Time-series databases: These are
specialized databases that are
optimized for storing and querying
time-series data, such as sensor data
and
fi
nancial data.
•Examples include In
fl
uxDB,
OpenTSDB, and TimescaleDB.
10

Graph Databases
•Graph databases: These databases are designed
for storing and querying graph data, which
consists of nodes and edges representing entities
and relationships.
•Examples include Neo4j and JanusGraph.
•They are well-suited for applications that require
querying complex relationships and patterns in
the data, such as social networks and
recommendation systems.
11

ACID
• ACID is an acronym that stands for Atomicity, Consistency, Isolation, and
Durability.
• These properties are a set of guarantees that a database system makes about
the behavior of transactions.
• ACID properties are important for maintaining data integrity, consistency, and
availability in a database.
• It ensures that the data stored in the database is accurate, consistent, and
can be relied upon.

Atomicity
• This property ensures that a transaction is treated as a single, indivisible unit
of work.
• Either all of the changes made in a transaction are committed to the
database, or none of them are.
• This means that if a transaction is interrupted or fails, any changes made in
that transaction will be rolled back, so that the database remains in a
consistent state.

Consistency
• This property ensures that a transaction brings the database from one valid
state to another valid state.
• A database starts in a consistent state, and any transaction that is executed
on the database should also leave the database in a consistent state.

Isolation
• This property ensures that the concurrent execution of transactions does not
a
ff
ect the correctness of the overall system.
• Each transaction should execute as if it is the only transaction being
executed, even though other transactions may be executing at the same time.

Durability
• This property ensures that once a transaction is committed, its e
ff
ects will
persist, even in the event of a failure (such as a power outage or a crash).
• This is typically achieved by writing the changes to non-volatile storage, such
as disk.

BASE
• BASE stands for Basically Available, Soft state, Eventually consistent.
• It is a set of properties that describe the behavior of a distributed database
system or a distributed data store.

Basically Available
• This property ensures that the data store is available for read and write
operations, although there may be some limitations on availability due to
network partitions or other failures.

Soft state
• This property acknowledges that the state of the data store may change over
time, even without input.
• This is due to the distributed nature of the data store and the inherent
uncertainty in network communication.

Eventually consistent
• This property ensures that all nodes in the distributed data store will
eventually converge to the same state, even if they do not have immediate
access to the same information.
• This means that it may take some time for all nodes to have the same data,
but eventually, they will.

Difference
• ACID guarantees consistency and
isolation for transactions, but it
comes with a cost of overhead and
less scalability
• BASE prioritizes availability and
scalability over consistency, which
can make it more di
ffi
cult to reason
about and predict the behavior of the
system.

OLTP - 1
• OLTP stands for Online Transaction Processing.
• It is a type of database system that is optimized for handling a large number
of short, transactional requests, such as inserting, updating, or retrieving data
from a database.
• In an OLTP system, the database is designed to handle a high number of
concurrent connections and transactions, with a focus on fast, consistent
response times.
• The data in an OLTP system is typically stored in normalized, relational tables,
which allows for e
ffi
cient querying and indexing.
• OLTP systems are used in a variety of applications, such as e-commerce
systems,
fi
nancial systems, and inventory management systems.

OLTP - 2
• The goal of OLTP is to enable the processing of business transactions as fast
as possible, with a high degree of consistency and data integrity.
• OLTP systems are typically characterized by a high number of read and write
operations, a large number of concurrent users, and a high volume of data.
• They are also characterized by a high degree of normalization and data
integrity, with strict constraints and triggers to ensure data consistency and
prevent data corruption.
• Overall, OLTP is designed to handle a large number of concurrent
transactions and to provide fast, consistent response times.
• It is an essential component of many business systems, and it is used to
support a wide variety of transactions and business processes.

OLAP - 1
• OLAP, or Online Analytical Processing is a powerful technology that allows users
to easily analyze large, complex data sets and make informed business decisions.
• It is commonly used in business intelligence and decision support systems to
support complex queries and analysis on large datasets.
• OLAP databases are typically built on top of a relational database and use a
multidimensional data model, which organizes data into a cube structure.
• Each dimension in the cube represents a di
ff
erent aspect of the data, such as
time, location, or product, and each cell in the cube contains a measure, such as
sales or pro
fi
t.
• Users can interact with the OLAP cube using a client tool, such as Microsoft
Excel, to drill down, roll up, and slice and dice the data to gain insights.

OLAP - 2
• For example, a user could start by looking at total sales for a given time period, then
drill down to see sales by region, and then by individual store.
• There are three main types of OLAP systems: relational OLAP (ROLAP),
multidimensional OLAP (MOLAP), and hybrid OLAP (HOLAP).
• ROLAP uses a relational database as the underlying data store, while MOLAP uses a
specialized multidimensional data store.
• HOLAP combines the bene
fi
ts of both ROLAP and MOLAP by using a relational data
store for the detailed data and a multidimensional data store for the summarized data.
• OLAP also provides several advanced analytical capabilities such as time series
analysis, forecasting, budgeting, and data mining.
• In addition, many OLAP tools provide a graphical interface that makes it easy for
users to interact with the data and perform advanced analysis.

Difference
• OLTP is designed to handle a high volume
of short online transactions, such as
inserting, updating, and retrieving data.
• It is optimized for transactional
consistency, and data is stored in a
normalized form to minimize data
redundancy.
• The main goal of OLTP is to ensure data
accuracy and integrity, and to make sure
that the system can handle a large
number of concurrent users and
transactions.
• OLTP is used for operational systems that
handle day-to-day transactions
• OLAP is designed to handle complex,
multi-dimensional analysis of data.
• It is optimized for fast query performance
and e
ffi
cient data aggregation, and data
is stored in a denormalized form to
enable faster data retrieval.
• The main goal of OLAP is to support
business intelligence and decision-
making by providing users with the ability
to analyze large amounts of data from
multiple dimensions and levels of detail.
• OLAP is used for analytical systems that
support business intelligence and
decision-making

4 V's of Big Data
•Volume
•Velocity
•Variety
•Veracity
2

Volume
•Big data is characterized by its large volume,
which can range from terabytes to petabytes
and even exabytes.
•This large volume of data is generated from
various sources, such as social media, IoT
devices, and transactional systems.
3

Velocity
•Big data is characterized by its high velocity,
which refers to the speed at which data is
generated and collected.
•This high velocity of data requires real-time
processing and analysis to extract insights
and make decisions.
4

Variety
•Big data is characterized by its wide variety of
data types, such as structured, unstructured,
and semi-structured data.
•This variety of data types requires specialized
tools and techniques for processing and
analysis.
5

Veracity
•Big data is characterized by its uncertainty
and lack of trustworthiness, which makes it
di
ffi
cult to validate and verify the accuracy of
the data.
•This requires data quality and data
governance processes to ensure that the data
is accurate and reliable.
6

Vertical vs Horizontal Scaling
1

Vertical Scaling
•Vertical scaling is the process of increasing the
capacity of a single server or machine by adding
more resources such as CPU, memory, or storage.
•Vertical scaling is often used to improve the
performance of a single server or to add capacity to
a machine that has reached its limits.
•The main disadvantage of vertical scaling is that it
can reach a physical limit of how much resources
can be added to a single machine.
2

Horizontal Scaling
•Horizontal scaling is the process of adding more machines to a network to
distribute the load and increase capacity.
•Horizontal scaling is often used in cloud computing and other distributed
systems to handle large amounts of tra
ffi
c or data.
•It allows for the system to handle more requests by adding more machines
to the network, rather than upgrading the resources of a single machine.
•Horizontal scaling is often considered more
fl
exible and cost-e
ff
ective than
vertical scaling, as it allows for the easy addition or removal of machines
as needed.
•However, it may also require a load balancer and a way to share data
between the machines.
3

Difference
•Adding more resources
to a single server or
machine, such as CPU,
memory, or storage
•There is a physical limit
to how much resources
can be added to a
single machine
4
•Adding more
machines to a network
to distribute the load
and increase capacity
•It may also require a
load balancer and a
way to share data
between the machines

Batch Data
2
• Batch data refers to data that is collected and processed in
fi
xed, non-overlapping intervals, also known as batches.
• Batch processing is commonly used when working with large
amounts of historical data, such as data from a data warehouse.
• The data is collected over a certain period of time and then
processed all at once.
• Batch processing is well suited for tasks that do not require real-
time processing, such as generating reports, running analytics,
or training machine learning models.

Streaming Data
3
• Streaming data, on the other hand, refers to data that is
generated and processed in real-time, as it is being generated.
• Streaming data is typically generated by various sources, such
as IoT devices, social media, or
fi
nancial transactions.
• The data is processed as it is received, with minimal latency,
and is often used to support real-time decision-making and
event detection.
• Examples of streaming data processing include monitoring
sensor data, analyzing social media feeds, and detecting fraud
in
fi
nancial transactions.

Main Difference
4
• The main di
ff
erence between batch data and streaming data
is the way they are processed.
• Batch data is processed in
fi
xed intervals, while streaming
data is processed as it is generated.
• Batch data is well suited for tasks that do not require real-
time processing, while streaming data is well suited for real-
time tasks such as monitoring and event detection.

Note
5
• It's worth noting that, these days, many systems combine
both batch and streaming data processing, this is known as
Lambda architecture.
• This is a way to handle both real-time and historical data in a
single system, which can be useful in cases where real-time
decisions need to be made based on historical data.

Data Processing Pipeline
3
• A data processing pipeline is a series of stages or phases that
data goes through from the time it is collected to the time it is
used for analysis or reporting.

Data collection
4
• The
fi
rst stage of the data processing pipeline is data
collection.
• This includes acquiring data from various sources, such as
sensors, log
fi
les, social media, and transactional systems.
• Data collection may also include pre-processing, such as
fi
ltering, sampling, and transforming the data to make it
suitable for further processing.

Data Storage
5
• After the data is collected, it needs to be stored in a reliable
and e
ffi
cient manner.
• This includes storing the data in a data warehouse, data lake,
or other data storage systems, as well as indexing,
partitioning, and backing up the data.

Data Processing
6
• The next stage of the data processing pipeline is data
processing, which includes cleaning, normalizing, validating,
and transforming the data.
• This step is critical for ensuring the quality and integrity of the
data.

Data Modeling
7
• After the data is cleaned and processed, it can be used for
data modeling, which includes building and training machine
learning models, and creating data visualizations.

Data Analysis
8
• Data analysis: The
fi
nal stage of the data processing pipeline
is data analysis, which includes querying, reporting, and
visualizing the data to gain insights and make data-driven
decisions.

Data Governance
9
• Data governance is an ongoing process that covers the data
life cycle, and it starts at the data collection phase and
continues throughout the entire pipeline.
• It includes data quality, data lineage, data privacy, data
security, data archiving, and data cataloging.

Note
10
• It's worth noting that these stages may not be strictly
sequential and can be executed in parallel, and the speci
fi
c
stages may vary depending on the speci
fi
c application and
the requirements of the organization.
• Additionally, the pipeline may include di
ff
erent tools,
technologies and frameworks at each stage and the pipeline
can be iterated to improve the quality of the data and the
accuracy of the models.

Google's Data
Processing
Pipeline Products

1. Ingest
2. Store
3. Process and Analyze
4. Explore and Visualize

Google Cloud
Data Product
Decision Tree
1

Cloud Storage
Cloud Storage is a managed service for storing unstructured
data.
Store any amount of data and retrieve it as often as you like.
2

Features
Automatic storage class transitions
Continental-scale and SLA-backed replication
Fast and ﬂexible transfer services
Default and conﬁgurable data security
Leading analytics and ML/AI tools
Object lifecycle management
Object Versioning
Retention policies
Object holds
3

Features
Customer-managed encryption keys
Customer-supplied encryption keys
Uniform bucket-level access
Requester pays
Bucket Lock
Pub/Sub notiﬁcations for Cloud Storage
Cloud Audit Logs with Cloud Storage
Object- and bucket-level permissions
4

Storage Options
Standard storage Storage for data that is frequently
accessed ("hot" data) and/or stored for
only brief periods of time.
"Hot" data, including websites, streaming
videos, and mobile apps.
Coldline Storage A very low cost, highly durable storage
service for storing infrequently accessed
data.
90 days.
Nearline Storage Low cost, highly durable storage service
for storing infrequently accessed data.
30 days.
Archival Storage The lowest cost, highly durable storage
service for data archiving, online backup,
and disaster recovery.
365 days.
Storage Class Use Cases Minimum Duration
5

Common Use Cases
Backup and Archives
Use Cloud Storage for backup, archives, and recovery
Cloud Storage's nearline storage provides fast, low-cost, highly durable storage for data accessed less than once a month,
reducing the cost of backups and archives while still retaining immediate access.
Backup data in Cloud Storage can be used for more than just recovery because all storage classes have ms latency and are
accessed through a single API.
Media content storage and delivery
Store data to stream audio or video
Stream audio or video directly to apps or websites with Cloud Storage's geo-redundant capabilities.
Geo-redundant storage with the highest level of availability and performance is ideal for low latency, high-QPS content
serving to users distributed across geographic regions.
Data lakes and big data analytics
Create an integrated repository for analytics
Develop and deploy an app or service in a space that provides collaboration and version control for your code.
Cloud Storage oﬀers high availability and performance while being strongly consistent, giving you conﬁdence and accuracy
in analytics workloads.
6

Common Use Cases
Machine learning and Al
Plug into world class machine learning and Al tools
Once your data is stored in Cloud Storage, take advantage of our options for training deep learning and machine learning
models cost-effectively.
Host a website
Hosting a static website with Cloud Storage
If you have a web app that needs to serve static content or user-uploaded static media, using Cloud Storage can be a cost-
effective and efficient way to host and serve this content, while reducing the amount of dynamic requests to your web
app.
7

Automatic storage class transitions
With features like Object Lifecycle Management (OLM) and
Autoclass you can easily optimize costs with object placement
across storage classes.
You can enable, at the bucket level, policy-based automatic
object movement to colder storage classes based on the last
access time.
There are no early deletion or retrieval fees, nor class transition
charges for object access in colder storage classes.
8

Continental-scale and SLA backed replication
Industry leading dual-region buckets support an expansive
number of regions.
A single, continental-scale bucket oﬀers nine regions across
three continents, providing a Recovery Time Objective (RTO) of
zero.
In the event of an outage, applications seamlessly access the
data in the alternate region.
There is no failover and failback process.
For organizations requiring ultra availability, turbo replication
with dual-region buckets oﬀers a 15 minute Recovery Point
Objective (RPO) SLA.
9

Fast and flexible transfer services
Storage Transfer Service oﬀers a highly performant, online
pathway to Cloud Storage—both with the scalability and speed
you need to simplify the data transfer process.
For oﬄine data transfer our Transfer Appliance is a shippable
storage server that sits in your datacenter and then ships to an
ingest location where the data is uploaded to Cloud Storage.
10

Default and configurable data security
Cloud Storage oﬀers secure-by-design features to protect your
data and advanced controls and capabilities to keep your data
private and secure against leaks or compromises.
Security features include access control policies, data
encryption, retention policies, retention policy locks, and signed
URLs.
11

Leading analytics and ML/AI tools
Once your data is stored in Cloud Storage, easily plug into
Google Cloud’s powerful tools to create your data warehouse
with BigQuery, run open-source analytics with Dataproc, or build
and deploy machine learning (ML) models with Vertex AI.
12

Object lifecycle management
Deﬁne conditions that trigger data deletion or transition to a
cheaper storage class.
13

Object Versioning
Continue to store old copies of objects when they are deleted or
overwritten
14

Retention policies
Deﬁne minimum retention periods that objects must be stored
for before they’re deletable.
15

Object holds
Place a hold on an object to prevent its deletion.
16

Customer-managed encryption keys
Encrypt object data with encryption keys stored by the Cloud
Key Management Service and managed by you.
17

Customer-supplied encryption keys
Encrypt object data with encryption keys created and managed
by you.
18

Uniform bucket-level access
Uniformly control access to your Cloud Storage resources by
disabling object ACLs.
19

Requester pays
Require accessors of your data to include a project ID to bill for
network charges, operation charges, and retrieval fees.
20

Bucket Lock
Bucket Lock allows you to conﬁgure a data retention policy for a
Cloud Storage bucket that governs how long objects in the
bucket must be retained.
21

Pub/Sub notifications for Cloud Storage
Send notiﬁcations to Pub/Sub when objects are created,
updated, or deleted.
22

Object- and bucket-level permissions
Cloud Identity and Access Management (IAM) allows you to
control who has access to your buckets and objects.
23

Migration to Google Cloud:
Transferring your large datasets
1

2
Where you're moving data from Scenario Suggested products
Another cloud provider (for example,
Amazon Web Services or Microsoft
Azure) to Google Cloud
Storage Transfer Service
Cloud Storage to Cloud Storage (two
different buckets)
Your private data center to Google
Cloud
Cloud
Cloud
Enough bandwidth to meet your project
deadline
for less than 1 TB of data
Enough bandwidth to meet your project
deadline
for more than 1 TB of data
Not enough bandwidth to meet your project
deadline
gsutil
for on-premises data
Transfer Appliance

Products
•Storage Transfer Service
•gsutil
•Transfer Appliance
3

•Move or backup data to a Cloud Storage bucket either from
other cloud storage providers or from a local or cloud POSIX
fi
le system.
•Move data from one Cloud Storage bucket to another, so that
it is available to di
ff
erent groups of users or applications.
•Move data from Cloud Storage to a local or cloud
fi
le system.
•Move data between
fi
le systems.
•Periodically move data as part of a data processing pipeline
or analytical work
fl
ow.
4

Storage Transfer Service - Options
•Schedule one-time transfer operations or recurring
transfer operations.
•Delete existing objects in the destination bucket if they
don't have a corresponding object in the source.
•Delete data source objects after transferring them.
•Schedule periodic synchronization from a data source
to a data sink with advanced
fi
lters based on
fi
le
creation dates,
fi
lenames, and the times of day you
prefer to import data.
5

gsutil - 1
• The gsutil tool is the standard tool for small- to medium-sized transfers (less
than 1 TB) over a typical enterprise-scale network, from a private data center
or from another cloud provider to Google Cloud.
• It's also available by default when you install the Google Cloud CLI.
• It's a reliable tool that provides all the basic features you need to manage
your Cloud Storage instances, including copying your data to and from the
local
fi
le system and Cloud Storage.
• It can also move and rename objects and perform real-time incremental
syncs, like rsync, to a Cloud Storage bucket.
6

gsutil is especially useful
• Your transfers need to be executed on an as-needed basis, or during
command-line sessions by your users.
• You're transferring only a few
fi
les or very large
fi
les, or both.
• You're consuming the output of a program (streaming output to Cloud
Storage).
• You need to watch a directory with a moderate number of
fi
les and sync any
updates with very low latencies.
7

Transfer Appliance
•Transfer Appliance is a high-
capacity storage device that
enables you to transfer and
securely ship your data to a
Google upload facility, where
we upload your data to Cloud
Storage
8

Transfer Appliance - How it works
1. Request an appliance
2. Upload your data
3. Ship the appliance back
4. Google uploads the data
5. Transfer is complete
9

10
Transfer Appliance weights and capacities

● Google Cloud SQL is a fully-managed database service
that makes it easy to set up, maintain, manage, and
administer your relational databases on Google Cloud
Platform.
● It is based on the MySQL and PostgreSQL database
engines and provides a number of features to help you
manage your databases with ease, including:
● Easy setup: You can set up a new Cloud SQL instance in
just a few clicks using the Google Cloud Console, the
gcloud command-line tool, or the Cloud SQL API.

● Automatic patches and updates: Cloud SQL automatically
applies patches and updates to your database, so you
don't have to worry about maintenance or downtime.
● High availability: Cloud SQL provides built-in high
availability, with automatic failover and replication to
ensure that your database is always available.
● Scalability: You can easily scale your Cloud SQL instances
up or down to meet the changing needs of your
application.

● Security: Cloud SQL provides a number of security
features to help protect your data, including encryption at
rest, network isolation, and integration with Google Cloud's
identity and access management (IAM) system.
● Monitoring and diagnostics: Cloud SQL provides detailed
monitoring and diagnostics information to help you
troubleshoot issues with your database.
● Integration with other Google Cloud services: Cloud SQL
integrates seamlessly with other Google Cloud services,
such as Google Kubernetes Engine, Cloud Functions, and
Cloud Run, making it easy to build and deploy applications
on Google Cloud Platform.

● Cloud SQL supports MySQL, PostgreSQL and SQL Server
databases. You can choose the database engine that best
fits your needs and get all the features and benefits of that
engine, along with the added benefits of being fully
managed on Google Cloud Platform.
● Cloud SQL provides multiple pricing options to fit your
needs and budget. You can choose between on-demand
pricing, which charges you based on the resources you
use, or committed use pricing, which provides discounted
rates in exchange for a commitment to use a certain
amount of resources over a one or three year period.

● Cloud SQL provides a number of tools and features to help
you manage your databases and optimize performance.
These include a web-based SQL client, the ability to import
and export data, support for connection pooling and load
balancing, and the ability to scale your instances up or
down as needed.
● Cloud SQL integrates with other Google Cloud services,
such as Cloud Functions and Cloud Run, making it easy to
build and deploy cloud-native applications. You can also
use Cloud SQL with popular open-source tools such as
MySQL Workbench and PostgreSQL clients, or connect to
it using standard MySQL and PostgreSQL drivers.

● Cloud SQL provides a number of security features to help
protect your data, including encryption at rest, network
isolation, and integration with Google Cloud's identity and
access management (IAM) system. You can also use
Cloud SQL with Cloud Security Command Center to
monitor and manage your database security.

Key Terms
● Instance: A Cloud SQL instance is a container for your
databases. It has a specific configuration and can host one
or more databases.
● Database: A database is a collection of data that is
organized in a specific way, making it easy to access,
update, and query. Cloud SQL supports several database
engines, including MySQL and PostgreSQL.

Key Terms
● Region: A region is a geographic area where Google Cloud
Platform resources are located. When you create a Cloud
SQL instance, you can choose which region it should be
located in.
● High availability: Cloud SQL instances can be configured
for high availability, which means that they are designed to
remain available even if there is a hardware failure or other
issue.

Key Terms
● Backup and recovery: Cloud SQL provides automatic and
on-demand backups of your database, as well as tools for
recovering from a disaster or data loss.
● Security: Cloud SQL takes security seriously, with features
such as encryption at rest, network isolation, and user
authentication.

Key Terms
● Monitoring and debugging: Cloud SQL provides monitoring
and debugging tools to help you track the performance of
your database and troubleshoot any issues that may arise.
● Scalability: Cloud SQL allows you to scale your database
up or down as needed, so you can handle changes in
demand without having to worry about capacity planning.

Pricing
● Google Cloud SQL charges for usage based on the type
and number of resources you consume, such as the
number of instances, the size of the instances, and the
amount of data stored.
● Here are some of the factors that can affect the cost of
Cloud SQL:

Pricing
● Instance type: Cloud SQL offers several instance types,
each with a different combination of CPU, memory, and
storage. The type of instance you choose will affect the
price.
● Instance size: The size of a Cloud SQL instance is
determined by the amount of CPU, memory, and storage it
has. You can choose from a range of sizes, and the cost
will depend on the size you choose.

Pricing
● Data storage: Cloud SQL charges for the amount of data
stored in your database, as well as for any additional
storage you may need.
● Network egress: Cloud SQL charges for the data that is
transferred out of a region. If you have a lot of data
transfer, it could increase your costs.

Pricing
● High availability: If you configure your Cloud SQL instance
for high availability, it will incur additional costs.
● To get an estimate of the cost of using Cloud SQL, you can
use the Google Cloud Pricing Calculator. This tool allows
you to specify your usage patterns and get an estimate of
the cost based on your specific needs.

Use Cases
● Web and mobile applications: Cloud SQL is well-suited for
powering the back-end of web and mobile applications. It
can handle high levels of concurrency and offers fast
response times, making it ideal for applications with a lot of
users.
● Microservices: Cloud SQL can be used to store data for
microservices-based architectures. It offers fast response
times and can be easily integrated with other Google Cloud
Platform services.

Use Cases
● E-commerce: Cloud SQL can be used to store and
manage data for e-commerce applications, including
customer information, order history, and inventory data.
● Internet of Things (IoT): Cloud SQL can be used to store
and process data from IoT devices, allowing you to
analyze and gain insights from the data.

Use Cases
● Gaming: Cloud SQL can be used to store and manage
data for online gaming applications, including player
profiles, game progress, and leaderboards.

Cloud SQL for MySQL
● Fully managed MySQL Community Edition databases in
the cloud.
● Custom machine types with up to 624 GB of RAM and 96
CPUs.
● Up to 64 TB of storage available, with the ability to
automatically increase storage size as needed.
● Create and manage instances in the Google Cloud
console.

Cloud SQL for MySQL
● Instances available in the Americas, EU, Asia, and
Australia.
● Supports migration from source databases to Cloud SQL
destination databases using Database Migration Service
(DMS).
● Customer data encrypted on Google's internal networks
and in database tables, temporary files, and backups.
● Support for secure external connections with the Cloud
SQL Auth proxy or with the SSL/TLS protocol.

Cloud SQL for MySQL
● Support for private IP (private services access).
● Data replication between multiple zones with automatic
failover.
● Import and export databases using mysqldump, or import
and export CSV files.
● Support for MySQL wire protocol and standard MySQL
connectors.
● Automated and on-demand backups and point-in-time
recovery.

Cloud SQL for MySQL
● Instance cloning.
● Integration with Google Cloud's operations suite logging
and monitoring.
● ISO/IEC 27001 compliant.

Unsupported MySQL features
● Federated Engine
● Memory Storage Engine
● The following feature is unsupported for MySQL for Cloud
SQL 5.6 and 5.7:
● The SUPER privilege
● Because Cloud SQL is a managed service, it restricts
access to certain system procedures and tables that
require advanced privileges.

Unsupported MySQL features
● The following features are unsupported for MySQL for
Cloud SQL 8.0:
● FIPS mode
● Resource groups

Unsupported plugins
● InnoDB memcached plugin
● X plugin
● Clone plugin
● InnoDB data-at-rest encryption
● validate_password component

Unsupported statements
● LOAD DATA INFILE
● SELECT ... INTO OUTFILE
● SELECT ... INTO DUMPFILE
● INSTALL PLUGIN …
● UNINSTALL PLUGIN
● CREATE FUNCTION ... SONAME …

About
● Google Cloud Spanner is a fully managed, horizontally scalable, cloud-native
database service that offers globally consistent, high-performance
transactions, and strong consistency across all rows, tables, and indexes. It is
designed to handle the most demanding workloads and provides the ability to
scale up or down as needed.
● Cloud Spanner is well-suited for applications that require high availability,
strong consistency, and high performance, such as financial systems,
e-commerce platforms, and real-time analytics.

Key Features
● Global distribution: Cloud Spanner allows you to replicate your data across
multiple regions, ensuring low latency and high availability for your
applications.
● Strong consistency: Cloud Spanner provides strong consistency across all
rows, tables, and indexes, allowing you to always read the latest data.
● High performance: Cloud Spanner is designed to handle the most demanding
workloads, with the ability to scale up or down as needed.

Key Features
● Fully managed: Cloud Spanner is fully managed by Google, meaning you
don't have to worry about hardware, software, or infrastructure.
● SQL support: Cloud Spanner supports a standard SQL API, making it easy to
integrate with existing applications and tools.
● Integration with other Google Cloud services: Cloud Spanner integrates with
other Google Cloud services, such as BigQuery and Cloud Functions,
allowing you to build scalable and powerful applications.

Additional Details
● Data modeling: Cloud Spanner uses a traditional relational database model,
with tables, rows, and columns. It supports the standard SQL data types, such
as INT64, FLOAT64, BOOL, and STRING. You can also use Cloud Spanner's
data definition language (DDL) to create and modify tables, indexes, and
other database objects.
● Indexing: Cloud Spanner supports both primary keys and secondary indexes,
allowing you to query and filter your data efficiently. You can create unique
and non-unique indexes, as well as composite indexes that cover multiple
columns.

Additional Details
● Transactions: Cloud Spanner supports transactions, allowing you to execute
multiple SQL statements as a single unit of work. Transactions provide ACID
(atomicity, consistency, isolation, and durability) guarantees, ensuring that
your data is always consistent and accurate.
● Replication: Cloud Spanner uses a distributed architecture to replicate your
data across multiple regions, providing high availability and low latency for
your applications. You can choose how many replicas you want for each
region, based on your performance and availability requirements.

Additional Details
● Security: Cloud Spanner follows best practices for data security and privacy,
including encryption of data at rest and in transit, access controls, and
auditing. It also integrates with Google Cloud's Identity and Access
Management (IAM) service, allowing you to set fine-grained permissions for
your users and applications.

How does google cloud spanner work ?
● You create a Cloud Spanner database and define your schema, including
tables, columns, and indexes.
● You can then load data into your Cloud Spanner database using SQL
INSERT, UPDATE, and DELETE statements, or using one of the available
import tools, such as Cloud Data Fusion or Cloud Dataproc.
● Cloud Spanner stores your data in a distributed data storage system called
Colossus, which is designed to scale horizontally across multiple servers and
regions. Colossus uses a combination of hard disks and solid-state drives
(SSDs) to store your data, with data replicated across multiple nodes for high
availability and low latency.

How does google cloud spanner work ?
● When you execute a SQL query or a transaction on your Cloud Spanner
database, the query or transaction is routed to the appropriate node based on
the data being accessed. Cloud Spanner uses a distributed lock manager to
ensure that transactions are executed in the correct order and to prevent
conflicts between concurrent transactions.
● Cloud Spanner automatically manages the underlying infrastructure and
software, including hardware provisioning, data replication, backup and
recovery, and security. You don't have to worry about these tasks, and you
can focus on building your applications..

Pricing
● Nodes: The number of nodes that you use determines the amount of
read/write throughput and storage capacity that your database has. You can
choose from two types of nodes:
● - Standard nodes: These nodes provide a good balance between cost and
performance, and are suitable for most workloads.
● - Memory nodes: These nodes offer higher read/write throughput and storage
capacity, but are more expensive than standard nodes.

Pricing
● Storage: The amount of storage that you use is based on the size of your
data, including indexes and backups. You can choose from two types of
storage:
● - SSD storage: This type of storage is suitable for most workloads and offers
good performance at a lower cost.
● - HDD storage: This type of storage is less expensive than SSD storage, but
offers slower performance.

Pricing
● Read/write operations: The number of read/write operations that you perform
is based on the number of queries and updates that you make to your
database. Read/write operations are charged per million operations.
● In addition to these components, Google Cloud Spanner also charges for
additional services such as data replication and backup storage. You can use
the Google Cloud Pricing Calculator to estimate the cost of using Google
Cloud Spanner for your specific workload.
● It's worth noting that Google Cloud Spanner offers a number of pricing
discounts and commitments, such as sustained use discounts and custom
usage commitments, which can help you save money on your Cloud Spanner
usage.

Use Cases
● Online transaction processing (OLTP) applications: Cloud Spanner is
well-suited for applications that require low-latency read/write access to a
large number of records, such as e-commerce platforms, financial systems,
and customer relationship management (CRM) systems.
● Analytics and reporting: Cloud Spanner can be used to store and analyze
large amounts of data in real-time, making it suitable for applications such as
business intelligence, data warehousing, and data lakes.

Use Cases
● Internet of Things (IoT) applications: Cloud Spanner can handle the large
volume of data generated by IoT devices, making it suitable for applications
such as smart cities, connected cars, and industrial IoT.
● Mobile and web applications: Cloud Spanner can support the high read/write
throughput and availability requirements of mobile and web applications,
making it suitable for applications such as social networks, gaming, and
content management systems.

Use Cases
● Hybrid and multi-cloud applications: Cloud Spanner can support hybrid and
multi-cloud architectures, making it suitable for applications that require data
to be accessed and modified from multiple locations.
● Microservices and distributed systems: Cloud Spanner can support the high
availability and consistency requirements of microservices and distributed
systems, making it suitable for applications such as distributed databases,
distributed caches, and event-driven architectures.

How does google cloud spanner provide high availability &
scalability ?
● High availability: Spanner is designed to provide 99.999% uptime, which
means that it is able to operate with minimal downtime. It achieves this
through a combination of techniques such as distributed data storage,
replication, and failover.
● Scalability: Spanner is able to scale horizontally, which means that you can
easily add more capacity to your database by adding more machines. It also
has automatic sharding, which means that it can automatically distribute your
data across multiple machines as your data grows.

How does google cloud spanner provide high availability &
scalability ?
● Consistency: Spanner uses a technology called "TrueTime" to provide strong
consistency guarantees across all of its replicas, which means that you can
be confident that all replicas of your data will be consistent with each other at
all times.

How does google cloud spanner provide global
consistency ?
● Google Cloud Spanner provides global consistency through the use of a
technology called "TrueTime." TrueTime is a distributed global clock that
provides a consistent view of time across all of the machines in a Spanner
cluster.
● TrueTime works by using a combination of atomic clocks, GPS receivers, and
network time protocol (NTP) servers to provide a highly accurate and
consistent view of time. It allows Spanner to provide strong consistency
guarantees across all of its replicas, which means that you can be confident
that all replicas of your data will be consistent with each other at all times.

How does google cloud spanner provide global
consistency ?
● TrueTime is used by Spanner to provide a consistent view of time for
operations such as transactions and reads. For example, if you execute a
transaction that involves multiple reads and writes, Spanner will use TrueTime
to ensure that the reads and writes are all executed in the correct order, even
if they are distributed across different machines. This helps to ensure that
your data remains consistent and correct, even in the face of network delays
and other potential issues.

Dataflow
Serverless, fast, and cost-eﬀective data-
processing service
Stream and batch data
Automatic Infrastructure provisioning
Automatic Scaling as your data grows
2

Dataflow
Real Time Data from diﬀerent sources but
capturing, processing, and analyzing it is
not easy because it's usually not in the
desired format for your downstream
systems
3

Dataflow
Read the data from the source ->
transform -> write it back into a sink
4

Dataflow
Portable
Processing pipeline created using open
source Apache Beam libraries in the
language of your choice
Dataﬂow job
Processing on worker virtual machines
5

Dataflow
Run Dataﬂow jobs using the Cloud Console
UI, gCloud CLI, or the APIs
Prebuilt or Custom templates
Write SQL statements to develop pipelines
right from BigQuery UI or use AI Platform
Notebooks
6

Dataflow
Data encrypted at rest
In Transit with an option to use customer-
managed encryption keys
Use private IPs and VPC service controls to
secure the environment
7

Dataflow
Dataﬂow is a great choice for use cases
such as real-time AI, data warehousing or
stream analytics
8

Definition
2
● A fully managed service for executing Apache Beam pipelines within the
Google Cloud ecosystem
● Google Cloud Dataflow was announced in June, 2014 and released to the
general public as an open beta in April, 2015

Features
● NoOps and Serverless
● Handles infrastructure setup
● Handles maintenance
● Built on Google infrastructure
● Reliable auto scaling
● Meet data pipeline demands
3

Dataflow, Dataproc comparison
5
Dataflow Dataproc
Recommended for New data processing pipelines, unified
batch and streaming
Existing Hadoop/Spark applications,
machine learning/data science
ecosystem,
large-batch jobs, preemptible VMs
Fully-managed: Yes No
Auto-scaling: Yes, transform-by-transform (adaptive) Yes, based on cluster utilization
(reactive)
Expertise: Apache Beam Hadoop, Hive, Pig, Apache Big Data
ecosystem, Spark, Flink, Presto, Druid

Apache Beam = Batch + strEAM
6

Dataflow pipeline = Directed Acyclic Graph
7

What is a PCollection ?
8
● In Apache Beam, a PCollection (short for "Parallel Collection") is an immutable data
set that is distributed across a set of workers for parallel processing.
● It represents a distributed dataset that can be processed in parallel using the
Apache Beam programming model.
● A PCollection can be created from an external source, such as a text file or a
database, or it can be created as the output of a Beam transform, such as a map or
filter operation.
● PCollections can be transformed and combined with other PCollections using
operations like map, filter, and join.
● Once a pipeline has been defined, the data in a PCollection can be processed by
executing the pipeline using a runner, such as the Google Cloud, Apache Flink or
Apache Spark runners.

What is a PTransform ?
● In Apache Beam, a PTransform (short for "Parallel Transform") is a
fundamental building block for constructing data processing pipelines.
● It represents a computation that takes one or more PCollections as input,
performs a set of operations on the data, and produces one or more output
PCollections.
● PTranforms can be either pre-defined (e.g., Map, Filter, GroupByKey) or
user-defined.
● Pre-defined PTransforms are provided by the Apache Beam SDK and can be
used to perform common data processing tasks, such as mapping, filtering,
and grouping data. User-defined PTransforms allow you to implement custom
logic for your data processing needs.
9

What is a PTransform ?
● PTranforms are applied to PCollections using the apply() method, which takes
one or more PCollections as input and returns one or more output
PCollections.
● For example, the following code applies a Map PTransform to a PCollection
words to produce a new PCollection lengths:
● lengths = words | beam.Map(lambda x: len(x))
● In this example, the Map PTransform takes as input a PCollection words and
applies the lambda function lambda x: len(x) to each element in the collection,
producing a new PCollection lengths that contains the lengths of the words in
words.
10

ParDo = Parallel Do = Parallel Execution [Transform]
11

GroupByKey Transform
12
Takes a keyed collection of elements and produces a collection where each element consists of a
key and all values associated with that key.

GroupByKey Transform Output
13
GroupByKey explicitly shuffles key-values pairs

CoGroupByKey Transform
14
● Aggregates all input elements by their key and allows downstream processing to consume all
values associated with the key.
● While GroupByKey performs this operation over a single input collection and thus a single type
of input values, CoGroupByKey operates over multiple input collections.

CoGroupByKey Transform Output
15
CoGroupByKey joins two or more key-value pairs

CombinePerKey Transform
16
Combines all elements for each key in a collection.

CombineGlobally Transform
17
Combines all elements in a collection.

CombineGlobally Transform Output
18

Flatten Transform
19
● Merges multiple Collection objects into a single logical Collection.
● A transform for Collection objects that store the same data type.

Partition Transform
20
● Separates elements in a collection into multiple output collections.
● The partitioning function contains the logic that determines how to separate
the elements of the input collection into each resulting partition output
collection.
● The number of partitions must be determined at graph construction time. You
cannot determine the number of partitions in mid-pipeline

DoFn
● The DoFn object that you pass to ParDo contains the processing logic that
gets applied to the elements in the input collection.
● executed with ParDo
● exposed to the context (timestamp, window pane, etc)
● can consume side inputs
● can produce multiple outputs or no outputs at all
● can produce side outputs
● can use Beam's persistent state APIs
● dynamically typed
23

Dataflow templates
● Dataflow templates allow you to package a Dataflow pipeline for deployment.
● Anyone with the correct permissions can then use the template to deploy the
packaged pipeline.
● You can create your own custom Dataflow templates, and Google provides
pre-built templates for common scenarios.
Flex templates: which are newer and recommend
Classic templates
24

Google Provided pre-built templates
Streaming
● Pub/Sub to BigQuery
● Pub/Sub to Cloud Storage
● Datastream to BigQuery
● Pub/Sub to MongoDB
Batch
● BigQuery to Cloud Storage
● Bigtable to Cloud Storage
● Cloud Storage to BigQuery
● Cloud Spanner to Cloud Storage
Utility
● Bulk compression of Cloud Storage files
● Firestore bulk delete
● File format conversion
25

Windows and Windowing Function
● Tumbling windows (called fixed windows in Apache Beam)
● Hopping windows (called sliding windows in Apache Beam)
● Sessions
26

Hopping windows
A hopping window represents a consistent time interval in the data stream.
Hopping windows can overlap, whereas tumbling windows are disjoint.
28

Sessions Windows
A session window contains elements within a gap duration of another element.
The gap duration is an interval between new data in a data stream. If data arrives
after the gap duration, the data is assigned to a new window.
29

Watermarks
● A watermark is a threshold that indicates when Dataflow expects all of the
data in a window to have arrived.
● If new data arrives with a timestamp that's in the window but older than the
watermark, the data is considered late data.
● Dataflow tracks watermarks because of the following:
● Data is not guaranteed to arrive in time order or at predictable intervals.
● Data events are not guaranteed to appear in pipelines in the same order that
they were generated.
● The data source determines the watermark.
● You can allow late data with the Apache Beam SDK.
● Dataflow SQL does not process late data.
30

Triggers
● Triggers determine when to emit aggregated results as data arrives. By
default, results are emitted when the watermark passes the end of the
window.
● You can use the Apache Beam SDK to create or modify triggers for each
collection in a streaming pipeline. You cannot set triggers with Dataflow SQL.
● Types of triggers
○ Event time: as indicated by the timestamp on each data element.
○ Processing time: which is the time that the data element is processed at any given stage in the
pipeline.
○ The number of data elements in a collection.
31

Side Inputs
A side input is an additional input that your DoFn can access each time it processes an
element in the input PCollection
32

Run Cloud Dataflow Pipelines
● Locally - which lets you test and debug your Apache Beam pipeline
● Dataflow - data processing system available for running Apache Beam
pipelines
33

Cloud Dataflow Managed Service
34

Security and permissions for pipelines
● The Dataflow service account - The Dataflow service uses the Dataflow
service account as part of the job creation request, such as to check project
quota and to create worker instances on your behalf, and during job execution
to manage the job. This account is also known as the Dataflow service agent.
● The worker service account - Worker instances use the worker service
account to access input and output resources after you submit your job. By
default, workers use your project's Compute Engine default service account
as the worker service account.
35

Worker Service Account Roles
● For the worker service account to be able to create, run, and examine a job, it
must have the following roles:
○ roles/dataflow.admin
○ roles/dataflow.worker
36

Additional Roles for accessed service
● You need to grant the required roles to your Dataflow project's worker service
account so that it can access the resources while running the Dataflow job
● If your job writes to BigQuery, your service account must also have at least
the roles/bigquery.dataEditor role.
● Other Services
○ Cloud Storage buckets
○ BigQuery datasets
○ Pub/Sub topics and subscriptions
○ Firestore datasets
37

Cloud Dataflow Service Account
● It is Automatically created
● Manages job resources
● Assumes cloud dataflow service agent role
● Can read/write access to project resources
38

Required permissions that the caller must have
39

Dataflow Built-in Roles
● Dataflow Admin - (roles/dataflow.admin)
● Dataflow Developer - (roles/dataflow.developer)
● Dataflow Viewer - (roles/dataflow.viewer)
● Dataflow Worker - (roles/dataflow.worker)
40

High availability and geographic redundancy
43

Cloud Dataproc
Dataproc is a managed service for any Open Source Software
jobs that support big data processing including ETL and machine
learning
Out-of-the-box support for the most popular open-source
software
You can use Dataproc to migrate your on premise OSS clusters to
the cloud
Maximizing eﬃciency and enabling scale
Use it with Cloud AI Notebook or BigQuery to build an end-to-end
data science environment
You can launch an IT governed, auto-scaling cluster in just 90
seconds 2

Cloud Dataproc
It manages the cluster creation, monitoring, and job
orchestration for you
Web UI, Cloud SDK, REST APIs, or with SSH access
You can submit jobs in your opensource framework of choice
Scale your cluster up or down at any time
Even when jobs are running
Pay for what you use down to the second
3

Dataproc
2
● Dataproc is a fully managed and highly scalable service for running Apache
Hadoop, Apache Spark, Apache Flink, Presto, and 30+ open source tools and
frameworks.
● Use Dataproc for data lake modernization, ETL, and secure data science, at
scale, integrated with Google Cloud, at a fraction of the cost.

Where does it stand in Data Pipeline
3

Benefits
● Open: Run open source data analytics at scale, with enterprise grade security
● Flexible: Use serverless, or manage clusters on Google Compute and
Kubernetes
● Intelligent: Enable data users through integrations with Vertex AI, BigQuery,
and Dataplex
● Secure: Configure advanced security such as Kerberos, Apache Ranger and
Personal Authentication
● Cost-effective: Realize 54% lower TCO compared to on-prem data lakes with
per-second pricing
4

Key features - 1
● Fully managed and automated big data open source software
● Containerize Apache Spark jobs with Kubernetes
● Enterprise security integrated with Google Cloud
● The best of open source with the best of Google Cloud
● Serverless Spark
● Resizable clusters
● Autoscaling clusters
● Cloud integrated
● Versioning
● Cluster scheduled deletion
● Automatic or manual configuration
5

Key features - 2
● Developer tools
● Initialization actions
● Optional components
● Custom containers and images
● Flexible virtual machines
● Component Gateway and notebook access
● Workflow templates
● Automated policy management
● Smart alerts
● Dataproc metastore
6

Fully managed and automated big data open source
software
● Serverless deployment, logging, and monitoring let you focus on your data
and analytics, not on your infrastructure.
● Reduce TCO of Apache Spark management by up to 54%.
● Enable data scientists and engineers to build and train models 5X faster,
compared to traditional notebooks, through integration with Vertex AI
Workbench.
● The Dataproc Jobs API makes it easy to incorporate big data processing into
custom applications, while Dataproc Metastore eliminates the need to run
your own Hive metastore or catalog service.
7

Containerize Apache Spark jobs with Kubernetes
● Build your Apache Spark jobs using Dataproc on Kubernetes so you can use
Dataproc with Google Kubernetes Engine (GKE) to provide job portability and
isolation.
How Dataproc on GKE works
● Dataproc on GKE deploys Dataproc virtual clusters on a GKE cluster.
● Unlike Dataproc on Compute Engine clusters, Dataproc on GKE virtual
clusters do not include separate master and worker VMs.
● Instead, when you create a Dataproc on GKE virtual cluster, Dataproc on
GKE creates node pools within a GKE cluster.
● The node pools and scheduling of pods on the node pools are managed by
GKE.
8

Enterprise security integrated with Google Cloud
● When you create a Dataproc cluster, you can enable Hadoop Secure Mode
via Kerberos by adding a Security Configuration.
● Additionally, some of the most commonly used Google Cloud-specific security
features used with Dataproc include default at-rest encryption, OS Login, VPC
Service Controls, and customer-managed encryption keys (CMEK).
9

Best Open source with the best of Google Cloud
● Dataproc lets you take the open source tools, algorithms, and programming
languages that you use today, but makes it easy to apply them on cloud-scale
datasets.
● At the same time, Dataproc has out-of-the-box integration with the rest of the
Google Cloud analytics, database, and AI ecosystem.
● Data scientists and engineers can quickly access data and build data
applications connecting Dataproc to BigQuery, Vertex AI, Cloud Spanner,
Pub/Sub, or Data Fusion.
10

Serverless Spark
● Deploy Spark applications and pipelines that autoscale without any manual
infrastructure provisioning or tuning.
● Spark is integrated with BigQuery, Vertex AI, and Dataplex, so you can write
and run it from these interfaces in two clicks, without custom integrations, for
ETL, data exploration, analysis, and ML.
11

Resizable clusters
● Create and scale clusters quickly with various virtual machine types, disk
sizes, number of nodes, and networking options.
● After creating a Dataproc cluster, you can adjust ("scale") the cluster by
increasing or decreasing the number of primary or secondary worker nodes
(horizontal scaling) in the cluster.
● You can scale a Dataproc cluster at any time, even when jobs are running on
the cluster.
● You cannot change the machine type of an existing cluster (vertical scaling).
● To vertically scale, create a cluster using a supported machine type, then
migrate jobs to the new cluster.
12

Autoscaling clusters
● Dataproc autoscaling provides a mechanism for automating cluster resource
management and enables automatic addition and subtraction of cluster
workers (nodes).
● The Dataproc AutoscalingPolicies API provides a mechanism for automating
cluster resource management and enables cluster worker VM autoscaling.
● An Autoscaling Policy is a reusable configuration that describes how cluster
workers using the autoscaling policy should scale.
● It defines scaling boundaries, frequency, and aggressiveness to provide
fine-grained control over cluster resources throughout cluster lifetime.
13

When to use autoscaling
● on clusters that store data in external services, such as Cloud Storage or
BigQuery
● on clusters that process many jobs
● to scale up single-job clusters
● with Enhanced Flexibility Mode for Spark batch jobs
14

When NOT to use autoscaling - 1
● HDFS
○ HDFS utilization is not a signal for autoscaling.
○ HDFS data is only hosted on primary workers.
○ The number of primary workers must be sufficient to host all HDFS data.
○ Decommissioning HDFS DataNodes can delay the removal of workers.
15

When NOT to use autoscaling - 2
● Autoscaling does not support YARN Node Labels, nor the property
dataproc:am.primary_only
● Autoscaling does not support Spark Structured Streaming
● Autoscaling is not recommended for the purpose of scaling a cluster down to
minimum size when the cluster is idle.
● When small and large jobs run on a cluster, graceful decommissioning
scale-down will wait for large jobs to finish.
16

Single node clusters
● This single node acts as the master and worker for your Dataproc cluster.
● Trying out new versions of Spark and Hadoop or other open source
components
● Building proof-of-concept (PoC) demonstrations
● Lightweight data science
● Small-scale non-critical data processing
● Education related to the Spark and Hadoop ecosystem
17

Limitations
● Single node clusters are not recommended for large-scale parallel data
processing.
● Single node clusters are not available with high-availability since there is only
one node in the cluster.
● Single node clusters cannot use preemptible VMs.
18

High Availability Mode
● When creating a Dataproc cluster, you can put the cluster into Hadoop High
Availability (HA) mode by specifying the number of master instances in the
cluster.
● The number of masters can only be specified at cluster creation time.
○ 1 master (default, non HA)
○ 3 masters (Hadoop HA)
19

Cloud integrated
● Built-in integration with Cloud Storage, BigQuery, Dataplex, Vertex AI,
Composer, Cloud Bigtable, Cloud Logging, and Cloud Monitoring, giving you
a more complete and robust data platform.
● Dataproc has built-in integration with other Google Cloud Platform services,
such as BigQuery, Cloud Storage, Cloud Bigtable, Cloud Logging, and Cloud
Monitoring, so you have more than just a Spark or Hadoop cluster—you have
a complete data platform.
● For example, you can use Dataproc to effortlessly ETL terabytes of raw log
data directly into BigQuery for business reporting.
20

Versioning
● Image versioning allows you to switch between different versions of Apache
Spark, Apache Hadoop, and other tools.
● Dataproc uses images to tie together useful Google Cloud Platform
connectors and Apache Spark & Apache Hadoop components into one
package that can be deployed on a Dataproc cluster.
● These images contain the base operating system (Debian or Ubuntu) for the
cluster, along with core and optional components needed to run jobs, such as
Spark, Hadoop, and Hive.
● These images will be upgraded periodically to include new improvements and
features.
● Dataproc versioning allows you to select sets of software versions when you
create clusters.
21

Cluster scheduled deletion
● To help avoid incurring Google Cloud charges for an inactive cluster, use
Dataproc's Cluster Scheduled Deletion feature when you create a cluster.
● This feature provides options to delete a cluster:
○ after a specified cluster idle period
○ at a specified future time
○ after a specified period that starts from the time of submission of the cluster creation request
22

Automatic or manual configuration
● Dataproc automatically configures hardware and software but also gives you
manual control.
● The open source components installed on Dataproc clusters contain many
configuration files.
● For example, Apache Spark and Apache Hadoop have several XML and plain
text configuration files.
● You can use the –properties flag of the gcloud dataproc clusters create
command to modify many common configuration files when creating a cluster.
23

Developer tools
● Multiple ways to manage a cluster, including an easy-to-use web UI, the
Cloud SDK, RESTful APIs, and SSH access.
● Integrate with APIs using Client Libraries for Java, Python, Node.js, Ruby, Go,
.NET, and PHP
● Script or interact with cloud resources at scale using the Google Cloud CLI
● Accelerate local development with emulators for Pub/Sub, Spanner, Bigtable,
and Datastore
24

Initialization actions - 1
● Run initialization actions to install or customize the settings and libraries you
need when your cluster is created.
25

Initialization actions - 2
● Initialization actions run as the root user. You do not need to use sudo
● Initialization actions are executed on each node during cluster creation
● Use absolute paths in initialization actions
● Use a shebang line in initialization actions to indicate how the script should be
interpreted (such as #!/bin/bash or #!/usr/bin/python
26

Optional components
● Use optional components to install and configure additional components on
the cluster.
● Optional components are integrated with Dataproc components and offer fully
configured environments for Zeppelin, Presto, and other open source
software components related to the Apache Hadoop and Apache Spark
ecosystem.
27

Optional Components - Example
28

Custom containers and images
● Dataproc serverless Spark can be provisioned with custom docker containers.
● Dataproc clusters can be provisioned with a custom image that includes your
pre-installed Linux operating system packages.
29

Flexible virtual machines
● Clusters can use custom machine types and preemptible virtual machines to make
them the perfect size for your needs.
● Dataproc clusters are built on Compute Engine instances.
● Machine types define the virtualized hardware resources available to an instance.
● Compute Engine offers both predefined machine types and custom machine types.
● Dataproc clusters can use both predefined and custom types for both master and/or
worker nodes.
● In addition to using standard Compute Engine VMs as Dataproc workers (called
"primary" workers), Dataproc clusters can use secondary workers.
● There are three types of secondary workers: spot VMs, standard preemptible VMs,
and non-preemptible VMs.
● If you specify secondary workers for your cluster, they must be the same type.
● The default Dataproc secondary worker type is the standard preemptible VM.
30

Component Gateway and notebook access
● Dataproc Component Gateway enables secure, one-click access to Dataproc
default and optional component web interfaces running on the cluster.
● Open source components included with Google Dataproc clusters, such as
Apache Hadoop and Apache Spark, provide web interfaces.
● These interfaces can be used to manage and monitor cluster resources and
facilities, such as the YARN resource manager, the Hadoop Distributed File
System (HDFS), MapReduce, and Spark.
● Component Gateway provides secure access to web endpoints for Dataproc
default and optional components.
● Clusters created with Dataproc image version 1.3.29 and later can enable
access to component web interfaces without relying on SSH tunnels or
modifying firewall rules to allow inbound traffic.
31

Workflow templates
● Dataproc workflow templates provide a flexible and easy-to-use mechanism
for managing and executing workflows.
● A workflow template is a reusable workflow configuration that defines a graph
of jobs with information on where to run those jobs.
● A Workflow is an operation that runs a Directed Acyclic Graph (DAG) of jobs
on a cluster.
● Workflows are ideal for complex job flows. You can create job dependencies
so that a job starts only after its dependencies complete successfully.
32

Automated policy management
● Standardize security, cost, and infrastructure policies across a fleet of
clusters.
● You can create policies for resource management, security, or network at a
project level.
● You can also make it easy for users to use the correct images, components,
metastore, and other peripheral services, enabling you to manage your fleet
of clusters and serverless Spark policies in the future.
33

Smart alerts
● Dataproc recommended alerts allow customers to adjust the thresholds for
the pre-configured alerts to get alerts on idle, runaway clusters, jobs,
overutilized clusters and more.
● Customers can further customize these alerts and even create advanced
cluster and job management capabilities.
● These capabilities allow customers to manage their fleet at scale.
34

Dataproc metastore
● Fully managed, highly available Hive Metastore (HMS) with fine-grained
access control and integration with BigQuery metastore, Dataplex, and Data
Catalog.
● Dataproc Metastore provides you with a fully compatible Hive Metastore
(HMS), which is the established standard in the open source big data
ecosystem for managing technical metadata.
● This service helps you manage the metadata of your data lakes and provides
interoperability between the various data processing tools you're using.
35

Connectors
● BigQuery connector - enable programmatic read/write access to BigQuery
● Bigtable connector
○ Bigtable is an excellent option for any Apache Spark or Hadoop uses that require Apache
HBase.
○ Bigtable supports the Apache HBase 1.0+ APIs and offers a Bigtable HBase client in Maven,
so it is easy to use Bigtable with Dataproc.
● Cloud Storage connector - The Cloud Storage connector is an open source
Java library that lets you run Apache Hadoop or Apache Spark jobs directly on
data in Cloud Storage, and offers a number of benefits over choosing the
Hadoop Distributed File System (HDFS).
● Pub/Sub Lite - The Pub/Sub Lite Spark Connector supports Pub/Sub Lite as
an input source to Apache Spark Structured Streaming in the default
micro-batch processing and experimental continuous processing modes.
36

Dataproc on Compute Engine pricing
● Dataproc is billed by the second, and all Dataproc clusters are billed in
one-second clock-time increments, subject to a 1-minute minimum billing.
● Dataproc on Compute Engine pricing is based on the size of Dataproc
clusters and the duration of time that they run.
● The size of a cluster is based on the aggregate number of virtual CPUs
(vCPUs) across the entire cluster, including the master and worker nodes.
● The duration of a cluster is the length of time between cluster creation and
cluster stopping or deletion.
● Total Price = $0.010 * # of vCPUs * hourly duration
37

Pricing example
Dataproc charge = # of vCPUs * hours * Dataproc price = 24 * 2 * $0.01 =
$0.48
38

Dataproc on GKE pricing
● The Dataproc on GKE pricing formula, $0.010 * # of vCPUs * hourly duration,
is the same as the Dataproc on Compute Engine pricing formula, and is
applied to the aggregate number of virtual CPUs running in VMs instances in
Dataproc-created node pools in the cluster.
39

Dataproc Roles
● Dataproc Admin
● Dataproc Editor
● Dataproc Viewer
● Dataproc Worker (for service accounts only)
41

IAM roles and Dataproc operations summary
42

Dataproc service accounts
● Dataproc VM service account: VMs in a Dataproc cluster use this service
account for Dataproc data plane operations, such reading and writing data
from and to Cloud Storage and BigQuery
● Dataproc Service Agent service account: Dataproc creates this service
account with the Dataproc Service Agent role in a Dataproc user's Google
Cloud project.
43

Cloud Pub/Sub
Cloud Pub/Sub is an asynchronous messaging service
Send, Receive, and Filter events or data steams
Durable Message Storage
Scalable in-order message delivery
Consistently high availability
Performance at any scale
Runs in all Google Cloud region of the world
Serverless
Scales global data delivery auto-magically
Millions of messages per second
Data producers don't need to change anything when the
consumers of their data change 2

Cloud Pub/Sub
Services can be entirely stateless
Set up Pub/Sub between services or applications by defining
topics and then subscriptions
Services to receive the messages published on those topics
one-to-many communications
Spread your workload over multiple workers
E.g. send logs from your security system to archiving, processing,
and analytic services
Stream your data into BigQuery or Dataflow for intelligent
processing
Ideal for notification
3

Core Concepts - 1
● Topic. A named resource to which messages are sent by publishers.
● Subscription. A named resource representing the stream of messages from a
single, specific topic, to be delivered to the subscribing application. For more
details about subscriptions and message delivery semantics, see the
Subscriber Guide.
● Message. The combination of data and (optional) attributes that a publisher
sends to a topic and is eventually delivered to subscribers.
● Message attribute. A key-value pair that a publisher can define for a message.
For example, key iana.org/language_tag and value en could be added to
messages to mark them as readable by an English-speaking subscriber.
3

Core Concepts - 2
● Publisher. An application that creates and sends messages to a single or
multiple topics.
● Subscriber. An application with a subscription to a single or multiple topics to
receive messages from it.
● Acknowledgment (or "ack"). A signal sent by a subscriber to Pub/Sub after it
has received a message successfully. Acknowledged messages are removed
from the subscription message queue.
● Push and pull. The two message delivery methods. A subscriber receives
messages either by Pub/Sub pushing them to the subscriber chosen
endpoint, or by the subscriber pulling them from the service.
4

Many-to-one (fan-in) and One-to-many (fan-out)
5

Common use cases
● Ingestion user interaction and server events.
● Real-time event distribution.
● Replicating data among databases.
● Parallel processing and workflows.
● Enterprise event bus.
● Data streaming from applications, services, or IoT devices.
● Refreshing distributed caches.
● Load balancing for reliability.
7

Integrations - 1
Stream processing and data integration.
● Dataflow: Dataflow templates and SQL, which allow processing and data
integration into BigQuery and data lakes on Cloud Storage.
● Dataflow templates for moving data from Pub/Sub to Cloud Storage,
BigQuery, and other products are available in the Pub/Sub and Dataflow UIs
in the Google Cloud console.
● Integration with Apache Spark, particularly when managed with Dataproc is
also available.
● Visual composition of integration and processing pipelines running on Spark +
Dataproc can be accomplished with Data Fusion.
8

Integrations - 2
Monitoring, Alerting and Logging.
● Supported by Monitoring and Logging products.
Authentication and IAM.
● Pub/Sub relies on a standard OAuth authentication used by other Google
Cloud products and supports granular IAM, enabling access control for
individual resources.
9

Integrations - 3
APIs.
● Pub/Sub uses standard gRPC and REST service API technologies along with
client libraries for several languages.
Triggers, notifications, and webhooks.
● Pub/Sub offers push-based delivery of messages as HTTP POST requests to
webhooks.
● You can implement workflow automation using Cloud Functions or other
serverless products.
10

Integrations - 4
Orchestration.
● Pub/Sub can be integrated into multistep serverless Workflows declaratively.
● Big data and analytic orchestration often done with Cloud Composer, which
supports Pub/Sub triggers.
● Application Integration provides a Pub/Sub trigger to trigger or start
integrations.
11

You can filter messages by their attributes from a
subscription
● When you receive messages from a subscription with a filter, you only receive
the messages that match the filter.
● The Pub/Sub service automatically acknowledges the messages that don't
match the filter.
● You can filter messages by their attributes, but not by the data in the
message.
● You can have multiple subscriptions attached to a topic and each subscription
can have a different filter.
12

Types of subscriptions
● Pull subscription
● Push subscription
● BigQuery subscription
13

Delivery Types - at-least-once delivery
● By default, Pub/Sub offers at-least-once delivery with no ordering guarantees
on all subscription types.
● Alternatively, if messages have the same ordering key and are in the same
region, you can enable message ordering.
● After you set the message ordering property, the Pub/Sub service delivers
messages with the same ordering key and in the order that the Pub/Sub
service receives the messages.
17

Delivery Types - exactly-once delivery
● Pub/Sub also supports exactly-once delivery.
● In general, Pub/Sub delivers each message once and in the order in which it
was published.
● However, messages may sometimes be delivered out of order or more than
once.
● Pub/Sub might redeliver a message even after an acknowledgement request
for the message returns successfully.
● This redelivery can be caused by issues such as server-side restarts or
client-side issues.
● Thus, although rare, any message can be redelivered at any time.
● Accommodating more-than-once delivery requires your subscriber to be
idempotent when processing messages.
18

Message Retention
● Unacknowledged messages are retained for a default of 7 days (configurable by the
subscription's message_retention_duration property).
● A topic can retain published messages for a maximum of 31 days (configurable by the
topic's message_retention_duration property) even after they have been acknowledged by
all attached subscriptions.
● In cases where the topic's message_retention_duration is greater than the subscription's
message_retention_duration, Pub/Sub discards a message only when its age exceeds the
topic's message_retention_duration.
● By default, subscriptions expire after 31 days of subscriber inactivity or if there are no
updates made to the subscription.
● When you modify either the message retention duration or subscription expiration policy,
the expiration period must be set to a value greater than the message retention duration.
The default message retention duration is 7 days and the default expiration period is 31
days.
19

Exponential backoff
● Exponential backoff lets you add progressively longer delays between retry
attempts.
● After the first delivery failure, Pub/Sub waits for a minimum backoff time
before retrying.
● For each consecutive message failure, more time is added to the delay, up to
a maximum delay (0 and 600 seconds).
● The maximum and minimum delay intervals are not fixed, and should be
configured based on local factors to your application.
20

Publishing with Pub/Sub Code
21

BigQuery
BigQuery is Google Cloud's enterprise data
warehouse
Ingest
Store
Analyze
Visualize
2

Supports Ingesting Data via Batch or
Streaming Directly
Fully-managed data warehouse
Petabyte scale
BigQuery
3

BigQuery supports a standard SQL dialect
that is ANSI compliant
Interacting with BigQuery is easy
BigQuery
4

You can use the Cloud Console UI,
BigQuery command-line tool bq, or use
the API with client libraries of your choice
BigQuery integrates with several business
intelligence tools
BigQuery
5

Simple pricing model
You pay for data storage, streaming
inserts, and querying data
Loading and exporting data are free of
charge
BigQuery
6

Storage costs are based on the amount of
data stored
For queries, you can choose to pay per
query or a ﬂat rate for dedicated
resources
BigQuery
7

BigQuery Views
A view is a virtual table defined by a SQL query.
Query it in the same way you query a table.
When a user queries the view, the query results contain data only from
the tables and fields specified in the query that defines the view.
How to use
Query editor box in the Google Cloud console
bq command-line tool's bq query command
BigQuery REST API to programmatically call the jobs.query or
query-type jobs.insert methods
BigQuery client libraries
You can also use a view as a data source for a visualization tool such as
Google Data Studio.
2

BigQuery View Limitation
Read-only.
No DML (insert, update, delete) queries against a view.
The dataset that contains your view and the dataset that
contains the tables referenced by the view must be in the same
location.
No exporting of data from a view.
You cannot mix standard SQL and legacy SQL queries when using
views.
You cannot reference query parameters in views. 3

BigQuery View Limitation
You cannot include a temporary user-deﬁned function or a
temporary table in the SQL query that deﬁnes a view.
You cannot reference a view in a wildcard table query.
4

BigQuery is Google's data warehouse solution
3

BigQuery - An Ideal Data Warehouse
● Interactive SQL queries over large datasets (petabytes) in seconds
● Serverless and no-ops, including ad hoc queries
● Ecosystem of visualization and reporting tools
● Ecosystem of ETL and data processing tools
● Up-to-the-minute data
● Machine learning
● Security and collaboration
4

Why BigQuery is different from Traditional Databases ?
5

Filter row data based on sensitive data
8

Authorized View - 1 Save result to another dest table
10

Authorized View - 2 Create another Dataset
11

Authorized View - 3 Save view
12

Authorized View - 4 Add necessary permissions
13

Authorized View - 5 Assign permissions
14

Authorized View - 6 Give View access to Dataset
15

Authorized View Procedure
● step 1 start with your source data set this is the data set with the sensitive data you don't want to share
● step 2 create a separate data set to store the view authorized views require the source data to sit in a separate
data set from the view don't worry about the reason why you'll see in step 6.
● step 3 create a view in the new data set in the new data set you create the view you intend to share with your
data analysts this view is created using a sql query that includes only the data the analysts need to see
● step 4 assign access controls to the project in order to query the view your analyst needs permission to run
queries assigning your analyst the bigquery user role gives them this ability this access does not give them the
ability to view or query any data sets within the project
● step 5 assign access controls to the data set containing the view in order for your analyst to query the view they
need to be granted the bigquery data viewer role for that specific data set that contains the view and finally
● step 6 authorize the view to access the source data set this gives the view itself access to the source data we
need to do this because the view takes on the permissions of the person using it and since the analyst doesn't
have access to the source table they'd otherwise get an error if they tried to query this view
16

Materialized Views
● Materialized views are precomputed views that periodically cache the results
of a query for increased performance and efficiency.
● BigQuery leverages precomputed results from materialized views and
whenever possible reads only delta changes from the base tables to compute
up-to-date results.
● Materialized views can be queried directly or can be used by the BigQuery
optimizer to process queries to the base tables.
● Queries that use materialized views are generally faster and consume fewer
resources than queries that retrieve the same data only from the base tables.
17

Binder1.pdf

Recommended

Recommended

More Related Content

Similar to Binder1.pdf

Similar to Binder1.pdf (20)

Recently uploaded

Recently uploaded (20)

Binder1.pdf