SlideShare a Scribd company logo
1 of 886
Download to read offline
Google Cloud
Certified Professional
Data Engineer Exam
Strategy, Tips and
Overview 1
About me
• I have 20 years experience in IT industry with
focus on Cloud, Data, ML & DevOps
• I hold more than 50+ certi
fi
cation in all the above
fi
elds
• I am a 2 time Google Professional Data Engineer
Certi
fi
ed and also 1 time Google DevOps and
Machine Learning Engineer Certi
fi
ed
• LinkedIn Pro
fi
le
2
3
4
Exam Cost
• 200 USD listed price
• 141.60 USD I paid after taxes etc.
• Price same whether at home or at exam center
5
Certification Exam Overview - 1
•50 Questions to be answered in 2 hours
•No negative marks
•All Multiple Choice Questions
•Select 1 right answer out of 4
•Select 2 or 3 right answers out of 5 or 6 choices
•144 seconds for every question
6
Certification Exam Overview - 2
• Questions can be answered and also marked for
review if you want to review later
• No pen and paper allowed or provided
• Cert Valid for 2 years
• Can be taken onsite at exam center or remotely
proctored
• Pass or Fail, marks will not be shared
• Exam is hard
7
Exam Strategy
• Process of elimination
• Eliminate usually 2 wrong choices
• Then make a decision/guess between the last 2
strong options
• On my
fi
rst go I made a choice and marked all
questions for review later
• On 2nd run I read through all my answers and
either con
fi
rmed or changed my choice
8
Top topics (in order)
• BigQuery
• Data
fl
ow
• Bigtable
• Dataproc
• Pub/Sub
• Cloud SQL
• Cloud Spanner
9
• Cloud Composer
• Data Prep
• Data Fusion
• Cloud DLP
• Pre-Trained ML AI
APIS
• Fundamentals of
Machine Learning
• Feature
Engineering
• Over
fi
tting
• IAM, Access
Control, Service
Accounts, Users,
Groups, Roles
• Kafka, Hive, Pig,
Hadoop
Tips (Guesswork)
• Choose Ideal Recommended Solution (Reference Architectures)
• Choose Google Product over other products
• Avoid Complex Convoluted Lengthy Manual Error prone
Cumbersome solution
• If little bit unsure or question is too time consuming, mark for
review later.
• After reading through all 50 questions you will
fi
gure out some
unsure questions
• If all else fail go with your
fi
rst guess
10
Do Review
• Sample Questions
• Exam Guide
11
Some Motivation
12
Thanks
13
For
Watching
Roles
& Responsibilities
of a Data Engineer
1
2
•Data Engineers are responsible for
designing, building, and maintaining
the infrastructure and systems that
are used to store, process, and
analyze large amounts of data.
3
•Design, build and maintain the data
pipeline: Data Engineers are
responsible for designing and building
the data pipeline, which includes the
collection, storage, processing, and
transportation of data.
•They also make sure that the pipeline is
e
ffi
cient, reliable, and scalable.
4
•Data storage and management: Data
Engineers are responsible for designing
and maintaining the data storage
systems, such as relational databases,
NoSQL databases, and data
warehouses.
•They also ensure that the data is properly
indexed, partitioned, and backed up.
5
•Data quality and integrity: Data
Engineers are responsible for ensuring
the quality and integrity of the data as
it
fl
ows through the pipeline.
•This includes cleaning, normalizing,
and validating the data before it is
stored.
6
•Data security: Data Engineers are
responsible for implementing security
measures to protect sensitive data
from unauthorized access.
•This includes implementing
encryption, access controls, and
monitoring for security breaches.
7
•Performance tuning and optimization:
Data Engineers are responsible for
monitoring the performance of the data
pipeline and making adjustments to
optimize its performance.
•This includes identifying and resolving
bottlenecks, and scaling resources as
needed.
8
•Collaboration with other teams: Data
Engineers often work closely with
Data Scientists, Data Analysts and
Business Intelligence teams to
understand their data needs and
ensure that the data pipeline is able
to support their requirements.
9
•Keeping up with the latest
technologies: Data Engineers need to
keep up-to-date with the latest
technologies and trends in the
fi
eld,
such as new data storage and
processing systems, big data
platforms, and data governance best
practices.
Thanks
for
Watching
10
Types of Data
Storage Systems
1
Relational Databases
•Relational databases: These are the most common
type of data storage systems, and include popular
options such as MySQL, Oracle, and Microsoft SQL
Server.
•Relational databases store data in tables with rows
and columns, and are based on the relational
model.
•They are well suited for structured data and support
the use of SQL for querying and manipulating data.
2
NoSQL Databases
•NoSQL databases: These are non-relational
databases that are designed to handle large
amounts of unstructured or semi-structured data.
•Examples include MongoDB, Cassandra, and
Hbase.
•NoSQL databases are often used for big data and
real-time web applications.
•They are horizontally scalable and provide high
performance and availability.
3
Data Warehouses
•Data warehouses: These are specialized relational
databases that are optimized for reporting and
analytics.
•They are designed to handle large amounts of
historical data and support complex queries and
aggregations.
•Examples include Amazon Redshift, Google
BigQuery, and Microsoft Azure SQL Data
Warehouse
4
Data Lakes
•Data lakes: Data lake is a data storage
architecture that allows storing raw,
unstructured and structured data at any
scale.
•Data lake technologies, such as Amazon
S3, Azure Data Lake Storage, and Google
Cloud Storage, provide a centralized
repository that can store all types of data.
5
Columnar Databases
•Columnar databases: Columnar
databases, such as Apache Parquet,
Apache ORC, and Google Bigtable, are
used for storing and querying large
amounts of data in a columnar format.
•This format is optimized for read-intensive
workloads and analytical querying.
6
Key-value Databases
•Key-value databases: Key-value databases,
such as Redis and memcached, are
designed to store large amounts of data in
a simple key-value format, and are
optimized for read-heavy workloads.
•They are particularly well-suited for use
cases such as caching, session
management, and real-time analytics.
7
Object Storage
•Object storage: These systems are designed
to store and retrieve unstructured data, such
as images, videos, and audio
fi
les.
•They are often used for cloud storage and
archiving.
•Examples include Amazon S3, Microsoft
Azure Blob Storage, and OpenStack Swift.
8
File Storage
•File storage: These systems store data as
fi
les and directories, which can be organized
in a hierarchical
fi
le system.
•They are often used for storing large
fi
les
and streaming media, and are commonly
used in distributed systems like Hadoop
HDFS.
•Other Examples: NTFS, ext4
9
Time-series Databases
•Time-series databases: These are
specialized databases that are
optimized for storing and querying
time-series data, such as sensor data
and
fi
nancial data.
•Examples include In
fl
uxDB,
OpenTSDB, and TimescaleDB.
10
Graph Databases
•Graph databases: These databases are designed
for storing and querying graph data, which
consists of nodes and edges representing entities
and relationships.
•Examples include Neo4j and JanusGraph.
•They are well-suited for applications that require
querying complex relationships and patterns in
the data, such as social networks and
recommendation systems.
11
Thanks
for
Watching
12
ACID vs BASE
ACID
• ACID is an acronym that stands for Atomicity, Consistency, Isolation, and
Durability.
• These properties are a set of guarantees that a database system makes about
the behavior of transactions.
• ACID properties are important for maintaining data integrity, consistency, and
availability in a database.
• It ensures that the data stored in the database is accurate, consistent, and
can be relied upon.
Atomicity
• This property ensures that a transaction is treated as a single, indivisible unit
of work.
• Either all of the changes made in a transaction are committed to the
database, or none of them are.
• This means that if a transaction is interrupted or fails, any changes made in
that transaction will be rolled back, so that the database remains in a
consistent state.
Consistency
• This property ensures that a transaction brings the database from one valid
state to another valid state.
• A database starts in a consistent state, and any transaction that is executed
on the database should also leave the database in a consistent state.
Isolation
• This property ensures that the concurrent execution of transactions does not
a
ff
ect the correctness of the overall system.
• Each transaction should execute as if it is the only transaction being
executed, even though other transactions may be executing at the same time.
Durability
• This property ensures that once a transaction is committed, its e
ff
ects will
persist, even in the event of a failure (such as a power outage or a crash).
• This is typically achieved by writing the changes to non-volatile storage, such
as disk.
BASE
• BASE stands for Basically Available, Soft state, Eventually consistent.
• It is a set of properties that describe the behavior of a distributed database
system or a distributed data store.
Basically Available
• This property ensures that the data store is available for read and write
operations, although there may be some limitations on availability due to
network partitions or other failures.
Soft state
• This property acknowledges that the state of the data store may change over
time, even without input.
• This is due to the distributed nature of the data store and the inherent
uncertainty in network communication.
Eventually consistent
• This property ensures that all nodes in the distributed data store will
eventually converge to the same state, even if they do not have immediate
access to the same information.
• This means that it may take some time for all nodes to have the same data,
but eventually, they will.
Difference
• ACID guarantees consistency and
isolation for transactions, but it
comes with a cost of overhead and
less scalability
• BASE prioritizes availability and
scalability over consistency, which
can make it more di
ffi
cult to reason
about and predict the behavior of the
system.
Thanks
for
Watching
12
OLTP vs OLAP
OLTP - 1
• OLTP stands for Online Transaction Processing.
• It is a type of database system that is optimized for handling a large number
of short, transactional requests, such as inserting, updating, or retrieving data
from a database.
• In an OLTP system, the database is designed to handle a high number of
concurrent connections and transactions, with a focus on fast, consistent
response times.
• The data in an OLTP system is typically stored in normalized, relational tables,
which allows for e
ffi
cient querying and indexing.
• OLTP systems are used in a variety of applications, such as e-commerce
systems,
fi
nancial systems, and inventory management systems.
OLTP - 2
• The goal of OLTP is to enable the processing of business transactions as fast
as possible, with a high degree of consistency and data integrity.
• OLTP systems are typically characterized by a high number of read and write
operations, a large number of concurrent users, and a high volume of data.
• They are also characterized by a high degree of normalization and data
integrity, with strict constraints and triggers to ensure data consistency and
prevent data corruption.
• Overall, OLTP is designed to handle a large number of concurrent
transactions and to provide fast, consistent response times.
• It is an essential component of many business systems, and it is used to
support a wide variety of transactions and business processes.
OLAP - 1
• OLAP, or Online Analytical Processing is a powerful technology that allows users
to easily analyze large, complex data sets and make informed business decisions.
• It is commonly used in business intelligence and decision support systems to
support complex queries and analysis on large datasets.
• OLAP databases are typically built on top of a relational database and use a
multidimensional data model, which organizes data into a cube structure.
• Each dimension in the cube represents a di
ff
erent aspect of the data, such as
time, location, or product, and each cell in the cube contains a measure, such as
sales or pro
fi
t.
• Users can interact with the OLAP cube using a client tool, such as Microsoft
Excel, to drill down, roll up, and slice and dice the data to gain insights.
OLAP - 2
• For example, a user could start by looking at total sales for a given time period, then
drill down to see sales by region, and then by individual store.
• There are three main types of OLAP systems: relational OLAP (ROLAP),
multidimensional OLAP (MOLAP), and hybrid OLAP (HOLAP).
• ROLAP uses a relational database as the underlying data store, while MOLAP uses a
specialized multidimensional data store.
• HOLAP combines the bene
fi
ts of both ROLAP and MOLAP by using a relational data
store for the detailed data and a multidimensional data store for the summarized data.
• OLAP also provides several advanced analytical capabilities such as time series
analysis, forecasting, budgeting, and data mining.
• In addition, many OLAP tools provide a graphical interface that makes it easy for
users to interact with the data and perform advanced analysis.
Difference
• OLTP is designed to handle a high volume
of short online transactions, such as
inserting, updating, and retrieving data.
• It is optimized for transactional
consistency, and data is stored in a
normalized form to minimize data
redundancy.
• The main goal of OLTP is to ensure data
accuracy and integrity, and to make sure
that the system can handle a large
number of concurrent users and
transactions.
• OLTP is used for operational systems that
handle day-to-day transactions
• OLAP is designed to handle complex,
multi-dimensional analysis of data.
• It is optimized for fast query performance
and e
ffi
cient data aggregation, and data
is stored in a denormalized form to
enable faster data retrieval.
• The main goal of OLAP is to support
business intelligence and decision-
making by providing users with the ability
to analyze large amounts of data from
multiple dimensions and levels of detail.
• OLAP is used for analytical systems that
support business intelligence and
decision-making
Thanks
for
Watching
7
4 V's of Big Data
1
4 V's of Big Data
•Volume
•Velocity
•Variety
•Veracity
2
Volume
•Big data is characterized by its large volume,
which can range from terabytes to petabytes
and even exabytes.
•This large volume of data is generated from
various sources, such as social media, IoT
devices, and transactional systems.
3
Velocity
•Big data is characterized by its high velocity,
which refers to the speed at which data is
generated and collected.
•This high velocity of data requires real-time
processing and analysis to extract insights
and make decisions.
4
Variety
•Big data is characterized by its wide variety of
data types, such as structured, unstructured,
and semi-structured data.
•This variety of data types requires specialized
tools and techniques for processing and
analysis.
5
Veracity
•Big data is characterized by its uncertainty
and lack of trustworthiness, which makes it
di
ffi
cult to validate and verify the accuracy of
the data.
•This requires data quality and data
governance processes to ensure that the data
is accurate and reliable.
6
Thanks
for
Watching
Vertical vs Horizontal Scaling
1
Vertical Scaling
•Vertical scaling is the process of increasing the
capacity of a single server or machine by adding
more resources such as CPU, memory, or storage.
•Vertical scaling is often used to improve the
performance of a single server or to add capacity to
a machine that has reached its limits.
•The main disadvantage of vertical scaling is that it
can reach a physical limit of how much resources
can be added to a single machine.
2
Horizontal Scaling
•Horizontal scaling is the process of adding more machines to a network to
distribute the load and increase capacity.
•Horizontal scaling is often used in cloud computing and other distributed
systems to handle large amounts of tra
ffi
c or data.
•It allows for the system to handle more requests by adding more machines
to the network, rather than upgrading the resources of a single machine.
•Horizontal scaling is often considered more
fl
exible and cost-e
ff
ective than
vertical scaling, as it allows for the easy addition or removal of machines
as needed.
•However, it may also require a load balancer and a way to share data
between the machines.
3
Difference
•Adding more resources
to a single server or
machine, such as CPU,
memory, or storage
•There is a physical limit
to how much resources
can be added to a
single machine
4
•Adding more
machines to a network
to distribute the load
and increase capacity
•It may also require a
load balancer and a
way to share data
between the machines
Thanks
for
Watching
Batch
& Streaming Data
1
Batch Data
2
• Batch data refers to data that is collected and processed in
fi
xed, non-overlapping intervals, also known as batches.
• Batch processing is commonly used when working with large
amounts of historical data, such as data from a data warehouse.
• The data is collected over a certain period of time and then
processed all at once.
• Batch processing is well suited for tasks that do not require real-
time processing, such as generating reports, running analytics,
or training machine learning models.
Streaming Data
3
• Streaming data, on the other hand, refers to data that is
generated and processed in real-time, as it is being generated.
• Streaming data is typically generated by various sources, such
as IoT devices, social media, or
fi
nancial transactions.
• The data is processed as it is received, with minimal latency,
and is often used to support real-time decision-making and
event detection.
• Examples of streaming data processing include monitoring
sensor data, analyzing social media feeds, and detecting fraud
in
fi
nancial transactions.
Main Difference
4
• The main di
ff
erence between batch data and streaming data
is the way they are processed.
• Batch data is processed in
fi
xed intervals, while streaming
data is processed as it is generated.
• Batch data is well suited for tasks that do not require real-
time processing, while streaming data is well suited for real-
time tasks such as monitoring and event detection.
Note
5
• It's worth noting that, these days, many systems combine
both batch and streaming data processing, this is known as
Lambda architecture.
• This is a way to handle both real-time and historical data in a
single system, which can be useful in cases where real-time
decisions need to be made based on historical data.
Thanks
for
Watching
6
Data Processing Pipeline
1
2
Data Processing Pipeline
3
• A data processing pipeline is a series of stages or phases that
data goes through from the time it is collected to the time it is
used for analysis or reporting.
Data collection
4
• The
fi
rst stage of the data processing pipeline is data
collection.
• This includes acquiring data from various sources, such as
sensors, log
fi
les, social media, and transactional systems.
• Data collection may also include pre-processing, such as
fi
ltering, sampling, and transforming the data to make it
suitable for further processing.
Data Storage
5
• After the data is collected, it needs to be stored in a reliable
and e
ffi
cient manner.
• This includes storing the data in a data warehouse, data lake,
or other data storage systems, as well as indexing,
partitioning, and backing up the data.
Data Processing
6
• The next stage of the data processing pipeline is data
processing, which includes cleaning, normalizing, validating,
and transforming the data.
• This step is critical for ensuring the quality and integrity of the
data.
Data Modeling
7
• After the data is cleaned and processed, it can be used for
data modeling, which includes building and training machine
learning models, and creating data visualizations.
Data Analysis
8
• Data analysis: The
fi
nal stage of the data processing pipeline
is data analysis, which includes querying, reporting, and
visualizing the data to gain insights and make data-driven
decisions.
Data Governance
9
• Data governance is an ongoing process that covers the data
life cycle, and it starts at the data collection phase and
continues throughout the entire pipeline.
• It includes data quality, data lineage, data privacy, data
security, data archiving, and data cataloging.
Note
10
• It's worth noting that these stages may not be strictly
sequential and can be executed in parallel, and the speci
fi
c
stages may vary depending on the speci
fi
c application and
the requirements of the organization.
• Additionally, the pipeline may include di
ff
erent tools,
technologies and frameworks at each stage and the pipeline
can be iterated to improve the quality of the data and the
accuracy of the models.
Thanks
for
Watching
11
Google's Data
Processing
Pipeline Products
1. Ingest
2. Store
3. Process and Analyze
4. Explore and Visualize
1. Ingest
2. Store
3. Process and Analyze
4. Explore and visualize
Thanks
for
Watching
Google Cloud
Data Product
Decision Tree
1
2
Cloud Storage
1
Cloud Storage
Cloud Storage is a managed service for storing unstructured
data.
Store any amount of data and retrieve it as often as you like.
2
Features
Automatic storage class transitions
Continental-scale and SLA-backed replication
Fast and flexible transfer services
Default and configurable data security
Leading analytics and ML/AI tools
Object lifecycle management
Object Versioning
Retention policies
Object holds
3
Features
Customer-managed encryption keys
Customer-supplied encryption keys
Uniform bucket-level access
Requester pays
Bucket Lock
Pub/Sub notifications for Cloud Storage
Cloud Audit Logs with Cloud Storage
Object- and bucket-level permissions
4
Storage Options
Standard storage Storage for data that is frequently
accessed ("hot" data) and/or stored for
only brief periods of time.
"Hot" data, including websites, streaming
videos, and mobile apps.
Coldline Storage A very low cost, highly durable storage
service for storing infrequently accessed
data.
90 days.
Nearline Storage Low cost, highly durable storage service
for storing infrequently accessed data.
30 days.
Archival Storage The lowest cost, highly durable storage
service for data archiving, online backup,
and disaster recovery.
365 days.
Storage Class Use Cases Minimum Duration
5
Common Use Cases
Backup and Archives
Use Cloud Storage for backup, archives, and recovery
Cloud Storage's nearline storage provides fast, low-cost, highly durable storage for data accessed less than once a month,
reducing the cost of backups and archives while still retaining immediate access.
Backup data in Cloud Storage can be used for more than just recovery because all storage classes have ms latency and are
accessed through a single API.
Media content storage and delivery
Store data to stream audio or video
Stream audio or video directly to apps or websites with Cloud Storage's geo-redundant capabilities.
Geo-redundant storage with the highest level of availability and performance is ideal for low latency, high-QPS content
serving to users distributed across geographic regions.
Data lakes and big data analytics
Create an integrated repository for analytics
Develop and deploy an app or service in a space that provides collaboration and version control for your code.
Cloud Storage offers high availability and performance while being strongly consistent, giving you confidence and accuracy
in analytics workloads.
6
Common Use Cases
Machine learning and Al
Plug into world class machine learning and Al tools
Once your data is stored in Cloud Storage, take advantage of our options for training deep learning and machine learning
models cost-effectively.
Host a website
Hosting a static website with Cloud Storage
If you have a web app that needs to serve static content or user-uploaded static media, using Cloud Storage can be a cost-
effective and efficient way to host and serve this content, while reducing the amount of dynamic requests to your web
app.
7
Automatic storage class transitions
With features like Object Lifecycle Management (OLM) and
Autoclass you can easily optimize costs with object placement
across storage classes.
You can enable, at the bucket level, policy-based automatic
object movement to colder storage classes based on the last
access time.
There are no early deletion or retrieval fees, nor class transition
charges for object access in colder storage classes.
8
Continental-scale and SLA backed replication
Industry leading dual-region buckets support an expansive
number of regions.
A single, continental-scale bucket offers nine regions across
three continents, providing a Recovery Time Objective (RTO) of
zero.
In the event of an outage, applications seamlessly access the
data in the alternate region.
There is no failover and failback process.
For organizations requiring ultra availability, turbo replication
with dual-region buckets offers a 15 minute Recovery Point
Objective (RPO) SLA.
9
Fast and flexible transfer services
Storage Transfer Service offers a highly performant, online
pathway to Cloud Storage—both with the scalability and speed
you need to simplify the data transfer process.
For offline data transfer our Transfer Appliance is a shippable
storage server that sits in your datacenter and then ships to an
ingest location where the data is uploaded to Cloud Storage.
10
Default and configurable data security
Cloud Storage offers secure-by-design features to protect your
data and advanced controls and capabilities to keep your data
private and secure against leaks or compromises.
Security features include access control policies, data
encryption, retention policies, retention policy locks, and signed
URLs.
11
Leading analytics and ML/AI tools
Once your data is stored in Cloud Storage, easily plug into
Google Cloud’s powerful tools to create your data warehouse
with BigQuery, run open-source analytics with Dataproc, or build
and deploy machine learning (ML) models with Vertex AI.
12
Object lifecycle management
Define conditions that trigger data deletion or transition to a
cheaper storage class.
13
Object Versioning
Continue to store old copies of objects when they are deleted or
overwritten
14
Retention policies
Define minimum retention periods that objects must be stored
for before they’re deletable.
15
Object holds
Place a hold on an object to prevent its deletion.
16
Customer-managed encryption keys
Encrypt object data with encryption keys stored by the Cloud
Key Management Service and managed by you.
17
Customer-supplied encryption keys
Encrypt object data with encryption keys created and managed
by you.
18
Uniform bucket-level access
Uniformly control access to your Cloud Storage resources by
disabling object ACLs.
19
Requester pays
Require accessors of your data to include a project ID to bill for
network charges, operation charges, and retrieval fees.
20
Bucket Lock
Bucket Lock allows you to configure a data retention policy for a
Cloud Storage bucket that governs how long objects in the
bucket must be retained.
21
Pub/Sub notifications for Cloud Storage
Send notifications to Pub/Sub when objects are created,
updated, or deleted.
22
Object- and bucket-level permissions
Cloud Identity and Access Management (IAM) allows you to
control who has access to your buckets and objects.
23
Thanks
for
Watching 24
Migration to Google Cloud:
Transferring your large datasets
1
2
Where you're moving data from Scenario Suggested products
Another cloud provider (for example,
Amazon Web Services or Microsoft
Azure) to Google Cloud
Storage Transfer Service
Cloud Storage to Cloud Storage (two
different buckets)
Your private data center to Google
Cloud
Your private data center to Google
Cloud
Your private data center to Google
Cloud
Enough bandwidth to meet your project
deadline
for less than 1 TB of data
Enough bandwidth to meet your project
deadline
for more than 1 TB of data
Not enough bandwidth to meet your project
deadline
Storage Transfer Service
gsutil
Storage Transfer Service
for on-premises data
Transfer Appliance
Products
•Storage Transfer Service
•gsutil
•Transfer Appliance
3
Storage Transfer Service
•Move or backup data to a Cloud Storage bucket either from
other cloud storage providers or from a local or cloud POSIX
fi
le system.
•Move data from one Cloud Storage bucket to another, so that
it is available to di
ff
erent groups of users or applications.
•Move data from Cloud Storage to a local or cloud
fi
le system.
•Move data between
fi
le systems.
•Periodically move data as part of a data processing pipeline
or analytical work
fl
ow.
4
Storage Transfer Service - Options
•Schedule one-time transfer operations or recurring
transfer operations.
•Delete existing objects in the destination bucket if they
don't have a corresponding object in the source.
•Delete data source objects after transferring them.
•Schedule periodic synchronization from a data source
to a data sink with advanced
fi
lters based on
fi
le
creation dates,
fi
lenames, and the times of day you
prefer to import data.
5
gsutil - 1
• The gsutil tool is the standard tool for small- to medium-sized transfers (less
than 1 TB) over a typical enterprise-scale network, from a private data center
or from another cloud provider to Google Cloud.
• It's also available by default when you install the Google Cloud CLI.
• It's a reliable tool that provides all the basic features you need to manage
your Cloud Storage instances, including copying your data to and from the
local
fi
le system and Cloud Storage.
• It can also move and rename objects and perform real-time incremental
syncs, like rsync, to a Cloud Storage bucket.
6
gsutil is especially useful
• Your transfers need to be executed on an as-needed basis, or during
command-line sessions by your users.
• You're transferring only a few
fi
les or very large
fi
les, or both.
• You're consuming the output of a program (streaming output to Cloud
Storage).
• You need to watch a directory with a moderate number of
fi
les and sync any
updates with very low latencies.
7
Transfer Appliance
•Transfer Appliance is a high-
capacity storage device that
enables you to transfer and
securely ship your data to a
Google upload facility, where
we upload your data to Cloud
Storage
8
Transfer Appliance - How it works
1. Request an appliance
2. Upload your data
3. Ship the appliance back
4. Google uploads the data
5. Transfer is complete
9
10
Transfer Appliance weights and capacities
11
Transfer Calculator
Thanks
for
Watching
Google Cloud SQL
● Google Cloud SQL is a fully-managed database service
that makes it easy to set up, maintain, manage, and
administer your relational databases on Google Cloud
Platform.
● It is based on the MySQL and PostgreSQL database
engines and provides a number of features to help you
manage your databases with ease, including:
● Easy setup: You can set up a new Cloud SQL instance in
just a few clicks using the Google Cloud Console, the
gcloud command-line tool, or the Cloud SQL API.
● Automatic patches and updates: Cloud SQL automatically
applies patches and updates to your database, so you
don't have to worry about maintenance or downtime.
● High availability: Cloud SQL provides built-in high
availability, with automatic failover and replication to
ensure that your database is always available.
● Scalability: You can easily scale your Cloud SQL instances
up or down to meet the changing needs of your
application.
● Security: Cloud SQL provides a number of security
features to help protect your data, including encryption at
rest, network isolation, and integration with Google Cloud's
identity and access management (IAM) system.
● Monitoring and diagnostics: Cloud SQL provides detailed
monitoring and diagnostics information to help you
troubleshoot issues with your database.
● Integration with other Google Cloud services: Cloud SQL
integrates seamlessly with other Google Cloud services,
such as Google Kubernetes Engine, Cloud Functions, and
Cloud Run, making it easy to build and deploy applications
on Google Cloud Platform.
● Cloud SQL supports MySQL, PostgreSQL and SQL Server
databases. You can choose the database engine that best
fits your needs and get all the features and benefits of that
engine, along with the added benefits of being fully
managed on Google Cloud Platform.
● Cloud SQL provides multiple pricing options to fit your
needs and budget. You can choose between on-demand
pricing, which charges you based on the resources you
use, or committed use pricing, which provides discounted
rates in exchange for a commitment to use a certain
amount of resources over a one or three year period.
● Cloud SQL provides a number of tools and features to help
you manage your databases and optimize performance.
These include a web-based SQL client, the ability to import
and export data, support for connection pooling and load
balancing, and the ability to scale your instances up or
down as needed.
● Cloud SQL integrates with other Google Cloud services,
such as Cloud Functions and Cloud Run, making it easy to
build and deploy cloud-native applications. You can also
use Cloud SQL with popular open-source tools such as
MySQL Workbench and PostgreSQL clients, or connect to
it using standard MySQL and PostgreSQL drivers.
● Cloud SQL provides a number of security features to help
protect your data, including encryption at rest, network
isolation, and integration with Google Cloud's identity and
access management (IAM) system. You can also use
Cloud SQL with Cloud Security Command Center to
monitor and manage your database security.
Key Terms
● Instance: A Cloud SQL instance is a container for your
databases. It has a specific configuration and can host one
or more databases.
● Database: A database is a collection of data that is
organized in a specific way, making it easy to access,
update, and query. Cloud SQL supports several database
engines, including MySQL and PostgreSQL.
Key Terms
● Region: A region is a geographic area where Google Cloud
Platform resources are located. When you create a Cloud
SQL instance, you can choose which region it should be
located in.
● High availability: Cloud SQL instances can be configured
for high availability, which means that they are designed to
remain available even if there is a hardware failure or other
issue.
Key Terms
● Backup and recovery: Cloud SQL provides automatic and
on-demand backups of your database, as well as tools for
recovering from a disaster or data loss.
● Security: Cloud SQL takes security seriously, with features
such as encryption at rest, network isolation, and user
authentication.
Key Terms
● Monitoring and debugging: Cloud SQL provides monitoring
and debugging tools to help you track the performance of
your database and troubleshoot any issues that may arise.
● Scalability: Cloud SQL allows you to scale your database
up or down as needed, so you can handle changes in
demand without having to worry about capacity planning.
Pricing
● Google Cloud SQL charges for usage based on the type
and number of resources you consume, such as the
number of instances, the size of the instances, and the
amount of data stored.
● Here are some of the factors that can affect the cost of
Cloud SQL:
Pricing
● Instance type: Cloud SQL offers several instance types,
each with a different combination of CPU, memory, and
storage. The type of instance you choose will affect the
price.
● Instance size: The size of a Cloud SQL instance is
determined by the amount of CPU, memory, and storage it
has. You can choose from a range of sizes, and the cost
will depend on the size you choose.
Pricing
● Data storage: Cloud SQL charges for the amount of data
stored in your database, as well as for any additional
storage you may need.
● Network egress: Cloud SQL charges for the data that is
transferred out of a region. If you have a lot of data
transfer, it could increase your costs.
Pricing
● High availability: If you configure your Cloud SQL instance
for high availability, it will incur additional costs.
● To get an estimate of the cost of using Cloud SQL, you can
use the Google Cloud Pricing Calculator. This tool allows
you to specify your usage patterns and get an estimate of
the cost based on your specific needs.
Use Cases
● Web and mobile applications: Cloud SQL is well-suited for
powering the back-end of web and mobile applications. It
can handle high levels of concurrency and offers fast
response times, making it ideal for applications with a lot of
users.
● Microservices: Cloud SQL can be used to store data for
microservices-based architectures. It offers fast response
times and can be easily integrated with other Google Cloud
Platform services.
Use Cases
● E-commerce: Cloud SQL can be used to store and
manage data for e-commerce applications, including
customer information, order history, and inventory data.
● Internet of Things (IoT): Cloud SQL can be used to store
and process data from IoT devices, allowing you to
analyze and gain insights from the data.
Use Cases
● Gaming: Cloud SQL can be used to store and manage
data for online gaming applications, including player
profiles, game progress, and leaderboards.
Cloud SQL for MySQL
● Fully managed MySQL Community Edition databases in
the cloud.
● Custom machine types with up to 624 GB of RAM and 96
CPUs.
● Up to 64 TB of storage available, with the ability to
automatically increase storage size as needed.
● Create and manage instances in the Google Cloud
console.
Cloud SQL for MySQL
● Instances available in the Americas, EU, Asia, and
Australia.
● Supports migration from source databases to Cloud SQL
destination databases using Database Migration Service
(DMS).
● Customer data encrypted on Google's internal networks
and in database tables, temporary files, and backups.
● Support for secure external connections with the Cloud
SQL Auth proxy or with the SSL/TLS protocol.
Cloud SQL for MySQL
● Support for private IP (private services access).
● Data replication between multiple zones with automatic
failover.
● Import and export databases using mysqldump, or import
and export CSV files.
● Support for MySQL wire protocol and standard MySQL
connectors.
● Automated and on-demand backups and point-in-time
recovery.
Cloud SQL for MySQL
● Instance cloning.
● Integration with Google Cloud's operations suite logging
and monitoring.
● ISO/IEC 27001 compliant.
Unsupported MySQL features
● Federated Engine
● Memory Storage Engine
● The following feature is unsupported for MySQL for Cloud
SQL 5.6 and 5.7:
● The SUPER privilege
● Because Cloud SQL is a managed service, it restricts
access to certain system procedures and tables that
require advanced privileges.
Unsupported MySQL features
● The following features are unsupported for MySQL for
Cloud SQL 8.0:
● FIPS mode
● Resource groups
Unsupported plugins
● InnoDB memcached plugin
● X plugin
● Clone plugin
● InnoDB data-at-rest encryption
● validate_password component
Unsupported statements
● LOAD DATA INFILE
● SELECT ... INTO OUTFILE
● SELECT ... INTO DUMPFILE
● INSTALL PLUGIN …
● UNINSTALL PLUGIN
● CREATE FUNCTION ... SONAME …
Thanks
For
Watching
Google Cloud Spanner
About
● Google Cloud Spanner is a fully managed, horizontally scalable, cloud-native
database service that offers globally consistent, high-performance
transactions, and strong consistency across all rows, tables, and indexes. It is
designed to handle the most demanding workloads and provides the ability to
scale up or down as needed.
● Cloud Spanner is well-suited for applications that require high availability,
strong consistency, and high performance, such as financial systems,
e-commerce platforms, and real-time analytics.
Key Features
● Global distribution: Cloud Spanner allows you to replicate your data across
multiple regions, ensuring low latency and high availability for your
applications.
● Strong consistency: Cloud Spanner provides strong consistency across all
rows, tables, and indexes, allowing you to always read the latest data.
● High performance: Cloud Spanner is designed to handle the most demanding
workloads, with the ability to scale up or down as needed.
Key Features
● Fully managed: Cloud Spanner is fully managed by Google, meaning you
don't have to worry about hardware, software, or infrastructure.
● SQL support: Cloud Spanner supports a standard SQL API, making it easy to
integrate with existing applications and tools.
● Integration with other Google Cloud services: Cloud Spanner integrates with
other Google Cloud services, such as BigQuery and Cloud Functions,
allowing you to build scalable and powerful applications.
Additional Details
● Data modeling: Cloud Spanner uses a traditional relational database model,
with tables, rows, and columns. It supports the standard SQL data types, such
as INT64, FLOAT64, BOOL, and STRING. You can also use Cloud Spanner's
data definition language (DDL) to create and modify tables, indexes, and
other database objects.
● Indexing: Cloud Spanner supports both primary keys and secondary indexes,
allowing you to query and filter your data efficiently. You can create unique
and non-unique indexes, as well as composite indexes that cover multiple
columns.
Additional Details
● Transactions: Cloud Spanner supports transactions, allowing you to execute
multiple SQL statements as a single unit of work. Transactions provide ACID
(atomicity, consistency, isolation, and durability) guarantees, ensuring that
your data is always consistent and accurate.
● Replication: Cloud Spanner uses a distributed architecture to replicate your
data across multiple regions, providing high availability and low latency for
your applications. You can choose how many replicas you want for each
region, based on your performance and availability requirements.
Additional Details
● Security: Cloud Spanner follows best practices for data security and privacy,
including encryption of data at rest and in transit, access controls, and
auditing. It also integrates with Google Cloud's Identity and Access
Management (IAM) service, allowing you to set fine-grained permissions for
your users and applications.
How does google cloud spanner work ?
● You create a Cloud Spanner database and define your schema, including
tables, columns, and indexes.
● You can then load data into your Cloud Spanner database using SQL
INSERT, UPDATE, and DELETE statements, or using one of the available
import tools, such as Cloud Data Fusion or Cloud Dataproc.
● Cloud Spanner stores your data in a distributed data storage system called
Colossus, which is designed to scale horizontally across multiple servers and
regions. Colossus uses a combination of hard disks and solid-state drives
(SSDs) to store your data, with data replicated across multiple nodes for high
availability and low latency.
How does google cloud spanner work ?
● When you execute a SQL query or a transaction on your Cloud Spanner
database, the query or transaction is routed to the appropriate node based on
the data being accessed. Cloud Spanner uses a distributed lock manager to
ensure that transactions are executed in the correct order and to prevent
conflicts between concurrent transactions.
● Cloud Spanner automatically manages the underlying infrastructure and
software, including hardware provisioning, data replication, backup and
recovery, and security. You don't have to worry about these tasks, and you
can focus on building your applications..
Pricing
● Nodes: The number of nodes that you use determines the amount of
read/write throughput and storage capacity that your database has. You can
choose from two types of nodes:
● - Standard nodes: These nodes provide a good balance between cost and
performance, and are suitable for most workloads.
● - Memory nodes: These nodes offer higher read/write throughput and storage
capacity, but are more expensive than standard nodes.
Pricing
● Storage: The amount of storage that you use is based on the size of your
data, including indexes and backups. You can choose from two types of
storage:
● - SSD storage: This type of storage is suitable for most workloads and offers
good performance at a lower cost.
● - HDD storage: This type of storage is less expensive than SSD storage, but
offers slower performance.
Pricing
● Read/write operations: The number of read/write operations that you perform
is based on the number of queries and updates that you make to your
database. Read/write operations are charged per million operations.
● In addition to these components, Google Cloud Spanner also charges for
additional services such as data replication and backup storage. You can use
the Google Cloud Pricing Calculator to estimate the cost of using Google
Cloud Spanner for your specific workload.
● It's worth noting that Google Cloud Spanner offers a number of pricing
discounts and commitments, such as sustained use discounts and custom
usage commitments, which can help you save money on your Cloud Spanner
usage.
Use Cases
● Online transaction processing (OLTP) applications: Cloud Spanner is
well-suited for applications that require low-latency read/write access to a
large number of records, such as e-commerce platforms, financial systems,
and customer relationship management (CRM) systems.
● Analytics and reporting: Cloud Spanner can be used to store and analyze
large amounts of data in real-time, making it suitable for applications such as
business intelligence, data warehousing, and data lakes.
Use Cases
● Internet of Things (IoT) applications: Cloud Spanner can handle the large
volume of data generated by IoT devices, making it suitable for applications
such as smart cities, connected cars, and industrial IoT.
● Mobile and web applications: Cloud Spanner can support the high read/write
throughput and availability requirements of mobile and web applications,
making it suitable for applications such as social networks, gaming, and
content management systems.
Use Cases
● Hybrid and multi-cloud applications: Cloud Spanner can support hybrid and
multi-cloud architectures, making it suitable for applications that require data
to be accessed and modified from multiple locations.
● Microservices and distributed systems: Cloud Spanner can support the high
availability and consistency requirements of microservices and distributed
systems, making it suitable for applications such as distributed databases,
distributed caches, and event-driven architectures.
How does google cloud spanner provide high availability &
scalability ?
● High availability: Spanner is designed to provide 99.999% uptime, which
means that it is able to operate with minimal downtime. It achieves this
through a combination of techniques such as distributed data storage,
replication, and failover.
● Scalability: Spanner is able to scale horizontally, which means that you can
easily add more capacity to your database by adding more machines. It also
has automatic sharding, which means that it can automatically distribute your
data across multiple machines as your data grows.
How does google cloud spanner provide high availability &
scalability ?
● Consistency: Spanner uses a technology called "TrueTime" to provide strong
consistency guarantees across all of its replicas, which means that you can
be confident that all replicas of your data will be consistent with each other at
all times.
How does google cloud spanner provide global
consistency ?
● Google Cloud Spanner provides global consistency through the use of a
technology called "TrueTime." TrueTime is a distributed global clock that
provides a consistent view of time across all of the machines in a Spanner
cluster.
● TrueTime works by using a combination of atomic clocks, GPS receivers, and
network time protocol (NTP) servers to provide a highly accurate and
consistent view of time. It allows Spanner to provide strong consistency
guarantees across all of its replicas, which means that you can be confident
that all replicas of your data will be consistent with each other at all times.
How does google cloud spanner provide global
consistency ?
● TrueTime is used by Spanner to provide a consistent view of time for
operations such as transactions and reads. For example, if you execute a
transaction that involves multiple reads and writes, Spanner will use TrueTime
to ensure that the reads and writes are all executed in the correct order, even
if they are distributed across different machines. This helps to ensure that
your data remains consistent and correct, even in the face of network delays
and other potential issues.
Thanks
For
Watching
Dataflow
1
Dataflow
Serverless, fast, and cost-effective data-
processing service
Stream and batch data
Automatic Infrastructure provisioning
Automatic Scaling as your data grows
2
Dataflow
Real Time Data from different sources but
capturing, processing, and analyzing it is
not easy because it's usually not in the
desired format for your downstream
systems
3
Dataflow
Read the data from the source ->
transform -> write it back into a sink
4
Dataflow
Portable
Processing pipeline created using open
source Apache Beam libraries in the
language of your choice
Dataflow job
Processing on worker virtual machines
5
Dataflow
Run Dataflow jobs using the Cloud Console
UI, gCloud CLI, or the APIs
Prebuilt or Custom templates
Write SQL statements to develop pipelines
right from BigQuery UI or use AI Platform
Notebooks
6
Dataflow
Data encrypted at rest
In Transit with an option to use customer-
managed encryption keys
Use private IPs and VPC service controls to
secure the environment
7
Dataflow
Dataflow is a great choice for use cases
such as real-time AI, data warehousing or
stream analytics
8
Thanks
for
Watching
9
Google Dataflow
1
Definition
2
● A fully managed service for executing Apache Beam pipelines within the
Google Cloud ecosystem
● Google Cloud Dataflow was announced in June, 2014 and released to the
general public as an open beta in April, 2015
Features
● NoOps and Serverless
● Handles infrastructure setup
● Handles maintenance
● Built on Google infrastructure
● Reliable auto scaling
● Meet data pipeline demands
3
Dataflow vs. Dataproc
4
Dataflow, Dataproc comparison
5
Dataflow Dataproc
Recommended for New data processing pipelines, unified
batch and streaming
Existing Hadoop/Spark applications,
machine learning/data science
ecosystem,
large-batch jobs, preemptible VMs
Fully-managed: Yes No
Auto-scaling: Yes, transform-by-transform (adaptive) Yes, based on cluster utilization
(reactive)
Expertise: Apache Beam Hadoop, Hive, Pig, Apache Big Data
ecosystem, Spark, Flink, Presto, Druid
Apache Beam = Batch + strEAM
6
Dataflow pipeline = Directed Acyclic Graph
7
What is a PCollection ?
8
● In Apache Beam, a PCollection (short for "Parallel Collection") is an immutable data
set that is distributed across a set of workers for parallel processing.
● It represents a distributed dataset that can be processed in parallel using the
Apache Beam programming model.
● A PCollection can be created from an external source, such as a text file or a
database, or it can be created as the output of a Beam transform, such as a map or
filter operation.
● PCollections can be transformed and combined with other PCollections using
operations like map, filter, and join.
● Once a pipeline has been defined, the data in a PCollection can be processed by
executing the pipeline using a runner, such as the Google Cloud, Apache Flink or
Apache Spark runners.
What is a PTransform ?
● In Apache Beam, a PTransform (short for "Parallel Transform") is a
fundamental building block for constructing data processing pipelines.
● It represents a computation that takes one or more PCollections as input,
performs a set of operations on the data, and produces one or more output
PCollections.
● PTranforms can be either pre-defined (e.g., Map, Filter, GroupByKey) or
user-defined.
● Pre-defined PTransforms are provided by the Apache Beam SDK and can be
used to perform common data processing tasks, such as mapping, filtering,
and grouping data. User-defined PTransforms allow you to implement custom
logic for your data processing needs.
9
What is a PTransform ?
● PTranforms are applied to PCollections using the apply() method, which takes
one or more PCollections as input and returns one or more output
PCollections.
● For example, the following code applies a Map PTransform to a PCollection
words to produce a new PCollection lengths:
● lengths = words | beam.Map(lambda x: len(x))
● In this example, the Map PTransform takes as input a PCollection words and
applies the lambda function lambda x: len(x) to each element in the collection,
producing a new PCollection lengths that contains the lengths of the words in
words.
10
ParDo = Parallel Do = Parallel Execution [Transform]
11
GroupByKey Transform
12
Takes a keyed collection of elements and produces a collection where each element consists of a
key and all values associated with that key.
GroupByKey Transform Output
13
GroupByKey explicitly shuffles key-values pairs
CoGroupByKey Transform
14
● Aggregates all input elements by their key and allows downstream processing to consume all
values associated with the key.
● While GroupByKey performs this operation over a single input collection and thus a single type
of input values, CoGroupByKey operates over multiple input collections.
CoGroupByKey Transform Output
15
CoGroupByKey joins two or more key-value pairs
CombinePerKey Transform
16
Combines all elements for each key in a collection.
CombineGlobally Transform
17
Combines all elements in a collection.
CombineGlobally Transform Output
18
Flatten Transform
19
● Merges multiple Collection objects into a single logical Collection.
● A transform for Collection objects that store the same data type.
Partition Transform
20
● Separates elements in a collection into multiple output collections.
● The partitioning function contains the logic that determines how to separate
the elements of the input collection into each resulting partition output
collection.
● The number of partitions must be determined at graph construction time. You
cannot determine the number of partitions in mid-pipeline
Partition Transform
21
Partition Transform Output
22
DoFn
● The DoFn object that you pass to ParDo contains the processing logic that
gets applied to the elements in the input collection.
● executed with ParDo
● exposed to the context (timestamp, window pane, etc)
● can consume side inputs
● can produce multiple outputs or no outputs at all
● can produce side outputs
● can use Beam's persistent state APIs
● dynamically typed
23
Dataflow templates
● Dataflow templates allow you to package a Dataflow pipeline for deployment.
● Anyone with the correct permissions can then use the template to deploy the
packaged pipeline.
● You can create your own custom Dataflow templates, and Google provides
pre-built templates for common scenarios.
Flex templates: which are newer and recommend
Classic templates
24
Google Provided pre-built templates
Streaming
● Pub/Sub to BigQuery
● Pub/Sub to Cloud Storage
● Datastream to BigQuery
● Pub/Sub to MongoDB
Batch
● BigQuery to Cloud Storage
● Bigtable to Cloud Storage
● Cloud Storage to BigQuery
● Cloud Spanner to Cloud Storage
Utility
● Bulk compression of Cloud Storage files
● Firestore bulk delete
● File format conversion
25
Windows and Windowing Function
● Tumbling windows (called fixed windows in Apache Beam)
● Hopping windows (called sliding windows in Apache Beam)
● Sessions
26
Tumbling windows
27
Hopping windows
A hopping window represents a consistent time interval in the data stream.
Hopping windows can overlap, whereas tumbling windows are disjoint.
28
Sessions Windows
A session window contains elements within a gap duration of another element.
The gap duration is an interval between new data in a data stream. If data arrives
after the gap duration, the data is assigned to a new window.
29
Watermarks
● A watermark is a threshold that indicates when Dataflow expects all of the
data in a window to have arrived.
● If new data arrives with a timestamp that's in the window but older than the
watermark, the data is considered late data.
● Dataflow tracks watermarks because of the following:
● Data is not guaranteed to arrive in time order or at predictable intervals.
● Data events are not guaranteed to appear in pipelines in the same order that
they were generated.
● The data source determines the watermark.
● You can allow late data with the Apache Beam SDK.
● Dataflow SQL does not process late data.
30
Triggers
● Triggers determine when to emit aggregated results as data arrives. By
default, results are emitted when the watermark passes the end of the
window.
● You can use the Apache Beam SDK to create or modify triggers for each
collection in a streaming pipeline. You cannot set triggers with Dataflow SQL.
● Types of triggers
○ Event time: as indicated by the timestamp on each data element.
○ Processing time: which is the time that the data element is processed at any given stage in the
pipeline.
○ The number of data elements in a collection.
31
Side Inputs
A side input is an additional input that your DoFn can access each time it processes an
element in the input PCollection
32
Run Cloud Dataflow Pipelines
● Locally - which lets you test and debug your Apache Beam pipeline
● Dataflow - data processing system available for running Apache Beam
pipelines
33
Cloud Dataflow Managed Service
34
Security and permissions for pipelines
● The Dataflow service account - The Dataflow service uses the Dataflow
service account as part of the job creation request, such as to check project
quota and to create worker instances on your behalf, and during job execution
to manage the job. This account is also known as the Dataflow service agent.
● The worker service account - Worker instances use the worker service
account to access input and output resources after you submit your job. By
default, workers use your project's Compute Engine default service account
as the worker service account.
35
Worker Service Account Roles
● For the worker service account to be able to create, run, and examine a job, it
must have the following roles:
○ roles/dataflow.admin
○ roles/dataflow.worker
36
Additional Roles for accessed service
● You need to grant the required roles to your Dataflow project's worker service
account so that it can access the resources while running the Dataflow job
● If your job writes to BigQuery, your service account must also have at least
the roles/bigquery.dataEditor role.
● Other Services
○ Cloud Storage buckets
○ BigQuery datasets
○ Pub/Sub topics and subscriptions
○ Firestore datasets
37
Cloud Dataflow Service Account
● It is Automatically created
● Manages job resources
● Assumes cloud dataflow service agent role
● Can read/write access to project resources
38
Required permissions that the caller must have
39
Dataflow Built-in Roles
● Dataflow Admin - (roles/dataflow.admin)
● Dataflow Developer - (roles/dataflow.developer)
● Dataflow Viewer - (roles/dataflow.viewer)
● Dataflow Worker - (roles/dataflow.worker)
40
Integrations
41
Typical Workflow
42
High availability and geographic redundancy
43
Example CI/CD Pipeline
44
Thanks
For
Watching
45
Cloud Dataproc
1
Cloud Dataproc
Dataproc is a managed service for any Open Source Software
jobs that support big data processing including ETL and machine
learning
Out-of-the-box support for the most popular open-source
software
You can use Dataproc to migrate your on premise OSS clusters to
the cloud
Maximizing efficiency and enabling scale
Use it with Cloud AI Notebook or BigQuery to build an end-to-end
data science environment
You can launch an IT governed, auto-scaling cluster in just 90
seconds 2
Cloud Dataproc
It manages the cluster creation, monitoring, and job
orchestration for you
Web UI, Cloud SDK, REST APIs, or with SSH access
You can submit jobs in your opensource framework of choice
Scale your cluster up or down at any time
Even when jobs are running
Pay for what you use down to the second
3
Thanks
for
Watching 4
Google Dataproc
1
Dataproc
2
● Dataproc is a fully managed and highly scalable service for running Apache
Hadoop, Apache Spark, Apache Flink, Presto, and 30+ open source tools and
frameworks.
● Use Dataproc for data lake modernization, ETL, and secure data science, at
scale, integrated with Google Cloud, at a fraction of the cost.
Where does it stand in Data Pipeline
3
Benefits
● Open: Run open source data analytics at scale, with enterprise grade security
● Flexible: Use serverless, or manage clusters on Google Compute and
Kubernetes
● Intelligent: Enable data users through integrations with Vertex AI, BigQuery,
and Dataplex
● Secure: Configure advanced security such as Kerberos, Apache Ranger and
Personal Authentication
● Cost-effective: Realize 54% lower TCO compared to on-prem data lakes with
per-second pricing
4
Key features - 1
● Fully managed and automated big data open source software
● Containerize Apache Spark jobs with Kubernetes
● Enterprise security integrated with Google Cloud
● The best of open source with the best of Google Cloud
● Serverless Spark
● Resizable clusters
● Autoscaling clusters
● Cloud integrated
● Versioning
● Cluster scheduled deletion
● Automatic or manual configuration
5
Key features - 2
● Developer tools
● Initialization actions
● Optional components
● Custom containers and images
● Flexible virtual machines
● Component Gateway and notebook access
● Workflow templates
● Automated policy management
● Smart alerts
● Dataproc metastore
6
Fully managed and automated big data open source
software
● Serverless deployment, logging, and monitoring let you focus on your data
and analytics, not on your infrastructure.
● Reduce TCO of Apache Spark management by up to 54%.
● Enable data scientists and engineers to build and train models 5X faster,
compared to traditional notebooks, through integration with Vertex AI
Workbench.
● The Dataproc Jobs API makes it easy to incorporate big data processing into
custom applications, while Dataproc Metastore eliminates the need to run
your own Hive metastore or catalog service.
7
Containerize Apache Spark jobs with Kubernetes
● Build your Apache Spark jobs using Dataproc on Kubernetes so you can use
Dataproc with Google Kubernetes Engine (GKE) to provide job portability and
isolation.
How Dataproc on GKE works
● Dataproc on GKE deploys Dataproc virtual clusters on a GKE cluster.
● Unlike Dataproc on Compute Engine clusters, Dataproc on GKE virtual
clusters do not include separate master and worker VMs.
● Instead, when you create a Dataproc on GKE virtual cluster, Dataproc on
GKE creates node pools within a GKE cluster.
● The node pools and scheduling of pods on the node pools are managed by
GKE.
8
Enterprise security integrated with Google Cloud
● When you create a Dataproc cluster, you can enable Hadoop Secure Mode
via Kerberos by adding a Security Configuration.
● Additionally, some of the most commonly used Google Cloud-specific security
features used with Dataproc include default at-rest encryption, OS Login, VPC
Service Controls, and customer-managed encryption keys (CMEK).
9
Best Open source with the best of Google Cloud
● Dataproc lets you take the open source tools, algorithms, and programming
languages that you use today, but makes it easy to apply them on cloud-scale
datasets.
● At the same time, Dataproc has out-of-the-box integration with the rest of the
Google Cloud analytics, database, and AI ecosystem.
● Data scientists and engineers can quickly access data and build data
applications connecting Dataproc to BigQuery, Vertex AI, Cloud Spanner,
Pub/Sub, or Data Fusion.
10
Serverless Spark
● Deploy Spark applications and pipelines that autoscale without any manual
infrastructure provisioning or tuning.
● Spark is integrated with BigQuery, Vertex AI, and Dataplex, so you can write
and run it from these interfaces in two clicks, without custom integrations, for
ETL, data exploration, analysis, and ML.
11
Resizable clusters
● Create and scale clusters quickly with various virtual machine types, disk
sizes, number of nodes, and networking options.
● After creating a Dataproc cluster, you can adjust ("scale") the cluster by
increasing or decreasing the number of primary or secondary worker nodes
(horizontal scaling) in the cluster.
● You can scale a Dataproc cluster at any time, even when jobs are running on
the cluster.
● You cannot change the machine type of an existing cluster (vertical scaling).
● To vertically scale, create a cluster using a supported machine type, then
migrate jobs to the new cluster.
12
Autoscaling clusters
● Dataproc autoscaling provides a mechanism for automating cluster resource
management and enables automatic addition and subtraction of cluster
workers (nodes).
● The Dataproc AutoscalingPolicies API provides a mechanism for automating
cluster resource management and enables cluster worker VM autoscaling.
● An Autoscaling Policy is a reusable configuration that describes how cluster
workers using the autoscaling policy should scale.
● It defines scaling boundaries, frequency, and aggressiveness to provide
fine-grained control over cluster resources throughout cluster lifetime.
13
When to use autoscaling
● on clusters that store data in external services, such as Cloud Storage or
BigQuery
● on clusters that process many jobs
● to scale up single-job clusters
● with Enhanced Flexibility Mode for Spark batch jobs
14
When NOT to use autoscaling - 1
● HDFS
○ HDFS utilization is not a signal for autoscaling.
○ HDFS data is only hosted on primary workers.
○ The number of primary workers must be sufficient to host all HDFS data.
○ Decommissioning HDFS DataNodes can delay the removal of workers.
15
When NOT to use autoscaling - 2
● Autoscaling does not support YARN Node Labels, nor the property
dataproc:am.primary_only
● Autoscaling does not support Spark Structured Streaming
● Autoscaling is not recommended for the purpose of scaling a cluster down to
minimum size when the cluster is idle.
● When small and large jobs run on a cluster, graceful decommissioning
scale-down will wait for large jobs to finish.
16
Single node clusters
● This single node acts as the master and worker for your Dataproc cluster.
● Trying out new versions of Spark and Hadoop or other open source
components
● Building proof-of-concept (PoC) demonstrations
● Lightweight data science
● Small-scale non-critical data processing
● Education related to the Spark and Hadoop ecosystem
17
Limitations
● Single node clusters are not recommended for large-scale parallel data
processing.
● Single node clusters are not available with high-availability since there is only
one node in the cluster.
● Single node clusters cannot use preemptible VMs.
18
High Availability Mode
● When creating a Dataproc cluster, you can put the cluster into Hadoop High
Availability (HA) mode by specifying the number of master instances in the
cluster.
● The number of masters can only be specified at cluster creation time.
○ 1 master (default, non HA)
○ 3 masters (Hadoop HA)
19
Cloud integrated
● Built-in integration with Cloud Storage, BigQuery, Dataplex, Vertex AI,
Composer, Cloud Bigtable, Cloud Logging, and Cloud Monitoring, giving you
a more complete and robust data platform.
● Dataproc has built-in integration with other Google Cloud Platform services,
such as BigQuery, Cloud Storage, Cloud Bigtable, Cloud Logging, and Cloud
Monitoring, so you have more than just a Spark or Hadoop cluster—you have
a complete data platform.
● For example, you can use Dataproc to effortlessly ETL terabytes of raw log
data directly into BigQuery for business reporting.
20
Versioning
● Image versioning allows you to switch between different versions of Apache
Spark, Apache Hadoop, and other tools.
● Dataproc uses images to tie together useful Google Cloud Platform
connectors and Apache Spark & Apache Hadoop components into one
package that can be deployed on a Dataproc cluster.
● These images contain the base operating system (Debian or Ubuntu) for the
cluster, along with core and optional components needed to run jobs, such as
Spark, Hadoop, and Hive.
● These images will be upgraded periodically to include new improvements and
features.
● Dataproc versioning allows you to select sets of software versions when you
create clusters.
21
Cluster scheduled deletion
● To help avoid incurring Google Cloud charges for an inactive cluster, use
Dataproc's Cluster Scheduled Deletion feature when you create a cluster.
● This feature provides options to delete a cluster:
○ after a specified cluster idle period
○ at a specified future time
○ after a specified period that starts from the time of submission of the cluster creation request
22
Automatic or manual configuration
● Dataproc automatically configures hardware and software but also gives you
manual control.
● The open source components installed on Dataproc clusters contain many
configuration files.
● For example, Apache Spark and Apache Hadoop have several XML and plain
text configuration files.
● You can use the –properties flag of the gcloud dataproc clusters create
command to modify many common configuration files when creating a cluster.
23
Developer tools
● Multiple ways to manage a cluster, including an easy-to-use web UI, the
Cloud SDK, RESTful APIs, and SSH access.
● Integrate with APIs using Client Libraries for Java, Python, Node.js, Ruby, Go,
.NET, and PHP
● Script or interact with cloud resources at scale using the Google Cloud CLI
● Accelerate local development with emulators for Pub/Sub, Spanner, Bigtable,
and Datastore
24
Initialization actions - 1
● Run initialization actions to install or customize the settings and libraries you
need when your cluster is created.
25
Initialization actions - 2
● Initialization actions run as the root user. You do not need to use sudo
● Initialization actions are executed on each node during cluster creation
● Use absolute paths in initialization actions
● Use a shebang line in initialization actions to indicate how the script should be
interpreted (such as #!/bin/bash or #!/usr/bin/python
26
Optional components
● Use optional components to install and configure additional components on
the cluster.
● Optional components are integrated with Dataproc components and offer fully
configured environments for Zeppelin, Presto, and other open source
software components related to the Apache Hadoop and Apache Spark
ecosystem.
27
Optional Components - Example
28
Custom containers and images
● Dataproc serverless Spark can be provisioned with custom docker containers.
● Dataproc clusters can be provisioned with a custom image that includes your
pre-installed Linux operating system packages.
29
Flexible virtual machines
● Clusters can use custom machine types and preemptible virtual machines to make
them the perfect size for your needs.
● Dataproc clusters are built on Compute Engine instances.
● Machine types define the virtualized hardware resources available to an instance.
● Compute Engine offers both predefined machine types and custom machine types.
● Dataproc clusters can use both predefined and custom types for both master and/or
worker nodes.
● In addition to using standard Compute Engine VMs as Dataproc workers (called
"primary" workers), Dataproc clusters can use secondary workers.
● There are three types of secondary workers: spot VMs, standard preemptible VMs,
and non-preemptible VMs.
● If you specify secondary workers for your cluster, they must be the same type.
● The default Dataproc secondary worker type is the standard preemptible VM.
30
Component Gateway and notebook access
● Dataproc Component Gateway enables secure, one-click access to Dataproc
default and optional component web interfaces running on the cluster.
● Open source components included with Google Dataproc clusters, such as
Apache Hadoop and Apache Spark, provide web interfaces.
● These interfaces can be used to manage and monitor cluster resources and
facilities, such as the YARN resource manager, the Hadoop Distributed File
System (HDFS), MapReduce, and Spark.
● Component Gateway provides secure access to web endpoints for Dataproc
default and optional components.
● Clusters created with Dataproc image version 1.3.29 and later can enable
access to component web interfaces without relying on SSH tunnels or
modifying firewall rules to allow inbound traffic.
31
Workflow templates
● Dataproc workflow templates provide a flexible and easy-to-use mechanism
for managing and executing workflows.
● A workflow template is a reusable workflow configuration that defines a graph
of jobs with information on where to run those jobs.
● A Workflow is an operation that runs a Directed Acyclic Graph (DAG) of jobs
on a cluster.
● Workflows are ideal for complex job flows. You can create job dependencies
so that a job starts only after its dependencies complete successfully.
32
Automated policy management
● Standardize security, cost, and infrastructure policies across a fleet of
clusters.
● You can create policies for resource management, security, or network at a
project level.
● You can also make it easy for users to use the correct images, components,
metastore, and other peripheral services, enabling you to manage your fleet
of clusters and serverless Spark policies in the future.
33
Smart alerts
● Dataproc recommended alerts allow customers to adjust the thresholds for
the pre-configured alerts to get alerts on idle, runaway clusters, jobs,
overutilized clusters and more.
● Customers can further customize these alerts and even create advanced
cluster and job management capabilities.
● These capabilities allow customers to manage their fleet at scale.
34
Dataproc metastore
● Fully managed, highly available Hive Metastore (HMS) with fine-grained
access control and integration with BigQuery metastore, Dataplex, and Data
Catalog.
● Dataproc Metastore provides you with a fully compatible Hive Metastore
(HMS), which is the established standard in the open source big data
ecosystem for managing technical metadata.
● This service helps you manage the metadata of your data lakes and provides
interoperability between the various data processing tools you're using.
35
Connectors
● BigQuery connector - enable programmatic read/write access to BigQuery
● Bigtable connector
○ Bigtable is an excellent option for any Apache Spark or Hadoop uses that require Apache
HBase.
○ Bigtable supports the Apache HBase 1.0+ APIs and offers a Bigtable HBase client in Maven,
so it is easy to use Bigtable with Dataproc.
● Cloud Storage connector - The Cloud Storage connector is an open source
Java library that lets you run Apache Hadoop or Apache Spark jobs directly on
data in Cloud Storage, and offers a number of benefits over choosing the
Hadoop Distributed File System (HDFS).
● Pub/Sub Lite - The Pub/Sub Lite Spark Connector supports Pub/Sub Lite as
an input source to Apache Spark Structured Streaming in the default
micro-batch processing and experimental continuous processing modes.
36
Dataproc on Compute Engine pricing
● Dataproc is billed by the second, and all Dataproc clusters are billed in
one-second clock-time increments, subject to a 1-minute minimum billing.
● Dataproc on Compute Engine pricing is based on the size of Dataproc
clusters and the duration of time that they run.
● The size of a cluster is based on the aggregate number of virtual CPUs
(vCPUs) across the entire cluster, including the master and worker nodes.
● The duration of a cluster is the length of time between cluster creation and
cluster stopping or deletion.
● Total Price = $0.010 * # of vCPUs * hourly duration
37
Pricing example
Dataproc charge = # of vCPUs * hours * Dataproc price = 24 * 2 * $0.01 =
$0.48
38
Dataproc on GKE pricing
● The Dataproc on GKE pricing formula, $0.010 * # of vCPUs * hourly duration,
is the same as the Dataproc on Compute Engine pricing formula, and is
applied to the aggregate number of virtual CPUs running in VMs instances in
Dataproc-created node pools in the cluster.
39
Project roles
40
Dataproc Roles
● Dataproc Admin
● Dataproc Editor
● Dataproc Viewer
● Dataproc Worker (for service accounts only)
41
IAM roles and Dataproc operations summary
42
Dataproc service accounts
● Dataproc VM service account: VMs in a Dataproc cluster use this service
account for Dataproc data plane operations, such reading and writing data
from and to Cloud Storage and BigQuery
● Dataproc Service Agent service account: Dataproc creates this service
account with the Dataproc Service Agent role in a Dataproc user's Google
Cloud project.
43
Thanks
For
Watching
44
Cloud Pub/Sub
1
Cloud Pub/Sub
Cloud Pub/Sub is an asynchronous messaging service
Send, Receive, and Filter events or data steams
Durable Message Storage
Scalable in-order message delivery
Consistently high availability
Performance at any scale
Runs in all Google Cloud region of the world
Serverless
Scales global data delivery auto-magically
Millions of messages per second
Data producers don't need to change anything when the
consumers of their data change 2
Cloud Pub/Sub
Services can be entirely stateless
Set up Pub/Sub between services or applications by defining
topics and then subscriptions
Services to receive the messages published on those topics
one-to-many communications
Spread your workload over multiple workers
E.g. send logs from your security system to archiving, processing,
and analytic services
Stream your data into BigQuery or Dataflow for intelligent
processing
Ideal for notification
3
Thanks
for
Watching 4
Google Pub/Sub
1
Core Concepts
2
Core Concepts - 1
● Topic. A named resource to which messages are sent by publishers.
● Subscription. A named resource representing the stream of messages from a
single, specific topic, to be delivered to the subscribing application. For more
details about subscriptions and message delivery semantics, see the
Subscriber Guide.
● Message. The combination of data and (optional) attributes that a publisher
sends to a topic and is eventually delivered to subscribers.
● Message attribute. A key-value pair that a publisher can define for a message.
For example, key iana.org/language_tag and value en could be added to
messages to mark them as readable by an English-speaking subscriber.
3
Core Concepts - 2
● Publisher. An application that creates and sends messages to a single or
multiple topics.
● Subscriber. An application with a subscription to a single or multiple topics to
receive messages from it.
● Acknowledgment (or "ack"). A signal sent by a subscriber to Pub/Sub after it
has received a message successfully. Acknowledged messages are removed
from the subscription message queue.
● Push and pull. The two message delivery methods. A subscriber receives
messages either by Pub/Sub pushing them to the subscriber chosen
endpoint, or by the subscriber pulling them from the service.
4
Many-to-one (fan-in) and One-to-many (fan-out)
5
Many-to-many
6
Common use cases
● Ingestion user interaction and server events.
● Real-time event distribution.
● Replicating data among databases.
● Parallel processing and workflows.
● Enterprise event bus.
● Data streaming from applications, services, or IoT devices.
● Refreshing distributed caches.
● Load balancing for reliability.
7
Integrations - 1
Stream processing and data integration.
● Dataflow: Dataflow templates and SQL, which allow processing and data
integration into BigQuery and data lakes on Cloud Storage.
● Dataflow templates for moving data from Pub/Sub to Cloud Storage,
BigQuery, and other products are available in the Pub/Sub and Dataflow UIs
in the Google Cloud console.
● Integration with Apache Spark, particularly when managed with Dataproc is
also available.
● Visual composition of integration and processing pipelines running on Spark +
Dataproc can be accomplished with Data Fusion.
8
Integrations - 2
Monitoring, Alerting and Logging.
● Supported by Monitoring and Logging products.
Authentication and IAM.
● Pub/Sub relies on a standard OAuth authentication used by other Google
Cloud products and supports granular IAM, enabling access control for
individual resources.
9
Integrations - 3
APIs.
● Pub/Sub uses standard gRPC and REST service API technologies along with
client libraries for several languages.
Triggers, notifications, and webhooks.
● Pub/Sub offers push-based delivery of messages as HTTP POST requests to
webhooks.
● You can implement workflow automation using Cloud Functions or other
serverless products.
10
Integrations - 4
Orchestration.
● Pub/Sub can be integrated into multistep serverless Workflows declaratively.
● Big data and analytic orchestration often done with Cloud Composer, which
supports Pub/Sub triggers.
● Application Integration provides a Pub/Sub trigger to trigger or start
integrations.
11
You can filter messages by their attributes from a
subscription
● When you receive messages from a subscription with a filter, you only receive
the messages that match the filter.
● The Pub/Sub service automatically acknowledges the messages that don't
match the filter.
● You can filter messages by their attributes, but not by the data in the
message.
● You can have multiple subscriptions attached to a topic and each subscription
can have a different filter.
12
Types of subscriptions
● Pull subscription
● Push subscription
● BigQuery subscription
13
Pull subscription
14
Push subscription
15
BigQuery subscription
16
Delivery Types - at-least-once delivery
● By default, Pub/Sub offers at-least-once delivery with no ordering guarantees
on all subscription types.
● Alternatively, if messages have the same ordering key and are in the same
region, you can enable message ordering.
● After you set the message ordering property, the Pub/Sub service delivers
messages with the same ordering key and in the order that the Pub/Sub
service receives the messages.
17
Delivery Types - exactly-once delivery
● Pub/Sub also supports exactly-once delivery.
● In general, Pub/Sub delivers each message once and in the order in which it
was published.
● However, messages may sometimes be delivered out of order or more than
once.
● Pub/Sub might redeliver a message even after an acknowledgement request
for the message returns successfully.
● This redelivery can be caused by issues such as server-side restarts or
client-side issues.
● Thus, although rare, any message can be redelivered at any time.
● Accommodating more-than-once delivery requires your subscriber to be
idempotent when processing messages.
18
Message Retention
● Unacknowledged messages are retained for a default of 7 days (configurable by the
subscription's message_retention_duration property).
● A topic can retain published messages for a maximum of 31 days (configurable by the
topic's message_retention_duration property) even after they have been acknowledged by
all attached subscriptions.
● In cases where the topic's message_retention_duration is greater than the subscription's
message_retention_duration, Pub/Sub discards a message only when its age exceeds the
topic's message_retention_duration.
● By default, subscriptions expire after 31 days of subscriber inactivity or if there are no
updates made to the subscription.
● When you modify either the message retention duration or subscription expiration policy,
the expiration period must be set to a value greater than the message retention duration.
The default message retention duration is 7 days and the default expiration period is 31
days.
19
Exponential backoff
● Exponential backoff lets you add progressively longer delays between retry
attempts.
● After the first delivery failure, Pub/Sub waits for a minimum backoff time
before retrying.
● For each consecutive message failure, more time is added to the delay, up to
a maximum delay (0 and 600 seconds).
● The maximum and minimum delay intervals are not fixed, and should be
configured based on local factors to your application.
20
Publishing with Pub/Sub Code
21
Thanks
For
Watching
22
BigQuery
1
BigQuery
BigQuery is Google Cloud's enterprise data
warehouse
Ingest
Store
Analyze
Visualize
2
Supports Ingesting Data via Batch or
Streaming Directly
Fully-managed data warehouse
Petabyte scale
BigQuery
3
BigQuery supports a standard SQL dialect
that is ANSI compliant
Interacting with BigQuery is easy
BigQuery
4
You can use the Cloud Console UI,
BigQuery command-line tool bq, or use
the API with client libraries of your choice
BigQuery integrates with several business
intelligence tools
BigQuery
5
Simple pricing model
You pay for data storage, streaming
inserts, and querying data
Loading and exporting data are free of
charge
BigQuery
6
Storage costs are based on the amount of
data stored
For queries, you can choose to pay per
query or a flat rate for dedicated
resources
BigQuery
7
Thanks
for
Watching
8
BigQuery
Views
1
BigQuery Views
A view is a virtual table defined by a SQL query.
Query it in the same way you query a table.
When a user queries the view, the query results contain data only from
the tables and fields specified in the query that defines the view.
How to use
Query editor box in the Google Cloud console
bq command-line tool's bq query command
BigQuery REST API to programmatically call the jobs.query or
query-type jobs.insert methods
BigQuery client libraries
You can also use a view as a data source for a visualization tool such as
Google Data Studio.
2
BigQuery View Limitation
Read-only.
No DML (insert, update, delete) queries against a view.
The dataset that contains your view and the dataset that
contains the tables referenced by the view must be in the same
location.
No exporting of data from a view.
You cannot mix standard SQL and legacy SQL queries when using
views.
You cannot reference query parameters in views. 3
BigQuery View Limitation
You cannot include a temporary user-defined function or a
temporary table in the SQL query that defines a view.
You cannot reference a view in a wildcard table query.
4
Thanks
for
Watching 5
Google BigQuery
1
Typical Workflow
2
BigQuery is Google's data warehouse solution
3
BigQuery - An Ideal Data Warehouse
● Interactive SQL queries over large datasets (petabytes) in seconds
● Serverless and no-ops, including ad hoc queries
● Ecosystem of visualization and reporting tools
● Ecosystem of ETL and data processing tools
● Up-to-the-minute data
● Machine learning
● Security and collaboration
4
Why BigQuery is different from Traditional Databases ?
5
BigQuery Data Organization
6
Row Level Security
7
Filter row data based on sensitive data
8
Authorized Views
9
Authorized View - 1 Save result to another dest table
10
Authorized View - 2 Create another Dataset
11
Authorized View - 3 Save view
12
Authorized View - 4 Add necessary permissions
13
Authorized View - 5 Assign permissions
14
Authorized View - 6 Give View access to Dataset
15
Authorized View Procedure
● step 1 start with your source data set this is the data set with the sensitive data you don't want to share
● step 2 create a separate data set to store the view authorized views require the source data to sit in a separate
data set from the view don't worry about the reason why you'll see in step 6.
● step 3 create a view in the new data set in the new data set you create the view you intend to share with your
data analysts this view is created using a sql query that includes only the data the analysts need to see
● step 4 assign access controls to the project in order to query the view your analyst needs permission to run
queries assigning your analyst the bigquery user role gives them this ability this access does not give them the
ability to view or query any data sets within the project
● step 5 assign access controls to the data set containing the view in order for your analyst to query the view they
need to be granted the bigquery data viewer role for that specific data set that contains the view and finally
● step 6 authorize the view to access the source data set this gives the view itself access to the source data we
need to do this because the view takes on the permissions of the person using it and since the analyst doesn't
have access to the source table they'd otherwise get an error if they tried to query this view
16
Materialized Views
● Materialized views are precomputed views that periodically cache the results
of a query for increased performance and efficiency.
● BigQuery leverages precomputed results from materialized views and
whenever possible reads only delta changes from the base tables to compute
up-to-date results.
● Materialized views can be queried directly or can be used by the BigQuery
optimizer to process queries to the base tables.
● Queries that use materialized views are generally faster and consume fewer
resources than queries that retrieve the same data only from the base tables.
17
Materialized Views
18
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf
Binder1.pdf

More Related Content

Similar to Binder1.pdf

Building Data Warehouse in SQL Server
Building Data Warehouse in SQL ServerBuilding Data Warehouse in SQL Server
Building Data Warehouse in SQL ServerAntonios Chatzipavlis
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7abdulrahmanhelan
 
UNIT 5- Other Databases.pdf
UNIT 5- Other Databases.pdfUNIT 5- Other Databases.pdf
UNIT 5- Other Databases.pdfShitalGhotekar
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxRahul Borate
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
 
01-Database Administration and Management.pdf
01-Database Administration and Management.pdf01-Database Administration and Management.pdf
01-Database Administration and Management.pdfTOUSEEQHAIDER14
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxRahul Borate
 
Managing Your Hyperion Environment – Performance Tuning, Problem Solving and ...
Managing Your Hyperion Environment – Performance Tuning, Problem Solving and ...Managing Your Hyperion Environment – Performance Tuning, Problem Solving and ...
Managing Your Hyperion Environment – Performance Tuning, Problem Solving and ...eCapital Advisors
 
Various Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptVarious Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptRafiulHasan19
 
ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)Huibert Aalbers
 
CST204 DBMS Module-1
CST204 DBMS Module-1CST204 DBMS Module-1
CST204 DBMS Module-1Jyothis Menon
 
Understanding data
Understanding dataUnderstanding data
Understanding dataShahd Salama
 
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...Maninda Edirisooriya
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeSaurabh K. Gupta
 

Similar to Binder1.pdf (20)

Building Data Warehouse in SQL Server
Building Data Warehouse in SQL ServerBuilding Data Warehouse in SQL Server
Building Data Warehouse in SQL Server
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7
 
UNIT 5- Other Databases.pdf
UNIT 5- Other Databases.pdfUNIT 5- Other Databases.pdf
UNIT 5- Other Databases.pdf
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
01-Database Administration and Management.pdf
01-Database Administration and Management.pdf01-Database Administration and Management.pdf
01-Database Administration and Management.pdf
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Data warehouseold
Data warehouseoldData warehouseold
Data warehouseold
 
DW (1).ppt
DW (1).pptDW (1).ppt
DW (1).ppt
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
Managing Your Hyperion Environment – Performance Tuning, Problem Solving and ...
Managing Your Hyperion Environment – Performance Tuning, Problem Solving and ...Managing Your Hyperion Environment – Performance Tuning, Problem Solving and ...
Managing Your Hyperion Environment – Performance Tuning, Problem Solving and ...
 
NoSQL.pptx
NoSQL.pptxNoSQL.pptx
NoSQL.pptx
 
Various Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptVarious Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.ppt
 
ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)
 
CST204 DBMS Module-1
CST204 DBMS Module-1CST204 DBMS Module-1
CST204 DBMS Module-1
 
Understanding data
Understanding dataUnderstanding data
Understanding data
 
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
 

Recently uploaded

Our great adventures in Warsaw - On arrival training
Our great adventures in Warsaw - On arrival trainingOur great adventures in Warsaw - On arrival training
Our great adventures in Warsaw - On arrival trainingAngelaHakobjanyan
 
Abstract Arch Design Wall Cladding - by Stone Art by SKL
Abstract Arch Design Wall Cladding -  by Stone Art by SKLAbstract Arch Design Wall Cladding -  by Stone Art by SKL
Abstract Arch Design Wall Cladding - by Stone Art by SKLstoneartbyskl
 
Reading 8 Artworks about books and readers
Reading 8 Artworks about books and readersReading 8 Artworks about books and readers
Reading 8 Artworks about books and readerssandamichaela *
 
prodtion diary final ultima.pptxoiu8edrfgrh
prodtion diary final ultima.pptxoiu8edrfgrhprodtion diary final ultima.pptxoiu8edrfgrh
prodtion diary final ultima.pptxoiu8edrfgrhLeonBraley
 
wepik-mastering-the-art-of-effective-communication-20240222122545I5QF.pptx
wepik-mastering-the-art-of-effective-communication-20240222122545I5QF.pptxwepik-mastering-the-art-of-effective-communication-20240222122545I5QF.pptx
wepik-mastering-the-art-of-effective-communication-20240222122545I5QF.pptxBChella
 
Laplace Transforms 2 Questionjjjjjjjs.pptx
Laplace Transforms 2 Questionjjjjjjjs.pptxLaplace Transforms 2 Questionjjjjjjjs.pptx
Laplace Transforms 2 Questionjjjjjjjs.pptxjoshuaclack73
 
Star Wars Inspired Lightsaber Battle Assignment
Star Wars Inspired Lightsaber Battle AssignmentStar Wars Inspired Lightsaber Battle Assignment
Star Wars Inspired Lightsaber Battle Assignmentsarahr51
 
K_ E_ S_ Retail Store Scavenger Hunt.pptx
K_ E_ S_ Retail Store Scavenger Hunt.pptxK_ E_ S_ Retail Store Scavenger Hunt.pptx
K_ E_ S_ Retail Store Scavenger Hunt.pptxKenSmith311760
 
一比一原版西悉尼大学毕业证(UWS毕业证)成绩单如可办理
一比一原版西悉尼大学毕业证(UWS毕业证)成绩单如可办理一比一原版西悉尼大学毕业证(UWS毕业证)成绩单如可办理
一比一原版西悉尼大学毕业证(UWS毕业证)成绩单如可办理awuboo
 
Green Lantern the Animated Series Practice Boards by Phoebe Holmes.pdf
Green Lantern the Animated Series Practice Boards by Phoebe Holmes.pdfGreen Lantern the Animated Series Practice Boards by Phoebe Holmes.pdf
Green Lantern the Animated Series Practice Boards by Phoebe Holmes.pdfPhoebeHolmes2
 
The Adventurer's Guide Book by Amoré van der Linde
The Adventurer's Guide Book by Amoré van der LindeThe Adventurer's Guide Book by Amoré van der Linde
The Adventurer's Guide Book by Amoré van der Lindethatsamorevdl
 
Tagarino_14510147_Process Journal AS1 Inhabiting the WallDesign.pdf
Tagarino_14510147_Process Journal AS1 Inhabiting the WallDesign.pdfTagarino_14510147_Process Journal AS1 Inhabiting the WallDesign.pdf
Tagarino_14510147_Process Journal AS1 Inhabiting the WallDesign.pdferintagarino1
 
obat aborsi pemalang wa 081336238223 jual obat aborsi cytotec asli di pemalan...
obat aborsi pemalang wa 081336238223 jual obat aborsi cytotec asli di pemalan...obat aborsi pemalang wa 081336238223 jual obat aborsi cytotec asli di pemalan...
obat aborsi pemalang wa 081336238223 jual obat aborsi cytotec asli di pemalan...yulianti213969
 
一比一定制英国赫特福德大学毕业证学位证书
一比一定制英国赫特福德大学毕业证学位证书一比一定制英国赫特福德大学毕业证学位证书
一比一定制英国赫特福德大学毕业证学位证书AS
 
一比一原版(MQU毕业证书)麦考瑞大学毕业证成绩单原件一模一样
一比一原版(MQU毕业证书)麦考瑞大学毕业证成绩单原件一模一样一比一原版(MQU毕业证书)麦考瑞大学毕业证成绩单原件一模一样
一比一原版(MQU毕业证书)麦考瑞大学毕业证成绩单原件一模一样aqwaz
 
Neighborhood Guide To Atlanta’s Awe-Inspiring Art Galleries
Neighborhood Guide To Atlanta’s Awe-Inspiring Art GalleriesNeighborhood Guide To Atlanta’s Awe-Inspiring Art Galleries
Neighborhood Guide To Atlanta’s Awe-Inspiring Art Galleriesconnectcontemporary
 
My scariest moment presentation-part one
My scariest moment presentation-part oneMy scariest moment presentation-part one
My scariest moment presentation-part oneatetteh2001
 
Museum Quality | PrintAction.pdf
Museum Quality | PrintAction.pdfMuseum Quality | PrintAction.pdf
Museum Quality | PrintAction.pdfVictoria Gaitskell
 
prodtion diary updated.pptxrfhkfjgjggjkgjk
prodtion diary updated.pptxrfhkfjgjggjkgjkprodtion diary updated.pptxrfhkfjgjggjkgjk
prodtion diary updated.pptxrfhkfjgjggjkgjkLeonBraley
 

Recently uploaded (20)

Our great adventures in Warsaw - On arrival training
Our great adventures in Warsaw - On arrival trainingOur great adventures in Warsaw - On arrival training
Our great adventures in Warsaw - On arrival training
 
Abstract Arch Design Wall Cladding - by Stone Art by SKL
Abstract Arch Design Wall Cladding -  by Stone Art by SKLAbstract Arch Design Wall Cladding -  by Stone Art by SKL
Abstract Arch Design Wall Cladding - by Stone Art by SKL
 
Reading 8 Artworks about books and readers
Reading 8 Artworks about books and readersReading 8 Artworks about books and readers
Reading 8 Artworks about books and readers
 
Apotik Jual Obat Aborsi asli Malang, Wa : 085180626899 - Penjual obat Cytotec...
Apotik Jual Obat Aborsi asli Malang, Wa : 085180626899 - Penjual obat Cytotec...Apotik Jual Obat Aborsi asli Malang, Wa : 085180626899 - Penjual obat Cytotec...
Apotik Jual Obat Aborsi asli Malang, Wa : 085180626899 - Penjual obat Cytotec...
 
prodtion diary final ultima.pptxoiu8edrfgrh
prodtion diary final ultima.pptxoiu8edrfgrhprodtion diary final ultima.pptxoiu8edrfgrh
prodtion diary final ultima.pptxoiu8edrfgrh
 
wepik-mastering-the-art-of-effective-communication-20240222122545I5QF.pptx
wepik-mastering-the-art-of-effective-communication-20240222122545I5QF.pptxwepik-mastering-the-art-of-effective-communication-20240222122545I5QF.pptx
wepik-mastering-the-art-of-effective-communication-20240222122545I5QF.pptx
 
Laplace Transforms 2 Questionjjjjjjjs.pptx
Laplace Transforms 2 Questionjjjjjjjs.pptxLaplace Transforms 2 Questionjjjjjjjs.pptx
Laplace Transforms 2 Questionjjjjjjjs.pptx
 
Star Wars Inspired Lightsaber Battle Assignment
Star Wars Inspired Lightsaber Battle AssignmentStar Wars Inspired Lightsaber Battle Assignment
Star Wars Inspired Lightsaber Battle Assignment
 
K_ E_ S_ Retail Store Scavenger Hunt.pptx
K_ E_ S_ Retail Store Scavenger Hunt.pptxK_ E_ S_ Retail Store Scavenger Hunt.pptx
K_ E_ S_ Retail Store Scavenger Hunt.pptx
 
一比一原版西悉尼大学毕业证(UWS毕业证)成绩单如可办理
一比一原版西悉尼大学毕业证(UWS毕业证)成绩单如可办理一比一原版西悉尼大学毕业证(UWS毕业证)成绩单如可办理
一比一原版西悉尼大学毕业证(UWS毕业证)成绩单如可办理
 
Green Lantern the Animated Series Practice Boards by Phoebe Holmes.pdf
Green Lantern the Animated Series Practice Boards by Phoebe Holmes.pdfGreen Lantern the Animated Series Practice Boards by Phoebe Holmes.pdf
Green Lantern the Animated Series Practice Boards by Phoebe Holmes.pdf
 
The Adventurer's Guide Book by Amoré van der Linde
The Adventurer's Guide Book by Amoré van der LindeThe Adventurer's Guide Book by Amoré van der Linde
The Adventurer's Guide Book by Amoré van der Linde
 
Tagarino_14510147_Process Journal AS1 Inhabiting the WallDesign.pdf
Tagarino_14510147_Process Journal AS1 Inhabiting the WallDesign.pdfTagarino_14510147_Process Journal AS1 Inhabiting the WallDesign.pdf
Tagarino_14510147_Process Journal AS1 Inhabiting the WallDesign.pdf
 
obat aborsi pemalang wa 081336238223 jual obat aborsi cytotec asli di pemalan...
obat aborsi pemalang wa 081336238223 jual obat aborsi cytotec asli di pemalan...obat aborsi pemalang wa 081336238223 jual obat aborsi cytotec asli di pemalan...
obat aborsi pemalang wa 081336238223 jual obat aborsi cytotec asli di pemalan...
 
一比一定制英国赫特福德大学毕业证学位证书
一比一定制英国赫特福德大学毕业证学位证书一比一定制英国赫特福德大学毕业证学位证书
一比一定制英国赫特福德大学毕业证学位证书
 
一比一原版(MQU毕业证书)麦考瑞大学毕业证成绩单原件一模一样
一比一原版(MQU毕业证书)麦考瑞大学毕业证成绩单原件一模一样一比一原版(MQU毕业证书)麦考瑞大学毕业证成绩单原件一模一样
一比一原版(MQU毕业证书)麦考瑞大学毕业证成绩单原件一模一样
 
Neighborhood Guide To Atlanta’s Awe-Inspiring Art Galleries
Neighborhood Guide To Atlanta’s Awe-Inspiring Art GalleriesNeighborhood Guide To Atlanta’s Awe-Inspiring Art Galleries
Neighborhood Guide To Atlanta’s Awe-Inspiring Art Galleries
 
My scariest moment presentation-part one
My scariest moment presentation-part oneMy scariest moment presentation-part one
My scariest moment presentation-part one
 
Museum Quality | PrintAction.pdf
Museum Quality | PrintAction.pdfMuseum Quality | PrintAction.pdf
Museum Quality | PrintAction.pdf
 
prodtion diary updated.pptxrfhkfjgjggjkgjk
prodtion diary updated.pptxrfhkfjgjggjkgjkprodtion diary updated.pptxrfhkfjgjggjkgjk
prodtion diary updated.pptxrfhkfjgjggjkgjk
 

Binder1.pdf

  • 1. Google Cloud Certified Professional Data Engineer Exam Strategy, Tips and Overview 1
  • 2. About me • I have 20 years experience in IT industry with focus on Cloud, Data, ML & DevOps • I hold more than 50+ certi fi cation in all the above fi elds • I am a 2 time Google Professional Data Engineer Certi fi ed and also 1 time Google DevOps and Machine Learning Engineer Certi fi ed • LinkedIn Pro fi le 2
  • 3. 3
  • 4. 4
  • 5. Exam Cost • 200 USD listed price • 141.60 USD I paid after taxes etc. • Price same whether at home or at exam center 5
  • 6. Certification Exam Overview - 1 •50 Questions to be answered in 2 hours •No negative marks •All Multiple Choice Questions •Select 1 right answer out of 4 •Select 2 or 3 right answers out of 5 or 6 choices •144 seconds for every question 6
  • 7. Certification Exam Overview - 2 • Questions can be answered and also marked for review if you want to review later • No pen and paper allowed or provided • Cert Valid for 2 years • Can be taken onsite at exam center or remotely proctored • Pass or Fail, marks will not be shared • Exam is hard 7
  • 8. Exam Strategy • Process of elimination • Eliminate usually 2 wrong choices • Then make a decision/guess between the last 2 strong options • On my fi rst go I made a choice and marked all questions for review later • On 2nd run I read through all my answers and either con fi rmed or changed my choice 8
  • 9. Top topics (in order) • BigQuery • Data fl ow • Bigtable • Dataproc • Pub/Sub • Cloud SQL • Cloud Spanner 9 • Cloud Composer • Data Prep • Data Fusion • Cloud DLP • Pre-Trained ML AI APIS • Fundamentals of Machine Learning • Feature Engineering • Over fi tting • IAM, Access Control, Service Accounts, Users, Groups, Roles • Kafka, Hive, Pig, Hadoop
  • 10. Tips (Guesswork) • Choose Ideal Recommended Solution (Reference Architectures) • Choose Google Product over other products • Avoid Complex Convoluted Lengthy Manual Error prone Cumbersome solution • If little bit unsure or question is too time consuming, mark for review later. • After reading through all 50 questions you will fi gure out some unsure questions • If all else fail go with your fi rst guess 10
  • 11. Do Review • Sample Questions • Exam Guide 11
  • 14. Roles & Responsibilities of a Data Engineer 1
  • 15. 2 •Data Engineers are responsible for designing, building, and maintaining the infrastructure and systems that are used to store, process, and analyze large amounts of data.
  • 16. 3 •Design, build and maintain the data pipeline: Data Engineers are responsible for designing and building the data pipeline, which includes the collection, storage, processing, and transportation of data. •They also make sure that the pipeline is e ffi cient, reliable, and scalable.
  • 17. 4 •Data storage and management: Data Engineers are responsible for designing and maintaining the data storage systems, such as relational databases, NoSQL databases, and data warehouses. •They also ensure that the data is properly indexed, partitioned, and backed up.
  • 18. 5 •Data quality and integrity: Data Engineers are responsible for ensuring the quality and integrity of the data as it fl ows through the pipeline. •This includes cleaning, normalizing, and validating the data before it is stored.
  • 19. 6 •Data security: Data Engineers are responsible for implementing security measures to protect sensitive data from unauthorized access. •This includes implementing encryption, access controls, and monitoring for security breaches.
  • 20. 7 •Performance tuning and optimization: Data Engineers are responsible for monitoring the performance of the data pipeline and making adjustments to optimize its performance. •This includes identifying and resolving bottlenecks, and scaling resources as needed.
  • 21. 8 •Collaboration with other teams: Data Engineers often work closely with Data Scientists, Data Analysts and Business Intelligence teams to understand their data needs and ensure that the data pipeline is able to support their requirements.
  • 22. 9 •Keeping up with the latest technologies: Data Engineers need to keep up-to-date with the latest technologies and trends in the fi eld, such as new data storage and processing systems, big data platforms, and data governance best practices.
  • 25. Relational Databases •Relational databases: These are the most common type of data storage systems, and include popular options such as MySQL, Oracle, and Microsoft SQL Server. •Relational databases store data in tables with rows and columns, and are based on the relational model. •They are well suited for structured data and support the use of SQL for querying and manipulating data. 2
  • 26. NoSQL Databases •NoSQL databases: These are non-relational databases that are designed to handle large amounts of unstructured or semi-structured data. •Examples include MongoDB, Cassandra, and Hbase. •NoSQL databases are often used for big data and real-time web applications. •They are horizontally scalable and provide high performance and availability. 3
  • 27. Data Warehouses •Data warehouses: These are specialized relational databases that are optimized for reporting and analytics. •They are designed to handle large amounts of historical data and support complex queries and aggregations. •Examples include Amazon Redshift, Google BigQuery, and Microsoft Azure SQL Data Warehouse 4
  • 28. Data Lakes •Data lakes: Data lake is a data storage architecture that allows storing raw, unstructured and structured data at any scale. •Data lake technologies, such as Amazon S3, Azure Data Lake Storage, and Google Cloud Storage, provide a centralized repository that can store all types of data. 5
  • 29. Columnar Databases •Columnar databases: Columnar databases, such as Apache Parquet, Apache ORC, and Google Bigtable, are used for storing and querying large amounts of data in a columnar format. •This format is optimized for read-intensive workloads and analytical querying. 6
  • 30. Key-value Databases •Key-value databases: Key-value databases, such as Redis and memcached, are designed to store large amounts of data in a simple key-value format, and are optimized for read-heavy workloads. •They are particularly well-suited for use cases such as caching, session management, and real-time analytics. 7
  • 31. Object Storage •Object storage: These systems are designed to store and retrieve unstructured data, such as images, videos, and audio fi les. •They are often used for cloud storage and archiving. •Examples include Amazon S3, Microsoft Azure Blob Storage, and OpenStack Swift. 8
  • 32. File Storage •File storage: These systems store data as fi les and directories, which can be organized in a hierarchical fi le system. •They are often used for storing large fi les and streaming media, and are commonly used in distributed systems like Hadoop HDFS. •Other Examples: NTFS, ext4 9
  • 33. Time-series Databases •Time-series databases: These are specialized databases that are optimized for storing and querying time-series data, such as sensor data and fi nancial data. •Examples include In fl uxDB, OpenTSDB, and TimescaleDB. 10
  • 34. Graph Databases •Graph databases: These databases are designed for storing and querying graph data, which consists of nodes and edges representing entities and relationships. •Examples include Neo4j and JanusGraph. •They are well-suited for applications that require querying complex relationships and patterns in the data, such as social networks and recommendation systems. 11
  • 37. ACID • ACID is an acronym that stands for Atomicity, Consistency, Isolation, and Durability. • These properties are a set of guarantees that a database system makes about the behavior of transactions. • ACID properties are important for maintaining data integrity, consistency, and availability in a database. • It ensures that the data stored in the database is accurate, consistent, and can be relied upon.
  • 38. Atomicity • This property ensures that a transaction is treated as a single, indivisible unit of work. • Either all of the changes made in a transaction are committed to the database, or none of them are. • This means that if a transaction is interrupted or fails, any changes made in that transaction will be rolled back, so that the database remains in a consistent state.
  • 39. Consistency • This property ensures that a transaction brings the database from one valid state to another valid state. • A database starts in a consistent state, and any transaction that is executed on the database should also leave the database in a consistent state.
  • 40. Isolation • This property ensures that the concurrent execution of transactions does not a ff ect the correctness of the overall system. • Each transaction should execute as if it is the only transaction being executed, even though other transactions may be executing at the same time.
  • 41. Durability • This property ensures that once a transaction is committed, its e ff ects will persist, even in the event of a failure (such as a power outage or a crash). • This is typically achieved by writing the changes to non-volatile storage, such as disk.
  • 42. BASE • BASE stands for Basically Available, Soft state, Eventually consistent. • It is a set of properties that describe the behavior of a distributed database system or a distributed data store.
  • 43. Basically Available • This property ensures that the data store is available for read and write operations, although there may be some limitations on availability due to network partitions or other failures.
  • 44. Soft state • This property acknowledges that the state of the data store may change over time, even without input. • This is due to the distributed nature of the data store and the inherent uncertainty in network communication.
  • 45. Eventually consistent • This property ensures that all nodes in the distributed data store will eventually converge to the same state, even if they do not have immediate access to the same information. • This means that it may take some time for all nodes to have the same data, but eventually, they will.
  • 46. Difference • ACID guarantees consistency and isolation for transactions, but it comes with a cost of overhead and less scalability • BASE prioritizes availability and scalability over consistency, which can make it more di ffi cult to reason about and predict the behavior of the system.
  • 49. OLTP - 1 • OLTP stands for Online Transaction Processing. • It is a type of database system that is optimized for handling a large number of short, transactional requests, such as inserting, updating, or retrieving data from a database. • In an OLTP system, the database is designed to handle a high number of concurrent connections and transactions, with a focus on fast, consistent response times. • The data in an OLTP system is typically stored in normalized, relational tables, which allows for e ffi cient querying and indexing. • OLTP systems are used in a variety of applications, such as e-commerce systems, fi nancial systems, and inventory management systems.
  • 50. OLTP - 2 • The goal of OLTP is to enable the processing of business transactions as fast as possible, with a high degree of consistency and data integrity. • OLTP systems are typically characterized by a high number of read and write operations, a large number of concurrent users, and a high volume of data. • They are also characterized by a high degree of normalization and data integrity, with strict constraints and triggers to ensure data consistency and prevent data corruption. • Overall, OLTP is designed to handle a large number of concurrent transactions and to provide fast, consistent response times. • It is an essential component of many business systems, and it is used to support a wide variety of transactions and business processes.
  • 51. OLAP - 1 • OLAP, or Online Analytical Processing is a powerful technology that allows users to easily analyze large, complex data sets and make informed business decisions. • It is commonly used in business intelligence and decision support systems to support complex queries and analysis on large datasets. • OLAP databases are typically built on top of a relational database and use a multidimensional data model, which organizes data into a cube structure. • Each dimension in the cube represents a di ff erent aspect of the data, such as time, location, or product, and each cell in the cube contains a measure, such as sales or pro fi t. • Users can interact with the OLAP cube using a client tool, such as Microsoft Excel, to drill down, roll up, and slice and dice the data to gain insights.
  • 52. OLAP - 2 • For example, a user could start by looking at total sales for a given time period, then drill down to see sales by region, and then by individual store. • There are three main types of OLAP systems: relational OLAP (ROLAP), multidimensional OLAP (MOLAP), and hybrid OLAP (HOLAP). • ROLAP uses a relational database as the underlying data store, while MOLAP uses a specialized multidimensional data store. • HOLAP combines the bene fi ts of both ROLAP and MOLAP by using a relational data store for the detailed data and a multidimensional data store for the summarized data. • OLAP also provides several advanced analytical capabilities such as time series analysis, forecasting, budgeting, and data mining. • In addition, many OLAP tools provide a graphical interface that makes it easy for users to interact with the data and perform advanced analysis.
  • 53. Difference • OLTP is designed to handle a high volume of short online transactions, such as inserting, updating, and retrieving data. • It is optimized for transactional consistency, and data is stored in a normalized form to minimize data redundancy. • The main goal of OLTP is to ensure data accuracy and integrity, and to make sure that the system can handle a large number of concurrent users and transactions. • OLTP is used for operational systems that handle day-to-day transactions • OLAP is designed to handle complex, multi-dimensional analysis of data. • It is optimized for fast query performance and e ffi cient data aggregation, and data is stored in a denormalized form to enable faster data retrieval. • The main goal of OLAP is to support business intelligence and decision- making by providing users with the ability to analyze large amounts of data from multiple dimensions and levels of detail. • OLAP is used for analytical systems that support business intelligence and decision-making
  • 55. 4 V's of Big Data 1
  • 56. 4 V's of Big Data •Volume •Velocity •Variety •Veracity 2
  • 57. Volume •Big data is characterized by its large volume, which can range from terabytes to petabytes and even exabytes. •This large volume of data is generated from various sources, such as social media, IoT devices, and transactional systems. 3
  • 58. Velocity •Big data is characterized by its high velocity, which refers to the speed at which data is generated and collected. •This high velocity of data requires real-time processing and analysis to extract insights and make decisions. 4
  • 59. Variety •Big data is characterized by its wide variety of data types, such as structured, unstructured, and semi-structured data. •This variety of data types requires specialized tools and techniques for processing and analysis. 5
  • 60. Veracity •Big data is characterized by its uncertainty and lack of trustworthiness, which makes it di ffi cult to validate and verify the accuracy of the data. •This requires data quality and data governance processes to ensure that the data is accurate and reliable. 6
  • 63. Vertical Scaling •Vertical scaling is the process of increasing the capacity of a single server or machine by adding more resources such as CPU, memory, or storage. •Vertical scaling is often used to improve the performance of a single server or to add capacity to a machine that has reached its limits. •The main disadvantage of vertical scaling is that it can reach a physical limit of how much resources can be added to a single machine. 2
  • 64. Horizontal Scaling •Horizontal scaling is the process of adding more machines to a network to distribute the load and increase capacity. •Horizontal scaling is often used in cloud computing and other distributed systems to handle large amounts of tra ffi c or data. •It allows for the system to handle more requests by adding more machines to the network, rather than upgrading the resources of a single machine. •Horizontal scaling is often considered more fl exible and cost-e ff ective than vertical scaling, as it allows for the easy addition or removal of machines as needed. •However, it may also require a load balancer and a way to share data between the machines. 3
  • 65. Difference •Adding more resources to a single server or machine, such as CPU, memory, or storage •There is a physical limit to how much resources can be added to a single machine 4 •Adding more machines to a network to distribute the load and increase capacity •It may also require a load balancer and a way to share data between the machines
  • 68. Batch Data 2 • Batch data refers to data that is collected and processed in fi xed, non-overlapping intervals, also known as batches. • Batch processing is commonly used when working with large amounts of historical data, such as data from a data warehouse. • The data is collected over a certain period of time and then processed all at once. • Batch processing is well suited for tasks that do not require real- time processing, such as generating reports, running analytics, or training machine learning models.
  • 69. Streaming Data 3 • Streaming data, on the other hand, refers to data that is generated and processed in real-time, as it is being generated. • Streaming data is typically generated by various sources, such as IoT devices, social media, or fi nancial transactions. • The data is processed as it is received, with minimal latency, and is often used to support real-time decision-making and event detection. • Examples of streaming data processing include monitoring sensor data, analyzing social media feeds, and detecting fraud in fi nancial transactions.
  • 70. Main Difference 4 • The main di ff erence between batch data and streaming data is the way they are processed. • Batch data is processed in fi xed intervals, while streaming data is processed as it is generated. • Batch data is well suited for tasks that do not require real- time processing, while streaming data is well suited for real- time tasks such as monitoring and event detection.
  • 71. Note 5 • It's worth noting that, these days, many systems combine both batch and streaming data processing, this is known as Lambda architecture. • This is a way to handle both real-time and historical data in a single system, which can be useful in cases where real-time decisions need to be made based on historical data.
  • 74. 2
  • 75. Data Processing Pipeline 3 • A data processing pipeline is a series of stages or phases that data goes through from the time it is collected to the time it is used for analysis or reporting.
  • 76. Data collection 4 • The fi rst stage of the data processing pipeline is data collection. • This includes acquiring data from various sources, such as sensors, log fi les, social media, and transactional systems. • Data collection may also include pre-processing, such as fi ltering, sampling, and transforming the data to make it suitable for further processing.
  • 77. Data Storage 5 • After the data is collected, it needs to be stored in a reliable and e ffi cient manner. • This includes storing the data in a data warehouse, data lake, or other data storage systems, as well as indexing, partitioning, and backing up the data.
  • 78. Data Processing 6 • The next stage of the data processing pipeline is data processing, which includes cleaning, normalizing, validating, and transforming the data. • This step is critical for ensuring the quality and integrity of the data.
  • 79. Data Modeling 7 • After the data is cleaned and processed, it can be used for data modeling, which includes building and training machine learning models, and creating data visualizations.
  • 80. Data Analysis 8 • Data analysis: The fi nal stage of the data processing pipeline is data analysis, which includes querying, reporting, and visualizing the data to gain insights and make data-driven decisions.
  • 81. Data Governance 9 • Data governance is an ongoing process that covers the data life cycle, and it starts at the data collection phase and continues throughout the entire pipeline. • It includes data quality, data lineage, data privacy, data security, data archiving, and data cataloging.
  • 82. Note 10 • It's worth noting that these stages may not be strictly sequential and can be executed in parallel, and the speci fi c stages may vary depending on the speci fi c application and the requirements of the organization. • Additionally, the pipeline may include di ff erent tools, technologies and frameworks at each stage and the pipeline can be iterated to improve the quality of the data and the accuracy of the models.
  • 85. 1. Ingest 2. Store 3. Process and Analyze 4. Explore and Visualize
  • 88. 3. Process and Analyze
  • 89. 4. Explore and visualize
  • 92. 2
  • 94. Cloud Storage Cloud Storage is a managed service for storing unstructured data. Store any amount of data and retrieve it as often as you like. 2
  • 95. Features Automatic storage class transitions Continental-scale and SLA-backed replication Fast and flexible transfer services Default and configurable data security Leading analytics and ML/AI tools Object lifecycle management Object Versioning Retention policies Object holds 3
  • 96. Features Customer-managed encryption keys Customer-supplied encryption keys Uniform bucket-level access Requester pays Bucket Lock Pub/Sub notifications for Cloud Storage Cloud Audit Logs with Cloud Storage Object- and bucket-level permissions 4
  • 97. Storage Options Standard storage Storage for data that is frequently accessed ("hot" data) and/or stored for only brief periods of time. "Hot" data, including websites, streaming videos, and mobile apps. Coldline Storage A very low cost, highly durable storage service for storing infrequently accessed data. 90 days. Nearline Storage Low cost, highly durable storage service for storing infrequently accessed data. 30 days. Archival Storage The lowest cost, highly durable storage service for data archiving, online backup, and disaster recovery. 365 days. Storage Class Use Cases Minimum Duration 5
  • 98. Common Use Cases Backup and Archives Use Cloud Storage for backup, archives, and recovery Cloud Storage's nearline storage provides fast, low-cost, highly durable storage for data accessed less than once a month, reducing the cost of backups and archives while still retaining immediate access. Backup data in Cloud Storage can be used for more than just recovery because all storage classes have ms latency and are accessed through a single API. Media content storage and delivery Store data to stream audio or video Stream audio or video directly to apps or websites with Cloud Storage's geo-redundant capabilities. Geo-redundant storage with the highest level of availability and performance is ideal for low latency, high-QPS content serving to users distributed across geographic regions. Data lakes and big data analytics Create an integrated repository for analytics Develop and deploy an app or service in a space that provides collaboration and version control for your code. Cloud Storage offers high availability and performance while being strongly consistent, giving you confidence and accuracy in analytics workloads. 6
  • 99. Common Use Cases Machine learning and Al Plug into world class machine learning and Al tools Once your data is stored in Cloud Storage, take advantage of our options for training deep learning and machine learning models cost-effectively. Host a website Hosting a static website with Cloud Storage If you have a web app that needs to serve static content or user-uploaded static media, using Cloud Storage can be a cost- effective and efficient way to host and serve this content, while reducing the amount of dynamic requests to your web app. 7
  • 100. Automatic storage class transitions With features like Object Lifecycle Management (OLM) and Autoclass you can easily optimize costs with object placement across storage classes. You can enable, at the bucket level, policy-based automatic object movement to colder storage classes based on the last access time. There are no early deletion or retrieval fees, nor class transition charges for object access in colder storage classes. 8
  • 101. Continental-scale and SLA backed replication Industry leading dual-region buckets support an expansive number of regions. A single, continental-scale bucket offers nine regions across three continents, providing a Recovery Time Objective (RTO) of zero. In the event of an outage, applications seamlessly access the data in the alternate region. There is no failover and failback process. For organizations requiring ultra availability, turbo replication with dual-region buckets offers a 15 minute Recovery Point Objective (RPO) SLA. 9
  • 102. Fast and flexible transfer services Storage Transfer Service offers a highly performant, online pathway to Cloud Storage—both with the scalability and speed you need to simplify the data transfer process. For offline data transfer our Transfer Appliance is a shippable storage server that sits in your datacenter and then ships to an ingest location where the data is uploaded to Cloud Storage. 10
  • 103. Default and configurable data security Cloud Storage offers secure-by-design features to protect your data and advanced controls and capabilities to keep your data private and secure against leaks or compromises. Security features include access control policies, data encryption, retention policies, retention policy locks, and signed URLs. 11
  • 104. Leading analytics and ML/AI tools Once your data is stored in Cloud Storage, easily plug into Google Cloud’s powerful tools to create your data warehouse with BigQuery, run open-source analytics with Dataproc, or build and deploy machine learning (ML) models with Vertex AI. 12
  • 105. Object lifecycle management Define conditions that trigger data deletion or transition to a cheaper storage class. 13
  • 106. Object Versioning Continue to store old copies of objects when they are deleted or overwritten 14
  • 107. Retention policies Define minimum retention periods that objects must be stored for before they’re deletable. 15
  • 108. Object holds Place a hold on an object to prevent its deletion. 16
  • 109. Customer-managed encryption keys Encrypt object data with encryption keys stored by the Cloud Key Management Service and managed by you. 17
  • 110. Customer-supplied encryption keys Encrypt object data with encryption keys created and managed by you. 18
  • 111. Uniform bucket-level access Uniformly control access to your Cloud Storage resources by disabling object ACLs. 19
  • 112. Requester pays Require accessors of your data to include a project ID to bill for network charges, operation charges, and retrieval fees. 20
  • 113. Bucket Lock Bucket Lock allows you to configure a data retention policy for a Cloud Storage bucket that governs how long objects in the bucket must be retained. 21
  • 114. Pub/Sub notifications for Cloud Storage Send notifications to Pub/Sub when objects are created, updated, or deleted. 22
  • 115. Object- and bucket-level permissions Cloud Identity and Access Management (IAM) allows you to control who has access to your buckets and objects. 23
  • 117. Migration to Google Cloud: Transferring your large datasets 1
  • 118. 2 Where you're moving data from Scenario Suggested products Another cloud provider (for example, Amazon Web Services or Microsoft Azure) to Google Cloud Storage Transfer Service Cloud Storage to Cloud Storage (two different buckets) Your private data center to Google Cloud Your private data center to Google Cloud Your private data center to Google Cloud Enough bandwidth to meet your project deadline for less than 1 TB of data Enough bandwidth to meet your project deadline for more than 1 TB of data Not enough bandwidth to meet your project deadline Storage Transfer Service gsutil Storage Transfer Service for on-premises data Transfer Appliance
  • 120. Storage Transfer Service •Move or backup data to a Cloud Storage bucket either from other cloud storage providers or from a local or cloud POSIX fi le system. •Move data from one Cloud Storage bucket to another, so that it is available to di ff erent groups of users or applications. •Move data from Cloud Storage to a local or cloud fi le system. •Move data between fi le systems. •Periodically move data as part of a data processing pipeline or analytical work fl ow. 4
  • 121. Storage Transfer Service - Options •Schedule one-time transfer operations or recurring transfer operations. •Delete existing objects in the destination bucket if they don't have a corresponding object in the source. •Delete data source objects after transferring them. •Schedule periodic synchronization from a data source to a data sink with advanced fi lters based on fi le creation dates, fi lenames, and the times of day you prefer to import data. 5
  • 122. gsutil - 1 • The gsutil tool is the standard tool for small- to medium-sized transfers (less than 1 TB) over a typical enterprise-scale network, from a private data center or from another cloud provider to Google Cloud. • It's also available by default when you install the Google Cloud CLI. • It's a reliable tool that provides all the basic features you need to manage your Cloud Storage instances, including copying your data to and from the local fi le system and Cloud Storage. • It can also move and rename objects and perform real-time incremental syncs, like rsync, to a Cloud Storage bucket. 6
  • 123. gsutil is especially useful • Your transfers need to be executed on an as-needed basis, or during command-line sessions by your users. • You're transferring only a few fi les or very large fi les, or both. • You're consuming the output of a program (streaming output to Cloud Storage). • You need to watch a directory with a moderate number of fi les and sync any updates with very low latencies. 7
  • 124. Transfer Appliance •Transfer Appliance is a high- capacity storage device that enables you to transfer and securely ship your data to a Google upload facility, where we upload your data to Cloud Storage 8
  • 125. Transfer Appliance - How it works 1. Request an appliance 2. Upload your data 3. Ship the appliance back 4. Google uploads the data 5. Transfer is complete 9
  • 126. 10 Transfer Appliance weights and capacities
  • 130. ● Google Cloud SQL is a fully-managed database service that makes it easy to set up, maintain, manage, and administer your relational databases on Google Cloud Platform. ● It is based on the MySQL and PostgreSQL database engines and provides a number of features to help you manage your databases with ease, including: ● Easy setup: You can set up a new Cloud SQL instance in just a few clicks using the Google Cloud Console, the gcloud command-line tool, or the Cloud SQL API.
  • 131. ● Automatic patches and updates: Cloud SQL automatically applies patches and updates to your database, so you don't have to worry about maintenance or downtime. ● High availability: Cloud SQL provides built-in high availability, with automatic failover and replication to ensure that your database is always available. ● Scalability: You can easily scale your Cloud SQL instances up or down to meet the changing needs of your application.
  • 132. ● Security: Cloud SQL provides a number of security features to help protect your data, including encryption at rest, network isolation, and integration with Google Cloud's identity and access management (IAM) system. ● Monitoring and diagnostics: Cloud SQL provides detailed monitoring and diagnostics information to help you troubleshoot issues with your database. ● Integration with other Google Cloud services: Cloud SQL integrates seamlessly with other Google Cloud services, such as Google Kubernetes Engine, Cloud Functions, and Cloud Run, making it easy to build and deploy applications on Google Cloud Platform.
  • 133. ● Cloud SQL supports MySQL, PostgreSQL and SQL Server databases. You can choose the database engine that best fits your needs and get all the features and benefits of that engine, along with the added benefits of being fully managed on Google Cloud Platform. ● Cloud SQL provides multiple pricing options to fit your needs and budget. You can choose between on-demand pricing, which charges you based on the resources you use, or committed use pricing, which provides discounted rates in exchange for a commitment to use a certain amount of resources over a one or three year period.
  • 134. ● Cloud SQL provides a number of tools and features to help you manage your databases and optimize performance. These include a web-based SQL client, the ability to import and export data, support for connection pooling and load balancing, and the ability to scale your instances up or down as needed. ● Cloud SQL integrates with other Google Cloud services, such as Cloud Functions and Cloud Run, making it easy to build and deploy cloud-native applications. You can also use Cloud SQL with popular open-source tools such as MySQL Workbench and PostgreSQL clients, or connect to it using standard MySQL and PostgreSQL drivers.
  • 135. ● Cloud SQL provides a number of security features to help protect your data, including encryption at rest, network isolation, and integration with Google Cloud's identity and access management (IAM) system. You can also use Cloud SQL with Cloud Security Command Center to monitor and manage your database security.
  • 136. Key Terms ● Instance: A Cloud SQL instance is a container for your databases. It has a specific configuration and can host one or more databases. ● Database: A database is a collection of data that is organized in a specific way, making it easy to access, update, and query. Cloud SQL supports several database engines, including MySQL and PostgreSQL.
  • 137. Key Terms ● Region: A region is a geographic area where Google Cloud Platform resources are located. When you create a Cloud SQL instance, you can choose which region it should be located in. ● High availability: Cloud SQL instances can be configured for high availability, which means that they are designed to remain available even if there is a hardware failure or other issue.
  • 138. Key Terms ● Backup and recovery: Cloud SQL provides automatic and on-demand backups of your database, as well as tools for recovering from a disaster or data loss. ● Security: Cloud SQL takes security seriously, with features such as encryption at rest, network isolation, and user authentication.
  • 139. Key Terms ● Monitoring and debugging: Cloud SQL provides monitoring and debugging tools to help you track the performance of your database and troubleshoot any issues that may arise. ● Scalability: Cloud SQL allows you to scale your database up or down as needed, so you can handle changes in demand without having to worry about capacity planning.
  • 140. Pricing ● Google Cloud SQL charges for usage based on the type and number of resources you consume, such as the number of instances, the size of the instances, and the amount of data stored. ● Here are some of the factors that can affect the cost of Cloud SQL:
  • 141. Pricing ● Instance type: Cloud SQL offers several instance types, each with a different combination of CPU, memory, and storage. The type of instance you choose will affect the price. ● Instance size: The size of a Cloud SQL instance is determined by the amount of CPU, memory, and storage it has. You can choose from a range of sizes, and the cost will depend on the size you choose.
  • 142. Pricing ● Data storage: Cloud SQL charges for the amount of data stored in your database, as well as for any additional storage you may need. ● Network egress: Cloud SQL charges for the data that is transferred out of a region. If you have a lot of data transfer, it could increase your costs.
  • 143. Pricing ● High availability: If you configure your Cloud SQL instance for high availability, it will incur additional costs. ● To get an estimate of the cost of using Cloud SQL, you can use the Google Cloud Pricing Calculator. This tool allows you to specify your usage patterns and get an estimate of the cost based on your specific needs.
  • 144. Use Cases ● Web and mobile applications: Cloud SQL is well-suited for powering the back-end of web and mobile applications. It can handle high levels of concurrency and offers fast response times, making it ideal for applications with a lot of users. ● Microservices: Cloud SQL can be used to store data for microservices-based architectures. It offers fast response times and can be easily integrated with other Google Cloud Platform services.
  • 145. Use Cases ● E-commerce: Cloud SQL can be used to store and manage data for e-commerce applications, including customer information, order history, and inventory data. ● Internet of Things (IoT): Cloud SQL can be used to store and process data from IoT devices, allowing you to analyze and gain insights from the data.
  • 146. Use Cases ● Gaming: Cloud SQL can be used to store and manage data for online gaming applications, including player profiles, game progress, and leaderboards.
  • 147. Cloud SQL for MySQL ● Fully managed MySQL Community Edition databases in the cloud. ● Custom machine types with up to 624 GB of RAM and 96 CPUs. ● Up to 64 TB of storage available, with the ability to automatically increase storage size as needed. ● Create and manage instances in the Google Cloud console.
  • 148. Cloud SQL for MySQL ● Instances available in the Americas, EU, Asia, and Australia. ● Supports migration from source databases to Cloud SQL destination databases using Database Migration Service (DMS). ● Customer data encrypted on Google's internal networks and in database tables, temporary files, and backups. ● Support for secure external connections with the Cloud SQL Auth proxy or with the SSL/TLS protocol.
  • 149. Cloud SQL for MySQL ● Support for private IP (private services access). ● Data replication between multiple zones with automatic failover. ● Import and export databases using mysqldump, or import and export CSV files. ● Support for MySQL wire protocol and standard MySQL connectors. ● Automated and on-demand backups and point-in-time recovery.
  • 150. Cloud SQL for MySQL ● Instance cloning. ● Integration with Google Cloud's operations suite logging and monitoring. ● ISO/IEC 27001 compliant.
  • 151. Unsupported MySQL features ● Federated Engine ● Memory Storage Engine ● The following feature is unsupported for MySQL for Cloud SQL 5.6 and 5.7: ● The SUPER privilege ● Because Cloud SQL is a managed service, it restricts access to certain system procedures and tables that require advanced privileges.
  • 152. Unsupported MySQL features ● The following features are unsupported for MySQL for Cloud SQL 8.0: ● FIPS mode ● Resource groups
  • 153. Unsupported plugins ● InnoDB memcached plugin ● X plugin ● Clone plugin ● InnoDB data-at-rest encryption ● validate_password component
  • 154. Unsupported statements ● LOAD DATA INFILE ● SELECT ... INTO OUTFILE ● SELECT ... INTO DUMPFILE ● INSTALL PLUGIN … ● UNINSTALL PLUGIN ● CREATE FUNCTION ... SONAME …
  • 157. About ● Google Cloud Spanner is a fully managed, horizontally scalable, cloud-native database service that offers globally consistent, high-performance transactions, and strong consistency across all rows, tables, and indexes. It is designed to handle the most demanding workloads and provides the ability to scale up or down as needed. ● Cloud Spanner is well-suited for applications that require high availability, strong consistency, and high performance, such as financial systems, e-commerce platforms, and real-time analytics.
  • 158. Key Features ● Global distribution: Cloud Spanner allows you to replicate your data across multiple regions, ensuring low latency and high availability for your applications. ● Strong consistency: Cloud Spanner provides strong consistency across all rows, tables, and indexes, allowing you to always read the latest data. ● High performance: Cloud Spanner is designed to handle the most demanding workloads, with the ability to scale up or down as needed.
  • 159. Key Features ● Fully managed: Cloud Spanner is fully managed by Google, meaning you don't have to worry about hardware, software, or infrastructure. ● SQL support: Cloud Spanner supports a standard SQL API, making it easy to integrate with existing applications and tools. ● Integration with other Google Cloud services: Cloud Spanner integrates with other Google Cloud services, such as BigQuery and Cloud Functions, allowing you to build scalable and powerful applications.
  • 160. Additional Details ● Data modeling: Cloud Spanner uses a traditional relational database model, with tables, rows, and columns. It supports the standard SQL data types, such as INT64, FLOAT64, BOOL, and STRING. You can also use Cloud Spanner's data definition language (DDL) to create and modify tables, indexes, and other database objects. ● Indexing: Cloud Spanner supports both primary keys and secondary indexes, allowing you to query and filter your data efficiently. You can create unique and non-unique indexes, as well as composite indexes that cover multiple columns.
  • 161. Additional Details ● Transactions: Cloud Spanner supports transactions, allowing you to execute multiple SQL statements as a single unit of work. Transactions provide ACID (atomicity, consistency, isolation, and durability) guarantees, ensuring that your data is always consistent and accurate. ● Replication: Cloud Spanner uses a distributed architecture to replicate your data across multiple regions, providing high availability and low latency for your applications. You can choose how many replicas you want for each region, based on your performance and availability requirements.
  • 162. Additional Details ● Security: Cloud Spanner follows best practices for data security and privacy, including encryption of data at rest and in transit, access controls, and auditing. It also integrates with Google Cloud's Identity and Access Management (IAM) service, allowing you to set fine-grained permissions for your users and applications.
  • 163. How does google cloud spanner work ? ● You create a Cloud Spanner database and define your schema, including tables, columns, and indexes. ● You can then load data into your Cloud Spanner database using SQL INSERT, UPDATE, and DELETE statements, or using one of the available import tools, such as Cloud Data Fusion or Cloud Dataproc. ● Cloud Spanner stores your data in a distributed data storage system called Colossus, which is designed to scale horizontally across multiple servers and regions. Colossus uses a combination of hard disks and solid-state drives (SSDs) to store your data, with data replicated across multiple nodes for high availability and low latency.
  • 164. How does google cloud spanner work ? ● When you execute a SQL query or a transaction on your Cloud Spanner database, the query or transaction is routed to the appropriate node based on the data being accessed. Cloud Spanner uses a distributed lock manager to ensure that transactions are executed in the correct order and to prevent conflicts between concurrent transactions. ● Cloud Spanner automatically manages the underlying infrastructure and software, including hardware provisioning, data replication, backup and recovery, and security. You don't have to worry about these tasks, and you can focus on building your applications..
  • 165. Pricing ● Nodes: The number of nodes that you use determines the amount of read/write throughput and storage capacity that your database has. You can choose from two types of nodes: ● - Standard nodes: These nodes provide a good balance between cost and performance, and are suitable for most workloads. ● - Memory nodes: These nodes offer higher read/write throughput and storage capacity, but are more expensive than standard nodes.
  • 166. Pricing ● Storage: The amount of storage that you use is based on the size of your data, including indexes and backups. You can choose from two types of storage: ● - SSD storage: This type of storage is suitable for most workloads and offers good performance at a lower cost. ● - HDD storage: This type of storage is less expensive than SSD storage, but offers slower performance.
  • 167. Pricing ● Read/write operations: The number of read/write operations that you perform is based on the number of queries and updates that you make to your database. Read/write operations are charged per million operations. ● In addition to these components, Google Cloud Spanner also charges for additional services such as data replication and backup storage. You can use the Google Cloud Pricing Calculator to estimate the cost of using Google Cloud Spanner for your specific workload. ● It's worth noting that Google Cloud Spanner offers a number of pricing discounts and commitments, such as sustained use discounts and custom usage commitments, which can help you save money on your Cloud Spanner usage.
  • 168. Use Cases ● Online transaction processing (OLTP) applications: Cloud Spanner is well-suited for applications that require low-latency read/write access to a large number of records, such as e-commerce platforms, financial systems, and customer relationship management (CRM) systems. ● Analytics and reporting: Cloud Spanner can be used to store and analyze large amounts of data in real-time, making it suitable for applications such as business intelligence, data warehousing, and data lakes.
  • 169. Use Cases ● Internet of Things (IoT) applications: Cloud Spanner can handle the large volume of data generated by IoT devices, making it suitable for applications such as smart cities, connected cars, and industrial IoT. ● Mobile and web applications: Cloud Spanner can support the high read/write throughput and availability requirements of mobile and web applications, making it suitable for applications such as social networks, gaming, and content management systems.
  • 170. Use Cases ● Hybrid and multi-cloud applications: Cloud Spanner can support hybrid and multi-cloud architectures, making it suitable for applications that require data to be accessed and modified from multiple locations. ● Microservices and distributed systems: Cloud Spanner can support the high availability and consistency requirements of microservices and distributed systems, making it suitable for applications such as distributed databases, distributed caches, and event-driven architectures.
  • 171. How does google cloud spanner provide high availability & scalability ? ● High availability: Spanner is designed to provide 99.999% uptime, which means that it is able to operate with minimal downtime. It achieves this through a combination of techniques such as distributed data storage, replication, and failover. ● Scalability: Spanner is able to scale horizontally, which means that you can easily add more capacity to your database by adding more machines. It also has automatic sharding, which means that it can automatically distribute your data across multiple machines as your data grows.
  • 172. How does google cloud spanner provide high availability & scalability ? ● Consistency: Spanner uses a technology called "TrueTime" to provide strong consistency guarantees across all of its replicas, which means that you can be confident that all replicas of your data will be consistent with each other at all times.
  • 173. How does google cloud spanner provide global consistency ? ● Google Cloud Spanner provides global consistency through the use of a technology called "TrueTime." TrueTime is a distributed global clock that provides a consistent view of time across all of the machines in a Spanner cluster. ● TrueTime works by using a combination of atomic clocks, GPS receivers, and network time protocol (NTP) servers to provide a highly accurate and consistent view of time. It allows Spanner to provide strong consistency guarantees across all of its replicas, which means that you can be confident that all replicas of your data will be consistent with each other at all times.
  • 174. How does google cloud spanner provide global consistency ? ● TrueTime is used by Spanner to provide a consistent view of time for operations such as transactions and reads. For example, if you execute a transaction that involves multiple reads and writes, Spanner will use TrueTime to ensure that the reads and writes are all executed in the correct order, even if they are distributed across different machines. This helps to ensure that your data remains consistent and correct, even in the face of network delays and other potential issues.
  • 177. Dataflow Serverless, fast, and cost-effective data- processing service Stream and batch data Automatic Infrastructure provisioning Automatic Scaling as your data grows 2
  • 178. Dataflow Real Time Data from different sources but capturing, processing, and analyzing it is not easy because it's usually not in the desired format for your downstream systems 3
  • 179. Dataflow Read the data from the source -> transform -> write it back into a sink 4
  • 180. Dataflow Portable Processing pipeline created using open source Apache Beam libraries in the language of your choice Dataflow job Processing on worker virtual machines 5
  • 181. Dataflow Run Dataflow jobs using the Cloud Console UI, gCloud CLI, or the APIs Prebuilt or Custom templates Write SQL statements to develop pipelines right from BigQuery UI or use AI Platform Notebooks 6
  • 182. Dataflow Data encrypted at rest In Transit with an option to use customer- managed encryption keys Use private IPs and VPC service controls to secure the environment 7
  • 183. Dataflow Dataflow is a great choice for use cases such as real-time AI, data warehousing or stream analytics 8
  • 186. Definition 2 ● A fully managed service for executing Apache Beam pipelines within the Google Cloud ecosystem ● Google Cloud Dataflow was announced in June, 2014 and released to the general public as an open beta in April, 2015
  • 187. Features ● NoOps and Serverless ● Handles infrastructure setup ● Handles maintenance ● Built on Google infrastructure ● Reliable auto scaling ● Meet data pipeline demands 3
  • 189. Dataflow, Dataproc comparison 5 Dataflow Dataproc Recommended for New data processing pipelines, unified batch and streaming Existing Hadoop/Spark applications, machine learning/data science ecosystem, large-batch jobs, preemptible VMs Fully-managed: Yes No Auto-scaling: Yes, transform-by-transform (adaptive) Yes, based on cluster utilization (reactive) Expertise: Apache Beam Hadoop, Hive, Pig, Apache Big Data ecosystem, Spark, Flink, Presto, Druid
  • 190. Apache Beam = Batch + strEAM 6
  • 191. Dataflow pipeline = Directed Acyclic Graph 7
  • 192. What is a PCollection ? 8 ● In Apache Beam, a PCollection (short for "Parallel Collection") is an immutable data set that is distributed across a set of workers for parallel processing. ● It represents a distributed dataset that can be processed in parallel using the Apache Beam programming model. ● A PCollection can be created from an external source, such as a text file or a database, or it can be created as the output of a Beam transform, such as a map or filter operation. ● PCollections can be transformed and combined with other PCollections using operations like map, filter, and join. ● Once a pipeline has been defined, the data in a PCollection can be processed by executing the pipeline using a runner, such as the Google Cloud, Apache Flink or Apache Spark runners.
  • 193. What is a PTransform ? ● In Apache Beam, a PTransform (short for "Parallel Transform") is a fundamental building block for constructing data processing pipelines. ● It represents a computation that takes one or more PCollections as input, performs a set of operations on the data, and produces one or more output PCollections. ● PTranforms can be either pre-defined (e.g., Map, Filter, GroupByKey) or user-defined. ● Pre-defined PTransforms are provided by the Apache Beam SDK and can be used to perform common data processing tasks, such as mapping, filtering, and grouping data. User-defined PTransforms allow you to implement custom logic for your data processing needs. 9
  • 194. What is a PTransform ? ● PTranforms are applied to PCollections using the apply() method, which takes one or more PCollections as input and returns one or more output PCollections. ● For example, the following code applies a Map PTransform to a PCollection words to produce a new PCollection lengths: ● lengths = words | beam.Map(lambda x: len(x)) ● In this example, the Map PTransform takes as input a PCollection words and applies the lambda function lambda x: len(x) to each element in the collection, producing a new PCollection lengths that contains the lengths of the words in words. 10
  • 195. ParDo = Parallel Do = Parallel Execution [Transform] 11
  • 196. GroupByKey Transform 12 Takes a keyed collection of elements and produces a collection where each element consists of a key and all values associated with that key.
  • 197. GroupByKey Transform Output 13 GroupByKey explicitly shuffles key-values pairs
  • 198. CoGroupByKey Transform 14 ● Aggregates all input elements by their key and allows downstream processing to consume all values associated with the key. ● While GroupByKey performs this operation over a single input collection and thus a single type of input values, CoGroupByKey operates over multiple input collections.
  • 199. CoGroupByKey Transform Output 15 CoGroupByKey joins two or more key-value pairs
  • 200. CombinePerKey Transform 16 Combines all elements for each key in a collection.
  • 201. CombineGlobally Transform 17 Combines all elements in a collection.
  • 203. Flatten Transform 19 ● Merges multiple Collection objects into a single logical Collection. ● A transform for Collection objects that store the same data type.
  • 204. Partition Transform 20 ● Separates elements in a collection into multiple output collections. ● The partitioning function contains the logic that determines how to separate the elements of the input collection into each resulting partition output collection. ● The number of partitions must be determined at graph construction time. You cannot determine the number of partitions in mid-pipeline
  • 207. DoFn ● The DoFn object that you pass to ParDo contains the processing logic that gets applied to the elements in the input collection. ● executed with ParDo ● exposed to the context (timestamp, window pane, etc) ● can consume side inputs ● can produce multiple outputs or no outputs at all ● can produce side outputs ● can use Beam's persistent state APIs ● dynamically typed 23
  • 208. Dataflow templates ● Dataflow templates allow you to package a Dataflow pipeline for deployment. ● Anyone with the correct permissions can then use the template to deploy the packaged pipeline. ● You can create your own custom Dataflow templates, and Google provides pre-built templates for common scenarios. Flex templates: which are newer and recommend Classic templates 24
  • 209. Google Provided pre-built templates Streaming ● Pub/Sub to BigQuery ● Pub/Sub to Cloud Storage ● Datastream to BigQuery ● Pub/Sub to MongoDB Batch ● BigQuery to Cloud Storage ● Bigtable to Cloud Storage ● Cloud Storage to BigQuery ● Cloud Spanner to Cloud Storage Utility ● Bulk compression of Cloud Storage files ● Firestore bulk delete ● File format conversion 25
  • 210. Windows and Windowing Function ● Tumbling windows (called fixed windows in Apache Beam) ● Hopping windows (called sliding windows in Apache Beam) ● Sessions 26
  • 212. Hopping windows A hopping window represents a consistent time interval in the data stream. Hopping windows can overlap, whereas tumbling windows are disjoint. 28
  • 213. Sessions Windows A session window contains elements within a gap duration of another element. The gap duration is an interval between new data in a data stream. If data arrives after the gap duration, the data is assigned to a new window. 29
  • 214. Watermarks ● A watermark is a threshold that indicates when Dataflow expects all of the data in a window to have arrived. ● If new data arrives with a timestamp that's in the window but older than the watermark, the data is considered late data. ● Dataflow tracks watermarks because of the following: ● Data is not guaranteed to arrive in time order or at predictable intervals. ● Data events are not guaranteed to appear in pipelines in the same order that they were generated. ● The data source determines the watermark. ● You can allow late data with the Apache Beam SDK. ● Dataflow SQL does not process late data. 30
  • 215. Triggers ● Triggers determine when to emit aggregated results as data arrives. By default, results are emitted when the watermark passes the end of the window. ● You can use the Apache Beam SDK to create or modify triggers for each collection in a streaming pipeline. You cannot set triggers with Dataflow SQL. ● Types of triggers ○ Event time: as indicated by the timestamp on each data element. ○ Processing time: which is the time that the data element is processed at any given stage in the pipeline. ○ The number of data elements in a collection. 31
  • 216. Side Inputs A side input is an additional input that your DoFn can access each time it processes an element in the input PCollection 32
  • 217. Run Cloud Dataflow Pipelines ● Locally - which lets you test and debug your Apache Beam pipeline ● Dataflow - data processing system available for running Apache Beam pipelines 33
  • 218. Cloud Dataflow Managed Service 34
  • 219. Security and permissions for pipelines ● The Dataflow service account - The Dataflow service uses the Dataflow service account as part of the job creation request, such as to check project quota and to create worker instances on your behalf, and during job execution to manage the job. This account is also known as the Dataflow service agent. ● The worker service account - Worker instances use the worker service account to access input and output resources after you submit your job. By default, workers use your project's Compute Engine default service account as the worker service account. 35
  • 220. Worker Service Account Roles ● For the worker service account to be able to create, run, and examine a job, it must have the following roles: ○ roles/dataflow.admin ○ roles/dataflow.worker 36
  • 221. Additional Roles for accessed service ● You need to grant the required roles to your Dataflow project's worker service account so that it can access the resources while running the Dataflow job ● If your job writes to BigQuery, your service account must also have at least the roles/bigquery.dataEditor role. ● Other Services ○ Cloud Storage buckets ○ BigQuery datasets ○ Pub/Sub topics and subscriptions ○ Firestore datasets 37
  • 222. Cloud Dataflow Service Account ● It is Automatically created ● Manages job resources ● Assumes cloud dataflow service agent role ● Can read/write access to project resources 38
  • 223. Required permissions that the caller must have 39
  • 224. Dataflow Built-in Roles ● Dataflow Admin - (roles/dataflow.admin) ● Dataflow Developer - (roles/dataflow.developer) ● Dataflow Viewer - (roles/dataflow.viewer) ● Dataflow Worker - (roles/dataflow.worker) 40
  • 227. High availability and geographic redundancy 43
  • 231. Cloud Dataproc Dataproc is a managed service for any Open Source Software jobs that support big data processing including ETL and machine learning Out-of-the-box support for the most popular open-source software You can use Dataproc to migrate your on premise OSS clusters to the cloud Maximizing efficiency and enabling scale Use it with Cloud AI Notebook or BigQuery to build an end-to-end data science environment You can launch an IT governed, auto-scaling cluster in just 90 seconds 2
  • 232. Cloud Dataproc It manages the cluster creation, monitoring, and job orchestration for you Web UI, Cloud SDK, REST APIs, or with SSH access You can submit jobs in your opensource framework of choice Scale your cluster up or down at any time Even when jobs are running Pay for what you use down to the second 3
  • 235. Dataproc 2 ● Dataproc is a fully managed and highly scalable service for running Apache Hadoop, Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks. ● Use Dataproc for data lake modernization, ETL, and secure data science, at scale, integrated with Google Cloud, at a fraction of the cost.
  • 236. Where does it stand in Data Pipeline 3
  • 237. Benefits ● Open: Run open source data analytics at scale, with enterprise grade security ● Flexible: Use serverless, or manage clusters on Google Compute and Kubernetes ● Intelligent: Enable data users through integrations with Vertex AI, BigQuery, and Dataplex ● Secure: Configure advanced security such as Kerberos, Apache Ranger and Personal Authentication ● Cost-effective: Realize 54% lower TCO compared to on-prem data lakes with per-second pricing 4
  • 238. Key features - 1 ● Fully managed and automated big data open source software ● Containerize Apache Spark jobs with Kubernetes ● Enterprise security integrated with Google Cloud ● The best of open source with the best of Google Cloud ● Serverless Spark ● Resizable clusters ● Autoscaling clusters ● Cloud integrated ● Versioning ● Cluster scheduled deletion ● Automatic or manual configuration 5
  • 239. Key features - 2 ● Developer tools ● Initialization actions ● Optional components ● Custom containers and images ● Flexible virtual machines ● Component Gateway and notebook access ● Workflow templates ● Automated policy management ● Smart alerts ● Dataproc metastore 6
  • 240. Fully managed and automated big data open source software ● Serverless deployment, logging, and monitoring let you focus on your data and analytics, not on your infrastructure. ● Reduce TCO of Apache Spark management by up to 54%. ● Enable data scientists and engineers to build and train models 5X faster, compared to traditional notebooks, through integration with Vertex AI Workbench. ● The Dataproc Jobs API makes it easy to incorporate big data processing into custom applications, while Dataproc Metastore eliminates the need to run your own Hive metastore or catalog service. 7
  • 241. Containerize Apache Spark jobs with Kubernetes ● Build your Apache Spark jobs using Dataproc on Kubernetes so you can use Dataproc with Google Kubernetes Engine (GKE) to provide job portability and isolation. How Dataproc on GKE works ● Dataproc on GKE deploys Dataproc virtual clusters on a GKE cluster. ● Unlike Dataproc on Compute Engine clusters, Dataproc on GKE virtual clusters do not include separate master and worker VMs. ● Instead, when you create a Dataproc on GKE virtual cluster, Dataproc on GKE creates node pools within a GKE cluster. ● The node pools and scheduling of pods on the node pools are managed by GKE. 8
  • 242. Enterprise security integrated with Google Cloud ● When you create a Dataproc cluster, you can enable Hadoop Secure Mode via Kerberos by adding a Security Configuration. ● Additionally, some of the most commonly used Google Cloud-specific security features used with Dataproc include default at-rest encryption, OS Login, VPC Service Controls, and customer-managed encryption keys (CMEK). 9
  • 243. Best Open source with the best of Google Cloud ● Dataproc lets you take the open source tools, algorithms, and programming languages that you use today, but makes it easy to apply them on cloud-scale datasets. ● At the same time, Dataproc has out-of-the-box integration with the rest of the Google Cloud analytics, database, and AI ecosystem. ● Data scientists and engineers can quickly access data and build data applications connecting Dataproc to BigQuery, Vertex AI, Cloud Spanner, Pub/Sub, or Data Fusion. 10
  • 244. Serverless Spark ● Deploy Spark applications and pipelines that autoscale without any manual infrastructure provisioning or tuning. ● Spark is integrated with BigQuery, Vertex AI, and Dataplex, so you can write and run it from these interfaces in two clicks, without custom integrations, for ETL, data exploration, analysis, and ML. 11
  • 245. Resizable clusters ● Create and scale clusters quickly with various virtual machine types, disk sizes, number of nodes, and networking options. ● After creating a Dataproc cluster, you can adjust ("scale") the cluster by increasing or decreasing the number of primary or secondary worker nodes (horizontal scaling) in the cluster. ● You can scale a Dataproc cluster at any time, even when jobs are running on the cluster. ● You cannot change the machine type of an existing cluster (vertical scaling). ● To vertically scale, create a cluster using a supported machine type, then migrate jobs to the new cluster. 12
  • 246. Autoscaling clusters ● Dataproc autoscaling provides a mechanism for automating cluster resource management and enables automatic addition and subtraction of cluster workers (nodes). ● The Dataproc AutoscalingPolicies API provides a mechanism for automating cluster resource management and enables cluster worker VM autoscaling. ● An Autoscaling Policy is a reusable configuration that describes how cluster workers using the autoscaling policy should scale. ● It defines scaling boundaries, frequency, and aggressiveness to provide fine-grained control over cluster resources throughout cluster lifetime. 13
  • 247. When to use autoscaling ● on clusters that store data in external services, such as Cloud Storage or BigQuery ● on clusters that process many jobs ● to scale up single-job clusters ● with Enhanced Flexibility Mode for Spark batch jobs 14
  • 248. When NOT to use autoscaling - 1 ● HDFS ○ HDFS utilization is not a signal for autoscaling. ○ HDFS data is only hosted on primary workers. ○ The number of primary workers must be sufficient to host all HDFS data. ○ Decommissioning HDFS DataNodes can delay the removal of workers. 15
  • 249. When NOT to use autoscaling - 2 ● Autoscaling does not support YARN Node Labels, nor the property dataproc:am.primary_only ● Autoscaling does not support Spark Structured Streaming ● Autoscaling is not recommended for the purpose of scaling a cluster down to minimum size when the cluster is idle. ● When small and large jobs run on a cluster, graceful decommissioning scale-down will wait for large jobs to finish. 16
  • 250. Single node clusters ● This single node acts as the master and worker for your Dataproc cluster. ● Trying out new versions of Spark and Hadoop or other open source components ● Building proof-of-concept (PoC) demonstrations ● Lightweight data science ● Small-scale non-critical data processing ● Education related to the Spark and Hadoop ecosystem 17
  • 251. Limitations ● Single node clusters are not recommended for large-scale parallel data processing. ● Single node clusters are not available with high-availability since there is only one node in the cluster. ● Single node clusters cannot use preemptible VMs. 18
  • 252. High Availability Mode ● When creating a Dataproc cluster, you can put the cluster into Hadoop High Availability (HA) mode by specifying the number of master instances in the cluster. ● The number of masters can only be specified at cluster creation time. ○ 1 master (default, non HA) ○ 3 masters (Hadoop HA) 19
  • 253. Cloud integrated ● Built-in integration with Cloud Storage, BigQuery, Dataplex, Vertex AI, Composer, Cloud Bigtable, Cloud Logging, and Cloud Monitoring, giving you a more complete and robust data platform. ● Dataproc has built-in integration with other Google Cloud Platform services, such as BigQuery, Cloud Storage, Cloud Bigtable, Cloud Logging, and Cloud Monitoring, so you have more than just a Spark or Hadoop cluster—you have a complete data platform. ● For example, you can use Dataproc to effortlessly ETL terabytes of raw log data directly into BigQuery for business reporting. 20
  • 254. Versioning ● Image versioning allows you to switch between different versions of Apache Spark, Apache Hadoop, and other tools. ● Dataproc uses images to tie together useful Google Cloud Platform connectors and Apache Spark & Apache Hadoop components into one package that can be deployed on a Dataproc cluster. ● These images contain the base operating system (Debian or Ubuntu) for the cluster, along with core and optional components needed to run jobs, such as Spark, Hadoop, and Hive. ● These images will be upgraded periodically to include new improvements and features. ● Dataproc versioning allows you to select sets of software versions when you create clusters. 21
  • 255. Cluster scheduled deletion ● To help avoid incurring Google Cloud charges for an inactive cluster, use Dataproc's Cluster Scheduled Deletion feature when you create a cluster. ● This feature provides options to delete a cluster: ○ after a specified cluster idle period ○ at a specified future time ○ after a specified period that starts from the time of submission of the cluster creation request 22
  • 256. Automatic or manual configuration ● Dataproc automatically configures hardware and software but also gives you manual control. ● The open source components installed on Dataproc clusters contain many configuration files. ● For example, Apache Spark and Apache Hadoop have several XML and plain text configuration files. ● You can use the –properties flag of the gcloud dataproc clusters create command to modify many common configuration files when creating a cluster. 23
  • 257. Developer tools ● Multiple ways to manage a cluster, including an easy-to-use web UI, the Cloud SDK, RESTful APIs, and SSH access. ● Integrate with APIs using Client Libraries for Java, Python, Node.js, Ruby, Go, .NET, and PHP ● Script or interact with cloud resources at scale using the Google Cloud CLI ● Accelerate local development with emulators for Pub/Sub, Spanner, Bigtable, and Datastore 24
  • 258. Initialization actions - 1 ● Run initialization actions to install or customize the settings and libraries you need when your cluster is created. 25
  • 259. Initialization actions - 2 ● Initialization actions run as the root user. You do not need to use sudo ● Initialization actions are executed on each node during cluster creation ● Use absolute paths in initialization actions ● Use a shebang line in initialization actions to indicate how the script should be interpreted (such as #!/bin/bash or #!/usr/bin/python 26
  • 260. Optional components ● Use optional components to install and configure additional components on the cluster. ● Optional components are integrated with Dataproc components and offer fully configured environments for Zeppelin, Presto, and other open source software components related to the Apache Hadoop and Apache Spark ecosystem. 27
  • 261. Optional Components - Example 28
  • 262. Custom containers and images ● Dataproc serverless Spark can be provisioned with custom docker containers. ● Dataproc clusters can be provisioned with a custom image that includes your pre-installed Linux operating system packages. 29
  • 263. Flexible virtual machines ● Clusters can use custom machine types and preemptible virtual machines to make them the perfect size for your needs. ● Dataproc clusters are built on Compute Engine instances. ● Machine types define the virtualized hardware resources available to an instance. ● Compute Engine offers both predefined machine types and custom machine types. ● Dataproc clusters can use both predefined and custom types for both master and/or worker nodes. ● In addition to using standard Compute Engine VMs as Dataproc workers (called "primary" workers), Dataproc clusters can use secondary workers. ● There are three types of secondary workers: spot VMs, standard preemptible VMs, and non-preemptible VMs. ● If you specify secondary workers for your cluster, they must be the same type. ● The default Dataproc secondary worker type is the standard preemptible VM. 30
  • 264. Component Gateway and notebook access ● Dataproc Component Gateway enables secure, one-click access to Dataproc default and optional component web interfaces running on the cluster. ● Open source components included with Google Dataproc clusters, such as Apache Hadoop and Apache Spark, provide web interfaces. ● These interfaces can be used to manage and monitor cluster resources and facilities, such as the YARN resource manager, the Hadoop Distributed File System (HDFS), MapReduce, and Spark. ● Component Gateway provides secure access to web endpoints for Dataproc default and optional components. ● Clusters created with Dataproc image version 1.3.29 and later can enable access to component web interfaces without relying on SSH tunnels or modifying firewall rules to allow inbound traffic. 31
  • 265. Workflow templates ● Dataproc workflow templates provide a flexible and easy-to-use mechanism for managing and executing workflows. ● A workflow template is a reusable workflow configuration that defines a graph of jobs with information on where to run those jobs. ● A Workflow is an operation that runs a Directed Acyclic Graph (DAG) of jobs on a cluster. ● Workflows are ideal for complex job flows. You can create job dependencies so that a job starts only after its dependencies complete successfully. 32
  • 266. Automated policy management ● Standardize security, cost, and infrastructure policies across a fleet of clusters. ● You can create policies for resource management, security, or network at a project level. ● You can also make it easy for users to use the correct images, components, metastore, and other peripheral services, enabling you to manage your fleet of clusters and serverless Spark policies in the future. 33
  • 267. Smart alerts ● Dataproc recommended alerts allow customers to adjust the thresholds for the pre-configured alerts to get alerts on idle, runaway clusters, jobs, overutilized clusters and more. ● Customers can further customize these alerts and even create advanced cluster and job management capabilities. ● These capabilities allow customers to manage their fleet at scale. 34
  • 268. Dataproc metastore ● Fully managed, highly available Hive Metastore (HMS) with fine-grained access control and integration with BigQuery metastore, Dataplex, and Data Catalog. ● Dataproc Metastore provides you with a fully compatible Hive Metastore (HMS), which is the established standard in the open source big data ecosystem for managing technical metadata. ● This service helps you manage the metadata of your data lakes and provides interoperability between the various data processing tools you're using. 35
  • 269. Connectors ● BigQuery connector - enable programmatic read/write access to BigQuery ● Bigtable connector ○ Bigtable is an excellent option for any Apache Spark or Hadoop uses that require Apache HBase. ○ Bigtable supports the Apache HBase 1.0+ APIs and offers a Bigtable HBase client in Maven, so it is easy to use Bigtable with Dataproc. ● Cloud Storage connector - The Cloud Storage connector is an open source Java library that lets you run Apache Hadoop or Apache Spark jobs directly on data in Cloud Storage, and offers a number of benefits over choosing the Hadoop Distributed File System (HDFS). ● Pub/Sub Lite - The Pub/Sub Lite Spark Connector supports Pub/Sub Lite as an input source to Apache Spark Structured Streaming in the default micro-batch processing and experimental continuous processing modes. 36
  • 270. Dataproc on Compute Engine pricing ● Dataproc is billed by the second, and all Dataproc clusters are billed in one-second clock-time increments, subject to a 1-minute minimum billing. ● Dataproc on Compute Engine pricing is based on the size of Dataproc clusters and the duration of time that they run. ● The size of a cluster is based on the aggregate number of virtual CPUs (vCPUs) across the entire cluster, including the master and worker nodes. ● The duration of a cluster is the length of time between cluster creation and cluster stopping or deletion. ● Total Price = $0.010 * # of vCPUs * hourly duration 37
  • 271. Pricing example Dataproc charge = # of vCPUs * hours * Dataproc price = 24 * 2 * $0.01 = $0.48 38
  • 272. Dataproc on GKE pricing ● The Dataproc on GKE pricing formula, $0.010 * # of vCPUs * hourly duration, is the same as the Dataproc on Compute Engine pricing formula, and is applied to the aggregate number of virtual CPUs running in VMs instances in Dataproc-created node pools in the cluster. 39
  • 274. Dataproc Roles ● Dataproc Admin ● Dataproc Editor ● Dataproc Viewer ● Dataproc Worker (for service accounts only) 41
  • 275. IAM roles and Dataproc operations summary 42
  • 276. Dataproc service accounts ● Dataproc VM service account: VMs in a Dataproc cluster use this service account for Dataproc data plane operations, such reading and writing data from and to Cloud Storage and BigQuery ● Dataproc Service Agent service account: Dataproc creates this service account with the Dataproc Service Agent role in a Dataproc user's Google Cloud project. 43
  • 279. Cloud Pub/Sub Cloud Pub/Sub is an asynchronous messaging service Send, Receive, and Filter events or data steams Durable Message Storage Scalable in-order message delivery Consistently high availability Performance at any scale Runs in all Google Cloud region of the world Serverless Scales global data delivery auto-magically Millions of messages per second Data producers don't need to change anything when the consumers of their data change 2
  • 280. Cloud Pub/Sub Services can be entirely stateless Set up Pub/Sub between services or applications by defining topics and then subscriptions Services to receive the messages published on those topics one-to-many communications Spread your workload over multiple workers E.g. send logs from your security system to archiving, processing, and analytic services Stream your data into BigQuery or Dataflow for intelligent processing Ideal for notification 3
  • 284. Core Concepts - 1 ● Topic. A named resource to which messages are sent by publishers. ● Subscription. A named resource representing the stream of messages from a single, specific topic, to be delivered to the subscribing application. For more details about subscriptions and message delivery semantics, see the Subscriber Guide. ● Message. The combination of data and (optional) attributes that a publisher sends to a topic and is eventually delivered to subscribers. ● Message attribute. A key-value pair that a publisher can define for a message. For example, key iana.org/language_tag and value en could be added to messages to mark them as readable by an English-speaking subscriber. 3
  • 285. Core Concepts - 2 ● Publisher. An application that creates and sends messages to a single or multiple topics. ● Subscriber. An application with a subscription to a single or multiple topics to receive messages from it. ● Acknowledgment (or "ack"). A signal sent by a subscriber to Pub/Sub after it has received a message successfully. Acknowledged messages are removed from the subscription message queue. ● Push and pull. The two message delivery methods. A subscriber receives messages either by Pub/Sub pushing them to the subscriber chosen endpoint, or by the subscriber pulling them from the service. 4
  • 286. Many-to-one (fan-in) and One-to-many (fan-out) 5
  • 288. Common use cases ● Ingestion user interaction and server events. ● Real-time event distribution. ● Replicating data among databases. ● Parallel processing and workflows. ● Enterprise event bus. ● Data streaming from applications, services, or IoT devices. ● Refreshing distributed caches. ● Load balancing for reliability. 7
  • 289. Integrations - 1 Stream processing and data integration. ● Dataflow: Dataflow templates and SQL, which allow processing and data integration into BigQuery and data lakes on Cloud Storage. ● Dataflow templates for moving data from Pub/Sub to Cloud Storage, BigQuery, and other products are available in the Pub/Sub and Dataflow UIs in the Google Cloud console. ● Integration with Apache Spark, particularly when managed with Dataproc is also available. ● Visual composition of integration and processing pipelines running on Spark + Dataproc can be accomplished with Data Fusion. 8
  • 290. Integrations - 2 Monitoring, Alerting and Logging. ● Supported by Monitoring and Logging products. Authentication and IAM. ● Pub/Sub relies on a standard OAuth authentication used by other Google Cloud products and supports granular IAM, enabling access control for individual resources. 9
  • 291. Integrations - 3 APIs. ● Pub/Sub uses standard gRPC and REST service API technologies along with client libraries for several languages. Triggers, notifications, and webhooks. ● Pub/Sub offers push-based delivery of messages as HTTP POST requests to webhooks. ● You can implement workflow automation using Cloud Functions or other serverless products. 10
  • 292. Integrations - 4 Orchestration. ● Pub/Sub can be integrated into multistep serverless Workflows declaratively. ● Big data and analytic orchestration often done with Cloud Composer, which supports Pub/Sub triggers. ● Application Integration provides a Pub/Sub trigger to trigger or start integrations. 11
  • 293. You can filter messages by their attributes from a subscription ● When you receive messages from a subscription with a filter, you only receive the messages that match the filter. ● The Pub/Sub service automatically acknowledges the messages that don't match the filter. ● You can filter messages by their attributes, but not by the data in the message. ● You can have multiple subscriptions attached to a topic and each subscription can have a different filter. 12
  • 294. Types of subscriptions ● Pull subscription ● Push subscription ● BigQuery subscription 13
  • 298. Delivery Types - at-least-once delivery ● By default, Pub/Sub offers at-least-once delivery with no ordering guarantees on all subscription types. ● Alternatively, if messages have the same ordering key and are in the same region, you can enable message ordering. ● After you set the message ordering property, the Pub/Sub service delivers messages with the same ordering key and in the order that the Pub/Sub service receives the messages. 17
  • 299. Delivery Types - exactly-once delivery ● Pub/Sub also supports exactly-once delivery. ● In general, Pub/Sub delivers each message once and in the order in which it was published. ● However, messages may sometimes be delivered out of order or more than once. ● Pub/Sub might redeliver a message even after an acknowledgement request for the message returns successfully. ● This redelivery can be caused by issues such as server-side restarts or client-side issues. ● Thus, although rare, any message can be redelivered at any time. ● Accommodating more-than-once delivery requires your subscriber to be idempotent when processing messages. 18
  • 300. Message Retention ● Unacknowledged messages are retained for a default of 7 days (configurable by the subscription's message_retention_duration property). ● A topic can retain published messages for a maximum of 31 days (configurable by the topic's message_retention_duration property) even after they have been acknowledged by all attached subscriptions. ● In cases where the topic's message_retention_duration is greater than the subscription's message_retention_duration, Pub/Sub discards a message only when its age exceeds the topic's message_retention_duration. ● By default, subscriptions expire after 31 days of subscriber inactivity or if there are no updates made to the subscription. ● When you modify either the message retention duration or subscription expiration policy, the expiration period must be set to a value greater than the message retention duration. The default message retention duration is 7 days and the default expiration period is 31 days. 19
  • 301. Exponential backoff ● Exponential backoff lets you add progressively longer delays between retry attempts. ● After the first delivery failure, Pub/Sub waits for a minimum backoff time before retrying. ● For each consecutive message failure, more time is added to the delay, up to a maximum delay (0 and 600 seconds). ● The maximum and minimum delay intervals are not fixed, and should be configured based on local factors to your application. 20
  • 305. BigQuery BigQuery is Google Cloud's enterprise data warehouse Ingest Store Analyze Visualize 2
  • 306. Supports Ingesting Data via Batch or Streaming Directly Fully-managed data warehouse Petabyte scale BigQuery 3
  • 307. BigQuery supports a standard SQL dialect that is ANSI compliant Interacting with BigQuery is easy BigQuery 4
  • 308. You can use the Cloud Console UI, BigQuery command-line tool bq, or use the API with client libraries of your choice BigQuery integrates with several business intelligence tools BigQuery 5
  • 309. Simple pricing model You pay for data storage, streaming inserts, and querying data Loading and exporting data are free of charge BigQuery 6
  • 310. Storage costs are based on the amount of data stored For queries, you can choose to pay per query or a flat rate for dedicated resources BigQuery 7
  • 313. BigQuery Views A view is a virtual table defined by a SQL query. Query it in the same way you query a table. When a user queries the view, the query results contain data only from the tables and fields specified in the query that defines the view. How to use Query editor box in the Google Cloud console bq command-line tool's bq query command BigQuery REST API to programmatically call the jobs.query or query-type jobs.insert methods BigQuery client libraries You can also use a view as a data source for a visualization tool such as Google Data Studio. 2
  • 314. BigQuery View Limitation Read-only. No DML (insert, update, delete) queries against a view. The dataset that contains your view and the dataset that contains the tables referenced by the view must be in the same location. No exporting of data from a view. You cannot mix standard SQL and legacy SQL queries when using views. You cannot reference query parameters in views. 3
  • 315. BigQuery View Limitation You cannot include a temporary user-defined function or a temporary table in the SQL query that defines a view. You cannot reference a view in a wildcard table query. 4
  • 319. BigQuery is Google's data warehouse solution 3
  • 320. BigQuery - An Ideal Data Warehouse ● Interactive SQL queries over large datasets (petabytes) in seconds ● Serverless and no-ops, including ad hoc queries ● Ecosystem of visualization and reporting tools ● Ecosystem of ETL and data processing tools ● Up-to-the-minute data ● Machine learning ● Security and collaboration 4
  • 321. Why BigQuery is different from Traditional Databases ? 5
  • 324. Filter row data based on sensitive data 8
  • 326. Authorized View - 1 Save result to another dest table 10
  • 327. Authorized View - 2 Create another Dataset 11
  • 328. Authorized View - 3 Save view 12
  • 329. Authorized View - 4 Add necessary permissions 13
  • 330. Authorized View - 5 Assign permissions 14
  • 331. Authorized View - 6 Give View access to Dataset 15
  • 332. Authorized View Procedure ● step 1 start with your source data set this is the data set with the sensitive data you don't want to share ● step 2 create a separate data set to store the view authorized views require the source data to sit in a separate data set from the view don't worry about the reason why you'll see in step 6. ● step 3 create a view in the new data set in the new data set you create the view you intend to share with your data analysts this view is created using a sql query that includes only the data the analysts need to see ● step 4 assign access controls to the project in order to query the view your analyst needs permission to run queries assigning your analyst the bigquery user role gives them this ability this access does not give them the ability to view or query any data sets within the project ● step 5 assign access controls to the data set containing the view in order for your analyst to query the view they need to be granted the bigquery data viewer role for that specific data set that contains the view and finally ● step 6 authorize the view to access the source data set this gives the view itself access to the source data we need to do this because the view takes on the permissions of the person using it and since the analyst doesn't have access to the source table they'd otherwise get an error if they tried to query this view 16
  • 333. Materialized Views ● Materialized views are precomputed views that periodically cache the results of a query for increased performance and efficiency. ● BigQuery leverages precomputed results from materialized views and whenever possible reads only delta changes from the base tables to compute up-to-date results. ● Materialized views can be queried directly or can be used by the BigQuery optimizer to process queries to the base tables. ● Queries that use materialized views are generally faster and consume fewer resources than queries that retrieve the same data only from the base tables. 17