Google Cloud Platform
Overview
By Balvinder Khurana & Sarang Shinde
Agenda
1. Overview of GCP Components
2. Storage
3. Compute
4. Processing
5. Batch Data Pipeline
6. Real Time Data Pipeline
7. Visualizations
8. Logging and Monitoring
9. Challenges and Learnings
10.Questions
Data Management and
Storage
Data Integration and Processing
Coordination and
Workflow Management
Data Integration and
Processing
Data Management and
Storage
Volume
Velocity
Variety
Analytics
Compliance
Operations
Data Usages Data Characteristics Data Tech Stack / Architecture
Big Data Foundation
Storage
1. Cloud Big Table
2. Cloud DataStore
3. Cloud SQL
4. Cloud Spanner
5. Big Query
6. Cloud Storage/Buckets
When you need Use Use in GCP
Storage for Compute, Block Storage
Storing Media, Blob Storage
SQL interface atop file data
Document database, NoSQL
Fast scanning NoSQL
Transaction Processing OLTP
Analytics/Data Warehouse (OLAP)
Persistent (Hard Disk), SSD
File System - HDFS
Hive (SQL Like)
CouchDB,MongoDB
HBase
RDBMS
Hive
Persistent (Hard Disk), SSD
Cloud Storage
BigQuery
DataStore
BigTable
Cloud SQL, CLoud Spanner
BigQuery
Typical Storage Use Cases - IoT
Processing & Compute
1. App Engine
2. Google Kubernetes/Container Engine(GKE)
3. Google Compute Engine (GCE)
4. Dataflow
5. Dataproc
6. Cloud Pub Sub
7. Cloud ML
Cloud Dataflow vs Cloud Dataproc
Dataproc
Dataflow
Dataflow Features
- Automated Resource
Management & horizontal
auto scaling.
- Dynamic work balancing.
- Unified programming model
for batch and streaming.
- Built-In support for Fault
Tolerance.
- Pipeline portability with Spark,
Flink etc.
Apache Beam Model
Composer
Typical Data Pipeline
Batch Data Pipeline
Real Time Data Pipeline
Demo
Data Fusion(Beta)
- Logging
- Profiling
- Tracing
- Alerts
- Debugging
Monitoring and Logging - Stack Driver
Visualizations - Data Studio
Challenges and Learnings
1. DataProc
2. Dataflow
3. BigQuery
4. Cloud Composer
5. Data Studio
More on GCP
Cloud MemoryStore
Cloud DataPrepCloud Data Catalog
Cloud AutoMLCloud Natural Language
Cloud life science
Questions?

Google Cloud Platform

  • 1.
    Google Cloud Platform Overview ByBalvinder Khurana & Sarang Shinde
  • 2.
    Agenda 1. Overview ofGCP Components 2. Storage 3. Compute 4. Processing 5. Batch Data Pipeline 6. Real Time Data Pipeline 7. Visualizations 8. Logging and Monitoring 9. Challenges and Learnings 10.Questions
  • 4.
    Data Management and Storage DataIntegration and Processing Coordination and Workflow Management Data Integration and Processing Data Management and Storage Volume Velocity Variety Analytics Compliance Operations Data Usages Data Characteristics Data Tech Stack / Architecture Big Data Foundation
  • 5.
    Storage 1. Cloud BigTable 2. Cloud DataStore 3. Cloud SQL 4. Cloud Spanner 5. Big Query 6. Cloud Storage/Buckets
  • 7.
    When you needUse Use in GCP Storage for Compute, Block Storage Storing Media, Blob Storage SQL interface atop file data Document database, NoSQL Fast scanning NoSQL Transaction Processing OLTP Analytics/Data Warehouse (OLAP) Persistent (Hard Disk), SSD File System - HDFS Hive (SQL Like) CouchDB,MongoDB HBase RDBMS Hive Persistent (Hard Disk), SSD Cloud Storage BigQuery DataStore BigTable Cloud SQL, CLoud Spanner BigQuery
  • 9.
  • 10.
    Processing & Compute 1.App Engine 2. Google Kubernetes/Container Engine(GKE) 3. Google Compute Engine (GCE) 4. Dataflow 5. Dataproc 6. Cloud Pub Sub 7. Cloud ML
  • 13.
    Cloud Dataflow vsCloud Dataproc
  • 14.
  • 15.
    Dataflow Dataflow Features - AutomatedResource Management & horizontal auto scaling. - Dynamic work balancing. - Unified programming model for batch and streaming. - Built-In support for Fault Tolerance. - Pipeline portability with Spark, Flink etc. Apache Beam Model
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
    - Logging - Profiling -Tracing - Alerts - Debugging Monitoring and Logging - Stack Driver
  • 23.
  • 24.
    Challenges and Learnings 1.DataProc 2. Dataflow 3. BigQuery 4. Cloud Composer 5. Data Studio
  • 25.
    More on GCP CloudMemoryStore Cloud DataPrepCloud Data Catalog Cloud AutoMLCloud Natural Language Cloud life science
  • 26.

Editor's Notes

  • #6 Storage - Choosing and using technologies for solution 2. Cloud DataStore Highly scalable NoSQL database Automatically handles sharding and replication ACID transactions SQL like queries Simple and Integrated Fast and Highly Scalable Easy to use query language Risch admin dashboard Multiple Access methods Fully Managed Diverse Data Types
  • #7 Cloud SQL - Hosted service for monolithic MySQL and PostgreSQL. Single write node that can achieve a max IOPS of 10K with a max storage of 10 TB , max RAM of 416 GB. Good for monolithic OLTP apps such as content management systems, ERP and CRM. Cloud Spanner - Globally distributed, highly available relational database service with both single region and multi-region deployment configurations. Inserts and updates are through a custom API while reads and DDL operations are though a Spanner-specific flavor of SQL. Good for distributed OLTP apps such as retail product catalog, SaaS user identity and online gaming. Cloud DataStore- Highly scalable NoSQL database with document data model, atomic (but not fully ACID) transactions, SQL-like query language and support for single-region/multi-region configurations. Note that Cloud Datastore is being retired in favor of a new product called Cloud Firestore. Good for distributed OLTP apps similar to Cloud Spanner. Cloud Bigtable - Single-region, highly-scalable, wide-column NoSQL service with low latency and high throughput. Deeply integrated with the Hadoop ecosystem including HBase API compatibility. Good for timeseries-like Hybrid Transactional/Analytical Processing (HTAP) apps that do not require multi-region deployments. Low Latency massively scalable, No SQL Sub 10 ms latency High availability and durability Resilience Fast and performant Seamless Scaling and replication Fully managed Hundreds of petabytes Millions of operations per second No downtime during reconfiguration Use cases Ideal for fintech, IoT, Ad tech Cloud BigQuery-BigQuery is Google's fully managed, low cost, serverless data warehouse that scales with your storage and computing power needs. With BigQuery, you get a columnar and ANSI SQL database that can analyze terabytes to petabytes of data in blazing-fast speeds. Analyze geospatial data using familiar SQL with BigQuery GIS. Quickly build and operationalize ML models on large-scale structured or semi-structured data using simple SQL with BigQuery ML, and support real-time interactive dashboarding with sub-second query latency using BigQuery BI Engine. Plus, BigQuery offers data transferring services, flexible data ingestion, and pay-for-what-you-use pricing. Cloud Storage- High performance object storage, Backup and Archival Storage Multi-Regional, Regional, Nearline, Cold line Single API across all storage classes. Switch across classes with Object Lifecycle Management Exabytes of Data Use cases - Media content storage and delivery Integrated repository for analysis and machine learning Backups and Archives
  • #12 Here just mention what is Pub Sub and Cloud ML no need to go into details.
  • #16 Google Managed Hadoop Cluster. Easy and rapid cluster creation. Helps to migrate existing on premises batch/Streaming workload easily into cloud with either ephemeral/persistent clusters. Tools available : Spark, Hive, Pig, Apache Tez, GCS connector optional druid, flink, presto etc. Has Autoscaling features.
  • #17 Talk about apache beam model and features.
  • #18 Talk about Orchestrations using lambda/cloud functions its drawbacks. Composer with Apache airflow. Airflow Operators. Features.
  • #25 - Filter , search and view logs. Profiling of resource consumption in your application. Connect with your production data with source code. Per URL statistics and latency distribution. Creating policy and notification.
  • #26 Data Studio : Support most type of charts. Connectors available for BigQuery, Cloud Spanner , Cloud SQL, Cloud Storage. In built caching which speedup visualizations. Easy to use and explore for Visualisations. Using Data Studio for cost/billing monitoring. Available for free.