SlideShare a Scribd company logo
Chicago Cloud Conference 2020
Architecting Analytic
Pipelines on GCP
Who am I?
Mariano is an engineer with more than 15 years of
experience with the JVM. He enjoys working with
and exploring a variety of big data technologies. He is
an avid open-source contributor.
Data/Platform Architect at Otus
Mariano Gonzalez
Most importantly, I am just a person trying to learn about and share big
data technologies and approaches.
Agenda
● Goal for this session
● Overview of GCP services
● Apache Beam and GCP Dataflow
● Natural Language Processing for sentiment analysis
● Demo ETL/Analytics
● QA
Goal for this Session
Find an elegant way to build and deploy data/analytic
pipelines that:
● Support for multiple workloads
● Scale compute and storage independently
● Backed up by manage services
● Cost effective
Common Architecture Analytics Pipeline
Data Storage
Different
Types and
Formats of
Data
Analytic/Data
Pipelines
User
Overview of GCP services - App Engine
● Good alternative if K8s infrastructure is not in place
● Easy deployment
○ Similar to AWS SAM from a CLI perspective
○ Similar to AWS Beanstalk from a deployment perspective
● Well integrated with other cloud services
○ GCP docker Registry
● Multiple Runtimes
○ Custom (Docker)
○ JVM/Node/Python
Overview of GCP services - Storage
● Hot - durable, available performance object storage for frequently accessed data
○ Amazon S3 Standard
○ Microsoft Azure Hot Blob Storage
○ Google Cloud Storage standard
● Cool - storage class for data that is accessed less frequently, but requires rapid access
when needed
○ Amazon S3 Standard I/A and S3 Standard Z-I/A
○ Microsoft Azure Cool Blob Storage
○ Google Cloud Storage Nearline
● Cold - secure, durable, and low-cost storage service for data archiving
○ Amazon S3 Glacier
○ Microsoft Azure Blob Archive Storage
○ Google Cloud Storage Coldline
Overview of GCP services - Pubsub
Why not just use Kafka?
● Fully managed services
○ Both system can have fully managed version in the cloud
● Cloud vs On-prem
○ Pubsub is only offered as part of the GCP ecosystem whereas Apache Kafka
can be used as a both cloud service and on-prem service
● Message duplication
○ Kafka manage the offsets via zookeeper
○ Pubsub works using acknowledging the message
Overview of GCP services - Pubsub
Why not just use Kafka?
● Retention policy
○ Both Kafka and Pubsub have options to configure the maximum retention
time
● Consumers Group vs Subscriptions
○ Pubsub use subscriptions, you create a subscription and then you start
reading messages from that subscription
○ Kafka use the concept of "consumer group" and "partition"
Overview of GCP services - BigQuery
● Query engines probably one of the most competed service today:
○ Snowflake
○ Presto
○ Redshift
● How are these warehouses different?
● Presto
○ Self hosted open source solution
● Pre-RA3 Redshift
○ Somewhat more fully managed, but still requires the user to configure individual
compute clusters with a fixed amount of memory, compute and storage
● Redshift RA3
○ Closer to the user experience of Snowflake by separating compute from storage
● Snowflake
○ The user only configures the size and number of compute clusters
○ Every compute cluster sees the same data
○ Compute clusters can be created and removed in seconds
Overview of GCP services - BigQuery
BigQuery
● Flat-rate is similar to Snowflake except there is no concept of a compute cluster, just a configurable number
of "compute slots"
● Pure serverless model, where the user submits queries one at a time and pays per query
● On-demand mode can be much more expensive, or much cheaper, depending on the nature of your
workload
A "steady" workload that utilizes your compute capacity 24/7 will be much cheaper in flat-rate mode. A
"spiky" workload that contains periodic large queries spaced with long periods of idleness or lower utilization
will be much cheaper in on-demand mode.
Overview of GCP services - BigQuery
What is Google Cloud Dataflow?
● Data processing service for both:
○ batch
○ real-time data streaming applications
● Benefits
○ Enables developers to set up analytic pipelines immediately
● Nextgen MapReduce
○ Designed to bring to an entire analytics pipelines the style of fast parallel execution that MapReduce
brought to a single type of computational for batch processing jobs
○ It's based partly on MillWheel and Flume (two Google-developed data ingestion and low-latency
processing).
Overview of GCP services - Dataflow
Apache Beam SDK and Dataflow Runner
Google Cloud Dataflow overlaps with services such as:
● Amazon Kinesis
● Apache Storm
● Apache Spark
● Facebook Flux
$ java -jar build/libs/transformation-1.0-all.jar 
--project=ccc-2020-289323 
--runner=DataflowRunner 
--streaming=true 
--region=us-east1 
--tempLocation=gs://chicago-cloud-conference-2020/temp/ 
--stagingLocation=gs://chicago-cloud-conference-2020/jars/ 
--filesToStage=build/libs/transformation-1.0-all.jar 
--maxNumWorkers=2 
--numWorkers=1
Apache Beam SDK and Dataflow Runner
Overview of GCP services - Dataproc
On demand Hadoop Cluster
● From all the 3 managed services for Hadoop Clusters (Amazon EMR, Azure Hdinsight)
Dataproc is the fastest to provision
● Easy runtime customization via PIP commands
● Not as well integrated with third party services (Azure Hdinsight - Databricks, Amazon EMR
- Apache Zeppelin)
$ gcloud beta dataproc clusters create cluster-name 
--optional-components=ANACONDA,JUPYTER 
--image-version=1.4 
--enable-component-gateway 
--bucket=chicago-cloud-conference-2020 
--region=us-east1 
--project=ccc-2020-289323 
--metadata 'PIP_PACKAGES=google-cloud-bigquery google-cloud-storage numpy pandas matplotlib'
Overview of GCP services - Cloud Natural Language API
● What can we do Cloud Natural Language API?
○ Reveal the structure and meaning of text via machine learning models
○ Extract information about people, places, and events, mentioned in text
documents, news articles or blog posts
○ Understand sentiment about product on social media or parse intent from
customer conversations happening in a call center or a messaging app
● How can we use it?
○ Analyze text uploaded as part of a HTTP request
○ Integrate with Google Cloud Storage
NLP - Sentiment Analysis
Two type of metrics to consider:
1. Score
a. It ranges between -1.0 (negative) and
1.0 (positive) and corresponds to the
general emotional tendency of the text
1. Magnitude
a. Indicates the general intensity of
emotion (both positive and negative) in
a given text, between 0.0 and inf
b. Magnitude is not normalized and each
expression of emotion in the text (both
positive and negative) contributes to the
value
Sentiment Sample Values
Positive score: 0.8, magnitude: 3.0
Negative score: -0.6, magnitude: 4.0
Neutral score: 0.1, magnitude: 0.0
Mixed score: 0.0, magnitude: 4.0
Demo - ETL
• Extract – Diferentes fuentes (Twitter for this case)
• Transform – Cleanup and data presentation
• Load – Columnar format
https://github.com/eschizoid/ccc-2020
Demo - Analytics
Conclusion
•Cost effect solution if you
know your data access
patterns
•Full serverless architecture
•Extensible workloads
QA

More Related Content

What's hot

Future of Data Platform in Cloud Native world
Future of Data Platform in Cloud Native worldFuture of Data Platform in Cloud Native world
Future of Data Platform in Cloud Native world
Srivatsan Srinivasan
 
On the Radar: SnapLogic
On the Radar: SnapLogicOn the Radar: SnapLogic
On the Radar: SnapLogic
SnapLogic
 
Cloud Modernization and Data as a Service Option
Cloud Modernization and Data as a Service OptionCloud Modernization and Data as a Service Option
Cloud Modernization and Data as a Service Option
Denodo
 
Life is a Stream of Events
Life is a Stream of Events Life is a Stream of Events
Life is a Stream of Events
confluent
 
Hadoop for Humans: Introducing SnapReduce 2.0
Hadoop for Humans: Introducing SnapReduce 2.0Hadoop for Humans: Introducing SnapReduce 2.0
Hadoop for Humans: Introducing SnapReduce 2.0
SnapLogic
 
Big Data Management: What's New, What's Different, and What You Need To Know
Big Data Management: What's New, What's Different, and What You Need To KnowBig Data Management: What's New, What's Different, and What You Need To Know
Big Data Management: What's New, What's Different, and What You Need To Know
SnapLogic
 
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
SnapLogic
 
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
R, Spark, Tensorflow, H20.ai Applied to Streaming AnalyticsR, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
Kai Wähner
 
Consumption based analytics enabled by Data Virtualization
Consumption based analytics enabled by Data VirtualizationConsumption based analytics enabled by Data Virtualization
Consumption based analytics enabled by Data Virtualization
Denodo
 
On Demand BI
On Demand BIOn Demand BI
On Demand BI
Darren Cunningham
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
SoftServe
 
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
Infochimps, a CSC Big Data Business
 
Webinar: SnapLogic Fall 2014 Release Brings iPaaS to the Enterprise
Webinar: SnapLogic Fall 2014 Release Brings iPaaS to the EnterpriseWebinar: SnapLogic Fall 2014 Release Brings iPaaS to the Enterprise
Webinar: SnapLogic Fall 2014 Release Brings iPaaS to the Enterprise
SnapLogic
 
451 Research Impact Report
451 Research Impact Report451 Research Impact Report
451 Research Impact Report
Infochimps, a CSC Big Data Business
 
Data Democratization at Nubank
 Data Democratization at Nubank Data Democratization at Nubank
Data Democratization at Nubank
Databricks
 
No sql now2011_review_of_adhoc_architectures
No sql now2011_review_of_adhoc_architecturesNo sql now2011_review_of_adhoc_architectures
No sql now2011_review_of_adhoc_architecturesNicholas Goodman
 
Digital Shift in Insurance: How is the Industry Responding with the Influx of...
Digital Shift in Insurance: How is the Industry Responding with the Influx of...Digital Shift in Insurance: How is the Industry Responding with the Influx of...
Digital Shift in Insurance: How is the Industry Responding with the Influx of...
DataWorks Summit
 
Building Intelligent Applications w/ Cassandra, Spark & DataStax by Jeff Carp...
Building Intelligent Applications w/ Cassandra, Spark & DataStax by Jeff Carp...Building Intelligent Applications w/ Cassandra, Spark & DataStax by Jeff Carp...
Building Intelligent Applications w/ Cassandra, Spark & DataStax by Jeff Carp...
Data Con LA
 
From ingest to insights with AWS
From ingest to insights with AWSFrom ingest to insights with AWS
From ingest to insights with AWS
Paul Van Siclen
 
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about..."Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
Kai Wähner
 

What's hot (20)

Future of Data Platform in Cloud Native world
Future of Data Platform in Cloud Native worldFuture of Data Platform in Cloud Native world
Future of Data Platform in Cloud Native world
 
On the Radar: SnapLogic
On the Radar: SnapLogicOn the Radar: SnapLogic
On the Radar: SnapLogic
 
Cloud Modernization and Data as a Service Option
Cloud Modernization and Data as a Service OptionCloud Modernization and Data as a Service Option
Cloud Modernization and Data as a Service Option
 
Life is a Stream of Events
Life is a Stream of Events Life is a Stream of Events
Life is a Stream of Events
 
Hadoop for Humans: Introducing SnapReduce 2.0
Hadoop for Humans: Introducing SnapReduce 2.0Hadoop for Humans: Introducing SnapReduce 2.0
Hadoop for Humans: Introducing SnapReduce 2.0
 
Big Data Management: What's New, What's Different, and What You Need To Know
Big Data Management: What's New, What's Different, and What You Need To KnowBig Data Management: What's New, What's Different, and What You Need To Know
Big Data Management: What's New, What's Different, and What You Need To Know
 
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
 
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
R, Spark, Tensorflow, H20.ai Applied to Streaming AnalyticsR, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
 
Consumption based analytics enabled by Data Virtualization
Consumption based analytics enabled by Data VirtualizationConsumption based analytics enabled by Data Virtualization
Consumption based analytics enabled by Data Virtualization
 
On Demand BI
On Demand BIOn Demand BI
On Demand BI
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
 
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
 
Webinar: SnapLogic Fall 2014 Release Brings iPaaS to the Enterprise
Webinar: SnapLogic Fall 2014 Release Brings iPaaS to the EnterpriseWebinar: SnapLogic Fall 2014 Release Brings iPaaS to the Enterprise
Webinar: SnapLogic Fall 2014 Release Brings iPaaS to the Enterprise
 
451 Research Impact Report
451 Research Impact Report451 Research Impact Report
451 Research Impact Report
 
Data Democratization at Nubank
 Data Democratization at Nubank Data Democratization at Nubank
Data Democratization at Nubank
 
No sql now2011_review_of_adhoc_architectures
No sql now2011_review_of_adhoc_architecturesNo sql now2011_review_of_adhoc_architectures
No sql now2011_review_of_adhoc_architectures
 
Digital Shift in Insurance: How is the Industry Responding with the Influx of...
Digital Shift in Insurance: How is the Industry Responding with the Influx of...Digital Shift in Insurance: How is the Industry Responding with the Influx of...
Digital Shift in Insurance: How is the Industry Responding with the Influx of...
 
Building Intelligent Applications w/ Cassandra, Spark & DataStax by Jeff Carp...
Building Intelligent Applications w/ Cassandra, Spark & DataStax by Jeff Carp...Building Intelligent Applications w/ Cassandra, Spark & DataStax by Jeff Carp...
Building Intelligent Applications w/ Cassandra, Spark & DataStax by Jeff Carp...
 
From ingest to insights with AWS
From ingest to insights with AWSFrom ingest to insights with AWS
From ingest to insights with AWS
 
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about..."Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
 

Similar to Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020

Getting more into GCP.pdf
Getting more into GCP.pdfGetting more into GCP.pdf
Getting more into GCP.pdf
Knoldus Inc.
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow Presentation
Knoldus Inc.
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow Presentation
Knoldus Inc.
 
Machine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud PlatformMachine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud Platform
Matthias Feys
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalSub Szabolcs Feczak
 
Introduction to GCP
Introduction to GCPIntroduction to GCP
Introduction to GCP
Knoldus Inc.
 
Cloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs GoogleCloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs Google
Patrick Pierson
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Story of migrating event pipeline from batch to streaming
Story of migrating event pipeline from batch to streamingStory of migrating event pipeline from batch to streaming
Story of migrating event pipeline from batch to streaming
lohitvijayarenu
 
Harnessing the Power of Google Cloud Platform: Strategies and Applications
Harnessing the Power of Google Cloud Platform: Strategies and ApplicationsHarnessing the Power of Google Cloud Platform: Strategies and Applications
Harnessing the Power of Google Cloud Platform: Strategies and Applications
Hitesh Mohapatra
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
confluent
 
Building Pinterest Real-Time Ads Platform Using Kafka Streams
Building Pinterest Real-Time Ads Platform Using Kafka Streams Building Pinterest Real-Time Ads Platform Using Kafka Streams
Building Pinterest Real-Time Ads Platform Using Kafka Streams
confluent
 
Logging in The World of DevOps
Logging in The World of DevOps Logging in The World of DevOps
Logging in The World of DevOps
DevOps Indonesia
 
[Study Guide] Google Professional Cloud Architect (GCP-PCA) Certification
[Study Guide] Google Professional Cloud Architect (GCP-PCA) Certification[Study Guide] Google Professional Cloud Architect (GCP-PCA) Certification
[Study Guide] Google Professional Cloud Architect (GCP-PCA) Certification
Amaaira Johns
 
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Bowen Li
 
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKSPostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
Carlos Andrés García
 
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKSPostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
VMware Tanzu
 
Kafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetupKafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetup
Mingmin Chen
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
NETWAYS
 
Automating using Ansible
Automating using AnsibleAutomating using Ansible
Automating using Ansible
Alok Patra
 

Similar to Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020 (20)

Getting more into GCP.pdf
Getting more into GCP.pdfGetting more into GCP.pdf
Getting more into GCP.pdf
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow Presentation
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow Presentation
 
Machine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud PlatformMachine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud Platform
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
 
Introduction to GCP
Introduction to GCPIntroduction to GCP
Introduction to GCP
 
Cloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs GoogleCloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs Google
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Story of migrating event pipeline from batch to streaming
Story of migrating event pipeline from batch to streamingStory of migrating event pipeline from batch to streaming
Story of migrating event pipeline from batch to streaming
 
Harnessing the Power of Google Cloud Platform: Strategies and Applications
Harnessing the Power of Google Cloud Platform: Strategies and ApplicationsHarnessing the Power of Google Cloud Platform: Strategies and Applications
Harnessing the Power of Google Cloud Platform: Strategies and Applications
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
Building Pinterest Real-Time Ads Platform Using Kafka Streams
Building Pinterest Real-Time Ads Platform Using Kafka Streams Building Pinterest Real-Time Ads Platform Using Kafka Streams
Building Pinterest Real-Time Ads Platform Using Kafka Streams
 
Logging in The World of DevOps
Logging in The World of DevOps Logging in The World of DevOps
Logging in The World of DevOps
 
[Study Guide] Google Professional Cloud Architect (GCP-PCA) Certification
[Study Guide] Google Professional Cloud Architect (GCP-PCA) Certification[Study Guide] Google Professional Cloud Architect (GCP-PCA) Certification
[Study Guide] Google Professional Cloud Architect (GCP-PCA) Certification
 
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
 
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKSPostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
 
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKSPostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
 
Kafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetupKafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetup
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
 
Automating using Ansible
Automating using AnsibleAutomating using Ansible
Automating using Ansible
 

Recently uploaded

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
Srikant77
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 

Recently uploaded (20)

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 

Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020

  • 1. Chicago Cloud Conference 2020 Architecting Analytic Pipelines on GCP
  • 2. Who am I? Mariano is an engineer with more than 15 years of experience with the JVM. He enjoys working with and exploring a variety of big data technologies. He is an avid open-source contributor. Data/Platform Architect at Otus Mariano Gonzalez Most importantly, I am just a person trying to learn about and share big data technologies and approaches.
  • 3. Agenda ● Goal for this session ● Overview of GCP services ● Apache Beam and GCP Dataflow ● Natural Language Processing for sentiment analysis ● Demo ETL/Analytics ● QA
  • 4. Goal for this Session Find an elegant way to build and deploy data/analytic pipelines that: ● Support for multiple workloads ● Scale compute and storage independently ● Backed up by manage services ● Cost effective
  • 5. Common Architecture Analytics Pipeline Data Storage Different Types and Formats of Data Analytic/Data Pipelines User
  • 6. Overview of GCP services - App Engine ● Good alternative if K8s infrastructure is not in place ● Easy deployment ○ Similar to AWS SAM from a CLI perspective ○ Similar to AWS Beanstalk from a deployment perspective ● Well integrated with other cloud services ○ GCP docker Registry ● Multiple Runtimes ○ Custom (Docker) ○ JVM/Node/Python
  • 7. Overview of GCP services - Storage ● Hot - durable, available performance object storage for frequently accessed data ○ Amazon S3 Standard ○ Microsoft Azure Hot Blob Storage ○ Google Cloud Storage standard ● Cool - storage class for data that is accessed less frequently, but requires rapid access when needed ○ Amazon S3 Standard I/A and S3 Standard Z-I/A ○ Microsoft Azure Cool Blob Storage ○ Google Cloud Storage Nearline ● Cold - secure, durable, and low-cost storage service for data archiving ○ Amazon S3 Glacier ○ Microsoft Azure Blob Archive Storage ○ Google Cloud Storage Coldline
  • 8. Overview of GCP services - Pubsub Why not just use Kafka? ● Fully managed services ○ Both system can have fully managed version in the cloud ● Cloud vs On-prem ○ Pubsub is only offered as part of the GCP ecosystem whereas Apache Kafka can be used as a both cloud service and on-prem service ● Message duplication ○ Kafka manage the offsets via zookeeper ○ Pubsub works using acknowledging the message
  • 9. Overview of GCP services - Pubsub Why not just use Kafka? ● Retention policy ○ Both Kafka and Pubsub have options to configure the maximum retention time ● Consumers Group vs Subscriptions ○ Pubsub use subscriptions, you create a subscription and then you start reading messages from that subscription ○ Kafka use the concept of "consumer group" and "partition"
  • 10. Overview of GCP services - BigQuery ● Query engines probably one of the most competed service today: ○ Snowflake ○ Presto ○ Redshift ● How are these warehouses different?
  • 11. ● Presto ○ Self hosted open source solution ● Pre-RA3 Redshift ○ Somewhat more fully managed, but still requires the user to configure individual compute clusters with a fixed amount of memory, compute and storage ● Redshift RA3 ○ Closer to the user experience of Snowflake by separating compute from storage ● Snowflake ○ The user only configures the size and number of compute clusters ○ Every compute cluster sees the same data ○ Compute clusters can be created and removed in seconds Overview of GCP services - BigQuery
  • 12. BigQuery ● Flat-rate is similar to Snowflake except there is no concept of a compute cluster, just a configurable number of "compute slots" ● Pure serverless model, where the user submits queries one at a time and pays per query ● On-demand mode can be much more expensive, or much cheaper, depending on the nature of your workload A "steady" workload that utilizes your compute capacity 24/7 will be much cheaper in flat-rate mode. A "spiky" workload that contains periodic large queries spaced with long periods of idleness or lower utilization will be much cheaper in on-demand mode. Overview of GCP services - BigQuery
  • 13. What is Google Cloud Dataflow? ● Data processing service for both: ○ batch ○ real-time data streaming applications ● Benefits ○ Enables developers to set up analytic pipelines immediately ● Nextgen MapReduce ○ Designed to bring to an entire analytics pipelines the style of fast parallel execution that MapReduce brought to a single type of computational for batch processing jobs ○ It's based partly on MillWheel and Flume (two Google-developed data ingestion and low-latency processing). Overview of GCP services - Dataflow
  • 14. Apache Beam SDK and Dataflow Runner Google Cloud Dataflow overlaps with services such as: ● Amazon Kinesis ● Apache Storm ● Apache Spark ● Facebook Flux $ java -jar build/libs/transformation-1.0-all.jar --project=ccc-2020-289323 --runner=DataflowRunner --streaming=true --region=us-east1 --tempLocation=gs://chicago-cloud-conference-2020/temp/ --stagingLocation=gs://chicago-cloud-conference-2020/jars/ --filesToStage=build/libs/transformation-1.0-all.jar --maxNumWorkers=2 --numWorkers=1
  • 15. Apache Beam SDK and Dataflow Runner
  • 16. Overview of GCP services - Dataproc On demand Hadoop Cluster ● From all the 3 managed services for Hadoop Clusters (Amazon EMR, Azure Hdinsight) Dataproc is the fastest to provision ● Easy runtime customization via PIP commands ● Not as well integrated with third party services (Azure Hdinsight - Databricks, Amazon EMR - Apache Zeppelin) $ gcloud beta dataproc clusters create cluster-name --optional-components=ANACONDA,JUPYTER --image-version=1.4 --enable-component-gateway --bucket=chicago-cloud-conference-2020 --region=us-east1 --project=ccc-2020-289323 --metadata 'PIP_PACKAGES=google-cloud-bigquery google-cloud-storage numpy pandas matplotlib'
  • 17. Overview of GCP services - Cloud Natural Language API ● What can we do Cloud Natural Language API? ○ Reveal the structure and meaning of text via machine learning models ○ Extract information about people, places, and events, mentioned in text documents, news articles or blog posts ○ Understand sentiment about product on social media or parse intent from customer conversations happening in a call center or a messaging app ● How can we use it? ○ Analyze text uploaded as part of a HTTP request ○ Integrate with Google Cloud Storage
  • 18. NLP - Sentiment Analysis Two type of metrics to consider: 1. Score a. It ranges between -1.0 (negative) and 1.0 (positive) and corresponds to the general emotional tendency of the text 1. Magnitude a. Indicates the general intensity of emotion (both positive and negative) in a given text, between 0.0 and inf b. Magnitude is not normalized and each expression of emotion in the text (both positive and negative) contributes to the value Sentiment Sample Values Positive score: 0.8, magnitude: 3.0 Negative score: -0.6, magnitude: 4.0 Neutral score: 0.1, magnitude: 0.0 Mixed score: 0.0, magnitude: 4.0
  • 19. Demo - ETL • Extract – Diferentes fuentes (Twitter for this case) • Transform – Cleanup and data presentation • Load – Columnar format https://github.com/eschizoid/ccc-2020
  • 21. Conclusion •Cost effect solution if you know your data access patterns •Full serverless architecture •Extensible workloads
  • 22. QA