SlideShare a Scribd company logo
1 of 31
Download to read offline
Elephants in The Cloud
or How to Become Cloud Ready
Krzysztof Adamski, GetInData
So You Say You Don’t Use Cloud?
HR System Online Documents Mobile PhoneEmail Server
Trust as a Key Factor
Image source: https://www.forbes.com/sites/louiscolumbus/2017/04/23/2017-state-of-cloud-adoption-and-security
More Secure or Not
In the end, do you
really think you can
provide better
infrastructure security
than cloud providers
???
Migration Questions?
How fast can you start/expand your analytics initiative?
1
How often is your cluster fully busy and your employees want more computing
power right now?2
How much time you spend on maintaining your infra?
3
How much time does it take you to gracefully apply all the security patches in
your Hadoop cluster?4
Do you need hardware that you don’t have in your data-center e.g. GPU,
terrible amounts of RAM5
Hadoop Operations at Scale
Migration Goals
Transition from infrastructure engineering
towards data engineering
1
Use the best possible technology stack in the
world
2
Free your time
3
Attract the best engineers
4
Ultimate world domination ;)
5
Krzysztof Adamski
Before You Start
Be smart
with which
service you
choose
Avoid
lock-in
Try to
estimate
the costs
See what
others
are doing
Technology
choices
Yet another
migration
Hardware,
engineering, legal
Netflix, Spotify, Etsy
What’s different in
the Cloud ?
Decoupled
storage
and
processing
Different Technologies
Hadoop Ecosystem Google Cloud Platform
File System HDFS Google Cloud Storage
Key Value Store HBase, Cassandra BigTable
SQL Hive, SparkSQL, Presto BigQuery
Messaging Queue Kafka PubSub
Geo-Replicated
RDBMS
CockroachDB Spanner
Cloud
Storage
Decision
Tree
Storage
Connectors
Strong Global Consistency
Google Cloud Storage provides strong global consistency for the following
operations, including both data and metadata:
● Read-after-write
● Read-after-metadata-update
● Read-after-delete
● Bucket listing, Object listing
● Granting access to resources
Eventual Consistency
● Revoking access from resources
It typically takes about a minute for revoking access to take effect. In some
cases it may take longer.
Beware of a cache though.
Pricing
● Pay-per-second billing
Keep in mind that if you often do sub-10
minute analyses using VMs, serverless
options may be better suited since VMs
are relatively slow to boot and serverless
functions are billed at every 100ms.
I want to start.
What’s next?
Data
repository
in a good
shape
Find best
candidates
for
migration
Isolated / self-contained
applications
With mainly external
(public data)
dependencies
Global use case
Baby Steps
Prepare your hadoop cluster to interact
with object storage.
1
Look for existing operators for popular
tools like Apache Airflow.
2
Make a copy of your critical datasets to
the cloud.
3
Use both BigQuery for fast analytics and
GCS output for more advanced trials.
4
Audit costs per query.
5
Networking
High bandwidth, low
latency and consistent
network connectivity is
critical.
Pay attention to such
things like choosing the
right region, number of
cores or even TCP
window size.
But to get the full speed
dedicated interconnect /
direct peering is the way
to go.
Multiple VPN tunnels
are a good starting
point to increase
bandwidth.
Transfer appliances for
offline data migration.
Data
Transfer
Time
Package Your Deployments
● Containers (docker) for tooling.
● Deployment artifacts (Spark / MR
jars).
● Tools like Spydra can help you
executing your packages in both
worlds
$ cat examples.json
{
"client_id": "simple-spydra-test",
"cluster_type": "dataproc",
"log_bucket": "spydra-test-logs",
"region": "europe-west1",
"cluster": {
"options": {
"project": "spydra-test"
}
},
"submit": {
"job_args": [
"pi",
"8",
"100"
],
"options": {
"jar": "hadoop-mapreduce-examples.jar"
}
}
}
$ spydra submit --spydra-json example.json
Other Important Features
● Cluster pooling - using init actions to kill old clusters
● Autoscaling - based on the workload
● Preemptible instances:
○ A reasonable choice for your cluster
○ Keep in mind final resilience (idempotence)
○ Available also with GPUs
No Long-Lived Services
● No patching! - YAY
● No wasting resources
● Latest security patches
applied automatically
Predictions
Forrester predicts
SaaS vendors will de-prioritize
their platform efforts to attain
global scale.
They will compete more at the platform level by running
portions of their services on AWS, Azure, GCP or Oracle Cloud
in 2018.
”
”
Future
Interesting projects:
● Spark on k8s
● dA Platform 2
Kubernetes
There no right answer - it's tradeoff that depends on many variables
Should I Stay or Should I Go?
Elephants in the cloud or How to become cloud ready

More Related Content

What's hot

How to Deliver a Critical and Actionable Customer-Facing Metrics Product with...
How to Deliver a Critical and Actionable Customer-Facing Metrics Product with...How to Deliver a Critical and Actionable Customer-Facing Metrics Product with...
How to Deliver a Critical and Actionable Customer-Facing Metrics Product with...InfluxData
 
Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...
Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...
Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...Spark Summit
 
How to Use Telegraf and Its Plugin Ecosystem
How to Use Telegraf and Its Plugin EcosystemHow to Use Telegraf and Its Plugin Ecosystem
How to Use Telegraf and Its Plugin EcosystemInfluxData
 
Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021InfluxData
 
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Databricks
 
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...InfluxData
 
Chronografand dashboarding
Chronografand dashboardingChronografand dashboarding
Chronografand dashboardingInfluxData
 
Real Time Experiment Analytics at Pinterest with Apache Flink - Ben Liu & Par...
Real Time Experiment Analytics at Pinterest with Apache Flink - Ben Liu & Par...Real Time Experiment Analytics at Pinterest with Apache Flink - Ben Liu & Par...
Real Time Experiment Analytics at Pinterest with Apache Flink - Ben Liu & Par...Flink Forward
 
Elephants in the cloud or how to become cloud ready
Elephants in the cloud or how to become cloud readyElephants in the cloud or how to become cloud ready
Elephants in the cloud or how to become cloud readyKrzysztof Adamski
 
Druid meetup @ Netflix (11/14/2018 )
Druid meetup @ Netflix  (11/14/2018 )Druid meetup @ Netflix  (11/14/2018 )
Druid meetup @ Netflix (11/14/2018 )Jaebin Yoon
 
tado° Makes Your Home Environment Smart with InfluxDB
tado° Makes Your Home Environment Smart with InfluxDBtado° Makes Your Home Environment Smart with InfluxDB
tado° Makes Your Home Environment Smart with InfluxDBInfluxData
 
Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...
Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...
Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...HostedbyConfluent
 
Natalie Godec - AirFlow and GCP: tomorrow's health service data platform
Natalie Godec - AirFlow and GCP: tomorrow's health service data platformNatalie Godec - AirFlow and GCP: tomorrow's health service data platform
Natalie Godec - AirFlow and GCP: tomorrow's health service data platformmatteo mazzeri
 
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...Dataconomy Media
 
Distributed Deep Learning At Scale On Apache Spark With BigDL
Distributed Deep Learning At Scale On Apache Spark With BigDLDistributed Deep Learning At Scale On Apache Spark With BigDL
Distributed Deep Learning At Scale On Apache Spark With BigDLYulia Tell
 
Setting up InfluxData for IoT
Setting up InfluxData for IoTSetting up InfluxData for IoT
Setting up InfluxData for IoTInfluxData
 
AI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionAI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionDatabricks
 
Crap. Your Big Data Kitchen Is Broken.
Crap. Your Big Data Kitchen Is Broken.Crap. Your Big Data Kitchen Is Broken.
Crap. Your Big Data Kitchen Is Broken.Altoros
 
How to Manage Your Time Series Data Pipeline at the Edge with InfluxDB
How to Manage Your Time Series Data Pipeline at the Edge with InfluxDBHow to Manage Your Time Series Data Pipeline at the Edge with InfluxDB
How to Manage Your Time Series Data Pipeline at the Edge with InfluxDBInfluxData
 
How to Improve Performance Testing Using InfluxDB and Apache JMeter
How to Improve Performance Testing Using InfluxDB and Apache JMeterHow to Improve Performance Testing Using InfluxDB and Apache JMeter
How to Improve Performance Testing Using InfluxDB and Apache JMeterInfluxData
 

What's hot (20)

How to Deliver a Critical and Actionable Customer-Facing Metrics Product with...
How to Deliver a Critical and Actionable Customer-Facing Metrics Product with...How to Deliver a Critical and Actionable Customer-Facing Metrics Product with...
How to Deliver a Critical and Actionable Customer-Facing Metrics Product with...
 
Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...
Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...
Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...
 
How to Use Telegraf and Its Plugin Ecosystem
How to Use Telegraf and Its Plugin EcosystemHow to Use Telegraf and Its Plugin Ecosystem
How to Use Telegraf and Its Plugin Ecosystem
 
Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021
 
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
 
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
 
Chronografand dashboarding
Chronografand dashboardingChronografand dashboarding
Chronografand dashboarding
 
Real Time Experiment Analytics at Pinterest with Apache Flink - Ben Liu & Par...
Real Time Experiment Analytics at Pinterest with Apache Flink - Ben Liu & Par...Real Time Experiment Analytics at Pinterest with Apache Flink - Ben Liu & Par...
Real Time Experiment Analytics at Pinterest with Apache Flink - Ben Liu & Par...
 
Elephants in the cloud or how to become cloud ready
Elephants in the cloud or how to become cloud readyElephants in the cloud or how to become cloud ready
Elephants in the cloud or how to become cloud ready
 
Druid meetup @ Netflix (11/14/2018 )
Druid meetup @ Netflix  (11/14/2018 )Druid meetup @ Netflix  (11/14/2018 )
Druid meetup @ Netflix (11/14/2018 )
 
tado° Makes Your Home Environment Smart with InfluxDB
tado° Makes Your Home Environment Smart with InfluxDBtado° Makes Your Home Environment Smart with InfluxDB
tado° Makes Your Home Environment Smart with InfluxDB
 
Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...
Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...
Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...
 
Natalie Godec - AirFlow and GCP: tomorrow's health service data platform
Natalie Godec - AirFlow and GCP: tomorrow's health service data platformNatalie Godec - AirFlow and GCP: tomorrow's health service data platform
Natalie Godec - AirFlow and GCP: tomorrow's health service data platform
 
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...
 
Distributed Deep Learning At Scale On Apache Spark With BigDL
Distributed Deep Learning At Scale On Apache Spark With BigDLDistributed Deep Learning At Scale On Apache Spark With BigDL
Distributed Deep Learning At Scale On Apache Spark With BigDL
 
Setting up InfluxData for IoT
Setting up InfluxData for IoTSetting up InfluxData for IoT
Setting up InfluxData for IoT
 
AI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionAI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat Detection
 
Crap. Your Big Data Kitchen Is Broken.
Crap. Your Big Data Kitchen Is Broken.Crap. Your Big Data Kitchen Is Broken.
Crap. Your Big Data Kitchen Is Broken.
 
How to Manage Your Time Series Data Pipeline at the Edge with InfluxDB
How to Manage Your Time Series Data Pipeline at the Edge with InfluxDBHow to Manage Your Time Series Data Pipeline at the Edge with InfluxDB
How to Manage Your Time Series Data Pipeline at the Edge with InfluxDB
 
How to Improve Performance Testing Using InfluxDB and Apache JMeter
How to Improve Performance Testing Using InfluxDB and Apache JMeterHow to Improve Performance Testing Using InfluxDB and Apache JMeter
How to Improve Performance Testing Using InfluxDB and Apache JMeter
 

Similar to Elephants in the cloud or How to become cloud ready

Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediFundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediAnimesh Chaturvedi
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAlluxio, Inc.
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Digital Forensics and Incident Response in The Cloud
Digital Forensics and Incident Response in The CloudDigital Forensics and Incident Response in The Cloud
Digital Forensics and Incident Response in The CloudVelocidex Enterprises
 
Building Cloud capability for startups
Building Cloud capability for startupsBuilding Cloud capability for startups
Building Cloud capability for startupsSekhar Mohanty
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyPeter Clapham
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist SoftServe
 
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architectureconfluent
 
Cloud Busting: Understanding Cloud-based Digital Forensics
Cloud Busting: Understanding Cloud-based Digital ForensicsCloud Busting: Understanding Cloud-based Digital Forensics
Cloud Busting: Understanding Cloud-based Digital ForensicsKerry Hazelton
 
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture (ANZ)
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture (ANZ)3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture (ANZ)
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture (ANZ)confluent
 
Slides: Accelerating Queries on Cloud Data Lakes
Slides: Accelerating Queries on Cloud Data LakesSlides: Accelerating Queries on Cloud Data Lakes
Slides: Accelerating Queries on Cloud Data LakesDATAVERSITY
 
云计算及其应用
云计算及其应用云计算及其应用
云计算及其应用lantianlcdx
 
Gruntwork Executive Summary
Gruntwork Executive SummaryGruntwork Executive Summary
Gruntwork Executive SummaryYevgeniy Brikman
 

Similar to Elephants in the cloud or How to become cloud ready (20)

Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediFundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Microsoft Dryad
Microsoft DryadMicrosoft Dryad
Microsoft Dryad
 
Digital Forensics and Incident Response in The Cloud
Digital Forensics and Incident Response in The CloudDigital Forensics and Incident Response in The Cloud
Digital Forensics and Incident Response in The Cloud
 
Building Cloud capability for startups
Building Cloud capability for startupsBuilding Cloud capability for startups
Building Cloud capability for startups
 
moveMountainIEEE
moveMountainIEEEmoveMountainIEEE
moveMountainIEEE
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture
 
Cloud Busting: Understanding Cloud-based Digital Forensics
Cloud Busting: Understanding Cloud-based Digital ForensicsCloud Busting: Understanding Cloud-based Digital Forensics
Cloud Busting: Understanding Cloud-based Digital Forensics
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
El35782786
El35782786El35782786
El35782786
 
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture (ANZ)
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture (ANZ)3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture (ANZ)
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture (ANZ)
 
kumarResume
kumarResumekumarResume
kumarResume
 
Slides: Accelerating Queries on Cloud Data Lakes
Slides: Accelerating Queries on Cloud Data LakesSlides: Accelerating Queries on Cloud Data Lakes
Slides: Accelerating Queries on Cloud Data Lakes
 
Self-Service Supercomputing
Self-Service SupercomputingSelf-Service Supercomputing
Self-Service Supercomputing
 
云计算及其应用
云计算及其应用云计算及其应用
云计算及其应用
 
Final White Paper_
Final White Paper_Final White Paper_
Final White Paper_
 
Gruntwork Executive Summary
Gruntwork Executive SummaryGruntwork Executive Summary
Gruntwork Executive Summary
 

More from GetInData

How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...GetInData
 
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr MenclewiczData-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr MenclewiczGetInData
 
How NOT to win a Kaggle competition
How NOT to win a Kaggle competitionHow NOT to win a Kaggle competition
How NOT to win a Kaggle competitionGetInData
 
How to become good Developer in Scrum Team?
How to become good Developer in Scrum Team? How to become good Developer in Scrum Team?
How to become good Developer in Scrum Team? GetInData
 
OpenLineage & Airflow - data lineage has never been easier
OpenLineage & Airflow - data lineage has never been easierOpenLineage & Airflow - data lineage has never been easier
OpenLineage & Airflow - data lineage has never been easierGetInData
 
Benefits of a Homemade ML Platform
Benefits of a Homemade ML PlatformBenefits of a Homemade ML Platform
Benefits of a Homemade ML PlatformGetInData
 
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...GetInData
 
MLOps implemented - how we combine the cloud & open-source to boost data scie...
MLOps implemented - how we combine the cloud & open-source to boost data scie...MLOps implemented - how we combine the cloud & open-source to boost data scie...
MLOps implemented - how we combine the cloud & open-source to boost data scie...GetInData
 
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInDataFeast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInDataGetInData
 
Big data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInDataBig data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInDataGetInData
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...GetInData
 
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...GetInData
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataGetInData
 
Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...GetInData
 
Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...GetInData
 
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...GetInData
 
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInDataStrategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInDataGetInData
 
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...GetInData
 
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...Real time analytics that controls 50% of mobile network in Poland - Maciej Br...
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...GetInData
 
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...GetInData
 

More from GetInData (20)

How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...
 
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr MenclewiczData-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
 
How NOT to win a Kaggle competition
How NOT to win a Kaggle competitionHow NOT to win a Kaggle competition
How NOT to win a Kaggle competition
 
How to become good Developer in Scrum Team?
How to become good Developer in Scrum Team? How to become good Developer in Scrum Team?
How to become good Developer in Scrum Team?
 
OpenLineage & Airflow - data lineage has never been easier
OpenLineage & Airflow - data lineage has never been easierOpenLineage & Airflow - data lineage has never been easier
OpenLineage & Airflow - data lineage has never been easier
 
Benefits of a Homemade ML Platform
Benefits of a Homemade ML PlatformBenefits of a Homemade ML Platform
Benefits of a Homemade ML Platform
 
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
 
MLOps implemented - how we combine the cloud & open-source to boost data scie...
MLOps implemented - how we combine the cloud & open-source to boost data scie...MLOps implemented - how we combine the cloud & open-source to boost data scie...
MLOps implemented - how we combine the cloud & open-source to boost data scie...
 
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInDataFeast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
 
Big data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInDataBig data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInData
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
 
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
 
Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...
 
Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...
 
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
 
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInDataStrategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
 
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
 
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...Real time analytics that controls 50% of mobile network in Poland - Maciej Br...
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...
 
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
 

Recently uploaded

AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024Lorenzo Miniero
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewDianaGray10
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireExakis Nelite
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch TuesdayIvanti
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Skynet Technologies
 
Navigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiNavigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiRaviKumarDaparthi
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingScyllaDB
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxFIDO Alliance
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsLeah Henrickson
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfAnubhavMangla3
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...ScyllaDB
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxFIDO Alliance
 
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptxCyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptxMasterG
 
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...SOFTTECHHUB
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform EngineeringMarcus Vechiato
 

Recently uploaded (20)

AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
Navigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiNavigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi Daparthi
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptxCyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
 
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 

Elephants in the cloud or How to become cloud ready

  • 1. Elephants in The Cloud or How to Become Cloud Ready Krzysztof Adamski, GetInData
  • 2. So You Say You Don’t Use Cloud? HR System Online Documents Mobile PhoneEmail Server
  • 3. Trust as a Key Factor Image source: https://www.forbes.com/sites/louiscolumbus/2017/04/23/2017-state-of-cloud-adoption-and-security
  • 4. More Secure or Not In the end, do you really think you can provide better infrastructure security than cloud providers ???
  • 5. Migration Questions? How fast can you start/expand your analytics initiative? 1 How often is your cluster fully busy and your employees want more computing power right now?2 How much time you spend on maintaining your infra? 3 How much time does it take you to gracefully apply all the security patches in your Hadoop cluster?4 Do you need hardware that you don’t have in your data-center e.g. GPU, terrible amounts of RAM5
  • 7. Migration Goals Transition from infrastructure engineering towards data engineering 1 Use the best possible technology stack in the world 2 Free your time 3 Attract the best engineers 4 Ultimate world domination ;) 5
  • 9. Before You Start Be smart with which service you choose Avoid lock-in Try to estimate the costs See what others are doing Technology choices Yet another migration Hardware, engineering, legal Netflix, Spotify, Etsy
  • 12. Different Technologies Hadoop Ecosystem Google Cloud Platform File System HDFS Google Cloud Storage Key Value Store HBase, Cassandra BigTable SQL Hive, SparkSQL, Presto BigQuery Messaging Queue Kafka PubSub Geo-Replicated RDBMS CockroachDB Spanner
  • 15. Strong Global Consistency Google Cloud Storage provides strong global consistency for the following operations, including both data and metadata: ● Read-after-write ● Read-after-metadata-update ● Read-after-delete ● Bucket listing, Object listing ● Granting access to resources
  • 16. Eventual Consistency ● Revoking access from resources It typically takes about a minute for revoking access to take effect. In some cases it may take longer. Beware of a cache though.
  • 17. Pricing ● Pay-per-second billing Keep in mind that if you often do sub-10 minute analyses using VMs, serverless options may be better suited since VMs are relatively slow to boot and serverless functions are billed at every 100ms.
  • 18. I want to start. What’s next?
  • 20. Find best candidates for migration Isolated / self-contained applications With mainly external (public data) dependencies Global use case
  • 21. Baby Steps Prepare your hadoop cluster to interact with object storage. 1 Look for existing operators for popular tools like Apache Airflow. 2 Make a copy of your critical datasets to the cloud. 3 Use both BigQuery for fast analytics and GCS output for more advanced trials. 4 Audit costs per query. 5
  • 22. Networking High bandwidth, low latency and consistent network connectivity is critical. Pay attention to such things like choosing the right region, number of cores or even TCP window size. But to get the full speed dedicated interconnect / direct peering is the way to go. Multiple VPN tunnels are a good starting point to increase bandwidth. Transfer appliances for offline data migration.
  • 24. Package Your Deployments ● Containers (docker) for tooling. ● Deployment artifacts (Spark / MR jars). ● Tools like Spydra can help you executing your packages in both worlds $ cat examples.json { "client_id": "simple-spydra-test", "cluster_type": "dataproc", "log_bucket": "spydra-test-logs", "region": "europe-west1", "cluster": { "options": { "project": "spydra-test" } }, "submit": { "job_args": [ "pi", "8", "100" ], "options": { "jar": "hadoop-mapreduce-examples.jar" } } } $ spydra submit --spydra-json example.json
  • 25. Other Important Features ● Cluster pooling - using init actions to kill old clusters ● Autoscaling - based on the workload ● Preemptible instances: ○ A reasonable choice for your cluster ○ Keep in mind final resilience (idempotence) ○ Available also with GPUs
  • 26. No Long-Lived Services ● No patching! - YAY ● No wasting resources ● Latest security patches applied automatically
  • 27. Predictions Forrester predicts SaaS vendors will de-prioritize their platform efforts to attain global scale. They will compete more at the platform level by running portions of their services on AWS, Azure, GCP or Oracle Cloud in 2018. ” ”
  • 28. Future Interesting projects: ● Spark on k8s ● dA Platform 2
  • 30. There no right answer - it's tradeoff that depends on many variables Should I Stay or Should I Go?