SlideShare a Scribd company logo
Elephants in The Cloud
or How to Become Cloud Ready
Krzysztof Adamski, GetInData
So You Say You Don’t Use Cloud?
HR System Online Documents Mobile PhoneEmail Server
Trust as a Key Factor
Image source: https://www.forbes.com/sites/louiscolumbus/2017/04/23/2017-state-of-cloud-adoption-and-security
More Secure or Not
In the end, do you
really think you can
provide better
infrastructure security
than cloud providers
???
Migration Questions?
How fast can you start/expand your analytics initiative?
1
How often is your cluster fully busy and your employees want more computing
power right now?2
How much time you spend on maintaining your infra?
3
How much time does it take you to gracefully apply all the security patches in
your Hadoop cluster?4
Do you need hardware that you don’t have in your data-center e.g. GPU,
terrible amounts of RAM5
Hadoop Operations at Scale
Migration Goals
Transition from infrastructure engineering
towards data engineering
1
Use the best possible technology stack in the
world
2
Free your time
3
Attract the best engineers
4
Ultimate world domination ;)
5
Krzysztof Adamski
Before You Start
Be smart
with which
service you
choose
Avoid
lock-in
Try to
estimate
the costs
See what
others
are doing
Technology
choices
Yet another
migration
Hardware,
engineering, legal
Netflix, Spotify, Etsy
What’s different in
the Cloud ?
Decoupled
storage
and
processing
Different Technologies
Hadoop Ecosystem Google Cloud Platform
File System HDFS Google Cloud Storage
Key Value Store HBase, Cassandra BigTable
SQL Hive, SparkSQL, Presto BigQuery
Messaging Queue Kafka PubSub
Geo-Replicated
RDBMS
CockroachDB Spanner
Cloud
Storage
Decision
Tree
Storage
Connectors
Strong Global Consistency
Google Cloud Storage provides strong global consistency for the following
operations, including both data and metadata:
● Read-after-write
● Read-after-metadata-update
● Read-after-delete
● Bucket listing, Object listing
● Granting access to resources
Eventual Consistency
● Revoking access from resources
It typically takes about a minute for revoking access to take effect. In some
cases it may take longer.
Beware of a cache though.
Pricing
● Pay-per-second billing
Keep in mind that if you often do sub-10
minute analyses using VMs, serverless
options may be better suited since VMs
are relatively slow to boot and serverless
functions are billed at every 100ms.
I want to start.
What’s next?
Data
repository
in a good
shape
Find best
candidates
for
migration
Isolated / self-contained
applications
With mainly external
(public data)
dependencies
Global use case
Baby Steps
Prepare your hadoop cluster to interact
with object storage.
1
Look for existing operators for popular
tools like Apache Airflow.
2
Make a copy of your critical datasets to
the cloud.
3
Use both BigQuery for fast analytics and
GCS output for more advanced trials.
4
Audit costs per query.
5
Networking
High bandwidth, low
latency and consistent
network connectivity is
critical.
Pay attention to such
things like choosing the
right region, number of
cores or even TCP
window size.
But to get the full speed
dedicated interconnect /
direct peering is the way
to go.
Multiple VPN tunnels
are a good starting
point to increase
bandwidth.
Transfer appliances for
offline data migration.
Data
Transfer
Time
Package Your Deployments
● Containers (docker) for tooling.
● Deployment artifacts (Spark / MR
jars).
● Tools like Spydra can help you
executing your packages in both
worlds
$ cat examples.json
{
"client_id": "simple-spydra-test",
"cluster_type": "dataproc",
"log_bucket": "spydra-test-logs",
"region": "europe-west1",
"cluster": {
"options": {
"project": "spydra-test"
}
},
"submit": {
"job_args": [
"pi",
"8",
"100"
],
"options": {
"jar": "hadoop-mapreduce-examples.jar"
}
}
}
$ spydra submit --spydra-json example.json
Other Important Features
● Cluster pooling - using init actions to kill old clusters
● Autoscaling - based on the workload
● Preemptible instances:
○ A reasonable choice for your cluster
○ Keep in mind final resilience (idempotence)
○ Available also with GPUs
No Long-Lived Services
● No patching! - YAY
● No wasting resources
● Latest security patches
applied automatically
Predictions
Forrester predicts
SaaS vendors will de-prioritize
their platform efforts to attain
global scale.
They will compete more at the platform level by running
portions of their services on AWS, Azure, GCP or Oracle Cloud
in 2018.
”
”
Future
Interesting projects:
● Spark on k8s
● dA Platform 2
Kubernetes
There no right answer - it's tradeoff that depends on many variables
Should I Stay or Should I Go?
Elephants in the cloud or How to become cloud ready

More Related Content

What's hot

Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Databricks
 

What's hot (20)

How to Deliver a Critical and Actionable Customer-Facing Metrics Product with...
How to Deliver a Critical and Actionable Customer-Facing Metrics Product with...How to Deliver a Critical and Actionable Customer-Facing Metrics Product with...
How to Deliver a Critical and Actionable Customer-Facing Metrics Product with...
 
Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...
Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...
Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...
 
How to Use Telegraf and Its Plugin Ecosystem
How to Use Telegraf and Its Plugin EcosystemHow to Use Telegraf and Its Plugin Ecosystem
How to Use Telegraf and Its Plugin Ecosystem
 
Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021
 
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
 
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
 
Chronografand dashboarding
Chronografand dashboardingChronografand dashboarding
Chronografand dashboarding
 
Real Time Experiment Analytics at Pinterest with Apache Flink - Ben Liu & Par...
Real Time Experiment Analytics at Pinterest with Apache Flink - Ben Liu & Par...Real Time Experiment Analytics at Pinterest with Apache Flink - Ben Liu & Par...
Real Time Experiment Analytics at Pinterest with Apache Flink - Ben Liu & Par...
 
Elephants in the cloud or how to become cloud ready
Elephants in the cloud or how to become cloud readyElephants in the cloud or how to become cloud ready
Elephants in the cloud or how to become cloud ready
 
Druid meetup @ Netflix (11/14/2018 )
Druid meetup @ Netflix  (11/14/2018 )Druid meetup @ Netflix  (11/14/2018 )
Druid meetup @ Netflix (11/14/2018 )
 
tado° Makes Your Home Environment Smart with InfluxDB
tado° Makes Your Home Environment Smart with InfluxDBtado° Makes Your Home Environment Smart with InfluxDB
tado° Makes Your Home Environment Smart with InfluxDB
 
Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...
Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...
Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...
 
Natalie Godec - AirFlow and GCP: tomorrow's health service data platform
Natalie Godec - AirFlow and GCP: tomorrow's health service data platformNatalie Godec - AirFlow and GCP: tomorrow's health service data platform
Natalie Godec - AirFlow and GCP: tomorrow's health service data platform
 
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...
 
Distributed Deep Learning At Scale On Apache Spark With BigDL
Distributed Deep Learning At Scale On Apache Spark With BigDLDistributed Deep Learning At Scale On Apache Spark With BigDL
Distributed Deep Learning At Scale On Apache Spark With BigDL
 
Setting up InfluxData for IoT
Setting up InfluxData for IoTSetting up InfluxData for IoT
Setting up InfluxData for IoT
 
AI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionAI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat Detection
 
Crap. Your Big Data Kitchen Is Broken.
Crap. Your Big Data Kitchen Is Broken.Crap. Your Big Data Kitchen Is Broken.
Crap. Your Big Data Kitchen Is Broken.
 
How to Manage Your Time Series Data Pipeline at the Edge with InfluxDB
How to Manage Your Time Series Data Pipeline at the Edge with InfluxDBHow to Manage Your Time Series Data Pipeline at the Edge with InfluxDB
How to Manage Your Time Series Data Pipeline at the Edge with InfluxDB
 
How to Improve Performance Testing Using InfluxDB and Apache JMeter
How to Improve Performance Testing Using InfluxDB and Apache JMeterHow to Improve Performance Testing Using InfluxDB and Apache JMeter
How to Improve Performance Testing Using InfluxDB and Apache JMeter
 

Similar to Elephants in the cloud or How to become cloud ready

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Digital Forensics and Incident Response in The Cloud
Digital Forensics and Incident Response in The CloudDigital Forensics and Incident Response in The Cloud
Digital Forensics and Incident Response in The Cloud
Velocidex Enterprises
 

Similar to Elephants in the cloud or How to become cloud ready (20)

Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediFundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Microsoft Dryad
Microsoft DryadMicrosoft Dryad
Microsoft Dryad
 
Digital Forensics and Incident Response in The Cloud
Digital Forensics and Incident Response in The CloudDigital Forensics and Incident Response in The Cloud
Digital Forensics and Incident Response in The Cloud
 
Building Cloud capability for startups
Building Cloud capability for startupsBuilding Cloud capability for startups
Building Cloud capability for startups
 
moveMountainIEEE
moveMountainIEEEmoveMountainIEEE
moveMountainIEEE
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture
 
Cloud Busting: Understanding Cloud-based Digital Forensics
Cloud Busting: Understanding Cloud-based Digital ForensicsCloud Busting: Understanding Cloud-based Digital Forensics
Cloud Busting: Understanding Cloud-based Digital Forensics
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
El35782786
El35782786El35782786
El35782786
 
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture (ANZ)
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture (ANZ)3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture (ANZ)
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture (ANZ)
 
kumarResume
kumarResumekumarResume
kumarResume
 
Slides: Accelerating Queries on Cloud Data Lakes
Slides: Accelerating Queries on Cloud Data LakesSlides: Accelerating Queries on Cloud Data Lakes
Slides: Accelerating Queries on Cloud Data Lakes
 
Self-Service Supercomputing
Self-Service SupercomputingSelf-Service Supercomputing
Self-Service Supercomputing
 
云计算及其应用
云计算及其应用云计算及其应用
云计算及其应用
 
Final White Paper_
Final White Paper_Final White Paper_
Final White Paper_
 
Gruntwork Executive Summary
Gruntwork Executive SummaryGruntwork Executive Summary
Gruntwork Executive Summary
 

More from GetInData

How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...
GetInData
 
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr MenclewiczData-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
GetInData
 
Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...
GetInData
 
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
GetInData
 
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
GetInData
 
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...Real time analytics that controls 50% of mobile network in Poland - Maciej Br...
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...
GetInData
 
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
GetInData
 

More from GetInData (20)

How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...
 
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr MenclewiczData-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
 
How NOT to win a Kaggle competition
How NOT to win a Kaggle competitionHow NOT to win a Kaggle competition
How NOT to win a Kaggle competition
 
How to become good Developer in Scrum Team?
How to become good Developer in Scrum Team? How to become good Developer in Scrum Team?
How to become good Developer in Scrum Team?
 
OpenLineage & Airflow - data lineage has never been easier
OpenLineage & Airflow - data lineage has never been easierOpenLineage & Airflow - data lineage has never been easier
OpenLineage & Airflow - data lineage has never been easier
 
Benefits of a Homemade ML Platform
Benefits of a Homemade ML PlatformBenefits of a Homemade ML Platform
Benefits of a Homemade ML Platform
 
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
 
MLOps implemented - how we combine the cloud & open-source to boost data scie...
MLOps implemented - how we combine the cloud & open-source to boost data scie...MLOps implemented - how we combine the cloud & open-source to boost data scie...
MLOps implemented - how we combine the cloud & open-source to boost data scie...
 
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInDataFeast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
 
Big data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInDataBig data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInData
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
 
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
 
Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...
 
Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...
 
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
 
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInDataStrategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
 
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
 
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...Real time analytics that controls 50% of mobile network in Poland - Maciej Br...
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...
 
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
 

Recently uploaded

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 

Recently uploaded (20)

Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 

Elephants in the cloud or How to become cloud ready

  • 1. Elephants in The Cloud or How to Become Cloud Ready Krzysztof Adamski, GetInData
  • 2. So You Say You Don’t Use Cloud? HR System Online Documents Mobile PhoneEmail Server
  • 3. Trust as a Key Factor Image source: https://www.forbes.com/sites/louiscolumbus/2017/04/23/2017-state-of-cloud-adoption-and-security
  • 4. More Secure or Not In the end, do you really think you can provide better infrastructure security than cloud providers ???
  • 5. Migration Questions? How fast can you start/expand your analytics initiative? 1 How often is your cluster fully busy and your employees want more computing power right now?2 How much time you spend on maintaining your infra? 3 How much time does it take you to gracefully apply all the security patches in your Hadoop cluster?4 Do you need hardware that you don’t have in your data-center e.g. GPU, terrible amounts of RAM5
  • 7. Migration Goals Transition from infrastructure engineering towards data engineering 1 Use the best possible technology stack in the world 2 Free your time 3 Attract the best engineers 4 Ultimate world domination ;) 5
  • 9. Before You Start Be smart with which service you choose Avoid lock-in Try to estimate the costs See what others are doing Technology choices Yet another migration Hardware, engineering, legal Netflix, Spotify, Etsy
  • 12. Different Technologies Hadoop Ecosystem Google Cloud Platform File System HDFS Google Cloud Storage Key Value Store HBase, Cassandra BigTable SQL Hive, SparkSQL, Presto BigQuery Messaging Queue Kafka PubSub Geo-Replicated RDBMS CockroachDB Spanner
  • 15. Strong Global Consistency Google Cloud Storage provides strong global consistency for the following operations, including both data and metadata: ● Read-after-write ● Read-after-metadata-update ● Read-after-delete ● Bucket listing, Object listing ● Granting access to resources
  • 16. Eventual Consistency ● Revoking access from resources It typically takes about a minute for revoking access to take effect. In some cases it may take longer. Beware of a cache though.
  • 17. Pricing ● Pay-per-second billing Keep in mind that if you often do sub-10 minute analyses using VMs, serverless options may be better suited since VMs are relatively slow to boot and serverless functions are billed at every 100ms.
  • 18. I want to start. What’s next?
  • 20. Find best candidates for migration Isolated / self-contained applications With mainly external (public data) dependencies Global use case
  • 21. Baby Steps Prepare your hadoop cluster to interact with object storage. 1 Look for existing operators for popular tools like Apache Airflow. 2 Make a copy of your critical datasets to the cloud. 3 Use both BigQuery for fast analytics and GCS output for more advanced trials. 4 Audit costs per query. 5
  • 22. Networking High bandwidth, low latency and consistent network connectivity is critical. Pay attention to such things like choosing the right region, number of cores or even TCP window size. But to get the full speed dedicated interconnect / direct peering is the way to go. Multiple VPN tunnels are a good starting point to increase bandwidth. Transfer appliances for offline data migration.
  • 24. Package Your Deployments ● Containers (docker) for tooling. ● Deployment artifacts (Spark / MR jars). ● Tools like Spydra can help you executing your packages in both worlds $ cat examples.json { "client_id": "simple-spydra-test", "cluster_type": "dataproc", "log_bucket": "spydra-test-logs", "region": "europe-west1", "cluster": { "options": { "project": "spydra-test" } }, "submit": { "job_args": [ "pi", "8", "100" ], "options": { "jar": "hadoop-mapreduce-examples.jar" } } } $ spydra submit --spydra-json example.json
  • 25. Other Important Features ● Cluster pooling - using init actions to kill old clusters ● Autoscaling - based on the workload ● Preemptible instances: ○ A reasonable choice for your cluster ○ Keep in mind final resilience (idempotence) ○ Available also with GPUs
  • 26. No Long-Lived Services ● No patching! - YAY ● No wasting resources ● Latest security patches applied automatically
  • 27. Predictions Forrester predicts SaaS vendors will de-prioritize their platform efforts to attain global scale. They will compete more at the platform level by running portions of their services on AWS, Azure, GCP or Oracle Cloud in 2018. ” ”
  • 28. Future Interesting projects: ● Spark on k8s ● dA Platform 2
  • 30. There no right answer - it's tradeoff that depends on many variables Should I Stay or Should I Go?