The way you operate your Big Data environment is not going to be the same anymore. This session is based on our experience managing on-premise environments and taking the lesson from innovative data-driven companies that successfully migrated their multi PB Hadoop clusters. Where to start and what decisions you have to make to gradually becoming cloud ready. The examples would refer to Google Cloud Platform yet the challenges are common.
Elephants in the cloud or How to become cloud ready
Elephants in The Cloud
or How to Become Cloud Ready
Krzysztof Adamski, GetInData
So You Say You Don’t Use Cloud?
HR System Online Documents Mobile PhoneEmail Server
Trust as a Key Factor
Image source: https://www.forbes.com/sites/louiscolumbus/2017/04/23/2017-state-of-cloud-adoption-and-security
More Secure or Not
In the end, do you
really think you can
than cloud providers
How fast can you start/expand your analytics initiative?
How often is your cluster fully busy and your employees want more computing
power right now?2
How much time you spend on maintaining your infra?
How much time does it take you to gracefully apply all the security patches in
your Hadoop cluster?4
Do you need hardware that you don’t have in your data-center e.g. GPU,
terrible amounts of RAM5
Transition from infrastructure engineering
towards data engineering
Use the best possible technology stack in the
Free your time
Attract the best engineers
Ultimate world domination ;)
Before You Start
Netflix, Spotify, Etsy
Strong Global Consistency
Google Cloud Storage provides strong global consistency for the following
operations, including both data and metadata:
● Bucket listing, Object listing
● Granting access to resources
● Revoking access from resources
It typically takes about a minute for revoking access to take effect. In some
cases it may take longer.
Beware of a cache though.
● Pay-per-second billing
Keep in mind that if you often do sub-10
minute analyses using VMs, serverless
options may be better suited since VMs
are relatively slow to boot and serverless
functions are billed at every 100ms.
Isolated / self-contained
With mainly external
Global use case
Prepare your hadoop cluster to interact
with object storage.
Look for existing operators for popular
tools like Apache Airflow.
Make a copy of your critical datasets to
Use both BigQuery for fast analytics and
GCS output for more advanced trials.
Audit costs per query.
High bandwidth, low
latency and consistent
network connectivity is
Pay attention to such
things like choosing the
right region, number of
cores or even TCP
But to get the full speed
dedicated interconnect /
direct peering is the way
Multiple VPN tunnels
are a good starting
point to increase
Transfer appliances for
offline data migration.
Package Your Deployments
● Containers (docker) for tooling.
● Deployment artifacts (Spark / MR
● Tools like Spydra can help you
executing your packages in both
$ cat examples.json
$ spydra submit --spydra-json example.json
Other Important Features
● Cluster pooling - using init actions to kill old clusters
● Autoscaling - based on the workload
● Preemptible instances:
○ A reasonable choice for your cluster
○ Keep in mind final resilience (idempotence)
○ Available also with GPUs
No Long-Lived Services
● No patching! - YAY
● No wasting resources
● Latest security patches
SaaS vendors will de-prioritize
their platform efforts to attain
They will compete more at the platform level by running
portions of their services on AWS, Azure, GCP or Oracle Cloud
● Spark on k8s
● dA Platform 2