Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Elephants in the cloud or How to become cloud ready


Published on

The way you operate your Big Data environment is not going to be the same anymore. This session is based on our experience managing on-premise environments and taking the lesson from innovative data-driven companies that successfully migrated their multi PB Hadoop clusters. Where to start and what decisions you have to make to gradually becoming cloud ready. The examples would refer to Google Cloud Platform yet the challenges are common.

Published in: Technology
  • Login to see the comments

Elephants in the cloud or How to become cloud ready

  1. 1. Elephants in The Cloud or How to Become Cloud Ready Krzysztof Adamski, GetInData
  2. 2. So You Say You Don’t Use Cloud? HR System Online Documents Mobile PhoneEmail Server
  3. 3. Trust as a Key Factor Image source:
  4. 4. More Secure or Not In the end, do you really think you can provide better infrastructure security than cloud providers ???
  5. 5. Migration Questions? How fast can you start/expand your analytics initiative? 1 How often is your cluster fully busy and your employees want more computing power right now?2 How much time you spend on maintaining your infra? 3 How much time does it take you to gracefully apply all the security patches in your Hadoop cluster?4 Do you need hardware that you don’t have in your data-center e.g. GPU, terrible amounts of RAM5
  6. 6. Hadoop Operations at Scale
  7. 7. Migration Goals Transition from infrastructure engineering towards data engineering 1 Use the best possible technology stack in the world 2 Free your time 3 Attract the best engineers 4 Ultimate world domination ;) 5
  8. 8. Krzysztof Adamski
  9. 9. Before You Start Be smart with which service you choose Avoid lock-in Try to estimate the costs See what others are doing Technology choices Yet another migration Hardware, engineering, legal Netflix, Spotify, Etsy
  10. 10. What’s different in the Cloud ?
  11. 11. Decoupled storage and processing
  12. 12. Different Technologies Hadoop Ecosystem Google Cloud Platform File System HDFS Google Cloud Storage Key Value Store HBase, Cassandra BigTable SQL Hive, SparkSQL, Presto BigQuery Messaging Queue Kafka PubSub Geo-Replicated RDBMS CockroachDB Spanner
  13. 13. Cloud Storage Decision Tree
  14. 14. Storage Connectors
  15. 15. Strong Global Consistency Google Cloud Storage provides strong global consistency for the following operations, including both data and metadata: ● Read-after-write ● Read-after-metadata-update ● Read-after-delete ● Bucket listing, Object listing ● Granting access to resources
  16. 16. Eventual Consistency ● Revoking access from resources It typically takes about a minute for revoking access to take effect. In some cases it may take longer. Beware of a cache though.
  17. 17. Pricing ● Pay-per-second billing Keep in mind that if you often do sub-10 minute analyses using VMs, serverless options may be better suited since VMs are relatively slow to boot and serverless functions are billed at every 100ms.
  18. 18. I want to start. What’s next?
  19. 19. Data repository in a good shape
  20. 20. Find best candidates for migration Isolated / self-contained applications With mainly external (public data) dependencies Global use case
  21. 21. Baby Steps Prepare your hadoop cluster to interact with object storage. 1 Look for existing operators for popular tools like Apache Airflow. 2 Make a copy of your critical datasets to the cloud. 3 Use both BigQuery for fast analytics and GCS output for more advanced trials. 4 Audit costs per query. 5
  22. 22. Networking High bandwidth, low latency and consistent network connectivity is critical. Pay attention to such things like choosing the right region, number of cores or even TCP window size. But to get the full speed dedicated interconnect / direct peering is the way to go. Multiple VPN tunnels are a good starting point to increase bandwidth. Transfer appliances for offline data migration.
  23. 23. Data Transfer Time
  24. 24. Package Your Deployments ● Containers (docker) for tooling. ● Deployment artifacts (Spark / MR jars). ● Tools like Spydra can help you executing your packages in both worlds $ cat examples.json { "client_id": "simple-spydra-test", "cluster_type": "dataproc", "log_bucket": "spydra-test-logs", "region": "europe-west1", "cluster": { "options": { "project": "spydra-test" } }, "submit": { "job_args": [ "pi", "8", "100" ], "options": { "jar": "hadoop-mapreduce-examples.jar" } } } $ spydra submit --spydra-json example.json
  25. 25. Other Important Features ● Cluster pooling - using init actions to kill old clusters ● Autoscaling - based on the workload ● Preemptible instances: ○ A reasonable choice for your cluster ○ Keep in mind final resilience (idempotence) ○ Available also with GPUs
  26. 26. No Long-Lived Services ● No patching! - YAY ● No wasting resources ● Latest security patches applied automatically
  27. 27. Predictions Forrester predicts SaaS vendors will de-prioritize their platform efforts to attain global scale. They will compete more at the platform level by running portions of their services on AWS, Azure, GCP or Oracle Cloud in 2018. ” ”
  28. 28. Future Interesting projects: ● Spark on k8s ● dA Platform 2
  29. 29. Kubernetes
  30. 30. There no right answer - it's tradeoff that depends on many variables Should I Stay or Should I Go?