Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Extending DevOps to Big Data Applications with Kubernetes


Published on

DevOps, continuous delivery and modern architectural trends can incredibly speed up the software development process. Big Data applications cannot be an exception and need to keep the same pace.

Published in: Engineering

Extending DevOps to Big Data Applications with Kubernetes

  2. 2. ABOUT ME NICOLA FERRARO Software Engineer at Red Hat Working on Apache Camel, Fabric8, JBoss Fuse, Fuse Integration Services for Openshift Previously Solution Architect for Big Data systems at Engineering (IT) @ni_ferraro
  3. 3. SUMMARY Overview of Big Data Systems and Apache Spark DevOps, Docker, Kubernetes and Openshift Origin Demo: Oshinko, Kafka, Spring-Boot and Fabric8 tools
  4. 4. BIG DATA Big Data ≈ Machine Learning at Scale Big Data systems are capable of handling data with high: Volume Terabytes, petabytes, ... Log files Transactions ​Velocity Streaming ​data Micro batching Near real-time ​Variety Structured data​ Images Videos Free text Volume Velocity Variety Here > 100 machines
  5. 5. EVOLUTION OF BIG DATA SOFTWARE Hadoop Pig Hive (a whole zoo) Spark
  6. 6. EVOLUTION OF BIG DATA INFRASTRUCTURE Commodity hardware (Really ?) Specialized hardware (Appliances) The Cloud
  7. 7. EVOLUTION OF ARCHITECTURES: BATCH The first Big Data architecture was heavily based on Hadoop: MapReduce + HDFS. Some commercial advertising motivations: Extract meaningful information from raw data: "companies analyze around <<put-a- random-number-between-0-and-20- here>> % of the data they produce" More data = more precision for machine learning algorithms = better insights Assist the decision making process: predict the future using all the available data SELECT page, count(*) FROM logs
  8. 8. ARCHITECTURES: BATCH Hadoop (HDFS + MapReduce) Nodes App Server Logs + Data (Flume) Report (DWH) Client Batch Processing (Hive) MR
  9. 9. EVOLUTION OF ARCHITECTURES: HYBRID Second generation architectures were focused to improve user experience through machine learning: Not everything could be executed in streaming, for performance reasons Execute heavyweight steps offline in batches (generate the "view", e.g. a machine learning model) Execute lightweight steps online with streaming applications The view must be refreshed periodically +
  10. 10. ARCHITECTURES: HYBRID (LAMBDA) Hadoop Nodes App Server Client Batch Processing NoSQL + Messaging + Other Streaming Processing
  11. 11. EVOLUTION OF ARCHITECTURES: STREAMING The new generation of architectures are streaming only: Provide immediate feedback to the user No need to execute offline work Provide a personalized experience to each user Here we focus (mainly) on streaming applications ...
  12. 12. ARCHITECTURES: STREAMING (KAPPA) Hadoop Nodes App Server Client NoSQL + Messaging + Other Streaming Processing Here we focus (mainly) on streaming applications ...
  13. 13. APACHE SPARK One of the main reason of Spark success is that it allows to define, by design, with a unified model: Batch processing and streaming (MapReduce / Storm) Multi-language: core in Scala, can be also used in Java, Python and R Declarative semantics (SQL), procedural and functional programming (Hive / MapReduce / Pig) General purpose data processing and machine learning (MapReduce / Mahout) Logistic regression in Spark vs. Hadoop (it was so in 2013) And the performance are really impressive!
  14. 14. SPARK ARCHITECTURE Driver Cluster Manager Driver (2nd app) Workers Driver App (main) Executor Executor Single Worker Tasks: "do something on a data partition" Assign executors to the driver app Define the app workflow Do the dirty job Logical viewPhysical view
  15. 15. SPARK ARCHITECTURE: DATA DISTRIBUTION HDFS / S3 Spark Executo r Spark Executo r Spark Executo r r/w Spark Driver Kafka Partition Spark Executo r Spark Executo r Spark Executo r Spark Driver Kafka Partition Kafka Partition HDFS / S3 HDFS / S3 Why distribution? ... to spread the work across thousands of workers
  16. 16. SPARK DEMO A simple quickstart showing a batch application doing a word count. The purpose is showing which part of the software is executed in the driver and which parts are sent to the remote executors.
  17. 17. SPARK: WHAT CAN GO WRONG ??? Programming with Spark is awesome, but there are a lot of problems you need to solve: Most of the problems do not happen in a standalone cluster (serialization issues, classloading issues, network issues, ...) Something that works well with a small dataset hangs forever in production: you must test it earlier with a lot of data! Spark applications must be tuned: decide caching steps, decide partitioning, reduce the number of shuffles, do not "collect()" For streaming apps: Exactly-once message semantics cannot be guaranteed all the times: you have to write idempotent operations to take into account duplicate messages org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCl at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$Closur at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:1 at org.apache.spark.SparkContext.clean(SparkContext.scala:2055) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:324) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:323) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationSco at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationSco at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at
  18. 18. AND SPARK IS THE MINOR PROBLEMEvery operation is executed by different software distributed in hundreds of machines. Production different from the development environment: difficult to maintain software aligned (patches and upgrades of NoSQL databases, message brokers, ...): you need to test the integration with different versions of external systems The whole system randomly doesn't work: e.g. you expect a message in a queue or you expect a result in a query, and your expectations are not satisfied Debugging is hard: it's better to fix bugs earlier, but bugs appear late in production! There are consistency problems in some NoSQL databases: the CAP theorem​ Too many steps from development to release, with different teams involved: development, qa, operations I can continue...
  19. 19. THE 2 MAIN PROBLEMS 1 The local development environment is usually too different from production (replication and versions), so you find bugs late 2 Big Data applications are difficult to test effectively, because configuring a good testing environment is a hard and long task: it's difficult to debug bugs all together How? we can apply DevOps practices with Kubernetes ... Easy Solution: Test early Test more Test the production software
  20. 20. SUMMARY Overview of Big Data Systems and Apache Spark DevOps, Docker, Kubernetes and Openshift Origin Demo: Oshinko, Kafka, Spring-Boot and Fabric8 tools
  21. 21. DEVOPS No common definition Combination of software development, qa and operations: working together to release working software frequently Applying common dev and agile techniques to operations It's a cultural change, rather than just a change in development and administration tools Common practices: Infrastructure as code Continuous integration Continuous delivery DEV OPS
  22. 22. WHY DEVOPS Increased communication and collaboration among: dev, ops, qa, business Fast time to market: accomodating business and customer needs quickly to have competitive advantage Embrace change, do not fear it Embrace innovation (and team engagement), through usage of new technologies Improved quality of software through automated testing DEV OPS
  23. 23. A DEVOPS WORKFLOW The key point of DevOps is being able to collaborate on the infrastructure the same way you collaborate on software: Common shared repository (SCM) Pull requests Peer review Automated checks (correctness, quality) Continuous integration Continuous delivery Testing the infrastructure like you test software: scalability, fault tolerance, backup, recovery options. DevOps practices become difficult to apply and uneffective when your infrastructure does not follow the software release cycle.
  24. 24. DOCKER Docker has revolutionized the way we build and ship software. Containers can be created using a Dockerfile (example on the right) There are at least 10 different ways to create Docker containers A container is not a VM (even if it seems really close to it) A container hosts a single application Containers need to be orchestrated to deliver their full power FROM ubuntu:16.04 RUN apt-key adv --keyserver hkp:// RUN echo "deb $(cat RUN apt-get update && apt-get install -y mongodb-org # Create the MongoDB data directory RUN mkdir -p /data/db # Expose port #27017 from the container to the host EXPOSE 27017 # Set /usr/bin/mongod as the dockerized entry-point app ENTRYPOINT ["/usr/bin/mongod"]
  25. 25. DOCKER ARCHITECTURE Docker is the building block of infrastructure as code: The image is completely described by the Dockerfile Any container is stateless (although stateful data can be attached with volumes). This means: You don't ssh your server and change the configuration You rebuild a new version of the application when you need to change something "Deployment" means basically replacing a container with a different container
  26. 26. KUBERNETES Cloud platform, with several deployment options (including local and private cloud), to Orchestrate (Docker) containers: Born at Google Production ready (Pokemon Go!) Provides: Application composition in namespaces Virtual networks Service discovery Load balancing Auto (and manual) scaling Automatic recovery Health checking Rolling upgrades Monitoring Logging .... kubectl run --image=nginx nginx-app --port=80 oc run --image=nginx nginx-app --port=80 # or better.. # oc new-app nginx --name nginx-app2 BOTH Kuebertes and Openshift Origin are Open Source projects!
  27. 27. KUBERNETES ARCHITECTURE Very high level architecture of Kubernetes. Note: it resembles the Spark architecture seen in previous slides (with the Kubernetes master playing the role of the cluster manager), but we want to deploy a whole Spark application inside Kubernetes (including the driver).
  28. 28. KUBERNETES FOR LOCAL DEV Minikube OPENSHIFT FOR LOCAL DEV Minishift "oc cluster up" $ minikube start Starting local Kubernetes cluster... Running pre-create checks... Creating machine... Starting local Kubernetes cluster... $ kubectl run hello-minikube --port=8080 $ minishift start ... $ oc cluster up -- Checking OpenShift client ... OK -- Checking Docker client ... OK ...
  29. 29. SUMMARY Overview of Big Data Systems and Apache Spark DevOps, Docker, Kubernetes and Openshift Origin Demo: Oshinko, Kafka, Spring-Boot and Fabric8 tools
  30. 30. OPENSHIFT ORIGIN DEMO A simple recommender system using collaborative filtering on Openshift origin. Deploy a: Spring-boot microservice Kafka Broker Spark Cluster Into openshift origin. Source code: Rating **** Rating **** Recommendations Recommendations Spark Cluster
  31. 31. SPARK ON KUBERNETES The problem: A Spark application needs a cluster of n worker machines to run You need 1 cluster manager node The application runs on 1 driver node We want to deploy everything inside Kubernetes or Openshift Origin Basically, every piece of the Spark architecture should become a POD. There are multiple solutions. I'm going to introduce the Oshinko project. Driver Cluster Manager Recall the Spark architecture PODs Worker Worker Worker
  32. 32. OSHINKO Making Apache Spark Cloud Native on Openshift Origin. W WW CM D Spark Cluster W WW CM D Spark Cluster 1 cluster per application! Deployed in the application namespace Oshinko console
  33. 33. DEMO TIME We will deploy the Oshinko rest server and then deploy the spark streaming application. It will connect to Kafka to retrieve ratings and return personalized (not really) recommendations. Sources: I used a basic implementation of the slope one algorithm ( ) for collaborative filtering. It's just an example of "real life" custom Spark application that sometimes doesn't work :)
  34. 34. TESTING One of the major advantages of making a cloud native Spark application is the possibility of executing system tests on software identical to production. Production pods Optional testing tools or load generators Test driver (inside or outside the namespace) You can create virtual test environments on the fly and do assertions. Demo: using fabric8-arquillian and the fabric8 kubernetes client
  35. 35. TESTING: SOME SCENARIOS Some tests that you may want to run: Deploy the production infrastructure (or part of it) and run a suite of functional tests to ensure every piece works as expected to produce the result Deploy the production infrastructure and do performance tests using external or injected data Deploy the production infrastructure (or part of it) and check that every piece is healthy. Then randomly kill pods while sending a load to test the system availability (system tests) The Chaos Monkey!
  36. 36. FABRIC8 We have used some tools from Fabric8: Fabric8 Maven Plugin: to build Kubernetes resources for Java projects Docker Maven Plugin: used by the fabric8 maven plugin to create docker images Fabric8 Arquillian: to create system tests from artifacts using the fabric8 plugin Fabric8 Kubernetes Client: to interact with Kubernetes resources (pods, services, etc.) Fabric8 is a complete development platform for Kubernetes! Uses Jenkins Pipelines and to provide: Continuous integration Continuous delivery Out of the box!
  37. 37. QUESTIONS?
  38. 38. Q/A What about storage and data locality? In cloud native applications, using a storage system like s3 has many advantages over HDFS (costs, availability, usage patterns in testing). Performance of HDFS are superior (6x), but you can use larger clusters for a limited amount of time to gain the same performance. More info: Can you scale the application directly from Openshift, or you always need to change the code? Sure, you can use Openshift to scale the application. Beware that, in a pure DevOps approach, changing the configuration of a system at runtime is not something you should do, especially on production. You lose the benefits of having all the software and configuration in a "executable" state, and you lose the effectiveness of system tests because tests will not run on an environment identical to production. As an alternative, you can use the auto-scaling feature (enabling it in the code). Is the Spark demo using a machine learning algorithm? The demo is using a basic collaborative filtering algorithm called "slope one" (the focus of this talk is not machine learning). Spark provides many algorithms out of the box, also for collaborative filtering. Many of them are supposed to be executed in a batch application, rather than in a streaming app.
  39. 39. THANKS Follow me on Twitter! @ni_ferraro