Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Flink Forward Berlin 2018: Ravi Suhag & Sumanth Nakshatrithaya - "Managing Flink operations at GoJek"


Published on

At GO-JEK, we build products that help millions of Indonesians At GO-JEK, we build products that help millions of Indonesians commute, shop, eat and pay, daily. Data Engineering team is responsible to create a reliable data infrastructure across all of GO-JEK’s 18+ products. We use Flink extensively to provide real-time streaming aggregation and analytics for billions of data points generated on daily basis. Working at such a large scale makes it really important to automate operations from infrastructure, failover, and monitoring. This way we can push features faster without causing chaos and disruption to the production environment. 1. Provisioning and deployment: With the nature of business at GoJek, we found ourselves provisioning Flink clusters quite often. Currently we run around 1000 jobs across 10 clusters for different data streams with increasing number of requests day by day. We also provision on the fly clusters with custom configuration for load testing, experimentation and chaos engineering. Provisioning these many clusters from ground up required lot of man hours and involved setting up virtual machines, monitoring agents, access management, configuration management, load testing and data stream integration. Our current setup has Flink over Yarn clusters as well as Kubernetes. We use our in-house provisioning tool Odin, built on top of Terraform and Chef for Yarn clusters and Kubernetes controllers for Kubernetes based deployments. It enables us to safely and predictably create and modify Flink infrastructure. Odin has helped us reduce provisioning time by 99% despite increasing number of requests. 2. Isolation and access control: Given the real-time and distributed nature of GoJek's services, events are classified into different streams depending on nature, time and transactional criticality, sensitivity and volume of data. Which requires setting up separate clusters based on security concerns, team segregation, job loads and criticality which comes at the cost of handling large volume data replication and maintenance. 3. Data quality control: The quality of ingestion events are controlled by Protobuf based version controlled strict event type schema with fully automated deployment pipeline. Deployed jobs are locked to a certain data schema and version which helps us accidental breaking schema changes and backward compatibility during migration and failover. 4. Monitoring and alerting: All the clusters are monitored using dedicated TICK setup. We monitors clusters for resource utilization, job stats and business impact per job. 5. Failover and Upgrading: Failover and upgrade operations are fully automated for yarn cluster failover, input stream failovers e.g. Kafka failover with stateless job strategies. Which helps us moving jobs from one cluster to another without any data loss or broken metric flow. 6. Chaos engineering and load testing: Loki is our disaster simulation tool that helps ensure the Flink infrastructure can tolerat

Published in: Technology
  • Excelente los enlaces. he visto varios videos.
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Flink Forward Berlin 2018: Ravi Suhag & Sumanth Nakshatrithaya - "Managing Flink operations at GoJek"

  1. 1. Managing Flink operations at GO-JEK Ravi Suhag | Sumanth K N
  2. 2. ONE APP FOR ALL YOUR NEEDS Established in 2010 as a motorcycle ride- hailing phone-based service, GO-JEK has evolved into an on-demand provider of transport and other lifestyle services. GO-JEK?
  3. 3. Resource Provisioning Resource Isolation Data Quality Monitoring Cluster Failovers Chaos and Load Testing Agenda
  4. 4. As data engineers, we look out for patterns in data usages and transformation and build tools, infrastructure, frameworks, and services. SCALE AUTOMATION PRODUCT MINDSET
  5. 5. At GO-JEK, location is built into the fabric of all our products 2B + GPS points per day 25M + Booking events per day 3B + API log events per day
  6. 6. Flink use cases at GO-JEK Surge Pricing API Health monitoring Driver allocation monitoring Fraud detection and more …
  7. 7. Resource Provisioning How do we create new clusters with increasing number of requests for more than 15 internal teams?
  8. 8. Challenges Multiple cloud providers and DCs Frequent need for cluster provisioning Man-hour intensive and repetitive Error-prone process
  9. 9. Architecture
  10. 10. Example
  11. 11. Impact Provisioning time reduced by 90% On the fly infra for load testing Infrastructure As Code Self-serve and no workflows
  12. 12. Resource Isolation How do we isolate resources for security, resilience, and segregation for more than 15 internal teams?
  13. 13. Why Critical kafka topics handicapped by low priority scripts Non-critical jobs wipe out resources of critical jobs Extensive downtime due to issues in Kafka Human errors
  14. 14. Isolation Architecture
  15. 15. Nature, time and transactional criticality, sensitivity and volume of data. Security concerns, team segregation, job loads and criticality which comes at the cost of handling large volume data replication and maintenance. Kafka input stream Flink clusters
  16. 16. Kafka Clusters Management
  17. 17. Flink Clusters Management
  18. 18. Job Management
  19. 19. Data Quality How do we manage quality of data for aggregation with more than 15 teams being responsible for producing data?
  20. 20. Challenges JSON weekly typed Errors during schema update Bad deployment sequence of an incompatible schema change No data for real-time actors
  21. 21. Measures Protobuf - Strongly typed Maintains version of the schema Version locking per job Allows to re-run a job with different schema Upstream actors to have an `inadequate data points` behavior
  22. 22. Monitoring How do we manage monitoring for multiple clusters across data-centers?
  23. 23. Failovers How do we manage kafka input stream failover and flink cluster failover for multiple clusters?
  24. 24. Why need failover? Kafka cluster down Flink clusters down or in bad state Kafka or Flink cluster migrate Job migration for teams
  25. 25. Kafka Failover
  26. 26. Kafka Failover
  27. 27. Kafka Failover
  28. 28. Flink Failover
  29. 29. Flink Failover
  30. 30. Impact Kafka input stream resiliency Allocate jobs dynamically to clusters Testing and upgradation becomes easy Self-serve and no workflow
  31. 31. Chaos and load Testing How do we build confidence in system behaviour through disaster simulation experiments?
  32. 32. LOKI for chaos engineering Disaster simulation Load testing Reports CLI interface
  33. 33. Architecture
  34. 34. Simulation Simulate throughtput in flink jobs with kafka ingestion Play chaos simulation on flink cluster or kafka cluster Generate reports Prepare playbook
  35. 35. Reports
  36. 36. Road Ahead Flink on Kubernetes Complex Event Processing Data Enrichment
  37. 37. Let’s talk! Ravi Suhag @ravi_suhag Sumanth K N