Successfully reported this slideshow.
Your SlideShare is downloading. ×

OSDC 2019 | Democratizing Data at Go-JEK by Maulik Soneji

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 34 Ad

OSDC 2019 | Democratizing Data at Go-JEK by Maulik Soneji

Download to read offline

Having grown 6666x in the last 3 years; the data generated has grown exponentially. As a Data Engineer at GoJEK, we faced the issue with having our complete team managing infrastructure requests. This led us to create an internal portal for other teams to self-provision their data. This talk is divided into two parts. In the first part, I will cover details about how we have scaled our data engineering infrastructure to manage the scale of more than 40 million messages per day. I will explain the data consumption, aggregation, monitoring and cold storage. This will also cover details about how we scaled our infrastructure to achieve the scale that we are at today. In the second part, I will cover how we created our internal portal for infrastructure orchestration. The infrastructure backed by kubernetes enables teams to self-provision data infrastructure without any supervision.

Having grown 6666x in the last 3 years; the data generated has grown exponentially. As a Data Engineer at GoJEK, we faced the issue with having our complete team managing infrastructure requests. This led us to create an internal portal for other teams to self-provision their data. This talk is divided into two parts. In the first part, I will cover details about how we have scaled our data engineering infrastructure to manage the scale of more than 40 million messages per day. I will explain the data consumption, aggregation, monitoring and cold storage. This will also cover details about how we scaled our infrastructure to achieve the scale that we are at today. In the second part, I will cover how we created our internal portal for infrastructure orchestration. The infrastructure backed by kubernetes enables teams to self-provision data infrastructure without any supervision.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to OSDC 2019 | Democratizing Data at Go-JEK by Maulik Soneji (20)

Advertisement

Recently uploaded (20)

OSDC 2019 | Democratizing Data at Go-JEK by Maulik Soneji

  1. 1. Democratizing Data at Go-jek
  2. 2. What do we mean by Democratizing Data?
  3. 3. What do we mean by Democratizing Data? - Self serve in a way that you are dependent on tools and not on people - Abstract in such a way that people only need to know - data they are publishing - Insights that they want
  4. 4. Agenda - About Go-jek - Core components of Data Platform - Building Data platform around core components - Whole picture of how components interact with each other - References - QnA
  5. 5. About Go-jek - A mobile app for daily needs - Transport - Food Delivery - Payments - Logistics - 18+ products - Operational Internationally - Indonesia - Singapore - Vietnam - Thailand 2016 Expansion and new services 2010 Call-center for ojek* services 2015 App launched with 3 services *ojek is an Indonesian term of motorcycle ride hailing
  6. 6. Uses of Data - Application Monitoring - Fraud Detection - Pricing - Report Generation - User segmentation Let's talk about the scale of Data Growth of Data - 6666x growth in 18 months - Expansion to 3 countries(Singapore, Vietnam, Thailand) Type of Data - 18+ products ranging from: - Ride - Food Delivery - Payments - Shipments Given the data at Go-jek it becomes essential to create a data platform at scale and supporting automation.
  7. 7. Core Components
  8. 8. Core Components - Components that serve as backbone for the data platform - We have a bunch of these core components for different use cases - Multiple options available for changing any of these core components with another well suited option
  9. 9. Core Components - Main components of our Data platform: - Kafka => message broker for producing and consuming data
  10. 10. Core Components - Main components of our Data platform: - Kafka => message broker for producing and consuming data - Flink => real time stream processing
  11. 11. Core Components - Main components of our Data platform: - Kafka => message broker for producing and consuming data - Flink => real time stream processing - Kubernetes => Deployment and management of containers
  12. 12. Core Components - Main components of our Data platform: - Kafka => message broker for producing and consuming data - Flink => real time stream processing - Kubernetes => Deployment and management of containers - Airflow => workflow management
  13. 13. Core Components - Main components of our Data platform: - Kafka => message broker for producing and consuming data - Flink => real time stream processing - Kubernetes => Deployment and management of containers - Airflow => workflow management - Dataproc => managed spark cluster for batch processing
  14. 14. Data Platform
  15. 15. Typical day for Data Engineer at Gojek - Tackling the following aspects of Data: - Ingestion - Consumption - Aggregation - Cold Storage - Visualization (in early stages currently) - Automation for Resource creation: - Resource is any tool used for data ingestion, consumption etc.
  16. 16. Focus areas for scaling - How to take care of these aspects while scaling data platform: - Reliability => No Data loss - Abstraction => Downtime due to one team shouldn’t impact other - Automation => Infrastructure, deployment - Monitoring => Monitoring and Alerting for all resources - Self Serve => Anyone can access platform without intervention
  17. 17. Data Ingestion - Uniformity by Protobufs - All kafka messages are serialized/deserialized by using protobuf - Teams maintain the protobuf schema - Client libraries in various languages to push messages - Helps in terms of: - Defining a schema for pushing/reading from kafka - No breaking change, a newer version of the schema shouldn’t break the older conventions
  18. 18. Why Protobufs? - All protobufs can be imported like client library - Use protoc command to generate library for different languages - Easy to import and use Create Golang library from Java library
  19. 19. Data Ingestion - Reliability - Reliability by Fronting Systems - Fronting serves as reliable way to push data to kafka - First pushes to internal kafka and then pushes batched data to main kafka - Internal redis store to store failed messages - Individual frontings for each team to isolate issues - High availability with HAProxy - Can be replaced with any system that enables reliable push to kafka - Sidekiq jobs - Application level retry mechanism
  20. 20. Fronting Architecture
  21. 21. Data Consumption - Firehose: Custom kafka consumer that pushes to sinks: - Database Sink - HTTP Sink - Influx Sink - Elasticsearch Sink
  22. 22. Data Cold Storage - Bigquery (Data warehouse) => Beast - GCS (Cold storage) => Sakaar - Pushes raw data from kafka to Bigquery and GCS - Governance of data by service accounts - Metabase and Zeppelin to query data
  23. 23. Kafka consumers architecture Downstream
  24. 24. Kafka - Abstraction - To scale kafka, formalize abstractions of different kafka clusters - Different Kafka clusters for application usecases: - Mainstream for all booking data - Appstream for mobile application data - Locstream for location data - Mirror kafka data for data usecases: - Data clusters used for aggregation, auditing, cold storage of data
  25. 25. Data Aggregation - Flink (for real time aggregation) - Dataproc (for batch aggregation) - SQL interface following Apache Calcite to create aggregation jobs - User Defined Functions to support complex functionalities - Different sinks to write aggregated data to Elasticsearch, Database, Influx and others - Airflow to schedule jobs
  26. 26. Data Visualization - Geo visualization platform to explore location data - D3 and chart libraries to explore data - Deck.GL for building heatmaps, 2D/3D visualization layers - Booking, payments heatmaps
  27. 27. Automation - Terraform for IAC and following convention - Creating all aspects of data platform through terraform: - Kubernetes cluster - VMs - Kafka/Core Components - Grants to service accounts
  28. 28. Monitoring - TICK script based monitoring/alerting setup - TICK(Telegraf, InfluxDB, Chronograph, Kapacitor) - StatsD client to collect metrics and write data to influx - Kapacitor scripts to create alerts - Integrations with slack, pagerduty for alerting - Grafana for monitoring - Create generic Tick templates and then use template to create alerts
  29. 29. Tick setup example In this template, the warn_threshold and crit_threshold are to be supplied when creating alert. Create task through a curl call
  30. 30. Self Serve - Data Platform to DIY provision data products - Web interface to collect information from user - Helm chart for all data products - Kubernetes client on backend to provision resource - Resource to team mapping for authentication and authorization
  31. 31. The whole picture
  32. 32. References - Data Infra blog: https://blog.gojekengineering.com/data-infrastructure-at-go-jek-cd4dc8cbd929 - Fronting blog: https://blog.gojekengineering.com/kafka-4066a4ea8d0d - Aggregation blog: https://blog.gojekengineering.com/daggers-data-aggregation-in-real-time-4a32eb9ad2d1 - Sakaar blog: https://blog.gojekengineering.com/sakaar-taking-kafka-data-to-cloud-storage-at-go-jek-7839da20b5f3 - Data visualization blog: https://blog.gojekengineering.com/atlas-go-jeks-real-time-geospatial-visualization-platform-1cf5e16814c5 - Open sourced repos: - https://github.com/gojek/beast - https://github.com/gojekfarm/stencil - Helm charts: https://github.com/gojektech/charts
  33. 33. Questions?

×