OSDC 2019 | Democratizing Data at Go-JEK by Maulik Soneji

What do we mean by Democratizing Data?

What do we mean by Democratizing Data?
- Self serve in a way that you are dependent on tools and not on people
- Abstract in such a way that people only need to know
- data they are publishing
- Insights that they want

Agenda
- About Go-jek
- Core components of Data Platform
- Building Data platform around core components
- Whole picture of how components interact with each other
- References
- QnA

About Go-jek
- A mobile app for daily needs
- Transport
- Food Delivery
- Payments
- Logistics
- 18+ products
- Operational Internationally
- Indonesia
- Singapore
- Vietnam
- Thailand
2016 Expansion and
new services
2010 Call-center for
ojek* services
2015 App launched
with 3 services
*ojek is an Indonesian term of motorcycle ride hailing

Uses of Data
- Application Monitoring
- Fraud Detection
- Pricing
- Report Generation
- User segmentation
Let's talk about the scale of Data
Growth of Data
- 6666x growth in 18
months
- Expansion to 3
countries(Singapore,
Vietnam, Thailand)
Type of Data
- 18+ products ranging from:
- Ride
- Food Delivery
- Payments
- Shipments
Given the data at Go-jek it becomes essential to create a data platform at scale and supporting automation.

Core Components
- Components that serve as backbone for the data platform
- We have a bunch of these core components for different use cases
- Multiple options available for changing any of these core components
with another well suited option

Core Components
- Main components of our Data platform:
- Kafka => message broker for producing and consuming data

Core Components
- Flink => real time stream processing

Core Components
- Kubernetes => Deployment and management of containers

Core Components
- Airflow => workflow management

Core Components
- Airflow => workflow management
- Dataproc => managed spark cluster for batch processing

Typical day for Data Engineer at Gojek
- Tackling the following aspects of Data:
- Ingestion
- Consumption
- Aggregation
- Cold Storage
- Visualization (in early stages currently)
- Automation for Resource creation:
- Resource is any tool used for data ingestion, consumption etc.

Focus areas for scaling
- How to take care of these aspects while scaling data platform:
- Reliability => No Data loss
- Abstraction => Downtime due to one team shouldn’t impact other
- Automation => Infrastructure, deployment
- Monitoring => Monitoring and Alerting for all resources
- Self Serve => Anyone can access platform without intervention

Data Ingestion
- Uniformity by Protobufs
- All kafka messages are serialized/deserialized by using protobuf
- Teams maintain the protobuf schema
- Client libraries in various languages to push messages
- Helps in terms of:
- Defining a schema for pushing/reading from kafka
- No breaking change, a newer version of the schema shouldn’t
break the older conventions

Why Protobufs?
- All protobufs can be imported like client library
- Use protoc command to generate library for different languages
- Easy to import and use
Create Golang library from Java library

Data Ingestion - Reliability
- Reliability by Fronting Systems
- Fronting serves as reliable way to push data to kafka
- First pushes to internal kafka and then pushes batched data to
main kafka
- Internal redis store to store failed messages
- Individual frontings for each team to isolate issues
- High availability with HAProxy
- Can be replaced with any system that enables reliable push to kafka
- Sidekiq jobs
- Application level retry mechanism

Data Consumption
- Firehose: Custom kafka consumer that pushes to sinks:
- Database Sink
- HTTP Sink
- Influx Sink
- Elasticsearch Sink

Data Cold Storage
- Bigquery (Data warehouse) => Beast
- GCS (Cold storage) => Sakaar
- Pushes raw data from kafka to Bigquery and GCS
- Governance of data by service accounts
- Metabase and Zeppelin to query data

Kafka consumers architecture
Downstream

Kafka - Abstraction
- To scale kafka, formalize abstractions of different kafka clusters
- Different Kafka clusters for application usecases:
- Mainstream for all booking data
- Appstream for mobile application data
- Locstream for location data
- Mirror kafka data for data usecases:
- Data clusters used for aggregation, auditing, cold storage of
data

Data Aggregation
- Flink (for real time aggregation)
- Dataproc (for batch aggregation)
- SQL interface following Apache Calcite to create aggregation jobs
- User Defined Functions to support complex functionalities
- Different sinks to write aggregated data to Elasticsearch,
Database, Influx and others
- Airflow to schedule jobs

Data Visualization
- Geo visualization platform to explore location data
- D3 and chart libraries to explore data
- Deck.GL for building heatmaps, 2D/3D visualization layers
- Booking, payments heatmaps

Automation
- Terraform for IAC and following convention
- Creating all aspects of data platform through terraform:
- Kubernetes cluster
- VMs
- Kafka/Core Components
- Grants to service accounts

Monitoring
- TICK script based monitoring/alerting setup
- TICK(Telegraf, InfluxDB, Chronograph, Kapacitor)
- StatsD client to collect metrics and write data to influx
- Kapacitor scripts to create alerts
- Integrations with slack, pagerduty for alerting
- Grafana for monitoring
- Create generic Tick templates and then use template to create alerts

Tick setup example
In this template, the warn_threshold and
crit_threshold are to be supplied when creating alert.
Create task through a curl call

Self Serve
- Data Platform to DIY provision data products
- Web interface to collect information from user
- Helm chart for all data products
- Kubernetes client on backend to provision resource
- Resource to team mapping for authentication and authorization

References
- Data Infra blog: https://blog.gojekengineering.com/data-infrastructure-at-go-jek-cd4dc8cbd929
- Fronting blog: https://blog.gojekengineering.com/kafka-4066a4ea8d0d
- Aggregation blog:
https://blog.gojekengineering.com/daggers-data-aggregation-in-real-time-4a32eb9ad2d1
- Sakaar blog:
https://blog.gojekengineering.com/sakaar-taking-kafka-data-to-cloud-storage-at-go-jek-7839da20b5f3
- Data visualization blog:
https://blog.gojekengineering.com/atlas-go-jeks-real-time-geospatial-visualization-platform-1cf5e16814c5
- Open sourced repos:
- https://github.com/gojek/beast
- https://github.com/gojekfarm/stencil
- Helm charts: https://github.com/gojektech/charts

OSDC 2019 | Democratizing Data at Go-JEK by Maulik Soneji

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to OSDC 2019 | Democratizing Data at Go-JEK by Maulik Soneji

Similar to OSDC 2019 | Democratizing Data at Go-JEK by Maulik Soneji (20)

Recently uploaded

Recently uploaded (20)

OSDC 2019 | Democratizing Data at Go-JEK by Maulik Soneji