SlideShare a Scribd company logo
Democratizing
Data at Go-jek
What do we mean by Democratizing Data?
What do we mean by Democratizing Data?
- Self serve in a way that you are dependent on tools and not on people
- Abstract in such a way that people only need to know
- data they are publishing
- Insights that they want
Agenda
- About Go-jek
- Core components of Data Platform
- Building Data platform around core components
- Whole picture of how components interact with each other
- References
- QnA
About Go-jek
- A mobile app for daily needs
- Transport
- Food Delivery
- Payments
- Logistics
- 18+ products
- Operational Internationally
- Indonesia
- Singapore
- Vietnam
- Thailand
2016 Expansion and
new services
2010 Call-center for
ojek* services
2015 App launched
with 3 services
*ojek is an Indonesian term of motorcycle ride hailing
Uses of Data
- Application Monitoring
- Fraud Detection
- Pricing
- Report Generation
- User segmentation
Let's talk about the scale of Data
Growth of Data
- 6666x growth in 18
months
- Expansion to 3
countries(Singapore,
Vietnam, Thailand)
Type of Data
- 18+ products ranging from:
- Ride
- Food Delivery
- Payments
- Shipments
Given the data at Go-jek it becomes essential to create a data platform at scale and supporting automation.
Core Components
Core Components
- Components that serve as backbone for the data platform
- We have a bunch of these core components for different use cases
- Multiple options available for changing any of these core components
with another well suited option
Core Components
- Main components of our Data platform:
- Kafka => message broker for producing and consuming data
Core Components
- Main components of our Data platform:
- Kafka => message broker for producing and consuming data
- Flink => real time stream processing
Core Components
- Main components of our Data platform:
- Kafka => message broker for producing and consuming data
- Flink => real time stream processing
- Kubernetes => Deployment and management of containers
Core Components
- Main components of our Data platform:
- Kafka => message broker for producing and consuming data
- Flink => real time stream processing
- Kubernetes => Deployment and management of containers
- Airflow => workflow management
Core Components
- Main components of our Data platform:
- Kafka => message broker for producing and consuming data
- Flink => real time stream processing
- Kubernetes => Deployment and management of containers
- Airflow => workflow management
- Dataproc => managed spark cluster for batch processing
Data Platform
Typical day for Data Engineer at Gojek
- Tackling the following aspects of Data:
- Ingestion
- Consumption
- Aggregation
- Cold Storage
- Visualization (in early stages currently)
- Automation for Resource creation:
- Resource is any tool used for data ingestion, consumption etc.
Focus areas for scaling
- How to take care of these aspects while scaling data platform:
- Reliability => No Data loss
- Abstraction => Downtime due to one team shouldn’t impact other
- Automation => Infrastructure, deployment
- Monitoring => Monitoring and Alerting for all resources
- Self Serve => Anyone can access platform without intervention
Data Ingestion
- Uniformity by Protobufs
- All kafka messages are serialized/deserialized by using protobuf
- Teams maintain the protobuf schema
- Client libraries in various languages to push messages
- Helps in terms of:
- Defining a schema for pushing/reading from kafka
- No breaking change, a newer version of the schema shouldn’t
break the older conventions
Why Protobufs?
- All protobufs can be imported like client library
- Use protoc command to generate library for different languages
- Easy to import and use
Create Golang library from Java library
Data Ingestion - Reliability
- Reliability by Fronting Systems
- Fronting serves as reliable way to push data to kafka
- First pushes to internal kafka and then pushes batched data to
main kafka
- Internal redis store to store failed messages
- Individual frontings for each team to isolate issues
- High availability with HAProxy
- Can be replaced with any system that enables reliable push to kafka
- Sidekiq jobs
- Application level retry mechanism
Fronting Architecture
Data Consumption
- Firehose: Custom kafka consumer that pushes to sinks:
- Database Sink
- HTTP Sink
- Influx Sink
- Elasticsearch Sink
Data Cold Storage
- Bigquery (Data warehouse) => Beast
- GCS (Cold storage) => Sakaar
- Pushes raw data from kafka to Bigquery and GCS
- Governance of data by service accounts
- Metabase and Zeppelin to query data
Kafka consumers architecture
Downstream
Kafka - Abstraction
- To scale kafka, formalize abstractions of different kafka clusters
- Different Kafka clusters for application usecases:
- Mainstream for all booking data
- Appstream for mobile application data
- Locstream for location data
- Mirror kafka data for data usecases:
- Data clusters used for aggregation, auditing, cold storage of
data
Data Aggregation
- Flink (for real time aggregation)
- Dataproc (for batch aggregation)
- SQL interface following Apache Calcite to create aggregation jobs
- User Defined Functions to support complex functionalities
- Different sinks to write aggregated data to Elasticsearch,
Database, Influx and others
- Airflow to schedule jobs
Data Visualization
- Geo visualization platform to explore location data
- D3 and chart libraries to explore data
- Deck.GL for building heatmaps, 2D/3D visualization layers
- Booking, payments heatmaps
Automation
- Terraform for IAC and following convention
- Creating all aspects of data platform through terraform:
- Kubernetes cluster
- VMs
- Kafka/Core Components
- Grants to service accounts
Monitoring
- TICK script based monitoring/alerting setup
- TICK(Telegraf, InfluxDB, Chronograph, Kapacitor)
- StatsD client to collect metrics and write data to influx
- Kapacitor scripts to create alerts
- Integrations with slack, pagerduty for alerting
- Grafana for monitoring
- Create generic Tick templates and then use template to create alerts
Tick setup example
In this template, the warn_threshold and
crit_threshold are to be supplied when creating alert.
Create task through a curl call
Self Serve
- Data Platform to DIY provision data products
- Web interface to collect information from user
- Helm chart for all data products
- Kubernetes client on backend to provision resource
- Resource to team mapping for authentication and authorization
The whole picture
References
- Data Infra blog: https://blog.gojekengineering.com/data-infrastructure-at-go-jek-cd4dc8cbd929
- Fronting blog: https://blog.gojekengineering.com/kafka-4066a4ea8d0d
- Aggregation blog:
https://blog.gojekengineering.com/daggers-data-aggregation-in-real-time-4a32eb9ad2d1
- Sakaar blog:
https://blog.gojekengineering.com/sakaar-taking-kafka-data-to-cloud-storage-at-go-jek-7839da20b5f3
- Data visualization blog:
https://blog.gojekengineering.com/atlas-go-jeks-real-time-geospatial-visualization-platform-1cf5e16814c5
- Open sourced repos:
- https://github.com/gojek/beast
- https://github.com/gojekfarm/stencil
- Helm charts: https://github.com/gojektech/charts
Questions?

More Related Content

What's hot

Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
confluent
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
Cask Data
 
Introducing a horizontally scalable, inference-based business Rules Engine fo...
Introducing a horizontally scalable, inference-based business Rules Engine fo...Introducing a horizontally scalable, inference-based business Rules Engine fo...
Introducing a horizontally scalable, inference-based business Rules Engine fo...
Cask Data
 
Real-Time Dynamic Data Export Using the Kafka Ecosystem
Real-Time Dynamic Data Export Using the Kafka EcosystemReal-Time Dynamic Data Export Using the Kafka Ecosystem
Real-Time Dynamic Data Export Using the Kafka Ecosystem
confluent
 
PCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningPCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System Tuning
Dr. Mirko Kämpf
 
New Approaches for Fraud Detection on Apache Kafka and KSQL
New Approaches for Fraud Detection on Apache Kafka and KSQLNew Approaches for Fraud Detection on Apache Kafka and KSQL
New Approaches for Fraud Detection on Apache Kafka and KSQL
confluent
 
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
HostedbyConfluent
 
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
How a Data Mesh is Driving our Platform | Trey Hicks, GlooHow a Data Mesh is Driving our Platform | Trey Hicks, Gloo
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
HostedbyConfluent
 
From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017
Apache Apex
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to..."Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
Cask Data
 
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Ingesting Data from Kafka to JDBC with Transformation and EnrichmentIngesting Data from Kafka to JDBC with Transformation and Enrichment
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Apache Apex
 
Deploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and KubernetesDeploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and Kubernetes
confluent
 
Data Integration with Apache Kafka: What, Why, How
Data Integration with Apache Kafka: What, Why, HowData Integration with Apache Kafka: What, Why, How
Data Integration with Apache Kafka: What, Why, How
Pat Patterson
 
Flink Forward San Francisco 2018 keynote: Stephan Ewen - "What turns stream p...
Flink Forward San Francisco 2018 keynote: Stephan Ewen - "What turns stream p...Flink Forward San Francisco 2018 keynote: Stephan Ewen - "What turns stream p...
Flink Forward San Francisco 2018 keynote: Stephan Ewen - "What turns stream p...
Flink Forward
 
KSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache KafkaKSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache Kafka
confluent
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
Guido Schmutz
 
Kafka Summit NYC 2017 - Venice: A Distributed Database on top of Kafka
Kafka Summit NYC 2017 - Venice: A Distributed Database on top of KafkaKafka Summit NYC 2017 - Venice: A Distributed Database on top of Kafka
Kafka Summit NYC 2017 - Venice: A Distributed Database on top of Kafka
confluent
 
Building a Streaming Platform with Kafka
Building a Streaming Platform with KafkaBuilding a Streaming Platform with Kafka
Building a Streaming Platform with Kafka
confluent
 
Ingesting and Processing IoT Data - using MQTT, Kafka Connect and KSQL
Ingesting and Processing IoT Data - using MQTT, Kafka Connect and KSQLIngesting and Processing IoT Data - using MQTT, Kafka Connect and KSQL
Ingesting and Processing IoT Data - using MQTT, Kafka Connect and KSQL
Guido Schmutz
 

What's hot (20)

Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
Introducing a horizontally scalable, inference-based business Rules Engine fo...
Introducing a horizontally scalable, inference-based business Rules Engine fo...Introducing a horizontally scalable, inference-based business Rules Engine fo...
Introducing a horizontally scalable, inference-based business Rules Engine fo...
 
Real-Time Dynamic Data Export Using the Kafka Ecosystem
Real-Time Dynamic Data Export Using the Kafka EcosystemReal-Time Dynamic Data Export Using the Kafka Ecosystem
Real-Time Dynamic Data Export Using the Kafka Ecosystem
 
PCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningPCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System Tuning
 
New Approaches for Fraud Detection on Apache Kafka and KSQL
New Approaches for Fraud Detection on Apache Kafka and KSQLNew Approaches for Fraud Detection on Apache Kafka and KSQL
New Approaches for Fraud Detection on Apache Kafka and KSQL
 
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
 
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
How a Data Mesh is Driving our Platform | Trey Hicks, GlooHow a Data Mesh is Driving our Platform | Trey Hicks, Gloo
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
 
From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to..."Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
 
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Ingesting Data from Kafka to JDBC with Transformation and EnrichmentIngesting Data from Kafka to JDBC with Transformation and Enrichment
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
 
Deploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and KubernetesDeploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and Kubernetes
 
Data Integration with Apache Kafka: What, Why, How
Data Integration with Apache Kafka: What, Why, HowData Integration with Apache Kafka: What, Why, How
Data Integration with Apache Kafka: What, Why, How
 
Flink Forward San Francisco 2018 keynote: Stephan Ewen - "What turns stream p...
Flink Forward San Francisco 2018 keynote: Stephan Ewen - "What turns stream p...Flink Forward San Francisco 2018 keynote: Stephan Ewen - "What turns stream p...
Flink Forward San Francisco 2018 keynote: Stephan Ewen - "What turns stream p...
 
KSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache KafkaKSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache Kafka
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
Kafka Summit NYC 2017 - Venice: A Distributed Database on top of Kafka
Kafka Summit NYC 2017 - Venice: A Distributed Database on top of KafkaKafka Summit NYC 2017 - Venice: A Distributed Database on top of Kafka
Kafka Summit NYC 2017 - Venice: A Distributed Database on top of Kafka
 
Building a Streaming Platform with Kafka
Building a Streaming Platform with KafkaBuilding a Streaming Platform with Kafka
Building a Streaming Platform with Kafka
 
Ingesting and Processing IoT Data - using MQTT, Kafka Connect and KSQL
Ingesting and Processing IoT Data - using MQTT, Kafka Connect and KSQLIngesting and Processing IoT Data - using MQTT, Kafka Connect and KSQL
Ingesting and Processing IoT Data - using MQTT, Kafka Connect and KSQL
 

Similar to OSDC 2019 | Democratizing Data at Go-JEK by Maulik Soneji

Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
Joe Stein
 
Data Infrastructure in Kumparan
Data Infrastructure in KumparanData Infrastructure in Kumparan
Data Infrastructure in Kumparan
Yosua Michael Maranatha
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
kgshukla
 
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
Surendar S
 
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
HostedbyConfluent
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?
Micron Technology
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
AboutYouGmbH
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Data Con LA
 
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsData Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Anant Corporation
 
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
Amazon Web Services
 
LeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration ServicesLeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration Services
Michael Stephenson
 
Using a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsUsing a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming Aggregations
VoltDB
 
Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline Development
Timothy Spann
 
Busy Bee Application Develompent Platform
Busy Bee Application Develompent PlatformBusy Bee Application Develompent Platform
Busy Bee Application Develompent Platform
Utkarsh Shukla
 
End To End Machine Learning With Google Cloud
End To End Machine Learning With Google Cloud End To End Machine Learning With Google Cloud
End To End Machine Learning With Google Cloud
Tu Pham
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
Stepan Pushkarev
 
The role of AWS in the Datalandscape of a fast growing Startup
The role of AWS in the Datalandscape of a fast growing StartupThe role of AWS in the Datalandscape of a fast growing Startup
The role of AWS in the Datalandscape of a fast growing Startup
Maximilian Ehrlich
 
Data science for infrastructure dev week 2022
Data science for infrastructure   dev week 2022Data science for infrastructure   dev week 2022
Data science for infrastructure dev week 2022
ZainAsgar1
 
The Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingThe Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and Streaming
Timothy Spann
 

Similar to OSDC 2019 | Democratizing Data at Go-JEK by Maulik Soneji (20)

Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Data Infrastructure in Kumparan
Data Infrastructure in KumparanData Infrastructure in Kumparan
Data Infrastructure in Kumparan
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
 
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
 
Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
 
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsData Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
 
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
 
LeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration ServicesLeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration Services
 
Using a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsUsing a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming Aggregations
 
Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline Development
 
Busy Bee Application Develompent Platform
Busy Bee Application Develompent PlatformBusy Bee Application Develompent Platform
Busy Bee Application Develompent Platform
 
End To End Machine Learning With Google Cloud
End To End Machine Learning With Google Cloud End To End Machine Learning With Google Cloud
End To End Machine Learning With Google Cloud
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
 
The role of AWS in the Datalandscape of a fast growing Startup
The role of AWS in the Datalandscape of a fast growing StartupThe role of AWS in the Datalandscape of a fast growing Startup
The role of AWS in the Datalandscape of a fast growing Startup
 
Data science for infrastructure dev week 2022
Data science for infrastructure   dev week 2022Data science for infrastructure   dev week 2022
Data science for infrastructure dev week 2022
 
The Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingThe Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and Streaming
 

Recently uploaded

The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024
Yara Milbes
 
TheFutureIsDynamic-BoxLang-CFCamp2024.pdf
TheFutureIsDynamic-BoxLang-CFCamp2024.pdfTheFutureIsDynamic-BoxLang-CFCamp2024.pdf
TheFutureIsDynamic-BoxLang-CFCamp2024.pdf
Ortus Solutions, Corp
 
Beginner's Guide to Observability@Devoxx PL 2024
Beginner's  Guide to Observability@Devoxx PL 2024Beginner's  Guide to Observability@Devoxx PL 2024
Beginner's Guide to Observability@Devoxx PL 2024
michniczscribd
 
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptxOperational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
sandeepmenon62
 
What’s New in VictoriaLogs - Q2 2024 Update
What’s New in VictoriaLogs - Q2 2024 UpdateWhat’s New in VictoriaLogs - Q2 2024 Update
What’s New in VictoriaLogs - Q2 2024 Update
VictoriaMetrics
 
The Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdfThe Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdf
mohitd6
 
Call Girls Bangalore🔥7023059433🔥Best Profile Escorts in Bangalore Available 24/7
Call Girls Bangalore🔥7023059433🔥Best Profile Escorts in Bangalore Available 24/7Call Girls Bangalore🔥7023059433🔥Best Profile Escorts in Bangalore Available 24/7
Call Girls Bangalore🔥7023059433🔥Best Profile Escorts in Bangalore Available 24/7
manji sharman06
 
Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...
Paul Brebner
 
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptxMigration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
ervikas4
 
Hands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion StepsHands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion Steps
servicesNitor
 
Upturn India Technologies - Web development company in Nashik
Upturn India Technologies - Web development company in NashikUpturn India Technologies - Web development company in Nashik
Upturn India Technologies - Web development company in Nashik
Upturn India Technologies
 
How GenAI Can Improve Supplier Performance Management.pdf
How GenAI Can Improve Supplier Performance Management.pdfHow GenAI Can Improve Supplier Performance Management.pdf
How GenAI Can Improve Supplier Performance Management.pdf
Zycus
 
Software Test Automation - A Comprehensive Guide on Automated Testing.pdf
Software Test Automation - A Comprehensive Guide on Automated Testing.pdfSoftware Test Automation - A Comprehensive Guide on Automated Testing.pdf
Software Test Automation - A Comprehensive Guide on Automated Testing.pdf
kalichargn70th171
 
42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert
vaishalijagtap12
 
Optimizing Your E-commerce with WooCommerce.pptx
Optimizing Your E-commerce with WooCommerce.pptxOptimizing Your E-commerce with WooCommerce.pptx
Optimizing Your E-commerce with WooCommerce.pptx
WebConnect Pvt Ltd
 
Going AOT: Everything you need to know about GraalVM for Java applications
Going AOT: Everything you need to know about GraalVM for Java applicationsGoing AOT: Everything you need to know about GraalVM for Java applications
Going AOT: Everything you need to know about GraalVM for Java applications
Alina Yurenko
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
Reetu63
 
Streamlining End-to-End Testing Automation
Streamlining End-to-End Testing AutomationStreamlining End-to-End Testing Automation
Streamlining End-to-End Testing Automation
Anand Bagmar
 
Computer Science & Engineering VI Sem- New Syllabus.pdf
Computer Science & Engineering VI Sem- New Syllabus.pdfComputer Science & Engineering VI Sem- New Syllabus.pdf
Computer Science & Engineering VI Sem- New Syllabus.pdf
chandangoswami40933
 
Hyperledger Besu 빨리 따라하기 (Private Networks)
Hyperledger Besu 빨리 따라하기 (Private Networks)Hyperledger Besu 빨리 따라하기 (Private Networks)
Hyperledger Besu 빨리 따라하기 (Private Networks)
wonyong hwang
 

Recently uploaded (20)

The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024
 
TheFutureIsDynamic-BoxLang-CFCamp2024.pdf
TheFutureIsDynamic-BoxLang-CFCamp2024.pdfTheFutureIsDynamic-BoxLang-CFCamp2024.pdf
TheFutureIsDynamic-BoxLang-CFCamp2024.pdf
 
Beginner's Guide to Observability@Devoxx PL 2024
Beginner's  Guide to Observability@Devoxx PL 2024Beginner's  Guide to Observability@Devoxx PL 2024
Beginner's Guide to Observability@Devoxx PL 2024
 
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptxOperational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
 
What’s New in VictoriaLogs - Q2 2024 Update
What’s New in VictoriaLogs - Q2 2024 UpdateWhat’s New in VictoriaLogs - Q2 2024 Update
What’s New in VictoriaLogs - Q2 2024 Update
 
The Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdfThe Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdf
 
Call Girls Bangalore🔥7023059433🔥Best Profile Escorts in Bangalore Available 24/7
Call Girls Bangalore🔥7023059433🔥Best Profile Escorts in Bangalore Available 24/7Call Girls Bangalore🔥7023059433🔥Best Profile Escorts in Bangalore Available 24/7
Call Girls Bangalore🔥7023059433🔥Best Profile Escorts in Bangalore Available 24/7
 
Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...
 
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptxMigration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
 
Hands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion StepsHands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion Steps
 
Upturn India Technologies - Web development company in Nashik
Upturn India Technologies - Web development company in NashikUpturn India Technologies - Web development company in Nashik
Upturn India Technologies - Web development company in Nashik
 
How GenAI Can Improve Supplier Performance Management.pdf
How GenAI Can Improve Supplier Performance Management.pdfHow GenAI Can Improve Supplier Performance Management.pdf
How GenAI Can Improve Supplier Performance Management.pdf
 
Software Test Automation - A Comprehensive Guide on Automated Testing.pdf
Software Test Automation - A Comprehensive Guide on Automated Testing.pdfSoftware Test Automation - A Comprehensive Guide on Automated Testing.pdf
Software Test Automation - A Comprehensive Guide on Automated Testing.pdf
 
42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert
 
Optimizing Your E-commerce with WooCommerce.pptx
Optimizing Your E-commerce with WooCommerce.pptxOptimizing Your E-commerce with WooCommerce.pptx
Optimizing Your E-commerce with WooCommerce.pptx
 
Going AOT: Everything you need to know about GraalVM for Java applications
Going AOT: Everything you need to know about GraalVM for Java applicationsGoing AOT: Everything you need to know about GraalVM for Java applications
Going AOT: Everything you need to know about GraalVM for Java applications
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
 
Streamlining End-to-End Testing Automation
Streamlining End-to-End Testing AutomationStreamlining End-to-End Testing Automation
Streamlining End-to-End Testing Automation
 
Computer Science & Engineering VI Sem- New Syllabus.pdf
Computer Science & Engineering VI Sem- New Syllabus.pdfComputer Science & Engineering VI Sem- New Syllabus.pdf
Computer Science & Engineering VI Sem- New Syllabus.pdf
 
Hyperledger Besu 빨리 따라하기 (Private Networks)
Hyperledger Besu 빨리 따라하기 (Private Networks)Hyperledger Besu 빨리 따라하기 (Private Networks)
Hyperledger Besu 빨리 따라하기 (Private Networks)
 

OSDC 2019 | Democratizing Data at Go-JEK by Maulik Soneji

  • 2. What do we mean by Democratizing Data?
  • 3. What do we mean by Democratizing Data? - Self serve in a way that you are dependent on tools and not on people - Abstract in such a way that people only need to know - data they are publishing - Insights that they want
  • 4. Agenda - About Go-jek - Core components of Data Platform - Building Data platform around core components - Whole picture of how components interact with each other - References - QnA
  • 5. About Go-jek - A mobile app for daily needs - Transport - Food Delivery - Payments - Logistics - 18+ products - Operational Internationally - Indonesia - Singapore - Vietnam - Thailand 2016 Expansion and new services 2010 Call-center for ojek* services 2015 App launched with 3 services *ojek is an Indonesian term of motorcycle ride hailing
  • 6. Uses of Data - Application Monitoring - Fraud Detection - Pricing - Report Generation - User segmentation Let's talk about the scale of Data Growth of Data - 6666x growth in 18 months - Expansion to 3 countries(Singapore, Vietnam, Thailand) Type of Data - 18+ products ranging from: - Ride - Food Delivery - Payments - Shipments Given the data at Go-jek it becomes essential to create a data platform at scale and supporting automation.
  • 8. Core Components - Components that serve as backbone for the data platform - We have a bunch of these core components for different use cases - Multiple options available for changing any of these core components with another well suited option
  • 9. Core Components - Main components of our Data platform: - Kafka => message broker for producing and consuming data
  • 10. Core Components - Main components of our Data platform: - Kafka => message broker for producing and consuming data - Flink => real time stream processing
  • 11. Core Components - Main components of our Data platform: - Kafka => message broker for producing and consuming data - Flink => real time stream processing - Kubernetes => Deployment and management of containers
  • 12. Core Components - Main components of our Data platform: - Kafka => message broker for producing and consuming data - Flink => real time stream processing - Kubernetes => Deployment and management of containers - Airflow => workflow management
  • 13. Core Components - Main components of our Data platform: - Kafka => message broker for producing and consuming data - Flink => real time stream processing - Kubernetes => Deployment and management of containers - Airflow => workflow management - Dataproc => managed spark cluster for batch processing
  • 15. Typical day for Data Engineer at Gojek - Tackling the following aspects of Data: - Ingestion - Consumption - Aggregation - Cold Storage - Visualization (in early stages currently) - Automation for Resource creation: - Resource is any tool used for data ingestion, consumption etc.
  • 16. Focus areas for scaling - How to take care of these aspects while scaling data platform: - Reliability => No Data loss - Abstraction => Downtime due to one team shouldn’t impact other - Automation => Infrastructure, deployment - Monitoring => Monitoring and Alerting for all resources - Self Serve => Anyone can access platform without intervention
  • 17. Data Ingestion - Uniformity by Protobufs - All kafka messages are serialized/deserialized by using protobuf - Teams maintain the protobuf schema - Client libraries in various languages to push messages - Helps in terms of: - Defining a schema for pushing/reading from kafka - No breaking change, a newer version of the schema shouldn’t break the older conventions
  • 18. Why Protobufs? - All protobufs can be imported like client library - Use protoc command to generate library for different languages - Easy to import and use Create Golang library from Java library
  • 19. Data Ingestion - Reliability - Reliability by Fronting Systems - Fronting serves as reliable way to push data to kafka - First pushes to internal kafka and then pushes batched data to main kafka - Internal redis store to store failed messages - Individual frontings for each team to isolate issues - High availability with HAProxy - Can be replaced with any system that enables reliable push to kafka - Sidekiq jobs - Application level retry mechanism
  • 21. Data Consumption - Firehose: Custom kafka consumer that pushes to sinks: - Database Sink - HTTP Sink - Influx Sink - Elasticsearch Sink
  • 22. Data Cold Storage - Bigquery (Data warehouse) => Beast - GCS (Cold storage) => Sakaar - Pushes raw data from kafka to Bigquery and GCS - Governance of data by service accounts - Metabase and Zeppelin to query data
  • 24. Kafka - Abstraction - To scale kafka, formalize abstractions of different kafka clusters - Different Kafka clusters for application usecases: - Mainstream for all booking data - Appstream for mobile application data - Locstream for location data - Mirror kafka data for data usecases: - Data clusters used for aggregation, auditing, cold storage of data
  • 25. Data Aggregation - Flink (for real time aggregation) - Dataproc (for batch aggregation) - SQL interface following Apache Calcite to create aggregation jobs - User Defined Functions to support complex functionalities - Different sinks to write aggregated data to Elasticsearch, Database, Influx and others - Airflow to schedule jobs
  • 26. Data Visualization - Geo visualization platform to explore location data - D3 and chart libraries to explore data - Deck.GL for building heatmaps, 2D/3D visualization layers - Booking, payments heatmaps
  • 27. Automation - Terraform for IAC and following convention - Creating all aspects of data platform through terraform: - Kubernetes cluster - VMs - Kafka/Core Components - Grants to service accounts
  • 28. Monitoring - TICK script based monitoring/alerting setup - TICK(Telegraf, InfluxDB, Chronograph, Kapacitor) - StatsD client to collect metrics and write data to influx - Kapacitor scripts to create alerts - Integrations with slack, pagerduty for alerting - Grafana for monitoring - Create generic Tick templates and then use template to create alerts
  • 29. Tick setup example In this template, the warn_threshold and crit_threshold are to be supplied when creating alert. Create task through a curl call
  • 30. Self Serve - Data Platform to DIY provision data products - Web interface to collect information from user - Helm chart for all data products - Kubernetes client on backend to provision resource - Resource to team mapping for authentication and authorization
  • 32.
  • 33. References - Data Infra blog: https://blog.gojekengineering.com/data-infrastructure-at-go-jek-cd4dc8cbd929 - Fronting blog: https://blog.gojekengineering.com/kafka-4066a4ea8d0d - Aggregation blog: https://blog.gojekengineering.com/daggers-data-aggregation-in-real-time-4a32eb9ad2d1 - Sakaar blog: https://blog.gojekengineering.com/sakaar-taking-kafka-data-to-cloud-storage-at-go-jek-7839da20b5f3 - Data visualization blog: https://blog.gojekengineering.com/atlas-go-jeks-real-time-geospatial-visualization-platform-1cf5e16814c5 - Open sourced repos: - https://github.com/gojek/beast - https://github.com/gojekfarm/stencil - Helm charts: https://github.com/gojektech/charts