SlideShare a Scribd company logo
1 of 13
www.globalbigdataconference.com
Twitter : @bigdataconf
Big Data Ingestion
Navneet Gupta
Flipkart Data Platform
navneet.gupta@flipkart.com
● Data Governance - Democratizing data at Flipkart
● Divided into three sub-teams called Ingestion, Processing and Consumption.
● Was created out of the vision to make Flipkart a data centric company. (Some
examples are Facebook, Google and LinkedIn)
● Work with all teams in Flipkart and act as a broker between teams for exchanging
data (raw or processed).
● Provides capabilities around data processing/consumption but is agnostic to any
knowledge about any business processes. Does not build any apps itself on top of
data collected.
● Examples of applications on top of FDP - Seller Analytics
Flipkart Data Platform (FDP)
● Responsibility to push data to FDP lies with source teams.
● Responsibility to report data availability lies with FDP. Should call out if
source teams not pushing data.
● All the business processes are modeled as entities/events and FDP
provides console to define those entities/events using custom Schema
management (Open source solutions include Avro, Thrift, Protocol
Buffers).
● Validation is bundled with schema definition.
● Having Schema helps to have strong assumptions about fields in data
More about FDP ...
● Flipkart teams work with varied datastores like MySQL, MongoDB, CouchDB,
HBase, Hadoop
● Some teams onboard later than others. Bootstrapping of huge volumes of data is
performed at times.
● A single ingestion mechanism might not be suitable for all teams at Flipkart. Some
teams prefer streaming ingestions, others want batch and some teams want
support to ingest their data in a Hadoop Cluster
● Data could be present in many formats like binary blobs, JSON, XML, CSV. We
don’t want to deal with each format and support only JSON payloads currently
Data has many faces at Flipkart !
● Almost 2 billion ingestions seen on an average day
● Half of those ingestions happening in streaming fashion (HTTP endpoint)
● Other ingestion mechanisms
○ Hadoop based ingestion
○ Java library
○ Daemons process on source machines
○ Cmdline tools to ingest file in one shot
● Plan to support 5-10x of ingestion numbers for next BBD
Some numbers ...
● Dropwizard based Java app. Endpoints defined for ingesting data
● Performs schema validation online.
● Relays validated data to KAFKA.
● Validation failures go through a different flow and customers are alerted
if the no of failures breaches some rules.
● Clients get 200 response code as well as a traceId when data ingested is
actually accepted by the service
● Monitoring is built for the service by exposing JMX metrics which goes to
a central monitoring service.
Streaming Ingestion
● Kafka is distributed, partitioned, replicated and fault tolerant publish
subscribe system (but with a unique design)
● Invented at LinkedIn, Used by many other large companies today (Yahoo,
Twitter, Netflix, Uber, Goldman Sachs)
● Has notion of Producers, Consumers, Brokers, Topics, Partitions
● Messages are persistent. Multiple consumers can consume messages. Can
consume the same message again by resetting the offset (replay)
● Highly scalable and highly configurable
● Excellent documentation and community support.
● Battle tested and easy to administor.
More about Kafka
● Kafka is a temporary store and contains data only till last 30 days
(configurable by no of days or size)
● Current consumers of our Kafka cluster include batch processing and
real-time processing flows.
● We use CAMUS to copy data from Kafka to Hadoop. Camus instance runs
every hour currently to copy all the new data in Kafka to Hadoop.
● Stream processing flow built on top of Storm uses official KafkaSpout to
consume data from Kafka.
Onto downstream systems ...
● Streaming Ingesting and Processing at FDP -
speakerdeck.com/sids/streaming-ingestion-and-processing-at-flipkart
● Kafka - http://kafka.apache.org/081/documentation.html
● LinkedIn Camus - https://github.com/linkedin/camus
● Apache Avro - http://avro.apache.org/docs/current/
● Dropwizard - http://www.dropwizard.io/
● Blog on building stream data platform -
http://blog.confluent.io/2015/02/25/stream-data-platform-2/
References
Questions?
BTW, We are hiring !!
careers.flipkart.com

More Related Content

What's hot

Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackMichel Tricot
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)KafkaZone
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notesMohit Saini
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDr. C.V. Suresh Babu
 
Guaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, Nutanix
Guaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, NutanixGuaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, Nutanix
Guaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, NutanixHostedbyConfluent
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformDatabricks
 
Modern Data Architecture
Modern Data Architecture Modern Data Architecture
Modern Data Architecture Mark Hewitt
 
Redis + Kafka = Performance at Scale | Julien Ruaux, Redis Labs
Redis + Kafka = Performance at Scale | Julien Ruaux, Redis LabsRedis + Kafka = Performance at Scale | Julien Ruaux, Redis Labs
Redis + Kafka = Performance at Scale | Julien Ruaux, Redis LabsHostedbyConfluent
 
Monitoring with Ganglia
Monitoring with GangliaMonitoring with Ganglia
Monitoring with GangliaFastly
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...confluent
 
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouseApplication Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouseVictoriaMetrics
 
Northwestern Mutual Journey – Transform BI Space to Cloud
Northwestern Mutual Journey – Transform BI Space to CloudNorthwestern Mutual Journey – Transform BI Space to Cloud
Northwestern Mutual Journey – Transform BI Space to CloudDatabricks
 
Scaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksScaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksDatabricks
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analyticsXiang Fu
 
Introduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxIntroduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxSwathiPonugumati
 

What's hot (20)

Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stack
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
DataHub
DataHubDataHub
DataHub
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System
 
Guaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, Nutanix
Guaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, NutanixGuaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, Nutanix
Guaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, Nutanix
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
Modern Data Architecture
Modern Data Architecture Modern Data Architecture
Modern Data Architecture
 
Data Engineering Basics
Data Engineering BasicsData Engineering Basics
Data Engineering Basics
 
Redis + Kafka = Performance at Scale | Julien Ruaux, Redis Labs
Redis + Kafka = Performance at Scale | Julien Ruaux, Redis LabsRedis + Kafka = Performance at Scale | Julien Ruaux, Redis Labs
Redis + Kafka = Performance at Scale | Julien Ruaux, Redis Labs
 
Monitoring with Ganglia
Monitoring with GangliaMonitoring with Ganglia
Monitoring with Ganglia
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouseApplication Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
 
Northwestern Mutual Journey – Transform BI Space to Cloud
Northwestern Mutual Journey – Transform BI Space to CloudNorthwestern Mutual Journey – Transform BI Space to Cloud
Northwestern Mutual Journey – Transform BI Space to Cloud
 
Scaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksScaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on Databricks
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 
Introduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxIntroduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptx
 

Similar to Big Data Ingestion at Flipkart

Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018VMware Tanzu
 
Processing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the processProcessing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the processJampp
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futuremarkgrover
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesKarthik Murugesan
 
Apache Kafka® + Machine Learning for Supply Chain 
Apache Kafka® + Machine Learning for Supply Chain Apache Kafka® + Machine Learning for Supply Chain 
Apache Kafka® + Machine Learning for Supply Chain confluent
 
IIoT with Kafka and Machine Learning for Supply Chain Optimization In Real Ti...
IIoT with Kafka and Machine Learning for Supply Chain Optimization In Real Ti...IIoT with Kafka and Machine Learning for Supply Chain Optimization In Real Ti...
IIoT with Kafka and Machine Learning for Supply Chain Optimization In Real Ti...Kai Wähner
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationGeorge Long
 
Designing your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresDesigning your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresOzgun Erdogan
 
Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...
Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...
Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...CodeScience
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data PlatformDani Solà Lagares
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworksIJDKP
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
 
Hadoop Training Tutorial for Freshers
Hadoop Training Tutorial for FreshersHadoop Training Tutorial for Freshers
Hadoop Training Tutorial for Freshersrajkamaltibacademy
 
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !Piyush Kumar
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...AboutYouGmbH
 
Informatica Interview Questions & Answers
Informatica Interview Questions & AnswersInformatica Interview Questions & Answers
Informatica Interview Questions & AnswersZaranTech LLC
 

Similar to Big Data Ingestion at Flipkart (20)

Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
 
Processing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the processProcessing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the process
 
Big data architecture
Big data architectureBig data architecture
Big data architecture
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
Apache Kafka® + Machine Learning for Supply Chain 
Apache Kafka® + Machine Learning for Supply Chain Apache Kafka® + Machine Learning for Supply Chain 
Apache Kafka® + Machine Learning for Supply Chain 
 
IIoT with Kafka and Machine Learning for Supply Chain Optimization In Real Ti...
IIoT with Kafka and Machine Learning for Supply Chain Optimization In Real Ti...IIoT with Kafka and Machine Learning for Supply Chain Optimization In Real Ti...
IIoT with Kafka and Machine Learning for Supply Chain Optimization In Real Ti...
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
Designing your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresDesigning your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with Postgres
 
Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...
Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...
Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data Platform
 
FIWARE and Smart Data Models
FIWARE and Smart Data ModelsFIWARE and Smart Data Models
FIWARE and Smart Data Models
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworks
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
Streaming Analytics
Streaming AnalyticsStreaming Analytics
Streaming Analytics
 
Hadoop Training Tutorial for Freshers
Hadoop Training Tutorial for FreshersHadoop Training Tutorial for Freshers
Hadoop Training Tutorial for Freshers
 
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Informatica Interview Questions & Answers
Informatica Interview Questions & AnswersInformatica Interview Questions & Answers
Informatica Interview Questions & Answers
 

Big Data Ingestion at Flipkart

  • 2. Big Data Ingestion Navneet Gupta Flipkart Data Platform navneet.gupta@flipkart.com
  • 3. ● Data Governance - Democratizing data at Flipkart ● Divided into three sub-teams called Ingestion, Processing and Consumption. ● Was created out of the vision to make Flipkart a data centric company. (Some examples are Facebook, Google and LinkedIn) ● Work with all teams in Flipkart and act as a broker between teams for exchanging data (raw or processed). ● Provides capabilities around data processing/consumption but is agnostic to any knowledge about any business processes. Does not build any apps itself on top of data collected. ● Examples of applications on top of FDP - Seller Analytics Flipkart Data Platform (FDP)
  • 4. ● Responsibility to push data to FDP lies with source teams. ● Responsibility to report data availability lies with FDP. Should call out if source teams not pushing data. ● All the business processes are modeled as entities/events and FDP provides console to define those entities/events using custom Schema management (Open source solutions include Avro, Thrift, Protocol Buffers). ● Validation is bundled with schema definition. ● Having Schema helps to have strong assumptions about fields in data More about FDP ...
  • 5. ● Flipkart teams work with varied datastores like MySQL, MongoDB, CouchDB, HBase, Hadoop ● Some teams onboard later than others. Bootstrapping of huge volumes of data is performed at times. ● A single ingestion mechanism might not be suitable for all teams at Flipkart. Some teams prefer streaming ingestions, others want batch and some teams want support to ingest their data in a Hadoop Cluster ● Data could be present in many formats like binary blobs, JSON, XML, CSV. We don’t want to deal with each format and support only JSON payloads currently Data has many faces at Flipkart !
  • 6. ● Almost 2 billion ingestions seen on an average day ● Half of those ingestions happening in streaming fashion (HTTP endpoint) ● Other ingestion mechanisms ○ Hadoop based ingestion ○ Java library ○ Daemons process on source machines ○ Cmdline tools to ingest file in one shot ● Plan to support 5-10x of ingestion numbers for next BBD Some numbers ...
  • 7. ● Dropwizard based Java app. Endpoints defined for ingesting data ● Performs schema validation online. ● Relays validated data to KAFKA. ● Validation failures go through a different flow and customers are alerted if the no of failures breaches some rules. ● Clients get 200 response code as well as a traceId when data ingested is actually accepted by the service ● Monitoring is built for the service by exposing JMX metrics which goes to a central monitoring service. Streaming Ingestion
  • 8.
  • 9. ● Kafka is distributed, partitioned, replicated and fault tolerant publish subscribe system (but with a unique design) ● Invented at LinkedIn, Used by many other large companies today (Yahoo, Twitter, Netflix, Uber, Goldman Sachs) ● Has notion of Producers, Consumers, Brokers, Topics, Partitions ● Messages are persistent. Multiple consumers can consume messages. Can consume the same message again by resetting the offset (replay) ● Highly scalable and highly configurable ● Excellent documentation and community support. ● Battle tested and easy to administor. More about Kafka
  • 10.
  • 11. ● Kafka is a temporary store and contains data only till last 30 days (configurable by no of days or size) ● Current consumers of our Kafka cluster include batch processing and real-time processing flows. ● We use CAMUS to copy data from Kafka to Hadoop. Camus instance runs every hour currently to copy all the new data in Kafka to Hadoop. ● Stream processing flow built on top of Storm uses official KafkaSpout to consume data from Kafka. Onto downstream systems ...
  • 12. ● Streaming Ingesting and Processing at FDP - speakerdeck.com/sids/streaming-ingestion-and-processing-at-flipkart ● Kafka - http://kafka.apache.org/081/documentation.html ● LinkedIn Camus - https://github.com/linkedin/camus ● Apache Avro - http://avro.apache.org/docs/current/ ● Dropwizard - http://www.dropwizard.io/ ● Blog on building stream data platform - http://blog.confluent.io/2015/02/25/stream-data-platform-2/ References
  • 13. Questions? BTW, We are hiring !! careers.flipkart.com