SlideShare a Scribd company logo
The New <Data> Deal:
RealTime Ingest
A fast pipeline befitting hundreds of millions of
customers.
Agenda
• What Paytm Labs does
• Our Data Platform
• Moving to Realtime data ingestion
2
What do we do at Paytm Toronto?
• Create Data Platforms and Data
Products to be used by Paytm in
the following areas:
• Create a comprehensive data
platform for archive, processing,
and data-based decisions
• Fraud Detection and Prevention
• Analytical Scoring, Reporting, and
other Data Science challenges
• Building an advertising technology
platform to generate revenue and
increase customer engagement
• Blockchain(!) Stay tuned…
3
The Platform
4
The Data Platform at Paytm:
• Maquette
• Continues to be mainstay of RT Fraud Prevention
• Provides a Rule DSL as well as a flexible data model
• Fabrica
• Modularized feature creation framework (yes, a real framework)
• In Spark and SparkSQL for now
• Will move to Spark Streamingvery soon
• Chilika
• Hadoop and related technologies
• Processing and Storage muscle for most data platform tasks
• Includes data export and BI tooling like Atscale, Tableau, and ES
5
Fabrica
• Modular framework for execution of feature creation, scoring,
and export jobs
• Parallel job execution and optimized by caching targeted
datasets
• Handles complex transformations and can automate new
feature creation
• Easily consumes Machine Learning libraries, especially Spark
MLLib
• Starts as a Spark batch job and moves to a Spark Streaming
application
• Written in Spark Scala
• A DSL coming later
6
Fabrica
7
Maquette:
• Realtime rule engine for fraud detection
• All of our marketplace transactions are evaluated in realtime
with concurrent evaluation on hundreds of fraud rules
• Custom Scala/Akka application with a Cassandra datastore
• Can be used with our datastores, such as Titan,
GraphIntelligence, HBase, etc
• Interface for Rule and Threshold tuning
• Handles millions of txns per day at an average response time
of 20ms
8
Chilika (aka Datalake, aka HadoopCluster):
Moving to a RealtimeData Pipeline
9
What we have been using…
10
A “traditional” batch ingestprocess to Hadoop
• A 24 hour cycle batch-driven process
• A reservoir of detailed data for the past 6 months for core
tables, 2 years for some higher level data, and a few months
for the new data sets
• Hadoop (HDP) data processing tools primarily, specifically Hive
• Hive (SQL) transformations require so much additional logic
for reprocessing, alerting, etc that they have python programs
call them
• For event-streams (aka Real-Time), we load into Kafka. We pull
off this event data into a aggregated avro file for archive in
HDFS.
11
When MySQL fails…we fail
12
Whenever you change a schema, you kill a kitten
somewhere in the world…
13
Lots of room for improvement…
• A 24 hour cycle batch-driven process means stale data for
a lot of use cases
• The most important and most fragile pipeline is MySQL
• The MySQL instances rely on a chain of Master-Replica-
Replica-Replica to get to Hadoop. This chain fails a lot
• The MySQL chain has a fixed schema from RDBMS to
Hive.
• Assumptions that this schema is fixed are carried forward
throughoutourown processing pipeline.
• Changes to schema result in a cascading failure
• Hive does not have a resilient and programmatic way of
handling schema change
• Others have attempted to write custom Java Hive SerDes to
correct data but this puts too much logicin the wrong spot
• By using Hive for transformations that are complicated,
we have forced unnecessary temporary tables, created
hacky nanny scripts, and made it nearly impossible to
compose complicated transformations
14
A word on impatience…
• The amount of signals and actions that a mobile
phone user will generate is much higher than a
web user by virtue of their mobility
• Reducingthe MTTF (Mean Time To Feature) from
hours to minutes opens up an entirely new set of
interactionswith users:
• More advertising inventory with hyperlocal (ie walk into a
store) targeting, ie more revenue potential
• Better fraud detection and risk assessment
• More opportunities to construct a meaningful relationship
with the customer through helpful engagement:
• bill splitting
• localized shopping reminders – “while you are here...”
• Experience planning (you were looking fora restaurant, so we
suggest something you would like, plan your transit/train, and
order via FoodPanda)
15
Chilika Realtime Edition
16
Using
Confluent.io
17
Our realtime approach:DFAI (Direct-From-App-Ingest)
• Requires our primary applications to implement an SDK
that we provide
• The SDK is a wrap of the Confluent.io SDKs with our
schema registered
• Schema management is done automatically with the
confluent.io schema repository using Apache Avro
• Avro Schema is flexible with Avro, unlike @#$^@!!! SQL
Schema
• Avro Schema is open source and would still be
manageable even if we moved away from using
Confluence. Our data is safe for the long term. 18
DFAI = Direct-From-App-Ingest
• Market-order
• order/invoice: sales_order_invoice table
• create
• updateAddress
• order/order : sales_order table
• create
• update
• order/payment: sales_order_payment table
• create
• insertUpdateExisting
• update
• order/item:sales_order_item table
• create
• update
• order/address: sales_order_address table
• create
• updateAddress
• order/return_fulfillment: sales_order_return table
• create
• update
• order/refund: sales_order_refund table
• create
• update
Order/invoice schema example:
{ "namespace" : "com.paytm",
"name": "order_invoice_value",
"type": "record",
"fields": [ { "name": "tax_amount", "type": unitSchemas.nullLong},
{ "name": "surchange_amount", "type": unitSchemas.nullLong.name},
{ "name": "subtotal", "type":unitSchemas.nullLong.name},
{ "name": "subtotal_incl_tax", "type":unitSchemas.nullLong.name},
{ "name": "fulfillment_id", "type":unitSchemas.nullLong.name},
{ "name": "created_at", "type": unitSchemas.str} ]
}
19
Event Sourcing & Stream Processing
• Read this: http://www.confluent.io/blog/making-sense-of-
stream-processing/
• Basic concept: treat the stream as an immutable set of
events and then aggregate views specific to use cases
• We will use Fabrica to stream and aggregate data in
realtime eventually
• For batch work, we post-process the create / update
events to yield an aggregate state view. In other words,
the last state
20
Say no to batch ingest:
talk to your Data Architect about DFAI today.
21
Thank you!
Adam Muise
+1-416-417-4037
adam@paytm.com

More Related Content

What's hot

Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
DataWorks Summit
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
Allen Day, PhD
 

What's hot (19)

Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014
Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014
Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014
 
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupReal Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIne
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoop
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 

Viewers also liked

Viewers also liked (11)

Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015
 
Next Generation Hadoop Introduction
Next Generation Hadoop IntroductionNext Generation Hadoop Introduction
Next Generation Hadoop Introduction
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascience
 
paytm
 paytm paytm
paytm
 
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadam2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
 
Developing a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and SprayDeveloping a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and Spray
 
Building Data Science Teams: A Moneyball Approach
Building Data Science Teams: A Moneyball ApproachBuilding Data Science Teams: A Moneyball Approach
Building Data Science Teams: A Moneyball Approach
 
Paytm analysis
Paytm analysisPaytm analysis
Paytm analysis
 
PAYTM PROJECT
PAYTM PROJECTPAYTM PROJECT
PAYTM PROJECT
 
The Moneyball Approach to Recruitment: Big Data = Big Changes
The Moneyball Approach to Recruitment: Big Data = Big ChangesThe Moneyball Approach to Recruitment: Big Data = Big Changes
The Moneyball Approach to Recruitment: Big Data = Big Changes
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 

Similar to 2015 nov 27_thug_paytm_rt_ingest_brief_final

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Data flow in the data center
Data flow in the data centerData flow in the data center
Data flow in the data center
Adam Cataldo
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
Tomas Cervenka
 

Similar to 2015 nov 27_thug_paytm_rt_ingest_brief_final (20)

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWS
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 
Genji: Framework for building resilient near-realtime data pipelines
Genji: Framework for building resilient near-realtime data pipelinesGenji: Framework for building resilient near-realtime data pipelines
Genji: Framework for building resilient near-realtime data pipelines
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 
MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...
MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...
MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
 
Data flow in the data center
Data flow in the data centerData flow in the data center
Data flow in the data center
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Operational-Analytics
Operational-AnalyticsOperational-Analytics
Operational-Analytics
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...
Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...
Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
What's new in AWS?
What's new in AWS?What's new in AWS?
What's new in AWS?
 

More from Adam Muise

KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012
Adam Muise
 

More from Adam Muise (17)

Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of Hadoop
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
 
2014 sept 4_hadoop_security
2014 sept 4_hadoop_security2014 sept 4_hadoop_security
2014 sept 4_hadoop_security
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoop
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop1012014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
 
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
 
2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda
 
2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
What is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACWhat is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMAC
 
What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog
 
KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotech
 
hadoop 101 aug 21 2012 tohug
 hadoop 101 aug 21 2012 tohug hadoop 101 aug 21 2012 tohug
hadoop 101 aug 21 2012 tohug
 

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 

Recently uploaded (20)

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 

2015 nov 27_thug_paytm_rt_ingest_brief_final

  • 1. The New <Data> Deal: RealTime Ingest A fast pipeline befitting hundreds of millions of customers.
  • 2. Agenda • What Paytm Labs does • Our Data Platform • Moving to Realtime data ingestion 2
  • 3. What do we do at Paytm Toronto? • Create Data Platforms and Data Products to be used by Paytm in the following areas: • Create a comprehensive data platform for archive, processing, and data-based decisions • Fraud Detection and Prevention • Analytical Scoring, Reporting, and other Data Science challenges • Building an advertising technology platform to generate revenue and increase customer engagement • Blockchain(!) Stay tuned… 3
  • 5. The Data Platform at Paytm: • Maquette • Continues to be mainstay of RT Fraud Prevention • Provides a Rule DSL as well as a flexible data model • Fabrica • Modularized feature creation framework (yes, a real framework) • In Spark and SparkSQL for now • Will move to Spark Streamingvery soon • Chilika • Hadoop and related technologies • Processing and Storage muscle for most data platform tasks • Includes data export and BI tooling like Atscale, Tableau, and ES 5
  • 6. Fabrica • Modular framework for execution of feature creation, scoring, and export jobs • Parallel job execution and optimized by caching targeted datasets • Handles complex transformations and can automate new feature creation • Easily consumes Machine Learning libraries, especially Spark MLLib • Starts as a Spark batch job and moves to a Spark Streaming application • Written in Spark Scala • A DSL coming later 6
  • 8. Maquette: • Realtime rule engine for fraud detection • All of our marketplace transactions are evaluated in realtime with concurrent evaluation on hundreds of fraud rules • Custom Scala/Akka application with a Cassandra datastore • Can be used with our datastores, such as Titan, GraphIntelligence, HBase, etc • Interface for Rule and Threshold tuning • Handles millions of txns per day at an average response time of 20ms 8
  • 9. Chilika (aka Datalake, aka HadoopCluster): Moving to a RealtimeData Pipeline 9
  • 10. What we have been using… 10
  • 11. A “traditional” batch ingestprocess to Hadoop • A 24 hour cycle batch-driven process • A reservoir of detailed data for the past 6 months for core tables, 2 years for some higher level data, and a few months for the new data sets • Hadoop (HDP) data processing tools primarily, specifically Hive • Hive (SQL) transformations require so much additional logic for reprocessing, alerting, etc that they have python programs call them • For event-streams (aka Real-Time), we load into Kafka. We pull off this event data into a aggregated avro file for archive in HDFS. 11
  • 13. Whenever you change a schema, you kill a kitten somewhere in the world… 13
  • 14. Lots of room for improvement… • A 24 hour cycle batch-driven process means stale data for a lot of use cases • The most important and most fragile pipeline is MySQL • The MySQL instances rely on a chain of Master-Replica- Replica-Replica to get to Hadoop. This chain fails a lot • The MySQL chain has a fixed schema from RDBMS to Hive. • Assumptions that this schema is fixed are carried forward throughoutourown processing pipeline. • Changes to schema result in a cascading failure • Hive does not have a resilient and programmatic way of handling schema change • Others have attempted to write custom Java Hive SerDes to correct data but this puts too much logicin the wrong spot • By using Hive for transformations that are complicated, we have forced unnecessary temporary tables, created hacky nanny scripts, and made it nearly impossible to compose complicated transformations 14
  • 15. A word on impatience… • The amount of signals and actions that a mobile phone user will generate is much higher than a web user by virtue of their mobility • Reducingthe MTTF (Mean Time To Feature) from hours to minutes opens up an entirely new set of interactionswith users: • More advertising inventory with hyperlocal (ie walk into a store) targeting, ie more revenue potential • Better fraud detection and risk assessment • More opportunities to construct a meaningful relationship with the customer through helpful engagement: • bill splitting • localized shopping reminders – “while you are here...” • Experience planning (you were looking fora restaurant, so we suggest something you would like, plan your transit/train, and order via FoodPanda) 15
  • 18. Our realtime approach:DFAI (Direct-From-App-Ingest) • Requires our primary applications to implement an SDK that we provide • The SDK is a wrap of the Confluent.io SDKs with our schema registered • Schema management is done automatically with the confluent.io schema repository using Apache Avro • Avro Schema is flexible with Avro, unlike @#$^@!!! SQL Schema • Avro Schema is open source and would still be manageable even if we moved away from using Confluence. Our data is safe for the long term. 18
  • 19. DFAI = Direct-From-App-Ingest • Market-order • order/invoice: sales_order_invoice table • create • updateAddress • order/order : sales_order table • create • update • order/payment: sales_order_payment table • create • insertUpdateExisting • update • order/item:sales_order_item table • create • update • order/address: sales_order_address table • create • updateAddress • order/return_fulfillment: sales_order_return table • create • update • order/refund: sales_order_refund table • create • update Order/invoice schema example: { "namespace" : "com.paytm", "name": "order_invoice_value", "type": "record", "fields": [ { "name": "tax_amount", "type": unitSchemas.nullLong}, { "name": "surchange_amount", "type": unitSchemas.nullLong.name}, { "name": "subtotal", "type":unitSchemas.nullLong.name}, { "name": "subtotal_incl_tax", "type":unitSchemas.nullLong.name}, { "name": "fulfillment_id", "type":unitSchemas.nullLong.name}, { "name": "created_at", "type": unitSchemas.str} ] } 19
  • 20. Event Sourcing & Stream Processing • Read this: http://www.confluent.io/blog/making-sense-of- stream-processing/ • Basic concept: treat the stream as an immutable set of events and then aggregate views specific to use cases • We will use Fabrica to stream and aggregate data in realtime eventually • For batch work, we post-process the create / update events to yield an aggregate state view. In other words, the last state 20
  • 21. Say no to batch ingest: talk to your Data Architect about DFAI today. 21