The New <Data> Deal:
RealTime Ingest
A fast pipeline befitting hundreds of millions of
customers.
Agenda
• What Paytm Labs does
• Our Data Platform
• Moving to Realtime data ingestion
2
What do we do at Paytm Toronto?
• Create Data Platforms and Data
Products to be used by Paytm in
the following areas:
• Create a comprehensive data
platform for archive, processing,
and data-based decisions
• Fraud Detection and Prevention
• Analytical Scoring, Reporting, and
other Data Science challenges
• Building an advertising technology
platform to generate revenue and
increase customer engagement
• Blockchain(!) Stay tuned…
3
The Platform
4
The Data Platform at Paytm:
• Maquette
• Continues to be mainstay of RT Fraud Prevention
• Provides a Rule DSL as well as a flexible data model
• Fabrica
• Modularized feature creation framework (yes, a real framework)
• In Spark and SparkSQL for now
• Will move to Spark Streamingvery soon
• Chilika
• Hadoop and related technologies
• Processing and Storage muscle for most data platform tasks
• Includes data export and BI tooling like Atscale, Tableau, and ES
5
Fabrica
• Modular framework for execution of feature creation, scoring,
and export jobs
• Parallel job execution and optimized by caching targeted
datasets
• Handles complex transformations and can automate new
feature creation
• Easily consumes Machine Learning libraries, especially Spark
MLLib
• Starts as a Spark batch job and moves to a Spark Streaming
application
• Written in Spark Scala
• A DSL coming later
6
Fabrica
7
Maquette:
• Realtime rule engine for fraud detection
• All of our marketplace transactions are evaluated in realtime
with concurrent evaluation on hundreds of fraud rules
• Custom Scala/Akka application with a Cassandra datastore
• Can be used with our datastores, such as Titan,
GraphIntelligence, HBase, etc
• Interface for Rule and Threshold tuning
• Handles millions of txns per day at an average response time
of 20ms
8
Chilika (aka Datalake, aka HadoopCluster):
Moving to a RealtimeData Pipeline
9
What we have been using…
10
A “traditional” batch ingestprocess to Hadoop
• A 24 hour cycle batch-driven process
• A reservoir of detailed data for the past 6 months for core
tables, 2 years for some higher level data, and a few months
for the new data sets
• Hadoop (HDP) data processing tools primarily, specifically Hive
• Hive (SQL) transformations require so much additional logic
for reprocessing, alerting, etc that they have python programs
call them
• For event-streams (aka Real-Time), we load into Kafka. We pull
off this event data into a aggregated avro file for archive in
HDFS.
11
When MySQL fails…we fail
12
Whenever you change a schema, you kill a kitten
somewhere in the world…
13
Lots of room for improvement…
• A 24 hour cycle batch-driven process means stale data for
a lot of use cases
• The most important and most fragile pipeline is MySQL
• The MySQL instances rely on a chain of Master-Replica-
Replica-Replica to get to Hadoop. This chain fails a lot
• The MySQL chain has a fixed schema from RDBMS to
Hive.
• Assumptions that this schema is fixed are carried forward
throughoutourown processing pipeline.
• Changes to schema result in a cascading failure
• Hive does not have a resilient and programmatic way of
handling schema change
• Others have attempted to write custom Java Hive SerDes to
correct data but this puts too much logicin the wrong spot
• By using Hive for transformations that are complicated,
we have forced unnecessary temporary tables, created
hacky nanny scripts, and made it nearly impossible to
compose complicated transformations
14
A word on impatience…
• The amount of signals and actions that a mobile
phone user will generate is much higher than a
web user by virtue of their mobility
• Reducingthe MTTF (Mean Time To Feature) from
hours to minutes opens up an entirely new set of
interactionswith users:
• More advertising inventory with hyperlocal (ie walk into a
store) targeting, ie more revenue potential
• Better fraud detection and risk assessment
• More opportunities to construct a meaningful relationship
with the customer through helpful engagement:
• bill splitting
• localized shopping reminders – “while you are here...”
• Experience planning (you were looking fora restaurant, so we
suggest something you would like, plan your transit/train, and
order via FoodPanda)
15
Chilika Realtime Edition
16
Using
Confluent.io
17
Our realtime approach:DFAI (Direct-From-App-Ingest)
• Requires our primary applications to implement an SDK
that we provide
• The SDK is a wrap of the Confluent.io SDKs with our
schema registered
• Schema management is done automatically with the
confluent.io schema repository using Apache Avro
• Avro Schema is flexible with Avro, unlike @#$^@!!! SQL
Schema
• Avro Schema is open source and would still be
manageable even if we moved away from using
Confluence. Our data is safe for the long term. 18
DFAI = Direct-From-App-Ingest
• Market-order
• order/invoice: sales_order_invoice table
• create
• updateAddress
• order/order : sales_order table
• create
• update
• order/payment: sales_order_payment table
• create
• insertUpdateExisting
• update
• order/item:sales_order_item table
• create
• update
• order/address: sales_order_address table
• create
• updateAddress
• order/return_fulfillment: sales_order_return table
• create
• update
• order/refund: sales_order_refund table
• create
• update
Order/invoice schema example:
{ "namespace" : "com.paytm",
"name": "order_invoice_value",
"type": "record",
"fields": [ { "name": "tax_amount", "type": unitSchemas.nullLong},
{ "name": "surchange_amount", "type": unitSchemas.nullLong.name},
{ "name": "subtotal", "type":unitSchemas.nullLong.name},
{ "name": "subtotal_incl_tax", "type":unitSchemas.nullLong.name},
{ "name": "fulfillment_id", "type":unitSchemas.nullLong.name},
{ "name": "created_at", "type": unitSchemas.str} ]
}
19
Event Sourcing & Stream Processing
• Read this: http://www.confluent.io/blog/making-sense-of-
stream-processing/
• Basic concept: treat the stream as an immutable set of
events and then aggregate views specific to use cases
• We will use Fabrica to stream and aggregate data in
realtime eventually
• For batch work, we post-process the create / update
events to yield an aggregate state view. In other words,
the last state
20
Say no to batch ingest:
talk to your Data Architect about DFAI today.
21
Thank you!
Adam Muise
+1-416-417-4037
adam@paytm.com

2015 nov 27_thug_paytm_rt_ingest_brief_final

  • 1.
    The New <Data>Deal: RealTime Ingest A fast pipeline befitting hundreds of millions of customers.
  • 2.
    Agenda • What PaytmLabs does • Our Data Platform • Moving to Realtime data ingestion 2
  • 3.
    What do wedo at Paytm Toronto? • Create Data Platforms and Data Products to be used by Paytm in the following areas: • Create a comprehensive data platform for archive, processing, and data-based decisions • Fraud Detection and Prevention • Analytical Scoring, Reporting, and other Data Science challenges • Building an advertising technology platform to generate revenue and increase customer engagement • Blockchain(!) Stay tuned… 3
  • 4.
  • 5.
    The Data Platformat Paytm: • Maquette • Continues to be mainstay of RT Fraud Prevention • Provides a Rule DSL as well as a flexible data model • Fabrica • Modularized feature creation framework (yes, a real framework) • In Spark and SparkSQL for now • Will move to Spark Streamingvery soon • Chilika • Hadoop and related technologies • Processing and Storage muscle for most data platform tasks • Includes data export and BI tooling like Atscale, Tableau, and ES 5
  • 6.
    Fabrica • Modular frameworkfor execution of feature creation, scoring, and export jobs • Parallel job execution and optimized by caching targeted datasets • Handles complex transformations and can automate new feature creation • Easily consumes Machine Learning libraries, especially Spark MLLib • Starts as a Spark batch job and moves to a Spark Streaming application • Written in Spark Scala • A DSL coming later 6
  • 7.
  • 8.
    Maquette: • Realtime ruleengine for fraud detection • All of our marketplace transactions are evaluated in realtime with concurrent evaluation on hundreds of fraud rules • Custom Scala/Akka application with a Cassandra datastore • Can be used with our datastores, such as Titan, GraphIntelligence, HBase, etc • Interface for Rule and Threshold tuning • Handles millions of txns per day at an average response time of 20ms 8
  • 9.
    Chilika (aka Datalake,aka HadoopCluster): Moving to a RealtimeData Pipeline 9
  • 10.
    What we havebeen using… 10
  • 11.
    A “traditional” batchingestprocess to Hadoop • A 24 hour cycle batch-driven process • A reservoir of detailed data for the past 6 months for core tables, 2 years for some higher level data, and a few months for the new data sets • Hadoop (HDP) data processing tools primarily, specifically Hive • Hive (SQL) transformations require so much additional logic for reprocessing, alerting, etc that they have python programs call them • For event-streams (aka Real-Time), we load into Kafka. We pull off this event data into a aggregated avro file for archive in HDFS. 11
  • 12.
  • 13.
    Whenever you changea schema, you kill a kitten somewhere in the world… 13
  • 14.
    Lots of roomfor improvement… • A 24 hour cycle batch-driven process means stale data for a lot of use cases • The most important and most fragile pipeline is MySQL • The MySQL instances rely on a chain of Master-Replica- Replica-Replica to get to Hadoop. This chain fails a lot • The MySQL chain has a fixed schema from RDBMS to Hive. • Assumptions that this schema is fixed are carried forward throughoutourown processing pipeline. • Changes to schema result in a cascading failure • Hive does not have a resilient and programmatic way of handling schema change • Others have attempted to write custom Java Hive SerDes to correct data but this puts too much logicin the wrong spot • By using Hive for transformations that are complicated, we have forced unnecessary temporary tables, created hacky nanny scripts, and made it nearly impossible to compose complicated transformations 14
  • 15.
    A word onimpatience… • The amount of signals and actions that a mobile phone user will generate is much higher than a web user by virtue of their mobility • Reducingthe MTTF (Mean Time To Feature) from hours to minutes opens up an entirely new set of interactionswith users: • More advertising inventory with hyperlocal (ie walk into a store) targeting, ie more revenue potential • Better fraud detection and risk assessment • More opportunities to construct a meaningful relationship with the customer through helpful engagement: • bill splitting • localized shopping reminders – “while you are here...” • Experience planning (you were looking fora restaurant, so we suggest something you would like, plan your transit/train, and order via FoodPanda) 15
  • 16.
  • 17.
  • 18.
    Our realtime approach:DFAI(Direct-From-App-Ingest) • Requires our primary applications to implement an SDK that we provide • The SDK is a wrap of the Confluent.io SDKs with our schema registered • Schema management is done automatically with the confluent.io schema repository using Apache Avro • Avro Schema is flexible with Avro, unlike @#$^@!!! SQL Schema • Avro Schema is open source and would still be manageable even if we moved away from using Confluence. Our data is safe for the long term. 18
  • 19.
    DFAI = Direct-From-App-Ingest •Market-order • order/invoice: sales_order_invoice table • create • updateAddress • order/order : sales_order table • create • update • order/payment: sales_order_payment table • create • insertUpdateExisting • update • order/item:sales_order_item table • create • update • order/address: sales_order_address table • create • updateAddress • order/return_fulfillment: sales_order_return table • create • update • order/refund: sales_order_refund table • create • update Order/invoice schema example: { "namespace" : "com.paytm", "name": "order_invoice_value", "type": "record", "fields": [ { "name": "tax_amount", "type": unitSchemas.nullLong}, { "name": "surchange_amount", "type": unitSchemas.nullLong.name}, { "name": "subtotal", "type":unitSchemas.nullLong.name}, { "name": "subtotal_incl_tax", "type":unitSchemas.nullLong.name}, { "name": "fulfillment_id", "type":unitSchemas.nullLong.name}, { "name": "created_at", "type": unitSchemas.str} ] } 19
  • 20.
    Event Sourcing &Stream Processing • Read this: http://www.confluent.io/blog/making-sense-of- stream-processing/ • Basic concept: treat the stream as an immutable set of events and then aggregate views specific to use cases • We will use Fabrica to stream and aggregate data in realtime eventually • For batch work, we post-process the create / update events to yield an aggregate state view. In other words, the last state 20
  • 21.
    Say no tobatch ingest: talk to your Data Architect about DFAI today. 21
  • 22.