To achieve better results in preventing fraud one needs to process a lot more data: older data by looking back a lot further in time, richer/wider data by integrating additional new data sources such as social, and connected data. Scaling such processing for web scale volume and latency requires novel technologies and alternative thinking.
Hi everyone, this is your last chance to get out in case you have the wrong room and session.
Okay, relax. Don’t be afraid if fraud prevention friends are the only words you’re familiar with.
My goal today is to make this accessible to all of you.
My name is Edi Bice and I’m a senior data scientist at eBay Enterprise, a former division of eBay and different from eBay Marketplaces.
Together with Innotrack, we are the world’s largest omnichannel commerce provider, and a partner to the world’s most iconic brands.
We provide everything from retail order management, payments, fraud and tax to fulfillment and transportation, store fulfillment, customer service etc.
Older Data – Looking back a lot further in time. Older data is not effective excuse – home for the holidays? Profiling infrequent behavior requires longer timeframes for sufficient data points.
Wider Data – Exploiting all available data. Still, how wide can that be!? How wide is Edi Bice? 7 characters, 14 bytes worth? Infinite! Compressed: my purchase history, web log, FB posts I liked, tweets I retweeted, things mentioned there, and so on, not very compressed.
Richer Data – Social/public data: people, places, interests graph – who is friends/related with/to whom, lived/studied/travelled where. Connected order data: who shipped to whom, where; tender, email, devices, IP addresses. Unstructured data: FB posts, tweets, Pinterest pins.
Faster Data – Clickstream data: browsing, researching specific product, reading reviews, typing characteristics, etc.
Web scale capable – what is web-scale? Google, Facebook, … But it’s more about the ability to scale up, or down, than the size itself. “By 2017, Web-scale IT … in 50 percent of global enterprises … according to Gartner, Inc.”
Low Latency –
Fault Tolerance – expected failure, redundancy, distribution of responsibilities
… we are all normal in our own abnormal special ways. So we can’t define one average and apply to all customers – we need one for each customer. But not just that, we need averages over periods of time, at same or similar merchants, and so on.
Using transaction number/amount velocity average to detect likely aberrant behavior.
Typical implementation: Why sliding window? Can’t store/pull entire, unlimited history in/from DB – restrict to recent history Why single SQL DB? Who can afford Oracle RAC?!
Sliding window restricts calculation of normal customer profile. How often do you shop at X? Using card Y? Shipping to your friend Z? Oops, we let it slide out of the sliding window!!
How can we profile properly, efficiently, and quickly? At large scale?
Assembly line processing!
The Venetian Arsenal, dating to about 1104, operated similar to a production line. Ships moved down a canal and were fitted by the various shops they passed.
Ransom Olds patented the assembly line concept, which he put to work in his Olds Motor Vehicle Company factory in 1901.
1913 Henry Ford perfected the assembly line by installing driven conveyor belts that could produce a Model T in 93 minutes.
Streaming Analytics is assembly line processing!
Order comes through, we look up CTxns for Edi, find it’s 0, increment by 1 and store back. Look up CustAvgAmt for Edi, find it’s 0, calculate the new average and store.
Another order comes through, we look up … and so on.
We don’t store and access all of Edi’s transactions in order to compute total number of transactions and average transaction amount.
“summary statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible.”
Average, standard deviation, skewness etc.
1964 Doug McIlroy internal Bell Labs memo:
“We should have some ways of connecting programs like [a] garden hose – screw in another segment when it becomes necessary to massage data in another way.”
Unix philosophy 1978:
1. Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new “features.”
2. Expect the output of every program to become the input to another, as yet unknown, program.
A Samza job is like an assembly line worker/robot. Except that it can pull work items from many (input) conveyor belts (streams), and place completed work in many (output) conveyor belts (streams).
Samza tasks are like worker clones – if one task per job is enough good, otherwise add some more (up to total number of stream partitions).
Kafka – read conveyor belts.
Kafka ecosystem at LinkedIn is sent over 800 billion messages per day which amounts to over 175 terabytes of data. To handle all these messages, LinkedIn runs over 1100 Kafka brokers/nodes.
How does Kafka do this?
Partitioned topic/stream (how could Henry Ford produce > 1/93 Model T per min?) Sequential disk IO (append only no updates like most databases) Multiple producers, multiple consumers per topic Distributed (team work – different Kafka nodes handling different partitions) Track your own work (no manager bottleneck)
Live orders are streamed to “orders” Kafka topic/stream (conveyor belt).
OrderManifold task partitions and routes orders to several Kafka topics, each keyed by a specific order field, ensuring worker sees all orders for a given email for example.
OrderByBillEmail task receives, calculates and stores all respective email statistics.
JoinBillEmailFeats receives input from two streams, “orders” and “order_by_billemail_feats”, joins them and sends them down the assembly line.
ReviewedOrderRouter reads from “reviewed_orders” stream and sends fraudulent ones to “reviewed_fraud” stream and otherwise to “reviewed_notfraud” stream.
FraudOrderManifold is similar to OrderManifold. FraudByBillEmailRisk calculates risk statistics for email and email domain, and so on.
At the end of the assembly line the car looks like this.
Graphical view of the sample JSON output.
Each + node is expandable and looks similar to billaddr_feats.
Each + node is the product of one Samza job/task (assembly line worker) – specialization of labor principle.
Each + node is appended to the work-in-progress car by a “join”er worker.
Time spent by each task (worker) is tracked to identify slow tasks (workers) – join_billaddr_feats_beg/end in ms (2!)
Local state – instead of a central DB, each Samza job (task) uses a local (Rocks)DB.
RocksDB, now open-source, created at Facebook off of LevelDB created at Google.
“RocksDB can be used by applications that need low latency database accesses.”
Kafka changelog – update RocksDB, write change to Kafka stream, Kafka auto compacts (throws away all but latest value for a given variable) changelog stream.
Machine dies, restart on another machine. Who, how?
Enter Hadoop 2.0 YARN (Yet Another Resource Negotiator).
Launching a Samza job
The Samza client talks to the YARN RM when it wants to start a new Samza job. The YARN RM talks to a YARN NM to space on the cluster for Samza’s App Master. Once the NM allocates space, it starts the Samza AM. The AM asks the YARN RM for 1+ YARN containers to run SamzaContainers (tasks) Once resources for 4 have been allocated, the NMs start the Samza containers. If any fail, YARN starts a new one, on same, if possible, or a different machine.
Great, so it’s fault tolerant. Is it fail/fool proof?!
It’s a complex stack!
Lot’s can go wrong - monitoring and alerting are key to proactive maintenance and quick troubleshooting in case things do go wrong.
metrics that allow you to see how many messages have been processed and sent the current offset in the input stream partition, and other details. metrics about the JVM (heap size, garbage collection information, threads etc.) internal metrics of the Kafka producers and consumers, and more. custom metrics
Low-latency, web-scale metrics monitoring ;-)
Samza JMX metrics Jmxtrans OpenTSDB/Hbase Grafana
Now, if we have time
Right after the session
Over drinks later on
Shoot me an email
Low Latency Web Scale Fraud prevention with Apache Samza, Kafka and Friends
Low-Latency, Web-scale Fraud
Prevention with Samza and Friends
Senior Data Scientist at eBay Enterprise leading
R&D efforts in applying machine learning to
fraud prevention and elsewhere.
Commerce is getting more convenient, more complex, and so is fraud. To
keep up fraud prevention solutions need to process a lot more data
• Older Data
• Looking back a lot further in time
• Older data is not effective excuse – home for the holidays?
• Wider Data
• Using all available data sources
• How wide can customer name possibly be?
• Richer Data
• Social/unstructured data – people, places, interests
• Connected data – who shipped to whom, where; email, devices, IP addresses
• Faster Data
• Clickstream data – website click patterns
Modern Fraud Prevention Architecture Requirements
• Web scale capable (horizontal scaling using commodity hardware)
• handle more actions and data for each user
• handle more users and more volume from each user
• handle more customers of all sizes (lowest processing cost)
• Low latency (milliseconds not hours)
• card present, digital goods, gift cards, store pickup (in-store online shopping!?)
• e-commerce physical goods? – no teleporting yet so speed up what we can
• process customer interactions in real time (personalization, royalty, shopping experience)
• dynamic order process (identification, authentication, tender presentation)
• Fault tolerance
• Commodity hardware is not without faults
• Expect and design for routine failures – more like shift changes, or relay races
Preventing fraud is all about detecting abnormal behavior.
Normal behavior is not normal – we are all normal in our
own abnormal special ways.
• Typical customer profiling calculations
• Transaction velocity (#txns_day) and change (#txns_day_1days/#txns_day_10days)
• Amount velocity ($txns_day) and change ($txns_day_1days/$txns_day_10days)
• Typical implementation and technologies
1. Define sliding window interval (7 days, a month, 6 months?)
2. For each live txn pull matching txns (card, ...) from single SQL DB within that sliding window
3. Loop over pulled transactions filtering based on timestamp to calculate change over sub-windows
• Issues, Problems, Solutions?!
CName Date $ ShipAddr … CTxns CustAvgAmt TxAmt_AvgAmt_Ratio Shipping
Edi Bice 8/3/15 50 123 Main St 1 50 = (50 + 0) / 1 NA 1 50
Edi Bice 8/3/15 100 123 Main St 2 75 = (100 + 50*1) / 2 2.0 = 100 / 50 2 75
Edi Bice 8/4/15 150 123 Main St 3 100 = (150 + 75*2) / 3 2.0 = 150 / 75 3 100
Edi Bice 8/5/15 1500 999 Wall St 4 450 = (1500 + 100*3) / 4 15.0 = 1500 / 100 1 1500
Streaming Analytics New Avg Amt = (New Txn Amt + Curr Avg Amt
* Curr Num Txns) / (Curr Num Txns + 1)
Kafka, Samza, and the Unix philosophy of distributed data by Martin Kleppmann
• Distributed, scalable, publish-subscribe messaging system
• Persistent, high-throughput messaging
• Designed for real time activity stream data processing
PreCog Samza Job Pipeline
Manifold (1-in-N-out) jobs
Risk-by-Y calc jobs
X-by-Y calc jobs
Samza job partition 0 Samza job partition 1
Durable changelog Kafka
Embedded key-value: very fast
Machine dies ⇒ local key-value store is lost
Solution: replicate all writes to Kafka!
Machine dies ⇒ restart on another machine
Restore key-value store from changelog
Changelog compaction in the background