Low Latency Web Scale Fraud prevention with Apache Samza, Kafka and Friends

Edi Bice
Edi BiceSenior Data Scientist at eBay Enterprise
Low-Latency, Web-scale Fraud
Prevention with Samza and Friends
Edi Bice
ebice@ebay.com
Senior Data Scientist at eBay Enterprise leading
R&D efforts in applying machine learning to
fraud prevention and elsewhere.
Commerce is getting more convenient, more complex, and so is fraud. To
keep up fraud prevention solutions need to process a lot more data
• Older Data
• Looking back a lot further in time
• Older data is not effective excuse – home for the holidays?
• Wider Data
• Using all available data sources
• How wide can customer name possibly be?
• Richer Data
• Social/unstructured data – people, places, interests
• Connected data – who shipped to whom, where; email, devices, IP addresses
• Faster Data
• Clickstream data – website click patterns
Modern Fraud Prevention Architecture Requirements
• Web scale capable (horizontal scaling using commodity hardware)
• handle more actions and data for each user
• handle more users and more volume from each user
• handle more customers of all sizes (lowest processing cost)
• Low latency (milliseconds not hours)
• card present, digital goods, gift cards, store pickup (in-store online shopping!?)
• e-commerce physical goods? – no teleporting yet so speed up what we can
• process customer interactions in real time (personalization, royalty, shopping experience)
• dynamic order process (identification, authentication, tender presentation)
• Fault tolerance
• Commodity hardware is not without faults
• Expect and design for routine failures – more like shift changes, or relay races
Preventing fraud is all about detecting abnormal behavior.
Normal behavior is not normal – we are all normal in our
own abnormal special ways.
• Typical customer profiling calculations
• Transaction velocity (#txns_day) and change (#txns_day_1days/#txns_day_10days)
• Amount velocity ($txns_day) and change ($txns_day_1days/$txns_day_10days)
• Typical implementation and technologies
1. Define sliding window interval (7 days, a month, 6 months?)
2. For each live txn pull matching txns (card, ...) from single SQL DB within that sliding window
3. Loop over pulled transactions filtering based on timestamp to calculate change over sub-windows
• Issues, Problems, Solutions?!
Low Latency Web Scale Fraud prevention with Apache Samza, Kafka and Friends
CName Date $ ShipAddr … CTxns CustAvgAmt TxAmt_AvgAmt_Ratio Shipping
Addr Txns
Shipping Addr
Avg Amount
Edi Bice 8/3/15 50 123 Main St 1 50 = (50 + 0) / 1 NA 1 50
Edi Bice 8/3/15 100 123 Main St 2 75 = (100 + 50*1) / 2 2.0 = 100 / 50 2 75
Edi Bice 8/4/15 150 123 Main St 3 100 = (150 + 75*2) / 3 2.0 = 150 / 75 3 100
Edi Bice 8/5/15 1500 999 Wall St 4 450 = (1500 + 100*3) / 4 15.0 = 1500 / 100 1 1500
Streaming Analytics New Avg Amt = (New Txn Amt + Curr Avg Amt
* Curr Num Txns) / (Curr Num Txns + 1)
job pipelines
Kafka, Samza, and the Unix philosophy of distributed data by Martin Kleppmann
Apache Kafka
• Distributed, scalable, publish-subscribe messaging system
• Persistent, high-throughput messaging
• Designed for real time activity stream data processing
PreCog Samza Job Pipeline
Manifold (1-in-N-out) jobs
Risk-by-Y calc jobs
X-by-Y calc jobs
Assembly jobs
Low Latency Web Scale Fraud prevention with Apache Samza, Kafka and Friends
FAULT-TOLERANT
LOCAL STATE
Samza job partition 0 Samza job partition 1
Local
RocksDB
Local
RocksDB
Durable changelog Kafka
replicate writes
Embedded key-value: very fast
Machine dies ⇒ local key-value store is lost
Solution: replicate all writes to Kafka!
Machine dies ⇒ restart on another machine
Restore key-value store from changelog
Changelog compaction in the background
Samza Jobs on Hadoop 2.0 (YARN)
Samza App Master
Node Manager
Kafka Broker
Machine 1 Machine 2
Samza TaskRunner: Partition 1
Node Manager
Kafka Broker
aStreamTask:process()
Samza TaskRunner: Partition 2
aStreamTask:process()
Machine 3
Node Manager
Kafka Broker
Samza TaskRunner: Partition 3
aStreamTask:process()
Monitoring Samza: Metrics and More
Samza JMX metrics jmxtrans OpenTSDB/HBase Grafana
Questions?
http://www.ebayenterprise.com/
ebice@ebay.com
@edi_bice
https://www.linkedin.com/in/ebice
1 of 15

Recommended

Transactional Streaming: If you can compute it, you can probably stream it. by
Transactional Streaming: If you can compute it, you can probably stream it.Transactional Streaming: If you can compute it, you can probably stream it.
Transactional Streaming: If you can compute it, you can probably stream it.jhugg
1.8K views65 slides
Feeding Cassandra with Spark-Streaming and Kafka by
Feeding Cassandra with Spark-Streaming and KafkaFeeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaDataStax Academy
2.7K views21 slides
Towards A Stream Centered Enterprise, Gabriel Commeau by
Towards A Stream Centered Enterprise, Gabriel CommeauTowards A Stream Centered Enterprise, Gabriel Commeau
Towards A Stream Centered Enterprise, Gabriel Commeauconfluent
1.2K views6 slides
Kafka, Killer of Point-to-Point Integrations, Lucian Lita by
Kafka, Killer of Point-to-Point Integrations, Lucian LitaKafka, Killer of Point-to-Point Integrations, Lucian Lita
Kafka, Killer of Point-to-Point Integrations, Lucian Litaconfluent
1.4K views16 slides
Clickstream Analysis with Apache Spark by
Clickstream Analysis with Apache SparkClickstream Analysis with Apache Spark
Clickstream Analysis with Apache SparkQAware GmbH
2.5K views40 slides
Near-realtime analytics with Kafka and HBase by
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBasedave_revell
23.6K views44 slides

More Related Content

Recently uploaded

Ukraine Infographic_22NOV2023_v2.pdf by
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdfAnastosiyaGurin
1.4K views3 slides
Dr. Ousmane Badiane-2023 ReSAKSS Conference by
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceAKADEMIYA2063
5 views34 slides
Best Home Security Systems.pptx by
Best Home Security Systems.pptxBest Home Security Systems.pptx
Best Home Security Systems.pptxmogalang
9 views16 slides
LIVE OAK MEMORIAL PARK.pptx by
LIVE OAK MEMORIAL PARK.pptxLIVE OAK MEMORIAL PARK.pptx
LIVE OAK MEMORIAL PARK.pptxms2332always
7 views6 slides
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P... by
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...DataScienceConferenc1
8 views36 slides
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo... by
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...DataScienceConferenc1
9 views77 slides

Recently uploaded(20)

Ukraine Infographic_22NOV2023_v2.pdf by AnastosiyaGurin
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdf
AnastosiyaGurin1.4K views
Dr. Ousmane Badiane-2023 ReSAKSS Conference by AKADEMIYA2063
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS Conference
AKADEMIYA20635 views
Best Home Security Systems.pptx by mogalang
Best Home Security Systems.pptxBest Home Security Systems.pptx
Best Home Security Systems.pptx
mogalang9 views
LIVE OAK MEMORIAL PARK.pptx by ms2332always
LIVE OAK MEMORIAL PARK.pptxLIVE OAK MEMORIAL PARK.pptx
LIVE OAK MEMORIAL PARK.pptx
ms2332always7 views
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P... by DataScienceConferenc1
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo... by DataScienceConferenc1
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f... by DataScienceConferenc1
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ... by DataScienceConferenc1
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx by DataScienceConferenc1
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
DGST Methodology Presentation.pdf by maddierlegum
DGST Methodology Presentation.pdfDGST Methodology Presentation.pdf
DGST Methodology Presentation.pdf
maddierlegum5 views
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20048 views
CRM stick or twist workshop by info828217
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshop
info82821714 views
Product Research sample.pdf by AllenSingson
Product Research sample.pdfProduct Research sample.pdf
Product Research sample.pdf
AllenSingson33 views
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx by DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M... by DataScienceConferenc1
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation

Featured

ChatGPT and the Future of Work - Clark Boyd by
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
26.2K views69 slides
Getting into the tech field. what next by
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
6.3K views22 slides
Google's Just Not That Into You: Understanding Core Updates & Search Intent by
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
6.7K views99 slides
How to have difficult conversations by
How to have difficult conversations How to have difficult conversations
How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC
5.4K views19 slides
Introduction to Data Science by
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceChristy Abraham Joy
82.5K views51 slides
Time Management & Productivity - Best Practices by
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
169.8K views42 slides

Featured(20)

ChatGPT and the Future of Work - Clark Boyd by Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd26.2K views
Getting into the tech field. what next by Tessa Mero
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero6.3K views
Google's Just Not That Into You: Understanding Core Updates & Search Intent by Lily Ray
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray6.7K views
Time Management & Productivity - Best Practices by Vit Horky
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky169.8K views
The six step guide to practical project management by MindGenius
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius36.7K views
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright... by RachelPearson36
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson3612.7K views
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present... by Applitools
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools55.5K views
12 Ways to Increase Your Influence at Work by GetSmarter
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
GetSmarter401.7K views
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G... by DevGAMM Conference
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
DevGAMM Conference3.6K views
Barbie - Brand Strategy Presentation by Erica Santiago
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
Erica Santiago25.1K views
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well by Saba Software
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software25.3K views
Introduction to C Programming Language by Simplilearn
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
Simplilearn8.4K views
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr... by Palo Alto Software
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
Palo Alto Software88.4K views
9 Tips for a Work-free Vacation by Weekdone.com
9 Tips for a Work-free Vacation9 Tips for a Work-free Vacation
9 Tips for a Work-free Vacation
Weekdone.com7.2K views
How to Map Your Future by SlideShop.com
How to Map Your FutureHow to Map Your Future
How to Map Your Future
SlideShop.com275.1K views

Low Latency Web Scale Fraud prevention with Apache Samza, Kafka and Friends

  • 2. Edi Bice ebice@ebay.com Senior Data Scientist at eBay Enterprise leading R&D efforts in applying machine learning to fraud prevention and elsewhere.
  • 3. Commerce is getting more convenient, more complex, and so is fraud. To keep up fraud prevention solutions need to process a lot more data • Older Data • Looking back a lot further in time • Older data is not effective excuse – home for the holidays? • Wider Data • Using all available data sources • How wide can customer name possibly be? • Richer Data • Social/unstructured data – people, places, interests • Connected data – who shipped to whom, where; email, devices, IP addresses • Faster Data • Clickstream data – website click patterns
  • 4. Modern Fraud Prevention Architecture Requirements • Web scale capable (horizontal scaling using commodity hardware) • handle more actions and data for each user • handle more users and more volume from each user • handle more customers of all sizes (lowest processing cost) • Low latency (milliseconds not hours) • card present, digital goods, gift cards, store pickup (in-store online shopping!?) • e-commerce physical goods? – no teleporting yet so speed up what we can • process customer interactions in real time (personalization, royalty, shopping experience) • dynamic order process (identification, authentication, tender presentation) • Fault tolerance • Commodity hardware is not without faults • Expect and design for routine failures – more like shift changes, or relay races
  • 5. Preventing fraud is all about detecting abnormal behavior. Normal behavior is not normal – we are all normal in our own abnormal special ways. • Typical customer profiling calculations • Transaction velocity (#txns_day) and change (#txns_day_1days/#txns_day_10days) • Amount velocity ($txns_day) and change ($txns_day_1days/$txns_day_10days) • Typical implementation and technologies 1. Define sliding window interval (7 days, a month, 6 months?) 2. For each live txn pull matching txns (card, ...) from single SQL DB within that sliding window 3. Loop over pulled transactions filtering based on timestamp to calculate change over sub-windows • Issues, Problems, Solutions?!
  • 7. CName Date $ ShipAddr … CTxns CustAvgAmt TxAmt_AvgAmt_Ratio Shipping Addr Txns Shipping Addr Avg Amount Edi Bice 8/3/15 50 123 Main St 1 50 = (50 + 0) / 1 NA 1 50 Edi Bice 8/3/15 100 123 Main St 2 75 = (100 + 50*1) / 2 2.0 = 100 / 50 2 75 Edi Bice 8/4/15 150 123 Main St 3 100 = (150 + 75*2) / 3 2.0 = 150 / 75 3 100 Edi Bice 8/5/15 1500 999 Wall St 4 450 = (1500 + 100*3) / 4 15.0 = 1500 / 100 1 1500 Streaming Analytics New Avg Amt = (New Txn Amt + Curr Avg Amt * Curr Num Txns) / (Curr Num Txns + 1)
  • 8. job pipelines Kafka, Samza, and the Unix philosophy of distributed data by Martin Kleppmann
  • 9. Apache Kafka • Distributed, scalable, publish-subscribe messaging system • Persistent, high-throughput messaging • Designed for real time activity stream data processing
  • 10. PreCog Samza Job Pipeline Manifold (1-in-N-out) jobs Risk-by-Y calc jobs X-by-Y calc jobs Assembly jobs
  • 12. FAULT-TOLERANT LOCAL STATE Samza job partition 0 Samza job partition 1 Local RocksDB Local RocksDB Durable changelog Kafka replicate writes Embedded key-value: very fast Machine dies ⇒ local key-value store is lost Solution: replicate all writes to Kafka! Machine dies ⇒ restart on another machine Restore key-value store from changelog Changelog compaction in the background
  • 13. Samza Jobs on Hadoop 2.0 (YARN) Samza App Master Node Manager Kafka Broker Machine 1 Machine 2 Samza TaskRunner: Partition 1 Node Manager Kafka Broker aStreamTask:process() Samza TaskRunner: Partition 2 aStreamTask:process() Machine 3 Node Manager Kafka Broker Samza TaskRunner: Partition 3 aStreamTask:process()
  • 14. Monitoring Samza: Metrics and More Samza JMX metrics jmxtrans OpenTSDB/HBase Grafana

Editor's Notes

  1. Hi everyone, this is your last chance to get out in case you have the wrong room and session. Okay, relax. Don’t be afraid if fraud prevention friends are the only words you’re familiar with. My goal today is to make this accessible to all of you.
  2. My name is Edi Bice and I’m a senior data scientist at eBay Enterprise, a former division of eBay and different from eBay Marketplaces. Together with Innotrack, we are the world’s largest omnichannel commerce provider, and a partner to the world’s most iconic brands. We provide everything from retail order management, payments, fraud and tax to fulfillment and transportation, store fulfillment, customer service etc.
  3. Older Data – Looking back a lot further in time. Older data is not effective excuse – home for the holidays? Profiling infrequent behavior requires longer timeframes for sufficient data points. Wider Data – Exploiting all available data. Still, how wide can that be!? How wide is Edi Bice? 7 characters, 14 bytes worth? Infinite! Compressed: my purchase history, web log, FB posts I liked, tweets I retweeted, things mentioned there, and so on, not very compressed. Richer Data – Social/public data: people, places, interests graph – who is friends/related with/to whom, lived/studied/travelled where. Connected order data: who shipped to whom, where; tender, email, devices, IP addresses. Unstructured data: FB posts, tweets, Pinterest pins. Faster Data – Clickstream data: browsing, researching specific product, reading reviews, typing characteristics, etc.
  4. Web scale capable – what is web-scale? Google, Facebook, … But it’s more about the ability to scale up, or down, than the size itself. “By 2017, Web-scale IT … in 50 percent of global enterprises … according to Gartner, Inc.” Low Latency – Fault Tolerance – expected failure, redundancy, distribution of responsibilities
  5. … we are all normal in our own abnormal special ways. So we can’t define one average and apply to all customers – we need one for each customer. But not just that, we need averages over periods of time, at same or similar merchants, and so on. Using transaction number/amount velocity average to detect likely aberrant behavior. Typical implementation: Why sliding window? Can’t store/pull entire, unlimited history in/from DB – restrict to recent history Why single SQL DB? Who can afford Oracle RAC?! Sliding window restricts calculation of normal customer profile. How often do you shop at X? Using card Y? Shipping to your friend Z? Oops, we let it slide out of the sliding window!! How can we profile properly, efficiently, and quickly? At large scale?
  6. Assembly line processing! The Venetian Arsenal, dating to about 1104, operated similar to a production line. Ships moved down a canal and were fitted by the various shops they passed. Ransom Olds patented the assembly line concept, which he put to work in his Olds Motor Vehicle Company factory in 1901. 1913 Henry Ford perfected the assembly line by installing driven conveyor belts that could produce a Model T in 93 minutes.
  7. Streaming Analytics is assembly line processing! Order comes through, we look up CTxns for Edi, find it’s 0, increment by 1 and store back. Look up CustAvgAmt for Edi, find it’s 0, calculate the new average and store. Another order comes through, we look up … and so on. We don’t store and access all of Edi’s transactions in order to compute total number of transactions and average transaction amount. “summary statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible.” Average, standard deviation, skewness etc.
  8. 1964 Doug McIlroy internal Bell Labs memo: “We should have some ways of connecting programs like [a] garden hose – screw in another segment when it becomes necessary to massage data in another way.” Unix philosophy 1978: 1. Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new “features.” 2. Expect the output of every program to become the input to another, as yet unknown, program. A Samza job is like an assembly line worker/robot. Except that it can pull work items from many (input) conveyor belts (streams), and place completed work in many (output) conveyor belts (streams). Samza tasks are like worker clones – if one task per job is enough good, otherwise add some more (up to total number of stream partitions).
  9. Kafka – read conveyor belts. Kafka ecosystem at LinkedIn is sent over 800 billion messages per day which amounts to over 175 terabytes of data. To handle all these messages, LinkedIn runs over 1100 Kafka brokers/nodes. How does Kafka do this? Partitioned topic/stream (how could Henry Ford produce > 1/93 Model T per min?) Sequential disk IO (append only no updates like most databases) Multiple producers, multiple consumers per topic Distributed (team work – different Kafka nodes handling different partitions) Track your own work (no manager bottleneck)
  10. Live orders are streamed to “orders” Kafka topic/stream (conveyor belt). OrderManifold task partitions and routes orders to several Kafka topics, each keyed by a specific order field, ensuring worker sees all orders for a given email for example. OrderByBillEmail task receives, calculates and stores all respective email statistics. JoinBillEmailFeats receives input from two streams, “orders” and “order_by_billemail_feats”, joins them and sends them down the assembly line. ReviewedOrderRouter reads from “reviewed_orders” stream and sends fraudulent ones to “reviewed_fraud” stream and otherwise to “reviewed_notfraud” stream. FraudOrderManifold is similar to OrderManifold. FraudByBillEmailRisk calculates risk statistics for email and email domain, and so on. At the end of the assembly line the car looks like this.
  11. Graphical view of the sample JSON output. Each + node is expandable and looks similar to billaddr_feats. Each + node is the product of one Samza job/task (assembly line worker) – specialization of labor principle. Each + node is appended to the work-in-progress car by a “join”er worker. Time spent by each task (worker) is tracked to identify slow tasks (workers) – join_billaddr_feats_beg/end in ms (2!)
  12. Local state – instead of a central DB, each Samza job (task) uses a local (Rocks)DB. RocksDB, now open-source, created at Facebook off of LevelDB created at Google. “RocksDB can be used by applications that need low latency database accesses.” Kafka changelog – update RocksDB, write change to Kafka stream, Kafka auto compacts (throws away all but latest value for a given variable) changelog stream. Machine dies, restart on another machine. Who, how?
  13. Enter Hadoop 2.0 YARN (Yet Another Resource Negotiator). Launching a Samza job The Samza client talks to the YARN RM when it wants to start a new Samza job. The YARN RM talks to a YARN NM to space on the cluster for Samza’s App Master. Once the NM allocates space, it starts the Samza AM. The AM asks the YARN RM for 1+ YARN containers to run SamzaContainers (tasks) Once resources for 4 have been allocated, the NMs start the Samza containers. If any fail, YARN starts a new one, on same, if possible, or a different machine. Great, so it’s fault tolerant. Is it fail/fool proof?!
  14. It’s a complex stack! Lot’s can go wrong - monitoring and alerting are key to proactive maintenance and quick troubleshooting in case things do go wrong. metrics that allow you to see how many messages have been processed and sent the current offset in the input stream partition, and other details. metrics about the JVM (heap size, garbage collection information, threads etc.) internal metrics of the Kafka producers and consumers, and more. custom metrics Low-latency, web-scale metrics monitoring ;-) Samza JMX metrics Jmxtrans OpenTSDB/Hbase Grafana
  15. Questions? Now, if we have time Right after the session Over drinks later on Shoot me an email