Near Real Time Streaming With
Apache Samza
Antispam Use Case
Hello!
Michael Sklyar
Team leader, R&D @ Cyren
Cyren
CYREN Customers
Business Case Antispam
https://www.youtube.com/watch?v=M_eYSuPKP3Y
Remember 2002?
VS 2010
Spam is business
Most of the email traffic is spam
Spamvertising
Malware
Fraud, 419
Cryptolocker
Phishing
CYREN Antispam
SDKs and SAAS
Inbound and Outbound
Global View
24/7
RPD
Reputation
Antispam Main Challenge
FPs FNs
Anti Spam Backend
● Known unwanted traffic is blocked
in RT
● Unknowns
● Backend purpose: Generate new
classifications from “unknown”
traffic
Sample classification logic: Bulk new domains
If it smells like spam...
● A link to a domain is seen in a big amount of emails
● Recently registered
● Doesn’t have a good (Cyren) reputation
● Passes defending mechanisms
Bulk New Domains Logic Challenges
● Process Billions of transactions per day
● Count the appearance of domains in “Windows”
● External Services
● DO IT FAST
Before we dig in...
Questions about the use case?
Replacing the backend
Bulk new domain abstraction
Parse Enrich Count ClassifyRaw
Transactions
Unknown
traffic
DCs
Replication
Replacing the backend Technological stack
Apache Samza Stream Processing Framework
YARN KAFKA
Samza API
Layers:
1. Streaming: Kafka
2. Execution: YARN
3. Processing: Samza API
Originally By Linkedin
The Obvious:
Distributed
High availability
Horizontally scalable
Apache Kafka Streaming
● Topics
● Partitions
● Offsets
● Producers
● Consumers
Distributed, partitioned, replicated publish-subscribe messaging system
Apache Yarn Execution Framework
Global Resource Manager
Per Application Application Master
Containers
Apache Samza Vocabulary & Concepts
Streams And Jobs
Jobs are Decoupled!
Apache Samza Vocabulary & Concepts
Partitions and Tasks
Partition 1 Partition 2 Partition 2
Task 1 Task 2 Task 3
Partition 1 Partition 2
Incoming Stream
(Kafka Topic)
Samza Job
Outgoing Stream
Tasks are single-
threaded
Apache Samza Vocabulary & Concepts
Fault Tolerance - Checkpointing - Committing - At least once processing
Apache Samza Vocabulary & Concepts
State Management with RocksDB
Local State Remote State
Input
Stream DB
Stateless Samza Tasks
Output
Stream
Input
Stream
Stateful Samza Tasks
Output
Stream
DB
Changelog
Stream
Apache Samza API
public class SplitStringIntoWordsTask implements StreamTask {
// Send outgoing messages to a stream called "words" in the "kafka" system.
private final SystemStream OUTPUT_STREAM = new SystemStream("kafka", "words");
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
String message = (String) envelope.getMessage();
for (String word : message.split(" ")) {
// Use the word as the key, and 1 as the value.
// A second task can add the 1's to get the word count.
collector.send(new OutgoingMessageEnvelope(OUTPUT_STREAM, word, 1));
}
}
}
Apache Samza API
/** Implement this if you want a callback when your task
starts up. */
public interface InitableTask {
void init(Config config, TaskContext context);
}
public interface WindowableTask {
void window(MessageCollector collector, TaskCoordinator
coordinator) throws Exception;
}
Writing a Samza Job - Include and build Samza
Use your favourite dependency management and build tool to:
● Include:
● Build a .tz distribution file with your code and dependencies
● “Extract” the samza-shell library (it will contain Samza shell scripts)
runtime(group: 'org.apache.samza', name: 'samza-core_2.10', version: "$SAMZA_VERSION")
runtime(group: 'org.apache.samza', name: 'samza-log4j', version: "$SAMZA_VERSION")
runtime(group: 'org.apache.samza', name: 'samza-shell', version: "$SAMZA_VERSION")
runtime(group: 'org.apache.samza', name: 'samza-yarn_2.10', version: "$SAMZA_VERSION")
runtime(group: 'org.apache.samza', name: 'samza-kafka_2.10', version: "$SAMZA_VERSION")
runtime(group: 'org.apache.kafka', name: 'kafka_2.10', version: "$KAFKA_VERSION")
Writing a Samza Job - The code
public class SplitStringIntoWordsTask implements StreamTask {
// Send outgoing messages to a stream called "words" in the "kafka" system.
private final SystemStream OUTPUT_STREAM = new SystemStream("kafka", "words");
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
String message = (String) envelope.getMessage();
for (String word : message.split(" ")) {
// Use the word as the key, and 1 as the value.
// A second task can add the 1's to get the word count.
collector.send(new OutgoingMessageEnvelope(OUTPUT_STREAM, word, 1));
}
}
}
Writing a Samza Job - Task Properties
Nasty and painful configuration specifying:
● Job properties: Name, artifact location, number of tasks, RAM, CPU
● Kafka Configuration: Systems and streams: incoming, outgoing, bootstrapping,
coordination.
● SerDe definitions
https://samza.apache.org/learn/documentation/0.10/jobs/configuration-table.html
Running a Samza Job
1. Build the distribution
2. Extract the distribution (to gain access to Samza shell
scripts)
3. Upload the distribution to be accessible by YARN
cluster (for instance HDFS)
4. Execute run-job.sh and specify you properties file
Bulk new domain detection with Samza
Remember what we were trying to achieve?
Recognize bulk new domains.. FAST
Parse Enrich Count ClassifyRaw
Transactions
Unknown
traffic
DCs
Replication
Bulk new domain detection Performance
What affects performance?
● Kafka and YARN cluster sizes
● Number of Kafka Partitions
(Partition=>Samza Task)
● Number of Containers
● container
● Kafka: HDD IO
● Caching
Metrics
Samza Metrics
Application Metrics - Codahale, Netflix servo
Time Series DB - Graphite/InfluxDB
UI - Grafana
Samza Impressions
The Good
Stable
Very easy to write jobs
Thread Safety is not an issue
Easy to add branches to the job graphs
High Throughput
Low Latency
Replayable streams
Samza Impressions
Could be better
Nasty Properties
Doesn’t support partitions amount change
Not the most spoiling - no fancy UIs
Only supports at least once processing paradigm
Not enough producers
Not the most mature
If you don’t already have a YARN cluster…
Samza Getting Started
Writing the code is the easiest part
Setting Kafka, Zookeeper,YARN
Building - packing your code into an artifact, preparing it for extraction
Understanding Properties
Good News
Hello-Samza
http://samza.apache.org/startup/hello-samza/0.10/
Thank You,
Questions?

Near real time streaming with apache samza - Antispam use case

  • 1.
    Near Real TimeStreaming With Apache Samza Antispam Use Case
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
    Spam is business Mostof the email traffic is spam Spamvertising Malware Fraud, 419 Cryptolocker Phishing
  • 8.
    CYREN Antispam SDKs andSAAS Inbound and Outbound Global View 24/7 RPD Reputation
  • 9.
  • 10.
    Anti Spam Backend ●Known unwanted traffic is blocked in RT ● Unknowns ● Backend purpose: Generate new classifications from “unknown” traffic
  • 11.
    Sample classification logic:Bulk new domains If it smells like spam... ● A link to a domain is seen in a big amount of emails ● Recently registered ● Doesn’t have a good (Cyren) reputation ● Passes defending mechanisms
  • 12.
    Bulk New DomainsLogic Challenges ● Process Billions of transactions per day ● Count the appearance of domains in “Windows” ● External Services ● DO IT FAST
  • 13.
    Before we digin... Questions about the use case?
  • 14.
    Replacing the backend Bulknew domain abstraction Parse Enrich Count ClassifyRaw Transactions Unknown traffic DCs Replication
  • 15.
    Replacing the backendTechnological stack
  • 16.
    Apache Samza StreamProcessing Framework YARN KAFKA Samza API Layers: 1. Streaming: Kafka 2. Execution: YARN 3. Processing: Samza API Originally By Linkedin The Obvious: Distributed High availability Horizontally scalable
  • 17.
    Apache Kafka Streaming ●Topics ● Partitions ● Offsets ● Producers ● Consumers Distributed, partitioned, replicated publish-subscribe messaging system
  • 18.
    Apache Yarn ExecutionFramework Global Resource Manager Per Application Application Master Containers
  • 19.
    Apache Samza Vocabulary& Concepts Streams And Jobs Jobs are Decoupled!
  • 20.
    Apache Samza Vocabulary& Concepts Partitions and Tasks Partition 1 Partition 2 Partition 2 Task 1 Task 2 Task 3 Partition 1 Partition 2 Incoming Stream (Kafka Topic) Samza Job Outgoing Stream Tasks are single- threaded
  • 21.
    Apache Samza Vocabulary& Concepts Fault Tolerance - Checkpointing - Committing - At least once processing
  • 22.
    Apache Samza Vocabulary& Concepts State Management with RocksDB Local State Remote State Input Stream DB Stateless Samza Tasks Output Stream Input Stream Stateful Samza Tasks Output Stream DB Changelog Stream
  • 23.
    Apache Samza API publicclass SplitStringIntoWordsTask implements StreamTask { // Send outgoing messages to a stream called "words" in the "kafka" system. private final SystemStream OUTPUT_STREAM = new SystemStream("kafka", "words"); public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { String message = (String) envelope.getMessage(); for (String word : message.split(" ")) { // Use the word as the key, and 1 as the value. // A second task can add the 1's to get the word count. collector.send(new OutgoingMessageEnvelope(OUTPUT_STREAM, word, 1)); } } }
  • 24.
    Apache Samza API /**Implement this if you want a callback when your task starts up. */ public interface InitableTask { void init(Config config, TaskContext context); } public interface WindowableTask { void window(MessageCollector collector, TaskCoordinator coordinator) throws Exception; }
  • 25.
    Writing a SamzaJob - Include and build Samza Use your favourite dependency management and build tool to: ● Include: ● Build a .tz distribution file with your code and dependencies ● “Extract” the samza-shell library (it will contain Samza shell scripts) runtime(group: 'org.apache.samza', name: 'samza-core_2.10', version: "$SAMZA_VERSION") runtime(group: 'org.apache.samza', name: 'samza-log4j', version: "$SAMZA_VERSION") runtime(group: 'org.apache.samza', name: 'samza-shell', version: "$SAMZA_VERSION") runtime(group: 'org.apache.samza', name: 'samza-yarn_2.10', version: "$SAMZA_VERSION") runtime(group: 'org.apache.samza', name: 'samza-kafka_2.10', version: "$SAMZA_VERSION") runtime(group: 'org.apache.kafka', name: 'kafka_2.10', version: "$KAFKA_VERSION")
  • 26.
    Writing a SamzaJob - The code public class SplitStringIntoWordsTask implements StreamTask { // Send outgoing messages to a stream called "words" in the "kafka" system. private final SystemStream OUTPUT_STREAM = new SystemStream("kafka", "words"); public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { String message = (String) envelope.getMessage(); for (String word : message.split(" ")) { // Use the word as the key, and 1 as the value. // A second task can add the 1's to get the word count. collector.send(new OutgoingMessageEnvelope(OUTPUT_STREAM, word, 1)); } } }
  • 27.
    Writing a SamzaJob - Task Properties Nasty and painful configuration specifying: ● Job properties: Name, artifact location, number of tasks, RAM, CPU ● Kafka Configuration: Systems and streams: incoming, outgoing, bootstrapping, coordination. ● SerDe definitions https://samza.apache.org/learn/documentation/0.10/jobs/configuration-table.html
  • 28.
    Running a SamzaJob 1. Build the distribution 2. Extract the distribution (to gain access to Samza shell scripts) 3. Upload the distribution to be accessible by YARN cluster (for instance HDFS) 4. Execute run-job.sh and specify you properties file
  • 29.
    Bulk new domaindetection with Samza
  • 30.
    Remember what wewere trying to achieve? Recognize bulk new domains.. FAST Parse Enrich Count ClassifyRaw Transactions Unknown traffic DCs Replication
  • 31.
    Bulk new domaindetection Performance What affects performance? ● Kafka and YARN cluster sizes ● Number of Kafka Partitions (Partition=>Samza Task) ● Number of Containers ● container ● Kafka: HDD IO ● Caching
  • 32.
    Metrics Samza Metrics Application Metrics- Codahale, Netflix servo Time Series DB - Graphite/InfluxDB UI - Grafana
  • 33.
    Samza Impressions The Good Stable Veryeasy to write jobs Thread Safety is not an issue Easy to add branches to the job graphs High Throughput Low Latency Replayable streams
  • 34.
    Samza Impressions Could bebetter Nasty Properties Doesn’t support partitions amount change Not the most spoiling - no fancy UIs Only supports at least once processing paradigm Not enough producers Not the most mature If you don’t already have a YARN cluster…
  • 35.
    Samza Getting Started Writingthe code is the easiest part Setting Kafka, Zookeeper,YARN Building - packing your code into an artifact, preparing it for extraction Understanding Properties Good News Hello-Samza http://samza.apache.org/startup/hello-samza/0.10/
  • 36.