When evaluating Open Source Software, or other software of a certain size or complexity, organizations frequently want to conduct a Pilot project, or Proof of Concept (POC). This talk describes a process to reduce the length of the Pilot, by leveraging configurations from performance testing to POC starting configurations.
18. Text or
Twitter
API
Java 1
and GUI Kafka
Java 3 for
analysis
Data
Store
Java calls API, and
Kafka producer
Tweets returned in
JSON
JSON tweets sent to Kafka Kafka JSON
to Kamanja
JSON with features saved in DB
JAVA: Every “time window”, queries the DB to aggregate (i.e. count (tags) by (tags) by..)
JSON returns the aggregate query results to JAVA
JSON query results to Kafka
JSON results of rule scoring, alert text
13 Tomcat web
service displays data
and charts
Matched_tags_
per_text
table
results to Java 3 for scoring,
with thresholds
Alerts table
Save results to DB
JAVA 1: check for updates to the alerts table
Kamanja
1
2
3 4
5
6
7
8 9
11
12
10
Java 2 for
Features
Sentiment or
Stanford NLP
Social Netowork Analysis:
Example System Configuration
ligaDATA
23. Kamanja: 220k to 230k messages / second
CONFIGURATION:
• 16 core box, using Solid State Disc
• Sample Tool to generate messages of size 1k (not being reduced)
• Data Mining uses 100’s to 100k fields – not 100 byte message
• Kafka Queue
• 3 input queues, each queue has 8 partitions
• Kamanja Engine
• Using the remaining 12-13 cores
• Not saving score results per record in this test
SO WHAT? COMPARISON:
• Storm is currently the lowest latency Apache big data system
• Storm integration, got up to 90k to 100k for same data
• Kamanja is 2.4 times faster than Storm = (225k/95k) in this test
• Spark streaming is with mini-batches, with higher latency than Storm or Kamanja
Why is Kamanja faster
than Storm?
Storm reads the data from
the input queue (sprout)
and passes that to Bolts.
Each pass between sprout
to bolt they serialize &
deserialize the data. There
is other overhead.
Kamanja:
One Speed Analysis
ligaDATA