Real-Time Data
Processing
Emerging Business Meetup - 07/24/13
Bryan Warner @ Traackr
About Me
● Bryan Warner - Engineer @ Traackr
○ bwarner@traackr.com
● Primary background is in Java
○ Breaking into Scala d...
About Traackr
● Influencer search engine
○ Platform for discovering and engaging online
individuals who matter
● We track ...
Overview
● Review Traackr's use case for real-time data
processing
● Technical solution we decided on
● Questions
Traackr Use Case
1. Real-time content stream for a targeted group of
influencers within our platform
a. Primarily to show ...
Traackr Use Case
Data Processing Requirements
1. Incoming data is not lost
2. Data needs to be analyzed and enriched
3. Ea...
Bird's Eye View
Tracking App
MongoDb
Initial Persist
Content "Enrichment"
Pipeline
RabbitMQ Broker
Queue
ElasticSearch
GNI...
Content Pipeline
● Apache Camel (http://camel.apache.org/)
○ Integration framework based on Enterprise
Integration pattern...
Content Pipeline
Queue
Queue Listener
Search Indexer
Tweet
Processor
Blog
Processor
Image
Processor
Routing
Filter
ROUTE
Content Pipeline
Initial Approach:
● Route(s) live within a CamelContext in your JVM
● Initial route is utilizing direct c...
Content Pipeline
TWEET
TWEET
IMAGE
TWEET
TWEET
BLOG
TWEET
TWEET
TWEET
BLOG
TWEET
TWEET
● If Tweets come into the system at...
Content Pipeline
Expanded Approach:
● Utilize Seda Components (http://camel.apache.org/seda.html)
○ Underlying Thread pool...
Content Pipeline
Queue
Queue Listener
Search Indexer
Tweet
Processor
Blog
Processor
Image
Processor
ROUTE
Routing
Filter
T...
Content Pipeline
Caveats:
● No visibility into SEDA's thread pool state (e.g. how many objects on its
internal queue?)
● I...
Content Pipeline
Final Solution:
from(<queue.uri>).routeId("my-route")
.split().method("payloadSplitterService", "splitMes...
Questions
Upcoming SlideShare
Loading in …5
×

Real-time Data Processing

1,486 views

Published on

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,486
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
6
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Real-time Data Processing

  1. 1. Real-Time Data Processing Emerging Business Meetup - 07/24/13 Bryan Warner @ Traackr
  2. 2. About Me ● Bryan Warner - Engineer @ Traackr ○ bwarner@traackr.com ● Primary background is in Java ○ Breaking into Scala development this past year ● Interested in search, data scalability, and distributed computing
  3. 3. About Traackr ● Influencer search engine ○ Platform for discovering and engaging online individuals who matter ● We track content and metrics for our database of influential people ○ Both in RT and daily processes ● Some of our back-end stack includes: ElasticSearch, MongoDb, Java/Spring, Scala/Akka, etc. ● Looking for developers to use our API!
  4. 4. Overview ● Review Traackr's use case for real-time data processing ● Technical solution we decided on ● Questions
  5. 5. Traackr Use Case 1. Real-time content stream for a targeted group of influencers within our platform a. Primarily to show real-time tweets via our Twitter data provider (GNIP) 2. On-demand content tracking and searching for new influencers a. Users can add up to a hundred people at once b. Expect that new influencer content is searchable near real-time
  6. 6. Traackr Use Case Data Processing Requirements 1. Incoming data is not lost 2. Data needs to be analyzed and enriched 3. Each type of data has its own processing component * Blog Posts, Tweets, Videos, Images, etc. 4. Components should be configurable for maximum throughput! 5. Components should act like small building blocks
  7. 7. Bird's Eye View Tracking App MongoDb Initial Persist Content "Enrichment" Pipeline RabbitMQ Broker Queue ElasticSearch GNIP Listener App "Post" Payload Queue Listener Make content searchable
  8. 8. Content Pipeline ● Apache Camel (http://camel.apache.org/) ○ Integration framework based on Enterprise Integration patterns (EIP) ● Flexible route building ○ Supports direct and asynchronous components ○ Integrates with DI frameworks (e.g. Spring, Guice) ○ Tons of native support for various transports (http, jms, amqp, tcp, imap, etc.) ● Good support for unit testing ○ org.apache.camel.component.mock.MockEndpoint
  9. 9. Content Pipeline Queue Queue Listener Search Indexer Tweet Processor Blog Processor Image Processor Routing Filter ROUTE
  10. 10. Content Pipeline Initial Approach: ● Route(s) live within a CamelContext in your JVM ● Initial route is utilizing direct components (serial) from(<queue.uri>).routeId("my-route") .choice() .when(simple("${in.body.isTweet()}")) .to("bean:languageAnalyzer?method=detectLanguage") .to("bean:tweetAnalyzer?method=extractMentions") .when(simple("${in.body.isBlog()}")) .to("bean:httpService?method=fetchFullContent") .to("bean:languageAnalyzer?method=detectLanguage") .otherwise() .to("bean:imageAnalyzer?method=categorizeImage") .end() .to("bean:searchService?method=indexContent"); But there's a throughput problem...
  11. 11. Content Pipeline TWEET TWEET IMAGE TWEET TWEET BLOG TWEET TWEET TWEET BLOG TWEET TWEET ● If Tweets come into the system at 5/sec, then Tweet processing rate has to be >= 5/sec ● If a blog post takes 5 seconds to process (on average)... ● And an image takes 30 seconds to process (on average)... then... Queue HEAD
  12. 12. Content Pipeline Expanded Approach: ● Utilize Seda Components (http://camel.apache.org/seda.html) ○ Underlying Thread pool with BlockingQueue from(<queue.uri>).routeId("my-route") .choice() .when(simple("${in.body.isTweet()}")).to("seda:tweetEnricher") .when(simple("${in.body.isBlog()}")).to("seda:blogEnricher") .otherwise().to("seda:imageEnricher") .end(); from("seda:tweetEnricher?concurrentConsumers=10").routeId("tweet-route") .to("bean:languageAnalyzer?method=detectLanguage") .to("bean:tweetAnalyzer?method=extractMentions") .to("seda:searchService") from("seda:blogEnricher?concurrentConsumers=2").routeId("blog-route") ... from("seda:imageEnricher?concurrentConsumers=2").routeId("img-route") ... // Routes re-join from("seda:searchService?concurrentConsumers=X").routeId("s-indexer") .to("bean:searchService?method=indexContent");
  13. 13. Content Pipeline Queue Queue Listener Search Indexer Tweet Processor Blog Processor Image Processor ROUTE Routing Filter ThreadPool + BlockingQueu e
  14. 14. Content Pipeline Caveats: ● No visibility into SEDA's thread pool state (e.g. how many objects on its internal queue?) ● If VM crashes, those payloads on the SEDA thread pool blocking queue are lost ● Our route is assuming that each payload consists of only one message ○ In reality, our payloads are a mix of different post types ... how to handle this efficiently?
  15. 15. Content Pipeline Final Solution: from(<queue.uri>).routeId("my-route") .split().method("payloadSplitterService", "splitMessage") .choice() .when(header("enrichTweets").isEqualTo(true)).to(<queue.uri.tweet>) .when(header("enrichBlogs").isEqualTo(true)).to(<queue.uri.blogs>) .otherwise().to("<queue.uri.images>") .end(); from(<queue.uri.tweet>).routeId("queue-in-tweet-route") .to("seda:tweetEnricher?timeout=0"); from("seda:tweetEnricher?concurrentConsumers=10&size=0&blockWhenFull=true").routeId ("tweet-route") .to("bean:languageAnalyzer?method=detectLanguage") .to("bean:tweetAnalyzer?method=extractMentions") .to("seda:searchService") from("seda:searchService?concurrentConsumers=X").routeId("s-indexer") .to("bean:searchService?method=indexContent");
  16. 16. Questions

×