@asnare / @fzk / @godatadriven
Divolte Collector
Andrew Snare / Friso van Vollenhoven
Because life’s too short for log file parsing
99% of all data in Hadoop - - [28/Jul/1995:11:53:28 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 200 669 - - [28/Jul/1995:11:53:29 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 304 0 - - [28/Jul/1995:11:53:31 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 4324 - - [28/Jul/1995:11:53:40 -0400] "GET /shuttle/countdown/count70.gif HTTP/1.0" 200 46573 - - [28/Jul/1995:11:53:54 -0400] "GET /shuttle/missions/sts-67/images/images.html HTTP/1.0" - - [28/Jul/1995:11:54:02 -0400] "GET /shuttle/technology/sts-newsref/sts-mps.html HTTP/1.0" 2 - - [28/Jul/1995:11:54:05 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204 - - [28/Jul/1995:11:54:05 -0400] "GET /images/shuttle-patch-logo.gif HTTP/1.0" 200 891 - - [28/Jul/1995:11:54:07 -0400] "GET /shuttle/technology/sts-newsref/stsref-toc.html HTTP/1. - - [28/Jul/1995:11:54:14 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204 - - [28/Jul/1995:11:54:15 -0400] "GET /shuttle/resources/orbiters/challenger.html HTTP/1.0" - - [28/Jul/1995:11:54:16 -0400] "GET /images/shuttle-patch-small.gif HTTP/1.0" 200 4179 - - [28/Jul/1995:11:54:16 -0400] "GET /shuttle/missions/sts-69/mission-sts-69.html HTTP/1.0 - - [28/Jul/1995:11:54:18 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204 - - [28/Jul/1995:11:54:19 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1713 - - [28/Jul/1995:11:54:19 -0400] "GET /shuttle/resources/orbiters/challenger-logo.gif HTTP/1 - - [28/Jul/1995:11:54:25 -0400] "GET /shuttle/missions/sts-70/images/images.html HTTP/1.0" 2 - - [28/Jul/1995:11:54:26 -0400] "GET /history/rocket-history.txt HTTP/1.0" 200 26990 - - [28/Jul/1995:11:54:30 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 304 0 - - [28/Jul/1995:11:54:31 -0400] "GET /images/launch-logo.gif HTTP/1.0" 304 0 - - [28/Jul/1995:11:54:38 -0400] "GET /history/apollo/images/apollo-logo1.gif HTTP/1.0" 304 - - [28/Jul/1995:11:54:48 -0400] "GET /shuttle/missions/sts-65/mission-sts-65.html HTTP/1.0" - - [28/Jul/1995:11:54:53 -0400] "GET /shuttle/missions/sts-67/images/KSC-95EC-0390.jpg HTTP - - [28/Jul/1995:11:54:58 -0400] "GET /shuttle/missions/missions.html HTTP/1.0" 200 8677 - - [28/Jul/1995:11:55:02 -0400] "GET /images/launchmedium.gif HTTP/1.0" 200 11853 - - [28/Jul/1995:11:55:05 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786 - - [28/Jul/1995:11:55:09 -0400] "GET /procurement/procurement.html HTTP/1.0" 200 3499 - - [28/Jul/1995:11:55:10 -0400] "GET /images/op-logo-small.gif HTTP/1.0" 200 14915 - - [28/Jul/1995:11:55:11 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786 - - [28/Jul/1995:11:55:11 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204
How do we use our data?
•Ad hoc
HTTP request:
log transport
log event:
2012-07-01T06:00:02.500Z /org/apache/hadoop/io/IOUtils.html
transport logs to
compute cluster
off line analytics /
model training
batch update
model state
serve model result
(e.g. recommendations) streaming log
streaming update
model state
Typical web optimization architecture
Parse HTTP server logs
How did it get there?
Option 1: parse HTTP server logs
•Ship log files on a schedule
•Parse using MapReduce jobs
•Batch analytics jobs feed online systems
HTTP server log parsing
•Inherently batch oriented
•Schema-less (URL format is the schema)
•Initial job to parse logs into structured format
•Usually multiple versions of parsers required
•Requires sessionizing
•Logs usually have more than you ask for (bots,
image requests, spiders, health check, etc.)
Stream HTTP server logs
Message Queue or Event Transport
(Kafka, Flume, etc.)
tail -F
How did it get there?
Option 2: stream HTTP server logs
•tail -F logfiles
•Use a queue for transport (e.g. Flume or Kafka)
•Parse logs on the fly
•Or write semi-schema’d logs, like JSON
•Parse again for batch work load
Stream HTTP server logs
•Allows for near real-time event handling when
consuming from queues
•Sessionizing? Duplicates? Bots?
•Still requires parser logic
•No schema
web server
tracking server
Message Queue or Event Transport
(Kafka, Flume, etc.)
web page traffic
tracking traffic
structured events
structured events
How did it get there?
Option 3: tagging
•Instrument pages with special ‘tag’, i.e. special
JavaScript or image just for logging the request
•Create special endpoint that handles the tag
request in a structured way
•Tag endpoint handles logging the events
•Not a new idea (Google Analytics, Omniture,
•Less garbage traffic, because a browser is
required to evaluate the tag
•Event logging is asynchronous
•Easier to do inflight processing (apply a schema,
add enrichments, etc.)
•Allows for custom events (other than page view)
•Manage session through cookies on the client
•Incoming data is already sessionized
•Extract additional information from clients
•Screen resolution
•Viewport size
Looks familiar?
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
ga('create', 'UA-40578233-2', '');
ga('send', 'pageview');
Divolte Collector
Click stream
data collection
for Hadoop
and Kafka.
Divolte Collector
web server
tracking server
Message Queue or Event Transport
(Kafka, Flume, etc.)
web page traffic
tracking traffic
structured events
structured events
Divolte Collector:Vision
•Focus purely on collection
•Processing is a separate concern
•Minimal on the fly enrichment
•The Hadoop tools ecosystem evolves too fast to compete
(SQL solutions, streaming, machine learning, etc.)
•Just provide data
•Data source for custom data science solutions
•Not a web analytics solution per se; descriptive web
analytics is a side effect
•Use cases will vary, try not too many assumptions about
users’ needs
Divolte Collector:Vision
•Solve the web specific tricky parts
•ID generation on client side (JavaScript)
•In-stream duplicate detection
•Data will be written in a schema-evolution-
friendly open format (Apache Avro)
•No arbitrary (JSON) objects
Javascript based tag
Your page content here.
Include Divolte Collector
just before the closing
body tag
<script src="//"
defer async>
Effectively stateless
Data with a schema in Avro
"namespace": "com.example.record",
"type": "record",
"name": "MyEventRecord",
"fields": [
{ "name": "location", "type": "string" },
{ "name": "pageType", "type": "string" },
{ "name": "timestamp", "type": "long" }
Map incoming data onto Avro records
mapping {
map clientTimestamp() onto 'timestamp'
map location() onto 'location'
def u = parse location() to uri
section {
when u.path().equalTo('/checkout') apply {
map 'checkout' onto 'pageType'
map 'normal' onto 'pageType'
User agent parsing
map userAgent().family() onto 'browserName'
map userAgent().osFamily() onto 'operatingSystemName'
map userAgent().osVersion() onto 'operatingSystemVersion'
// Etc... More fields available
IP to geolocation lookup
Useful performance
Requests per second: 14010.80 [#/sec] (mean)
Time per request: 0.571 [ms] (mean)
Time per request: 0.071 [ms] (mean, across all concurrent requests)
Transfer rate: 4516.55 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 1
Processing: 0 0 0.2 0 3
Waiting: 0 0 0.2 0 3
Total: 0 1 0.2 1 3
Percentage of the requests served within a certain time (ms)
50% 1
66% 1
75% 1
80% 1
90% 1
95% 1
98% 1
99% 1
100% 3 (longest request)
Custom events
divolte.signal('addToBasket', {
productId: 309125,
count: 1
In the page (Javascript)
map eventParameter('productId') onto 'basketProductId'
map eventParameter('count') onto 'basketNumProducts'
In the mapping (Groovy)
Avro data, use any tool
Divolte Collector
•Apache License,Version 2.0
Ad hoc
1. Pick n images randomly
2. Optimise displayed image using bandit optimisation
3. After X iterations:
•Pick n / 2 new images randomly
•Select n / 2 images from existing set using learned
•Construct new set of images using half of existing
set and newly selected random images
4. Goto 2
Bayesian Bandits
•For each image, keep track of:
•Number of impressions
•Number of clicks
•When serving an image:
•Draw a random number from a Beta
distribution with parameters alpha = # of clicks,
beta = # of impressions, for each image
•Show image where sample value is largest
Bayesian Bandits
Prototype UI
class HomepageHandler(ShopHandler):
def get(self):
# Hard-coded ID for a pretty flower.
# Later this ID will be decided by the bandit optmization.
winner = '15442023790'
# Grab the item details from our catalog service.
top_item = yield self._get_json('catalog/item/%s' % winner)
# Render the homepage
Prototype UI
<div class="col-md-6">
<h4>Top pick:</h4>
<!-- Link to the product page with a source identifier for tracking -->
<a href="/product/{{ top_item['id'] }}/#/?source=top_pick">
<img class="img-responsive img-rounded" src="{{ top_item['variants']['Medium']['img_source'] }}">
<!-- Signal that we served an impression of this image -->
<script>divolte.signal('impression', { source: 'top_pick', productId: '{{ top_item['id'] }}'})</script>
Photo by {{ top_item['owner']['real_name'] or top_item['owner']['user_name']}}
Data collection in Divolte Collector
"name": "source",
"type": ["null", "string"],
"default": null
def locationUri = parse location() to uri
when eventType().equalTo('pageView') apply {
def fragmentUri = parse locationUri.rawFragment() to uri
map fragmentUri.query().value('source') onto 'source'
when eventType().equalTo('impression') apply {
map eventParameters().value('productId') onto 'productId'
map eventParameters().value('source') onto 'source'
Keep counts in Redis
'c|14502147379': '2',
'c|15106342717': '2',
'c|15624953471': '1',
'c|9609633287': '1',
'i|14502147379': '2',
'i|15106342717': '3',
'i|15624953471': '2',
'i|9609633287': '3'
Consuming Kafka in Python
def start_consumer(args):
# Load the Avro schema used for serialization.
schema = avro.schema.Parse(open(args.schema).read())
# Create a Kafka consumer and Avro reader. Note that
# it is trivially possible to create a multi process
# consumer.
consumer = KafkaConsumer(args.topic,
reader =
# Consume messages.
for message in consumer:
handle_event(message, reader)
Consuming Kafka in Python
def handle_event(message, reader):
# Decode Avro bytes into a Python dictionary.
message_bytes = io.BytesIO(message.value)
decoder =
event =
# Event logic.
if 'top_pick' == event['source'] and 'pageView' == event['eventType']:
# Register a click.
CLICK_KEY_PREFIX + ascii_bytes(event['productId']),
elif 'top_pick' == event['source'] and 'impression' == event['eventType']:
# Register an impression and increment experiment count.
p = redis_client.pipeline()
IMPRESSION_KEY_PREFIX + ascii_bytes(event['productId']),
experiment_count, ingnored = p.execute()
if experiment_count == REFRESH_INTERVAL:
def refresh_items():
# Fetch current model state. We convert everything to str.
current_item_dict = redis_client.hgetall(ITEM_HASH_KEY)
current_items = numpy.unique([k[2:] for k in current_item_dict.keys()])
# Fetch random items from ElasticSearch. Note we fetch more than we need,
# but we filter out items already present in the current set and truncate
# the list to the desired size afterwards.
random_items = [
for item in random_item_set(NUM_ITEMS + NUM_ITEMS - len(current_items) // 2)
if not item in current_items][:NUM_ITEMS - len(current_items) // 2]
# Draw random samples.
samples = [
int(current_item_dict[CLICK_KEY_PREFIX + item]),
int(current_item_dict[IMPRESSION_KEY_PREFIX + item]))
for item in current_items]
# Select top half by sample values. current_items is conveniently
# a Numpy array here.
survivors = current_items[numpy.argsort(samples)[len(current_items) // 2:]]
# New item set is survivors plus the random ones.
new_items = numpy.concatenate([survivors, random_items])
# Update model state to reflect new item set. This operation is atomic
# in Redis.
p = redis_client.pipeline(transaction=True)
for item in new_items:
p.hincrby(ITEM_HASH_KEY, CLICK_KEY_PREFIX + item, 1)
Serving a recommendation
class BanditHandler(web.RequestHandler):
redis_client = None
def initialize(self, redis_client):
self.redis_client = redis_client
def get(self):
# Fetch model state.
item_dict = yield gen.Task(self.redis_client.hgetall, ITEM_HASH_KEY)
items = numpy.unique([k[2:] for k in item_dict.keys()])
# Draw random samples.
samples = [
int(item_dict[CLICK_KEY_PREFIX + item]),
int(item_dict[IMPRESSION_KEY_PREFIX + item]))
for item in items]
# Select item with largest sample value.
winner = items[numpy.argmax(samples)]
class HomepageHandler(ShopHandler):
def get(self):
http = AsyncHTTPClient()
request = HTTPRequest(url='http://localhost:8989/item', method='GET')
response = yield http.fetch(request)
winner = json_decode(response.body)
top_item = yield self._get_json('catalog/item/%s' % winner)
Server side - short term
•Allow multiple sources / sink channels
•With different input → schema mappings
•Server side events
•Support for server side event logging (JSON
•Enabler for mobile SDKs
•Trivial to add pixel based end-point (server
managed cookies)
Client side
•Specific browser related bug fixes (IE9)
•Allow for setting session scoped parameters
•JavaScript Data Layer
Collector next steps
•Integrate with Planout (
•Allow definition of online experiments in one
•All event logging automatically includes random
parameters generated for experiment selection
•Single solution for data collection for online
experimentation / optimization
We’re hiring / Questions? / Thank you!
@asnare / @fzk / @godatadriven
Andrew Snare / Friso van Vollenhoven

  • 50. Client side •Specific browser related bug fixes (IE9) •Allow for setting session scoped parameters •JavaScript Data Layer
  • 51. Collector next steps •Integrate with Planout ( planout/) •Allow definition of online experiments in one place •All event logging automatically includes random parameters generated for experiment selection •Single solution for data collection for online experimentation / optimization
  • 53. GoDataDriven We’re hiring / Questions? / Thank you! @asnare / @fzk / @godatadriven Andrew Snare / Friso van Vollenhoven