Divolte Collector - meetup presentation

Divolte Collector
Because life’s too short for log file parsing
GoDataDriven
PROUDLY PART OF THE XEBIA GROUP
@asnare / @fzk
signal@godatadriven.com
Andrew Snare / Friso van Vollenhoven

99% of all data in Hadoop
156.68.7.63 - - [28/Jul/1995:11:53:28 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 200 669
137.244.160.140 - - [28/Jul/1995:11:53:29 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 304 0
163.205.160.5 - - [28/Jul/1995:11:53:31 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 4324
163.205.160.5 - - [28/Jul/1995:11:53:40 -0400] "GET /shuttle/countdown/count70.gif HTTP/1.0" 200 46573
140.229.50.189 - - [28/Jul/1995:11:53:54 -0400] "GET /shuttle/missions/sts-67/images/images.html HTTP/1.0" 163.206.89.4 - - [28/Jul/1995:11:54:02 -0400] "GET /shuttle/technology/sts-newsref/sts-mps.html HTTP/1.0" 200 163.206.89.4 - - [28/Jul/1995:11:54:05 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204
163.206.89.4 - - [28/Jul/1995:11:54:05 -0400] "GET /images/shuttle-patch-logo.gif HTTP/1.0" 200 891
131.110.53.48 - - [28/Jul/1995:11:54:07 -0400] "GET /shuttle/technology/sts-newsref/stsref-toc.html HTTP/1.0" 163.205.160.5 - - [28/Jul/1995:11:54:14 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204
130.160.196.81 - - [28/Jul/1995:11:54:15 -0400] "GET /shuttle/resources/orbiters/challenger.html HTTP/1.0" 131.110.53.48 - - [28/Jul/1995:11:54:16 -0400] "GET /images/shuttle-patch-small.gif HTTP/1.0" 200 4179
137.244.160.140 - - [28/Jul/1995:11:54:16 -0400] "GET /shuttle/missions/sts-69/mission-sts-69.html HTTP/1.0" 131.110.53.48 - - [28/Jul/1995:11:54:18 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204
131.110.53.48 - - [28/Jul/1995:11:54:19 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1713
130.160.196.81 - - [28/Jul/1995:11:54:19 -0400] "GET /shuttle/resources/orbiters/challenger-logo.gif HTTP/1.0" 163.205.160.5 - - [28/Jul/1995:11:54:25 -0400] "GET /shuttle/missions/sts-70/images/images.html HTTP/1.0" 200 130.181.4.158 - - [28/Jul/1995:11:54:26 -0400] "GET /history/rocket-history.txt HTTP/1.0" 200 26990
137.244.160.140 - - [28/Jul/1995:11:54:30 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 304 0
137.244.160.140 - - [28/Jul/1995:11:54:31 -0400] "GET /images/launch-logo.gif HTTP/1.0" 304 0
137.244.160.140 - - [28/Jul/1995:11:54:38 -0400] "GET /history/apollo/images/apollo-logo1.gif HTTP/1.0" 304 168.178.17.149 - - [28/Jul/1995:11:54:48 -0400] "GET /shuttle/missions/sts-65/mission-sts-65.html HTTP/1.0" 140.229.50.189 - - [28/Jul/1995:11:54:53 -0400] "GET /shuttle/missions/sts-67/images/KSC-95EC-0390.jpg HTTP/131.110.53.48 - - [28/Jul/1995:11:54:58 -0400] "GET /shuttle/missions/missions.html HTTP/1.0" 200 8677
131.110.53.48 - - [28/Jul/1995:11:55:02 -0400] "GET /images/launchmedium.gif HTTP/1.0" 200 11853
131.110.53.48 - - [28/Jul/1995:11:55:05 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786
128.159.111.141 - - [28/Jul/1995:11:55:09 -0400] "GET /procurement/procurement.html HTTP/1.0" 200 3499
128.159.111.141 - - [28/Jul/1995:11:55:10 -0400] "GET /images/op-logo-small.gif HTTP/1.0" 200 14915
128.159.111.141 - - [28/Jul/1995:11:55:11 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786
128.159.111.141 - - [28/Jul/1995:11:55:11 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204
192.213.154.220 - - [28/Jul/1995:11:55:15 -0400] "GET /shuttle/countdown/tour.html HTTP/1.0" 200 4347

Typical web optimization architecture
USER
HTTP request:
/org/apache/hadoop/io/IOUtils.html
log transport
service
log event:
2012-07-01T06:00:02.500Z /org/apache/hadoop/io/IOUtils.html
transport logs to
compute cluster
(e.g. recommendations) streaming log
off line analytics /
model training
serve model result
batch update
model state
processing
streaming update
model state

Parse HTTP server logs
access.log

How did it get there?
Option 1: parse HTTP server logs
• Ship log files on a schedule
• Parse using MapReduce jobs
• Batch analytics jobs feed online systems

HTTP server log parsing
• Inherently batch oriented
• Schema-less (URL format is the schema)
• Initial job to parse logs into structured format
• Usually multiple versions of parsers required
• Requires sessionizing
• Logs usually have more than you ask for (bots,
image requests, spiders, health check, etc.)

Stream HTTP server logs
access.log
Message Queue or Event Transport
(Kafka, Flume, etc.)
EVENTS
tail -F
EVENTS
OTHER
CONSUMERS

Option 2: stream HTTP server logs
• tail -F logfiles
• Use a queue for transport (e.g. Flume or Kafka)
• Parse logs on the fly
• Or write semi-schema’d logs, like JSON
• Parse again for batch work load

Stream HTTP server logs
• Allows for near real-time event handling when
consuming from queues
• Sessionizing? Duplicates? Bots?
• Still requires parser logic
• No schema

Tagging
tracking traffic
(asynchronous)
index.
html
script.
js
tracking server
access.log
web server
EVENTS
OTHER
CONSUMERS
web page traffic
structured events
structured events

Option 3: tagging
• Instrument pages with special ‘tag’, i.e. special
JavaScript or image just for logging the request
• Create special endpoint that handles the tag
request in a structured way
• Tag endpoint handles logging the events

Tagging
• Not a new idea (Google Analytics, Omniture,
etc.)
• Less garbage traffic, because a browser is
required to evaluate the tag
• Event logging is asynchronous
• Easier to do inflight processing (apply a schema,
add enrichments, etc.)
• Allows for custom events (other than page view)

Also…
• Manage session through cookies on the client
side
• Incoming data is already sessionized
• Extract additional information from clients
• Screen resolution
• Viewport size
• Timezone

Looks familiar?
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
!
ga('create', 'UA-40578233-2', 'godatadriven.com');
ga('send', 'pageview');
!
</script>

Divolte Collector
Tag based click stream
data collection for
Hadoop and Kafka.

Divolte Collector
tracking traffic
(asynchronous)
index.
html
script.
js
tracking server
access.log
web server
EVENTS
OTHER
CONSUMERS
web page traffic
structured events
structured events

The TAG
<script src="//tr.example.com/divolte.js"
defer
async>
</script>

Schema!
{
"namespace": "com.example.record",
"type": "record",
"name": "ClickEventRecord",
"fields": [
{ "name": "productNumber", "type": ["null", "string"], "default": null },
{ "name": "shop", "type": ["null", "string"], "default": null },
{ "name": "category", "type": ["null", "string"], "default": null },
{ "name": "advisor", "type": ["null", "string"], "default": null },
{ "name": "searchPhrase", "type": ["null", "string"], "default": null },
{ "name": "basketProductNumber", "type": ["null", "string"], "default": null },
{ "name": "basketSizeCode", "type": ["null", "string"], "default": null },
{ "name": "basketProductCount", "type": ["null", "string"], "default": null }
]
}

Mapping
// Page type detector:
// http://.../basket
basket = "^https?://[^/]+/basket(?:[?#].*)?$"
!
// http://.../search?q=fiets
search = "^https?://[^/]+/search?.*$"
!
// http://.../checkout
checkout = "^https?://[^/]+/checkout(?:[?#].*)?$"
!
// http://.../thankyou
payment_ok = "^https://[^/]+/thankyou(?:[?#].*)?$"

Mapping
pageType {
type = regex_name
regexes = [
home, category, shop, basket, search, customercare
]
field = location
}
productNumber {
type = regex_group
regex = pdp
field = location
group = product
}
viewportPixelWidth = viewportPixelWidth
viewportPixelHeight = viewportPixelHeight
screenPixelWidth = screenPixelWidth
screenPixelHeight = screenPixelHeight

Configure
divolte {
server {
host = 0.0.0.0
use_x_forwarded_for = true
landing_page = false
}
!
tracking {
cookie_domain = .example.com
include "click-schema-mapping.conf"
schema_file = /etc/divolte/ClickEventRecord.avsc
}
!
…

Configure
kafka_flusher {
enabled = true
producer = {
metadata.broker.list = [
"broker1:9092",
"broker2:9092",
"broker3:9092"
]
}
}
!
…

Configure
hdfs_flusher {
hdfs {
replication = 3
}
!
simple_rolling_file_strategy {
roll_every = 60 minutes
sync_file_after_records = 1000
sync_file_after_duration = 10 seconds
!
working_dir = /divolte/inflight
publish_dir = /divolte/published
}
}
}

Demo: Javadoc analytics!
javadoc -d outputdir
-bottom '<script src="//localhost:8290/divolte.js"
defer async></script>'
-subpackages .

private static class JavadocEventHandler implements EventHandler<JavadocEventRecord> {
private static final String TCP_SERVER_HOST = "127.0.0.1";
private static final int TCP_SERVER_PORT = 1234;
!
private Socket socket = null;
private OutputStream stream;
!
@Override
public void setup() throws Exception {
socket = new Socket(TCP_SERVER_HOST, TCP_SERVER_PORT);
stream = socket.getOutputStream();
}
!
@Override
public void handle(JavadocEventRecord event) throws Exception {
if (!event.getDetectedDuplicate()) {
// Avro's toString already produces JSON.
stream.write(event.toString().getBytes(StandardCharsets.UTF_8));
stream.write("n".getBytes(StandardCharsets.UTF_8));
}
}
!
@Override
public void shutdown() throws Exception {
if (null != stream) stream.close();
if (null != socket) socket.close();
}
}

public static void main(String[] args) {
final DivolteKafkaConsumer<JavadocEventRecord> consumer =
DivolteKafkaConsumer.createConsumer(
KAFKA_TOPIC,
ZOOKEEPER_QUORUM,
KAFKA_CONSUMER_GROUP_ID,
NUM_CONSUMER_THREADS,
() -> new JavadocEventHandler(),
JavadocEventRecord.getClassSchema());
!
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
System.out.println("Shutting down consumer.");
consumer.shutdownConsumer();
}));
!
System.out.println("Starting consumer.");
consumer.startConsumer();
}

CREATE EXTERNAL TABLE javadoc_analytics (
firstInSession boolean
-- other fields are created automatically from schema
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
'/divolte/published'
TBLPROPERTIES (
'avro.schema.url'='hdfs:///JavadocEventRecord.avsc'
);

export IPYTHON=1
export IPYTHON_OPTS="notebook --ip=0.0.0.0"
pyspark
--jars divolte-spark-assembly-0.1.jar
--driver-class-path divolte-spark-assembly-0.1.jar
--num-executors 40

import io.divolte.spark.avro._
import org.apache.avro.generic.IndexedRecord
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
!
val sc = new SparkContext()
val events = sc.newAvroFile[IndexedRecord](path)
!
// And then…
val records = events.toRecords
// or
val eventFields = events.fields("sessionId", "location", "timestamp")

// Kafka configuration.
val consumerConfig = Map(
"group.id" -> "some-id-for-the-consumer-group",
"zookeeper.connect" -> "zookeeper-connect-string",
"auto.commit.interval.ms" -> "5000",
"auto.offset.reset" -> "largest"
)
val topicSettings = Map("divolte" -> Runtime.getRuntime.availableProcessors())
!
val sc = new SparkContext()
val ssc = new StreamingContext(sc, Seconds(15))
!
// Establish the source event stream.
val stream = ssc.divolteStream[GenericRecord](consumerConfig, topicSettings, StorageLevel.MEMORY_ONLY)
!
// And then…
val eventStream = stream.toRecords
// or
val locationStream = stream.fields("location")

Zero config deploy
• Easy to use for local development
•Works out of the box with zero custom config
• Comes with a built in schema and mapping
•Works on local machine without Hadoop
• Flushes to /tmp on local file system

Collector has no global state
• Load balancer friendly
• Horizontally scalable
• Shared nothing
• (other than HDFS and Kafka)

In stream de-duplication
• The internet is a mean place; data will have noise
• In stream hash based de-duplication
• Low false negative rate
• Virtually zero false positive rate
• Requires URI based routing from load balancer
• Easy to setup on nginx
• Supported on many hardware load balancers

Corrupt request detection
• The internet is still a mean place… Some URLS
are truncated
• Incomplete events detected and discarded

Defeat Chrome’s pre-rendering
• Chrome sometimes speculatively pre-renders
pages in the background
• This triggers JS even if the page is not shown
• Unless you use the Page Visibility API to detect
this
•Which we do
•We take care of many other JS caveats as well

Custom events
• Divolte presents itself as a JS library
• Map custom event parameters directly onto Avro
fields

<script>
divolte.signal("addToBasket", {
count: 2,
productId: "a3bc38de"
})
</script>
// server side mapping
eventType = eventType
!
basketProductId {
type = event_parameter
name = productId
}

Bring your own IDs
• Generate page view ID on server side
• Possible to relate server side logging to page
views and other client side events
<script
src="//…/divolte.js#a28de3bf42a5dc98c03"
defer
async>
</script>

User agent parsing
• On the fly parsing of user agent string
• Uses: http://uadetector.sourceforge.net/
• Updates user agent database at runtime without
restart

IP to geo coordinates
• On the fly enrichment with geo coordinates
based on IP address
• MaxMind geoIP database
• https://www.maxmind.com/en/geoip2-databases
• Updates database at runtime without restart
• Sets:
• Latitude & longitude
• Country, City, Subdivision

https://github.com/divolte/divolte-collector
https://github.com/divolte/divolte-examples
https://github.com/divolte/divolte-kafka-consumer
https://github.com/divolte/divolte-spark

GoDataDriven
We’re hiring / Questions? / Thank you!
@asnare / @fzk
signal@godatadriven.com
Andrew Snare / Friso van Vollenhoven

Divolte Collector - meetup presentation

More Related Content

What's hot

Viewers also liked

Similar to Divolte Collector - meetup presentation

More from fvanvollenhoven

Recently uploaded

Divolte Collector - meetup presentation