SlideShare a Scribd company logo
GoDataDriven
PROUDLY PART OF THE XEBIA GROUP
@asnare / @fzk / @godatadriven
signal@godatadriven.com
Divolte Collector
Andrew Snare / Friso van Vollenhoven
Because life’s too short for log file parsing
99% of all data in Hadoop
156.68.7.63 - - [28/Jul/1995:11:53:28 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 200 669
137.244.160.140 - - [28/Jul/1995:11:53:29 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 304 0
163.205.160.5 - - [28/Jul/1995:11:53:31 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 4324
163.205.160.5 - - [28/Jul/1995:11:53:40 -0400] "GET /shuttle/countdown/count70.gif HTTP/1.0" 200 46573
140.229.50.189 - - [28/Jul/1995:11:53:54 -0400] "GET /shuttle/missions/sts-67/images/images.html HTTP/1.0"
163.206.89.4 - - [28/Jul/1995:11:54:02 -0400] "GET /shuttle/technology/sts-newsref/sts-mps.html HTTP/1.0" 2
163.206.89.4 - - [28/Jul/1995:11:54:05 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204
163.206.89.4 - - [28/Jul/1995:11:54:05 -0400] "GET /images/shuttle-patch-logo.gif HTTP/1.0" 200 891
131.110.53.48 - - [28/Jul/1995:11:54:07 -0400] "GET /shuttle/technology/sts-newsref/stsref-toc.html HTTP/1.
163.205.160.5 - - [28/Jul/1995:11:54:14 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204
130.160.196.81 - - [28/Jul/1995:11:54:15 -0400] "GET /shuttle/resources/orbiters/challenger.html HTTP/1.0"
131.110.53.48 - - [28/Jul/1995:11:54:16 -0400] "GET /images/shuttle-patch-small.gif HTTP/1.0" 200 4179
137.244.160.140 - - [28/Jul/1995:11:54:16 -0400] "GET /shuttle/missions/sts-69/mission-sts-69.html HTTP/1.0
131.110.53.48 - - [28/Jul/1995:11:54:18 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204
131.110.53.48 - - [28/Jul/1995:11:54:19 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1713
130.160.196.81 - - [28/Jul/1995:11:54:19 -0400] "GET /shuttle/resources/orbiters/challenger-logo.gif HTTP/1
163.205.160.5 - - [28/Jul/1995:11:54:25 -0400] "GET /shuttle/missions/sts-70/images/images.html HTTP/1.0" 2
130.181.4.158 - - [28/Jul/1995:11:54:26 -0400] "GET /history/rocket-history.txt HTTP/1.0" 200 26990
137.244.160.140 - - [28/Jul/1995:11:54:30 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 304 0
137.244.160.140 - - [28/Jul/1995:11:54:31 -0400] "GET /images/launch-logo.gif HTTP/1.0" 304 0
137.244.160.140 - - [28/Jul/1995:11:54:38 -0400] "GET /history/apollo/images/apollo-logo1.gif HTTP/1.0" 304
168.178.17.149 - - [28/Jul/1995:11:54:48 -0400] "GET /shuttle/missions/sts-65/mission-sts-65.html HTTP/1.0"
140.229.50.189 - - [28/Jul/1995:11:54:53 -0400] "GET /shuttle/missions/sts-67/images/KSC-95EC-0390.jpg HTTP
131.110.53.48 - - [28/Jul/1995:11:54:58 -0400] "GET /shuttle/missions/missions.html HTTP/1.0" 200 8677
131.110.53.48 - - [28/Jul/1995:11:55:02 -0400] "GET /images/launchmedium.gif HTTP/1.0" 200 11853
131.110.53.48 - - [28/Jul/1995:11:55:05 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786
128.159.111.141 - - [28/Jul/1995:11:55:09 -0400] "GET /procurement/procurement.html HTTP/1.0" 200 3499
128.159.111.141 - - [28/Jul/1995:11:55:10 -0400] "GET /images/op-logo-small.gif HTTP/1.0" 200 14915
128.159.111.141 - - [28/Jul/1995:11:55:11 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786
128.159.111.141 - - [28/Jul/1995:11:55:11 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204
How do we use our data?
•Ad hoc
•Batch
•Streaming
USER
HTTP request:
/org/apache/hadoop/io/IOUtils.html
log transport
service
log event:
2012-07-01T06:00:02.500Z /org/apache/hadoop/io/IOUtils.html
transport logs to
compute cluster
off line analytics /
model training
batch update
model state
serve model result
(e.g. recommendations) streaming log
processing
streaming update
model state
Typical web optimization architecture
Parse HTTP server logs
access.log
How did it get there?
Option 1: parse HTTP server logs
•Ship log files on a schedule
•Parse using MapReduce jobs
•Batch analytics jobs feed online systems
HTTP server log parsing
•Inherently batch oriented
•Schema-less (URL format is the schema)
•Initial job to parse logs into structured format
•Usually multiple versions of parsers required
•Requires sessionizing
•Logs usually have more than you ask for (bots,
image requests, spiders, health check, etc.)
Stream HTTP server logs
access.log
Message Queue or Event Transport
(Kafka, Flume, etc.)
EVENTS
tail -F
EVENTS
OTHER
CONSUMERS
How did it get there?
Option 2: stream HTTP server logs
•tail -F logfiles
•Use a queue for transport (e.g. Flume or Kafka)
•Parse logs on the fly
•Or write semi-schema’d logs, like JSON
•Parse again for batch work load
Stream HTTP server logs
•Allows for near real-time event handling when
consuming from queues
•Sessionizing? Duplicates? Bots?
•Still requires parser logic
•No schema
Tagging
index.
html
script.
js
web server
access.log
tracking server
Message Queue or Event Transport
(Kafka, Flume, etc.)
EVENTS
OTHER
CONSUMERS
web page traffic
tracking traffic
(asynchronous)
structured events
structured events
How did it get there?
Option 3: tagging
•Instrument pages with special ‘tag’, i.e. special
JavaScript or image just for logging the request
•Create special endpoint that handles the tag
request in a structured way
•Tag endpoint handles logging the events
Tagging
•Not a new idea (Google Analytics, Omniture,
etc.)
•Less garbage traffic, because a browser is
required to evaluate the tag
•Event logging is asynchronous
•Easier to do inflight processing (apply a schema,
add enrichments, etc.)
•Allows for custom events (other than page view)
Also…
•Manage session through cookies on the client
side
•Incoming data is already sessionized
•Extract additional information from clients
•Screen resolution
•Viewport size
•Timezone
Looks familiar?
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-40578233-2', 'godatadriven.com');
ga('send', 'pageview');
</script>
Divolte Collector
Click stream
data collection
for Hadoop
and Kafka.
Divolte Collector
index.
html
script.
js
web server
access.log
tracking server
Message Queue or Event Transport
(Kafka, Flume, etc.)
EVENTS
OTHER
CONSUMERS
web page traffic
tracking traffic
(asynchronous)
structured events
structured events
Divolte Collector:Vision
•Focus purely on collection
•Processing is a separate concern
•Minimal on the fly enrichment
•The Hadoop tools ecosystem evolves too fast to compete
(SQL solutions, streaming, machine learning, etc.)
•Just provide data
•Data source for custom data science solutions
•Not a web analytics solution per se; descriptive web
analytics is a side effect
•Use cases will vary, try not too many assumptions about
users’ needs
Divolte Collector:Vision
•Solve the web specific tricky parts
•ID generation on client side (JavaScript)
•In-stream duplicate detection
•Schema!
•Data will be written in a schema-evolution-
friendly open format (Apache Avro)
•No arbitrary (JSON) objects
Javascript based tag
<body>
<!--
Your page content here.
-->
<!--
Include Divolte Collector
just before the closing
body tag
-->
<script src="//example.com/divolte.js"
defer async>
</script>
</body>
Effectively stateless
Data with a schema in Avro
{
"namespace": "com.example.record",
"type": "record",
"name": "MyEventRecord",
"fields": [
{ "name": "location", "type": "string" },
{ "name": "pageType", "type": "string" },
{ "name": "timestamp", "type": "long" }
]
}
Map incoming data onto Avro records
mapping {
map clientTimestamp() onto 'timestamp'
map location() onto 'location'
def u = parse location() to uri
section {
when u.path().equalTo('/checkout') apply {
map 'checkout' onto 'pageType'
exit()
}
map 'normal' onto 'pageType'
}
}
User agent parsing
map userAgent().family() onto 'browserName'
map userAgent().osFamily() onto 'operatingSystemName'
map userAgent().osVersion() onto 'operatingSystemVersion'
// Etc... More fields available
IP to geolocation lookup
Useful performance
Requests per second: 14010.80 [#/sec] (mean)
Time per request: 0.571 [ms] (mean)
Time per request: 0.071 [ms] (mean, across all concurrent requests)
Transfer rate: 4516.55 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 1
Processing: 0 0 0.2 0 3
Waiting: 0 0 0.2 0 3
Total: 0 1 0.2 1 3
Percentage of the requests served within a certain time (ms)
50% 1
66% 1
75% 1
80% 1
90% 1
95% 1
98% 1
99% 1
100% 3 (longest request)
Custom events
divolte.signal('addToBasket', {
productId: 309125,
count: 1
})
In the page (Javascript)
map eventParameter('productId') onto 'basketProductId'
map eventParameter('count') onto 'basketNumProducts'
In the mapping (Groovy)
Avro data, use any tool
Divolte Collector
•http://divolte.io
•Apache License,Version 2.0
Examples
Ad hoc
Batch
Online
Example
Example
Approach
1. Pick n images randomly
2. Optimise displayed image using bandit optimisation
3. After X iterations:
•Pick n / 2 new images randomly
•Select n / 2 images from existing set using learned
distribution
•Construct new set of images using half of existing
set and newly selected random images
4. Goto 2
Bayesian Bandits
•For each image, keep track of:
•Number of impressions
•Number of clicks
•When serving an image:
•Draw a random number from a Beta
distribution with parameters alpha = # of clicks,
beta = # of impressions, for each image
•Show image where sample value is largest
Bayesian Bandits
•https://en.wikipedia.org/wiki/Multi-armed_bandit
•http://tdunning.blogspot.nl/2012/02/bayesian-
bandits.html
•https://www.chrisstucchio.com/blog/2013/
bayesian_bandit.html
Prototype UI
class HomepageHandler(ShopHandler):
@coroutine
def get(self):
# Hard-coded ID for a pretty flower.
# Later this ID will be decided by the bandit optmization.
winner = '15442023790'
# Grab the item details from our catalog service.
top_item = yield self._get_json('catalog/item/%s' % winner)
# Render the homepage
self.render(
'index.html',
top_item=top_item)
Prototype UI
<div class="col-md-6">
<h4>Top pick:</h4>
<p>
<!-- Link to the product page with a source identifier for tracking -->
<a href="/product/{{ top_item['id'] }}/#/?source=top_pick">
<img class="img-responsive img-rounded" src="{{ top_item['variants']['Medium']['img_source'] }}">
<!-- Signal that we served an impression of this image -->
<script>divolte.signal('impression', { source: 'top_pick', productId: '{{ top_item['id'] }}'})</script>
</a>
</p>
<p>
Photo by {{ top_item['owner']['real_name'] or top_item['owner']['user_name']}}
</p>
</div>
Data collection in Divolte Collector
{
"name": "source",
"type": ["null", "string"],
"default": null
}
def locationUri = parse location() to uri
when eventType().equalTo('pageView') apply {
def fragmentUri = parse locationUri.rawFragment() to uri
map fragmentUri.query().value('source') onto 'source'
}
when eventType().equalTo('impression') apply {
map eventParameters().value('productId') onto 'productId'
map eventParameters().value('source') onto 'source'
}
Keep counts in Redis
{
'c|14502147379': '2',
'c|15106342717': '2',
'c|15624953471': '1',
'c|9609633287': '1',
'i|14502147379': '2',
'i|15106342717': '3',
'i|15624953471': '2',
'i|9609633287': '3'
}
Consuming Kafka in Python
def start_consumer(args):
# Load the Avro schema used for serialization.
schema = avro.schema.Parse(open(args.schema).read())
# Create a Kafka consumer and Avro reader. Note that
# it is trivially possible to create a multi process
# consumer.
consumer = KafkaConsumer(args.topic,
client_id=args.client,
group_id=args.group,
metadata_broker_list=args.brokers)
reader = avro.io.DatumReader(schema)
# Consume messages.
for message in consumer:
handle_event(message, reader)
Consuming Kafka in Python
def handle_event(message, reader):
# Decode Avro bytes into a Python dictionary.
message_bytes = io.BytesIO(message.value)
decoder = avro.io.BinaryDecoder(message_bytes)
event = reader.read(decoder)
# Event logic.
if 'top_pick' == event['source'] and 'pageView' == event['eventType']:
# Register a click.
redis_client.hincrby(
ITEM_HASH_KEY,
CLICK_KEY_PREFIX + ascii_bytes(event['productId']),
1)
elif 'top_pick' == event['source'] and 'impression' == event['eventType']:
# Register an impression and increment experiment count.
p = redis_client.pipeline()
p.incr(EXPERIMENT_COUNT_KEY)
p.hincrby(
ITEM_HASH_KEY,
IMPRESSION_KEY_PREFIX + ascii_bytes(event['productId']),
1)
experiment_count, ingnored = p.execute()
if experiment_count == REFRESH_INTERVAL:
refresh_items()
def refresh_items():
# Fetch current model state. We convert everything to str.
current_item_dict = redis_client.hgetall(ITEM_HASH_KEY)
current_items = numpy.unique([k[2:] for k in current_item_dict.keys()])
# Fetch random items from ElasticSearch. Note we fetch more than we need,
# but we filter out items already present in the current set and truncate
# the list to the desired size afterwards.
random_items = [
ascii_bytes(item)
for item in random_item_set(NUM_ITEMS + NUM_ITEMS - len(current_items) // 2)
if not item in current_items][:NUM_ITEMS - len(current_items) // 2]
# Draw random samples.
samples = [
numpy.random.beta(
int(current_item_dict[CLICK_KEY_PREFIX + item]),
int(current_item_dict[IMPRESSION_KEY_PREFIX + item]))
for item in current_items]
# Select top half by sample values. current_items is conveniently
# a Numpy array here.
survivors = current_items[numpy.argsort(samples)[len(current_items) // 2:]]
# New item set is survivors plus the random ones.
new_items = numpy.concatenate([survivors, random_items])
# Update model state to reflect new item set. This operation is atomic
# in Redis.
p = redis_client.pipeline(transaction=True)
p.set(EXPERIMENT_COUNT_KEY, 1)
p.delete(ITEM_HASH_KEY)
for item in new_items:
p.hincrby(ITEM_HASH_KEY, CLICK_KEY_PREFIX + item, 1)
p.hincrby(ITEM_HASH_KEY, IMPRESSION_KEY_PREFIX + item, 1)
p.execute()
Serving a recommendation
class BanditHandler(web.RequestHandler):
redis_client = None
def initialize(self, redis_client):
self.redis_client = redis_client
@gen.coroutine
def get(self):
# Fetch model state.
item_dict = yield gen.Task(self.redis_client.hgetall, ITEM_HASH_KEY)
items = numpy.unique([k[2:] for k in item_dict.keys()])
# Draw random samples.
samples = [
numpy.random.beta(
int(item_dict[CLICK_KEY_PREFIX + item]),
int(item_dict[IMPRESSION_KEY_PREFIX + item]))
for item in items]
# Select item with largest sample value.
winner = items[numpy.argmax(samples)]
self.write(winner)
Integrate
class HomepageHandler(ShopHandler):
@coroutine
def get(self):
http = AsyncHTTPClient()
request = HTTPRequest(url='http://localhost:8989/item', method='GET')
response = yield http.fetch(request)
winner = json_decode(response.body)
top_item = yield self._get_json('catalog/item/%s' % winner)
self.render(
'index.html',
top_item=top_item)
Roadmap
Server side - short term
•Allow multiple sources / sink channels
•With different input → schema mappings
•Server side events
•Support for server side event logging (JSON
endpoint)
•Enabler for mobile SDKs
•Trivial to add pixel based end-point (server
managed cookies)
Client side
•Specific browser related bug fixes (IE9)
•Allow for setting session scoped parameters
•JavaScript Data Layer
Collector next steps
•Integrate with Planout (https://facebook.github.io/
planout/)
•Allow definition of online experiments in one
place
•All event logging automatically includes random
parameters generated for experiment selection
•Single solution for data collection for online
experimentation / optimization
References
•http://blog.godatadriven.com/rapid-prototyping-
online-machine-learning-divolte-collector.html
•http://divolte.io
•https://github.com/divolte/divolte-collector
•https://github.com/divolte/divolte-examples
GoDataDriven
We’re hiring / Questions? / Thank you!
@asnare / @fzk / @godatadriven
signal@godatadriven.com
Andrew Snare / Friso van Vollenhoven

More Related Content

What's hot

Architecting Snowflake for High Concurrency and High Performance
Architecting Snowflake for High Concurrency and High PerformanceArchitecting Snowflake for High Concurrency and High Performance
Architecting Snowflake for High Concurrency and High Performance
SamanthaBerlant
 
"Introduction to FinOps" – Greg VanderWel at Chicago AWS user group
"Introduction to FinOps" – Greg VanderWel at Chicago AWS user group"Introduction to FinOps" – Greg VanderWel at Chicago AWS user group
"Introduction to FinOps" – Greg VanderWel at Chicago AWS user group
AWS Chicago
 
Amazon API Gateway
Amazon API GatewayAmazon API Gateway
Amazon API Gateway
Amazon Web Services
 
온라인 주문 서비스를 서버리스 아키텍쳐로 구축하기 - 김태우(Classmethod) :: AWS Community Day Online 2020
온라인 주문 서비스를 서버리스 아키텍쳐로 구축하기 - 김태우(Classmethod) :: AWS Community Day Online 2020온라인 주문 서비스를 서버리스 아키텍쳐로 구축하기 - 김태우(Classmethod) :: AWS Community Day Online 2020
온라인 주문 서비스를 서버리스 아키텍쳐로 구축하기 - 김태우(Classmethod) :: AWS Community Day Online 2020
AWSKRUG - AWS한국사용자모임
 
AWS Certified Cloud Practitioner Course S11-S17
AWS Certified Cloud Practitioner Course S11-S17AWS Certified Cloud Practitioner Course S11-S17
AWS Certified Cloud Practitioner Course S11-S17
Neal Davis
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Natalino Busa
 
Introducing PagerDuty Process Automation
Introducing PagerDuty Process AutomationIntroducing PagerDuty Process Automation
Introducing PagerDuty Process Automation
Rundeck
 
Strategic Approach To Data Migration Project Plan
Strategic Approach To Data Migration Project PlanStrategic Approach To Data Migration Project Plan
Strategic Approach To Data Migration Project Plan
SlideTeam
 
B2B Integration in the Cloud
B2B Integration in the CloudB2B Integration in the Cloud
B2B Integration in the Cloud
i8c
 
Cloud Migration, Application Modernization, and Security
Cloud Migration, Application Modernization, and Security Cloud Migration, Application Modernization, and Security
Cloud Migration, Application Modernization, and Security
Tom Laszewski
 
Data Cloud.pptx
Data Cloud.pptxData Cloud.pptx
Data Cloud.pptx
darshanpatil1401
 
Stream Computing & Analytics at Uber
Stream Computing & Analytics at UberStream Computing & Analytics at Uber
Stream Computing & Analytics at Uber
Sudhir Tonse
 
AWS 기반 대규모 트래픽 견디기 - 장준엽 (구로디지털 모임) :: AWS Community Day 2017
AWS 기반 대규모 트래픽 견디기 - 장준엽 (구로디지털 모임) :: AWS Community Day 2017AWS 기반 대규모 트래픽 견디기 - 장준엽 (구로디지털 모임) :: AWS Community Day 2017
AWS 기반 대규모 트래픽 견디기 - 장준엽 (구로디지털 모임) :: AWS Community Day 2017
AWSKRUG - AWS한국사용자모임
 
Data Migration PowerPoint Presentation Slides
Data Migration PowerPoint Presentation Slides Data Migration PowerPoint Presentation Slides
Data Migration PowerPoint Presentation Slides
SlideTeam
 
Introduction to Amazon Lightsail
Introduction to Amazon Lightsail Introduction to Amazon Lightsail
Introduction to Amazon Lightsail
Amazon Web Services
 
Hadoop and Manufacturing
Hadoop and ManufacturingHadoop and Manufacturing
Hadoop and Manufacturing
Cloudera, Inc.
 
Office 365 migration
Office 365 migrationOffice 365 migration
Office 365 migration
Motty Ben Atia
 
AWS Simple Storage Service (s3)
AWS Simple Storage Service (s3) AWS Simple Storage Service (s3)
AWS Simple Storage Service (s3)
zekeLabs Technologies
 
Migrating on premises and cloud contents to SharePoint Online at no cost with...
Migrating on premises and cloud contents to SharePoint Online at no cost with...Migrating on premises and cloud contents to SharePoint Online at no cost with...
Migrating on premises and cloud contents to SharePoint Online at no cost with...
Juan Carlos Gonzalez
 
Intro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS CloudIntro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS Cloud
Amazon Web Services
 

What's hot (20)

Architecting Snowflake for High Concurrency and High Performance
Architecting Snowflake for High Concurrency and High PerformanceArchitecting Snowflake for High Concurrency and High Performance
Architecting Snowflake for High Concurrency and High Performance
 
"Introduction to FinOps" – Greg VanderWel at Chicago AWS user group
"Introduction to FinOps" – Greg VanderWel at Chicago AWS user group"Introduction to FinOps" – Greg VanderWel at Chicago AWS user group
"Introduction to FinOps" – Greg VanderWel at Chicago AWS user group
 
Amazon API Gateway
Amazon API GatewayAmazon API Gateway
Amazon API Gateway
 
온라인 주문 서비스를 서버리스 아키텍쳐로 구축하기 - 김태우(Classmethod) :: AWS Community Day Online 2020
온라인 주문 서비스를 서버리스 아키텍쳐로 구축하기 - 김태우(Classmethod) :: AWS Community Day Online 2020온라인 주문 서비스를 서버리스 아키텍쳐로 구축하기 - 김태우(Classmethod) :: AWS Community Day Online 2020
온라인 주문 서비스를 서버리스 아키텍쳐로 구축하기 - 김태우(Classmethod) :: AWS Community Day Online 2020
 
AWS Certified Cloud Practitioner Course S11-S17
AWS Certified Cloud Practitioner Course S11-S17AWS Certified Cloud Practitioner Course S11-S17
AWS Certified Cloud Practitioner Course S11-S17
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
 
Introducing PagerDuty Process Automation
Introducing PagerDuty Process AutomationIntroducing PagerDuty Process Automation
Introducing PagerDuty Process Automation
 
Strategic Approach To Data Migration Project Plan
Strategic Approach To Data Migration Project PlanStrategic Approach To Data Migration Project Plan
Strategic Approach To Data Migration Project Plan
 
B2B Integration in the Cloud
B2B Integration in the CloudB2B Integration in the Cloud
B2B Integration in the Cloud
 
Cloud Migration, Application Modernization, and Security
Cloud Migration, Application Modernization, and Security Cloud Migration, Application Modernization, and Security
Cloud Migration, Application Modernization, and Security
 
Data Cloud.pptx
Data Cloud.pptxData Cloud.pptx
Data Cloud.pptx
 
Stream Computing & Analytics at Uber
Stream Computing & Analytics at UberStream Computing & Analytics at Uber
Stream Computing & Analytics at Uber
 
AWS 기반 대규모 트래픽 견디기 - 장준엽 (구로디지털 모임) :: AWS Community Day 2017
AWS 기반 대규모 트래픽 견디기 - 장준엽 (구로디지털 모임) :: AWS Community Day 2017AWS 기반 대규모 트래픽 견디기 - 장준엽 (구로디지털 모임) :: AWS Community Day 2017
AWS 기반 대규모 트래픽 견디기 - 장준엽 (구로디지털 모임) :: AWS Community Day 2017
 
Data Migration PowerPoint Presentation Slides
Data Migration PowerPoint Presentation Slides Data Migration PowerPoint Presentation Slides
Data Migration PowerPoint Presentation Slides
 
Introduction to Amazon Lightsail
Introduction to Amazon Lightsail Introduction to Amazon Lightsail
Introduction to Amazon Lightsail
 
Hadoop and Manufacturing
Hadoop and ManufacturingHadoop and Manufacturing
Hadoop and Manufacturing
 
Office 365 migration
Office 365 migrationOffice 365 migration
Office 365 migration
 
AWS Simple Storage Service (s3)
AWS Simple Storage Service (s3) AWS Simple Storage Service (s3)
AWS Simple Storage Service (s3)
 
Migrating on premises and cloud contents to SharePoint Online at no cost with...
Migrating on premises and cloud contents to SharePoint Online at no cost with...Migrating on premises and cloud contents to SharePoint Online at no cost with...
Migrating on premises and cloud contents to SharePoint Online at no cost with...
 
Intro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS CloudIntro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS Cloud
 

Similar to Divolte collector overview

Divolte Collector - meetup presentation
Divolte Collector - meetup presentationDivolte Collector - meetup presentation
Divolte Collector - meetup presentation
fvanvollenhoven
 
Prototyping online ML with Divolte Collector
Prototyping online ML with Divolte CollectorPrototyping online ML with Divolte Collector
Prototyping online ML with Divolte Collector
fvanvollenhoven
 
Online machine Learning with Divolte
Online machine Learning with DivolteOnline machine Learning with Divolte
Online machine Learning with Divolte
GoDataDriven
 
Stream processing in Mercari - Devsumi 2015 autumn LT
Stream processing in Mercari - Devsumi 2015 autumn LTStream processing in Mercari - Devsumi 2015 autumn LT
Stream processing in Mercari - Devsumi 2015 autumn LT
Masahiro Nagano
 
Best Practices in Handling Performance Issues
Best Practices in Handling Performance IssuesBest Practices in Handling Performance Issues
Best Practices in Handling Performance Issues
Odoo
 
Using Modern Browser APIs to Improve the Performance of Your Web Applications
Using Modern Browser APIs to Improve the Performance of Your Web ApplicationsUsing Modern Browser APIs to Improve the Performance of Your Web Applications
Using Modern Browser APIs to Improve the Performance of Your Web Applications
Nicholas Jansma
 
Measuring User Experience in the Browser
Measuring User Experience in the BrowserMeasuring User Experience in the Browser
Measuring User Experience in the BrowserAlois Reitbauer
 
From zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and ElasticsearchFrom zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and Elasticsearch
Rafał Kuć
 
From Zero to Hero - Centralized Logging with Logstash & Elasticsearch
From Zero to Hero - Centralized Logging with Logstash & ElasticsearchFrom Zero to Hero - Centralized Logging with Logstash & Elasticsearch
From Zero to Hero - Centralized Logging with Logstash & Elasticsearch
Sematext Group, Inc.
 
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
Amazon Web Services
 
Measuring User Experience
Measuring User ExperienceMeasuring User Experience
Measuring User ExperienceAlois Reitbauer
 
MongoDB for Time Series Data: Setting the Stage for Sensor Management
MongoDB for Time Series Data: Setting the Stage for Sensor ManagementMongoDB for Time Series Data: Setting the Stage for Sensor Management
MongoDB for Time Series Data: Setting the Stage for Sensor ManagementMongoDB
 
Orchestrate Event-Driven Infrastructure with SaltStack
Orchestrate Event-Driven Infrastructure with SaltStackOrchestrate Event-Driven Infrastructure with SaltStack
Orchestrate Event-Driven Infrastructure with SaltStack
Love Nyberg
 
20190516 web security-basic
20190516 web security-basic20190516 web security-basic
20190516 web security-basic
MksYi
 
OSDC 2015: Pere Urbon | Scaling Logstash: A Collection of War Stories
OSDC 2015: Pere Urbon | Scaling Logstash: A Collection of War StoriesOSDC 2015: Pere Urbon | Scaling Logstash: A Collection of War Stories
OSDC 2015: Pere Urbon | Scaling Logstash: A Collection of War Stories
NETWAYS
 
Building Faster Websites
Building Faster WebsitesBuilding Faster Websites
Building Faster Websites
Matthew Farina
 
2013.10 Operating * by the Numbers
2013.10 Operating * by the Numbers2013.10 Operating * by the Numbers
2013.10 Operating * by the Numbers
Allison Miller
 
Site activity & performance analysis
Site activity & performance analysisSite activity & performance analysis
Site activity & performance analysisEyal Vardi
 
What should I do when my website got hack?
What should I do when my website got hack?What should I do when my website got hack?
What should I do when my website got hack?
Sumedt Jitpukdebodin
 
Derek Pearcy - Reading Users' Minds For Fun And Profit
Derek Pearcy - Reading Users' Minds For Fun And ProfitDerek Pearcy - Reading Users' Minds For Fun And Profit
Derek Pearcy - Reading Users' Minds For Fun And Profit
bolt peters
 

Similar to Divolte collector overview (20)

Divolte Collector - meetup presentation
Divolte Collector - meetup presentationDivolte Collector - meetup presentation
Divolte Collector - meetup presentation
 
Prototyping online ML with Divolte Collector
Prototyping online ML with Divolte CollectorPrototyping online ML with Divolte Collector
Prototyping online ML with Divolte Collector
 
Online machine Learning with Divolte
Online machine Learning with DivolteOnline machine Learning with Divolte
Online machine Learning with Divolte
 
Stream processing in Mercari - Devsumi 2015 autumn LT
Stream processing in Mercari - Devsumi 2015 autumn LTStream processing in Mercari - Devsumi 2015 autumn LT
Stream processing in Mercari - Devsumi 2015 autumn LT
 
Best Practices in Handling Performance Issues
Best Practices in Handling Performance IssuesBest Practices in Handling Performance Issues
Best Practices in Handling Performance Issues
 
Using Modern Browser APIs to Improve the Performance of Your Web Applications
Using Modern Browser APIs to Improve the Performance of Your Web ApplicationsUsing Modern Browser APIs to Improve the Performance of Your Web Applications
Using Modern Browser APIs to Improve the Performance of Your Web Applications
 
Measuring User Experience in the Browser
Measuring User Experience in the BrowserMeasuring User Experience in the Browser
Measuring User Experience in the Browser
 
From zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and ElasticsearchFrom zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and Elasticsearch
 
From Zero to Hero - Centralized Logging with Logstash & Elasticsearch
From Zero to Hero - Centralized Logging with Logstash & ElasticsearchFrom Zero to Hero - Centralized Logging with Logstash & Elasticsearch
From Zero to Hero - Centralized Logging with Logstash & Elasticsearch
 
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
 
Measuring User Experience
Measuring User ExperienceMeasuring User Experience
Measuring User Experience
 
MongoDB for Time Series Data: Setting the Stage for Sensor Management
MongoDB for Time Series Data: Setting the Stage for Sensor ManagementMongoDB for Time Series Data: Setting the Stage for Sensor Management
MongoDB for Time Series Data: Setting the Stage for Sensor Management
 
Orchestrate Event-Driven Infrastructure with SaltStack
Orchestrate Event-Driven Infrastructure with SaltStackOrchestrate Event-Driven Infrastructure with SaltStack
Orchestrate Event-Driven Infrastructure with SaltStack
 
20190516 web security-basic
20190516 web security-basic20190516 web security-basic
20190516 web security-basic
 
OSDC 2015: Pere Urbon | Scaling Logstash: A Collection of War Stories
OSDC 2015: Pere Urbon | Scaling Logstash: A Collection of War StoriesOSDC 2015: Pere Urbon | Scaling Logstash: A Collection of War Stories
OSDC 2015: Pere Urbon | Scaling Logstash: A Collection of War Stories
 
Building Faster Websites
Building Faster WebsitesBuilding Faster Websites
Building Faster Websites
 
2013.10 Operating * by the Numbers
2013.10 Operating * by the Numbers2013.10 Operating * by the Numbers
2013.10 Operating * by the Numbers
 
Site activity & performance analysis
Site activity & performance analysisSite activity & performance analysis
Site activity & performance analysis
 
What should I do when my website got hack?
What should I do when my website got hack?What should I do when my website got hack?
What should I do when my website got hack?
 
Derek Pearcy - Reading Users' Minds For Fun And Profit
Derek Pearcy - Reading Users' Minds For Fun And ProfitDerek Pearcy - Reading Users' Minds For Fun And Profit
Derek Pearcy - Reading Users' Minds For Fun And Profit
 

More from GoDataDriven

Streamlining Data Science Workflows with a Feature Catalog
Streamlining Data Science Workflows with a Feature CatalogStreamlining Data Science Workflows with a Feature Catalog
Streamlining Data Science Workflows with a Feature Catalog
GoDataDriven
 
Visualizing Big Data in a Small Screen
Visualizing Big Data in a Small ScreenVisualizing Big Data in a Small Screen
Visualizing Big Data in a Small Screen
GoDataDriven
 
Building a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlowBuilding a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlow
GoDataDriven
 
Training Taster: Leading the way to become a data-driven organization
Training Taster: Leading the way to become a data-driven organizationTraining Taster: Leading the way to become a data-driven organization
Training Taster: Leading the way to become a data-driven organization
GoDataDriven
 
My Path From Data Engineer to Analytics Engineer
My Path From Data Engineer to Analytics EngineerMy Path From Data Engineer to Analytics Engineer
My Path From Data Engineer to Analytics Engineer
GoDataDriven
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchez
GoDataDriven
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
GoDataDriven
 
How to create a Devcontainer for your Python project
How to create a Devcontainer for your Python projectHow to create a Devcontainer for your Python project
How to create a Devcontainer for your Python project
GoDataDriven
 
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
GoDataDriven
 
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
GoDataDriven
 
MLOps CodeBreakfast on AWS - GoDataFest 2022
MLOps CodeBreakfast on AWS - GoDataFest 2022MLOps CodeBreakfast on AWS - GoDataFest 2022
MLOps CodeBreakfast on AWS - GoDataFest 2022
GoDataDriven
 
MLOps CodeBreakfast on Azure - GoDataFest 2022
MLOps CodeBreakfast on Azure - GoDataFest 2022MLOps CodeBreakfast on Azure - GoDataFest 2022
MLOps CodeBreakfast on Azure - GoDataFest 2022
GoDataDriven
 
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
GoDataDriven
 
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
GoDataDriven
 
AWS Well-Architected Webinar Security - Ben de Haan
AWS Well-Architected Webinar Security - Ben de HaanAWS Well-Architected Webinar Security - Ben de Haan
AWS Well-Architected Webinar Security - Ben de Haan
GoDataDriven
 
The 7 Habits of Effective Data Driven Companies
The 7 Habits of Effective Data Driven CompaniesThe 7 Habits of Effective Data Driven Companies
The 7 Habits of Effective Data Driven Companies
GoDataDriven
 
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
GoDataDriven
 
Artificial intelligence in actions: delivering a new experience to Formula 1 ...
Artificial intelligence in actions: delivering a new experience to Formula 1 ...Artificial intelligence in actions: delivering a new experience to Formula 1 ...
Artificial intelligence in actions: delivering a new experience to Formula 1 ...
GoDataDriven
 
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't HofSmart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
GoDataDriven
 
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
GoDataDriven
 

More from GoDataDriven (20)

Streamlining Data Science Workflows with a Feature Catalog
Streamlining Data Science Workflows with a Feature CatalogStreamlining Data Science Workflows with a Feature Catalog
Streamlining Data Science Workflows with a Feature Catalog
 
Visualizing Big Data in a Small Screen
Visualizing Big Data in a Small ScreenVisualizing Big Data in a Small Screen
Visualizing Big Data in a Small Screen
 
Building a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlowBuilding a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlow
 
Training Taster: Leading the way to become a data-driven organization
Training Taster: Leading the way to become a data-driven organizationTraining Taster: Leading the way to become a data-driven organization
Training Taster: Leading the way to become a data-driven organization
 
My Path From Data Engineer to Analytics Engineer
My Path From Data Engineer to Analytics EngineerMy Path From Data Engineer to Analytics Engineer
My Path From Data Engineer to Analytics Engineer
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchez
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
 
How to create a Devcontainer for your Python project
How to create a Devcontainer for your Python projectHow to create a Devcontainer for your Python project
How to create a Devcontainer for your Python project
 
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
 
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
 
MLOps CodeBreakfast on AWS - GoDataFest 2022
MLOps CodeBreakfast on AWS - GoDataFest 2022MLOps CodeBreakfast on AWS - GoDataFest 2022
MLOps CodeBreakfast on AWS - GoDataFest 2022
 
MLOps CodeBreakfast on Azure - GoDataFest 2022
MLOps CodeBreakfast on Azure - GoDataFest 2022MLOps CodeBreakfast on Azure - GoDataFest 2022
MLOps CodeBreakfast on Azure - GoDataFest 2022
 
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
 
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
 
AWS Well-Architected Webinar Security - Ben de Haan
AWS Well-Architected Webinar Security - Ben de HaanAWS Well-Architected Webinar Security - Ben de Haan
AWS Well-Architected Webinar Security - Ben de Haan
 
The 7 Habits of Effective Data Driven Companies
The 7 Habits of Effective Data Driven CompaniesThe 7 Habits of Effective Data Driven Companies
The 7 Habits of Effective Data Driven Companies
 
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
 
Artificial intelligence in actions: delivering a new experience to Formula 1 ...
Artificial intelligence in actions: delivering a new experience to Formula 1 ...Artificial intelligence in actions: delivering a new experience to Formula 1 ...
Artificial intelligence in actions: delivering a new experience to Formula 1 ...
 
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't HofSmart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
 
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
 

Recently uploaded

Discover the innovative and creative projects that highlight my journey throu...
Discover the innovative and creative projects that highlight my journey throu...Discover the innovative and creative projects that highlight my journey throu...
Discover the innovative and creative projects that highlight my journey throu...
dylandmeas
 
BeMetals Presentation_May_22_2024 .pdf
BeMetals Presentation_May_22_2024   .pdfBeMetals Presentation_May_22_2024   .pdf
BeMetals Presentation_May_22_2024 .pdf
DerekIwanaka1
 
Project File Report BBA 6th semester.pdf
Project File Report BBA 6th semester.pdfProject File Report BBA 6th semester.pdf
Project File Report BBA 6th semester.pdf
RajPriye
 
What is the TDS Return Filing Due Date for FY 2024-25.pdf
What is the TDS Return Filing Due Date for FY 2024-25.pdfWhat is the TDS Return Filing Due Date for FY 2024-25.pdf
What is the TDS Return Filing Due Date for FY 2024-25.pdf
seoforlegalpillers
 
Memorandum Of Association Constitution of Company.ppt
Memorandum Of Association Constitution of Company.pptMemorandum Of Association Constitution of Company.ppt
Memorandum Of Association Constitution of Company.ppt
seri bangash
 
LA HUG - Video Testimonials with Chynna Morgan - June 2024
LA HUG - Video Testimonials with Chynna Morgan - June 2024LA HUG - Video Testimonials with Chynna Morgan - June 2024
LA HUG - Video Testimonials with Chynna Morgan - June 2024
Lital Barkan
 
Affordable Stationery Printing Services in Jaipur | Navpack n Print
Affordable Stationery Printing Services in Jaipur | Navpack n PrintAffordable Stationery Printing Services in Jaipur | Navpack n Print
Affordable Stationery Printing Services in Jaipur | Navpack n Print
Navpack & Print
 
Premium MEAN Stack Development Solutions for Modern Businesses
Premium MEAN Stack Development Solutions for Modern BusinessesPremium MEAN Stack Development Solutions for Modern Businesses
Premium MEAN Stack Development Solutions for Modern Businesses
SynapseIndia
 
Enterprise Excellence is Inclusive Excellence.pdf
Enterprise Excellence is Inclusive Excellence.pdfEnterprise Excellence is Inclusive Excellence.pdf
Enterprise Excellence is Inclusive Excellence.pdf
KaiNexus
 
FINAL PRESENTATION.pptx12143241324134134
FINAL PRESENTATION.pptx12143241324134134FINAL PRESENTATION.pptx12143241324134134
FINAL PRESENTATION.pptx12143241324134134
LR1709MUSIC
 
Kseniya Leshchenko: Shared development support service model as the way to ma...
Kseniya Leshchenko: Shared development support service model as the way to ma...Kseniya Leshchenko: Shared development support service model as the way to ma...
Kseniya Leshchenko: Shared development support service model as the way to ma...
Lviv Startup Club
 
Putting the SPARK into Virtual Training.pptx
Putting the SPARK into Virtual Training.pptxPutting the SPARK into Virtual Training.pptx
Putting the SPARK into Virtual Training.pptx
Cynthia Clay
 
Sustainability: Balancing the Environment, Equity & Economy
Sustainability: Balancing the Environment, Equity & EconomySustainability: Balancing the Environment, Equity & Economy
Sustainability: Balancing the Environment, Equity & Economy
Operational Excellence Consulting
 
Maksym Vyshnivetskyi: PMO Quality Management (UA)
Maksym Vyshnivetskyi: PMO Quality Management (UA)Maksym Vyshnivetskyi: PMO Quality Management (UA)
Maksym Vyshnivetskyi: PMO Quality Management (UA)
Lviv Startup Club
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
taqyed
 
Tata Group Dials Taiwan for Its Chipmaking Ambition in Gujarat’s Dholera
Tata Group Dials Taiwan for Its Chipmaking Ambition in Gujarat’s DholeraTata Group Dials Taiwan for Its Chipmaking Ambition in Gujarat’s Dholera
Tata Group Dials Taiwan for Its Chipmaking Ambition in Gujarat’s Dholera
Avirahi City Dholera
 
Cree_Rey_BrandIdentityKit.PDF_PersonalBd
Cree_Rey_BrandIdentityKit.PDF_PersonalBdCree_Rey_BrandIdentityKit.PDF_PersonalBd
Cree_Rey_BrandIdentityKit.PDF_PersonalBd
creerey
 
5 Things You Need To Know Before Hiring a Videographer
5 Things You Need To Know Before Hiring a Videographer5 Things You Need To Know Before Hiring a Videographer
5 Things You Need To Know Before Hiring a Videographer
ofm712785
 
Buy Verified PayPal Account | Buy Google 5 Star Reviews
Buy Verified PayPal Account | Buy Google 5 Star ReviewsBuy Verified PayPal Account | Buy Google 5 Star Reviews
Buy Verified PayPal Account | Buy Google 5 Star Reviews
usawebmarket
 
The Parable of the Pipeline a book every new businessman or business student ...
The Parable of the Pipeline a book every new businessman or business student ...The Parable of the Pipeline a book every new businessman or business student ...
The Parable of the Pipeline a book every new businessman or business student ...
awaisafdar
 

Recently uploaded (20)

Discover the innovative and creative projects that highlight my journey throu...
Discover the innovative and creative projects that highlight my journey throu...Discover the innovative and creative projects that highlight my journey throu...
Discover the innovative and creative projects that highlight my journey throu...
 
BeMetals Presentation_May_22_2024 .pdf
BeMetals Presentation_May_22_2024   .pdfBeMetals Presentation_May_22_2024   .pdf
BeMetals Presentation_May_22_2024 .pdf
 
Project File Report BBA 6th semester.pdf
Project File Report BBA 6th semester.pdfProject File Report BBA 6th semester.pdf
Project File Report BBA 6th semester.pdf
 
What is the TDS Return Filing Due Date for FY 2024-25.pdf
What is the TDS Return Filing Due Date for FY 2024-25.pdfWhat is the TDS Return Filing Due Date for FY 2024-25.pdf
What is the TDS Return Filing Due Date for FY 2024-25.pdf
 
Memorandum Of Association Constitution of Company.ppt
Memorandum Of Association Constitution of Company.pptMemorandum Of Association Constitution of Company.ppt
Memorandum Of Association Constitution of Company.ppt
 
LA HUG - Video Testimonials with Chynna Morgan - June 2024
LA HUG - Video Testimonials with Chynna Morgan - June 2024LA HUG - Video Testimonials with Chynna Morgan - June 2024
LA HUG - Video Testimonials with Chynna Morgan - June 2024
 
Affordable Stationery Printing Services in Jaipur | Navpack n Print
Affordable Stationery Printing Services in Jaipur | Navpack n PrintAffordable Stationery Printing Services in Jaipur | Navpack n Print
Affordable Stationery Printing Services in Jaipur | Navpack n Print
 
Premium MEAN Stack Development Solutions for Modern Businesses
Premium MEAN Stack Development Solutions for Modern BusinessesPremium MEAN Stack Development Solutions for Modern Businesses
Premium MEAN Stack Development Solutions for Modern Businesses
 
Enterprise Excellence is Inclusive Excellence.pdf
Enterprise Excellence is Inclusive Excellence.pdfEnterprise Excellence is Inclusive Excellence.pdf
Enterprise Excellence is Inclusive Excellence.pdf
 
FINAL PRESENTATION.pptx12143241324134134
FINAL PRESENTATION.pptx12143241324134134FINAL PRESENTATION.pptx12143241324134134
FINAL PRESENTATION.pptx12143241324134134
 
Kseniya Leshchenko: Shared development support service model as the way to ma...
Kseniya Leshchenko: Shared development support service model as the way to ma...Kseniya Leshchenko: Shared development support service model as the way to ma...
Kseniya Leshchenko: Shared development support service model as the way to ma...
 
Putting the SPARK into Virtual Training.pptx
Putting the SPARK into Virtual Training.pptxPutting the SPARK into Virtual Training.pptx
Putting the SPARK into Virtual Training.pptx
 
Sustainability: Balancing the Environment, Equity & Economy
Sustainability: Balancing the Environment, Equity & EconomySustainability: Balancing the Environment, Equity & Economy
Sustainability: Balancing the Environment, Equity & Economy
 
Maksym Vyshnivetskyi: PMO Quality Management (UA)
Maksym Vyshnivetskyi: PMO Quality Management (UA)Maksym Vyshnivetskyi: PMO Quality Management (UA)
Maksym Vyshnivetskyi: PMO Quality Management (UA)
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
 
Tata Group Dials Taiwan for Its Chipmaking Ambition in Gujarat’s Dholera
Tata Group Dials Taiwan for Its Chipmaking Ambition in Gujarat’s DholeraTata Group Dials Taiwan for Its Chipmaking Ambition in Gujarat’s Dholera
Tata Group Dials Taiwan for Its Chipmaking Ambition in Gujarat’s Dholera
 
Cree_Rey_BrandIdentityKit.PDF_PersonalBd
Cree_Rey_BrandIdentityKit.PDF_PersonalBdCree_Rey_BrandIdentityKit.PDF_PersonalBd
Cree_Rey_BrandIdentityKit.PDF_PersonalBd
 
5 Things You Need To Know Before Hiring a Videographer
5 Things You Need To Know Before Hiring a Videographer5 Things You Need To Know Before Hiring a Videographer
5 Things You Need To Know Before Hiring a Videographer
 
Buy Verified PayPal Account | Buy Google 5 Star Reviews
Buy Verified PayPal Account | Buy Google 5 Star ReviewsBuy Verified PayPal Account | Buy Google 5 Star Reviews
Buy Verified PayPal Account | Buy Google 5 Star Reviews
 
The Parable of the Pipeline a book every new businessman or business student ...
The Parable of the Pipeline a book every new businessman or business student ...The Parable of the Pipeline a book every new businessman or business student ...
The Parable of the Pipeline a book every new businessman or business student ...
 

Divolte collector overview

  • 1. GoDataDriven PROUDLY PART OF THE XEBIA GROUP @asnare / @fzk / @godatadriven signal@godatadriven.com Divolte Collector Andrew Snare / Friso van Vollenhoven Because life’s too short for log file parsing
  • 2. 99% of all data in Hadoop 156.68.7.63 - - [28/Jul/1995:11:53:28 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 200 669 137.244.160.140 - - [28/Jul/1995:11:53:29 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 304 0 163.205.160.5 - - [28/Jul/1995:11:53:31 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 4324 163.205.160.5 - - [28/Jul/1995:11:53:40 -0400] "GET /shuttle/countdown/count70.gif HTTP/1.0" 200 46573 140.229.50.189 - - [28/Jul/1995:11:53:54 -0400] "GET /shuttle/missions/sts-67/images/images.html HTTP/1.0" 163.206.89.4 - - [28/Jul/1995:11:54:02 -0400] "GET /shuttle/technology/sts-newsref/sts-mps.html HTTP/1.0" 2 163.206.89.4 - - [28/Jul/1995:11:54:05 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204 163.206.89.4 - - [28/Jul/1995:11:54:05 -0400] "GET /images/shuttle-patch-logo.gif HTTP/1.0" 200 891 131.110.53.48 - - [28/Jul/1995:11:54:07 -0400] "GET /shuttle/technology/sts-newsref/stsref-toc.html HTTP/1. 163.205.160.5 - - [28/Jul/1995:11:54:14 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204 130.160.196.81 - - [28/Jul/1995:11:54:15 -0400] "GET /shuttle/resources/orbiters/challenger.html HTTP/1.0" 131.110.53.48 - - [28/Jul/1995:11:54:16 -0400] "GET /images/shuttle-patch-small.gif HTTP/1.0" 200 4179 137.244.160.140 - - [28/Jul/1995:11:54:16 -0400] "GET /shuttle/missions/sts-69/mission-sts-69.html HTTP/1.0 131.110.53.48 - - [28/Jul/1995:11:54:18 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204 131.110.53.48 - - [28/Jul/1995:11:54:19 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1713 130.160.196.81 - - [28/Jul/1995:11:54:19 -0400] "GET /shuttle/resources/orbiters/challenger-logo.gif HTTP/1 163.205.160.5 - - [28/Jul/1995:11:54:25 -0400] "GET /shuttle/missions/sts-70/images/images.html HTTP/1.0" 2 130.181.4.158 - - [28/Jul/1995:11:54:26 -0400] "GET /history/rocket-history.txt HTTP/1.0" 200 26990 137.244.160.140 - - [28/Jul/1995:11:54:30 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 304 0 137.244.160.140 - - [28/Jul/1995:11:54:31 -0400] "GET /images/launch-logo.gif HTTP/1.0" 304 0 137.244.160.140 - - [28/Jul/1995:11:54:38 -0400] "GET /history/apollo/images/apollo-logo1.gif HTTP/1.0" 304 168.178.17.149 - - [28/Jul/1995:11:54:48 -0400] "GET /shuttle/missions/sts-65/mission-sts-65.html HTTP/1.0" 140.229.50.189 - - [28/Jul/1995:11:54:53 -0400] "GET /shuttle/missions/sts-67/images/KSC-95EC-0390.jpg HTTP 131.110.53.48 - - [28/Jul/1995:11:54:58 -0400] "GET /shuttle/missions/missions.html HTTP/1.0" 200 8677 131.110.53.48 - - [28/Jul/1995:11:55:02 -0400] "GET /images/launchmedium.gif HTTP/1.0" 200 11853 131.110.53.48 - - [28/Jul/1995:11:55:05 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786 128.159.111.141 - - [28/Jul/1995:11:55:09 -0400] "GET /procurement/procurement.html HTTP/1.0" 200 3499 128.159.111.141 - - [28/Jul/1995:11:55:10 -0400] "GET /images/op-logo-small.gif HTTP/1.0" 200 14915 128.159.111.141 - - [28/Jul/1995:11:55:11 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786 128.159.111.141 - - [28/Jul/1995:11:55:11 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204
  • 3. How do we use our data? •Ad hoc •Batch •Streaming
  • 4. USER HTTP request: /org/apache/hadoop/io/IOUtils.html log transport service log event: 2012-07-01T06:00:02.500Z /org/apache/hadoop/io/IOUtils.html transport logs to compute cluster off line analytics / model training batch update model state serve model result (e.g. recommendations) streaming log processing streaming update model state Typical web optimization architecture
  • 5. Parse HTTP server logs access.log
  • 6. How did it get there? Option 1: parse HTTP server logs •Ship log files on a schedule •Parse using MapReduce jobs •Batch analytics jobs feed online systems
  • 7. HTTP server log parsing •Inherently batch oriented •Schema-less (URL format is the schema) •Initial job to parse logs into structured format •Usually multiple versions of parsers required •Requires sessionizing •Logs usually have more than you ask for (bots, image requests, spiders, health check, etc.)
  • 8. Stream HTTP server logs access.log Message Queue or Event Transport (Kafka, Flume, etc.) EVENTS tail -F EVENTS OTHER CONSUMERS
  • 9. How did it get there? Option 2: stream HTTP server logs •tail -F logfiles •Use a queue for transport (e.g. Flume or Kafka) •Parse logs on the fly •Or write semi-schema’d logs, like JSON •Parse again for batch work load
  • 10. Stream HTTP server logs •Allows for near real-time event handling when consuming from queues •Sessionizing? Duplicates? Bots? •Still requires parser logic •No schema
  • 11. Tagging index. html script. js web server access.log tracking server Message Queue or Event Transport (Kafka, Flume, etc.) EVENTS OTHER CONSUMERS web page traffic tracking traffic (asynchronous) structured events structured events
  • 12. How did it get there? Option 3: tagging •Instrument pages with special ‘tag’, i.e. special JavaScript or image just for logging the request •Create special endpoint that handles the tag request in a structured way •Tag endpoint handles logging the events
  • 13. Tagging •Not a new idea (Google Analytics, Omniture, etc.) •Less garbage traffic, because a browser is required to evaluate the tag •Event logging is asynchronous •Easier to do inflight processing (apply a schema, add enrichments, etc.) •Allows for custom events (other than page view)
  • 14. Also… •Manage session through cookies on the client side •Incoming data is already sessionized •Extract additional information from clients •Screen resolution •Viewport size •Timezone
  • 16. Divolte Collector Click stream data collection for Hadoop and Kafka.
  • 17. Divolte Collector index. html script. js web server access.log tracking server Message Queue or Event Transport (Kafka, Flume, etc.) EVENTS OTHER CONSUMERS web page traffic tracking traffic (asynchronous) structured events structured events
  • 18. Divolte Collector:Vision •Focus purely on collection •Processing is a separate concern •Minimal on the fly enrichment •The Hadoop tools ecosystem evolves too fast to compete (SQL solutions, streaming, machine learning, etc.) •Just provide data •Data source for custom data science solutions •Not a web analytics solution per se; descriptive web analytics is a side effect •Use cases will vary, try not too many assumptions about users’ needs
  • 19. Divolte Collector:Vision •Solve the web specific tricky parts •ID generation on client side (JavaScript) •In-stream duplicate detection •Schema! •Data will be written in a schema-evolution- friendly open format (Apache Avro) •No arbitrary (JSON) objects
  • 20. Javascript based tag <body> <!-- Your page content here. --> <!-- Include Divolte Collector just before the closing body tag --> <script src="//example.com/divolte.js" defer async> </script> </body>
  • 22. Data with a schema in Avro { "namespace": "com.example.record", "type": "record", "name": "MyEventRecord", "fields": [ { "name": "location", "type": "string" }, { "name": "pageType", "type": "string" }, { "name": "timestamp", "type": "long" } ] }
  • 23. Map incoming data onto Avro records mapping { map clientTimestamp() onto 'timestamp' map location() onto 'location' def u = parse location() to uri section { when u.path().equalTo('/checkout') apply { map 'checkout' onto 'pageType' exit() } map 'normal' onto 'pageType' } }
  • 24. User agent parsing map userAgent().family() onto 'browserName' map userAgent().osFamily() onto 'operatingSystemName' map userAgent().osVersion() onto 'operatingSystemVersion' // Etc... More fields available
  • 26. Useful performance Requests per second: 14010.80 [#/sec] (mean) Time per request: 0.571 [ms] (mean) Time per request: 0.071 [ms] (mean, across all concurrent requests) Transfer rate: 4516.55 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.1 0 1 Processing: 0 0 0.2 0 3 Waiting: 0 0 0.2 0 3 Total: 0 1 0.2 1 3 Percentage of the requests served within a certain time (ms) 50% 1 66% 1 75% 1 80% 1 90% 1 95% 1 98% 1 99% 1 100% 3 (longest request)
  • 27. Custom events divolte.signal('addToBasket', { productId: 309125, count: 1 }) In the page (Javascript) map eventParameter('productId') onto 'basketProductId' map eventParameter('count') onto 'basketNumProducts' In the mapping (Groovy)
  • 28. Avro data, use any tool
  • 32. Batch
  • 36. Approach 1. Pick n images randomly 2. Optimise displayed image using bandit optimisation 3. After X iterations: •Pick n / 2 new images randomly •Select n / 2 images from existing set using learned distribution •Construct new set of images using half of existing set and newly selected random images 4. Goto 2
  • 37. Bayesian Bandits •For each image, keep track of: •Number of impressions •Number of clicks •When serving an image: •Draw a random number from a Beta distribution with parameters alpha = # of clicks, beta = # of impressions, for each image •Show image where sample value is largest
  • 39. Prototype UI class HomepageHandler(ShopHandler): @coroutine def get(self): # Hard-coded ID for a pretty flower. # Later this ID will be decided by the bandit optmization. winner = '15442023790' # Grab the item details from our catalog service. top_item = yield self._get_json('catalog/item/%s' % winner) # Render the homepage self.render( 'index.html', top_item=top_item)
  • 40. Prototype UI <div class="col-md-6"> <h4>Top pick:</h4> <p> <!-- Link to the product page with a source identifier for tracking --> <a href="/product/{{ top_item['id'] }}/#/?source=top_pick"> <img class="img-responsive img-rounded" src="{{ top_item['variants']['Medium']['img_source'] }}"> <!-- Signal that we served an impression of this image --> <script>divolte.signal('impression', { source: 'top_pick', productId: '{{ top_item['id'] }}'})</script> </a> </p> <p> Photo by {{ top_item['owner']['real_name'] or top_item['owner']['user_name']}} </p> </div>
  • 41. Data collection in Divolte Collector { "name": "source", "type": ["null", "string"], "default": null } def locationUri = parse location() to uri when eventType().equalTo('pageView') apply { def fragmentUri = parse locationUri.rawFragment() to uri map fragmentUri.query().value('source') onto 'source' } when eventType().equalTo('impression') apply { map eventParameters().value('productId') onto 'productId' map eventParameters().value('source') onto 'source' }
  • 42. Keep counts in Redis { 'c|14502147379': '2', 'c|15106342717': '2', 'c|15624953471': '1', 'c|9609633287': '1', 'i|14502147379': '2', 'i|15106342717': '3', 'i|15624953471': '2', 'i|9609633287': '3' }
  • 43. Consuming Kafka in Python def start_consumer(args): # Load the Avro schema used for serialization. schema = avro.schema.Parse(open(args.schema).read()) # Create a Kafka consumer and Avro reader. Note that # it is trivially possible to create a multi process # consumer. consumer = KafkaConsumer(args.topic, client_id=args.client, group_id=args.group, metadata_broker_list=args.brokers) reader = avro.io.DatumReader(schema) # Consume messages. for message in consumer: handle_event(message, reader)
  • 44. Consuming Kafka in Python def handle_event(message, reader): # Decode Avro bytes into a Python dictionary. message_bytes = io.BytesIO(message.value) decoder = avro.io.BinaryDecoder(message_bytes) event = reader.read(decoder) # Event logic. if 'top_pick' == event['source'] and 'pageView' == event['eventType']: # Register a click. redis_client.hincrby( ITEM_HASH_KEY, CLICK_KEY_PREFIX + ascii_bytes(event['productId']), 1) elif 'top_pick' == event['source'] and 'impression' == event['eventType']: # Register an impression and increment experiment count. p = redis_client.pipeline() p.incr(EXPERIMENT_COUNT_KEY) p.hincrby( ITEM_HASH_KEY, IMPRESSION_KEY_PREFIX + ascii_bytes(event['productId']), 1) experiment_count, ingnored = p.execute() if experiment_count == REFRESH_INTERVAL: refresh_items()
  • 45. def refresh_items(): # Fetch current model state. We convert everything to str. current_item_dict = redis_client.hgetall(ITEM_HASH_KEY) current_items = numpy.unique([k[2:] for k in current_item_dict.keys()]) # Fetch random items from ElasticSearch. Note we fetch more than we need, # but we filter out items already present in the current set and truncate # the list to the desired size afterwards. random_items = [ ascii_bytes(item) for item in random_item_set(NUM_ITEMS + NUM_ITEMS - len(current_items) // 2) if not item in current_items][:NUM_ITEMS - len(current_items) // 2] # Draw random samples. samples = [ numpy.random.beta( int(current_item_dict[CLICK_KEY_PREFIX + item]), int(current_item_dict[IMPRESSION_KEY_PREFIX + item])) for item in current_items] # Select top half by sample values. current_items is conveniently # a Numpy array here. survivors = current_items[numpy.argsort(samples)[len(current_items) // 2:]] # New item set is survivors plus the random ones. new_items = numpy.concatenate([survivors, random_items]) # Update model state to reflect new item set. This operation is atomic # in Redis. p = redis_client.pipeline(transaction=True) p.set(EXPERIMENT_COUNT_KEY, 1) p.delete(ITEM_HASH_KEY) for item in new_items: p.hincrby(ITEM_HASH_KEY, CLICK_KEY_PREFIX + item, 1) p.hincrby(ITEM_HASH_KEY, IMPRESSION_KEY_PREFIX + item, 1) p.execute()
  • 46. Serving a recommendation class BanditHandler(web.RequestHandler): redis_client = None def initialize(self, redis_client): self.redis_client = redis_client @gen.coroutine def get(self): # Fetch model state. item_dict = yield gen.Task(self.redis_client.hgetall, ITEM_HASH_KEY) items = numpy.unique([k[2:] for k in item_dict.keys()]) # Draw random samples. samples = [ numpy.random.beta( int(item_dict[CLICK_KEY_PREFIX + item]), int(item_dict[IMPRESSION_KEY_PREFIX + item])) for item in items] # Select item with largest sample value. winner = items[numpy.argmax(samples)] self.write(winner)
  • 47. Integrate class HomepageHandler(ShopHandler): @coroutine def get(self): http = AsyncHTTPClient() request = HTTPRequest(url='http://localhost:8989/item', method='GET') response = yield http.fetch(request) winner = json_decode(response.body) top_item = yield self._get_json('catalog/item/%s' % winner) self.render( 'index.html', top_item=top_item)
  • 49. Server side - short term •Allow multiple sources / sink channels •With different input → schema mappings •Server side events •Support for server side event logging (JSON endpoint) •Enabler for mobile SDKs •Trivial to add pixel based end-point (server managed cookies)
  • 50. Client side •Specific browser related bug fixes (IE9) •Allow for setting session scoped parameters •JavaScript Data Layer
  • 51. Collector next steps •Integrate with Planout (https://facebook.github.io/ planout/) •Allow definition of online experiments in one place •All event logging automatically includes random parameters generated for experiment selection •Single solution for data collection for online experimentation / optimization
  • 53. GoDataDriven We’re hiring / Questions? / Thank you! @asnare / @fzk / @godatadriven signal@godatadriven.com Andrew Snare / Friso van Vollenhoven