(BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Roy Ben-Alta, Business Development Manager, AWS
Rick McFarland, VP of Data Services, Hearst
October 2015
BDT306
The Life of a Click
How Hearst Publishing Manages
Clickstream Analytics with AWS

What to Expect from the Session
• Common patterns for clickstream analytics
• Tips on using Amazon Kinesis and
Amazon EMR for clickstream processing
• Hearst’s big data journey in building the Hearst analytics
stack for clickstream
• Lesson learned
• Q&A

Clickstream Analytics = Business Value
Verticals/Use
Cases
Accelerated Ingest-
Transform-Load to final
destination
Continual Metrics/
KPI Extraction
Actionable Insights
Ad Tech/
Marketing Analytics
Advertising data aggregation Advertising metrics like coverage,
yield, conversion, scoring
webpages
User activity engagement
analytics, optimized bid/ buy
engines
Consumer Online/
Gaming
Online customer engagement data
aggregation
Consumer/ app engagement
metrics like page views, CTR
Customer clickstream analytics,
recommendation engines
Financial Services
Digital assets
Improve customer experience on
bank website
Financial market data metrics Fraud monitoring, and value-at-
risk assessment, auditing of
market order data
IoT / Sensor Data Fitness device , vehicle sensor,
telemetry data ingestion
Wearable sensor operational
metrics, and dashboards
Devices / sensor operational
intelligence

DataXu Records
68.198.92 - - [22/Dec/2013:23:08:37 -0400] "GET
/ HTTP/1.1" 200 6394 www.yahoo.com
"-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1...)" "-"
192.168.198.92 - - [22/Dec/2013:23:08:38 -0400] "GET
/images/logo.gif HTTP/1.1" 200 807 www.yahoo.com
"http://www.some.com/" "Mozilla/4.0 (compatible; MSIE 6...)" "-"
192.168.72.177 - - [22/Dec/2013:23:32:14 -0400] "GET
APACHE ACCESS LOG
{"cId":"10049","cdid":"5961","campID":"8","loc":"b","ip_address":"174.56.106.10
","icctm_ht_athr":"","icctm_ht_aid":"","icctm_ht_attl":"Family
Circus","icctm_ht_dtpub":"2011-04-05","icctm_ht_stnm":"SEATTLE POST-
INTELLIGENCER","icctm_ht_cnocl":"http://www.seattlepi.com/comics-and-
games/fun/Family_Circus","ts":"1422839422426","url":"http://www.seattlepi.co
m/comics-and-
games/fun/Family_Circus","hash":"d98ace5874334232f6db3e1c0f8be3ab","load"
:"5.096","ref":"http://www.seattlepi.com/comics-and-
games","bu":"HNP","brand":"SEATTLE POST-
INTELLIGENCER","ref_type":"SAMESITE","ref_subtype":"SAMESITE","ua":"deskto
p:chrome"}
JSON
Clickstream Record
Number of fields is not fixed
Tags names change
Multiple pages/sites
Format can be defined as we
store the data
AVRO, CSV, TSV, JSON

Clickstream Analytics Is the New “Hello World”
Hello World Word count Clickstream

Clickstream Analytics – Common Patterns
Flume HDFS Hive Batch high latency on retrieve
SQLFlume HDFS
Hive &
Pig
Batch low latency on
retrieve
Flume
Sqoop
HDFS
Impala
SparkSql
Presto
Other
More options: Batch with lower
latency when retrieve

users
Amazon
Kinesis
Kinesis-
enabled app
Amazon S3 Amazon
EMR
Web
Servers
Amazon S3
Amazon Redshift

It’s All About the Pace, About the Pace…
Big data
Hourly server logs:
were your systems misbehaving 1hr ago
Weekly / monthly bill:
what you spent this billing cycle
Daily customer-preferences report from your web
site’s click stream:
what deal or ad to try next time
Daily fraud reports:
was there fraud yesterday
Real-time big data
•Amazon CloudWatch metrics:
what went wrong now
•Real-time spending alerts/caps:
prevent overspending now
•Real-time analysis:
what to offer the current customer now
•Real-time detection:
block fraudulent use now

Clickstream Storage and Processing with
Amazon Kinesis
Amazon Kinesis
App N
Live dashboard
AWSendpoint
App 1
Aggregate and ingest
data to S3
App 2
Aggregate and
ingest data to
Amazon Redshift
Data lake
Amazon Redshift
App 3
ETL/ELT
Machine learning
Availability
Zone
Shard 1
Shard 2
Shard N
Availability
Zone
Availability
Zone
EMR
DynamoDB

Amazon
EMR
Managed, elastic Hadoop (1.x & 2.x) cluster
Integrates with Amazon S3, Amazon DynamoDB, and
Amazon Redshift
Install Storm, Spark, Hive, Pig, Impala, and end user
tools automatically
Support for Spot instances
Integrated HBase NOSQL database
Amazon EMR with Apache Spark
Apache Spark
Spark
SQL
Spark
Streaming
Mllib
GraphX

Spot Integration with Amazon EMR
aws emr create-cluster --name "Spot cluster" --ami-version 3.3
InstanceGroupType=MASTER,
InstanceType=m3.xlarge,InstanceCount=1,
InstanceGroupType=CORE,
BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2
InstanceGroupType=TASK,
BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3

Spot Integration with Amazon EMR
10 node cluster running for 14 hours
Cost = 1.0 * 10 * 14 = $140

Resize Nodes with Spot Instances
Add 10 more nodes on Spot

20 node cluster running for 7 hours
Cost = 1.0 * 10 * 7 = $70
= 0.5 * 10 * 7 = $35
Total $105

50 % less run-time ( 14  7)
25% less cost (140  105)

Amazon EMR and Amazon Kinesis for Batch and Interactive Processing
• Streaming log analysis
• Interactive ETL
Amazon Kinesis Amazon EMR
Amazon Redshift
Amazon S3
Data scientist
Amazon EMR for data scientists
using Spot instances
BI

Amazon Beanstalk - App to push data into Amazon Kinesis

• Amazon software license linking – Add ASL dependency
to SBT/MAVEN project (artifactId = spark-streaming-
kinesis-asl_2.10)
• Shards - Include head-room to catching up with data in
stream
• Tracking Amazon Kinesis application state (DynamoDB)
• Kinesis-Application:DynamoDB-table (1:1)
• Created automatically
• Make sure application name doesn’t conflict with existing DynamoDB tables.
• Adjust DynamoDB provision throughput if necessary (default 10 reads per
sec & 10 writes per second)
Amazon Kinesis Applications – Tips

Spark on Amazon EMR - Tips
• Amazon EMR applications after version 3.8.0 (no need
to run bootstrap actions)
• Use Spot instances for time & cost saving especially
when using Spark
• Run in Yarn cluster mode (--master yarn-cluster) for
production jobs – Spark driver runs in application master
(high availability)
• Data serialization – use Kryo if possible to boost
performance
(spark.serializer=org.apache.spark.serializer.KryoSerializer)

The Life of a Click at Hearst
• Hearst’s journey with their big data analytics platform on
AWS
• Demo
• Clickstream analysis patterns
• Lessons learned

BUSINESS MEDIA
operates more than 20 business-to-businesses with significant holdings in the
automotive, electronic, medical and finance industries
MAGAZINES
publishes 20 U.S. titles and close to 300 international editions
BROADCASTING
comprises 31 television and two radio stations
NEWSPAPERS
owns 15 daily and 34 weekly newspapers
Hearst includes over 200 businesses in over
100 countries around the world

Data Services at Hearst – Our Mission
• Ensure that Hearst leverages its combined data
assets
• Unify Hearst’s data streams
• Development of Big Data Analytics Platform using AWS
services
• Promote enterprise-wide product development
• Example: product initiative led by all of Hearst’s editors
– Buzzing@Hearst

Business Value of Buzzing
• Instant feedback on articles from our audiences
• Incremental re-syndication of popular articles across
properties (e.g. trending newspaper articles can be
adopted by magazines)
• Inform the editors to write articles that are more
relevant to our audiences and what channels are our
audiences leveraging to read our articles
• Ultimately, drive incremental value
• 25% more page views, 15% more visitors which
lead to incremental revenue

• Throughput goal: transport data from all 250+ Hearst
properties worldwide
• Latency goal: click-to-tool in under 5 minutes
• Agile: easily add new data fields into clickstream
• Unique metrics requirements defined by Data Science team
(e.g., standard deviations, regressions, etc.)
• Data reporting windows ranging from 1 hour to 1 week
• Front-end developed “from scratch” so data exposed through
API must support development team’s unique requirements
Most importantly, operation of existing sites cannot be
affected!
Engineering Requirements of Buzzing…

What we had to work with…
a ”static” clickstream collection process on many Hearst sites
Users to
Hearst
Properties
Clickstream
corporate
data center
Netezza
Data
Warehouse
Once per day
…now how do we get there?
Used for ad hoc
SQL-based
reporting and
analytics
~30 GB per day
containing basic web
log data (e.g., referrer,
url, user agent, cookie,
etc.)

…and we own Hearst’s tag management system
Users to
Hearst
Properties
Clickstream
This not only gave us access
to the clickstream but also
the JavaScript code that
lives on our websites
JavaScript on
web pages

Phase 1 – Ingest Clickstream Data Using AWS
Amazon
Kinesis
Node.JS App-
Proxy
Kinesis S3 App –
KCL Libraries
Users to
Hearst
Properties
Clickstream
“Raw JSON”
Raw data
Use tag manager to easily deploy JavaScript to all sites
Kinesis Client
Libraries and
Kinesis
Connectors
persist data to
Amazon S3
for durability
ElasticBeanstalk with
Node.JS exposes an
HTTP endpoint which
asynchronously takes
the data and feeds to
Amazon Kinesis
Implement
JavaScript on
sites that call
an exposed
endpoint and
pass in query
parameters

Node.JS – Push clickstream to Amazon Kinesis
function pushToKinesis(data) {
var params = {
Data: data, /* required */
PartitionKey: guid(),
StreamName: streamName /* required */
};
kinesis.putRecord(params, function(err, data) {
if (err) {
console.log(err, err.stack); // an error occurred
}
});
}
app.get('/hearstkin.gif', function(req, res){
async.series([function(callback){
var queryData = url.parse(req.url, true).query;
queryData.proxyts = new Date().getTime().toString();
pushToKinesis(JSON.stringify(queryData));
callback(null);
}]);
res.writeHead(200,{'Content-Type': 'text/plain', 'Access-
Control-Allow-Origin': '*'});
res.end(imageGIF, 'binary');
});
http.createServer(app).listen(app.get('port'), function(){
console.log('Express server listening on port ' +
app.get('port'));
});
Asynchronous calls –
ensures no user experience
interruption
Server timestamp – to
create a unified timestamp.
Amazon Kinesis now offers
this out-of-the box!
JSON format – this
helps us downstream
Kinesis Partition Key – guid() is a
good partition key to ensure even
distribution across the shards

Ingest Monitoring - AWS
Amazon Kinesis Monitoring
AWS Elastic Beanstalk Monitoring
Auto Scaling triggered by
network in > 20MB. Then scale
up to 40 instances.

Phase 1- Summary
• Use JSON formatting for payloads so more fields can be easily
added without impacting downstream processing
• HTTP call requires minimal code introduced to the actual site
implementations
• Flexible to meet rollout and growing demand
• Elastic Beanstalk can be scaled
• Amazon Kinesis stream can be re-sharded
• Amazon S3 provides high durability storage for raw data
• Once a reliable, scalable onboarding platform is in place,
we can now focus on ETL!

Phase 2a- Data Processing First Version (EMR)
ETL on
Amazon EMR
“Raw
JSON”
Raw Data Clean Aggregate
Data
• Amazon EMR was
chosen initially for
processing due to ease
of Amazon EMR creation
… and Pig because we
knew how to code in
PigLatin
• 50+ UDFs were written
using Python…also
because we knew Python

Unfortunately, Pig was not performing well – 15 min latency
Processing Clickstream Data with Pig
set output.compression.enabled true;
set output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
REGISTER '/home/hadoop/PROD/parser.py' USING jython as pyudf;
REGISTER '/home/hadoop/PROD/refclean.py' USING jython AS refs;
AA0 = load 's3://BUCKET_NAME/rawjson/datadata.tsv.gz' using TextLoader as
(line:chararray);
A0 = FILTER AA0 BY ( pyudf.get_obj(line,'url') MATCHES '.*(ad+|gd+).*';
A1 = FOREACH A0 GENERATE
( pyudf.urlclean(pyudf.get_obj(line,'url')) as url:chararray,
pyudf.get_obj(line,'hash') as hash:chararray,
pyudf.get_obj(line,'icxid') as icxid:chararray,
pyudf.pubclean(pyudf.get_obj(line,'icctm_ht_dtpub')) as pubdt:chararray,
pyudf.get_obj(line,'icctm_ht_cnocl') as cnocl:chararray,
pyudf.get_obj(line,'icctm_ht_athr') as author:chararray,
pyudf.get_obj(line,'icctm_ht_attl') as title:chararray,
pyudf.get_obj(line,'icctm_ht_aid') as cms_id:chararray,
pyudf.num1(pyudf.get_obj(line,'mxdpth')) as mxdpth:double,
pyudf.num2(pyudf.get_obj(line,'load')) as topsecs:double,
refs.classy(pyudf.get_obj(line,'url'),1) as bu:chararray,
pyudf.get_obj(line,'ip_address') as ip:chararray,
pyudf.get_obj(line,'img') as img:chararray
;
Gzip your
output
Regex in
Pig!
Python imports
limited to what is
allowed by
Jython

Phase 2b- Data Processing (SparkStreaming)
Clean Aggregate Data
Node.JS App-
ProxyUsers to
Hearst
Properties
Clickstream
• Welcome Apache Spark– one framework for
batch and realtime
• Benefits – using same code for batch and
real time ETL
• Use Spot instances – cost savings
• Drawbacks – Scala!
Amazon Kinesis
ETL on EMR

Using SQL with Scala
SparkSQL
Since we
knew SQL,
we wrote
Scala with
embedded
SQL Query
endpointUrl = kinesis.us-west-2.amazonaws.com
streamName= hearststream
outputLoc.json.streaming =
s3://hearstkinesisdata/processedsparkjson
window.length = 300
sliding.interval = 300
outputLimit = 5000000
query1Table=hearst1
query1= SELECT
simplestartq(proxyts, 5) as startq,
urlclean(url) as url,
hash,
icxid,
pubclean(icctm_ht_dtpub) as pubdt,
classy(url,1) as bu,
ip_address as ip,
artcheck(classy(url,1),url) as artcheck,
ref_type(ref,url) as ref_type,
img,
wc,
contentSource
FROM hearst1
val jsonRDD = sqlContext.jsonRDD(rdd1)
jsonRDD.registerTempTable(query1Table.trim)
val query1Result = sqlContext.sql(query1)//.limit(outputLimit.toInt)
query1Result.registerTempTable(query2Table.trim)
val query2Result = sqlContext.sql(query2)
query2Result.registerTempTable(query3Table.trim)
val query3Result = sqlContext.sql(query3).limit(outputLimit.toInt)
val outPartitionFolder = UDFUtils.output60WithRolling(slidingInterval.toInt)
query3Result.toJSON.saveAsTextFile("%s/%s".format(outputLocJSON,
outPartitionFolder), classOf[org.apache.hadoop.io.compress.GzipCodec])
logger.info("New JSON file written to "+outputLoc+"/"+outPartitionFolder)

Python UDF versus Scala
Python
def artcheck(bu,url):
try:
if url and bu:
cleanurl = url[0:url.find("?")].strip('/')
tailurl = url[findnth(url, '/', 3)+1:url.find("?")].strip('/')
revurl=cleanurl[::-1]
root=revurl[0:revurl.find('/')][::-1]
if (bu=='HMI' or bu=='HMG') and re.compile('ad+|gd+').search(tailurl)!=None : return 'T'
elif bu=='HTV' and root.isdigit()==True and re.compile('/search/').search(cleanurl)==None: return 'T'
elif bu=='HNP' and re.compile('blog|fuelfix').search(url)!=None and re.compile(r'S*[0-9]{4,4}/[0-9]{2,2}/[0-9]{2,2}S*').search(tailurl)!=None : return 'T'
elif bu=='HNP' and re.compile('businessinsider').search(url)!=None and re.compile(r'S*[0-9]{4,4}-[0-9]{2,2}').search(root)!=None : return 'T'
elif bu=='HNP' and re.compile('blog|fuelfix|businessinsider').search(url)==None and re.compile('.php').search(url)!=None : return 'T'
else : return 'F'
else : return 'F'
except:
return 'F'
def artcheck(bu:String, url: String )={
try{
val cleanurl = UDFUtils.utilurlclean(url.trim).stripSuffix("/")
val pathClean = UDFUtils.pathURI(cleanurl)
val lastContext = pathClean.split("/").last
var resp = "F"
if(("HMI"==bu||"HMG"==bu)&&Pattern.compile("/ad+|/gd+").matcher(pathClean).find()) resp="T"
else if("HTV"==bu && StringUtils.isNumeric(lastContext) && !cleanurl.contains("/search/")) resp="T"
else if("HNP"==bu && Pattern.compile("blog|fuelfix").matcher(url).find() && Pattern.compile("d{4}/d{2}/d{2}").matcher(pathClean).find()) resp="T"
else if("HNP"==bu && Pattern.compile("businessinsider").matcher(url).find() && Pattern.compile("d{4}-d{2}").matcher(lastContext).find()) resp="T"
else if("HNP"==bu && !Pattern.compile("blog|fuelfix|businessinsider").matcher(url).find()&& Pattern.compile(".php").matcher(url).find()) resp="T"
resp}
}
catch{
case e:Exception => "F"
}
Scala
Don’t be intimidated by Scala…if
you know Python, the syntax can
be similar
re.compile('ad+|gd+').
Pattern.compile("ad+|gd+").
Try: Except:
Try{} Catch{}

Phase 3a- Data Science!
Data Science on EC2
Clean Aggregate Data API-ready Data
Amazon Kinesis
ETL on EMR
• We decided to perform our
Data Science using SAS
on Amazon EC2 initially
because of the ability to
perform both data
manipulation and easily
run complex data science
techniques (e.g.,
regressions)
• Great for exploration and
initial development
• Performing data science
using this method took
3-5 minutes to complete

SAS Code Example
data _null_;
call system("aws s3 cp s3://BUCKET_NAME/file.gz
/home/ec2-user/LOGFILES/file.gz");
run;
FILENAME IN pipe "gzip -dc /home/ec2-user/LOGFILES/file.gz" lrecl=32767;
data temp1;
FORMAT startq DATETIME19.;
infile IN delimiter='09'x MISSOVER DSD lrecl=32767 firstobs=1;
input
startq :YMDDTTM.
url :$1000.
pageviews :best32.
visits :best32.
author :$100.
cms_id :$100.
img :$1000.
title :$1000.;
run;
Use pipe to
read in S3
data and
keep it
compressed
proc sql;
CREATE TABLE metrics AS
SELECT
url FORMAT=$1000.,
SUM(pageviews) as pageviews,
SUM(visits) as visits,
SUM(fvisits) as fvisits,
SUM(evisits) as evisits,
MIN(ttct) as rec,
COUNT(distinct startq) as frq,
AVG(visits) as avg_visits_pp,
SUM(visits1) as visits_soc,
SUM(visits2) as visits_dir,
SUM(visits3) as visits_int,
SUM(visits4) as visits_sea,
SUM(visits5) as visits_web,
SUM(visits6) as visits_nws,
SUM(visits7) as visits_pd,
SUM(visits8) as visits_soc_fb,
SUM(visits9) as visits_soc_tw,
SUM(visits10) as visits_soc_pi,
SUM(visits11) as visits_soc_re,
SUM(visits12) as visits_soc_yt,
SUM(visits13) as visits_soc_su,
SUM(visits14) as visits_soc_gp,
SUM(visits15) as visits_soc_li,
SUM(visits16) as visits_soc_tb,
SUM(visits17) as visits_soc_ot,
CASE WHEN (SUM(v1) - SUM(v3) ) > 20 THEN ( SUM(v1) - SUM(v3) ) / 2 ELSE 0 END as trending
FROM temp1
GROUP BY 1;
Use PROC SQL when
possible for easier
translation to Amazon
Redshift for production
later on.

Phase 3b- Split Data Science into Development and Production
Amazon
Kinesis
Clean Aggregate
Data
API-ready Data
Data Science
“Production”
Amazon Redshift
ETL on EMR
• Once Data Science
models were established,
we split the modeling and
production
• Production was moved to
Amazon Redshift which
provided much faster
ability to read Amazon S3
data and process the data
• Data Science processing
time went down to 100
seconds!
Use S3 to store
data science
models and
apply them
using Amazon
Redshift
Data Science
“Development”
on EC2
Statistical Models
run once per day
Models
Agg Data

select
clean_url as url,
trim(substring(max(proxyts||domain) from 20 for 1000)) as domain,
trim(substring(max(proxyts||clean_cnocl) from 20 for 1000)) as cnocl,
trim(substring(max(proxyts||img) from 20 for 1000)) as img,
trim(substring(max(proxyts||title) from 20 for 1000)) as title,
trim(substring(max(proxyts||section) from 20 for 1000)) as section,
approximate count(distinct ic_fpc) as visits,
count(1) as hits
from kinesis_hits
where bu='HMG' and (article_id is not null or author is not null or title is
not null)
group by 1;
Amazon Redshift Code Example
Cool trick to find the most
recent value of a
character field in one
pass through the data

Phase 4a- Elasticsearch Integration
Amazon EMR
PUSH
Buzzing API
S3 Storage
Data Science
Amazon Redshift
ETL on EMR
Since we had the Amazon EMR cluster running already, we used a handy Pig jar
that made it easy to push data to Elasticsearch.
S3 Storage
Models
Agg Data API Ready Data

REGISTER /home/hadoop/pig/lib/piggybank.jar;
REGISTER /home/hadoop/PROD/elasticsearch-hadoop-2.0.2.jar;
DEFINE EsStorageDEV org.elasticsearch.hadoop.pig.EsStorage
('es.nodes = es-dev.hearst.io',
'es.port = 9200',
'es.http.timeout = 5m',
'es.index.auto.create = true');
SECTIONS = load 's3://hearstkinesisdata/ss.tsv' USING PigStorage('t') as
(sectionid:chararray,cnt:long,visits:long,sectionname:chararray);
STORE SECTIONS INTO 'content-sections-sync/content-sections-sync' USING
EsStoragePROD;
Pig Code – Push to ES Example
Use handy Pig jar to push data to Elasticsearch
The “Amazon EMR overhead” required to read small files added 2 min to latency

Phase 4b- Elasticsearch Integration Sped Up
Buzzing API
S3 Storage
API
Ready
Data
Data Science
Amazon Redshift
ETL on EMR
Since the Amazon
Redshift code was
run in a Python
wrapper, solution
was to push data
directly into
Elasticsearch
Models
Agg Data

# Converting file into bulk-insert compatible format
$bin/convert_json.php big.json create rowbyrow.json
# Get mapping file
${aws} s3 cp S3://hearst/es_mapping es_mapping
# Creating new ES index
$(curl -XPUT http://es.hearst.io/content-web180-sync --data-binary es_mapping -s)
# Performing bulk API call
$(curl -XPOST http://es.hearst.io/content-web180-sync/_bulk --data-binary rowbyrow.json
-s) "http://es.hearst.io/content-web180-sync"
Script to Push to Elasticsearch Directly
Converting one big input JSON
file to a row-by-row JSON is a
key step for making the data
batch compatible
Use a mapping file to manage
the formatting in your index…
very important for dates and
numeric values that look like
strings

Final Data Pipeline
Buzzing API
API
Ready
Data
Amazon
Kinesis
S3 Storage
Node.JS
App- Proxy
Users to
Hearst
Properties
Clickstream
Data Science
Application
Amazon Redshift
ETL on EMR
100 seconds
1G/day
30 seconds
5GB/day
5 seconds
1G/day
Milliseconds
100GB/day
LATENCY
THROUGHPUT Models
Agg Data

Data Science
Amazon Redshift
ETL
A more “visual” representation of our pipeline!
Clickstream dataAmazon
Kinesis
Results
API

Version Transport
S
T
O
R
A
G
E ETL
S
T
O
R
A
G
E Analysis
S
T
O
R
A
G
E Exposure Latency
V1
Amazon
Kinesis S3 EMR-Pig S3 EC2-SAS S3
EMR to
ElasticSearch 1 hour
Today
Amazon
Kinesis Spark-Scala S3
Amazon
Redshift ElasticSearch <5 min
Tomorrow
Amazon
Kinesis PySpark + SparkR ElasticSearch <2 min
Lessons learned
“No Duh’s” Removing “stoppage” points, speed up processing, and
combine processes improve latency.

Data Science Tool Box
Buzzing API
API
Ready
Data
Amazon
Kinesis
S3 Storage
Node.JS
App- Proxy
Users to
Hearst
Properties
Clickstream
Data Science
Application
Amazon Redshift
ETL on EMR
Models
Agg Data
• IPython Notebook
• On Spark and Amazon Redshift
• Code sharing (and insights)
• User-friendly development
environment for data scientists
• Auto-convert .pynb  .py Data
Science
Toolbox
Data
Models
Amazon Redshift

Data Science at Hearst – Notebook

Next Steps
• Amazon EMR 4.1.0 with Spark 1.5 released and we can do
more with pyspark, look at Apache Zeppelin on Amazon EMR
• Amazon Kinesis just release a new feature to retain data up
to 7 days - We could do more ETL “in the stream”
• Amazon Kinesis Firehose and Lambda – Zero touch (no
Amazon EC2 maintenance)
• More complex data science that requires…
• Amazon Redshift UDFs
• Python shell that calls Amazon Redshift but also allows for
complex statistical methods (e.g., using R or machine learning)

Conclusion
• Clickstreams are the new “data currency” of business
• AWS provides great technology to process data
• High speed
• Lower costs – Using Spot…
• Very agile
• Do more with less: this can all be done with a team
of 2 FTEs!
• 1 developer (well versed in AWS) + 1 data scientist

Ingest Store Process Analyze
Click Insight
Time
Call To Action
Amazon S3
Amazon Kinesis
Amazon DynamoDB
Amazon RDS (Aurora)
AWS Lambda
KCL Apps
Amazon
EMR
Amazon
Redshift

Use Amazon Kinesis, EMR and Amazon Redshift for
Clickstream
Open source connectors:
• http://docs.aws.amazon.com/kinesis/latest/dev/developing-
consumers-with-kcl.html
AWS Big Data blog
- http://blogs.aws.amazon.com/bigdata/
AWS re:Invent Big Data booth
AWS Big Data Marketplace and Partner ecosystem
Hearst Booth – Hall C1156: Learn more about the
interesting things we are doing with data!
Call To Action

Remember to complete
your evaluations!

(BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Similar to (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

(BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS