Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

(BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

4,795 views

Published on

Hearst Corporation monitors trending content on 250+ sites worldwide, providing metrics to editors and promoting cross-platform content sharing. To facilitate this, Hearst built a clickstream analytics platform on AWS that transmits and processes over 30 TB of data a day using AWS resources such as AWS Elastic Beanstalk, Amazon Kinesis, Spark on Amazon EMR, Amazon S3, Amazon Redshift, and Amazon Elasticsearch. In this session, learn how Hearst designed their clickstream analytics application and how you can use the same architecture to build your own and be ready to handle the changing world of clickstream data. Dive into how to do Spark streaming from an Amazon Kinesis stream, use timestamps to cleanse and validate data coming from diverse sources, and see how the system has evolved as data types have change from HTTP GET to RESTful JSON requests. Finally, see how Hearst's data scientists interact with and use cleansed data provided by the platform to perform ad hoc analyses, develop home-grown algorithms, and create visualizations and dashboards that support Hearst business stakeholders.

Published in: Technology

(BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Roy Ben-Alta, Business Development Manager, AWS Rick McFarland, VP of Data Services, Hearst October 2015 BDT306 The Life of a Click How Hearst Publishing Manages Clickstream Analytics with AWS
  2. 2. What to Expect from the Session • Common patterns for clickstream analytics • Tips on using Amazon Kinesis and Amazon EMR for clickstream processing • Hearst’s big data journey in building the Hearst analytics stack for clickstream • Lesson learned • Q&A
  3. 3. Clickstream Analytics = Business Value Verticals/Use Cases Accelerated Ingest- Transform-Load to final destination Continual Metrics/ KPI Extraction Actionable Insights Ad Tech/ Marketing Analytics Advertising data aggregation Advertising metrics like coverage, yield, conversion, scoring webpages User activity engagement analytics, optimized bid/ buy engines Consumer Online/ Gaming Online customer engagement data aggregation Consumer/ app engagement metrics like page views, CTR Customer clickstream analytics, recommendation engines Financial Services Digital assets Improve customer experience on bank website Financial market data metrics Fraud monitoring, and value-at- risk assessment, auditing of market order data IoT / Sensor Data Fitness device , vehicle sensor, telemetry data ingestion Wearable sensor operational metrics, and dashboards Devices / sensor operational intelligence
  4. 4. DataXu Records 68.198.92 - - [22/Dec/2013:23:08:37 -0400] "GET / HTTP/1.1" 200 6394 www.yahoo.com "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1...)" "-" 192.168.198.92 - - [22/Dec/2013:23:08:38 -0400] "GET /images/logo.gif HTTP/1.1" 200 807 www.yahoo.com "http://www.some.com/" "Mozilla/4.0 (compatible; MSIE 6...)" "-" 192.168.72.177 - - [22/Dec/2013:23:32:14 -0400] "GET APACHE ACCESS LOG {"cId":"10049","cdid":"5961","campID":"8","loc":"b","ip_address":"174.56.106.10 ","icctm_ht_athr":"","icctm_ht_aid":"","icctm_ht_attl":"Family Circus","icctm_ht_dtpub":"2011-04-05","icctm_ht_stnm":"SEATTLE POST- INTELLIGENCER","icctm_ht_cnocl":"http://www.seattlepi.com/comics-and- games/fun/Family_Circus","ts":"1422839422426","url":"http://www.seattlepi.co m/comics-and- games/fun/Family_Circus","hash":"d98ace5874334232f6db3e1c0f8be3ab","load" :"5.096","ref":"http://www.seattlepi.com/comics-and- games","bu":"HNP","brand":"SEATTLE POST- INTELLIGENCER","ref_type":"SAMESITE","ref_subtype":"SAMESITE","ua":"deskto p:chrome"} JSON Clickstream Record Number of fields is not fixed Tags names change Multiple pages/sites Format can be defined as we store the data AVRO, CSV, TSV, JSON
  5. 5. Clickstream Analytics Is the New “Hello World” Hello World Word count Clickstream
  6. 6. Clickstream Analytics – Common Patterns Flume HDFS Hive Batch high latency on retrieve SQLFlume HDFS Hive & Pig Batch low latency on retrieve Flume Sqoop HDFS Impala SparkSql Presto Other More options: Batch with lower latency when retrieve
  7. 7. users Amazon Kinesis Kinesis- enabled app Amazon S3 Amazon EMR Web Servers Amazon S3 Amazon Redshift
  8. 8. It’s All About the Pace, About the Pace… Big data Hourly server logs: were your systems misbehaving 1hr ago Weekly / monthly bill: what you spent this billing cycle Daily customer-preferences report from your web site’s click stream: what deal or ad to try next time Daily fraud reports: was there fraud yesterday Real-time big data •Amazon CloudWatch metrics: what went wrong now •Real-time spending alerts/caps: prevent overspending now •Real-time analysis: what to offer the current customer now •Real-time detection: block fraudulent use now
  9. 9. Clickstream Storage and Processing with Amazon Kinesis Amazon Kinesis App N Live dashboard AWSendpoint App 1 Aggregate and ingest data to S3 App 2 Aggregate and ingest data to Amazon Redshift Data lake Amazon Redshift App 3 ETL/ELT Machine learning Availability Zone Shard 1 Shard 2 Shard N Availability Zone Availability Zone EMR DynamoDB
  10. 10. Amazon EMR Managed, elastic Hadoop (1.x & 2.x) cluster Integrates with Amazon S3, Amazon DynamoDB, and Amazon Redshift Install Storm, Spark, Hive, Pig, Impala, and end user tools automatically Support for Spot instances Integrated HBase NOSQL database Amazon EMR with Apache Spark Apache Spark Spark SQL Spark Streaming Mllib GraphX
  11. 11. Spot Integration with Amazon EMR aws emr create-cluster --name "Spot cluster" --ami-version 3.3 InstanceGroupType=MASTER, InstanceType=m3.xlarge,InstanceCount=1, InstanceGroupType=CORE, BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2 InstanceGroupType=TASK, BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3
  12. 12. Spot Integration with Amazon EMR 10 node cluster running for 14 hours Cost = 1.0 * 10 * 14 = $140
  13. 13. Resize Nodes with Spot Instances Add 10 more nodes on Spot
  14. 14. Resize Nodes with Spot Instances 20 node cluster running for 7 hours Cost = 1.0 * 10 * 7 = $70 = 0.5 * 10 * 7 = $35 Total $105
  15. 15. Resize Nodes with Spot Instances 50 % less run-time ( 14  7) 25% less cost (140  105)
  16. 16. Amazon EMR and Amazon Kinesis for Batch and Interactive Processing • Streaming log analysis • Interactive ETL Amazon Kinesis Amazon EMR Amazon Redshift Amazon S3 Data scientist Amazon EMR for data scientists using Spot instances BI
  17. 17. Amazon Beanstalk - App to push data into Amazon Kinesis
  18. 18. • Amazon software license linking – Add ASL dependency to SBT/MAVEN project (artifactId = spark-streaming- kinesis-asl_2.10) • Shards - Include head-room to catching up with data in stream • Tracking Amazon Kinesis application state (DynamoDB) • Kinesis-Application:DynamoDB-table (1:1) • Created automatically • Make sure application name doesn’t conflict with existing DynamoDB tables. • Adjust DynamoDB provision throughput if necessary (default 10 reads per sec & 10 writes per second) Amazon Kinesis Applications – Tips
  19. 19. Spark on Amazon EMR - Tips • Amazon EMR applications after version 3.8.0 (no need to run bootstrap actions) • Use Spot instances for time & cost saving especially when using Spark • Run in Yarn cluster mode (--master yarn-cluster) for production jobs – Spark driver runs in application master (high availability) • Data serialization – use Kryo if possible to boost performance (spark.serializer=org.apache.spark.serializer.KryoSerializer)
  20. 20. The Life of a Click at Hearst • Hearst’s journey with their big data analytics platform on AWS • Demo • Clickstream analysis patterns • Lessons learned
  21. 21. Have you heard of Hearst?
  22. 22. BUSINESS MEDIA operates more than 20 business-to-businesses with significant holdings in the automotive, electronic, medical and finance industries MAGAZINES publishes 20 U.S. titles and close to 300 international editions BROADCASTING comprises 31 television and two radio stations NEWSPAPERS owns 15 daily and 34 weekly newspapers Hearst includes over 200 businesses in over 100 countries around the world
  23. 23. Data Services at Hearst – Our Mission • Ensure that Hearst leverages its combined data assets • Unify Hearst’s data streams • Development of Big Data Analytics Platform using AWS services • Promote enterprise-wide product development • Example: product initiative led by all of Hearst’s editors – Buzzing@Hearst
  24. 24. 1
  25. 25. Business Value of Buzzing • Instant feedback on articles from our audiences • Incremental re-syndication of popular articles across properties (e.g. trending newspaper articles can be adopted by magazines) • Inform the editors to write articles that are more relevant to our audiences and what channels are our audiences leveraging to read our articles • Ultimately, drive incremental value • 25% more page views, 15% more visitors which lead to incremental revenue
  26. 26. • Throughput goal: transport data from all 250+ Hearst properties worldwide • Latency goal: click-to-tool in under 5 minutes • Agile: easily add new data fields into clickstream • Unique metrics requirements defined by Data Science team (e.g., standard deviations, regressions, etc.) • Data reporting windows ranging from 1 hour to 1 week • Front-end developed “from scratch” so data exposed through API must support development team’s unique requirements Most importantly, operation of existing sites cannot be affected! Engineering Requirements of Buzzing…
  27. 27. What we had to work with… a ”static” clickstream collection process on many Hearst sites Users to Hearst Properties Clickstream corporate data center Netezza Data Warehouse Once per day …now how do we get there? Used for ad hoc SQL-based reporting and analytics ~30 GB per day containing basic web log data (e.g., referrer, url, user agent, cookie, etc.)
  28. 28. …and we own Hearst’s tag management system Users to Hearst Properties Clickstream This not only gave us access to the clickstream but also the JavaScript code that lives on our websites JavaScript on web pages
  29. 29. Phase 1 – Ingest Clickstream Data Using AWS Amazon Kinesis Node.JS App- Proxy Kinesis S3 App – KCL Libraries Users to Hearst Properties Clickstream “Raw JSON” Raw data Use tag manager to easily deploy JavaScript to all sites Kinesis Client Libraries and Kinesis Connectors persist data to Amazon S3 for durability ElasticBeanstalk with Node.JS exposes an HTTP endpoint which asynchronously takes the data and feeds to Amazon Kinesis Implement JavaScript on sites that call an exposed endpoint and pass in query parameters
  30. 30. Node.JS – Push clickstream to Amazon Kinesis function pushToKinesis(data) { var params = { Data: data, /* required */ PartitionKey: guid(), StreamName: streamName /* required */ }; kinesis.putRecord(params, function(err, data) { if (err) { console.log(err, err.stack); // an error occurred } }); } app.get('/hearstkin.gif', function(req, res){ async.series([function(callback){ var queryData = url.parse(req.url, true).query; queryData.proxyts = new Date().getTime().toString(); pushToKinesis(JSON.stringify(queryData)); callback(null); }]); res.writeHead(200,{'Content-Type': 'text/plain', 'Access- Control-Allow-Origin': '*'}); res.end(imageGIF, 'binary'); }); http.createServer(app).listen(app.get('port'), function(){ console.log('Express server listening on port ' + app.get('port')); }); Asynchronous calls – ensures no user experience interruption Server timestamp – to create a unified timestamp. Amazon Kinesis now offers this out-of-the box! JSON format – this helps us downstream Kinesis Partition Key – guid() is a good partition key to ensure even distribution across the shards
  31. 31. Ingest Monitoring - AWS Amazon Kinesis Monitoring AWS Elastic Beanstalk Monitoring Auto Scaling triggered by network in > 20MB. Then scale up to 40 instances.
  32. 32. Phase 1- Summary • Use JSON formatting for payloads so more fields can be easily added without impacting downstream processing • HTTP call requires minimal code introduced to the actual site implementations • Flexible to meet rollout and growing demand • Elastic Beanstalk can be scaled • Amazon Kinesis stream can be re-sharded • Amazon S3 provides high durability storage for raw data • Once a reliable, scalable onboarding platform is in place, we can now focus on ETL!
  33. 33. Phase 2a- Data Processing First Version (EMR) ETL on Amazon EMR “Raw JSON” Raw Data Clean Aggregate Data • Amazon EMR was chosen initially for processing due to ease of Amazon EMR creation … and Pig because we knew how to code in PigLatin • 50+ UDFs were written using Python…also because we knew Python
  34. 34. Unfortunately, Pig was not performing well – 15 min latency Processing Clickstream Data with Pig set output.compression.enabled true; set output.compression.codec org.apache.hadoop.io.compress.GzipCodec; REGISTER '/home/hadoop/PROD/parser.py' USING jython as pyudf; REGISTER '/home/hadoop/PROD/refclean.py' USING jython AS refs; AA0 = load 's3://BUCKET_NAME/rawjson/datadata.tsv.gz' using TextLoader as (line:chararray); A0 = FILTER AA0 BY ( pyudf.get_obj(line,'url') MATCHES '.*(ad+|gd+).*'; A1 = FOREACH A0 GENERATE ( pyudf.urlclean(pyudf.get_obj(line,'url')) as url:chararray, pyudf.get_obj(line,'hash') as hash:chararray, pyudf.get_obj(line,'icxid') as icxid:chararray, pyudf.pubclean(pyudf.get_obj(line,'icctm_ht_dtpub')) as pubdt:chararray, pyudf.get_obj(line,'icctm_ht_cnocl') as cnocl:chararray, pyudf.get_obj(line,'icctm_ht_athr') as author:chararray, pyudf.get_obj(line,'icctm_ht_attl') as title:chararray, pyudf.get_obj(line,'icctm_ht_aid') as cms_id:chararray, pyudf.num1(pyudf.get_obj(line,'mxdpth')) as mxdpth:double, pyudf.num2(pyudf.get_obj(line,'load')) as topsecs:double, refs.classy(pyudf.get_obj(line,'url'),1) as bu:chararray, pyudf.get_obj(line,'ip_address') as ip:chararray, pyudf.get_obj(line,'img') as img:chararray ; Gzip your output Regex in Pig! Python imports limited to what is allowed by Jython
  35. 35. Phase 2b- Data Processing (SparkStreaming) Clean Aggregate Data Node.JS App- ProxyUsers to Hearst Properties Clickstream • Welcome Apache Spark– one framework for batch and realtime • Benefits – using same code for batch and real time ETL • Use Spot instances – cost savings • Drawbacks – Scala! Amazon Kinesis ETL on EMR
  36. 36. Using SQL with Scala SparkSQL Since we knew SQL, we wrote Scala with embedded SQL Query endpointUrl = kinesis.us-west-2.amazonaws.com streamName= hearststream outputLoc.json.streaming = s3://hearstkinesisdata/processedsparkjson window.length = 300 sliding.interval = 300 outputLimit = 5000000 query1Table=hearst1 query1= SELECT simplestartq(proxyts, 5) as startq, urlclean(url) as url, hash, icxid, pubclean(icctm_ht_dtpub) as pubdt, classy(url,1) as bu, ip_address as ip, artcheck(classy(url,1),url) as artcheck, ref_type(ref,url) as ref_type, img, wc, contentSource FROM hearst1 val jsonRDD = sqlContext.jsonRDD(rdd1) jsonRDD.registerTempTable(query1Table.trim) val query1Result = sqlContext.sql(query1)//.limit(outputLimit.toInt) query1Result.registerTempTable(query2Table.trim) val query2Result = sqlContext.sql(query2) query2Result.registerTempTable(query3Table.trim) val query3Result = sqlContext.sql(query3).limit(outputLimit.toInt) val outPartitionFolder = UDFUtils.output60WithRolling(slidingInterval.toInt) query3Result.toJSON.saveAsTextFile("%s/%s".format(outputLocJSON, outPartitionFolder), classOf[org.apache.hadoop.io.compress.GzipCodec]) logger.info("New JSON file written to "+outputLoc+"/"+outPartitionFolder)
  37. 37. Python UDF versus Scala Python def artcheck(bu,url): try: if url and bu: cleanurl = url[0:url.find("?")].strip('/') tailurl = url[findnth(url, '/', 3)+1:url.find("?")].strip('/') revurl=cleanurl[::-1] root=revurl[0:revurl.find('/')][::-1] if (bu=='HMI' or bu=='HMG') and re.compile('ad+|gd+').search(tailurl)!=None : return 'T' elif bu=='HTV' and root.isdigit()==True and re.compile('/search/').search(cleanurl)==None: return 'T' elif bu=='HNP' and re.compile('blog|fuelfix').search(url)!=None and re.compile(r'S*[0-9]{4,4}/[0-9]{2,2}/[0-9]{2,2}S*').search(tailurl)!=None : return 'T' elif bu=='HNP' and re.compile('businessinsider').search(url)!=None and re.compile(r'S*[0-9]{4,4}-[0-9]{2,2}').search(root)!=None : return 'T' elif bu=='HNP' and re.compile('blog|fuelfix|businessinsider').search(url)==None and re.compile('.php').search(url)!=None : return 'T' else : return 'F' else : return 'F' except: return 'F' def artcheck(bu:String, url: String )={ try{ val cleanurl = UDFUtils.utilurlclean(url.trim).stripSuffix("/") val pathClean = UDFUtils.pathURI(cleanurl) val lastContext = pathClean.split("/").last var resp = "F" if(("HMI"==bu||"HMG"==bu)&&Pattern.compile("/ad+|/gd+").matcher(pathClean).find()) resp="T" else if("HTV"==bu && StringUtils.isNumeric(lastContext) && !cleanurl.contains("/search/")) resp="T" else if("HNP"==bu && Pattern.compile("blog|fuelfix").matcher(url).find() && Pattern.compile("d{4}/d{2}/d{2}").matcher(pathClean).find()) resp="T" else if("HNP"==bu && Pattern.compile("businessinsider").matcher(url).find() && Pattern.compile("d{4}-d{2}").matcher(lastContext).find()) resp="T" else if("HNP"==bu && !Pattern.compile("blog|fuelfix|businessinsider").matcher(url).find()&& Pattern.compile(".php").matcher(url).find()) resp="T" resp} } catch{ case e:Exception => "F" } Scala Don’t be intimidated by Scala…if you know Python, the syntax can be similar re.compile('ad+|gd+'). Pattern.compile("ad+|gd+"). Try: Except: Try{} Catch{}
  38. 38. Phase 3a- Data Science! Data Science on EC2 Clean Aggregate Data API-ready Data Amazon Kinesis ETL on EMR • We decided to perform our Data Science using SAS on Amazon EC2 initially because of the ability to perform both data manipulation and easily run complex data science techniques (e.g., regressions) • Great for exploration and initial development • Performing data science using this method took 3-5 minutes to complete
  39. 39. SAS Code Example data _null_; call system("aws s3 cp s3://BUCKET_NAME/file.gz /home/ec2-user/LOGFILES/file.gz"); run; FILENAME IN pipe "gzip -dc /home/ec2-user/LOGFILES/file.gz" lrecl=32767; data temp1; FORMAT startq DATETIME19.; infile IN delimiter='09'x MISSOVER DSD lrecl=32767 firstobs=1; input startq :YMDDTTM. url :$1000. pageviews :best32. visits :best32. author :$100. cms_id :$100. img :$1000. title :$1000.; run; Use pipe to read in S3 data and keep it compressed proc sql; CREATE TABLE metrics AS SELECT url FORMAT=$1000., SUM(pageviews) as pageviews, SUM(visits) as visits, SUM(fvisits) as fvisits, SUM(evisits) as evisits, MIN(ttct) as rec, COUNT(distinct startq) as frq, AVG(visits) as avg_visits_pp, SUM(visits1) as visits_soc, SUM(visits2) as visits_dir, SUM(visits3) as visits_int, SUM(visits4) as visits_sea, SUM(visits5) as visits_web, SUM(visits6) as visits_nws, SUM(visits7) as visits_pd, SUM(visits8) as visits_soc_fb, SUM(visits9) as visits_soc_tw, SUM(visits10) as visits_soc_pi, SUM(visits11) as visits_soc_re, SUM(visits12) as visits_soc_yt, SUM(visits13) as visits_soc_su, SUM(visits14) as visits_soc_gp, SUM(visits15) as visits_soc_li, SUM(visits16) as visits_soc_tb, SUM(visits17) as visits_soc_ot, CASE WHEN (SUM(v1) - SUM(v3) ) > 20 THEN ( SUM(v1) - SUM(v3) ) / 2 ELSE 0 END as trending FROM temp1 GROUP BY 1; Use PROC SQL when possible for easier translation to Amazon Redshift for production later on.
  40. 40. Phase 3b- Split Data Science into Development and Production Amazon Kinesis Clean Aggregate Data API-ready Data Data Science “Production” Amazon Redshift ETL on EMR • Once Data Science models were established, we split the modeling and production • Production was moved to Amazon Redshift which provided much faster ability to read Amazon S3 data and process the data • Data Science processing time went down to 100 seconds! Use S3 to store data science models and apply them using Amazon Redshift Data Science “Development” on EC2 Statistical Models run once per day Models Agg Data
  41. 41. select clean_url as url, trim(substring(max(proxyts||domain) from 20 for 1000)) as domain, trim(substring(max(proxyts||clean_cnocl) from 20 for 1000)) as cnocl, trim(substring(max(proxyts||img) from 20 for 1000)) as img, trim(substring(max(proxyts||title) from 20 for 1000)) as title, trim(substring(max(proxyts||section) from 20 for 1000)) as section, approximate count(distinct ic_fpc) as visits, count(1) as hits from kinesis_hits where bu='HMG' and (article_id is not null or author is not null or title is not null) group by 1; Amazon Redshift Code Example Cool trick to find the most recent value of a character field in one pass through the data
  42. 42. Phase 4a- Elasticsearch Integration Amazon EMR PUSH Buzzing API S3 Storage Data Science Amazon Redshift ETL on EMR Since we had the Amazon EMR cluster running already, we used a handy Pig jar that made it easy to push data to Elasticsearch. S3 Storage Models Agg Data API Ready Data
  43. 43. REGISTER /home/hadoop/pig/lib/piggybank.jar; REGISTER /home/hadoop/PROD/elasticsearch-hadoop-2.0.2.jar; DEFINE EsStorageDEV org.elasticsearch.hadoop.pig.EsStorage ('es.nodes = es-dev.hearst.io', 'es.port = 9200', 'es.http.timeout = 5m', 'es.index.auto.create = true'); SECTIONS = load 's3://hearstkinesisdata/ss.tsv' USING PigStorage('t') as (sectionid:chararray,cnt:long,visits:long,sectionname:chararray); STORE SECTIONS INTO 'content-sections-sync/content-sections-sync' USING EsStoragePROD; Pig Code – Push to ES Example Use handy Pig jar to push data to Elasticsearch The “Amazon EMR overhead” required to read small files added 2 min to latency
  44. 44. Phase 4b- Elasticsearch Integration Sped Up Buzzing API S3 Storage API Ready Data Data Science Amazon Redshift ETL on EMR Since the Amazon Redshift code was run in a Python wrapper, solution was to push data directly into Elasticsearch Models Agg Data
  45. 45. # Converting file into bulk-insert compatible format $bin/convert_json.php big.json create rowbyrow.json # Get mapping file ${aws} s3 cp S3://hearst/es_mapping es_mapping # Creating new ES index $(curl -XPUT http://es.hearst.io/content-web180-sync --data-binary es_mapping -s) # Performing bulk API call $(curl -XPOST http://es.hearst.io/content-web180-sync/_bulk --data-binary rowbyrow.json -s) "http://es.hearst.io/content-web180-sync" Script to Push to Elasticsearch Directly Converting one big input JSON file to a row-by-row JSON is a key step for making the data batch compatible Use a mapping file to manage the formatting in your index… very important for dates and numeric values that look like strings
  46. 46. Final Data Pipeline Buzzing API API Ready Data Amazon Kinesis S3 Storage Node.JS App- Proxy Users to Hearst Properties Clickstream Data Science Application Amazon Redshift ETL on EMR 100 seconds 1G/day 30 seconds 5GB/day 5 seconds 1G/day Milliseconds 100GB/day LATENCY THROUGHPUT Models Agg Data
  47. 47. Data Science Amazon Redshift ETL A more “visual” representation of our pipeline! Clickstream dataAmazon Kinesis Results API
  48. 48. Version Transport S T O R A G E ETL S T O R A G E Analysis S T O R A G E Exposure Latency V1 Amazon Kinesis S3 EMR-Pig S3 EC2-SAS S3 EMR to ElasticSearch 1 hour Today Amazon Kinesis Spark-Scala S3 Amazon Redshift ElasticSearch <5 min Tomorrow Amazon Kinesis PySpark + SparkR ElasticSearch <2 min Lessons learned “No Duh’s” Removing “stoppage” points, speed up processing, and combine processes improve latency.
  49. 49. Data Science Tool Box Buzzing API API Ready Data Amazon Kinesis S3 Storage Node.JS App- Proxy Users to Hearst Properties Clickstream Data Science Application Amazon Redshift ETL on EMR Models Agg Data • IPython Notebook • On Spark and Amazon Redshift • Code sharing (and insights) • User-friendly development environment for data scientists • Auto-convert .pynb  .py Data Science Toolbox Data Models Amazon Redshift
  50. 50. Data Science at Hearst – Notebook
  51. 51. Next Steps • Amazon EMR 4.1.0 with Spark 1.5 released and we can do more with pyspark, look at Apache Zeppelin on Amazon EMR • Amazon Kinesis just release a new feature to retain data up to 7 days - We could do more ETL “in the stream” • Amazon Kinesis Firehose and Lambda – Zero touch (no Amazon EC2 maintenance) • More complex data science that requires… • Amazon Redshift UDFs • Python shell that calls Amazon Redshift but also allows for complex statistical methods (e.g., using R or machine learning)
  52. 52. Conclusion • Clickstreams are the new “data currency” of business • AWS provides great technology to process data • High speed • Lower costs – Using Spot… • Very agile • Do more with less: this can all be done with a team of 2 FTEs! • 1 developer (well versed in AWS) + 1 data scientist
  53. 53. Ingest Store Process Analyze Click Insight Time Call To Action Amazon S3 Amazon Kinesis Amazon DynamoDB Amazon RDS (Aurora) AWS Lambda KCL Apps Amazon EMR Amazon Redshift
  54. 54. Use Amazon Kinesis, EMR and Amazon Redshift for Clickstream Open source connectors: • http://docs.aws.amazon.com/kinesis/latest/dev/developing- consumers-with-kcl.html AWS Big Data blog - http://blogs.aws.amazon.com/bigdata/ AWS re:Invent Big Data booth AWS Big Data Marketplace and Partner ecosystem Hearst Booth – Hall C1156: Learn more about the interesting things we are doing with data! Call To Action
  55. 55. Remember to complete your evaluations!
  56. 56. Thank you!

×