• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interactive - Michael Sun - CBSi
 

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interactive - Michael Sun - CBSi

on

  • 6,133 views

 

Statistics

Views

Total Views
6,133
Views on SlideShare
5,590
Embed Views
543

Actions

Likes
13
Downloads
0
Comments
0

8 Embeds 543

http://www.cloudera.com 527
http://paper.li 4
http://a0.twimg.com 3
http://www.onlydoo.com 3
http://cloudera.matt.dev 2
https://app2.crowdbase.com 2
http://166.77.191.231 1
http://blog.cloudera.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • CBSi has a number of brands, this slide shows the biggest ones.
  • We have a lot of traffic and data. We’ve been using Hadoop quite extensively for a few years now.

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interactive - Michael Sun - CBSi Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interactive - Michael Sun - CBSi Presentation Transcript

  • Building Web Analytics on Hadoop at CBS Interactive Michael Sun [email_address] Hadoop World November 8 2011
  • $2 What is fog computing? Deep Thoughts $1 What is cloud computing? Convenient, on-demand, scalable, network accessible service of the shared pool of computing resources Vaporware $5 What is vapor computing? Local cloud computing
  • About Me
    • Lead Software Engineer
    • Manager of Data Warehouse Operations
    • Background in Software Systems Architecture, Design and Development
    • Expertise of data warehouse, data analytics, statistics, databases, and distributed systems
    • Ph.D. in Applied Physics from Harvard University
  • Brands and Websites of CBS interactive, Samples GAMES & MOVIES TECH, BIZ & NEWS SPORTS ENTERTAINMENT MUSIC
  • CBSi Scale
    • Top 20 global web property
    • 235M worldwide monthly unique users
    • Hadoop Cluster size:
      • Currently workers: 40 nodes (260 TB)
      • This month: add 24 nodes, 800 TB total
      • Next quarter: ~ 80 nodes, ~ 1 PB
    • DW peak processing: > 500M events/day globally, doubling next quarter (ad logs)
    1 - Source: comScore, March 2011
  • Web Analytics Processing
    • Collect web logs for web metrics analysis
      • Web logs by tracking clicks, page views, downloads, streaming video events, ad events, etc
    • Provide internal metrics for web sites monitoring
    • A/B testing
    • Billers apps, external reporting
    • Ad event tracking to support sales
    • Provide data service
      • Support marketing by providing data for data mining
      • User-centric datastore (stay tuned)
      • Optimize user experience
    1 - Source: comScore, March 2011
  • Modernize the platform
    • The web log processing using a proprietary platform ran into the limit
      • Code base was 10 years old
      • The version we used vendor is no longer supporting
      • Not fault-tolerant
      • Upgrade to the newer version not cost-effective
    • Data volume is increasing all the time
      • 300+ web sites
      • Video tracking increasing the fastest
      • To support new initiatives of business
    • Use open source systems as much as possible
  • Hadoop to the Rescue / Research
    • Open-source: scalable data processing framework based on MapReduce
    • Processing PB of data using Hadoop Distributed files system (HDFS)
      • high throughput
      • Fault-Tolerant
    • Distributed computing model
      • Functional programming model based
        • MapReduce (M|S|R)
    • Execution engine
      • Used as a cluster for ETL
      • Collect data (distributed harvester)
      • Analyze data (M/R, streaming + scripting + R, Pig/Hive)
      • Archive data (distributed archive)
  • The Plan
    • Build web logs collection (codename Fido)
      • Apache web log piped to cronolog
      • Hourly M/R collector job to
        • Gzip hourly log files & checksum
        • Scp from web servers to Hadoop datanodes
        • Put on HDFS
    • Build Python ETL framework (codename Lumberjack)
      • Based stdin/stdout streaming, one process/one thread
      • Can run stand-alone or on Hadoop
      • Pipeline
      • Filter
      • Schema
    • Build web log processing with Lumberjack
      • Parse
      • Sessionize
      • Lookup
      • Format data/Load to DB
  • Web Analytics Hadoop External data sources HDFS Python-ETL MapReduce Hive DW Database Sites Apache Logs Distribute log by Fido Web metrics Billers Data mining CMS Systems
  • The Python ETL Framework Lumberjack on Hadoop
    • Written in Python
    • Foundation Classes
      • Pipeline, stdin/stdout streaming, consisting of connected Filters
      • Schema, metadata describes the data being sent between Filters
        • String, for encoding/decoding
        • Datetime, timezone
        • Numeric, validaton
        • Null handling
      • Filter, stage in a Pipeline
      • Pipe, connecting Filters
  • The Python ETL Framework Lumberjack on Hadoop
    • Filter
      • Extract
        • DelimitedFileInput
        • DbQuery
        • RegexpFileInput
      • Transform
        • Expression
        • Lookup
        • Regex
        • Range Lookup
      • Load
        • DelimitedFileOutput
        • DbLoader
        • PickledFileOutput
  • The Python ETL Framework Lumberjack on Hadoop, Example Python schema schema = Schema(( SchemaField(u'anon_cookie', 'unicode', False, default=u'-', maxlen=19, io_encoding='utf-8', encoding_errors='replace', cache_size=0, on_null='strict', on_type_error='strict', on_range_error='strict'), SchemaField(u'client_ip_addr', 'int', False, default=0, signed=False, bits=32, precision=None, cache_size=1, on_null='strict', on_type_error='strict', on_range_error='strict'), SchemaField(u'session_id', 'int', False, default=-1, signed=True, bits=64, precision=None, cache_size=1, on_null='strict', on_type_error='strict', on_range_error='strict'), SchemaField(u'session_start_dt_ht', 'datetime', False, default='1970-01-01 08:00:00', timezone='PRC', io_format='%Y-%m-%d %H:%M:%S', cache_size=-1, on_null='strict', on_type_error='strict', on_range_error='strict'), SchemaField(u'session_start_dt_ut', 'datetime', False, default='1970-01-01 00:00:00', timezone='UTC', io_format='%Y-%m-%d %H:%M:%S', cache_size=-1, on_null='strict', on_type_error='strict', on_range_error='strict'), ))
  • The Python ETL Framework Lumberjack on Hadoop, Example Pipeline pl = etl.Pipeline( etl.file.DelimitedInput(stdin, input_schema, drop=('empty1', 'empty2', 'bytes_sent'), errors_policy=etl.file.DelimitedInputSkipAndCountErrors( etl.counter.HadoopStreamingCounter('Dropped Records', group='DelimitedInput'))) | etl.transform.StringCleaner() | dw_py_mod_utils.IPFilter('ip_address', 'CNET_Exclude_Addresses.txt', None, None) | cnclear_parser.CnclearParser() | etl.transform.StringCleaner(fields=('xref', 'src_url', 'title')) | parse_url.ParseURL('src_url') | event_url_title_cleaner.CNClearURLTitleCleaner() | dw_py_mod_utils.Sha1Hash(['user_agent', 'skc_url', 'title']) | parse_ua.ParseUA('user_agent') | etl.transform.ProjectToOutputSchema(output_schema) | etl.file.DelimitedOutput(stdout) ) pl.setup() pl.run()
  • The Python ETL Framework Lumberjack on Hadoop
      • Based on Hadoop streaming
      • Written in Python
      • Mapper and Reducer as pipeline (stdin/stdout streaming)
    • Mapper for all transformations
      • Expression
      • Lookup by using Tokyo cabinet
      • Range lookup
      • Regex
    • Use the shuffle phase (in between M/R) for sorting
      • Aggregation by Reducer
      • eg Sessionize in detail
  • Web log Processing by Hadoop Streaming and Python-ETL
    • Parsing web logs
      • IAB filtering and checking
      • Parsing user agents by regex
      • IP range lookup
      • Look up product key etc
    • Sessionization
      • Prepare Sessionize
      • Sessionize
      • Filter-unpack
    • Process huge dimensions, URL/Page Title
    • Load Facts
      • Format Load data/Load data to DB
  • Sessionize on Hadoop in Detail
    • Group web events (page impression, click, video tracking etc) into user sessions based on a set of business rules, eg 30 minutes timeout
    • Enable analysis of user behavior patterns
    • Gather session-level facts
  • Input to Sessionize Take output data from parsed data of type: page impression, click-payable, click-nonpayable, video tracking, optimization event types
  • Prep-sessionize (Mapper)
    • Pre-sessionize lookups
    • Normalize event records of all event types to the same schema
      • Order fields in the same sequence for all event types
    • IP range lookup
  • Sorting before Sessionize
    • Sorting events according to anon_cookie (the same user) + event_dt_ht (event stream)
    • Hadoop streaming implied sorting
      • -D stream.num.map.output.key.fields=2
      • -D mapred.text.key.partitioner.options=-k1
      • -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
  • Sessionize (Reducer)
    • Apply sessionize business rules
    • Assign session_id from a distributed sequence
    • Summary to gather session-level facts join session facts with web-events
    • Assemble all output rows to one output streams (web events, sessions, rejects), by just extending the fields,
      • reject_flag to mark rows rejected
      • event_type (session) indicator field
  • Output (Mapper) from Sessionize
    • Filter for specific event type events
    • Unpack the event type + sessions fact
    • Provide data to prepare_load
  • Web Analytics Hadoop External data sources HDFS Python-ETL MapReduce Hive DW Database Sites Apache Logs Distribute log by Fido Web metrics Billers Data mining CMS Systems
  • Benefits to Ops
    • Processing time to reaching SLA, saving 6 hours
    • Running 2 years in production without any big issues
    • Withstood the test of 50% / year data volume increase
    • Architecture by design made easy to add new processing logic
    • Robust and Fault-Tolerant
      • Five dead datanodes, jobs still ran OK
      • Upgraded JVM on a few datanodes while jobs running
      • Reprocessing old data while processing data of current day
  • Conclusions I – Create Tool Appropriate to the Job if it doesn’t have what you want
    • Python ETL Framework and Hadoop Streaming together can do complex, big volume ETL work
    • Python ETL Framework
      • Home grown, under review for open-source release
      • Rich functionalities by Python
      • Extensible
      • NLS support
    • Put on top of another platform, eg Hadoop
      • Distributed/Parallel
      • Sorting
      • Aggregation
  • Conclusions II – Power and Flexibility for Processing Big Data
    • Hadoop - scale and computing horsepower
      • Robustness
      • Fault-tolerance
      • Scalability
      • Significant reduction of processing time to reach SLA
      • Cost-effective
        • Commodity HW
        • Free SW
    • Currently:
      • Build Multi-tenant Hadoop clusters using Fair Scheduler
  • The Team (alphabetical order) Batu Ulug Dan Lescohier Jim Haas, presenting “Hadoop in Mission-critical Environment” Michael Sun Richard Zhang Ron Mahoney Slawomir Krysiak
  • Questions? [email_address] Follow up Lumberjack [email_address]
  • Abstract CBS Interactive successfully adopted Hadoop as the web analytics platform, processing one Billion weblogs daily from hundreds of web site properties that CBS Interactive oversees. After introducing Lumberjack—the Extraction, Transformation and Loading framework we built based on python and streaming, which is under review for Open-Source release—Michael will talk about web metrics processing on Hadoop, focusing on weblog harvesting, parsing, dimension look-up, sessionization, and loading into a database. Since migrating processing from a proprietary platform to Hadoop, CBS Interactive achieved robustness, fault-tolerance and scalability, and significant reduction of processing time to reach SLA (over six hours reduction so far).