Hw09   Analytics And Reporting
 

Hw09 Analytics And Reporting

on

  • 2,810 views

 

Statistics

Views

Total Views
2,810
Views on SlideShare
2,803
Embed Views
7

Actions

Likes
2
Downloads
85
Comments
0

1 Embed 7

http://www.slideshare.net 7

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Diversity collapse, dot com collapse, and how the stock survived
  • Grew 28% sequentially and 346% annually (Q299 vs. Q298)
  • Economics Professor at Stanford
  • In PIG need 2 functions (1) to check a session contains a search event. (2) to find view event for a search event.
  • Need to write down schema of the data. Explain pattern. Also explain {}
  • GROUP is like PIG Group
  • GROUP is like PIG Group
  • GROUP is like PIG Group

Hw09   Analytics And Reporting Hw09 Analytics And Reporting Presentation Transcript

  • Scalable Stream Processing and Map-Reduce
    • Neel Sundaresan, Evan Chiu, Gyanit Singh
    • eBay Research Labs
    eBay Research Labs
  • About eBay Research Labs
    • Who we are
      • eBay Research Labs was formed in July of 2005. The group's charter is to conduct forward looking research and deliver innovative solutions to business and product
    • What we do
      • Search & IR
      • Machine learning
      • Analytics and Optimization
      • Reputation, Trust and Safety
      • Distributed and Grid Computing
      • Social and Incentive Networks
      • Large Scale Visualization
      • Scalable Martix and Graph Computing
    • Basically we “Dig” into data!
    eBay Inc. Confidential
    • On an average day on eBay…
    The Land of Large Numbers eBay Inc. Confidential eBay users trade about $1,400 worth of goods on the site every second.
      • A vehicle sells every 2 minutes
      • A part or accessory sells every 3 seconds
      • Diamond jewelry sells every 83 seconds
      • A Timberland shoe sells every 10 minutes
      • A trading card sells every 6 seconds
  • Listings eBay Inc. Confidential 546.4 Million 1999 2000 2001 2002 2003 1998 2004 2005
  • Registered Users eBay Inc. Confidential 180.6 Million 1999 2000 2001 2002 2003 1998 2004 2005
  • League of Long Tail
    • Lots of our research is in data mining field.
      • Data collection is important to us.
    • Example:
      • View item click through rate for search algorithms.
      • Browsing pattern of users performing searches in different categories.
        • For e.g. computer v.s. clothing.
      • Purchase rate for various merchandizing algorithms.
      • Other kind of sessionized data.
    eBay Inc. Confidential
  • Of Needles, Haystacks, and Magnets…
    • Transaction Data and session data
      • Terabytes per day
    • Data mining researcher’s source of truth
    • Nature of session data
      • Sessionized streams
      • Semi-structured
      • Constantly changing schema
    • Examples:
      • View item click through rate for search algorithms.
      • Browsing pattern of users performing searches in different categories.
        • For e.g. computer vs clothing.
      • Purchase rate for various recommendation algorithms – A/B experimentation
    eBay Inc. Confidential
  • The Log Challenge
    • The amount of the data is huge.
      • TB+ / day
      • Need to perform analysis on weeks or even years worth of data.
    • Analysis takes time.
      • Making it distributed will help.
    • Text streams are not indexed.
      • Difficult to query
      • Fields often change
    • Difficult to perform sessionized analysis
      • E.g.: Study the session paths of session in which a given search algorithm is used.
      • E.g.: Differences in UK user and US users.
    • Large Number of session paths (in millions)
      • Visualization is difficult
    • Scaling up to potential large user base
    eBay Inc. Confidential
  • What we need is..
    • An analytic tool:
      • Easy to perform session analysis.
      • Large scale stream processing.
      • Quick turnaround on analysis.
      • Effective query language.
      • Visual analytics that caters to intuition and provides extensive analytics.
      • Highly customizable processing.
      • Provide interfaces at different levels.
    eBay Inc. Confidential
  • Architecture - Mobius eBay Inc. Confidential Log Processing API NFS/HDFS Web Interface Java M/R Template Click stream Visualizer Logs Log Query Language Application Layer Data Source Layer Cluster Layer Extension (mapper/reducer plugin) Hadoop Hybrid Hadoop Cluster Solaris Servers Solaris Servers Solaris Servers Windows Desktop Windows Desktop Windows Desktop
  • eBay Inc. Confidential ?% ?% ?% ?% ?% start search exit exit view purchase View through rate Buy rate
  • User session information
    • Questions
      • What percentage of searches done receive clicks?
      • Out of those clicked results, how many are abandoned?
      • How many viewed results are followed to bid?
    • Data
      • Session activity grouped together as a stream.
      • A session is a bag of events.
      • Each event is a tuple with various fields.
    • Process:
      • Extract session with searches.
      • Compute view through rate.
    eBay Inc. Confidential
  • What do we need in a Stream Query Language
    • Detect patterns from the stream over a window
      • window
        • time-based, count-based, event/trigger-based
        • Sliding vs Landmark windows
      • Structure
        • Repetition (*, +), Sequence ( //), condition
      • Naming patterns
    • Integrate Relational and Pattern operators
    • “Inner” Queries
    eBay Inc. Confidential
  • Mobius Query Language (MQL) - Structure of a Query eBay Inc. Confidential SELECT * FROM source WHERE where_condition PATTERN pattern WITH condition START start_condition END end_condition FROM START END FILTER PATTERN SELECT Input stream
  • Simple Example eBay Inc. Confidential
    • Input: Stream of events.
    • Event is modeled as tuples.
    • Stream is modeled as ordered bag.
    (1,11:00AM) (5,11:01AM) (2,11:02AM) (6,11:04AM) (10,11:05AM) (3,11:06AM) (7,11:07AM) (2,11:08AM) (-1,11:10AM) . . . inputStream
  • Simple Example Contd… eBay Inc. Confidential SELECT size(A), B FROM inputStream WHERE $0 > 0 PATTERN ( ($0 < 5) + AS A[ ] // $0 > 5 AS B ) WITH A[0].time + 5 min > B.time && size(A) < 3 START $1 > 11:01AM END $0 < 0 FROM START END FILTER PATTERN SELECT Input Stream (1,11:00AM) (5,11:01AM) (2,11:02AM) (6,11:04AM) (10,11:05AM) (3,11:06AM) (7,11:07AM) (2,11:08AM) (-1,11:10AM) . . . (1,11:00AM) (5,11:01AM) (2,11:02AM) (6,11:04AM) (10,11:05AM) (3,11:06AM) (7,11:07AM) (2,11:08AM) (-1,11:10AM) . . . (5,11:01AM) (2,11:02AM) (6,11:04AM) (10,11:05AM) (3,11:06AM) (7,11:07AM) (2,11:08AM) (-1,11:10AM) . . . (5,11:01AM) (2,11:02AM) (6,11:04AM) (10,11:05AM) (3,11:06AM) (7,11:07AM) (2,11:08AM) (-1,11:10AM) ({(2,11:02AM)}, (6,11:04AM)) ({(3,11:06AM)},(7,11:07AM) ) (1, (6,11:04AM)) (1, (7,11:07AM) ) Output: Start-end Sub-stream Filter Pattern Tuple in a bag
  • Other Systems
    • PIG and Hive
      • Patterns are not available.
      • Other stream processing operator also not available. E.g. Start, End.
    • CEP based Stream Processing Languages. (STREAM, Streambase, Cayuga)
      • Have flat data model.
        • Can only store few features of patterns.
      • Degree of parallelism is restricted to 1 due to inability to represent substreams.
        • In some cases splitting is done but that splitting operator has 1 degree of parallelism.
    • Active Databases
      • ECA (events-conditions-actions) and triggers
    eBay Inc. Confidential
  • Pattern Query
    • Problem: For each user identify a set of “search” followed by “view” (click-thru) events
    • Input Stream : (uid,session(name,time, itemID,…))*
    • Output Stream : (uid,views(S,V)*)*
    • DATASET clicks = SELECT uid, {
    • SELECT S, V
    • FROM session
    • PATTERN (name == “search” AS S //
    • ** //
    • name == “view” AS V)
    • WITH V.itemID belongsto S.impressions
    • } AS views
    • FROM logs
    • WHERE size(session) > 1;
    • Pattern defined in the pattern query is
    eBay Inc. Confidential Pattern Query search * view
  • Recommender Systems
    • Products are recommended to users on various pages.
      • How many clicks does the recommendation gets?
      • Do those clicks result in purchase?
      • Performance of different recommendation algorithms?
    • Data
      • User session containing events.
      • Events are of different types
    • Process
      • Extract session with recommendations.
      • Group them by algorithm used.
      • Calculate the view and purchase through rate.
    eBay Inc. Confidential
  • Sessionized Analysis 1 eBay Inc. Confidential X % Y % X1 % X2 % X3 % X4 % Y1 % Y2 % Y3 % Y4 % Start Algo1 Algo2 At least one view Left eBay w/o View Left eBay w/o View At least one view At least one purchase no purchase were made no purchase were made At least one purchase View through rate Buy rate
  • Step1 : Extract all sessions with “view”s
    • Input Stream: merch_logs is (uid, sessionid, (name,time,itemID…)*)*
    • DATASET merchView = SELECT uid, sessionid, {
    • SELECT S.algorithm AS algo, V.itemID AS itemID
    • FROM session
    • PATTERN ( name == “ recommend ” AS S//
    • ** AS B[ ] //
    • name == “ view ” AS V)
    • WITH (S.itemID == V.itemID) AND
    • (B[i].itemid != S.itemID)
    • } AS viewedRecos
    • FROM merch_logs ;
    • The PATTERN part defines pattern
    • Schema of merchView is (uid, sessionid, (algo,itemID)*)*
    eBay Inc. Confidential recommend * view Bag named session Bag named viewedRecos A
  • Step 2: Extract all instances of recommendations made…
    • DATASET merchShown = SELECT uid, sessionid, flatten({
    • SELECT algorithm AS algo
    • FROM session
    • WHERE name == “recommend”
    • })
    • FROM merch_logs;
    • (uid, sessionid, (name,time,itemID…)*)* => (uid,sessionId, algo)*
    eBay Inc. Confidential B
  • Step 3: Produce a flattened view
    • DATASET unnestedmerchView = SELECT uid, sessionid, flatten(viewedRecos)
    • FROM merchView;
    • (uid, sessionid, (algo,itemID)*)* => (uid, sessionid, algo, itemId)*
    eBay Inc. Confidential C
  • Step 4: Group the data by algorithm type and compute counts
    • DATASET merchData = SELECT groupid AS algo, size(unnestedmerchView) AS clicks,
    • size(merchShown) AS impressions
    • FROM unnestedmerchView, merchShown
    • GROUP unnestedmerchView BY algo
    • ALSO merchShown BY algo
    • (uid, sessionid, algo, itemId)* , (uid,sessionId, algo)* => (algo, #clicks, #impression)* -- stream/bag of length #algos
    eBay Inc. Confidential D
  • Parallel Implementation
    • MQL compiling engine compiles queries in to a DAG. (similar to PIG)
    • Then the DAG is compiled into one or more map-reduce jobs.
    • When ever a grouping, sorting, union operator is seen a reduce phase is needed.
    eBay Inc. Confidential From Filter Select Group Load Save Select Scatter Gather Gather Scatter select select Pattern B A D C
  • Understanding System Health
    • Data
      • System logs containing many beacon streams.
      • Each machine beacon stream comprising of single beacon message.
      • Message contains various state data.
    • Problem
      • Contiguous time period load is more than 60%.
      • Time period only interesting if it is more than delta mins.
    eBay Inc. Confidential Time of the day % load 60% t
    • Input Stream: Schema of systemlogs is (systemname, beacons(load, time, …))
    • Output Stream: (systemname, (load,time)*)*
    • DATASET system = SELECT systemname, {
    • SELECT LE AS load
    • FROM beacons
    • PATTERN ( load < 60 AS SEVENT //
    • (load > 60)+ AS LE[ ] //
    • load < 60 AS EEVENT)
    • WITH LE[size(LE)].time – LE[0].time > 10mins
    • } AS LoadTimes
    • FROM systemlogs;
    • Pattern defined in pattern query is
    eBay Inc. Confidential load<60 load > 60 load<60
  • Ongoing and Future Work
    • Optimization mechanism.
      • Filter cannot be pushed ahead of user defined functions and pattern queries.
      • Inferring projection is also limited.
    • Compilation to Map-Reduce Jobs
      • Inferring the best strategy to split work between map-reduce in case of multiple queries.
    • Degree of parallelization when stream is not split by the users into substreams.
    • Near real-time stream processing engine.
    • Reporting mechanism.
    eBay Inc. Confidential