Hw09 Analytics And Reporting


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Diversity collapse, dot com collapse, and how the stock survived
  • Grew 28% sequentially and 346% annually (Q299 vs. Q298)
  • Economics Professor at Stanford
  • In PIG need 2 functions (1) to check a session contains a search event. (2) to find view event for a search event.
  • Need to write down schema of the data. Explain pattern. Also explain {}
  • GROUP is like PIG Group
  • GROUP is like PIG Group
  • GROUP is like PIG Group
  • Hw09 Analytics And Reporting

    1. 1. Scalable Stream Processing and Map-Reduce <ul><li>Neel Sundaresan, Evan Chiu, Gyanit Singh </li></ul><ul><li>eBay Research Labs </li></ul>eBay Research Labs
    2. 2. About eBay Research Labs <ul><li>Who we are </li></ul><ul><ul><li>eBay Research Labs was formed in July of 2005. The group's charter is to conduct forward looking research and deliver innovative solutions to business and product </li></ul></ul><ul><li>What we do </li></ul><ul><ul><li>Search & IR </li></ul></ul><ul><ul><li>Machine learning </li></ul></ul><ul><ul><li>Analytics and Optimization </li></ul></ul><ul><ul><li>Reputation, Trust and Safety </li></ul></ul><ul><ul><li>Distributed and Grid Computing </li></ul></ul><ul><ul><li>Social and Incentive Networks </li></ul></ul><ul><ul><li>Large Scale Visualization </li></ul></ul><ul><ul><li>Scalable Martix and Graph Computing </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>Basically we “Dig” into data! </li></ul>eBay Inc. Confidential
    3. 3. <ul><li>On an average day on eBay… </li></ul>The Land of Large Numbers eBay Inc. Confidential eBay users trade about $1,400 worth of goods on the site every second. <ul><ul><li>A vehicle sells every 2 minutes </li></ul></ul><ul><ul><li>A part or accessory sells every 3 seconds </li></ul></ul><ul><ul><li>Diamond jewelry sells every 83 seconds </li></ul></ul><ul><ul><li>A Timberland shoe sells every 10 minutes </li></ul></ul><ul><ul><li>A trading card sells every 6 seconds </li></ul></ul>
    4. 4. Listings eBay Inc. Confidential 546.4 Million 1999 2000 2001 2002 2003 1998 2004 2005
    5. 5. Registered Users eBay Inc. Confidential 180.6 Million 1999 2000 2001 2002 2003 1998 2004 2005
    6. 6. League of Long Tail <ul><li>Lots of our research is in data mining field. </li></ul><ul><ul><li>Data collection is important to us. </li></ul></ul><ul><li>Example: </li></ul><ul><ul><li>View item click through rate for search algorithms. </li></ul></ul><ul><ul><li>Browsing pattern of users performing searches in different categories. </li></ul></ul><ul><ul><ul><li>For e.g. computer v.s. clothing. </li></ul></ul></ul><ul><ul><li>Purchase rate for various merchandizing algorithms. </li></ul></ul><ul><ul><li>Other kind of sessionized data. </li></ul></ul>eBay Inc. Confidential
    7. 7. Of Needles, Haystacks, and Magnets… <ul><li>Transaction Data and session data </li></ul><ul><ul><li>Terabytes per day </li></ul></ul><ul><li>Data mining researcher’s source of truth </li></ul><ul><li>Nature of session data </li></ul><ul><ul><li>Sessionized streams </li></ul></ul><ul><ul><li>Semi-structured </li></ul></ul><ul><ul><li>Constantly changing schema </li></ul></ul><ul><li>Examples: </li></ul><ul><ul><li>View item click through rate for search algorithms. </li></ul></ul><ul><ul><li>Browsing pattern of users performing searches in different categories. </li></ul></ul><ul><ul><ul><li>For e.g. computer vs clothing. </li></ul></ul></ul><ul><ul><li>Purchase rate for various recommendation algorithms – A/B experimentation </li></ul></ul>eBay Inc. Confidential
    8. 8. The Log Challenge <ul><li>The amount of the data is huge. </li></ul><ul><ul><li>TB+ / day </li></ul></ul><ul><ul><li>Need to perform analysis on weeks or even years worth of data. </li></ul></ul><ul><li>Analysis takes time. </li></ul><ul><ul><li>Making it distributed will help. </li></ul></ul><ul><li>Text streams are not indexed. </li></ul><ul><ul><li>Difficult to query </li></ul></ul><ul><ul><li>Fields often change </li></ul></ul><ul><li>Difficult to perform sessionized analysis </li></ul><ul><ul><li>E.g.: Study the session paths of session in which a given search algorithm is used. </li></ul></ul><ul><ul><li>E.g.: Differences in UK user and US users. </li></ul></ul><ul><li>Large Number of session paths (in millions) </li></ul><ul><ul><li>Visualization is difficult </li></ul></ul><ul><li>Scaling up to potential large user base </li></ul>eBay Inc. Confidential
    9. 9. What we need is.. <ul><li>An analytic tool: </li></ul><ul><ul><li>Easy to perform session analysis. </li></ul></ul><ul><ul><li>Large scale stream processing. </li></ul></ul><ul><ul><li>Quick turnaround on analysis. </li></ul></ul><ul><ul><li>Effective query language. </li></ul></ul><ul><ul><li>Visual analytics that caters to intuition and provides extensive analytics. </li></ul></ul><ul><ul><li>Highly customizable processing. </li></ul></ul><ul><ul><li>Provide interfaces at different levels. </li></ul></ul>eBay Inc. Confidential
    10. 10. Architecture - Mobius eBay Inc. Confidential Log Processing API NFS/HDFS Web Interface Java M/R Template Click stream Visualizer Logs Log Query Language Application Layer Data Source Layer Cluster Layer Extension (mapper/reducer plugin) Hadoop Hybrid Hadoop Cluster Solaris Servers Solaris Servers Solaris Servers Windows Desktop Windows Desktop Windows Desktop
    11. 11. eBay Inc. Confidential ?% ?% ?% ?% ?% start search exit exit view purchase View through rate Buy rate
    12. 12. User session information <ul><li>Questions </li></ul><ul><ul><li>What percentage of searches done receive clicks? </li></ul></ul><ul><ul><li>Out of those clicked results, how many are abandoned? </li></ul></ul><ul><ul><li>How many viewed results are followed to bid? </li></ul></ul><ul><li>Data </li></ul><ul><ul><li>Session activity grouped together as a stream. </li></ul></ul><ul><ul><li>A session is a bag of events. </li></ul></ul><ul><ul><li>Each event is a tuple with various fields. </li></ul></ul><ul><li>Process: </li></ul><ul><ul><li>Extract session with searches. </li></ul></ul><ul><ul><li>Compute view through rate. </li></ul></ul>eBay Inc. Confidential
    13. 13. What do we need in a Stream Query Language <ul><li>Detect patterns from the stream over a window </li></ul><ul><ul><li>window </li></ul></ul><ul><ul><ul><li>time-based, count-based, event/trigger-based </li></ul></ul></ul><ul><ul><ul><li>Sliding vs Landmark windows </li></ul></ul></ul><ul><ul><li>Structure </li></ul></ul><ul><ul><ul><li>Repetition (*, +), Sequence ( //), condition </li></ul></ul></ul><ul><ul><li>Naming patterns </li></ul></ul><ul><li>Integrate Relational and Pattern operators </li></ul><ul><li>“Inner” Queries </li></ul>eBay Inc. Confidential
    14. 14. Mobius Query Language (MQL) - Structure of a Query eBay Inc. Confidential SELECT * FROM source WHERE where_condition PATTERN pattern WITH condition START start_condition END end_condition FROM START END FILTER PATTERN SELECT Input stream
    15. 15. Simple Example eBay Inc. Confidential <ul><li>Input: Stream of events. </li></ul><ul><li>Event is modeled as tuples. </li></ul><ul><li>Stream is modeled as ordered bag. </li></ul>(1,11:00AM) (5,11:01AM) (2,11:02AM) (6,11:04AM) (10,11:05AM) (3,11:06AM) (7,11:07AM) (2,11:08AM) (-1,11:10AM) . . . inputStream
    16. 16. Simple Example Contd… eBay Inc. Confidential SELECT size(A), B FROM inputStream WHERE $0 > 0 PATTERN ( ($0 < 5) + AS A[ ] // $0 > 5 AS B ) WITH A[0].time + 5 min > B.time && size(A) < 3 START $1 > 11:01AM END $0 < 0 FROM START END FILTER PATTERN SELECT Input Stream (1,11:00AM) (5,11:01AM) (2,11:02AM) (6,11:04AM) (10,11:05AM) (3,11:06AM) (7,11:07AM) (2,11:08AM) (-1,11:10AM) . . . (1,11:00AM) (5,11:01AM) (2,11:02AM) (6,11:04AM) (10,11:05AM) (3,11:06AM) (7,11:07AM) (2,11:08AM) (-1,11:10AM) . . . (5,11:01AM) (2,11:02AM) (6,11:04AM) (10,11:05AM) (3,11:06AM) (7,11:07AM) (2,11:08AM) (-1,11:10AM) . . . (5,11:01AM) (2,11:02AM) (6,11:04AM) (10,11:05AM) (3,11:06AM) (7,11:07AM) (2,11:08AM) (-1,11:10AM) ({(2,11:02AM)}, (6,11:04AM)) ({(3,11:06AM)},(7,11:07AM) ) (1, (6,11:04AM)) (1, (7,11:07AM) ) Output: Start-end Sub-stream Filter Pattern Tuple in a bag
    17. 17. Other Systems <ul><li>PIG and Hive </li></ul><ul><ul><li>Patterns are not available. </li></ul></ul><ul><ul><li>Other stream processing operator also not available. E.g. Start, End. </li></ul></ul><ul><li>CEP based Stream Processing Languages. (STREAM, Streambase, Cayuga) </li></ul><ul><ul><li>Have flat data model. </li></ul></ul><ul><ul><ul><li>Can only store few features of patterns. </li></ul></ul></ul><ul><ul><li>Degree of parallelism is restricted to 1 due to inability to represent substreams. </li></ul></ul><ul><ul><ul><li>In some cases splitting is done but that splitting operator has 1 degree of parallelism. </li></ul></ul></ul><ul><li>Active Databases </li></ul><ul><ul><li>ECA (events-conditions-actions) and triggers </li></ul></ul>eBay Inc. Confidential
    18. 18. Pattern Query <ul><li>Problem: For each user identify a set of “search” followed by “view” (click-thru) events </li></ul><ul><li>Input Stream : (uid,session(name,time, itemID,…))* </li></ul><ul><li>Output Stream : (uid,views(S,V)*)* </li></ul><ul><li>DATASET clicks = SELECT uid, { </li></ul><ul><li>SELECT S, V </li></ul><ul><li>FROM session </li></ul><ul><li>PATTERN (name == “search” AS S // </li></ul><ul><li>** // </li></ul><ul><li>name == “view” AS V) </li></ul><ul><li>WITH V.itemID belongsto S.impressions </li></ul><ul><li>} AS views </li></ul><ul><li>FROM logs </li></ul><ul><li>WHERE size(session) > 1; </li></ul><ul><li>Pattern defined in the pattern query is </li></ul>eBay Inc. Confidential Pattern Query search * view
    19. 19. Recommender Systems <ul><li>Products are recommended to users on various pages. </li></ul><ul><ul><li>How many clicks does the recommendation gets? </li></ul></ul><ul><ul><li>Do those clicks result in purchase? </li></ul></ul><ul><ul><li>Performance of different recommendation algorithms? </li></ul></ul><ul><li>Data </li></ul><ul><ul><li>User session containing events. </li></ul></ul><ul><ul><li>Events are of different types </li></ul></ul><ul><li>Process </li></ul><ul><ul><li>Extract session with recommendations. </li></ul></ul><ul><ul><li>Group them by algorithm used. </li></ul></ul><ul><ul><li>Calculate the view and purchase through rate. </li></ul></ul>eBay Inc. Confidential
    20. 20. Sessionized Analysis 1 eBay Inc. Confidential X % Y % X1 % X2 % X3 % X4 % Y1 % Y2 % Y3 % Y4 % Start Algo1 Algo2 At least one view Left eBay w/o View Left eBay w/o View At least one view At least one purchase no purchase were made no purchase were made At least one purchase View through rate Buy rate
    21. 21. Step1 : Extract all sessions with “view”s <ul><li>Input Stream: merch_logs is (uid, sessionid, (name,time,itemID…)*)* </li></ul><ul><li>DATASET merchView = SELECT uid, sessionid, { </li></ul><ul><li>SELECT S.algorithm AS algo, V.itemID AS itemID </li></ul><ul><li>FROM session </li></ul><ul><li>PATTERN ( name == “ recommend ” AS S// </li></ul><ul><li>** AS B[ ] // </li></ul><ul><li>name == “ view ” AS V) </li></ul><ul><li>WITH (S.itemID == V.itemID) AND </li></ul><ul><li>(B[i].itemid != S.itemID) </li></ul><ul><li> } AS viewedRecos </li></ul><ul><li>FROM merch_logs ; </li></ul><ul><li>The PATTERN part defines pattern </li></ul><ul><li>Schema of merchView is (uid, sessionid, (algo,itemID)*)* </li></ul>eBay Inc. Confidential recommend * view Bag named session Bag named viewedRecos A
    22. 22. Step 2: Extract all instances of recommendations made… <ul><li>DATASET merchShown = SELECT uid, sessionid, flatten({ </li></ul><ul><li>SELECT algorithm AS algo </li></ul><ul><li>FROM session </li></ul><ul><li>WHERE name == “recommend” </li></ul><ul><li> }) </li></ul><ul><li>FROM merch_logs; </li></ul><ul><li>(uid, sessionid, (name,time,itemID…)*)* => (uid,sessionId, algo)* </li></ul>eBay Inc. Confidential B
    23. 23. Step 3: Produce a flattened view <ul><li>DATASET unnestedmerchView = SELECT uid, sessionid, flatten(viewedRecos) </li></ul><ul><li>FROM merchView; </li></ul><ul><li>(uid, sessionid, (algo,itemID)*)* => (uid, sessionid, algo, itemId)* </li></ul>eBay Inc. Confidential C
    24. 24. Step 4: Group the data by algorithm type and compute counts <ul><li>DATASET merchData = SELECT groupid AS algo, size(unnestedmerchView) AS clicks, </li></ul><ul><li>size(merchShown) AS impressions </li></ul><ul><li>FROM unnestedmerchView, merchShown </li></ul><ul><li>GROUP unnestedmerchView BY algo </li></ul><ul><li>ALSO merchShown BY algo </li></ul><ul><li>(uid, sessionid, algo, itemId)* , (uid,sessionId, algo)* => (algo, #clicks, #impression)* -- stream/bag of length #algos </li></ul>eBay Inc. Confidential D
    25. 25. Parallel Implementation <ul><li>MQL compiling engine compiles queries in to a DAG. (similar to PIG) </li></ul><ul><li>Then the DAG is compiled into one or more map-reduce jobs. </li></ul><ul><li>When ever a grouping, sorting, union operator is seen a reduce phase is needed. </li></ul>eBay Inc. Confidential From Filter Select Group Load Save Select Scatter Gather Gather Scatter select select Pattern B A D C
    26. 26. Understanding System Health <ul><li>Data </li></ul><ul><ul><li>System logs containing many beacon streams. </li></ul></ul><ul><ul><li>Each machine beacon stream comprising of single beacon message. </li></ul></ul><ul><ul><li>Message contains various state data. </li></ul></ul><ul><li>Problem </li></ul><ul><ul><li>Contiguous time period load is more than 60%. </li></ul></ul><ul><ul><li>Time period only interesting if it is more than delta mins. </li></ul></ul>eBay Inc. Confidential Time of the day % load 60% t
    27. 27. <ul><li>Input Stream: Schema of systemlogs is (systemname, beacons(load, time, …)) </li></ul><ul><li>Output Stream: (systemname, (load,time)*)* </li></ul><ul><li>DATASET system = SELECT systemname, { </li></ul><ul><li>SELECT LE AS load </li></ul><ul><li>FROM beacons </li></ul><ul><li>PATTERN ( load < 60 AS SEVENT // </li></ul><ul><li>(load > 60)+ AS LE[ ] // </li></ul><ul><li>load < 60 AS EEVENT) </li></ul><ul><li>WITH LE[size(LE)].time – LE[0].time > 10mins </li></ul><ul><li>} AS LoadTimes </li></ul><ul><li>FROM systemlogs; </li></ul><ul><li>Pattern defined in pattern query is </li></ul>eBay Inc. Confidential load<60 load > 60 load<60
    28. 28. Ongoing and Future Work <ul><li>Optimization mechanism. </li></ul><ul><ul><li>Filter cannot be pushed ahead of user defined functions and pattern queries. </li></ul></ul><ul><ul><li>Inferring projection is also limited. </li></ul></ul><ul><li>Compilation to Map-Reduce Jobs </li></ul><ul><ul><li>Inferring the best strategy to split work between map-reduce in case of multiple queries. </li></ul></ul><ul><li>Degree of parallelization when stream is not split by the users into substreams. </li></ul><ul><li>Near real-time stream processing engine. </li></ul><ul><li>Reporting mechanism. </li></ul>eBay Inc. Confidential