Presentation_BigData_NenaMarin

869 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
869
On SlideShare
0
From Embeds
0
Number of Embeds
17
Actions
Shares
0
Downloads
41
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Presentation_BigData_NenaMarin

  1. 1. Bid Data <br />Ad Analytics<br />Nena Marín, Ph.D.<br />Principal Scientist & Director of Research<br />
  2. 2. Outline<br />Background<br />Big Data Challenge: mysqlvsmynosql<br />Data Intensive Computing: online vs offline<br />Big Data Solution<br />Schema changes<br />Analytics approach changes<br />Results<br />Conclusions<br />
  3. 3. Big Data Problem<br />Overwhelming Amounts of Data & Growing<br />Data exceeding 100 GB/TB/PBs<br />Unstructured data/Content<br />Structured data: <br />Warehouse for tri-state Electric and Gas Utility ServicesOracle OWB + Cognos<br />346,000 electric and 306,000 natural gas customers (2007)<br />Semi & Unstructured data (Adometry 2010-now)<br />Impression & click stream data<br />Scale: ~3B impressions per day<br />Page&Ad Tag customers<br />Ad Server Log File customers<br />Growth rates: 2 – 9% per month<br />Unstructured content <br />Email (Adometry: Cross-Channel AA) <br />Sensor Data (HALO Project: 8.5 B particles 2009-2010) <br />
  4. 4. Ad Analytics: X Insurance<br />
  5. 5. X Insuranceover 4 Billion Impressions per month<br />
  6. 6. Three Clients Volume<br />
  7. 7. Big Data Analysis<br />Off-line (batch)<br />Ad Analytics Warehouse<br />HALO project: TerascaleAstronomical Dataset on 240 compute nodes Longhorn Cluster (ref1)<br />Hierarchical Density Shaving clustering algorithm<br />Dataflow Clustering Algorithm<br />Hadoop (128 compute-cores Spur cluster @ UT/TACC) <br />On-line (realtime)<br />Netflix Recommender: 100 million ratings in 16.31 minutes . Effective = 9.6 microseconds per rating (ref2)<br />REF1:http://www.computer.org/portal/web/csdl/doi/10.1109/ICDMW.2010.26<br />REF2: http://kdd09.crowdvine.com/talks/4963 (KDD2009)<br />
  8. 8. Outline<br />Background<br />Big Data Challenge: mysqlvsmynosql<br />Data Intensive Computing: online vs offline<br />Big Data Solution<br />Schema changes<br />Analytics approach changes<br />Results<br />Conclusions<br />
  9. 9. Big Data Solution Questions<br />How will we add new nodes (grow)?<br />Any single points of failure?<br />Do the writes scale as well?<br />How much administration will the system require?<br />Implementation learning curve?<br />If its open source, is there a healthy community?<br />How much time and effort would we have to expend to deploy and integrate it?<br />Does it use technology which we know we can work with? Integration tools, Presentation Tools, Analytics (data mining) Tools<br />
  10. 10.
  11. 11. My two cents<br />30 Billion rows per month puts you in VLDB territory, so you need partitioning. <br />The low cardinality dimensions would also suggest that bitmap indexes would be a performance win.<br />Column store<br />Most aggregates are by columns<br />Agility to update schema: add columns <br />Quicklz compression on partitions and tables<br />
  12. 12. Columnar Database<br />Review Table<br />WITH (APPENDONLY=true, ORIENTATION=column, COMPRESSTYPE=quicklz, <br /> OIDS=FALSE<br />)<br />DISTRIBUTED BY (id);<br />PARTITION BY RANGE(productid)<br /> SUBPARTITION BY RANGE(submissiontime)<br /> SUBPARTITION BY LIST(status) <br /> (<br /> PARTITION clnt100002 START (100002) END (100003) EVERY (1) WITH (appendonly=true, compresstype=quicklz, orientation=column) <br /> (<br /> START ('2011-05-01 00:00:00'::timestamp without time zone) END ('2011-08-01 00:00:00'::timestamp without time zone) EVERY ('1 day'::interval) WITH (appendonly=true, compresstype=quicklz, orientation=column) <br /> (<br /> SUBPARTITION submVALUES(‘submitted’) WITH (appendonly=true, compresstype=quicklz, orientation=column), <br /> SUBPARTITION apprVALUES(‘approved’) WITH (appendonly=true, compresstype=quicklz, orientation=column), <br /> DEFAULT SUBPARTITION other WITH (appendonly=true, compresstype=quicklz, orientation=column)<br /> )<br /> )<br /> )<br />
  13. 13. Why Partitioning?<br />Because is partitioned by date, when query WHERE submissiondate>= & <=... the partition out of the date range won't be scanned and won't impact the performance... <br />If a partition is no longer needed, can create a new table with the content of the partition and drop the partition<br />Can recreate partitions with "compression" and "append only" option to save disk space and IO Bandwidth.<br />
  14. 14.
  15. 15. Outline<br />Background<br />Big Data Challenge: mysqlvsmynosql<br />Data Intensive Computing: online vs offline<br />Big Data Solution<br />Schema changes<br />Analytics approach changes<br />Results<br />Conclusions<br />
  16. 16. Question 1: Top rated products (filtered by brand or category)<br />MDX: WITHSET [TCat] AS TopCount([Product].[Subcategory].[Subcategory],10,[Measures].[Rating]) MEMBER [Product].[Subcategory].[Other] ASAggregate([Product].[Subcategory].[Subcategory] - TCat)SELECT { [Measures].[Rating] } ON COLUMNS, TCat + [Other] ON ROWS FROM [DW_PRODUCTS] <br />SQL query:select top X rating, count(id) from fact where brand = x and category = y group by rating order by count(id) desc<br />
  17. 17. Recommender systems<br />We Know What You OughtTo Be Watching This Summer<br />9/6/2011<br />17<br />
  18. 18. Ratings Data Matrix<br />Sparse Matrix Representation<br />rowID, colID, Rating, colID, Rating, …<br />User1, 1, 1, 2, 5<br />User2, 3, 3, 4, 4, 5, 3<br />9/6/2011<br />18<br />
  19. 19. After ROW CLUSTERING<br />RAW DATA<br />After COLUMN Clustering<br />Iterate<br />Until<br />convergence<br />Movies<br />Ratings<br />Users<br />K by L Coclusters<br />K rows<br />L<br />
  20. 20. From the Training<br />K by L Coclusters<br />Average Ratings per cluster<br />Average Ratings per User<br />Average Ratings per Movie<br />Global Average rating is 3.68<br />9/6/2011<br />20<br />
  21. 21. Prediction Algorithm<br /><ul><li>Case 1: known User, known Movie
  22. 22. Rating = Cluster average
  23. 23. Case 2: known User, unknown Movie
  24. 24. Rating= User average
  25. 25. Case 3: unknown User, known Movie
  26. 26. Rating = Movie average
  27. 27. Case 4: unknown User, unknown Movie
  28. 28. Rating = Global Average</li></ul>9/6/2011<br />21<br />
  29. 29. DataRush based Recommender System<br />Dataflow Prediction Application Graph <br />Dataflow Training Application Graph<br />9/6/2011<br />22<br />
  30. 30. Results: Scalability<br />Across Cores<br />9/6/2011<br />23<br />
  31. 31. Results<br />
  32. 32. Question 1: Top rated products <br />Build Recommender Data Mining Model based on Coclustering customers & ratings.<br />Training runtime for 100,480,507 ratings<br />16.31 minutes<br />Apply recommender real-time:<br />Effective prediction runtime: 9.738615 μs per rating<br />
  33. 33. Question 2: Fastest Rising Products<br />Method 1: Store Co-Clustering Model in DW<br />Identify when products move from one cluster to another<br />Method 2:Product Ratings Distributionbin ratings. Establish Distribution Baseline for each product: Mu, s<br />When Mu, s change beyond a certain threshold: Identify the movers/shakers & biggest losers <br />
  34. 34. Question 3: Category Level Stats<br />Category level statistics (including roll-ups) such as avg rating, content volume, etc<br />Average = Sum(X)/n<br />Greenplum OLAP grouping extensions: CUBE, ROLLUP, GROUPING SETS <br />SELECT productcategory, productid, sum(rating)/count(*) FROM review<br />GROUP BY ROLLUP(productcategory, productid)<br />ORDER BY 1,2,3;<br />
  35. 35. Question 4: Top Contributors<br />Top contributors <br />Score (content submissions + Helpfulness Votes)<br />Tag content: <br />Approve: positive reviews, <br />reject negative or inappropriate/ price. <br />Snippets: Highlight sentences in reviews<br />Keep Score, sentiment, reason codes, snippets, product flaws, intelligence data, <Key> in DW<br />Move content to readily available <Key, Value> Store.<br />Query Top Score Contributors use <Key> to pull content real-time.<br />
  36. 36. Brisk by DataStax (buy local)<br />
  37. 37. Cassandra<br />First developed by Facebook<br />SuperColumns can turn a simple key-value architecture into an architecture that handles sorted lists, based on an index specified by the user.<br />Can scale from one node to several thousand nodes clustered in different data centers. <br />Can be tuned for more consistency or availability<br />Smooth node replacement if one goes down<br />
  38. 38. Outline<br />Background<br />Big Data Challenge: mysqlvsmynosql<br />Data Intensive Computing: online vs offline<br />Big Data Solution<br />Schema changes<br />Analytics approach changes<br />Results<br />Conclusions<br />
  39. 39. Greenplum Bulk Loader<br />gpload-f load_reviewer.yml -q -l ./gpload_reviewer.log<br />
  40. 40. Load Times<br />
  41. 41. Path to Conversion<br />
  42. 42. Attribution Credit<br />
  43. 43.
  44. 44. X Insurance: Historical Data Load Times<br />Weekly reports: 20 minutes<br />Attribution Reports (pre-aggregated by 4 dimensions and deployed to GUI)<br />Reach & Frequency Reports<br />Campaign Optimization:<br />ANN + Non-Linear Optimization<br />Allocate budget onto different Sites + Placements to maximize conversions <br />
  45. 45. Attribute by CreativeSize<br />INSIGHT: Top Ranked Creative Sizes<br /> High Propensity to Convert<br /> High number of Conversions<br /> Low Cost<br /> High Revenue <br />
  46. 46. Overlap Report<br />Actionable:<br />cookie sync with “Turn” cookies, use as block list to prevent reaching same cookies twice<br />
  47. 47.
  48. 48. Outline<br />Background<br />Big Data Challenge: mysqlvsmynosql<br />Data Intensive Computing: online vs offline<br />Big Data Solution<br />Schema changes<br />Analytics approach changes<br />Results<br />Conclusions<br />
  49. 49. Common Problems (ALL)<br />Data Quality<br />Discovery Stats: before load & after load<br />Establish Baselines and use them for validation<br />Performance<br />Growth rates and loading windows: low space triggers, <br />Latency of online queries<br />Latency of offline queries<br />Agility of Schema to change<br />Under-estimate value of metadata design<br />Integration<br />Self documentation<br />Self-governance<br />
  50. 50. Common Problems (Internet Advertising)<br />Bad Data:<br />Well defined Customer Data requirements<br />Context: Campaign, Site, Placement<br />IP: invalids, GEO & Demog.<br />Cost<br />Revenue <br />Common cookie across the different data sources<br />Have resigned in some cases to IP & user agent (browser & language)<br />Only Aggregate Data<br />Black-out periods and agility to rollover new quarters<br />
  51. 51. Greenplum: Lessons Learned<br />Adding new nodes<br />expanded cluster from 4 to 8 nodes<br />Redistribution tool failed<br />Duplicate rowids in multiple nodes had to re-load<br />Single point of failure<br />GPMASTER node is single point of failure<br />All slave nodes are mirrored and failed segments can be recovered<br />Read scale<br />Network bandwidth<br />Write scale<br />Network Bandwidth <br />Hard disk space (dead in the water at 80% use on GPMaster)<br />IT resources <br />Full Product Support <br />Have been down for two weeks at a time<br />Open source – healthy community<br />Technology <br />works well with others = PostgresqlPgAdmin, PentahoTalend Studio, etc.<br />Learning curve = NONE everyone knew SQL<br />Deployment = had several initial install issues. But deploying new clients is automated using Python and SQL<br />
  52. 52. Data<br />Append Only<br />Columnar Orientation<br />Load balance via Distributed by:<br />No indexing but partitioning by:<br />Agitlity of Schema to Change<br />Content<br />Evaluating Cassandra & Brisk<br /><Key, Value> store for content<br />Score, sentiment, reason codes, + <Key> to DW<br />Real-time<br />Leverage Data Mining Models (PMML + store in DW)<br />Use stored models to identify change in patterns<br />Recommendations<br />
  53. 53. Q & A<br />
  54. 54. HALO ProjectVisualization and Data Analysis Longhorn Cluster at TACC UT Austin<br />240 Dell R610 Compute Nodes, each with<br /><ul><li> 2 Intel Nehalem quad-core processors (8 cores) @ 2.53 GHz
  55. 55. 48GB RAM
  56. 56. 73GB local disk</li></li></ul><li>Halo detection for visualization<br />Original dataset<br />Discovered Clusters<br />
  57. 57. Mapping Customer Data to AA Schema<br />

×