Your SlideShare is downloading. ×
Presentation_BigData_NenaMarin
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Presentation_BigData_NenaMarin

539
views

Published on

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
539
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
28
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Bid Data
    Ad Analytics
    Nena Marín, Ph.D.
    Principal Scientist & Director of Research
  • 2. Outline
    Background
    Big Data Challenge: mysqlvsmynosql
    Data Intensive Computing: online vs offline
    Big Data Solution
    Schema changes
    Analytics approach changes
    Results
    Conclusions
  • 3. Big Data Problem
    Overwhelming Amounts of Data & Growing
    Data exceeding 100 GB/TB/PBs
    Unstructured data/Content
    Structured data:
    Warehouse for tri-state Electric and Gas Utility ServicesOracle OWB + Cognos
    346,000 electric and 306,000 natural gas customers (2007)
    Semi & Unstructured data (Adometry 2010-now)
    Impression & click stream data
    Scale: ~3B impressions per day
    Page&Ad Tag customers
    Ad Server Log File customers
    Growth rates: 2 – 9% per month
    Unstructured content
    Email (Adometry: Cross-Channel AA)
    Sensor Data (HALO Project: 8.5 B particles 2009-2010)
  • 4. Ad Analytics: X Insurance
  • 5. X Insuranceover 4 Billion Impressions per month
  • 6. Three Clients Volume
  • 7. Big Data Analysis
    Off-line (batch)
    Ad Analytics Warehouse
    HALO project: TerascaleAstronomical Dataset on 240 compute nodes Longhorn Cluster (ref1)
    Hierarchical Density Shaving clustering algorithm
    Dataflow Clustering Algorithm
    Hadoop (128 compute-cores Spur cluster @ UT/TACC)
    On-line (realtime)
    Netflix Recommender: 100 million ratings in 16.31 minutes . Effective = 9.6 microseconds per rating (ref2)
    REF1:http://www.computer.org/portal/web/csdl/doi/10.1109/ICDMW.2010.26
    REF2: http://kdd09.crowdvine.com/talks/4963 (KDD2009)
  • 8. Outline
    Background
    Big Data Challenge: mysqlvsmynosql
    Data Intensive Computing: online vs offline
    Big Data Solution
    Schema changes
    Analytics approach changes
    Results
    Conclusions
  • 9. Big Data Solution Questions
    How will we add new nodes (grow)?
    Any single points of failure?
    Do the writes scale as well?
    How much administration will the system require?
    Implementation learning curve?
    If its open source, is there a healthy community?
    How much time and effort would we have to expend to deploy and integrate it?
    Does it use technology which we know we can work with? Integration tools, Presentation Tools, Analytics (data mining) Tools
  • 10.
  • 11. My two cents
    30 Billion rows per month puts you in VLDB territory, so you need partitioning.
    The low cardinality dimensions would also suggest that bitmap indexes would be a performance win.
    Column store
    Most aggregates are by columns
    Agility to update schema: add columns
    Quicklz compression on partitions and tables
  • 12. Columnar Database
    Review Table
    WITH (APPENDONLY=true, ORIENTATION=column, COMPRESSTYPE=quicklz,
    OIDS=FALSE
    )
    DISTRIBUTED BY (id);
    PARTITION BY RANGE(productid)
    SUBPARTITION BY RANGE(submissiontime)
    SUBPARTITION BY LIST(status)
    (
    PARTITION clnt100002 START (100002) END (100003) EVERY (1) WITH (appendonly=true, compresstype=quicklz, orientation=column)
    (
    START ('2011-05-01 00:00:00'::timestamp without time zone) END ('2011-08-01 00:00:00'::timestamp without time zone) EVERY ('1 day'::interval) WITH (appendonly=true, compresstype=quicklz, orientation=column)
    (
    SUBPARTITION submVALUES(‘submitted’) WITH (appendonly=true, compresstype=quicklz, orientation=column),
    SUBPARTITION apprVALUES(‘approved’) WITH (appendonly=true, compresstype=quicklz, orientation=column),
    DEFAULT SUBPARTITION other WITH (appendonly=true, compresstype=quicklz, orientation=column)
    )
    )
    )
  • 13. Why Partitioning?
    Because is partitioned by date, when query WHERE submissiondate>= & <=... the partition out of the date range won't be scanned and won't impact the performance...
    If a partition is no longer needed, can create a new table with the content of the partition and drop the partition
    Can recreate partitions with "compression" and "append only" option to save disk space and IO Bandwidth.
  • 14.
  • 15. Outline
    Background
    Big Data Challenge: mysqlvsmynosql
    Data Intensive Computing: online vs offline
    Big Data Solution
    Schema changes
    Analytics approach changes
    Results
    Conclusions
  • 16. Question 1: Top rated products (filtered by brand or category)
    MDX: WITHSET [TCat] AS TopCount([Product].[Subcategory].[Subcategory],10,[Measures].[Rating]) MEMBER [Product].[Subcategory].[Other] ASAggregate([Product].[Subcategory].[Subcategory] - TCat)SELECT { [Measures].[Rating] } ON COLUMNS, TCat + [Other] ON ROWS FROM [DW_PRODUCTS]
    SQL query:select top X rating, count(id) from fact where brand = x and category = y group by rating order by count(id) desc
  • 17. Recommender systems
    We Know What You OughtTo Be Watching This Summer
    9/6/2011
    17
  • 18. Ratings Data Matrix
    Sparse Matrix Representation
    rowID, colID, Rating, colID, Rating, …
    User1, 1, 1, 2, 5
    User2, 3, 3, 4, 4, 5, 3
    9/6/2011
    18
  • 19. After ROW CLUSTERING
    RAW DATA
    After COLUMN Clustering
    Iterate
    Until
    convergence
    Movies
    Ratings
    Users
    K by L Coclusters
    K rows
    L
  • 20. From the Training
    K by L Coclusters
    Average Ratings per cluster
    Average Ratings per User
    Average Ratings per Movie
    Global Average rating is 3.68
    9/6/2011
    20
  • 21. Prediction Algorithm
    • Case 1: known User, known Movie
    • 22. Rating = Cluster average
    • 23. Case 2: known User, unknown Movie
    • 24. Rating= User average
    • 25. Case 3: unknown User, known Movie
    • 26. Rating = Movie average
    • 27. Case 4: unknown User, unknown Movie
    • 28. Rating = Global Average
    9/6/2011
    21
  • 29. DataRush based Recommender System
    Dataflow Prediction Application Graph
    Dataflow Training Application Graph
    9/6/2011
    22
  • 30. Results: Scalability
    Across Cores
    9/6/2011
    23
  • 31. Results
  • 32. Question 1: Top rated products
    Build Recommender Data Mining Model based on Coclustering customers & ratings.
    Training runtime for 100,480,507 ratings
    16.31 minutes
    Apply recommender real-time:
    Effective prediction runtime: 9.738615 μs per rating
  • 33. Question 2: Fastest Rising Products
    Method 1: Store Co-Clustering Model in DW
    Identify when products move from one cluster to another
    Method 2:Product Ratings Distributionbin ratings. Establish Distribution Baseline for each product: Mu, s
    When Mu, s change beyond a certain threshold: Identify the movers/shakers & biggest losers
  • 34. Question 3: Category Level Stats
    Category level statistics (including roll-ups) such as avg rating, content volume, etc
    Average = Sum(X)/n
    Greenplum OLAP grouping extensions: CUBE, ROLLUP, GROUPING SETS
    SELECT productcategory, productid, sum(rating)/count(*) FROM review
    GROUP BY ROLLUP(productcategory, productid)
    ORDER BY 1,2,3;
  • 35. Question 4: Top Contributors
    Top contributors
    Score (content submissions + Helpfulness Votes)
    Tag content:
    Approve: positive reviews,
    reject negative or inappropriate/ price.
    Snippets: Highlight sentences in reviews
    Keep Score, sentiment, reason codes, snippets, product flaws, intelligence data, <Key> in DW
    Move content to readily available <Key, Value> Store.
    Query Top Score Contributors use <Key> to pull content real-time.
  • 36. Brisk by DataStax (buy local)
  • 37. Cassandra
    First developed by Facebook
    SuperColumns can turn a simple key-value architecture into an architecture that handles sorted lists, based on an index specified by the user.
    Can scale from one node to several thousand nodes clustered in different data centers.
    Can be tuned for more consistency or availability
    Smooth node replacement if one goes down
  • 38. Outline
    Background
    Big Data Challenge: mysqlvsmynosql
    Data Intensive Computing: online vs offline
    Big Data Solution
    Schema changes
    Analytics approach changes
    Results
    Conclusions
  • 39. Greenplum Bulk Loader
    gpload-f load_reviewer.yml -q -l ./gpload_reviewer.log
  • 40. Load Times
  • 41. Path to Conversion
  • 42. Attribution Credit
  • 43.
  • 44. X Insurance: Historical Data Load Times
    Weekly reports: 20 minutes
    Attribution Reports (pre-aggregated by 4 dimensions and deployed to GUI)
    Reach & Frequency Reports
    Campaign Optimization:
    ANN + Non-Linear Optimization
    Allocate budget onto different Sites + Placements to maximize conversions
  • 45. Attribute by CreativeSize
    INSIGHT: Top Ranked Creative Sizes
    High Propensity to Convert
    High number of Conversions
    Low Cost
    High Revenue
  • 46. Overlap Report
    Actionable:
    cookie sync with “Turn” cookies, use as block list to prevent reaching same cookies twice
  • 47.
  • 48. Outline
    Background
    Big Data Challenge: mysqlvsmynosql
    Data Intensive Computing: online vs offline
    Big Data Solution
    Schema changes
    Analytics approach changes
    Results
    Conclusions
  • 49. Common Problems (ALL)
    Data Quality
    Discovery Stats: before load & after load
    Establish Baselines and use them for validation
    Performance
    Growth rates and loading windows: low space triggers,
    Latency of online queries
    Latency of offline queries
    Agility of Schema to change
    Under-estimate value of metadata design
    Integration
    Self documentation
    Self-governance
  • 50. Common Problems (Internet Advertising)
    Bad Data:
    Well defined Customer Data requirements
    Context: Campaign, Site, Placement
    IP: invalids, GEO & Demog.
    Cost
    Revenue
    Common cookie across the different data sources
    Have resigned in some cases to IP & user agent (browser & language)
    Only Aggregate Data
    Black-out periods and agility to rollover new quarters
  • 51. Greenplum: Lessons Learned
    Adding new nodes
    expanded cluster from 4 to 8 nodes
    Redistribution tool failed
    Duplicate rowids in multiple nodes had to re-load
    Single point of failure
    GPMASTER node is single point of failure
    All slave nodes are mirrored and failed segments can be recovered
    Read scale
    Network bandwidth
    Write scale
    Network Bandwidth
    Hard disk space (dead in the water at 80% use on GPMaster)
    IT resources
    Full Product Support
    Have been down for two weeks at a time
    Open source – healthy community
    Technology
    works well with others = PostgresqlPgAdmin, PentahoTalend Studio, etc.
    Learning curve = NONE everyone knew SQL
    Deployment = had several initial install issues. But deploying new clients is automated using Python and SQL
  • 52. Data
    Append Only
    Columnar Orientation
    Load balance via Distributed by:
    No indexing but partitioning by:
    Agitlity of Schema to Change
    Content
    Evaluating Cassandra & Brisk
    <Key, Value> store for content
    Score, sentiment, reason codes, + <Key> to DW
    Real-time
    Leverage Data Mining Models (PMML + store in DW)
    Use stored models to identify change in patterns
    Recommendations
  • 53. Q & A
  • 54. HALO ProjectVisualization and Data Analysis Longhorn Cluster at TACC UT Austin
    240 Dell R610 Compute Nodes, each with
    • 2 Intel Nehalem quad-core processors (8 cores) @ 2.53 GHz
    • 55. 48GB RAM
    • 56. 73GB local disk
  • Halo detection for visualization
    Original dataset
    Discovered Clusters
  • 57. Mapping Customer Data to AA Schema

×