Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
April 10-12, Chicago, IL
Yahoo!, Big Data, and
Microsoft BI: Bigger and
Better Together
Dianne Cantwell and Denny Lee
April 10-12, Chicago, IL
Please silence
cell phones
3
Agenda
Yahoo! Business Case for Hadoop and BI
Big Data, Fast Queries
Big Data / BI Themes
Get the Hardware Balance Right...
4
Yahoo! manages a
powerful scalable
advertising exchange
that includes publishers
and advertisers
Yahoo! TAO Business Cha...
5
Advertisers want to get
the best bang for their
buck by reaching their
targeted audiences
effectively and efficiently
Ya...
6
Yahoo! needs visibility into how consumers
are responding to ads along many
dimensions: web sites, creatives, time of
da...
7
Yahoo! TAO Technical Requirements
680,000,000Visitors to Yahoo! Branded sites:
Ad Impressions: 3,500,000,000(perday)
Ref...
8
Yahoo! TAO Platform Architecture
How did we load so much so quickly?
Data Archive & Staging
Oracle 11G RAC
File 1
File 2...
9
BI Query Servers
SQL Server Analysis
Services 2008 R2
24TB
Cube
/qtr
Adhoc Query/Visualization
Tableau Desktop 7
Optimiz...
10
Yahoo! TAO Return on Investment
For campaigns
optimized using TAO,
advertisers spent
more with Yahoo! than
before
For c...
11
Yahoo! TAO Return on Investment
Yahoo! TAO exposed customer segment
performance to campaign managers and
advertisers fo...
12
Yahoo! TAO Future Direction
Increase Segments by 3x
Increase data size and cartesian
No longer doing distinct count
Bui...
13
Big Data Analytics Challenges
Cube
F
14
Get the data out!
15
Extracting the data
File Generation
Hadoop jobs create many files that are exported / dumped to disk in tabular format
...
16
AS on Oracle Case
Oracle OLEDB
10K rows/sec
100K
rows/sec
SSIS Connector
20K rows/sec
Oracle Analysis Services
Oracle S...
17
Passthrough Query to Linked Server
http://msdn.microsoft.com/en-us/library/jj710329.aspx
18
Partitioning,
Partitioning,
Partitioning
19
PartitionsPartitions
Yahoo Example – “Fast” Oracle Load
• Data is streamed in to Oracle to files
• To get max processin...
20
Partitions – Directly Merging
Partitions
10/10/10 00:00
Oracle 10g
10/10/10 01:00
10/10/10 23:00
…
• New model allows f...
21
It is the order of things
22
It is the order of things
“I am a Jem'Hadar. He is a Vorta.
It is the order of things."
"Do you really want to give up
...
23
Segments and the importance of sort order
Data File Sorted Not Sorted % Diff
fact.data 195,708,592 344,502,968 43.19%
a...
24
Across the Eighth Dimension!
How do you associate dimensions with
Star Trek Into Darkness?
Cube
25
26
Back to cube dimensions
Running ProcessUpdate
Takes a long time to run because all of the fact partitions are re-indexe...
27
Let’s aggregate it up
April 10-12, Chicago, IL
Thank you!
Diamond Sponsor
Upcoming SlideShare
Loading in …5
×

Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

609 views

Published on

This is Dianne Wood and my SQL PASS Business Analytics 2013 Conference presentation on Yahoo!, Big Data, and Microsoft BI - Bigger and Better Together

Published in: Technology
  • Earn $90/day Working Online. You won't get rich, but it is going to make you some money! ♥♥♥ https://tinyurl.com/y4urott2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Your opinions matter! get paid BIG $$$ for them! START NOW!!.. ▲▲▲ https://tinyurl.com/realmoneystreams2019
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • I made $2,600 with this. I already have 7 days with this... ♣♣♣ https://tinyurl.com/make2793amonth
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

  1. 1. April 10-12, Chicago, IL Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together Dianne Cantwell and Denny Lee
  2. 2. April 10-12, Chicago, IL Please silence cell phones
  3. 3. 3 Agenda Yahoo! Business Case for Hadoop and BI Big Data, Fast Queries Big Data / BI Themes Get the Hardware Balance Right Partitioning, Partitioning, Partitioning Keep it Simple It is the order of things
  4. 4. 4 Yahoo! manages a powerful scalable advertising exchange that includes publishers and advertisers Yahoo! TAO Business Challenge
  5. 5. 5 Advertisers want to get the best bang for their buck by reaching their targeted audiences effectively and efficiently Yahoo! TAO Business Challenge
  6. 6. 6 Yahoo! needs visibility into how consumers are responding to ads along many dimensions: web sites, creatives, time of day, user segments (e.g. gender, age, location) to make the exchange work as efficiently and effectively as possible Yahoo! TAO Business Challenge
  7. 7. 7 Yahoo! TAO Technical Requirements 680,000,000Visitors to Yahoo! Branded sites: Ad Impressions: 3,500,000,000(perday) Refresh Frequency: Hourly 464,000,000,000(perqtr) Rows Loaded: Average Query Time: <10 seconds
  8. 8. 8 Yahoo! TAO Platform Architecture How did we load so much so quickly? Data Archive & Staging Oracle 11G RAC File 1 File 2 File N Partition 1 Partition 2 Partition N Partition 1 Partition 2 Partition N 24TB Cube /qtr 1.2TB /day 135GB/day compressed 2PB cluster Data Aggregation & ETL Hadoop BI Server SQL Server Analysis Services 2008 R2
  9. 9. 9 BI Query Servers SQL Server Analysis Services 2008 R2 24TB Cube /qtr Adhoc Query/Visualization Tableau Desktop 7 Optimization Application Custom J2EE App Yahoo! TAO Platform Architecture Queries at the “speed of thought” 464B rows of event level data /qtr • Dimensions: 42 • Attributes: 296 • Measures: 278 Avg Query Time: 2 secs Avg Query Time: 5 secs
  10. 10. 10 Yahoo! TAO Return on Investment For campaigns optimized using TAO, advertisers spent more with Yahoo! than before For campaigns optimized using TAO, more eCPMs (revenue)!
  11. 11. 11 Yahoo! TAO Return on Investment Yahoo! TAO exposed customer segment performance to campaign managers and advertisers for the first time! No longer “flying audience blind”
  12. 12. 12 Yahoo! TAO Future Direction Increase Segments by 3x Increase data size and cartesian No longer doing distinct count Built frequency reports and sampling to deliver this due to the inherent complexity! Current Challenge Hadoop to SSAS cube (more later) External access to cubes More disk due to need for more IO
  13. 13. 13 Big Data Analytics Challenges Cube F
  14. 14. 14 Get the data out!
  15. 15. 15 Extracting the data File Generation Hadoop jobs create many files that are exported / dumped to disk in tabular format File Staging Files are propped to a staging folder for relational dB access Oracle External Tables Generate external tables that point to the staged files No need to import the data Processing is slow
  16. 16. 16 AS on Oracle Case Oracle OLEDB 10K rows/sec 100K rows/sec SSIS Connector 20K rows/sec Oracle Analysis Services Oracle SQL Analysis Services
  17. 17. 17 Passthrough Query to Linked Server http://msdn.microsoft.com/en-us/library/jj710329.aspx
  18. 18. 18 Partitioning, Partitioning, Partitioning
  19. 19. 19 PartitionsPartitions Yahoo Example – “Fast” Oracle Load • Data is streamed in to Oracle to files • To get max processing, 30 threads are fired because all T (temp) partitions are processed concurrently • Super fast data loads • Problem is that it requires constant merging of partitions Files are streamed in as they become available 10/10/10 T360772 10/10/10 T360773 … 10/10/10 T361645 10/10/10 T360772 Oracle 10g 10/10/10 T360773 10/10/10 T361645 … 10/10/10 T360772 10/10/10 T360773 10/10/10 T361645 … SSAS 10/10/10 Merge
  20. 20. 20 Partitions – Directly Merging Partitions 10/10/10 00:00 Oracle 10g 10/10/10 01:00 10/10/10 23:00 … • New model allows for set hourly partitions • No more streaming data but with hourly partitions, cannot have as many threads for fast data loads, unless… • Process multiple cubes or measure groups in parallel Partitions 10/10/10 00:00 10/10/10 01:00 10/10/10 23:00 … SSAS Segments 10/10/10 00:00 10/10/10 01:00 10/10/10 23:00 … Activities 10/10/10 00:00 10/10/10 01:00 10/10/10 23:00 … Uniques
  21. 21. 21 It is the order of things
  22. 22. 22 It is the order of things “I am a Jem'Hadar. He is a Vorta. It is the order of things." "Do you really want to give up your life for the 'order of things'?" "It is not my life to give up, Captain – and it never was.” Rocks and Shoals, Deep Space Nine Written by Ronald D. Moore
  23. 23. 23 Segments and the importance of sort order Data File Sorted Not Sorted % Diff fact.data 195,708,592 344,502,968 43.19% agg.rigid.data 106,825,677 106,825,677 0.00% dim1.dim2.fact.map 17,332,729 32,989,946 47.46% dim1.dim3.fact.map 16,923,276 32,222,813 47.48% dim1.dim4.fact.map 6,079,396 12,286,978 50.52% dim5.dim6.fact.map 2,630,888 6,057,334 56.57% dim1.dim7.fact.map 1,809,725 3,904,004 53.64% dim8.dim9.fact.map 1,592,886 3,793,452 58.01% dim1.dim10.fact.map 1,419,255 3,108,248 54.34% dim8.dim11.fact.map 1,301,221 3,042,638 57.23% dim1.dim12.fact.map 2,949,432 2,949,432 0.00% dim1.dim13.fact.map 2,934,836 2,934,836 0.00% dimA.dimA.fact.map 1,101,552 2,716,289 59.45% dim8.dimB.fact.map 961,332 2,451,956 60.79% dim1.dimC.fact.map 1,027,305 2,323,906 55.79% dim8.dim8.fact.map 1,592,886 2,308,232 30.99% dimA.dimD.fact.map 851,095 2,170,962 60.80% Not Sorted Sorted
  24. 24. 24 Across the Eighth Dimension! How do you associate dimensions with Star Trek Into Darkness? Cube
  25. 25. 25
  26. 26. 26 Back to cube dimensions Running ProcessUpdate Takes a long time to run because all of the fact partitions are re-indexed! Minimize likelihood by building SCD-2 dimensions Composite Key based on lowest level unique values to represent row Sometimes identity can be just as effective though hashing requires mapping or lookuptables Create SK to allow for SCD-2 dimensions Key is that we keep the memory space of the SK small Composite(Composite) or Hash(Composite) is good for dimensions loaded from fact BUT do not expect Type-2 for fact-based dimensions Important to call out restatement based on current data (high cost associated with keeping versioned history of dimension tables)
  27. 27. 27 Let’s aggregate it up
  28. 28. April 10-12, Chicago, IL Thank you! Diamond Sponsor

×