Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

Denny Lee
Denny LeeStaff Developer Advocate
April 10-12, Chicago, IL
Yahoo!, Big Data, and
Microsoft BI: Bigger and
Better Together
Dianne Cantwell and Denny Lee
April 10-12, Chicago, IL
Please silence
cell phones
3
Agenda
Yahoo! Business Case for Hadoop and BI
Big Data, Fast Queries
Big Data / BI Themes
Get the Hardware Balance Right
Partitioning, Partitioning, Partitioning
Keep it Simple
It is the order of things
4
Yahoo! manages a
powerful scalable
advertising exchange
that includes publishers
and advertisers
Yahoo! TAO Business Challenge
5
Advertisers want to get
the best bang for their
buck by reaching their
targeted audiences
effectively and efficiently
Yahoo! TAO Business Challenge
6
Yahoo! needs visibility into how consumers
are responding to ads along many
dimensions: web sites, creatives, time of
day, user segments (e.g. gender, age,
location) to make the exchange work as
efficiently and effectively as possible
Yahoo! TAO Business Challenge
7
Yahoo! TAO Technical Requirements
680,000,000Visitors to Yahoo! Branded sites:
Ad Impressions: 3,500,000,000(perday)
Refresh Frequency: Hourly
464,000,000,000(perqtr)
Rows Loaded:
Average Query Time: <10 seconds
8
Yahoo! TAO Platform Architecture
How did we load so much so quickly?
Data Archive & Staging
Oracle 11G RAC
File 1
File 2
File N
Partition 1
Partition 2
Partition N
Partition 1
Partition 2
Partition N
24TB
Cube
/qtr
1.2TB
/day
135GB/day
compressed
2PB
cluster
Data Aggregation & ETL
Hadoop
BI Server
SQL Server Analysis
Services 2008 R2
9
BI Query Servers
SQL Server Analysis
Services 2008 R2
24TB
Cube
/qtr
Adhoc Query/Visualization
Tableau Desktop 7
Optimization Application
Custom J2EE App
Yahoo! TAO Platform Architecture
Queries at the “speed of thought”
464B rows of
event level data
/qtr
• Dimensions: 42
• Attributes: 296
• Measures: 278
Avg Query Time:
2 secs
Avg Query Time:
5 secs
10
Yahoo! TAO Return on Investment
For campaigns
optimized using TAO,
advertisers spent
more with Yahoo! than
before
For campaigns
optimized using TAO,
more eCPMs
(revenue)!
11
Yahoo! TAO Return on Investment
Yahoo! TAO exposed customer segment
performance to campaign managers and
advertisers for the first time! No longer
“flying audience blind”
12
Yahoo! TAO Future Direction
Increase Segments by 3x
Increase data size and cartesian
No longer doing distinct count
Built frequency reports and sampling to deliver this due to the inherent complexity!
Current Challenge
Hadoop to SSAS cube (more later)
External access to cubes
More disk due to need for more IO
13
Big Data Analytics Challenges
Cube
F
14
Get the data out!
15
Extracting the data
File Generation
Hadoop jobs create many files that are exported / dumped to disk in tabular format
File Staging
Files are propped to a staging folder for relational dB access
Oracle External Tables
Generate external tables that point to the staged files
No need to import the data
Processing is slow
16
AS on Oracle Case
Oracle OLEDB
10K rows/sec
100K
rows/sec
SSIS Connector
20K rows/sec
Oracle Analysis Services
Oracle SQL Analysis Services
17
Passthrough Query to Linked Server
http://msdn.microsoft.com/en-us/library/jj710329.aspx
18
Partitioning,
Partitioning,
Partitioning
19
PartitionsPartitions
Yahoo Example – “Fast” Oracle Load
• Data is streamed in to Oracle to files
• To get max processing, 30 threads are fired because all T (temp) partitions are
processed concurrently
• Super fast data loads
• Problem is that it requires constant merging of partitions
Files are streamed in
as they become
available
10/10/10 T360772
10/10/10 T360773
…
10/10/10 T361645
10/10/10 T360772
Oracle 10g
10/10/10 T360773
10/10/10 T361645
…
10/10/10 T360772
10/10/10 T360773
10/10/10 T361645
…
SSAS
10/10/10
Merge
20
Partitions – Directly Merging
Partitions
10/10/10 00:00
Oracle 10g
10/10/10 01:00
10/10/10 23:00
…
• New model allows for set hourly partitions
• No more streaming data but with hourly partitions, cannot have as many threads for
fast data loads, unless…
• Process multiple cubes or measure groups in parallel
Partitions
10/10/10 00:00
10/10/10 01:00
10/10/10 23:00
…
SSAS
Segments
10/10/10 00:00
10/10/10 01:00
10/10/10 23:00
…
Activities
10/10/10 00:00
10/10/10 01:00
10/10/10 23:00
…
Uniques
21
It is the order of things
22
It is the order of things
“I am a Jem'Hadar. He is a Vorta.
It is the order of things."
"Do you really want to give up
your life for the 'order of things'?"
"It is not my life to give up, Captain
– and it never was.”
Rocks and Shoals,
Deep Space Nine
Written by Ronald D. Moore
23
Segments and the importance of sort order
Data File Sorted Not Sorted % Diff
fact.data 195,708,592 344,502,968 43.19%
agg.rigid.data 106,825,677 106,825,677 0.00%
dim1.dim2.fact.map 17,332,729 32,989,946 47.46%
dim1.dim3.fact.map 16,923,276 32,222,813 47.48%
dim1.dim4.fact.map 6,079,396 12,286,978 50.52%
dim5.dim6.fact.map 2,630,888 6,057,334 56.57%
dim1.dim7.fact.map 1,809,725 3,904,004 53.64%
dim8.dim9.fact.map 1,592,886 3,793,452 58.01%
dim1.dim10.fact.map 1,419,255 3,108,248 54.34%
dim8.dim11.fact.map 1,301,221 3,042,638 57.23%
dim1.dim12.fact.map 2,949,432 2,949,432 0.00%
dim1.dim13.fact.map 2,934,836 2,934,836 0.00%
dimA.dimA.fact.map 1,101,552 2,716,289 59.45%
dim8.dimB.fact.map 961,332 2,451,956 60.79%
dim1.dimC.fact.map 1,027,305 2,323,906 55.79%
dim8.dim8.fact.map 1,592,886 2,308,232 30.99%
dimA.dimD.fact.map 851,095 2,170,962 60.80%
Not Sorted
Sorted
24
Across the Eighth Dimension!
How do you associate dimensions with
Star Trek Into Darkness?
Cube
25
26
Back to cube dimensions
Running ProcessUpdate
Takes a long time to run because all of the fact partitions are re-indexed!
Minimize likelihood by building SCD-2 dimensions
Composite Key based on lowest level unique values to represent row
Sometimes identity can be just as effective though hashing requires mapping or lookuptables
Create SK to allow for SCD-2 dimensions
Key is that we keep the memory space of the SK small
Composite(Composite) or Hash(Composite) is good for dimensions loaded from fact BUT do
not expect Type-2 for fact-based dimensions
Important to call out restatement based on current data (high cost associated with keeping
versioned history of dimension tables)
27
Let’s aggregate it up
April 10-12, Chicago, IL
Thank you!
Diamond Sponsor
1 of 28

Recommended

Yahoo! TAO Case Study Excerpt by
Yahoo! TAO Case Study ExcerptYahoo! TAO Case Study Excerpt
Yahoo! TAO Case Study ExcerptDenny Lee
10K views9 slides
2012.04.26 big insights streams im forum2 by
2012.04.26 big insights streams im forum22012.04.26 big insights streams im forum2
2012.04.26 big insights streams im forum2Wilfried Hoge
2.5K views41 slides
Galaxy of bits by
Galaxy of bitsGalaxy of bits
Galaxy of bitsMichal Zylinski
648 views32 slides
Big Data simplified by
Big Data simplifiedBig Data simplified
Big Data simplifiedPraveen Hanchinal
1.6K views31 slides
Big Data Real Time Applications by
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
11.5K views39 slides
Tech4Africa - Opportunities around Big Data by
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataSteve Watt
1.2K views42 slides

More Related Content

Similar to Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

Five database trends - updated April 2015 by
Five database trends - updated April 2015Five database trends - updated April 2015
Five database trends - updated April 2015Guy Harrison
16.7K views89 slides
Next generation databases july2010 by
Next generation databases july2010Next generation databases july2010
Next generation databases july2010Guy Harrison
1.1K views40 slides
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ... by
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...Athens Big Data
43 views43 slides
Data Culture Series - Keynote & Panel - Reading - 12th May 2015 by
Data Culture Series  - Keynote & Panel - Reading - 12th May 2015Data Culture Series  - Keynote & Panel - Reading - 12th May 2015
Data Culture Series - Keynote & Panel - Reading - 12th May 2015Jonathan Woodward
875 views66 slides
Introduction to Azure DocumentDB by
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDBDenny Lee
564 views57 slides
Red hatpartner2013edb futureofdatabase by
Red hatpartner2013edb futureofdatabaseRed hatpartner2013edb futureofdatabase
Red hatpartner2013edb futureofdatabaseEDB
1.4K views15 slides

Similar to Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together(20)

Five database trends - updated April 2015 by Guy Harrison
Five database trends - updated April 2015Five database trends - updated April 2015
Five database trends - updated April 2015
Guy Harrison16.7K views
Next generation databases july2010 by Guy Harrison
Next generation databases july2010Next generation databases july2010
Next generation databases july2010
Guy Harrison1.1K views
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ... by Athens Big Data
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
Athens Big Data43 views
Data Culture Series - Keynote & Panel - Reading - 12th May 2015 by Jonathan Woodward
Data Culture Series  - Keynote & Panel - Reading - 12th May 2015Data Culture Series  - Keynote & Panel - Reading - 12th May 2015
Data Culture Series - Keynote & Panel - Reading - 12th May 2015
Jonathan Woodward875 views
Introduction to Azure DocumentDB by Denny Lee
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
Denny Lee564 views
Red hatpartner2013edb futureofdatabase by EDB
Red hatpartner2013edb futureofdatabaseRed hatpartner2013edb futureofdatabase
Red hatpartner2013edb futureofdatabase
EDB1.4K views
SQLCAT: Tier-1 BI in the World of Big Data by Denny Lee
SQLCAT: Tier-1 BI in the World of Big DataSQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big Data
Denny Lee510 views
Couchbase Overview Nov 2013 by Jeff Harris
Couchbase Overview Nov 2013Couchbase Overview Nov 2013
Couchbase Overview Nov 2013
Jeff Harris847 views
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf by Altinity Ltd
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfOSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
Altinity Ltd19 views
Our Hero Flash eBook by thinkASG
Our Hero Flash eBookOur Hero Flash eBook
Our Hero Flash eBook
thinkASG296 views
Database revolution opening webcast 01 18-12 by mark madsen
Database revolution opening webcast 01 18-12Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12
mark madsen1K views
Database Revolution - Exploratory Webcast by Inside Analysis
Database Revolution - Exploratory WebcastDatabase Revolution - Exploratory Webcast
Database Revolution - Exploratory Webcast
Inside Analysis658 views
Big Data Basic Concepts | Presented in 2014 by Kenneth Igiri
Big Data Basic Concepts  | Presented in 2014Big Data Basic Concepts  | Presented in 2014
Big Data Basic Concepts | Presented in 2014
Kenneth Igiri6 views
August meetup - All about Apache Druid by Imply
August meetup - All about Apache Druid August meetup - All about Apache Druid
August meetup - All about Apache Druid
Imply 218 views
Graph Database Use Cases - StampedeCon 2015 by StampedeCon
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015
StampedeCon833 views
Graph database Use Cases by Max De Marzi
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
Max De Marzi58.9K views
Introduction to Big Data & Hadoop by Edureka!
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
Edureka!1.5K views
Petascale Analytics - The World of Big Data Requires Big Analytics by Heiko Joerg Schick
Petascale Analytics - The World of Big Data Requires Big AnalyticsPetascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big Analytics
Heiko Joerg Schick2.3K views
The paradox of big data - dataiku / oxalide APEROTECH by Dataiku
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECH
Dataiku1.8K views
Evolution of the DBA to Data Platform Administrator/Specialist by Tony Rogerson
Evolution of the DBA to Data Platform Administrator/SpecialistEvolution of the DBA to Data Platform Administrator/Specialist
Evolution of the DBA to Data Platform Administrator/Specialist
Tony Rogerson904 views

More from Denny Lee

Azure Cosmos DB: Globally Distributed Multi-Model Database Service by
Azure Cosmos DB: Globally Distributed Multi-Model Database ServiceAzure Cosmos DB: Globally Distributed Multi-Model Database Service
Azure Cosmos DB: Globally Distributed Multi-Model Database ServiceDenny Lee
1.2K views49 slides
Spark to DocumentDB connector by
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connectorDenny Lee
1K views26 slides
SQL Server Integration Services Best Practices by
SQL Server Integration Services Best PracticesSQL Server Integration Services Best Practices
SQL Server Integration Services Best PracticesDenny Lee
1.2K views32 slides
SQL Server Reporting Services: IT Best Practices by
SQL Server Reporting Services: IT Best PracticesSQL Server Reporting Services: IT Best Practices
SQL Server Reporting Services: IT Best PracticesDenny Lee
1.2K views52 slides
Introduction to Microsoft's Big Data Platform and Hadoop Primer by
Introduction to Microsoft's Big Data Platform and Hadoop PrimerIntroduction to Microsoft's Big Data Platform and Hadoop Primer
Introduction to Microsoft's Big Data Platform and Hadoop PrimerDenny Lee
394 views33 slides
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007) by
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Denny Lee
969 views22 slides

More from Denny Lee(20)

Azure Cosmos DB: Globally Distributed Multi-Model Database Service by Denny Lee
Azure Cosmos DB: Globally Distributed Multi-Model Database ServiceAzure Cosmos DB: Globally Distributed Multi-Model Database Service
Azure Cosmos DB: Globally Distributed Multi-Model Database Service
Denny Lee1.2K views
Spark to DocumentDB connector by Denny Lee
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connector
Denny Lee1K views
SQL Server Integration Services Best Practices by Denny Lee
SQL Server Integration Services Best PracticesSQL Server Integration Services Best Practices
SQL Server Integration Services Best Practices
Denny Lee1.2K views
SQL Server Reporting Services: IT Best Practices by Denny Lee
SQL Server Reporting Services: IT Best PracticesSQL Server Reporting Services: IT Best Practices
SQL Server Reporting Services: IT Best Practices
Denny Lee1.2K views
Introduction to Microsoft's Big Data Platform and Hadoop Primer by Denny Lee
Introduction to Microsoft's Big Data Platform and Hadoop PrimerIntroduction to Microsoft's Big Data Platform and Hadoop Primer
Introduction to Microsoft's Big Data Platform and Hadoop Primer
Denny Lee394 views
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007) by Denny Lee
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Denny Lee969 views
SQL Server Reporting Services Disaster Recovery webinar by Denny Lee
SQL Server Reporting Services Disaster Recovery webinarSQL Server Reporting Services Disaster Recovery webinar
SQL Server Reporting Services Disaster Recovery webinar
Denny Lee575 views
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D... by Denny Lee
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
Denny Lee1.5K views
Designing, Building, and Maintaining Large Cubes using Lessons Learned by Denny Lee
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDesigning, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons Learned
Denny Lee292 views
SQLCAT - Data and Admin Security by Denny Lee
SQLCAT - Data and Admin SecuritySQLCAT - Data and Admin Security
SQLCAT - Data and Admin Security
Denny Lee340 views
SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008 by Denny Lee
SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008
SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008
Denny Lee660 views
SQLCAT: A Preview to PowerPivot Server Best Practices by Denny Lee
SQLCAT: A Preview to PowerPivot Server Best PracticesSQLCAT: A Preview to PowerPivot Server Best Practices
SQLCAT: A Preview to PowerPivot Server Best Practices
Denny Lee396 views
Deploying and Managing PowerPivot for SharePoint by Denny Lee
Deploying and Managing PowerPivot for SharePointDeploying and Managing PowerPivot for SharePoint
Deploying and Managing PowerPivot for SharePoint
Denny Lee665 views
Big Data, Bigger Brains by Denny Lee
Big Data, Bigger BrainsBig Data, Bigger Brains
Big Data, Bigger Brains
Denny Lee205 views
Jump Start into Apache Spark (Seattle Spark Meetup) by Denny Lee
Jump Start into Apache Spark (Seattle Spark Meetup)Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)
Denny Lee1.3K views
How Concur uses Big Data to get you to Tableau Conference On Time by Denny Lee
How Concur uses Big Data to get you to Tableau Conference On TimeHow Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On Time
Denny Lee5.9K views
SQL Server Reporting Services Disaster Recovery Webinar by Denny Lee
SQL Server Reporting Services Disaster Recovery WebinarSQL Server Reporting Services Disaster Recovery Webinar
SQL Server Reporting Services Disaster Recovery Webinar
Denny Lee5.1K views
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078) by Denny Lee
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Denny Lee3.7K views
SQL Server Reporting Services: IT Best Practices by Denny Lee
SQL Server Reporting Services: IT Best PracticesSQL Server Reporting Services: IT Best Practices
SQL Server Reporting Services: IT Best Practices
Denny Lee19.8K views
Building SSRS 2008 large scale solutions by Denny Lee
Building SSRS 2008 large scale solutionsBuilding SSRS 2008 large scale solutions
Building SSRS 2008 large scale solutions
Denny Lee6.6K views

Recently uploaded

CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue by
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueShapeBlue
138 views15 slides
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... by
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...ShapeBlue
194 views62 slides
Digital Personal Data Protection (DPDP) Practical Approach For CISOs by
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOsPriyanka Aash
158 views59 slides
Initiating and Advancing Your Strategic GIS Governance Strategy by
Initiating and Advancing Your Strategic GIS Governance StrategyInitiating and Advancing Your Strategic GIS Governance Strategy
Initiating and Advancing Your Strategic GIS Governance StrategySafe Software
176 views68 slides
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ... by
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...ShapeBlue
126 views10 slides
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...James Anderson
160 views32 slides

Recently uploaded(20)

CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue by ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
ShapeBlue138 views
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... by ShapeBlue
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
ShapeBlue194 views
Digital Personal Data Protection (DPDP) Practical Approach For CISOs by Priyanka Aash
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOs
Priyanka Aash158 views
Initiating and Advancing Your Strategic GIS Governance Strategy by Safe Software
Initiating and Advancing Your Strategic GIS Governance StrategyInitiating and Advancing Your Strategic GIS Governance Strategy
Initiating and Advancing Your Strategic GIS Governance Strategy
Safe Software176 views
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ... by ShapeBlue
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
ShapeBlue126 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson160 views
NTGapps NTG LowCode Platform by Mustafa Kuğu
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform
Mustafa Kuğu423 views
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue by ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
ShapeBlue218 views
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue221 views
Business Analyst Series 2023 - Week 4 Session 8 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 8Business Analyst Series 2023 -  Week 4 Session 8
Business Analyst Series 2023 - Week 4 Session 8
DianaGray10123 views
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T by ShapeBlue
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
ShapeBlue152 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker54 views
DRBD Deep Dive - Philipp Reisner - LINBIT by ShapeBlue
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBIT
ShapeBlue180 views
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or... by ShapeBlue
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
ShapeBlue198 views
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ... by ShapeBlue
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
ShapeBlue184 views
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... by ShapeBlue
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
ShapeBlue180 views
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ... by ShapeBlue
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
ShapeBlue186 views
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue by ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
ShapeBlue263 views
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...

Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

  • 1. April 10-12, Chicago, IL Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together Dianne Cantwell and Denny Lee
  • 2. April 10-12, Chicago, IL Please silence cell phones
  • 3. 3 Agenda Yahoo! Business Case for Hadoop and BI Big Data, Fast Queries Big Data / BI Themes Get the Hardware Balance Right Partitioning, Partitioning, Partitioning Keep it Simple It is the order of things
  • 4. 4 Yahoo! manages a powerful scalable advertising exchange that includes publishers and advertisers Yahoo! TAO Business Challenge
  • 5. 5 Advertisers want to get the best bang for their buck by reaching their targeted audiences effectively and efficiently Yahoo! TAO Business Challenge
  • 6. 6 Yahoo! needs visibility into how consumers are responding to ads along many dimensions: web sites, creatives, time of day, user segments (e.g. gender, age, location) to make the exchange work as efficiently and effectively as possible Yahoo! TAO Business Challenge
  • 7. 7 Yahoo! TAO Technical Requirements 680,000,000Visitors to Yahoo! Branded sites: Ad Impressions: 3,500,000,000(perday) Refresh Frequency: Hourly 464,000,000,000(perqtr) Rows Loaded: Average Query Time: <10 seconds
  • 8. 8 Yahoo! TAO Platform Architecture How did we load so much so quickly? Data Archive & Staging Oracle 11G RAC File 1 File 2 File N Partition 1 Partition 2 Partition N Partition 1 Partition 2 Partition N 24TB Cube /qtr 1.2TB /day 135GB/day compressed 2PB cluster Data Aggregation & ETL Hadoop BI Server SQL Server Analysis Services 2008 R2
  • 9. 9 BI Query Servers SQL Server Analysis Services 2008 R2 24TB Cube /qtr Adhoc Query/Visualization Tableau Desktop 7 Optimization Application Custom J2EE App Yahoo! TAO Platform Architecture Queries at the “speed of thought” 464B rows of event level data /qtr • Dimensions: 42 • Attributes: 296 • Measures: 278 Avg Query Time: 2 secs Avg Query Time: 5 secs
  • 10. 10 Yahoo! TAO Return on Investment For campaigns optimized using TAO, advertisers spent more with Yahoo! than before For campaigns optimized using TAO, more eCPMs (revenue)!
  • 11. 11 Yahoo! TAO Return on Investment Yahoo! TAO exposed customer segment performance to campaign managers and advertisers for the first time! No longer “flying audience blind”
  • 12. 12 Yahoo! TAO Future Direction Increase Segments by 3x Increase data size and cartesian No longer doing distinct count Built frequency reports and sampling to deliver this due to the inherent complexity! Current Challenge Hadoop to SSAS cube (more later) External access to cubes More disk due to need for more IO
  • 13. 13 Big Data Analytics Challenges Cube F
  • 15. 15 Extracting the data File Generation Hadoop jobs create many files that are exported / dumped to disk in tabular format File Staging Files are propped to a staging folder for relational dB access Oracle External Tables Generate external tables that point to the staged files No need to import the data Processing is slow
  • 16. 16 AS on Oracle Case Oracle OLEDB 10K rows/sec 100K rows/sec SSIS Connector 20K rows/sec Oracle Analysis Services Oracle SQL Analysis Services
  • 17. 17 Passthrough Query to Linked Server http://msdn.microsoft.com/en-us/library/jj710329.aspx
  • 19. 19 PartitionsPartitions Yahoo Example – “Fast” Oracle Load • Data is streamed in to Oracle to files • To get max processing, 30 threads are fired because all T (temp) partitions are processed concurrently • Super fast data loads • Problem is that it requires constant merging of partitions Files are streamed in as they become available 10/10/10 T360772 10/10/10 T360773 … 10/10/10 T361645 10/10/10 T360772 Oracle 10g 10/10/10 T360773 10/10/10 T361645 … 10/10/10 T360772 10/10/10 T360773 10/10/10 T361645 … SSAS 10/10/10 Merge
  • 20. 20 Partitions – Directly Merging Partitions 10/10/10 00:00 Oracle 10g 10/10/10 01:00 10/10/10 23:00 … • New model allows for set hourly partitions • No more streaming data but with hourly partitions, cannot have as many threads for fast data loads, unless… • Process multiple cubes or measure groups in parallel Partitions 10/10/10 00:00 10/10/10 01:00 10/10/10 23:00 … SSAS Segments 10/10/10 00:00 10/10/10 01:00 10/10/10 23:00 … Activities 10/10/10 00:00 10/10/10 01:00 10/10/10 23:00 … Uniques
  • 21. 21 It is the order of things
  • 22. 22 It is the order of things “I am a Jem'Hadar. He is a Vorta. It is the order of things." "Do you really want to give up your life for the 'order of things'?" "It is not my life to give up, Captain – and it never was.” Rocks and Shoals, Deep Space Nine Written by Ronald D. Moore
  • 23. 23 Segments and the importance of sort order Data File Sorted Not Sorted % Diff fact.data 195,708,592 344,502,968 43.19% agg.rigid.data 106,825,677 106,825,677 0.00% dim1.dim2.fact.map 17,332,729 32,989,946 47.46% dim1.dim3.fact.map 16,923,276 32,222,813 47.48% dim1.dim4.fact.map 6,079,396 12,286,978 50.52% dim5.dim6.fact.map 2,630,888 6,057,334 56.57% dim1.dim7.fact.map 1,809,725 3,904,004 53.64% dim8.dim9.fact.map 1,592,886 3,793,452 58.01% dim1.dim10.fact.map 1,419,255 3,108,248 54.34% dim8.dim11.fact.map 1,301,221 3,042,638 57.23% dim1.dim12.fact.map 2,949,432 2,949,432 0.00% dim1.dim13.fact.map 2,934,836 2,934,836 0.00% dimA.dimA.fact.map 1,101,552 2,716,289 59.45% dim8.dimB.fact.map 961,332 2,451,956 60.79% dim1.dimC.fact.map 1,027,305 2,323,906 55.79% dim8.dim8.fact.map 1,592,886 2,308,232 30.99% dimA.dimD.fact.map 851,095 2,170,962 60.80% Not Sorted Sorted
  • 24. 24 Across the Eighth Dimension! How do you associate dimensions with Star Trek Into Darkness? Cube
  • 25. 25
  • 26. 26 Back to cube dimensions Running ProcessUpdate Takes a long time to run because all of the fact partitions are re-indexed! Minimize likelihood by building SCD-2 dimensions Composite Key based on lowest level unique values to represent row Sometimes identity can be just as effective though hashing requires mapping or lookuptables Create SK to allow for SCD-2 dimensions Key is that we keep the memory space of the SK small Composite(Composite) or Hash(Composite) is good for dimensions loaded from fact BUT do not expect Type-2 for fact-based dimensions Important to call out restatement based on current data (high cost associated with keeping versioned history of dimension tables)
  • 28. April 10-12, Chicago, IL Thank you! Diamond Sponsor

Editor's Notes

  1. Like the NYSE, the Yahoo! ad network behaves like an exchange for display advertising Advertisers are the buyers Publishers (web sites) are the sellers (Yahoo! is one of the publishers) Yahoo! needs to create the most efficient exchange as possible
  2. Performance display advertiser requires that we can: Identify the target audience for a campaign Monitor how they behave across a number of different dimensions
  3. Huge opportunity for optimization but difficult given the large number of discrete dimensions
  4. The number of ad performance factors (i.e. dimensions) and the number of ad impressions per day is huge Yahoo! branded sites attract 680 million unique visitors worldwide 3.5B performance display ad impressions served on Yahoo! exchange per day Large many to many relationships (consumers can be a member of more than one segment) Each consumer is a member of an average of 10 segments – explodes the data by 10x 161B rows per quarter for impression data 203B rows per quarter for segment data (compressed but # of rows processed is really 10x = 2 trillion) Given the number of permutations, query performance needs to be speed of thought or the system is useless Traditional ROLAP is too slow Hundred of dimensions, attributes and metrics create complexity Need integration with good visualization tools to find relevant trends and performance improvement opportunities Data needs to be fresh (from ad impression to query in less than 24 hours) or opportunities are lost Display ad campaigns have very short timeframes (< 2 weeks)
  5. Key design concepts are: Use standard, off the shelf parts Loosely coupled components (using a pull architecture) Centralize data aggregation on grid using Hadoop Leverage Oracle’s external table feature to make data available to SSAS with minimal latency One to one match of SASS partitions to Oracle partitions so not aggregation needed & partition pruning enabled (30+ trillion rows in Oracle tables) Maximize parallel loading (90+ threads loading in parallel) Separate cube building from cube querying Improvements in HW/Design 9h -> 2.5h: Change in HW: IBM x3560 M3 256GB RAM, 48 cores; EMC Clariion SAN 2.5h -> 1.25h: Use of Data Direct / Attunity drivers
  6. Cube is complex due to nature of the ad business Need to provide an “anything by anything” query environment to find the optimization opportunities If queries aren’t fast, we lose the value Need to update the cube continuously given that there’s limited time to optimize a display ad campaign (data needs to be updated 4x day at minimum) Used SASS aggregations extensively – cut down on Hadoop aggregations dramatically Only 8 fact tables loaded (4 areas, 1 detail, 1 aggregate) As opposed to an existing ROLAP application at Yahoo! that requires 3,600 facts (aggregate) tables
  7. Doubled the eCPM (revenue) by allowing our campaign managers to “tune” campaign targeting and creatives Drove increase in spend from advertisers since they got better performance by advertising through Yahoo!
  8. IMPORTANT: Sorting is require for both the source and the cube partition queries.
  9. Haven’t used UBO yet due to the 2005 issues Creates own spreadsheet (above) to hand-make aggregations Extremely difficult to make/explain aggs Analysis: once you split; how long is ProcessData v.s ProcessIndexes To determine if aggregation creation is the issue or not