SlideShare a Scribd company logo
1 of 11
Spark’s role in building a
new reporting infrastructure
in 45 days
Kyle Burke, Manger –Data Science
Dexter Jones – Manger, BI
Ken Rona – Chief Data Scientist
We are hiring
Look by the Ping Pong
table
3
C h a l l e n g e : B u i l d B I / R e p o r t i n g c a p a b i l i t y i n
4 5 d a y s
¶ Existing in-house solution was limited and hard to change.
¶ Operations relied on third party for reporting.
• Some clients were not configured to provide data to the third party.
¶ Onboarding 40 new clients in 45 days provided opportunity to re-think how
we provided reporting to all clients.
• Separated compute and storage layer to provide flexibility to grow as needs change.
• Based on standard fact and dimension tables.
• Looked to AWS/EMR/Spark/Presto/Athena/Hive for solution for speed and cost.
• Needed to process 3 billion transactions per day.
4
E n t i r e s o l u t i o n r u n s o n AW S
Dimensions
Facts
Metastore
Reporting Data
ETL
5
S o u r c e S y s t e m s a r e S 3 a n d R e d s h i f t
Dimensions
Facts
¶ Focused on data that was available
¶ Embraced that data was going to be missing at end of phase 1
Source Systems
6
E T L i n S p a r k a n d d a t a w r i t t e n t o P a r q u e t
f i l e s
¶ Dimensions and Facts are processed and summarized every hour
¶ Files are written to S3 in Parquet (saves space/money/time)
Dimensions
Facts
Source Systems Compute - ETL
Storage
7
H i v e i s u s e d a s a m e t a s t o r e f o r p a r t i t i o n
d a t a
Dimensions
Facts
Source Systems Compute - ETL
Storage
¶ Hive Metastore serves as
reference for systems that look to
access the Parquet data on S3
¶ Tracks how data is partitioned
and decreases query time
Metastore
8
P r e s t o s e r v e s a s q u e r y e n g i n e f o r
s u m m a r i z e d h o u r l y d a t a
Dimensions
Facts
Source Systems Compute - ETL
Storage
Metastore
¶ Presto is used for real time querying of the
data and results are written back to S3
Compute - Query
9
A t h e n a i s s i m i l a r t o P r e s t o a n d m i g h t b e
m o r e c o s t e f f i c i e n t . U n d e r r e v i e w f o r p o w e r
u s e r s
Dimensions
Facts
Source Systems Compute - ETL
Storage
Metastore
¶ Athena allows
power users to
efficiently query
storage
¶ Charge by
amount of data
scanned
Compute - Query
Compute - Query
10
D e m o o f p a r t i t i o n s p e e d u s i n g A t h e n a
11
C u r r e n t l y u s i n g M i c r o S t r a t e g y a s a u t h o r i n g
t o o l f o r B I / a n a l y s t s a n d f o r r e p o r t d e l i v e r y
Dimensions
Facts
Source Systems Compute - ETL
Storage
Metastore
¶ MicroStrategy is
BI tool
¶ Experimented
with QuickSight
(AWS) but
impractical to
manage multi-
client
Compute - Query
Compute - Query
Reporting Data
Reporting

More Related Content

What's hot

Lighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris PeetersLighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris PeetersData Science Leuven
 
Introduction_OF_Hadoop_and_BigData
Introduction_OF_Hadoop_and_BigDataIntroduction_OF_Hadoop_and_BigData
Introduction_OF_Hadoop_and_BigDataNilay Mishra
 
Database novelty detection
Database novelty detectionDatabase novelty detection
Database novelty detectionMostafaAliAbbas
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01David Smiley
 
How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)Rainer Sternfeld
 
OSDC 2017 - Claus Matzinger - An Open Machine Data Analysis Srack with Docker...
OSDC 2017 - Claus Matzinger - An Open Machine Data Analysis Srack with Docker...OSDC 2017 - Claus Matzinger - An Open Machine Data Analysis Srack with Docker...
OSDC 2017 - Claus Matzinger - An Open Machine Data Analysis Srack with Docker...NETWAYS
 
Containerized DBs in a Machine Data environment with Crate.io
Containerized DBs in a Machine Data environment with Crate.ioContainerized DBs in a Machine Data environment with Crate.io
Containerized DBs in a Machine Data environment with Crate.ioClaus Matzinger
 
Presto Apache BigData 2017
Presto Apache BigData 2017Presto Apache BigData 2017
Presto Apache BigData 2017Zhenxiao Luo
 
ESIP 2018 - The Case for Archives of Convenience
ESIP 2018 - The Case for Archives of ConvenienceESIP 2018 - The Case for Archives of Convenience
ESIP 2018 - The Case for Archives of ConvenienceDan Pilone
 
Large Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache SparkLarge Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache SparkDatabricks
 
Deadline-aware MapReduce Job Scheduling with Dynamic Resource Availability
Deadline-aware MapReduce Job Scheduling with Dynamic Resource AvailabilityDeadline-aware MapReduce Job Scheduling with Dynamic Resource Availability
Deadline-aware MapReduce Job Scheduling with Dynamic Resource AvailabilityJAYAPRAKASH JPINFOTECH
 
Non-Relational Databases: This hurts. I like it.
Non-Relational Databases: This hurts. I like it.Non-Relational Databases: This hurts. I like it.
Non-Relational Databases: This hurts. I like it.Onyxfish
 
PrEstoCloud : PROACTIVE CLOUD RESOURCES MANAGEMENT AT THE EDGE FOR EFFICIENT ...
PrEstoCloud : PROACTIVE CLOUD RESOURCES MANAGEMENT AT THE EDGE FOR EFFICIENT ...PrEstoCloud : PROACTIVE CLOUD RESOURCES MANAGEMENT AT THE EDGE FOR EFFICIENT ...
PrEstoCloud : PROACTIVE CLOUD RESOURCES MANAGEMENT AT THE EDGE FOR EFFICIENT ...OW2
 
My other computer_is_a_datacentre
My other computer_is_a_datacentreMy other computer_is_a_datacentre
My other computer_is_a_datacentreSteve Loughran
 
Implementing BigPetStore with Apache Flink
Implementing BigPetStore with Apache FlinkImplementing BigPetStore with Apache Flink
Implementing BigPetStore with Apache FlinkMárton Balassi
 
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting LiStanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting LiPacificResearchPlatform
 
Plant observatory(mini project)
Plant observatory(mini project)Plant observatory(mini project)
Plant observatory(mini project)ArtemSavchuk2
 

What's hot (19)

Lighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris PeetersLighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris Peeters
 
Introduction_OF_Hadoop_and_BigData
Introduction_OF_Hadoop_and_BigDataIntroduction_OF_Hadoop_and_BigData
Introduction_OF_Hadoop_and_BigData
 
Database novelty detection
Database novelty detectionDatabase novelty detection
Database novelty detection
 
Geospatial data
Geospatial dataGeospatial data
Geospatial data
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)
 
OSDC 2017 - Claus Matzinger - An Open Machine Data Analysis Srack with Docker...
OSDC 2017 - Claus Matzinger - An Open Machine Data Analysis Srack with Docker...OSDC 2017 - Claus Matzinger - An Open Machine Data Analysis Srack with Docker...
OSDC 2017 - Claus Matzinger - An Open Machine Data Analysis Srack with Docker...
 
Containerized DBs in a Machine Data environment with Crate.io
Containerized DBs in a Machine Data environment with Crate.ioContainerized DBs in a Machine Data environment with Crate.io
Containerized DBs in a Machine Data environment with Crate.io
 
Presto Apache BigData 2017
Presto Apache BigData 2017Presto Apache BigData 2017
Presto Apache BigData 2017
 
ESIP 2018 - The Case for Archives of Convenience
ESIP 2018 - The Case for Archives of ConvenienceESIP 2018 - The Case for Archives of Convenience
ESIP 2018 - The Case for Archives of Convenience
 
Large Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache SparkLarge Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache Spark
 
Deadline-aware MapReduce Job Scheduling with Dynamic Resource Availability
Deadline-aware MapReduce Job Scheduling with Dynamic Resource AvailabilityDeadline-aware MapReduce Job Scheduling with Dynamic Resource Availability
Deadline-aware MapReduce Job Scheduling with Dynamic Resource Availability
 
Non-Relational Databases: This hurts. I like it.
Non-Relational Databases: This hurts. I like it.Non-Relational Databases: This hurts. I like it.
Non-Relational Databases: This hurts. I like it.
 
PrEstoCloud : PROACTIVE CLOUD RESOURCES MANAGEMENT AT THE EDGE FOR EFFICIENT ...
PrEstoCloud : PROACTIVE CLOUD RESOURCES MANAGEMENT AT THE EDGE FOR EFFICIENT ...PrEstoCloud : PROACTIVE CLOUD RESOURCES MANAGEMENT AT THE EDGE FOR EFFICIENT ...
PrEstoCloud : PROACTIVE CLOUD RESOURCES MANAGEMENT AT THE EDGE FOR EFFICIENT ...
 
My other computer_is_a_datacentre
My other computer_is_a_datacentreMy other computer_is_a_datacentre
My other computer_is_a_datacentre
 
Implementing BigPetStore with Apache Flink
Implementing BigPetStore with Apache FlinkImplementing BigPetStore with Apache Flink
Implementing BigPetStore with Apache Flink
 
October 1 NISO Training Thursday: Using Alerting Systems to Ensure OA Policy ...
October 1 NISO Training Thursday: Using Alerting Systems to Ensure OA Policy ...October 1 NISO Training Thursday: Using Alerting Systems to Ensure OA Policy ...
October 1 NISO Training Thursday: Using Alerting Systems to Ensure OA Policy ...
 
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting LiStanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li
 
Plant observatory(mini project)
Plant observatory(mini project)Plant observatory(mini project)
Plant observatory(mini project)
 

Viewers also liked

Intro to Spark
Intro to SparkIntro to Spark
Intro to SparkKyle Burke
 
Презентация сульфатна кислота
Презентация сульфатна кислотаПрезентация сульфатна кислота
Презентация сульфатна кислотаИрина Вовченко
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...StampedeCon
 
End-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentEnd-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentDataWorks Summit/Hadoop Summit
 
The Good, The Bad and The Greys of Data Visualisation Design
The Good, The Bad and The Greys of Data Visualisation DesignThe Good, The Bad and The Greys of Data Visualisation Design
The Good, The Bad and The Greys of Data Visualisation DesignAndy Kirk
 
Deep Learning with GPUs in Production - AI By the Bay
Deep Learning with GPUs in Production - AI By the BayDeep Learning with GPUs in Production - AI By the Bay
Deep Learning with GPUs in Production - AI By the BayAdam Gibson
 
Edge patterns in the IIoT
Edge patterns in the IIoTEdge patterns in the IIoT
Edge patterns in the IIoTBrad Nicholas
 
Multi-model database
Multi-model databaseMulti-model database
Multi-model databaseJiaheng Lu
 
MYTHS slide deck revised
MYTHS slide deck revisedMYTHS slide deck revised
MYTHS slide deck revisedLaura Maurer
 
La sostenibiltà
La sostenibiltàLa sostenibiltà
La sostenibiltàvoltafano
 
Hiperplasia suprarrenal
Hiperplasia suprarrenalHiperplasia suprarrenal
Hiperplasia suprarrenaljoanalopez
 
추석여행 호텔벤허
추석여행 호텔벤허추석여행 호텔벤허
추석여행 호텔벤허dehryes
 
Company Profile- CFMS.-1
Company Profile- CFMS.-1Company Profile- CFMS.-1
Company Profile- CFMS.-1Shashi Singh
 
2015-11-24-me bios-digitale-fabriek-naar-kennisfabriek
2015-11-24-me bios-digitale-fabriek-naar-kennisfabriek2015-11-24-me bios-digitale-fabriek-naar-kennisfabriek
2015-11-24-me bios-digitale-fabriek-naar-kennisfabriekSirris
 
Práctica3 propiedades mecanicas_alvarogarciacamaron
Práctica3 propiedades mecanicas_alvarogarciacamaronPráctica3 propiedades mecanicas_alvarogarciacamaron
Práctica3 propiedades mecanicas_alvarogarciacamaronAlvarogarcy
 
BRW Article Feb 2003 - Ageing Workforce
BRW Article Feb 2003 - Ageing WorkforceBRW Article Feb 2003 - Ageing Workforce
BRW Article Feb 2003 - Ageing WorkforceJonathan Morris
 

Viewers also liked (20)

Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
Summary of smart building
Summary of smart buildingSummary of smart building
Summary of smart building
 
Презентация сульфатна кислота
Презентация сульфатна кислотаПрезентация сульфатна кислота
Презентация сульфатна кислота
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
 
End-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentEnd-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service Deployment
 
The Good, The Bad and The Greys of Data Visualisation Design
The Good, The Bad and The Greys of Data Visualisation DesignThe Good, The Bad and The Greys of Data Visualisation Design
The Good, The Bad and The Greys of Data Visualisation Design
 
Deep Learning with GPUs in Production - AI By the Bay
Deep Learning with GPUs in Production - AI By the BayDeep Learning with GPUs in Production - AI By the Bay
Deep Learning with GPUs in Production - AI By the Bay
 
Edge patterns in the IIoT
Edge patterns in the IIoTEdge patterns in the IIoT
Edge patterns in the IIoT
 
Multi-model database
Multi-model databaseMulti-model database
Multi-model database
 
Azure SQL Database
Azure SQL DatabaseAzure SQL Database
Azure SQL Database
 
Test
TestTest
Test
 
MYTHS slide deck revised
MYTHS slide deck revisedMYTHS slide deck revised
MYTHS slide deck revised
 
La sostenibiltà
La sostenibiltàLa sostenibiltà
La sostenibiltà
 
Hiperplasia suprarrenal
Hiperplasia suprarrenalHiperplasia suprarrenal
Hiperplasia suprarrenal
 
추석여행 호텔벤허
추석여행 호텔벤허추석여행 호텔벤허
추석여행 호텔벤허
 
Company Profile- CFMS.-1
Company Profile- CFMS.-1Company Profile- CFMS.-1
Company Profile- CFMS.-1
 
Cara membahagi pusaka
Cara membahagi  pusakaCara membahagi  pusaka
Cara membahagi pusaka
 
2015-11-24-me bios-digitale-fabriek-naar-kennisfabriek
2015-11-24-me bios-digitale-fabriek-naar-kennisfabriek2015-11-24-me bios-digitale-fabriek-naar-kennisfabriek
2015-11-24-me bios-digitale-fabriek-naar-kennisfabriek
 
Práctica3 propiedades mecanicas_alvarogarciacamaron
Práctica3 propiedades mecanicas_alvarogarciacamaronPráctica3 propiedades mecanicas_alvarogarciacamaron
Práctica3 propiedades mecanicas_alvarogarciacamaron
 
BRW Article Feb 2003 - Ageing Workforce
BRW Article Feb 2003 - Ageing WorkforceBRW Article Feb 2003 - Ageing Workforce
BRW Article Feb 2003 - Ageing Workforce
 

Similar to Real Time Reporting Platform

Big data at United Airlines
Big data at United AirlinesBig data at United Airlines
Big data at United AirlinesDataWorks Summit
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul
 
Creating Modern Metadata Systems [FutureStack16 NYC]
Creating Modern Metadata Systems [FutureStack16 NYC]Creating Modern Metadata Systems [FutureStack16 NYC]
Creating Modern Metadata Systems [FutureStack16 NYC]New Relic
 
Using Perforce Data in Development at Tableau
Using Perforce Data in Development at TableauUsing Perforce Data in Development at Tableau
Using Perforce Data in Development at TableauPerforce
 
Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...Paris Carbone
 
Experiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of ThingsExperiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of ThingsUSGProfessionalsBelgium
 
Experiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of ThingsExperiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of ThingsGuyVanderSande
 
Taking EA down to desktop
Taking EA down to desktopTaking EA down to desktop
Taking EA down to desktopCora Carmody
 
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...Denodo
 
Cloud-Scale BGP and NetFlow Analysis
Cloud-Scale BGP and NetFlow AnalysisCloud-Scale BGP and NetFlow Analysis
Cloud-Scale BGP and NetFlow AnalysisAlex Henthorn-Iwane
 
Using Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-PurposeUsing Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-PurposeDATAVERSITY
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on ReadBig Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on ReadThink Big, a Teradata Company
 
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftPowering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftJie Li
 
Whitepaper: Mining the AWR repository for Capacity Planning and Visualization
Whitepaper: Mining the AWR repository for Capacity Planning and VisualizationWhitepaper: Mining the AWR repository for Capacity Planning and Visualization
Whitepaper: Mining the AWR repository for Capacity Planning and VisualizationKristofferson A
 
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkLedingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkMukesh Singh
 
Collecting and Making Sense of Diverse Data at WayUp
Collecting and Making Sense of Diverse Data at WayUpCollecting and Making Sense of Diverse Data at WayUp
Collecting and Making Sense of Diverse Data at WayUpHarlan Harris
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]New Relic
 

Similar to Real Time Reporting Platform (20)

Big data at United Airlines
Big data at United AirlinesBig data at United Airlines
Big data at United Airlines
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Creating Modern Metadata Systems [FutureStack16 NYC]
Creating Modern Metadata Systems [FutureStack16 NYC]Creating Modern Metadata Systems [FutureStack16 NYC]
Creating Modern Metadata Systems [FutureStack16 NYC]
 
Using Perforce Data in Development at Tableau
Using Perforce Data in Development at TableauUsing Perforce Data in Development at Tableau
Using Perforce Data in Development at Tableau
 
Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...
 
Experiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of ThingsExperiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of Things
 
Experiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of ThingsExperiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of Things
 
Taking EA down to desktop
Taking EA down to desktopTaking EA down to desktop
Taking EA down to desktop
 
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
 
Cloud-Scale BGP and NetFlow Analysis
Cloud-Scale BGP and NetFlow AnalysisCloud-Scale BGP and NetFlow Analysis
Cloud-Scale BGP and NetFlow Analysis
 
Using Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-PurposeUsing Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-Purpose
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on ReadBig Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
 
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftPowering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
 
Whitepaper: Mining the AWR repository for Capacity Planning and Visualization
Whitepaper: Mining the AWR repository for Capacity Planning and VisualizationWhitepaper: Mining the AWR repository for Capacity Planning and Visualization
Whitepaper: Mining the AWR repository for Capacity Planning and Visualization
 
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkLedingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lk
 
Collecting and Making Sense of Diverse Data at WayUp
Collecting and Making Sense of Diverse Data at WayUpCollecting and Making Sense of Diverse Data at WayUp
Collecting and Making Sense of Diverse Data at WayUp
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
 

Recently uploaded

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Recently uploaded (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Real Time Reporting Platform

  • 1. Spark’s role in building a new reporting infrastructure in 45 days Kyle Burke, Manger –Data Science Dexter Jones – Manger, BI Ken Rona – Chief Data Scientist
  • 2. We are hiring Look by the Ping Pong table
  • 3. 3 C h a l l e n g e : B u i l d B I / R e p o r t i n g c a p a b i l i t y i n 4 5 d a y s ¶ Existing in-house solution was limited and hard to change. ¶ Operations relied on third party for reporting. • Some clients were not configured to provide data to the third party. ¶ Onboarding 40 new clients in 45 days provided opportunity to re-think how we provided reporting to all clients. • Separated compute and storage layer to provide flexibility to grow as needs change. • Based on standard fact and dimension tables. • Looked to AWS/EMR/Spark/Presto/Athena/Hive for solution for speed and cost. • Needed to process 3 billion transactions per day.
  • 4. 4 E n t i r e s o l u t i o n r u n s o n AW S Dimensions Facts Metastore Reporting Data ETL
  • 5. 5 S o u r c e S y s t e m s a r e S 3 a n d R e d s h i f t Dimensions Facts ¶ Focused on data that was available ¶ Embraced that data was going to be missing at end of phase 1 Source Systems
  • 6. 6 E T L i n S p a r k a n d d a t a w r i t t e n t o P a r q u e t f i l e s ¶ Dimensions and Facts are processed and summarized every hour ¶ Files are written to S3 in Parquet (saves space/money/time) Dimensions Facts Source Systems Compute - ETL Storage
  • 7. 7 H i v e i s u s e d a s a m e t a s t o r e f o r p a r t i t i o n d a t a Dimensions Facts Source Systems Compute - ETL Storage ¶ Hive Metastore serves as reference for systems that look to access the Parquet data on S3 ¶ Tracks how data is partitioned and decreases query time Metastore
  • 8. 8 P r e s t o s e r v e s a s q u e r y e n g i n e f o r s u m m a r i z e d h o u r l y d a t a Dimensions Facts Source Systems Compute - ETL Storage Metastore ¶ Presto is used for real time querying of the data and results are written back to S3 Compute - Query
  • 9. 9 A t h e n a i s s i m i l a r t o P r e s t o a n d m i g h t b e m o r e c o s t e f f i c i e n t . U n d e r r e v i e w f o r p o w e r u s e r s Dimensions Facts Source Systems Compute - ETL Storage Metastore ¶ Athena allows power users to efficiently query storage ¶ Charge by amount of data scanned Compute - Query Compute - Query
  • 10. 10 D e m o o f p a r t i t i o n s p e e d u s i n g A t h e n a
  • 11. 11 C u r r e n t l y u s i n g M i c r o S t r a t e g y a s a u t h o r i n g t o o l f o r B I / a n a l y s t s a n d f o r r e p o r t d e l i v e r y Dimensions Facts Source Systems Compute - ETL Storage Metastore ¶ MicroStrategy is BI tool ¶ Experimented with QuickSight (AWS) but impractical to manage multi- client Compute - Query Compute - Query Reporting Data Reporting