This presentation was given at the Atlanta Hadoop User Group and outline the architecture a real-time reporting platform we build in 45 days at IgnitionOne.
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Real Time Reporting Platform
1. Spark’s role in building a
new reporting infrastructure
in 45 days
Kyle Burke, Manger –Data Science
Dexter Jones – Manger, BI
Ken Rona – Chief Data Scientist
3. 3
C h a l l e n g e : B u i l d B I / R e p o r t i n g c a p a b i l i t y i n
4 5 d a y s
¶ Existing in-house solution was limited and hard to change.
¶ Operations relied on third party for reporting.
• Some clients were not configured to provide data to the third party.
¶ Onboarding 40 new clients in 45 days provided opportunity to re-think how
we provided reporting to all clients.
• Separated compute and storage layer to provide flexibility to grow as needs change.
• Based on standard fact and dimension tables.
• Looked to AWS/EMR/Spark/Presto/Athena/Hive for solution for speed and cost.
• Needed to process 3 billion transactions per day.
4. 4
E n t i r e s o l u t i o n r u n s o n AW S
Dimensions
Facts
Metastore
Reporting Data
ETL
5. 5
S o u r c e S y s t e m s a r e S 3 a n d R e d s h i f t
Dimensions
Facts
¶ Focused on data that was available
¶ Embraced that data was going to be missing at end of phase 1
Source Systems
6. 6
E T L i n S p a r k a n d d a t a w r i t t e n t o P a r q u e t
f i l e s
¶ Dimensions and Facts are processed and summarized every hour
¶ Files are written to S3 in Parquet (saves space/money/time)
Dimensions
Facts
Source Systems Compute - ETL
Storage
7. 7
H i v e i s u s e d a s a m e t a s t o r e f o r p a r t i t i o n
d a t a
Dimensions
Facts
Source Systems Compute - ETL
Storage
¶ Hive Metastore serves as
reference for systems that look to
access the Parquet data on S3
¶ Tracks how data is partitioned
and decreases query time
Metastore
8. 8
P r e s t o s e r v e s a s q u e r y e n g i n e f o r
s u m m a r i z e d h o u r l y d a t a
Dimensions
Facts
Source Systems Compute - ETL
Storage
Metastore
¶ Presto is used for real time querying of the
data and results are written back to S3
Compute - Query
9. 9
A t h e n a i s s i m i l a r t o P r e s t o a n d m i g h t b e
m o r e c o s t e f f i c i e n t . U n d e r r e v i e w f o r p o w e r
u s e r s
Dimensions
Facts
Source Systems Compute - ETL
Storage
Metastore
¶ Athena allows
power users to
efficiently query
storage
¶ Charge by
amount of data
scanned
Compute - Query
Compute - Query
10. 10
D e m o o f p a r t i t i o n s p e e d u s i n g A t h e n a
11. 11
C u r r e n t l y u s i n g M i c r o S t r a t e g y a s a u t h o r i n g
t o o l f o r B I / a n a l y s t s a n d f o r r e p o r t d e l i v e r y
Dimensions
Facts
Source Systems Compute - ETL
Storage
Metastore
¶ MicroStrategy is
BI tool
¶ Experimented
with QuickSight
(AWS) but
impractical to
manage multi-
client
Compute - Query
Compute - Query
Reporting Data
Reporting