SlideShare a Scribd company logo
Architecting Data Lakes on AWS with HiFX
Established in the year 2001, HiFX is an Amazon Web Services Consulting Partner
that has been designing and migrating applications and workloads in the cloud since
2010. We have been helping organisations to become truly data driven by building
data lakes in AWS since 2015.
2
The Challenges
Lack of agility and accessibility for data analysis which would aid the product team to make smart business
decisions and improve strategies
Increasing volume and velocity of data. With new digital properties getting added, there was a need to
design the collection and storage layers that would scale well.
Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was
leading to difficulties in identifying what type of data is available, getting access to it and integration.
Poorly recorded data. Often, the meaning and granularity of the data was getting lost in processing.
Dozens of independently managed collections of data, leading to data silos. Having no single source of
truth was leading to difficulties in identifying what type of data is available, granting access and integration.
04
03
02
01
Our Journey from Data to Decisions with an AWS powered Data Lake
Connecting dozens of data
streams and repositories to a
unified data pipeline enabling
near realtime access to any data
source
Engineering well designed big
data stores for reporting and
and exploratory analysis
Architect a secure, well
governed data lake to store all
data in a raw format. S3 is the
fabric with which we have woven
the solution.
Processing data in streams or
batches to aid analytics and
machine learning, supplemented
by smart workflow management to
orchestrate the tasks
Dynamic dashboards and
visualisations that makes data tell
stories and help drive insights.
Offering recommendations and
predictive analytics off the data
in the data lake
COLLECT STORE PROCESS CONSUME
Scribe (Collector)
Accumulo (Storage)
Acccumulo is the data consumer
component responsible for reading data
from the event streams (Kinesis Streams),
performing rudimentary data quality checks
and converting data to Avro Format before
loading it to the Data Lake
Our Data Lake in S3 captures and store
raw data at scale for a low cost. It allows us
to store many types of data in the same
repository while allowing to define the
structure of the data at the time when it is
used
scribe
accumulo
Scribe collects data from the trackers
and writes them to Kinesis Streams
It is written in Go and engineered for
high concurrency, low latency and
horizontally scalability
Currently running on two c4.large
instances, our API latency for 50
percentile is 12.6ms and 75 percentile
is 36ms. This is made possible
because of the consistent and
predicable performance of Kinesis
COLLECT STORE PROCESS CONSUME
6
Why Amazon S3 For Data Lake ?
Performance relatively lower than an HDFS cluster, but doesn't affect our workloads significantly. EMRFS with
consistent view (backed by DynamoDB) works really well
Native support for versioning, tiered-storage (Standard, IA, Amazon Glacier) via life-cycle policies and Secure–
SSL, client/server-side encryption
Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was
leading to difficulties in identifying what type of data is available, getting access to it and integration.
Unlimited number of objects and volume of data, along with 99.99% availability and 99.999999999%
durability. Lower TCO and easier to scale than HDFS
Decoupled storage and compute allowing multiple & heterogeneous analysis clusters to use the same data
04
03
02
01
Prism (Processor)
Lens (Consumer)
Custom built reporting & visualisation app
to help business owners to easily interpret,
visualise and record data and derive
insights
Detailed Analysis of KPIs, Event
Segmentation, Funnels, Search Insights,
Path Finder, Retention/Addiction Analysis
etc powered by Redshift and Druid. Using
Pgpool to cache Redshift queries.
process
consume
Unified Processing Engine using
Apache Spark running on EMR written
in Scala
Airflow is used to programmatically
author, schedule and monitor
workflows
Prism generates data for tracking KPIs
and perform funnel, pathflow, retention
and affinity analysis. It also include
machine learning workloads that
generate recommendations and
predictions
COLLECT STORE PROCESS CONSUME
COLLECT STORE PROCESS CONSUME
KPIs
Product relationship
Understand which products are viewed consecutively
Product affinity
Understand which products are purchased together
Sales
Hourly, daily, weekly, monthly, quarterly, and annual
Average market basket
Average order size
Cart abandonment rate
Shopping cart abandonment rate
Days/ Visits To "Purchase"
The average number of days and sessions from the first website
interaction to purchase.
Cost per Acquisition
(Total Cost of Marketing Activities) / (# of Conversions)
Repeat purchase rate
What % of our customers are repeat customers
Product page performance
Measuring product performance
The scatter plot compares the number of unique
users that view each product with the number of
unique users that add the product to basket, with
the size of each dot being the number of uniques
that buy the product.
Any products located in the lower right corner are
highly trafficked but low converting - any effort
spent fixing those product pages (e.g. by checking
the copy, updating the product images or lowering
the price) should be rewarded with a significant
sales uplift, given the number of people visiting
those pages
Measuring product performance
In contrast, products located in the top left of the
plot are very highly converting, but low trafficked
pages. We should drive more traffic to these
pages, either by positioning those products more
prominently on catalog pages, for example, or by
spending marketing dollars driving more traffic to
those pages specifically. Again, that investment
should result in a significant uplift in sales, given
how highly converting those products are.
Similarly, products in the lower left corner are
performing poorly - but it is not clear whether this is
because they have low traffic levels and /or are
poor at driving conversions. We should invest in
improving the performance of these pages, but the
return on that investment is likely to be smaller (or
harder to achieve) than the other two opportunities
Product page performance
Identifying products / content that go well together
Market basket analysis is an Association rule learning technique aimed at uncovering the associations and
connections between specific products in our store
In a market basket analysis, we look to see if there are combinations of products that frequently co-occur in
transactions.
We can use this type of analysis to:
• Inform the placement of content items on sites, or products in catalogue
• Drive recommendation engines (like Amazon’s customers who bought this product also bought these
products…)
• Deliver targeted marketing (e.g. emailing customers who bought products specific products with other
products and offers on those products that are likely to be interesting to them.)

More Related Content

What's hot

Tapdata Product Intro
Tapdata Product IntroTapdata Product Intro
Tapdata Product Intro
Tapdata
 
Spark Summit Keynote by Seshu Adunuthula
Spark Summit Keynote by Seshu AdunuthulaSpark Summit Keynote by Seshu Adunuthula
Spark Summit Keynote by Seshu Adunuthula
Spark Summit
 
Microsoft business intelligence and analytics
Microsoft business intelligence and analyticsMicrosoft business intelligence and analytics
Microsoft business intelligence and analytics
Jeannette Browning
 
Why Finance Should Consider Agile Modern Data Delivery Platform
Why Finance Should Consider Agile Modern Data Delivery PlatformWhy Finance Should Consider Agile Modern Data Delivery Platform
Why Finance Should Consider Agile Modern Data Delivery Platform
syed_javed
 
4870 ibm-storage-solutions-final_nov26_18_34019934_usen
4870  ibm-storage-solutions-final_nov26_18_34019934_usen4870  ibm-storage-solutions-final_nov26_18_34019934_usen
4870 ibm-storage-solutions-final_nov26_18_34019934_usen
duc_spt
 
Spark Summit presentation by Ken Tsai
Spark Summit presentation by Ken TsaiSpark Summit presentation by Ken Tsai
Spark Summit presentation by Ken Tsai
Spark Summit
 
Chug building a data lake in azure with spark and databricks
Chug   building a data lake in azure with spark and databricksChug   building a data lake in azure with spark and databricks
Chug building a data lake in azure with spark and databricks
Brandon Berlinrut
 
Uses of Data Lakes: Data Analytics Week SF
Uses of Data Lakes: Data Analytics Week SFUses of Data Lakes: Data Analytics Week SF
Uses of Data Lakes: Data Analytics Week SF
Amazon Web Services
 
Data Lakes in the Wild
Data Lakes in the WildData Lakes in the Wild
Data Lakes in the Wild
Amazon Web Services
 
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
DataConf
 
Building Modern Data Platform with AWS
Building Modern Data Platform with AWSBuilding Modern Data Platform with AWS
Building Modern Data Platform with AWS
Dmitry Anoshin
 
data warehouse vs data lake
data warehouse vs data lakedata warehouse vs data lake
data warehouse vs data lake
Polestarsolutions
 
Data Lakes: 8 Enterprise Data Management Requirements
Data Lakes: 8 Enterprise Data Management RequirementsData Lakes: 8 Enterprise Data Management Requirements
Data Lakes: 8 Enterprise Data Management Requirements
SnapLogic
 
Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher   Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher
Tamir Dresher
 
Why HR Should Consider Agile Modern Data Delivery Platform
Why HR Should Consider Agile Modern Data Delivery PlatformWhy HR Should Consider Agile Modern Data Delivery Platform
Why HR Should Consider Agile Modern Data Delivery Platform
syed_javed
 
SAP HANA Database
SAP HANA DatabaseSAP HANA Database
SAP HANA Database
Mayuree Srikulwong
 
Power BI Dashboard | Microsoft Power BI Tutorial | Data Visualization | Edureka
Power BI Dashboard | Microsoft Power BI Tutorial | Data Visualization | EdurekaPower BI Dashboard | Microsoft Power BI Tutorial | Data Visualization | Edureka
Power BI Dashboard | Microsoft Power BI Tutorial | Data Visualization | Edureka
Edureka!
 
From ingest to insights with AWS
From ingest to insights with AWSFrom ingest to insights with AWS
From ingest to insights with AWS
Paul Van Siclen
 
Data Governance a Business Value Driven Approach
Data Governance a Business Value Driven ApproachData Governance a Business Value Driven Approach
Data Governance a Business Value Driven Approach
Tridant
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the Lakehouse
Databricks
 

What's hot (20)

Tapdata Product Intro
Tapdata Product IntroTapdata Product Intro
Tapdata Product Intro
 
Spark Summit Keynote by Seshu Adunuthula
Spark Summit Keynote by Seshu AdunuthulaSpark Summit Keynote by Seshu Adunuthula
Spark Summit Keynote by Seshu Adunuthula
 
Microsoft business intelligence and analytics
Microsoft business intelligence and analyticsMicrosoft business intelligence and analytics
Microsoft business intelligence and analytics
 
Why Finance Should Consider Agile Modern Data Delivery Platform
Why Finance Should Consider Agile Modern Data Delivery PlatformWhy Finance Should Consider Agile Modern Data Delivery Platform
Why Finance Should Consider Agile Modern Data Delivery Platform
 
4870 ibm-storage-solutions-final_nov26_18_34019934_usen
4870  ibm-storage-solutions-final_nov26_18_34019934_usen4870  ibm-storage-solutions-final_nov26_18_34019934_usen
4870 ibm-storage-solutions-final_nov26_18_34019934_usen
 
Spark Summit presentation by Ken Tsai
Spark Summit presentation by Ken TsaiSpark Summit presentation by Ken Tsai
Spark Summit presentation by Ken Tsai
 
Chug building a data lake in azure with spark and databricks
Chug   building a data lake in azure with spark and databricksChug   building a data lake in azure with spark and databricks
Chug building a data lake in azure with spark and databricks
 
Uses of Data Lakes: Data Analytics Week SF
Uses of Data Lakes: Data Analytics Week SFUses of Data Lakes: Data Analytics Week SF
Uses of Data Lakes: Data Analytics Week SF
 
Data Lakes in the Wild
Data Lakes in the WildData Lakes in the Wild
Data Lakes in the Wild
 
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
 
Building Modern Data Platform with AWS
Building Modern Data Platform with AWSBuilding Modern Data Platform with AWS
Building Modern Data Platform with AWS
 
data warehouse vs data lake
data warehouse vs data lakedata warehouse vs data lake
data warehouse vs data lake
 
Data Lakes: 8 Enterprise Data Management Requirements
Data Lakes: 8 Enterprise Data Management RequirementsData Lakes: 8 Enterprise Data Management Requirements
Data Lakes: 8 Enterprise Data Management Requirements
 
Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher   Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher
 
Why HR Should Consider Agile Modern Data Delivery Platform
Why HR Should Consider Agile Modern Data Delivery PlatformWhy HR Should Consider Agile Modern Data Delivery Platform
Why HR Should Consider Agile Modern Data Delivery Platform
 
SAP HANA Database
SAP HANA DatabaseSAP HANA Database
SAP HANA Database
 
Power BI Dashboard | Microsoft Power BI Tutorial | Data Visualization | Edureka
Power BI Dashboard | Microsoft Power BI Tutorial | Data Visualization | EdurekaPower BI Dashboard | Microsoft Power BI Tutorial | Data Visualization | Edureka
Power BI Dashboard | Microsoft Power BI Tutorial | Data Visualization | Edureka
 
From ingest to insights with AWS
From ingest to insights with AWSFrom ingest to insights with AWS
From ingest to insights with AWS
 
Data Governance a Business Value Driven Approach
Data Governance a Business Value Driven ApproachData Governance a Business Value Driven Approach
Data Governance a Business Value Driven Approach
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the Lakehouse
 

Similar to Architecting Data Lake on AWS by the Data Engineering Team at HiFX IT

ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
AWS User Group Kochi
 
Architecting Data Lakes on AWS
Architecting Data Lakes on AWSArchitecting Data Lakes on AWS
Architecting Data Lakes on AWS
Sajith Appukuttan
 
Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS
Amazon Web Services
 
IBM Cognos tutorial - ABC LEARN
IBM Cognos tutorial - ABC LEARNIBM Cognos tutorial - ABC LEARN
IBM Cognos tutorial - ABC LEARN
abclearnn
 
Automate Business Insights on AWS - Simple, Fast, and Secure Analytics Platforms
Automate Business Insights on AWS - Simple, Fast, and Secure Analytics PlatformsAutomate Business Insights on AWS - Simple, Fast, and Secure Analytics Platforms
Automate Business Insights on AWS - Simple, Fast, and Secure Analytics Platforms
Amazon Web Services
 
Data Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdf
Data Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdfData Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdf
Data Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdfAmazon Web Services
 
Data Engineering
Data EngineeringData Engineering
Data Engineering
kiansahafi
 
Data Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptxData Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptx
ArunPandiyan890855
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
Amazon Web Services
 
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysWhat is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
NEWYORKSYS-IT SOLUTIONS
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
DATAVERSITY
 
oracle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdforacle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdf
ssuserf8f9b2
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Amazon Web Services LATAM
 
利用 Amazon QuickSight 視覺化分析服務剖析資料
利用 Amazon QuickSight 視覺化分析服務剖析資料利用 Amazon QuickSight 視覺化分析服務剖析資料
利用 Amazon QuickSight 視覺化分析服務剖析資料
Amazon Web Services
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
Sheetal Pratik
 
AWS Initiate Day Dublin 2019 – Big Data Meets AI
AWS Initiate Day Dublin 2019 – Big Data Meets AIAWS Initiate Day Dublin 2019 – Big Data Meets AI
AWS Initiate Day Dublin 2019 – Big Data Meets AI
Amazon Web Services
 
Big Data and Analytics on Amazon Web Services: Building A Business-Friendly P...
Big Data and Analytics on Amazon Web Services: Building A Business-Friendly P...Big Data and Analytics on Amazon Web Services: Building A Business-Friendly P...
Big Data and Analytics on Amazon Web Services: Building A Business-Friendly P...
Amazon Web Services
 
Modern Data Warehousing with Amazon Redshift
Modern Data Warehousing with Amazon RedshiftModern Data Warehousing with Amazon Redshift
Modern Data Warehousing with Amazon Redshift
Amazon Web Services
 
Kyvos Insights
Kyvos Insights Kyvos Insights
Kyvos Insights
rebeccatho
 

Similar to Architecting Data Lake on AWS by the Data Engineering Team at HiFX IT (20)

ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
 
Architecting Data Lakes on AWS
Architecting Data Lakes on AWSArchitecting Data Lakes on AWS
Architecting Data Lakes on AWS
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS
 
IBM Cognos tutorial - ABC LEARN
IBM Cognos tutorial - ABC LEARNIBM Cognos tutorial - ABC LEARN
IBM Cognos tutorial - ABC LEARN
 
Automate Business Insights on AWS - Simple, Fast, and Secure Analytics Platforms
Automate Business Insights on AWS - Simple, Fast, and Secure Analytics PlatformsAutomate Business Insights on AWS - Simple, Fast, and Secure Analytics Platforms
Automate Business Insights on AWS - Simple, Fast, and Secure Analytics Platforms
 
Data Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdf
Data Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdfData Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdf
Data Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdf
 
Data Engineering
Data EngineeringData Engineering
Data Engineering
 
Data Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptxData Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptx
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysWhat is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
oracle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdforacle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdf
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
利用 Amazon QuickSight 視覺化分析服務剖析資料
利用 Amazon QuickSight 視覺化分析服務剖析資料利用 Amazon QuickSight 視覺化分析服務剖析資料
利用 Amazon QuickSight 視覺化分析服務剖析資料
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 
AWS Initiate Day Dublin 2019 – Big Data Meets AI
AWS Initiate Day Dublin 2019 – Big Data Meets AIAWS Initiate Day Dublin 2019 – Big Data Meets AI
AWS Initiate Day Dublin 2019 – Big Data Meets AI
 
Big Data and Analytics on Amazon Web Services: Building A Business-Friendly P...
Big Data and Analytics on Amazon Web Services: Building A Business-Friendly P...Big Data and Analytics on Amazon Web Services: Building A Business-Friendly P...
Big Data and Analytics on Amazon Web Services: Building A Business-Friendly P...
 
Modern Data Warehousing with Amazon Redshift
Modern Data Warehousing with Amazon RedshiftModern Data Warehousing with Amazon Redshift
Modern Data Warehousing with Amazon Redshift
 
Kyvos Insights
Kyvos Insights Kyvos Insights
Kyvos Insights
 

Recently uploaded

原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 

Recently uploaded (20)

原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 

Architecting Data Lake on AWS by the Data Engineering Team at HiFX IT

  • 1. Architecting Data Lakes on AWS with HiFX Established in the year 2001, HiFX is an Amazon Web Services Consulting Partner that has been designing and migrating applications and workloads in the cloud since 2010. We have been helping organisations to become truly data driven by building data lakes in AWS since 2015.
  • 2. 2 The Challenges Lack of agility and accessibility for data analysis which would aid the product team to make smart business decisions and improve strategies Increasing volume and velocity of data. With new digital properties getting added, there was a need to design the collection and storage layers that would scale well. Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was leading to difficulties in identifying what type of data is available, getting access to it and integration. Poorly recorded data. Often, the meaning and granularity of the data was getting lost in processing. Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was leading to difficulties in identifying what type of data is available, granting access and integration. 04 03 02 01
  • 3. Our Journey from Data to Decisions with an AWS powered Data Lake Connecting dozens of data streams and repositories to a unified data pipeline enabling near realtime access to any data source Engineering well designed big data stores for reporting and and exploratory analysis Architect a secure, well governed data lake to store all data in a raw format. S3 is the fabric with which we have woven the solution. Processing data in streams or batches to aid analytics and machine learning, supplemented by smart workflow management to orchestrate the tasks Dynamic dashboards and visualisations that makes data tell stories and help drive insights. Offering recommendations and predictive analytics off the data in the data lake
  • 4. COLLECT STORE PROCESS CONSUME Scribe (Collector) Accumulo (Storage) Acccumulo is the data consumer component responsible for reading data from the event streams (Kinesis Streams), performing rudimentary data quality checks and converting data to Avro Format before loading it to the Data Lake Our Data Lake in S3 captures and store raw data at scale for a low cost. It allows us to store many types of data in the same repository while allowing to define the structure of the data at the time when it is used scribe accumulo Scribe collects data from the trackers and writes them to Kinesis Streams It is written in Go and engineered for high concurrency, low latency and horizontally scalability Currently running on two c4.large instances, our API latency for 50 percentile is 12.6ms and 75 percentile is 36ms. This is made possible because of the consistent and predicable performance of Kinesis
  • 6. 6 Why Amazon S3 For Data Lake ? Performance relatively lower than an HDFS cluster, but doesn't affect our workloads significantly. EMRFS with consistent view (backed by DynamoDB) works really well Native support for versioning, tiered-storage (Standard, IA, Amazon Glacier) via life-cycle policies and Secure– SSL, client/server-side encryption Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was leading to difficulties in identifying what type of data is available, getting access to it and integration. Unlimited number of objects and volume of data, along with 99.99% availability and 99.999999999% durability. Lower TCO and easier to scale than HDFS Decoupled storage and compute allowing multiple & heterogeneous analysis clusters to use the same data 04 03 02 01
  • 7. Prism (Processor) Lens (Consumer) Custom built reporting & visualisation app to help business owners to easily interpret, visualise and record data and derive insights Detailed Analysis of KPIs, Event Segmentation, Funnels, Search Insights, Path Finder, Retention/Addiction Analysis etc powered by Redshift and Druid. Using Pgpool to cache Redshift queries. process consume Unified Processing Engine using Apache Spark running on EMR written in Scala Airflow is used to programmatically author, schedule and monitor workflows Prism generates data for tracking KPIs and perform funnel, pathflow, retention and affinity analysis. It also include machine learning workloads that generate recommendations and predictions COLLECT STORE PROCESS CONSUME
  • 9. KPIs Product relationship Understand which products are viewed consecutively Product affinity Understand which products are purchased together Sales Hourly, daily, weekly, monthly, quarterly, and annual Average market basket Average order size Cart abandonment rate Shopping cart abandonment rate Days/ Visits To "Purchase" The average number of days and sessions from the first website interaction to purchase. Cost per Acquisition (Total Cost of Marketing Activities) / (# of Conversions) Repeat purchase rate What % of our customers are repeat customers
  • 10. Product page performance Measuring product performance The scatter plot compares the number of unique users that view each product with the number of unique users that add the product to basket, with the size of each dot being the number of uniques that buy the product. Any products located in the lower right corner are highly trafficked but low converting - any effort spent fixing those product pages (e.g. by checking the copy, updating the product images or lowering the price) should be rewarded with a significant sales uplift, given the number of people visiting those pages
  • 11. Measuring product performance In contrast, products located in the top left of the plot are very highly converting, but low trafficked pages. We should drive more traffic to these pages, either by positioning those products more prominently on catalog pages, for example, or by spending marketing dollars driving more traffic to those pages specifically. Again, that investment should result in a significant uplift in sales, given how highly converting those products are. Similarly, products in the lower left corner are performing poorly - but it is not clear whether this is because they have low traffic levels and /or are poor at driving conversions. We should invest in improving the performance of these pages, but the return on that investment is likely to be smaller (or harder to achieve) than the other two opportunities Product page performance
  • 12. Identifying products / content that go well together Market basket analysis is an Association rule learning technique aimed at uncovering the associations and connections between specific products in our store In a market basket analysis, we look to see if there are combinations of products that frequently co-occur in transactions. We can use this type of analysis to: • Inform the placement of content items on sites, or products in catalogue • Drive recommendation engines (like Amazon’s customers who bought this product also bought these products…) • Deliver targeted marketing (e.g. emailing customers who bought products specific products with other products and offers on those products that are likely to be interesting to them.)

Editor's Notes

  1. Trackers allow to collect data from any type application (web, mobile), service or device. All trackers adhere to the predefined Tracker Protocol. Send data asynchronously, and hence would not affect the performance. Collectors are stateless and horizontally scalable Each shard in kinesis steam can support reads upto 2MB per second and writes upto 1,000 records / 1MB per second. Scribe and Accumulo automatically detects new shards and scales. Accumulo KCL Java App that buffers the events and upload the batches as Avro files to the Data Lake DAGs in Airflow pull dimension and offline data and loads them to the Data Lake
  2. Streaming workloads for near-realtime reports ( news ) and batch for daily reports (classifieds) EMR with instance fleets provides a cost effective way to process data Data processing involves quality checks, cleansing, reconciling and enrichment. Subset of data ( sans page views, data beyond last 2 years) sent to druid & redshift. All data (historical) stored as parquet in S3 with lifecycle policy. Athena can point to this data for ad hoc analysis Druid for realtime data and aggregate queries that do not require join. Redshift for everything else. LENS built with React using nvd3 charting library. Built for multi tenancy with fine-grained ACLs. Apis powered by Go Recommendations powered by DyanamoDB ( predictable performance and no need to sort on multiple fields)