Architecting Data Lake on AWS by the Data Engineering Team at HiFX IT

Architecting Data Lakes on AWS with HiFX
Established in the year 2001, HiFX is an Amazon Web Services Consulting Partner
that has been designing and migrating applications and workloads in the cloud since
2010. We have been helping organisations to become truly data driven by building
data lakes in AWS since 2015.

2
The Challenges
Lack of agility and accessibility for data analysis which would aid the product team to make smart business
decisions and improve strategies
Increasing volume and velocity of data. With new digital properties getting added, there was a need to
design the collection and storage layers that would scale well.
Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was
leading to difficulties in identifying what type of data is available, getting access to it and integration.
Poorly recorded data. Often, the meaning and granularity of the data was getting lost in processing.
Dozens of independently managed collections of data, leading to data silos. Having no single source of
truth was leading to difficulties in identifying what type of data is available, granting access and integration.
04
03
02
01

Our Journey from Data to Decisions with an AWS powered Data Lake
Connecting dozens of data
streams and repositories to a
unified data pipeline enabling
near realtime access to any data
source
Engineering well designed big
data stores for reporting and
and exploratory analysis
Architect a secure, well
governed data lake to store all
data in a raw format. S3 is the
fabric with which we have woven
the solution.
Processing data in streams or
batches to aid analytics and
machine learning, supplemented
by smart workflow management to
orchestrate the tasks
Dynamic dashboards and
visualisations that makes data tell
stories and help drive insights.
Offering recommendations and
predictive analytics off the data
in the data lake

COLLECT STORE PROCESS CONSUME
Scribe (Collector)
Accumulo (Storage)
Acccumulo is the data consumer
component responsible for reading data
from the event streams (Kinesis Streams),
performing rudimentary data quality checks
and converting data to Avro Format before
loading it to the Data Lake
Our Data Lake in S3 captures and store
raw data at scale for a low cost. It allows us
to store many types of data in the same
repository while allowing to define the
structure of the data at the time when it is
used
scribe
accumulo
Scribe collects data from the trackers
and writes them to Kinesis Streams
It is written in Go and engineered for
high concurrency, low latency and
horizontally scalability
Currently running on two c4.large
instances, our API latency for 50
percentile is 12.6ms and 75 percentile
is 36ms. This is made possible
because of the consistent and
predicable performance of Kinesis

6
Why Amazon S3 For Data Lake ?
Performance relatively lower than an HDFS cluster, but doesn't affect our workloads significantly. EMRFS with
consistent view (backed by DynamoDB) works really well
Native support for versioning, tiered-storage (Standard, IA, Amazon Glacier) via life-cycle policies and Secure–
SSL, client/server-side encryption
Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was
leading to difficulties in identifying what type of data is available, getting access to it and integration.
Unlimited number of objects and volume of data, along with 99.99% availability and 99.999999999%
durability. Lower TCO and easier to scale than HDFS
Decoupled storage and compute allowing multiple & heterogeneous analysis clusters to use the same data
04
03
02
01

Prism (Processor)
Lens (Consumer)
Custom built reporting & visualisation app
to help business owners to easily interpret,
visualise and record data and derive
insights
Detailed Analysis of KPIs, Event
Segmentation, Funnels, Search Insights,
Path Finder, Retention/Addiction Analysis
etc powered by Redshift and Druid. Using
Pgpool to cache Redshift queries.
process
consume
Unified Processing Engine using
Apache Spark running on EMR written
in Scala
Airflow is used to programmatically
author, schedule and monitor
workflows
Prism generates data for tracking KPIs
and perform funnel, pathflow, retention
and affinity analysis. It also include
machine learning workloads that
generate recommendations and
predictions
COLLECT STORE PROCESS CONSUME

KPIs
Product relationship
Understand which products are viewed consecutively
Product affinity
Understand which products are purchased together
Sales
Hourly, daily, weekly, monthly, quarterly, and annual
Average market basket
Average order size
Cart abandonment rate
Shopping cart abandonment rate
Days/ Visits To "Purchase"
The average number of days and sessions from the first website
interaction to purchase.
Cost per Acquisition
(Total Cost of Marketing Activities) / (# of Conversions)
Repeat purchase rate
What % of our customers are repeat customers

Product page performance
Measuring product performance
The scatter plot compares the number of unique
users that view each product with the number of
unique users that add the product to basket, with
the size of each dot being the number of uniques
that buy the product.
Any products located in the lower right corner are
highly trafficked but low converting - any effort
spent fixing those product pages (e.g. by checking
the copy, updating the product images or lowering
the price) should be rewarded with a significant
sales uplift, given the number of people visiting
those pages

Measuring product performance
In contrast, products located in the top left of the
plot are very highly converting, but low trafficked
pages. We should drive more traffic to these
pages, either by positioning those products more
prominently on catalog pages, for example, or by
spending marketing dollars driving more traffic to
those pages specifically. Again, that investment
should result in a significant uplift in sales, given
how highly converting those products are.
Similarly, products in the lower left corner are
performing poorly - but it is not clear whether this is
because they have low traffic levels and /or are
poor at driving conversions. We should invest in
improving the performance of these pages, but the
return on that investment is likely to be smaller (or
harder to achieve) than the other two opportunities
Product page performance

Identifying products / content that go well together
Market basket analysis is an Association rule learning technique aimed at uncovering the associations and
connections between specific products in our store
In a market basket analysis, we look to see if there are combinations of products that frequently co-occur in
transactions.
We can use this type of analysis to:
• Inform the placement of content items on sites, or products in catalogue
• Drive recommendation engines (like Amazon’s customers who bought this product also bought these
products…)
• Deliver targeted marketing (e.g. emailing customers who bought products specific products with other
products and offers on those products that are likely to be interesting to them.)

Architecting Data Lake on AWS by the Data Engineering Team at HiFX IT

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Architecting Data Lake on AWS by the Data Engineering Team at HiFX IT

Similar to Architecting Data Lake on AWS by the Data Engineering Team at HiFX IT (20)

Recently uploaded

Recently uploaded (20)

Architecting Data Lake on AWS by the Data Engineering Team at HiFX IT

Editor's Notes