Architecting Data Lakes on AWS

Architecting Data Lakes on AWS with HiFX
Established in the year 2001, HiFX is an Amazon Web Services Advanced Consulting
Partner. We have been designing and migrating workloads in AWS cloud since 2010
and helping organisations to become truly data driven by building big data solutions
since 2015
About Malayala Manorama
Malayala Manorama is one of the largest media conglomerates in India. They run
manoramaonline.com, the largest news portal for Malayalees around the world and
several digital media properties including manoramanews.com, m4marry.com,
helloaddress.com, tapeytapey.com, entedeal.com, quickerala.com, qkdoc.com,
manoramahorizon.com and various mobile applications

2
The Challenges
Lack of agility and accessibility for data analysis which would aid the product team to make smart
business decisions and improve strategies
Increasing volume and velocity of data. With new digital properties getting added, there was a need to
design the collection and storage layers that would scale well
Dozens of independently managed collections of data, leading to data silos. Having no single source of truth
was leading to difficulties in identifying what type of data is available, getting access to it and integration.
Poorly recorded data. Often, the meaning and granularity of the data was getting lost in processing
Dozens of independently managed collections of data, leading to data silos. Having no single source of
truth was leading to difficulties in identifying what type of data is available, granting access and
integration
04
03
02
01

Our Journey from Data to Decisions with an AWS powered Data Lake
Connecting dozens of data
streams and repositories to a
unified data pipeline enabling
near realtime access to any data
source
Engineering well designed big
data stores for reporting and
and exploratory analysis
A r c h i t e c t a s e c u r e , w e l l
governed data lake to store all
data in a raw format. S3 is the
fabric with which we have woven
the solution.
Processing data in streams or
batches to aid analytics and
machine learning, supplemented
by smart workflow management to
orchestrate the tasks
D y n a m i c d a s h b o a r d s a n d
visualisations that makes data tell
stories and help drive insights.
Offering recommendations and
predictive analytics off the data
in the data lake

Faster, Smarter & Better Decisions

COLLECT STORE PROCESS CONSUME
Scribe (Collector)
Accumulo (Storage)
Acccumulo is the data consumer
component responsible for reading data
from the event streams (Kinesis Streams),
performing rudimentary data quality checks
and converting data to Avro Format before
loading it to the Data Lake
Our Data Lake in S3 captures and store
raw data at scale for a low cost. It allows us
to store many types of data in the same
repository while allowing to define the
structure of the data at the time when it is
used
scribe
accumulo
Scribe collects data from the trackers
and writes them to Kinesis Streams
It is written in Go and engineered for
high concurrency, low latency and
horizontally scalability
Currently running on two c4.large
instances, our API latency for 50
percentile is 12.6ms and 75 percentile
is 27ms. This is made possible
because of the consistent and
predicable performance of Kinesis

Availability Zone
Availability Zone B
Availability Zone A
Auto Scaling Group 1:1
Availability Zone B
Availability Zone
Auto scaling Group
Availability Zone A
Collector API
in GO
ELB
Android
SDK
JS SDK
Clickstream
Android App
iOS App
Collector API
in GO
Data sink
Java / Go / Python / PHP
SDK
ServerEvents
SCRIBETRACKERS
Kinesis
Streams
RAW Data
(Avro Format)
ACCUMULO
AirFlow
Consumer
KCL App in Java
Dimension /
Offline Data
Data Lake in AWS S3
iOS SDK

7
Why Amazon S3 For Data Lake ?
Performance relatively lower than an HDFS cluster, but doesn't affect our workloads significantly. EMRFS with
consistent view (backed by DynamoDB) works really well
Native support for versioning, tiered-storage (Standard, IA, Amazon Glacier) via life-cycle policies and Secure–
SSL, client/server-side encryption
Unlimited number of objects and volume of data, along with 99.99% availability and 99.999999999%
durability. Lower TCO and easier to scale than HDFS
Decoupled storage and compute allowing multiple & heterogeneous analysis clusters to use the same
data
04
03
02
01

Prism (Processor)
Lens (Consumer)
Custom built reporting & visualisation app
to help business owners to easily interpret,
visualise and record data and derive
insights
Detailed Analysis of KPIs, Event
Segmentation, Funnels, Search Insights,
Path Finder, Retention/Addiction Analysis
etc powered by Redshift and Druid. Using
Pgpool to cache Redshift queries.
process
consume
Unified Processing Engine using
Apache Spark running on EMR written
in Scala
Airflow is used to programmatically
author, schedule and monitor
workflows
Prism generates data for tracking KPIs
and perform funnel, pathflow, retention
and affinity analysis. It also include
machine learning workloads that
generate recommendations and
predictions

EMR Spark
Persistent Cluster
(ML & Realtime updates)
EMR Spark
On-Demand Cluster
Batch Workloads
Enriched Data in Parquet Format
(Run Athena / Presto on top for
ad hoc queries )
Redshift
Availability Zone B
Availability Zone
Auto scaling Group 1:1
Availability Zone A
Lens BI
Dashboard
Processed
data
Processed data
PRISM
DynamoDB
( Stores the results of recommendations,
market basket analysis etc )
Druid
(Realtime streaming
data ingestion engine)
Data Lake in AWS S3
LENS
Kinesis

Scalability / Performance
Collect, Storage and Process layers designed to Autoscale
Latency numbers at the data collector: 75 percentile - 27ms and 95 percentile - 156 ms
Currently handling 60 million events per month. Expecting 100x scale in 2018
Horizontally Scalable Data Collectors, Data Consumers, Data Processors and Data Reporting Stores
04
03
02
01

11
The Benefits
Ability to run targeted mobile push and email campaigns
Consistent KPI measurement. The client has a consistent framework across properties to measure KPIs
Better user experience. Recommendations running off the data in the Data Lake add value to the digital
properties we manage
Better business agility and product decisions based on behavioural insights. The journey from data to
decisions is made swifter
04
03
02
01

Architecting Data Lakes on AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Architecting Data Lakes on AWS

Similar to Architecting Data Lakes on AWS (20)

Recently uploaded

Recently uploaded (20)

Architecting Data Lakes on AWS