About HiFX
Established in the year 2001, HiFX is an Amazon Web Services
Consulting Partner.
We have been designing and migrating workloads in AWS cloud
since 2010 and helping organizations to become truly data driven
by building big data solutions since 2015
Case Study with Malayala Manorama
Malayala Manorama is one of the largest media conglomerates in India. They
run manoramaonline.com, the largest news portal for Malayalees, around the
world and several digital media properties
In 2016, Manorama embarked on a project to develop an in-house
analytics pipeline that could unify enormous amounts of raw data from
multiple web domains and convert it into meaningful insights. The
company currently has 10 domains such as its matrimonial and real
estate sites, with plans to further expand its digital footprint.
HiFX, has been Malayala Manorama’s technology partner for more than
18 years and was approached to design this new data analytics pipeline.
Manorama Online
Manorama News
The Week
Vanitha
Watchtime India
E-paper/E-magazine
Chuttuvattom
OnManorama
M4Marry
HelloAddress
QuickeralaQkdoc
Entedeal
Manorama Horizon
Android
iOS
Manorama MAX
2
The Challenges
Lack of agility and accessibility for data analysis which would aid the product team to
make smart business decisions and improve strategies
Increasing volume and velocity of data. With new digital properties getting added, there
was a need to design the collection and storage layers that would scale well
Dozens of independently managed collections of data, leading to data silos. Having no
single source of truth was leading to difficulties in identifying what type of data is
available, getting access to it and integration.
Poorly recorded data. Often, the meaning and granularity of the data was getting lost in
processing
Dozens of independently managed collections of data, leading to data silos. Having no
single source of truth was leading to difficulties in identifying what type of data is
available, granting access and integration
04
03
02
01
About Lens
Vision Lens is a unified data platform with a
consolidated solution stack to
generate meaningful real time
insights and drive revenue
“
“
Better product
decisions based
on behavioral
insights
Add value
to our
businesses
€
Increase CLV
Deeply
understand every
user's journey
Immediate actions,
smart targeting and
marketing
automation
Positively
impact KPIs
Components
02A well governed data lake
architected to store raw and
enriched data thereby
eliminating storage silos
WELL GOVERNED DATA LAKE
01 Connecting dozens of data
streams and repositories to a
unified data pipeline enabling
near real-time access to any data
source
UNIFIED DATA PIPELINE
03
Data processing framework to
support streams and batches
workloads to aid analytics and
machine learning along with
smart workflow management
DATA PROCESSING
FRAMEWORK
05
Recommendations and
personalization engine powered
by machine learning
RECOMMENDATIONS ENGINE
04Well designed big data stores
for reporting and exploratory
analysis
BIG DATA STORES FOR OLAP
06
Dynamic dashboards and smart
visualizations that makes data
tell stories and drives insights.
SMART DASHBOARDS
Solution Stack
04
Track Key metrics : visits,
plays,dropouts and minutes
watched
VIDEO ANALYTICS
Watch Attention shift in
real time
Updates every few
seconds to quickly
capitalize attention to
every post, campaign and
sections
STREAMING ANALYTICS
01
02Historical View of unique
attention metrics to understand
what happened in the past and
use it to plan for the future
BATCH ANALYTICS
03
Integrations with Google
Accelerated Pages and
Facebook Instant articles
FB IA AND GOOGLE AMP
INTEGRATIONS
05
Recommendations and
personalization engine
powered by machine learning
CONTENT
PERSONALIZATION
06
Dynamic dashboards and smart
visualizations that makes data
tell stories and drives insights.
ADVANCED REPORTING
Clean structured data that
team can analyze directly
RAW DATA ACCESS07
Key Infrastructure Components
CloudFront
ECS
Kinesis Stream
S3 Bucket
EMR Spark
Sagemaker
Aurora
Redshift
Elasticsearch Service
DynamoDB
DatabricksAWS ALB Apache Airflow
Architecture
Trackers
Android SDK IOS SDK JS SDK PHP SDK Java SDK
Data / Event Trackers
Trackers allow us to collect data from any
type of digital application, service or
device. All trackers adhere to the LENS
Tracker Protocol.
Collectors-Scribe
Data Collectors
04
03
Written in
Go/Java
02
Designed for
Low LatencyEngineered for
High
Concurrency
Horizontally
Scalable
01
Scribe collects data from the trackers
and writes them to the Kinesis data
firehose.
This allows near-real time processing of
data as well as storage in the data lake
for further batch analysis.
Use ECS Fargate for the
containerization.
Scribe API endpoints
• Event tracker
• Pixel tracker
• Click tracker
• AMP tracker
Accumulo /Data Lake
A
ACCUMULO
The data consumer component
responsible for -
Reading data from the event
firehose ( Kinesis Streams )
Performing rudimentary data
quality checks
Converting data to Avro
Format with Snappy
Compression
Loading them to the Data
Lake
DATA LAKE
Data Lake supports the following
capabilities
Capture and store raw data securely
at scale at a low cost
Store many types of data in the same
repository
Define the structure of the data at
the time it is used
It is designed to
Retain all data
Support all data types
Adapt Easily to changes
Prism - Processing Engine
Using Apache Spark as our processing Engine.
It’s written in Scala.
It can run on EMR-5.27 and as a Databricks job running
on AWS spot/on-demand instances
Unified Processing Engine
Prism
Analytics Engine
Prism - Processing Engine
Data Cleanser
Performs data cleansing
including:
• Normalization
• De-duplication
• Bot-exclusion
• Fixes for client clock issues.
Data Enricher
Performs enrichment activities
including:
• User Agent Parsing to
understand OS / Platform
• Referrer Parsing to understand
channels
• IP to location transformation
• Lat+Long to location
transformation
• Widen event data with user
profile information
Data Quality Checks
Performs the data quality checks
needed to detect, report and omit
instrumentation errors
Data Reconciler
Reconciles data that is
sacrosanct like transactions
from the feeds generated by
the master db
Sessionization/User Merging
Sessionize and merge the users
based on domain/anonymous id
15
Prism
Analytics Engine
Data Refresher
Loads the data to respective tables
in the data warehouse and other
reporting data stores
Prism - Real-time Analytics
• Use structured streaming to stream live events
into Elastic Search.
• Stack can be run on both EMR and Databricks
• Run in 50 -4.x large instances, which is scaled
to 100 instances during the election time.
• Configurations:-
spark.executor.cores=4
spark.executor.memory=25g
spark.executor.instances=50
Spark Streaming
Spark Streaming
Prism - Batch Analytics
Spark on EMR/Databricks
Spark• Scheduled Job which kick off every
day to process all the events for a
day and write the cleansed
raw/aggregated data to the redshift
(primary data store).
• It also writes the data to Parquet
Format to run presto/Databricks
delta lake on the top if needed.
• Runs in 20 – r4.2xlarge instances
• Configurations:-
spark.executor.cores=3
spark.executor.memory=20g
spark.executor.instances=39
Data Stores
DATA WAREHOUSE
AMAZON REDSHIFT
Primary Data Store
• Supports batch workloads.
• Supports up to 50
concurrent queries
• Cache layer pgpool deployed
• WLM and concurrency
scaling enabled
• Elastic Resize
• Redshift spectrum to query
archived data in S3
01
REALTIME REPORTING STORE
Elasticsearch
Content Analytics Real Time
Dashboard.
• Fluidic Dashboard with
granular filters
• Data Visualization using
Kibana
02
RECOMMENDATION RESULTS
DYNAMODB
Features like,
Horizontally Scalability, low
operational overhead and
predictable performance
make Dynamodb a good
choice for storing
recommendation results
03
Orchestration
Used to programmatically author, schedule
and monitor workflows.
Workflow Management
Rich UI that makes it easy to visualize
pipelines running in production, monitor
progress, and troubleshoot issues when
needed.
Rich UI
Apache Airflow
Data Retention Strategy
 Find a balance between what’s optimal for your clients’ business needs vs. operational cost effectiveness
 Ensure the data retention policies align with the regulatory restrictions(GDPR)
 Define proper life cycle policies at different stages
 S3-IA/Glacier lifecycle policy defined for the data at rest in Data lake and a scheduled purging policy defined
for the primary data store(redshift)
 We keep a quarter worth of data in the primary data store(redshift) and older data is archived to S3.
 Redshift Spectrum is used for detailed analysis of older data.
 For YOY, QOQ comparison we pre-calculate it as part of the quarterly process and store the aggregated results
into the data store.
Page Views
Dashboard - KPIs/Different Angles
Domain Specific KPIs
Key Metrics in the Content
Dashboard.
Different Angles
New and returning Visitors
Explore the Content Data from these
Angles
Engaged Time
Social Shares and
Referrals
Bounce Rate
Video Play Rate
Titles
Authors
Sections
Tags
Referrers
Campaigns
Google AMP Facebook IA
Scalability /Performance
Collect, Storage and Process layers designed to Autoscale.
Batch analytics takes an average of 30-40 mins to process and refresh data for the entire day
across all reporting dashboards
Turnaround latency numbers at the data collector: 75 percentile - 27ms and 95 percentile - 156
ms
Currently handles about 150 GB of data per day with an average of 300 million events processed
per day
Horizontally Scalable Data Collectors, Data Consumers, Data Processors and Data Reporting
Stores
04
03
02
01
The real time streaming stack currently processes 500K events in less than 10 seconds.
05
06
Best Practices in Spark
 Use Dataset, DataFrames, Spark SQL instead of RDD to get the benefits of catalyst optimizer
 Choose the best data format and compression.
 Apache Parque gives the fastest read performance with the spark with its vectorized Parquet reader. Run
presto/Databricks delta lake on the top if needed.
 Avro offers rich schema support and more efficient writes than Parquet.
 Choose either Snappy or LZO compression as they have balance in terms of split-ability and block compression.
 Use the Spark Web UI to explore your task jobs, storage, and SQL query plan to optimize your spark execution
 Look at the spark event timeline to see the amount of time for each stage/tasks
 Check the shuffles between stages and the amount of data shuffled(Use the spark.sql.shuffle.partitions option
if needed)
 Check the join algorithms being used.
 Broadcast join should be used when one table is small.
 Sort-merge join should be used for large tables. You can use bucketing to pre-sort and group tables; this will
avoid shuffling in the sort merge
 Enable Dynamic Partition Pruning/ flattenScalarSubqueriesWithAggregates/ Bloom Filter Join/ Optimized Join
Reorder
 Use s3 instead of s3a/s3n protocol to refer the data so that it goes through the optimized path
 Use EMRFS consistency only if its required
 Find an optimal configurations on number of executors, memory setting for each executors and the no of cores for
the spark job.
Outcomes
Ability to run targeted mobile push and email campaigns
Consistent KPI measurement. The client has a consistent framework across properties to
measure KPIs
Dozens of independently managed collections of data, leading to data silos. Having no single
source of truth was leading to difficulties in identifying what type of data is available, getting
access to it and integration.
Better user experience. Recommendations running off the data in the Data Lake add value to the
digital properties we manage
Better business agility and product decisions based on behavioural insights. The journey from
data to decisions is made swifter
04
03
02
01
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS

ACDKOCHI19 - Next Generation Data Analytics Platform on AWS

  • 1.
    About HiFX Established inthe year 2001, HiFX is an Amazon Web Services Consulting Partner. We have been designing and migrating workloads in AWS cloud since 2010 and helping organizations to become truly data driven by building big data solutions since 2015
  • 2.
    Case Study withMalayala Manorama Malayala Manorama is one of the largest media conglomerates in India. They run manoramaonline.com, the largest news portal for Malayalees, around the world and several digital media properties In 2016, Manorama embarked on a project to develop an in-house analytics pipeline that could unify enormous amounts of raw data from multiple web domains and convert it into meaningful insights. The company currently has 10 domains such as its matrimonial and real estate sites, with plans to further expand its digital footprint. HiFX, has been Malayala Manorama’s technology partner for more than 18 years and was approached to design this new data analytics pipeline.
  • 3.
    Manorama Online Manorama News TheWeek Vanitha Watchtime India E-paper/E-magazine Chuttuvattom OnManorama M4Marry HelloAddress QuickeralaQkdoc Entedeal Manorama Horizon Android iOS Manorama MAX
  • 4.
    2 The Challenges Lack ofagility and accessibility for data analysis which would aid the product team to make smart business decisions and improve strategies Increasing volume and velocity of data. With new digital properties getting added, there was a need to design the collection and storage layers that would scale well Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was leading to difficulties in identifying what type of data is available, getting access to it and integration. Poorly recorded data. Often, the meaning and granularity of the data was getting lost in processing Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was leading to difficulties in identifying what type of data is available, granting access and integration 04 03 02 01
  • 5.
  • 6.
    Vision Lens isa unified data platform with a consolidated solution stack to generate meaningful real time insights and drive revenue “ “ Better product decisions based on behavioral insights Add value to our businesses € Increase CLV Deeply understand every user's journey Immediate actions, smart targeting and marketing automation Positively impact KPIs
  • 7.
    Components 02A well governeddata lake architected to store raw and enriched data thereby eliminating storage silos WELL GOVERNED DATA LAKE 01 Connecting dozens of data streams and repositories to a unified data pipeline enabling near real-time access to any data source UNIFIED DATA PIPELINE 03 Data processing framework to support streams and batches workloads to aid analytics and machine learning along with smart workflow management DATA PROCESSING FRAMEWORK 05 Recommendations and personalization engine powered by machine learning RECOMMENDATIONS ENGINE 04Well designed big data stores for reporting and exploratory analysis BIG DATA STORES FOR OLAP 06 Dynamic dashboards and smart visualizations that makes data tell stories and drives insights. SMART DASHBOARDS
  • 8.
    Solution Stack 04 Track Keymetrics : visits, plays,dropouts and minutes watched VIDEO ANALYTICS Watch Attention shift in real time Updates every few seconds to quickly capitalize attention to every post, campaign and sections STREAMING ANALYTICS 01 02Historical View of unique attention metrics to understand what happened in the past and use it to plan for the future BATCH ANALYTICS 03 Integrations with Google Accelerated Pages and Facebook Instant articles FB IA AND GOOGLE AMP INTEGRATIONS 05 Recommendations and personalization engine powered by machine learning CONTENT PERSONALIZATION 06 Dynamic dashboards and smart visualizations that makes data tell stories and drives insights. ADVANCED REPORTING Clean structured data that team can analyze directly RAW DATA ACCESS07
  • 9.
    Key Infrastructure Components CloudFront ECS KinesisStream S3 Bucket EMR Spark Sagemaker Aurora Redshift Elasticsearch Service DynamoDB DatabricksAWS ALB Apache Airflow
  • 10.
  • 11.
    Trackers Android SDK IOSSDK JS SDK PHP SDK Java SDK Data / Event Trackers Trackers allow us to collect data from any type of digital application, service or device. All trackers adhere to the LENS Tracker Protocol.
  • 12.
    Collectors-Scribe Data Collectors 04 03 Written in Go/Java 02 Designedfor Low LatencyEngineered for High Concurrency Horizontally Scalable 01 Scribe collects data from the trackers and writes them to the Kinesis data firehose. This allows near-real time processing of data as well as storage in the data lake for further batch analysis. Use ECS Fargate for the containerization. Scribe API endpoints • Event tracker • Pixel tracker • Click tracker • AMP tracker
  • 13.
    Accumulo /Data Lake A ACCUMULO Thedata consumer component responsible for - Reading data from the event firehose ( Kinesis Streams ) Performing rudimentary data quality checks Converting data to Avro Format with Snappy Compression Loading them to the Data Lake DATA LAKE Data Lake supports the following capabilities Capture and store raw data securely at scale at a low cost Store many types of data in the same repository Define the structure of the data at the time it is used It is designed to Retain all data Support all data types Adapt Easily to changes
  • 14.
    Prism - ProcessingEngine Using Apache Spark as our processing Engine. It’s written in Scala. It can run on EMR-5.27 and as a Databricks job running on AWS spot/on-demand instances Unified Processing Engine Prism Analytics Engine
  • 15.
    Prism - ProcessingEngine Data Cleanser Performs data cleansing including: • Normalization • De-duplication • Bot-exclusion • Fixes for client clock issues. Data Enricher Performs enrichment activities including: • User Agent Parsing to understand OS / Platform • Referrer Parsing to understand channels • IP to location transformation • Lat+Long to location transformation • Widen event data with user profile information Data Quality Checks Performs the data quality checks needed to detect, report and omit instrumentation errors Data Reconciler Reconciles data that is sacrosanct like transactions from the feeds generated by the master db Sessionization/User Merging Sessionize and merge the users based on domain/anonymous id 15 Prism Analytics Engine Data Refresher Loads the data to respective tables in the data warehouse and other reporting data stores
  • 16.
    Prism - Real-timeAnalytics • Use structured streaming to stream live events into Elastic Search. • Stack can be run on both EMR and Databricks • Run in 50 -4.x large instances, which is scaled to 100 instances during the election time. • Configurations:- spark.executor.cores=4 spark.executor.memory=25g spark.executor.instances=50 Spark Streaming Spark Streaming
  • 17.
    Prism - BatchAnalytics Spark on EMR/Databricks Spark• Scheduled Job which kick off every day to process all the events for a day and write the cleansed raw/aggregated data to the redshift (primary data store). • It also writes the data to Parquet Format to run presto/Databricks delta lake on the top if needed. • Runs in 20 – r4.2xlarge instances • Configurations:- spark.executor.cores=3 spark.executor.memory=20g spark.executor.instances=39
  • 18.
    Data Stores DATA WAREHOUSE AMAZONREDSHIFT Primary Data Store • Supports batch workloads. • Supports up to 50 concurrent queries • Cache layer pgpool deployed • WLM and concurrency scaling enabled • Elastic Resize • Redshift spectrum to query archived data in S3 01 REALTIME REPORTING STORE Elasticsearch Content Analytics Real Time Dashboard. • Fluidic Dashboard with granular filters • Data Visualization using Kibana 02 RECOMMENDATION RESULTS DYNAMODB Features like, Horizontally Scalability, low operational overhead and predictable performance make Dynamodb a good choice for storing recommendation results 03
  • 19.
    Orchestration Used to programmaticallyauthor, schedule and monitor workflows. Workflow Management Rich UI that makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed. Rich UI Apache Airflow
  • 20.
    Data Retention Strategy Find a balance between what’s optimal for your clients’ business needs vs. operational cost effectiveness  Ensure the data retention policies align with the regulatory restrictions(GDPR)  Define proper life cycle policies at different stages  S3-IA/Glacier lifecycle policy defined for the data at rest in Data lake and a scheduled purging policy defined for the primary data store(redshift)  We keep a quarter worth of data in the primary data store(redshift) and older data is archived to S3.  Redshift Spectrum is used for detailed analysis of older data.  For YOY, QOQ comparison we pre-calculate it as part of the quarterly process and store the aggregated results into the data store.
  • 21.
    Page Views Dashboard -KPIs/Different Angles Domain Specific KPIs Key Metrics in the Content Dashboard. Different Angles New and returning Visitors Explore the Content Data from these Angles Engaged Time Social Shares and Referrals Bounce Rate Video Play Rate Titles Authors Sections Tags Referrers Campaigns Google AMP Facebook IA
  • 22.
    Scalability /Performance Collect, Storageand Process layers designed to Autoscale. Batch analytics takes an average of 30-40 mins to process and refresh data for the entire day across all reporting dashboards Turnaround latency numbers at the data collector: 75 percentile - 27ms and 95 percentile - 156 ms Currently handles about 150 GB of data per day with an average of 300 million events processed per day Horizontally Scalable Data Collectors, Data Consumers, Data Processors and Data Reporting Stores 04 03 02 01 The real time streaming stack currently processes 500K events in less than 10 seconds. 05 06
  • 23.
    Best Practices inSpark  Use Dataset, DataFrames, Spark SQL instead of RDD to get the benefits of catalyst optimizer  Choose the best data format and compression.  Apache Parque gives the fastest read performance with the spark with its vectorized Parquet reader. Run presto/Databricks delta lake on the top if needed.  Avro offers rich schema support and more efficient writes than Parquet.  Choose either Snappy or LZO compression as they have balance in terms of split-ability and block compression.  Use the Spark Web UI to explore your task jobs, storage, and SQL query plan to optimize your spark execution  Look at the spark event timeline to see the amount of time for each stage/tasks  Check the shuffles between stages and the amount of data shuffled(Use the spark.sql.shuffle.partitions option if needed)  Check the join algorithms being used.  Broadcast join should be used when one table is small.  Sort-merge join should be used for large tables. You can use bucketing to pre-sort and group tables; this will avoid shuffling in the sort merge  Enable Dynamic Partition Pruning/ flattenScalarSubqueriesWithAggregates/ Bloom Filter Join/ Optimized Join Reorder  Use s3 instead of s3a/s3n protocol to refer the data so that it goes through the optimized path  Use EMRFS consistency only if its required  Find an optimal configurations on number of executors, memory setting for each executors and the no of cores for the spark job.
  • 24.
    Outcomes Ability to runtargeted mobile push and email campaigns Consistent KPI measurement. The client has a consistent framework across properties to measure KPIs Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was leading to difficulties in identifying what type of data is available, getting access to it and integration. Better user experience. Recommendations running off the data in the Data Lake add value to the digital properties we manage Better business agility and product decisions based on behavioural insights. The journey from data to decisions is made swifter 04 03 02 01