SlideShare a Scribd company logo
1 of 12
Architecting Data Lakes on AWS with HiFX
Established in the year 2001, HiFX is an Amazon Web Services Advanced Consulting
Partner. We have been designing and migrating workloads in AWS cloud since 2010
and helping organisations to become truly data driven by building big data solutions
since 2015
About Malayala Manorama
Malayala Manorama is one of the largest media conglomerates in India. They run
manoramaonline.com, the largest news portal for Malayalees around the world and
several digital media properties including manoramanews.com, m4marry.com,
helloaddress.com, tapeytapey.com, entedeal.com, quickerala.com, qkdoc.com,
manoramahorizon.com and various mobile applications
2
The Challenges
Lack of agility and accessibility for data analysis which would aid the product team to make smart
business decisions and improve strategies
Increasing volume and velocity of data. With new digital properties getting added, there was a need to
design the collection and storage layers that would scale well
Dozens of independently managed collections of data, leading to data silos. Having no single source of truth
was leading to difficulties in identifying what type of data is available, getting access to it and integration.
Poorly recorded data. Often, the meaning and granularity of the data was getting lost in processing
Dozens of independently managed collections of data, leading to data silos. Having no single source of
truth was leading to difficulties in identifying what type of data is available, granting access and
integration
04
03
02
01
Our Journey from Data to Decisions with an AWS powered Data Lake
Connecting dozens of data
streams and repositories to a
unified data pipeline enabling
near realtime access to any data
source
Engineering well designed big
data stores for reporting and
and exploratory analysis
A r c h i t e c t a s e c u r e , w e l l
governed data lake to store all
data in a raw format. S3 is the
fabric with which we have woven
the solution.
Processing data in streams or
batches to aid analytics and
machine learning, supplemented
by smart workflow management to
orchestrate the tasks
D y n a m i c d a s h b o a r d s a n d
visualisations that makes data tell
stories and help drive insights.
Offering recommendations and
predictive analytics off the data
in the data lake
Faster, Smarter & Better Decisions
COLLECT STORE PROCESS CONSUME
Scribe (Collector)
Accumulo (Storage)
Acccumulo is the data consumer
component responsible for reading data
from the event streams (Kinesis Streams),
performing rudimentary data quality checks
and converting data to Avro Format before
loading it to the Data Lake
Our Data Lake in S3 captures and store
raw data at scale for a low cost. It allows us
to store many types of data in the same
repository while allowing to define the
structure of the data at the time when it is
used
scribe
accumulo
Scribe collects data from the trackers
and writes them to Kinesis Streams
It is written in Go and engineered for
high concurrency, low latency and
horizontally scalability
Currently running on two c4.large
instances, our API latency for 50
percentile is 12.6ms and 75 percentile
is 27ms. This is made possible
because of the consistent and
predicable performance of Kinesis
Availability Zone
Availability Zone B
Availability Zone A
Auto Scaling Group 1:1
Availability Zone B
Availability Zone
Auto scaling Group
Availability Zone A
Collector API
in GO
ELB
Android
SDK
JS SDK
Clickstream
Android App
iOS App
Collector API
in GO
Data sink
Java / Go / Python / PHP
SDK
ServerEvents
SCRIBETRACKERS
Kinesis
Streams
RAW Data
(Avro Format)
ACCUMULO
AirFlow
Consumer
KCL App in Java
Dimension /
Offline Data
Data Lake in AWS S3
iOS SDK
COLLECT STORE PROCESS CONSUME
7
Why Amazon S3 For Data Lake ?
Performance relatively lower than an HDFS cluster, but doesn't affect our workloads significantly. EMRFS with
consistent view (backed by DynamoDB) works really well
Native support for versioning, tiered-storage (Standard, IA, Amazon Glacier) via life-cycle policies and Secure–
SSL, client/server-side encryption
Dozens of independently managed collections of data, leading to data silos. Having no single source of truth
was leading to difficulties in identifying what type of data is available, getting access to it and integration.
Unlimited number of objects and volume of data, along with 99.99% availability and 99.999999999%
durability. Lower TCO and easier to scale than HDFS
Decoupled storage and compute allowing multiple & heterogeneous analysis clusters to use the same
data
04
03
02
01
Prism (Processor)
Lens (Consumer)
Custom built reporting & visualisation app
to help business owners to easily interpret,
visualise and record data and derive
insights
Detailed Analysis of KPIs, Event
Segmentation, Funnels, Search Insights,
Path Finder, Retention/Addiction Analysis
etc powered by Redshift and Druid. Using
Pgpool to cache Redshift queries.
process
consume
Unified Processing Engine using
Apache Spark running on EMR written
in Scala
Airflow is used to programmatically
author, schedule and monitor
workflows
Prism generates data for tracking KPIs
and perform funnel, pathflow, retention
and affinity analysis. It also include
machine learning workloads that
generate recommendations and
predictions
COLLECT STORE PROCESS CONSUME
EMR Spark
Persistent Cluster
(ML & Realtime updates)
EMR Spark
On-Demand Cluster
Batch Workloads
Enriched Data in Parquet Format
(Run Athena / Presto on top for
ad hoc queries )
Redshift
Availability Zone B
Availability Zone
Auto scaling Group 1:1
Availability Zone A
Lens BI
Dashboard
Processed
data
Processed data
PRISM
DynamoDB
( Stores the results of recommendations,
market basket analysis etc )
Druid
(Realtime streaming
data ingestion engine)
Data Lake in AWS S3
LENS
Kinesis
COLLECT STORE PROCESS CONSUME
Scalability / Performance
Collect, Storage and Process layers designed to Autoscale
Latency numbers at the data collector: 75 percentile - 27ms and 95 percentile - 156 ms
Dozens of independently managed collections of data, leading to data silos. Having no single source of truth
was leading to difficulties in identifying what type of data is available, getting access to it and integration.
Currently handling 60 million events per month. Expecting 100x scale in 2018
Horizontally Scalable Data Collectors, Data Consumers, Data Processors and Data Reporting Stores
04
03
02
01
11
The Benefits
Ability to run targeted mobile push and email campaigns
Consistent KPI measurement. The client has a consistent framework across properties to measure KPIs
Dozens of independently managed collections of data, leading to data silos. Having no single source of truth
was leading to difficulties in identifying what type of data is available, getting access to it and integration.
Better user experience. Recommendations running off the data in the Data Lake add value to the digital
properties we manage
Better business agility and product decisions based on behavioural insights. The journey from data to
decisions is made swifter
04
03
02
01
THANK YOU

More Related Content

What's hot

Customer Experience at Disney+ Through Data Perspective
Customer Experience at Disney+ Through Data PerspectiveCustomer Experience at Disney+ Through Data Perspective
Customer Experience at Disney+ Through Data PerspectiveDatabricks
 
Amazon big success using big data analytics
Amazon big success using big data analyticsAmazon big success using big data analytics
Amazon big success using big data analyticsKovid Academy
 
Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200Amazon Web Services
 
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | WhitepaperThe Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | WhitepaperVasu S
 
Entity Resolution Service - Bringing Petabytes of Data Online for Instant Access
Entity Resolution Service - Bringing Petabytes of Data Online for Instant AccessEntity Resolution Service - Bringing Petabytes of Data Online for Instant Access
Entity Resolution Service - Bringing Petabytes of Data Online for Instant AccessDataWorks Summit
 
Module 3 - QuickSight Overview
Module 3 - QuickSight OverviewModule 3 - QuickSight Overview
Module 3 - QuickSight OverviewLam Le
 
Data Estate Modernization
Data Estate ModernizationData Estate Modernization
Data Estate ModernizationKarina Matos
 
When Open Source Meets the Enterprise
When Open Source Meets the EnterpriseWhen Open Source Meets the Enterprise
When Open Source Meets the EnterpriseMariaDB plc
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Amazon Web Services
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouseJames Serra
 
Customer experience at disney+ through data perspective
Customer experience at disney+ through data perspectiveCustomer experience at disney+ through data perspective
Customer experience at disney+ through data perspectiveMartin Zapletal
 
Qubole on AWS - White paper
Qubole on AWS - White paper Qubole on AWS - White paper
Qubole on AWS - White paper Vasu S
 
Delivering business insights and automation utilizing aws data services
Delivering business insights and automation utilizing aws data servicesDelivering business insights and automation utilizing aws data services
Delivering business insights and automation utilizing aws data servicesBhuvaneshwaran R
 
Data analysis trend 2015 2016 v071
Data analysis trend 2015 2016 v071Data analysis trend 2015 2016 v071
Data analysis trend 2015 2016 v071Chun Myung Kyu
 
Big Data Analytics on AWS - Carlos Conde - AWS Summit Paris
Big Data Analytics on AWS - Carlos Conde - AWS Summit ParisBig Data Analytics on AWS - Carlos Conde - AWS Summit Paris
Big Data Analytics on AWS - Carlos Conde - AWS Summit ParisAmazon Web Services
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Carol McDonald
 

What's hot (20)

Customer Experience at Disney+ Through Data Perspective
Customer Experience at Disney+ Through Data PerspectiveCustomer Experience at Disney+ Through Data Perspective
Customer Experience at Disney+ Through Data Perspective
 
Amazon big success using big data analytics
Amazon big success using big data analyticsAmazon big success using big data analytics
Amazon big success using big data analytics
 
Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200
 
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | WhitepaperThe Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
 
Entity Resolution Service - Bringing Petabytes of Data Online for Instant Access
Entity Resolution Service - Bringing Petabytes of Data Online for Instant AccessEntity Resolution Service - Bringing Petabytes of Data Online for Instant Access
Entity Resolution Service - Bringing Petabytes of Data Online for Instant Access
 
Module 3 - QuickSight Overview
Module 3 - QuickSight OverviewModule 3 - QuickSight Overview
Module 3 - QuickSight Overview
 
Data Estate Modernization
Data Estate ModernizationData Estate Modernization
Data Estate Modernization
 
Azure Big data
Azure Big data Azure Big data
Azure Big data
 
When Open Source Meets the Enterprise
When Open Source Meets the EnterpriseWhen Open Source Meets the Enterprise
When Open Source Meets the Enterprise
 
Managed NoSQL databases
Managed NoSQL databasesManaged NoSQL databases
Managed NoSQL databases
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Customer experience at disney+ through data perspective
Customer experience at disney+ through data perspectiveCustomer experience at disney+ through data perspective
Customer experience at disney+ through data perspective
 
Qubole on AWS - White paper
Qubole on AWS - White paper Qubole on AWS - White paper
Qubole on AWS - White paper
 
Lecture1
Lecture1Lecture1
Lecture1
 
Delivering business insights and automation utilizing aws data services
Delivering business insights and automation utilizing aws data servicesDelivering business insights and automation utilizing aws data services
Delivering business insights and automation utilizing aws data services
 
Data analysis trend 2015 2016 v071
Data analysis trend 2015 2016 v071Data analysis trend 2015 2016 v071
Data analysis trend 2015 2016 v071
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
Big Data Analytics on AWS - Carlos Conde - AWS Summit Paris
Big Data Analytics on AWS - Carlos Conde - AWS Summit ParisBig Data Analytics on AWS - Carlos Conde - AWS Summit Paris
Big Data Analytics on AWS - Carlos Conde - AWS Summit Paris
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
 

Similar to Architecting Data Lakes on AWS

Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsFinding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsAmazon Web Services
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSAWS User Group Kochi
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSAmazon Web Services
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Amazon Web Services
 
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAmazon Web Services
 
Building a modern data platform in AWS
Building a modern data platform in AWSBuilding a modern data platform in AWS
Building a modern data platform in AWSAmazon Web Services
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Amazon Web Services
 
The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewAmazon Web Services
 
Value of Data Beyond Analytics by Darin Briskman
 Value of Data Beyond Analytics by Darin Briskman Value of Data Beyond Analytics by Darin Briskman
Value of Data Beyond Analytics by Darin BriskmanSameer Kenkare
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Amazon Web Services
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Amazon Web Services
 
Drive Customer Value with Data-Driven Decisions (GPSBUS206) - AWS re:Invent 2018
Drive Customer Value with Data-Driven Decisions (GPSBUS206) - AWS re:Invent 2018Drive Customer Value with Data-Driven Decisions (GPSBUS206) - AWS re:Invent 2018
Drive Customer Value with Data-Driven Decisions (GPSBUS206) - AWS re:Invent 2018Amazon Web Services
 
Using AWS Purpose-Built Databases to Modernize your Applications
Using AWS Purpose-Built Databases to Modernize your ApplicationsUsing AWS Purpose-Built Databases to Modernize your Applications
Using AWS Purpose-Built Databases to Modernize your ApplicationsAmazon Web Services
 
Architecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSArchitecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSAmazon Web Services
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)Amazon Web Services
 
Data Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptxData Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptxArunPandiyan890855
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 

Similar to Architecting Data Lakes on AWS (20)

Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsFinding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
 
Big Data Building Blocks with AWS Cloud
Big Data Building Blocks with AWS CloudBig Data Building Blocks with AWS Cloud
Big Data Building Blocks with AWS Cloud
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
 
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
 
Building a modern data platform in AWS
Building a modern data platform in AWSBuilding a modern data platform in AWS
Building a modern data platform in AWS
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
 
The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – Overview
 
Value of Data Beyond Analytics by Darin Briskman
 Value of Data Beyond Analytics by Darin Briskman Value of Data Beyond Analytics by Darin Briskman
Value of Data Beyond Analytics by Darin Briskman
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
 
Drive Customer Value with Data-Driven Decisions (GPSBUS206) - AWS re:Invent 2018
Drive Customer Value with Data-Driven Decisions (GPSBUS206) - AWS re:Invent 2018Drive Customer Value with Data-Driven Decisions (GPSBUS206) - AWS re:Invent 2018
Drive Customer Value with Data-Driven Decisions (GPSBUS206) - AWS re:Invent 2018
 
Using AWS Purpose-Built Databases to Modernize your Applications
Using AWS Purpose-Built Databases to Modernize your ApplicationsUsing AWS Purpose-Built Databases to Modernize your Applications
Using AWS Purpose-Built Databases to Modernize your Applications
 
Architecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSArchitecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWS
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
 
Data Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptxData Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptx
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Architecting Data Lakes on AWS

  • 1. Architecting Data Lakes on AWS with HiFX Established in the year 2001, HiFX is an Amazon Web Services Advanced Consulting Partner. We have been designing and migrating workloads in AWS cloud since 2010 and helping organisations to become truly data driven by building big data solutions since 2015 About Malayala Manorama Malayala Manorama is one of the largest media conglomerates in India. They run manoramaonline.com, the largest news portal for Malayalees around the world and several digital media properties including manoramanews.com, m4marry.com, helloaddress.com, tapeytapey.com, entedeal.com, quickerala.com, qkdoc.com, manoramahorizon.com and various mobile applications
  • 2. 2 The Challenges Lack of agility and accessibility for data analysis which would aid the product team to make smart business decisions and improve strategies Increasing volume and velocity of data. With new digital properties getting added, there was a need to design the collection and storage layers that would scale well Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was leading to difficulties in identifying what type of data is available, getting access to it and integration. Poorly recorded data. Often, the meaning and granularity of the data was getting lost in processing Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was leading to difficulties in identifying what type of data is available, granting access and integration 04 03 02 01
  • 3. Our Journey from Data to Decisions with an AWS powered Data Lake Connecting dozens of data streams and repositories to a unified data pipeline enabling near realtime access to any data source Engineering well designed big data stores for reporting and and exploratory analysis A r c h i t e c t a s e c u r e , w e l l governed data lake to store all data in a raw format. S3 is the fabric with which we have woven the solution. Processing data in streams or batches to aid analytics and machine learning, supplemented by smart workflow management to orchestrate the tasks D y n a m i c d a s h b o a r d s a n d visualisations that makes data tell stories and help drive insights. Offering recommendations and predictive analytics off the data in the data lake
  • 4. Faster, Smarter & Better Decisions
  • 5. COLLECT STORE PROCESS CONSUME Scribe (Collector) Accumulo (Storage) Acccumulo is the data consumer component responsible for reading data from the event streams (Kinesis Streams), performing rudimentary data quality checks and converting data to Avro Format before loading it to the Data Lake Our Data Lake in S3 captures and store raw data at scale for a low cost. It allows us to store many types of data in the same repository while allowing to define the structure of the data at the time when it is used scribe accumulo Scribe collects data from the trackers and writes them to Kinesis Streams It is written in Go and engineered for high concurrency, low latency and horizontally scalability Currently running on two c4.large instances, our API latency for 50 percentile is 12.6ms and 75 percentile is 27ms. This is made possible because of the consistent and predicable performance of Kinesis
  • 6. Availability Zone Availability Zone B Availability Zone A Auto Scaling Group 1:1 Availability Zone B Availability Zone Auto scaling Group Availability Zone A Collector API in GO ELB Android SDK JS SDK Clickstream Android App iOS App Collector API in GO Data sink Java / Go / Python / PHP SDK ServerEvents SCRIBETRACKERS Kinesis Streams RAW Data (Avro Format) ACCUMULO AirFlow Consumer KCL App in Java Dimension / Offline Data Data Lake in AWS S3 iOS SDK COLLECT STORE PROCESS CONSUME
  • 7. 7 Why Amazon S3 For Data Lake ? Performance relatively lower than an HDFS cluster, but doesn't affect our workloads significantly. EMRFS with consistent view (backed by DynamoDB) works really well Native support for versioning, tiered-storage (Standard, IA, Amazon Glacier) via life-cycle policies and Secure– SSL, client/server-side encryption Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was leading to difficulties in identifying what type of data is available, getting access to it and integration. Unlimited number of objects and volume of data, along with 99.99% availability and 99.999999999% durability. Lower TCO and easier to scale than HDFS Decoupled storage and compute allowing multiple & heterogeneous analysis clusters to use the same data 04 03 02 01
  • 8. Prism (Processor) Lens (Consumer) Custom built reporting & visualisation app to help business owners to easily interpret, visualise and record data and derive insights Detailed Analysis of KPIs, Event Segmentation, Funnels, Search Insights, Path Finder, Retention/Addiction Analysis etc powered by Redshift and Druid. Using Pgpool to cache Redshift queries. process consume Unified Processing Engine using Apache Spark running on EMR written in Scala Airflow is used to programmatically author, schedule and monitor workflows Prism generates data for tracking KPIs and perform funnel, pathflow, retention and affinity analysis. It also include machine learning workloads that generate recommendations and predictions COLLECT STORE PROCESS CONSUME
  • 9. EMR Spark Persistent Cluster (ML & Realtime updates) EMR Spark On-Demand Cluster Batch Workloads Enriched Data in Parquet Format (Run Athena / Presto on top for ad hoc queries ) Redshift Availability Zone B Availability Zone Auto scaling Group 1:1 Availability Zone A Lens BI Dashboard Processed data Processed data PRISM DynamoDB ( Stores the results of recommendations, market basket analysis etc ) Druid (Realtime streaming data ingestion engine) Data Lake in AWS S3 LENS Kinesis COLLECT STORE PROCESS CONSUME
  • 10. Scalability / Performance Collect, Storage and Process layers designed to Autoscale Latency numbers at the data collector: 75 percentile - 27ms and 95 percentile - 156 ms Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was leading to difficulties in identifying what type of data is available, getting access to it and integration. Currently handling 60 million events per month. Expecting 100x scale in 2018 Horizontally Scalable Data Collectors, Data Consumers, Data Processors and Data Reporting Stores 04 03 02 01
  • 11. 11 The Benefits Ability to run targeted mobile push and email campaigns Consistent KPI measurement. The client has a consistent framework across properties to measure KPIs Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was leading to difficulties in identifying what type of data is available, getting access to it and integration. Better user experience. Recommendations running off the data in the Data Lake add value to the digital properties we manage Better business agility and product decisions based on behavioural insights. The journey from data to decisions is made swifter 04 03 02 01