SlideShare a Scribd company logo
1 of 12
Hadoop at Aadhaar
(Data Store, OLTP & OLAP)
github.com/regunathb
RegunathB
Bangalore Hadoop Meetup
Enrolment Data
•
600 to 800 million UIDs in 4 years
– 1 million a day with transaction, durability guarantees
– 350+ trillion matches every day
•
~5MB per resident
– Maps to about 10-15 PB of raw data (2048-bit PKI encrypted)
– About 30 TB I/O every day
– Replication and backup across DCs of about 5+ TB of incremental
data every day
– Lifecycle updates and new enrolments will continue for ever
•
Enrolment data moves from very hot to cold, needing
multi-layered storage architecture
•
Additional process data
– Several million events on an average moving through async
channels (some persistent and some transient)
– Needing insert and update guarantees across data stores
2
Authentication Data
•
100+ million authentications per day (10 hrs)
– Possible high variance on peak and average
– Sub second response
– Guaranteed audits
•
Multi-DC architecture
– All changes needs to be propagated from enrolment data stores to
all authentication sites
•
Authentication request is about 4 K
– 100 million authentications a day
– 1 billion audit records in 10 days (30+ billion a year)
– 4 TB encrypted audit logs in 10 days
– Audit write must be guaranteed
3
Aadhaar Data Stores
Mongo cluster
(all enrolment records/documents
– demographics + photo)
Shard
1
Shard
4
Shard
5
Shard
2
Shard
3 Low latency indexed read (Documents per sec),
High latency random search (seconds per read)
MySQL
(all UID generated records - demographics only,
track & trace, enrolment status )
Low latency indexed read (milli-
seconds per read),
High latency random search (seconds
per read)
UID master
(sharded)
Enrolment
DB
Solr cluster
(all enrolment records/documents
– selected demographics only)
Low latency indexed read (Documents per sec),
Low latency random search (Documents per sec)
Shard
0
Shard
2
Shard
6
Shard
9
Shard
a
Shard
d
Shard
f
HDFS
(all raw packets)
Data
Node 1
Data
Node 10
Data
Node ..
High read throughput (MB per sec),
High latency read (seconds per read)
Data
Node 20
HBase
(all enrolment
biometric templates)
Region
Ser. 1
Region
Ser. 10
Region
Ser. ..
High read throughput (MB per sec),
Low-to-Medium latency read (milli-seconds per read)Region
Ser. 20
NFS
(all archived raw packets)
Moderate read throughput,
High latency read (seconds per read)
LUN 1 LUN 2 LUN 3 LUN 4
Systems Architecture
•
Work distribution
using SEDA &
Messaging
•
Ability to scale
within JVM and
across
•
Recovery through
check-pointing
•
Sync Http based
Auth gateway
•
Protocol Buffers &
XML payloads
•
Sharded clusters
•
Near Real-time data delivery to warehouse
•
Nightly data-sets used to build dashboards,
data marts and reports
•
Real-time monitoring using Events
Enrolment Biometric Middleware
•
Distribute, Reconcile biometric data extraction and de-dup
requests across multiple vendors (ABISs)
•
Biometric data de-referencing/read service(Http) over
sharded HDFS and NFS
– Serves bulk of the HDFS read requests (25TB per day)
– Locate data from multiple HDFS clusters
●
Sharded by read/write patterns : New, Archive,
Purge
•
Calculates and maintains Volume allocation, SLA breach
thresholds of ABISs
– Thresholds stored in ZK and pushed to middleware
nodes
6
Event Streams & Sinks
•
Event framework supporting different interaction/data
durability patterns
– P2P, Pub-Sub
– Intra-JVM and Queue destinations - Durable / Non-Durable
– Fire & Forget, Ack. after processing
•
Event Sinks
– Ephemeral data consumed by counters, metrics (dashboard)
– Rolling file appenders that push data to HDFS
●
Primary mechanism for delivering raw fact data from
transactional systems to the warehouse staging area
7
Data Analysis
•
Statistical analysis from millions of events
– View into quality of enrolments – e.g. Enrolment
Agencies, Operators
– Feature introduction – e.g. Based on avg. time taken for
biometric capture, demographic data input
– Enrolment volumes – e.g. By Registrar, Agency,
Operator etc
●
Useful in fraud detection
•
Goal to share anonymized data sets for use by industry and
academia – information transparency
•
Various reports – Self-serve, Canned, Operational and/or
Aggregates
8
UID BI Platform
Data Analysis architecture
9
Data Access Framework
UIDAI Systems
Events
(Rabbit MQ)
Server DB
(MySQL)
Hadoop HDFS
Data Warehouse (HDFS/Hive)
Event CSV
Fact DataDimension Data
Datasets
On-Demand Datasets
Datamarts
(MySQL)
Raw Data
Dimension Data
(MySQL)
Pig
Pentaho Kettle
Hive
Pentaho Kettle
Canned Reports Dashboard
Self-service
Analytics
Pentaho BI
FusionCharts
E-mail/Portal/Others
Hadoop stack summary
•
CDH2 (Enrolment, Analysis), CDH3(Authentication)
•
Data Store
– HDFS : Enrolment, Events, Audit Logs, Warehouse
– HBase : Biometric templates used in Authentication
•
Coordination/Config
– ZK : Biometric middleware thresholds
•
Analysis
– Pig : ETL for loading analysis data from staging to atomic
warehouse
– Hive : Dataset generation framework
10
Learnings
•
Watch out for“too many small files”. HDFS is better suited for
fewer but large files
•
Data loss from HDFS in spite of having 3 replica copies – maybe
fixed in releases post CDH2?
•
Give careful consideration to HBase table design – row key
primarily to avoid region-server hot-spotting
•
Hive data (HDFS files) does not handle duplicate records – can
be an issue if data injestion is replayed for data sets
– Hive over Hbase is a viable alternative
11
References
•
Aadhaar Portal :
https://portal.uidai.gov.in/uidwebportal/dashboard.do
•
Data Portal :
https://data.uidai.gov.in/uiddatacatalog/dataCatalogHom
e.do
•
Analytics whitepaper :
http://uidai.gov.in/images/FrontPageUpdates/uid_doc_30
012012.pdf
12

More Related Content

What's hot

Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureVARUN SAXENA
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBRavi Teja
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
Cloud File System and Cloud Data Management Interface (CDMI)
Cloud File System and Cloud Data Management Interface (CDMI)Cloud File System and Cloud Data Management Interface (CDMI)
Cloud File System and Cloud Data Management Interface (CDMI)Calsoft Inc.
 
RedisConf18 - Redis on Google Cloud Platform
RedisConf18 - Redis on Google Cloud PlatformRedisConf18 - Redis on Google Cloud Platform
RedisConf18 - Redis on Google Cloud PlatformRedis Labs
 
MongoDB Memory Management Demystified
MongoDB Memory Management DemystifiedMongoDB Memory Management Demystified
MongoDB Memory Management DemystifiedMongoDB
 
Generating Code with Oracle SQL Developer Data Modeler
Generating Code with Oracle SQL Developer Data ModelerGenerating Code with Oracle SQL Developer Data Modeler
Generating Code with Oracle SQL Developer Data ModelerRob van den Berg
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Simplilearn
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionHow One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionDataWorks Summit
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
 

What's hot (20)

Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Cloud File System and Cloud Data Management Interface (CDMI)
Cloud File System and Cloud Data Management Interface (CDMI)Cloud File System and Cloud Data Management Interface (CDMI)
Cloud File System and Cloud Data Management Interface (CDMI)
 
IoT:what about data storage?
IoT:what about data storage?IoT:what about data storage?
IoT:what about data storage?
 
Machine Learning in the IoT with Apache NiFi
Machine Learning in the IoT with Apache NiFiMachine Learning in the IoT with Apache NiFi
Machine Learning in the IoT with Apache NiFi
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
RedisConf18 - Redis on Google Cloud Platform
RedisConf18 - Redis on Google Cloud PlatformRedisConf18 - Redis on Google Cloud Platform
RedisConf18 - Redis on Google Cloud Platform
 
MongoDB Memory Management Demystified
MongoDB Memory Management DemystifiedMongoDB Memory Management Demystified
MongoDB Memory Management Demystified
 
Generating Code with Oracle SQL Developer Data Modeler
Generating Code with Oracle SQL Developer Data ModelerGenerating Code with Oracle SQL Developer Data Modeler
Generating Code with Oracle SQL Developer Data Modeler
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Apache samza
Apache samzaApache samza
Apache samza
 
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionHow One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 

Viewers also liked

Building the Flipkart phantom
Building the Flipkart phantomBuilding the Flipkart phantom
Building the Flipkart phantomRegunath B
 
Aadhaar at 5th_elephant_v3
Aadhaar at 5th_elephant_v3Aadhaar at 5th_elephant_v3
Aadhaar at 5th_elephant_v3Regunath B
 
practical risks in aadhaar project and measures to overcome them
practical risks in aadhaar project and measures to overcome thempractical risks in aadhaar project and measures to overcome them
practical risks in aadhaar project and measures to overcome themsaipriyadonthula
 
Unique identification authority of india uid
Unique identification authority of india   uidUnique identification authority of india   uid
Unique identification authority of india uidAjit Dadresa
 
E commerce data migration in moving systems across data centres
E commerce data migration in moving systems across data centres E commerce data migration in moving systems across data centres
E commerce data migration in moving systems across data centres Regunath B
 
Facebook style notifications using hbase and event streams
Facebook style notifications using hbase and event streamsFacebook style notifications using hbase and event streams
Facebook style notifications using hbase and event streamsRegunath B
 
Aesop change data propagation
Aesop change data propagationAesop change data propagation
Aesop change data propagationRegunath B
 
Building tiered data stores using aesop to bridge sql and no sql systems
Building tiered data stores using aesop to bridge sql and no sql systemsBuilding tiered data stores using aesop to bridge sql and no sql systems
Building tiered data stores using aesop to bridge sql and no sql systemsRegunath B
 
Oss as a competitive advantage
Oss as a competitive advantageOss as a competitive advantage
Oss as a competitive advantageRegunath B
 
Authentication(pswrd,token,certificate,biometric)
Authentication(pswrd,token,certificate,biometric)Authentication(pswrd,token,certificate,biometric)
Authentication(pswrd,token,certificate,biometric)Ali Raw
 

Viewers also liked (14)

Building the Flipkart phantom
Building the Flipkart phantomBuilding the Flipkart phantom
Building the Flipkart phantom
 
Aadhaar at 5th_elephant_v3
Aadhaar at 5th_elephant_v3Aadhaar at 5th_elephant_v3
Aadhaar at 5th_elephant_v3
 
practical risks in aadhaar project and measures to overcome them
practical risks in aadhaar project and measures to overcome thempractical risks in aadhaar project and measures to overcome them
practical risks in aadhaar project and measures to overcome them
 
Srikanth Nadhamuni
Srikanth NadhamuniSrikanth Nadhamuni
Srikanth Nadhamuni
 
Aadhaar
AadhaarAadhaar
Aadhaar
 
Unique identification authority of india uid
Unique identification authority of india   uidUnique identification authority of india   uid
Unique identification authority of india uid
 
E commerce data migration in moving systems across data centres
E commerce data migration in moving systems across data centres E commerce data migration in moving systems across data centres
E commerce data migration in moving systems across data centres
 
What database
What databaseWhat database
What database
 
Facebook style notifications using hbase and event streams
Facebook style notifications using hbase and event streamsFacebook style notifications using hbase and event streams
Facebook style notifications using hbase and event streams
 
Aesop change data propagation
Aesop change data propagationAesop change data propagation
Aesop change data propagation
 
Building tiered data stores using aesop to bridge sql and no sql systems
Building tiered data stores using aesop to bridge sql and no sql systemsBuilding tiered data stores using aesop to bridge sql and no sql systems
Building tiered data stores using aesop to bridge sql and no sql systems
 
Uid
UidUid
Uid
 
Oss as a competitive advantage
Oss as a competitive advantageOss as a competitive advantage
Oss as a competitive advantage
 
Authentication(pswrd,token,certificate,biometric)
Authentication(pswrd,token,certificate,biometric)Authentication(pswrd,token,certificate,biometric)
Authentication(pswrd,token,certificate,biometric)
 

Similar to Hadoop at aadhaar

Real time monitoring-alerting: storing 2Tb of logs a day in Elasticsearch
Real time monitoring-alerting: storing 2Tb of logs a day in ElasticsearchReal time monitoring-alerting: storing 2Tb of logs a day in Elasticsearch
Real time monitoring-alerting: storing 2Tb of logs a day in ElasticsearchAli Kheyrollahi
 
Realtime Analytics on AWS
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWSSungmin Kim
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDogRedis Labs
 
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataHitoshi Sato
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scaledatamantra
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)Sascha Dittmann
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analyticskgshukla
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache ApexApache Apex
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016kbajda
 
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data LakesCrossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data LakesIsuru Suriarachchi
 
Red Hat Storage Server Administration Deep Dive
Red Hat Storage Server Administration Deep DiveRed Hat Storage Server Administration Deep Dive
Red Hat Storage Server Administration Deep DiveRed_Hat_Storage
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataRostislav Pashuto
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsMapR Technologies
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDBMongoDB
 
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard confluent
 
A Gentle Introduction to Big Data
A Gentle Introduction to Big DataA Gentle Introduction to Big Data
A Gentle Introduction to Big DataMehmet Ali Akyol
 

Similar to Hadoop at aadhaar (20)

Real time monitoring-alerting: storing 2Tb of logs a day in Elasticsearch
Real time monitoring-alerting: storing 2Tb of logs a day in ElasticsearchReal time monitoring-alerting: storing 2Tb of logs a day in Elasticsearch
Real time monitoring-alerting: storing 2Tb of logs a day in Elasticsearch
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
Realtime Analytics on AWS
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWS
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016
 
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data LakesCrossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
Red Hat Storage Server Administration Deep Dive
Red Hat Storage Server Administration Deep DiveRed Hat Storage Server Administration Deep Dive
Red Hat Storage Server Administration Deep Dive
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
 
Hadoop and friends
Hadoop and friendsHadoop and friends
Hadoop and friends
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
 
A Gentle Introduction to Big Data
A Gentle Introduction to Big DataA Gentle Introduction to Big Data
A Gentle Introduction to Big Data
 

Recently uploaded

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 

Recently uploaded (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Hadoop at aadhaar

  • 1. Hadoop at Aadhaar (Data Store, OLTP & OLAP) github.com/regunathb RegunathB Bangalore Hadoop Meetup
  • 2. Enrolment Data • 600 to 800 million UIDs in 4 years – 1 million a day with transaction, durability guarantees – 350+ trillion matches every day • ~5MB per resident – Maps to about 10-15 PB of raw data (2048-bit PKI encrypted) – About 30 TB I/O every day – Replication and backup across DCs of about 5+ TB of incremental data every day – Lifecycle updates and new enrolments will continue for ever • Enrolment data moves from very hot to cold, needing multi-layered storage architecture • Additional process data – Several million events on an average moving through async channels (some persistent and some transient) – Needing insert and update guarantees across data stores 2
  • 3. Authentication Data • 100+ million authentications per day (10 hrs) – Possible high variance on peak and average – Sub second response – Guaranteed audits • Multi-DC architecture – All changes needs to be propagated from enrolment data stores to all authentication sites • Authentication request is about 4 K – 100 million authentications a day – 1 billion audit records in 10 days (30+ billion a year) – 4 TB encrypted audit logs in 10 days – Audit write must be guaranteed 3
  • 4. Aadhaar Data Stores Mongo cluster (all enrolment records/documents – demographics + photo) Shard 1 Shard 4 Shard 5 Shard 2 Shard 3 Low latency indexed read (Documents per sec), High latency random search (seconds per read) MySQL (all UID generated records - demographics only, track & trace, enrolment status ) Low latency indexed read (milli- seconds per read), High latency random search (seconds per read) UID master (sharded) Enrolment DB Solr cluster (all enrolment records/documents – selected demographics only) Low latency indexed read (Documents per sec), Low latency random search (Documents per sec) Shard 0 Shard 2 Shard 6 Shard 9 Shard a Shard d Shard f HDFS (all raw packets) Data Node 1 Data Node 10 Data Node .. High read throughput (MB per sec), High latency read (seconds per read) Data Node 20 HBase (all enrolment biometric templates) Region Ser. 1 Region Ser. 10 Region Ser. .. High read throughput (MB per sec), Low-to-Medium latency read (milli-seconds per read)Region Ser. 20 NFS (all archived raw packets) Moderate read throughput, High latency read (seconds per read) LUN 1 LUN 2 LUN 3 LUN 4
  • 5. Systems Architecture • Work distribution using SEDA & Messaging • Ability to scale within JVM and across • Recovery through check-pointing • Sync Http based Auth gateway • Protocol Buffers & XML payloads • Sharded clusters • Near Real-time data delivery to warehouse • Nightly data-sets used to build dashboards, data marts and reports • Real-time monitoring using Events
  • 6. Enrolment Biometric Middleware • Distribute, Reconcile biometric data extraction and de-dup requests across multiple vendors (ABISs) • Biometric data de-referencing/read service(Http) over sharded HDFS and NFS – Serves bulk of the HDFS read requests (25TB per day) – Locate data from multiple HDFS clusters ● Sharded by read/write patterns : New, Archive, Purge • Calculates and maintains Volume allocation, SLA breach thresholds of ABISs – Thresholds stored in ZK and pushed to middleware nodes 6
  • 7. Event Streams & Sinks • Event framework supporting different interaction/data durability patterns – P2P, Pub-Sub – Intra-JVM and Queue destinations - Durable / Non-Durable – Fire & Forget, Ack. after processing • Event Sinks – Ephemeral data consumed by counters, metrics (dashboard) – Rolling file appenders that push data to HDFS ● Primary mechanism for delivering raw fact data from transactional systems to the warehouse staging area 7
  • 8. Data Analysis • Statistical analysis from millions of events – View into quality of enrolments – e.g. Enrolment Agencies, Operators – Feature introduction – e.g. Based on avg. time taken for biometric capture, demographic data input – Enrolment volumes – e.g. By Registrar, Agency, Operator etc ● Useful in fraud detection • Goal to share anonymized data sets for use by industry and academia – information transparency • Various reports – Self-serve, Canned, Operational and/or Aggregates 8
  • 9. UID BI Platform Data Analysis architecture 9 Data Access Framework UIDAI Systems Events (Rabbit MQ) Server DB (MySQL) Hadoop HDFS Data Warehouse (HDFS/Hive) Event CSV Fact DataDimension Data Datasets On-Demand Datasets Datamarts (MySQL) Raw Data Dimension Data (MySQL) Pig Pentaho Kettle Hive Pentaho Kettle Canned Reports Dashboard Self-service Analytics Pentaho BI FusionCharts E-mail/Portal/Others
  • 10. Hadoop stack summary • CDH2 (Enrolment, Analysis), CDH3(Authentication) • Data Store – HDFS : Enrolment, Events, Audit Logs, Warehouse – HBase : Biometric templates used in Authentication • Coordination/Config – ZK : Biometric middleware thresholds • Analysis – Pig : ETL for loading analysis data from staging to atomic warehouse – Hive : Dataset generation framework 10
  • 11. Learnings • Watch out for“too many small files”. HDFS is better suited for fewer but large files • Data loss from HDFS in spite of having 3 replica copies – maybe fixed in releases post CDH2? • Give careful consideration to HBase table design – row key primarily to avoid region-server hot-spotting • Hive data (HDFS files) does not handle duplicate records – can be an issue if data injestion is replayed for data sets – Hive over Hbase is a viable alternative 11
  • 12. References • Aadhaar Portal : https://portal.uidai.gov.in/uidwebportal/dashboard.do • Data Portal : https://data.uidai.gov.in/uiddatacatalog/dataCatalogHom e.do • Analytics whitepaper : http://uidai.gov.in/images/FrontPageUpdates/uid_doc_30 012012.pdf 12