Aadhaar at 5th_elephant_v3


Published on

Slides used in the talk at the Fifth Elephant Big Data conference by Dr. Pramod Varma and Regunath B

Published in: Technology, Education
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Aadhaar at 5th_elephant_v3

  1. Big Data at AadhaarDr. Pramod K Varma Regunath Balasubramaianpramod.uid@gmail.com regunathb@gmail.comTwitter: @pramodkvarma Twitter: @RegunathB
  2. Aadhaar at a Glance 2
  3. India• 1.2 billion residents – 640,000 villages, ~60% lives under $2/day – ~75% literacy, <3% pays Income Tax, <20% banking – ~800 million mobile, ~200-300 mn migrant workers• Govt. spends about $25-40 bn on direct subsidies – Residents have no standard identity document – Most programs plagued with ghost and multiple identities causing leakage of 30-40% 3
  4. Vision• Create a common “national identity” for every “resident” – Biometric backed identity to eliminate duplicates – “Verifiable online identity” for portability• Applications ecosystem using open APIs – Aadhaar enabled bank account and payment platform – Aadhaar enabled electronic, paperless KYC 4
  5. Aadhaar System• Enrolment – One time in a person’s lifetime – Minimal demographics – Multi-modal biometrics (Fingerprints, Iris) – 12-digit unique Aadhaar number assigned• Authentication – Verify “you are who you claim to be” – Open API based – Multi-device, multi-factor, multi-modal 5
  6. Architecture Principles• Design for scale – Every component needs to scale to large volumes – Millions of transactions and billions of records – Accommodate failure and design for recovery• Open architecture – Use of open standards to ensure interoperability – Allow the ecosystem to build libraries to standard APIs – Use of open-source technologies wherever prudent• Security – End to end security of resident data – Use of open source – Data privacy handling (API and data anonymization) 6
  7. Designed for Scale• Horizontal scalability for all components – “Open Scale-out” is the key – Distributed computing on commodity hardware – Distributed data store and data partitioning – Horizontal scaling of “data store” a must! – Use of right data store for right purpose• No single point of bottleneck for scaling• Asynchronous processing throughout the system – Allows loose coupling various components – Allows independent component level scaling 7
  8. Enrolment Volume• 600 to 800 million UIDs in 4 years – 1 million a day – 200+ trillion matches every day!!!• ~5MB per resident – Maps to about 10-15 PB of raw data (2048-bit PKI encrypted!) – About 30 TB I/O every day – Replication and backup across DCs of about 5+ TB of incremental data every day – Lifecycle updates and new enrolments will continue for ever• Additional process data – Several million events on an average moving through async channels (some persistent and some transient) – Needing complete update and insert guarantees across data stores 8
  9. Authentication Volume• 100+ million authentications per day (10 hrs) – Possible high variance on peak and average – Sub second response – Guaranteed audits• Multi-DC architecture – All changes needs to be propagated from enrolment data stores to all authentication sites• Authentication request is about 4 K – 100 million authentications a day – 1 billion audit records in 10 days (30+ billion a year) – 4 TB encrypted audit logs in 10 days – Audit write must be guaranteed 9
  10. Open APIs• Aadhaar Services – Core Authentication API and supporting Best Finger Detection, OTP Request APIs – New services being built on top• Aadhaar Open Standards for Plug-n-play – Biometric Device API – Biometric SDK API – Biometric Identification System API – Transliteration API for Indian Languages 10
  11. Implementation 11
  12. Patterns & Technologies• Principles • POJO based application implementation • Light-weight, custom application container • Http gateway for APIs• Compute Patterns • Data Locality • Distribute compute (within a OS process and across)• Compute Architectures • SEDA – Staged Event Driven Architecture • Master-Worker(s) Compute Grid• Data Access types • High throughput streaming : bio-dedupe, analytics • High volume, moderate latency : workflow, UID records • High volume , low latency : auth, demo-dedupe, search – eAadhaar, KYC
  13. Aadhaar Data Stores (Data consistency challenges..)Shard Shard Shard Shard 0 2 6 9 Low latency indexed read (Documents per sec), Solr cluster Low latency random search (Documents per sec)Shard Shard Shard (all enrolment records/documents a d f – selected demographics only) Shard Shard 1 2 Shard Low latency indexed read (Documents per sec), 3 Mongo cluster High latency random search (seconds per read) Shard Shard (all enrolment records/documents 4 5 – demographics + photo) Low latency indexed read (milli-seconds Enrolment UID master DB MySQL per read), (sharded) (all UID generated records - demographics only, High latency random search (seconds per track & trace, enrolment status ) read) HBase High read throughput (MB per sec), Region Region Region Region (all enrolment Low-to-Medium latency read (milli-seconds per read) Ser. 1 Ser. 10 Ser. .. Ser. 20 biometric templates) DataNode 1 Data Node 10 Data Node .. Data Node 20 HDFS High read throughput (MB per sec), (all raw packets) High latency read (seconds per read) LUN 1 LUN 2 LUN 3 LUN 4 Moderate read throughput, NFS High latency read (seconds per read) (all archived raw packets)
  14. Aadhaar Architecture • Real-time monitoring using Events• Work distribution using SEDA & Messaging• Ability to scale within JVM and across• Recovery through check-pointing• Sync Http based Auth gateway• Protocol Buffers & XML payloads• Sharded clusters • Near Real-time data delivery to warehouse • Nightly data-sets used to build dashboards, data marts and reports
  15. Deployment Monitoring
  16. Learnings• Make everything API based• Everything fails (hardware, software, network, storage) – System must recover, retry transactions, and sort of self- heal• Security and privacy should not be an afterthought• Scalability does not come from one product• Open scale out is the only way you should go. – Heterogeneous, multi-vendor, commodity compute, growing linear fashion. Nothing else can adapt! 16
  17. Thank You!Dr. Pramod K Varma Regunath Balasubramaianpramod.uid@gmail.com regunathb@gmail.comTwitter: @pramodkvarma Twitter: @RegunathB 17