Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Aadhaar at 5th_elephant_v3


Published on

Slides used in the talk at the Fifth Elephant Big Data conference by Dr. Pramod Varma and Regunath B

Published in: Technology, Education

Aadhaar at 5th_elephant_v3

  1. Big Data at AadhaarDr. Pramod K Varma Regunath regunathb@gmail.comTwitter: @pramodkvarma Twitter: @RegunathB
  2. Aadhaar at a Glance 2
  3. India• 1.2 billion residents – 640,000 villages, ~60% lives under $2/day – ~75% literacy, <3% pays Income Tax, <20% banking – ~800 million mobile, ~200-300 mn migrant workers• Govt. spends about $25-40 bn on direct subsidies – Residents have no standard identity document – Most programs plagued with ghost and multiple identities causing leakage of 30-40% 3
  4. Vision• Create a common “national identity” for every “resident” – Biometric backed identity to eliminate duplicates – “Verifiable online identity” for portability• Applications ecosystem using open APIs – Aadhaar enabled bank account and payment platform – Aadhaar enabled electronic, paperless KYC 4
  5. Aadhaar System• Enrolment – One time in a person’s lifetime – Minimal demographics – Multi-modal biometrics (Fingerprints, Iris) – 12-digit unique Aadhaar number assigned• Authentication – Verify “you are who you claim to be” – Open API based – Multi-device, multi-factor, multi-modal 5
  6. Architecture Principles• Design for scale – Every component needs to scale to large volumes – Millions of transactions and billions of records – Accommodate failure and design for recovery• Open architecture – Use of open standards to ensure interoperability – Allow the ecosystem to build libraries to standard APIs – Use of open-source technologies wherever prudent• Security – End to end security of resident data – Use of open source – Data privacy handling (API and data anonymization) 6
  7. Designed for Scale• Horizontal scalability for all components – “Open Scale-out” is the key – Distributed computing on commodity hardware – Distributed data store and data partitioning – Horizontal scaling of “data store” a must! – Use of right data store for right purpose• No single point of bottleneck for scaling• Asynchronous processing throughout the system – Allows loose coupling various components – Allows independent component level scaling 7
  8. Enrolment Volume• 600 to 800 million UIDs in 4 years – 1 million a day – 200+ trillion matches every day!!!• ~5MB per resident – Maps to about 10-15 PB of raw data (2048-bit PKI encrypted!) – About 30 TB I/O every day – Replication and backup across DCs of about 5+ TB of incremental data every day – Lifecycle updates and new enrolments will continue for ever• Additional process data – Several million events on an average moving through async channels (some persistent and some transient) – Needing complete update and insert guarantees across data stores 8
  9. Authentication Volume• 100+ million authentications per day (10 hrs) – Possible high variance on peak and average – Sub second response – Guaranteed audits• Multi-DC architecture – All changes needs to be propagated from enrolment data stores to all authentication sites• Authentication request is about 4 K – 100 million authentications a day – 1 billion audit records in 10 days (30+ billion a year) – 4 TB encrypted audit logs in 10 days – Audit write must be guaranteed 9
  10. Open APIs• Aadhaar Services – Core Authentication API and supporting Best Finger Detection, OTP Request APIs – New services being built on top• Aadhaar Open Standards for Plug-n-play – Biometric Device API – Biometric SDK API – Biometric Identification System API – Transliteration API for Indian Languages 10
  11. Implementation 11
  12. Patterns & Technologies• Principles • POJO based application implementation • Light-weight, custom application container • Http gateway for APIs• Compute Patterns • Data Locality • Distribute compute (within a OS process and across)• Compute Architectures • SEDA – Staged Event Driven Architecture • Master-Worker(s) Compute Grid• Data Access types • High throughput streaming : bio-dedupe, analytics • High volume, moderate latency : workflow, UID records • High volume , low latency : auth, demo-dedupe, search – eAadhaar, KYC
  13. Aadhaar Data Stores (Data consistency challenges..)Shard Shard Shard Shard 0 2 6 9 Low latency indexed read (Documents per sec), Solr cluster Low latency random search (Documents per sec)Shard Shard Shard (all enrolment records/documents a d f – selected demographics only) Shard Shard 1 2 Shard Low latency indexed read (Documents per sec), 3 Mongo cluster High latency random search (seconds per read) Shard Shard (all enrolment records/documents 4 5 – demographics + photo) Low latency indexed read (milli-seconds Enrolment UID master DB MySQL per read), (sharded) (all UID generated records - demographics only, High latency random search (seconds per track & trace, enrolment status ) read) HBase High read throughput (MB per sec), Region Region Region Region (all enrolment Low-to-Medium latency read (milli-seconds per read) Ser. 1 Ser. 10 Ser. .. Ser. 20 biometric templates) DataNode 1 Data Node 10 Data Node .. Data Node 20 HDFS High read throughput (MB per sec), (all raw packets) High latency read (seconds per read) LUN 1 LUN 2 LUN 3 LUN 4 Moderate read throughput, NFS High latency read (seconds per read) (all archived raw packets)
  14. Aadhaar Architecture • Real-time monitoring using Events• Work distribution using SEDA & Messaging• Ability to scale within JVM and across• Recovery through check-pointing• Sync Http based Auth gateway• Protocol Buffers & XML payloads• Sharded clusters • Near Real-time data delivery to warehouse • Nightly data-sets used to build dashboards, data marts and reports
  15. Deployment Monitoring
  16. Learnings• Make everything API based• Everything fails (hardware, software, network, storage) – System must recover, retry transactions, and sort of self- heal• Security and privacy should not be an afterthought• Scalability does not come from one product• Open scale out is the only way you should go. – Heterogeneous, multi-vendor, commodity compute, growing linear fashion. Nothing else can adapt! 16
  17. Thank You!Dr. Pramod K Varma Regunath regunathb@gmail.comTwitter: @pramodkvarma Twitter: @RegunathB 17