Search data store for the worlds largest                            biometric identity system                    Regunath ...
India● 1.2 billion residents   ● 640,000 villages, ~60% lives under $2/day   ● ~75% literacy, <3% pays Income Tax, <20% ba...
Aadhaar● Create a common ‘national identity’ for every ‘resident’   ●Biometric backed identity to eliminate duplicates   ●...
Search Requirements● Multi-attribute query like:   name contains ‘regunath’ AND city = ‘bangalore’ AND   address contains ...
Why MongoDB● Auto-sharding● Replication● Failover   … Essentially an AP (slaveOk) data store in CAP parlance● Evolving sch...
Design               { _id:123456789, name: ‘abcde’, year:1980, ….. }    MongoDB         2                                ...
Implementation and Deployment   ● Start - 4M records in 2 shards   Current - 250M records in 8 shards ( 8 x ~2 TB x 3 repl...
Monitoring and Troubleshooting● Monitoring tools evaluated   ●MMS   ●munin● Manual approach - daily ritual   ●RS, DB, conf...
Key Learnings on MongoDB● Indexing 32 fields   ●Compound indexes   ●Multi-keys indexes       {…"indexes" : [{ "email":"jo...
Questions?                    Regunath Balasubramanian               Shashikant Soni                      regunathb@gmail....
Upcoming SlideShare
Loading in …5
×

Search data store for the world's largest biometric identity system

813 views
780 views

Published on

Aadhaar application stores and searches through 200M residents' data containing personal and biometrics information. A user can search for records based on various criteria like personal or system information of resident(s). The session will discuss about the approach and challenges to creating a data store to handle 2M inserts/updates and 10M reads/day. You will learn details on storing and handling 16TB of data, spread over 8 shards for high availability and approach on scaling it to handle a total of 1.2 Billion residents' information data) in such a way, that we can process it for analytics.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
813
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
21
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Search data store for the world's largest biometric identity system

  1. 1. Search data store for the worlds largest biometric identity system Regunath Balasubramanian Shashikant Soni regunathb@gmail.com soni.shashikant@gmail.com twitter @regunathbCONFIDENTIAL: For limited circulation only Slide 1
  2. 2. India● 1.2 billion residents ● 640,000 villages, ~60% lives under $2/day ● ~75% literacy, <3% pays Income Tax, <20% banking ● ~800 million mobile, ~200-300 mn migrant workers● Govt. spends about $25-40B on direct subsidies ● Residents have no standard identity document ● Most programs plagued with ghost and multiple identities causing leakage of 30-40% Slide 2
  3. 3. Aadhaar● Create a common ‘national identity’ for every ‘resident’ ●Biometric backed identity to eliminate duplicates ●‘Verifiable online identity’ for portability● Applications ecosystem using open APIs ●Aadhaar enabled bank account and payment platform ●Aadhaar enabled electronic, paperless KYC (Know Your Customer) Slide 3
  4. 4. Search Requirements● Multi-attribute query like: name contains ‘regunath’ AND city = ‘bangalore’ AND address contains ‘J P Nagar’ AND YearOfBirth = ……● Search 1.2B resident data with photo, history ●35Kb - Average record size● Response times in milliseconds● Open scale out Slide 4
  5. 5. Why MongoDB● Auto-sharding● Replication● Failover … Essentially an AP (slaveOk) data store in CAP parlance● Evolving schema● Map-Reduce for analysis● Full text search ●Compound (or) multi-keys Slide 5
  6. 6. Design { _id:123456789, name: ‘abcde’, year:1980, ….. } MongoDB 2 Search API Client App Name=‘abcde’ Solr 1 Address=‘some place’ Indexes Name: ‘abcde’ Year= 1980 Address: ‘some place’ year: 1980● Read/Search ●Sharded Solr indexes for search ●Keyed document read from MongoDB● Write ●Eventual consistency (across data sources) driven by application ●Composite MongodDB-Solr app persistence handler Slide 6
  7. 7. Implementation and Deployment ● Start - 4M records in 2 shards Current - 250M records in 8 shards ( 8 x ~2 TB x 3 replicas) ● Performance , Reliability & Durability ●SlaveOk ●getLastError, Write Concern: availability vs durability  j = journaling  w = nodes-to-write ● Replica-sets / Shards – how? RS 1 RS 1 RS 1 Rs 2 RS 2 RS 2Primary Config 1 Config 2 Config 3SecondaryArbiter Router Router Router Slide 7
  8. 8. Monitoring and Troubleshooting● Monitoring tools evaluated ●MMS ●munin● Manual approach - daily ritual ●RS, DB, config, router - health and stats● Problem analysis stats ●mongostat, iostat, currentOps, logs ●Client connections● Stats for storage, shards addition ●Data file size ●Shard data distribution ●Replication Slide 8
  9. 9. Key Learnings on MongoDB● Indexing 32 fields ●Compound indexes ●Multi-keys indexes  {…"indexes" : [{ "email":"john.doe@email.com", "phone":"123456789“ }] }  db.coll.find ({ "indexes.email" : "john.doe@email.com" }) ●Indexes use b-tree ●Many fields to index ●Performs well upto 1-2M documents ●Best if index fits in memory● Data replication, RS failover ●Rollback when RS goes out of sync  Manual restore (physical data copy)  Restarting a very stale node Slide 9
  10. 10. Questions? Regunath Balasubramanian Shashikant Soni regunathb@gmail.com soni.shashikant@gmail.com twitter @regunathbCONFIDENTIAL: For limited circulation only Slide 10

×