Your SlideShare is downloading. ×
0
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Biometric Databases and Hadoop__HadoopSummit2010

7,731

Published on

Hadoop Summit 2010 - Application Track …

Hadoop Summit 2010 - Application Track
Biometric Databases and Hadoop
Jason Trost, Abel Sussman and Lalit Kapoor, Booz Allen Hamilton

Published in: Technology
0 Comments
11 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
7,731
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
11
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hadoop for Large-scale Biometric Databases
    Jason Trost
    Cloud Computing Team
    Booz | Allen | Hamilton
  • 2. This session shows the application of Hadoop and a large-scale, low-latency distributed fuzzy matching database to Biometrics
    Background - what you need to know about Biometrics
    The Problem – Big Data and unordered fuzzy matching
    A Solution - Hadoop Applications for Biometrics
    Session Agenda
  • 3. Key Takeaways from this Session
    Searching large-scale Biometric Databases is a hard problem
    Hadoop is a potential solution to this problem
    Hadoop is a great platform for solving all sorts of Big Data and distributed computing problems, even low latency searching
    3
  • 4. 4
    Introduction to Biometrics
    Iris
    Face
    Fingerprint
    Biometrics:
    The science of establishing the identity of an individual based on the physical, chemical, or behavioral attributes of the person *
    Modality:
    Physical or behavioral characteristics of an individual used to establish identity*
    Template:
    A symbolic or numeric representation of a modality optimized for storage and/or matching
    Palm Print
    Gait
    Hand Geometry
    Signature
    Ear
    Voice
    Keystroke Pattern
    Facial Thermogram
    Vein Pattern
    * Handbook of Biometrics. A. Jain, P. Flynn A. Ross.
  • 5. Assist with criminal investigations (e.g. crime scene fingerprints)
    Identify individuals entering and leaving the country
    Surveillance
    5
    Why are Biometrics Important?
    • Enables identifying/authenticating individuals based on “credentials” that are hard to forge
    • 6. It has many useful applications where establishing identity is important
    • 7. Banks and Financial Services companies are using biometrics to prevent banking and identity fraud
    • 8. National governments are creating biometric databases for law enforcement & security reasons:
  • Enrollment – Add an identity and associated biometric data to the database if they do not already exist
    Verification – Lookup the biometric template for a single individual and determine whether it matches a captured biometric measurement (1-to-1 match)
    Identification – Determine the identity of an individual given some biometric measurements (1-to-N match)
    6
    Biometric Database Operations
  • 9. Enrollment: Adding New Identities and Biometrics Data to the Database
    Collect biographic information from an individual such as name, address, SSN, etc
    Capture biometric data in raw form (e.g. high resolution images)
    Transform raw biometric data into encoded biometric template (feature vector)
    Store all this information in the biometrics database
    7
  • 10. Verification: One-to-one Matching
    Lookup the biometric template for a particular individual
    Verify that the stored template and the recently captured template match
    Fuzzy matching is used for matching the biometric templates
    8
  • 11. Identification: One-to-Many Searching
    Capture some number of raw Biometric features, convert them into Biometric templates
    Perform fuzzy matching against large number of stored biometric templates to determine the identity
    If latency is not an issue, this is relatively straightforward, especially in MapReduce
    This is a hard problem for low latency applications and increasing in complexity as the size of these databases grow
    There is a speed/accuracy tradeoff
    The search space can be reduced using clustering techniques, but this only goes so far
    9
  • 12. What is Fuzzy matching?
    Fuzzy matching is an operation performed on two objects that determines how similar the objects are to each other
    Typically this operation produces a numeric similarity score
    Necessary when data collected from sensor is noisy, and matching needs to be very accurate
    Almost all biometric matching algorithms perform some sort of fuzzy matching:
    Elastic Bunch Graph Matching – face recognition algorithm
    BOZORTH3 - minutiae based fingerprint matching algorithm
    IrisCode - iris matching algorithm
    Other Examples:
    Image comparison
    Audio comparison
    Video comparison
    10
  • 13. Why Fuzzy Matching?
    Biometric data is inherently noisy and dirty
    Conditions are not exactly the same when the original biometric data was captured (Enrollment) and when a new reading occurs (Identification)
    Different types of cameras and sensors made by different companies
    Partial or smudged fingerprints (e.g. crime scene)
    Changes in skin tone, facial hair, makeup
    Different lighting conditions
    Aging and skin damage
    Weight gain, Weight loss
    Injury
    Derived from http://www.flickr.com/photos/glennji/3558118429/. Licensed under Creative Commons
    11
  • 14. Existing Large-scale Biometric Databases
    US Visitor & Immigrant Status Indicator Technology (US-VISIT)*
    International travelers’ biometrics (fingerprint and face)
    Collected at US ports of entry, Immigration Services, and State Department
    Used to support the Department of Homeland Security's mission
    FBI Integrated Automated Fingerprint Identification System, (IAFIS)**
    Used to solve and prevent crime and catch criminals and terrorists
    Includes fingerprints, criminal histories, mug shots, scars and tattoo photos, physical characteristics like height, weight, and hair and eye color, and aliases
    AllTrust Networks Paycheck Secure System
    Uses fingerprints to support secure check cashing
    Designed to stop fraud and speed check cashing
    Plus many more
    12
    * One Team, One Mission, Securing our Homeland. US DHS.
    ** http://www.fbi.gov/hq/cjisd/iafis/iafis_facts.htm
    *** http://www.alltrustnetworks.com/News/6Million/tabid/378/Default.aspx
  • 15. This session shows the application of Hadoop and a large-scale, low-latency distributed fuzzy matching database to Biometrics
    Background - what you need to know about Biometrics
    The Problem – Big Data and unordered fuzzy matching
    A Solution - Hadoop Applications for Biometrics
    Session Agenda
  • 16. Combined U.S. government biometric databases are expected to grow to hold billions of identities
    The DHS’s US-VISIT program has the world’s largest and fastest biometric database (called IDENT) with over 110 million identities and roughly 145,000 identities enrolled or verified daily*
    From the FBI’s Integrated Automated Fingerprint Identification System (IAFIS) alone, there are 66.5 million identities with 8,000-10,000 more subjects added each day **
    India is reportedly creating a biometric database to hold the fingerprints and face images for each of its 1.2 billion citizens as part of its Unique Identification Project ***
    European Union’s Biometric Matching System (EU-BMS) is expected to hold biometric information of 70 Million people to support visa applications, border control, and immigration ****
    AllTrust Networks Paycheck Secure system has enrolled over 6 Million users and has performed over 70 Million transactions*****
    13
    Growth of Biometric Databases
    * US-VISIT: The world’s largest biometric application. William Graves.
    ** http://www.fbi.gov/hq/cjisd/iafis/iafis_facts.htm
    *** http://www.business-standard.com/india/news/national-population-register-to-start-biometrics-data-collectiondec/399135/
    **** http://www.findbiometrics.com/articles/i/5220/
    ***** http://www.alltrustnetworks.com/News/6Million/tabid/378/Default.aspx
  • 17. Biometric Databases are a Big Data Problem
    Large scale operations
    Searching and storing 100 Million to 1 Billion Identities
    Multiple biometric templates and raw files per identity for multimodal matching (Fingerprints, Faces, and Iris)
    Typically, new raw files and templates are stored after each Verification and Identification operation because the biometrics readings change over time
    Raw Images:
    (500M Identities x 16KB-300KB* x 10-20) = 1-2 PB
    Biometric Templates:
    (500M Identities x 256b-3KB** x 10-20) = 2-27 TB
    15
  • 18. Biometric Databases Must Perform Fuzzy Matching
    • Fuzzy matching techniques must be used because the data is noisy and “dirty”
    • 19. Most applications require low latency fuzzy match searches in order to be useful
    • 20. The objects being searched for cannot be ordered effectively to speed up searches
    • 21. Clustering techniques can be used to reduce the search space, but this only goes so far
    • 22. Fuzzy match searches are expensive and typically a large number of objects need to be searched to find a match
    16
  • 23. This session shows the application of Hadoop and a large-scale, low-latency distributed fuzzy matching database to Biometrics
    Background - what you need to know about Biometrics
    The Problem – Big Data and unordered fuzzy matching
    A Solution - Hadoop Applications for Biometrics
    Session Agenda
  • 24. Hadoop and Biometric Databases
    HDFS as file storage for petabytes worth of images
    Redundancy
    Distribution
    Opens the doors to storing more and more raw images and at higher resolutions
    18
    • Mahout/MapReduce can be used for indexing and clustering biometric templates to improve overall search speeds
    • 25. MapReduce can be used for improving feature selection by analyzing the entire database to select features that are most effective in distinguishing identities
    • 26. Easy to test and deploy new algorithms against all data at scale
    • 27. N-to-N matching search (special type of Identification search) to cleanse database, find people trying to circumvent the system (Identity Fraud, etc)
    • 28. Map Reduce can be used for batched searching where latency doesn’t matter
    • 29. What about low latency searching…?
  • Fuzzy Table: A Solution to Large-scale, Low Latency, Fuzzy Matching
    Fuzzy Table is a large scale, low latency, distributed fuzzy matching database
    It enables fast parallel searches against keys that cannot be effectively ordered and that require fuzzy matching such as biometrics identification, large scale image search, large scale audio search, etc
    It provides the benefits of Hadoop against problems that require large scale low latency fuzzy matching
    Horizontal scalability over commodity hardware
    Distributed and parallel computation
    High reliability and redundancy
    Enabled by Mahout and MapReduce for binning/clustering, re-encoding, and other bulk data operations
    We have found no other solution with these characteristics
    19
  • 30. Fuzzy Table Architecture
    20
  • 31. Fuzzy Table: Bulk Data Processing Component
    The centroids from K-means clustering are used to create a “Bin classifier” that is used determine the best bins to search for a given key
    {Key, Value} records are stored as SequenceFiles in HDFS and the files are stored in such a way to spread these records across the cluster for optimal parallel searching
    MapReduce is used for all other bulk or batch data processing including:
    Re-encoding the raw files into Feature vectors
    Performing large-scale feature evaluation to improve clustering
    Batch fuzzy match searching
    21
    • Mahout’s Canopy Clustering and K-means Clustering are used to partition the data into clusters (referred to as bins) in order to reduce the search space
    • 32. This makes searching faster because a only small subset of the data must be processed
    • 33. This concept is based on work done in academia*
    *Efficient Search and Retrieval in Biometric Databases by Amit Mhatre, Srinivas Palla, Sharat Chikkerur and Venu Govindaraju
    * Efficient fingerprint search based on database clustering. Manhua Liu, Xudong Jiang, Alex Chichung Kot
  • 34. Bulk Clustering and Real-time Classification
    22
    This makes searching for keys faster because only a small subset of the entire dataset needs to be processed using fuzzy matching
    The classifier determines which Bins need to be searched in order to find the most likely matching keys
  • 35. Fuzzy Table: Data Storage and Bins
    Bins are represented as directories in HDFS containing one or more chunk files (stored as SequenceFiles): /fuzzytable/_table_fingerprints/_bin_000001/_chunk_000001
    Chunk files contain many {Key, Value} pairs and are a small multiple of the HDFS block size
    Chunk files are distributed uniformly and randomly across the Data Servers in the cluster
    This ensures that the bins are striped across the cluster for optimal parallel searching
    Also, chunk files are replicated across the Data Servers using the replication mechanism in HDFS
    Data Servers only search through chunk files that reside locally and results are returned in real-time as soon as a match is found
    23
  • 36. Fuzzy Table: Low Latency Fuzzy Matching Component
    The low latency component consists of three main parts
    Client – submit queries for Keys and get back {Key, Value} pairs
    Master Server – serve metadata about which Data Servers host which bins
    Data Servers – Actually perform fuzzy matching searches
    Data Servers perform fuzzy matching against Keys in order to find {Key, Value} records
    double score = fuzzyMatcher.match(key, storedRec.getKey());
    if(score >= threshold)
    return storedRec;
    Fuzzy matching searches are performed in parallel across many Data Servers
    24
  • 37. Fuzzy Table Query
    25
  • 38. Fuzzy Table Query
    26
  • 39. Fuzzy Table Query
    27
  • 40. Fuzzy Table Query
    28
  • 41. Fuzzy Table Query
    29
  • 42. Fuzzy Table Query
    30
  • 43. Fuzzy Table Query
    31
  • 44. Fuzzy Table Query
    32
  • 45. Future Work
    Fuzzy Table is still a research prototype, but we plan to keep building it out to support this biometrics work
    Locality Sensitive Hashing instead of K-means clustering for binning and search space reduction
    Distributed/Replicated master servers (and Zookeeper integration)
    Real-time ingest
    Hopefully we will have performance/scalability metrics as well as more features and example applications to share within the next few months
    33
  • 46. Conclusion
    Searching large-scale Biometric Databases is a hard problem
    Hadoop is a potential solution to this problem
    We used MapReduce for bulk processing to enable distributed low latency fuzzy matching over HDFS
    Hadoop is a great platform for solving all sorts of Big Data and distributed computing problems, even for low latency searching
    34
  • 47. Contributors
    Cloud Computing Team
    Jason Trost
    Lalit Kapoor
    Daniel Neuberger
    Michael Beck
    Edmond Kohlwey
    Josh Sullivan
    Identity Management/Biometrics Team
    Abel Sussman
    Eric Karlinsky
    Deanna Walters
    Joel Rader
    Allen Wight
    35
  • 48. Questions?
  • 49. Contact Information – Cloud Computing Team
    37
    Joshua Sullivan
    Senior Associate
    Lalit Kapoor
    Senior Consultant
    Michael Beck
    Senior Consultant
    Daniel Neuberger
    Senior Consultant
    Jason Trost
    Associate
    Booz Allen Hamilton Inc.
    134 National Business Parkway.
    Annapolis Junction, Maryland 20701
    (301)543-4611
    sullivan_joshua@bah.com
    Booz Allen Hamilton Inc.
    134 National Business Parkway.
    Annapolis Junction, Maryland 20701
    (301)821-8000 kapoor_lalit@bah.com
    Booz Allen Hamilton Inc.
    134 National Business Parkway.
    Annapolis Junction, Maryland 20701
    (301)821-8000 kapoor_lalit@bah.com
    Booz Allen Hamilton Inc.
    134 National Business Parkway.
    Annapolis Junction, Maryland 20701
    (301)821-8000 kapoor_lalit@bah.com
    Booz Allen Hamilton Inc.
    134 National Business Parkway.
    Annapolis Junction, Maryland 20701
    (301)543-4400
    trost_jason@bah.com
    Edmund Kohlwey
    Consultant
    Booz Allen Hamilton Inc.
    134 National Business Parkway.
    Annapolis Junction, Maryland 20701
    (301)617-3523 kohlwey_edmund@bah.com
  • 50. Contact Information – Identity Management Team
    38
    Joel Rader
    Identity Analyst
    Eric Karlinsky
    Identity Analyst
    Deanna Walters
    Biometrics Analyst
    Allen Wight
    Biometrics Analyst
    Booz Allen Hamilton Inc.
    13200 Woodland Park Rd
    Herndon, VA 20171
    (703) 984-0312
    rader_joel@bah.com
    Booz Allen Hamilton Inc.
    13200 Woodland Park Rd.
    Herndon, VA 20171
    (703) 984-3532 Karlinsky_eric@bah.com
    Booz Allen Hamilton Inc.
    13200 Woodland Park Rd
    Herndon, VA 20171
    (703) 984-1982
    walters_deanna@bah.com
    Booz Allen Hamilton Inc.
    13200 Woodland Park Rd
    Herndon, VA 20171
    (703) 984-1978
    wight_allen@bah.com
    Abel Sussman
    Biometrics Subject Matter Expert
    Booz Allen Hamilton Inc.
    13200 Woodland Park Rd.
    Herndon, VA 20171
    (703) 984-7663
    sussman_abel@bah.com

×