• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Biometric Databases and Hadoop__HadoopSummit2010
 

Biometric Databases and Hadoop__HadoopSummit2010

on

  • 8,119 views

Hadoop Summit 2010 - Application Track

Hadoop Summit 2010 - Application Track
Biometric Databases and Hadoop
Jason Trost, Abel Sussman and Lalit Kapoor, Booz Allen Hamilton

Statistics

Views

Total Views
8,119
Views on SlideShare
8,075
Embed Views
44

Actions

Likes
11
Downloads
0
Comments
0

8 Embeds 44

http://localhost 13
http://192.168.1.9 11
http://tweetedtimes.com 6
https://www.linkedin.com 6
http://www.linkedin.com 4
http://twitter.com 2
http://www.slideshare.net 1
http://paper.li 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Biometric Databases and Hadoop__HadoopSummit2010 Biometric Databases and Hadoop__HadoopSummit2010 Presentation Transcript

    • Hadoop for Large-scale Biometric Databases
      Jason Trost
      Cloud Computing Team
      Booz | Allen | Hamilton
    • This session shows the application of Hadoop and a large-scale, low-latency distributed fuzzy matching database to Biometrics
      Background - what you need to know about Biometrics
      The Problem – Big Data and unordered fuzzy matching
      A Solution - Hadoop Applications for Biometrics
      Session Agenda
    • Key Takeaways from this Session
      Searching large-scale Biometric Databases is a hard problem
      Hadoop is a potential solution to this problem
      Hadoop is a great platform for solving all sorts of Big Data and distributed computing problems, even low latency searching
      3
    • 4
      Introduction to Biometrics
      Iris
      Face
      Fingerprint
      Biometrics:
      The science of establishing the identity of an individual based on the physical, chemical, or behavioral attributes of the person *
      Modality:
      Physical or behavioral characteristics of an individual used to establish identity*
      Template:
      A symbolic or numeric representation of a modality optimized for storage and/or matching
      Palm Print
      Gait
      Hand Geometry
      Signature
      Ear
      Voice
      Keystroke Pattern
      Facial Thermogram
      Vein Pattern
      * Handbook of Biometrics. A. Jain, P. Flynn A. Ross.
    • Assist with criminal investigations (e.g. crime scene fingerprints)
      Identify individuals entering and leaving the country
      Surveillance
      5
      Why are Biometrics Important?
      • Enables identifying/authenticating individuals based on “credentials” that are hard to forge
      • It has many useful applications where establishing identity is important
      • Banks and Financial Services companies are using biometrics to prevent banking and identity fraud
      • National governments are creating biometric databases for law enforcement & security reasons:
    • Enrollment – Add an identity and associated biometric data to the database if they do not already exist
      Verification – Lookup the biometric template for a single individual and determine whether it matches a captured biometric measurement (1-to-1 match)
      Identification – Determine the identity of an individual given some biometric measurements (1-to-N match)
      6
      Biometric Database Operations
    • Enrollment: Adding New Identities and Biometrics Data to the Database
      Collect biographic information from an individual such as name, address, SSN, etc
      Capture biometric data in raw form (e.g. high resolution images)
      Transform raw biometric data into encoded biometric template (feature vector)
      Store all this information in the biometrics database
      7
    • Verification: One-to-one Matching
      Lookup the biometric template for a particular individual
      Verify that the stored template and the recently captured template match
      Fuzzy matching is used for matching the biometric templates
      8
    • Identification: One-to-Many Searching
      Capture some number of raw Biometric features, convert them into Biometric templates
      Perform fuzzy matching against large number of stored biometric templates to determine the identity
      If latency is not an issue, this is relatively straightforward, especially in MapReduce
      This is a hard problem for low latency applications and increasing in complexity as the size of these databases grow
      There is a speed/accuracy tradeoff
      The search space can be reduced using clustering techniques, but this only goes so far
      9
    • What is Fuzzy matching?
      Fuzzy matching is an operation performed on two objects that determines how similar the objects are to each other
      Typically this operation produces a numeric similarity score
      Necessary when data collected from sensor is noisy, and matching needs to be very accurate
      Almost all biometric matching algorithms perform some sort of fuzzy matching:
      Elastic Bunch Graph Matching – face recognition algorithm
      BOZORTH3 - minutiae based fingerprint matching algorithm
      IrisCode - iris matching algorithm
      Other Examples:
      Image comparison
      Audio comparison
      Video comparison
      10
    • Why Fuzzy Matching?
      Biometric data is inherently noisy and dirty
      Conditions are not exactly the same when the original biometric data was captured (Enrollment) and when a new reading occurs (Identification)
      Different types of cameras and sensors made by different companies
      Partial or smudged fingerprints (e.g. crime scene)
      Changes in skin tone, facial hair, makeup
      Different lighting conditions
      Aging and skin damage
      Weight gain, Weight loss
      Injury
      Derived from http://www.flickr.com/photos/glennji/3558118429/. Licensed under Creative Commons
      11
    • Existing Large-scale Biometric Databases
      US Visitor & Immigrant Status Indicator Technology (US-VISIT)*
      International travelers’ biometrics (fingerprint and face)
      Collected at US ports of entry, Immigration Services, and State Department
      Used to support the Department of Homeland Security's mission
      FBI Integrated Automated Fingerprint Identification System, (IAFIS)**
      Used to solve and prevent crime and catch criminals and terrorists
      Includes fingerprints, criminal histories, mug shots, scars and tattoo photos, physical characteristics like height, weight, and hair and eye color, and aliases
      AllTrust Networks Paycheck Secure System
      Uses fingerprints to support secure check cashing
      Designed to stop fraud and speed check cashing
      Plus many more
      12
      * One Team, One Mission, Securing our Homeland. US DHS.
      ** http://www.fbi.gov/hq/cjisd/iafis/iafis_facts.htm
      *** http://www.alltrustnetworks.com/News/6Million/tabid/378/Default.aspx
    • This session shows the application of Hadoop and a large-scale, low-latency distributed fuzzy matching database to Biometrics
      Background - what you need to know about Biometrics
      The Problem – Big Data and unordered fuzzy matching
      A Solution - Hadoop Applications for Biometrics
      Session Agenda
    • Combined U.S. government biometric databases are expected to grow to hold billions of identities
      The DHS’s US-VISIT program has the world’s largest and fastest biometric database (called IDENT) with over 110 million identities and roughly 145,000 identities enrolled or verified daily*
      From the FBI’s Integrated Automated Fingerprint Identification System (IAFIS) alone, there are 66.5 million identities with 8,000-10,000 more subjects added each day **
      India is reportedly creating a biometric database to hold the fingerprints and face images for each of its 1.2 billion citizens as part of its Unique Identification Project ***
      European Union’s Biometric Matching System (EU-BMS) is expected to hold biometric information of 70 Million people to support visa applications, border control, and immigration ****
      AllTrust Networks Paycheck Secure system has enrolled over 6 Million users and has performed over 70 Million transactions*****
      13
      Growth of Biometric Databases
      * US-VISIT: The world’s largest biometric application. William Graves.
      ** http://www.fbi.gov/hq/cjisd/iafis/iafis_facts.htm
      *** http://www.business-standard.com/india/news/national-population-register-to-start-biometrics-data-collectiondec/399135/
      **** http://www.findbiometrics.com/articles/i/5220/
      ***** http://www.alltrustnetworks.com/News/6Million/tabid/378/Default.aspx
    • Biometric Databases are a Big Data Problem
      Large scale operations
      Searching and storing 100 Million to 1 Billion Identities
      Multiple biometric templates and raw files per identity for multimodal matching (Fingerprints, Faces, and Iris)
      Typically, new raw files and templates are stored after each Verification and Identification operation because the biometrics readings change over time
      Raw Images:
      (500M Identities x 16KB-300KB* x 10-20) = 1-2 PB
      Biometric Templates:
      (500M Identities x 256b-3KB** x 10-20) = 2-27 TB
      15
    • Biometric Databases Must Perform Fuzzy Matching
      • Fuzzy matching techniques must be used because the data is noisy and “dirty”
      • Most applications require low latency fuzzy match searches in order to be useful
      • The objects being searched for cannot be ordered effectively to speed up searches
      • Clustering techniques can be used to reduce the search space, but this only goes so far
      • Fuzzy match searches are expensive and typically a large number of objects need to be searched to find a match
      16
    • This session shows the application of Hadoop and a large-scale, low-latency distributed fuzzy matching database to Biometrics
      Background - what you need to know about Biometrics
      The Problem – Big Data and unordered fuzzy matching
      A Solution - Hadoop Applications for Biometrics
      Session Agenda
    • Hadoop and Biometric Databases
      HDFS as file storage for petabytes worth of images
      Redundancy
      Distribution
      Opens the doors to storing more and more raw images and at higher resolutions
      18
      • Mahout/MapReduce can be used for indexing and clustering biometric templates to improve overall search speeds
      • MapReduce can be used for improving feature selection by analyzing the entire database to select features that are most effective in distinguishing identities
      • Easy to test and deploy new algorithms against all data at scale
      • N-to-N matching search (special type of Identification search) to cleanse database, find people trying to circumvent the system (Identity Fraud, etc)
      • Map Reduce can be used for batched searching where latency doesn’t matter
      • What about low latency searching…?
    • Fuzzy Table: A Solution to Large-scale, Low Latency, Fuzzy Matching
      Fuzzy Table is a large scale, low latency, distributed fuzzy matching database
      It enables fast parallel searches against keys that cannot be effectively ordered and that require fuzzy matching such as biometrics identification, large scale image search, large scale audio search, etc
      It provides the benefits of Hadoop against problems that require large scale low latency fuzzy matching
      Horizontal scalability over commodity hardware
      Distributed and parallel computation
      High reliability and redundancy
      Enabled by Mahout and MapReduce for binning/clustering, re-encoding, and other bulk data operations
      We have found no other solution with these characteristics
      19
    • Fuzzy Table Architecture
      20
    • Fuzzy Table: Bulk Data Processing Component
      The centroids from K-means clustering are used to create a “Bin classifier” that is used determine the best bins to search for a given key
      {Key, Value} records are stored as SequenceFiles in HDFS and the files are stored in such a way to spread these records across the cluster for optimal parallel searching
      MapReduce is used for all other bulk or batch data processing including:
      Re-encoding the raw files into Feature vectors
      Performing large-scale feature evaluation to improve clustering
      Batch fuzzy match searching
      21
      • Mahout’s Canopy Clustering and K-means Clustering are used to partition the data into clusters (referred to as bins) in order to reduce the search space
      • This makes searching faster because a only small subset of the data must be processed
      • This concept is based on work done in academia*
      *Efficient Search and Retrieval in Biometric Databases by Amit Mhatre, Srinivas Palla, Sharat Chikkerur and Venu Govindaraju
      * Efficient fingerprint search based on database clustering. Manhua Liu, Xudong Jiang, Alex Chichung Kot
    • Bulk Clustering and Real-time Classification
      22
      This makes searching for keys faster because only a small subset of the entire dataset needs to be processed using fuzzy matching
      The classifier determines which Bins need to be searched in order to find the most likely matching keys
    • Fuzzy Table: Data Storage and Bins
      Bins are represented as directories in HDFS containing one or more chunk files (stored as SequenceFiles): /fuzzytable/_table_fingerprints/_bin_000001/_chunk_000001
      Chunk files contain many {Key, Value} pairs and are a small multiple of the HDFS block size
      Chunk files are distributed uniformly and randomly across the Data Servers in the cluster
      This ensures that the bins are striped across the cluster for optimal parallel searching
      Also, chunk files are replicated across the Data Servers using the replication mechanism in HDFS
      Data Servers only search through chunk files that reside locally and results are returned in real-time as soon as a match is found
      23
    • Fuzzy Table: Low Latency Fuzzy Matching Component
      The low latency component consists of three main parts
      Client – submit queries for Keys and get back {Key, Value} pairs
      Master Server – serve metadata about which Data Servers host which bins
      Data Servers – Actually perform fuzzy matching searches
      Data Servers perform fuzzy matching against Keys in order to find {Key, Value} records
      double score = fuzzyMatcher.match(key, storedRec.getKey());
      if(score >= threshold)
      return storedRec;
      Fuzzy matching searches are performed in parallel across many Data Servers
      24
    • Fuzzy Table Query
      25
    • Fuzzy Table Query
      26
    • Fuzzy Table Query
      27
    • Fuzzy Table Query
      28
    • Fuzzy Table Query
      29
    • Fuzzy Table Query
      30
    • Fuzzy Table Query
      31
    • Fuzzy Table Query
      32
    • Future Work
      Fuzzy Table is still a research prototype, but we plan to keep building it out to support this biometrics work
      Locality Sensitive Hashing instead of K-means clustering for binning and search space reduction
      Distributed/Replicated master servers (and Zookeeper integration)
      Real-time ingest
      Hopefully we will have performance/scalability metrics as well as more features and example applications to share within the next few months
      33
    • Conclusion
      Searching large-scale Biometric Databases is a hard problem
      Hadoop is a potential solution to this problem
      We used MapReduce for bulk processing to enable distributed low latency fuzzy matching over HDFS
      Hadoop is a great platform for solving all sorts of Big Data and distributed computing problems, even for low latency searching
      34
    • Contributors
      Cloud Computing Team
      Jason Trost
      Lalit Kapoor
      Daniel Neuberger
      Michael Beck
      Edmond Kohlwey
      Josh Sullivan
      Identity Management/Biometrics Team
      Abel Sussman
      Eric Karlinsky
      Deanna Walters
      Joel Rader
      Allen Wight
      35
    • Questions?
    • Contact Information – Cloud Computing Team
      37
      Joshua Sullivan
      Senior Associate
      Lalit Kapoor
      Senior Consultant
      Michael Beck
      Senior Consultant
      Daniel Neuberger
      Senior Consultant
      Jason Trost
      Associate
      Booz Allen Hamilton Inc.
      134 National Business Parkway.
      Annapolis Junction, Maryland 20701
      (301)543-4611
      sullivan_joshua@bah.com
      Booz Allen Hamilton Inc.
      134 National Business Parkway.
      Annapolis Junction, Maryland 20701
      (301)821-8000 kapoor_lalit@bah.com
      Booz Allen Hamilton Inc.
      134 National Business Parkway.
      Annapolis Junction, Maryland 20701
      (301)821-8000 kapoor_lalit@bah.com
      Booz Allen Hamilton Inc.
      134 National Business Parkway.
      Annapolis Junction, Maryland 20701
      (301)821-8000 kapoor_lalit@bah.com
      Booz Allen Hamilton Inc.
      134 National Business Parkway.
      Annapolis Junction, Maryland 20701
      (301)543-4400
      trost_jason@bah.com
      Edmund Kohlwey
      Consultant
      Booz Allen Hamilton Inc.
      134 National Business Parkway.
      Annapolis Junction, Maryland 20701
      (301)617-3523 kohlwey_edmund@bah.com
    • Contact Information – Identity Management Team
      38
      Joel Rader
      Identity Analyst
      Eric Karlinsky
      Identity Analyst
      Deanna Walters
      Biometrics Analyst
      Allen Wight
      Biometrics Analyst
      Booz Allen Hamilton Inc.
      13200 Woodland Park Rd
      Herndon, VA 20171
      (703) 984-0312
      rader_joel@bah.com
      Booz Allen Hamilton Inc.
      13200 Woodland Park Rd.
      Herndon, VA 20171
      (703) 984-3532 Karlinsky_eric@bah.com
      Booz Allen Hamilton Inc.
      13200 Woodland Park Rd
      Herndon, VA 20171
      (703) 984-1982
      walters_deanna@bah.com
      Booz Allen Hamilton Inc.
      13200 Woodland Park Rd
      Herndon, VA 20171
      (703) 984-1978
      wight_allen@bah.com
      Abel Sussman
      Biometrics Subject Matter Expert
      Booz Allen Hamilton Inc.
      13200 Woodland Park Rd.
      Herndon, VA 20171
      (703) 984-7663
      sussman_abel@bah.com