Hadoop World 2011: The Hadoop Award for Government Excellence - Bob Gourley - Crucial Point LLC

951 views

Published on

Federal, State and Local governments and the development community surrounding them are busy creating solutions leveraging the Apache Foundation Hadoop capabilities. This session will highlight the top five solutions selected by an all star panel of judges. Who will take home the coveted Hadoop Award for Government Excellence (Haggie) Nominations for Haggies are being accepted now at http://CTOlabs.com

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
951
On SlideShare
0
From Embeds
0
Number of Embeds
156
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • An important mission of the Department of State (DoS), Bureau of Consular Affairs (CA) is to issue travel documents to U.S. and foreign citizens. CA uses a suite of software applications at locations around the world to collect applicant data for the purpose of issuing immigrant visas, non-immigrant visas, and United States passports. CA stores data collected from consular posts abroad and domestic processing centers, as well as other government agencies, in the Consular Consolidated Database (CCD). Since its introduction, the CCD has proven to be a robust, economical, and analytically-powerful data platform in an environment where fragmentation and inefficiency had been the norm. Indeed, without the CCD and its capabilities, CA would not be able to make effective use of the massive amount of data it collects.The Size and Complexities of Consular Data: CA stores one hundred (115) terabytes of data in the CCD. On average, the CCD grows by 6-8 terabytes each month. Currently, over 170 software applications collect information for CA. CA uses these applications to process the many types of travel documents issued by the bureau. These applications also provide the interfaces with the numerous partner agencies that share data with CA. Most of these applications use a ‘case’ (such as a visa or passport application), and not a person record, as the basis of their data storage and retrieval. Each application collects different data, in a variety of formats, and with varying levels of detail. At the application level, it is extremely difficult to link person information in one application to potentially-matching person information contained in another application. A person could apply for a visa at one location, and then apply at another location under a different name, and an adjudicator may not be able to establish the link between the cases. However, since all CA data is stored in one central repository (the CCD), the CCD can leverage all available data elements from all applications throughout the system in order to determine all of the potential identity matches of any given person that CA has encountered.The CCD also contains unstructured data, such as free-form comments or case notes. The CCD must deal with millions of large image files, such as applicant photos or scanned documents. The CCD’s powerful, custom-built analytical tools synthesize the complex data captured by CA with the equally-complex data received from other agencies. The CCD thus gives its users the ability to make informed decisions, detect and prevent fraud, and identify potential national security threats. Sharing Consular DataThe CCD is at the heart of information sharing between the government agencies involved in the national security of the United States. Over 34,000 national security officials in the Department of State and its partner agencies use the CCD. In fact, the CCD now serves more external users (23,000) than internal DoS users (11,600). The statistics below illustrate just how vital the CCD is to the entire national security apparatus:• DHS: The CCD is the single most important and frequently-used source of data. DHS has over 17,000 users worldwide, averaging 7 million hits per month. • FBI: 1,700 users of the CCD, averaging 420,000 hits per month• DoD: 200 users, averaging 180,000 hits per monthBecause of the CCD, information sharing between posts and security partners is no longer a cumbersome effort. Instead, it is automated, simplified and routine. For example, in November 2010, the average response time for the over 630,000 fingerprint checks submitted to DHS was 10.5 minutes. The average response time for the 588,000 fingerprint checks submitted to the FBI was 14.6 minutes. This near real-time collection, distribution, and analysis of consular data is vital to those stakeholders who rely on consular data to make informed decisions.An Improved ArchitectureThe CCD’s architecture is designed to be flexible and scalable. It uses the latest generation of technologies and methodologies to enable rapid capture, distribution, and analysis of the massive amount of data collected by CA. The CCD captures data from 270 posts around the world and replicates that data using Oracle Multimaster Replication to a centralized repository in near real-time. This CCD architecture replaces the stove-piped concept of the past with a web-enabled, directly-accessible database platform. The CCD connects users to their data via a single-platform design that is forward-looking and easily integrated with external systems. Before the CCD, consular data resided on a decentralized global network of approximately 270 consular posts supported by independent, in-house systems. These systems contained all of the significant inefficiencies inherent when data resources are structurally isolated and widely distributed. Management reporting was inefficient. O&M costs were burdened with the necessity of delivering services individually to each post. The old architecture created formidable logistical and fiscal hurdles. Sharing data between posts and with partners was difficult and time-consuming. The inability to rapidly share information and to obtain early access to application data negatively impacted fraud detection and prevention. The new CCD architecture consolidated the individual data assets of each post into a design that incorporated advanced infrastructure components. This forward-looking model has the flexibility needed for system modifications, the easy integration of new stakeholders, and the ability to make use of future technologies. Today, according to Oracle, the CCD is the “largest connected/replicating database structure in the government.” The CCD is economical, too. The CCD architecture has saved CA $1.4 million annually. The CCD architecture has also established an enviable green profile. The CCD made possible the elimination of an entire Data Share Group with a hardware reduction of 100 servers, an 80% reduction in passport database servers, and reduced support costs by eliminating entire storage networks.Making Sense of Consular DataThe data contained in the CCD would mean little to the consular officer adjudicating a visa application, or to a Customs and Border Patrol agent at a border crossing, if not for the CCD’s ability to make sense of the enormity of the data (over 115 terabytes) it contains. The CCD can pre-screen a visa record before an adjudicator even looks at it. The CCD provides the means to conduct vetting checks against various government databases. The CCD contains powerful analytical tools and a set of custom-built services that allow users to do everything from sending a mass email to American citizens abroad to tracking fraud investigations. In short, the CCD is a one-stop shop for collecting, analyzing and making informed use of consular data. Consulate staffs and Customs and Border Patrol agents are under immense pressure to do thorough and accurate identity and background checks on both citizens and non-citizens. In this age of international terrorism, the success and accuracy of staff decisions has critical implications to the security of the United States. The CCD gives its users the tools and data to make informed decisions.Before an adjudicator of a visa applicant looks at a record, the CCD has already done much of the pre-processing automatically. Rather than an adjudicator sorting through terabytes of data, the CCD has already sorted through the over 115 terabytes data and made the connections that are simply impossible for an individual user to make. At each encounter with an applicant for a visa or passport, the CCD automatically establishes links between all cases involving that applicant and other potentially-matching cases, enabling the detection of potential fraud or national security threats. For example, the CCD can base these links on the applicant using the same point of contact in the United States that was used on another case. The CCD can establish links based on the results of a biometric check, such as fingerprints or facial recognition. The CCD can even establish links using unstructured data by searching for certain text strings and linking records in which these strings appear. The CCD examines every conceivable combination of data elements when looking for potential matches. Due to the wide variety of resources used by the CCD, the system can establish links between two applicants using completely different names. With each subsequent encounter, the CCD creates additional links, resulting in a searchable, fully cross-referenced web of information that traces a person’s activities across all of CA data. By being able to see these links in a person-centric view, the adjudicators have a broader, more complete, and more easily-accessible set of data with which to make better-informed decisions.The CCD automatically initiates biometric checks, including fingerprint checks and facial recognition checks. The CCD can also automatically look for possible fraud indicators in the data the applicant provided in his or her application. The CCD can then alert the adjudicator to look into these indicators, saving the adjudicator time. If the adjudicator finds a case of potential fraud, he or she can refer the case for fraud investigation right from the CCD. The fraud investigator can record the results of his or her investigation in the CCD and has access to all of the analytical tools and biometric checks available in the system. The CCD captures all of the data entered during the process and automatically creates cross-references using the new data. The CCD completes the loop.When a CCD user pulls up an applicant record in the CCD, he or she will see much more than just the applicant’s biographical data and the current status of the case. The user can see the results of all of the background checks that the adjudicator ran. The user can see all of the previous visa or passport records for that applicant. The user can see all of the applicant’s images, the applicant’s fingerprints, and even a list of other CCD records that are linked to the applicant in one way or another. The CCD makes all of the information related to a case accessible in a single, consolidated view.In fact, the CCD is so easy to use that each month its users run 20 million reports, generate 120 million hits, process 1 million applicants, conduct 4 million facial recognition searches, submit 800,000 fingerprint check requests, and much more. Users add 6-8 terabytes of data to the CCD each month. Without the robust functionality built until the CCD, this workload would be unimaginable.Conclusion: Before the CCD, Visa and U.S. Passport application data were located on independent databases making data sharing within the Department of State and its national security partner agencies difficult. CA needed to maximize the accuracy and availability of consular data by creating a single, consolidated database. CA needed a state-of-the-art data archiving and data-sharing platform that provided rapid access to data and that enabled the fluid exchange of information, while reducing expenses and encouraging inter-agency collaboration. The CCD has transformed CA’s mission delivery by breaking the paradigm of data isolated in independent databases. The CCD is a single-platform of common, trusted data. The CCD uses a simplified, robust, and innovative network architecture that has streamlined CA’s physical IT infrastructure. The CCD today consolidates data from posts all over the world into a central repository that is over 115 terabytes in size and growing by 6-8 terabytes each month. In terms of both improved resource use and in enhanced national security through better data analysis, it is impossible to overstate the benefits that the CCD brings to CA. The CCD allows staff to focus its time on better customer service, investigative activities, and analysis. CA’s technical achievement with the CCD has been to create a robust, economical, and analytically-powerful data platform in an environment where fragmentation and inefficiency had been the norm.
  • Hadoop World 2011: The Hadoop Award for Government Excellence - Bob Gourley - Crucial Point LLC

    1. 1. Government Big Data Solutions Award Bob Gourley CTOlabs.com http://ctolabs.com Nov 2011
    2. 2. About This Presentation:• How can we help accelerate public sector innovation?• Top Federal Mission Needs for Big Data• The State of Big Data Solutions in the Federal Space• The Intent of the Government Big Data Solutions Award• Criteria• Judges• Top Nominees for 2011• How to Nominate for 2012• The Judges Choice for 2011CTOlabs.com 2
    3. 3. Our ChallengeCTOlabs.com
    4. 4. The Government Needs More Agility*―High tech runs three-times faster than normal businesses. And the government runs three-times slower than normal businesses. So we have a nine-times gap‖ – Andy Grove The government can rapidly benefit from the lessons of high tech by being a faster follower, especially when it comes to Big Data constructs Thesis: If the Big Data community understands more about federal missions, challenges and successes, we can improve the speed and effectiveness of federal solutions. *Among other needsCTOlabs.com 4
    5. 5. Top Federal Mission Needs for Big Data Financial fraud detection across large, rapidly changing data sets Cyber Security: rapid real time analysis of all relevant data Rapid return of geospatial data based on query Location based push of data: Focused on emergency response Real time return of relevant search: USA.gov is exemplar Real time suggestion of topics: USA.gov is exemplar Real time suggestion of correlations: DoD has many use cases Bioinformatics: Human Genome Bioinformatics: Patient location, treatment, outcomes These needs must be met in an era of significant downward pressure on budgets. Scalable systems with well thought out governance & extensive automation are key. CTOlabs.com 5
    6. 6. Most active fed solution areas: Federal integrators: Spending internal research and development funds to create prototypes and full solutions relevant to fed missions DoD and IC agencies: Using Big Data approaches to solve ―needle in the haystack‖ and ―connect the dots‖ problems National Labs: Bioinformatics solutions have been put in place by federal researchers OMB and GSA: Ensuring sharing of lessons and solutions. Key exemplars around web search methods. Solutions inside government agencies and on citizen facing properties Big Data solutions are already making a difference in government service to citizens. Highlighting some of this virtuous work is a goal of our Government Big Data Solutions Award. CTOlabs.com 6
    7. 7. The Intent of the Government Big Data Solutions Award Established to help facilitate exchange of best practices, lessons learned and creative ideas for solutions to hard data challenges Special focus on solutions built around Apache Hadoop framework Nominees and award winners to be written up in CTOlabs.com technology reviews Award meant to help generate exchange of lessons learned We established a team of judges, asked them to consider mission impact as primary criteria, and solicited award nominations via sites frequented by government IT professionals and solution providers.CTOlabs.com 7
    8. 8. Judges Doug Cutting: An advocate and creator of open source search technologies (@cutting) Chris Dorobek: Founder, editor, publisher of DorobekInsider.com (@DorobekINSIDER) Ed Granstedt: QinetiQ Strategic Solution Center Ryan LaSalle: Accenture Technology Labs (@Labsguy) Alan Wade: Experienced federal CIO Judges are all experienced innovators known for mastery in their fields CTOlabs.com 8
    9. 9. Top Nominees for 2011 USA Search: Best in class hosted search services over more than 400 gov sites. Great use of CDH3. GCE Federal: Cloud-based financial management solutions. Apache Hadoop, Hbase, Lucene for Dept of Labor. PNNL Bioinformatics: Leading researcher Dr. Taylor of PNNL is advancing understanding of health, biology, genetics and computing using Apache Hadoop/MapReduce/HBase. SherpaSurfing: Use of CDH as a cybersecurity solution. Ingest packet capture in any format, analyze trends, find malware, alert. US Department of State: Bureau of Counselor Affairs. Large data with important applications for citizen service and national security. Each of these are making a difference for government missions right now. CTOlabs.com 9
    10. 10. Please Think Now About 2012 NominationsCTOlabs.com
    11. 11. How to Nominate for 2012 Click Here. Fill In Form. Hit “Submit” • We expect (and hope for) a much more crowded field of contenders next year. • Please let us know if you are working on things that feds should be aware of. • You can also submit technologies for review on our site.CTOlabs.com 11
    12. 12. Special Mention Department of StateConsular Consolidated Database CTOlabs.com
    13. 13. Department of State (DoS), Bureau of Consular Affairs (CA) Consular Consolidated Database (CCD) CCD is critical to citizen support and important in facilitating lawful visits to US First line of defense against unlawful entry Largest connected/replicating database structure in the government Pre-screening visa applicants, helps adjudicators weed out fraud Used by multiple agencies Very smart use of current data approaches to solve hard problems CTOlabs.com 13
    14. 14. Judge’s Choice 2011 GSA USA SearchCTOlabs.com
    15. 15. CTOlabs.com 15
    16. 16. USA Search Program of General Services Administration‘s (GSA) Office of Citizen Services and Information Technologies. Hosted search services for USA.gov and over 500 other government websites. Solves big data challenges with open source capabilities. CDH3 since fall 2010. HDFS, Hadoop and Hive used in cost effective, resilient, scalable solution. Search Results. Search Suggestions. Trend analysis. Analytic dashboards. Bottom Line: USA Search brings the best of the open source community to multiple government missions, including direct citizen support CTOlabs.com 16
    17. 17. CTOlabs.com 17
    18. 18. Questions/Comments?CTOlabs.com
    19. 19. This Presentation Prepared By: Bob Gourley CTOlabs.com http://twitter.com/bobgourley CTOlabs.com
    20. 20. Backup SlidesCTOlabs.com 20
    21. 21. Department of State (DoS), Bureau of Consular Affairs (CA) Consular Consolidated Database (CCD)•Bureau of Consular Affairs issues travel documents to U.S. and foreign citizens. CA stores data collected fromconsular posts abroad and domestic processing centers, as well as other government agencies in the ConsularConsolidated Database (CCD).•CCD holds over one hundred (115) terabytes of data, growing by 6-8 terabytes each month. Over 170 softwareapplications collect this information and provide interfaces with the numerous partner agencies that share datawith CA.•CCD is the ―largest connected/replicating database structure in the government.‖•Most of these applications use a ‗case‘ (such as a visa or passport application), and not a person record, as thebasis of their data storage and retrieval. At the application level, it is extremely difficult to link person informationin one application to potentially-matching person information contained in another application. A person couldapply for a visa at one location, and then apply at another location under a different name, and an adjudicatormay not be able to establish the link between the cases. The CCD can leverage all available data elements fromall applications throughout the system in order to determine all of the potential identity matches of any givenperson that CA has encountered.•The CCD also contains unstructured data, such as free-form comments or case notes. The CCD must deal withmillions of large image files, such as applicant photos or scanned documents. The CCD‘s powerful, custom-builtanalytical tools synthesize the complex data captured by CA with the equally-complex data received from otheragencies. The CCD thus gives its users the ability to make informed decisions, detect and prevent fraud, andidentify potential national security threats. CTOlabs.com 21
    22. 22. Department of State (DoS), Bureau of Consular Affairs (CA) Consular Consolidated Database•CCD is based on Oracle tools. (CCD)•The CCD can pre-screen a visa record before an adjudicator even looks at it. The CCD provides the means toconduct vetting checks against various government databases.•Due to the wide variety of resources used by the CCD, the system can establish links between two applicantsusing completely different names. With each subsequent encounter, the CCD creates additional links, resulting ina searchable, fully cross-referenced web of information that traces a person‘s activities across all of CA data. Bybeing able to see these links in a person-centric view, the adjudicators have a broader, more complete, and moreeasily-accessible set of data with which to make better-informed decisions.•The CCD automatically initiates biometric checks. The CCD automatically looks for fraud indicators. The CCDcaptures all of the data entered during the process and automatically creates cross-references using the newdata.•The CCD has transformed CA‘s mission delivery by breaking the paradigm of data isolated in independentdatabases•The CCD allows staff to focus its time on better customer service, investigative activities, and analysis. CA‘stechnical achievement with the CCD has been to create a robust, economical, and analytically-powerful dataplatform in an environment where fragmentation and inefficiency had been the norm. CTOlabs.com 22
    23. 23. USA Search: A Strategic Resource• USASearch is a program of the General Services Administration‘s (GSA) Office of Citizen Services and Information Technologies.• GSA believes in building once and using many times. USASearch is no exception. Since 2000, USASearch has provided hosted search services for USA.gov and for more than 400 government websites—across all levels of government—at no cost through its Affiliate Program.• USASearch instituted many innovative changes in 2010—making it a model for the Obama administration‘s effort to leverage open source technologies and shared solutions to bring substantial cost savings for the government. With its new open architecture model, the USASearch Program provides viable and scalable shared search services.• USASearch Solves Big Data Challenges CTOlabs.com 23
    24. 24. USA Search: A Strategic Resource• USASearch began using Cloudera‘s Distribution including Apache Hadoop (CDH3) for the first time in the fall of 2010, and since then has seen its usage grow every month— not just in scale, but also in scope.• All of the search traffic across USA.gov and the hundreds of affiliate sites comes through a single search service, and this generates a lot of data. To continuously improve the service, USASearch needs aggregated information on what searchers look for, how well they find it, and emerging trends, among other information. Once searches are initiated, USASearch also needs to know what results are shown and clicked on. This information needs to be broken down by affiliate and by time, and also aggregated across all affiliates.• The initial system was fairly simple and did just enough to address the most pressing data needs. As USASearch watched its data grow and the nightly batch jobs took longer and longer, it became clear that it would soon exhaust its existing resources. USASearch considered scaling up the hardware vertically and sharding the database horizontally, but both options seemed to kick the can down the road. Larger database hardware is both costly and eventually insufficient for USASearch‘s needs, and sharding promised to take all the usual issues associated with a single database system and multiply them.• USASearch determined it needed HDFS, Hadoop, and Apache Hive—a big data system that could grow cost effectively and without downtime, be naturally resilient to failures, and sensibly handle backups. CTOlabs.com 24
    25. 25. USA Search: A Strategic Resource• USASearch Makes Data Actionable USASearch displays the results of its Hive analyses in various analytics dashboards, but, more importantly, it also ensures the results positively affect searchers‘ experience on government websites. For example, USASearch uses Hadoop to generate contextually relevant and timely search suggestions for each of its affiliated government websites. Compare the different type-ahead suggestions for ‗gran‘ on NPS.gov and USA.gov. Both websites use the same USASearch backend system, but the suggestions differ completely.• USASearch Is a Success The overhaul of USASearch‘s analytics is a dramatic success story. In the space of a few months, USASearch went from having a brittle and hard-to-scale RDBMS-based analytics platform to a much more agile Hadoop-based system that is intrinsically designed to scale. USASearch continues to see its Hadoop usage grow in scope with each new data source it adds, and it is clear that USASearch will rely on it more and more as the suite of tools and resources around Hadoop grows and matures in the future.• By using a state-of-the-art open source technology, USASearch has created a radically different search service that transforms the customer experience. Having a government-owned and -controlled search service allows us to constantly understand what‘s on the minds of Americans to drive enhancements to other delivery channels. The public has a much improved experience when interacting with the government due to USASearch. CTOlabs.com 25

    ×