SlideShare a Scribd company logo
1 of 25
Government Big Data Solutions Award




 Bob Gourley
 CTOlabs.com   http://ctolabs.com   Nov 2011
About This Presentation:
•   How can we help accelerate public sector innovation?
•   Top Federal Mission Needs for Big Data
•   The State of Big Data Solutions in the Federal Space
•   The Intent of the Government Big Data Solutions Award
•   Criteria
•   Judges
•   Top Nominees for 2011
•   How to Nominate for 2012
•   The Judges Choice for 2011



CTOlabs.com                                                 2
Our Challenge




CTOlabs.com
The Government Needs More Agility*
―High tech runs three-times faster than normal businesses. And the
   government runs three-times slower than normal businesses.
                   So we have a nine-times gap‖
                                                      – Andy Grove


 The government can rapidly benefit from the lessons of high tech
  by being a faster follower, especially when it comes to Big Data
  constructs

 Thesis: If the Big Data community understands more about federal
  missions, challenges and successes, we can improve the speed
  and effectiveness of federal solutions.


                            *Among other needs
CTOlabs.com                                                     4
Top Federal Mission Needs for Big Data
 Financial fraud detection across large, rapidly changing data sets
 Cyber Security: rapid real time analysis of all relevant data
 Rapid return of geospatial data based on query
 Location based push of data: Focused on emergency response
 Real time return of relevant search: USA.gov is exemplar
 Real time suggestion of topics: USA.gov is exemplar
 Real time suggestion of correlations: DoD has many use cases
 Bioinformatics: Human Genome
 Bioinformatics: Patient location, treatment, outcomes
  These needs must be met in an era of significant downward pressure on budgets.
 Scalable systems with well thought out governance & extensive automation are key.
 CTOlabs.com                                                                   5
Most active fed solution areas:
 Federal integrators: Spending internal research and development
  funds to create prototypes and full solutions relevant to fed
  missions

 DoD and IC agencies: Using Big Data approaches to solve
  ―needle in the haystack‖ and ―connect the dots‖ problems

 National Labs: Bioinformatics solutions have been put in place by
  federal researchers

 OMB and GSA: Ensuring sharing of lessons and solutions. Key
  exemplars around web search methods. Solutions inside
  government agencies and on citizen facing properties
     Big Data solutions are already making a difference in government service to
     citizens. Highlighting some of this virtuous work is a goal of our Government
                               Big Data Solutions Award.
 CTOlabs.com                                                                         6
The Intent of the Government Big Data
            Solutions Award
 Established to help facilitate exchange of best practices, lessons
  learned and creative ideas for solutions to hard data challenges

 Special focus on solutions built around Apache Hadoop framework

 Nominees and award winners to be written up in CTOlabs.com
  technology reviews

 Award meant to help generate exchange of lessons learned


    We established a team of judges, asked them to consider mission impact as
     primary criteria, and solicited award nominations via sites frequented by
               government IT professionals and solution providers.

CTOlabs.com                                                                      7
Judges
 Doug Cutting: An advocate and creator of open source search
  technologies (@cutting)

 Chris Dorobek: Founder, editor, publisher of DorobekInsider.com
  (@DorobekINSIDER)

 Ed Granstedt: QinetiQ Strategic Solution Center

 Ryan LaSalle: Accenture Technology Labs (@Labsguy)

 Alan Wade: Experienced federal CIO


       Judges are all experienced innovators known for mastery in their fields


 CTOlabs.com                                                                     8
Top Nominees for 2011
 USA Search: Best in class hosted search services over more than
  400 gov sites. Great use of CDH3.

 GCE Federal: Cloud-based financial management solutions.
  Apache Hadoop, Hbase, Lucene for Dept of Labor.

 PNNL Bioinformatics: Leading researcher Dr. Taylor of PNNL is
  advancing understanding of health, biology, genetics and computing
  using Apache Hadoop/MapReduce/HBase.

 SherpaSurfing: Use of CDH as a cybersecurity solution. Ingest
  packet capture in any format, analyze trends, find malware, alert.

 US Department of State: Bureau of Counselor Affairs. Large data
  with important applications for citizen service and national security.

      Each of these are making a difference for government missions right now.

 CTOlabs.com                                                                     9
Please Think Now About
       2012 Nominations



CTOlabs.com
How to Nominate for 2012
                               Click Here.
                               Fill In Form.
                               Hit “Submit”

                            • We expect (and hope
                              for) a much more
                              crowded field of
                              contenders next year.

                            • Please let us know if
                              you are working on
                              things that feds should
                              be aware of.

                            • You can also submit
                              technologies for review
                              on our site.

CTOlabs.com                                       11
Special Mention
     Department of State
Consular Consolidated Database


 CTOlabs.com
Department of State (DoS), Bureau of Consular
     Affairs (CA) Consular Consolidated Database
                        (CCD)

 CCD is critical to citizen support and important in facilitating lawful
  visits to US

 First line of defense against unlawful entry

 Largest connected/replicating database structure in the government

 Pre-screening visa applicants, helps adjudicators weed out fraud

 Used by multiple agencies


         Very smart use of current data approaches to solve hard problems

 CTOlabs.com                                                                13
Judge’s Choice 2011
                 GSA
             USA Search


CTOlabs.com
CTOlabs.com   15
USA Search
 Program of General Services Administration‘s (GSA) Office of
  Citizen Services and Information Technologies.

 Hosted search services for USA.gov and over 500 other
  government websites.

 Solves big data challenges with open source capabilities.

 CDH3 since fall 2010. HDFS, Hadoop and Hive used in cost
  effective, resilient, scalable solution.

 Search Results. Search Suggestions. Trend analysis. Analytic
  dashboards.

     Bottom Line: USA Search brings the best of the open source community to
           multiple government missions, including direct citizen support
 CTOlabs.com                                                                   16
CTOlabs.com   17
Questions/Comments?




CTOlabs.com
This Presentation Prepared By:
          Bob Gourley
          CTOlabs.com
 http://twitter.com/bobgourley


 CTOlabs.com
Backup Slides




CTOlabs.com                   20
Department of State (DoS), Bureau of Consular
         Affairs (CA) Consular Consolidated Database
                                                (CCD)
•Bureau of Consular Affairs issues travel documents to U.S. and foreign citizens. CA stores data collected from
consular posts abroad and domestic processing centers, as well as other government agencies in the Consular
Consolidated Database (CCD).

•CCD holds over one hundred (115) terabytes of data, growing by 6-8 terabytes each month. Over 170 software
applications collect this information and provide interfaces with the numerous partner agencies that share data
with CA.

•CCD is the ―largest connected/replicating database structure in the government.‖

•Most of these applications use a ‗case‘ (such as a visa or passport application), and not a person record, as the
basis of their data storage and retrieval. At the application level, it is extremely difficult to link person information
in one application to potentially-matching person information contained in another application. A person could
apply for a visa at one location, and then apply at another location under a different name, and an adjudicator
may not be able to establish the link between the cases. The CCD can leverage all available data elements from
all applications throughout the system in order to determine all of the potential identity matches of any given
person that CA has encountered.

•The CCD also contains unstructured data, such as free-form comments or case notes. The CCD must deal with
millions of large image files, such as applicant photos or scanned documents. The CCD‘s powerful, custom-built
analytical tools synthesize the complex data captured by CA with the equally-complex data received from other
agencies. The CCD thus gives its users the ability to make informed decisions, detect and prevent fraud, and
identify potential national security threats.



   CTOlabs.com                                                                                                    21
Department of State (DoS), Bureau of Consular
         Affairs (CA) Consular Consolidated Database
•CCD is based on Oracle tools.
                               (CCD)
•The CCD can pre-screen a visa record before an adjudicator even looks at it. The CCD provides the means to
conduct vetting checks against various government databases.

•Due to the wide variety of resources used by the CCD, the system can establish links between two applicants
using completely different names. With each subsequent encounter, the CCD creates additional links, resulting in
a searchable, fully cross-referenced web of information that traces a person‘s activities across all of CA data. By
being able to see these links in a person-centric view, the adjudicators have a broader, more complete, and more
easily-accessible set of data with which to make better-informed decisions.

•The CCD automatically initiates biometric checks. The CCD automatically looks for fraud indicators. The CCD
captures all of the data entered during the process and automatically creates cross-references using the new
data.

•The CCD has transformed CA‘s mission delivery by breaking the paradigm of data isolated in independent
databases

•The CCD allows staff to focus its time on better customer service, investigative activities, and analysis. CA‘s
technical achievement with the CCD has been to create a robust, economical, and analytically-powerful data
platform in an environment where fragmentation and inefficiency had been the norm.




   CTOlabs.com                                                                                                22
USA Search: A Strategic Resource
• USASearch is a program of the General Services Administration‘s (GSA)
  Office of Citizen Services and Information Technologies.

• GSA believes in building once and using many times. USASearch is no
  exception. Since 2000, USASearch has provided hosted search services for
  USA.gov and for more than 400 government websites—across all levels of
  government—at no cost through its Affiliate Program.

• USASearch instituted many innovative changes in 2010—making it a model
  for the Obama administration‘s effort to leverage open source technologies
  and shared solutions to bring substantial cost savings for the government.
  With its new open architecture model, the USASearch Program provides
  viable and scalable shared search services.

• USASearch Solves Big Data Challenges



 CTOlabs.com                                                           23
USA Search: A Strategic Resource
•    USASearch began using Cloudera‘s Distribution including Apache Hadoop (CDH3) for
     the first time in the fall of 2010, and since then has seen its usage grow every month—
     not just in scale, but also in scope.

•    All of the search traffic across USA.gov and the hundreds of affiliate sites comes through
     a single search service, and this generates a lot of data. To continuously improve the
     service, USASearch needs aggregated information on what searchers look for, how well
     they find it, and emerging trends, among other information. Once searches are initiated,
     USASearch also needs to know what results are shown and clicked on. This information
     needs to be broken down by affiliate and by time, and also aggregated across all
     affiliates.

•    The initial system was fairly simple and did just enough to address the most pressing
     data needs. As USASearch watched its data grow and the nightly batch jobs took longer
     and longer, it became clear that it would soon exhaust its existing resources. USASearch
     considered scaling up the hardware vertically and sharding the database horizontally, but
     both options seemed to kick the can down the road. Larger database hardware is both
     costly and eventually insufficient for USASearch‘s needs, and sharding promised to take
     all the usual issues associated with a single database system and multiply them.

•    USASearch determined it needed HDFS, Hadoop, and Apache Hive—a big data system
     that could grow cost effectively and without downtime, be naturally resilient to failures,
     and sensibly handle backups.

    CTOlabs.com                                                                           24
USA Search: A Strategic Resource
•    USASearch Makes Data Actionable USASearch displays the results of its Hive
     analyses in various analytics dashboards, but, more importantly, it also ensures
     the results positively affect searchers‘ experience on government websites.
     For example, USASearch uses Hadoop to generate contextually relevant and
     timely search suggestions for each of its affiliated government websites.
     Compare the different type-ahead suggestions for ‗gran‘ on NPS.gov and
     USA.gov. Both websites use the same USASearch backend system, but the
     suggestions differ completely.
•    USASearch Is a Success The overhaul of USASearch‘s analytics is a dramatic
     success story. In the space of a few months, USASearch went from having a
     brittle and hard-to-scale RDBMS-based analytics platform to a much more agile
     Hadoop-based system that is intrinsically designed to scale. USASearch
     continues to see its Hadoop usage grow in scope with each new data source it
     adds, and it is clear that USASearch will rely on it more and more as the suite of
     tools and resources around Hadoop grows and matures in the future.
•    By using a state-of-the-art open source technology, USASearch has created a
     radically different search service that transforms the customer experience.
     Having a government-owned and -controlled search service allows us to
     constantly understand what‘s on the minds of Americans to drive enhancements
     to other delivery channels. The public has a much improved experience when
     interacting with the government due to USASearch.

    CTOlabs.com                                                                  25

More Related Content

What's hot

Big Data Systems: Past, Present & (Possibly) Future with @techmilind
Big Data Systems: Past, Present &  (Possibly) Future with @techmilindBig Data Systems: Past, Present &  (Possibly) Future with @techmilind
Big Data Systems: Past, Present & (Possibly) Future with @techmilindEMC
 
The technical case for a semantic web
The technical case for a semantic webThe technical case for a semantic web
The technical case for a semantic webTony Dobaj
 
Big Data v. Small data - Rules to thumb for 2015
Big Data v. Small data - Rules to thumb for 2015Big Data v. Small data - Rules to thumb for 2015
Big Data v. Small data - Rules to thumb for 2015Visart
 
Online text data for machine learning, data science, and research - Who can p...
Online text data for machine learning, data science, and research - Who can p...Online text data for machine learning, data science, and research - Who can p...
Online text data for machine learning, data science, and research - Who can p...Fredrik Olsson
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data ScienceEdureka!
 
Applications of Machine Learning at USC
Applications of Machine Learning at USCApplications of Machine Learning at USC
Applications of Machine Learning at USCSri Ambati
 
Creating a Data-Driven Government: Big Data With Purpose
Creating a Data-Driven Government: Big Data With PurposeCreating a Data-Driven Government: Big Data With Purpose
Creating a Data-Driven Government: Big Data With PurposeTyrone Grandison
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data AnalyticsEMC
 
Preservation and Proportionality: Lowering the Burden of Preserving Data in C...
Preservation and Proportionality: Lowering the Burden of Preserving Data in C...Preservation and Proportionality: Lowering the Burden of Preserving Data in C...
Preservation and Proportionality: Lowering the Burden of Preserving Data in C...Zapproved
 
HPE IDOL 10 (Intelligent Data Operating Layer)
HPE IDOL 10 (Intelligent Data Operating Layer)HPE IDOL 10 (Intelligent Data Operating Layer)
HPE IDOL 10 (Intelligent Data Operating Layer)Andrey Karpov
 
KM Russia 2014 - John Girard
KM Russia 2014 - John GirardKM Russia 2014 - John Girard
KM Russia 2014 - John GirardJohn Girard
 
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...Got Chaos? Extracting Business Intelligence from Email with Natural Language ...
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...Digital Reasoning
 
Open Data for Transportation Agencies
Open Data for Transportation AgenciesOpen Data for Transportation Agencies
Open Data for Transportation AgenciesNovavia Solutions
 
Semantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data ContextSemantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data ContextMurad Daryousse
 

What's hot (20)

Big Data Systems: Past, Present & (Possibly) Future with @techmilind
Big Data Systems: Past, Present &  (Possibly) Future with @techmilindBig Data Systems: Past, Present &  (Possibly) Future with @techmilind
Big Data Systems: Past, Present & (Possibly) Future with @techmilind
 
The technical case for a semantic web
The technical case for a semantic webThe technical case for a semantic web
The technical case for a semantic web
 
Big Data v. Small data - Rules to thumb for 2015
Big Data v. Small data - Rules to thumb for 2015Big Data v. Small data - Rules to thumb for 2015
Big Data v. Small data - Rules to thumb for 2015
 
Online text data for machine learning, data science, and research - Who can p...
Online text data for machine learning, data science, and research - Who can p...Online text data for machine learning, data science, and research - Who can p...
Online text data for machine learning, data science, and research - Who can p...
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
 
Applications of Machine Learning at USC
Applications of Machine Learning at USCApplications of Machine Learning at USC
Applications of Machine Learning at USC
 
Data Journalism for Business Reporting
Data Journalism for Business ReportingData Journalism for Business Reporting
Data Journalism for Business Reporting
 
Big Data World
Big Data WorldBig Data World
Big Data World
 
Creating a Data-Driven Government: Big Data With Purpose
Creating a Data-Driven Government: Big Data With PurposeCreating a Data-Driven Government: Big Data With Purpose
Creating a Data-Driven Government: Big Data With Purpose
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big data Paper
Big data PaperBig data Paper
Big data Paper
 
On Big Data
On Big DataOn Big Data
On Big Data
 
Preservation and Proportionality: Lowering the Burden of Preserving Data in C...
Preservation and Proportionality: Lowering the Burden of Preserving Data in C...Preservation and Proportionality: Lowering the Burden of Preserving Data in C...
Preservation and Proportionality: Lowering the Burden of Preserving Data in C...
 
HPE IDOL 10 (Intelligent Data Operating Layer)
HPE IDOL 10 (Intelligent Data Operating Layer)HPE IDOL 10 (Intelligent Data Operating Layer)
HPE IDOL 10 (Intelligent Data Operating Layer)
 
ANALYTICS OF DATA USING HADOOP-A REVIEW
ANALYTICS OF DATA USING HADOOP-A REVIEWANALYTICS OF DATA USING HADOOP-A REVIEW
ANALYTICS OF DATA USING HADOOP-A REVIEW
 
The Future of LOD
The Future of LODThe Future of LOD
The Future of LOD
 
KM Russia 2014 - John Girard
KM Russia 2014 - John GirardKM Russia 2014 - John Girard
KM Russia 2014 - John Girard
 
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...Got Chaos? Extracting Business Intelligence from Email with Natural Language ...
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...
 
Open Data for Transportation Agencies
Open Data for Transportation AgenciesOpen Data for Transportation Agencies
Open Data for Transportation Agencies
 
Semantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data ContextSemantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data Context
 

Similar to Hadoop World 2011: The Hadoop Award for Government Excellence - Bob Gourley - Crucial Point LLC

Open government international garry lloyd
Open government international   garry lloydOpen government international   garry lloyd
Open government international garry lloydGarry Lloyd
 
Why CxOs care about Data Governance; the roadblock to digital mastery
Why CxOs care about Data Governance; the roadblock to digital masteryWhy CxOs care about Data Governance; the roadblock to digital mastery
Why CxOs care about Data Governance; the roadblock to digital masteryCoert Du Plessis (杜康)
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementationSandip Tipayle Patil
 
Bigdatacooltools
BigdatacooltoolsBigdatacooltools
Bigdatacooltoolssuresh sood
 
EDF2012 Rufus Pollock - Open Data. Where we are where we are going
EDF2012  Rufus Pollock - Open Data. Where we are where we are goingEDF2012  Rufus Pollock - Open Data. Where we are where we are going
EDF2012 Rufus Pollock - Open Data. Where we are where we are goingEuropean Data Forum
 
Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Gregg Barrett
 
First they have to find it: Getting Open Government Data Discovered and Used
First they have to find it: Getting Open Government Data Discovered and UsedFirst they have to find it: Getting Open Government Data Discovered and Used
First they have to find it: Getting Open Government Data Discovered and UsedRensselaer Polytechnic Institute
 
Big Data
Big DataBig Data
Big DataBBDO
 
141900791 big-data
141900791 big-data141900791 big-data
141900791 big-dataglittaz
 
BBDO Proximity: Big-data May 2013
BBDO Proximity: Big-data May 2013BBDO Proximity: Big-data May 2013
BBDO Proximity: Big-data May 2013Brian Crotty
 
QuickView #3 - Big Data
QuickView #3 - Big DataQuickView #3 - Big Data
QuickView #3 - Big DataSonovate
 
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate Oomph! Recruitment
 
Move It Don't Lose It: Is Your Big Data Collecting Dust?
Move It Don't Lose It: Is Your Big Data Collecting Dust?Move It Don't Lose It: Is Your Big Data Collecting Dust?
Move It Don't Lose It: Is Your Big Data Collecting Dust?Jennifer Walker
 

Similar to Hadoop World 2011: The Hadoop Award for Government Excellence - Bob Gourley - Crucial Point LLC (20)

Open government international garry lloyd
Open government international   garry lloydOpen government international   garry lloyd
Open government international garry lloyd
 
Why CxOs care about Data Governance; the roadblock to digital mastery
Why CxOs care about Data Governance; the roadblock to digital masteryWhy CxOs care about Data Governance; the roadblock to digital mastery
Why CxOs care about Data Governance; the roadblock to digital mastery
 
data, big data, open data
data, big data, open datadata, big data, open data
data, big data, open data
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
 
Bigdatacooltools
BigdatacooltoolsBigdatacooltools
Bigdatacooltools
 
EDF2012 Rufus Pollock - Open Data. Where we are where we are going
EDF2012  Rufus Pollock - Open Data. Where we are where we are goingEDF2012  Rufus Pollock - Open Data. Where we are where we are going
EDF2012 Rufus Pollock - Open Data. Where we are where we are going
 
Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...
 
Big Data
Big DataBig Data
Big Data
 
Big Data - Gerami
Big Data - GeramiBig Data - Gerami
Big Data - Gerami
 
Big Data Analytics (1).ppt
Big Data Analytics (1).pptBig Data Analytics (1).ppt
Big Data Analytics (1).ppt
 
First they have to find it: Getting Open Government Data Discovered and Used
First they have to find it: Getting Open Government Data Discovered and UsedFirst they have to find it: Getting Open Government Data Discovered and Used
First they have to find it: Getting Open Government Data Discovered and Used
 
Big Data
Big DataBig Data
Big Data
 
141900791 big-data
141900791 big-data141900791 big-data
141900791 big-data
 
BBDO Proximity: Big-data May 2013
BBDO Proximity: Big-data May 2013BBDO Proximity: Big-data May 2013
BBDO Proximity: Big-data May 2013
 
Big data
Big data Big data
Big data
 
Big Data - CRM's Promise Land
Big Data - CRM's Promise LandBig Data - CRM's Promise Land
Big Data - CRM's Promise Land
 
QuickView #3 - Big Data
QuickView #3 - Big DataQuickView #3 - Big Data
QuickView #3 - Big Data
 
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
 
Move It Don't Lose It: Is Your Big Data Collecting Dust?
Move It Don't Lose It: Is Your Big Data Collecting Dust?Move It Don't Lose It: Is Your Big Data Collecting Dust?
Move It Don't Lose It: Is Your Big Data Collecting Dust?
 
Using Data Riches A tale of two projects - Ajay Vinze
Using Data Riches A tale of two projects - Ajay VinzeUsing Data Riches A tale of two projects - Ajay Vinze
Using Data Riches A tale of two projects - Ajay Vinze
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Recently uploaded (20)

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Hadoop World 2011: The Hadoop Award for Government Excellence - Bob Gourley - Crucial Point LLC

  • 1. Government Big Data Solutions Award Bob Gourley CTOlabs.com http://ctolabs.com Nov 2011
  • 2. About This Presentation: • How can we help accelerate public sector innovation? • Top Federal Mission Needs for Big Data • The State of Big Data Solutions in the Federal Space • The Intent of the Government Big Data Solutions Award • Criteria • Judges • Top Nominees for 2011 • How to Nominate for 2012 • The Judges Choice for 2011 CTOlabs.com 2
  • 4. The Government Needs More Agility* ―High tech runs three-times faster than normal businesses. And the government runs three-times slower than normal businesses. So we have a nine-times gap‖ – Andy Grove  The government can rapidly benefit from the lessons of high tech by being a faster follower, especially when it comes to Big Data constructs  Thesis: If the Big Data community understands more about federal missions, challenges and successes, we can improve the speed and effectiveness of federal solutions. *Among other needs CTOlabs.com 4
  • 5. Top Federal Mission Needs for Big Data  Financial fraud detection across large, rapidly changing data sets  Cyber Security: rapid real time analysis of all relevant data  Rapid return of geospatial data based on query  Location based push of data: Focused on emergency response  Real time return of relevant search: USA.gov is exemplar  Real time suggestion of topics: USA.gov is exemplar  Real time suggestion of correlations: DoD has many use cases  Bioinformatics: Human Genome  Bioinformatics: Patient location, treatment, outcomes These needs must be met in an era of significant downward pressure on budgets. Scalable systems with well thought out governance & extensive automation are key. CTOlabs.com 5
  • 6. Most active fed solution areas:  Federal integrators: Spending internal research and development funds to create prototypes and full solutions relevant to fed missions  DoD and IC agencies: Using Big Data approaches to solve ―needle in the haystack‖ and ―connect the dots‖ problems  National Labs: Bioinformatics solutions have been put in place by federal researchers  OMB and GSA: Ensuring sharing of lessons and solutions. Key exemplars around web search methods. Solutions inside government agencies and on citizen facing properties Big Data solutions are already making a difference in government service to citizens. Highlighting some of this virtuous work is a goal of our Government Big Data Solutions Award. CTOlabs.com 6
  • 7. The Intent of the Government Big Data Solutions Award  Established to help facilitate exchange of best practices, lessons learned and creative ideas for solutions to hard data challenges  Special focus on solutions built around Apache Hadoop framework  Nominees and award winners to be written up in CTOlabs.com technology reviews  Award meant to help generate exchange of lessons learned We established a team of judges, asked them to consider mission impact as primary criteria, and solicited award nominations via sites frequented by government IT professionals and solution providers. CTOlabs.com 7
  • 8. Judges  Doug Cutting: An advocate and creator of open source search technologies (@cutting)  Chris Dorobek: Founder, editor, publisher of DorobekInsider.com (@DorobekINSIDER)  Ed Granstedt: QinetiQ Strategic Solution Center  Ryan LaSalle: Accenture Technology Labs (@Labsguy)  Alan Wade: Experienced federal CIO Judges are all experienced innovators known for mastery in their fields CTOlabs.com 8
  • 9. Top Nominees for 2011  USA Search: Best in class hosted search services over more than 400 gov sites. Great use of CDH3.  GCE Federal: Cloud-based financial management solutions. Apache Hadoop, Hbase, Lucene for Dept of Labor.  PNNL Bioinformatics: Leading researcher Dr. Taylor of PNNL is advancing understanding of health, biology, genetics and computing using Apache Hadoop/MapReduce/HBase.  SherpaSurfing: Use of CDH as a cybersecurity solution. Ingest packet capture in any format, analyze trends, find malware, alert.  US Department of State: Bureau of Counselor Affairs. Large data with important applications for citizen service and national security. Each of these are making a difference for government missions right now. CTOlabs.com 9
  • 10. Please Think Now About 2012 Nominations CTOlabs.com
  • 11. How to Nominate for 2012 Click Here. Fill In Form. Hit “Submit” • We expect (and hope for) a much more crowded field of contenders next year. • Please let us know if you are working on things that feds should be aware of. • You can also submit technologies for review on our site. CTOlabs.com 11
  • 12. Special Mention Department of State Consular Consolidated Database CTOlabs.com
  • 13. Department of State (DoS), Bureau of Consular Affairs (CA) Consular Consolidated Database (CCD)  CCD is critical to citizen support and important in facilitating lawful visits to US  First line of defense against unlawful entry  Largest connected/replicating database structure in the government  Pre-screening visa applicants, helps adjudicators weed out fraud  Used by multiple agencies Very smart use of current data approaches to solve hard problems CTOlabs.com 13
  • 14. Judge’s Choice 2011 GSA USA Search CTOlabs.com
  • 16. USA Search  Program of General Services Administration‘s (GSA) Office of Citizen Services and Information Technologies.  Hosted search services for USA.gov and over 500 other government websites.  Solves big data challenges with open source capabilities.  CDH3 since fall 2010. HDFS, Hadoop and Hive used in cost effective, resilient, scalable solution.  Search Results. Search Suggestions. Trend analysis. Analytic dashboards. Bottom Line: USA Search brings the best of the open source community to multiple government missions, including direct citizen support CTOlabs.com 16
  • 19. This Presentation Prepared By: Bob Gourley CTOlabs.com http://twitter.com/bobgourley CTOlabs.com
  • 21. Department of State (DoS), Bureau of Consular Affairs (CA) Consular Consolidated Database (CCD) •Bureau of Consular Affairs issues travel documents to U.S. and foreign citizens. CA stores data collected from consular posts abroad and domestic processing centers, as well as other government agencies in the Consular Consolidated Database (CCD). •CCD holds over one hundred (115) terabytes of data, growing by 6-8 terabytes each month. Over 170 software applications collect this information and provide interfaces with the numerous partner agencies that share data with CA. •CCD is the ―largest connected/replicating database structure in the government.‖ •Most of these applications use a ‗case‘ (such as a visa or passport application), and not a person record, as the basis of their data storage and retrieval. At the application level, it is extremely difficult to link person information in one application to potentially-matching person information contained in another application. A person could apply for a visa at one location, and then apply at another location under a different name, and an adjudicator may not be able to establish the link between the cases. The CCD can leverage all available data elements from all applications throughout the system in order to determine all of the potential identity matches of any given person that CA has encountered. •The CCD also contains unstructured data, such as free-form comments or case notes. The CCD must deal with millions of large image files, such as applicant photos or scanned documents. The CCD‘s powerful, custom-built analytical tools synthesize the complex data captured by CA with the equally-complex data received from other agencies. The CCD thus gives its users the ability to make informed decisions, detect and prevent fraud, and identify potential national security threats. CTOlabs.com 21
  • 22. Department of State (DoS), Bureau of Consular Affairs (CA) Consular Consolidated Database •CCD is based on Oracle tools. (CCD) •The CCD can pre-screen a visa record before an adjudicator even looks at it. The CCD provides the means to conduct vetting checks against various government databases. •Due to the wide variety of resources used by the CCD, the system can establish links between two applicants using completely different names. With each subsequent encounter, the CCD creates additional links, resulting in a searchable, fully cross-referenced web of information that traces a person‘s activities across all of CA data. By being able to see these links in a person-centric view, the adjudicators have a broader, more complete, and more easily-accessible set of data with which to make better-informed decisions. •The CCD automatically initiates biometric checks. The CCD automatically looks for fraud indicators. The CCD captures all of the data entered during the process and automatically creates cross-references using the new data. •The CCD has transformed CA‘s mission delivery by breaking the paradigm of data isolated in independent databases •The CCD allows staff to focus its time on better customer service, investigative activities, and analysis. CA‘s technical achievement with the CCD has been to create a robust, economical, and analytically-powerful data platform in an environment where fragmentation and inefficiency had been the norm. CTOlabs.com 22
  • 23. USA Search: A Strategic Resource • USASearch is a program of the General Services Administration‘s (GSA) Office of Citizen Services and Information Technologies. • GSA believes in building once and using many times. USASearch is no exception. Since 2000, USASearch has provided hosted search services for USA.gov and for more than 400 government websites—across all levels of government—at no cost through its Affiliate Program. • USASearch instituted many innovative changes in 2010—making it a model for the Obama administration‘s effort to leverage open source technologies and shared solutions to bring substantial cost savings for the government. With its new open architecture model, the USASearch Program provides viable and scalable shared search services. • USASearch Solves Big Data Challenges CTOlabs.com 23
  • 24. USA Search: A Strategic Resource • USASearch began using Cloudera‘s Distribution including Apache Hadoop (CDH3) for the first time in the fall of 2010, and since then has seen its usage grow every month— not just in scale, but also in scope. • All of the search traffic across USA.gov and the hundreds of affiliate sites comes through a single search service, and this generates a lot of data. To continuously improve the service, USASearch needs aggregated information on what searchers look for, how well they find it, and emerging trends, among other information. Once searches are initiated, USASearch also needs to know what results are shown and clicked on. This information needs to be broken down by affiliate and by time, and also aggregated across all affiliates. • The initial system was fairly simple and did just enough to address the most pressing data needs. As USASearch watched its data grow and the nightly batch jobs took longer and longer, it became clear that it would soon exhaust its existing resources. USASearch considered scaling up the hardware vertically and sharding the database horizontally, but both options seemed to kick the can down the road. Larger database hardware is both costly and eventually insufficient for USASearch‘s needs, and sharding promised to take all the usual issues associated with a single database system and multiply them. • USASearch determined it needed HDFS, Hadoop, and Apache Hive—a big data system that could grow cost effectively and without downtime, be naturally resilient to failures, and sensibly handle backups. CTOlabs.com 24
  • 25. USA Search: A Strategic Resource • USASearch Makes Data Actionable USASearch displays the results of its Hive analyses in various analytics dashboards, but, more importantly, it also ensures the results positively affect searchers‘ experience on government websites. For example, USASearch uses Hadoop to generate contextually relevant and timely search suggestions for each of its affiliated government websites. Compare the different type-ahead suggestions for ‗gran‘ on NPS.gov and USA.gov. Both websites use the same USASearch backend system, but the suggestions differ completely. • USASearch Is a Success The overhaul of USASearch‘s analytics is a dramatic success story. In the space of a few months, USASearch went from having a brittle and hard-to-scale RDBMS-based analytics platform to a much more agile Hadoop-based system that is intrinsically designed to scale. USASearch continues to see its Hadoop usage grow in scope with each new data source it adds, and it is clear that USASearch will rely on it more and more as the suite of tools and resources around Hadoop grows and matures in the future. • By using a state-of-the-art open source technology, USASearch has created a radically different search service that transforms the customer experience. Having a government-owned and -controlled search service allows us to constantly understand what‘s on the minds of Americans to drive enhancements to other delivery channels. The public has a much improved experience when interacting with the government due to USASearch. CTOlabs.com 25

Editor's Notes

  1. An important mission of the Department of State (DoS), Bureau of Consular Affairs (CA) is to issue travel documents to U.S. and foreign citizens. CA uses a suite of software applications at locations around the world to collect applicant data for the purpose of issuing immigrant visas, non-immigrant visas, and United States passports. CA stores data collected from consular posts abroad and domestic processing centers, as well as other government agencies, in the Consular Consolidated Database (CCD). Since its introduction, the CCD has proven to be a robust, economical, and analytically-powerful data platform in an environment where fragmentation and inefficiency had been the norm. Indeed, without the CCD and its capabilities, CA would not be able to make effective use of the massive amount of data it collects.The Size and Complexities of Consular Data: CA stores one hundred (115) terabytes of data in the CCD. On average, the CCD grows by 6-8 terabytes each month. Currently, over 170 software applications collect information for CA. CA uses these applications to process the many types of travel documents issued by the bureau. These applications also provide the interfaces with the numerous partner agencies that share data with CA. Most of these applications use a ‘case’ (such as a visa or passport application), and not a person record, as the basis of their data storage and retrieval. Each application collects different data, in a variety of formats, and with varying levels of detail. At the application level, it is extremely difficult to link person information in one application to potentially-matching person information contained in another application. A person could apply for a visa at one location, and then apply at another location under a different name, and an adjudicator may not be able to establish the link between the cases. However, since all CA data is stored in one central repository (the CCD), the CCD can leverage all available data elements from all applications throughout the system in order to determine all of the potential identity matches of any given person that CA has encountered.The CCD also contains unstructured data, such as free-form comments or case notes. The CCD must deal with millions of large image files, such as applicant photos or scanned documents. The CCD’s powerful, custom-built analytical tools synthesize the complex data captured by CA with the equally-complex data received from other agencies. The CCD thus gives its users the ability to make informed decisions, detect and prevent fraud, and identify potential national security threats. Sharing Consular DataThe CCD is at the heart of information sharing between the government agencies involved in the national security of the United States. Over 34,000 national security officials in the Department of State and its partner agencies use the CCD. In fact, the CCD now serves more external users (23,000) than internal DoS users (11,600). The statistics below illustrate just how vital the CCD is to the entire national security apparatus:• DHS: The CCD is the single most important and frequently-used source of data. DHS has over 17,000 users worldwide, averaging 7 million hits per month. • FBI: 1,700 users of the CCD, averaging 420,000 hits per month• DoD: 200 users, averaging 180,000 hits per monthBecause of the CCD, information sharing between posts and security partners is no longer a cumbersome effort. Instead, it is automated, simplified and routine. For example, in November 2010, the average response time for the over 630,000 fingerprint checks submitted to DHS was 10.5 minutes. The average response time for the 588,000 fingerprint checks submitted to the FBI was 14.6 minutes. This near real-time collection, distribution, and analysis of consular data is vital to those stakeholders who rely on consular data to make informed decisions.An Improved ArchitectureThe CCD’s architecture is designed to be flexible and scalable. It uses the latest generation of technologies and methodologies to enable rapid capture, distribution, and analysis of the massive amount of data collected by CA. The CCD captures data from 270 posts around the world and replicates that data using Oracle Multimaster Replication to a centralized repository in near real-time. This CCD architecture replaces the stove-piped concept of the past with a web-enabled, directly-accessible database platform. The CCD connects users to their data via a single-platform design that is forward-looking and easily integrated with external systems. Before the CCD, consular data resided on a decentralized global network of approximately 270 consular posts supported by independent, in-house systems. These systems contained all of the significant inefficiencies inherent when data resources are structurally isolated and widely distributed. Management reporting was inefficient. O&M costs were burdened with the necessity of delivering services individually to each post. The old architecture created formidable logistical and fiscal hurdles. Sharing data between posts and with partners was difficult and time-consuming. The inability to rapidly share information and to obtain early access to application data negatively impacted fraud detection and prevention. The new CCD architecture consolidated the individual data assets of each post into a design that incorporated advanced infrastructure components. This forward-looking model has the flexibility needed for system modifications, the easy integration of new stakeholders, and the ability to make use of future technologies. Today, according to Oracle, the CCD is the “largest connected/replicating database structure in the government.” The CCD is economical, too. The CCD architecture has saved CA $1.4 million annually. The CCD architecture has also established an enviable green profile. The CCD made possible the elimination of an entire Data Share Group with a hardware reduction of 100 servers, an 80% reduction in passport database servers, and reduced support costs by eliminating entire storage networks.Making Sense of Consular DataThe data contained in the CCD would mean little to the consular officer adjudicating a visa application, or to a Customs and Border Patrol agent at a border crossing, if not for the CCD’s ability to make sense of the enormity of the data (over 115 terabytes) it contains. The CCD can pre-screen a visa record before an adjudicator even looks at it. The CCD provides the means to conduct vetting checks against various government databases. The CCD contains powerful analytical tools and a set of custom-built services that allow users to do everything from sending a mass email to American citizens abroad to tracking fraud investigations. In short, the CCD is a one-stop shop for collecting, analyzing and making informed use of consular data. Consulate staffs and Customs and Border Patrol agents are under immense pressure to do thorough and accurate identity and background checks on both citizens and non-citizens. In this age of international terrorism, the success and accuracy of staff decisions has critical implications to the security of the United States. The CCD gives its users the tools and data to make informed decisions.Before an adjudicator of a visa applicant looks at a record, the CCD has already done much of the pre-processing automatically. Rather than an adjudicator sorting through terabytes of data, the CCD has already sorted through the over 115 terabytes data and made the connections that are simply impossible for an individual user to make. At each encounter with an applicant for a visa or passport, the CCD automatically establishes links between all cases involving that applicant and other potentially-matching cases, enabling the detection of potential fraud or national security threats. For example, the CCD can base these links on the applicant using the same point of contact in the United States that was used on another case. The CCD can establish links based on the results of a biometric check, such as fingerprints or facial recognition. The CCD can even establish links using unstructured data by searching for certain text strings and linking records in which these strings appear. The CCD examines every conceivable combination of data elements when looking for potential matches. Due to the wide variety of resources used by the CCD, the system can establish links between two applicants using completely different names. With each subsequent encounter, the CCD creates additional links, resulting in a searchable, fully cross-referenced web of information that traces a person’s activities across all of CA data. By being able to see these links in a person-centric view, the adjudicators have a broader, more complete, and more easily-accessible set of data with which to make better-informed decisions.The CCD automatically initiates biometric checks, including fingerprint checks and facial recognition checks. The CCD can also automatically look for possible fraud indicators in the data the applicant provided in his or her application. The CCD can then alert the adjudicator to look into these indicators, saving the adjudicator time. If the adjudicator finds a case of potential fraud, he or she can refer the case for fraud investigation right from the CCD. The fraud investigator can record the results of his or her investigation in the CCD and has access to all of the analytical tools and biometric checks available in the system. The CCD captures all of the data entered during the process and automatically creates cross-references using the new data. The CCD completes the loop.When a CCD user pulls up an applicant record in the CCD, he or she will see much more than just the applicant’s biographical data and the current status of the case. The user can see the results of all of the background checks that the adjudicator ran. The user can see all of the previous visa or passport records for that applicant. The user can see all of the applicant’s images, the applicant’s fingerprints, and even a list of other CCD records that are linked to the applicant in one way or another. The CCD makes all of the information related to a case accessible in a single, consolidated view.In fact, the CCD is so easy to use that each month its users run 20 million reports, generate 120 million hits, process 1 million applicants, conduct 4 million facial recognition searches, submit 800,000 fingerprint check requests, and much more. Users add 6-8 terabytes of data to the CCD each month. Without the robust functionality built until the CCD, this workload would be unimaginable.Conclusion: Before the CCD, Visa and U.S. Passport application data were located on independent databases making data sharing within the Department of State and its national security partner agencies difficult. CA needed to maximize the accuracy and availability of consular data by creating a single, consolidated database. CA needed a state-of-the-art data archiving and data-sharing platform that provided rapid access to data and that enabled the fluid exchange of information, while reducing expenses and encouraging inter-agency collaboration. The CCD has transformed CA’s mission delivery by breaking the paradigm of data isolated in independent databases. The CCD is a single-platform of common, trusted data. The CCD uses a simplified, robust, and innovative network architecture that has streamlined CA’s physical IT infrastructure. The CCD today consolidates data from posts all over the world into a central repository that is over 115 terabytes in size and growing by 6-8 terabytes each month. In terms of both improved resource use and in enhanced national security through better data analysis, it is impossible to overstate the benefits that the CCD brings to CA. The CCD allows staff to focus its time on better customer service, investigative activities, and analysis. CA’s technical achievement with the CCD has been to create a robust, economical, and analytically-powerful data platform in an environment where fragmentation and inefficiency had been the norm.