Big Data Challenges
in the DoD and IC
Wes Caldwell
Chief Architect
Intelligent Software Solutions
Topics
• Introduction to ISS
• The growth of data
• Our customer’s data environment
• The need for effective big-data management
• Search as the cornerstone of a big-data strategy
About ISS
• Headquartered in Colorado Springs
• Other offices located in Washington DC, Hampton VA,
Tampa FL, and Rome NY
• Innovative Solutions from “Space to Mud
and Everything Between”
• Sole prime on multiple Air Force Research Labs
programs IDIQ
• Currently Executing More Than 100 Software
Development Projects
• Over 800 employees
• Strength in Solutions Development and
Deployment
• Consistently Recognized as a Leader
• Recognized as a Deloitte Fast 50 Colorado
company and a Deloitte Fast 500 company over
eight consecutive years
• Three-time Inc. Magazine 500 winner
• 2009 Defense Company of the Year
ISS Solution Space/Value Proposition
• Reusable and license-free to US
Federal Government (GOTS)
• Committed to providing best ROI
to our customers by integrating
leading open-source solutions into
our products and services
• Scalable from a single desktop
solution to large distributed
networks with thousands of users
• Customizable to each
organization’s unique analytical and
information technology
infrastructure
• Operationally proven, secure and
accredited for all major classified
networks
ISS Business Strategy
Government
Off The Shelf
(GOTS)
Commercial
Off The Shelf
(COTS)
Subject
MatterExperts
(SMEs)
• Low Barrier to Entry: No license
fees to US Government Agencies
• Fast: Proven baseline provides
immediate capability
• Turnkey: Highly customizable
solutions can be implemented
quickly with no development
• Solutions Oriented: Subject Matter
Experts support implementation in
each domain
• Low Cost: Cost of Adding Features
is shared across large customer
base; all customers benefit
Blending the best elements of each industry model to provide low risk,
nonproprietary, high payoff solutions—fast! 6
The growth of data
• Most electronic information is not relational,
but unstructured (textual, binary) or semi-
structured (spreadsheet, RSS feed, etc.)
– In 2007, the estimated information content of all
human knowledge was 295 exabytes(295 million
terabytes)
– Data production will be 44 times greater in 2020
than in 2009
• Approx 35 zetabytes total (35 billion terabytes)
• A majority of the data produced in the future will
be unstructured
– A tremendous amount of information and
knowledge is dormant within unstructured data
Our customer’s data environment
• Literally thousands of data sources/feeds
from a variety of strategic, national, and
tactical sources
– Media (documents, images, etc.)
– Human interactions
– Geospatial
– Open Source (News feeds, RSS)
– Imagery/Video
– Many more…
How our analysts feel
The need for effective “big-data” management
• Analysts are looking to extract knowledge from the massive heterogeneous
data sets, providing “actionable intelligence”
• Tactical environments absolutely demand effective management of data
– Time to live on the relevance of data collected can be very short
– Communications pipes aren’t as optimal as large CONUS-based data
centers, so reduction of data based on tactical conditions (i.e. AOR,
Problem Domain, etc.) is critical
• Search and Analytics are key enablers to allow an analyst to reliably search
through large amounts of information, and to focus their efforts around a
subset of that information to perform deeper analysis
Search IS the cornerstone of an effective big-data strategy
Structured Content
Semi-Structured
Content
Un-Structured
Content
Content Cache
(Haystacks)
Content Acquisition
Tenets
• Connector architecture
• Data normalization
• Data staging
• Data Compartmenting
(Multiple Haystacks)
Tenets
• Optimized Index of Content
for Search and Discovery of
Big Data
• Analyst Topics that “Shrink
the Haystack” Search
Features (Facets, Auto-
Complete, Tagging,
Comments, etc.)
• Semantic (Synonym) Search
based on pluggable
taxonomies
Search/Discovery
Content Index
NLP Pipeline
Semantic Enrichment
Categorization
Named
Entity
Recognition
Clustering
Gazetteers
Tenets
• “Domain Spaces” that
support pluggable entity
recognition and
categorization
• Continuous feedback loop
that improves the system
over time with analyst
input
• Lexicon-based analytics
that allows for targeted
categorization across
corpus of data
Tenets
• Data Reduction into
focused “Data
Perspectives”
• Data perspectives stored
in optimized formats
(e.g. Graph, Time Series,
Geo, etc.) for the
questions being asked
• Leveraging industry-
standard parallel
processing frameworks
for scalable analytics
Data Perspectives
Data
How can Search help you?
Have a great conference!!!

Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC

  • 2.
    Big Data Challenges inthe DoD and IC Wes Caldwell Chief Architect Intelligent Software Solutions
  • 3.
    Topics • Introduction toISS • The growth of data • Our customer’s data environment • The need for effective big-data management • Search as the cornerstone of a big-data strategy
  • 4.
    About ISS • Headquarteredin Colorado Springs • Other offices located in Washington DC, Hampton VA, Tampa FL, and Rome NY • Innovative Solutions from “Space to Mud and Everything Between” • Sole prime on multiple Air Force Research Labs programs IDIQ • Currently Executing More Than 100 Software Development Projects • Over 800 employees • Strength in Solutions Development and Deployment • Consistently Recognized as a Leader • Recognized as a Deloitte Fast 50 Colorado company and a Deloitte Fast 500 company over eight consecutive years • Three-time Inc. Magazine 500 winner • 2009 Defense Company of the Year
  • 5.
    ISS Solution Space/ValueProposition • Reusable and license-free to US Federal Government (GOTS) • Committed to providing best ROI to our customers by integrating leading open-source solutions into our products and services • Scalable from a single desktop solution to large distributed networks with thousands of users • Customizable to each organization’s unique analytical and information technology infrastructure • Operationally proven, secure and accredited for all major classified networks
  • 6.
    ISS Business Strategy Government OffThe Shelf (GOTS) Commercial Off The Shelf (COTS) Subject MatterExperts (SMEs) • Low Barrier to Entry: No license fees to US Government Agencies • Fast: Proven baseline provides immediate capability • Turnkey: Highly customizable solutions can be implemented quickly with no development • Solutions Oriented: Subject Matter Experts support implementation in each domain • Low Cost: Cost of Adding Features is shared across large customer base; all customers benefit Blending the best elements of each industry model to provide low risk, nonproprietary, high payoff solutions—fast! 6
  • 7.
    The growth ofdata • Most electronic information is not relational, but unstructured (textual, binary) or semi- structured (spreadsheet, RSS feed, etc.) – In 2007, the estimated information content of all human knowledge was 295 exabytes(295 million terabytes) – Data production will be 44 times greater in 2020 than in 2009 • Approx 35 zetabytes total (35 billion terabytes) • A majority of the data produced in the future will be unstructured – A tremendous amount of information and knowledge is dormant within unstructured data
  • 8.
    Our customer’s dataenvironment • Literally thousands of data sources/feeds from a variety of strategic, national, and tactical sources – Media (documents, images, etc.) – Human interactions – Geospatial – Open Source (News feeds, RSS) – Imagery/Video – Many more…
  • 9.
  • 10.
    The need foreffective “big-data” management • Analysts are looking to extract knowledge from the massive heterogeneous data sets, providing “actionable intelligence” • Tactical environments absolutely demand effective management of data – Time to live on the relevance of data collected can be very short – Communications pipes aren’t as optimal as large CONUS-based data centers, so reduction of data based on tactical conditions (i.e. AOR, Problem Domain, etc.) is critical • Search and Analytics are key enablers to allow an analyst to reliably search through large amounts of information, and to focus their efforts around a subset of that information to perform deeper analysis
  • 11.
    Search IS thecornerstone of an effective big-data strategy Structured Content Semi-Structured Content Un-Structured Content Content Cache (Haystacks) Content Acquisition Tenets • Connector architecture • Data normalization • Data staging • Data Compartmenting (Multiple Haystacks) Tenets • Optimized Index of Content for Search and Discovery of Big Data • Analyst Topics that “Shrink the Haystack” Search Features (Facets, Auto- Complete, Tagging, Comments, etc.) • Semantic (Synonym) Search based on pluggable taxonomies Search/Discovery Content Index NLP Pipeline Semantic Enrichment Categorization Named Entity Recognition Clustering Gazetteers Tenets • “Domain Spaces” that support pluggable entity recognition and categorization • Continuous feedback loop that improves the system over time with analyst input • Lexicon-based analytics that allows for targeted categorization across corpus of data Tenets • Data Reduction into focused “Data Perspectives” • Data perspectives stored in optimized formats (e.g. Graph, Time Series, Geo, etc.) for the questions being asked • Leveraging industry- standard parallel processing frameworks for scalable analytics Data Perspectives Data
  • 12.
    How can Searchhelp you? Have a great conference!!!