Boost PC performance: How more available memory can improve productivity
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
1.
2. Big Data Challenges
in the DoD and IC
Wes Caldwell
Chief Architect
Intelligent Software Solutions
3. Topics
• Introduction to ISS
• The growth of data
• Our customer’s data environment
• The need for effective big-data management
• Search as the cornerstone of a big-data strategy
4. About ISS
• Headquartered in Colorado Springs
• Other offices located in Washington DC, Hampton VA,
Tampa FL, and Rome NY
• Innovative Solutions from “Space to Mud
and Everything Between”
• Sole prime on multiple Air Force Research Labs
programs IDIQ
• Currently Executing More Than 100 Software
Development Projects
• Over 800 employees
• Strength in Solutions Development and
Deployment
• Consistently Recognized as a Leader
• Recognized as a Deloitte Fast 50 Colorado
company and a Deloitte Fast 500 company over
eight consecutive years
• Three-time Inc. Magazine 500 winner
• 2009 Defense Company of the Year
5. ISS Solution Space/Value Proposition
• Reusable and license-free to US
Federal Government (GOTS)
• Committed to providing best ROI
to our customers by integrating
leading open-source solutions into
our products and services
• Scalable from a single desktop
solution to large distributed
networks with thousands of users
• Customizable to each
organization’s unique analytical and
information technology
infrastructure
• Operationally proven, secure and
accredited for all major classified
networks
6. ISS Business Strategy
Government
Off The Shelf
(GOTS)
Commercial
Off The Shelf
(COTS)
Subject
MatterExperts
(SMEs)
• Low Barrier to Entry: No license
fees to US Government Agencies
• Fast: Proven baseline provides
immediate capability
• Turnkey: Highly customizable
solutions can be implemented
quickly with no development
• Solutions Oriented: Subject Matter
Experts support implementation in
each domain
• Low Cost: Cost of Adding Features
is shared across large customer
base; all customers benefit
Blending the best elements of each industry model to provide low risk,
nonproprietary, high payoff solutions—fast! 6
7. The growth of data
• Most electronic information is not relational,
but unstructured (textual, binary) or semi-
structured (spreadsheet, RSS feed, etc.)
– In 2007, the estimated information content of all
human knowledge was 295 exabytes(295 million
terabytes)
– Data production will be 44 times greater in 2020
than in 2009
• Approx 35 zetabytes total (35 billion terabytes)
• A majority of the data produced in the future will
be unstructured
– A tremendous amount of information and
knowledge is dormant within unstructured data
8. Our customer’s data environment
• Literally thousands of data sources/feeds
from a variety of strategic, national, and
tactical sources
– Media (documents, images, etc.)
– Human interactions
– Geospatial
– Open Source (News feeds, RSS)
– Imagery/Video
– Many more…
10. The need for effective “big-data” management
• Analysts are looking to extract knowledge from the massive heterogeneous
data sets, providing “actionable intelligence”
• Tactical environments absolutely demand effective management of data
– Time to live on the relevance of data collected can be very short
– Communications pipes aren’t as optimal as large CONUS-based data
centers, so reduction of data based on tactical conditions (i.e. AOR,
Problem Domain, etc.) is critical
• Search and Analytics are key enablers to allow an analyst to reliably search
through large amounts of information, and to focus their efforts around a
subset of that information to perform deeper analysis
11. Search IS the cornerstone of an effective big-data strategy
Structured Content
Semi-Structured
Content
Un-Structured
Content
Content Cache
(Haystacks)
Content Acquisition
Tenets
• Connector architecture
• Data normalization
• Data staging
• Data Compartmenting
(Multiple Haystacks)
Tenets
• Optimized Index of Content
for Search and Discovery of
Big Data
• Analyst Topics that “Shrink
the Haystack” Search
Features (Facets, Auto-
Complete, Tagging,
Comments, etc.)
• Semantic (Synonym) Search
based on pluggable
taxonomies
Search/Discovery
Content Index
NLP Pipeline
Semantic Enrichment
Categorization
Named
Entity
Recognition
Clustering
Gazetteers
Tenets
• “Domain Spaces” that
support pluggable entity
recognition and
categorization
• Continuous feedback loop
that improves the system
over time with analyst
input
• Lexicon-based analytics
that allows for targeted
categorization across
corpus of data
Tenets
• Data Reduction into
focused “Data
Perspectives”
• Data perspectives stored
in optimized formats
(e.g. Graph, Time Series,
Geo, etc.) for the
questions being asked
• Leveraging industry-
standard parallel
processing frameworks
for scalable analytics
Data Perspectives
Data