Data Exploration
Jim Hendler
Director, Rensselaer Institute for
Data Exploration and Applications

THE RENSSELAER IDEA
Rensselaer Polytechnic Institute, USA
http://www.cs.rpi.edu/~hendler
Data-driven research areas at RPI
•
•
•
•
•
•
•
•
•

Data-driven Medical and Healthcare Applications
Predictive Models for Business and Economics
“Biome” studies for Built and Natural Environments
Question Answering from texts and data
Resiliency Models for Population-Scale Problems and cybersecurity domains
Semantically-enabled Data Services for Science and
Engineering Research
Materials genome and nano-manufacturing informatics
Platforms for testing Policy and Open Data issues
…
IDEA
The Rensselaer IDEA: empowering our researchers
Application-specific
data tools

Data discovery,
integration,
and interaction
technologies

IDEA
The trunk: Shared Data Technologies
High Performance Modeling and Simulation
• Center for Computational Innovation

Cognitive Computing
• Watson at Rensselaer IBM Partnership

Perceptualization
• Experimental Multimedia Performing Arts Center

Data Science
• Data Science Research Center
IDEA
Roots: Data Exploration
Geekopedia: Data exploration helps a data consumer focus an information search on the pertinent
aspect of relevant data before true analysis can be achieved. In large data sets, data is not
gathered or controlled in a focused manner. Even in smaller data sets, it is also true that data
gathered are not in a very rigid and specific technique can result in a disorganized manner and a
myriad of subsets each…

Discover
Integrate
Validate
Explain

DATA
IDEA
Data Exploration Challenges

Discover
Integrate
Validate
Explain
These needs live outside traditional
data/info architectures
IDEA
Discovery needs semantics

How do you find the Data you need?

Middle Eastern Terrorists for $800 ?
IDEA
Discovery – there’s a lot out there

IDEA
Discovery needs more than keywords

World Bank: Africa

Africover: Agriculture

Kenya: Agricultural

US Data.gov: Crop
IDEA
Integration needs Semantics

Person

Campus Personnel

RIN

660125137

Address #

1118

Address St

Pinehurst

Address zip

12203

Course topic

CSCI

Course #

YES

RPI ID

4961

660125137

Name

Hendler

NO!!!!
Campus Classes
CRN
Name

IDEA

1118
Intro to Physics
Semantic Web and Linked Data (UK)

Royal Mail

County Council

IOGDC Open Data Tutorial

Ordnance Survey

IDEA

11
Data Mashups

http://logd.tw.rpi.edu
IDEA

Distribution Statement
Validation needs semantics

Easy for us
IDEA
Hard for machines…

Head to head comparison shows that burglaries in Avon
and Somerset (UK) far exceed those in Los Angeles,
California

IDEA
Data + everything else you know

Same or
different?

Do the terms mean the same? Are they collected in the same way? Are
they processed differently? …
IDEA
Validation/Explanation need knowledge

Trends in Smoking Prevalence, Tobacco Policy
Coverage and Tobacco Prices (1991-2007)

Statistical correlation needs
explanation
IDEA
Explanation also needs Semantics

Inference Web: McGuinness – various DoD/IC projects
IDEA
Closing the loop: where do the semantics come from?
How do we go
from the
predictive
analytics of Big
Data to
models/explanat
ions that allow
new
understanding?

Data
Prediction
Design

Model

IDEA
1. Better tools for Analytics, Agents and HPC

Make the tools and algorithms being developed by RPI
researchers more “reusable” and multitask (including
HPC data-analytic tools)
IDEA
2. Next-Gen Visualization (at scale)

How can multi-modal, multi-user, large scale sensory (visualization,
sonification, haptics) interaction change the way we understand data?
IDEA
3. Include “agents” in the modeling

Develop technologies that enable
researchers to work with “humanbased” data at larger scales and in new
ways
• Population-scale
computing models
for agent-based
simulations
IDEA
Approach
Platform: Research in using
supercomputers for
discrete modeling
• Carothers’ ROSS model

KR Model:
• Weaver’s restricted rules
on graphs

Challenge problem:
• Classification algorithms at petaflop scale
• “Logical” (nonlinear, discontinuous) agents
IDEA
4. Exploit Cognitive Computing
IDEA will be the hub of Rensselaer’s cognitivecomputing research
• eg. Answer questions such as “Why” and “How”
integrated with large scale simulations

IDEA
Watson’s parallel model

© Making Watson Fast, IBM J Res and Dev,3/4 2012

Distributed (coarse-grained) parallelism
IDEA
Cognitive Computing at Scale
DeepQA type
approach best on
large clusters

(Physical)
Simulation runs on
supercomputers

IDEA
Approach: link these computational models

Surmise (unproven): Cognitive Computing on a fast (large) cluster
can query computations run against data generated by simulations
(physical or agent-based) on the supercomputer
IDEA
5. Data services will provide synergy across disciplines
•

Semantics is a key technology for
common data services

P o le
ep

Agency Policy
Makers

System Scientists

Politicians

Decision-level semantic mediation: high-level vocabularies that facilitate policy-level
decision-making

Inte ra d
g te
A p a io s
p lic t n

Inter-disciplinary
Data Visualization
Apps

S m tic
e an
in rope
te
rability

Integration
Frameworks &
Methodologies

Eco & other system
Assessment Apps

Application-level semantic mediation: mid-level vocabularies that facilitate the interoperability of system models and data products

S f t w re
o
a ,
T o &A p
o ls
p s

Disciplinespecific
model(s)

S m tic
e an
in rope
te
rability

Dataproduct
Generator

S m tic qu ry
e an
e ,
h
ypoth is an
s
d
in re c
fe n e

Information/
S
cience Apps
Qu ry
e ,
ac e s an
c s
d
u e of data
s

Data-level Semantic mediation: lower-level vocabularies applied to each data source
for a specific science domain of interest

D ta
a
Rp s o
e o it rie
s

Federal
Repository

Discovery, Integration. Validation
Curation, Citation,Archiving …
IDEA

Commercial
Database

Researcher
Private
Database

Other Data
Sources

Me
tadata,
s h m
c e a,
data
... ... ...
Conclusions
• The “warehouse” is only a small part of the data
ecosystem
• Database technologies are only part of the story
• Discovery, Integration, … , validation, explanation are key to
solving problems with data

• Closing the loop means “exploring” our data
• Humans are still a key player in this

• The Rensselaer IDEA will explore
• Data-driven applications and tools, but also…
• … multimodal visualization, multiscale and agent modeling,
cognitive computing, and semantic data platforms
IDEA
Rensselaer Institute for Data
Exploration and Applications

The Rensselaer IDEA: Data Exploration

  • 1.
    Data Exploration Jim Hendler Director,Rensselaer Institute for Data Exploration and Applications THE RENSSELAER IDEA Rensselaer Polytechnic Institute, USA http://www.cs.rpi.edu/~hendler
  • 2.
    Data-driven research areasat RPI • • • • • • • • • Data-driven Medical and Healthcare Applications Predictive Models for Business and Economics “Biome” studies for Built and Natural Environments Question Answering from texts and data Resiliency Models for Population-Scale Problems and cybersecurity domains Semantically-enabled Data Services for Science and Engineering Research Materials genome and nano-manufacturing informatics Platforms for testing Policy and Open Data issues … IDEA
  • 3.
    The Rensselaer IDEA:empowering our researchers Application-specific data tools Data discovery, integration, and interaction technologies IDEA
  • 4.
    The trunk: SharedData Technologies High Performance Modeling and Simulation • Center for Computational Innovation Cognitive Computing • Watson at Rensselaer IBM Partnership Perceptualization • Experimental Multimedia Performing Arts Center Data Science • Data Science Research Center IDEA
  • 5.
    Roots: Data Exploration Geekopedia:Data exploration helps a data consumer focus an information search on the pertinent aspect of relevant data before true analysis can be achieved. In large data sets, data is not gathered or controlled in a focused manner. Even in smaller data sets, it is also true that data gathered are not in a very rigid and specific technique can result in a disorganized manner and a myriad of subsets each… Discover Integrate Validate Explain DATA IDEA
  • 6.
    Data Exploration Challenges Discover Integrate Validate Explain Theseneeds live outside traditional data/info architectures IDEA
  • 7.
    Discovery needs semantics Howdo you find the Data you need? Middle Eastern Terrorists for $800 ? IDEA
  • 8.
    Discovery – there’sa lot out there IDEA
  • 9.
    Discovery needs morethan keywords World Bank: Africa Africover: Agriculture Kenya: Agricultural US Data.gov: Crop IDEA
  • 10.
    Integration needs Semantics Person CampusPersonnel RIN 660125137 Address # 1118 Address St Pinehurst Address zip 12203 Course topic CSCI Course # YES RPI ID 4961 660125137 Name Hendler NO!!!! Campus Classes CRN Name IDEA 1118 Intro to Physics
  • 11.
    Semantic Web andLinked Data (UK) Royal Mail County Council IOGDC Open Data Tutorial Ordnance Survey IDEA 11
  • 12.
  • 13.
  • 14.
    Hard for machines… Headto head comparison shows that burglaries in Avon and Somerset (UK) far exceed those in Los Angeles, California IDEA
  • 15.
    Data + everythingelse you know Same or different? Do the terms mean the same? Are they collected in the same way? Are they processed differently? … IDEA
  • 16.
    Validation/Explanation need knowledge Trendsin Smoking Prevalence, Tobacco Policy Coverage and Tobacco Prices (1991-2007) Statistical correlation needs explanation IDEA
  • 17.
    Explanation also needsSemantics Inference Web: McGuinness – various DoD/IC projects IDEA
  • 18.
    Closing the loop:where do the semantics come from? How do we go from the predictive analytics of Big Data to models/explanat ions that allow new understanding? Data Prediction Design Model IDEA
  • 19.
    1. Better toolsfor Analytics, Agents and HPC Make the tools and algorithms being developed by RPI researchers more “reusable” and multitask (including HPC data-analytic tools) IDEA
  • 20.
    2. Next-Gen Visualization(at scale) How can multi-modal, multi-user, large scale sensory (visualization, sonification, haptics) interaction change the way we understand data? IDEA
  • 21.
    3. Include “agents”in the modeling Develop technologies that enable researchers to work with “humanbased” data at larger scales and in new ways • Population-scale computing models for agent-based simulations IDEA
  • 22.
    Approach Platform: Research inusing supercomputers for discrete modeling • Carothers’ ROSS model KR Model: • Weaver’s restricted rules on graphs Challenge problem: • Classification algorithms at petaflop scale • “Logical” (nonlinear, discontinuous) agents IDEA
  • 23.
    4. Exploit CognitiveComputing IDEA will be the hub of Rensselaer’s cognitivecomputing research • eg. Answer questions such as “Why” and “How” integrated with large scale simulations IDEA
  • 24.
    Watson’s parallel model ©Making Watson Fast, IBM J Res and Dev,3/4 2012 Distributed (coarse-grained) parallelism IDEA
  • 25.
    Cognitive Computing atScale DeepQA type approach best on large clusters (Physical) Simulation runs on supercomputers IDEA
  • 26.
    Approach: link thesecomputational models Surmise (unproven): Cognitive Computing on a fast (large) cluster can query computations run against data generated by simulations (physical or agent-based) on the supercomputer IDEA
  • 27.
    5. Data serviceswill provide synergy across disciplines • Semantics is a key technology for common data services P o le ep Agency Policy Makers System Scientists Politicians Decision-level semantic mediation: high-level vocabularies that facilitate policy-level decision-making Inte ra d g te A p a io s p lic t n Inter-disciplinary Data Visualization Apps S m tic e an in rope te rability Integration Frameworks & Methodologies Eco & other system Assessment Apps Application-level semantic mediation: mid-level vocabularies that facilitate the interoperability of system models and data products S f t w re o a , T o &A p o ls p s Disciplinespecific model(s) S m tic e an in rope te rability Dataproduct Generator S m tic qu ry e an e , h ypoth is an s d in re c fe n e Information/ S cience Apps Qu ry e , ac e s an c s d u e of data s Data-level Semantic mediation: lower-level vocabularies applied to each data source for a specific science domain of interest D ta a Rp s o e o it rie s Federal Repository Discovery, Integration. Validation Curation, Citation,Archiving … IDEA Commercial Database Researcher Private Database Other Data Sources Me tadata, s h m c e a, data ... ... ...
  • 28.
    Conclusions • The “warehouse”is only a small part of the data ecosystem • Database technologies are only part of the story • Discovery, Integration, … , validation, explanation are key to solving problems with data • Closing the loop means “exploring” our data • Humans are still a key player in this • The Rensselaer IDEA will explore • Data-driven applications and tools, but also… • … multimodal visualization, multiscale and agent modeling, cognitive computing, and semantic data platforms IDEA
  • 29.
    Rensselaer Institute forData Exploration and Applications