The Rensselaer IDEA: Data Exploration


Published on

The Rensselaer Institute for Data Exploration and Applications is addressing new modes of data exploration and integration to enhance the work of campus researchers (and beyond). This talk outlines the "data exploration" technologies being explored

Published in: Education, Technology

The Rensselaer IDEA: Data Exploration

  1. 1. Data Exploration Jim Hendler Director, Rensselaer Institute for Data Exploration and Applications THE RENSSELAER IDEA Rensselaer Polytechnic Institute, USA
  2. 2. Data-driven research areas at RPI • • • • • • • • • Data-driven Medical and Healthcare Applications Predictive Models for Business and Economics “Biome” studies for Built and Natural Environments Question Answering from texts and data Resiliency Models for Population-Scale Problems and cybersecurity domains Semantically-enabled Data Services for Science and Engineering Research Materials genome and nano-manufacturing informatics Platforms for testing Policy and Open Data issues … IDEA
  3. 3. The Rensselaer IDEA: empowering our researchers Application-specific data tools Data discovery, integration, and interaction technologies IDEA
  4. 4. The trunk: Shared Data Technologies High Performance Modeling and Simulation • Center for Computational Innovation Cognitive Computing • Watson at Rensselaer IBM Partnership Perceptualization • Experimental Multimedia Performing Arts Center Data Science • Data Science Research Center IDEA
  5. 5. Roots: Data Exploration Geekopedia: Data exploration helps a data consumer focus an information search on the pertinent aspect of relevant data before true analysis can be achieved. In large data sets, data is not gathered or controlled in a focused manner. Even in smaller data sets, it is also true that data gathered are not in a very rigid and specific technique can result in a disorganized manner and a myriad of subsets each… Discover Integrate Validate Explain DATA IDEA
  6. 6. Data Exploration Challenges Discover Integrate Validate Explain These needs live outside traditional data/info architectures IDEA
  7. 7. Discovery needs semantics How do you find the Data you need? Middle Eastern Terrorists for $800 ? IDEA
  8. 8. Discovery – there’s a lot out there IDEA
  9. 9. Discovery needs more than keywords World Bank: Africa Africover: Agriculture Kenya: Agricultural US Crop IDEA
  10. 10. Integration needs Semantics Person Campus Personnel RIN 660125137 Address # 1118 Address St Pinehurst Address zip 12203 Course topic CSCI Course # YES RPI ID 4961 660125137 Name Hendler NO!!!! Campus Classes CRN Name IDEA 1118 Intro to Physics
  11. 11. Semantic Web and Linked Data (UK) Royal Mail County Council IOGDC Open Data Tutorial Ordnance Survey IDEA 11
  12. 12. Data Mashups IDEA Distribution Statement
  13. 13. Validation needs semantics Easy for us IDEA
  14. 14. Hard for machines… Head to head comparison shows that burglaries in Avon and Somerset (UK) far exceed those in Los Angeles, California IDEA
  15. 15. Data + everything else you know Same or different? Do the terms mean the same? Are they collected in the same way? Are they processed differently? … IDEA
  16. 16. Validation/Explanation need knowledge Trends in Smoking Prevalence, Tobacco Policy Coverage and Tobacco Prices (1991-2007) Statistical correlation needs explanation IDEA
  17. 17. Explanation also needs Semantics Inference Web: McGuinness – various DoD/IC projects IDEA
  18. 18. Closing the loop: where do the semantics come from? How do we go from the predictive analytics of Big Data to models/explanat ions that allow new understanding? Data Prediction Design Model IDEA
  19. 19. 1. Better tools for Analytics, Agents and HPC Make the tools and algorithms being developed by RPI researchers more “reusable” and multitask (including HPC data-analytic tools) IDEA
  20. 20. 2. Next-Gen Visualization (at scale) How can multi-modal, multi-user, large scale sensory (visualization, sonification, haptics) interaction change the way we understand data? IDEA
  21. 21. 3. Include “agents” in the modeling Develop technologies that enable researchers to work with “humanbased” data at larger scales and in new ways • Population-scale computing models for agent-based simulations IDEA
  22. 22. Approach Platform: Research in using supercomputers for discrete modeling • Carothers’ ROSS model KR Model: • Weaver’s restricted rules on graphs Challenge problem: • Classification algorithms at petaflop scale • “Logical” (nonlinear, discontinuous) agents IDEA
  23. 23. 4. Exploit Cognitive Computing IDEA will be the hub of Rensselaer’s cognitivecomputing research • eg. Answer questions such as “Why” and “How” integrated with large scale simulations IDEA
  24. 24. Watson’s parallel model © Making Watson Fast, IBM J Res and Dev,3/4 2012 Distributed (coarse-grained) parallelism IDEA
  25. 25. Cognitive Computing at Scale DeepQA type approach best on large clusters (Physical) Simulation runs on supercomputers IDEA
  26. 26. Approach: link these computational models Surmise (unproven): Cognitive Computing on a fast (large) cluster can query computations run against data generated by simulations (physical or agent-based) on the supercomputer IDEA
  27. 27. 5. Data services will provide synergy across disciplines • Semantics is a key technology for common data services P o le ep Agency Policy Makers System Scientists Politicians Decision-level semantic mediation: high-level vocabularies that facilitate policy-level decision-making Inte ra d g te A p a io s p lic t n Inter-disciplinary Data Visualization Apps S m tic e an in rope te rability Integration Frameworks & Methodologies Eco & other system Assessment Apps Application-level semantic mediation: mid-level vocabularies that facilitate the interoperability of system models and data products S f t w re o a , T o &A p o ls p s Disciplinespecific model(s) S m tic e an in rope te rability Dataproduct Generator S m tic qu ry e an e , h ypoth is an s d in re c fe n e Information/ S cience Apps Qu ry e , ac e s an c s d u e of data s Data-level Semantic mediation: lower-level vocabularies applied to each data source for a specific science domain of interest D ta a Rp s o e o it rie s Federal Repository Discovery, Integration. Validation Curation, Citation,Archiving … IDEA Commercial Database Researcher Private Database Other Data Sources Me tadata, s h m c e a, data ... ... ...
  28. 28. Conclusions • The “warehouse” is only a small part of the data ecosystem • Database technologies are only part of the story • Discovery, Integration, … , validation, explanation are key to solving problems with data • Closing the loop means “exploring” our data • Humans are still a key player in this • The Rensselaer IDEA will explore • Data-driven applications and tools, but also… • … multimodal visualization, multiscale and agent modeling, cognitive computing, and semantic data platforms IDEA
  29. 29. Rensselaer Institute for Data Exploration and Applications