Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Science of Data Science


Published on

This talk presents areas of investigation underway at the Rensselaer Institute for Data Exploration and Applications. First presented at Flipkart, Bangalore India, 3/2015.

Published in: Technology
  • For Business Analytics tools Online Training register at
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

The Science of Data Science

  1. 1. The Science of Data Science (Data plus Semantics yields Knowledge) Prof. James Hendler Tetherless World Constellation Chair of Computer, Web and Cognitive Sciences Director, The Rensselaer IDEA 1
  2. 2. The Rensselaer Institute for Data Exploration and Applications Performance Plan to Budget Presentation February 2015 The Rensselaer Institute for Data Exploration and Applications (IDEA) is a breakthrough initiative brings together key research areas and advanced technologies to revolutionize the way we use data in science, engineering, and virtually every other research and educational discipline. By bridging the gaps between analytics, modeling, and simulation we continue the Rensselaer tradition as a leader in applying critical technologies to improving everyday life and meeting the challenges of the future.
  3. 3. 3 The Rensselaer Institute for Data Exploration and Applications Business Systems: Built and Natural Environments: Cyber- Resiliency: Policy, Ethics and Stewardship: Materials Informatics:Data-driven Physical/Life Sciences: Healthcare Analytics and Mobile Health: Social Network Analytics: Agents and Augmented Reality:
  4. 4. 4 IDEA project examples • Healthcare in Context: Data mining/analytics to Improve public health from a systems perspective at the individual to national scales. • Data-Centric Engineering Design: Data-driven Design & Control under uncertainty via data fusion across multiple scales and sources • Supply Chain Resilience through Information Visibility: Demonstrate uses of supply chain information visibility for anticipating, mitigating and recovering from disruptive events • Accelerated design of functional materials/Material Ontology: Address basic materials processing data-based informatics for complex, multifunctional (often nano) materials. • Biome-informatics: Develop data aggregation and computational tools to integrate disparate datasets into large ecosystem models using data collected on the microbial communities that inhabit the base of most ecosystems • Deducing Structure to Function in Biomedicine: Develop systematic data-resourced methods for discovering and exploiting structure-to-function relationships.
  5. 5. 5 KDD Pipeline – as usually presented Data Storage (Big Data Warehouse)
  6. 6. KDD Pipeline – in the real world Data is increasingly being brought in from external sources, with mixed provenance, and increasingly outside the analyzers’ control. At increasing rates and scales 6 Data Storage Sensors and apps Social Media Customer Behaviors Web Partners Formatting, standards use, data cleansing, data bias analysis, … Open data Data Storage Data Storage Data Storage Data Storage Data Storage Data Storage Data Storage
  7. 7. Tough data integration challenges Enterprise analytics Open Data Integration Hard problems!
  8. 8. Closing the loop on (big) data IDEA is focusing on key data science areas which are revolutionizing engineering, science and business with significant social impact 8 Predictive Analytics Discovery Informatics Data Exploration
  9. 9. Theme 1: Predictive Analytics 9 From “what is” to “what if”
  10. 10. Courtesy of Eric Schadt, Mount Sinai Example: Healthcare Data Analytics The Digital Universe of Data to Better Diagnose and Treat Patients Courtesy of Eric Schadt, Mount Sinai
  11. 11. Identifying predictive features in data Each factor must be separately analyzed for its “Predictivity” • Mutual information measure The “black art” of predictive analytics is finding the right ones • Use too few, the model is weak • Use too many, the model becomes slow and dominated by noise Algorithms required to do this because the overwhelming number of “weak” factors defies human abilities to combine • Machine learning identifies key feature • some require “roll ups” • some require “pull outs” • Mathematical techniques then reduce the dimensionality 11
  12. 12. 12 Predictive analytics in sensors Extend-o-hand (Josh Shinavier. PhD) Classification of the sensor data (via machine-learning) allows predictive recognition of different gestures (i.e. before the gesture is finished).
  13. 13. 13 Predictive analytics in large scale behaviors List clusters at risk for Asian Clams <1mile Cook’s Bay.” Machine-learning generates predicts future distributions of invasive species in Lake George based on current distributions and bathymetry similarity.
  14. 14. Predictive Social Network Analytics (with RPI NeST center) 14 Social Networks in Action Analyzing cascading failures Modeling (supply chain) networks… and predicting (cascading) network risks. Modeling network stressors (including human cognitive element) Understanding network dynamics
  15. 15. 15 Data Science Research Center: tools for data analytics Theory & Algorithms • Randomized • Optimization • Approximation • Multilinear Algebra Applications Statistics • Multivariate analysis • Optimal Experimental Design Dimension reduction by randomized algorithms for numerical linear algebra for identify significant components and visualizing Petabyte-scale data matrices (P. Drineas, CSCI) Parallel Factor Analysis for tensor systems creates a scalable solution, on AMOS, for a critical data-processing component of data analytics for large graphs. (B. Yener, CSCI) Computational concerns • Scaling • Cyber Security for Data
  16. 16. Adding Semantics: Discovery Informatics 16 From “what if” to “Why”
  17. 17. 17 Scientific data: Microbiome informatics Human Biome Environmental Biome Built Environment Data Analytics Semantic Data Integration While microbes are among the smallest organisms on the planet, they are also the largest influence on mass and nutrient transport in the biosphere. They are the base of most natural ecosystems, as well as the purveyors of air and water quality. It is also microbes that primarily govern disease transfer and human health in our built environments.
  18. 18. 18 Materials Processing Ontology (cMDIS/IDEA) The materials field has made much progress on systematically understanding materials structure-to-property relationships, but lacks an organized model of processing-to- property relations. A critical need for systematic development of new materials technologies! Goal: Create a (machine-readable) ontology for materials processing. By combining our expertise in data science, materials and manufacturing, we are creating a key missing link in the Materials Genome Initiative.
  19. 19. Some questions need a qualitative answer Platform for Experimental Collaborative Ethnography
  20. 20. 20 Discovery Informatics Requires Unstructured data Integration of text analytics, natural language processing, network-based multimedia analysis and structured/unstructured data integration
  21. 21. Requires Unstructured data (real-time feeds /images/video) DOE SEAB report on HPC: How might a neuromorphic “accelerator” type processor be used to improve the application performance, power consumption and overall system reliability of future exascale systems? 21 Power Consumption (w/IBM) Network Learning (sensors) Sparse Distributed Representations Hybrid Neural/Symbolic Systems Neuromorphic Computing: software systems that implement models inspired by neural systems to analyze data tied to perception, motor control, or multisensory integration.
  22. 22. 22 Neuromorphic Computing (CCI/IDEA) Joint CCI/IDEA project to use supercomputer to model state-of-the-art neuromorphic processors Use for improving AMOS energy use (like autonomic control) Use for exploring inputs from data-sensing systems (extrinsic control) Neuromorphic Computing requires critical Rensselaer technologies Integrating data analytics (on the fly) with simulation and modeling CCI (AMOS) allows us to explore new variants on neuromorphic approaches IDEA provides learning models and analytics capabilities for evaluation Together allow us to attack audio/visual streaming data autonomic extrinsic
  23. 23. Theme 3: Data Exploration 23 From “why” to “what is”
  24. 24. 24 From visualization to exploration … Unfortunately, visualization too often becomes an end product of scientific analysis, rather than an exploration tool that scientists can use throughout the research life cycle. However, new database technologies, coupled with emerging Web-based technologies, may hold the key to lowering the cost of visualization generation and allow it to become a more integral part of the scientific process.
  25. 25. 25 From visualization to exploration … Unfortunately, visualization too often becomes an end product of scientific analysis, rather than an exploration tool that scientists can use throughout the research life cycle. However, new database technologies, coupled with emerging Web-based technologies, may hold the key to lowering the cost of visualization generation and allow it to become a more integral part of the scientific process.
  26. 26. 26 From what is, to what if, to why (and back) These capabilities are critical in “closing the loop” between data, simulation and modeling in scientific discovery, engineering design, and business innovation.
  27. 27. 27 A “Data Science” Research Agenda Multiscale Sparcity Abductive Agent-oriented
  28. 28. • Gathering and representing information from multiple sources • topic of CODS talk • Systematic (and scalable) methods for predictive analytics • example: Parallel search for best kernel functions 28 Supporting the Scientific agenda • New Data Exploration platforms • example: Patent pending on new multi- user collaborative device • Cognitive and immersive platforms • Data sharing standards • Research Data Alliance • W3C
  29. 29. The Rensselaer IDEA Summary • Data is not just the “oil” of the new generation • information is the new power source generated from that “oil” • Using data for prediction is becoming less of an art, but still needs systematicity • Scaling tools beyond MapReduce • Better methods for rapid customization • Turning data into causal or design knowledge is in its early stages • Closing the loop from data to design requires new informatics, new mathematics, and new ways of thinking beyond data mining 29