A talk at the Urban Science workshop at the Puget Sound Regional Council July 20 2014 organized by the Northwest Institute for Advanced Computing, a joint effort between Pacific Northwest National Labs and the University of Washington.
An invited talk in the Big Data session of the Industrial Research Institute meeting in Seattle Washington.
Some notes on how to train data science talent and exploit the fact that the membrane between academia and industry has become more permeable.
A 25 minute talk from a panel on big data curricula at JSM 2013
http://www.amstat.org/meetings/jsm/2013/onlineprogram/ActivityDetails.cfm?SessionID=208664
Talk given at Los Alamos National Labs in Fall 2015.
As research becomes more data-intensive and platforms become more heterogeneous, we need to shift focus from performance to productivity.
Smart Data - How you and I will exploit Big Data for personalized digital hea...Amit Sheth
Amit Sheth's keynote at IEEE BigData 2014, Oct 29, 2014.
Abstract from:
http://cci.drexel.edu/bigdata/bigdata2014/keynotespeech.htm
Big Data has captured a lot of interest in industry, with the emphasis on the challenges of the four Vs of Big Data: Volume, Variety, Velocity, and Veracity, and their applications to drive value for businesses. Recently, there is rapid growth in situations where a big data challenge relates to making individually relevant decisions. A key example is personalized digital health that related to taking better decisions about our health, fitness, and well-being. Consider for instance, understanding the reasons for and avoiding an asthma attack based on Big Data in the form of personal health signals (e.g., physiological data measured by devices/sensors or Internet of Things around humans, on the humans, and inside/within the humans), public health signals (e.g., information coming from the healthcare system such as hospital admissions), and population health signals (such as Tweets by people related to asthma occurrences and allergens, Web services providing pollen and smog information). However, no individual has the ability to process all these data without the help of appropriate technology, and each human has different set of relevant data!
In this talk, I will describe Smart Data that is realized by extracting value from Big Data, to benefit not just large companies but each individual. If my child is an asthma patient, for all the data relevant to my child with the four V-challenges, what I care about is simply, “How is her current health, and what are the risk of having an asthma attack in her current situation (now and today), especially if that risk has changed?” As I will show, Smart Data that gives such personalized and actionable information will need to utilize metadata, use domain specific knowledge, employ semantics and intelligent processing, and go beyond traditional reliance on ML and NLP. I will motivate the need for a synergistic combination of techniques similar to the close interworking of the top brain and the bottom brain in the cognitive models.
For harnessing volume, I will discuss the concept of Semantic Perception, that is, how to convert massive amounts of data into information, meaning, and insight useful for human decision-making. For dealing with Variety, I will discuss experience in using agreement represented in the form of ontologies, domain models, or vocabularies, to support semantic interoperability and integration. For Velocity, I will discuss somewhat more recent work on Continuous Semantics, which seeks to use dynamically created models of new objects, concepts, and relationships, using them to better understand new cues in the data that capture rapidly evolving events and situations.
Smart Data applications in development at Kno.e.sis come from the domains of personalized health, energy, disaster response, and smart city.
Citizen Sensor Data Mining, Social Media Analytics and ApplicationsAmit Sheth
Opening talk at Singapore Symposium on Sentiment Analysis (S3A), February 6, 2015, Singapore. http://s3a.sentic.net/#s3a2015
Abstract
With the rapid rise in the popularity of social media, and near ubiquitous mobile access, the sharing of observations and opinions has become common-place. This has given us an unprecedented access to the pulse of a populace and the ability to perform analytics on social data to support a variety of socially intelligent applications -- be it for brand tracking and management, crisis coordination, organizing revolutions or promoting social development in underdeveloped and developing countries.
I will review: 1) understanding and analysis of informal text, esp. microblogs (e.g., issues of cultural entity extraction and role of semantic/background knowledge enhanced techniques), and 2) how we built Twitris, a comprehensive social media analytics (social intelligence) platform.
I will describe the analysis capabilities along three dimensions: spatio-temporal-thematic, people-content-network, and sentiment-emption-intent. I will couple technical insights with identification of computational techniques and real-world examples using live demos of Twitris (http://twitris2.knoesis.org).
An invited talk in the Big Data session of the Industrial Research Institute meeting in Seattle Washington.
Some notes on how to train data science talent and exploit the fact that the membrane between academia and industry has become more permeable.
A 25 minute talk from a panel on big data curricula at JSM 2013
http://www.amstat.org/meetings/jsm/2013/onlineprogram/ActivityDetails.cfm?SessionID=208664
Talk given at Los Alamos National Labs in Fall 2015.
As research becomes more data-intensive and platforms become more heterogeneous, we need to shift focus from performance to productivity.
Smart Data - How you and I will exploit Big Data for personalized digital hea...Amit Sheth
Amit Sheth's keynote at IEEE BigData 2014, Oct 29, 2014.
Abstract from:
http://cci.drexel.edu/bigdata/bigdata2014/keynotespeech.htm
Big Data has captured a lot of interest in industry, with the emphasis on the challenges of the four Vs of Big Data: Volume, Variety, Velocity, and Veracity, and their applications to drive value for businesses. Recently, there is rapid growth in situations where a big data challenge relates to making individually relevant decisions. A key example is personalized digital health that related to taking better decisions about our health, fitness, and well-being. Consider for instance, understanding the reasons for and avoiding an asthma attack based on Big Data in the form of personal health signals (e.g., physiological data measured by devices/sensors or Internet of Things around humans, on the humans, and inside/within the humans), public health signals (e.g., information coming from the healthcare system such as hospital admissions), and population health signals (such as Tweets by people related to asthma occurrences and allergens, Web services providing pollen and smog information). However, no individual has the ability to process all these data without the help of appropriate technology, and each human has different set of relevant data!
In this talk, I will describe Smart Data that is realized by extracting value from Big Data, to benefit not just large companies but each individual. If my child is an asthma patient, for all the data relevant to my child with the four V-challenges, what I care about is simply, “How is her current health, and what are the risk of having an asthma attack in her current situation (now and today), especially if that risk has changed?” As I will show, Smart Data that gives such personalized and actionable information will need to utilize metadata, use domain specific knowledge, employ semantics and intelligent processing, and go beyond traditional reliance on ML and NLP. I will motivate the need for a synergistic combination of techniques similar to the close interworking of the top brain and the bottom brain in the cognitive models.
For harnessing volume, I will discuss the concept of Semantic Perception, that is, how to convert massive amounts of data into information, meaning, and insight useful for human decision-making. For dealing with Variety, I will discuss experience in using agreement represented in the form of ontologies, domain models, or vocabularies, to support semantic interoperability and integration. For Velocity, I will discuss somewhat more recent work on Continuous Semantics, which seeks to use dynamically created models of new objects, concepts, and relationships, using them to better understand new cues in the data that capture rapidly evolving events and situations.
Smart Data applications in development at Kno.e.sis come from the domains of personalized health, energy, disaster response, and smart city.
Citizen Sensor Data Mining, Social Media Analytics and ApplicationsAmit Sheth
Opening talk at Singapore Symposium on Sentiment Analysis (S3A), February 6, 2015, Singapore. http://s3a.sentic.net/#s3a2015
Abstract
With the rapid rise in the popularity of social media, and near ubiquitous mobile access, the sharing of observations and opinions has become common-place. This has given us an unprecedented access to the pulse of a populace and the ability to perform analytics on social data to support a variety of socially intelligent applications -- be it for brand tracking and management, crisis coordination, organizing revolutions or promoting social development in underdeveloped and developing countries.
I will review: 1) understanding and analysis of informal text, esp. microblogs (e.g., issues of cultural entity extraction and role of semantic/background knowledge enhanced techniques), and 2) how we built Twitris, a comprehensive social media analytics (social intelligence) platform.
I will describe the analysis capabilities along three dimensions: spatio-temporal-thematic, people-content-network, and sentiment-emption-intent. I will couple technical insights with identification of computational techniques and real-world examples using live demos of Twitris (http://twitris2.knoesis.org).
The document provides an overview of funding and active projects at Kno.e.sis as of December 2015. Key details include total extramural funds exceeding $8.3 million with the majority obtained that year from competitive NSF and NIH sources. Active projects focus on areas such as context-aware harassment detection on social media, monitoring drug trends on social media, disaster management using social and physical sensing, and modeling social behavior for healthcare utilization in depression. The summary highlights student and faculty involvement and accomplishments across multiple funded projects.
There is a rapid intertwining of sensors and mobile devices into the fabric of our lives. This has resulted in unprecedented growth in the number of observations from the physical and social worlds reported in the cyber world. Sensing and computational components embedded in the physical world is termed as Cyber-Physical System (CPS). Current science of CPS is yet to effectively integrate citizen observations in CPS analysis. We demonstrate the role of citizen observations in CPS and propose a novel approach to perform a holistic analysis of machine and citizen sensor observations. Specifically, we demonstrate the complementary, corroborative, and timely aspects of citizen sensor observations compared to machine sensor observations in Physical-Cyber-Social (PCS) Systems.
Physical processes are inherently complex and embody uncertainties. They manifest as machine and citizen sensor observations in PCS Systems. We propose a generic framework to move from observations to decision-making and actions in PCS systems consisting of: (a) PCS event extraction, (b) PCS event understanding, and (c) PCS action recommendation. We demonstrate the role of Probabilistic Graphical Models (PGMs) as a unified framework to deal with uncertainty, complexity, and dynamism that help translate observations into actions. Data driven approaches alone are not guaranteed to be able to synthesize PGMs reflecting real-world dependencies accurately. To overcome this limitation, we propose to empower PGMs using the declarative domain knowledge. Specifically, we propose four techniques: (a) automatic creation of massive training data for Conditional Random Fields (CRFs) using domain knowledge of entities used in PCS event extraction, (b) Bayesian Network structure refinement using causal knowledge from Concept Net used in PCS event understanding, (c) knowledge-driven piecewise linear approximation of nonlinear time series dynamics using Linear Dynamical Systems (LDS) used in PCS event understanding, and the (d) transforming knowledge of goals and actions into a Markov Decision Process (MDP) model used in PCS action recommendation.
We evaluate the benefits of the proposed techniques on real-world applications involving traffic analytics and Internet of Things (IoT).
The Ohio Center of Excellence in Knowledge-enabled Computing at Wright State University:
1) Shares the second position globally in impact on the World Wide Web and has the largest academic research group in the US working on semantic web, social media, big data, and health applications.
2) Has exceptional student success with internships and jobs at top companies and a total of 100 researchers including 15 highly cited faculty and 45 PhD students, largely funded through $2M+ annually in research funding.
3) Provides world-class resources for multidisciplinary projects across information technology and domains like biomedicine, with collaboration from industry partners like Google and IBM.
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
This document discusses trends in data science in 2016, including how data science is moving into new use cases such as medicine, politics, government, and neuroscience. It also covers trends in hardware, generalized libraries, leveraging workflows, and frameworks that could enable a big leap ahead. The document discusses learning trends like MOOCs, inverted classrooms, collaborative learning, and how O'Reilly Media is embracing Jupyter notebooks. It also covers measuring distance between learners and subject communities, and the importance of both people and automation working together.
The document discusses the concept of "Broad Data" which refers to the large amount of freely available but widely varied open data on the World Wide Web, including structured and semi-structured data. It provides examples such as the growing linked open data cloud and over 710,000 datasets available from governments around the world. Broad data poses new challenges for data search, modeling, integration and visualization of partially modeled datasets. International open government data search and linking government data to additional contexts are also discussed.
This document summarizes a project on social and physical sensing enabled decision support for disaster management. The project involves a collaboration between Kno.e.sis at Wright State University and Ohio State University. It aims to extract relevant information from citizen sensed data, develop adaptive models of hurricane storm surge coupled with citizen and remote sensed data, and provide tools to assist first responders by integrating data from multiple sources. The project will analyze multimodal data and develop methodologies to predict consequences of infrastructure damage. It is supported by the National Science Foundation.
This document discusses how to make data more engaging for the public. It suggests using games, art, and storytelling to bring data closer to people. Data needs to entertain and excite people as well as inform them. Frameworks are examined for how varying levels of participation, localization, and shareability impact public engagement with factual evidence. Tools and guidance are proposed to help communities communicate about data in inspiring ways and achieve wider civic participation. The talk considers how data interaction research can help understand how people search for, make sense of, and share data stories on social media in order to design systems that better support these tasks.
The Evidence Hub: Harnessing the Collective Intelligence of Communities to Bu...Anna De Liddo
The Evidence Hub is a tool that harnesses collective intelligence to build evidence-based knowledge. It allows communities to gather and debate evidence for ideas and solutions. Users can easily add evidence, counter-evidence, and have conversations to share knowledge. Visual analytics show social dynamics like key players and agreements/disagreements. Future research focuses on defining participation roles and processes, and developing reporting, discourse analytics, and geo-deliberation analytics.
This document discusses data-centric education and learning. It begins by outlining past and present technologies used in education. It then discusses how data-centric learning is enabled by devices that connect to the cloud and collect real-time student data. This data can provide adaptive instruction, feedback, and insights into learning processes. Examples are given of social network analysis and predictive analytics projects using large educational datasets. Finally, frameworks for designing data-driven learning environments and strategies to improve performance are presented. The conclusion emphasizes using data and analytics responsibly and strategically to improve education.
RDAP14: Developing a cross-institutional data management plan for a major par...ASIS&T
The document discusses the challenges of building a cross-institutional data management plan for a large particle physics collaboration called GlueX Experiment involving over 30 institutions, 1 national lab, and expected to generate 15 petabytes of data per year. It notes the legal and compliance issues that can arise when sharing research data across multiple institutions, including responsibility, infrastructure, intellectual property, data ownership, and export controls. Representatives from the collaborating institutions acknowledge the difficulty of developing a consistent cross-institutional data management plan but recognize the benefits of improved data sharing and access.
Guest presentation: SASUF Symposium: Digital Technologies, Big Data, and Cybersecurity, Vaal University of Technology, Vanderbijlpark, South Africa, 15 May 2018
TRANSFORMING BIG DATA INTO SMART DATA: Deriving Value via Harnessing Volume, ...Amit Sheth
Transforming big data into smart data involves deriving value from harnessing the volume, variety, and velocity of big data using semantics and the semantic web. This allows making sense of big data by providing actionable information that improves decision making. Examples discussed include a healthcare application called kHealth that uses personal sensor data along with population level data to provide personalized and timely health recommendations and interventions for conditions like asthma.
NITRD Big Data Interagency Working Group Workshop: Pioneering the Future of Federally Supported Data Repositories Jan 13, 2021 - Opening comments on where we are and one suggestion of where we might go with an International Data Science Institute (IDSI) - A blue sky view.
Talk given at Delft University speaker series on "Crowd Computing & Human-Centered AI" (https://www.academicfringe.org/). November 23, 2020. Covers two 2020 works:
(1) Anubrata Das, Brandon Dang, and Matthew Lease. Fast, Accurate, and Healthier: Interactive Blurring Helps Moderators Reduce Exposure to Harmful Content. In Proceedings of the 8th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), 2020.
Alexander Braylan and Matthew Lease. Modeling and Aggregation of Complex Annotations via Annotation Distances. In Proceedings of the Web Conference, pages 1807--1818, 2020.
Explainable Fact Checking with Humans in-the-loopMatthew Lease
Invited Keynote at KDD 2021 TrueFact Workshop: Making a Credible Web for Tomorrow, August 15, 2021.
https://www.microsoft.com/en-us/research/event/kdd-2021-truefact-workshop-making-a-credible-web-for-tomorrow/#!program-schedule
There are many online and in-person courses available for librarians to learn about research data management, data analysis, and visualization, but after you have taken a course, how do you go about applying what you have learned? While it is possible to just start offering classes and consultations, your service will have a better chance of becoming relevant if you consider stakeholders and review your institutional environment. This lecture will give you some ideas to get started with data services at your institution.
Philip Bourne presented his viewpoint on the future of open science at an NIAID workshop. He argued that as science becomes more democratized, it will lead to more scrutiny of research, a need for new types of rewards beyond publications and citations, and a removal of artificial boundaries between fields. As an example, he discussed how open science allowed two researchers working in different areas to connect via common data references in their notebooks. Bourne believes this digitization and interconnection of research will accelerate, transforming institutions into digital enterprises where digital assets are identifiable and interoperable. However, fully realizing this vision will require coordinating tools across the research lifecycle through common frameworks and developing new support structures.
The document provides an overview of funding and active projects at Kno.e.sis as of December 2015. Key details include total extramural funds exceeding $8.3 million with the majority obtained that year from competitive NSF and NIH sources. Active projects focus on areas such as context-aware harassment detection on social media, monitoring drug trends on social media, disaster management using social and physical sensing, and modeling social behavior for healthcare utilization in depression. The summary highlights student and faculty involvement and accomplishments across multiple funded projects.
There is a rapid intertwining of sensors and mobile devices into the fabric of our lives. This has resulted in unprecedented growth in the number of observations from the physical and social worlds reported in the cyber world. Sensing and computational components embedded in the physical world is termed as Cyber-Physical System (CPS). Current science of CPS is yet to effectively integrate citizen observations in CPS analysis. We demonstrate the role of citizen observations in CPS and propose a novel approach to perform a holistic analysis of machine and citizen sensor observations. Specifically, we demonstrate the complementary, corroborative, and timely aspects of citizen sensor observations compared to machine sensor observations in Physical-Cyber-Social (PCS) Systems.
Physical processes are inherently complex and embody uncertainties. They manifest as machine and citizen sensor observations in PCS Systems. We propose a generic framework to move from observations to decision-making and actions in PCS systems consisting of: (a) PCS event extraction, (b) PCS event understanding, and (c) PCS action recommendation. We demonstrate the role of Probabilistic Graphical Models (PGMs) as a unified framework to deal with uncertainty, complexity, and dynamism that help translate observations into actions. Data driven approaches alone are not guaranteed to be able to synthesize PGMs reflecting real-world dependencies accurately. To overcome this limitation, we propose to empower PGMs using the declarative domain knowledge. Specifically, we propose four techniques: (a) automatic creation of massive training data for Conditional Random Fields (CRFs) using domain knowledge of entities used in PCS event extraction, (b) Bayesian Network structure refinement using causal knowledge from Concept Net used in PCS event understanding, (c) knowledge-driven piecewise linear approximation of nonlinear time series dynamics using Linear Dynamical Systems (LDS) used in PCS event understanding, and the (d) transforming knowledge of goals and actions into a Markov Decision Process (MDP) model used in PCS action recommendation.
We evaluate the benefits of the proposed techniques on real-world applications involving traffic analytics and Internet of Things (IoT).
The Ohio Center of Excellence in Knowledge-enabled Computing at Wright State University:
1) Shares the second position globally in impact on the World Wide Web and has the largest academic research group in the US working on semantic web, social media, big data, and health applications.
2) Has exceptional student success with internships and jobs at top companies and a total of 100 researchers including 15 highly cited faculty and 45 PhD students, largely funded through $2M+ annually in research funding.
3) Provides world-class resources for multidisciplinary projects across information technology and domains like biomedicine, with collaboration from industry partners like Google and IBM.
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
This document discusses trends in data science in 2016, including how data science is moving into new use cases such as medicine, politics, government, and neuroscience. It also covers trends in hardware, generalized libraries, leveraging workflows, and frameworks that could enable a big leap ahead. The document discusses learning trends like MOOCs, inverted classrooms, collaborative learning, and how O'Reilly Media is embracing Jupyter notebooks. It also covers measuring distance between learners and subject communities, and the importance of both people and automation working together.
The document discusses the concept of "Broad Data" which refers to the large amount of freely available but widely varied open data on the World Wide Web, including structured and semi-structured data. It provides examples such as the growing linked open data cloud and over 710,000 datasets available from governments around the world. Broad data poses new challenges for data search, modeling, integration and visualization of partially modeled datasets. International open government data search and linking government data to additional contexts are also discussed.
This document summarizes a project on social and physical sensing enabled decision support for disaster management. The project involves a collaboration between Kno.e.sis at Wright State University and Ohio State University. It aims to extract relevant information from citizen sensed data, develop adaptive models of hurricane storm surge coupled with citizen and remote sensed data, and provide tools to assist first responders by integrating data from multiple sources. The project will analyze multimodal data and develop methodologies to predict consequences of infrastructure damage. It is supported by the National Science Foundation.
This document discusses how to make data more engaging for the public. It suggests using games, art, and storytelling to bring data closer to people. Data needs to entertain and excite people as well as inform them. Frameworks are examined for how varying levels of participation, localization, and shareability impact public engagement with factual evidence. Tools and guidance are proposed to help communities communicate about data in inspiring ways and achieve wider civic participation. The talk considers how data interaction research can help understand how people search for, make sense of, and share data stories on social media in order to design systems that better support these tasks.
The Evidence Hub: Harnessing the Collective Intelligence of Communities to Bu...Anna De Liddo
The Evidence Hub is a tool that harnesses collective intelligence to build evidence-based knowledge. It allows communities to gather and debate evidence for ideas and solutions. Users can easily add evidence, counter-evidence, and have conversations to share knowledge. Visual analytics show social dynamics like key players and agreements/disagreements. Future research focuses on defining participation roles and processes, and developing reporting, discourse analytics, and geo-deliberation analytics.
This document discusses data-centric education and learning. It begins by outlining past and present technologies used in education. It then discusses how data-centric learning is enabled by devices that connect to the cloud and collect real-time student data. This data can provide adaptive instruction, feedback, and insights into learning processes. Examples are given of social network analysis and predictive analytics projects using large educational datasets. Finally, frameworks for designing data-driven learning environments and strategies to improve performance are presented. The conclusion emphasizes using data and analytics responsibly and strategically to improve education.
RDAP14: Developing a cross-institutional data management plan for a major par...ASIS&T
The document discusses the challenges of building a cross-institutional data management plan for a large particle physics collaboration called GlueX Experiment involving over 30 institutions, 1 national lab, and expected to generate 15 petabytes of data per year. It notes the legal and compliance issues that can arise when sharing research data across multiple institutions, including responsibility, infrastructure, intellectual property, data ownership, and export controls. Representatives from the collaborating institutions acknowledge the difficulty of developing a consistent cross-institutional data management plan but recognize the benefits of improved data sharing and access.
Guest presentation: SASUF Symposium: Digital Technologies, Big Data, and Cybersecurity, Vaal University of Technology, Vanderbijlpark, South Africa, 15 May 2018
TRANSFORMING BIG DATA INTO SMART DATA: Deriving Value via Harnessing Volume, ...Amit Sheth
Transforming big data into smart data involves deriving value from harnessing the volume, variety, and velocity of big data using semantics and the semantic web. This allows making sense of big data by providing actionable information that improves decision making. Examples discussed include a healthcare application called kHealth that uses personal sensor data along with population level data to provide personalized and timely health recommendations and interventions for conditions like asthma.
NITRD Big Data Interagency Working Group Workshop: Pioneering the Future of Federally Supported Data Repositories Jan 13, 2021 - Opening comments on where we are and one suggestion of where we might go with an International Data Science Institute (IDSI) - A blue sky view.
Talk given at Delft University speaker series on "Crowd Computing & Human-Centered AI" (https://www.academicfringe.org/). November 23, 2020. Covers two 2020 works:
(1) Anubrata Das, Brandon Dang, and Matthew Lease. Fast, Accurate, and Healthier: Interactive Blurring Helps Moderators Reduce Exposure to Harmful Content. In Proceedings of the 8th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), 2020.
Alexander Braylan and Matthew Lease. Modeling and Aggregation of Complex Annotations via Annotation Distances. In Proceedings of the Web Conference, pages 1807--1818, 2020.
Explainable Fact Checking with Humans in-the-loopMatthew Lease
Invited Keynote at KDD 2021 TrueFact Workshop: Making a Credible Web for Tomorrow, August 15, 2021.
https://www.microsoft.com/en-us/research/event/kdd-2021-truefact-workshop-making-a-credible-web-for-tomorrow/#!program-schedule
There are many online and in-person courses available for librarians to learn about research data management, data analysis, and visualization, but after you have taken a course, how do you go about applying what you have learned? While it is possible to just start offering classes and consultations, your service will have a better chance of becoming relevant if you consider stakeholders and review your institutional environment. This lecture will give you some ideas to get started with data services at your institution.
Philip Bourne presented his viewpoint on the future of open science at an NIAID workshop. He argued that as science becomes more democratized, it will lead to more scrutiny of research, a need for new types of rewards beyond publications and citations, and a removal of artificial boundaries between fields. As an example, he discussed how open science allowed two researchers working in different areas to connect via common data references in their notebooks. Bourne believes this digitization and interconnection of research will accelerate, transforming institutions into digital enterprises where digital assets are identifiable and interoperable. However, fully realizing this vision will require coordinating tools across the research lifecycle through common frameworks and developing new support structures.
Opening/Framing Comments: John Behrens, Vice President, Center for Digital Data, Analytics, & Adaptive Learning Pearson
Discussion of how the field of educational measurement is changing; how long held assumptions may no longer be taken for granted and that new terminology and language are coming into the.
Panel 1: Beyond the Construct: New Forms of Measurement
This panel presents new views of what assessment can be and new species of big data that push our understanding for what can be used in evidentiary arguments.
Marcia Linn, Lydia Liu from UC Berkeley and ETS discuss continuous assessment of science and new kinds of constructs that relate to collaboration and student reasoning.
John Byrnes from SRI International discusses text and other semi-structured data sources and different methods of analysis.
Kristin Dicerbo from Pearson discusses hidden assessments and the different student interactions and events that can be used in inferential processes.
Panel 2: The Test is Just the Beginning: Assessments Meet Systems Context
This panel looks at how assessments are not the end game, but often the first step in larger big-data practices at districts/state/national levels.
Gerald Tindal from the University of Oregon discusses State data systems and special education, including curriculum-based measurement across geographic settings.
Jack Buckley Commissioner of the National Center for Educational Statistics discussing national datasets where tests and other data connect.
Lindsay Page, Will Marinell from the Strategic Data Project at Harvard discussing state and district datasets used for evaluating teachers, colleges of education, and student progress.
Panel 3: Connecting the Dots: Research Agendas to Integrate Different Worlds
This panel will look at how research organizations are viewing the connections between the perspectives presented in Panels 1 and 2; what is known, what is still yet to be discovered in order to achieve the promised of big connected data in education.
Andrea Conklin Bueschel Program Director at the Spencer Foundation
Ed Dieterle Senior Program Officer at the Bill and Melinda Gates Foundation
Edith Gummer Program Manager at National Science Foundation
Elsevier CWTS Open Data Report Presentation at RDA meeting in Barcelona Elsevier
The Open Data report is a result of a year-long, co-conducted study between Elsevier and the Centre for Science and Technology Studies (CWTS), part of Leiden University, the Netherlands. The study is based on a complementary methods approach consisting of a quantitative analysis of bibliometric and publication data, a global survey of 1,200 researchers and three case studies including in-depth interviews with key individuals involved in data collection, analysis and deposition in the fields of soil science, human genetics and digital humanities.
This document discusses the need for critical infrastructure to promote data synthesis and evidence-based nutrient management. It outlines 10 steps for real-time data uptake, analysis, and customized nutrient recommendations. Key challenges include data standards, minimum data sets, provenance, and repositories. The Purdue University Research Repository is presented as a solution, providing preservation, curation, and publication of agricultural data. Hands-on support from librarians and agronomists is discussed to help researchers transition data and ensure best practices.
Realizing the Potential of Research Data by Carole L. Palmer carolelynnpalmer
The document discusses the challenges and opportunities in realizing the potential of research data. It notes that while institutions are well positioned with expertise and infrastructure to support data-intensive research, the scale and pace of changes pose significant challenges. New programs have emerged to train experts in data curation and e-science, and there is an abundance of data repositories, standards, and initiatives. Realizing the full potential of research data will require overcoming issues of interoperability between heterogeneous distributed data sources and establishing consensus around data sharing policies and practices.
Data science applications and use cases were discussed. Examples included using data science in business for tasks like optimizing operations, healthcare to improve efficiency and care, and urban planning to address challenges in cities. Data science contrasts with other disciplines by combining technical skills from computer science, mathematics, and statistics to analyze large datasets. Case studies demonstrated data science applications in domains like cancer research using patterns in biomedical data, healthcare to power precision medicine, political campaigns using social media microtargeting, and the growing Internet of Things producing large volumes of data.
Data science applications can be found in many domains including business, healthcare, urban planning, and more. In business, data science is used to optimize operations and customer experiences. In healthcare, data science aims to improve efficiency, reduce readmissions, and enable earlier disease detection. For urban areas experiencing rapid growth, data science combines with urban informatics to help address challenges. Case studies show how data science is used in cancer research by leveraging large datasets and algorithms, in healthcare by Stanford and Google to advance precision medicine, in political elections through micro-targeting, and with the growing Internet of Things to analyze data from billions of connected devices.
CODATA International Training Workshop in Big Data for Science for Researcher...Johann van Wyk
Presentation at NeDICC Meeting on 16 July 2014. Feedback from CODATA International Training Workshop in Big Data for Science for Researchers from Emerging and Developing Countries, Beijing, China, 5-20 June 2014
Data science applications and use cases were discussed. Examples included using data science in business for tasks like car design and insurance, in healthcare for reducing readmissions and improving care, and in urban planning to address challenges in growing cities. Cancer research was highlighted as an area using big data analytics and machine learning to identify patterns linked to cancer. Healthcare examples included using genetic data at Stanford Medicine for precision health. Data science was applied to political elections through Obama's targeted social media campaigns. Finally, the growing field of internet of things was noted as an area that will produce huge volumes of data for analysis.
The document discusses the rise of data science and its disruptive impact on higher education. It analyzes precedents like bioinformatics that were enabled by new digital data sources and technologies. The author advocates that universities should embrace data science by establishing interdisciplinary collaborations, investing in data infrastructure, and ensuring research has societal value and responsibility.
1) Rensselaer Polytechnic Institute aims to incorporate data science education across its curriculum to develop "data dexterity" in every student.
2) A proposed core curriculum includes data-intensive courses in science and the major, as well as collaborative projects through a new Data INCITE laboratory.
3) The goal is for data management and analysis to become as fundamental as calculus, with open data sharing and verification of results.
A Revolution in Open Science: Open Data and the Role of Libraries (Professor ...LIBER Europe
This document discusses the opportunities and challenges of open science and open data. It argues that openly sharing scientific data and findings has significant benefits, including enabling faster scientific progress, deterring fraud, and supporting citizen science. However, for data to be truly open and useful to others, it needs to be accessible, intelligible, assessable, and reusable. The document also examines the roles and responsibilities of different stakeholders in working towards more open and reproducible science. This includes changing incentives for scientists, strategic funding for technical solutions from funders, and exploring how institutions like libraries and learned societies can help address the challenges of managing and making sense of the growing volume of research data.
What Data Science Will Mean to You - One Person's ViewPhilip Bourne
This document provides an overview of data science from the perspective of Philip Bourne. Some key points:
- Data science is disruptive to higher education and all disciplines are being impacted by large amounts of digital data.
- Data science can be defined using a 4+1 model focusing on value, design, systems, analytics, and practice.
- Principles of excellence, inclusivity, openness, and fairness should guide data science work.
- Lessons from advances in computational biology and AlphaFold2 show the importance of open data, collaboration, and engineering challenges.
- A data science school should focus on responsible data practices while balancing open research that benefits patients.
Why should I care about information literacy? nmjb
This document summarizes a workshop on improving researchers' competency in information handling and data management. The workshop covered how information literacy relates to researcher development, defined information literacy using the 7 Pillars model, and discussed national initiatives and case studies in applying information literacy. Participants engaged in group work applying information literacy concepts to the Researcher Development Framework and discussed motivation and examples of good practice in supporting information literacy development.
This document summarizes the Library Impact Data Project, which aimed to show correlations between library usage data (books borrowed, e-resources accessed) and student attainment across multiple universities. Phase 1 found statistical significance between library usage and grades. Phase 2 added more student data points and found further correlations with demographics. The project aims to create a shared analytics service to allow libraries to analyze usage and benchmark against peers. Key areas for the next phase include developing an intuitive dashboard, addressing ethical issues around profiling individuals, and integrating additional data sources.
This presentation was provided by Kristi Holmes of Northwestern University during the NISO hot topic virtual conference "Effective Data Management," which was held on September 29, 2021.
Similar to Data Science and Urban Science @ UW (20)
The document discusses using machine learning techniques to learn vector representations of SQL queries that can then be used for various workload management tasks without requiring manual feature engineering. It shows that representations learned from SQL strings using models like Doc2Vec and LSTM autoencoders can achieve high accuracy for tasks like predicting query errors, auditing users, and summarizing workloads for index recommendation. These learned representations allow workload management to be database agnostic and avoid maintaining database-specific feature extractors.
This document discusses the responsible use of data science techniques and technologies. It describes data science as answering questions using large, noisy, and heterogeneous datasets that were collected for unrelated purposes. It raises concerns about the irresponsible use of data science, such as algorithms amplifying biases in data. The work of the DataLab group at the University of Washington is presented, which aims to address these issues by developing techniques to balance predictive accuracy with fairness, increase data sharing while protecting privacy, and ensure transparency in datasets and methods.
Brief remarks on big data trends and responsible data science at the Workshop on Science and Technology for Washington State: Advising the Legislature, October 4th 2017 in Seattle.
Talk at ISIM 2017 in Durham, UK on applying database techniques to querying model results in the geosciences, with a broader position about the interaction between data science and simulation as modes of scientific inquiry.
Bill Howe discussed emerging topics in responsible data science for the next decade. He described how the field will focus more on what should be done with data rather than just what can be done. Specifically, he talked about incorporating societal constraints like fairness, transparency and ethics into algorithmic decision making. He provided examples of unfair outcomes from existing algorithms and discussed approaches to measure and achieve fairness. Finally, he discussed the need for reproducibility in science and potential techniques for more automatic scientific claim checking and deep data curation.
This document discusses democratizing data science in the cloud. It describes how cloud data management involves sharing resources like infrastructure, schema, data, and queries between tenants. This sharing enables new query-as-a-service systems that can provide smart cross-tenant services by learning from metadata, queries, and data across all users. Examples of possible services discussed include automated data curation, query recommendation, data discovery, and semi-automatic data integration. The document also describes some cloud data systems developed at the University of Washington like SQLShare and Myria that aim to realize this vision.
The document discusses teaching data ethics in data science education. It provides context about the eScience Institute and a data science MOOC. It then presents a vignette on teaching data ethics using the example of an alcohol study conducted in Barrow, Alaska in 1979. The study had methodological and ethical issues in how it presented results to the community. The document concludes by discussing incorporating data ethics into all of the Institute's data science programs and initiatives like automated data curation and analyzing scientific literature visuals.
Data science remains a high-touch activity, especially in life, physical, and social sciences. Data management and manipulation tasks consume too much bandwidth: Specialized tools and technologies are difficult to use together, issues of scale persist despite the Cambrian explosion of big data systems, and public data sources (including the scientific literature itself) suffer curation and quality problems.
Together, these problems motivate a research agenda around “human-data interaction:” understanding and optimizing how people use and share quantitative information.
I’ll describe some of our ongoing work in this area at the University of Washington eScience Institute.
In the context of the Myria project, we're building a big data "polystore" system that can hide the idiosyncrasies of specialized systems behind a common interface without sacrificing performance. In scientific data curation, we are automatically correcting metadata errors in public data repositories with cooperative machine learning approaches. In the Viziometrics project, we are mining patterns of visual information in the scientific literature using machine vision, machine learning, and graph analytics. In the VizDeck and Voyager projects, we are developing automatic visualization recommendation techniques. In graph analytics, we are working on parallelizing best-of-breed graph clustering algorithms to handle multi-billion-edge graphs.
The common thread in these projects is the goal of democratizing data science techniques, especially in the sciences.
A talk I gave at the MMDS workshop June 2014 on the Myria system as well as some of Seung-Hee Bae's work on scalable graph clustering.
https://mmds-data.org/
This document summarizes a presentation about Myria, a relational algorithmics-as-a-service platform developed by researchers at the University of Washington. Myria allows users to write queries and algorithms over large datasets using declarative languages like Datalog and SQL, and executes them efficiently in a parallel manner. It aims to make data analysis scalable and accessible for researchers across many domains by removing the need to handle low-level data management and integration tasks. The presentation provides an overview of the Myria architecture and compiler framework, and gives examples of how it has been used for projects in oceanography, astronomy, biology and medical informatics.
Talk delivered at High Performance Transaction Processing 2013
Myria is a new Big Data service being developed at the University of Washington. We feature high level language interfaces, a hybrid graph-relational data model, database-style algebraic optimization, a comprehensive REST API, an iterative programming model suitable for machine learning and graph analytics applications, and a tight connection to new theories of parallel computation.
In this talk, we describe the motivation for another big data platform emphasizing requirements emerging from the physical, life, and social sciences.
The University of Washington eScience Institute aims to help position UW at the forefront of eScience techniques and technologies. Its strategy includes hiring research scientists, adding faculty in key fields, and building a consultancy of students. The exponential growth of data is transitioning science from data-poor to data-rich. Techniques like sensors, data management, and cloud computing are important. The "long tail" of smaller science projects is also worthy of investment and can have high impact if properly supported.
A taxonomy for data science curricula; a motivation for choosing a particular point in the design space; an overview of some our activities, including a coursera course slated for Spring 2012
Relational databases remain underused in the long tail of science, despite a number of significant
success stories and a natural correspondence between scientific inquiry and ad hoc database query.
Barriers to adoption have been articulated in the past, but spreadsheets and other file-oriented ap-
proaches still dominate. At the University of Washington eScience Institute, we are exploring a new
“delivery vector” for selected database features targeting researchers in the long tail: a web-based
query-as-a-service system called SQLShare that eschews conventional database design, instead empha-
sizing a simple Upload-Query-Share workflow and exposing a direct, full-SQL query interface over
“raw” tabular data. We augment the basic query interface with services for cleaning and integrating
data, recommending and authoring queries, and automatically generating visualizations. We find that
even non-programmers are able to create and share SQL views for a variety of tasks, including quality
control, integration, basic analysis, and access control. Researchers in oceanography, molecular biol-
ogy, and ecology report migrating data to our system from spreadsheets, from conventional databases,
and from ASCII files. In this paper, we will provide some examples of how the platform has enabled sci-
ence in other domains, describe our SQLShare system, and propose some emerging research directions
in this space for the database community.
This document discusses the roles that cloud computing and virtualization can play in reproducible research. It notes that virtualization allows for capturing the full computational environment of an experiment. The cloud builds on this by providing scalable resources and services for storage, computation and managing virtual machines. Challenges include costs, handling large datasets, and cultural adoption issues. Databases in the cloud may help support exploratory analysis of large datasets. Overall, the cloud shows promise for improving reproducibility by enabling sharing of full experimental environments and resources for computationally intensive analysis.
This document discusses enabling end-to-end eScience through integrating query, workflow, visualization, and mashups at an ocean observatory. It describes using a domain-specific query algebra to optimize queries on unstructured grid data from ocean models. It also discusses enabling rapid prototyping of scientific mashups through visual programming frameworks to facilitate data integration and analysis.
This document describes HaLoop, a system that extends MapReduce to efficiently support iterative data processing on large clusters. HaLoop introduces caching mechanisms that allow loop-invariant data to be accessed without reloading or reshuffling between iterations. This improves performance for iterative algorithms like PageRank, transitive closure, and k-means clustering. The largest gains come from caching invariant data in the reducer input cache to avoid unnecessary loading and shuffling. HaLoop also eliminates extra MapReduce jobs for termination checking in some cases. Overall, HaLoop shows that minimal extensions to MapReduce can efficiently support a wide range of recursive programs and languages on large-scale clusters.
This document discusses query-driven visualization in the cloud using MapReduce. It begins by explaining how all science is reducing to a database problem as data is acquired en masse independently of hypotheses. It then discusses why visualization and a cloud approach are useful before reviewing relevant technologies like relational databases, MapReduce, GridFields mesh algebra, and VisTrails workflows. Preliminary results are shown for climatology queries on a shared cloud and core visualization algorithms on a private cluster using MapReduce.
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
The binding of cosmological structures by massless topological defectsSérgio Sacani
Assuming spherical symmetry and weak field, it is shown that if one solves the Poisson equation or the Einstein field
equations sourced by a topological defect, i.e. a singularity of a very specific form, the result is a localized gravitational
field capable of driving flat rotation (i.e. Keplerian circular orbits at a constant speed for all radii) of test masses on a thin
spherical shell without any underlying mass. Moreover, a large-scale structure which exploits this solution by assembling
concentrically a number of such topological defects can establish a flat stellar or galactic rotation curve, and can also deflect
light in the same manner as an equipotential (isothermal) sphere. Thus, the need for dark matter or modified gravity theory is
mitigated, at least in part.
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
Immersive Learning That Works: Research Grounding and Paths ForwardLeonel Morgado
We will metaverse into the essence of immersive learning, into its three dimensions and conceptual models. This approach encompasses elements from teaching methodologies to social involvement, through organizational concerns and technologies. Challenging the perception of learning as knowledge transfer, we introduce a 'Uses, Practices & Strategies' model operationalized by the 'Immersive Learning Brain' and ‘Immersion Cube’ frameworks. This approach offers a comprehensive guide through the intricacies of immersive educational experiences and spotlighting research frontiers, along the immersion dimensions of system, narrative, and agency. Our discourse extends to stakeholders beyond the academic sphere, addressing the interests of technologists, instructional designers, and policymakers. We span various contexts, from formal education to organizational transformation to the new horizon of an AI-pervasive society. This keynote aims to unite the iLRN community in a collaborative journey towards a future where immersive learning research and practice coalesce, paving the way for innovative educational research and practice landscapes.
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Leonel Morgado
Current descriptions of immersive learning cases are often difficult or impossible to compare. This is due to a myriad of different options on what details to include, which aspects are relevant, and on the descriptive approaches employed. Also, these aspects often combine very specific details with more general guidelines or indicate intents and rationales without clarifying their implementation. In this paper we provide a method to describe immersive learning cases that is structured to enable comparisons, yet flexible enough to allow researchers and practitioners to decide which aspects to include. This method leverages a taxonomy that classifies educational aspects at three levels (uses, practices, and strategies) and then utilizes two frameworks, the Immersive Learning Brain and the Immersion Cube, to enable a structured description and interpretation of immersive learning cases. The method is then demonstrated on a published immersive learning case on training for wind turbine maintenance using virtual reality. Applying the method results in a structured artifact, the Immersive Learning Case Sheet, that tags the case with its proximal uses, practices, and strategies, and refines the free text case description to ensure that matching details are included. This contribution is thus a case description method in support of future comparative research of immersive learning cases. We then discuss how the resulting description and interpretation can be leveraged to change immersion learning cases, by enriching them (considering low-effort changes or additions) or innovating (exploring more challenging avenues of transformation). The method holds significant promise to support better-grounded research in immersive learning.
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
Or: Beyond linear.
Abstract: Equivariant neural networks are neural networks that incorporate symmetries. The nonlinear activation functions in these networks result in interesting nonlinear equivariant maps between simple representations, and motivate the key player of this talk: piecewise linear representation theory.
Disclaimer: No one is perfect, so please mind that there might be mistakes and typos.
dtubbenhauer@gmail.com
Corrected slides: dtubbenhauer.com/talks.html
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
2. 2
“It’s a great time to be a data geek.”
-- Roger Barga, Microsoft Research
“The greatest minds of my generation are trying
to figure out how to make people click on ads”
-- Jeff Hammerbacher, co-founder, Cloudera
3. The Fourth Paradigm
1. Empirical + experimental
2. Theoretical
3. Computational
4. Data-Intensive
Jim Gray
7/21/2014 Bill Howe, UW 3
4. “All across our campus, the process of discovery will increasingly rely on
researchers’ ability to extract knowledge from vast amounts of data… In order
to remain at the forefront, UW must be a leader in advancing these
techniques and technologies, and in making [them] accessible to researchers
in the broadest imaginable range of fields.”
2005-2008
In other words:
• Data-driven discovery will be ubiquitous
• UW must be a leader in inventing the
capabilities
• UW must be a leader in translational
activities – in putting these capabilities to
work
• It’s about intellectual infrastructure (human capital) and software
infrastructure (shared tools and services – digital capital)
5. A 5-year, US$37.8 million cross-institutional
collaboration to create a data science environment
5
2014
6. $9.3 million from Washington Research Foundation to
Amplify the Moore/Sloan effort
• 6 X 5-year Faculty lines in Data Science
• 6 X startup packages
• 15 X 3 yr postdoctoral fellows
• Funds to remodel and furnish a WRF Data Science Studio
• Also $7.1 million to closely-related Institute for
Neuroengineering, $8.0 million to Institute for Protein
Design, $6.7 million to Clean Energy Institute
6
7. 7/21/2014 Bill Howe, UW 7
Data Science Kickoff Session:
137 posters from 30+ departments and units
8. 8
PIs on Moore/Sloan effort
+ eScience Institute
Steering Committee
+ UW participants in
February 7 Data Science
poster session
Broad collaborations
9. Establish a virtuous cycle
• 6 working groups, each with
• 3-6 faculty from each institution
10. Key Activity: Promote interdisciplinary careers
• Interdisciplinary graduate students
– New, interdisciplinary “Data Science” Ph.D. tracks and program
• Interdisciplinary postdocs (“Data Science Fellows”)
– Dual-mentored postdocs with interests in both methods and a domain
science
• Interdisciplinary research scientists (“Data Scientists”)
• Work across disciplines to solve people’s data science challenges
• Interdisciplinary faculty
– Supported with special hiring and funding initiatives
• “Senior Research Fellows”
– Short-term and long-term visitors
• A diverse faculty steering committee
11. UW Data Science Education Efforts
7/21/2014 Bill Howe, UW 11
Students Non-Students
CS/Informatics Non-Major
professionals researchers
undergrads grads undergrads grads
UWEO Data Science Certificate
MOOC Intro to Data Science
IGERT: Big Data PhD Track
New CS Courses
Bootcamps and workshops
Intro to Data Programming
Data Science Masters (planned)
Incubator: hands-on training
12. 12
Educational
transformation
Big Data access
and management
Big Data
modeling
Big Data analytics
Collaborative
Big Data scienceData
Key Activity: Foster Interdisciplinary Education
• Ultimate goal: A new PhD program
– Initial goal: A new certificate based on Big Data tracks in all departments
– Education highlights: data science courses, co-advising, and internships
• End-to-End Research Agenda
– Big Data mgmt, analytics, modeling, & collaboration
• Cyberinfrastructure Development
– Big Data analysis service
13. • Additional data science educational activities
– Coursera MOOCs
• Introduction to Data Science (Bill Howe)
• Computational Methods of Data Analysis (Nathan Kutz)
• High Performance Scientific Computing (Randy LeVeque)
– Traditional courses
• Many! Example: Biochemistry for Computer Scientists (Joe Hellerstein)
• We try to list relevant courses on the eScience Institute website
– UW Educational Outreach
• 3-course Certificate in Data Science
• 3-course Certificate in Cloud Data Management & Analytics
• 3-course Certificate in Cloud Application Development on Amazon Web Services
• 3-course Certificate in Data Visualization
– Workshops and bootcamps
• Software Carpentry (Winter & Spring 2013; Winter, Spring, & Summer 2014)
• Cosmology and Machine Learning (Autumn 2014)
14. • An open shared R&D space where researchers from
across the campus will come to collaborate
• A resident data science team
– Permanent staff of ~5 Data Scientists – applied research and development
– ~15-20 Data Science Fellows (research scientists, visitors, postdocs, students)
– Entrepreneurial mentorship
• Modes of engagement
– Drop-in open workspace
– Studio “Office Hours”
– Incubation Program
– Plus seminars, sponsored
lunches, workshops,
bootcamps, joint proposals …
Key Activity: “Re-establish the watercooler”
15. Key Activity: Create scalable impact through a
Data Science Incubation Program
• Scale and concentrate our efforts
– Move from “accidental” encounters to engineered partnerships
– Identify emerging opportunities around campus
– Provide a shared environment where researchers can learn from an in-house
team, external mentors, and each other
• A startup environment!
– “Seed grant” program
• Lightweight – 1-page proposals
– Significant potential for technology spinout – new markets for existing
technology and new technology for existing
markets
16. Key Activity: Democratize Access to Big Data and Big
Data Infrastructure
• SQLShare: Database-as-a-Service for scientists and engineers
• Myria: Easy, Scalable Analytics-as-a-Service
17. Open Data sharing platforms
• Database-as-a-service for open data analytics
• Interoperable with external tools and languages
• Local or cloud deployments
• Interoperable with existing database platforms
• Built-in data integration, profiling, analytics
Google
Fusion
Tables
17
Entrepreneurship
1) “Data once guarded for assumed but untested reasons is now
open, and we're seeing benefits.”
-- Nigel Shadbolt, Open Data Institute
2) Need to help “non-specialists within an organization use data
that had been the realm of programmers and DB admins”
-- Benjamin Romano, Xconomy
“Businesses are now using data the way scientists always have”
-- Jeff Hammerbacher, Cloudera
24. “Much of the material remains unprocessed,
or, if processed, unanalyzed, or, if analyzed,
not read, or, if read, not used or acted upon”
Objectives
Design generalizable method to process HIS-
like data
Make important dataset available for
analysis
Explore actionable data analysis of HIS data
Why do we care?
25. Metadata Trace - saving
Reports of year n saved in
January of year n+1
Years were not recorded for
the first year of use…
26.
27.
28. REDPy
Repeating Earthquake Detector (Python)
An eScience Incubator Project
Project Lead: Alicia Hotovec-Ellis
Data Scientist: Jake Vanderplas
John
Vidale
Alicia
Hotovec-Ellis
Jake
Vanderplas
33. I talked with Alicia a bit yesterday, and she showed me that her earthquake-repeater-
searching implementation is more general, and more powerful than I had thought, and
closer to trial by others (and I have a particular use in mind in the ongoing iMUSH
experiment on Mount St Helens)<snip>
So I'm encouraging her to continue to work on it a day per week or so for the
forseeable future, assuming you have the facilities to continue the incubation.
The project outlives the incubator……
Publications in the works on both the software and
the science – from three months of half-time work
34. Using Twitter data to identify geographic
clustering of anti-vaccination sentiments
Ben Brooks
June 12, 2014
Benjamin
Brooks
Andrew
Whitaker
Abie Flaxman
35. Initial approach
• Sentiment regarding vaccination can be discerned
from Twitter.
• Can we find city- or county-level pockets of anti-
vaccination sentiment?
• Do these locales correlate with outbreak and
vaccination rate data (beyond H1N1)?
36. Training data issues
• Training data from PSU study labeled
tweets as positive, negative, neutral, or
irrelevant.
• Many tweet categorizations seemed
suspect.
• Produced new training dataset; switched
approach to negative tweets vs. all others.
• Of tweets we labeled as negative, PSU
training data agreed with 36%.
• Sample non-negative tweets in
training dataset from PSU study:
• “RT @Lyn_Sue Lyn_Sue18 Reasons Why u
Should NOT Vaccinate Your Children
Against The Flu This Season”
• “1882 -3 O RT @alexHroz Citizens From
All Walks Intend To Refuse Swine Flu
"Vaccine,”
• “Eighteen Reasons Why You Should NOT
Vaccinate Your Children Against The Flu
This Season by Bill Sard”
• “Swine Flu Vaccine not necessary and not
healthy:”
37. Background: Previous work
• “For our sentiment classification, we used an ensemble method combining the Naive Bayes and the
Maximum Entropy classifiers…The accuracy of this ensemble classifier was 84.29%.”
38. Other sentiment approaches
• Precision Of all tweets labeled negative by the algorithm, what percentage are “true
negatives”?
• Recall Of all “true negative” tweets, what percentage are labeled negative by the algorithm?
Precision Recall
Vaccine-specific keywords 19% 59%
Modified general sentiment 25% 41%
Naïve Bayes 79% 19%
Logistic regression 70% 28%
Labeled data from PSU study 41% 36%
39. Other sentiment approaches
• Data labeled by human beings does not perform dramatically better than other classifiers!
Precision Recall
Vaccine-specific keywords 19% 59%
Modified general sentiment 25% 41%
Naïve Bayes 79% 19%
Logistic regression 70% 28%
Labeled data from PSU study 41% 36%
40. Scalable Analytics over Call Record Data in Developing Nations
Project Lead
Ian Kelley
Information School
University of Washington
E-mail: ikelley@uw.edu
eScience Data Incubator - 12 June 2014
Andrew WhitakerIan Kelley Josh Blumenstock
41. Map migration patterns of workers during labor
market shortages (Rwanda)
Measure and categorize mobility patterns
Determine peoples’ geographic center of gravity
Discover the effects of violent events on internal
population mobility (Afghanistan)
Track activity patterns over time; identify changes
Map connected areas of country
eScience Data Incubator - 12 June 2014
42. eScience Data Incubator - 12 June 2014
Average position during a time period (e.g., day, week)
44. Towards An Urban Science Incubation Cohort
44
OneBusAway:
Transit Traveler Information Systems
Foreclosure Rates and
changes in poverty
concentration
PNW Seismic Network
Early Warning System
Ocean Observatories Initiative
Education CRPE
45. Seattle the tech and innovation hub
• “most innovative state” (Bloomberg 12/13)
• “smartest city” (Fast Company, 11/13)
• only US city on “ten best Internet cities” (UBM’s Future
Cities blog, 8/13)
• ranked 2nd for women entrepreneurs (geekwire, 2/13)
• ranked 4th as global startup hub, > NYC (geekwire, 11/12)
• “the top tech city” (geekwire, 6/12)
• …and so on
45
46. eScience Institute + Urban Science
• Better public engagement than in physical and earth sciences
• Leverages our core interest in open data and open science
• Acute need relative to traditionally data-intensive fields
– relative newcomers in DS techniques and technologies
– We prefer collaborations with smaller labs and individuals as opposed to
“Big Science” projects
• Seattle offers a unique testbed as an urbanizing region
– Brookings “metro”: Interconnected urban, suburban, rural, environment
– Engaged, active communities
– Strong local interest in open data, open government
– Global hub for technology and innovation (next slide)
• Connections with King County Executive’s office, State CIO’s
office, Seattle CTO’s office, local gov data companies (Socrata) 46
47. Data Science @ UW
We are at the dawn of
a revolutionary new era of discovery and learning
Editor's Notes
3
Institutional change rather than specific research projects
Institutional change rather than specific research projects
What is the studio?
it’s an open research space where anyone on campus can come to collaborate with a data science team that consists of a several permanent staff with expertise in databases, machine learning, visualization, software engineering, reproducibility, cluster and cloud computing – these are new “research and development” career paths in applied data science, attracting those with significant software backgrounds interested in applying their expertise to science problems
The Studio will also house a number of data science fellows – partially funded research scientists, visiting scientists, postdocs, and students (including IGERT students as Magda discussed)
The Studio will be a delivery vector for a number of activities – the seminar series, the lunches, workshops and bootcamps. But you can engage directly with the Studio in a number of ways: the space will be designed to support drop-in collaboration, we will hold scheduled office hours, and a flagship program that I’m really excited about is our data science incubator, which I’ll describe in a moment.
These data science collaborations can spin out tools like SQLShare, but we need to make these technology-oriented collaborations more common.
The next generation of this is an incubation program to scale and concentrate our collaborations
We want to move from “accidental” encounters to engineered partnerships -- identify promising new opportunities and new partners around campus and invest our time with them.
We need a shared environment where researchers can learn not only from our team, but also external mentors and most importantly **each other** – we routinely find shared solutions across very different fields. John Wilkerson in political science is using sequence alignment algorithms from biology for text analytics to trace the flow of ideas through legislation – he’ll have a student participating in our incubation program this spring.
And we intend this to be a true startup environment, with siginifcant potential for technology spinout. We can help find new markets for existting technology as well as finding opportunities for new technologies.
Let me give you a brief example of a project a little further upstream that the incubation program can provide access to.
This work is in a space of open data sharing platforms, along with Socrata here in Seattle, products from Google and Microsoft, and a number of other companies.
Two observations motivate the products in this space:
First, there’s a movement toward open data that has researchers, government agencies, and even companies exposing their data assets online for use by others for reasons of transparency, efficiency, accountability. Even for commercial data, there are marketplaces emerging to facilitate the buying and selling of data. All of these use cases need new technology. So that’s one reason.
Second, if you’re going to use someone else’s data, you need it to be as accessible as possible. In particular, you need to help data analysts use the data “had previously been the realm of programmers and DB adminsistrators” – here I’m quoting Benjamin Romano from in an Xconomy article about Socrata.
SQLShare is an open data system, but emphasizes rich data manipulation rather than just fetch and retrieval, interoperability with external tools and existing databases, local or cloud deployments, and built-in services for data integration, profiling, and visualization.
Ginger mentioned this system in her talk – we have maintained a production deployment here on campus for three years focusing on science users. Our observation is that science use cases are a predictor for commercial use cases – businesses are beginning to use data the same way scientists always have – they collect it aggressively, torture it with analytics, use it to make predictions about the world. So we think if we can handle these difficult science use cases that we will also be addressing a significant commercial problem.
GENERALISATION POSSIBLE – NOT A ONE OFF ISSUE
BIEN DECRIRE LE GRAPH ET LES AXES
Cluster in space and often in time
Everywhere! But small, often missed by routine network detection…
Eruptions!!
What can they tell us?… What’s the state of the science?
Great – we have a classifiers that is accurate.
Let’s extend this work to see to what extent opinions manifest themselves in actions of public health importance.
A lot of discussion in media on return of vaccine-preventable disease outbreaks.
Fear of autism, etc.
H1N1 vaccination rates recorded in January 2010 (older than 6 months) vs. average sentiment score of users in regions (black) and states (gray)
Impressed with accuracy of 84% in a 4-class problem.
Not a surprising result that state-level information might not be that strongly correlated. Wanted to dig deeper into the geographic features of users.
MENTION INABILITY TO REPRODUCE ACCURATE CLASSIFIER
Dow Constantine’s office, King County Executive
Fred Jarrett, Chief of Staff for King County
Tom Stritikus, Dean of the UW College of Education
Thaisa Way, Landscape Architecture
Bill Glenn, Socrata
local company behind data.seattle.gov