Data Science: The Revolution in Science Education
Kirk D. Borne <email@example.com >
(George Mason University -- School of Physics, Astronomy, & Computational Sciences)
• Huge quantities of data are being generated, collected, and stored within all scientific, research, business,
government, and personal domains (including social networks of all sorts).
• Two significant challenges of this BIG DATA flood are addressed here:
Training the next-generation workforce to manage and expertly use these data … A sea of Data (sea of CDs)
“The Rise of the Data Scientist” This is the CD Sea in Kilmington, England
(600,000 CDs ~ 300 TB).
Discovering the hidden knowledge and surprises that are hidden within the data …
Transforming our repositories from a data representation to a knowledge representation
• So how do we address these challenges?
• First, we must face it – i.e., the future researchers that we train as well as knowledge workers (those who extract
knowledge from data and information) must recognize the need and face the challenge.
• Second, we need algorithms, tools, and methodologies from the discipline of Data Science:
… for Big Data management, data mining & knowledge discovery, efficient & effecting indexing, data fusion &
integration, visual analytics, relevance analysis, dimension reduction, feature selection, semantic mark-up, knowledge
More data is different!
mining, knowledge-reuse, knowledge self extraction, self-description, recommendation systems, and more.
More Data is Different – Data Science is Essential Data Science Education: Two Perspectives
• The message should be clear: “more data is not simply more data, but more Discovery from • Informatics in Education – working with data in all learning settings
data is different.” BIG DATA has big volume, velocity, and variety! BIG DATA: • Informatics (Data Science) enables transparent reuse and analysis of data in
• Numerous federal agencies (and others, of course) have addressed this, many names and inquiry-based classroom learning.
including the August 9, 2010 announcement from the White House OSTP: many responses … • Learning is enhanced when students work with real data and information
• Big Data is a national challenge and a national priority, along with healthcare and • Data Mining (especially online data) that are related to the topic (any topic) being studied.
national security. • Machine Learning (ML) • http://serc.carleton.edu/usingdata/ (“Using Data in the Classroom”)
• Exploratory Data Analysis (EDA)
• See http://www.aip.org/fyi (#87) • Intelligent Data Analysis (IDA) • An Education in Informatics – students are specifically trained:
• International initiative by the CODATA organization to address this challenge: • Data Analytics • … to access large distributed data repositories
• Predictive Analytics
ADMIRE = Advanced Data Methods and Information technologies for • Discovery Informatics • … to conduct meaningful inquiries into the data
Research and Education • On-Line Analytical Processing • … to mine, visualize, and analyze the data
• Business Intelligence (BI)
• Many U.S. national study groups in the sciences have issued reports on the • Business Analytics • … to make objective data-driven inferences, discoveries, and decisions
urgency of establishing both research and educational programs to face the Big • Customer Relationship Management • Numerous Data Science programs now exist at several universities (GMU,
Data challenges. • Target Marketing
Caltech, RPI, Michigan, Cornell, U. Illinois, and others …)
• Each of these reports have issued a call to action … • Market Basket Analysis • http://spacs.gmu.edu/ (Computational & Data Sciences @ GMU)
• Credit Scoring
• Case-Based Reasoning (CBR)
Data Science: A National Imperative • Connecting the Dots Goals of Data Science Education
1. National Academies report: Bits of Power: Issues in Global Access to Scientific Data, (1997) downloaded from
• Intrusion Detection Systems (IDS)
• Recommendation / Personalization
• Primary Goal: to increase student’s understanding of the role that data &
2. NSF (National Science Foundation) report: Knowledge Lost in Information: Research Directions for Digital Libraries, (2003) downloaded from
http://www.sis.pitt.edu/~dlwkshop/report.pdf Systems! information play across all disciplines, and to increase the student’s ability
3. NSF report: Cyberinfrastructure for Environmental Research and Education, (2003) downloaded from http://www.ncar.ucar.edu/cyber/cyberreport.pdf to use the technologies and methodologies associated with data acquisition,
4. NSB (National Science Board) report: Long-lived Digital Data Collections: Enabling Research and Education in the 21st Century, (2005) downloaded from
CDS Undergraduate management, search, mining, analysis, and visualization.
5. NSF report with the Computing Research Association: Cyberinfrastructure for Education and Learning for the Future: A Vision and Research Agenda,
(2005) downloaded from http://www.cra.org/reports/cyberinfrastructure.pdf
Program at GMU: • Secondary goals:
6. NSF Atkins Report: Revolutionizing Science & Engineering Through Cyberinfrastructure: Report of the NSF Blue-Ribbon Advisory Panel on http://spacs.gmu.edu/ • To increase student’s abilities to use databases for inquiry.
Cyberinfrastructure, (2005) downloaded from http://www.nsf.gov/od/oci/reports/atkins.pdf
7. NSF report: The Role of Academic Libraries in the Digital Data Universe, (2006) downloaded from http://www.arl.org/bm~doc/digdatarpt.pdf • CDS = Computational and • To increase student’s abilities to acquire, process, and explore data with the use of a
8. National Research Council, National Academies Press report: Learning to Think Spatially, (2006) downloaded from
Data Sciences computer.
• To increase student’s confidence and comfort in using data to address real-world
9. NSF report: Cyberinfrastructure Vision for 21st Century Discovery, (2007) downloaded from http://www.nsf.gov/od/oci/ci_v5.pdf • Undergraduate B.S. degree
10. JISC/NSF Workshop report on Data-Driven Science & Repositories, (2007) http://www.sis.pitt.edu/~repwkshop/NSF-JISC-report.pdf problems (in their chosen scientific discipline, or in any endeavor).
program at GMU since 2008
11. DOE report: Visualization and Knowledge Discovery: Report from the DOE/ASCR Workshop on Visual Analysis and Data Exploration at Extreme Scale, • To increase student’s awareness of ethical issues pertaining to data and information,
(2007) downloaded from http://www.sc.doe.gov/ascr/ProgramDocuments/Docs/DOE-Visualization-Report-2007.pdf • The DATA SCIENCE component
including privacy, ownership, proper attribution, misuse and abuse of statistics and
12. DOE report: Mathematics for Analysis of Petascale Data Workshop Report, (2008) downloaded from of the curriculum was developed
http://www.sc.doe.gov/ascr/ProgramDocuments/Docs/PetascaleDataWorkshopReport.pdf graphs, data falsification, and objective reasoning from data.
with the support of a grant from the
13. NSTC Interagency Working Group on Digital Data report: Harnessing the Power of Digital Data for Science and Society, (2009) downloaded from • To demonstrate and to share the joy of discovery from data.
http://www.nitrd.gov/about/Harnessing_Power_Web.pdf National Science Foundation:
14. National Academies report: Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age, (2009) downloaded from • CUPIDS = Curriculum for an
15. NSF report: Data-Enabled Science in the Mathematical and Physical Sciences, (2010) http://www.cra.org/ccc/docs/reports/DES-report_final.pdf
Undergraduate Program In
Concluding Remarks and Reflections:
• Primary Goal: to increase • Now is the time to implement data-oriented methodologies (Informatics /
Addressing the D2K (Data-to-Knowledge) Challenge student’s understanding of the role Data Science) into all degree programs – training the next-generation
that data plays across the sciences
• Complete end-to-end application of Informatics (BIG DATA Science): as well as to increase the student’s
workforce to use data for knowledge discovery and decision support.
• Data management, metadata management, data search, information extraction, ability to use the technologies • We have a grand opportunity now to establish dialogue and collaboration
data mining, knowledge discovery, knowledge representation associated with data acquisition, across diverse data-intensive research and application communities.
• All steps are necessary – skilled workforce needed to take data to information mining, analysis, and visualization.
• Students with a broad interest in computers and sciences will benefit
and then take information to knowledge. • Objectives – students are trained:
from these types of programs: Computational and Data Sciences.
• … to access large distributed data
• Applies to any discipline. o Actual quote from high school senior visiting the university: “I plan
• … to conduct meaningful inquiries to major in biology, but I wish I could do something with computers
into the data also.”
• … to mine, visualize, and analyze the
• Students graduating with a traditional discipline-based bachelors degree
• … to make objective data-driven in science generally do not have the required background necessary to
inferences, discoveries, and decisions participate as productive members of modern interdisciplinary scientific
• Core CDS courses & electives: research teams, which are becoming increasingly computational- and
• CDS 101 – Introduction to Computational data-intensive.
and Data Sciences
• CDS 130 – Computing for Scientists • The motivating theme and goal of science degree programs should be to
• CDS 151 – Data Ethics train the next-generation scientists in the tools and techniques of cyber-
• CDS 251 – Introduction to Scientific
enabled science (e-Science) to prepare them to confront the emerging
• CDS 301 – Scientific Information and Data petascale challenges of data-intensive science.
• It is also good for society in general that all members of the 21st century
• CDS 302 – Scientific Data and Databases
• CDS 401 – Scientific Data Mining workforce are trained in computational and data science skills – i.e.,
• CDS 410 – Modeling and Simulations I computational literacy and data literacy for all citizens!
• CDS 411 – Modeling and Simulations II