Jeffrey Stanton School of Information Studies Syracuse University What is Data Science?
Kilo, Mega, Giga, Tera, Peta, ExaZetta = 1021 bytes Over 95% of the digital universe is "unstructured data" – meaning its content can't be truly represented by its field in a record, such as name, address, or date of last transaction. In organizations, unstructured data accounts for more than 80% of all information. Source: IDC …An organization employing 1,000 knowledge workers loses $5.7 million annually just in time wasted having to reformat information as they move among applications. Not finding information costs that same organization an additional $5.3m a year. Source: IDC
Available data on a scale millions of times larger than 20 years ago: customer transactions; environmental sensor outputs; genetic and epigenetic sequences; web documents; digital images and audio Heterogeneous data sets, with different representations and formats; mixtures of structured and unstructured data; some, little, or no metadata; distributed across systems Chaotic information life cycle, where little time and effort is spent on what should be kept and what can be discarded Diverse and/or legacy infrastructure: mainframes running Cobol connected with high speed networks to sensor arrays running Linux Why Data Science?
How will global climate change affect sea levels in major coastal metropolitan areas worldwide? Does genetic screening reduce cancer mortality for adults between the ages of 50 and 59? What gene sequences in cereal grains are associated with greater crop yields in arid environments? How can we reduce false positives in automated airline baggage scans without reducing accuracy? What Internet data can be mined as predictive of firm creation among startups that provide new jobs? Critical Questions
Water sustainability Climate analysis and prediction Energy through fusion CO2Sequestration Hazard analysis and management Cancer detection and therapy Drug design and development Advanced materials analysis New combustion systems Virtual product design In silico semiconductor design “Big Data” Provides Answers NSF Advisory Committee for Cyberinfrastructure, Taskforce for Grand Challenges, Final Report, March 2011. http://www.nsf.gov/od/oci/taskforces/TaskForceReport_GrandChallenges.pdf
NSF Advisory Committee for Cyberinfra-structure, Taskforce for Grand Challenges, Final Report, March 2011. http://www.nsf.gov/od/oci/taskforces/TaskForceReport_GrandChallenges.pdf “All grand challenges face barriers due to challenges in software, in data management and visualization, and in coordinating the work of diverse communities that must work together to develop new models and algorithms, and to evaluate outputs as a basis for critical decisions.”
Knowledge Development for Industry, Education, Government, Research Domain Experts Infrastructure Professionals Information Organization & Visualization Expertise in specific subject areas Rapid pace of IT development Limited opportunity to master technology skills Limited expertise in domain areas Data Scientists Information Analysis SolutionIntegration Proliferation of big data & new technology Specialized knowledge of HW, FW, MW, SW Digital Curation Need for knowledge and information managers Communication challenges Data Scientists: Transforming Data Into Decisions
A Definition of A Data Scientist A data scientist uses deep expertise in the management, transformation, and analysis of large, heterogeneous data sets to: Help infrastructure experts with the architecture of hardware and software to manage big data challenges Help domain experts and decision makers reduce the data deluge into usable knowledge, visualizations, and presentations Help institutions and organizations control and curate data throughout the information lifecycle