Jeffrey StantonSchool of Information StudiesSyracuse UniversityWhat is Data Science?
BIG Data
Kilo, Mega, Giga, Tera, Peta, ExaZetta = 1021 bytesOver 95% of the digital universe is "unstructured data" – meaning its content can't be truly represented by its field in a record, such as name, address, or date of last transaction. Inorganizations, unstructured data accounts for more than 80% of all information.Source: IDC…An organization employing 1,000 knowledge workers loses $5.7 million annually just in time wasted having to reformat information as they move among applications. Not finding information costs that same organization an additional $5.3m a year.Source: IDC
Available data on a scale millions of times larger than 20 years ago: customer transactions; environmental sensor outputs; genetic and epigenetic sequences; web documents; digital images and audioHeterogeneous data sets, with different representations and formats; mixtures of structured and unstructured data; some, little, or no metadata; distributed across systemsChaotic information life cycle, where little time and effort is spent on what should be kept and what can be discardedDiverse and/or legacy infrastructure: mainframes running Cobol connected with high speed networks to sensor arrays running LinuxWhy Data Science?
How will global climate change affect sea levels in major coastal metropolitan areas worldwide?Does genetic screening reduce cancer mortality for adults between the ages of 50 and 59?What gene sequences in cereal grains are associated with greater crop yields in arid environments?How can we reduce false positives in automated airline baggage scans without reducing accuracy?What Internet data can be mined as predictive of firm creation among startups that provide new jobs?Critical Questions
Water sustainabilityClimate analysis and predictionEnergy through fusionCO2SequestrationHazard analysis and management Cancer detection and therapyDrug design and developmentAdvanced materials analysisNew combustion systemsVirtual product designIn silico semiconductor design“Big Data” Provides AnswersNSF Advisory Committee for Cyberinfrastructure, Taskforce for Grand Challenges, Final Report, March 2011. http://www.nsf.gov/od/oci/taskforces/TaskForceReport_GrandChallenges.pdf
NSF Advisory Committee for Cyberinfra-structure, Taskforce for Grand Challenges, Final Report, March 2011. http://www.nsf.gov/od/oci/taskforces/TaskForceReport_GrandChallenges.pdf“All grand challenges face barriers due to challenges in software, in data management and visualization, and in coordinating the work of diverse communities that must work together to develop new models and algorithms, and to evaluate outputs as a basis for critical decisions.”
Knowledge Development for Industry, Education, Government, ResearchDomain ExpertsInfrastructure ProfessionalsInformation Organization & VisualizationExpertise in specific subject areasRapid pace of IT developmentLimited opportunity to master technology skillsLimited expertise in domain areasData ScientistsInformation AnalysisSolutionIntegrationProliferation of big data & new technologySpecialized knowledge of HW, FW, MW, SWDigital CurationNeed for knowledge and information managersCommunication  challengesData Scientists: Transforming Data Into Decisions
A Definition of A Data ScientistA data scientist uses deep expertise in the management, transformation, and analysis of large, heterogeneous data sets to:Help infrastructure experts with the architecture of hardware and software to manage big data challengesHelp domain experts and decision makers reduce the data deluge into usable knowledge, visualizations, and presentationsHelp institutions and organizations control and curate data throughout the information lifecycle

What is Data Science

  • 1.
    Jeffrey StantonSchool ofInformation StudiesSyracuse UniversityWhat is Data Science?
  • 2.
  • 3.
    Kilo, Mega, Giga,Tera, Peta, ExaZetta = 1021 bytesOver 95% of the digital universe is "unstructured data" – meaning its content can't be truly represented by its field in a record, such as name, address, or date of last transaction. Inorganizations, unstructured data accounts for more than 80% of all information.Source: IDC…An organization employing 1,000 knowledge workers loses $5.7 million annually just in time wasted having to reformat information as they move among applications. Not finding information costs that same organization an additional $5.3m a year.Source: IDC
  • 4.
    Available data ona scale millions of times larger than 20 years ago: customer transactions; environmental sensor outputs; genetic and epigenetic sequences; web documents; digital images and audioHeterogeneous data sets, with different representations and formats; mixtures of structured and unstructured data; some, little, or no metadata; distributed across systemsChaotic information life cycle, where little time and effort is spent on what should be kept and what can be discardedDiverse and/or legacy infrastructure: mainframes running Cobol connected with high speed networks to sensor arrays running LinuxWhy Data Science?
  • 5.
    How will globalclimate change affect sea levels in major coastal metropolitan areas worldwide?Does genetic screening reduce cancer mortality for adults between the ages of 50 and 59?What gene sequences in cereal grains are associated with greater crop yields in arid environments?How can we reduce false positives in automated airline baggage scans without reducing accuracy?What Internet data can be mined as predictive of firm creation among startups that provide new jobs?Critical Questions
  • 6.
    Water sustainabilityClimate analysisand predictionEnergy through fusionCO2SequestrationHazard analysis and management Cancer detection and therapyDrug design and developmentAdvanced materials analysisNew combustion systemsVirtual product designIn silico semiconductor design“Big Data” Provides AnswersNSF Advisory Committee for Cyberinfrastructure, Taskforce for Grand Challenges, Final Report, March 2011. http://www.nsf.gov/od/oci/taskforces/TaskForceReport_GrandChallenges.pdf
  • 7.
    NSF Advisory Committeefor Cyberinfra-structure, Taskforce for Grand Challenges, Final Report, March 2011. http://www.nsf.gov/od/oci/taskforces/TaskForceReport_GrandChallenges.pdf“All grand challenges face barriers due to challenges in software, in data management and visualization, and in coordinating the work of diverse communities that must work together to develop new models and algorithms, and to evaluate outputs as a basis for critical decisions.”
  • 8.
    Knowledge Development forIndustry, Education, Government, ResearchDomain ExpertsInfrastructure ProfessionalsInformation Organization & VisualizationExpertise in specific subject areasRapid pace of IT developmentLimited opportunity to master technology skillsLimited expertise in domain areasData ScientistsInformation AnalysisSolutionIntegrationProliferation of big data & new technologySpecialized knowledge of HW, FW, MW, SWDigital CurationNeed for knowledge and information managersCommunication challengesData Scientists: Transforming Data Into Decisions
  • 9.
    A Definition ofA Data ScientistA data scientist uses deep expertise in the management, transformation, and analysis of large, heterogeneous data sets to:Help infrastructure experts with the architecture of hardware and software to manage big data challengesHelp domain experts and decision makers reduce the data deluge into usable knowledge, visualizations, and presentationsHelp institutions and organizations control and curate data throughout the information lifecycle

Editor's Notes

  • #3 Facebook friend connections worldwide, a network diagram of the Enron email set, a comparison of similar gene sequences between humans, chimps, and macaques
  • #9 HW, FW, MW, SW: Hardware Firmware Middleware Software