Jeffrey
                Stanton

      WHAT IS   School of
                Information
DATA SCIENCE?   Studies

                Syracuse
                University
BIG   DATA
KILO, MEGA, GIGA, TERA, PETA, EXA
        ZETTA = 10 21 BYTES
…An organization          Over 95% of the
employing      1,000      digital universe is
knowledge workers         "unstructured data"
loses $5.7 million        –     meaning       its
annually just in          content can't be truly
time wasted having        represented by its
to         reformat       field       in        a
information as they       record,    such      as
move          among       name, address, or
applications.   Not       date      of       last
finding information       transaction. In
costs that same           organizations, unstr
organization      an      uctured data
additional $5.3m a        accounts for more
year.                     than 80% of all
                          information.
Source: IDC
                          Source: IDC
WHY DATA SCIENCE?

 Available data on a scale millions of times larger than 20
  years ago: customer transactions; environmental sensor
  outputs; genetic and epigenetic sequences; web documents;
  digital images and audio
 Heterogeneous data sets, with different representations and
  formats; mixtures of structured and unstructured data;
  some, little, or no metadata; distributed across systems
 Chaotic information life cycle, where little time and effort is
  spent on what should be kept and what can be discarded
 Diverse and/or legacy infrastructure: mainframes running
  Cobol connected with high speed networks to sensor arrays
  running Linux
CRITICAL QUESTIONS

 How will global climate change affect sea levels in major
  coastal metropolitan areas worldwide?
 Does genetic screening reduce cancer mortality for adults
  between the ages of 50 and 59?
 What gene sequences in cereal grains are associated with
  greater crop yields in arid environments?
 How can we reduce false positives in automated airline
  baggage scans without reducing accuracy?
 What Internet data can be mined as predictive of firm
  creation among startups that provide new jobs?
“BIG DATA” PROVIDES ANSWERS

 Water sustainability                              Drug design and
 Climate analysis and                               development
  prediction                                        Advanced materials
 Energy through fusion                              analysis
 CO 2 Sequestration                                New combustion
 Hazard analysis and                                systems
  management                                        Virtual product design
 Cancer detection and                              In silico semiconductor
  therapy                                            design
NSF Advisory Committee for Cyberinfrastructure, Taskforce for Grand Challenges, Final Report,
March 2011. http://www.nsf.gov/od/oci/taskforces/TaskForceReport_GrandChallenges.pdf
NSF Advisory
“All grand challenges face        Committee
                                  for
barriers due to challenges in     Cyberinfra-
software, in data management      structure, Tas
                                  kforce for
and visualization, and in         Grand
                                  Challenges, F
coordinating the work of          inal
                                  Report, Marc
diverse communities that must     h 2011.
work together to develop new      http://www.n
                                  sf.gov/od/oci/
models and algorithms, and to     taskforces/Ta
                                  skForceRepor
evaluate outputs as a basis for   t_GrandChall
                                  enges.pdf
critical decisions.”
Knowledge Development
                                            for
                             Industry, Education, Governme
                                       nt, Research
       Domain
       Experts                                                            Infrastructure
                                       Information
                                                                          Professionals
  Expertise in specific
                                      Organization &                       Rapid pace of
     subject areas                     Visualization                      IT development

Limited opportunity to                                                  Limited expertise in
master technology skills    Information      Data         Solution
                                                                           domain areas
                              Analysis    Scientists     Integration

Proliferation of big data
                                                                       Specialized knowledge
  & new technology                                                      of HW, FW, MW, SW
                                      Digital Curation
Need for knowledge and                                                    Communication
 information managers                                                       challenges

         Data Scientists: Transforming Data Into Decisions
A DEFINITION OF A DATA SCIENTIST

 A data scientist uses deep expertise in the
  management, transformation, and analysis of large,
  heterogeneous data sets to:
   Help infrastructure experts with the architecture of hardware
    and software to manage big data challenges
   Help domain experts and decision makers reduce the data
    deluge into usable knowledge, visualizations, and
    presentations
   Help institutions and organizations control and curate data
    throughout the information lifecycle

Jeff's what isdatascience

  • 1.
    Jeffrey Stanton WHAT IS School of Information DATA SCIENCE? Studies Syracuse University
  • 2.
    BIG DATA
  • 3.
    KILO, MEGA, GIGA,TERA, PETA, EXA ZETTA = 10 21 BYTES …An organization Over 95% of the employing 1,000 digital universe is knowledge workers "unstructured data" loses $5.7 million – meaning its annually just in content can't be truly time wasted having represented by its to reformat field in a information as they record, such as move among name, address, or applications. Not date of last finding information transaction. In costs that same organizations, unstr organization an uctured data additional $5.3m a accounts for more year. than 80% of all information. Source: IDC Source: IDC
  • 4.
    WHY DATA SCIENCE? Available data on a scale millions of times larger than 20 years ago: customer transactions; environmental sensor outputs; genetic and epigenetic sequences; web documents; digital images and audio  Heterogeneous data sets, with different representations and formats; mixtures of structured and unstructured data; some, little, or no metadata; distributed across systems  Chaotic information life cycle, where little time and effort is spent on what should be kept and what can be discarded  Diverse and/or legacy infrastructure: mainframes running Cobol connected with high speed networks to sensor arrays running Linux
  • 5.
    CRITICAL QUESTIONS  Howwill global climate change affect sea levels in major coastal metropolitan areas worldwide?  Does genetic screening reduce cancer mortality for adults between the ages of 50 and 59?  What gene sequences in cereal grains are associated with greater crop yields in arid environments?  How can we reduce false positives in automated airline baggage scans without reducing accuracy?  What Internet data can be mined as predictive of firm creation among startups that provide new jobs?
  • 6.
    “BIG DATA” PROVIDESANSWERS  Water sustainability  Drug design and  Climate analysis and development prediction  Advanced materials  Energy through fusion analysis  CO 2 Sequestration  New combustion  Hazard analysis and systems management  Virtual product design  Cancer detection and  In silico semiconductor therapy design NSF Advisory Committee for Cyberinfrastructure, Taskforce for Grand Challenges, Final Report, March 2011. http://www.nsf.gov/od/oci/taskforces/TaskForceReport_GrandChallenges.pdf
  • 7.
    NSF Advisory “All grandchallenges face Committee for barriers due to challenges in Cyberinfra- software, in data management structure, Tas kforce for and visualization, and in Grand Challenges, F coordinating the work of inal Report, Marc diverse communities that must h 2011. work together to develop new http://www.n sf.gov/od/oci/ models and algorithms, and to taskforces/Ta skForceRepor evaluate outputs as a basis for t_GrandChall enges.pdf critical decisions.”
  • 8.
    Knowledge Development for Industry, Education, Governme nt, Research Domain Experts Infrastructure Information Professionals Expertise in specific Organization & Rapid pace of subject areas Visualization IT development Limited opportunity to Limited expertise in master technology skills Information Data Solution domain areas Analysis Scientists Integration Proliferation of big data Specialized knowledge & new technology of HW, FW, MW, SW Digital Curation Need for knowledge and Communication information managers challenges Data Scientists: Transforming Data Into Decisions
  • 9.
    A DEFINITION OFA DATA SCIENTIST  A data scientist uses deep expertise in the management, transformation, and analysis of large, heterogeneous data sets to:  Help infrastructure experts with the architecture of hardware and software to manage big data challenges  Help domain experts and decision makers reduce the data deluge into usable knowledge, visualizations, and presentations  Help institutions and organizations control and curate data throughout the information lifecycle

Editor's Notes

  • #3 Facebook friend connections worldwide, a network diagram of the Enron email set, a comparison of similar gene sequences between humans, chimps, and macaques
  • #9 HW, FW, MW, SW: Hardware Firmware Middleware Software