Jeff's what isdatascience


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Facebook friend connections worldwide, a network diagram of the Enron email set, a comparison of similar gene sequences between humans, chimps, and macaques
  • HW, FW, MW, SW: Hardware Firmware Middleware Software
  • Jeff's what isdatascience

    1. 1. Jeffrey Stanton WHAT IS School of InformationDATA SCIENCE? Studies Syracuse University
    2. 2. BIG DATA
    3. 3. KILO, MEGA, GIGA, TERA, PETA, EXA ZETTA = 10 21 BYTES…An organization Over 95% of theemploying 1,000 digital universe isknowledge workers "unstructured data"loses $5.7 million – meaning itsannually just in content cant be trulytime wasted having represented by itsto reformat field in ainformation as they record, such asmove among name, address, orapplications. Not date of lastfinding information transaction. Incosts that same organizations, unstrorganization an uctured dataadditional $5.3m a accounts for moreyear. than 80% of all information.Source: IDC Source: IDC
    4. 4. WHY DATA SCIENCE? Available data on a scale millions of times larger than 20 years ago: customer transactions; environmental sensor outputs; genetic and epigenetic sequences; web documents; digital images and audio Heterogeneous data sets, with different representations and formats; mixtures of structured and unstructured data; some, little, or no metadata; distributed across systems Chaotic information life cycle, where little time and effort is spent on what should be kept and what can be discarded Diverse and/or legacy infrastructure: mainframes running Cobol connected with high speed networks to sensor arrays running Linux
    5. 5. CRITICAL QUESTIONS How will global climate change affect sea levels in major coastal metropolitan areas worldwide? Does genetic screening reduce cancer mortality for adults between the ages of 50 and 59? What gene sequences in cereal grains are associated with greater crop yields in arid environments? How can we reduce false positives in automated airline baggage scans without reducing accuracy? What Internet data can be mined as predictive of firm creation among startups that provide new jobs?
    6. 6. “BIG DATA” PROVIDES ANSWERS Water sustainability  Drug design and Climate analysis and development prediction  Advanced materials Energy through fusion analysis CO 2 Sequestration  New combustion Hazard analysis and systems management  Virtual product design Cancer detection and  In silico semiconductor therapy designNSF Advisory Committee for Cyberinfrastructure, Taskforce for Grand Challenges, Final Report,March 2011.
    7. 7. NSF Advisory“All grand challenges face Committee forbarriers due to challenges in Cyberinfra-software, in data management structure, Tas kforce forand visualization, and in Grand Challenges, Fcoordinating the work of inal Report, Marcdiverse communities that must h together to develop new http://www.n and algorithms, and to taskforces/Ta skForceReporevaluate outputs as a basis for t_GrandChall enges.pdfcritical decisions.”
    8. 8. Knowledge Development for Industry, Education, Governme nt, Research Domain Experts Infrastructure Information Professionals Expertise in specific Organization & Rapid pace of subject areas Visualization IT developmentLimited opportunity to Limited expertise inmaster technology skills Information Data Solution domain areas Analysis Scientists IntegrationProliferation of big data Specialized knowledge & new technology of HW, FW, MW, SW Digital CurationNeed for knowledge and Communication information managers challenges Data Scientists: Transforming Data Into Decisions
    9. 9. A DEFINITION OF A DATA SCIENTIST A data scientist uses deep expertise in the management, transformation, and analysis of large, heterogeneous data sets to:  Help infrastructure experts with the architecture of hardware and software to manage big data challenges  Help domain experts and decision makers reduce the data deluge into usable knowledge, visualizations, and presentations  Help institutions and organizations control and curate data throughout the information lifecycle