Jeff's what isdatascience
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Jeff's what isdatascience

on

  • 118 views

 

Statistics

Views

Total Views
118
Views on SlideShare
118
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Facebook friend connections worldwide, a network diagram of the Enron email set, a comparison of similar gene sequences between humans, chimps, and macaques
  • HW, FW, MW, SW: Hardware Firmware Middleware Software

Jeff's what isdatascience Presentation Transcript

  • 1. Jeffrey Stanton WHAT IS School of InformationDATA SCIENCE? Studies Syracuse University
  • 2. BIG DATA
  • 3. KILO, MEGA, GIGA, TERA, PETA, EXA ZETTA = 10 21 BYTES…An organization Over 95% of theemploying 1,000 digital universe isknowledge workers "unstructured data"loses $5.7 million – meaning itsannually just in content cant be trulytime wasted having represented by itsto reformat field in ainformation as they record, such asmove among name, address, orapplications. Not date of lastfinding information transaction. Incosts that same organizations, unstrorganization an uctured dataadditional $5.3m a accounts for moreyear. than 80% of all information.Source: IDC Source: IDC
  • 4. WHY DATA SCIENCE? Available data on a scale millions of times larger than 20 years ago: customer transactions; environmental sensor outputs; genetic and epigenetic sequences; web documents; digital images and audio Heterogeneous data sets, with different representations and formats; mixtures of structured and unstructured data; some, little, or no metadata; distributed across systems Chaotic information life cycle, where little time and effort is spent on what should be kept and what can be discarded Diverse and/or legacy infrastructure: mainframes running Cobol connected with high speed networks to sensor arrays running Linux
  • 5. CRITICAL QUESTIONS How will global climate change affect sea levels in major coastal metropolitan areas worldwide? Does genetic screening reduce cancer mortality for adults between the ages of 50 and 59? What gene sequences in cereal grains are associated with greater crop yields in arid environments? How can we reduce false positives in automated airline baggage scans without reducing accuracy? What Internet data can be mined as predictive of firm creation among startups that provide new jobs?
  • 6. “BIG DATA” PROVIDES ANSWERS Water sustainability  Drug design and Climate analysis and development prediction  Advanced materials Energy through fusion analysis CO 2 Sequestration  New combustion Hazard analysis and systems management  Virtual product design Cancer detection and  In silico semiconductor therapy designNSF Advisory Committee for Cyberinfrastructure, Taskforce for Grand Challenges, Final Report,March 2011. http://www.nsf.gov/od/oci/taskforces/TaskForceReport_GrandChallenges.pdf
  • 7. NSF Advisory“All grand challenges face Committee forbarriers due to challenges in Cyberinfra-software, in data management structure, Tas kforce forand visualization, and in Grand Challenges, Fcoordinating the work of inal Report, Marcdiverse communities that must h 2011.work together to develop new http://www.n sf.gov/od/oci/models and algorithms, and to taskforces/Ta skForceReporevaluate outputs as a basis for t_GrandChall enges.pdfcritical decisions.”
  • 8. Knowledge Development for Industry, Education, Governme nt, Research Domain Experts Infrastructure Information Professionals Expertise in specific Organization & Rapid pace of subject areas Visualization IT developmentLimited opportunity to Limited expertise inmaster technology skills Information Data Solution domain areas Analysis Scientists IntegrationProliferation of big data Specialized knowledge & new technology of HW, FW, MW, SW Digital CurationNeed for knowledge and Communication information managers challenges Data Scientists: Transforming Data Into Decisions
  • 9. A DEFINITION OF A DATA SCIENTIST A data scientist uses deep expertise in the management, transformation, and analysis of large, heterogeneous data sets to:  Help infrastructure experts with the architecture of hardware and software to manage big data challenges  Help domain experts and decision makers reduce the data deluge into usable knowledge, visualizations, and presentations  Help institutions and organizations control and curate data throughout the information lifecycle