Jeffrey Stanton WHAT IS School of InformationDATA SCIENCE? Studies Syracuse University
KILO, MEGA, GIGA, TERA, PETA, EXA ZETTA = 10 21 BYTES…An organization Over 95% of theemploying 1,000 digital universe isknowledge workers "unstructured data"loses $5.7 million – meaning itsannually just in content cant be trulytime wasted having represented by itsto reformat field in ainformation as they record, such asmove among name, address, orapplications. Not date of lastfinding information transaction. Incosts that same organizations, unstrorganization an uctured dataadditional $5.3m a accounts for moreyear. than 80% of all information.Source: IDC Source: IDC
WHY DATA SCIENCE? Available data on a scale millions of times larger than 20 years ago: customer transactions; environmental sensor outputs; genetic and epigenetic sequences; web documents; digital images and audio Heterogeneous data sets, with different representations and formats; mixtures of structured and unstructured data; some, little, or no metadata; distributed across systems Chaotic information life cycle, where little time and effort is spent on what should be kept and what can be discarded Diverse and/or legacy infrastructure: mainframes running Cobol connected with high speed networks to sensor arrays running Linux
CRITICAL QUESTIONS How will global climate change affect sea levels in major coastal metropolitan areas worldwide? Does genetic screening reduce cancer mortality for adults between the ages of 50 and 59? What gene sequences in cereal grains are associated with greater crop yields in arid environments? How can we reduce false positives in automated airline baggage scans without reducing accuracy? What Internet data can be mined as predictive of firm creation among startups that provide new jobs?
“BIG DATA” PROVIDES ANSWERS Water sustainability Drug design and Climate analysis and development prediction Advanced materials Energy through fusion analysis CO 2 Sequestration New combustion Hazard analysis and systems management Virtual product design Cancer detection and In silico semiconductor therapy designNSF Advisory Committee for Cyberinfrastructure, Taskforce for Grand Challenges, Final Report,March 2011. http://www.nsf.gov/od/oci/taskforces/TaskForceReport_GrandChallenges.pdf
NSF Advisory“All grand challenges face Committee forbarriers due to challenges in Cyberinfra-software, in data management structure, Tas kforce forand visualization, and in Grand Challenges, Fcoordinating the work of inal Report, Marcdiverse communities that must h 2011.work together to develop new http://www.n sf.gov/od/oci/models and algorithms, and to taskforces/Ta skForceReporevaluate outputs as a basis for t_GrandChall enges.pdfcritical decisions.”
Knowledge Development for Industry, Education, Governme nt, Research Domain Experts Infrastructure Information Professionals Expertise in specific Organization & Rapid pace of subject areas Visualization IT developmentLimited opportunity to Limited expertise inmaster technology skills Information Data Solution domain areas Analysis Scientists IntegrationProliferation of big data Specialized knowledge & new technology of HW, FW, MW, SW Digital CurationNeed for knowledge and Communication information managers challenges Data Scientists: Transforming Data Into Decisions
A DEFINITION OF A DATA SCIENTIST A data scientist uses deep expertise in the management, transformation, and analysis of large, heterogeneous data sets to: Help infrastructure experts with the architecture of hardware and software to manage big data challenges Help domain experts and decision makers reduce the data deluge into usable knowledge, visualizations, and presentations Help institutions and organizations control and curate data throughout the information lifecycle