This document provides an introduction to a masters level course on data science taught by Wray Buntine at Monash University. It outlines the course modules which cover topics such as data models, data types and storage, data analysis processes, and data curation. It also discusses definitions of data science, the data science process, and how data science has emerged and is impacting other fields.
Einstein published his ideas and became a pivotal element in shifting the way we think about physics - from the Newtonian model to the Quantum - in turn this changed the way we think about the world and allowed us to develop new ways of engaging with the world.
We are at a similar juncture. The development of computational technologies allows us to think about astronomical volumes of data and to make meaning of that data.
The mindshift that occurs is that “the machine is our friend”. The computer, like all machines, extends our capabilities. As a consequence the types of thinking now required in industry are those that get away from thinking like a computer and shift towards creative engagement with possibilities. Logical thinking is still necessary but it starts to be driven by imagination.
Computational thinking and data science change the way we think about defining and solving problems.
The age of creativity - which increasingly extends its impact from arts applications to business, scientific, technological, entrepreneurship, political, and other contexts.
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
The term 'Data Science' was first described in scientific literature about 15 years ago. It started to become a major trend in industry about 7 years ago.
O'Reilly Media surveys the industry extensively each year. In addition we get a good birds-eye view of industry trends through our conference programs and publications, working closely with some of the best practitioners in Data Science.
By now, the field has evolved far beyond its origins eclipsing an earlier generation of Business Intelligence and Data Warehousing approaches. Data Science is moving up, into the business verticals and government spheres of influence where it has true global impact.
This talk considers Data Science trends from the past three years in particular. What is emerging? Which parts are evolving? Which seem cluttered and poised for consolidation or other change?
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-2.html
Einstein published his ideas and became a pivotal element in shifting the way we think about physics - from the Newtonian model to the Quantum - in turn this changed the way we think about the world and allowed us to develop new ways of engaging with the world.
We are at a similar juncture. The development of computational technologies allows us to think about astronomical volumes of data and to make meaning of that data.
The mindshift that occurs is that “the machine is our friend”. The computer, like all machines, extends our capabilities. As a consequence the types of thinking now required in industry are those that get away from thinking like a computer and shift towards creative engagement with possibilities. Logical thinking is still necessary but it starts to be driven by imagination.
Computational thinking and data science change the way we think about defining and solving problems.
The age of creativity - which increasingly extends its impact from arts applications to business, scientific, technological, entrepreneurship, political, and other contexts.
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
The term 'Data Science' was first described in scientific literature about 15 years ago. It started to become a major trend in industry about 7 years ago.
O'Reilly Media surveys the industry extensively each year. In addition we get a good birds-eye view of industry trends through our conference programs and publications, working closely with some of the best practitioners in Data Science.
By now, the field has evolved far beyond its origins eclipsing an earlier generation of Business Intelligence and Data Warehousing approaches. Data Science is moving up, into the business verticals and government spheres of influence where it has true global impact.
This talk considers Data Science trends from the past three years in particular. What is emerging? Which parts are evolving? Which seem cluttered and poised for consolidation or other change?
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-2.html
A presentation delivered by Mohammed Barakat on the 2nd Jordanian Continuous Improvement Open Day in Amman. The presentation is about Data Science and was delivered on 3rd October 2015.
Talk slides from my annual address at the Bio-IT World Expo & Conference where I cover trends, best practices and emerging pain points for life science focused HPC, scientific computing and "research IT"
Email "chris@bioteam.net" if you want a PDF copy of these slides. I've disabled the raw powerpoint download option on slideshare.
A Statistician's Introductory View on Big Data and Data Science (Version 7)Prof. Dr. Diego Kuonen
Presentation given by Dr. Diego Kuonen, CStat PStat CSci, on May 12, 2015, at the 'SAS Forum Switzerland' in Zurich, Switzerland.
ABSTRACT
There is no question that big data have hit the business, government and scientific sectors. The demand for skills in data science is unprecedented in sectors where value, competitiveness and efficiency are driven by data. However, there is plenty of misleading hype around the terms 'big data' and 'data science'. This presentation gives a professional statistician's introductory view on these terms, illustrates the connection between data science and statistics, and highlights some challenges and opportunities from a statistical perspective.
The presentation is also available at http://www.statoo.com/BigDataDataScience/.
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
This is my presentation on the Topic "Data Science - An emerging Stream of Science with its Spreading Reach & Impact". I have compiled and collected different statistics and data from different sources. This may be useful for students and those who might be interested in this field of Study.
Confirming PagesLess managing. More teaching. Greater AlleneMcclendon878
Confirming Pages
Less managing. More teaching. Greater learning.
INSTRUCTORS GET:
• Interactive Applications – book-specific interactive
assignments that require students to APPLY what
they’ve learned.
• Simple assignment management, allowing you to
spend more time teaching.
• Auto-graded assignments, quizzes, and tests.
• Detailed Visual Reporting where student and
section results can be viewed and analyzed.
• Sophisticated online testing capability.
• A filtering and reporting function
that allows you to easily assign and
report on materials that are correlated
to accreditation standards, learning
outcomes, and Bloom’s taxonomy.
• An easy-to-use lecture capture tool.
Would you like your students to show up for class more prepared? (Let’s face it, class
is much more fun if everyone is engaged and prepared…)
Want ready-made application-level interactive assignments, student progress
reporting, and auto-assignment grading? (Less time grading means more time teaching…)
Want an instant view of student or class performance relative to learning
objectives? (No more wondering if students understand…)
Need to collect data and generate reports required for administration or
accreditation? (Say goodbye to manually tracking student learning outcomes…)
Want to record and post your lectures for students to view online?
INSTRUCTORS...
With McGraw-Hill's Connect® MIS,
haa7685X_fm_i-xxxv.indd ihaa7685X_fm_i-xxxv.indd i 12/20/11 9:29 PM12/20/11 9:29 PM
Confirming Pages
Want an online, searchable version of your textbook?
Wish you could reference your textbook online while you’re doing
your assignments?
Want to get more value from your textbook purchase?
Think learning MIS should be a bit more interesting?
Connect® Plus MIS eBook
If you choose to use Connect™ Plus MIS, you have an affordable and
searchable online version of your book integrated with your other
online tools.
Connect® Plus MIS eBook offers features like:
• Topic search
• Direct links from assignments
• Adjustable text size
• Jump to page number
• Print by section
Check out the STUDENT RESOURCES
section under the Connect® Library tab.
Here you’ll find a wealth of resources designed to help you
achieve your goals in the course. You’ll find things like quizzes,
PowerPoints, and Internet activities to help you study.
Every student has different needs, so explore the STUDENT
RESOURCES to find the materials best suited to you.
haa7685X_fm_i-xxxv.indd iihaa7685X_fm_i-xxxv.indd ii 12/20/11 9:29 PM12/20/11 9:29 PM
Confirming Pages
Management Information Systems
FOR THE INFORMATION AGE
NINTH EDITION
Stephen Haag
DANIELS COLLEGE OF BUSINESS
UNIVERSITY OF DENVER
Maeve Cummings
KELCE COLLEGE OF BUSINESS
PITTSBURG STATE UNIVERSITY
haa7685X_fm_i-xxxv.indd iiihaa7685X_fm_i-xxxv.indd iii 12/26/11 5:37 PM12/26/11 5:37 PM
Confirming Pages
MANAGEMENT INFORMATION SYSTEMS FOR THE INF ...
Correctness in Data Science - Data Science Pop-up SeattleDomino Data Lab
Presented by: Benjamin S. Skrainka is a Principal Data Scientist and Lead Instructor at Galvanize, Inc. For several decades, he has built practical solutions to relevant problems using the best statistical and engineering tools. His expertise spans several problem domains, including sequencing DNA, estimating demand for differentiated products, measuring advertising efficacy, and forecasting for capacity planning. Ben earned an AB in Physics from Princeton University and a PhD in Economics from University College London.
Big Data, Data-Driven Decision Making and Statistics Towards Data-Informed Po...Prof. Dr. Diego Kuonen
Presentation given by Dr. Diego Kuonen, CStat PStat CSci, on October 20, 2015, at the Swiss Statistical Society's celebration of the `World Statistics Day 2015' in Olten, Switzerland.
Further information are available at https://worldstatisticsday.org/blog.html?c=CHE
The presentation is also available at http://www.statoo.com/BigDataDataScience/.
Intro to Data Science for Non-Data ScientistsSri Ambati
Erin LeDell and Chen Huang's presentations from the Intro to Data Science for Non-Data Scientists Meetup at H2O HQ on 08.20.15
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
A presentation delivered by Mohammed Barakat on the 2nd Jordanian Continuous Improvement Open Day in Amman. The presentation is about Data Science and was delivered on 3rd October 2015.
Talk slides from my annual address at the Bio-IT World Expo & Conference where I cover trends, best practices and emerging pain points for life science focused HPC, scientific computing and "research IT"
Email "chris@bioteam.net" if you want a PDF copy of these slides. I've disabled the raw powerpoint download option on slideshare.
A Statistician's Introductory View on Big Data and Data Science (Version 7)Prof. Dr. Diego Kuonen
Presentation given by Dr. Diego Kuonen, CStat PStat CSci, on May 12, 2015, at the 'SAS Forum Switzerland' in Zurich, Switzerland.
ABSTRACT
There is no question that big data have hit the business, government and scientific sectors. The demand for skills in data science is unprecedented in sectors where value, competitiveness and efficiency are driven by data. However, there is plenty of misleading hype around the terms 'big data' and 'data science'. This presentation gives a professional statistician's introductory view on these terms, illustrates the connection between data science and statistics, and highlights some challenges and opportunities from a statistical perspective.
The presentation is also available at http://www.statoo.com/BigDataDataScience/.
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
This is my presentation on the Topic "Data Science - An emerging Stream of Science with its Spreading Reach & Impact". I have compiled and collected different statistics and data from different sources. This may be useful for students and those who might be interested in this field of Study.
Confirming PagesLess managing. More teaching. Greater AlleneMcclendon878
Confirming Pages
Less managing. More teaching. Greater learning.
INSTRUCTORS GET:
• Interactive Applications – book-specific interactive
assignments that require students to APPLY what
they’ve learned.
• Simple assignment management, allowing you to
spend more time teaching.
• Auto-graded assignments, quizzes, and tests.
• Detailed Visual Reporting where student and
section results can be viewed and analyzed.
• Sophisticated online testing capability.
• A filtering and reporting function
that allows you to easily assign and
report on materials that are correlated
to accreditation standards, learning
outcomes, and Bloom’s taxonomy.
• An easy-to-use lecture capture tool.
Would you like your students to show up for class more prepared? (Let’s face it, class
is much more fun if everyone is engaged and prepared…)
Want ready-made application-level interactive assignments, student progress
reporting, and auto-assignment grading? (Less time grading means more time teaching…)
Want an instant view of student or class performance relative to learning
objectives? (No more wondering if students understand…)
Need to collect data and generate reports required for administration or
accreditation? (Say goodbye to manually tracking student learning outcomes…)
Want to record and post your lectures for students to view online?
INSTRUCTORS...
With McGraw-Hill's Connect® MIS,
haa7685X_fm_i-xxxv.indd ihaa7685X_fm_i-xxxv.indd i 12/20/11 9:29 PM12/20/11 9:29 PM
Confirming Pages
Want an online, searchable version of your textbook?
Wish you could reference your textbook online while you’re doing
your assignments?
Want to get more value from your textbook purchase?
Think learning MIS should be a bit more interesting?
Connect® Plus MIS eBook
If you choose to use Connect™ Plus MIS, you have an affordable and
searchable online version of your book integrated with your other
online tools.
Connect® Plus MIS eBook offers features like:
• Topic search
• Direct links from assignments
• Adjustable text size
• Jump to page number
• Print by section
Check out the STUDENT RESOURCES
section under the Connect® Library tab.
Here you’ll find a wealth of resources designed to help you
achieve your goals in the course. You’ll find things like quizzes,
PowerPoints, and Internet activities to help you study.
Every student has different needs, so explore the STUDENT
RESOURCES to find the materials best suited to you.
haa7685X_fm_i-xxxv.indd iihaa7685X_fm_i-xxxv.indd ii 12/20/11 9:29 PM12/20/11 9:29 PM
Confirming Pages
Management Information Systems
FOR THE INFORMATION AGE
NINTH EDITION
Stephen Haag
DANIELS COLLEGE OF BUSINESS
UNIVERSITY OF DENVER
Maeve Cummings
KELCE COLLEGE OF BUSINESS
PITTSBURG STATE UNIVERSITY
haa7685X_fm_i-xxxv.indd iiihaa7685X_fm_i-xxxv.indd iii 12/26/11 5:37 PM12/26/11 5:37 PM
Confirming Pages
MANAGEMENT INFORMATION SYSTEMS FOR THE INF ...
Correctness in Data Science - Data Science Pop-up SeattleDomino Data Lab
Presented by: Benjamin S. Skrainka is a Principal Data Scientist and Lead Instructor at Galvanize, Inc. For several decades, he has built practical solutions to relevant problems using the best statistical and engineering tools. His expertise spans several problem domains, including sequencing DNA, estimating demand for differentiated products, measuring advertising efficacy, and forecasting for capacity planning. Ben earned an AB in Physics from Princeton University and a PhD in Economics from University College London.
Big Data, Data-Driven Decision Making and Statistics Towards Data-Informed Po...Prof. Dr. Diego Kuonen
Presentation given by Dr. Diego Kuonen, CStat PStat CSci, on October 20, 2015, at the Swiss Statistical Society's celebration of the `World Statistics Day 2015' in Olten, Switzerland.
Further information are available at https://worldstatisticsday.org/blog.html?c=CHE
The presentation is also available at http://www.statoo.com/BigDataDataScience/.
Intro to Data Science for Non-Data ScientistsSri Ambati
Erin LeDell and Chen Huang's presentations from the Intro to Data Science for Non-Data Scientists Meetup at H2O HQ on 08.20.15
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
Presented at NUS: Fuzzing and Software Security Summer School 2024
This keynote talks about the democratization of fuzzing at scale, highlighting the collaboration between open source communities, academia, and industry to advance the field of fuzzing. It delves into the history of fuzzing, the development of scalable fuzzing platforms, and the empowerment of community-driven research. The talk will further discuss recent advancements leveraging AI/ML and offer insights into the future evolution of the fuzzing landscape.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Vaccine management system project report documentation..pdfKamal Acharya
The Division of Vaccine and Immunization is facing increasing difficulty monitoring vaccines and other commodities distribution once they have been distributed from the national stores. With the introduction of new vaccines, more challenges have been anticipated with this additions posing serious threat to the already over strained vaccine supply chain system in Kenya.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
Maintaining high-quality standards in the production of TMT bars is crucial for ensuring structural integrity in construction. Addressing common defects through careful monitoring, standardized processes, and advanced technology can significantly improve the quality of TMT bars. Continuous training and adherence to quality control measures will also play a pivotal role in minimizing these defects.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfKamal Acharya
The College Bus Management system is completely developed by Visual Basic .NET Version. The application is connect with most secured database language MS SQL Server. The application is develop by using best combination of front-end and back-end languages. The application is totally design like flat user interface. This flat user interface is more attractive user interface in 2017. The application is gives more important to the system functionality. The application is to manage the student’s details, driver’s details, bus details, bus route details, bus fees details and more. The application has only one unit for admin. The admin can manage the entire application. The admin can login into the application by using username and password of the admin. The application is develop for big and small colleges. It is more user friendly for non-computer person. Even they can easily learn how to manage the application within hours. The application is more secure by the admin. The system will give an effective output for the VB.Net and SQL Server given as input to the system. The compiled java program given as input to the system, after scanning the program will generate different reports. The application generates the report for users. The admin can view and download the report of the data. The application deliver the excel format reports. Because, excel formatted reports is very easy to understand the income and expense of the college bus. This application is mainly develop for windows operating system users. In 2017, 73% of people enterprises are using windows operating system. So the application will easily install for all the windows operating system users. The application-developed size is very low. The application consumes very low space in disk. Therefore, the user can allocate very minimum local disk space for this application.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
introds_110116.pdf
1. Introduction to
Data Science
Wray Buntine
http://topicmodels.org
Monash University
Intro. to Data Science, c Wray Buntine, 2015 Slide 1 / 142
2. Background
I material developed as part of an introductory
masters unit at Monash University
I 6 modules over semester, 1 module in 2
weeks
I download the slides now:
I links in the slides are active
I there are some videos and readings I will recommend
I useful resources (blogs, news lists,
magazines, etc.) at my blog
I please interrupt me with questions!
Intro. to Data Science, c Wray Buntine, 2015 Slide 2 / 142
3. Overview of Content
1. Data Science and Data in Society
overview and look at projects
(job) roles, and the impact
2. Data Models in Organisations
data business models
application areas and case studies
3. Data Types and Storage
characterising data and "big" data
data sources and case studies
4. Data Resources, Processes, Standards and Tools
resources and standards; resources case studies
5. Data Analysis Process
data analysis theory; data analysis process
6. Data Curation and Management
issues in data management
data management frameworks
Intro. to Data Science, c Wray Buntine, 2015 Slide 3 / 142
4. Outline
Data Science and Data in Society
Data Models in Organisations
Data Types and Storage
Data Resources, Processes, Standards and
Tools
Data Analysis Process
Data Curation and Management
Intro. to Data Science, c Wray Buntine, 2015 Slide 4 / 142
5. What is Data Science
basic descriptions and history
Intro. to Data Science, c Wray Buntine, 2015 Slide 4 / 142
6. What is Data Science?
I contains the word “science” so cannot be a
science; NB. this is an old joke ...
Intro. to Data Science, c Wray Buntine, 2015 Slide 5 / 142
7. What is Data Science?
I contains the word “science” so cannot be a
science; NB. this is an old joke ...
I circular:
data science is what the data scientist does
Intro. to Data Science, c Wray Buntine, 2015 Slide 5 / 142
8. What is Data Science?
I contains the word “science” so cannot be a
science; NB. this is an old joke ...
I circular:
data science is what the data scientist does
I less circular but a tiny bit more helpful:
data science is the technology of handling and
extracting value from data
Intro. to Data Science, c Wray Buntine, 2015 Slide 5 / 142
9. What is Data Science?
I contains the word “science” so cannot be a
science; NB. this is an old joke ...
I circular:
data science is what the data scientist does
I less circular but a tiny bit more helpful:
data science is the technology of handling and
extracting value from data
I narrow:
machine learning on big data
Intro. to Data Science, c Wray Buntine, 2015 Slide 5 / 142
10. Machine Learning Definition
(well understood and agreed on)
Machine Learning is concerned with the
development of algorithms and techniques that
allow computers to learn.
I concerned with building computational
artifacts
I but the underlying theory is statistics
Intro. to Data Science, c Wray Buntine, 2015 Slide 6 / 142
11. Why Machine Learning?
I Human expertise does not exist. e.g. Martian exploration.
I Humans cannot explain their expertise or reduce it to a
ruleset, or their explanation is incomplete and needs
tuning, e.g. speech recognition.
I Many solutions need to be adapted automatically e.g. user
personalisation.
I Situation changing in time, e.g. junk email.
I There are large amounts of data e.g. discover astronomical
objects.
I Humans are expensive to use for the work, e.g. zipcode
recognition.
Intro. to Data Science, c Wray Buntine, 2015 Slide 7 / 142
12. Why Machine Learning?
you don’t want to
be this guy!
Intro. to Data Science, c Wray Buntine, 2015 Slide 8 / 142
13. Why Machine Learning?
I the information society
I information warfare
I information overload
I information access
Exercise: Google these to find out about them!
Intro. to Data Science, c Wray Buntine, 2015 Slide 9 / 142
14. Data Science Examples
I Google’s spell checker and translate
I Amazon.com’s recommendation engine
I “saturated fat is not bad for you after all”
I Microsoft’s Predictive Analytics for Traffic
from 2005
Intro. to Data Science, c Wray Buntine, 2015 Slide 10 / 142
15. Historical Context
Wolfram Alpha: computable knowledge history
Cloud Infographic: Evolution Of Big Data
Intro. to Data Science, c Wray Buntine, 2015 Slide 11 / 142
16. Web X.0 (credit: Nova Spivack)
Intro. to Data Science, c Wray Buntine, 2015 Slide 12 / 142
17. The Data Science Process
what happens in a Data Science project?
I illustrating the process
I a quick walkthrough illustrating the steps
I the standard value chain
I our model of the process
Intro. to Data Science, c Wray Buntine, 2015 Slide 13 / 142
18. The Data Science Process:
Illustrating the Process
a quick walkthrough illustrating the steps
Intro. to Data Science, c Wray Buntine, 2015 Slide 14 / 142
19. The Data Science Process
I many different tasks come together to complete a
Data Science project
I a data scientist should be familiar with most, but
doesn’t need to be an expert in all
I not all are labelled as Data Science
I some from other field such as computer engineering,
business, ...
Intro. to Data Science, c Wray Buntine, 2015 Slide 15 / 142
20. Intro. to Data Science, c Wray Buntine, 2015 Slide 16 / 142
Pitch your ideas.
“Young Business Man Holding a Tablet” by Pic Basement, CC-BY 2.0
21. Intro. to Data Science, c Wray Buntine, 2015 Slide 17 / 142
Researchers preparing to x-ray a patient.
by Stephen Ausmus acquired from USDA ARS, public domain.
22. Intro. to Data Science, c Wray Buntine, 2015 Slide 18 / 142
Scientists watch over data collected by the
gravimeter and magnetometer instruments.
by NASA/GSFC/Jefferson Beck, CC-BY 2.0
23. Intro. to Data Science, c Wray Buntine, 2015 Slide 19 / 142
Data can be got from many sources.
icons from by Openclipart.org, public domain
24. Intro. to Data Science, c Wray Buntine, 2015 Slide 20 / 142
Some of the best data is Open Data.
by Libby Levi for opensource.com, CC-BY-SA 2.0
25. Intro. to Data Science, c Wray Buntine, 2015 Slide 21 / 142
Linked Open Data (LOD) graph gives
semantics.
by Open Knowledge, CC-BY-SA 2.0
26. Intro. to Data Science, c Wray Buntine, 2015 Slide 22 / 142
Navigate data standards and formats
“The Web is Agreement” cropped, by Paul Downey, CC-BY 2.0
27. Intro. to Data Science, c Wray Buntine, 2015 Slide 23 / 142
Understand the database schema.
by Eric, Sql Designer, CC-BY-SA 2.0
28. Intro. to Data Science, c Wray Buntine, 2015 Slide 24 / 142
Governance cares for the data and its subjects.
icons from by Openclipart.org, public domain;
Good and Evil by AJC ajcann.wordpress.com, CC-BY-SA 2.0
29. Intro. to Data Science, c Wray Buntine, 2015 Slide 25 / 142
Data engineers make the back-end work
by Intel Free Press, CC-BY 2.0
30. Intro. to Data Science, c Wray Buntine, 2015 Slide 26 / 142
Inspect and clean the data.
“rstudio” by mararie, CC-BY-SA 2.0
31. Intro. to Data Science, c Wray Buntine, 2015 Slide 27 / 142
Propose a conceptual/mathematical/functional model.
“Mathematics” by Tom Brown, CC-BY 2.0
32. Intro. to Data Science, c Wray Buntine, 2015 Slide 28 / 142
Analyst builds models with his favorite tool.
33. Intro. to Data Science, c Wray Buntine, 2015 Slide 29 / 142
Analysis, statistics and/or machine learning works on
the data.
“From Data to Wisdom” by Nick Webb, CC-BY 2.0
34. Intro. to Data Science, c Wray Buntine, 2015 Slide 30 / 142
Choose visualizations, many different options!
“Visualization Matrix” cropped, by Lauren Manning, CC-BY 2.0
35. Intro. to Data Science, c Wray Buntine, 2015 Slide 31 / 142
Visualise data to interpret/present results.
by Stephen Ausmus acquired from USDA ARS, public domain.
36. Intro. to Data Science, c Wray Buntine, 2015 Slide 32 / 142
Data science process flowchart.
by Farcaster, CC-BY-SA 3.0
37. Intro. to Data Science, c Wray Buntine, 2015 Slide 33 / 142
Operationalization: putting the results to work.
"Illustration of Strategy“ by Denis Fadeev, CC-BY-SA 3.0
38. The Data Science Process:
A Proposed Value Chain
our model of the process
Intro. to Data Science, c Wray Buntine, 2015 Slide 34 / 142
39. Parts of a Data Science Project
Collection: getting the data
Engineering: storage and computational resources across full
lifecycle
Governance: overall management of data across full lifecycle
Wrangling: data preprocessing, cleaning
Analysis: discovery (learning, visualisation, etc.)
Presentation: arguing the case that the results are significant
and useful
Operationalisation: putting the results to work, so as to gain
benefits or value
We call this the Standard Value Chain.
Intro. to Data Science, c Wray Buntine, 2015 Slide 35 / 142
40. The Value Chain
Collection: getting the data
Engineering: storage and computational resources
Governance: overall management of data
Wrangling: data preprocessing, cleaning
Analysis: discovery (learning, visualisation, etc.)
Presentation: arguing that results are significant and useful
Operationalisation: putting the results to work
we will refer to this
throughout!
Intro. to Data Science, c Wray Buntine, 2015 Slide 36 / 142
41. Doing Data Science
Data scientist ::= addresses the data science process to
extract meaning/value from data
Intro. to Data Science, c Wray Buntine, 2015 Slide 37 / 142
42. From What is Data Science?
A quote from Jeff Hammerbacher
... on any given day, a team member could author a
multistage processing pipeline in Python, design a
hypothesis test, perform a regression analysis over
data samples with R, design and implement an
algorithm for some data-intensive product or service in
Hadoop, or communicate the results of our analyses
to other members of the organization ...
hypothesis test ::= statistical test to evaluate a simple claim
regression analysis ::= fitting a curve to real valued data
Hadoop ::= system for partitioning computation across a
compute cluster
Intro. to Data Science, c Wray Buntine, 2015 Slide 38 / 142
43. Data Science Emerges
the beginnings of data science
Intro. to Data Science, c Wray Buntine, 2015 Slide 39 / 142
44. Related: Data Engineering
I building scalable systems for storage, processing data
I e.g. Amazon Web Services, Teradata, Hadoop, ...
I databases, distributed processing, datalakes, cloud
computing, GPUs, wrangling, ...
I huge, continuous improvement ....
Intro. to Data Science, c Wray Buntine, 2015 Slide 40 / 142
45. Related: Data Analysis
I performing analysis and understanding results
I e.g. R, Tableau, Weka, Microsoft Azure Machine
Learning, ...
I machine learning, computational statistics,
visualisation, ...
I huge, continuous improvement ....
Intro. to Data Science, c Wray Buntine, 2015 Slide 41 / 142
46. Related: Data Management
I managing data through its lifecycle
I e.g. ANDS, Talend, Master Data Management, ...
I ethics, privacy, providence, curation, backup,
governance, ...
I huge, continuous improvement ....
Intro. to Data Science, c Wray Buntine, 2015 Slide 42 / 142
47. Fits and Starts
I Data Analysis (John Tukey) in 1962
I Expert Systems in the 1980’s
I Machine Learning in the 1980’s
I Data Mining in the 1990’s
I see Business Week’s “Database Marketing” cover
story September 1994
Intro. to Data Science, c Wray Buntine, 2015 Slide 43 / 142
48. Data Science Emerges ∼2000
I data analysis came of age 1990’s
I William Cleveland publishes in 2001
“Data Science: An Action Plan for ... the field of Statistics”
I data engineering came of age 2000’s (Dot.Com
boom)
I (digital) data management came of age 2000’s
(Dot.Com boom)
I the data/information society
I business pressure on decision making
I “data” as a valuable asset
I Dot.Com companies show the way
see also David Donoho’s “50 years of Data Science” (PDF paper)
Intro. to Data Science, c Wray Buntine, 2015 Slide 44 / 142
50. Data Science Research
Programs
I National Institute of Standards (NIST, in US)
Big Data Working Group (2013-2015)
I US National Academy of Sciences’
Committee on the Analysis of Massive Data
(2013)
I Alan Turing Institute for Data Science at
London’s new Knowledge Quarter (near
National Library, 2016-???)
I major growth in universities internationally
Intro. to Data Science, c Wray Buntine, 2015 Slide 46 / 142
51. Impact of Data Science
some examples of how data science is impacting others:
I your life in the cloud
I datafication of you
I science and social good
I scientific method holds true, but broadens technology
Intro. to Data Science, c Wray Buntine, 2015 Slide 47 / 142
52. Your Life on the Cloud
From Year Zero: Our life timelines begin
Our personal information is increasingly stored in the cloud
(though perhaps behind firewalls): social life (Facebook), career
(LinkedIn), search history (Google, etc.), health and medical
(Fitbit, TBD), music (Apple), ...
Many, many advantages:
e.g. personal agents
I computerised support for health
I ...
But some disadvantages:
e.g. security and privacy breaches
I ...
Intro. to Data Science, c Wray Buntine, 2015 Slide 48 / 142
53. Your Life on the Cloud (cont.)
But
I corporate leakage to government (security, tax, etc.)
I what if you don’t have rights to access/delete data?
I security and privacy breaches
I what if we’ve changed our ways?
I the department of pre-crime
I corporate mergers
I “the science is settled” and government mandates
Intro. to Data Science, c Wray Buntine, 2015 Slide 49 / 142
54. Data Science for Science
I fields like physics, bioinformatics and earth science used
big data anyway
I had their own independent data science revolution
I in other areas has raised the profile of data-driven science
I spurred on governments to develop cross-disciplinary
programmes
I Alan Turing Institute for Data Science in the UK
I has provided new data sources and tools for collecting
data
I crowd sourcing
I social media
I allows for citizen/participatory science
I DataONE
Intro. to Data Science, c Wray Buntine, 2015 Slide 50 / 142
55. Data Science for Social Good
Example:
“Data, Predictions, and Decisions in Support of People and Society”
by Eric Horvitz (Distinguished Scientist & Managing Director at
Microsoft) see the final section of video 46:51-53:00 mins.
Interactive website Aid Data (making development finance data
more accessible).
Data Science for Social Good movement training data
scientists to support community and charity.
Intro. to Data Science, c Wray Buntine, 2015 Slide 51 / 142
56. Health Care Futurology
see “Big data – 2020 vision” talk by SAP manager John Schitka
I your stomach can be instrumented to assess contents,
nutrients, etc.
I your bloodstream can be instrumented too assess insulin
levels, etc.
I your “health” dashboard can be online and shared by your
GP
I health management organisations (HMO) tying funding
levels to patient care performance
I GP/HMO will know about your icecream/beer binge last
night and you missing your morning run
I longitudinal studies feasible
Intro. to Data Science, c Wray Buntine, 2015 Slide 52 / 142
57. Outline
Data Science and Data in Society
Data Models in Organisations
Data Types and Storage
Data Resources, Processes, Standards and
Tools
Data Analysis Process
Data Curation and Management
Intro. to Data Science, c Wray Buntine, 2015 Slide 53 / 142
58. Business Models
From Wikipedia:
A business model describes the rationale of how an
organization creates, delivers, and captures value, in
economic, social, cultural or other contexts.
Examples of general classes:
I retailer versus wholesaler
I luxury consumer products
I software vendor
I service provider
What kinds of businesses do we have operating in the Data
Science world?
Intro. to Data Science, c Wray Buntine, 2015 Slide 53 / 142
62. Amazon.com (cont.)
I an assembly line for the retail industry, with support for
embedded online retailers
I huge stock of books, DVDs, CDs, etc.„ easily searchable
I extensive customer reviews
Intro. to Data Science, c Wray Buntine, 2015 Slide 57 / 142
63. Amazon.com (cont.)
Information-based differentiation: satisfies customers by
providing a differentiated service:
I superior information including reviews about
products
I superior range
Information-based delivery network: they deliver information for
others; retailers in the Amazon marketplace get:
I customers directed to them
I other retailers’ support
Intro. to Data Science, c Wray Buntine, 2015 Slide 58 / 142
64. Data Business Models
information brokering service: buys and sells data/information
for others.
Information-based differentiation: satisfies customers by
providing a differentiated service built on the
data/information.
Information-based delivery network: deliver data information
for others.
“What a Big-Data Business Model Looks Like” by Ray Wang in
the Harvard Business Review claims these are unique in the
data world.
Data Science companies can pursue other business models,
software as a service, consulting, CRM, etc.
Intro. to Data Science, c Wray Buntine, 2015 Slide 59 / 142
65. Data Providers
data provider ::= business selling the “data” it collects,
e.g., Lexus-Nexus
I this is a traditional business model, selling data not widgets
I so does not fit into Wang’s categories (though is borderline
“data broker”)
I fasting growing segment of the IT industry post 2000 (see
Evan Quinn’s blog post on Infochimps.com April 2013
“Is Big Data the Tail Wagging the Data Economy Dog?”)
I some call this the data economy
Intro. to Data Science, c Wray Buntine, 2015 Slide 60 / 142
67. Netflix: Example Case Study
I on demand internet streaming, and flat-rate DVD rental
I over 50 million subscribers in the US by 2014
I international market
I video recommendation!
I established the Netflix Prize in 2006-2009 as a
crowdsourced way of testing out algorithms
By Ivongala (Own work) [Public domain], via Wikimedia Commons
Intro. to Data Science, c Wray Buntine, 2015 Slide 62 / 142
68. Netflix: Analysis
data sources: user rankings, user profiles
data volume: (2012) 25 million users, 4 million rates/day, 3 million
searches/day, video cloud stordage 2 petabytes
data velocity: video titles change daily, rankings/ratings updated
data variety: user rankings, user profiles, media properties
software: Hadoop, Pig, Cassandra, Teradata
analytics: personalised recommender system
processing: analytic processing, streaming video
capabilities: ratings and search per day, content delivery
security/privacy: protect user data; digital rights
lifecycle: continued ranking and updating
other: mobile interface
Intro. to Data Science, c Wray Buntine, 2015 Slide 63 / 142
69. Application Areas from MGI
The McKinsey Global Institute report on Big Data from 2011,
“Big data: The next frontier for innovation, competition, and productivit
1. Health
2. Government
3. Retail
4. Manufacturing
5. Location Technology
NB. What happened to Science? MGI is an industry
organisation.
Intro. to Data Science, c Wray Buntine, 2015 Slide 64 / 142
70. Application Areas from NIST
I government operation
I commercial
I defense
I healthcare and life sciences
I social media
I research infrastructure/ecosystem
I astronomy/physics
I earth science
I energy
Intro. to Data Science, c Wray Buntine, 2015 Slide 65 / 142
71. Outline
Data Science and Data in Society
Data Models in Organisations
Data Types and Storage
Data Resources, Processes, Standards and
Tools
Data Analysis Process
Data Curation and Management
Intro. to Data Science, c Wray Buntine, 2015 Slide 66 / 142
72. Big Data
describing big data and its characterisation
Intro. to Data Science, c Wray Buntine, 2015 Slide 66 / 142
74. Moore’s Law
I stated to double every 2 years starting 1975
I transistor count translates to:
I more memory
I bigger CPUs
I faster memory, CPUs (smaller==faster)
I pace currently slowing
Intro. to Data Science, c Wray Buntine, 2015 Slide 68 / 142
75. Big Data
From Big data on Wikipedia:
Big data usually includes data sets with sizes beyond
the ability of commonly used software tools to capture,
curate, manage, and process data within a tolerable
elapsed time. Big data "size" is a constantly moving
target, ...
I don’t always ask why, can simply detect patterns
I a cost-free byproduct of digital interaction
I enabled by the cloud: affordability, extensibility, agility
Intro. to Data Science, c Wray Buntine, 2015 Slide 69 / 142
76. Big Data and “V”s
I 2001 Doug Laney produced report describing 3 V’s:
“3-D Data Management: Controlling Data Volume,
Velocity and Variety”
I these characterise bigness, adequately
I other V’s characterise problems with analysis and
understanding
Veracity: correctness, truth, i.e.. lack of ...
Variability: change in meaning over time, e.g., natural
language
I other V’s characterise aspirations
Visualisation: one method for analysis
Value: what we want to get out of the data
I think of any more? write a blog!
Intro. to Data Science, c Wray Buntine, 2015 Slide 70 / 142
77. Different Kinds of Data
some examples
Intro. to Data Science, c Wray Buntine, 2015 Slide 71 / 142
83. Internet of Things Data
Intro. to Data Science, c Wray Buntine, 2015 Slide 77 / 142
84. MetaData
metadata ::= structured information that describes, explains,
locates, or otherwise makes it easier to retrieve, use or manage
an information resource.
I is data about data
I a computer can process and interpret it
Descriptive: describes content for identification and retrieval
e.g. title, author
Structural: documents relationships and links
e.g. elements in XML, containers in MPEG
Administrative: helps to manage information
e.g. version number, archiving date, DRM
Intro. to Data Science, c Wray Buntine, 2015 Slide 78 / 142
85. Metadata Example
Let us look at examples to characterise the metadata:
I Australian Government
Digital Transformation Office, Service Standard webpage
I medical bibliographic data in XML on PubMed,
“Lower respiratory tract disorder hospitalizations
among children born via elective early-term
delivery”
Intro. to Data Science, c Wray Buntine, 2015 Slide 79 / 142
86. Infographics on Data
I “Data Science Matters” from the datascience@berkeley
Blog
I
“60 Seconds – Things That Happen On Internet Every 60 secs”
from GO-Gulf
I “60 Seconds – Things That Happen Every 60 secs Part 2”
again
Intro. to Data Science, c Wray Buntine, 2015 Slide 80 / 142
88. JSON Example
I example from Wikipedia
I no fixed format
I semi-structured,
key-value pairs,
hierarchical
I “friendly” alternative to
XML
I self-documenting
structure
I example,
EventRegistry file
Intro. to Data Science, c Wray Buntine, 2015 Slide 82 / 142
89. Graph Database Example
I example graph
I example content
FreeBase page for “Arnold Schwarzenegger”
I example content format FreeBase extract
I stores graph, commonly as triples, subject, verb,
object
I commonly used to store Linked Open Data
Intro. to Data Science, c Wray Buntine, 2015 Slide 83 / 142
90. Database Background
Concepts
Many NoSQL and SQL DBs offer:
I large scale, distributed processing
I robustness achieved
I general query languages
I some notion of consistency
e.g. “eventually” as nodes spread updates
Intro. to Data Science, c Wray Buntine, 2015 Slide 84 / 142
91. Beyond SQL Databases
(NoSQL)
Type Examples Notes
RDBMS MySQL,
MSSQL Server
SQL
Object DB Zope,
Objectivity
navigate network
Doc. DB MongoDB,
CouchDB
JSON like, Javascript like
queries
key-val cache Memcached,
Coherence
in-memory
key-val store Aerospike,
HyperDex
not in-memory but highly opti-
mised
tabular key-val Cassandra,
HBase
relational-like, “wide column
store”
graph DB Neo4j,
OrientDB
RDF, SPARQL,
Intro. to Data Science, c Wray Buntine, 2015 Slide 85 / 142
92. Beyond SQL Databases, cont.
I NoSQL databases offer a rich variety beyond traditional
relational.
I Many target web applications.
I See blog post by Eric Knorr 19/11/2012 on Infoworld.com,
“The wild, crazy world of databases”
I See blog post by Fabian Pascal 12/17/2015 on
AllAnalytics.com,
“Data Fundamentals for Analysts: Documents and Databases”.
Intro. to Data Science, c Wray Buntine, 2015 Slide 86 / 142
93. Overview: Processing
Interactive: bringing humans into the loop
Streaming: massive data streaming through system with little
storage
Batch: data stored and analysed in large blocks,
“batches,” easier to develop and analyse
Intro. to Data Science, c Wray Buntine, 2015 Slide 87 / 142
94. Distributed Analytics
I legacy systems provide powerful statistical tools on the
desktop
I SAS, R, Matlab
but often-times without distributed or multi-processor
support
I supporting distributed/multi-processor computation
requires special redesign of algorithms
I in-database analytics systems intended to support this
e.g. MADLib from Pivotal and MLLib from Spark integrates with
their distributed SQL;
Intro. to Data Science, c Wray Buntine, 2015 Slide 88 / 142
95. Hadoop
I Java implementation of Map-Reduce developed by
Doug Cutting while at Yahoo!
I architecture:
Common: Java libraries and utilities
YARN: job scheduling and cluster management
Hadoop Distributed File System (HDFSTM):
MapReduce: core
I huge tool ecosystem
I well passed the peak of the hype curve
Intro. to Data Science, c Wray Buntine, 2015 Slide 89 / 142
96. Spark
I another (open source) Apache top-level project at
Apache Spark
I developed at AMPLab at UC Berkeley
I builds on Hadoop infrastructure (HDFS, etc.)
I interfaces in Java, Scala, Python, R
I provides in-memory analytics
I works with some of the Hadoop ecosystem
Intro. to Data Science, c Wray Buntine, 2015 Slide 90 / 142
98. Case Study: Health Care Data
When Health Care Gets a Healthy Dose of Data “How
Intermountain Healthcare is using data and analytics to
transform patient care,” June 25, 2015, Michael Fitzgerald
I 8000 word article in Sloan Review MIT
I behind “membership” (ask Wray if you want a copy)
I use of Electronic Health Records (EHR)
I Intermountain Healthcare has 22 hospitals and 185 clinics
I 2009 US government mandated “all health care providers
adopt and demonstrate ‘meaningful use’ of EHR”
I promote data-driven decision making
Intro. to Data Science, c Wray Buntine, 2015 Slide 92 / 142
99. Health Care Data, cont.
I first computer support in 1985
I data quality and data gathering important
e.g. if you show physician their quality metrics are below
average they say (1) “your data isn’t accurate” and (2) “my
patients are different”
I need a common language for data across departments
and hospitals
I big clinical teams (newborn, cardiovascular, ear nose &
throat) have their own data manager and data analyst
I some nurses assigned to data recording roles
I approx. 10 analysts per hospital
Intro. to Data Science, c Wray Buntine, 2015 Slide 93 / 142
100. Health Care Data, cont.
e.g. analysed lifestyle of diabetes patients with good blood
sugar levels to understand factors and inform the team
e.g. experimented with different practices in surgery (e.g. no
personal clothing items in operating theatre), measure
effect on infections after 6 months then keep/drop
e.g. approx. US$40 million supply costs per hospital per year;
analysed alternative items for use and cost to recommend
which to use
e.g. pull readings from patients’ vital signs, sends an email alert
telling patients at risk of heart failure
e.g. give assessment of patient’s likelihood of being readmitted
to the hospital once released
Intro. to Data Science, c Wray Buntine, 2015 Slide 94 / 142
101. General Comments
I requires careful collection of standardised data
I embed analytics in processes rather than as an add-on, to
allow feedback loops
I implementation pushback from staff over the effort
I analytics demonstrated improved patient care due to
monitoring and improving processes
I main use is descriptive analytics not full data science
I comparative evaluation of alternatives
I data-driven process improvement
I experimental evaluation of processes
I some predictive analytics
I predicting readmittance
I predicting heart failure
Intro. to Data Science, c Wray Buntine, 2015 Slide 95 / 142
102. Outline
Data Science and Data in Society
Data Models in Organisations
Data Types and Storage
Data Resources, Processes, Standards and
Tools
Data Analysis Process
Data Curation and Management
Intro. to Data Science, c Wray Buntine, 2015 Slide 96 / 142
103. Getting Data
data sources and working with data
Intro. to Data Science, c Wray Buntine, 2015 Slide 96 / 142
104. NYC Data
NYC under Major Bloomberg embarked on a program to make
the cities data accessible:
I “How data and open government are transforming NYC” in
Radar.O’Reilly:
“In God We Trust,” tweeted New York City Mayor
Mike Bloomberg this month. “Everyone else,
bring data.”
I Bloomberg signs NYC ’Open Data Policy’ into law, plans
web portal for 2018,” in Engadget
I NYC Open Data portal
I City of Melbourne’s open data platform
Intro. to Data Science, c Wray Buntine, 2015 Slide 97 / 142
105. NYC Data, cont.
“How we found the worst place to park in New York City” is
examples, and a discussion of the complexities of getting data
out of NYC:
Map of road speed by day+time: GPS data for NYC cabs
gives; data obtained via FOIL request, then made
public by recipient
Danger spots for cycles: NYPD crash data obtained by daily
download of PDF files followed by (non-trivial)
extraction
Dirty waterways: fecal coliform measurements on waterways
from Department of Environmental Protection’s
website; extracted from Excel sheets per site;
each in a different format
Faulty road markings: parking tickets for fire-hydrants by
location from NYC Open Data portal need to
normalize the addresses supplied
Intro. to Data Science, c Wray Buntine, 2015 Slide 98 / 142
106. Traffic Prediction
see 7:40-11:06 on Clearflow in
“Data, Predictions, and Decisions in Support of People and Society,”
by Eric Horvitz
I forecasting traffic: blockages, clearing, surprising
situations, alternate routes
I critical data:
I GPS data on traffic flow
I maps
I incidents and events
I weather
I see Microsoft Introduces Tool for Avoiding Traffic Jams in
NYT 2008
Intro. to Data Science, c Wray Buntine, 2015 Slide 99 / 142
107. Democratization of Data
“The New Data Republic: Not Quite a Democracy” in MIT Sloan
Review 2015
I from Hal Varian: “information that once was available to
only a select few. . . available to everyone”
I from Robert Duffner: “finally puts crucial business
information in the hands of those who need it”
I government and IT departments building data and
infrastructure to allow sharing
I USA Open Gov Initiative
I analytic tools, desktop and web-based, available to
analyse it
I but people need the right skills
open data is all good and well, but people need to be able to
use it too!
Intro. to Data Science, c Wray Buntine, 2015 Slide 100 / 142
108. Linked Open Data
LOD project started by
Prof. Sir Tim Berners-Lee, OM, KBE, FRS, FREng, FRSA, DFBCS.
I objects given a URI (like a URL)
I relationships between two objects can be represented as a
triple, (subject, verb, object)
I relation itself is another URI
I data has an open license for use
e.g. example on NYT
I a tutorial on LOD by Tom Heath
Intro. to Data Science, c Wray Buntine, 2015 Slide 101 / 142
109. Wrangling
manipulating data to make it directly usable for analysis
Intro. to Data Science, c Wray Buntine, 2015 Slide 102 / 142
110. Wrangling Examples
I want the core news text, title, date, etc. off the following
page Apple’s iPhone loses top spot to Android in Australia
I want the text plus details from the PDF file
“Data Wrangling: The Challenging Journey from the Wild to the L
I want all article titles from the PUBMED results xml
I want to digitize the text off a scanned letter
I want to extract all the sentences referring to Hillary Clinton
in a news article
Intro. to Data Science, c Wray Buntine, 2015 Slide 103 / 142
111. Wrangling Examples, cont.
I your company has customer records in 4 different
databases in different formats; you want a single
standardised set of customer names and addresses
I convert addresses in your customer database into
geographic latitude and longitude
I convert free text dates to standard format, e.g. “next
Tuesday”, “2nd January 15”, “January 3 next year”, “3rd
Friday in the month”, “03/31/15”, “31/03/15”
I recognise what values in your data are “unknown” or
“illegal”
Intro. to Data Science, c Wray Buntine, 2015 Slide 104 / 142
113. Example Standards
I metadata such as Dublin Core
I XML formats for sharing models, PMML (see below)
I standards for the data mining/science process, such as
CRISP-DM
I health codes: disease and health problem codings ICD-10
I systematized nomenclature of medicine, clinical terms,
SNoMed-CT
What other sorts of things might you have standards for?
Intro. to Data Science, c Wray Buntine, 2015 Slide 106 / 142
114. Data Science Process
I using our own “standard Data Science value chain” to
describe the process
I CRISP-DM discussed previously
I statisticians sometimes use the term exploratory data
analysis for part of the process
I while not a standard„ one can take this sort of specification
to the extreme: see “Data Science life-cycle”
Intro. to Data Science, c Wray Buntine, 2015 Slide 107 / 142
115. The API Economy
I The Application Economy: A New Model for IT (CISCO)
I
The Application Economy Is Changing the Future of Business
I ProgrammableWeb API Category: Data
I Top 30 Predictive Analytics API
Intro. to Data Science, c Wray Buntine, 2015 Slide 108 / 142
116. Case Studies of Data and
Standards
look at some examples of standardised data collections
Intro. to Data Science, c Wray Buntine, 2015 Slide 109 / 142
117. Freebase
I an example of a graph database we looked at earlier
I graph can be represented in RDF which is triples of URIs
I Freebase, now owned by Google, currently read-only and
to be decommissioned soon
I used by others as a knowledge-base in knowledge
language processing, e.g., TextRazor, “extract meaning
from your text”
I see also DBpedia
Intro. to Data Science, c Wray Buntine, 2015 Slide 110 / 142
118. Medical Data Dictionaries
The Unified Medical Language System (UMLS)
Intro. to Data Science, c Wray Buntine, 2015 Slide 111 / 142
119. Medical Data Dictionaries,
cont.
ICD: the International Classification of Diseases
I used ... to classify diseases and other health problems ...
on ... health and vital records
example: Pneumonia due to Streptococcus pneumoniae
Intro. to Data Science, c Wray Buntine, 2015 Slide 112 / 142
120. Publishing Repositories
I PUBMED, we have seen before
I ACM Digital Library
I Patent databases (for WIPO, USPTO, EPO, etc.), e.g.,
Global Patent Search Network
Intro. to Data Science, c Wray Buntine, 2015 Slide 113 / 142
121. News and Event Registry
I collect news article globally, process and organise as
events
I perform concept and event identification
I create a document database for inspection
I Event Registry
I sometimes news stored as NewsML
Intro. to Data Science, c Wray Buntine, 2015 Slide 114 / 142
122. Government Data
I US Government’s Data.GOV
I NYC Open Data
I Australia’s Urban Intelligence Network (AURIN)
I BioGrid Australia
Intro. to Data Science, c Wray Buntine, 2015 Slide 115 / 142
123. Outline
Data Science and Data in Society
Data Models in Organisations
Data Types and Storage
Data Resources, Processes, Standards and
Tools
Data Analysis Process
Data Curation and Management
Intro. to Data Science, c Wray Buntine, 2015 Slide 116 / 142
124. Essential Viewing
I “The wonderful and terrifying implications of computers
that can learn” at TED by Jeremy Howard
I “The Unreasonable Effectiveness of Data” lecture at Univ.
of British Columbia by Peter Norvig
I “Knowledge is Beautiful” by David McCandless at the RSA
I
“The power of emotions: When big data meets emotion data”,
by Rana El Kaliouby
Intro. to Data Science, c Wray Buntine, 2015 Slide 116 / 142
125. Types of Data Analysis
“Six types of analyses every data scientist should know”, by
Jeffrey Leek
I extends SAS’s “analytic levels” with a some nuances
I introducing inference and causality
Intro. to Data Science, c Wray Buntine, 2015 Slide 117 / 142
126. Tools for the Data Analysis
Process
popular software and prototyping
Intro. to Data Science, c Wray Buntine, 2015 Slide 118 / 142
127. Common Software
access: SQL, Hadoop, MS SQL Server, PIG, Spark
wrangling: common scripting languages (Python, Perl)
visualisation: Tableau, Matlab, Javascript+D3.js
statistical analysis: Weka, SAS, R
multi-purpose: Python, R, SAS, KNIME, RapidMiner
cloud-based: Azure ML (Microsoft), AWS ML (Amazon)
Intro. to Data Science, c Wray Buntine, 2015 Slide 119 / 142
128. Mapping Big Data
See “Mapping Big Data: A Data-Driven Market Report” by
Russell Jurney, published by O’Reilly 2015. See Table 1-6.
Cluster Company
Old Data Platforms IBM, Microsoft, Oracle, Dell, Netapp
Servers Intel, SUSE, MSC Software, NVidia, Redline Trad
Analytic Tools Tableau, Teradata, Informatica, Talend, Actian
New Data Platforms Cloudera, Hortonworks, MapR, Datastax, Pivotal
Enterprise Software HP, SAP, Cisco, VMWare, EMC
Cloud Computing Amazon Web Svcs., Google, Rackspace, MarkLo
Note the Enterprise Software segment developing good
connections with all others, but already has strong connections
with Old Data Platforms.
Intro. to Data Science, c Wray Buntine, 2015 Slide 120 / 142
129. Scripting Languages
see Wikipedia entry scripting languages:
I no formal or universally agreed definition
I often interpreted and are high-level programming
languages
I automating tasks originally done one-by-one by hand
I also, extension language, control language
e.g. bash, Perl, Python, R, Matlab, ...
kinds: glue languages (connecting software components), GUI
scripting, job control, macros, extensible languages,
application specific, ...
I an endless discussion on StackExchange
Intro. to Data Science, c Wray Buntine, 2015 Slide 121 / 142
130. Rapid Prototyping
see Wikipedia entry software prototyping:
I software development for data science projects is often
(almost) one-off ... get the results, but ensure it is
reproducable
I not standard software engineering, not “waterfall model”,
not “agile”
I little requirements analysis
I the results are tested, not the software and its full capability
I development speed and agility are important
I hence use of scripting languages
Intro. to Data Science, c Wray Buntine, 2015 Slide 122 / 142
131. Discussion: Python versus R
I both are free
I R developed by statisticians for statisticians, huge support
for analysis
I Python by computer scientists for general use
I R is better for stand-alone analysis and exploration
I Python lets you integrate easier with other systems
I Python easier to learn and extend than R (better language)
I R has vectors and arrays as first class objects; similar to
Matlab!
I R currently less scalable.
See In data science, the R language is swallowing Python by
Matt Asay, recent blog in Infoworld.
Intro. to Data Science, c Wray Buntine, 2015 Slide 123 / 142
132. Scientific Method
is Data Science writ large, so what can we learn
Intro. to Data Science, c Wray Buntine, 2015 Slide 124 / 142
133. Scientific Method in Medicine
I “How science goes wrong” on The Economist, 2013
I “Battling Bad Science” a TED talk by Ben Goldacre, 2011
I “The Truth Wears Off” by Jonah Lehrer in The New Yorker,
2010
I “Richard Smith: Time for science to be about truth rather
than careers” blog on BMJ, 2013
I “Offline: What is medicine’s 5 sigma?” by Richard Horton
on The Lancet, 2015
I “The 10 stuff ups we all make when interpreting research“
by Will J Grant and Rod Lamberts in The Conversation,
2015.
Broadly:
I industry coercion
I academic games
I errors in application of scientific method
Intro. to Data Science, c Wray Buntine, 2015 Slide 125 / 142
134. Scientific Method in Medicine
Major applications errors are:
I misuse of significance testing
I correlation does not imply causation
I not checking/testing the true costs
I inadequate reproducability e.g., difficult to repeat
I selection bias
Intro. to Data Science, c Wray Buntine, 2015 Slide 126 / 142
135. Significance Testing Errors
Significance chasing: repeat many experiments until you get
significance
I
“I Fooled Millions Into Thinking Chocolate Helps
Weight Loss. Here’s How.” by John
Bohannon,
In parallel: multiple teams trying different experiments until
one gets significance
Ignoring negative results: a variation on the above; similar to
repeated testing until success
The decline effect: a variation on the above, as some negative
results get recorded, eventually the original
(flawed) positive result gets overturned
Inadequate repeatability: means subsequent teams cannot
check your results, so you’re inital inadequate
significance testing doesn’t get retested
Intro. to Data Science, c Wray Buntine, 2015 Slide 127 / 142
136. Significance Testing
I be careful with P-values and significance levels: use strong
significance levels and don’t “repeat until success”
I record negative results
I ensure repeatability by properly recording experimental
methodology and data processing
Intro. to Data Science, c Wray Buntine, 2015 Slide 128 / 142
137. Error: Correlation versus
Causation
I See correlation does not imply causation (Wikipedia).
I also Hilarious Graphs
I happens when medical experts use observational data to
draw conclusions, e.g., epidemiological data
I methods for testing/estimating causation from data is
currently a research agenda in discovery science
I “intervention” is a basic part of double blind trials (a major
experimental standard)
Intro. to Data Science, c Wray Buntine, 2015 Slide 129 / 142
138. Data Analysis Meta Case Studies
What is Hard?
comments
Intro. to Data Science, c Wray Buntine, 2015 Slide 130 / 142
139. The Hardest Parts
See blog “The hardest parts of data science” by Yanir Seroussi
23rd Nov. 2015.
Model fitting: core statistics/machine learning not usually hard
(e.g., many use R as a black box for this)
Data collection: can be critical sometimes, but often more
routine
Data cleaning: can be a lot of work, but often more routine
Problem definition: getting into the application and
understanding the real problem can be hard
Evaluation: what is measured? should multiple evaluations be
done? can be hard
Ambiguity and uncertainty: invariably these occur and we need
to live with them; can be hard
Intro. to Data Science, c Wray Buntine, 2015 Slide 131 / 142
140. Outline
Data Science and Data in Society
Data Models in Organisations
Data Types and Storage
Data Resources, Processes, Standards and
Tools
Data Analysis Process
Data Curation and Management
Intro. to Data Science, c Wray Buntine, 2015 Slide 132 / 142
142. Terminology
I Privacy is (for our purposes) having control over how one
shares oneself with others.
e.g. closing the blinds in your living room
I Confidentiality is information privacy, how information
about an individual is treated and shared.
e.g. excluding others from viewing your search terms or browse
history
I Security is (for our purposes) the protection of data,
preventing it from being improperly used
e.g. preventing hackers from stealing credit card data
I Ethics is (for our purposes) the moral handling of data
(especially, other data about others)
Intro. to Data Science, c Wray Buntine, 2015 Slide 133 / 142
143. Regulations and Compliance
I Regulations devised by various government bodies:
taxation, medical care, securities and investments, work
health and safety, employment, corporate law.
I they need to check companies for their compliance
I Auditing
systematic and independent examination of books,
accounts, documents and vouchers of an organization
to ascertain how far they present a true and fair view
I Regulatory compliance:
that organisations ensure that they are aware of and
take steps to comply with relevant laws and
regulations.
I auditing data and records are a good source for Data
Science
Intro. to Data Science, c Wray Buntine, 2015 Slide 134 / 142
144. Data Governance
Supporting and handling:
I ethics, confidentiality
I security
I regulatory compliance
I organisation policies
I organisation business outcomes
which may include handling the steps in the data science
and/or big data value chain
Intro. to Data Science, c Wray Buntine, 2015 Slide 135 / 142
145. Data Management
managing to achieve governance, etc.
Intro. to Data Science, c Wray Buntine, 2015 Slide 136 / 142
146. Data Management
Data management is the development, execution and
supervision of plans, policies, programs and practices that
control, protect, deliver and enhance the value of data and
information assets.
Intro. to Data Science, c Wray Buntine, 2015 Slide 137 / 142
147. Data Management and
Data Science
medical informatics: for predicting fungal infections from
nursing notes, the team needs to abide by
confidentiality and security
internet advertising: what implicit and explicit data is stored
about a user
retailing: conduct market intelligence on new products; put
together data from different divisions, brands
predictive medical system: implementation may need changing
standard operating procedure for staff
Intro. to Data Science, c Wray Buntine, 2015 Slide 138 / 142
148. Contexts for Data Management
Science: reproducibility and credibility of scientific work,
producing artifacts of knowledge, creating
scientific data
Business: governance, compliance, information privacy, etc.
Curation: e.g. museums and libraries, preservation,
maintenance, etc.
Government: a unique regulatory environment (e.g.,
“transperancy”), archiving, FOIs, support data
infrastructure, etc.
Medicine: significant privacy issues, conflicting corporate
financial constraints, government regulations and
furthering of medical science
Intro. to Data Science, c Wray Buntine, 2015 Slide 139 / 142
149. Digital Curation Centre
About:
The Digital Curation Centre (DCC) is a world-leading
centre of expertise in digital information curation with a
focus on building capacity, capability and skills for
research data management across the UK’s higher
education research community.
See “The DCC Curation Lifecycle Model” by DCC (PDF)
Intro. to Data Science, c Wray Buntine, 2015 Slide 140 / 142
150. Australian Public Service
Background:
the creation, collection, management, use and
disposal of agency data is governed by a number of
legislative and regulatory requirements, government
policies and plans
I data needs to be authentic, accurate and reliable
I strong governance framework
I sensible risk management and a focus on information
security, privacy management
I clear and transparent privacy policies and provide ethical
leadership
Intro. to Data Science, c Wray Buntine, 2015 Slide 141 / 142
151. Conclusion
Data Science and Data in Society
Data Models in Organisations
Data Types and Storage
Data Resources, Processes, Standards and
Tools
Data Analysis Process
Data Curation and Management
Intro. to Data Science, c Wray Buntine, 2015 Slide 142 / 142