This document provides an overview and roadmap for the DSE 400 - Fast Track to Data Science course. The week 1 agenda includes introductions, reading assignments on data science topics, installing R and RStudio, practicing with math and machine learning datasets, and an assignment to import and display the Housing dataset from UCI Machine Learning Repository in R. The course aims to provide an introduction to data science, analytics, and visualization over 8 weeks covering topics like statistics, machine learning, Hadoop, ethics, and building data products.
A session focused on ramping you up on what Hadoop is, how its works and what it's capable of. We will also look at what Hadoop 2.x and YARN brings to the table and some future projects in the Hadoop space to keep an eye on.
A session focused on ramping you up on what Hadoop is, how its works and what it's capable of. We will also look at what Hadoop 2.x and YARN brings to the table and some future projects in the Hadoop space to keep an eye on.
Business Analytics Competency centre: A strategic Differentiator BSGAfrica
Analytics industry trends and how they relate to the Insurance sector, highlighting the importance of recognising the Customer Lifetime Value (CLV) over the immediate revenue-generation potential of each customer. Steven Ing spoke about data as a strategic business asset, and the importance of recognising it as such. Additionally, he commented on the significance of developing an internal strategic team of experts, with a specific focus on facilitating and promoting the use of analytics to achieve business objectives across the enterprise.
Building enterprise advance analytics platformHaoran Du
By Raymond Fu - Practice Architect
This lecture talks about the best practices in building an advanced analytics platform to help companies apply machine learning, deep learning and data science to their structured and unstructured data.
At Southern California Data Science Conference Sept.25.2016 at USC
http://socaldatascience.org/
http://www.datalaus.com/en/
Institute H: The Road to Becoming a Center of Excellence
Thursday, October 8, 9:00 am - 12:00 p.m., Executive C D
Lisa D'Adamo-Weinstein, Director, Academic Support
Northeast Center of SUNY Empire State College
Elaine Richardson, Retired Director, Academic Success Center
Clemson University
Laura Sanders, Assistant Dean, Student Success, College of Engineering
Valparaiso University
The purpose of the Centers of Excellence Designation Program is to:
promote professional standards of excellence for learning centers;
encourage centers to develop, maintain and assess quality programs and services to enhance student learning;
honor the history of established and unique learning centers; and
celebrate the outstanding achievements of centers that meet and exceed these standards.
This post-conference institute will walk participants through the rationale for the creation of the designation program;
review the criteria for evaluation and discuss the steps for completing an application. We will also share insights
gathered during the first two rounds of applications reviews to assist participants in developing a clear plan for how
they can best put together their own application
Jeff will showcase the sparklyr the new R package to interface with Spark and talk about the different use extensions including the rsparkling ML package.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
3 Things to Learn About:
* How Sparklyr supports a complete backend for dplyr, a popular tool for working with data frame objects both in memory and out of memory
* How Sparklyr llows data scientists to use dplyr to translate R code into Spark SQL
* How Sparklyr supports MLlib so data scientists can run classifiers, regressions, and many other machine learning algorithms in Spark
Successfully establishing a SOA Center of ExcellenceKelly Emo
This presentation was delivered by HP in a public webcast. It details out the effectiveness of establishing a Center of Excellence to accelerate SOA adoption and the role of SOA Governance to support the CoE
Horton+Pruim+Kaplan_MOSAIC-StudentGuide.pdf Nicholas J. .docxwellesleyterresa
Horton+Pruim+Kaplan_MOSAIC-StudentGuide.pdf
Nicholas J. Horton
Randall Pruim
Daniel T. Kaplan
A Student's
Guide to
R
Project MOSAIC
2 horton, kaplan, pruim
Copyright (c) 2015 by Nicholas J. Horton, Randall
Pruim, & Daniel Kaplan.
Edition 1.2, November 2015
This material is copyrighted by the authors under a
Creative Commons Attribution 3.0 Unported License.
You are free to Share (to copy, distribute and transmit
the work) and to Remix (to adapt the work) if you
attribute our work. More detailed information about
the licensing is available at this web page: http:
//www.mosaic-web.org/go/teachingRlicense.html.
Cover Photo: Maya Hanna.
http://www.mosaic-web.org/go/teachingRlicense.html
http://www.mosaic-web.org/go/teachingRlicense.html
Contents
1 Introduction 13
2 Getting Started with RStudio 15
3 One Quantitative Variable 27
4 One Categorical Variable 39
5 Two Quantitative Variables 45
6 Two Categorical Variables 55
7 Quantitative Response, Categorical Predictor 61
8 Categorical Response, Quantitative Predictor 69
9 Survival Time Outcomes 73
4 horton, kaplan, pruim
10 More than Two Variables 75
11 Probability Distributions & Random Variables 83
12 Power Calculations 89
13 Data Management 93
14 Health Evaluation (HELP) Study 107
15 Exercises and Problems 111
16 Bibliography 115
17 Index 117
About These Notes
We present an approach to teaching introductory and in-
termediate statistics courses that is tightly coupled with
computing generally and with R and RStudio in particular.
These activities and examples are intended to highlight
a modern approach to statistical education that focuses
on modeling, resampling based inference, and multivari-
ate graphical techniques. A secondary goal is to facilitate
computing with data through use of small simulation
studies and appropriate statistical analysis workflow. This
follows the philosophy outlined by Nolan and Temple
Lang1. The importance of modern computation in statis- 1 D. Nolan and D. Temple Lang.
Computing in the statistics
curriculum. The American
Statistician, 64(2):97–107, 2010
tics education is a principal component of the recently
adopted American Statistical Association’s curriculum
guidelines2.
2 ASA Undergraduate Guide-
lines Workgroup. 2014 cur-
riculum guidelines for under-
graduate programs in statisti-
cal science. Technical report,
American Statistical Associa-
tion, November 2014. http:
//www.amstat.org/education/
curriculumguidelines.cfm
Throughout this book (and its companion volumes),
we introduce multiple activities, some appropriate for
an introductory course, others suitable for higher levels,
that demonstrate key concepts in statistics and modeling
while also supporting the core material of more tradi-
tional courses.
A Work in Progress
Caution!
Despite our best efforts, you
WILL find bugs both in this
document and in our code.
Please let us know when y ...
Business Analytics Competency centre: A strategic Differentiator BSGAfrica
Analytics industry trends and how they relate to the Insurance sector, highlighting the importance of recognising the Customer Lifetime Value (CLV) over the immediate revenue-generation potential of each customer. Steven Ing spoke about data as a strategic business asset, and the importance of recognising it as such. Additionally, he commented on the significance of developing an internal strategic team of experts, with a specific focus on facilitating and promoting the use of analytics to achieve business objectives across the enterprise.
Building enterprise advance analytics platformHaoran Du
By Raymond Fu - Practice Architect
This lecture talks about the best practices in building an advanced analytics platform to help companies apply machine learning, deep learning and data science to their structured and unstructured data.
At Southern California Data Science Conference Sept.25.2016 at USC
http://socaldatascience.org/
http://www.datalaus.com/en/
Institute H: The Road to Becoming a Center of Excellence
Thursday, October 8, 9:00 am - 12:00 p.m., Executive C D
Lisa D'Adamo-Weinstein, Director, Academic Support
Northeast Center of SUNY Empire State College
Elaine Richardson, Retired Director, Academic Success Center
Clemson University
Laura Sanders, Assistant Dean, Student Success, College of Engineering
Valparaiso University
The purpose of the Centers of Excellence Designation Program is to:
promote professional standards of excellence for learning centers;
encourage centers to develop, maintain and assess quality programs and services to enhance student learning;
honor the history of established and unique learning centers; and
celebrate the outstanding achievements of centers that meet and exceed these standards.
This post-conference institute will walk participants through the rationale for the creation of the designation program;
review the criteria for evaluation and discuss the steps for completing an application. We will also share insights
gathered during the first two rounds of applications reviews to assist participants in developing a clear plan for how
they can best put together their own application
Jeff will showcase the sparklyr the new R package to interface with Spark and talk about the different use extensions including the rsparkling ML package.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
3 Things to Learn About:
* How Sparklyr supports a complete backend for dplyr, a popular tool for working with data frame objects both in memory and out of memory
* How Sparklyr llows data scientists to use dplyr to translate R code into Spark SQL
* How Sparklyr supports MLlib so data scientists can run classifiers, regressions, and many other machine learning algorithms in Spark
Successfully establishing a SOA Center of ExcellenceKelly Emo
This presentation was delivered by HP in a public webcast. It details out the effectiveness of establishing a Center of Excellence to accelerate SOA adoption and the role of SOA Governance to support the CoE
Horton+Pruim+Kaplan_MOSAIC-StudentGuide.pdf Nicholas J. .docxwellesleyterresa
Horton+Pruim+Kaplan_MOSAIC-StudentGuide.pdf
Nicholas J. Horton
Randall Pruim
Daniel T. Kaplan
A Student's
Guide to
R
Project MOSAIC
2 horton, kaplan, pruim
Copyright (c) 2015 by Nicholas J. Horton, Randall
Pruim, & Daniel Kaplan.
Edition 1.2, November 2015
This material is copyrighted by the authors under a
Creative Commons Attribution 3.0 Unported License.
You are free to Share (to copy, distribute and transmit
the work) and to Remix (to adapt the work) if you
attribute our work. More detailed information about
the licensing is available at this web page: http:
//www.mosaic-web.org/go/teachingRlicense.html.
Cover Photo: Maya Hanna.
http://www.mosaic-web.org/go/teachingRlicense.html
http://www.mosaic-web.org/go/teachingRlicense.html
Contents
1 Introduction 13
2 Getting Started with RStudio 15
3 One Quantitative Variable 27
4 One Categorical Variable 39
5 Two Quantitative Variables 45
6 Two Categorical Variables 55
7 Quantitative Response, Categorical Predictor 61
8 Categorical Response, Quantitative Predictor 69
9 Survival Time Outcomes 73
4 horton, kaplan, pruim
10 More than Two Variables 75
11 Probability Distributions & Random Variables 83
12 Power Calculations 89
13 Data Management 93
14 Health Evaluation (HELP) Study 107
15 Exercises and Problems 111
16 Bibliography 115
17 Index 117
About These Notes
We present an approach to teaching introductory and in-
termediate statistics courses that is tightly coupled with
computing generally and with R and RStudio in particular.
These activities and examples are intended to highlight
a modern approach to statistical education that focuses
on modeling, resampling based inference, and multivari-
ate graphical techniques. A secondary goal is to facilitate
computing with data through use of small simulation
studies and appropriate statistical analysis workflow. This
follows the philosophy outlined by Nolan and Temple
Lang1. The importance of modern computation in statis- 1 D. Nolan and D. Temple Lang.
Computing in the statistics
curriculum. The American
Statistician, 64(2):97–107, 2010
tics education is a principal component of the recently
adopted American Statistical Association’s curriculum
guidelines2.
2 ASA Undergraduate Guide-
lines Workgroup. 2014 cur-
riculum guidelines for under-
graduate programs in statisti-
cal science. Technical report,
American Statistical Associa-
tion, November 2014. http:
//www.amstat.org/education/
curriculumguidelines.cfm
Throughout this book (and its companion volumes),
we introduce multiple activities, some appropriate for
an introductory course, others suitable for higher levels,
that demonstrate key concepts in statistics and modeling
while also supporting the core material of more tradi-
tional courses.
A Work in Progress
Caution!
Despite our best efforts, you
WILL find bugs both in this
document and in our code.
Please let us know when y ...
1
IDS 403 Final Project Part Two Guidelines and Rubric
Overview
This course explores technology and its impact on the world around us. Technology influences society, and society influences technology, creating a feedback
loop between them. We will critically analyze this feedback loop in this course through social, historical, and theoretical approaches to technology as well as the
four general education lenses: history, humanities, natural and applied sciences, and social sciences. Each of these four perspectives allows us to better
understand the construction of technology and its interrelation with society. From this enhanced understanding, you will be equipped to draw connections
between technology, society, and your personal and professional lives, helping you to become a better-informed citizen who can make a positive difference in the
world.
Issues and events in technology have a pervading influence on many aspects of society, and how they are dealt with requires diverse knowledge and perspectives
to investigate and change. The purpose of this project is to examine a specific issue or event in technology and how it impacts individuals and society through the
development of a critical analysis portfolio and a presentation in which you will demonstrate your ability to think critically, investigate, and communicate clearly.
These skills are often necessary to achieve personal and professional goals across many disciplines.
In this assignment, you will demonstrate your mastery of the following course outcomes:
Analyze the evolving role of technology in one’s discipline of study or chosen profession by investigating the influence of technology on modern culture
[IDS-403-01]
Integrate interdisciplinary approaches for determining how technology affects modern identity in personal and professional contexts [IDS-403-02]
Explain how technology influences modern society by employing appropriate research strategies [IDS-403-03]
Recommend strategies for utilizing current technology to meet personal and professional goals [IDS-403-04]
Articulate informed viewpoints on how technology shapes the world and can influence change using effective communication skills [IDS-403-05]
Assess the impact of emerging technologies on societal issues for incorporating diverse perspectives and viewpoints informed by relevant literature and
interpersonal experiences [IDS-403-06]
Prompt
For the second part of this project, you will develop a multimedia presentation in which you will have a chance to reflect on what you have learned about your
issue or event, yourself, and society through analyzing its impact on technology. You will also be able to apply your communication skills and integrate multimedia
elements to communicate your message to an audience.
In developing this presentation, you will be able to use your analyses from the first part of this project as a starting point. The reflective nature of this ...
1 IDS 403 Final Project Part Two Guidelines and Rubric AbbyWhyte974
1
IDS 403 Final Project Part Two Guidelines and Rubric
Overview
This course explores technology and its impact on the world around us. Technology influences society, and society influences technology, creating a feedback
loop between them. We will critically analyze this feedback loop in this course through social, historical, and theoretical approaches to technology as well as the
four general education lenses: history, humanities, natural and applied sciences, and social sciences. Each of these four perspectives allows us to better
understand the construction of technology and its interrelation with society. From this enhanced understanding, you will be equipped to draw connections
between technology, society, and your personal and professional lives, helping you to become a better-informed citizen who can make a positive difference in the
world.
Issues and events in technology have a pervading influence on many aspects of society, and how they are dealt with requires diverse knowledge and perspectives
to investigate and change. The purpose of this project is to examine a specific issue or event in technology and how it impacts individuals and society through the
development of a critical analysis portfolio and a presentation in which you will demonstrate your ability to think critically, investigate, and communicate clearly.
These skills are often necessary to achieve personal and professional goals across many disciplines.
In this assignment, you will demonstrate your mastery of the following course outcomes:
Analyze the evolving role of technology in one’s discipline of study or chosen profession by investigating the influence of technology on modern culture
[IDS-403-01]
Integrate interdisciplinary approaches for determining how technology affects modern identity in personal and professional contexts [IDS-403-02]
Explain how technology influences modern society by employing appropriate research strategies [IDS-403-03]
Recommend strategies for utilizing current technology to meet personal and professional goals [IDS-403-04]
Articulate informed viewpoints on how technology shapes the world and can influence change using effective communication skills [IDS-403-05]
Assess the impact of emerging technologies on societal issues for incorporating diverse perspectives and viewpoints informed by relevant literature and
interpersonal experiences [IDS-403-06]
Prompt
For the second part of this project, you will develop a multimedia presentation in which you will have a chance to reflect on what you have learned about your
issue or event, yourself, and society through analyzing its impact on technology. You will also be able to apply your communication skills and integrate multimedia
elements to communicate your message to an audience.
In developing this presentation, you will be able to use your analyses from the first part of this project as a starting point. The reflective nature of this ...
R is among the most popular programming languages among data science professionals. In this guide learn about the basic concepts and various functionalities it offers.
Data scientist enablement dse 400 - week 1 roadmap
1. Data Scientist Enablement
DSE 400 - Fast Track to Data Science
Week 1 Roadmap
Advanced Center of Excellence
Modern Renaissance Corporation
In Collaboration with SONO team and others
Content of this document is under Creative Commons Licence CC-BY-4.0
2. Agenda
You can always find the latest version of this document at bit.ly/1hC5wAV
Welcome
Mission and Objectives
DSE Roadmap
DSE 400 at a glance
Week 1 at a glance
Discussions
Learning
Practice
Assignments and Submission
Looking ahead
References
Acknowledgement
In God we trust. all others must bring data. - W Edwards Deming
3. Welcome
Welcome to DSE 2014 Track. You are on one of he tmost
exciting programs to disseminate knowledge, diffuse
advancements and also stimulate adoption of Data/Decision
Sciences, Big Data Analytics and what we call Evidence-
Oriented Systems Engineering. The content and the courses
are designed to be easy, engaging and engendering.
Consequently, we also hope this program will also be most
rewarding for you from intellectual, pragmatic and
professional development perspectives.
4. Mission and Objectives
Mission of our program is to provide free, open and world-
class enablement of Data Scientists and help advance the
profession of Data Science as well as allied disciplines.
We aim to prepare the participants with analytical and
practical skills emphasizing breadth and depth in a range of
relevant disciplines and capabilities in Data/Decision
Sciences, Big Data Analytics, Architecture and Systems
Engineering.
5. Data Scientist Enablement Roadmap - 2014
Fast track to
Data Science
Machine Learning with R
Modern Data Platforms
Advanced Techniques in
Big Data Analytics
“”“A Data Scientist is someone who knows how to extract meaning from and interpret data, which
requires both tools and methods from statistics and machine learning, as well as being human.”
- Rachel Schutt and Cathy O’Neil, Doing Data Science
6. DSE 2014 with tentative timeline
Fast track to
Data Science
(DSE 400)
Modern Data Platforms (DSE 502)
Advanced Techniques in
Big Data Analytics (DSE 600)
Jan 19 - Mar 15
Mar 30 - May 10
May 25 - July 5
July 20 - Aug 30
Machine Learning with R (DSE 501)
7. Introductory course with NO pre-requisites. It employs
socialized learning paradigm involving individual effort,
team work, discussions and collaboration on SONO (Social
Knowledge) platform.
Topics include Algorithms, Statistical
Inference, Data Analysis, Hadoop, R,
Data Engineering, Machine Learning,
Visualization, Applications, Case Studies,
employing a variety of tools and techniques.
DSE 400 at a glance
8. Discussions(on SONO):
Welcome, Introductions, Programming and Analytics background etc.
Reading plan:
Read Chapters 1-3 from An Introduction to Data Science by Jeffrey Stanton and Big Data
[sorry] & Data Science: What Does a Data Scientist Do?
Activities:
Installing R and R-Studio; Fun with Math; Playing with ML Datasets, Research on Data
Visualization tools etc.
Assignment 1:
Download Housing dataset from UCI Machine Learning Repository to your local machine or
cloud drive. Import this dataset into your R environment and display this dataset.
DSE 400 - Week 1 at a glance
9. Login to SONO Community. Visit our Jump Pad (or
Knowledge Domain) called DSE 400. Go to DSE 2014
Global then join right participant group based on first letter
of your last name. Also feel free to explore other
Knowledge-rich communities on SONO.
http://getsokno.com/redvinef/controllers/cell.php?
user_knocell=992
Social Engagement on SONO
10. Discussion 1: Welcome to DSE program.
Discussion 2: What programming languages are you
familiar with? What languages do you use on day to day
basis? Do you have any experience using R Language?
What kind of Analytics tools if any, you have used before?
<Optional> Discussion 3: Q&A. We will focus on topics
central to Week1. But General questions are also welcome.
To participate in these discussions visit DSE 400 Week 1 at
http://getsokno.com/redvinef/controllers/cell.php?user_knocell=1001
Social Engagement on SONO - Week 1
11. DSE 400 is designed be a broad introduction to Data
Science, Analytics Architecture and Visualization from both
learning as well as pragmatic perspectives. Following plan
is recommend for Week 1 to kickstart the program.
Read Chapters 1-3 from An Introduction to Data Science
by Jeffrey Stanton.
Read Big Data [sorry] & Data Science: What Does a Data
Scientist Do?
Week 1 Reading Plan
12. <Required> Visit http://www.rstudio.com/ Follow the instructions to
download and install R and R-Studio. For specific advice on your system and its
configuration, several how-to videos on Installing R and R-Studio can be found
on Youtube. Skip this activity if you already have R and R-Studio.
<Collaborative Research> <Required> Create a presentation on Data
Visualization Tools - A Comparative Study . Incorporate your unique ideas,
research and collective insights to arrive at the right evaluation methodology,
explain your thought-process and justify your choices. Note: You will build this
presentation for 4 weeks. You and your team will present it during 5th week
Activities
13. <Practice> Math is Fun. Create a bar chart quickly with 10 random values
using Data Graphs widget at Math is Fun website. Change graph to Pie Chart.
Display percentages only, not the original values.
<Practice> Visit UCI Machine Learning Repository. Familiarize yourself
with various datasets at this site. Feel free to download any dataset you like. We
will be using this repository in DSE program extensively. For week 1 our focus is
on just “Housing” dataset.
Activities - contd
14. Download R-Studio, in case you have not already done so.
Download Housing dataset from UCI Machine Learning
Repository to your local machine or cloud drive. Import this
dataset into your R environment and display this dataset.
Show the screenshot of your environment.
(See the sample image in the next slide.)
http://archive.ics.uci.edu/ml/datasets.html
Assignment 1 - Submission Required
16. Submissions
Deadline Saturday Jan 25, 11:59 PM your local time.
Submit <mail to datascience400@gmail.com> the
screenshots of your R workspace (on your
machine/laptop/desktop) showing the Housing dataset.
You can either paste the image into the body of email or
create a document in PDF format and send it as an
attachment. No links please.
19. Week 2 Basic Statistics, Hypothesis Testing, Regression, Playing with Spreadsheets,Visualization with
R. If you are new to Statistics or need a refresher, read ahead Think Stats: Probability and Statistics for
Programmers or watch Statistics Playlist by Khan Academy
Week 3 - 4 Intro to Machine Learning(ML) - Classification, Clustering, Prediction NaiveBayes,
Recommendations and Boosting algorithms
Week 5 Visualizations. Present your research Data Visualization Tools - A Comparative Study
Week 6 -7 Processing large data sets. Hadoop Ecosystem. Stream Computing etc.
Week 8 Ethics, Privacy and Building Data Products.
DSE 400 - Weeks 2-8 ahead
20. References and Additional Reading
An Introduction to Data Science by Jeffrey Stanton. This
is a good introduction to Data Science for non-technical
readers. This book is available under Creative Commons
Licence.
Learning R - Video Tutorial Lessons on Youtube
R for Machine Learning by Allison Chung
The Value of Big Data Isn't the Data HBR Article
[MIT OCW] Prediction, Machine Learning and Statistics
21. Housing Data Set Information: Concerns housing values in suburbs of Boston.
Origin: This dataset was taken from the StatLib library which is maintained at
Carnegie Mellon University. Creator : Harrison, D. and Rubinfeld, D.L.
'Hedonic prices and the demand for clean air', J. Environ. Economics &
Management, vol.5, 81-102, 1978.
Content that appears as is on this document only, is under Creative Commons
BY-NC-SA This license may not apply to material referenced here.
Citation
22. For More Information
DSE 2014 stream is all set set to commence on Jan 19, 2004
For more details, visit DSE 400 Announcement Page bit.ly/18zPE1j
Visit DSE 2014 Global to participate in DSE and to get to know the DSE Core
Team and participants. Week 1 discussions can found at DSE 400 Week 1
We welcome questions, thoughts and suggestions. Post these on SONO in the
right forum/discussion or write to us at <datascience400@gmail.com>
You can always find the latest version of this document at bit.ly/1hC5wAV
23. We thank our community of committed and passionate
volunteers, experts, educators, innovators, benefactors,
advisers, advocates, mentors and supporters.
We are also grateful to the outstanding support and
encouragement from SONO team as well as other
organizations like R-Project, Open Courseware
Consortium, MIT, IBM, Creative Commons, HortonWorks,
Stanford University, Caltech and Data Science Central etc.
Acknowledgement