SlideShare a Scribd company logo
How to train the next generation for Big Data
Projects: building a curriculum
Christopher G. Wilson, Ph.D.
Associate Professor Physiology and Pediatrics
Center for Perinatal Biology
Experimental Biology, Mar 28th, 2015
Outline
• Assessing the need for a “Big Data” Analytics course
• Structure and grading of the course
• Overview of the curriculum
• Advantages to Python/IPython
• Examples/use cases
• Coalition institutions and participating faculty
Is a Big Data analytics course necessary?
• “Back in the day, when *I* was a graduate student…”
• First year Physics lab as a training ground…
• Contemporary students live in a digital world…
• Office suites are NOT suited to large-scale data analytics!
Work-flow of “Big Data” analysis
Or…
• Obtain data
• Scrub data
• Explore data
• Model the data
• Interpret the data
• Present the data
Why use Free/Open-Source Software?
• In this era of shrinking science funding, free software makes
more economic sense.
• Bugs/security issues are fixed FASTER than proprietary
software.
• With access to the source code, we can customize the
software to fit OUR needs.
• Reproducibility of analyses and algorithms is easier when all
code is free, can be shared, and examined/dissected.
• Free/Open-source software tends to be more reliable and
stable.
• See Eric Raymond’s The Cathedral and the Bazaar for a
more comprehensive explanation.
Using a “flipped” classroom
• On-line material or reading is provided to the student either
before or during the class meeting time
• The instructor provides a short summary/overview lecture
(~20 min)
• The remaining class time is spent working on the subject
matter as individuals and groups—with the instructor and TA
present
• More effective for learning “hands on” skills like
programming, bioinformatics, web design, etc.
Why use a flipped classroom
model instead of lecturing for 50
minutes and assigning
homework?
The data analytics team
•  Project manager—responsible for setting clear project objectives and deliverables.
The project manager should be someone with more experience in data analysis
and a more comprehensive background than the other team members.
•  Statistician—should have a strong mathematics/statistics background and will be
responsible for reporting and developing the statistics workflow for the project.
•  Visualization specialist—responsible for the design/development of data
visualization (figures/animation) for the project.
•  Database specialist—develops ontology/meta-tags to represent the data and
incorporate this information in the team's chosen database schema.
•  Content Expert—has the strongest background in the focus area of the project
(Physiologist, systems biologist, molecular biologist, biochemist, clinician, etc.) and
is responsible for providing background material relevant to the project's focus.
•  Web developer/integrator—responsible for web-content related to the project,
including the final report formatting (for web/hardcopy display).
•  Data analyst—the most junior member of the team will take on general
responsibilities to assist the other team members. This is a learning opportunity for
a team member who is new to data analysis and needs time to develop the skills
necessary to fully participate in the workflow.
Student self-assessment
Survey created using
Google Forms
Student self-assessment
From Doing Data Science by Cathy O’Neil and Rachel Schutt
Grading
•  Pass/No Pass
•  Weekly quizzes (concepts from short lectures, on-line resources, simple
code fragments/pseudo-code, etc.)
•  Projects
•  One individual project (basics of using IPython, simple statistics
computed via interaction with R—or using Pandas—and simple
visualization of a dataset).
•  Two short projects (small group, designed to develop team-based
distribution of workload, team roles assigned by instructor).
•  Larger scale project using a Big Data dataset (students will “self-
organize” their team roles). This project is envisioned as the final
exam for the class and each team will present their results and
project summary to the class.
•  Final projects will be posted on the class website along with IPython
notebooks and supporting materials used for the project.
Syllabus Overview (10 week course)
Foundations 1: Using text editors, using the IPython notebook for data exploration, using
version control software (git), using the class wiki.
Foundations 2: Using IPython/NumPy/SciPy, importing and manipulating data with Pandas,
data visualization in IPython.
Analysis Methods: Basic signal theory overview, time-series data, plotting (lines, histograms,
bars, etc.) dynamical systems analyses of data variability, information theory measures
(entropy) of complexity, frequency domain/spectral measures (FFT, time-varying spectrum),
wavelets.
Handling Sequence data: Using R/Bioconductor, differences between mRNA-Seq, gene-
array, proteomics, and deep-sequencing data, visualizing data from gene/RNA arrays.
Data set storage and retrieval: Basics of relational databases, SQL vs. NOSQL, cloud
storage/NAS/computing clusters, interfacing with Hadoop/MapReduce, metadata and ontology
for biomedical/patient data (XML), using secure databases (REDCap).
Data integrity and security: The Health Insurance Portability and Accountability Act (HIPAA)
and what it means for data management, de-identifying patient data (handling PHI), data
security best practices, making data available to the public—implications for data transparency
and large-scale data mining.
Why Python?
•  Python is an easy-to-learn, complete programming language
that has rapidly become an important scientific programming
and data analysis environment with usage across multiple
disciplines.
•  Python was originally developed with a philosophy of “easy to
read” code incorporating object-oriented, imperative, and
functional programming styles.
•  Python allows the incorporation of specialized modules based
upon low-level code (C/C++) so it can run very fast.
•  Python has modules developed specifically for scientific
computing and signal processing (NumPy/SciPy).
•  Python has well-documented import/export hooks into
databases (both SQL and NOSQL) that are key to working with
Big Data.
Why IPython?
•  IPython is an interactive data exploration and visualization shell
that supports the inclusion of code, inline text, mathematical
expressions, 2D/3D plotting, multimedia, and dynamic widgets.
•  IPython is a suite of tools designed to cover scientific workflow
from interactive data transformation and analysis to publication.
•  The IPython notebook uses a web browser as its display “front
end” and provides a rich interactive environment similar that
seen in Mathematica.
•  IPython notebooks makes it possible to save analysis
procedures and output—providing reproducible, curatable data
analysis, and an easy way to share algorithms/methods.
•  IPython supports parallel coding and distributed data analysis to
take advantage of cloud/high-performance clusters.
Python as a data analytics environment
IPython
interface
http://ipython.org
Line plots with error bars
import numpy as np
import matplotlib.pyplot as plt
# example data
x = np.arange(0.1, 4, 0.5)
y = np.exp(-x)
plt.errorbar(x, y, xerr=0.2, yerr=0.4)
plt.show()
Heatmaps
import numpy as np
import numpy.random
import matplotlib.pyplot as plt
# Generate some test data
x = np.random.randn(8873)
y = np.random.randn(8873)
heatmap, xedges, yedges = np.histogram2d(x, y, bins=50)
extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]]
plt.clf()
plt.imshow(heatmap, extent=extent)
plt.show()
Scatterplots
import numpy as np
import matplotlib.pyplot as plt
N = 50
x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
area = np.pi * (15 * np.random.rand(N))**2 #
0 to 15 point radiuses
plt.scatter(x, y, s=area, c=colors,
alpha=0.5)
plt.show()
3D contour map
from mpl_toolkits.mplot3d import axes3d
import matplotlib.pyplot as plt
from matplotlib import cm
fig = plt.figure()
ax = fig.gca(projection='3d')
X, Y, Z = axes3d.get_test_data(0.05)
ax.plot_surface(X, Y, Z, rstride=8,
cstride=8, alpha=0.3)
cset = ax.contour(X, Y, Z, zdir='z',
offset=-100, cmap=cm.coolwarm)
cset = ax.contour(X, Y, Z, zdir='x',
offset=-40, cmap=cm.coolwarm)
cset = ax.contour(X, Y, Z, zdir='y',
offset=40, cmap=cm.coolwarm)
ax.set_xlabel('X')
ax.set_xlim(-40, 40)
ax.set_ylabel('Y')
ax.set_ylim(-40, 40)
ax.set_zlabel('Z')
ax.set_zlim(-100, 100)
plt.show()
Example: Patient physiology waveforms + EMR
Example: Interrogating sequence data
Summary
• Free/Libre Open-Source software provides a viable “tool
stack” for Big Data analytics.
• Python provides a robust, easy-to-use foundation for data
analytics.
• IPython provides an easy to use interactive front-end for data
transformation, analysis, visualization, presentation, and
distribution.
• Team-based science depends upon developing a wide range
of data analytics skills.
• We have developed a coalition of institutions to serve
students who wish to be become data scientists.
Coalition Institutions
The coding Queen and her Court…
Abby Dobyns
Princesses of Python
Rhaya Johnson
Regie Felix and Adaeze Anyanwu
And a Princeling….
Jamie Tillett
Acknowledgements
Loma Linda
• Traci Marin
• Charles Wang
• Wilson Aruni
• Valery Filippov
UC Riverside
• Thomas Girke
(Bioinformatics)
My laboratory’s git repository:
La Sierra University
•  Marvin Payne
CSU San Bernardino
•  Art Concepcion
(Bioinformatics)
UC Irvine
•  Alex Nicolau
(Comp Sci/Bioinf)
https://github.com/drcgw/bass
Further reading
• Doing Data Science by Cathy O’Neil and Rachel Schutt
• Data Analysis with Open-Source Tools by Philipp Janert
• The Art of R Programming by Norman Matloff
• R for Everyone by Jared P. Lander
• Python for Data Analysis by Wes McKinney
• Think Python by Allen B. Downey
• Think Stats by Allen B. Downey
• Think Complexity by Allen B. Downey
• Every one of Edward Tufte’s books (The Visual Display
of Quantitative Information, Visual Explanations,
Envisioning Information, Beautiful Evidence)
Questions?!

More Related Content

What's hot

[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程
台灣資料科學年會
 
機器學習速遊
機器學習速遊機器學習速遊
機器學習速遊
台灣資料科學年會
 
ML DL AI DS BD - An Introduction
ML DL AI DS BD - An IntroductionML DL AI DS BD - An Introduction
ML DL AI DS BD - An Introduction
Dony Riyanto
 
Machine learning with Big Data power point presentation
Machine learning with Big Data power point presentationMachine learning with Big Data power point presentation
Machine learning with Big Data power point presentation
David Raj Kanthi
 
Introduction to Python Objects and Strings
Introduction to Python Objects and StringsIntroduction to Python Objects and Strings
Introduction to Python Objects and Strings
Sangeetha S
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
shivani saluja
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
Pruet Boonma
 
[Eestec] Machine Learning online seminar 1, 12 2016
[Eestec] Machine Learning online seminar 1, 12 2016[Eestec] Machine Learning online seminar 1, 12 2016
[Eestec] Machine Learning online seminar 1, 12 2016
Grigoris C
 
林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning
台灣資料科學年會
 
Data! Data! Data! I Can't Make Bricks Without Clay!
Data! Data! Data! I Can't Make Bricks Without Clay!Data! Data! Data! I Can't Make Bricks Without Clay!
Data! Data! Data! I Can't Make Bricks Without Clay!
Turi, Inc.
 
Machine Learning Overview
Machine Learning OverviewMachine Learning Overview
Machine Learning Overview
Mykhailo Koval
 
Introduction to machine learning and applications (1)
Introduction to machine learning and applications (1)Introduction to machine learning and applications (1)
Introduction to machine learning and applications (1)
Manjunath Sindagi
 
Applications of Machine Learning
Applications of Machine LearningApplications of Machine Learning
Applications of Machine Learning
Department of Computer Science, Aalto University
 
An overview of machine learning
An overview of machine learningAn overview of machine learning
An overview of machine learning
drcfetr
 
Artificial Intelligence for Automating Data Analysis
Artificial Intelligence for Automating Data AnalysisArtificial Intelligence for Automating Data Analysis
Artificial Intelligence for Automating Data Analysis
Manuel Martín
 
Simple overview of machine learning
Simple overview of machine learningSimple overview of machine learning
Simple overview of machine learning
priyadharshini R
 
Learning for Big Data-林軒田
Learning for Big Data-林軒田Learning for Big Data-林軒田
Learning for Big Data-林軒田
台灣資料科學年會
 
Knowledge Discovery
Knowledge DiscoveryKnowledge Discovery
Knowledge Discovery
André Karpištšenko
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
BICA Labs
 
A Friendly Introduction to Machine Learning
A Friendly Introduction to Machine LearningA Friendly Introduction to Machine Learning
A Friendly Introduction to Machine Learning
Haptik
 

What's hot (20)

[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程
 
機器學習速遊
機器學習速遊機器學習速遊
機器學習速遊
 
ML DL AI DS BD - An Introduction
ML DL AI DS BD - An IntroductionML DL AI DS BD - An Introduction
ML DL AI DS BD - An Introduction
 
Machine learning with Big Data power point presentation
Machine learning with Big Data power point presentationMachine learning with Big Data power point presentation
Machine learning with Big Data power point presentation
 
Introduction to Python Objects and Strings
Introduction to Python Objects and StringsIntroduction to Python Objects and Strings
Introduction to Python Objects and Strings
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
[Eestec] Machine Learning online seminar 1, 12 2016
[Eestec] Machine Learning online seminar 1, 12 2016[Eestec] Machine Learning online seminar 1, 12 2016
[Eestec] Machine Learning online seminar 1, 12 2016
 
林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning
 
Data! Data! Data! I Can't Make Bricks Without Clay!
Data! Data! Data! I Can't Make Bricks Without Clay!Data! Data! Data! I Can't Make Bricks Without Clay!
Data! Data! Data! I Can't Make Bricks Without Clay!
 
Machine Learning Overview
Machine Learning OverviewMachine Learning Overview
Machine Learning Overview
 
Introduction to machine learning and applications (1)
Introduction to machine learning and applications (1)Introduction to machine learning and applications (1)
Introduction to machine learning and applications (1)
 
Applications of Machine Learning
Applications of Machine LearningApplications of Machine Learning
Applications of Machine Learning
 
An overview of machine learning
An overview of machine learningAn overview of machine learning
An overview of machine learning
 
Artificial Intelligence for Automating Data Analysis
Artificial Intelligence for Automating Data AnalysisArtificial Intelligence for Automating Data Analysis
Artificial Intelligence for Automating Data Analysis
 
Simple overview of machine learning
Simple overview of machine learningSimple overview of machine learning
Simple overview of machine learning
 
Learning for Big Data-林軒田
Learning for Big Data-林軒田Learning for Big Data-林軒田
Learning for Big Data-林軒田
 
Knowledge Discovery
Knowledge DiscoveryKnowledge Discovery
Knowledge Discovery
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
 
A Friendly Introduction to Machine Learning
A Friendly Introduction to Machine LearningA Friendly Introduction to Machine Learning
A Friendly Introduction to Machine Learning
 

Similar to 2015 03-28-eb-final

An Overview of Python for Data Analytics
An Overview of Python for Data AnalyticsAn Overview of Python for Data Analytics
An Overview of Python for Data Analytics
IRJET Journal
 
Abhishek Training PPT.pptx
Abhishek Training PPT.pptxAbhishek Training PPT.pptx
Abhishek Training PPT.pptx
KashishKashish22
 
FDS_dept_ppt.pptx
FDS_dept_ppt.pptxFDS_dept_ppt.pptx
FDS_dept_ppt.pptx
SatyajitPatil42
 
Python ml
Python mlPython ml
Python ml
Shubham Sharma
 
Python for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive GuidePython for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive Guide
priyanka rajput
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
sumitkumar600840
 
DataScience_RoadMap_2023.pdf
DataScience_RoadMap_2023.pdfDataScience_RoadMap_2023.pdf
DataScience_RoadMap_2023.pdf
MuhammadRizwanAmanat
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
AmarnathKambale
 
2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research
Yannick Wurm
 
Adarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptxAdarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptx
hkabir55
 
Data Science and Analysis.pptx
Data Science and Analysis.pptxData Science and Analysis.pptx
Data Science and Analysis.pptx
PrashantYadav931011
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Ramiro Aduviri Velasco
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
Pramod Toraskar
 
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
ArmyTrilidiaDevegaSK
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
rohithprabhas1
 
Toolboxes for data scientists
Toolboxes for data scientistsToolboxes for data scientists
Toolboxes for data scientists
Sudipto Krishna Dutta
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
javed75
 
I2DS Project.pdf
I2DS Project.pdfI2DS Project.pdf
I2DS Project.pdf
AbdulnasserAlMaqrami
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.
 
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerAutomatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Francesco Osborne
 

Similar to 2015 03-28-eb-final (20)

An Overview of Python for Data Analytics
An Overview of Python for Data AnalyticsAn Overview of Python for Data Analytics
An Overview of Python for Data Analytics
 
Abhishek Training PPT.pptx
Abhishek Training PPT.pptxAbhishek Training PPT.pptx
Abhishek Training PPT.pptx
 
FDS_dept_ppt.pptx
FDS_dept_ppt.pptxFDS_dept_ppt.pptx
FDS_dept_ppt.pptx
 
Python ml
Python mlPython ml
Python ml
 
Python for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive GuidePython for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive Guide
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
 
DataScience_RoadMap_2023.pdf
DataScience_RoadMap_2023.pdfDataScience_RoadMap_2023.pdf
DataScience_RoadMap_2023.pdf
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research
 
Adarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptxAdarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptx
 
Data Science and Analysis.pptx
Data Science and Analysis.pptxData Science and Analysis.pptx
Data Science and Analysis.pptx
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
 
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
 
Toolboxes for data scientists
Toolboxes for data scientistsToolboxes for data scientists
Toolboxes for data scientists
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
 
I2DS Project.pdf
I2DS Project.pdfI2DS Project.pdf
I2DS Project.pdf
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerAutomatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
 

Recently uploaded

The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdf
The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdfThe Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdf
The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdf
mediapraxi
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
SSR02
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
RASHMI M G
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
Wasswaderrick3
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
sanjana502982
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 

Recently uploaded (20)

The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdf
The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdfThe Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdf
The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdf
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 

2015 03-28-eb-final

  • 1. How to train the next generation for Big Data Projects: building a curriculum Christopher G. Wilson, Ph.D. Associate Professor Physiology and Pediatrics Center for Perinatal Biology Experimental Biology, Mar 28th, 2015
  • 2. Outline • Assessing the need for a “Big Data” Analytics course • Structure and grading of the course • Overview of the curriculum • Advantages to Python/IPython • Examples/use cases • Coalition institutions and participating faculty
  • 3. Is a Big Data analytics course necessary? • “Back in the day, when *I* was a graduate student…” • First year Physics lab as a training ground… • Contemporary students live in a digital world… • Office suites are NOT suited to large-scale data analytics!
  • 4. Work-flow of “Big Data” analysis
  • 5. Or… • Obtain data • Scrub data • Explore data • Model the data • Interpret the data • Present the data
  • 6.
  • 7. Why use Free/Open-Source Software? • In this era of shrinking science funding, free software makes more economic sense. • Bugs/security issues are fixed FASTER than proprietary software. • With access to the source code, we can customize the software to fit OUR needs. • Reproducibility of analyses and algorithms is easier when all code is free, can be shared, and examined/dissected. • Free/Open-source software tends to be more reliable and stable. • See Eric Raymond’s The Cathedral and the Bazaar for a more comprehensive explanation.
  • 8. Using a “flipped” classroom • On-line material or reading is provided to the student either before or during the class meeting time • The instructor provides a short summary/overview lecture (~20 min) • The remaining class time is spent working on the subject matter as individuals and groups—with the instructor and TA present • More effective for learning “hands on” skills like programming, bioinformatics, web design, etc.
  • 9. Why use a flipped classroom model instead of lecturing for 50 minutes and assigning homework?
  • 10. The data analytics team •  Project manager—responsible for setting clear project objectives and deliverables. The project manager should be someone with more experience in data analysis and a more comprehensive background than the other team members. •  Statistician—should have a strong mathematics/statistics background and will be responsible for reporting and developing the statistics workflow for the project. •  Visualization specialist—responsible for the design/development of data visualization (figures/animation) for the project. •  Database specialist—develops ontology/meta-tags to represent the data and incorporate this information in the team's chosen database schema. •  Content Expert—has the strongest background in the focus area of the project (Physiologist, systems biologist, molecular biologist, biochemist, clinician, etc.) and is responsible for providing background material relevant to the project's focus. •  Web developer/integrator—responsible for web-content related to the project, including the final report formatting (for web/hardcopy display). •  Data analyst—the most junior member of the team will take on general responsibilities to assist the other team members. This is a learning opportunity for a team member who is new to data analysis and needs time to develop the skills necessary to fully participate in the workflow.
  • 12. Student self-assessment From Doing Data Science by Cathy O’Neil and Rachel Schutt
  • 13. Grading •  Pass/No Pass •  Weekly quizzes (concepts from short lectures, on-line resources, simple code fragments/pseudo-code, etc.) •  Projects •  One individual project (basics of using IPython, simple statistics computed via interaction with R—or using Pandas—and simple visualization of a dataset). •  Two short projects (small group, designed to develop team-based distribution of workload, team roles assigned by instructor). •  Larger scale project using a Big Data dataset (students will “self- organize” their team roles). This project is envisioned as the final exam for the class and each team will present their results and project summary to the class. •  Final projects will be posted on the class website along with IPython notebooks and supporting materials used for the project.
  • 14. Syllabus Overview (10 week course) Foundations 1: Using text editors, using the IPython notebook for data exploration, using version control software (git), using the class wiki. Foundations 2: Using IPython/NumPy/SciPy, importing and manipulating data with Pandas, data visualization in IPython. Analysis Methods: Basic signal theory overview, time-series data, plotting (lines, histograms, bars, etc.) dynamical systems analyses of data variability, information theory measures (entropy) of complexity, frequency domain/spectral measures (FFT, time-varying spectrum), wavelets. Handling Sequence data: Using R/Bioconductor, differences between mRNA-Seq, gene- array, proteomics, and deep-sequencing data, visualizing data from gene/RNA arrays. Data set storage and retrieval: Basics of relational databases, SQL vs. NOSQL, cloud storage/NAS/computing clusters, interfacing with Hadoop/MapReduce, metadata and ontology for biomedical/patient data (XML), using secure databases (REDCap). Data integrity and security: The Health Insurance Portability and Accountability Act (HIPAA) and what it means for data management, de-identifying patient data (handling PHI), data security best practices, making data available to the public—implications for data transparency and large-scale data mining.
  • 15. Why Python? •  Python is an easy-to-learn, complete programming language that has rapidly become an important scientific programming and data analysis environment with usage across multiple disciplines. •  Python was originally developed with a philosophy of “easy to read” code incorporating object-oriented, imperative, and functional programming styles. •  Python allows the incorporation of specialized modules based upon low-level code (C/C++) so it can run very fast. •  Python has modules developed specifically for scientific computing and signal processing (NumPy/SciPy). •  Python has well-documented import/export hooks into databases (both SQL and NOSQL) that are key to working with Big Data.
  • 16. Why IPython? •  IPython is an interactive data exploration and visualization shell that supports the inclusion of code, inline text, mathematical expressions, 2D/3D plotting, multimedia, and dynamic widgets. •  IPython is a suite of tools designed to cover scientific workflow from interactive data transformation and analysis to publication. •  The IPython notebook uses a web browser as its display “front end” and provides a rich interactive environment similar that seen in Mathematica. •  IPython notebooks makes it possible to save analysis procedures and output—providing reproducible, curatable data analysis, and an easy way to share algorithms/methods. •  IPython supports parallel coding and distributed data analysis to take advantage of cloud/high-performance clusters.
  • 17. Python as a data analytics environment
  • 19. Line plots with error bars import numpy as np import matplotlib.pyplot as plt # example data x = np.arange(0.1, 4, 0.5) y = np.exp(-x) plt.errorbar(x, y, xerr=0.2, yerr=0.4) plt.show()
  • 20. Heatmaps import numpy as np import numpy.random import matplotlib.pyplot as plt # Generate some test data x = np.random.randn(8873) y = np.random.randn(8873) heatmap, xedges, yedges = np.histogram2d(x, y, bins=50) extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]] plt.clf() plt.imshow(heatmap, extent=extent) plt.show()
  • 21. Scatterplots import numpy as np import matplotlib.pyplot as plt N = 50 x = np.random.rand(N) y = np.random.rand(N) colors = np.random.rand(N) area = np.pi * (15 * np.random.rand(N))**2 # 0 to 15 point radiuses plt.scatter(x, y, s=area, c=colors, alpha=0.5) plt.show()
  • 22. 3D contour map from mpl_toolkits.mplot3d import axes3d import matplotlib.pyplot as plt from matplotlib import cm fig = plt.figure() ax = fig.gca(projection='3d') X, Y, Z = axes3d.get_test_data(0.05) ax.plot_surface(X, Y, Z, rstride=8, cstride=8, alpha=0.3) cset = ax.contour(X, Y, Z, zdir='z', offset=-100, cmap=cm.coolwarm) cset = ax.contour(X, Y, Z, zdir='x', offset=-40, cmap=cm.coolwarm) cset = ax.contour(X, Y, Z, zdir='y', offset=40, cmap=cm.coolwarm) ax.set_xlabel('X') ax.set_xlim(-40, 40) ax.set_ylabel('Y') ax.set_ylim(-40, 40) ax.set_zlabel('Z') ax.set_zlim(-100, 100) plt.show()
  • 23. Example: Patient physiology waveforms + EMR
  • 25. Summary • Free/Libre Open-Source software provides a viable “tool stack” for Big Data analytics. • Python provides a robust, easy-to-use foundation for data analytics. • IPython provides an easy to use interactive front-end for data transformation, analysis, visualization, presentation, and distribution. • Team-based science depends upon developing a wide range of data analytics skills. • We have developed a coalition of institutions to serve students who wish to be become data scientists.
  • 27. The coding Queen and her Court… Abby Dobyns Princesses of Python Rhaya Johnson Regie Felix and Adaeze Anyanwu And a Princeling…. Jamie Tillett
  • 28. Acknowledgements Loma Linda • Traci Marin • Charles Wang • Wilson Aruni • Valery Filippov UC Riverside • Thomas Girke (Bioinformatics) My laboratory’s git repository: La Sierra University •  Marvin Payne CSU San Bernardino •  Art Concepcion (Bioinformatics) UC Irvine •  Alex Nicolau (Comp Sci/Bioinf) https://github.com/drcgw/bass
  • 29. Further reading • Doing Data Science by Cathy O’Neil and Rachel Schutt • Data Analysis with Open-Source Tools by Philipp Janert • The Art of R Programming by Norman Matloff • R for Everyone by Jared P. Lander • Python for Data Analysis by Wes McKinney • Think Python by Allen B. Downey • Think Stats by Allen B. Downey • Think Complexity by Allen B. Downey • Every one of Edward Tufte’s books (The Visual Display of Quantitative Information, Visual Explanations, Envisioning Information, Beautiful Evidence)