This presentation has slides from a talk that I gave at the annual Experimental Biology meeting, 2015, on our curriculum for Big Data Analytics in the Inland Empire.
1. How to train the next generation for Big Data
Projects: building a curriculum
Christopher G. Wilson, Ph.D.
Associate Professor Physiology and Pediatrics
Center for Perinatal Biology
Experimental Biology, Mar 28th, 2015
2. Outline
• Assessing the need for a “Big Data” Analytics course
• Structure and grading of the course
• Overview of the curriculum
• Advantages to Python/IPython
• Examples/use cases
• Coalition institutions and participating faculty
3. Is a Big Data analytics course necessary?
• “Back in the day, when *I* was a graduate student…”
• First year Physics lab as a training ground…
• Contemporary students live in a digital world…
• Office suites are NOT suited to large-scale data analytics!
7. Why use Free/Open-Source Software?
• In this era of shrinking science funding, free software makes
more economic sense.
• Bugs/security issues are fixed FASTER than proprietary
software.
• With access to the source code, we can customize the
software to fit OUR needs.
• Reproducibility of analyses and algorithms is easier when all
code is free, can be shared, and examined/dissected.
• Free/Open-source software tends to be more reliable and
stable.
• See Eric Raymond’s The Cathedral and the Bazaar for a
more comprehensive explanation.
8. Using a “flipped” classroom
• On-line material or reading is provided to the student either
before or during the class meeting time
• The instructor provides a short summary/overview lecture
(~20 min)
• The remaining class time is spent working on the subject
matter as individuals and groups—with the instructor and TA
present
• More effective for learning “hands on” skills like
programming, bioinformatics, web design, etc.
9. Why use a flipped classroom
model instead of lecturing for 50
minutes and assigning
homework?
10. The data analytics team
• Project manager—responsible for setting clear project objectives and deliverables.
The project manager should be someone with more experience in data analysis
and a more comprehensive background than the other team members.
• Statistician—should have a strong mathematics/statistics background and will be
responsible for reporting and developing the statistics workflow for the project.
• Visualization specialist—responsible for the design/development of data
visualization (figures/animation) for the project.
• Database specialist—develops ontology/meta-tags to represent the data and
incorporate this information in the team's chosen database schema.
• Content Expert—has the strongest background in the focus area of the project
(Physiologist, systems biologist, molecular biologist, biochemist, clinician, etc.) and
is responsible for providing background material relevant to the project's focus.
• Web developer/integrator—responsible for web-content related to the project,
including the final report formatting (for web/hardcopy display).
• Data analyst—the most junior member of the team will take on general
responsibilities to assist the other team members. This is a learning opportunity for
a team member who is new to data analysis and needs time to develop the skills
necessary to fully participate in the workflow.
13. Grading
• Pass/No Pass
• Weekly quizzes (concepts from short lectures, on-line resources, simple
code fragments/pseudo-code, etc.)
• Projects
• One individual project (basics of using IPython, simple statistics
computed via interaction with R—or using Pandas—and simple
visualization of a dataset).
• Two short projects (small group, designed to develop team-based
distribution of workload, team roles assigned by instructor).
• Larger scale project using a Big Data dataset (students will “self-
organize” their team roles). This project is envisioned as the final
exam for the class and each team will present their results and
project summary to the class.
• Final projects will be posted on the class website along with IPython
notebooks and supporting materials used for the project.
14. Syllabus Overview (10 week course)
Foundations 1: Using text editors, using the IPython notebook for data exploration, using
version control software (git), using the class wiki.
Foundations 2: Using IPython/NumPy/SciPy, importing and manipulating data with Pandas,
data visualization in IPython.
Analysis Methods: Basic signal theory overview, time-series data, plotting (lines, histograms,
bars, etc.) dynamical systems analyses of data variability, information theory measures
(entropy) of complexity, frequency domain/spectral measures (FFT, time-varying spectrum),
wavelets.
Handling Sequence data: Using R/Bioconductor, differences between mRNA-Seq, gene-
array, proteomics, and deep-sequencing data, visualizing data from gene/RNA arrays.
Data set storage and retrieval: Basics of relational databases, SQL vs. NOSQL, cloud
storage/NAS/computing clusters, interfacing with Hadoop/MapReduce, metadata and ontology
for biomedical/patient data (XML), using secure databases (REDCap).
Data integrity and security: The Health Insurance Portability and Accountability Act (HIPAA)
and what it means for data management, de-identifying patient data (handling PHI), data
security best practices, making data available to the public—implications for data transparency
and large-scale data mining.
15. Why Python?
• Python is an easy-to-learn, complete programming language
that has rapidly become an important scientific programming
and data analysis environment with usage across multiple
disciplines.
• Python was originally developed with a philosophy of “easy to
read” code incorporating object-oriented, imperative, and
functional programming styles.
• Python allows the incorporation of specialized modules based
upon low-level code (C/C++) so it can run very fast.
• Python has modules developed specifically for scientific
computing and signal processing (NumPy/SciPy).
• Python has well-documented import/export hooks into
databases (both SQL and NOSQL) that are key to working with
Big Data.
16. Why IPython?
• IPython is an interactive data exploration and visualization shell
that supports the inclusion of code, inline text, mathematical
expressions, 2D/3D plotting, multimedia, and dynamic widgets.
• IPython is a suite of tools designed to cover scientific workflow
from interactive data transformation and analysis to publication.
• The IPython notebook uses a web browser as its display “front
end” and provides a rich interactive environment similar that
seen in Mathematica.
• IPython notebooks makes it possible to save analysis
procedures and output—providing reproducible, curatable data
analysis, and an easy way to share algorithms/methods.
• IPython supports parallel coding and distributed data analysis to
take advantage of cloud/high-performance clusters.
19. Line plots with error bars
import numpy as np
import matplotlib.pyplot as plt
# example data
x = np.arange(0.1, 4, 0.5)
y = np.exp(-x)
plt.errorbar(x, y, xerr=0.2, yerr=0.4)
plt.show()
20. Heatmaps
import numpy as np
import numpy.random
import matplotlib.pyplot as plt
# Generate some test data
x = np.random.randn(8873)
y = np.random.randn(8873)
heatmap, xedges, yedges = np.histogram2d(x, y, bins=50)
extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]]
plt.clf()
plt.imshow(heatmap, extent=extent)
plt.show()
21. Scatterplots
import numpy as np
import matplotlib.pyplot as plt
N = 50
x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
area = np.pi * (15 * np.random.rand(N))**2 #
0 to 15 point radiuses
plt.scatter(x, y, s=area, c=colors,
alpha=0.5)
plt.show()
22. 3D contour map
from mpl_toolkits.mplot3d import axes3d
import matplotlib.pyplot as plt
from matplotlib import cm
fig = plt.figure()
ax = fig.gca(projection='3d')
X, Y, Z = axes3d.get_test_data(0.05)
ax.plot_surface(X, Y, Z, rstride=8,
cstride=8, alpha=0.3)
cset = ax.contour(X, Y, Z, zdir='z',
offset=-100, cmap=cm.coolwarm)
cset = ax.contour(X, Y, Z, zdir='x',
offset=-40, cmap=cm.coolwarm)
cset = ax.contour(X, Y, Z, zdir='y',
offset=40, cmap=cm.coolwarm)
ax.set_xlabel('X')
ax.set_xlim(-40, 40)
ax.set_ylabel('Y')
ax.set_ylim(-40, 40)
ax.set_zlabel('Z')
ax.set_zlim(-100, 100)
plt.show()
25. Summary
• Free/Libre Open-Source software provides a viable “tool
stack” for Big Data analytics.
• Python provides a robust, easy-to-use foundation for data
analytics.
• IPython provides an easy to use interactive front-end for data
transformation, analysis, visualization, presentation, and
distribution.
• Team-based science depends upon developing a wide range
of data analytics skills.
• We have developed a coalition of institutions to serve
students who wish to be become data scientists.
27. The coding Queen and her Court…
Abby Dobyns
Princesses of Python
Rhaya Johnson
Regie Felix and Adaeze Anyanwu
And a Princeling….
Jamie Tillett
28. Acknowledgements
Loma Linda
• Traci Marin
• Charles Wang
• Wilson Aruni
• Valery Filippov
UC Riverside
• Thomas Girke
(Bioinformatics)
My laboratory’s git repository:
La Sierra University
• Marvin Payne
CSU San Bernardino
• Art Concepcion
(Bioinformatics)
UC Irvine
• Alex Nicolau
(Comp Sci/Bioinf)
https://github.com/drcgw/bass
29. Further reading
• Doing Data Science by Cathy O’Neil and Rachel Schutt
• Data Analysis with Open-Source Tools by Philipp Janert
• The Art of R Programming by Norman Matloff
• R for Everyone by Jared P. Lander
• Python for Data Analysis by Wes McKinney
• Think Python by Allen B. Downey
• Think Stats by Allen B. Downey
• Think Complexity by Allen B. Downey
• Every one of Edward Tufte’s books (The Visual Display
of Quantitative Information, Visual Explanations,
Envisioning Information, Beautiful Evidence)