SlideShare a Scribd company logo
1 of 30
Download to read offline
How to train the next generation for Big Data
Projects: building a curriculum
Christopher G. Wilson, Ph.D.
Associate Professor Physiology and Pediatrics
Center for Perinatal Biology
Experimental Biology, Mar 28th, 2015
Outline
• Assessing the need for a “Big Data” Analytics course
• Structure and grading of the course
• Overview of the curriculum
• Advantages to Python/IPython
• Examples/use cases
• Coalition institutions and participating faculty
Is a Big Data analytics course necessary?
• “Back in the day, when *I* was a graduate student…”
• First year Physics lab as a training ground…
• Contemporary students live in a digital world…
• Office suites are NOT suited to large-scale data analytics!
Work-flow of “Big Data” analysis
Or…
• Obtain data
• Scrub data
• Explore data
• Model the data
• Interpret the data
• Present the data
Why use Free/Open-Source Software?
• In this era of shrinking science funding, free software makes
more economic sense.
• Bugs/security issues are fixed FASTER than proprietary
software.
• With access to the source code, we can customize the
software to fit OUR needs.
• Reproducibility of analyses and algorithms is easier when all
code is free, can be shared, and examined/dissected.
• Free/Open-source software tends to be more reliable and
stable.
• See Eric Raymond’s The Cathedral and the Bazaar for a
more comprehensive explanation.
Using a “flipped” classroom
• On-line material or reading is provided to the student either
before or during the class meeting time
• The instructor provides a short summary/overview lecture
(~20 min)
• The remaining class time is spent working on the subject
matter as individuals and groups—with the instructor and TA
present
• More effective for learning “hands on” skills like
programming, bioinformatics, web design, etc.
Why use a flipped classroom
model instead of lecturing for 50
minutes and assigning
homework?
The data analytics team
•  Project manager—responsible for setting clear project objectives and deliverables.
The project manager should be someone with more experience in data analysis
and a more comprehensive background than the other team members.
•  Statistician—should have a strong mathematics/statistics background and will be
responsible for reporting and developing the statistics workflow for the project.
•  Visualization specialist—responsible for the design/development of data
visualization (figures/animation) for the project.
•  Database specialist—develops ontology/meta-tags to represent the data and
incorporate this information in the team's chosen database schema.
•  Content Expert—has the strongest background in the focus area of the project
(Physiologist, systems biologist, molecular biologist, biochemist, clinician, etc.) and
is responsible for providing background material relevant to the project's focus.
•  Web developer/integrator—responsible for web-content related to the project,
including the final report formatting (for web/hardcopy display).
•  Data analyst—the most junior member of the team will take on general
responsibilities to assist the other team members. This is a learning opportunity for
a team member who is new to data analysis and needs time to develop the skills
necessary to fully participate in the workflow.
Student self-assessment
Survey created using
Google Forms
Student self-assessment
From Doing Data Science by Cathy O’Neil and Rachel Schutt
Grading
•  Pass/No Pass
•  Weekly quizzes (concepts from short lectures, on-line resources, simple
code fragments/pseudo-code, etc.)
•  Projects
•  One individual project (basics of using IPython, simple statistics
computed via interaction with R—or using Pandas—and simple
visualization of a dataset).
•  Two short projects (small group, designed to develop team-based
distribution of workload, team roles assigned by instructor).
•  Larger scale project using a Big Data dataset (students will “self-
organize” their team roles). This project is envisioned as the final
exam for the class and each team will present their results and
project summary to the class.
•  Final projects will be posted on the class website along with IPython
notebooks and supporting materials used for the project.
Syllabus Overview (10 week course)
Foundations 1: Using text editors, using the IPython notebook for data exploration, using
version control software (git), using the class wiki.
Foundations 2: Using IPython/NumPy/SciPy, importing and manipulating data with Pandas,
data visualization in IPython.
Analysis Methods: Basic signal theory overview, time-series data, plotting (lines, histograms,
bars, etc.) dynamical systems analyses of data variability, information theory measures
(entropy) of complexity, frequency domain/spectral measures (FFT, time-varying spectrum),
wavelets.
Handling Sequence data: Using R/Bioconductor, differences between mRNA-Seq, gene-
array, proteomics, and deep-sequencing data, visualizing data from gene/RNA arrays.
Data set storage and retrieval: Basics of relational databases, SQL vs. NOSQL, cloud
storage/NAS/computing clusters, interfacing with Hadoop/MapReduce, metadata and ontology
for biomedical/patient data (XML), using secure databases (REDCap).
Data integrity and security: The Health Insurance Portability and Accountability Act (HIPAA)
and what it means for data management, de-identifying patient data (handling PHI), data
security best practices, making data available to the public—implications for data transparency
and large-scale data mining.
Why Python?
•  Python is an easy-to-learn, complete programming language
that has rapidly become an important scientific programming
and data analysis environment with usage across multiple
disciplines.
•  Python was originally developed with a philosophy of “easy to
read” code incorporating object-oriented, imperative, and
functional programming styles.
•  Python allows the incorporation of specialized modules based
upon low-level code (C/C++) so it can run very fast.
•  Python has modules developed specifically for scientific
computing and signal processing (NumPy/SciPy).
•  Python has well-documented import/export hooks into
databases (both SQL and NOSQL) that are key to working with
Big Data.
Why IPython?
•  IPython is an interactive data exploration and visualization shell
that supports the inclusion of code, inline text, mathematical
expressions, 2D/3D plotting, multimedia, and dynamic widgets.
•  IPython is a suite of tools designed to cover scientific workflow
from interactive data transformation and analysis to publication.
•  The IPython notebook uses a web browser as its display “front
end” and provides a rich interactive environment similar that
seen in Mathematica.
•  IPython notebooks makes it possible to save analysis
procedures and output—providing reproducible, curatable data
analysis, and an easy way to share algorithms/methods.
•  IPython supports parallel coding and distributed data analysis to
take advantage of cloud/high-performance clusters.
Python as a data analytics environment
IPython
interface
http://ipython.org
Line plots with error bars
import numpy as np
import matplotlib.pyplot as plt
# example data
x = np.arange(0.1, 4, 0.5)
y = np.exp(-x)
plt.errorbar(x, y, xerr=0.2, yerr=0.4)
plt.show()
Heatmaps
import numpy as np
import numpy.random
import matplotlib.pyplot as plt
# Generate some test data
x = np.random.randn(8873)
y = np.random.randn(8873)
heatmap, xedges, yedges = np.histogram2d(x, y, bins=50)
extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]]
plt.clf()
plt.imshow(heatmap, extent=extent)
plt.show()
Scatterplots
import numpy as np
import matplotlib.pyplot as plt
N = 50
x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
area = np.pi * (15 * np.random.rand(N))**2 #
0 to 15 point radiuses
plt.scatter(x, y, s=area, c=colors,
alpha=0.5)
plt.show()
3D contour map
from mpl_toolkits.mplot3d import axes3d
import matplotlib.pyplot as plt
from matplotlib import cm
fig = plt.figure()
ax = fig.gca(projection='3d')
X, Y, Z = axes3d.get_test_data(0.05)
ax.plot_surface(X, Y, Z, rstride=8,
cstride=8, alpha=0.3)
cset = ax.contour(X, Y, Z, zdir='z',
offset=-100, cmap=cm.coolwarm)
cset = ax.contour(X, Y, Z, zdir='x',
offset=-40, cmap=cm.coolwarm)
cset = ax.contour(X, Y, Z, zdir='y',
offset=40, cmap=cm.coolwarm)
ax.set_xlabel('X')
ax.set_xlim(-40, 40)
ax.set_ylabel('Y')
ax.set_ylim(-40, 40)
ax.set_zlabel('Z')
ax.set_zlim(-100, 100)
plt.show()
Example: Patient physiology waveforms + EMR
Example: Interrogating sequence data
Summary
• Free/Libre Open-Source software provides a viable “tool
stack” for Big Data analytics.
• Python provides a robust, easy-to-use foundation for data
analytics.
• IPython provides an easy to use interactive front-end for data
transformation, analysis, visualization, presentation, and
distribution.
• Team-based science depends upon developing a wide range
of data analytics skills.
• We have developed a coalition of institutions to serve
students who wish to be become data scientists.
Coalition Institutions
The coding Queen and her Court…
Abby Dobyns
Princesses of Python
Rhaya Johnson
Regie Felix and Adaeze Anyanwu
And a Princeling….
Jamie Tillett
Acknowledgements
Loma Linda
• Traci Marin
• Charles Wang
• Wilson Aruni
• Valery Filippov
UC Riverside
• Thomas Girke
(Bioinformatics)
My laboratory’s git repository:
La Sierra University
•  Marvin Payne
CSU San Bernardino
•  Art Concepcion
(Bioinformatics)
UC Irvine
•  Alex Nicolau
(Comp Sci/Bioinf)
https://github.com/drcgw/bass
Further reading
• Doing Data Science by Cathy O’Neil and Rachel Schutt
• Data Analysis with Open-Source Tools by Philipp Janert
• The Art of R Programming by Norman Matloff
• R for Everyone by Jared P. Lander
• Python for Data Analysis by Wes McKinney
• Think Python by Allen B. Downey
• Think Stats by Allen B. Downey
• Think Complexity by Allen B. Downey
• Every one of Edward Tufte’s books (The Visual Display
of Quantitative Information, Visual Explanations,
Envisioning Information, Beautiful Evidence)
Questions?!

More Related Content

What's hot

林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning
台灣資料科學年會
 

What's hot (20)

[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程
 
機器學習速遊
機器學習速遊機器學習速遊
機器學習速遊
 
ML DL AI DS BD - An Introduction
ML DL AI DS BD - An IntroductionML DL AI DS BD - An Introduction
ML DL AI DS BD - An Introduction
 
Machine learning with Big Data power point presentation
Machine learning with Big Data power point presentationMachine learning with Big Data power point presentation
Machine learning with Big Data power point presentation
 
Introduction to Python Objects and Strings
Introduction to Python Objects and StringsIntroduction to Python Objects and Strings
Introduction to Python Objects and Strings
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
[Eestec] Machine Learning online seminar 1, 12 2016
[Eestec] Machine Learning online seminar 1, 12 2016[Eestec] Machine Learning online seminar 1, 12 2016
[Eestec] Machine Learning online seminar 1, 12 2016
 
林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning
 
Data! Data! Data! I Can't Make Bricks Without Clay!
Data! Data! Data! I Can't Make Bricks Without Clay!Data! Data! Data! I Can't Make Bricks Without Clay!
Data! Data! Data! I Can't Make Bricks Without Clay!
 
Machine Learning Overview
Machine Learning OverviewMachine Learning Overview
Machine Learning Overview
 
Introduction to machine learning and applications (1)
Introduction to machine learning and applications (1)Introduction to machine learning and applications (1)
Introduction to machine learning and applications (1)
 
Applications of Machine Learning
Applications of Machine LearningApplications of Machine Learning
Applications of Machine Learning
 
An overview of machine learning
An overview of machine learningAn overview of machine learning
An overview of machine learning
 
Artificial Intelligence for Automating Data Analysis
Artificial Intelligence for Automating Data AnalysisArtificial Intelligence for Automating Data Analysis
Artificial Intelligence for Automating Data Analysis
 
Simple overview of machine learning
Simple overview of machine learningSimple overview of machine learning
Simple overview of machine learning
 
Learning for Big Data-林軒田
Learning for Big Data-林軒田Learning for Big Data-林軒田
Learning for Big Data-林軒田
 
Knowledge Discovery
Knowledge DiscoveryKnowledge Discovery
Knowledge Discovery
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
 
A Friendly Introduction to Machine Learning
A Friendly Introduction to Machine LearningA Friendly Introduction to Machine Learning
A Friendly Introduction to Machine Learning
 

Similar to 2015 03-28-eb-final

A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
ArmyTrilidiaDevegaSK
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
rohithprabhas1
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
javed75
 

Similar to 2015 03-28-eb-final (20)

An Overview of Python for Data Analytics
An Overview of Python for Data AnalyticsAn Overview of Python for Data Analytics
An Overview of Python for Data Analytics
 
Abhishek Training PPT.pptx
Abhishek Training PPT.pptxAbhishek Training PPT.pptx
Abhishek Training PPT.pptx
 
FDS_dept_ppt.pptx
FDS_dept_ppt.pptxFDS_dept_ppt.pptx
FDS_dept_ppt.pptx
 
Python ml
Python mlPython ml
Python ml
 
Python for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive GuidePython for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive Guide
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
 
DataScience_RoadMap_2023.pdf
DataScience_RoadMap_2023.pdfDataScience_RoadMap_2023.pdf
DataScience_RoadMap_2023.pdf
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research
 
Adarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptxAdarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptx
 
Data Science and Analysis.pptx
Data Science and Analysis.pptxData Science and Analysis.pptx
Data Science and Analysis.pptx
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
 
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
 
Toolboxes for data scientists
Toolboxes for data scientistsToolboxes for data scientists
Toolboxes for data scientists
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
 
I2DS Project.pdf
I2DS Project.pdfI2DS Project.pdf
I2DS Project.pdf
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerAutomatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
 

Recently uploaded

Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Sérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 

Recently uploaded (20)

Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 

2015 03-28-eb-final

  • 1. How to train the next generation for Big Data Projects: building a curriculum Christopher G. Wilson, Ph.D. Associate Professor Physiology and Pediatrics Center for Perinatal Biology Experimental Biology, Mar 28th, 2015
  • 2. Outline • Assessing the need for a “Big Data” Analytics course • Structure and grading of the course • Overview of the curriculum • Advantages to Python/IPython • Examples/use cases • Coalition institutions and participating faculty
  • 3. Is a Big Data analytics course necessary? • “Back in the day, when *I* was a graduate student…” • First year Physics lab as a training ground… • Contemporary students live in a digital world… • Office suites are NOT suited to large-scale data analytics!
  • 4. Work-flow of “Big Data” analysis
  • 5. Or… • Obtain data • Scrub data • Explore data • Model the data • Interpret the data • Present the data
  • 6.
  • 7. Why use Free/Open-Source Software? • In this era of shrinking science funding, free software makes more economic sense. • Bugs/security issues are fixed FASTER than proprietary software. • With access to the source code, we can customize the software to fit OUR needs. • Reproducibility of analyses and algorithms is easier when all code is free, can be shared, and examined/dissected. • Free/Open-source software tends to be more reliable and stable. • See Eric Raymond’s The Cathedral and the Bazaar for a more comprehensive explanation.
  • 8. Using a “flipped” classroom • On-line material or reading is provided to the student either before or during the class meeting time • The instructor provides a short summary/overview lecture (~20 min) • The remaining class time is spent working on the subject matter as individuals and groups—with the instructor and TA present • More effective for learning “hands on” skills like programming, bioinformatics, web design, etc.
  • 9. Why use a flipped classroom model instead of lecturing for 50 minutes and assigning homework?
  • 10. The data analytics team •  Project manager—responsible for setting clear project objectives and deliverables. The project manager should be someone with more experience in data analysis and a more comprehensive background than the other team members. •  Statistician—should have a strong mathematics/statistics background and will be responsible for reporting and developing the statistics workflow for the project. •  Visualization specialist—responsible for the design/development of data visualization (figures/animation) for the project. •  Database specialist—develops ontology/meta-tags to represent the data and incorporate this information in the team's chosen database schema. •  Content Expert—has the strongest background in the focus area of the project (Physiologist, systems biologist, molecular biologist, biochemist, clinician, etc.) and is responsible for providing background material relevant to the project's focus. •  Web developer/integrator—responsible for web-content related to the project, including the final report formatting (for web/hardcopy display). •  Data analyst—the most junior member of the team will take on general responsibilities to assist the other team members. This is a learning opportunity for a team member who is new to data analysis and needs time to develop the skills necessary to fully participate in the workflow.
  • 12. Student self-assessment From Doing Data Science by Cathy O’Neil and Rachel Schutt
  • 13. Grading •  Pass/No Pass •  Weekly quizzes (concepts from short lectures, on-line resources, simple code fragments/pseudo-code, etc.) •  Projects •  One individual project (basics of using IPython, simple statistics computed via interaction with R—or using Pandas—and simple visualization of a dataset). •  Two short projects (small group, designed to develop team-based distribution of workload, team roles assigned by instructor). •  Larger scale project using a Big Data dataset (students will “self- organize” their team roles). This project is envisioned as the final exam for the class and each team will present their results and project summary to the class. •  Final projects will be posted on the class website along with IPython notebooks and supporting materials used for the project.
  • 14. Syllabus Overview (10 week course) Foundations 1: Using text editors, using the IPython notebook for data exploration, using version control software (git), using the class wiki. Foundations 2: Using IPython/NumPy/SciPy, importing and manipulating data with Pandas, data visualization in IPython. Analysis Methods: Basic signal theory overview, time-series data, plotting (lines, histograms, bars, etc.) dynamical systems analyses of data variability, information theory measures (entropy) of complexity, frequency domain/spectral measures (FFT, time-varying spectrum), wavelets. Handling Sequence data: Using R/Bioconductor, differences between mRNA-Seq, gene- array, proteomics, and deep-sequencing data, visualizing data from gene/RNA arrays. Data set storage and retrieval: Basics of relational databases, SQL vs. NOSQL, cloud storage/NAS/computing clusters, interfacing with Hadoop/MapReduce, metadata and ontology for biomedical/patient data (XML), using secure databases (REDCap). Data integrity and security: The Health Insurance Portability and Accountability Act (HIPAA) and what it means for data management, de-identifying patient data (handling PHI), data security best practices, making data available to the public—implications for data transparency and large-scale data mining.
  • 15. Why Python? •  Python is an easy-to-learn, complete programming language that has rapidly become an important scientific programming and data analysis environment with usage across multiple disciplines. •  Python was originally developed with a philosophy of “easy to read” code incorporating object-oriented, imperative, and functional programming styles. •  Python allows the incorporation of specialized modules based upon low-level code (C/C++) so it can run very fast. •  Python has modules developed specifically for scientific computing and signal processing (NumPy/SciPy). •  Python has well-documented import/export hooks into databases (both SQL and NOSQL) that are key to working with Big Data.
  • 16. Why IPython? •  IPython is an interactive data exploration and visualization shell that supports the inclusion of code, inline text, mathematical expressions, 2D/3D plotting, multimedia, and dynamic widgets. •  IPython is a suite of tools designed to cover scientific workflow from interactive data transformation and analysis to publication. •  The IPython notebook uses a web browser as its display “front end” and provides a rich interactive environment similar that seen in Mathematica. •  IPython notebooks makes it possible to save analysis procedures and output—providing reproducible, curatable data analysis, and an easy way to share algorithms/methods. •  IPython supports parallel coding and distributed data analysis to take advantage of cloud/high-performance clusters.
  • 17. Python as a data analytics environment
  • 19. Line plots with error bars import numpy as np import matplotlib.pyplot as plt # example data x = np.arange(0.1, 4, 0.5) y = np.exp(-x) plt.errorbar(x, y, xerr=0.2, yerr=0.4) plt.show()
  • 20. Heatmaps import numpy as np import numpy.random import matplotlib.pyplot as plt # Generate some test data x = np.random.randn(8873) y = np.random.randn(8873) heatmap, xedges, yedges = np.histogram2d(x, y, bins=50) extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]] plt.clf() plt.imshow(heatmap, extent=extent) plt.show()
  • 21. Scatterplots import numpy as np import matplotlib.pyplot as plt N = 50 x = np.random.rand(N) y = np.random.rand(N) colors = np.random.rand(N) area = np.pi * (15 * np.random.rand(N))**2 # 0 to 15 point radiuses plt.scatter(x, y, s=area, c=colors, alpha=0.5) plt.show()
  • 22. 3D contour map from mpl_toolkits.mplot3d import axes3d import matplotlib.pyplot as plt from matplotlib import cm fig = plt.figure() ax = fig.gca(projection='3d') X, Y, Z = axes3d.get_test_data(0.05) ax.plot_surface(X, Y, Z, rstride=8, cstride=8, alpha=0.3) cset = ax.contour(X, Y, Z, zdir='z', offset=-100, cmap=cm.coolwarm) cset = ax.contour(X, Y, Z, zdir='x', offset=-40, cmap=cm.coolwarm) cset = ax.contour(X, Y, Z, zdir='y', offset=40, cmap=cm.coolwarm) ax.set_xlabel('X') ax.set_xlim(-40, 40) ax.set_ylabel('Y') ax.set_ylim(-40, 40) ax.set_zlabel('Z') ax.set_zlim(-100, 100) plt.show()
  • 23. Example: Patient physiology waveforms + EMR
  • 25. Summary • Free/Libre Open-Source software provides a viable “tool stack” for Big Data analytics. • Python provides a robust, easy-to-use foundation for data analytics. • IPython provides an easy to use interactive front-end for data transformation, analysis, visualization, presentation, and distribution. • Team-based science depends upon developing a wide range of data analytics skills. • We have developed a coalition of institutions to serve students who wish to be become data scientists.
  • 27. The coding Queen and her Court… Abby Dobyns Princesses of Python Rhaya Johnson Regie Felix and Adaeze Anyanwu And a Princeling…. Jamie Tillett
  • 28. Acknowledgements Loma Linda • Traci Marin • Charles Wang • Wilson Aruni • Valery Filippov UC Riverside • Thomas Girke (Bioinformatics) My laboratory’s git repository: La Sierra University •  Marvin Payne CSU San Bernardino •  Art Concepcion (Bioinformatics) UC Irvine •  Alex Nicolau (Comp Sci/Bioinf) https://github.com/drcgw/bass
  • 29. Further reading • Doing Data Science by Cathy O’Neil and Rachel Schutt • Data Analysis with Open-Source Tools by Philipp Janert • The Art of R Programming by Norman Matloff • R for Everyone by Jared P. Lander • Python for Data Analysis by Wes McKinney • Think Python by Allen B. Downey • Think Stats by Allen B. Downey • Think Complexity by Allen B. Downey • Every one of Edward Tufte’s books (The Visual Display of Quantitative Information, Visual Explanations, Envisioning Information, Beautiful Evidence)