1. Data Science:
Notes and Toolkits
Dr. Haralambos Marmanis
Waltham, MA
April, 2014
___________________________________
Web: http://www.marmanis.com
Email: h@marmanis.com
Copyright(c)2014H.Marmanis.
Allrightsreserved
1
2. What is Science?
• Science is the systematic, data based, pursuit of knowledge
through reason
• Science is not about what we believe, it is about how we arrived
at what we believe
• Science always relied on data, e.g. Copernicus’ and Kepler’s
theories needed Brahe’s data to grow and prosper
• The word “Science”, for most people, points to specific subject
areas such as Physics, Chemistry, etc.
• However, the methodology is not a priori restricted to these
fields; nearly everything that is taught in a university is the
outcome of a scientific endeavor
Copyright(c)2014H.Marmanis.
Allrightsreserved
2
3. What is Data Science?
The systematic
data based
pursuit of knowledge
through reason
in non-traditional fields
i.e. applying the same methodology that is applied in physics,
chemistry, biology, etc. to fields like e-Commerce, social networking,
finance, energy, marketing, and so on.
Copyright(c)2014H.Marmanis.
Allrightsreserved
3
4. Why should I care?
• Scientists rejoice! There was never a better time to be a data
scientist – click here to see what the business analysts say.
• If you are a scientist today, you can become
the next Newton,
the next Maxwell,
the next Einstein in your field!
• These slides will provide you with an overview of Notes and Tools
that are necessary, although not sufficient, for achieving your own
discoveries
• The content of the slides is taken from my (forthcoming) book:
“The Data Science Revolution:
An overview of the field and its applications”
• Benefits range from “pats on the back” to salary increase or a
generous bonus and from corporate recognition to international
fame! So, your mileage can vary but it’s all good!
Copyright(c)2014H.Marmanis.
Allrightsreserved
4
5. Where do I start?
1. The first thing that you need to start is a problem
2. The second is an understanding of the problem. An
understanding implies the following:
• Clear description of the problem
• Clear objectives
• Measurable success criteria
3. The third is a set of data related to the problem
4. The fourth is a set of hypotheses
5. The fifth is a set of tools that will allow us to assess the
validity of our hypotheses based on the available data
Copyright(c)2014H.Marmanis.
Allrightsreserved
5
6. Where do I start?
1. The first thing that you need to start is a problem
2. The second is an understanding of the problem. An
understanding implies the following:
• Clear description of the problem
• Clear objectives
• Measurable success criteria
3. The third is a set of data related to the problem
4. The fourth is a set of hypotheses
5. The fifth is a set of tools that will allow us to assess the
validity of our hypotheses based on the available data
Copyright(c)2014H.Marmanis.
Allrightsreserved
6
8. What you really need …
Domain
Expertise
ScienceEngineering
Copyright(c)2014H.Marmanis.
Allrightsreserved
8
9. Domain expertise
• Each domain defines its own “universe” that, like our physical
universe, waits to be explored by scientific means
• You do not have to be a domain expert yourself but you
should be able to grasp all the fundamentals quickly and
accurately
• Examples (just a few – this is practically endless):
• Supply chain management
• Auctions for Ads
• Financial derivatives pricing
• Mortgage risk assessment
• Drug discovery
Copyright(c)2014H.Marmanis.
Allrightsreserved
9
10. Science
• A firm background in mathematics is essential; not just statistics!
• Applied Mathematics
• A firm understanding of the scientific method
1. Aggregate the questions/problems to be answered/solved
2. Conceptualize the problem’s domain
3. Formulate hypotheses build models
4. Describe the problems based on the models
5. Solve the problems
6. Validate the solutions
7. Repeat steps 3 through 6, as needed
• Scientific computing
• Numerical Methods
• Visualization
Copyright(c)2014H.Marmanis.
Allrightsreserved
10
11. Engineering
• Engineering is the systematic application of knowledge for the
purpose of designing, implementing, and maintaining physical
or virtual constructs in a way that optimizes multiple
objectives (e.g. cost, functional effectiveness, operational
efficiency, etc.) while respecting all applicable constraints.
• In the context of Data Science, engineering skills are required
for effectively integrating the scientific solution into the real-
world system (e.g. an online retail store, a social networking
site, a financial tool)
• In particular, software engineering proficiency is crucial, since
all the “objects of observation” are effectively digital and
accessible only through some software system
Copyright(c)2014H.Marmanis.
Allrightsreserved
11
12. Computational environments
Copyright(c)2014H.Marmanis.
Allrightsreserved
12
Name Language Purpose License
MATLAB C, C++, Java MATLAB General Proprietary
SciLab C,C++, Java, Fortran, Scilab General CeCILL
(Open Source)
Octave General GNU GPL
R C, Fortran, R Statistical, Graphics GNU GPL
Julia C, C++, Scheme General MIT License
ScaVis Java General Mixed
SciPy C, Fortran, Python General BSD
13. Scientific Libraries
• Basic Linear Algebra Subprograms (BLAS) written in Fortran
• Linear Algebra Package (LAPACK) written in Fortran 90
• Numerical Algorithms Group (NAG) libraries
• GraphLab -- GraphLab API is written in C++
• MTJ -- Matrix Toolkit that integrates BLAS and LAPACK in Java
• EJML – linear algebra library written in Java
• Commons Math – Apache project that offers a lightweight,
self-contained, library for mathematics and statistics
• NumPy – support for matrices and high-level mathematical
functions for Python
• SciPy – it includes efficient numerical routines for numerical
integration and optimization
Copyright(c)2014H.Marmanis.
Allrightsreserved
13
15. Big Data technologies
• Hadoop – open-source software for reliable, scalable, distributed
computing
• OpenCL – open royalty-free standard for cross-platform, parallel
programming of modern processors found in personal computers,
servers and handheld/embedded devices
• Cloudify – Provision, configure, orchestrate, and monitor large
distributed systems on the cloud
• Spring XD -- a unified, distributed, and extensible system for data
ingestion, real time analytics, batch processing, and data export
• Proactive Parallel Suite -- an open source solution that enables the
orchestration of applications and seamlessly integrates with the
management of high-performance clouds
• Ibis -- an efficient Java-based platform for distributed computing
Copyright(c)2014H.Marmanis.
Allrightsreserved
15