Data Science:
Notes and Toolkits
Dr. Haralambos Marmanis
Waltham, MA
April, 2014
___________________________________
Web: http://www.marmanis.com
Email: h@marmanis.com
Copyright(c)2014H.Marmanis.
Allrightsreserved
1
What is Science?
• Science is the systematic, data based, pursuit of knowledge
through reason
• Science is not about what we believe, it is about how we arrived
at what we believe
• Science always relied on data, e.g. Copernicus’ and Kepler’s
theories needed Brahe’s data to grow and prosper
• The word “Science”, for most people, points to specific subject
areas such as Physics, Chemistry, etc.
• However, the methodology is not a priori restricted to these
fields; nearly everything that is taught in a university is the
outcome of a scientific endeavor
Copyright(c)2014H.Marmanis.
Allrightsreserved
2
What is Data Science?
The systematic
data based
pursuit of knowledge
through reason
in non-traditional fields
i.e. applying the same methodology that is applied in physics,
chemistry, biology, etc. to fields like e-Commerce, social networking,
finance, energy, marketing, and so on.
Copyright(c)2014H.Marmanis.
Allrightsreserved
3
Why should I care?
• Scientists rejoice! There was never a better time to be a data
scientist – click here to see what the business analysts say.
• If you are a scientist today, you can become
the next Newton,
the next Maxwell,
the next Einstein in your field!
• These slides will provide you with an overview of Notes and Tools
that are necessary, although not sufficient, for achieving your own
discoveries
• The content of the slides is taken from my (forthcoming) book:
“The Data Science Revolution:
An overview of the field and its applications”
• Benefits range from “pats on the back” to salary increase or a
generous bonus and from corporate recognition to international
fame! So, your mileage can vary but it’s all good!
Copyright(c)2014H.Marmanis.
Allrightsreserved
4
Where do I start?
1. The first thing that you need to start is a problem
2. The second is an understanding of the problem. An
understanding implies the following:
• Clear description of the problem
• Clear objectives
• Measurable success criteria
3. The third is a set of data related to the problem
4. The fourth is a set of hypotheses
5. The fifth is a set of tools that will allow us to assess the
validity of our hypotheses based on the available data
Copyright(c)2014H.Marmanis.
Allrightsreserved
5
Where do I start?
1. The first thing that you need to start is a problem
2. The second is an understanding of the problem. An
understanding implies the following:
• Clear description of the problem
• Clear objectives
• Measurable success criteria
3. The third is a set of data related to the problem
4. The fourth is a set of hypotheses
5. The fifth is a set of tools that will allow us to assess the
validity of our hypotheses based on the available data
Copyright(c)2014H.Marmanis.
Allrightsreserved
6
Buzzword overview
Copyright(c)2014H.Marmanis.
Allrightsreserved
7
• Big Data
• Data Analysis
• Intelligent Web
• Machine Learning
• Artificial Intelligence
• Statistical Analysis
What you really need …
Domain
Expertise
ScienceEngineering
Copyright(c)2014H.Marmanis.
Allrightsreserved
8
Domain expertise
• Each domain defines its own “universe” that, like our physical
universe, waits to be explored by scientific means
• You do not have to be a domain expert yourself but you
should be able to grasp all the fundamentals quickly and
accurately
• Examples (just a few – this is practically endless):
• Supply chain management
• Auctions for Ads
• Financial derivatives pricing
• Mortgage risk assessment
• Drug discovery
Copyright(c)2014H.Marmanis.
Allrightsreserved
9
Science
• A firm background in mathematics is essential; not just statistics!
• Applied Mathematics
• A firm understanding of the scientific method
1. Aggregate the questions/problems to be answered/solved
2. Conceptualize the problem’s domain
3. Formulate hypotheses  build models
4. Describe the problems based on the models
5. Solve the problems
6. Validate the solutions
7. Repeat steps 3 through 6, as needed
• Scientific computing
• Numerical Methods
• Visualization
Copyright(c)2014H.Marmanis.
Allrightsreserved
10
Engineering
• Engineering is the systematic application of knowledge for the
purpose of designing, implementing, and maintaining physical
or virtual constructs in a way that optimizes multiple
objectives (e.g. cost, functional effectiveness, operational
efficiency, etc.) while respecting all applicable constraints.
• In the context of Data Science, engineering skills are required
for effectively integrating the scientific solution into the real-
world system (e.g. an online retail store, a social networking
site, a financial tool)
• In particular, software engineering proficiency is crucial, since
all the “objects of observation” are effectively digital and
accessible only through some software system
Copyright(c)2014H.Marmanis.
Allrightsreserved
11
Computational environments
Copyright(c)2014H.Marmanis.
Allrightsreserved
12
Name Language Purpose License
MATLAB C, C++, Java MATLAB General Proprietary
SciLab C,C++, Java, Fortran, Scilab General CeCILL
(Open Source)
Octave General GNU GPL
R C, Fortran, R Statistical, Graphics GNU GPL
Julia C, C++, Scheme General MIT License
ScaVis Java General Mixed
SciPy C, Fortran, Python General BSD
Scientific Libraries
• Basic Linear Algebra Subprograms (BLAS) written in Fortran
• Linear Algebra Package (LAPACK) written in Fortran 90
• Numerical Algorithms Group (NAG) libraries
• GraphLab -- GraphLab API is written in C++
• MTJ -- Matrix Toolkit that integrates BLAS and LAPACK in Java
• EJML – linear algebra library written in Java
• Commons Math – Apache project that offers a lightweight,
self-contained, library for mathematics and statistics
• NumPy – support for matrices and high-level mathematical
functions for Python
• SciPy – it includes efficient numerical routines for numerical
integration and optimization
Copyright(c)2014H.Marmanis.
Allrightsreserved
13
Machine Learning libraries
• Jgap – Genetic algorithms library
• Encog – Neural networks library
• Opt4J – Evolutionary computation library
• Weka – Clustering and classification algorithms
• Yooreeka – Search, recommendations, clustering,
classification, and mathematical analysis
Copyright(c)2014H.Marmanis.
Allrightsreserved
14
Big Data technologies
• Hadoop – open-source software for reliable, scalable, distributed
computing
• OpenCL – open royalty-free standard for cross-platform, parallel
programming of modern processors found in personal computers,
servers and handheld/embedded devices
• Cloudify – Provision, configure, orchestrate, and monitor large
distributed systems on the cloud
• Spring XD -- a unified, distributed, and extensible system for data
ingestion, real time analytics, batch processing, and data export
• Proactive Parallel Suite -- an open source solution that enables the
orchestration of applications and seamlessly integrates with the
management of high-performance clouds
• Ibis -- an efficient Java-based platform for distributed computing
Copyright(c)2014H.Marmanis.
Allrightsreserved
15
Copyright(c)2014H.Marmanis.
Allrightsreserved
16
The Data Science Revolution:
An overview of the field and its applications

Data Science: Notes and Toolkits

  • 1.
    Data Science: Notes andToolkits Dr. Haralambos Marmanis Waltham, MA April, 2014 ___________________________________ Web: http://www.marmanis.com Email: h@marmanis.com Copyright(c)2014H.Marmanis. Allrightsreserved 1
  • 2.
    What is Science? •Science is the systematic, data based, pursuit of knowledge through reason • Science is not about what we believe, it is about how we arrived at what we believe • Science always relied on data, e.g. Copernicus’ and Kepler’s theories needed Brahe’s data to grow and prosper • The word “Science”, for most people, points to specific subject areas such as Physics, Chemistry, etc. • However, the methodology is not a priori restricted to these fields; nearly everything that is taught in a university is the outcome of a scientific endeavor Copyright(c)2014H.Marmanis. Allrightsreserved 2
  • 3.
    What is DataScience? The systematic data based pursuit of knowledge through reason in non-traditional fields i.e. applying the same methodology that is applied in physics, chemistry, biology, etc. to fields like e-Commerce, social networking, finance, energy, marketing, and so on. Copyright(c)2014H.Marmanis. Allrightsreserved 3
  • 4.
    Why should Icare? • Scientists rejoice! There was never a better time to be a data scientist – click here to see what the business analysts say. • If you are a scientist today, you can become the next Newton, the next Maxwell, the next Einstein in your field! • These slides will provide you with an overview of Notes and Tools that are necessary, although not sufficient, for achieving your own discoveries • The content of the slides is taken from my (forthcoming) book: “The Data Science Revolution: An overview of the field and its applications” • Benefits range from “pats on the back” to salary increase or a generous bonus and from corporate recognition to international fame! So, your mileage can vary but it’s all good! Copyright(c)2014H.Marmanis. Allrightsreserved 4
  • 5.
    Where do Istart? 1. The first thing that you need to start is a problem 2. The second is an understanding of the problem. An understanding implies the following: • Clear description of the problem • Clear objectives • Measurable success criteria 3. The third is a set of data related to the problem 4. The fourth is a set of hypotheses 5. The fifth is a set of tools that will allow us to assess the validity of our hypotheses based on the available data Copyright(c)2014H.Marmanis. Allrightsreserved 5
  • 6.
    Where do Istart? 1. The first thing that you need to start is a problem 2. The second is an understanding of the problem. An understanding implies the following: • Clear description of the problem • Clear objectives • Measurable success criteria 3. The third is a set of data related to the problem 4. The fourth is a set of hypotheses 5. The fifth is a set of tools that will allow us to assess the validity of our hypotheses based on the available data Copyright(c)2014H.Marmanis. Allrightsreserved 6
  • 7.
    Buzzword overview Copyright(c)2014H.Marmanis. Allrightsreserved 7 • BigData • Data Analysis • Intelligent Web • Machine Learning • Artificial Intelligence • Statistical Analysis
  • 8.
    What you reallyneed … Domain Expertise ScienceEngineering Copyright(c)2014H.Marmanis. Allrightsreserved 8
  • 9.
    Domain expertise • Eachdomain defines its own “universe” that, like our physical universe, waits to be explored by scientific means • You do not have to be a domain expert yourself but you should be able to grasp all the fundamentals quickly and accurately • Examples (just a few – this is practically endless): • Supply chain management • Auctions for Ads • Financial derivatives pricing • Mortgage risk assessment • Drug discovery Copyright(c)2014H.Marmanis. Allrightsreserved 9
  • 10.
    Science • A firmbackground in mathematics is essential; not just statistics! • Applied Mathematics • A firm understanding of the scientific method 1. Aggregate the questions/problems to be answered/solved 2. Conceptualize the problem’s domain 3. Formulate hypotheses  build models 4. Describe the problems based on the models 5. Solve the problems 6. Validate the solutions 7. Repeat steps 3 through 6, as needed • Scientific computing • Numerical Methods • Visualization Copyright(c)2014H.Marmanis. Allrightsreserved 10
  • 11.
    Engineering • Engineering isthe systematic application of knowledge for the purpose of designing, implementing, and maintaining physical or virtual constructs in a way that optimizes multiple objectives (e.g. cost, functional effectiveness, operational efficiency, etc.) while respecting all applicable constraints. • In the context of Data Science, engineering skills are required for effectively integrating the scientific solution into the real- world system (e.g. an online retail store, a social networking site, a financial tool) • In particular, software engineering proficiency is crucial, since all the “objects of observation” are effectively digital and accessible only through some software system Copyright(c)2014H.Marmanis. Allrightsreserved 11
  • 12.
    Computational environments Copyright(c)2014H.Marmanis. Allrightsreserved 12 Name LanguagePurpose License MATLAB C, C++, Java MATLAB General Proprietary SciLab C,C++, Java, Fortran, Scilab General CeCILL (Open Source) Octave General GNU GPL R C, Fortran, R Statistical, Graphics GNU GPL Julia C, C++, Scheme General MIT License ScaVis Java General Mixed SciPy C, Fortran, Python General BSD
  • 13.
    Scientific Libraries • BasicLinear Algebra Subprograms (BLAS) written in Fortran • Linear Algebra Package (LAPACK) written in Fortran 90 • Numerical Algorithms Group (NAG) libraries • GraphLab -- GraphLab API is written in C++ • MTJ -- Matrix Toolkit that integrates BLAS and LAPACK in Java • EJML – linear algebra library written in Java • Commons Math – Apache project that offers a lightweight, self-contained, library for mathematics and statistics • NumPy – support for matrices and high-level mathematical functions for Python • SciPy – it includes efficient numerical routines for numerical integration and optimization Copyright(c)2014H.Marmanis. Allrightsreserved 13
  • 14.
    Machine Learning libraries •Jgap – Genetic algorithms library • Encog – Neural networks library • Opt4J – Evolutionary computation library • Weka – Clustering and classification algorithms • Yooreeka – Search, recommendations, clustering, classification, and mathematical analysis Copyright(c)2014H.Marmanis. Allrightsreserved 14
  • 15.
    Big Data technologies •Hadoop – open-source software for reliable, scalable, distributed computing • OpenCL – open royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices • Cloudify – Provision, configure, orchestrate, and monitor large distributed systems on the cloud • Spring XD -- a unified, distributed, and extensible system for data ingestion, real time analytics, batch processing, and data export • Proactive Parallel Suite -- an open source solution that enables the orchestration of applications and seamlessly integrates with the management of high-performance clouds • Ibis -- an efficient Java-based platform for distributed computing Copyright(c)2014H.Marmanis. Allrightsreserved 15
  • 16.
    Copyright(c)2014H.Marmanis. Allrightsreserved 16 The Data ScienceRevolution: An overview of the field and its applications