Data Science: Notes and Toolkits

973 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
973
On SlideShare
0
From Embeds
0
Number of Embeds
69
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Data Science: Notes and Toolkits

  1. 1. Data Science: Notes and Toolkits Dr. Haralambos Marmanis Waltham, MA April, 2014 ___________________________________ Web: http://www.marmanis.com Email: h@marmanis.com Copyright(c)2014H.Marmanis. Allrightsreserved 1
  2. 2. What is Science? • Science is the systematic, data based, pursuit of knowledge through reason • Science is not about what we believe, it is about how we arrived at what we believe • Science always relied on data, e.g. Copernicus’ and Kepler’s theories needed Brahe’s data to grow and prosper • The word “Science”, for most people, points to specific subject areas such as Physics, Chemistry, etc. • However, the methodology is not a priori restricted to these fields; nearly everything that is taught in a university is the outcome of a scientific endeavor Copyright(c)2014H.Marmanis. Allrightsreserved 2
  3. 3. What is Data Science? The systematic data based pursuit of knowledge through reason in non-traditional fields i.e. applying the same methodology that is applied in physics, chemistry, biology, etc. to fields like e-Commerce, social networking, finance, energy, marketing, and so on. Copyright(c)2014H.Marmanis. Allrightsreserved 3
  4. 4. Why should I care? • Scientists rejoice! There was never a better time to be a data scientist – click here to see what the business analysts say. • If you are a scientist today, you can become the next Newton, the next Maxwell, the next Einstein in your field! • These slides will provide you with an overview of Notes and Tools that are necessary, although not sufficient, for achieving your own discoveries • The content of the slides is taken from my (forthcoming) book: “The Data Science Revolution: An overview of the field and its applications” • Benefits range from “pats on the back” to salary increase or a generous bonus and from corporate recognition to international fame! So, your mileage can vary but it’s all good! Copyright(c)2014H.Marmanis. Allrightsreserved 4
  5. 5. Where do I start? 1. The first thing that you need to start is a problem 2. The second is an understanding of the problem. An understanding implies the following: • Clear description of the problem • Clear objectives • Measurable success criteria 3. The third is a set of data related to the problem 4. The fourth is a set of hypotheses 5. The fifth is a set of tools that will allow us to assess the validity of our hypotheses based on the available data Copyright(c)2014H.Marmanis. Allrightsreserved 5
  6. 6. Where do I start? 1. The first thing that you need to start is a problem 2. The second is an understanding of the problem. An understanding implies the following: • Clear description of the problem • Clear objectives • Measurable success criteria 3. The third is a set of data related to the problem 4. The fourth is a set of hypotheses 5. The fifth is a set of tools that will allow us to assess the validity of our hypotheses based on the available data Copyright(c)2014H.Marmanis. Allrightsreserved 6
  7. 7. Buzzword overview Copyright(c)2014H.Marmanis. Allrightsreserved 7 • Big Data • Data Analysis • Intelligent Web • Machine Learning • Artificial Intelligence • Statistical Analysis
  8. 8. What you really need … Domain Expertise ScienceEngineering Copyright(c)2014H.Marmanis. Allrightsreserved 8
  9. 9. Domain expertise • Each domain defines its own “universe” that, like our physical universe, waits to be explored by scientific means • You do not have to be a domain expert yourself but you should be able to grasp all the fundamentals quickly and accurately • Examples (just a few – this is practically endless): • Supply chain management • Auctions for Ads • Financial derivatives pricing • Mortgage risk assessment • Drug discovery Copyright(c)2014H.Marmanis. Allrightsreserved 9
  10. 10. Science • A firm background in mathematics is essential; not just statistics! • Applied Mathematics • A firm understanding of the scientific method 1. Aggregate the questions/problems to be answered/solved 2. Conceptualize the problem’s domain 3. Formulate hypotheses  build models 4. Describe the problems based on the models 5. Solve the problems 6. Validate the solutions 7. Repeat steps 3 through 6, as needed • Scientific computing • Numerical Methods • Visualization Copyright(c)2014H.Marmanis. Allrightsreserved 10
  11. 11. Engineering • Engineering is the systematic application of knowledge for the purpose of designing, implementing, and maintaining physical or virtual constructs in a way that optimizes multiple objectives (e.g. cost, functional effectiveness, operational efficiency, etc.) while respecting all applicable constraints. • In the context of Data Science, engineering skills are required for effectively integrating the scientific solution into the real- world system (e.g. an online retail store, a social networking site, a financial tool) • In particular, software engineering proficiency is crucial, since all the “objects of observation” are effectively digital and accessible only through some software system Copyright(c)2014H.Marmanis. Allrightsreserved 11
  12. 12. Computational environments Copyright(c)2014H.Marmanis. Allrightsreserved 12 Name Language Purpose License MATLAB C, C++, Java MATLAB General Proprietary SciLab C,C++, Java, Fortran, Scilab General CeCILL (Open Source) Octave General GNU GPL R C, Fortran, R Statistical, Graphics GNU GPL Julia C, C++, Scheme General MIT License ScaVis Java General Mixed SciPy C, Fortran, Python General BSD
  13. 13. Scientific Libraries • Basic Linear Algebra Subprograms (BLAS) written in Fortran • Linear Algebra Package (LAPACK) written in Fortran 90 • Numerical Algorithms Group (NAG) libraries • GraphLab -- GraphLab API is written in C++ • MTJ -- Matrix Toolkit that integrates BLAS and LAPACK in Java • EJML – linear algebra library written in Java • Commons Math – Apache project that offers a lightweight, self-contained, library for mathematics and statistics • NumPy – support for matrices and high-level mathematical functions for Python • SciPy – it includes efficient numerical routines for numerical integration and optimization Copyright(c)2014H.Marmanis. Allrightsreserved 13
  14. 14. Machine Learning libraries • Jgap – Genetic algorithms library • Encog – Neural networks library • Opt4J – Evolutionary computation library • Weka – Clustering and classification algorithms • Yooreeka – Search, recommendations, clustering, classification, and mathematical analysis Copyright(c)2014H.Marmanis. Allrightsreserved 14
  15. 15. Big Data technologies • Hadoop – open-source software for reliable, scalable, distributed computing • OpenCL – open royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices • Cloudify – Provision, configure, orchestrate, and monitor large distributed systems on the cloud • Spring XD -- a unified, distributed, and extensible system for data ingestion, real time analytics, batch processing, and data export • Proactive Parallel Suite -- an open source solution that enables the orchestration of applications and seamlessly integrates with the management of high-performance clouds • Ibis -- an efficient Java-based platform for distributed computing Copyright(c)2014H.Marmanis. Allrightsreserved 15
  16. 16. Copyright(c)2014H.Marmanis. Allrightsreserved 16 The Data Science Revolution: An overview of the field and its applications

×