Your SlideShare is downloading. ×
0
Data Science: Notes and Toolkits
Data Science: Notes and Toolkits
Data Science: Notes and Toolkits
Data Science: Notes and Toolkits
Data Science: Notes and Toolkits
Data Science: Notes and Toolkits
Data Science: Notes and Toolkits
Data Science: Notes and Toolkits
Data Science: Notes and Toolkits
Data Science: Notes and Toolkits
Data Science: Notes and Toolkits
Data Science: Notes and Toolkits
Data Science: Notes and Toolkits
Data Science: Notes and Toolkits
Data Science: Notes and Toolkits
Data Science: Notes and Toolkits
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data Science: Notes and Toolkits

345

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
345
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data Science: Notes and Toolkits Dr. Haralambos Marmanis Waltham, MA April, 2014 ___________________________________ Web: http://www.marmanis.com Email: h@marmanis.com Copyright(c)2014H.Marmanis. Allrightsreserved 1
  • 2. What is Science? • Science is the systematic, data based, pursuit of knowledge through reason • Science is not about what we believe, it is about how we arrived at what we believe • Science always relied on data, e.g. Copernicus’ and Kepler’s theories needed Brahe’s data to grow and prosper • The word “Science”, for most people, points to specific subject areas such as Physics, Chemistry, etc. • However, the methodology is not a priori restricted to these fields; nearly everything that is taught in a university is the outcome of a scientific endeavor Copyright(c)2014H.Marmanis. Allrightsreserved 2
  • 3. What is Data Science? The systematic data based pursuit of knowledge through reason in non-traditional fields i.e. applying the same methodology that is applied in physics, chemistry, biology, etc. to fields like e-Commerce, social networking, finance, energy, marketing, and so on. Copyright(c)2014H.Marmanis. Allrightsreserved 3
  • 4. Why should I care? • Scientists rejoice! There was never a better time to be a data scientist – click here to see what the business analysts say. • If you are a scientist today, you can become the next Newton, the next Maxwell, the next Einstein in your field! • These slides will provide you with an overview of Notes and Tools that are necessary, although not sufficient, for achieving your own discoveries • The content of the slides is taken from my (forthcoming) book: “The Data Science Revolution: An overview of the field and its applications” • Benefits range from “pats on the back” to salary increase or a generous bonus and from corporate recognition to international fame! So, your mileage can vary but it’s all good! Copyright(c)2014H.Marmanis. Allrightsreserved 4
  • 5. Where do I start? 1. The first thing that you need to start is a problem 2. The second is an understanding of the problem. An understanding implies the following: • Clear description of the problem • Clear objectives • Measurable success criteria 3. The third is a set of data related to the problem 4. The fourth is a set of hypotheses 5. The fifth is a set of tools that will allow us to assess the validity of our hypotheses based on the available data Copyright(c)2014H.Marmanis. Allrightsreserved 5
  • 6. Where do I start? 1. The first thing that you need to start is a problem 2. The second is an understanding of the problem. An understanding implies the following: • Clear description of the problem • Clear objectives • Measurable success criteria 3. The third is a set of data related to the problem 4. The fourth is a set of hypotheses 5. The fifth is a set of tools that will allow us to assess the validity of our hypotheses based on the available data Copyright(c)2014H.Marmanis. Allrightsreserved 6
  • 7. Buzzword overview Copyright(c)2014H.Marmanis. Allrightsreserved 7 • Big Data • Data Analysis • Intelligent Web • Machine Learning • Artificial Intelligence • Statistical Analysis
  • 8. What you really need … Domain Expertise ScienceEngineering Copyright(c)2014H.Marmanis. Allrightsreserved 8
  • 9. Domain expertise • Each domain defines its own “universe” that, like our physical universe, waits to be explored by scientific means • You do not have to be a domain expert yourself but you should be able to grasp all the fundamentals quickly and accurately • Examples (just a few – this is practically endless): • Supply chain management • Auctions for Ads • Financial derivatives pricing • Mortgage risk assessment • Drug discovery Copyright(c)2014H.Marmanis. Allrightsreserved 9
  • 10. Science • A firm background in mathematics is essential; not just statistics! • Applied Mathematics • A firm understanding of the scientific method 1. Aggregate the questions/problems to be answered/solved 2. Conceptualize the problem’s domain 3. Formulate hypotheses  build models 4. Describe the problems based on the models 5. Solve the problems 6. Validate the solutions 7. Repeat steps 3 through 6, as needed • Scientific computing • Numerical Methods • Visualization Copyright(c)2014H.Marmanis. Allrightsreserved 10
  • 11. Engineering • Engineering is the systematic application of knowledge for the purpose of designing, implementing, and maintaining physical or virtual constructs in a way that optimizes multiple objectives (e.g. cost, functional effectiveness, operational efficiency, etc.) while respecting all applicable constraints. • In the context of Data Science, engineering skills are required for effectively integrating the scientific solution into the real- world system (e.g. an online retail store, a social networking site, a financial tool) • In particular, software engineering proficiency is crucial, since all the “objects of observation” are effectively digital and accessible only through some software system Copyright(c)2014H.Marmanis. Allrightsreserved 11
  • 12. Computational environments Copyright(c)2014H.Marmanis. Allrightsreserved 12 Name Language Purpose License MATLAB C, C++, Java MATLAB General Proprietary SciLab C,C++, Java, Fortran, Scilab General CeCILL (Open Source) Octave General GNU GPL R C, Fortran, R Statistical, Graphics GNU GPL Julia C, C++, Scheme General MIT License ScaVis Java General Mixed SciPy C, Fortran, Python General BSD
  • 13. Scientific Libraries • Basic Linear Algebra Subprograms (BLAS) written in Fortran • Linear Algebra Package (LAPACK) written in Fortran 90 • Numerical Algorithms Group (NAG) libraries • GraphLab -- GraphLab API is written in C++ • MTJ -- Matrix Toolkit that integrates BLAS and LAPACK in Java • EJML – linear algebra library written in Java • Commons Math – Apache project that offers a lightweight, self-contained, library for mathematics and statistics • NumPy – support for matrices and high-level mathematical functions for Python • SciPy – it includes efficient numerical routines for numerical integration and optimization Copyright(c)2014H.Marmanis. Allrightsreserved 13
  • 14. Machine Learning libraries • Jgap – Genetic algorithms library • Encog – Neural networks library • Opt4J – Evolutionary computation library • Weka – Clustering and classification algorithms • Yooreeka – Search, recommendations, clustering, classification, and mathematical analysis Copyright(c)2014H.Marmanis. Allrightsreserved 14
  • 15. Big Data technologies • Hadoop – open-source software for reliable, scalable, distributed computing • OpenCL – open royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices • Cloudify – Provision, configure, orchestrate, and monitor large distributed systems on the cloud • Spring XD -- a unified, distributed, and extensible system for data ingestion, real time analytics, batch processing, and data export • Proactive Parallel Suite -- an open source solution that enables the orchestration of applications and seamlessly integrates with the management of high-performance clouds • Ibis -- an efficient Java-based platform for distributed computing Copyright(c)2014H.Marmanis. Allrightsreserved 15
  • 16. Copyright(c)2014H.Marmanis. Allrightsreserved 16 The Data Science Revolution: An overview of the field and its applications

×