SlideShare a Scribd company logo
1 of 16
Data Science:
Notes and Toolkits
Dr. Haralambos Marmanis
Waltham, MA
April, 2014
___________________________________
Web: http://www.marmanis.com
Email: h@marmanis.com
Copyright(c)2014H.Marmanis.
Allrightsreserved
1
What is Science?
• Science is the systematic, data based, pursuit of knowledge
through reason
• Science is not about what we believe, it is about how we arrived
at what we believe
• Science always relied on data, e.g. Copernicus’ and Kepler’s
theories needed Brahe’s data to grow and prosper
• The word “Science”, for most people, points to specific subject
areas such as Physics, Chemistry, etc.
• However, the methodology is not a priori restricted to these
fields; nearly everything that is taught in a university is the
outcome of a scientific endeavor
Copyright(c)2014H.Marmanis.
Allrightsreserved
2
What is Data Science?
The systematic
data based
pursuit of knowledge
through reason
in non-traditional fields
i.e. applying the same methodology that is applied in physics,
chemistry, biology, etc. to fields like e-Commerce, social networking,
finance, energy, marketing, and so on.
Copyright(c)2014H.Marmanis.
Allrightsreserved
3
Why should I care?
• Scientists rejoice! There was never a better time to be a data
scientist – click here to see what the business analysts say.
• If you are a scientist today, you can become
the next Newton,
the next Maxwell,
the next Einstein in your field!
• These slides will provide you with an overview of Notes and Tools
that are necessary, although not sufficient, for achieving your own
discoveries
• The content of the slides is taken from my (forthcoming) book:
“The Data Science Revolution:
An overview of the field and its applications”
• Benefits range from “pats on the back” to salary increase or a
generous bonus and from corporate recognition to international
fame! So, your mileage can vary but it’s all good!
Copyright(c)2014H.Marmanis.
Allrightsreserved
4
Where do I start?
1. The first thing that you need to start is a problem
2. The second is an understanding of the problem. An
understanding implies the following:
• Clear description of the problem
• Clear objectives
• Measurable success criteria
3. The third is a set of data related to the problem
4. The fourth is a set of hypotheses
5. The fifth is a set of tools that will allow us to assess the
validity of our hypotheses based on the available data
Copyright(c)2014H.Marmanis.
Allrightsreserved
5
Where do I start?
1. The first thing that you need to start is a problem
2. The second is an understanding of the problem. An
understanding implies the following:
• Clear description of the problem
• Clear objectives
• Measurable success criteria
3. The third is a set of data related to the problem
4. The fourth is a set of hypotheses
5. The fifth is a set of tools that will allow us to assess the
validity of our hypotheses based on the available data
Copyright(c)2014H.Marmanis.
Allrightsreserved
6
Buzzword overview
Copyright(c)2014H.Marmanis.
Allrightsreserved
7
• Big Data
• Data Analysis
• Intelligent Web
• Machine Learning
• Artificial Intelligence
• Statistical Analysis
What you really need …
Domain
Expertise
ScienceEngineering
Copyright(c)2014H.Marmanis.
Allrightsreserved
8
Domain expertise
• Each domain defines its own “universe” that, like our physical
universe, waits to be explored by scientific means
• You do not have to be a domain expert yourself but you
should be able to grasp all the fundamentals quickly and
accurately
• Examples (just a few – this is practically endless):
• Supply chain management
• Auctions for Ads
• Financial derivatives pricing
• Mortgage risk assessment
• Drug discovery
Copyright(c)2014H.Marmanis.
Allrightsreserved
9
Science
• A firm background in mathematics is essential; not just statistics!
• Applied Mathematics
• A firm understanding of the scientific method
1. Aggregate the questions/problems to be answered/solved
2. Conceptualize the problem’s domain
3. Formulate hypotheses  build models
4. Describe the problems based on the models
5. Solve the problems
6. Validate the solutions
7. Repeat steps 3 through 6, as needed
• Scientific computing
• Numerical Methods
• Visualization
Copyright(c)2014H.Marmanis.
Allrightsreserved
10
Engineering
• Engineering is the systematic application of knowledge for the
purpose of designing, implementing, and maintaining physical
or virtual constructs in a way that optimizes multiple
objectives (e.g. cost, functional effectiveness, operational
efficiency, etc.) while respecting all applicable constraints.
• In the context of Data Science, engineering skills are required
for effectively integrating the scientific solution into the real-
world system (e.g. an online retail store, a social networking
site, a financial tool)
• In particular, software engineering proficiency is crucial, since
all the “objects of observation” are effectively digital and
accessible only through some software system
Copyright(c)2014H.Marmanis.
Allrightsreserved
11
Computational environments
Copyright(c)2014H.Marmanis.
Allrightsreserved
12
Name Language Purpose License
MATLAB C, C++, Java MATLAB General Proprietary
SciLab C,C++, Java, Fortran, Scilab General CeCILL
(Open Source)
Octave General GNU GPL
R C, Fortran, R Statistical, Graphics GNU GPL
Julia C, C++, Scheme General MIT License
ScaVis Java General Mixed
SciPy C, Fortran, Python General BSD
Scientific Libraries
• Basic Linear Algebra Subprograms (BLAS) written in Fortran
• Linear Algebra Package (LAPACK) written in Fortran 90
• Numerical Algorithms Group (NAG) libraries
• GraphLab -- GraphLab API is written in C++
• MTJ -- Matrix Toolkit that integrates BLAS and LAPACK in Java
• EJML – linear algebra library written in Java
• Commons Math – Apache project that offers a lightweight,
self-contained, library for mathematics and statistics
• NumPy – support for matrices and high-level mathematical
functions for Python
• SciPy – it includes efficient numerical routines for numerical
integration and optimization
Copyright(c)2014H.Marmanis.
Allrightsreserved
13
Machine Learning libraries
• Jgap – Genetic algorithms library
• Encog – Neural networks library
• Opt4J – Evolutionary computation library
• Weka – Clustering and classification algorithms
• Yooreeka – Search, recommendations, clustering,
classification, and mathematical analysis
Copyright(c)2014H.Marmanis.
Allrightsreserved
14
Big Data technologies
• Hadoop – open-source software for reliable, scalable, distributed
computing
• OpenCL – open royalty-free standard for cross-platform, parallel
programming of modern processors found in personal computers,
servers and handheld/embedded devices
• Cloudify – Provision, configure, orchestrate, and monitor large
distributed systems on the cloud
• Spring XD -- a unified, distributed, and extensible system for data
ingestion, real time analytics, batch processing, and data export
• Proactive Parallel Suite -- an open source solution that enables the
orchestration of applications and seamlessly integrates with the
management of high-performance clouds
• Ibis -- an efficient Java-based platform for distributed computing
Copyright(c)2014H.Marmanis.
Allrightsreserved
15
Copyright(c)2014H.Marmanis.
Allrightsreserved
16
The Data Science Revolution:
An overview of the field and its applications

More Related Content

Viewers also liked

Soluções Navita para BlackBerry 20081111
Soluções Navita para BlackBerry 20081111Soluções Navita para BlackBerry 20081111
Soluções Navita para BlackBerry 20081111Roberto Dariva
 
Ta Review: Application Servers
Ta Review: Application ServersTa Review: Application Servers
Ta Review: Application ServersDavid Fletcher
 
Open Source Presentation To Portal Partners2
Open Source Presentation To Portal Partners2Open Source Presentation To Portal Partners2
Open Source Presentation To Portal Partners2Viet NguyenHoang
 
Viettel chung thuc-so_ke_khai_qua_mang
Viettel chung thuc-so_ke_khai_qua_mangViettel chung thuc-so_ke_khai_qua_mang
Viettel chung thuc-so_ke_khai_qua_mangPham Ngoc Quang
 
советский энциклопедический словарь
советский энциклопедический словарьсоветский энциклопедический словарь
советский энциклопедический словарьЕлена Демидова
 
Chuong 1 sinh ly hung phan
Chuong 1 sinh ly hung phanChuong 1 sinh ly hung phan
Chuong 1 sinh ly hung phanPham Ngoc Quang
 
Bet youdon'tknowreading
Bet youdon'tknowreadingBet youdon'tknowreading
Bet youdon'tknowreadingMonica Campana
 
3. Luat thue TNCN 12.10
3. Luat thue TNCN 12.10 3. Luat thue TNCN 12.10
3. Luat thue TNCN 12.10 Pham Ngoc Quang
 
He tuan hoan tham khao 2
He tuan hoan tham khao 2He tuan hoan tham khao 2
He tuan hoan tham khao 2Pham Ngoc Quang
 
Presentation to CCAE - November 28/08
Presentation to CCAE - November 28/08Presentation to CCAE - November 28/08
Presentation to CCAE - November 28/08Ken Hudson
 
Ag Portal Gioi Thieu Quy Trinh
Ag Portal   Gioi Thieu Quy TrinhAg Portal   Gioi Thieu Quy Trinh
Ag Portal Gioi Thieu Quy TrinhPham Ngoc Quang
 
He thong tai khoan ke toan
He thong tai khoan ke toanHe thong tai khoan ke toan
He thong tai khoan ke toanPham Ngoc Quang
 

Viewers also liked (20)

Soluções Navita para BlackBerry 20081111
Soluções Navita para BlackBerry 20081111Soluções Navita para BlackBerry 20081111
Soluções Navita para BlackBerry 20081111
 
Ta Review: Application Servers
Ta Review: Application ServersTa Review: Application Servers
Ta Review: Application Servers
 
Flagler Budget.Key
Flagler Budget.KeyFlagler Budget.Key
Flagler Budget.Key
 
Open Source Presentation To Portal Partners2
Open Source Presentation To Portal Partners2Open Source Presentation To Portal Partners2
Open Source Presentation To Portal Partners2
 
Trao doi chat va q p3
Trao doi chat va q  p3Trao doi chat va q  p3
Trao doi chat va q p3
 
Viettel chung thuc-so_ke_khai_qua_mang
Viettel chung thuc-so_ke_khai_qua_mangViettel chung thuc-so_ke_khai_qua_mang
Viettel chung thuc-so_ke_khai_qua_mang
 
TANET
TANETTANET
TANET
 
Huyen Khong Tu
Huyen Khong TuHuyen Khong Tu
Huyen Khong Tu
 
советский энциклопедический словарь
советский энциклопедический словарьсоветский энциклопедический словарь
советский энциклопедический словарь
 
Chuong 1 sinh ly hung phan
Chuong 1 sinh ly hung phanChuong 1 sinh ly hung phan
Chuong 1 sinh ly hung phan
 
Bet youdon'tknowreading
Bet youdon'tknowreadingBet youdon'tknowreading
Bet youdon'tknowreading
 
Ta Review OES
Ta Review OESTa Review OES
Ta Review OES
 
3. Luat thue TNCN 12.10
3. Luat thue TNCN 12.10 3. Luat thue TNCN 12.10
3. Luat thue TNCN 12.10
 
He tuan hoan tham khao 2
He tuan hoan tham khao 2He tuan hoan tham khao 2
He tuan hoan tham khao 2
 
Presentation to CCAE - November 28/08
Presentation to CCAE - November 28/08Presentation to CCAE - November 28/08
Presentation to CCAE - November 28/08
 
Picasso 4f
Picasso 4fPicasso 4f
Picasso 4f
 
Ag Portal Gioi Thieu Quy Trinh
Ag Portal   Gioi Thieu Quy TrinhAg Portal   Gioi Thieu Quy Trinh
Ag Portal Gioi Thieu Quy Trinh
 
He thong tai khoan ke toan
He thong tai khoan ke toanHe thong tai khoan ke toan
He thong tai khoan ke toan
 
Thai nguyen 02-midrex
Thai nguyen 02-midrexThai nguyen 02-midrex
Thai nguyen 02-midrex
 
Bessa swinston
Bessa swinstonBessa swinston
Bessa swinston
 

Similar to Data Science: Notes and Toolkits

Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 
Data Science Training and Placement
Data Science Training and PlacementData Science Training and Placement
Data Science Training and PlacementAkhilGGM
 
Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)SayyedYusufali
 
Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)SayyedYusufali
 
Data science training in hydpdf converted (1)
Data science training in hydpdf  converted (1)Data science training in hydpdf  converted (1)
Data science training in hydpdf converted (1)SayyedYusufali
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification courseKumarNaik21
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxssuser1a4f0f
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?DIGITALSAI1
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification courseKumarNaik21
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)SayyedYusufali
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabadVamsiNihal
 
Data science training in Hyderabad
Data science  training in HyderabadData science  training in Hyderabad
Data science training in Hyderabadsaitejavella
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training HyderabadNithinsunil1
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabadVamsiNihal
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)SayyedYusufali
 
data science training and placement
data science training and placementdata science training and placement
data science training and placementSaiprasadVella
 
online data science training
online data science trainingonline data science training
online data science trainingDIGITALSAI1
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabadVamsiNihal
 
data science online training in hyderabad
data science online training in hyderabaddata science online training in hyderabad
data science online training in hyderabadVamsiNihal
 
Best data science training in Hyderabad
Best data science training in HyderabadBest data science training in Hyderabad
Best data science training in HyderabadKumarNaik21
 

Similar to Data Science: Notes and Toolkits (20)

Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Data Science Training and Placement
Data Science Training and PlacementData Science Training and Placement
Data Science Training and Placement
 
Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)
 
Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)
 
Data science training in hydpdf converted (1)
Data science training in hydpdf  converted (1)Data science training in hydpdf  converted (1)
Data science training in hydpdf converted (1)
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 
Data science training in Hyderabad
Data science  training in HyderabadData science  training in Hyderabad
Data science training in Hyderabad
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training Hyderabad
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabad
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
data science training and placement
data science training and placementdata science training and placement
data science training and placement
 
online data science training
online data science trainingonline data science training
online data science training
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabad
 
data science online training in hyderabad
data science online training in hyderabaddata science online training in hyderabad
data science online training in hyderabad
 
Best data science training in Hyderabad
Best data science training in HyderabadBest data science training in Hyderabad
Best data science training in Hyderabad
 

Recently uploaded

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Recently uploaded (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

Data Science: Notes and Toolkits

  • 1. Data Science: Notes and Toolkits Dr. Haralambos Marmanis Waltham, MA April, 2014 ___________________________________ Web: http://www.marmanis.com Email: h@marmanis.com Copyright(c)2014H.Marmanis. Allrightsreserved 1
  • 2. What is Science? • Science is the systematic, data based, pursuit of knowledge through reason • Science is not about what we believe, it is about how we arrived at what we believe • Science always relied on data, e.g. Copernicus’ and Kepler’s theories needed Brahe’s data to grow and prosper • The word “Science”, for most people, points to specific subject areas such as Physics, Chemistry, etc. • However, the methodology is not a priori restricted to these fields; nearly everything that is taught in a university is the outcome of a scientific endeavor Copyright(c)2014H.Marmanis. Allrightsreserved 2
  • 3. What is Data Science? The systematic data based pursuit of knowledge through reason in non-traditional fields i.e. applying the same methodology that is applied in physics, chemistry, biology, etc. to fields like e-Commerce, social networking, finance, energy, marketing, and so on. Copyright(c)2014H.Marmanis. Allrightsreserved 3
  • 4. Why should I care? • Scientists rejoice! There was never a better time to be a data scientist – click here to see what the business analysts say. • If you are a scientist today, you can become the next Newton, the next Maxwell, the next Einstein in your field! • These slides will provide you with an overview of Notes and Tools that are necessary, although not sufficient, for achieving your own discoveries • The content of the slides is taken from my (forthcoming) book: “The Data Science Revolution: An overview of the field and its applications” • Benefits range from “pats on the back” to salary increase or a generous bonus and from corporate recognition to international fame! So, your mileage can vary but it’s all good! Copyright(c)2014H.Marmanis. Allrightsreserved 4
  • 5. Where do I start? 1. The first thing that you need to start is a problem 2. The second is an understanding of the problem. An understanding implies the following: • Clear description of the problem • Clear objectives • Measurable success criteria 3. The third is a set of data related to the problem 4. The fourth is a set of hypotheses 5. The fifth is a set of tools that will allow us to assess the validity of our hypotheses based on the available data Copyright(c)2014H.Marmanis. Allrightsreserved 5
  • 6. Where do I start? 1. The first thing that you need to start is a problem 2. The second is an understanding of the problem. An understanding implies the following: • Clear description of the problem • Clear objectives • Measurable success criteria 3. The third is a set of data related to the problem 4. The fourth is a set of hypotheses 5. The fifth is a set of tools that will allow us to assess the validity of our hypotheses based on the available data Copyright(c)2014H.Marmanis. Allrightsreserved 6
  • 7. Buzzword overview Copyright(c)2014H.Marmanis. Allrightsreserved 7 • Big Data • Data Analysis • Intelligent Web • Machine Learning • Artificial Intelligence • Statistical Analysis
  • 8. What you really need … Domain Expertise ScienceEngineering Copyright(c)2014H.Marmanis. Allrightsreserved 8
  • 9. Domain expertise • Each domain defines its own “universe” that, like our physical universe, waits to be explored by scientific means • You do not have to be a domain expert yourself but you should be able to grasp all the fundamentals quickly and accurately • Examples (just a few – this is practically endless): • Supply chain management • Auctions for Ads • Financial derivatives pricing • Mortgage risk assessment • Drug discovery Copyright(c)2014H.Marmanis. Allrightsreserved 9
  • 10. Science • A firm background in mathematics is essential; not just statistics! • Applied Mathematics • A firm understanding of the scientific method 1. Aggregate the questions/problems to be answered/solved 2. Conceptualize the problem’s domain 3. Formulate hypotheses  build models 4. Describe the problems based on the models 5. Solve the problems 6. Validate the solutions 7. Repeat steps 3 through 6, as needed • Scientific computing • Numerical Methods • Visualization Copyright(c)2014H.Marmanis. Allrightsreserved 10
  • 11. Engineering • Engineering is the systematic application of knowledge for the purpose of designing, implementing, and maintaining physical or virtual constructs in a way that optimizes multiple objectives (e.g. cost, functional effectiveness, operational efficiency, etc.) while respecting all applicable constraints. • In the context of Data Science, engineering skills are required for effectively integrating the scientific solution into the real- world system (e.g. an online retail store, a social networking site, a financial tool) • In particular, software engineering proficiency is crucial, since all the “objects of observation” are effectively digital and accessible only through some software system Copyright(c)2014H.Marmanis. Allrightsreserved 11
  • 12. Computational environments Copyright(c)2014H.Marmanis. Allrightsreserved 12 Name Language Purpose License MATLAB C, C++, Java MATLAB General Proprietary SciLab C,C++, Java, Fortran, Scilab General CeCILL (Open Source) Octave General GNU GPL R C, Fortran, R Statistical, Graphics GNU GPL Julia C, C++, Scheme General MIT License ScaVis Java General Mixed SciPy C, Fortran, Python General BSD
  • 13. Scientific Libraries • Basic Linear Algebra Subprograms (BLAS) written in Fortran • Linear Algebra Package (LAPACK) written in Fortran 90 • Numerical Algorithms Group (NAG) libraries • GraphLab -- GraphLab API is written in C++ • MTJ -- Matrix Toolkit that integrates BLAS and LAPACK in Java • EJML – linear algebra library written in Java • Commons Math – Apache project that offers a lightweight, self-contained, library for mathematics and statistics • NumPy – support for matrices and high-level mathematical functions for Python • SciPy – it includes efficient numerical routines for numerical integration and optimization Copyright(c)2014H.Marmanis. Allrightsreserved 13
  • 14. Machine Learning libraries • Jgap – Genetic algorithms library • Encog – Neural networks library • Opt4J – Evolutionary computation library • Weka – Clustering and classification algorithms • Yooreeka – Search, recommendations, clustering, classification, and mathematical analysis Copyright(c)2014H.Marmanis. Allrightsreserved 14
  • 15. Big Data technologies • Hadoop – open-source software for reliable, scalable, distributed computing • OpenCL – open royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices • Cloudify – Provision, configure, orchestrate, and monitor large distributed systems on the cloud • Spring XD -- a unified, distributed, and extensible system for data ingestion, real time analytics, batch processing, and data export • Proactive Parallel Suite -- an open source solution that enables the orchestration of applications and seamlessly integrates with the management of high-performance clouds • Ibis -- an efficient Java-based platform for distributed computing Copyright(c)2014H.Marmanis. Allrightsreserved 15
  • 16. Copyright(c)2014H.Marmanis. Allrightsreserved 16 The Data Science Revolution: An overview of the field and its applications