SlideShare a Scribd company logo
FIRST STEPS IN DATASCIENCE
Tips and tools for wannabe data analysts
By Sheshachalam Ratnala
Data analytics Aka Machine Learning
Data analytics as an area
where the available digital
data is treated as a Gold
Mine from where tangible
output is obtained which
when applied impacts
businesses and it’s
efficiency.
Machine Learning is the
tool in the form of y=f(x)
which co-relates all the
parameters in the data to
obtain the relation which it
learns from these
parameters and keeps on
improving the relationship
2
Data analytics Aka Machine Learning`
Data : It is a set of values of quantitative and qualitative
variables. Historic information or knowledge represented
in usable form
Population - Entire group
It’s the collection of data which represents whole of the problem domain
Sample - A portion of the group
Subset of the population to be taken for inference which is the true representation
of the overall population
3
Data analytics – How to start
Data Science/Data analytics With what ever name it’s
been known to you has essentially 3 areas to cover
Business
StatisticsProgramming
4
Data analytics – How to start
Business – Critical thinking
1. Objective analysis and evaluation of an issue in order to form a judgement
2. This is the stage to build the hypothesis for the problem domain in context
3. The model below could be a way to follow
5
Data analytics – How to start
Statistics – Mathematical Analysis
Data is considered as variable and the hierarchy is as follows
Data
(Variables)
Numerical
(Quantitative)
Discrete Continuous
Categorical
(Qualitative)
Ordinal
(Logically
ordered)
Nominal
(Unordered)
Continuous
Any values between a permitted
range(5.3, 5.35,5.45 6.0)
Discrete
Whole no: 5, 10
Ordinal
Logical order like Low; Med; High
Nominal
Male ;Female , Different types of 4
wheelers
6
Data analytics – How to start
Programming - Execution
R is the widely used tool due it’s historical
statistical usage and it’s abundant statistical
libraries
Python the interpreted language provides
a wide variety of packages for application
development and it’s statistical library .
Data ingestion Tools: Spark, Hadoop
7
Data analytics – Problem perspective
Solution
Hypothesis
Supervised
Learning
Numerical Data
(Target Variable)
Regression
Linear Regression Time Series
Categorical data
(Target Variable)
Classification
Decision Trees Random Forest K NN Logistic
Regression
Demand
Forecasting
Reinforcement
learning
Semi-Supervised
NLP and AI
Unsupervised
Clustering
K Means Hierarchical
clustering
Dimensionality
Reduction
Collaborative
filtering
8
Classifying the problem
Data analytics – Problem Complexity
The solution
complexity
and data
volume
increases
with the
kind of
business
value being
generated
Credits : odoscope: Overview of analytics methods
9
Data analytics – The execution
Basic Terminology
• Attribute - Features are a quantitative attributes of the samples
being observed
• Axis - Features are orthogonal axes of their feature space, if
they are linearly independent
• Column/Independent variables - Features are represented as
columns in your dataset
• Dimension - A dataset's features, grouped together can be
treated as a n-dimensional coordinate space
• Input - Feature values are the input of data-driven, machine
learning algorithms
• Predictor/Dependent variable - Features used to predict other
attributes are called predictors
• View - Each feature conveys a quantitative trait or perspective
about the sample being observed
• Independent Variable - Autonomous features used to calculate
others are like independent variables in algebraic equations
Structuring the data
10
Data analytics – The execution
The rule of Seven
The steps are iterative at any stage
• Data collection(Problem context)
• Data Wrangling/Data Munging(Data cleaning)
• Data exploring/Analysis
• Data Transforming
• Modelling
• Model evaluation
• Data Visualization( Intelligence)
The machine learning models works only on clean structured data . 5 out of 7 steps are
related to pre-processing of the data given to model.
11
Data analytics – The execution
1. Data collection /selection
1.No bias in the data feature
2.Relevant data feature
3.Techniques to handle
a) Data Collection:
1. Data from sources related to problem i..e DB’s,Weblogs,emails etc..
2. Any audio,video,sensor data etc .
3. The 6 Vs of data , Variety ,Velocity,Verasity,Volume,Value,Viable
b) Data Selection:
1. PCA : Unsupervised data
2.LDA (Linear discrimant analysis) : Supervised data
The rule of Seven
12
Data analytics – The execution
2. Data cleaning (Garbage in Garbage Out)
1. Data obtained is not clean and have below issues:
1. Outliers 4. Erroneous data7. Need formatting
2. Missing data 5. Irrelevant data
3. Malicious data 6. Inconsistent data
2. Techniques to handle
1.Impute values by Mean ,Median or Mode
2. Treat outliers by deleting the row if not at all related else analyze with more data
3.Binning
4.Creating new features from given features
5.Dummy variables
The rule of Seven
13
Data analytics – The execution
3. Data Analysis (Data exploring)
1.Find the relevance of the feature set. Apply all the basic statistical exploration i..e moments
2. Obtain the statistical relation.
3.Perform basic visualizations for obtaining the concrete feature set.
4.Techniques to handle
1.Univariate analysis ( Mean ,mode, Normal distrubution,Variance,Skewness,Kurtosis)
2.Bi-Variate analysis ( Scatter plot, Box plot, Histogram)
3.Multi-variate analysis (Probability distribution functions PDFs)
The rule of Seven
14
Credits: https://jixta.wordpress.com/
Data analytics – The execution
Data analysis – Adopt few basics visualizations from the list
15
Data analytics – The execution
4. Data Transformation(Data on the same scale)
1. Ensure that the rest of the features are informative and transformation changes the no. of features or
the feature values. This is also known as Feature engineering
2. Dimensionality Reduction
3. Curse of dimensionality
4. Techniques to handle
1.PCA : Principal component analysis
2.Kernel Trick
3.Normalization
The rule of Seven
16
Data analytics – The execution
6. Machine learning modeling
1. Split data as Test , Train.
2. Keep some data never tested or get
some sample termed as “out of sample”
3. Apply the appropriate ML algorithm on the train data.
4.Check the accuracy with the test data .
5.Observer the Bias and Variance
a)Bias is how far is the target value w.r.t actual value
b)Variance is how distributed is the value w.r.t actual value
c)Error = variance + Bias²
The rule of Seven
17
Data analytics – The execution
The rule of Seven
6.1 Machine learning modeling
2.Apply the appropriate algorithm
as described by solution hypothesis
Ref: cheatsheet
18
Data analytics – The execution
6.2 Machine learning model
1. Model Performance
1. Model validation
1. MSE ( Mean square error) 2. Hypothesis testing 3.Cross-validation
2. Algorithm tuning
1.Tuning the co-efficient parameters 2..Increasing the splits
3. Feature engineering (iterate again for features)
4. Cross validation
1. K-Fold
5. Ensemble method ( Combining the ML algorithms)
1. Voting ( Selection based on voting on performance) 2.Bagging( Bootstrapping + Aggregating) 3.Boosting (Weak learner
to strong learner.
The rule of Seven
19
Data analytics Aka Machine Learning
6.3.1 Machine learning model performance
1. Confusion matrix ( Hypothesis testing
Measurement terms
1. Precision 3.Accuracy 5.False positive(Fallout-rate)
2. Recall 4.Specificity 6.False negative (Miss rate)
20
The rule of Seven
Data analytics Aka Machine Learning
6.3.2 Machine learning model performance
1. Cross-fold validations
• Random division of data sets
• ML algorithm check for each
subset
• Overall efficiency as the final
accuracy of the model
21
The rule of Seven
Data analytics Aka Machine Learning
7. Data Visualization
1. Storifying the data analysis as Descriptive ,prescriptive or predictive
2. Effective use of the visuals graphs.
3.Tools like Tableau ,D3.js ,Matplotlib,chart.js
22
The rule of Seven
Data analytics Aka Machine Learning
Tools in practice
Core – Python library
NumPy
Pandas
Matplotlib
Scikit-learn
(Machine learning algos)
(Mathematical computing functions /N- Dimensional array )
(Data Analysis ,Data munging by in
memory data representation) (2 D Visualization library)
For a high level language user python is the best tool available to use
23
Data analytics Aka Machine Learning
Tools sources
1. Anaconda
1. Use IPython universal editor
2. Python 2.7+ or 3.5
3. Careful about the version because of supporting function
4. A good starting tool
5. Spyder Interactive editor tool for basic python learning
2. Enthought Canopy.
1. Interactive environment
3. Pycharm by jetbrains : Interactive IDE debugger tool
24
Data analytics Aka Machine Learning
Tools cheat sheets
Must visit sites
KdNuggets
Kaggle
DatascienceCentral
DataCamp
https://www.class-central.com/
http://analyticsvidhya.com/
https://www.odsc.com/
http://www.pythonlearn.com/
http://datascienceplus.com/
Practice data sets
http://ipython-books.github.io/minibook/
http://learnds.com/
https://vincentarelbundock.github.io/Rdatasets/
25
Thank you !!!
Connect with me at
r.shera01@gmail.com
26

More Related Content

What's hot

2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
Azad public school
 
Data mininng trends
Data mininng trendsData mininng trends
Data mininng trends
VijayasankariS
 
Data warehousing and online analytical processing
Data warehousing and online analytical processingData warehousing and online analytical processing
Data warehousing and online analytical processing
VijayasankariS
 
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
ssuser23e4f31
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
VijayasankariS
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
DataminingTools Inc
 
What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?
Seval Çapraz
 
Chapter 13 data warehousing
Chapter 13   data warehousingChapter 13   data warehousing
Chapter 13 data warehousingsumit621
 
142230 633685297550892500
142230 633685297550892500142230 633685297550892500
142230 633685297550892500
sumit621
 
Data analytics
Data analyticsData analytics
Data analytics
Dr.Bhuvaneswari Velumani
 
Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Data Mining with SQL Server 2008
Data Mining with SQL Server 2008
Peter Gfader
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceVignesh Prajapati
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.ppt
neelamoberoi1030
 
02 Data Mining
02 Data Mining02 Data Mining
Key Principles Of Data Mining
Key Principles Of Data MiningKey Principles Of Data Mining
Key Principles Of Data Mining
tobiemuir
 
Data Mining
Data MiningData Mining
Data Mining
solairajAnandappan
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousingsumit621
 
03 data mining : data warehouse
03 data mining : data warehouse03 data mining : data warehouse
03 data mining : data warehouse
Institute of Technology Telkom
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
Devakumar Jain
 
Odam: Open Data, Access and Mining
Odam: Open Data, Access and MiningOdam: Open Data, Access and Mining
Odam: Open Data, Access and Mining
Daniel JACOB
 

What's hot (20)

2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 
Data mininng trends
Data mininng trendsData mininng trends
Data mininng trends
 
Data warehousing and online analytical processing
Data warehousing and online analytical processingData warehousing and online analytical processing
Data warehousing and online analytical processing
 
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?
 
Chapter 13 data warehousing
Chapter 13   data warehousingChapter 13   data warehousing
Chapter 13 data warehousing
 
142230 633685297550892500
142230 633685297550892500142230 633685297550892500
142230 633685297550892500
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Data Mining with SQL Server 2008
Data Mining with SQL Server 2008
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.ppt
 
02 Data Mining
02 Data Mining02 Data Mining
02 Data Mining
 
Key Principles Of Data Mining
Key Principles Of Data MiningKey Principles Of Data Mining
Key Principles Of Data Mining
 
Data Mining
Data MiningData Mining
Data Mining
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
 
03 data mining : data warehouse
03 data mining : data warehouse03 data mining : data warehouse
03 data mining : data warehouse
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
 
Odam: Open Data, Access and Mining
Odam: Open Data, Access and MiningOdam: Open Data, Access and Mining
Odam: Open Data, Access and Mining
 

Viewers also liked

Süperlig Puan Tablosu
Süperlig Puan TablosuSüperlig Puan Tablosu
Süperlig Puan Tablosuynebilir
 
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
Using Road Sensor Data for Official Statistics: towards a Big Data MethodologyUsing Road Sensor Data for Official Statistics: towards a Big Data Methodology
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
Piet J.H. Daas
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data mining
Hoang Nguyen
 
[Webinar Slides] 5 Learning Trends Every CLO Should be Watching
[Webinar Slides] 5 Learning Trends Every CLO Should be Watching[Webinar Slides] 5 Learning Trends Every CLO Should be Watching
[Webinar Slides] 5 Learning Trends Every CLO Should be Watching
David Blake
 
Analyze Your Smart City: Build Sensor Analytics with OGC SensorThings API
Analyze Your Smart City: Build Sensor Analytics with OGC SensorThings API Analyze Your Smart City: Build Sensor Analytics with OGC SensorThings API
Analyze Your Smart City: Build Sensor Analytics with OGC SensorThings API
SensorUp
 
Customer Service Strategy
Customer Service StrategyCustomer Service Strategy
Customer Service Strategy
Yodhia Antariksa
 
Innovation Strategy
Innovation StrategyInnovation Strategy
Innovation Strategy
Yodhia Antariksa
 

Viewers also liked (7)

Süperlig Puan Tablosu
Süperlig Puan TablosuSüperlig Puan Tablosu
Süperlig Puan Tablosu
 
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
Using Road Sensor Data for Official Statistics: towards a Big Data MethodologyUsing Road Sensor Data for Official Statistics: towards a Big Data Methodology
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data mining
 
[Webinar Slides] 5 Learning Trends Every CLO Should be Watching
[Webinar Slides] 5 Learning Trends Every CLO Should be Watching[Webinar Slides] 5 Learning Trends Every CLO Should be Watching
[Webinar Slides] 5 Learning Trends Every CLO Should be Watching
 
Analyze Your Smart City: Build Sensor Analytics with OGC SensorThings API
Analyze Your Smart City: Build Sensor Analytics with OGC SensorThings API Analyze Your Smart City: Build Sensor Analytics with OGC SensorThings API
Analyze Your Smart City: Build Sensor Analytics with OGC SensorThings API
 
Customer Service Strategy
Customer Service StrategyCustomer Service Strategy
Customer Service Strategy
 
Innovation Strategy
Innovation StrategyInnovation Strategy
Innovation Strategy
 

Similar to Data analytcis-first-steps

Unit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptx
Unit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptxUnit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptx
Unit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptx
tesfkeb
 
Machine learning
Machine learning Machine learning
Machine learning
Aarthi Srinivasan
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
SaketBansal9
 
Regression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms ExcelRegression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms Excel
Dr. Abdul Ahad Abro
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
TanujaSomvanshi1
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
Roger Barga
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
sumitkumar600840
 
data mining
data miningdata mining
data mining
manasa polu
 
Data analytics in computer networking
Data analytics in computer networkingData analytics in computer networking
Data analytics in computer networking
Stenio Fernandes
 
Mastering in Data Science 3RITPL-1 (1).pdf
Mastering in Data Science 3RITPL-1 (1).pdfMastering in Data Science 3RITPL-1 (1).pdf
Mastering in Data Science 3RITPL-1 (1).pdf
tarunprajapati0t
 
Data processing
Data processingData processing
Data processing
AnupamSingh211
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Rodney Joyce
 
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Rohit Dubey
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
Ujjawal
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptx
NitinSharma134320
 
Machine learning for sensor Data Analytics
Machine learning for sensor Data AnalyticsMachine learning for sensor Data Analytics
Machine learning for sensor Data Analytics
MATLABISRAEL
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
Basma Gamal
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdf
ShaikSikindar1
 
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Hima Patel
 
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptxLesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
cloudserviceuit
 

Similar to Data analytcis-first-steps (20)

Unit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptx
Unit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptxUnit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptx
Unit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptx
 
Machine learning
Machine learning Machine learning
Machine learning
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
 
Regression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms ExcelRegression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms Excel
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
 
data mining
data miningdata mining
data mining
 
Data analytics in computer networking
Data analytics in computer networkingData analytics in computer networking
Data analytics in computer networking
 
Mastering in Data Science 3RITPL-1 (1).pdf
Mastering in Data Science 3RITPL-1 (1).pdfMastering in Data Science 3RITPL-1 (1).pdf
Mastering in Data Science 3RITPL-1 (1).pdf
 
Data processing
Data processingData processing
Data processing
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptx
 
Machine learning for sensor Data Analytics
Machine learning for sensor Data AnalyticsMachine learning for sensor Data Analytics
Machine learning for sensor Data Analytics
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdf
 
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
 
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptxLesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
 

Recently uploaded

Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 

Recently uploaded (20)

Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 

Data analytcis-first-steps

  • 1. FIRST STEPS IN DATASCIENCE Tips and tools for wannabe data analysts By Sheshachalam Ratnala
  • 2. Data analytics Aka Machine Learning Data analytics as an area where the available digital data is treated as a Gold Mine from where tangible output is obtained which when applied impacts businesses and it’s efficiency. Machine Learning is the tool in the form of y=f(x) which co-relates all the parameters in the data to obtain the relation which it learns from these parameters and keeps on improving the relationship 2
  • 3. Data analytics Aka Machine Learning` Data : It is a set of values of quantitative and qualitative variables. Historic information or knowledge represented in usable form Population - Entire group It’s the collection of data which represents whole of the problem domain Sample - A portion of the group Subset of the population to be taken for inference which is the true representation of the overall population 3
  • 4. Data analytics – How to start Data Science/Data analytics With what ever name it’s been known to you has essentially 3 areas to cover Business StatisticsProgramming 4
  • 5. Data analytics – How to start Business – Critical thinking 1. Objective analysis and evaluation of an issue in order to form a judgement 2. This is the stage to build the hypothesis for the problem domain in context 3. The model below could be a way to follow 5
  • 6. Data analytics – How to start Statistics – Mathematical Analysis Data is considered as variable and the hierarchy is as follows Data (Variables) Numerical (Quantitative) Discrete Continuous Categorical (Qualitative) Ordinal (Logically ordered) Nominal (Unordered) Continuous Any values between a permitted range(5.3, 5.35,5.45 6.0) Discrete Whole no: 5, 10 Ordinal Logical order like Low; Med; High Nominal Male ;Female , Different types of 4 wheelers 6
  • 7. Data analytics – How to start Programming - Execution R is the widely used tool due it’s historical statistical usage and it’s abundant statistical libraries Python the interpreted language provides a wide variety of packages for application development and it’s statistical library . Data ingestion Tools: Spark, Hadoop 7
  • 8. Data analytics – Problem perspective Solution Hypothesis Supervised Learning Numerical Data (Target Variable) Regression Linear Regression Time Series Categorical data (Target Variable) Classification Decision Trees Random Forest K NN Logistic Regression Demand Forecasting Reinforcement learning Semi-Supervised NLP and AI Unsupervised Clustering K Means Hierarchical clustering Dimensionality Reduction Collaborative filtering 8 Classifying the problem
  • 9. Data analytics – Problem Complexity The solution complexity and data volume increases with the kind of business value being generated Credits : odoscope: Overview of analytics methods 9
  • 10. Data analytics – The execution Basic Terminology • Attribute - Features are a quantitative attributes of the samples being observed • Axis - Features are orthogonal axes of their feature space, if they are linearly independent • Column/Independent variables - Features are represented as columns in your dataset • Dimension - A dataset's features, grouped together can be treated as a n-dimensional coordinate space • Input - Feature values are the input of data-driven, machine learning algorithms • Predictor/Dependent variable - Features used to predict other attributes are called predictors • View - Each feature conveys a quantitative trait or perspective about the sample being observed • Independent Variable - Autonomous features used to calculate others are like independent variables in algebraic equations Structuring the data 10
  • 11. Data analytics – The execution The rule of Seven The steps are iterative at any stage • Data collection(Problem context) • Data Wrangling/Data Munging(Data cleaning) • Data exploring/Analysis • Data Transforming • Modelling • Model evaluation • Data Visualization( Intelligence) The machine learning models works only on clean structured data . 5 out of 7 steps are related to pre-processing of the data given to model. 11
  • 12. Data analytics – The execution 1. Data collection /selection 1.No bias in the data feature 2.Relevant data feature 3.Techniques to handle a) Data Collection: 1. Data from sources related to problem i..e DB’s,Weblogs,emails etc.. 2. Any audio,video,sensor data etc . 3. The 6 Vs of data , Variety ,Velocity,Verasity,Volume,Value,Viable b) Data Selection: 1. PCA : Unsupervised data 2.LDA (Linear discrimant analysis) : Supervised data The rule of Seven 12
  • 13. Data analytics – The execution 2. Data cleaning (Garbage in Garbage Out) 1. Data obtained is not clean and have below issues: 1. Outliers 4. Erroneous data7. Need formatting 2. Missing data 5. Irrelevant data 3. Malicious data 6. Inconsistent data 2. Techniques to handle 1.Impute values by Mean ,Median or Mode 2. Treat outliers by deleting the row if not at all related else analyze with more data 3.Binning 4.Creating new features from given features 5.Dummy variables The rule of Seven 13
  • 14. Data analytics – The execution 3. Data Analysis (Data exploring) 1.Find the relevance of the feature set. Apply all the basic statistical exploration i..e moments 2. Obtain the statistical relation. 3.Perform basic visualizations for obtaining the concrete feature set. 4.Techniques to handle 1.Univariate analysis ( Mean ,mode, Normal distrubution,Variance,Skewness,Kurtosis) 2.Bi-Variate analysis ( Scatter plot, Box plot, Histogram) 3.Multi-variate analysis (Probability distribution functions PDFs) The rule of Seven 14
  • 15. Credits: https://jixta.wordpress.com/ Data analytics – The execution Data analysis – Adopt few basics visualizations from the list 15
  • 16. Data analytics – The execution 4. Data Transformation(Data on the same scale) 1. Ensure that the rest of the features are informative and transformation changes the no. of features or the feature values. This is also known as Feature engineering 2. Dimensionality Reduction 3. Curse of dimensionality 4. Techniques to handle 1.PCA : Principal component analysis 2.Kernel Trick 3.Normalization The rule of Seven 16
  • 17. Data analytics – The execution 6. Machine learning modeling 1. Split data as Test , Train. 2. Keep some data never tested or get some sample termed as “out of sample” 3. Apply the appropriate ML algorithm on the train data. 4.Check the accuracy with the test data . 5.Observer the Bias and Variance a)Bias is how far is the target value w.r.t actual value b)Variance is how distributed is the value w.r.t actual value c)Error = variance + Bias² The rule of Seven 17
  • 18. Data analytics – The execution The rule of Seven 6.1 Machine learning modeling 2.Apply the appropriate algorithm as described by solution hypothesis Ref: cheatsheet 18
  • 19. Data analytics – The execution 6.2 Machine learning model 1. Model Performance 1. Model validation 1. MSE ( Mean square error) 2. Hypothesis testing 3.Cross-validation 2. Algorithm tuning 1.Tuning the co-efficient parameters 2..Increasing the splits 3. Feature engineering (iterate again for features) 4. Cross validation 1. K-Fold 5. Ensemble method ( Combining the ML algorithms) 1. Voting ( Selection based on voting on performance) 2.Bagging( Bootstrapping + Aggregating) 3.Boosting (Weak learner to strong learner. The rule of Seven 19
  • 20. Data analytics Aka Machine Learning 6.3.1 Machine learning model performance 1. Confusion matrix ( Hypothesis testing Measurement terms 1. Precision 3.Accuracy 5.False positive(Fallout-rate) 2. Recall 4.Specificity 6.False negative (Miss rate) 20 The rule of Seven
  • 21. Data analytics Aka Machine Learning 6.3.2 Machine learning model performance 1. Cross-fold validations • Random division of data sets • ML algorithm check for each subset • Overall efficiency as the final accuracy of the model 21 The rule of Seven
  • 22. Data analytics Aka Machine Learning 7. Data Visualization 1. Storifying the data analysis as Descriptive ,prescriptive or predictive 2. Effective use of the visuals graphs. 3.Tools like Tableau ,D3.js ,Matplotlib,chart.js 22 The rule of Seven
  • 23. Data analytics Aka Machine Learning Tools in practice Core – Python library NumPy Pandas Matplotlib Scikit-learn (Machine learning algos) (Mathematical computing functions /N- Dimensional array ) (Data Analysis ,Data munging by in memory data representation) (2 D Visualization library) For a high level language user python is the best tool available to use 23
  • 24. Data analytics Aka Machine Learning Tools sources 1. Anaconda 1. Use IPython universal editor 2. Python 2.7+ or 3.5 3. Careful about the version because of supporting function 4. A good starting tool 5. Spyder Interactive editor tool for basic python learning 2. Enthought Canopy. 1. Interactive environment 3. Pycharm by jetbrains : Interactive IDE debugger tool 24
  • 25. Data analytics Aka Machine Learning Tools cheat sheets Must visit sites KdNuggets Kaggle DatascienceCentral DataCamp https://www.class-central.com/ http://analyticsvidhya.com/ https://www.odsc.com/ http://www.pythonlearn.com/ http://datascienceplus.com/ Practice data sets http://ipython-books.github.io/minibook/ http://learnds.com/ https://vincentarelbundock.github.io/Rdatasets/ 25
  • 26. Thank you !!! Connect with me at r.shera01@gmail.com 26