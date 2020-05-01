Successfully reported this slideshow.
  1. 1. Get started with Data Science kaustuv.kunal at gmail.com
  2. 2. Data Science Data Science transforms hypothesis and data into actions
  3. 3. Who is Data Scientist Some one who knows more statistics than a typical computer programmer and knows more programing than a typical statistician.
  4. 4. Starting up Which programming Language to choose? Options are, R is good for statistics Python has rich set of applications & libraries and wider user support I started with R but shifted to Python
  5. 5. Editor of choice Learn it.. Use it.. Embrace it
  6. 6. Stages of Data Science Project  Data Explore  Data Prepare/Clean  Data Model Model Evaluation & Fine-tuning  Model Deployment & Presentation
  7. 7. Data Explore  Acquire domain knowledge & Understand business problem  Fetch data  Remember, no data set is perfect  Keep note of-  Missing values  Inconsistent data range & units  incorrect data or outliers *Pythons Pandas, NumPy, Matplotlib libraries
  8. 8. Data Prepare/clean  Do something for missing data  Replace with mean/median values  Remove column/feature  Remove record  Transform into new categorical feature  Feature Engineering  Normalization : Convert values into min-max range such as 0-1  Standardization : Convert values to have Zero mean  Transformation : logarithmic, One Hot Encode  Prepare train, test and cross validation data set Python scikit learn modules (e.g. imputer)
  9. 9. Data Model Select ML model depending upon problem learning type Advisable to use more than one modeling method Learn ensemble methods, it combines weak learners into a strong one
  10. 10. Data Model : Cheat Sheet Learning Type Supervised (Training set) Regression:: Predict values e.g. house prices Prediction Linear Regression Ordinary Least Square, SVM for regression, KNN for regression, Random Forest for Regression Classification: Predicting labels e.g. spam detection Logistic Regression, Decision Tree, Naïve Bay’s classifier, Random Forest for classification, KNN for Classification SVM for classification Unsupervised( No Training set) Clustering: grouping items e.g. grouping customers on buying patterns on add targeting K- Means, HCA(hierarchical Cluster Analysis) Association Rule e.g. Suggesting Items based on items purchased Apriori Algorithm
  11. 11. Model Evaluation & Fine-tuning  Compare with more than on modeling method  Decide on evolution method (generally specified in project initiation phase)  Iterate through feature engineering to improve measure.
  12. 12. Model Evaluation Measures Classification Model Regression Model Probability Model Clustering • Confusion Matrix :  Accuracy  Precision  Recall/Sensitivity  Specificity  F1 Score • ROC (Receiver Operating Characteristics) • AUC (Area Under (ROC) curve0 • Mean Squared Error(MSE) • Root Mean Square Error (RMSE) • R-Squares • Mean Absolute Error (MAE) • Correlation :  Pearson’s  Spearman’s  Kendall • Log Likelihood • Deviance • Akike Information Criterion (AIC) • Entropy • Silhoutte
  13. 13. Model Deployment  Most important though often neglected  Deploy using Flask Learn Basics of REST & HTML

