I am shubham sharma graduated from Acropolis Institute of technology in Computer Science and Engineering. I have spent around 2 years in field of Machine learning. I am currently working as Data Scientist in Reliance industries private limited Mumbai. Mainly focused on problems related to data handing, data analysis, modeling, forecasting, statistics and machine learning, Deep learning, Computer Vision, Natural language processing etc. Area of interests are Data Analytics, Machine Learning, Machine learning, Time Series Forecasting, web information retrieval, algorithms, Data structures, design patterns, OOAD.
Machine learning techniques are powerful, but building and deploying such models for production use require a lot of care and expertise.
A lot of books, articles, and best practices have been written and discussed on machine learning techniques and feature engineering, but putting those techniques into use on a production environment is usually forgotten and under- estimated , the aim of this talk is to shed some lights on current machine learning deployment practices, and go into details on how to deploy sustainable machine learning pipelines.
NumPy Roadmap presentation at NumFOCUS ForumRalf Gommers
This presentation is an attempt to summarize the NumPy roadmap and both technical and non-technical ideas for the next 1-2 years to users that heavily rely on NumPy, as well as potential funders.
Automated machine learning lectures given at the Advanced Course on Data Science & Machine Learning. AutoML, hyperparameter optimization, Bayesian optimization, Neural Architecture Search, Meta-learning, MAML
Machine learning techniques are powerful, but building and deploying such models for production use require a lot of care and expertise.
A lot of books, articles, and best practices have been written and discussed on machine learning techniques and feature engineering, but putting those techniques into use on a production environment is usually forgotten and under- estimated , the aim of this talk is to shed some lights on current machine learning deployment practices, and go into details on how to deploy sustainable machine learning pipelines.
NumPy Roadmap presentation at NumFOCUS ForumRalf Gommers
This presentation is an attempt to summarize the NumPy roadmap and both technical and non-technical ideas for the next 1-2 years to users that heavily rely on NumPy, as well as potential funders.
Automated machine learning lectures given at the Advanced Course on Data Science & Machine Learning. AutoML, hyperparameter optimization, Bayesian optimization, Neural Architecture Search, Meta-learning, MAML
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
In this session we will show how to build a text classifier using the Apache Lucene/Solr with libSVM libraries. We classify our corpus of job offers into a number of predefined categories. Each indexed document (a job offer) then belongs to zero, one or more categories. Known machine learning techniques for text classification include naïve bayes model, logistic regression, neural network, support vector machine (SVM), etc. We use Lucene/Solr to construct the features vector. Then we use the libsvm library known as the reference implementation of the SVM model to classify the document. We construct as many one-vs-all svm classifiers as there are classes in our setting, then using the Hadoop MapReduce Framework we reconcile the result of our classifiers. The end result is a scalable multi-class classifier. Finally we outline how the classifier is used to enrich basic solr keyword search.
mlflow: Accelerating the End-to-End ML lifecycleDatabricks
Building and deploying a machine learning model can be difficult to do once. Enabling other data scientists (or yourself, one month later) to reproduce your pipeline, to compare the results of different versions, to track what’s running where, and to redeploy and rollback updated models is much harder.
In this talk, I’ll introduce MLflow, a new open source project from Databricks that simplifies the machine learning lifecycle. MLflow provides APIs for tracking experiment runs between multiple users within a reproducible environment, and for managing the deployment of models to production. MLflow is designed to be an open, modular platform, in the sense that you can use it with any existing ML library and development process. MLflow was launched in June 2018 and has already seen significant community contributions, with over 50 contributors and new features including language APIs, integrations with popular ML libraries, and storage backends. I’ll show how MLflow works and explain how to get started with MLflow.
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
Big data analysis relies on exploiting various handy tools to gain insight from data easily. In this talk, the speaker demonstrates a data mining flow for text classification using many Python tools. The flow consists of feature extraction/selection, model training/tuning and evaluation. Various tools are used in the flow, including: Pandas for feature processing, scikit-learn for classification, IPython, Notebook for fast sketching, matplotlib for visualization.
SDEC2011 Mahout - the what, the how and the whyKorea Sdec
Mahout is an open source machine learning library from Apache. From its humble beginnings at Apache Lucene, the project has grown into a active community of developers, machine learning experts and enthusiasts. With v0.5 released recently, the project has been focussing full steam on developing stable APIs with an eye on our major milestone of v1.0. The speaker has been with Mahout from his days in college as a computer science student. The talk will focus on the major use cases of Mahout. The design decisions, things that worked, things that didn't, and things to expect in the future releases.
http://sdec.kr/
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016MLconf
DL4J and DataVec for Enterprise Deep Learning Workflows: Applications in NLP, sensor processing (IoT), image processing, and audio processing have all emerged as prime deep learning applications. In this session we will take a look at a practical review of building practical and secure Deep Learning workflows in the enterprise. We’ll see how DL4J’s DataVec tool enables scalable ETL and vectorization pipelines to be created for a single machine or scale out to Spark on Hadoop. We’ll also see how Deep Networks such as Recurrent Neural Networks are able to leverage DataVec to more quickly process data for modeling.
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...MLconf
Why Machine Learning Algorithms Fall Short (And What You Can Do About It): Many think that machine learning is all about the algorithms. Want a self-learning system? Get your data, start coding or hire a PhD that will build you a model that will stand the test of time. Of course we know that this is not enough. Models degrade over time, algorithms that work great on yesterday’s data may not be the best option, new data sources and types are made available. In short, your self-learning system may not be learning anything at all. In this session, we will examine how to overcome challenges in creating self-learning systems that perform better and are built to stand the test of time. We will show how to apply mathematical optimization algorithms that often prove superior to local optimization methods favored by typical machine learning applications and discuss why these methods can crate better results. We will also examine the role of smart automation in the context of machine learning and how smart automation can create self-learning systems that are built to last.
Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...Databricks
Anomaly detection has numerous applications in a wide variety of fields. In banking, with ever growing heterogeneity and complexity, the difficulty of discovering deviating cases using conventional techniques and scenario definitions is on the rise. In our talk, we’ll present an outline of Swedbank’s ways of constructing and leveraging scalable pipelines based on Spark and Tensorflow in combination with an in-house tailor-made platform to develop, deploy and monitor deep anomaly detection models. In summary, this talk will present Swedbank’s approach on building, unifying and scaling an end-to-end solution using large amounts of heterogeneous imbalanced data. In this talk we will include sections with the following topics: Feature engineering: transactions2vec; Anomaly detection and its applications in banking; Deep anomaly detection methods: Deep SVDD and Generative adversarial networks, Model overview and code snippets in Tensorflow estimator API; Model Deployment: An overview of how the different puzzle pieces outlined above are put together and operationalized to create and end-to-end deployment.
We explain various kinds of bad memory utilization patterns in Java applications, present a tool to efficiently detect them, and give a number of common solutions to these problems.
Update: Social Harvest is going open source, see http://www.socialharvest.io for more information.
My MongoSV 2011 talk about implementing machine learning and other algorithms in MongoDB. With a little real-world example at the end about what Social Harvest is doing with MongoDB. For more updates about my research, check out my blog at www.shift8creative.com
Online Machine Learning: introduction and examplesFelipe
In this talk I introduce the topic of Online Machine Learning, which deals with techniques for doing machine learning in an online setting, i.e. where you train your model a few examples at a time, rather than using the full dataset (off-line learning).
Python is the choice llanguage for data analysis,
The aim of this slide is to provide a comprehensive learning path to people new to python for data analysis. This path provides a comprehensive overview of the steps you need to learn to use Python for data analysis.
Data Science With Python | Python For Data Science | Python Data Science Cour...Simplilearn
This Data Science with Python presentation will help you understand what is Data Science, basics of Python for data analysis, why learn Python, how to install Python, Python libraries for data analysis, exploratory analysis using Pandas, introduction to series and dataframe, loan prediction problem, data wrangling using Pandas, building a predictive model using Scikit-Learn and implementing logistic regression model using Python. The aim of this video is to provide a comprehensive knowledge to beginners who are new to Python for data analysis. This video provides a comprehensive overview of basic concepts that you need to learn to use Python for data analysis. Now, let us understand how Python is used in Data Science for data analysis.
This Data Science with Python presentation will cover the following topics:
1. What is Data Science?
2. Basics of Python for data analysis
- Why learn Python?
- How to install Python?
3. Python libraries for data analysis
4. Exploratory analysis using Pandas
- Introduction to series and dataframe
- Loan prediction problem
5. Data wrangling using Pandas
6. Building a predictive model using Scikit-learn
- Logistic regression
This Data Science with Python course will establish your mastery of data science and analytics techniques using Python. With this Python for Data Science Course, you'll learn the essential concepts of Python programming and become an expert in data analytics, machine learning, data visualization, web scraping and natural language processing. Python is a required skill for many data science positions, so jumpstart your career with this interactive, hands-on course.
Why learn Data Science?
Data Scientists are being deployed in all kinds of industries, creating a huge demand for skilled professionals. Data scientist is the pinnacle rank in an analytics organization. Glassdoor has ranked data scientist first in the 25 Best Jobs for 2016, and good data scientists are scarce and in great demand. As a data you will be required to understand the business problem, design the analysis, collect and format the required data, apply algorithms or techniques using the correct tools, and finally make recommendations backed by data.
You can gain in-depth knowledge of Data Science by taking our Data Science with python certification training course. With Simplilearn Data Science certification training course, you will prepare for a career as a Data Scientist as you master all the concepts and techniques.
Learn more at: https://www.simplilearn.com
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
In this session we will show how to build a text classifier using the Apache Lucene/Solr with libSVM libraries. We classify our corpus of job offers into a number of predefined categories. Each indexed document (a job offer) then belongs to zero, one or more categories. Known machine learning techniques for text classification include naïve bayes model, logistic regression, neural network, support vector machine (SVM), etc. We use Lucene/Solr to construct the features vector. Then we use the libsvm library known as the reference implementation of the SVM model to classify the document. We construct as many one-vs-all svm classifiers as there are classes in our setting, then using the Hadoop MapReduce Framework we reconcile the result of our classifiers. The end result is a scalable multi-class classifier. Finally we outline how the classifier is used to enrich basic solr keyword search.
mlflow: Accelerating the End-to-End ML lifecycleDatabricks
Building and deploying a machine learning model can be difficult to do once. Enabling other data scientists (or yourself, one month later) to reproduce your pipeline, to compare the results of different versions, to track what’s running where, and to redeploy and rollback updated models is much harder.
In this talk, I’ll introduce MLflow, a new open source project from Databricks that simplifies the machine learning lifecycle. MLflow provides APIs for tracking experiment runs between multiple users within a reproducible environment, and for managing the deployment of models to production. MLflow is designed to be an open, modular platform, in the sense that you can use it with any existing ML library and development process. MLflow was launched in June 2018 and has already seen significant community contributions, with over 50 contributors and new features including language APIs, integrations with popular ML libraries, and storage backends. I’ll show how MLflow works and explain how to get started with MLflow.
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
Big data analysis relies on exploiting various handy tools to gain insight from data easily. In this talk, the speaker demonstrates a data mining flow for text classification using many Python tools. The flow consists of feature extraction/selection, model training/tuning and evaluation. Various tools are used in the flow, including: Pandas for feature processing, scikit-learn for classification, IPython, Notebook for fast sketching, matplotlib for visualization.
SDEC2011 Mahout - the what, the how and the whyKorea Sdec
Mahout is an open source machine learning library from Apache. From its humble beginnings at Apache Lucene, the project has grown into a active community of developers, machine learning experts and enthusiasts. With v0.5 released recently, the project has been focussing full steam on developing stable APIs with an eye on our major milestone of v1.0. The speaker has been with Mahout from his days in college as a computer science student. The talk will focus on the major use cases of Mahout. The design decisions, things that worked, things that didn't, and things to expect in the future releases.
http://sdec.kr/
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016MLconf
DL4J and DataVec for Enterprise Deep Learning Workflows: Applications in NLP, sensor processing (IoT), image processing, and audio processing have all emerged as prime deep learning applications. In this session we will take a look at a practical review of building practical and secure Deep Learning workflows in the enterprise. We’ll see how DL4J’s DataVec tool enables scalable ETL and vectorization pipelines to be created for a single machine or scale out to Spark on Hadoop. We’ll also see how Deep Networks such as Recurrent Neural Networks are able to leverage DataVec to more quickly process data for modeling.
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...MLconf
Why Machine Learning Algorithms Fall Short (And What You Can Do About It): Many think that machine learning is all about the algorithms. Want a self-learning system? Get your data, start coding or hire a PhD that will build you a model that will stand the test of time. Of course we know that this is not enough. Models degrade over time, algorithms that work great on yesterday’s data may not be the best option, new data sources and types are made available. In short, your self-learning system may not be learning anything at all. In this session, we will examine how to overcome challenges in creating self-learning systems that perform better and are built to stand the test of time. We will show how to apply mathematical optimization algorithms that often prove superior to local optimization methods favored by typical machine learning applications and discuss why these methods can crate better results. We will also examine the role of smart automation in the context of machine learning and how smart automation can create self-learning systems that are built to last.
Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...Databricks
Anomaly detection has numerous applications in a wide variety of fields. In banking, with ever growing heterogeneity and complexity, the difficulty of discovering deviating cases using conventional techniques and scenario definitions is on the rise. In our talk, we’ll present an outline of Swedbank’s ways of constructing and leveraging scalable pipelines based on Spark and Tensorflow in combination with an in-house tailor-made platform to develop, deploy and monitor deep anomaly detection models. In summary, this talk will present Swedbank’s approach on building, unifying and scaling an end-to-end solution using large amounts of heterogeneous imbalanced data. In this talk we will include sections with the following topics: Feature engineering: transactions2vec; Anomaly detection and its applications in banking; Deep anomaly detection methods: Deep SVDD and Generative adversarial networks, Model overview and code snippets in Tensorflow estimator API; Model Deployment: An overview of how the different puzzle pieces outlined above are put together and operationalized to create and end-to-end deployment.
We explain various kinds of bad memory utilization patterns in Java applications, present a tool to efficiently detect them, and give a number of common solutions to these problems.
Update: Social Harvest is going open source, see http://www.socialharvest.io for more information.
My MongoSV 2011 talk about implementing machine learning and other algorithms in MongoDB. With a little real-world example at the end about what Social Harvest is doing with MongoDB. For more updates about my research, check out my blog at www.shift8creative.com
Online Machine Learning: introduction and examplesFelipe
In this talk I introduce the topic of Online Machine Learning, which deals with techniques for doing machine learning in an online setting, i.e. where you train your model a few examples at a time, rather than using the full dataset (off-line learning).
Python is the choice llanguage for data analysis,
The aim of this slide is to provide a comprehensive learning path to people new to python for data analysis. This path provides a comprehensive overview of the steps you need to learn to use Python for data analysis.
Data Science With Python | Python For Data Science | Python Data Science Cour...Simplilearn
This Data Science with Python presentation will help you understand what is Data Science, basics of Python for data analysis, why learn Python, how to install Python, Python libraries for data analysis, exploratory analysis using Pandas, introduction to series and dataframe, loan prediction problem, data wrangling using Pandas, building a predictive model using Scikit-Learn and implementing logistic regression model using Python. The aim of this video is to provide a comprehensive knowledge to beginners who are new to Python for data analysis. This video provides a comprehensive overview of basic concepts that you need to learn to use Python for data analysis. Now, let us understand how Python is used in Data Science for data analysis.
This Data Science with Python presentation will cover the following topics:
1. What is Data Science?
2. Basics of Python for data analysis
- Why learn Python?
- How to install Python?
3. Python libraries for data analysis
4. Exploratory analysis using Pandas
- Introduction to series and dataframe
- Loan prediction problem
5. Data wrangling using Pandas
6. Building a predictive model using Scikit-learn
- Logistic regression
This Data Science with Python course will establish your mastery of data science and analytics techniques using Python. With this Python for Data Science Course, you'll learn the essential concepts of Python programming and become an expert in data analytics, machine learning, data visualization, web scraping and natural language processing. Python is a required skill for many data science positions, so jumpstart your career with this interactive, hands-on course.
Why learn Data Science?
Data Scientists are being deployed in all kinds of industries, creating a huge demand for skilled professionals. Data scientist is the pinnacle rank in an analytics organization. Glassdoor has ranked data scientist first in the 25 Best Jobs for 2016, and good data scientists are scarce and in great demand. As a data you will be required to understand the business problem, design the analysis, collect and format the required data, apply algorithms or techniques using the correct tools, and finally make recommendations backed by data.
You can gain in-depth knowledge of Data Science by taking our Data Science with python certification training course. With Simplilearn Data Science certification training course, you will prepare for a career as a Data Scientist as you master all the concepts and techniques.
Learn more at: https://www.simplilearn.com
Python for Data Science: A Comprehensive Guidepriyanka rajput
Python’s popularity in data science is undeniable, to sum up. It is the best option for data analysts and scientists because of its simplicity, extensive library environment, and community support. The essential Python tools and best practices have been highlighted in this thorough book, enabling data aficionados to succeed in this fast-paced industry.
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
Ini adalah slide tambahan dari materi pengenalan Big Data Analytics (di file berikutnya), yang mengajak kita mulai hands-on dengan beberapa hal terkait Machine/Deep Learning, Big Data (batch/streaming), dan AI menggunakan Tensor Flow
Python is an interpreted, high-level, general-purpose programming language.
Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data.
Data analysis is a process of inspecting, cleansing, transforming and modeling data with the goal of discovering useful information, informing conclusion and supporting decision-making.
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
Using popular data science tools such as Python and R, the book offers many examples of real-life applications, with practice ranging from small to big data.
ProjectPro offers a hands-on approach to mastering machine learning and data science through 150+ solved end-to-end deployable machine learning and data science projects. They also provide 2000+ FREE data science code examples that can help one master the foundations of basic data science and machine learning concepts.
ProjectPro offers Solved End-to-End, Ready to Deploy, Enterprise-Grade Big Data, and Data Science Projects for Reuse and Upskilling. Each project solves a real business problem end-to-end and comes with solution code, explanation videos, cloud lab, and tech support.
Talk given at first OmniSci user conference where I discuss cooperating with open-source communities to ensure you get useful answers quickly from your data. I get a chance to introduce OpenTeams in this talk as well and discuss how it can help companies cooperate with communities.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
2. Outlines
• Some Training Pics
• Why learn Python for Machine Leaning ?
• Python Libraries For Machine Leaning
• Machine Learning key phases
• Series and Data-frames
• Case Studies :- Load Prediction Problem
• Python/Predictive model in data analytics
• Main Resources
6. Why learn Python for Machine Leaning ?
• Python has gathered a lot of interest recently as a choice of language
for data analysis. I had compared it against SAS & R some time back.
Here are some reasons which go in favour of learning Python:
• Open Source – free to install
• Awesome online community
• Very easy to learn
• Can become a common language for data science and production of
web based analytics products.
•
7. Python Libraries For Machine Leaning
• NumPy stands for Numerical Python. The most powerful feature of NumPy is n-dimensional array.
This library also contains basic linear algebra functions, Fourier transforms, advanced random
number capabilities and tools for integration with other low level languages like Fortran, C and
C++
• SciPy stands for Scientific Python. SciPy is built on NumPy. It is one of the most useful library for
variety of high level science and engineering modules like discrete Fourier transform, Linear
Algebra, Optimization and Sparse matrices.
• Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to heat plots..
You can use Pylab feature in ipython notebook (ipython notebook –pylab = inline) to use these
plotting features inline. If you ignore the inline option, then pylab converts ipython environment
to an environment, very similar to Matlab. You can also use Latex commands to add math to your
plot.
• Pandas for structured data operations and manipulations. It is extensively used for data munging
and preparation. Pandas were added relatively recently to Python and have been instrumental in
boosting Python’s usage in data scientist community.
• .
8. • Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this
library contains a lot of effiecient tools for machine learning and statistical
modeling including classification, regression, clustering and dimensionality
reduction.
• Statsmodels for statistical modeling. Statsmodels is a Python module that allows
users to explore data, estimate statistical models, and perform statistical tests. An
extensive list of descriptive statistics, statistical tests, plotting functions, and
result statistics are available for different types of data and each estimator.
• Seaborn for statistical data visualization. Seaborn is a library for making attractive
and informative statistical graphics in Python. It is based on matplotlib. Seaborn
aims to make visualization a central part of exploring and understanding data.
• Bokeh for creating interactive plots, dashboards and data applications on modern
web-browsers. It empowers the user to generate elegant and concise graphics in
the style of D3.js. Moreover, it has the capability of high-performance
interactivity over very large or streaming datasets
Python Libraries For Machine Leaning
9. • Blaze for extending the capability of Numpy and Pandas to distributed and
streaming datasets. It can be used to access data from a multitude of sources
including Bcolz, MongoDB, SQLAlchemy, Apache Spark, PyTables, etc. Together
with Bokeh, Blaze can act as a very powerful tool for creating effective
visualizations and dashboards on huge chunks of data.
• Scrapy for web crawling. It is a very useful framework for getting specific patterns
of data. It has the capability to start at a website home url and then dig through
web-pages within the website to gather information.
• SymPy for symbolic computation. It has wide-ranging capabilities from basic
symbolic arithmetic to calculus, algebra, discrete mathematics and quantum
physics. Another useful feature is the capability of formatting the result of the
computations as LaTeX code.
• Requests for accessing the web. It works similar to the the standard python
library urllib2 but is much easier to code. You will find subtle differences with
urllib2 but for beginners, Requests might be more convenient.
Python Libraries For Machine Leaning
10. Machine Learning key phases
• We will take you through the 3 key phases:
• Data Exploration – finding out more about the data we have
• Data Munging – cleaning the data and playing with it to make it better
suit statistical modeling
• Predictive Modeling – running the actual algorithms and having fun
11. Python math and cmath libs
• math provides access to the mathematical functions defined by the C
standard.
• These functions cannot be used with complex numbers;
• cmath
• It provides access to mathematical functions for complex numbers.
The functions in this module accept integers, floating-point numbers
or complex numbers as arguments. They will also accept any Python
object that has either a __complex__() or a __float__() method:
•
12. Series and Dataframes
• Numpy and Scipy Documentation
• Introduction to Series and Dataframes
• Series can be understood as a 1 dimensional labelled / indexed array.
You can access individual elements of this series through these labels.
13. Practice data set – Loan Prediction Problem
• Steps :
• Step1 :- installation
• Install ipython
• Install pandas
• Install numpy
• Install matplotlib
• Then
14. Practice data set – Loan Prediction Problem
• Step 2:- begin with exploration
• To begin, start iPython interface in Inline Pylab mode by typing following on your terminal / windows
command/pydev(eclip) prompt:
• >>>ipython notebook --pylab=inline
• Importing libraries and the data set:
• Importing libraries and the data set:
• import pandas as pd
• import numpy as np
• import matplotlib as plt
• df = pd.read_csv("/home/kunal/Downloads/Loan_Prediction/train.csv") #Reading the dataset in a dataframe
using Pandas
• df.head(10)
• df.describe()
• df['Property_Area'].value_counts()
15. Practice data set – Loan Prediction Problem
• Step 3 :- Distribution analysis
• Lets start by plotting the histogram of ApplicantIncome using the following commands:
• df['ApplicantIncome'].hist(bins=50)
• Next, we look at box plots to understand the distributions. Box plot for fare can be plotted by:
• df.boxplot(column='ApplicantIncome’)
• Categorical variable analysis
• temp1 = df['Credit_History'].value_counts(ascending=True)
• temp2 = df.pivot_table(values='Loan_Status',index=['Credit_History'],aggfunc=lambda x:
x.map({'Y':1,'N':0}).mean())
• print 'Frequency Table for Credit History:'
• print temp1
• print 'nProbility of getting loan for each Credit History class:'
• print temp2
16. Practice data set – Loan Prediction Problem
• Using matplotlib for plotting graph
• import matplotlib.pyplot as plt
• fig = plt.figure(figsize=(8,4))
• ax1 = fig.add_subplot(121)
• ax1.set_xlabel('Credit_History')
• ax1.set_ylabel('Count of Applicants')
• ax1.set_title("Applicants by Credit_History")
• temp1.plot(kind='bar')
• ax2 = fig.add_subplot(122)
• temp2.plot(kind = 'bar')
• ax2.set_xlabel('Credit_History')
• ax2.set_ylabel('Probability of getting loan')
• ax2.set_title("Probability of getting loan by credit history")
17. Practice data set – Loan Prediction Problem
• these two plots can also be visualized by combining them in a stacked
chart::
• temp3 = pd.crosstab(df['Credit_History'], df['Loan_Status'])
• temp3.plot(kind='bar', stacked=True, color=['red','blue'], grid=False)
• 4. Data Munging in Python : Using Pandas
• Check missing values in the dataset
• df.apply(lambda x: sum(x.isnull()),axis=0)
• df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)
18. Practice data set – Loan Prediction Problem
• 5. Building a Predictive Model in Python
• This can be done using the following code:
• from sklearn.preprocessing import LabelEncoder
• var_mod =
['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Lo
an_Status']
• le = LabelEncoder()
• for i in var_mod:
• df[i] = le.fit_transform(df[i])
• df.dtypes
• Python is really a great tool, and is becoming an increasingly popular language
among the data scientists.
19. Python with JSON,csv with Pandas
• Best blogs
• https://www.dataquest.io/blog/python-json-tutorial/
• http://blog.danwin.com/examples-of-web-scraping-in-python-3-x-for-
data-journalists/
• https://automatetheboringstuff.com/chapter14/
20. Python/Predictive model in data analytics
• Predictive modeling is a process that uses data
mining and probability to forecast outcomes. Each model is made up
of a number of predictors, which are variables that are likely to
influence future results.
• Sklearn.LabelEncoder()
• It Convert Pandas Categorical Data For Scikit-Learn
21. Python/Perfect way to build a Predictive Model
• Predictive modeling is a process that uses data mining and
probability to forecast outcomes. Each model is made up of a number
of predictors, which are variables that are likely to influence future
results.
• Broadly, it can be divided into 4 parts.
• Descriptive analysis on the Data – 50% time
• Data treatment (Missing value and outlier fixing) – 40% time
• Data Modelling – 4% time
• Estimation of performance – 6% time
22. Python/Perfect way to build a Predictive Model
• Descriptive Analysis
• Descriptive statistics is the initial stage of analysis used to describe and
summarize data. The availability of a large amount of data and very efficient
computational methods strengthened this area of the statistic.: Below are the
steps involved to understand,
• Variable Identification
• Univariate Analysis
• Bi-variate Analysis
• Missing values treatment
• Outlier treatment
• Variable transformation
• Variable creation
23. Python/Perfect way to build a Predictive Model
• Data treatment:
• An important aspect of statistical treatment of data is the handling of
errors. methods to treat missing values
• Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion.
• Mean/ Mode/ Median Imputation: Imputation is a method to fill in the
missing values with estimated ones.
• Prediction Model: Prediction model is one of the sophisticated method for
handling missing data. Here, we create a predictive model to estimate
values that will substitute the missing data.
• KNN Imputation: In this method of imputation, the missing values of
an attribute are imputed using the given number of attributes that are
most similar to the attribute whose values are missing.
24. Python/Perfect way to build a Predictive Model
• Data Modelling : In case of bigger data, you can consider running a
Random Forest. This will take maximum amount of time
• Estimation of Performance : It is measurement of performance
.kfold with k=7 highly effective to take my initial bet. This finally
takes 1-2 minutes to execute and document.
25. Python/time-series-forecast-study
• The problem is to predict the number of monthly sales of champagne for the Perrin
Freres label (named for a region in France).
• The dataset provides the number of monthly sales of champagne from January 1964 to
September 1972, or just under 10 years of data.
• Download the dataset as a CSV file and place it in your current working director
• The steps of this project that we will through are as follows.
• Environment.
• Problem Description.
• Test Harness.
• Persistence.
• Data Analysis.
• ARIMA Models.
• Model Validation.
• y with the filename “champagne.csv“.
26. Python/time-series-forecast-study
• 3. Test Harness
• We must develop a test harness to investigate the data and evaluate candidate models.
• This involves two steps:
• Defining a Validation Dataset.
• Developing a Method for Model Evaluation.
• 3.1 Validation Dataset
• The code below will load the dataset as a Pandas Series and split into two, one for model development
(dataset.csv) and the other for validation (validation.csv).
• from pandas import Series
• series = Series.from_csv('champagne.csv', header=0)
• split_point = len(series) - 12
• dataset, validation = series[0:split_point], series[split_point:]
• print('Dataset %d, Validation %d' % (len(dataset), len(validation)))
• dataset.to_csv('dataset.csv')
• validation.to_csv('validation.csv')
27. Python/time-series-forecast-study
• The specific contents of these files are:
• dataset.csv: Observations from January 1964 to September 1971 (93
observations)
• validation.csv: Observations from October 1971 to September 1972 (12
observations)
•
3.2. Model Evaluation
• Model evaluation will only be performed on the data
in dataset.csv prepared in the previous section.
• Model evaluation involves two elements:
• Performance Measure.
• Test Strategy.
28. Python/Recommender systems
• Movie recommend
• https://cambridgespark.com/content/tutorials/implementing-your-
own-recommender-systems-in-Python/index.html
• Movie recommend
• Matrix factorization recommender
• https://beckernick.github.io/matrix-factorization-recommender/
29. Python/Main resources
• I definitely recommend cs109.org. Harvard CS 109
There are a few different courses, but the one I used when I was
learning was Dataquest
• https://www.dataquest.io/blog/pandas-cheat-sheet/
• https://www.dataquest.io/blog/data-science-portfolio-project/
• Loan prediction
• https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-
learn-data-science-python-scratch-2/