1. DATA SCIENCE: TOOLS,
TECHNIQUES and APPLICATIONS
Dr. Meenakshi Srivastava
Dr. Ranjana Rajnish
Assistant Professor
Amity University
msrivastava@lko.amity.edu
2. What and Why ???
• WHAT is Data Science?
• WHY Data Science is Important?
• WHY Data Scientist are High in Demands?
• WHY Data Science : In Academia ?
3. Application of Data Science :
Some Examples
I. HEALTHCARE
• Survival analysis
– Analyze survival statistics for different patient
attributes (age, blood type, gender, etc) and
treatments.
• Medication (dosage) effectiveness
– Analyze effects of admitting different types and
dosage of medication for a disease.
• Re-admission risk
– Predict risk of re-admittance based on patient
attributes, medical history, diagnose & treatment.
4. II. MARKETING
• Predicting Lifetime Value (LTV)
What for: if you can predict the
characteristics of high LTV customers, this
supports customer segmentation, identifies
up sell opportunities and supports other
marketing initiatives.
• Demand Forecasting
5. III. LOGISTICS
• How many of what thing Customer needs and
where will they need them?
(Enables lean inventory and prevents out of
stock situations.)
7. What is Data Science ?
Data Science, is Broad Umbrella Term
whereby the Scientific Methods, Math,
Statistics etc are applied to Data sets in
order to extract KNOWLEDGE and
INSIGHT.
9. THE DATA SCIENCE UNICORN
• In medieval times, a Unicorn was a
rare and mythical creature with
great powers.
• In today’s world, a similar mythical
creature is a Data Science Unicorn,
who knows equally well the
Technology, Data Science, and
Business.
• Such professional is a most
valuable resource of any data
science team.
• Many data professionals are
experts in the first two areas –
technology and data science, but
lack business/domain skills.
11. How To Become A Data Science
UNICORN ?
Data Science UNICORN: Do Whatever Is
Necessary To Extract Value from the Data
• Statistics: Take a sample (data), answer questions about the process that
produced this sample Is it a normal distribution? Estimate it’s mean.
• Machine Learning: Take a sample(data), build a model to answer
questions about future samples.
– Given a sample of named faces, design a model for naming a new unseen
face.
• Data Mining: mine huge data store for interesting patterns or
relationships.
– Given DB of transactions, apply tools and algorithms to find frequent product
bundles
12. Machine Learning
Machine
Learning refers to a
computer’s ability to
learn from a dataset and
adapt accordingly
without having been
explicitly programmed
to do so.
Examples : Regression,
Decision Tree, Neural
Network etc.
13. Data Mining
• To most of people data mining
goes something like this: Tons of
data is collected, then quant
wizards work their arcane magic,
and then they know all of this
amazing stuff.
• BUT WHAT THEY DO ?
• They can tell us that "one of
these things is not like the other“,
or it can show us categories and
then sort things into pre-
determined categories/ class.
15. COMPUTATIONAL TOOLS
• With the help of existing computational tools
you all can very easily analyze your data.
• No Programming Skills Required.
• No in depth knowledge of Statistics, Machine
Learning, Data Mining etc is required.
16. Common Computational Tool
• Rapid Miner (Open Source and Free):
This is very popular since it is a readymade, open
source, no-coding required software, which gives
advanced analytics. Written in Java, it incorporates
multifaceted data mining functions such as data
preprocessing, visualization, predictive analysis, and
can be easily integrated with WEKA and R-tool to
directly give models from scripts written in the
former two.
17. • WEKA (Open Source & Free):
This is a JAVA based customization tool, which is
free to use. It includes visualization and
predictive analysis and modeling techniques,
clustering, association, regression and
classification.
18. • R-Programming Tool (Open Source and Free) :
This is written in C and FORTRAN, and allows the
data miners to write scripts just like a programming
language/platform. Hence, it is used to make
statistical and analytical software for data mining. It
supports graphical analysis, both linear and
nonlinear modeling, classification, clustering and
time-based data analysis.
19. • Python based Orange and NTLK:
Python is very popular due to ease of use and its
powerful features. Orange is an open source
tool that is written in Python with useful data
analytics, text analysis, and machine-learning
features embedded in a visual programming
interface. NTLK, also composed in Python, is a
powerful language processing data mining tool,
which consists of data mining, machine learning,
and data scraping features that can easily be
built up for customized needs.
20. • Rattle (Open source and FREE)
A rattle is a GUI tool that uses R
Stats programming language. Rattle exposes the
statistical power of R by providing considerable
data mining functionality. Although Rattle has
an extensive and well-developed UI. Also, it has
an inbuilt log code tab that generates duplicate
code for any activity happening at GUI.
21. • DataMelt (Availability: Open source and Free)
DataMelt, also known as DMelt is a computation
and visualization environment. Also, provides an
interactive framework to do data analysis and
visualization. It is designed mainly for engineers,
scientists & students.
22. How Computational Tools Work
• Have methods developed using Statistics,
Machine Learning and Data Mining are used.
• These pre-developed methods can be easily
applied on your data set.
• They provide you in build support for data
visualization.
23.
24. What ALL I CAN DO WITH MY DATA ?
• Regression:
In statistics, regression is a classic technique to
identify the scalar relationship between two
or more variables by fitting the state line on
the variable values.
25. Cont…
• Classification:
This is a machine-learning technique used for
labeling the set of observations provided for
training examples. With this, we can classify the
observations into one or more labels. The
likelihood of sales, online fraud detection, and
cancer classification (for medical science) are
common applications of classification problems.
Google Mail uses this technique to classify e-
mails as spam or not.
26. • Clustering:
This technique is all about organizing similar items
into groups from the given collection of items.
User segmentation and image compression are
the most common applications of clustering.
Market segmentation, social network analysis,
organizing the computer clustering, and
astronomical data analysis are applications of
clustering.
• Google News
Uses these techniques to group similar news items
into the same category.
27. Cont…
• Recommendation:
The recommendation algorithms are used in
recommender systems where these systems are
the most immediately recognizable machine
learning techniques in use today. Web content
recommendations may include similar websites,
blogs, videos, or related content. Also,
recommendation of online items can be helpful
for cross-selling and up-selling.
28. • Association Rules:
This data mining technique helps to find the
association between two or more Items. It
discovers a hidden pattern in the data set.
29. • Outlier Detection:
This type of data mining technique refers to
observation of data items in the dataset which
do not match an expected pattern or expected
behavior. This technique can be used in a variety
of domains, such as intrusion, detection, fraud
or fault detection, etc. Outer detection is also
called Outlier Analysis or Outlier mining.
30. • Prediction:
Prediction has used a combination of the other
data mining techniques like trends, sequential
patterns, clustering, classification, etc. It
analyzes past events or instances in a right
sequence for predicting a future event.
31. ADVANTAGES
Use Computational Tools to predict the
behavior of your compound.
Use Computational Tools to analyze the same
data with a different vision.
Cos Cutting.
Time Saving
Very Clean perfect vision for your Research