2. Introduction
• Data Mining is the process of extracting information
from large data sets through the use of algorithms
and techniques drawn from the field of Statistics,
Machine Learning and Data Base Management
Systems.
• “Mining” means to find something that already exists.
• Therefore, data mining can be defined as a process of
identifying hidden patterns and relationships, and
trends within data.
• Traditional methods often involves:-
1) manual work
2) interpretation of data.
3. • Data Mining, popularly called as knowledge
discovery in
• large data
• Enables organizations to make calculated
decisions by
• Assembling
• accumulating
• analyzing and
• accessing corporate data.
4. __
__
__
__
__
__
__
__
__
Transformed
Data
Patterns
and
Rules
Target
Data
Interpretation
& Evaluation
Knowledge
Understanding
Raw
Dat
a
DATA
Ware
house
Integration
5. • The scope of pharmaceutical applications is large and it
may involve drug manufacturing processes as well as
data processing.
• Data processing and analysis is a key area in the
pharmaceutical industry.
• The vision of a pharmaceutical industry that can be
achieved with data mining.
• pharmaceutical companies delivers drugs, developing
test kits (including genetic tests) and computer
programs to deliver the best drug to the patient.
6.
7. Pharmaceutical companies can also employ data mining
methods to huge masses of genomic data to predict how
a patient’s genetic makeup determines his or her response
to a drug therapy .
genomic data :-The complete set of chromosomal and
extra chromosomal genes of an organism, a cell, an
organelle or a virus; the complete DNA component of an
organism.
10. Decision Support System (DSS) tools.
• Decision support
systems (DSS) are
defined as
• interactive computer-based
systems intended
to help decision makers
to utilize data and
models in order to
• identify problems, solve
problems and make
decisions.
11. DATA MINING TECHNIQUES.
•Many organizations generate
mountains of data about their new
drugs discovered and its
performance reports, etc.
•This data is a strategic resource.
Now, making use of most of these
strategic resources will lead to
•improving the quality of pharma
industries.
12. • Six important steps in the Data Mining process
as
1. Problem Definition.
2. Knowledge acquisition.
3. Data selection.
4. Data Preprocessing.
5. Analysis and Interpretation.
6. Reporting and Use.
13. Identify the data mining process as
1. Definition of the objectives of the analysis.
2. Selection &Pretreatment of the data.
4. Explanatory analysis.
5. Specification of the statistical methods.
6. Analysis of the data.
7. Evaluation and comparison of methods.
8. Interpretation of the chosen model.
14. 1. Definition of the objectives of the analysis.
Understanding the project objectives and
requirements from a business perspective and then
converting this knowledge into a data mining
problem definition with a preliminary plan
designed to achieve the objectives.
15. Relevant data sources for the pharma industry are:
•clinical data (patient data, pharmaceutical data,
medical treatments, length of stay);
•administrative data (staff skills, overtime, nursing
care hours, staff sick leave);
• financial data (treatment costs, drug costs, staff
salaries, accounting, cost-effectiveness studies); and
• organizational data (room occupation, facilities,
equipment).
16. Data mining is used to support:
•The clinicians at the point of care delivery;
•The controlling of clinical treatment pathways;
•The administrative and management tasks; and
•Efficient management of organizational and
financial data.
17. Associations, Mining Frequent
Patterns.
• These methods identify rules of affinities
among the collections.
• rules of affinities:- relationships among
data
• That the patterns occur frequently during
Data Mining process.
• The applications of association rules
include market basket analysis
• attached mailing in direct marketing
• Fraud detection
• department store floor/shelf planning etc.
18. •Association of training undertaken diseases
with drugs
•Association and analysis of staff movements
•Application tracking mechanism in
physicians adopting drugs with customer’s
prescription
19. Classification And Prediction.
• The classification and
prediction models are two
data analysis techniques
that are used to describe
data classes and predict
future data classes.
• E.g. A credit card company
whose customer credit
history is known can
classify its customer Record
as
• Good, Medium, or Poor.
20. •Predicting consumer behavior
•Predicting the likelihood of success in a drug
adoption process
•Predicting the percentage accuracy in performance of
a drug
•Classifying the historical health records
•Prediction of what type of drugs most likely to be
retained, most likely to be left, most likely to
transform their composition.
21. Predicting pharma product behavior and attitude
•Predicting demand projections by seasonal variations
•Predicting the performance progress of segments
throughout the performance period
•Identifying the best profile for different drugs
•Classify trends of movements through the
organization for successful/unsuccessful patient
historical records
•Categorization of drugs, diseases and patients.
22. • The models of decision
trees, neural networks
based classifications
schemes are very much
useful in pharma industry.
23. • Decision trees:- Decision-tree is a common knowledge
representation used for classification.
• In classification, one is given data from a specific
instance, and the decision tree predicts, based on the
data, into which of two or more classes the instance
belongs.
• Each instance contains data from multiple attributes.
• Instances are collections of previously acquired data
which are sorted into class labels.
• It does so by determining which tests best divide the
instances into separate classes, forming a tree.
24.
25. • Neural Networks
– Learn through training
– Resemble to biological
networks in structure
– Can produce very good
predictions
– Not easy to use and to
understand
– Cannot deal with
missing data
26. Uses Bayesian neural network
Prior probability is probability that any report
contains reference to adverse event
Posterior probability is probability that report has
link between drug and adverse event
Determines “strength” of link between adverse
event and drug (called Information Component or
IC)
More complicated than appears: patient may
consume multiple drugs – which one caused
adverse event?
28. • Classification works on discrete and unordered data, while prediction
works on continuous data.
• E.g. Discrete data This data set shows a group of discrete data.
Music format Number sold
CD albums 140
CD singles 70
Downloads 55
Vinyl 5
Total sales 270
• This is called discrete data because the units of measurement (for example,
CDs) cannot be split up; there is nothing between 1 CD and 2 CDs
• E.g. Continues data
• This data is called continuous because the scale of measurement - distance -
has meaning at all points between the numbers given, e.g we can travel a
distance of 1.2 and 1.85 and even 1.632 miles.
Distance in miles 0.1 0.2 0.6 1.1 1.2 1.8 2.0 2.7 3.4 4.6 6.2 8.0 12.1 14.2
29. • Regression is often used as it is a
statistical method used for numeric
prediction.
• Primary emphasis should be made on
the selection measurement accuracy
and predicative efficiency of any
new drug discovery.
• Simple or multiple regressions is
the basic prediction model that
enables a decision maker to forecast
each criterion status based on
predictor information.
• neural network technology is useful
from different areas of business.
30. CLUSTERING.
• It is a method by which similar
records are grouped together.
• Clustering is usually used to mean
segmentation.
• An organization can take the
hierarchy of classes that group
similar events.
• Using clustering, patients can be
grouped based on age, name,
diseases etc.
• In business, clustering helps identify
groups of similarities;
• characterize customer groups based
on purchasing patterns, etc.
31. DATA MINING AND STATISTICS.
• The ability to build a successful
predictive model depends on past
data.
• Data Mining is designed to learn from
past success and failures and will be
able to predict what will happen
next (future prediction).
• The Data Mining tool checks the
statistical significance of the
predicted patterns and reports.
32. The difference between Data Mining
and statistics
• Data Mining automates the statistical process
requiring in several tools.
• Statistical inference is assumption driven in the
sense that a hypothesis is formed and tested
against data.
• Data Mining, in contrast is discovery driven.
That is, the hypothesis is automatically
extracted from the given data.
33. Data Mining can answer analytical
questions such as:
• what are discovery of new molecules and
issues over it?
• What factors or combinations are directly
impacting the drugs?
• What are the best and outstanding drugs?
• Which drugs are likely to be retained?
• How to optimally allocate resources to ensure
effectiveness and efficiency? etc.
34. • An intelligent text mining system could
provide a platform for extracting and
managing specific information at the entity
level.
• For e.g. Information pertaining to
• genes
• proteins
• diseases
• organisms
• chemical substance etc can be analytically
extracted for patterns .
35. It would also provide insights into inter relationships
such as
• protein-protein
• Gene-gene
• Protein-Chemical
• Gene-Disease and
• Drug-Drug interactions.
• Text mining can be applied to biomedical literature,
clinical documents and other medical literary sources
for data curation and database population in a semi-automated
manner.
36. Applications Of Data Mining In
The Pharmaceutical Industry
• A lot of information is hidden in the legacy
systems.
• This information can easily be extracted.
• Most of the times this can not be done directly
from the legacy systems, because these are not
build to answer questions that are
unpredictable.
37. • A user-interface may be designed to accept all kinds
of information from the user (e.g. weight, sex, age,
foods consumed, reactions reported, dosage, length of
usage).
• Then, based upon the information in the databases
and the relevant data entered by the user,
• a list of warnings or known reactions (accompanied
by probabilities) should be reported.
• Note that user profiles can contain large amounts of
information, and efficient and effective data mining
tools need to be developed to probe the databases for
relevant information.
38. • Secondly, the patient's (anonymous) profile should
be recorded along with any adverse reactions
reported by the patient, so that future correlations
can be reported.
• Over time, the databases will become much larger,
and interaction data for existing medicines will
become more complete.
• The amount of existing pharmaceutical information
pharmacological properties, dosages,
contraindications, warnings, etc. is enormous;
• however, this fact reflects the number of medicines
on the market, rather than an abundance of detailed
information about each product.
39. One of the major problems with pharmaceutical
data is a lack of information.
• a food and drug administration department
estimated that
• only about 1% of serious events are reported to
the food and drug administration department.
Fear of litigation may be a contributing factor;
• however, most health care providers simply
don't have the time to fill out reports of
possible adverse drug reactions.
40. •Furthermore, it is expensive and time consuming
for pharmaceutical companies to perform a
thorough job of data collection, especially when
most of the information is not required by law.
•Finally, one should note that the food and drug
administration department does not require
manufacturers to test new medicines for potential
interactions.
41. Three stages of drug development
• Finding of new drugs
• Development tests and Predicts drug behavior
• Clinical trials test the drug in humans and
• Commercialization takes drug and sells it to
likely Consumers (doctors and patients).
43. 1) Clinical data analysis – clinical data analysis
evaluates and streamlines from large amount of
information.
Data mining helps to see trends, irregularity, and
risk during product development and launch.
2) Marketing and sales analysis –the
identification of the most profitable product and
allocation of marketing funds.
Data mining here helps to examine consumer
behavior in terms of prescription renewal and
product purchases.
44. 3) Customer analysis – using data mining one can
develop more targeted customer profiles that focus
not only on products, but also on the ability to pay
for them by analyzing historical health trends in
combination with demographics.
4) Target physicians who have high prescription
rates of a certain drug or treatment with new drug
information that treat complementary symptoms or
conditions.
45. DEVELOPMENT OF NEW
DRUGS.
• This can be achieved by clustering the
molecules into groups according to the
chemical properties of the molecules via
cluster analysis.
• every time a new molecule is discovered it can
be grouped with other chemically similar
molecules.
46. •Mining can help us to measure the chemical activity
of the molecule on specific disease say tuberculosis
and find out which part of the molecule is causing the
action.
•This way we can combine a vast number of
molecules forming a super molecule with only the
specific part of the molecule which is responsible for
the action and inhibiting the other parts.
•This would greatly reduce the adverse effects
associated with drug actions.
47. • They use high speed screening to test tens,
hundreds, or thousands of drugs very quickly.
• The general goal is to find activity on
relevant genes or to find drug compounds that
have desirable characteristics.
• The Data mining techniques that are used in
developing of new drugs are clustering,
classification and neural networks.
• The basic objective is to determine
compounds with similar activity.
48. • The reason is for similar activity compounds
behave similarly.
• This is possible only when we have known
compound and looking for something better.
• When we don’t have known compounds but
have desired activity and want to find
compound that exhibits this activity, then data
mining rescues this.
49. DEVELOPMENT TESTS AND
PREDICTS DRUG BEHAVIOR
• Issues which affect the success of a drug which
can impact the future development of the drug.
1) Adverse reactions to the drugs are reported
spontaneously and not in any organized manner.
2) we can only compare the adverse reactions with
the drugs of our own company and not with other
drugs from competing firms.
3) we only have information on the patient taking
the drug not the adverse reaction that the patient
is suffering from
50. Solution
• All this can be solved with creation of a data
warehouse for drug reactions and running
business intelligence tools on them.
• BI tool:- Business intelligence tools are a type of
software that is designed to retrieve, analyze and
report data.
• This broad definition includes everything from
spreadsheets, visual analytics, and querying
software to data mining, warehousing, and
decision engineering.
51.
52. •The drug undergoes testing in animals and human
tissue to observe effect and determines how much
drug to consume for desired effect or how
dangerous is the drug.
•The Data mining techniques can be here used is
classification and neural networks.
53. • The goal here is to predict if treatment will aid
patients.
• Because if drug will not aid patients, what
purpose does drug serve.
• Predicting the drug behavior is essential when we
have data supporting use of drug and also have
training data that shows effects of drug (positive
or negative).
• The test should be able to predict which patients
will benefit and which treatment help sickle cell
anemia patients.
54. How it works
•The information like gender, body weight,
disease state, etc will play crucial role.
•This crucial data should be fed into neural
network and predict whether patient will
benefit from drug.
•Only one of two classifications yes/no will
be available on training data.
•Network is trained for the yes
classifications and a snapshot is taken of the
neural network.
•Then network is trained for the no
classifications and another snapshot is
taken.
•The output is yes or no, depending on
whether the inputs are more similar to the
yes or the no training data.
•E.G. ARTMAP.
55. Weight
Height
Gender
Blood
Pressure
Imagine array of
weights, one for
each “template”
Template closest
to input chosen.
Patient
Benefits?
Path of “least resistance”
chosen for output.
56. CLINICAL TRIALS TEST THE
DRUG IN HUMANS
• Company tests drugs in actual patients on larger
scale.
• company has to keep track of data about patient
progress.
• The Government wants to protect health of
citizens, many rules govern clinical trials.
• In developed countries food and drug
administration oversees trials.
• The Data mining techniques used here can be
neural networks.
57. • Here data is collected by pharmaceutical
company but undergoes statistical analysis to
determine success of trial.
• Data is generally reported to food and drug
administration department and inspected
closely.
• Too many negative reactions might indicate
drug is too dangerous.
• An adverse event might be medicine causing
drowsiness.
58. • The goal is to detect when too many adverse
events occur or detect link between drug and
adverse event.
• Too many adverse events linked to a drug might
indicate drug is too dangerous or health of patient
is at risk.
• Adverse events are reported to food and drug
administration when link is suspected.
• One can feed the information on drug causing too
many adverse events pertaining to drugs into a
neural network and let network lead us to what is
meant by ‘too many’.
59. Benefits
• Research Stage – instead of trial and error, data
mining can help find drugs that have desirable
activity
• Development Stage – data mining can help
predict who will benefit from drug
• Clinical Trials Stage – data mining protects
patients and helps regulate drug testing
• Commercialization Stage – data mining can
optimize use of sales resources like manpower,
advertising
60. CONCLUSION.
• Due to increased computerization and consumer/patient
awareness.
• Reporting (via the internet) by health care workers can easily
be facilitated.
• Data collection in hospitals and extended care facilities is not
difficult, and this information is of high quality since such
institutions typically have tailored diets for their patients, and
maintain accurate records of treatments, lab tests, and
administration of prescriptions.
• Furthermore, given the popularity of the internet, it is
relatively easy for consumers to voluntarily fill in and submit
detailed profiles of themselves.
61. •It is mostly observed that data mining techniques are
seldom used in a pharmaceutical environment.
•How data mining can help find drugs that have desirable
activity and predict who will benefit from drug.
•Data mining protects patients and helps regulate drug
testing and optimizes use of sales resources like
manpower, advertising.