big data analytics.pptx

NADAR SARSWATHI COLLEGE
OF ARTS AND SCIENCE
THENI
DEPARTMENT OF INFORMATION
TECHNOLOGY
BIG DATAANALYTICS
SEMI-SUPERVISED LEARNING
BY:
S.SABTHAMI
II M.Sc(IT)

SEMI-SUPERVISED LEARNING
• Semi-supervised learning
– Real life applications are somewhere in
between.

The Challenge
• Unsupervised portion of the corpus, ,
adds to
– Vocabulary
– Knowledge about the joint distribution of terms
– Unsupervised measures of inter-document
similarity.
• E.g.: site name, directory path, hyperlinks
• Put together multiple sources of evidence of
similarity and class membership into a label-
learning system.
– combine different features with partial supervision

Hard Classification
• Train a supervised learner on available
labeled data
• Label all documents in
• Retrain the classifier using the new labels
for documents where the classier was
most confident,
• Continue until labels do not change any
more.

Expectation maximization
• Softer variant of previous algorithm
• Steps
– Set up some fixed number of clusters with
some arbitrary initial distributions,
– Alternate following steps
• based on the current parameters of the distribution
that characterizes c.
– Re-estimate, Pr(c|d), for each cluster c and each
document d,
• Re-estimate parameters of the distribution for each
cluster.

Experiment: EM
• Set up one cluster for each class label
• Estimate a class-conditional distribution
which includes information from D
• Simultaneously estimate the cluster
memberships of the unlabeled documents.

EM: Issues
• For, we know the class label cd
– Question: how to use this information ?
– Will be dealt with later
• Using Laplace estimate instead of ML
estimate
– Not strictly EM
– Convergence takes place in practice

A metric graph-labeling problem:
NP-Completeness
• NP-complete [Kleinberg and Tardos]
• approximation algorithms
– Within a O(log k log log k) multiplicative factor
of the minimal cost,
– k = number of distinct class labels.

Problems with approaches so far
• Metric or relaxation labeling
– Representing accurate joint distributions over thousands of
terms
• High space and time complexity
• Naïve Models
– Fast: assume class-conditional attribute independence,
– Dimensionality of textual sub-problem >> dimensionality of
link sub-problem,
– Pr(vT|f(v)) tends to be lower in magnitude than
Pr(f(N(v))|f(v)).
– Hacky workaround: aggressive pruning of textual features

Co-Training [Blum and Mitchell]
• Classifiers with disjoint features spaces.
• Co-training of classifiers
– Scores used by each classifier to train the other
– Semi-supervised EM-like training with two
classifiers
• Assumptions
– Two sets of features (LA and LB) per document dA
and dB.
– Must be no instance d for which
– Given the label , dA is conditionally independent
of dB (and vice versa)

big data analytics.pptx

More Related Content

Similar to big data analytics.pptx

More from SabthamiS1

Recently uploaded

big data analytics.pptx