NADAR SARSWATHI COLLEGE
OF ARTS AND SCIENCE
THENI
DEPARTMENT OF INFORMATION
TECHNOLOGY
BIG DATAANALYTICS
SEMI-SUPERVISED LEARNING
BY:
S.SABTHAMI
II M.Sc(IT)
SEMI-SUPERVISED LEARNING
• Semi-supervised learning
– Real life applications are somewhere in
between.
The Challenge
• Unsupervised portion of the corpus, ,
adds to
– Vocabulary
– Knowledge about the joint distribution of terms
– Unsupervised measures of inter-document
similarity.
• E.g.: site name, directory path, hyperlinks
• Put together multiple sources of evidence of
similarity and class membership into a label-
learning system.
– combine different features with partial supervision
Hard Classification
• Train a supervised learner on available
labeled data
• Label all documents in
• Retrain the classifier using the new labels
for documents where the classier was
most confident,
• Continue until labels do not change any
more.
Expectation maximization
• Softer variant of previous algorithm
• Steps
– Set up some fixed number of clusters with
some arbitrary initial distributions,
– Alternate following steps
• based on the current parameters of the distribution
that characterizes c.
– Re-estimate, Pr(c|d), for each cluster c and each
document d,
• Re-estimate parameters of the distribution for each
cluster.
Experiment: EM
• Set up one cluster for each class label
• Estimate a class-conditional distribution
which includes information from D
• Simultaneously estimate the cluster
memberships of the unlabeled documents.
EM: Issues
• For, we know the class label cd
– Question: how to use this information ?
– Will be dealt with later
• Using Laplace estimate instead of ML
estimate
– Not strictly EM
– Convergence takes place in practice
A metric graph-labeling problem:
NP-Completeness
• NP-complete [Kleinberg and Tardos]
• approximation algorithms
– Within a O(log k log log k) multiplicative factor
of the minimal cost,
– k = number of distinct class labels.
Problems with approaches so far
• Metric or relaxation labeling
– Representing accurate joint distributions over thousands of
terms
• High space and time complexity
• Naïve Models
– Fast: assume class-conditional attribute independence,
– Dimensionality of textual sub-problem >> dimensionality of
link sub-problem,
– Pr(vT|f(v)) tends to be lower in magnitude than
Pr(f(N(v))|f(v)).
– Hacky workaround: aggressive pruning of textual features
Co-Training [Blum and Mitchell]
• Classifiers with disjoint features spaces.
• Co-training of classifiers
– Scores used by each classifier to train the other
– Semi-supervised EM-like training with two
classifiers
• Assumptions
– Two sets of features (LA and LB) per document dA
and dB.
– Must be no instance d for which
– Given the label , dA is conditionally independent
of dB (and vice versa)
big data analytics.pptx

big data analytics.pptx

  • 1.
    NADAR SARSWATHI COLLEGE OFARTS AND SCIENCE THENI DEPARTMENT OF INFORMATION TECHNOLOGY BIG DATAANALYTICS SEMI-SUPERVISED LEARNING BY: S.SABTHAMI II M.Sc(IT)
  • 2.
    SEMI-SUPERVISED LEARNING • Semi-supervisedlearning – Real life applications are somewhere in between.
  • 3.
    The Challenge • Unsupervisedportion of the corpus, , adds to – Vocabulary – Knowledge about the joint distribution of terms – Unsupervised measures of inter-document similarity. • E.g.: site name, directory path, hyperlinks • Put together multiple sources of evidence of similarity and class membership into a label- learning system. – combine different features with partial supervision
  • 4.
    Hard Classification • Traina supervised learner on available labeled data • Label all documents in • Retrain the classifier using the new labels for documents where the classier was most confident, • Continue until labels do not change any more.
  • 5.
    Expectation maximization • Softervariant of previous algorithm • Steps – Set up some fixed number of clusters with some arbitrary initial distributions, – Alternate following steps • based on the current parameters of the distribution that characterizes c. – Re-estimate, Pr(c|d), for each cluster c and each document d, • Re-estimate parameters of the distribution for each cluster.
  • 6.
    Experiment: EM • Setup one cluster for each class label • Estimate a class-conditional distribution which includes information from D • Simultaneously estimate the cluster memberships of the unlabeled documents.
  • 7.
    EM: Issues • For,we know the class label cd – Question: how to use this information ? – Will be dealt with later • Using Laplace estimate instead of ML estimate – Not strictly EM – Convergence takes place in practice
  • 8.
    A metric graph-labelingproblem: NP-Completeness • NP-complete [Kleinberg and Tardos] • approximation algorithms – Within a O(log k log log k) multiplicative factor of the minimal cost, – k = number of distinct class labels.
  • 9.
    Problems with approachesso far • Metric or relaxation labeling – Representing accurate joint distributions over thousands of terms • High space and time complexity • Naïve Models – Fast: assume class-conditional attribute independence, – Dimensionality of textual sub-problem >> dimensionality of link sub-problem, – Pr(vT|f(v)) tends to be lower in magnitude than Pr(f(N(v))|f(v)). – Hacky workaround: aggressive pruning of textual features
  • 10.
    Co-Training [Blum andMitchell] • Classifiers with disjoint features spaces. • Co-training of classifiers – Scores used by each classifier to train the other – Semi-supervised EM-like training with two classifiers • Assumptions – Two sets of features (LA and LB) per document dA and dB. – Must be no instance d for which – Given the label , dA is conditionally independent of dB (and vice versa)