Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Medical Persona Classification in Social Media


Published on

Identifying medical persona from a social media post is of paramount importance for drug marketing and pharma-covigilance. In this work, we propose multiple approaches to infer the medical persona associated with a social media post. We pose this as a supervised multi-label text classification problem. The main challenge is to identify the hidden cues in a post that are indicative of a particular persona. We first propose a large set of manually engineered features for this task. Further, we propose multiple neural network based architectures to extract useful features from these posts using pre-trained word embeddings. Our experiments on thousands of blogs and tweets show that
the proposed approach results in 7% and 5% gain in F-measure
over manual feature engineering based approach for blogs and
tweets respectively.

Published in: Engineering
  • Be the first to comment

Medical Persona Classification in Social Media

  1. 1. Medical Persona Classification in Social Media Nikhil Pattisapu1 , Manish Gupta1,2 , Ponnurangam Kumaraguru3 , Vasudeva Varma1 1IIIT Hyderabad 2Microsoft India 3IIIT Delhi Advances in Social Network Analysis and Mining 2017 ASONAM 2017 1 / 30
  2. 2. Overview Motivation Problem Definition Related Work Dataset Approach Evaluation Metrics Experiments Results Analysis and Conclusion Future Work ASONAM 2017 2 / 30
  3. 3. Motivation What is Medical Persona? User groups and content providers of Web 2.0 applications in healthcare. Some examples - Patient Caretaker Consultant Journalist Pharmacist Researcher Other ASONAM 2017 3 / 30
  4. 4. Motivation Pharmaceutical firms use Medical social media for Drug marketing and pharmacovigilance. Figure: Sample post from describing a patient’s experiences with the drug Keppra. ASONAM 2017 4 / 30
  5. 5. Motivation Use cases Few use cases for identifying medical persona are mentioned below. To gather information about drug usage, adverse events, benefits and side effects from patients. To find out the kind of informational assistance sought by caretakers and put such information readily available. To identify key opinion leaders in a drug or disease area. To find out if a doctor has patients who can take part in a clinical trial. ASONAM 2017 5 / 30
  6. 6. Motivation Use cases To gather information on conversations between pharmacists and others to identify drug dosage, interactions and therapeutic effects. To acquire or collaborate on technologies invented by researchers that can be a part of the drug pipeline. To gather information about journalists’ survey on quality of life of patients. ASONAM 2017 6 / 30
  7. 7. Problem Definition Given a social media post, identify the medical personae associated with it. We pose this as multi-label text classification problem, where our label set is {Patient, Caretaker, Consultant, Journalist, Pharmacist, Researcher, Other} There are two primary reasons for setting this as a multi-label classification task (as opposed to single-label) There might be posts involving conversations between multiple personae. For example, a blog describing patient-consultant conversation. A post might be of ambiguous nature and hence can potentially be mapped to more than one label by a human annotator. ASONAM 2017 7 / 30
  8. 8. Related Work This problem is primarily related to two problems, which are thoroughly studied in literature Authorship Attribution - The task of determining the author of a particular document Automatic Genre Identification (AGI) - The task of classifying documents based on genres (which includes their form, structure, functional trait, communicative purpose, targeted audience and narrative style) rather than the content, topics or subjects that the documents span. ASONAM 2017 8 / 30
  9. 9. Related Work State-of-the-art Methods For both, authorship attribution and AGI, supervised algorithms based on extensive feature engineering have been proposed. The top features include Word n-grams Character n-grams Common words Function words Part-of-speech tags Document statistics (e.g. document length) HTML tags. Stylistic features Acronyms Hashtag and reply mentions. ASONAM 2017 9 / 30
  10. 10. Related Work Why can’t existing methods be trivially adapted? Different features need to be explored for medical domain. As opposed to most methods proposed in literature, our task is of closed-set multi-label type. Each persona has several users and will itself contain heterogeneity. ASONAM 2017 10 / 30
  11. 11. Dataset Blog / Tweet Search API Noise Filtering & Deduplication Human Annotation Query Blogs / Tweets Labeled Blogs / Tweets Figure: Dataset Collection Our dataset consists of both blogs as well as tweets. Examples of queries include drug names - minocycline, qvar, gilenya Whenever using only drugs as queries resulted in a lot of irrelevant content, drug-disease pairs (e.g. acne minocycline) were used as queries. We used 50 queries and retrieved 50 blogs and 30 tweets per query. Noisy posts, retweets were removed. ASONAM 2017 11 / 30
  12. 12. Dataset Figure: Dataset Statistics 1581 blogs and 1025 tweets were annotated The inter-annotator agreement between 4 annotators was found to be 0.708 for blogs and 0.70 for tweets. The label cardinality of blogs and tweets was 1.18 and 1.24 respectively. The maximum label cardinality of a blog was 2 and that of a tweet was 3. ASONAM 2017 12 / 30
  13. 13. Approach Overview We first transform the multi-label task into one or more single label-task using Binary label transformation Label powerset transformation We then use the following approaches to solve this task N-gram approach Feature Engineering Averaged Word Vectors CNN-LSTM ASONAM 2017 13 / 30
  14. 14. Approach Label transformation method Binary Relevance Method We train an individual classifier for each label. Given an unseen sample, the combined model then predicts all labels for this sample for which the respective classifiers predict a positive result. Label Powerset Method We train one binary classifier for every label combination attested in the training set For an unseen example, prediction is done using a voting scheme. ASONAM 2017 14 / 30
  15. 15. Approach N-gram approach (Baseline) Each document is represented as a TF-IDF vector over the entire vocabulary. An SVM is trained to classify the document into one or more of the pre-defined personae. Both Word n-grams and character n-grams are used. Averaged word Vectors document vector(di ) = wi j word embedding(wi j ) len(pi ) (1) ASONAM 2017 15 / 30
  16. 16. Approach Word Embedding Details ID Training Source Training Algo- rithm #Dim #Entries Domain 1 Medical Tweets (ADR) Word2Vec 200 1344629 Medical 2 Twitter GloVe 200 1193515 Generic 3 Web crawl 1 GloVe 300 2196018 Generic 4 Web crawl 2 GloVe 300 1917495 Generic 5 PubMed, PMC, Wikipedia Word2Vec 200 5443656 Medical Table: Pre-trained Word Embedding Details ASONAM 2017 16 / 30
  17. 17. Approach Feature Engineering For this task, we manually engineered a total of 89 features, distributed in 6 feature types. Document Level features (4) Captures generic features of a post Examples - Number of sentences, average sentence length, average word length Pharmacist blogs are lengthier than Patient blogs. POS features (33) Capture the distribution of different Parts-of-Speech in the document. Example - Number of Adjectives A Consultant is 1.6 times more likely to use adjectives than a journalist. ASONAM 2017 17 / 30
  18. 18. Approach Feature Engineering List lookup features (7) Include the average frequency of terms which occur in the document as well as in a particular list. Example - List of abusive words. The terms MD, Dr., MBBS, FRCS, consultation fee, were found to be more frequent in consultant blogs than others. Syntactic features (7) Capture the presence or absence of various classes of terms. Example - date, person, location, organization, time, money, and percentage amounts. Researcher blogs contain more percentage mentions than others. ASONAM 2017 18 / 30
  19. 19. Approach Feature Engineering Semantic features (35) Consist of a lot of medical domain specific features Examples - number of disease mentions, drug mentions, chemical mentions, organ mentions The distribution across these features gives significant clues about the persona. These features were extracted using MetaMap. Tweet specific features Consist features specific to tweets only Examples - number of hashtags ASONAM 2017 19 / 30
  20. 20. Approach CNN Architecture For experiments related to tweets, we use the following CNN architecture Softmax / Sigmoid Convolution Layer Max-pooling Layer Pre-trained Word Embedding Layer I am suffering pneumonia Figure: CNN ASONAM 2017 20 / 30
  21. 21. Approach CNN-LSTM Architecture For experiments related to blogs, we use the following CNN-LSTM architecture LSTM LSTM LSTM Softmax / Sigmoid Layer Convolution Layer Max-pooling Layer Pre-trained Word Embedding Layer Sequential Layer I treated a patient He was suffering fever Hygiene highly impacts dengue Figure: CNN-LSTMASONAM 2017 21 / 30
  22. 22. Evaluation Metrics Each evaluation metric is described on a per instance basis which is subsequently averaged over all instances to obtain the aggregate value. Let l and pr be the true label set and predicted label set for document d Exact Match = 1 if l = pr 0 otherwise (2) Jaccard Similarity = |l ∩ pr|/|l ∪ pr| (3) Precision = |l ∩ pr|/|pr| (4) Recall = |l ∩ pr|/|l| (5) F − Score = 2 ∗ Precision ∗ Recall/(Precision + Recall) (6) ASONAM 2017 22 / 30
  23. 23. Evaluation Metrics Hamming Loss = |L| j=1 xor(lj , prj ) |L| (7) Hamming Score = 1 − Hamming Loss (8) where lj , prj denote jth element of l and pr respectively. ASONAM 2017 23 / 30
  24. 24. Experimental Details Throughout this work, we conduct 10 fold cross validation experiments. For extracting semantic features we use MetaMap. For tuning hyperparameters in CNN and CNN-LSTM models, we used a grid search over the entire hyper-parameter space which includes Number of convolution filters Filter sizes Activation Functions (ReLU and sigmoid) Size of hidden layer Number of epochs We select the configuration which maximizes the F-Score on a hold-out validation set. ASONAM 2017 24 / 30
  25. 25. Results Blogs Approach LT Method Emb Id JS EM HS F- Score Word unigrams BR - 0.446 0.393 0.870 0.520 LP 0.566 0.511 0.865 0.570 Character n-grams BR - 0.460 0.401 0.871 0.530 LP 0.577 0.523 0.868 0.580 Feature Engineering BR - 0.461 0.409 0.872 0.530 LP 0.574 0.518 0.867 0.580 Averaged Word2Vec BR 3 0.608 0.521 0.880 0.600 LP 4 0.627 0.568 0.886 0.640 CNN- LSTM BR 3 0.496 0.421 0.846 0.460 LP 3 0.586 0.514 0.869 0.600 Table: Results of all Approaches for Blogs ASONAM 2017 25 / 30
  26. 26. Results Tweets Approach LT Method Emb Id JS EM HS F- Score Word unigrams BR - 0.427 0.352 0.862 0.500 LP 0.518 0.441 0.846 0.510 Character n-grams BR - 0.421 0.353 0.864 0.480 LP 0.513 0.435 0.845 0.490 Feature Engineering BR - 0.450 0.366 0.865 0.520 LP 0.540 0.455 0.852 0.540 Averaged Word2Vec BR 3 0.563 0.469 0.863 0.560 LP 4 0.544 0.462 0.853 0.520 CNN BR 4 0.593 0.499 0.873 0.590 LP 4 0.582 0.489 0.864 0.580 Table: Results of all Approaches for Tweets ASONAM 2017 26 / 30
  27. 27. Analysis Feature Analysis Feature Group Best Feature (Blogs) Best Feature (Tweets) Document # characters (3) # characters (8) Syntactic # Money mentions (2) # Money mentions (6) List lookup # matching words with consultant list (1) # matching words with patient word list (29) Semantic # Inorganic chemical (38) # research activity (34) POS # Foreign word (163) # Personal Pronoun (116) Tweet specific - # hashtags (9) Table: Feature Analysis for Blogs and Tweets based on χ2 metric. Number in the parenthesis indicates feature rank (lesser the better) ASONAM 2017 27 / 30
  28. 28. Analysis and Conclusion Averaged word2vec (for blogs), CNN model (for tweets) outperforms other approaches. CNN-LSTM model fails to outperform averaged word2vec method, mainly due to the high number of trainable model parameters Word embeddings with superior medical concept coverage do not perform well against others. [May be coverage is not very crucial for this task.] Word embeddings trained purely on medical text (like PubMed articles) do not outperform others. Lack of diversity of persona in training data Most of the data is generated by few personae (like researchers for PubMed) ASONAM 2017 28 / 30
  29. 29. Future Work The current features are limited to a posts content, we would like to explore other features like social features, for example, number of followers on Twitter We wish to experiment with distant supervision based methods to get automatically labeled examples for data hungry models like CNN-LSTM. ASONAM 2017 29 / 30
  30. 30. Thank You !! For any queries, please contact