Making Natural Language Processing Robust to Sociolinguistic Variation:
Natural language processing on social media text has the potential to aggregate facts and opinions from millions of people all over the world. However, language in social media is highly variable, making it more difficult to analyze that conventional news texts. Fortunately, this variation is not random; it is often linked to social properties of the author. I will describe two machine learning methods for exploiting social network structures to make natural language processing more robust to socially-linked variation. The key idea behind both methods is linguistic homophily: the tendency of socially linked individuals to use language in similar ways. This idea is captured using embeddings of node positions in social networks. By integrating node embeddings into neural networks for language analysis, we obtained customized language processing systems for individual writers — even for individuals for whom we have no labeled data. The first application shows how to apply this idea to the problem of tweet-level sentiment analysis. The second application targets the problem of linking spans of text to known entities in a knowledge base.