Title: Two new machine learning approaches for text classification
Abstract: Text document classification is one of the most well studied applications of machine learning. Yet this technology is still limited by practical difficulties and invalid underlying assumptions.
First, many people who want text classifiers do not have the time or resources to annotate a dataset. They often employ a heuristic alternative: they create word lists for each label class, and then perform prediction by selecting the class whose list matches the largest number of words in the text. This heuristic is theoretically unjustified, and mistakenly assigns the same importance to every word in the list. I show that list-based classification can be viewed as a (very!) special case of Naive Bayes. Based on this analysis, it is possible to estimate weights for each word without supervision, using the method-of-moments.
Second, machine learning approaches to text classification nearly always begin with an IID assumption. Yet words can mean different things to different people, raising the possibility for misunderstandings even in human-human conversation. One potential solution is to relax the IID assumption by personalizing text classifiers to the author. An apparent roadblock is the challenge of obtaining labeled data for each author. I will present a method that sidesteps this requirement by relying on the sociological theory of homophily, which states that people who are socially connected tend to share personal traits. This idea can be formalized by estimating node embeddings for each individual in a social network, and then using these embeddings to drive a social attentional mechanism in a neural ensemble classifier. The resulting system obtains significant improvements on sentiment analysis in Twitter. This project is joint work with Yi Yang.
Bio: Jacob Eisenstein is an Assistant Professor in the School of Interactive Computing at Georgia Tech. He works on statistical natural language processing, focusing on computational sociolinguistics, social media analysis, discourse, and machine learning. He is a recipient of the NSF CAREER Award, a member of the Air Force Office of Scientific Research (AFOSR) Young Investigator Program, and was a SICSA Distinguished Visiting Fellow at the University of Edinburgh. His work has also been supported by the National Institutes for Health, the National Endowment for the Humanities, and Google. Jacob was a Postdoctoral researcher at Carnegie Mellon and the University of Illinois. He completed his Ph.D. at MIT in 2008, winning the George M. Sprowls dissertation award. Jacob's research has been featured in the New York Times, National Public Radio, and the BBC. Thanks to his brief appearance in If These Knishes Could Talk, Jacob has a Bacon number of 2.