Annotation: A Survey on Sentiment Analysis and Opinion Mining Techniques

The phrase “sentiment analysis” is high on the list of search terms for anyone seeking to understand how to process social media data. It’s a component of Natural Language Processing (NLP), where a machine extracts (somewhat) accurate meaning from human language and textual information.

I’ve talked with a number of CS friends who tell me how hard this is. For example, if you’re a computer, what do you make of a sentence like this:

“The man shot the elephant in his pajamas.”

For the moment let’s leave aside the crime of shooting an elephant. I agree this is a problem.

For the computer, the problem is less about morality, and more about who “he” is that is wearing the pajamas. This isn’t a problem for you or me, because we just understand implicitly that elephants are extremely unlikely to wear pajamas. But “understanding implicitly” is not something computers generally do, unless we tell them what is implicit.

Unless I’m wrong, because the whole AI field and NLP seem to be moving forward fast once again.

Back to sentiment analysis. In reading this journal article I learned about sentiment lexicons, like SentiWordNet, which consist of words classified by emotional content or polarity (e.g. positive, negative, or objective).  A machine reading “the man shot the elephant in his pajamas” might simply extract that this was a violent event. If I posted on Facebook that “I am appalled that the man shot the elephant in his pajamas,” then the machine might conclude that “I” am expressing a highly negative sentiment toward shooting elephants. This is probably meaningful information about me to someone, for example the Wilderness Society, and possibly the NRA.

OK there’s still an elephant in the room so here it is: the annotation.

Kaur, Amandeep, and Vishal Gupta. “A Survey on Sentiment Analysis and Opinion Mining Techniques.” Journal of Emerging Technologies in Web Intelligence, vol. 5, no. 4, Nov. 2013, pp. 367–71.

For anyone seeking to use social media data to inform effective communication campaigns targeting social media users, it is essential to analyze their attitudes and opinions regarding the subjects of the communications. The term “sentiment analysis” is commonly used in reference to techniques for capturing public opinion on commercial products, political candidates and policy preferences, and various social events and movements. In this paper, Kaur and Gupta provide survey of the literature on the subject, and an overview of the terminology and main techniques for extracting sentiment from social network data.

They use the term sentiment analysis to describe a general category of applications using Natural Language Processing (NLP) to extract and classify signifiers of opinion, mood, and emotion from a corpus of social media data. They identify this as a multidisciplinary problem for Artificial Intelligence, to “minimize the gap between human and computer,” and a process of “mining the text and classifying user sentiments, likes, dislikes, and wishes.”  They describe common approaches to sentiment analysis. These include “subjective lexicon,” wherein a text is analyzed based on a list of words that have been scored on a scale of positive, negative, or objective. N-Gram modeling is based on a given sequence of words, using terms such as “unigram,” “bigram,” “trigram,” etc., and/or combinations of such sequences to extract probable meaning. “Machine learning” is a general term for learning a language model from the features of a given text.

The authors then describe the process of sentiment analysis. The first step is the generation of a sentiment lexicon for processing and classification. A number of techniques have been used, including a simple comparison of the words of an input text with a set of terms for which positive or negative associations have been assigned. They cite the development of SentiWordNet as a multi-language lexicon resource for extracting sentiment, and further techniques that combine single-word analysis with multi-word expressions, polarity-based word count within paragraphs, and rules based on antonym generation, among others.

They then discuss the categorization of text at the level of subjectivity or objectivity. Subjective statements contained in a social media post indicate opinion, and thus are stronger signals of personal sentiment. For example “the car is comfortable” is classifiable as subjective, whereas “Maruti launched as new car” is an objective statement of fact and less meaningful as a signal of sentiment about Maruti cars.

In addition to polarity, subjectivity, and objectivity, sentiment analysis may be concerned with more specific affective classes such as happy, anger, sad, and surprise. The authors mention ConceptNet as an example corpus for “sentic computing,” where states of sensitivity, attention, pleasantness, and aptitude are analyzed as a basis for sentiment classification.

The authors conclude with suggestions for further work on sentiment analysis research, including cross-language sentiment sense mapping, and resources for many more of the languages used in their home nation of India.