Twitter Movie Review – Chennai Express


In the spirit of my first post (Pappu Vs. Feku) I will continue to explore the use of Twitter in providing an eye into the events of contemporary interest, and movies are certainly something that capture interest of a large majority of Indian audience. So I am looking at Chennai Express that released last week and has been generating all kinds of buzz both online and offline. The idea is to come up with some sort of review of the movie by fetching the tweets about the movie and analyzing them by using a R code. So I managed to pull about 1500 (Note: This is the maximum Twitter API limit) tweets that were directly talking about the movie. After doing a bit of clean-up and removing terms not helpful for review purposes (read: Karan Johar and Taran Adarsh), this is how the whole Twitter chatter about Chennai Express looks like:

twitter-movie-review-chennai-express

Chenna Express on Twitter

For those who are interested in seeing a table, here is the list of top ten five frequent terms:

Word Frequency
Superb 195
Entertainer 188
Highest 120
Broken 109
Smashing 98

Now since this review comes almost three days into the movie, in the tag cloud we see a lot of talk about the collection of the movie, however we can easily see the movie has been described by a lot of people as superb and entertaining. This is in line with the huge box office success that the movie has turned out to be.

Obviously such an approach only shows the loudest voices and can easily masquerade the overall opinion. Thus to be somewhat more nuanced, I have drilled down a bit further and did some sort of sentiment analysis. To keep things simple I have just examined the sentiments in terms of positive and negative. I basically break down all the tweets into individual words and create a sentiment score by classifying them as positive or negative. For classification purposes I am using Hu and Liu’s lexicon of 6800 words found here. Without getting too much in two technical details, below is the sentiment score I could calculate.

Positive Percentage : 61%

It must be noted that this number based only on positive and negative words in the tweets and ignores large number words which do not really express either positive or negative sentiment. Also I have no way of accounting of sarcasm or vernacular words and phrases. But I think that overall, the number of such instances is likely to be less than 15%. However going ahead I would also want to include a map of clusters of words that go occur together more often in order to get a better idea.

So with this score let us compare with what other popular reviews say about Chennai Express:

IMDB – 7.5/10

Rediff.com – 3/5

Yahoo – 3/5

Bookmyshow – 3.5/5

It looks like my Twitter review has done decently well as compared to the above, however the real test would be to do this analysis on the day of the release or the next morning and make some prediction about success of the movie. Going forward, this is going to be the idea and I will provide an analysis of one or two movies per week  and maybe I will be able to come up with some decent predictor of movie performance beforehand. Obviously such a review can never go into the nuances of the film and frankly it is not even intended here, the idea is just to explore how much can be understood from the opinion on Twitter and see how it tallies with the reality. Let me know your thoughts and ideas in the comments section.

Technical Details:

If you have read my first post (I suggest you do, at least the technical part) than basically this is just an extension of that. About fifty percent of the code is still the same, I have tried to be more nuanced when it came to stop words, to do that I just created a quick and dirty tag cloud and based on that expanded the list of stop words. Also unlike last time, I have removed all the screen names/ user names, since popular names tend to create a lot more noise and add little value. Finally, the sentiment analysis code is heavily influenced from couple of sources online which I mention in the end.

Code:


library("ROAuth")
 library("twitteR")
 library("RJSONIO")
 library("wordcloud")
 library("tm")
 library("plyr")
 library("stringr")

load("twitter authentication.Rdata")

registerTwitterOAuth(twitCred)

#searchTwitter for Movie -Chennai Express here

movie <- searchTwitter('#ChennaiExpress', n = 1500,lang = 'en', cainfo = "cacert.pem")

save(movie, file = "X:/TweetSent/ce.Rdata")

ce <- twListToDF(movie)

write.csv(ce, file = "X:/TweetSent/ce.csv")

#test to remove numbers as well

test <- strsplit(ce$text, " ")

stp <- stopwords("en")

#Customizing stop words list

stp <- c(stp, 'record','till', 'hey', 'thu', 'paid', 'now', 'fri', 'nett', 'breakdown', 'meant', 'last', 'ke', 'ki', 'par', 'r', 'ur', 'aur', 'pic', 'sun', 'beautiful','last','celebration', 'justified', 'done', 'd', 'dat', 'u', 'say', 'rs', 'records', 'cr', 'gauri', 'thats', 'will', 'needs', 'just', 'fans', 'day', 'srk', 'today', 'crs')

movie_test <- lapply(test, function(x) x[grep("^[[:alpha:]]+$", x)])

movie_test <- unlist(movie_test)

movie_test <- tolower(movie_test)

movie_test <- movie_test[-grep("^[rm]t$", movie_test)]

movie_test <- movie_test[!movie_test %in% stp]

movie.ts <- as.data.frame(table(movie_test))

movie.ts <- movie.ts[sort.list(movie.ts$Freq, decreasing = T),]
 print(xtable(head(movie.ts, 10),caption = "Top 10 words in tweets"), file = "X:/TweetSent/tabletest", include.rownames = F)

#Selecting color palettes for wordcloud

library(RColorBrewer)
 pal2 <- brewer.pal(8,"Dark2")

wordcloud(movie.ts$movie_test, movie.ts$Freq, scale = c(4, 0.5), min.freq = 5, random.order = F, colors = pal2)

#Sentiment analysis on movie.ts

#Downloaded list of positive & negative words (about 6800) from http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

pos <- scan("X:/TweetSent/opinion-lexicon-English/positive-words.txt", what = 'character', comment.char = ';')

neg <- scan("X:/TweetSent/opinion-lexicon-English/negative-words.txt", what = 'character', comment.char = ';')

#Sentiment Scoring Algorithm

words <- movie_test

score.sentiment = function(words, pos, neg, .progress='none')
 {

require(plyr)
 require(stringr)

#final <- matrix('', 0 ,3)

words <- movie_test

scores = laply(words, function(words, pos, neg) {

# comparing words with the downloaded lexicon

positive <- match(words, pos)
 negative <- match(words, neg)

positive <- sum(!is.na(positive))
 negative <- sum(!is.na(negative))

score <- c(positive, negative)
 #newrow <- c(words, score)
 #final <- rbind(final, newrow)

return(score)

}, pos, neg, .progress=.progress )

scores.df = data.frame(score = scores, text = words)

return(scores.df)
 }

final <- score.sentiment(words, pos, neg)

colnames(final) <- c("Positive", "Negative", "Word")

pos.percent <- sum(final$Positive) / (sum(final$Positive) + sum(final$Negative))

print(pos.percent)

Acknowledgements:

Twitter Coverage ISMB

First Shot: Sentiment Analysis in R

R Tutorial on Twitter text Mining

Advertisements

One thought on “Twitter Movie Review – Chennai Express

What do you think?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s