It seems like I have been going south since the blog started, first with the politicians and then Chennai Express. So continuing that journey, this time around I am looking at a Madras Cafe to find out what the TwitterVerse has been talking about it. Now the process is pretty much similar to that of last time around (see Twitter Movie Review – Chennai Express), however there are two improvements. To start with the sample of Tweets is larger, I am looking at about 3000 tweets (3099 precisely) and the time span of the tweets is from 23rd August morning till 24th early morning. There are couple of ways this helps, first we are likely to see a broader verdict and second, the amount of noise in the tweets will be somewhat low. It also helps that I collected the data pretty soon after movie release and since the movie does not have as much PR & marketing blitzkrieg as Chennai Express, the tweets are much less noisy and more diverse.
This is how #MadrasCafe verdict looks on a word cloud.
Visually, it looks like Madras Cafe has got pretty good reviews and thumbs-up from a lot of folks who have seen it and tweeted about. Words like “awesome”, “superb” ,”good”, “must watch”, are frequently associated with it. We also see a lot of appreciation and chatter about John Abraham who has produced the movie and acted as well. There is also a great deal of appreciation for the director Shoojit Sircar and Nargis Fakhri. We also see some words not directly related to the movie, which is perhaps the spillage from other topics that were being discussed or trending and were tagged #MadrasCafe.
Cluster analysis is a nice way to get better insights into the data. It helps to understand which objects, in our case terms or words, are more associated with each other and which are far. So rather than just seeing the most frequent terms, we can look at terms that frequently occur together in a tweet. This helps to get clear sense of what people are talking about when they use such terms. I have done some clean-up while ensuring that the structure of data remains intact. This is needed because first, there is a lot of noise in the data and second, it is important to restrict cluster analysis to finding important chunks of objects that go together. More on this in the technical part.
The above diagram is how the clusters looks like. I have divided the diagram in 10 clusters, indicated by red borders. The terms which occur higher in the clusters are the ones that occur more frequently and the terms which are close to each other are the ones which occur together.
We can get a lot more insight about the terms we saw in the Word Cloud when we look at the clusters in which they occur. So for example, the third cluster from the left is regarding the spillage I talked about earlier. Here we see the terms that occurred together and it looks like some trolls trying to get more visibility by using the Madras Cafe hashtag while tweeting about things not much related. The first cluster on the left is about the director, and the one next to it is some sort of discussion about Nargis Fakhri. However, such terms are slightly low on the diagram indicating that they are occurring relatively less frequently.
The terms “watch” & “review” on the top right are the ones that occur a lot more frequently and also occur together often. The reason for the two terms occurring more frequently is since the data was extracted during initial one or two days of the release, as a result there were a lot people who had watched or where planning to watch the movie. More importantly, this is also the time around which leading critics, newspapers and websites post their reviews and tweet about it and since they tend to have a lot of followers, such tweets get retweeted a lot and as a result the frequency of terms is higher.
The big cluster right in the middle is the one that we are most interested in. From the frequently occurring terms and the closeness of the terms, we can easily make out that this cluster is largely about the movie. Here we can confirm what we saw in the word cloud, the word “awesome”, “superb”, “kudos”, “brilliant” are frequently occurring with the words “story”, “acting”. Also there is some talk about Chennai Express, probably a comparison or something. There is also a second cluster towards the right which does seem to be related to the movie however it looks more generic perhaps talking about bollywood and Indian cinema in the context of this movie. Only two terms in this cluster seem directly related to the movie, “political” & “thriller”.
Now let us look at the overall sentiment expressed on Twitter for the movie, remember in the last review, we saw Chennai Express getting about 61% positive sentiment on Twitter. In comparison, below is the sentiment for Madras Cafe:
Positive Percentage: 72%
Comparing this largely positive sentiment with the reviews and ratings from other places:
Again there is a close match between the sentiment from Twitter (specifically my code) and what is seen around the web. This is also in line with the word cloud and associations seen earlier.So in case you are geekily reading this review you might want to check out the movie as well. :)
I have talked more in-depth about the sentiment analysis method I adopted in my earlier post, so if you are interested do check it out here.
One last thing that I was interested in is the number of tweets per user. This will help to understand whether there are few individuals tweeting a lot or we have more diverse group. It is important since, there is a possibility that most of our tweets might come from a few users who happened to be tweeting a lot about this topic during the time we downloaded the tweets. In that case our data set will be biased and the analysis might not represent the true picture. So what we are hoping to see via this analysis is the long tail graph showing many users tweeting in small amounts and only a few users with high number of tweets. Here is what we find in our case:
Here we clearly see a long tail that sharply tapers off and we have less than 10 users with more than 20 tweets about the topic. This is a good thing, as it indicates that the data we have does represent the opinion of many people and is not just a limited to few individuals
Without disclosing the screen names, the highest tweeter has about 92 tweets. Having checked out the tweets, it looks as if the user is trying to get people to read his review by tweeting a lot and tagging many people. Among the others, there are a lot of tweets for a contest by Inox and from Box Office Updates.
Finally, thanks for reading, if you like what you just read, please share and do leave your comment/suggestions/question/PJ/or code :P . Thanks….!
The code is largely similar to last time and can be found at the end here. However, there are couple of new additions, basically about the cluster analysis and the long tail.
In the cluster analysis, part I have used Hierarchical Agglomerative Clustering which is a part of core R package. I did do some pre-processing of the data before I did clustering. For this part I took the raw dataset and the not the one used for word cloud & sentiment analysis. The reason is that, while creating the word cloud, I just had words and their frequencies, everything else about the structure was largely lost. So I took the original data set, created a corpus which is basically a vector of tweets. After that I did some clean-up of the data set, and converted the entire data set into the term-document matrix. A term-document matrix is basically a matrix which describe the frequency of terms occurring in each of the documents, here the documents are tweets. The advantage of term document matrix is that it preserves the structure of the tweets. After this another thing that I did was, I removed all the sparse term to reduce the sparsity percentage, since twitter data contains a lot of noise, this is an essential step or else our clusters will not come out properly. Finally after this and lots of errors, I created the cluster. Here is the code for that part:
</p> #Association and cluster analysis mc.corpus <- lapply(mcafe, function(x) x$getText()) #removes retweets,RT@ etc. mc.corpus <- lapply(mc.corpus, function(x) gsub("[@][^ ]*", " ", x)) #removes url's mc.corpus <- lapply(mc.corpus, function(x) gsub("[h][t][t][[p][^ ]*", " ", x)) # Creating corpus mc.corpus <- Corpus(VectorSource(mc.corpus)) mc.corpus <- tm_map(mc.corpus, tolower) mc.corpus <- tm_map(mc.corpus, removePunctuation) mc.corpus <- tm_map(mc.corpus, removeWords, stp) mc.corpus <- tm_map(mc.corpus, removeNumbers) mc.dtm <- TermDocumentMatrix(mc.corpus) mc.dm <- removeSparseTerms(mc.dtm, sparse= 0.985) mc.df <- as.data.frame(inspect(mc.dm)) #Creating Hierarchical Clusters mc.scale <- scale(mc.df) dm <- dist(mc.scale, method = "euclidean") clust.mc <- hclust(dm, method = "ward") plot(clust.mc) groups <- cutree(clust.mc, k = 10) rect.hclust(clust.mc, k = 10, border = "red")
The process of creating the long tail was pretty simple, I simply extracted the Screen Names, created a table and sorted them frequency. Post this I plotted the screen names to get the graph that we saw. Here is the code for that:
# Long Tail of Users usr <- read.csv(file = "X:/TweetSent/mcafe.csv") users <- as.data.frame(table(usr$screenName)) colnames(users) <- c("user", "tweets") user.sort <- users[order(-users$tweets),] library(ggplot2) p <- qplot(1:nrow(user.sort), tweets, data = user.sort, color = "red", alpha = 0.25, size = 4, geom = "point", main = "Tweets per User", xlab = "Users", ylab = "Tweets") p + theme_bw()