Wordclouds!
I've been having a play with twitter sourced wordclouds this afternoon. Here's one I made with the search term 'Jeremy Clarkson':
Click permalink below to see a few more, and find out how I made them!
To make these wordclouds I used the R package 'twitteR', combined with Jason Davies D3-based wordcloud extension.
Data Extraction
This article by Juianhi explains how to set up a new application on Twitter so you can use their API with twitteR, and this site outlines a number of techniques to process the returned data, for instance removing 'stop words' (like 'the'), and calculating word frequencies. All in all, the script below gets you as far as a csv with two columns - 'word', and 'freq', which is where we go to the visualisation step!
library(tm)
library(wordcloud)
library(RColorBrewer)
library(RCurl)
#Choose location
setwd("~/Data Science/Git Repository/TwitteR")
# Set SSL certs globally
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
require(twitteR)
reqURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
apiKey <- "yourAPIKEY"
apiSecret <- "yourSecret"
twitCred <- OAuthFactory$new(consumerKey=apiKey,consumerSecret=apiSecret,requestURL=reqURL,accessURL=accessURL,authURL=authURL)
#9537119
twitCred$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
registerTwitterOAuth(twitCred)
search_term <- "crossfit"
search_term_vector <- strsplit(search_term, " ")[[1]]
mach_tweets = searchTwitter(search_term, n=500, lang="en")
mach_text = sapply(mach_tweets, function(x) x$getText())
# create a corpus
mach_corpus = Corpus(VectorSource(mach_text))
# create document term matrix applying some transformations
tdm = TermDocumentMatrix(mach_corpus,
control = list(removePunctuation = TRUE,
stopwords = c(search_term_vector, stopwords("english")),
removeNumbers = TRUE, tolower = TRUE))
# define tdm as matrix
m = as.matrix(tdm)
# get word counts in decreasing order
word_freqs = sort(rowSums(m), decreasing=TRUE)
# create a data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)
# remove terms which are websites
dm <- dm[- grep("http", dm$word),]
# remove terms containing non alphanumeric symbols
dm <- dm[- grep("[^[:alnum:]]", dm$word),]
#write out the top 200 terms
write.table(dm[1:200,], "words.csv", sep=",", row.names=FALSE)
Visualisation
For the visualisation step, I started with the example code I got from Jason Davies github. It was necessary to make a few tweaks, to run it from a csv rather than being hardcoded - and adding a linear scale object to control the sizing of terms, but once that was done I was good to go.
<!DOCTYPE html>
<meta charset="utf-8">
<body>
<!-- Note you will need the d3 library and Jason Davies d3.layout.cloud extension, available here:
https://github.com/jasondavies/d3-cloud
-->
<script src="../lib/d3/d3.v3.min.js"></script>
<script src="./resources/d3.layout.cloud.js"></script>
<script>
var fill = d3.scale.category20();
var scale = d3.scale.linear()
.range([5,25]);
d3.csv("./resources/words.csv", function(error, data) {
scale.domain([0, d3.max(data, function(d) { console.log(d.value); return d.freq; })]);
d3.layout.cloud().size([600, 600])
.words(data.map(function(d) {
return {text: d.word, size: scale(d.freq)};
}))
.padding(0)
.rotate(function() { return ~~(Math.random() * 2) * 90; })
.font("Impact")
.fontSize(function(d) { return d.size; })
.on("end", draw)
.start();
});
function draw(words) {
d3.select("body").append("svg")
.attr("width", 600)
.attr("height", 600)
.append("g")
.attr("transform", "translate(150,150)")
.selectAll("text")
.data(words)
.enter().append("text")
.style("font-size", function(d) { return d.size + "px"; })
.style("font-family", "Impact")
.style("fill", function(d, i) { return fill(i); })
.attr("text-anchor", "middle")
.attr("transform", function(d) {
return "translate(" + [d.x+100, d.y+100] + ")rotate(" + d.rotate + ")";
})
.text(function(d) { return d.text; });
}
</script>
So there you have it. And finally, as promised, here are a few more I made for fun:
"Crossfit":
"Boris Johnson":

There are no published comments.
New comment