Blog CS - Repository of things

I've been having a play with twitter sourced wordclouds this afternoon. Here's one I made with the search term 'Jeremy Clarkson':

Click permalink below to see a few more, and find out how I made them!

To make these wordclouds I used the R package 'twitteR', combined with Jason Davies D3-based wordcloud extension.

Data Extraction

This article by Juianhi explains how to set up a new application on Twitter so you can use their API with twitteR, and this site outlines a number of techniques to process the returned data, for instance removing 'stop words' (like 'the'), and calculating word frequencies. All in all, the script below gets you as far as a csv with two columns - 'word', and 'freq', which is where we go to the visualisation step!

library(tm)
library(wordcloud)
library(RColorBrewer)
library(RCurl)

#Choose location
setwd("~/Data Science/Git Repository/TwitteR")

# Set SSL certs globally
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))

require(twitteR)
reqURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
apiKey <- "yourAPIKEY"
apiSecret <- "yourSecret"

twitCred <- OAuthFactory$new(consumerKey=apiKey,consumerSecret=apiSecret,requestURL=reqURL,accessURL=accessURL,authURL=authURL)
#9537119

twitCred$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))

registerTwitterOAuth(twitCred)

search_term <- "crossfit"
search_term_vector <- strsplit(search_term, " ")[[1]]
mach_tweets = searchTwitter(search_term, n=500, lang="en")
mach_text = sapply(mach_tweets, function(x) x$getText())

# create a corpus
mach_corpus = Corpus(VectorSource(mach_text))

# create document term matrix applying some transformations
tdm = TermDocumentMatrix(mach_corpus,
                         control = list(removePunctuation = TRUE,
                                        stopwords = c(search_term_vector, stopwords("english")),
                                        removeNumbers = TRUE, tolower = TRUE))

# define tdm as matrix
m = as.matrix(tdm)
# get word counts in decreasing order
word_freqs = sort(rowSums(m), decreasing=TRUE) 
# create a data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)
# remove terms which are websites
dm <- dm[- grep("http", dm$word),]
# remove terms containing non alphanumeric symbols
dm <- dm[- grep("[^[:alnum:]]", dm$word),]

#write out the top 200 terms
write.table(dm[1:200,], "words.csv", sep=",", row.names=FALSE)

Visualisation

For the visualisation step, I started with the example code I got from Jason Davies github. It was necessary to make a few tweaks, to run it from a csv rather than being hardcoded - and adding a linear scale object to control the sizing of terms, but once that was done I was good to go.

<!DOCTYPE html>
<meta charset="utf-8">
<body>
<!-- Note you will need the d3 library and Jason Davies d3.layout.cloud extension, available here:
https://github.com/jasondavies/d3-cloud
-->
<script src="../lib/d3/d3.v3.min.js"></script>
<script src="./resources/d3.layout.cloud.js"></script>
<script>
  var fill = d3.scale.category20();
  var scale = d3.scale.linear()
	.range([5,25]);
	
   d3.csv("./resources/words.csv", function(error, data) {

     scale.domain([0, d3.max(data, function(d) { console.log(d.value); return d.freq; })]);

     d3.layout.cloud().size([600, 600])
      .words(data.map(function(d) {
        return {text: d.word, size: scale(d.freq)};
      }))
      .padding(0)
      .rotate(function() { return ~~(Math.random() * 2) * 90; })
      .font("Impact")
      .fontSize(function(d) { return d.size; })
      .on("end", draw)
      .start();
	});	

  function draw(words) {
    d3.select("body").append("svg")
        .attr("width", 600)
        .attr("height", 600)
      .append("g")
        .attr("transform", "translate(150,150)")
      .selectAll("text")
        .data(words)
      .enter().append("text")
        .style("font-size", function(d) { return d.size + "px"; })
        .style("font-family", "Impact")
        .style("fill", function(d, i) { return fill(i); })
        .attr("text-anchor", "middle")
        .attr("transform", function(d) {
          return "translate(" + [d.x+100, d.y+100] + ")rotate(" + d.rotate + ")";
        })
        .text(function(d) { return d.text; });
  }
</script>

So there you have it. And finally, as promised, here are a few more I made for fun:

"Crossfit":

"Boris Johnson":

Categories

Tag Cloud

About

Me

Latest posts

Wordclouds!

There are no published comments.

New comment