Exploring Bias: Comparing approaches for collecting Twitter data


  1. We are tracking the UK debate on the EU referendum to explore the various ways in which the public imagines the European Union. We are using Twitter to track trends in response to emerging events, this analysis allows us to gain a more nuanced understanding of those who are motivated to comment on UK-EU-related topics. See bit.ly/1KKov9w for interactive visualisations of the data.
  2. We have collected Twitter data on the referendum debate for the last five months. We are using three methods for collecting data from Twitter: 1) using hashtags chosen by an expert panel as search queries; 2) collecting a random sample without specified search terms and extracting referendum appropriate data automatically 3) collecting from the three official campaign groups @vote_leave, @LeaveEUOfficial and @StrongerIn.

    Each method of collection influences what data will be collected and therefore each data set has certain biases. The hashtag and random stream sample sets are heavily influenced by the terms used for data collection. Those terms differ greatly when automatically extracted (the random stream set) or chosen by experts (the hashtag set). The expert method is designed to follow a wider variety of terms that the experts expect will become discussion topics over the longer-term referendum debate, whereas the automatic method extracts data using terms which are commonly associated with known referendum specific terms. Examining the three different sets allows us to contrast what is being collected and gives us the ability to have a broader understanding of public and elite opinion. In particular we are examining how topics differ between these data sets and how they influence each other.

    The hashtag set is the largest by a considerable amount. During a five-month period the set collected using hashtags contained 5,556,027 tweets. The set extracted from the random stream 8,777 tweets and the official campaigns 2,606 tweets. To determine how relevant the data collection is to the debate we extracted 100 tweets from each set and asked three annotators to consider the relevance of each tweet in two ways, 1) whether it is directly relevant to the UK-EU referendum debate, and 2) whether it is about a topic that would likely influence voter opinion. We found that the data from the official campaign groups and data automatically extracted from the random stream are more relevant to the topic than the data gathered using hashtags.The hashtag set has a low relevance score for 'directly relevant to the referendum debate' but this rises significantly when the topics that will influence the debate are considered.

  3. This was as we expected. The differences can be explained as follows, the official campaign set contains the information from the campaign groups who are publishing tweets in order to influence the debate, this gives us a small, very specific, very opinion driven set. The random stream set gives a set of data from the wider public but only tweets that contain a terms that are closely related to the debate, therefore providing a very topic specific set. The hashtag gathered set is a much larger set, collected using a wider variety of terms, it contains more non-relevant information but also covers the topics likely to influence voters not identified in the other sets.
  4. When we look at tweeting frequency over time we find that all of the collection strategies are picking up increases in data volumes on the same dates. This is when there are events that are prompting both the campaigns and general Twitter users to Tweet. For example, there is a peak in data collected on 12th October this was when the Britain Stronger in Europe campaign was launched, and David Cameron gave a speech at Chatham House on 10 November 2015 setting out his case for EU reform and his letter to Donald Tusk was sent on 11 November 2015.
  5. We did find a lot more data was collected by the hashtag method in early September, on further inspection this data relates to refugees and migrants. The shows that the campaign groups are not talking about the refugee/migrant issues, it is not being directly related to the UK-EU referendum debate, but it is being widely discussed.
  6. Analysing the frequency of of commonly used hashtags gives an indication of topics discussed it each of the datasets. Much of the discussion in the tweets from the official campaign and the random stream data are directly related to the UK-EU referendum, this is echoed by the hashtags #brexit, #leaveeu, #voteleave, #euref being the top four most frequent in both collections.

    Hashtags with a pro-Leave sentiment appear more frequently in all three of our data sets. We do not see any pro-Leave hashtags appearing in the hashtag gathered set, and only #strongerin and #remainineu in the random stream set. We have a very small number of pro-Remain hashtags in the official campaign data. We are collecting from the three campaign groups and as only one is pro-Remain we would expect a lower level of pro-Remain hashtags in the official set, but not as low as we are seeing. This suggests that either pro-Remain supporters don't use hashtags, use them in unexpected ways or there is a strong pro-Leave sentiment within Twitter.
  7. We also see another phenomenon with in the data – where hashtags are used to draw attention to specific themes. Within the official stream certain hashtags have been heavily used by the two pro-Leave campaigns, for example @LeaveEUOfficial launched #theknoweu, #justsaying, #fudgeoff, #twibbon and @vote_leave launched #wrongthenwrongnow, #theinvisableman. We can see that the #twibbon also appears in the random stream data set and therefore has cross pollinated and is being discussed by the wider public. The @StrongerIn campaign do not seem to be using hashtags to the same extent and rarely use any beyond #strongerin. It is possible that the lack of use of hashtags by the @StrongerIn campaign means that their supporters are not using hashtags. This is something we will need to investigate further.
  8. In the hashtag gathered data many of the top hashtags indicate a focus on the topic of refugees (#refugeeswelcome, #refugee, #refugeecrisis) and in discussing other countries (#uk, #usa, #syria, #germany). In the random stream data we also see a discussion of the referendum specific terms #brexit and #leaveeu but very little occurrence of the #strongerin hashtag.
  9. This is written as part of the #ImagineEurope project. The project is part of the Economic and Social Research Council's UK in a Changing Europe programme ukandeu.ac.uk. Look out for our regular updates as the project tracks developments in the debate on the UK's continued membership of the EU and follow us @myimageoftheEU on twitter for more information on this and other projects