Post 6 – Scraping the web for data; Twitter

Twitter is an interesting program and media. It is a global source that is accessible to anyone that has the internet or a mobile phone, and due to this it redefined the time span for news to be spread or broken. If you want to get a story broken, or spread news about a particular topic, Twitter is your best friend. You aren’t following your particular recipient? No problem. As long as you have an account you can opinion-ate or inform anyone’s eye off–even if it’s not amongst the popular topics of pop culture, technology, breaking news, or politics. Through its hashtags and trending topics, Twitter is easy to navigate, and files everything into neat little boxes–fitted with further hashtags acting as sub-topics.

But what makes Twitter unique? What steps it away from every other social media that keeps people connected and allows sharing? Twitter users are restricted to a 140-character limit in every post. This may sound easy to overcome, but not so much when trying to condense complex readings into a short sentence. Generally used to spread breaking news, natural or human disasters or popular issues, this restriction allows for the point to get across immediately. While keeping it concise means your attention is grabbed instantly, the challenge is shaping the post so that is still makes sense. There is nothing worse than a post with very important words, but nothing connecting them. But the tone of the post also contributes. Most of the posts on Twitter can fall into two categories: opinionated (and biased), or informative (and educated).

With all of this in mind, it was time to undertake the web scraping task. Originally, the Twitter Advanced search paired with the Twitter Archiver Add-on seemed like the ideal program or tool to use. Not only was this task needed, I wanted to use it for my benefit, and expand on my knowledge of the Internet of Things and data privacy in general. The process of scraping the data with the Twitter Advanced search and archiver were simple: the words ‘data’ and ‘ownership’ must be present, and ‘privacy’ was a keyword that could pop up. However, this didn’t turn up much, and it felt that the search was moving away from the original intended issue. A few posts back, the Internet of Things was the focus or specific issue within data that was being investigated. In trying to get back on track, more secondary research was conducted, as well as a repeat of previous class exercises. By doing this, I would hopefully get back onto an issue that was talked about more, and that I could possibly create some visual design responses for.

So here comes the tool Brand24: an online program that business can use to monitor what social media users are saying about their company, with the additional feature of being able to respond to them. With a new focus in mind, a new process was developed–heightened by the added features and functions of Brand24. The first step is for the tool to search the internet for any posts with the exact phrase ‘Internet of Things’, and the added keyword ‘privacy’. From here, the process is to only search through Twitter posts, and then play around with the keywords. Based on the results previously, some key words could be added in to narrow the outcomes further, or another way is to input excluded words to hopefully specify target users or situations. The next stage of this process is to play around with the added features of the influence slider and the emotion scale. The influence slider allows you to see which tweets or people held the most influence in the search in terms of visits, retweets, comments and likes, while the emotion scale allows you to accumulate positive, negative or the default neutral posts. These extra features could aid the process–as well as the type of results–as I could see whether the tool was accurate in its findings, and get to the point straight away on what were the most popular tweets surrounding the issue. The final stages of the process is to visit the top sites tweeted about to expand my understanding of the issue further, and to revisit the saved search often to view the developments.

Proposed process
Proposed process

Below is a flow chart that demonstrates the process that was actually taken in this web scraping task.

Actual process taken
Actual process undertaken

The process itself along with the Brand24 tool proved to be a good combination. The detailed and generative process that was designed was enhanced through the features and added functions of the web scraper. The combination allowed me to explore within a topic that was both specific but also broad. I could begin with the broad spectrum such as the Internet of Things, and narrow it down by ‘privacy’ keywords. Also, having excluded keywords such as ‘business’, ‘company’ and ‘patient’ allowed the search to zero in on more generalised posts that were hopefully more targeted to the everyday social media user. It was interesting to see what posts were collated when these aspects weren’t included.

The parameters

This exclusion did work, however, I felt that the results were very informative and unemotional. Although this was a very common nature with all of the posts gathered. Furthermore, the influence slider was both an advantage and disadvantaged it turned out. It was an advantage because it could narrow down on the most popular tweets in the search, eliminating a lot of the retweets, however it was also a disadvantage, because as the slider was increased, two things happened: mostly all of the results were of about 5 original posts retweeted multiple times, or some of the less retweeted and original content was eliminated–ultimately, a loss.

Examples of results with a low influence value
Examples of results with a high influence value

As implied previously, a lot of the posts were just statements or the name of the article / document attached to the tweet. Or if they were of an opinion, they were direct retweets of the original opinion. This result became difficult as I was hoping to discover some original posts that game an opinion on the privacy issues. However, these were far too rare and possibly due to either the broader spectrum of data and privacy, or the platform of Twitter as its character limit restrictions. Overall, this facet was a little disappointing.

Examples of the expansive retweeting

In terms of the Brand24 tool, it seems to make the decision of whether the post is positive, negative, or neutral, however, it often gets it wrong. If there is a negatively associated word in a positive post, then it will only judge the post on that word. Or if there is a link in the post, it just generally puts it as a neutral post. The same outcomes occur if the post is a statement and not an opinion. Therefore, the tool gets it wrong a lot of the times, skewing the results because it possibly lacks the human decision-making element.

Negative tweet that's been categorised as neutral
Negative tweet categorised as neutral
Possibly positive tweet categorised as neutral

With these results in mind, there are a few visual design responses that could arise–however strictly initial concepts. Firstly, a response could be a set of posters or a service design that aims to educate and inform users of the lack of or hidden, privacy in the Internet of Things. Along the same line, the response could be a system or service in the IoT, such as an app that acts as a VPN. It could be a new login screen on social media apps to opt-out of the monitoring. Or another response could be a flyer that is in the boxes of new appliances and products to warn people of its connection to the internet or iCloud.

Since this post was so large in content, ideas and data, here are my findings–of the web scraping and the task altogether.

  1. Twitter allows for short posts but this also restricts what a person can say, conveyed through the extensive retweeting occurring.
  2. With such a broad, new and big topic such as the Internet of Things, most of the posts are informative, and rather statement-based.
  3. It is best to search around for a web scraper or tool that works best for you as it could make the process easier.
  4. Even though the process didn’t work the first time around, I kept trying and changing the parameters until I found something that was both interesting and collated reasonable results. Playing around with the parameters meant that different dynamics could be explored.
  5. When working with data and web scrapers, the task doesn’t always go to plan. Computers don’t think like us humans; they don’t see the emotional side.



Featured Image:

Twitter_cover n.d., Theme Expert, Google Images, viewed 12 September 2016, <;

%d bloggers like this: