Blog Post 6

Web Scrapes: Reddit, Twitter, Huntonprivacy

My chosen social media Platforms for my data analysis and web scrapes are Reddit Twitter and hunton privacy blog. The first platform i will discuss is reddit. Reddit is a message board wherein users submit links. What differentiates it from a real-time information network like Twitter, Facebook or Instagram, is that the stream of content is administered by the community. Posts that are considered to be significant are upvoted, and those that are not are downvoted. This will determine where each post is positioned on the site, this results in the most popular posts being made of the front page, which is seen by hundreds of thousands of people.

Reddit has its own personalised advanced search tools, apart from that standard word or phrase search, Reddit has a more detailed search terms, such as the following.

use the following search parameters to narrow your results:

subreddit:subreddit
find submissions in “subreddit”
author:username
find submissions by “username”
site:example.com
find submissions from “example.com”
url:text
search for “text” in url
selftext:text
search for “text” in self post contents
self:yes (or self:no)
include (or exclude) self posts
nsfw:yes (or nsfw:no)
include (or exclude) results marked as NSFW

Theses parameters allow you to extract who posts on Reddit from other websites, for example if you want to find posts from a particular website you can, this allows for a deeper insight into how blog posts and articles can be circulated around the web, and how some gain more traction than others. The main search parameter that is used however to get target results is the selftext: parameter, this search tool also show you the different categories in which the data had been distributed into.

flow-chart

From my Reddit search i found 1,842 different post broken up into each of these categories.Each category listed below has in an in depth subreddit system that is filled with a varying amount of results in regards to my search parameters, when I chose to arrange the post by the amount of comments just on page 1, I found the total number of comments from the popular articles reaching over 39759. What makes the reddit system so unique is the nature of transitioning posts is entirely built on the legitimacy of the author, that combines with the interesting upvote component of this system is that  current events May push a post higher because it’s relevancy has increased due to a connected event in  the real world.

CaptureCapture1

  1. /r/autotldr (470)
  2. /r/personalfinance (223)
  3. /r/legaladvice (189)
  4. /r/techsupport (149)
  5. /r/nosleep (104)
  6. /r/relationships (81)
  7. /r/metalgearsolid (68)
  8. /r/sysadmin (65)
  9. /r/Bitcoin (63)
  10. /r/dirtypenpals (60)
  11. /r/DarkNetMarkets (59)
  12. /r/ITCareerQuestions (59)
  13. /r/cscareerquestions (56)
  14. /r/AskNetsec (53)
  15. /r/conspiracy (52)
  16. /r/unitsd8u (51)
  17. /r/freedonuts (45)

One methods that could be used to visualise this data, would be to break down how many  up-votes each post got and the related it to Morse code patterns or binary patterns. You could select several different posts and then use the number of up votes and translate that number into binary, i feel like this could be an interesting way to show how data could be represented if it was portrayed in a language that is detrimental to that information existing in the first place. The idea of having visuals symbols such as Morse code is also a visually engaging concept, for the idea behind Morse code was to communicate large distances only using a combinations of long and short light or sound signals. This idea show how even though we have incredibly sophisticated methods of communication, we could still use older methods now and they could be interpreted as symbols to text.

Web Scrape Twitter

For my second chosen multimedia platform I chose the more obvious alternative for my web scrape and decided use Twitter. Based on the information that we were given in the lecture and the quick workshop that we were given in class, the embedded data extraction system that is set up between Google documents and Twitter allows someone to very easily archive large sets of information without having to independently import each data set into spreadsheet. One reason why Twitter is also a good platform to correlate data in regards to certain social issues, is because the results that you can attain from observing certain Twitter patterns are similar to the amount of quantitative data that you can get from a survey, quite often people, when completing surveys are answering questions from an interviewer, there is a temptation to embellish or bend the truth perhaps because there may be an agenda present. Which is why its a fantastic platform to get an array of opinions without having to let that person know that you were interested on their perspective on these social issue, allowing you to avoid bias or prejudice.

Twitter has some interesting search functionality,  below is listed the main features that this particular app utilises. Twitter allows you to send and read other users updates or messages,messages are limited to 140 characters, you can send and receive updates via the Twitter website, SMS, emails or a third party application, and finally You can restrict delivery to your circle of friends and you can search for people by name or user name, import friends from other networks, or invite friends via email. Twitter also has an advanced search option this allows you to search certain phrases, exact words or certain hashtags that could directly related to your particular interests, using this advanced functionality you can also limit your search to accounts and posts that mention other specific accounts, this is a good way to find out if certain twitter accounts are posting consistently on a particular issue.

flowchart1

The data that i found from the twitter archive search was comprehensive to say the least, from my own refined search parameters i was able to find 279 tweets about online privacy in the past week, and that was using online certain phrases. The amount of data that has been collected has been overwhelming, the amount of posts have reached in total based on my search spreadsheet is 1,859,876 people, which is the total combined number of followers linked to each post. The average user has 6914 followers in regards to this post, which means comparatively to blog searching and other articles based posts, the ability to reach millions of people through this micro blogging app can prove to be incredibly effective. I think one of the most interesting parts of this search is to see the occupation of each twitter account, this kind of information is important when trying to understand the expectations of those accounts in regards to certain issues, whether someone may or may not be more inclines to re tweet a post based on their own specific account information.

One interesting visual style that you could represent this data would be to use the related bird theme that is associated with twitter, and use that theme to create a comprehensive date set on online privacy. You  could do this by recording what each tweet has said and then extract the sounds waves of what each tweet looks like visually, i think that this could be a really engaging method of data visualisation, and it could show some similarities in the patterns of thought that people have in regards to these kinds of issues.

Capture.PNG

Scraping the web for data: Huntonprivacy

For my web scrapper i decided to scan a very prolific legal services firm called Hunton & Williams, they are a legal firm for a range of issues one being internet privacy and the future of online security, on particular component of the site is their blog called huntonprivacy and they post in depth articles about current events globally and new information in regards to digital privacy. This scrape allowed me to gather information quickly about the history of online privacy over the past several years, and then to be able to collate that data into a small info graphic that allowed for some insights into the patterns of information posted about these issues.

The main purpose of Hunton & Williams LLP’s  is to provide experience, breadth of knowledge and outstanding client service in regards to legal issues. The firm is a leader in its field and has been recognised by the publications made in their blog section.

There are no real unique qualities to his website or the platforms they use to share information, and i deliberately chose  a platform outside of the regular social media spheres, for i wanted to gather information on published articles and not on public speculation, whilst i feel like public opinion is vital to understanding the issue, i wanted to see how much accurate information i could gather on relevant online privacy issues that have been significant over the past several years. Whilst this website limits in its functionality, i used a web scraping service called import.io which helps create intelligent technology that translates the web into data.

My process for using my web scarping service was to first figure out what kind of information i needed, whether it be text or image based, because i was looking for articles i used the scrape to target articles posted about digital privacy over the past several years. Then i had to use the parameters such as the defining a column that has website attributes to acquire information, for example, i chose one column then click on a title and it will give me ever single title on that URL link and categorise it into a data set. Once you have all your URL’S input into the data extractor and set up your parameters, the extractor gets all the relevant information based on the input parameters and gives your your results. This scrap  gave me a huge array of articles that gave me extracts of the content, dates of posts, amount of comments and the title of the articles itself.

Untitled-2Untitled-3

The original scan i created resulted in several hundred articles, whilst this was a wellspring of information, the parameters where to vague to allow me to see the important pieces of data. Once i refined this search down i was able to extract a series of articles from my chosen multi-media site and analyse that data to create some insights.

My data consists of several different pieces of information, i chose some target questions to then show what data was relevant to my particular interests. My focus for this scrape was one, country article was focuses on, amount of articles posted in 6-month blocks, articles focuses on personal data or more corporate or private sector data and the content of each article whether it is about bio-metrics, legislation or cookies. ect ect.

This scrape has allowed me to understand the focuses of how larger scale social networks approach these issues, for a blog site that is devoted to posting only content related to my issues, i would find the data to be a lot dryer and harder to categorise into relevant insights, however i also found that the information gained just by just searching larger social media sights or using Google, would find information with some levels of prejudice in them. Overall i found this method of data mining to be interesting and practical if you want a broad range of sources quickly, however if you have a more individualised agenda this method will result in too much content that is unrelated to your targeted ideas.

One visual response that could come from understanding this data is to use little folder symbols with varying levels of thickness that correlate to how much information has been posted around a central main data central image, i:e a large digital brain, computer or planet, and have them revolve around in spherical manner with certain peaks based on what words were used or countries mention, or whether or not eh article is positive or not. The image below is an example how how you could visual describe this kind of information in a interesting visual way. Using this kind of circular module can yield a compelling set of data displayable whilst giving the viewer enough of a coherent data structure.

My 5 point Summary is:

  • More Articles are Focused on Large scale Data retention
  • The United States has the most amount of content related to Online Privacy
  • There were more articles posted in 2014 than 2015 (based on my site)
  • The most popular discussion about online privacy is about Laws and Legislation
  • There are 8 articles specifically talking about the FTC in regards to the US’s Privacy Laws
Screen Shot 2016-08-24 at 10.58.21 AM
Huntington Privacy Blog, Webscrape
Montyhaytonscrape-01
My infographic on my web scrape- Monty Hayton 2016

References

D13yacurqjgara.cloudfront.net. (2016). [online] Available at: https://d13yacurqjgara.cloudfront.net/users/314283/screenshots/2102279/voice-rec3_1x.png [Accessed 30 Aug. 2016].

Reddit.com. (2016). autotldr: search results – selftext%3Aonline+security. [online] Available at: https://www.reddit.com/r/autotldr/search?q=selftext%253Aonline%2Bsecurity&sort=relevance&restrict_sr=on&t=year [Accessed 30 Aug. 2016].

Reddit.com. (2016). reddit.com: search results – selftext:online security. [online] Available at: https://www.reddit.com/search?q=selftext%3Aonline+security&sort=relevance&restrict_sr=&t=year [Accessed 30 Aug. 2016].

.