BLOG POST 6: Scraping the web for data

‘Lexicons + The Internet Language’

The history and context of language are always changing and developing. As the emergence of technology and the Integration of the Internet changes the way we consume media. Our linguistics and vocabulary also expand. Social Media in its own platform is a major contributor in the ways we communicate visually and audibly. The format and structure of social media influences writing styles as well as content. Twitter is a new form of media that delivers its messages in a 140 character limit. This restriction creates a succinct, creative and empowering conversation that users are easily able to engage and scroll through.

Lexicons are a linguistic resource that we use to understand the vocabulary of a person in association to words of sentimental value (emotions). Whether they’re positive, negative or neutral. I.e. ‘NO!’ and ‘no’ conveys a different tone of voice and with the slight alterations in its composition, It delivers a different message. Twitter is a primary social media platform that deals with languages of informal expressions. Generally a collation of data and colloquial expressions. Such as acronyms, the use of incorrect spelling/ terms and abbreviations. Due to the vast majority of language expressions and variable factors, It is difficult to determine whether the responses are of sentimental value (positive, negative or neutral) therefore the use of emoticons are applied.

Emoticons are a highly recognised attribute to the Internet language. The use of visual expression displays a greater range of sentimental values and is a language technique globally practised. Emoticons are considered to be opinion lexicons and are stable for sentimental classification, unlike literal words.

The default Twitter search allows users to add emoticons to the search to find positive/ negative tweets. The majority of tweets does not contain emoticons which impact the search and statistics by DTA: 25th Australasian Database conference shows that only 9.40% of tweets in 2011 contain at least one emoticon. 7.37% of that is positive and 2.03% negative. (Mohammad, S, A. Wang, H. 2014). Due to these results, It shows a decline and insufficient use of lexicons and emoticon limitations.



Twitter features using # syntax as a mean of collating tweets into categories and as a new form of internet language. Hashtags are also a form of metadata by collecting words of the same topic giving context to the tweet. For example #idontwanttowritethisblogpostanymore groups tweets with similar concepts. Although topics that are not typical are often more difficult to evaluate and contribute to the global expansion of lexicons, providing better performance to searches and collation of material.


Sharaf, M.A., Wang, H. 2014, ‘Databases Theory and Application: 25th Australasian Database Conference, Springer, Brisbane, viewed 4 September 2016,
Bravo, F. 2016, Lexicon Expansion, viewed 4 September 2016,
Reed, J. 2014, How social media is changing language, blog post, viewed 4 September 2016,