At the 29C3 Chaos Communication Congress that was held in Germany in December last year, researcher Sadia Afroz described to the audience that it is possible to identify up to 80% of underground Anonymous forum users using various methods including stylometric analysis, Latent Dirichlet allocation and the authorship attribution framework Jstylo. Stylometry uses linguistic information found in a document to perform authorship recognition
The researchers techniques could also be able to identify the subtle differences between a conversation on say credit card hacking from a discussion on creating exploits, however this is still being worked upon and is not possible yet. For the moment the analytic techniques look for Function Words as a means of identifying
“If our dataset contains 100 users we can at least identify 80 of them,” researcher Sadia Afroz told an audience at the 29C3 Chaos Communication Congress in Germany.
We wanted to see what stylometry could do when applied to an interesting real world dataset containing short text in multiple languages. As a result, we applied stylometry to leaked underground forums. Online forums are frequently used by cyber-criminals around the world to establish trade relationship and exchange fraudulent goods and services such as the sale of stolen credit card numbers and compromised hosts, spamming, phishing, and online credential theft. These forums are popular among the cyber-criminals as they are easily accessible and provide some high degree of anonymity. In this work, we examine several multilingual underground forums, for example, thebadhackerz.com, blackhatpalace.com, www.carders.cc, free-hack.com, hackel1te.info, hack-sector.forumh.net, rootwarez.org, L33tcrew.org, antichat.ru. We did authorship attribution on these users and so far have had 72% success in correct attribution (however we believe this number will be significantly improved by the time of the talk as we continue our analysis and bring in new features). Authorship attribution in the underground forums requires new features since the text used in these forums are multilingual, contain numerical information such as credit card and bank account numbers, and have many symbols in the URLs and services being shared. These properties of the text are not similar to common writing. We are expecting a significant increase in the accuracy once the above mentioned feature set is implemented.
It found up to 300 distinct discussion topics in the forums, with some of the most popular being carding, encryption services, password cracking and blackhat search engine optimisation (SEO) tools. Whilst the researchers state that they are able to identify up to 80% of individuals via their writing patterns, they could only do this when the text being analysed was in English. When the writing was in any other language the results went down to 65%. Translating the foreign text using free online tools like Google translate and Bing didn’t aid in improving the results (if you have ever used Google to translate a certain amount of text from one language to another, you will understand why)
However, there are also open-source tools that have been released that will aid in anonymising users from authorship recognition methods such as the methods discussed by the researchers like Anonymouth (Authorship Recognition Evasion Tool) and JStylo. (An Authorship Recognition Analysis Tool) JStylo, is the machine learning engine which powers Anonymouth
Video tutorial on how to use Anonymouth and Deceiving Authorship Detection
Here is the talk by researcher Sadia Afroz and team and is very eye-opening. (45 mins long)
The research was carried out by the Drexel and George Mason universities research team consisting of composed of Sadia Afroz, Aylin Caliskan Islam, Ariel Stolerman, Rachel Greenstadt, and Damon McCoy.
So what are the implications of the researchers finding and how and when will companies and indeed governments start to use such analytic tools to look for individuals across the internet by their writing styles? That really is the big question. We could very well see people’s anonymous Twitter, Facebook, reddit accounts scanned into huge databases that an algorithm would then go to work on, attempting to link certain accounts and forum posts by their unique writing styles via Function Words and stylometric patterns. Maybe that is just me being very paranoid, but with the amount of snooping certain governments currently carry out, it’s not that far-fetched by any stretch. Either way it really was a fascinating bit of research from the Drexel and George Mason research teams and one I hope isn’t used against law-abiding people instead of the intended criminal element.