RolePlay: How Machine Learning Systems can help in Detecting Spam Emails?

Cybersecurity has been a demand since the early ages of the Internet and today, looking at the tremendous increase of data threats via phishing emails, spam borne malware, spear phishing, etc. a call for reliable and smart anti-spam email filters becomes inevitable.

The trend of Digital Thefts and Email Spams is heading upwards!


We have many detectors and anti-spam filters; nevertheless, the need for more dependable feature-rich digital products occupy high demand in the market. This is where we look upon Artificial Intelligence which, even though it is a computer technology, it possesses life-like abilities that can think like humans by acting and performing smart.

Humans have made machines learn the empirical aspects and relative approaches to spam email filtering which can help to defeat the spammers who gather sensitive user information from websites, chat rooms, and viruses.

“Knowingly or unknowingly, spam prevents a user from gaining the full advantage of time, storage, and network bandwidth, causing a destructive impact on email servers, CPU memory, power, user time and communication bandwidth.”

With the ever-escalating possibilities of occurrence of fraud and spam over the web, the interference of Artificial Intelligence and Machine Learning technologies is seen as a crucial need for anti-spam propositions. Besides possessing some workable ways to protect ourselves from deceitful emails, there are multiple gaps that need to be identified and filled to halt the nasty actions of spammers and hackers full of bad intent.

Before digging into how technology can aid us, let’s get familiar with spams, their motive, global contribution, and their anti systems currently in use.

What Spam is?

Simply put, Spam or junk emails refers to unsolicited messages, mainly of an advertising type, and sent in bulk. The most widely used form of delivery is email, but it can also be submitted by instant messaging programs or social networks.

But, when we talk about spam, we are also referring to another type of email: malicious or fraudulent emails. These emails usually contain scams or scams that have the sole purpose of obtaining sensitive information from the recipient, such as the access data of a bank account, passwords, and credit/debit card numbers.

The Purpose
Spam emails are propagated merely for financial disruption or reputational damage in both personal and institutional front.

The Types

  • Email Phishing (spear phishing, clone phishing)
  • Whaling
  • Email Spoofing
  • Clickjacking

The Contribution(-ves)

  • Email spam costs businesses $20.5 billion every year.
  • 45% of the total emails are fraud which comes out to be 14.5 million global messages in a day on conversion.
  • More than 70% of emails are unwanted brand promotion messages.
  • Spammers earn $7000 a day from spam email sites.
  • Apple IDs are the potential targets of around 25% of phishing scams.

“57.26% of the total email traffic was credited to spam accounts in 2019”

Source: statista

Current Anti-Spam Solutions

-DomainKeys Identified Mail (DKIM)

A complex framework that runs the entire process through public key encryption but due to its low adoption and efficiencies an email with a void DKIM field cannot be treated as confirmed spam.

-Content-Based Filtering

This technique creates automatic filtering rules using ML approaches of Naive Bayesian classification, Support Vector Machine, Neural Networks, and K Nearest Neighbor.

It analyzes the words and phrases input in email content and closely monitors parameters of word occurrence, distribution, and placement. It then applies rules to filter spam emails and if the message matches the predefined spam patterns a decision is made based on the intermediate calculations stating the email as a fraud.

-DNS Blacklisting

Domain Name Server (DNS) blacklisting is done two ways, first, by maintaining a list of IP addresses determines as spam propagators and second, by marking domain names/websites using URI (Uniform Resource Identifiers), thus, preserving the incurred blacklisted data in a centralized database.

Such databases are not efficient enough as they lack the early detection symptoms of phishing URLs as their update process is time-consuming, moreover, spammers can surpass this technique by altering the source address(which itself can be spoofed) and can set up cheap new domains before launching mass spamming.

-Cryptographic Hashing

One of the oldest methods used by Yahoo, Hotmail, and others where this technique was used to calculate alphanumeric strings which are called the signature of the email (i.e. the hash value) stored in the database.

The vendors relied on the fact that fraudsters will burst uncountable emails where some of the emails will reach their honeypot account that distinguishes the spam and non-spam emails by the generated signature(hash value). Soon when the positive signature matches with the negative one it gets added to the spam databases and other emails that reach the customer account as instantly discarded on the match of spam value with that of the email value.

Some bottlenecks appear to be the delayed nature of database update and techniques forged by spammers to break hashing programs and algorithms.

-IP Based Filtering

It is usually source-based filtering using IP addresses which is also able to reveal the geographical locations which help in recognizing the countries which are a mass source of spam. And once detected, the suspected IP addresses can be blocked.

This method has been effective to a certain extent and made a contribution in tackling Botnets as well but still need to be leveraged to solve high-level issues.

-Peer to PERR Infrastructure

Inspired by the robust technology of blockchain, the framework emphasizes on using Bitmessage peer-to-peer communication protocol that is conducted over a highly encrypted and authenticated network with decentralized servers.

Being highly complicated the framework carries some scalability concerns relative to the existing email infrastructure and the complex nature of BitMessages(unintuitive alphanumeric strings) makes it troublesome for normal users to deal with them.

-Regular Expression (Regex) Based Filtering

Regex is a rule-based on a static filtering system where rules are built forging regular expressions which are assigned their individual respective scores to make the judgment.


The total value of the scores of the matched rules is calculated and checked and compared with a preset threshold value which determines the status of email as spam or not. Regular expressions are complex patterns involving texts and numbers where texts fulfill the purpose of string matching and can be used to find out an email address in a set of texts. Tech experts can make good use of these patterns to check the validity of the email, irrespective of the coding language used.

Such a Heuristic system is an incredible anti-spam detection solution that is fast, useful, and effective unless scammers get hold of the ruleset to craft messages overcoming and avoiding the filtering patterns.

AI-Powered Techniques for Spam Detection and Classification

-Genetic Algorithm Systems

Considering the limitations of the contemporary similar systems, the Regex filtering method was proposed as an effective method to combat spammers against their attempts of mass spams.

DiscoverRegex is a robust software system that deploys the idea of a local content filtering system that can be shared on a P2P network using Genetic Programming (Bio-inspired genetic algorithms) to create regular expressions.

Receiving partially successful results, the tool needs to work on generating regular expressions from the spam body content along with the spam content of the subject header to deliver complete results and make the system fully ready.

-Machine Learning-Based Systems

Machine Learning, a powerful subset of Artificial Intelligence, is considered as a boon to handling problematic issues produced by Hams or Spams with its technical abilities to evolve with time.

On research, it was found that the operating mechanisms of spam emails modify with time. The change in the content and structure and the delivery mechanisms of such emails pose threat to the crucial organizational data with new types of deceitful methods.

There are many techniques and frameworks devised and proposed by the tech geeks to overcome the efforts of spammers, leveraging the power of Machine Learning. First, let’s get familiar with the types of ML algorithms that are used to process the data for spam detection.

#Supervised ML Algorithms
Such programs tend to learn from labeled datasets and introduce the idea or mapping between the input and the output. They can be branched between two classes Classification and Regression where the former method generates results of categorical nature while the later outputs numerical values.

#Unsupervised ML Algorithms
No labeled and structured data is introduced. The dataset provided is analyzed and worked upon by h=the algorithms to figure out common features and rearrange data points based on the similarities in the form of clusters.

#Semi-supervised ML Algorithms
It’s the combination of the above two and such algorithms produce effective results when a limited amount of data is labeled in the collection of huge input data.

Let’s see what are the ways in which the latest technology can aid us to preserve sensitive data from spam attacks.

1 Artificial Neural Network

ANN (Artificial Neural Network) Based Systems use artificial neurons that are connected with a series of layers and wherein the information passes from the Input layer to Hidden later to the Output layer in a feedforward arrangement.

Nosseir et al used ANN to determine acceptable and non-acceptable words from the email content. The neural classifiers dealt with words in the email body and processed them to eliminate words with articles and prepositions or misspellings or encrypted words (like I*n$u*rènce).

It was applied to a small scale database at first, incurring accuracy of 99.87% from measurements for a five-character ANN and was suggested to work on larger datasets to see the deriving outcomes.

Many other unique approaches were invented by researchers using ANN and still continue, which have been proven as effective techniques against words obfuscation and phishing.

2 Deep Learning Frameworks

Multimodal spam classification was introduced using deep learning concepts to test and bring expected results. Deep learning algorithms say, DeepSVm, CNN (Convolutional Neural Networks), Deep Boltzmann Machine work on DL principles constructing a hierarchy of concepts, keeping the ability to learn in a supervised and unsupervised manner.
[Deep Learning is a subclass of Machine Learning]

The CNN approach tackles spam emails by targeting images and spam content. It identifies an image and text of an email via multi-modal architectures and can work on large enriched datasets to generate results that can be generalized over different instances. The output incurred, processing small datasets, possessed an accuracy of 98.11% and are looked to achieve likewise results when experimented on with an increased amount of data.

3 NB (Naive Bayes) Based Propositions

NB classifiers were used to detect whether an email is spam or not and was supported by enhancements in calculations and interpretations of thresholds (to define ternary email categories) as in the earlier detection systems.

The amalgamation of NB classifiers and DTRS (Decision-Theoretic Rough Set) were experimented to attain the threshold value, grouping emails into spam, ham or suspect from different datasets.

This resulted in 90.05% accuracy (manual attempt) but is not considered as an ideal solution as it did not classify the email as spam automatically, involving the user to make a decision on the group of emails mark it as ‘suspect’ where a user might fail in the judgment of marking emails.

Another approach came into light that introduced the Word Frequency Factor and Average Word Frequency Factor that worked on Mutual Information Feature Selection improving the efficiency of algorithms based on languages, especially Chinese and English. Emails were classified as per NB classifiers which gave good results for English emails and substandard results for Chinese which calls for more improvements in Chinese word segmentation and pre-processing

Reference Sources:


Leave a Reply