Spam Filters Explained
By Alan Hearnshaw |
What do they do? How do they work? Which one is right for me?
Spam is a
very real problem that many people have to deal with on a daily basis. For those
that have decided to do something about it and start to investigate the options
available in spam filtering, this article provides a brief introduction to your
options and the types of spam filters available.
Despite the bewildering
array of spam filters available today, all claiming to the best one “of its
kind” there are really just five filtering methodologies in general use today
and all products rely on one, or a combination of these:
Content-Based
Filters “In the beginning, there were content-based filters.”
These
filters scan the contents of the and look for tell-tale signs that the message
is spam. In the early days of spamming it was quite simple to look out for “Kill
Words” such as ”Lose Weight” and mark a message as spam if it was
found.
Very soon though, spammers got wise to this and started resorting
to all kinds of tricks to get their message past the filters. The days of
“obfuscation” had begun. We started getting messages containing the phrase
“L0se Welght” (Notice the zero for “o” and “l” for “i”) and even more bizarre –
and sometimes quite ingenious – variations. This rendered basic content-based
filters somewhat ineffective, although there are one or two on the market now
that are clever enough to “see through” theses attempts and still provide good
results.
Bayesian Based Filters “The Reverend Bayes comes to the
rescue”
Born in London 1702, the son of a minister, Thomas Bayes
developed a formula which allowed him to determine the probability of an event
occurring based on the probabilities of two or more independent evidentiary
events.
Bayesian filters “learn” from studying known good and bad
messages. Each message is split into single “word bytes”, or tokens and these
tokens are placed into a database along with how often they are found in each
kind of message. When a new message arrives to be tested by the filter, the
new message is also split into tokens and each token is looked up in the
database. Extrapolating results from the database and applying a form of the
good reverend’s formula, know as a “Naive Bayesian” formula, the message is
given a “spamicity” rating and can be dealt with accordingly.
Bayesian
filters typically are capable of achieving very good accuracy rates (>97% is
not uncommon), and require very little on-going
maintenance.
Whitelist/Blacklist Filters “Who goes there, friend or
foe?”
This very basic form of filtering is seldom used on its own
nowadays, but can be useful as part of a larger filtering strategy.
A
“whitelist” is nothing more than a list of e-mail addresses from which you wish
to accept communications. A whitelist filter would only accept messages from
these people and all others would be rejected.
A “blacklist”, conversely,
is a list of e-mail addresses - and sometimes IP Addresses (computer
identification addresses) - from which communications will not be
accepted.
While this may seem like a good idea from the outset, a
whitelist methodology is too restrictive for most people and, as virtually all
spam e-mails carry a forged “from” address, there is little point in collecting
this address to ban it in future as it is very unlikely to be the same next
time. There are bodies on the internet that maintain a list of known “bad”
sources of e-mail. Many filters today have the ability to query these servers to
see if the message they are looking at comes from a source identified by this
Internet-based blacklist, or RBL. While being quite effective, they do tend to
suffer from “false positives” where good messages are incorrectly identified as
spam. This happens often with newsletters.
Challenge/Response
Filters “Open sesame!”
Challenge/Response filters are characterised by
their ability to automatically send a response to a previously unknown sender
asking them to take some further action before their message will be delivered.
This is often referred to as a "Turing Test" - named after a test devised by
British mathematician Alan Turing to determine if machines could
“think”.
Recent years have seen the appearance of some internet services
which automatically perform this Challenge/Response function for the user and
require the sender of an e-mail to visit their web site to facilitate the
receipt of their message.
Critics of this system claim it to be too
drastic a measure and that it sends a message that "my time is more important
than yours" to the people trying to communicate with you.
For some low
traffic e-mail users though, this system alone may be a perfectly acceptable
method of completely eliminating spam from their inbox - one step above the
"Whitelist" system outlined above.
Community Filters “A united
front”
These types of filters work on the principal of "communal
knowledge" of spam. When a user receives a spam message, they simply mark it as
such in their filter. This information is sent to a central server where a
“fingerprint” of the message is stored. After enough people have “voted”
this message to be spam, then it is stopped from reaching all the other people
in the community.
This type of filtering can prove to be quite effective,
although it stands to reason that it can never be 100% effective as a few people
have to receive the spam for it to be “flagged” in the first place. Just like
its similar cousin the Internet black list (RBL), this system also can suffer
from “false positives”, or messages incorrectly identified as
spam.
Hopefully you are now armed with a little more information to be
able to make an informed decision on the best spam filter for
you.
About the Author Alan Hearnshaw is a computer programmer and the owner of http://www.WhichSpamFilter.com a
site which provides weekly in-depth spam filter reviews, user help and guidance
and a community forum.
Top of Page |
|
|

|