...making Linux just a little more fun!
SpamAssassin, as most of our readers will know, is a popular spam classifier on Linux. This article assumes that you already have SpamAssassin installed and working. If you are interested in running SpamAssassin, but do not yet have it set up, there is a useful introduction at http://linux.org.mt/article/killspam.
When I set up SpamAssassin in Mandrake 9.2, it came very close to picking up 100% of my spam. Over time, however, many of the spammers have figured out how to fine tune their spam and bypass the default ruleset. I find the default setup still picks up at least half the spam, maybe two thirds on a good day, but too much leaks through. If the spammers are tuning their messages, I guess the only thing to do is to tune my scoring. There are at least 8 possible ways of improving SpamAssassin's hit rate.
The options available in your configuration file are listed in http://spamassassin.org/doc/Mail_SpamAssassin_Conf.html, as well as in the relevant manpage (see "man Mail::SpamAssassin::Conf".)
Most of the complexity is in the rules, but it's rarely necessary to roll your own SpamAssassin rules. Adjusting the threshold and the scores associated with existing rules is much simpler and can be very effective. If that's not enough, there are also additional rulesets available to download. Of course, if you think writing you own rulesets sounds like a fun thing to do, I've included a pointer below to get you started.
Most of the rules are listed in http://spamassassin.org/tests.html along with their associated scores, but the scores listed did not match those in my configuration files. They were probably from a newer release. You will probably find the default scores for your configuration in /usr/share/spamassassin/50_scores.cf, where there are 4 scores associated with each rule. The rule chosen depends on whether Bayesian analysis (see below) and network tests are enabled. Where only one score is supplied, that score is always used. Other options, such as relative scores, are described in the aforementioned documentation.
Regular reviews are even more important as you tune your settings. No matter how careful you are, there is no guarantee that any change to the settings won't cause some collateral damage. Even with the default setup, you will see some false positives, e.g. newsletters from The National Trust and The Royal Opera House get marked as spam, so a key part of managing your spam - especially in the beginning - involves reviewing the hits and determine which senders you need to "whitelist".
Whitelisting is usually done with the settings whitelist_from
and
whitelist_to
. These can be repeated as many times as you like. Simple
globbing patterns (see "man bash" and search for "Pattern
Matching" for a description) are used to specify wildcard matches. E.g.,
'?' matches a single character and '*' matches any number of characters
(including zero.) whitelist_from
and whitelist_to
subtract 100 points from the score, making it very rare for matching emails
to reach the spam threshold.
whitelist_from [email protected] whitelist_from *@importantclient.example.comAlso available are options
more_spam_to
,
all_spam_to
. According to the documentation, "There are
three levels of To-whitelisting, whitelist_to, more_spam_to and
all_spam_to
. Users in the first level may still get some spammish
mails blocked, but users inall_spam_to
should never get mail
blocked."
You should consider using all_spam_to
for postmaster
addresses. It's very annoying if someone tries to report a spam and has
their report blocked or rejected as spam.
You may sometimes see a large number of emails slipping through from a particular sender. Usually sender IDs are forged and chosen randomly. There is little point blacklisting most senders, but sometimes it can be worthwhile. A more useful option is blacklisting based on recipients. If your email address is [email protected] you may see a lot of spam with nearby addresses, such as [email protected] and [email protected] in the cc list and these recipients can be blacklisted.
According to the documentation blacklisting is done with the settings
blacklist_from
and blacklist_to
, but you may find
thatblacklist_to
doesn't work on versions of SpamAssassin
older than 2.6.0.
blacklist_from *@evilspammers.example.org blacklist_to [email protected] blacklist_to *.wi*@example.comThere are a number of other settings for blacklisting and whitelisting. Global settings can be overridden locally by
unwhitelist_from,
unwhitelist_to, unblacklist_from
and unblacklist_to
.
Please read the documentation to find out more about these and other
available settings.
SpamAssassin checks the headers to see if the email has been relayed through any hosts with matches in certain blocklists. This is known not to work with a number of configurations because it only checks the first DNS entry in resolv.conf; if this does not point to a working DNS server, it will not work. This is a known problem under Mandrake 9.2.
DNS Blocklists can be disabled with the option skip_rbl_checks.
Bayesian analysis is a feature of recent versions of SpamAssassin and I find it very effective. Some work is required to build and maintain the database, but it is well worth the small effort involved.
To configure SpamAssassin to use Bayesian analysis you add the line
use_bayes 1to your user_prefs file.
You won't see any matches for Bayesian analysis yet. The algorithm requires at least 200 spam emails in it's database before it will assign any probability to your emails. To get to this point collect your spam emails in a separate mailbox and run
sa-learn --spam --mbox ~/Mail/spamtrapOften you will see that the number of emails it has learned from (analysed) is less than the number that appear to be in the mailbox it is learning from. This is because it has detected that some emails are duplicates of emails it has seen before.
You should also give it your "ham" emails to learn from using the command
$ sa-learn --ham --mbox ~/Mail/inboxOnce it has learned from more than 200 spam emails you should start seeing matches in the headers like
BAYES_90 (4.5 points) BODY: Bayesian classifier says spam probability is 90 to 99%Don't stop feeding it data when it starts to work. The more data it has, the more accurate it should be. If you are short of disk space, you should bear in mind that the database can get quite large. Mine is about 10MB.
If at any time you accidentally classify a message incorrectly this can be corrected. Move the message to a temporary folder, then use the command
sa-learn --forget --mbox ~/Mail/tempthen move it back to the correct folder and classify it as usual.
In my experience a threshold of 3.0 or 3.5 will increase the amount of spam caught dramatically, but won't produce significantly more false positives. This is achieved very simply by changing or adding the required_hits setting, e.g.
required_hits 3.5If you have been using SpamAssassin for a while you can use grep to assess the level at which you are likely to see a significant increase in false positives. This is done by searching your mail folders for X-Spam-Status header lines with different scores.
$ grep 'X-Spam-Status: .* hits=[5-9]\.' ~/Mail/inbox | wc -l 1 $ grep 'X-Spam-Status: .* hits=[34]\.' ~/Mail/inbox | wc -l 4 $ grep 'X-Spam-Status: .* hits=2\.' ~/Mail/inbox | wc -l 10 $The first command shows that there is only 1 mail in my inbox that has scored between 5.0 and 9.9 points. The second that there are 4 mails that scored between 3.0 and 4.9 points and the third that 10 mails scored between 2.0 and 2.9 points. It should be borne in mind that this ignores all emails from before setting up SpamAssassin and all emails that you have deleted since that time.
Before we start, I should say that the default scores have been tuned using a genetic algorithm. Should you trust your judgment against that algorithm? My opinion is that spam is evolving. Many of them are tested against the default SpamAssassin rules and fine tuned until they pass. Also everyone's spam problem is different. Statistically, what works for a large database of spam, possibly going back years, isn't necessarily the best for your current spam problem. If you find that your tuning efforts make the problem worse, you can always go back to the defaults.
Incoming messages should have some headers that indicate which rules were triggered. These look like:
X-Spam-Status: No, hits=3.0 required=3.5 tests=BAYES_50,USER_AGENT version=2.55These headers will not normally be displayed, but any decent mail client will have an option to display all headers. In kmail this option is View->Headers->All.
If you do not see these headers when you have all headers displayed, take a look at the section "Other Options" at the end of this article for the option controlling headers.
Looking at the matches given above, Bayesian analysis has given the mail a 50-60% probability of being spam. I have sufficient confidence in the Bayesian analysis to make anything with a probability of 50% or more spam, so I set the scores for those rules to my current threshold of 3.5.
Here's another one that sneaked in under the radar.
X-Spam-Status: No, hits=1.6 required=3.5 tests=HTML_20_30,MIME_HTML_ONLY,USER_AGENT version=2.55The USER_AGENT rule isn't very interesting. Most mail has a user-agent header and this scores 0.001. We'll leave that alone. The other tests seem to contradict each other, one apparently saying that that the message is all HTML and the other that it's 20-30% HTML. I would guess that the 20-30% is the ratio of HTML tags to text, so it can be all HTML, but not all tags.
So, how should we adjust the scoring? HTML_20_30 matches 6 times in 8 months of legitimate email, but it matches a third of the mail currently in my spam folder, so it should be scored highly, but not highly enough to be conclusive on it's own. It seems to be scored at 1.47, which may be a bit low, but it's not far wrong. MIME_HTML_ONLY matched 1 legitimate email, but matches 95% of my spam. Strangely this only scores 0.1. I'm going to treat it as almost conclusive and score it at 3.0, requiring only another 0.5 points to trigger a match on my threshold of 3.5.
Another email got through with these matches:
X-Spam-Status: No, hits=1.5 required=3.5 tests=GET_IT_NOW,HTML_10_20 version=2.55Looking at my email, I find that HTML_10_20 matches a lot of legitimate email, as well as spam and GET_IT_NOW only matches 1 spam.
$ grep HTML_10_20 ~/Mail/spamtrap | wc -l 19 $ grep HTML_10_20 ~/Mail/inbox ~/Mail/mailing-lists | wc -l 8 $ grep GET_IT_NOW ~/Mail/inbox ~/Mail/mailing-lists | wc -l 0 $ grep GET_IT_NOW ~/Mail/spamtrap | wc -l 1 $In this case I can't justify changing the scoring for either rule.
If, like me, you aren't running the very latest distribution you may find that you are a little behind the curve. The standard rulesets are always evolving and just running a more recent version should help to catch more Spam.
As I write the latest stable version of SpamAssassin is 2.63 and 3.00 is under development. The latest versions can be downloaded from http://spamassassin.apache.org/downloads.html.
I have not installed any of these rulesets and I am not recommending any of them. You should read the documentation and evaluate their suitability carefully before installing any new rulesets and monitor the results once they are installed.
Rolling your own SpamAssassin rules is likely to be a minority interest, but I guess it will appeal to some of our readers. If you spot a pattern in your spam that there doesn't seem to be a rule for, or you are just terminally curious then read "A straightforward guide to writing your own add-on rules for SpamAssassin", by Matt Kettler.
rewrite_subject { 0 | 1 } (default: 0) By default, the subject lines of suspected spam will not be tagged. This can be enabled here. always_add_headers { 0 | 1 } (default: 1) By default, X-Spam-Status, X-Spam-Checker-Version, (and optionally X-Spam-Level) will be added to all messages scanned by SpamAssassin. If you don't want to add the headers to non-spam, set this value to 0. See also always_add_report. always_add_report { 0 | 1 } (default: 0) By default, mail tagged as spam includes a report, either in the headers or in an attachment (report_safe). If you set this to option to 1, the report will be included in the X-Spam-Report header, even if the message is not tagged as spam. Note that the report text always states that the mail is spam, since normally the report is only added if the mail is spam. This can be useful if you want to know what rules the mail triggered, and why it was not tagged as spam. See also always_add_headers. spam_level_stars { 0 | 1 } (default: 1) By default, a header field called "X-Spam-Level" will be added to the message, with its value set to a number of asterisks equal to the score of the message. In other words, for a message scoring 7.2 points: X-Spam-Level: ******* This can be useful for MUA rule creation. spam_level_char { x (some character, unquoted) } (default: *) By default, the "X-Spam-Level" header will use a '*' character with its length equal to the score of the message. Some people don't like escaping *s though, so you can set the character to anything with this option. In other words, for a message scoring 7.2 points with this option set to . X-Spam-Level: .......
Neil is a programmer, specialising in C++ on Unix and Linux. He has degrees
in Computer science and Next Generation Computing.
Neil has worked on a wide range of systems from the control system for the
British Gas national grid to video servers for the Home Choice video on
demand service. He first programmed computers in 1980 with his school
General Studies class, which was allowed access to a mainframe at The
National Institute of Oceanography, programmed in Fortran on punch cards.
A computer science degree followed at Queen Mary College, London, then Neil
worked for Logica for 3 years before taking an MSc in New Generation
Computing at Exeter University.
The next 5 years saw Neil researching parallel simulation algorithms at the
Royal Signals and Radar Establishment, initially on transputers and
subsequently on SPARC based parallel systems. Since leaving RSRE, Neil has
mostly worked freelance and has worked on financial data feeds, video
servers and virus scanning proxies.
Neil first used Unix at college in 1982 and started working on Linux in
1996.
As of May 2004, Neil is working for Wirefast a global messaging company.
Outside of computing, Neil is into motor sport, particularly Formula 1, the
World Rally Championship and the British Touring Car Championship. He
doesn't race himself. If you've seen Neil's driving, you'll understand why.