Sussing spammers from raw logs
Ajay asked me to teach him to figure out spammers based on the logs, like I do.
Well, Ajay, I imagine others want to do that as well, so here’s a post about it.
But I can’t promise you’ll figure it out. It’s got to do with how your brain processes information as well as knowing what to look for.
First of all, you need access to the raw logs. I’m not talking about Latest Visitors, like you’ll find in cpanel. I’m talking about the full raw logs. Preferably going back to the start of the month. And you should save them at the end of each month. Save them to your computer in a zipped or gzipped format (that compresses them to about one tenth of the size).
If you’ve got hosting that allows you ssh (command line interface) access, you can use grep (linux command set) on the raw logs while they’re still on the webhost. If you don’t (most budget webhosting doesn’t allow ssh), then you need to download the logs to your computer.
The best tool I’ve found for dealing with raw logs on a Windows computer, is TextHarvest. It’s a free demo app. It’s plenty powerful enough for our purposes, but also shows how powerful the proprietary paid versions will be for other kinds of tasks.
When I use TextHarvest, I start by unzipping the log file and navigating to it, so it’s my input file. I then make a bookmark for the output file in my browser, and untick the autoview (the app will crash if you autoview several megabytes of log file).
Then, when I receive a comment I suspect to be from a jokester, like that comment from Dima, I find the IP address from the comment, and input that in the Keep list. And I’ve found that using the \ slash instead of the / (which is the default they tell you to use, but any character will work) one makes sense, since I often search for user agents as well.
Press start, and it chugs through the log file.
When I then look at the log output, it contains only the entries made by that IP number, and possibly when others look at their user page if they’ve made any wiki edits.
What I then look at, is if the pattern is that of a normal user. Does it load any images? Does the browsing history look legit? There are so many things jokesters and spammers do, so I won’t detail them here. But YOU need to figure out what a normal browsing history looks like on YOUR website. And if some user diverts from that, what’s the reason? Sometimes normal users don’t have normal browsing patterns either.
When you look at the browsing history, pay attention to the user agents. Spammers sometimes have scripts that change their user agents more or less in mid flight.
Finally, a warning: Don’t be too cocky now that I’ve taught you the basics of checking logs. There’s a LOT more to be learned, and I’ve seen people completely misunderstand the information in front of them. Hopefully I haven’t made too many errors myself. I started reading logs years before I became Spamhuntress, so I had a head start.
August 20th, 2006 at 3:56 am
Nice trick. Dead handy tool. Life’s much nicer with you around
August 20th, 2006 at 12:25 pm
[…] I asked Spamhuntress on how she finds spammers by going through her site logs and she was kind enough to write a detailed article on how to. […]
August 20th, 2006 at 12:47 pm
Thanks for the tutorial. Am gonna try it out.
Will keep your warning in mind…
August 20th, 2006 at 2:11 pm
I’m a big fan of the Cygwin toolkit for windows. It provides most of the useful unix tools like grep. It also has a nice command line window called rxvt (not part of the default install) which is much better than the Windows command prompt. I use it for SSH as well as local work.
http://cygwin.com/
August 21st, 2006 at 4:05 am
I have also made many thoughts about fighting spam (email-spam-fight is always an endless fight between David and Goliath
) and I came up with many things at the same time:
- Use spider traps (spiders are the “harvesters” for the spammers which are based on robots.txt and .htaccess blocking list. Positive: Spambots got locked out by automated updated .htaccess files. Negative: “Normal” people can also be caught by when they click on such “trap links”. Spammers may sell blacklists of such spider-traps on high prices.
- Put your email address and other personal data in an image with a fuzzy background. Positive: Spambots may have it much harder to read the data from images even when they use OCR-enabled spambot software. Negative: Because of the fuzzy background people with disablilities may not read it correctly and are not able to contact you. And some browsers does not support images (like w3m/lynx for console freaks).
- Use extreme strong anti-spam filters in server mail software and in exchange for it contact form with captcha. Positive: Spam is not prevent it’s reduced heavily. I want to show you my main.cf and master.cf for my Postfix/VHCS installation here:
- http://www.mxchange.org/master.cf
- http://www.mxchange.org/main.cf
Negative: Normal mails can also be blocked out. Some people doesn’t like contact forms, they prefer email contact instead. Blocking software consumes lots of CPU resources due to the filtering process. This may slow dow your server and make it weak for (d)DOS attacks.
Well, for blog software, you may want to use a “battery” of plugins. BB2, Bodcheck, Email-Immunitzer and so on on the same time. See my blog for what anti-spam plugins I use. And you may also think about the resources every plugin uses.
August 21st, 2006 at 4:07 am
Please correct: […] Use spider traps (spiders are the “harvesters” for the spammer) which […]
August 21st, 2006 at 3:43 pm
Thanks for the instructions. Have become a regular reader and learn something each day.
August 23rd, 2006 at 12:11 pm
that posting here is so useless… i have no example to describe how useless it is.
August 24th, 2006 at 3:50 am
The guy above who signed with Shoemoney, is NOT the Shoemoney that’s most closely associated with that nick. It’s some jerk from Germany.