mailto: blog -at- heyrick -dot- eu

b.log abuse

Here's a quick heads up for those who would like to read my witterings off-line. I am soon to implement some logic to auto-ban spiders and content suckers that implement abusive behaviour.
I maintain a manual list of blocked IPs, and as such (my URL munged so it fits better, you know the URL, you're reading this!):
  2010/11/27 09:25:49  ?=20080621  http[...]/index.php?diary=00000000
  2010/11/27 09:25:49  ?=20080622  http[...]/index.php?diary=00000000
  2010/11/27 09:25:50  ?=20080627  http[...]/index.php?diary=00000000
  2010/11/27 09:25:50  ?=20080629  http[...]/index.php?diary=00000000
  2010/11/27 09:25:51  ?=20080630  http[...]/index.php?diary=00000000
  2010/11/27 09:25:51  ?=20080702  http[...]/index.php?diary=00000000
  2010/11/27 09:25:51  ?=20080704  http[...]/index.php?diary=00000000
  2010/11/27 09:25:52  ?=20080705  http[...]/index.php?diary=00000000
  2010/11/27 09:25:52  ?=20080706  http[...]/index.php?diary=00000000
It should perhaps come of no surprise to see that the IP address relates to Shenzhen, China (a place just above Hong Kong). Sorry guys, but you're now on the block list (and looking at my list, I'm getting very close to blocking the entirety of China).

I am not going to say exactly how my algorithm will be implemented, suffice to say it is a factor of how much time has elapsed since a specific IP address last requested something. The idea is to trap 'bots that behave in a method similar to the above, but it may well catch out those who use my search and then try to "open in new tab" on everything of interest. Or, alternatively, those who want to wget everything.
It will be sad if the trigger happens for "open in new tab", but I'm in two minds regarding wget. While I am not overly bothered by people taking the content to read offline (for the stupid, this means personal private use permitted, not a giveaway of copyrights), I don't appreciate if you clobber my server in order to achieve this.

Therefore, if you are a wgetter, please may I suggest:

wget -PC:\ricksblog -w60 -r -k -nv -np -nc

Feel free to alter this to suit your preferences, I have specified to pull from the "entire index" and write to C:\ricksblog, munging internal links to point to files you have downloaded, plus retrieve content necessary to display the page (images, etc). The parts in purple bold are very important. The -nc is for no-clobber (so if you run this again to sync your offline copy, it will not download everything all over again) and the critically important part is the -w60 which asks wget to wait a minute between fetches. This means it'll be slow, yes, but on the other hand you won't find your request dissolves into a bunch of 403s ("forbidden"). I would not recommend you tweak this value. Just fire it up and leave it running in the background.

I will also be making modifications to my expiry settings - again to try to reduce 'bot behaviour. With the notable exceptions of the GoogleBot and Bing!bot - because in my experience these 'bots are well behaved.


Why am I doing this? I don't need to worry about bandwidth right now, but if my site moves in the future, bandwidth may become a factor. I think seeing automatic "allocation exceeded" messages due to some greedy asses helps nobody, and I'm sure as hell not going to cough up extra cash so Chinese and Russian 'bots can snarf my content at regular intervals. So I'm going to get a little more militant here. I welcome blinking thinking visitors. And be-my-boyfriend requests from cute Japanese girls (non-smoker, please!). I don't welcome scapers and scavengers.



You've seen them. Those squirly messed up words to verify you're a human and not a machine. Well, here's one I received on a site that, as far as I can work out, say "com" in "communicate" and decided that I'd added a URL. Hehe...
I think the first letter is an 'a', but I have no idea what that dead ant after the number is. And, for the record, I did have Javascript enabled - how else would it have triggered seeing 'com' in my writing?
Incomprehensible recaptcha.


Recaptcha [update 2010/11/28]

Today's offering was even worse. Logic would say that the correct word is "present", but on the other hand a literal interpretation would say "tneserp" is just as valid, given that these things are supposed to be left-to-right with mutilated letters, not an entire word upside down.
Confusing recaptcha.


Your comments:

Please note that while I check this page every so often, I am not able to control what users write; therefore I disclaim all liability for unpleasant and/or infringing and/or defamatory material. Undesired content will be removed as soon as it is noticed. By leaving a comment, you agree not to post material that is illegal or in bad taste, and you should be aware that the time and your IP address are both recorded, should it be necessary to find out who you are. Oh, and don't bother trying to inline HTML. I'm not that stupid! ☺ ADDING COMMENTS DOES NOT WORK IF READING TRANSLATED VERSIONS.
You can now follow comment additions with the comment RSS feed. This is distinct from the b.log RSS feed, so you can subscribe to one or both as you wish.

Rob, 27th November 2010, 21:23
I often see those captchas with random bits of punctuation after the words. In wonder some times if they have slurped some large document to get the selection, and forgotten to ignore it all.
Rick, 28th November 2010, 00:55
I read somewhere ages ago that the original concept of the recaptcha was to take scanned documents, pull out the words, and offer munged ones up for recaptcha. The first few times a word was offered, anything would be accepted, and once the system had a majority vote on what the word was, it would be distorted more and then you'd need to enter the correct word. 
Keeping track of all of this permitted a giant distributed form of human character recognition to be implemented. I guess they had to be careful for the quality of the input, for nobody would appreciate captchas like "rape child" or "bishop motherf***er". For that matter, what is "avrequen"? It's stumped Google, it doesn't appear to be a place, nor a word in Spanish or French. So....? 
Mick, 9th December 2010, 23:16
1. Are you sure you've banned most of China? 
Country: CHINA 
# ISO Code: CN 
# Total Networks: 1,818 
# Total Subnets: 267,237,376 
Would you like me to email you the deny list?? Your .htaccess file will swell! 
I also hate these darned boxes with completely grabled letters. Have you had the option to listen to and type in the spoken word yet? I forget where I stumbled upon this but imagine a computer voice saying something without any programming of how to pronounce or punctuate the word and you are half way there. Superior Software Speech was far clearer as we humans spelt things 'rong' to get it ta talk propa like eye duz :-@
Rick, 11th December 2010, 16:11
I have not banned China because I don't have much .htaccess control - shame, as passwording folders would be useful too for beta releases and such. Nevermind, I'd just block on IP fragments. And yes, I know there are many, but most of my nastiness seems to originate from either Shanghai or Sezchuan (or however you spell that). 
I have not tried the spoken captchas, but I have seen a few with the ability to speak, no doubt as a result of the American disability legislation (see above "Get an audio challenge"). I always wondered, in the case of separate letters or numbers (as opposed to words like above), if it wouldn't be possible to have a computer 'learn' the combinations spoken? 
Ahhh... Superior Speech. ;-) Have you heard Microsoft's Sam yet?

Add a comment (v0.11) [help?] . . . try the comment feed!
Your name
Your email (optional)
Validation Are you real? Please type 47472 backwards.
Your comment
French flagSpanish flagJapanese flag
«   November 2010   »

(Felicity? Marte? Find out!)

Last 5 entries

List all b.log entries

Return to the site index



Search Rick's b.log!

PS: Don't try to be clever.
It's a simple substring match.


Last read at 04:51 on 2024/05/25.

QR code

Valid HTML 4.01 Transitional
Valid CSS
Valid RSS 2.0


© 2010 Rick Murray
This web page is licenced for your personal, private, non-commercial use only. No automated processing by advertising systems is permitted.
RIPA notice: No consent is given for interception of page transmission.


Have you noticed the watermarks on pictures?
Next entry - 2010/12/08
Return to top of page