mailto: blog -at- heyrick -dot- eu

Navi: Previous entry Display calendar Next entry
Switch to desktop version

FYI! Last read at 20:28 on 2024/04/28.

b.log abuse

Here's a quick heads up for those who would like to read my witterings off-line. I am soon to implement some logic to auto-ban spiders and content suckers that implement abusive behaviour.
I maintain a manual list of blocked IPs, and as such (my URL munged so it fits better, you know the URL, you're reading this!):
  2010/11/27 09:25:49  ?=20080621  183.17.47.28  http[...]/index.php?diary=00000000
  2010/11/27 09:25:49  ?=20080622  183.17.47.28  http[...]/index.php?diary=00000000
  2010/11/27 09:25:50  ?=20080627  183.17.47.28  http[...]/index.php?diary=00000000
  2010/11/27 09:25:50  ?=20080629  183.17.47.28  http[...]/index.php?diary=00000000
  2010/11/27 09:25:51  ?=20080630  183.17.47.28  http[...]/index.php?diary=00000000
  2010/11/27 09:25:51  ?=20080702  183.17.47.28  http[...]/index.php?diary=00000000
  2010/11/27 09:25:51  ?=20080704  183.17.47.28  http[...]/index.php?diary=00000000
  2010/11/27 09:25:52  ?=20080705  183.17.47.28  http[...]/index.php?diary=00000000
  2010/11/27 09:25:52  ?=20080706  183.17.47.28  http[...]/index.php?diary=00000000
It should perhaps come of no surprise to see that the IP address relates to Shenzhen, China (a place just above Hong Kong). Sorry guys, but you're now on the block list (and looking at my list, I'm getting very close to blocking the entirety of China).

I am not going to say exactly how my algorithm will be implemented, suffice to say it is a factor of how much time has elapsed since a specific IP address last requested something. The idea is to trap 'bots that behave in a method similar to the above, but it may well catch out those who use my search and then try to "open in new tab" on everything of interest. Or, alternatively, those who want to wget everything.
It will be sad if the trigger happens for "open in new tab", but I'm in two minds regarding wget. While I am not overly bothered by people taking the content to read offline (for the stupid, this means personal private use permitted, not a giveaway of copyrights), I don't appreciate if you clobber my server in order to achieve this.

Therefore, if you are a wgetter, please may I suggest:

wget -PC:\ricksblog -w60 -r -k -nv -np -nc http://www.heyrick.co.uk/blog/index.php?diary=00000000

Feel free to alter this to suit your preferences, I have specified to pull from the "entire index" and write to C:\ricksblog, munging internal links to point to files you have downloaded, plus retrieve content necessary to display the page (images, etc). The parts in purple bold are very important. The -nc is for no-clobber (so if you run this again to sync your offline copy, it will not download everything all over again) and the critically important part is the -w60 which asks wget to wait a minute between fetches. This means it'll be slow, yes, but on the other hand you won't find your request dissolves into a bunch of 403s ("forbidden"). I would not recommend you tweak this value. Just fire it up and leave it running in the background.

I will also be making modifications to my expiry settings - again to try to reduce 'bot behaviour. With the notable exceptions of the GoogleBot and Bing!bot - because in my experience these 'bots are well behaved.

 

Why am I doing this? I don't need to worry about bandwidth right now, but if my site moves in the future, bandwidth may become a factor. I think seeing automatic "allocation exceeded" messages due to some greedy asses helps nobody, and I'm sure as hell not going to cough up extra cash so Chinese and Russian 'bots can snarf my content at regular intervals. So I'm going to get a little more militant here. I welcome blinking thinking visitors. And be-my-boyfriend requests from cute Japanese girls (non-smoker, please!). I don't welcome scapers and scavengers.

 

Recaptcha

You've seen them. Those squirly messed up words to verify you're a human and not a machine. Well, here's one I received on a site that, as far as I can work out, say "com" in "communicate" and decided that I'd added a URL. Hehe...
I think the first letter is an 'a', but I have no idea what that dead ant after the number is. And, for the record, I did have Javascript enabled - how else would it have triggered seeing 'com' in my writing?
Incomprehensible recaptcha.

 

Recaptcha [update 2010/11/28]

Today's offering was even worse. Logic would say that the correct word is "present", but on the other hand a literal interpretation would say "tneserp" is just as valid, given that these things are supposed to be left-to-right with mutilated letters, not an entire word upside down.
Confusing recaptcha.

 

Your comments:

Rob, 27th November 2010, 21:23
I often see those captchas with random bits of punctuation after the words. In wonder some times if they have slurped some large document to get the selection, and forgotten to ignore it all.
Rick, 28th November 2010, 00:55
I read somewhere ages ago that the original concept of the recaptcha was to take scanned documents, pull out the words, and offer munged ones up for recaptcha. The first few times a word was offered, anything would be accepted, and once the system had a majority vote on what the word was, it would be distorted more and then you'd need to enter the correct word. 
Keeping track of all of this permitted a giant distributed form of human character recognition to be implemented. I guess they had to be careful for the quality of the input, for nobody would appreciate captchas like "rape child" or "bishop motherf***er". For that matter, what is "avrequen"? It's stumped Google, it doesn't appear to be a place, nor a word in Spanish or French. So....? 
Mick, 9th December 2010, 23:16
1. Are you sure you've banned most of China? 
Country: CHINA 
# ISO Code: CN 
# Total Networks: 1,818 
# Total Subnets: 267,237,376 
 
Would you like me to email you the deny list?? Your .htaccess file will swell! 
 
I also hate these darned boxes with completely grabled letters. Have you had the option to listen to and type in the spoken word yet? I forget where I stumbled upon this but imagine a computer voice saying something without any programming of how to pronounce or punctuate the word and you are half way there. Superior Software Speech was far clearer as we humans spelt things 'rong' to get it ta talk propa like eye duz :-@
Rick, 11th December 2010, 16:11
I have not banned China because I don't have much .htaccess control - shame, as passwording folders would be useful too for beta releases and such. Nevermind, I'd just block on IP fragments. And yes, I know there are many, but most of my nastiness seems to originate from either Shanghai or Sezchuan (or however you spell that). 
 
I have not tried the spoken captchas, but I have seen a few with the ability to speak, no doubt as a result of the American disability legislation (see above "Get an audio challenge"). I always wondered, in the case of separate letters or numbers (as opposed to words like above), if it wouldn't be possible to have a computer 'learn' the combinations spoken? 
 
Ahhh... Superior Speech. ;-) Have you heard Microsoft's Sam yet?

Add a comment (v0.11) [help?]
Your name:

 
Your email (optional):

 
Validation:
Please type 01520 backwards.

 
Your comment:

 

Navi: Previous entry Display calendar Next entry
Switch to desktop version

Search:

See the rest of HeyRick :-)