First off the boring introduction part
Early last year I got hacked with a redirect code.
Thanks to the guys here, once the problem was found I was up and running again in a matter of minutes, so its only fair I release the bot trap here first
Everyone seems to have a downer on the likes of China, Russia etc
But how many know the truth behind where your content is going or by who its being scraped?
Some days I have been doing 3000+ unique and my earnings have been beyond dire
This then led me to do some experiments on finding ways of tracking where my hits have come from, what hits and a way of tracking and catching content scrapers
Your content is being used to help another person grow their websites.
I have been running some tests using nothing more than a free hack from vb ord by Paul M
Track Guest Visits - vBulletin.org Forum
Thats just a link to the 3.8 version, but there is also a 4 series version
And modifying my spiders_vbulletin.xml file
The latest version can be found on Vbulletin
WS Spiders List (for updated vBulletin "spiders_vbulletin.xml" files))
Boring bit over
Running the above you will have a nice show of all the spiders hitting you
As well as the spiders, you can add user agents to catch the content scrpaers
For those that dont know. The spiders_vbulletin.xml file in vb 3.8 is found under forum root > includes > xml and you will find it there
To add any of the following code. Open the file with notepad, wordpad etc and add it above </searchspiders> which is located near or at the very end of the file.
Save it, reupload it and then click on your Currently Active Users, which will then oad the latest changes.
These are some of my results tracking some user agents
<spider ident="FREE">
<name>Free</name>
</spider>
100% success rate on catching scrapers and spammers
<spider ident="Deepnet Explorer">
<name>Deepnet Explorer</name>
</spider>
100% success rate on catching scrapers and spammers
<spider ident="Gentoo">
<name>Gentoo</name>
</spider>
100% success rate. This will show a lot of softlayer, ThePlanet to name a few
<spider ident="DigExt">
<name>DigExt</name>
</spider>
100% success rate. Scrapers and spammers
<spider ident="Java">
<name>Java</name>
</spider>
70% success. But I found the moron that had been overloading my server with this one by scraping 150 pages at a time
<spider ident="Windows 98">
<name>Windows 98</name>
</spider>
How many people use windows 98? Another moron that was scraping me was along with a load of other unwanted people. Success rate 100%
<spider ident="WordPress">
<name>WordPress</name>
</spider>
This will show a lot of idiots running a wordpress add on that scrapes you for content.
You will see some trty and leave comments.
One guy was hitting me and scraping all content via the rss feeds and removing my link. Lets just say he had a divert placed to a rude sites rss feeds.
You cann the above in action on my forum, because I have the results on show.
Plus you will get an idea of some of the other user agents Im tracking to see what the results are like.
If anyone wants, I will add my own robots file. But it can change on an hourly basis if I find something else to track, or I get too many false positives.
The above are all good to use from my own experiments so far
If you find the above helps you, bung me a backlink
This is a project I have been working on for a few weeks and thought it only fair to share my findings and with any luck have others share theirs


6Likes
LinkBack URL
About LinkBacks






Reply With Quote