vBulletin SEO Forums

SEO

vBulletin Search Engine Optimization

Buy vBSEO Now! HACKER SAFE certified sites prevent over 99.9% of hacker crime.
ne nw
vBSEO Total Support Team Launches DeskPro New vBSEO Discount Level for Network Builders vBSEO 3.2.0 GOLD Has Landed Success with vBSEO = 600ore Web Visitors + $1400 in a Day! Crawlability Inc. Files for SEO Technology Patent
se sw

Google Duplicate Proxy Exploit

This is a discussion on Google Duplicate Proxy Exploit within the General Discussion forums, part of the vBulletin SEO Discussion category; Defend Your Website From Google Duplicate Proxy There is a current and active way to knock a website out of ...

Go Back   vBulletin SEO Forums > vBulletin SEO Discussion > General Discussion

Enhancing 80 million pages.

Register FAQ Members List Social Groups Calendar Search Today's Posts Mark Forums Read
  #1  
Old 12-24-2007, 06:29 AM
NeutralizeR's Avatar
Senior Member
Big Board Administrator
 
Real Name: Mavi KARANLIK
Join Date: Feb 2006
Location: Ankara/TÜRKİYE
Posts: 296
Exclamation Google Duplicate Proxy Exploit

Quote:
Defend Your Website From Google Duplicate Proxy

There is a current and active way to knock a website out of Google's search engine results. It's simple and effective. This information is already in the public domain and the more people that know about it, the more likelihood there is that Google will do something about it. This article will tell you how it works, how to get a website knocked out of the search engine rankings, but most importantly, how to defend your own website from having it happen to you.

To understand this exploit, you must first understand about Google's Duplicate Content filter. It's simply described thus: Google doesn't want you to search for "blue widget" and have the top 10 search terms returned copies of the same article on how great blue widgets are. They want to give you ONE copy of the Great Blue Widget article, and 9 other different results, just on the off chance that you've already read that article and the other results are actually what you wanted.

To handle this, every time Google spiders and indexes a page, it checks it to see if it's already got a page that is predominantly the same, a duplicate page if you will. Exactly how Google works this out, nobody knows exactly, but it is going to be a combination of some or all of: page text length, page title, headings, keyword densities, checking exactly copy sentence fragments etc. As a result of this duplicate content filter, a whole industry has grown up around trying to get round the filter, just search for "spin article".

Getting back to the story here, Google indexes a page and lets say it fails it's duplicate content check, what does Google do? These days, it dumps that duplicate page in Google's Supplemental Index. What, you didn't know that Google have 2 indexes? Well they do: the main one, and supplemental one. 2 things are important here: Google will always return results from their Main index if they can; and they will only go to the Supplemental index if they don't get enough joy from their main index. What this means is that if your page is in the supplemental index, it's almost certain that you will never show up in the Search Engine Ranking Pages, unless there is next to no competition for the phrase that was searched for.

This all seems pretty reasonable to me, so what's the problem? Well there's another little step I haven't mentioned yet. What happens if someone copies your page, let's say your homepage of your business website, and when Google indexes that copy, it correctly determines that it's a duplicate. Now Google knows about 2 pages that it knows are duplicates, it has to decide which to dump in the supplemental index, and which to keep in the main one. That's pretty obvious right? But how does Google know which is the original and which is the copy? They don't. Sure they have some clever algorithms to work it out, but even if they are 99% accurate, that leaves a lot of problems for that 1% of times they can get it wrong!

And this is the heart of the exploit, if someone copies your websites homepage say, and manages to convince Google that *their* page is the original, your homepage will get tossed into the supplemental index, not to see the light of day in the Search Engine Ranking Pages for a while. In case I'm not being clear enough, that's bad! But wait, it gets worse:

It's fair to say that in the case of a person physically copying your page and hosting it, you can often get them to take it down through the use of copyright lawyers, and cease and desist letters to ISP's and the like, with a quick "Reinclusion Request" to Google. But recently there's a new threat that's a whole lot harder to stop: the use of publicly accessible Proxy websites. (If you don't know what a Proxy is, it's basically a way of making the web run faster by caching content more local to your internet destination. In principle they are generally a good thing.)

There are many such web proxies out there, and I won't list any here, however I will describe the process: they send out spiders (much like Google's) and they spider your page, take your content, then they host a copy of your website on their proxy site, nominally so that when their users request your page, they can serve up their local copy quickly rather than having to retrieve if off your server. The big issue is that Google can sometimes decide that the proxy copy of your web page is the original, and yours is not.

Worse again, there's some evidence that people are deliberately and maliciously using proxy servers to cache copies of web pages, then using normal (white and black hat) Search Engine Optimization (SEO) techniques to make those proxy pages rank in the search engine, increasing the likelihood that your legitimate page will be the one dumped by the search engines' duplicate content filters. Danger Will Robinson!

Even worse still, some of the proxy spiders actively spoof their origins so that you don't realize that it's a spider from a proxy, as they pretend to be a Googlebot for example, or from Yahoo. This is why the major search engines actively publish guidelines on how to identify and validate their own spiders. Now for the big question, how can you defend against this? There are several possible solutions, depending on you web hosting technology and technical competence.

Option 1 - If you are running Apache and PHP on your server, you can set the webhost up to check for search engine spiders that purport to be from the main search engines, and using php and the .htaccess file, you can block proxies from other sources. However this only works for proxies that are playing by the rules and identifying themselves correctly.

Option 2 - If you are using MS Windows and IIS on your server, or if you are on a shared hosting solution that doesn't give you the ability to do anything clever, it's an awful lot harder and you should take the advice of a professional on how to defend yourself from this kind of attack.

Option 3 - This is current the best solution available, and applies if you are running a PHP or ASP based website: you set ALL pages robot meta tags to "noindex" and "nofollow", then you implement a PHP or ASP script on each page that checks for valid spiders from the major search engines, and if so, resets the robot meta tags to index and follow. The important distinction here is that it's easier to validate a real spider, and to discount a spider that's trying to spoof you, because the major search engines publish processes and procedures to do this, including IP lookups and the like.

So, stay aware, stay knowledgeable, and stay protected. And if you see that you've suddenly been dumped from the Search Engine Rankings Pages, now you might know why, how and what to do about it.

About this author
Sophie White is an Internet Marketing and Website Promotion Consultant at Get Better Website ROI with Website Promotion and Optimization Services, Pay Per Click Advertising and Management Services from Intrinsic Marketing - an SEO and Pay Per Click firm dedicated to supplying Better Website ROI.
Source: Defend Your Website From Google Duplicate Proxy - SEO Marketing

I've a major problem with the proxies indexing/caching my web site. Thousands of pages of my web site are already indexed/cached by some proxies. I suspect the critical traffic drop (%80) is because of this issue.

Check this thread please:
From 3rd place to nowhere

Can you please implement this script to vBSEO?
SEO Egghead by Jaimie Sirovich SimpleCloak v2 PHP Implementation

Please post your opinions, thanks.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Share on Facebook!
Reply With Quote
  #2  
Old 12-24-2007, 06:41 AM
NeutralizeR's Avatar
Senior Member
Big Board Administrator
 
Real Name: Mavi KARANLIK
Join Date: Feb 2006
Location: Ankara/TÜRKİYE
Posts: 296
The proxies i'm talking about:
site:tylerschnaidt.com msxlabs - Google'da Ara
site:surffreedom.com msxlabs - Google'da Ara
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Share on Facebook!
Reply With Quote
  #3  
Old 12-25-2007, 02:42 AM
briansol's Avatar
Senior Member
vBSEO Pre-Release TeamDesign for SEOBig Board Administrator
 
Real Name: Brian
Join Date: Apr 2006
Location: Central CT, USA
Posts: 5,545
you can add a simple JS into your head include...

Code:
<script type="text/javascript">
if (top.document.domain != 'msxlabs.com') {
  top.location.replace(document.URL.replace(document.domain, 'msxlabs.com'));
}
</script>
that will break the site out of the frame and/or proxy
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Share on Facebook!
Reply With Quote
  #4  
Old 12-25-2007, 06:00 AM
NeutralizeR's Avatar
Senior Member
Big Board Administrator
 
Real Name: Mavi KARANLIK
Join Date: Feb 2006
Location: Ankara/TÜRKİYE
Posts: 296
Quote:
Originally Posted by briansol View Post
you can add a simple JS into your head include...

Code:
<script type="text/javascript">
if (top.document.domain != 'msxlabs.com') {
  top.location.replace(document.URL.replace(document.domain, 'msxlabs.com'));
}
</script>
that will break the site out of the frame and/or proxy
Will it also prevent proxy caching - google indexed proxy pages?
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Share on Facebook!
Reply With Quote
Reply

Tags
duplicate, exploit, google, proxy

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads

Thread Thread Starter Forum Replies Last Post
Google Duplicate Content Flaw: Not So Perfect After All BamaStangGuy General Discussion 3 06-30-2007 01:23 PM
Proxy Servers Stealing Your Adsense Dollars cageybee Google Adsense, YPN, & Other Ad Networks 8 02-19-2007 08:41 PM
VBSEO Exploit? Matt Troubleshooting 5 09-02-2006 11:47 PM
“Crawl Caching Proxy” (BigDaddy Update) Discussed by Matt Cutts Joe Ward SEO Buzz 1 04-24-2006 08:25 PM


All times are GMT -4. The time now is 01:29 AM.


Powered by vBulletin Version 3.8.0 Beta 4
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.5 ©2008, Crawlability, Inc.