After attending the Search Engine Strategies conference last week in New York City at the Hilton in Manhattan, I decided to share what I learned or took away from the sessions that I attended.
The following notes are from the duplicate content session. I will attach some power points for members here to download and view:
- Discussion on hosting same content with multiple domains. Nothing new that we haven't already experienced/know. (Kennedy 3)
- Watch out for dynamic urls can be serving the same information. (http://www.superpages.com/yellowpages/C-Art+Galleries+%26+Dealers/S-IL/T-Bloomington/ serves same content as http://www.superpages.com/yellowpages/S-IL/T-Bloomington/C-Art+Galleries+%26+Dealers) I believe Parid fixes these issues last year. (Kennedy 11)
- If your content has been dropped from a search engine, you can fill out a reinclusion request. (Kennedy 15-17)
- Yahoo
- Search engines are looking for unique content. They are removing headers, footers, and side navigation when indexing.
- Press Releases not considered duplicate spam because of linkage properties
- Search engines look at host name resolution; multiple host names per IP address.
- When indexing, web pages are broken into word sets (shingles). Rearranging those shingles into a different order doesn't add any benefit. Search engine still considers that all the shingle is there.
- If you must have duplicate content, use Meta tags (noindex) to weed out secondary content.
- One panelist (Yahoo?) mentioned that print-friendly pages are a problem since that is duplicate content of the original one. Matt Cutts said to not worry about print friendly pages.
- Track paths through cookies not urls. This seems most appropriate for labs, since they do a lot of tracking through urls, which ends up duplicating content.
- Make sure to call directory pages consistently: These three links have different urls (although I can't imagine that search engines are really having trouble distinguishing between the first two):
- directory/
- directory
- directory/index.cgi
- Don't use session IDs.
- When redesigning a site, make sure that only one of the web sites is being indexed.
- Matt Cutts
- People are more worried than they should be. Google knows mistakes happen and isn't looking to punish anyone for innocent mistakes. An example was given of someone who was asking about duplicate content issues. When asked how many domains their content was on, they sheepishly replied 2500 domains. This is the type of person/site Google wants to go after (my 2c: at first any way). Search engines are aware there are good and bad reasons for showing duplicate content. (I remember Yahoo nodding at this point)
- Google will soon be rolling out a new infrastructure that will specifically deal with multiple domains.
- Use Google Site Maps, which allows for testing robots.txt. You can see exactly how the Googlebot will crawl, which pages it gets stuck on, which pages end up with duplicate contents.
- Search Engines are getting smarter about understanding JavaScript. (My 2c: My feeling when this was being talked about was that Google probably has a bot that understands JavaScript. They can, or soon will, figure out what the script is doing. Yahoo was nodding with a grin on his face as well. However, I expect they will only use these tools for internal analysis.)
- Question: "How can a search engine determine what the 'real page' is?"
- Yahoo: By using shingling techniques and algorithms.
- Google: By using algorithms. Looking at how often a site has duplicate content from other sources. "How much you copy from versus how much others are copying from you".
- Yahoo has a possible proposal for webmasters to "noindex" only portions of pages.
You can download the PPT at Rantchaos
Thanks
Mike
Sportsrant.com










Reply With Quote