Page 2 of 3 FirstFirst 1 2 3 LastLast
Results 16 to 30 of 36

Robots.txt now support the sitemap file

This is a discussion on Robots.txt now support the sitemap file within the General Discussion forums, part of the vBSEO Google/Yahoo Sitemap category; Have done our site with this today, lets see what happens...

  1. #16
    Senior Member
    Real Name
    Ged
    Join Date
    Dec 2006
    Location
    UK
    Posts
    438
    Liked
    0 times
    Have done our site with this today, lets see what happens

  2. #17
    Junior Member jgommel's Avatar
    Real Name
    James Gommel
    Join Date
    Aug 2006
    Location
    Cleveland, OH
    Posts
    24
    Liked
    0 times
    Quote Originally Posted by briansol View Post
    no, you only want to link to to the root, as vbseo has the .htaccess condition to re-write it to its 'real' location.

    http://www.mysite.com/forum/sitemap_index.xml.gz
    or
    http://www.mysite.com/sitemap_index.xml.gz
    So if I understand correctly...it doesn't matter which one you put in your robots.txt because the .htaccess rewrite will direct it to the sitemap_index.xml.gz in the /data folder, correct?

  3. #18
    Senior Member Code Monkey's Avatar
    Real Name
    Code Monkey
    Join Date
    Aug 2006
    Posts
    780
    Liked
    0 times
    Yes it does matter. If your forums are in a subdirectory and you want to include your front page you need to have the link point to the file in root not the subdirectory. Pages above the sitemap location will not be recognized. That's why the smoke and mirrors to make it look like it is somewhere else.

    So if your forums are in the forums/ directory then you would move the htaccess code for the sitemap (not the vbseo code) to a htaccess in your root and add forums/ to the links structure. Like this.

    Code:
    RewriteRule ^sitemap(\.txt(\.gz)?)$ forums/vbseo_sitemap/vbseo_getsitemap.php?sitemap=urllist$1 [L]
    RewriteRule ^((urllist|sitemap).*\.(xml|txt)(\.gz)?)$ forums/vbseo_sitemap/vbseo_getsitemap.php?sitemap=$1 [L]
    Then you would use

    Code:
    http://www.example.com/sitemap_index.xml.gz
    Last edited by Code Monkey; 04-13-2007 at 11:47 PM.

  4. #19
    Junior Member jgommel's Avatar
    Real Name
    James Gommel
    Join Date
    Aug 2006
    Location
    Cleveland, OH
    Posts
    24
    Liked
    0 times
    Quote Originally Posted by Code Monkey View Post
    Yes it does matter. If your forums are in a subdirectory and you want to include your front page you need to have the link point to the file in root not the subdirectory. Pages above the sitemap location will not be recognized. That's why the smoke and mirrors to make it look like it is somewhere else.

    So if your forums are in the forums/ directory then you would move the htaccess code for the sitemap (not the vbseo code) to a htaccess in your root and add forums/ to the links structure. Like this.

    Code:
    RewriteRule ^sitemap(.txt(.gz)?)$ forums/vbseo_sitemap/vbseo_getsitemap.php?sitemap=urllist$1 [L]
    RewriteRule ^((urllist|sitemap).*.(xml|txt)(.gz)?)$ forums/vbseo_sitemap/vbseo_getsitemap.php?sitemap=$1 [L]
    Then you would use

    Code:
    http://www.example.com/sitemap_index.xml.gz
    Thanks, I just changed my setup to mirror your explanation. I have vBulletin installed in a subdirectory called /vb/. Just added it to my Google Webmaster tools and it found the sitemap no problem.

    Thank you!
    Last edited by jgommel; 04-14-2007 at 12:33 AM.

  5. #20
    Senior Member Code Monkey's Avatar
    Real Name
    Code Monkey
    Join Date
    Aug 2006
    Posts
    780
    Liked
    0 times
    No problem. Once you get your head around it then it's smooth sailing from there. The point is to keep the sitemaps themselves in a secure directory that is writable yet make them apear to be at the top level of your site. I'm glad you got it straightened out. Get ready for the bot explosion.

  6. #21
    Senior Member
    Real Name
    Mohamed
    Join Date
    Dec 2006
    Posts
    3,891
    Liked
    1 times
    I would suggest not using .gz for the file in the robots.txt. Google support .gz and we can handle it from the the webmaster tool, but this for the other search engines and we are not sure about that.

  7. #22
    Senior Member
    Real Name
    Keith Cohen
    Join Date
    Jul 2005
    Location
    Raleigh, NC USA
    Posts
    6,147
    Liked
    12 times
    But the filename has .gz, so it has to be referenced that way. Otherwise, it won't find the file.

  8. #23
    Senior Member Code Monkey's Avatar
    Real Name
    Code Monkey
    Join Date
    Aug 2006
    Posts
    780
    Liked
    0 times
    All the major search engines recognize it is so it's not a problem. Google, Yahoo, and MSN have all agreed to support the google sitemap standard. I believe ask.com has climbed on board as well.

  9. #24
    Senior Member
    Real Name
    Mohamed
    Join Date
    Dec 2006
    Posts
    3,891
    Liked
    1 times
    I just wasn`t sure but I forund this in sitemaps.org

    Q: Can I zip my Sitemaps or do they have to be gzipped?

    Please use gzip to compress your Sitemaps. Remember, your Sitemap must be no larger than 10MB (10,485,760 bytes), whether compressed or not.
    sitemaps.org - FAQ

    So .gz file must be supported by all major search engines.

  10. #25
    Senior Member
    Real Name
    Mike
    Join Date
    Aug 2006
    Posts
    209
    Liked
    3 times
    Quote Originally Posted by Mike View Post
    file should go in root, and it is important to have one because virtually all spiders will first search to see if this file exists on your site to see if there are specific instructions for them to follow.

    What you can or should put in there is a debate that will go on for years, and partly depends on what you have in your site that you don't want search engines indexing.

    Here is an example from one of my sites

    Code:
    User-agent: *
    Disallow: /forum/vbseo_sitemap
    Disallow: /forum/admincp
    Disallow: /forum/attachments
    Disallow: /forum/attachment.php
    Disallow: /forum/arcade.php
    Disallow: /forum/calendar.php?do=add
    Disallow: /forum/cron.php
    Disallow: /forum/editpost.php
    Disallow: /forum/login.php
    Disallow: /forum/modcp
    Disallow: /forum/moderator.php
    Disallow: /forum/membermap.php
    Disallow: /forum/newreply.php
    Disallow: /forum/newthread.php
    Disallow: /forum/online.php
    Disallow: /forum/payments.php
    Disallow: /forum/pda
    Disallow: /forum/postings.php
    Disallow: /forum/printthread.php
    Disallow: /forum/private.php
    Disallow: /forum/profile.php
    Disallow: /forum/register.php
    Disallow: /forum/report.php
    Disallow: /forum/reputation.php
    Disallow: /forum/sendtofriend.php
    Disallow: /forum/search.php
    Disallow: /forum/sendmessage.php
    Disallow: /forum/showpost.php
    Disallow: /forum/subscription.php
    Disallow: /forum/threadrate.php
    Disallow: /forum/usercp.php
    Disallow: /forum/spy.php
    Disallow: /forum/tags/
    Here is an example of someone just blocking specific spiders
    Code:
    User-agent: BoardTracker
    Disallow: /
     
    User-agent: Gigabot
    Disallow: /
    Or, you could simply have a robots.txt file that is blank. You don't have to have anything in it, but most believe it is helpful to restrict spiders from going through unimportant files, and/or files that don't need to be seen by the public.

    Having a robots.txt file that is blank simply tells spiders that everything is available to be searched.

    Mike when you disallow showpost.php doesnt that prevent them from caching the pages?

    Disallow: /forum/showpost.php
    And how come you dont have showthread.php disallowed? Just trying to understand the whole robots.txt stuff.

    last thing... My forums are at the root. (mysite.com/index.php) so for the robots.txt file should I put "sitemap_index.xml.gz" at the top or bottom after all the disallow's ?

    I currently have it looking like this -

    User-agent: *
    Disallow: /clientscript/
    Disallow: /includes/
    Disallow: /install/
    Disallow: /customavatars/
    Disallow: /subscription.php
    Disallow: /payments.php
    Disallow: /profile.php
    Disallow: /faq.php
    Disallow: /calendar.php
    Disallow: /private.php
    Disallow: /poll.php
    Disallow: /sendmessage.php
    Disallow: /sendmessage.php?do=
    Disallow: /showgroups.php
    Disallow: /reputation.php
    Disallow: /report.php
    Disallow: /threadrate.php
    Disallow: /postings.php
    Disallow: /online.php
    Disallow: /search.php
    Disallow: /newthread.php
    Disallow: /newreply.php
    Disallow: /register.php
    Disallow: /login.php
    Disallow: /image.php
    Disallow: /cron.php
    Disallow: /joinrequests.php
    Disallow: /usercp.php
    Disallow: /member.php
    Sitemap: http://www.mysite.com/sitemap_index.xml.gz
    thx in advance.
    Last edited by mikeinjersey; 04-27-2007 at 11:28 PM.

  11. #26
    Senior Member
    Real Name
    Michael
    Join Date
    Oct 2005
    Posts
    1,755
    Liked
    1 times
    Blog Entries
    1
    I don't think it matters where you put the sitemap file. I just put mine at the bottom because that was the last thing I had added to the robots.txt file.

    I disallow showpost because I don't want spiders going into my individual posts. (I also don't rewrite the showpost url's to cut down on server load, since I block them from indexing anyway)

    I do want them indexing my threads though, which is why I don't have showthread.php in there.

  12. #27
    Senior Member
    Real Name
    Mike
    Join Date
    Aug 2006
    Posts
    209
    Liked
    3 times
    Quote Originally Posted by Mike View Post
    I disallow showpost because I don't want spiders going into my individual posts. (I also don't rewrite the showpost url's to cut down on server load, since I block them from indexing anyway)

    I do want them indexing my threads though, which is why I don't have showthread.php in there.
    Im trying to understand this... Posts are where all the rich content is.. why block them from reading posts? I can understand the server load part..but geez. do others do this aswell?

    so you dont allow the engines from indexing your posts at all ? (neither the archive or the standard way)

  13. #28
    Senior Member Code Monkey's Avatar
    Real Name
    Code Monkey
    Join Date
    Aug 2006
    Posts
    780
    Liked
    0 times
    The blocking is for the "view single post" pages not the threads. The content of the posts is already contained in the thread pages. The individual posts views can be considered duplicate content.

  14. #29
    Senior Member jw00dy's Avatar
    Real Name
    Jonathan
    Join Date
    Dec 2006
    Location
    Tooele, UT
    Posts
    184
    Liked
    0 times
    Great info. Thank you for sharing.

    I guess this means urllist.txt.gz will be going bye bye?
    Last edited by jw00dy; 05-01-2007 at 06:35 AM.
    allthingsmoto.com & bodynspirit.net vBSEO Optimized

  15. #30
    Senior Member
    Real Name
    .
    Join Date
    Jul 2006
    Posts
    386
    Liked
    3 times
    Blog Entries
    1
    I know this is a bit late, but wouldn't it be better to write it as:
    Sitemap: /forums/sitemap_index.xml.gz

    instead of putting the whole url?

Page 2 of 3 FirstFirst 1 2 3 LastLast

Similar Threads

  1. Joint support for the Sitemap Protocol
    By Mert Gökçeimam in forum General Discussion
    Replies: 5
    Last Post: 11-16-2006, 04:15 PM
  2. Problem with the sitemap file
    By Toocool in forum Troubleshooting
    Replies: 3
    Last Post: 01-29-2006, 05:33 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •