If you are the owner of a proxy website and do not use or heard about the robots.txt, you may get problems coming your way from other angry webmasters claiming that you have taken their web content without explicit permission. If you do not understand the above, then learn about this term “proxy hijacking” now. When your loyal proxy visitor uses your free proxy website to retrieve webpages from other sites on the web anonymously, those webpage content are instantly being rewritten by the web proxy script and appear to be hosted as your proxy website’s contents. What was painstakingly created on other internet websites now becomes your hijacked content after some proxy visitors anonymously surf those third party websites. If you happen to have those search engine bots from Google,Yahoo and MSN etc crawling through your proxy server contents at that moment, they may index those proxy hijacked content and associate those content to your anonymous proxy instead. When the real owners and authors of those content do a copyscape search on search engines and find their contents being listed on your anonymous web proxy and not rightfully on their own websites, they get angry and find your proxy ip and send emails to your proxy hosting provider and to the search engines to report your content stealing acts. Your anonymous web proxy may be removed from the important search engine results and that means a great loss of proxy earnings and traffic for you. Some inexperienced hosting companies may also suspend your hosting accounts. This is also a reason to go with specialized proxy hosting providers that are used to handling such copyright complaints and know the actual reason behind the content abuses as a by effects of the scripts used for surfing anonymously. If you are using Adsense for monetizing your web proxy, note that they have an explicit policy on not allowing you to put advertisements on content that does not belong exclusively to you. Some bitter, anti proxy content owners may even try and get your Adsense accounts banned by reporting that you are a spammer and are using duplicated content.

If you are only running a transparent proxy list site or providing free school proxy lists without actually using a proxy script, then you do not really need to read on. We have also heard about bypass proxy webmasters losing their Google Adsense accounts because they are not careful with the robot.txt. If you bothered to read the terms and conditions when you sign up with Adsense, you will notice that you are not supposed to put your Adsense advertisements on websites that contain adult and naughty contents. You are very sure there are not such stuff, pictures and videos etc on your site right? What if I tell you 80% of all web proxy users like to surf and download such naughty contents. They use your public proxy for anonymous ip purposes. Not every user is using your services just to unblock and proxy myspace from their schools. Within a day of running web proxys, your proxy users will have helped you to auto created sufficient picture and video content to rival the other naughty websites. When a Google Adsense employee find checks your proxy website contents and find those pictures that are clearly flouting their rules and terms, you can say bye bye to your account and any earnings made so far. You will not get any chance to explain the situation either. Just take a look at what happens to all the careless webmasters. They have to end up using other advertising networks such as Adbrite, Adversal etc which only provides less than 1/3 of their previous earnings with Adsense. By the way, when you are banned from Adsense, it is nearly for life.

The robots.txt file is a simple text file that resides in the public root directory of your website. For example, if your proxy website is http://freeproxywebsites.blogspot.com then your robots.txt can be found at http://freeproxywebsites.blogspot.com/robots.txt. This is of course an imaginary example but you should get what I mean. This robots.txt is written using the Robots Exclusion Protocol in order for web crawlers and robots to understand which files and directories of a website are meant to be publicly accessible and which should remain hidden. Not every web bots obey this robots.txt, and some rogue bots may deliberately access the files you have labeled to be disallowed for access. However, for well known bots such as Google, yahoo and msn, you can typically be assured that whichever files and directories that you have disallowed robots access will not be displayed in their search engine results (although that does not mean they will not crawl those files). This robots exclusion standard also complements website sitemaps file which is written using the robot inclusion standard.

To make things simple, just open up a new text file using notepad or whatever text editor you like. Save it using the filename “robots.txt”. Next, check the proxy script that you are using for your proxy websites. Each proxy script uses different files and thus their robots.txt are created differently. I have given some examples below for the three big open source and freely available web proxy scripts: CGI Proxy, Phproxy and Glype. If you have paid for a custom proxy script, then ask that developer for the corresponding robots.txt. Just copy the text given inside the block quotes into your robots.txt and upload/FTP the file to the /public_html folder of your hosting account. For addon domains, just FTP the robots.txt to /public_html/someAddonsDomains.com/ folders. Each unblocking proxy website you have on your hosting server must have their own proxy robots.txt. After uploading the web proxy robots.txt, just check that its really there by accessing for example freeproxywebsites.blogspot.com/robots.txt in firefox or IE etc. If you are using the CGI Proxy script, most likely you have your own dedicated server. Look for the existence of nph-proxy.pl or nph-proxy.cgi in the root directory of your proxy website to confirm. The CGI Proxy robots.txt should contain

User-agent: *
Disallow: /nph-proxy.pl/
Disallow: /nph-proxy.cgi/

If you are a new proxy webmaster and is using a shared proxy account, there is a high chance of you using this new Glype proxy script. Look for the existence of browse.php in the root directory of your proxy website to confirm. Anyway, the default installation for most Glype packages should already contain a robots.txt. If not, your Glype Proxy robots.txt should contain

User-agent: *
Disallow: /browse.php

Most free proxy websites till date are still running the discontinued Phproxy script. If you see an index.inc.php file in your proxy website, high chances you are using Phproxy. Because Phproxy is easily reconfigured and indeed there are many customized versions of PHProxy spreading around online, it can be a bit tricky trying to prepare its robots.txt. Remember that the purpose of the robots.txt is to prevent search engine robots from crawling and indexing files and contents created from third party sites. By default, Phproxy creates all proxified content on the index.php?q=12345-some-encoded-url-1245 file. Notice the ubberish url I have given, the url always look something like this. Sometimes, you see surf.php?q=olniwpe… or index.php?page=rvouwqdn… etc and other creative representations. These are attempts by bypass filter proxy webmasters trying to disguise and customize their own bypass Phproxy scripts and remove easily identified footprints that tell people you are running the Phproxy script. They do this to avoid being detected by web filtering companies such as websense that correctly label their proxy websites as web proxies, proxy avoidance sites and blocking them from access by school students and office workers to unblock websites. Anyway, just test drive by loading any web page with your web proxy and identify the static portion of the generated proxy url, whether its “index.php?q=”, “index.php?page=”, “surf.php?q=”, etc and put it in your Phproxy robots.txt, for example

User-agent: *
Disallow: /index.php?q=*

If you do not know what web proxy scripts you are using but you know you got them free, then most likely you are using either of these big three: CGI Proxy, Phproxy and Glype. For convenience, just lump everything above into a single robots.txt.

User-agent: *
Disallow: /browse.php
Disallow: /nph-proxy.pl/
Disallow: /nph-proxy.cgi/
Disallow: /index.php?q*

Creating proper robots.txt files for your anonymous proxy websites is an often forgotten but essential step for many school proxy owners, especially those that own large free proxy networks consisting of hundreds of web proxies. If you do not have this simple robots.txt, you may find search engine bots hitting on your auto generated content so frequently that it wastes your bandwidth as well as being complained about by anti-proxy webmasters, copyright issues and loss of profits.