Google wants to open source web crawlers

In an effort to push for an official web crawler standard, Google has made its robots.txt parsing and matching library open source with the hope that web developers will soon be able to agree on a standard for how web crawlers operate online.

The C++ library is responsible for powering the company's own web crawler Googlebot which is used for indexing websites in accordance with the Robots Exclusion Protocol (REP). Through REP, website owners are able to dictate how web crawlers that visit their sites to index them should behave.

Using a text file called robots.txt, web crawlers such as Googlebot know which website resources can be visited and which can be indexed.

The rules for REP were written by the creator of the first search engine, Martijn Koster, 25 years ago and since that time REP has been widely adopted by web publishers but has never become an official internet standard. Google is looking to change this and hopes to do so by making the parser used to decode its robots.txt file open source.

REP

In a blog post, Henner Zeller, Lizzi Harvey and Gary Illyes explained how the fact that REP not being an official internet standard has led to confusion about how to implement it among web developers, saying:

“The REP was never turned into an official Internet standard, which means that developers have interpreted the protocol somewhat differently over the years. And since its inception, the REP hasn't been updated to cover today's corner cases. This is a challenging problem for website owners because the ambiguous de-facto standard made it difficult to write the rules correctly.”

To help make REP implementations more consistent across the web, Google is now pushing to make the REP an Internet Engineering Task Force standard and the search giant has even published a draft proposal to help its efforts.

The proposed draft suggests expanding robots.txt from HTTP to any URI-based transfer protocol (including FTP and CoAP), requiring developers to parse at least 500 kibibytes of a robots.txt file and a new maximum caching time of 24 hours.

“RFC stands for Request for Comments, and we mean it: we uploaded the draft to IETF to get feedback from developers who care about the basic building blocks of the internet. As we work to give web creators the controls they need to tell us how much information they want to make available to Googlebot, and by extension, eligible to appear in Search, we have to make sure we get this right,” Zeller, Harvey and Illyes added.

Via The Register

No comments yet.

Leave a Reply

in development