By John Gruber
Jiiiii — Free to download, unlock your anime-watching-superpowers today!
Daisuke Wakabayashi, reporting for The New York Times:
Google and Microsoft are the only search engines that spend hundreds of millions of dollars annually to maintain a real-time map of the English-language internet. That’s in addition to the billions they’ve spent over the years to build out their indexes, according to a report this summer from Britain’s Competition and Markets Authority.
Google holds a significant leg up on Microsoft in more than market share. British competition authorities said Google’s index included about 500 billion to 600 billion web pages, compared with 100 billion to 200 billion for Microsoft.
The size of the index is valuable, of course, but I’d argue that it’s not the best comparison point. It’s an easy comparison, because you can just compare numbers and say 500–600 billion is bigger and better than 100–200 billion. But I’ll bet the overwhelming number of searches are completely satisfied by the contents of pages indexed by Bing. It’s the quality of results that matters most. A 500 billion-page index is useless if it doesn’t surface the correct results.
What’s more interesting to me is that while there are a number of small search engines, Google and Bing are the only two comprehensive indexes. DuckDuckGo, for example, syndicates the contents of its index from Microsoft. Google has a monopoly on web search no matter how you look at the market, but there’s even less competition for indexing the web than there is for user-facing search engines. In fact, I think semantically it sort of breaks the engine in “search engine” — the term presupposes that the service showing you the results is the same service that is crawling the web to index them. That’s just not true today.
When Mr. Maril started researching how sites treated Google’s crawler, he downloaded 17 million so-called robots.txt files — essentially rules of the road posted by nearly every website laying out where crawlers can go — and found many examples where Google had greater access than competitors.
ScienceDirect, a site for peer-reviewed papers, permits only Google’s crawler to have access to links containing PDF documents. Only Google’s computers get access to listings on PBS Kids. On Alibaba.com, the U.S. site of the Chinese e-commerce giant Alibaba, only Google’s crawler is given access to pages that list products.
I don’t think any of this exclusivity is the result of nefarious deals between the websites and Google — these sites have just determined, on their own, that it makes financial sense to only permit Google to index their content.
★ Monday, 14 December 2020