Before a new search engine can hope to make a run against Google, it has to crawl.
But indexing the web by “crawling” sites with automated software doesn’t just require scaling up to the web’s vast scope—even though doing so is a big challenge in itself. Individual sites have no obligation to welcome a new search crawler. Some instead post digital no-trespassing signs, a way to discourage automated traffic that might bog down performance.
“The web has trillions of documents,” says Vivek Raghunathan, cofounder of the ad-free, subscription-based search startup Neeva. “And the web is a lot trickier to crawl than it was a few years ago.”
An October 2020 report on digital competition by the House Judiciary Committee’s Subcommittee on Antitrust aimed a government spotlight at this situation.
“The high cost of maintaining a fresh index, and the decision by many large webpages to block most crawlers, significantly limits new search engine entrants,” the report stated. “Today, the only English-language search engines that maintain their own comprehensive webpage index are Google and Bing.”
That leaves many Google competitors renting the index Microsoft maintains for its Bing search, which has 6.4% of the U.S. market—compared to Google’s 87.3%—in Statcounter’s measurements. Bing’s index works well for many queries, but sites leaning on it cede a key way to differentiate themselves.
That’s an issue for Neeva as well as two other privacy-centric search engines, DuckDuckGo and Brave. All three call on Bing for some of the results they provide to users. It’s just one ingredient rather than the entirety of their technology, but still: It would be easier to do without it if creating a new index of the web wasn’t so hard.
Robots not welcome here
Websites control automated access to their pages using standardized “robots.txt” files enumerating where crawlers may go. Crawlers can disregard these instructions, as the Internet Archive began doing in 2017, to improve its backup of the web. But sites can punish a pushy robot by blocking its access.
DuckDuckGo and Neeva pointed to Facebook’s platform as one example. Its robots.txt file takes a guest-list approach, approving Google and Bing as well as such less obvious crawlers as “Applebot,” which gathers data for Apple’s Siri and Spotlight. But it excludes all bots not cited by name.
Jason Grosse, a spokesperson for Facebook’s parent firm Meta, said in an email: “Generally speaking, our robots.txt policy is not out of line with other major platforms.”
Indexing sites that don’t appreciate a new crawler’s attention can demand discretion and diplomacy.
“A lot of the work we’ve done in the last year, year and a half, is building a crawler system that is well behaved,” said Neeva’s Raghunathan. “We do things like smart algorithmic estimation of how much can we crawl this site so it looks like a rounding error.”
Sometimes, however, Neeva has to ask for help. From whom? “I’d say it’s been the first person we know, and often the first person we know is the CEO or the head of engineering.”
Even a search site that excels at providing web results will struggle to match Google’s full-spectrum information retrieval.
Brave, meanwhile, operates in a stealth mode by varying its crawler’s identification and only abiding by whatever restrictions a robots.txt file places on Google’s crawler. Josep M. Pujol, chief of search at Brave, founded by Mozilla cofounder Brendan Eich and better known for its privacy-focused browser, said in an email that this requires treading lightly.
“We respect the spirit of the law but not the letter,” he said. “As of today, the data centers that host our crawlers have received a very small number of complaints.”
Pujol called asking individual sites’ permission impractical: “How do you scale human interaction to thousands of companies?”
Google, meanwhile, can get another leg up because its nonsearch lines of businesses—starting with display ads, but including services like Google Analytics—require access to sites that competitors can only request, said Zack Maril, a software engineer and founder of a search-competition group called Knuckleheads’ Club.
These other ventures, he wrote in an email, “all can benefit from Google’s search business in various ways that other competitors running only search engines simply cannot compete on.”
Search sites without Google- or Bing-level traffic also lack large-scale metrics about what sites are more or less popular. Google and Bing “can look at everything that people liked, and prioritize all the clicks from there,” says Raghunathan. “When you’re bootstrapping, it’s a lot harder.”
A report on digital competition, published in July 2020 by the U.K.’s Competition and Markets Authority, suggested requiring Google to provide some of these metrics. As DuckDuckGo communications vice president Kamyl Bazbaz approvingly phrased it, “Share a certain amount of click-and-query data that other search engines could use to level the playing field.”
Brave invites itself to a form of that sharing when it asks its users to allow “Google fallback mixing,” in which Brave sends along a query to Google and then analyzes the results to improve its index.
Even a search site that excels at providing web results will struggle to match Google’s full-spectrum information retrieval. For example, I’ve had DuckDuckGo as the default on my iPad Mini for years—but its maps results only cover driving and walking, so I still find myself turning to Apple Maps and Google Maps.
Despite the inherent challenges of competing with Google in search, the fact that new firms are still willing to try speaks well of the stubbornness that these upstarts will need.
“We love that there are lots of other search competitors now,” said DuckDuckGo’s Bazbaz. “It’s a market that, historically, people have been really afraid of—and for good reason—because of the way that Google has dominated it.”