The dark side of the web

Google sees only a fraction of the content that appears on the internet. Stuart Andrews finds out what's lurking in the deep web

When Google indexes so many billions of web pages that it doesn’t even bother listing the number any more, it’s hard to imagine that much lies beyond its far-reaching tentacles.

Beneath, however, lies an online world that few know exists. It’s a realm of huge, untapped reserves of valuable information containing sprawling databases, hidden websites and murky forums. It’s a world where academics and researchers might find the data required to solve some of mankind’s biggest problems, but also where criminal syndicates operate, and terrorist handbooks and child pornography are freely distributed.

At the same time, the underground web is the best hope for those who want to escape the bonds of totalitarian state censorship, and share their ideas or experiences with the outside world.

Interested? You’re not alone. The deep web and its “darknets” are a new battleground for those who want to uphold the right to privacy online, and those who feel that rights need to be sacrificed for the safety of society. The deep web is also the new frontier for those who want to rival Google in the field of search. Take a journey with us to the other side of the internet.

Deep webs, the dark web and darknets

The first thing to grasp is that, while the elements that make up this other web have aspects in common, we’re not talking about a single, unified entity. Those in the know will often talk in terms of the deep or invisible web, darknets and the dark web, and you might think these are all the same thing. In fact, they’re separate phenomena, albeit linked by common themes, properties or interests.

The deep web isn’t half as strange or sinister as it sounds. In computer-science speak, it refers to those portions of the web that, for whatever reason, have been invisible to conventional search engines such as Google.

The majority of this deep web is made up of dynamically created pages and database entries that are accessible only through manual completion of an HTML form

The majority of this deep web is made up of dynamically created pages and database entries that are accessible only through manual completion of an HTML form. A smaller proportion has been accidentally or purposefully made inaccessible to Google’s crawlers, while other areas sit behind password-protected or subscription-only sites.

Make no mistake, the deep web is huge. Michael Bergman’s pioneering 2001 study, The Deep Web: Surfacing Hidden Value, estimated that it accounted for 7,500TB of data at a time when search engines could index only 19.

Even the more conservative estimates in a 2007 paper written by Google’s Jayant Madhavan, Alon Halevy and colleagues, suggests that there are more than 25 million different sources of deep web content, many of which are huge repositories.

“There is a prevailing sense in the database community that we missed the boat with the WWW,” the Google paper concluded. “The over-arching message of this paper is that a second boat is here, with staggering volumes of structured data, and that boat should be ours.”

Treasures of the deep

“There’s a lot of legitimate and valuable content in the deep web,” said Dr Juliana Freire, the leader of a University of Utah project, DeepPeep, which aims to make deep web content more accessible.

“For example, there are several scientific data sets (such as the Sloan Digital Sky Survey and the Center for Coastal Margin Observation & Prediction), documents and databases, and these are useful to society and have many important applications.”

Read more

Analysis

Pages