Disruptive Technology – The Deep Web

The term Deep Web refers to the Web-accessible – but not currently Web-crawled – data out there. For the most part, this is databased information.

There’s a good – if light – article on Salon (free daily pass required) today about this issue, In search of the deep Web.

To say that this technology is disruptive is to put it mildly. A disruptive technology can be described (by me) as one that forces a change/changes in a highly disproportional way to its appearance, and its effect is just about impossible to gauge before the shift happens.

A classic disruptive technology is file-sharing of music. We all know the Napster/GNUtella/RIAA stories.

A even simpler example is the hyperlink: OK, this take you to another page. Cool. So what?

So what?

This link enables today’s search engines to do what they do – they keep following links and, in some cases (such as Google), use the number of linkings to a page to assign page ranks. A link – a public URL – allows anyone (and that’s key) to link directly to that page.

Your friends, your enemies, your competitors, search engines, the government…they all can link to your page.

Didn’t think that had so much power, did you? Well, maybe today you do, but did you back in the Netscape 1.1 days?

As is typical of a disruptive technology, the deep Web issue raises more questions than it answers/can answer. For example:

  • If deep Web diving becomes possible, what happens to the business models that have grown up around the proprietary data available only on Site X (examples: Orbitz, Amazon, etc)?
  • If the deep Web divers are the ones who can intelligently obtain and organize, well, everything, what is the role of the other sites? Think about it this way: If futureGoogle can scour the database(s) of futureEncyclopediaBritannica and futureEncyclopediaOthers, what is the need for the encyclopedia-centric sites? This is already happening to an alarming degree today and it’s just because of Google and a vast universe of Web sites; imagine the impact if that vast universe Google indexes contains the full Encyclopedia Britannica.
  • More importantly, if the deep Web divers are the ones who can intelligently obtain and organize, what happens when one company gets better at it than others? Does this company essentially own the Web (in the way Microsoft currently owns the desktop)? What are the ramifications of this?
  • As currently outlined, deep Web diving will require better crawlers, ones that can mimic human mouse clicks and so on so the source site will surrender its data. This raises a host of questions all by itself:
    • This will lead to a whole new class of ways to prevent such access, which may/may not impact the everyday user of, say, a stock market feed site. Right?
    • What if a company opens up their API (such as Amazon and Google have done, to a degree) so the deep Web problem is settled by the companies with the data? Will deep Web diving stop, or will it continue to get the companies that don’t open APIs? Will this deep diving then go to areas of, for example, Amazon that aren’t exposed by it’s APIs?
    • Isn’t this mimicking human interaction with the computer a form of a Turing Test? And – since this is currently not really well done anywhere today, why would anyone expect this to work for deep Web diving in the future?

  • Privacy/security issues. There are reasons some databases are protected – they contain patient HIV status, social security numbers, list CIA operatives and so on. How to differentiate between/protect these databases and leave others accessible to deep Web diving? And – if it’s possible to protect the privacy/security databases in a meaningful way (secure them, not laws to prosecute intruders after the fact), why wouldn’t any/most companies deploy the same technologies to protect what is near and dear to the company (KFC’s 11 secret ingredients…)?

For the answers to these and other questions, check back in about five years…