Tools of the Trade

I’ve spent the last eight or so years (has it really been that long?) doing Web development, and each year I get deeper and deeper into the actual hard-core guts of development. You know the drill, from simple presentation (HTML) through dynamic scripting languages/database access to the tools that actually either create parts of a site remotely (not on a POST or GET request) or tools to help manage/maintain the sites.

While most of my work has been with such types of Web development – and I see that staying that way – it’s sometimes daunting to see the tools that a needed to be an even baseline competent Webmonkey.

  • HTML
  • CSS
  • JavaScript
  • DHTML (CSS + JavaScript)
  • At least one, preferably two scripting languages: ASP, ColdFusion, JSP, Perl/CGI, PHP
  • Today, RSS knowledge is important (which implies/requires some inkling of XML)
  • Some SQL (or forget dynamic sites, unless it’s tool based)
  • Rudimentary (at least) understanding of PhotoShop/graphics production

And this is ignoring all the tools (HomeSite, WebTrends) one may use and protocols that one needs to be at least unconsciously aware of (FTP, HTTP, HTTPS, telnet/SSH).

This is a broad range of skills: It’s a combo of writing (if not for edits, but for error messages and so on), coding, graphics and integration chops.

That’s a lot for the basics.

More hard-core programming – say, a Unix C programmer – needs to know C, Unix and some socket stuff or what have you. Harder to learn, harder to get better at (IMHO), but – overall – a much narrower range of skills. A C programmer, for the most part, has little concern about graphics or graphic design; Web developers do.

I’m not complaining, mind you – having new stuff to learn is great. While I’ll continue to get better at ever the very basic stuff (say, HTML) and still never run out of ways to improve there, it’s exciting to learn completely new technologies, such as RSS and XML.

Or SVG, CSS2, Python, Ruby, Mason…

Disruptive Technology – The Deep Web

The term Deep Web refers to the Web-accessible – but not currently Web-crawled – data out there. For the most part, this is databased information.

There’s a good – if light – article on Salon (free daily pass required) today about this issue, In search of the deep Web.

To say that this technology is disruptive is to put it mildly. A disruptive technology can be described (by me) as one that forces a change/changes in a highly disproportional way to its appearance, and its effect is just about impossible to gauge before the shift happens.

A classic disruptive technology is file-sharing of music. We all know the Napster/GNUtella/RIAA stories.

A even simpler example is the hyperlink: OK, this take you to another page. Cool. So what?

So what?

This link enables today’s search engines to do what they do – they keep following links and, in some cases (such as Google), use the number of linkings to a page to assign page ranks. A link – a public URL – allows anyone (and that’s key) to link directly to that page.

Your friends, your enemies, your competitors, search engines, the government…they all can link to your page.

Didn’t think that had so much power, did you? Well, maybe today you do, but did you back in the Netscape 1.1 days?

As is typical of a disruptive technology, the deep Web issue raises more questions than it answers/can answer. For example:

  • If deep Web diving becomes possible, what happens to the business models that have grown up around the proprietary data available only on Site X (examples: Orbitz, Amazon, etc)?
  • If the deep Web divers are the ones who can intelligently obtain and organize, well, everything, what is the role of the other sites? Think about it this way: If futureGoogle can scour the database(s) of futureEncyclopediaBritannica and futureEncyclopediaOthers, what is the need for the encyclopedia-centric sites? This is already happening to an alarming degree today and it’s just because of Google and a vast universe of Web sites; imagine the impact if that vast universe Google indexes contains the full Encyclopedia Britannica.
  • More importantly, if the deep Web divers are the ones who can intelligently obtain and organize, what happens when one company gets better at it than others? Does this company essentially own the Web (in the way Microsoft currently owns the desktop)? What are the ramifications of this?
  • As currently outlined, deep Web diving will require better crawlers, ones that can mimic human mouse clicks and so on so the source site will surrender its data. This raises a host of questions all by itself:
    • This will lead to a whole new class of ways to prevent such access, which may/may not impact the everyday user of, say, a stock market feed site. Right?
    • What if a company opens up their API (such as Amazon and Google have done, to a degree) so the deep Web problem is settled by the companies with the data? Will deep Web diving stop, or will it continue to get the companies that don’t open APIs? Will this deep diving then go to areas of, for example, Amazon that aren’t exposed by it’s APIs?
    • Isn’t this mimicking human interaction with the computer a form of a Turing Test? And – since this is currently not really well done anywhere today, why would anyone expect this to work for deep Web diving in the future?

  • Privacy/security issues. There are reasons some databases are protected – they contain patient HIV status, social security numbers, list CIA operatives and so on. How to differentiate between/protect these databases and leave others accessible to deep Web diving? And – if it’s possible to protect the privacy/security databases in a meaningful way (secure them, not laws to prosecute intruders after the fact), why wouldn’t any/most companies deploy the same technologies to protect what is near and dear to the company (KFC’s 11 secret ingredients…)?

For the answers to these and other questions, check back in about five years…

RSS Mess

I’ve been tweaking my RSS parser (yes, it’s home-grown) and it’s interesting – by rolling your own, you begin to appreciate what the real RSS aggregators etc. do.

RSS is a mess.

Why do some feeds have , others and so on?

Right now, there is probably more exception handling than in the real engine of the parser/display code.

While this is normal – coding is easy, error-trapping is hard/time-consuming – wasn’t the whole XML interchange (such as RSS) concept supposed to make all this easy?

Ah well, that’s how we learn..

RSS Feeds

I’ve added a page to this blog – RSS Feeds – that is not so much for anyone else but for me.

I’ve been tinkering with RSS and XML for over a year; I build the RSS that this site has by hand (it parses out the static index page and drops to RSS file every five minutes).

This is another step.

Basically, I build an RSS parser on my home box that grabs some feeds that I like at certain intervals, processes same, and uploads the results to the RSS Feeds page.

This is just an experiment to teach me how do to all this – no, I don’t want to be (and technically can’t) the next Technorati or what have you.

Basically, I want to learn how to use RSS feeds, process same and get results so I can, at some future date, embed a “recent headlines” area in a client’s Web site.

It’s another tool that I can wield; another way to leverage what is out there.

This feed section is strongly beta; here are the good and bad points of the section:

The Good:

  • It works! For all the caveats and so on that are listed below, it pretty much does as designed. Slick.
  • All processing happens locally and is then pushed to the remote (publically accessible) site. No database hits or what have you for the end user.
  • It was designed with extensibility in mind: Designed to not process a given RSS feed, but to process all feeds in an array – so I can keep adding/subtracting to the list and no code changes.
  • Using a simple JS function and CSS, I display the list of items without descriptions. A toggle is available to show/hide descriptions; defaults to no description (more headlines per inch). Note: Since JS is used, a page reload is not required. Very fast.
  • I cache feeds, so I don’t hit (just reprocesses local copies) any sites more frequently than every hour. During testing, I hit Slashdot too often, and I’m now under a 72-hour ban for RSS feeds. My bad.
  • Even on this first cut, the code is documented fairly well – it’s not alpha code – and has a handful of variables that can easily be transfered to a config file of sorts to alter the output. For example, I have a constraint on the number of listings from any given site (currently defaulted to 10). If the site offers more listings than the default, the default wins. If the site offers less listings than the default; the site listing’s count wins (duh!). But little things like that are good, especially this early in the process.

The Bad:

  • The processing code – all combined – is too much code. This calls this which includes that which writes to this which is FTP’d to there…and so on. First cut; code work. Now the challenge is to optimize.
  • Right now, it’s built in PHP, with a shell script for the cron. Should build the entire thing in either Perl or a shell script to make it faster.
  • Major character issues – Lot to learn there, but that’s part of the point, to get it as generic as possible so I can roll it out for any RSS feed and have it work.
  • I’d like to add a feature where each feed can have its own schedule – for example, I don’t care if I hit my own site more frequently than every hour. But right now, the global is one hour (I can set the global to any time), and I can’t override that value for any given feed – it’s all or none. In the future, this will be important: Some sites will allow more frequent updates, and that should be designed into this sort of app. Why not? Worst case scenario, I build in this functionality that almost never – or just never – is used. It’s there if needed.
  • As my feed list gets larger, I’ll probably have to create some sort of page-level navigation (drop-down form or bullet list) to take users down the page to the feed desired.

But first cut. Damn good for that, I think.

Lot of tweaks needed, but this is at the 80/20 mark already (80% of the functionality with 20% of the work…)

My, How You’ve Grown!

Three stories/events that have created a fair amount of buzz around the Blogosphere lately:

OSS Software is Too Hard for Non-Geeks to Use

Written by Eric S. Raymond, this rant was inspired by Raymond’s own efforts to print from a (Linux) computer to a printer connected to another Linux box on his own network.

Hilarity (not) ensued.

His basic message was that OSS software basically sucks when it comes to helping newbies (or, as in this case, experienced Unix Jocks). If people expect Linux to work on the desktop, Wizards and so on have to work as an average user would expect.

Bill Gates Stumps for More CS Majors

The richest man in the world made a tour of major CS colleges recently, trying to drum up interest in CS. Enrollments in CS have dipped over the last few years.

Dave Winer blames the MS juggernaut for killing interest by killing competition; MS’s Scoble, of course, disagrees.

Both have interesting things to say about this issue.

Search Wars intensify

Dan Gillmor takes Yahoo to the woodshed for hiding paid inclusion in newly released search; Jeremy Zawodny rebutts. Update: Tim Bray – someone who knows search – chimes in. Hint: He’s skeptical of Yahoo’s direction.

To me, one single thread runs through all these issues: Information Technology – which, more and more each day, is rapidly becoming inseparable from Internet Technology – is growing up.

These are all valid issues to bring up; these are all growing pains.

We’re going to see a lot more of these types of issues in the near future; some are going to be bloody battles. And not all will end well.

In a follow-up to his OSS Rant, Raymond published some letters he had received from users re: this subject.

One of the most interesting comments to this follow-up was posted by an anonymous reader:

Linux Idenity Crisis

I think the whole community needs to step back for a while and determine just what exactly Linux wants to be.

This whole premise of easy-to-use yet powerful software is flawed. A powerful tool necessarily involves some training for the user…Open source has always been about power and flexibility…If you want to serve the ends of power and flexibility, you cannot also serve the end of ignorant users. No other industry in the business of making powerful tools will dispute this fact…The real problem here is that Linux no longer knows what it wants to be. It wants to conquer the world somehow – to serve best the needs of both grandmothers on AOL and researchers at physics labs.

This user basically argues for keeping Linux complex so no power is sacrificed. Agree or disagree with what he writes, it’s a compelling question.

Where does Linux want to go today?