Text vs. Not Text

Many of the powerful features supplied by Web scripting languages (and use beyond the Web) is the ability to handle text.

Move. Push. Pop. Array to string; string to array. RegEx to add/split/replace and so on.

Powerful stuff. Different levels of “powers” for different languages, but – overall – text handlers are available and impressive in most Web languages.

More difficult once one hits a non-text item, for example, an image.

There are a whole bunch of ways to approach this issue (depending on the language, database/not and so on), but here is one generic example: Assume one has a databased app that stores the text (headline, full article) of a some online news story.

Assume, as well, that the article (or its CMS – content management system) allows (optionally) one image to complement the story that is not databased (except for pointer) but resides in the file system.

Pretty basic.

Yet, while text-based info can be validated against some text-based rules (trim item, must be more than 10 chars and less than 255 chars, cannot be a dupe for its column [headline, full article] and so on).

Images? Little more complex.

  • Is provided image (via upload) valid?
  • Is image a [list or not of acceptable image types]? I.e., is image a JPG?
  • RE: Previous point – is the test against the file extension (i.e. a text file with a “.jpg” extension) or a real test as to what the file is (i.e. “text.txt” is really a JPG)?
  • Any rules on file size? – Height, width, file size
  • Any rules on what to call the uploaded image? I.e. should “rose.jpg” be transformed to “[article_id].jpg” or does each image get it’s own directory? (Can’t have two different “rose.jpg” image in same directory)
  • Transformations: Is the upload supposed to resize/resample [different format] a valid image upload? Copy those images (full-sized, thumbnail for example) where?

Each of the above points is relatively trivial; taken together, it’s a lot of trivial decisions (only JPG?) and conditional logic.

Way messier than text handling. Yes, best handled – as I have done – with different functions and so on…but still…messier.

Full disclosure: I just finished a “update” area of a Website that permits an image upload (along with lots of text stuff). The image processing/trapping – for one image on a page that has a half-dozen text areas and various mappings (to other areas) – is roughly 20 percent of the code.

In this particular case, I was/wasn’t doing the following – this is not rocket science:

  • Uploaded image must exist (PHP)
  • Uploaded image must be a JPG (ImageMagick)
  • New image overwrites existing image (so no unique issues)
  • Standard – but messy – process to resize a given image to 1) full-sized and 2) thumbnail images on given (local) server
  • Upload those full/thumbs to remote server
  • Defaults are set for the processing of full/thumb images

OK, there are defaults set for text, as well – usually NOT NULL and length is not more than [max] characters.

Images are different. Why?

  • Different defaults – MORE defaults – need to be set (dupes, type, width, height, color depth…)
  • Text rules are usually resolved via submitted values (POST values); images require examination of uploaded (or not) file.
  • Text rules may require examination of uploaded (POST) text and database text. Images (or any uploaded file) usually requires the same AND file structure examination (example: Yes, “article_name” = x in the database; but – for an image – one has to check to see if “article_image” is a dupe or whatever in the database AND file structure (unless storing as blobs, which has it’s own overhead)

Sigh.

UPDATED: 3/27/2004 I cut short this entry due to fatigue yesterday. Here is what I left out:

While it’s true that you can – and should – build tools to handle the different (i.e. non-text) datatypes, that’s part of the issue: You have to build them.

For text, most language offer a plethora of tools (regEx, split, arrays and index/substring functions alone handle most of heavy lifting) built in – to different degrees with different languages.

Extensive image – or other file – handling mechanisms are either missing or pale in comparison to the text tools. To a degree, this makes sense – most data processing is text processing, and text often needs to be massaged (parsed, for example), file uploads are usually just an upload/validate that file is [such and such] a file, rename and move. There is not much massaging of the file innards; there is often much monkeying with text strings.

Yet there is a dearth of tools for image processing/validation built in to many languages (I’m not certain, but I think Java does a good job of natively allowing image access – sizes, type and so on).

For other languages – such as PHP or Perl – a (wonderful/brilliant) third-party tool – Imagemagick – allows all sorts of image manipulation/validation (see the IBM tutorial).

While third-party tools can either extend a language’s abilities – or replace native implementations with a better toolsets – these tools must then be available on whatever server the code is deployed. This can often be a huge issue.

For example, get The Suits to sign off on installing a free, open-source (yeah, no support contract/contact) program like ImageMagick on a server. Or you have your own personal site on shared hosting, and the company won’t install this tool. In either case, you could be screwed.

Native support is virtually always preferable, for the preceding reason (it’s there, guaranteed) and for performance issues. Another layer of abstraction over existing code is not the way to go, for the most part.

In addition, non-native tools – be they user-defined functions or third-party products – are not familiar to all developers, so there is that learning curve/slowed production.

For example, all PHP coders know how to grab a substring.

However, if I build an imageDiscovery() function that validates image, passes back the image extension and sets default image HxW and so on, well, that’s custom code that same developer will have to learn if he works on my code/site. Ditto for a third-party tool (such as ImageMagick): The same developer might be used to a different third-party tool, or none at all. Again, learning curve/slowed production.

Ah well, end of rant.