Preventing Complex Failure

Posted on September 12th, 2007 in PHP by Russ

“Complex things break in complex ways.” (
Steven M. Bellovin
) “The threat is complexity itself.” (same article, Andreas M. Antonopoulos).

As one webpage to email script grows to encompass an entire content management project, or one product listing grows into a storefront and shopping cart, how do we keep our simple tools from becoming complex tools that could break and require a complicated repair?

People who have read my code might notice that I rely on subroutines and functions to keep me on track. I try to break a project into numerous problems, so that each more or less discrete problem is solvable in its own right. To take as an example the form->email project I mentioned above, I’d have one function to accept the variables from the form and “clean them up” so that I could protect the system from people trying to break my form, and another function to handle the actual emailing ( since I usually work in php, a wrapper to the mail() function).

To me, this is what “making legos to build programs” is about; which is how I learned. Once I’ve written those functions, I can put them into a separate file and include that file as a library. The next time I have to write a similar tool, I can use those same functions. Unfortunately, that also flavors my approach to object oriented programming, but that’s a post for a different day.

When I’m banging away at the keyboard like Hephaestus, I’ve usually got the problem in my head. So, I know the six or ten or maybe even twenty functions I’ve got going, and I can apply them as necessary. However, once the forge has cooled and someone brings back a broken lightning bolt and I must reforge it ( days, weeks, even months or years later), I often have to take a few minutes to reacquaint myself with the tools I had to build for that particular project.

So if I keep my creating simple with the use of functions that break a project down to atoms, and then reassemble the atoms to finish the project, how do I keep the resulting project from being a complex group of simple things that itself will break in complex ways? It’s easiest to lump together functions to larger groups. Actually the process is reversed; I break a large problem into slightly smaller problems, then break those problems into slightly smaller still problems, et cetera. What I’m describing now is more of recombining these problems back into the larger problems. So two or three functions that are always called together can be called with the same function.

However, this doesn’t help, really, when I come back days or months later and stare at the project, what helps me the most is the documentation I included above each function and frequently within each line. As well as documentation during the “main processing part” of the project.

Sometimes, however, a project is too big for me to load entirely into memory; Wordpress is one example. It’s just too big for me to see all the pieces of it. And when I try to work on it, I can sense that I’m losing visualizations of projects I’m currently working on. Another time when a brain->computer analogy works well. I have to swap some of the projects I have in RAM down to disk in order to look at other problems. To keep from breaking a large project catastrophically, I have to have some understanding of the complexity of the entire project, so I build my way up from the smaller functions through the aggregate functions and then look at how the aggregate functions are applied to each other. This takes a lot of processing power, and usually about the time I have got the functions sorted out, something else more urgent will call my attention to real life, and I’ll lose my place.

So, how do we, we of perhaps smaller RAM, catch up and understand a larger project without it conflicting in our mind space with other projects and real life issues?

I don’t know. What are your suggestions?

RCS and Subversion

Posted on September 8th, 2007 in PHP by Russ

I’m a revision management junkie. Since I discovered it, I’ve been very pleased with subversion. I’ve even thought about setting up some code on a public svn server, but I don’t believe I have anything of interest to the general person.

Currently, one of my main clients uses Subversion for their normal development. However, I also wind up creating more one-off scripts for system maintenance; things like “compare this database list against the files actually on the disk” which are not part of the general development branch.

So, since each of these little one-off scripts might contain one or more scripts, and since the files will change over time as the goals get tweaked a little with each one, I started using RCS for these smaller development environments. This way, I have a record of what changes I’ve made, but I don’t have the overhead of the creation and use of a full svn repository.

Here’s what I do;
In the directory I’m working in, I create a new directory “RCS.” Then I ‘check in’ the file for the first time ‘ci -u $filename’ and enter a log entry (generally “initial import”). Since someone else might come along behind me and screw something up, I mean, change my file, I check it back out and leave it checked out until next time: ‘co -l $fiename’.

I’ve also started using the RCS Keywords within the file, which causes an interesting thing in my header. Since I’m trying to do “better” with regards to documenting the file, and since I’m trying to stick to the phpDocumentor style, I’m winding up using something like this:

* @version 0.7
* Revision $Revision$

Weird, huh? But I don’t think that the phpDocumentor @version keyword processing is compatible with the RCS $revision$ keyword processing.

Generating a Third Party Sitemap

Posted on September 4th, 2007 in Search Engines by Russ

This is an Interesting Technique. I thought I’d take a look at generating these. As with anything that looks tedious, I figured I’d do it with scripting.

My first thought was to generate the list and then post it to a blog using the “post via email” function; straightforward to generate a list, then an email and send it on its way.

I use site5 for hosting this site, and have the log files placed within my ~/logs directory on the server. I downloaded the most recent (last month’s) copy of the log file and unzipped it:

$> scp arghwebworks.com:logs/arghwebworks.com-Aug-2007.gz .
$> gunzip arghwebworks.com-Aug-2007.gz

then I opened it up and examined a few lines:

72.51.41.47 - - [31/Jul/2007:02:40:03 -0400] "GET / HTTP/1.0" 200 44785 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)"
66.249.66.1 - - [31/Jul/2007:02:40:05 -0400] "GET /2006/07/17/why-ajax-sucks/?paged=3 HTTP/1.1" 200 54123 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
212.247.189.113 - - [31/Jul/2007:02:41:04 -0400] "GET / HTTP/1.0" 200 44785 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)"

Note that your log format may change, and so if you use this perl script, you might have to play with the regular expression somewhat.

I wrote this perl script:

#!/usr/bin/perl 

open (LOG, "arghwebworks.com-Aug-2007" );

my %referrers;
while (my $line=) {
	$line =~ m/\[.*\] ".*" \d{1,4} \d* "(.*?)"/;
	$testval = $1;
	$testval =~ s/\s+$//;
	$testval =~ s/^\s+//;
	unless ( $testval =~ m/http:..www.arghwebworks.com/  || $testval =~ m/http:..arghwebworks.com/ ) {

		if ( $testval gt '' && $testval ne '-' ) {
			$referrers{$testval}++;
		}
	}
} 

$count=0;
# this is some magic sort sequence. I'm not really sure how it works :P
foreach $food ( sort { $referrers{$a} < $referrers {$b} } keys %referrers) {
		print "$food is $referrers{$food}.\n";
		$count++;
		if ( $count > 10 ) {
			exit();
		}
	}

Yeah, I know, it’s crap. :) It still runs quickly enough to give me the results in a matter of microseconds.

Here’s what I get back:

http://209.85.135.104/search?q=cache:PpXaEbSr1p4J:www.arghwebworks.com/%3Fpaged%3D5+mysql-bind+patch&hl=cs&ct=clnk&cd=14&gl=cz is 22.
http://209.85.165.104/search?q=cache:JfMyYqbaZYEJ:www.arghwebworks.com/%3Fp%3D70+procmailrc+cialis&hl=en&ct=clnk&cd=1&gl=us is 7.
http://viagranwnc.blogspot.com is 7.
http://phentermine--mine.blogspot.com is 6.
http://search.live.com/results.aspx?q=search&mrt=en-us&FORM=LIVSOP is 5.
http://64.233.183.104/search?q=cache:wMrxXb6B59EJ:arghwebworks.com/cwc_tgc/+tara%40welcomingcongregations.org&hl=fr&ct=clnk&cd=3&gl=ci is 4.
http://buy65--rphentermine.blogspot.com is 4.
http://phenterminew0-q.blogspot.com is 2.
http://airline-rrr-tickets.blogspot.com is 1.
http://www.pingdom.com/tools/fpt/?url=www.arghwebworks.com is 3.
http://valium9-j.blogspot.com is 1.

I didn’t link those on purpose.

Look at this short list of URLs. Out of eleven, five of them contain the name of some drug. The top two are from a google search in Czech, and there are two other search engine queries (one of them references a client ). One is an obvious airline ticket splog., and one is a tool from pingdom.com.

Totally worthless.

If I were going to pursue this method, I’d want to be able to check the “spamminess” of a link. Any suggestions out there?