Raging Sloth Tech

PNG Optimizer Cleanup


Tying up loose ends



So when I first started ragingsloth.com I wrote a little
python script that would go through and run png optimization programs on all your png images (BTW I've recently discovered a program called jpegrescan which can be used to convert this script to adding a little extra lossless compression to jpegs. Be warned though that I've found this technique makes the files work improperly with photoshop for some reason, so if you have a website use it but always keep a copy of your original images). At the time my plan was to try and write a tutorial in python that focused on how to think like a developer and walk through all the stages slowly. Unfortunately I didn’t revisit the article in a really long time and now that I’ve started updating the site relatively often I want to write shorter pieces and skip over a lot of the stuff I was previously planning on covering. Unfortunately one of the steps I never got to with the original script was refactoring. That’s the process where you go back and clean up old code that just doesn’t look right.

The approach I took originally was to try and instil in people the idea that they should try things when programming. You are far better off jumping into a prototype and learning from its mistakes than sitting there for hours and accomplishing nothing. Even if you come up with a brilliant design that way you will have come to it without ever confirming your understanding of the interfaces involved. Imagine creating a beautiful architectural design in your head and then realizing that a function key to it doesn’t work the way you expected... Well if you’d coded something as you went along you wouldn’t have gotten yourself into that mess. Remember a real design comes from an understanding that is very hard to come by
without actually coding a solution.

On the other hand though when you rush through a prototype usually you want to go back and make it pretty again. I say usually because sometimes you code things for a single use and then throw it away, be careful though because today’s throwaway code can often become tomorrow’s mission critical system that you may have to maintain.

So let’s get started cleaning up our old script you can see below:

#!/usr/bin/env python

import os,traceback,sys



def optimizeImages(directory):
"run optipng pngout on all png files located in a directory recursively"
path = os.path.normpath(directory)
if not os.path.isdir(path):
raise Error, "Directory %s not found" % path
os.chdir(path)
files = os.listdir(path)
for file in files:
if(os.path.isdir(path+"/"+file):
try:
optimizeImages(path+"/"+file)
os.chdir(path)
except:
traceback.print_exc()
else:
try:
if(file.split(".")[-1] == 'png'):
print "optipnging "+ path + '/'+ file
os.spawnlp(os.P_WAIT,'optipng','optipng','-o7','-q',file)
print "pngouting" + path + '/'+file
os.spawnlp(os.P_WAIT,'pngout','pngout','-q',file)
except:
traceback.print_exc()




optimizeImages(os.getcwd())
sys.exit(0)


Note that I made a quick change to the code to fix a bug in the main tutorial but reverted to the old code here so that I can take you through the process of fixing it.

Bugs!!!


Now in my own defence the script was written to do a specific task and I never encountered any problems with it as is. If it were really a throwaway script or even one I planned on using over and over I quite likely would have never had any problems with it. The thing is that this script as written can potentially never end :)

See this script is written sort of assuming that you’re running a Unix like OS, perhaps GNU/Linux or Mac OSX or FreeBSD. You can see that in the way I formatted the slashes for the file access with ‘/‘ instead of the windows ‘\’. The truth is I don’t even know if this script runs on windows because I don’t have a windows box (though I do maintain a virtual machine). I used to have plans to write a bunch of website optimization scripts and most websites are hosted on Linux boxes so I went specifically for those.

So anyway for anyone who is unfamiliar, in Unix you can create what are called sym links. Basically you create a new record for an existing file but in another place. So for example I can save an image file in one directory on my hard drive (perhaps a folder used by an image management application) and then put a sym link in my web server folder and it would be accessible from both filesystem locations. This can also be done with directories and our script above doesn’t care whether a directory is a real one or a sym link.

This is actually great if you want the script to follow sym links (like if you use them to manage files for a website), but terrible if say you want to run the script over an entire hard drive, because you will hit the same files twice, or potentially an infinite number of times :) Imagine you call the script at location / and there is a symlink in your
wine directory that points to the fake windows d:\ which is of course your root directory again... which in turn contains your wine d drive which contains your wine d drive.... There you have an infinite loop which is rather unsightly.

So given the limited nature of what this script does, this is actually the only bug I’ve ever identified. So one way to fix it is to just add another section to the if statement where we test for a directory, and that’s exactly the quick fix I shoved into place in the old tutorial to keep anyone who might download the script from looping around in infinite circles. So our quick fix is "and not os.path.islink(path+"/"+file))" added to "if(os.path.isdir(path+"/"+file):" line just before the colon.

If you really want symlink support another thing you could do would be to maintain a list of all directories you've already been to and resolve the sym link to its actual location, then simply skip symlinks to places you have already been. For the purposes of where I want to go with this script however I'm just going to assume that you're ok with not following sym links.

As we continue our cleanup however we'll see that this gets fixed anyway as a consequence of our refactoring.

Free upgrades



There is always this urge when developing an application to write code. One would think this was perfectly healthy but it can lead to writing code that doesn't really need to be written. Often times we are confronted with tasks that have been encountered many many times before. In these cases there are often prepackaged solutions that we can just plug into. There are advantages and disadvantages to this. The advantage is of course that we get a bunch of functionality for free, the disadvantage is that the solution may not fit nicely into our design, or the code we are plugging in may not be well maintained. So we need to balance our desire for total control with potential benefits of reusing someone else's code.

Now in our case the python standard library has a prebuilt system for doing much of what our script does. It is called
walk (you will need to scroll down to the walk function). With walk we don't need to worry about any of our code that deals with directory manipulation. All we need to do is figure out how to use it and then pick out interesting files from the list it gives us and run our programs on them. Another nice thing about walk is that the developers of it foresaw our issue with links and allow us to choose whether or not to follow links in the arguments.

So lets take a look at how walk works. Walk doesn't actually do what we want but creates an object that does. This object "generates" results based on what we ask for and it is up to us to use them the way we want them. The Generator also keeps track of where it left off giving us chunks of data to sort through. So lets look at some code, try typing the following into a python interactive shell:


>>>import os
>>>walker = os.walk("/")
>>>walker.next()


As results you'll get a tuple with 3 values. The first is a string giving you the path to the directory you're currently "walking" the second is a list of all the directories held within and the third is a list of all the files (ie non-directories) in the current directory. So for our purposes we are only really interested in the first and third values, our "walker" will take care of the 2nd. Let's take a look at how our script turns out if we use walk instead of our code.

#!/usr/bin/env python

import os,traceback,sys

def optimizeImages(directory):
"run optipng pngout on all png files located in a directory recursively"
path = os.path.normpath(directory)
if not os.path.isdir(path):
raise Error, "Directory %s not found" % path

#we can use a for loop to continually get more data from a walk object until there is none left
for currPath, currDirs, currFiles in os.walk(path):
for file in currFiles:
#no need for us to test and see if the file is a directory as fileGenerator is taking care of that for us. Also no need for recursion.
try:
if(file.split(".")[-1] == 'png'):
print "optipnging "+ currPath + '/'+ file
#we need to append the filename to the path as we haven't changed the current working directory this time
os.spawnlp(os.P_WAIT,'optipng','optipng','-o7','-q',currPath+'/'+file)
print "pngouting" + currPath + '/'+file
os.spawnlp(os.P_WAIT,'pngout','pngout','-q', currPath +'/'+ file)
except:
traceback.print_exc()

optimizeImages(os.getcwd())
sys.exit(0)



So as you can see we've taken a short script and made it even shorter. We've also made use of functionality that is maintained for us by someone else. If for example someone were to find out a faster way to do this iteration and implement it in the standard python library then our script would go faster without us having to do anything. We could also easily add a command line option to toggle between following and not following links without having to put in a bunch of ugly if statements to switch between modes. I've seen other similar scripts on the net that use regular expressions and various methods to quickly sort through these lists of files but for the purposes of this script I don't see the point. The time it takes for optipng and pngout to process a file is going to be far longer than the time it takes to test for png at the end of a filename.

Conclusion


So this was a rather short exercise in refactoring our script (mostly due to how short it was in the first place). The improved script is up on my
downloads page and now we have a much smaller cleaner script to use in the PNG optimizer parallelization tutorials.