Raging Sloth Tech

Optimizer - becoming more generic



Note: I've found a few bugs in the code for this tutorial (no one has commented on it so I assume no one's really read it anyway :) I've fixed the bugs in the downloadable file but haven't gotten around to updating this tutorial yet.)


So the previous sections of the PNG Optimizer tutorial have been about just that. They started because I had a png optimizer program that would only operate on a single file at a time and I decided to write a script to feed multiple files into it. Since then we've cleaned up the code and then parallelized the process so that multiple processor cores can be used to speed the whole thing up. Now when we take a look at the script however we can see it is rather limited in function. It operates by identifying png files and running appropriate programs on them with appropriate command line arguments. Why should we limit ourselves to png files however? A quick look at the code shows that the main functionality of our script is in identifying files of a particular type so now we're going to take a step towards making the system more useful and generic. It would be awfully nice if we could use the same script to make our jpegs a little smaller as well. For this we're going to make use of
jpegrescan but be forewarned that you should not run jpegrescan to replace your master images. jpegrescan will shrink your jpegs in a lossless fashion from an image perspective, however it will dump a lot of metadata (if you have any you care about) and I have found that images which have been shrunk with it are not properly rendered by photoshop. So don't say I didn't warn you. In keeping with the functionality we now have for pngs the script will be written to shrink jpegs in place (ie replacing the previous file) so do not run this script on master images, run it on exported copies that you are preparing to put online or send to someone (who isn't going to edit them in photoshop).

We'll start as always by taking a look at the script as it was when we last left it.

#!/usr/bin/env python

import os,traceback,sys
from multiprocessing import Pool

def optimizePNG(filenameAndPath):
try:
os.spawnlp(os.P_WAIT,'optipng','optipng','-o7','-q',filenameAndPath)
os.spawnlp(os.P_WAIT,'pngout','pngout','-q', filenameAndPath)
return (True,filenameAndPath,os.getpid())
except:
return (False,filenameAndPath,os.getpid(),traceback.format_exc())

def generatePNGListing(directory):
"run optipng pngout on all png files located in a directory recursively"
path = os.path.normpath(directory)
if not os.path.isdir(path):
raise Error, "Directory %s not found" % path

#we can use a for loop to continually get more data from a walk object until there is none left
for currPath, currDirs, currFiles in os.walk(path):
for file in currFiles:
#no need for us to test and see if the file is a directory as fileGenerator is taking care of that for us. Also no need for recursion.
if(file.split(".")[-1].lower() == 'png'):
yield(currPath+"/"+file)

def main(dir):
pool = Pool() #we'll use the default number of processes
results = pool.imap_unordered(optimizePNG,generatePNGListing(dir))
pool.close()
pool.join()
successes = []
failures = []
for result in results:
if result[0] == True:
successes.append(result)
else:
failures.append(result)
for success in successes:
print success[1] + " optimized by "+str(success[2])
for failure in failures:
print failure[1] + " failed due to exception "+failure[3]+ " in process "+str(failure[2])
if __name__ == "__main__":
main(os.getcwd())



Our png filter isn't really all that complex. We simply take a substring of the filename starting after the last "." to the right and see if it is equal to png. "if(file.split(".")[-1].lower() == 'png'):" we could easily replace this with a nice loop that would check multiple file extensions but python provides us with a much better alternative, in the form of the
"in" operator. Unlike the example given on that page we have a perfect candidate for the in operator because we have a finite list of elements that don't follow a pattern. We could of course hard code two statements into the if with an or operator but using in allows us to create a list that is much easier to expand. So lets change our script a little to see how things work.

#!/usr/bin/env python

import os,traceback,sys
from multiprocessing import Pool

def optimizePNG(filenameAndPath):
try:
#os.spawnlp(os.P_WAIT,'optipng','optipng','-o7','-q',filenameAndPath)
#os.spawnlp(os.P_WAIT,'pngout','pngout','-q', filenameAndPath)
return (True,filenameAndPath,os.getpid())
except:
return (False,filenameAndPath,os.getpid(),traceback.format_exc())

def generatePNGListing(directory):
"run optipng pngout on all png files located in a directory recursively"
path = os.path.normpath(directory)
if not os.path.isdir(path):
raise Error, "Directory %s not found" % path

#we can use a for loop to continually get more data from a walk object until there is none left
for currPath, currDirs, currFiles in os.walk(path):
for file in currFiles:
#no need for us to test and see if the file is a directory as fileGenerator is taking care of that for us. Also no need for recursion.
if(file.split(".")[-1].lower() in ['png','jpg']):
yield(currPath+"/"+file)

def main(dir):
pool = Pool() #we'll use the default number of processes
results = pool.imap_unordered(optimizePNG,generatePNGListing(dir))
pool.close()
pool.join()
successes = []
failures = []
for result in results:
if result[0] == True:
successes.append(result)
else:
failures.append(result)
for success in successes:
print success[1] + " optimized by "+str(success[2])
for failure in failures:
print failure[1] + " failed due to exception "+failure[3]+ " in process "+str(failure[2])
if __name__ == "__main__":
main(os.getcwd())



You'll notice the first thing I did was to comment out the part of our code that actually optimizes images. We haven't dealt with any code to actually do anything it jpeg files yet and we don't want to run our png programs against them, it is also nice to have the script finish way faster for testing purposes. If you run this script in a folder that contains jpegs within it (including in sub-folders) you'll see that it has found all png and jpg files which is exactly what we want. But our script even as it is could be better. While they are rare I have seen jpeg files with a "jpeg" extension rather than a jpg extension and it would be nice if they would be found too. At this point it is pretty easy for me to go right into the code and add "jpeg" to the list we have, but what about weeks from now? I'll have to read through the entire generatePNGListing method to find where to put things. On that note we aren't generating a png listing anymore either... So we'll move our list somewhere where it is easier to find and rename some things to make more sense in our new context.

#!/usr/bin/env python

import os,traceback,sys
from multiprocessing import Pool

interestingFileTypes = ['png','jpg','jpeg']

def optimizeFile(filenameAndPath):
"for the given file run optimization programs associated with its type"
try:
#os.spawnlp(os.P_WAIT,'optipng','optipng','-o7','-q',filenameAndPath)
#os.spawnlp(os.P_WAIT,'pngout','pngout','-q', filenameAndPath)
return (True,filenameAndPath,os.getpid())
except:
return (False,filenameAndPath,os.getpid(),traceback.format_exc())

def generateInterestingFileListing(directory):
"Recursively identify all files with file extenstions contained within the interestingFileTypes list"
path = os.path.normpath(directory)
if not os.path.isdir(path):
raise Error, "Directory %s not found" % path

#we can use a for loop to continually get more data from a walk object until there is none left
for currPath, currDirs, currFiles in os.walk(path):
for file in currFiles:
#no need for us to test and see if the file is a directory as fileGenerator is taking care of that for us. Also no need for recursion.
if(file.split(".")[-1].lower() in interestingFileTypes):
yield(currPath+"/"+file)

def main(dir):
"run associated optimization programs on all files within the given directory"
pool = Pool() #we'll use the default number of processes
results = pool.imap_unordered(optimizeFile,generateInterestingFileListing(dir))
pool.close()
pool.join()
successes = []
failures = []
for result in results:
if result[0] == True:
successes.append(result)
else:
failures.append(result)
for success in successes:
print success[1] + " optimized by "+str(success[2])
for failure in failures:
print failure[1] + " failed due to exception "+failure[3]+ " in process "+str(failure[2])
if __name__ == "__main__":
main(os.getcwd())



You can see that I've moved the list we are testing against into a separate variable so that it is easily accessible without having to look at our file identifying code. I've also gone through and given more generic names to our functions and cleaned up some docstrings that were painfully out of date :). So all that's really left for us to do is alter our optimizeFile function to actually optimize files. We again could just use some if statements to run different programs depending on the file extension of the particular file we are processing, but if we to want to later expand our script to handle further file types it would be a total pain. In Python everything is an object including functions, as such there is a convenient switch statement like construct that we can use here. The basics are when you want to sort based on a variable you can create a dictionary with variable values as keys and handler functions as values like so:


def handleOne(input):
print 2

def handleSTUFF(input):
print "STUFF"+str(input)

handlers = {1:handleOne,"STUFF":handleSTUFF}

for x in [1,1,1,"STUFF","STUFF",1]:
handlers[x]("value")



As you can see the values you compare to don't even have to be the same type which is very convenient and powerful. Now if we use this method you'll notice that we end up typing out the same data more than once. We already have a list of all file extensions we are going to handle and then we enter those same extension as keys in our dictionary, we can eliminate the file extension list by making use of the dictionary.keys() method which will return a list of all keys to our dictionary used to store handlers. This way we only need to change those data values in once place which means we don't have to worry about say, adding an extension to the dictionary forgetting the list and having nothing happen, or adding to the list without adding to the handler and getting an exception thrown when there is no handler for a particular file. We are now ready for a much more mature version of our code.

#!/usr/bin/env python

import os,traceback,sys
from multiprocessing import Pool


def pngOptimize(filenameAndPath):
return "PNG"

def jpegOptimize(filenameAndPath):
return "JPEG"

#a mapping of file extensions to functions which will act on those filetypes, file extensions must be lower case.
fileOptimizers = {'png': pngOptimize,'jpg': jpegOptimize, 'jpeg':jpegOptimize}

def optimizeFile(filenamePathAndExt):
"for the given file run optimization programs associated with its type"
filenameAndPath = filenamePathAndExt[0]
Ext = filenamePathAndExt[1]
try:
#os.spawnlp(os.P_WAIT,'optipng','optipng','-o7','-q',filenameAndPath)
#os.spawnlp(os.P_WAIT,'pngout','pngout','-q', filenameAndPath)
return (True,filenameAndPath,str(os.getpid()) + fileOptimizers[Ext](filenameAndPath))
except:
return (False,filenameAndPath,os.getpid(),traceback.format_exc())

def generateInterestingFileListing(directory):
"Recursively identify all files with file extenstions contained within the interestingFileTypes list"
path = os.path.normpath(directory)
if not os.path.isdir(path):
raise Error, "Directory %s not found" % path

#we can use a for loop to continually get more data from a walk object until there is none left
for currPath, currDirs, currFiles in os.walk(path):
for file in currFiles:
#no need for us to test and see if the file is a directory as fileGenerator is taking care of that for us. Also no need for recursion.
extension = file.split(".")[-1].lower()
if(extension in fileOptimizers.keys()):
yield((currPath+"/"+file, extension))

def main(dir):
"run associated optimization programs on all files within the given directory"
pool = Pool() #we'll use the default number of processes
results = pool.imap_unordered(optimizeFile,generateInterestingFileListing(dir))
pool.close()
pool.join()
successes = []
failures = []
for result in results:
if result[0] == True:
successes.append(result)
else:
failures.append(result)
for success in successes:
print success[1] + " optimized by "+str(success[2])
for failure in failures:
print failure[1] + " failed due to exception "+failure[3]+ " in process "+str(failure[2])
if __name__ == "__main__":
main(os.getcwd())



This script won't actually do anything to your files but it will demonstrate that our code is working properly by identifying jpg jpeg and png files and printing out a list with the file type added to the end. If we for example also wanted to run a program on bmp files we could just add a function which did what we wanted and add a reference with "bmp" in the fileOptimizer's dictionary. The way we deal with the case of filenames is to convert the extensions to lower case, so it is necessary that any file extensions we add are in lower case. Now before I end this tutorial and put up the final script I want to reiterate a warning about jpegrescan. jpegrescan is not lossy from an image quality perspective, however it will strip away your metadata including any copyright notice you add, it will also (at least based on the versions I've used) do something to the file that photoshop will not like. I've been able to edit rescanned images with
the gimp and they have worked fine in every viewer I've tested, but photoshop does not properly render them anymore (at least at the time of this writing.) So do not.... DO NOT run this script against your original jpegs, run it against copies of jpegs that you want smaller for some reason. On a less menacing note you'll also need to install jpegrescan (requires perl and I think a couple of perl libraries to work,) and place it somewhere in the path or adjust the spawnlp code under jpegOptimize. You can download this script here.

#!/usr/bin/env python

import os,traceback,sys
from multiprocessing import Pool


def pngOptimize(filenameAndPath):
try:
os.spawnlp(os.P_WAIT,'optipng','optipng','-o7','-q',filenameAndPath)
os.spawnlp(os.P_WAIT,'pngout','pngout','-q', filenameAndPath)
return (True,filenameAndPath,os.getpid())
except:
return (False,filenameAndPath,os.getpid(),traceback.format_exc())

#this will shrink jpegs but potentially make them unreadable with photoshop use at your own risk or consider changing this code
#to save to a different filename (change the last parameter to the os.spawnlp function call to something like filenameAndPath + ".rescanned.jpeg"
def jpegOptimize(filenameAndPath):
try:
os.spawnlp(os.P_WAIT,'jpegrescan','jpegrescan',filenameAndPath,filenameAndPath)
return (True,filenameAndPath,os.getpid())
except:
return (False,filenameAndPath,os.getpid(),traceback.format_exc())


#a mapping of file extensions to functions which will act on those filetypes, file extensions must be in lower case.
fileOptimizers = {'png': pngOptimize,'jpg': jpegOptimize, 'jpeg':jpegOptimize}

def optimizeFile(filenamePathAndExt):
"for the given file run optimization programs associated with its type"
filenameAndPath = filenamePathAndExt[0]
Ext = filenamePathAndExt[1]
try:
#os.spawnlp(os.P_WAIT,'optipng','optipng','-o7','-q',filenameAndPath)
#os.spawnlp(os.P_WAIT,'pngout','pngout','-q', filenameAndPath)
return (True,filenameAndPath,str(os.getpid()) + fileOptimizers[Ext](filenameAndPath))
except:
return (False,filenameAndPath,os.getpid(),traceback.format_exc())

def generateInterestingFileListing(directory):
"Recursively identify all files with file extenstions contained within the interestingFileTypes list"
path = os.path.normpath(directory)
if not os.path.isdir(path):
raise Error, "Directory %s not found" % path

#we can use a for loop to continually get more data from a walk object until there is none left
for currPath, currDirs, currFiles in os.walk(path):
for file in currFiles:
#no need for us to test and see if the file is a directory as fileGenerator is taking care of that for us. Also no need for recursion.
extension = file.split(".")[-1].lower()
if(extension in fileOptimizers.keys()):
yield((currPath+"/"+file, extension))

def main(dir):
"run associated optimization programs on all files within the given directory"
pool = Pool() #we'll use the default number of processes
results = pool.imap_unordered(optimizeFile,generateInterestingFileListing(dir))
pool.close()
pool.join()
successes = []
failures = []
for result in results:
if result[0] == True:
successes.append(result)
else:
failures.append(result)
for success in successes:
print success[1] + " optimized by "+str(success[2])
for failure in failures:
print failure[1] + " failed due to exception "+failure[3]+ " in process "+str(failure[2])
if __name__ == "__main__":
main(os.getcwd())