The optparse module: writing command-line tools the easy way

Status:Draft
Author:Michele Simionato
E-mail:michele.simionato@partecs.com
Date:May 2004

The optparse module is a powerful, flexible, extensible, easy-to-use command-line parsing library for Python. Using optparse, you can add intelligent, sophisticated handling of command-line options to your scripts with very little overhead. -- Greg Ward, optparse author

Introduction

Once upon a time, when graphic interfaces were still to be dreamed about, command-line tools were the body and the soul of all programming tools. Many years have passed since then, but some things have not changed: command-line tools are still fast, efficient, portable, easy to use and - more importantly - reliable. You can count on them. You can expect command-line scripts to work in any situation, during the installation phase, in a situation of disaster recovery, when your window manager breaks down and even in systems with severe memory/hardware constraints. When you really need them, command-line tools are always there.

Hence, it is important for a programming language - especially one that wants to be called a "scripting" language - to provide facilities to help the programmer in the task of writing command-line tools. For a long time Python support for this kind of tasks has been provided by the``getopt`` module. I have never been particularly fond of getopt, since it required a sensible amount of coding even for the parsing of simple command-lines. However, with the coming of Python 2.3 the situation has changed: thanks to the great job of Greg Ward (the author of optparse a.k.a. Optik) now the Python programmer has at her disposal (in the standard library and not as an add-on module) a fully fledged Object Oriented API for command-line arguments parsing, which makes writing Unix-style command-line tools easy, efficient and fast.

The only disadvantage of optparse is that it is sophisticated tool, which requires some time to be fully mastered. The purpose of this paper is to help the reader to rapidly get the 10% of the features of optparse that she will use in the 90% of the cases. Taking as an example a real life application - a search and replace tool - I will guide the reader through (some of) the wonders of optparse. Also, I will show some trick that will make your life with optparse much happier. This paper is intended for both Unix and Windows programmers - actually I will argue that Windows programmers need optparse even more than Unix programmers; it does not require any particular expertise to be fully appreciated.

A simple example

I will take as pedagogical example a little tool I wrote some time ago, a multiple files search and replace tool. I needed it because I am not always working under Unix, and I do not always have sed/awk or even Emacs installed, so it made sense to have this little Python script in my toolbox. It is only few lines long, it can always be modified and extended with a minimal effort, works on every platform (including my PDA) and has the advantage of being completely command-line driven: it does not require to have any graphics library installed and I can use it when I work on a remote machine via ssh.

The tool takes a bunch of files and replace a given regular expression everywhere in-place; moreover, it saves a backup copy of the original un-modified files and give the option to recover them when I want to. Of course, all of this can be done more efficiently in the Unix world with specialized tools, but those tools are written in C and they are not as easily customizable as a Python script, that you may change in real time to suit your needs. So, it makes sense to write this kind of utilities in Python, and actually many people (including myself) are actively replacing some Unix commands and bash scripts with Python scripts. In real life, I have extended a lot the minimal tool that I describe here, and I continue to tune it as needed. For instance, you can make it to work recursively and/or on remote directories.

As a final note, let me notice that I find optparse to be much more useful in the Windows world than in the Unix/Linux/Mac OS X world. The reason is that the pletora of pretty good command-line tools which are available under Unix are missing in the Windows environment, or do not have a satisfactory equivalent. Therefore, it makes sense to write a personal collection of command-line scripts for your more common task, if you need to work on many platforms and portability is an important requirement. Using Python and optparse, you may write your own scripts once and having them to run on every platform running Python, which means in practice any traditional platform and increasingly more of the non-traditional ones - Python is spreading into the embedded market too, including PDA's, cellular phones, and more.

The Unix philosophy for command-line arguments

In order to understand how optparse works, it is essential to understand the Unix philosophy about command-lines arguments. As Greg Ward puts it:

The purpose of optparse is to make it very easy to provide the most standard, obvious, straightforward, and user-friendly user interface for Unix command-line programs. The optparse philosophy is heavily influenced by the Unix and GNU toolkits ...

So, I think my Windows readers will be best served if I put here a brief summary of the Unix terminology. Old time Unix geeks may safely skip this section. Let me just notice that optparse could be extended to implement other kinds of conventions for optional argument parsing..

Here is optparse/Unix/GNU terminology: the arguments given to a command-line script - i.e. the arguments that Python stores in the list sys.argv[1:] - are classified in three groups: options, option arguments and positional arguments. Options can be distinguished since they are prefixed by a dash or a double dash; options can have arguments or not (there is at most an option argument right after each option); options without arguments are called flags. Positional arguments are what it is left in the command-line after you remove options and option arguments.

In the example of the search/replace tool, I will need two options with an argument - since I want to pass to the script a regular expression and a replacement string - and I will need a flag specifying whether or not a backup of the original files needs to be performed. Finally, I will need a number of positional arguments to store the names of the files on which the search and replace will act.

Consider - for the sake of the example - the following situations: you have a bunch of text files in the current directory containing dates in the European format DD-MM-YYYY, and that you want to convert them in the American format MM-DD-YYYY. If you are sure that all your dates are in the correct format, your can match them with a simple regular expression such as (\d\d)-(\d\d)-(\d\d\d\d) (this regular expression looks for strings composed of three groups of digits separated by dashes, with the first and second group composed by two digits and the last group composed by four digits).

In this particular example it is not so important to make a backup copy of the original files, since to revert to the original format it is enough to run the script again. So the syntax to use would be something like

$> replace.py --nobackup --regx="(\d\d)-(\d\d)-(\d\d\d\d)" \
                         --repl="\2-\1-\3" *.txt

In order to emphasize the portability, I have used a generic $> promtp, meaning that these examples work equally well on both Unix and Windows (of course on Unix I could do the same job with sed or awk, but these tools are not as flexible as a Python script).

The syntax here has the advantage of being quite clear, but the disadvantage of being quite verbose, and it is handier to use abbreviations for the name of the options. For instance, sensible abbreviations can be -x for --regx, -r for --repl and -n for --nobackup; moreover, the = sign can safely be removed. Then the previous command reads

$> replace.py -n -x"(\dd)-(\dd)-(\d\d\d\d)" -r"\2-\1-\3" *.txt

You see here the Unix convention at work: one-letter options (a.k.a. short options) are prefixed with a single dash, whereas long options are prefixed with a double dash. The advantage of the convention is that short options can be composed: for instance

$> replace.py -nx "(\dd)-(\dd)-(\d\d\d\d)" -r "\2-\1-\3" *.txt

means the same as the previous line, i.e. -nx is parsed as -n -x. You can also freely exchange the order of the options, for instance in this way:

$> replace.py -nr "\2-\1-\3" *.txt -x "(\dd)-(\dd)-(\d\d\d\d)"

This command will be parsed exactly as before, i.e. options and option arguments are not positional.

How does it work in practice?

Having stated the requirements, we may start implementing our search and replace tool. The first step, and the most important one, is to write down the documentation string, even if you will have to wait until the last section to understand why the docstring is the most important part of this script ;)

#!/usr/bin/env python
"""
Given a sequence of text files, replaces everywhere
a regular expression x with a replacement string s.

  usage: %prog files [options]
  -x, --regx=REGX: regular expression
  -r, --repl=REPL: replacement string
  -n, --nobackup: do not make backup copies
"""

On Windows the first line in unnecessary, but is good practice to have it in the Unix world.

The next step is to write down a simple search and replace routine:

import re

def replace(regx, repl, files, backup_option=True):
    rx = re.compile(regx)
    for fname in files:
        txt = file(fname, "U").read()
        if backup_option:
            print >> file(fname+".bak", "w"), txt,
        print >> file(fname, "w"), rx.sub(repl, txt),

This replace routine is entirely unsurprising, the only thing you may notice is the usage of the "U" option in the line

txt=file(fname,"U").read()

This is a new feature of Python 2.3. Text files open with the "U" option are read in "Universal" mode: this means that Python takes care for you of the newline pain, i.e. this script will work correctly everywhere, independently by the newline conventions of your operating system. The script works by reading the whole file in memory: this is bad practice, and here I am assuming that you will use this script only on short files that will fit in your memory, otherwise you should "massage" the code a bit. Also, a full fledged script would check if the file exists and can be read, and would do something in the case it is not, but I ask you to forgive me for skipping on these points, since the thing I am really interested in is the optparse module.

So, how does it work? It is quite simple, really. First you need to instantiate an argument line parser from the OptionParser class provided by optparse:

import optparse 
parser = optparse.OptionParser("usage: %prog files [options]")

The string "usage: %prog files [options]" will be used to print a customized usage message, where %prog will be replaced by the name of the script (in this case replace.py`). You may safely omit it and optparse will use a default "usage: %prog [options]" string.

Then, you tell the parser informations about which options it must recognize:

parser.add_option("-x", "--regx",
                help="regular expression")
parser.add_option("-r", "--repl",
                help="replacement string")
parser.add_option("-n", "--nobackup",
                action="store_true",
                help="do not make backup copies")

The help keyword argument is intended to document the intent of the given option; it is also used by optparse in the usage message. The action=store_true keyword argument is used to distinguish flags from options with arguments, it tells optparse to set the flag nobackup to True if -n or --nobackup is given in the command line.

Finally, you tell the parse to do its job and to parse the command line:

option, files = parser.parse_args()

The .parse_args() method returns two values: option, which is an instance of the optparse.Option class, and files, which is a list of positional arguments. The option object has attributes - called destionations in optparse terminology - corresponding to the given options. In our example, option will have the attributes option.regx, option.repl and option.nobackup.

If no options are passed to the command line, all these attributes are initialized to None, otherwise they are initialized to the argument option. In particular flag options are initialized to True if they are given, to None otherwise. So, in our example option.nobackup is True if the flag -n or --nobackup is given. The list files contains the files passed to the command line (assuming you passed the names of accessible text files in your system).

The main logic can be as simple as the following:

if not files:
    print "No files given!"
elif option.regx and option.repl:
    replace(option.regex, option.repl, files, not option.nobackup)
else:
    print "Missing options or unrecognized options."
    print __doc__ # documentation on how to use the script

A nice feature of optparse is that an help option is automatically created, so replace.py -h (or replace.py --help) will work as you may expect:

$> replace.py --help
usage: replace.py files [options]


options:
  -h, --help           show this help message and exit
  -xREGX, --regx=REGX  regular expression
  -rREPL, --repl=REPL  replacement string
  -n, --nobackup       do not make backup copies

You may programmatically print the usage message by invoking parser.print_help().

At this point you may test your script and see that it works as advertised.

How to reduce verbosity and make your life with optparse happier

The approach we followed in the previous example has a disadvantage: it involves a certain amount of verbosity/redundance. Suppose for instance we want to add the ability to restore the original file from the backup copy. Then, we have to change the script in three points: in the docstring, in the add_option list, and in the if .. elif .. else ... statement. At least one of this is redundant. One would be tempted to think that the information in the documentation string is redundant, since it is already magically provided in the help options: however, I will take the opposite view, that the information in the help options is redundant, since it is already contained in the docstring. It is a sacrilege to write a Python script without a docstring, since the docstring should always be available to automatic documentation tools such as pydoc. Since the docstring cannot be removed or shortened, the idea is to extract information from it, in such a way to avoid the boring task of writing by hand the parser.add_option lines. I implemented this idea in a cookbook recipe, by writing an optionparse module which is just a thin wrapper around optparse. The relevant code is here. The relevant functions are optionparse.parse which parses the docstring and optionparse.exit which exits the execution by displaying an usage message. To show how to use them, let me rewrite the search and replace tool (including the new restore option) in this way:

#!/usr/bin/env python
"""
Given a sequence of text files, replaces everywhere
a regular expression x with a replacement string s.

  usage: %prog files [options]
  -x, --regx=REGX: regular expression
  -r, --repl=REPL: replacement string
  -n, --nobackup: do not make backup copies
  -R, --restore: restore the original from the backup
"""
import optionparse, os, shutil, re

def replace(regx, repl, files, backup_option=True):
    rx = re.compile(regx)
    for fname in files:
        # TODO: add a test to see if the file exists and can be read
        txt = file(fname, "U").read()
        if backup_option:
            print >> file(fname+".bak", "w"), txt
        print >> file(fname, "w"), rx.sub(repl,txt)

def restore(files): # restor original files from backup files
    for fname in files:       
        if os.path.exists(fname+".bak"):
            shutil.copyfile(fname+".bak",fname)
        else:
            print "Sorry, there is no backup copy for %s" % fname

if __name__=='__main__':
    option, files = optionparse.parse(__doc__)
    # optionparse.parse parses both the docstring and the command line!
    if not files:
        optionparse.exit()
    elif option.regx and option.repl:
        replace(option.regex, option.repl, files, not option.nobackup)
    elif option.restore:
        restore(files)
    else:
        print "Missing options or unrecognized options."

The code is quite readable. Internally optionparse.parse(__doc__) works by generating an option parser from the docstring, then applying it to the command-line arguments. Working a bit more, one could also devise various tricks to avoid the redundance in the if statement (for instance using a dictionary of functions and dispatching according to the name of the given option). However this simple recipe is good enough to provide a minimal wrapper to optparse. It requires a minimum effort and works well for the most common case. For instance, the paper you are reading now has been written by using optionparse: I used it to write a simple wrapper to docutils - the standard Python tool which converts (restructured) text files to HTML pages. It is also nice to notice that internally docutils itself uses optparse to do its job, so actually this paper has been composed by using optparse twice!

Finally, you should keep in mind that this article only scratch the surface of optparse, which is quite sophisticated. For instance you can specify default values, different destinations, a store_false action and much more, even if often you don't need all this power. Still, it is handy to have the power at your disposal when you need it. The serious user of optparse is strongly encorauged to read the documentation in the standard library, which is pretty good and detailed. I will think that this article has fullfilled its function of "appetizer" to optparse, if it has stimulate the reader to study more.

References