diff options
author | michele.simionato <devnull@localhost> | 2009-09-07 05:57:55 +0000 |
---|---|---|
committer | michele.simionato <devnull@localhost> | 2009-09-07 05:57:55 +0000 |
commit | a5188019a7201046af584d94f290fd0fc4711932 (patch) | |
tree | 6b42d467e990394b3239c794173c0f352b5023f2 /artima | |
parent | 48c0eede04f8ca49b8343e0c65e9ffb3786df0df (diff) | |
download | micheles-a5188019a7201046af584d94f290fd0fc4711932.tar.gz |
Finally published "Records management in Python/1"
Diffstat (limited to 'artima')
-rw-r--r-- | artima/python/records1.py | 165 |
1 files changed, 79 insertions, 86 deletions
diff --git a/artima/python/records1.py b/artima/python/records1.py index 1edc978..e1b15d2 100644 --- a/artima/python/records1.py +++ b/artima/python/records1.py @@ -1,7 +1,7 @@ """ Everybody has worked with records: by reading CSV files, by -interacting with a database, by coding in a programmming language, -etc. Records look like an old, traditional, boring topic where +interacting with a database, by coding in a programmming language. +Records look like an old, traditional, boring topic where everything has been said already. However this is not the case. Actually, there is still plenty to say about records: in this three part series I will discuss a few general techniques to read, @@ -10,58 +10,54 @@ write and process records in modern Python. The first part of reading a CSV file with a number of fields which is known only at runt time; the second part discusses the problem of interacting with a database; the third and last part discusses the problem of rendering -data in record format. +records into XML or HTML format. Record vs namedtuple ----------------------------------------------------- -Let me begin by observing that for many years -*there was no record type in the Python language, -nor in the standard library*. -This omission seems incredible, but it is true: the Python community has -asked for records in the language from the beginning, but Guido never -considered that request. The canonical answer was -"in real life you always need to add -methods to your data, so just use a custom class". I (as many other -others) never bought that argument: I have always regarded the -lack of support for records as a weak point of Python. -The good news are that the situation has changed: starting from Python 2.6 -records are part of the standard library under the dismissive name of -*named tuples*. You can use named tuples even in older versions of Python, -simply by downloading `Raymond Hettinger`_'s recipe_ on the -Python Cookbook:: +For many years there was no record type in the Python language, nor in +the standard library. This is hard to believe, but true: the Python +community has asked for records in the language from the beginning, +but Guido never considered that request. The canonical answer was "in +real life you always need to add methods to your data, so just use a +custom class". The good news is that the situation has finally +changed: starting from Python 2.6 records are part of the standard +library under the dismissive name of *named tuples*. You can use named +tuples even in older versions of Python, simply by downloading +`Raymond Hettinger`_'s recipe_ on the Python Cookbook:: $ wget http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/500261/index_txt -O namedtuple.py .. _recipe: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/500261 - .. _Raymond Hettinger: http://www.pycon.it/pycon2/schedule/speaker/raymond-hettinger +.. _a Cookbook recipe: http://code.activestate.com/recipes/576555/ +.. _standard library documentation: http://docs.python.org/library/collections.html?highlight=namedtuple#collections.namedtuple The existence of named tuples has changed completely the way of managing records: nowadays a named tuple has become *the one obvious -way* to implement mutable records. Mutable records +way* to implement immutable records. Mutable records are much more complex and they are not available in the standard -library, nor there are plans for their addition to the best of my -knowledge. However, there are viable alternatives if you need mutable -records: typically the use case is managing database records and -that can be done with an Object Relational Mapper. Moreover, there -are people who think that mutable records are Evil and they should -not be part of the language. This is the dominant opinion in the -functional programming community (SML, Haskell and, in a minor -way, Scheme): in that languages the only way to modify a field is -to create a new record which is a copyf of the original -record except for the modified field. In this sense named tuples are -functional structures and they support -*functional update* via the ``_replace`` method; I will discuss -this point in detail in a short while. +library, nor there are plans for their addition, at least as far as +I know. There are many viable alternatives if you need mutable +records: the typical use case is managing database records and +that can be done with an Object Relational Mapper. There is also +`a Cookbook recipe`_ for mutable records +which is a natural extension of the namedtuple +recipe. +Notice however that there are people who think that mutable records are +*Evil*. This is the dominant opinion in the functional programming +community: in that context the only way to modify a field is to create +a new record which is a copy of the original record except for the +modified field. In this sense named tuples are functional structures +and they support *functional update* via the ``_replace`` method; I +will discuss this point in detail in a short while. .. _ORM: http://en.wikipedia.org/wiki/Object-relational_mapping .. _test case: http://en.wikipedia.org/wiki/Test_case -It is very easy to use named tuples; the docstring in the recipe contains many -example and it works as a `test case`_ too: I strongly recommend to run -``namedtuple.py`` by reading its docstring with attention. +To use named tuples is very easy and you can just look at the +examples in the `standard library documentation`_. Here I will duplicate part of what you can find in there, for the benefit of the lazy readers: @@ -71,7 +67,7 @@ for the benefit of the lazy readers: >>> print article1 Article(title='Records in Python', author="M. Simionato") -``namedtuple`` is a functions working as a class factory: +``namedtuple`` is a function working as a class factory: it takes in input the name of the class and the names of the fields - a sequence of strings or a string of space-or-comma-separated names - and it returns a subclass of ``tuple``. @@ -84,7 +80,7 @@ accessible both per index *and* per name: "M. Simionato" Therefore, named tuples are much more readable than ordinary tuples: -you write ``article1.author`` instead of``article1[1]``. Moreover, +you write ``article1.author`` instead of ``article1[1]``. Moreover, the constructor accepts both a positional syntax and a keyword argument syntax, so that it is possible to write @@ -96,21 +92,21 @@ of named tuples. You can pass all the arguments as positional arguments, all the arguments are keyword arguments and even some arguments as positional and some others as keyword arguments: - >>> title='Records in Python' + >>> title = 'Records in Python' >>> kw = dict(author="M. Simionato") >>> Article(title, **kw) Article(title='Records in Python', author="M. Simionato") This "magic" has nothing to do with ``namedtuple`` per se: it is the standard way argument passing works in Python, even if I bet many -people do not know that it is possible to mix the arguments, the only -restriction being putting the keyword arguments *after* the positional -arguments. +people do not know that it is possible to mix the arguments. The only +real restriction is that you must put the keyword arguments *after* +the positional arguments. -Another advantage is that named tuples *are* tuples, so you can use +Another advantage is that named tuples *are* tuples, so that you can use them in your legacy code expecting regular tuples, and everything will work just fine, including tuple unpacking (i.e. ``title, author = -article1``) and the ``*`` notation (i.e. ``f(*article1)``). +article1``), possibly via the ``*`` notation (i.e. ``f(*article1)``). An additional feature with respect to traditional tuples, is that named tuples support functional update, as I anticipated before: @@ -128,8 +124,8 @@ you invoke``namedtuple``. The readers of my series about Scheme (`The Adventures of a Pythonista in Schemeland`_ ) will certainly be reminded of macros. Actually, ``exec`` is more powerful than Scheme macros, since macros generate code at compilation time whereas ``exec`` -works at runtime. Th means that in order to use macro you must know -the structure of the record before executing the programa, whereas +works at runtime. That means that in order to use macro you must know +the structure of the record before executing the program, whereas ``exec`` is able to define the record type during program execution. In order to do the same in Scheme you would need to use ``eval``, not macro. @@ -145,9 +141,9 @@ situations (another good usage of ``exec`` is in the doctest_ module). Parsing CSV files ----------------------------------------------------------- -No more theory: let me be practical now, by showing a concrete +No more theory: let be practical now, by showing a concrete example. Suppose you need to parse a CSV files with N+1 rows and M -columns separated by commas:: +columns separated by commas, in the following format:: field1,field2,...,fieldM row11,row12,...,row1M @@ -157,50 +153,47 @@ columns separated by commas:: The first line (the header) contains the names of the fields. The precise format of the CSV file is not know in advance, but only at runtime, when -the file is read. The file may come from an Excel sheet, or from a database -dump, or it could be a log file. Suppose one of the fields is a date +the header is read. The file may come from an Excel sheet, or from a database +dump, or could be a log file. Suppose one of the fields is a date field and that you want to extract the records between two dates, in order to perform a statistical analysis and to generate a report -in different formats (another CSV file, or a HTML table to upload -to a Web site, or a LaTeX table to be included in a scientific +in different formats (another CSV file, or a HTML table for upload +on a Web site, or a LaTeX table to be included in a scientific paper, or anything else). I am sure most of you had to do something -like first at some point in your life. +like first at some point. .. image:: log.gif -To solve the specific problem is always easy: the difficult thing is to -provide a general recipe. We would like to avoid to write 100 small +To solve the specific problem is always easy: the difficult thing is +to provide a general recipe. We would like to avoid to write 100 small scripts, more or less identical, but with a different management of the I/O depending on the specific problem. Clearly, different business logic will require different scripts, but at least the I/O part should be common. For instance, if the originally the CSV file has 5 fields, -but then after 6 months the specs change and you need to manage a -file with 6 fields, you don't want to be forced to change the script; -the same if the names of the fields change. On the other hand, if -the output must be an HTML table, that table must be able to manage -a generic number of fields; moreover, it must be easy to change -the output format (HTML, XML, CSV, ...) with a minimal effort, without -changing the script, but only some configuration parameter or command -line switch. -Finally, and this is the most difficult part, you should not create -a monster: if you must choose between a super-powerful framework -able to cope with all possibilities you can imaging (and that approach -will necessarely fail once you will have to face something you did -non expect) and a poor man system which is however easily extensible, -you must have the courage for humility: that means that you will have -to apply mercyless cuts and to remove all the fancy feature you added -in order to keep only the essential. In practice, coding in this way -means that you will have to work two or three times more than the -time needed to write a monster, to product much less. On the other -hand, we all know that given a fixed amount of functionality, programmers -should be payed inversally to the number of lines of code ;) +but then after 6 months the specs change and you need to manage a file +with 6 fields, you don't want to change the script; idem if the names +of the fields change. Moreover, it must be easy to change the output +format (HTML, XML, CSV, ...) with a minimal effort, without changing +the script, but only some configuration parameter or command line +switch. Finally, and this is the most difficult part, one should not +create a monster. If the choice is between a super-powerful framework +able to cope with all the possibilities one can imagine and a poor man +system which is however easily extensible, one must have the courage +for humility. The problem is that most of the times it is impossible +to know which features are really needed, so that one must +implement things which will be removed later. In practice, coding in +this way you will have to work two or three times more than +the time needed to write a monster, to produce much less. On the other +hand, we all know that given a fixed amount of functionality, +programmers should be payed inversally to the number of lines of code +;) Anyway, stop with the philosophy and let's go back to the problem. Here is a possible solution: $$tabular_data -By executing the script you get:: +By executing the script you will get:: $ python tabular_data.py NamedTuple(title="title", author='author') @@ -211,12 +204,12 @@ By executing the script you get:: There are many remarkable things to notice. 1. - For of all, I did follow the golden rule of the good programmer, i.e. + First of all, I did follow the golden rule of the smart programmer, i.e. I did *change the question*: even if the problem asked to read a CSV file, I implemented a ``get_table`` generator instead, able to process a generic iterable in the form header+data, where the header is the schema - the ordered list of the field names - and - data are the records. The generator retusn an iterator in the form + data are the records. The generator returns an iterator in the form header+data, where the data are actually named tuples. The greater generality allows a better reuse of code, and more. @@ -230,13 +223,13 @@ There are many remarkable things to notice. work with a database. 3. - Having changed the question from "process a CSV file" to + Having changed the question from "process a CSV file" into "convert an iterable into a namedtuple sequence" allows me to leave the job of reading the CSV files to the right object, i.e. to ``csv.reader`` which is part of the standard library and - that I can afford not to test (I am in the champ of people thing - that you should not test everything, that there are things that - must must be tested with high priority and other that should be + that I can afford not to test (in my opinion + you should not test everything, there are things that + must be tested with high priority and others that should be tested with low priority or even not tested at all). 4. @@ -263,10 +256,10 @@ There are many remarkable things to notice. 7. For reasons of technical convenience I have introduced the function ``headtail``; it is worth mentioning that in Python 3.0 *tuple unpacking* - will be extended so that it will be possible to write directly - ``head, *tail = iterable`` instead of ``head, tail = headtail(iterable)``, - therefore ``headtail`` will not be needed anymore (functional programmers - will recognize the technique of pattern matching of cons). + has been extended so that it will be possible to write directly + ``head, *tail = iterable`` instead of ``head, tail = headtail(iterable)`` + - functional programmers + will recognize the technique of pattern matching of conses. 8. ``get_table`` allows to alias the names of the field, as shown by |