summaryrefslogtreecommitdiff
path: root/artima
diff options
context:
space:
mode:
authormichele.simionato <devnull@localhost>2009-09-07 05:57:55 +0000
committermichele.simionato <devnull@localhost>2009-09-07 05:57:55 +0000
commita5188019a7201046af584d94f290fd0fc4711932 (patch)
tree6b42d467e990394b3239c794173c0f352b5023f2 /artima
parent48c0eede04f8ca49b8343e0c65e9ffb3786df0df (diff)
downloadmicheles-a5188019a7201046af584d94f290fd0fc4711932.tar.gz
Finally published "Records management in Python/1"
Diffstat (limited to 'artima')
-rw-r--r--artima/python/records1.py165
1 files changed, 79 insertions, 86 deletions
diff --git a/artima/python/records1.py b/artima/python/records1.py
index 1edc978..e1b15d2 100644
--- a/artima/python/records1.py
+++ b/artima/python/records1.py
@@ -1,7 +1,7 @@
"""
Everybody has worked with records: by reading CSV files, by
-interacting with a database, by coding in a programmming language,
-etc. Records look like an old, traditional, boring topic where
+interacting with a database, by coding in a programmming language.
+Records look like an old, traditional, boring topic where
everything has been said already. However this is not the
case. Actually, there is still plenty to say about records: in this
three part series I will discuss a few general techniques to read,
@@ -10,58 +10,54 @@ write and process records in modern Python. The first part
of reading a CSV file with a number of fields which is known only at
runt time; the second part discusses the problem of interacting with a
database; the third and last part discusses the problem of rendering
-data in record format.
+records into XML or HTML format.
Record vs namedtuple
-----------------------------------------------------
-Let me begin by observing that for many years
-*there was no record type in the Python language,
-nor in the standard library*.
-This omission seems incredible, but it is true: the Python community has
-asked for records in the language from the beginning, but Guido never
-considered that request. The canonical answer was
-"in real life you always need to add
-methods to your data, so just use a custom class". I (as many other
-others) never bought that argument: I have always regarded the
-lack of support for records as a weak point of Python.
-The good news are that the situation has changed: starting from Python 2.6
-records are part of the standard library under the dismissive name of
-*named tuples*. You can use named tuples even in older versions of Python,
-simply by downloading `Raymond Hettinger`_'s recipe_ on the
-Python Cookbook::
+For many years there was no record type in the Python language, nor in
+the standard library. This is hard to believe, but true: the Python
+community has asked for records in the language from the beginning,
+but Guido never considered that request. The canonical answer was "in
+real life you always need to add methods to your data, so just use a
+custom class". The good news is that the situation has finally
+changed: starting from Python 2.6 records are part of the standard
+library under the dismissive name of *named tuples*. You can use named
+tuples even in older versions of Python, simply by downloading
+`Raymond Hettinger`_'s recipe_ on the Python Cookbook::
$ wget http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/500261/index_txt
-O namedtuple.py
.. _recipe: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/500261
-
.. _Raymond Hettinger: http://www.pycon.it/pycon2/schedule/speaker/raymond-hettinger
+.. _a Cookbook recipe: http://code.activestate.com/recipes/576555/
+.. _standard library documentation: http://docs.python.org/library/collections.html?highlight=namedtuple#collections.namedtuple
The existence of named tuples has changed completely the way of
managing records: nowadays a named tuple has become *the one obvious
-way* to implement mutable records. Mutable records
+way* to implement immutable records. Mutable records
are much more complex and they are not available in the standard
-library, nor there are plans for their addition to the best of my
-knowledge. However, there are viable alternatives if you need mutable
-records: typically the use case is managing database records and
-that can be done with an Object Relational Mapper. Moreover, there
-are people who think that mutable records are Evil and they should
-not be part of the language. This is the dominant opinion in the
-functional programming community (SML, Haskell and, in a minor
-way, Scheme): in that languages the only way to modify a field is
-to create a new record which is a copyf of the original
-record except for the modified field. In this sense named tuples are
-functional structures and they support
-*functional update* via the ``_replace`` method; I will discuss
-this point in detail in a short while.
+library, nor there are plans for their addition, at least as far as
+I know. There are many viable alternatives if you need mutable
+records: the typical use case is managing database records and
+that can be done with an Object Relational Mapper. There is also
+`a Cookbook recipe`_ for mutable records
+which is a natural extension of the namedtuple
+recipe.
+Notice however that there are people who think that mutable records are
+*Evil*. This is the dominant opinion in the functional programming
+community: in that context the only way to modify a field is to create
+a new record which is a copy of the original record except for the
+modified field. In this sense named tuples are functional structures
+and they support *functional update* via the ``_replace`` method; I
+will discuss this point in detail in a short while.
.. _ORM: http://en.wikipedia.org/wiki/Object-relational_mapping
.. _test case: http://en.wikipedia.org/wiki/Test_case
-It is very easy to use named tuples; the docstring in the recipe contains many
-example and it works as a `test case`_ too: I strongly recommend to run
-``namedtuple.py`` by reading its docstring with attention.
+To use named tuples is very easy and you can just look at the
+examples in the `standard library documentation`_.
Here I will duplicate part of what you can find in there,
for the benefit of the lazy readers:
@@ -71,7 +67,7 @@ for the benefit of the lazy readers:
>>> print article1
Article(title='Records in Python', author="M. Simionato")
-``namedtuple`` is a functions working as a class factory:
+``namedtuple`` is a function working as a class factory:
it takes in input the name of the class and the names of the fields
- a sequence of strings or a string of space-or-comma-separated names -
and it returns a subclass of ``tuple``.
@@ -84,7 +80,7 @@ accessible both per index *and* per name:
"M. Simionato"
Therefore, named tuples are much more readable than ordinary tuples:
-you write ``article1.author`` instead of``article1[1]``. Moreover,
+you write ``article1.author`` instead of ``article1[1]``. Moreover,
the constructor accepts both a positional syntax and a keyword
argument syntax, so that it is possible to write
@@ -96,21 +92,21 @@ of named tuples. You can pass all the arguments as positional
arguments, all the arguments are keyword arguments and even
some arguments as positional and some others as keyword arguments:
- >>> title='Records in Python'
+ >>> title = 'Records in Python'
>>> kw = dict(author="M. Simionato")
>>> Article(title, **kw)
Article(title='Records in Python', author="M. Simionato")
This "magic" has nothing to do with ``namedtuple`` per se: it is the
standard way argument passing works in Python, even if I bet many
-people do not know that it is possible to mix the arguments, the only
-restriction being putting the keyword arguments *after* the positional
-arguments.
+people do not know that it is possible to mix the arguments. The only
+real restriction is that you must put the keyword arguments *after*
+the positional arguments.
-Another advantage is that named tuples *are* tuples, so you can use
+Another advantage is that named tuples *are* tuples, so that you can use
them in your legacy code expecting regular tuples, and everything will
work just fine, including tuple unpacking (i.e. ``title, author =
-article1``) and the ``*`` notation (i.e. ``f(*article1)``).
+article1``), possibly via the ``*`` notation (i.e. ``f(*article1)``).
An additional feature with respect to traditional tuples, is that
named tuples support functional update, as I anticipated before:
@@ -128,8 +124,8 @@ you invoke``namedtuple``. The readers of my series about Scheme
(`The Adventures of a Pythonista in Schemeland`_ ) will certainly be
reminded of macros. Actually, ``exec`` is more powerful than Scheme
macros, since macros generate code at compilation time whereas ``exec``
-works at runtime. Th means that in order to use macro you must know
-the structure of the record before executing the programa, whereas
+works at runtime. That means that in order to use macro you must know
+the structure of the record before executing the program, whereas
``exec`` is able to define the record type during program execution.
In order to do the same in Scheme you would need to use ``eval``, not macro.
@@ -145,9 +141,9 @@ situations (another good usage of ``exec`` is in the doctest_ module).
Parsing CSV files
-----------------------------------------------------------
-No more theory: let me be practical now, by showing a concrete
+No more theory: let be practical now, by showing a concrete
example. Suppose you need to parse a CSV files with N+1 rows and M
-columns separated by commas::
+columns separated by commas, in the following format::
field1,field2,...,fieldM
row11,row12,...,row1M
@@ -157,50 +153,47 @@ columns separated by commas::
The first line (the header) contains the names of the fields. The precise
format of the CSV file is not know in advance, but only at runtime, when
-the file is read. The file may come from an Excel sheet, or from a database
-dump, or it could be a log file. Suppose one of the fields is a date
+the header is read. The file may come from an Excel sheet, or from a database
+dump, or could be a log file. Suppose one of the fields is a date
field and that you want to extract the records between two dates,
in order to perform a statistical analysis and to generate a report
-in different formats (another CSV file, or a HTML table to upload
-to a Web site, or a LaTeX table to be included in a scientific
+in different formats (another CSV file, or a HTML table for upload
+on a Web site, or a LaTeX table to be included in a scientific
paper, or anything else). I am sure most of you had to do something
-like first at some point in your life.
+like first at some point.
.. image:: log.gif
-To solve the specific problem is always easy: the difficult thing is to
-provide a general recipe. We would like to avoid to write 100 small
+To solve the specific problem is always easy: the difficult thing is
+to provide a general recipe. We would like to avoid to write 100 small
scripts, more or less identical, but with a different management of
the I/O depending on the specific problem. Clearly, different business
logic will require different scripts, but at least the I/O part should
be common. For instance, if the originally the CSV file has 5 fields,
-but then after 6 months the specs change and you need to manage a
-file with 6 fields, you don't want to be forced to change the script;
-the same if the names of the fields change. On the other hand, if
-the output must be an HTML table, that table must be able to manage
-a generic number of fields; moreover, it must be easy to change
-the output format (HTML, XML, CSV, ...) with a minimal effort, without
-changing the script, but only some configuration parameter or command
-line switch.
-Finally, and this is the most difficult part, you should not create
-a monster: if you must choose between a super-powerful framework
-able to cope with all possibilities you can imaging (and that approach
-will necessarely fail once you will have to face something you did
-non expect) and a poor man system which is however easily extensible,
-you must have the courage for humility: that means that you will have
-to apply mercyless cuts and to remove all the fancy feature you added
-in order to keep only the essential. In practice, coding in this way
-means that you will have to work two or three times more than the
-time needed to write a monster, to product much less. On the other
-hand, we all know that given a fixed amount of functionality, programmers
-should be payed inversally to the number of lines of code ;)
+but then after 6 months the specs change and you need to manage a file
+with 6 fields, you don't want to change the script; idem if the names
+of the fields change. Moreover, it must be easy to change the output
+format (HTML, XML, CSV, ...) with a minimal effort, without changing
+the script, but only some configuration parameter or command line
+switch. Finally, and this is the most difficult part, one should not
+create a monster. If the choice is between a super-powerful framework
+able to cope with all the possibilities one can imagine and a poor man
+system which is however easily extensible, one must have the courage
+for humility. The problem is that most of the times it is impossible
+to know which features are really needed, so that one must
+implement things which will be removed later. In practice, coding in
+this way you will have to work two or three times more than
+the time needed to write a monster, to produce much less. On the other
+hand, we all know that given a fixed amount of functionality,
+programmers should be payed inversally to the number of lines of code
+;)
Anyway, stop with the philosophy and let's go back to the problem.
Here is a possible solution:
$$tabular_data
-By executing the script you get::
+By executing the script you will get::
$ python tabular_data.py
NamedTuple(title="title", author='author')
@@ -211,12 +204,12 @@ By executing the script you get::
There are many remarkable things to notice.
1.
- For of all, I did follow the golden rule of the good programmer, i.e.
+ First of all, I did follow the golden rule of the smart programmer, i.e.
I did *change the question*: even if the problem asked to read a
CSV file, I implemented a ``get_table`` generator instead, able
to process a generic iterable in the form header+data, where the
header is the schema - the ordered list of the field names - and
- data are the records. The generator retusn an iterator in the form
+ data are the records. The generator returns an iterator in the form
header+data, where the data are actually named tuples. The greater
generality allows a better reuse of code, and more.
@@ -230,13 +223,13 @@ There are many remarkable things to notice.
work with a database.
3.
- Having changed the question from "process a CSV file" to
+ Having changed the question from "process a CSV file" into
"convert an iterable into a namedtuple sequence" allows me
to leave the job of reading the CSV files to the right object,
i.e. to ``csv.reader`` which is part of the standard library and
- that I can afford not to test (I am in the champ of people thing
- that you should not test everything, that there are things that
- must must be tested with high priority and other that should be
+ that I can afford not to test (in my opinion
+ you should not test everything, there are things that
+ must be tested with high priority and others that should be
tested with low priority or even not tested at all).
4.
@@ -263,10 +256,10 @@ There are many remarkable things to notice.
7.
For reasons of technical convenience I have introduced the function
``headtail``; it is worth mentioning that in Python 3.0 *tuple unpacking*
- will be extended so that it will be possible to write directly
- ``head, *tail = iterable`` instead of ``head, tail = headtail(iterable)``,
- therefore ``headtail`` will not be needed anymore (functional programmers
- will recognize the technique of pattern matching of cons).
+ has been extended so that it will be possible to write directly
+ ``head, *tail = iterable`` instead of ``head, tail = headtail(iterable)``
+ - functional programmers
+ will recognize the technique of pattern matching of conses.
8.
``get_table`` allows to alias the names of the field, as shown by