Finally published "Records management in Python/1"

author: michele.simionato <devnull@localhost> 2009-09-07 05:57:55 +0000
committer: michele.simionato <devnull@localhost> 2009-09-07 05:57:55 +0000
commit: a5188019a7201046af584d94f290fd0fc4711932 (patch)
tree: 6b42d467e990394b3239c794173c0f352b5023f2 /artima
parent: 48c0eede04f8ca49b8343e0c65e9ffb3786df0df (diff)
download: micheles-a5188019a7201046af584d94f290fd0fc4711932.tar.gz
1 files changed, 79 insertions, 86 deletions
diff --git a/artima/python/records1.py b/artima/python/records1.py
index 1edc978..e1b15d2 100644
--- a/artima/python/records1.py
+++ b/artima/python/records1.py
@@ -1,7 +1,7 @@
 """
 Everybody has worked with records: by reading CSV files, by
-interacting with a database, by coding in a programmming language,
-etc. Records look like an old, traditional, boring topic where
+interacting with a database, by coding in a programmming language.
+Records look like an old, traditional, boring topic where
 everything has been said already. However this is not the
 case. Actually, there is still plenty to say about records: in this
 three part series I will discuss a few general techniques to read,
@@ -10,58 +10,54 @@ write and process records in modern Python. The first part
 of reading a CSV file with a number of fields which is known only at
 runt time; the second part discusses the problem of interacting with a
 database; the third and last part discusses the problem of rendering
-data in record format.
+records into XML or HTML format.
 
 Record vs namedtuple
 -----------------------------------------------------
 
-Let me begin by observing that for many years
-*there was no record type in the Python language,
-nor in the standard library*.
-This omission seems incredible, but it is true: the Python community has
-asked for records in the language from the beginning, but Guido never
-considered that request. The canonical answer was 
-"in real life you always need to add
-methods to your data, so just use a custom class". I (as many other
-others) never bought that argument: I have always regarded the
-lack of support for records as a weak point of Python.
-The good news are that the situation has changed: starting from Python 2.6
-records are part of the standard library under the dismissive name of 
-*named tuples*. You can use named tuples even in older versions of Python,
-simply by downloading  `Raymond Hettinger`_'s recipe_ on the
-Python Cookbook::
+For many years there was no record type in the Python language, nor in
+the standard library.  This is hard to believe, but true: the Python
+community has asked for records in the language from the beginning,
+but Guido never considered that request. The canonical answer was "in
+real life you always need to add methods to your data, so just use a
+custom class".  The good news is that the situation has finally
+changed: starting from Python 2.6 records are part of the standard
+library under the dismissive name of *named tuples*. You can use named
+tuples even in older versions of Python, simply by downloading
+`Raymond Hettinger`_'s recipe_ on the Python Cookbook::
 
  $ wget http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/500261/index_txt
  -O namedtuple.py
 
 .. _recipe: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/500261
-
 .. _Raymond Hettinger: http://www.pycon.it/pycon2/schedule/speaker/raymond-hettinger
+.. _a Cookbook recipe: http://code.activestate.com/recipes/576555/
+.. _standard library documentation: http://docs.python.org/library/collections.html?highlight=namedtuple#collections.namedtuple
 
 The existence of named tuples has changed completely the way of
 managing records: nowadays a named tuple has become *the one obvious
-way* to implement mutable records. Mutable records
+way* to implement immutable records. Mutable records
 are much more complex and they are not available in the standard
-library, nor there are plans for their addition to the best of my
-knowledge. However, there are viable alternatives if you need mutable
-records: typically the use case is managing database records and
-that can be done with an Object Relational Mapper. Moreover, there
-are people who think that mutable records are Evil and they should
-not be part of the language. This is the dominant opinion in the
-functional programming community (SML, Haskell and, in a minor
-way, Scheme): in that languages the only way to modify a field is
-to create a new record which is a copyf of the original
-record except for the modified field. In this sense named tuples are
-functional structures and they support
-*functional update* via the ``_replace`` method; I will discuss
-this point in detail in a short while.
+library, nor there are plans for their addition, at least as far as
+I know. There are  many viable alternatives if you need mutable
+records: the typical use case is managing database records and
+that can be done with an Object Relational Mapper. There is also
+`a Cookbook recipe`_ for mutable records
+which is a natural extension of the namedtuple
+recipe.
+Notice however that there are people who think that mutable records are
+*Evil*.  This is the dominant opinion in the functional programming
+community: in that context the only way to modify a field is to create
+a new record which is a copy of the original record except for the
+modified field. In this sense named tuples are functional structures
+and they support *functional update* via the ``_replace`` method; I
+will discuss this point in detail in a short while.
 
 .. _ORM: http://en.wikipedia.org/wiki/Object-relational_mapping
 .. _test case: http://en.wikipedia.org/wiki/Test_case
 
-It is very easy to use named tuples; the docstring in the recipe contains many
-example and it works as a `test case`_ too: I strongly recommend to run
-``namedtuple.py`` by reading its docstring with attention.
+To use named tuples is very easy and you can just look at the
+examples in the `standard library documentation`_.
 Here I will duplicate part of what you can find in there,
 for the benefit of the lazy readers:
 
@@ -71,7 +67,7 @@ for the benefit of the lazy readers:
  >>> print article1
  Article(title='Records in Python', author="M. Simionato")
 
-``namedtuple`` is a functions working as a class factory:
+``namedtuple`` is a function working as a class factory:
 it takes in input the name of the class and the names of the fields
 - a sequence of strings or a string of space-or-comma-separated names -
 and it returns a subclass of ``tuple``.  
@@ -84,7 +80,7 @@ accessible both per index *and* per name:
  "M. Simionato"
 
 Therefore, named tuples are much more readable than ordinary tuples:
-you write ``article1.author`` instead of``article1[1]``. Moreover,
+you write ``article1.author`` instead of ``article1[1]``. Moreover,
 the constructor accepts both a positional syntax and a keyword
 argument syntax, so that it is possible to write
 
@@ -96,21 +92,21 @@ of named tuples. You can pass all the arguments as positional
 arguments, all the arguments are keyword arguments and even
 some arguments as positional and some others as keyword arguments:
 
- >>> title='Records in Python'
+ >>> title = 'Records in Python'
  >>> kw = dict(author="M. Simionato")
  >>> Article(title, **kw)
  Article(title='Records in Python', author="M. Simionato")
 
 This "magic" has nothing to do with ``namedtuple`` per se: it is the
 standard way argument passing works in Python, even if I bet many
-people do not know that it is possible to mix the arguments, the only
-restriction being putting the keyword arguments *after* the positional
-arguments.
+people do not know that it is possible to mix the arguments. The only
+real restriction is that you must put the keyword arguments *after*
+the positional arguments.
 
-Another advantage is that named tuples *are* tuples, so you can use
+Another advantage is that named tuples *are* tuples, so that you can use
 them in your legacy code expecting regular tuples, and everything will
 work just fine, including tuple unpacking (i.e. ``title, author =
-article1``) and the ``*`` notation (i.e. ``f(*article1)``).
+article1``), possibly via the ``*`` notation (i.e. ``f(*article1)``).
 
 An additional feature with respect to traditional tuples, is that
 named tuples support functional update, as I anticipated before:
@@ -128,8 +124,8 @@ you invoke``namedtuple``. The readers of my series about Scheme
 (`The Adventures of a Pythonista in Schemeland`_ ) will certainly be
 reminded of macros. Actually, ``exec`` is more powerful than Scheme
 macros, since macros generate code at compilation time whereas ``exec``
-works at runtime. Th means that in order to use macro you must know
-the structure of the record before executing the programa, whereas
+works at runtime. That means that in order to use macro you must know
+the structure of the record before executing the program, whereas
 ``exec`` is able to define the record type during program execution.
 In order to do the same in Scheme you would need to use ``eval``, not macro.
 
@@ -145,9 +141,9 @@ situations (another good usage of ``exec`` is in the doctest_ module).
 Parsing CSV files
 -----------------------------------------------------------
 
-No more theory: let me be practical now, by showing a concrete
+No more theory: let be practical now, by showing a concrete
 example.  Suppose you need to parse a CSV files with N+1 rows and M
-columns separated by commas::
+columns separated by commas, in the following format::
 
  field1,field2,...,fieldM
  row11,row12,...,row1M
@@ -157,50 +153,47 @@ columns separated by commas::
 
 The first line (the header) contains the names of the fields. The precise
 format of the CSV file is not know in advance, but only at runtime, when
-the file is read. The file may come from an Excel sheet, or from a database 
-dump, or it could be a log file. Suppose one of the fields is a date
+the header is read. The file may come from an Excel sheet, or from a database 
+dump, or could be a log file. Suppose one of the fields is a date
 field and that you want to extract the records between two dates,
 in order to perform a statistical analysis and to generate a report
-in different formats (another CSV file, or a HTML table to upload
-to a Web site, or a LaTeX table to be included in a scientific
+in different formats (another CSV file, or a HTML table for upload
+on a Web site, or a LaTeX table to be included in a scientific
 paper, or anything else). I am sure most of you had to do something
-like first at some point in your life.
+like first at some point.
 
 .. image:: log.gif
 
-To solve the specific problem is always easy: the difficult thing is to
-provide a general recipe. We would like to avoid to write 100 small
+To solve the specific problem is always easy: the difficult thing is
+to provide a general recipe. We would like to avoid to write 100 small
 scripts, more or less identical, but with a different management of
 the I/O depending on the specific problem. Clearly, different business
 logic will require different scripts, but at least the I/O part should
 be common. For instance, if the originally the CSV file has 5 fields,
-but then after 6 months the specs change and you need to manage a
-file with 6 fields, you don't want to be forced to change the script;
-the same if the names of the fields change. On the other hand, if 
-the output must be an HTML table, that table must be able to manage
-a generic number of fields; moreover, it must be easy to change
-the output format (HTML, XML, CSV, ...) with a minimal effort, without
-changing the script, but only some configuration parameter or command
-line switch.
-Finally, and this is the most difficult part, you should not create
-a monster: if you must choose between a super-powerful framework
-able to cope with all possibilities you can imaging (and that approach
-will necessarely fail once you will have to face something you did
-non expect) and a poor man system which is however easily extensible,
-you must have the courage for humility: that means that you will have
-to apply mercyless cuts and to remove all the fancy feature you added
-in order to keep only the essential. In practice, coding in this way
-means that you will have to work two or three times more than the
-time needed to write a monster, to product much less. On the other
-hand, we all know that given a fixed amount of functionality, programmers
-should be payed inversally to the number of lines of code ;)
+but then after 6 months the specs change and you need to manage a file
+with 6 fields, you don't want to change the script; idem if the names
+of the fields change. Moreover, it must be easy to change the output
+format (HTML, XML, CSV, ...) with a minimal effort, without changing
+the script, but only some configuration parameter or command line
+switch.  Finally, and this is the most difficult part, one should not
+create a monster. If the choice is between a super-powerful framework
+able to cope with all the possibilities one can imagine and a poor man
+system which is however easily extensible, one must have the courage
+for humility. The problem is that most of the times it is impossible
+to know which features are really needed, so that one must
+implement things which will be removed later. In practice, coding in
+this way you will have to work two or three times more than
+the time needed to write a monster, to produce much less. On the other
+hand, we all know that given a fixed amount of functionality,
+programmers should be payed inversally to the number of lines of code
+;)
 
 Anyway, stop with the philosophy and let's go back to the problem.
 Here is a possible solution:
 
 $$tabular_data
 
-By executing the script you get::
+By executing the script you will get::
 
  $ python tabular_data.py
  NamedTuple(title="title", author='author')
@@ -211,12 +204,12 @@ By executing the script you get::
 There are many remarkable things to notice.
 
 1. 
-    For of all, I did follow the golden rule of the good programmer, i.e.
+    First of all, I did follow the golden rule of the smart programmer, i.e.
     I did *change the question*: even if the problem asked to read a
     CSV file, I implemented a ``get_table`` generator instead, able
     to process a generic iterable in the form header+data, where the
     header is the schema - the ordered list of the field names - and
-    data are the records. The generator retusn an iterator in the form
+    data are the records. The generator returns an iterator in the form
     header+data, where the data are actually named tuples. The greater
     generality allows a better reuse of code, and more.
 
@@ -230,13 +223,13 @@ There are many remarkable things to notice.
     work with a database.
   
 3. 
-    Having changed the question from "process a CSV file" to 
+    Having changed the question from "process a CSV file" into 
     "convert an iterable into a namedtuple sequence" allows me
     to leave the job of reading the CSV files to the right object,
     i.e. to ``csv.reader`` which is part of the standard library and
-    that I can afford not to test (I am in the champ of people thing
-    that you should not test everything, that there are things that
-    must must be tested with high priority and other that should be
+    that I can afford not to test (in my opinion
+    you should not test everything, there are things that
+    must be tested with high priority and others that should be
     tested with low priority or even not tested at all).
 
 4. 
@@ -263,10 +256,10 @@ There are many remarkable things to notice.
 7.  
     For reasons of technical convenience I have introduced the function
     ``headtail``; it is worth mentioning that in Python 3.0 *tuple unpacking*
-    will be extended so that it will be possible to write directly
-    ``head, *tail = iterable`` instead of ``head, tail = headtail(iterable)``,
-    therefore ``headtail`` will not be needed anymore (functional programmers
-    will recognize the technique of pattern matching of cons).
+    has been extended so that it will be possible to write directly
+    ``head, *tail = iterable`` instead of ``head, tail = headtail(iterable)``
+    - functional programmers
+    will recognize the technique of pattern matching of conses.
 
 8.
     ``get_table`` allows to alias the names of the field, as shown by
author	michele.simionato <devnull@localhost>	2009-09-07 05:57:55 +0000
committer	michele.simionato <devnull@localhost>	2009-09-07 05:57:55 +0000
commit	a5188019a7201046af584d94f290fd0fc4711932 (patch)
tree	6b42d467e990394b3239c794173c0f352b5023f2 /artima
parent	48c0eede04f8ca49b8343e0c65e9ffb3786df0df (diff)
download	micheles-a5188019a7201046af584d94f290fd0fc4711932.tar.gz