artima/python/records3.py


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328

r"""\
In the first two installaments of this series
I discussed how to read and process homogeneous records.
In this final installment I will discuss non-homogeneous records and
we will devise a small framework to convert text records into CSV, HTML,
XML or other formats. *En passant*, I will discuss various object oriented 
techniques and patterns.

.. figure:: http://www.phyast.pitt.edu/~micheles/python/patchwork1.jpg
 :width: 300

 Fig 1: object-oriented design

A micro-framework to convert records into text
-------------------------------------------------------------------

It is well know that I am not a framework lower and there are certainly
many Python programmers sharing this attitude, starting from Guido.
Actually, my dislike of frameworks is inversely proportional to their
size: I hate the mega-frameworks, I tolerate the medium-sized framework
and I like enough the micro-frameworks. In this installment I will define
a micro-framework to render non-homogeneous records into text. The
framework is based on the `template pattern`_: in order to define a
renderer class, the programmer inherits from a mother class ``RecordRenderer``
and fills in the rendering methods: then the framework with automatically
call them but without too much magic.

This approach is acceptable only when the base class is simple: it is much
less acceptable when you start already from a deep hierarchy. For me a
hierarchy is deep if there are more than two levels: if looking at
mother and children is not enough, and I am forced to look even and
the grand-parent classes, the framework is already too complex.

Inheritance-based frameworks have the tendency to go out of control,
because it become natural to extend the hierarchy too much. In 
traditional object-oriented languages it is quite natural to use
inheritance, but as I said elsewhere one should always keep in mind
that alternative are always possible (a notable new language *without*
inheritance is Go).

Anyway, one should not fight the language she is using: in Python the
`template pattern`_ is a perfectly reasonable approach.

.. figure:: http://www.phyast.pitt.edu/~micheles/python/patchwork2.jpg

 Fig 2: the *template pattern*

To convert into text a non-homogenous 
record with *N* fields requires in general *N+1*
functions: *N* functions to convert the fields and a function to convert
the full record. It is natural to group the needed functions as method
of a renderer class: the *N* field-associated rendering functions will
be methods converting values into strings, whereas the *N+1* function
will be a ``.render`` method converting the record of strings so obtained
into a single string. We will use a base class called ``RecordRendererABC``,
where the ABC suffix means *Abstract Base Class*. 

I should point out that an Abstract Base Class in Python can provide
concrete methods to its subclasses and therefore the meaning of ABC
in Python is different than in C++/Java: a Python ABC is a mixin class,
which can provide implementation; it is not necessarily pure interface.

For instance, suppose we want to convert an Article record 

 ``Article = namedtuple("Article", "title author pubdate")``

into CSV format.

How do we proceed? First of all we define a suitable subclass of
``RecordRendererABC``:

$$CSVArticleRenderer

Notice that ``CSVArticleRenderer`` defines a ``.schema`` class
attribute, a namedtuple containing the names of the rendering methods.

In this example both title and author are converted by using the ``.str``
method, inherited from the base class, whereas the publication date
is converted by using the ``.isodate`` method, which is defined
directly in the ``CSVArticleRenderer`` class.
The ``.render`` method is inherited and converts the input namedtuple
into a string by converting into strings the fields with the corresponding
methods and by joining the results, using a comma as separator.
Here is an example:

  >>> a = Article("test title", "test author", datetime(2008, 05, 15))
  >>> r = CSVArticleRenderer(a)

The ``.render`` method works as expected:

  >>> print r.render()
  test title,test author,2008-05-15

By default the separator (``delimiter``) is set to the empty string ''.
This is useful for implementing different renderers. For instance,
suppose we want to define a renderer converting the articles into HTML
format. Suppose we defined three CSS classes ``title``, ``author`` and
``pubdate`` to visualize the different fields in different ways, for
instance with different colors. We could define a renderer using the
CSS classes as follows:

$$HTMLArticleRenderer

Here is how the renderer works:

  >>> r = HTMLArticleRenderer(a)
  >>> print r.render()
  <span class="title">test title</span>
  <span class="author">test author</span>
  <span class="pubdate">2008-05-15</span>

Design notes
--------------------------------------------------------------------

Having discussed the usage of the framework, it is now the time to
discuss the implementation of the base class and the reasons for the
design choices I made.
Here is the source code for ``RecordRendererABC``:

$$RecordRendererABC

Let me start from the constructor. The ``__init__`` methods accepts in input
a single argument, a sequence with length equal to the length of the schema.
The input sequence *is not required to be a namedtuple*: there is no type check
such as ``isinstance(input, self.schema.__class__)``.
A type check here would be a design mistake, since it would restrict without
reason the field of applicability of the renderer and it would force
the users to use type converted without need. The only requirement for
the ``input`` object is that ``zip(self.schema, input)`` must not raise
an exception: in other words, it is enough that ``input`` had the right
length.

Actually ``zip(self.schema, input)`` would not raise an error even if
``input`` had a different length. This is potentially dangerous.
For instance, imagine that for some reason (say a programmer error)
we passed a sequence of length zero: then ``zip(self.schema, input)`` 
would silently return an empty list. Since *errors should never pass silently*,
I decided to add a check on the length: in this way if there is an error
we see it immediately, at instantiation time, and not too late,
when we start iterating on the renderer. It is always better to discover
errors early.

On the other hand, it is best to not exaggerate with the checks. For instance,
if ``.input`` is a list, it is theoretically possible for an evil programmer
to modify the list *after* instantiation, by adding or removing elements.
Then ``zip(self.schema, input)`` could behave in an unexpected way.
However, there is no way to protect against evil (or just bad) programmers.
Even if we replaced ``.input`` with a tuple, which is immutable, its
size could always be changes, simply by overriding the ``.input``
attribute after instantiation.

The Python way is to limit the checks to the one dictated from common
sense, intended to limit accidental errors which are likely to happen:
for the rest, the attitude is to trust the user. Checks motivated by
paranoia and lack of trust in the user are not to be introduced, since
in a dynamic language the user can do whenever she wants anyway.
The attitude is mutuated from the `spirit of C`_ (*trust the programmer*).
According to this maxim I decided not to add additional checks.

In special cases (for instance if you are implementing a subclass of
``RecordRendererABC`` which requires for ``.input`` to be a record)
it may be sensible to introduce some additional check. For instance
you may want to ensure that ``.input`` be a record with the right
fields. However, even in this case it is best not to introduce a
type check like
``isinstance(input, self.schema.__class__)``; you can instead
use a lighter check like ``input._fields == self.schema.fields``: 
in this way any object with the right fields would be accepted,
not use a namedtuple. The basic idea is to follow the 
`duck typing`_ principle: don't be too picky and
accept as good anything with the needed
attributes.

In this logic you may want to enlarge even more the field of
acceptable objects: for instance a dictionary-like object
with the right keys could act as a substitute for a record.
We could implement such feature by adding an ``if`` in the
``__init__`` method, by introducing a special case when the input object
is a dictionary. But that would be bad programming: the point of object
oriented programming is to avoid ``ifs`` and to replace them with methods.
In our example, we should remember that Python provides a *classmethod*
construct, which *raison d'etre* is exactly to manage this use case:
it allows the programmer to define alternate constructors, without
the need for complicating the ``__init__`` method. Using alternate
constructors is called `factory method pattern`_ and it is one of
the basic techniques of OOP. The advantages are clear, expecially
in terms of simplicity and easy of maintenance, but also from the
point of view of code reuse and extensibility.

.. figure:: http://www.phyast.pitt.edu/~micheles/python/patchwork3.jpg
 
 Fig 3: the *factory method pattern*

In our example dictionaries are rendered through the ``.frommap``
classmethod:

    >>> r = CSVArticleRenderer.frommap(dict(
    ...     title="test title", author="test author", 
    ...     pubdate=datetime(2008, 05, 15)))

There is also a ``.fromobj`` classmethod accepting in input any
object with a set of attributes which is a superset (proper or
improper) of the schema's attributes. This is pure *duck typing*.
If the object lacks an attribute, we will get an ``AttributeError``
at instantiation time, an absolutely clear and telling error message;
on the other hand, if the object has enough attributes, it will be
automatically converted into a namedtuple.

The base class also defines the special methods ``__iter__`` and ``__len__``:
therefore each rendered instance is a sequence of fixed length and can be
passed in input to another renderer. In other words, renderers are
composable in the functional sense.

Renderers are actually homogeneous records with fields which are strings
and can be passed to the ``HtmlTable`` object defined in the previous
installment. It is trivial to convert a rendered into a list of strings:
thanks to the ``__iter__`` method, ``list(renderer)`` works as expected
(idem for ``tuple(renderer)`` and ``len(renderer)``). `list``, ``tuple`` and
``len`` are actually builtin generic functions which play well with
*duck typing* and are definible for any custom object.

It was good to discuss what was implemented into ``RecordRendererABC``;
it is also interesting to discuss what was *not* implemented.
In particular, I did not implement the renderers are namedtuples.
I wanted to avoid the *blob* antipattern_, when you have a class which
is everything to everybody. I wanted to keep namedtuples simple, without
adding any methods to them: renderers are logically an independent concept,
even if they can be converted into namedtuples, being iterable.

.. figure:: http://www.phyast.pitt.edu/~micheles/python/blob.jpg
 :width: 350

 Fig 4: the *blob antipattern*

I did define ``CSVArticleRenderer`` and ``HTMLArticleRenderer`` as
subclasses of ``RecordRendererABC``. An alternate design could have
introduced different abstract intermediate subclasses, depending on the output
format: for instance ``CSVRecordRenderer``, ``HTMLRecordRenderer``,
``XMLRecordRenderer``, etc. However I have decided of following strictly
the rule that *flat is better than nested*, and to keep the hierachies
as short as possible.
Actually in Python 2.6+ one could define three abstract interfaces
``CSVRecordRenderer``, ``HTMLRecordRenderer`` and ``XMLRecordRenderer``
and one could register her concrete classes with such interfaces: this
can be done without using inheritance and by keeping the hierarchy flat.

.. _spirit of C: http://www.artima.com/cppsource/spiritofc.html
.. _duck typing: http://en.wikipedia.org/wiki/Duck_typing
.. _factory method pattern: http://en.wikipedia.org/wiki/Factory_method_pattern
.. _antipattern: http://en.wikipedia.org/wiki/Antipattern
.. _template pattern: http://en.wikipedia.org/wiki/Template_pattern
"""
import os, cgi
from datetime import datetime
from tabular_data import headtail
from collections import namedtuple

# in Python 2.6 use abstractmethod, abstractproperty instead
class notimplemented(object):
    "Descriptor raising a meaningful error message for nonoverridden attributes"
    def __init__(self, message):
        self.message = message
    def __get__(self, obj, objcls=None):
        raise NotImplementedError(self.message)

class RecordRendererABC(object):
    schema = () # a namedtuple specifying the names of the converters
    delimiter = ''
    
    @classmethod
    def frommap(cls, kw):
        return cls(cls.schema.__class__(**kw))

    @classmethod
    def fromobj(cls, obj):
        Schema = cls.schema.__class__
        nt = Schema._make(getattr(obj, field) for field in Schema._fields)
        return cls(nt)

    def __init__(self, input):
        li, ls = len(input), len(self)
        if li != ls:
            raise TypeError('%s has %d fields, expected %d' % (input, li, ls))
        self.input = input
        
    def __iter__(self):
        for convertername, value in zip(self.schema, self.input):
            yield getattr(self, convertername)(value)
    
    def __len__(self):
        return len(self.schema)

    def str(self, value):
        return str(value)

    def render(self):
        return self.delimiter.join(self)
    
Article = namedtuple("Article", "title author pubdate")

class CSVArticleRenderer(RecordRendererABC):
    schema = Article("str", "str", "isodate")
    delimiter = ','
    def isodate(self, date):
        return date.isoformat()[:10]

class HTMLArticleRenderer(RecordRendererABC):
    schema = Article(title='title', author='author', pubdate="pubdate")
    delimiter = '\n'
    def title(self, title):
        return '<span class="title">%s</span>' % cgi.escape(title)
    def author(self, author):
        return '<span class="author">%s</span>' % cgi.escape(author)
    def pubdate(self, date):
        return '<span class="pubdate">%s</span>' % date.isoformat()[:10]
    
# todo: xml.escape    
def to_xml(ntuple):
    name = ntuple.__class__.__name__
    xml = ['<%s>' % name]
    for i, field in enumerate(ntuple._fields):
        xml.append("<%s>%s</%s>" % (field, ntuple[i], field))
    xml.append('</%s>' % name)
    return os.linesep.join(xml)

if __name__ == '__main__':
    import doctest; doctest.testmod()
    HTMLArticleRenderer.fromobj(Article("a",'b', datetime.today()))