docs/design/parser-architecture.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138

Libcroco parser architecture
-----------------------------

Author: Dodji Seketeli <dodji@seketeli.org>

$Id$

I) Forethoughts.
===================

Libcroco's parser is a simple recursive descent parser.
The major design focus has been simplicity, reliability and
conformance.

Simplicity
-----------
We want the code to be maintainable by anyone who knows the css spec
and who knows how to code in C. Therefore, we avoid to overuse
the C preprocessor magic and all the tricks that tends to turn C into
a maintainance nightmare.

We also try to adhere to the gnome coding guidelines specified
at http://developer.gnome.org/doc/guides/programming-guidelines.


Reliability
-----------
Each single function of the libcroco library should never crash, 
and this, whatever the arguments it takes.
As a consequence we tend to be paranoic when it comes to check
pointers values before dereferencing them for example...

Conformance
-----------
We try to stick to the css spec. We now this is almost impossible to achieve
given the ressource we have but we think it is sane target to chase.

II) Overall architecture
=========================
The parser is organized around two main classes :

1/ CRInput
2/ CRTknzr (Tokenizer or lexer)
3/ CRParser
4/ CROMParser

II.1 The CRInput class
-----------------------
The CRInput class provides the abstraction of  
an utf8-encoded character stream. 

Ideally, it should abstracts local data sources 
(local files and in-memory buffers)
and remote data sources (sockets, url-identified ressources) but at the
moment, it abstracts local data sources only.

Adding a new type of data source should be transparent for the
classes that already use CRInput. After all, it is what is abstraction about :)


II.2 The CRTknzr class
----------------------
The main job of the tokenizer (or lexer) is to
provide a get_next_token () method.
This methods returns the next css token found in the input stream.
(Note that the input stream here is an instance of CRInput).

This provides an extremely usefull facility to the parser.

II.3 The CRParser class
-------------------------
The core of the parser.

The main job of this class is to provide a cr_parser_parse_stylesheet() 
method. During the parsing (the execution of the cr_parser_stylesheet())
the parser sents events to notify the application when it encounters
remarquable css constructions. This is the SAC (Simple api for CSS) api model

To achieve that task, almost each production of the css grammar 
has a matching parsing function (or method) in this class.

For example, the following  production named "ruleset" (specified in the 
css2 spec in appendix D.1):

ruleset : selector [ ',' S* selector ]*
        '{' S* declaration [ ';' S* declaration ]* '}' S*

is "implemented" by the cr_parser_parse_ruleset () method. 

The same thing applies for the "selector" production:

selector : simple_selector [ combinator simple_selector ]*

which is implemented by the  cr_parser_parse_selector() method... and so on
and so forth.

II.3.1 Structure of a parsing method.
-------------------------------------
A parsing method (e.g cr_parser_parse_ruleset()) is there
to:

             * try to recognize a substring of the incoming character string
	     as something that matches a given css grammar production.

             eg: the job of the cr_parser_parse_ruleset() is to try 
	     to recognize if "what" comes next in the input strean 
	     is a css2 "ruleset".

             * Builds a basic abstract data structure to 
	     store the information encountered
	     during the parsing of the current character string.

	     eg: cr_parser_parse_declaration() has the following prototype:

             enum CRStatus
	     cr_parser_parse_declaration (CRParser *a_this, GString **a_property,
					  CRTerm **a_value) ;

             In case of successfull parsing, this method returns 
	     (via its parameters) the  property _and_ the 
	     value of the css2 declaration.
	     Note that a css2 declaration is specified as follows:

	     declaration : property ':' S* expr prio?
	                  | /* empty */

	     * After completion, say if the parsing has succeed or not.	
	     
	     eg: cr_parser_parse_declaration() returns CR_OK if the
	     parsing has succeed, and error code otherwise. Obviously,
	     the a_property and a_value out parameter are valid if and only
	     if the function return value is CR_OK.

	     * if the parsing failed, leave the position in the stream unchanged.
	     That is the position in the character stream should be as if
	     the parsing function hasn't been called at all.