docs/raptor-parsers.xml


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246

<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN" 
               "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd">
<chapter id="raptor-parsers">
<title>Parsers in Raptor (syntax to triples)</title>

<section id="raptor-parsers-intro">
<title>Introduction</title>

<para>This section describes the parsers that can be compiled into
Raptor and their features.  The exact parsers supported may vary
by different builds of raptor and can be queried at run-time by
use of the 
<link linkend="raptor-parsers-enumerate"><function>raptor_parsers_enumerate</function></link>
and
<link linkend="raptor-syntaxes-enumerate"><function>raptor_syntaxes_enumerate</function></link>
functions</para>

<para>The optional features that may be set on parsers can also
be queried at run-time iwth the 
<link linkend="raptor-features-enumerate"><function>raptor_features_enumerate</function></link>
function.</para>

</section>


<section id="parser-grddl">
<title>GRDDL parser (name <literal>grddl</literal>)</title>
<para>A parser for the
<ulink url="http://www.w3.org/TR/2007/PR-grddl-20070716/">Gleaning Resource Descriptions from Dialects of Languages (GRDDL)</ulink>,
W3C Proposed Recommendation of 2007-07-16 which allows reading XHTML
and XML as RDF triples by using profiles in the document that declare
XSLT transforms from the XHTML or XML content into RDF/XML or other
RDF syntax which can then be parsed.</para>

<para>The GRDDL parser is rather complex and different from the other
parsers in that it retrieves URIs, reads HTML documents (possibly
with errors), transforms the documents with XSLT and turns the result
into a single graph.  The default configuration of the GRDDL parser
also reads microformats (hcard, hcalendar) and follows &lt;link&gt;
tags that point to RDF/XML.  Parts of the GRDDL process can be
altered by configuration, which are describe below.
</para>

<para>The GRDDL parser defines 'base', 'Base' and 'url' XSLT parameters
with the value of the base URI to allow some XSLT sheets to work. These
set of parameters cannot be disabled.
</para>

<para>If the XSLT transform returns an empty string, no further
processing of the result is done, and a warning is generated.  The
xsl:output method is mapped to result document mime types as follows:
'text' to text/plain; 'xml' to application/xml and 'html' to text/html.
Any result that is of type 'application/xml' or unknown mime type
is assumed to be RDF/XML.
</para>

<para>The URIs that are processed during GRDDL operations can be checked
and skipped if required using a handler set with the
<link linkend="raptor-parser-set-uri-filter"><function>raptor_parser_set_uri_filter()</function></link>
function.  If the handler returns non-0, the URI is rejected.
This uses
<link linkend="raptor-www-set-uri-filter"><function>raptor_www_set_uri_filter()</function></link>
internally.
</para>

<para>If the value of feature
<link linkend="RAPTOR-FEATURE-WWW-TIMEOUT:CAPS"><literal>RAPTOR_FEATURE_WWW_TIMEOUT</literal></link>
if set to a number &gt;0, it is used as the timeout in seconds
for retrieving of URIs during GRDDL processing.
This uses
<link linkend="raptor-www-set-connection-timeout"><function>raptor_www_set_connection_timeout()</function></link>
internally.
</para>

<para>The hardcoded support for hcard and hcalendar
microformats can be disabled by setting parser feature
<link linkend="RAPTOR-FEATURE-MICROFORMATS:CAPS"><literal>RAPTOR_FEATURE_MICROFORMATS</literal></link>
to 0
or using
<link linkend="raptor-set-parser-strict"><function>raptor_set_parser_strict()</function></link>
with a value of 1.
</para>

<para>The GRDDL parser by default will try an XML parser on the
content followed by a lax HTML parser.  This can be disabled by
setting parser feature
<link linkend="RAPTOR-FEATURE-HTML-TAG-SOUP:CAPS"><literal>RAPTOR_FEATURE_HTML_TAG_SOUP</literal></link>
to 0
or using 
<link linkend="raptor-set-parser-strict"><function>raptor_set_parser_strict()</function></link>
with a value of 1.
</para>

<para>The GRDDL parser by default will try to look for an HTML
&lt;link&gt; tag that points to RDF/XML.  This can be disabled by
setting parser feature
<link linkend="RAPTOR-FEATURE-HTML-LINK:CAPS"><literal>RAPTOR_FEATURE_HTML_LINK</literal></link>
to 0
or using 
<link linkend="raptor-set-parser-strict"><function>raptor_set_parser_strict()</function></link>
with a value of 1.
</para>

</section>


<section id="parser-guess">
<title>Guess parser (name <literal>guess</literal>)</title>
<para>
This is a special parser that picks the actual parser to use based
on the content type, the content bytes or the content identifier.  The
content name can be either from a local file or from a URI.
</para>

<para>If the protocol that delivered the content (such as HTTP)
provided a <emphasis>Content Type</emphasis> (aka MIME Type) then
this will be the primary means for identifying th ecotnent.
</para>

<para>The secondary means to identify the content are the bytes of
the content (if available), otherwise the content identifier is used,
which is the least reliable.
</para>

</section>


<section id="parser-ntriples">
<title>N-Triples parser (name <literal>ntriples</literal>)</title>

<para>A parser for the
<ulink url="http://www.w3.org/TR/rdf-testcases/#ntriples">N-Triples</ulink>
syntax as used by the 
<ulink url="http://www.w3.org/2001/sw/RDFCore/">W3C RDF Core working group</ulink>
for the <ulink url="http://www.w3.org/TR/rdf-testcases/">RDF Test Cases</ulink>.
</para>

</section>


<section id="parser-rdfa">
<title>RDFa parser - (name <literal>rdfa</literal>)</title>
<para>
A parser for the
<ulink url="http://www.w3.org/TR/2008/CR-rdfa-syntax-20080620/">RDFa</ulink>
syntax, W3C Candidate Recommendation 20 June 2008 which allows reading XHTML
and XML as RDF triples by interpreting attributes on elements to
describe which ones have RDF semantics.   This is implemented via
<ulink url="http://rdfa.digitalbazaar.com/librdfa/">librdfa</ulink>
linked inside Raptor, written by Manu Sporny of Digital Bazaar,
and licensed with the same license as Raptor.
</para>

<para>
This parser is beta quality and passes all but 4 of the RDFa tests as
of Raptor 1.4.18.
</para>

</section>


<section id="parser-rdfxml">
<title>RDF/XML parser - default (name <literal>rdfxml</literal>)</title>
<para>
A parser for the standard
<ulink url="http://www.w3.org/TR/rdf-syntax-grammar/">RDF/XML syntax</ulink>
as revised by the
<ulink url="http://www.w3.org/2001/sw/RDFCore/">W3C RDF Core working group</ulink>.</para>

<para>This is the default parser in Raptor.</para>

<para>Features of this parser:</para>
<itemizedlist>
<listitem><para>Fully handles the <ulink url="http://www.w3.org/TR/rdf-syntax-grammar/">RDF/XML syntax updates</ulink> for <ulink url="http://www.w3.org/TR/xmlbase/">XML Base</ulink>, <literal>xml:lang</literal>, RDF datatyping and Collections.</para></listitem>

<listitem><para>Handles all RDF vocabularies such as <ulink url="http://www.foaf-project.org/">FOAF</ulink>, <ulink url="http://www.purl.org/rss/1.0/">RSS 1.0</ulink>, <ulink url="http://dublincore.org/">Dublin Core</ulink>, <ulink url="http://www.w3.org/TR/owl-features/">OWL</ulink>, <ulink url="http://usefulinc.com/doap">DOAP</ulink></para></listitem>

<listitem><para>Handles <literal>rdf:resource</literal> / <literal>resource</literal> attributes</para></listitem>

<listitem><para>Uses <ulink url="http://expat.sourceforge.net/">expat</ulink> and/or (GNOME) <ulink url="http://xmlsoft.org/">libxml</ulink> XML parsers as available or required</para></listitem>

</itemizedlist>

</section>


<section id="parser-rss-tag-soup">
<title>RSS Tag Soup parser (name <literal>rss-tag-soup</literal>)</title>

<para>A parser for the multiple XML RSS formats that use the elements
such as <literal>channel</literal>, <literal>item</literal>,
<literal>title</literal>, <literal>description</literal>
in different ways.
This includes support for the Atom 1.0 syndication format defined in IETF
<ulink url="http://www.ietf.org/rfc/rfc4287.txt">RFC 4287</ulink>
</para>

<para>The parser attempts to turn the input into
<ulink url="http://www.purl.org/rss/1.0/">RSS 1.0</ulink>
RDF triples in the RSS 1.0 model of a syndication feed.
This includes triples for RSS Enclosures.
</para>

<para>
True <ulink url="http://www.purl.org/rss/1.0/">RSS 1.0</ulink> when
wanted to be used as a full RDF vocabulary, is best parsed by the
RDF/XML parser (name <literal>rdfxml</literal>).
</para>

</section>


<section id="parser-trig">
<title>TRiG parser (name <literal>trig</literal>)</title>

<para>A parser for the
<ulink url="http://www.wiwiss.fu-berlin.de/suhl/bizer/TriG/Spec/">TriG - Turtle with Named Graphs</ulink>
syntax.
</para>

<para>The parser is alpha quality and may not support the entire TRiG
specification.</para>

</section>


<section id="parser-turtle">
<title>Turtle Terse RDF Triple Language parser (name <literal>turtle</literal>)</title>

<para>A parser for the
<ulink url="http://www.dajobe.org/2004/01/turtle/">Turtle Terse RDF Triple Language</ulink>
syntax, designed as a useful subset of
<ulink url="http://www.w3.org/DesignIssues/Notation3">Notation 3</ulink>.
</para>

</section>


</chapter>

<!--
Local variables:
mode: sgml
sgml-parent-document: ("raptor-docs.xml" "book" "part")
End:
-->