| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The 0x20 character should also be escaped as per the SPARQL reference,
and it correctly is when setting a TrackerResource IRI. Even though,
the fast path check for the presence of characters that should be
escaped is missing it, so it would be possible to let IRIs that only
have this invalid character as valid.
Since 0x20 (whitespace) is possibly the most ubiquitous character that
should be escaped, it's a bit of an oversight.
Fixes: 33031007c ("libtracker-sparql: Escape illegal characters in IRIREF...")
|
| |
|
| |
|
| |
|
|\
| |
| |
| |
| | |
libtracker-sparql: Escape illegal characters in IRIREF from TrackerResource
See merge request GNOME/tracker!536
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently, all IRIREF going through SPARQL updates will be validated for the
characters being in the expected set (https://www.w3.org/TR/sparql11-query/#rIRIREF),
meanwhile TrackerResource is pretty liberal in the characters used by a
TrackerResource identifier or IRI reference.
This disagreement causes has 2 possible outcomes:
- If the resource is inserted via print_sparql_update(), print_rdf() or alike while
containing illegal characters, it will find errors when handling the SPARQL update.
- If the resource is directly inserted via TrackerBatch or update_resource(), the
validation step will be bypassed, ending up with an IRI that contains illegal
characters as per the SPARQL grammar.
In order to make TrackerResource friendly to e.g. sloppy IRI composition and avoid
these ugly situations when an illegal char sneaks in, make it escape the IRIs as
defined by IRIREF in the SPARQL grammar definition. This way every method of insertion
will succeed and be most correct with the given input.
Also, add tests for this behavior, to ensure we escape what should be escaped.
|
| | |
|
| | |
|
|/ |
|
|\
| |
| |
| |
| | |
Fix build/compiler warnings
See merge request GNOME/tracker!534
|
| |
| |
| |
| |
| |
| | |
Even though the TrackerResource helper functions don't use this type internally,
it may be set directly through tracker_resource_set_gvalue(). Handle this
additional type, since it's used in some Tracker Miners extractors.
|
| |
| |
| |
| |
| | |
For some reason this fails, but also is unused in this
project tree. We can stop asking for its existence.
|
| |
| |
| |
| |
| | |
This generic method is available since meson 0.51 (which we already require),
while pkg.get_pkgconfig_variable is deprecated in 0.56.
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Separate ontology parsing so that we can provide separate locations for
.ontology and .description files, so that we can avoid doubly parsing the
base ontology just for the sake of parsing the description files.
This avoids redefinition warnings by the ontology docgen tool while
generating the docs for the base ontology.
|
| |
| |
| |
| | |
Fixes a compiler warning.
|
| | |
|
| |
| |
| |
| |
| | |
We are using TrackerSparqlCursor API here, passing a TrackerDBCursor.
we should cast this subclass to the correct parent type.
|
| |
| |
| |
| |
| |
| | |
There is new API scheduled for 3.2.0 to pause/unpause messages for deferred
processing. Follow these API updates and handle it, without bumping to such
newer version so far.
|
| |
| |
| |
| |
| |
| |
| | |
Or the right way. We still use at places
tracker_namespace_manager_get_default() to preserve backwards compatible
behavior, so these deprecated warnings should be avoided with
G_GNUC_BEGIN/END_IGNORE_DEPRECATIONS.
|
|/ |
|
|\
| |
| |
| |
| | |
Improve performance of database updates
See merge request GNOME/tracker!532
|
| |
| |
| |
| |
| |
| |
| |
| | |
Sometimes if the other end closes prematurely, the cursor gets cancelled
and end up as an "Interrupted" error. Make that error more consistent by
using G_IO_ERROR_CANCELLED, and avoid issuing a warning in those situations.
Fixes some sporadic warnings seen in the serialize test.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
In the slow paths of deleting a class, we must recursively delete
all subclasses prior to the deletion of this class. We figure out
the existing subclasses of a class for a given resource through a
query, but we have this information right there.
Use the types array, and recursively handle direct subclasses of
the class being handled for deletion, so that things cascade properly
and it mostly consists of a couple of array lookups as opposed to
a database query.
This should make these slow paths faster.
|
| |
| |
| |
| |
| | |
The graph management update statements are rarely run, we can avoid
caching them.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When ensuring that a resource IRI is known in the database, we optimize
for newly added resources and try to insert them without further checks,
only resorting to a query to fetch the resource ID if that insertion failed.
But we figure it out after getting a GError, and clearing it. Instead, just
check the return value, so that we don't create/free errors for every resource
where that assumption fails.
Also, we failed to cache the resource if the second querying step was taken,
fix that to get another nice speed improvement.
|
| |
| |
| |
| |
| | |
Instead of having to pass a GError and looking for that to check for
errors.
|
| |
| |
| |
| |
| | |
We initialize nrl:modified once, there is no need to cache this statement
for anything later.
|
| |
| |
| |
| |
| |
| | |
This is quite a hot path during updates of already existing resources,
it makes sense to keep the statement around, instead of resorting to
cache lookups.
|
| |
| |
| |
| |
| |
| |
| | |
This is only used in the update machinery, so move it there and keep
the TrackerDBStatement around. This is a very hot path during updates
of already existing resources, so it makes sense to avoid the DB
interface internal caches and SQL query strings for this.
|
| |
| |
| |
| |
| |
| | |
Instead of relying on the internal TrackerDBInterface cache, use a
distinct one in the update machinery, so TrackerProperty objects
can be looked up directly without creating a SQL string.
|
| |
| |
| |
| |
| | |
This is only necessary for properties with rdfs:Resource range,
so only perform this operation for those.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently the code is exceedengly thorough and issues individual deletes
for every property that is related to the class being deleted. We can take
some shortcuts here, and avoid querying existing values for most cases,
only 2 cases remain where it is necessary to fetch the previous content:
- Properties that have TRACKER_PROPERTY_TYPE_RESOURCE, in order to correctly
change the refcounting of the resource that the property points to.
- Properties that are domain indexes in other classes, since we have to
chain up to those tables with the right value being deleted.
All other situations can do without fetching the previous values for the
property, and for single valued properties, they can even be delegated to
the deletion of the row in the table representing the TrackerClass.
The result is a speedup while deleting entire resources.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Right now, FTS updates do iterate the set of previously looked up
properties in order to find the FTS ones and issue the FTS update.
We want to optimize on fetching the old values for properties, and
FTS is orthogonal to it, so decouple both machineries. As a replacement
keep a list of FTS properties that were modified, so that these are
updated when flushing.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Since refcount changes are very frequent, and each of them requires
a hashtable lookup+update to change (the refcount is the value of
the hashtable, stored with GINT_TO_POINTER), plus the replacement
goes with a new copy of the TrackerRowid for the hashtable key. Overall,
this maintenance is somewhat expensive to perform.
As there are small buffers for triple updates (64 items), a hash table
does not bring a lot of benefit in lookups, nor updates. Switching to
a simple unordered array to keep this accounting looks faster in
profiling.
|
| |
| |
| |
| |
| |
| |
| |
| | |
This small internal helper has the nice added value that it does not require
allocating new memory to iterate across triples (as opposed to
tracker_resource_get_properties and tracker_resource_get_values). Use that
so we can avoid these for the most part (the exception being rdf:type, since
we want it handled before all values).
|
| |
| |
| |
| |
| | |
We can avoid this small piece of busywork until when it's needed.
May help with insertion of simple TrackerResources.
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Since the function it is being expanded from can be called
recursively, on deeper iterations it would pointessly try to
expand the URI again.
Do it from the toplevel function, so it's done once for the
whole TrackerResource update.
|
| |
| |
| |
| |
| | |
We can no look up properties in either short or long URI form, so
the URI expansion can be avoided here.
|
| |
| |
| |
| |
| | |
In addition to the expanded URIs, make it possible to look up properties
by short URI.
|
| |
| |
| |
| |
| |
| | |
We already have the data in memory, so use that instead of querying the
database for these. If the URI does not turn out to be a class/property
one, the database lookup is performed.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The improvement is twofold, if a same TrackerResource is referenced by
different elements in a TrackerBatch, we will skip second additions
altogether. On the other hand, the hashtable is more long-lived and
not created/freed all the time tens thousands of times per second.
Since the resources might be re-used in different graphs within the
same batch (happens in tracker-miner-fs-3 for file-related content
graph data), we must be careful though not to optimize those away.
In that case we simply blow the visited resources cache on graph
changes during the processing of a batch.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When preparing update statements, we can possibly end up with multiple
coalesced single-valued property updates for the same property. In that
case we attempt to just push the last update as it is the one that will
prevail.
But the amount of properties to be grouped for a same graph/resource/class
tuple is not usually high enough. As this is a very hot path during
updates, using a hashtable to track and avoid repeatedly updating the
same property has an intrinsic cost that timidly shows up in profiles.
As the amount of elements to iterate is not too high, this can be replaced
with a GList which esentially drops these lookups from showing in profiles.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently the code relies on TrackerDBInterface builtin caching, which
consists on lookups for the SQL string prior to statement compiling/binding.
This is a very late layer of caching, and relies on us creating the full
SQL string (i.e. basically one step away from a sqlite3_stmt) to reuse
a cached statement.
The caching strategy here can be significantly improved, we basically
need a graph|{class|property} tuple to look up a statement, and that is
something that TrackerDataLogEntry structs in the event log already
have. So we may go directly from a log entry to a statement without
intermediate string stages.
And that's precisely what this commit does. There is a new set of
hash/equal functions in order to match these log entries in the
statement MRU, so that lookups are fast based on pointer hashes and
comparisons.
Since we need to preserve a copy of these TrackerDataLogEntry as keys
in the MRU cache, add copy/free functions that handle copying the
necessary data (a copy of the TrackerDataLogEntry, but also partial
copy of the array containing propery changes for this update).
This makes lookups sensibly faster, compared to the old laborious
string building for lookups.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This API was so far internal to TrackerDBInterface, in order to cache
statements for selects and updates. Since we want to add other caching
layers elsewhere, make this API public and more generic.
Now, it is possible to define hash/equal/destroy functions, so that it
is not limited to SQL query strings as keys, this will be useful in
future commits. In the mean time, reimplement the select/update caches
on top of this API.
|
| |
| |
| |
| |
| |
| | |
Since those are very frequently used properties, it makes sense to have
a fast path lookup like we have for rdf:type, so that we don't have to
perform a hashtable lookup every time these properties are used.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently, our caching of triples does have a number of nested structures:
- In the buffer there is a struct for each graph
- In the graph struct there is a set of changed resource
- In the resource struct there is a set of modified tables
- In the table struct there is a set of modified properties
- In the property struct there is a list of values
This incurs in a maintenance cost that is higher than desired, adding and
removing elements here becomes a fair chunk of the time spent in updates,
since there is a number of allocations and list/hashtable updates
performed for batches that deal with a fair amount of different resources
(i.e. most of them).
In order to improve this, keep use two arrays to buffer this data:
- A "properties" array, that keeps individual predicate/object pairs. This
is used to store the values of properties being inserted or deleted for
single-valued and multivalued properties. This struct is "linked" with
(i.e. references) other elements in the array, so that e.g. class updates
may reference multiple properties/values being updated.
- An "update log" array, containing structs that are a event_type/graph/
subject tuple, plus optionally a link to one of the properties in the
previous array, all other properties are fetched through iterating through
the linked properties. These log entries are valid for class table updates
(i.e. single-valued properties) or multi-valued property tables.
These arrays make allocating the buffer a one-time operation (buffer size
is fixed, and the arrays are reused during the processing of a TrackerBatch)
and insertions into the log largely O(1) as opposed to a number of
array/hashtable lookups and inserts.
But we still want to coalesce updates to a same class table (e.g. changes to
several single-valued properties in the same table), for that there is an
additional hashtable set that uses these log entries as keys themselves,
with special hash/equal functions, lookups for prior events modifying the
same TrackerClass is also quite fast.
Overall, this makes the maintenance of this buffer less expensive in the
big picture. Even though there are still some remnants of the previous
caching for graphs and resources, this plays less of a role.
Since this changes the ordering of updates, some tests that rely on implicit
ordering (DESCRIBE ones) had to be adapted for this change.
|
| |
| |
| |
| |
| |
| | |
Instead of passing a gint pointer when binding values to statements so it
is increased in the function doing that, increase the parameter counter
in the upper code.
|
| |
| |
| |
| |
| |
| | |
The SQL update construction of both is mixed, which hinders readability.
Untangle these so the generated INSERT/UPDATE can be followed more
easily.
|
| |
| |
| |
| |
| |
| |
| |
| | |
Since graphs don't change often, we can preserve the cached information for
these, most importantly the prepared statements to update refcounts.
We now instead clear the contents of the TrackerDataUpdateBufferGraph,
preserving the things we want to preserve.
|