diff options
author | Paul McCullagh <paul.mccullagh@primebase.org> | 2009-03-26 13:18:01 +0100 |
---|---|---|
committer | Paul McCullagh <paul.mccullagh@primebase.org> | 2009-03-26 13:18:01 +0100 |
commit | a61584ad84bc94a869c6dc62eaff08d4f1f00a20 (patch) | |
tree | b25a7c94518c2cfe15b710de38015bee35d6153f /storage/pbxt | |
parent | 032ef1fa0781490606c2cf690420464ed98ace8c (diff) | |
download | mariadb-git-a61584ad84bc94a869c6dc62eaff08d4f1f00a20.tar.gz |
Added PBXT storage engine
Diffstat (limited to 'storage/pbxt')
78 files changed, 59233 insertions, 0 deletions
diff --git a/storage/pbxt/AUTHORS b/storage/pbxt/AUTHORS new file mode 100644 index 00000000000..3c5c3db6db8 --- /dev/null +++ b/storage/pbxt/AUTHORS @@ -0,0 +1,4 @@ +Paul McCullagh +paul.mccullagh@primebase.org +http://www.primebase.org +http://pbxt.blogspot.com diff --git a/storage/pbxt/ChangeLog b/storage/pbxt/ChangeLog new file mode 100644 index 00000000000..d638daa30da --- /dev/null +++ b/storage/pbxt/ChangeLog @@ -0,0 +1,687 @@ +PBXT Release Notes +================== + +------- 1.0.08 RC - Not yet released + +RN232: Merged Drizzle-specific changes into the main tree. + +RN231: Fixed a bug that caused bad performance as the number of threads increased. This occurred when the number of open table handles exceeded 'table_open_cache', and MySQL started closing open table handlers. PBXT was flushing a table when all table handlers were closed. PBXT will now only do this when the FLUSH TABLES statement is used. + +RN230: Improved efficiency of conflict resolution: Implemented a queue for threads waiting for a lock. Threads no longer poll to take a lock. If a temp lock is granted because of an update, then the thread granted the temp lock will also wait for the transaction that did the update to quit. + +RN229: Fixed bug #313391: LOAD DATA ... REPLACE broken. + +RN228: Fixed bug #341115: 'Out of memory' error (a bug in key comparison algorithm). + +RN227: Changed conflict handling to use spin locks and improve efficiency. + +RN226: Fixed bug #340316: Issue with bigint unsigned auto-increment field. + +RN225: Fixed bug #308557: UPDATE fails to match all rows in a transactional scenario. + +RN224: Fixed a deadlock which could occur during table scans. + +RN223: Index scans now use handles to cache buffers instead of making a copy of the index page. The handles are "copy-on-write". + +RN222: Fixed a bug that caused the server to hang on startup if PBXT ran out of record cache while waiting for the sweeper to complete. + +RN221: Fixed an index recovery bug. This occurred if the server crashed after operating in low index cache sitations. + +RN220: Improved index selectivity estimation: added scanning from the end of index backwards. + +RN219: Fixed a problem: during intersected range scan not all fields were returned by engine to MySQL. + +RN218: Changed the way row locking (used by SELECT FOR UPDATE) works. Previously we locked a group of rows at once (although there were many groups). However, this caused conflicts even when the same rows were not locked. We now locks individual rows. + +RN217: Fixed bug #315564: Rollbacked inserts remain permanently in table. + +RN216: Added lock tracing. In DEBUG mode, each thread has a list of locks (semaphores, mutexes, r/w locks that it holds). + +RN215: Fixed a bug that caused a crash during restart if an index file was flushed during recovery. + +RN214: Fixed bug #310184: Deadlock when trying to wake up transactions + +RN213: Fixed an index corruption bug on SPARC Solaris. Note this error will occur on any machine that does not use the x86 (little endian) byte order. + +------- 1.0.07 RC - 2008-12-15 + +RN212: Fixed build problems on NetBSD. + +RN211: Fixed build problems on FreeBSD. + +RN210: Fixed build problems on OpenSolaris. + +RN209: Added handling of the foreign_key_checks system flag. + +RN208: xtstat will now automatically reconnect if the connection to server is lost. + +RN207: Foreign key references are now checked on CREATE TABLE. + +RN206: Fixed a crash if inserting into a table that has an FK that references a column that has no index on it. + +RN205: Added processing of foreign key action SET DEFAULT. + +RN204: Fixed an index recovery problem: unswept index entries were not recovered correctly + +RN203: Fixed foreign key bug: REPLACE fails with 'on delete cascade' + +RN202: Fixes and updates to tests, now all tests pass on windows and linux. + +RN201: Fixed ref-counting for mmapped files. + +RN200: Fixed an index recovery problem: unswept index entries were not recovered correctly . + +RN199: Recovery now takes place on plug-in startup. Previously recovery occurred when the first PBXT table was accessed. + +RN198: Fixed a recovery bug that caused index entries to get out of sync with the data file. + +RN197: Improved the efficiency of group commit. + +RN196: Changed checkpointing so that it now works during idle time. Every record, row or index file fllush now also contributes to the checkpoint (fuzzy checkpointing). Checkpointing is forced to complete after about 50% of the checkpoint threshold in order to ensure the correct maximum for log reading on recovery. + +RN195: Fixed scheduling bug that caused sweeper to get behind with the cleanup, which caused performance problems in high conflict situations. Foreground threads will now wait if the sweeper gets too far behind. + +RN194: Created the xtstat program which monitors the internal performance of PBXT. Run xtstat --help for more details information of the output. + +RN193: Implemented the pbxt.statistics virtual table. The statistics table returns information about the internal activity of the engine. This includes I/O byte counts, cache hit counts and usage, commit count, etc. + +RN192: Due to timing issues in the engine API it could happen that the client received an OK for a committed transaction before the transaction was actually committed. This problem has been fixed. + +RN191: Fixed a bug that caused a hang when conflicts occured while reading a covering index. + +RN190: Previously the sweeper delayed deletion of transaction structures until all transactions that were running during sweeping have quit. This is now handled by the same code that fixed the bug in RN189. + +RN189: Fixed a bug that could cause a row to go missing due to a visibility issue. + +RN188: Fixed a bug which ocurred when using CREATE TABLE ... AVG_ROW_LENGTH=x, and the table contained BLOBs. In this case, alter table corrupted the table. + +RN187: Windows now stores paths in the location file in UNIX format by converting all '\' characters to '/'. Note that the location file is only cross-platform if the paths are relative (which is the default). + +RN186: Set version number to 1.0.07. + +------- 1.0.06 Beta 2 - 2008-11-06 + +RN185: Disabled support for INSERT DELAYED because of MySQL bug #40505 + +RN184: Implemented info(flag == HA_STATUS_AUTO) engine API call. This call returns the next value that will be assigned as auto-increment value on the table. + +RN183: Turned off streaming on Windows (see XT_STREAMING macro in sources) + +RN182: Switch code base to the latest version of BLOB streaming engine (PBMS): www.blobstreaming.org. + +RN181: Updated pbxt-test-run default parameters (--force is on, --default-storage-engine is pbxt, --base-dir is set according to config) + +RN180: PBXT can now cope with a missing .xti file (the file that contains the table indexes). This file can be regenerated using REPAIR TABLE. + +RN179: On recovery PBXT now creates a filed called 'recovery-progress' in the pbxt database. The recovery percentage complete is written to this file as recovery progresses. Note that this file will not be created if no recoery is necessary or if PBXT estimates that it will read less then 10MB to do recovery. + +RN178: Fixed a problem in CHECK TABLE that caused memory corruption for fixed-size records + +RN177: Added "crash debugging". When enabled, crash debugging does the following: + - Create a core dump on Windows if the server crashes. + - Make a backup copy of the datadir directory before recovery if the server crashes. + - Keep at least 5 of the previous transaction logs. +Currently crash debugging is disabled by default. To disable, create a file called 'no-debug' in the pbxt database folder, and restart the server. When crash debugging is disabled by default, it can be enabled by creating a file called 'crash-debug; in the pbxt database folder. + +RN176: Fixed a bug: a lock was not released appropriately + +RN175: Fixed some debug assertions + +RN174: Fixed some of test/mysql-test tests + +RN173: Fixed a RENAME TABLE bug, that prevented index files from being properly recreated + +RN172: Added the file ./pbxt/lock-pid. This file is locked while the server is running, and contains the process of the server. PBXT will return an error on startup if the file is locked or the process is still running in order to prevent a second server from being started. + +RN171: Implemented the AVG_ROW_LENGTH table attribute. When set, this value determines the size of the fixed length data component of a record. Normally this size is estimated depending on the column definitions. The command CHECK TABLE dumps the current average row length to the log. This can be used to find a suitable value for AVG_ROW_LENGTH. + +RN170: Changed configure so that debug/optimize flags set for building the engine override the flags set for MySQL. If --with-debug is not specified, then the engine will use the flags set when building MySQL. If MySQL was built with --with-debug=full, the DEBUG will be defined for the engine. When building the engine, the following flags can be set: + yes - Debug symbols enabled, no optimization, DEBUG not defined. + full - Debug symbols enabled, no optimization, DEBUG defined. + only - Debug symbols enabled, MySQL flags used, DEBUG not defined. + prof - Profile code enabled, optimization on, DEBUG not defined. + no - No debug symbols, optimization on, DEBUG not defined. + +RN169: Used MySQL root Makefile instead of config.status in order to extract settings (such as CFLAGS and CXXFLAGS) for the PBXT build. + +RN168: Fixed Windows build after merging changes for Drizzle. + +RN167: Fixed "This table requires primary key" error in sql-bench. + +RN166: Fixed threading problems that caused crashes in sql-bench. + +RN165: Added sql-bench to pbxt source tree. + +RN164: Ported PBXT to Drizzle. To compile for Drizzle DRIZZLED must be defined on the command line. The -drz.am and -drz.in files are must be used when PBXT is embedded in Drizzle. + +RN163: Added "make test" build step. Running "make test" from the root of pbxt source tree will launch test/mysql-test/pbxt-test-run.pl with appropriate options to execute the pbxt functional test suite. On Windows where +pbxt is statically linked into mysql server binary pbxt testing works by going to test/mysql-test directory and running ./pbxt-test-run.pl with --base-dir argument pointing to a mysql source tree (mysql binaries are taken +from there) and passing the rest of usual arguments (--force --mysqld=--default-storage-engine=pbxt) + +RN162: The 'pbxt' database must now be dropped explicitly. It is automatically created when the first PBXT table is created. After that, the pbxt database can be dropped once all PBXT tables have been dropped. Dropping the pbxt database will also cause all transaction (pbxt/system directory) and data logs (pbxt/data directory) to also be deleted. + +RN161: Added pbxt.location system table. This table can only be dropped when all PBXT tables have been deleted. Dropping the system table will cause all transaction (pbxt/system directory) and data logs (pbxt/data directory) to also be deleted. + +RN160: Made changes to run with MySQL 6.0.6. + +RN159: Changes to configure: added --with-plugindir=<path>, which should be used to specify the plugin directory. This means that --libdir should no longer be used. For backwards compatibility configure will still recognize this options if the path ends with 'plugin'. + +Also updated --help, to include all options, and better desciptions of the options. + +The configure options are now as follows: + +--with-mysql=<path> - (Required) It specifies the path to the MySQL source tree. The source should already be built. All other options will be taken from the MySQL build by default. +--with-debug=yes/no - (Optional) Specify if then engine should be built with different debug options to the MySQL source tree. +--with-plugindur=<path> - (Optional) Specify an alternative installation directory for the plugin. By default it will be installed in the plugin directory of the MySQL installation. + + +RN158: Added support for core dumps on Windows. This can be enabled by defining XT_COREDUMP. On by default at the moment. If the server crashes a file called PBXTCore00000001.dmp will be created in the data directory. This file can be openned using MS VS. + +RN157: Fixed a compile problem with tv_nsec which is not supported on all platforms. + +RN156: Updated tests to run with MySQL 5.1.28. + +RN155: Errors during cascade update of VARCHAR values with trailing spaces + +RN154: Fixed a bug: impossible to create a foreign key that referenced an ENUM or SET column + +RN153: Fixed a bug that caused the following problems: #1. Foreign keys: crash if update cascade and autocommit=0 #2. Foreign keys: crash if update cascade and multi-level recursion + +RN152: Fixed missing information about foreign keys in I_S.table_constraints and I_S.referential_constraints + +------- 1.0.05 Beta - 2008-08-30 + +RN151: "Quick config": It is now possible to configure the engine by just specifying the mysql source code tree (the --with-mysql option). The --libdir and --with-debug setting will be deduced automatically. + +RN150: Added system variable pbxt_sweeper_priority, 0 = low (default), 1 = normal (same as user threads), 2 = high. The sweeper cleans up deleted records (deleted records also result from an update). If allowed to accumulate, these records can slow searches. Higher priority for the sweeper is recommended on systems with 4 or more cores. + +RN149: Record cleanup is now initiated if a deleted record is found, and the transaction that deleted the record has ended. Since waking up the sweeper is an expensive operation, normally the sweeper will run every 1/10th of a second. + +RN148: Fixed a bug which caused transaction starvation (one transaction was constantly locked out) during high conflict updates. This lead to cleanup of records not being done, which lead to a general slow down. + +RN147: Fixed a problem with TRUNCATE TABLE: a failed TRUNCATE TABLE could put the engine into an invalid state that later caused a crash + +RN146: Fixed a bug that caused the error: "-49: Record format unknown, either corrupted or upgrade required". + +RN145: Added pbxt_db_offline_log_function system variable, 0 = recycle logs (default), 1 = delete logs (default on Mac OS X), 2 = keep logs. + +------- 1.0.04 Alpha - 2008-08-02 + +RN144: Completed port and testing of Windows version. + +RN143: Fixed a bug which caused the free-er thread to hang. This was a result of an invalid operation ID, which was the result of the checkpointer flushing the table at the same time as a foreground thread. + +RN142: The fast RW/mutex lock can now handle nested calls. This is possible during a sequential scan. + +RN141: The normal behavior in MySQL is that an auto-increment values will be re-issued if you delete the row containing the current maximum auto-increment value and then restart the server. To prevent this you can use ALTER TABLE my_table AUTO_INCREMENT = <current-max-auto-increment> + 1, before deleting the current maximum auto-increment value. + +A new system variable, pbxt_auto_increment_mode, has been added so that this work around is not necessary. When set to 0 (the default), auto-increment works as described above. When set to 1, the AUTO_INCREMENT value of the table is automatically to prevent previously issued auto-increment values being returned. + +However, if the server crashes, a gap of up to 100 unique values can result, because the table AUTO_INCREMENT value is incremented in steps of 100. + +RN140: Index statistics are now automatically recalculated when the table row count exceeds 200. + +RN139: Fixed a bug that caused index corruption, error: "int idx_push(index_xt.cc:172) -2: Core B-tree too deep". + +RN138: Handle startup and recovery when an index is corrupted. + +RN137: Fixed a bug in the zero wait R/W lock that caused the lock to fail (the state is extremely volatile, and must be written to memory after increment). + +RN136: Fixed a bug that cause the error "int xt_pwrite_file(filesys_xt.cc:789) errno (14): Bad address". + +RN135: Fixed TRUNCATE TABLE that did not work correctly when the table contained BLOBs stored in the BLOB streaming engine (www.blobstreaming.org). + +RN134: Fixed a bug that caused duplicate rows to be returned from an index scan (using a SELECT FOR UPDATE) if a concurrent update was done. + +RN133: Optimised PBXT for multi-processor scale-up. This mostly involved using different types of locks instead of the standard pthread mutex and reader/writer locks [TODO: 0038]. + +------- 1.0.03 Alpha - 2008-05-30 + +RN132: Fixed bug when using PBXT in conjunction with the BLOB streaming engine (www.blobstreaming.org). Uploaded BLOBs could not be inserted into a table. + +RN131: Fixed wait for background processes on shutdown. Shutdown will wait a maximum of 16 seconds for each process. + +RN130: Fixed calculation of bytes to be read for recovery. + +RN129: Fixed bug in cleanup of unterminated transactions. + +RN128: The writer will now start working when one of the following is true: +- it is time for a checkpoint, +- the log cache is almost full, +- the free'er is waiting for the writer, +- there is no other activity. + +RN127: Fixed checkpoint frequency. Checkpointing is now done correctly after 'pbxt_checkpoint_frequency' bytes. + +RN126: Implemented index consistent write [TODO: 0050]. + +RN125: Implemented memory mapping for row pointer (.xtr) and handle data files (.xtd). + +RN124: Index files now use direct I/O. + +------- 1.0.02 Alpha - 2008-04-25 + +RN123: Fixed compile errors with MySQL 5.1.24. + +------- 1.0.01 Alpha - 2008-03-28 + +RN122: ++++ NOTE: This version is not compatible with older versions of PBXT ++++. + +RN121: Transaction logs are now global so that multi-database statements are now possible. This makes it also possible to work PBXT temporary tables. + +RN120: Transaction logs pre-allocated and recycled. + +RN119: Transaction log writes on 512 byte boundaries only. + +------- 1.0.00 Alpha - 2008-03-10 + +This version has alpha status because of the large number of changes done for full durability. + +RN118: ++++ NOTE: This version is incompatible to older versions of PBXT ++++. + +RN117: Documentation now avaliable at http://www.primebase.org/documentation. + +RN116: Corrected the plug.in file so that PBXT compiles when dropped into the storage directory in the MySQL source tree. + +RN115: Compiled and tested with MySQL 5.1.23. + +RN114: Increased index block size. Minimum is now 4K. Default is 16K. + +RN113: Calculate index selectivity to return a more accurate value from records_in_range(). NOTE: FLUSH TABLESl will update the index statistics, after data has been inserted or updated. + +RN112: Optimized table storage, saving 8 bytes per row. + +RN111: Optimized search on keys containing 2 or 3 not null integer values. + +RN110: Optimization: store the row ID in the index so that an index entry can be verified as current without loading the record. This is necessary to optimize an access with index coverage. + +RN109: Optimization: only load the record extended data if required. + +RN108: Implemented SHOW ENGINE PBXT STATUS; + +RN107: Added the following system variables: + +pbxt_index_cache_size - The amount of memory allocated to the index cache, used only to cache index data +pbxt_record_cache_size - The amount of memory allocated to the record cache used to cache table data +pbxt_log_cache_size - The amount of memory allocated to the transaction log cache used to cache on transaction log data +pbxt_log_file_threshold - The size of a transaction log before rollover, and a new log is created +pbxt_transaction_buffer_size - The size of the global transaction log buffer (the engine allocates 2 buffers of this size) +pbxt_log_buffer_size - The size of the buffer used to cache data from transaction and data logs during sequential scans, or when writing a data log +pbxt_checkpoint_frequency - The amount of data written to the transaction log before a checkpoint is performed +pbxt_data_log_threshold - The maximum size of a data log file +pbxt_garbage_threshold - The percentage of garbage in a data log file before it is compacted + +RN106: PBXT now compiles for MySQL 6.0.3. + +RN104: Updates now locks a record temporarily. This prevents most "record changed" errors, however, it makes UPDATE statements a type of "committed read". This means that you may update a different value to that which you selected in repeatable read mode. To avoid this, use SELECT FOR UPDATE if you plan to UPDATE records after reading. + +RN103: Implemented SELECT FOR UPDATE. This is implemented by turning SELECT FOR UPDATE into a type of "committed read". This means that, if you do a SELECT followed by a SELECT FOR UPDATE you can get different results, even in repeatable read mode. + +RN102: Implemented recovery of index entries. Note: indexes are not yet fully consistent. This means that index can become currupted due to a crash. Data, however, cannot be lost. The indices can be rebuild using REPAIR TABLE. + +RN101: Writing and flushing of a single transaction write-ahead log. + +RN100: Automatic rollover of transaction logs as they become full. + +RN99: Implementation of the transaction log cache. + +RN98: Group commit. + +RN97: Implementation of the writer thread that applies changes in the transaction log to the database. + +RN96: Implementation of the checkpointer thread that periodically flushes the database and writes a checkpoint which determines the recovery start point. + +RN95: Implementation of the free'er thread that is responsible for keeping the record cache at a preset level. + +RN94: Modifications to the record cache so that rows are stored in pages, in order to speed up sequence access. + +RN93: Implemented the recovery process which applies changes written to the log that are not in the database, on startup. + +RN92: Modification of the sweeper thread which cleans up rolled-back transactions and deleted data, to use the new transaction log format. + +RN91: Modifications to the data logs so that they use the same record structure as the transaction logs. + +RN90: The data logs are now managed "per database" in order to minimize the work done to flush and commit a transaction. + +RN89: Implementation of a file handle pool for the data logs. + +------- 0.9.91 Beta - 2007-10-30 + +RN88: The format of the URL genearated by MyBS has been changed. The format of the BLOB URLs is now as follows: + +'~*' <db-name> '/' <type-char> <table-id> '-' <blob-id> '-' <access-code> '-' <server-id> + +Where <type-char> is '_' or '~'. + +Examples: ~*test/_11-128-fbd590b-0, ~*test/~1-524-3dc45b09-0 + +In other words, the characters '>' has been replace by '*', '^' has been replace by '_' and ':' has been replace by '~'. The reason for this is that the characters '>' and '^' are not allowed in URLs, and must be URL-encoded. The character ':' is reserved, but allowed. + +NOTE: This change makes this version incompatible with previous versions of MyBS. If you have a table with BLOB URLs, you can upgrade the URLs as follows: + +UPDATE blob_table SET blob_col = REPLACE(REPLACE(blob_col, '~>', '~*'), '/:', '/~'); + +Replacing '^' is not necessary because BLOB URLs with '^' should not appear in tables. + +------- 0.9.90 Beta - 2007-10-17 + +RN87: Corrected stack trace of errors passed through the BLOB streaming API. + +RN86: Added new engine API accessor functions that appeared in 5.1.21 (thanks Stewart). + +RN85: Added plug.in file. PBXT now compiles when dropped into the storage directory of the MySQL build tree. However, you have rebuild configure. For example: + +rm -rf autom4te.cache/ +aclocal +autoconf +autoheader +automake -a +./configure --help +./configure --with-plugins=max --without-innodb --prefix=/usr/local/mysql --with-debug=full + +NOTE: ./configure --help should show that the PBXT has been included. + +RN84: Fixed several problems with shutdown of PBXT in combiniation with MyBS. + +------- 0.9.89 Beta - 2007-08-17 + +RN83 (2007-08-21): Fixed a crash due to a compile bug that does not like the contruct *((xtWordPS *) &(v)) = (xtWordPS) (x) (macro allocr_() and alloczr_()). + +RN82: It is now possible to insert non-URL values into a LONGBLOB field, in the previous version the generated an "Invalid URL" error. Such values can be retrieved as a stream using a field reference. + +RN81: Fixed a bug that caused PBXT to crash during certina operations when MyBS was not installed. + +RN80: Set engine as capable of row-level replication, but not as statement replication. Statement replication does not work because MVCC is not serializable. + +------- 0.9.88 Beta - 2007-07-25 + +RN79: Made some corrections in order to compile with MySQL 5.1.20. + +RN78: Support for the features of the MyBS BLOB Streaming engine, version 0.5 Alpha. + +RN77: Bugfix: The server crashes during BLOB data handling. The reason is the table field structure is shared, and may not be changed. + +------- 0.9.87 Beta - 2007-06-19 + +RN76: The major feature of this release is support for the BLOB Streaming Engine. The current version enables the download of specific BLOB columns via the Streaming Engine. For example: + +use test; +CREATE TABLE notes_tab ( + n_id INTEGER PRIMARY KEY, + n_text BLOB +) ENGINE=pbxt; +INSERT notes_tab VALUES (1, "This is a BLOB streaming test!"); + +The URL: + +http://localhost:8080/test/notes_tab/n_text/n_id=1 + +will return the value "This is a BLOB streaming test!" + +RN75: Bugfix: MySQL prints error: "Plugin 'PBXT' will be forced to shutdown". This error was caused by the plug-in having a reference to itself. + +RN74: Added system variable pbxt_index_cache_size and pbxt_record_cache_size. These variable can now be set on the mysqld command line (for example: --pbxt_record_cache_size=50MB). The values are also displayed by SHOW VARIABLES. + +------- 0.9.86 Beta - 2007-04-07 + +RN74: ++++ NOTE: This version is incompatible to older versions of PBXT ++++. + +In order to upgrade, install the older version of PBXT. Convert all tables to MyISAM using ALTER TABLE t1 ENGINE=MyISAM. Then install the new version of PBXT and convert back using ALTER TABLE t1 ENGINE=PBXT. + +RN73: Each table will now use a maximum of 4 data log files. This means a maximum of 7 files per table. The minimum is 3 for tables that do not have a variable field that exceeds about 40 bytes in size. This means that under Linux PBXT requires a maximum of 7 file handles per table used. Windows lock of pread/pwrite (atomic seek and read/write) functions means it requires a file handler per file per open table handler. [TODO: 0044] + +RN72: All threads now write to the same data log file. Recovery and compaction take this fact into account. Each thread still writes its own transaction log. + +RN71: Removed all directory scans when creating and dropping table. Increased the table limit to 10000. + +RN70: Changed locking to avoid a deadlock when TRUNCATE TABLE is used together with other DML. + +RN69: procedures and functions are now considered atomic, and execute in a single transaction. + +RN68: Bug fixed: all files are now correctly flushed before commit. + +------- 0.9.85 Beta - 2007-03-15 + +RN67: Changed the implementation of the pushsr_ and allocr_ macros because "*((void **) &(v) = " caused a crash due to a compiler error on some platforms (thanks Luciano for your help on this one and RN66). + +RN66: Fixed a bug that caused PBXT to corrupt the index file when the size exceeded 4GB. [TODO: 0031] + +RN65: PBXT now runs under Windows. This source tree must be placed in the MySQL source storage directory in order to compile. Further details of how to build are in the windows-readme.txt file. [TODO: 0027] + +RN64: Improved speed of table lookup by ID after a table has been deleted. The sweeper needs to ignore these records. Scanning the directory each time was too slow. + +RN63: Added checking for repeat update of a record in a statement. + +RN62: Committed read no longer blocks due to a change made by another transaction (the XT_REPEATABLE_READ_BLOCKS define, turns blocking on). + +RN61: Avoid checking for duplicates if an index is not modified by an update. + +RN60: Records updated repeatedly by a transaction are now updated in place. [TODO: 0040] + +------- 0.9.8 Beta - 2007-01-30 + +RN59: Reduced the number of file handles used to a maximum of one per file. This assumes that pread() and pwrite() allows multiple threads to use the same file handle (according to my tests, this is the case). + +RN58: Added the configure flag --with-debug=only which compiles a version of the plug-in with debug symbols that will link to an non-debug MySQL server. + +RN57: Changed error number returned on lock from 1205 (lock timeout) to 1020 (optimistic lock failure). + +RN56: Added UNIX environment variable for PBXT system parameters. These must be set before starting mysqld, for example: + +setenv pbxt_index_cache_size 400MB +setenv pbxt_record_cache_size "1 GB" + +Values are in bytes unless one of the following units is specified: GB, MB, Kb + +RN55: Fixed a bug which prevented VARCHAR values from being compressed correctly when stored in variable length rows. + +RN54: Fixed a bug which caused a crash when PBXT was used with MySQL 5.1.14. This bug also caused data to be corrupted on insert. + +RN53: Set query caching mode to transactional. [TODO: 0027] + +RN52: Added conditions so that the engine compiles with MySQL 5.1.14 and 5.1.13. + +------- 0.9.74 Beta - 2006-12-14 + +RN51: DELETE FROM <table>; is no longer implemented by re-creating the table. This statement now works by deleting all rows. TRUNCATE is implemented as before, by re-creating the table. + +RN50: The test scripts innodb.test and innodb-mysql.test have been modified to run with PBXT. + +RN49: [TODO: 0020] Implemented foreign keys. Functionality is identical to InnoDB with 2 exceptions: + +* Data types of referenced columns must be an exact match (e.g. you cannot mix VARCHAR and CHAR values). +* Currently an exact matching index is required on referenced columns (i.e. the index may not have more columns that the columns used in the foreign key definition). + +Also note the following: + +* It is possible to create foreign keys that reference non-existent tables or columns. An error will occur when updating a table with an incorrect foreign key declaration. +* If you alter the data-type of a column referenced by a foreign key set you need to set foreign_key_checks=0; or an error will occur. + +RN48: Fixed a bug in the implementation of indexes on ENUM and SET types. + +RN47: Fixed a bug that caused a crash when an index was place on a BLOB column, and data was retrieved from the index directly. + +------- 0.9.73 Beta - 2006-10-31 + +RN46: Updated test scripts to run with MySQL 5.1.13. + +------- 0.9.72 Beta - 2006-10-19 + +RN45: Corrected compilation errors that occurred due to a change to struct st_mysql_plugin. + +------- 0.9.71 Beta - 2006-10-04 + +RN44: Corrected compilation errors that occurred due to changes in the storage engine API. + +------- 0.9.7 Beta - 2006-09-20 + +RN43: This is the first Beta release of PrimeBase XT. It has been integrated into MySQL 4.1.21 and is available as a plug-in for MySQL 5.1.12, or later. This version has been extensively tested using mysql-test-run, on various Linux and Mac OS X platforms. + +RN42: ++++ NOTE: This version is incompatible to older versions of PBXT ++++. Files created by older versions cannot be opened by version 0.9.7. + +RN41: Renaming or deleting a table while using a name with different case to the original created name did not work. + +RN40: Fixed a bug when grouping and searching on indexed columns that contain a null. + +RN39: Fixed bugs related to trailing spaces on VARCHAR values. Values that only vary by the number of trailing spaces (for example "aa" and "aa "), are now correctly handled as identical. + +RN38: The default AUTO_INCREMENT value was not correctly preserved during ALTER TABLE. + +RN37: Created a MySQL 5.1 Plugin version of PBXT. [TODO: 0017] + +RN36: Fixed a race condition in the row cache which had the affect that inserted rows dissappeared after cleanup because the cache was out of date. I was only able to reproduce this error on multi-processor machines. + +------- 0.9.6 - 2006-08-05 + +RN35: ++++ NOTE: This version is incompatible to older versions of PBXT ++++. + +The disk format of tables and log files has changed slightly in this version. As a result, files created by older versions cannot be opened by version 0.9.6. An error will be generated. If you have data wish to preserve, first start the older version of XT and convert all tables to MyISAM. The stop the server and removed all transaction log file (files of the form xtlog-*.xt). Then start the new version and convert tables back to XT. + +RN34: Implemented READ COMMITTED transaction mode. XT now supports READ COMMITTED and SERIALIZABLE transaction modes. NOTE: if the mode is set to REPEATABLE READ, SERIALIZABLE is used. If the mode is set to READ UNCOMMITTED READ COMMITTED is used. + +RN33: The implementation of AUTO_INCREMENT on a paritial index is non-standard. A unique value is generated without regard to the value of the index prefix. For example, assume we have the following table: CREATE TABLE t1 (c1 CHAR(10) not null, c2 INT not null AUTO_INCREMENT, PRIMARY KEY(c1, c2)); + +With the following contents: c1 c2 + A 8 + B 1 + +After executing the following statement: insert into t1 (c1) values ('B'); + +This is the result using PBXT: c1 c2 + A 8 + B 1 + B 9 + +The standard result would be: c1 c2 + A 8 + B 1 + B 2 + +RN32: PBXT does not permit access to multiple databases within a single transaction. For example: + +begin; +update database_1.t1 set a=10; +update database_2.t2 set d=10; +commit; + +In this case the following error is returned: 1015: Can't lock file (errno: -1) + +RN31: The implementation of COUNT(*) has changed. For effectiency, rows are not counted. The information is taken from the header of the record (.xtr) files. This information is only 100% accurate after transaction cleanup has completed. Which basically means, only when PBXT is idle. ANALYZE TABLE waits for all background activity to stop, so the statement may be executed before a COUNT(*) to ensure an accurate result. NOTE: Other then waiting for background processes, ANALYSE TABLE is not implemented. + +RN30: Two concurrency bugs have been fixed: a shared lock was used instead of an exclusive lock when deleting from a transaction list, the transaction segment semaphore was not initialized. XT now runs correctly in a multi-processor environment. The test used was sysbench on a dual-process, dual-core, AMD 64-bit machine running SUSE Linux 10.0. + +RN29: PBXT compiles and runs on under 64-bit Lunix. [TODO: 0009] + +RN28: ./mysql-test-run --force --mysqld=--default-storage-engine=pbxt will now execute most tests successfully. Changes to the tests and the result have been documented in http://www.primebase.com/xt/download/pbxt-test-run-changes.txt. [TODO: 0004, 0019] + +RN27: Fixed a bug that caused the server to crash if when using tables locks and transactions. For example: LOCK TABLES, BEGIN, COMMIT, SELECT. This sequence now returns an error. The correct sequence is: + +LOCK TABLES, BEGIN, COMMIT, UNLOCK TABLES, SELECT +or +LOCK TABLES, BEGIN, COMMIT, BEGIN, SELECT COMMIT, UNLOCK TABLES + +RN26: Fixed a concurrency problem which caused a number of threads to hang during the sysbench test - see RN30 above (bug reported by Vadim). + +RN25: Fixed a bug that caused the server to hang when ha_pbxt::create() and ha_pbxt::ha_open() where given different, but equivalent paths for a particular table. + +RN24: Fixed bug in the indexing of blob columns, for example: create table t1(name_id int, name blob, INDEX name_idx (name(5))); + +RN23: When a duplicate key error occurs in auto-commit mode, the transaction is now rolled back. + +RN22: Fixed incorrect duplicate key error. In the case of a unique key which allows NULLs, duplicates are allowed if the inserted key contains a NULL. For example: + +create table t1 (id int not null, str char(10), unique(str)); +insert into t1 values (1, null),(2, null),(3, "foo"),(4, "bar"); + +RN21: PBXT now returns the correct error code on duplicate key: 1062 instead of 1022. + +RN19: Implemented AUTO_INCREMENT on partial keys. However, the XT implementation is non-standard. Increment of partial index works, but the ID generated is incremented like a non-partial index. For example: + +create table t1 (c1 char(10) not null, c2 int not null auto_increment, primary key(c1, c2)); +select * from t1; +c1 c2 +A 8 +B 1 + +insert into t1 (c1) values ('B'); +select * from t1; +c1 c2 +A 8 +B 1 +B 9 + +The standard result would be: +c1 c2 +A 8 +B 1 +B 2 + +RN18: Implemented TRUNCATE TABLE and DELETE FROM <table>; (i.e. a DELETE without WHERE clause). Previously DELETE FROM <table>; did not cause an error, but no rows where deleted (TRUNCATE TABLE returned an error). [TODO: 0012, 0022] + +RN17: Implemented CREATE TABLE (...) auto_increment=<value>; + +------- 0.9.51 - 2006-07-06 + +RN16: Fixed crash which could occur when creating the first table in a database (bug reported by Hakan). + +------- 0.9.5 - 2006-07-03 + +RN15: This version concludes the re-structuring of the PBXT implementation. I have made a number of major changes, including: + +- All files except the transaction logs are now associated with a particular table. All table related files begin with the name of the table. The extension indicates the function. + +- I have merged the handle and the fixed length row data for performance reasons. + +- Only the variable size component of a row is stored in the data log files. As a result the data logs can now be considered as a type of "overflow" area. + +- Memory mapped files are no longer used because it is not possible to flush changes to the disk. + +RN14: File names have the following forms: + +[table-name]-[table-id].xtr - These files contains the table row pointers. Each row pointer occupies 8 bytes and refers to a list of records. The file name also contains the table ID. This is a unique number which is used internally by XT to identify the table. + +[table-name].xtd - This file contains the fixed length data of a table. Each data item includes a handle and a record. The handle references a record in the data log file if the table contains variable length records. + +[table-name].xti - This file contains the index data of the table. + +[table-name]-[log-id].xtl - This is a data log file. It contains the variable length data of the table. A table may have any number of data log files, each with a unique ID. + +xtlog-[log-id].xt - These files are the transaction logs. Log entries that specify updates reference a data file record. Each active thread has its own transaction log in order to avoid contension. + +RN13: Fixed the bug "Hang on DROP DATABASE". [TODO: 0016] + +RN12: PBXT currently only supports the "Serializable" transaction isolation level. This is the highest isolation level possible and includes the "repeatable-read" functionality [TODO: 0015]. This is implemented by giving every transaction a snapshot of the database at the point when the transaction is started. + +If the transaction tries to update a record that was updated by some other transaction after the snapshot was taken, a locked error is returned. A deadlock can occur if 2 transactions update the same record in a different order. PBXT can detect all deadlocks. + +RN11: I have implemented write buffering on the table data files. [TODO: 0013] + +RN10: The unique constraint (UNIQUE INDEX/PRIMARY KEY) is now checked correctly. [TODO: 0008] + +RN9: I have implemented a conventional B-tree algorithm for the indices (instead of the Lehman and Yoa B*-link tree). Although this reduces concurrency it improves the performance of queries significantly because of the simplicity of the algorithm. Deletion is also implemented in a very simple manner. [TODO: 0007] + +RN8: PBXT now has only 2 caches [TODO: 0006]: + +The Index Cache (pbxt_index_cache_size): This is the amount of memory the PBXT storage engine uses to cache index data and row pointers. This is all the data in the files with the extensions '.xti' and '.xtr'. This cache is managed in blocks of 2K. + +The Record Cache (pbxt_record_cache_size): This is the amount of memory the PBXT storage engine uses to cache table row data (handles and records). This is all the data in the files with the extension '.xtd'. + +The size of the caches are determined by the values of the system variables pbxt_index_cache_size and pbxt_row_cache_size. By default these values are set to 32MB. + +RN7: Auto-increment is now implemented in memory. This is done by doing a MAX() select when a table is first opened to get the high value. After that, then high value is incremented in memory on INSERT. On UPDATE (or INSERT) the value in memory is adjusted if necessary. This method also makes it possible for rows to be inserted simultaneously on the same table. [TODO: 0005, 0014] + +RN6: ./run-all-tests --create-options=TYPE=PBXT succeeds. [TODO: 0004] + +RN5: Using sql-bench and my own Java based test I have confirmed that PBXT behaves correctly during multi-threaded access. [PARTIALY TODO: 0002] + +RN4: Load/Stability test. Using sql-bench I have tested PBXT under load over a long period of time. [PARTIALY TODO: 0001] + +------- 0.9.2 - 2006-04-01 + +RN3: Fixed a bug that cause the error "-6: Handle is out of range: [0:0]". + +RN2: Implemented SET, ENUM and YEAR data types. + +RN1: Fixed a bug in the error reporting when a table is created with a datatype that is not supported. [TODO: 0011] + + diff --git a/storage/pbxt/INSTALL b/storage/pbxt/INSTALL new file mode 100644 index 00000000000..23e5f25d0e5 --- /dev/null +++ b/storage/pbxt/INSTALL @@ -0,0 +1,236 @@ +Installation Instructions +************************* + +Copyright (C) 1994, 1995, 1996, 1999, 2000, 2001, 2002, 2004, 2005 Free +Software Foundation, Inc. + +This file is free documentation; the Free Software Foundation gives +unlimited permission to copy, distribute and modify it. + +Basic Installation +================== + +These are generic installation instructions. + + The `configure' shell script attempts to guess correct values for +various system-dependent variables used during compilation. It uses +those values to create a `Makefile' in each directory of the package. +It may also create one or more `.h' files containing system-dependent +definitions. Finally, it creates a shell script `config.status' that +you can run in the future to recreate the current configuration, and a +file `config.log' containing compiler output (useful mainly for +debugging `configure'). + + It can also use an optional file (typically called `config.cache' +and enabled with `--cache-file=config.cache' or simply `-C') that saves +the results of its tests to speed up reconfiguring. (Caching is +disabled by default to prevent problems with accidental use of stale +cache files.) + + If you need to do unusual things to compile the package, please try +to figure out how `configure' could check whether to do them, and mail +diffs or instructions to the address given in the `README' so they can +be considered for the next release. If you are using the cache, and at +some point `config.cache' contains results you don't want to keep, you +may remove or edit it. + + The file `configure.ac' (or `configure.in') is used to create +`configure' by a program called `autoconf'. You only need +`configure.ac' if you want to change it or regenerate `configure' using +a newer version of `autoconf'. + +The simplest way to compile this package is: + + 1. `cd' to the directory containing the package's source code and type + `./configure' to configure the package for your system. If you're + using `csh' on an old version of System V, you might need to type + `sh ./configure' instead to prevent `csh' from trying to execute + `configure' itself. + + Running `configure' takes awhile. While running, it prints some + messages telling which features it is checking for. + + 2. Type `make' to compile the package. + + 3. Optionally, type `make check' to run any self-tests that come with + the package. + + 4. Type `make install' to install the programs and any data files and + documentation. + + 5. You can remove the program binaries and object files from the + source code directory by typing `make clean'. To also remove the + files that `configure' created (so you can compile the package for + a different kind of computer), type `make distclean'. There is + also a `make maintainer-clean' target, but that is intended mainly + for the package's developers. If you use it, you may have to get + all sorts of other programs in order to regenerate files that came + with the distribution. + +Compilers and Options +===================== + +Some systems require unusual options for compilation or linking that the +`configure' script does not know about. Run `./configure --help' for +details on some of the pertinent environment variables. + + You can give `configure' initial values for configuration parameters +by setting variables in the command line or in the environment. Here +is an example: + + ./configure CC=c89 CFLAGS=-O2 LIBS=-lposix + + *Note Defining Variables::, for more details. + +Compiling For Multiple Architectures +==================================== + +You can compile the package for more than one kind of computer at the +same time, by placing the object files for each architecture in their +own directory. To do this, you must use a version of `make' that +supports the `VPATH' variable, such as GNU `make'. `cd' to the +directory where you want the object files and executables to go and run +the `configure' script. `configure' automatically checks for the +source code in the directory that `configure' is in and in `..'. + + If you have to use a `make' that does not support the `VPATH' +variable, you have to compile the package for one architecture at a +time in the source code directory. After you have installed the +package for one architecture, use `make distclean' before reconfiguring +for another architecture. + +Installation Names +================== + +By default, `make install' installs the package's commands under +`/usr/local/bin', include files under `/usr/local/include', etc. You +can specify an installation prefix other than `/usr/local' by giving +`configure' the option `--prefix=PREFIX'. + + You can specify separate installation prefixes for +architecture-specific files and architecture-independent files. If you +pass the option `--exec-prefix=PREFIX' to `configure', the package uses +PREFIX as the prefix for installing programs and libraries. +Documentation and other data files still use the regular prefix. + + In addition, if you use an unusual directory layout you can give +options like `--bindir=DIR' to specify different values for particular +kinds of files. Run `configure --help' for a list of the directories +you can set and what kinds of files go in them. + + If the package supports it, you can cause programs to be installed +with an extra prefix or suffix on their names by giving `configure' the +option `--program-prefix=PREFIX' or `--program-suffix=SUFFIX'. + +Optional Features +================= + +Some packages pay attention to `--enable-FEATURE' options to +`configure', where FEATURE indicates an optional part of the package. +They may also pay attention to `--with-PACKAGE' options, where PACKAGE +is something like `gnu-as' or `x' (for the X Window System). The +`README' should mention any `--enable-' and `--with-' options that the +package recognizes. + + For packages that use the X Window System, `configure' can usually +find the X include and library files automatically, but if it doesn't, +you can use the `configure' options `--x-includes=DIR' and +`--x-libraries=DIR' to specify their locations. + +Specifying the System Type +========================== + +There may be some features `configure' cannot figure out automatically, +but needs to determine by the type of machine the package will run on. +Usually, assuming the package is built to be run on the _same_ +architectures, `configure' can figure that out, but if it prints a +message saying it cannot guess the machine type, give it the +`--build=TYPE' option. TYPE can either be a short name for the system +type, such as `sun4', or a canonical name which has the form: + + CPU-COMPANY-SYSTEM + +where SYSTEM can have one of these forms: + + OS KERNEL-OS + + See the file `config.sub' for the possible values of each field. If +`config.sub' isn't included in this package, then this package doesn't +need to know the machine type. + + If you are _building_ compiler tools for cross-compiling, you should +use the option `--target=TYPE' to select the type of system they will +produce code for. + + If you want to _use_ a cross compiler, that generates code for a +platform different from the build platform, you should specify the +"host" platform (i.e., that on which the generated programs will +eventually be run) with `--host=TYPE'. + +Sharing Defaults +================ + +If you want to set default values for `configure' scripts to share, you +can create a site shell script called `config.site' that gives default +values for variables like `CC', `cache_file', and `prefix'. +`configure' looks for `PREFIX/share/config.site' if it exists, then +`PREFIX/etc/config.site' if it exists. Or, you can set the +`CONFIG_SITE' environment variable to the location of the site script. +A warning: not all `configure' scripts look for a site script. + +Defining Variables +================== + +Variables not defined in a site shell script can be set in the +environment passed to `configure'. However, some packages may run +configure again during the build, and the customized values of these +variables may be lost. In order to avoid this problem, you should set +them in the `configure' command line, using `VAR=value'. For example: + + ./configure CC=/usr/local2/bin/gcc + +causes the specified `gcc' to be used as the C compiler (unless it is +overridden in the site shell script). Here is a another example: + + /bin/bash ./configure CONFIG_SHELL=/bin/bash + +Here the `CONFIG_SHELL=/bin/bash' operand causes subsequent +configuration-related scripts to be executed by `/bin/bash'. + +`configure' Invocation +====================== + +`configure' recognizes the following options to control how it operates. + +`--help' +`-h' + Print a summary of the options to `configure', and exit. + +`--version' +`-V' + Print the version of Autoconf used to generate the `configure' + script, and exit. + +`--cache-file=FILE' + Enable the cache: use and save the results of the tests in FILE, + traditionally `config.cache'. FILE defaults to `/dev/null' to + disable caching. + +`--config-cache' +`-C' + Alias for `--cache-file=config.cache'. + +`--quiet' +`--silent' +`-q' + Do not print messages saying which checks are being made. To + suppress all normal output, redirect it to `/dev/null' (any error + messages will still be shown). + +`--srcdir=DIR' + Look for the package's source code in directory DIR. Usually + `configure' can determine that directory automatically. + +`configure' also accepts some other, not widely useful, options. Run +`configure --help' for more details. + diff --git a/storage/pbxt/Makefile.am b/storage/pbxt/Makefile.am new file mode 100644 index 00000000000..a8bfde74ee3 --- /dev/null +++ b/storage/pbxt/Makefile.am @@ -0,0 +1,3 @@ +SUBDIRS = src + +EXTRA_DIST = plug.in diff --git a/storage/pbxt/NEWS b/storage/pbxt/NEWS new file mode 100644 index 00000000000..e69de29bb2d --- /dev/null +++ b/storage/pbxt/NEWS diff --git a/storage/pbxt/README b/storage/pbxt/README new file mode 100644 index 00000000000..52d7cf6c44e --- /dev/null +++ b/storage/pbxt/README @@ -0,0 +1,19 @@ +PrimeBase XT for MySQL 5.1 +========================== + +This is the PrimeBase XT (PBXT) transactional storage engine for MySQL. PBXT is "pluggable", which means that it can be loaded dynamically by MySQL at runtime. It uses a unique "write-once" update strategy and MVCC (multi-version concurrency control) to provide optimal performance over a wide range of tasks. + +This package includes the complete source code for the engine. Although this is a standalone project it must be built against a compiled version of the MySQL 5.1 source tree, because it references headers files used internally by the server. + +Details about how to build PBXT both under UNIX or Windows, as a standalone plug-in, or as part of the MySQL source code, is distribed in the documentation which is avaliable online at: + +http://www.primebase.org/documentation + +Bug reports, questions and comments can be sent directly to me. + +Thanks for your support! + +Paul McCullagh +SNAP Innovation GmbH +paul.mccullagh@primebase.org + diff --git a/storage/pbxt/TODO b/storage/pbxt/TODO new file mode 100644 index 00000000000..b5782defb61 --- /dev/null +++ b/storage/pbxt/TODO @@ -0,0 +1,195 @@ +PBXT To-Do List +=============== + +My thanks to all who have downloaded and tested PBXT. If an issue you reported before the date below is not on this list, please e-mail me again. + +------- 2008-12-09 + +0063: The option for not using memory mapped files must be fixed. + +0062: Dynamic option for using memory mapping on a table (Dimitri). + +------- 2008-09-12 + +0061: Add records per key result to ha_pbxt:info() call (Mark). + +------- 2008-08-31 + +0060: Add table option to determine if a table should be memory mapped or not (also requested by Dimitri). + +0059: Add table options: + AVG_ROW_LENGTH [=] value + DATA DIRECTORY [=] 'absolute path to directory' + INDEX DIRECTORY [=] 'absolute path to directory' + MAX_ROWS [=] value + +------- 2008-03-28 + +0058: Consolidate writes when changes in the log are applied to the database. + +------- 2008-03-07 + +0057: Cluster updates onto a single page. + +0056: Add checksum to index and data pages. + +0055: When no index cache is available, the complete index must be flushed (not just single pages). + +0054: Optimize indexes by not creating indexes that are a complete sub-set of some other index. In this case we must be able to identify part of an index as unique. For example: primary key (a, b), index (a, b, c). Here we would just create index (a, b, c), and specify that the part (a, b) must be unique. Operations on (a, b) will be directed to index (a, b, c). + +0053: Check and test lock tables. + +0052: Cache data log data in the handle data cache. Must be purged when a handle data record is written. + +0051: Write data log data alternatively to the transaction log. The compactor must then compact transaction logs. + +0050: [RESOLVED: RN126] Implement consistent write for indexes. + +0049: [RESOLVED: RN114] Set the index block size to 4K, or 16K as used by InnoDB. + +0048: [RESOLVED: RN110] Add row ID to indexes. This should only be set once the row is cleaned by the sweeper. Then the row ID can be used to make a quite check if the row is the most recent version. + +------- 2007-06-19 + +0047: Test build with ./configure --with-innodb under Linux (Vadim). + +0046: [RESOLVED: RN85] Add plug.in file to enable drop in compile under Linux. + +0045: Provide libstdc++.so.6 binaries (Vadim). + +0044: [RESOLVED: RN73] Limit number of file handles used per table (Brian). + +0043: XA (two-phase commit) support (Peter). + +------- 2007-03-13 + +0042: [RESOLVED: RN108] Implemement STATUS commands. + +0041: Implement index prefix compression. + +------- 2007-03-07 + +0040: [RESOLVED: RN60] Update in-place when a transaction updates the same record more than once. + +0039: Set the number and size of the segments dynamically according to the amount of memory in the cache (and the number of CPUs?) (as discussed with: Peter & Vadim). + +0038: [RESOLVED: RN133] Improve the efficiency of the locks by using atomic compare and swap (Peter & Vadim). + +0037: [RESOLVED: RN133] Instead of a global LRU list, use a LRU list for segment of the cache (Peter & Vadim). [ Note: a global list using a TAS lock and change time (so that LRU is not always updated) is most efficient]. + +0036: Add support for deferred foreign key checking (requested by: Mark). + +0035: [RESOLVED: RN71] Remove the 2000 table limit (reported by: Hakan). + +------- 2007-02-28 + +0035: [RESOLVED: RN74, RN107] Build in the PBXT system parameters (currently they must be set using environment variables. + +0034: [RESOLVED: RN117] Initial documentation (yes, it must be done!) + +0033: Make the error code returned on lock error configurable. + +0032: [RESOLVED: RN65] Create a source code pluggable version for Windows. + +0031: [RESOLVED: RN66] PBXT corrupts the index file when the size exceeds 4 GB (reported by: Luciano) + +0030: [RESOLVED: RN102] Implement pbxt_index_flush_delay. Postpones index writing in order to speed up imports. [Resolution uses that fact hat index entries that are missing are added during recovery. As a result, index flushing can be delayed.] + +0029: [RESOLVED: RN103] Implement SELECT ... FOR UPDATE (recommended by: Robin). + +------- 2007-02-14 + +0028: Implement CREATE TABLE ... DATA/INDEX DIRECTORY (suggested by: Robin). + +------- 2006-12-06 + +0027: [RESOLVED: RN53] Bug in pbxt with query caching (reported by: Giuseppe) caused violation of transaction isolation. + +------- 2006-08-05 + +0026: Implement BACKUP and RESTORE table (planned for the first post release version). + +0025: Implement DISABLE/ENABLE KEYS. Works for FOREIGN KEYs, currently no plans to implement for disabling indexes. + +0024: Implement ANALYZE TABLE (planned for the first post release version). + +0023: Implement CHECK TABLE (planned for the first release candidate). + +0022: [RESOLVED: RN18] Implement TRUNCATE TABLE and DELETE FROM <table>; (i.e. a DELETE without WHERE clause). Currently this function does not cause an error, but no rows are deleted. + +------- 2006-07-06 + +0021: [RESOLVED: RN28] .../mysql-test/mysql-test-run --force --mysqld=--default-storage-engine=pbxt produces a number of errors (reported by: Hakan): As far as I can tell some failures are unnessary but others are bugs. All need to be checked. + +------- 2006-07-03 + +0020: [RESOLVED: RN49] Implement referential integrity (planned for the first release candidate). + +------- 2006-04-01 + +0019: [RESOLVED: RN28] mysql-test-run hangs on alter table (reported by: Hakan): Running a test like ./mysql-test-run.pl --mysqld=--default-storage-engine=pbxt, hangs on ALTER TABLE. + +0018: Implement GEOMETRY date type. Note: There are currently no plans to implement this feature. + +------- 2006-03-31 + +0017: [RESOLVED: RN37] MySQL 5.x Version (reported by: Ronald, Giuseppe). + +0016: [RESOLVED: RN13] Hang on "DROP DATABASE" (reported by: Giuseppe). Load the world database (http://downloads.mysql.com/docs/world.sql) and convert all tables into PBXT. Then, the drop database command hangs. + +0015: [RESOLVED: RN12] Implement isolation level "repeatable read" (reported by: Giuseppe). Current PBXT only supports isolation level "committed read". This means committed data can be seen no matter when it was committed. Use SELECT ... FOR UPDATE to guarantee repeatable read, on data already read. + +0014: [RESOLVED: RN7] Two transactions cannot insert simaltaneously if they use auto_increment (reported by: Giuseppe). See also 0005. + +0013: [RESOLVED: RN11] Implement buffered write (reported by: Giuseppe): Lack of buffered write leads to bad performance in operations such as ALTER TABLE ENGINE = PBXT and INSERT ... SELECT. + +0012: [RESOLVED: RN18] TRUNCATE does not work (reported by: Giuseppe) + +0011: [RESOLVED: RN2] Load Sakila Sample Database (reported by: Ronald): ALTER TABLE film ENGINE=PBXT; fails + +0010: [RESOLVED: RN6] sql-bench (reported by: Dmitry): ./run-all-tests --create-options=TYPE=PBXT fails. + +0009: [RESOLVED: RN29] 64-bit Linux (reported by: Hakan): PBXT current does not compile under 64-bit Linux. + +------- 2006-03-16 + +0008: [RESOLVED: RN10] Enforcing the unique index constraint: + +An index declared as "unique" must return a "duplicate unique key" error when inserting a duplicate value. The difficulty part of implementing this in PBXT is that we may encounter a duplicate value that has not yet been committed. The index reading thread must then wait for the transaction to commit or abort. + +0007: [RESOLVED: RN9] Cleaning up empty index nodes: + +The Lehman and Yoa algorithm used for indexing does not describe a way of cleaning up empty index nodes on-the-fly. A search of the relevant literature for an algorithm also turns up empty handed (periodic "reorg" is mostly suggested). I have subsequently devised an algorithm that will do the job. This needs to be implemented. + +0006: [RESOLVED: RN8] Cache Balancing: + +PBXT uses a number of small caches in order to improve concurrency (rather than one large cache). A process is required to manage the amount of cache memory used as a whole. The process must distribute the overall amount of memory available for caching over the small caches, according to demand. + +0005: [RESOLVED: RN7] Implement a faster auto-increment method + +Currently the auto-increment is handled by the default method used in MySQL. This is done by performing a "fetch-last" on the index for each insert to find the highest key value. This works well unless there are large number empty index nodes due to the problem described in (2) above. + +PBXT Testing To-Do List + +This is my first take on what still must be tested. My thanks to Ronald Bradford who is working on a generic testing framework that can be used to test PBXT. + +0004: [RESOLVED: RN6, RN28] MySQL Tests: + +Several tests (for mysql-test-run) written for other engines can be adapted and used to test PBXT. + +0003: [RESOLVED: RN30] Multi-processor Test: + +There is a difference between preemptive multitasking and true multitasking, which you have on a multi-processor (or dual core) machine. I don't expect any fundamental problems here, but it must be tested. + +0002: [RESOLVED: RN5, RN30, RN43] Multi-user/locking Test: + +How does the engine perform with a number of concurrent users running various transactions on a number of different tables? +This is a difficult test to write because it need to simulate a production situation. To test at least 2 or 3 machines is required. The idea is not to use too much data so that a lot of conflicts may occur. + +0001: [RESOLVED: RN4, RN43] Load/Stability Test: + +How does the engine perform under heavy load over a long period of time? How stable is the engine on power outage, etc? + +The test could use a variation of the test program written for test (3) above. At least 3 test machines would be required. The test must be modified to cause as much activity as possible. The test should monitor the performance under load. + + diff --git a/storage/pbxt/plug.in b/storage/pbxt/plug.in new file mode 100644 index 00000000000..173c66697b7 --- /dev/null +++ b/storage/pbxt/plug.in @@ -0,0 +1,8 @@ +DRIZZLE_STORAGE_ENGINE(pbxt,no, [PBXT Storage Engine], + [MVCC-based transactional engine], [max,max-no-ndb]) +DRIZZLE_PLUGIN_DIRECTORY(pbxt, [storage/pbxt]) +DRIZZLE_PLUGIN_STATIC(pbxt, [src/libpbxt.a]) +DRIZZLE_PLUGIN_MANDATORY(pbxt) dnl Default +DRIZZLE_PLUGIN_ACTIONS(pbxt, [ + AC_CONFIG_FILES(storage/pbxt/src/Makefile) + ]) diff --git a/storage/pbxt/src/CMakeLists.txt b/storage/pbxt/src/CMakeLists.txt new file mode 100755 index 00000000000..4533752045c --- /dev/null +++ b/storage/pbxt/src/CMakeLists.txt @@ -0,0 +1,53 @@ +# Copyright (c) 2008 PrimeBase Technologies GmbH +# +# PrimeBase XT +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write to the Free Software +# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA +# +# 2006-03-22 Paul McCullagh +# +# H&G2JCtL +# +# This file is used to make the Windows version + +SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DMYSQL_SERVER") +SET(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -DMYSQL_SERVER") + +SET(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -DMYSQL_SERVER -DSAFEMALLOC -DSAFE_MUTEX -DDEBUG") +SET(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} -DMYSQL_SERVER -DSAFEMALLOC -DSAFE_MUTEX -DDEBUG") + +INCLUDE_DIRECTORIES(${CMAKE_SOURCE_DIR}/include ${CMAKE_SOURCE_DIR}/sql + ${CMAKE_SOURCE_DIR}/regex + ${CMAKE_SOURCE_DIR}/extra/yassl/include) + +SET(PBXT_SOURCES ha_pbxt.cc bsearch_xt.cc index_xt.cc strutil_xt.cc cache_xt.cc linklist_xt.cc + ccutils_xt.cc lock_xt.cc table_xt.cc database_xt.cc thread_xt.cc + datadic_xt.cc memory_xt.cc trace_xt.cc datalog_xt.cc myxt_xt.cc util_xt.cc + filesys_xt.cc pthread_xt.cc xaction_xt.cc restart_xt.cc xactlog_xt.cc + hashtab_xt.cc sortedlist_xt.cc heap_xt.cc streaming_xt.cc tabcache_xt.cc + systab_xt.cc ha_xtsys.cc discover_xt.cc + bsearch_xt.h linklist_xt.h tabcache_xt.h cache_xt.h lock_xt.h table_xt.h + ccutils_xt.h thread_xt.h database_xt.h memory_xt.h trace_xt.h + datadic_xt.h pbms.h util_xt.h datalog_xt.h myxt_xt.h xaction_xt.h + filesys_xt.h pthread_xt.h xactlog_xt.h ha_pbxt.h restart_xt.h xt_config.h + hashtab_xt.h sortedlist_xt.h xt_defs.h heap_xt.h streaming_xt.h xt_errno.h + systab_xt.h ha_xtsys.h discover_xt.h + index_xt.h strutil_xt.h) + +IF(NOT SOURCE_SUBLIBS) + ADD_LIBRARY(pbxt ${PBXT_SOURCES}) + ADD_DEPENDENCIES(pbxt GenError) +ENDIF(NOT SOURCE_SUBLIBS) + diff --git a/storage/pbxt/src/Makefile-dzl.am b/storage/pbxt/src/Makefile-dzl.am new file mode 100644 index 00000000000..8fba00d7c67 --- /dev/null +++ b/storage/pbxt/src/Makefile-dzl.am @@ -0,0 +1,49 @@ +# Used to build Makefile.in + +MYSQLDATAdir = $(localstatedir) +MYSQLSHAREdir = $(pkgdatadir) +MYSQLBASEdir= $(prefix) +MYSQLLIBdir= $(pkglibdir) +pkgplugindir = $(pkglibdir)/plugin + +AM_CPPFLAGS = -I$(top_srcdir)/../../ + +LIBS = + +LDADD = + +noinst_HEADERS = bsearch_xt.h cache_xt.h ccutils_xt.h database_xt.h \ + datadic_xt.h datalog_xt.h filesys_xt.h hashtab_xt.h \ + ha_pbxt.h heap_xt.h index_xt.h linklist_xt.h \ + memory_xt.h myxt_xt.h pthread_xt.h restart_xt.h \ + streaming_xt.h sortedlist_xt.h strutil_xt.h \ + tabcache_xt.h table_xt.h trace_xt.h thread_xt.h \ + util_xt.h xaction_xt.h xactlog_xt.h lock_xt.h \ + systab_xt.h ha_xtsys.h discover_xt.h \ + mybs.h xt_config.h xt_defs.h xt_errno.h +EXTRA_LTLIBRARIES = libpbxt.la + +libpbxt_la_SOURCES = bsearch_xt.cc cache_xt.cc ccutils_xt.cc database_xt.cc \ + datadic_xt.cc datalog_xt.cc filesys_xt.cc hashtab_xt.cc \ + ha_pbxt.cc heap_xt.cc index_xt.cc linklist_xt.cc \ + memory_xt.cc myxt_xt.cc pthread_xt.cc restart_xt.cc \ + streaming_xt.cc sortedlist_xt.cc strutil_xt.cc \ + tabcache_xt.cc table_xt.cc trace_xt.cc thread_xt.cc \ + systab_xt.cc ha_xtsys.cc discover_xt.cc \ + util_xt.cc xaction_xt.cc xactlog_xt.cc lock_xt.cc + +libpbxt_la_LDFLAGS = -module + +# These are the warning Drizzle uses: +# DRIZZLE_WARNINGS = -W -Wall -Wextra -pedantic -Wundef -Wredundant-decls -Wno-strict-aliasing -Wno-long-long -Wno-unused-parameter + +libpbxt_la_CXXFLAGS = $(AM_CFLAGS) -DMYSQL_DYNAMIC_PLUGIN +libpbxt_la_CFLAGS = $(AM_CFLAGS) -DMYSQL_DYNAMIC_PLUGIN -std=c99 + +EXTRA_LIBRARIES = libpbxt.a +noinst_LIBRARIES = libpbxt.a +libpbxt_a_SOURCES = $(libpbxt_la_SOURCES) +libpbxt_a_CXXFLAGS = $(AM_CFLAGS) -DDRIZZLED -Wno-long-long +libpbxt_a_CFLAGS = $(AM_CFLAGS) -DDRIZZLED -std=c99 + +EXTRA_DIST = CMakeLists.txt diff --git a/storage/pbxt/src/Makefile.am b/storage/pbxt/src/Makefile.am new file mode 100644 index 00000000000..78db0ca787c --- /dev/null +++ b/storage/pbxt/src/Makefile.am @@ -0,0 +1,51 @@ +# Used to build Makefile.in + +INCLUDES = $(ENG_MYSQL_INC) + +LIBS = + +LDADD = + +plugindir = $(ENG_PLUGIN_DIR) + +noinst_HEADERS = bsearch_xt.h cache_xt.h ccutils_xt.h database_xt.h \ + datadic_xt.h datalog_xt.h filesys_xt.h hashtab_xt.h \ + ha_pbxt.h heap_xt.h index_xt.h linklist_xt.h \ + memory_xt.h myxt_xt.h pthread_xt.h restart_xt.h \ + streaming_xt.h sortedlist_xt.h strutil_xt.h \ + tabcache_xt.h table_xt.h trace_xt.h thread_xt.h \ + util_xt.h xaction_xt.h xactlog_xt.h lock_xt.h \ + systab_xt.h ha_xtsys.h discover_xt.h \ + pbms.h xt_config.h xt_defs.h xt_errno.h locklist_xt.h + +plugin_LTLIBRARIES = libpbxt.la + +libpbxt_la_SOURCES = bsearch_xt.cc cache_xt.cc ccutils_xt.cc database_xt.cc \ + datadic_xt.cc datalog_xt.cc filesys_xt.cc hashtab_xt.cc \ + ha_pbxt.cc heap_xt.cc index_xt.cc linklist_xt.cc \ + memory_xt.cc myxt_xt.cc pthread_xt.cc restart_xt.cc \ + streaming_xt.cc sortedlist_xt.cc strutil_xt.cc \ + tabcache_xt.cc table_xt.cc trace_xt.cc thread_xt.cc \ + systab_xt.cc ha_xtsys.cc discover_xt.cc \ + util_xt.cc xaction_xt.cc xactlog_xt.cc lock_xt.cc locklist_xt.cc + +libpbxt_la_LDFLAGS = -module + +# These are the warnings Drizzle uses: +# DRIZZLE_WARNINGS = -W -Wall -Wextra -pedantic -Wundef -Wredundant-decls -Wno-strict-aliasing -Wno-long-long -Wno-unused-parameter + +libpbxt_la_CXXFLAGS = $(AM_CXXFLAGS) -DMYSQL_DYNAMIC_PLUGIN +libpbxt_la_CFLAGS = $(AM_CFLAGS) -DMYSQL_DYNAMIC_PLUGIN -std=c99 + +EXTRA_LIBRARIES = libpbxt.a libxtutil.a +noinst_LIBRARIES = libpbxt.a libxtutil.a +libpbxt_a_SOURCES = $(libpbxt_la_SOURCES) +libpbxt_a_CXXFLAGS = $(AM_CXXFLAGS) $(DRIZZLE_WARNINGS) +libpbxt_a_CFLAGS = $(AM_CFLAGS) -std=c99 $(DRIZZLE_WARNINGS) + +libxtutil_a_SOURCES = strutil_xt.cc \ + trace_xt.cc +libxtutil_a_CXXFLAGS = $(AM_CXXFLAGS) +libxtutil_a_CFLAGS = $(AM_CFLAGS) + +EXTRA_DIST = CMakeLists.txt diff --git a/storage/pbxt/src/bsearch_xt.cc b/storage/pbxt/src/bsearch_xt.cc new file mode 100644 index 00000000000..539de1ae74d --- /dev/null +++ b/storage/pbxt/src/bsearch_xt.cc @@ -0,0 +1,66 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2004-01-03 Paul McCullagh + * + * H&G2JCtL + */ + +#include "xt_config.h" + +#include <stdio.h> + +#include "bsearch_xt.h" +#include "pthread_xt.h" +#include "thread_xt.h" + +/** + * Binary search a array of 'count' items, with byte size 'size'. This + * function returns a pointer to the element and the 'index' + * of the element if found. + * + * If not found the index of the insert point of the item + * is returned (0 <= index <= count). + * + * The comparison routine 'compar' may throw an exception. + * In this case the error details will be stored in 'thread'. + */ +void *xt_bsearch(XTThreadPtr thread, const void *key, register const void *base, size_t count, size_t size, size_t *idx, const void *thunk, XTCompareFunc compar) +{ + register size_t i; + register size_t guess; + register int r; + + i = 0; + while (i < count) { + guess = (i + count - 1) >> 1; + r = (compar)(thread, thunk, key, ((char *) base) + guess * size); + if (r == 0) { + *idx = guess; + return ((char *) base) + guess * size; + } + if (r < 0) + count = guess; + else + i = guess + 1; + } + + *idx = i; + return NULL; +} + diff --git a/storage/pbxt/src/bsearch_xt.h b/storage/pbxt/src/bsearch_xt.h new file mode 100644 index 00000000000..f15e28009fb --- /dev/null +++ b/storage/pbxt/src/bsearch_xt.h @@ -0,0 +1,32 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2004-01-03 Paul McCullagh + * + * H&G2JCtL + */ +#ifndef __xt_bsearch_h__ +#define __xt_bsearch_h__ + +#include "xt_defs.h" + +struct XTThread; + +void *xt_bsearch(struct XTThread *self, const void *key, register const void *base, size_t count, size_t size, size_t *idx, const void *thunk, XTCompareFunc compar); + +#endif diff --git a/storage/pbxt/src/cache_xt.cc b/storage/pbxt/src/cache_xt.cc new file mode 100644 index 00000000000..0e15475f185 --- /dev/null +++ b/storage/pbxt/src/cache_xt.cc @@ -0,0 +1,1507 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH, Germany + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-05-24 Paul McCullagh + * + * H&G2JCtL + */ + +#include "xt_config.h" + +#ifndef XT_WIN +#include <unistd.h> +#endif + +#include <stdio.h> +#include <time.h> + +#include "pthread_xt.h" +#include "thread_xt.h" +#include "filesys_xt.h" +#include "cache_xt.h" +#include "table_xt.h" +#include "trace_xt.h" +#include "util_xt.h" + +#define XT_TIME_DIFF(start, now) (\ + ((xtWord4) (now) < (xtWord4) (start)) ? \ + ((xtWord4) 0XFFFFFFFF - ((xtWord4) (start) - (xtWord4) (now))) : \ + ((xtWord4) (now) - (xtWord4) (start))) + +/* + * ----------------------------------------------------------------------- + * D I S K C A C H E + */ + +#define IDX_CAC_SEGMENT_COUNT ((off_t) 1 << XT_INDEX_CACHE_SEGMENT_SHIFTS) +#define IDX_CAC_SEGMENT_MASK (IDX_CAC_SEGMENT_COUNT - 1) + +//#define IDX_USE_SPINRWLOCK +#define IDX_USE_RWMUTEX +//#define IDX_CAC_USE_PTHREAD_RW + +#ifdef IDX_CAC_USE_FASTWRLOCK +#define IDX_CAC_LOCK_TYPE XTFastRWLockRec +#define IDX_CAC_INIT_LOCK(s, i) xt_fastrwlock_init(s, &(i)->cs_lock) +#define IDX_CAC_FREE_LOCK(s, i) xt_fastrwlock_free(s, &(i)->cs_lock) +#define IDX_CAC_READ_LOCK(i, o) xt_fastrwlock_slock(&(i)->cs_lock, (o)) +#define IDX_CAC_WRITE_LOCK(i, o) xt_fastrwlock_xlock(&(i)->cs_lock, (o)) +#define IDX_CAC_UNLOCK(i, o) xt_fastrwlock_unlock(&(i)->cs_lock, (o)) +#elif defined(IDX_CAC_USE_PTHREAD_RW) +#define IDX_CAC_LOCK_TYPE xt_rwlock_type +#define IDX_CAC_INIT_LOCK(s, i) xt_init_rwlock(s, &(i)->cs_lock) +#define IDX_CAC_FREE_LOCK(s, i) xt_free_rwlock(&(i)->cs_lock) +#define IDX_CAC_READ_LOCK(i, o) xt_slock_rwlock_ns(&(i)->cs_lock) +#define IDX_CAC_WRITE_LOCK(i, o) xt_xlock_rwlock_ns(&(i)->cs_lock) +#define IDX_CAC_UNLOCK(i, o) xt_unlock_rwlock_ns(&(i)->cs_lock) +#elif defined(IDX_USE_RWMUTEX) +#define IDX_CAC_LOCK_TYPE XTRWMutexRec +#define IDX_CAC_INIT_LOCK(s, i) xt_rwmutex_init_with_autoname(s, &(i)->cs_lock) +#define IDX_CAC_FREE_LOCK(s, i) xt_rwmutex_free(s, &(i)->cs_lock) +#define IDX_CAC_READ_LOCK(i, o) xt_rwmutex_slock(&(i)->cs_lock, (o)->t_id) +#define IDX_CAC_WRITE_LOCK(i, o) xt_rwmutex_xlock(&(i)->cs_lock, (o)->t_id) +#define IDX_CAC_UNLOCK(i, o) xt_rwmutex_unlock(&(i)->cs_lock, (o)->t_id) +#endif + +#define ID_HANDLE_USE_SPINLOCK +//#define ID_HANDLE_USE_PTHREAD_RW + +#if defined(ID_HANDLE_USE_PTHREAD_RW) +#define ID_HANDLE_LOCK_TYPE xt_mutex_type +#define ID_HANDLE_INIT_LOCK(s, i) xt_init_mutex_with_autoname(s, i) +#define ID_HANDLE_FREE_LOCK(s, i) xt_free_mutex(i) +#define ID_HANDLE_LOCK(i) xt_lock_mutex_ns(i) +#define ID_HANDLE_UNLOCK(i) xt_unlock_mutex_ns(i) +#elif defined(ID_HANDLE_USE_SPINLOCK) +#define ID_HANDLE_LOCK_TYPE XTSpinLockRec +#define ID_HANDLE_INIT_LOCK(s, i) xt_spinlock_init_with_autoname(s, i) +#define ID_HANDLE_FREE_LOCK(s, i) xt_spinlock_free(s, i) +#define ID_HANDLE_LOCK(i) xt_spinlock_lock(i) +#define ID_HANDLE_UNLOCK(i) xt_spinlock_unlock(i) +#endif + +#define XT_HANDLE_SLOTS 37 + +/* +#ifdef DEBUG +#define XT_INIT_HANDLE_COUNT 0 +#define XT_INIT_HANDLE_BLOCKS 0 +#else +#define XT_INIT_HANDLE_COUNT 40 +#define XT_INIT_HANDLE_BLOCKS 10 +#endif +*/ + +/* A disk cache segment. The cache is divided into a number of segments + * to improve concurrency. + */ +typedef struct DcSegment { + IDX_CAC_LOCK_TYPE cs_lock; /* The cache segment lock. */ + XTIndBlockPtr *cs_hash_table; +} DcSegmentRec, *DcSegmentPtr; + +typedef struct DcHandleSlot { + ID_HANDLE_LOCK_TYPE hs_handles_lock; + XTIndHandleBlockPtr hs_free_blocks; + XTIndHandlePtr hs_free_handles; + XTIndHandlePtr hs_used_handles; +} DcHandleSlotRec, *DcHandleSlotPtr; + +typedef struct DcGlobals { + xt_mutex_type cg_lock; /* The public cache lock. */ + DcSegmentRec cg_segment[IDX_CAC_SEGMENT_COUNT]; + XTIndBlockPtr cg_blocks; +#ifdef XT_USE_DIRECT_IO_ON_INDEX + xtWord1 *cg_buffer; +#endif + XTIndBlockPtr cg_free_list; + xtWord4 cg_free_count; + xtWord4 cg_ru_now; /* A counter as described by Jim Starkey (my thanks) */ + XTIndBlockPtr cg_lru_block; + XTIndBlockPtr cg_mru_block; + xtWord4 cg_hash_size; + xtWord4 cg_block_count; + xtWord4 cg_max_free; +#ifdef DEBUG_CHECK_IND_CACHE + u_int cg_reserved_by_ots; /* Number of blocks reserved by open tables. */ + u_int cg_read_count; /* Number of blocks being read. */ +#endif + + /* Index cache handles: */ + DcHandleSlotRec cg_handle_slot[XT_HANDLE_SLOTS]; +} DcGlobalsRec; + +static DcGlobalsRec ind_cac_globals; + +#ifdef XT_USE_MYSYS +#ifdef xtPublic +#undef xtPublic +#endif +#include "my_global.h" +#include "my_sys.h" +#include "keycache.h" +KEY_CACHE my_cache; +#undef pthread_rwlock_rdlock +#undef pthread_rwlock_wrlock +#undef pthread_rwlock_unlock +#undef pthread_mutex_lock +#undef pthread_mutex_unlock +#undef pthread_cond_wait +#undef pthread_cond_broadcast +#undef xt_mutex_type +#define xtPublic +#endif + +/* + * ----------------------------------------------------------------------- + * INDEX CACHE HANDLES + */ + +static XTIndHandlePtr ind_alloc_handle() +{ + XTIndHandlePtr handle; + + if (!(handle = (XTIndHandlePtr) xt_calloc_ns(sizeof(XTIndHandleRec)))) + return NULL; + xt_spinlock_init_with_autoname(NULL, &handle->ih_lock); + return handle; +} + +static void ind_free_handle(XTIndHandlePtr handle) +{ + xt_spinlock_free(NULL, &handle->ih_lock); + xt_free_ns(handle); +} + +static void ind_handle_exit(XTThreadPtr self) +{ + DcHandleSlotPtr hs; + XTIndHandlePtr handle; + XTIndHandleBlockPtr hptr; + + for (int i=0; i<XT_HANDLE_SLOTS; i++) { + hs = &ind_cac_globals.cg_handle_slot[i]; + + while (hs->hs_used_handles) { + handle = hs->hs_used_handles; + xt_ind_release_handle(handle, FALSE, self); + } + + while (hs->hs_free_blocks) { + hptr = hs->hs_free_blocks; + hs->hs_free_blocks = hptr->hb_next; + xt_free(self, hptr); + } + + while (hs->hs_free_handles) { + handle = hs->hs_free_handles; + hs->hs_free_handles = handle->ih_next; + ind_free_handle(handle); + } + + ID_HANDLE_FREE_LOCK(self, &hs->hs_handles_lock); + } +} + +static void ind_handle_init(XTThreadPtr self) +{ + DcHandleSlotPtr hs; + + for (int i=0; i<XT_HANDLE_SLOTS; i++) { + hs = &ind_cac_globals.cg_handle_slot[i]; + memset(hs, 0, sizeof(DcHandleSlotRec)); + ID_HANDLE_INIT_LOCK(self, &hs->hs_handles_lock); + } +} + +//#define CHECK_HANDLE_STRUCTS + +#ifdef CHECK_HANDLE_STRUCTS +static int gdummy = 0; + +static void ic_stop_here() +{ + gdummy = gdummy + 1; + printf("Nooo %d!\n", gdummy); +} + +static void ic_check_handle_structs() +{ + XTIndHandlePtr handle, phandle; + XTIndHandleBlockPtr hptr, phptr; + int count = 0; + int ctest; + + phandle = NULL; + handle = ind_cac_globals.cg_used_handles; + while (handle) { + if (handle == phandle) + ic_stop_here(); + if (handle->ih_prev != phandle) + ic_stop_here(); + if (handle->ih_cache_reference) { + ctest = handle->x.ih_cache_block->cb_handle_count; + if (ctest == 0 || ctest > 100) + ic_stop_here(); + } + else { + ctest = handle->x.ih_handle_block->hb_ref_count; + if (ctest == 0 || ctest > 100) + ic_stop_here(); + } + phandle = handle; + handle = handle->ih_next; + count++; + if (count > 1000) + ic_stop_here(); + } + + count = 0; + hptr = ind_cac_globals.cg_free_blocks; + while (hptr) { + if (hptr == phptr) + ic_stop_here(); + phptr = hptr; + hptr = hptr->hb_next; + count++; + if (count > 1000) + ic_stop_here(); + } + + count = 0; + handle = ind_cac_globals.cg_free_handles; + while (handle) { + if (handle == phandle) + ic_stop_here(); + phandle = handle; + handle = handle->ih_next; + count++; + if (count > 1000) + ic_stop_here(); + } +} +#endif + +/* + * Get a handle to the index block. + * This function is called by index scanners (readers). + */ +xtPublic XTIndHandlePtr xt_ind_get_handle(XTOpenTablePtr ot, XTIndexPtr ind, XTIndReferencePtr iref) +{ + DcHandleSlotPtr hs; + XTIndHandlePtr handle; + + hs = &ind_cac_globals.cg_handle_slot[iref->ir_block->cb_address % XT_HANDLE_SLOTS]; + + ASSERT_NS(iref->ir_ulock == XT_UNLOCK_READ); + ID_HANDLE_LOCK(&hs->hs_handles_lock); +#ifdef CHECK_HANDLE_STRUCTS + ic_check_handle_structs(); +#endif + if ((handle = hs->hs_free_handles)) + hs->hs_free_handles = handle->ih_next; + else { + if (!(handle = ind_alloc_handle())) { + ID_HANDLE_UNLOCK(&hs->hs_handles_lock); + xt_ind_release(ot, ind, XT_UNLOCK_READ, iref); + return NULL; + } + } + if (hs->hs_used_handles) + hs->hs_used_handles->ih_prev = handle; + handle->ih_next = hs->hs_used_handles; + handle->ih_prev = NULL; + handle->ih_address = iref->ir_block->cb_address; + handle->ih_cache_reference = TRUE; + handle->x.ih_cache_block = iref->ir_block; + handle->ih_branch = iref->ir_branch; + /* {HANDLE-COUNT-USAGE} + * This is safe because: + * + * I have an Slock on the cache block, and I have + * at least an Slock on the index. + * So this excludes anyone who is reading + * cb_handle_count in the index. + * (all cache block writers, and a freeer). + * + * The increment is safe because I have the list + * lock, which is required by anyone else + * who increments or decrements this value. + */ + iref->ir_block->cb_handle_count++; + hs->hs_used_handles = handle; +#ifdef CHECK_HANDLE_STRUCTS + ic_check_handle_structs(); +#endif + ID_HANDLE_UNLOCK(&hs->hs_handles_lock); + xt_ind_release(ot, ind, XT_UNLOCK_READ, iref); + return handle; +} + +xtPublic void xt_ind_release_handle(XTIndHandlePtr handle, xtBool have_lock, XTThreadPtr thread) +{ + DcHandleSlotPtr hs; + XTIndBlockPtr block = NULL; + u_int hash_idx = NULL; + DcSegmentPtr seg = NULL; + XTIndBlockPtr xblock; + + /* The lock order is: + * 1. Cache segment (cs_lock) - This is only by ind_free_block()! + * 1. S/Slock cache block (cb_lock) + * 2. List lock (cg_handles_lock). + * 3. Handle lock (ih_lock) + */ + if (!have_lock) + xt_spinlock_lock(&handle->ih_lock); + + /* Get the lock on the cache page if required: */ + if (handle->ih_cache_reference) { + u_int file_id; + xtIndexNodeID address; + + block = handle->x.ih_cache_block; + + file_id = block->cb_file_id; + address = block->cb_address; + hash_idx = XT_NODE_ID(address) + (file_id * 223); + seg = &ind_cac_globals.cg_segment[hash_idx & IDX_CAC_SEGMENT_MASK]; + hash_idx = (hash_idx >> XT_INDEX_CACHE_SEGMENT_SHIFTS) % ind_cac_globals.cg_hash_size; + } + + xt_spinlock_unlock(&handle->ih_lock); + + /* Because of the lock order, I have to release the + * handle before I get a lock on the cache block. + * + * But, by doing this, thie cache block may be gone! + */ + if (block) { + IDX_CAC_READ_LOCK(seg, thread); + xblock = seg->cs_hash_table[hash_idx]; + while (xblock) { + if (block == xblock) { + /* Found the block... */ + xt_atomicrwlock_xlock(&block->cb_lock, thread->t_id); + goto block_found; + } + xblock = xblock->cb_next; + } + block = NULL; + block_found: + IDX_CAC_UNLOCK(seg, thread); + } + + hs = &ind_cac_globals.cg_handle_slot[handle->ih_address % XT_HANDLE_SLOTS]; + + ID_HANDLE_LOCK(&hs->hs_handles_lock); +#ifdef CHECK_HANDLE_STRUCTS + ic_check_handle_structs(); +#endif + + /* I don't need to lock the handle because I have locked + * the list, and no other thread can change the + * handle without first getting a lock on the list. + * + * In addition, the caller is the only owner of the + * handle, and the only thread with an independent + * reference to the handle. + * All other access occur over the list. + */ + + /* Remove the reference to the cache or a handle block: */ + if (handle->ih_cache_reference) { + ASSERT_NS(block == handle->x.ih_cache_block); + ASSERT_NS(block && block->cb_handle_count > 0); + /* {HANDLE-COUNT-USAGE} + * This is safe here because I have excluded + * all readers by taking an Xlock on the + * cache block. + */ + block->cb_handle_count--; + } + else { + XTIndHandleBlockPtr hptr = handle->x.ih_handle_block; + + ASSERT_NS(!handle->ih_cache_reference); + ASSERT_NS(hptr->hb_ref_count > 0); + hptr->hb_ref_count--; + if (!hptr->hb_ref_count) { + /* Put it back on the free list: */ + hptr->hb_next = hs->hs_free_blocks; + hs->hs_free_blocks = hptr; + } + } + + /* Unlink the handle: */ + if (handle->ih_next) + handle->ih_next->ih_prev = handle->ih_prev; + if (handle->ih_prev) + handle->ih_prev->ih_next = handle->ih_next; + if (hs->hs_used_handles == handle) + hs->hs_used_handles = handle->ih_next; + + /* Put it on the free list: */ + handle->ih_next = hs->hs_free_handles; + hs->hs_free_handles = handle; + +#ifdef CHECK_HANDLE_STRUCTS + ic_check_handle_structs(); +#endif + ID_HANDLE_UNLOCK(&hs->hs_handles_lock); + + if (block) + xt_atomicrwlock_unlock(&block->cb_lock, TRUE); +} + +/* Call this function before a referenced cache block is modified! + * This function is called by index updaters. + */ +xtPublic xtBool xt_ind_copy_on_write(XTIndReferencePtr iref) +{ + DcHandleSlotPtr hs; + XTIndHandleBlockPtr hptr; + u_int branch_size; + XTIndHandlePtr handle; + u_int i = 0; + + hs = &ind_cac_globals.cg_handle_slot[iref->ir_block->cb_address % XT_HANDLE_SLOTS]; + + /* {HANDLE-COUNT-USAGE} + * This is only called by updaters of this index block, or + * the free which holds an Xlock on the index block. + * + * These are all mutually exclusive for the index block. + */ + ASSERT_NS(iref->ir_block->cb_handle_count); + if (!iref->ir_block->cb_handle_count) + return OK; + + ID_HANDLE_LOCK(&hs->hs_handles_lock); +#ifdef CHECK_HANDLE_STRUCTS + ic_check_handle_structs(); +#endif + if ((hptr = hs->hs_free_blocks)) + hs->hs_free_blocks = hptr->hb_next; + else { + if (!(hptr = (XTIndHandleBlockPtr) xt_malloc_ns(sizeof(XTIndHandleBlockRec)))) { + ID_HANDLE_UNLOCK(&hs->hs_handles_lock); + return FAILED; + } + } + + branch_size = XT_GET_INDEX_BLOCK_LEN(XT_GET_DISK_2(iref->ir_branch->tb_size_2)); + memcpy(&hptr->hb_branch, iref->ir_branch, branch_size); + hptr->hb_ref_count = iref->ir_block->cb_handle_count; + + handle = hs->hs_used_handles; + while (handle) { + if (handle->ih_branch == iref->ir_branch) { + i++; + xt_spinlock_lock(&handle->ih_lock); + ASSERT_NS(handle->ih_cache_reference); + handle->ih_cache_reference = FALSE; + handle->x.ih_handle_block = hptr; + handle->ih_branch = &hptr->hb_branch; + xt_spinlock_unlock(&handle->ih_lock); +#ifndef DEBUG + if (i == hptr->hb_ref_count) + break; +#endif + } + handle = handle->ih_next; + } +#ifdef DEBUG + ASSERT_NS(hptr->hb_ref_count == i); +#endif + /* {HANDLE-COUNT-USAGE} + * It is safe to modify cb_handle_count when I have the + * list lock, and I have excluded all readers! + */ + iref->ir_block->cb_handle_count = 0; +#ifdef CHECK_HANDLE_STRUCTS + ic_check_handle_structs(); +#endif + ID_HANDLE_UNLOCK(&hs->hs_handles_lock); + + return OK; +} + +xtPublic void xt_ind_lock_handle(XTIndHandlePtr handle) +{ + xt_spinlock_lock(&handle->ih_lock); +} + +xtPublic void xt_ind_unlock_handle(XTIndHandlePtr handle) +{ + xt_spinlock_unlock(&handle->ih_lock); +} + +/* + * ----------------------------------------------------------------------- + * INIT/EXIT + */ + +/* + * Initialize the disk cache. + */ +xtPublic void xt_ind_init(XTThreadPtr self, size_t cache_size) +{ + XTIndBlockPtr block; + +#ifdef XT_USE_MYSYS + init_key_cache(&my_cache, 1024, cache_size, 100, 300); +#endif + /* Memory is devoted to the page data alone, I no longer count the size of the directory, + * or the page overhead: */ + ind_cac_globals.cg_block_count = cache_size / XT_INDEX_PAGE_SIZE; + ind_cac_globals.cg_hash_size = ind_cac_globals.cg_block_count / (IDX_CAC_SEGMENT_COUNT >> 1); + ind_cac_globals.cg_max_free = ind_cac_globals.cg_block_count / 10; + if (ind_cac_globals.cg_max_free < 8) + ind_cac_globals.cg_max_free = 8; + if (ind_cac_globals.cg_max_free > 128) + ind_cac_globals.cg_max_free = 128; + + try_(a) { + for (u_int i=0; i<IDX_CAC_SEGMENT_COUNT; i++) { + ind_cac_globals.cg_segment[i].cs_hash_table = (XTIndBlockPtr *) xt_calloc(self, ind_cac_globals.cg_hash_size * sizeof(XTIndBlockPtr)); + IDX_CAC_INIT_LOCK(self, &ind_cac_globals.cg_segment[i]); + } + + block = (XTIndBlockPtr) xt_malloc(self, ind_cac_globals.cg_block_count * sizeof(XTIndBlockRec)); + ind_cac_globals.cg_blocks = block; + xt_init_mutex_with_autoname(self, &ind_cac_globals.cg_lock); +#ifdef XT_USE_DIRECT_IO_ON_INDEX + xtWord1 *buffer; +#ifdef XT_WIN + size_t psize = 512; +#else + size_t psize = getpagesize(); +#endif + size_t diff; + + buffer = (xtWord1 *) xt_malloc(self, (ind_cac_globals.cg_block_count * XT_INDEX_PAGE_SIZE)); + diff = (size_t) buffer % psize; + if (diff != 0) { + xt_free(self, buffer); + buffer = (xtWord1 *) xt_malloc(self, (ind_cac_globals.cg_block_count * XT_INDEX_PAGE_SIZE) + psize); + diff = (size_t) buffer % psize; + if (diff != 0) + diff = psize - diff; + } + ind_cac_globals.cg_buffer = buffer; + buffer += diff; +#endif + + for (u_int i=0; i<ind_cac_globals.cg_block_count; i++) { + xt_atomicrwlock_init_with_autoname(self, &block->cb_lock); + block->cb_state = IDX_CAC_BLOCK_FREE; + block->cb_next = ind_cac_globals.cg_free_list; +#ifdef XT_USE_DIRECT_IO_ON_INDEX + block->cb_data = buffer; + buffer += XT_INDEX_PAGE_SIZE; +#endif + ind_cac_globals.cg_free_list = block; + block++; + } + ind_cac_globals.cg_free_count = ind_cac_globals.cg_block_count; +#ifdef DEBUG_CHECK_IND_CACHE + ind_cac_globals.cg_reserved_by_ots = 0; +#endif + ind_handle_init(self); + } + catch_(a) { + xt_ind_exit(self); + throw_(); + } + cont_(a); +} + +xtPublic void xt_ind_exit(XTThreadPtr self) +{ +#ifdef XT_USE_MYSYS + end_key_cache(&my_cache, 1); +#endif + for (u_int i=0; i<IDX_CAC_SEGMENT_COUNT; i++) { + if (ind_cac_globals.cg_segment[i].cs_hash_table) { + xt_free(self, ind_cac_globals.cg_segment[i].cs_hash_table); + ind_cac_globals.cg_segment[i].cs_hash_table = NULL; + IDX_CAC_FREE_LOCK(self, &ind_cac_globals.cg_segment[i]); + } + } + + if (ind_cac_globals.cg_blocks) { + xt_free(self, ind_cac_globals.cg_blocks); + ind_cac_globals.cg_blocks = NULL; + xt_free_mutex(&ind_cac_globals.cg_lock); + } +#ifdef XT_USE_DIRECT_IO_ON_INDEX + if (ind_cac_globals.cg_buffer) { + xt_free(self, ind_cac_globals.cg_buffer); + ind_cac_globals.cg_buffer = NULL; + } +#endif + ind_handle_exit(self); + + memset(&ind_cac_globals, 0, sizeof(ind_cac_globals)); +} + +xtPublic xtInt8 xt_ind_get_usage() +{ + xtInt8 size = 0; + + size = (xtInt8) (ind_cac_globals.cg_block_count - ind_cac_globals.cg_free_count) * (xtInt8) XT_INDEX_PAGE_SIZE; + return size; +} + +xtPublic xtInt8 xt_ind_get_size() +{ + xtInt8 size = 0; + + size = (xtInt8) ind_cac_globals.cg_block_count * (xtInt8) XT_INDEX_PAGE_SIZE; + return size; +} + +/* + * ----------------------------------------------------------------------- + * INDEX CHECKING + */ + +xtPublic void xt_ind_check_cache(XTIndexPtr ind) +{ + XTIndBlockPtr block; + u_int free_count, inuse_count, clean_count; + xtBool check_count = FALSE; + + if (ind == (XTIndex *) 1) { + ind = NULL; + check_count = TRUE; + } + + // Check the dirty list: + if (ind) { + u_int cnt = 0; + + block = ind->mi_dirty_list; + while (block) { + cnt++; + ASSERT_NS(block->cb_state == IDX_CAC_BLOCK_DIRTY); + block = block->cb_dirty_next; + } + ASSERT_NS(ind->mi_dirty_blocks == cnt); + } + + xt_lock_mutex_ns(&ind_cac_globals.cg_lock); + + // Check the free list: + free_count = 0; + block = ind_cac_globals.cg_free_list; + while (block) { + free_count++; + ASSERT_NS(block->cb_state == IDX_CAC_BLOCK_FREE); + block = block->cb_next; + } + ASSERT_NS(ind_cac_globals.cg_free_count == free_count); + + /* Check the LRU list: */ + XTIndBlockPtr list_block, plist_block; + + plist_block = NULL; + list_block = ind_cac_globals.cg_lru_block; + if (list_block) { + ASSERT_NS(ind_cac_globals.cg_mru_block != NULL); + ASSERT_NS(ind_cac_globals.cg_mru_block->cb_mr_used == NULL); + ASSERT_NS(list_block->cb_lr_used == NULL); + inuse_count = 0; + clean_count = 0; + while (list_block) { + inuse_count++; + ASSERT_NS(list_block->cb_state == IDX_CAC_BLOCK_DIRTY || list_block->cb_state == IDX_CAC_BLOCK_CLEAN); + if (list_block->cb_state == IDX_CAC_BLOCK_CLEAN) + clean_count++; + ASSERT_NS(block != list_block); + ASSERT_NS(list_block->cb_lr_used == plist_block); + plist_block = list_block; + list_block = list_block->cb_mr_used; + } + ASSERT_NS(ind_cac_globals.cg_mru_block == plist_block); + } + else { + inuse_count = 0; + clean_count = 0; + ASSERT_NS(ind_cac_globals.cg_mru_block == NULL); + } + +#ifdef DEBUG_CHECK_IND_CACHE + ASSERT_NS(free_count + inuse_count + ind_cac_globals.cg_reserved_by_ots + ind_cac_globals.cg_read_count == ind_cac_globals.cg_block_count); +#endif + xt_unlock_mutex_ns(&ind_cac_globals.cg_lock); + if (check_count) { + /* We have just flushed, check how much is now free/clean. */ + if (free_count + clean_count < 10) { + /* This could be a problem: */ + printf("Cache very low!\n"); + } + } +} + +#ifdef XXXXDEBUG +static void ind_cac_check_on_dirty_list(DcSegmentPtr seg, XTIndBlockPtr block) +{ + XTIndBlockPtr list_block, plist_block; + xtBool found = FALSE; + + plist_block = NULL; + list_block = seg->cs_dirty_list[block->cb_file_id % XT_INDEX_CACHE_FILE_SLOTS]; + while (list_block) { + ASSERT_NS(list_block->cb_state == IDX_CAC_BLOCK_DIRTY); + ASSERT_NS(list_block->cb_dirty_prev == plist_block); + if (list_block == block) + found = TRUE; + plist_block = list_block; + list_block = list_block->cb_dirty_next; + } + ASSERT_NS(found); +} + +static void ind_cac_check_dirty_list(DcSegmentPtr seg, XTIndBlockPtr block) +{ + XTIndBlockPtr list_block, plist_block; + + for (u_int j=0; j<XT_INDEX_CACHE_FILE_SLOTS; j++) { + plist_block = NULL; + list_block = seg->cs_dirty_list[j]; + while (list_block) { + ASSERT_NS(list_block->cb_state == IDX_CAC_BLOCK_DIRTY); + ASSERT_NS(block != list_block); + ASSERT_NS(list_block->cb_dirty_prev == plist_block); + plist_block = list_block; + list_block = list_block->cb_dirty_next; + } + } +} + +#endif + +/* + * ----------------------------------------------------------------------- + * FREEING INDEX CACHE + */ + +/* + * This function return TRUE if the block is freed. + * This function returns FALSE if the block cannot be found, or the + * block is not clean. + * + * We also return FALSE if we cannot copy the block to the handle + * (if this is required). This will be due to out-of-memory! + */ +static xtBool ind_free_block(XTOpenTablePtr ot, XTIndBlockPtr block) +{ + XTIndBlockPtr xblock, pxblock; + u_int hash_idx; + u_int file_id; + xtIndexNodeID address; + DcSegmentPtr seg; + +#ifdef DEBUG_CHECK_IND_CACHE + xt_ind_check_cache(NULL); +#endif + file_id = block->cb_file_id; + address = block->cb_address; + + hash_idx = XT_NODE_ID(address) + (file_id * 223); + seg = &ind_cac_globals.cg_segment[hash_idx & IDX_CAC_SEGMENT_MASK]; + hash_idx = (hash_idx >> XT_INDEX_CACHE_SEGMENT_SHIFTS) % ind_cac_globals.cg_hash_size; + + IDX_CAC_WRITE_LOCK(seg, ot->ot_thread); + + pxblock = NULL; + xblock = seg->cs_hash_table[hash_idx]; + while (xblock) { + if (block == xblock) { + /* Found the block... */ + xt_atomicrwlock_xlock(&block->cb_lock, ot->ot_thread->t_id); + if (block->cb_state != IDX_CAC_BLOCK_CLEAN) { + /* This block cannot be freeed: */ + xt_atomicrwlock_unlock(&block->cb_lock, TRUE); + IDX_CAC_UNLOCK(seg, ot->ot_thread); +#ifdef DEBUG_CHECK_IND_CACHE + xt_ind_check_cache(NULL); +#endif + return FALSE; + } + + goto free_the_block; + } + pxblock = xblock; + xblock = xblock->cb_next; + } + + IDX_CAC_UNLOCK(seg, ot->ot_thread); + + /* Not found (this can happen, if block was freed by another thread) */ +#ifdef DEBUG_CHECK_IND_CACHE + xt_ind_check_cache(NULL); +#endif + return FALSE; + + free_the_block: + + /* If the block is reference by a handle, then we + * have to copy the data to the handle before we + * free the page: + */ + /* {HANDLE-COUNT-USAGE} + * This access is safe because: + * + * We have an Xlock on the cache block, which excludes + * all other writers that want to change the cache block + * and also all readers of the cache block, because + * they all have at least an Slock on the cache block. + */ + if (block->cb_handle_count) { + XTIndReferenceRec iref; + + iref.ir_ulock = XT_UNLOCK_WRITE; + iref.ir_block = block; + iref.ir_branch = (XTIdxBranchDPtr) block->cb_data; + if (!xt_ind_copy_on_write(&iref)) { + xt_atomicrwlock_unlock(&block->cb_lock, TRUE); + return FALSE; + } + } + + /* Block is clean, remove from the hash table: */ + if (pxblock) + pxblock->cb_next = block->cb_next; + else + seg->cs_hash_table[hash_idx] = block->cb_next; + + xt_lock_mutex_ns(&ind_cac_globals.cg_lock); + + /* Remove from the MRU list: */ + if (ind_cac_globals.cg_lru_block == block) + ind_cac_globals.cg_lru_block = block->cb_mr_used; + if (ind_cac_globals.cg_mru_block == block) + ind_cac_globals.cg_mru_block = block->cb_lr_used; + + /* Note, I am updating blocks for which I have no lock + * here. But I think this is OK because I have a lock + * for the MRU list. + */ + if (block->cb_lr_used) + block->cb_lr_used->cb_mr_used = block->cb_mr_used; + if (block->cb_mr_used) + block->cb_mr_used->cb_lr_used = block->cb_lr_used; + + /* The block is now free: */ + block->cb_next = ind_cac_globals.cg_free_list; + ind_cac_globals.cg_free_list = block; + ind_cac_globals.cg_free_count++; + block->cb_state = IDX_CAC_BLOCK_FREE; + IDX_TRACE("%d- f%x\n", (int) XT_NODE_ID(address), (int) XT_GET_DISK_2(block->cb_data)); + + /* Unlock BEFORE the block is reused! */ + xt_atomicrwlock_unlock(&block->cb_lock, TRUE); + + xt_unlock_mutex_ns(&ind_cac_globals.cg_lock); + + IDX_CAC_UNLOCK(seg, ot->ot_thread); + +#ifdef DEBUG_CHECK_IND_CACHE + xt_ind_check_cache(NULL); +#endif + return TRUE; +} + +#define IND_CACHE_MAX_BLOCKS_TO_FREE 100 + +/* + * Return the number of blocks freed. + * + * The idea is to grab a list of blocks to free. + * The list consists of the LRU blocks that are + * clean. + * + * Free as many as possible (up to max of blocks_required) + * from the list, even if LRU position has changed + * (or we have a race if there are too few blocks). + * However, if the block cannot be found, or is dirty + * we must skip it. + * + * Repeat until we find no blocks for the list, or + * we have freed 'blocks_required'. + * + * 'not_this' is a block that must not be freed because + * it is locked by the calling thread! + */ +static u_int ind_cac_free_lru_blocks(XTOpenTablePtr ot, u_int blocks_required, XTIdxBranchDPtr not_this) +{ + register DcGlobalsRec *dcg = &ind_cac_globals; + XTIndBlockPtr to_free[IND_CACHE_MAX_BLOCKS_TO_FREE]; + int count; + XTIndBlockPtr block; + u_int blocks_freed = 0; + XTIndBlockPtr locked_block; + +#ifdef XT_USE_DIRECT_IO_ON_INDEX +#error This will not work! +#endif + locked_block = (XTIndBlockPtr) ((xtWord1 *) not_this - offsetof(XTIndBlockRec, cb_data)); + + retry: + xt_lock_mutex_ns(&ind_cac_globals.cg_lock); + block = dcg->cg_lru_block; + count = 0; + while (block && count < IND_CACHE_MAX_BLOCKS_TO_FREE) { + if (block != locked_block && block->cb_state == IDX_CAC_BLOCK_CLEAN) { + to_free[count] = block; + count++; + } + block = block->cb_mr_used; + } + xt_unlock_mutex_ns(&ind_cac_globals.cg_lock); + + if (!count) + return blocks_freed; + + for (int i=0; i<count; i++) { + if (ind_free_block(ot, to_free[i])) + blocks_freed++; + if (blocks_freed >= blocks_required && + ind_cac_globals.cg_free_count >= ind_cac_globals.cg_max_free + blocks_required) + return blocks_freed; + } + + goto retry; +} + +/* + * ----------------------------------------------------------------------- + * MAIN CACHE FUNCTIONS + */ + +/* + * Fetch the block. Note, if we are about to write the block + * then there is no need to read it from disk! + */ +static XTIndBlockPtr ind_cac_fetch(XTOpenTablePtr ot, xtIndexNodeID address, DcSegmentPtr *ret_seg, xtBool read_data) +{ + register XTOpenFilePtr file = ot->ot_ind_file; + register XTIndBlockPtr block, new_block; + register DcSegmentPtr seg; + register u_int hash_idx; + register DcGlobalsRec *dcg = &ind_cac_globals; + size_t red_size; + +#ifdef DEBUG_CHECK_IND_CACHE + xt_ind_check_cache(NULL); +#endif + /* Address, plus file ID multiplied by my favorite prime number! */ + hash_idx = XT_NODE_ID(address) + (file->fr_id * 223); + seg = &dcg->cg_segment[hash_idx & IDX_CAC_SEGMENT_MASK]; + hash_idx = (hash_idx >> XT_INDEX_CACHE_SEGMENT_SHIFTS) % dcg->cg_hash_size; + + IDX_CAC_READ_LOCK(seg, ot->ot_thread); + block = seg->cs_hash_table[hash_idx]; + while (block) { + if (XT_NODE_ID(block->cb_address) == XT_NODE_ID(address) && block->cb_file_id == file->fr_id) { + ASSERT_NS(block->cb_state != IDX_CAC_BLOCK_FREE); + + /* Check how recently this page has been used: */ + if (XT_TIME_DIFF(block->cb_ru_time, dcg->cg_ru_now) > (dcg->cg_block_count >> 1)) { + xt_lock_mutex_ns(&dcg->cg_lock); + + /* Move to the front of the MRU list: */ + block->cb_ru_time = ++dcg->cg_ru_now; + if (dcg->cg_mru_block != block) { + /* Remove from the MRU list: */ + if (dcg->cg_lru_block == block) + dcg->cg_lru_block = block->cb_mr_used; + if (block->cb_lr_used) + block->cb_lr_used->cb_mr_used = block->cb_mr_used; + if (block->cb_mr_used) + block->cb_mr_used->cb_lr_used = block->cb_lr_used; + + /* Make the block the most recently used: */ + if ((block->cb_lr_used = dcg->cg_mru_block)) + dcg->cg_mru_block->cb_mr_used = block; + block->cb_mr_used = NULL; + dcg->cg_mru_block = block; + if (!dcg->cg_lru_block) + dcg->cg_lru_block = block; + } + + xt_unlock_mutex_ns(&dcg->cg_lock); + } + + *ret_seg = seg; +#ifdef DEBUG_CHECK_IND_CACHE + xt_ind_check_cache(NULL); +#endif + ot->ot_thread->st_statistics.st_ind_cache_hit++; + return block; + } + block = block->cb_next; + } + + /* Block not found... */ + IDX_CAC_UNLOCK(seg, ot->ot_thread); + + /* Check the open table reserve list first: */ + if ((new_block = ot->ot_ind_res_bufs)) { + ot->ot_ind_res_bufs = new_block->cb_next; + ot->ot_ind_res_count--; +#ifdef DEBUG_CHECK_IND_CACHE + xt_lock_mutex_ns(&dcg->cg_lock); + dcg->cg_reserved_by_ots--; + dcg->cg_read_count++; + xt_unlock_mutex_ns(&dcg->cg_lock); +#endif + goto use_free_block; + } + + free_some_blocks: + if (!dcg->cg_free_list) { + if (!ind_cac_free_lru_blocks(ot, 1, NULL)) { + if (!dcg->cg_free_list) { + xt_register_xterr(XT_REG_CONTEXT, XT_ERR_NO_INDEX_CACHE); +#ifdef DEBUG_CHECK_IND_CACHE + xt_ind_check_cache(NULL); +#endif + return NULL; + } + } + } + + /* Get a free block: */ + xt_lock_mutex_ns(&dcg->cg_lock); + if (!(new_block = dcg->cg_free_list)) { + xt_unlock_mutex_ns(&dcg->cg_lock); + goto free_some_blocks; + } + ASSERT_NS(new_block->cb_state == IDX_CAC_BLOCK_FREE); + dcg->cg_free_list = new_block->cb_next; + dcg->cg_free_count--; +#ifdef DEBUG_CHECK_IND_CACHE + dcg->cg_read_count++; +#endif + xt_unlock_mutex_ns(&dcg->cg_lock); + + use_free_block: + new_block->cb_address = address; + new_block->cb_file_id = file->fr_id; + new_block->cb_state = IDX_CAC_BLOCK_CLEAN; + new_block->cb_handle_count = 0; + new_block->cp_flush_seq = 0; + new_block->cb_dirty_next = NULL; + new_block->cb_dirty_prev = NULL; + + if (read_data) { + if (!xt_pread_file(file, xt_ind_node_to_offset(ot->ot_table, address), XT_INDEX_PAGE_SIZE, 0, new_block->cb_data, &red_size, &ot->ot_thread->st_statistics.st_ind, ot->ot_thread)) { + xt_lock_mutex_ns(&dcg->cg_lock); + new_block->cb_next = dcg->cg_free_list; + dcg->cg_free_list = new_block; + dcg->cg_free_count++; +#ifdef DEBUG_CHECK_IND_CACHE + dcg->cg_read_count--; +#endif + new_block->cb_state = IDX_CAC_BLOCK_FREE; + IDX_TRACE("%d- F%x\n", (int) XT_NODE_ID(address), (int) XT_GET_DISK_2(new_block->cb_data)); + xt_unlock_mutex_ns(&dcg->cg_lock); +#ifdef DEBUG_CHECK_IND_CACHE + xt_ind_check_cache(NULL); +#endif + return NULL; + } + IDX_TRACE("%d- R%x\n", (int) XT_NODE_ID(address), (int) XT_GET_DISK_2(new_block->cb_data)); + ot->ot_thread->st_statistics.st_ind_cache_miss++; + } + else + red_size = 0; + // PMC - I don't think this is required! memset(new_block->cb_data + red_size, 0, XT_INDEX_PAGE_SIZE - red_size); + + IDX_CAC_WRITE_LOCK(seg, ot->ot_thread); + block = seg->cs_hash_table[hash_idx]; + while (block) { + if (XT_NODE_ID(block->cb_address) == XT_NODE_ID(address) && block->cb_file_id == file->fr_id) { + /* Oops, someone else was faster! */ + xt_lock_mutex_ns(&dcg->cg_lock); + new_block->cb_next = dcg->cg_free_list; + dcg->cg_free_list = new_block; + dcg->cg_free_count++; +#ifdef DEBUG_CHECK_IND_CACHE + dcg->cg_read_count--; +#endif + new_block->cb_state = IDX_CAC_BLOCK_FREE; + IDX_TRACE("%d- F%x\n", (int) XT_NODE_ID(address), (int) XT_GET_DISK_2(new_block->cb_data)); + xt_unlock_mutex_ns(&dcg->cg_lock); + goto done_ok; + } + block = block->cb_next; + } + block = new_block; + + /* Make the block the most recently used: */ + xt_lock_mutex_ns(&dcg->cg_lock); + block->cb_ru_time = ++dcg->cg_ru_now; + if ((block->cb_lr_used = dcg->cg_mru_block)) + dcg->cg_mru_block->cb_mr_used = block; + block->cb_mr_used = NULL; + dcg->cg_mru_block = block; + if (!dcg->cg_lru_block) + dcg->cg_lru_block = block; +#ifdef DEBUG_CHECK_IND_CACHE + dcg->cg_read_count--; +#endif + xt_unlock_mutex_ns(&dcg->cg_lock); + + /* Add to the hash table: */ + block->cb_next = seg->cs_hash_table[hash_idx]; + seg->cs_hash_table[hash_idx] = block; + + done_ok: + *ret_seg = seg; +#ifdef DEBUG_CHECK_IND_CACHE + xt_ind_check_cache(NULL); +#endif + return block; +} + +static xtBool ind_cac_get(XTOpenTablePtr ot, xtIndexNodeID address, DcSegmentPtr *ret_seg, XTIndBlockPtr *ret_block) +{ + register XTOpenFilePtr file = ot->ot_ind_file; + register XTIndBlockPtr block; + register DcSegmentPtr seg; + register u_int hash_idx; + register DcGlobalsRec *dcg = &ind_cac_globals; + + hash_idx = XT_NODE_ID(address) + (file->fr_id * 223); + seg = &dcg->cg_segment[hash_idx & IDX_CAC_SEGMENT_MASK]; + hash_idx = (hash_idx >> XT_INDEX_CACHE_SEGMENT_SHIFTS) % dcg->cg_hash_size; + + IDX_CAC_READ_LOCK(seg, ot->ot_thread); + block = seg->cs_hash_table[hash_idx]; + while (block) { + if (XT_NODE_ID(block->cb_address) == XT_NODE_ID(address) && block->cb_file_id == file->fr_id) { + ASSERT_NS(block->cb_state != IDX_CAC_BLOCK_FREE); + + *ret_seg = seg; + *ret_block = block; + return OK; + } + block = block->cb_next; + } + IDX_CAC_UNLOCK(seg, ot->ot_thread); + + /* Block not found: */ + *ret_seg = NULL; + *ret_block = NULL; + return OK; +} + +xtPublic xtBool xt_ind_write(XTOpenTablePtr ot, XTIndexPtr ind, xtIndexNodeID address, size_t size, xtWord1 *data) +{ + XTIndBlockPtr block; + DcSegmentPtr seg; + + if (!(block = ind_cac_fetch(ot, address, &seg, FALSE))) + return FAILED; + + xt_atomicrwlock_xlock(&block->cb_lock, ot->ot_thread->t_id); + ASSERT_NS(block->cb_state == IDX_CAC_BLOCK_CLEAN || block->cb_state == IDX_CAC_BLOCK_DIRTY); + memcpy(block->cb_data, data, size); + block->cp_flush_seq = ot->ot_table->tab_ind_flush_seq; + if (block->cb_state != IDX_CAC_BLOCK_DIRTY) { + TRACK_BLOCK_WRITE(offset); + xt_spinlock_lock(&ind->mi_dirty_lock); + if ((block->cb_dirty_next = ind->mi_dirty_list)) + ind->mi_dirty_list->cb_dirty_prev = block; + block->cb_dirty_prev = NULL; + ind->mi_dirty_list = block; + ind->mi_dirty_blocks++; + xt_spinlock_unlock(&ind->mi_dirty_lock); + block->cb_state = IDX_CAC_BLOCK_DIRTY; + } + xt_atomicrwlock_unlock(&block->cb_lock, TRUE); + IDX_CAC_UNLOCK(seg, ot->ot_thread); +#ifdef XT_TRACK_INDEX_UPDATES + ot->ot_ind_changed++; +#endif + return OK; +} + +/* + * Update the cache, if in RAM. + */ +xtPublic xtBool xt_ind_write_cache(XTOpenTablePtr ot, xtIndexNodeID address, size_t size, xtWord1 *data) +{ + XTIndBlockPtr block; + DcSegmentPtr seg; + + if (!ind_cac_get(ot, address, &seg, &block)) + return FAILED; + + if (block) { + xt_atomicrwlock_xlock(&block->cb_lock, ot->ot_thread->t_id); + ASSERT_NS(block->cb_state == IDX_CAC_BLOCK_CLEAN || block->cb_state == IDX_CAC_BLOCK_DIRTY); + memcpy(block->cb_data, data, size); + xt_atomicrwlock_unlock(&block->cb_lock, TRUE); + IDX_CAC_UNLOCK(seg, ot->ot_thread); + } + + return OK; +} + +xtPublic xtBool xt_ind_clean(XTOpenTablePtr ot, XTIndexPtr ind, xtIndexNodeID address) +{ + XTIndBlockPtr block; + DcSegmentPtr seg; + + if (!ind_cac_get(ot, address, &seg, &block)) + return FAILED; + if (block) { + xt_atomicrwlock_xlock(&block->cb_lock, ot->ot_thread->t_id); + ASSERT_NS(block->cb_state == IDX_CAC_BLOCK_CLEAN || block->cb_state == IDX_CAC_BLOCK_DIRTY); + + if (block->cb_state == IDX_CAC_BLOCK_DIRTY) { + /* Take the block off the dirty list: */ + xt_spinlock_lock(&ind->mi_dirty_lock); + if (block->cb_dirty_next) + block->cb_dirty_next->cb_dirty_prev = block->cb_dirty_prev; + if (block->cb_dirty_prev) + block->cb_dirty_prev->cb_dirty_next = block->cb_dirty_next; + if (ind->mi_dirty_list == block) + ind->mi_dirty_list = block->cb_dirty_next; + ind->mi_dirty_blocks--; + xt_spinlock_unlock(&ind->mi_dirty_lock); + block->cb_state = IDX_CAC_BLOCK_CLEAN; + } + xt_atomicrwlock_unlock(&block->cb_lock, TRUE); + + IDX_CAC_UNLOCK(seg, ot->ot_thread); + } + + return OK; +} + +xtPublic xtBool xt_ind_read_bytes(XTOpenTablePtr ot, xtIndexNodeID address, size_t size, xtWord1 *data) +{ + XTIndBlockPtr block; + DcSegmentPtr seg; + + if (!(block = ind_cac_fetch(ot, address, &seg, TRUE))) + return FAILED; + + xt_atomicrwlock_slock(&block->cb_lock); + memcpy(data, block->cb_data, size); + xt_atomicrwlock_unlock(&block->cb_lock, FALSE); + IDX_CAC_UNLOCK(seg, ot->ot_thread); + return OK; +} + +xtPublic xtBool xt_ind_fetch(XTOpenTablePtr ot, xtIndexNodeID address, XTPageLockType ltype, XTIndReferencePtr iref) +{ + register XTIndBlockPtr block; + DcSegmentPtr seg; + xtWord2 branch_size; + + ASSERT_NS(iref->ir_ulock == XT_UNLOCK_NONE); + if (!(block = ind_cac_fetch(ot, address, &seg, TRUE))) + return NULL; + + branch_size = XT_GET_DISK_2(((XTIdxBranchDPtr) block->cb_data)->tb_size_2); + if (XT_GET_INDEX_BLOCK_LEN(branch_size) < 2 || XT_GET_INDEX_BLOCK_LEN(branch_size) > XT_INDEX_PAGE_SIZE) { + IDX_CAC_UNLOCK(seg, ot->ot_thread); + xt_register_taberr(XT_REG_CONTEXT, XT_ERR_INDEX_CORRUPTED, ot->ot_table->tab_name); + return FAILED; + } + + if (ltype == XT_XLOCK_LEAF) { + if (XT_IS_NODE(branch_size)) + ltype = XT_LOCK_READ; + else + ltype = XT_LOCK_WRITE; + } + + if (ltype == XT_LOCK_WRITE) { + xt_atomicrwlock_xlock(&block->cb_lock, ot->ot_thread->t_id); + iref->ir_ulock = XT_UNLOCK_WRITE; + } + else { + xt_atomicrwlock_slock(&block->cb_lock); + iref->ir_ulock = XT_UNLOCK_READ; + } + + IDX_CAC_UNLOCK(seg, ot->ot_thread); + + /* {DIRECT-IO} + * Direct I/O requires that the buffer is 512 byte aligned. + * To do this, cb_data is turned into a pointer, instead + * of an array. + * As a result, we need to pass a pointer to both the + * cache block and the cache block data: + */ + iref->ir_block = block; + iref->ir_branch = (XTIdxBranchDPtr) block->cb_data; + return OK; +} + +xtPublic xtBool xt_ind_release(XTOpenTablePtr ot, XTIndexPtr ind, XTPageUnlockType XT_UNUSED(utype), XTIndReferencePtr iref) +{ + register XTIndBlockPtr block; + + block = iref->ir_block; + + if (utype == XT_UNLOCK_R_UPDATE || utype == XT_UNLOCK_W_UPDATE) { + /* The page was update: */ + ASSERT_NS(block->cb_state == IDX_CAC_BLOCK_CLEAN || block->cb_state == IDX_CAC_BLOCK_DIRTY); + block->cp_flush_seq = ot->ot_table->tab_ind_flush_seq; + if (block->cb_state != IDX_CAC_BLOCK_DIRTY) { + TRACK_BLOCK_WRITE(offset); + xt_spinlock_lock(&ind->mi_dirty_lock); + if ((block->cb_dirty_next = ind->mi_dirty_list)) + ind->mi_dirty_list->cb_dirty_prev = block; + block->cb_dirty_prev = NULL; + ind->mi_dirty_list = block; + ind->mi_dirty_blocks++; + xt_spinlock_unlock(&ind->mi_dirty_lock); + block->cb_state = IDX_CAC_BLOCK_DIRTY; + } + } + +#ifdef DEBUG + if (utype == XT_UNLOCK_W_UPDATE) + utype = XT_UNLOCK_WRITE; + else if (utype == XT_UNLOCK_R_UPDATE) + utype = XT_UNLOCK_READ; + ASSERT_NS(iref->ir_ulock == utype); +#endif + xt_atomicrwlock_unlock(&block->cb_lock, iref->ir_ulock == XT_UNLOCK_WRITE ? TRUE : FALSE); +#ifdef DEBUG + iref->ir_ulock = XT_UNLOCK_NONE; +#endif + return OK; +} + +xtPublic xtBool xt_ind_reserve(XTOpenTablePtr ot, u_int count, XTIdxBranchDPtr not_this) +{ + register XTIndBlockPtr block; + register DcGlobalsRec *dcg = &ind_cac_globals; + +#ifdef XT_TRACK_INDEX_UPDATES + ot->ot_ind_reserved = count; + ot->ot_ind_reads = 0; +#endif +#ifdef DEBUG_CHECK_IND_CACHE + xt_ind_check_cache(NULL); +#endif + while (ot->ot_ind_res_count < count) { + if (!dcg->cg_free_list) { + if (!ind_cac_free_lru_blocks(ot, count - ot->ot_ind_res_count, not_this)) { + if (!dcg->cg_free_list) { + xt_ind_free_reserved(ot); + xt_register_xterr(XT_REG_CONTEXT, XT_ERR_NO_INDEX_CACHE); +#ifdef DEBUG_CHECK_IND_CACHE + xt_ind_check_cache(NULL); +#endif + return FAILED; + } + } + } + + /* Get a free block: */ + xt_lock_mutex_ns(&dcg->cg_lock); + while (ot->ot_ind_res_count < count && (block = dcg->cg_free_list)) { + ASSERT_NS(block->cb_state == IDX_CAC_BLOCK_FREE); + dcg->cg_free_list = block->cb_next; + dcg->cg_free_count--; + block->cb_next = ot->ot_ind_res_bufs; + ot->ot_ind_res_bufs = block; + ot->ot_ind_res_count++; +#ifdef DEBUG_CHECK_IND_CACHE + dcg->cg_reserved_by_ots++; +#endif + } + xt_unlock_mutex_ns(&dcg->cg_lock); + } +#ifdef DEBUG_CHECK_IND_CACHE + xt_ind_check_cache(NULL); +#endif + return OK; +} + +xtPublic void xt_ind_free_reserved(XTOpenTablePtr ot) +{ +#ifdef DEBUG_CHECK_IND_CACHE + xt_ind_check_cache(NULL); +#endif + if (ot->ot_ind_res_bufs) { + register XTIndBlockPtr block, fblock; + register DcGlobalsRec *dcg = &ind_cac_globals; + + xt_lock_mutex_ns(&dcg->cg_lock); + block = ot->ot_ind_res_bufs; + while (block) { + fblock = block; + block = block->cb_next; + + fblock->cb_next = dcg->cg_free_list; + dcg->cg_free_list = fblock; +#ifdef DEBUG_CHECK_IND_CACHE + dcg->cg_reserved_by_ots--; +#endif + dcg->cg_free_count++; + } + xt_unlock_mutex_ns(&dcg->cg_lock); + ot->ot_ind_res_bufs = NULL; + ot->ot_ind_res_count = 0; + } +#ifdef DEBUG_CHECK_IND_CACHE + xt_ind_check_cache(NULL); +#endif +} + +xtPublic void xt_ind_unreserve(XTOpenTablePtr ot) +{ + if (!ind_cac_globals.cg_free_list) + xt_ind_free_reserved(ot); +} + +xtPublic void xt_load_indices(XTThreadPtr self, XTOpenTablePtr ot) +{ + register XTTableHPtr tab = ot->ot_table; + register XTIndBlockPtr block; + DcSegmentPtr seg; + xtIndexNodeID id; + + xt_lock_mutex_ns(&tab->tab_ind_flush_lock); + + for (id=1; id < XT_NODE_ID(tab->tab_ind_eof); id++) { + if (!(block = ind_cac_fetch(ot, id, &seg, TRUE))) { + xt_unlock_mutex_ns(&tab->tab_ind_flush_lock); + xt_throw(self); + } + IDX_CAC_UNLOCK(seg, ot->ot_thread); + } + + xt_unlock_mutex_ns(&tab->tab_ind_flush_lock); +} + + diff --git a/storage/pbxt/src/cache_xt.h b/storage/pbxt/src/cache_xt.h new file mode 100644 index 00000000000..d113bb2f907 --- /dev/null +++ b/storage/pbxt/src/cache_xt.h @@ -0,0 +1,148 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-05-24 Paul McCullagh + * + * H&G2JCtL + */ +#ifndef __xt_cache_h__ +#define __xt_cache_h__ + +//#define XT_USE_MYSYS + +#include "filesys_xt.h" +#include "index_xt.h" + +struct XTOpenTable; +struct XTIdxReadBuffer; + +#ifdef DEBUG +//#define XT_USE_CACHE_DEBUG_SIZES +#endif + +#ifdef XT_USE_CACHE_DEBUG_SIZES +#define XT_INDEX_CACHE_SEGMENT_SHIFTS 1 +#else +#define XT_INDEX_CACHE_SEGMENT_SHIFTS 3 +#endif + +#define IDX_CAC_BLOCK_FREE 0 +#define IDX_CAC_BLOCK_CLEAN 1 +#define IDX_CAC_BLOCK_DIRTY 2 + +typedef enum XTPageLockType { XT_LOCK_READ, XT_LOCK_WRITE, XT_XLOCK_LEAF }; +typedef enum XTPageUnlockType { XT_UNLOCK_NONE, XT_UNLOCK_READ, XT_UNLOCK_WRITE, XT_UNLOCK_R_UPDATE, XT_UNLOCK_W_UPDATE }; + +/* A block is X locked if it is being changed or freed. + * A block is S locked if it is being read. + */ +typedef struct XTIndBlock { + xtIndexNodeID cb_address; /* The block address. */ + u_int cb_file_id; /* The file id of the block. */ + /* This is protected by cs_lock */ + struct XTIndBlock *cb_next; /* Pointer to next block on hash list, or next free block on free list. */ + /* This is protected by mi_dirty_lock */ + struct XTIndBlock *cb_dirty_next; /* Double link for dirty blocks, next pointer. */ + struct XTIndBlock *cb_dirty_prev; /* Double link for dirty blocks, previous pointer. */ + /* This is protected by cg_lock */ + xtWord4 cb_ru_time; /* If this is in the top 1/4 don't change position in MRU list. */ + struct XTIndBlock *cb_mr_used; /* More recently used blocks. */ + struct XTIndBlock *cb_lr_used; /* Less recently used blocks. */ + /* Protected by cb_lock: */ + XTAtomicRWLockRec cb_lock; + xtWord1 cb_state; /* Block status. */ + xtWord2 cb_handle_count; /* TRUE if this page is referenced by a handle. */ + xtWord2 cp_flush_seq; +#ifdef XT_USE_DIRECT_IO_ON_INDEX + xtWord1 *cb_data; +#else + xtWord1 cb_data[XT_INDEX_PAGE_SIZE]; +#endif +} XTIndBlockRec, *XTIndBlockPtr; + +typedef struct XTIndReference { + XTPageUnlockType ir_ulock; + XTIndBlockPtr ir_block; + XTIdxBranchDPtr ir_branch; +} XTIndReferenceRec, *XTIndReferencePtr; + +typedef struct XTIndFreeBlock { + XTDiskValue1 if_status_1; + XTDiskValue1 if_unused1_1; + XTDiskValue2 if_unused2_2; + XTDiskValue4 if_unused3_4; + XTDiskValue8 if_next_block_8; +} XTIndFreeBlockRec, *XTIndFreeBlockPtr; + +typedef struct XTIndHandleBlock { + xtWord4 hb_ref_count; + struct XTIndHandleBlock *hb_next; + XTIdxBranchDRec hb_branch; +} XTIndHandleBlockRec, *XTIndHandleBlockPtr; + +typedef struct XTIndHandle { + struct XTIndHandle *ih_next; + struct XTIndHandle *ih_prev; + XTSpinLockRec ih_lock; + xtIndexNodeID ih_address; + xtBool ih_cache_reference; /* True if this handle references the cache. */ + union { + XTIndBlockPtr ih_cache_block; + XTIndHandleBlockPtr ih_handle_block; + } x; + XTIdxBranchDPtr ih_branch; +} XTIndHandleRec, *XTIndHandlePtr; + +void xt_ind_init(XTThreadPtr self, size_t cache_size); +void xt_ind_exit(XTThreadPtr self); + +xtInt8 xt_ind_get_usage(); +xtInt8 xt_ind_get_size(); +xtBool xt_ind_write(struct XTOpenTable *ot, XTIndexPtr ind, xtIndexNodeID offset, size_t size, xtWord1 *data); +xtBool xt_ind_write_cache(struct XTOpenTable *ot, xtIndexNodeID offset, size_t size, xtWord1 *data); +xtBool xt_ind_clean(struct XTOpenTable *ot, XTIndexPtr ind, xtIndexNodeID offset); +xtBool xt_ind_read_bytes(struct XTOpenTable *ot, xtIndexNodeID offset, size_t size, xtWord1 *data); +void xt_ind_check_cache(XTIndexPtr ind); +xtBool xt_ind_reserve(struct XTOpenTable *ot, u_int count, XTIdxBranchDPtr not_this); +void xt_ind_free_reserved(struct XTOpenTable *ot); +void xt_ind_unreserve(struct XTOpenTable *ot); +void xt_load_indices(XTThreadPtr self, struct XTOpenTable *ot); + +xtBool xt_ind_fetch(struct XTOpenTable *ot, xtIndexNodeID node, XTPageLockType ltype, XTIndReferencePtr iref); +xtBool xt_ind_release(struct XTOpenTable *ot, XTIndexPtr ind, XTPageUnlockType utype, XTIndReferencePtr iref); + +void xt_ind_lock_handle(XTIndHandlePtr handle); +void xt_ind_unlock_handle(XTIndHandlePtr handle); +xtBool xt_ind_copy_on_write(XTIndReferencePtr iref); + +XTIndHandlePtr xt_ind_get_handle(struct XTOpenTable *ot, XTIndexPtr ind, XTIndReferencePtr iref); +void xt_ind_release_handle(XTIndHandlePtr handle, xtBool have_lock, XTThreadPtr thread); + +#ifdef DEBUG +//#define DEBUG_CHECK_IND_CACHE +#endif + +//#define XT_TRACE_INDEX + +#ifdef XT_TRACE_INDEX +#define IDX_TRACE(x, y, z) xt_trace(x, y, z) +#else +#define IDX_TRACE(x, y, z) +#endif + +#endif diff --git a/storage/pbxt/src/ccutils_xt.cc b/storage/pbxt/src/ccutils_xt.cc new file mode 100644 index 00000000000..1d93e4c34b3 --- /dev/null +++ b/storage/pbxt/src/ccutils_xt.cc @@ -0,0 +1,69 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2006-05-16 Paul McCullagh + * + * H&G2JCtL + * + * C++ Utilities + */ + +#include "xt_config.h" + +#include "pthread_xt.h" +#include "ccutils_xt.h" +#include "bsearch_xt.h" + +static int ccu_compare_object(XTThreadPtr XT_UNUSED(self), register const void XT_UNUSED(*thunk), register const void *a, register const void *b) +{ + XTObject *obj_ptr = (XTObject *) b; + + return obj_ptr->compare(a); +} + +void XTListImp::append(XTThreadPtr self, XTObject *info, void *key) { + size_t idx; + + if (li_item_count == 0) + idx = 0; + else if (li_item_count == 1) { + int r; + + if ((r = li_items[0]->compare(key)) == 0) + idx = 0; + else if (r < 0) + idx = 0; + else + idx = 1; + } + else { + xt_bsearch(self, key, li_items, li_item_count, sizeof(void *), &idx, NULL, ccu_compare_object); + } + + if (!xt_realloc(NULL, (void **) &li_items, (li_item_count + 1) * sizeof(void *))) { + if (li_referenced) + info->release(self); + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + return; + } + memmove(&li_items[idx+1], &li_items[idx], (li_item_count-idx) * sizeof(void *)); + li_items[idx] = info; + li_item_count++; +} + + diff --git a/storage/pbxt/src/ccutils_xt.h b/storage/pbxt/src/ccutils_xt.h new file mode 100644 index 00000000000..a800073869d --- /dev/null +++ b/storage/pbxt/src/ccutils_xt.h @@ -0,0 +1,220 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2006-05-16 Paul McCullagh + * + * H&G2JCtL + * + * C++ Utilities + */ + +#ifndef __ccutils_xt_h__ +#define __ccutils_xt_h__ + +#include <errno.h> + +#include "xt_defs.h" +#include "thread_xt.h" + +class XTObject +{ + private: + u_int o_refcnt; + + public: + inline XTObject() { o_refcnt = 1; } + + virtual ~XTObject() { } + + inline void reference() { + o_refcnt++; + } + + inline void release(XTThreadPtr self) { + ASSERT(o_refcnt > 0); + o_refcnt--; + if (o_refcnt == 0) { + finalize(self); + delete this; + } + } + + virtual XTObject *factory(XTThreadPtr self) { + XTObject *new_obj; + + if (!(new_obj = new XTObject())) + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + return new_obj; + } + + virtual XTObject *clone(XTThreadPtr self) { + XTObject *new_obj; + + new_obj = factory(self); + new_obj->init(self, this); + return new_obj; + } + + virtual void init(XTThreadPtr self) { (void) self; } + virtual void init(XTThreadPtr self, XTObject *obj) { (void) obj; init(self); } + virtual void finalize(XTThreadPtr self) { (void) self; } + virtual int compare(const void *key) { (void) key; return -1; } +}; + +class XTListImp +{ + protected: + bool li_referenced; + u_int li_item_count; + XTObject **li_items; + + public: + inline XTListImp() : li_referenced(true), li_item_count(0), li_items(NULL) { } + + inline void setNonReferenced() { li_referenced = false; } + + void append(XTThreadPtr self, XTObject *info) { + if (!xt_realloc(NULL, (void **) &li_items, (li_item_count + 1) * sizeof(void *))) { + if (li_referenced) + info->release(self); + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + return; + } + li_items[li_item_count] = info; + li_item_count++; + } + + void insert(XTThreadPtr self, XTObject *info, u_int i) { + if (!xt_realloc(NULL, (void **) &li_items, (li_item_count + 1) * sizeof(void *))) { + if (li_referenced) + info->release(self); + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + return; + } + memmove(&li_items[i+1], &li_items[i], (li_item_count-i) * sizeof(XTObject *)); + li_items[i] = info; + li_item_count++; + } + + void addToFront(XTThreadPtr self, XTObject *info) { + insert(self, info, 0); + } + + /* Will sort! */ + void append(XTThreadPtr self, XTObject *info, void *key); + + inline bool remove(XTObject *info) { + for (u_int i=0; i<li_item_count; i++) { + if (li_items[i] == info) { + li_item_count--; + memmove(&li_items[i], &li_items[i+1], (li_item_count - i) * sizeof(XTObject *)); + return true; + } + } + return false; + } + + inline bool remove(XTThreadPtr self, u_int i) { + XTObject *item; + + if (i >= li_item_count) + return false; + item = li_items[i]; + li_item_count--; + memmove(&li_items[i], &li_items[i+1], (li_item_count - i) * sizeof(void *)); + if (li_referenced) + item->release(self); + return true; + } + + inline XTObject *take(u_int i) { + XTObject *item; + + if (i >= li_item_count) + return NULL; + item = li_items[i]; + li_item_count--; + memmove(&li_items[i], &li_items[i+1], (li_item_count - i) * sizeof(void *)); + return item; + } + + inline u_int size() const { return li_item_count; } + + inline void setEmpty(XTThreadPtr self) { + if (li_items) + xt_free(self, li_items); + li_item_count = 0; + li_items = NULL; + } + + inline bool isEmpty() { return li_item_count == 0; } + + inline XTObject *itemAt(u_int i) const { + if (i >= li_item_count) + return NULL; + return li_items[i]; + } +}; + + +template <class T> class XTList : public XTListImp +{ + public: + inline XTList() : XTListImp() { } + + inline void append(XTThreadPtr self, T *a) { XTListImp::append(self, a); } + inline void insert(XTThreadPtr self, T *a, u_int i) { XTListImp::insert(self, a, i); } + inline void addToFront(XTThreadPtr self, T *a) { XTListImp::addToFront(self, a); } + + inline bool remove(T *a) { return XTListImp::remove(a); } + + inline bool remove(XTThreadPtr self, u_int i) { return XTListImp::remove(self, i); } + + inline T *take(u_int i) { return (T *) XTListImp::take(i); } + + inline T *itemAt(u_int i) const { return (T *) XTListImp::itemAt(i); } + + inline u_int indexOf(T *a) { + u_int i; + + for (i=0; i<size(); i++) { + if (itemAt(i) == a) + break; + } + return i; + } + + void deleteAll(XTThreadPtr self) + { + for (u_int i=0; i<size(); i++) { + if (li_referenced) + itemAt(i)->release(self); + } + setEmpty(self); + } + + void clone(XTThreadPtr self, XTListImp *list) + { + deleteAll(self); + for (u_int i=0; i<list->size(); i++) { + XTListImp::append(self, list->itemAt(i)->clone(self)); + } + } +}; + +#endif diff --git a/storage/pbxt/src/database_xt.cc b/storage/pbxt/src/database_xt.cc new file mode 100644 index 00000000000..61f479263c3 --- /dev/null +++ b/storage/pbxt/src/database_xt.cc @@ -0,0 +1,1281 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-01-15 Paul McCullagh + * + * H&G2JCtL + */ + +#include "xt_config.h" + +#include <string.h> +#include <stdio.h> + +#include "pthread_xt.h" +#include "hashtab_xt.h" +#include "filesys_xt.h" +#include "database_xt.h" +#include "memory_xt.h" +#include "heap_xt.h" +#include "datalog_xt.h" +#include "strutil_xt.h" +#include "util_xt.h" +#include "trace_xt.h" + +#ifdef DEBUG +//#define XT_TEST_XACT_OVERFLOW +#endif + +#ifndef NAME_MAX +#define NAME_MAX 128 +#endif + +/* + * ----------------------------------------------------------------------- + * GLOBALS + */ + +xtPublic xtLogOffset xt_db_log_file_threshold; +xtPublic size_t xt_db_log_buffer_size; +xtPublic size_t xt_db_transaction_buffer_size; +xtPublic size_t xt_db_checkpoint_frequency; +xtPublic off_t xt_db_data_log_threshold; +xtPublic size_t xt_db_data_file_grow_size; +xtPublic size_t xt_db_row_file_grow_size; +xtPublic int xt_db_garbage_threshold; +xtPublic int xt_db_log_file_count; +xtPublic int xt_db_auto_increment_mode; /* 0 = MySQL compatible, 1 = PrimeBase Compatible. */ +xtPublic int xt_db_offline_log_function; /* 0 = recycle logs, 1 = delete logs, 2 = keep logs */ +xtPublic int xt_db_sweeper_priority; /* 0 = low (default), 1 = normal, 2 = high */ + +xtPublic XTSortedListPtr xt_db_open_db_by_id = NULL; +xtPublic XTHashTabPtr xt_db_open_databases = NULL; +xtPublic time_t xt_db_approximate_time = 0; /* A "fast" alternative timer (not too accurate). */ + +static xtDatabaseID db_next_id = 1; +static volatile XTOpenFilePtr db_lock_file = NULL; + +/* + * ----------------------------------------------------------------------- + * LOCK/UNLOCK INSTALLATION + */ + +xtPublic void xt_lock_installation(XTThreadPtr self, char *installation_path) +{ + char file_path[PATH_MAX]; + char buffer[101]; + size_t red_size; + llong pid; + xtBool cd = pbxt_crash_debug; + + xt_strcpy(PATH_MAX, file_path, installation_path); + xt_add_pbxt_file(PATH_MAX, file_path, "no-debug"); + if (xt_fs_exists(file_path)) + pbxt_crash_debug = FALSE; + xt_strcpy(PATH_MAX, file_path, installation_path); + xt_add_pbxt_file(PATH_MAX, file_path, "crash-debug"); + if (xt_fs_exists(file_path)) + pbxt_crash_debug = TRUE; + + if (pbxt_crash_debug != cd) { + if (pbxt_crash_debug) + xt_logf(XT_NT_WARNING, "Crash debugging has been turned on ('crash-debug' file exists)\n"); + else + xt_logf(XT_NT_WARNING, "Crash debugging has been turned off ('no-debug' file exists)\n"); + } + else if (pbxt_crash_debug) + xt_logf(XT_NT_WARNING, "Crash debugging is enabled\n"); + + /* Moved the lock file out of the pbxt directory so that + * it is possible to drop the pbxt database! + */ + xt_strcpy(PATH_MAX, file_path, installation_path); + xt_add_dir_char(PATH_MAX, file_path); + xt_strcat(PATH_MAX, file_path, "pbxt-lock"); + db_lock_file = xt_open_file(self, file_path, XT_FS_CREATE | XT_FS_MAKE_PATH); + + try_(a) { + if (!xt_lock_file(self, db_lock_file)) { + xt_logf(XT_NT_ERROR, "A server appears to already be running\n"); + xt_logf(XT_NT_ERROR, "The file: %s, is locked\n", file_path); + xt_throw_xterr(XT_CONTEXT, XT_ERR_SERVER_RUNNING); + } + if (!xt_pread_file(db_lock_file, 0, 100, 0, buffer, &red_size, &self->st_statistics.st_rec, self)) + xt_throw(self); + if (red_size > 0) { + buffer[red_size] = 0; +#ifdef XT_WIN + pid = (llong) _atoi64(buffer); +#else + pid = atoll(buffer); +#endif + /* Problem with this code is, after a restart + * the process ID's are reused. + * If some system process grabs the proc id that + * the server had on the last run, then + * the database will not start. + if (xt_process_exists((xtProcID) pid)) { + xt_logf(XT_NT_ERROR, "A server appears to already be running, process ID: %lld\n", pid); + xt_logf(XT_NT_ERROR, "Remove the file: %s, if this is not the case\n", file_path); + xt_throw_xterr(XT_CONTEXT, XT_ERR_SERVER_RUNNING); + } + */ + xt_logf(XT_NT_INFO, "The server was not shutdown correctly, recovery required\n"); +#ifdef XT_BACKUP_BEFORE_RECOVERY + if (pbxt_crash_debug) { + /* The server was not shut down correctly. Make a backup before + * we start recovery. + */ + char extension[100]; + + for (int i=1;;i++) { + xt_strcpy(PATH_MAX, file_path, installation_path); + xt_remove_dir_char(file_path); + sprintf(extension, "-recovery-%d", i); + xt_strcat(PATH_MAX, file_path, extension); + if (!xt_fs_exists(file_path)) + break; + } + xt_logf(XT_NT_INFO, "In order to reproduce recovery errors a backup of the installation\n"); + xt_logf(XT_NT_INFO, "will be made to:\n"); + xt_logf(XT_NT_INFO, "%s\n", file_path); + xt_logf(XT_NT_INFO, "Copy in progress...\n"); + xt_fs_copy_dir(self, installation_path, file_path); + xt_logf(XT_NT_INFO, "Copy OK\n"); + } +#endif + } + + sprintf(buffer, "%lld", (llong) xt_getpid()); + xt_set_eof_file(self, db_lock_file, 0); + if (!xt_pwrite_file(db_lock_file, 0, strlen(buffer), buffer, &self->st_statistics.st_rec, self)) + xt_throw(self); + } + catch_(a) { + xt_close_file(self, db_lock_file); + db_lock_file = NULL; + xt_throw(self); + } + cont_(a); +} + +xtPublic void xt_unlock_installation(XTThreadPtr self, char *installation_path) +{ + if (db_lock_file) { + char lock_file[PATH_MAX]; + + xt_unlock_file(NULL, db_lock_file); + xt_close_file_ns(db_lock_file); + db_lock_file = NULL; + + xt_strcpy(PATH_MAX, lock_file, installation_path); + xt_add_dir_char(PATH_MAX, lock_file); + xt_strcat(PATH_MAX, lock_file, "pbxt-lock"); + xt_fs_delete(self, lock_file); + } +} + +int *xt_bad_pointer = 0; + +void xt_crash_me(void) +{ + if (pbxt_crash_debug) + *xt_bad_pointer = 123; +} + +/* + * ----------------------------------------------------------------------- + * INIT/EXIT DATABASE + */ + +static xtBool db_hash_comp(void *key, void *data) +{ + XTDatabaseHPtr db = (XTDatabaseHPtr) data; + + return strcmp((char *) key, db->db_name) == 0; +} + +static xtHashValue db_hash(xtBool is_key, void *key_data) +{ + XTDatabaseHPtr db = (XTDatabaseHPtr) key_data; + + if (is_key) + return xt_ht_hash((char *) key_data); + return xt_ht_hash(db->db_name); +} + +static xtBool db_hash_comp_ci(void *key, void *data) +{ + XTDatabaseHPtr db = (XTDatabaseHPtr) data; + + return strcasecmp((char *) key, db->db_name) == 0; +} + +static xtHashValue db_hash_ci(xtBool is_key, void *key_data) +{ + XTDatabaseHPtr db = (XTDatabaseHPtr) key_data; + + if (is_key) + return xt_ht_casehash((char *) key_data); + return xt_ht_casehash(db->db_name); +} + +static void db_hash_free(XTThreadPtr self, void *data) +{ + xt_heap_release(self, (XTDatabaseHPtr) data); +} + +static int db_cmp_db_id(struct XTThread XT_UNUSED(*self), register const void XT_UNUSED(*thunk), register const void *a, register const void *b) +{ + xtDatabaseID db_id = *((xtDatabaseID *) a); + XTDatabaseHPtr *db_ptr = (XTDatabaseHPtr *) b; + + if (db_id == (*db_ptr)->db_id) + return 0; + if (db_id < (*db_ptr)->db_id) + return -1; + return 1; +} + +xtPublic void xt_init_databases(XTThreadPtr self) +{ + if (pbxt_ignore_case) + xt_db_open_databases = xt_new_hashtable(self, db_hash_comp_ci, db_hash_ci, db_hash_free, TRUE, TRUE); + else + xt_db_open_databases = xt_new_hashtable(self, db_hash_comp, db_hash, db_hash_free, TRUE, TRUE); + xt_db_open_db_by_id = xt_new_sortedlist(self, sizeof(XTDatabaseHPtr), 20, 10, db_cmp_db_id, NULL, NULL, FALSE, FALSE); +} + +xtPublic void xt_stop_database_threads(XTThreadPtr self, xtBool sync) +{ + u_int len = 0; + XTDatabaseHPtr *dbptr; + XTDatabaseHPtr db = NULL; + + if (xt_db_open_db_by_id) + len = xt_sl_get_size(xt_db_open_db_by_id); + for (u_int i=0; i<len; i++) { + if ((dbptr = (XTDatabaseHPtr *) xt_sl_item_at(xt_db_open_db_by_id, i))) { + db = *dbptr; + if (sync) { + /* Wait for the sweeper: */ + xt_wait_for_sweeper(self, db, 16); + + /* Wait for the writer: */ + xt_wait_for_writer(self, db); + + /* Wait for the checkpointer: */ + xt_wait_for_checkpointer(self, db); + } + xt_stop_checkpointer(self, db); + xt_stop_writer(self, db); + xt_stop_sweeper(self, db); + xt_stop_compactor(self, db); + } + } +} + +xtPublic void xt_exit_databases(XTThreadPtr self) +{ + if (xt_db_open_databases) { + xt_free_hashtable(self, xt_db_open_databases); + xt_db_open_databases = NULL; + } + if (xt_db_open_db_by_id) { + xt_free_sortedlist(self, xt_db_open_db_by_id); + xt_db_open_db_by_id = NULL; + } +} + +xtPublic void xt_create_database(XTThreadPtr self, char *path) +{ + xt_fs_mkdir(self, path); +} + +static void db_finalize(XTThreadPtr self, void *x) +{ + XTDatabaseHPtr db = (XTDatabaseHPtr) x; + + xt_stop_checkpointer(self, db); + xt_stop_compactor(self, db); + xt_stop_sweeper(self, db); + xt_stop_writer(self, db); + + xt_sl_delete(self, xt_db_open_db_by_id, &db->db_id); + /* + * Important is that xt_db_pool_exit() is called + * before xt_xn_exit_db() because xt_xn_exit_db() + * frees the checkpoint information which + * may be required to shutdown the tables, which + * flushes tables, and therefore does a checkpoint. + */ + /* This was the previous order of shutdown: + xt_xn_exit_db(self, db); + xt_dl_exit_db(self, db); + xt_db_pool_exit(self, db); + db->db_indlogs.ilp_exit(self); + */ + + xt_db_pool_exit(self, db); + db->db_indlogs.ilp_exit(self); + xt_dl_exit_db(self, db); + xt_xn_exit_db(self, db); + xt_tab_exit_db(self, db); + if (db->db_name) { + xt_free(self, db->db_name); + db->db_name = NULL; + } + if (db->db_main_path) { + xt_free(self, db->db_main_path); + db->db_main_path = NULL; + } +} + +static void db_onrelease(XTThreadPtr self, void XT_UNUSED(*x)) +{ + /* Signal threads waiting for exclusive use of the database: */ + if (xt_db_open_databases) // The database may already be closed. + xt_ht_signal(self, xt_db_open_databases); +} + +xtPublic void xt_add_pbxt_file(size_t size, char *path, const char *file) +{ + xt_add_dir_char(size, path); + xt_strcat(size, path, "pbxt"); + xt_add_dir_char(size, path); + xt_strcat(size, path, file); +} + +xtPublic void xt_add_location_file(size_t size, char *path) +{ + xt_add_dir_char(size, path); + xt_strcat(size, path, "pbxt"); + xt_add_dir_char(size, path); + xt_strcat(size, path, "location"); +} + +xtPublic void xt_add_pbxt_dir(size_t size, char *path) +{ + xt_add_dir_char(size, path); + xt_strcat(size, path, "pbxt"); +} + +xtPublic void xt_add_system_dir(size_t size, char *path) +{ + xt_add_dir_char(size, path); + xt_strcat(size, path, "pbxt"); + xt_add_dir_char(size, path); + xt_strcat(size, path, "system"); +} + +xtPublic void xt_add_data_dir(size_t size, char *path) +{ + xt_add_dir_char(size, path); + xt_strcat(size, path, "pbxt"); + xt_add_dir_char(size, path); + xt_strcat(size, path, "data"); +} + +/* + * I have a problem here. I cannot rely on the path given to xt_get_database() to be + * consistant. When called from ha_create_table() the path is not modified. + * However when called from ha_open() the path is first transformed by a call to + * fn_format(). I have given an example from a stack trace below. + * + * In this case the odd path comes from the option: + * --tmpdir=/Users/build/Development/mysql/debug-mysql/mysql-test/var//tmp + * + * #3 0x001a3818 in ha_pbxt::create(char const*, st_table*, st_ha_create_information*) + * (this=0x2036898, table_path=0xf0060bd0 "/users/build/development/mysql/debug-my + * sql/mysql-test/var//tmp/#sql5718_1_0.frm", table_arg=0xf00601c0, + * create_info=0x2017410) at ha_pbxt.cc:2323 + * #4 0x00140d74 in ha_create_table(char const*, st_ha_create_information*, bool) + * (name=0xf0060bd0 "/users/build/development/mysql/debug-mysql/mysql-te + * st/var//tmp/#sql5718_1_0.frm", create_info=0x2017410, + * update_create_info=false) at handler.cc:1387 + * + * #4 0x0013f7a4 in handler::ha_open(char const*, int, int) (this=0x203ba98, + * name=0xf005eb70 "/users/build/development/mysql/debug-mysql/mysql-te + * st/var/tmp/#sql5718_1_1", mode=2, test_if_locked=2) at handler.cc:993 + * #5 0x000cd900 in openfrm(char const*, char const*, unsigned, unsigned, + * unsigned, st_table*) (name=0xf005f260 "/users/build/development/mys + * ql/debug-mysql/mysql-test/var//tmp/#sql5718_1_1.frm", + * alias=0xf005fb90 "#sql-5718_1", db_stat=7, prgflag=44, + * ha_open_flags=0, outparam=0x2039e18) at table.cc:771 + * + * As a result, I no longer use the entire path as the key to find a database. + * Just the last component of the path (i.e. the database name) should be + * sufficient!? + */ +xtPublic XTDatabaseHPtr xt_get_database(XTThreadPtr self, char *path, xtBool multi_path) +{ + XTDatabaseHPtr db = NULL; + char db_path[PATH_MAX]; + char db_name[NAME_MAX]; + xtBool multi_path_db = FALSE; + + /* A database may not be in use when this is called. */ + ASSERT(!self->st_database); + xt_ht_lock(self, xt_db_open_databases); + pushr_(xt_ht_unlock, xt_db_open_databases); + + xt_strcpy(PATH_MAX, db_path, path); + xt_add_location_file(PATH_MAX, db_path); + if (multi_path || xt_fs_exists(db_path)) + multi_path_db = TRUE; + + xt_strcpy(PATH_MAX, db_path, path); + xt_remove_dir_char(db_path); + xt_strcpy(NAME_MAX, db_name, xt_last_directory_of_path(db_path)); + + db = (XTDatabaseHPtr) xt_ht_get(self, xt_db_open_databases, db_name); + if (!db) { + pushsr_(db, xt_heap_release, (XTDatabaseHPtr) xt_heap_new(self, sizeof(XTDatabaseRec), db_finalize)); + xt_heap_set_release_callback(self, db, db_onrelease); + db->db_id = db_next_id++; + db->db_name = xt_dup_string(self, db_name); + db->db_main_path = xt_dup_string(self, db_path); + db->db_multi_path = multi_path_db; +#ifdef XT_TEST_XACT_OVERFLOW + /* Test transaction ID overflow: */ + db->db_xn_curr_id = 0xFFFFFFFF - 30; +#endif + xt_db_pool_init(self, db); + xt_tab_init_db(self, db); + xt_dl_init_db(self, db); + + /* Initialize the index logs: */ + db->db_indlogs.ilp_init(self, db, XT_INDEX_WRITE_BUFFER_SIZE); + + xt_xn_init_db(self, db); + xt_sl_insert(self, xt_db_open_db_by_id, &db->db_id, &db); + + xt_start_sweeper(self, db); + xt_start_compactor(self, db); + xt_start_writer(self, db); + xt_start_checkpointer(self, db); + + popr_(); + xt_ht_put(self, xt_db_open_databases, db); + + /* The recovery process could attach parts of the open + * database to the thread! + */ + xt_unuse_database(self, self); + + } + xt_heap_reference(self, db); + freer_(); + + /* {INDEX-RECOV_ROWID} + * Wait for sweeper to finish processing possibly + * unswept transactions after recovery. + * This is required because during recovery for + * all index entries written the row_id is set. + * + * When the row ID is set, this means that the row + * is "clean". i.e. visible to all transactions. + * + * Obviously this is not necessary the case for all + * index entries recovered. For example, + * transactions that still need to be swept may be + * rolled back. + * + * As a result, we have to wait the the sweeper + * to complete. Only then can we be sure that + * all index entries that are not visible have + * been removed. + * + * {OPEN-DB-SWEEPER-WAIT} + * This has been moved to after the release of the open + * database lock because: + * + * - We are waiting for the sweeper which may run out of + * record cache. + * - If it runs out of cache it well wait + * for the freeer thread. + * - For the freeer thread to be able to work it needs + * to open the database. + * - To open the database it needs the open database + * lock. + */ + pushr_(xt_heap_release, db); + xt_wait_for_sweeper(self, db, 0); + popr_(); + + return db; +} + +xtPublic XTDatabaseHPtr xt_get_database_by_id(XTThreadPtr self, xtDatabaseID db_id) +{ + XTDatabaseHPtr *dbptr; + XTDatabaseHPtr db = NULL; + + xt_ht_lock(self, xt_db_open_databases); + pushr_(xt_ht_unlock, xt_db_open_databases); + if ((dbptr = (XTDatabaseHPtr *) xt_sl_find(self, xt_db_open_db_by_id, &db_id))) { + db = *dbptr; + xt_heap_reference(self, db); + } + freer_(); // xt_ht_unlock(xt_db_open_databases) + return db; +} + +xtPublic void xt_check_database(XTThreadPtr self) +{ + xt_check_tables(self); + /* + xt_check_handlefiles(self, db); + */ +} + +xtPublic void xt_drop_database(XTThreadPtr self, XTDatabaseHPtr db) +{ + char path[PATH_MAX]; + char db_name[NAME_MAX]; + XTOpenDirPtr od; + char *file; + XTTablePathPtr *tp_ptr; + + xt_ht_lock(self, xt_db_open_databases); + pushr_(xt_ht_unlock, xt_db_open_databases); + + /* Shutdown the database daemons: */ + xt_stop_checkpointer(self, db); + xt_stop_sweeper(self, db); + xt_stop_compactor(self, db); + xt_stop_writer(self, db); + + /* Remove the database from the directory: */ + xt_strcpy(NAME_MAX, db_name, db->db_name); + xt_ht_del(self, xt_db_open_databases, db_name); + + /* Release the lock on the database directory: */ + freer_(); // xt_ht_unlock(xt_db_open_databases) + + /* Delete the transaction logs: */ + xt_xlog_delete_logs(self, db); + + /* Delete the data logs: */ + xt_dl_delete_logs(self, db); + + for (u_int i=0; i<xt_sl_get_size(db->db_table_paths); i++) { + + tp_ptr = (XTTablePathPtr *) xt_sl_item_at(db->db_table_paths, i); + + xt_strcpy(PATH_MAX, path, (*tp_ptr)->tp_path); + + /* Delete all files in the database: */ + pushsr_(od, xt_dir_close, xt_dir_open(self, path, NULL)); + while (xt_dir_next(self, od)) { + file = xt_dir_name(self, od); + if (xt_ends_with(file, ".xtr") || + xt_ends_with(file, ".xtd") || + xt_ends_with(file, ".xti") || + xt_ends_with(file, ".xt")) + { + xt_add_dir_char(PATH_MAX, path); + xt_strcat(PATH_MAX, path, file); + xt_fs_delete(self, path); + xt_remove_last_name_of_path(path); + } + } + freer_(); // xt_dir_close(od) + + } + if (!db->db_multi_path) { + xt_strcpy(PATH_MAX, path, db->db_main_path); + xt_add_pbxt_dir(PATH_MAX, path); + if (!xt_fs_rmdir(NULL, path)) + xt_log_and_clear_exception(self); + } +} + +/* + * Open/use a database. + */ +xtPublic void xt_open_database(XTThreadPtr self, char *path, xtBool multi_path) +{ + XTDatabaseHPtr db; + + /* We cannot get a database, without unusing the current + * first. The reason is that the restart process will + * partially set the current database! + */ + xt_unuse_database(self, self); + db = xt_get_database(self, path, multi_path); + pushr_(xt_heap_release, db); + xt_use_database(self, db, XT_FOR_USER); + freer_(); // xt_heap_release(self, db); +} + +/* This function can only be called if you do not already have a database in + * use. This is because to get a database pointer you are not allowed + * to have a database in use! + */ +xtPublic void xt_use_database(XTThreadPtr self, XTDatabaseHPtr db, int what_for) +{ + /* Check if a transaction is in progress. If so, + * we cannot change the database! + */ + if (self->st_xact_data || self->st_database) + xt_throw_xterr(XT_CONTEXT, XT_ERR_CANNOT_CHANGE_DB); + + xt_heap_reference(self, db); + self->st_database = db; + xt_xn_init_thread(self, what_for); +} + +xtPublic void xt_unuse_database(XTThreadPtr self, XTThreadPtr other_thr) +{ + /* Abort the transacion if it belongs exclusively to this thread. */ + xt_lock_mutex(self, &other_thr->t_lock); + pushr_(xt_unlock_mutex, &other_thr->t_lock); + + xt_xn_exit_thread(other_thr); + if (other_thr->st_database) { + xt_heap_release(self, other_thr->st_database); + other_thr->st_database = NULL; + } + + freer_(); +} + +xtPublic void xt_db_init_thread(XTThreadPtr XT_UNUSED(self), XTThreadPtr XT_UNUSED(new_thread)) +{ +#ifdef XT_IMPLEMENT_NO_ACTION + memset(&new_thread->st_restrict_list, 0, sizeof(XTBasicListRec)); + new_thread->st_restrict_list.bl_item_size = sizeof(XTRestrictItemRec); +#endif +} + +xtPublic void xt_db_exit_thread(XTThreadPtr self) +{ +#ifdef XT_IMPLEMENT_NO_ACTION + xt_bl_free(NULL, &self->st_restrict_list); +#endif + xt_unuse_database(self, self); +} + +/* + * ----------------------------------------------------------------------- + * OPEN TABLE POOL + */ + +#ifdef UNUSED_CODE +static void check_free_list(XTDatabaseHPtr db) +{ + XTOpenTablePtr ot; + u_int cnt = 0; + + ot = db->db_ot_pool.otp_mr_used; + if (ot) + ASSERT_NS(!ot->ot_otp_mr_used); + ot = db->db_ot_pool.otp_lr_used; + if (ot) + ASSERT_NS(!ot->ot_otp_lr_used); + while (ot) { + cnt++; + ot = ot->ot_otp_mr_used; + } + ASSERT_NS(cnt == db->db_ot_pool.otp_total_free); +} +#endif + +xtPublic void xt_db_pool_init(XTThreadPtr self, XTDatabaseHPtr db) +{ + memset(&db->db_ot_pool, 0, sizeof(XTAllTablePoolsRec)); + xt_init_mutex_with_autoname(self, &db->db_ot_pool.opt_lock); + xt_init_cond(self, &db->db_ot_pool.opt_cond); +} + +xtPublic void xt_db_pool_exit(XTThreadPtr self, XTDatabaseHPtr db) +{ + XTOpenTablePoolPtr table_pool, tmp; + XTOpenTablePtr ot, tmp_ot; + + xt_free_mutex(&db->db_ot_pool.opt_lock); + xt_free_cond(&db->db_ot_pool.opt_cond); + + for (u_int i=0; i<XT_OPEN_TABLE_POOL_HASH_SIZE; i++) { + table_pool = db->db_ot_pool.otp_hash[i]; + while (table_pool) { + tmp = table_pool->opt_next_hash; + ot = table_pool->opt_free_list; + while (ot) { + tmp_ot = ot->ot_otp_next_free; + ot->ot_thread = self; + xt_close_table(ot, TRUE, FALSE); + ot = tmp_ot; + } + xt_free(self, table_pool); + table_pool = tmp; + } + } +} + +static XTOpenTablePoolPtr db_get_open_table_pool(XTDatabaseHPtr db, xtTableID tab_id) +{ + XTOpenTablePoolPtr table_pool; + u_int hash; + + hash = tab_id % XT_OPEN_TABLE_POOL_HASH_SIZE; + table_pool = db->db_ot_pool.otp_hash[hash]; + while (table_pool) { + if (table_pool->opt_tab_id == tab_id) + return table_pool; + table_pool = table_pool->opt_next_hash; + } + + if (!(table_pool = (XTOpenTablePoolPtr) xt_malloc_ns(sizeof(XTOpenTablePoolRec)))) + return NULL; + + table_pool->opt_db = db; + table_pool->opt_tab_id = tab_id; + table_pool->opt_total_open = 0; + table_pool->opt_locked = FALSE; + table_pool->opt_flushing = 0; + table_pool->opt_free_list = NULL; + table_pool->opt_next_hash = db->db_ot_pool.otp_hash[hash]; + db->db_ot_pool.otp_hash[hash] = table_pool; + + return table_pool; +} + +static void db_free_open_table_pool(XTThreadPtr self, XTOpenTablePoolPtr table_pool) +{ + if (!table_pool->opt_locked && !table_pool->opt_flushing && !table_pool->opt_total_open) { + XTOpenTablePoolPtr ptr, pptr = NULL; + u_int hash; + + hash = table_pool->opt_tab_id % XT_OPEN_TABLE_POOL_HASH_SIZE; + ptr = table_pool->opt_db->db_ot_pool.otp_hash[hash]; + while (ptr) { + if (ptr == table_pool) + break; + pptr = ptr; + ptr = ptr->opt_next_hash; + } + + if (ptr == table_pool) { + if (pptr) + pptr->opt_next_hash = table_pool->opt_next_hash; + else + table_pool->opt_db->db_ot_pool.otp_hash[hash] = table_pool->opt_next_hash; + } + + xt_free(self, table_pool); + } +} + +static XTOpenTablePoolPtr db_lock_table_pool(XTThreadPtr self, XTDatabaseHPtr db, xtTableID tab_id, xtBool flush_table, xtBool wait_for_open) +{ + XTOpenTablePoolPtr table_pool; + XTOpenTablePtr ot, tmp_ot; + + xt_lock_mutex(self, &db->db_ot_pool.opt_lock); + pushr_(xt_unlock_mutex, &db->db_ot_pool.opt_lock); + + if (!(table_pool = db_get_open_table_pool(db, tab_id))) + xt_throw(self); + + /* Wait for the lock: */ + while (table_pool->opt_locked) { + xt_timed_wait_cond(self, &db->db_ot_pool.opt_cond, &db->db_ot_pool.opt_lock, 2000); + if (!(table_pool = db_get_open_table_pool(db, tab_id))) + xt_throw(self); + } + + /* Lock it: */ + table_pool->opt_locked = TRUE; + + if (flush_table) { + table_pool->opt_flushing++; + freer_(); // xt_unlock_mutex(db_ot_pool.opt_lock) + + pushr_(xt_db_unlock_table_pool, table_pool); + /* During this time, background processes can use the + * pool! + * + * May also do a flush, but this is now taken care + * of here [*10*] + */ + if ((ot = xt_db_open_pool_table(self, db, tab_id, NULL, TRUE))) { + pushr_(xt_db_return_table_to_pool, ot); + xt_sync_flush_table(self, ot); + freer_(); //xt_db_return_table_to_pool_foreground(ot); + } + + popr_(); // Discard xt_db_unlock_table_pool_no_lock(table_pool) + + xt_lock_mutex(self, &db->db_ot_pool.opt_lock); + pushr_(xt_unlock_mutex, &db->db_ot_pool.opt_lock); + table_pool->opt_flushing--; + } + + /* Free all open tables not in use: */ + ot = table_pool->opt_free_list; + table_pool->opt_free_list = NULL; + while (ot) { + tmp_ot = ot->ot_otp_next_free; + + /* Remove from MRU list: */ + if (db->db_ot_pool.otp_lr_used == ot) + db->db_ot_pool.otp_lr_used = ot->ot_otp_mr_used; + if (db->db_ot_pool.otp_mr_used == ot) + db->db_ot_pool.otp_mr_used = ot->ot_otp_lr_used; + if (ot->ot_otp_lr_used) + ot->ot_otp_lr_used->ot_otp_mr_used = ot->ot_otp_mr_used; + if (ot->ot_otp_mr_used) + ot->ot_otp_mr_used->ot_otp_lr_used = ot->ot_otp_lr_used; + + if (db->db_ot_pool.otp_lr_used) + db->db_ot_pool.otp_free_time = db->db_ot_pool.otp_lr_used->ot_otp_free_time; + + ASSERT_NS(db->db_ot_pool.otp_total_free > 0); + db->db_ot_pool.otp_total_free--; + + /* Close the table: */ + ASSERT(table_pool->opt_total_open > 0); + table_pool->opt_total_open--; + + ot->ot_thread = self; + xt_close_table(ot, table_pool->opt_total_open == 0, FALSE); + + /* Go to the next: */ + ot = tmp_ot; + } + + /* Wait for other to close: */ + if (wait_for_open) { + while (table_pool->opt_total_open > 0) { + xt_timed_wait_cond_ns(&db->db_ot_pool.opt_cond, &db->db_ot_pool.opt_lock, 2000); + } + } + + freer_(); // xt_unlock_mutex(db_ot_pool.opt_lock) + return table_pool; +} + +xtPublic XTOpenTablePoolPtr xt_db_lock_table_pool_by_name(XTThreadPtr self, XTDatabaseHPtr db, XTPathStrPtr tab_name, xtBool no_load, xtBool flush_table, xtBool missing_ok, xtBool wait_for_open, XTTableHPtr *ret_tab) +{ + XTOpenTablePoolPtr table_pool; + XTTableHPtr tab; + xtTableID tab_id; + + pushsr_(tab, xt_heap_release, xt_use_table(self, tab_name, no_load, missing_ok, NULL)); + if (!tab) { + freer_(); // xt_heap_release(tab) + return NULL; + } + + tab_id = tab->tab_id; + + if (ret_tab) { + *ret_tab = tab; + table_pool = db_lock_table_pool(self, db, tab_id, flush_table, wait_for_open); + popr_(); // Discard xt_heap_release(tab) + return table_pool; + } + + freer_(); // xt_heap_release(tab) + return db_lock_table_pool(self, db, tab_id, flush_table, wait_for_open); +} + +xtPublic void xt_db_wait_for_open_tables(XTThreadPtr self, XTOpenTablePoolPtr table_pool) +{ + XTDatabaseHPtr db = table_pool->opt_db; + + xt_lock_mutex(self, &db->db_ot_pool.opt_lock); + pushr_(xt_unlock_mutex, &db->db_ot_pool.opt_lock); + + /* Wait for other to close: */ + while (table_pool->opt_total_open > 0) { + xt_timed_wait_cond(self, &db->db_ot_pool.opt_cond, &db->db_ot_pool.opt_lock, 2000); + } + + freer_(); // xt_unlock_mutex(db_ot_pool.opt_lock) +} + +xtPublic void xt_db_unlock_table_pool(XTThreadPtr self, XTOpenTablePoolPtr table_pool) +{ + XTDatabaseHPtr db; + + if (!table_pool) + return; + + db = table_pool->opt_db; + xt_lock_mutex(self, &db->db_ot_pool.opt_lock); + pushr_(xt_unlock_mutex, &db->db_ot_pool.opt_lock); + + table_pool->opt_locked = FALSE; + xt_broadcast_cond(self, &db->db_ot_pool.opt_cond); + db_free_open_table_pool(NULL, table_pool); + + freer_(); // xt_unlock_mutex(db_ot_pool.opt_lock) +} + +xtPublic XTOpenTablePtr xt_db_open_table_using_tab(XTTableHPtr tab, XTThreadPtr thread) +{ + XTDatabaseHPtr db = tab->tab_db; + XTOpenTablePoolPtr table_pool; + XTOpenTablePtr ot; + + xt_lock_mutex_ns(&db->db_ot_pool.opt_lock); + + if (!(table_pool = db_get_open_table_pool(db, tab->tab_id))) + goto failed; + + while (table_pool->opt_locked) { + if (!xt_timed_wait_cond_ns(&db->db_ot_pool.opt_cond, &db->db_ot_pool.opt_lock, 2000)) + goto failed_1; + if (!(table_pool = db_get_open_table_pool(db, tab->tab_id))) + goto failed; + } + + if ((ot = table_pool->opt_free_list)) { + /* Remove from the free list: */ + table_pool->opt_free_list = ot->ot_otp_next_free; + + /* Remove from MRU list: */ + if (db->db_ot_pool.otp_lr_used == ot) + db->db_ot_pool.otp_lr_used = ot->ot_otp_mr_used; + if (db->db_ot_pool.otp_mr_used == ot) + db->db_ot_pool.otp_mr_used = ot->ot_otp_lr_used; + if (ot->ot_otp_lr_used) + ot->ot_otp_lr_used->ot_otp_mr_used = ot->ot_otp_mr_used; + if (ot->ot_otp_mr_used) + ot->ot_otp_mr_used->ot_otp_lr_used = ot->ot_otp_lr_used; + + if (db->db_ot_pool.otp_lr_used) + db->db_ot_pool.otp_free_time = db->db_ot_pool.otp_lr_used->ot_otp_free_time; + + ASSERT_NS(db->db_ot_pool.otp_total_free > 0); + db->db_ot_pool.otp_total_free--; + + ot->ot_thread = thread; + goto done_ok; + } + + if ((ot = xt_open_table(tab))) { + ot->ot_thread = thread; + table_pool->opt_total_open++; + } + + done_ok: + db_free_open_table_pool(NULL, table_pool); + xt_unlock_mutex_ns(&db->db_ot_pool.opt_lock); + return ot; + + failed_1: + db_free_open_table_pool(NULL, table_pool); + + failed: + xt_unlock_mutex_ns(&db->db_ot_pool.opt_lock); + return NULL; +} + +xtPublic xtBool xt_db_open_pool_table_ns(XTOpenTablePtr *ret_ot, XTDatabaseHPtr db, xtTableID tab_id) +{ + XTThreadPtr self = xt_get_self(); + xtBool ok = TRUE; + + try_(a) { + *ret_ot = xt_db_open_pool_table(self, db, tab_id, NULL, FALSE); + } + catch_(a) { + ok = FALSE; + } + cont_(a); + return ok; +} + +xtPublic XTOpenTablePtr xt_db_open_pool_table(XTThreadPtr self, XTDatabaseHPtr db, xtTableID tab_id, int *result, xtBool i_am_background) +{ + XTOpenTablePtr ot; + XTOpenTablePoolPtr table_pool; + int r; + XTTableHPtr tab; + + xt_lock_mutex(self, &db->db_ot_pool.opt_lock); + pushr_(xt_unlock_mutex, &db->db_ot_pool.opt_lock); + + if (!(table_pool = db_get_open_table_pool(db, tab_id))) + xt_throw(self); + + /* Background processes do not have to wait while flushing! + * + * I think I did this so that the background process would + * not hang during flushing. Exact reason currently + * unknown. + * + * This led to the situation that the checkpointer + * could flush at the same time as a user process + * which was flushing due to a rename. + * + * This led to the situation described here: [*10*], + * which is now fixed. + */ + while (table_pool->opt_locked && !(i_am_background && table_pool->opt_flushing)) { + xt_timed_wait_cond(self, &db->db_ot_pool.opt_cond, &db->db_ot_pool.opt_lock, 2000); + if (!(table_pool = db_get_open_table_pool(db, tab_id))) + xt_throw(self); + } + + /* Moved from above, because db_get_open_table_pool() may return a different + * pool on each call! + */ + pushr_(db_free_open_table_pool, table_pool); + + if ((ot = table_pool->opt_free_list)) { + /* Remove from the free list: */ + table_pool->opt_free_list = ot->ot_otp_next_free; + + /* Remove from MRU list: */ + if (db->db_ot_pool.otp_lr_used == ot) + db->db_ot_pool.otp_lr_used = ot->ot_otp_mr_used; + if (db->db_ot_pool.otp_mr_used == ot) + db->db_ot_pool.otp_mr_used = ot->ot_otp_lr_used; + if (ot->ot_otp_lr_used) + ot->ot_otp_lr_used->ot_otp_mr_used = ot->ot_otp_mr_used; + if (ot->ot_otp_mr_used) + ot->ot_otp_mr_used->ot_otp_lr_used = ot->ot_otp_lr_used; + + if (db->db_ot_pool.otp_lr_used) + db->db_ot_pool.otp_free_time = db->db_ot_pool.otp_lr_used->ot_otp_free_time; + + ASSERT(db->db_ot_pool.otp_total_free > 0); + db->db_ot_pool.otp_total_free--; + + freer_(); // db_free_open_table_pool(table_pool) + freer_(); // xt_unlock_mutex(&db->db_ot_pool.opt_lock) + ot->ot_thread = self; + return ot; + } + + r = xt_use_table_by_id(self, &tab, db, tab_id); + if (result) { + if (r != XT_TAB_OK) { + *result = r; + freer_(); // db_free_open_table_pool(table_pool) + freer_(); // xt_unlock_mutex(&db->db_ot_pool.opt_lock) + return NULL; + } + } + else { + switch (r) { + case XT_TAB_NOT_FOUND: + /* The table no longer exists, ignore the change: */ + freer_(); // db_free_open_table_pool(table_pool) + freer_(); // xt_unlock_mutex(&db->db_ot_pool.opt_lock) + return NULL; + case XT_TAB_NO_DICTIONARY: + xt_throw_ulxterr(XT_CONTEXT, XT_ERR_NO_DICTIONARY, (u_long) tab_id); + case XT_TAB_POOL_CLOSED: + xt_throw_ulxterr(XT_CONTEXT, XT_ERR_TABLE_LOCKED, (u_long) tab_id); + default: + break; + } + } + + /* xt_use_table_by_id returns a referenced tab! */ + pushr_(xt_heap_release, tab); + if ((ot = xt_open_table(tab))) { + ot->ot_thread = self; + table_pool->opt_total_open++; + } + freer_(); // xt_release_heap(tab) + + freer_(); // db_free_open_table_pool(table_pool) + freer_(); // xt_unlock_mutex(&db->db_ot_pool.opt_lock) + return ot; +} + +xtPublic void xt_db_return_table_to_pool(XTThreadPtr self, XTOpenTablePtr ot) +{ + if (!xt_db_return_table_to_pool_ns(ot)) + xt_throw(self); +} + +xtPublic xtBool xt_db_return_table_to_pool_ns(XTOpenTablePtr ot) +{ + XTOpenTablePoolPtr table_pool; + XTDatabaseHPtr db = ot->ot_table->tab_db; + xtBool flush_table = TRUE; + + xt_lock_mutex_ns(&db->db_ot_pool.opt_lock); + + if (!(table_pool = db_get_open_table_pool(db, ot->ot_table->tab_id))) + goto failed; + + if (table_pool->opt_locked && !table_pool->opt_flushing) { + table_pool->opt_total_open--; + /* Table will be closed below: */ + if (table_pool->opt_total_open > 0) + flush_table = FALSE; + } + else { + /* Put it on the free list: */ + db->db_ot_pool.otp_total_free++; + + ot->ot_otp_next_free = table_pool->opt_free_list; + table_pool->opt_free_list = ot; + + /* This is the time the table was freed: */ + ot->ot_otp_free_time = xt_db_approximate_time; + + /* Add to most recently used: */ + if ((ot->ot_otp_lr_used = db->db_ot_pool.otp_mr_used)) + db->db_ot_pool.otp_mr_used->ot_otp_mr_used = ot; + ot->ot_otp_mr_used = NULL; + db->db_ot_pool.otp_mr_used = ot; + if (!db->db_ot_pool.otp_lr_used) { + db->db_ot_pool.otp_lr_used = ot; + db->db_ot_pool.otp_free_time = ot->ot_otp_free_time; + } + + ot = NULL; + } + + db_free_open_table_pool(NULL, table_pool); + + if (!xt_broadcast_cond_ns(&db->db_ot_pool.opt_cond)) + goto failed; + xt_unlock_mutex_ns(&db->db_ot_pool.opt_lock); + if (ot) + xt_close_table(ot, flush_table, FALSE); + + return OK; + + failed: + xt_unlock_mutex_ns(&db->db_ot_pool.opt_lock); + if (ot) + xt_close_table(ot, TRUE, FALSE); + return FAILED; +} + +//#define TEST_FREE_OPEN_TABLES + +#ifdef DEBUG +#undef XT_OPEN_TABLE_FREE_TIME +#define XT_OPEN_TABLE_FREE_TIME 5 +#endif + +xtPublic void xt_db_free_unused_open_tables(XTThreadPtr self, XTDatabaseHPtr db) +{ + XTOpenTablePoolPtr table_pool; + size_t count; + XTOpenTablePtr ot; + xtBool flush_table = TRUE; + u_int table_count; + + /* A quick check of the oldest free table: */ + if (xt_db_approximate_time < db->db_ot_pool.otp_free_time + XT_OPEN_TABLE_FREE_TIME) + return; + + table_count = db->db_table_by_id ? xt_sl_get_size(db->db_table_by_id) : 0; + count = table_count * 3; + if (count < 20) + count = 20; +#ifdef TEST_FREE_OPEN_TABLES + count = 10; +#endif + if (db->db_ot_pool.otp_total_free > count) { + XTOpenTablePtr ptr, pptr; + + count = table_count * 2; + if (count < 10) + count = 10; +#ifdef TEST_FREE_OPEN_TABLES + count = 5; +#endif + xt_lock_mutex(self, &db->db_ot_pool.opt_lock); + pushr_(xt_unlock_mutex, &db->db_ot_pool.opt_lock); + + while (db->db_ot_pool.otp_total_free > count) { + ASSERT_NS(db->db_ot_pool.otp_lr_used); + if (!(ot = db->db_ot_pool.otp_lr_used)) + break; + + /* Check how long the open table has been free: */ + if (xt_db_approximate_time < ot->ot_otp_free_time + XT_OPEN_TABLE_FREE_TIME) + break; + + ot->ot_thread = self; + + /* Remove from MRU list: */ + db->db_ot_pool.otp_lr_used = ot->ot_otp_mr_used; + if (db->db_ot_pool.otp_mr_used == ot) + db->db_ot_pool.otp_mr_used = ot->ot_otp_lr_used; + if (ot->ot_otp_lr_used) + ot->ot_otp_lr_used->ot_otp_mr_used = ot->ot_otp_mr_used; + if (ot->ot_otp_mr_used) + ot->ot_otp_mr_used->ot_otp_lr_used = ot->ot_otp_lr_used; + + if (db->db_ot_pool.otp_lr_used) + db->db_ot_pool.otp_free_time = db->db_ot_pool.otp_lr_used->ot_otp_free_time; + + ASSERT(db->db_ot_pool.otp_total_free > 0); + db->db_ot_pool.otp_total_free--; + + if (!(table_pool = db_get_open_table_pool(db, ot->ot_table->tab_id))) + xt_throw(self); + + /* Find the open table in the table pool, + * and remove it from the list: + */ + pptr = NULL; + ptr = table_pool->opt_free_list; + while (ptr) { + if (ptr == ot) + break; + pptr = ptr; + ptr = ptr->ot_otp_next_free; + } + + ASSERT_NS(ptr == ot); + if (ptr == ot) { + if (pptr) + pptr->ot_otp_next_free = ot->ot_otp_next_free; + else + table_pool->opt_free_list = ot->ot_otp_next_free; + } + + ASSERT_NS(table_pool->opt_total_open > 0); + table_pool->opt_total_open--; + if (table_pool->opt_total_open > 0) + flush_table = FALSE; + else + flush_table = TRUE; + + db_free_open_table_pool(self, table_pool); + + freer_(); + + /* Close the table, but not + * while holding the lock. + */ + xt_close_table(ot, flush_table, FALSE); + + xt_lock_mutex(self, &db->db_ot_pool.opt_lock); + pushr_(xt_unlock_mutex, &db->db_ot_pool.opt_lock); + } + + freer_(); + } +} diff --git a/storage/pbxt/src/database_xt.h b/storage/pbxt/src/database_xt.h new file mode 100644 index 00000000000..487f4288163 --- /dev/null +++ b/storage/pbxt/src/database_xt.h @@ -0,0 +1,247 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-01-15 Paul McCullagh + * + * H&G2JCtL + */ +#ifndef __xt_database_h__ +#define __xt_database_h__ + +#include <time.h> + +#include "thread_xt.h" +#include "hashtab_xt.h" +#include "table_xt.h" +#include "sortedlist_xt.h" +#include "xaction_xt.h" +#include "heap_xt.h" +#include "xactlog_xt.h" +#include "restart_xt.h" +#include "index_xt.h" + +#ifdef DEBUG +//#define XT_USE_XACTION_DEBUG_SIZES +#endif + +#ifdef XT_USE_XACTION_DEBUG_SIZES +#define XT_DB_TABLE_POOL_SIZE 2 +#else +#define XT_DB_TABLE_POOL_SIZE 10 // The number of open tables maintained by the sweeper +#endif + +/* Turn this switch on to enable spin lock based wait-for logic: */ +#define XT_USE_SPINLOCK_WAIT_FOR + +extern xtLogOffset xt_db_log_file_threshold; +extern size_t xt_db_log_buffer_size; +extern size_t xt_db_transaction_buffer_size; +extern size_t xt_db_checkpoint_frequency; +extern off_t xt_db_data_log_threshold; +extern size_t xt_db_data_file_grow_size; +extern size_t xt_db_row_file_grow_size; +extern int xt_db_garbage_threshold; +extern int xt_db_log_file_count; +extern int xt_db_auto_increment_mode; +extern int xt_db_offline_log_function; +extern int xt_db_sweeper_priority; + +extern XTSortedListPtr xt_db_open_db_by_id; +extern XTHashTabPtr xt_db_open_databases; +extern time_t xt_db_approximate_time; + +#define XT_OPEN_TABLE_POOL_HASH_SIZE 223 + +#define XT_SW_WORK_NORMAL 0 +#define XT_SW_NO_MORE_XACT_SLOTS 1 +#define XT_SW_DIRTY_RECORD_FOUND 2 +#define XT_SW_TOO_FAR_BEHIND 3 /* The sweeper is getting too far behind, although it is working! */ + +typedef struct XTOpenTablePool { + struct XTDatabase *opt_db; + xtTableID opt_tab_id; /* The table ID. */ + u_int opt_total_open; /* Total number of open tables. */ + xtBool opt_locked; /* This table is locked open tables are freed on return to pool. */ + u_int opt_flushing; + XTOpenTablePtr opt_free_list; /* A list of free, unused open tables. */ + struct XTOpenTablePool *opt_next_hash; +} XTOpenTablePoolRec, *XTOpenTablePoolPtr; + +typedef struct XTAllTablePools { + xt_mutex_type opt_lock; /* This lock protects the open table pool. */ + xt_cond_type opt_cond; /* Used to wait for an exclusive lock on a table. */ + + u_int otp_total_free; /* This is the total number of free open tables (not in use): */ + + /* All free (unused tables) are on this list: */ + XTOpenTablePtr otp_mr_used; + XTOpenTablePtr otp_lr_used; + time_t otp_free_time; /* The free time of the LRU open table. */ + + XTOpenTablePoolPtr otp_hash[XT_OPEN_TABLE_POOL_HASH_SIZE]; +} XTAllTablePoolsRec, *XTAllTablePoolsPtr; + +typedef struct XTTablePath { + u_int tp_tab_count; /* The number of tables using this path. */ + char tp_path[XT_VAR_LENGTH]; /* The table path. */ +} XTTablePathRec, *XTTablePathPtr; + +#define XT_THREAD_BUSY 0 +#define XT_THREAD_IDLE 1 +#define XT_THREAD_INERR 2 + +typedef struct XTDatabase : public XTHeap { + char *db_name; /* The name of the database, last component of the path! */ + char *db_main_path; + xtDatabaseID db_id; + xtTableID db_curr_tab_id; /* The ID of the last table created. */ + XTHashTabPtr db_tables; + XTSortedListPtr db_table_by_id; + XTSortedListPtr db_table_paths; /* A list of table paths used by this database. */ + xtBool db_multi_path; + + /* The open table pool: */ + XTAllTablePoolsRec db_ot_pool; + + /* Transaction related stuff: */ + XTSpinLockRec db_xn_id_lock; /* Lock for next transaction ID. */ + xtXactID db_xn_curr_id; /* The ID of the last transaction started. */ + xtXactID db_xn_min_ram_id; /* The lowest ID of the transactions in memory (RAM). */ + xtXactID db_xn_to_clean_id; /* The next transaction to be cleaned (>= db_xn_min_ram_id). */ + xtXactID db_xn_min_run_id; /* The lowest ID of all running transactions (not up-to-date! >= db_xn_to_clean_id) */ + xtWord4 db_xn_end_time; /* The time of the transaction end. */ + XTXactSegRec db_xn_idx[XT_XN_NO_OF_SEGMENTS]; /* Index of transactions in RAM. */ + xtWord1 *db_xn_data; /* Start of the block allocated to contain transaction data. */ + xtWord1 *db_xn_data_end; /* End of the transaction data block. */ + u_int db_stat_sweep_waits; /* STATISTICS: count the sweeper waits. */ + XTDatabaseLogRec db_xlog; /* The transaction log for this database. */ + XTXactRestartRec db_restart; /* Database recovery stuff. */ + + XTSortedListPtr db_xn_wait_for; /* The "wait-for" list, of transactions waiting for other transactions. */ + u_int db_xn_call_start; /* Start of the post wait calls. */ + XTSpinLockRec db_xn_wait_spinlock; + //xt_mutex_type db_xn_wait_lock; /* The lock associated with the wait for list. */ + //xt_cond_type db_xn_wait_cond; /* This condition is signalled when a transaction quits. */ + //u_int db_xn_wait_on_cond; /* Number of threads waiting on the condition. */ + int db_xn_wait_count; /* Number of waiting transactions. */ + u_int db_xn_total_writer_count; /* The total number of writers. */ + int db_xn_writer_count; /* The number of writer threads. */ + int db_xn_writer_wait_count; /* The number of writer threads waiting. */ + int db_xn_long_running_count; /* The number of long running writer threads. */ + + /* Sweeper stuff: */ + struct XTThread *db_sw_thread; /* The sweeper thread (cleans up transactions). */ + xt_mutex_type db_sw_lock; /* The lock associated with the sweeper. */ + xt_cond_type db_sw_cond; /* The sweeper wakeup condition. */ + u_int db_sw_check_count; + int db_sw_idle; /* BUSY/IDLE/INERR depending on the state of the sweeper. */ + int db_sw_faster; /* non-zero if the sweeper should work faster. */ + xtBool db_sw_fast; /* TRUE if the sweeper is working faster. */ + + /* Writer stuff: */ + struct XTThread *db_wr_thread; /* The writer thread (write log data to the database). */ + int db_wr_idle; /* BUSY/IDLE/INERR depending on the state of the writer. */ + xtBool db_wr_faster; /* Set to TRUE if the writer should work faster. */ + xtBool db_wr_fast; /* TRUE if the writer is working faster. */ + u_int db_wr_thread_waiting; /* Count the number of threads waiting for the writer. */ + xtBool db_wr_freeer_waiting; /* TRUE if the freeer is wating for the writer. */ + xt_mutex_type db_wr_lock; + xt_cond_type db_wr_cond; /* Writer condition when idle (must bw woken by log flush! */ + xtLogID db_wr_log_id; /* Current write log ID. */ + xtLogOffset db_wr_log_offset; /* Current write log offset. */ + xtLogID db_wr_flush_point_log_id; /* This is the point to which the writer will write (log ID). */ + xtLogOffset db_wr_flush_point_log_offset; /* This is the point to which the writer will write (log offset). */ + + /* Data log stuff: */ + XTDataLogCacheRec db_datalogs; /* The database data log stuff. */ + XTIndexLogPoolRec db_indlogs; /* Index logs used for consistent write. */ + + /* Compactor stuff: */ + struct XTThread *db_co_thread; /* The compator thread (compacts data logs). */ + xt_mutex_type db_co_ext_lock; /* Required when extended data is moved, or removed. */ + xtBool db_co_busy; /* True of the compactor is busy compacting a data log. */ + xt_mutex_type db_co_dlog_lock; /* This is the lock required to flusht the compactors data log. */ + + /* Checkpointer stuff: */ + struct XTThread *db_cp_thread; /* The checkpoint thread (flushes the database data). */ + xt_mutex_type db_cp_lock; + xt_cond_type db_cp_cond; /* Writer condition when idle (must bw woken by log flush! */ + XTCheckPointStateRec db_cp_state; /* The checkpoint state. */ +} XTDatabaseRec, *XTDatabaseHPtr; /* Heap pointer */ + +#define XT_FOR_USER 0 +#define XT_FOR_COMPACTOR 1 +#define XT_FOR_SWEEPER 2 +#define XT_FOR_WRITER 3 +#define XT_FOR_CHECKPOINTER 4 + +void xt_create_database(XTThreadPtr th, char *path); +XTDatabaseHPtr xt_get_database(XTThreadPtr self, char *path, xtBool multi_path); +XTDatabaseHPtr xt_get_database_by_id(XTThreadPtr self, xtDatabaseID db_id); +void xt_drop_database(XTThreadPtr self, XTDatabaseHPtr db); +void xt_check_database(XTThreadPtr self); + +void xt_add_pbxt_file(size_t size, char *path, const char *file); +void xt_add_location_file(size_t size, char *path); +void xt_add_system_dir(size_t size, char *path); +void xt_add_data_dir(size_t size, char *path); + +void xt_use_database(XTThreadPtr self, XTDatabaseHPtr db, int what_for); +void xt_unuse_database(XTThreadPtr self, XTThreadPtr other_thr); +void xt_open_database(XTThreadPtr self, char *path, xtBool multi_path); + +void xt_lock_installation(XTThreadPtr self, char *installation_path); +void xt_unlock_installation(XTThreadPtr self, char *installation_path); +void xt_crash_me(void); + +void xt_init_databases(XTThreadPtr self); +void xt_stop_database_threads(XTThreadPtr self, xtBool sync); +void xt_exit_databases(XTThreadPtr self); + +void xt_dump_database(XTThreadPtr self, XTDatabaseHPtr db); + +void xt_db_init_thread(XTThreadPtr self, XTThreadPtr new_thread); +void xt_db_exit_thread(XTThreadPtr self); + +void xt_db_pool_init(XTThreadPtr self, struct XTDatabase *db); +void xt_db_pool_exit(XTThreadPtr self, struct XTDatabase *db); +XTOpenTablePoolPtr xt_db_lock_table_pool_by_name(XTThreadPtr self, XTDatabaseHPtr db, XTPathStrPtr name, xtBool no_load, xtBool flush_table, xtBool missing_ok, xtBool wait_for_open, XTTableHPtr *ret_tab); +void xt_db_wait_for_open_tables(XTThreadPtr self, XTOpenTablePoolPtr table_pool); +void xt_db_unlock_table_pool(struct XTThread *self, XTOpenTablePoolPtr table_pool); +XTOpenTablePtr xt_db_open_pool_table(XTThreadPtr self, XTDatabaseHPtr db, xtTableID tab_id, int *result, xtBool i_am_background); +XTOpenTablePtr xt_db_open_table_using_tab(XTTableHPtr tab, XTThreadPtr thread); +xtBool xt_db_open_pool_table_ns(XTOpenTablePtr *ret_ot, XTDatabaseHPtr db, xtTableID tab_id); +void xt_db_return_table_to_pool(XTThreadPtr self, XTOpenTablePtr ot); +xtBool xt_db_return_table_to_pool_ns(XTOpenTablePtr ot); +void xt_db_free_unused_open_tables(XTThreadPtr self, XTDatabaseHPtr db); + +#define XT_LONG_RUNNING_TIME 2 + +inline void xt_xlog_check_long_writer(XTThreadPtr thread) +{ + if (thread->st_xact_writer) { + if (xt_db_approximate_time - thread->st_xact_write_time > XT_LONG_RUNNING_TIME) { + if (!thread->st_xact_long_running) { + thread->st_xact_long_running = TRUE; + thread->st_database->db_xn_long_running_count++; + } + } + } +} + +#endif diff --git a/storage/pbxt/src/datadic_xt.cc b/storage/pbxt/src/datadic_xt.cc new file mode 100644 index 00000000000..c2ac186cf6f --- /dev/null +++ b/storage/pbxt/src/datadic_xt.cc @@ -0,0 +1,2875 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2006-05-16 Paul McCullagh + * + * H&G2JCtL + * + * Implementation of the PBXT internal data dictionary. + */ + + +#include "xt_config.h" + +#include <ctype.h> +#include <errno.h> + +#ifdef DEBUG +#ifdef DRIZZLED +#include <drizzled/common_includes.h> +#else +#include "mysql_priv.h" +#endif +#endif + +#include "pthread_xt.h" +#include "datadic_xt.h" +#include "util_xt.h" +#include "database_xt.h" +#include "table_xt.h" +#include "heap_xt.h" +#include "strutil_xt.h" +#include "myxt_xt.h" +#include "hashtab_xt.h" + +/* + * ----------------------------------------------------------------------- + * Lexical analyser + */ + +#define XT_TK_EOF 0 +#define XT_TK_IDENTIFIER 1 +#define XT_TK_NUMBER 2 +#define XT_TK_STRING 3 +#define XT_TK_PUNCTUATION 4 + +#define XT_TK_RESERVER_WORDS 5 +#define XT_TK_PRIMARY 5 +#define XT_TK_UNIQUE 6 +#define XT_TK_FULLTEXT 7 +#define XT_TK_SPATIAL 8 +#define XT_TK_INDEX 9 +#define XT_TK_KEY 10 +#define XT_TK_CHECK 11 +#define XT_TK_FOREIGN 12 +#define XT_TK_COLUMN 13 +#define XT_TK_REFERENCES 14 +#define XT_TK_NOT 15 +#define XT_TK_NULL 16 +#define XT_TK_AUTO_INCREMENT 17 +#define XT_TK_COMMENT 18 +#define XT_TK_DEFAULT 19 +#define XT_TK_COLLATE 20 + +class XTToken { + public: + u_int tk_type; + char *tk_text; + size_t tk_length; + + void initCString(u_int type, char *start, char *end); + inline char charAt(u_int i) { + if (i >= tk_length) + return 0; + return toupper(tk_text[i]); + } + void expectKeyWord(XTThreadPtr self, c_char *keyword); + void expectIdentifier(XTThreadPtr self); + void expectNumber(XTThreadPtr self); + bool isKeyWord(c_char *keyword); + bool isReservedWord(); + bool isReservedWord(u_int word); + void identifyReservedWord(); + bool isEOF(); + bool isIdentifier(); + bool isNumber(); + size_t getString(char *string, size_t len); + void getTokenText(char *string, size_t len); + XTToken *clone(XTThreadPtr self); +}; + +void XTToken::initCString(u_int type, char *start, char *end) +{ + tk_type = type; + tk_text = start; + tk_length = (size_t) end - (size_t) start; +} + +bool XTToken::isKeyWord(c_char *keyword) +{ + char *str = tk_text; + size_t len = tk_length; + + while (len && *keyword) { + if (toupper(*keyword) != toupper(*str)) + return false; + keyword++; + str++; + len--; + } + return !len && !*keyword; +} + +bool XTToken::isReservedWord() +{ + return tk_type >= XT_TK_RESERVER_WORDS; +} + +bool XTToken::isReservedWord(u_int word) +{ + return tk_type == word; +} + +void XTToken::identifyReservedWord() +{ + if (tk_type == XT_TK_IDENTIFIER) { + switch (charAt(0)) { + case 'A': + if (isKeyWord("AUTO_INCREMENT")) + tk_type = XT_TK_AUTO_INCREMENT; + break; + case 'C': + switch (charAt(2)) { + case 'E': + if (isKeyWord("CHECK")) + tk_type = XT_TK_CHECK; + break; + case 'L': + if (isKeyWord("COLUMN")) + tk_type = XT_TK_COLUMN; + else if (isKeyWord("COLLATE")) + tk_type = XT_TK_COLLATE; + break; + case 'M': + if (isKeyWord("COMMENT")) + tk_type = XT_TK_COMMENT; + break; + } + break; + case 'D': + if (isKeyWord("DEFAULT")) + tk_type = XT_TK_DEFAULT; + break; + case 'F': + switch (charAt(1)) { + case 'O': + if (isKeyWord("FOREIGN")) + tk_type = XT_TK_FOREIGN; + break; + case 'U': + if (isKeyWord("FULLTEXT")) + tk_type = XT_TK_FULLTEXT; + break; + } + break; + case 'I': + if (isKeyWord("INDEX")) + tk_type = XT_TK_INDEX; + break; + case 'K': + if (isKeyWord("KEY")) + tk_type = XT_TK_KEY; + break; + case 'N': + switch (charAt(1)) { + case 'O': + if (isKeyWord("NOT")) + tk_type = XT_TK_NOT; + break; + case 'U': + if (isKeyWord("NULL")) + tk_type = XT_TK_NULL; + break; + } + break; + case 'P': + if (isKeyWord("PRIMARY")) + tk_type = XT_TK_PRIMARY; + break; + case 'R': + if (isKeyWord("REFERENCES")) + tk_type = XT_TK_REFERENCES; + break; + case 'S': + if (isKeyWord("SPATIAL")) + tk_type = XT_TK_SPATIAL; + break; + case 'U': + if (isKeyWord("UNIQUE")) + tk_type = XT_TK_UNIQUE; + break; + } + } +} + +bool XTToken::isEOF() +{ + return tk_type == XT_TK_EOF; +} + +bool XTToken::isIdentifier() +{ + return tk_type == XT_TK_IDENTIFIER; +} + +bool XTToken::isNumber() +{ + return tk_type == XT_TK_NUMBER; +} + +/* Return actual, or required string length. */ +size_t XTToken::getString(char *dtext, size_t dsize) +{ + char *buffer = dtext; + int slen; + size_t dlen; + char *stext; + char quote; + + if ((slen = (int) tk_length) == 0) { + *dtext = 0; + return 0; + } + switch (*tk_text) { + case '\'': + case '"': + case '`': + quote = *tk_text; + stext = tk_text+1; + slen -= 2; + dlen = 0; + while (slen > 0) { + if (*stext == '\\') { + stext++; + slen--; + if (slen > 0) { + switch (*stext) { + case '\0': + *dtext = 0; + break; + case '\'': + *dtext = '\''; + break; + case '"': + *dtext = '"'; + break; + case 'b': + *dtext = '\b'; + break; + case 'n': + *dtext = '\n'; + break; + case 'r': + *dtext = '\r'; + break; + case 't': + *dtext = '\t'; + break; + case 'z': + *dtext = (char) 26; + break; + case '\\': + *dtext = '\\'; + break; + default: + *dtext = *stext; + break; + } + } + } + else if (*stext == quote) { + if (dlen < dsize) + *dtext = quote; + stext++; + slen--; + } + else { + if (dlen < dsize) + *dtext = *stext; + } + dtext++; + dlen++; + stext++; + slen--; + } + if (dlen < dsize) + buffer[dlen] = 0; + else if (dsize > 0) + buffer[dsize-1] = 0; + break; + default: + if (dsize > 0) { + dlen = dsize-1; + if ((int) dlen > slen) + dlen = slen; + memcpy(dtext, tk_text, dlen); + dtext[dlen] = 0; + } + dlen = tk_length; + break; + } + return dlen; +} + +/* Return the token as a string with ... in it if it is too long + */ +void XTToken::getTokenText(char *string, size_t size) +{ + if (tk_length == 0 || !tk_text) { + xt_strcpy(size, string, "EOF"); + return; + } + + size--; + if (tk_length <= size) { + memcpy(string, tk_text, tk_length); + string[tk_length] = 0; + return; + } + + size = (size - 3) / 2; + memcpy(string, tk_text, size); + memcpy(string+size, "...", 3); + memcpy(string+size+3, tk_text + tk_length - size, size); + string[size+3+size] = 0; +} + +XTToken *XTToken::clone(XTThreadPtr self) +{ + XTToken *tk; + + if (!(tk = new XTToken())) + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + tk->initCString(tk_type, tk_text, tk_text + tk_length); + return tk; +} + +void XTToken::expectKeyWord(XTThreadPtr self, c_char *keyword) +{ + char buffer[100]; + + if (isKeyWord(keyword)) + return; + getTokenText(buffer, 100); + xt_throw_i2xterr(XT_CONTEXT, XT_ERR_A_EXPECTED_NOT_B, keyword, buffer); +} + +void XTToken::expectIdentifier(XTThreadPtr self) +{ + char buffer[100]; + + if (isIdentifier()) + return; + getTokenText(buffer, 100); + xt_throw_i2xterr(XT_CONTEXT, XT_ERR_A_EXPECTED_NOT_B, "Identifier", buffer); +} + +void XTToken::expectNumber(XTThreadPtr self) +{ + char buffer[100]; + + if (isNumber()) + return; + getTokenText(buffer, 100); + xt_throw_i2xterr(XT_CONTEXT, XT_ERR_A_EXPECTED_NOT_B, "Value", buffer); +} + +struct charset_info_st; + +class XTTokenizer { + struct charset_info_st *tkn_charset; + char *tkn_cstring; + char *tkn_curr_pos; + XTToken *tkn_current; + bool tkn_in_comment; + + public: + + XTTokenizer(bool convert, char *cstring) { + tkn_charset = myxt_getcharset(convert); + tkn_cstring = cstring; + tkn_curr_pos = cstring; + tkn_current = NULL; + tkn_in_comment = FALSE; + } + + virtual ~XTTokenizer(void) { + if (tkn_current) + delete tkn_current; + } + + inline bool isSingleChar(int ch) + { + return ch != '$' && ch != '_' && myxt_ispunct(tkn_charset, ch); + } + + inline bool isIdentifierChar(int ch) + { + return ch && !isSingleChar(ch) && !myxt_isspace(tkn_charset, ch); + } + + inline bool isNumberChar(int ch, int next_ch) + { + return myxt_isdigit(tkn_charset, ch) || ((ch == '-' || ch == '+') && myxt_isdigit(tkn_charset, next_ch)); + } + + XTToken *newToken(XTThreadPtr self, u_int type, char *start, char *end); + XTToken *nextToken(XTThreadPtr self); + XTToken *nextToken(XTThreadPtr self, c_char *keyword, XTToken *tk); +}; + +void ri_free_token(XTThreadPtr self __attribute__((unused)), XTToken *tk) +{ + delete tk; +} + +XTToken *XTTokenizer::newToken(XTThreadPtr self, u_int type, char *start, char *end) +{ + if (!tkn_current) { + if (!(tkn_current = new XTToken())) + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + } + tkn_current->initCString(type, start, end); + if (type == XT_TK_IDENTIFIER) + tkn_current->identifyReservedWord(); + return tkn_current; +} + +XTToken *XTTokenizer::nextToken(XTThreadPtr self) +{ + char *token_start; + u_int token_type = XT_TK_PUNCTUATION; + char quote; + bool must_be_num; + + restart: + + /* Ignore space: */ + while (*tkn_curr_pos && myxt_isspace(tkn_charset, *tkn_curr_pos)) tkn_curr_pos++; + + token_start = tkn_curr_pos; + switch (*tkn_curr_pos) { + case '\0': + return newToken(self, XT_TK_EOF, NULL, NULL); + // Comment: # ... EOL + case '#': + tkn_curr_pos++; + while (*tkn_curr_pos && *tkn_curr_pos != '\n' && *tkn_curr_pos != '\r') tkn_curr_pos++; + goto restart; + case '-': + if (tkn_curr_pos[1] == '-') { + // Comment: -- ... EOL + while (*tkn_curr_pos && *tkn_curr_pos != '\n' && *tkn_curr_pos != '\r') tkn_curr_pos++; + goto restart; + } + if (myxt_isdigit(tkn_charset, tkn_curr_pos[1])) + goto is_number; + tkn_curr_pos++; + break; + case '+': + if (myxt_isdigit(tkn_charset, tkn_curr_pos[1])) + goto is_number; + tkn_curr_pos++; + break; + case '/': + tkn_curr_pos++; + if (*tkn_curr_pos == '*') { + // Comment: /* ... */ + // Look for: /*!99999 ... */ version conditional statements + tkn_curr_pos++; + if (*tkn_curr_pos == '!') { + tkn_curr_pos++; + if (isdigit(*tkn_curr_pos)) { + while (isdigit(*tkn_curr_pos)) + tkn_curr_pos++; + tkn_in_comment = true; + goto restart; + } + } + + while (*tkn_curr_pos && !(*tkn_curr_pos == '*' && *(tkn_curr_pos+1) == '/')) tkn_curr_pos++; + if (*tkn_curr_pos == '*' && *(tkn_curr_pos+1) == '/') + tkn_curr_pos += 2; + goto restart; + } + break; + case '\'': + token_type = XT_TK_STRING; + goto is_string; + case '"': + case '`': + token_type = XT_TK_IDENTIFIER; + is_string: + quote = *tkn_curr_pos; + tkn_curr_pos++; + while (*tkn_curr_pos) { + if (*tkn_curr_pos == quote) { + // Doubling the quote means stay in string... + if (*(tkn_curr_pos + 1) != quote) + break; + tkn_curr_pos++; + } + tkn_curr_pos++; + } + + if (*tkn_curr_pos == quote) + tkn_curr_pos++; + break; + case '$': + goto is_identifier; + case '*': + if (tkn_in_comment) { + if (tkn_curr_pos[1] == '/') { + tkn_in_comment = false; + tkn_curr_pos += 2; + goto restart; + } + } + /* No break required! */ + default: + if (isNumberChar(tkn_curr_pos[0], tkn_curr_pos[1])) + goto is_number; + + if (isSingleChar(*tkn_curr_pos)) { + token_type = XT_TK_PUNCTUATION; + // The rest are singles... + tkn_curr_pos++; + break; + } + + is_identifier: + // Identifier (any string of characters that is not punctuation or a space: + token_type = XT_TK_IDENTIFIER; + while (isIdentifierChar(*tkn_curr_pos)) + tkn_curr_pos++; + break; + + is_number: + must_be_num = false; + token_type = XT_TK_NUMBER; + + if (*tkn_curr_pos == '-' || *tkn_curr_pos == '+') { + must_be_num = true; + tkn_curr_pos++; + } + + // Number: 9999 [ . 9999 ] [ e/E [+/-] 9999 ] + // However, 9999e or 9999E is an identifier! + while (*tkn_curr_pos && myxt_isdigit(tkn_charset, *tkn_curr_pos)) tkn_curr_pos++; + + if (*tkn_curr_pos == '.') { + must_be_num = true; + tkn_curr_pos++; + while (*tkn_curr_pos && myxt_isdigit(tkn_charset, *tkn_curr_pos)) tkn_curr_pos++; + } + + if (*tkn_curr_pos == 'e' || *tkn_curr_pos == 'E') { + tkn_curr_pos++; + + if (isNumberChar(tkn_curr_pos[0], tkn_curr_pos[1])) { + must_be_num = true; + + if (*tkn_curr_pos == '-' || *tkn_curr_pos == '+') + tkn_curr_pos++; + while (*tkn_curr_pos && myxt_isdigit(tkn_charset, *tkn_curr_pos)) + tkn_curr_pos++; + } + else if (!must_be_num) + token_type = XT_TK_IDENTIFIER; + } + + if (must_be_num || !isIdentifierChar(*tkn_curr_pos)) + break; + + /* Crazy, but true. An identifier can start by looking like a number! */ + goto is_identifier; + } + + return newToken(self, token_type, token_start, tkn_curr_pos); +} + +XTToken *XTTokenizer::nextToken(XTThreadPtr self, c_char *keyword, XTToken *tk) +{ + tk->expectKeyWord(self, keyword); + return nextToken(self); +} + +/* + * ----------------------------------------------------------------------- + * Parser + */ + +/* + We must parse the following syntax. Note that the constraints + may be embedded in a CREATE TABLE/ALTER TABLE statement. + + [CONSTRAINT symbol] FOREIGN KEY [id] (index_col_name, ...) + REFERENCES tbl_name (index_col_name, ...) + [ON DELETE {RESTRICT | CASCADE | SET NULL | SET DEFAULT | NO ACTION}] + [ON UPDATE {RESTRICT | CASCADE | SET NULL | SET DEFAULT | NO ACTION}] +*/ + +class XTParseTable : public XTObject { + public: + void raiseError(XTThreadPtr self, XTToken *tk, int err); + + private: + XTTokenizer *pt_tokenizer; + XTToken *pt_current; + XTStringBufferRec pt_sbuffer; + + void syntaxError(XTThreadPtr self, XTToken *tk); + + void parseIdentifier(XTThreadPtr self, char *name); + int parseKeyAction(XTThreadPtr self); + void parseCreateTable(XTThreadPtr self); + void parseAddTableItem(XTThreadPtr self); + void parseQualifiedName(XTThreadPtr self, char *name); + void parseTableName(XTThreadPtr self, bool alterTable); + void parseExpression(XTThreadPtr self, bool allow_reserved); + void parseBrackets(XTThreadPtr self); + void parseMoveColumn(XTThreadPtr self); + + /* If old_col_name is NULL, then this column is to be added, + * if old_col_name is empty (strlen() = 0) then the column + * exists, and should be modified, otherwize the column + * given is to be modified. + */ + void parseColumnDefinition(XTThreadPtr self, char *old_col_name); + void parseDataType(XTThreadPtr self); + void parseReferenceDefinition(XTThreadPtr self, u_int req_cols); + void optionalIndexName(XTThreadPtr self); + void optionalIndexType(XTThreadPtr self); + u_int columnList(XTThreadPtr self, bool index_cols); + void parseAlterTable(XTThreadPtr self); + void parseCreateIndex(XTThreadPtr self); + void parseDropIndex(XTThreadPtr self); + + public: + XTParseTable() { + pt_tokenizer = NULL; + pt_current = NULL; + memset(&pt_sbuffer, 0, sizeof(XTStringBufferRec)); + } + + virtual void finalize(XTThreadPtr self __attribute__((unused))) { + if (pt_tokenizer) + delete pt_tokenizer; + xt_sb_set_size(NULL, &pt_sbuffer, 0); + } + + // Hooks to receive output from the parser: + virtual void setTableName(XTThreadPtr self __attribute__((unused)), char *name __attribute__((unused)), bool alterTable __attribute__((unused))) { + } + virtual void addColumn(XTThreadPtr self __attribute__((unused)), char *col_name __attribute__((unused)), char *old_col_name __attribute__((unused))) { + } + virtual void setDataType(XTThreadPtr self, char *cstring) { + if (cstring) + xt_free(self, cstring); + } + virtual void setNull(XTThreadPtr self __attribute__((unused)), bool nullOK __attribute__((unused))) { + } + virtual void setAutoInc(XTThreadPtr self __attribute__((unused)), bool autoInc __attribute__((unused))) { + } + + /* Add a contraint. If lastColumn is TRUE then add the contraint + * to the last column. If not, expect addListedColumn() to be called. + */ + virtual void addConstraint(XTThreadPtr self __attribute__((unused)), char *name __attribute__((unused)), u_int type __attribute__((unused)), bool lastColumn __attribute__((unused))) { + } + + /* Move the last column created. If symbol is NULL then move the column to the + * first position, else move it to the position just after the given column. + */ + virtual void moveColumn(XTThreadPtr self __attribute__((unused)), char *col_name __attribute__((unused))) { + } + + virtual void dropColumn(XTThreadPtr self __attribute__((unused)), char *col_name __attribute__((unused))) { + } + + virtual void dropConstraint(XTThreadPtr self __attribute__((unused)), char *name __attribute__((unused)), u_int type __attribute__((unused))) { + } + + virtual void setIndexName(XTThreadPtr self __attribute__((unused)), char *name __attribute__((unused))) { + } + virtual void addListedColumn(XTThreadPtr self __attribute__((unused)), char *index_col_name __attribute__((unused))) { + } + virtual void setReferencedTable(XTThreadPtr self __attribute__((unused)), char *ref_table __attribute__((unused))) { + } + virtual void addReferencedColumn(XTThreadPtr self __attribute__((unused)), char *index_col_name __attribute__((unused))) { + } + virtual void setActions(XTThreadPtr self __attribute__((unused)), int on_delete __attribute__((unused)), int on_update __attribute__((unused))) { + } + + virtual void parseTable(XTThreadPtr self, bool convert, char *sql); +}; + +void XTParseTable::raiseError(XTThreadPtr self, XTToken *tk, int err) +{ + char buffer[100]; + + tk->getTokenText(buffer, 100); + xt_throw_ixterr(XT_CONTEXT, err, buffer); +} + +void XTParseTable::syntaxError(XTThreadPtr self, XTToken *tk) +{ + raiseError(self, tk, XT_ERR_SYNTAX); +} + +void XTParseTable::parseIdentifier(XTThreadPtr self, char *name) +{ + pt_current->expectIdentifier(self); + if (name) { + if (pt_current->getString(name, XT_IDENTIFIER_NAME_SIZE) >= XT_IDENTIFIER_NAME_SIZE) + raiseError(self, pt_current, XT_ERR_ID_TOO_LONG); + } + pt_current = pt_tokenizer->nextToken(self); +} + +int XTParseTable::parseKeyAction(XTThreadPtr self) +{ + XTToken *tk; + + tk = pt_tokenizer->nextToken(self); + + if (tk->isKeyWord("RESTRICT")) + return XT_KEY_ACTION_RESTRICT; + + if (tk->isKeyWord("CASCADE")) + return XT_KEY_ACTION_CASCADE; + + if (tk->isKeyWord("SET")) { + tk = pt_tokenizer->nextToken(self); + if (tk->isKeyWord("DEFAULT")) + return XT_KEY_ACTION_SET_DEFAULT; + tk->expectKeyWord(self, "NULL"); + return XT_KEY_ACTION_SET_NULL; + } + + if (tk->isKeyWord("NO")) { + tk = pt_tokenizer->nextToken(self); + tk->expectKeyWord(self, "ACTION"); + return XT_KEY_ACTION_NO_ACTION; + } + + syntaxError(self, tk); + return 0; +} + +void XTParseTable::parseTable(XTThreadPtr self, bool convert, char *sql) +{ + if (pt_tokenizer) + delete pt_tokenizer; + pt_tokenizer = new XTTokenizer(convert, sql); + if (!pt_tokenizer) + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + pt_current = pt_tokenizer->nextToken(self); + + if (pt_current->isKeyWord("CREATE")) { + pt_current = pt_tokenizer->nextToken(self); + if (pt_current->isKeyWord("TEMPORARY") || pt_current->isKeyWord("TABLE")) + parseCreateTable(self); + else + parseCreateIndex(self); + } + else if (pt_current->isKeyWord("ALTER")) + parseAlterTable(self); + else if (pt_current->isKeyWord("DROP")) + parseDropIndex(self); + else if (pt_current->isKeyWord("TRUNCATE")) { + pt_current = pt_tokenizer->nextToken(self); + if (pt_current->isKeyWord("TABLE")) + pt_current = pt_tokenizer->nextToken(self); + parseTableName(self, true); + } + else if (pt_current->isKeyWord("OPTIMIZE") || pt_current->isKeyWord("REPAIR")) { + /* OPTIMIZE [LOCAL | NO_WRITE_TO_BINLOG] TABLE tbl_name [, tbl_name] ... + * + * GOTCHA: This cannot work if more than one table is specified, + * because then I cannot find the source table?! + */ + pt_current = pt_tokenizer->nextToken(self); + while (!pt_current->isEOF() && !pt_current->isKeyWord("TABLE")) + pt_current = pt_tokenizer->nextToken(self); + pt_current = pt_tokenizer->nextToken(self); + parseTableName(self, true); + } + else + syntaxError(self, pt_current); +} + +void XTParseTable::parseCreateTable(XTThreadPtr self) +{ + if (pt_current->isKeyWord("TEMPORARY")) + pt_current = pt_tokenizer->nextToken(self); + pt_current = pt_tokenizer->nextToken(self, "TABLE", pt_current); + if (pt_current->isKeyWord("IF")) { + pt_current = pt_tokenizer->nextToken(self); + pt_current = pt_tokenizer->nextToken(self, "NOT", pt_current); + pt_current = pt_tokenizer->nextToken(self, "EXISTS", pt_current); + } + + /* Table name is optional (when loading from dictionary)! */ + if (!pt_current->isKeyWord("(")) + parseTableName(self, false); + else + setTableName(self, NULL, false); + + /* We do not support CREATE ... SELECT! */ + if (pt_current->isKeyWord("(")) { + pt_current = pt_tokenizer->nextToken(self); + // Avoid this: + // create table t3 (select group_concat(a) as a from t1 where a = 'a') union + // (select group_concat(b) as a from t1 where a = 'b'); + if (pt_current->isKeyWord("SELECT")) + return; + + /* Allow empty table definition for temporary table. */ + while (!pt_current->isEOF() && !pt_current->isKeyWord(")")) { + parseAddTableItem(self); + if (!pt_current->isKeyWord(",")) + break; + pt_current = pt_tokenizer->nextToken(self); + } + pt_current = pt_tokenizer->nextToken(self, ")", pt_current); + } +} + +void XTParseTable::parseAddTableItem(XTThreadPtr self) +{ + char name[XT_IDENTIFIER_NAME_SIZE]; + + *name = 0; + if (pt_current->isKeyWord("CONSTRAINT")) { + pt_current = pt_tokenizer->nextToken(self); + if (pt_current->isIdentifier()) + parseQualifiedName(self, name); + } + + if (pt_current->isReservedWord(XT_TK_PRIMARY)) { + pt_current = pt_tokenizer->nextToken(self); + pt_current = pt_tokenizer->nextToken(self, "KEY", pt_current); + + addConstraint(self, name, XT_DD_KEY_PRIMARY, false); + optionalIndexType(self); + + /* GATCHA: Wierd?! This syntax is used in a test: + * alter table t1 add primary key aaa(tt); + */ + if (!pt_current->isKeyWord("(")) + pt_current = pt_tokenizer->nextToken(self); + columnList(self, true); + } + else if (pt_current->isReservedWord(XT_TK_UNIQUE) || + pt_current->isReservedWord(XT_TK_FULLTEXT) || + pt_current->isReservedWord(XT_TK_SPATIAL) || + pt_current->isReservedWord(XT_TK_INDEX) || + pt_current->isReservedWord(XT_TK_KEY)) { + bool is_unique = false; + + if (pt_current->isReservedWord(XT_TK_FULLTEXT) || pt_current->isReservedWord(XT_TK_SPATIAL)) + pt_current = pt_tokenizer->nextToken(self); + else if (pt_current->isReservedWord(XT_TK_UNIQUE)) { + pt_current = pt_tokenizer->nextToken(self); + is_unique = true; + } + if (pt_current->isReservedWord(XT_TK_INDEX) || pt_current->isReservedWord(XT_TK_KEY)) + pt_current = pt_tokenizer->nextToken(self); + + addConstraint(self, name, is_unique ? XT_DD_INDEX_UNIQUE : XT_DD_INDEX, false); + optionalIndexName(self); + optionalIndexType(self); + columnList(self, true); + } + else if (pt_current->isReservedWord(XT_TK_CHECK)) { + pt_current = pt_tokenizer->nextToken(self); + parseExpression(self, false); + } + else if (pt_current->isReservedWord(XT_TK_FOREIGN)) { + u_int req_cols; + + pt_current = pt_tokenizer->nextToken(self); + pt_current = pt_tokenizer->nextToken(self, "KEY", pt_current); + + addConstraint(self, name, XT_DD_KEY_FOREIGN, false); + optionalIndexName(self); + req_cols = columnList(self, false); + /* GOTCHA: According the MySQL manual this is optional, but without domains, + * it is required! + */ + parseReferenceDefinition(self, req_cols); + } + else if (pt_current->isKeyWord("(")) { + pt_current = pt_tokenizer->nextToken(self); + for (;;) { + parseColumnDefinition(self, NULL); + if (!pt_current->isKeyWord(",")) + break; + pt_current = pt_tokenizer->nextToken(self); + } + pt_current = pt_tokenizer->nextToken(self, ")", pt_current); + } + else { + if (pt_current->isReservedWord(XT_TK_COLUMN)) + pt_current = pt_tokenizer->nextToken(self); + parseColumnDefinition(self, NULL); + parseMoveColumn(self); + } + /* GOTCHA: Support: create table t1 (a int not null, key `a` (a) key_block_size=1024) + * and any other undocumented syntax?! + */ + parseExpression(self, true); +} + +void XTParseTable::parseExpression(XTThreadPtr self, bool allow_reserved) +{ + while (!pt_current->isEOF() && !pt_current->isKeyWord(",") && + !pt_current->isKeyWord(")") && (allow_reserved || !pt_current->isReservedWord())) { + if (pt_current->isKeyWord("(")) + parseBrackets(self); + else + pt_current = pt_tokenizer->nextToken(self); + } +} + +void XTParseTable::parseBrackets(XTThreadPtr self) +{ + u_int cnt = 1; + pt_current = pt_tokenizer->nextToken(self, "(", pt_current); + while (cnt) { + if (pt_current->isEOF()) + break; + if (pt_current->isKeyWord("(")) + cnt++; + if (pt_current->isKeyWord(")")) + cnt--; + pt_current = pt_tokenizer->nextToken(self); + } +} + +void XTParseTable::parseMoveColumn(XTThreadPtr self) +{ + if (pt_current->isKeyWord("FIRST")) { + pt_current = pt_tokenizer->nextToken(self); + /* If name is NULL it means move to the front. */ + moveColumn(self, NULL); + } + else if (pt_current->isKeyWord("AFTER")) { + char name[XT_IDENTIFIER_NAME_SIZE]; + + pt_current = pt_tokenizer->nextToken(self); + parseQualifiedName(self, name); + moveColumn(self, name); + } +} + +void XTParseTable::parseQualifiedName(XTThreadPtr self, char *name) +{ + /* Should be an identifier by I have this example: + * CREATE TABLE t1 ( comment CHAR(32) ASCII NOT NULL, koi8_ru_f CHAR(32) CHARACTER SET koi8r NOT NULL default '' ) CHARSET=latin5; + * + * COMMENT is elsewhere used as reserved word?! + */ + if (pt_current->getString(name, XT_IDENTIFIER_NAME_SIZE) >= XT_IDENTIFIER_NAME_SIZE) + raiseError(self, pt_current, XT_ERR_ID_TOO_LONG); + pt_current = pt_tokenizer->nextToken(self); + while (pt_current->isKeyWord(".")) { + pt_current = pt_tokenizer->nextToken(self); + /* Accept anything after the DOT! */ + if (pt_current->getString(name, XT_IDENTIFIER_NAME_SIZE) >= XT_IDENTIFIER_NAME_SIZE) + raiseError(self, pt_current, XT_ERR_ID_TOO_LONG); + pt_current = pt_tokenizer->nextToken(self); + } +} + +void XTParseTable::parseTableName(XTThreadPtr self, bool alterTable) +{ + char name[XT_IDENTIFIER_NAME_SIZE]; + + parseQualifiedName(self, name); + setTableName(self, name, alterTable); +} + +void XTParseTable::parseColumnDefinition(XTThreadPtr self, char *old_col_name) +{ + char col_name[XT_IDENTIFIER_NAME_SIZE]; + + // column_definition + parseQualifiedName(self, col_name); + addColumn(self, col_name, old_col_name); + parseDataType(self); + + for (;;) { + if (pt_current->isReservedWord(XT_TK_NOT)) { + pt_current = pt_tokenizer->nextToken(self); + pt_current = pt_tokenizer->nextToken(self, "NULL", pt_current); + setNull(self, false); + } + else if (pt_current->isReservedWord(XT_TK_NULL)) { + pt_current = pt_tokenizer->nextToken(self); + setNull(self, true); + } + else if (pt_current->isReservedWord(XT_TK_DEFAULT)) { + pt_current = pt_tokenizer->nextToken(self); + /* Possible here [ + | - ] <value> or [ <charset> ] <string> */ + parseExpression(self, false); + } + else if (pt_current->isReservedWord(XT_TK_AUTO_INCREMENT)) { + pt_current = pt_tokenizer->nextToken(self); + setAutoInc(self, true); + } + else if (pt_current->isReservedWord(XT_TK_UNIQUE)) { + pt_current = pt_tokenizer->nextToken(self); + if (pt_current->isReservedWord(XT_TK_KEY)) + pt_current = pt_tokenizer->nextToken(self); + addConstraint(self, NULL, XT_DD_INDEX_UNIQUE, true); + } + else if (pt_current->isReservedWord(XT_TK_KEY)) { + pt_current = pt_tokenizer->nextToken(self); + addConstraint(self, NULL, XT_DD_INDEX, true); + } + else if (pt_current->isReservedWord(XT_TK_PRIMARY)) { + pt_current = pt_tokenizer->nextToken(self); + pt_current = pt_tokenizer->nextToken(self, "KEY", pt_current); + addConstraint(self, NULL, XT_DD_KEY_PRIMARY, true); + } + else if (pt_current->isReservedWord(XT_TK_COMMENT)) { + pt_current = pt_tokenizer->nextToken(self); + pt_current = pt_tokenizer->nextToken(self); + } + else if (pt_current->isReservedWord(XT_TK_REFERENCES)) { + addConstraint(self, NULL, XT_DD_KEY_FOREIGN, true); + parseReferenceDefinition(self, 1); + } + else if (pt_current->isReservedWord(XT_TK_CHECK)) { + pt_current = pt_tokenizer->nextToken(self); + parseExpression(self, false); + } + /* GOTCHA: Not in the documentation: + * CREATE TABLE t1 (c varchar(255) NOT NULL COLLATE utf8_general_ci, INDEX (c)) + */ + else if (pt_current->isReservedWord(XT_TK_COLLATE)) { + pt_current = pt_tokenizer->nextToken(self); + pt_current = pt_tokenizer->nextToken(self); + } + else + break; + } +} + +void XTParseTable::parseDataType(XTThreadPtr self) +{ + /* Not actually implemented because MySQL allows undocumented + * syntax like this: + * create table t1 (c national character varying(10)) + */ + parseExpression(self, false); + setDataType(self, NULL); +} + +void XTParseTable::optionalIndexName(XTThreadPtr self) +{ + // [index_name] + if (!pt_current->isKeyWord("USING") && !pt_current->isKeyWord("(")) { + char name[XT_IDENTIFIER_NAME_SIZE]; + + parseIdentifier(self, name); + setIndexName(self, name); + } +} + +void XTParseTable::optionalIndexType(XTThreadPtr self) +{ + // USING {BTREE | HASH} + if (pt_current->isKeyWord("USING")) { + pt_current = pt_tokenizer->nextToken(self); + pt_current = pt_tokenizer->nextToken(self); + } +} + +u_int XTParseTable::columnList(XTThreadPtr self, bool index_cols) +{ + char name[XT_IDENTIFIER_NAME_SIZE]; + u_int cols = 0; + + pt_current->expectKeyWord(self, "("); + do { + pt_current = pt_tokenizer->nextToken(self); + parseQualifiedName(self, name); + addListedColumn(self, name); + cols++; + if (index_cols) { + if (pt_current->isKeyWord("(")) { + pt_current = pt_tokenizer->nextToken(self); + pt_current = pt_tokenizer->nextToken(self); + pt_current = pt_tokenizer->nextToken(self, ")", pt_current); + } + if (pt_current->isKeyWord("ASC")) + pt_current = pt_tokenizer->nextToken(self); + else if (pt_current->isKeyWord("DESC")) + pt_current = pt_tokenizer->nextToken(self); + } + } while (pt_current->isKeyWord(",")); + pt_current = pt_tokenizer->nextToken(self, ")", pt_current); + return cols; +} + +void XTParseTable::parseReferenceDefinition(XTThreadPtr self, u_int req_cols) +{ + int on_delete = XT_KEY_ACTION_DEFAULT; + int on_update = XT_KEY_ACTION_DEFAULT; + char name[XT_IDENTIFIER_NAME_SIZE]; + u_int cols = 0; + + // REFERENCES tbl_name + pt_current = pt_tokenizer->nextToken(self, "REFERENCES", pt_current); + parseQualifiedName(self, name); + setReferencedTable(self, name); + + // [ (index_col_name,...) ] + if (pt_current->isKeyWord("(")) { + pt_current->expectKeyWord(self, "("); + do { + pt_current = pt_tokenizer->nextToken(self); + parseQualifiedName(self, name); + addReferencedColumn(self, name); + cols++; + if (cols > req_cols) + raiseError(self, pt_current, XT_ERR_INCORRECT_NO_OF_COLS); + } while (pt_current->isKeyWord(",")); + if (cols != req_cols) + raiseError(self, pt_current, XT_ERR_INCORRECT_NO_OF_COLS); + pt_current = pt_tokenizer->nextToken(self, ")", pt_current); + } + else + addReferencedColumn(self, NULL); + + // [MATCH FULL | MATCH PARTIAL | MATCH SIMPLE] + if (pt_current->isKeyWord("MATCH")) { + pt_current = pt_tokenizer->nextToken(self); + pt_current = pt_tokenizer->nextToken(self); + } + + // [ON DELETE {RESTRICT | CASCADE | SET NULL | SET DEFAULT | NO ACTION}] + // [ON UPDATE {RESTRICT | CASCADE | SET NULL | SET DEFAULT | NO ACTION}] + while (pt_current->isKeyWord("ON")) { + pt_current = pt_tokenizer->nextToken(self); + if (pt_current->isKeyWord("DELETE")) + on_delete = parseKeyAction(self); + else if (pt_current->isKeyWord("UPDATE")) + on_update = parseKeyAction(self); + else + syntaxError(self, pt_current); + pt_current = pt_tokenizer->nextToken(self); + } + + setActions(self, on_delete, on_update); +} + +void XTParseTable::parseAlterTable(XTThreadPtr self) +{ + char name[XT_IDENTIFIER_NAME_SIZE]; + + pt_current = pt_tokenizer->nextToken(self, "ALTER", pt_current); + if (pt_current->isKeyWord("IGNORE")) + pt_current = pt_tokenizer->nextToken(self); + pt_current = pt_tokenizer->nextToken(self, "TABLE", pt_current); + parseTableName(self, true); + for (;;) { + if (pt_current->isKeyWord("ADD")) { + pt_current = pt_tokenizer->nextToken(self); + parseAddTableItem(self); + } + else if (pt_current->isKeyWord("ALTER")) { + pt_current = pt_tokenizer->nextToken(self); + if (pt_current->isReservedWord(XT_TK_COLUMN)) + pt_current = pt_tokenizer->nextToken(self); + pt_current->expectIdentifier(self); + pt_current = pt_tokenizer->nextToken(self); + if (pt_current->isKeyWord("SET")) { + pt_current = pt_tokenizer->nextToken(self); + pt_current = pt_tokenizer->nextToken(self, "DEFAULT", pt_current); + pt_current = pt_tokenizer->nextToken(self); + } + else if (pt_current->isKeyWord("DROP")) { + pt_current = pt_tokenizer->nextToken(self); + pt_current = pt_tokenizer->nextToken(self, "DEFAULT", pt_current); + } + } + else if (pt_current->isKeyWord("CHANGE")) { + char old_col_name[XT_IDENTIFIER_NAME_SIZE]; + + pt_current = pt_tokenizer->nextToken(self); + if (pt_current->isReservedWord(XT_TK_COLUMN)) + pt_current = pt_tokenizer->nextToken(self); + + parseQualifiedName(self, old_col_name); + parseColumnDefinition(self, old_col_name); + parseMoveColumn(self); + } + else if (pt_current->isKeyWord("MODIFY")) { + pt_current = pt_tokenizer->nextToken(self); + if (pt_current->isReservedWord(XT_TK_COLUMN)) + pt_current = pt_tokenizer->nextToken(self); + parseColumnDefinition(self, NULL); + parseMoveColumn(self); + } + else if (pt_current->isKeyWord("DROP")) { + pt_current = pt_tokenizer->nextToken(self); + if (pt_current->isReservedWord(XT_TK_PRIMARY)) { + pt_current = pt_tokenizer->nextToken(self); + pt_current = pt_tokenizer->nextToken(self, "KEY", pt_current); + dropConstraint(self, NULL, XT_DD_KEY_PRIMARY); + } + else if (pt_current->isReservedWord(XT_TK_INDEX) || pt_current->isReservedWord(XT_TK_KEY)) { + pt_current = pt_tokenizer->nextToken(self); + parseIdentifier(self, name); + dropConstraint(self, name, XT_DD_INDEX); + } + else if (pt_current->isReservedWord(XT_TK_FOREIGN)) { + pt_current = pt_tokenizer->nextToken(self); + pt_current = pt_tokenizer->nextToken(self, "KEY", pt_current); + parseIdentifier(self, name); + dropConstraint(self, name, XT_DD_KEY_FOREIGN); + } + else { + if (pt_current->isReservedWord(XT_TK_COLUMN)) + pt_current = pt_tokenizer->nextToken(self); + parseQualifiedName(self, name); + dropColumn(self, name); + } + } + else if (pt_current->isKeyWord("RENAME")) { + pt_current = pt_tokenizer->nextToken(self); + if (pt_current->isKeyWord("TO")) + pt_current = pt_tokenizer->nextToken(self); + parseQualifiedName(self, name); + } + else + /* Just ignore the syntax until the next , */ + parseExpression(self, true); + if (!pt_current->isKeyWord(",")) + break; + pt_current = pt_tokenizer->nextToken(self); + } +} + +void XTParseTable::parseCreateIndex(XTThreadPtr self) +{ + char name[XT_IDENTIFIER_NAME_SIZE]; + bool is_unique = false; + + if (pt_current->isReservedWord(XT_TK_UNIQUE)) { + pt_current = pt_tokenizer->nextToken(self); + is_unique = true; + } + else if (pt_current->isReservedWord(XT_TK_FULLTEXT)) + pt_current = pt_tokenizer->nextToken(self); + else if (pt_current->isKeyWord("SPACIAL")) + pt_current = pt_tokenizer->nextToken(self); + pt_current = pt_tokenizer->nextToken(self, "INDEX", pt_current); + parseQualifiedName(self, name); + optionalIndexType(self); + pt_current = pt_tokenizer->nextToken(self, "ON", pt_current); + parseTableName(self, true); + addConstraint(self, NULL, is_unique ? XT_DD_INDEX_UNIQUE : XT_DD_INDEX, false); + setIndexName(self, name); + columnList(self, true); +} + +void XTParseTable::parseDropIndex(XTThreadPtr self) +{ + char name[XT_IDENTIFIER_NAME_SIZE]; + + pt_current = pt_tokenizer->nextToken(self, "DROP", pt_current); + pt_current = pt_tokenizer->nextToken(self, "INDEX", pt_current); + parseQualifiedName(self, name); + pt_current = pt_tokenizer->nextToken(self, "ON", pt_current); + parseTableName(self, true); + dropConstraint(self, name, XT_DD_INDEX); +} + +/* + * ----------------------------------------------------------------------- + * Create/Alter table table + */ + +class XTCreateTable : public XTParseTable { + public: + bool ct_convert; + struct charset_info_st *ct_charset; + XTPathStrPtr ct_tab_path; + u_int ct_contraint_no; + XTDDTable *ct_curr_table; + XTDDColumn *ct_curr_column; + XTDDConstraint *ct_curr_constraint; + + XTCreateTable(bool convert, XTPathStrPtr tab_path) : XTParseTable() { + ct_convert = convert; + ct_charset = myxt_getcharset(convert); + ct_tab_path = tab_path; + ct_curr_table = NULL; + ct_curr_column = NULL; + ct_curr_constraint = NULL; + } + + virtual void finalize(XTThreadPtr self) { + if (ct_curr_table) + ct_curr_table->release(self); + XTParseTable::finalize(self); + } + + virtual void setTableName(XTThreadPtr self, char *name, bool alterTable); + virtual void addColumn(XTThreadPtr self, char *col_name, char *old_col_name); + virtual void addConstraint(XTThreadPtr self, char *name, u_int type, bool lastColumn); + virtual void dropConstraint(XTThreadPtr self, char *name, u_int type); + virtual void addListedColumn(XTThreadPtr self, char *index_col_name); + virtual void setReferencedTable(XTThreadPtr self, char *ref_table); + virtual void addReferencedColumn(XTThreadPtr self, char *index_col_name); + virtual void setActions(XTThreadPtr self, int on_delete, int on_update); + + virtual void parseTable(XTThreadPtr self, bool convert, char *sql); +}; + +static void ri_free_create_table(XTThreadPtr self, XTCreateTable *ct) +{ + if (ct) + ct->release(self); +} + +XTDDTable *xt_ri_create_table(XTThreadPtr self, bool convert, XTPathStrPtr tab_path, char *sql, XTDDTable *start_tab) +{ + XTCreateTable *ct; + XTDDTable *dd_tab; + + if (!(ct = new XTCreateTable(convert, tab_path))) { + if (start_tab) + start_tab->release(self); + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + } + + ct->ct_curr_table = start_tab; + + pushr_(ri_free_create_table, ct); + + ct->parseTable(self, convert, sql); + + /* Return the table ... */ + dd_tab = ct->ct_curr_table; + ct->ct_curr_table = NULL; + + freer_(); + return dd_tab; +} + +void XTCreateTable::parseTable(XTThreadPtr self, bool convert, char *sql) +{ + u_int i; + + ct_contraint_no = 0; + XTParseTable::parseTable(self, convert, sql); + + /* Remove contraints that do not have matching columns. */ + for (i=0; i<ct_curr_table->dt_indexes.size();) { + if (!ct_curr_table->dt_indexes.itemAt(i)->attachColumns()) + ct_curr_table->dt_indexes.remove(self, i); + else + i++; + } + + for (i=0; i<ct_curr_table->dt_fkeys.size(); ) { + if (!ct_curr_table->dt_fkeys.itemAt(i)->attachColumns()) + ct_curr_table->dt_fkeys.remove(self, i); + else + i++; + } +} + +void XTCreateTable::setTableName(XTThreadPtr self, char *name, bool alterTable) +{ + char path[PATH_MAX]; + + if (!name) + return; + + xt_strcpy(PATH_MAX, path, ct_tab_path->ps_path); + xt_remove_last_name_of_path(path); + + if (ct_convert) { + char buffer[XT_IDENTIFIER_NAME_SIZE]; + size_t len; + + myxt_static_convert_identifier(self, ct_charset, name, buffer, XT_IDENTIFIER_NAME_SIZE); + len = strlen(path); + myxt_static_convert_table_name(self, buffer, &path[len], PATH_MAX - len); + } + else + xt_strcat(PATH_MAX, path, name); + + if (alterTable) { + XTTableHPtr tab; + + /* Find the table... */ + pushsr_(tab, xt_heap_release, xt_use_table(self, (XTPathStrPtr) path, FALSE, TRUE, NULL)); + + /* Clone the foreign key definitions: */ + if (tab && tab->tab_dic.dic_table) { + ct_curr_table->dt_fkeys.deleteAll(self); + ct_curr_table->dt_fkeys.clone(self, &tab->tab_dic.dic_table->dt_fkeys); + for (u_int i=0; i<ct_curr_table->dt_fkeys.size(); i++) + ct_curr_table->dt_fkeys.itemAt(i)->co_table = ct_curr_table; + } + + freer_(); // xt_heap_release(tab) + } +} + +/* + * old_name is given if the column name was changed. + * NOTE that we built the table desciption from the current MySQL table + * description. This means that all changes to columns and + * indexes have already been applied. + * + * Our job is to now add the foreign key changes. + * This means we have to note the current column here. It is + * possible to add a FOREIGN KEY contraint directly to a column! + */ +void XTCreateTable::addColumn(XTThreadPtr self, char *new_name, char *old_name) +{ + char new_col_name[XT_IDENTIFIER_NAME_SIZE]; + + myxt_static_convert_identifier(self, ct_charset, new_name, new_col_name, XT_IDENTIFIER_NAME_SIZE); + ct_curr_column = ct_curr_table->findColumn(new_col_name); + if (old_name) { + char old_col_name[XT_IDENTIFIER_NAME_SIZE]; + + myxt_static_convert_identifier(self, ct_charset, old_name, old_col_name, XT_IDENTIFIER_NAME_SIZE); + ct_curr_table->alterColumnName(self, old_col_name, new_col_name); + } +} + +void XTCreateTable::addConstraint(XTThreadPtr self, char *name, u_int type, bool lastColumn) +{ + /* We are only interested in foreign keys! */ + if (type == XT_DD_KEY_FOREIGN) { + char buffer[50]; + + if (!(ct_curr_constraint = new XTDDForeignKey())) + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + ct_curr_table->dt_fkeys.append(self, (XTDDForeignKey *) ct_curr_constraint); + ct_curr_constraint->co_table = ct_curr_table; + + if (name && *name) + ct_curr_constraint->co_name = myxt_convert_identifier(self, ct_charset, name); + else { + // Generate a default constraint name: + ct_contraint_no++; + sprintf(buffer, "FOREIGN_%d", ct_contraint_no); + ct_curr_constraint->co_name = xt_dup_string(self, buffer); + } + + if (lastColumn && ct_curr_column) { + /* This constraint has one column, the current column. */ + XTDDColumnRef *cref; + char *col_name = xt_dup_string(self, ct_curr_column->dc_name); + + if (!(cref = new XTDDColumnRef())) { + xt_free(self, col_name); + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + } + cref->cr_col_name = col_name; + ct_curr_constraint->co_cols.append(self, cref); + } + } + else + /* Other constraints/indexes do not interest us: */ + ct_curr_constraint = NULL; +} + +void XTCreateTable::dropConstraint(XTThreadPtr self, char *name, u_int type) +{ + if (type == XT_DD_KEY_FOREIGN && name) { + u_int i; + XTDDForeignKey *fkey; + char con_name[XT_IDENTIFIER_NAME_SIZE]; + + myxt_static_convert_identifier(self, ct_charset, name, con_name, XT_IDENTIFIER_NAME_SIZE); + for (i=0; i<ct_curr_table->dt_fkeys.size(); i++) { + fkey = ct_curr_table->dt_fkeys.itemAt(i); + if (fkey->co_name && myxt_strcasecmp(con_name, fkey->co_name) == 0) { + ct_curr_table->dt_fkeys.remove(fkey); + fkey->release(self); + } + } + } +} + +void XTCreateTable::addListedColumn(XTThreadPtr self, char *index_col_name) +{ + if (ct_curr_constraint && ct_curr_constraint->co_type == XT_DD_KEY_FOREIGN) { + XTDDColumnRef *cref; + char *name = myxt_convert_identifier(self, ct_charset, index_col_name); + + if (!(cref = new XTDDColumnRef())) { + xt_free(self, name); + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + } + cref->cr_col_name = name; + ct_curr_constraint->co_cols.append(self, cref); + } +} + +void XTCreateTable::setReferencedTable(XTThreadPtr self, char *ref_table) +{ + XTDDForeignKey *fk = (XTDDForeignKey *) ct_curr_constraint; + char path[PATH_MAX]; + + xt_strcpy(PATH_MAX, path, ct_tab_path->ps_path); + xt_remove_last_name_of_path(path); + if (ct_convert) { + char buffer[XT_IDENTIFIER_NAME_SIZE]; + size_t len; + + myxt_static_convert_identifier(self, ct_charset, ref_table, buffer, XT_IDENTIFIER_NAME_SIZE); + len = strlen(path); + myxt_static_convert_table_name(self, buffer, &path[len], PATH_MAX - len); + } + else + xt_strcat(PATH_MAX, path, ref_table); + + fk->fk_ref_tab_name = (XTPathStrPtr) xt_dup_string(self, path); +} + +/* If the referenced column is NULL, this means + * duplicate the local column list! + */ +void XTCreateTable::addReferencedColumn(XTThreadPtr self, char *index_col_name) +{ + XTDDForeignKey *fk = (XTDDForeignKey *) ct_curr_constraint; + XTDDColumnRef *cref; + char *name; + + if (index_col_name) { + name = myxt_convert_identifier(self, ct_charset, index_col_name); + if (!(cref = new XTDDColumnRef())) { + xt_free(self, name); + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + } + cref->cr_col_name = name; + fk->fk_ref_cols.append(self, cref); + } + else + fk->fk_ref_cols.clone(self, &fk->co_cols); +} + +void XTCreateTable::setActions(XTThreadPtr self __attribute__((unused)), int on_delete, int on_update) +{ + XTDDForeignKey *fk = (XTDDForeignKey *) ct_curr_constraint; + + fk->fk_on_delete = on_delete; + fk->fk_on_update = on_update; +} + +/* + * ----------------------------------------------------------------------- + * Dictionary methods + */ + +void XTDDColumn::init(XTThreadPtr self, XTObject *obj) { + XTDDColumn *col = (XTDDColumn *) obj; + + XTObject::init(self, obj); + if (col->dc_name) + dc_name = xt_dup_string(self, col->dc_name); + if (col->dc_data_type) + dc_data_type = xt_dup_string(self, col->dc_data_type); + dc_null_ok = col->dc_null_ok; + dc_auto_inc = col->dc_auto_inc; +} + +void XTDDColumn::finalize(XTThreadPtr self) +{ + if (dc_name) + xt_free(self, dc_name); + if (dc_data_type) + xt_free(self, dc_data_type); +} + +void XTDDColumn::loadString(XTThreadPtr self, XTStringBufferPtr sb) +{ + xt_sb_concat(self, sb, "`"); + xt_sb_concat(self, sb, dc_name); + xt_sb_concat(self, sb, "` "); + if (dc_data_type) { + xt_sb_concat(self, sb, dc_data_type); + if (dc_null_ok) + xt_sb_concat(self, sb, " NULL"); + else + xt_sb_concat(self, sb, " NOT NULL"); + if (dc_auto_inc) + xt_sb_concat(self, sb, " AUTO_INCREMENT"); + } +} + +void XTDDColumnRef::init(XTThreadPtr self, XTObject *obj) +{ + XTDDColumnRef *cr = (XTDDColumnRef *) obj; + + XTObject::init(self, obj); + cr_col_name = xt_dup_string(self, cr->cr_col_name); +} + +void XTDDColumnRef::finalize(XTThreadPtr self) +{ + XTObject::finalize(self); + if (cr_col_name) { + xt_free(self, cr_col_name); + cr_col_name = NULL; + } +} + +void XTDDConstraint::init(XTThreadPtr self, XTObject *obj) +{ + XTDDConstraint *co = (XTDDConstraint *) obj; + + XTObject::init(self, obj); + co_type = co->co_type; + if (co->co_name) + co_name = xt_dup_string(self, co->co_name); + if (co->co_ind_name) + co_ind_name = xt_dup_string(self, co->co_ind_name); + co_cols.clone(self, &co->co_cols); +} + +void XTDDConstraint::loadString(XTThreadPtr self, XTStringBufferPtr sb) +{ + if (co_name) { + xt_sb_concat(self, sb, "CONSTRAINT `"); + xt_sb_concat(self, sb, co_name); + xt_sb_concat(self, sb, "` "); + } + switch (co_type) { + case XT_DD_INDEX: + xt_sb_concat(self, sb, "INDEX "); + break; + case XT_DD_INDEX_UNIQUE: + xt_sb_concat(self, sb, "UNIQUE INDEX "); + break; + case XT_DD_KEY_PRIMARY: + xt_sb_concat(self, sb, "PRIMARY KEY "); + break; + case XT_DD_KEY_FOREIGN: + xt_sb_concat(self, sb, "FOREIGN KEY "); + break; + } + if (co_ind_name) { + xt_sb_concat(self, sb, "`"); + xt_sb_concat(self, sb, co_ind_name); + xt_sb_concat(self, sb, "` "); + } + xt_sb_concat(self, sb, "(`"); + xt_sb_concat(self, sb, co_cols.itemAt(0)->cr_col_name); + for (u_int i=1; i<co_cols.size(); i++) { + xt_sb_concat(self, sb, "`, `"); + xt_sb_concat(self, sb, co_cols.itemAt(i)->cr_col_name); + } + xt_sb_concat(self, sb, "`)"); +} + +void XTDDConstraint::alterColumnName(XTThreadPtr self, char *from_name, char *to_name) +{ + XTDDColumnRef *col; + + for (u_int i=0; i<co_cols.size(); i++) { + col = co_cols.itemAt(i); + if (myxt_strcasecmp(col->cr_col_name, from_name) == 0) { + char *name = xt_dup_string(self, to_name); + + xt_free(self, col->cr_col_name); + col->cr_col_name = name; + break; + } + } +} + +void XTDDConstraint::getColumnList(char *buffer, size_t size) +{ + if (co_table->dt_table) { + xt_strcat(size, buffer, "`"); + xt_strcpy(size, buffer, co_table->dt_table->tab_name->ps_path); + xt_strcat(size, buffer, "` (`"); + } + else + xt_strcpy(size, buffer, "(`"); + xt_strcat(size, buffer, co_cols.itemAt(0)->cr_col_name); + for (u_int i=1; i<co_cols.size(); i++) { + xt_strcat(size, buffer, "`, `"); + xt_strcat(size, buffer, co_cols.itemAt(i)->cr_col_name); + } + xt_strcat(size, buffer, "`)"); +} + +bool XTDDConstraint::sameColumns(XTDDConstraint *co) +{ + u_int i = 0; + + if (co_cols.size() != co->co_cols.size()) + return false; + while (i<co_cols.size()) { + if (myxt_strcasecmp(co_cols.itemAt(i)->cr_col_name, co->co_cols.itemAt(i)->cr_col_name) != 0) + return false; + i++; + } + return OK; +} + +bool XTDDConstraint::attachColumns() +{ + XTDDColumn *col; + + for (u_int i=0; i<co_cols.size(); i++) { + if (!(col = co_table->findColumn(co_cols.itemAt(i)->cr_col_name))) + return false; + /* If this is a primary key, then the column becomes not-null! */ + if (co_type == XT_DD_KEY_PRIMARY) + col->dc_null_ok = false; + } + return true; +} + +void XTDDTableRef::finalize(XTThreadPtr self) +{ + XTDDForeignKey *fk; + + if ((fk = tr_fkey)) { + tr_fkey = NULL; + fk->removeReference(self); + xt_heap_release(self, fk->co_table->dt_table); /* We referenced the database table, not the foreign key */ + } + XTObject::finalize(self); +} + +bool XTDDTableRef::checkReference(xtWord1 *before_buf, XTThreadPtr thread) +{ + XTIndexPtr loc_ind, ind; + xtBool no_null = TRUE; + XTOpenTablePtr ot; + XTIdxSearchKeyRec search_key; + xtXactID xn_id; + XTXactWaitRec xw; + + if (!(loc_ind = tr_fkey->getReferenceIndexPtr())) + return false; + + if (!(ind = tr_fkey->getIndexPtr())) + return false; + + search_key.sk_key_value.sv_flags = 0; + search_key.sk_key_value.sv_rec_id = 0; + search_key.sk_key_value.sv_row_id = 0; + search_key.sk_key_value.sv_key = search_key.sk_key_buf; + search_key.sk_key_value.sv_length = myxt_create_foreign_key_from_row(loc_ind, search_key.sk_key_buf, before_buf, ind, &no_null); + search_key.sk_on_key = FALSE; + + if (!no_null) + return true; + + /* Search for the key in the child (referencing) table: */ + if (!(ot = xt_db_open_table_using_tab(tr_fkey->co_table->dt_table, thread))) + goto failed; + + retry: + if (!xt_idx_search(ot, ind, &search_key)) + goto failed; + + while (ot->ot_curr_rec_id && search_key.sk_on_key) { + switch (xt_tab_maybe_committed(ot, ot->ot_curr_rec_id, &xn_id, &ot->ot_curr_row_id, &ot->ot_curr_updated)) { + case XT_MAYBE: + xw.xw_xn_id = xn_id; + if (!xt_xn_wait_for_xact(thread, &xw, NULL)) + goto failed; + goto retry; + case XT_ERR: + goto failed; + case TRUE: + /* We found a matching child: */ + xt_register_ixterr(XT_REG_CONTEXT, XT_ERR_ROW_IS_REFERENCED, tr_fkey->co_name); + goto failed; + break; + case FALSE: + if (!xt_idx_next(ot, ind, &search_key)) + goto failed; + break; + } + } + + /* No matching children, all OK: */ + xt_db_return_table_to_pool_ns(ot); + return true; + + failed: + xt_db_return_table_to_pool_ns(ot); + return false; +} + +/* + * A row has been deleted or updated (after_buf non-NULL), check if it is referenced by the foreign key table. + * If it is referenced, then we need to follow the specified action. + */ +bool XTDDTableRef::modifyRow(XTOpenTablePtr XT_UNUSED(ref_ot), xtWord1 *before_buf, xtWord1 *after_buf, XTThreadPtr thread) +{ + XTIndexPtr loc_ind, ind; + xtBool no_null = TRUE; + XTOpenTablePtr ot; + XTIdxSearchKeyRec search_key; + xtXactID xn_id; + int action = after_buf ? tr_fkey->fk_on_update : tr_fkey->fk_on_delete; + u_int after_key_len = 0; + xtWord1 *after_key = NULL; + XTInfoBufferRec after_info; + XTXactWaitRec xw; + + after_info.ib_free = FALSE; + + if (!(loc_ind = tr_fkey->getReferenceIndexPtr())) + return false; + + if (!(ind = tr_fkey->getIndexPtr())) + return false; + + search_key.sk_key_value.sv_flags = 0; + search_key.sk_key_value.sv_rec_id = 0; + search_key.sk_key_value.sv_row_id = 0; + search_key.sk_key_value.sv_key = search_key.sk_key_buf; + search_key.sk_key_value.sv_length = myxt_create_foreign_key_from_row(loc_ind, search_key.sk_key_buf, before_buf, ind, &no_null); + search_key.sk_on_key = FALSE; + + if (!no_null) + return true; + + if (after_buf) { + if (!(after_key = (xtWord1 *) xt_malloc_ns(XT_INDEX_MAX_KEY_SIZE))) + return false; + after_key_len = myxt_create_foreign_key_from_row(loc_ind, after_key, after_buf, ind, NULL); + + /* Check whether the key value has changed, if not, we have nothing + * to do here! + */ + if (myxt_compare_key(ind, 0, search_key.sk_key_value.sv_length, + search_key.sk_key_value.sv_key, after_key) == 0) + goto success; + + } + + /* Search for the key in the child (referencing) table: */ + if (!(ot = xt_db_open_table_using_tab(tr_fkey->co_table->dt_table, thread))) + goto failed; + + retry: + if (!xt_idx_search(ot, ind, &search_key)) + goto failed_2; + + while (ot->ot_curr_rec_id && search_key.sk_on_key) { + switch (xt_tab_maybe_committed(ot, ot->ot_curr_rec_id, &xn_id, &ot->ot_curr_row_id, &ot->ot_curr_updated)) { + case XT_MAYBE: + xw.xw_xn_id = xn_id; + if (!xt_xn_wait_for_xact(thread, &xw, NULL)) + goto failed_2; + goto retry; + case XT_ERR: + goto failed_2; + case TRUE: + /* We found a matching child: */ + switch (action) { + case XT_KEY_ACTION_CASCADE: + if (after_buf) { + /* Do a cascaded update: */ + if (!xt_tab_load_record(ot, ot->ot_curr_rec_id, &after_info)) + goto failed_2; + + if (!myxt_create_row_from_key(ot, ind, after_key, after_key_len, after_info.ib_db.db_data)) + goto failed_2; + + if (!xt_tab_update_record(ot, NULL, after_info.ib_db.db_data)) { + // Change to duplicate foreign key + if (ot->ot_thread->t_exception.e_xt_err == XT_ERR_DUPLICATE_KEY) + xt_register_ixterr(XT_REG_CONTEXT, XT_ERR_DUPLICATE_FKEY, tr_fkey->co_name); + goto failed_2; + } + } + else { + /* Do a cascaded delete: */ + if (!xt_tab_delete_record(ot, NULL)) + goto failed_2; + } + break; + case XT_KEY_ACTION_SET_NULL: + if (!xt_tab_load_record(ot, ot->ot_curr_rec_id, &after_info)) + goto failed_2; + + myxt_set_null_row_from_key(ot, ind, after_info.ib_db.db_data); + + if (!xt_tab_update_record(ot, NULL, after_info.ib_db.db_data)) + goto failed_2; + break; + case XT_KEY_ACTION_SET_DEFAULT: + + if (!xt_tab_load_record(ot, ot->ot_curr_rec_id, &after_info)) + goto failed_2; + + myxt_set_default_row_from_key(ot, ind, after_info.ib_db.db_data); + + if (!xt_tab_update_record(ot, NULL, after_info.ib_db.db_data)) + goto failed_2; + + break; + case XT_KEY_ACTION_NO_ACTION: +#ifdef XT_IMPLEMENT_NO_ACTION + XTRestrictItemRec r; + + r.ri_tab_id = ref_ot->ot_table->tab_id; + r.ri_rec_id = ref_ot->ot_curr_rec_id; + if (!xt_bl_append(NULL, &thread->st_restrict_list, (void *) &r)) + goto failed_2; + break; +#endif + default: + xt_register_ixterr(XT_REG_CONTEXT, XT_ERR_ROW_IS_REFERENCED, tr_fkey->co_name); + goto failed_2; + } + /* Fall throught to next: */ + case FALSE: + if (!xt_idx_next(ot, ind, &search_key)) + goto failed_2; + break; + } + } + + /* No matching children, all OK: */ + xt_db_return_table_to_pool_ns(ot); + + success: + xt_ib_free(NULL, &after_info); + if (after_key) + xt_free_ns(after_key); + return true; + + failed_2: + xt_db_return_table_to_pool_ns(ot); + + failed: + xt_ib_free(NULL, &after_info); + if (after_key) + xt_free_ns(after_key); + return false; +} + +void XTDDTableRef::deleteAllRows(XTThreadPtr self) +{ + XTOpenTablePtr ot; + xtInt8 row_count; + + if (!tr_fkey->getReferenceIndexPtr()) + throw_(); + + if (!tr_fkey->getIndexPtr()) + throw_(); + + if (!(ot = xt_db_open_table_using_tab(tr_fkey->co_table->dt_table, self))) + throw_(); + + row_count = ((xtInt8) ot->ot_table->tab_row_eof_id) - 1; + row_count -= (xtInt8) ot->ot_table->tab_row_fnum; + + xt_db_return_table_to_pool_ns(ot); + + if (row_count > 0) + xt_throw_ixterr(XT_CONTEXT, XT_ERR_ROW_IS_REFERENCED, tr_fkey->co_name); +} + +void XTDDIndex::init(XTThreadPtr self, XTObject *obj) +{ + XTDDConstraint::init(self, obj); +} + +XTIndexPtr XTDDIndex::getIndexPtr() +{ + if (in_index >= co_table->dt_table->tab_dic.dic_key_count) { + XTDDIndex *in; + + if (!(in = co_table->findIndex(this))) + return NULL; + in_index = in->in_index; + } + return co_table->dt_table->tab_dic.dic_keys[in_index]; +} + +void XTDDForeignKey::init(XTThreadPtr self, XTObject *obj) +{ + XTDDForeignKey *fk = (XTDDForeignKey *) obj; + + XTDDIndex::init(self, obj); + if (fk->fk_ref_tab_name) + fk_ref_tab_name = (XTPathStrPtr) xt_dup_string(self, fk->fk_ref_tab_name->ps_path); + fk_ref_cols.clone(self, &fk->fk_ref_cols); + fk_on_delete = fk->fk_on_delete; + fk_on_update = fk->fk_on_update; +} + +void XTDDForeignKey::finalize(XTThreadPtr self) +{ + XTDDTable *ref_tab; + + if (fk_ref_tab_name) { + xt_free(self, fk_ref_tab_name); + fk_ref_tab_name = NULL; + } + + if ((ref_tab = fk_ref_table)) { + fk_ref_table = NULL; + ref_tab->removeReference(self, this); + xt_heap_release(self, ref_tab->dt_table); /* We referenced the table, not the index! */ + } + + fk_ref_index = UINT_MAX; + + fk_ref_cols.deleteAll(self); + XTDDConstraint::finalize(self); +} + +void XTDDForeignKey::loadString(XTThreadPtr self, XTStringBufferPtr sb) +{ + XTDDConstraint::loadString(self, sb); + xt_sb_concat(self, sb, " REFERENCES `"); + xt_sb_concat(self, sb, xt_last_name_of_path(fk_ref_tab_name->ps_path)); + xt_sb_concat(self, sb, "` "); + + xt_sb_concat(self, sb, "(`"); + xt_sb_concat(self, sb, fk_ref_cols.itemAt(0)->cr_col_name); + for (u_int i=1; i<fk_ref_cols.size(); i++) { + xt_sb_concat(self, sb, "`, `"); + xt_sb_concat(self, sb, fk_ref_cols.itemAt(i)->cr_col_name); + } + xt_sb_concat(self, sb, "`)"); + + if (fk_on_delete != XT_KEY_ACTION_DEFAULT && fk_on_delete != XT_KEY_ACTION_RESTRICT) { + xt_sb_concat(self, sb, " ON DELETE "); + switch (fk_on_delete) { + case XT_KEY_ACTION_CASCADE: xt_sb_concat(self, sb, "CASCADE"); break; + case XT_KEY_ACTION_SET_NULL: xt_sb_concat(self, sb, "SET NULL"); break; + case XT_KEY_ACTION_SET_DEFAULT: xt_sb_concat(self, sb, "SET DEFAULT"); break; + case XT_KEY_ACTION_NO_ACTION: xt_sb_concat(self, sb, "NO ACTION"); break; + } + } + if (fk_on_update != XT_KEY_ACTION_DEFAULT && fk_on_update != XT_KEY_ACTION_RESTRICT) { + xt_sb_concat(self, sb, " ON UPDATE "); + switch (fk_on_update) { + case XT_KEY_ACTION_DEFAULT: xt_sb_concat(self, sb, "RESTRICT"); break; + case XT_KEY_ACTION_RESTRICT: xt_sb_concat(self, sb, "RESTRICT"); break; + case XT_KEY_ACTION_CASCADE: xt_sb_concat(self, sb, "CASCADE"); break; + case XT_KEY_ACTION_SET_NULL: xt_sb_concat(self, sb, "SET NULL"); break; + case XT_KEY_ACTION_SET_DEFAULT: xt_sb_concat(self, sb, "SET DEFAULT"); break; + case XT_KEY_ACTION_NO_ACTION: xt_sb_concat(self, sb, "NO ACTION"); break; + } + } +} + +void XTDDForeignKey::getReferenceList(char *buffer, size_t size) +{ + buffer[0] = '`'; + xt_strcpy(size, buffer + 1, xt_last_name_of_path(fk_ref_tab_name->ps_path)); + xt_strcat(size, buffer, "` ("); + xt_strcat(size, buffer, fk_ref_cols.itemAt(0)->cr_col_name); + for (u_int i=1; i<fk_ref_cols.size(); i++) { + xt_strcat(size, buffer, ", "); + xt_strcat(size, buffer, fk_ref_cols.itemAt(i)->cr_col_name); + } + xt_strcat(size, buffer, ")"); +} + +struct XTIndex *XTDDForeignKey::getReferenceIndexPtr() +{ + if (!fk_ref_table) { + xt_register_taberr(XT_REG_CONTEXT, XT_ERR_REF_TABLE_NOT_FOUND, fk_ref_tab_name); + return NULL; + } + if (fk_ref_index >= fk_ref_table->dt_table->tab_dic.dic_key_count) { + XTDDIndex *in; + + if (!(in = fk_ref_table->findReferenceIndex(this))) + return NULL; + if (!checkReferencedTypes(fk_ref_table)) + return NULL; + fk_ref_index = in->in_index; + } + + return fk_ref_table->dt_table->tab_dic.dic_keys[fk_ref_index]; +} + +bool XTDDForeignKey::sameReferenceColumns(XTDDConstraint *co) +{ + u_int i = 0; + + if (fk_ref_cols.size() != co->co_cols.size()) + return false; + while (i<fk_ref_cols.size()) { + if (myxt_strcasecmp(fk_ref_cols.itemAt(i)->cr_col_name, co->co_cols.itemAt(i)->cr_col_name) != 0) + return false; + i++; + } + return OK; +} + +bool XTDDForeignKey::checkReferencedTypes(XTDDTable *dt) +{ + XTDDColumn *col, *ref_col; + XTDDEnumerableColumn *enum_col, *enum_ref_col; + + if (dt->dt_table->tab_dic.dic_tab_flags & XT_TAB_FLAGS_TEMP_TAB) { + xt_register_xterr(XT_REG_CONTEXT, XT_ERR_FK_REF_TEMP_TABLE); + return false; + } + + for (u_int i=0; i<co_cols.size() && i<fk_ref_cols.size(); i++) { + col = co_table->findColumn(co_cols.itemAt(i)->cr_col_name); + ref_col = dt->findColumn(fk_ref_cols.itemAt(i)->cr_col_name); + if (!col || !ref_col) + continue; + + enum_col = col->castToEnumerable(); + enum_ref_col = ref_col->castToEnumerable(); + + if (!enum_col && !enum_ref_col && (strcmp(col->dc_data_type, ref_col->dc_data_type) == 0)) + continue; + + /* Allow match varchar(30) == varchar(40): */ + if (strncmp(col->dc_data_type, "varchar", 7) == 0 && strncmp(ref_col->dc_data_type, "varchar", 7) == 0) { + char *t1, *t2; + + t1 = col->dc_data_type + 7; + while (*t1 && (isdigit(*t1) || *t1 == '(' || *t1 == ')')) t1++; + t2 = col->dc_data_type + 7; + while (*t2 && (isdigit(*t2) || *t2 == '(' || *t2 == ')')) t2++; + + if (strcmp(t1, t2) == 0) + continue; + } + + /* + * MySQL stores ENUMs an integer indexes for string values. That's why + * it is ok to have refrences between columns that are different ENUMs as long + * as they contain equal number of members, so that for example a cascase update + * will not cause an invaid value to be stored in the child table. + * + * The above is also true for SETs. + * + */ + + if (enum_col && enum_ref_col && + (enum_col->enum_size == enum_ref_col->enum_size) && + (enum_col->is_enum == enum_ref_col->is_enum)) + continue; + + xt_register_tabcolerr(XT_REG_CONTEXT, XT_ERR_REF_TYPE_WRONG, fk_ref_tab_name, ref_col->dc_name); + return false; + } + return true; +} + +void XTDDForeignKey::removeReference(XTThreadPtr self) +{ + XTDDTable *ref_tab; + + xt_xlock_rwlock(self, &co_table->dt_ref_lock); + pushr_(xt_unlock_rwlock, &co_table->dt_ref_lock); + + if ((ref_tab = fk_ref_table)) { + fk_ref_table = NULL; + ref_tab->removeReference(self, this); + xt_heap_release(self, ref_tab->dt_table); /* We referenced the table, not the index! */ + } + + fk_ref_index = UINT_MAX; + + freer_(); // xt_unlock_rwlock(&co_table->dt_ref_lock); +} + +/* + * A row was inserted, check that a key exists in the referenced + * table. + */ +bool XTDDForeignKey::insertRow(xtWord1 *before_buf, xtWord1 *rec_buf, XTThreadPtr thread) +{ + XTIndexPtr loc_ind, ind; + xtBool no_null = TRUE; + XTOpenTablePtr ot; + XTIdxSearchKeyRec search_key; + xtXactID xn_id; + XTXactWaitRec xw; + + /* This lock ensures that the foreign key references are not + * changed. + */ + xt_slock_rwlock_ns(&co_table->dt_ref_lock); + + if (!(loc_ind = getIndexPtr())) + goto failed; + + if (!(ind = getReferenceIndexPtr())) + goto failed; + + search_key.sk_key_value.sv_flags = 0; + search_key.sk_key_value.sv_rec_id = 0; + search_key.sk_key_value.sv_row_id = 0; + search_key.sk_key_value.sv_key = search_key.sk_key_buf; + search_key.sk_key_value.sv_length = myxt_create_foreign_key_from_row(loc_ind, search_key.sk_key_buf, rec_buf, ind, &no_null); + search_key.sk_on_key = FALSE; + + if (!no_null) + goto success; + + if (before_buf) { + u_int before_key_len; + xtWord1 before_key[XT_INDEX_MAX_KEY_SIZE]; + + /* If there is a before buffer, this insert was an update, so check + * if the key value has changed. If not, we need not do anything. + */ + before_key_len = myxt_create_foreign_key_from_row(loc_ind, before_key, before_buf, ind, NULL); + + /* Check whether the key value has changed, if not, we have nothing + * to do here! + */ + if (search_key.sk_key_value.sv_length == before_key_len && + memcmp(search_key.sk_key_buf, before_key, before_key_len) == 0) + goto success; + } + + /* Search for the key in the parent (referenced) table: */ + if (!(ot = xt_db_open_table_using_tab(fk_ref_table->dt_table, thread))) + goto failed; + + retry: + if (!xt_idx_search(ot, ind, &search_key)) + goto failed_2; + + while (ot->ot_curr_rec_id) { + if (!search_key.sk_on_key) + break; + + switch (xt_tab_maybe_committed(ot, ot->ot_curr_rec_id, &xn_id, &ot->ot_curr_row_id, &ot->ot_curr_updated)) { + case XT_MAYBE: + /* We should not get a deadlock here because the thread + * that we are waiting for, should not doing + * data definition (i.e. should not be trying to + * get an exclusive lock on dt_ref_lock. + */ + xw.xw_xn_id = xn_id; + if (!xt_xn_wait_for_xact(thread, &xw, NULL)) + goto failed_2; + goto retry; + case XT_ERR: + goto failed_2; + case TRUE: + /* We found a matching parent: */ + xt_db_return_table_to_pool_ns(ot); + goto success; + case FALSE: + if (!xt_idx_next(ot, ind, &search_key)) + goto failed_2; + break; + } + } + + xt_register_ixterr(XT_REG_CONTEXT, XT_ERR_NO_REFERENCED_ROW, co_name); + + failed_2: + xt_db_return_table_to_pool_ns(ot); + + failed: + xt_unlock_rwlock_ns(&co_table->dt_ref_lock); + return false; + + success: + xt_unlock_rwlock_ns(&co_table->dt_ref_lock); + return true; +} + +/* + * Convert XT_KEY_ACTION_* constants to strings + */ +const char *XTDDForeignKey::actionTypeToString(int action) +{ + switch (action) + { + case XT_KEY_ACTION_DEFAULT: + case XT_KEY_ACTION_RESTRICT: + return "RESTRICT"; + case XT_KEY_ACTION_CASCADE: + return "CASCADE"; + case XT_KEY_ACTION_SET_NULL: + return "SET NULL"; + case XT_KEY_ACTION_SET_DEFAULT: + return ""; + case XT_KEY_ACTION_NO_ACTION: + return "NO ACTION"; + } + + return ""; +} + +void XTDDTable::init(XTThreadPtr self) +{ + xt_init_rwlock_with_autoname(self, &dt_ref_lock); + dt_trefs = NULL; +} + +void XTDDTable::init(XTThreadPtr self, XTObject *obj) +{ + XTDDTable *tab = (XTDDTable *) obj; + u_int i; + + init(self); + XTObject::init(self, obj); + dt_cols.clone(self, &tab->dt_cols); + dt_indexes.clone(self, &tab->dt_indexes); + dt_fkeys.clone(self, &tab->dt_fkeys); + + for (i=0; i<dt_indexes.size(); i++) + dt_indexes.itemAt(i)->co_table = this; + for (i=0; i<dt_fkeys.size(); i++) + dt_fkeys.itemAt(i)->co_table = this; +} + +void XTDDTable::finalize(XTThreadPtr self) +{ + XTDDTableRef *ptr; + + removeReferences(self); + + dt_cols.deleteAll(self); + dt_indexes.deleteAll(self); + dt_fkeys.deleteAll(self); + + while (dt_trefs) { + ptr = dt_trefs; + dt_trefs = dt_trefs->tr_next; + ptr->release(self); + } + + xt_free_rwlock(&dt_ref_lock); +} + +XTDDColumn *XTDDTable::findColumn(char *name) +{ + XTDDColumn *col; + + for (u_int i=0; i<dt_cols.size(); i++) { + col = dt_cols.itemAt(i); + if (myxt_strcasecmp(name, col->dc_name) == 0) + return col; + } + return NULL; +} + +void XTDDTable::loadString(XTThreadPtr self, XTStringBufferPtr sb) +{ + u_int i; + + /* I do not specify a table name because that is known */ + xt_sb_concat(self, sb, "CREATE TABLE (\n "); + + /* We only need to save the foreign key definitions!! + for (i=0; i<dt_cols.size(); i++) { + if (i != 0) + xt_sb_concat(self, sb, ",\n "); + dt_cols.itemAt(i)->loadString(self, sb); + } + + for (i=0; i<dt_indexes.size(); i++) { + xt_sb_concat(self, sb, ",\n "); + dt_indexes.itemAt(i)->loadString(self, sb); + } + */ + + for (i=0; i<dt_fkeys.size(); i++) { + if (i != 0) + xt_sb_concat(self, sb, ",\n "); + dt_fkeys.itemAt(i)->loadString(self, sb); + } + + xt_sb_concat(self, sb, "\n)\n"); +} + +void XTDDTable::loadForeignKeyString(XTThreadPtr self, XTStringBufferPtr sb) +{ + for (u_int i=0; i<dt_fkeys.size(); i++) { + xt_sb_concat(self, sb, ",\n "); + dt_fkeys.itemAt(i)->loadString(self, sb); + } +} + +/* Change all references to the given column name to new name. */ +void XTDDTable::alterColumnName(XTThreadPtr self, char *from_name, char *to_name) +{ + u_int i; + + /* We only alter references in the foreign keys (we copied the + * other changes from MySQL). + */ + for (i=0; i<dt_fkeys.size(); i++) + dt_fkeys.itemAt(i)->alterColumnName(self, from_name, to_name); +} + +void XTDDTable::attachReference(XTThreadPtr self, XTDDForeignKey *fk) +{ + XTDDTableRef *tr; + + /* Remove the reference to this FK if one exists: */ + removeReference(self, fk); + + if (!fk->checkReferencedTypes(this)) { + if (!self->st_ignore_fkeys) + throw_(); + } + + xt_xlock_rwlock(self, &dt_ref_lock); + pushr_(xt_unlock_rwlock, &dt_ref_lock); + + if (!(tr = new XTDDTableRef())) + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + tr->tr_fkey = fk; + tr->tr_next = dt_trefs; + dt_trefs = tr; + + /* Reference the database table of the foreign key, not the FK itself. + * Just referencing the key will not guarantee that the + * table remains valid because the FK does not reference the + * table. + */ + xt_heap_reference(self, fk->co_table->dt_table); + + freer_(); // xt_unlock_rwlock(&dt_ref_lock); +} + +/* + * Remove the reference to the given foreign key. + */ +void XTDDTable::removeReference(XTThreadPtr self, XTDDForeignKey *fk) +{ + XTDDTableRef *tr, *prev_tr = NULL; + + xt_xlock_rwlock(self, &dt_ref_lock); + pushr_(xt_unlock_rwlock, &dt_ref_lock); + + tr = dt_trefs; + while (tr) { + if (tr->tr_fkey == fk) { + if (prev_tr) + prev_tr->tr_next = tr->tr_next; + else + dt_trefs = tr->tr_next; + break; + } + prev_tr = tr; + tr = tr->tr_next; + } + freer_(); // xt_unlock_rwlock(&dt_ref_lock); + if (tr) + tr->release(self); +} + +void XTDDTable::checkForeignKeyReference(XTThreadPtr self, XTDDForeignKey *fk) +{ + XTDDColumnRef *cr; + + for (u_int i=0; i<fk->fk_ref_cols.size(); i++) { + cr = fk->fk_ref_cols.itemAt(i); + if (!findColumn(cr->cr_col_name)) + xt_throw_tabcolerr(XT_CONTEXT, XT_ERR_COLUMN_NOT_FOUND, fk->fk_ref_tab_name, cr->cr_col_name); + } +} + +void XTDDTable::attachReference(XTThreadPtr self, XTDDTable *dt) +{ + XTDDForeignKey *fk; + + for (u_int i=0; i<dt_fkeys.size(); i++) { + fk = dt_fkeys.itemAt(i); + if (xt_tab_compare_names(fk->fk_ref_tab_name->ps_path, dt->dt_table->tab_name->ps_path) == 0) { + fk->removeReference(self); + + dt->attachReference(self, fk); + + xt_xlock_rwlock(self, &dt_ref_lock); + pushr_(xt_unlock_rwlock, &dt_ref_lock); + /* Referenced the table, not the index! + * We do this because we know that if the table is referenced, the + * index will remain valid! + * This is because the table references the index, and only + * releases it when the table is released. The index does not + * reference the table though! + */ + xt_heap_reference(self, dt->dt_table); + fk->fk_ref_table = dt; + freer_(); // xt_unlock_rwlock(&dt_ref_lock); + } + } +} + +/* + * This function assumes the database table list is locked! + */ +void XTDDTable::attachReferences(XTThreadPtr self, XTDatabaseHPtr db) +{ + XTDDForeignKey *fk; + XTTableHPtr tab; + XTDDTable *dt; + XTHashEnumRec tables; + + /* Search for table referenced by this table. */ + for (u_int i=0; i<dt_fkeys.size(); i++) { + fk = dt_fkeys.itemAt(i); + fk->removeReference(self); + + // if self-reference + if (xt_tab_compare_names(fk->fk_ref_tab_name->ps_path, this->dt_table->tab_name->ps_path) == 0) + fk->fk_ref_table = this; + else { + /* get pointer to the referenced table, load it if needed + * cyclic references are being handled, absent table is ignored + */ + tab = xt_use_table_no_lock(self, db, fk->fk_ref_tab_name, /*TRUE*/FALSE, /*FALSE*/TRUE, NULL, NULL); + + if (tab) { + pushr_(xt_heap_release, tab); + if ((dt = tab->tab_dic.dic_table)) { + // Add a reverse reference: + dt->attachReference(self, fk); + xt_heap_reference(self, dt->dt_table); /* Referenced the table, not the index! */ + fk->fk_ref_table = dt; + } + freer_(); // xt_heap_release(tab) + } + else if (!self->st_ignore_fkeys) { + xt_throw_taberr(XT_CONTEXT, XT_ERR_REF_TABLE_NOT_FOUND, fk->fk_ref_tab_name); + } + } + } + + /* Search for tables that reference this table. */ + xt_ht_enum(self, dt_table->tab_db->db_tables, &tables); + while ((tab = (XTTableHPtr) xt_ht_next(self, &tables))) { + if (tab == this->dt_table) /* no need to re-reference itself, also this fails with "native" pthreads */ + continue; + xt_heap_reference(self, tab); + pushr_(xt_heap_release, tab); + if ((dt = tab->tab_dic.dic_table)) + dt->attachReference(self, this); + freer_(); // xt_heap_release(tab) + } +} + +void XTDDTable::removeReferences(XTThreadPtr self) +{ + XTDDForeignKey *fk; + XTDDTableRef *tr; + XTDDTable *tab; + + xt_xlock_rwlock(self, &dt_ref_lock); + pushr_(xt_unlock_rwlock, &dt_ref_lock); + + for (u_int i=0; i<dt_fkeys.size(); i++) { + fk = dt_fkeys.itemAt(i); + if ((tab = fk->fk_ref_table)) { + fk->fk_ref_table = NULL; + fk->fk_ref_index = UINT_MAX; + if (tab != this) { + /* To avoid deadlock we do not hold more than + * one lock at a time! + */ + freer_(); // xt_unlock_rwlock(&dt_ref_lock); + + tab->removeReference(self, fk); + xt_heap_release(self, tab->dt_table); /* We referenced the table, not the index! */ + + xt_xlock_rwlock(self, &dt_ref_lock); + pushr_(xt_unlock_rwlock, &dt_ref_lock); + } + } + } + + while (dt_trefs) { + tr = dt_trefs; + dt_trefs = tr->tr_next; + freer_(); // xt_unlock_rwlock(&dt_ref_lock); + tr->release(self); + xt_xlock_rwlock(self, &dt_ref_lock); + pushr_(xt_unlock_rwlock, &dt_ref_lock); + } + + freer_(); // xt_unlock_rwlock(&dt_ref_lock); +} + +void XTDDTable::checkForeignKeys(XTThreadPtr self, bool temp_table) +{ + XTDDForeignKey *fk; + + if (temp_table && dt_fkeys.size()) { + /* Temporary tables cannot have foreign keys: */ + xt_throw_xterr(XT_CONTEXT, XT_ERR_FK_ON_TEMP_TABLE); + + } + + /* Search for table referenced by this table. */ + for (u_int i=0; i<dt_fkeys.size(); i++) { + fk = dt_fkeys.itemAt(i); + + if (fk->fk_on_delete == XT_KEY_ACTION_SET_NULL || fk->fk_on_update == XT_KEY_ACTION_SET_NULL) { + /* Check that all the columns can be set to NULL! */ + XTDDColumn *col; + + for (u_int j=0; j<fk->co_cols.size(); j++) { + if ((col = findColumn(fk->co_cols.itemAt(j)->cr_col_name))) { + if (!col->dc_null_ok) + xt_throw_tabcolerr(XT_CONTEXT, XT_ERR_COLUMN_IS_NOT_NULL, fk->fk_ref_tab_name, col->dc_name); + } + } + } + + // TODO: dont close table immediately so it can be possibly reused in this loop + XTTable *ref_tab; + + pushsr_(ref_tab, xt_heap_release, xt_use_table(self, fk->fk_ref_tab_name, FALSE, TRUE, NULL)); + if (ref_tab && !fk->checkReferencedTypes(ref_tab->tab_dic.dic_table)) + throw_(); + freer_(); + + /* Currently I allow foreign keys to be created on tables that do not yet exist! + pushsr_(tab, xt_heap_release, xt_use_table(self, fk->fk_ref_tab_name, FALSE FALSE)); + if ((dt = tab->tab_dic.dic_table)) + dt->checkForeignKeyReference(self, fk); + freer_(); // xt_heap_release(tab) + */ + } +} + +XTDDIndex *XTDDTable::findIndex(XTDDConstraint *co) +{ + XTDDIndex *ind; + + for (u_int i=0; i<dt_indexes.size(); i++) { + ind = dt_indexes.itemAt(i); + if (co->sameColumns(ind)) + return ind; + } + { + char buffer[XT_ERR_MSG_SIZE - 200]; + + co->getColumnList(buffer, XT_ERR_MSG_SIZE - 200); + xt_register_ixterr(XT_REG_CONTEXT, XT_ERR_NO_MATCHING_INDEX, buffer); + } + return NULL; +} + +XTDDIndex *XTDDTable::findReferenceIndex(XTDDForeignKey *fk) +{ + XTDDIndex *ind; + XTDDColumnRef *cr; + u_int i; + + for (i=0; i<dt_indexes.size(); i++) { + ind = dt_indexes.itemAt(i); + if (fk->sameReferenceColumns(ind)) + return ind; + } + + /* If the index does not exist, maybe the columns do not exist?! */ + for (i=0; i<fk->fk_ref_cols.size(); i++) { + cr = fk->fk_ref_cols.itemAt(i); + if (!findColumn(cr->cr_col_name)) { + xt_register_tabcolerr(XT_REG_CONTEXT, XT_ERR_COLUMN_NOT_FOUND, fk->fk_ref_tab_name, cr->cr_col_name); + return NULL; + } + } + + { + char buffer[XT_ERR_MSG_SIZE - 200]; + + fk->getReferenceList(buffer, XT_ERR_MSG_SIZE - 200); + xt_register_ixterr(XT_REG_CONTEXT, XT_ERR_NO_MATCHING_INDEX, buffer); + } + return NULL; +} + +bool XTDDTable::insertRow(XTOpenTablePtr ot, xtWord1 *rec_ptr) +{ + bool ok = true; + XTInfoBufferRec rec_buf; + + if (ot->ot_thread->st_ignore_fkeys) + return true; + + rec_buf.ib_free = FALSE; + if (!rec_ptr) { + if (!xt_tab_load_record(ot, ot->ot_curr_rec_id, &rec_buf)) + return false; + rec_ptr = rec_buf.ib_db.db_data; + + } + for (u_int i=0; i<dt_fkeys.size(); i++) { + if (!dt_fkeys.itemAt(i)->insertRow(NULL, rec_ptr, ot->ot_thread)) { + ok = false; + break; + } + } + xt_ib_free(NULL, &rec_buf); + return ok; +} + +bool XTDDTable::checkNoAction(XTOpenTablePtr ot, xtRecordID rec_id) +{ + XTDDTableRef *tr; + bool ok = true; + XTInfoBufferRec rec_buf; + xtWord1 *rec_ptr; + + if (ot->ot_thread->st_ignore_fkeys) + return true; + + rec_buf.ib_free = FALSE; + if (!xt_tab_load_record(ot, rec_id, &rec_buf)) + return false; + rec_ptr = rec_buf.ib_db.db_data; + + xt_slock_rwlock_ns(&dt_ref_lock); + tr = dt_trefs; + while (tr) { + if (!tr->checkReference(rec_ptr, ot->ot_thread)) { + ok = false; + break; + } + tr = tr->tr_next; + } + xt_unlock_rwlock_ns(&dt_ref_lock); + xt_ib_free(NULL, &rec_buf); + return ok; +} + +bool XTDDTable::deleteRow(XTOpenTablePtr ot, xtWord1 *rec_ptr) +{ + XTDDTableRef *tr; + bool ok = true; + XTInfoBufferRec rec_buf; + + if (ot->ot_thread->st_ignore_fkeys) + return true; + + rec_buf.ib_free = FALSE; + if (!rec_ptr) { + if (!xt_tab_load_record(ot, ot->ot_curr_rec_id, &rec_buf)) + return false; + rec_ptr = rec_buf.ib_db.db_data; + + } + xt_slock_rwlock_ns(&dt_ref_lock); + tr = dt_trefs; + while (tr) { + if (!tr->modifyRow(ot, rec_ptr, NULL, ot->ot_thread)) { + ok = false; + break; + } + tr = tr->tr_next; + } + xt_unlock_rwlock_ns(&dt_ref_lock); + xt_ib_free(NULL, &rec_buf); + return ok; +} + +void XTDDTable::deleteAllRows(XTThreadPtr self) +{ + XTDDTableRef *tr; + + xt_slock_rwlock(self, &dt_ref_lock); + pushr_(xt_unlock_rwlock, &dt_ref_lock); + + tr = dt_trefs; + while (tr) { + tr->deleteAllRows(self); + tr = tr->tr_next; + } + + freer_(); // xt_unlock_rwlock(&dt_ref_lock); +} + +bool XTDDTable::updateRow(XTOpenTablePtr ot, xtWord1 *before, xtWord1 *after) +{ + XTDDTableRef *tr; + bool ok; + XTInfoBufferRec before_buf; + + ASSERT_NS(after); + + if (ot->ot_thread->st_ignore_fkeys) + return true; + + /* If before is NULL then this is a cascaded + * update. In this case there is no need to check + * if the column has a parent!! + */ + if (before) { + if (dt_fkeys.size() > 0) { + for (u_int i=0; i<dt_fkeys.size(); i++) { + if (!dt_fkeys.itemAt(i)->insertRow(before, after, ot->ot_thread)) + return false; + } + } + } + + ok = true; + before_buf.ib_free = FALSE; + + xt_slock_rwlock_ns(&dt_ref_lock); + if ((tr = dt_trefs)) { + if (!before) { + if (!xt_tab_load_record(ot, ot->ot_curr_rec_id, &before_buf)) + return false; + before = before_buf.ib_db.db_data; + } + + while (tr) { + if (!tr->modifyRow(ot, before, after, ot->ot_thread)) { + ok = false; + break; + } + tr = tr->tr_next; + } + } + xt_unlock_rwlock_ns(&dt_ref_lock); + + xt_ib_free(NULL, &before_buf); + return ok; +} + +xtBool XTDDTable::checkCanDrop() +{ + /* no refs or references only itself */ + return (dt_trefs == NULL) || + (dt_trefs->tr_next == NULL) && (dt_trefs->tr_fkey->co_table == this); +} diff --git a/storage/pbxt/src/datadic_xt.h b/storage/pbxt/src/datadic_xt.h new file mode 100644 index 00000000000..825914b60f3 --- /dev/null +++ b/storage/pbxt/src/datadic_xt.h @@ -0,0 +1,295 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2004-01-03 Paul McCullagh + * + * H&G2JCtL + * + * Implementation of the PBXT internal data dictionary. + */ + +#ifndef __datadic_xt_h__ +#define __datadic_xt_h__ + +#include <stddef.h> +#include <limits.h> + +#include "ccutils_xt.h" +#include "util_xt.h" + +struct XTDatabase; +struct XTTable; +struct XTIndex; +struct XTOpenTable; +struct XTIndex; + +/* Constraint types: */ +#define XT_DD_UNKNOWN ((u_int) -1) +#define XT_DD_INDEX 0 +#define XT_DD_INDEX_UNIQUE 1 +#define XT_DD_KEY_PRIMARY 2 +#define XT_DD_KEY_FOREIGN 3 + +#define XT_KEY_ACTION_DEFAULT 0 +#define XT_KEY_ACTION_RESTRICT 1 +#define XT_KEY_ACTION_CASCADE 2 +#define XT_KEY_ACTION_SET_NULL 3 +#define XT_KEY_ACTION_SET_DEFAULT 4 +#define XT_KEY_ACTION_NO_ACTION 5 /* Like RESTRICT, but check at end of statement. */ + +class XTDDEnumerableColumn; +class XTDDColumnFactory; + +class XTDDColumn : public XTObject { + +protected: + + XTDDColumn() : XTObject(), + dc_name(NULL), + dc_data_type(NULL), + dc_null_ok(true), + dc_auto_inc(false) { + } + +public: + char *dc_name; + char *dc_data_type; + bool dc_null_ok; + bool dc_auto_inc; + + virtual XTObject *factory(XTThreadPtr self) { + XTObject *new_obj; + + if (!(new_obj = new XTDDColumn())) + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + return new_obj; + } + + virtual void init(XTThreadPtr self) { + XTObject::init(self); + } + virtual void init(XTThreadPtr self, XTObject *obj); + virtual void finalize(XTThreadPtr self); + virtual void loadString(XTThreadPtr self, XTStringBufferPtr sb); + + virtual XTDDEnumerableColumn *castToEnumerable() { + return NULL; + } + + friend class XTDDColumnFactory; +}; + +/* + * subclass for ENUMs and SETs + */ +class XTDDEnumerableColumn : public XTDDColumn { + +protected: + XTDDEnumerableColumn() : XTDDColumn(), + enum_size(0), is_enum(0) { + } + +public: + int enum_size; /* number of elements in the ENUM or SET */ + xtBool is_enum; /* TRUE if this is ENUM, FALSE if SET */ + + virtual XTObject *factory(XTThreadPtr self) { + XTObject *new_obj; + + if (!(new_obj = new XTDDEnumerableColumn())) + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + return new_obj; + } + + virtual XTDDEnumerableColumn *castToEnumerable() { + return this; + } + + friend class XTDDColumnFactory; +}; + +class XTDDColumnRef : public XTObject { + public: + char *cr_col_name; + + XTDDColumnRef() : XTObject(), cr_col_name(NULL) { } + + virtual XTObject *factory(XTThreadPtr self) { + XTObject *new_obj; + + if (!(new_obj = new XTDDColumnRef())) + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + return new_obj; + } + + virtual void init(XTThreadPtr self, XTObject *obj); + virtual void finalize(XTThreadPtr self); +}; + +class XTDDConstraint : public XTObject { + public: + class XTDDTable *co_table; /* The table of this constraint (non-referenced). */ + u_int co_type; + char *co_name; + char *co_ind_name; + XTList<XTDDColumnRef> co_cols; + + XTDDConstraint(u_int t) : XTObject(), + co_table(NULL), + co_type(t), + co_name(NULL), + co_ind_name(NULL) { + } + + virtual void init(XTThreadPtr self, XTObject *obj); + virtual void finalize(XTThreadPtr self) { + if (co_name) + xt_free(self, co_name); + if (co_ind_name) + xt_free(self, co_ind_name); + co_cols.deleteAll(self); + XTObject::finalize(self); + } + virtual void loadString(XTThreadPtr self, XTStringBufferPtr sb); + virtual void alterColumnName(XTThreadPtr self, char *from_name, char *to_name); + void getColumnList(char *buffer, size_t size); + bool sameColumns(XTDDConstraint *co); + bool attachColumns(); +}; + +class XTDDTableRef : public XTObject { + public: + class XTDDTableRef *tr_next; /* The next reference in the list. */ + class XTDDForeignKey *tr_fkey; /* The foreign key that references this table (if not-NULL). */ + + XTDDTableRef() : XTObject(), tr_next(NULL), tr_fkey(NULL) { } + virtual void finalize(XTThreadPtr self); + bool modifyRow(struct XTOpenTable *tab, xtWord1 *before, xtWord1 *after, XTThreadPtr thread); + bool checkReference(xtWord1 *before, XTThreadPtr thread); + void deleteAllRows(XTThreadPtr self); +}; + +class XTDDIndex : public XTDDConstraint { + public: + u_int in_index; + + XTDDIndex(u_int type) : XTDDConstraint(type), in_index((u_int) -1) { } + + virtual XTObject *factory(XTThreadPtr self) { + XTObject *new_obj; + + if (!(new_obj = new XTDDIndex(XT_DD_UNKNOWN))) + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + return new_obj; + } + + virtual void init(XTThreadPtr self, XTObject *obj); + struct XTIndex *getIndexPtr(); +}; + +/* + * A foreign key is based on a local index. + */ +class XTDDForeignKey : public XTDDIndex { + public: + XTPathStrPtr fk_ref_tab_name; + XTDDTable *fk_ref_table; + u_int fk_ref_index; /* The index on which this foreign key references. */ + XTList<XTDDColumnRef> fk_ref_cols; + int fk_on_delete; + int fk_on_update; + + XTDDForeignKey() : XTDDIndex(XT_DD_KEY_FOREIGN), + fk_ref_tab_name(NULL), + fk_ref_table(NULL), + fk_ref_index(UINT_MAX), + fk_on_delete(0), + fk_on_update(0) { + } + + virtual XTObject *factory(XTThreadPtr self) { + XTObject *new_obj; + + if (!(new_obj = new XTDDForeignKey())) + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + return new_obj; + } + + virtual void init(XTThreadPtr self, XTObject *obj); + virtual void finalize(XTThreadPtr self); + virtual void loadString(XTThreadPtr self, XTStringBufferPtr sb); + void getReferenceList(char *buffer, size_t size); + struct XTIndex *getReferenceIndexPtr(); + bool sameReferenceColumns(XTDDConstraint *co); + bool checkReferencedTypes(XTDDTable *dt); + void removeReference(XTThreadPtr self); + bool insertRow(xtWord1 *before, xtWord1 *after, XTThreadPtr thread); + bool updateRow(xtWord1 *before, xtWord1 *after, XTThreadPtr thread); + + static const char *actionTypeToString(int action); +}; + +class XTDDTable : public XTObject { + private: + + public: + struct XTTable *dt_table; + + XTList<XTDDColumn> dt_cols; + XTList<XTDDIndex> dt_indexes; + + xt_rwlock_type dt_ref_lock; /* The lock for adding and using references. */ + XTList<XTDDForeignKey> dt_fkeys; /* The foreign keys on this table. */ + XTDDTableRef *dt_trefs; /* A list of tables that reference this table. */ + + virtual XTObject *factory(XTThreadPtr self) { + XTObject *new_obj; + + if (!(new_obj = new XTDDTable())) + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + return new_obj; + } + + virtual void init(XTThreadPtr self); + virtual void init(XTThreadPtr self, XTObject *obj); + virtual void finalize(XTThreadPtr self); + + XTDDColumn *findColumn(char *name); + void loadString(XTThreadPtr self, XTStringBufferPtr sb); + void loadForeignKeyString(XTThreadPtr self, XTStringBufferPtr sb); + void checkForeignKeyReference(XTThreadPtr self, XTDDForeignKey *fk); + void attachReferences(XTThreadPtr self, struct XTDatabase *db); + void attachReference(XTThreadPtr self, XTDDForeignKey *fk); + void alterColumnName(XTThreadPtr self, char *from_name, char *to_name); + void attachReference(XTThreadPtr self, XTDDTable *dt); + void removeReferences(XTThreadPtr self); + void removeReference(XTThreadPtr self, XTDDForeignKey *fk); + void checkForeignKeys(XTThreadPtr self, bool temp_table); + XTDDIndex *findIndex(XTDDConstraint *co); + XTDDIndex *findReferenceIndex(XTDDForeignKey *fk); + bool insertRow(struct XTOpenTable *rec_ot, xtWord1 *buffer); + bool checkNoAction(struct XTOpenTable *ot, xtRecordID rec_id); + xtBool checkCanDrop(); + bool deleteRow(struct XTOpenTable *rec_ot, xtWord1 *buffer); + void deleteAllRows(XTThreadPtr self); + bool updateRow(struct XTOpenTable *rec_ot, xtWord1 *before, xtWord1 *after); +}; + +XTDDTable *xt_ri_create_table(XTThreadPtr self, bool convert, XTPathStrPtr tab_path, char *sql, XTDDTable *my_tab); + +#endif diff --git a/storage/pbxt/src/datalog_xt.cc b/storage/pbxt/src/datalog_xt.cc new file mode 100644 index 00000000000..dc9423e7eac --- /dev/null +++ b/storage/pbxt/src/datalog_xt.cc @@ -0,0 +1,2052 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-01-24 Paul McCullagh + * + * H&G2JCtL + */ + +#include "xt_config.h" + +#include <stdio.h> +#ifndef XT_WIN +#include <unistd.h> +#include <signal.h> +#endif +#include <stdlib.h> + +#ifndef DRIZZLED +#include "mysql_priv.h" +#endif + +#include "ha_pbxt.h" + +#include "filesys_xt.h" +#include "database_xt.h" +#include "memory_xt.h" +#include "strutil_xt.h" +#include "sortedlist_xt.h" +#include "util_xt.h" +#include "heap_xt.h" +#include "table_xt.h" +#include "trace_xt.h" +#include "myxt_xt.h" + +static void dl_wake_co_thread(XTDatabaseHPtr db); + +/* + * -------------------------------------------------------------------------------- + * SEQUENTIAL READING + */ + +xtBool XTDataSeqRead::sl_seq_init(struct XTDatabase *db, size_t buffer_size) +{ + sl_db = db; + sl_buffer_size = buffer_size; + + sl_log_file = NULL; + sl_log_eof = 0; + + sl_buf_log_offset = 0; + sl_buffer_len = 0; + sl_buffer = (xtWord1 *) xt_malloc_ns(buffer_size); + + sl_rec_log_id = 0; + sl_rec_log_offset = 0; + sl_record_len = 0; + + return sl_buffer != NULL; +} + +void XTDataSeqRead::sl_seq_exit() +{ + if (sl_log_file) { + xt_close_file_ns(sl_log_file); + sl_log_file = NULL; + } + if (sl_buffer) { + xt_free_ns(sl_buffer); + sl_buffer = NULL; + } +} + +XTOpenFilePtr XTDataSeqRead::sl_seq_open_file() +{ + return sl_log_file; +} + +void XTDataSeqRead::sl_seq_pos(xtLogID *log_id, xtLogOffset *log_offset) +{ + *log_id = sl_rec_log_id; + *log_offset = sl_rec_log_offset; +} + +xtBool XTDataSeqRead::sl_seq_start(xtLogID log_id, xtLogOffset log_offset, xtBool missing_ok) +{ + if (sl_rec_log_id != log_id) { + if (sl_log_file) { + xt_close_file_ns(sl_log_file); + sl_log_file = NULL; + } + + sl_rec_log_id = log_id; + sl_buf_log_offset = sl_rec_log_offset; + sl_buffer_len = 0; + + if (!sl_db->db_datalogs.dlc_open_log(&sl_log_file, log_id, missing_ok ? XT_FS_MISSING_OK : XT_FS_DEFAULT)) + return FAILED; + if (sl_log_file) + sl_log_eof = xt_seek_eof_file(NULL, sl_log_file); + } + sl_rec_log_offset = log_offset; + sl_record_len = 0; + return OK; +} + +xtBool XTDataSeqRead::sl_rnd_read(xtLogOffset log_offset, size_t size, xtWord1 *buffer, size_t *data_read, struct XTThread *thread) +{ + if (!sl_log_file) { + *data_read = 0; + return OK; + } + return xt_pread_file(sl_log_file, log_offset, size, 0, buffer, data_read, &thread->st_statistics.st_data, thread); +} + +/* + * Unlike the transaction log sequential reader, this function only returns + * the header of a record. + */ +xtBool XTDataSeqRead::sl_seq_next(XTXactLogBufferDPtr *ret_entry, xtBool verify, struct XTThread *thread) +{ + XTXactLogBufferDPtr record; + size_t tfer; + size_t len = 0; + size_t rec_offset; + size_t max_rec_len; + xtBool reread_from_buffer; + xtWord4 size; + + /* Go to the next record (xseq_record_len must be initialized + * to 0 for this to work. + */ + sl_rec_log_offset += sl_record_len; + sl_record_len = 0; + + if (sl_rec_log_offset < sl_buf_log_offset || + sl_rec_log_offset >= sl_buf_log_offset + (xtLogOffset) sl_buffer_len) { + /* The current position is nowhere near the buffer, read data into the + * buffer: + */ + tfer = sl_buffer_size; + if (!sl_rnd_read(sl_rec_log_offset, tfer, sl_buffer, &tfer, thread)) + return FAILED; + sl_buf_log_offset = sl_rec_log_offset; + sl_buffer_len = tfer; + + /* Should we go to the next log? */ + if (!tfer) + goto return_empty; + } + + /* The start of the record is in the buffer: */ + read_from_buffer: + rec_offset = (size_t) (sl_rec_log_offset - sl_buf_log_offset); + max_rec_len = sl_buffer_len - rec_offset; + reread_from_buffer = FALSE; + size = 0; + + /* Check the type of record: */ + record = (XTXactLogBufferDPtr) (sl_buffer + rec_offset); + switch (record->xl.xl_status_1) { + case XT_LOG_ENT_HEADER: + if (offsetof(XTXactLogHeaderDRec, xh_size_4) + 4 > max_rec_len) { + reread_from_buffer = TRUE; + goto read_more; + } + len = XT_GET_DISK_4(record->xh.xh_size_4); + if (len > max_rec_len) { + reread_from_buffer = TRUE; + goto read_more; + } + if (verify) { + if (record->xh.xh_checksum_1 != XT_CHECKSUM_1(sl_rec_log_id)) + goto return_empty; + if (XT_LOG_HEAD_MAGIC(record, len) != XT_LOG_FILE_MAGIC) + goto return_empty; + if (len > offsetof(XTXactLogHeaderDRec, xh_log_id_4) + 4) { + if (XT_GET_DISK_4(record->xh.xh_log_id_4) != sl_rec_log_id) + goto return_empty; + } + } + break; + case XT_LOG_ENT_EXT_REC_OK: + case XT_LOG_ENT_EXT_REC_DEL: + len = offsetof(XTactExtRecEntryDRec, er_data); + if (len > max_rec_len) { + reread_from_buffer = TRUE; + goto read_more; + } + size = XT_GET_DISK_4(record->er.er_data_size_4); + if (verify) { + if (sl_rec_log_offset + (xtLogOffset) offsetof(XTactExtRecEntryDRec, er_data) + size > sl_log_eof) + goto return_empty; + } + break; + default: + ASSERT_NS(FALSE); + goto return_empty; + } + + if (len <= max_rec_len) { + /* The record is completely in the buffer: */ + sl_record_len = len+size; + *ret_entry = record; + return OK; + } + + read_more: + /* The record is partially in the buffer. */ + memmove(sl_buffer, sl_buffer + rec_offset, max_rec_len); + sl_buf_log_offset += rec_offset; + sl_buffer_len = max_rec_len; + + /* Read the rest, as far as possible: */ + tfer = sl_buffer_size - max_rec_len; + if (!sl_rnd_read(sl_buf_log_offset + max_rec_len, tfer, sl_buffer + max_rec_len, &tfer, thread)) + return FAILED; + sl_buffer_len += tfer; + + if (sl_buffer_len < len) + /* A partial record is in the log, must be the end of the log: */ + goto return_empty; + + if (reread_from_buffer) + goto read_from_buffer; + + /* The record is not completely in the buffer: */ + sl_record_len = len; + *ret_entry = (XTXactLogBufferDPtr) sl_buffer; + return OK; + + return_empty: + *ret_entry = NULL; + return OK; +} + +void XTDataSeqRead::sl_seq_skip(size_t size) +{ + sl_record_len += size; +} + +void XTDataSeqRead::sl_seq_skip_to(off_t log_offset) +{ + if (log_offset >= sl_rec_log_offset) + sl_record_len = (size_t) (log_offset - sl_rec_log_offset); +} + +/* + * -------------------------------------------------------------------------------- + * STATIC UTILITIES + */ + +static xtBool dl_create_log_header(XTDataLogFilePtr data_log, XTOpenFilePtr of, XTThreadPtr thread) +{ + XTXactLogHeaderDRec header; + + /* The header was not completely written, so write a new one: */ + memset(&header, 0, sizeof(XTXactLogHeaderDRec)); + header.xh_status_1 = XT_LOG_ENT_HEADER; + header.xh_checksum_1 = XT_CHECKSUM_1(data_log->dlf_log_id); + XT_SET_DISK_4(header.xh_size_4, sizeof(XTXactLogHeaderDRec)); + XT_SET_DISK_8(header.xh_free_space_8, 0); + XT_SET_DISK_8(header.xh_file_len_8, sizeof(XTXactLogHeaderDRec)); + XT_SET_DISK_4(header.xh_log_id_4, data_log->dlf_log_id); + XT_SET_DISK_2(header.xh_version_2, XT_LOG_VERSION_NO); + XT_SET_DISK_4(header.xh_magic_4, XT_LOG_FILE_MAGIC); + if (!xt_pwrite_file(of, 0, sizeof(XTXactLogHeaderDRec), &header, &thread->st_statistics.st_data, thread)) + return FAILED; + if (!xt_flush_file(of, &thread->st_statistics.st_data, thread)) + return FAILED; + return OK; +} + +static xtBool dl_write_log_header(XTDataLogFilePtr data_log, XTOpenFilePtr of, xtBool flush, XTThreadPtr thread) +{ + XTXactLogHeaderDRec header; + + /* The header was not completely written, so write a new one: */ + XT_SET_DISK_8(header.xh_free_space_8, data_log->dlf_garbage_count); + XT_SET_DISK_8(header.xh_file_len_8, data_log->dlf_log_eof); + XT_SET_DISK_8(header.xh_comp_pos_8, data_log->dlf_start_offset); + + if (!xt_pwrite_file(of, offsetof(XTXactLogHeaderDRec, xh_free_space_8), 24, (xtWord1 *) &header.xh_free_space_8, &thread->st_statistics.st_data, thread)) + return FAILED; + if (flush && !xt_flush_file(of, &thread->st_statistics.st_data, thread)) + return FAILED; + return OK; +} + +static void dl_free_seq_read(XTThreadPtr self __attribute__((unused)), XTDataSeqReadPtr seq_read) +{ + seq_read->sl_seq_exit(); +} + +static void dl_recover_log(XTThreadPtr self, XTDatabaseHPtr db, XTDataLogFilePtr data_log) +{ + XTDataSeqReadRec seq_read; + XTXactLogBufferDPtr record; + + if (!seq_read.sl_seq_init(db, xt_db_log_buffer_size)) + xt_throw(self); + pushr_(dl_free_seq_read, &seq_read); + + seq_read.sl_seq_start(data_log->dlf_log_id, 0, FALSE); + + for (;;) { + if (!seq_read.sl_seq_next(&record, TRUE, self)) + xt_throw(self); + if (!record) + break; + switch (record->xh.xh_status_1) { + case XT_LOG_ENT_HEADER: + data_log->dlf_garbage_count = XT_GET_DISK_8(record->xh.xh_free_space_8); + data_log->dlf_start_offset = XT_GET_DISK_8(record->xh.xh_comp_pos_8); + seq_read.sl_seq_skip_to((off_t) XT_GET_DISK_8(record->xh.xh_file_len_8)); + break; + } + } + + if (!(data_log->dlf_log_eof = seq_read.sl_rec_log_offset)) { + data_log->dlf_log_eof = sizeof(XTXactLogHeaderDRec); + if (!dl_create_log_header(data_log, seq_read.sl_log_file, self)) + xt_throw(self); + } + if (!dl_write_log_header(data_log, seq_read.sl_log_file, TRUE, self)) + xt_throw(self); + + freer_(); // dl_free_seq_read(&seq_read) +} + +/* + * -------------------------------------------------------------------------------- + * D A T A L O G C AC H E + */ + +void XTDataLogCache::dls_remove_log(XTDataLogFilePtr data_log) +{ + xtLogID log_id = data_log->dlf_log_id; + + switch (data_log->dlf_state) { + case XT_DL_HAS_SPACE: + xt_sl_delete(NULL, dlc_has_space, &log_id); + break; + case XT_DL_TO_COMPACT: + xt_sl_delete(NULL, dlc_to_compact, &log_id); + break; + case XT_DL_TO_DELETE: + xt_sl_delete(NULL, dlc_to_delete, &log_id); + break; + case XT_DL_DELETED: + xt_sl_delete(NULL, dlc_deleted, &log_id); + break; + } +} + +int XTDataLogCache::dls_get_log_state(XTDataLogFilePtr data_log) +{ + if (data_log->dlf_to_much_garbage()) + return XT_DL_TO_COMPACT; + if (data_log->dlf_space_avaliable() > 0) + return XT_DL_HAS_SPACE; + return XT_DL_READ_ONLY; +} + +xtBool XTDataLogCache::dls_set_log_state(XTDataLogFilePtr data_log, int state) +{ + xtLogID log_id = data_log->dlf_log_id; + + xt_lock_mutex_ns(&dlc_lock); + if (state == XT_DL_MAY_COMPACT) { + if (data_log->dlf_state != XT_DL_UNKNOWN && + data_log->dlf_state != XT_DL_HAS_SPACE && + data_log->dlf_state != XT_DL_READ_ONLY) + goto ok; + state = XT_DL_TO_COMPACT; + } + if (state == XT_DL_UNKNOWN) + state = dls_get_log_state(data_log); + switch (state) { + case XT_DL_HAS_SPACE: + if (data_log->dlf_state != XT_DL_HAS_SPACE) { + dls_remove_log(data_log); + if (!xt_sl_insert(NULL, dlc_has_space, &log_id, &log_id)) + goto failed; + } + break; + case XT_DL_TO_COMPACT: +#ifdef DEBUG_LOG_DELETE + printf("-- set to compact: %d\n", (int) log_id); +#endif + if (data_log->dlf_state != XT_DL_TO_COMPACT) { + dls_remove_log(data_log); + if (!xt_sl_insert(NULL, dlc_to_compact, &log_id, &log_id)) + goto failed; + } + dl_wake_co_thread(dlc_db); + break; + case XT_DL_COMPACTED: +#ifdef DEBUG_LOG_DELETE + printf("-- set compacted: %d\n", (int) log_id); +#endif + if (data_log->dlf_state != state) + dls_remove_log(data_log); + break; + case XT_DL_TO_DELETE: +#ifdef DEBUG_LOG_DELETE + printf("-- set to delete log: %d\n", (int) log_id); +#endif + if (data_log->dlf_state != XT_DL_TO_DELETE) { + dls_remove_log(data_log); + if (!xt_sl_insert(NULL, dlc_to_delete, &log_id, &log_id)) + goto failed; + } + break; + case XT_DL_DELETED: +#ifdef DEBUG_LOG_DELETE + printf("-- set DELETED log: %d\n", (int) log_id); +#endif + if (data_log->dlf_state != XT_DL_DELETED) { + dls_remove_log(data_log); + if (!xt_sl_insert(NULL, dlc_deleted, &log_id, &log_id)) + goto failed; + } + break; + default: + if (data_log->dlf_state != state) + dls_remove_log(data_log); + break; + } + data_log->dlf_state = state; + + ok: + xt_unlock_mutex_ns(&dlc_lock); + return OK; + + failed: + xt_unlock_mutex_ns(&dlc_lock); + return FAILED; +} + +static int dl_cmp_log_id(XTThreadPtr XT_UNUSED(self), register const void XT_UNUSED(*thunk), register const void *a, register const void *b) +{ + xtLogID log_id_a = *((xtLogID *) a); + xtLogID log_id_b = *((xtLogID *) b); + + if (log_id_a == log_id_b) + return 0; + if (log_id_a < log_id_b) + return -1; + return 1; +} + +void XTDataLogCache::dlc_init(XTThreadPtr self, XTDatabaseHPtr db) +{ + XTOpenDirPtr od; + char log_dir[PATH_MAX]; + char *file; + xtLogID log_id; + XTDataLogFilePtr data_log= NULL; + + memset(this, 0, sizeof(XTDataLogCacheRec)); + dlc_db = db; + try_(a) { + xt_init_mutex_with_autoname(self, &dlc_lock); + xt_init_cond(self, &dlc_cond); + for (u_int i=0; i<XT_DL_NO_OF_SEGMENTS; i++) { + xt_init_mutex_with_autoname(self, &dlc_segment[i].dls_lock); + xt_init_cond(self, &dlc_segment[i].dls_cond); + } + dlc_has_space = xt_new_sortedlist(self, sizeof(xtLogID), 20, 10, dl_cmp_log_id, NULL, NULL, FALSE, FALSE); + dlc_to_compact = xt_new_sortedlist(self, sizeof(xtLogID), 20, 10, dl_cmp_log_id, NULL, NULL, FALSE, FALSE); + dlc_to_delete = xt_new_sortedlist(self, sizeof(xtLogID), 20, 10, dl_cmp_log_id, NULL, NULL, FALSE, FALSE); + dlc_deleted = xt_new_sortedlist(self, sizeof(xtLogID), 20, 10, dl_cmp_log_id, NULL, NULL, FALSE, FALSE); + xt_init_mutex_with_autoname(self, &dlc_mru_lock); + xt_init_mutex_with_autoname(self, &dlc_head_lock); + + xt_strcpy(PATH_MAX, log_dir, dlc_db->db_main_path); + xt_add_data_dir(PATH_MAX, log_dir); + if (xt_fs_exists(log_dir)) { + pushsr_(od, xt_dir_close, xt_dir_open(self, log_dir, NULL)); + while (xt_dir_next(self, od)) { + file = xt_dir_name(self, od); + if (xt_ends_with(file, ".xt")) { + if ((log_id = (xtLogID) xt_file_name_to_id(file))) { + if (!dlc_get_data_log(&data_log, log_id, TRUE, NULL)) + xt_throw(self); + dl_recover_log(self, db, data_log); + if (!dls_set_log_state(data_log, XT_DL_UNKNOWN)) + xt_throw(self); + } + } + } + freer_(); + } + } + catch_(a) { + dlc_exit(self); + xt_throw(self); + } + cont_(a); +} + +void XTDataLogCache::dlc_exit(XTThreadPtr self) +{ + XTDataLogFilePtr data_log, tmp_data_log; + XTOpenLogFilePtr open_log, tmp_open_log; + + if (dlc_has_space) { + xt_free_sortedlist(self, dlc_has_space); + dlc_has_space = NULL; + } + if (dlc_to_compact) { + xt_free_sortedlist(self, dlc_to_compact); + dlc_to_compact = NULL; + } + if (dlc_to_delete) { + xt_free_sortedlist(self, dlc_to_delete); + dlc_to_delete = NULL; + } + if (dlc_deleted) { + xt_free_sortedlist(self, dlc_deleted); + dlc_deleted = NULL; + } + for (u_int i=0; i<XT_DL_NO_OF_SEGMENTS; i++) { + for (u_int j=0; j<XT_DL_SEG_HASH_TABLE_SIZE; j++) { + data_log = dlc_segment[i].dls_hash_table[j]; + while (data_log) { + if (data_log->dlf_log_file) { + xt_close_file_ns(data_log->dlf_log_file); + data_log->dlf_log_file = NULL; + } + + open_log = data_log->dlf_free_list; + while (open_log) { + if (open_log->odl_log_file) + xt_close_file(self, open_log->odl_log_file); + tmp_open_log = open_log; + open_log = open_log->odl_next_free; + xt_free(self, tmp_open_log); + } + tmp_data_log = data_log; + data_log = data_log->dlf_next_hash; + + xt_free(self, tmp_data_log); + } + } + xt_free_mutex(&dlc_segment[i].dls_lock); + xt_free_cond(&dlc_segment[i].dls_cond); + } + xt_free_mutex(&dlc_head_lock); + xt_free_mutex(&dlc_mru_lock); + xt_free_mutex(&dlc_lock); + xt_free_cond(&dlc_cond); +} + +void XTDataLogCache::dlc_name(size_t size, char *path, xtLogID log_id) +{ + char name[50]; + + sprintf(name, "dlog-%lu.xt", (u_long) log_id); + xt_strcpy(size, path, dlc_db->db_main_path); + xt_add_data_dir(size, path); + xt_add_dir_char(size, path); + xt_strcat(size, path, name); +} + +xtBool XTDataLogCache::dlc_open_log(XTOpenFilePtr *fh, xtLogID log_id, int mode) +{ + char log_path[PATH_MAX]; + + dlc_name(PATH_MAX, log_path, log_id); + return xt_open_file_ns(fh, log_path, mode); +} + +xtBool XTDataLogCache::dlc_unlock_log(XTDataLogFilePtr data_log) +{ + if (data_log->dlf_log_file) { + xt_close_file_ns(data_log->dlf_log_file); + data_log->dlf_log_file = NULL; + } + + return dls_set_log_state(data_log, XT_DL_UNKNOWN); +} + +XTDataLogFilePtr XTDataLogCache::dlc_get_log_for_writing(off_t space_required, struct XTThread *thread) +{ + xtLogID log_id, *log_id_ptr = NULL; + size_t size; + size_t idx; + XTDataLogFilePtr data_log = NULL; + + xt_lock_mutex_ns(&dlc_lock); + + /* Look for an existing log with enough space: */ + size = xt_sl_get_size(dlc_has_space); + for (idx=0; idx<size; idx++) { + log_id_ptr = (xtLogID *) xt_sl_item_at(dlc_has_space, idx); + if (!dlc_get_data_log(&data_log, *log_id_ptr, FALSE, NULL)) + goto failed; + if (data_log) { + if (data_log->dlf_space_avaliable() >= space_required) + break; + data_log = NULL; + } + else { + ASSERT_NS(FALSE); + xt_sl_delete_item_at(NULL, dlc_has_space, idx); + idx--; + size--; + } + } + + if (data_log) { + /* Found a log: */ + if (!dlc_open_log(&data_log->dlf_log_file, *log_id_ptr, XT_FS_DEFAULT)) + goto failed; + xt_sl_delete_item_at(NULL, dlc_has_space, idx); + } + else { + /* Create a new log: */ + log_id = dlc_next_log_id; + for (u_int i=0; i<XT_DL_MAX_LOG_ID; i++) { + log_id++; + if (log_id > XT_DL_MAX_LOG_ID) + log_id = 1; + if (!dlc_get_data_log(&data_log, log_id, FALSE, NULL)) + goto failed; + if (!data_log) + break; + } + dlc_next_log_id = log_id; + if (data_log) { + xt_register_ulxterr(XT_REG_CONTEXT, XT_ERR_LOG_MAX_EXCEEDED, (u_long) XT_DL_MAX_LOG_ID); + goto failed; + } + if (!dlc_get_data_log(&data_log, log_id, TRUE, NULL)) + goto failed; + if (!dlc_open_log(&data_log->dlf_log_file, log_id, XT_FS_CREATE | XT_FS_MAKE_PATH)) + goto failed; + data_log->dlf_log_eof = sizeof(XTXactLogHeaderDRec); + if (!dl_create_log_header(data_log, data_log->dlf_log_file, thread)) { + xt_close_file_ns(data_log->dlf_log_file); + goto failed; + } + /* By setting this late we ensure that the error + * will be repeated. + */ + dlc_next_log_id = log_id; + } + data_log->dlf_state = XT_DL_EXCLUSIVE; + + xt_unlock_mutex_ns(&dlc_lock); + return data_log; + + failed: + xt_unlock_mutex_ns(&dlc_lock); + return NULL; +} + +xtBool XTDataLogCache::dlc_get_data_log(XTDataLogFilePtr *lf, xtLogID log_id, xtBool create, XTDataLogSegPtr *ret_seg) +{ + register XTDataLogSegPtr seg; + register u_int hash_idx; + register XTDataLogFilePtr data_log; + + /* Which segment, and hash index: */ + seg = &dlc_segment[log_id & XT_DL_SEGMENT_MASK]; + hash_idx = (log_id >> XT_DL_SEGMENT_SHIFTS) % XT_DL_SEG_HASH_TABLE_SIZE; + + /* Lock the segment: */ + xt_lock_mutex_ns(&seg->dls_lock); + + /* Find the log file on the hash list: */ + data_log = seg->dls_hash_table[hash_idx]; + while (data_log) { + if (data_log->dlf_log_id == log_id) + break; + data_log = data_log->dlf_next_hash; + } + + if (!data_log && create) { + /* Create a new log file structure: */ + if (!(data_log = (XTDataLogFilePtr) xt_calloc_ns(sizeof(XTDataLogFileRec)))) + goto failed; + data_log->dlf_log_id = log_id; + data_log->dlf_next_hash = seg->dls_hash_table[hash_idx]; + seg->dls_hash_table[hash_idx] = data_log; + } + + if (ret_seg) { + /* This gives the caller the lock: */ + *ret_seg = seg; + *lf = data_log; + return OK; + } + + xt_unlock_mutex_ns(&seg->dls_lock); + *lf = data_log; + return OK; + + failed: + xt_unlock_mutex_ns(&seg->dls_lock); + return FAILED; +} + +/* + * If just_close is FALSE, then a log is being deleted. + * This means that that the log may still be in exclusive use by + * some thread. So we just close the log! + */ +xtBool XTDataLogCache::dlc_remove_data_log(xtLogID log_id, xtBool just_close) +{ + register XTDataLogSegPtr seg; + register u_int hash_idx; + register XTDataLogFilePtr data_log; + XTOpenLogFilePtr open_log, tmp_open_log; + + /* Which segment, and hash index: */ + seg = &dlc_segment[log_id & XT_DL_SEGMENT_MASK]; + hash_idx = (log_id >> XT_DL_SEGMENT_SHIFTS) % XT_DL_SEG_HASH_TABLE_SIZE; + + /* Lock the segment: */ + retry: + xt_lock_mutex_ns(&seg->dls_lock); + + /* Find the log file on the hash list: */ + data_log = seg->dls_hash_table[hash_idx]; + while (data_log) { + if (data_log->dlf_log_id == log_id) + break; + data_log = data_log->dlf_next_hash; + } + + if (data_log) { + xt_lock_mutex_ns(&dlc_mru_lock); + + open_log = data_log->dlf_free_list; + while (open_log) { + if (open_log->odl_log_file) + xt_close_file_ns(open_log->odl_log_file); + + /* Remove from MRU list: */ + if (dlc_lru_open_log == open_log) { + dlc_lru_open_log = open_log->odl_mr_used; + ASSERT_NS(!open_log->odl_lr_used); + } + else if (open_log->odl_lr_used) + open_log->odl_lr_used->odl_mr_used = open_log->odl_mr_used; + if (dlc_mru_open_log == open_log) { + dlc_mru_open_log = open_log->odl_lr_used; + ASSERT_NS(!open_log->odl_mr_used); + } + else if (open_log->odl_mr_used) + open_log->odl_mr_used->odl_lr_used = open_log->odl_lr_used; + + data_log->dlf_open_count--; + tmp_open_log = open_log; + open_log = open_log->odl_next_free; + xt_free_ns(tmp_open_log); + } + data_log->dlf_free_list = NULL; + + xt_unlock_mutex_ns(&dlc_mru_lock); + + if (data_log->dlf_open_count) { + if (!xt_timed_wait_cond_ns(&seg->dls_cond, &seg->dls_lock, 2000)) + goto failed; + xt_unlock_mutex_ns(&seg->dls_lock); + goto retry; + } + + /* Close the exclusive file if required: */ + if (data_log->dlf_log_file) { + xt_close_file_ns(data_log->dlf_log_file); + data_log->dlf_log_file = NULL; + } + + if (!just_close) { + /* Remove the log from the hash list: */ + XTDataLogFilePtr ptr, pptr = NULL; + + ptr = seg->dls_hash_table[hash_idx]; + while (ptr) { + if (ptr == data_log) + break; + pptr = ptr; + ptr = ptr->dlf_next_hash; + } + + if (ptr == data_log) { + if (pptr) + pptr->dlf_next_hash = ptr->dlf_next_hash; + else + seg->dls_hash_table[hash_idx] = ptr->dlf_next_hash; + } + + xt_free_ns(data_log); + } + } + + xt_unlock_mutex_ns(&seg->dls_lock); + return OK; + + failed: + xt_unlock_mutex_ns(&seg->dls_lock); + return FAILED; +} + +xtBool XTDataLogCache::dlc_get_open_log(XTOpenLogFilePtr *ol, xtLogID log_id) +{ + register XTDataLogSegPtr seg; + register u_int hash_idx; + register XTDataLogFilePtr data_log; + register XTOpenLogFilePtr open_log; + char path[PATH_MAX]; + + /* Which segment, and hash index: */ + seg = &dlc_segment[log_id & XT_DL_SEGMENT_MASK]; + hash_idx = (log_id >> XT_DL_SEGMENT_SHIFTS) % XT_DL_SEG_HASH_TABLE_SIZE; + + /* Lock the segment: */ + xt_lock_mutex_ns(&seg->dls_lock); + + /* Find the log file on the hash list: */ + data_log = seg->dls_hash_table[hash_idx]; + while (data_log) { + if (data_log->dlf_log_id == log_id) + break; + data_log = data_log->dlf_next_hash; + } + + if (!data_log) { + /* Create a new log file structure: */ + dlc_name(PATH_MAX, path, log_id); + if (!xt_fs_exists(path)) { + xt_register_ixterr(XT_REG_CONTEXT, XT_ERR_DATA_LOG_NOT_FOUND, path); + goto failed; + } + if (!(data_log = (XTDataLogFilePtr) xt_calloc_ns(sizeof(XTDataLogFileRec)))) + goto failed; + data_log->dlf_log_id = log_id; + data_log->dlf_next_hash = seg->dls_hash_table[hash_idx]; + seg->dls_hash_table[hash_idx] = data_log; + } + + if ((open_log = data_log->dlf_free_list)) { + /* Remove from the free list: */ + if ((data_log->dlf_free_list = open_log->odl_next_free)) + data_log->dlf_free_list->odl_prev_free = NULL; + + /* This file has been most recently used: */ + if (XT_TIME_DIFF(open_log->odl_ru_time, dlc_ru_now) > (XT_DL_LOG_POOL_SIZE >> 1)) { + /* Move to the front of the MRU list: */ + xt_lock_mutex_ns(&dlc_mru_lock); + + open_log->odl_ru_time = ++dlc_ru_now; + if (dlc_mru_open_log != open_log) { + /* Remove from the MRU list: */ + if (dlc_lru_open_log == open_log) { + dlc_lru_open_log = open_log->odl_mr_used; + ASSERT_NS(!open_log->odl_lr_used); + } + else if (open_log->odl_lr_used) + open_log->odl_lr_used->odl_mr_used = open_log->odl_mr_used; + if (open_log->odl_mr_used) + open_log->odl_mr_used->odl_lr_used = open_log->odl_lr_used; + + /* Make the file the most recently used: */ + if ((open_log->odl_lr_used = dlc_mru_open_log)) + dlc_mru_open_log->odl_mr_used = open_log; + open_log->odl_mr_used = NULL; + dlc_mru_open_log = open_log; + if (!dlc_lru_open_log) + dlc_lru_open_log = open_log; + } + xt_unlock_mutex_ns(&dlc_mru_lock); + } + } + else { + /* Create a new open file: */ + if (!(open_log = (XTOpenLogFilePtr) xt_calloc_ns(sizeof(XTOpenLogFileRec)))) + goto failed; + dlc_name(PATH_MAX, path, log_id); + if (!xt_open_file_ns(&open_log->odl_log_file, path, XT_FS_DEFAULT)) { + xt_free_ns(open_log); + goto failed; + } + open_log->olf_log_id = log_id; + open_log->odl_data_log = data_log; + data_log->dlf_open_count++; + + /* Make the new open file the most recently used: */ + xt_lock_mutex_ns(&dlc_mru_lock); + open_log->odl_ru_time = ++dlc_ru_now; + if ((open_log->odl_lr_used = dlc_mru_open_log)) + dlc_mru_open_log->odl_mr_used = open_log; + open_log->odl_mr_used = NULL; + dlc_mru_open_log = open_log; + if (!dlc_lru_open_log) + dlc_lru_open_log = open_log; + dlc_open_count++; + xt_unlock_mutex_ns(&dlc_mru_lock); + } + + open_log->odl_in_use = TRUE; + xt_unlock_mutex_ns(&seg->dls_lock); + *ol = open_log; + + if (dlc_open_count > XT_DL_LOG_POOL_SIZE) { + u_int target = XT_DL_LOG_POOL_SIZE / 4 * 3; + xtLogID free_log_id; + + /* Remove some open files: */ + while (dlc_open_count > target) { + XTOpenLogFilePtr to_free = dlc_lru_open_log; + + if (!to_free || to_free->odl_in_use) + break; + + /* Dirty read the file ID: */ + free_log_id = to_free->olf_log_id; + + seg = &dlc_segment[free_log_id & XT_DL_SEGMENT_MASK]; + + /* Lock the segment: */ + xt_lock_mutex_ns(&seg->dls_lock); + + /* Lock the MRU list: */ + xt_lock_mutex_ns(&dlc_mru_lock); + + /* Check if we have the same open file: */ + if (dlc_lru_open_log == to_free && !to_free->odl_in_use) { + data_log = to_free->odl_data_log; + + /* Remove from the MRU list: */ + dlc_lru_open_log = to_free->odl_mr_used; + ASSERT_NS(!to_free->odl_lr_used); + + if (dlc_mru_open_log == to_free) { + dlc_mru_open_log = to_free->odl_lr_used; + ASSERT_NS(!to_free->odl_mr_used); + } + else if (to_free->odl_mr_used) + to_free->odl_mr_used->odl_lr_used = to_free->odl_lr_used; + + /* Remove from the free list of the file: */ + if (data_log->dlf_free_list == to_free) { + data_log->dlf_free_list = to_free->odl_next_free; + ASSERT_NS(!to_free->odl_prev_free); + } + else if (to_free->odl_prev_free) + to_free->odl_prev_free->odl_next_free = to_free->odl_next_free; + if (to_free->odl_next_free) + to_free->odl_next_free->odl_prev_free = to_free->odl_prev_free; + ASSERT_NS(data_log->dlf_open_count > 0); + data_log->dlf_open_count--; + dlc_open_count--; + } + else + to_free = NULL; + + xt_unlock_mutex_ns(&dlc_mru_lock); + xt_unlock_mutex_ns(&seg->dls_lock); + + if (to_free) { + xt_close_file_ns(to_free->odl_log_file); + xt_free_ns(to_free); + } + } + } + + return OK; + + failed: + xt_unlock_mutex_ns(&seg->dls_lock); + return FAILED; +} + +void XTDataLogCache::dlc_release_open_log(XTOpenLogFilePtr open_log) +{ + register XTDataLogSegPtr seg; + register XTDataLogFilePtr data_log = open_log->odl_data_log; + + /* Which segment, and hash index: */ + seg = &dlc_segment[open_log->olf_log_id & XT_DL_SEGMENT_MASK]; + + xt_lock_mutex_ns(&seg->dls_lock); + open_log->odl_next_free = data_log->dlf_free_list; + open_log->odl_prev_free = NULL; + if (data_log->dlf_free_list) + data_log->dlf_free_list->odl_prev_free = open_log; + data_log->dlf_free_list = open_log; + open_log->odl_in_use = FALSE; + + /* Wakeup any exclusive lockers: */ + if (!xt_broadcast_cond_ns(&seg->dls_cond)) + xt_log_and_clear_exception_ns(); + + xt_unlock_mutex_ns(&seg->dls_lock); +} + +/* + * -------------------------------------------------------------------------------- + * D A T A L O G F I L E + */ + +off_t XTDataLogFile::dlf_space_avaliable() +{ + if (dlf_log_eof < xt_db_data_log_threshold) + return xt_db_data_log_threshold - dlf_log_eof; + return 0; +} + +xtBool XTDataLogFile::dlf_to_much_garbage() +{ + if (!dlf_log_eof) + return FALSE; + return dlf_garbage_count * 100 / dlf_log_eof >= xt_db_garbage_threshold; +} + +/* + * -------------------------------------------------------------------------------- + * D A T A L O G B U F F E R + */ + +void XTDataLogBuffer::dlb_init(XTDatabaseHPtr db, size_t buffer_size) +{ + ASSERT_NS(!dlb_db); + ASSERT_NS(!dlb_buffer_size); + ASSERT_NS(!dlb_data_log); + ASSERT_NS(!dlb_log_buffer); + dlb_db = db; + dlb_buffer_size = buffer_size; +} + +void XTDataLogBuffer::dlb_exit(XTThreadPtr self) +{ + dlb_close_log(self); + if (dlb_log_buffer) { + xt_free(self, dlb_log_buffer); + dlb_log_buffer = NULL; + } + dlb_db = NULL; + dlb_buffer_offset = 0; + dlb_buffer_size = 0; + dlb_buffer_len = 0; + dlb_flush_required = FALSE; +#ifdef DEBUG + dlb_max_write_offset = 0; +#endif +} + +xtBool XTDataLogBuffer::dlb_close_log(XTThreadPtr thread) +{ + if (dlb_data_log) { + /* Flush and commit the data in the old log: */ + if (!dlb_flush_log(TRUE, thread)) + return FAILED; + + if (!dlb_db->db_datalogs.dlc_unlock_log(dlb_data_log)) + return FAILED; + dlb_data_log = NULL; + } + return OK; +} + +/* When I use 'thread' instead of 'self', this means + * that I will not throw an error. + */ +xtBool XTDataLogBuffer::dlb_get_log_offset(xtLogID *log_id, xtLogOffset *out_offset, size_t req_size, struct XTThread *thread) +{ + /* Note, I am allowing a log to grow beyond the threshold. + * The amount depends on the maximum extended record size. + * If I don't some logs will never fill up, because of only having + * a few more bytes available. + */ + if (!dlb_data_log || dlb_data_log->dlf_space_avaliable() == 0) { + /* Release the old log: */ + if (!dlb_close_log(thread)) + return FAILED; + + if (!dlb_log_buffer) { + if (!(dlb_log_buffer = (xtWord1 *) xt_malloc_ns(dlb_buffer_size))) + return FAILED; + } + + /* I could use req_size instead of 1, but this would mean some logs + * are never filled up. + */ + if (!(dlb_data_log = dlb_db->db_datalogs.dlc_get_log_for_writing(1, thread))) + return FAILED; +#ifdef DEBUG + dlb_max_write_offset = dlb_data_log->dlf_log_eof; +#endif + } + + *log_id = dlb_data_log->dlf_log_id; + *out_offset = dlb_data_log->dlf_log_eof; + dlb_data_log->dlf_log_eof += req_size; + return OK; +} + +xtBool XTDataLogBuffer::dlb_flush_log(xtBool commit, XTThreadPtr thread) +{ + if (!dlb_data_log || !dlb_data_log->dlf_log_file) + return OK; + + if (dlb_buffer_len) { + if (!xt_pwrite_file(dlb_data_log->dlf_log_file, dlb_buffer_offset, dlb_buffer_len, dlb_log_buffer, &thread->st_statistics.st_data, thread)) + return FAILED; +#ifdef DEBUG + if (dlb_buffer_offset + (xtLogOffset) dlb_buffer_len > dlb_max_write_offset) + dlb_max_write_offset = dlb_buffer_offset + (xtLogOffset) dlb_buffer_len; +#endif + dlb_buffer_len = 0; + dlb_flush_required = TRUE; + } + + if (commit && dlb_flush_required) { +#ifdef DEBUG + /* This would normally be equal, however, in the case + * where some other thread flushes the compactors + * data log, the eof, can be greater than the + * write offset. + * + * This occurs because the flush can come between the + * dlb_get_log_offset() and dlb_write_thru_log() calls. + */ + ASSERT_NS(dlb_data_log->dlf_log_eof >= dlb_max_write_offset); +#endif + if (!xt_flush_file(dlb_data_log->dlf_log_file, &thread->st_statistics.st_data, thread)) + return FAILED; + dlb_flush_required = FALSE; + } + return OK; +} + +xtBool XTDataLogBuffer::dlb_write_thru_log(xtLogID log_id __attribute__((unused)), xtLogOffset log_offset, size_t size, xtWord1 *data, XTThreadPtr thread) +{ + ASSERT_NS(log_id == dlb_data_log->dlf_log_id); + + if (dlb_buffer_len) + dlb_flush_log(FALSE, thread); + + if (!xt_pwrite_file(dlb_data_log->dlf_log_file, log_offset, size, data, &thread->st_statistics.st_data, thread)) + return FAILED; +#ifdef DEBUG + if (log_offset + size > dlb_max_write_offset) + dlb_max_write_offset = log_offset + size; +#endif + dlb_flush_required = TRUE; + return OK; +} + +xtBool XTDataLogBuffer::dlb_append_log(xtLogID log_id __attribute__((unused)), xtLogOffset log_offset, size_t size, xtWord1 *data, XTThreadPtr thread) +{ + ASSERT_NS(log_id == dlb_data_log->dlf_log_id); + + if (dlb_buffer_len) { + /* Should be the case, we only write by appending: */ + ASSERT_NS(dlb_buffer_offset + (xtLogOffset) dlb_buffer_len == log_offset); + /* Check if we are appending to the existing value in the buffer: */ + if (dlb_buffer_offset + (xtLogOffset) dlb_buffer_len == log_offset) { + /* Can we just append: */ + if (dlb_buffer_size >= dlb_buffer_len + size) { + memcpy(dlb_log_buffer + dlb_buffer_len, data, size); + dlb_buffer_len += size; + return OK; + } + } + dlb_flush_log(FALSE, thread); + } + + ASSERT_NS(dlb_buffer_len == 0); + + if (dlb_buffer_size >= size) { + dlb_buffer_offset = log_offset; + dlb_buffer_len = size; + memcpy(dlb_log_buffer, data, size); + return OK; + } + + /* Write directly: */ + if (!xt_pwrite_file(dlb_data_log->dlf_log_file, log_offset, size, data, &thread->st_statistics.st_data, thread)) + return FAILED; +#ifdef DEBUG + if (log_offset + size > dlb_max_write_offset) + dlb_max_write_offset = log_offset + size; +#endif + dlb_flush_required = TRUE; + return OK; +} + +xtBool XTDataLogBuffer::dlb_read_log(xtLogID log_id, xtLogOffset log_offset, size_t size, xtWord1 *data, XTThreadPtr thread) +{ + size_t red_size; + XTOpenLogFilePtr open_log; + + if (dlb_data_log && log_id == dlb_data_log->dlf_log_id) { + /* Reading from the write log, I can do this quicker: */ + if (dlb_buffer_len) { + /* If it is in the buffer, then it is completely in the buffer. */ + if (log_offset >= dlb_buffer_offset) { + if (log_offset + (xtLogOffset) size <= dlb_buffer_offset + (xtLogOffset) dlb_buffer_len) { + memcpy(data, dlb_log_buffer + (log_offset - dlb_buffer_offset), size); + return OK; + } + /* Should not happen, reading past EOF: */ + ASSERT_NS(FALSE); + memset(data, 0, size); + return OK; + } + /* In the write log, but not in the buffer, + * must be completely not in the log, + * because only whole records are written to the + * log: + */ + ASSERT_NS(log_offset + (xtLogOffset) size <= dlb_buffer_offset); + } + return xt_pread_file(dlb_data_log->dlf_log_file, log_offset, size, size, data, NULL, &thread->st_statistics.st_data, thread); + } + + /* Read from some other log: */ + if (!dlb_db->db_datalogs.dlc_get_open_log(&open_log, log_id)) + return FAILED; + + if (!xt_pread_file(open_log->odl_log_file, log_offset, size, 0, data, &red_size, &thread->st_statistics.st_data, thread)) { + dlb_db->db_datalogs.dlc_release_open_log(open_log); + return FAILED; + } + + dlb_db->db_datalogs.dlc_release_open_log(open_log); + + if (red_size < size) + memset(data + red_size, 0, size - red_size); + + return OK; +} + +/* + * We assume that the given reference may not be valid. + * Only valid references actually cause a delete. + * Invalid references are logged, and ignored. + * + * Note this routine does not lock the compactor. + * This can lead to the some incorrect calculation is the + * amount of garbage. But nothing serious I think. + */ +xtBool XTDataLogBuffer::dlb_delete_log(xtLogID log_id, xtLogOffset log_offset, size_t size, xtTableID tab_id, xtRecordID rec_id, XTThreadPtr thread) +{ + XTactExtRecEntryDRec record; + xtWord1 status = XT_LOG_ENT_EXT_REC_DEL; + XTOpenLogFilePtr open_log; + xtBool to_much_garbage; + XTDataLogFilePtr data_log; + + if (!dlb_read_log(log_id, log_offset, offsetof(XTactExtRecEntryDRec, er_data), (xtWord1 *) &record, thread)) + return FAILED; + + /* Already deleted: */ + if (record.er_status_1 == XT_LOG_ENT_EXT_REC_DEL) + return OK; + + if (record.er_status_1 != XT_LOG_ENT_EXT_REC_OK || + size != XT_GET_DISK_4(record.er_data_size_4) || + tab_id != XT_GET_DISK_4(record.er_tab_id_4) || + rec_id != XT_GET_DISK_4(record.er_rec_id_4)) { + xt_register_xterr(XT_REG_CONTEXT, XT_ERR_BAD_EXT_RECORD); + return FAILED; + } + + if (dlb_data_log && log_id == dlb_data_log->dlf_log_id) { + /* Writing to the write log, I can do this quicker: */ + if (dlb_buffer_len) { + /* If it is in the buffer, then it is completely in the buffer. */ + if (log_offset >= dlb_buffer_offset) { + if (log_offset + 1 <= dlb_buffer_offset + (xtLogOffset) dlb_buffer_len) { + *(dlb_log_buffer + (log_offset - dlb_buffer_offset)) = XT_LOG_ENT_EXT_REC_DEL; + goto inc_garbage_count; + } + /* Should not happen, writing past EOF: */ + ASSERT_NS(FALSE); + return OK; + } + ASSERT_NS(log_offset + (xtLogOffset) size <= dlb_buffer_offset); + } + + if (!xt_pwrite_file(dlb_data_log->dlf_log_file, log_offset, 1, &status, &thread->st_statistics.st_data, thread)) + return FAILED; + + inc_garbage_count: + xt_lock_mutex_ns(&dlb_db->db_datalogs.dlc_head_lock); + dlb_data_log->dlf_garbage_count += offsetof(XTactExtRecEntryDRec, er_data) + size; + ASSERT_NS(dlb_data_log->dlf_garbage_count < dlb_data_log->dlf_log_eof); + if (!dl_write_log_header(dlb_data_log, dlb_data_log->dlf_log_file, FALSE, thread)) { + xt_unlock_mutex_ns(&dlb_db->db_datalogs.dlc_head_lock); + return FAILED; + } + dlb_flush_required = TRUE; + xt_unlock_mutex_ns(&dlb_db->db_datalogs.dlc_head_lock); + return OK; + } + + /* Write to some other log, open the log: */ + if (!dlb_db->db_datalogs.dlc_get_open_log(&open_log, log_id)) + return FAILED; + + /* Write the status byte: */ + if (!xt_pwrite_file(open_log->odl_log_file, log_offset, 1, &status, &thread->st_statistics.st_data, thread)) + goto failed; + + data_log = open_log->odl_data_log; + + /* Adjust the garbage level in the header. */ + xt_lock_mutex_ns(&dlb_db->db_datalogs.dlc_head_lock); + data_log->dlf_garbage_count += offsetof(XTactExtRecEntryDRec, er_data) + size; + ASSERT_NS(data_log->dlf_garbage_count < data_log->dlf_log_eof); + if (!dl_write_log_header(data_log, open_log->odl_log_file, FALSE, thread)) { + xt_unlock_mutex_ns(&dlb_db->db_datalogs.dlc_head_lock); + goto failed; + } + to_much_garbage = data_log->dlf_to_much_garbage(); + xt_unlock_mutex_ns(&dlb_db->db_datalogs.dlc_head_lock); + + if (to_much_garbage && + (data_log->dlf_state == XT_DL_HAS_SPACE || data_log->dlf_state == XT_DL_READ_ONLY)) { + /* There is too much garbage, it may be compacted. */ + if (!dlb_db->db_datalogs.dls_set_log_state(data_log, XT_DL_MAY_COMPACT)) + goto failed; + } + + /* Release the open log: */ + dlb_db->db_datalogs.dlc_release_open_log(open_log); + + return OK; + + failed: + dlb_db->db_datalogs.dlc_release_open_log(open_log); + return FAILED; +} + +/* + * Delete all the extended data belonging to a particular + * table. + */ +xtPublic void xt_dl_delete_ext_data(XTThreadPtr self, XTTableHPtr tab, xtBool missing_ok __attribute__((unused)), xtBool have_table_lock) +{ + XTOpenTablePtr ot; + xtRecordID page_rec_id, offs_rec_id; + XTTabRecExtDPtr rec_buf; + xtWord4 log_over_size; + xtLogID log_id; + xtLogOffset log_offset; + xtWord1 *page_data; + + page_data = (xtWord1 *) xt_malloc(self, tab->tab_recs.tci_page_size); + pushr_(xt_free, page_data); + + /* Scan the table, and remove all exended data... */ + if (!(ot = xt_open_table(tab))) { + if (self->t_exception.e_xt_err == XT_SYSTEM_ERROR && + XT_FILE_NOT_FOUND(self->t_exception.e_sys_err)) + return; + xt_throw(self); + } + ot->ot_thread = self; + + /* {LOCK-EXT-REC} This lock is to stop the compactor changing records + * while we are doing the delete. + */ + xt_lock_mutex_ns(&tab->tab_db->db_co_ext_lock); + + page_rec_id = 1; + while (page_rec_id < tab->tab_rec_eof_id) { + /* NOTE: There is a good reason for using xt_tc_read_page(). + * A deadlock can occur if using read, which can run out of + * memory, which waits for the freeer, which may need to + * open a table, which requires the db->db_tables lock, + * which is owned by the this thread, when the function + * is called from drop table. + * + * xt_tc_read_page() should work because no more changes + * should happen to the table while we are dropping it. + */ + if (!tab->tab_recs.xt_tc_read_page(ot->ot_rec_file, page_rec_id, page_data, self)) + goto failed; + + for (offs_rec_id=0; offs_rec_id<tab->tab_recs.tci_rows_per_page && page_rec_id+offs_rec_id < tab->tab_rec_eof_id; offs_rec_id++) { + rec_buf = (XTTabRecExtDPtr) (page_data + (offs_rec_id * tab->tab_recs.tci_rec_size)); + if (XT_REC_IS_EXT_DLOG(rec_buf->tr_rec_type_1)) { + log_over_size = XT_GET_DISK_4(rec_buf->re_log_dat_siz_4); + XT_GET_LOG_REF(log_id, log_offset, rec_buf); + + if (!self->st_dlog_buf.dlb_delete_log(log_id, log_offset, log_over_size, tab->tab_id, page_rec_id+offs_rec_id, self)) { + if (self->t_exception.e_xt_err != XT_ERR_BAD_EXT_RECORD && + self->t_exception.e_xt_err != XT_ERR_DATA_LOG_NOT_FOUND) + xt_log_and_clear_exception(self); + } + } + } + + page_rec_id += tab->tab_recs.tci_rows_per_page; + } + + xt_unlock_mutex_ns(&tab->tab_db->db_co_ext_lock); + + xt_close_table(ot, TRUE, have_table_lock); + + freer_(); // xt_free(page_data) + return; + + failed: + xt_unlock_mutex_ns(&tab->tab_db->db_co_ext_lock); + + xt_close_table(ot, TRUE, have_table_lock); + xt_throw(self); +} + +/* + * -------------------------------------------------------------------------------- + * GARBAGE COLLECTOR THREAD + */ + +xtPublic void xt_dl_init_db(XTThreadPtr self, XTDatabaseHPtr db) +{ + xt_init_mutex_with_autoname(self, &db->db_co_ext_lock); + xt_init_mutex_with_autoname(self, &db->db_co_dlog_lock); +} + +xtPublic void xt_dl_exit_db(XTThreadPtr self, XTDatabaseHPtr db) +{ + xt_stop_compactor(self, db); // Already done! + db->db_co_thread = NULL; + xt_free_mutex(&db->db_co_ext_lock); + xt_free_mutex(&db->db_co_dlog_lock); +} + +xtPublic void xt_dl_set_to_delete(XTThreadPtr self, XTDatabaseHPtr db, xtLogID log_id) +{ + XTDataLogFilePtr data_log; + + if (!db->db_datalogs.dlc_get_data_log(&data_log, log_id, FALSE, NULL)) + xt_throw(self); + if (data_log) { + if (!db->db_datalogs.dls_set_log_state(data_log, XT_DL_TO_DELETE)) + xt_throw(self); + } +} + +xtPublic void xt_dl_log_status(XTThreadPtr self, XTDatabaseHPtr db, XTStringBufferPtr strbuf) +{ + XTSortedListPtr list; + XTDataLogFilePtr data_log; + XTDataLogSegPtr seg; + u_int no_of_logs; + xtLogID *log_id_ptr; + + list = xt_new_sortedlist(self, sizeof(xtLogID), 20, 10, dl_cmp_log_id, NULL, NULL, FALSE, FALSE); + pushr_(xt_free_sortedlist, list); + + for (u_int i=0; i<XT_DL_NO_OF_SEGMENTS; i++) { + for (u_int j=0; j<XT_DL_SEG_HASH_TABLE_SIZE; j++) { + seg = &db->db_datalogs.dlc_segment[i]; + data_log = seg->dls_hash_table[j]; + while (data_log) { + xt_sl_insert(self, list, &data_log->dlf_log_id, &data_log->dlf_log_id); + data_log = data_log->dlf_next_hash; + } + } + } + + no_of_logs = xt_sl_get_size(list); + for (u_int i=0; i<no_of_logs; i++) { + log_id_ptr = (xtLogID *) xt_sl_item_at(list, i); + if (!db->db_datalogs.dlc_get_data_log(&data_log, *log_id_ptr, FALSE, &seg)) + xt_throw(self); + if (data_log) { + xt_sb_concat(self, strbuf, "d-log: "); + xt_sb_concat_int8(self, strbuf, data_log->dlf_log_id); + xt_sb_concat(self, strbuf, " status="); + switch (data_log->dlf_state) { + case XT_DL_UNKNOWN: + xt_sb_concat(self, strbuf, "?"); + break; + case XT_DL_HAS_SPACE: + xt_sb_concat(self, strbuf, "has-space "); + break; + case XT_DL_READ_ONLY: + xt_sb_concat(self, strbuf, "read-only "); + break; + case XT_DL_TO_COMPACT: + xt_sb_concat(self, strbuf, "to-compact"); + break; + case XT_DL_COMPACTED: + xt_sb_concat(self, strbuf, "compacted "); + break; + case XT_DL_TO_DELETE: + xt_sb_concat(self, strbuf, "to-delete "); + break; + case XT_DL_DELETED: + xt_sb_concat(self, strbuf, "deleted "); + break; + case XT_DL_EXCLUSIVE: + xt_sb_concat(self, strbuf, "x-locked "); + break; + } + xt_sb_concat(self, strbuf, " eof="); + xt_sb_concat_int8(self, strbuf, data_log->dlf_log_eof); + xt_sb_concat(self, strbuf, " garbage="); + xt_sb_concat_int8(self, strbuf, data_log->dlf_garbage_count); + xt_sb_concat(self, strbuf, " g%="); + if (data_log->dlf_log_eof) + xt_sb_concat_int8(self, strbuf, data_log->dlf_garbage_count * 100 / data_log->dlf_log_eof); + else + xt_sb_concat(self, strbuf, "100"); + xt_sb_concat(self, strbuf, " open="); + xt_sb_concat_int8(self, strbuf, data_log->dlf_open_count); + xt_sb_concat(self, strbuf, "\n"); + } + xt_unlock_mutex_ns(&seg->dls_lock); + } + + freer_(); // xt_free_sortedlist(list) +} + +xtPublic void xt_dl_delete_logs(XTThreadPtr self, XTDatabaseHPtr db) +{ + char path[PATH_MAX]; + XTOpenDirPtr od; + char *file; + xtLogID log_id; + + xt_strcpy(PATH_MAX, path, db->db_main_path); + xt_add_data_dir(PATH_MAX, path); + if (!xt_fs_exists(path)) + return; + pushsr_(od, xt_dir_close, xt_dir_open(self, path, NULL)); + while (xt_dir_next(self, od)) { + file = xt_dir_name(self, od); + if ((log_id = (xtLogID) xt_file_name_to_id(file))) { + if (!db->db_datalogs.dlc_remove_data_log(log_id, TRUE)) + xt_log_and_clear_exception(self); + } + if (xt_ends_with(file, ".xt")) { + xt_add_dir_char(PATH_MAX, path); + xt_strcat(PATH_MAX, path, file); + xt_fs_delete(self, path); + xt_remove_last_name_of_path(path); + } + } + freer_(); // xt_dir_close(od) + + /* I no longer attach the condition: !db->db_multi_path + * to removing this directory. This is because + * the pbxt directory must now be removed explicitly + * by drop database, or by delete all the PBXT + * system tables. + */ + if (!xt_fs_rmdir(NULL, path)) + xt_log_and_clear_exception(self); +} + +typedef struct XTCompactorState { + XTSeqLogReadPtr cs_seqread; + XTOpenTablePtr cs_ot; + XTDataBufferRec cs_databuf; +} XTCompactorStateRec, *XTCompactorStatePtr; + +static void dl_free_compactor_state(XTThreadPtr self, XTCompactorStatePtr cs) +{ + if (cs->cs_seqread) { + cs->cs_seqread->sl_seq_exit(); + delete cs->cs_seqread; + cs->cs_seqread = NULL; + } + if (cs->cs_ot) { + xt_db_return_table_to_pool(self, cs->cs_ot); + cs->cs_ot = NULL; + } + xt_db_set_size(self, &cs->cs_databuf, 0); +} + +static XTOpenTablePtr dl_cs_get_open_table(XTThreadPtr self, XTCompactorStatePtr cs, xtTableID tab_id) +{ + if (cs->cs_ot) { + if (cs->cs_ot->ot_table->tab_id == tab_id) + return cs->cs_ot; + + xt_db_return_table_to_pool(self, cs->cs_ot); + cs->cs_ot = NULL; + } + + if (!cs->cs_ot) { + if (!(cs->cs_ot = xt_db_open_pool_table(self, self->st_database, tab_id, NULL, TRUE))) + return NULL; + } + + return cs->cs_ot; +} + +static void dl_co_wait(XTThreadPtr self, XTDatabaseHPtr db, u_int secs) +{ + xt_lock_mutex(self, &db->db_datalogs.dlc_lock); + pushr_(xt_unlock_mutex, &db->db_datalogs.dlc_lock); + if (!self->t_quit) + xt_timed_wait_cond(self, &db->db_datalogs.dlc_cond, &db->db_datalogs.dlc_lock, secs * 1000); + freer_(); // xt_unlock_mutex(&db->db_datalogs.dlc_lock) +} + +/* + * Collect all the garbage in a file by moving all valid records + * into some other data log and updating the handles. + */ +static xtBool dl_collect_garbage(XTThreadPtr self, XTDatabaseHPtr db, XTDataLogFilePtr data_log) +{ + XTXactLogBufferDPtr record; + size_t size; + xtTableID tab_id; + xtRecordID rec_id; + XTCompactorStateRec cs; + XTOpenTablePtr ot; + XTTableHPtr tab; + XTTabRecExtDRec rec_buffer; + size_t src_size; + xtLogID src_log_id; + xtLogOffset src_log_offset; + xtLogID curr_log_id; + xtLogOffset curr_log_offset; + xtLogID dest_log_id; + xtLogOffset dest_log_offset; + off_t garbage_count = 0; + + memset(&cs, 0, sizeof(XTCompactorStateRec)); + + if (!(cs.cs_seqread = new XTDataSeqRead())) + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + + if (!cs.cs_seqread->sl_seq_init(db, xt_db_log_buffer_size)) { + delete cs.cs_seqread; + xt_throw(self); + } + pushr_(dl_free_compactor_state, &cs); + + if (!cs.cs_seqread->sl_seq_start(data_log->dlf_log_id, data_log->dlf_start_offset, FALSE)) + xt_throw(self); + + for (;;) { + if (self->t_quit) { + /* Flush the destination log: */ + xt_lock_mutex(self, &db->db_co_dlog_lock); + pushr_(xt_unlock_mutex, &db->db_co_dlog_lock); + if (!self->st_dlog_buf.dlb_flush_log(TRUE, self)) + xt_throw(self); + freer_(); // xt_unlock_mutex(&db->db_co_dlog_lock) + + /* Flush the transaction log. */ + if (!xt_xlog_flush_log(self)) + xt_throw(self); + + xt_lock_mutex_ns(&db->db_datalogs.dlc_head_lock); + data_log->dlf_garbage_count += garbage_count; + ASSERT(data_log->dlf_garbage_count < data_log->dlf_log_eof); + if (!dl_write_log_header(data_log, cs.cs_seqread->sl_seq_open_file(), TRUE, self)) { + xt_unlock_mutex_ns(&db->db_datalogs.dlc_head_lock); + xt_throw(self); + } + xt_unlock_mutex_ns(&db->db_datalogs.dlc_head_lock); + + freer_(); // dl_free_compactor_state(&cs) + return FAILED; + } + if (!cs.cs_seqread->sl_seq_next(&record, TRUE, self)) + xt_throw(self); + cs.cs_seqread->sl_seq_pos(&curr_log_id, &curr_log_offset); + if (!record) { + data_log->dlf_start_offset = curr_log_offset; + break; + } + switch (record->xh.xh_status_1) { + case XT_LOG_ENT_EXT_REC_OK: + size = XT_GET_DISK_4(record->er.er_data_size_4); + tab_id = XT_GET_DISK_4(record->er.er_tab_id_4); + rec_id = XT_GET_DISK_4(record->er.er_rec_id_4); + + if (!(ot = dl_cs_get_open_table(self, &cs, tab_id))) + break; + tab = ot->ot_table; + + /* All this is required for a valid record address: */ + if (!rec_id || rec_id >= tab->tab_rec_eof_id) + break; + + /* {LOCK-EXT-REC} It is important to prevent the compactor from modifying + * a record that has been freed (and maybe allocated again). + * + * Consider the following sequence: + * + * 1. Compactor reads the record. + * 2. The record is freed and reallocated. + * 3. The compactor updates the record. + * + * To prevent this, the compactor locks out the + * sweeper using the db_co_ext_lock lock. The db_co_ext_lock lock + * prevents a extended record from being moved and removed at the + * same time. + * + * The compactor also checks the status of the record before + * moving a record. + */ + xt_lock_mutex(self, &db->db_co_ext_lock); + pushr_(xt_unlock_mutex, &db->db_co_ext_lock); + + /* Read the record: */ + if (!xt_tab_get_rec_data(ot, rec_id, offsetof(XTTabRecExtDRec, re_data), (xtWord1 *) &rec_buffer)) { + xt_log_and_clear_warning(self); + freer_(); // xt_unlock_mutex(&db->db_co_ext_lockk) + break; + } + + /* [(7)] REMOVE is followed by FREE: + if (XT_REC_IS_REMOVED(rec_buffer.tr_rec_type_1) || !XT_REC_IS_EXT_DLOG(rec_buffer.tr_rec_type_1)) { + */ + if (!XT_REC_IS_EXT_DLOG(rec_buffer.tr_rec_type_1)) { + freer_(); // xt_unlock_mutex(&db->db_co_ext_lock) + break; + } + + XT_GET_LOG_REF(src_log_id, src_log_offset, &rec_buffer); + src_size = (size_t) XT_GET_DISK_4(rec_buffer.re_log_dat_siz_4); + + /* Does the record agree with the current position: */ + if (curr_log_id != src_log_id || + curr_log_offset != src_log_offset || + size != src_size) { + freer_(); // xt_unlock_mutex(&db->db_co_ext_lock) + break; + } + + size = offsetof(XTactExtRecEntryDRec, er_data) + size; + + /* Allocate space in a destination log: */ + xt_lock_mutex(self, &db->db_co_dlog_lock); + pushr_(xt_unlock_mutex, &db->db_co_dlog_lock); + if (!self->st_dlog_buf.dlb_get_log_offset(&dest_log_id, &dest_log_offset, size, self)) + xt_throw(self); + freer_(); // xt_unlock_mutex(&db->db_co_dlog_lock) + + /* This record is referenced by the data: */ + xt_db_set_size(self, &cs.cs_databuf, size); + if (!cs.cs_seqread->sl_rnd_read(src_log_offset, size, cs.cs_databuf.db_data, NULL, self)) + xt_throw(self); + + /* The problem with writing to the buffer here, is that other + * threads want to read the data! */ + xt_lock_mutex(self, &db->db_co_dlog_lock); + pushr_(xt_unlock_mutex, &db->db_co_dlog_lock); + if (!self->st_dlog_buf.dlb_write_thru_log(dest_log_id, dest_log_offset, size, cs.cs_databuf.db_data, self)) + xt_throw(self); + freer_(); // xt_unlock_mutex(&db->db_co_dlog_lock) + + /* Make sure we flush the compactor target log, before we + * flush the transaction log!! + * This is done here [(8)] + */ + + XT_SET_LOG_REF(&rec_buffer, dest_log_id, dest_log_offset); + xtOpSeqNo op_seq; + if (!xt_tab_put_log_rec_data(ot, XT_LOG_ENT_REC_MOVED, 0, rec_id, 8, (xtWord1 *) &rec_buffer.re_log_id_2, &op_seq)) + xt_throw(self); + tab->tab_co_op_seq = op_seq; + + /* Only records that were actually moved, count as garbage now! + * This means, lost records, remain "lost" as far as the garbage + * count is concerned! + */ + garbage_count += size; + freer_(); // xt_unlock_mutex(&db->db_co_ext_lock) + break; + } + data_log->dlf_start_offset = curr_log_offset; + } + + /* Flush the distination log. */ + xt_lock_mutex(self, &db->db_co_dlog_lock); + pushr_(xt_unlock_mutex, &db->db_co_dlog_lock); + if (!self->st_dlog_buf.dlb_flush_log(TRUE, self)) + xt_throw(self); + freer_(); // xt_unlock_mutex(&db->db_co_dlog_lock) + + /* Flush the transaction log. */ + if (!xt_xlog_flush_log(self)) + xt_throw(self); + + /* Save state in source log header. */ + xt_lock_mutex_ns(&db->db_datalogs.dlc_head_lock); + data_log->dlf_garbage_count += garbage_count; + ASSERT(data_log->dlf_garbage_count < data_log->dlf_log_eof); + if (!dl_write_log_header(data_log, cs.cs_seqread->sl_seq_open_file(), TRUE, self)) { + xt_unlock_mutex_ns(&db->db_datalogs.dlc_head_lock); + xt_throw(self); + } + xt_unlock_mutex_ns(&db->db_datalogs.dlc_head_lock); + + /* Wait for the writer to write all the changes. + * Then we can start the delete process for the log: + * + * Note, if we do not wait, then it could be some operations are held up, + * by being out of sequence. This could cause the log to be deleted + * before all the operations have been performed (which are on a table + * basis). + * + */ + for (;;) { + u_int edx; + XTTableEntryPtr tab_ptr; + xtBool wait; + + if (self->t_quit) { + freer_(); // dl_free_compactor_state(&cs) + return FAILED; + } + wait = FALSE; + xt_ht_lock(self, db->db_tables); + pushr_(xt_ht_unlock, db->db_tables); + xt_enum_tables_init(&edx); + while ((tab_ptr = xt_enum_tables_next(self, db, &edx))) { + if (tab_ptr->te_table && tab_ptr->te_table->tab_co_op_seq > tab_ptr->te_table->tab_head_op_seq) { + wait = TRUE; + break; + } + } + freer_(); // xt_ht_unlock(db->db_tables) + + if (!wait) + break; + + /* Nobody will wake me, so check again shortly! */ + dl_co_wait(self, db, 1); + } + + db->db_datalogs.dls_set_log_state(data_log, XT_DL_COMPACTED); + +#ifdef DEBUG_LOG_DELETE + printf("-- MARK FOR DELETE IN LOG: %d\n", (int) data_log->dlf_log_id); +#endif + /* Log that this log should be deleted on the next checkpoint: */ + // transaction log... + XTXactNewLogEntryDRec log_rec; + log_rec.xl_status_1 = XT_LOG_ENT_DEL_LOG; + log_rec.xl_checksum_1 = XT_CHECKSUM_1(data_log->dlf_log_id); + XT_SET_DISK_4(log_rec.xl_log_id_4, data_log->dlf_log_id); + if (!xt_xlog_log_data(self, sizeof(XTXactNewLogEntryDRec), (XTXactLogBufferDPtr) &log_rec, TRUE)) { + db->db_datalogs.dls_set_log_state(data_log, XT_DL_TO_COMPACT); + xt_throw(self); + } + + freer_(); // dl_free_compactor_state(&cs) + return OK; +} + +static void dl_co_not_busy(XTThreadPtr XT_UNUSED(self), XTDatabaseHPtr db) +{ + db->db_co_busy = FALSE; +} + +static void dl_co_main(XTThreadPtr self, xtBool once_off) +{ + XTDatabaseHPtr db = self->st_database; + xtLogID *log_id_ptr, log_id; + XTDataLogFilePtr data_log = NULL; + + xt_set_low_priority(self); + + while (!self->t_quit) { + while (!self->t_quit) { + xt_lock_mutex_ns(&db->db_datalogs.dlc_lock); + if ((log_id_ptr = (xtLogID *) xt_sl_first_item(db->db_datalogs.dlc_to_compact))) { + log_id = *log_id_ptr; + } + else + log_id = 0; + xt_unlock_mutex_ns(&db->db_datalogs.dlc_lock); + if (!log_id) + break; + if (!db->db_datalogs.dlc_get_data_log(&data_log, log_id, FALSE, NULL)) + xt_throw(self); + ASSERT(data_log); + if (data_log) { + db->db_co_busy = TRUE; + pushr_(dl_co_not_busy, db); + dl_collect_garbage(self, db, data_log); + freer_(); // dl_co_not_busy(db) + } + else { + xt_lock_mutex_ns(&db->db_datalogs.dlc_lock); + xt_sl_delete(self, db->db_datalogs.dlc_to_compact, &log_id); + xt_unlock_mutex_ns(&db->db_datalogs.dlc_lock); + } + } + + if (once_off) + break; + + /* Wait for a signal that a data log can be collected: */ + dl_co_wait(self, db, 120); + } +} + +static void *dl_run_co_thread(XTThreadPtr self) +{ + XTDatabaseHPtr db = (XTDatabaseHPtr) self->t_data; + int count; + void *mysql_thread; + + mysql_thread = myxt_create_thread(); + + while (!self->t_quit) { + try_(a) { + /* + * The garbage collector requires that the database + * is in use because. + */ + xt_use_database(self, db, XT_FOR_COMPACTOR); + + /* This action is both safe and required: + * + * safe: releasing the database is safe because as + * long as this thread is running the database + * reference is valid, and this reference cannot + * be the only one to the database because + * otherwize this thread would not be running. + * + * required: releasing the database is necessary + * otherwise we cannot close the database + * correctly because we only shutdown this + * thread when the database is closed and we + * only close the database when all references + * are removed. + */ + xt_heap_release(self, self->st_database); + + dl_co_main(self, FALSE); + } + catch_(a) { + if (!(self->t_exception.e_xt_err == XT_SIGNAL_CAUGHT && + self->t_exception.e_sys_err == SIGTERM)) + xt_log_and_clear_exception(self); + } + cont_(a); + + /* Avoid releasing the database (done above) */ + self->st_database = NULL; + xt_unuse_database(self, self); + + /* After an exception, pause before trying again... */ + /* Number of seconds */ +#ifdef DEBUG + count = 10; +#else + count = 2*60; +#endif + while (!self->t_quit && count > 0) { + sleep(1); + count--; + } + } + + myxt_destroy_thread(mysql_thread, TRUE); + return NULL; +} + +static void dl_free_co_thread(XTThreadPtr self, void *data) +{ + XTDatabaseHPtr db = (XTDatabaseHPtr) data; + + if (db->db_co_thread) { + xt_lock_mutex(self, &db->db_datalogs.dlc_lock); + pushr_(xt_unlock_mutex, &db->db_datalogs.dlc_lock); + db->db_co_thread = NULL; + freer_(); // xt_unlock_mutex(&db->db_datalogs.dlc_lock) + } +} + +xtPublic void xt_start_compactor(XTThreadPtr self, XTDatabaseHPtr db) +{ + char name[PATH_MAX]; + + sprintf(name, "GC-%s", xt_last_directory_of_path(db->db_main_path)); + xt_remove_dir_char(name); + db->db_co_thread = xt_create_daemon(self, name); + xt_set_thread_data(db->db_co_thread, db, dl_free_co_thread); + xt_run_thread(self, db->db_co_thread, dl_run_co_thread); +} + +static void dl_wake_co_thread(XTDatabaseHPtr db) +{ + if (!xt_signal_cond(NULL, &db->db_datalogs.dlc_cond)) + xt_log_and_clear_exception_ns(); +} + +xtPublic void xt_stop_compactor(XTThreadPtr self, XTDatabaseHPtr db) +{ + XTThreadPtr thr_co; + + if (db->db_co_thread) { + xt_lock_mutex(self, &db->db_datalogs.dlc_lock); + pushr_(xt_unlock_mutex, &db->db_datalogs.dlc_lock); + + /* This pointer is safe as long as you have the transaction lock. */ + if ((thr_co = db->db_co_thread)) { + xtThreadID tid = thr_co->t_id; + + /* Make sure the thread quits when woken up. */ + xt_terminate_thread(self, thr_co); + + dl_wake_co_thread(db); + + freer_(); // xt_unlock_mutex(&db->db_datalogs.dlc_lock) + + /* + * This seems to kill the whole server sometimes!! + * SIGTERM is going to a different thread??! + xt_kill_thread(thread); + */ + xt_wait_for_thread(tid, FALSE); + + /* PMC - This should not be necessary to set the signal here, but in the + * debugger the handler is not called!!? + thr_co->t_delayed_signal = SIGTERM; + xt_kill_thread(thread); + */ + db->db_co_thread = NULL; + } + else + freer_(); // xt_unlock_mutex(&db->db_datalogs.dlc_lock) + } +} + diff --git a/storage/pbxt/src/datalog_xt.h b/storage/pbxt/src/datalog_xt.h new file mode 100644 index 00000000000..245ebcbaeda --- /dev/null +++ b/storage/pbxt/src/datalog_xt.h @@ -0,0 +1,228 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-01-24 Paul McCullagh + * + * H&G2JCtL + */ +#ifndef __xt_datalog_h__ +#define __xt_datalog_h__ + +#include "pthread_xt.h" +#include "filesys_xt.h" +#include "sortedlist_xt.h" +#include "xactlog_xt.h" +#include "util_xt.h" + +struct XTThread; +struct XTDatabase; +struct xXTDataLog; +struct XTTable; +struct XTOpenTable; + +#define XT_SET_LOG_REF(d, l, o) do { XT_SET_DISK_2((d)->re_log_id_2, l); \ + XT_SET_DISK_6((d)->re_log_offs_6, o); \ + } while (0) +#define XT_GET_LOG_REF(l, o, s) do { l = XT_GET_DISK_2((s)->re_log_id_2); \ + o = XT_GET_DISK_6((s)->re_log_offs_6); \ + } while (0) + +#ifdef DEBUG +//#define USE_DEBUG_SIZES +#endif + +#ifdef USE_DEBUG_SIZES +#define XT_DL_MAX_LOG_ID 500 +#define XT_DL_LOG_POOL_SIZE 10 +#define XT_DL_HASH_TABLE_SIZE 5 +#define XT_DL_SEGMENT_SHIFTS 1 +#else +#define XT_DL_MAX_LOG_ID 0x7FFF +#define XT_DL_LOG_POOL_SIZE 1000 +#define XT_DL_HASH_TABLE_SIZE 10000 +#define XT_DL_SEGMENT_SHIFTS 3 +#endif + +#define XT_DL_SEG_HASH_TABLE_SIZE (XT_DL_HASH_TABLE_SIZE / XT_DL_NO_OF_SEGMENTS) +#define XT_DL_NO_OF_SEGMENTS (1 << XT_DL_SEGMENT_SHIFTS) +#define XT_DL_SEGMENT_MASK (XT_DL_NO_OF_SEGMENTS - 1) + +typedef struct XTOpenLogFile { + xtLogID olf_log_id; + XTOpenFilePtr odl_log_file; /* The open file handle. */ + struct XTDataLogFile *odl_data_log; + + xtBool odl_in_use; + struct XTOpenLogFile *odl_next_free; /* Pointer to the next on the free list. */ + struct XTOpenLogFile *odl_prev_free; /* Pointer to the previous on the free list. */ + + xtWord4 odl_ru_time; /* If this is in the top 1/4 don't change position in MRU list. */ + struct XTOpenLogFile *odl_mr_used; /* More recently used pages. */ + struct XTOpenLogFile *odl_lr_used; /* Less recently used pages. */ +} XTOpenLogFileRec, *XTOpenLogFilePtr; + +#define XT_DL_MAY_COMPACT -1 /* This is an indication to set the state to XT_DL_TO_COMPACT. */ +#define XT_DL_UNKNOWN 0 +#define XT_DL_HAS_SPACE 1 /* The log is not yet full, and can be used for writing. */ +#define XT_DL_READ_ONLY 2 /* The log is full, and can only be read now. */ +#define XT_DL_TO_COMPACT 3 /* The log has too much garbage, and must be compacted. */ +#define XT_DL_COMPACTED 4 /* The state after compaction. */ +#define XT_DL_TO_DELETE 5 /* All references to this log have been removed, and it is to be deleted. */ +#define XT_DL_DELETED 6 /* After deletion, logs are locked until the next checkpoint. */ +#define XT_DL_EXCLUSIVE 7 /* The log is locked and being written by a thread. */ + +typedef struct XTDataLogFile { + xtLogID dlf_log_id; /* The ID of the data log. */ + int dlf_state; + struct XTDataLogFile *dlf_next_hash; /* Pointer to the next on the hash list. */ + u_int dlf_open_count; /* Number of open log files. */ + XTOpenLogFilePtr dlf_free_list; /* The open file free list. */ + off_t dlf_log_eof; + off_t dlf_start_offset; /* Start offset for garbage collection. */ + off_t dlf_garbage_count; /* The amount of garbage in the log file. */ + XTOpenFilePtr dlf_log_file; /* The open file handle (if the log is in exclusive use!!). */ + + off_t dlf_space_avaliable(); + xtBool dlf_to_much_garbage(); +} XTDataLogFileRec, *XTDataLogFilePtr; + +typedef struct XTDataLogSeg { + xt_mutex_type dls_lock; /* The cache segment lock. */ + xt_cond_type dls_cond; + XTDataLogFilePtr dls_hash_table[XT_DL_SEG_HASH_TABLE_SIZE]; +} XTDataLogSegRec, *XTDataLogSegPtr; + +typedef struct XTDataLogCache { + struct XTDatabase *dlc_db; + + xt_mutex_type dlc_lock; /* The public cache lock. */ + xt_cond_type dlc_cond; /* The public cache wait condition. */ + XTSortedListPtr dlc_has_space; /* List of logs with space for more data. */ + XTSortedListPtr dlc_to_compact; /* List of logs to be compacted. */ + XTSortedListPtr dlc_to_delete; /* List of logs to be deleted at next checkpoint. */ + XTSortedListPtr dlc_deleted; /* List of logs deleted at the previous checkpoint. */ + XTDataLogSegRec dlc_segment[XT_DL_NO_OF_SEGMENTS]; + xtLogID dlc_next_log_id; /* The next log ID to be used to create a new log. */ + + xt_mutex_type dlc_mru_lock; /* The lock for the LRU list. */ + xtWord4 dlc_ru_now; + XTOpenLogFilePtr dlc_lru_open_log; + XTOpenLogFilePtr dlc_mru_open_log; + u_int dlc_open_count; /* The total open file count. */ + + xt_mutex_type dlc_head_lock; /* The lock for changing the header of shared logs. */ + + void dls_remove_log(XTDataLogFilePtr data_log); + int dls_get_log_state(XTDataLogFilePtr data_log); + xtBool dls_set_log_state(XTDataLogFilePtr data_log, int state); + void dlc_init(struct XTThread *self, struct XTDatabase *db); + void dlc_exit(struct XTThread *self); + void dlc_name(size_t size, char *path, xtLogID log_id); + xtBool dlc_open_log(XTOpenFilePtr *fh, xtLogID log_id, int mode); + xtBool dlc_unlock_log(XTDataLogFilePtr data_log); + XTDataLogFilePtr dlc_get_log_for_writing(off_t space_required, struct XTThread *thread); + xtBool dlc_get_data_log(XTDataLogFilePtr *data_log, xtLogID log_id, xtBool create, XTDataLogSegPtr *ret_seg); + xtBool dlc_remove_data_log(xtLogID log_id, xtBool just_close); + xtBool dlc_get_open_log(XTOpenLogFilePtr *open_log, xtLogID log_id); + void dlc_release_open_log(XTOpenLogFilePtr open_log); +} XTDataLogCacheRec, *XTDataLogCachePtr; + +/* The data log buffer, used by a thread to write a + * data log file. + */ +typedef struct XTDataLogBuffer { + struct XTDatabase *dlb_db; + XTDataLogFilePtr dlb_data_log; /* The data log file. */ + + xtLogOffset dlb_buffer_offset; /* The offset into the log file. */ + size_t dlb_buffer_size; /* The size of the buffer. */ + size_t dlb_buffer_len; /* The amount of data in the buffer. */ + xtWord1 *dlb_log_buffer; + xtBool dlb_flush_required; +#ifdef DEBUG + off_t dlb_max_write_offset; +#endif + + void dlb_init(struct XTDatabase *db, size_t buffer_size); + void dlb_exit(struct XTThread *self); + xtBool dlb_close_log(struct XTThread *thread); + xtBool dlb_get_log_offset(xtLogID *log_id, off_t *out_offset, size_t req_size, struct XTThread *thread); + xtBool dlb_flush_log(xtBool commit, struct XTThread *thread); + xtBool dlb_write_thru_log(xtLogID log_id, xtLogOffset log_offset, size_t size, xtWord1 *data, struct XTThread *thread); + xtBool dlb_append_log(xtLogID log_id, off_t out_offset, size_t size, xtWord1 *data, struct XTThread *thread); + xtBool dlb_read_log(xtLogID log_id, off_t offset, size_t size, xtWord1 *data, struct XTThread *thread); + xtBool dlb_delete_log(xtLogID log_id, off_t offset, size_t size, xtTableID tab_id, xtRecordID tab_offset, struct XTThread *thread); +} XTDataLogBufferRec, *XTDataLogBufferPtr; + +typedef struct XTSeqLogRead { + struct XTDatabase *sl_db; + + virtual ~XTSeqLogRead() { } + virtual xtBool sl_seq_init(struct XTDatabase *db, size_t buffer_size) { (void) buffer_size; sl_db = db; return OK; }; + virtual void sl_seq_exit() { }; + virtual XTOpenFilePtr sl_seq_open_file() { return NULL; }; + virtual void sl_seq_pos(xtLogID *log_id, xtLogOffset *log_offset) { (void) log_id; (void) log_offset; }; + virtual xtBool sl_seq_start(xtLogID log_id, xtLogOffset log_offset, xtBool missing_ok) { + (void) log_id; (void) log_offset; (void) missing_ok; return OK; + }; + virtual xtBool sl_rnd_read(xtLogOffset log_offset, size_t size, xtWord1 *data, size_t *read, struct XTThread *thread) { + (void) log_offset; (void) size; (void) data; (void) read; (void) thread; return OK; + }; + virtual xtBool sl_seq_next(XTXactLogBufferDPtr *entry, xtBool verify, struct XTThread *thread) { + (void) entry; (void) verify; (void) thread; return OK; + }; + virtual void sl_seq_skip(size_t size) { (void) size; } +} XTSeqLogReadRec, *XTSeqLogReadPtr; + +typedef struct XTDataSeqRead : public XTSeqLogRead { + XTOpenFilePtr sl_log_file; + xtLogID sl_rec_log_id; /* The current record log ID. */ + xtLogOffset sl_rec_log_offset; /* The current log read position. */ + size_t sl_record_len; /* The length of the current record. */ + xtLogOffset sl_log_eof; + + size_t sl_buffer_size; /* Size of the buffer. */ + xtLogOffset sl_buf_log_offset; /* File offset of the buffer. */ + size_t sl_buffer_len; /* Amount of data in the buffer. */ + xtWord1 *sl_buffer; + + virtual ~XTDataSeqRead() { } + virtual xtBool sl_seq_init(struct XTDatabase *db, size_t buffer_size); + virtual void sl_seq_exit(); + virtual XTOpenFilePtr sl_seq_open_file(); + virtual void sl_seq_pos(xtLogID *log_id, xtLogOffset *log_offset); + virtual xtBool sl_seq_start(xtLogID log_id, xtLogOffset log_offset, xtBool missing_ok); + virtual xtBool sl_rnd_read(xtLogOffset log_offset, size_t size, xtWord1 *data, size_t *read, struct XTThread *thread); + virtual xtBool sl_seq_next(XTXactLogBufferDPtr *entry, xtBool verify, struct XTThread *thread); + virtual void sl_seq_skip(size_t size); + virtual void sl_seq_skip_to(off_t offset); +} XTDataSeqReadRec, *XTDataSeqReadPtr; + +void xt_dl_delete_ext_data(struct XTThread *self, struct XTTable *tab, xtBool missing_ok, xtBool have_table_lock); + +void xt_start_compactor(struct XTThread *self, struct XTDatabase *db); +void xt_stop_compactor(struct XTThread *self, struct XTDatabase *db); + +void xt_dl_init_db(struct XTThread *self, struct XTDatabase *db); +void xt_dl_exit_db(struct XTThread *self, struct XTDatabase *db); +void xt_dl_set_to_delete(struct XTThread *self, struct XTDatabase *db, xtLogID log_id); +void xt_dl_log_status(struct XTThread *self, struct XTDatabase *db, XTStringBufferPtr strbuf); +void xt_dl_delete_logs(struct XTThread *self, struct XTDatabase *db); + +#endif + diff --git a/storage/pbxt/src/discover_xt.cc b/storage/pbxt/src/discover_xt.cc new file mode 100644 index 00000000000..074132d47cb --- /dev/null +++ b/storage/pbxt/src/discover_xt.cc @@ -0,0 +1,1383 @@ +/* Copyright (c) 2008 PrimeBase Technologies GmbH, Germany + * Derived from code Copyright (C) 2000-2004 MySQL AB + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Created by Leslie on 8/27/08. + * + */ + +#include "xt_config.h" + +#ifndef DRIZZLED +#include "mysql_priv.h" +#include "item_create.h" +#include <m_ctype.h> +#else +#include <drizzled/session.h> +#include <drizzled/server_includes.h> +#include <drizzled/sql_base.h> +#endif + +#include "strutil_xt.h" +#include "ha_pbxt.h" +#include "discover_xt.h" +#include "ha_xtsys.h" + +#ifndef DRIZZLED +#if MYSQL_VERSION_ID > 60005 +#define DOT_STR(x) x.str +#else +#define DOT_STR(x) x +#endif +#endif + +#ifndef DRIZZLED +#define LOCK_OPEN_HACK_REQUIRED +#endif // DRIZZLED + +#ifdef LOCK_OPEN_HACK_REQUIRED +/////////////////////////////// +/* + * Unfortunately I cannot use the standard mysql_create_table_no_lock() because it will lock "LOCK_open" + * which has already been locked while the server is performing table discovery. So I have added this hack + * in here to create my own version. The following macros will make the changes I need to get it to work. + * The actual function code has been copied here without changes. + * + * Its almost enough to make you want to cry. :( +*/ +//----------------------------- + +#ifdef pthread_mutex_lock +#undef pthread_mutex_lock +#endif + +#ifdef pthread_mutex_unlock +#undef pthread_mutex_unlock +#endif + +#define mysql_create_table_no_lock hacked_mysql_create_table_no_lock +#define pthread_mutex_lock(l) +#define pthread_mutex_unlock(l) + +#define check_engine(t, n, c) (0) +#define set_table_default_charset(t, c, d) + +void calculate_interval_lengths(CHARSET_INFO *cs, TYPELIB *interval, + uint32 *max_length, uint32 *tot_length); + +uint build_tmptable_filename(THD* thd, char *buff, size_t bufflen); +uint build_table_filename(char *buff, size_t bufflen, const char *db, + const char *table_name, const char *ext, uint flags); + +////////////////////////////////////////////////////////// +////// START OF CUT AND PASTES FROM sql_table.cc //////// +////////////////////////////////////////////////////////// + +// sort_keys() cut and pasted directly from sql_table.cc. +static int sort_keys(KEY *a, KEY *b) +{ + ulong a_flags= a->flags, b_flags= b->flags; + + if (a_flags & HA_NOSAME) + { + if (!(b_flags & HA_NOSAME)) + return -1; + if ((a_flags ^ b_flags) & (HA_NULL_PART_KEY | HA_END_SPACE_KEY)) + { + /* Sort NOT NULL keys before other keys */ + return (a_flags & (HA_NULL_PART_KEY | HA_END_SPACE_KEY)) ? 1 : -1; + } + if (a->name == primary_key_name) + return -1; + if (b->name == primary_key_name) + return 1; + /* Sort keys don't containing partial segments before others */ + if ((a_flags ^ b_flags) & HA_KEY_HAS_PART_KEY_SEG) + return (a_flags & HA_KEY_HAS_PART_KEY_SEG) ? 1 : -1; + } + else if (b_flags & HA_NOSAME) + return 1; // Prefer b + + if ((a_flags ^ b_flags) & HA_FULLTEXT) + { + return (a_flags & HA_FULLTEXT) ? 1 : -1; + } + /* + Prefer original key order. usable_key_parts contains here + the original key position. + */ + return ((a->usable_key_parts < b->usable_key_parts) ? -1 : + (a->usable_key_parts > b->usable_key_parts) ? 1 : + 0); +} + +// check_if_keyname_exists() cut and pasted directly from sql_table.cc. +static bool +check_if_keyname_exists(const char *name, KEY *start, KEY *end) +{ + for (KEY *key=start ; key != end ; key++) + if (!my_strcasecmp(system_charset_info,name,key->name)) + return 1; + return 0; +} + +// make_unique_key_name() cut and pasted directly from sql_table.cc. +static char * +make_unique_key_name(const char *field_name,KEY *start,KEY *end) +{ + char buff[MAX_FIELD_NAME],*buff_end; + + if (!check_if_keyname_exists(field_name,start,end) && + my_strcasecmp(system_charset_info,field_name,primary_key_name)) + return (char*) field_name; // Use fieldname + buff_end=strmake(buff,field_name, sizeof(buff)-4); + + /* + Only 3 chars + '\0' left, so need to limit to 2 digit + This is ok as we can't have more than 100 keys anyway + */ + for (uint i=2 ; i< 100; i++) + { + *buff_end= '_'; + int10_to_str(i, buff_end+1, 10); + if (!check_if_keyname_exists(buff,start,end)) + return sql_strdup(buff); + } + return (char*) "not_specified"; // Should never happen +} + + +// prepare_blob_field() cut and pasted directly from sql_table.cc. +static bool prepare_blob_field(THD *thd, Create_field *sql_field) +{ + DBUG_ENTER("prepare_blob_field"); + + if (sql_field->length > MAX_FIELD_VARCHARLENGTH && + !(sql_field->flags & BLOB_FLAG)) + { + /* Convert long VARCHAR columns to TEXT or BLOB */ + char warn_buff[MYSQL_ERRMSG_SIZE]; + + if (sql_field->def || (thd->variables.sql_mode & (MODE_STRICT_TRANS_TABLES | + MODE_STRICT_ALL_TABLES))) + { + my_error(ER_TOO_BIG_FIELDLENGTH, MYF(0), sql_field->field_name, + MAX_FIELD_VARCHARLENGTH / sql_field->charset->mbmaxlen); + DBUG_RETURN(1); + } + sql_field->sql_type= MYSQL_TYPE_BLOB; + sql_field->flags|= BLOB_FLAG; + sprintf(warn_buff, ER(ER_AUTO_CONVERT), sql_field->field_name, + (sql_field->charset == &my_charset_bin) ? "VARBINARY" : "VARCHAR", + (sql_field->charset == &my_charset_bin) ? "BLOB" : "TEXT"); + push_warning(thd, MYSQL_ERROR::WARN_LEVEL_NOTE, ER_AUTO_CONVERT, + warn_buff); + } + + if ((sql_field->flags & BLOB_FLAG) && sql_field->length) + { + if (sql_field->sql_type == MYSQL_TYPE_BLOB) + { + /* The user has given a length to the blob column */ + sql_field->sql_type= get_blob_type_from_length(sql_field->length); + sql_field->pack_length= calc_pack_length(sql_field->sql_type, 0); + } + sql_field->length= 0; + } + DBUG_RETURN(0); +} + +////////////////////////////// +// mysql_prepare_create_table() cut and pasted directly from sql_table.cc. +static int +mysql_prepare_create_table(THD *thd, HA_CREATE_INFO *create_info, + Alter_info *alter_info, + bool tmp_table, + uint *db_options, + handler *file, KEY **key_info_buffer, + uint *key_count, int select_field_count) +{ + const char *key_name; + Create_field *sql_field,*dup_field; + uint field,null_fields,blob_columns,max_key_length; + ulong record_offset= 0; + KEY *key_info; + KEY_PART_INFO *key_part_info; + int timestamps= 0, timestamps_with_niladic= 0; + int field_no,dup_no; + int select_field_pos,auto_increment=0; + List_iterator<Create_field> it(alter_info->create_list); + List_iterator<Create_field> it2(alter_info->create_list); + uint total_uneven_bit_length= 0; + DBUG_ENTER("mysql_prepare_create_table"); + + select_field_pos= alter_info->create_list.elements - select_field_count; + null_fields=blob_columns=0; + create_info->varchar= 0; + max_key_length= file->max_key_length(); + + for (field_no=0; (sql_field=it++) ; field_no++) + { + CHARSET_INFO *save_cs; + + /* + Initialize length from its original value (number of characters), + which was set in the parser. This is necessary if we're + executing a prepared statement for the second time. + */ + sql_field->length= sql_field->char_length; + if (!sql_field->charset) + sql_field->charset= create_info->default_table_charset; + /* + table_charset is set in ALTER TABLE if we want change character set + for all varchar/char columns. + But the table charset must not affect the BLOB fields, so don't + allow to change my_charset_bin to somethig else. + */ + if (create_info->table_charset && sql_field->charset != &my_charset_bin) + sql_field->charset= create_info->table_charset; + + save_cs= sql_field->charset; + if ((sql_field->flags & BINCMP_FLAG) && + !(sql_field->charset= get_charset_by_csname(sql_field->charset->csname, + MY_CS_BINSORT,MYF(0)))) + { + char tmp[64]; + strmake(strmake(tmp, save_cs->csname, sizeof(tmp)-4), + STRING_WITH_LEN("_bin")); + my_error(ER_UNKNOWN_COLLATION, MYF(0), tmp); + DBUG_RETURN(TRUE); + } + + /* + Convert the default value from client character + set into the column character set if necessary. + */ + if (sql_field->def && + save_cs != sql_field->def->collation.collation && + (sql_field->sql_type == MYSQL_TYPE_VAR_STRING || + sql_field->sql_type == MYSQL_TYPE_STRING || + sql_field->sql_type == MYSQL_TYPE_SET || + sql_field->sql_type == MYSQL_TYPE_ENUM)) + { + /* + Starting from 5.1 we work here with a copy of Create_field + created by the caller, not with the instance that was + originally created during parsing. It's OK to create + a temporary item and initialize with it a member of the + copy -- this item will be thrown away along with the copy + at the end of execution, and thus not introduce a dangling + pointer in the parsed tree of a prepared statement or a + stored procedure statement. + */ + sql_field->def= sql_field->def->safe_charset_converter(save_cs); + + if (sql_field->def == NULL) + { + /* Could not convert */ + my_error(ER_INVALID_DEFAULT, MYF(0), sql_field->field_name); + DBUG_RETURN(TRUE); + } + } + + if (sql_field->sql_type == MYSQL_TYPE_SET || + sql_field->sql_type == MYSQL_TYPE_ENUM) + { + uint32 dummy; + CHARSET_INFO *cs= sql_field->charset; + TYPELIB *interval= sql_field->interval; + + /* + Create typelib from interval_list, and if necessary + convert strings from client character set to the + column character set. + */ + if (!interval) + { + /* + Create the typelib in runtime memory - we will free the + occupied memory at the same time when we free this + sql_field -- at the end of execution. + */ + interval= sql_field->interval= typelib(thd->mem_root, + sql_field->interval_list); + List_iterator<String> int_it(sql_field->interval_list); + String conv, *tmp; + char comma_buf[2]; + int comma_length= cs->cset->wc_mb(cs, ',', (uchar*) comma_buf, + (uchar*) comma_buf + + sizeof(comma_buf)); + DBUG_ASSERT(comma_length > 0); + for (uint i= 0; (tmp= int_it++); i++) + { + uint lengthsp; + if (String::needs_conversion(tmp->length(), tmp->charset(), + cs, &dummy)) + { + uint cnv_errs; + conv.copy(tmp->ptr(), tmp->length(), tmp->charset(), cs, &cnv_errs); + interval->type_names[i]= strmake_root(thd->mem_root, conv.ptr(), + conv.length()); + interval->type_lengths[i]= conv.length(); + } + + // Strip trailing spaces. + lengthsp= cs->cset->lengthsp(cs, interval->type_names[i], + interval->type_lengths[i]); + interval->type_lengths[i]= lengthsp; + ((uchar *)interval->type_names[i])[lengthsp]= '\0'; + if (sql_field->sql_type == MYSQL_TYPE_SET) + { + if (cs->coll->instr(cs, interval->type_names[i], + interval->type_lengths[i], + comma_buf, comma_length, NULL, 0)) + { + my_error(ER_ILLEGAL_VALUE_FOR_TYPE, MYF(0), "set", tmp->ptr()); + DBUG_RETURN(TRUE); + } + } + } + sql_field->interval_list.empty(); // Don't need interval_list anymore + } + + if (sql_field->sql_type == MYSQL_TYPE_SET) + { + uint32 field_length; + if (sql_field->def != NULL) + { + char *not_used; + uint not_used2; + bool not_found= 0; + String str, *def= sql_field->def->val_str(&str); + if (def == NULL) /* SQL "NULL" maps to NULL */ + { + if ((sql_field->flags & NOT_NULL_FLAG) != 0) + { + my_error(ER_INVALID_DEFAULT, MYF(0), sql_field->field_name); + DBUG_RETURN(TRUE); + } + + /* else, NULL is an allowed value */ + (void) find_set(interval, NULL, 0, + cs, ¬_used, ¬_used2, ¬_found); + } + else /* not NULL */ + { + (void) find_set(interval, def->ptr(), def->length(), + cs, ¬_used, ¬_used2, ¬_found); + } + + if (not_found) + { + my_error(ER_INVALID_DEFAULT, MYF(0), sql_field->field_name); + DBUG_RETURN(TRUE); + } + } + calculate_interval_lengths(cs, interval, &dummy, &field_length); + sql_field->length= field_length + (interval->count - 1); + } + else /* MYSQL_TYPE_ENUM */ + { + uint32 field_length; + DBUG_ASSERT(sql_field->sql_type == MYSQL_TYPE_ENUM); + if (sql_field->def != NULL) + { + String str, *def= sql_field->def->val_str(&str); + if (def == NULL) /* SQL "NULL" maps to NULL */ + { + if ((sql_field->flags & NOT_NULL_FLAG) != 0) + { + my_error(ER_INVALID_DEFAULT, MYF(0), sql_field->field_name); + DBUG_RETURN(TRUE); + } + + /* else, the defaults yield the correct length for NULLs. */ + } + else /* not NULL */ + { + def->length(cs->cset->lengthsp(cs, def->ptr(), def->length())); + if (find_type2(interval, def->ptr(), def->length(), cs) == 0) /* not found */ + { + my_error(ER_INVALID_DEFAULT, MYF(0), sql_field->field_name); + DBUG_RETURN(TRUE); + } + } + } + calculate_interval_lengths(cs, interval, &field_length, &dummy); + sql_field->length= field_length; + } + set_if_smaller(sql_field->length, MAX_FIELD_WIDTH-1); + } + + if (sql_field->sql_type == MYSQL_TYPE_BIT) + { + sql_field->pack_flag= FIELDFLAG_NUMBER; + if (file->ha_table_flags() & HA_CAN_BIT_FIELD) + total_uneven_bit_length+= sql_field->length & 7; + else + sql_field->pack_flag|= FIELDFLAG_TREAT_BIT_AS_CHAR; + } + + sql_field->create_length_to_internal_length(); + if (prepare_blob_field(thd, sql_field)) + DBUG_RETURN(TRUE); + + if (!(sql_field->flags & NOT_NULL_FLAG)) + null_fields++; + + if (check_column_name(sql_field->field_name)) + { + my_error(ER_WRONG_COLUMN_NAME, MYF(0), sql_field->field_name); + DBUG_RETURN(TRUE); + } + + /* Check if we have used the same field name before */ + for (dup_no=0; (dup_field=it2++) != sql_field; dup_no++) + { + if (my_strcasecmp(system_charset_info, + sql_field->field_name, + dup_field->field_name) == 0) + { + /* + If this was a CREATE ... SELECT statement, accept a field + redefinition if we are changing a field in the SELECT part + */ + if (field_no < select_field_pos || dup_no >= select_field_pos) + { + my_error(ER_DUP_FIELDNAME, MYF(0), sql_field->field_name); + DBUG_RETURN(TRUE); + } + else + { + /* Field redefined */ + sql_field->def= dup_field->def; + sql_field->sql_type= dup_field->sql_type; + sql_field->charset= (dup_field->charset ? + dup_field->charset : + create_info->default_table_charset); + sql_field->length= dup_field->char_length; + sql_field->pack_length= dup_field->pack_length; + sql_field->key_length= dup_field->key_length; + sql_field->decimals= dup_field->decimals; + sql_field->create_length_to_internal_length(); + sql_field->unireg_check= dup_field->unireg_check; + /* + We're making one field from two, the result field will have + dup_field->flags as flags. If we've incremented null_fields + because of sql_field->flags, decrement it back. + */ + if (!(sql_field->flags & NOT_NULL_FLAG)) + null_fields--; + sql_field->flags= dup_field->flags; + sql_field->interval= dup_field->interval; + it2.remove(); // Remove first (create) definition + select_field_pos--; + break; + } + } + } + /* Don't pack rows in old tables if the user has requested this */ + if ((sql_field->flags & BLOB_FLAG) || + sql_field->sql_type == MYSQL_TYPE_VARCHAR && + create_info->row_type != ROW_TYPE_FIXED) + (*db_options)|= HA_OPTION_PACK_RECORD; + it2.rewind(); + } + + /* record_offset will be increased with 'length-of-null-bits' later */ + record_offset= 0; + null_fields+= total_uneven_bit_length; + + it.rewind(); + while ((sql_field=it++)) + { + DBUG_ASSERT(sql_field->charset != 0); + + if (prepare_create_field(sql_field, &blob_columns, + ×tamps, ×tamps_with_niladic, + file->ha_table_flags())) + DBUG_RETURN(TRUE); + if (sql_field->sql_type == MYSQL_TYPE_VARCHAR) + create_info->varchar= TRUE; + sql_field->offset= record_offset; + if (MTYP_TYPENR(sql_field->unireg_check) == Field::NEXT_NUMBER) + auto_increment++; + record_offset+= sql_field->pack_length; + } + if (timestamps_with_niladic > 1) + { + my_message(ER_TOO_MUCH_AUTO_TIMESTAMP_COLS, + ER(ER_TOO_MUCH_AUTO_TIMESTAMP_COLS), MYF(0)); + DBUG_RETURN(TRUE); + } + if (auto_increment > 1) + { + my_message(ER_WRONG_AUTO_KEY, ER(ER_WRONG_AUTO_KEY), MYF(0)); + DBUG_RETURN(TRUE); + } + if (auto_increment && + (file->ha_table_flags() & HA_NO_AUTO_INCREMENT)) + { + my_message(ER_TABLE_CANT_HANDLE_AUTO_INCREMENT, + ER(ER_TABLE_CANT_HANDLE_AUTO_INCREMENT), MYF(0)); + DBUG_RETURN(TRUE); + } + + if (blob_columns && (file->ha_table_flags() & HA_NO_BLOBS)) + { + my_message(ER_TABLE_CANT_HANDLE_BLOB, ER(ER_TABLE_CANT_HANDLE_BLOB), + MYF(0)); + DBUG_RETURN(TRUE); + } + + /* Create keys */ + + List_iterator<Key> key_iterator(alter_info->key_list); + List_iterator<Key> key_iterator2(alter_info->key_list); + uint key_parts=0, fk_key_count=0; + bool primary_key=0,unique_key=0; + Key *key, *key2; + uint tmp, key_number; + /* special marker for keys to be ignored */ + static char ignore_key[1]; + + /* Calculate number of key segements */ + *key_count= 0; + + while ((key=key_iterator++)) + { + DBUG_PRINT("info", ("key name: '%s' type: %d", key->DOT_STR(name) ? key->DOT_STR(name) : + "(none)" , key->type)); + LEX_STRING key_name_str; + if (key->type == Key::FOREIGN_KEY) + { + fk_key_count++; + Foreign_key *fk_key= (Foreign_key*) key; + if (fk_key->ref_columns.elements && + fk_key->ref_columns.elements != fk_key->columns.elements) + { + my_error(ER_WRONG_FK_DEF, MYF(0), + (fk_key->DOT_STR(name) ? fk_key->DOT_STR(name) : "foreign key without name"), + ER(ER_KEY_REF_DO_NOT_MATCH_TABLE_REF)); + DBUG_RETURN(TRUE); + } + continue; + } + (*key_count)++; + tmp=file->max_key_parts(); + if (key->columns.elements > tmp) + { + my_error(ER_TOO_MANY_KEY_PARTS,MYF(0),tmp); + DBUG_RETURN(TRUE); + } + key_name_str.str= (char*) key->DOT_STR(name); + key_name_str.length= key->DOT_STR(name) ? strlen(key->DOT_STR(name)) : 0; + if (check_string_char_length(&key_name_str, "", NAME_CHAR_LEN, + system_charset_info, 1)) + { + my_error(ER_TOO_LONG_IDENT, MYF(0), key->DOT_STR(name)); + DBUG_RETURN(TRUE); + } + key_iterator2.rewind (); + if (key->type != Key::FOREIGN_KEY) + { + while ((key2 = key_iterator2++) != key) + { + /* + foreign_key_prefix(key, key2) returns 0 if key or key2, or both, is + 'generated', and a generated key is a prefix of the other key. + Then we do not need the generated shorter key. + */ + if ((key2->type != Key::FOREIGN_KEY && + key2->DOT_STR(name) != ignore_key && + !foreign_key_prefix(key, key2))) + { + /* TODO: issue warning message */ + /* mark that the generated key should be ignored */ + if (!key2->generated || + (key->generated && key->columns.elements < + key2->columns.elements)) + key->DOT_STR(name)= ignore_key; + else + { + key2->DOT_STR(name)= ignore_key; + key_parts-= key2->columns.elements; + (*key_count)--; + } + break; + } + } + } + if (key->DOT_STR(name) != ignore_key) + key_parts+=key->columns.elements; + else + (*key_count)--; + if (key->DOT_STR(name) && !tmp_table && (key->type != Key::PRIMARY) && + !my_strcasecmp(system_charset_info,key->DOT_STR(name),primary_key_name)) + { + my_error(ER_WRONG_NAME_FOR_INDEX, MYF(0), key->DOT_STR(name)); + DBUG_RETURN(TRUE); + } + } + tmp=file->max_keys(); + if (*key_count > tmp) + { + my_error(ER_TOO_MANY_KEYS,MYF(0),tmp); + DBUG_RETURN(TRUE); + } + + (*key_info_buffer)= key_info= (KEY*) sql_calloc(sizeof(KEY) * (*key_count)); + key_part_info=(KEY_PART_INFO*) sql_calloc(sizeof(KEY_PART_INFO)*key_parts); + if (!*key_info_buffer || ! key_part_info) + DBUG_RETURN(TRUE); // Out of memory + + key_iterator.rewind(); + key_number=0; + for (; (key=key_iterator++) ; key_number++) + { + uint key_length=0; + Key_part_spec *column; + + if (key->DOT_STR(name) == ignore_key) + { + /* ignore redundant keys */ + do + key=key_iterator++; + while (key && key->DOT_STR(name) == ignore_key); + if (!key) + break; + } + + switch (key->type) { + case Key::MULTIPLE: + key_info->flags= 0; + break; + case Key::FULLTEXT: + key_info->flags= HA_FULLTEXT; + if ((key_info->parser_name= &key->key_create_info.parser_name)->str) + key_info->flags|= HA_USES_PARSER; + else + key_info->parser_name= 0; + break; + case Key::SPATIAL: +#ifdef HAVE_SPATIAL + key_info->flags= HA_SPATIAL; + break; +#else + my_error(ER_FEATURE_DISABLED, MYF(0), + sym_group_geom.name, sym_group_geom.needed_define); + DBUG_RETURN(TRUE); +#endif + case Key::FOREIGN_KEY: + key_number--; // Skip this key + continue; + default: + key_info->flags = HA_NOSAME; + break; + } + if (key->generated) + key_info->flags|= HA_GENERATED_KEY; + + key_info->key_parts=(uint8) key->columns.elements; + key_info->key_part=key_part_info; + key_info->usable_key_parts= key_number; + key_info->algorithm= key->key_create_info.algorithm; + + if (key->type == Key::FULLTEXT) + { + if (!(file->ha_table_flags() & HA_CAN_FULLTEXT)) + { + my_message(ER_TABLE_CANT_HANDLE_FT, ER(ER_TABLE_CANT_HANDLE_FT), + MYF(0)); + DBUG_RETURN(TRUE); + } + } + /* + Make SPATIAL to be RTREE by default + SPATIAL only on BLOB or at least BINARY, this + actually should be replaced by special GEOM type + in near future when new frm file is ready + checking for proper key parts number: + */ + + /* TODO: Add proper checks if handler supports key_type and algorithm */ + if (key_info->flags & HA_SPATIAL) + { + if (!(file->ha_table_flags() & HA_CAN_RTREEKEYS)) + { + my_message(ER_TABLE_CANT_HANDLE_SPKEYS, ER(ER_TABLE_CANT_HANDLE_SPKEYS), + MYF(0)); + DBUG_RETURN(TRUE); + } + if (key_info->key_parts != 1) + { + my_error(ER_WRONG_ARGUMENTS, MYF(0), "SPATIAL INDEX"); + DBUG_RETURN(TRUE); + } + } + else if (key_info->algorithm == HA_KEY_ALG_RTREE) + { +#ifdef HAVE_RTREE_KEYS + if ((key_info->key_parts & 1) == 1) + { + my_error(ER_WRONG_ARGUMENTS, MYF(0), "RTREE INDEX"); + DBUG_RETURN(TRUE); + } + /* TODO: To be deleted */ + my_error(ER_NOT_SUPPORTED_YET, MYF(0), "RTREE INDEX"); + DBUG_RETURN(TRUE); +#else + my_error(ER_FEATURE_DISABLED, MYF(0), + sym_group_rtree.name, sym_group_rtree.needed_define); + DBUG_RETURN(TRUE); +#endif + } + + /* Take block size from key part or table part */ + /* + TODO: Add warning if block size changes. We can't do it here, as + this may depend on the size of the key + */ + key_info->block_size= (key->key_create_info.block_size ? + key->key_create_info.block_size : + create_info->key_block_size); + + if (key_info->block_size) + key_info->flags|= HA_USES_BLOCK_SIZE; + + List_iterator<Key_part_spec> cols(key->columns), cols2(key->columns); + CHARSET_INFO *ft_key_charset=0; // for FULLTEXT + for (uint column_nr=0 ; (column=cols++) ; column_nr++) + { + uint length; + Key_part_spec *dup_column; + + it.rewind(); + field=0; + while ((sql_field=it++) && + my_strcasecmp(system_charset_info, + column->DOT_STR(field_name), + sql_field->field_name)) + field++; + if (!sql_field) + { + my_error(ER_KEY_COLUMN_DOES_NOT_EXITS, MYF(0), column->field_name); + DBUG_RETURN(TRUE); + } + while ((dup_column= cols2++) != column) + { + if (!my_strcasecmp(system_charset_info, + column->DOT_STR(field_name), dup_column->DOT_STR(field_name))) + { + my_printf_error(ER_DUP_FIELDNAME, + ER(ER_DUP_FIELDNAME),MYF(0), + column->field_name); + DBUG_RETURN(TRUE); + } + } + cols2.rewind(); + if (key->type == Key::FULLTEXT) + { + if ((sql_field->sql_type != MYSQL_TYPE_STRING && + sql_field->sql_type != MYSQL_TYPE_VARCHAR && + !f_is_blob(sql_field->pack_flag)) || + sql_field->charset == &my_charset_bin || + sql_field->charset->mbminlen > 1 || // ucs2 doesn't work yet + (ft_key_charset && sql_field->charset != ft_key_charset)) + { + my_error(ER_BAD_FT_COLUMN, MYF(0), column->field_name); + DBUG_RETURN(-1); + } + ft_key_charset=sql_field->charset; + /* + for fulltext keys keyseg length is 1 for blobs (it's ignored in ft + code anyway, and 0 (set to column width later) for char's. it has + to be correct col width for char's, as char data are not prefixed + with length (unlike blobs, where ft code takes data length from a + data prefix, ignoring column->length). + */ + column->length=test(f_is_blob(sql_field->pack_flag)); + } + else + { + column->length*= sql_field->charset->mbmaxlen; + + if (key->type == Key::SPATIAL && column->length) + { + my_error(ER_WRONG_SUB_KEY, MYF(0)); + DBUG_RETURN(TRUE); + } + + if (f_is_blob(sql_field->pack_flag) || + (f_is_geom(sql_field->pack_flag) && key->type != Key::SPATIAL)) + { + if (!(file->ha_table_flags() & HA_CAN_INDEX_BLOBS)) + { + my_error(ER_BLOB_USED_AS_KEY, MYF(0), column->field_name); + DBUG_RETURN(TRUE); + } + if (f_is_geom(sql_field->pack_flag) && sql_field->geom_type == + Field::GEOM_POINT) + column->length= 25; + if (!column->length) + { + my_error(ER_BLOB_KEY_WITHOUT_LENGTH, MYF(0), column->field_name); + DBUG_RETURN(TRUE); + } + } +#ifdef HAVE_SPATIAL + if (key->type == Key::SPATIAL) + { + if (!column->length) + { + /* + 4 is: (Xmin,Xmax,Ymin,Ymax), this is for 2D case + Lately we'll extend this code to support more dimensions + */ + column->length= 4*sizeof(double); + } + } +#endif + if (!(sql_field->flags & NOT_NULL_FLAG)) + { + if (key->type == Key::PRIMARY) + { + /* Implicitly set primary key fields to NOT NULL for ISO conf. */ + sql_field->flags|= NOT_NULL_FLAG; + sql_field->pack_flag&= ~FIELDFLAG_MAYBE_NULL; + null_fields--; + } + else + { + key_info->flags|= HA_NULL_PART_KEY; + if (!(file->ha_table_flags() & HA_NULL_IN_KEY)) + { + my_error(ER_NULL_COLUMN_IN_INDEX, MYF(0), column->field_name); + DBUG_RETURN(TRUE); + } + if (key->type == Key::SPATIAL) + { + my_message(ER_SPATIAL_CANT_HAVE_NULL, + ER(ER_SPATIAL_CANT_HAVE_NULL), MYF(0)); + DBUG_RETURN(TRUE); + } + } + } + if (MTYP_TYPENR(sql_field->unireg_check) == Field::NEXT_NUMBER) + { + if (column_nr == 0 || (file->ha_table_flags() & HA_AUTO_PART_KEY)) + auto_increment--; // Field is used + } + } + + key_part_info->fieldnr= field; + key_part_info->offset= (uint16) sql_field->offset; + key_part_info->key_type=sql_field->pack_flag; + length= sql_field->key_length; + + if (column->length) + { + if (f_is_blob(sql_field->pack_flag)) + { + if ((length=column->length) > max_key_length || + length > file->max_key_part_length()) + { + length=min(max_key_length, file->max_key_part_length()); + if (key->type == Key::MULTIPLE) + { + /* not a critical problem */ + char warn_buff[MYSQL_ERRMSG_SIZE]; + my_snprintf(warn_buff, sizeof(warn_buff), ER(ER_TOO_LONG_KEY), + length); + push_warning(thd, MYSQL_ERROR::WARN_LEVEL_WARN, + ER_TOO_LONG_KEY, warn_buff); + /* Align key length to multibyte char boundary */ + length-= length % sql_field->charset->mbmaxlen; + } + else + { + my_error(ER_TOO_LONG_KEY,MYF(0),length); + DBUG_RETURN(TRUE); + } + } + } + else if (!f_is_geom(sql_field->pack_flag) && + (column->length > length || + !Field::type_can_have_key_part (sql_field->sql_type) || + ((f_is_packed(sql_field->pack_flag) || + ((file->ha_table_flags() & HA_NO_PREFIX_CHAR_KEYS) && + (key_info->flags & HA_NOSAME))) && + column->length != length))) + { + my_message(ER_WRONG_SUB_KEY, ER(ER_WRONG_SUB_KEY), MYF(0)); + DBUG_RETURN(TRUE); + } + else if (!(file->ha_table_flags() & HA_NO_PREFIX_CHAR_KEYS)) + length=column->length; + } + else if (length == 0) + { + my_error(ER_WRONG_KEY_COLUMN, MYF(0), column->field_name); + DBUG_RETURN(TRUE); + } + if (length > file->max_key_part_length() && key->type != Key::FULLTEXT) + { + length= file->max_key_part_length(); + if (key->type == Key::MULTIPLE) + { + /* not a critical problem */ + char warn_buff[MYSQL_ERRMSG_SIZE]; + my_snprintf(warn_buff, sizeof(warn_buff), ER(ER_TOO_LONG_KEY), + length); + push_warning(thd, MYSQL_ERROR::WARN_LEVEL_WARN, + ER_TOO_LONG_KEY, warn_buff); + /* Align key length to multibyte char boundary */ + length-= length % sql_field->charset->mbmaxlen; + } + else + { + my_error(ER_TOO_LONG_KEY,MYF(0),length); + DBUG_RETURN(TRUE); + } + } + key_part_info->length=(uint16) length; + /* Use packed keys for long strings on the first column */ + if (!((*db_options) & HA_OPTION_NO_PACK_KEYS) && + (length >= KEY_DEFAULT_PACK_LENGTH && + (sql_field->sql_type == MYSQL_TYPE_STRING || + sql_field->sql_type == MYSQL_TYPE_VARCHAR || + sql_field->pack_flag & FIELDFLAG_BLOB))) + { + if (column_nr == 0 && (sql_field->pack_flag & FIELDFLAG_BLOB) || + sql_field->sql_type == MYSQL_TYPE_VARCHAR) + key_info->flags|= HA_BINARY_PACK_KEY | HA_VAR_LENGTH_KEY; + else + key_info->flags|= HA_PACK_KEY; + } + /* Check if the key segment is partial, set the key flag accordingly */ + if (length != sql_field->key_length) + key_info->flags|= HA_KEY_HAS_PART_KEY_SEG; + + key_length+=length; + key_part_info++; + + /* Create the key name based on the first column (if not given) */ + if (column_nr == 0) + { + if (key->type == Key::PRIMARY) + { + if (primary_key) + { + my_message(ER_MULTIPLE_PRI_KEY, ER(ER_MULTIPLE_PRI_KEY), + MYF(0)); + DBUG_RETURN(TRUE); + } + key_name=primary_key_name; + primary_key=1; + } + else if (!(key_name = key->DOT_STR(name))) + key_name=make_unique_key_name(sql_field->field_name, + *key_info_buffer, key_info); + if (check_if_keyname_exists(key_name, *key_info_buffer, key_info)) + { + my_error(ER_DUP_KEYNAME, MYF(0), key_name); + DBUG_RETURN(TRUE); + } + key_info->name=(char*) key_name; + } + } + if (!key_info->name || check_column_name(key_info->name)) + { + my_error(ER_WRONG_NAME_FOR_INDEX, MYF(0), key_info->name); + DBUG_RETURN(TRUE); + } + if (!(key_info->flags & HA_NULL_PART_KEY)) + unique_key=1; + key_info->key_length=(uint16) key_length; + if (key_length > max_key_length && key->type != Key::FULLTEXT) + { + my_error(ER_TOO_LONG_KEY,MYF(0),max_key_length); + DBUG_RETURN(TRUE); + } + key_info++; + } + if (!unique_key && !primary_key && + (file->ha_table_flags() & HA_REQUIRE_PRIMARY_KEY)) + { + my_message(ER_REQUIRES_PRIMARY_KEY, ER(ER_REQUIRES_PRIMARY_KEY), MYF(0)); + DBUG_RETURN(TRUE); + } + if (auto_increment > 0) + { + my_message(ER_WRONG_AUTO_KEY, ER(ER_WRONG_AUTO_KEY), MYF(0)); + DBUG_RETURN(TRUE); + } + /* Sort keys in optimized order */ + my_qsort((uchar*) *key_info_buffer, *key_count, sizeof(KEY), + (qsort_cmp) sort_keys); + create_info->null_bits= null_fields; + + /* Check fields. */ + it.rewind(); + while ((sql_field=it++)) + { + Field::utype type= (Field::utype) MTYP_TYPENR(sql_field->unireg_check); + + if (thd->variables.sql_mode & MODE_NO_ZERO_DATE && + !sql_field->def && + sql_field->sql_type == MYSQL_TYPE_TIMESTAMP && + (sql_field->flags & NOT_NULL_FLAG) && + (type == Field::NONE || type == Field::TIMESTAMP_UN_FIELD)) + { + /* + An error should be reported if: + - NO_ZERO_DATE SQL mode is active; + - there is no explicit DEFAULT clause (default column value); + - this is a TIMESTAMP column; + - the column is not NULL; + - this is not the DEFAULT CURRENT_TIMESTAMP column. + + In other words, an error should be reported if + - NO_ZERO_DATE SQL mode is active; + - the column definition is equivalent to + 'column_name TIMESTAMP DEFAULT 0'. + */ + + my_error(ER_INVALID_DEFAULT, MYF(0), sql_field->field_name); + DBUG_RETURN(TRUE); + } + } + + DBUG_RETURN(FALSE); +} + +////////////////////////////// +// mysql_create_table_no_lock() cut and pasted directly from sql_table.cc. (I did make is static after copying it.) + +static bool mysql_create_table_no_lock(THD *thd, + const char *db, const char *table_name, + HA_CREATE_INFO *create_info, + Alter_info *alter_info, + bool internal_tmp_table, + uint select_field_count) +{ + char path[FN_REFLEN]; + uint path_length; + const char *alias; + uint db_options, key_count; + KEY *key_info_buffer; + handler *file; + bool error= TRUE; + DBUG_ENTER("mysql_create_table_no_lock"); + DBUG_PRINT("enter", ("db: '%s' table: '%s' tmp: %d", + db, table_name, internal_tmp_table)); + + + /* Check for duplicate fields and check type of table to create */ + if (!alter_info->create_list.elements) + { + my_message(ER_TABLE_MUST_HAVE_COLUMNS, ER(ER_TABLE_MUST_HAVE_COLUMNS), + MYF(0)); + DBUG_RETURN(TRUE); + } + if (check_engine(thd, table_name, create_info)) + DBUG_RETURN(TRUE); + db_options= create_info->table_options; + if (create_info->row_type == ROW_TYPE_DYNAMIC) + db_options|=HA_OPTION_PACK_RECORD; + alias= table_case_name(create_info, table_name); + + /* PMC - Done to avoid getting the partition handler by mistake! */ + if (!(file= new (thd->mem_root) ha_xtsys(pbxt_hton, NULL))) + { + mem_alloc_error(sizeof(handler)); + DBUG_RETURN(TRUE); + } + + set_table_default_charset(thd, create_info, (char*) db); + + if (mysql_prepare_create_table(thd, create_info, alter_info, + internal_tmp_table, + &db_options, file, + &key_info_buffer, &key_count, + select_field_count)) + goto err; + + /* Check if table exists */ + if (create_info->options & HA_LEX_CREATE_TMP_TABLE) + { + path_length= build_tmptable_filename(thd, path, sizeof(path)); + create_info->table_options|=HA_CREATE_DELAY_KEY_WRITE; + } + else + { + #ifdef FN_DEVCHAR + /* check if the table name contains FN_DEVCHAR when defined */ + if (strchr(alias, FN_DEVCHAR)) + { + my_error(ER_WRONG_TABLE_NAME, MYF(0), alias); + DBUG_RETURN(TRUE); + } +#endif + path_length= build_table_filename(path, sizeof(path), db, alias, reg_ext, + internal_tmp_table ? FN_IS_TMP : 0); + } + + /* Check if table already exists */ + if ((create_info->options & HA_LEX_CREATE_TMP_TABLE) && + find_temporary_table(thd, db, table_name)) + { + if (create_info->options & HA_LEX_CREATE_IF_NOT_EXISTS) + { + create_info->table_existed= 1; // Mark that table existed + push_warning_printf(thd, MYSQL_ERROR::WARN_LEVEL_NOTE, + ER_TABLE_EXISTS_ERROR, ER(ER_TABLE_EXISTS_ERROR), + alias); + error= 0; + goto err; + } + my_error(ER_TABLE_EXISTS_ERROR, MYF(0), alias); + goto err; + } + + pthread_mutex_lock(&LOCK_open); + if (!internal_tmp_table && !(create_info->options & HA_LEX_CREATE_TMP_TABLE)) + { + if (!access(path,F_OK)) + { + if (create_info->options & HA_LEX_CREATE_IF_NOT_EXISTS) + goto warn; + my_error(ER_TABLE_EXISTS_ERROR,MYF(0),table_name); + goto unlock_and_end; + } + /* + We don't assert here, but check the result, because the table could be + in the table definition cache and in the same time the .frm could be + missing from the disk, in case of manual intervention which deletes + the .frm file. The user has to use FLUSH TABLES; to clear the cache. + Then she could create the table. This case is pretty obscure and + therefore we don't introduce a new error message only for it. + */ + if (get_cached_table_share(db, alias)) + { + my_error(ER_TABLE_EXISTS_ERROR, MYF(0), table_name); + goto unlock_and_end; + } + } + + /* + Check that table with given name does not already + exist in any storage engine. In such a case it should + be discovered and the error ER_TABLE_EXISTS_ERROR be returned + unless user specified CREATE TABLE IF EXISTS + The LOCK_open mutex has been locked to make sure no + one else is attempting to discover the table. Since + it's not on disk as a frm file, no one could be using it! + */ + if (!(create_info->options & HA_LEX_CREATE_TMP_TABLE)) + { + bool create_if_not_exists = + create_info->options & HA_LEX_CREATE_IF_NOT_EXISTS; + int retcode = ha_table_exists_in_engine(thd, db, table_name); + DBUG_PRINT("info", ("exists_in_engine: %u",retcode)); + switch (retcode) + { + case HA_ERR_NO_SUCH_TABLE: + /* Normal case, no table exists. we can go and create it */ + break; + case HA_ERR_TABLE_EXIST: + DBUG_PRINT("info", ("Table existed in handler")); + + if (create_if_not_exists) + goto warn; + my_error(ER_TABLE_EXISTS_ERROR,MYF(0),table_name); + goto unlock_and_end; + break; + default: + DBUG_PRINT("info", ("error: %u from storage engine", retcode)); + my_error(retcode, MYF(0),table_name); + goto unlock_and_end; + } + } + + thd_proc_info(thd, "creating table"); + create_info->table_existed= 0; // Mark that table is created + + create_info->table_options=db_options; + + path[path_length - reg_ext_length]= '\0'; // Remove .frm extension + if (rea_create_table(thd, path, db, table_name, + create_info, alter_info->create_list, + key_count, key_info_buffer, file)) + goto unlock_and_end; + + if (create_info->options & HA_LEX_CREATE_TMP_TABLE) + { + /* Open table and put in temporary table list */ +#if MYSQL_VERSION_ID > 60005 + if (!(open_temporary_table(thd, path, db, table_name, 1, OTM_OPEN))) +#else + if (!(open_temporary_table(thd, path, db, table_name, 1))) +#endif + { +#if MYSQL_VERSION_ID > 60005 + (void) rm_temporary_table(create_info->db_type, path, false); +#else + (void) rm_temporary_table(create_info->db_type, path); +#endif + goto unlock_and_end; + } + thd->thread_specific_used= TRUE; + } + + /* + Don't write statement if: + - It is an internal temporary table, + - Row-based logging is used and it we are creating a temporary table, or + - The binary log is not open. + Otherwise, the statement shall be binlogged. + */ + if (!internal_tmp_table && + (!thd->current_stmt_binlog_row_based || + (thd->current_stmt_binlog_row_based && + !(create_info->options & HA_LEX_CREATE_TMP_TABLE)))) + write_bin_log(thd, TRUE, thd->query, thd->query_length); + error= FALSE; +unlock_and_end: + pthread_mutex_unlock(&LOCK_open); + +err: + thd_proc_info(thd, "After create"); + delete file; + DBUG_RETURN(error); + +warn: + error= FALSE; + push_warning_printf(thd, MYSQL_ERROR::WARN_LEVEL_NOTE, + ER_TABLE_EXISTS_ERROR, ER(ER_TABLE_EXISTS_ERROR), + alias); + create_info->table_existed= 1; // Mark that table existed + goto unlock_and_end; +} + +//////////////////////////////////////////////////////// +////// END OF CUT AND PASTES FROM sql_table.cc //////// +//////////////////////////////////////////////////////// + +#endif // LOCK_OPEN_HACK_REQUIRED + +//------------------------------ +int xt_create_table_frm(handlerton *hton, THD* thd, const char *db, const char *name, DT_FIELD_INFO *info, DT_KEY_INFO *keys __attribute__((unused)), xtBool skip_existing) +{ +#ifdef DRIZZLED + static const char *ext = ".dfe"; + static const int ext_len = 4; +#else + static const char *ext = ".frm"; + static const int ext_len = 4; +#endif + int err = 1; + //HA_CREATE_INFO create_info = {0}; + //Alter_info alter_info; + char field_length_buffer[12], *field_length_ptr; + LEX *save_lex= thd->lex, mylex; + + memset(&mylex.create_info, 0, sizeof(HA_CREATE_INFO)); + + thd->lex = &mylex; + lex_start(thd); + + /* setup the create info */ + mylex.create_info.db_type = hton; +#ifndef DRIZZLED + mylex.create_info.frm_only = 1; +#endif + mylex.create_info.default_table_charset = system_charset_info; + + /* setup the column info. */ + while (info->field_name) { + LEX_STRING field_name, comment; + field_name.str = (char*)(info->field_name); + field_name.length = strlen(info->field_name); + + comment.str = (char*)(info->comment); + comment.length = strlen(info->comment); + + if (info->field_length) { + sprintf(field_length_buffer, "%d", info->field_length); + field_length_ptr = field_length_buffer; + } else + field_length_ptr = NULL; + +#ifdef DRIZZLED + if (add_field_to_list(thd, &field_name, info->field_type, field_length_ptr, info->field_decimal_length, + info->field_flags, + COLUMN_FORMAT_TYPE_FIXED, + NULL /*default_value*/, NULL /*on_update_value*/, &comment, NULL /*change*/, + NULL /*interval_list*/, info->field_charset, + NULL /*vcol_info*/)) +#else + if (add_field_to_list(thd, &field_name, info->field_type, field_length_ptr, info->field_decimal_length, + info->field_flags, +#if MYSQL_VERSION_ID > 60005 + HA_SM_DISK, + COLUMN_FORMAT_TYPE_FIXED, +#endif + NULL /*default_value*/, NULL /*on_update_value*/, &comment, NULL /*change*/, + NULL /*interval_list*/, info->field_charset, 0 /*uint_geom_type*/)) +#endif + goto error; + + + info++; + } + + if (skip_existing) { + size_t db_len = strlen(db); + size_t name_len = strlen(name); + size_t len = db_len + 1 + name_len + ext_len + 1; + char *path = (char *)xt_malloc_ns(len); + memcpy(path, db, db_len); + memcpy(path + db_len + 1, name, name_len); + memcpy(path + db_len + 1 + name_len, ext, ext_len); + path[db_len] = XT_DIR_CHAR; + path[len - 1] = '\0'; + xtBool exists = xt_fs_exists(path); + xt_free_ns(path); + if (exists) + goto noerror; + } + + /* Create an internal temp table */ +#ifdef DRIZZLED + if (mysql_create_table_no_lock(thd, db, name, &mylex.create_info, &mylex.alter_info, 1, 0, false)) + goto error; +#else + if (mysql_create_table_no_lock(thd, db, name, &mylex.create_info, &mylex.alter_info, 1, 0)) + goto error; +#endif + + noerror: + err = 0; + + error: + lex_end(&mylex); + thd->lex = save_lex; + return err; +} + diff --git a/storage/pbxt/src/discover_xt.h b/storage/pbxt/src/discover_xt.h new file mode 100644 index 00000000000..733974ad59f --- /dev/null +++ b/storage/pbxt/src/discover_xt.h @@ -0,0 +1,79 @@ +/* Copyright (c) 2008 PrimeBase Technologies GmbH, Germany + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Created by Leslie on 8/27/08. + * + */ + +#ifndef __DISCOVER_XT_H__ +#define __DISCOVER_XT_H__ + +#ifdef DRIZZLED +#include <drizzled/common.h> +#else +#include "mysql_priv.h" +#endif + +/* + * --------------------------------------------------------------- + * TABLE DISCOVERY HANDLER + */ + +typedef struct dt_field_info { + /** + This is used as column name. + */ + const char* field_name; + /** + For string-type columns, this is the maximum number of + characters. For numeric data this can be NULL. + */ + uint field_length; + + /** + For decimal columns, this is the maximum number of + digits after the decimal. For other data this can be NULL. + */ + char* field_decimal_length; + /** + This denotes data type for the column. For the most part, there seems to + be one entry in the enum for each SQL data type, although there seem to + be a number of additional entries in the enum. + */ + enum enum_field_types field_type; + + /** + This is the charater set for non numeric data types including blob data. + */ + CHARSET_INFO *field_charset; + + uint field_flags; // Field atributes(maybe_null, signed, unsigned etc.) + const char* comment; +} DT_FIELD_INFO; + +typedef struct dt_key_info +{ + const char* key_name; + uint key_type; /* PRI_KEY_FLAG, UNIQUE_KEY_FLAG, MULTIPLE_KEY_FLAG */ + const char* key_columns[8]; // The size of this can be set to what ever you need. +} DT_KEY_INFO; + +int xt_create_table_frm(handlerton *hton, THD* thd, const char *db, const char *name, DT_FIELD_INFO *info, DT_KEY_INFO *keys, xtBool skip_existing); + +#endif + diff --git a/storage/pbxt/src/filesys_xt.cc b/storage/pbxt/src/filesys_xt.cc new file mode 100644 index 00000000000..5ca36cd9244 --- /dev/null +++ b/storage/pbxt/src/filesys_xt.cc @@ -0,0 +1,1697 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-01-12 Paul McCullagh + * + * H&G2JCtL + */ + +#include "xt_config.h" + +#ifndef XT_WIN +#include <unistd.h> +#include <dirent.h> +#include <sys/mman.h> +#endif +#include <stdio.h> +#include <sys/stat.h> +#include <fcntl.h> +#include <sys/types.h> +#include <ctype.h> +#include <string.h> +#include <errno.h> + +#include "strutil_xt.h" +#include "pthread_xt.h" +#include "thread_xt.h" +#include "filesys_xt.h" +#include "memory_xt.h" +#include "cache_xt.h" +#include "sortedlist_xt.h" +#include "trace_xt.h" + +#ifdef DEBUG +//#define DEBUG_PRINT_IO +//#define DEBUG_TRACE_IO +//#define DEBUG_TRACE_MAP_IO +//#define DEBUG_TRACE_FILES +#endif + +#ifdef DEBUG_TRACE_FILES +//#define PRINTF xt_ftracef +#define PRINTF xt_trace +#endif + +/* ---------------------------------------------------------------------- + * Globals + */ + +typedef struct FsGlobals { + xt_mutex_type *fsg_lock; /* The xtPublic cache lock. */ + u_int fsg_current_id; + XTSortedListPtr fsg_open_files; +} FsGlobalsRec; + +static FsGlobalsRec fs_globals; + +#ifdef XT_WIN +static int fs_get_win_error() +{ + return (int) GetLastError(); +} + +xtPublic void xt_get_win_message(char *buffer, size_t size, int err) +{ + FormatMessage(FORMAT_MESSAGE_FROM_SYSTEM, NULL, err, + MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT), + buffer, + size, NULL); +} +#endif + +/* ---------------------------------------------------------------------- + * Open file list + */ + +static XTFilePtr fs_new_file(XTThreadPtr self, char *file) +{ + XTFilePtr file_ptr; + + pushsr_(file_ptr, xt_free, (XTFilePtr) xt_calloc(self, sizeof(XTFileRec))); + + file_ptr->fil_path = xt_dup_string(self, file); + file_ptr->fil_id = fs_globals.fsg_current_id++; +#ifdef DEBUG_TRACE_FILES + PRINTF("%s: allocated file: (%d) %s\n", self->t_name, (int) file_ptr->fil_id, xt_last_2_names_of_path(file_ptr->fil_path)); +#endif + if (!fs_globals.fsg_current_id) + fs_globals.fsg_current_id++; + file_ptr->fil_filedes = XT_NULL_FD; + file_ptr->fil_handle_count = 0; + + popr_(); // Discard xt_free(file_ptr) + return file_ptr; +} + +static void fs_close_fmap(XTThreadPtr self, XTFileMemMapPtr mm) +{ +#ifdef XT_WIN + if (mm->mm_start) { + FlushViewOfFile(mm->mm_start, 0); + UnmapViewOfFile(mm->mm_start); + mm->mm_start = NULL; + } + if (mm->mm_mapdes != NULL) { + CloseHandle(mm->mm_mapdes); + mm->mm_mapdes = NULL; + } +#else + if (mm->mm_start) { + msync( (char *)mm->mm_start, (size_t) mm->mm_length, MS_SYNC); + munmap((caddr_t) mm->mm_start, (size_t) mm->mm_length); + mm->mm_start = NULL; + } +#endif + xt_rwmutex_free(self, &mm->mm_lock); + xt_free(self, mm); +} + +static void fs_free_file(XTThreadPtr self, void *thunk __attribute__((unused)), void *item) +{ + XTFilePtr file_ptr = *((XTFilePtr *) item); + + if (file_ptr->fil_filedes != XT_NULL_FD) { +#ifdef DEBUG_TRACE_FILES + PRINTF("%s: close file: (%d) %s\n", self->t_name, (int) file_ptr->fil_id, xt_last_2_names_of_path(file_ptr->fil_path)); +#endif +#ifdef XT_WIN + CloseHandle(file_ptr->fil_filedes); +#else + close(file_ptr->fil_filedes); +#endif + //PRINTF("close (FILE) %d %s\n", file_ptr->fil_filedes, file_ptr->fil_path); + file_ptr->fil_filedes = XT_NULL_FD; + } + + if (file_ptr->fil_memmap) { + fs_close_fmap(self, file_ptr->fil_memmap); + file_ptr->fil_memmap = NULL; + } + +#ifdef DEBUG_TRACE_FILES + PRINTF("%s: free file: (%d) %s\n", self->t_name, (int) file_ptr->fil_id, + file_ptr->fil_path ? xt_last_2_names_of_path(file_ptr->fil_path) : "?"); +#endif + + if (!file_ptr->fil_ref_count) { + /* Flush any cache before this file is invalid: */ + if (file_ptr->fil_path) { + xt_free(self, file_ptr->fil_path); + file_ptr->fil_path = NULL; + } + + xt_free(self, file_ptr); + } +} + +static int fs_comp_file(XTThreadPtr self __attribute__((unused)), register const void *thunk __attribute__((unused)), register const void *a, register const void *b) +{ + char *file_name = (char *) a; + XTFilePtr file_ptr = *((XTFilePtr *) b); + + return strcmp(file_name, file_ptr->fil_path); +} + +static int fs_comp_file_ci(XTThreadPtr self __attribute__((unused)), register const void *thunk __attribute__((unused)), register const void *a, register const void *b) +{ + char *file_name = (char *) a; + XTFilePtr file_ptr = *((XTFilePtr *) b); + + return strcasecmp(file_name, file_ptr->fil_path); +} + +/* ---------------------------------------------------------------------- + * init & exit + */ + +xtPublic void xt_fs_init(XTThreadPtr self) +{ + fs_globals.fsg_open_files = xt_new_sortedlist(self, + sizeof(XTFilePtr), 20, 20, + pbxt_ignore_case ? fs_comp_file_ci : fs_comp_file, + NULL, fs_free_file, TRUE, FALSE); + fs_globals.fsg_lock = fs_globals.fsg_open_files->sl_lock; + fs_globals.fsg_current_id = 1; +} + +xtPublic void xt_fs_exit(XTThreadPtr self) +{ + if (fs_globals.fsg_open_files) { + xt_free_sortedlist(self, fs_globals.fsg_open_files); + fs_globals.fsg_open_files = NULL; + } + fs_globals.fsg_lock = NULL; + fs_globals.fsg_current_id = 0; +} + +/* ---------------------------------------------------------------------- + * File operations + */ + +static void fs_set_stats(XTThreadPtr self, char *path) +{ + char super_path[PATH_MAX]; + struct stat stats; + char *ptr; + + ptr = xt_last_name_of_path(path); + if (ptr == path) + strcpy(super_path, "."); + else { + xt_strcpy(PATH_MAX, super_path, path); + + if ((ptr = xt_last_name_of_path(super_path))) + *ptr = 0; + } + if (stat(super_path, &stats) == -1) + xt_throw_ferrno(XT_CONTEXT, errno, super_path); + + if (chmod(path, stats.st_mode) == -1) + xt_throw_ferrno(XT_CONTEXT, errno, path); + + /*chown(path, stats.st_uid, stats.st_gid);*/ +} + +xtPublic char *xt_file_path(struct XTFileRef *of) +{ + return of->fr_file->fil_path; +} + +xtBool xt_fs_exists(char *path) +{ + int err; + + err = access(path, F_OK); + if (err == -1) + return FALSE; + return TRUE; +} + +/* + * No error is generated if the file dose not exist. + */ +xtPublic xtBool xt_fs_delete(XTThreadPtr self, char *name) +{ +#ifdef DEBUG_TRACE_FILES + PRINTF("%s: DELETE FILE: %s\n", xt_get_self()->t_name, xt_last_2_names_of_path(name)); +#endif +#ifdef XT_WIN + //PRINTF("delete %s\n", name); + if (!DeleteFile(name)) { + int err = fs_get_win_error(); + + if (!XT_FILE_NOT_FOUND(err)) { + xt_throw_ferrno(XT_CONTEXT, err, name); + return FAILED; + } + } +#else + if (unlink(name) == -1) { + int err = errno; + + if (err != ENOENT) { + xt_throw_ferrno(XT_CONTEXT, err, name); + return FAILED; + } + } +#endif + return OK; +} + +xtPublic xtBool xt_fs_file_not_found(int err) +{ +#ifdef XT_WIN + return XT_FILE_NOT_FOUND(err); +#else + return err == ENOENT; +#endif +} + +xtPublic void xt_fs_move(struct XTThread *self, char *from_path, char *to_path) +{ +#ifdef DEBUG_TRACE_FILES + PRINTF("%s: MOVE FILE: %s --> %s\n", xt_get_self()->t_name, xt_last_2_names_of_path(from_path), xt_last_2_names_of_path(to_path)); +#endif +#ifdef XT_WIN + if (!MoveFile(from_path, to_path)) + xt_throw_ferrno(XT_CONTEXT, fs_get_win_error(), from_path); +#else + int err; + + if (link(from_path, to_path) == -1) { + err = errno; + xt_throw_ferrno(XT_CONTEXT, err, from_path); + } + + if (unlink(from_path) == -1) { + err = errno; + unlink(to_path); + xt_throw_ferrno(XT_CONTEXT, err, from_path); + } +#endif +} + +xtPublic xtBool xt_fs_rename(struct XTThread *self, char *from_path, char *to_path) +{ + int err; + +#ifdef DEBUG_TRACE_FILES + PRINTF("%s: RENAME FILE: %s --> %s\n", xt_get_self()->t_name, xt_last_2_names_of_path(from_path), xt_last_2_names_of_path(to_path)); +#endif + if (rename(from_path, to_path) == -1) { + err = errno; + xt_throw_ferrno(XT_CONTEXT, err, from_path); + return FAILED; + } + return OK; +} + +xtPublic xtBool xt_fs_stat(XTThreadPtr self, char *path, off_t *size, struct timespec *mod_time) +{ +#ifdef XT_WIN + HANDLE fh; + BY_HANDLE_FILE_INFORMATION info; + SECURITY_ATTRIBUTES sa = { sizeof(SECURITY_ATTRIBUTES), 0, 0 }; + + fh = CreateFile( + path, + GENERIC_READ, + FILE_SHARE_READ, + &sa, + OPEN_EXISTING, + FILE_ATTRIBUTE_NORMAL, + NULL); + if (fh == INVALID_HANDLE_VALUE) { + xt_throw_ferrno(XT_CONTEXT, fs_get_win_error(), path); + return FAILED; + } + + if (!GetFileInformationByHandle(fh, &info)) { + CloseHandle(fh); + xt_throw_ferrno(XT_CONTEXT, fs_get_win_error(), path); + return FAILED; + } + + CloseHandle(fh); + if (size) + *size = (off_t) info.nFileSizeLow | (((off_t) info.nFileSizeHigh) << 32); + if (mod_time) + mod_time->tv.ft = info.ftLastWriteTime; +#else + struct stat sb; + + if (stat(path, &sb) == -1) { + xt_throw_ferrno(XT_CONTEXT, errno, path); + return FAILED; + } + if (size) + *size = sb.st_size; + if (mod_time) { + mod_time->tv_sec = sb.st_mtime; +#ifdef XT_MAC + /* This is the Mac OS X version: */ + mod_time->tv_nsec = sb.st_mtimespec.tv_nsec; +#else +#ifdef __USE_MISC + /* This is the Linux version: */ + mod_time->tv_nsec = sb.st_mtim.tv_nsec; +#else + /* Not supported? */ + mod_time->tv_nsec = 0; +#endif +#endif + } +#endif + return OK; +} + +void xt_fs_mkdir(XTThreadPtr self, char *name) +{ + char path[PATH_MAX]; + + xt_strcpy(PATH_MAX, path, name); + xt_remove_dir_char(path); + +#ifdef XT_WIN + { + SECURITY_ATTRIBUTES sa = { sizeof(SECURITY_ATTRIBUTES), 0, 0 }; + + if (!CreateDirectory(path, &sa)) + xt_throw_ferrno(XT_CONTEXT, fs_get_win_error(), path); + } +#else + if (mkdir(path, S_IRWXU | S_IRWXG | S_IRWXO) == -1) + xt_throw_ferrno(XT_CONTEXT, errno, path); + + try_(a) { + fs_set_stats(self, path); + } + catch_(a) { + xt_fs_rmdir(NULL, name); + throw_(); + } + cont_(a); +#endif +} + +void xt_fs_mkpath(XTThreadPtr self, char *path) +{ + char *ptr; + + if (xt_fs_exists(path)) + return; + + if (!(ptr = (char *) xt_last_directory_of_path((c_char *) path))) + return; + if (ptr == path) + return; + ptr--; + if (XT_IS_DIR_CHAR(*ptr)) { + *ptr = 0; + xt_fs_mkpath(self, path); + *ptr = XT_DIR_CHAR; + xt_fs_mkdir(self, path); + } +} + +xtBool xt_fs_rmdir(XTThreadPtr self, char *name) +{ + char path[PATH_MAX]; + + xt_strcpy(PATH_MAX, path, name); + xt_remove_dir_char(path); + +#ifdef XT_WIN + if (!RemoveDirectory(path)) { + int err = fs_get_win_error(); + + if (!XT_FILE_NOT_FOUND(err)) { + xt_throw_ferrno(XT_CONTEXT, err, path); + return FAILED; + } + } +#else + if (rmdir(path) == -1) { + int err = errno; + + if (err != ENOENT) { + xt_throw_ferrno(XT_CONTEXT, err, path); + return FAILED; + } + } +#endif + return OK; +} + +/* ---------------------------------------------------------------------- + * Open & Close operations + */ + +xtPublic XTFilePtr xt_fs_get_file(XTThreadPtr self, char *file_name) +{ + XTFilePtr file_ptr, *file_pptr; + + xt_sl_lock(self, fs_globals.fsg_open_files); + pushr_(xt_sl_unlock, fs_globals.fsg_open_files); + + if ((file_pptr = (XTFilePtr *) xt_sl_find(self, fs_globals.fsg_open_files, file_name))) + file_ptr = *file_pptr; + else { + file_ptr = fs_new_file(self, file_name); + xt_sl_insert(self, fs_globals.fsg_open_files, file_name, &file_ptr); + } + file_ptr->fil_ref_count++; + freer_(); // xt_sl_unlock(fs_globals.fsg_open_files) + return file_ptr; +} + +xtPublic void xt_fs_release_file(XTThreadPtr self, XTFilePtr file_ptr) +{ + xt_sl_lock(self, fs_globals.fsg_open_files); + pushr_(xt_sl_unlock, fs_globals.fsg_open_files); + + file_ptr->fil_ref_count--; + if (!file_ptr->fil_ref_count) { + xt_sl_delete(self, fs_globals.fsg_open_files, file_ptr->fil_path); + } + + freer_(); // xt_ht_unlock(fs_globals.fsg_open_files) +} + +static xtBool fs_open_file(XTThreadPtr self, XT_FD *fd, XTFilePtr file, int mode) +{ + int retried = FALSE; + +#ifdef DEBUG_TRACE_FILES + PRINTF("%s: OPEN FILE: (%d) %s\n", self->t_name, (int) file->fil_id, xt_last_2_names_of_path(file->fil_path)); +#endif + retry: +#ifdef XT_WIN + SECURITY_ATTRIBUTES sa = { sizeof(SECURITY_ATTRIBUTES), 0, 0 }; + DWORD flags; + + if (mode & XT_FS_EXCLUSIVE) + flags = CREATE_NEW; + else if (mode & XT_FS_CREATE) + flags = OPEN_ALWAYS; + else + flags = OPEN_EXISTING; + + *fd = CreateFile( + file->fil_path, + mode & XT_FS_READONLY ? GENERIC_READ : (GENERIC_READ | GENERIC_WRITE), + FILE_SHARE_READ | FILE_SHARE_WRITE, + &sa, + flags, + FILE_FLAG_RANDOM_ACCESS, + NULL); + if (*fd == INVALID_HANDLE_VALUE) { + int err = fs_get_win_error(); + + if (!(mode & XT_FS_MISSING_OK) || !XT_FILE_NOT_FOUND(err)) { + if (!retried && (mode & XT_FS_MAKE_PATH) && XT_FILE_NOT_FOUND(err)) { + char path[PATH_MAX]; + + xt_strcpy(PATH_MAX, path, file->fil_path); + xt_remove_last_name_of_path(path); + xt_fs_mkpath(self, path); + retried = TRUE; + goto retry; + } + + xt_throw_ferrno(XT_CONTEXT, err, file->fil_path); + } + + /* File is missing, but don't throw an error. */ + return FAILED; + } + //PRINTF("open %d %s\n", *fd, file->fil_path); + return OK; +#else + int flags = 0; + + if (mode & XT_FS_READONLY) + flags = O_RDONLY; + else + flags = O_RDWR; + if (mode & XT_FS_CREATE) + flags |= O_CREAT; + if (mode & XT_FS_EXCLUSIVE) + flags |= O_EXCL; +#ifdef O_DIRECT + if (mode & XT_FS_DIRECT_IO) + flags |= O_DIRECT; +#endif + + *fd = open(file->fil_path, flags, XT_MASK); + if (*fd == -1) { + int err = errno; + + if (!(mode & XT_FS_MISSING_OK) || err != ENOENT) { + if (!retried && (mode & XT_FS_MAKE_PATH) && err == ENOENT) { + char path[PATH_MAX]; + + xt_strcpy(PATH_MAX, path, file->fil_path); + xt_remove_last_name_of_path(path); + xt_fs_mkpath(self, path); + retried = TRUE; + goto retry; + } + + xt_throw_ferrno(XT_CONTEXT, err, file->fil_path); + } + + /* File is missing, but don't throw an error. */ + return FAILED; + } + ///PRINTF("open %d %s\n", *fd, file->fil_path); + return OK; +#endif +} + +xtPublic XTOpenFilePtr xt_open_file(XTThreadPtr self, char *file, int mode) +{ + XTOpenFilePtr of; + + pushsr_(of, xt_close_file, (XTOpenFilePtr) xt_calloc(self, sizeof(XTOpenFileRec))); + of->fr_file = xt_fs_get_file(self, file); + of->fr_id = of->fr_file->fil_id; + of->of_filedes = XT_NULL_FD; + +#ifdef XT_WIN + if (!fs_open_file(self, &of->of_filedes, of->fr_file, mode)) { + xt_close_file(self, of); + of = NULL; + } +#else + xtBool failed = FALSE; + + if (of->fr_file->fil_filedes == -1) { + xt_sl_lock(self, fs_globals.fsg_open_files); + pushr_(xt_sl_unlock, fs_globals.fsg_open_files); + if (of->fr_file->fil_filedes == -1) { + if (!fs_open_file(self, &of->fr_file->fil_filedes, of->fr_file, mode)) + failed = TRUE; + } + freer_(); // xt_ht_unlock(fs_globals.fsg_open_files) + } + + if (failed) { + /* Close, but after we have release the fsg_open_files lock! */ + xt_close_file(self, of); + of = NULL; + } + else + of->of_filedes = of->fr_file->fil_filedes; +#endif + + popr_(); // Discard xt_close_file(of) + return of; +} + +xtPublic XTOpenFilePtr xt_open_file_ns(char *file, int mode) +{ + XTThreadPtr self = xt_get_self(); + XTOpenFilePtr of; + + try_(a) { + of = xt_open_file(self, file, mode); + } + catch_(a) { + of = NULL; + } + cont_(a); + return of; +} + +xtPublic xtBool xt_open_file_ns(XTOpenFilePtr *fh, char *file, int mode) +{ + XTThreadPtr self = xt_get_self(); + xtBool ok = TRUE; + + try_(a) { + *fh = xt_open_file(self, file, mode); + } + catch_(a) { + ok = FALSE; + } + cont_(a); + return ok; +} + +xtPublic void xt_close_file(XTThreadPtr self, XTOpenFilePtr of) +{ + if (of->of_filedes != XT_NULL_FD) { +#ifdef XT_WIN + CloseHandle(of->of_filedes); +#ifdef DEBUG_TRACE_FILES + PRINTF("%s: close file: (%d) %s\n", self->t_name, (int) of->fr_file->fil_id, xt_last_2_names_of_path(of->fr_file->fil_path)); +#endif +#else + if (!of->fr_file || of->of_filedes != of->fr_file->fil_filedes) { + close(of->of_filedes); +#ifdef DEBUG_TRACE_FILES + PRINTF("%s: close file: (%d) %s\n", self->t_name, (int) of->fr_file->fil_id, xt_last_2_names_of_path(of->fr_file->fil_path)); +#endif + } +#endif + + of->of_filedes = XT_NULL_FD; + } + + if (of->fr_file) { + xt_fs_release_file(self, of->fr_file); + of->fr_file = NULL; + } + xt_free(self, of); +} + +xtPublic xtBool xt_close_file_ns(XTOpenFilePtr of) +{ + XTThreadPtr self = xt_get_self(); + xtBool failed = FALSE; + + try_(a) { + xt_close_file(self, of); + } + catch_(a) { + failed = TRUE; + } + cont_(a); + return failed; +} + +/* ---------------------------------------------------------------------- + * I/O operations + */ + +xtPublic xtBool xt_lock_file(struct XTThread *self, XTOpenFilePtr of) +{ +#ifdef XT_WIN + if (!LockFile(of->of_filedes, 0, 0, 512, 0)) { + int err = fs_get_win_error(); + + if (err == ERROR_LOCK_VIOLATION || + err == ERROR_LOCK_FAILED) + return FAILED; + + xt_throw_ferrno(XT_CONTEXT, err, xt_file_path(of)); + return FAILED; + } + return OK; +#else + if (lockf(of->of_filedes, F_TLOCK, 0) == 0) + return OK; + if (errno == EAGAIN) + return FAILED; + xt_throw_ferrno(XT_CONTEXT, errno, xt_file_path(of)); + return FAILED; +#endif +} + +xtPublic void xt_unlock_file(struct XTThread *self, XTOpenFilePtr of) +{ +#ifdef XT_WIN + if (!UnlockFile(of->of_filedes, 0, 0, 512, 0)) { + int err = fs_get_win_error(); + + if (err != ERROR_NOT_LOCKED) + xt_throw_ferrno(XT_CONTEXT, err, xt_file_path(of)); + } +#else + if (lockf(of->of_filedes, F_ULOCK, 0) == -1) + xt_throw_ferrno(XT_CONTEXT, errno, xt_file_path(of)); +#endif +} + +static off_t fs_seek_eof(XTThreadPtr self, XT_FD fd, XTFilePtr file) +{ +#ifdef XT_WIN + DWORD result; + LARGE_INTEGER lpFileSize; + + result = SetFilePointer(fd, 0, NULL, FILE_END); + if (result == 0xFFFFFFFF) { + xt_throw_ferrno(XT_CONTEXT, fs_get_win_error(), file->fil_path); + return (off_t) -1; + } + + if (!GetFileSizeEx(fd, &lpFileSize)) { + xt_throw_ferrno(XT_CONTEXT, fs_get_win_error(), file->fil_path); + return (off_t) -1; + } + + return lpFileSize.QuadPart; +#else + off_t off; + + off = lseek(fd, 0, SEEK_END); + if (off == -1) { + xt_throw_ferrno(XT_CONTEXT, errno, file->fil_path); + return -1; + } + + return off; +#endif +} + +xtPublic off_t xt_seek_eof_file(XTThreadPtr self, XTOpenFilePtr of) +{ + return fs_seek_eof(self, of->of_filedes, of->fr_file); +} + +xtPublic xtBool xt_set_eof_file(XTThreadPtr self, XTOpenFilePtr of, off_t offset) +{ +#ifdef XT_WIN + LARGE_INTEGER liDistanceToMove; + + liDistanceToMove.QuadPart = offset; + if (!SetFilePointerEx(of->of_filedes, liDistanceToMove, NULL, FILE_BEGIN)) { + xt_throw_ferrno(XT_CONTEXT, fs_get_win_error(), xt_file_path(of)); + return FAILED; + } + + if (!SetEndOfFile(of->of_filedes)) { + xt_throw_ferrno(XT_CONTEXT, fs_get_win_error(), xt_file_path(of)); + return FAILED; + } +#else + if (ftruncate(of->of_filedes, offset) == -1) { + xt_throw_ferrno(XT_CONTEXT, errno, xt_file_path(of)); + return FAILED; + } +#endif + return OK; +} + +xtPublic xtBool xt_pwrite_file(XTOpenFilePtr of, off_t offset, size_t size, void *data, XTIOStatsPtr stat, XTThreadPtr XT_UNUSED(thread)) +{ +#ifdef DEBUG_PRINT_IO + PRINTF("PBXT WRITE %s offs=%d size=%d\n", of->fr_file->fil_path, (int) offset, (int) size); +#endif +#ifdef DEBUG_TRACE_IO + char timef[50]; + xtWord8 start = xt_trace_clock(); +#endif +#ifdef XT_WIN + LARGE_INTEGER liDistanceToMove; + DWORD result; + + liDistanceToMove.QuadPart = offset; + if (!SetFilePointerEx(of->of_filedes, liDistanceToMove, NULL, FILE_BEGIN)) + return xt_register_ferrno(XT_REG_CONTEXT, fs_get_win_error(), xt_file_path(of)); + + if (!WriteFile(of->of_filedes, data, size, &result, NULL)) + return xt_register_ferrno(XT_REG_CONTEXT, fs_get_win_error(), xt_file_path(of)); + + if (result != size) + return xt_register_ferrno(XT_REG_CONTEXT, ERROR_HANDLE_EOF, xt_file_path(of)); +#else + ssize_t write_size; + + write_size = pwrite(of->of_filedes, data, size, offset); + if (write_size == -1) + return xt_register_ferrno(XT_REG_CONTEXT, errno, xt_file_path(of)); + + if ((size_t) write_size != size) + return xt_register_ferrno(XT_REG_CONTEXT, ESPIPE, xt_file_path(of)); + +#endif + stat->ts_write += (u_int) size; + +#ifdef DEBUG_TRACE_IO + xt_trace("/* %s */ pbxt_file_writ(\"%s\", %lu, %lu);\n", xt_trace_clock_diff(timef, start), of->fr_file->fil_path, (u_long) offset, (u_long) size); +#endif + return OK; +} + +xtPublic xtBool xt_flush_file(XTOpenFilePtr of, XTIOStatsPtr stat, XTThreadPtr XT_UNUSED(thread)) +{ + xtWord8 s; + +#ifdef DEBUG_PRINT_IO + PRINTF("PBXT FLUSH %s\n", of->fr_file->fil_path); +#endif +#ifdef DEBUG_TRACE_IO + char timef[50]; + xtWord8 start = xt_trace_clock(); +#endif + stat->ts_flush_start = xt_trace_clock(); +#ifdef XT_WIN + if (!FlushFileBuffers(of->of_filedes)) { + xt_register_ferrno(XT_REG_CONTEXT, fs_get_win_error(), xt_file_path(of)); + goto failed; + } +#else + if (fsync(of->of_filedes) == -1) { + xt_register_ferrno(XT_REG_CONTEXT, errno, xt_file_path(of)); + goto failed; + } +#endif +#ifdef DEBUG_TRACE_IO + xt_trace("/* %s */ pbxt_file_sync(\"%s\");\n", xt_trace_clock_diff(timef, start), of->fr_file->fil_path); +#endif + s = stat->ts_flush_start; + stat->ts_flush_start = 0; + stat->ts_flush_time += xt_trace_clock() - s; + stat->ts_flush++; + return OK; + + failed: + s = stat->ts_flush_start; + stat->ts_flush_start = 0; + stat->ts_flush_time += xt_trace_clock() - s; + return FAILED; +} + +xtBool xt_pread_file(XTOpenFilePtr of, off_t offset, size_t size, size_t min_size, void *data, size_t *red_size, XTIOStatsPtr stat, XTThreadPtr XT_UNUSED(thread)) +{ +#ifdef DEBUG_PRINT_IO + PRINTF("PBXT READ %s offset=%d size=%d\n", of->fr_file->fil_path, (int) offset, (int) size); +#endif +#ifdef DEBUG_TRACE_IO + char timef[50]; + xtWord8 start = xt_trace_clock(); +#endif +#ifdef XT_WIN + LARGE_INTEGER liDistanceToMove; + DWORD result; + + liDistanceToMove.QuadPart = offset; + if (!SetFilePointerEx(of->of_filedes, liDistanceToMove, NULL, FILE_BEGIN)) + return xt_register_ferrno(XT_REG_CONTEXT, fs_get_win_error(), xt_file_path(of)); + + if (!ReadFile(of->of_filedes, data, size, &result, NULL)) + return xt_register_ferrno(XT_REG_CONTEXT, fs_get_win_error(), xt_file_path(of)); + + if ((size_t) result < min_size) + return xt_register_ferrno(XT_REG_CONTEXT, ERROR_HANDLE_EOF, xt_file_path(of)); + + if (red_size) + *red_size = (size_t) result; + stat->ts_read += (u_int) result; +#else + ssize_t read_size; + + read_size = pread(of->of_filedes, data, size, offset); + if (read_size == -1) + return xt_register_ferrno(XT_REG_CONTEXT, errno, xt_file_path(of)); + + /* Throw an error if read less than the minimum: */ + if ((size_t) read_size < min_size) { +//PRINTF("PMC PBXT <-- offset:%llu, count:%lu \n", (u_llong) offset, (u_long) size); + return xt_register_ferrno(XT_REG_CONTEXT, ESPIPE, xt_file_path(of)); + } + + if (red_size) + *red_size = (size_t) read_size; + stat->ts_read += (u_int) read_size; +#endif +#ifdef DEBUG_TRACE_IO + xt_trace("/* %s */ pbxt_file_read(\"%s\", %lu, %lu);\n", xt_trace_clock_diff(timef, start), of->fr_file->fil_path, (u_long) offset, (u_long) size); +#endif + return OK; +} + +/* ---------------------------------------------------------------------- + * Directory operations + */ + +/* + * The filter may contain one '*' as wildcard. + */ +XTOpenDirPtr xt_dir_open(XTThreadPtr self, c_char *path, c_char *filter) +{ + XTOpenDirPtr od; + + pushsr_(od, xt_dir_close, (XTOpenDirPtr) xt_calloc(self, sizeof(XTOpenDirRec))); + +#ifdef XT_WIN + size_t len; + + od->od_handle = XT_NULL_FD; + + // path = path\(filter | *) + len = strlen(path) + 1 + (filter ? strlen(filter) : 1) + 1; + od->od_path = (char *) xt_malloc(self, len); + + strcpy(od->od_path, path); + xt_add_dir_char(len, od->od_path); + if (filter) + strcat(od->od_path, filter); + else + strcat(od->od_path, "*"); +#else + od->od_path = xt_dup_string(self, path); + + if (filter) + od->od_filter = xt_dup_string(self, filter); + + od->od_dir = opendir(path); + if (!od->od_dir) + xt_throw_ferrno(XT_CONTEXT, errno, path); +#endif + + popr_(); // Discard xt_dir_close(od) + return od; +} + +void xt_dir_close(XTThreadPtr self, XTOpenDirPtr od) +{ + if (od) { +#ifdef XT_WIN + if (od->od_handle != XT_NULL_FD) { + FindClose(od->od_handle); + od->od_handle = XT_NULL_FD; + } +#else + if (od->od_dir) { + closedir(od->od_dir); + od->od_dir = NULL; + } + if (od->od_filter) { + xt_free(self, od->od_filter); + od->od_filter = NULL; + } +#endif + if (od->od_path) { + xt_free(self, od->od_path); + od->od_path = NULL; + } + xt_free(self, od); + } +} + +#ifdef XT_WIN +xtBool xt_dir_next(XTThreadPtr self, XTOpenDirPtr od) +{ + int err = 0; + + if (od->od_handle == INVALID_HANDLE_VALUE) { + od->od_handle = FindFirstFile(od->od_path, &od->od_data); + if (od->od_handle == INVALID_HANDLE_VALUE) + err = fs_get_win_error(); + } + else { + if (!FindNextFile(od->od_handle, &od->od_data)) + err = fs_get_win_error(); + } + + if (err) { + if (err != ERROR_NO_MORE_FILES) { + if (err == ERROR_FILE_NOT_FOUND) { + char path[PATH_MAX]; + + xt_strcpy(PATH_MAX, path, od->od_path); + xt_remove_last_name_of_path(path); + if (!xt_fs_exists(path)) + xt_throw_ferrno(XT_CONTEXT, err, path); + } + else + xt_throw_ferrno(XT_CONTEXT, err, od->od_path); + } + return FAILED; + } + + return OK; +} +#else +static xtBool fs_match_filter(c_char *name, c_char *filter) +{ + while (*name && *filter) { + if (*filter == '*') { + if (filter[1] == *name) + filter++; + else + name++; + } + else { + if (*name != *filter) + return FALSE; + name++; + filter++; + } + } + if (!*name) { + if (!*filter || (*filter == '*' && !filter[1])) + return TRUE; + } + return FALSE; +} + +xtBool xt_dir_next(XTThreadPtr self, XTOpenDirPtr od) +{ + int err; + struct dirent *result; + + for (;;) { + err = readdir_r(od->od_dir, &od->od_entry, &result); + if (err) { + xt_throw_ferrno(XT_CONTEXT, err, od->od_path); + return FAILED; + } + if (!result) + break; + /* Filter out '.' and '..': */ + if (od->od_entry.d_name[0] == '.') { + if (od->od_entry.d_name[1] == '.') { + if (od->od_entry.d_name[2] == '\0') + continue; + } + else { + if (od->od_entry.d_name[1] == '\0') + continue; + } + } + if (!od->od_filter) + break; + if (fs_match_filter(od->od_entry.d_name, od->od_filter)) + break; + } + return result ? TRUE : FALSE; +} +#endif + +char *xt_dir_name(XTThreadPtr self __attribute__((unused)), XTOpenDirPtr od) +{ +#ifdef XT_WIN + return od->od_data.cFileName; +#else + return od->od_entry.d_name; +#endif +} + +xtBool xt_dir_is_file(XTThreadPtr self __attribute__((unused)), XTOpenDirPtr od) +{ +#ifdef XT_WIN + if (od->od_data.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) + return FALSE; +#elif defined(XT_SOLARIS) + char path[PATH_MAX]; + struct stat sb; + + xt_strcpy(PATH_MAX, path, od->od_path); + xt_add_dir_char(PATH_MAX, path); + xt_strcat(PATH_MAX, path, od->od_entry.d_name); + + if (stat(path, &sb) == -1) { + xt_throw_ferrno(XT_CONTEXT, errno, path); + return FAILED; + } + + if ( sb.st_mode & S_IFDIR ) + return FALSE; +#else + if (od->od_entry.d_type & DT_DIR) + return FALSE; +#endif + return TRUE; +} + +off_t xt_dir_file_size(XTThreadPtr self, XTOpenDirPtr od) +{ +#ifdef XT_WIN + return (off_t) od->od_data.nFileSizeLow | (((off_t) od->od_data.nFileSizeHigh) << 32); +#else + char path[PATH_MAX]; + off_t size; + + xt_strcpy(PATH_MAX, path, od->od_path); + xt_add_dir_char(PATH_MAX, path); + xt_strcat(PATH_MAX, path, od->od_entry.d_name); + if (!xt_fs_stat(self, path, &size, NULL)) + return -1; + return size; +#endif +} + +/* ---------------------------------------------------------------------- + * File mapping operations + */ + +static xtBool fs_map_file(XTFileMemMapPtr mm, XTFilePtr file, xtBool grow) +{ + ASSERT_NS(!mm->mm_start); +#ifdef XT_WIN + /* This will grow the file to the given size: */ + mm->mm_mapdes = CreateFileMapping(file->fil_filedes, NULL, PAGE_READWRITE, (DWORD) (mm->mm_length >> 32), (DWORD) mm->mm_length, NULL); + if (mm->mm_mapdes == NULL) { + xt_register_ferrno(XT_REG_CONTEXT, fs_get_win_error(), file->fil_path); + return FAILED; + } + + mm->mm_start = (xtWord1 *) MapViewOfFile(mm->mm_mapdes, FILE_MAP_WRITE, 0, 0, 0); + if (!mm->mm_start) { + CloseHandle(mm->mm_mapdes); + mm->mm_mapdes = NULL; + xt_register_ferrno(XT_REG_CONTEXT, fs_get_win_error(), file->fil_path); + return FAILED; + } +#else + if (grow) { + char data[2]; + + if (pwrite(file->fil_filedes, data, 1, mm->mm_length - 1) == -1) { + xt_register_ferrno(XT_REG_CONTEXT, errno, file->fil_path); + return FAILED; + } + } + + /* Remap: */ + mm->mm_start = (xtWord1 *) mmap(0, (size_t) mm->mm_length, PROT_READ | PROT_WRITE, MAP_SHARED, file->fil_filedes, 0); + if (mm->mm_start == MAP_FAILED) { + mm->mm_start = NULL; + xt_register_ferrno(XT_REG_CONTEXT, errno, file->fil_path); + return FAILED; + } +#endif + return OK; +} + +xtPublic XTMapFilePtr xt_open_fmap(XTThreadPtr self, char *file, size_t grow_size) +{ + XTMapFilePtr map; + + pushsr_(map, xt_close_fmap, (XTMapFilePtr) xt_calloc(self, sizeof(XTMapFileRec))); + map->fr_file = xt_fs_get_file(self, file); + map->fr_id = map->fr_file->fil_id; + + xt_sl_lock(self, fs_globals.fsg_open_files); + pushr_(xt_sl_unlock, fs_globals.fsg_open_files); + + if (map->fr_file->fil_filedes == XT_NULL_FD) { + if (!fs_open_file(self, &map->fr_file->fil_filedes, map->fr_file, XT_FS_DEFAULT)) { + xt_close_fmap(self, map); + map = NULL; + } + } + + map->fr_file->fil_handle_count++; + + freer_(); // xt_ht_unlock(fs_globals.fsg_open_files) + + if (!map->fr_file->fil_memmap) { + xt_sl_lock(self, fs_globals.fsg_open_files); + pushr_(xt_sl_unlock, fs_globals.fsg_open_files); + if (!map->fr_file->fil_memmap) { + XTFileMemMapPtr mm; + + mm = (XTFileMemMapPtr) xt_calloc(self, sizeof(XTFileMemMapRec)); + pushr_(fs_close_fmap, mm); + +#ifdef XT_WIN + /* NULL is the value returned on error! */ + mm->mm_mapdes = NULL; +#endif + xt_rwmutex_init_with_autoname(self, &mm->mm_lock); + mm->mm_length = fs_seek_eof(self, map->fr_file->fil_filedes, map->fr_file); + if (sizeof(size_t) == 4 && mm->mm_length >= (off_t) 0xFFFFFFFF) + xt_throw_ixterr(XT_CONTEXT, XT_ERR_FILE_TOO_LONG, map->fr_file->fil_path); + mm->mm_grow_size = grow_size; + + if (mm->mm_length < (off_t) grow_size) { + mm->mm_length = (off_t) grow_size; + if (!fs_map_file(mm, map->fr_file, TRUE)) + xt_throw(self); + } + else { + if (!fs_map_file(mm, map->fr_file, FALSE)) + xt_throw(self); + } + + popr_(); // Discard fs_close_fmap(mm) + map->fr_file->fil_memmap = mm; + } + freer_(); // xt_ht_unlock(fs_globals.fsg_open_files) + } + map->mf_memmap = map->fr_file->fil_memmap; + + popr_(); // Discard xt_close_fmap(map) + return map; +} + +xtPublic void xt_close_fmap(XTThreadPtr self, XTMapFilePtr map) +{ + if (map->fr_file) { + xt_fs_release_file(self, map->fr_file); + + xt_sl_lock(self, fs_globals.fsg_open_files); + pushr_(xt_sl_unlock, fs_globals.fsg_open_files); + + map->fr_file->fil_handle_count--; + if (!map->fr_file->fil_handle_count) + fs_free_file(self, NULL, &map->fr_file); + + freer_(); + + map->fr_file = NULL; + + + } + map->mf_memmap = NULL; + xt_free(self, map); +} + +xtPublic xtBool xt_close_fmap_ns(XTMapFilePtr map) +{ + XTThreadPtr self = xt_get_self(); + xtBool failed = FALSE; + + try_(a) { + xt_close_fmap(self, map); + } + catch_(a) { + failed = TRUE; + } + cont_(a); + return failed; +} + +static xtBool fs_remap_file(XTMapFilePtr map, off_t offset, size_t size, XTIOStatsPtr stat) +{ + off_t new_size = 0; + XTFileMemMapPtr mm = map->mf_memmap; + xtWord8 s; + + if (offset + (off_t) size > mm->mm_length) { + /* Expand the file: */ + new_size = (mm->mm_length + (off_t) mm->mm_grow_size) / (off_t) mm->mm_grow_size; + new_size *= mm->mm_grow_size; + while (new_size < offset + (off_t) size) + new_size += mm->mm_grow_size; + + if (sizeof(size_t) == 4 && new_size >= (off_t) 0xFFFFFFFF) { + xt_register_ixterr(XT_REG_CONTEXT, XT_ERR_FILE_TOO_LONG, xt_file_path(map)); + return FAILED; + } + } + else if (!mm->mm_start) + new_size = mm->mm_length; + + if (new_size) { + if (mm->mm_start) { + /* Flush & unmap: */ + stat->ts_flush_start = xt_trace_clock(); +#ifdef XT_WIN + if (!FlushViewOfFile(mm->mm_start, 0)) { + xt_register_ferrno(XT_REG_CONTEXT, fs_get_win_error(), xt_file_path(map)); + goto failed; + } + + if (!UnmapViewOfFile(mm->mm_start)) { + xt_register_ferrno(XT_REG_CONTEXT, fs_get_win_error(), xt_file_path(map)); + goto failed; + } +#else + if (msync( (char *)mm->mm_start, (size_t) mm->mm_length, MS_SYNC) == -1) { + xt_register_ferrno(XT_REG_CONTEXT, errno, xt_file_path(map)); + goto failed; + } + + /* Unmap: */ + if (munmap((caddr_t) mm->mm_start, (size_t) mm->mm_length) == -1) { + xt_register_ferrno(XT_REG_CONTEXT, errno, xt_file_path(map)); + goto failed; + } +#endif + s = stat->ts_flush_start; + stat->ts_flush_start = 0; + stat->ts_flush_time += xt_trace_clock() - s; + stat->ts_flush++; + } + mm->mm_start = NULL; +#ifdef XT_WIN + if (!CloseHandle(mm->mm_mapdes)) + return xt_register_ferrno(XT_REG_CONTEXT, fs_get_win_error(), xt_file_path(map)); + mm->mm_mapdes = NULL; +#endif + mm->mm_length = new_size; + + if (!fs_map_file(mm, map->fr_file, TRUE)) + return FAILED; + } + return OK; + + failed: + s = stat->ts_flush_start; + stat->ts_flush_start = 0; + stat->ts_flush_time += xt_trace_clock() - s; + return FAILED; +} + +xtPublic xtBool xt_pwrite_fmap(XTMapFilePtr map, off_t offset, size_t size, void *data, XTIOStatsPtr stat, XTThreadPtr thread) +{ + XTFileMemMapPtr mm = map->mf_memmap; + xtThreadID thd_id = thread->t_id; + +#ifdef DEBUG_TRACE_MAP_IO + xt_trace("/* %s */ pbxt_fmap_writ(\"%s\", %lu, %lu);\n", xt_trace_clock_diff(NULL), map->fr_file->fil_path, (u_long) offset, (u_long) size); +#endif + xt_rwmutex_slock(&mm->mm_lock, thd_id); + if (!mm->mm_start || offset + (off_t) size > mm->mm_length) { + xt_rwmutex_unlock(&mm->mm_lock, thd_id); + + xt_rwmutex_xlock(&mm->mm_lock, thd_id); + if (!fs_remap_file(map, offset, size, stat)) + goto failed; + } + +#ifdef XT_WIN + __try + { + memcpy(mm->mm_start + offset, data, size); + } + // GetExceptionCode()== EXCEPTION_IN_PAGE_ERROR ? EXCEPTION_EXECUTE_HANDLER : EXCEPTION_CONTINUE_SEARCH + __except(EXCEPTION_EXECUTE_HANDLER) + { + xt_register_ferrno(XT_REG_CONTEXT, GetExceptionCode(), xt_file_path(map)); + goto failed; + } +#else + memcpy(mm->mm_start + offset, data, size); +#endif + + xt_rwmutex_unlock(&mm->mm_lock, thd_id); + stat->ts_write += size; + return OK; + + failed: + xt_rwmutex_unlock(&mm->mm_lock, thd_id); + return FAILED; +} + +xtPublic xtBool xt_pread_fmap_4(XTMapFilePtr map, off_t offset, xtWord4 *value, XTIOStatsPtr stat, XTThreadPtr thread) +{ + XTFileMemMapPtr mm = map->mf_memmap; + xtThreadID thd_id = thread->t_id; + +#ifdef DEBUG_TRACE_MAP_IO + xt_trace("/* %s */ pbxt_fmap_read_4(\"%s\", %lu, 4);\n", xt_trace_clock_diff(NULL), map->fr_file->fil_path, (u_long) offset); +#endif + xt_rwmutex_slock(&mm->mm_lock, thd_id); + if (!mm->mm_start) { + xt_rwmutex_unlock(&mm->mm_lock, thd_id); + xt_rwmutex_xlock(&mm->mm_lock, thd_id); + if (!fs_remap_file(map, 0, 0, stat)) { + xt_rwmutex_unlock(&mm->mm_lock, thd_id); + return FAILED; + } + } + if (offset >= mm->mm_length) + *value = 0; + else { + xtWord1 *data; + + data = mm->mm_start + offset; +#ifdef XT_WIN + __try + { + *value = XT_GET_DISK_4(data); + // GetExceptionCode()== EXCEPTION_IN_PAGE_ERROR ? EXCEPTION_EXECUTE_HANDLER : EXCEPTION_CONTINUE_SEARCH + } + __except(EXCEPTION_EXECUTE_HANDLER) + { + xt_rwmutex_unlock(&mm->mm_lock, thd_id); + return xt_register_ferrno(XT_REG_CONTEXT, GetExceptionCode(), xt_file_path(map)); + } +#else + *value = XT_GET_DISK_4(data); +#endif + } + + xt_rwmutex_unlock(&mm->mm_lock, thd_id); + stat->ts_read += 4; + return OK; +} + +xtPublic xtBool xt_pread_fmap(XTMapFilePtr map, off_t offset, size_t size, size_t min_size, void *data, size_t *red_size, XTIOStatsPtr stat, XTThreadPtr thread) +{ + XTFileMemMapPtr mm = map->mf_memmap; + xtThreadID thd_id = thread->t_id; + size_t tfer; + +#ifdef DEBUG_TRACE_MAP_IO + xt_trace("/* %s */ pbxt_fmap_read(\"%s\", %lu, %lu);\n", xt_trace_clock_diff(NULL), map->fr_file->fil_path, (u_long) offset, (u_long) size); +#endif + /* NOTE!! The file map may already be locked, + * by a call to xt_lock_fmap_ptr()! + * + * This can occur during a sequential scan: + * xt_pread_fmap() Line 1330 + * XTTabCache::tc_read_direct() Line 361 + * XTTabCache::xt_tc_read() Line 220 + * xt_tab_get_rec_data() + * tab_visible() Line 2412 + * xt_tab_seq_next() Line 4068 + * + * And occurs during the following test: + * create table t1 ( a int not null, b int not null) ; + * --disable_query_log + * insert into t1 values (1,1),(2,2),(3,3),(4,4); + * let $1=19; + * set @d=4; + * while ($1) + * { + * eval insert into t1 select a+@d,b+@d from t1; + * eval set @d=@d*2; + * dec $1; + * } + * + * --enable_query_log + * alter table t1 add index i1(a); + * delete from t1 where a > 2000000; + * create table t2 like t1; + * insert into t2 select * from t1; + * + * As a result, the slock must be able to handle + * nested calls to lock/unlock. + */ + xt_rwmutex_slock(&mm->mm_lock, thd_id); + tfer = size; + if (!mm->mm_start) { + xt_rwmutex_unlock(&mm->mm_lock, thd_id); + xt_rwmutex_xlock(&mm->mm_lock, thd_id); + if (!fs_remap_file(map, 0, 0, stat)) { + xt_rwmutex_unlock(&mm->mm_lock, thd_id); + return FAILED; + } + } + if (offset >= mm->mm_length) + tfer = 0; + else { + if (mm->mm_length - offset < (off_t) tfer) + tfer = (size_t) (mm->mm_length - offset); +#ifdef XT_WIN + __try + { + memcpy(data, mm->mm_start + offset, tfer); + // GetExceptionCode()== EXCEPTION_IN_PAGE_ERROR ? EXCEPTION_EXECUTE_HANDLER : EXCEPTION_CONTINUE_SEARCH + } + __except(EXCEPTION_EXECUTE_HANDLER) + { + xt_rwmutex_unlock(&mm->mm_lock, thd_id); + return xt_register_ferrno(XT_REG_CONTEXT, GetExceptionCode(), xt_file_path(map)); + } +#else + memcpy(data, mm->mm_start + offset, tfer); +#endif + } + + xt_rwmutex_unlock(&mm->mm_lock, thd_id); + if (tfer < min_size) + return xt_register_ferrno(XT_REG_CONTEXT, ESPIPE, xt_file_path(map)); + + if (red_size) + *red_size = tfer; + stat->ts_read += tfer; + return OK; +} + +xtPublic xtBool xt_flush_fmap(XTMapFilePtr map, XTIOStatsPtr stat, XTThreadPtr thread) +{ + XTFileMemMapPtr mm = map->mf_memmap; + xtThreadID thd_id = thread->t_id; + xtWord8 s; + +#ifdef DEBUG_TRACE_MAP_IO + xt_trace("/* %s */ pbxt_fmap_sync(\"%s\");\n", xt_trace_clock_diff(NULL), map->fr_file->fil_path); +#endif + xt_rwmutex_slock(&mm->mm_lock, thd_id); + if (!mm->mm_start) { + xt_rwmutex_unlock(&mm->mm_lock, thd_id); + xt_rwmutex_xlock(&mm->mm_lock, thd_id); + if (!fs_remap_file(map, 0, 0, stat)) { + xt_rwmutex_unlock(&mm->mm_lock, thd_id); + return FAILED; + } + } + stat->ts_flush_start = xt_trace_clock(); +#ifdef XT_WIN + if (!FlushViewOfFile(mm->mm_start, 0)) { + xt_register_ferrno(XT_REG_CONTEXT, fs_get_win_error(), xt_file_path(map)); + goto failed; + } +#else + if (msync( (char *)mm->mm_start, (size_t) mm->mm_length, MS_SYNC) == -1) { + xt_register_ferrno(XT_REG_CONTEXT, errno, xt_file_path(map)); + goto failed; + } +#endif + xt_rwmutex_unlock(&mm->mm_lock, thd_id); + s = stat->ts_flush_start; + stat->ts_flush_start = 0; + stat->ts_flush_time += xt_trace_clock() - s; + stat->ts_flush++; + return OK; + + failed: + xt_rwmutex_unlock(&mm->mm_lock, thd_id); + s = stat->ts_flush_start; + stat->ts_flush_start = 0; + stat->ts_flush_time += xt_trace_clock() - s; + return FAILED; +} + +xtPublic xtWord1 *xt_lock_fmap_ptr(XTMapFilePtr map, off_t offset, size_t size, XTIOStatsPtr stat, XTThreadPtr XT_UNUSED(thread)) +{ + XTFileMemMapPtr mm = map->mf_memmap; + xtThreadID thd_id = thread->t_id; + + xt_rwmutex_slock(&mm->mm_lock, thd_id); + if (!mm->mm_start) { + xt_rwmutex_unlock(&mm->mm_lock, thd_id); + xt_rwmutex_xlock(&mm->mm_lock, thd_id); + if (!fs_remap_file(map, 0, 0, stat)) + goto failed; + } + if (offset >= mm->mm_length) + goto failed; + + if (offset + (off_t) size > mm->mm_length) + stat->ts_read += (u_int) (offset + (off_t) size - mm->mm_length); + else + stat->ts_read += size; + return mm->mm_start + offset; + + failed: + xt_rwmutex_unlock(&mm->mm_lock, thd_id); + return NULL; +} + +xtPublic void xt_unlock_fmap_ptr(XTMapFilePtr map, XTThreadPtr thread) +{ + xt_rwmutex_unlock(&map->mf_memmap->mm_lock, thread->t_id); +} + +/* ---------------------------------------------------------------------- + * Copy files/directories + */ + +static void fs_copy_file(XTThreadPtr self, char *from_path, char *to_path, void *copy_buf) +{ + XTOpenFilePtr from; + XTOpenFilePtr to; + off_t offset = 0; + size_t read_size= 0; + + from = xt_open_file(self, from_path, XT_FS_READONLY); + pushr_(xt_close_file, from); + to = xt_open_file(self, to_path, XT_FS_CREATE | XT_FS_MAKE_PATH); + pushr_(xt_close_file, to); + + for (;;) { + if (!xt_pread_file(from, offset, 16*1024, 0, copy_buf, &read_size, &self->st_statistics.st_x, self)) + xt_throw(self); + if (!read_size) + break; + if (!xt_pwrite_file(to, offset, read_size, copy_buf, &self->st_statistics.st_x, self)) + xt_throw(self); + offset += (off_t) read_size; + } + + freer_(); + freer_(); +} + +xtPublic void xt_fs_copy_file(XTThreadPtr self, char *from_path, char *to_path) +{ + void *buffer; + + buffer = xt_malloc(self, 16*1024); + pushr_(xt_free, buffer); + fs_copy_file(self, from_path, to_path, buffer); + freer_(); +} + +static void fs_copy_dir(XTThreadPtr self, char *from_path, char *to_path, void *copy_buf) +{ + XTOpenDirPtr od; + char *file; + + xt_add_dir_char(PATH_MAX, from_path); + xt_add_dir_char(PATH_MAX, to_path); + + pushsr_(od, xt_dir_close, xt_dir_open(self, from_path, NULL)); + while (xt_dir_next(self, od)) { + file = xt_dir_name(self, od); + if (*file == '.') + continue; +#ifdef XT_WIN + if (strcmp(file, "pbxt-lock") == 0) + continue; +#endif + xt_strcat(PATH_MAX, from_path, file); + xt_strcat(PATH_MAX, to_path, file); + if (xt_dir_is_file(self, od)) + fs_copy_file(self, from_path, to_path, copy_buf); + else + fs_copy_dir(self, from_path, to_path, copy_buf); + xt_remove_last_name_of_path(from_path); + xt_remove_last_name_of_path(to_path); + } + freer_(); + + xt_remove_dir_char(from_path); + xt_remove_dir_char(to_path); +} + +xtPublic void xt_fs_copy_dir(XTThreadPtr self, const char *from, const char *to) +{ + void *buffer; + char from_path[PATH_MAX]; + char to_path[PATH_MAX]; + + xt_strcpy(PATH_MAX, from_path, from); + xt_strcpy(PATH_MAX, to_path, to); + + buffer = xt_malloc(self, 16*1024); + pushr_(xt_free, buffer); + fs_copy_dir(self, from_path, to_path, buffer); + freer_(); +} + diff --git a/storage/pbxt/src/filesys_xt.h b/storage/pbxt/src/filesys_xt.h new file mode 100644 index 00000000000..ebc4f474fc9 --- /dev/null +++ b/storage/pbxt/src/filesys_xt.h @@ -0,0 +1,167 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-01-12 Paul McCullagh + * + * H&G2JCtL + */ +#ifndef __xt_filesys_h__ +#define __xt_filesys_h__ + +#ifdef XT_WIN +#include <time.h> +#else +#include <sys/time.h> +#include <dirent.h> +#endif +#include <sys/stat.h> + +#include "xt_defs.h" +#include "lock_xt.h" + +#ifdef XT_WIN +#define XT_FILE_IN_USE(x) ((x) == ERROR_SHARING_VIOLATION) +#define XT_FILE_ACCESS_DENIED(x) ((x) == ERROR_ACCESS_DENIED || (x) == ERROR_NETWORK_ACCESS_DENIED) +#define XT_FILE_TOO_MANY_OPEN(x) ((x) == ERROR_TOO_MANY_OPEN_FILES) +#define XT_FILE_NOT_FOUND(x) ((x) == ERROR_FILE_NOT_FOUND || (x) == ERROR_PATH_NOT_FOUND) +#else +#define XT_FILE_IN_USE(x) ((x) == ETXTBSY) +#define XT_FILE_ACCESS_DENIED(x) ((x) == EACCES) +#define XT_FILE_TOO_MANY_OPEN(x) ((x) == EMFILE) +#define XT_FILE_NOT_FOUND(x) ((x) == ENOENT) +#endif + +struct XTOpenFile; + +#define XT_MASK ((S_IRUSR | S_IWUSR) | (S_IRGRP | S_IWGRP) | (S_IROTH)) + +#define XT_FS_DEFAULT 0 /* Open for read/write, error if does not exist. */ +#define XT_FS_READONLY 1 /* Open for read only (otherwize read/write). */ +#define XT_FS_CREATE 2 /* Create if the file does not exist. */ +#define XT_FS_EXCLUSIVE 4 /* Create, and generate an error if it already exists. */ +#define XT_FS_MISSING_OK 8 /* Set this flag if you don't want to throw an error if the file does not exist! */ +#define XT_FS_MAKE_PATH 16 /* Create the path if it does not exist. */ +#define XT_FS_DIRECT_IO 32 /* Use direct I/O on this file if possible (O_DIRECT). */ + +xtBool xt_fs_exists(char *path); +xtBool xt_fs_delete(struct XTThread *self, char *path); +xtBool xt_fs_file_not_found(int err); +void xt_fs_mkdir(struct XTThread *self, char *path); +void xt_fs_mkpath(struct XTThread *self, char *path); +xtBool xt_fs_rmdir(struct XTThread *self, char *path); +xtBool xt_fs_stat(struct XTThread *self, char *path, off_t *size, struct timespec *mod_time); +void xt_fs_move(struct XTThread *self, char *from_path, char *to_path); +xtBool xt_fs_rename(struct XTThread *self, char *from_path, char *to_path); + +#ifdef XT_WIN +#define XT_FD HANDLE +#define XT_NULL_FD INVALID_HANDLE_VALUE +#else +#define XT_FD int +#define XT_NULL_FD (-1) +#endif + +typedef struct XTFileMemMap { + xtWord1 *mm_start; /* The in-memory start of the map. */ +#ifdef XT_WIN + HANDLE mm_mapdes; +#endif + off_t mm_length; /* The length of the file map. */ + XTRWMutexRec mm_lock; /* The file map R/W lock. */ + size_t mm_grow_size; /* The amount by which the map file is increased. */ +} XTFileMemMapRec, *XTFileMemMapPtr; + +typedef struct XTFile { + u_int fil_ref_count; /* The number of open file structure referencing this file. */ + char *fil_path; + u_int fil_id; /* This is used by the disk cache to identify a file in the hash index. */ + XT_FD fil_filedes; /* The shared file descriptor (pread and pwrite allow this), on Windows this is used only for mmapped files */ + u_int fil_handle_count; /* Number of references in the case of mmapped fil_filedes, both Windows and Unix */ + XTFileMemMapPtr fil_memmap; /* Non-null if this file is memory mapped. */ +} XTFileRec, *XTFilePtr; + +typedef struct XTFileRef { + XTFilePtr fr_file; + u_int fr_id; /* Copied from above (small optimisation). */ +} XTFileRefRec, *XTFileRefPtr; + +typedef struct XTOpenFile : public XTFileRef { + XT_FD of_filedes; +} XTOpenFileRec, *XTOpenFilePtr; + +void xt_fs_init(struct XTThread *self); +void xt_fs_exit(struct XTThread *self); + +XTFilePtr xt_fs_get_file(struct XTThread *self, char *file_name); +void xt_fs_release_file(struct XTThread *self, XTFilePtr file_ptr); + +XTOpenFilePtr xt_open_file(struct XTThread *self, char *file, int mode); +XTOpenFilePtr xt_open_file_ns(char *file, int mode); +xtBool xt_open_file_ns(XTOpenFilePtr *fh, char *file, int mode); +void xt_close_file(struct XTThread *self, XTOpenFilePtr f); +xtBool xt_close_file_ns(XTOpenFilePtr f); +char *xt_file_path(struct XTFileRef *of); + +xtBool xt_lock_file(struct XTThread *self, XTOpenFilePtr of); +void xt_unlock_file(struct XTThread *self, XTOpenFilePtr of); + +off_t xt_seek_eof_file(struct XTThread *self, XTOpenFilePtr of); +xtBool xt_set_eof_file(struct XTThread *self, XTOpenFilePtr of, off_t offset); + +xtBool xt_pwrite_file(XTOpenFilePtr of, off_t offset, size_t size, void *data, struct XTIOStats *timer, struct XTThread *thread); +xtBool xt_pread_file(XTOpenFilePtr of, off_t offset, size_t size, size_t min_size, void *data, size_t *red_size, struct XTIOStats *timer, struct XTThread *thread); +xtBool xt_flush_file(XTOpenFilePtr of, struct XTIOStats *timer, struct XTThread *thread); + +typedef struct XTOpenDir { + char *od_path; +#ifdef XT_WIN + HANDLE od_handle; + WIN32_FIND_DATA od_data; +#else + char *od_filter; + struct dirent od_entry; + DIR *od_dir; +#endif +} XTOpenDirRec, *XTOpenDirPtr; + +XTOpenDirPtr xt_dir_open(struct XTThread *self, c_char *path, c_char *filter); +void xt_dir_close(struct XTThread *self, XTOpenDirPtr od); +xtBool xt_dir_next(struct XTThread *self, XTOpenDirPtr od); +char *xt_dir_name(struct XTThread *self, XTOpenDirPtr od); +xtBool xt_dir_is_file(struct XTThread *self, XTOpenDirPtr od); +off_t xt_dir_file_size(struct XTThread *self, XTOpenDirPtr od); + +typedef struct XTMapFile : public XTFileRef { + XTFileMemMapPtr mf_memmap; +} XTMapFileRec, *XTMapFilePtr; + +XTMapFilePtr xt_open_fmap(struct XTThread *self, char *file, size_t grow_size); +void xt_close_fmap(struct XTThread *self, XTMapFilePtr map); +xtBool xt_close_fmap_ns(XTMapFilePtr map); +xtBool xt_pwrite_fmap(XTMapFilePtr map, off_t offset, size_t size, void *data, struct XTIOStats *timer, struct XTThread *thread); +xtBool xt_pread_fmap(XTMapFilePtr map, off_t offset, size_t size, size_t min_size, void *data, size_t *red_size, struct XTIOStats *timer, struct XTThread *thread); +xtBool xt_pread_fmap_4(XTMapFilePtr map, off_t offset, xtWord4 *value, struct XTIOStats *timer, struct XTThread *thread); +xtBool xt_flush_fmap(XTMapFilePtr map, struct XTIOStats *stat, struct XTThread *thread); +xtWord1 *xt_lock_fmap_ptr(XTMapFilePtr map, off_t offset, size_t size, struct XTIOStats *timer, struct XTThread *thread); +void xt_unlock_fmap_ptr(XTMapFilePtr map, struct XTThread *thread); + +void xt_fs_copy_file(struct XTThread *self, char *from_path, char *to_path); +void xt_fs_copy_dir(struct XTThread *self, const char *from, const char *to); + +#endif + diff --git a/storage/pbxt/src/ha_pbxt.cc b/storage/pbxt/src/ha_pbxt.cc new file mode 100644 index 00000000000..f12e23414fb --- /dev/null +++ b/storage/pbxt/src/ha_pbxt.cc @@ -0,0 +1,5370 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * Derived from ha_example.h + * Copyright (C) 2003 MySQL AB + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-11-10 Paul McCullagh + * + */ + +#ifdef USE_PRAGMA_IMPLEMENTATION +#pragma implementation // gcc: Class implementation +#endif + +#include "xt_config.h" + +#if defined(XT_WIN) +#include <windows.h> +#endif + +#include <stdlib.h> +#include <time.h> + +#ifdef DRIZZLED +#include <drizzled/common.h> +#include <drizzled/plugin.h> +#include <mysys/my_alloc.h> +#include <mysys/hash.h> +#include <drizzled/field.h> +#include <drizzled/current_session.h> +#include <drizzled/data_home.h> +#include <drizzled/error.h> +#include <drizzled/table.h> +#include <drizzled/field/timestamp.h> +#include <drizzled/server_includes.h> +extern "C" char **session_query(Session *session); +#define my_strdup(a,b) strdup(a) +#else +#include "mysql_priv.h" +#include <mysql/plugin.h> +#endif + +#include "ha_pbxt.h" +#include "ha_xtsys.h" + +#include "strutil_xt.h" +#include "database_xt.h" +#include "cache_xt.h" +#include "trace_xt.h" +#include "heap_xt.h" +#include "myxt_xt.h" +#include "datadic_xt.h" +#ifdef XT_STREAMING +#include "streaming_xt.h" +#endif +#include "tabcache_xt.h" +#include "systab_xt.h" +#include "xaction_xt.h" + +#ifdef DEBUG +//#define XT_USE_SYS_PAR_DEBUG_SIZES +//#define PBXT_HANDLER_TRACE +//#define PBXT_TRACE_RETURN +//#define XT_PRINT_INDEX_OPT +//#define XT_SHOW_DUMPS_TRACE +//#define XT_UNIT_TEST +//#define LOAD_TABLE_ON_OPEN +//#define CHECK_TABLE_LOADS + +/* Enable to trace the statements executed by the engine: */ +//#define TRACE_STATEMENTS + +/* Enable to print the trace to the stdout, instead of + * to the trace log. + */ +//#define PRINT_STATEMENTS +#endif + +static handler *pbxt_create_handler(handlerton *hton, TABLE_SHARE *table, MEM_ROOT *mem_root); +static int pbxt_init(void *p); +static int pbxt_end(void *p); +#ifndef DRIZZLED +static int pbxt_panic(handlerton *hton, enum ha_panic_function flag); +#endif +static void pbxt_drop_database(handlerton *hton, char *path); +static int pbxt_close_connection(handlerton *hton, THD* thd); +static int pbxt_commit(handlerton *hton, THD *thd, bool all); +static int pbxt_rollback(handlerton *hton, THD *thd, bool all); +static void ha_aquire_exclusive_use(XTThreadPtr self, XTSharePtr share, ha_pbxt *mine); +static void ha_release_exclusive_use(XTThreadPtr self, XTSharePtr share); +static void ha_close_open_tables(XTThreadPtr self, XTSharePtr share, ha_pbxt *mine); + +#ifdef TRACE_STATEMENTS + +#ifdef PRINT_STATEMENTS +#define STAT_TRACE(y, x) printf("%s: %s\n", y ? y->t_name : "-unknown-", x) +#else +#define STAT_TRACE(y, x) xt_ttraceq(y, x) +#endif + +#else + +#define STAT_TRACE(y, x) + +#endif + +#ifdef PBXT_HANDLER_TRACE +#define PBXT_ALLOW_PRINTING + +#define XT_TRACE_CALL() do { XTThreadPtr s = xt_get_self(); printf("%s %s\n", s ? s->t_name : "-unknown-", __FUNC__); } while (0) +#ifdef PBXT_TRACE_RETURN +#define XT_RETURN(x) do { printf("%d\n", (int) (x)); return (x); } while (0) +#define XT_RETURN_VOID do { printf("out\n"); return; } while (0) +#else +#define XT_RETURN(x) return (x) +#define XT_RETURN_VOID return +#endif + +#else + +#define XT_TRACE_CALL() +#define XT_RETURN(x) return (x) +#define XT_RETURN_VOID return + +#endif + +#ifdef PBXT_ALLOW_PRINTING +#define XT_PRINT0(y, x) do { XTThreadPtr s = (y); printf("%s " x, s ? s->t_name : "-unknown-"); } while (0) +#define XT_PRINT1(y, x, a) do { XTThreadPtr s = (y); printf("%s " x, s ? s->t_name : "-unknown-", a); } while (0) +#define XT_PRINT2(y, x, a, b) do { XTThreadPtr s = (y); printf("%s " x, s ? s->t_name : "-unknown-", a, b); } while (0) +#define XT_PRINT3(y, x, a, b, c) do { XTThreadPtr s = (y); printf("%s " x, s ? s->t_name : "-unknown-", a, b, c); } while (0) +#else +#define XT_PRINT0(y, x) +#define XT_PRINT1(y, x, a) +#define XT_PRINT2(y, x, a, b) +#define XT_PRINT3(y, x, a, b, c) +#endif + + +#define TS(x) (x)->s + +handlerton *pbxt_hton; +bool pbxt_inited = false; // Variable for checking the init state of hash +xtBool pbxt_ignore_case = true; +const char *pbxt_extensions[]= { ".xtr", ".xtd", ".xtl", ".xti", ".xt", "", NULL }; +#ifdef XT_CRASH_DEBUG +xtBool pbxt_crash_debug = TRUE; +#else +xtBool pbxt_crash_debug = FALSE; +#endif + +/* Variables for pbxt share methods */ +static xt_mutex_type pbxt_database_mutex; // Prevent a database from being opened while it is being dropped +static XTHashTabPtr pbxt_share_tables; // Hash used to track open tables +static XTDatabaseHPtr pbxt_database = NULL; // The global open database +static char *pbxt_index_cache_size; +static char *pbxt_record_cache_size; +static char *pbxt_log_cache_size; +static char *pbxt_log_file_threshold; +static char *pbxt_transaction_buffer_size; +static char *pbxt_log_buffer_size; +static char *pbxt_checkpoint_frequency; +static char *pbxt_data_log_threshold; +static char *pbxt_data_file_grow_size; +static char *pbxt_row_file_grow_size; + +#ifdef DEBUG +#define XT_SHARE_LOCK_WAIT 5000 +#else +#define XT_SHARE_LOCK_WAIT 500 +#endif + +/* + * Lock timeout in 1/1000ths of a second + */ +#define XT_SHARE_LOCK_TIMEOUT 30000 + +/* + * ----------------------------------------------------------------------- + * SYSTEM VARIABLES + * + */ + +//#define XT_FOR_TEAMDRIVE + +typedef struct HAVarParams { + const char *vp_var; /* Variable name. */ + const char *vp_def; /* Default value. */ + const char *vp_min; /* Minimum allowed value. */ + const char *vp_max4; /* Maximum allowed value on 32-bit processors. */ + const char *vp_max8; /* Maximum allowed value on 64-bit processors. */ +} HAVarParamsRec, *HAVarParamsPtr; + +#ifdef XT_USE_SYS_PAR_DEBUG_SIZES +static HAVarParamsRec vp_index_cache_size = { "pbxt_index_cache_size", "32MB", "8MB", "2GB", "2000GB" }; +static HAVarParamsRec vp_record_cache_size = { "pbxt_record_cache_size", "32MB", "8MB", "2GB", "2000GB" }; +static HAVarParamsRec vp_log_cache_size = { "pbxt_log_cache_size", "16MB", "4MB", "2GB", "2000GB" }; +static HAVarParamsRec vp_checkpoint_frequency = { "pbxt_checkpoint_frequency", "28MB", "512K", "1GB", "24GB" }; +static HAVarParamsRec vp_log_file_threshold = { "pbxt_log_file_threshold", "32MB", "1MB", "2GB", "256TB" }; +static HAVarParamsRec vp_transaction_buffer_size = { "pbxt_transaction_buffer_size", "1MB", "128K", "1GB", "24GB" }; +static HAVarParamsRec vp_log_buffer_size = { "pbxt_log_buffer_size", "256K", "128K", "1GB", "24GB" }; +static HAVarParamsRec vp_data_log_threshold = { "pbxt_data_log_threshold", "400K", "400K", "2GB", "256TB" }; +static HAVarParamsRec vp_data_file_grow_size = { "pbxt_data_file_grow_size", "2MB", "128K", "1GB", "2GB" }; +static HAVarParamsRec vp_row_file_grow_size = { "pbxt_row_file_grow_size", "256K", "32K", "1GB", "2GB" }; +#define XT_DL_DEFAULT_XLOG_COUNT 3 +#define XT_DL_DEFAULT_GARBAGE_LEVEL 10 +#else +static HAVarParamsRec vp_index_cache_size = { "pbxt_index_cache_size", "32MB", "8MB", "2GB", "2000GB" }; +static HAVarParamsRec vp_record_cache_size = { "pbxt_record_cache_size", "32MB", "8MB", "2GB", "2000GB" }; +static HAVarParamsRec vp_log_cache_size = { "pbxt_log_cache_size", "16MB", "4MB", "2GB", "2000GB" }; +static HAVarParamsRec vp_checkpoint_frequency = { "pbxt_checkpoint_frequency", "28MB", "512K", "1GB", "24GB" }; +static HAVarParamsRec vp_log_file_threshold = { "pbxt_log_file_threshold", "32MB", "1MB", "2GB", "256TB" }; +static HAVarParamsRec vp_transaction_buffer_size = { "pbxt_transaction_buffer_size", "1MB", "128K", "1GB", "24GB" }; +static HAVarParamsRec vp_log_buffer_size = { "pbxt_log_buffer_size", "256K", "128K", "1GB", "24GB" }; +static HAVarParamsRec vp_data_log_threshold = { "pbxt_data_log_threshold", "64MB", "1MB", "2GB", "256TB" }; +static HAVarParamsRec vp_data_file_grow_size = { "pbxt_data_file_grow_size", "2MB", "128K", "1GB", "2GB" }; +static HAVarParamsRec vp_row_file_grow_size = { "pbxt_row_file_grow_size", "256K", "32K", "1GB", "2GB" }; +#define XT_DL_DEFAULT_XLOG_COUNT 3 +#define XT_DL_DEFAULT_GARBAGE_LEVEL 50 +#endif + +#define XT_AUTO_INCREMENT_DEF 0 + +#ifdef XT_MAC +#ifdef DEBUG +/* For debugging on the Mac, we check the re-use logs: */ +#define XT_OFFLINE_LOG_FUNCTION_DEF XT_RECYCLE_LOGS +#else +#define XT_OFFLINE_LOG_FUNCTION_DEF XT_DELETE_LOGS +#endif +#else +#define XT_OFFLINE_LOG_FUNCTION_DEF XT_RECYCLE_LOGS +#endif + +/* TeamDrive, uses special auto-increment, and + * we keep the logs for the moment: + */ +#ifdef XT_FOR_TEAMDRIVE +#undef XT_OFFLINE_LOG_FUNCTION_DEF +#define XT_OFFLINE_LOG_FUNCTION_DEF XT_KEEP_LOGS +//#undef XT_AUTO_INCREMENT_DEF +//#define XT_AUTO_INCREMENT_DEF 1 +#endif + +/* + * ----------------------------------------------------------------------- + * SHARED TABLE DATA + * + */ + +static xtBool ha_hash_comp(void *key, void *data) +{ + XTSharePtr share = (XTSharePtr) data; + + return strcmp((char *) key, share->sh_table_path->ps_path) == 0; +} + +static xtHashValue ha_hash(xtBool is_key, void *key_data) +{ + XTSharePtr share = (XTSharePtr) key_data; + + if (is_key) + return xt_ht_hash((char *) key_data); + return xt_ht_hash(share->sh_table_path->ps_path); +} + +static xtBool ha_hash_comp_ci(void *key, void *data) +{ + XTSharePtr share = (XTSharePtr) data; + + return strcasecmp((char *) key, share->sh_table_path->ps_path) == 0; +} + +static xtHashValue ha_hash_ci(xtBool is_key, void *key_data) +{ + XTSharePtr share = (XTSharePtr) key_data; + + if (is_key) + return xt_ht_casehash((char *) key_data); + return xt_ht_casehash(share->sh_table_path->ps_path); +} + +static void ha_open_share(XTThreadPtr self, XTShareRec *share, xtBool *tabled_opened) +{ + xt_lock_mutex(self, (xt_mutex_type *) share->sh_ex_mutex); + pushr_(xt_unlock_mutex, share->sh_ex_mutex); + + if (!share->sh_table) { + share->sh_table = xt_use_table(self, share->sh_table_path, FALSE, FALSE, tabled_opened); + share->sh_dic_key_count = share->sh_table->tab_dic.dic_key_count; + share->sh_dic_keys = share->sh_table->tab_dic.dic_keys; + share->sh_recalc_selectivity = FALSE; + } + + freer_(); // xt_ht_unlock(pbxt_share_tables) +} + +static void ha_close_share(XTThreadPtr self, XTShareRec *share) +{ + XTTableHPtr tab; + + if ((tab = share->sh_table)) { + /* Save this, in case the share is re-opened. */ + share->sh_min_auto_inc = tab->tab_auto_inc; + + xt_heap_release(self, tab); + share->sh_table = NULL; + } + + /* This are only references: */ + share->sh_dic_key_count = 0; + share->sh_dic_keys = NULL; +} + +static void ha_cleanup_share(XTThreadPtr self, XTSharePtr share) +{ + ha_close_share(self, share); + + if (share->sh_table_path) { + xt_free(self, share->sh_table_path); + share->sh_table_path = NULL; + } + + if (share->sh_ex_cond) { + thr_lock_delete(&share->sh_lock); + xt_delete_cond(self, (xt_cond_type *) share->sh_ex_cond); + share->sh_ex_cond = NULL; + } + + if (share->sh_ex_mutex) { + xt_delete_mutex(self, (xt_mutex_type *) share->sh_ex_mutex); + share->sh_ex_mutex = NULL; + } + + xt_free(self, share); +} + +static void ha_hash_free(XTThreadPtr self, void *data) +{ + XTSharePtr share = (XTSharePtr) data; + + ha_cleanup_share(self, share); +} + +/* + * This structure contains information that is common to all handles. + * (i.e. it is table specific). + */ +static XTSharePtr ha_get_share(XTThreadPtr self, const char *table_path, bool open_table, xtBool *tabled_opened) +{ + XTShareRec *share; + + enter_(); + xt_ht_lock(self, pbxt_share_tables); + pushr_(xt_ht_unlock, pbxt_share_tables); + + // Check if the table exists... + if (!(share = (XTSharePtr) xt_ht_get(self, pbxt_share_tables, (void *) table_path))) { + share = (XTSharePtr) xt_calloc(self, sizeof(XTShareRec)); + pushr_(ha_cleanup_share, share); + + share->sh_ex_mutex = (xt_mutex_type *) xt_new_mutex(self); + share->sh_ex_cond = (xt_cond_type *) xt_new_cond(self); + + thr_lock_init(&share->sh_lock); + + share->sh_use_count = 0; + share->sh_table_path = (XTPathStrPtr) xt_dup_string(self, table_path); + + if (open_table) + ha_open_share(self, share, tabled_opened); + + popr_(); // Discard ha_cleanup_share(share); + + xt_ht_put(self, pbxt_share_tables, share); + } + + share->sh_use_count++; + freer_(); // xt_ht_unlock(pbxt_share_tables) + + return_(share); +} + +/* + * Free shared information. + */ +static void ha_unget_share(XTThreadPtr self, XTSharePtr share) +{ + xt_ht_lock(self, pbxt_share_tables); + pushr_(xt_ht_unlock, pbxt_share_tables); + + if (!--share->sh_use_count) + xt_ht_del(self, pbxt_share_tables, share->sh_table_path); + + freer_(); // xt_ht_unlock(pbxt_share_tables) +} + +static xtBool ha_unget_share_removed(XTThreadPtr self, XTSharePtr share) +{ + xtBool removed = FALSE; + + xt_ht_lock(self, pbxt_share_tables); + pushr_(xt_ht_unlock, pbxt_share_tables); + + if (!--share->sh_use_count) { + removed = TRUE; + xt_ht_del(self, pbxt_share_tables, share->sh_table_path); + } + + freer_(); // xt_ht_unlock(pbxt_share_tables) + return removed; +} + +/* + * ----------------------------------------------------------------------- + * PUBLIC FUNCTIONS + * + */ + +xtPublic void xt_ha_unlock_table(XTThreadPtr self, void *share) +{ + ha_release_exclusive_use(self, (XTSharePtr) share); + ha_unget_share(self, (XTSharePtr) share); +} + +xtPublic void xt_ha_close_global_database(XTThreadPtr self) +{ + if (pbxt_database) { + xt_heap_release(self, pbxt_database); + pbxt_database = NULL; + } +} + +/* + * Open a PBXT database given the path of a table. + * This function also returns the name of the table. + * + * We use the pbxt_database_mutex to lock this + * operation to make sure it does not occur while + * some other thread is doing a "closeall". + */ +xtPublic void xt_ha_open_database_of_table(XTThreadPtr self, XTPathStrPtr table_path __attribute__((unused))) +{ +#ifdef XT_USE_GLOBAL_DB + if (!self->st_database) { + if (!pbxt_database) { + xt_open_database(self, mysql_real_data_home, TRUE); + pbxt_database = self->st_database; + xt_heap_reference(self, pbxt_database); + } + else + xt_use_database(self, pbxt_database, XT_FOR_USER); + } +#else + char db_path[PATH_MAX]; + + xt_strcpy(PATH_MAX, db_path, (char *) table_path); + xt_remove_last_name_of_path(db_path); + xt_remove_dir_char(db_path); + + if (self->st_database && xt_tab_compare_paths(self->st_database->db_name, xt_last_name_of_path(db_path)) == 0) + /* This thread already has this database open! */ + return; + + /* Auto commit before changing the database: */ + if (self->st_xact_data) { + /* PMC - This probably indicates something strange is happening: + * + * This sequence generates this error: + * + * delimiter | + * + * create temporary table t3 (id int)| + * + * create function f10() returns int + * begin + * drop temporary table if exists t3; + * create temporary table t3 (id int) engine=myisam; + * insert into t3 select id from t4; + * return (select count(*) from t3); + * end| + * + * select f10()| + * + * An error is generated because the same thread is used + * to open table t4 (at the start of the functions), and + * then to drop table t3. To drop t3 we need to + * switch the database, so we land up here! + */ + xt_throw_xterr(XT_CONTEXT, XT_ERR_CANNOT_CHANGE_DB); + /* + if (!xt_xn_commit(self)) + throw_(); + */ + } + + xt_lock_mutex(self, &pbxt_database_mutex); + pushr_(xt_unlock_mutex, &pbxt_database_mutex); + xt_open_database(self, db_path, FALSE); + freer_(); // xt_unlock_mutex(&pbxt_database_mutex); +#endif +} + +xtPublic XTThreadPtr xt_ha_set_current_thread(THD *thd, XTExceptionPtr e) +{ + XTThreadPtr self; + static int ha_thread_count = 0, ha_id; + + if (!(self = (XTThreadPtr) *thd_ha_data(thd, pbxt_hton))) { +// const Security_context *sctx; + char name[120]; + char ha_id_str[50]; + + ha_id = ++ha_thread_count; + sprintf(ha_id_str, "_%d", ha_id); + xt_strcpy(120,name,"user"); // TODO: Fix this hack +/* + sctx = &thd->main_security_ctx; + + if (sctx->user) { + xt_strcpy(120, name, sctx->user); + xt_strcat(120, name, "@"); + } + else + *name = 0; + if (sctx->host) + xt_strcat(120, name, sctx->host); + else if (sctx->ip) + xt_strcat(120, name, sctx->ip); + else if (thd->proc_info) + xt_strcat(120, name, (char *) thd->proc_info); + else + xt_strcat(120, name, "system"); +*/ + xt_strcat(120, name, ha_id_str); + if (!(self = xt_create_thread(name, FALSE, TRUE, e))) + return NULL; + + self->st_xact_mode = XT_XACT_REPEATABLE_READ; + + *thd_ha_data(thd, pbxt_hton) = (void *) self; + } + return self; +} + +xtPublic void xt_ha_close_connection(THD* thd) +{ + XTThreadPtr self; + + if ((self = (XTThreadPtr) *thd_ha_data(thd, pbxt_hton))) { + *thd_ha_data(thd, pbxt_hton) = NULL; + xt_free_thread(self); + } +} + +xtPublic XTThreadPtr xt_ha_thd_to_self(THD *thd) +{ + return (XTThreadPtr) *thd_ha_data(thd, pbxt_hton); +} + +/* The first bit is 1. */ +static u_int ha_get_max_bit(MY_BITMAP *map) +{ + my_bitmap_map *data_ptr = map->bitmap; + my_bitmap_map *end_ptr = map->last_word_ptr; + my_bitmap_map b; + u_int cnt = map->n_bits; + + for (; end_ptr >= data_ptr; end_ptr--) { + if ((b = *end_ptr)) { + my_bitmap_map mask; + + if (end_ptr == map->last_word_ptr && map->last_word_mask) + mask = map->last_word_mask >> 1; + else + mask = 0x80000000; + while (!(b & mask)) { + b = b << 1; + /* Should not happen, but if it does, we hang! */ + if (!b) + return map->n_bits; + cnt--; + } + return cnt; + } + if (end_ptr == map->last_word_ptr) + cnt = ((cnt-1) / 32) * 32; + else + cnt -= 32; + } + return 0; +} + +/* + * ----------------------------------------------------------------------- + * SUPPORT FUNCTIONS + * + */ + +/* + * In PBXT, as in MySQL: thread == connection. + * + * So we simply attach a PBXT thread to a MySQL thread. + */ +static XTThreadPtr ha_set_current_thread(THD *thd, int *err) +{ + XTThreadPtr self; + XTExceptionRec e; + + if (!(self = xt_ha_set_current_thread(thd, &e))) { + xt_log_exception(NULL, &e, XT_LOG_DEFAULT); + *err = e.e_xt_err; + return NULL; + } + return self; +} + +xtPublic int xt_ha_pbxt_to_mysql_error(int xt_err) +{ + switch (xt_err) { + case XT_NO_ERR: + return(0); + case XT_ERR_DUPLICATE_KEY: + return HA_ERR_FOUND_DUPP_KEY; + case XT_ERR_DEADLOCK: + return HA_ERR_LOCK_DEADLOCK; + case XT_ERR_RECORD_CHANGED: + /* If we generate HA_ERR_RECORD_CHANGED instead of HA_ERR_LOCK_WAIT_TIMEOUT + * then sysbench does not work because it does not handle this error. + */ + //return HA_ERR_LOCK_WAIT_TIMEOUT; // but HA_ERR_RECORD_CHANGED is the correct error for a optimistic lock failure. + return HA_ERR_RECORD_CHANGED; + case XT_ERR_LOCK_TIMEOUT: + return HA_ERR_LOCK_WAIT_TIMEOUT; + case XT_ERR_TABLE_IN_USE: + return HA_ERR_WRONG_COMMAND; + case XT_ERR_TABLE_NOT_FOUND: + return HA_ERR_NO_SUCH_TABLE; + case XT_ERR_TABLE_EXISTS: + return HA_ERR_TABLE_EXIST; + case XT_ERR_CANNOT_CHANGE_DB: + return ER_TRG_IN_WRONG_SCHEMA; + case XT_ERR_COLUMN_NOT_FOUND: + return HA_ERR_CANNOT_ADD_FOREIGN; + case XT_ERR_NO_REFERENCED_ROW: + case XT_ERR_REF_TABLE_NOT_FOUND: + case XT_ERR_REF_TYPE_WRONG: + return HA_ERR_NO_REFERENCED_ROW; + case XT_ERR_ROW_IS_REFERENCED: + return HA_ERR_ROW_IS_REFERENCED; + case XT_ERR_COLUMN_IS_NOT_NULL: + case XT_ERR_INCORRECT_NO_OF_COLS: + case XT_ERR_FK_ON_TEMP_TABLE: + case XT_ERR_FK_REF_TEMP_TABLE: + return HA_ERR_CANNOT_ADD_FOREIGN; + case XT_ERR_DUPLICATE_FKEY: + return HA_ERR_FOREIGN_DUPLICATE_KEY; + case XT_ERR_RECORD_DELETED: + return HA_ERR_RECORD_DELETED; + } + return(-1); // Unknown error +} + +xtPublic int xt_ha_pbxt_thread_error_for_mysql(THD *thd __attribute__((unused)), const XTThreadPtr self, int ignore_dup_key) +{ + int xt_err = self->t_exception.e_xt_err; + + XT_PRINT2(self, "xt_ha_pbxt_thread_error_for_mysql xt_err=%d auto commit=%d\n", (int) xt_err, (int) self->st_auto_commit); + switch (xt_err) { + case XT_NO_ERR: + break; + case XT_ERR_DUPLICATE_KEY: + case XT_ERR_DUPLICATE_FKEY: + /* Let MySQL call rollback as and when it wants to for duplicate + * key. + * + * In addition, we are not allowed to do an auto-rollback + * inside a sub-statement (function() or procedure()) + * For example: + * + * delimiter | + * + * create table t3 (c1 char(1) primary key not null)| + * + * create function bug12379() + * returns integer + * begin + * insert into t3 values('X'); + * insert into t3 values('X'); + * return 0; + * end| + * + * --error 1062 + * select bug12379()| + * + * + * Not doing an auto-rollback should solve this problem in the + * case of duplicate key (but not in others - like deadlock)! + * I don't think this situation is handled correctly by MySQL. + */ + + /* If we are in auto-commit mode (and we are not ignoring + * duplicate keys) then rollback the transaction automatically. + */ + if (!ignore_dup_key && self->st_auto_commit) + goto abort_transaction; + break; + case XT_ERR_DEADLOCK: + case XT_ERR_NO_REFERENCED_ROW: + case XT_ERR_ROW_IS_REFERENCED: + goto abort_transaction; + case XT_ERR_RECORD_CHANGED: + /* MySQL also handles the locked error. NOTE: There is no automatic + * rollback! + */ + break; + default: + xt_log_exception(self, &self->t_exception, XT_LOG_DEFAULT); + abort_transaction: + /* PMC 2006-08-30: It should be that this is not necessary! + * + * It is only necessary to call ha_rollback() if the engine + * aborts the transaction. + * + * On the other hand, I shouldn't need to rollback the + * transaction because, if I return an error, MySQL + * should do it for me. + * + * Unfortunately, when auto-commit is off, MySQL does not + * rollback automatically (for example when a deadlock + * is provoked). + * + * And when we have a multi update we cannot rely on this + * either (see comment above). + */ + if (self->st_xact_data) { + /* + * GOTCHA: + * A result of the "st_abort_trans = TRUE" below is that + * the following code results in an empty set. + * The reason is "ignore_dup_key" is not set so + * the duplicate key leads to an error which causes + * the transaction to be aborted. + * The delayed inserts are all execute in one transaction. + * + * CREATE TABLE t1 ( + * c1 INT(11) NOT NULL AUTO_INCREMENT, + * c2 INT(11) DEFAULT NULL, + * PRIMARY KEY (c1) + * ); + * SET insert_id= 14; + * INSERT DELAYED INTO t1 VALUES(NULL, 11), (NULL, 12); + * INSERT DELAYED INTO t1 VALUES(14, 91); + * INSERT DELAYED INTO t1 VALUES (NULL, 92), (NULL, 93); + * FLUSH TABLE t1; + * SELECT * FROM t1; + */ + if (self->st_lock_count == 0) { + /* No table locks, must rollback immediately + * (there will be no possibility later! + */ + XT_PRINT1(self, "xt_xn_rollback xt_err=%d\n", xt_err); + if (!xt_xn_rollback(self)) + xt_log_exception(self, &self->t_exception, XT_LOG_DEFAULT); + } + else { + /* Locks are held on tables. + * Only rollback after locks are released. + */ + self->st_auto_commit = TRUE; + self->st_abort_trans = TRUE; + } +#ifdef xxxx +/* DBUG_ASSERT(thd->transaction.stmt.ha_list == NULL || + trans == &thd->transaction.stmt); in handler.cc now + * fails, and I don't know if this function can be called anymore! */ + /* Cause any other DBs to do a rollback as well... */ + if (thd) { + /* + * GOTCHA: + * This is a BUG in MySQL. I cannot rollback a transaction if + * pb_mysql_thd->in_sub_stmt! But I must....?! + */ +#ifdef MYSQL_SERVER + if (!thd->in_sub_stmt) + ha_rollback(thd); +#endif + } +#endif + } + break; + } + return xt_ha_pbxt_to_mysql_error(xt_err); +} + +static void ha_conditional_close_database(XTThreadPtr self, XTThreadPtr other_thr, void *db) +{ + if (other_thr->st_database == (XTDatabaseHPtr) db) + xt_unuse_database(self, other_thr); +} + +/* + * This is only called from drop database, so we know that + * no thread is actually using the database. This means that it + * must be safe to close the database. + */ +xtPublic void xt_ha_all_threads_close_database(XTThreadPtr self, XTDatabaseHPtr db) +{ + xt_lock_mutex(self, &pbxt_database_mutex); + pushr_(xt_unlock_mutex, &pbxt_database_mutex); + xt_do_to_all_threads(self, ha_conditional_close_database, db); + freer_(); // xt_unlock_mutex(&pbxt_database_mutex); +} + +static int ha_log_pbxt_thread_error_for_mysql(int ignore_dup_key) +{ + return xt_ha_pbxt_thread_error_for_mysql(current_thd, myxt_get_self(), ignore_dup_key); +} + +/* + * ----------------------------------------------------------------------- + * STATIC HOOKS + * + */ +static xtWord8 ha_set_variable(char **value, HAVarParamsPtr vp) +{ + xtWord8 result; + xtWord8 mi, ma; + char *mm; + + if (!*value) + *value = getenv(vp->vp_var); + if (!*value) + *value = (char *) vp->vp_def; + result = xt_byte_size_to_int8(*value); + mi = (xtWord8) xt_byte_size_to_int8(vp->vp_min); + if (result < mi) { + result = mi; + *value = (char *) vp->vp_min; + } + if (sizeof(size_t) == 8) + mm = (char *) vp->vp_max8; + else + mm = (char *) vp->vp_max4; + ma = (xtWord8) xt_byte_size_to_int8(mm); + if (result > ma) { + result = ma; + *value = mm; + } + return result; +} + +static void pbxt_call_init(XTThreadPtr self) +{ + xtInt8 index_cache_size; + xtInt8 record_cache_size; + xtInt8 log_cache_size; + xtInt8 log_file_threshold; + xtInt8 transaction_buffer_size; + xtInt8 log_buffer_size; + xtInt8 checkpoint_frequency; + xtInt8 data_log_threshold; + xtInt8 data_file_grow_size; + xtInt8 row_file_grow_size; + + xt_logf(XT_NT_INFO, "PrimeBase XT (PBXT) Engine %s loaded...\n", xt_get_version()); + xt_logf(XT_NT_INFO, "Paul McCullagh, PrimeBase Technologies GmbH, http://www.primebase.org\n"); + + index_cache_size = ha_set_variable(&pbxt_index_cache_size, &vp_index_cache_size); + record_cache_size = ha_set_variable(&pbxt_record_cache_size, &vp_record_cache_size); + log_cache_size = ha_set_variable(&pbxt_log_cache_size, &vp_log_cache_size); + log_file_threshold = ha_set_variable(&pbxt_log_file_threshold, &vp_log_file_threshold); + transaction_buffer_size = ha_set_variable(&pbxt_transaction_buffer_size, &vp_transaction_buffer_size); + log_buffer_size = ha_set_variable(&pbxt_log_buffer_size, &vp_log_buffer_size); + checkpoint_frequency = ha_set_variable(&pbxt_checkpoint_frequency, &vp_checkpoint_frequency); + data_log_threshold = ha_set_variable(&pbxt_data_log_threshold, &vp_data_log_threshold); + data_file_grow_size = ha_set_variable(&pbxt_data_file_grow_size, &vp_data_file_grow_size); + row_file_grow_size = ha_set_variable(&pbxt_row_file_grow_size, &vp_row_file_grow_size); + + xt_db_log_file_threshold = (xtLogOffset) log_file_threshold; + xt_db_log_buffer_size = (size_t) xt_align_offset(log_buffer_size, 512); + xt_db_transaction_buffer_size = (size_t) xt_align_offset(transaction_buffer_size, 512); + xt_db_checkpoint_frequency = (size_t) checkpoint_frequency; + xt_db_data_log_threshold = (off_t) data_log_threshold; + xt_db_data_file_grow_size = (size_t) data_file_grow_size; + xt_db_row_file_grow_size = (size_t) row_file_grow_size; + + pbxt_ignore_case = lower_case_table_names != 0; + if (pbxt_ignore_case) + pbxt_share_tables = xt_new_hashtable(self, ha_hash_comp_ci, ha_hash_ci, ha_hash_free, TRUE, FALSE); + else + pbxt_share_tables = xt_new_hashtable(self, ha_hash_comp, ha_hash, ha_hash_free, TRUE, FALSE); + + xt_thread_wait_init(self); + xt_fs_init(self); + xt_lock_installation(self, mysql_real_data_home); + XTSystemTableShare::startUp(self); + xt_init_databases(self); + xt_ind_init(self, (size_t) index_cache_size); + xt_tc_init(self, (size_t) record_cache_size); + xt_xlog_init(self, (size_t) log_cache_size); +} + +static void pbxt_call_exit(XTThreadPtr self) +{ + xt_logf(XT_NT_INFO, "PrimeBase XT Engine shutdown...\n"); + +#ifdef TRACE_STATEMENTS + xt_dump_trace(); +#endif +#ifdef XT_USE_GLOBAL_DB + xt_ha_close_global_database(self); +#endif +#ifdef DEBUG + //xt_stop_database_threads(self, FALSE); + xt_stop_database_threads(self, TRUE); +#else + xt_stop_database_threads(self, TRUE); +#endif + /* This will tell the freeer to quit ASAP: */ + xt_quit_freeer(self); + /* We conditional stop the freeer here, because if we are + * in startup, then the free will be hanging. + * {FREEER-HANG} + * + * This problem has been solved by MySQL! + */ + xt_stop_freeer(self); + xt_exit_databases(self); + XTSystemTableShare::shutDown(self); + xt_xlog_exit(self); + xt_tc_exit(self); + xt_ind_exit(self); + xt_unlock_installation(self, mysql_real_data_home); + xt_fs_exit(self); + xt_thread_wait_exit(self); + if (pbxt_share_tables) { + xt_free_hashtable(self, pbxt_share_tables); + pbxt_share_tables = NULL; + } +} + +/* + * Shutdown the PBXT sub-system. + */ +static void ha_exit(XTThreadPtr self) +{ + /* Wrap things up... */ + xt_unuse_database(self, self); /* Just in case the main thread has a database in use (for testing)? */ + /* This may cause the streaming engine to cleanup connections and + * tables belonging to this engine. This in turn may require some of + * the stuff below (like xt_create_thread() called from pbxt_close_table()! */ +#ifdef XT_STREAMING + xt_exit_streaming(); +#endif + pbxt_call_exit(self); + xt_exit_threading(self); + xt_exit_memory(); + xt_exit_logging(); + xt_p_mutex_destroy(&pbxt_database_mutex); + pbxt_inited = false; +} + +/* + * Outout the PBXT status. Return FALSE on error. + */ +static bool pbxt_show_status(handlerton *hton __attribute__((unused)), THD* thd, + stat_print_fn* stat_print, + enum ha_stat_type stat_type __attribute__((unused))) +{ + XTThreadPtr self; + int err = 0; + XTStringBufferRec strbuf = { 0, 0, 0 }; + bool not_ok = FALSE; + + if (!(self = ha_set_current_thread(thd, &err))) + return FALSE; + +#ifdef XT_SHOW_DUMPS_TRACE + //if (pbxt_database) + // xt_dump_xlogs(pbxt_database, 0); + xt_trace("// %s - dump\n", xt_trace_clock_diff(NULL)); + xt_dump_trace(); +#endif + + try_(a) { + myxt_get_status(self, &strbuf); + } + catch_(a) { + not_ok = TRUE; + } + cont_(a); + + if (!not_ok) { + if (stat_print(thd, "PBXT", 4, "", 0, strbuf.sb_cstring, strbuf.sb_len)) + not_ok = TRUE; + } + xt_sb_set_size(self, &strbuf, 0); + + return not_ok; +} + +/* + * Initialize the PBXT sub-system. + * + * return 1 on error, else 0. + */ +static int pbxt_init(void *p) +{ + int init_err = 0; + + XT_TRACE_CALL(); + + if (sizeof(xtWordPS) != sizeof(void *)) { + printf("PBXT: This won't work, I require that sizeof(xtWordPS) != sizeof(void *)!\n"); + XT_RETURN(1); + } + + /* GOTCHA: This will "detect" if are loading the plug-in + * with different --with-debug option to MySQL. + * + * In this case, you will get an error when loading the + * library that some symbol was not found. + */ + void *dummy = my_malloc(100, MYF(0)); + my_free((byte *) dummy, MYF(0)); + + if (!pbxt_inited) { + XTThreadPtr self = NULL; + + xt_p_mutex_init_with_autoname(&pbxt_database_mutex, NULL); + + pbxt_hton = (handlerton *) p; + pbxt_hton->state = SHOW_OPTION_YES; +#ifndef DRIZZLED + pbxt_hton->db_type = DB_TYPE_PBXT; // Wow! I have my own! +#endif + pbxt_hton->close_connection = pbxt_close_connection; /* close_connection, cleanup thread related data. */ + pbxt_hton->commit = pbxt_commit; /* commit */ + pbxt_hton->rollback = pbxt_rollback; /* rollback */ + pbxt_hton->create = pbxt_create_handler; /* Create a new handler */ + pbxt_hton->drop_database = pbxt_drop_database; /* Drop a database */ +#ifndef DRIZZLED + pbxt_hton->panic = pbxt_panic; /* Panic call */ +#endif + pbxt_hton->show_status = pbxt_show_status; + pbxt_hton->flags = HTON_NO_FLAGS; /* HTON_CAN_RECREATE - Without this flags TRUNCATE uses delete_all_rows() */ + + if (!xt_init_logging()) /* Initialize logging */ + goto error_1; + +#ifdef XT_STREAMING + if (!xt_init_streaming()) + goto error_2; +#endif + + if (!xt_init_memory()) /* Initialize memory */ + goto error_3; + + /* +7 assumes: + * We are not using multiple database, and: + * +1 Main thread. + * +1 Compactor thread + * +1 Writer thread + * +1 Checkpointer thread + * +1 Sweeper thread + * +1 Free'er thread + * +1 Temporary thread (e.g. TempForClose, TempForEnd) + */ + self = xt_init_threading(max_connections + 7); /* Create the main self: */ + if (!self) + goto error_4; + + pbxt_inited = true; + + try_(a) { + /* Initialize all systems */ + pbxt_call_init(self); + + /* Conditional unit test: */ +#ifdef XT_UNIT_TEST + //xt_unit_test_create_threads(self); + xt_unit_test_read_write_locks(self); + //xt_unit_test_mutex_locks(self); +#endif + + /* {OPEN-DB-SWEEPER-WAIT} + * I have to start the freeer before I open and recover the database + * because it we run out of cache while waiting for the sweeper + * we will hang! + */ + xt_start_freeer(self); + +#ifdef XT_USE_GLOBAL_DB + /* Open the global database. */ + ASSERT(!pbxt_database); + { + THD *curr_thd = current_thd; + THD *thd = curr_thd; + +#ifndef DRIZZLED + extern myxt_mutex_t LOCK_plugin; + + /* {MYSQL QUIRK} + * I have to release this lock for PBXT recovery to + * work, because it needs to open .frm files. + * So, I unlock, but during INSTALL PLUGIN this is + * risky, because we are in multi-threaded + * mode! + * + * Although, as far as I can tell from the MySQL code, + * INSTALL PLUGIN should still work ok, during + * concurrent access, because we are not + * relying on pointer/memory that may be changed by + * other users. + * + * Only real problem, 2 threads try to load the same + * plugin at the same time. + */ + myxt_mutex_unlock(&LOCK_plugin); +#endif + + /* Can't do this here yet, because I need a THD! */ + try_(b) { + /* {MYSQL QUIRK} + * Sometime we have a THD, + * sometimes we don't. + * So far, I have noticed that during INSTALL PLUGIN, + * we have one, otherwize not. + */ + if (!curr_thd) { + if (!(thd = (THD *) myxt_create_thread())) + xt_throw(self); + } + + xt_open_database(self, mysql_real_data_home, TRUE); + pbxt_database = self->st_database; + xt_heap_reference(self, pbxt_database); + } + catch_(b) { + if (!curr_thd && thd) + myxt_destroy_thread(thd, FALSE); +#ifndef DRIZZLED + myxt_mutex_lock(&LOCK_plugin); +#endif + xt_throw(self); + } + cont_(b); + + if (!curr_thd) + myxt_destroy_thread(thd, FALSE); +#ifndef DRIZZLED + myxt_mutex_lock(&LOCK_plugin); +#endif + } +#endif + } + catch_(a) { + xt_log_exception(self, &self->t_exception, XT_LOG_DEFAULT); + init_err = 1; + } + cont_(a); + + if (init_err) { + /* {FREEER-HANG} The free-er will be hung in: + #0 0x91fc6a2e in semaphore_wait_signal_trap + #1 0x91fce505 in pthread_mutex_lock + #2 0x00489633 in safe_mutex_lock at thr_mutex.c:149 + #3 0x002dfca9 in plugin_thdvar_init at sql_plugin.cc:2398 + #4 0x000d6a12 in THD::init at sql_class.cc:715 + #5 0x000de9d3 in THD::THD at sql_class.cc:597 + #6 0x000debe1 in THD::THD at sql_class.cc:631 + #7 0x00e207a4 in myxt_create_thread at myxt_xt.cc:2666 + #8 0x00e3134b in tabc_fr_run_thread at tabcache_xt.cc:982 + #9 0x00e422ca in thr_main at thread_xt.cc:1006 + #10 0x91ff7c55 in _pthread_start + #11 0x91ff7b12 in thread_start + * + * so it is not good trying to stop it here! + * + * With regard to this problem, see {OPEN-DB-SWEEPER-WAIT} + * Due to this problem, I will probably have to hack + * the mutex so that the freeer can get started... + * + * NOPE! problem has gone in 6.0.9. Also not a problem in + * 5.1.29. + */ + + /* {OPEN-DB-SWEEPER-WAIT} + * I have to stop the freeer here because it was + * started before opening the database. + */ + pbxt_call_exit(self); + pbxt_inited = FALSE; + xt_exit_threading(self); + goto error_4; + } + xt_free_thread(self); + } + XT_RETURN(init_err); + + error_4: + xt_exit_memory(); + + error_3: +#ifdef XT_STREAMING + xt_exit_streaming(); + + error_2: +#endif + xt_exit_logging(); + + error_1: + xt_p_mutex_destroy(&pbxt_database_mutex); + XT_RETURN(1); +} + +static int pbxt_end(void *p __attribute__((unused))) +{ + XTThreadPtr self; + int err = 0; + + XT_TRACE_CALL(); + + if (pbxt_inited) { + XTExceptionRec e; + + /* This flag also means "shutting down". */ + pbxt_inited = FALSE; + self = xt_create_thread("TempForEnd", FALSE, TRUE, &e); + if (self) { + self->t_main = TRUE; + ha_exit(self); + } + } + + XT_RETURN(err); +} + +#ifndef DRIZZLED +static int pbxt_panic(handlerton *hton, enum ha_panic_function flag) +{ + return pbxt_end(hton); +} +#endif + +/* + * Kill the PBXT thread associated with the MySQL thread. + */ +static int pbxt_close_connection(handlerton *hton, THD* thd) +{ + XTThreadPtr self; +#ifdef XT_STREAMING + XTExceptionRec e; +#endif + + XT_TRACE_CALL(); + if ((self = (XTThreadPtr) *thd_ha_data(thd, hton))) { + *thd_ha_data(thd, hton) = NULL; + /* Required because freeing the thread could cause + * free of database which could call xt_close_file_ns()! + */ + xt_set_self(self); + xt_free_thread(self); + } +#ifdef XT_STREAMING + if (!xt_pbms_close_connection((void *) thd, &e)) + xt_log_exception(NULL, &e, XT_LOG_DEFAULT); +#endif + return 0; +} + +/* + * Currently does nothing because it was all done + * when the last PBXT table was removed from the + * database. + */ +static void pbxt_drop_database(handlerton *hton __attribute__((unused)), char *path __attribute__((unused))) +{ + XT_TRACE_CALL(); +} + +/* + * NOTES ON TRANSACTIONS: + * + * 1. If self->st_lock_count == 0 and transaction can be ended immediately. + * If not, we must wait until the last lock is released on the last handler + * to ensure that the tables are flushed before the transaction is + * committed or aborted. + * + * 2. all (below) indicates, within a BEGIN/END (i.e. auto_commit off) whether + * the statement or the entire transation is being terminated. + * We currently ignore statement termination. + * + * 3. If in BEGIN/END we must call ha_rollback() if we abort the transaction + * internally. + */ + +/* + * Commit the PBXT transaction of the given thread. + * thd is the MySQL thread structure. + * pbxt_thr is a pointer the the PBXT thread structure. + * + */ +static int pbxt_commit(handlerton *hton, THD *thd, bool all) +{ + int err = 0; + XTThreadPtr self; + + if ((self = (XTThreadPtr) *thd_ha_data(thd, hton))) { + XT_PRINT1(self, "pbxt_commit all=%d\n", all); + + if (self->st_xact_data) { + /* There are no table locks, commit immediately in all cases + * except when this is a statement commit with an explicit + * transaction (!all && !self->st_auto_commit). + */ + if (all || self->st_auto_commit) { + XT_PRINT0(self, "xt_xn_commit\n"); + + if (!xt_xn_commit(self)) + err = xt_ha_pbxt_thread_error_for_mysql(thd, self, FALSE); + } + } + if (!all) + self->st_stat_trans = FALSE; + } + return err; +} + +static int pbxt_rollback(handlerton *hton, THD *thd, bool all) +{ + int err = 0; + XTThreadPtr self; + + if ((self = (XTThreadPtr) *thd_ha_data(thd, hton))) { + XT_PRINT1(self, "pbxt_rollback all=%d\n", all); + + if (self->st_xact_data) { + /* There are no table locks, rollback immediately in all cases + * except when this is a statement commit with an explicit + * transaction (!all && !self->st_auto_commit). + * + * Note, the only reason for a rollback of a operation is + * due to an error. In this case PBXT has already + * undone the effects of the operation. + * + * However, this is not the same as statement rollback + * which can involve a number of operations. + * + * TODO: Implement statement rollback. + */ + if (all || self->st_auto_commit) { + XT_PRINT0(self, "xt_xn_rollback\n"); + if (!xt_xn_rollback(self)) + err = xt_ha_pbxt_thread_error_for_mysql(thd, self, FALSE); + } + } + if (!all) + self->st_stat_trans = FALSE; + } + return 0; +} + +static handler *pbxt_create_handler(handlerton *hton, TABLE_SHARE *table, MEM_ROOT *mem_root) +{ + if (table && XTSystemTableShare::isSystemTable(table->path.str)) + return new (mem_root) ha_xtsys(hton, table); + else + return new (mem_root) ha_pbxt(hton, table); +} + +/* + * ----------------------------------------------------------------------- + * HANDLER LOCKING FUNCTIONS + * + * These functions are used get a lock on all handles of a particular table. + * + */ + +static void ha_add_to_handler_list(XTThreadPtr self, XTSharePtr share, ha_pbxt *handler) +{ + xt_lock_mutex(self, (xt_mutex_type *) share->sh_ex_mutex); + pushr_(xt_unlock_mutex, share->sh_ex_mutex); + + handler->pb_ex_next = share->sh_handlers; + handler->pb_ex_prev = NULL; + if (share->sh_handlers) + share->sh_handlers->pb_ex_prev = handler; + share->sh_handlers = handler; + + freer_(); // xt_unlock_mutex(share->sh_ex_mutex) +} + +static void ha_remove_from_handler_list(XTThreadPtr self, XTSharePtr share, ha_pbxt *handler) +{ + xt_lock_mutex(self, (xt_mutex_type *) share->sh_ex_mutex); + pushr_(xt_unlock_mutex, share->sh_ex_mutex); + + /* Move front pointer: */ + if (share->sh_handlers == handler) + share->sh_handlers = handler->pb_ex_next; + + /* Remove from list: */ + if (handler->pb_ex_prev) + handler->pb_ex_prev->pb_ex_next = handler->pb_ex_next; + if (handler->pb_ex_next) + handler->pb_ex_next->pb_ex_prev = handler->pb_ex_prev; + + freer_(); // xt_unlock_mutex(share->sh_ex_mutex) +} + +/* + * Aquire exclusive use of a table, by waiting for all + * threads to complete use of all handlers of the table. + * At the same time we hold up all threads + * that want to use handlers belonging to the table. + * + * But we do not hold up threads that close the handlers. + */ +static void ha_aquire_exclusive_use(XTThreadPtr self, XTSharePtr share, ha_pbxt *mine) +{ + ha_pbxt *handler; + time_t end_time = time(NULL) + XT_SHARE_LOCK_TIMEOUT / 1000; + + XT_PRINT1(self, "ha_aquire_exclusive_use %s PBXT X lock\n", share->sh_table_path->ps_path); + /* GOTCHA: It is possible to hang here, if you hold + * onto the sh_ex_mutex lock, before we really + * have the exclusive lock (i.e. before all + * handlers are no longer in use. + * The reason is, because reopen() is not possible + * when some other thread holds sh_ex_mutex. + * So this can prevent a thread from completing its + * use of a handler, when prevents exclusive use + * here. + */ + xt_lock_mutex(self, (xt_mutex_type *) share->sh_ex_mutex); + pushr_(xt_unlock_mutex, share->sh_ex_mutex); + + /* Wait until we can get an exclusive lock: */ + while (share->sh_table_lock) { + xt_timed_wait_cond(self, (xt_cond_type *) share->sh_ex_cond, (xt_mutex_type *) share->sh_ex_mutex, XT_SHARE_LOCK_WAIT); + if (time(NULL) > end_time) { + freer_(); // xt_unlock_mutex(share->sh_ex_mutex) + xt_throw_taberr(XT_CONTEXT, XT_ERR_LOCK_TIMEOUT, share->sh_table_path); + } + } + + /* This tells readers (and other exclusive lockers) that someone has an exclusive lock. */ + share->sh_table_lock = TRUE; + + /* Wait for all open handlers use count to go to 0 */ + retry: + handler = share->sh_handlers; + while (handler) { + if (handler == mine || !handler->pb_ex_in_use) + handler = handler->pb_ex_next; + else { + /* Wait a bit, and try again: */ + xt_timed_wait_cond(self, (xt_cond_type *) share->sh_ex_cond, (xt_mutex_type *) share->sh_ex_mutex, XT_SHARE_LOCK_WAIT); + if (time(NULL) > end_time) { + freer_(); // xt_unlock_mutex(share->sh_ex_mutex) + xt_throw_taberr(XT_CONTEXT, XT_ERR_LOCK_TIMEOUT, share->sh_table_path); + } + /* Handler may have been freed, check from the begining again: */ + goto retry; + } + } + + freer_(); // xt_unlock_mutex(share->sh_ex_mutex) +} + +/* + * If you have exclusively locked the table, you can close all handler + * open tables. + * + * Call ha_close_open_tables() to get an exclusive lock. + */ +static void ha_close_open_tables(XTThreadPtr self, XTSharePtr share, ha_pbxt *mine) +{ + ha_pbxt *handler; + + xt_lock_mutex(self, (xt_mutex_type *) share->sh_ex_mutex); + pushr_(xt_unlock_mutex, share->sh_ex_mutex); + + /* Now that we know no handler is in use, we can close all the + * open tables... + */ + handler = share->sh_handlers; + while (handler) { + if (handler != mine && handler->pb_open_tab) { + xt_db_return_table_to_pool_ns(handler->pb_open_tab); + handler->pb_open_tab = NULL; + } + handler = handler->pb_ex_next; + } + + freer_(); // xt_unlock_mutex(share->sh_ex_mutex) +} + +static void ha_release_exclusive_use(XTThreadPtr self __attribute__((unused)), XTSharePtr share) +{ + XT_PRINT1(self, "ha_release_exclusive_use %s PBXT X UNLOCK\n", share->sh_table_path->ps_path); + xt_lock_mutex_ns((xt_mutex_type *) share->sh_ex_mutex); + share->sh_table_lock = FALSE; + xt_broadcast_cond_ns((xt_cond_type *) share->sh_ex_cond); + xt_unlock_mutex_ns((xt_mutex_type *) share->sh_ex_mutex); +} + +static xtBool ha_wait_for_shared_use(ha_pbxt *mine, XTSharePtr share) +{ + time_t end_time = time(NULL) + XT_SHARE_LOCK_TIMEOUT / 1000; + + XT_PRINT1(xt_get_self(), "ha_wait_for_shared_use %s share lock wait...\n", share->sh_table_path->ps_path); + mine->pb_ex_in_use = 0; + xt_lock_mutex_ns((xt_mutex_type *) share->sh_ex_mutex); + while (share->sh_table_lock) { + /* Wake up the exclusive locker (may be waiting). He can try to continue: */ + xt_broadcast_cond_ns((xt_cond_type *) share->sh_ex_cond); + + if (!xt_timed_wait_cond(NULL, (xt_cond_type *) share->sh_ex_cond, (xt_mutex_type *) share->sh_ex_mutex, XT_SHARE_LOCK_WAIT)) { + xt_unlock_mutex_ns((xt_mutex_type *) share->sh_ex_mutex); + return FAILED; + } + + if (time(NULL) > end_time) { + xt_unlock_mutex_ns((xt_mutex_type *) share->sh_ex_mutex); + xt_register_taberr(XT_REG_CONTEXT, XT_ERR_LOCK_TIMEOUT, share->sh_table_path); + return FAILED; + } + } + mine->pb_ex_in_use = 1; + xt_unlock_mutex_ns((xt_mutex_type *) share->sh_ex_mutex); + return OK; +} + +xtPublic int ha_pbxt::reopen() +{ + THD *thd = current_thd; + int err = 0; + XTThreadPtr self; + xtBool tabled_opened = FALSE; + + if (!(self = ha_set_current_thread(thd, &err))) + return xt_ha_pbxt_to_mysql_error(err); + + try_(a) { + xt_ha_open_database_of_table(self, pb_share->sh_table_path); + + ha_open_share(self, pb_share, &tabled_opened); + + if (!(pb_open_tab = xt_db_open_table_using_tab(pb_share->sh_table, self))) + xt_throw(self); + pb_open_tab->ot_thread = self; + + if (tabled_opened) { +#ifdef LOAD_TABLE_ON_OPEN + xt_tab_load_table(self, pb_open_tab); +#else + xt_tab_load_row_pointers(self, pb_open_tab); +#endif + xt_ind_set_index_selectivity(self, pb_open_tab); + /* If the number of rows is less than 150 we will recalculate the + * selectity of the indices, as soon as the number of rows + * exceeds 200 (see [**]) + */ + pb_share->sh_recalc_selectivity = (pb_share->sh_table->tab_row_eof_id - 1 - pb_share->sh_table->tab_row_fnum) < 150; + } + + /* I am not doing this anymore because it was only required + * for DELETE FROM table;, which is now implemented + * by deleting each row. + * TRUNCATE TABLE does not preserve the counter value. + */ + //init_auto_increment(pb_share->sh_min_auto_inc); + init_auto_increment(0); + } + catch_(a) { + err = xt_ha_pbxt_thread_error_for_mysql(thd, self, pb_ignore_dup_key); + } + cont_(a); + + return err; +} + +/* + * ----------------------------------------------------------------------- + * INFORMATION SCHEMA FUNCTIONS + * + */ + +int pbxt_statistics_fill_table(THD *thd, TABLE_LIST *tables, COND *cond) +{ + XTThreadPtr self; + int err = 0; + + if (!(self = ha_set_current_thread(thd, &err))) + return xt_ha_pbxt_to_mysql_error(err); + try_(a) { + err = myxt_statistics_fill_table(self, thd, tables, cond, system_charset_info); + } + catch_(a) { + err = xt_ha_pbxt_thread_error_for_mysql(thd, self, FALSE); + } + cont_(a); + return err; +} + +ST_FIELD_INFO pbxt_statistics_fields_info[]= +{ + { "ID", 4, MYSQL_TYPE_LONG, 0, 0, "The ID of the statistic", SKIP_OPEN_TABLE}, + { "Name", 40, MYSQL_TYPE_STRING, 0, 0, "The name of the statistic", SKIP_OPEN_TABLE}, + { "Value", 8, MYSQL_TYPE_LONGLONG, 0, 0, "The accumulated value", SKIP_OPEN_TABLE}, + { 0, 0, MYSQL_TYPE_STRING, 0, 0, 0, SKIP_OPEN_TABLE} +}; + +int pbxt_init_statitics(void *p) +{ + ST_SCHEMA_TABLE *schema = (ST_SCHEMA_TABLE *) p; + schema->fields_info = pbxt_statistics_fields_info; + schema->fill_table = pbxt_statistics_fill_table; + +#if defined(XT_WIN) && defined(XT_COREDUMP) + void register_crash_filter(); + + if (pbxt_crash_debug) + register_crash_filter(); +#endif + + return 0; +} + +int pbxt_exit_statitics(void *p __attribute__((unused))) +{ + return(0); +} + +/* + * ----------------------------------------------------------------------- + * DYNAMIC HOOKS + * + */ + +ha_pbxt::ha_pbxt(handlerton *hton, TABLE_SHARE *table_arg) : handler(hton, table_arg) +{ + pb_share = NULL; + pb_open_tab = NULL; + pb_key_read = FALSE; + pb_ignore_dup_key = 0; + pb_lock_table = FALSE; + pb_table_locked = 0; + pb_ex_next = NULL; + pb_ex_prev = NULL; + pb_ex_in_use = 0; + pb_in_stat = FALSE; +} + +/* + * If frm_error() is called then we will use this to to find out what file extentions + * exist for the storage engine. This is also used by the default rename_table and + * delete_table method in handler.cc. + */ +const char **ha_pbxt::bas_ext() const +{ + return pbxt_extensions; +} + +/* + * Specify the caching type: HA_CACHE_TBL_NONTRANSACT, HA_CACHE_TBL_NOCACHE + * HA_CACHE_TBL_ASKTRANSACT, HA_CACHE_TBL_TRANSACT + */ +MX_UINT8_T ha_pbxt::table_cache_type() +{ + return HA_CACHE_TBL_TRANSACT; /* Use transactional query cache */ +} + +MX_TABLE_TYPES_T ha_pbxt::table_flags() const +{ + return ( + /* We need this flag because records are not packed + * into a table which means #ROWID != offset + */ + HA_REC_NOT_IN_SEQ | + /* Since PBXT caches read records itself, I believe + * this to be the case. + */ + HA_FAST_KEY_READ | + /* + * I am assuming a "key" means a unique index. + * Of course a primary key does not allow nulls. + */ + HA_NULL_IN_KEY | + /* + * This is necessary because a MySQL blob can be + * fairly small. + */ + HA_CAN_INDEX_BLOBS | + /* + * Due to transactional influences, this will be + * the case. + * Although the count is good enough for practical + * purposes! + HA_NOT_EXACT_COUNT | + */ + /* + * This basically means we have a file with the name of + * database table (which we do). + */ + HA_FILE_BASED | + /* + * Not sure what this does (but MyISAM and InnoDB have it)?! + * Could it mean that we support the handler functions. + */ + HA_CAN_SQL_HANDLER | + /* + * This is not true, we cannot insert delayed, but a + * really cannot see what's wrong with inserting normally + * when asked to insert delayed! + * And the functionallity is required to pass the alter_table + * test. + * + * Disabled because of MySQL bug #40505 + */ + /*HA_CAN_INSERT_DELAYED |*/ +#if MYSQL_VERSION_ID > 50119 + /* We can do row logging, but not statement, because + * MVCC is not serializable! + */ + HA_BINLOG_ROW_CAPABLE | +#endif + /* + * Auto-increment is allowed on a partial key. + */ + HA_AUTO_PART_KEY); +} + +/* + * The following query from the DBT1 test is VERY slow + * if we do not set HA_READ_ORDER. + * The reason is that it must scan all duplicates, then + * sort. + * + * SELECT o_id, o_carrier_id, o_entry_d, o_ol_cnt + * FROM orders FORCE INDEX (o_w_id) + * WHERE o_w_id = 2 + * AND o_d_id = 1 + * AND o_c_id = 500 + * ORDER BY o_id DESC limit 1; + * + */ +#define FLAGS_ARE_READ_DYNAMICALLY + +MX_ULONG_T ha_pbxt::index_flags(uint inx __attribute__((unused)), uint part __attribute__((unused)), bool all_parts __attribute__((unused))) const +{ + /* It would be nice if the dynamic version of this function works, + * but it does not. MySQL loads this information when the table is openned, + * and then it is fixed. + * + * The problem is, I have had to remove the HA_READ_ORDER option although + * it applies to PBXT. PBXT returns entries in index order during an index + * scan in _almost_ all cases. + * + * A number of cases are demostrated here: [(11)] + * + * If involves the following conditions: + * - a SELECT FOR UPDATE, UPDATE or DELETE statement + * - an ORDER BY, or join that requires the sort order + * - another transaction which updates the index while it is being + * scanned. + * + * In this "obscure" case, the index scan may return index + * entries in the wrong order. + */ +#ifdef FLAGS_ARE_READ_DYNAMICALLY + /* If were are in an update (SELECT FOR UPDATE, UPDATE or DELETE), then + * it may be that we return the rows from an index in the wrong + * order! This is due to the fact that update reads wait for transactions + * to commit and this means that index entries may change position during + * the scan! + */ + if (pb_open_tab && pb_open_tab->ot_for_update) + return (HA_READ_NEXT | HA_READ_PREV | HA_READ_RANGE | HA_KEYREAD_ONLY); + /* If I understand HA_KEYREAD_ONLY then this means I do not + * need to fetch the record associated with an index + * key. + */ + return (HA_READ_NEXT | HA_READ_PREV | HA_READ_ORDER | HA_READ_RANGE | HA_KEYREAD_ONLY); +#else + return (HA_READ_NEXT | HA_READ_PREV | HA_READ_RANGE | HA_KEYREAD_ONLY); +#endif +} + +void ha_pbxt::internal_close(THD *thd, struct XTThread *self) +{ + if (pb_share) { + xtBool removed; + XTOpenTablePtr ot; + + try_(a) { + /* This lock must be held when we remove the handler's + * open table because ha_close_open_tables() can run + * concurrently. + */ + xt_lock_mutex_ns(pb_share->sh_ex_mutex); + if ((ot = pb_open_tab)) { + pb_open_tab->ot_thread = self; + if (self->st_database != pb_open_tab->ot_table->tab_db) + xt_ha_open_database_of_table(self, pb_share->sh_table_path); + pb_open_tab = NULL; + pushr_(xt_db_return_table_to_pool, ot); + } + xt_unlock_mutex_ns(pb_share->sh_ex_mutex); + + ha_remove_from_handler_list(self, pb_share, this); + + /* Someone may be waiting for me to complete: */ + xt_broadcast_cond_ns((xt_cond_type *) pb_share->sh_ex_cond); + + removed = ha_unget_share_removed(self, pb_share); + + if (ot) { + /* Flush the table if this was the last handler: */ + /* This is not necessary but has the affect that + * FLUSH TABLES; does a checkpoint! + */ + if (removed) { + /* GOTCHA: + * This was killing performance as the number of threads increased! + * + * When MySQL runs out of table handlers because the table + * handler cache is too small, it starts to close handlers. + * (open_cache.records > table_cache_size) + * + * Which can lead to closing all handlers for a particular table. + * + * It does this while holding lock_OPEN! + * So this code below leads to a sync operation while lock_OPEN + * is held. The result is that the whole server comes to a stop. + */ + if (!thd || thd_sql_command(thd) == SQLCOM_FLUSH) // FLUSH TABLES + xt_sync_flush_table(self, ot); + } + freer_(); // xt_db_return_table_to_pool(ot); + } + } + catch_(a) { + xt_log_and_clear_exception(self); + } + cont_(a); + + pb_share = NULL; + } +} + +/* + * Used for opening tables. The name will be the name of the file. + * A table is opened when it needs to be opened. For instance + * when a request comes in for a select on the table (tables are not + * open and closed for each request, they are cached). + + * Called from handler.cc by handler::ha_open(). The server opens all tables by + * calling ha_open() which then calls the handler specific open(). + */ +int ha_pbxt::open(const char *table_path, int mode __attribute__((unused)), uint test_if_locked __attribute__((unused))) +{ + THD *thd = current_thd; + int err = 0; + XTThreadPtr self; + xtBool tabled_opened = FALSE; + + ref_length = XT_RECORD_OFFS_SIZE; + + if (!(self = ha_set_current_thread(thd, &err))) + return xt_ha_pbxt_to_mysql_error(err); + + XT_PRINT1(self, "ha_pbxt::open %s\n", table_path); + + pb_ex_in_use = 1; + try_(a) { + xt_ha_open_database_of_table(self, (XTPathStrPtr) table_path); + + pb_share = ha_get_share(self, table_path, true, &tabled_opened); + ha_add_to_handler_list(self, pb_share, this); + if (pb_share->sh_table_lock) { + if (!ha_wait_for_shared_use(this, pb_share)) + xt_throw(self); + } + + ha_open_share(self, pb_share, &tabled_opened); + + thr_lock_data_init(&pb_share->sh_lock, &pb_lock, NULL); + if (!(pb_open_tab = xt_db_open_table_using_tab(pb_share->sh_table, self))) + xt_throw(self); + pb_open_tab->ot_thread = self; + + if (tabled_opened) { +#ifdef LOAD_TABLE_ON_OPEN + xt_tab_load_table(self, pb_open_tab); +#else + xt_tab_load_row_pointers(self, pb_open_tab); +#endif + xt_ind_set_index_selectivity(self, pb_open_tab); + pb_share->sh_recalc_selectivity = (pb_share->sh_table->tab_row_eof_id - 1 - pb_share->sh_table->tab_row_fnum) < 150; + } + + init_auto_increment(0); + } + catch_(a) { + err = xt_ha_pbxt_thread_error_for_mysql(thd, self, pb_ignore_dup_key); + internal_close(thd, self); + } + cont_(a); + + if (!err) + info(HA_STATUS_NO_LOCK | HA_STATUS_VARIABLE | HA_STATUS_CONST); + + pb_ex_in_use = 0; + if (pb_share) { + /* Someone may be waiting for me to complete: */ + if (pb_share->sh_table_lock) + xt_broadcast_cond_ns((xt_cond_type *) pb_share->sh_ex_cond); + } + return err; +} + + +/* + Closes a table. We call the free_share() function to free any resources + that we have allocated in the "shared" structure. + + Called from sql_base.cc, sql_select.cc, and table.cc. + In sql_select.cc it is only used to close up temporary tables or during + the process where a temporary table is converted over to being a + myisam table. + For sql_base.cc look at close_data_tables(). +*/ +int ha_pbxt::close(void) +{ + THD *thd = current_thd; + volatile int err = 0; + volatile XTThreadPtr self; + + if (thd) + self = ha_set_current_thread(thd, (int *) &err); + else { + XTExceptionRec e; + + if (!(self = xt_create_thread("TempForClose", FALSE, TRUE, &e))) { + xt_log_exception(NULL, &e, XT_LOG_DEFAULT); + return 0; + } + } + + XT_PRINT1(self, "ha_pbxt::close %s\n", pb_share && pb_share->sh_table_path->ps_path ? pb_share->sh_table_path->ps_path : "unknown"); + + if (self) { + try_(a) { + internal_close(thd, self); + } + catch_(a) { + err = xt_ha_pbxt_thread_error_for_mysql(thd, self, pb_ignore_dup_key); + } + cont_(a); + + if (!thd) + xt_free_thread(self); + } + else + xt_log(XT_NS_CONTEXT, XT_LOG_WARNING, "Unable to release table reference\n"); + + return err; +} + +void ha_pbxt::init_auto_increment(xtWord8 min_auto_inc) +{ + XTTableHPtr tab; + xtWord8 nr = 1; + int err; + + /* Get the value of the auto-increment value by + * loading the highest value from the index... + */ + tab = pb_open_tab->ot_table; + + /* Cannot do this if the index version is bad! */ + if (tab->tab_dic.dic_disable_index) + return; + + xt_spinlock_lock(&tab->tab_ainc_lock); + if (table->found_next_number_field && !tab->tab_auto_inc) { + Field *tmp_fie = table->next_number_field; + THD *tmp_thd = table->in_use; + xtBool xn_started = FALSE; + XTThreadPtr self = pb_open_tab->ot_thread; + + /* + * A table may be opened by a thread with a running + * transaction! + * Since get_auto_increment() does not do an update, + * it should be OK to use the transaction we already + * have to get the next auto-increment value. + */ + if (!self->st_xact_data) { + self->st_xact_mode = XT_XACT_REPEATABLE_READ; + self->st_ignore_fkeys = FALSE; + self->st_auto_commit = TRUE; + self->st_table_trans = FALSE; + self->st_abort_trans = FALSE; + self->st_stat_ended = FALSE; + self->st_stat_trans = FALSE; + self->st_is_update = FALSE; + if (!xt_xn_begin(self)) { + xt_spinlock_unlock(&tab->tab_ainc_lock); + xt_throw(self); + } + xn_started = TRUE; + } + + /* Setup the conditions for the next call! */ + table->in_use = current_thd; + table->next_number_field = table->found_next_number_field; + + extra(HA_EXTRA_KEYREAD); + table->mark_columns_used_by_index_no_reset(TS(table)->next_number_index, table->read_set); + column_bitmaps_signal(); + index_init(TS(table)->next_number_index, 0); + if (!TS(table)->next_number_key_offset) { + // Autoincrement at key-start + err = index_last(table->record[1]); + if (!err) + nr = (xtWord8) table->next_number_field->val_int_offset(TS(table)->rec_buff_length)+1; + } + else { + /* Do an index scan to find the largest value! */ + /* The standard method will not work because it forces + * us to lock that table! + */ + xtWord8 val; + + err = index_first(table->record[1]); + while (!err) { + val = (xtWord8) table->next_number_field->val_int_offset(TS(table)->rec_buff_length)+1; + if (val > nr) + nr = val; + err = index_next(table->record[1]); + } + } + + index_end(); + extra(HA_EXTRA_NO_KEYREAD); + + tab->tab_auto_inc = nr; + if (tab->tab_auto_inc < tab->tab_dic.dic_min_auto_inc) + tab->tab_auto_inc = tab->tab_dic.dic_min_auto_inc; + if (tab->tab_auto_inc < min_auto_inc) + tab->tab_auto_inc = min_auto_inc; + + /* Restore the changed values: */ + table->next_number_field = tmp_fie; + table->in_use = tmp_thd; + + if (xn_started) + xt_xn_commit(self); + } + xt_spinlock_unlock(&tab->tab_ainc_lock); +} + +void ha_pbxt::get_auto_increment(MX_ULONGLONG_T offset, MX_ULONGLONG_T increment, + MX_ULONGLONG_T nb_desired_values __attribute__((unused)), + MX_ULONGLONG_T *first_value, + MX_ULONGLONG_T *nb_reserved_values __attribute__((unused))) +{ + register XTTableHPtr tab; + MX_ULONGLONG_T nr, nr_plus_inc; + + ASSERT_NS(pb_ex_in_use); + + tab = pb_open_tab->ot_table; + + xt_spinlock_lock(&tab->tab_ainc_lock); + nr = (MX_ULONGLONG_T) tab->tab_auto_inc; + if (nr < offset) + nr = offset; + else if (increment > 1 && ((nr - offset) % increment) != 0) + nr += increment - ((nr - offset) % increment); + nr_plus_inc = nr + increment; + if (table->next_number_field->cmp((const unsigned char *)&nr, (const unsigned char *)&nr_plus_inc) < 0) + tab->tab_auto_inc = (xtWord8) (nr_plus_inc); + else + nr = ~0; /* indicate error to the caller */ + xt_spinlock_unlock(&tab->tab_ainc_lock); + + *first_value = nr; + *nb_reserved_values = 1; +} + +/* GOTCHA: We need to use signed value here because of the test + * (from auto_increment.test): + * create table t1 (a int not null auto_increment primary key); + * insert into t1 values (NULL); + * insert into t1 values (-1); + * insert into t1 values (NULL); + */ +void ha_pbxt::set_auto_increment(Field *nr) +{ + register XTTableHPtr tab; + MX_ULONGLONG_T nr_int_val; + + nr_int_val = nr->val_int(); + tab = pb_open_tab->ot_table; + + if (nr->cmp((const unsigned char *)&tab->tab_auto_inc) > 0) { + xt_spinlock_lock(&tab->tab_ainc_lock); + + if (nr->cmp((const unsigned char *)&tab->tab_auto_inc) > 0) { + MX_ULONGLONG_T nr_int_val_plus_one = nr_int_val + 1; + if (nr->cmp((const unsigned char *)&nr_int_val_plus_one) < 0) + tab->tab_auto_inc = nr_int_val_plus_one; + else + tab->tab_auto_inc = nr_int_val; + } + xt_spinlock_unlock(&tab->tab_ainc_lock); + } + + if (xt_db_auto_increment_mode == 1) { + if (nr_int_val > (MX_ULONGLONG_T) tab->tab_dic.dic_min_auto_inc) { + /* Do this every 100 calls: */ +#ifdef DEBUG + tab->tab_dic.dic_min_auto_inc = nr_int_val + 5; +#else + tab->tab_dic.dic_min_auto_inc = nr_int_val + 100; +#endif + pb_open_tab->ot_thread = xt_get_self(); + if (!xt_tab_write_min_auto_inc(pb_open_tab)) + xt_log_and_clear_exception(pb_open_tab->ot_thread); + } + } +} + +/* +static void dump_buf(unsigned char *buf, int len) +{ + int i; + + for (i=0; i<len; i++) printf("%2c", buf[i] <= 127 ? buf[i] : '.'); + printf("\n"); + for (i=0; i<len; i++) printf("%02x", buf[i]); + printf("\n"); +} +*/ + +/* + * write_row() inserts a row. No extra() hint is given currently if a bulk load + * is happeneding. buf() is a byte array of data. You can use the field + * information to extract the data from the native byte array type. + * Example of this would be: + * for (Field **field=table->field ; *field ; field++) + * { + * ... + * } + + * See ha_tina.cc for an example of extracting all of the data as strings. + * ha_berekly.cc has an example of how to store it intact by "packing" it + * for ha_berkeley's own native storage type. + + * See the note for update_row() on auto_increments and timestamps. This + * case also applied to write_row(). + + * Called from item_sum.cc, item_sum.cc, sql_acl.cc, sql_insert.cc, + * sql_insert.cc, sql_select.cc, sql_table.cc, sql_udf.cc, and sql_update.cc. + */ +int ha_pbxt::write_row(byte *buf) +{ + int err = 0; + + ASSERT_NS(pb_ex_in_use); + + XT_PRINT1(pb_open_tab->ot_thread, "ha_pbxt::write_row %s\n", pb_share->sh_table_path->ps_path); + XT_DISABLED_TRACE(("INSERT tx=%d val=%d\n", (int) pb_open_tab->ot_thread->st_xact_data->xd_start_xn_id, (int) XT_GET_DISK_4(&buf[1]))); + //statistic_increment(ha_write_count,&LOCK_status); + + /* GOTCHA: I have a huge problem with the transaction statement. + * It is not ALWAYS committed (I mean ha_commit_trans() is + * not always called - for example in SELECT). + * + * If I call trans_register_ha() but ha_commit_trans() is not called + * then MySQL thinks a transaction is still running (while + * I have committed the auto-transaction in ha_pbxt::external_lock()). + * + * This causes all kinds of problems, like transactions + * are killed when they should not be. + * + * To prevent this, I only inform MySQL that a transaction + * has beens started when an update is performed. I have determined that + * ha_commit_trans() is only guarenteed to be called if an update is done. + */ + if (!pb_open_tab->ot_thread->st_stat_trans) { + trans_register_ha(pb_mysql_thd, FALSE, pbxt_hton); + XT_PRINT0(pb_open_tab->ot_thread, "ha_pbxt::write_row trans_register_ha all=FALSE\n"); + pb_open_tab->ot_thread->st_stat_trans = TRUE; + } + + xt_xlog_check_long_writer(pb_open_tab->ot_thread); + + if (table->timestamp_field_type & TIMESTAMP_AUTO_SET_ON_INSERT) + table->timestamp_field->set_time(); + + if (table->next_number_field && buf == table->record[0]) { + int update_err = update_auto_increment(); + if (update_err) { + ha_log_pbxt_thread_error_for_mysql(pb_ignore_dup_key); + return update_err; + } + set_auto_increment(table->next_number_field); + } + + if (!xt_tab_new_record(pb_open_tab, (xtWord1 *) buf)) { + err = ha_log_pbxt_thread_error_for_mysql(pb_ignore_dup_key); + + /* + * This is needed to allow the same row to be updated multiple times in case of bulk REPLACE. + * This happens during execution of LOAD DATA...REPLACE MySQL first tries to INSERT the row + * and if it gets dup-key error it tries UPDATE, so the same row can be overwriten multiple + * times within the same statement + */ + if (err == HA_ERR_FOUND_DUPP_KEY && pb_open_tab->ot_thread->st_is_update) + pb_open_tab->ot_thread->st_update_id++; + } + + return err; +} + +#ifdef UNUSED_CODE +static int equ_bin(const byte *a, const char *b) +{ + while (*a && *b) { + if (*a != *b) + return 0; + a++; + b++; + } + return 1; +} +static void dump_bin(const byte *a_in, int offset, int len_in) +{ + const byte *a = a_in; + int len = len_in; + + a += offset; + while (len > 0) { + xt_trace("%02X", (int) *a); + a++; + len--; + } + xt_trace("=="); + a = a_in; + len = len_in; + a += offset; + while (len > 0) { + xt_trace("%c", (*a > 8 && *a < 127) ? *a : '.'); + a++; + len--; + } + xt_trace("\n"); +} +#endif + +/* + * Yes, update_row() does what you expect, it updates a row. old_data will have + * the previous row record in it, while new_data will have the newest data in + * it. Keep in mind that the server can do updates based on ordering if an ORDER BY + * clause was used. Consecutive ordering is not guarenteed. + * + * Called from sql_select.cc, sql_acl.cc, sql_update.cc, and sql_insert.cc. + */ +int ha_pbxt::update_row(const byte * old_data, byte * new_data) +{ + int err = 0; + register XTThreadPtr self = pb_open_tab->ot_thread; + + ASSERT_NS(pb_ex_in_use); + + XT_PRINT1(self, "ha_pbxt::update_row %s\n", pb_share->sh_table_path->ps_path); + XT_DISABLED_TRACE(("UPDATE tx=%d val=%d\n", (int) self->st_xact_data->xd_start_xn_id, (int) XT_GET_DISK_4(&new_data[1]))); + //statistic_increment(ha_update_count,&LOCK_status); + + if (!self->st_stat_trans) { + trans_register_ha(pb_mysql_thd, FALSE, pbxt_hton); + XT_PRINT0(self, "ha_pbxt::update_row trans_register_ha all=FALSE\n"); + self->st_stat_trans = TRUE; + } + + xt_xlog_check_long_writer(self); + + if (!self->st_is_update) { + self->st_is_update = TRUE; + self->st_update_id++; + } + + if (table->timestamp_field_type & TIMESTAMP_AUTO_SET_ON_UPDATE) + table->timestamp_field->set_time(); + + /* GOTCHA: We need to check the auto-increment value on update + * because of the following test (which fails for InnoDB) - + * auto_increment.test: + * create table t1 (a int not null auto_increment primary key, val int); + * insert into t1 (val) values (1); + * update t1 set a=2 where a=1; + * insert into t1 (val) values (1); + */ + if (table->found_next_number_field && new_data == table->record[0]) { + MX_LONGLONG_T nr; + my_bitmap_map *old_map; + + old_map = mx_tmp_use_all_columns(table, table->read_set); + nr = table->found_next_number_field->val_int(); + set_auto_increment(table->found_next_number_field); + mx_tmp_restore_column_map(table, old_map); + } + + if (!xt_tab_update_record(pb_open_tab, (xtWord1 *) old_data, (xtWord1 *) new_data)) + err = ha_log_pbxt_thread_error_for_mysql(pb_ignore_dup_key); + + pb_open_tab->ot_table->tab_locks.xt_remove_temp_lock(pb_open_tab, TRUE); + + return err; +} + +/* + * This will delete a row. buf will contain a copy of the row to be deleted. + * The server will call this right after the current row has been called (from + * either a previous rnd_next() or index call). + * + * Called in sql_acl.cc and sql_udf.cc to manage internal table information. + * Called in sql_delete.cc, sql_insert.cc, and sql_select.cc. In sql_select it is + * used for removing duplicates while in insert it is used for REPLACE calls. +*/ +int ha_pbxt::delete_row(const byte * buf) +{ + int err = 0; + + ASSERT_NS(pb_ex_in_use); + + XT_PRINT1(pb_open_tab->ot_thread, "ha_pbxt::delete_row %s\n", pb_share->sh_table_path->ps_path); + XT_DISABLED_TRACE(("DELETE tx=%d val=%d\n", (int) pb_open_tab->ot_thread->st_xact_data->xd_start_xn_id, (int) XT_GET_DISK_4(&buf[1]))); + //statistic_increment(ha_delete_count,&LOCK_status); + + if (!pb_open_tab->ot_thread->st_stat_trans) { + trans_register_ha(pb_mysql_thd, FALSE, pbxt_hton); + XT_PRINT0(pb_open_tab->ot_thread, "ha_pbxt::delete_row trans_register_ha all=FALSE\n"); + pb_open_tab->ot_thread->st_stat_trans = TRUE; + } + + xt_xlog_check_long_writer(pb_open_tab->ot_thread); + + if (!xt_tab_delete_record(pb_open_tab, (xtWord1 *) buf)) + err = ha_log_pbxt_thread_error_for_mysql(pb_ignore_dup_key); + + pb_open_tab->ot_table->tab_locks.xt_remove_temp_lock(pb_open_tab, TRUE); + + return err; +} + +/* + * ----------------------------------------------------------------------- + * INDEX METHODS + */ + +/* + * This looks like a hack, but actually, it is OK. + * It depends on the setup done by the super-class. It involves an extra + * range check that we need to do if a "new" record is returned during + * an index scan. + * + * A new record is returned if a row is updated (by another transaction) + * during the index scan. If an update is detected, then the scan stops + * and waits for the transaction to end. + * + * If the transaction commits, then the updated row is returned instead + * of the row it would have returned when doing a consistant read + * (repeatable read). + * + * These new records can appear out of index order, and may not even + * belong to the index range that we are concerned with. + * + * Notice that there is not check for the start of the range. It appears + * that this is not necessary, MySQL seems to have no problem ignoring + * such values. + * + * A number of test have been given below which demonstrate the use + * of the function. + * + * They also demonstrate the ORDER BY problem described here: [(11)]. + * + * DROP TABLE IF EXISTS test_tab, test_tab_1, test_tab_2; + * CREATE TABLE test_tab (ID int primary key, Value int, Name varchar(20), index(Value, Name)) ENGINE=pbxt; + * INSERT test_tab values(1, 1, 'A'); + * INSERT test_tab values(2, 1, 'B'); + * INSERT test_tab values(3, 1, 'C'); + * INSERT test_tab values(4, 2, 'D'); + * INSERT test_tab values(5, 2, 'E'); + * INSERT test_tab values(6, 2, 'F'); + * INSERT test_tab values(7, 2, 'G'); + * + * select * from test_tab where value = 1 order by value, name for update; + * + * -- Test: 1 + * -- C1 + * begin; + * select * from test_tab where id = 5 for update; + * + * -- C2 + * begin; + * select * from test_tab where value = 2 order by value, name for update; + * + * -- C1 + * update test_tab set value = 3 where id = 6; + * commit; + * + * -- Test: 2 + * -- C1 + * begin; + * select * from test_tab where id = 5 for update; + * + * -- C2 + * begin; + * select * from test_tab where value >= 2 order by value, name for update; + * + * -- C1 + * update test_tab set value = 3 where id = 6; + * commit; + * + * -- Test: 3 + * -- C1 + * begin; + * select * from test_tab where id = 5 for update; + * + * -- C2 + * begin; + * select * from test_tab where value = 2 order by value, name for update; + * + * -- C1 + * update test_tab set value = 1 where id = 6; + * commit; + */ + +int ha_pbxt::xt_index_in_range(register XTOpenTablePtr ot __attribute__((unused)), register XTIndexPtr ind, + register XTIdxSearchKeyPtr search_key, xtWord1 *buf) +{ + /* If search key is given, this means we want an exact match. */ + if (search_key) { + xtWord1 key_buf[XT_INDEX_MAX_KEY_SIZE]; + + myxt_create_key_from_row(ind, key_buf, buf, NULL); + search_key->sk_on_key = myxt_compare_key(ind, search_key->sk_key_value.sv_flags, search_key->sk_key_value.sv_length, + search_key->sk_key_value.sv_key, key_buf) == 0; + return search_key->sk_on_key; + } + + /* Otherwise, check the end of the range. */ + if (end_range) + return compare_key(end_range) <= 0; + return 1; +} + +int ha_pbxt::xt_index_next_read(register XTOpenTablePtr ot, register XTIndexPtr ind, xtBool key_only, + register XTIdxSearchKeyPtr search_key, byte *buf) +{ + xt_xlog_check_long_writer(ot->ot_thread); + + if (key_only) { + /* We only need to read the data from the key: */ + while (ot->ot_curr_rec_id) { + if (search_key && !search_key->sk_on_key) + break; + + switch (xt_tab_visible(ot)) { + case FALSE: + if (xt_idx_next(ot, ind, search_key)) + break; + case XT_ERR: + goto failed; + case XT_NEW: + if (!xt_idx_read(ot, ind, (xtWord1 *) buf)) + goto failed; + if (xt_index_in_range(ot, ind, search_key, buf)) { + return 0; + } + if (!xt_idx_next(ot, ind, search_key)) + goto failed; + break; + case XT_RETRY: + /* We cannot start from the beginning again, if we have + * already output rows! + * And we need the orginal search key. + * + * The case in which this occurs is: + * + * T1: UPDATE tbl_file SET GlobalID = 'DBCD5C4514210200825501089884844_6M' WHERE ID = 39 + * Locks a particular row. + * + * T2: SELECT ID,Flags FROM tbl_file WHERE SpaceID = 1 AND Path = '/zi/America/' AND + * Name = 'Cuiaba' AND Flags IN ( 0,1,4,5 ) FOR UPDATE + * scans the index and stops on the lock (of the before image) above. + * + * T1 quits, the sweeper deletes the record updated by T1?! + * BUG: Cleanup should wait until T2 is complete! + * + * T2 continues, and returns XT_RETRY. + * + * At this stage T2 has already returned some rows, so it may not retry from the + * start. Instead it tries to locate the last record it tried to lock. + * This record is gone (or not visible), so it finds the next one. + * + * POTENTIAL BUG: If cleanup does not wait until T2 is complete, then + * I may miss the update record, if it is moved before the index scan + * position. + */ + if (!pb_ind_row_count && search_key) { + if (!xt_idx_search(pb_open_tab, ind, search_key)) + return ha_log_pbxt_thread_error_for_mysql(pb_ignore_dup_key); + } + else { + if (!xt_idx_research(pb_open_tab, ind)) + goto failed; + } + break; + default: + if (!xt_idx_read(ot, ind, (xtWord1 *) buf)) + goto failed; + return 0; + } + } + } + else { + while (ot->ot_curr_rec_id) { + if (search_key && !search_key->sk_on_key) + break; + + switch (xt_tab_read_record(ot, (xtWord1 *) buf)) { + case FALSE: + XT_DISABLED_TRACE(("not visi tx=%d rec=%d\n", (int) ot->ot_thread->st_xact_data->xd_start_xn_id, (int) ot->ot_curr_rec_id)); + if (xt_idx_next(ot, ind, search_key)) + break; + case XT_ERR: + goto failed; + case XT_NEW: + if (xt_index_in_range(ot, ind, search_key, buf)) + return 0; + if (!xt_idx_next(ot, ind, search_key)) + goto failed; + break; + case XT_RETRY: + if (!pb_ind_row_count && search_key) { + if (!xt_idx_search(pb_open_tab, ind, search_key)) + return ha_log_pbxt_thread_error_for_mysql(pb_ignore_dup_key); + } + else { + if (!xt_idx_research(pb_open_tab, ind)) + goto failed; + } + break; + default: + XT_DISABLED_TRACE(("visible tx=%d rec=%d\n", (int) ot->ot_thread->st_xact_data->xd_start_xn_id, (int) ot->ot_curr_rec_id)); + return 0; + } + } + } + return HA_ERR_END_OF_FILE; + + failed: + return ha_log_pbxt_thread_error_for_mysql(FALSE); +} + +int ha_pbxt::xt_index_prev_read(XTOpenTablePtr ot, XTIndexPtr ind, xtBool key_only, + register XTIdxSearchKeyPtr search_key, byte *buf) +{ + if (key_only) { + /* We only need to read the data from the key: */ + while (ot->ot_curr_rec_id) { + if (search_key && !search_key->sk_on_key) + break; + + switch (xt_tab_visible(ot)) { + case FALSE: + if (xt_idx_prev(ot, ind, search_key)) + break; + case XT_ERR: + goto failed; + case XT_NEW: + if (!xt_idx_read(ot, ind, (xtWord1 *) buf)) + goto failed; + if (xt_index_in_range(ot, ind, search_key, buf)) + return 0; + if (!xt_idx_next(ot, ind, search_key)) + goto failed; + break; + case XT_RETRY: + if (!pb_ind_row_count && search_key) { + if (!xt_idx_search_prev(pb_open_tab, ind, search_key)) + return ha_log_pbxt_thread_error_for_mysql(pb_ignore_dup_key); + } + else { + if (!xt_idx_research(pb_open_tab, ind)) + goto failed; + } + break; + default: + if (!xt_idx_read(ot, ind, (xtWord1 *) buf)) + goto failed; + return 0; + } + } + } + else { + /* We need to read the entire record: */ + while (ot->ot_curr_rec_id) { + if (search_key && !search_key->sk_on_key) + break; + + switch (xt_tab_read_record(ot, (xtWord1 *) buf)) { + case FALSE: + if (xt_idx_prev(ot, ind, search_key)) + break; + case XT_ERR: + goto failed; + case XT_NEW: + if (xt_index_in_range(ot, ind, search_key, buf)) + return 0; + if (!xt_idx_next(ot, ind, search_key)) + goto failed; + break; + case XT_RETRY: + if (!pb_ind_row_count && search_key) { + if (!xt_idx_search_prev(pb_open_tab, ind, search_key)) + return ha_log_pbxt_thread_error_for_mysql(pb_ignore_dup_key); + } + else { + if (!xt_idx_research(pb_open_tab, ind)) + goto failed; + } + break; + default: + return 0; + } + } + } + return HA_ERR_END_OF_FILE; + + failed: + return ha_log_pbxt_thread_error_for_mysql(FALSE); +} + +int ha_pbxt::index_init(uint idx, bool sorted __attribute__((unused))) +{ + XTIndexPtr ind; + + /* select count(*) from smalltab_PBXT; + * ignores the error below, and continues to + * call index_first! + */ + active_index = idx; + + if (pb_open_tab->ot_table->tab_dic.dic_disable_index) { + xt_tab_set_index_error(pb_open_tab->ot_table); + return ha_log_pbxt_thread_error_for_mysql(pb_ignore_dup_key); + } + + /* The number of columns required: */ + if (pb_open_tab->ot_is_modify) { + pb_open_tab->ot_cols_req = table->read_set->n_bits; +#ifdef XT_PRINT_INDEX_OPT + ind = (XTIndexPtr) pb_share->sh_dic_keys[idx]; + + printf("index_init %s index %d cols req=%d/%d read_bits=%X write_bits=%X index_bits=%X\n", pb_open_tab->ot_table->tab_name->ps_path, (int) idx, pb_open_tab->ot_cols_req, pb_open_tab->ot_cols_req, (int) *table->read_set->bitmap, (int) *table->write_set->bitmap, (int) *ind->mi_col_map.bitmap); +#endif + } + else { + pb_open_tab->ot_cols_req = ha_get_max_bit(table->read_set); + + /* Check for index coverage! + * + * Given the following table: + * + * CREATE TABLE `customer` ( + * `c_id` int(11) NOT NULL DEFAULT '0', + * `c_d_id` int(11) NOT NULL DEFAULT '0', + * `c_w_id` int(11) NOT NULL DEFAULT '0', + * `c_first` varchar(16) DEFAULT NULL, + * `c_middle` char(2) DEFAULT NULL, + * `c_last` varchar(16) DEFAULT NULL, + * `c_street_1` varchar(20) DEFAULT NULL, + * `c_street_2` varchar(20) DEFAULT NULL, + * `c_city` varchar(20) DEFAULT NULL, + * `c_state` char(2) DEFAULT NULL, + * `c_zip` varchar(9) DEFAULT NULL, + * `c_phone` varchar(16) DEFAULT NULL, + * `c_since` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, + * `c_credit` char(2) DEFAULT NULL, + * `c_credit_lim` decimal(24,12) DEFAULT NULL, + * `c_discount` double DEFAULT NULL, + * `c_balance` decimal(24,12) DEFAULT NULL, + * `c_ytd_payment` decimal(24,12) DEFAULT NULL, + * `c_payment_cnt` double DEFAULT NULL, + * `c_delivery_cnt` double DEFAULT NULL, + * `c_data` text, + * PRIMARY KEY (`c_w_id`,`c_d_id`,`c_id`), + * KEY `c_w_id` (`c_w_id`,`c_d_id`,`c_last`,`c_first`,`c_id`) + * ) ENGINE=PBXT; + * + * MySQL does not recognize index coverage on the followin select: + * + * SELECT c_id FROM customer WHERE c_w_id = 3 AND c_d_id = 8 AND + * c_last = 'EINGATIONANTI' ORDER BY c_first ASC LIMIT 1; + * + * TODO: Find out why this is necessary, MyISAM does not + * seem to have this problem! + */ + ind = (XTIndexPtr) pb_share->sh_dic_keys[idx]; + if (bitmap_is_subset(table->read_set, &ind->mi_col_map)) + pb_key_read = TRUE; +#ifdef XT_PRINT_INDEX_OPT + printf("index_init %s index %d cols req=%d/%d read_bits=%X write_bits=%X index_bits=%X converage=%d\n", pb_open_tab->ot_table->tab_name->ps_path, (int) idx, pb_open_tab->ot_cols_req, table->read_set->n_bits, (int) *table->read_set->bitmap, (int) *table->write_set->bitmap, (int) *ind->mi_col_map.bitmap, (int) (bitmap_is_subset(table->read_set, &ind->mi_col_map) != 0)); +#endif + } + + xt_xlog_check_long_writer(pb_open_tab->ot_thread); + + pb_open_tab->ot_thread->st_statistics.st_scan_index++; + return 0; +} + +int ha_pbxt::index_end() +{ + int err = 0; + + XT_TRACE_CALL(); + + XTThreadPtr thread = pb_open_tab->ot_thread; + + /* + * the assertion below is not always held, because the sometimes handler is unlocked + * before this function is called + */ + /*ASSERT_NS(pb_ex_in_use);*/ + + if (pb_open_tab->ot_ind_rhandle) { + xt_ind_release_handle(pb_open_tab->ot_ind_rhandle, FALSE, thread); + pb_open_tab->ot_ind_rhandle = NULL; + } + + /* + * make permanent the lock for the last scanned row + */ + if (pb_open_tab) + pb_open_tab->ot_table->tab_locks.xt_make_lock_permanent(pb_open_tab, &thread->st_lock_list); + + xt_xlog_check_long_writer(thread); + + active_index = MAX_KEY; + XT_RETURN(err); +} + +#ifdef XT_TRACK_RETURNED_ROWS +void ha_start_scan(XTOpenTablePtr ot, u_int index) +{ + xt_ttracef(ot->ot_thread, "SCAN %d:%d\n", (int) ot->ot_table->tab_id, (int) index); + ot->ot_rows_ret_curr = 0; + for (u_int i=0; i<ot->ot_rows_ret_max; i++) + ot->ot_rows_returned[i] = 0; +} + +void ha_return_row(XTOpenTablePtr ot, u_int index) +{ + xt_ttracef(ot->ot_thread, "%d:%d ROW=%d:%d\n", + (int) ot->ot_table->tab_id, (int) index, (int) ot->ot_curr_row_id, (int) ot->ot_curr_rec_id); + ot->ot_rows_ret_curr++; + if (ot->ot_curr_row_id >= ot->ot_rows_ret_max) { + if (!xt_realloc_ns((void **) &ot->ot_rows_returned, (ot->ot_curr_row_id+1) * sizeof(xtRecordID))) + ASSERT_NS(FALSE); + memset(&ot->ot_rows_returned[ot->ot_rows_ret_max], 0, (ot->ot_curr_row_id+1 - ot->ot_rows_ret_max) * sizeof(xtRecordID)); + ot->ot_rows_ret_max = ot->ot_curr_row_id+1; + } + if (!ot->ot_curr_row_id || !ot->ot_curr_rec_id || ot->ot_rows_returned[ot->ot_curr_row_id]) { + char *sql = *thd_query(current_thd); + + xt_ttracef(ot->ot_thread, "DUP %d:%d %s\n", + (int) ot->ot_table->tab_id, (int) index, *thd_query(current_thd)); + xt_dump_trace(); + printf("ERROR: row=%d rec=%d newr=%d, already returned!\n", (int) ot->ot_curr_row_id, (int) ot->ot_rows_returned[ot->ot_curr_row_id], (int) ot->ot_curr_rec_id); + printf("ERROR: %s\n", sql); +#ifdef XT_WIN + FatalAppExit(0, "Debug Me!"); +#endif + } + else + ot->ot_rows_returned[ot->ot_curr_row_id] = ot->ot_curr_rec_id; +} +#endif + +int ha_pbxt::index_read_xt(byte * buf, uint idx, const byte *key, uint key_len __attribute__((unused)), enum ha_rkey_function find_flag __attribute__((unused))) +{ + int err = 0; + XTIndexPtr ind; + int prefix = 0; + XTIdxSearchKeyRec search_key; + +#ifdef XT_TRACK_RETURNED_ROWS + ha_start_scan(pb_open_tab, idx); +#endif + + /* This call starts a search on this handler! */ + pb_ind_row_count = 0; + + ASSERT_NS(pb_ex_in_use); + + XT_PRINT1(pb_open_tab->ot_thread, "ha_pbxt::index_read_xt %s\n", pb_share->sh_table_path->ps_path); + XT_DISABLED_TRACE(("search tx=%d val=%d update=%d\n", (int) pb_open_tab->ot_thread->st_xact_data->xd_start_xn_id, (int) XT_GET_DISK_4(key), pb_modified)); + ind = (XTIndexPtr) pb_share->sh_dic_keys[idx]; + + switch (find_flag) { + case HA_READ_PREFIX_LAST: + case HA_READ_PREFIX_LAST_OR_PREV: + prefix = SEARCH_PREFIX; + case HA_READ_BEFORE_KEY: + case HA_READ_KEY_OR_PREV: // I assume you want to be positioned on the last entry in the key duplicate list!! + xt_idx_prep_key(ind, &search_key, ((find_flag == HA_READ_BEFORE_KEY) ? 0 : XT_SEARCH_AFTER_KEY) | prefix, (xtWord1 *) key, (size_t) key_len); + if (!xt_idx_search_prev(pb_open_tab, ind, &search_key)) + err = ha_log_pbxt_thread_error_for_mysql(pb_ignore_dup_key); + else + err = xt_index_prev_read(pb_open_tab, ind, pb_key_read, + (find_flag == HA_READ_PREFIX_LAST) ? &search_key : NULL, buf); + break; + case HA_READ_PREFIX: + prefix = SEARCH_PREFIX; + case HA_READ_KEY_EXACT: + case HA_READ_KEY_OR_NEXT: + case HA_READ_AFTER_KEY: + default: + xt_idx_prep_key(ind, &search_key, ((find_flag == HA_READ_AFTER_KEY) ? XT_SEARCH_AFTER_KEY : 0) | prefix, (xtWord1 *) key, key_len); + if (!xt_idx_search(pb_open_tab, ind, &search_key)) + err = ha_log_pbxt_thread_error_for_mysql(pb_ignore_dup_key); + else + err = xt_index_next_read(pb_open_tab, ind, pb_key_read, + (find_flag == HA_READ_KEY_EXACT || find_flag == HA_READ_PREFIX) ? &search_key : NULL, buf); + break; + } + + pb_ind_row_count++; +#ifdef XT_TRACK_RETURNED_ROWS + if (!err) + ha_return_row(pb_open_tab, idx); +#endif + XT_DISABLED_TRACE(("search tx=%d val=%d err=%d\n", (int) pb_open_tab->ot_thread->st_xact_data->xd_start_xn_id, (int) XT_GET_DISK_4(key), err)); + if (err) + table->status = STATUS_NOT_FOUND; + else { + pb_open_tab->ot_thread->st_statistics.st_row_select++; + table->status = 0; + } + return err; +} + +/* + * Positions an index cursor to the index specified in the handle. Fetches the + * row if available. If the key value is null, begin at the first key of the + * index. + */ +int ha_pbxt::index_read(byte * buf, const byte * key, uint key_len __attribute__((unused)), enum ha_rkey_function find_flag __attribute__((unused))) +{ + //statistic_increment(ha_read_key_count,&LOCK_status); + return index_read_xt(buf, active_index, key, key_len, find_flag); +} + +int ha_pbxt::index_read_idx(byte * buf, uint idx, const byte *key, uint key_len __attribute__((unused)), enum ha_rkey_function find_flag __attribute__((unused))) +{ + //statistic_increment(ha_read_key_count,&LOCK_status); + return index_read_xt(buf, idx, key, key_len, find_flag); +} + +int ha_pbxt::index_read_last(byte * buf, const byte * key, uint key_len) +{ + //statistic_increment(ha_read_key_count,&LOCK_status); + return index_read_xt(buf, active_index, key, key_len, HA_READ_PREFIX_LAST); +} + +/* + * Used to read forward through the index. + */ +int ha_pbxt::index_next(byte * buf) +{ + int err = 0; + XTIndexPtr ind; + + XT_TRACE_CALL(); + //statistic_increment(ha_read_next_count,&LOCK_status); + ASSERT_NS(pb_ex_in_use); + + ind = (XTIndexPtr) pb_share->sh_dic_keys[active_index]; + + if (!xt_idx_next(pb_open_tab, ind, NULL)) + err = ha_log_pbxt_thread_error_for_mysql(pb_ignore_dup_key); + else + err = xt_index_next_read(pb_open_tab, ind, pb_key_read, NULL, buf); + + pb_ind_row_count++; +#ifdef XT_TRACK_RETURNED_ROWS + if (!err) + ha_return_row(pb_open_tab, active_index); +#endif + if (err) + table->status = STATUS_NOT_FOUND; + else { + pb_open_tab->ot_thread->st_statistics.st_row_select++; + table->status = 0; + } + XT_RETURN(err); +} + +/* + * I have implemented this because there is currently a + * bug in handler::index_next_same(). + * + * drop table if exists t1; + * CREATE TABLE t1 (a int, b int, primary key(a,b)) + * PARTITION BY KEY(b,a) PARTITIONS 2; + * insert into t1 values (0,0),(1,1),(2,2),(3,3),(4,4),(5,5),(6,6); + * select * from t1 where a = 4; + * + */ +int ha_pbxt::index_next_same(byte * buf, const byte *key, uint length) +{ + int err = 0; + XTIndexPtr ind; + XTIdxSearchKeyRec search_key; + + XT_TRACE_CALL(); + //statistic_increment(ha_read_next_count,&LOCK_status); + ASSERT_NS(pb_ex_in_use); + + ind = (XTIndexPtr) pb_share->sh_dic_keys[active_index]; + + search_key.sk_key_value.sv_flags = HA_READ_KEY_EXACT; + search_key.sk_key_value.sv_rec_id = 0; + search_key.sk_key_value.sv_row_id = 0; + search_key.sk_key_value.sv_key = search_key.sk_key_buf; + search_key.sk_key_value.sv_length = myxt_create_key_from_key(ind, search_key.sk_key_buf, (xtWord1 *) key, (u_int) length); + search_key.sk_on_key = TRUE; + + if (!xt_idx_next(pb_open_tab, ind, &search_key)) + err = ha_log_pbxt_thread_error_for_mysql(pb_ignore_dup_key); + else + err = xt_index_next_read(pb_open_tab, ind, pb_key_read, &search_key, buf); + + pb_ind_row_count++; +#ifdef XT_TRACK_RETURNED_ROWS + if (!err) + ha_return_row(pb_open_tab, active_index); +#endif + if (err) + table->status = STATUS_NOT_FOUND; + else { + pb_open_tab->ot_thread->st_statistics.st_row_select++; + table->status = 0; + } + XT_RETURN(err); +} + +/* + * Used to read backwards through the index. + */ +int ha_pbxt::index_prev(byte * buf) +{ + int err = 0; + XTIndexPtr ind; + + XT_TRACE_CALL(); + //statistic_increment(ha_read_prev_count,&LOCK_status); + ASSERT_NS(pb_ex_in_use); + + ind = (XTIndexPtr) pb_share->sh_dic_keys[active_index]; + + if (!xt_idx_prev(pb_open_tab, ind, NULL)) + err = ha_log_pbxt_thread_error_for_mysql(pb_ignore_dup_key); + else + err = xt_index_prev_read(pb_open_tab, ind, pb_key_read, NULL, buf); + + pb_ind_row_count++; +#ifdef XT_TRACK_RETURNED_ROWS + if (!err) + ha_return_row(pb_open_tab, active_index); +#endif + if (err) + table->status = STATUS_NOT_FOUND; + else { + pb_open_tab->ot_thread->st_statistics.st_row_select++; + table->status = 0; + } + XT_RETURN(err); +} + +/* + * index_first() asks for the first key in the index. + */ +int ha_pbxt::index_first(byte * buf) +{ + int err = 0; + XTIndexPtr ind; + XTIdxSearchKeyRec search_key; + + XT_TRACE_CALL(); + //statistic_increment(ha_read_first_count,&LOCK_status); + ASSERT_NS(pb_ex_in_use); + +#ifdef XT_TRACK_RETURNED_ROWS + ha_start_scan(pb_open_tab, active_index); +#endif + pb_ind_row_count = 0; + + ind = (XTIndexPtr) pb_share->sh_dic_keys[active_index]; + + xt_idx_prep_key(ind, &search_key, XT_SEARCH_FIRST_FLAG, NULL, 0); + if (!xt_idx_search(pb_open_tab, ind, &search_key)) + err = ha_log_pbxt_thread_error_for_mysql(pb_ignore_dup_key); + else + err = xt_index_next_read(pb_open_tab, ind, pb_key_read, NULL, buf); + + pb_ind_row_count++; +#ifdef XT_TRACK_RETURNED_ROWS + if (!err) + ha_return_row(pb_open_tab, active_index); +#endif + if (err) + table->status = STATUS_NOT_FOUND; + else { + pb_open_tab->ot_thread->st_statistics.st_row_select++; + table->status = 0; + } + XT_RETURN(err); +} + +/* + * index_last() asks for the last key in the index. + */ +int ha_pbxt::index_last(byte * buf) +{ + int err = 0; + XTIndexPtr ind; + XTIdxSearchKeyRec search_key; + + XT_TRACE_CALL(); + //statistic_increment(ha_read_last_count,&LOCK_status); + ASSERT_NS(pb_ex_in_use); + +#ifdef XT_TRACK_RETURNED_ROWS + ha_start_scan(pb_open_tab, active_index); +#endif + pb_ind_row_count = 0; + + ind = (XTIndexPtr) pb_share->sh_dic_keys[active_index]; + + xt_idx_prep_key(ind, &search_key, XT_SEARCH_AFTER_LAST_FLAG, NULL, 0); + if (!xt_idx_search_prev(pb_open_tab, ind, &search_key)) + err = ha_log_pbxt_thread_error_for_mysql(pb_ignore_dup_key); + else + err = xt_index_prev_read(pb_open_tab, ind, pb_key_read, NULL, buf); + + pb_ind_row_count++; +#ifdef XT_TRACK_RETURNED_ROWS + if (!err) + ha_return_row(pb_open_tab, active_index); +#endif + if (err) + table->status = STATUS_NOT_FOUND; + else { + pb_open_tab->ot_thread->st_statistics.st_row_select++; + table->status = 0; + } + XT_RETURN(err); +} + +/* + * ----------------------------------------------------------------------- + * RAMDOM/SEQUENTIAL READ METHODS + */ + +/* + * rnd_init() is called when the system wants the storage engine to do a table + * scan. + * See the example in the introduction at the top of this file to see when + * rnd_init() is called. + * + * Called from filesort.cc, records.cc, sql_handler.cc, sql_select.cc, sql_table.cc, + * and sql_update.cc. + */ +int ha_pbxt::rnd_init(bool scan) +{ + int err = 0; + + XT_PRINT1(pb_open_tab->ot_thread, "ha_pbxt::rnd_init %s\n", pb_share->sh_table_path->ps_path); + XT_DISABLED_TRACE(("seq scan tx=%d\n", (int) pb_open_tab->ot_thread->st_xact_data->xd_start_xn_id)); + + /* The number of columns required: */ + if (pb_open_tab->ot_is_modify) + pb_open_tab->ot_cols_req = table->read_set->n_bits; + else { + pb_open_tab->ot_cols_req = ha_get_max_bit(table->read_set); + + /* + * in case of queries like SELECT COUNT(*) FROM t + * table->read_set is empty. Otoh, ot_cols_req == 0 can be treated + * as "all columns" by some internal code (see e.g. myxt_load_row), + * which makes such queries very ineffective for the records with + * extended part. Setting column count to 1 makes sure that the + * extended part will not be acessed in most cases. + */ + + if (pb_open_tab->ot_cols_req == 0) + pb_open_tab->ot_cols_req = 1; + } + + ASSERT_NS(pb_ex_in_use); + if (scan) { + if (!xt_tab_seq_init(pb_open_tab)) + err = ha_log_pbxt_thread_error_for_mysql(pb_ignore_dup_key); + } + else + xt_tab_seq_reset(pb_open_tab); + + xt_xlog_check_long_writer(pb_open_tab->ot_thread); + + return err; +} + +int ha_pbxt::rnd_end() +{ + XT_TRACE_CALL(); + + /* + * make permanent the lock for the last scanned row + */ + XTThreadPtr thread = pb_open_tab->ot_thread; + if (pb_open_tab) + pb_open_tab->ot_table->tab_locks.xt_make_lock_permanent(pb_open_tab, &thread->st_lock_list); + + xt_xlog_check_long_writer(thread); + + xt_tab_seq_exit(pb_open_tab); + XT_RETURN(0); +} + +/* + * This is called for each row of the table scan. When you run out of records + * you should return HA_ERR_END_OF_FILE. Fill buff up with the row information. + * The Field structure for the table is the key to getting data into buf + * in a manner that will allow the server to understand it. + * + * Called from filesort.cc, records.cc, sql_handler.cc, sql_select.cc, sql_table.cc, + * and sql_update.cc. + */ +int ha_pbxt::rnd_next(byte *buf) +{ + int err = 0; + xtBool eof; + + XT_TRACE_CALL(); + ASSERT_NS(pb_ex_in_use); + //statistic_increment(ha_read_rnd_next_count, &LOCK_status); + xt_xlog_check_long_writer(pb_open_tab->ot_thread); + + if (!xt_tab_seq_next(pb_open_tab, (xtWord1 *) buf, &eof)) + err = ha_log_pbxt_thread_error_for_mysql(pb_ignore_dup_key); + else if (eof) + err = HA_ERR_END_OF_FILE; + + if (err) + table->status = STATUS_NOT_FOUND; + else { + pb_open_tab->ot_thread->st_statistics.st_row_select++; + table->status = 0; + } + XT_RETURN(err); +} + +/* + * position() is called after each call to rnd_next() if the data needs + * to be ordered. You can do something like the following to store + * the position: + * ha_store_ptr(ref, ref_length, current_position); + * + * The server uses ref to store data. ref_length in the above case is + * the size needed to store current_position. ref is just a byte array + * that the server will maintain. If you are using offsets to mark rows, then + * current_position should be the offset. If it is a primary key like in + * BDB, then it needs to be a primary key. + * + * Called from filesort.cc, sql_select.cc, sql_delete.cc and sql_update.cc. + */ +void ha_pbxt::position(const byte *record __attribute__((unused))) +{ + XT_TRACE_CALL(); + ASSERT_NS(pb_ex_in_use); + /* + * I changed this from using little endian to big endian. + * + * The reason is because sometime the pointer are sorted. + * When they are are sorted a binary compare is used. + * A binary compare sorts big endian values correctly! + * + * Take the followin example: + * + * create table t1 (a int, b text); + * insert into t1 values (1, 'aa'), (1, 'bb'), (1, 'cc'); + * select group_concat(b) from t1 group by a; + * + * With little endian pointers the result is: + * aa,bb,cc + * + * With big-endian pointer the result is: + * aa,cc,bb + * + */ + (void) ASSERT_NS(XT_RECORD_OFFS_SIZE == 4); + mi_int4store((xtWord1 *) ref, pb_open_tab->ot_curr_rec_id); + XT_RETURN_VOID; +} + +/* + * Given the #ROWID retrieve the record. + * + * Called from filesort.cc records.cc sql_insert.cc sql_select.cc sql_update.cc. + */ +int ha_pbxt::rnd_pos(byte * buf, byte *pos) +{ + int err = 0; + + XT_TRACE_CALL(); + ASSERT_NS(pb_ex_in_use); + //statistic_increment(ha_read_rnd_count, &LOCK_status); + XT_PRINT1(pb_open_tab->ot_thread, "ha_pbxt::rnd_pos %s\n", pb_share->sh_table_path->ps_path); + + pb_open_tab->ot_curr_rec_id = mi_uint4korr((xtWord1 *) pos); + switch (xt_tab_dirty_read_record(pb_open_tab, (xtWord1 *) buf)) { + case FALSE: + err = ha_log_pbxt_thread_error_for_mysql(pb_ignore_dup_key); + break; + default: + break; + } + + if (err) + table->status = STATUS_NOT_FOUND; + else { + pb_open_tab->ot_thread->st_statistics.st_row_select++; + table->status = 0; + } + XT_RETURN(err); +} + +/* + * ----------------------------------------------------------------------- + * INFO METHODS + */ + +/* + ::info() is used to return information to the optimizer. + Currently this table handler doesn't implement most of the fields + really needed. SHOW also makes use of this data + Another note, you will probably want to have the following in your + code: + if (records < 2) + records = 2; + The reason is that the server will optimize for cases of only a single + record. If in a table scan you don't know the number of records + it will probably be better to set records to two so you can return + as many records as you need. + Along with records a few more variables you may wish to set are: + records + deleted + data_file_length + index_file_length + delete_length + check_time + Take a look at the public variables in handler.h for more information. + + Called in: + filesort.cc + ha_heap.cc + item_sum.cc + opt_sum.cc + sql_delete.cc + sql_delete.cc + sql_derived.cc + sql_select.cc + sql_select.cc + sql_select.cc + sql_select.cc + sql_select.cc + sql_show.cc + sql_show.cc + sql_show.cc + sql_show.cc + sql_table.cc + sql_union.cc + sql_update.cc + +*/ +#if MYSQL_VERSION_ID < 50114 +void ha_pbxt::info(uint flag) +#else +int ha_pbxt::info(uint flag) +#endif +{ + XTOpenTablePtr ot; + int in_use; + + XT_TRACE_CALL(); + + if (!(in_use = pb_ex_in_use)) { + pb_ex_in_use = 1; + if (pb_share && pb_share->sh_table_lock) { + /* If some thread has an exclusive lock, then + * we wait for the lock to be removed: + */ +#if MYSQL_VERSION_ID < 50114 + ha_wait_for_shared_use(this, pb_share); + pb_ex_in_use = 1; +#else + if (!ha_wait_for_shared_use(this, pb_share)) + return ha_log_pbxt_thread_error_for_mysql(pb_ignore_dup_key); +#endif + } + } + + if ((ot = pb_open_tab)) { + if (flag & HA_STATUS_VARIABLE) { + stats.deleted = ot->ot_table->tab_row_fnum; + stats.records = (ha_rows) (ot->ot_table->tab_row_eof_id - 1 - stats.deleted); + stats.data_file_length = ot->ot_table->tab_rec_eof_id; + stats.index_file_length = xt_ind_node_to_offset(ot->ot_table, ot->ot_table->tab_ind_eof); + stats.delete_length = ot->ot_table->tab_rec_fnum * ot->ot_rec_size; + //check_time = info.check_time; + stats.mean_rec_length = ot->ot_rec_size; + } + + if (flag & HA_STATUS_CONST) { + ha_rows rec_per_key; + XTIndexPtr ind; + TABLE_SHARE *share= TS(table); + + stats.max_data_file_length = 0x00FFFFFF; + stats.max_index_file_length = 0x00FFFFFF; + //stats.create_time = info.create_time; + ref_length = XT_RECORD_OFFS_SIZE; + //share->db_options_in_use = info.options; + stats.block_size = XT_INDEX_PAGE_SIZE; + + if (share->tmp_table == NO_TMP_TABLE) +#if MYSQL_VERSION_ID > 60005 +#define WHICH_MUTEX LOCK_ha_data +#else +#define WHICH_MUTEX mutex +#endif + +#ifdef SAFE_MUTEX + +#if MYSQL_VERSION_ID < 60000 +#if MYSQL_VERSION_ID < 50123 + safe_mutex_lock(&share->mutex,__FILE__,__LINE__); +#else + safe_mutex_lock(&share->mutex,0,__FILE__,__LINE__); +#endif +#else +#if MYSQL_VERSION_ID < 60004 + safe_mutex_lock(&share->mutex,__FILE__,__LINE__); +#else + safe_mutex_lock(&share->WHICH_MUTEX,0,__FILE__,__LINE__); +#endif +#endif + +#else // SAFE_MUTEX + +#ifdef MY_PTHREAD_FASTMUTEX + my_pthread_fastmutex_lock(&share->WHICH_MUTEX); +#else + pthread_mutex_lock(&share->WHICH_MUTEX); +#endif + +#endif // SAFE_MUTEX + share->keys_in_use.set_prefix(share->keys); + //share->keys_in_use.intersect_extended(info.key_map); + share->keys_for_keyread.intersect(share->keys_in_use); + //share->db_record_offset = info.record_offset; + for (u_int i = 0; i < share->keys; i++) { + ind = pb_share->sh_dic_keys[i]; + + rec_per_key = 0; + if (ind->mi_seg_count == 1 && (ind->mi_flags & HA_NOSAME)) + rec_per_key = 1; + else { + + } + for (u_int j = 0; j < table->key_info[i].key_parts; j++) + table->key_info[i].rec_per_key[j] = (ulong) rec_per_key; + } + if (share->tmp_table == NO_TMP_TABLE) +#ifdef SAFE_MUTEX + safe_mutex_unlock(&share->WHICH_MUTEX,__FILE__,__LINE__); +#else +#ifdef MY_PTHREAD_FASTMUTEX + pthread_mutex_unlock(&share->WHICH_MUTEX.mutex); +#else + pthread_mutex_unlock(&share->WHICH_MUTEX); +#endif +#endif + /* + Set data_file_name and index_file_name to point at the symlink value + if table is symlinked (Ie; Real name is not same as generated name) + */ + /* + data_file_name = index_file_name = 0; + fn_format(name_buff, file->filename, "", MI_NAME_DEXT, 2); + if (strcmp(name_buff, info.data_file_name)) + data_file_name = info.data_file_name; + strmov(fn_ext(name_buff), MI_NAME_IEXT); + if (strcmp(name_buff, info.index_file_name)) + index_file_name = info.index_file_name; + */ + } + + if (flag & HA_STATUS_ERRKEY) + errkey = ot->ot_err_index_no; + + if (flag & HA_STATUS_AUTO) + stats.auto_increment_value = (ulonglong) ot->ot_table->tab_auto_inc; + } + else + errkey = (uint) -1; + + if (!in_use) { + pb_ex_in_use = 0; + if (pb_share) { + /* Someone may be waiting for me to complete: */ + if (pb_share->sh_table_lock) + xt_broadcast_cond_ns((xt_cond_type *) pb_share->sh_ex_cond); + } + } +#if MYSQL_VERSION_ID < 50114 + XT_RETURN_VOID; +#else + XT_RETURN(0); +#endif +} + +/* + * extra() is called whenever the server wishes to send a hint to + * the storage engine. The myisam engine implements the most hints. + * ha_innodb.cc has the most exhaustive list of these hints. + */ +int ha_pbxt::extra(enum ha_extra_function operation) +{ + int err = 0; + + XT_PRINT2(xt_get_self(), "ha_pbxt::extra %s operation=%d\n", pb_share->sh_table_path->ps_path, operation); + + switch (operation) { + case HA_EXTRA_RESET_STATE: + pb_key_read = FALSE; + pb_ignore_dup_key = 0; + /* As far as I can tell, this function is called for + * every table at the end of a statement. + * + * So, during a LOCK TABLES ... UNLOCK TABLES, I use + * this to find the end of a statement. + * start_stmt() indicates the start of a statement, + * and is also called once for each table in the + * statement. + * + * So the statement boundary is indicated by + * self->st_stat_count == 0 + * + * GOTCHA: I cannot end the transaction here! + * I must end it in start_stmt(). + * The reason is because there are situations + * where this would end a transaction that + * was begin by external_lock(). + * + * An example of this is when a function + * is called when doing CREATE TABLE SELECT. + */ + if (pb_in_stat) { + /* NOTE: pb_in_stat is just used to avoid getting + * self, if it is not necessary!! + */ + XTThreadPtr self; + + pb_in_stat = FALSE; + + if (!(self = ha_set_current_thread(pb_mysql_thd, &err))) + return xt_ha_pbxt_to_mysql_error(err); + + if (self->st_stat_count > 0) { + self->st_stat_count--; + if (self->st_stat_count == 0) + self->st_stat_ended = TRUE; + } + + /* This is the end of a statement, I can turn any locks into perminant locks now: */ + if (pb_open_tab) + pb_open_tab->ot_table->tab_locks.xt_make_lock_permanent(pb_open_tab, &self->st_lock_list); + } + break; + case HA_EXTRA_KEYREAD: + /* This means we so not need to read the entire record. */ + pb_key_read = TRUE; + break; + case HA_EXTRA_NO_KEYREAD: + pb_key_read = FALSE; + break; + case HA_EXTRA_IGNORE_DUP_KEY: + /* NOTE!!! Calls to extra(HA_EXTRA_IGNORE_DUP_KEY) can be nested! + * In fact, the calls are from different threads, so + * strictly speaking I should protect this variable!! + * Here is the sequence that produces the duplicate call: + * + * drop table if exists t1; + * CREATE TABLE t1 (x int not null, y int, primary key (x)) engine=pbxt; + * insert into t1 values (1, 3), (4, 1); + * replace DELAYED into t1 (x, y) VALUES (4, 2); + * select * from t1 order by x; + * + */ + pb_ignore_dup_key++; + break; + case HA_EXTRA_NO_IGNORE_DUP_KEY: + pb_ignore_dup_key--; + break; + case HA_EXTRA_KEYREAD_PRESERVE_FIELDS: + /* MySQL needs all fields */ + pb_key_read = FALSE; + break; + default: + break; + } + + return err; +} + + +/* + * Deprecated and likely to be removed in the future. Storage engines normally + * just make a call like: + * ha_pbxt::extra(HA_EXTRA_RESET); + * to handle it. + */ +int ha_pbxt::reset(void) +{ + XT_TRACE_CALL(); + extra(HA_EXTRA_RESET_STATE); + XT_RETURN(0); +} + +void ha_pbxt::unlock_row() +{ + XT_TRACE_CALL(); + if (pb_open_tab) + pb_open_tab->ot_table->tab_locks.xt_remove_temp_lock(pb_open_tab, FALSE); +} + +/* + * Used to delete all rows in a table. Both for cases of truncate and + * for cases where the optimizer realizes that all rows will be + * removed as a result of a SQL statement. + * + * Called from item_sum.cc by Item_func_group_concat::clear(), + * Item_sum_count_distinct::clear(), and Item_func_group_concat::clear(). + * Called from sql_delete.cc by mysql_delete(). + * Called from sql_select.cc by JOIN::reinit(). + * Called from sql_union.cc by st_select_lex_unit::exec(). + */ +int ha_pbxt::delete_all_rows() +{ + THD *thd = current_thd; + int err = 0; + XTThreadPtr self; + XTDDTable *tab_def = NULL; + char path[PATH_MAX]; + + XT_TRACE_CALL(); + + if (thd_sql_command(thd) != SQLCOM_TRUNCATE) { + /* Just like InnoDB we only handle TRUNCATE TABLE + * by recreating the table. + * DELETE FROM t must be handled by deleting + * each row because it may be part of a transaction, + * and there may be foreign key actions. + */ + XT_RETURN (my_errno = HA_ERR_WRONG_COMMAND); + } + + if (!(self = ha_set_current_thread(thd, &err))) + return xt_ha_pbxt_to_mysql_error(err); + + try_(a) { + XTDictionaryRec dic; + + memset(&dic, 0, sizeof(dic)); + + dic = pb_share->sh_table->tab_dic; + xt_strcpy(PATH_MAX, path, pb_share->sh_table->tab_name->ps_path); + + if ((tab_def = dic.dic_table)) + tab_def->reference(); + + if (!(thd_test_options(thd,OPTION_NO_FOREIGN_KEY_CHECKS))) + tab_def->deleteAllRows(self); + + /* We should have a table lock! */ + //ASSERT(pb_lock_table); + if (!pb_table_locked) { + ha_aquire_exclusive_use(self, pb_share, this); + pushr_(ha_release_exclusive_use, pb_share); + } + ha_close_open_tables(self, pb_share, NULL); + + /* This is required in the case of delete_all_rows, because we must + * ensure that the handlers no longer reference the old + * table, so that it will not be used again. The table + * must be re-openned, because the ID has changed! + * + * 0.9.86+ Must check if this is still necessary. + * + * the ha_close_share(self, pb_share) call was moved from above + * (before tab_def = dic.dic_table), because of a crash. + * Test case: + * + * set storage_engine = pbxt; + * create table t1 (s1 int primary key); + * insert into t1 values (1); + * create table t2 (s1 int, foreign key (s1) references t1 (s1)); + * insert into t2 values (1); + * truncate table t1; -- this should fail because of FK constraint + * alter table t1 engine = myisam; -- this caused crash + * + */ + ha_close_share(self, pb_share); + + xt_create_table(self, (XTPathStrPtr) path, &dic); + if (!pb_table_locked) + freer_(); // ha_release_exclusive_use(pb_share) + } + catch_(a) { + err = xt_ha_pbxt_thread_error_for_mysql(thd, self, pb_ignore_dup_key); + } + cont_(a); + + if (tab_def) + tab_def->release(self); + + XT_RETURN(err); +} + +/* + * TODO: Implement! + * Assuming a key (a,b,c) + * + * rec_per_key[0] = SELECT COUNT(*)/COUNT(DISTINCT a) FROM t; + * rec_per_key[1] = SELECT COUNT(*)/COUNT(DISTINCT a,b) FROM t; + * rec_per_key[2] = SELECT COUNT(*)/COUNT(DISTINCT a,b,c) FROM t; + * + * After this is implemented, the selectivity can serve as + * a quick estimate of records_in_range(). + * + * After you have done this, you need to redo the index_merge* + * tests. Restore the standard result to check if we + * now agree with the MyISAM strategy. + * + */ +int ha_pbxt::analyze(THD *thd __attribute__((unused)), HA_CHECK_OPT *check_opt __attribute__((unused))) +{ + int err = 0; + XTDatabaseHPtr db; + xtXactID my_xn_id; + xtXactID clean_xn_id = 0; + uint cnt = 10; + + XT_TRACE_CALL(); + + if (!pb_open_tab) { + if ((err = reopen())) + XT_RETURN(err); + } + + /* Wait until the sweeper is no longer busy! + * If you want an accurate count(*) value, then call + * ANALYZE TABLE first. This function waits until the + * sweeper has completed. + */ + db = pb_open_tab->ot_table->tab_db; + + /* + * Wait until everything is cleaned up before this transaction. + * But this will only work if the we quit out transaction! + * + * GOTCHA: When a PBXT table is partitioned, then analyze() is + * called for each component. The first calls xt_xn_commit(). + * All following calls have no transaction!: + * + * CREATE TABLE t1 (a int) + * PARTITION BY LIST (a) + * (PARTITION x1 VALUES IN (10), PARTITION x2 VALUES IN (20)); + * + * analyze table t1; + * + */ + if (pb_open_tab->ot_thread && pb_open_tab->ot_thread->st_xact_data) { + my_xn_id = pb_open_tab->ot_thread->st_xact_data->xd_start_xn_id; + XT_PRINT0(xt_get_self(), "xt_xn_commit\n"); + xt_xn_commit(pb_open_tab->ot_thread); + } + else + my_xn_id = db->db_xn_to_clean_id; + + while ((!db->db_sw_idle || xt_xn_is_before(db->db_xn_to_clean_id, my_xn_id)) && !thd_killed(thd)) { + xt_busy_wait(); + + /* + * It is possible that the sweeper gets stuck because + * it has no dictionary information! + * As in the example below. + * + * create table t4 ( + * pk_col int auto_increment primary key, a1 char(64), a2 char(64), b char(16), c char(16) not null, d char(16), dummy char(64) default ' ' + * ) engine=pbxt; + * + * insert into t4 (a1, a2, b, c, d, dummy) select * from t1; + * + * create index idx12672_0 on t4 (a1); + * create index idx12672_1 on t4 (a1,a2,b,c); + * create index idx12672_2 on t4 (a1,a2,b); + * analyze table t1; + */ + if (db->db_sw_idle) { + /* This will make sure we don't wait forever: */ + if (clean_xn_id != db->db_xn_to_clean_id) { + clean_xn_id = db->db_xn_to_clean_id; + cnt = 10; + } + else { + cnt--; + if (!cnt) + break; + } + xt_wakeup_sweeper(db); + } + } + + XT_RETURN(err); +} + +int ha_pbxt::repair(THD *thd __attribute__((unused)), HA_CHECK_OPT *check_opt __attribute__((unused))) +{ + return(HA_ADMIN_TRY_ALTER); +} + +/* + * This is mapped to "ALTER TABLE tablename TYPE=PBXT", which rebuilds + * the table in MySQL. + */ +int ha_pbxt::optimize(THD *thd __attribute__((unused)), HA_CHECK_OPT *check_opt __attribute__((unused))) +{ + return(HA_ADMIN_TRY_ALTER); +} + +#ifdef DEBUG +extern int pbxt_mysql_trace_on; +#endif + +int ha_pbxt::check(THD* thd, HA_CHECK_OPT* check_opt __attribute__((unused))) +{ + int err = 0; + XTThreadPtr self; + + if (!(self = ha_set_current_thread(thd, &err))) + return xt_ha_pbxt_to_mysql_error(err); + if (self->st_lock_count) + ASSERT(self->st_xact_data); + + if (!pb_table_locked) { + ha_aquire_exclusive_use(self, pb_share, this); + pushr_(ha_release_exclusive_use, pb_share); + } + +#ifdef CHECK_TABLE_LOADS + xt_tab_load_table(self, pb_open_tab); +#endif + xt_check_table(self, pb_open_tab); + + if (!pb_table_locked) + freer_(); // ha_release_exclusive_use(pb_share) + + //pbxt_mysql_trace_on = TRUE; + return 0; +} + +/* + * This function is called: + * For each table in LOCK TABLES, + * OR + * For each table in a statement. + * + * It is called with F_UNLCK: + * in UNLOCK TABLES + * OR + * at the end of a statement. + * + */ +xtPublic int ha_pbxt::external_lock(THD *thd, int lock_type) +{ + int err = 0; + XTThreadPtr self; + + if (!(self = ha_set_current_thread(thd, &err))) + return xt_ha_pbxt_to_mysql_error(err); + + /* F_UNLCK is set when this function is called at end + * of statement or UNLOCK TABLES + */ + if (lock_type == F_UNLCK) { + /* This is not TRUE if external_lock() FAILED! + * Can we rely on external_unlock being called when + * external_lock() fails? Currently yes, but it does + * not make sense! + ASSERT_NS(pb_ex_in_use); + */ + + XT_PRINT1(self, "ha_pbxt::EXTERNAL_LOCK %s lock_type=UNLOCK\n", pb_share->sh_table_path->ps_path); + + /* Make any temporary locks on this table permanent. + * + * This is required here because of the following example: + * create table t1 (a int NOT NULL, b int, primary key (a)); + * create table t2 (a int NOT NULL, b int, primary key (a)); + * insert into t1 values (0, 10),(1, 11),(2, 12); + * insert into t2 values (1, 21),(2, 22),(3, 23); + * update t1 set b= (select b from t2 where t1.a = t2.a); + * update t1 set b= (select b from t2 where t1.a = t2.a); + * select * from t1; + * drop table t1, t2; + * + */ + + /* GOTCHA! It's weird, but, if this function returns an error + * on lock, then UNLOCK is called?! + * This should not be done, because if lock fails, it should be + * assumed that no UNLOCK is required. + * Basically, I have to assume that some code will presume this, + * although the function lock_external() calls unlock, even + * when lock fails. + * The result is, that my lock count can go wrong. So I could + * change the lock method, and increment the lock count, even + * if it fails. However, the consequences are more serious, + * if some code decides not to call UNLOCK after lock fails. + * The result is that I would have a permanent too high lock, + * count and nothing will work. + * So instead, I handle the fact that I might too many unlocks + * here. + */ + if (self->st_lock_count > 0) + self->st_lock_count--; + if (!self->st_lock_count) { + /* This section handles "auto-commit"... */ + +#ifdef XT_IMPLEMENT_NO_ACTION + /* {NO-ACTION-BUG} + * This is required here because it marks the end of a statement. + * If we are in a non-auto-commit mode, then we cannot + * wait for st_is_update to be set by the begining of a new transaction. + */ + if (self->st_restrict_list.bl_count) { + if (!xt_tab_restrict_rows(&self->st_restrict_list, self)) + err = xt_ha_pbxt_thread_error_for_mysql(thd, self, pb_ignore_dup_key); + } +#endif + + if (self->st_xact_data) { + if (self->st_auto_commit) { + /* + * Normally I could assume that if the transaction + * has not been aborted by now, then it should be committed. + * + * Unfortunately, this is not the case! + * + * create table t1 (id int primary key) engine = pbxt; + * create table t2 (id int) engine = pbxt; + * + * insert into t1 values ( 1 ) ; + * insert into t1 values ( 2 ) ; + * insert into t2 values ( 1 ) ; + * insert into t2 values ( 2 ) ; + * + * --This statement is returns an error calls ha_autocommit_or_rollback(): + * update t1 set t1.id=1 where t1.id=2; + * + * --This statement is returns no error and calls ha_autocommit_or_rollback(): + * update t1,t2 set t1.id=3, t2.id=3 where t1.id=2 and t2.id = t1.id; + * + * --But this statement returns an error and does not call ha_autocommit_or_rollback(): + * update t1,t2 set t1.id=1, t2.id=1 where t1.id=3 and t2.id = t1.id; + * + * The result is, I cannot rely on ha_autocommit_or_rollback() being called :( + * So I have to abort myself here... + */ + if (pb_open_tab) + pb_open_tab->ot_table->tab_locks.xt_make_lock_permanent(pb_open_tab, &self->st_lock_list); + + if (self->st_abort_trans) { + XT_PRINT0(self, "xt_xn_rollback in unlock\n"); + if (!xt_xn_rollback(self)) + err = xt_ha_pbxt_thread_error_for_mysql(thd, self, pb_ignore_dup_key); + } + else { + XT_PRINT0(self, "xt_xn_commit in unlock\n"); + if (!xt_xn_commit(self)) + err = xt_ha_pbxt_thread_error_for_mysql(thd, self, pb_ignore_dup_key); + } + } + } + + /* If the previous statement was "for update", then set the visibilty + * so that non- for update SELECTs will see what the for update select + * (or update statement) just saw. + */ + if (pb_open_tab) { + if (pb_open_tab->ot_for_update) + self->st_visible_time = self->st_database->db_xn_end_time; + + if (pb_share->sh_recalc_selectivity) { + if ((pb_share->sh_table->tab_row_eof_id - 1 - pb_share->sh_table->tab_row_fnum) >= 200) { + /* [**] */ + pb_share->sh_recalc_selectivity = FALSE; + xt_ind_set_index_selectivity(self, pb_open_tab); + pb_share->sh_recalc_selectivity = (pb_share->sh_table->tab_row_eof_id - 1 - pb_share->sh_table->tab_row_fnum) < 150; + } + } + } + + if (self->st_stat_modify) + self->st_statistics.st_stat_write++; + else + self->st_statistics.st_stat_read++; + self->st_stat_modify = FALSE; + } + + if (pb_table_locked) { + pb_table_locked--; + if (!pb_table_locked) + ha_release_exclusive_use(self, pb_share); + } + + /* No longer in use: */ + pb_ex_in_use = 0; + /* Someone may be waiting for me to complete: */ + if (pb_share->sh_table_lock) + xt_broadcast_cond_ns((xt_cond_type *) pb_share->sh_ex_cond); + } + else { + XT_PRINT2(self, "ha_pbxt::EXTERNAL_LOCK %s lock_type=%d\n", pb_share->sh_table_path->ps_path, lock_type); + + if (pb_lock_table) { + + pb_ex_in_use = 1; + try_(a) { + if (!pb_table_locked) + ha_aquire_exclusive_use(self, pb_share, this); + pb_table_locked++; + + ha_close_open_tables(self, pb_share, this); + + if (!pb_share->sh_table) { + xt_ha_open_database_of_table(self, pb_share->sh_table_path); + + ha_open_share(self, pb_share, NULL); + } + } + catch_(a) { + err = xt_ha_pbxt_thread_error_for_mysql(thd, self, pb_ignore_dup_key); + pb_ex_in_use = 0; + goto complete; + } + cont_(a); + } + else { + pb_ex_in_use = 1; + if (pb_share->sh_table_lock && !pb_table_locked) { + /* If some thread has an exclusive lock, then + * we wait for the lock to be removed: + */ + if (!ha_wait_for_shared_use(this, pb_share)) { + err = ha_log_pbxt_thread_error_for_mysql(pb_ignore_dup_key); + goto complete; + } + } + + if (!pb_open_tab) { + if ((err = reopen())) { + pb_ex_in_use = 0; + goto complete; + } + } + + /* Set the current thread for this open table: */ + pb_open_tab->ot_thread = self; + + /* If this is a set, then it is in UPDATE/DELETE TABLE ... + * or SELECT ... FOR UPDATE + */ + pb_open_tab->ot_is_modify = FALSE; + if ((pb_open_tab->ot_for_update = (lock_type == F_WRLCK))) { + switch ((int) thd_sql_command(thd)) { + case SQLCOM_UPDATE: + case SQLCOM_UPDATE_MULTI: + case SQLCOM_DELETE: + case SQLCOM_DELETE_MULTI: + case SQLCOM_REPLACE: + case SQLCOM_REPLACE_SELECT: + case SQLCOM_INSERT: + case SQLCOM_INSERT_SELECT: + pb_open_tab->ot_is_modify = TRUE; + self->st_stat_modify = TRUE; + break; + case SQLCOM_CREATE_TABLE: + case SQLCOM_CREATE_INDEX: + case SQLCOM_ALTER_TABLE: + case SQLCOM_TRUNCATE: + case SQLCOM_DROP_TABLE: + case SQLCOM_DROP_INDEX: + case SQLCOM_LOAD: + case SQLCOM_REPAIR: + case SQLCOM_OPTIMIZE: + self->st_stat_modify = TRUE; + break; + } + } + + if (pb_open_tab->ot_is_modify && pb_open_tab->ot_table->tab_dic.dic_disable_index) { + xt_tab_set_index_error(pb_open_tab->ot_table); + err = ha_log_pbxt_thread_error_for_mysql(pb_ignore_dup_key); + goto complete; + } + } + + /* Record the associated MySQL thread: */ + pb_mysql_thd = thd; + + if (self->st_database != pb_share->sh_table->tab_db) { + try_(b) { + /* PBXT does not permit multiple databases us one statement, + * or in a single transaction! + * + * Example query: + * + * update mysqltest_1.t1, mysqltest_2.t2 set a=10,d=10; + */ + if (self->st_lock_count > 0) + xt_throw_xterr(XT_CONTEXT, XT_ERR_MULTIPLE_DATABASES); + + xt_ha_open_database_of_table(self, pb_share->sh_table_path); + } + catch_(b) { + err = xt_ha_pbxt_thread_error_for_mysql(thd, self, pb_ignore_dup_key); + pb_ex_in_use = 0; + goto complete; + } + cont_(b); + } + + /* See (***) */ + self->st_is_update = FALSE; + + /* Auto begin a transaction (if one is not already running): */ + if (!self->st_xact_data) { + /* Transaction mode numbers must be identical! */ + (void) ASSERT_NS(ISO_READ_UNCOMMITTED == XT_XACT_UNCOMMITTED_READ); + (void) ASSERT_NS(ISO_SERIALIZABLE == XT_XACT_SERIALIZABLE); + + self->st_xact_mode = thd_tx_isolation(thd) <= ISO_READ_COMMITTED ? XT_XACT_COMMITTED_READ : XT_XACT_REPEATABLE_READ; + self->st_ignore_fkeys = (thd_test_options(thd,OPTION_NO_FOREIGN_KEY_CHECKS)) != 0; + self->st_auto_commit = (thd_test_options(thd, (OPTION_NOT_AUTOCOMMIT | OPTION_BEGIN))) == 0; + self->st_table_trans = thd_sql_command(thd) == SQLCOM_LOCK_TABLES; + self->st_abort_trans = FALSE; + self->st_stat_ended = FALSE; + self->st_stat_trans = FALSE; + XT_PRINT0(self, "xt_xn_begin\n"); + if (!xt_xn_begin(self)) { + err = xt_ha_pbxt_thread_error_for_mysql(thd, self, pb_ignore_dup_key); + pb_ex_in_use = 0; + goto complete; + } + + /* + * (**) GOTCHA: trans_register_ha() is not mentioned in the documentation. + * It must be called to inform MySQL that we have a transaction (see start_stmt). + * + * Here are some tests that confirm whether things are done correctly: + * + * drop table if exists t1, t2; + * create table t1 (c1 int); + * insert t1 values (1); + * select * from t1; + * rename table t1 to t2; + * + * rename will generate an error if MySQL thinks a transaction is + * still running. + * + * create table t1 (a text character set utf8, b text character set latin1); + * insert t1 values (0x4F736E616272C3BC636B, 0x4BF66C6E); + * select * from t1; + * --exec $MYSQL_DUMP --tab=$MYSQLTEST_VARDIR/tmp/ test + * --exec $MYSQL test < $MYSQLTEST_VARDIR/tmp/t1.sql + * --exec $MYSQL_IMPORT test $MYSQLTEST_VARDIR/tmp/t1.txt + * select * from t1; + * + * This test forces a begin transaction in start_stmt() + * + * drop tables if exists t1; + * create table t1 (c1 int); + * lock tables t1 write; + * insert t1 values (1); + * insert t1 values (2); + * unlock tables; + * + * The second select will return an empty result of the + * MySQL is not informed that a transaction is running (auto-commit + * in external_lock comes too late)! + * + */ + if (!self->st_auto_commit) { + trans_register_ha(thd, TRUE, pbxt_hton); + XT_PRINT0(self, "ha_pbxt::external_lock trans_register_ha all=TRUE\n"); + } + } + + if (lock_type == F_WRLCK || self->st_xact_mode < XT_XACT_REPEATABLE_READ) + self->st_visible_time = self->st_database->db_xn_end_time; + +#ifdef TRACE_STATEMENTS + if (self->st_lock_count == 0) + STAT_TRACE(self, *thd_query(thd)); +#endif + self->st_lock_count++; + } + + complete: + return err; +} + +/* + * This function is called for each table in a statement + * after LOCK TABLES has been used. + * + * Currently I only use this function to set the + * current thread of the table handle. + * + * GOTCHA: The prototype of start_stmt() has changed + * from version 4.1 to 5.1! + */ +int ha_pbxt::start_stmt(THD *thd, thr_lock_type lock_type) +{ + int err = 0; + XTThreadPtr self; + + ASSERT_NS(pb_ex_in_use); + + if (!(self = ha_set_current_thread(thd, &err))) + return xt_ha_pbxt_to_mysql_error(err); + + XT_PRINT2(self, "ha_pbxt::start_stmt %s lock_type=%d\n", pb_share->sh_table_path->ps_path, (int) lock_type); + + if (!pb_open_tab) { + if ((err = reopen())) + goto complete; + } + + ASSERT_NS(pb_open_tab->ot_thread == self); + ASSERT_NS(thd == pb_mysql_thd); + ASSERT_NS(self->st_database == pb_open_tab->ot_table->tab_db); + + if (self->st_stat_ended) { + self->st_stat_ended = FALSE; + self->st_stat_trans = FALSE; + +#ifdef XT_IMPLEMENT_NO_ACTION + if (self->st_restrict_list.bl_count) { + if (!xt_tab_restrict_rows(&self->st_restrict_list, self)) { + err = xt_ha_pbxt_thread_error_for_mysql(pb_mysql_thd, self, pb_ignore_dup_key); + } + } +#endif + + /* This section handles "auto-commit"... */ + if (self->st_xact_data && self->st_auto_commit && self->st_table_trans) { + if (self->st_abort_trans) { + XT_PRINT0(self, "xt_xn_rollback\n"); + if (!xt_xn_rollback(self)) + err = xt_ha_pbxt_thread_error_for_mysql(pb_mysql_thd, self, pb_ignore_dup_key); + } + else { + XT_PRINT0(self, "xt_xn_commit\n"); + if (!xt_xn_commit(self)) + err = xt_ha_pbxt_thread_error_for_mysql(pb_mysql_thd, self, pb_ignore_dup_key); + } + } + + if (self->st_stat_modify) + self->st_statistics.st_stat_write++; + else + self->st_statistics.st_stat_read++; + self->st_stat_modify = FALSE; + + /* If the previous statement was "for update", then set the visibilty + * so that non- for update SELECTs will see what the for update select + * (or update statement) just saw. + */ + if (pb_open_tab->ot_for_update) + self->st_visible_time = self->st_database->db_xn_end_time; + } + + pb_open_tab->ot_for_update = + (lock_type != TL_READ && + lock_type != TL_READ_WITH_SHARED_LOCKS && + lock_type != TL_READ_HIGH_PRIORITY && + lock_type != TL_READ_NO_INSERT); + pb_open_tab->ot_is_modify = FALSE; + if (pb_open_tab->ot_for_update) { + switch ((int) thd_sql_command(thd)) { + case SQLCOM_UPDATE: + case SQLCOM_UPDATE_MULTI: + case SQLCOM_DELETE: + case SQLCOM_DELETE_MULTI: + case SQLCOM_REPLACE: + case SQLCOM_REPLACE_SELECT: + case SQLCOM_INSERT: + case SQLCOM_INSERT_SELECT: + pb_open_tab->ot_is_modify = TRUE; + self->st_stat_modify = TRUE; + break; + case SQLCOM_CREATE_TABLE: + case SQLCOM_CREATE_INDEX: + case SQLCOM_ALTER_TABLE: + case SQLCOM_TRUNCATE: + case SQLCOM_DROP_TABLE: + case SQLCOM_DROP_INDEX: + case SQLCOM_LOAD: + case SQLCOM_REPAIR: + case SQLCOM_OPTIMIZE: + self->st_stat_modify = TRUE; + break; + } + } + + + /* (***) This is required at this level! + * No matter how often it is called, it is still the start of a + * statement. We need to make sure statements that are NOT mistaken + * for different type of statement. + * + * Here is an example: + * select * from t1 where data = getcount("bar") + * + * If the procedure getcount() addresses another table. + * then open and close of the statements in getcount() + * are nested within an open close of the select t1 + * statement. + */ + self->st_is_update = FALSE; + + /* See comment (**) */ + if (!self->st_xact_data) { + self->st_xact_mode = thd_tx_isolation(thd) <= ISO_READ_COMMITTED ? XT_XACT_COMMITTED_READ : XT_XACT_REPEATABLE_READ; + self->st_ignore_fkeys = (thd_test_options(thd, OPTION_NO_FOREIGN_KEY_CHECKS)) != 0; + self->st_auto_commit = (thd_test_options(thd,(OPTION_NOT_AUTOCOMMIT | OPTION_BEGIN))) == 0; + /* self->st_table_trans = not set here! */ + self->st_abort_trans = FALSE; + self->st_stat_ended = FALSE; + self->st_stat_trans = FALSE; + XT_PRINT0(self, "xt_xn_begin\n"); + if (!xt_xn_begin(self)) { + err = xt_ha_pbxt_thread_error_for_mysql(thd, self, pb_ignore_dup_key); + goto complete; + } + if (!self->st_auto_commit) { + trans_register_ha(thd, TRUE, pbxt_hton); + XT_PRINT0(self, "ha_pbxt::start_stmt trans_register_ha all=TRUE\n"); + } + } + + if (pb_open_tab->ot_for_update || self->st_xact_mode < XT_XACT_REPEATABLE_READ) + self->st_visible_time = self->st_database->db_xn_end_time; + + pb_in_stat = TRUE; + + self->st_stat_count++; + + complete: + return err; +} + +/* + * The idea with handler::store_lock() is the following: + * + * The statement decided which locks we should need for the table + * for updates/deletes/inserts we get WRITE locks, for SELECT... we get + * read locks. + * + * Before adding the lock into the table lock handler (see thr_lock.c) + * mysqld calls store lock with the requested locks. Store lock can now + * modify a write lock to a read lock (or some other lock), ignore the + * lock (if we don't want to use MySQL table locks at all) or add locks + * for many tables (like we do when we are using a MERGE handler). + * + * When releasing locks, store_lock() are also called. In this case one + * usually doesn't have to do anything. + * + * In some exceptional cases MySQL may send a request for a TL_IGNORE; + * This means that we are requesting the same lock as last time and this + * should also be ignored. (This may happen when someone does a flush + * table when we have opened a part of the tables, in which case mysqld + * closes and reopens the tables and tries to get the same locks at last + * time). In the future we will probably try to remove this. + * + * Called from lock.cc by get_lock_data(). + */ +THR_LOCK_DATA **ha_pbxt::store_lock(THD *thd, THR_LOCK_DATA **to, enum thr_lock_type lock_type) +{ + if (lock_type != TL_IGNORE && pb_lock.type == TL_UNLOCK) { + /* Set to TRUE for operations that require a table lock: */ + switch (thd_sql_command(thd)) { + case SQLCOM_TRUNCATE: + /* GOTCHA: + * The problem is, if I do not do this, then + * TRUNCATE TABLE deadlocks with a normal update of the table! + * The reason is: + * + * external_lock() is called before MySQL actually locks the + * table. In external_lock(), the table is shared locked, + * by indicating that the handler is in use. + * + * Then later, in delete_all_rows(), a exclusive lock must be + * obtained. If an UPDATE or INSERT has also gained a shared + * lock in the meantime, then TRUNCATE TABLE hangs. + * + * By setting pb_lock_table we indicate that an exclusive lock + * should be gained in external_lock(). + * + * This is the locking behaviour: + * + * TRUNCATE TABLE: + * XT SHARE LOCK (mysql_lock_tables calls external_lock) + * MySQL WRITE LOCK (mysql_lock_tables) + * ... + * XT EXCLUSIVE LOCK (delete_all_rows) + * + * INSERT: + * XT SHARED LOCK (mysql_lock_tables calls external_lock) + * MySQL WRITE_ALLOW_WRITE LOCK (mysql_lock_tables) + * + * If the locking for INSERT is done in the ... phase + * above, then we have a deadlock because + * WRITE_ALLOW_WRITE conflicts with WRITE. + * + * Making TRUNCATE TABLE take a WRITE_ALLOW_WRITE LOCK, will + * not solve the problem because then 2 TRUNCATE TABLES + * can deadlock due to lock escalation. + * + * What may work is if MySQL were to lock BEFORE calling + * external_lock()! + * + * However, using this method, TRUNCATE TABLE does deadlock + * with other operations such as ALTER TABLE! + * + * This is handled with a lock timeout. Assuming + * TRUNCATE TABLE will be mixed with DML this is the + * best solution! + */ + pb_lock_table = TRUE; + break; + default: + pb_lock_table = FALSE; + break; + } + +#ifdef PBXT_HANDLER_TRACE + pb_lock.type = lock_type; +#endif + /* GOTCHA: Before it was OK to weaken the lock after just checking + * that !thd->in_lock_tables. However, when starting a procedure, MySQL + * simulates a LOCK TABLES statement. + * + * So we need to be more specific here, and check what the actual statement + * type. Before doing this I got a deadlock (undetected) on the following test. + * However, now we get a failed assertion in ha_rollback_trans(): + * TODO: Check this with InnoDB! + * + * DBUG_ASSERT(0); + * my_error(ER_COMMIT_NOT_ALLOWED_IN_SF_OR_TRG, MYF(0)); + * + * drop table if exists t3; + * create table t3 (a smallint primary key) engine=pbxt; + * insert into t3 (a) values (40); + * insert into t3 (a) values (50); + * + * delimiter | + * + * drop function if exists t3_update| + * + * create function t3_update() returns int + * begin + * insert into t3 values (10); + * return 100; + * end| + * + * delimiter ; + * + * CONN 1: + * + * begin; + * update t3 set a = 5 where a = 50; + * + * CONN 2: + * + * begin; + * update t3 set a = 4 where a = 40; + * + * CONN 1: + * + * update t3 set a = 4 where a = 40; // Hangs waiting CONN 2. + * + * CONN 2: + * + * select t3_update(); // Hangs waiting for table lock. + * + */ + if ((lock_type >= TL_WRITE_CONCURRENT_INSERT && lock_type <= TL_WRITE) && + !(thd_in_lock_tables(thd) && thd_sql_command(thd) == SQLCOM_LOCK_TABLES) && + !thd_tablespace_op(thd) && + thd_sql_command(thd) != SQLCOM_TRUNCATE && + thd_sql_command(thd) != SQLCOM_OPTIMIZE && + thd_sql_command(thd) != SQLCOM_CREATE_TABLE) { + lock_type = TL_WRITE_ALLOW_WRITE; + } + + /* In queries of type INSERT INTO t1 SELECT ... FROM t2 ... + * MySQL would use the lock TL_READ_NO_INSERT on t2, and that + * would conflict with TL_WRITE_ALLOW_WRITE, blocking all inserts + * to t2. Convert the lock to a normal read lock to allow + * concurrent inserts to t2. + * + * (This one from InnoDB) + + * Stewart: removed SQLCOM_CALL, not sure of implications. + */ + if (lock_type == TL_READ_NO_INSERT && + (!thd_in_lock_tables(thd) +#ifndef DRIZZLED + || thd_sql_command(thd) == SQLCOM_CALL +#endif + )) + { + lock_type = TL_READ; + } + + XT_PRINT3(xt_get_self(), "ha_pbxt::store_lock %s %d->%d\n", pb_share->sh_table_path->ps_path, pb_lock.type, lock_type); + pb_lock.type = lock_type; + } +#ifdef PBXT_HANDLER_TRACE + else { + XT_PRINT3(xt_get_self(), "ha_pbxt::store_lock %s %d->%d (ignore/unlock)\n", pb_share->sh_table_path->ps_path, lock_type, lock_type); + } +#endif + *to++= &pb_lock; + return to; +} + +/* + * Used to delete a table. By the time delete_table() has been called all + * opened references to this table will have been closed (and your globally + * shared references released. The variable name will just be the name of + * the table. You will need to remove any files you have created at this point. + * + * Called from handler.cc by delete_table and ha_create_table(). Only used + * during create if the table_flag HA_DROP_BEFORE_CREATE was specified for + * the storage engine. +*/ +int ha_pbxt::delete_table(const char *table_path) +{ + THD *thd = current_thd; + int err = 0; + XTThreadPtr self; + XTSharePtr share; + + if (XTSystemTableShare::isSystemTable(table_path)) + return delete_system_table(table_path); + + if (!(self = ha_set_current_thread(thd, &err))) + return xt_ha_pbxt_to_mysql_error(err); + + self->st_ignore_fkeys = (thd_test_options(thd, OPTION_NO_FOREIGN_KEY_CHECKS)) != 0; + + STAT_TRACE(self, *thd_query(thd)); + XT_PRINT1(self, "ha_pbxt::delete_table %s\n", table_path); + + try_(a) { + xt_ha_open_database_of_table(self, (XTPathStrPtr) table_path); + + ASSERT(xt_get_self() == self); + try_(b) { + /* NOTE: MySQL does not drop a table by first locking it! + * We also cannot use pb_share because the handler used + * to delete a table is not openned correctly. + */ + share = ha_get_share(self, table_path, false, NULL); + pushr_(ha_unget_share, share); + ha_aquire_exclusive_use(self, share, NULL); + pushr_(ha_release_exclusive_use, share); + ha_close_open_tables(self, share, NULL); + + xt_drop_table(self, (XTPathStrPtr) table_path); + + freer_(); // ha_release_exclusive_use(share) + freer_(); // ha_unget_share(share) + } + catch_(b) { + /* If the table does not exist, just log the error and continue... */ + if (self->t_exception.e_xt_err == XT_ERR_TABLE_NOT_FOUND) + xt_log_and_clear_exception(self); + else + throw_(); + } + cont_(b); + + /* + * If there are no more PBXT tables in the database, we + * "drop the database", which deletes all PBXT resources + * in the database. + */ + /* We now only drop the pbxt system data, + * when the PBXT database is dropped. + */ +#ifndef XT_USE_GLOBAL_DB + if (!xt_table_exists(self->st_database)) { + xt_ha_all_threads_close_database(self, self->st_database); + xt_drop_database(self, self->st_database); + xt_unuse_database(self, self); + xt_ha_close_global_database(self); + } +#endif + } + catch_(a) { + err = xt_ha_pbxt_thread_error_for_mysql(thd, self, pb_ignore_dup_key); + } + cont_(a); + + return err; +} + +int ha_pbxt::delete_system_table(const char *table_path) +{ + THD *thd = current_thd; + XTExceptionRec e; + int err = 0; + XTThreadPtr self; + + if (!(self = xt_ha_set_current_thread(thd, &e))) + return xt_ha_pbxt_to_mysql_error(e.e_xt_err); + + try_(a) { + xt_ha_open_database_of_table(self, (XTPathStrPtr) table_path); + + if (xt_table_exists(self->st_database)) + xt_throw_xterr(XT_CONTEXT, XT_ERR_PBXT_TABLE_EXISTS); + + XTSystemTableShare::setSystemTableDeleted(table_path); + + if (!XTSystemTableShare::doesSystemTableExist()) { + xt_ha_all_threads_close_database(self, self->st_database); + xt_drop_database(self, self->st_database); + xt_unuse_database(self, self); + xt_ha_close_global_database(self); + } + } + catch_(a) { + err = xt_ha_pbxt_thread_error_for_mysql(thd, self, FALSE); + } + cont_(a); + + return err; +} + +/* + * Renames a table from one name to another from alter table call. + * This function can be used to move a table from one database to + * another. + */ +int ha_pbxt::rename_table(const char *from, const char *to) +{ + THD *thd = current_thd; + int err = 0; + XTThreadPtr self; + XTSharePtr share; + XTDatabaseHPtr to_db; + + XT_TRACE_CALL(); + + if (XTSystemTableShare::isSystemTable(from)) + return rename_system_table(from, to); + + if (!(self = ha_set_current_thread(thd, &err))) + return xt_ha_pbxt_to_mysql_error(err); + + XT_PRINT2(self, "ha_pbxt::rename_table %s -> %s\n", from, to); + + try_(a) { + xt_ha_open_database_of_table(self, (XTPathStrPtr) to); + to_db = self->st_database; + + xt_ha_open_database_of_table(self, (XTPathStrPtr) from); + + if (self->st_database != to_db) + xt_throw_xterr(XT_CONTEXT, XT_ERR_CANNOT_CHANGE_DB); + + /* + * NOTE: MySQL does not lock before calling rename table! + * + * We cannot use pb_share because rename_table() is + * called without correctly initializing + * the handler! + */ + share = ha_get_share(self, from, true, NULL); + pushr_(ha_unget_share, share); + ha_aquire_exclusive_use(self, share, NULL); + pushr_(ha_release_exclusive_use, share); + ha_close_open_tables(self, share, NULL); + + self->st_ignore_fkeys = (thd_test_options(thd, OPTION_NO_FOREIGN_KEY_CHECKS)) != 0; + xt_rename_table(self, (XTPathStrPtr) from, (XTPathStrPtr) to); + + freer_(); // ha_release_exclusive_use(share) + freer_(); // ha_unget_share(share) + +#ifdef XT_STREAMING + /* PBMS remove the table? */ + xt_pbms_rename_table(from, to); +#endif + /* + * If there are no more PBXT tables in the database, we + * "drop the database", which deletes all PBXT resources + * in the database. + */ +#ifdef XT_USE_GLOBAL_DB + /* We now only drop the pbxt system data, + * when the PBXT database is dropped. + */ + if (!xt_table_exists(self->st_database)) { + xt_ha_all_threads_close_database(self, self->st_database); + xt_drop_database(self, self->st_database); + } +#endif + } + catch_(a) { + err = xt_ha_pbxt_thread_error_for_mysql(thd, self, pb_ignore_dup_key); + } + cont_(a); + + XT_RETURN(err); +} + +int ha_pbxt::rename_system_table(const char *from __attribute__((unused)), const char *to __attribute__((unused))) +{ + return ER_NOT_SUPPORTED_YET; +} + +uint ha_pbxt::max_supported_key_length() const +{ + return XT_INDEX_MAX_KEY_SIZE; +} + +uint ha_pbxt::max_supported_key_part_length() const +{ + /* There is a little overhead in order to fit! */ + return XT_INDEX_MAX_KEY_SIZE-4; +} + +/* + * Called in test_quick_select to determine if indexes should be used. + * + * As far as I can tell, time is measured in "disk reads". So the + * calculation below means the system reads about 20 rows per read. + * + * For example a sequence scan uses a read buffer which reads a + * number of rows at once, or a sequential scan can make use + * of the cache (so it need to read less). + */ +double ha_pbxt::scan_time() +{ + double result = (double) (stats.records + stats.deleted) / 38.0 + 2; + return result; +} + +/* + * The next method will never be called if you do not implement indexes. + */ +double ha_pbxt::read_time(uint index __attribute__((unused)), uint ranges, ha_rows rows) +{ + double result = rows2double(ranges+rows); + return result; +} + +/* + * Given a starting key, and an ending key estimate the number of rows that + * will exist between the two. end_key may be empty which in case determine + * if start_key matches any rows. + * + * Called from opt_range.cc by check_quick_keys(). + * + */ +ha_rows ha_pbxt::records_in_range(uint inx, key_range *min_key, key_range *max_key) +{ + XTIndexPtr ind; + key_part_map keypart_map; + u_int segement = 0; + ha_rows result; + + if (min_key) + keypart_map = min_key->keypart_map; + else if (max_key) + keypart_map = max_key->keypart_map; + else + return 1; + ind = (XTIndexPtr) pb_share->sh_dic_keys[inx]; + + while (keypart_map & 1) { + segement++; + keypart_map = keypart_map >> 1; + } + + if (segement < 1 || segement > ind->mi_seg_count) + result = 1; + else + result = ind->mi_seg[segement-1].is_recs_in_range; +#ifdef XT_PRINT_INDEX_OPT + printf("records_in_range %s index %d cols req=%d/%d read_bits=%X write_bits=%X index_bits=%X --> %d\n", pb_open_tab->ot_table->tab_name->ps_path, (int) inx, segement, ind->mi_seg_count, (int) *table->read_set->bitmap, (int) *table->write_set->bitmap, (int) *ind->mi_col_map.bitmap, (int) result); +#endif + return result; +} + +/* + * create() is called to create a table/database. The variable name will have the name + * of the table. When create() is called you do not need to worry about opening + * the table. Also, the FRM file will have already been created so adjusting + * create_info will not do you any good. You can overwrite the frm file at this + * point if you wish to change the table definition, but there are no methods + * currently provided for doing that. + + * Called from handle.cc by ha_create_table(). +*/ +int ha_pbxt::create(const char *table_path, TABLE *table_arg, HA_CREATE_INFO *create_info) +{ + THD *thd = current_thd; + int err = 0; + XTThreadPtr self; + XTDDTable *tab_def = NULL; + XTDictionaryRec dic; + + memset(&dic, 0, sizeof(dic)); + + XT_TRACE_CALL(); + + if (!(self = ha_set_current_thread(thd, &err))) + return xt_ha_pbxt_to_mysql_error(err); + + STAT_TRACE(self, *thd_query(thd)); + XT_PRINT1(self, "ha_pbxt::create %s\n", table_path); + + try_(a) { + xt_ha_open_database_of_table(self, (XTPathStrPtr) table_path); + + for (uint i=0; i<TS(table_arg)->keys; i++) { + if (table_arg->key_info[i].key_length > XT_INDEX_MAX_KEY_SIZE) + xt_throw_sulxterr(XT_CONTEXT, XT_ERR_KEY_TOO_LARGE, table_arg->key_info[i].name, (u_long) XT_INDEX_MAX_KEY_SIZE); + } + + /* ($) auto_increment_value will be zero if + * AUTO_INCREMENT is not used. Otherwise + * Query was ALTER TABLE ... AUTO_INCREMENT = x; or + * CREATE TABLE ... AUTO_INCREMENT = x; + */ + tab_def = xt_ri_create_table(self, true, (XTPathStrPtr) table_path, *thd_query(thd), myxt_create_table_from_table(self, table_arg)); + tab_def->checkForeignKeys(self, create_info->options & HA_LEX_CREATE_TMP_TABLE); + + dic.dic_table = tab_def; + dic.dic_my_table = table_arg; + dic.dic_tab_flags = (create_info->options & HA_LEX_CREATE_TMP_TABLE) ? XT_TAB_FLAGS_TEMP_TAB : 0; + dic.dic_min_auto_inc = (xtWord8) create_info->auto_increment_value; /* ($) */ + dic.dic_def_ave_row_size = (xtWord8) table_arg->s->avg_row_length; + myxt_setup_dictionary(self, &dic); + + /* + * We used to ignore the value of foreign_key_checks flag and allowed creation + * of tables with "hanging" references. Now we validate FKs if foreign_key_checks != 0 + */ + self->st_ignore_fkeys = (thd_test_options(thd, OPTION_NO_FOREIGN_KEY_CHECKS)) != 0; + + /* + * Previously I set delete_if_exists=TRUE because + * CREATE TABLE was being used to TRUNCATE. + * This was due to the flag HTON_CAN_RECREATE. + * Now I could set delete_if_exists=FALSE, but + * leaving it TRUE should not cause any problems. + */ + xt_create_table(self, (XTPathStrPtr) table_path, &dic); + } + catch_(a) { + if (tab_def) + tab_def->finalize(self); + err = xt_ha_pbxt_thread_error_for_mysql(thd, self, pb_ignore_dup_key); + } + cont_(a); + + /* Free the dictionary, but not 'table_arg'! */ + dic.dic_my_table = NULL; + myxt_free_dictionary(self, &dic); + + XT_RETURN(err); +} + +void ha_pbxt::update_create_info(HA_CREATE_INFO *create_info) +{ + XTOpenTablePtr ot; + + if ((ot = pb_open_tab)) { + if (!(create_info->used_fields & HA_CREATE_USED_AUTO)) { + /* Fill in the minimum auto-increment value! */ + create_info->auto_increment_value = ot->ot_table->tab_dic.dic_min_auto_inc; + } + } +} + +char *ha_pbxt::get_foreign_key_create_info() +{ + THD *thd = current_thd; + int err = 0; + XTThreadPtr self; + XTStringBufferRec tab_def = { 0, 0, 0 }; + + if (!(self = ha_set_current_thread(thd, &err))) { + xt_ha_pbxt_to_mysql_error(err); + return NULL; + } + + if (!pb_open_tab) { + if ((err = reopen())) + return NULL; + } + + if (!pb_open_tab->ot_table->tab_dic.dic_table) + return NULL; + + try_(a) { + pb_open_tab->ot_table->tab_dic.dic_table->loadForeignKeyString(self, &tab_def); + } + catch_(a) { + xt_sb_set_size(self, &tab_def, 0); + err = xt_ha_pbxt_thread_error_for_mysql(thd, self, pb_ignore_dup_key); + } + cont_(a); + + return tab_def.sb_cstring; +} + +void ha_pbxt::free_foreign_key_create_info(char* str) +{ + xt_free(NULL, str); +} + +bool ha_pbxt::get_error_message(int error __attribute__((unused)), String *buf) +{ + THD *thd = current_thd; + int err = 0; + XTThreadPtr self; + + if (!(self = ha_set_current_thread(thd, &err))) + return FALSE; + + if (!self->t_exception.e_xt_err) + return FALSE; + + buf->copy(self->t_exception.e_err_msg, strlen(self->t_exception.e_err_msg), system_charset_info); + return TRUE; +} + +/* + * get info about FKs of the currently open table + * used in + * 1. REPLACE; is > 0 if table is referred by a FOREIGN KEY + * 2. INFORMATION_SCHEMA tables: TABLE_CONSTRAINTS, REFERENTIAL_CONSTRAINTS + * Return value: as of 5.1.24 it's ignored + */ + +int ha_pbxt::get_foreign_key_list(THD *thd, List<FOREIGN_KEY_INFO> *f_key_list) +{ + int err = 0; + XTThreadPtr self; + const char *action; + + if (!(self = ha_set_current_thread(thd, &err))) { + return xt_ha_pbxt_to_mysql_error(err); + } + + try_(a) { + XTDDTable *table_dic = pb_open_tab->ot_table->tab_dic.dic_table; + + if (table_dic == NULL) + xt_throw_errno(XT_CONTEXT, XT_ERR_NO_DICTIONARY); + + for (int i = 0, sz = table_dic->dt_fkeys.size(); i < sz; i++) { + FOREIGN_KEY_INFO *fk_info= new // assumed that C++ exceptions are disabled + (thd_alloc(thd, sizeof(FOREIGN_KEY_INFO))) FOREIGN_KEY_INFO; + + if (fk_info == NULL) + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + + XTDDForeignKey *fk = table_dic->dt_fkeys.itemAt(i); + + const char *path = fk->fk_ref_tab_name->ps_path; + const char *ref_tbl_name = path + strlen(path); + + while (ref_tbl_name != path && !XT_IS_DIR_CHAR(*ref_tbl_name)) + ref_tbl_name--; + + const char * ref_db_name = ref_tbl_name - 1; + + while (ref_db_name != path && !XT_IS_DIR_CHAR(*ref_db_name)) + ref_db_name--; + + ref_tbl_name++; + ref_db_name++; + + fk_info->forein_id = thd_make_lex_string(thd, 0, + fk->co_name, (uint) strlen(fk->co_name), 1); + + fk_info->referenced_db = thd_make_lex_string(thd, 0, + ref_db_name, (uint) (ref_tbl_name - ref_db_name - 1), 1); + + fk_info->referenced_table = thd_make_lex_string(thd, 0, + ref_tbl_name, (uint) strlen(ref_tbl_name), 1); + + fk_info->referenced_key_name = NULL; + + XTIndex *ix = fk->getReferenceIndexPtr(); + if (ix == NULL) /* can be NULL if another thread changes referenced table at the moment */ + continue; + + XTDDTable *ref_table = fk->fk_ref_table; + + // might be a self-reference + if ((ref_table == NULL) + && (xt_tab_compare_names(path, table_dic->dt_table->tab_name->ps_path) == 0)) { + ref_table = table_dic; + } + + if (ref_table != NULL) { + const XTList<XTDDIndex>& ix_list = ref_table->dt_indexes; + for (int j = 0, sz2 = ix_list.size(); j < sz2; j++) { + XTDDIndex *ddix = ix_list.itemAt(j); + if (ddix->in_index == ix->mi_index_no) { + const char *ix_name = + ddix->co_name ? ddix->co_name : ddix->co_ind_name; + fk_info->referenced_key_name = thd_make_lex_string(thd, 0, + ix_name, (uint) strlen(ix_name), 1); + break; + } + } + } + + action = XTDDForeignKey::actionTypeToString(fk->fk_on_delete); + fk_info->delete_method = thd_make_lex_string(thd, 0, + action, (uint) strlen(action), 1); + action = XTDDForeignKey::actionTypeToString(fk->fk_on_update); + fk_info->update_method = thd_make_lex_string(thd, 0, + action, (uint) strlen(action), 1); + + const XTList<XTDDColumnRef>& cols = fk->co_cols; + for (int j = 0, sz2 = cols.size(); j < sz2; j++) { + XTDDColumnRef *col_ref= cols.itemAt(j); + fk_info->foreign_fields.push_back(thd_make_lex_string(thd, 0, + col_ref->cr_col_name, (uint) strlen(col_ref->cr_col_name), 1)); + } + + const XTList<XTDDColumnRef>& ref_cols = fk->fk_ref_cols; + for (int j = 0, sz2 = ref_cols.size(); j < sz2; j++) { + XTDDColumnRef *col_ref= ref_cols.itemAt(j); + fk_info->referenced_fields.push_back(thd_make_lex_string(thd, 0, + col_ref->cr_col_name, (uint) strlen(col_ref->cr_col_name), 1)); + } + + f_key_list->push_back(fk_info); + } + } + catch_(a) { + err = xt_ha_pbxt_thread_error_for_mysql(thd, self, pb_ignore_dup_key); + } + cont_(a); + + return err; +} + +uint ha_pbxt::referenced_by_foreign_key() +{ + XTDDTable *table_dic = pb_open_tab->ot_table->tab_dic.dic_table; + + if (!table_dic) + return 0; + /* Check the list of referencing tables: */ + return table_dic->dt_trefs ? 1 : 0; +} + + +struct st_mysql_sys_var +{ + MYSQL_PLUGIN_VAR_HEADER; +}; + +#if MYSQL_VERSION_ID < 60000 +#if MYSQL_VERSION_ID >= 50124 +#define USE_CONST_SAVE +#endif +#else +#if MYSQL_VERSION_ID >= 60005 +#define USE_CONST_SAVE +#endif +#endif + +#ifdef USE_CONST_SAVE +static void pbxt_record_cache_size_func(THD *thd __attribute__((unused)), struct st_mysql_sys_var *var, void *tgt, const void *save) +#else +static void pbxt_record_cache_size_func(THD *thd __attribute__((unused)), struct st_mysql_sys_var *var, void *tgt, void *save) +#endif +{ + xtInt8 record_cache_size; + + char *old= *(char **) tgt; + *(char **)tgt= *(char **) save; + if (var->flags & PLUGIN_VAR_MEMALLOC) + { + *(char **)tgt= my_strdup(*(char **) save, MYF(0)); + my_free(old, MYF(0)); + } + record_cache_size = ha_set_variable(&pbxt_record_cache_size, &vp_record_cache_size); + xt_tc_set_cache_size((size_t) record_cache_size); +#ifdef DEBUG + char buffer[200]; + + sprintf(buffer, "pbxt_record_cache_size=%llu\n", (u_llong) record_cache_size); + xt_logf(XT_NT_INFO, buffer); +#endif +} + +#ifndef DRIZZLED +struct st_mysql_storage_engine pbxt_storage_engine = { + MYSQL_HANDLERTON_INTERFACE_VERSION +}; +static st_mysql_information_schema pbxt_statitics = { + MYSQL_INFORMATION_SCHEMA_INTERFACE_VERSION +}; +#endif + +#if MYSQL_VERSION_ID >= 50118 +static MYSQL_SYSVAR_STR(index_cache_size, pbxt_index_cache_size, + PLUGIN_VAR_READONLY, + "The amount of memory allocated to the index cache, used only to cache index data.", + NULL, NULL, NULL); + +static MYSQL_SYSVAR_STR(record_cache_size, pbxt_record_cache_size, + PLUGIN_VAR_READONLY, // PLUGIN_VAR_OPCMDARG | PLUGIN_VAR_MEMALLOC, + "The amount of memory allocated to the record cache used to cache table data.", + NULL, pbxt_record_cache_size_func, NULL); + +static MYSQL_SYSVAR_STR(log_cache_size, pbxt_log_cache_size, + PLUGIN_VAR_READONLY, + "The amount of memory allocated to the transaction log cache used to cache transaction log data.", + NULL, NULL, NULL); + +static MYSQL_SYSVAR_STR(log_file_threshold, pbxt_log_file_threshold, + PLUGIN_VAR_READONLY, + "The size of a transaction log before rollover, and a new log is created.", + NULL, NULL, NULL); + +static MYSQL_SYSVAR_STR(transaction_buffer_size, pbxt_transaction_buffer_size, + PLUGIN_VAR_READONLY, + "The size of the global transaction log buffer (the engine allocates 2 buffers of this size).", + NULL, NULL, NULL); + +static MYSQL_SYSVAR_STR(log_buffer_size, pbxt_log_buffer_size, + PLUGIN_VAR_READONLY, + "The size of the buffer used to cache data from transaction and data logs during sequential scans, or when writing a data log.", + NULL, NULL, NULL); + +static MYSQL_SYSVAR_STR(checkpoint_frequency, pbxt_checkpoint_frequency, + PLUGIN_VAR_READONLY, + "The size of the transaction data buffer which is allocate by each thread.", + NULL, NULL, NULL); + +static MYSQL_SYSVAR_STR(data_log_threshold, pbxt_data_log_threshold, + PLUGIN_VAR_READONLY, + "The maximum size of a data log file.", + NULL, NULL, NULL); + +static MYSQL_SYSVAR_STR(data_file_grow_size, pbxt_data_file_grow_size, + PLUGIN_VAR_READONLY, + "The amount by which the handle data files (.xtd) grow.", + NULL, NULL, NULL); + +static MYSQL_SYSVAR_STR(row_file_grow_size, pbxt_row_file_grow_size, + PLUGIN_VAR_READONLY, + "The amount by which the row pointer files (.xtr) grow.", + NULL, NULL, NULL); + +static MYSQL_SYSVAR_INT(garbage_threshold, xt_db_garbage_threshold, + PLUGIN_VAR_OPCMDARG, + "The percentage of garbage in a repository file before it is compacted.", + NULL, NULL, XT_DL_DEFAULT_GARBAGE_LEVEL, 0, 100, 1); + +static MYSQL_SYSVAR_INT(log_file_count, xt_db_log_file_count, + PLUGIN_VAR_OPCMDARG, + "The minimum number of transaction logs used.", + NULL, NULL, XT_DL_DEFAULT_XLOG_COUNT, 1, 20000, 1); + +static MYSQL_SYSVAR_INT(auto_increment_mode, xt_db_auto_increment_mode, + PLUGIN_VAR_OPCMDARG, + "The auto-increment mode, 0 = MySQL standard (default), 1 = previous ID's never reused.", + NULL, NULL, XT_AUTO_INCREMENT_DEF, 0, 1, 1); + +/* {RN145} */ +static MYSQL_SYSVAR_INT(offline_log_function, xt_db_offline_log_function, + PLUGIN_VAR_OPCMDARG, + "Determines what happens to transaction logs when the are moved offline, 0 = recycle logs (default), 1 = delete logs (default on Mac OS X), 2 = keep logs.", + NULL, NULL, XT_OFFLINE_LOG_FUNCTION_DEF, 0, 2, 1); + +/* {RN150} */ +static MYSQL_SYSVAR_INT(sweeper_priority, xt_db_sweeper_priority, + PLUGIN_VAR_OPCMDARG, + "Determines the priority of the background sweeper process, 0 = low (default), 1 = normal (same as user threads), 2 = high.", + NULL, NULL, XT_PRIORITY_LOW, XT_PRIORITY_LOW, XT_PRIORITY_HIGH, 1); + +static struct st_mysql_sys_var* pbxt_system_variables[] = { + MYSQL_SYSVAR(index_cache_size), + MYSQL_SYSVAR(record_cache_size), + MYSQL_SYSVAR(log_cache_size), + MYSQL_SYSVAR(log_file_threshold), + MYSQL_SYSVAR(transaction_buffer_size), + MYSQL_SYSVAR(log_buffer_size), + MYSQL_SYSVAR(checkpoint_frequency), + MYSQL_SYSVAR(data_log_threshold), + MYSQL_SYSVAR(data_file_grow_size), + MYSQL_SYSVAR(row_file_grow_size), + MYSQL_SYSVAR(garbage_threshold), + MYSQL_SYSVAR(log_file_count), + MYSQL_SYSVAR(auto_increment_mode), + MYSQL_SYSVAR(offline_log_function), + MYSQL_SYSVAR(sweeper_priority), + NULL +}; +#endif + +#ifdef DRIZZLED +drizzle_declare_plugin(pbxt) +#else +mysql_declare_plugin(pbxt) +#endif +{ + MYSQL_STORAGE_ENGINE_PLUGIN, +#ifndef DRIZZLED + &pbxt_storage_engine, +#endif + "PBXT", +#ifdef DRIZZLED + "1.0", +#endif + "Paul McCullagh, PrimeBase Technologies GmbH", + "High performance, multi-versioning transactional engine", + PLUGIN_LICENSE_GPL, + pbxt_init, /* Plugin Init */ + pbxt_end, /* Plugin Deinit */ +#ifndef DRIZZLED + 0x0001 /* 0.1 */, +#endif + NULL, /* status variables */ +#if MYSQL_VERSION_ID >= 50118 + pbxt_system_variables, /* system variables */ +#else + NULL, +#endif + NULL /* config options */ +}, +{ + MYSQL_INFORMATION_SCHEMA_PLUGIN, +#ifndef DRIZZLED + &pbxt_statitics, +#endif + "PBXT_STATISTICS", +#ifdef DRIZZLED + "1.0", +#endif + "Paul McCullagh, PrimeBase Technologies GmbH", + "PBXT internal system statitics", + PLUGIN_LICENSE_GPL, + pbxt_init_statitics, /* plugin init */ + pbxt_exit_statitics, /* plugin deinit */ +#ifndef DRIZZLED + 0x0005, +#endif + NULL, /* status variables */ + NULL, /* system variables */ + NULL /* config options */ +} +#ifdef DRIZZLED +drizzle_declare_plugin_end; +#else +mysql_declare_plugin_end; +#endif + +#if defined(XT_WIN) && defined(XT_COREDUMP) + +/* + * WINDOWS CORE DUMP SUPPORT + * + * MySQL supports core dumping on Windows with --core-file command line option. + * However it creates dumps with the MiniDumpNormal option which saves only stack traces. + * + * We instead (or in addition) create dumps with MiniDumpWithoutOptionalData option + * which saves all available information. To enable core dumping enable XT_COREDUMP + * at compile time. + * In addition, pbxt_crash_debug must be set to TRUE which is the case if XT_CRASH_DEBUG + * is defined. + * This switch is also controlled by creating a file called "no-debug" or "crash-debug" + * in the pbxt database directory. + */ + +typedef enum _MINIDUMP_TYPE { + MiniDumpNormal = 0x0000, + MiniDumpWithDataSegs = 0x0001, + MiniDumpWithFullMemory = 0x0002, + MiniDumpWithHandleData = 0x0004, + MiniDumpFilterMemory = 0x0008, + MiniDumpScanMemory = 0x0010, + MiniDumpWithUnloadedModules = 0x0020, + MiniDumpWithIndirectlyReferencedMemory = 0x0040, + MiniDumpFilterModulePaths = 0x0080, + MiniDumpWithProcessThreadData = 0x0100, + MiniDumpWithPrivateReadWriteMemory = 0x0200, +} MINIDUMP_TYPE; + +typedef struct _MINIDUMP_EXCEPTION_INFORMATION { + DWORD ThreadId; + PEXCEPTION_POINTERS ExceptionPointers; + BOOL ClientPointers; +} MINIDUMP_EXCEPTION_INFORMATION, *PMINIDUMP_EXCEPTION_INFORMATION; + +typedef BOOL (WINAPI *MINIDUMPWRITEDUMP)( + HANDLE hProcess, + DWORD dwPid, + HANDLE hFile, + MINIDUMP_TYPE DumpType, + void *ExceptionParam, + void *UserStreamParam, + void *CallbackParam + ); + +char base_path[_MAX_PATH] = {0}; +char dump_path[_MAX_PATH] = {0}; + +void core_dump(struct _EXCEPTION_POINTERS *pExceptionInfo) +{ + SECURITY_ATTRIBUTES sa = { sizeof(SECURITY_ATTRIBUTES), 0, 0 }; + int i; + HMODULE hDll = NULL; + HANDLE hFile; + MINIDUMPWRITEDUMP pDump; + char *end_ptr = base_path; + + MINIDUMP_EXCEPTION_INFORMATION ExInfo, *ExInfoPtr = NULL; + + if (pExceptionInfo) { + ExInfo.ThreadId = GetCurrentThreadId(); + ExInfo.ExceptionPointers = pExceptionInfo; + ExInfo.ClientPointers = NULL; + ExInfoPtr = &ExInfo; + } + + end_ptr = base_path + strlen(base_path); + + strcat(base_path, "DBGHELP.DLL" ); + hDll = LoadLibrary(base_path); + *end_ptr = 0; + if (hDll==NULL) { + int err; + err = HRESULT_CODE(GetLastError()); + hDll = LoadLibrary( "DBGHELP.DLL" ); + if (hDll==NULL) { + err = HRESULT_CODE(GetLastError()); + return; + } + } + + pDump = (MINIDUMPWRITEDUMP)GetProcAddress( hDll, "MiniDumpWriteDump" ); + if (!pDump) { + int err; + err = HRESULT_CODE(GetLastError()); + return; + } + + for (i = 1; i < INT_MAX; i++) { + sprintf(dump_path, "%sPBXTCore%08d.dmp", base_path, i); + hFile = CreateFile( dump_path, GENERIC_WRITE, FILE_SHARE_WRITE, NULL, CREATE_NEW, + FILE_ATTRIBUTE_NORMAL, NULL ); + + if ( hFile != INVALID_HANDLE_VALUE ) + break; + + if (HRESULT_CODE(GetLastError()) == ERROR_FILE_EXISTS ) + continue; + + return; + } + + // write the dump + BOOL bOK = pDump( GetCurrentProcess(), GetCurrentProcessId(), hFile, + MiniDumpWithPrivateReadWriteMemory, ExInfoPtr, NULL, NULL ); + + CloseHandle(hFile); +} + +LONG crash_filter( struct _EXCEPTION_POINTERS *pExceptionInfo ) +{ + core_dump(pExceptionInfo); + return EXCEPTION_EXECUTE_HANDLER; +} + +void register_crash_filter() +{ + SetUnhandledExceptionFilter( (LPTOP_LEVEL_EXCEPTION_FILTER) crash_filter ); +} + +#endif // XT_WIN && XT_COREDUMP diff --git a/storage/pbxt/src/ha_pbxt.h b/storage/pbxt/src/ha_pbxt.h new file mode 100644 index 00000000000..6f6a194de12 --- /dev/null +++ b/storage/pbxt/src/ha_pbxt.h @@ -0,0 +1,318 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * Derived from ha_example.h + * Copyright (C) 2003 MySQL AB + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-11-10 Paul McCullagh + * + */ +#ifndef __ha_pbxt_h__ +#define __ha_pbxt_h__ + +#ifdef DRIZZLED +#include <drizzled/common.h> +#include <drizzled/handler.h> +#include <drizzled/handlerton.h> +#include <mysys/thr_lock.h> +#else +#include "mysql_priv.h" +#endif + +#include "xt_defs.h" +#include "table_xt.h" + +#ifdef USE_PRAGMA_INTERFACE +#pragma interface /* gcc class implementation */ +#endif + +#if MYSQL_VERSION_ID <= 50120 +#define thd_killed(t) (t)->killed +#endif + +#if MYSQL_VERSION_ID >= 50120 +#define byte uchar +#endif + +class ha_pbxt; + +extern handlerton *pbxt_hton; + +/* + * XTShareRec is a structure that will be shared amoung all open handlers. + */ +typedef struct XTShare { + XTPathStrPtr sh_table_path; + uint sh_use_count; + + XTTableHPtr sh_table; /* This is a XTTableHPtr, a reference to the XT internal table handle. */ + + uint sh_dic_key_count; + XTIndexPtr *sh_dic_keys; /* A reference to the XT internal index list. */ + xtBool sh_recalc_selectivity; /* This is set to TRUE if when have < 100 rows when the table is openned. */ + + /* We use a trick here to get an exclusive lock + * on a table. The trick avoids having to use a + * semaphore if a thread does not want + * exclusive use. + */ + xt_mutex_type *sh_ex_mutex; + xt_cond_type *sh_ex_cond; + xtBool sh_table_lock; /* Set to TRUE if a lock on the table is held. */ + ha_pbxt *sh_handlers; /* Double linked list of handlers for a particular table. */ + xtWord8 sh_min_auto_inc; /* Used to proporgate the current auto-inc over a DELETE FROM + * (does not work if the server shuts down in between!). + */ + + THR_LOCK sh_lock; /* MySQL lock */ +} XTShareRec, *XTSharePtr; + +/* + * Class definition for the storage engine + */ +class ha_pbxt: public handler +{ + public: + XTSharePtr pb_share; /* Shared table info */ + + XTOpenTablePtr pb_open_tab; /* This is a XTOpenTablePtr (a reference to the XT internal table handle)! */ + + xtBool pb_key_read; /* No Need to retrieve the entire row, index values are sufficient. */ + int pb_ignore_dup_key; + u_int pb_ind_row_count; + + THR_LOCK_DATA pb_lock; /* MySQL lock */ + + ha_pbxt *pb_ex_next; /* Double linked list of handlers for a particular table. */ + ha_pbxt *pb_ex_prev; + + xtBool pb_lock_table; /* The operation requires a table lock. */ + int pb_table_locked; /* TRUE of this handler holds the table lock. */ + int pb_ex_in_use; /* Set to 1 while when the handler is in use. */ + + THD *pb_mysql_thd; /* A pointer to the MySQL thread. */ + xtBool pb_in_stat; /* TRUE of start_stmt() was issued */ + + ha_pbxt(handlerton *hton, TABLE_SHARE *table_arg); + + virtual ~ha_pbxt() { } + + /* The name that will be used for display purposes */ + const char *table_type() const { return "PBXT"; } + + /* + * The name of the index type that will be used for display + * don't implement this method unless you really have indexes. + */ + const char *index_type(uint inx) { (void) inx; return "BTREE"; } + + const char **bas_ext() const; + + MX_UINT8_T table_cache_type(); + + /* + * This is a list of flags that says what the storage engine + * implements. The current table flags are documented in + * handler.h + */ + MX_TABLE_TYPES_T table_flags() const; + + /* + * part is the key part to check. First key part is 0 + * If all_parts it's set, MySQL want to know the flags for the combined + * index up to and including 'part'. + */ + MX_ULONG_T index_flags(uint inx, uint part, bool all_parts) const; + + /* + * unireg.cc will call the following to make sure that the storage engine can + * handle the data it is about to send. + * + * Return *real* limits of your storage engine here. MySQL will do + * min(your_limits, MySQL_limits) automatically + * + * Theoretically PBXT supports any number of key parts, etc. + * Practically this is not true of course. + */ + uint max_supported_record_length() const { return UINT_MAX; } + uint max_supported_keys() const { return 512; } + uint max_supported_key_parts() const { return 128; } + uint max_supported_key_length() const; + uint max_supported_key_part_length() const; + + double scan_time(); + + double read_time(uint index, uint ranges, ha_rows rows); + + bool has_transactions() { return 1; } + + /* + * Everything below are methods that we implement in ha_pbxt.cc. + */ + void internal_close(THD *thd, struct XTThread *self); + int open(const char *name, int mode, uint test_if_locked); // required + int reopen(void); + int close(void); // required + + void init_auto_increment(xtWord8 min_auto_inc); + void get_auto_increment(MX_ULONGLONG_T offset, MX_ULONGLONG_T increment, + MX_ULONGLONG_T nb_desired_values, + MX_ULONGLONG_T *first_value, + MX_ULONGLONG_T *nb_reserved_values); + void set_auto_increment(Field *nr); + + int write_row(byte * buf); + int update_row(const byte * old_data, byte * new_data); + int delete_row(const byte * buf); + + /* Index access functions: */ + int xt_index_in_range(register XTOpenTablePtr ot, register XTIndexPtr ind, register XTIdxSearchKeyPtr search_key, byte *buf); + int xt_index_next_read(register XTOpenTablePtr ot, register XTIndexPtr ind, xtBool key_only, register XTIdxSearchKeyPtr search_key, byte *buf); + int xt_index_prev_read(XTOpenTablePtr ot, XTIndexPtr ind, xtBool key_only, register XTIdxSearchKeyPtr search_key, byte *buf); + int index_init(uint idx, bool sorted); + int index_end(); + int index_read(byte * buf, const byte * key, + uint key_len, enum ha_rkey_function find_flag); + int index_read_idx(byte * buf, uint idx, const byte * key, + uint key_len, enum ha_rkey_function find_flag); + int index_read_xt(byte * buf, uint idx, const byte * key, + uint key_len, enum ha_rkey_function find_flag); + int index_next(byte * buf); + int index_next_same(byte * buf, const byte *key, uint length); + int index_prev(byte * buf); + int index_first(byte * buf); + int index_last(byte * buf); + int index_read_last(byte * buf, const byte * key, uint key_len); + + /* Sequential scan functions: */ + int rnd_init(bool scan); //required + int rnd_end(); + int rnd_next(byte *buf); //required + int rnd_pos(byte * buf, byte *pos); //required + void position(const byte *record); //required +#if MYSQL_VERSION_ID < 50114 + void info(uint); +#else + int info(uint); +#endif + + int extra(enum ha_extra_function operation); + int reset(void); + int external_lock(THD *thd, int lock_type); //required + int start_stmt(THD *thd, thr_lock_type lock_type); + void unlock_row(); + int delete_all_rows(void); + int repair(THD* thd, HA_CHECK_OPT* check_opt); + int analyze(THD* thd, HA_CHECK_OPT* check_opt); + int optimize(THD* thd, HA_CHECK_OPT* check_opt); + int check(THD* thd, HA_CHECK_OPT* check_opt); + ha_rows records_in_range(uint inx, key_range *min_key, key_range *max_key); + int delete_table(const char *from); + int delete_system_table(const char *table_path); + int rename_table(const char * from, const char * to); + int rename_system_table(const char * from, const char * to); + int create(const char *name, TABLE *form, HA_CREATE_INFO *create_info); //required + void update_create_info(HA_CREATE_INFO *create_info); + + THR_LOCK_DATA **store_lock(THD *thd, THR_LOCK_DATA **to, enum thr_lock_type lock_type); //required + + /* Foreign key support: */ + //bool is_fk_defined_on_table_or_index(uint index); + char* get_foreign_key_create_info(); + int get_foreign_key_list(THD *thd, List<FOREIGN_KEY_INFO> *f_key_list); + //bool can_switch_engines(); + uint referenced_by_foreign_key(); + void free_foreign_key_create_info(char* str); + + virtual bool get_error_message(int error, String *buf); +}; + +/* From ha_pbxt.cc: */ +#define XT_TAB_NAME_WITH_EXT_SIZE XT_TABLE_NAME_SIZE+4 + +class THD; +struct XTThread; +struct XTDatabase; + +void xt_ha_unlock_table(struct XTThread *self, void *share); +void xt_ha_close_global_database(XTThreadPtr self); +void xt_ha_open_database_of_table(struct XTThread *self, XTPathStrPtr table_path); +struct XTThread *xt_ha_set_current_thread(THD *thd, XTExceptionPtr e); +void xt_ha_close_connection(THD* thd); +struct XTThread *xt_ha_thd_to_self(THD* thd); +int xt_ha_pbxt_to_mysql_error(int xt_err); +int xt_ha_pbxt_thread_error_for_mysql(THD *thd, const XTThreadPtr self, int ignore_dup_key); +void xt_ha_all_threads_close_database(XTThreadPtr self, XTDatabase *db); + +/* + * These hooks are suppossed to only be used by InnoDB: + */ +#ifndef DRIZZLED +#ifdef INNODB_COMPATIBILITY_HOOKS +extern "C" struct charset_info_st *thd_charset(MYSQL_THD thd); +extern "C" char **thd_query(MYSQL_THD thd); +extern "C" int thd_slave_thread(const MYSQL_THD thd); +extern "C" int thd_non_transactional_update(const MYSQL_THD thd); +extern "C" int thd_binlog_format(const MYSQL_THD thd); +extern "C" void thd_mark_transaction_to_rollback(MYSQL_THD thd, bool all); +#else +#define thd_charset(t) (t)->charset() +#define thd_query(t) &(t)->query +#define thd_slave_thread(t) (t)->slave_thread +#define thd_non_transactional_update(t) (t)->transaction.all.modified_non_trans_table +#define thd_binlog_format(t) (t)->variables.binlog_format +#define thd_mark_transaction_to_rollback(t) mark_transaction_to_rollback(t, all) +#endif // INNODB_COMPATIBILITY_HOOKS */ +#endif /* !DRIZZLED */ + +/* How to lock MySQL mutexes! */ +#ifdef SAFE_MUTEX + +#if MYSQL_VERSION_ID < 60000 +#if MYSQL_VERSION_ID < 50123 +#define myxt_mutex_lock(x) safe_mutex_lock(x,__FILE__,__LINE__) +#else +#define myxt_mutex_lock(x) safe_mutex_lock(x,0,__FILE__,__LINE__) +#endif +#else +#if MYSQL_VERSION_ID < 60004 +#define myxt_mutex_lock(x) safe_mutex_lock(x,__FILE__,__LINE__) +#else +#define myxt_mutex_lock(x) safe_mutex_lock(x,0,__FILE__,__LINE__) +#endif +#endif + +#define myxt_mutex_t safe_mutex_t +#define myxt_mutex_unlock(x) safe_mutex_unlock(x,__FILE__,__LINE__) + +#else // SAFE_MUTEX + +#ifdef MY_PTHREAD_FASTMUTEX +#define myxt_mutex_lock(x) my_pthread_fastmutex_lock(x) +#define myxt_mutex_t my_pthread_fastmutex_t +#define myxt_mutex_unlock(x) pthread_mutex_unlock(&(x)->mutex) +#else +#define myxt_mutex_lock(x) pthread_mutex_lock(x) +#define myxt_mutex_t pthread_mutex_t +#define myxt_mutex_unlock(x) pthread_mutex_unlock(x) +#endif + +#endif // SAFE_MUTEX + +#endif + diff --git a/storage/pbxt/src/ha_xtsys.cc b/storage/pbxt/src/ha_xtsys.cc new file mode 100644 index 00000000000..1c76d13379a --- /dev/null +++ b/storage/pbxt/src/ha_xtsys.cc @@ -0,0 +1,252 @@ +/* Copyright (c) 2008 PrimeBase Technologies GmbH, Germany + * + * PrimeBase Media Stream for MySQL + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Paul McCullagh + * + * 2007-05-20 + * + * H&G2JCtL + * + * Table handler. + * + */ + +#ifdef USE_PRAGMA_IMPLEMENTATION +#pragma implementation // gcc: Class implementation +#endif + +#include "xt_config.h" + +#include <stdlib.h> +#include <time.h> + +#ifdef DRIZZLED +#include <drizzled/server_includes.h> +#endif + +#include "ha_xtsys.h" +#include "ha_pbxt.h" + +#include "strutil_xt.h" +#include "database_xt.h" +#include "discover_xt.h" +#include "systab_xt.h" +#include "xt_defs.h" + +/* Note: mysql_priv.h messes with new, which caused a crash. */ +#ifdef new +#undef new +#endif + +/* + * --------------------------------------------------------------- + * HANDLER INTERFACE + */ + +ha_xtsys::ha_xtsys(handlerton *hton, TABLE_SHARE *table_arg): +handler(hton, table_arg), +ha_open_tab(NULL) +{ + init(); +} + +static const char *ha_pbms_exts[] = { + "", + NullS +}; + +const char **ha_xtsys::bas_ext() const +{ + return ha_pbms_exts; +} + +int ha_xtsys::open(const char *table_path, int mode __attribute__((unused)), uint test_if_locked __attribute__((unused))) +{ + THD *thd = current_thd; + XTExceptionRec e; + XTThreadPtr self; + int err = 0; + + if (!(self = xt_ha_set_current_thread(thd, &e))) + return xt_ha_pbxt_to_mysql_error(e.e_xt_err); + + try_(a) { + xt_ha_open_database_of_table(self, (XTPathStrPtr) table_path); + + ha_open_tab = XTSystemTableShare::openSystemTable(self, table_path, table); + thr_lock_data_init(ha_open_tab->ost_share->sts_my_lock, &ha_lock, NULL); + ref_length = ha_open_tab->getRefLen(); + } + catch_(a) { + err = xt_ha_pbxt_thread_error_for_mysql(thd, self, FALSE); + if (ha_open_tab) { + ha_open_tab->release(self); + ha_open_tab = NULL; + } + } + cont_(a); + + return err; +} + +int ha_xtsys::close(void) +{ + THD *thd = current_thd; + XTExceptionRec e; + volatile XTThreadPtr self = NULL; + int err = 0; + + if (thd) + self = xt_ha_set_current_thread(thd, &e); + else { + if (!(self = xt_create_thread("TempForClose", FALSE, TRUE, &e))) { + xt_log_exception(NULL, &e, XT_LOG_DEFAULT); + return 0; + } + } + + if (self) { + try_(a) { + if (ha_open_tab) { + ha_open_tab->release(self); + ha_open_tab = NULL; + } + } + catch_(a) { + err = xt_ha_pbxt_thread_error_for_mysql(thd, self, FALSE); + } + cont_(a); + + if (!thd) + xt_free_thread(self); + } + else + xt_log(XT_NS_CONTEXT, XT_LOG_WARNING, "Unable to release table reference\n"); + + return err; +} + +int ha_xtsys::rnd_init(bool scan __attribute__((unused))) +{ + int err = 0; + + if (!ha_open_tab->seqScanInit()) + err = xt_ha_pbxt_thread_error_for_mysql(current_thd, xt_get_self(), FALSE); + + return err; +} + +int ha_xtsys::rnd_next(byte *buf) +{ + bool eof; + int err = 0; + + if (!ha_open_tab->seqScanNext((char *) buf, &eof)) { + if (eof) + err = HA_ERR_END_OF_FILE; + else + err = xt_ha_pbxt_thread_error_for_mysql(current_thd, xt_get_self(), FALSE); + } + + return err; +} + +void ha_xtsys::position(const byte *record) +{ + xtWord4 rec_id; + rec_id = ha_open_tab->seqScanPos((xtWord1 *) record); + mi_int4store((xtWord1 *) ref, rec_id); +} + +int ha_xtsys::rnd_pos(byte * buf, byte *pos) +{ + int err = 0; + xtWord4 rec_id; + + rec_id = mi_uint4korr((xtWord1 *) pos); + if (!ha_open_tab->seqScanRead(rec_id, (char *) buf)) + err = xt_ha_pbxt_thread_error_for_mysql(current_thd, xt_get_self(), FALSE); + + return err; +} + +int ha_xtsys::info(uint flag __attribute__((unused))) +{ + return 0; +} + +int ha_xtsys::external_lock(THD *thd, int lock_type) +{ + XTExceptionRec e; + XTThreadPtr self; + int err = 0; + bool ok; + + if (!(self = xt_ha_set_current_thread(thd, &e))) + return xt_ha_pbxt_to_mysql_error(e.e_xt_err); + + if (lock_type == F_UNLCK) + ok = ha_open_tab->unuse(); + else + ok = ha_open_tab->use(); + + if (!ok) + err = xt_ha_pbxt_thread_error_for_mysql(current_thd, xt_get_self(), FALSE); + + return err; +} + +THR_LOCK_DATA **ha_xtsys::store_lock(THD *thd __attribute__((unused)), THR_LOCK_DATA **to, enum thr_lock_type lock_type) +{ + if (lock_type != TL_IGNORE && ha_lock.type == TL_UNLOCK) + ha_lock.type = lock_type; + *to++ = &ha_lock; + return to; +} + +/* Note: ha_pbxt::delete_system_table is called instead. */ +int ha_xtsys::delete_table(const char *table_path __attribute__((unused))) +{ + /* Should never be called */ + return 0; +} + +int ha_xtsys::create(const char *name __attribute__((unused)), TABLE *table_arg __attribute__((unused)), HA_CREATE_INFO *create_info __attribute__((unused))) +{ + /* Allow the table to be created. + * This is required after a dump is restored. + */ + return 0; +} + +bool ha_xtsys::get_error_message(int error __attribute__((unused)), String *buf) +{ + THD *thd = current_thd; + XTExceptionRec e; + XTThreadPtr self; + + if (!(self = xt_ha_set_current_thread(thd, &e))) + return FALSE; + + if (!self->t_exception.e_xt_err) + return FALSE; + + buf->copy(self->t_exception.e_err_msg, strlen(self->t_exception.e_err_msg), system_charset_info); + return TRUE; +} + diff --git a/storage/pbxt/src/ha_xtsys.h b/storage/pbxt/src/ha_xtsys.h new file mode 100644 index 00000000000..66a4b5a5dfa --- /dev/null +++ b/storage/pbxt/src/ha_xtsys.h @@ -0,0 +1,95 @@ +/* Copyright (c) 2008 PrimeBase Technologies GmbH, Germany + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Paul McCullagh + * + * 2007-05-20 + * + * H&G2JCtL + * + * PBXT System Table handler. + * + */ +#ifndef __HA_XTSYS_H__ +#define __HA_XTSYS_H__ + +#ifdef DRIZZLED +#include <drizzled/common.h> +#include <drizzled/handler.h> +#include <drizzled/current_session.h> +#else +#include "mysql_priv.h" +#endif + +#include "xt_defs.h" + +#ifdef USE_PRAGMA_INTERFACE +#pragma interface /* gcc class implementation */ +#endif + +#if MYSQL_VERSION_ID >= 50120 +#define byte uchar +#endif + +class XTOpenSystemTable; + +class ha_xtsys: public handler +{ + THR_LOCK_DATA ha_lock; ///< MySQL lock + XTOpenSystemTable *ha_open_tab; + +public: + ha_xtsys(handlerton *hton, TABLE_SHARE *table_arg); + ~ha_xtsys() { } + + const char *table_type() const { return "PBXT"; } + + const char *index_type(uint inx __attribute__((unused))) { + return "NONE"; + } + + const char **bas_ext() const; + + MX_TABLE_TYPES_T table_flags() const { + return HA_BINLOG_ROW_CAPABLE | HA_BINLOG_STMT_CAPABLE; + } + + MX_ULONG_T index_flags(uint inx __attribute__((unused)), uint part __attribute__((unused)), bool all_parts __attribute__((unused))) const { + return (HA_READ_NEXT | HA_READ_PREV | HA_READ_RANGE | HA_KEYREAD_ONLY); + } + uint max_supported_keys() const { return 512; } + uint max_supported_key_part_length() const { return 1024; } + + int open(const char *name, int mode, uint test_if_locked); + int close(void); + int rnd_init(bool scan); + int rnd_next(byte *buf); + int rnd_pos(byte * buf, byte *pos); + void position(const byte *record); + int info(uint); + + int external_lock(THD *thd, int lock_type); + int delete_table(const char *from); + int create(const char *name, TABLE *form, HA_CREATE_INFO *create_info); + + THR_LOCK_DATA **store_lock(THD *thd, THR_LOCK_DATA **to, enum thr_lock_type lock_type); + bool get_error_message(int error, String *buf); +}; + +#endif + diff --git a/storage/pbxt/src/hashtab_xt.cc b/storage/pbxt/src/hashtab_xt.cc new file mode 100644 index 00000000000..3708f071ac5 --- /dev/null +++ b/storage/pbxt/src/hashtab_xt.cc @@ -0,0 +1,264 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-01-15 Paul McCullagh + * + */ + +#include "xt_config.h" + +#include <ctype.h> + +#include "pthread_xt.h" +#include "heap_xt.h" +#include "thread_xt.h" +#include "hashtab_xt.h" + +XTHashTabPtr xt_new_hashtable(XTThreadPtr self, XTHTCompareFunc comp_func, XTHTHashFunc hash_func, XTHTFreeFunc free_func, xtBool with_lock, xtBool with_cond) +{ + XTHashTabPtr ht; + xtHashValue tab_size = 223; + + ht = (XTHashTabPtr) xt_calloc(self, offsetof(XTHashTabRec, ht_items) + (sizeof(XTHashItemPtr) * tab_size)); + ht->ht_comp_func = comp_func; + ht->ht_hash_func = hash_func; + ht->ht_free_func = free_func; + ht->ht_tab_size = tab_size; + + if (with_lock || with_cond) { + ht->ht_lock = (xt_mutex_type *) xt_calloc(self, sizeof(xt_mutex_type)); + try_(a) { + xt_init_mutex_with_autoname(self, ht->ht_lock); + } + catch_(a) { + xt_free(self, ht->ht_lock); + xt_free(self, ht); + throw_(); + } + cont_(a); + } + + if (with_cond) { + ht->ht_cond = (xt_cond_type *) xt_calloc(self, sizeof(xt_cond_type)); + try_(b) { + xt_init_cond(self, ht->ht_cond); + } + catch_(b) { + xt_free(self, ht->ht_cond); + ht->ht_cond = NULL; + xt_free_hashtable(self, ht); + throw_(); + } + cont_(b); + } + + return ht; +} + +void xt_free_hashtable(XTThreadPtr self, XTHashTabPtr ht) +{ + xtHashValue i; + XTHashItemPtr item, tmp_item; + + if (ht->ht_lock) + xt_lock_mutex(self, ht->ht_lock); + for (i=0; i<ht->ht_tab_size; i++) { + item = ht->ht_items[i]; + while (item) { + if (ht->ht_free_func) + (*ht->ht_free_func)(self, item->hi_data); + tmp_item = item; + item = item->hi_next; + xt_free(self, tmp_item); + } + } + if (ht->ht_lock) + xt_unlock_mutex(self, ht->ht_lock); + if (ht->ht_lock) { + xt_free_mutex(ht->ht_lock); + xt_free(self, ht->ht_lock); + } + if (ht->ht_cond) { + xt_free_cond(ht->ht_cond); + xt_free(self, ht->ht_cond); + } + xt_free(self, ht); +} + +xtPublic void xt_ht_put(XTThreadPtr self, XTHashTabPtr ht, void *data) +{ + XTHashItemPtr item = NULL; + xtHashValue h; + + pushr_(ht->ht_free_func, data); + h = (*ht->ht_hash_func)(FALSE, data); + item = (XTHashItemPtr) xt_malloc(self, sizeof(XTHashItemRec)); + item->hi_data = data; + item->hi_hash = h; + item->hi_next = ht->ht_items[h % ht->ht_tab_size]; + ht->ht_items[h % ht->ht_tab_size] = item; + popr_(); +} + +xtPublic void *xt_ht_get(XTThreadPtr self __attribute__((unused)), XTHashTabPtr ht, void *key) +{ + XTHashItemPtr item; + xtHashValue h; + void *data = NULL; + + h = (*ht->ht_hash_func)(TRUE, key); + + item = ht->ht_items[h % ht->ht_tab_size]; + while (item) { + if (item->hi_hash == h && (*ht->ht_comp_func)(key, item->hi_data)) { + data = item->hi_data; + break; + } + item = item->hi_next; + } + + return data; +} + +xtPublic xtBool xt_ht_del(XTThreadPtr self, XTHashTabPtr ht, void *key) +{ + XTHashItemPtr item, pitem = NULL; + xtHashValue h; + xtBool found = FALSE; + + h = (*ht->ht_hash_func)(TRUE, key); + + item = ht->ht_items[h % ht->ht_tab_size]; + while (item) { + if (item->hi_hash == h && (*ht->ht_comp_func)(key, item->hi_data)) { + void *data; + + found = TRUE; + data = item->hi_data; + + /* Unlink the item: */ + if (pitem) + pitem->hi_next = item->hi_next; + else + ht->ht_items[h % ht->ht_tab_size] = item->hi_next; + + /* Free the item: */ + xt_free(self, item); + + /* Free the data */ + if (ht->ht_free_func) + (*ht->ht_free_func)(self, data); + break; + } + pitem = item; + item = item->hi_next; + } + + return found; +} + +xtPublic xtHashValue xt_ht_hash(char *s) +{ + register char *p; + register xtHashValue h = 0, g; + + p = s; + while (*p) { + h = (h << 4) + *p; + /* Assignment intended here! */ + if ((g = h & 0xF0000000)) { + h = h ^ (g >> 24); + h = h ^ g; + } + p++; + } + return h; +} + +/* + * The case-insensitive version of the hash... + */ +xtPublic xtHashValue xt_ht_casehash(char *s) +{ + register char *p; + register xtHashValue h = 0, g; + + p = s; + while (*p) { + h = (h << 4) + tolower(*p); + /* Assignment intended here! */ + if ((g = h & 0xF0000000)) { + h = h ^ (g >> 24); + h = h ^ g; + } + p++; + } + return h; +} + +xtPublic xtBool xt_ht_lock(XTThreadPtr self, XTHashTabPtr ht) +{ + if (ht->ht_lock) + return xt_lock_mutex(self, ht->ht_lock); + return TRUE; +} + +xtPublic void xt_ht_unlock(XTThreadPtr self, XTHashTabPtr ht) +{ + if (ht->ht_lock) + xt_unlock_mutex(self, ht->ht_lock); +} + +xtPublic void xt_ht_wait(XTThreadPtr self, XTHashTabPtr ht) +{ + xt_wait_cond(self, ht->ht_cond, ht->ht_lock); +} + +xtPublic void xt_ht_timed_wait(XTThreadPtr self, XTHashTabPtr ht, u_long milli_sec) +{ + xt_timed_wait_cond(self, ht->ht_cond, ht->ht_lock, milli_sec); +} + +xtPublic void xt_ht_signal(XTThreadPtr self, XTHashTabPtr ht) +{ + xt_signal_cond(self, ht->ht_cond); +} + +xtPublic void xt_ht_enum(struct XTThread *self __attribute__((unused)), XTHashTabPtr ht, XTHashEnumPtr en) +{ + en->he_i = 0; + en->he_item = NULL; + en->he_ht = ht; +} + +xtPublic void *xt_ht_next(struct XTThread *self __attribute__((unused)), XTHashEnumPtr en) +{ + if (en->he_item) { + en->he_item = en->he_item->hi_next; + if (en->he_item) + return en->he_item->hi_data; + en->he_i++; + } + while (en->he_i < en->he_ht->ht_tab_size) { + if ((en->he_item = en->he_ht->ht_items[en->he_i])) + return en->he_item->hi_data; + en->he_i++; + } + return NULL; +} + diff --git a/storage/pbxt/src/hashtab_xt.h b/storage/pbxt/src/hashtab_xt.h new file mode 100644 index 00000000000..d6085c4288d --- /dev/null +++ b/storage/pbxt/src/hashtab_xt.h @@ -0,0 +1,78 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-01-15 Paul McCullagh + * + * H&G2JCtL + */ +#ifndef __xt_hashtab_h__ +#define __xt_hashtab_h__ + +#include "xt_defs.h" + +struct XTThread; + +#define xtHashValue u_int + +typedef xtBool (*XTHTCompareFunc)(void *key, void *data); +typedef xtHashValue (*XTHTHashFunc)(xtBool is_key, void *key_data); +typedef void (*XTHTFreeFunc)(struct XTThread *self, void *item); + +typedef struct XTHashItem { + struct XTHashItem *hi_next; + xtHashValue hi_hash; + void *hi_data; +} XTHashItemRec, *XTHashItemPtr; + +typedef struct XTHashTab { + XTHTCompareFunc ht_comp_func; + XTHTHashFunc ht_hash_func; + XTHTFreeFunc ht_free_func; + xt_mutex_type *ht_lock; + xt_cond_type *ht_cond; + + xtHashValue ht_tab_size; + XTHashItemPtr ht_items[XT_VAR_LENGTH]; +} XTHashTabRec, *XTHashTabPtr; + +typedef struct XTHashEnum { + u_int he_i; + XTHashItemPtr he_item; + XTHashTabPtr he_ht; +} XTHashEnumRec, *XTHashEnumPtr; + +XTHashTabPtr xt_new_hashtable(struct XTThread *self, XTHTCompareFunc comp_func, XTHTHashFunc hash_func, XTHTFreeFunc free_func, xtBool with_lock, xtBool with_cond); +void xt_free_hashtable(struct XTThread *self, XTHashTabPtr ht); + +void xt_ht_put(struct XTThread *self, XTHashTabPtr ht, void *data); +void *xt_ht_get(struct XTThread *self, XTHashTabPtr ht, void *key); +xtBool xt_ht_del(struct XTThread *self, XTHashTabPtr ht, void *key); + +xtHashValue xt_ht_hash(char *s); +xtHashValue xt_ht_casehash(char *s); + +xtBool xt_ht_lock(struct XTThread *self, XTHashTabPtr ht); +void xt_ht_unlock(struct XTThread *self, XTHashTabPtr ht); +void xt_ht_wait(struct XTThread *self, XTHashTabPtr ht); +void xt_ht_timed_wait(struct XTThread *self, XTHashTabPtr ht, u_long milli_sec); +void xt_ht_signal(struct XTThread *self, XTHashTabPtr ht); + +void xt_ht_enum(struct XTThread *self, XTHashTabPtr ht, XTHashEnumPtr en); +void *xt_ht_next(struct XTThread *self, XTHashEnumPtr en); + +#endif diff --git a/storage/pbxt/src/heap_xt.cc b/storage/pbxt/src/heap_xt.cc new file mode 100644 index 00000000000..54756060942 --- /dev/null +++ b/storage/pbxt/src/heap_xt.cc @@ -0,0 +1,129 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-01-10 Paul McCullagh + * + * H&G2JCtL + */ + +#include "xt_config.h" + +#include "pthread_xt.h" +#include "heap_xt.h" +#include "thread_xt.h" + +#ifdef xt_heap_new +#undef xt_heap_new +#endif + +#ifdef DEBUG +xtPublic XTHeapPtr xt_mm_heap_new(XTThreadPtr self, size_t size, XTFinalizeFunc finalize, u_int line, c_char *file, xtBool track) +#else +xtPublic XTHeapPtr xt_heap_new(XTThreadPtr self, size_t size, XTFinalizeFunc finalize) +#endif +{ + volatile XTHeapPtr hp; + +#ifdef DEBUG + hp = (XTHeapPtr) xt_mm_calloc(self, size, line, file); + hp->h_track = track; + if (track) + printf("HEAP: +1 1 %s:%d\n", file, (int) line); +#else + hp = (XTHeapPtr) xt_calloc(self, size); +#endif + if (!hp) + return NULL; + + try_(a) { + xt_spinlock_init_with_autoname(self, &hp->h_lock); + } + catch_(a) { + xt_free(self, hp); + throw_(); + } + cont_(a); + + hp->h_ref_count = 1; + hp->h_finalize = finalize; + hp->h_onrelease = NULL; + return hp; +} + +xtPublic void xt_check_heap(XTThreadPtr self __attribute__((unused)), XTHeapPtr hp __attribute__((unused))) +{ +#ifdef DEBUG + xt_mm_malloc_size(self, hp); +#endif +} + +#ifdef DEBUG +xtPublic void xt_mm_heap_reference(XTThreadPtr self, XTHeapPtr hp, u_int line, c_char *file) +#else +xtPublic void xt_heap_reference(XTThreadPtr, XTHeapPtr hp) +#endif +{ + xt_spinlock_lock(&hp->h_lock); + hp->h_ref_count++; +#ifdef DEBUG + if (hp->h_track) + printf("HEAP: +1 %2d %s:%d\n", (int) hp->h_ref_count, file, (int) line); +#endif + xt_spinlock_unlock(&hp->h_lock); +} + +xtPublic void xt_heap_release(XTThreadPtr self, XTHeapPtr hp) +{ + if (!hp) + return; +#ifdef DEBUG + xt_spinlock_lock(&hp->h_lock); + ASSERT(hp->h_ref_count != 0); + xt_spinlock_unlock(&hp->h_lock); +#endif + xt_spinlock_lock(&hp->h_lock); + if (hp->h_onrelease) + (*hp->h_onrelease)(self, hp); + if (hp->h_ref_count > 0) { +#ifdef DEBUG + if (hp->h_track) + printf("HEAP: -1 %2d\n", (int) hp->h_ref_count); +#endif + hp->h_ref_count--; + if (hp->h_ref_count == 0) { + if (hp->h_finalize) + (*hp->h_finalize)(self, hp); + xt_spinlock_unlock(&hp->h_lock); + xt_free(self, hp); + return; + } + } + xt_spinlock_unlock(&hp->h_lock); +} + +xtPublic void xt_heap_set_release_callback(XTThreadPtr self __attribute__((unused)), XTHeapPtr hp, XTFinalizeFunc onrelease) +{ + hp->h_onrelease = onrelease; +} + +xtPublic u_int xt_heap_get_ref_count(struct XTThread *self __attribute__((unused)), XTHeapPtr hp) +{ + return hp->h_ref_count; +} + + diff --git a/storage/pbxt/src/heap_xt.h b/storage/pbxt/src/heap_xt.h new file mode 100644 index 00000000000..afad132e1e3 --- /dev/null +++ b/storage/pbxt/src/heap_xt.h @@ -0,0 +1,68 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-01-10 Paul McCullagh + * + * H&G2JCtL + */ +#ifndef __xt_heap_h__ +#define __xt_heap_h__ + +#include "xt_defs.h" +#include "lock_xt.h" + +struct XTThread; + +/* + * Heap memory has a reference count, and a lock for shared access. + * It also has a finalize routine which is called before the memory is + * freed. + */ +typedef void (*XTFinalizeFunc)(struct XTThread *self, void *heap_ptr); + +typedef struct XTHeap { + XTSpinLockRec h_lock; /* Prevent concurrent access to the heap memory: */ + u_int h_ref_count; /* So we know when to free (EVERY pointer reference MUST be counted). */ + XTFinalizeFunc h_finalize; /* If non-NULL, call before freeing. */ + XTFinalizeFunc h_onrelease; /* If non-NULL, call on release. */ +#ifdef DEBUG + xtBool h_track; +#endif +} XTHeapRec, *XTHeapPtr; + +/* Returns with reference count = 1 */ +XTHeapPtr xt_heap_new(struct XTThread *self, size_t size, XTFinalizeFunc finalize); +XTHeapPtr xt_mm_heap_new(struct XTThread *self, size_t size, XTFinalizeFunc finalize, u_int line, c_char *file, xtBool track); + +void xt_heap_set_release_callback(struct XTThread *self, XTHeapPtr mem, XTFinalizeFunc onrelease); + +void xt_heap_reference(struct XTThread *self, XTHeapPtr mem); +void xt_mm_heap_reference(struct XTThread *self, XTHeapPtr hp, u_int line, c_char *file); + +void xt_heap_release(struct XTThread *self, XTHeapPtr mem); +u_int xt_heap_get_ref_count(struct XTThread *self, XTHeapPtr mem); + +void xt_check_heap(struct XTThread *self, XTHeapPtr mem); + +#ifdef DEBUG +#define xt_heap_new(t, s, f) xt_mm_heap_new(t, s, f, __LINE__, __FILE__, FALSE) +#define xt_heap_new_track(t, s, f) xt_mm_heap_new(t, s, f, __LINE__, __FILE__, TRUE) +#define xt_heap_reference(t, s) xt_mm_heap_reference(t, s, __LINE__, __FILE__) +#endif + +#endif diff --git a/storage/pbxt/src/index_xt.cc b/storage/pbxt/src/index_xt.cc new file mode 100644 index 00000000000..9cd9a966f74 --- /dev/null +++ b/storage/pbxt/src/index_xt.cc @@ -0,0 +1,3854 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-09-30 Paul McCullagh + * + * H&G2JCtL + */ + +#include "xt_config.h" + +#include <string.h> +#include <stdio.h> +#include <stddef.h> +#ifndef XT_WIN +#include <strings.h> +#endif + +#ifdef DRIZZLED +#include <drizzled/base.h> +#else +#include "mysql_priv.h" +#endif + +#include "pthread_xt.h" +#include "memory_xt.h" +#include "index_xt.h" +#include "heap_xt.h" +#include "database_xt.h" +#include "strutil_xt.h" +#include "cache_xt.h" +#include "myxt_xt.h" +#include "trace_xt.h" +#include "table_xt.h" + +#ifdef DEBUG +#define MAX_SEARCH_DEPTH 32 +//#define CHECK_AND_PRINT +//#define CHECK_NODE_REFERENCE +//#define TRACE_FLUSH +//#define CHECK_PRINTS_RECORD_REFERENCES +#else +#define MAX_SEARCH_DEPTH 100 +#endif + +#define IND_FLUSH_BUFFER_SIZE 200 + +typedef struct IdxStackItem { + XTIdxItemRec i_pos; + xtIndexNodeID i_branch; +} IdxStackItemRec, *IdxStackItemPtr; + +typedef struct IdxBranchStack { + int s_top; + IdxStackItemRec s_elements[MAX_SEARCH_DEPTH]; +} IdxBranchStackRec, *IdxBranchStackPtr; + +#ifdef DEBUG +#ifdef TEST_CODE +static void idx_check_on_key(XTOpenTablePtr ot); +#endif +static u_int idx_check_index(XTOpenTablePtr ot, XTIndexPtr ind, xtBool with_lock); +#endif + +static xtBool idx_insert_node(XTOpenTablePtr ot, XTIndexPtr ind, IdxBranchStackPtr stack, XTIdxKeyValuePtr key_value, xtIndexNodeID branch); + +#ifdef XT_TRACK_INDEX_UPDATES + +static xtBool ind_track_write(struct XTOpenTable *ot, struct XTIndex *ind, xtIndexNodeID offset, size_t size, xtWord1 *data) +{ + ot->ot_ind_reads++; + return xt_ind_write(ot, ind, offset, size, data); +} + +#define XT_IND_WRITE ind_track_write + +#else + +#define XT_IND_WRITE xt_ind_write + +#endif + + +#ifdef CHECK_NODE_REFERENCE +#define IDX_GET_NODE_REF(t, x, o) idx_get_node_ref(t, x, o) +#else +#define IDX_GET_NODE_REF(t, x, o) XT_GET_NODE_REF(t, (x) - (o)) +#endif + +/* + * ----------------------------------------------------------------------- + * DEBUG ACTIVITY + */ + +//#define TRACK_ACTIVITY + +#ifdef TRACK_ACTIVITY +#define TRACK_MAX_BLOCKS 2000 + +typedef struct TrackBlock { + xtWord1 exists; + char *activity; +} TrackBlockRec, *TrackBlockPtr; + +TrackBlockRec blocks[TRACK_MAX_BLOCKS]; + +xtPublic void track_work(u_int block, char *what) +{ + int len = 0, len2; + + ASSERT_NS(block > 0 && block <= TRACK_MAX_BLOCKS); + block--; + if (blocks[block].activity) + len = strlen(blocks[block].activity); + len2 = strlen(what); + xt_realloc_ns((void **) &blocks[block].activity, len + len2 + 1); + memcpy(blocks[block].activity + len, what, len2 + 1); +} + +static void track_block_exists(xtIndexNodeID block) +{ + if (XT_NODE_ID(block) > 0 && XT_NODE_ID(block) <= TRACK_MAX_BLOCKS) + blocks[XT_NODE_ID(block)-1].exists = TRUE; +} + +static void track_reset_missing() +{ + for (u_int i=0; i<TRACK_MAX_BLOCKS; i++) + blocks[i].exists = FALSE; +} + +static void track_dump_missing(xtIndexNodeID eof_block) +{ + for (u_int i=0; i<XT_NODE_ID(eof_block)-1; i++) { + if (!blocks[i].exists) + printf("block missing = %04d %s\n", i+1, blocks[i].activity); + } +} + +static void track_dump_all(u_int max_block) +{ + for (u_int i=0; i<max_block; i++) { + if (blocks[i].exists) + printf(" %04d %s\n", i+1, blocks[i].activity); + else + printf("-%04d %s\n", i+1, blocks[i].activity); + } +} + +#endif + +xtPublic void xt_ind_track_dump_block(XTTableHPtr tab __attribute__((unused)), xtIndexNodeID address __attribute__((unused))) +{ +#ifdef TRACK_ACTIVITY + u_int i = XT_NODE_ID(address)-1; + + printf("BLOCK %04d %s\n", i+1, blocks[i].activity); +#endif +} + +#ifdef CHECK_NODE_REFERENCE +static xtIndexNodeID idx_get_node_ref(XTTableHPtr tab, xtWord1 *ref, u_int node_ref_size) +{ + xtIndexNodeID node; + + /* Node is invalid by default: */ + XT_NODE_ID(node) = 0xFFFFEEEE; + if (node_ref_size) { + ref -= node_ref_size; + node = XT_RET_NODE_ID(XT_GET_DISK_4(ref)); + if (node >= tab->tab_ind_eof) { + xt_register_taberr(XT_REG_CONTEXT, XT_ERR_INDEX_CORRUPTED, tab->tab_name); + } + } + return node; +} +#endif + +/* + * ----------------------------------------------------------------------- + * Stack functions + */ + +static void idx_newstack(IdxBranchStackPtr stack) +{ + stack->s_top = 0; +} + +static xtBool idx_push(IdxBranchStackPtr stack, xtIndexNodeID n, XTIdxItemPtr pos) +{ + if (stack->s_top == MAX_SEARCH_DEPTH) { + xt_register_error(XT_REG_CONTEXT, XT_ERR_STACK_OVERFLOW, 0, "Index node stack overflow"); + return FAILED; + } + stack->s_elements[stack->s_top].i_branch = n; + if (pos) + stack->s_elements[stack->s_top].i_pos = *pos; + stack->s_top++; + return OK; +} + +static IdxStackItemPtr idx_pop(IdxBranchStackPtr stack) +{ + if (stack->s_top == 0) + return NULL; + stack->s_top--; + return &stack->s_elements[stack->s_top]; +} + +static IdxStackItemPtr idx_top(IdxBranchStackPtr stack) +{ + if (stack->s_top == 0) + return NULL; + return &stack->s_elements[stack->s_top-1]; +} + +/* + * ----------------------------------------------------------------------- + * Allocation of nodes + */ + +static xtBool idx_new_branch(XTOpenTablePtr ot, XTIndexPtr ind, xtIndexNodeID *address) +{ + register XTTableHPtr tab; + xtIndexNodeID wrote_pos; + XTIndFreeBlockRec free_block; + XTIndFreeListPtr list_ptr; + + tab = ot->ot_table; + + //ASSERT_NS(XT_INDEX_HAVE_XLOCK(ind, ot)); + if (ind->mi_free_list && ind->mi_free_list->fl_free_count) { + ind->mi_free_list->fl_free_count--; + *address = ind->mi_free_list->fl_page_id[ind->mi_free_list->fl_free_count]; + TRACK_BLOCK_ALLOC(*address); + return OK; + } + + xt_lock_mutex_ns(&tab->tab_ind_lock); + + /* Check the cached free list: */ + while ((list_ptr = tab->tab_ind_free_list)) { + if (list_ptr->fl_start < list_ptr->fl_free_count) { + wrote_pos = list_ptr->fl_page_id[list_ptr->fl_start]; + list_ptr->fl_start++; + xt_unlock_mutex_ns(&tab->tab_ind_lock); + *address = wrote_pos; + TRACK_BLOCK_ALLOC(wrote_pos); + return OK; + } + tab->tab_ind_free_list = list_ptr->fl_next_list; + xt_free_ns(list_ptr); + } + + if ((XT_NODE_ID(wrote_pos) = XT_NODE_ID(tab->tab_ind_free))) { + /* Use the block on the free list: */ + if (!xt_ind_read_bytes(ot, wrote_pos, sizeof(XTIndFreeBlockRec), (xtWord1 *) &free_block)) + goto failed; + XT_NODE_ID(tab->tab_ind_free) = (xtIndexNodeID) XT_GET_DISK_8(free_block.if_next_block_8); + xt_unlock_mutex_ns(&tab->tab_ind_lock); + *address = wrote_pos; + TRACK_BLOCK_ALLOC(wrote_pos); + return OK; + } + + /* PMC - Dont allow overflow! */ + if (XT_NODE_ID(tab->tab_ind_eof) >= 0xFFFFFFF) { + xt_register_ixterr(XT_REG_CONTEXT, XT_ERR_INDEX_FILE_TO_LARGE, xt_file_path(ot->ot_ind_file)); + goto failed; + } + *address = tab->tab_ind_eof; + XT_NODE_ID(tab->tab_ind_eof)++; + xt_unlock_mutex_ns(&tab->tab_ind_lock); + TRACK_BLOCK_ALLOC(*address); + return OK; + + failed: + xt_unlock_mutex_ns(&tab->tab_ind_lock); + return FAILED; +} + +/* Add the block to the private free list of the index. + * On flush, this list will be transfered to the global list. + */ +static xtBool idx_free_branch(XTOpenTablePtr ot, XTIndexPtr ind, xtIndexNodeID node_id) +{ + register u_int count; + register u_int i; + register u_int guess; + + TRACK_BLOCK_FREE(node_id); + //ASSERT_NS(XT_INDEX_HAVE_XLOCK(ind, ot)); + if (!ind->mi_free_list) { + count = 0; + if (!(ind->mi_free_list = (XTIndFreeListPtr) xt_calloc_ns(offsetof(XTIndFreeListRec, fl_page_id) + 10 * sizeof(xtIndexNodeID)))) + return FAILED; + } + else { + count = ind->mi_free_list->fl_free_count; + if (!xt_realloc_ns((void **) &ind->mi_free_list, offsetof(XTIndFreeListRec, fl_page_id) + (count + 1) * sizeof(xtIndexNodeID))) + return FAILED; + } + + i = 0; + while (i < count) { + guess = (i + count - 1) >> 1; + if (XT_NODE_ID(node_id) == XT_NODE_ID(ind->mi_free_list->fl_page_id[guess])) { + // Should not happen... + ASSERT_NS(FALSE); + return OK; + } + if (XT_NODE_ID(node_id) < XT_NODE_ID(ind->mi_free_list->fl_page_id[guess])) + count = guess; + else + i = guess + 1; + } + + /* Insert at position i */ + memmove(ind->mi_free_list->fl_page_id + i + 1, ind->mi_free_list->fl_page_id + i, (ind->mi_free_list->fl_free_count - i) * sizeof(xtIndexNodeID)); + ind->mi_free_list->fl_page_id[i] = node_id; + ind->mi_free_list->fl_free_count++; + + /* Set the cache page to clean: */ + return xt_ind_clean(ot, ind, node_id); +} + +/* + * ----------------------------------------------------------------------- + * Simple compare functions + */ + +xtPublic int xt_compare_2_int4(XTIndexPtr ind __attribute__((unused)), uint key_length, xtWord1 *key_value, xtWord1 *b_value) +{ + int r; + + ASSERT_NS(key_length == 4 || key_length == 8); + r = (xtInt4) XT_GET_DISK_4(key_value) - (xtInt4) XT_GET_DISK_4(b_value); + if (r == 0 && key_length > 4) { + key_value += 4; + b_value += 4; + r = (xtInt4) XT_GET_DISK_4(key_value) - (xtInt4) XT_GET_DISK_4(b_value); + } + return r; +} + +xtPublic int xt_compare_3_int4(XTIndexPtr ind __attribute__((unused)), uint key_length, xtWord1 *key_value, xtWord1 *b_value) +{ + int r; + + ASSERT_NS(key_length == 4 || key_length == 8 || key_length == 12); + r = (xtInt4) XT_GET_DISK_4(key_value) - (xtInt4) XT_GET_DISK_4(b_value); + if (r == 0 && key_length > 4) { + key_value += 4; + b_value += 4; + r = (xtInt4) XT_GET_DISK_4(key_value) - (xtInt4) XT_GET_DISK_4(b_value); + if (r == 0 && key_length > 8) { + key_value += 4; + b_value += 4; + r = (xtInt4) XT_GET_DISK_4(key_value) - (xtInt4) XT_GET_DISK_4(b_value); + } + } + return r; +} + +/* + * ----------------------------------------------------------------------- + * Tree branch sanning (searching nodes and leaves) + */ + +xtPublic void xt_scan_branch_single(struct XTTable *tab __attribute__((unused)), XTIndexPtr ind, XTIdxBranchDPtr branch, register XTIdxKeyValuePtr value, register XTIdxResultRec *result) +{ + XT_NODE_TEMP; + u_int branch_size; + u_int node_ref_size; + u_int full_item_size; + int search_flags; + register xtWord1 *base; + register u_int i; + register xtWord1 *bitem; + + branch_size = XT_GET_DISK_2(branch->tb_size_2); + node_ref_size = XT_IS_NODE(branch_size) ? XT_NODE_REF_SIZE : 0; + + result->sr_found = FALSE; + result->sr_duplicate = FALSE; + result->sr_item.i_total_size = XT_GET_BRANCH_DATA_SIZE(branch_size); + ASSERT_NS((int) result->sr_item.i_total_size >= 0 && result->sr_item.i_total_size <= XT_INDEX_PAGE_SIZE-2); + + result->sr_item.i_item_size = ind->mi_key_size + XT_RECORD_REF_SIZE; + full_item_size = result->sr_item.i_item_size + node_ref_size; + result->sr_item.i_node_ref_size = node_ref_size; + + search_flags = value->sv_flags; + base = branch->tb_data + node_ref_size; + if (search_flags & XT_SEARCH_FIRST_FLAG) + i = 0; + else if (search_flags & XT_SEARCH_AFTER_LAST_FLAG) + i = (result->sr_item.i_total_size - node_ref_size) / full_item_size; + else { + register u_int guess; + register u_int count; + register xtInt4 r; + xtRecordID key_record; + + key_record = value->sv_rec_id; + count = (result->sr_item.i_total_size - node_ref_size) / full_item_size; + + ASSERT_NS(ind); + i = 0; + while (i < count) { + guess = (i + count - 1) >> 1; + + bitem = base + guess * full_item_size; + + switch (ind->mi_single_type) { + case HA_KEYTYPE_LONG_INT: { + register xtInt4 a, b; + + a = XT_GET_DISK_4(value->sv_key); + b = XT_GET_DISK_4(bitem); + r = (a < b) ? -1 : (a == b ? 0 : 1); + break; + } + case HA_KEYTYPE_ULONG_INT: { + register xtWord4 a, b; + + a = XT_GET_DISK_4(value->sv_key); + b = XT_GET_DISK_4(bitem); + r = (a < b) ? -1 : (a == b ? 0 : 1); + break; + } + default: + /* Should not happen: */ + r = 1; + break; + } + if (r == 0) { + if (search_flags & XT_SEARCH_WHOLE_KEY) { + xtRecordID item_record; + xtRowID row_id; + + xt_get_record_ref(bitem + ind->mi_key_size, &item_record, &row_id); + + /* This should not happen because we should never + * try to insert the same record twice into the + * index! + */ + result->sr_duplicate = TRUE; + if (key_record == item_record) { + result->sr_found = TRUE; + result->sr_rec_id = item_record; + result->sr_row_id = row_id; + result->sr_branch = IDX_GET_NODE_REF(tab, bitem, node_ref_size); + result->sr_item.i_item_offset = node_ref_size + guess * full_item_size; + return; + } + if (key_record < item_record) + r = -1; + else + r = 1; + } + else { + result->sr_found = TRUE; + /* -1 causes a search to the beginning of the duplicate list of keys. + * 1 causes a search to just after the key. + */ + if (search_flags & XT_SEARCH_AFTER_KEY) + r = 1; + else + r = -1; + } + } + + if (r < 0) + count = guess; + else + i = guess + 1; + } + } + + bitem = base + i * full_item_size; + xt_get_res_record_ref(bitem + ind->mi_key_size, result); + result->sr_branch = IDX_GET_NODE_REF(tab, bitem, node_ref_size); /* Only valid if this is a node. */ + result->sr_item.i_item_offset = node_ref_size + i * full_item_size; +} + +/* + * We use a special binary search here. It basically assumes that the values + * in the index are not unique. + * + * Even if they are unique, when we search for part of a key, then it is + * effectively the case. + * + * So in the situation where we find duplicates in the index we usually + * want to position ourselves at the beginning of the duplicate list. + * + * Alternatively a search can find the position just after a given key. + * + * To achieve this we make the following modifications: + * - The result of the comparison is always returns 1 or -1. We only stop + * the search early in the case an exact match when inserting (but this + * should not happen anyway). + * - The search never actually fails, but sets 'found' to TRUE if it + * sees the search key in the index. + * + * If the search value exists in the index we know that + * this method will take us to the first occurrence of the key in the + * index (in the case of -1) or to the first value after the + * the search key in the case of 1. + */ +xtPublic void xt_scan_branch_fix(struct XTTable *tab __attribute__((unused)), XTIndexPtr ind, XTIdxBranchDPtr branch, register XTIdxKeyValuePtr value, register XTIdxResultRec *result) +{ + XT_NODE_TEMP; + u_int branch_size; + u_int node_ref_size; + u_int full_item_size; + int search_flags; + xtWord1 *base; + register u_int i; + xtWord1 *bitem; + + branch_size = XT_GET_DISK_2(branch->tb_size_2); + node_ref_size = XT_IS_NODE(branch_size) ? XT_NODE_REF_SIZE : 0; + + result->sr_found = FALSE; + result->sr_duplicate = FALSE; + result->sr_item.i_total_size = XT_GET_BRANCH_DATA_SIZE(branch_size); + ASSERT_NS((int) result->sr_item.i_total_size >= 0 && result->sr_item.i_total_size <= XT_INDEX_PAGE_SIZE-2); + + result->sr_item.i_item_size = ind->mi_key_size + XT_RECORD_REF_SIZE; + full_item_size = result->sr_item.i_item_size + node_ref_size; + result->sr_item.i_node_ref_size = node_ref_size; + + search_flags = value->sv_flags; + base = branch->tb_data + node_ref_size; + if (search_flags & XT_SEARCH_FIRST_FLAG) + i = 0; + else if (search_flags & XT_SEARCH_AFTER_LAST_FLAG) + i = (result->sr_item.i_total_size - node_ref_size) / full_item_size; + else { + register u_int guess; + register u_int count; + xtRecordID key_record; + int r; + + key_record = value->sv_rec_id; + count = (result->sr_item.i_total_size - node_ref_size) / full_item_size; + + ASSERT_NS(ind); + i = 0; + while (i < count) { + guess = (i + count - 1) >> 1; + + bitem = base + guess * full_item_size; + + r = myxt_compare_key(ind, search_flags, value->sv_length, value->sv_key, bitem); + + if (r == 0) { + if (search_flags & XT_SEARCH_WHOLE_KEY) { + xtRecordID item_record; + xtRowID row_id; + + xt_get_record_ref(bitem + ind->mi_key_size, &item_record, &row_id); + + /* This should not happen because we should never + * try to insert the same record twice into the + * index! + */ + result->sr_duplicate = TRUE; + if (key_record == item_record) { + result->sr_found = TRUE; + result->sr_rec_id = item_record; + result->sr_row_id = row_id; + result->sr_branch = IDX_GET_NODE_REF(tab, bitem, node_ref_size); + result->sr_item.i_item_offset = node_ref_size + guess * full_item_size; + return; + } + if (key_record < item_record) + r = -1; + else + r = 1; + } + else { + result->sr_found = TRUE; + /* -1 causes a search to the beginning of the duplicate list of keys. + * 1 causes a search to just after the key. + */ + if (search_flags & XT_SEARCH_AFTER_KEY) + r = 1; + else + r = -1; + } + } + + if (r < 0) + count = guess; + else + i = guess + 1; + } + } + + bitem = base + i * full_item_size; + xt_get_res_record_ref(bitem + ind->mi_key_size, result); + result->sr_branch = IDX_GET_NODE_REF(tab, bitem, node_ref_size); /* Only valid if this is a node. */ + result->sr_item.i_item_offset = node_ref_size + i * full_item_size; +} + +xtPublic void xt_scan_branch_fix_simple(struct XTTable *tab __attribute__((unused)), XTIndexPtr ind, XTIdxBranchDPtr branch, register XTIdxKeyValuePtr value, register XTIdxResultRec *result) +{ + XT_NODE_TEMP; + u_int branch_size; + u_int node_ref_size; + u_int full_item_size; + int search_flags; + xtWord1 *base; + register u_int i; + xtWord1 *bitem; + + branch_size = XT_GET_DISK_2(branch->tb_size_2); + node_ref_size = XT_IS_NODE(branch_size) ? XT_NODE_REF_SIZE : 0; + + result->sr_found = FALSE; + result->sr_duplicate = FALSE; + result->sr_item.i_total_size = XT_GET_BRANCH_DATA_SIZE(branch_size); + ASSERT_NS((int) result->sr_item.i_total_size >= 0 && result->sr_item.i_total_size <= XT_INDEX_PAGE_SIZE-2); + + result->sr_item.i_item_size = ind->mi_key_size + XT_RECORD_REF_SIZE; + full_item_size = result->sr_item.i_item_size + node_ref_size; + result->sr_item.i_node_ref_size = node_ref_size; + + search_flags = value->sv_flags; + base = branch->tb_data + node_ref_size; + if (search_flags & XT_SEARCH_FIRST_FLAG) + i = 0; + else if (search_flags & XT_SEARCH_AFTER_LAST_FLAG) + i = (result->sr_item.i_total_size - node_ref_size) / full_item_size; + else { + register u_int guess; + register u_int count; + xtRecordID key_record; + int r; + + key_record = value->sv_rec_id; + count = (result->sr_item.i_total_size - node_ref_size) / full_item_size; + + ASSERT_NS(ind); + i = 0; + while (i < count) { + guess = (i + count - 1) >> 1; + + bitem = base + guess * full_item_size; + + r = ind->mi_simple_comp_key(ind, value->sv_length, value->sv_key, bitem); + + if (r == 0) { + if (search_flags & XT_SEARCH_WHOLE_KEY) { + xtRecordID item_record; + xtRowID row_id; + + xt_get_record_ref(bitem + ind->mi_key_size, &item_record, &row_id); + + /* This should not happen because we should never + * try to insert the same record twice into the + * index! + */ + result->sr_duplicate = TRUE; + if (key_record == item_record) { + result->sr_found = TRUE; + result->sr_rec_id = item_record; + result->sr_row_id = row_id; + result->sr_branch = IDX_GET_NODE_REF(tab, bitem, node_ref_size); + result->sr_item.i_item_offset = node_ref_size + guess * full_item_size; + return; + } + if (key_record < item_record) + r = -1; + else + r = 1; + } + else { + result->sr_found = TRUE; + /* -1 causes a search to the beginning of the duplicate list of keys. + * 1 causes a search to just after the key. + */ + if (search_flags & XT_SEARCH_AFTER_KEY) + r = 1; + else + r = -1; + } + } + + if (r < 0) + count = guess; + else + i = guess + 1; + } + } + + bitem = base + i * full_item_size; + xt_get_res_record_ref(bitem + ind->mi_key_size, result); + result->sr_branch = IDX_GET_NODE_REF(tab, bitem, node_ref_size); /* Only valid if this is a node. */ + result->sr_item.i_item_offset = node_ref_size + i * full_item_size; +} + +/* + * Variable length key values are stored as a sorted list. Since each list item has a variable length, we + * must scan the list sequentially in order to find a key. + */ +xtPublic void xt_scan_branch_var(struct XTTable *tab __attribute__((unused)), XTIndexPtr ind, XTIdxBranchDPtr branch, register XTIdxKeyValuePtr value, register XTIdxResultRec *result) +{ + XT_NODE_TEMP; + u_int branch_size; + u_int node_ref_size; + int search_flags; + xtWord1 *base; + xtWord1 *bitem; + u_int ilen; + xtWord1 *bend; + + branch_size = XT_GET_DISK_2(branch->tb_size_2); + node_ref_size = XT_IS_NODE(branch_size) ? XT_NODE_REF_SIZE : 0; + + result->sr_found = FALSE; + result->sr_duplicate = FALSE; + result->sr_item.i_total_size = XT_GET_BRANCH_DATA_SIZE(branch_size); + ASSERT_NS((int) result->sr_item.i_total_size >= 0 && result->sr_item.i_total_size <= XT_INDEX_PAGE_SIZE-2); + + result->sr_item.i_node_ref_size = node_ref_size; + + search_flags = value->sv_flags; + base = branch->tb_data + node_ref_size; + bitem = base; + bend = &branch->tb_data[result->sr_item.i_total_size]; + ilen = 0; + if (bitem >= bend) + goto done_ok; + + if (search_flags & XT_SEARCH_FIRST_FLAG) + ilen = myxt_get_key_length(ind, bitem); + else if (search_flags & XT_SEARCH_AFTER_LAST_FLAG) { + bitem = bend; + ilen = 0; + } + else { + xtRecordID key_record; + int r; + + key_record = value->sv_rec_id; + + ASSERT_NS(ind); + while (bitem < bend) { + ilen = myxt_get_key_length(ind, bitem); + r = myxt_compare_key(ind, search_flags, value->sv_length, value->sv_key, bitem); + if (r == 0) { + if (search_flags & XT_SEARCH_WHOLE_KEY) { + xtRecordID item_record; + xtRowID row_id; + + xt_get_record_ref(bitem + ilen, &item_record, &row_id); + + /* This should not happen because we should never + * try to insert the same record twice into the + * index! + */ + result->sr_duplicate = TRUE; + if (key_record == item_record) { + result->sr_found = TRUE; + result->sr_item.i_item_size = ilen + XT_RECORD_REF_SIZE; + result->sr_rec_id = item_record; + result->sr_row_id = row_id; + result->sr_branch = IDX_GET_NODE_REF(tab, bitem, node_ref_size); + result->sr_item.i_item_offset = bitem - branch->tb_data; + return; + } + if (key_record < item_record) + r = -1; + else + r = 1; + } + else { + result->sr_found = TRUE; + /* -1 causes a search to the beginning of the duplicate list of keys. + * 1 causes a search to just after the key. + */ + if (search_flags & XT_SEARCH_AFTER_KEY) + r = 1; + else + r = -1; + } + } + if (r <= 0) + break; + bitem += ilen + XT_RECORD_REF_SIZE + node_ref_size; + } + } + + done_ok: + result->sr_item.i_item_size = ilen + XT_RECORD_REF_SIZE; + xt_get_res_record_ref(bitem + ilen, result); + result->sr_branch = IDX_GET_NODE_REF(tab, bitem, node_ref_size); /* Only valid if this is a node. */ + result->sr_item.i_item_offset = bitem - branch->tb_data; +} + +/* Go to the next item in the node. */ +static void idx_next_branch_item(XTTableHPtr tab __attribute__((unused)), XTIndexPtr ind, XTIdxBranchDPtr branch, register XTIdxResultRec *result) +{ + XT_NODE_TEMP; + xtWord1 *bitem; + u_int ilen; + + result->sr_item.i_item_offset += result->sr_item.i_item_size + result->sr_item.i_node_ref_size; + bitem = branch->tb_data + result->sr_item.i_item_offset; + if (ind->mi_fix_key) + ilen = result->sr_item.i_item_size; + else { + ilen = myxt_get_key_length(ind, bitem) + XT_RECORD_REF_SIZE; + result->sr_item.i_item_size = ilen; + } + xt_get_res_record_ref(bitem + ilen - XT_RECORD_REF_SIZE, result); /* (Only valid if i_item_offset < i_total_size) */ + result->sr_branch = IDX_GET_NODE_REF(tab, bitem, result->sr_item.i_node_ref_size); +} + +xtPublic void xt_prev_branch_item_fix(XTTableHPtr tab __attribute__((unused)), XTIndexPtr ind __attribute__((unused)), XTIdxBranchDPtr branch, register XTIdxResultRec *result) +{ + XT_NODE_TEMP; + ASSERT_NS(result->sr_item.i_item_offset >= result->sr_item.i_item_size + result->sr_item.i_node_ref_size + result->sr_item.i_node_ref_size); + result->sr_item.i_item_offset -= (result->sr_item.i_item_size + result->sr_item.i_node_ref_size); + xt_get_res_record_ref(branch->tb_data + result->sr_item.i_item_offset + result->sr_item.i_item_size - XT_RECORD_REF_SIZE, result); /* (Only valid if i_item_offset < i_total_size) */ + result->sr_branch = IDX_GET_NODE_REF(tab, branch->tb_data + result->sr_item.i_item_offset, result->sr_item.i_node_ref_size); +} + +xtPublic void xt_prev_branch_item_var(XTTableHPtr tab __attribute__((unused)), XTIndexPtr ind, XTIdxBranchDPtr branch, register XTIdxResultRec *result) +{ + XT_NODE_TEMP; + xtWord1 *bitem; + xtWord1 *bend; + u_int ilen; + + bitem = branch->tb_data + result->sr_item.i_node_ref_size; + bend = &branch->tb_data[result->sr_item.i_item_offset]; + for (;;) { + ilen = myxt_get_key_length(ind, bitem); + if (bitem + ilen + XT_RECORD_REF_SIZE + result->sr_item.i_node_ref_size >= bend) + break; + bitem += ilen + XT_RECORD_REF_SIZE + result->sr_item.i_node_ref_size; + } + + result->sr_item.i_item_size = ilen + XT_RECORD_REF_SIZE; + xt_get_res_record_ref(bitem + ilen, result); /* (Only valid if i_item_offset < i_total_size) */ + result->sr_branch = IDX_GET_NODE_REF(tab, bitem, result->sr_item.i_node_ref_size); + result->sr_item.i_item_offset = bitem - branch->tb_data; +} + +static void idx_first_branch_item(XTTableHPtr tab __attribute__((unused)), XTIndexPtr ind, XTIdxBranchDPtr branch, register XTIdxResultPtr result) +{ + XT_NODE_TEMP; + u_int branch_size; + u_int node_ref_size; + u_int key_data_size; + + branch_size = XT_GET_DISK_2(branch->tb_size_2); + node_ref_size = XT_IS_NODE(branch_size) ? XT_NODE_REF_SIZE : 0; + + result->sr_found = FALSE; + result->sr_duplicate = FALSE; + result->sr_item.i_total_size = XT_GET_BRANCH_DATA_SIZE(branch_size); + ASSERT_NS((int) result->sr_item.i_total_size >= 0 && result->sr_item.i_total_size <= XT_INDEX_PAGE_SIZE-2); + + if (ind->mi_fix_key) + key_data_size = ind->mi_key_size; + else { + xtWord1 *bitem; + + bitem = branch->tb_data + node_ref_size; + if (bitem < &branch->tb_data[result->sr_item.i_total_size]) + key_data_size = myxt_get_key_length(ind, bitem); + else + key_data_size = 0; + } + + result->sr_item.i_item_size = key_data_size + XT_RECORD_REF_SIZE; + result->sr_item.i_node_ref_size = node_ref_size; + + xt_get_res_record_ref(branch->tb_data + node_ref_size + key_data_size, result); + result->sr_branch = IDX_GET_NODE_REF(tab, branch->tb_data + node_ref_size, node_ref_size); /* Only valid if this is a node. */ + result->sr_item.i_item_offset = node_ref_size; +} + +/* + * Last means different things for leaf or node! + */ +xtPublic void xt_last_branch_item_fix(XTTableHPtr tab __attribute__((unused)), XTIndexPtr ind, XTIdxBranchDPtr branch, register XTIdxResultPtr result) +{ + XT_NODE_TEMP; + u_int branch_size; + u_int node_ref_size; + + branch_size = XT_GET_DISK_2(branch->tb_size_2); + node_ref_size = XT_IS_NODE(branch_size) ? XT_NODE_REF_SIZE : 0; + + result->sr_found = FALSE; + result->sr_duplicate = FALSE; + result->sr_item.i_total_size = XT_GET_BRANCH_DATA_SIZE(branch_size); + ASSERT_NS((int) result->sr_item.i_total_size >= 0 && result->sr_item.i_total_size <= XT_INDEX_PAGE_SIZE-2); + + result->sr_item.i_item_size = ind->mi_key_size + XT_RECORD_REF_SIZE; + result->sr_item.i_node_ref_size = node_ref_size; + + if (node_ref_size) { + result->sr_item.i_item_offset = result->sr_item.i_total_size; + result->sr_branch = IDX_GET_NODE_REF(tab, branch->tb_data + result->sr_item.i_item_offset, node_ref_size); + } + else { + if (result->sr_item.i_total_size) { + result->sr_item.i_item_offset = result->sr_item.i_total_size - result->sr_item.i_item_size; + xt_get_res_record_ref(branch->tb_data + result->sr_item.i_item_offset + ind->mi_key_size, result); + } + else + /* Leaf is empty: */ + result->sr_item.i_item_offset = 0; + } +} + +xtPublic void xt_last_branch_item_var(XTTableHPtr tab __attribute__((unused)), XTIndexPtr ind, XTIdxBranchDPtr branch, register XTIdxResultPtr result) +{ + XT_NODE_TEMP; + u_int branch_size; + u_int node_ref_size; + + branch_size = XT_GET_DISK_2(branch->tb_size_2); + node_ref_size = XT_IS_NODE(branch_size) ? XT_NODE_REF_SIZE : 0; + + result->sr_found = FALSE; + result->sr_duplicate = FALSE; + result->sr_item.i_total_size = XT_GET_BRANCH_DATA_SIZE(branch_size); + ASSERT_NS((int) result->sr_item.i_total_size >= 0 && result->sr_item.i_total_size <= XT_INDEX_PAGE_SIZE-2); + + result->sr_item.i_node_ref_size = node_ref_size; + + if (node_ref_size) { + result->sr_item.i_item_offset = result->sr_item.i_total_size; + result->sr_branch = IDX_GET_NODE_REF(tab, branch->tb_data + result->sr_item.i_item_offset, node_ref_size); + result->sr_item.i_item_size = 0; + } + else { + if (result->sr_item.i_total_size) { + xtWord1 *bitem; + u_int ilen; + xtWord1 *bend; + + bitem = branch->tb_data + node_ref_size;; + bend = &branch->tb_data[result->sr_item.i_total_size]; + ilen = 0; + if (bitem < bend) { + for (;;) { + ilen = myxt_get_key_length(ind, bitem); + if (bitem + ilen + XT_RECORD_REF_SIZE + node_ref_size >= bend) + break; + bitem += ilen + XT_RECORD_REF_SIZE + node_ref_size; + } + } + + result->sr_item.i_item_offset = bitem - branch->tb_data; + xt_get_res_record_ref(bitem + ilen, result); + result->sr_item.i_item_size = ilen + XT_RECORD_REF_SIZE; + } + else { + /* Leaf is empty: */ + result->sr_item.i_item_offset = 0; + result->sr_item.i_item_size = 0; + } + } +} + +/* + * Remove an item and save to disk. + */ +static xtBool idx_remove_branch_item_right(XTOpenTablePtr ot, XTIndexPtr ind, xtIndexNodeID, XTIndReferencePtr iref, register XTIdxItemPtr item) +{ + register XTIdxBranchDPtr branch = iref->ir_branch; + u_int size = item->i_item_size + item->i_node_ref_size; + + /* {HANDLE-COUNT-USAGE} + * This access is safe because we have the right to update + * the page, so no other thread can modify the page. + * + * This means: + * We either have an Xlock on the index, or we have + * an Xlock on the cache block. + */ + if (iref->ir_block->cb_handle_count) { + if (!xt_ind_copy_on_write(iref)) + return FAILED; + } + /* Remove the node reference to the left of the item: */ + memmove(&branch->tb_data[item->i_item_offset], + &branch->tb_data[item->i_item_offset + size], + item->i_total_size - item->i_item_offset - size); + item->i_total_size -= size; + XT_SET_DISK_2(branch->tb_size_2, XT_MAKE_BRANCH_SIZE(item->i_total_size, item->i_node_ref_size)); + IDX_TRACE("%d-> %x\n", (int) XT_NODE_ID(address), (int) XT_GET_DISK_2(branch->tb_size_2)); + xt_ind_release(ot, ind, item->i_node_ref_size ? XT_UNLOCK_R_UPDATE : XT_UNLOCK_W_UPDATE, iref); + return OK; +} + +static xtBool idx_remove_branch_item_left(XTOpenTablePtr ot, XTIndexPtr ind, xtIndexNodeID, XTIndReferencePtr iref, register XTIdxItemPtr item) +{ + register XTIdxBranchDPtr branch = iref->ir_branch; + u_int size = item->i_item_size + item->i_node_ref_size; + + if (iref->ir_block->cb_handle_count) { + if (!xt_ind_copy_on_write(iref)) + return FAILED; + } + /* Remove the node reference to the left of the item: */ + memmove(&branch->tb_data[item->i_item_offset - item->i_node_ref_size], + &branch->tb_data[item->i_item_offset + item->i_item_size], + item->i_total_size - item->i_item_offset - item->i_item_size); + item->i_total_size -= size; + XT_SET_DISK_2(branch->tb_size_2, XT_MAKE_BRANCH_SIZE(item->i_total_size, item->i_node_ref_size)); + IDX_TRACE("%d-> %x\n", (int) XT_NODE_ID(address), (int) XT_GET_DISK_2(branch->tb_size_2)); + xt_ind_release(ot, ind, item->i_node_ref_size ? XT_UNLOCK_R_UPDATE : XT_UNLOCK_W_UPDATE, iref); + return OK; +} + +static void idx_insert_leaf_item(XTIndexPtr ind __attribute__((unused)), XTIdxBranchDPtr leaf, XTIdxKeyValuePtr value, XTIdxResultPtr result) +{ + xtWord1 *item; + + /* This will ensure we do not overwrite the end of the buffer: */ + ASSERT_NS(value->sv_length <= XT_INDEX_MAX_KEY_SIZE); + memmove(&leaf->tb_data[result->sr_item.i_item_offset + value->sv_length + XT_RECORD_REF_SIZE], + &leaf->tb_data[result->sr_item.i_item_offset], + result->sr_item.i_total_size - result->sr_item.i_item_offset); + item = &leaf->tb_data[result->sr_item.i_item_offset]; + memcpy(item, value->sv_key, value->sv_length); + xt_set_val_record_ref(item + value->sv_length, value); + result->sr_item.i_total_size += value->sv_length + XT_RECORD_REF_SIZE; + XT_SET_DISK_2(leaf->tb_size_2, XT_MAKE_LEAF_SIZE(result->sr_item.i_total_size)); +} + +static void idx_insert_node_item(XTTableHPtr tab __attribute__((unused)), XTIndexPtr ind __attribute__((unused)), XTIdxBranchDPtr leaf, XTIdxKeyValuePtr value, XTIdxResultPtr result, xtIndexNodeID branch) +{ + xtWord1 *item; + + /* This will ensure we do not overwrite the end of the buffer: */ + ASSERT_NS(value->sv_length <= XT_INDEX_MAX_KEY_SIZE); + memmove(&leaf->tb_data[result->sr_item.i_item_offset + value->sv_length + XT_RECORD_REF_SIZE + result->sr_item.i_node_ref_size], + &leaf->tb_data[result->sr_item.i_item_offset], + result->sr_item.i_total_size - result->sr_item.i_item_offset); + item = &leaf->tb_data[result->sr_item.i_item_offset]; + memcpy(item, value->sv_key, value->sv_length); + xt_set_val_record_ref(item + value->sv_length, value); + XT_SET_NODE_REF(tab, item + value->sv_length + XT_RECORD_REF_SIZE, branch); + result->sr_item.i_total_size += value->sv_length + XT_RECORD_REF_SIZE + result->sr_item.i_node_ref_size; + XT_SET_DISK_2(leaf->tb_size_2, XT_MAKE_NODE_SIZE(result->sr_item.i_total_size)); +} + +static void idx_get_middle_branch_item(XTIndexPtr ind, XTIdxBranchDPtr branch, XTIdxKeyValuePtr value, XTIdxResultPtr result) +{ + xtWord1 *bitem; + + if (ind->mi_fix_key) { + u_int full_item_size = result->sr_item.i_item_size + result->sr_item.i_node_ref_size; + + result->sr_item.i_item_offset = ((result->sr_item.i_total_size - result->sr_item.i_node_ref_size) + / full_item_size / 2 * full_item_size) + result->sr_item.i_node_ref_size; + + bitem = &branch->tb_data[result->sr_item.i_item_offset]; + value->sv_flags = XT_SEARCH_WHOLE_KEY; + value->sv_length = result->sr_item.i_item_size - XT_RECORD_REF_SIZE; + xt_get_record_ref(bitem + value->sv_length, &value->sv_rec_id, &value->sv_row_id); + memcpy(value->sv_key, bitem, value->sv_length); + } + else { + u_int node_ref_size; + u_int ilen; + xtWord1 *bend; + + node_ref_size = result->sr_item.i_node_ref_size; + bitem = branch->tb_data + node_ref_size;; + bend = &branch->tb_data[(result->sr_item.i_total_size - node_ref_size) / 2 + node_ref_size]; + ilen = 0; + if (bitem < bend) { + for (;;) { + ilen = myxt_get_key_length(ind, bitem); + if (bitem + ilen + XT_RECORD_REF_SIZE + node_ref_size >= bend) + break; + bitem += ilen + XT_RECORD_REF_SIZE + node_ref_size; + } + } + + result->sr_item.i_item_offset = bitem - branch->tb_data; + result->sr_item.i_item_size = ilen + XT_RECORD_REF_SIZE; + + value->sv_flags = XT_SEARCH_WHOLE_KEY; + value->sv_length = ilen; + xt_get_record_ref(bitem + ilen, &value->sv_rec_id, &value->sv_row_id); + memcpy(value->sv_key, bitem, value->sv_length); + } +} + +static size_t idx_write_branch_item(XTIndexPtr ind __attribute__((unused)), xtWord1 *item, XTIdxKeyValuePtr value) +{ + memcpy(item, value->sv_key, value->sv_length); + xt_set_val_record_ref(item + value->sv_length, value); + return value->sv_length + XT_RECORD_REF_SIZE; +} + +static xtBool idx_replace_node_key(XTOpenTablePtr ot, XTIndexPtr ind, IdxStackItemPtr item, IdxBranchStackPtr stack, u_int item_size, xtWord1 *item_buf) +{ + XTIndReferenceRec iref; + xtIndexNodeID new_branch; + XTIdxResultRec result; + xtIndexNodeID current = item->i_branch; + u_int new_size; + XTIdxBranchDPtr new_branch_ptr; + XTIdxKeyValueRec key_value; + xtWord1 key_buf[XT_INDEX_MAX_KEY_SIZE]; + +#ifdef DEBUG + iref.ir_ulock = XT_UNLOCK_NONE; +#endif + if (!xt_ind_fetch(ot, current, XT_LOCK_WRITE, &iref)) + return FAILED; + if (iref.ir_block->cb_handle_count) { + if (!xt_ind_copy_on_write(&iref)) + goto failed_1; + } + memmove(&iref.ir_branch->tb_data[item->i_pos.i_item_offset + item_size], + &iref.ir_branch->tb_data[item->i_pos.i_item_offset + item->i_pos.i_item_size], + item->i_pos.i_total_size - item->i_pos.i_item_offset - item->i_pos.i_item_size); + memcpy(&iref.ir_branch->tb_data[item->i_pos.i_item_offset], + item_buf, item_size); + item->i_pos.i_total_size = item->i_pos.i_total_size + item_size - item->i_pos.i_item_size; + XT_SET_DISK_2(iref.ir_branch->tb_size_2, XT_MAKE_NODE_SIZE(item->i_pos.i_total_size)); + IDX_TRACE("%d-> %x\n", (int) XT_NODE_ID(current), (int) XT_GET_DISK_2(iref.ir_branch->tb_size_2)); + + if (item->i_pos.i_total_size <= XT_INDEX_PAGE_DATA_SIZE) + return xt_ind_release(ot, ind, XT_UNLOCK_W_UPDATE, &iref); + + /* The node has overflowed!! */ + result.sr_item = item->i_pos; + + /* Adjust the stack (we want the parents of the delete node): */ + for (;;) { + if (idx_pop(stack) == item) + break; + } + + /* We assume that value can be overwritten (which is the case) */ + key_value.sv_flags = XT_SEARCH_WHOLE_KEY; + key_value.sv_key = key_buf; + idx_get_middle_branch_item(ind, iref.ir_branch, &key_value, &result); + + if (!idx_new_branch(ot, ind, &new_branch)) + goto failed_1; + + /* Split the node: */ + new_size = result.sr_item.i_total_size - result.sr_item.i_item_offset - result.sr_item.i_item_size; + // TODO: Are 2 buffers now required? + new_branch_ptr = (XTIdxBranchDPtr) &ot->ot_ind_wbuf.tb_data[XT_INDEX_PAGE_DATA_SIZE]; + memmove(new_branch_ptr->tb_data, &iref.ir_branch->tb_data[result.sr_item.i_item_offset + result.sr_item.i_item_size], new_size); + + XT_SET_DISK_2(new_branch_ptr->tb_size_2, XT_MAKE_NODE_SIZE(new_size)); + IDX_TRACE("%d-> %x\n", (int) XT_NODE_ID(new_branch), (int) XT_GET_DISK_2(new_branch_ptr->tb_size_2)); + if (!xt_ind_write(ot, ind, new_branch, offsetof(XTIdxBranchDRec, tb_data) + new_size, (xtWord1 *) new_branch_ptr)) + goto failed_2; + + /* Change the size of the old branch: */ + XT_SET_DISK_2(iref.ir_branch->tb_size_2, XT_MAKE_NODE_SIZE(result.sr_item.i_item_offset)); + IDX_TRACE("%d-> %x\n", (int) XT_NODE_ID(current), (int) XT_GET_DISK_2(iref.ir_branch->tb_size_2)); + + xt_ind_release(ot, ind, XT_UNLOCK_W_UPDATE, &iref); + + /* Insert the new branch into the parent node, using the new middle key value: */ + if (!idx_insert_node(ot, ind, stack, &key_value, new_branch)) { + /* + * TODO: Mark the index as corrupt. + * This should not fail because everything has been + * preallocated. + * However, if it does fail the index + * will be corrupt. + * I could modify and release the branch above, + * after this point. + * But that would mean holding the lock longer, + * and also may not help because idx_insert_node() + * is recursive. + */ + idx_free_branch(ot, ind, new_branch); + return FAILED; + } + + return OK; + + failed_2: + idx_free_branch(ot, ind, new_branch); + + failed_1: + xt_ind_release(ot, ind, XT_UNLOCK_WRITE, &iref); + + return FAILED; +} + +/*ot_ind_wbuf + * ----------------------------------------------------------------------- + * Standard b-tree insert + */ + +/* + * Insert the given branch into the node on the top of the stack. If the stack + * is empty we need to add a new root. + */ +static xtBool idx_insert_node(XTOpenTablePtr ot, XTIndexPtr ind, IdxBranchStackPtr stack, XTIdxKeyValuePtr key_value, xtIndexNodeID branch) +{ + IdxStackItemPtr stack_item; + xtIndexNodeID new_branch; + size_t size; + xtIndexNodeID current; + XTIndReferenceRec iref; + XTIdxResultRec result; + u_int new_size; + XTIdxBranchDPtr new_branch_ptr; + +#ifdef DEBUG + iref.ir_ulock = XT_UNLOCK_NONE; +#endif + /* Insert a new branch (key, data)... */ + if (!(stack_item = idx_pop(stack))) { + xtWord1 *ditem; + + /* New root */ + if (!idx_new_branch(ot, ind, &new_branch)) + goto failed; + + ditem = ot->ot_ind_wbuf.tb_data; + XT_SET_NODE_REF(ot->ot_table, ditem, ind->mi_root); + ditem += XT_NODE_REF_SIZE; + ditem += idx_write_branch_item(ind, ditem, key_value); + XT_SET_NODE_REF(ot->ot_table, ditem, branch); + ditem += XT_NODE_REF_SIZE; + size = ditem - ot->ot_ind_wbuf.tb_data; + XT_SET_DISK_2(ot->ot_ind_wbuf.tb_size_2, XT_MAKE_NODE_SIZE(size)); + IDX_TRACE("%d-> %x\n", (int) XT_NODE_ID(new_branch), (int) XT_GET_DISK_2(ot->ot_ind_wbuf.tb_size_2)); + if (!xt_ind_write(ot, ind, new_branch, offsetof(XTIdxBranchDRec, tb_data) + size, (xtWord1 *) &ot->ot_ind_wbuf)) + goto failed_2; + ind->mi_root = new_branch; + goto done_ok; + } + + current = stack_item->i_branch; + /* This read does not count (towards ot_ind_reads), because we are only + * counting each loaded page once. We assume that the page is in + * cache, and will remain in cache when we read again below for the + * purpose of update. + */ + if (!xt_ind_fetch(ot, current, XT_LOCK_READ, &iref)) + goto failed; + ASSERT_NS(XT_IS_NODE(XT_GET_DISK_2(iref.ir_branch->tb_size_2))); + ind->mi_scan_branch(ot->ot_table, ind, iref.ir_branch, key_value, &result); + + if (result.sr_item.i_total_size + key_value->sv_length + XT_RECORD_REF_SIZE + result.sr_item.i_node_ref_size <= XT_INDEX_PAGE_DATA_SIZE) { + if (iref.ir_block->cb_handle_count) { + if (!xt_ind_copy_on_write(&iref)) + goto failed_1; + } + idx_insert_node_item(ot->ot_table, ind, iref.ir_branch, key_value, &result, branch); + IDX_TRACE("%d-> %x\n", (int) XT_NODE_ID(current), (int) XT_GET_DISK_2(ot->ot_ind_wbuf.tb_size_2)); + ASSERT_NS(result.sr_item.i_total_size <= XT_INDEX_PAGE_DATA_SIZE); + xt_ind_release(ot, ind, XT_UNLOCK_R_UPDATE, &iref); + goto done_ok; + } + + memcpy(&ot->ot_ind_wbuf, iref.ir_branch, offsetof(XTIdxBranchDRec, tb_data) + result.sr_item.i_total_size); + idx_insert_node_item(ot->ot_table, ind, &ot->ot_ind_wbuf, key_value, &result, branch); + IDX_TRACE("%d-> %x\n", (int) XT_NODE_ID(current), (int) XT_GET_DISK_2(ot->ot_ind_wbuf.tb_size_2)); + ASSERT_NS(result.sr_item.i_total_size > XT_INDEX_PAGE_DATA_SIZE); + + /* We assume that value can be overwritten (which is the case) */ + idx_get_middle_branch_item(ind, &ot->ot_ind_wbuf, key_value, &result); + + if (!idx_new_branch(ot, ind, &new_branch)) + goto failed_1; + + /* Split the node: */ + new_size = result.sr_item.i_total_size - result.sr_item.i_item_offset - result.sr_item.i_item_size; + new_branch_ptr = (XTIdxBranchDPtr) &ot->ot_ind_wbuf.tb_data[XT_INDEX_PAGE_DATA_SIZE]; + memmove(new_branch_ptr->tb_data, &ot->ot_ind_wbuf.tb_data[result.sr_item.i_item_offset + result.sr_item.i_item_size], new_size); + + XT_SET_DISK_2(new_branch_ptr->tb_size_2, XT_MAKE_NODE_SIZE(new_size)); + IDX_TRACE("%d-> %x\n", (int) XT_NODE_ID(new_branch), (int) XT_GET_DISK_2(new_branch_ptr->tb_size_2)); + if (!xt_ind_write(ot, ind, new_branch, offsetof(XTIdxBranchDRec, tb_data) + new_size, (xtWord1 *) new_branch_ptr)) + goto failed_2; + + /* Change the size of the old branch: */ + XT_SET_DISK_2(ot->ot_ind_wbuf.tb_size_2, XT_MAKE_NODE_SIZE(result.sr_item.i_item_offset)); + IDX_TRACE("%d-> %x\n", (int) XT_NODE_ID(current), (int) XT_GET_DISK_2(ot->ot_ind_wbuf.tb_size_2)); + if (iref.ir_block->cb_handle_count) { + if (!xt_ind_copy_on_write(&iref)) + goto failed_2; + } + memcpy(iref.ir_branch, &ot->ot_ind_wbuf, offsetof(XTIdxBranchDRec, tb_data) + result.sr_item.i_item_offset); + xt_ind_release(ot, ind, XT_UNLOCK_R_UPDATE, &iref); + + /* Insert the new branch into the parent node, using the new middle key value: */ + if (!idx_insert_node(ot, ind, stack, key_value, new_branch)) { + // Index may be inconsistant now... + idx_free_branch(ot, ind, new_branch); + goto failed; + } + + done_ok: + return OK; + + failed_2: + idx_free_branch(ot, ind, new_branch); + + failed_1: + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + + failed: + return FAILED; +} + +static xtBool idx_out_of_memory_failure(XTOpenTablePtr ot) +{ +#ifdef XT_TRACK_INDEX_UPDATES + /* If the index has been changed when we run out of memory, we + * will corrupt the index! + */ + ASSERT_NS(ot->ot_ind_changed == 0); +#endif + if (ot->ot_thread->t_exception.e_xt_err == XT_ERR_NO_INDEX_CACHE) { + /* Flush index and retry! */ + xt_clear_exception(ot->ot_thread); + if (!xt_flush_indices(ot, NULL, FALSE)) + return FAILED; + return TRUE; + } + return FALSE; +} + +/* + * Check all the duplicate variation in an index. + * If one of them is visible, then we have a duplicate key + * error. + * + * GOTCHA: This routine must use the write index buffer! + */ +static xtBool idx_check_duplicates(XTOpenTablePtr ot, XTIndexPtr ind, XTIdxKeyValuePtr key_value) +{ + IdxBranchStackRec stack; + xtIndexNodeID current; + XTIndReferenceRec iref; + XTIdxResultRec result; + xtBool on_key = FALSE; + xtXactID xn_id; + int save_flags; + XTXactWaitRec xw; + +#ifdef DEBUG + iref.ir_ulock = XT_UNLOCK_NONE; +#endif + retry: + idx_newstack(&stack); + + if (!(XT_NODE_ID(current) = XT_NODE_ID(ind->mi_root))) + return OK; + + save_flags = key_value->sv_flags; + key_value->sv_flags = 0; + + while (XT_NODE_ID(current)) { + if (!xt_ind_fetch(ot, current, XT_LOCK_READ, &iref)) { + key_value->sv_flags = save_flags; + return FAILED; + } + ind->mi_scan_branch(ot->ot_table, ind, iref.ir_branch, key_value, &result); + if (result.sr_found) + /* If we have found the key in a node: */ + on_key = TRUE; + if (!result.sr_item.i_node_ref_size) + break; + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + if (!idx_push(&stack, current, &result.sr_item)) { + key_value->sv_flags = save_flags; + return FAILED; + } + current = result.sr_branch; + } + + key_value->sv_flags = save_flags; + + if (!on_key) { + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + return OK; + } + + for (;;) { + if (result.sr_item.i_item_offset == result.sr_item.i_total_size) { + IdxStackItemPtr node; + + /* We are at the end of a leaf node. + * Go up the stack to find the start position of the next key. + * If we find none, then we are the end of the index. + */ + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + while ((node = idx_pop(&stack))) { + if (node->i_pos.i_item_offset < node->i_pos.i_total_size) { + current = node->i_branch; + if (!xt_ind_fetch(ot, current, XT_LOCK_READ, &iref)) + return FAILED; + xt_get_res_record_ref(&iref.ir_branch->tb_data[node->i_pos.i_item_offset + node->i_pos.i_item_size - XT_RECORD_REF_SIZE], &result); + result.sr_item = node->i_pos; + goto check_value; + } + } + break; + } + + check_value: + /* Quit the loop if the key is no longer matched! */ + if (myxt_compare_key(ind, 0, key_value->sv_length, key_value->sv_key, &iref.ir_branch->tb_data[result.sr_item.i_item_offset]) != 0) { + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + break; + } + + switch (xt_tab_maybe_committed(ot, result.sr_rec_id, &xn_id, NULL, NULL)) { + case XT_MAYBE: + /* Record is not committed, wait for the transaction. */ + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + XT_INDEX_UNLOCK(ind, ot); + xw.xw_xn_id = xn_id; + if (!xt_xn_wait_for_xact(ot->ot_thread, &xw, NULL)) { + XT_INDEX_WRITE_LOCK(ind, ot); + return FAILED; + } + XT_INDEX_WRITE_LOCK(ind, ot); + goto retry; + case XT_ERR: + /* Error while reading... */ + goto failed; + case TRUE: + /* Record is committed or belongs to me, duplicate key: */ + XT_DEBUG_TRACE(("DUPLICATE KEY tx=%d rec=%d\n", (int) ot->ot_thread->st_xact_data->xd_start_xn_id, (int) result.sr_rec_id)); + xt_register_xterr(XT_REG_CONTEXT, XT_ERR_DUPLICATE_KEY); + goto failed; + case FALSE: + /* Record is deleted or rolled-back: */ + break; + } + + idx_next_branch_item(ot->ot_table, ind, iref.ir_branch, &result); + + if (result.sr_item.i_node_ref_size) { + /* Go down to the bottom: */ + while (XT_NODE_ID(current)) { + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + if (!idx_push(&stack, current, &result.sr_item)) + return FAILED; + current = result.sr_branch; + if (!xt_ind_fetch(ot, current, XT_LOCK_READ, &iref)) + return FAILED; + idx_first_branch_item(ot->ot_table, ind, iref.ir_branch, &result); + if (!result.sr_item.i_node_ref_size) + break; + } + } + } + + return OK; + + failed: + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + return FAILED; +} + +/* + * Insert a value into the given index. Return FALSE if an error occurs. + */ +xtPublic xtBool xt_idx_insert(XTOpenTablePtr ot, XTIndexPtr ind, xtRowID row_id, xtRecordID rec_id, xtWord1 *rec_buf, xtWord1 *bef_buf, xtBool allow_dups) +{ + XTIdxKeyValueRec key_value; + xtWord1 key_buf[XT_INDEX_MAX_KEY_SIZE]; + IdxBranchStackRec stack; + xtIndexNodeID current; + XTIndReferenceRec iref; + xtIndexNodeID new_branch; + XTIdxBranchDPtr new_branch_ptr; + size_t size; + XTIdxResultRec result; + size_t new_size; + xtBool check_for_dups = ind->mi_flags & (HA_UNIQUE_CHECK | HA_NOSAME) && !allow_dups; + xtBool lock_structure = FALSE; + +#ifdef DEBUG + iref.ir_ulock = XT_UNLOCK_NONE; +#endif +#ifdef CHECK_AND_PRINT + //idx_check_index(ot, ind, TRUE); +#endif + + retry_after_oom: +#ifdef XT_TRACK_INDEX_UPDATES + ot->ot_ind_changed = 0; +#endif + key_value.sv_flags = XT_SEARCH_WHOLE_KEY; + key_value.sv_rec_id = rec_id; + key_value.sv_row_id = row_id; /* Should always be zero on insert (will be update by sweeper later). + * Non-zero only during recovery, assuming that sweeper will process such records right after recovery. + */ + key_value.sv_key = key_buf; + key_value.sv_length = myxt_create_key_from_row(ind, key_buf, rec_buf, &check_for_dups); + + if (bef_buf && check_for_dups) { + /* If we have a before image, and we are required to check for duplicates. + * then compare the before image key with the after image key. + */ + xtWord1 bef_key_buf[XT_INDEX_MAX_KEY_SIZE]; + u_int len; + xtBool has_no_null = TRUE; + + len = myxt_create_key_from_row(ind, bef_key_buf, bef_buf, &has_no_null); + if (has_no_null) { + /* If the before key has no null values, then compare with the after key value. + * We only have to check for duplicates if the key has changed! + */ + check_for_dups = myxt_compare_key(ind, 0, len, bef_key_buf, key_buf) != 0; + } + } + + /* The index appears to have no root: */ + if (!XT_NODE_ID(ind->mi_root)) + lock_structure = TRUE; + + lock_and_retry: + idx_newstack(&stack); + + /* A write lock is only required if we are going to change the + * strcuture of the index! + */ + if (lock_structure) + XT_INDEX_WRITE_LOCK(ind, ot); + else + XT_INDEX_READ_LOCK(ind, ot); + + retry: + if (!(XT_NODE_ID(current) = XT_NODE_ID(ind->mi_root))) { + /* Index is empty, create a new one: */ + ASSERT_NS(lock_structure); + if (!xt_ind_reserve(ot, 1, NULL)) + goto failed; + if (!idx_new_branch(ot, ind, &new_branch)) + goto failed; + size = idx_write_branch_item(ind, ot->ot_ind_wbuf.tb_data, &key_value); + XT_SET_DISK_2(ot->ot_ind_wbuf.tb_size_2, XT_MAKE_LEAF_SIZE(size)); + IDX_TRACE("%d-> %x\n", (int) new_branch, (int) XT_GET_DISK_2(ot->ot_ind_wbuf.tb_size_2)); + if (!xt_ind_write(ot, ind, new_branch, offsetof(XTIdxBranchDRec, tb_data) + size, (xtWord1 *) &ot->ot_ind_wbuf)) + goto failed_2; + ind->mi_root = new_branch; + goto done_ok; + } + + while (XT_NODE_ID(current)) { + if (!xt_ind_fetch(ot, current, XT_XLOCK_LEAF, &iref)) + goto failed; + ind->mi_scan_branch(ot->ot_table, ind, iref.ir_branch, &key_value, &result); + if (result.sr_duplicate) { + if (check_for_dups) { + /* Duplicates are not allowed, at least one has been + * found... + */ + + /* Leaf nodes (i_node_ref_size == 0) are write locked, + * non-leaf nodes are read locked. + */ + xt_ind_release(ot, ind, result.sr_item.i_node_ref_size ? XT_UNLOCK_READ : XT_UNLOCK_WRITE, &iref); + + if (!idx_check_duplicates(ot, ind, &key_value)) + goto failed; + /* We have checked all the "duplicate" variations. None of them are + * relevant. So this will cause a correct insert. + */ + check_for_dups = FALSE; + idx_newstack(&stack); + goto retry; + } + } + if (result.sr_found) { + /* Node found, can happen during recovery of indexes! */ + XTPageUnlockType utype; + + if (!result.sr_row_id && row_id) { + /* {INDEX-RECOV_ROWID} Set the row-id + * during recovery, even if the index entry + * is not committed. + * It will be removed later by the sweeper. + */ + size_t offset; + xtWord1 *data; + + offset = + /* This is the offset of the reference in the item we found: */ + result.sr_item.i_item_offset + result.sr_item.i_item_size - XT_RECORD_REF_SIZE + + /* This is the offset of the row id in the reference: */ + 4; + data = &iref.ir_branch->tb_data[offset]; + + /* This update does not change the structure of page, so we do it without + * copying the page before we write. + */ + XT_SET_DISK_4(data, row_id); + utype = result.sr_item.i_node_ref_size ? XT_UNLOCK_R_UPDATE : XT_UNLOCK_W_UPDATE; + } + else + utype = result.sr_item.i_node_ref_size ? XT_UNLOCK_READ : XT_UNLOCK_WRITE; + xt_ind_release(ot, ind, utype, &iref); + goto done_ok; + } + /* Stop when we get to a leaf: */ + if (!result.sr_item.i_node_ref_size) + break; + xt_ind_release(ot, ind, result.sr_item.i_node_ref_size ? XT_UNLOCK_READ : XT_UNLOCK_WRITE, &iref); + if (!idx_push(&stack, current, NULL)) + goto failed; + current = result.sr_branch; + } + ASSERT_NS(XT_NODE_ID(current)); + + /* Must be a leaf!: */ + ASSERT_NS(!result.sr_item.i_node_ref_size); + + if (result.sr_item.i_total_size + key_value.sv_length + XT_RECORD_REF_SIZE <= XT_INDEX_PAGE_DATA_SIZE) { + if (iref.ir_block->cb_handle_count) { + if (!xt_ind_copy_on_write(&iref)) + goto failed_1; + } + idx_insert_leaf_item(ind, iref.ir_branch, &key_value, &result); + IDX_TRACE("%d-> %x\n", (int) XT_NODE_ID(current), (int) XT_GET_DISK_2(ot->ot_ind_wbuf.tb_size_2)); + ASSERT_NS(result.sr_item.i_total_size <= XT_INDEX_PAGE_DATA_SIZE); + xt_ind_release(ot, ind, XT_UNLOCK_W_UPDATE, &iref); + goto done_ok; + } + + /* Key does not fit. Must split the node. + * Make sure we have a structural lock: + */ + if (!lock_structure) { + xt_ind_release(ot, ind, XT_UNLOCK_WRITE, &iref); + XT_INDEX_UNLOCK(ind, ot); + lock_structure = TRUE; + goto lock_and_retry; + } + + memcpy(&ot->ot_ind_wbuf, iref.ir_branch, offsetof(XTIdxBranchDRec, tb_data) + result.sr_item.i_total_size); + idx_insert_leaf_item(ind, &ot->ot_ind_wbuf, &key_value, &result); + IDX_TRACE("%d-> %x\n", (int) XT_NODE_ID(current), (int) XT_GET_DISK_2(ot->ot_ind_wbuf.tb_size_2)); + ASSERT_NS(result.sr_item.i_total_size > XT_INDEX_PAGE_DATA_SIZE); + + /* This is the number of potential writes. In other words, the total number + * of blocks that may be accessed. + * + * Note that this assume if a block is read and written soon after that the block + * will not be freed in-between (a safe assumption?) + */ + if (!xt_ind_reserve(ot, stack.s_top * 2 + 3, iref.ir_branch)) + goto failed_1; + + /* Key does not fit, must split... */ + idx_get_middle_branch_item(ind, &ot->ot_ind_wbuf, &key_value, &result); + + if (!idx_new_branch(ot, ind, &new_branch)) + goto failed_1; + + /* Copy and write the rest of the data to the new node: */ + new_size = result.sr_item.i_total_size - result.sr_item.i_item_offset - result.sr_item.i_item_size; + new_branch_ptr = (XTIdxBranchDPtr) &ot->ot_ind_wbuf.tb_data[XT_INDEX_PAGE_DATA_SIZE]; + memmove(new_branch_ptr->tb_data, &ot->ot_ind_wbuf.tb_data[result.sr_item.i_item_offset + result.sr_item.i_item_size], new_size); + + XT_SET_DISK_2(new_branch_ptr->tb_size_2, XT_MAKE_LEAF_SIZE(new_size)); + IDX_TRACE("%d-> %x\n", (int) XT_NODE_ID(new_branch), (int) XT_GET_DISK_2(new_branch_ptr->tb_size_2)); + if (!xt_ind_write(ot, ind, new_branch, offsetof(XTIdxBranchDRec, tb_data) + new_size, (xtWord1 *) new_branch_ptr)) + goto failed_2; + + /* Modify the first node: */ + XT_SET_DISK_2(ot->ot_ind_wbuf.tb_size_2, XT_MAKE_LEAF_SIZE(result.sr_item.i_item_offset)); + IDX_TRACE("%d-> %x\n", (int) XT_NODE_ID(current), (int) XT_GET_DISK_2(ot->ot_ind_wbuf.tb_size_2)); + + if (iref.ir_block->cb_handle_count) { + if (!xt_ind_copy_on_write(&iref)) + goto failed_2; + } + memcpy(iref.ir_branch, &ot->ot_ind_wbuf, offsetof(XTIdxBranchDRec, tb_data) + result.sr_item.i_item_offset); + xt_ind_release(ot, ind, XT_UNLOCK_W_UPDATE, &iref); + + /* Insert the new branch into the parent node, using the new middle key value: */ + if (!idx_insert_node(ot, ind, &stack, &key_value, new_branch)) { + // Index may be inconsistant now... + idx_free_branch(ot, ind, new_branch); + goto failed; + } + +#ifdef XT_TRACK_INDEX_UPDATES + ASSERT_NS(ot->ot_ind_reserved >= ot->ot_ind_reads); +#endif + + done_ok: + XT_INDEX_UNLOCK(ind, ot); + +#ifdef DEBUG + //printf("INSERT OK\n"); + //idx_check_index(ot, ind, TRUE); +#endif + xt_ind_unreserve(ot); + return OK; + + failed_2: + idx_free_branch(ot, ind, new_branch); + + failed_1: + xt_ind_release(ot, ind, XT_UNLOCK_WRITE, &iref); + + failed: + XT_INDEX_UNLOCK(ind, ot); + if (idx_out_of_memory_failure(ot)) + goto retry_after_oom; + +#ifdef DEBUG + //printf("INSERT FAILED\n"); + //idx_check_index(ot, ind, TRUE); +#endif + xt_ind_unreserve(ot); + return FAILED; +} + +static xtBool idx_delete(XTOpenTablePtr ot, XTIndexPtr ind, XTIdxKeyValuePtr key_value) +{ + IdxBranchStackRec stack; + xtIndexNodeID current; + XTIndReferenceRec iref; + XTIdxResultRec result; + IdxStackItemPtr delete_node = NULL; + IdxStackItemPtr current_top = NULL; + xtBool lock_structure = FALSE; + +#ifdef DEBUG + iref.ir_ulock = XT_UNLOCK_NONE; +#endif + /* The index appears to have no root: */ + if (!XT_NODE_ID(ind->mi_root)) + lock_structure = TRUE; + + lock_and_retry: + idx_newstack(&stack); + + if (lock_structure) + XT_INDEX_WRITE_LOCK(ind, ot); + else + XT_INDEX_READ_LOCK(ind, ot); + + if (!(XT_NODE_ID(current) = XT_NODE_ID(ind->mi_root))) + goto done_ok; + + while (XT_NODE_ID(current)) { + if (!xt_ind_fetch(ot, current, XT_XLOCK_LEAF, &iref)) + goto failed; + ind->mi_scan_branch(ot->ot_table, ind, iref.ir_branch, key_value, &result); + if (!result.sr_item.i_node_ref_size) { + /* A leaf... */ + if (result.sr_found) { + if (!idx_remove_branch_item_right(ot, ind, current, &iref, &result.sr_item)) + goto failed; + } + else + xt_ind_release(ot, ind, XT_UNLOCK_WRITE, &iref); + goto done_ok; + } + if (!idx_push(&stack, current, &result.sr_item)) { + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + goto failed; + } + if (result.sr_found) + /* If we have found the key in a node: */ + break; + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + current = result.sr_branch; + } + + /* Must be a non-leaf!: */ + ASSERT_NS(result.sr_item.i_node_ref_size); + + /* We will have to remove the key from a non-leaf node, + * which means we are changing the structure of the index. + * Make sure we have a structural lock: + */ + if (!lock_structure) { + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + XT_INDEX_UNLOCK(ind, ot); + lock_structure = TRUE; + goto lock_and_retry; + } + + /* This is the item we will have to replace: */ + delete_node = idx_top(&stack); + + /* Follow the branch after this item: */ + idx_next_branch_item(ot->ot_table, ind, iref.ir_branch, &result); + ASSERT_NS(XT_NODE_ID(current)); + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + + /* Go down the left-hand side until we reach a leaf: */ + while (XT_NODE_ID(current)) { + current = result.sr_branch; + if (!xt_ind_fetch(ot, current, XT_XLOCK_LEAF, &iref)) + goto failed; + idx_first_branch_item(ot->ot_table, ind, iref.ir_branch, &result); + if (!result.sr_item.i_node_ref_size) + break; + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + if (!idx_push(&stack, current, &result.sr_item)) + goto failed; + } + + ASSERT_NS(XT_NODE_ID(current)); + ASSERT_NS(!result.sr_item.i_node_ref_size); + + if (!xt_ind_reserve(ot, stack.s_top + 2, iref.ir_branch)) { + xt_ind_release(ot, ind, XT_UNLOCK_WRITE, &iref); + goto failed; + } + + /* Crawl back up the stack trace, looking for a key + * that can be used to replace the deleted key. + * + * Any empty nodes on the way up can be removed! + */ + if (result.sr_item.i_total_size > 0) { + /* There is a key in the leaf, extract it, and put it in the node: */ + memcpy(key_value->sv_key, &iref.ir_branch->tb_data[result.sr_item.i_item_offset], result.sr_item.i_item_size); + /* This call also frees the iref.ir_branch page! */ + if (!idx_remove_branch_item_right(ot, ind, current, &iref, &result.sr_item)) + goto failed; + if (!idx_replace_node_key(ot, ind, delete_node, &stack, result.sr_item.i_item_size, key_value->sv_key)) + goto failed; + goto done_ok_2; + } + + xt_ind_release(ot, ind, XT_UNLOCK_WRITE, &iref); + + for (;;) { + /* The current node/leaf is empty, remove it: */ + idx_free_branch(ot, ind, current); + + current_top = idx_pop(&stack); + current = current_top->i_branch; + if (!xt_ind_fetch(ot, current, XT_XLOCK_LEAF, &iref)) + goto failed; + + if (current_top == delete_node) { + /* All children have been removed. Delete the key and done: */ + if (!idx_remove_branch_item_right(ot, ind, current, &iref, ¤t_top->i_pos)) + goto failed; + goto done_ok_2; + } + + if (current_top->i_pos.i_total_size > current_top->i_pos.i_node_ref_size) { + /* Save the key: */ + memcpy(key_value->sv_key, &iref.ir_branch->tb_data[current_top->i_pos.i_item_offset], current_top->i_pos.i_item_size); + /* This function also frees the cache page: */ + if (!idx_remove_branch_item_left(ot, ind, current, &iref, ¤t_top->i_pos)) + goto failed; + if (!idx_replace_node_key(ot, ind, delete_node, &stack, current_top->i_pos.i_item_size, key_value->sv_key)) + goto failed; + goto done_ok_2; + } + xt_ind_release(ot, ind, current_top->i_pos.i_node_ref_size ? XT_UNLOCK_READ : XT_UNLOCK_WRITE, &iref); + } + + + done_ok_2: +#ifdef XT_TRACK_INDEX_UPDATES + ASSERT_NS(ot->ot_ind_reserved >= ot->ot_ind_reads); +#endif + + done_ok: + XT_INDEX_UNLOCK(ind, ot); + +#ifdef DEBUG + //printf("DELETE OK\n"); + //idx_check_index(ot, ind, TRUE); +#endif + xt_ind_unreserve(ot); + return OK; + + failed: + XT_INDEX_UNLOCK(ind, ot); + xt_ind_unreserve(ot); + return FAILED; +} + +xtPublic xtBool xt_idx_delete(XTOpenTablePtr ot, XTIndexPtr ind, xtRecordID rec_id, xtWord1 *rec_buf) +{ + XTIdxKeyValueRec key_value; + xtWord1 key_buf[XT_INDEX_MAX_KEY_SIZE + XT_MAX_RECORD_REF_SIZE]; + + retry_after_oom: +#ifdef XT_TRACK_INDEX_UPDATES + ot->ot_ind_changed = 0; +#endif + + key_value.sv_flags = XT_SEARCH_WHOLE_KEY; + key_value.sv_rec_id = rec_id; + key_value.sv_row_id = 0; + key_value.sv_key = key_buf; + key_value.sv_length = myxt_create_key_from_row(ind, key_buf, rec_buf, NULL); + + if (!idx_delete(ot, ind, &key_value)) { + if (idx_out_of_memory_failure(ot)) + goto retry_after_oom; + return FAILED; + } + return OK; +} + +xtPublic xtBool xt_idx_update_row_id(XTOpenTablePtr ot, XTIndexPtr ind, xtRecordID rec_id, xtRowID row_id, xtWord1 *rec_buf) +{ + xtIndexNodeID current; + XTIndReferenceRec iref; + XTIdxResultRec result; + XTIdxKeyValueRec key_value; + xtWord1 key_buf[XT_INDEX_MAX_KEY_SIZE + XT_MAX_RECORD_REF_SIZE]; + +#ifdef DEBUG + iref.ir_ulock = XT_UNLOCK_NONE; +#endif +#ifdef CHECK_AND_PRINT + idx_check_index(ot, ind, TRUE); +#endif + retry_after_oom: +#ifdef XT_TRACK_INDEX_UPDATES + ot->ot_ind_changed = 0; +#endif + key_value.sv_flags = XT_SEARCH_WHOLE_KEY; + key_value.sv_rec_id = rec_id; + key_value.sv_row_id = 0; + key_value.sv_key = key_buf; + key_value.sv_length = myxt_create_key_from_row(ind, key_buf, rec_buf, NULL); + + /* NOTE: Only a read lock is required for this!! + * + * 09.05.2008 - This has changed because the dirty list now + * hangs on the index. And the dirty list may be updated + * by any change of the index. + * However, the advantage is that I should be able to read + * lock in the first phase of the flush. + * + * 18.02.2009 - This has changed again. + * I am now using a read lock, because this update does not + * require a structural change. In fact, it does not even + * need a WRITE LOCK on the page affected, because there + * is only ONE thread that can do this (the sweeper). + * + * This has the advantage that the sweeper (which uses this + * function, causes less conflicts. + * + * However, it does mean that the dirty list must be otherwise + * protected (which it now is be a spin lock - mi_dirty_lock). + * + * It also has the dissadvantage that I am going to have to + * take an xlock in the first phase of the flush. + */ + XT_INDEX_READ_LOCK(ind, ot); + + if (!(XT_NODE_ID(current) = XT_NODE_ID(ind->mi_root))) + goto done_ok; + + while (XT_NODE_ID(current)) { + if (!xt_ind_fetch(ot, current, XT_LOCK_READ, &iref)) + goto failed; + ind->mi_scan_branch(ot->ot_table, ind, iref.ir_branch, &key_value, &result); + if (result.sr_found || !result.sr_item.i_node_ref_size) + break; + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + current = result.sr_branch; + } + + if (result.sr_found) { + size_t offset; + xtWord1 *data; + + offset = + /* This is the offset of the reference in the item we found: */ + result.sr_item.i_item_offset + result.sr_item.i_item_size - XT_RECORD_REF_SIZE + + /* This is the offset of the row id in the reference: */ + 4; + data = &iref.ir_branch->tb_data[offset]; + + /* This update does not change the structure of page, so we do it without + * copying the page before we write. + * + * TODO: Check that concurrent reads can handle this! + * assuming the write is not atomic. + */ + XT_SET_DISK_4(data, row_id); + xt_ind_release(ot, ind, XT_UNLOCK_R_UPDATE, &iref); + } + else + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + + done_ok: + XT_INDEX_UNLOCK(ind, ot); + +#ifdef DEBUG + //idx_check_index(ot, ind, TRUE); + //idx_check_on_key(ot); +#endif + return OK; + + failed: + XT_INDEX_UNLOCK(ind, ot); + if (idx_out_of_memory_failure(ot)) + goto retry_after_oom; + return FAILED; +} + +xtPublic void xt_idx_prep_key(XTIndexPtr ind, register XTIdxSearchKeyPtr search_key, int flags, xtWord1 *in_key_buf, size_t in_key_length) +{ + search_key->sk_key_value.sv_flags = flags; + search_key->sk_key_value.sv_rec_id = 0; + search_key->sk_key_value.sv_row_id = 0; + search_key->sk_key_value.sv_key = search_key->sk_key_buf; + search_key->sk_key_value.sv_length = myxt_create_key_from_key(ind, search_key->sk_key_buf, in_key_buf, in_key_length); + search_key->sk_on_key = FALSE; +} + +xtPublic xtBool xt_idx_research(XTOpenTablePtr ot, XTIndexPtr ind) +{ + XTIdxSearchKeyRec search_key; + + xt_ind_lock_handle(ot->ot_ind_rhandle); + search_key.sk_key_value.sv_flags = XT_SEARCH_WHOLE_KEY; + xt_get_record_ref(&ot->ot_ind_rhandle->ih_branch->tb_data[ot->ot_ind_state.i_item_offset + ot->ot_ind_state.i_item_size - XT_RECORD_REF_SIZE], + &search_key.sk_key_value.sv_rec_id, &search_key.sk_key_value.sv_row_id); + search_key.sk_key_value.sv_key = search_key.sk_key_buf; + search_key.sk_key_value.sv_length = ot->ot_ind_state.i_item_size - XT_RECORD_REF_SIZE; + search_key.sk_on_key = FALSE; + memcpy(search_key.sk_key_buf, &ot->ot_ind_rhandle->ih_branch->tb_data[ot->ot_ind_state.i_item_offset], search_key.sk_key_value.sv_length); + xt_ind_unlock_handle(ot->ot_ind_rhandle); + return xt_idx_search(ot, ind, &search_key); +} + +/* + * Search for a given key and position the current pointer on the first + * key in the list of duplicates. If the key is not found the current + * pointer is placed at the first position after the key. + */ +xtPublic xtBool xt_idx_search(XTOpenTablePtr ot, XTIndexPtr ind, register XTIdxSearchKeyPtr search_key) +{ + IdxBranchStackRec stack; + xtIndexNodeID current; + XTIndReferenceRec iref; + XTIdxResultRec result; + +#ifdef DEBUG + iref.ir_ulock = XT_UNLOCK_NONE; +#endif + if (ot->ot_ind_rhandle) { + xt_ind_release_handle(ot->ot_ind_rhandle, FALSE, ot->ot_thread); + ot->ot_ind_rhandle = NULL; + } +#ifdef DEBUG + //idx_check_index(ot, ind, TRUE); +#endif + + /* Calling from recovery, this is not the case. + * But the index read does not require a transaction! + * Only insert requires this to check for duplicates. + if (!ot->ot_thread->st_xact_data) { + xt_register_xterr(XT_REG_CONTEXT, XT_ERR_NO_TRANSACTION); + return FAILED; + } + */ + + retry_after_oom: +#ifdef XT_TRACK_INDEX_UPDATES + ot->ot_ind_changed = 0; +#endif + idx_newstack(&stack); + + ot->ot_curr_rec_id = 0; + ot->ot_curr_row_id = 0; + + XT_INDEX_READ_LOCK(ind, ot); + + if (!(XT_NODE_ID(current) = XT_NODE_ID(ind->mi_root))) + goto done_ok; + + while (XT_NODE_ID(current)) { + if (!xt_ind_fetch(ot, current, XT_LOCK_READ, &iref)) + goto failed; + ind->mi_scan_branch(ot->ot_table, ind, iref.ir_branch, &search_key->sk_key_value, &result); + if (result.sr_found) + /* If we have found the key in a node: */ + search_key->sk_on_key = TRUE; + if (!result.sr_item.i_node_ref_size) + break; + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + if (!idx_push(&stack, current, &result.sr_item)) + goto failed; + current = result.sr_branch; + } + + if (result.sr_item.i_item_offset == result.sr_item.i_total_size) { + IdxStackItemPtr node; + + /* We are at the end of a leaf node. + * Go up the stack to find the start position of the next key. + * If we find none, then we are the end of the index. + */ + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + while ((node = idx_pop(&stack))) { + if (node->i_pos.i_item_offset < node->i_pos.i_total_size) { + xtRecordID rec_id; + + if (!xt_ind_fetch(ot, node->i_branch, XT_LOCK_READ, &iref)) + goto failed; + xt_get_record_ref(&iref.ir_branch->tb_data[node->i_pos.i_item_offset + node->i_pos.i_item_size - XT_RECORD_REF_SIZE], &rec_id, &ot->ot_curr_row_id); + ot->ot_curr_rec_id = rec_id; + ot->ot_ind_state = node->i_pos; + + /* Convert the pointer to a handle which can be used in later operations: */ + ASSERT_NS(!ot->ot_ind_rhandle); + if (!(ot->ot_ind_rhandle = xt_ind_get_handle(ot, ind, &iref))) + goto failed; + /* Keep the node for next operations: */ + /* + branch_size = XT_GET_INDEX_BLOCK_LEN(XT_GET_DISK_2(iref.ir_branch->tb_size_2)); + memcpy(&ot->ot_ind_rbuf, iref.ir_branch, branch_size); + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + */ + break; + } + } + } + else { + ot->ot_curr_rec_id = result.sr_rec_id; + ot->ot_curr_row_id = result.sr_row_id; + ot->ot_ind_state = result.sr_item; + + /* Convert the pointer to a handle which can be used in later operations: */ + ASSERT_NS(!ot->ot_ind_rhandle); + if (!(ot->ot_ind_rhandle = xt_ind_get_handle(ot, ind, &iref))) + goto failed; + /* Keep the node for next operations: */ + /* + branch_size = XT_GET_INDEX_BLOCK_LEN(XT_GET_DISK_2(iref.ir_branch->tb_size_2)); + memcpy(&ot->ot_ind_rbuf, iref.ir_branch, branch_size); + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + */ + } + + done_ok: + XT_INDEX_UNLOCK(ind, ot); + +#ifdef DEBUG + //idx_check_index(ot, ind, TRUE); + //idx_check_on_key(ot); +#endif + ASSERT_NS(iref.ir_ulock == XT_UNLOCK_NONE); + return OK; + + failed: + XT_INDEX_UNLOCK(ind, ot); + if (idx_out_of_memory_failure(ot)) + goto retry_after_oom; + ASSERT_NS(iref.ir_ulock == XT_UNLOCK_NONE); + return FAILED; +} + +xtPublic xtBool xt_idx_search_prev(XTOpenTablePtr ot, XTIndexPtr ind, register XTIdxSearchKeyPtr search_key) +{ + IdxBranchStackRec stack; + xtIndexNodeID current; + XTIndReferenceRec iref; + XTIdxResultRec result; + +#ifdef DEBUG + iref.ir_ulock = XT_UNLOCK_NONE; +#endif + if (ot->ot_ind_rhandle) { + xt_ind_release_handle(ot->ot_ind_rhandle, FALSE, ot->ot_thread); + ot->ot_ind_rhandle = NULL; + } +#ifdef DEBUG + //idx_check_index(ot, ind, TRUE); +#endif + + /* see the comment above in xt_idx_search */ + /* + if (!ot->ot_thread->st_xact_data) { + xt_register_xterr(XT_REG_CONTEXT, XT_ERR_NO_TRANSACTION); + return FAILED; + } + */ + + retry_after_oom: +#ifdef XT_TRACK_INDEX_UPDATES + ot->ot_ind_changed = 0; +#endif + idx_newstack(&stack); + + ot->ot_curr_rec_id = 0; + ot->ot_curr_row_id = 0; + + XT_INDEX_READ_LOCK(ind, ot); + + if (!(XT_NODE_ID(current) = XT_NODE_ID(ind->mi_root))) + goto done_ok; + + while (XT_NODE_ID(current)) { + if (!xt_ind_fetch(ot, current, XT_LOCK_READ, &iref)) + goto failed; + ind->mi_scan_branch(ot->ot_table, ind, iref.ir_branch, &search_key->sk_key_value, &result); + if (result.sr_found) + /* If we have found the key in a node: */ + search_key->sk_on_key = TRUE; + if (!result.sr_item.i_node_ref_size) + break; + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + if (!idx_push(&stack, current, &result.sr_item)) + goto failed; + current = result.sr_branch; + } + + if (result.sr_item.i_item_offset == 0) { + IdxStackItemPtr node; + + /* We are at the end of a leaf node. + * Go up the stack to find the start poition of the next key. + * If we find none, then we are the end of the index. + */ + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + while ((node = idx_pop(&stack))) { + if (node->i_pos.i_item_offset > node->i_pos.i_node_ref_size) { + if (!xt_ind_fetch(ot, node->i_branch, XT_LOCK_READ, &iref)) + goto failed; + result.sr_item = node->i_pos; + ind->mi_prev_item(ot->ot_table, ind, iref.ir_branch, &result); + goto record_found; + } + } + goto done_ok; + } + + /* We must just step once to the left in this leaf node... */ + ind->mi_prev_item(ot->ot_table, ind, iref.ir_branch, &result); + + record_found: + ot->ot_curr_rec_id = result.sr_rec_id; + ot->ot_curr_row_id = result.sr_row_id; + ot->ot_ind_state = result.sr_item; + + /* Convert to handle for later operations: */ + ASSERT_NS(!ot->ot_ind_rhandle); + if (!(ot->ot_ind_rhandle = xt_ind_get_handle(ot, ind, &iref))) + goto failed; + /* Keep a copy of the node for previous operations... */ + /* + u_int branch_size; + + branch_size = XT_GET_INDEX_BLOCK_LEN(XT_GET_DISK_2(iref.ir_branch->tb_size_2)); + memcpy(&ot->ot_ind_rbuf, iref.ir_branch, branch_size); + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + */ + + done_ok: + XT_INDEX_UNLOCK(ind, ot); + +#ifdef DEBUG + //idx_check_index(ot, ind, TRUE); + //idx_check_on_key(ot); +#endif + return OK; + + failed: + XT_INDEX_UNLOCK(ind, ot); + if (idx_out_of_memory_failure(ot)) + goto retry_after_oom; + return FAILED; +} + +/* + * Copy the current index value to the record. + */ +xtPublic xtBool xt_idx_read(XTOpenTablePtr ot, XTIndexPtr ind, xtWord1 *rec_buf) +{ + xtWord1 *bitem; + +#ifdef DEBUG + //idx_check_on_key(ot); +#endif + xt_ind_lock_handle(ot->ot_ind_rhandle); + bitem = ot->ot_ind_rhandle->ih_branch->tb_data + ot->ot_ind_state.i_item_offset; + myxt_create_row_from_key(ot, ind, bitem, ot->ot_ind_state.i_item_size - XT_RECORD_REF_SIZE, rec_buf); + xt_ind_unlock_handle(ot->ot_ind_rhandle); + return OK; +} + +xtPublic xtBool xt_idx_next(register XTOpenTablePtr ot, register XTIndexPtr ind, register XTIdxSearchKeyPtr search_key) +{ + XTIdxKeyValueRec key_value; + xtWord1 key_buf[XT_INDEX_MAX_KEY_SIZE]; + XTIdxResultRec result; + IdxBranchStackRec stack; + xtIndexNodeID current; + XTIndReferenceRec iref; + +#ifdef DEBUG + iref.ir_ulock = XT_UNLOCK_NONE; +#endif + ASSERT_NS(ot->ot_ind_rhandle); + xt_ind_lock_handle(ot->ot_ind_rhandle); + if (!ot->ot_ind_state.i_node_ref_size && + ot->ot_ind_state.i_item_offset < ot->ot_ind_state.i_total_size && + ot->ot_ind_rhandle->ih_cache_reference) { + key_value.sv_key = &ot->ot_ind_rhandle->ih_branch->tb_data[ot->ot_ind_state.i_item_offset]; + key_value.sv_length = ot->ot_ind_state.i_item_size - XT_RECORD_REF_SIZE; + + result.sr_item = ot->ot_ind_state; + idx_next_branch_item(ot->ot_table, ind, ot->ot_ind_rhandle->ih_branch, &result); + if (result.sr_item.i_item_offset < result.sr_item.i_total_size) { + /* Still on key? */ + if (search_key && search_key->sk_on_key) { + search_key->sk_on_key = myxt_compare_key(ind, search_key->sk_key_value.sv_flags, search_key->sk_key_value.sv_length, + search_key->sk_key_value.sv_key, &ot->ot_ind_rhandle->ih_branch->tb_data[result.sr_item.i_item_offset]) == 0; + } + xt_ind_unlock_handle(ot->ot_ind_rhandle); + goto checked_on_key; + } + } + + key_value.sv_flags = XT_SEARCH_WHOLE_KEY; + xt_get_record_ref(&ot->ot_ind_rhandle->ih_branch->tb_data[ot->ot_ind_state.i_item_offset + ot->ot_ind_state.i_item_size - XT_RECORD_REF_SIZE], &key_value.sv_rec_id, &key_value.sv_row_id); + key_value.sv_key = key_buf; + key_value.sv_length = ot->ot_ind_state.i_item_size - XT_RECORD_REF_SIZE; + memcpy(key_buf, &ot->ot_ind_rhandle->ih_branch->tb_data[ot->ot_ind_state.i_item_offset], key_value.sv_length); + xt_ind_release_handle(ot->ot_ind_rhandle, TRUE, ot->ot_thread); + ot->ot_ind_rhandle = NULL; + + retry_after_oom: +#ifdef XT_TRACK_INDEX_UPDATES + ot->ot_ind_changed = 0; +#endif + idx_newstack(&stack); + + XT_INDEX_READ_LOCK(ind, ot); + + if (!(XT_NODE_ID(current) = XT_NODE_ID(ind->mi_root))) { + XT_INDEX_UNLOCK(ind, ot); + return OK; + } + + while (XT_NODE_ID(current)) { + if (!xt_ind_fetch(ot, current, XT_LOCK_READ, &iref)) + goto failed; + ind->mi_scan_branch(ot->ot_table, ind, iref.ir_branch, &key_value, &result); + if (result.sr_item.i_node_ref_size) { + if (result.sr_found) { + /* If we have found the key in a node: */ + idx_next_branch_item(ot->ot_table, ind, iref.ir_branch, &result); + + /* Go down to the bottom: */ + while (XT_NODE_ID(current)) { + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + if (!idx_push(&stack, current, &result.sr_item)) + goto failed; + current = result.sr_branch; + if (!xt_ind_fetch(ot, current, XT_LOCK_READ, &iref)) + goto failed; + idx_first_branch_item(ot->ot_table, ind, iref.ir_branch, &result); + if (!result.sr_item.i_node_ref_size) + break; + } + + /* Is the leaf not empty, then we are done... */ + break; + } + } + else { + /* We have reached the leaf. */ + if (result.sr_found) + /* If we have found the key in a leaf: */ + idx_next_branch_item(ot->ot_table, ind, iref.ir_branch, &result); + /* If we did not find the key (although we should have). Our + * position is automatically the next one. + */ + break; + } + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + if (!idx_push(&stack, current, &result.sr_item)) + goto failed; + current = result.sr_branch; + } + + /* Check the current position in a leaf: */ + if (result.sr_item.i_item_offset == result.sr_item.i_total_size) { + /* At the end: */ + IdxStackItemPtr node; + + /* We are at the end of a leaf node. + * Go up the stack to find the start poition of the next key. + * If we find none, then we are the end of the index. + */ + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + while ((node = idx_pop(&stack))) { + if (node->i_pos.i_item_offset < node->i_pos.i_total_size) { + if (!xt_ind_fetch(ot, node->i_branch, XT_LOCK_READ, &iref)) + goto failed; + result.sr_item = node->i_pos; + xt_get_res_record_ref(&iref.ir_branch->tb_data[result.sr_item.i_item_offset + result.sr_item.i_item_size - XT_RECORD_REF_SIZE], &result); + goto unlock_check_on_key; + } + } + + /* No more keys: */ + if (search_key) + search_key->sk_on_key = FALSE; + ot->ot_curr_rec_id = 0; + ot->ot_curr_row_id = 0; + XT_INDEX_UNLOCK(ind, ot); + return OK; + } + + unlock_check_on_key: + + ASSERT_NS(!ot->ot_ind_rhandle); + if (!(ot->ot_ind_rhandle = xt_ind_get_handle(ot, ind, &iref))) + goto failed; + /* + u_int branch_size; + + branch_size = XT_GET_INDEX_BLOCK_LEN(XT_GET_DISK_2(iref.ir_branch->tb_size_2)); + memcpy(&ot->ot_ind_rbuf, iref.ir_branch, branch_size); + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + */ + + XT_INDEX_UNLOCK(ind, ot); + + /* Still on key? */ + if (search_key && search_key->sk_on_key) { + /* GOTCHA: As a short-cut I was using a length compare + * and a memcmp() here to check whether we as still on + * the original search key. + * This does not work because it does not take into account + * trialing spaces (which are ignored in comparison). + * So lengths can be different, but values still equal. + * + * NOTE: We have to use the original search flags for + * this compare. + */ + xt_ind_lock_handle(ot->ot_ind_rhandle); + search_key->sk_on_key = myxt_compare_key(ind, search_key->sk_key_value.sv_flags, search_key->sk_key_value.sv_length, + search_key->sk_key_value.sv_key, &ot->ot_ind_rhandle->ih_branch->tb_data[result.sr_item.i_item_offset]) == 0; + xt_ind_unlock_handle(ot->ot_ind_rhandle); + } + + checked_on_key: + ot->ot_curr_rec_id = result.sr_rec_id; + ot->ot_curr_row_id = result.sr_row_id; + ot->ot_ind_state = result.sr_item; + + return OK; + + failed: + XT_INDEX_UNLOCK(ind, ot); + if (idx_out_of_memory_failure(ot)) + goto retry_after_oom; + return FAILED; +} + +xtPublic xtBool xt_idx_prev(register XTOpenTablePtr ot, register XTIndexPtr ind, register XTIdxSearchKeyPtr search_key) +{ + XTIdxKeyValueRec key_value; + xtWord1 key_buf[XT_INDEX_MAX_KEY_SIZE]; + XTIdxResultRec result; + IdxBranchStackRec stack; + xtIndexNodeID current; + XTIndReferenceRec iref; + IdxStackItemPtr node; + +#ifdef DEBUG + iref.ir_ulock = XT_UNLOCK_NONE; +#endif + ASSERT_NS(ot->ot_ind_rhandle); + xt_ind_lock_handle(ot->ot_ind_rhandle); + if (!ot->ot_ind_state.i_node_ref_size && ot->ot_ind_state.i_item_offset > 0) { + key_value.sv_key = &ot->ot_ind_rhandle->ih_branch->tb_data[ot->ot_ind_state.i_item_offset]; + key_value.sv_length = ot->ot_ind_state.i_item_size - XT_RECORD_REF_SIZE; + + result.sr_item = ot->ot_ind_state; + ind->mi_prev_item(ot->ot_table, ind, ot->ot_ind_rhandle->ih_branch, &result); + + if (search_key && search_key->sk_on_key) { + search_key->sk_on_key = myxt_compare_key(ind, search_key->sk_key_value.sv_flags, search_key->sk_key_value.sv_length, + search_key->sk_key_value.sv_key, &ot->ot_ind_rhandle->ih_branch->tb_data[result.sr_item.i_item_offset]) == 0; + } + + xt_ind_unlock_handle(ot->ot_ind_rhandle); + goto checked_on_key; + } + + key_value.sv_flags = XT_SEARCH_WHOLE_KEY; + key_value.sv_rec_id = ot->ot_curr_rec_id; + key_value.sv_row_id = 0; + key_value.sv_key = key_buf; + key_value.sv_length = ot->ot_ind_state.i_item_size - XT_RECORD_REF_SIZE; + memcpy(key_buf, &ot->ot_ind_rhandle->ih_branch->tb_data[ot->ot_ind_state.i_item_offset], key_value.sv_length); + xt_ind_release_handle(ot->ot_ind_rhandle, TRUE, ot->ot_thread); + ot->ot_ind_rhandle = NULL; + + retry_after_oom: +#ifdef XT_TRACK_INDEX_UPDATES + ot->ot_ind_changed = 0; +#endif + idx_newstack(&stack); + + XT_INDEX_READ_LOCK(ind, ot); + + if (!(XT_NODE_ID(current) = XT_NODE_ID(ind->mi_root))) { + XT_INDEX_UNLOCK(ind, ot); + return OK; + } + + while (XT_NODE_ID(current)) { + if (!xt_ind_fetch(ot, current, XT_LOCK_READ, &iref)) + goto failed; + ind->mi_scan_branch(ot->ot_table, ind, iref.ir_branch, &key_value, &result); + if (result.sr_item.i_node_ref_size) { + if (result.sr_found) { + /* If we have found the key in a node: */ + + /* Go down to the bottom: */ + while (XT_NODE_ID(current)) { + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + if (!idx_push(&stack, current, &result.sr_item)) + goto failed; + current = result.sr_branch; + if (!xt_ind_fetch(ot, current, XT_LOCK_READ, &iref)) + goto failed; + ind->mi_last_item(ot->ot_table, ind, iref.ir_branch, &result); + if (!result.sr_item.i_node_ref_size) + break; + } + + /* Is the leaf not empty, then we are done... */ + if (result.sr_item.i_total_size == 0) + break; + goto unlock_check_on_key; + } + } + else { + /* We have reached the leaf. + * Whether we found the key or not, we have + * to move one to the left. + */ + if (result.sr_item.i_item_offset == 0) + break; + ind->mi_prev_item(ot->ot_table, ind, iref.ir_branch, &result); + goto unlock_check_on_key; + } + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + if (!idx_push(&stack, current, &result.sr_item)) + goto failed; + current = result.sr_branch; + } + + /* We are at the start of a leaf node. + * Go up the stack to find the start poition of the next key. + * If we find none, then we are the end of the index. + */ + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + while ((node = idx_pop(&stack))) { + if (node->i_pos.i_item_offset > node->i_pos.i_node_ref_size) { + if (!xt_ind_fetch(ot, node->i_branch, XT_LOCK_READ, &iref)) + goto failed; + result.sr_item = node->i_pos; + ind->mi_prev_item(ot->ot_table, ind, iref.ir_branch, &result); + goto unlock_check_on_key; + } + } + + /* No more keys: */ + if (search_key) + search_key->sk_on_key = FALSE; + ot->ot_curr_rec_id = 0; + ot->ot_curr_row_id = 0; + + XT_INDEX_UNLOCK(ind, ot); + return OK; + + unlock_check_on_key: + ASSERT_NS(!ot->ot_ind_rhandle); + if (!(ot->ot_ind_rhandle = xt_ind_get_handle(ot, ind, &iref))) + goto failed; + /* + u_int branch_size; + + branch_size = XT_GET_INDEX_BLOCK_LEN(XT_GET_DISK_2(iref.ir_branch->tb_size_2)); + memcpy(&ot->ot_ind_rbuf, iref.ir_branch, branch_size); + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + */ + + XT_INDEX_UNLOCK(ind, ot); + + /* Still on key? */ + if (search_key && search_key->sk_on_key) { + xt_ind_lock_handle(ot->ot_ind_rhandle); + search_key->sk_on_key = myxt_compare_key(ind, search_key->sk_key_value.sv_flags, search_key->sk_key_value.sv_length, + search_key->sk_key_value.sv_key, &ot->ot_ind_rhandle->ih_branch->tb_data[result.sr_item.i_item_offset]) == 0; + xt_ind_unlock_handle(ot->ot_ind_rhandle); + } + + checked_on_key: + ot->ot_curr_rec_id = result.sr_rec_id; + ot->ot_curr_row_id = result.sr_row_id; + ot->ot_ind_state = result.sr_item; + return OK; + + failed: + XT_INDEX_UNLOCK(ind, ot); + if (idx_out_of_memory_failure(ot)) + goto retry_after_oom; + return FAILED; +} + +/* Return TRUE if the record matches the current index search! */ +xtPublic xtBool xt_idx_match_search(register XTOpenTablePtr ot __attribute__((unused)), register XTIndexPtr ind, register XTIdxSearchKeyPtr search_key, xtWord1 *buf, int mode) +{ + int r; + xtWord1 key_buf[XT_INDEX_MAX_KEY_SIZE]; + + myxt_create_key_from_row(ind, key_buf, (xtWord1 *) buf, NULL); + r = myxt_compare_key(ind, search_key->sk_key_value.sv_flags, search_key->sk_key_value.sv_length, search_key->sk_key_value.sv_key, key_buf); + switch (mode) { + case XT_S_MODE_MATCH: + return r == 0; + case XT_S_MODE_NEXT: + return r <= 0; + case XT_S_MODE_PREV: + return r >= 0; + } + return FALSE; +} + +static void idx_set_index_selectivity(XTThreadPtr self __attribute__((unused)), XTOpenTablePtr ot, XTIndexPtr ind) +{ + static const xtRecordID MAX_RECORDS = 100; + + XTIdxSearchKeyRec search_key; + XTIndexSegPtr key_seg; + u_int select_count[2] = {0, 0}; + xtWord1 key_buf[XT_INDEX_MAX_KEY_SIZE]; + u_int key_len; + xtWord1 *next_key_buf; + u_int next_key_len; + u_int curr_len; + u_int diff; + u_int j, i; + /* these 2 vars are used to check the overlapping if we have < 200 records */ + xtRecordID last_rec = 0; /* last record accounted in this iteration */ + xtRecordID last_iter_rec = 0; /* last record accounted in the previous iteration */ + + xtBool (* xt_idx_iterator[2])( + register struct XTOpenTable *ot, register struct XTIndex *ind, register XTIdxSearchKeyPtr search_key) = { + + xt_idx_next, + xt_idx_prev + }; + + xtBool (* xt_idx_begin[2])( + struct XTOpenTable *ot, struct XTIndex *ind, register XTIdxSearchKeyPtr search_key) = { + + xt_idx_search, + xt_idx_search_prev + }; + + ind->mi_select_total = 0; + key_seg = ind->mi_seg; + for (i=0; i < ind->mi_seg_count; key_seg++, i++) { + key_seg->is_selectivity = 1; + key_seg->is_recs_in_range = 1; + } + + for (j=0; j < 2; j++) { + xt_idx_prep_key(ind, &search_key, j == 0 ? XT_SEARCH_FIRST_FLAG : XT_SEARCH_AFTER_LAST_FLAG, NULL, 0); + if (!(xt_idx_begin[j])(ot, ind, &search_key)) + goto failed; + + /* Initialize the buffer with the first index valid index entry: */ + while (!select_count[j] && ot->ot_curr_rec_id != last_iter_rec) { + if (ot->ot_curr_row_id) { + select_count[j]++; + last_rec = ot->ot_curr_rec_id; + + key_len = ot->ot_ind_state.i_item_size - XT_RECORD_REF_SIZE; + xt_ind_unlock_handle(ot->ot_ind_rhandle); + memcpy(key_buf, ot->ot_ind_rhandle->ih_branch->tb_data + ot->ot_ind_state.i_item_offset, key_len); + xt_ind_unlock_handle(ot->ot_ind_rhandle); + } + if (!(xt_idx_iterator[j])(ot, ind, &search_key)) + goto failed_1; + } + + while (select_count[j] < MAX_RECORDS && ot->ot_curr_rec_id != last_iter_rec) { + /* Check if the index entry is committed: */ + if (ot->ot_curr_row_id) { + xt_ind_lock_handle(ot->ot_ind_rhandle); + select_count[j]++; + last_rec = ot->ot_curr_rec_id; + + next_key_len = ot->ot_ind_state.i_item_size - XT_RECORD_REF_SIZE; + next_key_buf = ot->ot_ind_rhandle->ih_branch->tb_data + ot->ot_ind_state.i_item_offset; + + curr_len = 0; + diff = FALSE; + key_seg = ind->mi_seg; + for (i=0; i < ind->mi_seg_count; key_seg++, i++) { + curr_len += myxt_key_seg_length(key_seg, curr_len, key_buf); + if (!diff && myxt_compare_key(ind, 0, curr_len, key_buf, next_key_buf) != 0) + diff = i+1; + if (diff) + key_seg->is_selectivity++; + } + + /* Store the key for the next comparison: */ + key_len = next_key_len; + memcpy(key_buf, next_key_buf, key_len); + xt_ind_unlock_handle(ot->ot_ind_rhandle); + } + + if (!(xt_idx_iterator[j])(ot, ind, &search_key)) + goto failed_1; + } + + last_iter_rec = last_rec; + + if (ot->ot_ind_rhandle) { + xt_ind_release_handle(ot->ot_ind_rhandle, FALSE, self); + ot->ot_ind_rhandle = NULL; + } + } + + u_int select_total; + + select_total = select_count[0] + select_count[1]; + if (select_total) { + u_int recs; + + ind->mi_select_total = select_total; + key_seg = ind->mi_seg; + for (i=0; i < ind->mi_seg_count; key_seg++, i++) { + recs = (u_int) ((double) select_total / (double) key_seg->is_selectivity + (double) 0.5); + key_seg->is_recs_in_range = recs ? recs : 1; + } + } + return; + + failed_1: + xt_ind_release_handle(ot->ot_ind_rhandle, FALSE, self); + ot->ot_ind_rhandle = NULL; + + failed: + ot->ot_table->tab_dic.dic_disable_index = XT_INDEX_CORRUPTED; + xt_log_and_clear_exception_ns(); + return; +} + +xtPublic void xt_ind_set_index_selectivity(XTThreadPtr self, XTOpenTablePtr ot) +{ + XTTableHPtr tab = ot->ot_table; + XTIndexPtr *ind; + u_int i; + + if (!tab->tab_dic.dic_disable_index) { + for (i=0, ind=tab->tab_dic.dic_keys; i<tab->tab_dic.dic_key_count; i++, ind++) + idx_set_index_selectivity(self, ot, *ind); + } +} + +/* + * ----------------------------------------------------------------------- + * Print a b-tree + */ + +#ifdef TEST_CODE +static void idx_check_on_key(XTOpenTablePtr ot) +{ + u_int offs = ot->ot_ind_state.i_item_offset + ot->ot_ind_state.i_item_size - XT_RECORD_REF_SIZE; + xtRecordID rec_id; + xtRowID row_id; + + if (ot->ot_curr_rec_id && ot->ot_ind_state.i_item_offset < ot->ot_ind_state.i_total_size) { + xt_get_record_ref(&ot->ot_ind_rbuf.tb_data[offs], &rec_id, &row_id); + + ASSERT_NS(rec_id == ot->ot_curr_rec_id); + } +} +#endif + +static void idx_check_space(int depth) +{ + for (int i=0; i<depth; i++) + printf(". "); +} + +static u_int idx_check_node(XTOpenTablePtr ot, XTIndexPtr ind, int depth, xtIndexNodeID node) +{ + XTIdxResultRec result; + u_int block_count = 1; + XTIndReferenceRec iref; + +#ifdef DEBUG + iref.ir_ulock = XT_UNLOCK_NONE; +#endif + ASSERT_NS(XT_NODE_ID(node) <= XT_NODE_ID(ot->ot_table->tab_ind_eof)); + if (!xt_ind_fetch(ot, node, XT_LOCK_READ, &iref)) + return 0; + + idx_first_branch_item(ot->ot_table, ind, iref.ir_branch, &result); + ASSERT_NS(result.sr_item.i_total_size + offsetof(XTIdxBranchDRec, tb_data) <= XT_INDEX_PAGE_SIZE); + if (result.sr_item.i_node_ref_size) { + idx_check_space(depth); + printf("%04d -->\n", (int) XT_NODE_ID(result.sr_branch)); +#ifdef TRACK_ACTIVITY + track_block_exists(result.sr_branch); +#endif + block_count += idx_check_node(ot, ind, depth+1, result.sr_branch); + } + + while (result.sr_item.i_item_offset < result.sr_item.i_total_size) { +#ifdef CHECK_PRINTS_RECORD_REFERENCES + idx_check_space(depth); + if (result.sr_item.i_item_size == 12) { + /* Assume this is a NOT-NULL INT!: */ + xtWord4 val = XT_GET_DISK_4(&iref.ir_branch->tb_data[result.sr_item.i_item_offset]); + printf("(%6d) ", (int) val); + } + printf("rec=%d row=%d ", (int) result.sr_rec_id, (int) result.sr_row_id); + printf("\n"); +#endif + idx_next_branch_item(ot->ot_table, ind, iref.ir_branch, &result); + if (result.sr_item.i_node_ref_size) { + idx_check_space(depth); + printf("%04d -->\n", (int) XT_NODE_ID(result.sr_branch)); +#ifdef TRACK_ACTIVITY + track_block_exists(result.sr_branch); +#endif + block_count += idx_check_node(ot, ind, depth+1, result.sr_branch); + } + } + + xt_ind_release(ot, ind, XT_UNLOCK_READ, &iref); + return block_count; +} + +static u_int idx_check_index(XTOpenTablePtr ot, XTIndexPtr ind, xtBool with_lock) +{ + xtIndexNodeID current; + u_int block_count = 0; + u_int i; + + if (with_lock) + XT_INDEX_WRITE_LOCK(ind, ot); + + printf("INDEX (%d) %04d ---------------------------------------\n", (int) ind->mi_index_no, (int) XT_NODE_ID(ind->mi_root)); + if ((XT_NODE_ID(current) = XT_NODE_ID(ind->mi_root))) { +#ifdef TRACK_ACTIVITY + track_block_exists(ind->mi_root); +#endif + block_count = idx_check_node(ot, ind, 0, current); + } + + if (ind->mi_free_list && ind->mi_free_list->fl_free_count) { + printf("INDEX (%d) FREE ---------------------------------------", (int) ind->mi_index_no); + ASSERT_NS(ind->mi_free_list->fl_start == 0); + for (i=0; i<ind->mi_free_list->fl_free_count; i++) { + if ((i % 40) == 0) + printf("\n"); + block_count++; +#ifdef TRACK_ACTIVITY + track_block_exists(ind->mi_free_list->fl_page_id[i]); +#endif + printf("%2d ", (int) XT_NODE_ID(ind->mi_free_list->fl_page_id[i])); + } + if ((i % 40) != 0) + printf("\n"); + } + + if (with_lock) + XT_INDEX_UNLOCK(ind, ot); + return block_count; + +} + +xtPublic void xt_check_indices(XTOpenTablePtr ot) +{ + register XTTableHPtr tab = ot->ot_table; + XTIndexPtr *ind; + xtIndexNodeID current; + XTIndFreeBlockRec free_block; + u_int ind_count, block_count = 0; + u_int free_count = 0; + u_int i, j; + + xt_lock_mutex_ns(&tab->tab_ind_flush_lock); + printf("CHECK INDICES %s ==============================\n", tab->tab_name->ps_path); +#ifdef TRACK_ACTIVITY + track_reset_missing(); +#endif + + ind = tab->tab_dic.dic_keys; + for (u_int k=0; k<tab->tab_dic.dic_key_count; k++, ind++) { + ind_count = idx_check_index(ot, *ind, TRUE); + block_count += ind_count; + } + + xt_lock_mutex_ns(&tab->tab_ind_lock); + printf("\nFREE: ---------------------------------------\n"); + if (tab->tab_ind_free_list) { + XTIndFreeListPtr ptr; + + ptr = tab->tab_ind_free_list; + while (ptr) { + printf("Memory List:"); + i = 0; + for (j=ptr->fl_start; j<ptr->fl_free_count; j++, i++) { + if ((i % 40) == 0) + printf("\n"); + free_count++; +#ifdef TRACK_ACTIVITY + track_block_exists(ptr->fl_page_id[j]); +#endif + printf("%2d ", (int) XT_NODE_ID(ptr->fl_page_id[j])); + } + if ((i % 40) != 0) + printf("\n"); + ptr = ptr->fl_next_list; + } + } + + current = tab->tab_ind_free; + if (XT_NODE_ID(current)) { + u_int k = 0; + printf("Disk List:"); + while (XT_NODE_ID(current)) { + if ((k % 40) == 0) + printf("\n"); + free_count++; +#ifdef TRACK_ACTIVITY + track_block_exists(current); +#endif + printf("%d ", (int) XT_NODE_ID(current)); + if (!xt_ind_read_bytes(ot, current, sizeof(XTIndFreeBlockRec), (xtWord1 *) &free_block)) { + xt_log_and_clear_exception_ns(); + break; + } + XT_NODE_ID(current) = (xtIndexNodeID) XT_GET_DISK_8(free_block.if_next_block_8); + k++; + } + if ((k % 40) != 0) + printf("\n"); + } + printf("\n-----------------------------\n"); + printf("used blocks %d + free blocks %d = %d\n", block_count, free_count, block_count + free_count); + printf("EOF = %"PRIu64", total blocks = %d\n", (xtWord8) xt_ind_node_to_offset(tab, tab->tab_ind_eof), (int) (XT_NODE_ID(tab->tab_ind_eof) - 1)); + printf("-----------------------------\n"); + xt_unlock_mutex_ns(&tab->tab_ind_lock); +#ifdef TRACK_ACTIVITY + track_dump_missing(tab->tab_ind_eof); + printf("===================================================\n"); + track_dump_all((u_int) (XT_NODE_ID(tab->tab_ind_eof) - 1)); +#endif + printf("===================================================\n"); + xt_unlock_mutex_ns(&tab->tab_ind_flush_lock); +} + +/* + * ----------------------------------------------------------------------- + * Index consistant flush + */ + +static xtBool idx_flush_dirty_list(XTIndexLogPtr il, XTOpenTablePtr ot, u_int *flush_count, XTIndBlockPtr *flush_list) +{ + for (u_int i=0; i<*flush_count; i++) + il->il_write_block(ot, flush_list[i]); + *flush_count = 0; + return OK; +} + +static xtBool ind_add_to_dirty_list(XTIndexLogPtr il, XTOpenTablePtr ot, u_int *flush_count, XTIndBlockPtr *flush_list, XTIndBlockPtr block) +{ + register u_int count; + register u_int i; + register u_int guess; + + if (*flush_count == IND_FLUSH_BUFFER_SIZE) { + if (!idx_flush_dirty_list(il, ot, flush_count, flush_list)) + return FAILED; + } + + count = *flush_count; + i = 0; + while (i < count) { + guess = (i + count - 1) >> 1; + if (XT_NODE_ID(block->cb_address) == XT_NODE_ID(flush_list[guess]->cb_address)) { + // Should not happen... + ASSERT_NS(FALSE); + return OK; + } + if (XT_NODE_ID(block->cb_address) < XT_NODE_ID(flush_list[guess]->cb_address)) + count = guess; + else + i = guess + 1; + } + + /* Insert at position i */ + memmove(flush_list + i + 1, flush_list + i, (*flush_count - i) * sizeof(XTIndBlockPtr)); + flush_list[i] = block; + *flush_count = *flush_count + 1; + return OK; +} + +xtPublic xtBool xt_flush_indices(XTOpenTablePtr ot, off_t *bytes_flushed, xtBool have_table_lock) +{ + register XTTableHPtr tab = ot->ot_table; + XTIndexLogPtr il; + XTIndexPtr *indp; + XTIndexPtr ind; + u_int i, j; + xtBool wrote_something = FALSE; + u_int flush_count = 0; + XTIndBlockPtr flush_list[IND_FLUSH_BUFFER_SIZE]; + XTIndBlockPtr block, fblock; + xtWord1 *data; + xtIndexNodeID ind_free; + xtBool something_to_free = FALSE; + xtIndexNodeID last_address, next_address; + xtWord2 curr_flush_seq; + XTIndFreeListPtr list_ptr; + u_int dirty_blocks; + XTCheckPointTablePtr cp_tab; + XTCheckPointStatePtr cp = NULL; + + if (!xt_begin_checkpoint(tab->tab_db, have_table_lock, ot->ot_thread)) + return FAILED; + +#ifdef DEBUG_CHECK_IND_CACHE + xt_ind_check_cache(NULL); +#endif + xt_lock_mutex_ns(&tab->tab_ind_flush_lock); + + if (!tab->tab_db->db_indlogs.ilp_get_log(&il, ot->ot_thread)) + goto failed_3; + + il->il_reset(tab->tab_id); + if (!il->il_write_byte(ot, XT_DT_FREE_LIST)) + goto failed_2; + if (!il->il_write_word4(ot, tab->tab_id)) + goto failed_2; + if (!il->il_write_word4(ot, 0)) + goto failed_2; + + /* Lock all: */ + dirty_blocks = 0; + indp = tab->tab_dic.dic_keys; + for (i=0; i<tab->tab_dic.dic_key_count; i++, indp++) { + ind = *indp; + XT_INDEX_WRITE_LOCK(ind, ot); + if (ind->mi_free_list && ind->mi_free_list->fl_free_count) + something_to_free = TRUE; + dirty_blocks += ind->mi_dirty_blocks; + } + // 128 dirty blocks == 2MB +#ifdef TRACE_FLUSH + printf("FLUSH index %d %s\n", (int) dirty_blocks * XT_INDEX_PAGE_SIZE, tab->tab_name->ps_path); + fflush(stdout); +#endif + if (bytes_flushed) + *bytes_flushed += (dirty_blocks * XT_INDEX_PAGE_SIZE); + + curr_flush_seq = tab->tab_ind_flush_seq; + tab->tab_ind_flush_seq++; + + /* Write the dirty pages: */ + indp = tab->tab_dic.dic_keys; + data = tab->tab_index_head->tp_data; + for (i=0; i<tab->tab_dic.dic_key_count; i++, indp++) { + ind = *indp; + xt_spinlock_lock(&ind->mi_dirty_lock); + if ((block = ind->mi_dirty_list)) { + wrote_something = TRUE; + while (block) { + ASSERT_NS(block->cb_state == IDX_CAC_BLOCK_DIRTY); + ASSERT_NS(block->cp_flush_seq == curr_flush_seq); + if (!ind_add_to_dirty_list(il, ot, &flush_count, flush_list, block)) + goto failed; + block = block->cb_dirty_next; + } + } + xt_spinlock_unlock(&ind->mi_dirty_lock); + XT_SET_NODE_REF(tab, data, ind->mi_root); + data += XT_NODE_REF_SIZE; + } + + /* Flush the dirty blocks: */ + if (!idx_flush_dirty_list(il, ot, &flush_count, flush_list)) + goto failed; + + xt_lock_mutex_ns(&tab->tab_ind_lock); + + /* Write the free list: */ + if (something_to_free) { + union { + xtWord1 buffer[XT_BLOCK_SIZE_FOR_DIRECT_IO]; + XTIndFreeBlockRec free_block; + } x; + memset(x.buffer, 0, sizeof(XTIndFreeBlockRec)); + + /* The old start of the free list: */ + XT_NODE_ID(ind_free) = 0; + while ((list_ptr = tab->tab_ind_free_list)) { + if (list_ptr->fl_start < list_ptr->fl_free_count) { + ind_free = list_ptr->fl_page_id[list_ptr->fl_start]; + break; + } + tab->tab_ind_free_list = list_ptr->fl_next_list; + xt_free_ns(list_ptr); + } + if (!XT_NODE_ID(ind_free)) + ind_free = tab->tab_ind_free; + + if (!il->il_write_byte(ot, XT_DT_FREE_LIST)) + goto failed; + indp = tab->tab_dic.dic_keys; + XT_NODE_ID(last_address) = 0; + for (i=0; i<tab->tab_dic.dic_key_count; i++, indp++) { + ind = *indp; + //ASSERT_NS(XT_INDEX_HAVE_XLOCK(ind, ot)); + if (ind->mi_free_list && ind->mi_free_list->fl_free_count) { + for (j=0; j<ind->mi_free_list->fl_free_count; j++) { + next_address = ind->mi_free_list->fl_page_id[j]; + if (!il->il_write_word4(ot, XT_NODE_ID(ind->mi_free_list->fl_page_id[j]))) + goto failed; + if (XT_NODE_ID(last_address)) { + XT_SET_DISK_8(x.free_block.if_next_block_8, XT_NODE_ID(next_address)); + if (!xt_ind_write_cache(ot, last_address, 8, x.buffer)) + goto failed; + } + last_address = next_address; + } + } + } + if (!il->il_write_word4(ot, XT_NODE_ID(ind_free))) + goto failed; + if (XT_NODE_ID(last_address)) { + XT_SET_DISK_8(x.free_block.if_next_block_8, XT_NODE_ID(tab->tab_ind_free)); + if (!xt_ind_write_cache(ot, last_address, 8, x.buffer)) + goto failed; + } + if (!il->il_write_word4(ot, 0xFFFFFFFF)) + goto failed; + } + + /* + * Add the free list caches to the global free list cache. + * Added backwards to match the write order. + */ + indp = tab->tab_dic.dic_keys + tab->tab_dic.dic_key_count-1; + for (i=0; i<tab->tab_dic.dic_key_count; i++, indp--) { + ind = *indp; + //ASSERT_NS(XT_INDEX_HAVE_XLOCK(ind, ot)); + if (ind->mi_free_list) { + wrote_something = TRUE; + ind->mi_free_list->fl_next_list = tab->tab_ind_free_list; + tab->tab_ind_free_list = ind->mi_free_list; + } + ind->mi_free_list = NULL; + } + + /* + * The new start of the free list is the first + * item on the table free list: + */ + XT_NODE_ID(ind_free) = 0; + while ((list_ptr = tab->tab_ind_free_list)) { + if (list_ptr->fl_start < list_ptr->fl_free_count) { + ind_free = list_ptr->fl_page_id[list_ptr->fl_start]; + break; + } + tab->tab_ind_free_list = list_ptr->fl_next_list; + xt_free_ns(list_ptr); + } + if (!XT_NODE_ID(ind_free)) + ind_free = tab->tab_ind_free; + xt_unlock_mutex_ns(&tab->tab_ind_lock); + + XT_SET_DISK_6(tab->tab_index_head->tp_ind_eof_6, XT_NODE_ID(tab->tab_ind_eof)); + XT_SET_DISK_6(tab->tab_index_head->tp_ind_free_6, XT_NODE_ID(ind_free)); + + if (!il->il_write_header(ot, XT_INDEX_HEAD_SIZE, (xtWord1 *) tab->tab_index_head)) + goto failed; + + indp = tab->tab_dic.dic_keys; + for (i=0; i<tab->tab_dic.dic_key_count; i++, indp++) { + ind = *indp; + XT_INDEX_UNLOCK(ind, ot); + } + + if (wrote_something) { + /* Flush the log before we flush the index. + * + * The reason is, we must make sure that changes that + * will be in the index are already in the transaction + * log. + * + * Only then are we able to undo those changes on + * recovery. + * + * Simple example: + * CREATE TABLE t1 (s1 INT PRIMARY KEY); + * INSERT INTO t1 VALUES (1); + * + * BEGIN; + * INSERT INTO t1 VALUES (2); + * + * --- INDEX IS FLUSHED HERE --- + * + * --- SERVER CRASH HERE --- + * + * + * The INSERT VALUES (2) has been written + * to the log, but not flushed. + * But the index has been updated. + * If the index is flushed it will contain + * the entry for record with s1=2. + * + * This entry must be removed on recovery. + * + * To prevent this situation I flush the log + * here. + */ + if (!(tab->tab_dic.dic_tab_flags & XT_TAB_FLAGS_TEMP_TAB)) { + if (!xt_xlog_flush_log(ot->ot_thread)) + goto failed_2; + if (!il->il_flush(ot)) + goto failed_2; + } + + if (!il->il_apply_log(ot)) + goto failed_2; + + indp = tab->tab_dic.dic_keys; + for (i=0; i<tab->tab_dic.dic_key_count; i++, indp++) { + ind = *indp; + XT_INDEX_WRITE_LOCK(ind, ot); + } + + /* Free up flushed pages: */ + indp = tab->tab_dic.dic_keys; + for (i=0; i<tab->tab_dic.dic_key_count; i++, indp++) { + ind = *indp; + xt_spinlock_lock(&ind->mi_dirty_lock); + if ((block = ind->mi_dirty_list)) { + while (block) { + fblock = block; + block = block->cb_dirty_next; + ASSERT_NS(fblock->cb_state == IDX_CAC_BLOCK_DIRTY); + if (fblock->cp_flush_seq == curr_flush_seq) { + /* Take the block off the dirty list: */ + if (fblock->cb_dirty_next) + fblock->cb_dirty_next->cb_dirty_prev = fblock->cb_dirty_prev; + if (fblock->cb_dirty_prev) + fblock->cb_dirty_prev->cb_dirty_next = fblock->cb_dirty_next; + if (ind->mi_dirty_list == fblock) + ind->mi_dirty_list = fblock->cb_dirty_next; + ind->mi_dirty_blocks--; + fblock->cb_state = IDX_CAC_BLOCK_CLEAN; + } + } + } + xt_spinlock_unlock(&ind->mi_dirty_lock); + } + + indp = tab->tab_dic.dic_keys; + for (i=0; i<tab->tab_dic.dic_key_count; i++, indp++) { + ind = *indp; + XT_INDEX_UNLOCK(ind, ot); + } + } + + il->il_release(); + + /* Mark this table as index flushed: */ + cp = &tab->tab_db->db_cp_state; + xt_lock_mutex_ns(&cp->cp_state_lock); + if (cp->cp_running) { + cp_tab = (XTCheckPointTablePtr) xt_sl_find(NULL, cp->cp_table_ids, &tab->tab_id); + if (cp_tab && (cp_tab->cpt_flushed & XT_CPT_ALL_FLUSHED) != XT_CPT_ALL_FLUSHED) { + cp_tab->cpt_flushed |= XT_CPT_INDEX_FLUSHED; + if ((cp_tab->cpt_flushed & XT_CPT_ALL_FLUSHED) == XT_CPT_ALL_FLUSHED) { + ASSERT_NS(cp->cp_flush_count < xt_sl_get_size(cp->cp_table_ids)); + cp->cp_flush_count++; + } + } + } + xt_unlock_mutex_ns(&cp->cp_state_lock); + + xt_unlock_mutex_ns(&tab->tab_ind_flush_lock); +#ifdef DEBUG_CHECK_IND_CACHE + xt_ind_check_cache((XTIndex *) 1); +#endif +#ifdef TRACE_FLUSH + printf("FLUSH --end-- %s\n", tab->tab_name->ps_path); + fflush(stdout); +#endif + if (!xt_end_checkpoint(tab->tab_db, ot->ot_thread, NULL)) + return FAILED; + return OK; + + failed: + indp = tab->tab_dic.dic_keys; + for (i=0; i<tab->tab_dic.dic_key_count; i++, indp++) { + ind = *indp; + XT_INDEX_UNLOCK(ind, ot); + } + + failed_2: + il->il_release(); + + failed_3: + xt_unlock_mutex_ns(&tab->tab_ind_flush_lock); +#ifdef DEBUG_CHECK_IND_CACHE + xt_ind_check_cache(NULL); +#endif + return FAILED; +} + +void XTIndexLogPool::ilp_init(struct XTThread *self, struct XTDatabase *db, size_t log_buffer_size) +{ + char path[PATH_MAX]; + XTOpenDirPtr od; + xtLogID log_id; + char *file; + XTIndexLogPtr il = NULL; + XTOpenTablePtr ot = NULL; + + ilp_db = db; + ilp_log_buffer_size = log_buffer_size; + xt_init_mutex_with_autoname(self, &ilp_lock); + + xt_strcpy(PATH_MAX, path, db->db_main_path); + xt_add_system_dir(PATH_MAX, path); + if (xt_fs_exists(path)) { + pushsr_(od, xt_dir_close, xt_dir_open(self, path, NULL)); + while (xt_dir_next(self, od)) { + file = xt_dir_name(self, od); + if (xt_starts_with(file, "ilog")) { + if ((log_id = (xtLogID) xt_file_name_to_id(file))) { + if (!ilp_open_log(&il, log_id, FALSE, self)) + goto failed; + if (il->il_tab_id && il->il_log_eof) { + if (!il->il_open_table(&ot)) + goto failed; + if (ot) { + if (!il->il_apply_log(ot)) + goto failed; + ot->ot_thread = self; + il->il_close_table(ot); + } + } + il->il_close(TRUE); + } + } + } + freer_(); // xt_dir_close(od) + } + return; + + failed: + if (ot && il) + il->il_close_table(ot); + if (il) + il->il_close(FALSE); + xt_throw(self); +} + +void XTIndexLogPool::ilp_close(struct XTThread *self __attribute__((unused)), xtBool lock) +{ + XTIndexLogPtr il; + + if (lock) + xt_lock_mutex_ns(&ilp_lock); + while ((il = ilp_log_pool)) { + ilp_log_pool = il->il_next_in_pool; + il_pool_count--; + il->il_close(TRUE); + } + if (lock) + xt_unlock_mutex_ns(&ilp_lock); +} + +void XTIndexLogPool::ilp_exit(struct XTThread *self) +{ + ilp_close(self, FALSE); + ASSERT_NS(il_pool_count == 0); + xt_free_mutex(&ilp_lock); +} + +void XTIndexLogPool::ilp_name(size_t size, char *path, xtLogID log_id) +{ + char name[50]; + + sprintf(name, "ilog-%lu.xt", (u_long) log_id); + xt_strcpy(size, path, ilp_db->db_main_path); + xt_add_system_dir(size, path); + xt_add_dir_char(size, path); + xt_strcat(size, path, name); +} + +xtBool XTIndexLogPool::ilp_open_log(XTIndexLogPtr *ret_il, xtLogID log_id, xtBool excl, XTThreadPtr thread) +{ + char log_path[PATH_MAX]; + XTIndexLogPtr il; + XTIndLogHeadDRec log_head; + size_t read_size; + + ilp_name(PATH_MAX, log_path, log_id); + if (!(il = (XTIndexLogPtr) xt_calloc_ns(sizeof(XTIndexLogRec)))) + return FAILED; + il->il_log_id = log_id; + il->il_pool = this; + + /* Writes will be rounded up to the nearest direct write block size (see [+]), + * so make sure we have space in the buffer for that: + */ + if (!(il->il_buffer = (xtWord1 *) xt_malloc_ns(ilp_log_buffer_size + XT_BLOCK_SIZE_FOR_DIRECT_IO))) + goto failed; + il->il_buffer_size = ilp_log_buffer_size; + + if (!(il->il_of = xt_open_file_ns(log_path, (excl ? XT_FS_EXCLUSIVE : 0) | XT_FS_CREATE | XT_FS_MAKE_PATH))) + goto failed; + + if (!xt_pread_file(il->il_of, 0, sizeof(XTIndLogHeadDRec), 0, &log_head, &read_size, &thread->st_statistics.st_ilog, thread)) + goto failed; + + if (read_size == sizeof(XTIndLogHeadDRec)) { + il->il_tab_id = XT_GET_DISK_4(log_head.ilh_tab_id_4); + il->il_log_eof = XT_GET_DISK_4(log_head.ilh_log_eof_4); + } + else { + il->il_tab_id = 0; + il->il_log_eof = 0; + } + + *ret_il = il; + return OK; + + failed: + il->il_close(FALSE); + return FAILED; +} + +xtBool XTIndexLogPool::ilp_get_log(XTIndexLogPtr *ret_il, XTThreadPtr thread) +{ + XTIndexLogPtr il; + xtLogID log_id = 0; + + xt_lock_mutex_ns(&ilp_lock); + if ((il = ilp_log_pool)) { + ilp_log_pool = il->il_next_in_pool; + il_pool_count--; + } + else { + ilp_next_log_id++; + log_id = ilp_next_log_id; + } + xt_unlock_mutex_ns(&ilp_lock); + if (!il) { + if (!ilp_open_log(&il, log_id, TRUE, thread)) + return FAILED; + } + *ret_il= il; + return OK; +} + +void XTIndexLogPool::ilp_release_log(XTIndexLogPtr il) +{ + xt_lock_mutex_ns(&ilp_lock); + if (il_pool_count == 5) + il->il_close(TRUE); + else { + il_pool_count++; + il->il_next_in_pool = ilp_log_pool; + ilp_log_pool = il; + } + xt_unlock_mutex_ns(&ilp_lock); +} + +void XTIndexLog::il_reset(xtTableID tab_id) +{ + il_tab_id = tab_id; + il_log_eof = 0; + il_buffer_len = 0; + il_buffer_offset = 0; +} + +void XTIndexLog::il_close(xtBool delete_it) +{ + xtLogID log_id = il_log_id; + + if (il_of) { + xt_close_file_ns(il_of); + il_of = NULL; + } + + if (delete_it && log_id) { + char log_path[PATH_MAX]; + + il_pool->ilp_name(PATH_MAX, log_path, log_id); + xt_fs_delete(NULL, log_path); + } + + if (il_buffer) { + xt_free_ns(il_buffer); + il_buffer = NULL; + } + + xt_free_ns(this); +} + + +void XTIndexLog::il_release() +{ + il_pool->ilp_db->db_indlogs.ilp_release_log(this); +} + +xtBool XTIndexLog::il_require_space(size_t bytes, XTThreadPtr thread) +{ + if (il_buffer_len + bytes > il_buffer_size) { + if (!xt_pwrite_file(il_of, il_buffer_offset, il_buffer_len, il_buffer, &thread->st_statistics.st_ilog, thread)) + return FAILED; + il_buffer_offset += il_buffer_len; + il_buffer_len = 0; + } + + return OK; +} + +xtBool XTIndexLog::il_write_byte(struct XTOpenTable *ot __attribute__((unused)), xtWord1 byte) +{ + if (!il_require_space(1, ot->ot_thread)) + return FAILED; + *(il_buffer + il_buffer_len) = byte; + il_buffer_len++; + return OK; +} + +xtBool XTIndexLog::il_write_word4(struct XTOpenTable *ot __attribute__((unused)), xtWord4 value) +{ + xtWord1 *buffer; + + if (!il_require_space(4, ot->ot_thread)) + return FAILED; + buffer = il_buffer + il_buffer_len; + XT_SET_DISK_4(buffer, value); + il_buffer_len += 4; + return OK; +} + +xtBool XTIndexLog::il_write_block(struct XTOpenTable *ot __attribute__((unused)), XTIndBlockPtr block) +{ + XTIndPageDataDPtr page_data; + xtIndexNodeID node_id; + XTIdxBranchDPtr node; + u_int block_len; + + node_id = block->cb_address; + node = (XTIdxBranchDPtr) block->cb_data; + block_len = XT_GET_INDEX_BLOCK_LEN(XT_GET_DISK_2(node->tb_size_2)); + + if (!il_require_space(offsetof(XTIndPageDataDRec, ild_data) + block_len, ot->ot_thread)) + return FAILED; + + ASSERT_NS(offsetof(XTIndPageDataDRec, ild_data) + XT_INDEX_PAGE_SIZE <= il_buffer_size); + + page_data = (XTIndPageDataDPtr) (il_buffer + il_buffer_len); + TRACK_BLOCK_TO_FLUSH(node_id); + page_data->ild_data_type = XT_DT_INDEX_PAGE; + XT_SET_DISK_4(page_data->ild_page_id_4, XT_NODE_ID(node_id)); + memcpy(page_data->ild_data, block->cb_data, block_len); + + il_buffer_len += offsetof(XTIndPageDataDRec, ild_data) + block_len; + + return OK; +} + +xtBool XTIndexLog::il_write_header(struct XTOpenTable *ot __attribute__((unused)), size_t head_size, xtWord1 *head_buf) +{ + XTIndHeadDataDPtr head_data; + + if (!il_require_space(offsetof(XTIndHeadDataDRec, ilh_data) + head_size, ot->ot_thread)) + return FAILED; + + head_data = (XTIndHeadDataDPtr) (il_buffer + il_buffer_len); + head_data->ilh_data_type = XT_DT_HEADER; + XT_SET_DISK_2(head_data->ilh_head_size_2, head_size); + memcpy(head_data->ilh_data, head_buf, head_size); + + il_buffer_len += offsetof(XTIndHeadDataDRec, ilh_data) + head_size; + + return OK; +} + +xtBool XTIndexLog::il_flush(struct XTOpenTable *ot) +{ + XTIndLogHeadDRec log_head; + xtTableID tab_id = ot->ot_table->tab_id; + + if (il_buffer_len) { + if (!xt_pwrite_file(il_of, il_buffer_offset, il_buffer_len, il_buffer, &ot->ot_thread->st_statistics.st_ilog, ot->ot_thread)) + return FAILED; + il_buffer_offset += il_buffer_len; + il_buffer_len = 0; + } + + if (il_log_eof != il_buffer_offset) { + log_head.ilh_data_type = XT_DT_LOG_HEAD; + XT_SET_DISK_4(log_head.ilh_tab_id_4, tab_id); + XT_SET_DISK_4(log_head.ilh_log_eof_4, il_buffer_offset); + + if (!xt_flush_file(il_of, &ot->ot_thread->st_statistics.st_ilog, ot->ot_thread)) + return FAILED; + + if (!xt_pwrite_file(il_of, 0, sizeof(XTIndLogHeadDRec), (xtWord1 *) &log_head, &ot->ot_thread->st_statistics.st_ilog, ot->ot_thread)) + return FAILED; + + if (!xt_flush_file(il_of, &ot->ot_thread->st_statistics.st_ilog, ot->ot_thread)) + return FAILED; + + il_tab_id = tab_id; + il_log_eof = il_buffer_offset; + } + return OK; +} + +xtBool XTIndexLog::il_apply_log(struct XTOpenTable *ot) +{ + XT_NODE_TEMP; + register XTTableHPtr tab = ot->ot_table; + off_t offset; + size_t pos; + xtWord1 *buffer; + off_t address; + xtIndexNodeID node_id; + size_t req_size = 0; + XTIndLogHeadDRec log_head; + + offset = 0; + while (offset < il_log_eof) { + if (offset < il_buffer_offset || + offset >= il_buffer_offset + (off_t) il_buffer_len) { + il_buffer_len = il_buffer_size; + if (il_log_eof - offset < (off_t) il_buffer_len) + il_buffer_len = (size_t) (il_log_eof - offset); + + /* Corrupt log?! */ + if (il_buffer_len < req_size) { + xt_register_ixterr(XT_REG_CONTEXT, XT_ERR_INDEX_LOG_CORRUPT, xt_file_path(il_of)); + xt_log_and_clear_exception_ns(); + return OK; + } + if (!xt_pread_file(il_of, offset, il_buffer_len, il_buffer_len, il_buffer, NULL, &ot->ot_thread->st_statistics.st_ilog, ot->ot_thread)) + return FAILED; + il_buffer_offset = offset; + } + pos = (size_t) (offset - il_buffer_offset); + ASSERT_NS(pos < il_buffer_len); + buffer = il_buffer + pos; + switch (*buffer) { + case XT_DT_LOG_HEAD: + req_size = sizeof(XTIndLogHeadDRec); + if (il_buffer_len - pos < req_size) { + il_buffer_len = 0; + continue; + } + offset += req_size; + req_size = 0; + break; + case XT_DT_INDEX_PAGE: + XTIndPageDataDPtr page_data; + XTIdxBranchDPtr node; + u_int block_len; + size_t size; + + req_size = offsetof(XTIndPageDataDRec, ild_data) + 2; + if (il_buffer_len - pos < req_size) { + il_buffer_len = 0; + continue; + } + page_data = (XTIndPageDataDPtr) buffer; + node_id = XT_RET_NODE_ID(XT_GET_DISK_4(page_data->ild_page_id_4)); + node = (XTIdxBranchDPtr) page_data->ild_data; + block_len = XT_GET_INDEX_BLOCK_LEN(XT_GET_DISK_2(node->tb_size_2)); + if (block_len < 2 || block_len > XT_INDEX_PAGE_SIZE) { + xt_register_taberr(XT_REG_CONTEXT, XT_ERR_INDEX_CORRUPTED, tab->tab_name); + return FAILED; + } + + req_size = offsetof(XTIndPageDataDRec, ild_data) + block_len; + if (il_buffer_len - pos < req_size) { + il_buffer_len = 0; + continue; + } + + TRACK_BLOCK_FLUSH_N(node_id); + address = xt_ind_node_to_offset(tab, node_id); + /* [+] Round up the block size. Space has been provided. */ + size = (((block_len - 1) / XT_BLOCK_SIZE_FOR_DIRECT_IO) + 1) * XT_BLOCK_SIZE_FOR_DIRECT_IO; + IDX_TRACE("%d- W%x\n", (int) XT_NODE_ID(node_id), (int) XT_GET_DISK_2(page_data->ild_data)); + ASSERT_NS(size > 0 && size <= XT_INDEX_PAGE_SIZE); + if (!xt_pwrite_file(ot->ot_ind_file, address, size, page_data->ild_data, &ot->ot_thread->st_statistics.st_ind, ot->ot_thread)) + return FAILED; + + offset += req_size; + req_size = 0; + break; + case XT_DT_FREE_LIST: + xtWord4 block, nblock; + union { + xtWord1 buffer[XT_BLOCK_SIZE_FOR_DIRECT_IO]; + XTIndFreeBlockRec free_block; + } x; + off_t aoff; + + memset(x.buffer, 0, sizeof(XTIndFreeBlockRec)); + + pos++; + offset++; + + for (;;) { + req_size = 8; + if (il_buffer_len - pos < req_size) { + il_buffer_len = il_buffer_size; + if (il_log_eof - offset < (off_t) il_buffer_len) + il_buffer_len = (size_t) (il_log_eof - offset); + /* Corrupt log?! */ + if (il_buffer_len < req_size) { + xt_register_ixterr(XT_REG_CONTEXT, XT_ERR_INDEX_LOG_CORRUPT, xt_file_path(il_of)); + xt_log_and_clear_exception_ns(); + return OK; + } + if (!xt_pread_file(il_of, offset, il_buffer_len, il_buffer_len, il_buffer, NULL, &ot->ot_thread->st_statistics.st_ilog, ot->ot_thread)) + return FAILED; + pos = 0; + } + block = XT_GET_DISK_4(il_buffer + pos); + nblock = XT_GET_DISK_4(il_buffer + pos + 4); + if (nblock == 0xFFFFFFFF) + break; + aoff = xt_ind_node_to_offset(tab, XT_RET_NODE_ID(block)); + XT_SET_DISK_8(x.free_block.if_next_block_8, nblock); + IDX_TRACE("%d- *%x\n", (int) block, (int) XT_GET_DISK_2(x.buffer)); + if (!xt_pwrite_file(ot->ot_ind_file, aoff, XT_BLOCK_SIZE_FOR_DIRECT_IO, x.buffer, &ot->ot_thread->st_statistics.st_ind, ot->ot_thread)) + return FAILED; + pos += 4; + offset += 4; + } + + offset += 8; + req_size = 0; + break; + case XT_DT_HEADER: + XTIndHeadDataDPtr head_data; + size_t len; + + req_size = offsetof(XTIndHeadDataDRec, ilh_data); + if (il_buffer_len - pos < req_size) { + il_buffer_len = 0; + continue; + } + head_data = (XTIndHeadDataDPtr) buffer; + len = XT_GET_DISK_2(head_data->ilh_head_size_2); + + req_size = offsetof(XTIndHeadDataDRec, ilh_data) + len; + if (il_buffer_len - pos < req_size) { + il_buffer_len = 0; + continue; + } + + if (!xt_pwrite_file(ot->ot_ind_file, 0, len, head_data->ilh_data, &ot->ot_thread->st_statistics.st_ind, ot->ot_thread)) + return FAILED; + + offset += req_size; + req_size = 0; + break; + default: + xt_register_ixterr(XT_REG_CONTEXT, XT_ERR_INDEX_LOG_CORRUPT, xt_file_path(il_of)); + xt_log_and_clear_exception_ns(); + return OK; + } + } + + if (!xt_flush_file(ot->ot_ind_file, &ot->ot_thread->st_statistics.st_ind, ot->ot_thread)) + return FAILED; + + log_head.ilh_data_type = XT_DT_LOG_HEAD; + XT_SET_DISK_4(log_head.ilh_tab_id_4, il_tab_id); + XT_SET_DISK_4(log_head.ilh_log_eof_4, 0); + + if (!xt_pwrite_file(il_of, 0, sizeof(XTIndLogHeadDRec), (xtWord1 *) &log_head, &ot->ot_thread->st_statistics.st_ilog, ot->ot_thread)) + return FAILED; + + if (!(tab->tab_dic.dic_tab_flags & XT_TAB_FLAGS_TEMP_TAB)) { + if (!xt_flush_file(il_of, &ot->ot_thread->st_statistics.st_ilog, ot->ot_thread)) + return FAILED; + } + return OK; +} + +xtBool XTIndexLog::il_open_table(struct XTOpenTable **ot) +{ + return xt_db_open_pool_table_ns(ot, il_pool->ilp_db, il_tab_id); +} + +void XTIndexLog::il_close_table(struct XTOpenTable *ot) +{ + xt_db_return_table_to_pool_ns(ot); +} + + diff --git a/storage/pbxt/src/index_xt.h b/storage/pbxt/src/index_xt.h new file mode 100644 index 00000000000..2a4b3750815 --- /dev/null +++ b/storage/pbxt/src/index_xt.h @@ -0,0 +1,508 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-09-30 Paul McCullagh + * + * H&G2JCtL + */ +#ifndef __xt_index_h__ +#define __xt_index_h__ + +#ifdef DRIZZLED +#include <mysys/my_bitmap.h> +#else +#include <mysql_version.h> +#include <my_bitmap.h> +#endif + +#include "thread_xt.h" +#include "linklist_xt.h" +#include "datalog_xt.h" +#include "datadic_xt.h" +//#include "cache_xt.h" + +#ifndef MYSQL_VERSION_ID +#error MYSQL_VERSION_ID must be defined! +#endif + +struct XTDictionary; +STRUCT_TABLE; +struct XTTable; +struct XTOpenTable; +struct XTIndex; +struct XTIndBlock; +struct XTTable; +class Field; + +/* + * INDEX ROLLBACK + * + * When a transaction is rolled back, the index entries are not + * garbage collected!! Instead, the index entries are deleted + * when the data record is garbage collected. + * + * When an index record is written, and this record replaces + * some other record (i.e. a node is updated). The new record + * references its predecessor. + * + * On cleanup (rollback or commit), the predecessor records + * are garbage collected. + * + * NOTE: It is possible to loose memory if a crash occurs during + * index modification. This can occur if a node is split and + * we crash between writing the 2 new records. + * + */ + +/* + * These flags influence the way the compare and search + * routines function. + * + * The low-order 16 bits are reserved for the caller + * (i.e. MySQL specific stuff). + */ +#define XT_SEARCH_WHOLE_KEY 0x10000000 /* This flag is used to search for an insertion point, or to find + * a particular slot that has already been inserted into the + * index. The compare includes the handle of the variation. + */ +#define XT_SEARCH_AFTER_KEY 0x20000000 /* This flags searches for the position just after the given key. + * Even if the key is not found, success is possible if there + * is a value in the index that would be after the search key. + * + * If this flag is not set then we search for the first + * occurrence of the key in the index. If not found we + * take the position just after the search key. + */ +#define XT_SEARCH_FIRST_FLAG 0x40000000 /* Use this flags to find the first position in the index. + * When set, the actual key value is ignored. + */ +#define XT_SEARCH_AFTER_LAST_FLAG 0x80000000 /* Search out the position after the last in the index. + * When set, the actual key value is ignored. + */ + +#define XT_INDEX_MAX_KEY_SIZE_MAX 2048 /* These are allocated on the stack, so this is the maximum! */ + +#define XT_INDEX_MAX_KEY_SIZE ((XT_INDEX_PAGE_SIZE >> 1) > XT_INDEX_MAX_KEY_SIZE_MAX ? XT_INDEX_MAX_KEY_SIZE_MAX : (XT_INDEX_PAGE_SIZE >> 1)) + +#define XT_IS_NODE_BIT 0x8000 + +#define XT_IS_NODE(x) ((x) & XT_IS_NODE_BIT) + +#define XT_NODE_REF_SIZE 4 +#define XT_GET_NODE_REF(t, x) XT_RET_NODE_ID(XT_GET_DISK_4(x)) +#define XT_SET_NODE_REF(t, x, y) XT_SET_DISK_4((x), XT_NODE_ID(y)) + +#define XT_MAX_RECORD_REF_SIZE 8 + +#define XT_INDEX_PAGE_DATA_SIZE XT_INDEX_PAGE_SIZE - 2 /* NOTE: 2 == offsetof(XTIdxBranchDRec, tb_data) */ + +#define XT_MAKE_LEAF_SIZE(x) ((x) + offsetof(XTIdxBranchDRec, tb_data)) + +#define XT_MAKE_NODE_SIZE(x) (((x) + offsetof(XTIdxBranchDRec, tb_data)) | XT_IS_NODE_BIT) + +#define XT_MAKE_BRANCH_SIZE(x, y) (((x) + offsetof(XTIdxBranchDRec, tb_data)) | ((y) ? XT_IS_NODE_BIT : 0)) + +#define XT_GET_INDEX_BLOCK_LEN(x) ((x) & 0x7FFF) + +#define XT_GET_BRANCH_DATA_SIZE(x) (XT_GET_INDEX_BLOCK_LEN(x) - offsetof(XTIdxBranchDRec, tb_data)) + +typedef struct XTIndexHead { + XTDiskValue4 tp_format_offset_4; /* The offset of the format part of the header. */ + + XTDiskValue4 tp_header_size_4; /* The size of the header. */ + XTDiskValue6 tp_not_used_6; + + XTDiskValue6 tp_ind_eof_6; + XTDiskValue6 tp_ind_free_6; + + /* The index roots follow. Each is if_node_ref_size_1 size. */ + xtWord1 tp_data[XT_VAR_LENGTH]; +} XTIndexHeadDRec, *XTIndexHeadDPtr; + +typedef struct XTIndexFormat { + XTDiskValue4 if_format_size_4; /* The size of this structure (index format). */ + XTDiskValue2 if_tab_version_2; /* The table version number. */ + XTDiskValue2 if_ind_version_2; /* The index version number. */ + XTDiskValue1 if_node_ref_size_1; /* This size of index node reference in indexes (default 4 bytes). */ + XTDiskValue1 if_rec_ref_size_1; /* The size of record references in the indexes (default 4 bytes). */ + XTDiskValue4 if_page_size_4; +} XTIndexFormatDRec, *XTIndexFormatDPtr; + +typedef struct XTIdxBranch { + XTDiskValue2 tb_size_2; /* No of bytes used below. */ + + /* We enough space for 2 buffers when splitting! */ + xtWord1 tb_data[XT_INDEX_PAGE_DATA_SIZE]; +} XTIdxBranchDRec, *XTIdxBranchDPtr; + +typedef struct XTIdxItem { + u_int i_total_size; /* Size of the data in the searched branch (excludes 2 byte header). */ + u_int i_item_size; /* Size of the item at this position. */ + u_int i_node_ref_size; + u_int i_item_offset; /* Item offset. */ +} XTIdxItemRec, *XTIdxItemPtr; + +typedef struct XTIdxResult { + xtBool sr_found; /* TRUE if the key was found. */ + xtBool sr_duplicate; /* TRUE if the duplicate was found. */ + xtRecordID sr_rec_id; /* Reference to the record of the found key. */ + xtRowID sr_row_id; + xtIndexNodeID sr_branch; /* Branch to follow when searching a node. */ + XTIdxItemRec sr_item; +} XTIdxResultRec, *XTIdxResultPtr; + +typedef struct XTIdxKeyValue { + int sv_flags; + xtRecordID sv_rec_id; + xtRowID sv_row_id; + u_int sv_length; + xtWord1 *sv_key; +} XTIdxKeyValueRec, *XTIdxKeyValuePtr; + +typedef struct XTIdxSearchKey { + xtBool sk_on_key; /* TRUE if we are positioned on the search key. */ + XTIdxKeyValueRec sk_key_value; /* The value of the search key. */ + xtWord1 sk_key_buf[XT_INDEX_MAX_KEY_SIZE]; +} XTIdxSearchKeyRec, *XTIdxSearchKeyPtr; + +typedef void (*XTScanBranchFunc)(struct XTTable *tab, struct XTIndex *ind, XTIdxBranchDPtr branch, register XTIdxKeyValuePtr value, register XTIdxResultRec *result); +typedef void (*XTPrevItemFunc)(struct XTTable *tab, struct XTIndex *ind, XTIdxBranchDPtr branch, register XTIdxResultRec *result); +typedef void (*XTLastItemFunc)(struct XTTable *tab, struct XTIndex *ind, XTIdxBranchDPtr branch, register XTIdxResultRec *result); + +typedef int (*XTSimpleCompFunc)(struct XTIndex *ind, u_int key_length, xtWord1 *key_value, xtWord1 *b_value); + +struct charset_info_st; + +typedef struct XTIndexSeg /* Key-portion */ +{ + u_int col_idx; /* The table column index of this component. */ + u_int is_recs_in_range; /* Value returned by records_in_range(). */ + u_int is_selectivity; /* The number of unique values per mi_select_total. */ + xtWord1 type; /* Type of key (for sort) */ + xtWord1 language; + xtWord1 null_bit; /* bitmask to test for NULL */ + xtWord1 bit_start,bit_end; /* if bit field */ + xtWord1 bit_pos,bit_length; /* (not used in 4.1) */ + xtWord2 flag; + xtWord2 length; /* Keylength */ + xtWord4 start; /* Start of key in record */ + xtWord4 null_pos; /* position to NULL indicator */ + MX_CONST_CHARSET_INFO *charset; +} XTIndexSegRec, *XTIndexSegPtr; + +typedef struct XTIndFreeList { + struct XTIndFreeList *fl_next_list; /* List of free pages for this index. */ + u_int fl_start; /* Start for allocating from the front of the list. */ + u_int fl_free_count; /* Total items in the free list. */ + xtIndexNodeID fl_page_id[XT_VAR_LENGTH]; /* List of page ID's of the free pages. */ +} XTIndFreeListRec, *XTIndFreeListPtr; + +/* + * XT_INDEX_USE_PTHREAD_RW: + * The stardard pthread RW lock is currently the fastest for INSERTs + * in 32 threads on smalltab: runTest(SMALL_INSERT_TEST, 32, dbUrl) + */ +/* + * XT_INDEX_USE_RW_MUTEX: + * But the RW mutex is a close second, if not just as fast. + * If it is at least as fast, then it is better because read lock + * overhead is then zero. + * + * If definitely does get in the way of the + */ +/* XT_INDEX_USE_PTHREAD_RW: + * But this is clearly better on Linux. 216682 instead of 169259 + * payment transactions (DBT2 in non-conflict transactions, + * using only the customer table). + * + * 27.2.2009: + * The story continues. I have now fixed a bug in RW MUTEX that + * may have been slowing things down (see {RACE-WR_MUTEX}). + * + * So we will need to test "customer payment" again. + * + * 3.3.2009 + * Latest test show that RW mutex is slightly faster: + * 127460 to 123574 payment transactions. + */ +#define XT_INDEX_USE_RW_MUTEX +//#define XT_INDEX_USE_PTHREAD_RW + +#ifdef XT_INDEX_USE_FASTWRLOCK +#define XT_INDEX_LOCK_TYPE XTFastRWLockRec +#define XT_INDEX_INIT_LOCK(s, i) xt_fastrwlock_init(s, &(i)->mi_rwlock) +#define XT_INDEX_FREE_LOCK(s, i) xt_fastrwlock_free(s, &(i)->mi_rwlock) +#define XT_INDEX_READ_LOCK(i, o) xt_fastrwlock_slock(&(i)->mi_rwlock, (o)->ot_thread) +#define XT_INDEX_WRITE_LOCK(i, o) xt_fastrwlock_xlock(&(i)->mi_rwlock, (o)->ot_thread) +#define XT_INDEX_UNLOCK(i, o) xt_fastrwlock_unlock(&(i)->mi_rwlock, (o)->ot_thread) +#define XT_INDEX_HAVE_XLOCK(i, o) TRUE +#elif defined(XT_INDEX_USE_PTHREAD_RW) +#define XT_INDEX_LOCK_TYPE xt_rwlock_type +#define XT_INDEX_INIT_LOCK(s, i) xt_init_rwlock_with_autoname(s, &(i)->mi_rwlock) +#define XT_INDEX_FREE_LOCK(s, i) xt_free_rwlock(&(i)->mi_rwlock) +#define XT_INDEX_READ_LOCK(i, o) xt_slock_rwlock_ns(&(i)->mi_rwlock) +#define XT_INDEX_WRITE_LOCK(i, o) xt_xlock_rwlock_ns(&(i)->mi_rwlock) +#define XT_INDEX_UNLOCK(i, o) xt_unlock_rwlock_ns(&(i)->mi_rwlock) +#define XT_INDEX_HAVE_XLOCK(i, o) TRUE +#else // XT_INDEX_USE_RW_MUTEX +#define XT_INDEX_LOCK_TYPE XTRWMutexRec +#define XT_INDEX_INIT_LOCK(s, i) xt_rwmutex_init_with_autoname(s, &(i)->mi_rwlock) +#define XT_INDEX_FREE_LOCK(s, i) xt_rwmutex_free(s, &(i)->mi_rwlock) +#define XT_INDEX_READ_LOCK(i, o) xt_rwmutex_slock(&(i)->mi_rwlock, (o)->ot_thread->t_id) +#define XT_INDEX_WRITE_LOCK(i, o) xt_rwmutex_xlock(&(i)->mi_rwlock, (o)->ot_thread->t_id) +#define XT_INDEX_UNLOCK(i, o) xt_rwmutex_unlock(&(i)->mi_rwlock, (o)->ot_thread->t_id) +#define XT_INDEX_HAVE_XLOCK(i, o) ((i)->mi_rwlock.xs_xlocker == (o)->ot_thread->t_id) +#endif + +/* The R/W lock on the index is used as follows: + * Read Lock - used for operations on the index that are not of a structural nature. + * This includes any read operation and update operations that change an index + * node. + * Write lock - used to change the structure of the index. This includes adding + * and deleting pages. + */ +typedef struct XTIndex { + u_int mi_index_no; /* The index number (used by MySQL). */ + xt_mutex_type mi_flush_lock; /* Lock the index during flushing. */ + + /* Protected by the mi_rwlock lock: */ + XT_INDEX_LOCK_TYPE mi_rwlock; /* This lock protects the structure of the index. + * Read lock - structure may not change, but pages may change. + * Write lock - structure of index may be changed. + */ + xtIndexNodeID mi_root; /* The index root node. */ + XTIndFreeListPtr mi_free_list; /* List of free pages for this index. */ + + /* Protected by the mi_dirty_lock: */ + XTSpinLockRec mi_dirty_lock; /* Spin lock protecting the dirty & free lists. */ + struct XTIndBlock *mi_dirty_list; /* List of dirty pages for this index. */ + u_int mi_dirty_blocks; /* Count of the dirty blocks. */ + + /* Index contants: */ + u_int mi_flags; + u_int mi_key_size; + xtBool mi_low_byte_first; + xtBool mi_fix_key; + u_int mi_single_type; /* Used when the index contains a single field. */ + u_int mi_select_total; + XTScanBranchFunc mi_scan_branch; + XTPrevItemFunc mi_prev_item; + XTLastItemFunc mi_last_item; + XTSimpleCompFunc mi_simple_comp_key; + MY_BITMAP mi_col_map; /* Bit-map of columns in the index. */ + u_int mi_subset_of; /* Indicates if this index is a complete subset of someother index. */ + u_int mi_seg_count; + XTIndexSegRec mi_seg[200]; +} XTIndexRec, *XTIndexPtr; + +#define XT_INDEX_OK 0 +#define XT_INDEX_TOO_OLD 1 +#define XT_INDEX_TOO_NEW 2 +#define XT_INDEX_BAD_BLOCK 3 +#define XT_INDEX_CORRUPTED 4 +#define XT_INDEX_MISSING 5 + +typedef void (*XTFreeDicFunc)(struct XTThread *self, struct XTDictionary *dic); + +typedef struct XTDictionary { + XTDDTable *dic_table; /* XT table information. */ + + /* Table binary information. */ + u_int dic_buf_size; /* This is the size of the MySQL row. */ + u_int dic_rec_size; /* This is the size of the handle data file record. */ + xtBool dic_rec_fixed; /* TRUE if the record has a fixed length size. */ + u_int dic_tab_flags; /* Table flags XT_TAB_FLAGS_* */ + xtWord8 dic_min_auto_inc; /* The minimum auto-increment value. */ + xtWord8 dic_min_row_size; + xtWord8 dic_max_row_size; + xtWord8 dic_ave_row_size; + xtWord8 dic_def_ave_row_size; /* Defined row size set by the user. */ + u_int dic_no_of_cols; /* Number of columns. */ + u_int dic_fix_col_count; /* The number of columns always in the fixed part of a extended record. */ + u_int dic_ind_cols_req; /* The number of columns required to build all indexes. */ + xtWord8 dic_ind_rec_len; /* Length of the record part that is needed for all index columns! */ + + /* BLOB columns: */ + u_int dic_blob_cols_req; /* The number of the columns required to load all LONGBLOB columns. */ + u_int dic_blob_count; + Field **dic_blob_cols; + + /* MySQL related information. NULL when no tables are open from MySQL side! */ + u_int dic_disable_index; /* Non-zero if the index cannot be used. */ + u_int dic_index_ver; /* The version of the index. */ + u_int dic_key_count; + XTIndexPtr *dic_keys; /* MySQL/PBXT key description */ + STRUCT_TABLE *dic_my_table; /* MySQL table */ +} XTDictionaryRec, *XTDictionaryPtr; + +#define XT_DT_LOG_HEAD 0 +#define XT_DT_INDEX_PAGE 1 +#define XT_DT_FREE_LIST 2 +#define XT_DT_HEADER 3 + +typedef struct XTIndLogHead { + xtWord1 ilh_data_type; /* XT_DT_LOG_HEAD */ + XTDiskValue4 ilh_tab_id_4; + XTDiskValue4 ilh_log_eof_4; /* The entire size of the log (0 if invalid!) */ +} XTIndLogHeadDRec, *XTIndLogHeadDPtr; + +typedef struct XTIndPageData { + xtWord1 ild_data_type; + XTDiskValue4 ild_page_id_4; + xtWord1 ild_data[XT_VAR_LENGTH]; +} XTIndPageDataDRec, *XTIndPageDataDPtr; + +typedef struct XTIndHeadData { + xtWord1 ilh_data_type; + XTDiskValue2 ilh_head_size_2; + xtWord1 ilh_data[XT_VAR_LENGTH]; +} XTIndHeadDataDRec, *XTIndHeadDataDPtr; + +typedef struct XTIndexLog { + struct XTIndexLogPool *il_pool; + struct XTIndexLog *il_next_in_pool; + + xtLogID il_log_id; /* The ID of the data log. */ + XTOpenFilePtr il_of; + size_t il_buffer_size; + xtWord1 *il_buffer; + + xtTableID il_tab_id; + off_t il_log_eof; + size_t il_buffer_len; + off_t il_buffer_offset; + + + void il_reset(xtTableID tab_id); + void il_close(xtBool delete_it); + void il_release(); + + xtBool il_write_byte(struct XTOpenTable *ot, xtWord1 val); + xtBool il_write_word4(struct XTOpenTable *ot, xtWord4 value); + xtBool il_write_block(struct XTOpenTable *ot, struct XTIndBlock *block); + xtBool il_write_free_list(struct XTOpenTable *ot, u_int free_count, XTIndFreeListPtr free_list); + xtBool il_require_space(size_t bytes, XTThreadPtr thread); + xtBool il_write_header(struct XTOpenTable *ot, size_t head_size, xtWord1 *head_data); + xtBool il_flush(struct XTOpenTable *ot); + xtBool il_apply_log(struct XTOpenTable *ot); + + xtBool il_open_table(struct XTOpenTable **ot); + void il_close_table(struct XTOpenTable *ot); +} XTIndexLogRec, *XTIndexLogPtr; + +typedef struct XTIndexLogPool { + struct XTDatabase *ilp_db; + size_t ilp_log_buffer_size; + u_int il_pool_count; + XTIndexLogPtr ilp_log_pool; + xt_mutex_type ilp_lock; /* The public pool lock. */ + xtLogID ilp_next_log_id; + + void ilp_init(struct XTThread *self, struct XTDatabase *db, size_t log_buffer_size); + void ilp_close(struct XTThread *self, xtBool lock); + void ilp_exit(struct XTThread *self); + void ilp_name(size_t size, char *path, xtLogID log_id); + + xtBool ilp_open_log(XTIndexLogPtr *il, xtLogID log_id, xtBool excl, XTThreadPtr thread); + + xtBool ilp_get_log(XTIndexLogPtr *il, XTThreadPtr thread); + void ilp_release_log(XTIndexLogPtr il); +} XTIndexLogPoolRec, *XTIndexLogPoolPtr; + +/* A record reference consists of a record ID and a row ID: */ +inline void xt_get_record_ref(register xtWord1 *item, xtRecordID *rec_id, xtRowID *row_id) { + *rec_id = XT_GET_DISK_4(item); + item += 4; + *row_id = XT_GET_DISK_4(item); +} + +inline void xt_get_res_record_ref(register xtWord1 *item, register XTIdxResultRec *result) { + result->sr_rec_id = XT_GET_DISK_4(item); + item += 4; + result->sr_row_id = XT_GET_DISK_4(item); +} + +inline void xt_set_record_ref(register xtWord1 *item, xtRecordID rec_id, xtRowID row_id) { + XT_SET_DISK_4(item, rec_id); + item += 4; + XT_SET_DISK_4(item, row_id); +} + +inline void xt_set_val_record_ref(register xtWord1 *item, register XTIdxKeyValuePtr value) { + XT_SET_DISK_4(item, value->sv_rec_id); + item += 4; + XT_SET_DISK_4(item, value->sv_row_id); +} + +xtBool xt_idx_insert(struct XTOpenTable *ot, struct XTIndex *ind, xtRowID row_id, xtRecordID rec_id, xtWord1 *rec_buf, xtWord1 *bef_buf, xtBool allow_dups); +xtBool xt_idx_delete(struct XTOpenTable *ot, struct XTIndex *ind, xtRecordID rec_id, xtWord1 *rec_buf); +xtBool xt_idx_update_row_id(struct XTOpenTable *ot, struct XTIndex *ind, xtRecordID rec_id, xtRowID row_id, xtWord1 *rec_buf); +void xt_idx_prep_key(struct XTIndex *ind, register XTIdxSearchKeyPtr search_key, int flags, xtWord1 *in_key_buf, size_t in_key_length); +xtBool xt_idx_research(struct XTOpenTable *ot, struct XTIndex *ind); +xtBool xt_idx_search(struct XTOpenTable *ot, struct XTIndex *ind, register XTIdxSearchKeyPtr search_key); +xtBool xt_idx_search_prev(struct XTOpenTable *ot, struct XTIndex *ind, register XTIdxSearchKeyPtr search_key); +xtBool xt_idx_next(register struct XTOpenTable *ot, register struct XTIndex *ind, register XTIdxSearchKeyPtr search_key); +xtBool xt_idx_prev(register struct XTOpenTable *ot, register struct XTIndex *ind, register XTIdxSearchKeyPtr search_key); +xtBool xt_idx_read(struct XTOpenTable *ot, struct XTIndex *ind, xtWord1 *rec_buf); +void xt_ind_set_index_selectivity(XTThreadPtr self, struct XTOpenTable *ot); +void xt_check_indices(struct XTOpenTable *ot); +xtBool xt_flush_indices(struct XTOpenTable *ot, off_t *bytes_flushed, xtBool have_table_lock); +void xt_ind_track_dump_block(struct XTTable *tab, xtIndexNodeID address); + +#define XT_S_MODE_MATCH 0 +#define XT_S_MODE_NEXT 1 +#define XT_S_MODE_PREV 2 +xtBool xt_idx_match_search(struct XTOpenTable *ot, struct XTIndex *ind, register XTIdxSearchKeyPtr search_key, xtWord1 *buf, int mode); + +int xt_compare_2_int4(XTIndexPtr ind, uint key_length, xtWord1 *key_value, xtWord1 *b_value); +int xt_compare_3_int4(XTIndexPtr ind, uint key_length, xtWord1 *key_value, xtWord1 *b_value); +void xt_scan_branch_single(struct XTTable *tab, XTIndexPtr ind, XTIdxBranchDPtr branch, register XTIdxKeyValuePtr value, register XTIdxResultRec *result); +void xt_scan_branch_fix(struct XTTable *tab, XTIndexPtr ind, XTIdxBranchDPtr branch, register XTIdxKeyValuePtr value, register XTIdxResultRec *result); +void xt_scan_branch_fix_simple(struct XTTable *tab, XTIndexPtr ind, XTIdxBranchDPtr branch, register XTIdxKeyValuePtr value, register XTIdxResultRec *result); +void xt_scan_branch_var(struct XTTable *tab, XTIndexPtr ind, XTIdxBranchDPtr branch, register XTIdxKeyValuePtr value, register XTIdxResultRec *result); + +void xt_prev_branch_item_fix(struct XTTable *tab, XTIndexPtr ind, XTIdxBranchDPtr branch, register XTIdxResultRec *result); +void xt_prev_branch_item_var(struct XTTable *tab, XTIndexPtr ind, XTIdxBranchDPtr branch, register XTIdxResultRec *result); + +void xt_last_branch_item_fix(struct XTTable *tab, XTIndexPtr ind, XTIdxBranchDPtr branch, register XTIdxResultPtr result); +void xt_last_branch_item_var(struct XTTable *tab, XTIndexPtr ind, XTIdxBranchDPtr branch, register XTIdxResultPtr result); + +//#define TRACK_ACTIVITY +#ifdef TRACK_ACTIVITY + +#define TRACK_BLOCK_ALLOC(x) track_work(xt_ind_offset_to_node(tab, x), "A") +#define TRACK_BLOCK_FREE(x) track_work(xt_ind_offset_to_node(ot->ot_table, x), "-") +#define TRACK_BLOCK_SPLIT(x) track_work(xt_ind_offset_to_node(ot->ot_table, x), "/") +#define TRACK_BLOCK_WRITE(x) track_work(xt_ind_offset_to_node(ot->ot_table, x), "w") +#define TRACK_BLOCK_FLUSH_N(x) track_work(x, "F") +#define TRACK_BLOCK_TO_FLUSH(x) track_work(x, "f") + +xtPublic void track_work(u_int block, char *what); +#else + +#define TRACK_BLOCK_ALLOC(x) +#define TRACK_BLOCK_FREE(x) +#define TRACK_BLOCK_SPLIT(x) +#define TRACK_BLOCK_WRITE(x) +#define TRACK_BLOCK_FLUSH_N(x) +#define TRACK_BLOCK_TO_FLUSH(x) + +#endif + +#endif + diff --git a/storage/pbxt/src/linklist_xt.cc b/storage/pbxt/src/linklist_xt.cc new file mode 100644 index 00000000000..de5fc6170ce --- /dev/null +++ b/storage/pbxt/src/linklist_xt.cc @@ -0,0 +1,224 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-02-03 Paul McCullagh + * + * H&G2JCtL + */ + +#include "xt_config.h" + +#include "pthread_xt.h" +#include "linklist_xt.h" +#include "thread_xt.h" +#include "memory_xt.h" + +xtPublic XTLinkedListPtr xt_new_linkedlist(struct XTThread *self, void *thunk, XTFreeFunc free_func, xtBool with_lock) +{ + XTLinkedListPtr ll; + + ll = (XTLinkedListPtr) xt_calloc(self, sizeof(XTLinkedListRec)); + try_(a) { + if (with_lock) { + ll->ll_lock = (xt_mutex_type *) xt_calloc(self, sizeof(xt_mutex_type)); + try_(b) { + xt_init_mutex_with_autoname(self, ll->ll_lock); + } + catch_(b) { + xt_free(self, ll->ll_lock); + ll->ll_lock = NULL; + throw_(); + } + cont_(b); + ll->ll_cond = (xt_cond_type *) xt_calloc(self, sizeof(xt_cond_type)); + try_(c) { + xt_init_cond(self, ll->ll_cond); + } + catch_(c) { + xt_free(self, ll->ll_cond); + ll->ll_cond = NULL; + throw_(); + } + cont_(c); + } + ll->ll_thunk = thunk; + ll->ll_free_func = free_func; + } + catch_(a) { + xt_free_linkedlist(self, ll); + throw_(); + } + cont_(a); + return ll; +} + +xtPublic void xt_free_linkedlist(XTThreadPtr self, XTLinkedListPtr ll) +{ + if (ll->ll_lock) + xt_lock_mutex(self, ll->ll_lock); + while (ll->ll_items) + xt_ll_remove(self, ll, ll->ll_items, FALSE); + if (ll->ll_lock) + xt_unlock_mutex(self, ll->ll_lock); + if (ll->ll_lock) { + xt_free_mutex(ll->ll_lock); + xt_free(self, ll->ll_lock); + } + if (ll->ll_cond) { + xt_free_cond(ll->ll_cond); + xt_free(self, ll->ll_cond); + } + xt_free(self, ll); +} + +xtPublic void xt_ll_add(XTThreadPtr self, XTLinkedListPtr ll, XTLinkedItemPtr li, xtBool lock) +{ + if (lock && ll->ll_lock) + xt_lock_mutex(self, ll->ll_lock); + li->li_next = ll->ll_items; + li->li_prev = NULL; + if (ll->ll_items) + ll->ll_items->li_prev = li; + ll->ll_items = li; + ll->ll_item_count++; + if (lock && ll->ll_lock) + xt_unlock_mutex(self, ll->ll_lock); +} + +xtPublic XTLinkedItemPtr xt_ll_first_item(XTThreadPtr XT_UNUSED(self), XTLinkedListPtr ll) +{ + return ll ? ll->ll_items : NULL; +} + +xtPublic XTLinkedItemPtr xt_ll_next_item(XTThreadPtr XT_UNUSED(self), XTLinkedItemPtr item) +{ + return item->li_next; +} + +xtPublic xtBool xt_ll_exists(XTThreadPtr self, XTLinkedListPtr ll, XTLinkedItemPtr li, xtBool lock) +{ + XTLinkedItemPtr ptr; + + if (lock && ll->ll_lock) + xt_lock_mutex(self, ll->ll_lock); + + ptr = ll->ll_items; + + for (ptr = ll->ll_items; ptr && (ptr != li); ptr = ptr->li_next){} + + if (lock && ll->ll_lock) + xt_unlock_mutex(self, ll->ll_lock); + + return (ptr == li); +} + +xtPublic void xt_ll_remove(XTThreadPtr self, XTLinkedListPtr ll, XTLinkedItemPtr li, xtBool lock) +{ + if (lock && ll->ll_lock) + xt_lock_mutex(self, ll->ll_lock); + + /* Move front pointer: */ + if (ll->ll_items == li) + ll->ll_items = li->li_next; + + /* Remove from list: */ + if (li->li_prev) + li->li_prev->li_next = li->li_next; + if (li->li_next) + li->li_next->li_prev = li->li_prev; + + ll->ll_item_count--; + if (ll->ll_free_func) + (*ll->ll_free_func)(self, ll->ll_thunk, li); + + /* Signal one less: */ + if (ll->ll_cond) + xt_signal_cond(self, ll->ll_cond); + + if (lock && ll->ll_lock) + xt_unlock_mutex(self, ll->ll_lock); +} + +xtPublic void xt_ll_lock(XTThreadPtr self, XTLinkedListPtr ll) +{ + if (ll->ll_lock) + xt_lock_mutex(self, ll->ll_lock); +} + +xtPublic void xt_ll_unlock(XTThreadPtr self, XTLinkedListPtr ll) +{ + if (ll->ll_lock) + xt_unlock_mutex(self, ll->ll_lock); +} + +xtPublic void xt_ll_wait_till_empty(XTThreadPtr self, XTLinkedListPtr ll) +{ + xt_lock_mutex(self, ll->ll_lock); + pushr_(xt_unlock_mutex, ll->ll_lock); + for (;;) { + if (ll->ll_item_count == 0) + break; + xt_wait_cond(self, ll->ll_cond, ll->ll_lock); + } + freer_(); // xt_unlock_mutex(ll->ll_lock) +} + +xtPublic u_int xt_ll_get_size(XTLinkedListPtr ll) +{ + return ll->ll_item_count; +} + +xtPublic void xt_init_linkedqueue(XTThreadPtr XT_UNUSED(self), XTLinkedQueuePtr lq) +{ + lq->lq_count = 0; + lq->lq_front = NULL; + lq->lq_back = NULL; +} + +xtPublic void xt_exit_linkedqueue(XTThreadPtr XT_UNUSED(self), XTLinkedQueuePtr lq) +{ + lq->lq_count = 0; + lq->lq_front = NULL; + lq->lq_back = NULL; +} + +xtPublic void xt_lq_add(XTThreadPtr XT_UNUSED(self), XTLinkedQueuePtr lq, XTLinkedQItemPtr qi) +{ + lq->lq_count++; + qi->qi_next = NULL; + if (!lq->lq_front) + lq->lq_front = qi; + if (lq->lq_back) + lq->lq_back->qi_next = qi; + lq->lq_back = qi; +} + +xtPublic XTLinkedQItemPtr xt_lq_remove(XTThreadPtr XT_UNUSED(self), XTLinkedQueuePtr lq) +{ + XTLinkedQItemPtr qi = NULL; + + if (!lq->lq_front) { + qi = lq->lq_front; + lq->lq_front = qi->qi_next; + if (!lq->lq_front) + lq->lq_back = NULL; + qi->qi_next = NULL; + } + return qi; +} + diff --git a/storage/pbxt/src/linklist_xt.h b/storage/pbxt/src/linklist_xt.h new file mode 100644 index 00000000000..1e33f71a421 --- /dev/null +++ b/storage/pbxt/src/linklist_xt.h @@ -0,0 +1,77 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-02-03 Paul McCullagh + * + * H&G2JCtL + */ +#ifndef __xt_linklist_h__ +#define __xt_linklist_h__ + +#include "xt_defs.h" + +struct XTThread; + +typedef struct XTLinkedItem { + struct XTLinkedItem *li_prev; + struct XTLinkedItem *li_next; +} XTLinkedItemRec, *XTLinkedItemPtr; + +typedef struct XTLinkedList { + xt_mutex_type *ll_lock; + xt_cond_type *ll_cond; /* Condition for wait for empty. */ + void *ll_thunk; + XTFreeFunc ll_free_func; + u_int ll_item_count; + XTLinkedItemPtr ll_items; +} XTLinkedListRec, *XTLinkedListPtr; + +XTLinkedListPtr xt_new_linkedlist(struct XTThread *self, void *thunk, XTFreeFunc free_func, xtBool with_lock); +void xt_free_linkedlist(struct XTThread *self, XTLinkedListPtr ll); + +void xt_ll_add(struct XTThread *self, XTLinkedListPtr ll, XTLinkedItemPtr li, xtBool lock); +void xt_ll_remove(struct XTThread *self, XTLinkedListPtr ll, XTLinkedItemPtr li, xtBool lock); +xtBool xt_ll_exists(struct XTThread *self, XTLinkedListPtr ll, XTLinkedItemPtr li, xtBool lock); + +void xt_ll_lock(struct XTThread *self, XTLinkedListPtr ll); +void xt_ll_unlock(struct XTThread *self, XTLinkedListPtr ll); + +void xt_ll_wait_till_empty(struct XTThread *self, XTLinkedListPtr ll); + +XTLinkedItemPtr xt_ll_first_item(struct XTThread *self, XTLinkedListPtr ll); +XTLinkedItemPtr xt_ll_next_item(struct XTThread *self, XTLinkedItemPtr item); +u_int xt_ll_get_size(XTLinkedListPtr ll); + +typedef struct XTLinkedQItem { + struct XTLinkedQItem *qi_next; +} XTLinkedQItemRec, *XTLinkedQItemPtr; + +typedef struct XTLinkedQueue { + size_t lq_count; + XTLinkedQItemPtr lq_front; + XTLinkedQItemPtr lq_back; +} XTLinkedQueueRec, *XTLinkedQueuePtr; + +void xt_init_linkedqueue(struct XTThread *self, XTLinkedQueuePtr lq); +void xt_exit_linkedqueue(struct XTThread *self, XTLinkedQueuePtr lq); + +void xt_lq_add(struct XTThread *self, XTLinkedQueuePtr lq, XTLinkedQItemPtr qi); +XTLinkedQItemPtr xt_lq_remove(struct XTThread *self, XTLinkedQueuePtr lq); + +#endif + diff --git a/storage/pbxt/src/lock_xt.cc b/storage/pbxt/src/lock_xt.cc new file mode 100644 index 00000000000..42f1df18276 --- /dev/null +++ b/storage/pbxt/src/lock_xt.cc @@ -0,0 +1,2478 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2008-01-24 Paul McCullagh + * + * Row lock functions. + * + * H&G2JCtL + */ + +#include "xt_config.h" + +#include <stdio.h> + +#include "lock_xt.h" +#include "thread_xt.h" +#include "table_xt.h" +#include "xaction_xt.h" +#include "database_xt.h" +#include "trace_xt.h" + +#ifdef DEBUG +//#define XT_TRACE_LOCKS +//#define CHECK_ROWLOCK_GROUP_CONSISTENCY +#endif + +/* + * ----------------------------------------------------------------------- + * ROW LOCKS, LIST BASED + */ +#ifdef XT_USE_LIST_BASED_ROW_LOCKS + +#ifdef CHECK_ROWLOCK_GROUP_CONSISTENCY +/* + * Requires a spin-lock on group->lg_lock! + */ +static void check_rowlock_group(XTLockGroupPtr group) +{ + XTThreadPtr self = xt_get_self(); + + char *crash = NULL; + + if (group->lg_lock.spl_locker != self) + *crash = 1; + + if (group->lg_list_in_use > group->lg_list_size) + *crash = 1; + + xtRowID prev_row = 0; + XTLockItemPtr item = group->lg_list; + + for (int i = 0; i < group->lg_list_in_use; i++, item++) { + + if (!item->li_thread_id) + *crash = 1; + + if(!xt_thr_array[item->li_thread_id]->st_xact_data) + *crash = 1; + + if(item->li_count > XT_TEMP_LOCK_BYTES) + *crash = 1; + + // rows per thread must obey the row_id > prev_row_id + prev_count*group_size rule + if (prev_row >= item->li_row_id) + *crash = 1; + + // calculate the new prev. row + if (item->li_count < XT_TEMP_LOCK_BYTES) + prev_row = item->li_row_id + (item->li_count - 1) * XT_ROW_LOCK_GROUP_COUNT; + else + prev_row = item->li_row_id; + } +} +#endif + +static int xlock_cmp_row_ids(XTThreadPtr XT_UNUSED(self), register const void *XT_UNUSED(thunk), register const void *a, register const void *b) +{ + xtRowID row_id = *((xtTableID *) a); + XTLockItemPtr item = (XTLockItemPtr) b; + + if (row_id < item->li_row_id) + return -1; + if (row_id > item->li_row_id) + return 1; + return 0; +} + +void XTRowLockList::xt_remove_all_locks(struct XTDatabase *, XTThreadPtr thread) +{ +#ifdef XT_TRACE_LOCKS + xt_ttracef(xt_get_self(), "remove all locks\n"); +#endif + if (!bl_count) + return; + + xtThreadID thd_id; + XTPermRowLockPtr plock; +#ifndef XT_USE_TABLE_REF + XTOpenTablePtr pot = NULL; +#endif + + thd_id = thread->t_id; + plock = (XTPermRowLockPtr) bl_data; + for (u_int i=0; i<bl_count; i++) { +#ifdef XT_USE_TABLE_REF + XTTableHPtr tab = plock->pr_table; +#else + if (!xt_db_open_pool_table_ns(&pot, db, plock->pr_tab_id)) { + /* Should not happen, but just in case, we just don't + * remove the lock. We will probably end up with a deadlock + * somewhere. + */ + xt_log_and_clear_exception_ns(); + } + else { +#endif + for (int j=0; j<XT_ROW_LOCK_GROUP_COUNT; j++) { + if (plock->pr_group[j]) { + /* Go through group j and compact. */ +#ifndef XT_USE_TABLE_REF + XTTableHPtr tab = pot->ot_table; +#endif + XTLockGroupPtr group; + XTLockItemPtr copy; + XTLockItemPtr item; + int new_count; + + group = &tab->tab_locks.rl_groups[j]; + xt_spinlock_lock(&group->lg_lock); + copy = group->lg_list; + item = group->lg_list; + new_count = 0; + for (size_t k=0; k<group->lg_list_in_use; k++) { + if (item->li_thread_id != thd_id) { + if (copy != item) { + copy->li_row_id = item->li_row_id; + copy->li_count = item->li_count; + copy->li_thread_id = item->li_thread_id; + } + new_count++; + copy++; + } +#ifdef XT_TRACE_LOCKS + else { + if (item->li_count == XT_TEMP_LOCK_BYTES) + xt_ttracef(xt_get_self(), "remove group %d lock row_id=%d TEMP\n", j, (int) item->li_row_id); + else + xt_ttracef(xt_get_self(), "remove group %d locks row_id=%d (%d)\n", j, (int) item->li_row_id, (int) item->li_count); + } +#endif + item++; + } + group->lg_list_in_use = new_count; +#ifdef CHECK_ROWLOCK_GROUP_CONSISTENCY + check_rowlock_group(group); +#endif + if (group->lg_wait_queue) + tab->tab_locks.rl_grant_locks(group, thread); + + xt_spinlock_unlock(&group->lg_lock); + + xt_xn_wakeup_thread_list(thread); + } + } +#ifdef XT_USE_TABLE_REF + xt_heap_release(NULL, plock->pr_table); +#else + xt_db_return_table_to_pool_ns(pot); + } +#endif + plock++; + } + bl_count = 0; +} + +#ifdef DEBUG_LOCK_QUEUE +int *dummy_ptr = 0; + +void XTRowLocks::rl_check(XTLockWaitPtr no_lw) +{ + XTLockGroupPtr group; + XTLockWaitPtr lw, lw_prev; + + for (int i=0; i<XT_ROW_LOCK_GROUP_COUNT; i++) { + group = &rl_groups[i]; + xt_spinlock_lock(&group->lg_lock); + + lw = group->lg_wait_queue; + lw_prev = NULL; + while (lw) { + if (lw == no_lw) + *dummy_ptr = 1; + if (lw->lw_prev != lw_prev) + *dummy_ptr = 2; + lw_prev = lw; + lw = lw->lw_next; + } + xt_spinlock_unlock(&group->lg_lock); + } +} +#endif + +xtBool XTRowLocks::rl_lock_row(XTLockGroupPtr group, XTLockWaitPtr lw, XTRowLockListPtr, int *result) +{ + XTLockItemPtr item; + size_t index; + xtRowID row_id = lw->lw_row_id; + +#ifdef CHECK_ROWLOCK_GROUP_CONSISTENCY + check_rowlock_group(group); +#endif + if (group->lg_list_size == group->lg_list_in_use) { + if (!xt_realloc_ns((void **) &group->lg_list, (group->lg_list_size + 2) * sizeof(XTLockItemRec))) + return FAILED; + group->lg_list_size += 2; + } + item = (XTLockItemPtr) xt_bsearch(NULL, &row_id, group->lg_list, group->lg_list_in_use, sizeof(XTLockItemRec), &index, NULL, xlock_cmp_row_ids); + + /* There's no item with this ID, but there could be an item with a range that covers this row */ + if (!item && group->lg_list_in_use) { + if (index > 0) { + int count; + + item = group->lg_list + index - 1; + + count = item->li_count; + if (item->li_count == XT_TEMP_LOCK_BYTES) + count = 1; + + if (row_id >= item->li_row_id + count * XT_ROW_LOCK_GROUP_COUNT) + item = NULL; + } + } + + if (item) { + /* Item already exists. */ + if (item->li_thread_id == lw->lw_thread->t_id) { + /* Already have a permanent lock: */ + *result = XT_NO_LOCK; + lw->lw_curr_lock = XT_NO_LOCK; + return OK; + } + /* {REMOVE-LOCKS} + * This must be valid because a thread must remove + * the locks before it frees its st_xact_data structure, + * xt_thr_array entry must also be valid, because + * transaction must be ended before the thread is + * killed. + */ + *result = item->li_count == XT_TEMP_LOCK_BYTES ? XT_TEMP_LOCK : XT_PERM_LOCK; + lw->lw_xn_id = xt_thr_array[item->li_thread_id]->st_xact_data->xd_start_xn_id; + lw->lw_curr_lock = *result; + return OK; + } + + /* Add the lock: */ + XT_MEMMOVE(group->lg_list, &group->lg_list[index+1], + &group->lg_list[index], (group->lg_list_in_use - index) * sizeof(XTLockItemRec)); + group->lg_list[index].li_row_id = row_id; + group->lg_list[index].li_count = XT_TEMP_LOCK_BYTES; + group->lg_list[index].li_thread_id = lw->lw_thread->t_id; + group->lg_list_in_use++; + +#ifdef XT_TRACE_LOCKS + xt_ttracef(ot->ot_thread, "set temp lock row=%d setby=%s\n", (int) row_id, xt_get_self()->t_name); +#endif +#ifdef CHECK_ROWLOCK_GROUP_CONSISTENCY + check_rowlock_group(group); +#endif + *result = XT_NO_LOCK; + lw->lw_ot->ot_temp_row_lock = row_id; + lw->lw_curr_lock = XT_NO_LOCK; + return OK; +} + +void XTRowLocks::rl_grant_locks(XTLockGroupPtr group, XTThreadPtr thread) +{ + XTLockWaitPtr lw, lw_next, lw_prev; + int result; + xtThreadID lw_thd_id; + + thread->st_thread_list_count = 0; + lw = group->lg_wait_queue; + while (lw) { + lw_next = lw->lw_next; + lw_prev = lw->lw_prev; + lw_thd_id = lw->lw_thread->t_id; + /* NOTE: after lw_curr_lock is changed, lw may no longer be referenced + * by this function!!! + */ + if (!rl_lock_row(group, lw, &lw->lw_thread->st_lock_list, &result)) { + /* We transfer the error to the other thread! */ + XTThreadPtr self = xt_get_self(); + + result = XT_LOCK_ERR; + memcpy(&lw->lw_thread->t_exception, &self->t_exception, sizeof(XTExceptionRec)); + lw->lw_curr_lock = XT_LOCK_ERR; + } + if (result == XT_NO_LOCK || result == XT_LOCK_ERR) { + /* Remove from the wait queue: */ + if (lw_next) + lw_next->lw_prev = lw_prev; + if (lw_prev) + lw_prev->lw_next = lw_next; + if (group->lg_wait_queue == lw) + group->lg_wait_queue = lw_next; + if (group->lg_wait_queue_end == lw) + group->lg_wait_queue_end = lw_prev; + if (result == XT_NO_LOCK) { + /* Add to the thread list: */ + if (thread->st_thread_list_count == thread->st_thread_list_size) { + if (!xt_realloc_ns((void **) &thread->st_thread_list, (thread->st_thread_list_size+1) * sizeof(xtThreadID))) { + xt_xn_wakeup_thread(lw_thd_id); + goto done; + } + thread->st_thread_list_size++; + } + thread->st_thread_list[thread->st_thread_list_count] = lw_thd_id; + thread->st_thread_list_count++; + done:; + } + } + lw = lw_next; + } +} + +//#define QUEUE_ORDER_FIFO + +/* Try to lock a row. + * This function returns: + * XT_NO_LOCK on success. + * XT_TEMP_LOCK if there is a temporary lock on the row. + * XT_PERM_LOCK if there is a permanent lock in the row. + * XT_FAILED an error occured. + * + * If there is a lock on this row, the transaction ID of the + * locker is also returned. + * + * The caller must wait if the row is locked. If the lock is + * permanent, then the caller must wait for the transaction to + * terminate. If the lock is temporary, then the caller must + * wait for the transaction to signal that the lock has been + * released. + */ +xtBool XTRowLocks::xt_set_temp_lock(XTOpenTablePtr ot, XTLockWaitPtr lw, XTRowLockListPtr lock_list) +{ + XTLockGroupPtr group; + int result; + + if (ot->ot_temp_row_lock) { + /* Check if we don't already have this temp lock: */ + if (ot->ot_temp_row_lock == lw->lw_row_id) { + lw->lw_curr_lock = XT_NO_LOCK; + return OK; + } + + xt_make_lock_permanent(ot, lock_list); + } + + /* Add a temporary lock. */ + group = &rl_groups[lw->lw_row_id % XT_ROW_LOCK_GROUP_COUNT]; + xt_spinlock_lock(&group->lg_lock); + + if (!rl_lock_row(group, lw, lock_list, &result)) { + xt_spinlock_unlock(&group->lg_lock); + return FAILED; + } + + if (result != XT_NO_LOCK) { + /* Add the thread to the end of the thread queue: */ +#ifdef QUEUE_ORDER_FIFO + if (group->lg_wait_queue_end) { + group->lg_wait_queue_end->lw_next = lw; + lw->lw_prev = group->lg_wait_queue_end; + } + else { + group->lg_wait_queue = lw; + lw->lw_prev = NULL; + } + lw->lw_next = NULL; + group->lg_wait_queue_end = lw; +#else + XTLockWaitPtr pos = group->lg_wait_queue_end; + xtXactID xn_id = ot->ot_thread->st_xact_data->xd_start_xn_id; + + while (pos) { + if (pos->lw_thread->st_xact_data->xd_start_xn_id < xn_id) + break; + pos = pos->lw_prev; + } + if (pos) { + lw->lw_prev = pos; + lw->lw_next = pos->lw_next; + if (pos->lw_next) + pos->lw_next->lw_prev = lw; + else + group->lg_wait_queue_end = lw; + pos->lw_next = lw; + } + else { + /* Front of the queue: */ + lw->lw_prev = NULL; + lw->lw_next = group->lg_wait_queue; + if (group->lg_wait_queue) + group->lg_wait_queue->lw_prev = lw; + else + group->lg_wait_queue_end = lw; + group->lg_wait_queue = lw; + } +#endif + } + + xt_spinlock_unlock(&group->lg_lock); + return OK; +} + +/* + * Remove a temporary lock. + * + * If updated is set to TRUE this means that the row was update. + * This means that any thread waiting on the temporary lock will + * also have to wait for the transaction to quit before + * continuing. + * + * If the thread were to continue it would just hang again because + * it will discover that the transaction has updated the row. + * + * So the 'updated' flag is an optimisation which prevents the + * thread from making an unncessary retry. + */ +void XTRowLocks::xt_remove_temp_lock(XTOpenTablePtr ot, xtBool updated) +{ + xtRowID row_id; + XTLockGroupPtr group; + XTLockItemPtr item; + size_t index; + xtBool lock_granted = FALSE; + xtThreadID locking_thread_id = 0; + + if (!(row_id = ot->ot_temp_row_lock)) + return; + + group = &rl_groups[row_id % XT_ROW_LOCK_GROUP_COUNT]; + xt_spinlock_lock(&group->lg_lock); +#ifdef CHECK_ROWLOCK_GROUP_CONSISTENCY + check_rowlock_group(group); +#endif + +#ifdef XT_TRACE_LOCKS + xt_ttracef(xt_get_self(), "remove temp lock %d\n", (int) row_id); +#endif + item = (XTLockItemPtr) xt_bsearch(NULL, &row_id, group->lg_list, group->lg_list_in_use, sizeof(XTLockItemRec), &index, NULL, xlock_cmp_row_ids); + if (item) { + /* Item exists. */ + if (item->li_thread_id == ot->ot_thread->t_id && + item->li_count == XT_TEMP_LOCK_BYTES) { + XTLockWaitPtr lw; + + /* First check if there is some thread waiting to take over this lock: */ + lw = group->lg_wait_queue; + while (lw) { + if (lw->lw_row_id == row_id) { + lock_granted = TRUE; + break; + } + lw = lw->lw_next; + } + + if (lock_granted) { + /* Grant the lock just released... */ + XTLockWaitPtr lw_next, lw_prev; + xtXactID locking_xact_id; + + /* Store this info, lw will soon be untouchable! */ + lw_next = lw->lw_next; + lw_prev = lw->lw_prev; + locking_xact_id = lw->lw_thread->st_xact_data->xd_start_xn_id; + locking_thread_id = lw->lw_thread->t_id; + + /* Lock has moved from one thread to the next. + * change the thread holding this lock: + */ + item->li_thread_id = locking_thread_id; + + /* Remove from the wait queue: */ + if (lw_next) + lw_next->lw_prev = lw_prev; + if (lw_prev) + lw_prev->lw_next = lw_next; + if (group->lg_wait_queue == lw) + group->lg_wait_queue = lw_next; + if (group->lg_wait_queue_end == lw) + group->lg_wait_queue_end = lw_prev; + + /* If the thread that release the lock updated the + * row then we will have to wait for the transaction + * to terminate: + */ + if (updated) { + lw->lw_row_updated = TRUE; + lw->lw_updating_xn_id = ot->ot_thread->st_xact_data->xd_start_xn_id; + } + + /* The thread has the lock now: */ + lw->lw_ot->ot_temp_row_lock = row_id; + lw->lw_curr_lock = XT_NO_LOCK; + + /* Everyone after this that is waiting for the same lock is + * now waiting for a different transaction: + */ + lw = lw_next; + while (lw) { + if (lw->lw_row_id == row_id) { + lw->lw_xn_id = locking_xact_id; + lw->lw_curr_lock = XT_TEMP_LOCK; + } + lw = lw->lw_next; + } + } + else { + /* Remove the lock: */ + XT_MEMMOVE(group->lg_list, &group->lg_list[index], + &group->lg_list[index+1], (group->lg_list_in_use - index - 1) * sizeof(XTLockItemRec)); + group->lg_list_in_use--; + } + } + } +#ifdef CHECK_ROWLOCK_GROUP_CONSISTENCY + check_rowlock_group(group); +#endif + xt_spinlock_unlock(&group->lg_lock); + + ot->ot_temp_row_lock = 0; + if (lock_granted) + xt_xn_wakeup_thread(locking_thread_id); +} + +xtBool XTRowLocks::xt_make_lock_permanent(XTOpenTablePtr ot, XTRowLockListPtr lock_list) +{ + xtRowID row_id; + XTLockGroupPtr group; + XTLockItemPtr item; + size_t index; + + if (!(row_id = ot->ot_temp_row_lock)) + return OK; + +#ifdef XT_TRACE_LOCKS + xt_ttracef(xt_get_self(), "make lock perm %d\n", (int) ot->ot_temp_row_lock); +#endif + + /* Add to the lock list: */ + XTPermRowLockPtr locks = (XTPermRowLockPtr) lock_list->bl_data; + for (unsigned i=0; i<lock_list->bl_count; i++) { +#ifdef XT_USE_TABLE_REF + if (locks->pr_table == ot->ot_table) { +#else + if (locks->pr_tab_id == ot->ot_table->tab_id) { +#endif + locks->pr_group[row_id % XT_ROW_LOCK_GROUP_COUNT] = 1; + goto done; + } + locks++; + } + + /* Add new to lock list: */ + { + XTPermRowLockRec perm_lock; + +#ifdef XT_USE_TABLE_REF + perm_lock.pr_table = ot->ot_table; + xt_heap_reference(NULL, perm_lock.pr_table); +#else + perm_lock.pr_tab_id = ot->ot_table->tab_id; +#endif + memset(perm_lock.pr_group, 0, XT_ROW_LOCK_GROUP_COUNT); + perm_lock.pr_group[row_id % XT_ROW_LOCK_GROUP_COUNT] = 1; + if (!xt_bl_append(NULL, lock_list, &perm_lock)) { + xt_remove_temp_lock(ot, FALSE); + return FAILED; + } + } + + done: + group = &rl_groups[row_id % XT_ROW_LOCK_GROUP_COUNT]; + xt_spinlock_lock(&group->lg_lock); + + item = (XTLockItemPtr) xt_bsearch(NULL, &row_id, group->lg_list, group->lg_list_in_use, sizeof(XTLockItemRec), &index, NULL, xlock_cmp_row_ids); + ASSERT_NS(item); +#ifdef CHECK_ROWLOCK_GROUP_CONSISTENCY + check_rowlock_group(group); +#endif + if (item) { + /* Lock exists (it should!). */ + if (item->li_thread_id == ot->ot_thread->t_id && + item->li_count == XT_TEMP_LOCK_BYTES) { + if (index > 0 && + group->lg_list[index-1].li_thread_id == ot->ot_thread->t_id && + group->lg_list[index-1].li_count < XT_TEMP_LOCK_BYTES-2 && + group->lg_list[index-1].li_row_id == row_id - (XT_ROW_LOCK_GROUP_COUNT * group->lg_list[index-1].li_count)) { + group->lg_list[index-1].li_count++; + /* Combine with the left: */ + if (index + 1 < group->lg_list_in_use && + group->lg_list[index+1].li_thread_id == ot->ot_thread->t_id && + group->lg_list[index+1].li_count != XT_TEMP_LOCK_BYTES && + group->lg_list[index+1].li_row_id == row_id + XT_ROW_LOCK_GROUP_COUNT) { + /* And combine with the right */ + u_int left = group->lg_list[index-1].li_count + group->lg_list[index+1].li_count; + u_int right; + + if (left > XT_TEMP_LOCK_BYTES-1) { + right = left - (XT_TEMP_LOCK_BYTES-1); + left = XT_TEMP_LOCK_BYTES-1; + } + else + right = 0; + + group->lg_list[index-1].li_count = left; + if (right) { + /* There is something left over on the right: */ + group->lg_list[index+1].li_count = right; + group->lg_list[index+1].li_row_id = group->lg_list[index-1].li_row_id + left * XT_ROW_LOCK_GROUP_COUNT; + XT_MEMMOVE(group->lg_list, &group->lg_list[index], + &group->lg_list[index+1], (group->lg_list_in_use - index - 1) * sizeof(XTLockItemRec)); + group->lg_list_in_use--; + } + else { + XT_MEMMOVE(group->lg_list, &group->lg_list[index], + &group->lg_list[index+2], (group->lg_list_in_use - index - 2) * sizeof(XTLockItemRec)); + group->lg_list_in_use -= 2; + } + } + else { + XT_MEMMOVE(group->lg_list, &group->lg_list[index], + &group->lg_list[index+1], (group->lg_list_in_use - index - 1) * sizeof(XTLockItemRec)); + group->lg_list_in_use--; + } + } + else if (index + 1 < group->lg_list_in_use && + group->lg_list[index+1].li_thread_id == ot->ot_thread->t_id && + group->lg_list[index+1].li_count < XT_TEMP_LOCK_BYTES-2 && + group->lg_list[index+1].li_row_id == row_id + XT_ROW_LOCK_GROUP_COUNT) { + /* Combine with the right: */ + group->lg_list[index+1].li_count++; + group->lg_list[index+1].li_row_id = row_id; + XT_MEMMOVE(group->lg_list, &group->lg_list[index], + &group->lg_list[index+1], (group->lg_list_in_use - index - 1) * sizeof(XTLockItemRec)); + group->lg_list_in_use--; + } + else + group->lg_list[index].li_count = 1; + } + } +#ifdef CHECK_ROWLOCK_GROUP_CONSISTENCY + check_rowlock_group(group); +#endif + xt_spinlock_unlock(&group->lg_lock); + + ot->ot_temp_row_lock = 0; + return OK; +} + +xtBool xt_init_row_locks(XTRowLocksPtr rl) +{ + for (int i=0; i<XT_ROW_LOCK_GROUP_COUNT; i++) { + xt_spinlock_init_with_autoname(NULL, &rl->rl_groups[i].lg_lock); + rl->rl_groups[i].lg_wait_queue = NULL; + rl->rl_groups[i].lg_list_size = 0; + rl->rl_groups[i].lg_list_in_use = 0; + rl->rl_groups[i].lg_list = NULL; + } + return OK; +} + +void xt_exit_row_locks(XTRowLocksPtr rl __attribute__((unused))) +{ + for (int i=0; i<XT_ROW_LOCK_GROUP_COUNT; i++) { + xt_spinlock_free(NULL, &rl->rl_groups[i].lg_lock); + rl->rl_groups[i].lg_wait_queue = NULL; + rl->rl_groups[i].lg_list_size = 0; + rl->rl_groups[i].lg_list_in_use = 0; + if (rl->rl_groups[i].lg_list) { + xt_free_ns(rl->rl_groups[i].lg_list); + rl->rl_groups[i].lg_list = NULL; + } + } +} + +/* + * ----------------------------------------------------------------------- + * ROW LOCKS, HASH BASED + */ +#else // XT_USE_LIST_BASED_ROW_LOCKS + +void XTRowLockList::old_xt_remove_all_locks(struct XTDatabase *db, xtThreadID thd_id) +{ +#ifdef XT_TRACE_LOCKS + xt_ttracef(xt_get_self(), "remove all locks\n"); +#endif + if (!bl_count) + return; + + int pgroup; + xtTableID ptab_id; + XTPermRowLockPtr plock; + XTOpenTablePtr pot = NULL; + + plock = (XTPermRowLockPtr) &bl_data[bl_count * bl_item_size]; + for (u_int i=0; i<bl_count; i++) { + plock--; + pgroup = plock->pr_group; + ptab_id = plock->pr_tab_id; + if (pot) { + if (pot->ot_table->tab_id == ptab_id) + goto remove_lock; + xt_db_return_table_to_pool_ns(pot); + pot = NULL; + } + + if (!xt_db_open_pool_table_ns(&pot, db, ptab_id)) { + /* Should not happen, but just in case, we just don't + * remove the lock. We will probably end up with a deadlock + * somewhere. + */ + xt_log_and_clear_exception_ns(); + goto skip_remove_lock; + } + if (!pot) + /* Can happen of the table has been dropped: */ + goto skip_remove_lock; + + remove_lock: +#ifdef XT_TRACE_LOCKS + xt_ttracef(xt_get_self(), "remove lock group=%d\n", pgroup); +#endif + pot->ot_table->tab_locks.tab_row_locks[pgroup] = NULL; + pot->ot_table->tab_locks.tab_lock_perm[pgroup] = 0; + skip_remove_lock:; + } + bl_count = 0; + + if (pot) + xt_db_return_table_to_pool_ns(pot); +} + +/* Try to lock a row. + * This function returns: + * XT_NO_LOCK on success. + * XT_TEMP_LOCK if there is a temporary lock on the row. + * XT_PERM_LOCK if there is a permanent lock in the row. + * + * If there is a lock on this row, the transaction ID of the + * locker is also returned. + * + * The caller must wait if the row is locked. If the lock is + * permanent, then the caller must wait for the transaction to + * terminate. If the lock is temporary, then the caller must + * wait for the transaction to signal that the lock has been + * released. + */ +int XTRowLocks::old_xt_set_temp_lock(XTOpenTablePtr ot, xtRowID row, xtXactID *xn_id, XTRowLockListPtr lock_list) +{ + int group; + XTXactDataPtr xact, my_xact; + + if (ot->ot_temp_row_lock) { + /* Check if we don't already have this temp lock: */ + if (ot->ot_temp_row_lock == row) { + gl->lw_curr_lock = XT_NO_LOCK; + return XT_NO_LOCK; + } + + xt_make_lock_permanent(ot, lock_list); + } + + my_xact = ot->ot_thread->st_xact_data; + group = row % XT_ROW_LOCK_COUNT; + if ((xact = tab_row_locks[group])) { + if (xact == my_xact) + return XT_NO_LOCK; + *xn_id = xact->xd_start_xn_id; + return tab_lock_perm[group] ? XT_PERM_LOCK : XT_TEMP_LOCK; + } + + tab_row_locks[row % XT_ROW_LOCK_COUNT] = my_xact; + +#ifdef XT_TRACE_LOCKS + xt_ttracef(xt_get_self(), "set temp lock %d group=%d for %s\n", (int) row, (int) row % XT_ROW_LOCK_COUNT, ot->ot_thread->t_name); +#endif + ot->ot_temp_row_lock = row; + return XT_NO_LOCK; +} + +/* Just check if there is a lock on the row. + * This function returns: + * XT_NO_LOCK if there is no lock. + * XT_TEMP_LOCK if there is a temporary lock on the row. + * XT_PERM_LOCK if a lock is a permanent lock in the row. + */ +int XTRowLocks::old_xt_is_locked(struct XTOpenTable *ot, xtRowID row, xtXactID *xn_id) +{ + int group; + XTXactDataPtr xact; + + group = row % XT_ROW_LOCK_COUNT; + if ((xact = tab_row_locks[group])) { + if (xact == ot->ot_thread->st_xact_data) + return XT_NO_LOCK; + *xn_id = xact->xd_start_xn_id; + if (tab_lock_perm[group]) + return XT_PERM_LOCK; + return XT_TEMP_LOCK; + } + return XT_NO_LOCK; +} + +void XTRowLocks::old_xt_remove_temp_lock(XTOpenTablePtr ot) +{ + int group; + XTXactDataPtr xact, my_xact; + + if (!ot->ot_temp_row_lock) + return; + + my_xact = ot->ot_thread->st_xact_data; + group = ot->ot_temp_row_lock % XT_ROW_LOCK_COUNT; +#ifdef XT_TRACE_LOCKS + xt_ttracef(xt_get_self(), "remove temp lock %d group=%d\n", (int) ot->ot_temp_row_lock, (int) ot->ot_temp_row_lock % XT_ROW_LOCK_COUNT); +#endif + ot->ot_temp_row_lock = 0; + if ((xact = tab_row_locks[group])) { + if (xact == my_xact) + tab_row_locks[group] = NULL; + } + + if (ot->ot_table->tab_db->db_xn_wait_count) + xt_xn_wakeup_transactions(ot->ot_table->tab_db, ot->ot_thread); +} + +xtBool XTRowLocks::old_xt_make_lock_permanent(XTOpenTablePtr ot, XTRowLockListPtr lock_list) +{ + int group; + + if (!ot->ot_temp_row_lock) + return OK; + +#ifdef XT_TRACE_LOCKS + xt_ttracef(xt_get_self(), "make lock perm %d group=%d\n", (int) ot->ot_temp_row_lock, (int) ot->ot_temp_row_lock % XT_ROW_LOCK_COUNT); +#endif + /* Check if the lock is already permanent: */ + group = ot->ot_temp_row_lock % XT_ROW_LOCK_COUNT; + if (!tab_lock_perm[group]) { + XTPermRowLockRec plock; + + plock.pr_tab_id = ot->ot_table->tab_id; + plock.pr_group = group; + if (!xt_bl_append(NULL, lock_list, &plock)) { + xt_remove_temp_lock(ot); + return FAILED; + } + tab_lock_perm[group] = 1; + } + + ot->ot_temp_row_lock = 0; + return OK; +} + +/* Release this lock, and all locks gained after this lock + * on this table. + * + * The locks are only released temporarily. The will be regained + * below using regain locks. + * + * Returns: + * XT_NO_LOCK if no lock is released. + * XT_PERM_LOCK if a lock is released. + * + * Note that only permanent locks can be released in this way. + * So if the thread has a temporary lock, it will first be made + * permanent. + * + * {RELEASING-LOCKS} + * The idea of the releasing locks comes from the fact that each + * lock, locks a group of records. + * So if T1 has a lock (e.g. when doing SELECT FOR UPDATE), + * and then encounters an updated record x + * from T2, and it must wait for T2, it firsts releases the + * lock, just in case T2 tries to gain a lock on another + * record y in the same group, which will cause it to + * wait on T1. + * + * However, there are several problems with releasing + * locks. + * - It can cause a "live-lock", where another transation + * keeps getting in before. + * - It may not solve the problem in all cases because + * the SELECT FOR UPDATE has locked other record groups + * before it encountered record x. + * - Further problems occur when locks are granted by + * callback: + * T1 waits for T2, because it has a lock on record x + * T2 releases the lock because it must wait for T3 + * T1 is granted the lock (but does not know about this yet) + * T2 tries to regain lock (after T3 quits) and + * must wait for T1 - DEADLOCK + * + * In general, it does not make sense to release locks + * when it can be granted again by a callback. + * + * TODO: 2 possible solutions: + * - Do not lock groups, lock rows. + * UPDATE INTENSION ROW LOCK + * - Use multiple lock types: + * UPDATE INTENSION LOCK (required first) + * SHARED UPDATE LOCK (used by INSERT or DELETE) + * EXCLUSIVE UPDATE LOCK (used by SELECT FOR UPDATE) + * + * Temporary solution. Do not release any locks. +int XTRowLocks::xt_release_locks(struct XTOpenTable *ot, xtRowID row, XTRowLockListPtr lock_list) + */ + +/* + * Regain a lock previously held. This function regains locks + * released by xt_release_locks(). + * + * It will return lock_type and xn_id if the row is locked, and therefore + * regain cannot continue. In this case, the caller must wait. + * It returns XT_NO_LOCK if there are no more locks to be regained. + * + * Locks are always regained in the order in which they were originally + * taken. +xtBool XTRowLocks::xt_regain_locks(struct XTOpenTable *ot, int *lock_type, xtXactID *xn_id, XTRowLockListPtr lock_list) + */ + +xtBool old_xt_init_row_locks(XTRowLocksPtr rl) +{ + memset(rl->tab_lock_perm, 0, XT_ROW_LOCK_COUNT); + memset(rl->tab_row_locks, 0, XT_ROW_LOCK_COUNT * sizeof(XTXactDataPtr)); + return OK; +} + +void old_xt_exit_row_locks(XTRowLocksPtr rl __attribute__((unused))) +{ +} + +#endif // XT_USE_LIST_BASED_ROW_LOCKS + +xtPublic xtBool xt_init_row_lock_list(XTRowLockListPtr lock_list) +{ + lock_list->bl_item_size = sizeof(XTPermRowLockRec); + lock_list->bl_size = 0; + lock_list->bl_count = 0; + lock_list->bl_data = NULL; + return OK; +} + +xtPublic void xt_exit_row_lock_list(XTRowLockListPtr lock_list) +{ + xt_bl_set_size(NULL, lock_list, 0); +} + +/* + * ----------------------------------------------------------------------- + * SPECIAL EXCLUSIVE/SHARED (XS) LOCK + */ + +#define XT_GET1(x) *(x) +#define XT_SET4(x, y) xt_atomic_set4(x, y) +#define XT_GET4(x) xt_atomic_get4(x) + +#ifdef XT_THREAD_LOCK_INFO +xtPublic void xt_rwmutex_init(struct XTThread *self, XTRWMutexPtr xsl, const char *n) +#else +xtPublic void xt_rwmutex_init(XTThreadPtr self, XTRWMutexPtr xsl) +#endif +{ +#ifdef DEBUG + xsl->xs_lock_thread = 0; + xsl->xs_inited = 12345; +#endif + xt_init_mutex_with_autoname(self, &xsl->xs_lock); + xt_init_cond(self, &xsl->xs_cond); + XT_SET4(&xsl->xs_state, 0); + xsl->xs_xlocker = 0; + /* Must be aligned! */ + ASSERT(xt_thr_maximum_threads == xt_align_size(xt_thr_maximum_threads, XT_XS_LOCK_ALIGN)); + xsl->x.xs_rlock = (xtWord1 *) xt_calloc(self, xt_thr_maximum_threads); +#ifdef XT_THREAD_LOCK_INFO + xsl->xs_name = n; + xt_thread_lock_info_init(&xsl->xs_lock_info, xsl); +#endif +} + +xtPublic void xt_rwmutex_free(XTThreadPtr self, XTRWMutexPtr xsl) +{ +#ifdef DEBUG + ASSERT(!xsl->xs_lock_thread); + ASSERT(xsl->xs_inited == 12345); + xsl->xs_inited = 0; +#endif + if (xsl->x.xs_rlock) + xt_free(self, (void *) xsl->x.xs_rlock); + xt_free_mutex(&xsl->xs_lock); + xt_free_cond(&xsl->xs_cond); +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_free(&xsl->xs_lock_info); +#endif +} + +xtPublic xtBool xt_rwmutex_xlock(XTRWMutexPtr xsl, xtThreadID thd_id) +{ +#ifdef DEBUG + ASSERT_NS(xsl->xs_inited == 12345); +#endif + ASSERT_NS(xt_get_self()->t_id == thd_id); + xt_lock_mutex_ns(&xsl->xs_lock); + ASSERT_NS(xsl->x.xs_rlock[thd_id] == XT_NO_LOCK); + + /* Wait for exclusive locker: */ + while (xsl->xs_xlocker) { + if (!xt_timed_wait_cond_ns(&xsl->xs_cond, &xsl->xs_lock, 10000)) { + xt_unlock_mutex_ns(&xsl->xs_lock); + return FAILED; + } + } + + /* I am the locker (set state before locker!): */ + XT_SET4(&xsl->xs_state, 0); + xsl->xs_xlocker = thd_id; + + /* Wait for all the read lockers: */ + while (xsl->xs_state < xt_thr_current_max_threads) { + while (xsl->x.xs_rlock[xsl->xs_state]) { + /* {RACE-WR_MUTEX} + * Just in case of this, we keep the wait time down! + */ + if (!xt_timed_wait_cond_ns(&xsl->xs_cond, &xsl->xs_lock, 10)) { + XT_SET4(&xsl->xs_state, 0); + xsl->xs_xlocker = 0; + xt_unlock_mutex_ns(&xsl->xs_lock); + return FAILED; + } + } + /* State can be incremented in parallel by a reader + * thread! + */ + XT_SET4(&xsl->xs_state, xsl->xs_state + 1); + } + + /* I have waited for all: */ + XT_SET4(&xsl->xs_state, xt_thr_maximum_threads); + +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_add_owner(&xsl->xs_lock_info); +#endif + + return OK; +} + +xtPublic xtBool xt_rwmutex_slock(XTRWMutexPtr xsl, xtThreadID thd_id) +{ +#ifdef DEBUG + ASSERT_NS(xsl->xs_inited == 12345); +#endif + ASSERT_NS(xt_get_self()->t_id == thd_id); + + xt_flushed_inc1(&xsl->x.xs_rlock[thd_id]); + + if (xsl->x.xs_rlock[thd_id] > 1) + return OK; + + /* Check if there could be an X locker: */ + if (xsl->xs_xlocker) { + /* There is an X locker. + * If xs_state < thd_id then the X locker will wait for me. + * So I should not wait! + */ + if (xsl->xs_state >= thd_id) { + /* If xsl->xs_state >= thd_id, then the locker has already + * checked me, and I will have to wait. + * + * Otherwise, xs_state <= thd_id, which means the + * X locker has not checked me, and will still wait for me (or + * is already waiting for me). In this case, I will have to + * take the mutex to make sure exactly how far he + * is with the checking. + */ + xt_lock_mutex_ns(&xsl->xs_lock); + while (xsl->xs_state > thd_id && xsl->xs_xlocker) { + if (!xt_timed_wait_cond_ns(&xsl->xs_cond, &xsl->xs_lock, 10000)) { + xt_unlock_mutex_ns(&xsl->xs_lock); + xsl->x.xs_rlock[thd_id]--; + return FAILED; + } + } + xt_unlock_mutex_ns(&xsl->xs_lock); + } + } + + /* There is no exclusive locker, so we have the read lock: */ + ASSERT_NS(xsl->xs_state != xt_thr_maximum_threads); +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_add_owner(&xsl->xs_lock_info); +#endif + return OK; +} + +xtPublic xtBool xt_rwmutex_unlock(XTRWMutexPtr xsl, xtThreadID thd_id) +{ +#ifdef DEBUG + ASSERT_NS(xsl->xs_inited == 12345); +#endif + ASSERT_NS(xt_get_self()->t_id == thd_id); + if (xsl->xs_xlocker == thd_id) { + /* I have an X lock. */ + ASSERT_NS(xsl->x.xs_rlock[thd_id] == XT_NO_LOCK); + ASSERT_NS(xsl->xs_state == xt_thr_maximum_threads); + XT_SET4(&xsl->xs_state, 0); + xsl->xs_xlocker = 0; + xt_unlock_mutex_ns(&xsl->xs_lock); + /* Wake up any other X or shared lockers: */ + if (!xt_broadcast_cond_ns(&xsl->xs_cond)) + return FAILED; + } + else { + /* I have a shared lock: */ + ASSERT_NS(xsl->x.xs_rlock[thd_id] > 0); + ASSERT_NS(xsl->xs_state != xt_thr_maximum_threads); /* TODO: PMC - HOW can this fail?! - but it does? */ + if (xsl->x.xs_rlock[thd_id] > 1) + xsl->x.xs_rlock[thd_id]--; + else { + /* {RACE-WR_MUTEX}. + * A BUG FIX: + * + * Previously I was checking "xsl->xs_xlocker" after, + * descrementing the READ lock. + * + * This resulted in a race condition that could cause the + * unlocking reader to hang in xt_lock_mutex_ns(). + * This was because the X locker, grabbed the mutex (xs_lock) + * but did not wait for the reader. + * + * The result was that the reader had to wait in UNLOCK + * until the X locker did an unlock! + * + * This only became obvious when it caused a deadlock (because + * the reader was waiting for the locker, which it should not + * have been, of course). + */ + if (xsl->xs_xlocker) { + xt_lock_mutex_ns(&xsl->xs_lock); + if (xsl->xs_xlocker && xsl->xs_state == thd_id) { + /* If the X locker is waiting for me, + * then allow him to continue. + */ + if (!xt_broadcast_cond_ns(&xsl->xs_cond)) { + xt_unlock_mutex_ns(&xsl->xs_lock); + return FAILED; + } + } + xt_flushed_dec1(&xsl->x.xs_rlock[thd_id]); + xt_unlock_mutex_ns(&xsl->xs_lock); + } + else + /* {RACE-WR_MUTEX} + * There is a race condition between the check above, and the + * the decrement here. + * + * However, if I check xsl->xs_xlocker afterwards, and then + * try to get the lock xs_lock, I could hand for the duration + * of the X lock. + */ + xt_flushed_dec1(&xsl->x.xs_rlock[thd_id]); + } + } +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_release_owner(&xsl->xs_lock_info); +#endif + return OK; +} + +/* + * ----------------------------------------------------------------------- + * SPIN LOCK + */ + +#ifdef XT_THREAD_LOCK_INFO +xtPublic void xt_spinlock_init(XTThreadPtr self __attribute__((unused)), XTSpinLockPtr spl, const char *n) +#else +xtPublic void xt_spinlock_init(XTThreadPtr self __attribute__((unused)), XTSpinLockPtr spl) +#endif +{ + spl->spl_lock = 0; +#ifdef XT_SPL_DEFAULT + xt_init_mutex(self, &spl->spl_mutex); +#endif +#ifdef DEBUG + spl->spl_locker = 0; +#endif +#ifdef XT_THREAD_LOCK_INFO + spl->spl_name = n; + xt_thread_lock_info_init(&spl->spl_lock_info, spl); +#endif +} + +xtPublic void xt_spinlock_free(XTThreadPtr self __attribute__((unused)), XTSpinLockPtr spl __attribute__((unused))) +{ +#ifdef XT_SPL_DEFAULT + xt_free_mutex(&spl->spl_mutex); +#endif +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_free(&spl->spl_lock_info); +#endif +} + +xtPublic xtBool xt_spinlock_spin(XTSpinLockPtr spl) +{ + volatile xtWord4 *lck = &spl->spl_lock; + + for (;;) { + for (int i=0; i<10; i++) { + /* Check the lock variable: */ + if (!*lck) { + /* Try to get the lock: */ + if (!xt_spinlock_set(spl)) + return OK; + } + } + + /* Go to "sleep" */ + xt_critical_wait(); + } + + return OK; +} + +#ifdef DEBUG +xtPublic void xt_spinlock_set_thread(XTSpinLockPtr spl) +{ + spl->spl_locker = xt_get_self(); +} +#endif + +/* + * ----------------------------------------------------------------------- + * FAST LOCK + */ + +#ifdef XT_THREAD_LOCK_INFO +xtPublic void xt_fastlock_init(XTThreadPtr self, XTFastLockPtr fal, const char *n) +#else +xtPublic void xt_fastlock_init(XTThreadPtr self, XTFastLockPtr fal) +#endif +{ + xt_spinlock_init_with_autoname(self, &fal->fal_spinlock); + xt_spinlock_init_with_autoname(self, &fal->fal_wait_lock); + for (u_int i=0; i<XT_FAST_LOCK_MAX_WAIT; i++) + fal->fal_wait_list[i] = NULL; + fal->fal_wait_count = 0; + fal->fal_wait_wakeup = 0; + fal->fal_wait_alloc = 0; +#ifdef XT_THREAD_LOCK_INFO + fal->fal_name = n; + xt_thread_lock_info_init(&fal->fal_lock_info, fal); +#endif +} + +xtPublic void xt_fastlock_free(XTThreadPtr self, XTFastLockPtr fal) +{ + xt_spinlock_free(self, &fal->fal_spinlock); + xt_spinlock_free(self, &fal->fal_wait_lock); +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_free(&fal->fal_lock_info); +#endif +} + +xtPublic xtBool xt_fastlock_spin(XTFastLockPtr fal, XTThreadPtr thread) +{ + volatile xtWord4 *lck = &fal->fal_spinlock.spl_lock; + + do { + for (int i=0; i<10; i++) { + /* Check the lock variable: */ + if (!*lck) { + /* Try to get the lock: */ + if (!xt_spinlock_set(&fal->fal_spinlock)) { + fal->fal_locker = thread; + return OK; + } + } + } + + for (int i=0; i<10; i++) { + xt_critical_wait(); + if (!*lck) { + /* Try to get the lock: */ + if (!xt_spinlock_set(&fal->fal_spinlock)) { + fal->fal_locker = thread; + return OK; + } + } + } + + /* Wait for a wakeup */ + xt_spinlock_lock(&fal->fal_wait_lock); + if (fal->fal_wait_count == XT_FAST_LOCK_MAX_WAIT) { + xt_register_ulxterr(XT_REG_CONTEXT, XT_ERR_TOO_MANY_WAITERS, (u_long) XT_FAST_LOCK_MAX_WAIT+1); + xt_spinlock_unlock(&fal->fal_wait_lock); + return FAILED; + } + while (fal->fal_wait_list[fal->fal_wait_alloc]) + fal->fal_wait_alloc = (fal->fal_wait_alloc + 1) % XT_FAST_LOCK_MAX_WAIT; + fal->fal_wait_list[fal->fal_wait_alloc] = thread; + fal->fal_wait_alloc = (fal->fal_wait_alloc + 1) % XT_FAST_LOCK_MAX_WAIT; + fal->fal_wait_count++; + xt_lock_thread(thread); + xt_spinlock_unlock(&fal->fal_wait_lock); + if (!xt_wait_thread(thread)) { + xt_unlock_thread(thread); + if (fal->fal_locker == thread) + xt_fastlock_unlock(fal, thread); + return FAILED; + } + xt_unlock_thread(thread); + } while (fal->fal_locker != thread); + return OK; +} + +/* Wake up one of the waiters. */ +xtPublic void xt_fastlock_wakeup(XTFastLockPtr fal) +{ + xt_spinlock_lock(&fal->fal_wait_lock); + if (fal->fal_wait_count) { + XTThreadPtr thread; + + /* Find a waiting thread, and give it the exclusive lock: */ + while (!fal->fal_wait_list[fal->fal_wait_wakeup]) + fal->fal_wait_wakeup = (fal->fal_wait_wakeup + 1) % XT_FAST_LOCK_MAX_WAIT; + thread = fal->fal_wait_list[fal->fal_wait_wakeup]; + fal->fal_wait_list[fal->fal_wait_wakeup] = NULL; + fal->fal_wait_wakeup = (fal->fal_wait_wakeup + 1) % XT_FAST_LOCK_MAX_WAIT; + fal->fal_wait_count--; + fal->fal_locker = thread; + + xt_lock_thread(thread); + xt_spinlock_unlock(&fal->fal_wait_lock); + xt_signal_thread(thread); + xt_unlock_thread(thread); + } + else { + xt_spinlock_unlock(&fal->fal_wait_lock); + fal->fal_locker = NULL; + xt_spinlock_reset(&fal->fal_spinlock); + } +} + +/* + * ----------------------------------------------------------------------- + * READ/WRITE SPIN LOCK + */ + +#ifdef XT_THREAD_LOCK_INFO +xtPublic void xt_spinrwlock_init(struct XTThread *self, XTSpinRWLockPtr srw, const char *name) +#else +xtPublic void xt_spinrwlock_init(struct XTThread *self, XTSpinRWLockPtr srw) +#endif +{ + xt_spinlock_init_with_autoname(self, &srw->srw_lock); + xt_spinlock_init_with_autoname(self, &srw->srw_state_lock); + srw->srw_state = 0; + srw->srw_xlocker = 0; + /* Must be aligned! */ + ASSERT(xt_thr_maximum_threads == xt_align_size(xt_thr_maximum_threads, XT_XS_LOCK_ALIGN)); + srw->x.srw_rlock = (xtWord1 *) xt_calloc(self, xt_thr_maximum_threads); +#ifdef XT_THREAD_LOCK_INFO + srw->srw_name = name; + xt_thread_lock_info_init(&srw->srw_lock_info, srw); +#endif +} + +xtPublic void xt_spinrwlock_free(struct XTThread *self, XTSpinRWLockPtr srw) +{ + if (srw->x.srw_rlock) + xt_free(self, (void *) srw->x.srw_rlock); + xt_spinlock_free(self, &srw->srw_lock); + xt_spinlock_free(self, &srw->srw_state_lock); +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_free(&srw->srw_lock_info); +#endif +} + +xtPublic xtBool xt_spinrwlock_xlock(XTSpinRWLockPtr srw, xtThreadID thd_id) +{ + xt_spinlock_lock(&srw->srw_lock); + ASSERT_NS(srw->x.srw_rlock[thd_id] == XT_NO_LOCK); + + xt_spinlock_lock(&srw->srw_state_lock); + + /* Set the state before xlocker (dirty read!) */ + srw->srw_state = 0; + + /* I am the locker: */ + srw->srw_xlocker = thd_id; + + /* Wait for all the read lockers: */ + while (srw->srw_state < xt_thr_current_max_threads) { + while (srw->x.srw_rlock[srw->srw_state]) { + xt_spinlock_unlock(&srw->srw_state_lock); + /* Wait for this reader, during this time, the reader + * himself, may increment the state. */ + xt_critical_wait(); + xt_spinlock_lock(&srw->srw_state_lock); + } + /* State can be incremented in parallel by a reader + * thread! + */ + srw->srw_state++; + } + + /* I have waited for all: */ + srw->srw_state = xt_thr_maximum_threads; + + xt_spinlock_unlock(&srw->srw_state_lock); + +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_add_owner(&srw->srw_lock_info); +#endif + + return OK; +} + +xtPublic xtBool xt_spinrwlock_slock(XTSpinRWLockPtr srw, xtThreadID thd_id) +{ + ASSERT_NS(srw->x.srw_rlock[thd_id] == XT_NO_LOCK); + srw->x.srw_rlock[thd_id] = XT_WANT_LOCK; + /* Check if there could be an X locker: */ + if (srw->srw_xlocker) { + /* There is an X locker. + * If srw_state < thd_id then the X locker will wait for me. + * So I should not wait! + */ + if (srw->srw_state >= thd_id) { + /* If srw->srw_state >= thd_id, then the locker may have, or + * has already checked me, and I will have to wait. + * + * Otherwise, srw_state <= thd_id, which means the + * X locker has not checked me, and will still wait for me (or + * is already waiting for me). In this case, I will have to + * take the mutex to make sure exactly how far he + * is with the checking. + */ + xt_spinlock_lock(&srw->srw_state_lock); + while (srw->srw_state > thd_id && srw->srw_xlocker) { + xt_spinlock_unlock(&srw->srw_state_lock); + xt_critical_wait(); + xt_spinlock_lock(&srw->srw_state_lock); + } + xt_spinlock_unlock(&srw->srw_state_lock); + } + } + /* There is no exclusive locker, so we have the read lock: */ + srw->x.srw_rlock[thd_id] = XT_HAVE_LOCK; + +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_add_owner(&srw->srw_lock_info); +#endif + + return OK; +} + +xtPublic xtBool xt_spinrwlock_unlock(XTSpinRWLockPtr srw, xtThreadID thd_id) +{ + if (srw->srw_xlocker == thd_id) { + /* I have an X lock. */ + ASSERT_NS(srw->srw_state == xt_thr_maximum_threads); + srw->srw_state = 0; + srw->srw_xlocker = 0; + xt_spinlock_unlock(&srw->srw_lock); + } + else { + /* I have a shared lock: */ + ASSERT_NS(srw->x.srw_rlock[thd_id] == XT_HAVE_LOCK); + ASSERT_NS(srw->srw_state != xt_thr_maximum_threads); + srw->x.srw_rlock[thd_id] = XT_NO_LOCK; + if (srw->srw_xlocker && srw->srw_state == thd_id) { + xt_spinlock_lock(&srw->srw_state_lock); + if (srw->srw_xlocker && srw->srw_state == thd_id) { + /* If the X locker is waiting for me, + * then allow him to continue. + */ + srw->srw_state = thd_id+1; + } + xt_spinlock_unlock(&srw->srw_state_lock); + } + } + +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_release_owner(&srw->srw_lock_info); +#endif + + return OK; +} + +/* + * ----------------------------------------------------------------------- + * FAST READ/WRITE LOCK (BASED ON FAST MUTEX) + */ + +#ifdef XT_THREAD_LOCK_INFO +xtPublic void xt_fastrwlock_init(struct XTThread *self, XTFastRWLockPtr frw, const char *n) +#else +xtPublic void xt_fastrwlock_init(struct XTThread *self, XTFastRWLockPtr frw) +#endif +{ + xt_fastlock_init_with_autoname(self, &frw->frw_lock); + frw->frw_xlocker = NULL; + xt_spinlock_init_with_autoname(self, &frw->frw_state_lock); + frw->frw_state = 0; + frw->frw_read_waiters = 0; + /* Must be aligned! */ + ASSERT(xt_thr_maximum_threads == xt_align_size(xt_thr_maximum_threads, XT_XS_LOCK_ALIGN)); + frw->x.frw_rlock = (xtWord1 *) xt_calloc(self, xt_thr_maximum_threads); +#ifdef XT_THREAD_LOCK_INFO + frw->frw_name = n; + xt_thread_lock_info_init(&frw->frw_lock_info, frw); +#endif +} + +xtPublic void xt_fastrwlock_free(struct XTThread *self, XTFastRWLockPtr frw) +{ + if (frw->x.frw_rlock) + xt_free(self, (void *) frw->x.frw_rlock); + xt_fastlock_free(self, &frw->frw_lock); + xt_spinlock_free(self, &frw->frw_state_lock); +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_free(&frw->frw_lock_info); +#endif +} + +xtPublic xtBool xt_fastrwlock_xlock(XTFastRWLockPtr frw, struct XTThread *thread) +{ + xt_fastlock_lock(&frw->frw_lock, thread); + ASSERT_NS(frw->x.frw_rlock[thread->t_id] == XT_NO_LOCK); + + xt_spinlock_lock(&frw->frw_state_lock); + + /* Set the state before xlocker (dirty read!) */ + frw->frw_state = 0; + + /* I am the locker: */ + frw->frw_xlocker = thread; + + /* Wait for all the read lockers: */ + while (frw->frw_state < xt_thr_current_max_threads) { + while (frw->x.frw_rlock[frw->frw_state]) { + xt_lock_thread(thread); + xt_spinlock_unlock(&frw->frw_state_lock); + /* Wait for this reader. We rely on the reader to free + * us from this wait! */ + if (!xt_wait_thread(thread)) { + xt_unlock_thread(thread); + frw->frw_state = 0; + frw->frw_xlocker = NULL; + xt_fastlock_unlock(&frw->frw_lock, thread); + return FAILED; + } + xt_unlock_thread(thread); + xt_spinlock_lock(&frw->frw_state_lock); + } + /* State can be incremented in parallel by a reader + * thread! + */ + frw->frw_state++; + } + + /* I have waited for all: */ + frw->frw_state = xt_thr_maximum_threads; + + xt_spinlock_unlock(&frw->frw_state_lock); + +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_add_owner(&frw->frw_lock_info); +#endif + + return OK; +} + +xtPublic xtBool xt_fastrwlock_slock(XTFastRWLockPtr frw, struct XTThread *thread) +{ + xtThreadID thd_id = thread->t_id; + + ASSERT_NS(frw->x.frw_rlock[thd_id] == XT_NO_LOCK); + frw->x.frw_rlock[thd_id] = XT_WANT_LOCK; + /* Check if there could be an X locker: */ + if (frw->frw_xlocker) { + /* There is an X locker. + * If frw_state < thd_id then the X locker will wait for me. + * So I should not wait! + */ + if (frw->frw_state >= thd_id) { + /* If frw->frw_state >= thd_id, then the locker may have, or + * has already checked me, and I will have to wait. + * + * Otherwise, frw_state <= thd_id, which means the + * X locker has not checked me, and will still wait for me (or + * is already waiting for me). In this case, I will have to + * take the mutex to make sure exactly how far he + * is with the checking. + */ + xt_spinlock_lock(&frw->frw_state_lock); + frw->frw_read_waiters++; + frw->x.frw_rlock[thd_id] = XT_WAITING; + while (frw->frw_state > thd_id && frw->frw_xlocker) { + xt_lock_thread(thread); + xt_spinlock_unlock(&frw->frw_state_lock); + if (!xt_wait_thread(thread)) { + xt_unlock_thread(thread); + xt_spinlock_lock(&frw->frw_state_lock); + frw->frw_read_waiters--; + frw->x.frw_rlock[thd_id] = XT_NO_LOCK; + xt_spinlock_unlock(&frw->frw_state_lock); + return FAILED; + } + xt_unlock_thread(thread); + xt_spinlock_lock(&frw->frw_state_lock); + } + frw->x.frw_rlock[thd_id] = XT_HAVE_LOCK; + frw->frw_read_waiters--; + xt_spinlock_unlock(&frw->frw_state_lock); + return OK; + } + } + /* There is no exclusive locker, so we have the read lock: */ + frw->x.frw_rlock[thd_id] = XT_HAVE_LOCK; + +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_add_owner(&frw->frw_lock_info); +#endif + + return OK; +} + +xtPublic xtBool xt_fastrwlock_unlock(XTFastRWLockPtr frw, struct XTThread *thread) +{ + xtThreadID thd_id = thread->t_id; + + if (frw->frw_xlocker == thread) { + /* I have an X lock. */ + ASSERT_NS(frw->frw_state == xt_thr_maximum_threads); + frw->frw_state = 0; + frw->frw_xlocker = NULL; + + /* Wake up all read waiters: */ + if (frw->frw_read_waiters) { + xt_spinlock_lock(&frw->frw_state_lock); + if (frw->frw_read_waiters) { + XTThreadPtr target; + + for (u_int i=0; i<xt_thr_current_max_threads; i++) { + if (frw->x.frw_rlock[i] == XT_WAITING) { + if ((target = xt_thr_array[i])) { + xt_lock_thread(target); + xt_signal_thread(target); + xt_unlock_thread(target); + } + } + } + } + xt_spinlock_unlock(&frw->frw_state_lock); + } + xt_fastlock_unlock(&frw->frw_lock, thread); + } + else { + /* I have a shared lock: */ + ASSERT_NS(frw->x.frw_rlock[thd_id] == XT_HAVE_LOCK); + ASSERT_NS(frw->frw_state != xt_thr_maximum_threads); + frw->x.frw_rlock[thd_id] = XT_NO_LOCK; + if (frw->frw_xlocker && frw->frw_state == thd_id) { + xt_spinlock_lock(&frw->frw_state_lock); + if (frw->frw_xlocker && frw->frw_state == thd_id) { + /* If the X locker is waiting for me, + * then allow him to continue. + */ + frw->frw_state = thd_id+1; + /* Wake him up: */ + xt_lock_thread(frw->frw_xlocker); + xt_signal_thread(frw->frw_xlocker); + xt_unlock_thread(frw->frw_xlocker); + } + xt_spinlock_unlock(&frw->frw_state_lock); + } + } + +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_release_owner(&frw->frw_lock_info); +#endif + + return OK; +} + +/* + * ----------------------------------------------------------------------- + * ATOMIC READ/WRITE LOCK (BASED ON ATOMIC OPERATIONS) + */ + +#ifdef XT_THREAD_LOCK_INFO +xtPublic void xt_atomicrwlock_init(struct XTThread XT_UNUSED(*self), XTAtomicRWLockPtr arw, const char *n) +#else +xtPublic void xt_atomicrwlock_init(struct XTThread XT_UNUSED(*self), XTAtomicRWLockPtr arw) +#endif +{ + arw->arw_reader_count = 0; + arw->arw_xlock_set = 0; +#ifdef XT_THREAD_LOCK_INFO + arw->arw_name = n; + xt_thread_lock_info_init(&arw->arw_lock_info, arw); +#endif +} + +xtPublic void xt_atomicrwlock_free(struct XTThread *, XTAtomicRWLockPtr XT_UNUSED(arw)) +{ +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_free(&arw->arw_lock_info); +#endif +} + +xtPublic xtBool xt_atomicrwlock_xlock(XTAtomicRWLockPtr arw, xtThreadID XT_UNUSED(thr_id)) +{ + register xtWord2 set; + + /* First get an exclusive lock: */ + for (;;) { + set = xt_atomic_tas2(&arw->arw_xlock_set, 1); + if (!set) + break; + xt_yield(); + } + + /* Wait for the remaining readers: */ + while (arw->arw_reader_count) + xt_yield(); + +#ifdef DEBUG + arw->arw_locker = thr_id; +#endif + +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_add_owner(&arw->arw_lock_info); +#endif + return OK; +} + +xtPublic xtBool xt_atomicrwlock_slock(XTAtomicRWLockPtr arw) +{ + register xtWord2 set; + + /* First get an exclusive lock: */ + for (;;) { + set = xt_atomic_tas2(&arw->arw_xlock_set, 1); + if (!set) + break; + xt_yield(); + } + + /* Add a reader: */ + xt_atomic_inc2(&arw->arw_reader_count); + + /* Release the xlock: */ + arw->arw_xlock_set = 0; + +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_add_owner(&arw->arw_lock_info); +#endif + return OK; +} + +xtPublic xtBool xt_atomicrwlock_unlock(XTAtomicRWLockPtr arw, xtBool xlocked) +{ + if (xlocked) + arw->arw_xlock_set = 0; + else + xt_atomic_dec2(&arw->arw_reader_count); + +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_release_owner(&arw->arw_lock_info); +#endif +#ifdef DEBUG + arw->arw_locker = 0; +#endif + + return OK; +} + +/* + * ----------------------------------------------------------------------- + * UNIT TESTS + */ + +#define JOB_MEMCPY 1 +#define JOB_SLEEP 2 +#define JOB_PRINT 3 +#define JOB_INCREMENT 4 +#define JOB_SNOOZE 5 + +#define LOCK_PTHREAD_RW 1 +#define LOCK_PTHREAD_MUTEX 2 +#define LOCK_FASTRW 3 +#define LOCK_SPINLOCK 4 +#define LOCK_FASTLOCK 5 +#define LOCK_SPINRWLOCK 6 +#define LOCK_FASTRWLOCK 7 +#define LOCK_ATOMICRWLOCK 8 + +typedef struct XSLockTest { + u_int xs_interations; + xtBool xs_which_lock; + xtBool xs_which_job; + xtBool xs_debug_print; + XTRWMutexRec xs_lock; + xt_rwlock_type xs_plock; + XTSpinLockRec xs_spinlock; + xt_mutex_type xs_mutex; + XTFastLockRec xs_fastlock; + XTSpinRWLockRec xs_spinrwlock; + XTFastRWLockRec xs_fastrwlock; + XTAtomicRWLockRec xs_atomicrwlock; + int xs_progress; + xtWord4 xs_inc; +} XSLockTestRec, *XSLockTestPtr; + +static void lck_free_thread_data(XTThreadPtr self __attribute__((unused)), void *data __attribute__((unused))) +{ +} + +static void lck_do_job(XTThreadPtr self, int job, XSLockTestPtr data) +{ + char b1[2048], b2[2048]; + + switch (job) { + case JOB_MEMCPY: + memcpy(b1, b2, 2048); + data->xs_inc++; + break; + case JOB_SLEEP: + xt_sleep_milli_second(1); + data->xs_inc++; + break; + case JOB_PRINT: + printf("- %s got lock\n", self->t_name); + xt_sleep_milli_second(10); + data->xs_inc++; + break; + case JOB_INCREMENT: + data->xs_inc++; + break; + case JOB_SNOOZE: + xt_sleep_milli_second(10); + data->xs_inc++; + break; + } +} + +#if 0 +static void *lck_run_dumper(XTThreadPtr self) +{ + int state = 0; + + while (state != 1) { + sleep(1); + if (state == 2) { + xt_dump_trace(); + state = 0; + } + } +} +#endif + +static void *lck_run_reader(XTThreadPtr self) +{ + XSLockTestRec *data = (XSLockTestRec *) self->t_data; + + if (data->xs_debug_print) + printf("- %s start\n", self->t_name); + for (u_int i=0; i<data->xs_interations; i++) { + if (data->xs_progress && ((i+1) % data->xs_progress) == 0) + printf("- %s %d\n", self->t_name, i+1); + if (data->xs_which_lock == LOCK_PTHREAD_RW) { + xt_slock_rwlock_ns(&data->xs_plock); + lck_do_job(self, data->xs_which_job, data); + xt_unlock_rwlock_ns(&data->xs_plock); + } + else if (data->xs_which_lock == LOCK_FASTRW) { + xt_rwmutex_slock(&data->xs_lock, self->t_id); + lck_do_job(self, data->xs_which_job, data); + xt_rwmutex_unlock(&data->xs_lock, self->t_id); + } + else if (data->xs_which_lock == LOCK_SPINRWLOCK) { + xt_spinrwlock_slock(&data->xs_spinrwlock, self->t_id); + lck_do_job(self, data->xs_which_job, data); + xt_spinrwlock_unlock(&data->xs_spinrwlock, self->t_id); + } + else if (data->xs_which_lock == LOCK_FASTRWLOCK) { + xt_fastrwlock_slock(&data->xs_fastrwlock, self); + lck_do_job(self, data->xs_which_job, data); + xt_fastrwlock_unlock(&data->xs_fastrwlock, self); + } + else if (data->xs_which_lock == LOCK_ATOMICRWLOCK) { + xt_atomicrwlock_slock(&data->xs_atomicrwlock); + lck_do_job(self, data->xs_which_job, data); + xt_atomicrwlock_unlock(&data->xs_atomicrwlock, FALSE); + } + else + ASSERT(FALSE); + } + if (data->xs_debug_print) + printf("- %s stop\n", self->t_name); + return NULL; +} + +static void *lck_run_writer(XTThreadPtr self) +{ + XSLockTestRec *data = (XSLockTestRec *) self->t_data; + + if (data->xs_debug_print) + printf("- %s start\n", self->t_name); + for (u_int i=0; i<data->xs_interations; i++) { + if (data->xs_progress && ((i+1) % data->xs_progress) == 0) + printf("- %s %d\n", self->t_name, i+1); + if (data->xs_which_lock == LOCK_PTHREAD_RW) { + xt_xlock_rwlock_ns(&data->xs_plock); + lck_do_job(self, data->xs_which_job, data); + xt_unlock_rwlock_ns(&data->xs_plock); + } + else if (data->xs_which_lock == LOCK_FASTRW) { + xt_rwmutex_xlock(&data->xs_lock, self->t_id); + lck_do_job(self, data->xs_which_job, data); + xt_rwmutex_unlock(&data->xs_lock, self->t_id); + } + else if (data->xs_which_lock == LOCK_SPINRWLOCK) { + xt_spinrwlock_xlock(&data->xs_spinrwlock, self->t_id); + lck_do_job(self, data->xs_which_job, data); + xt_spinrwlock_unlock(&data->xs_spinrwlock, self->t_id); + } + else if (data->xs_which_lock == LOCK_FASTRWLOCK) { + xt_fastrwlock_xlock(&data->xs_fastrwlock, self); + lck_do_job(self, data->xs_which_job, data); + xt_fastrwlock_unlock(&data->xs_fastrwlock, self); + } + else if (data->xs_which_lock == LOCK_ATOMICRWLOCK) { + xt_atomicrwlock_xlock(&data->xs_atomicrwlock, self->t_id); + lck_do_job(self, data->xs_which_job, data); + xt_atomicrwlock_unlock(&data->xs_atomicrwlock, TRUE); + } + else + ASSERT(FALSE); + } + if (data->xs_debug_print) + printf("- %s stop\n", self->t_name); + return NULL; +} + +static void lck_print_test(XSLockTestRec *data) +{ + switch (data->xs_which_lock) { + case LOCK_PTHREAD_RW: + printf("pthread read/write"); + break; + case LOCK_PTHREAD_MUTEX: + printf("pthread mutex"); + break; + case LOCK_FASTRW: + printf("fast read/write mutex"); + break; + case LOCK_SPINLOCK: + printf("spin mutex"); + break; + case LOCK_FASTLOCK: + printf("fast mutex"); + break; + case LOCK_SPINRWLOCK: + printf("spin read/write lock"); + break; + case LOCK_FASTRWLOCK: + printf("fast read/write lock"); + break; + case LOCK_ATOMICRWLOCK: + printf("atomic read/write lock"); + break; + } + + switch (data->xs_which_job) { + case JOB_MEMCPY: + printf(" MEMCPY 2K"); + break; + case JOB_SLEEP: + printf(" SLEEP 1/1000s"); + break; + case JOB_PRINT: + printf(" PRINT DEBUG"); + break; + case JOB_INCREMENT: + printf(" INCREMENT"); + break; + case JOB_SNOOZE: + printf(" SLEEP 1/100s"); + break; + } + + printf(" %d interations", data->xs_interations); +} + +static void *lck_run_mutex_locker(XTThreadPtr self) +{ + XSLockTestRec *data = (XSLockTestRec *) self->t_data; + + if (data->xs_debug_print) + printf("- %s start\n", self->t_name); + for (u_int i=0; i<data->xs_interations; i++) { + if (data->xs_progress && ((i+1) % data->xs_progress) == 0) + printf("- %s %d\n", self->t_name, i+1); + if (data->xs_which_lock == LOCK_PTHREAD_MUTEX) { + xt_lock_mutex_ns(&data->xs_mutex); + lck_do_job(self, data->xs_which_job, data); + xt_unlock_mutex_ns(&data->xs_mutex); + } + else if (data->xs_which_lock == LOCK_SPINLOCK) { + xt_spinlock_lock(&data->xs_spinlock); + lck_do_job(self, data->xs_which_job, data); + xt_spinlock_unlock(&data->xs_spinlock); + } + else if (data->xs_which_lock == LOCK_FASTLOCK) { + xt_fastlock_lock(&data->xs_fastlock, self); + lck_do_job(self, data->xs_which_job, data); + xt_fastlock_unlock(&data->xs_fastlock, self); + } + else + ASSERT(FALSE); + } + if (data->xs_debug_print) + printf("- %s stop\n", self->t_name); + return NULL; +} + +typedef struct LockThread { + xtThreadID id; + XTThreadPtr ptr; +} LockThreadRec, *LockThreadPtr; + +static void lck_reader_writer_test(XTThreadPtr self, XSLockTestRec *data, int reader_cnt, int writer_cnt) +{ + xtWord8 start; + LockThreadPtr threads; + int thread_cnt = reader_cnt + writer_cnt; + char buffer[40]; + + //XTThreadPtr dumper = xt_create_daemon(self, "DUMPER"); + //xt_run_thread(self, dumper, lck_run_dumper); + + printf("READ/WRITE TEST: "); + lck_print_test(data); + printf(", %d readers, %d writers\n", reader_cnt, writer_cnt); + threads = (LockThreadPtr) xt_malloc(self, thread_cnt * sizeof(LockThreadRec)); + + for (int i=0; i<thread_cnt; i++) { + sprintf(buffer, "%s%d", i < reader_cnt ? "READER-" : "WRITER-", i+1); + threads[i].ptr = xt_create_daemon(self, buffer); + threads[i].id = threads[i].ptr->t_id; + xt_set_thread_data(threads[i].ptr, data, lck_free_thread_data); + } + + start = xt_trace_clock(); + for (int i=0; i<reader_cnt; i++) + xt_run_thread(self, threads[i].ptr, lck_run_reader); + for (int i=reader_cnt; i<thread_cnt; i++) + xt_run_thread(self, threads[i].ptr, lck_run_writer); + + for (int i=0; i<thread_cnt; i++) + xt_wait_for_thread(threads[i].id, TRUE); + printf("----- %d reader, %d writer time=%s\n", reader_cnt, writer_cnt, xt_trace_clock_diff(buffer, start)); + + xt_free(self, threads); + printf("TEST RESULT = %d\n", data->xs_inc); + + //xt_wait_for_thread(dumper, TRUE); +} + +static void lck_mutex_lock_test(XTThreadPtr self, XSLockTestRec *data, int thread_cnt) +{ + xtWord8 start; + LockThreadPtr threads; + char buffer[40]; + + printf("LOCK MUTEX TEST: "); + lck_print_test(data); + printf(", %d threads\n", thread_cnt); + threads = (LockThreadPtr) xt_malloc(self, thread_cnt * sizeof(LockThreadRec)); + + for (int i=0; i<thread_cnt; i++) { + sprintf(buffer, "THREAD%d", i+1); + threads[i].ptr = xt_create_daemon(self, buffer); + threads[i].id = threads[i].ptr->t_id; + xt_set_thread_data(threads[i].ptr, data, lck_free_thread_data); + } + + start = xt_trace_clock(); + for (int i=0; i<thread_cnt; i++) + xt_run_thread(self, threads[i].ptr, lck_run_mutex_locker); + + for (int i=0; i<thread_cnt; i++) + xt_wait_for_thread(threads[i].id, TRUE); + printf("----- %d threads time=%s\n", thread_cnt, xt_trace_clock_diff(buffer, start)); + + xt_free(self, threads); + printf("TEST RESULT = %d\n", data->xs_inc); +} + +xtPublic void xt_unit_test_read_write_locks(XTThreadPtr self) +{ + XSLockTestRec data; + + memset(&data, 0, sizeof(data)); + + printf("TEST: xt_unit_test_read_write_locks\n"); + xt_rwmutex_init_with_autoname(self, &data.xs_lock); + xt_init_rwlock_with_autoname(self, &data.xs_plock); + xt_spinrwlock_init_with_autoname(self, &data.xs_spinrwlock); + xt_fastrwlock_init_with_autoname(self, &data.xs_fastrwlock); + xt_atomicrwlock_init_with_autoname(self, &data.xs_atomicrwlock); + + /** + data.xs_interations = 10; + data.xs_which_lock = LOCK_FASTRW; // LOCK_PTHREAD_RW, LOCK_FASTRW, LOCK_SPINRWLOCK, LOCK_FASTRWLOCK + data.xs_which_job = JOB_PRINT; + data.xs_debug_print = TRUE; + data.xs_progress = 0; + lck_reader_writer_test(self, &data, 4, 0); + lck_reader_writer_test(self, &data, 0, 2); + lck_reader_writer_test(self, &data, 1, 1); + lck_reader_writer_test(self, &data, 4, 2); + **/ + + /** + data.xs_interations = 4000; + data.xs_which_lock = LOCK_FASTRW; // LOCK_PTHREAD_RW, LOCK_FASTRW, LOCK_SPINRWLOCK, LOCK_FASTRWLOCK + data.xs_which_job = JOB_SLEEP; + data.xs_debug_print = TRUE; + data.xs_progress = 200; + lck_reader_writer_test(self, &data, 4, 0); + lck_reader_writer_test(self, &data, 0, 2); + lck_reader_writer_test(self, &data, 1, 1); + lck_reader_writer_test(self, &data, 4, 2); + **/ + + /**/ + data.xs_interations = 1000000; + data.xs_which_lock = LOCK_FASTRW; // LOCK_PTHREAD_RW, LOCK_FASTRW, LOCK_SPINRWLOCK, LOCK_FASTRWLOCK, LOCK_ATOMICRWLOCK + data.xs_which_job = JOB_INCREMENT; + data.xs_debug_print = FALSE; + data.xs_progress = 0; + lck_reader_writer_test(self, &data, 10, 0); + /**/ + + /** + data.xs_interations = 10000; + data.xs_which_lock = LOCK_FASTRW; // LOCK_PTHREAD_RW, LOCK_FASTRW, LOCK_SPINRWLOCK, LOCK_FASTRWLOCK + data.xs_which_job = JOB_MEMCPY; + data.xs_debug_print = FALSE; + data.xs_progress = 0; + lck_reader_writer_test(self, &data, 10, 5); + **/ + + /** + data.xs_interations = 1000; + data.xs_which_lock = LOCK_FASTRW; // LOCK_PTHREAD_RW, LOCK_FASTRW, LOCK_SPINRWLOCK, LOCK_FASTRWLOCK + data.xs_which_job = JOB_SLEEP; + data.xs_debug_print = FALSE; + data.xs_progress = 0; + lck_reader_writer_test(self, &data, 10, 5); + **/ + + xt_rwmutex_free(self, &data.xs_lock); + xt_free_rwlock(&data.xs_plock); + xt_spinrwlock_free(self, &data.xs_spinrwlock); + xt_fastrwlock_free(self, &data.xs_fastrwlock); +} + +xtPublic void xt_unit_test_mutex_locks(XTThreadPtr self) +{ + XSLockTestRec data; + + memset(&data, 0, sizeof(data)); + + printf("TEST: xt_unit_test_mutex_locks\n"); + xt_spinlock_init_with_autoname(self, &data.xs_spinlock); + xt_fastlock_init_with_autoname(self, &data.xs_fastlock); + xt_init_mutex_with_autoname(self, &data.xs_mutex); + + /**/ + data.xs_interations = 10; + data.xs_which_lock = LOCK_SPINLOCK; // LOCK_SPINLOCK, LOCK_PTHREAD_MUTEX, LOCK_FASTLOCK + data.xs_which_job = JOB_PRINT; + data.xs_debug_print = TRUE; + data.xs_progress = 0; + data.xs_inc = 0; + lck_mutex_lock_test(self, &data, 2); + /**/ + + /**/ + data.xs_interations = 100000; + data.xs_which_lock = LOCK_SPINLOCK; // LOCK_SPINLOCK, LOCK_PTHREAD_MUTEX, LOCK_FASTLOCK + data.xs_which_job = JOB_INCREMENT; + data.xs_debug_print = FALSE; + data.xs_progress = 0; + data.xs_inc = 0; + lck_mutex_lock_test(self, &data, 10); + /**/ + + /**/ + data.xs_interations = 10000; + data.xs_which_lock = LOCK_SPINLOCK; // LOCK_SPINLOCK, LOCK_PTHREAD_MUTEX, LOCK_FASTLOCK + data.xs_which_job = JOB_MEMCPY; + data.xs_debug_print = FALSE; + data.xs_progress = 0; + data.xs_inc = 0; + lck_mutex_lock_test(self, &data, 10); + /**/ + + /**/ + data.xs_interations = 1000; + data.xs_which_lock = LOCK_FASTLOCK; // LOCK_SPINLOCK, LOCK_PTHREAD_MUTEX, LOCK_FASTLOCK + data.xs_which_job = JOB_SLEEP; + data.xs_debug_print = FALSE; + data.xs_progress = 0; + data.xs_inc = 0; + lck_mutex_lock_test(self, &data, 10); + /**/ + + /**/ + data.xs_interations = 100; + data.xs_which_lock = LOCK_FASTLOCK; // LOCK_SPINLOCK, LOCK_PTHREAD_MUTEX, LOCK_FASTLOCK + data.xs_which_job = JOB_SNOOZE; + data.xs_debug_print = FALSE; + data.xs_progress = 0; + data.xs_inc = 0; + lck_mutex_lock_test(self, &data, 10); + /**/ + + xt_spinlock_free(self, &data.xs_spinlock); + xt_fastlock_free(self, &data.xs_fastlock); + xt_free_mutex(&data.xs_mutex); +} + +xtPublic void xt_unit_test_create_threads(XTThreadPtr self) +{ + XTThreadPtr threads[10]; + + printf("TEST: xt_unit_test_create_threads\n"); + printf("current max threads = %d, in use = %d\n", xt_thr_current_max_threads, xt_thr_current_thread_count); + + /* Create some threads: */ + threads[0] = xt_create_daemon(self, "test0"); + printf("thread = %d\n", threads[0]->t_id); + threads[1] = xt_create_daemon(self, "test1"); + printf("thread = %d\n", threads[1]->t_id); + threads[2] = xt_create_daemon(self, "test2"); + printf("thread = %d\n", threads[2]->t_id); + threads[3] = xt_create_daemon(self, "test3"); + printf("thread = %d\n", threads[3]->t_id); + threads[4] = xt_create_daemon(self, "test4"); + printf("thread = %d\n", threads[4]->t_id); + printf("current max threads = %d, in use = %d\n", xt_thr_current_max_threads, xt_thr_current_thread_count); + + /* Max stays the same: */ + xt_free_thread(threads[3]); + xt_free_thread(threads[2]); + xt_free_thread(threads[1]); + printf("current max threads = %d, in use = %d\n", xt_thr_current_max_threads, xt_thr_current_thread_count); + + /* Fill in the gaps: */ + threads[1] = xt_create_daemon(self, "test1"); + printf("thread = %d\n", threads[1]->t_id); + threads[2] = xt_create_daemon(self, "test2"); + printf("thread = %d\n", threads[2]->t_id); + threads[3] = xt_create_daemon(self, "test3"); + printf("thread = %d\n", threads[3]->t_id); + printf("current max threads = %d, in use = %d\n", xt_thr_current_max_threads, xt_thr_current_thread_count); + + /* And add one: */ + threads[5] = xt_create_daemon(self, "test5"); + printf("thread = %d\n", threads[5]->t_id); + printf("current max threads = %d, in use = %d\n", xt_thr_current_max_threads, xt_thr_current_thread_count); + + /* Max stays the same: */ + xt_free_thread(threads[3]); + xt_free_thread(threads[2]); + xt_free_thread(threads[1]); + xt_free_thread(threads[4]); + printf("current max threads = %d, in use = %d\n", xt_thr_current_max_threads, xt_thr_current_thread_count); + + /* Recalculate the max: */ + xt_free_thread(threads[5]); + printf("current max threads = %d, in use = %d\n", xt_thr_current_max_threads, xt_thr_current_thread_count); + + /* Fill in the gaps: */ + threads[1] = xt_create_daemon(self, "test1"); + printf("thread = %d\n", threads[1]->t_id); + threads[2] = xt_create_daemon(self, "test2"); + printf("thread = %d\n", threads[2]->t_id); + threads[3] = xt_create_daemon(self, "test3"); + printf("thread = %d\n", threads[3]->t_id); + printf("current max threads = %d, in use = %d\n", xt_thr_current_max_threads, xt_thr_current_thread_count); + + xt_free_thread(threads[3]); + xt_free_thread(threads[2]); + xt_free_thread(threads[1]); + xt_free_thread(threads[0]); + printf("current max threads = %d, in use = %d\n", xt_thr_current_max_threads, xt_thr_current_thread_count); +} + +#ifdef UNUSED_CODE +int XTRowLocks::xt_release_locks(struct XTOpenTable *ot, xtRowID row, XTRowLockListPtr lock_list) +{ + if (ot->ot_temp_row_lock) + xt_make_lock_permanent(ot, lock_list); + + if (!lock_list->bl_count) + return XT_NO_LOCK; + + int group, pgroup; + XTXactDataPtr xact; + xtTableID tab_id, ptab_id; + XTPermRowLockPtr plock; + XTOpenTablePtr pot = NULL; + XTRowLocksPtr row_locks; + + /* Do I have the lock? */ + group = row % XT_ROW_LOCK_COUNT; + if (!(xact = tab_row_locks[group])) + /* There is no lock: */ + return XT_NO_LOCK; + + if (xact != ot->ot_thread->st_xact_data) + /* There is a lock but it does not belong to me! */ + return XT_NO_LOCK; + + tab_id = ot->ot_table->tab_id; + plock = (XTPermRowLockPtr) &lock_list->bl_data[lock_list->bl_count * lock_list->bl_item_size]; + lock_list->rll_release_point = lock_list->bl_count; + for (u_int i=0; i<lock_list->bl_count; i++) { + plock--; + + pgroup = plock->pr_group; + ptab_id = plock->pr_tab_id; + + if (ptab_id == tab_id) + row_locks = this; + else { + if (pot) { + if (pot->ot_table->tab_id == ptab_id) + goto remove_lock; + xt_db_return_table_to_pool_ns(pot); + pot = NULL; + } + + if (!xt_db_open_pool_table_ns(&pot, ot->ot_table->tab_db, tab_id)) { + /* Should not happen, but just in case, we just don't + * remove the lock. We will probably end up with a deadlock + * somewhere. + */ + xt_log_and_clear_exception_ns(); + goto skip_remove_lock; + } + if (!pot) + /* Can happen of the table has been dropped: */ + goto skip_remove_lock; + + remove_lock: + row_locks = &pot->ot_table->tab_locks; + } + +#ifdef XT_TRACE_LOCKS + xt_ttracef(xt_get_self(), "release lock group=%d\n", pgroup); +#endif + row_locks->tab_row_locks[pgroup] = NULL; + row_locks->tab_lock_perm[pgroup] = 0; + skip_remove_lock:; + + lock_list->rll_release_point--; + if (tab_id == ptab_id && group == pgroup) + break; + } + + if (pot) + xt_db_return_table_to_pool_ns(pot); + return XT_PERM_LOCK; +} + +xtBool XTRowLocks::xt_regain_locks(struct XTOpenTable *ot, int *lock_type, xtXactID *xn_id, XTRowLockListPtr lock_list) +{ + int group; + XTXactDataPtr xact, my_xact; + XTPermRowLockPtr plock; + xtTableID tab_id; + XTOpenTablePtr pot = NULL; + XTRowLocksPtr row_locks = NULL; + XTTableHPtr tab = NULL; + + for (u_int i=lock_list->rll_release_point; i<lock_list->bl_count; i++) { + plock = (XTPermRowLockPtr) &lock_list->bl_data[i * lock_list->bl_item_size]; + + my_xact = ot->ot_thread->st_xact_data; + group = plock->pr_group; + tab_id = plock->pr_tab_id; + + if (tab_id == ot->ot_table->tab_id) { + row_locks = this; + tab = ot->ot_table; + } + else { + if (pot) { + if (tab_id == pot->ot_table->tab_id) + goto gain_lock; + xt_db_return_table_to_pool_ns(pot); + pot = NULL; + } + + if (!xt_db_open_pool_table_ns(&pot, ot->ot_table->tab_db, tab_id)) + return FAILED; + if (!pot) + goto no_gain_lock; + + gain_lock: + tab = pot->ot_table; + row_locks = &tab->tab_locks; + no_gain_lock:; + } + +#ifdef XT_TRACE_LOCKS + xt_ttracef(xt_get_self(), "regain lock group=%d\n", group); +#endif + XT_TAB_ROW_WRITE_LOCK(&tab->tab_row_rwlock[group % XT_ROW_RWLOCKS], ot->ot_thread); + if ((xact = row_locks->tab_row_locks[group])) { + if (xact != my_xact) { + *xn_id = xact->xd_start_xn_id; + *lock_type = row_locks->tab_lock_perm[group] ? XT_PERM_LOCK : XT_TEMP_LOCK; + goto done; + } + } + else + row_locks->tab_row_locks[group] = my_xact; + row_locks->tab_lock_perm[group] = 1; + XT_TAB_ROW_UNLOCK(&tab->tab_row_rwlock[group % XT_ROW_RWLOCKS], ot->ot_thread); + lock_list->rll_release_point++; + } + *lock_type = XT_NO_LOCK; + return OK; + + done: + XT_TAB_ROW_UNLOCK(&tab->tab_row_rwlock[group % XT_ROW_RWLOCKS], ot->ot_thread); + return OK; +} + +#endif diff --git a/storage/pbxt/src/lock_xt.h b/storage/pbxt/src/lock_xt.h new file mode 100644 index 00000000000..53fe7023eef --- /dev/null +++ b/storage/pbxt/src/lock_xt.h @@ -0,0 +1,698 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2008-01-24 Paul McCullagh + * + * Row lock functions. + * + * H&G2JCtL + */ +#ifndef __xt_lock_h__ +#define __xt_lock_h__ + +#include "xt_defs.h" +#include "util_xt.h" +#include "locklist_xt.h" +#include "pthread_xt.h" + +struct XTThread; +struct XTDatabase; +struct XTOpenTable; +struct XTXactData; +struct XTTable; + +/* Possibilities are 2 = align 4 or 2 = align 8 */ +#define XT_XS_LOCK_SHIFT 2 +#define XT_XS_LOCK_ALIGN (1 << XT_XS_LOCK_SHIFT) + +/* This lock is fast for reads but slow for writes. + * Use this lock in situations where you have 99% reads, + * and then some potentially long writes. + */ +typedef struct XTRWMutex { +#ifdef DEBUG + struct XTThread *xs_lock_thread; + u_int xs_inited; +#endif +#ifdef XT_THREAD_LOCK_INFO + XTThreadLockInfoRec xs_lock_info; + const char *xs_name; +#endif + xt_mutex_type xs_lock; + xt_cond_type xs_cond; + volatile xtWord4 xs_state; + volatile xtThreadID xs_xlocker; + union { +#if XT_XS_LOCK_ALIGN == 4 + volatile xtWord4 *xs_rlock_align; +#else + volatile xtWord8 *xs_rlock_align; +#endif + volatile xtWord1 *xs_rlock; + } x; +} XTRWMutexRec, *XTRWMutexPtr; + +#ifdef XT_THREAD_LOCK_INFO +#define xt_rwmutex_init_with_autoname(a,b) xt_rwmutex_init(a,b,LOCKLIST_ARG_SUFFIX(b)) +void xt_rwmutex_init(struct XTThread *self, XTRWMutexPtr xsl, const char *name); +#else +#define xt_rwmutex_init_with_autoname(a,b) xt_rwmutex_init(a,b) +void xt_rwmutex_init(struct XTThread *self, XTRWMutexPtr xsl); +#endif +void xt_rwmutex_free(struct XTThread *self, XTRWMutexPtr xsl); +xtBool xt_rwmutex_xlock(XTRWMutexPtr xsl, xtThreadID thd_id); +xtBool xt_rwmutex_slock(XTRWMutexPtr xsl, xtThreadID thd_id); +xtBool xt_rwmutex_unlock(XTRWMutexPtr xsl, xtThreadID thd_id); + +#ifdef XT_WIN +#define XT_SPL_WIN32_ASM +#else +#if defined(__GNUC__) && (defined(__x86_64__) || defined(__i386__)) +#define XT_SPL_GNUC_X86 +#else +#define XT_SPL_DEFAULT +#endif +#endif + +#ifdef XT_SOLARIS +/* Use Sun atomic operations library + * http://docs.sun.com/app/docs/doc/816-5168/atomic-ops-3c?a=view + */ +#define XT_SPL_SOLARIS_LIB +#endif + +#ifdef XT_SPL_SOLARIS_LIB +#include <atomic.h> +#endif + +typedef struct XTSpinLock { + volatile xtWord4 spl_lock; +#ifdef XT_SPL_DEFAULT + xt_mutex_type spl_mutex; +#endif +#ifdef DEBUG + struct XTThread *spl_locker; +#endif +#ifdef XT_THREAD_LOCK_INFO + XTThreadLockInfoRec spl_lock_info; + const char *spl_name; +#endif +} XTSpinLockRec, *XTSpinLockPtr; + +#ifdef XT_THREAD_LOCK_INFO +#define xt_spinlock_init_with_autoname(a,b) xt_spinlock_init(a,b,LOCKLIST_ARG_SUFFIX(b)) +void xt_spinlock_init(struct XTThread *self, XTSpinLockPtr sp, const char *name); +#else +#define xt_spinlock_init_with_autoname(a,b) xt_spinlock_init(a,b) +void xt_spinlock_init(struct XTThread *self, XTSpinLockPtr sp); +#endif +void xt_spinlock_free(struct XTThread *self, XTSpinLockPtr sp); +xtBool xt_spinlock_spin(XTSpinLockPtr spl); +#ifdef DEBUG +void xt_spinlock_set_thread(XTSpinLockPtr spl); +#endif + +/* + * This macro is to remind me where it was safe + * to use a read lock! + */ +#define xt_lck_slock xt_spinlock_lock + +/* I call these operations flushed because the result + * is written atomically. + * But the operations themselves are not atomic! + */ +inline void xt_flushed_inc1(volatile xtWord1 *mptr) +{ +#ifdef XT_SPL_WIN32_ASM + __asm MOV ECX, mptr + __asm MOV DL, BYTE PTR [ECX] + __asm INC DL + __asm XCHG DL, BYTE PTR [ECX] +#elif defined(XT_SPL_GNUC_X86) + xtWord1 val; + + asm volatile ("movb %1,%0" : "=r" (val) : "m" (*mptr) : "memory"); + val++; + asm volatile ("xchgb %1,%0" : "=r" (val) : "m" (*mptr), "0" (val) : "memory"); +#elif defined(XT_SPL_SOLARIS_LIB) + atomic_inc_8(mptr); +#else + *mptr++; +#endif +} + +inline xtWord1 xt_flushed_dec1(volatile xtWord1 *mptr) +{ + xtWord1 val; + +#ifdef XT_SPL_WIN32_ASM + __asm MOV ECX, mptr + __asm MOV DL, BYTE PTR [ECX] + __asm DEC DL + __asm MOV val, DL + __asm XCHG DL, BYTE PTR [ECX] +#elif defined(XT_SPL_GNUC_X86) + xtWord1 val2; + + asm volatile ("movb %1, %0" : "=r" (val) : "m" (*mptr) : "memory"); + val--; + asm volatile ("xchgb %1,%0" : "=r" (val2) : "m" (*mptr), "0" (val) : "memory"); + /* Should work, but compiler makes a mistake? + * asm volatile ("xchgb %1, %0" : : "r" (val), "m" (*mptr) : "memory"); + */ +#elif defined(XT_SPL_SOLARIS_LIB) + val = atomic_dec_8_nv(mptr); +#else + val = --(*mptr); +#endif + return val; +} + +inline void xt_atomic_inc2(volatile xtWord2 *mptr) +{ +#ifdef XT_SPL_WIN32_ASM + __asm LOCK INC WORD PTR mptr +#elif defined(XT_SPL_GNUC_X86) + asm volatile ("lock; incw %0" : : "m" (*mptr) : "memory"); +#elif defined(__GNUC__) + __sync_fetch_and_add(mptr, 1); +#elif defined(XT_SPL_SOLARIS_LIB) + atomic_inc_16_nv(mptr); +#else + (*mptr)++; +#endif +} + +inline void xt_atomic_dec2(volatile xtWord2 *mptr) +{ +#ifdef XT_SPL_WIN32_ASM + __asm LOCK DEC WORD PTR mptr +#elif defined(XT_SPL_GNUC_X86) + asm volatile ("lock; decw %0" : : "m" (*mptr) : "memory"); +#elif defined(__GNUC__) + __sync_fetch_and_sub(mptr, 1); +#elif defined(XT_SPL_SOLARIS_LIB) + val1 = atomic_dec_16_nv(mptr); +#else + val1 = --(*mptr); +#endif +} + +/* Atomic test and set 2 byte word! */ +inline xtWord2 xt_atomic_tas2(volatile xtWord2 *mptr, xtWord2 val) +{ +#ifdef XT_SPL_WIN32_ASM + __asm MOV ECX, mptr + __asm MOV DX, val + __asm XCHG DX, WORD PTR [ECX] + __asm MOV val, DX +#elif defined(XT_SPL_GNUC_X86) + asm volatile ("xchgw %1,%0" : "=r" (val) : "m" (*mptr), "0" (val) : "memory"); +#elif defined(XT_SPL_SOLARIS_LIB) + val = atomic_swap_16(mptr, val); +#else + /* Yikes! */ + xtWord2 nval = val; + + val = *mptr; + *mptr = nval; +#endif + return val; +} + +inline void xt_atomic_set4(volatile xtWord4 *mptr, xtWord4 val) +{ +#ifdef XT_SPL_WIN32_ASM + __asm MOV ECX, mptr + __asm MOV EDX, val + __asm XCHG EDX, DWORD PTR [ECX] + //__asm MOV DWORD PTR [ECX], EDX +#elif defined(XT_SPL_GNUC_X86) + asm volatile ("xchgl %1,%0" : "=r" (val) : "m" (*mptr), "0" (val) : "memory"); + //asm volatile ("movl %0,%1" : "=r" (val) : "m" (*mptr) : "memory"); +#elif defined(XT_SPL_SOLARIS_LIB) + atomic_swap_32(mptr, val); +#else + *mptr = val; +#endif +} + +inline xtWord4 xt_atomic_get4(volatile xtWord4 *mptr) +{ + xtWord4 val; + +#ifdef XT_SPL_WIN32_ASM + __asm MOV ECX, mptr + __asm MOV EDX, DWORD PTR [ECX] + __asm MOV val, EDX +#elif defined(XT_SPL_GNUC_X86) + asm volatile ("movl %1,%0" : "=r" (val) : "m" (*mptr) : "memory"); +#else + val = *mptr; +#endif + return val; +} + +/* Code for test and set is derived from code by Larry Zhou and + * Google: http://code.google.com/p/google-perftools + */ +inline xtWord4 xt_spinlock_set(XTSpinLockPtr spl) +{ + xtWord4 prv; + volatile xtWord4 *lck; + + lck = &spl->spl_lock; +#ifdef XT_SPL_WIN32_ASM + __asm MOV ECX, lck + __asm MOV EDX, 1 + __asm XCHG EDX, DWORD PTR [ECX] + __asm MOV prv, EDX +#elif defined(XT_SPL_GNUC_X86) + prv = 1; + asm volatile ("xchgl %1,%0" : "=r" (prv) : "m" (*lck), "0" (prv) : "memory"); +#elif defined(XT_SPL_SOLARIS_LIB) + prv = atomic_swap_32(lck, 1); +#else + /* The default implementation just uses a mutex, and + * does not spin! */ + xt_lock_mutex_ns(&spl->spl_mutex); + /* We have the lock */ + *lck = 1; + prv = 0; +#endif +#ifdef DEBUG + if (!prv) + xt_spinlock_set_thread(spl); +#endif + return prv; +} + +inline xtWord4 xt_spinlock_reset(XTSpinLockPtr spl) +{ + xtWord4 prv; + volatile xtWord4 *lck; + +#ifdef DEBUG + spl->spl_locker = NULL; +#endif + lck = &spl->spl_lock; +#ifdef XT_SPL_WIN32_ASM + __asm MOV ECX, lck + __asm MOV EDX, 0 + __asm XCHG EDX, DWORD PTR [ECX] + __asm MOV prv, EDX +#elif defined(XT_SPL_GNUC_X86) + prv = 0; + asm volatile ("xchgl %1,%0" : "=r" (prv) : "m" (*lck), "0" (prv) : "memory"); +#elif defined(XT_SPL_SOLARIS_LIB) + prv = atomic_swap_32(lck, 0); +#else + *lck = 0; + xt_unlock_mutex_ns(&spl->spl_mutex); + prv = 1; +#endif + return prv; +} + +/* + * Return FALSE, and register an error on failure. + */ +inline xtBool xt_spinlock_lock(XTSpinLockPtr spl) +{ + if (!xt_spinlock_set(spl)) { +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_add_owner(&spl->spl_lock_info); +#endif + return OK; + } +#ifdef XT_THREAD_LOCK_INFO + xtBool spin_result = xt_spinlock_spin(spl); + if (spin_result) + xt_thread_lock_info_add_owner(&spl->spl_lock_info); + return spin_result; +#else + return xt_spinlock_spin(spl); +#endif +} + +inline void xt_spinlock_unlock(XTSpinLockPtr spl) +{ + xt_spinlock_reset(spl); +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_release_owner(&spl->spl_lock_info); +#endif +} + +void xt_unit_test_read_write_locks(struct XTThread *self); +void xt_unit_test_mutex_locks(struct XTThread *self); +void xt_unit_test_create_threads(struct XTThread *self); + +#define XT_FAST_LOCK_MAX_WAIT 100 + +typedef struct XTFastLock { + XTSpinLockRec fal_spinlock; + struct XTThread *fal_locker; + + XTSpinLockRec fal_wait_lock; + u_int fal_wait_count; + u_int fal_wait_wakeup; + u_int fal_wait_alloc; + struct XTThread *fal_wait_list[XT_FAST_LOCK_MAX_WAIT]; +#ifdef XT_THREAD_LOCK_INFO + XTThreadLockInfoRec fal_lock_info; + const char *fal_name; +#endif +} XTFastLockRec, *XTFastLockPtr; + +#ifdef XT_THREAD_LOCK_INFO +#define xt_fastlock_init_with_autoname(a,b) xt_fastlock_init(a,b,LOCKLIST_ARG_SUFFIX(b)) +void xt_fastlock_init(struct XTThread *self, XTFastLockPtr spl, const char *name); +#else +#define xt_fastlock_init_with_autoname(a,b) xt_fastlock_init(a,b) +void xt_fastlock_init(struct XTThread *self, XTFastLockPtr spl); +#endif +void xt_fastlock_free(struct XTThread *self, XTFastLockPtr spl); +void xt_fastlock_wakeup(XTFastLockPtr spl); +xtBool xt_fastlock_spin(XTFastLockPtr spl, struct XTThread *thread); + +inline xtBool xt_fastlock_lock(XTFastLockPtr fal, struct XTThread *thread) +{ + if (!xt_spinlock_set(&fal->fal_spinlock)) { + fal->fal_locker = thread; +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_add_owner(&fal->fal_lock_info); +#endif + return OK; + } +#ifdef XT_THREAD_LOCK_INFO + xtBool spin_result = xt_fastlock_spin(fal, thread); + if (spin_result) + xt_thread_lock_info_add_owner(&fal->fal_lock_info); + return spin_result; +#else + return xt_fastlock_spin(fal, thread); +#endif +} + +inline void xt_fastlock_unlock(XTFastLockPtr fal, struct XTThread *thread __attribute__((unused))) +{ + if (fal->fal_wait_count) + xt_fastlock_wakeup(fal); + else { + fal->fal_locker = NULL; + xt_spinlock_reset(&fal->fal_spinlock); + } +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_release_owner(&fal->fal_lock_info); +#endif +} + +typedef struct XTSpinRWLock { + XTSpinLockRec srw_lock; + volatile xtThreadID srw_xlocker; + XTSpinLockRec srw_state_lock; + volatile u_int srw_state; + union { +#if XT_XS_LOCK_ALIGN == 4 + volatile xtWord4 *srw_rlock_align; +#else + volatile xtWord8 *srw_rlock_align; +#endif + volatile xtWord1 *srw_rlock; + } x; + +#ifdef XT_THREAD_LOCK_INFO + XTThreadLockInfoRec srw_lock_info; + const char *srw_name; +#endif + +} XTSpinRWLockRec, *XTSpinRWLockPtr; + +#ifdef XT_THREAD_LOCK_INFO +#define xt_spinrwlock_init_with_autoname(a,b) xt_spinrwlock_init(a,b,LOCKLIST_ARG_SUFFIX(b)) +void xt_spinrwlock_init(struct XTThread *self, XTSpinRWLockPtr xsl, const char *name); +#else +#define xt_spinrwlock_init_with_autoname(a,b) xt_spinrwlock_init(a,b) +void xt_spinrwlock_init(struct XTThread *self, XTSpinRWLockPtr xsl); +#endif +void xt_spinrwlock_free(struct XTThread *self, XTSpinRWLockPtr xsl); +xtBool xt_spinrwlock_xlock(XTSpinRWLockPtr xsl, xtThreadID thd_id); +xtBool xt_spinrwlock_slock(XTSpinRWLockPtr xsl, xtThreadID thd_id); +xtBool xt_spinrwlock_unlock(XTSpinRWLockPtr xsl, xtThreadID thd_id); + +typedef struct XTFastRWLock { + XTFastLockRec frw_lock; + struct XTThread *frw_xlocker; + XTSpinLockRec frw_state_lock; + volatile u_int frw_state; + u_int frw_read_waiters; + union { +#if XT_XS_LOCK_ALIGN == 4 + volatile xtWord4 *frw_rlock_align; +#else + volatile xtWord8 *frw_rlock_align; +#endif + volatile xtWord1 *frw_rlock; + } x; + +#ifdef XT_THREAD_LOCK_INFO + XTThreadLockInfoRec frw_lock_info; + const char *frw_name; +#endif + +} XTFastRWLockRec, *XTFastRWLockPtr; + +#ifdef XT_THREAD_LOCK_INFO +#define xt_fastrwlock_init_with_autoname(a,b) xt_fastrwlock_init(a,b,LOCKLIST_ARG_SUFFIX(b)) +void xt_fastrwlock_init(struct XTThread *self, XTFastRWLockPtr frw, const char *name); +#else +#define xt_fastrwlock_init_with_autoname(a,b) xt_fastrwlock_init(a,b) +void xt_fastrwlock_init(struct XTThread *self, XTFastRWLockPtr frw); +#endif + +void xt_fastrwlock_free(struct XTThread *self, XTFastRWLockPtr frw); +xtBool xt_fastrwlock_xlock(XTFastRWLockPtr frw, struct XTThread *thread); +xtBool xt_fastrwlock_slock(XTFastRWLockPtr frw, struct XTThread *thread); +xtBool xt_fastrwlock_unlock(XTFastRWLockPtr frw, struct XTThread *thread); + +typedef struct XTAtomicRWLock { + volatile xtWord2 arw_reader_count; + volatile xtWord2 arw_xlock_set; + +#ifdef XT_THREAD_LOCK_INFO + XTThreadLockInfoRec arw_lock_info; + const char *arw_name; +#endif +#ifdef DEBUG + xtThreadID arw_locker; +#endif +} XTAtomicRWLockRec, *XTAtomicRWLockPtr; + +#ifdef XT_THREAD_LOCK_INFO +#define xt_atomicrwlock_init_with_autoname(a,b) xt_atomicrwlock_init(a,b,LOCKLIST_ARG_SUFFIX(b)) +void xt_atomicrwlock_init(struct XTThread *self, XTAtomicRWLockPtr xsl, const char *name); +#else +#define xt_atomicrwlock_init_with_autoname(a,b) xt_atomicrwlock_init(a,b) +void xt_atomicrwlock_init(struct XTThread *self, XTAtomicRWLockPtr xsl); +#endif +void xt_atomicrwlock_free(struct XTThread *self, XTAtomicRWLockPtr xsl); +xtBool xt_atomicrwlock_xlock(XTAtomicRWLockPtr xsl, xtThreadID thr_id); +xtBool xt_atomicrwlock_slock(XTAtomicRWLockPtr xsl); +xtBool xt_atomicrwlock_unlock(XTAtomicRWLockPtr xsl, xtBool xlocked); + +/* + * ----------------------------------------------------------------------- + * ROW LOCKS + */ + +/* + * [(9)] + * + * These are perminent row locks. They are set on rows for 2 reasons: + * + * 1. To lock a row that is being updated. The row is locked + * when it is read, until the point that it is updated. If the row + * is not updated, the lock is removed. + * This prevents an update coming between which will cause an error + * on the first thread. + * + * 2. The locks are used to implement SELECT FOR UPDATE. + */ + +/* + * A lock that is set in order to perform an update is a temporary lock. + * This lock will be removed once the update of the record is done. + * The objective is to prevent some other thread from changine the + * record between the time the record is read and updated. This is to + * prevent unncessary "Record was updated" errors. + * + * A permanent lock is set by a SELECT FOR UPDATE. These locks are + * held until the end of the transaction. + * + * However, a SELECT FOR UPDATE will pop its lock stack before + * waiting for a transaction that has updated a record. + * This is to prevent the deadlock that can occur because a + * SELECT FOR UPDATE locks groups of records (I mean in general the + * locks used are group locks). + * + * This means a SELECT FOR UPDATE can get ahead of an UPDATE as far as + * locking is concerned. Example: + * + * Record 1,2 and 3 are in group A. + * + * T1: UPDATES record 2. + * T2: SELECT FOR UPDATE record 1, which locks group A. + * T2: SELECT FOR UPDATE record 2, which must wait for T1. + * T1: UPDATES record 3, which musts wait because of group lock A. + * + * To avoid deadlock, T2 releases its group lock A before waiting for + * record 2. It then regains the lock after waiting for record 2. + * + * (NOTE: Locks are no longer released. Please check this comment: + * {RELEASING-LOCKS} in lock_xt.cc. ) + * + * However, release group A lock mean first releasing all locks gained + * after group a lock. + * + * For example: a thread locks groups: A, B and C. To release group B + * lock the thread must release C as well. Afterwards, it must gain + * B and C again, in that order. This is to ensure that the lock + * order is NOT changed! + * + */ +#define XT_LOCK_ERR -1 +#define XT_NO_LOCK 0 +#define XT_TEMP_LOCK 1 /* A temporary lock */ +#define XT_PERM_LOCK 2 /* A permanent lock */ + +typedef struct XTRowLockList : public XTBasicList { + void xt_remove_all_locks(struct XTDatabase *db, struct XTThread *thread); +} XTRowLockListRec, *XTRowLockListPtr; + +#define XT_USE_LIST_BASED_ROW_LOCKS + +#ifdef XT_USE_LIST_BASED_ROW_LOCKS +/* + * This method stores each lock, and avoids conflicts. + * But it is a bit more expensive in time. + */ + +#ifdef DEBUG +#define XT_TEMP_LOCK_BYTES 10 +#define XT_ROW_LOCK_GROUP_COUNT 5 +#else +#define XT_TEMP_LOCK_BYTES 0xFFFF +#define XT_ROW_LOCK_GROUP_COUNT 23 +#endif + +typedef struct XTLockWait { + /* Information about the lock to be aquired: */ + struct XTThread *lw_thread; + struct XTOpenTable *lw_ot; + xtRowID lw_row_id; + + /* This is the lock currently held, and the transaction ID: */ + int lw_curr_lock; + xtXactID lw_xn_id; + + /* This is information about the updating transaction: */ + xtBool lw_row_updated; + xtXactID lw_updating_xn_id; + + /* Pointers for the lock list: */ + struct XTLockWait *lw_next; + struct XTLockWait *lw_prev; +} XTLockWaitRec, *XTLockWaitPtr; + +typedef struct XTLockItem { + xtRowID li_row_id; /* The row list is sorted in this value. */ + xtWord2 li_count; /* The number of consecutive rows locked. FFFF means a temporary lock. */ + xtWord2 li_thread_id; /* The thread that holds this lock. */ +} XTLockItemRec, *XTLockItemPtr; + +typedef struct XTLockGroup { + XTSpinLockRec lg_lock; /* A lock for the list. */ + XTLockWaitPtr lg_wait_queue; /* A queue of threads waiting for a lock in this group. */ + XTLockWaitPtr lg_wait_queue_end; /* The end of the thread queue. */ + size_t lg_list_size; /* The size of the list. */ + size_t lg_list_in_use; /* Number of slots on the list in use. */ + XTLockItemPtr lg_list; /* List of locks. */ +} XTLockGroupRec, *XTLockGroupPtr; + +struct XTLockWait; + +typedef struct XTRowLocks { + XTLockGroupRec rl_groups[XT_ROW_LOCK_GROUP_COUNT]; + + xtBool xt_set_temp_lock(struct XTOpenTable *ot, XTLockWaitPtr lw, XTRowLockListPtr lock_list); + void xt_remove_temp_lock(struct XTOpenTable *ot, xtBool updated); + xtBool xt_make_lock_permanent(struct XTOpenTable *ot, XTRowLockListPtr lock_list); + + xtBool rl_lock_row(XTLockGroupPtr group, XTLockWaitPtr lw, XTRowLockListPtr lock_list, int *result); + void rl_grant_locks(XTLockGroupPtr group, struct XTThread *thread); +#ifdef DEBUG_LOCK_QUEUE + void rl_check(XTLockWaitPtr lw); +#endif +} XTRowLocksRec, *XTRowLocksPtr; + +#define XT_USE_TABLE_REF + +typedef struct XTPermRowLock { +#ifdef XT_USE_TABLE_REF + struct XTTable *pr_table; +#else + xtTableID pr_tab_id; +#endif + xtWord1 pr_group[XT_ROW_LOCK_GROUP_COUNT]; +} XTPermRowLockRec, *XTPermRowLockPtr; + +#else // XT_ROW_LOCK_GROUP_COUNT + +/* Hash based row locking. This method allows conflics, even + * when there is none. + */ +typedef struct XTRowLocks { + xtWord1 tab_lock_perm[XT_ROW_LOCK_COUNT]; /* Byte set to 1 for permanent locks. */ + struct XTXactData *tab_row_locks[XT_ROW_LOCK_COUNT]; /* The transactions that have locked the specific rows. */ + + int xt_set_temp_lock(struct XTOpenTable *ot, xtRowID row, xtXactID *xn_id, XTRowLockListPtr lock_list); + void xt_remove_temp_lock(struct XTOpenTable *ot); + xtBool xt_make_lock_permanent(struct XTOpenTable *ot, XTRowLockListPtr lock_list); + int xt_is_locked(struct XTOpenTable *ot, xtRowID row, xtXactID *xn_id); +} XTRowLocksRec, *XTRowLocksPtr; + +typedef struct XTPermRowLock { + xtTableID pr_tab_id; + xtWord4 pr_group; +} XTPermRowLockRec, *XTPermRowLockPtr; + +#endif // XT_ROW_LOCK_GROUP_COUNT + +xtBool xt_init_row_locks(XTRowLocksPtr rl); +void xt_exit_row_locks(XTRowLocksPtr rl); + +xtBool xt_init_row_lock_list(XTRowLockListPtr rl); +void xt_exit_row_lock_list(XTRowLockListPtr rl); + +#define XT_NO_LOCK 0 +#define XT_WANT_LOCK 1 +#define XT_HAVE_LOCK 2 +#define XT_WAITING 3 + +#endif diff --git a/storage/pbxt/src/locklist_xt.cc b/storage/pbxt/src/locklist_xt.cc new file mode 100644 index 00000000000..0a3df584ba6 --- /dev/null +++ b/storage/pbxt/src/locklist_xt.cc @@ -0,0 +1,184 @@ +/* Copyright (c) 2009 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2009-01-20 Vladimir Kolesnikov + * + * H&G2JCtL + */ + +#include "xt_config.h" +#include "locklist_xt.h" + +#ifdef XT_THREAD_LOCK_INFO +#include "pthread_xt.h" +#include "thread_xt.h" +#include "trace_xt.h" + +void xt_thread_lock_info_init(XTThreadLockInfoPtr ptr, XTSpinLock *lock) +{ + ptr->li_spin_lock = lock; + ptr->li_lock_type = XTThreadLockInfo::SPIN_LOCK; +} + +void xt_thread_lock_info_init(XTThreadLockInfoPtr ptr, XTRWMutex *lock) +{ + ptr->li_rw_mutex = lock; + ptr->li_lock_type = XTThreadLockInfo::RW_MUTEX; +} + +void xt_thread_lock_info_init(XTThreadLockInfoPtr ptr, XTFastLock *lock) +{ + ptr->li_fast_lock = lock; + ptr->li_lock_type = XTThreadLockInfo::FAST_LOCK; +} + +void xt_thread_lock_info_init(XTThreadLockInfoPtr ptr, xt_mutex_struct *lock) +{ + ptr->li_mutex = lock; + ptr->li_lock_type = XTThreadLockInfo::MUTEX; +} + +void xt_thread_lock_info_init(XTThreadLockInfoPtr ptr, xt_rwlock_struct *lock) +{ + ptr->li_rwlock = lock; + ptr->li_lock_type = XTThreadLockInfo::RW_LOCK; +} + +void xt_thread_lock_info_init(XTThreadLockInfoPtr ptr, XTFastRWLock *lock) +{ + ptr->li_fast_rwlock = lock; + ptr->li_lock_type = XTThreadLockInfo::FAST_RW_LOCK; +} + +void xt_thread_lock_info_init(XTThreadLockInfoPtr ptr, XTSpinRWLock *lock) +{ + ptr->li_spin_rwlock = lock; + ptr->li_lock_type = XTThreadLockInfo::SPIN_RW_LOCK; +} + +void xt_thread_lock_info_init(XTThreadLockInfoPtr ptr, XTAtomicRWLock *lock) +{ + ptr->li_atomic_rwlock = lock; + ptr->li_lock_type = XTThreadLockInfo::ATOMIC_RW_LOCK; +} + +void xt_thread_lock_info_free(XTThreadLockInfoPtr ptr) +{ + /* TODO: check to see if it's present in a thread's list */ +} + +void xt_thread_lock_info_add_owner (XTThreadLockInfoPtr ptr) +{ + XTThread *self = xt_get_self(); + + if (!self) + return; + + if (self->st_thread_lock_count < XT_THREAD_LOCK_INFO_MAX_COUNT) { + self->st_thread_lock_list[self->st_thread_lock_count] = ptr; + self->st_thread_lock_count++; + } +} + +void xt_thread_lock_info_release_owner (XTThreadLockInfoPtr ptr) +{ + XTThread *self = xt_get_self(); + + if (!self) + return; + + for (int i = self->st_thread_lock_count - 1; i >= 0; i--) { + if (self->st_thread_lock_list[i] == ptr) { + self->st_thread_lock_count--; + memcpy(self->st_thread_lock_list + i, + self->st_thread_lock_list + i + 1, + (self->st_thread_lock_count - i)*sizeof(XTThreadLockInfoPtr)); + self->st_thread_lock_list[self->st_thread_lock_count] = NULL; + break; + } + } +} + +void xt_trace_thread_locks(XTThread *self) +{ + if (!self) + return; + + xt_ttracef(self, "thread lock list (first in list added first): "); + + if (!self->st_thread_lock_count) { + xt_trace(" <empty>\n"); + return; + } + + xt_trace("\n"); + + int count = min(self->st_thread_lock_count, XT_THREAD_LOCK_INFO_MAX_COUNT); + + for(int i = 0; i < count; i++) { + + const char *lock_type = NULL; + const char *lock_name = NULL; + + XTThreadLockInfoPtr li = self->st_thread_lock_list[i]; + + switch(li->li_lock_type) { + case XTThreadLockInfo::SPIN_LOCK: + lock_type = "XTSpinLock"; + lock_name = li->li_spin_lock->spl_name; + break; + case XTThreadLockInfo::RW_MUTEX: + lock_type = "XTRWMutex"; + lock_name = li->li_rw_mutex->xs_name; + break; + case XTThreadLockInfo::MUTEX: + lock_type = "xt_mutex_struct"; +#ifdef XT_WIN + lock_name = li->li_mutex->mt_name; +#else + lock_name = li->li_mutex->mu_name; +#endif + break; + case XTThreadLockInfo::RW_LOCK: + lock_type = "xt_rwlock_struct"; + lock_name = li->li_rwlock->rw_name; + break; + case XTThreadLockInfo::FAST_LOCK: + lock_type = "XTFastLock"; + lock_name = li->li_fast_lock->fal_name; + break; + case XTThreadLockInfo::FAST_RW_LOCK: + lock_type = "XTFastRWLock"; + lock_name = li->li_fast_rwlock->frw_name; + break; + case XTThreadLockInfo::SPIN_RW_LOCK: + lock_type = "XTSpinRWLock"; + lock_name = li->li_spin_rwlock->srw_name; + break; + case XTThreadLockInfo::ATOMIC_RW_LOCK: + lock_type = "XTAtomicRWLock"; + lock_name = li->li_atomic_rwlock->arw_name; + break; + } + + xt_ttracef(self, " #lock#%d: type: %s name: %s \n", count, lock_type, lock_name); + } +} + +#endif + diff --git a/storage/pbxt/src/locklist_xt.h b/storage/pbxt/src/locklist_xt.h new file mode 100644 index 00000000000..f0f16009ca1 --- /dev/null +++ b/storage/pbxt/src/locklist_xt.h @@ -0,0 +1,97 @@ +/* Copyright (c) 2009 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2009-01-20 Vladimir Kolesnikov + * + * H&G2JCtL + */ + +#ifndef __xt_locklist_h__ +#define __xt_locklist_h__ + +#ifdef DEBUG +//#define XT_THREAD_LOCK_INFO +#ifndef XT_WIN +/* We need DEBUG_LOCKING in order to enable pthread function wrappers */ +#define DEBUG_LOCKING +#endif +#endif + +#include "xt_defs.h" + +struct XTThread; +struct XTSpinLock; +struct XTRWMutex; +struct xt_mutex_struct; +struct xt_rwlock_struct; +struct XTFastLock; +struct XTFastRWLock; +struct XTSpinRWLock; +struct XTAtomicRWLock; + +#ifdef XT_THREAD_LOCK_INFO + +#define XT_THREAD_LOCK_INFO_MAX_COUNT 50 + +#ifdef XT_WIN +#define LOCKLIST_ARG_SUFFIX(name) #name " in " __FUNCTION__ "() at " __FILE__ ":" QUOTE(__LINE__) +#else +#define LOCKLIST_ARG_SUFFIX(name) #name " in " QUOTE(__PRETTY_FUNCTION__) "() at " QUOTE(__FILE__) ":" QUOTE(__LINE__) +#endif + +/* + * An instance of XTThreadLockInfo class keeps information about a lock kept by a thread. + * There's a list of XTThreadLockInfo instances per thread. An instance can be included + * into several thread lists in case of shared locks. + */ +typedef struct XTThreadLockInfo { + + enum LockType { SPIN_LOCK, RW_MUTEX, MUTEX, RW_LOCK, FAST_LOCK, FAST_RW_LOCK, SPIN_RW_LOCK, ATOMIC_RW_LOCK }; + + LockType li_lock_type; + + union { + XTSpinLock *li_spin_lock; // SPIN_LOCK + XTRWMutex *li_rw_mutex; // RW_MUTEX + XTFastLock *li_fast_lock; // FAST_LOCK + XTFastRWLock *li_fast_rwlock; // FAST_RW_LOCK + XTSpinRWLock *li_spin_rwlock; // SPIN_RW_LOCK + XTAtomicRWLock *li_atomic_rwlock; // ATOMIC_RW_LOCK + xt_mutex_struct *li_mutex; // MUTEX + xt_rwlock_struct *li_rwlock; // RW_LOCK + }; +} +XTThreadLockInfoRec, *XTThreadLockInfoPtr; + +void xt_thread_lock_info_init(XTThreadLockInfoPtr ptr, XTSpinLock *lock); +void xt_thread_lock_info_init(XTThreadLockInfoPtr ptr, XTRWMutex *lock); +void xt_thread_lock_info_init(XTThreadLockInfoPtr ptr, XTFastLock *lock); +void xt_thread_lock_info_init(XTThreadLockInfoPtr ptr, XTFastRWLock *lock); +void xt_thread_lock_info_init(XTThreadLockInfoPtr ptr, XTSpinRWLock *lock); +void xt_thread_lock_info_init(XTThreadLockInfoPtr ptr, XTAtomicRWLock *lock); +void xt_thread_lock_info_init(XTThreadLockInfoPtr ptr, xt_mutex_struct *lock); +void xt_thread_lock_info_init(XTThreadLockInfoPtr ptr, xt_rwlock_struct *lock); +void xt_thread_lock_info_free(XTThreadLockInfoPtr ptr); + +void xt_thread_lock_info_add_owner (XTThreadLockInfoPtr ptr); +void xt_thread_lock_info_release_owner (XTThreadLockInfoPtr ptr); + +void xt_trace_thread_locks(XTThread *self); + +#endif // XT_THREAD_LOCK_INFO +#endif // __xt_locklist_h__ diff --git a/storage/pbxt/src/memory_xt.cc b/storage/pbxt/src/memory_xt.cc new file mode 100644 index 00000000000..6da8673cfcc --- /dev/null +++ b/storage/pbxt/src/memory_xt.cc @@ -0,0 +1,1138 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-01-04 Paul McCullagh + * + * H&G2JCtL + */ + +#include "xt_config.h" + +#include <stdio.h> +#include <errno.h> +#include <stdlib.h> +#include <string.h> + +#include "pthread_xt.h" +#include "thread_xt.h" +#include "strutil_xt.h" +#include "trace_xt.h" + +#ifdef DEBUG +#define RECORD_MM +#endif + +#ifdef DEBUG + +#undef xt_malloc +#undef xt_calloc +#undef xt_realloc +#undef xt_free +#undef xt_pfree + +#undef xt_malloc_ns +#undef xt_calloc_ns +#undef xt_realloc_ns +#undef xt_free_ns + +void *xt_malloc(XTThreadPtr self, size_t size); +void *xt_calloc(XTThreadPtr self, size_t size); +xtBool xt_realloc(XTThreadPtr self, void **ptr, size_t size); +void xt_free(XTThreadPtr self, void *ptr); +void xt_pfree(XTThreadPtr self, void **ptr); + +void *xt_malloc_ns(size_t size); +void *xt_calloc_ns(size_t size); +xtBool xt_realloc_ns(void **ptr, size_t size); +void xt_free_ns(void *ptr); + +#define ADD_TOTAL_ALLOCS 4000 + +#define SHIFT_RIGHT(ptr, n) memmove(((char *) (ptr)) + sizeof(MissingMemoryRec), (ptr), (long) (n) * sizeof(MissingMemoryRec)) +#define SHIFT_LEFT(ptr, n) memmove((ptr), ((char *) (ptr)) + sizeof(MissingMemoryRec), (long) (n) * sizeof(MissingMemoryRec)) + +#define STACK_TRACE_DEPTH 4 + +typedef struct MissingMemory { + void *mm_ptr; + xtWord4 id; + xtWord2 line_nr; + xtWord2 trace_count; + c_char *mm_file; + c_char *mm_func[STACK_TRACE_DEPTH]; +} MissingMemoryRec, *MissingMemoryPtr; + +static MissingMemoryRec *mm_addresses = NULL; +static long mm_nr_in_use = 0L; +static long mm_total_allocated = 0L; +static xtWord4 mm_alloc_count = 0; +static xt_mutex_type mm_mutex; + +#ifdef RECORD_MM +static long mm_find_pointer(void *ptr); +#endif + +#endif + +/* + * ----------------------------------------------------------------------- + * STANDARD SYSTEM BASED MEMORY ALLOCATION + */ + +xtPublic void *xt_malloc(XTThreadPtr self, size_t size) +{ + void *ptr; + + if (!(ptr = malloc(size))) { + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + return NULL; + } + return ptr; +} + +xtPublic xtBool xt_realloc(XTThreadPtr self, void **ptr, size_t size) +{ + void *new_ptr; + + if (!(new_ptr = realloc(*ptr, size))) { + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + return FAILED; + } + *ptr = new_ptr; + return OK; +} + +xtPublic void xt_free(XTThreadPtr self __attribute__((unused)), void *ptr) +{ + free(ptr); +} + +xtPublic void *xt_calloc(XTThreadPtr self, size_t size) +{ + void *ptr; + + if ((ptr = xt_malloc(self, size))) + memset(ptr, 0, size); + return ptr; +} + +#undef xt_pfree + +xtPublic void xt_pfree(XTThreadPtr self, void **ptr) +{ + if (*ptr) { + void *p = *ptr; + + *ptr = NULL; + xt_free(self, p); + } +} + +/* + * ----------------------------------------------------------------------- + * SYSTEM MEMORY ALLOCATION WITH A THREAD + */ + +xtPublic void *xt_malloc_ns(size_t size) +{ + void *ptr; + + if (!(ptr = malloc(size))) { + xt_register_errno(XT_REG_CONTEXT, XT_ENOMEM); + return NULL; + } + return ptr; +} + +xtPublic void *xt_calloc_ns(size_t size) +{ + void *ptr; + + if (!(ptr = malloc(size))) { + xt_register_errno(XT_REG_CONTEXT, XT_ENOMEM); + return NULL; + } + memset(ptr, 0, size); + return ptr; +} + +xtPublic xtBool xt_realloc_ns(void **ptr, size_t size) +{ + void *new_ptr; + + if (!(new_ptr = realloc(*ptr, size))) + return xt_register_errno(XT_REG_CONTEXT, XT_ENOMEM); + *ptr = new_ptr; + return OK; +} + +xtPublic void xt_free_ns(void *ptr) +{ + free(ptr); +} + +#ifdef DEBUG + +/* + * ----------------------------------------------------------------------- + * MEMORY SEARCHING CODE + */ + +#define MM_THROW_ASSERTION(str) mm_throw_assertion(self, __FUNC__, __FILE__, __LINE__, str) + +static void mm_throw_assertion(XTThreadPtr self, c_char *func, c_char *file, u_int line, c_char *str) +{ + printf("***** MM:FATAL %s\n", str); + xt_throw_assertion(self, func, file, line, str); +} + +/* + * ----------------------------------------------------------------------- + * MEMORY SEARCHING CODE + */ + +static int mm_debug_ik_inc; +static int mm_debug_ik_dec; +static int mm_debug_ik_no; + +/* + * Call this function where the missing memory + * is referenced. + */ +xtPublic void mm_trace_inc(XTThreadPtr self, XTMMTraceRefPtr tr) +{ + int i; + +#ifdef RECORD_MM + if (xt_lock_mutex(self, &mm_mutex)) { + long mm; + + mm = mm_find_pointer(tr); + if (mm >= 0) + mm_addresses[mm].trace_count = 1; + xt_unlock_mutex(self, &mm_mutex); + } +#endif + mm_debug_ik_inc++; + if (tr->mm_pos < XT_MM_STACK_TRACE-1) { + tr->mm_trace[tr->mm_pos++] = self->t_name[0] == 'S' ? XT_MM_TRACE_SW_INC : XT_MM_TRACE_INC; + for (i=1; i<=XT_MM_TRACE_DEPTH; i++) { + if (self->t_call_top-i < 0) + break; + if (tr->mm_pos < XT_MM_STACK_TRACE-1) { + tr->mm_line[tr->mm_pos] = self->t_call_stack[self->t_call_top-i].cs_line; + tr->mm_trace[tr->mm_pos++] = self->t_call_stack[self->t_call_top-i].cs_func; + } + else if (tr->mm_pos < XT_MM_STACK_TRACE) + tr->mm_trace[tr->mm_pos++] = XT_MM_TRACE_ERROR; + } + } + else if (tr->mm_pos < XT_MM_STACK_TRACE) + tr->mm_trace[tr->mm_pos++] = XT_MM_TRACE_ERROR; +} + +xtPublic void mm_trace_dec(XTThreadPtr self, XTMMTraceRefPtr tr) +{ + int i; + +#ifdef RECORD_MM + if (xt_lock_mutex(self, &mm_mutex)) { + long mm; + + mm = mm_find_pointer(tr); + if (mm >= 0) + mm_addresses[mm].trace_count = 1; + xt_unlock_mutex(self, &mm_mutex); + } +#endif + mm_debug_ik_dec++; + if (tr->mm_pos < XT_MM_STACK_TRACE-1) { + tr->mm_trace[tr->mm_pos++] = self->t_name[0] == 'S' ? XT_MM_TRACE_SW_DEC : XT_MM_TRACE_DEC; + for (i=1; i<=XT_MM_TRACE_DEPTH; i++) { + if (self->t_call_top-i < 0) + break; + if (tr->mm_pos < XT_MM_STACK_TRACE-1) { + tr->mm_line[tr->mm_pos] = self->t_call_stack[self->t_call_top-i].cs_line; + tr->mm_trace[tr->mm_pos++] = self->t_call_stack[self->t_call_top-i].cs_func; + } + else if (tr->mm_pos < XT_MM_STACK_TRACE) + tr->mm_trace[tr->mm_pos++] = XT_MM_TRACE_ERROR; + } + } + else if (tr->mm_pos < XT_MM_STACK_TRACE) + tr->mm_trace[tr->mm_pos++] = XT_MM_TRACE_ERROR; +} + +xtPublic void mm_trace_init(XTThreadPtr self, XTMMTraceRefPtr tr) +{ + mm_debug_ik_no++; + tr->mm_id = (u_int) mm_debug_ik_no; + tr->mm_pos = 0; + mm_trace_inc(self, tr); +} + +xtPublic void mm_trace_print(XTMMTraceRefPtr tr) +{ + int i, cnt = 0; + + for (i=0; i<tr->mm_pos; i++) { + if (tr->mm_trace[i] == XT_MM_TRACE_INC) { + if (i > 0) + printf("\n"); + cnt++; + printf("INC (%d) ", cnt); + } + else if (tr->mm_trace[i] == XT_MM_TRACE_SW_INC) { + if (i > 0) + printf("\n"); + printf("SW-DEC (%d) ", cnt); + cnt--; + } + else if (tr->mm_trace[i] == XT_MM_TRACE_DEC) { + if (i > 0) + printf("\n"); + printf("DEC (%d) ", cnt); + cnt--; + } + else if (tr->mm_trace[i] == XT_MM_TRACE_SW_DEC) { + if (i > 0) + printf("\n"); + printf("SW-DEC (%d) ", cnt); + cnt--; + } + else if (tr->mm_trace[i] == XT_MM_TRACE_ERROR) { + if (i > 0) + printf("\n"); + printf("ERROR: Space out"); + } + else + printf("%s(%d) ", tr->mm_trace[i], (int) tr->mm_line[i]); + } + printf("\n"); +} + +/* Call this function on exit, when you know the memory is missing. */ +static void mm_debug_trace_count(XTMMTraceRefPtr tr) +{ + printf("MM Trace ID: %d\n", tr->mm_id); + mm_trace_print(tr); +} + +/* The give the sum of allocations, etc. */ +static void mm_debug_trace_sum(void) +{ + if (mm_debug_ik_no) { + printf("MM Trace INC: %d\n", mm_debug_ik_inc); + printf("MM Trace DEC: %d\n", mm_debug_ik_dec); + printf("MM Trace ALL: %d\n", mm_debug_ik_no); + } +} + +/* + * ----------------------------------------------------------------------- + * DEBUG MEMORY ALLOCATION AND HEAP CHECKING + */ + +#ifdef RECORD_MM +static long mm_find_pointer(void *ptr) +{ + register long i, n, guess; + + i = 0; + n = mm_nr_in_use; + while (i < n) { + guess = (i + n - 1) >> 1; + if (ptr == mm_addresses[guess].mm_ptr) + return(guess); + if (ptr < mm_addresses[guess].mm_ptr) + n = guess; + else + i = guess + 1; + } + return(-1); +} + +static long mm_add_pointer(void *ptr, u_int id) +{ +#pragma unused(id) + register int i, n, guess; + + if (mm_nr_in_use == mm_total_allocated) { + /* Not enough space, add more: */ + MissingMemoryRec *new_addresses; + + new_addresses = (MissingMemoryRec *) xt_calloc_ns(sizeof(MissingMemoryRec) * (mm_total_allocated + ADD_TOTAL_ALLOCS)); + if (!new_addresses) + return(-1); + + if (mm_addresses) { + memcpy(new_addresses, mm_addresses, sizeof(MissingMemoryRec) * mm_total_allocated); + free(mm_addresses); + } + + mm_addresses = new_addresses; + mm_total_allocated += ADD_TOTAL_ALLOCS; + } + + i = 0; + n = mm_nr_in_use; + while (i < n) { + guess = (i + n - 1) >> 1; + if (ptr < mm_addresses[guess].mm_ptr) + n = guess; + else + i = guess + 1; + } + + SHIFT_RIGHT(&mm_addresses[i], mm_nr_in_use - i); + mm_nr_in_use++; + mm_addresses[i].mm_ptr = ptr; + return(i); +} + +xtPublic char *mm_watch_point = 0; + +static long mm_remove_pointer(void *ptr) +{ + register int i, n, guess; + + if (mm_watch_point == ptr) + printf("Hit watch point!\n"); + + i = 0; + n = mm_nr_in_use; + while (i < n) { + guess = (i + n - 1) >> 1; + if (ptr == mm_addresses[guess].mm_ptr) + goto remove; + if (ptr < mm_addresses[guess].mm_ptr) + n = guess; + else + i = guess + 1; + } + return(-1); + + remove: + /* Decrease the number of sets, and shift left: */ + mm_nr_in_use--; + SHIFT_LEFT(&mm_addresses[guess], mm_nr_in_use - guess); + return(guess); +} + +static void mm_add_core_ptr(XTThreadPtr self, void *ptr, u_int id, u_int line, c_char *file_name) +{ + long mm; + + mm = mm_add_pointer(ptr, id); + if (mm < 0) { + MM_THROW_ASSERTION("MM ERROR: Cannot allocate table big enough!"); + return; + } + + /* Record the pointer: */ + if (mm_alloc_count >= 4115 && mm_alloc_count <= 4130) { + if (id) + mm_addresses[mm].id = id; + else + mm_addresses[mm].id = mm_alloc_count++; + } + else { + if (id) + mm_addresses[mm].id = id; + else + mm_addresses[mm].id = mm_alloc_count++; + } + mm_addresses[mm].mm_ptr = ptr; + mm_addresses[mm].line_nr = (ushort) line; + if (file_name) + mm_addresses[mm].mm_file = file_name; + else + mm_addresses[mm].mm_file = "?"; + if (self) { + for (int i=1; i<=STACK_TRACE_DEPTH; i++) { + if (self->t_call_top-i >= 0) + mm_addresses[mm].mm_func[i-1] = self->t_call_stack[self->t_call_top-i].cs_func; + else + mm_addresses[mm].mm_func[i-1] = NULL; + } + } + else { + for (int i=0; i<STACK_TRACE_DEPTH; i++) + mm_addresses[mm].mm_func[i] = NULL; + } +} + +static void mm_remove_core_ptr(void *ptr) +{ + XTThreadPtr self = NULL; + long mm; + + mm = mm_remove_pointer(ptr); + if (mm < 0) { + MM_THROW_ASSERTION("Pointer not allocated"); + return; + } +} + +static void mm_throw_assertion(MissingMemoryPtr mm_ptr, void *p, c_char *message); + +static long mm_find_core_ptr(void *ptr) +{ + long mm; + + mm = mm_find_pointer(ptr); + if (mm < 0) + mm_throw_assertion(NULL, ptr, "Pointer not allocated"); + return(mm); +} + +static void mm_replace_core_ptr(long i, void *ptr) +{ + XTThreadPtr self = NULL; + MissingMemoryRec tmp = mm_addresses[i]; + long mm; + + mm_remove_pointer(mm_addresses[i].mm_ptr); + mm = mm_add_pointer(ptr, mm_addresses[i].id); + if (mm < 0) { + MM_THROW_ASSERTION("Cannot allocate table big enough!"); + return; + } + mm_addresses[mm] = tmp; + mm_addresses[mm].mm_ptr = ptr; +} +#endif + +static void mm_throw_assertion(MissingMemoryPtr mm_ptr, void *p, c_char *message) +{ + XTThreadPtr self = NULL; + char str[200]; + + if (mm_ptr) { + sprintf(str, "MM: %08lX (#%ld) %s:%d %s", + (unsigned long) mm_ptr->mm_ptr, + (long) mm_ptr->id, + xt_last_name_of_path(mm_ptr->mm_file), + (int) mm_ptr->line_nr, + message); + } + else + sprintf(str, "MM: %08lX %s", (unsigned long) p, message); + MM_THROW_ASSERTION(str); +} + +/* + * ----------------------------------------------------------------------- + * MISSING MEMORY PUBLIC ROUTINES + */ + +#define MEM_DEBUG_HDR_SIZE offsetof(MemoryDebugRec, data) +#define MEM_TRAILER_SIZE 2 +#define MEM_HEADER 0x01010101 +#define MEM_FREED 0x03030303 +#define MEM_TRAILER_BYTE 0x02 +#define MEM_FREED_BYTE 0x03 + +typedef struct MemoryDebug { + xtWord4 check; + xtWord4 size; + char data[200]; +} MemoryDebugRec, *MemoryDebugPtr; + +static size_t mm_checkmem(XTThreadPtr self, MissingMemoryPtr mm_ptr, void *p, xtBool freeme) +{ + unsigned char *ptr = (unsigned char *) p - MEM_DEBUG_HDR_SIZE; + MemoryDebugPtr debug_ptr = (MemoryDebugPtr) ptr; + size_t size = debug_ptr->size; + long a_value; /* Added to simplfy debugging. */ + + if (!ASSERT(p)) + return(0); + if (!ASSERT(((long) p & 1L) == 0)) + return(0); + a_value = MEM_FREED; + if (debug_ptr->check == MEM_FREED) { + mm_throw_assertion(mm_ptr, p, "Pointer already freed 'debug_ptr->check != MEM_FREED'"); + return(0); + } + a_value = MEM_HEADER; + if (debug_ptr->check != MEM_HEADER) { + mm_throw_assertion(mm_ptr, p, "Header not valid 'debug_ptr->check != MEM_HEADER'"); + return(0); + } + a_value = MEM_TRAILER_BYTE; + if (!(*((unsigned char *) ptr + size + MEM_DEBUG_HDR_SIZE) == MEM_TRAILER_BYTE && + *((unsigned char *) ptr + size + MEM_DEBUG_HDR_SIZE + 1L) == MEM_TRAILER_BYTE)) { + mm_throw_assertion(mm_ptr, p, "Trailer overwritten"); + return(0); + } + + if (freeme) { + debug_ptr->check = MEM_FREED; + *((unsigned char *) ptr + size + MEM_DEBUG_HDR_SIZE) = MEM_FREED_BYTE; + *((unsigned char *) ptr + size + MEM_DEBUG_HDR_SIZE + 1L) = MEM_FREED_BYTE; + + memset(((unsigned char *) ptr) + MEM_DEBUG_HDR_SIZE, 0xF5, size); + xt_free(self, ptr); + } + + return size; +} + +xtBool xt_mm_scan_core(void) +{ + long mm; + + if (!mm_addresses) + return TRUE; + + if (!xt_lock_mutex(NULL, &mm_mutex)) + return TRUE; + + for (mm=0; mm<mm_nr_in_use; mm++) { + mm_checkmem(NULL, &mm_addresses[mm], mm_addresses[mm].mm_ptr, FALSE); + } + + xt_unlock_mutex(NULL, &mm_mutex); + return TRUE; +} + +void xt_mm_memmove(void *block, void *dest, void *source, size_t size) +{ + if (block) { + MemoryDebugPtr debug_ptr = (MemoryDebugPtr) ((char *) block - MEM_DEBUG_HDR_SIZE); + +#ifdef RECORD_MM + if (xt_lock_mutex(NULL, &mm_mutex)) { + mm_find_core_ptr(block); + xt_unlock_mutex(NULL, &mm_mutex); + } +#endif + mm_checkmem(NULL, NULL, block, FALSE); + + if (dest < block || (char *) dest > (char *) block + debug_ptr->size) + mm_throw_assertion(NULL, block, "Destination not in block"); + if ((char *) dest + size > (char *) block + debug_ptr->size) + mm_throw_assertion(NULL, block, "Copy will overwrite memory"); + } + + memmove(dest, source, size); +} + +void xt_mm_memcpy(void *block, void *dest, void *source, size_t size) +{ + if (block) { + MemoryDebugPtr debug_ptr = (MemoryDebugPtr) ((char *) block - MEM_DEBUG_HDR_SIZE); + +#ifdef RECORD_MM + if (xt_lock_mutex(NULL, &mm_mutex)) { + mm_find_core_ptr(block); + xt_unlock_mutex(NULL, &mm_mutex); + } +#endif + mm_checkmem(NULL, NULL, block, FALSE); + + if (dest < block || (char *) dest > (char *) block + debug_ptr->size) + mm_throw_assertion(NULL, block, "Destination not in block"); + if ((char *) dest + size > (char *) block + debug_ptr->size) + mm_throw_assertion(NULL, block, "Copy will overwrite memory"); + } + + memcpy(dest, source, size); +} + +void xt_mm_memset(void *block, void *dest, int value, size_t size) +{ + if (block) { + MemoryDebugPtr debug_ptr = (MemoryDebugPtr) ((char *) block - MEM_DEBUG_HDR_SIZE); + +#ifdef RECORD_MM + if (xt_lock_mutex(NULL, &mm_mutex)) { + mm_find_core_ptr(block); + xt_unlock_mutex(NULL, &mm_mutex); + } +#endif + mm_checkmem(NULL, NULL, block, FALSE); + + if (dest < block || (char *) dest > (char *) block + debug_ptr->size) + mm_throw_assertion(NULL, block, "Destination not in block"); + if ((char *) dest + size > (char *) block + debug_ptr->size) + mm_throw_assertion(NULL, block, "Copy will overwrite memory"); + } + + memset(dest, value, size); +} + +void *xt_mm_malloc(XTThreadPtr self, size_t size, u_int line __attribute__((unused)), c_char *file __attribute__((unused))) +{ + unsigned char *p; + + if (size > (600*1024*1024)) + mm_throw_assertion(NULL, NULL, "Very large block allocated - meaybe error"); + p = (unsigned char *) xt_malloc(self, size + MEM_DEBUG_HDR_SIZE + MEM_TRAILER_SIZE); + if (!p) + return NULL; + + memset(p, 0x55, size + MEM_DEBUG_HDR_SIZE + MEM_TRAILER_SIZE); + + ((MemoryDebugPtr) p)->check = MEM_HEADER; + ((MemoryDebugPtr) p)->size = size; + *(p + size + MEM_DEBUG_HDR_SIZE) = MEM_TRAILER_BYTE; + *(p + size + MEM_DEBUG_HDR_SIZE + 1L) = MEM_TRAILER_BYTE; + +#ifdef RECORD_MM + xt_lock_mutex(self, &mm_mutex); + mm_add_core_ptr(self, p + MEM_DEBUG_HDR_SIZE, 0, line, file); + xt_unlock_mutex(self, &mm_mutex); +#endif + + return p + MEM_DEBUG_HDR_SIZE; +} + +void *xt_mm_calloc(XTThreadPtr self, size_t size, u_int line __attribute__((unused)), c_char *file __attribute__((unused))) +{ + unsigned char *p; + + if (size > (500*1024*1024)) + mm_throw_assertion(NULL, NULL, "Very large block allocated - meaybe error"); + p = (unsigned char *) xt_calloc(self, size + MEM_DEBUG_HDR_SIZE + MEM_TRAILER_SIZE); + if (!p) + return NULL; + + ((MemoryDebugPtr) p)->check = MEM_HEADER; + ((MemoryDebugPtr) p)->size = size; + *(p + size + MEM_DEBUG_HDR_SIZE) = MEM_TRAILER_BYTE; + *(p + size + MEM_DEBUG_HDR_SIZE + 1L) = MEM_TRAILER_BYTE; + +#ifdef RECORD_MM + xt_lock_mutex(self, &mm_mutex); + mm_add_core_ptr(self, p + MEM_DEBUG_HDR_SIZE, 0, line, file); + xt_unlock_mutex(self, &mm_mutex); +#endif + + return p + MEM_DEBUG_HDR_SIZE; +} + +xtBool xt_mm_sys_realloc(XTThreadPtr self, void **ptr, size_t newsize, u_int line, c_char *file) +{ + return xt_mm_realloc(self, ptr, newsize, line, file); +} + +xtBool xt_mm_realloc(XTThreadPtr self, void **ptr, size_t newsize, u_int line, c_char *file) +{ + unsigned char *oldptr = (unsigned char *) *ptr; + size_t size; +#ifdef RECORD_MM + long mm; +#endif + unsigned char *pnew; + + if (!oldptr) { + *ptr = xt_mm_malloc(self, newsize, line, file); + return *ptr ? TRUE : FALSE; + } + +#ifdef RECORD_MM + xt_lock_mutex(self, &mm_mutex); + if ((mm = mm_find_core_ptr(oldptr)) < 0) { + xt_unlock_mutex(self, &mm_mutex); + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + return FAILED; + } + xt_unlock_mutex(self, &mm_mutex); +#endif + + oldptr = oldptr - MEM_DEBUG_HDR_SIZE; + size = ((MemoryDebugPtr) oldptr)->size; + + ASSERT(((MemoryDebugPtr) oldptr)->check == MEM_HEADER); + ASSERT(*((unsigned char *) oldptr + size + MEM_DEBUG_HDR_SIZE) == MEM_TRAILER_BYTE && + *((unsigned char *) oldptr + size + MEM_DEBUG_HDR_SIZE + 1L) == MEM_TRAILER_BYTE); + + /* Realloc allways moves! */ + pnew = (unsigned char *) xt_malloc(self, newsize + MEM_DEBUG_HDR_SIZE + MEM_TRAILER_SIZE); + if (!pnew) { + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + return FAILED; + } + + if (newsize > size) { + memcpy(((MemoryDebugPtr) pnew)->data, ((MemoryDebugPtr) oldptr)->data, size); + memset(((MemoryDebugPtr) pnew)->data + size, 0x55, newsize - size); + } + else + memcpy(((MemoryDebugPtr) pnew)->data, ((MemoryDebugPtr) oldptr)->data, newsize); + + ((MemoryDebugPtr) pnew)->check = MEM_HEADER; + ((MemoryDebugPtr) pnew)->size = newsize; + *(pnew + newsize + MEM_DEBUG_HDR_SIZE) = MEM_TRAILER_BYTE; + *(pnew + newsize + MEM_DEBUG_HDR_SIZE + 1L) = MEM_TRAILER_BYTE; + +#ifdef RECORD_MM + xt_lock_mutex(self, &mm_mutex); + if ((mm = mm_find_core_ptr(oldptr + MEM_DEBUG_HDR_SIZE)) < 0) { + xt_unlock_mutex(self, &mm_mutex); + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + return FAILED; + } + mm_replace_core_ptr(mm, pnew + MEM_DEBUG_HDR_SIZE); + xt_unlock_mutex(self, &mm_mutex); +#endif + + memset(oldptr, 0x55, size + MEM_DEBUG_HDR_SIZE + MEM_TRAILER_SIZE); + xt_free(self, oldptr); + + *ptr = pnew + MEM_DEBUG_HDR_SIZE; + return OK; +} + +void xt_mm_free(XTThreadPtr self, void *ptr) +{ +#ifdef RECORD_MM + if (xt_lock_mutex(self, &mm_mutex)) { + mm_remove_core_ptr(ptr); + xt_unlock_mutex(self, &mm_mutex); + } +#endif + mm_checkmem(self, NULL, ptr, TRUE); +} + +void xt_mm_pfree(XTThreadPtr self, void **ptr) +{ + if (*ptr) { + void *p = *ptr; + + *ptr = NULL; + xt_mm_free(self, p); + } +} + +size_t xt_mm_malloc_size(XTThreadPtr self, void *ptr) +{ + size_t size = 0; + +#ifdef RECORD_MM + if (xt_lock_mutex(self, &mm_mutex)) { + mm_find_core_ptr(ptr); + xt_unlock_mutex(self, &mm_mutex); + } +#endif + size = mm_checkmem(self, NULL, ptr, FALSE); + return size; +} + +void xt_mm_check_ptr(XTThreadPtr self, void *ptr) +{ + mm_checkmem(self, NULL, ptr, FALSE); +} +#endif + +/* + * ----------------------------------------------------------------------- + * INIT/EXIT MEMORY + */ + +xtPublic xtBool xt_init_memory(void) +{ +#ifdef DEBUG + XTThreadPtr self = NULL; + + if (!xt_init_mutex_with_autoname(NULL, &mm_mutex)) + return FALSE; + + mm_addresses = (MissingMemoryRec *) malloc(sizeof(MissingMemoryRec) * ADD_TOTAL_ALLOCS); + if (!mm_addresses) { + MM_THROW_ASSERTION("MM ERROR: Insuffient memory to allocate MM table"); + xt_free_mutex(&mm_mutex); + return FALSE; + } + + memset(mm_addresses, 0, sizeof(MissingMemoryRec) * ADD_TOTAL_ALLOCS); + mm_total_allocated = ADD_TOTAL_ALLOCS; + mm_nr_in_use = 0L; + mm_alloc_count = 0L; +#endif + return TRUE; +} + +xtPublic void debug_ik_count(void *value); +xtPublic void debug_ik_sum(void); + +xtPublic void xt_exit_memory(void) +{ +#ifdef DEBUG + long mm; + int i; + + if (!mm_addresses) + return; + + xt_lock_mutex(NULL, &mm_mutex); + for (mm=0; mm<mm_nr_in_use; mm++) { + MissingMemoryPtr mm_ptr = &mm_addresses[mm]; + + xt_logf(XT_NS_CONTEXT, XT_LOG_FATAL, "MM: %p (#%ld) %s:%d Not freed\n", + mm_ptr->mm_ptr, + (long) mm_ptr->id, + xt_last_name_of_path(mm_ptr->mm_file), + (int) mm_ptr->line_nr); + for (i=0; i<STACK_TRACE_DEPTH; i++) { + if (mm_ptr->mm_func[i]) + xt_logf(XT_NS_CONTEXT, XT_LOG_FATAL, "MM: %s\n", mm_ptr->mm_func[i]); + } + /* + * Assumes we place out tracing function in the first + * position!! + */ + if (mm_ptr->trace_count) + mm_debug_trace_count((XTMMTraceRefPtr) mm_ptr->mm_ptr); + } + mm_debug_trace_sum(); + free(mm_addresses); + mm_addresses = NULL; + mm_nr_in_use = 0L; + mm_total_allocated = 0L; + mm_alloc_count = 0L; + xt_unlock_mutex(NULL, &mm_mutex); + + xt_free_mutex(&mm_mutex); +#endif +} + +/* + * ----------------------------------------------------------------------- + * MEMORY ALLOCATION UTILITIES + */ + +#ifdef DEBUG +char *xt_mm_dup_string(XTThreadPtr self, c_char *str, u_int line, c_char *file) +#else +char *xt_dup_string(XTThreadPtr self, c_char *str) +#endif +{ + size_t len; + char *new_str; + + if (!str) + return NULL; + len = strlen(str); +#ifdef DEBUG + new_str = (char *) xt_mm_malloc(self, len + 1, line, file); +#else + new_str = (char *) xt_malloc(self, len + 1); +#endif + if (new_str) + strcpy(new_str, str); + return new_str; +} + +xtPublic char *xt_long_to_str(XTThreadPtr self, long v) +{ + char str[50]; + + sprintf(str, "%lu", v); + return xt_dup_string(self, str); +} + +char *xt_dup_nstr(XTThreadPtr self, c_char *str, int start, size_t len) +{ + char *new_str = (char *) xt_malloc(self, len + 1); + + if (new_str) { + memcpy(new_str, str + start, len); + new_str[len] = 0; + } + return new_str; +} + +/* + * ----------------------------------------------------------------------- + * LIGHT WEIGHT CHECK FUNCTIONS + * Timing related memory management problems my not like the memset + * or other heavy checking. Try this... + */ + +#ifdef LIGHT_WEIGHT_CHECKS +xtPublic void *xt_malloc(XTThreadPtr self, size_t size) +{ + char *ptr; + + if (!(ptr = (char *) malloc(size+8))) { + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + return NULL; + } + *((xtWord4 *) ptr) = size; + *((xtWord4 *) (ptr + size + 4)) = 0x7E7EFEFE; + return ptr+4; +} + +xtPublic void xt_check_ptr(void *ptr) +{ + char *old_ptr; + xtWord4 size; + + old_ptr = (char *) ptr; + old_ptr -= 4; + size = *((xtWord4 *) old_ptr); + if (size == 0xDEADBEAF || *((xtWord4 *) (old_ptr + size + 4)) != 0x7E7EFEFE) { + char *dummy = NULL; + + xt_dump_trace(); + *dummy = 40; + } +} + +xtPublic xtBool xt_realloc(XTThreadPtr self, void **ptr, size_t size) +{ + char *old_ptr; + char *new_ptr; + + if ((old_ptr = (char *) *ptr)) { + void check_for_file(char *my_ptr, xtWord4 len); + + xt_check_ptr(old_ptr); + check_for_file((char *) old_ptr, *((xtWord4 *) (old_ptr - 4))); + if (!(new_ptr = (char *) realloc(old_ptr - 4, size+8))) { + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + return FAILED; + } + *((xtWord4 *) new_ptr) = size; + *((xtWord4 *) (new_ptr + size + 4)) = 0x7E7EFEFE; + *ptr = new_ptr+4; + return OK; + } + *ptr = xt_malloc(self, size); + return *ptr != NULL; +} + +xtPublic void xt_free(XTThreadPtr self __attribute__((unused)), void *ptr) +{ + char *old_ptr; + xtWord4 size; + void check_for_file(char *my_ptr, xtWord4 len); + + old_ptr = (char *) ptr; + old_ptr -= 4; + size = *((xtWord4 *) old_ptr); + if (size == 0xDEADBEAF || *((xtWord4 *) (old_ptr + size + 4)) != 0x7E7EFEFE) { + char *dummy = NULL; + + xt_dump_trace(); + *dummy = 41; + } + check_for_file((char *) ptr, size); + *((xtWord4 *) old_ptr) = 0xDEADBEAF; + *((xtWord4 *) (old_ptr + size)) = 0xEFEFDFDF; + *((xtWord4 *) (old_ptr + size + 4)) = 0x1F1F1F1F; + //memset(old_ptr, 0xEF, size+4); + free(old_ptr); +} + +xtPublic void *xt_calloc(XTThreadPtr self, size_t size) +{ + void *ptr; + + if ((ptr = xt_malloc(self, size))) + memset(ptr, 0, size); + return ptr; +} + +#undef xt_pfree + +xtPublic void xt_pfree(XTThreadPtr self, void **ptr) +{ + if (*ptr) { + void *p = *ptr; + + *ptr = NULL; + xt_free(self, p); + } +} + +xtPublic void *xt_malloc_ns(size_t size) +{ + char *ptr; + + if (!(ptr = (char *) malloc(size+8))) { + xt_register_errno(XT_REG_CONTEXT, XT_ENOMEM); + return NULL; + } + *((xtWord4 *) ptr) = size; + *((xtWord4 *) (ptr + size + 4)) = 0x7E7EFEFE; + return ptr+4; +} + +xtPublic void *xt_calloc_ns(size_t size) +{ + char *ptr; + + if (!(ptr = (char *) malloc(size+8))) { + xt_register_errno(XT_REG_CONTEXT, XT_ENOMEM); + return NULL; + } + *((xtWord4 *) ptr) = size; + *((xtWord4 *) (ptr + size + 4)) = 0x7E7EFEFE; + memset(ptr+4, 0, size); + return ptr+4; +} + +xtPublic xtBool xt_realloc_ns(void **ptr, size_t size) +{ + char *old_ptr; + char *new_ptr; + + if ((old_ptr = (char *) *ptr)) { + void check_for_file(char *my_ptr, xtWord4 len); + + xt_check_ptr(old_ptr); + check_for_file((char *) old_ptr, *((xtWord4 *) (old_ptr - 4))); + if (!(new_ptr = (char *) realloc(old_ptr - 4, size+8))) + return xt_register_errno(XT_REG_CONTEXT, XT_ENOMEM); + *((xtWord4 *) new_ptr) = size; + *((xtWord4 *) (new_ptr + size + 4)) = 0x7E7EFEFE; + *ptr = new_ptr+4; + return OK; + } + *ptr = xt_malloc_ns(size); + return *ptr != NULL; +} + +xtPublic void xt_free_ns(void *ptr) +{ + char *old_ptr; + xtWord4 size; + void check_for_file(char *my_ptr, xtWord4 len); + + old_ptr = (char *) ptr; + old_ptr -= 4; + size = *((xtWord4 *) old_ptr); + if (size == 0xDEADBEAF || *((xtWord4 *) (old_ptr + size + 4)) != 0x7E7EFEFE) { + char *dummy = NULL; + + xt_dump_trace(); + *dummy = 42; + } + check_for_file((char *) ptr, size); + *((xtWord4 *) old_ptr) = 0xDEADBEAF; + *((xtWord4 *) (old_ptr + size)) = 0xEFEFDFDF; + *((xtWord4 *) (old_ptr + size + 4)) = 0x1F1F1F1F; + //memset(old_ptr, 0xEE, size+4); + free(old_ptr); +} +#endif + diff --git a/storage/pbxt/src/memory_xt.h b/storage/pbxt/src/memory_xt.h new file mode 100644 index 00000000000..3b4150df185 --- /dev/null +++ b/storage/pbxt/src/memory_xt.h @@ -0,0 +1,126 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-01-04 Paul McCullagh + * + * H&G2JCtL + */ +#ifndef __xt_memory_h__ +#define __xt_memory_h__ + +#include <string.h> + +#include "xt_defs.h" + +struct XTThread; + +#ifdef DEBUG + +#define XT_MM_STACK_TRACE 200 +#define XT_MM_TRACE_DEPTH 4 +#define XT_MM_TRACE_INC ((char *) 1) +#define XT_MM_TRACE_DEC ((char *) 2) +#define XT_MM_TRACE_SW_INC ((char *) 1) +#define XT_MM_TRACE_SW_DEC ((char *) 2) +#define XT_MM_TRACE_ERROR ((char *) 3) + +typedef struct XTMMTraceRef { + int mm_pos; + u_int mm_id; + u_int mm_line[XT_MM_STACK_TRACE]; + c_char *mm_trace[XT_MM_STACK_TRACE]; +} XTMMTraceRefRec, *XTMMTraceRefPtr; + +#define XT_MM_TRACE_INIT(x) (x)->mm_pos = 0 + +extern char *mm_watch_point; + +#define XT_MEMMOVE(b, d, s, l) xt_mm_memmove(b, d, s, l) +#define XT_MEMCPY(b, d, s, l) xt_mm_memcpy(b, d, s, l) +#define XT_MEMSET(b, d, v, l) xt_mm_memset(b, d, v, l) + +#define xt_malloc(t, s) xt_mm_malloc(t, s, __LINE__, __FILE__) +#define xt_calloc(t, s) xt_mm_calloc(t, s, __LINE__, __FILE__) +#define xt_realloc(t, p, s) xt_mm_realloc(t, p, s, __LINE__, __FILE__) +#define xt_free xt_mm_free +#define xt_pfree xt_mm_pfree + +#define xt_malloc_ns(s) xt_mm_malloc(NULL, s, __LINE__, __FILE__) +#define xt_calloc_ns(s) xt_mm_calloc(NULL, s, __LINE__, __FILE__) +#define xt_realloc_ns(p, s) xt_mm_sys_realloc(NULL, p, s, __LINE__, __FILE__) +#define xt_free_ns(p) xt_mm_free(NULL, p) + +void xt_mm_memmove(void *block, void *dest, void *source, size_t size); +void xt_mm_memcpy(void *block, void *dest, void *source, size_t size); +void xt_mm_memset(void *block, void *dest, int value, size_t size); + +void *xt_mm_malloc(struct XTThread *self, size_t size, u_int line, const char *file); +void *xt_mm_calloc(struct XTThread *self, size_t size, u_int line, const char *file); +xtBool xt_mm_realloc(struct XTThread *self, void **ptr, size_t size, u_int line, const char *file); +void xt_mm_free(struct XTThread *self, void *ptr); +void xt_mm_pfree(struct XTThread *self, void **ptr); +size_t xt_mm_malloc_size(struct XTThread *self, void *ptr); +void xt_mm_check_ptr(struct XTThread *self, void *ptr); +xtBool xt_mm_sys_realloc(struct XTThread *self, void **ptr, size_t newsize, u_int line, const char *file); + +#ifndef XT_SCAN_CORE_DEFINED +#define XT_SCAN_CORE_DEFINED +xtBool xt_mm_scan_core(void); +#endif + +void mm_trace_inc(struct XTThread *self, XTMMTraceRefPtr tr); +void mm_trace_dec(struct XTThread *self, XTMMTraceRefPtr tr); +void mm_trace_init(struct XTThread *self, XTMMTraceRefPtr tr); +void mm_trace_print(XTMMTraceRefPtr tr); + +#else + +#define XT_MEMMOVE(b, d, s, l) memmove(d, s, l) +#define XT_MEMCPY(b, d, s, l) memcpy(d, s, l) +#define XT_MEMSET(b, d, v, l) memset(d, v, l) + +void *xt_malloc(struct XTThread *self, size_t size); +void *xt_calloc(struct XTThread *self, size_t size); +xtBool xt_realloc(struct XTThread *self, void **ptr, size_t size); +void xt_free(struct XTThread *self, void *ptr); +void xt_pfree(struct XTThread *self, void **ptr); + +void *xt_malloc_ns(size_t size); +void *xt_calloc_ns(size_t size); +xtBool xt_realloc_ns(void **ptr, size_t size); +void xt_free_ns(void *ptr); + +#define xt_pfree(t, p) xt_pfree(t, (void **) p) + +#endif + +#ifdef DEBUG +#define xt_dup_string(t, s) xt_mm_dup_string(t, s, __LINE__, __FILE__) + +char *xt_mm_dup_string(struct XTThread *self, const char *path, u_int line, const char *file); +#else +char *xt_dup_string(struct XTThread *self, const char *path); +#endif + +char *xt_long_to_str(struct XTThread *self, long v); +char *xt_dup_nstr(struct XTThread *self, const char *str, int start, size_t len); + +xtBool xt_init_memory(void); +void xt_exit_memory(void); + +#endif diff --git a/storage/pbxt/src/myxt_xt.cc b/storage/pbxt/src/myxt_xt.cc new file mode 100644 index 00000000000..a8ecbc31cd8 --- /dev/null +++ b/storage/pbxt/src/myxt_xt.cc @@ -0,0 +1,3209 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2006-05-16 Paul McCullagh + * + * H&G2JCtL + * + * These functions implement the parts of PBXT which must conform to the + * key and row format used by MySQL. + */ + +#include "xt_config.h" + +#ifdef DRIZZLED +#include <drizzled/server_includes.h> +#include <drizzled/plugin.h> +#include <drizzled/show.h> +#include <drizzled/field/blob.h> +#include <drizzled/field/enum.h> +#include <drizzled/field/varstring.h> +#include <drizzled/current_session.h> +#include <drizzled/sql_lex.h> +#include <drizzled/session.h> +extern "C" struct charset_info_st *session_charset(Session *session); +extern pthread_key_t THR_Session; +#else +#include "mysql_priv.h" +#include <mysql/plugin.h> +#endif + +#ifdef HAVE_ISNAN +#include <math.h> +#endif + +#include "ha_pbxt.h" + +#include "myxt_xt.h" +#include "strutil_xt.h" +#include "database_xt.h" +#ifdef XT_STREAMING +#include "streaming_xt.h" +#endif +#include "cache_xt.h" +#include "datalog_xt.h" + +#define SLAP_DEBUG + +#ifdef DRIZZLED +#define swap_variables(TYPE, a, b) \ + do { \ + TYPE dummy; \ + dummy= a; \ + a= b; \ + b= dummy; \ + } while (0) + + +#define CMP_NUM(a,b) (((a) < (b)) ? -1 : ((a) == (b)) ? 0 : 1) +#else +#define get_rec_bits(bit_ptr, bit_ofs, bit_len) \ + (((((uint16) (bit_ptr)[1] << 8) | (uint16) (bit_ptr)[0]) >> (bit_ofs)) & \ + ((1 << (bit_len)) - 1)) +#endif + +#define FIX_LENGTH(cs, pos, length, char_length) \ + do { \ + if ((length) > char_length) \ + char_length= my_charpos(cs, pos, pos+length, char_length); \ + set_if_smaller(char_length,length); \ + } while(0) + +#ifdef store_key_length_inc +#undef store_key_length_inc +#endif +#define store_key_length_inc(key,length) \ +{ if ((length) < 255) \ + { *(key)++=(length); } \ + else \ + { *(key)=255; mi_int2store((key)+1,(length)); (key)+=3; } \ +} + +#define set_rec_bits(bits, bit_ptr, bit_ofs, bit_len) \ +{ \ + (bit_ptr)[0]= ((bit_ptr)[0] & ~(((1 << (bit_len)) - 1) << (bit_ofs))) | \ + ((bits) << (bit_ofs)); \ + if ((bit_ofs) + (bit_len) > 8) \ + (bit_ptr)[1]= ((bit_ptr)[1] & ~((1 << ((bit_len) - 8 + (bit_ofs))) - 1)) | \ + ((bits) >> (8 - (bit_ofs))); \ +} + +#define clr_rec_bits(bit_ptr, bit_ofs, bit_len) \ + set_rec_bits(0, bit_ptr, bit_ofs, bit_len) + +static ulong my_calc_blob_length(uint length, xtWord1 *pos) +{ + switch (length) { + case 1: + return (uint) (uchar) *pos; + case 2: + return (uint) uint2korr(pos); + case 3: + return uint3korr(pos); + case 4: + return uint4korr(pos); + default: + break; + } + return 0; /* Impossible */ +} + +static void my_store_blob_length(byte *pos,uint pack_length,uint length) +{ + switch (pack_length) { + case 1: + *pos= (uchar) length; + break; + case 2: + int2store(pos,length); + break; + case 3: + int3store(pos,length); + break; + case 4: + int4store(pos,length); + default: + break; + } + return; +} + +static int my_compare_text(MX_CONST_CHARSET_INFO *charset_info, uchar *a, uint a_length, + uchar *b, uint b_length, my_bool part_key, + my_bool skip_end_space __attribute__((unused))) +{ + if (!part_key) + /* The last parameter is diff_if_only_endspace_difference, which means + * that end spaces are not ignored. We actually always want + * to ignore end spaces! + */ + return charset_info->coll->strnncollsp(charset_info, a, a_length, + b, b_length, /*(my_bool)!skip_end_space*/0); + return charset_info->coll->strnncoll(charset_info, a, a_length, + b, b_length, part_key); +} + +/* + * ----------------------------------------------------------------------- + * Create a key + */ + +/* + * Derived from _mi_pack_key() + */ +xtPublic u_int myxt_create_key_from_key(XTIndexPtr ind, xtWord1 *key, xtWord1 *old, u_int k_length) +{ + xtWord1 *start_key = key; + XTIndexSegRec *keyseg = ind->mi_seg; + + for (u_int i=0; i<ind->mi_seg_count && (int) k_length > 0; i++, old += keyseg->length, keyseg++) + { + enum ha_base_keytype type = (enum ha_base_keytype) keyseg->type; + u_int length = keyseg->length < k_length ? keyseg->length : k_length; + u_int char_length; + xtWord1 *pos; + MX_CONST_CHARSET_INFO *cs = keyseg->charset; + + if (keyseg->null_bit) { + k_length--; + if (!(*key++ = (xtWord1) 1 - *old++)) { /* Copy null marker */ + k_length -= length; + if (keyseg->flag & (HA_VAR_LENGTH_PART | HA_BLOB_PART)) { + k_length -= 2; /* Skip length */ + old += 2; + } + continue; /* Found NULL */ + } + } + char_length= (cs && cs->mbmaxlen > 1) ? length/cs->mbmaxlen : length; + pos = old; + if (keyseg->flag & HA_SPACE_PACK) { + uchar *end = pos + length; + if (type != HA_KEYTYPE_NUM) { + while (end > pos && end[-1] == ' ') + end--; + } + else { + while (pos < end && pos[0] == ' ') + pos++; + } + k_length -= length; + length = (u_int) (end-pos); + FIX_LENGTH(cs, pos, length, char_length); + store_key_length_inc(key, char_length); + memcpy((byte*) key,pos,(size_t) char_length); + key += char_length; + continue; + } + if (keyseg->flag & (HA_VAR_LENGTH_PART | HA_BLOB_PART)) { + /* Length of key-part used with mi_rkey() always 2 */ + u_int tmp_length = uint2korr(pos); + k_length -= 2 + length; + pos += 2; + set_if_smaller(length, tmp_length); /* Safety */ + FIX_LENGTH(cs, pos, length, char_length); + store_key_length_inc(key,char_length); + old +=2; /* Skip length */ + memcpy((char *) key, pos, (size_t) char_length); + key += char_length; + continue; + } + if (keyseg->flag & HA_SWAP_KEY) + { /* Numerical column */ + pos+=length; + k_length-=length; + while (length--) { + *key++ = *--pos; + } + continue; + } + FIX_LENGTH(cs, pos, length, char_length); + memcpy((byte*) key, pos, char_length); + if (length > char_length) + cs->cset->fill(cs, (char *) (key + char_length), length - char_length, ' '); + key += length; + k_length -= length; + } + + return (u_int) (key - start_key); +} + +/* Derived from _mi_make_key */ +xtPublic u_int myxt_create_key_from_row(XTIndexPtr ind, xtWord1 *key, xtWord1 *record, xtBool *no_duplicate) +{ + register XTIndexSegRec *keyseg = ind->mi_seg; + xtWord1 *pos; + xtWord1 *end; + xtWord1 *start; + + start = key; + for (u_int i=0; i<ind->mi_seg_count; i++, keyseg++) + { + enum ha_base_keytype type = (enum ha_base_keytype) keyseg->type; + u_int length = keyseg->length; + u_int char_length; + MX_CONST_CHARSET_INFO *cs = keyseg->charset; + + if (keyseg->null_bit) { + if (record[keyseg->null_pos] & keyseg->null_bit) { + *key++ = 0; /* NULL in key */ + + /* The point is, if a key contains a NULL value + * the duplicate checking must be disabled. + * This is because a NULL value is not considered + * equal to any other value. + */ + if (no_duplicate) + *no_duplicate = FALSE; + continue; + } + *key++ = 1; /* Not NULL */ + } + + char_length= ((cs && cs->mbmaxlen > 1) ? length/cs->mbmaxlen : length); + + pos = record + keyseg->start; + if (type == HA_KEYTYPE_BIT) + { + if (keyseg->bit_length) + { + uchar bits = get_rec_bits((uchar*) record + keyseg->bit_pos, + keyseg->bit_start, keyseg->bit_length); + *key++ = bits; + length--; + } + memcpy((byte*) key, pos, length); + key+= length; + continue; + } + if (keyseg->flag & HA_SPACE_PACK) + { + end = pos + length; + if (type != HA_KEYTYPE_NUM) { + while (end > pos && end[-1] == ' ') + end--; + } + else { + while (pos < end && pos[0] == ' ') + pos++; + } + length = (u_int) (end-pos); + FIX_LENGTH(cs, pos, length, char_length); + store_key_length_inc(key,char_length); + memcpy((byte*) key,(byte*) pos,(size_t) char_length); + key += char_length; + continue; + } + if (keyseg->flag & HA_VAR_LENGTH_PART) { + uint pack_length= (keyseg->bit_start == 1 ? 1 : 2); + uint tmp_length= (pack_length == 1 ? (uint) *(uchar*) pos : + uint2korr(pos)); + pos += pack_length; /* Skip VARCHAR length */ + set_if_smaller(length,tmp_length); + FIX_LENGTH(cs, pos, length, char_length); + store_key_length_inc(key,char_length); + memcpy((byte*) key,(byte*) pos,(size_t) char_length); + key += char_length; + continue; + } + if (keyseg->flag & HA_BLOB_PART) + { + u_int tmp_length = my_calc_blob_length(keyseg->bit_start, pos); + memcpy((byte*) &pos,pos+keyseg->bit_start,sizeof(char*)); + set_if_smaller(length,tmp_length); + FIX_LENGTH(cs, pos, length, char_length); + store_key_length_inc(key,char_length); + memcpy((byte*) key,(byte*) pos,(size_t) char_length); + key+= char_length; + continue; + } + if (keyseg->flag & HA_SWAP_KEY) + { /* Numerical column */ +#ifdef HAVE_ISNAN + if (type == HA_KEYTYPE_FLOAT) + { + float nr; + float4get(nr,pos); + if (isnan(nr)) + { + /* Replace NAN with zero */ + bzero(key,length); + key+=length; + continue; + } + } + else if (type == HA_KEYTYPE_DOUBLE) { + double nr; + + float8get(nr,pos); + if (isnan(nr)) { + bzero(key,length); + key+=length; + continue; + } + } +#endif + pos+=length; + while (length--) { + *key++ = *--pos; + } + continue; + } + FIX_LENGTH(cs, pos, length, char_length); + memcpy((byte*) key, pos, char_length); + if (length > char_length) + cs->cset->fill(cs, (char *) key + char_length, length - char_length, ' '); + key += length; + } + + return ind->mi_fix_key ? ind->mi_key_size : (u_int) (key - start); /* Return keylength */ +} + +xtPublic u_int myxt_create_foreign_key_from_row(XTIndexPtr ind, xtWord1 *key, xtWord1 *record, XTIndexPtr fkey_ind, xtBool *no_null) +{ + register XTIndexSegRec *keyseg = ind->mi_seg; + register XTIndexSegRec *fkey_keyseg = fkey_ind->mi_seg; + xtWord1 *pos; + xtWord1 *end; + xtWord1 *start; + + start = key; + for (u_int i=0; i<ind->mi_seg_count; i++, keyseg++, fkey_keyseg++) + { + enum ha_base_keytype type = (enum ha_base_keytype) keyseg->type; + u_int length = keyseg->length; + u_int char_length; + MX_CONST_CHARSET_INFO *cs = keyseg->charset; + xtBool is_null = FALSE; + + if (keyseg->null_bit) { + if (record[keyseg->null_pos] & keyseg->null_bit) { + is_null = TRUE; + if (no_null) + *no_null = FALSE; + } + } + + if (fkey_keyseg->null_bit) { + if (is_null) { + *key++ = 0; /* NULL in key */ + + /* The point is, if a key contains a NULL value + * the duplicate checking must be disabled. + * This is because a NULL value is not considered + * equal to any other value. + */ + continue; + } + *key++ = 1; /* Not NULL */ + } + + char_length= ((cs && cs->mbmaxlen > 1) ? length/cs->mbmaxlen : length); + + pos = record + keyseg->start; + if (type == HA_KEYTYPE_BIT) + { + if (keyseg->bit_length) + { + uchar bits = get_rec_bits((uchar*) record + keyseg->bit_pos, + keyseg->bit_start, keyseg->bit_length); + *key++ = bits; + length--; + } + memcpy((byte*) key, pos, length); + key+= length; + continue; + } + if (keyseg->flag & HA_SPACE_PACK) + { + end = pos + length; + if (type != HA_KEYTYPE_NUM) { + while (end > pos && end[-1] == ' ') + end--; + } + else { + while (pos < end && pos[0] == ' ') + pos++; + } + length = (u_int) (end-pos); + FIX_LENGTH(cs, pos, length, char_length); + store_key_length_inc(key,char_length); + memcpy((byte*) key,(byte*) pos,(size_t) char_length); + key += char_length; + continue; + } + if (keyseg->flag & HA_VAR_LENGTH_PART) { + uint pack_length= (keyseg->bit_start == 1 ? 1 : 2); + uint tmp_length= (pack_length == 1 ? (uint) *(uchar*) pos : + uint2korr(pos)); + pos += pack_length; /* Skip VARCHAR length */ + set_if_smaller(length,tmp_length); + FIX_LENGTH(cs, pos, length, char_length); + store_key_length_inc(key,char_length); + memcpy((byte*) key,(byte*) pos,(size_t) char_length); + key += char_length; + continue; + } + if (keyseg->flag & HA_BLOB_PART) + { + u_int tmp_length = my_calc_blob_length(keyseg->bit_start, pos); + memcpy((byte*) &pos,pos+keyseg->bit_start,sizeof(char*)); + set_if_smaller(length,tmp_length); + FIX_LENGTH(cs, pos, length, char_length); + store_key_length_inc(key,char_length); + memcpy((byte*) key,(byte*) pos,(size_t) char_length); + key+= char_length; + continue; + } + if (keyseg->flag & HA_SWAP_KEY) + { /* Numerical column */ +#ifdef HAVE_ISNAN + if (type == HA_KEYTYPE_FLOAT) + { + float nr; + float4get(nr,pos); + if (isnan(nr)) + { + /* Replace NAN with zero */ + bzero(key,length); + key+=length; + continue; + } + } + else if (type == HA_KEYTYPE_DOUBLE) { + double nr; + + float8get(nr,pos); + if (isnan(nr)) { + bzero(key,length); + key+=length; + continue; + } + } +#endif + pos+=length; + while (length--) { + *key++ = *--pos; + } + continue; + } + FIX_LENGTH(cs, pos, length, char_length); + memcpy((byte*) key, pos, char_length); + if (length > char_length) + cs->cset->fill(cs, (char *) key + char_length, length - char_length, ' '); + key += length; + } + + return fkey_ind->mi_fix_key ? fkey_ind->mi_key_size : (u_int) (key - start); /* Return keylength */ +} + +/* I may be overcautious here, but can I assume that + * null_ptr refers to my buffer. If I cannot, then I + * cannot use the set_notnull() method. + */ +static void mx_set_notnull_in_record(Field *field, char *record) +{ + if (field->null_ptr) + record[(uint) (field->null_ptr - (uchar *) field->table->record[0])] &= (uchar) ~field->null_bit; +} + +static xtBool mx_is_null_in_record(Field *field, char *record) +{ + if (field->null_ptr) { + if (record[(uint) (field->null_ptr - (uchar *) field->table->record[0])] & (uchar) field->null_bit) + return TRUE; + } + return FALSE; +} + +/* + * PBXT uses a completely different disk format to MySQL so I need a + * method that just returns the byte length and + * pointer to the data in a row. + */ +static char *mx_get_length_and_data(Field *field, char *dest, xtWord4 *len) +{ + char *from; + +#if MYSQL_VERSION_ID < 50114 + from = dest + field->offset(); +#else + from = dest + field->offset(field->table->record[0]); +#endif + switch (field->real_type()) { +#ifndef DRIZZLED + case MYSQL_TYPE_TINY_BLOB: + case MYSQL_TYPE_MEDIUM_BLOB: + case MYSQL_TYPE_LONG_BLOB: +#endif + case MYSQL_TYPE_BLOB: { + /* TODO - Check: this was the original comment: I must set + * *data to non-NULL value, *data == 0, means SQL NULL value. + */ + char *data; + + /* GOTCHA: There is no way this can work! field is shared + * between threads. + char *save = field->ptr; + + field->ptr = (char *) from; + ((Field_blob *) field)->get_ptr(&data); + field->ptr = save; // Restore org row pointer + */ + + xtWord4 packlength = ((Field_blob *) field)->pack_length() - field->table->s->blob_ptr_size; + memcpy(&data, ((char *) from)+packlength, sizeof(char*)); + + *len = ((Field_blob *) field)->get_length((byte *) from); + return data; + } +#ifndef DRIZZLED + case MYSQL_TYPE_STRING: + /* To write this function you would think Field_string::pack + * would serve as a good example, but as far as I can tell + * it has a bug: the test from[length-1] == ' ' assumes + * 1-byte chars. + * + * But this is not relevant because I believe lengthsp + * will give me the correct answer! + */ + *len = field->charset()->cset->lengthsp(field->charset(), from, field->field_length); + return from; + case MYSQL_TYPE_VAR_STRING: { + uint length=uint2korr(from); + + *len = length; + return from+HA_KEY_BLOB_LENGTH; + } +#endif + case MYSQL_TYPE_VARCHAR: { + uint length; + + if (((Field_varstring *) field)->length_bytes == 1) + length = *((unsigned char *) from); + else + length = uint2korr(from); + + *len = length; + return from+((Field_varstring *) field)->length_bytes; + } +#ifndef DRIZZLED + case MYSQL_TYPE_DECIMAL: + case MYSQL_TYPE_TINY: + case MYSQL_TYPE_SHORT: + case MYSQL_TYPE_LONG: + case MYSQL_TYPE_FLOAT: + case MYSQL_TYPE_DOUBLE: + case MYSQL_TYPE_NULL: + case MYSQL_TYPE_TIMESTAMP: + case MYSQL_TYPE_LONGLONG: + case MYSQL_TYPE_INT24: + case MYSQL_TYPE_DATE: + case MYSQL_TYPE_TIME: + case MYSQL_TYPE_DATETIME: + case MYSQL_TYPE_YEAR: + case MYSQL_TYPE_NEWDATE: + case MYSQL_TYPE_BIT: + case MYSQL_TYPE_NEWDECIMAL: + case MYSQL_TYPE_ENUM: + case MYSQL_TYPE_SET: + case MYSQL_TYPE_GEOMETRY: +#else + case DRIZZLE_TYPE_TINY: + case DRIZZLE_TYPE_LONG: + case DRIZZLE_TYPE_DOUBLE: + case DRIZZLE_TYPE_NULL: + case DRIZZLE_TYPE_TIMESTAMP: + case DRIZZLE_TYPE_LONGLONG: + case DRIZZLE_TYPE_DATETIME: + case DRIZZLE_TYPE_DATE: + case DRIZZLE_TYPE_NEWDECIMAL: + case DRIZZLE_TYPE_ENUM: + case DRIZZLE_TYPE_VIRTUAL: +#endif + break; + } + + *len = field->pack_length(); + return from; +} + +/* + * Set the length and data value of a field. + * + * If input data is NULL this is a NULL value. In this case + * we assume the null bit has been set and prepared + * the field as follows: + * + * According to the InnoDB implementation, we need + * to zero out the field data... + * "MySQL seems to assume the field for an SQL NULL + * value is set to zero or space. Not taking this into + * account caused seg faults with NULL BLOB fields, and + * bug number 154 in the MySQL bug database: GROUP BY + * and DISTINCT could treat NULL values inequal". + */ +static void mx_set_length_and_data(Field *field, char *dest, xtWord4 len, char *data) +{ + char *from; + +#if MYSQL_VERSION_ID < 50114 + from = dest + field->offset(); +#else + from = dest + field->offset(field->table->record[0]); +#endif + switch (field->real_type()) { +#ifndef DRIZZLED + case MYSQL_TYPE_TINY_BLOB: + case MYSQL_TYPE_MEDIUM_BLOB: + case MYSQL_TYPE_LONG_BLOB: +#endif + case MYSQL_TYPE_BLOB: { + /* GOTCHA: There is no way that this can work. + * field is shared, because table is shared! + char *save = field->ptr; + + field->ptr = (char *) from; + ((Field_blob *) field)->set_ptr(len, data); + field->ptr = save; // Restore org row pointer + */ + xtWord4 packlength = ((Field_blob *) field)->pack_length() - field->table->s->blob_ptr_size; + + ((Field_blob *) field)->store_length((byte *) from, packlength, len); + memcpy_fixed(((char *) from)+packlength, &data, sizeof(char*)); + + if (data) + mx_set_notnull_in_record(field, dest); + return; + } +#ifndef DRIZZLED + case MYSQL_TYPE_STRING: + if (data) { + mx_set_notnull_in_record(field, dest); + memcpy(from, data, len); + } + else + len = 0; + + /* And I think that fill will do this for me... */ + field->charset()->cset->fill(field->charset(), from + len, field->field_length - len, ' '); + return; + case MYSQL_TYPE_VAR_STRING: + int2store(from, len); + if (data) { + mx_set_notnull_in_record(field, dest); + memcpy(from+HA_KEY_BLOB_LENGTH, data, len); + } + return; +#endif + case MYSQL_TYPE_VARCHAR: + if (((Field_varstring *) field)->length_bytes == 1) + *((unsigned char *) from) = (unsigned char) len; + else + int2store(from, len); + if (data) { + mx_set_notnull_in_record(field, dest); + memcpy(from+((Field_varstring *) field)->length_bytes, data, len); + } + return; +#ifndef DRIZZLED + case MYSQL_TYPE_DECIMAL: + case MYSQL_TYPE_TINY: + case MYSQL_TYPE_SHORT: + case MYSQL_TYPE_LONG: + case MYSQL_TYPE_FLOAT: + case MYSQL_TYPE_DOUBLE: + case MYSQL_TYPE_NULL: + case MYSQL_TYPE_TIMESTAMP: + case MYSQL_TYPE_LONGLONG: + case MYSQL_TYPE_INT24: + case MYSQL_TYPE_DATE: + case MYSQL_TYPE_TIME: + case MYSQL_TYPE_DATETIME: + case MYSQL_TYPE_YEAR: + case MYSQL_TYPE_NEWDATE: + case MYSQL_TYPE_BIT: + case MYSQL_TYPE_NEWDECIMAL: + case MYSQL_TYPE_ENUM: + case MYSQL_TYPE_SET: + case MYSQL_TYPE_GEOMETRY: +#else + case DRIZZLE_TYPE_TINY: + case DRIZZLE_TYPE_LONG: + case DRIZZLE_TYPE_DOUBLE: + case DRIZZLE_TYPE_NULL: + case DRIZZLE_TYPE_TIMESTAMP: + case DRIZZLE_TYPE_LONGLONG: + case DRIZZLE_TYPE_DATETIME: + case DRIZZLE_TYPE_DATE: + case DRIZZLE_TYPE_NEWDECIMAL: + case DRIZZLE_TYPE_ENUM: + case DRIZZLE_TYPE_VIRTUAL: +#endif + break; + } + + if (data) { + mx_set_notnull_in_record(field, dest); + memcpy(from, data, len); + } + else + bzero(from, field->pack_length()); +} + +xtPublic void myxt_set_null_row_from_key(XTOpenTablePtr ot __attribute__((unused)), XTIndexPtr ind, xtWord1 *record) +{ + register XTIndexSegRec *keyseg = ind->mi_seg; + + for (u_int i=0; i<ind->mi_seg_count; i++, keyseg++) { + ASSERT_NS(keyseg->null_bit); + record[keyseg->null_pos] |= keyseg->null_bit; + } +} + +xtPublic void myxt_set_default_row_from_key(XTOpenTablePtr ot, XTIndexPtr ind, xtWord1 *record) +{ + XTTableHPtr tab = ot->ot_table; + TABLE *table = tab->tab_dic.dic_my_table; + XTIndexSegRec *keyseg = ind->mi_seg; + + xt_lock_mutex_ns(&tab->tab_dic_field_lock); + + for (u_int i=0; i<ind->mi_seg_count; i++, keyseg++) { + + u_int col_idx = keyseg->col_idx; + Field *field = table->field[col_idx]; + byte *field_save = field->ptr; + + field->ptr = table->s->default_values + keyseg->start; + memcpy(record + keyseg->start, field->ptr, field->pack_length()); + record[keyseg->null_pos] &= ~keyseg->null_bit; + record[keyseg->null_pos] |= table->s->default_values[keyseg->null_pos] & keyseg->null_bit; + + field->ptr = field_save; + } + + xt_unlock_mutex_ns(&tab->tab_dic_field_lock); +} + +/* Derived from _mi_put_key_in_record */ +xtPublic xtBool myxt_create_row_from_key(XTOpenTablePtr ot __attribute__((unused)), XTIndexPtr ind, xtWord1 *b_value, u_int key_len, xtWord1 *dest_buff) +{ + byte *record = (byte *) dest_buff; + register byte *key; + byte *pos,*key_end; + register XTIndexSegRec *keyseg = ind->mi_seg; + + /* GOTCHA: When selecting from multiple + * indexes the key values are "merged" into the + * same buffer!! + * This means that this function must not affect + * the value of any other feilds. + * + * I was setting all to NULL: + memset(dest_buff, 0xFF, table->s->null_bytes); + */ + key = (byte *) b_value; + key_end = key + key_len; + for (u_int i=0; i<ind->mi_seg_count; i++, keyseg++) { + if (keyseg->null_bit) { + if (!*key++) + { + record[keyseg->null_pos] |= keyseg->null_bit; + continue; + } + record[keyseg->null_pos] &= ~keyseg->null_bit; + } + if (keyseg->type == HA_KEYTYPE_BIT) + { + uint length = keyseg->length; + + if (keyseg->bit_length) + { + uchar bits= *key++; + set_rec_bits(bits, record + keyseg->bit_pos, keyseg->bit_start, + keyseg->bit_length); + length--; + } + else + { + clr_rec_bits(record + keyseg->bit_pos, keyseg->bit_start, + keyseg->bit_length); + } + memcpy(record + keyseg->start, (byte*) key, length); + key+= length; + continue; + } + if (keyseg->flag & HA_SPACE_PACK) + { + uint length; + get_key_length(length,key); +#ifdef CHECK_KEYS + if (length > keyseg->length || key+length > key_end) + goto err; +#endif + pos = record+keyseg->start; + if (keyseg->type != (int) HA_KEYTYPE_NUM) + { + memcpy(pos,key,(size_t) length); + bfill(pos+length,keyseg->length-length,' '); + } + else + { + bfill(pos,keyseg->length-length,' '); + memcpy(pos+keyseg->length-length,key,(size_t) length); + } + key+=length; + continue; + } + + if (keyseg->flag & HA_VAR_LENGTH_PART) + { + uint length; + get_key_length(length,key); +#ifdef CHECK_KEYS + if (length > keyseg->length || key+length > key_end) + goto err; +#endif + /* Store key length */ + if (keyseg->bit_start == 1) + *(uchar*) (record+keyseg->start)= (uchar) length; + else + int2store(record+keyseg->start, length); + /* And key data */ + memcpy(record+keyseg->start + keyseg->bit_start, (byte*) key, length); + key+= length; + } + else if (keyseg->flag & HA_BLOB_PART) + { + uint length; + get_key_length(length,key); +#ifdef CHECK_KEYS + if (length > keyseg->length || key+length > key_end) + goto err; +#endif + /* key is a pointer into ot_ind_rbuf, which should be + * safe until we move to the next index item! + */ + byte *key_ptr = key; // Cannot take the address of a register variable! + memcpy(record+keyseg->start+keyseg->bit_start, + (char*) &key_ptr,sizeof(char*)); + + my_store_blob_length(record+keyseg->start, + (uint) keyseg->bit_start,length); + key+=length; + } + else if (keyseg->flag & HA_SWAP_KEY) + { + byte *to= record+keyseg->start+keyseg->length; + byte *end= key+keyseg->length; +#ifdef CHECK_KEYS + if (end > key_end) + goto err; +#endif + do { + *--to= *key++; + } while (key != end); + continue; + } + else + { +#ifdef CHECK_KEYS + if (key+keyseg->length > key_end) + goto err; +#endif + memcpy(record+keyseg->start,(byte*) key, + (size_t) keyseg->length); + key+= keyseg->length; + } + + } + return OK; + +#ifdef CHECK_KEYS + err: +#endif + return FAILED; /* Crashed row */ +} + +/* + * ----------------------------------------------------------------------- + * Compare keys + */ + +static int my_compare_bin(uchar *a, uint a_length, uchar *b, uint b_length, + my_bool part_key, my_bool skip_end_space) +{ + uint length= min(a_length,b_length); + uchar *end= a+ length; + int flag; + + while (a < end) + if ((flag= (int) *a++ - (int) *b++)) + return flag; + if (part_key && b_length < a_length) + return 0; + if (skip_end_space && a_length != b_length) + { + int swap= 1; + /* + We are using space compression. We have to check if longer key + has next character < ' ', in which case it's less than the shorter + key that has an implicite space afterwards. + + This code is identical to the one in + strings/ctype-simple.c:my_strnncollsp_simple + */ + if (a_length < b_length) + { + /* put shorter key in a */ + a_length= b_length; + a= b; + swap= -1; /* swap sign of result */ + } + for (end= a + a_length-length; a < end ; a++) + { + if (*a != ' ') + return (*a < ' ') ? -swap : swap; + } + return 0; + } + return (int) (a_length-b_length); +} + +xtPublic u_int myxt_get_key_length(XTIndexPtr ind, xtWord1 *key_buf) +{ + register XTIndexSegRec *keyseg = ind->mi_seg; + register uchar *key_data = (uchar *) key_buf; + uint seg_len; + uint pack_len; + + for (u_int i=0; i<ind->mi_seg_count; i++, keyseg++) { + /* Handle NULL part */ + if (keyseg->null_bit) { + if (!*key_data++) + continue; + } + + switch ((enum ha_base_keytype) keyseg->type) { + case HA_KEYTYPE_TEXT: /* Ascii; Key is converted */ + if (keyseg->flag & HA_SPACE_PACK) { + get_key_pack_length(seg_len, pack_len, key_data); + } + else + seg_len = keyseg->length; + key_data += seg_len; + break; + case HA_KEYTYPE_BINARY: + if (keyseg->flag & HA_SPACE_PACK) { + get_key_pack_length(seg_len, pack_len, key_data); + } + else + seg_len = keyseg->length; + key_data += seg_len; + break; + case HA_KEYTYPE_VARTEXT1: + case HA_KEYTYPE_VARTEXT2: + get_key_pack_length(seg_len, pack_len, key_data); + key_data += seg_len; + break; + case HA_KEYTYPE_VARBINARY1: + case HA_KEYTYPE_VARBINARY2: + get_key_pack_length(seg_len, pack_len, key_data); + key_data += seg_len; + break; + case HA_KEYTYPE_NUM: { + /* Numeric key */ + if (keyseg->flag & HA_SPACE_PACK) + seg_len = *key_data++; + else + seg_len = keyseg->length; + key_data += seg_len; + break; + } + case HA_KEYTYPE_INT8: + case HA_KEYTYPE_SHORT_INT: + case HA_KEYTYPE_USHORT_INT: + case HA_KEYTYPE_LONG_INT: + case HA_KEYTYPE_ULONG_INT: + case HA_KEYTYPE_INT24: + case HA_KEYTYPE_UINT24: + case HA_KEYTYPE_FLOAT: + case HA_KEYTYPE_DOUBLE: + case HA_KEYTYPE_LONGLONG: + case HA_KEYTYPE_ULONGLONG: + case HA_KEYTYPE_BIT: + key_data += keyseg->length; + break; + case HA_KEYTYPE_END: + goto end; + } + } + + end: + return (xtWord1 *) key_data - key_buf; +} + +/* Derived from ha_key_cmp */ +xtPublic int myxt_compare_key(XTIndexPtr ind, int search_flags, uint key_length, xtWord1 *key_value, xtWord1 *b_value) +{ + register XTIndexSegRec *keyseg = ind->mi_seg; + int flag; + register uchar *a = (uchar *) key_value; + uint a_length; + register uchar *b = (uchar *) b_value; + uint b_length; + uint next_key_length; + uchar *end; + uint piks; + uint pack_len; + + for (uint i=0; i < ind->mi_seg_count && (int) key_length > 0; key_length = next_key_length, keyseg++, i++) { + piks = !(keyseg->flag & HA_NO_SORT); + + /* Handle NULL part */ + if (keyseg->null_bit) { + /* 1 is not null, 0 is null */ + int b_not_null = (int) *b++; + + key_length--; + if ((int) *a != b_not_null && piks) + { + flag = (int) *a - b_not_null; + return ((keyseg->flag & HA_REVERSE_SORT) ? -flag : flag); + } + if (!*a++) { + /* If key was NULL */ + if (search_flags == (SEARCH_FIND | SEARCH_UPDATE)) + search_flags = SEARCH_SAME; /* Allow duplicate keys */ + else if (search_flags & SEARCH_NULL_ARE_NOT_EQUAL) + { + /* + * This is only used from mi_check() to calculate cardinality. + * It can't be used when searching for a key as this would cause + * compare of (a,b) and (b,a) to return the same value. + */ + return -1; + } + /* PMC - I don't know why I had next_key_length = key_length - keyseg->length; + * This was my comment: even when null we have the complete length + * + * The truth is, a NULL only takes up one byte in the key, and this has already + * been subtracted. + */ + next_key_length = key_length; + continue; /* To next key part */ + } + } + + /* Both components are not null... */ + if (keyseg->length < key_length) { + end = a + keyseg->length; + next_key_length = key_length - keyseg->length; + } + else { + end = a + key_length; + next_key_length = 0; + } + + switch ((enum ha_base_keytype) keyseg->type) { + case HA_KEYTYPE_TEXT: /* Ascii; Key is converted */ + if (keyseg->flag & HA_SPACE_PACK) { + get_key_pack_length(a_length, pack_len, a); + next_key_length = key_length - a_length - pack_len; + get_key_pack_length(b_length, pack_len, b); + + if (piks && (flag = my_compare_text(keyseg->charset, a, a_length, b, b_length, + (my_bool) ((search_flags & SEARCH_PREFIX) && next_key_length <= 0), + (my_bool)!(search_flags & SEARCH_PREFIX)))) + return ((keyseg->flag & HA_REVERSE_SORT) ? -flag : flag); + a += a_length; + } + else { + a_length = (uint) (end - a); + b_length = keyseg->length; + if (piks && (flag = my_compare_text(keyseg->charset, a, a_length, b, b_length, + (my_bool) ((search_flags & SEARCH_PREFIX) && next_key_length <= 0), + (my_bool)!(search_flags & SEARCH_PREFIX)))) + return ((keyseg->flag & HA_REVERSE_SORT) ? -flag : flag); + a = end; + } + b += b_length; + break; + case HA_KEYTYPE_BINARY: + if (keyseg->flag & HA_SPACE_PACK) { + get_key_pack_length(a_length, pack_len, a); + next_key_length = key_length - a_length - pack_len; + get_key_pack_length(b_length, pack_len, b); + + if (piks && (flag = my_compare_bin(a, a_length, b, b_length, + (my_bool) ((search_flags & SEARCH_PREFIX) && next_key_length <= 0), 1))) + return ((keyseg->flag & HA_REVERSE_SORT) ? -flag : flag); + } + else { + a_length = keyseg->length; + b_length = keyseg->length; + if (piks && (flag = my_compare_bin(a, a_length, b, b_length, + (my_bool) ((search_flags & SEARCH_PREFIX) && next_key_length <= 0), 0))) + return ((keyseg->flag & HA_REVERSE_SORT) ? -flag : flag); + } + a += a_length; + b += b_length; + break; + case HA_KEYTYPE_VARTEXT1: + case HA_KEYTYPE_VARTEXT2: + { + get_key_pack_length(a_length, pack_len, a); + next_key_length = key_length - a_length - pack_len; + get_key_pack_length(b_length, pack_len, b); + + if (piks && (flag = my_compare_text(keyseg->charset, a, a_length, b, b_length, + (my_bool) ((search_flags & SEARCH_PREFIX) && next_key_length <= 0), + (my_bool) ((search_flags & (SEARCH_FIND | SEARCH_UPDATE)) == SEARCH_FIND)))) + return ((keyseg->flag & HA_REVERSE_SORT) ? -flag : flag); + a += a_length; + b += b_length; + break; + } + case HA_KEYTYPE_VARBINARY1: + case HA_KEYTYPE_VARBINARY2: + { + get_key_pack_length(a_length, pack_len, a); + next_key_length = key_length - a_length - pack_len; + get_key_pack_length(b_length, pack_len, b); + + if (piks && (flag=my_compare_bin(a, a_length, b, b_length, + (my_bool) ((search_flags & SEARCH_PREFIX) && next_key_length <= 0), 0))) + return ((keyseg->flag & HA_REVERSE_SORT) ? -flag : flag); + a += a_length; + b += b_length; + break; + } + case HA_KEYTYPE_INT8: + { + int i_1 = (int) *((signed char *) a); + int i_2 = (int) *((signed char *) b); + if (piks && (flag = CMP_NUM(i_1,i_2))) + return ((keyseg->flag & HA_REVERSE_SORT) ? -flag : flag); + a = end; + b += keyseg->length; + break; + } + case HA_KEYTYPE_SHORT_INT: { + int16 s_1 = sint2korr(a); + int16 s_2 = sint2korr(b); + if (piks && (flag = CMP_NUM(s_1, s_2))) + return ((keyseg->flag & HA_REVERSE_SORT) ? -flag : flag); + a = end; + b += keyseg->length; + break; + } + case HA_KEYTYPE_USHORT_INT: { + uint16 us_1= sint2korr(a); + uint16 us_2= sint2korr(b); + if (piks && (flag = CMP_NUM(us_1, us_2))) + return ((keyseg->flag & HA_REVERSE_SORT) ? -flag : flag); + a = end; + b += keyseg->length; + break; + } + case HA_KEYTYPE_LONG_INT: { + int32 l_1 = sint4korr(a); + int32 l_2 = sint4korr(b); + if (piks && (flag = CMP_NUM(l_1, l_2))) + return ((keyseg->flag & HA_REVERSE_SORT) ? -flag : flag); + a = end; + b += keyseg->length; + break; + } + case HA_KEYTYPE_ULONG_INT: { + uint32 u_1 = sint4korr(a); + uint32 u_2 = sint4korr(b); + if (piks && (flag = CMP_NUM(u_1, u_2))) + return ((keyseg->flag & HA_REVERSE_SORT) ? -flag : flag); + a = end; + b += keyseg->length; + break; + } + case HA_KEYTYPE_INT24: { + int32 l_1 = sint3korr(a); + int32 l_2 = sint3korr(b); + if (piks && (flag = CMP_NUM(l_1, l_2))) + return ((keyseg->flag & HA_REVERSE_SORT) ? -flag : flag); + a = end; + b += keyseg->length; + break; + } + case HA_KEYTYPE_UINT24: { + int32 l_1 = uint3korr(a); + int32 l_2 = uint3korr(b); + if (piks && (flag = CMP_NUM(l_1, l_2))) + return ((keyseg->flag & HA_REVERSE_SORT) ? -flag : flag); + a = end; + b += keyseg->length; + break; + } + case HA_KEYTYPE_FLOAT: { + float f_1, f_2; + + float4get(f_1, a); + float4get(f_2, b); + /* + * The following may give a compiler warning about floating point + * comparison not being safe, but this is ok in this context as + * we are bascily doing sorting + */ + if (piks && (flag = CMP_NUM(f_1, f_2))) + return ((keyseg->flag & HA_REVERSE_SORT) ? -flag : flag); + a = end; + b += keyseg->length; + break; + } + case HA_KEYTYPE_DOUBLE: { + double d_1, d_2; + + float8get(d_1, a); + float8get(d_2, b); + /* + * The following may give a compiler warning about floating point + * comparison not being safe, but this is ok in this context as + * we are bascily doing sorting + */ + if (piks && (flag = CMP_NUM(d_1, d_2))) + return ((keyseg->flag & HA_REVERSE_SORT) ? -flag : flag); + a = end; + b += keyseg->length; + break; + } + case HA_KEYTYPE_NUM: { + /* Numeric key */ + if (keyseg->flag & HA_SPACE_PACK) { + a_length = *a++; + end = a + a_length; + next_key_length = key_length - a_length - 1; + b_length = *b++; + } + else { + a_length = (int) (end - a); + b_length = keyseg->length; + } + + /* remove pre space from keys */ + for ( ; a_length && *a == ' ' ; a++, a_length--) ; + for ( ; b_length && *b == ' ' ; b++, b_length--) ; + + if (keyseg->flag & HA_REVERSE_SORT) { + swap_variables(uchar *, a, b); + swap_variables(uint, a_length, b_length); + } + + if (piks) { + if (*a == '-') { + if (*b != '-') + return -1; + a++; b++; + swap_variables(uchar *, a, b); + swap_variables(uint, a_length, b_length); + a_length--; b_length--; + } + else if (*b == '-') + return 1; + while (a_length && (*a == '+' || *a == '0')) { + a++; a_length--; + } + + while (b_length && (*b == '+' || *b == '0')) { + b++; b_length--; + } + + if (a_length != b_length) + return (a_length < b_length) ? -1 : 1; + while (b_length) { + if (*a++ != *b++) + return ((int) a[-1] - (int) b[-1]); + b_length--; + } + } + a = end; + b += b_length; + break; + } +#ifdef HAVE_LONG_LONG + case HA_KEYTYPE_LONGLONG: { + longlong ll_a = sint8korr(a); + longlong ll_b = sint8korr(b); + if (piks && (flag = CMP_NUM(ll_a,ll_b))) + return ((keyseg->flag & HA_REVERSE_SORT) ? -flag : flag); + a = end; + b += keyseg->length; + break; + } + case HA_KEYTYPE_ULONGLONG: { + ulonglong ll_a = uint8korr(a); + ulonglong ll_b = uint8korr(b); + if (piks && (flag = CMP_NUM(ll_a,ll_b))) + return ((keyseg->flag & HA_REVERSE_SORT) ? -flag : flag); + a = end; + b += keyseg->length; + break; + } +#endif + case HA_KEYTYPE_BIT: + /* TODO: What here? */ + break; + case HA_KEYTYPE_END: /* Ready */ + goto end; + } + } + + end: + return 0; +} + +xtPublic u_int myxt_key_seg_length(XTIndexSegRec *keyseg, u_int key_offset, xtWord1 *key_value) +{ + register xtWord1 *a = (xtWord1 *) key_value + key_offset; + u_int a_length; + u_int has_null = 0; + u_int key_length = 0; + u_int pack_len; + + /* Handle NULL part */ + if (keyseg->null_bit) { + has_null++; + /* If the value is null, then it only requires one byte: */ + if (!*a++) + return has_null; + } + + key_length = has_null + keyseg->length; + + switch ((enum ha_base_keytype) keyseg->type) { + case HA_KEYTYPE_TEXT: /* Ascii; Key is converted */ + if (keyseg->flag & HA_SPACE_PACK) { + get_key_pack_length(a_length, pack_len, a); + key_length = has_null + a_length + pack_len; + } + break; + case HA_KEYTYPE_BINARY: + if (keyseg->flag & HA_SPACE_PACK) { + get_key_pack_length(a_length, pack_len, a); + key_length = has_null + a_length + pack_len; + } + break; + case HA_KEYTYPE_VARTEXT1: + case HA_KEYTYPE_VARTEXT2: + case HA_KEYTYPE_VARBINARY1: + case HA_KEYTYPE_VARBINARY2: { + get_key_pack_length(a_length, pack_len, a); + key_length = has_null + a_length + pack_len; + break; + } + case HA_KEYTYPE_INT8: + case HA_KEYTYPE_SHORT_INT: + case HA_KEYTYPE_USHORT_INT: + case HA_KEYTYPE_LONG_INT: + case HA_KEYTYPE_ULONG_INT: + case HA_KEYTYPE_INT24: + case HA_KEYTYPE_UINT24: + case HA_KEYTYPE_FLOAT: + case HA_KEYTYPE_DOUBLE: + break; + case HA_KEYTYPE_NUM: { + /* Numeric key */ + if (keyseg->flag & HA_SPACE_PACK) { + a_length = *a++; + key_length = has_null + a_length + 1; + } + break; + } +#ifdef HAVE_LONG_LONG + case HA_KEYTYPE_LONGLONG: + case HA_KEYTYPE_ULONGLONG: + break; +#endif + case HA_KEYTYPE_BIT: + /* TODO: What here? */ + break; + case HA_KEYTYPE_END: /* Ready */ + break; + } + + return key_length; +} + +/* + * ----------------------------------------------------------------------- + * Load and store rows + */ + +xtPublic xtWord4 myxt_store_row_length(XTOpenTablePtr ot, char *rec_buff) +{ + TABLE *table = ot->ot_table->tab_dic.dic_my_table; + char *sdata; + xtWord4 dlen; + xtWord4 item_size; + xtWord4 row_size = 0; + + for (Field **field=table->field ; *field ; field++) { + if ((*field)->is_null_in_record((const uchar *) rec_buff)) { + sdata = NULL; + dlen = 0; + item_size = 1; + } + else { + sdata = mx_get_length_and_data(*field, rec_buff, &dlen); + if (!dlen) { + /* Empty, but not null (blobs may return NULL, when + * length is 0. + */ + sdata = rec_buff; // Any valid pointer will do + item_size = 1 + dlen; + } + else if (dlen <= 240) + item_size = 1 + dlen; + else if (dlen <= 0xFFFF) + item_size = 3 + dlen; + else if (dlen <= 0xFFFFFF) + item_size = 4 + dlen; + else + item_size = 5 + dlen; + } + + row_size += item_size; + } + return row_size; +} + +static xtWord4 mx_store_row(XTOpenTablePtr ot, xtWord4 row_size, char *rec_buff) +{ + TABLE *table = ot->ot_table->tab_dic.dic_my_table; + char *sdata; + xtWord4 dlen; + xtWord4 item_size; + + for (Field **field=table->field ; *field ; field++) { + if ((*field)->is_null_in_record((const uchar *) rec_buff)) { + sdata = NULL; + dlen = 0; + item_size = 1; + } + else { + sdata = mx_get_length_and_data(*field, rec_buff, &dlen); + if (!dlen) { + /* Empty, but not null (blobs may return NULL, when + * length is 0. + */ + sdata = rec_buff; // Any valid pointer will do + item_size = 1 + dlen; + } + else if (dlen <= 240) + item_size = 1 + dlen; + else if (dlen <= 0xFFFF) + item_size = 3 + dlen; + else if (dlen <= 0xFFFFFF) + item_size = 4 + dlen; + else + item_size = 5 + dlen; + } + + if (row_size + item_size > ot->ot_row_wbuf_size) { + if (!xt_realloc_ns((void **) &ot->ot_row_wbuffer, row_size + item_size)) + return 0; + ot->ot_row_wbuf_size = row_size + item_size; + } + + if (!sdata) + ot->ot_row_wbuffer[row_size] = 255; + else if (dlen <= 240) { + ot->ot_row_wbuffer[row_size] = (unsigned char) dlen; + memcpy(&ot->ot_row_wbuffer[row_size+1], sdata, dlen); + } + else if (dlen <= 0xFFFF) { + ot->ot_row_wbuffer[row_size] = 254; + XT_SET_DISK_2(&ot->ot_row_wbuffer[row_size+1], dlen); + memcpy(&ot->ot_row_wbuffer[row_size+3], sdata, dlen); + } + else if (dlen <= 0xFFFFFF) { + ot->ot_row_wbuffer[row_size] = 253; + XT_SET_DISK_3(&ot->ot_row_wbuffer[row_size+1], dlen); + memcpy(&ot->ot_row_wbuffer[row_size+4], sdata, dlen); + } + else { + ot->ot_row_wbuffer[row_size] = 252; + XT_SET_DISK_4(&ot->ot_row_wbuffer[row_size+1], dlen); + memcpy(&ot->ot_row_wbuffer[row_size+5], sdata, dlen); + } + + row_size += item_size; + } + return row_size; +} + +/* Count the number and size of whole columns in the given buffer. */ +xtPublic size_t myxt_load_row_length(XTOpenTablePtr ot, size_t buffer_size, xtWord1 *source_buf, u_int *ret_col_cnt) +{ + u_int col_cnt; + xtWord4 len; + size_t size = 0; + u_int i; + + col_cnt = ot->ot_table->tab_dic.dic_no_of_cols; + if (ret_col_cnt) + col_cnt = *ret_col_cnt; + for (i=0; i<col_cnt; i++) { + if (size + 1 > buffer_size) + goto done; + switch (*source_buf) { + case 255: // Indicate NULL value + size++; + source_buf++; + break; + case 254: // 2 bytes length + if (size + 3 > buffer_size) + goto done; + len = XT_GET_DISK_2(source_buf + 1); + if (size + 3 + len > buffer_size) + goto done; + size += 3 + len; + source_buf += 3 + len; + break; + case 253: // 3 bytes length + if (size + 4 > buffer_size) + goto done; + len = XT_GET_DISK_3(source_buf + 1); + if (size + 4 + len > buffer_size) + goto done; + size += 4 + len; + source_buf += 4 + len; + break; + case 252: // 4 bytes length + if (size + 5 > buffer_size) + goto done; + len = XT_GET_DISK_4(source_buf + 1); + if (size + 5 + len > buffer_size) + goto done; + size += 5 + len; + source_buf += 5 + len; + break; + default: // Length byte + len = *source_buf; + if (size + 1 + len > buffer_size) + goto done; + size += 1 + len; + source_buf += 1 + len; + break; + } + } + + done: + if (ret_col_cnt) + *ret_col_cnt = i; + return size; +} + +/* Unload from PBXT variable length format to the MySQL row format. */ +xtPublic xtBool myxt_load_row(XTOpenTablePtr ot, xtWord1 *source_buf, xtWord1 *dest_buff, u_int col_cnt) +{ + TABLE *table; + xtWord4 len; + Field *curr_field; + xtBool is_null; + u_int i = 0; + + if (!(table = ot->ot_table->tab_dic.dic_my_table)) { + xt_register_taberr(XT_REG_CONTEXT, XT_ERR_NO_DICTIONARY, ot->ot_table->tab_name); + return FAILED; + } + + /* According to the InnoDB implementation: + * "MySQL assumes that all columns + * have the SQL NULL bit set unless it + * is a nullable column with a non-NULL value". + */ + memset(dest_buff, 0xFF, table->s->null_bytes); + for (Field **field=table->field ; *field && (!col_cnt || i<col_cnt); field++, i++) { + curr_field = *field; + is_null = FALSE; + switch (*source_buf) { + case 255: // Indicate NULL value + is_null = TRUE; + len = 0; + source_buf++; + break; + case 254: // 2 bytes length + len = XT_GET_DISK_2(source_buf + 1); + source_buf += 3; + break; + case 253: // 3 bytes length + len = XT_GET_DISK_3(source_buf + 1); + source_buf += 4; + break; + case 252: // 4 bytes length + len = XT_GET_DISK_4(source_buf + 1); + source_buf += 5; + break; + default: // Length byte + if (*source_buf > 240) { + xt_register_xterr(XT_REG_CONTEXT, XT_ERR_BAD_RECORD_FORMAT); + return FAILED; + } + len = *source_buf; + source_buf++; + break; + } + + if (is_null) + mx_set_length_and_data(curr_field, (char *) dest_buff, 0, NULL); + else + mx_set_length_and_data(curr_field, (char *) dest_buff, len, (char *) source_buf); + + source_buf += len; + } + return OK; +} + +xtPublic xtBool myxt_find_column(XTOpenTablePtr ot, u_int *col_idx, const char *col_name) +{ + TABLE *table = ot->ot_table->tab_dic.dic_my_table; + u_int i=0; + + for (Field **field=table->field; *field; field++, i++) { + if (!my_strcasecmp(system_charset_info, (*field)->field_name, col_name)) { + *col_idx = i; + return OK; + } + } + return FALSE; +} + +xtPublic void myxt_get_column_name(XTOpenTablePtr ot, u_int col_idx, u_int len, char *col_name) +{ + TABLE *table = ot->ot_table->tab_dic.dic_my_table; + Field *field; + + field = table->field[col_idx]; + xt_strcpy(len, col_name, field->field_name); +} + +xtPublic void myxt_get_column_as_string(XTOpenTablePtr ot, char *buffer, u_int col_idx, u_int len, char *value) +{ + XTTableHPtr tab = ot->ot_table; + XTThreadPtr self = ot->ot_thread; + TABLE *table = tab->tab_dic.dic_my_table; + Field *field = table->field[col_idx]; + char buf_val[MAX_FIELD_WIDTH]; + String val(buf_val, sizeof(buf_val), &my_charset_bin); + + if (mx_is_null_in_record(field, buffer)) + xt_strcpy(len, value, "NULL"); + else { + byte *save; + + /* Required by store() - or an assertion will fail: */ + if (table->read_set) + bitmap_set_bit(table->read_set, col_idx); + + save = field->ptr; + xt_lock_mutex(self, &tab->tab_dic_field_lock); + pushr_(xt_unlock_mutex, &tab->tab_dic_field_lock); +#if MYSQL_VERSION_ID < 50114 + field->ptr = (byte *) buffer + field->offset(); +#else + field->ptr = (byte *) buffer + field->offset(field->table->record[0]); +#endif + field->val_str(&val); + field->ptr = save; // Restore org row pointer + freer_(); // xt_unlock_mutex(&tab->tab_dic_field_lock) + xt_strcpy(len, value, val.c_ptr()); + } +} + +xtPublic xtBool myxt_set_column(XTOpenTablePtr ot, char *buffer, u_int col_idx, const char *value, u_int len) +{ + XTTableHPtr tab = ot->ot_table; + XTThreadPtr self = ot->ot_thread; + TABLE *table = tab->tab_dic.dic_my_table; + Field *field = table->field[col_idx]; + byte *save; + int error; + + /* Required by store() - or an assertion will fail: */ + if (table->write_set) + bitmap_set_bit(table->write_set, col_idx); + + mx_set_notnull_in_record(field, buffer); + + save = field->ptr; + xt_lock_mutex(self, &tab->tab_dic_field_lock); + pushr_(xt_unlock_mutex, &tab->tab_dic_field_lock); +#if MYSQL_VERSION_ID < 50114 + field->ptr = (byte *) buffer + field->offset(); +#else + field->ptr = (byte *) buffer + field->offset(field->table->record[0]); +#endif + error = field->store(value, len, &my_charset_utf8_general_ci); + field->ptr = save; // Restore org row pointer + freer_(); // xt_unlock_mutex(&tab->tab_dic_field_lock) + return error ? FAILED : OK; +} + +xtPublic void myxt_get_column_data(XTOpenTablePtr ot, char *buffer, u_int col_idx, char **value, size_t *len) +{ + TABLE *table = ot->ot_table->tab_dic.dic_my_table; + Field *field = table->field[col_idx]; + char *sdata; + xtWord4 dlen; + + sdata = mx_get_length_and_data(field, buffer, &dlen); + *value = sdata; + *len = dlen; +} + +xtPublic xtBool myxt_store_row(XTOpenTablePtr ot, XTTabRecInfoPtr rec_info, char *rec_buff) +{ + if (ot->ot_rec_fixed) { + rec_info->ri_fix_rec_buf = (XTTabRecFixDPtr) ot->ot_row_wbuffer; + rec_info->ri_rec_buf_size = ot->ot_rec_size; + rec_info->ri_ext_rec = NULL; + + rec_info->ri_fix_rec_buf->tr_rec_type_1 = XT_TAB_STATUS_FIXED; + memcpy(rec_info->ri_fix_rec_buf->rf_data, rec_buff, ot->ot_rec_size - XT_REC_FIX_HEADER_SIZE); + } + else { + xtWord4 row_size; + + if (!(row_size = mx_store_row(ot, XT_REC_EXT_HEADER_SIZE, rec_buff))) + return FAILED; + if (row_size - XT_REC_FIX_EXT_HEADER_DIFF <= ot->ot_rec_size) { + rec_info->ri_fix_rec_buf = (XTTabRecFixDPtr) &ot->ot_row_wbuffer[XT_REC_FIX_EXT_HEADER_DIFF]; + rec_info->ri_rec_buf_size = row_size - XT_REC_FIX_EXT_HEADER_DIFF; + rec_info->ri_ext_rec = NULL; + + rec_info->ri_fix_rec_buf->tr_rec_type_1 = XT_TAB_STATUS_VARIABLE; + } + else { + rec_info->ri_fix_rec_buf = (XTTabRecFixDPtr) ot->ot_row_wbuffer; + rec_info->ri_rec_buf_size = ot->ot_rec_size; + rec_info->ri_ext_rec = (XTTabRecExtDPtr) ot->ot_row_wbuffer; + rec_info->ri_log_data_size = row_size - ot->ot_rec_size; + rec_info->ri_log_buf = (XTactExtRecEntryDPtr) &ot->ot_row_wbuffer[ot->ot_rec_size - offsetof(XTactExtRecEntryDRec, er_data)]; + + rec_info->ri_ext_rec->tr_rec_type_1 = XT_TAB_STATUS_EXT_DLOG; + XT_SET_DISK_4(rec_info->ri_ext_rec->re_log_dat_siz_4, rec_info->ri_log_data_size); + } + } + return OK; +} + +static void mx_print_string(uchar *s, uint count) +{ + while (count > 0) { + if (s[count - 1] != ' ') + break; + count--; + } + printf("\""); + for (u_int i=0; i<count; i++, s++) + printf("%c", *s); + printf("\""); +} + +xtPublic void myxt_print_key(XTIndexPtr ind, xtWord1 *key_value) +{ + register XTIndexSegRec *keyseg = ind->mi_seg; + register uchar *b = (uchar *) key_value; + uint b_length; + uint pack_len; + + for (u_int i = 0; i < ind->mi_seg_count; i++, keyseg++) { + if (i!=0) + printf(" "); + if (keyseg->null_bit) { + if (!*b++) { + printf("NULL"); + continue; + } + } + switch ((enum ha_base_keytype) keyseg->type) { + case HA_KEYTYPE_TEXT: /* Ascii; Key is converted */ + if (keyseg->flag & HA_SPACE_PACK) { + get_key_pack_length(b_length, pack_len, b); + } + else + b_length = keyseg->length; + mx_print_string(b, b_length); + b += b_length; + break; + case HA_KEYTYPE_LONG_INT: { + int32 l_2 = sint4korr(b); + b += keyseg->length; + printf("%ld", (long) l_2); + break; + } + case HA_KEYTYPE_ULONG_INT: { + xtWord4 u_2 = sint4korr(b); + b += keyseg->length; + printf("%lu", (u_long) u_2); + break; + } + default: + break; + } + } +} + +/* + * ----------------------------------------------------------------------- + * MySQL Data Dictionary + */ + +#define TS(x) (x)->s + +static void my_close_table(TABLE *table) +{ +#ifndef DRIZZLED + closefrm(table, 1); // TODO: Q, why did Stewart remove this? +#endif + xt_free_ns(table); +} + +/* + * This function returns NULL if the table cannot be opened + * because this is not a MySQL thread. + */ +static TABLE *my_open_table(XTThreadPtr self, XTDatabaseHPtr db __attribute__((unused)), XTPathStrPtr tab_path) +{ + THD *thd = current_thd; + char path_buffer[PATH_MAX]; + char *table_name; + char database_name[XT_IDENTIFIER_NAME_SIZE]; + char *ptr; + size_t size; + char *buffer, *path, *db_name, *name; + TABLE_SHARE *share; + int error; + TABLE *table; + + /* If we have no MySQL thread, then we cannot open this table! + * What this means is the thread is probably the sweeper or the + * compactor. + */ + if (!thd) + return NULL; + + /* GOTCHA: Check if the table name is a partitian, + * if so we need to remove the partition + * extension, in order for this to work! + * + * Reason: the parts of a partition table do not + * have .frm files!! + */ + xt_strcpy(PATH_MAX, path_buffer, tab_path->ps_path); + table_name = xt_last_name_of_path(path_buffer); + if ((ptr = strstr(table_name, "#P#"))) + *ptr = 0; + + xt_2nd_last_name_of_path(XT_IDENTIFIER_NAME_SIZE, database_name, path_buffer); + + size = sizeof(TABLE) + sizeof(TABLE_SHARE) + + strlen(path_buffer) + 1 + + strlen(database_name) + 1 + strlen(table_name) + 1; + if (!(buffer = (char *) xt_malloc(self, size))) + return NULL; + table = (TABLE *) buffer; + buffer += sizeof(TABLE); + share = (TABLE_SHARE *) buffer; + buffer += sizeof(TABLE_SHARE); + + path = buffer; + strcpy(path, path_buffer); + buffer += strlen(path_buffer) + 1; + db_name = buffer; + strcpy(db_name, database_name); + buffer += strlen(database_name) + 1; + name = buffer; + strcpy(name, table_name); + + /* Required to call 'open_table_from_share'! */ + LEX *old_lex, new_lex; + + old_lex = thd->lex; + thd->lex = &new_lex; + new_lex.current_select= NULL; + lex_start(thd); + +#if MYSQL_VERSION_ID < 60000 +#if MYSQL_VERSION_ID < 50123 + init_tmp_table_share(share, db_name, 0, name, path); +#else + init_tmp_table_share(thd, share, db_name, 0, name, path); +#endif +#else +#if MYSQL_VERSION_ID < 60004 + init_tmp_table_share(share, db_name, 0, name, path); +#else + init_tmp_table_share(thd, share, db_name, 0, name, path); +#endif +#endif + + if ((error = open_table_def(thd, share, 0))) { + xt_free(self, table); + lex_end(&new_lex); + thd->lex = old_lex; + xt_throw_ulxterr(XT_CONTEXT, XT_ERR_LOADING_MYSQL_DIC, (u_long) error); + return NULL; + } + +#if MYSQL_VERSION_ID >= 60003 + if ((error = open_table_from_share(thd, share, "", 0, (uint) READ_ALL, 0, table, OTM_OPEN))) +#else + if ((error = open_table_from_share(thd, share, "", 0, (uint) READ_ALL, 0, table, FALSE))) +#endif + { + xt_free(self, table); + lex_end(&new_lex); + thd->lex = old_lex; + xt_throw_ulxterr(XT_CONTEXT, XT_ERR_LOADING_MYSQL_DIC, (u_long) error); + return NULL; + } + + lex_end(&new_lex); + thd->lex = old_lex; + + /* GOTCHA: I am the plug-in!!! Therefore, I should not hold + * a reference to myself. By holding this reference I prevent + * plugin_shutdown() and reap_plugins() in sql_plugin.cc + * from doing their job on shutdown! + */ + plugin_unlock(NULL, table->s->db_plugin); + table->s->db_plugin = NULL; + return table; +} + +/* +static bool my_match_index(XTDDIndex *ind, KEY *index) +{ + KEY_PART_INFO *key_part; + KEY_PART_INFO *key_part_end; + u_int j; + XTDDColumnRef *cref; + + if (index->key_parts != ind->co_cols.size()) + return false; + + j=0; + key_part_end = index->key_part + index->key_parts; + for (key_part = index->key_part; key_part != key_part_end; key_part++, j++) { + if (!(cref = ind->co_cols.itemAt(j))) + return false; + if (myxt_strcasecmp(cref->cr_col_name, (char *) key_part->field->field_name) != 0) + return false; + } + + if (ind->co_type == XT_DD_KEY_PRIMARY) { + if (!(index->flags & HA_NOSAME)) + return false; + } + else { + if (ind->co_type == XT_DD_INDEX_UNIQUE) { + if (!(index->flags & HA_NOSAME)) + return false; + } + if (ind->co_ind_name) { + if (myxt_strcasecmp(ind->co_ind_name, index->name) != 0) + return false; + } + } + + return true; +} + +static XTDDIndex *my_find_index(XTDDTable *dd_tab, KEY *index) +{ + XTDDIndex *ind; + + for (u_int i=0; i<dd_tab->dt_indexes.size(); i++) + { + ind = dd_tab->dt_indexes.itemAt(i); + if (my_match_index(ind, index)) + return ind; + } + return NULL; +} +*/ + +static void my_deref_index_data(struct XTThread *self, XTIndexPtr mi) +{ + enter_(); + /* The dirty list of cache pages should be empty here! */ + ASSERT(!mi->mi_dirty_list); + ASSERT(!mi->mi_free_list); + + xt_free_mutex(&mi->mi_flush_lock); + xt_spinlock_free(self, &mi->mi_dirty_lock); + XT_INDEX_FREE_LOCK(self, mi); + myxt_bitmap_free(self, &mi->mi_col_map); + if (mi->mi_free_list) + xt_free(self, mi->mi_free_list); + + xt_free(self, mi); + exit_(); +} + +static xtBool my_is_not_null_int4(XTIndexSegPtr seg) +{ + return (seg->type == HA_KEYTYPE_LONG_INT && !(seg->flag & HA_NULL_PART)); +} + +/* Derived from ha_myisam::create and mi_create */ +static XTIndexPtr my_create_index(XTThreadPtr self, TABLE *table_arg, u_int idx, KEY *index) +{ + XTIndexPtr ind; + KEY_PART_INFO *key_part; + KEY_PART_INFO *key_part_end; + XTIndexSegRec *seg; + Field *field; + enum ha_base_keytype type; + uint options = 0; + u_int key_length = 0; + xtBool partial_field; + + enter_(); + + pushsr_(ind, my_deref_index_data, (XTIndexPtr) xt_calloc(self, offsetof(XTIndexRec, mi_seg) + sizeof(XTIndexSegRec) * index->key_parts)); + + XT_INDEX_INIT_LOCK(self, ind); + xt_init_mutex_with_autoname(self, &ind->mi_flush_lock); + xt_spinlock_init_with_autoname(self, &ind->mi_dirty_lock); + ind->mi_index_no = idx; + ind->mi_flags = (index->flags & (HA_NOSAME | HA_NULL_ARE_EQUAL | HA_UNIQUE_CHECK)); + ind->mi_low_byte_first = TS(table_arg)->db_low_byte_first; + ind->mi_fix_key = TRUE; + ind->mi_select_total = 0; + ind->mi_subset_of = 0; + myxt_bitmap_init(self, &ind->mi_col_map, TS(table_arg)->fields); + + ind->mi_seg_count = (uint) index->key_parts; + key_part_end = index->key_part + index->key_parts; + seg = ind->mi_seg; + for (key_part = index->key_part; key_part != key_part_end; key_part++, seg++) { + partial_field = FALSE; + field = key_part->field; + + type = field->key_type(); + seg->flag = key_part->key_part_flag; + + if (options & HA_OPTION_PACK_KEYS || + (index->flags & (HA_PACK_KEY | HA_BINARY_PACK_KEY | HA_SPACE_PACK_USED))) + { + if (key_part->length > 8 && (type == HA_KEYTYPE_TEXT || type == HA_KEYTYPE_NUM || + (type == HA_KEYTYPE_BINARY && !field->zero_pack()))) + { + /* No blobs here */ + if (key_part == index->key_part) + ind->mi_flags |= HA_PACK_KEY; +#ifndef DRIZZLED + if (!(field->flags & ZEROFILL_FLAG) && + (field->type() == MYSQL_TYPE_STRING || + field->type() == MYSQL_TYPE_VAR_STRING || + ((int) (key_part->length - field->decimals())) >= 4)) + seg->flag |= HA_SPACE_PACK; +#endif + } + } + + seg->col_idx = field->field_index; + seg->is_recs_in_range = 1; + seg->is_selectivity = 1; + seg->type = (int) type; + seg->start = key_part->offset; + seg->length = key_part->length; + seg->bit_start = seg->bit_end = 0; + seg->bit_length = seg->bit_pos = 0; + seg->charset = field->charset(); + + if (field->null_ptr) { + key_length++; + seg->flag |= HA_NULL_PART; + seg->null_bit = field->null_bit; + seg->null_pos = (uint) (field->null_ptr - (uchar*) table_arg->record[0]); + } + else { + seg->null_bit = 0; + seg->null_pos = 0; + } + + if (field->real_type() == MYSQL_TYPE_ENUM +#ifndef DRIZZLED + || field->real_type() == MYSQL_TYPE_SET +#endif + ) { + /* This values are not indexed as string!! + * The index will not be built correctly if this value is non-NULL. + */ + seg->charset = NULL; + } + + if (field->type() == MYSQL_TYPE_BLOB +#ifndef DRIZZLED + || field->type() == MYSQL_TYPE_GEOMETRY +#endif + ) { + seg->flag |= HA_BLOB_PART; + /* save number of bytes used to pack length */ + seg->bit_start = (uint) (field->pack_length() - TS(table_arg)->blob_ptr_size); + } +#ifndef DRIZZLED + else if (field->type() == MYSQL_TYPE_BIT) { + seg->bit_length = ((Field_bit *) field)->bit_len; + seg->bit_start = ((Field_bit *) field)->bit_ofs; + seg->bit_pos = (uint) (((Field_bit *) field)->bit_ptr - (uchar*) table_arg->record[0]); + } +#else + /* Drizzle uses HA_KEYTYPE_ULONG_INT keys for enums > 1 byte, which is not consistent with MySQL, so we fix it here */ + else if (field->type() == MYSQL_TYPE_ENUM) { + switch (seg->length) { + case 2: + seg->type = HA_KEYTYPE_USHORT_INT; + break; + case 3: + seg->type = HA_KEYTYPE_UINT24; + break; + } + } +#endif + + switch (seg->type) { + case HA_KEYTYPE_VARTEXT1: + case HA_KEYTYPE_VARTEXT2: + case HA_KEYTYPE_VARBINARY1: + case HA_KEYTYPE_VARBINARY2: + if (!(seg->flag & HA_BLOB_PART)) { + /* Make a flag that this is a VARCHAR */ + seg->flag |= HA_VAR_LENGTH_PART; + /* Store in bit_start number of bytes used to pack the length */ + seg->bit_start = ((seg->type == HA_KEYTYPE_VARTEXT1 || seg->type == HA_KEYTYPE_VARBINARY1) ? 1 : 2); + } + break; + } + + /* All packed fields start with a length (1 or 3 bytes): */ + if (seg->flag & (HA_VAR_LENGTH_PART | HA_BLOB_PART | HA_SPACE_PACK)) { + key_length++; /* At least one length byte */ + if (seg->length >= 255) /* prefix may be 3 bytes */ + key_length +=2; + } + + key_length += seg->length; + if (seg->length > 40) + ind->mi_fix_key = FALSE; + + /* Determine if only part of the field is in the key: + * This is important for index coverage! + * Note, BLOB fields are never retrieved from + * an index! + */ + if (field->type() == MYSQL_TYPE_BLOB) + partial_field = TRUE; + else if (field->real_type() == MYSQL_TYPE_VARCHAR // For varbinary type +#ifndef DRIZZLED + || field->real_type() == MYSQL_TYPE_VAR_STRING // For varbinary type + || field->real_type() == MYSQL_TYPE_STRING // For binary type +#endif + ) + { + Field *tab_field = table_arg->field[key_part->fieldnr-1]; + u_int field_len = tab_field->key_length(); + + if (key_part->length != field_len) + partial_field = TRUE; + } + + /* NOTE: do not set if the field is only partially in the index!!! */ + if (!partial_field) + bitmap_fast_test_and_set(&ind->mi_col_map, field->field_index); + } + + if (key_length > XT_INDEX_MAX_KEY_SIZE) + xt_throw_sulxterr(XT_CONTEXT, XT_ERR_KEY_TOO_LARGE, index->name, (u_long) XT_INDEX_MAX_KEY_SIZE); + + /* This is the maximum size of the index on disk: */ + ind->mi_key_size = key_length; + + if (ind->mi_fix_key) { + /* Special case for not-NULL 4 byte int value: */ + switch (ind->mi_seg_count) { + case 1: + ind->mi_single_type = ind->mi_seg[0].type; + if (ind->mi_seg[0].type == HA_KEYTYPE_LONG_INT || + ind->mi_seg[0].type == HA_KEYTYPE_ULONG_INT) { + if (!(ind->mi_seg[0].flag & HA_NULL_PART)) + ind->mi_scan_branch = xt_scan_branch_single; + } + break; + case 2: + if (my_is_not_null_int4(&ind->mi_seg[0]) && + my_is_not_null_int4(&ind->mi_seg[1])) { + ind->mi_scan_branch = xt_scan_branch_fix_simple; + ind->mi_simple_comp_key = xt_compare_2_int4; + } + break; + case 3: + if (my_is_not_null_int4(&ind->mi_seg[0]) && + my_is_not_null_int4(&ind->mi_seg[1]) && + my_is_not_null_int4(&ind->mi_seg[2])) { + ind->mi_scan_branch = xt_scan_branch_fix_simple; + ind->mi_simple_comp_key = xt_compare_3_int4; + } + break; + } + if (!ind->mi_scan_branch) + ind->mi_scan_branch = xt_scan_branch_fix; + ind->mi_prev_item = xt_prev_branch_item_fix; + ind->mi_last_item = xt_last_branch_item_fix; + } + else { + ind->mi_scan_branch = xt_scan_branch_var; + ind->mi_prev_item = xt_prev_branch_item_var; + ind->mi_last_item = xt_last_branch_item_var; + } + + XT_NODE_ID(ind->mi_root) = 0; + + popr_(); // Discard my_deref_index_data(ind) + + return_(ind); +} + +xtPublic void myxt_setup_dictionary(XTThreadPtr self, XTDictionaryPtr dic) +{ + TABLE *my_tab = dic->dic_my_table; + u_int field_count; + u_int var_field_count = 0; + xtBool blob_field_count = 0; + xtWord8 min_data_size = 0; + xtWord8 max_data_size = 0; + xtWord8 ave_data_size = 0; + xtWord8 min_row_size = 0; + xtWord8 max_row_size = 0; + xtWord8 ave_row_size = 0; + xtWord8 max_ave_row_size = 0; + u_int dic_rec_size; + xtBool dic_rec_fixed; + Field *curr_field; + Field **field; + +#ifdef SLAP_DEBUG + xtBool slap_debug = FALSE; + + /* DBG, checks optimized average row size of mysqlslap: */ + if (strcmp(my_tab->s->db.str, "mysqlslap") == 0 && strcmp(my_tab->s->table_name.str, "t1") == 0) + slap_debug = TRUE; +#endif + + /* How many columns are required for all indexes. */ + KEY *index; + KEY_PART_INFO *key_part; + KEY_PART_INFO *key_part_end; + + dic->dic_ind_cols_req = 0; + for (uint i=0; i<TS(my_tab)->keys; i++) { + index = &my_tab->key_info[i]; + + key_part_end = index->key_part + index->key_parts; + for (key_part = index->key_part; key_part != key_part_end; key_part++) { + curr_field = key_part->field; + + if ((u_int) curr_field->field_index+1 > dic->dic_ind_cols_req) + dic->dic_ind_cols_req = curr_field->field_index+1; + } + } + + /* We will work out how many columns are required for all blobs: */ + dic->dic_blob_cols_req = 0; + field_count = 0; + for (field=my_tab->field; (curr_field = *field); field++) { + field_count++; + min_data_size = curr_field->key_length(); + max_data_size = curr_field->key_length(); + enum_field_types tno = curr_field->type(); + + max_ave_row_size = 128; + if (tno == MYSQL_TYPE_BLOB) { + blob_field_count++; + min_data_size = 0; + max_data_size = ((Field_blob *) curr_field)->max_data_length(); + /* Set the average length higher for BLOBs: */ + if (max_data_size == 0xFFFF) + max_ave_row_size = 192; + else if (max_data_size == 0xFFFFFF) + max_ave_row_size = 256; + else if (max_data_size == 0xFFFFFFFF) { + max_ave_row_size = 384; + if ((u_int) curr_field->field_index+1 > dic->dic_blob_cols_req) + dic->dic_blob_cols_req = curr_field->field_index+1; + dic->dic_blob_count++; + xt_realloc(self, (void **) &dic->dic_blob_cols, sizeof(Field *) * dic->dic_blob_count); + dic->dic_blob_cols[dic->dic_blob_count-1] = curr_field; + } + /*DBG*//* Hack for test! */ + if (strcmp(curr_field->field_name, "c_data") == 0) + max_ave_row_size = 500; + } + else if (tno == MYSQL_TYPE_VARCHAR +#ifndef DRIZZLED + || tno == MYSQL_TYPE_VAR_STRING +#endif + ) { + /* GOTCHA: MYSQL_TYPE_VAR_STRING does not exist as MYSQL_TYPE_VARCHAR define, but + * is used when creating a table with + * VARCHAR() + */ + min_data_size = 0; +#ifdef SLAP_DEBUG + if (slap_debug) + min_data_size = 128; +#endif + } + + if (max_data_size == min_data_size) + ave_data_size = max_data_size; + else { + var_field_count++; + /* Take the average a 25% of the maximum: */ + ave_data_size = max_data_size / 4; + if (ave_data_size < 40) + ave_data_size = 40; + else if (ave_data_size > max_ave_row_size) + ave_data_size = max_ave_row_size; + if (ave_data_size > max_data_size) + ave_data_size = max_data_size; + } + + /* Add space for the length indicators: */ + if (min_data_size <= 240) + min_row_size += 1 + min_data_size; + else if (min_data_size <= 0xFFFF) + min_row_size += 3 + min_data_size; + else if (min_data_size <= 0xFFFFFF) + min_row_size += 4 + min_data_size; + else + min_row_size += 5 + min_data_size; + + if (max_data_size <= 240) + max_row_size += 1 + max_data_size; + else if (max_data_size <= 0xFFFF) + max_row_size += 3 + max_data_size; + else if (max_data_size <= 0xFFFFFF) + max_row_size += 4 + max_data_size; + else + max_row_size += 5 + max_data_size; + + if (ave_data_size <= 240) + ave_row_size += 1 + ave_data_size; + else /* Should not be more than this! */ + ave_row_size += 3 + ave_data_size; + + /* This is the length of the record required for all indexes: */ + if (field_count + 1 == dic->dic_ind_cols_req) + dic->dic_ind_rec_len = max_data_size; + } + + dic->dic_min_row_size = min_row_size; + dic->dic_max_row_size = max_row_size; + dic->dic_ave_row_size = ave_row_size; + dic->dic_no_of_cols = field_count; + + if (dic->dic_def_ave_row_size) { + /* The average row size has been set: */ + dic_rec_size = offsetof(XTTabRecFix, rf_data) + TS(my_tab)->reclength; + + if (dic->dic_def_ave_row_size >= (xtWord8) TS(my_tab)->reclength && + dic_rec_size <= XT_TAB_MAX_FIX_REC_LENGTH && + (ave_row_size + ave_row_size / 10 >= max_row_size || + dic_rec_size < XT_TAB_MIN_VAR_REC_LENGTH) && + !blob_field_count) { + dic_rec_fixed = TRUE; + } + else { + xtWord8 new_rec_size; + + dic_rec_fixed = FALSE; + if (dic->dic_def_ave_row_size > max_row_size) + new_rec_size = offsetof(XTTabRecFix, rf_data) + max_row_size; + else + new_rec_size = offsetof(XTTabRecFix, rf_data) + dic->dic_def_ave_row_size; + + /* The maximum record size 64K for explicit AVG_ROW_LENGTH! */ + if (new_rec_size > XT_TAB_MAX_FIX_REC_LENGTH_SPEC) + new_rec_size = XT_TAB_MAX_FIX_REC_LENGTH_SPEC; + + dic_rec_size = (u_int) new_rec_size; + } + } + else { + /* If the average size is within 10% if of the maximum size, then we + * we handle these rows as fixed size rows. + * Fixed size rows use the internal MySQL format. + */ + dic_rec_size = offsetof(XTTabRecFix, rf_data) + TS(my_tab)->reclength; + /* Fixed length records must be less than 16K in size, + * have an average size which is very close to the maximum size or + * be less than a minimum size, + * and not contain any BLOBs: + */ + if (dic_rec_size <= XT_TAB_MAX_FIX_REC_LENGTH && + (ave_row_size + ave_row_size / 10 >= max_row_size || + dic_rec_size < XT_TAB_MIN_VAR_REC_LENGTH) && + !blob_field_count) { + dic_rec_fixed = TRUE; + } + else { + dic_rec_fixed = FALSE; + /* Note I add offsetof(XTTabRecFix, rf_data) insteard of + * offsetof(XTTabRecExt, re_data) here! + * The reason is that, we want to include the average size + * record in the fixed data part. To do this we only need to + * calculate a fixed header size, because in the cases in which + * it fits, we will only be using a fixed header! + */ + dic_rec_size = (u_int) (offsetof(XTTabRecFix, rf_data) + ave_row_size); + /* The maximum record size (16K for autorow sizing)! */ + if (dic_rec_size > XT_TAB_MAX_FIX_REC_LENGTH) + dic_rec_size = XT_TAB_MAX_FIX_REC_LENGTH; + } + } + + if (!dic->dic_rec_size) { + dic->dic_rec_size = dic_rec_size; + dic->dic_rec_fixed = dic_rec_fixed; + } + else { + /* This just confirms that our original calculation on + * create table agrees with the current calculation. + * (i.e. if non-zero values were loaded from the table). + * + * It may be the criteria for calculating the data record size + * and whether to used a fixed or variable record has changed, + * but we need to stick to the current physical layout of the + * table. + * + * Note that this can occur in rename table when the + * method of calculation has changed. + * + * On rename, the format of the table does not change, so we + * will not take the calculated values. + */ + //ASSERT(dic->dic_rec_size == dic_rec_size); + //ASSERT(dic->dic_rec_fixed == dic_rec_fixed); + } + + if (dic_rec_fixed) { + /* Recalculate the length of the required required to address all + * index columns! + */ + if (field_count == dic->dic_ind_cols_req) + dic->dic_ind_rec_len = TS(my_tab)->reclength; + else { + field=my_tab->field; + + curr_field = field[dic->dic_ind_cols_req]; +#if MYSQL_VERSION_ID < 50114 + dic->dic_ind_rec_len = curr_field->offset(); +#else + dic->dic_ind_rec_len = curr_field->offset(curr_field->table->record[0]); +#endif + } + } + + /* We now calculate how many of the first columns in the row + * will definitely fit into the buffer, when the record is + * of type extended. + * + * In this way we can figure out if we need to load the extended + * record at all. + */ + dic->dic_fix_col_count = 0; + if (!dic_rec_fixed) { + xtWord8 max_rec_size = offsetof(XTTabRecExt, re_data); + + for (Field **f=my_tab->field; (curr_field = *f); f++) { + max_data_size = curr_field->key_length(); + enum_field_types tno = curr_field->type(); + if (tno == MYSQL_TYPE_BLOB) + max_data_size = ((Field_blob *) curr_field)->max_data_length(); + if (max_data_size <= 240) + max_rec_size += 1 + max_data_size; + else if (max_data_size <= 0xFFFF) + max_rec_size += 3 + max_data_size; + else if (max_data_size <= 0xFFFFFF) + max_rec_size += 4 + max_data_size; + else + max_rec_size += 5 + max_data_size; + if (max_rec_size > (xtWord8) dic_rec_size) + break; + dic->dic_fix_col_count++; + } + ASSERT(dic->dic_fix_col_count < dic->dic_no_of_cols); + } + + dic->dic_key_count = TS(my_tab)->keys; + dic->dic_buf_size = TS(my_tab)->rec_buff_length; +} + +static u_int my_get_best_superset(XTThreadPtr self __attribute__((unused)), XTDictionaryPtr dic, XTIndexPtr ind) +{ + XTIndexPtr super_ind; + u_int super = 0; + u_int super_seg_count = ind->mi_seg_count; + + for (u_int i=0; i<dic->dic_key_count; i++) { + super_ind = dic->dic_keys[i]; + if (ind->mi_index_no != super_ind->mi_index_no && + super_seg_count < super_ind->mi_seg_count) { + for (u_int j=0; j<ind->mi_seg_count; j++) { + if (ind->mi_seg[j].col_idx != super_ind->mi_seg[j].col_idx) + goto next; + } + super_seg_count = super_ind->mi_seg_count; + super = i+1; + next:; + } + } + return super; +} + +/* + * Return FAILED if the MySQL dictionary is not available. + */ +xtPublic xtBool myxt_load_dictionary(XTThreadPtr self, XTDictionaryPtr dic, XTDatabaseHPtr db, XTPathStrPtr tab_path) +{ + TABLE *my_tab; + + if (!(my_tab = my_open_table(self, db, tab_path))) + return FAILED; + dic->dic_my_table = my_tab; + dic->dic_def_ave_row_size = (xtWord8) my_tab->s->avg_row_length; + myxt_setup_dictionary(self, dic); + dic->dic_keys = (XTIndexPtr *) xt_calloc(self, sizeof(XTIndexPtr) * TS(my_tab)->keys); + for (uint i=0; i<TS(my_tab)->keys; i++) + dic->dic_keys[i] = my_create_index(self, my_tab, i, &my_tab->key_info[i]); + + /* Check if any key is a subset of another: */ + for (u_int i=0; i<dic->dic_key_count; i++) + dic->dic_keys[i]->mi_subset_of = my_get_best_superset(self, dic, dic->dic_keys[i]); + + return OK; +} + +xtPublic void myxt_free_dictionary(XTThreadPtr self, XTDictionaryPtr dic) +{ + if (dic->dic_table) { + dic->dic_table->release(self); + dic->dic_table = NULL; + } + + if (dic->dic_my_table) { + my_close_table(dic->dic_my_table); + dic->dic_my_table = NULL; + } + + if (dic->dic_blob_cols) { + xt_free(self, dic->dic_blob_cols); + dic->dic_blob_cols = NULL; + } + dic->dic_blob_count = 0; + + /* If we have opened a table, then this data is freed with the dictionary: */ + if (dic->dic_keys) { + for (uint i=0; i<dic->dic_key_count; i++) { + if (dic->dic_keys[i]) + my_deref_index_data(self, (XTIndexPtr) dic->dic_keys[i]); + } + xt_free(self, dic->dic_keys); + dic->dic_key_count = 0; + dic->dic_keys = NULL; + } +} + +xtPublic void myxt_move_dictionary(XTDictionaryPtr dic, XTDictionaryPtr source_dic) +{ + dic->dic_my_table = source_dic->dic_my_table; + source_dic->dic_my_table = NULL; + + if (!dic->dic_rec_size) { + dic->dic_rec_size = source_dic->dic_rec_size; + dic->dic_rec_fixed = source_dic->dic_rec_fixed; + } + else { + /* This just confirms that our original calculation on + * create table agrees with the current calculation. + * (i.e. if non-zero values were loaded from the table). + * + * It may be the criteria for calculating the data record size + * and whether to used a fixed or variable record has changed, + * but we need to stick to the current physical layout of the + * table. + */ + ASSERT_NS(dic->dic_rec_size == source_dic->dic_rec_size); + ASSERT_NS(dic->dic_rec_fixed == source_dic->dic_rec_fixed); + } + + dic->dic_tab_flags = source_dic->dic_tab_flags; + dic->dic_blob_cols_req = source_dic->dic_blob_cols_req; + dic->dic_blob_count = source_dic->dic_blob_count; + dic->dic_blob_cols = source_dic->dic_blob_cols; + source_dic->dic_blob_cols = NULL; + + dic->dic_buf_size = source_dic->dic_buf_size; + dic->dic_key_count = source_dic->dic_key_count; + dic->dic_keys = source_dic->dic_keys; + + /* Set this to zero, bcause later xt_flush_tables() may be called. + * This can occur when using the BLOB streaming engine, + * in command ALTER TABLE x ENGINE = PBXT; + */ + source_dic->dic_key_count = 0; + source_dic->dic_keys = NULL; + + dic->dic_min_row_size = source_dic->dic_min_row_size; + dic->dic_max_row_size = source_dic->dic_max_row_size; + dic->dic_ave_row_size = source_dic->dic_ave_row_size; + dic->dic_def_ave_row_size = source_dic->dic_def_ave_row_size; + + dic->dic_no_of_cols = source_dic->dic_no_of_cols; + dic->dic_fix_col_count = source_dic->dic_fix_col_count; + dic->dic_ind_cols_req = source_dic->dic_ind_cols_req; + dic->dic_ind_rec_len = source_dic->dic_ind_rec_len; +} + +static void my_free_dd_table(XTThreadPtr self, XTDDTable *dd_tab) +{ + if (dd_tab) + dd_tab->release(self); +} + +static void ha_create_dd_index(XTThreadPtr self, XTDDIndex *ind, KEY *key) +{ + KEY_PART_INFO *key_part; + KEY_PART_INFO *key_part_end; + XTDDColumnRef *cref; + + if (strcmp(key->name, "PRIMARY") == 0) + ind->co_type = XT_DD_KEY_PRIMARY; + else if (key->flags & HA_NOSAME) + ind->co_type = XT_DD_INDEX_UNIQUE; + else + ind->co_type = XT_DD_INDEX; + + if (ind->co_type == XT_DD_KEY_PRIMARY) + ind->co_name = xt_dup_string(self, key->name); + else + ind->co_ind_name = xt_dup_string(self, key->name); + + key_part_end = key->key_part + key->key_parts; + for (key_part = key->key_part; key_part != key_part_end; key_part++) { + if (!(cref = new XTDDColumnRef())) + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + ind->co_cols.append(self, cref); + cref->cr_col_name = xt_dup_string(self, (char *) key_part->field->field_name); + } +} + +static char *my_type_to_string(XTThreadPtr self, Field *field, TABLE *my_tab __attribute__((unused))) +{ + char buffer[MAX_FIELD_WIDTH + 400], *ptr; + String type((char *) buffer, sizeof(buffer), system_charset_info); + + /* GOTCHA: + * - Above sets the string length to the same as the buffer, + * so we must set the length to zero. + * - The result is not necessarilly zero terminated. + * - We cannot assume that the input buffer is the one + * we get back (for example text field). + */ + type.length(0); + field->sql_type(type); + ptr = type.c_ptr(); + if (ptr != buffer) + xt_strcpy(sizeof(buffer), buffer, ptr); + + if (field->has_charset()) { + /* Always include the charset so that we can compare types + * for FK/PK releations. + */ + xt_strcat(sizeof(buffer), buffer, " CHARACTER SET "); + xt_strcat(sizeof(buffer), buffer, (char *) field->charset()->csname); + + /* For string types dump collation name only if + * collation is not primary for the given charset + */ + if (!(field->charset()->state & MY_CS_PRIMARY)) { + xt_strcat(sizeof(buffer), buffer, " COLLATE "); + xt_strcat(sizeof(buffer), buffer, (char *) field->charset()->name); + } + } + + return xt_dup_string(self, buffer); // type.length() +} + +xtPublic XTDDTable *myxt_create_table_from_table(XTThreadPtr self, TABLE *my_tab) +{ + XTDDTable *dd_tab; + Field *curr_field; + XTDDColumn *col; + XTDDIndex *ind; + + if (!(dd_tab = new XTDDTable())) + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + dd_tab->init(self); + pushr_(my_free_dd_table, dd_tab); + + for (Field **field=my_tab->field; (curr_field = *field); field++) { + col = XTDDColumnFactory::createFromMySQLField(self, my_tab, curr_field); + dd_tab->dt_cols.append(self, col); + } + + for (uint i=0; i<TS(my_tab)->keys; i++) { + if (!(ind = (XTDDIndex *) new XTDDIndex(XT_DD_UNKNOWN))) + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + dd_tab->dt_indexes.append(self, ind); + ind->co_table = dd_tab; + ind->in_index = i; + ha_create_dd_index(self, ind, &my_tab->key_info[i]); + } + + popr_(); // my_free_dd_table(dd_tab) + return dd_tab; +} + +/* + * ----------------------------------------------------------------------- + * MySQL CHARACTER UTILITIES + */ + +xtPublic void myxt_static_convert_identifier(XTThreadPtr self __attribute__((unused)), MX_CHARSET_INFO *cs, char *from, char *to, size_t to_len) +{ + uint errors; + + /* + * Bug#4417 + * Check that identifiers and strings are not converted + * when the client character set is binary. + */ + if (cs == &my_charset_utf8_general_ci || cs == &my_charset_bin) + xt_strcpy(to_len, to, from); + else + strconvert(cs, from, &my_charset_utf8_general_ci, to, to_len, &errors); +} + +// cs == current_thd->charset() +xtPublic char *myxt_convert_identifier(XTThreadPtr self, MX_CHARSET_INFO *cs, char *from) +{ + uint errors; + u_int len; + char *to; + + if (cs == &my_charset_utf8_general_ci || cs == &my_charset_bin) + to = xt_dup_string(self, from); + else { + len = strlen(from) * 3 + 1; + to = (char *) xt_malloc(self, len); + strconvert(cs, from, &my_charset_utf8_general_ci, to, len, &errors); + } + return to; +} + +xtPublic char *myxt_convert_table_name(XTThreadPtr self, char *from) +{ + u_int len; + char *to; + + len = strlen(from) * 5 + 1; + to = (char *) xt_malloc(self, len); + tablename_to_filename(from, to, len); + return to; +} + +xtPublic void myxt_static_convert_table_name(XTThreadPtr self __attribute__((unused)), char *from, char *to, size_t to_len) +{ + tablename_to_filename(from, to, to_len); +} + +xtPublic int myxt_strcasecmp(char * a, char *b) +{ + return my_strcasecmp(&my_charset_utf8_general_ci, a, b); +} + +xtPublic int myxt_isspace(MX_CHARSET_INFO *cs, char a) +{ + return my_isspace(cs, a); +} + +xtPublic int myxt_ispunct(MX_CHARSET_INFO *cs, char a) +{ + return my_ispunct(cs, a); +} + +xtPublic int myxt_isdigit(MX_CHARSET_INFO *cs, char a) +{ + return my_isdigit(cs, a); +} + +xtPublic MX_CHARSET_INFO *myxt_getcharset(bool convert) +{ + if (convert) { + THD *thd = current_thd; + + if (thd) + return thd_charset(thd); + } + return &my_charset_utf8_general_ci; +} + +#ifdef XT_STREAMING +xtPublic xtBool myxt_use_blobs(XTOpenTablePtr ot, void **ret_pbms_table, xtWord1 *rec_buf) +{ + void *pbms_table; + XTTable *tab = ot->ot_table; + u_int idx = 0; + Field *field; + char *blob_ref; + xtWord4 len; + char in_url[PBMS_BLOB_URL_SIZE]; + char *out_url; + + if (!xt_pbms_open_table(&pbms_table, tab->tab_name->ps_path)) + return FAILED; + + for (idx=0; idx<tab->tab_dic.dic_blob_count; idx++) { + field = tab->tab_dic.dic_blob_cols[idx]; + if ((blob_ref = mx_get_length_and_data(field, (char *) rec_buf, &len)) && len) { + xt_strncpy(PBMS_BLOB_URL_SIZE, in_url, blob_ref, len); + + if (!xt_pbms_use_blob(pbms_table, &out_url, in_url, field->field_index)) { + xt_pbms_close_table(pbms_table); + return FAILED; + } + + if (out_url) { + len = strlen(out_url); + mx_set_length_and_data(field, (char *) rec_buf, len, out_url); + } + } + } + *ret_pbms_table = pbms_table; + return OK; +} + +xtPublic void myxt_unuse_blobs(XTOpenTablePtr ot __attribute__((unused)), void *pbms_table) +{ + xt_pbms_close_table(pbms_table); +} + +xtPublic xtBool myxt_retain_blobs(XTOpenTablePtr ot __attribute__((unused)), void *pbms_table, xtRecordID rec_id) +{ + xtBool ok; + PBMSEngineRefRec eng_ref; + + memset(&eng_ref, 0, sizeof(PBMSEngineRefRec)); + XT_SET_DISK_8(eng_ref.er_data, rec_id); + ok = xt_pbms_retain_blobs(pbms_table, &eng_ref); + xt_pbms_close_table(pbms_table); + return ok; +} + +xtPublic void myxt_release_blobs(XTOpenTablePtr ot, xtWord1 *rec_buf, xtRecordID rec_id) +{ + void *pbms_table; + XTTable *tab = ot->ot_table; + u_int idx = 0; + Field *field; + char *blob_ref; + xtWord4 len; + char in_url[PBMS_BLOB_URL_SIZE]; + PBMSEngineRefRec eng_ref; + + memset(&eng_ref, 0, sizeof(PBMSEngineRefRec)); + XT_SET_DISK_8(eng_ref.er_data, rec_id); + + if (!xt_pbms_open_table(&pbms_table, tab->tab_name->ps_path)) + return; + + for (idx=0; idx<tab->tab_dic.dic_blob_count; idx++) { + field = tab->tab_dic.dic_blob_cols[idx]; + if ((blob_ref = mx_get_length_and_data(field, (char *) rec_buf, &len)) && len) { + xt_strncpy(PBMS_BLOB_URL_SIZE, in_url, blob_ref, len); + + xt_pbms_release_blob(pbms_table, in_url, field->field_index, &eng_ref); + } + } + + xt_pbms_close_table(pbms_table); +} +#endif // XT_STREAMING + +xtPublic void *myxt_create_thread() +{ + THD *new_thd; + + if (my_thread_init()) { + xt_register_error(XT_REG_CONTEXT, XT_ERR_MYSQL_ERROR, 0, "Unable to initialize MySQL threading"); + return NULL; + } + + if (!(new_thd = new THD())) { + my_thread_end(); + xt_register_error(XT_REG_CONTEXT, XT_ERR_MYSQL_ERROR, 0, "Unable to create MySQL thread (THD)"); + return NULL; + } + + new_thd->thread_stack = (char *) &new_thd; + new_thd->store_globals(); + lex_start(new_thd); + + return (void *) new_thd; +} + +xtPublic void myxt_destroy_thread(void *thread, xtBool end_threads) +{ + THD *thd = (THD *) thread; + +#if MYSQL_VERSION_ID > 60005 + /* PMC - This is a HACK! It is required because + * MySQL shuts down MDL before shutting down the + * plug-ins. + */ + if (!pbxt_inited) + mdl_init(); + close_thread_tables(thd); + if (!pbxt_inited) + mdl_destroy(); +#else + close_thread_tables(thd); +#endif + + delete thd; + + /* Remember that we don't have a THD */ + my_pthread_setspecific_ptr(THR_THD, 0); + + if (end_threads) + my_thread_end(); +} + +xtPublic XTThreadPtr myxt_get_self() +{ + THD *thd; + + if ((thd = current_thd)) + return xt_ha_thd_to_self(thd); + return NULL; +} + +/* + * ----------------------------------------------------------------------- + * INFORMATION SCHEMA FUNCTIONS + * + */ + +static int mx_put_record(THD *thd, TABLE *table) +{ + return schema_table_store_record(thd, table); +} + +#ifdef UNUSED_CODE +static void mx_put_int(TABLE *table, int column, int value) +{ + table->field[column]->store(value, false); +} + +static void mx_put_real8(TABLE *table, int column, xtReal8 value) +{ + table->field[column]->store(value); +} + +static void mx_put_string(TABLE *table, int column, const char *string, u_int len, charset_info_st *charset) +{ + table->field[column]->store(string, len, charset); +} +#endif + +static void mx_put_u_llong(TABLE *table, int column, u_llong value) +{ + table->field[column]->store(value, false); +} + +static void mx_put_string(TABLE *table, int column, const char *string, charset_info_st *charset) +{ + table->field[column]->store(string, strlen(string), charset); +} + +xtPublic int myxt_statistics_fill_table(XTThreadPtr self, void *th, void *ta, void *, MX_CONST void *ch) +{ + THD *thd = (THD *) th; + TABLE_LIST *tables = (TABLE_LIST *) ta; + charset_info_st *charset = (charset_info_st *) ch; + TABLE *table = (TABLE *) tables->table; + int err = 0; + int col; + const char *stat_name; + u_llong stat_value; + XTStatisticsRec statistics; + + xt_gather_statistics(&statistics); + for (u_int rec_id=0; !err && rec_id<XT_STAT_CURRENT_MAX; rec_id++) { + stat_name = xt_get_stat_meta_data(rec_id)->sm_name; + stat_value = xt_get_statistic(&statistics, self->st_database, rec_id); + + col=0; + mx_put_u_llong(table, col++, rec_id+1); + mx_put_string(table, col++, stat_name, charset); + mx_put_u_llong(table, col++, stat_value); + err = mx_put_record(thd, table); + } + + return err; +} + +xtPublic void myxt_get_status(XTThreadPtr self, XTStringBufferPtr strbuf) +{ + char string[200]; + + xt_sb_concat(self, strbuf, "\n"); + xt_get_now(string, 200); + xt_sb_concat(self, strbuf, string); + xt_sb_concat(self, strbuf, " PBXT "); + xt_sb_concat(self, strbuf, xt_get_version()); + xt_sb_concat(self, strbuf, " STATUS OUTPUT"); + xt_sb_concat(self, strbuf, "\n"); + + xt_sb_concat(self, strbuf, "Record cache usage: "); + xt_sb_concat_int8(self, strbuf, xt_tc_get_usage()); + xt_sb_concat(self, strbuf, "\n"); + xt_sb_concat(self, strbuf, "Record cache size: "); + xt_sb_concat_int8(self, strbuf, xt_tc_get_size()); + xt_sb_concat(self, strbuf, "\n"); + xt_sb_concat(self, strbuf, "Record cache high: "); + xt_sb_concat_int8(self, strbuf, xt_tc_get_high()); + xt_sb_concat(self, strbuf, "\n"); + xt_sb_concat(self, strbuf, "Index cache usage: "); + xt_sb_concat_int8(self, strbuf, xt_ind_get_usage()); + xt_sb_concat(self, strbuf, "\n"); + xt_sb_concat(self, strbuf, "Index cache size: "); + xt_sb_concat_int8(self, strbuf, xt_ind_get_size()); + xt_sb_concat(self, strbuf, "\n"); + xt_sb_concat(self, strbuf, "Log cache usage: "); + xt_sb_concat_int8(self, strbuf, xt_xlog_get_usage()); + xt_sb_concat(self, strbuf, "\n"); + xt_sb_concat(self, strbuf, "Log cache size: "); + xt_sb_concat_int8(self, strbuf, xt_xlog_get_size()); + xt_sb_concat(self, strbuf, "\n"); + + xt_ht_lock(self, xt_db_open_databases); + pushr_(xt_ht_unlock, xt_db_open_databases); + + XTDatabaseHPtr *dbptr; + size_t len = xt_sl_get_size(xt_db_open_db_by_id); + + if (len > 0) { + xt_sb_concat(self, strbuf, "Data log files:\n"); + for (u_int i=0; i<len; i++) { + dbptr = (XTDatabaseHPtr *) xt_sl_item_at(xt_db_open_db_by_id, i); + +#ifndef XT_USE_GLOBAL_DB + xt_sb_concat(self, strbuf, "Database: "); + xt_sb_concat(self, strbuf, (*dbptr)->db_name); + xt_sb_concat(self, strbuf, "\n"); +#endif + xt_dl_log_status(self, *dbptr, strbuf); + } + } + else + xt_sb_concat(self, strbuf, "No data logs in use\n"); + + freer_(); // xt_ht_unlock(xt_db_open_databases) +} + +/* + * ----------------------------------------------------------------------- + * MySQL Bit Maps + */ + +xtPublic void myxt_bitmap_init(XTThreadPtr self, MY_BITMAP *map, u_int n_bits) +{ + my_bitmap_map *buf; + uint size_in_bytes = (((n_bits) + 31) / 32) * 4; + + buf = (my_bitmap_map *) xt_malloc(self, size_in_bytes); + map->bitmap= buf; + map->n_bits= n_bits; + create_last_word_mask(map); + bitmap_clear_all(map); +} + +xtPublic void myxt_bitmap_free(XTThreadPtr self, MY_BITMAP *map) +{ + if (map->bitmap) { + xt_free(self, map->bitmap); + map->bitmap = NULL; + } +} + +/* + * ----------------------------------------------------------------------- + * XTDDColumnFactory methods + */ + +XTDDColumn *XTDDColumnFactory::createFromMySQLField(XTThread *self, TABLE *my_tab, Field *field) +{ + XTDDEnumerableColumn *en_col; + XTDDColumn *col; + xtBool is_enum = FALSE; + + switch(field->real_type()) { + case MYSQL_TYPE_ENUM: + is_enum = TRUE; + /* fallthrough */ + +#ifndef DRIZZLED + case MYSQL_TYPE_SET: +#endif + col = en_col = new XTDDEnumerableColumn(); + if (!col) + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + col->init(self); + en_col->enum_size = ((Field_enum *)field)->typelib->count; + en_col->is_enum = is_enum; + break; + + default: + col = new XTDDColumn(); + if (!col) + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + col->init(self); + } + + col->dc_name = xt_dup_string(self, (char *) field->field_name); + col->dc_data_type = my_type_to_string(self, field, my_tab); + col->dc_null_ok = field->null_ptr != NULL; + + return col; +} + diff --git a/storage/pbxt/src/myxt_xt.h b/storage/pbxt/src/myxt_xt.h new file mode 100644 index 00000000000..4d33431088e --- /dev/null +++ b/storage/pbxt/src/myxt_xt.h @@ -0,0 +1,104 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2006-05-16 Paul McCullagh + * + * H&G2JCtL + * + * These functions implement the parts of PBXT which must conform to the + * key and row format used by MySQL. + */ + +#ifndef __xt_myxt_h__ +#define __xt_myxt_h__ + +#include "xt_defs.h" +#include "table_xt.h" +#include "datadic_xt.h" + +#ifndef MYSQL_VERSION_ID +#error MYSQL_VERSION_ID must be defined! +#endif + +struct XTDictionary; +struct XTDatabase; +STRUCT_TABLE; +struct charset_info_st; + +u_int myxt_create_key_from_key(XTIndexPtr ind, xtWord1 *key, xtWord1 *old, u_int k_length); +u_int myxt_create_key_from_row(XTIndexPtr ind, xtWord1 *key, xtWord1 *record, xtBool *no_duplicate); +u_int myxt_create_foreign_key_from_row(XTIndexPtr ind, xtWord1 *key, xtWord1 *record, XTIndexPtr fkey_ind, xtBool *no_null); +u_int myxt_get_key_length(XTIndexPtr ind, xtWord1 *b_value); +int myxt_compare_key(XTIndexPtr ind, int search_flags, uint key_length, xtWord1 *key_value, xtWord1 *b_value); +u_int myxt_key_seg_length(XTIndexSegRec *keyseg, u_int key_offset, xtWord1 *key_value); +xtBool myxt_create_row_from_key(XTOpenTablePtr ot, XTIndexPtr ind, xtWord1 *key, u_int key_len, xtWord1 *record); +void myxt_set_null_row_from_key(XTOpenTablePtr ot, XTIndexPtr ind, xtWord1 *record); +void myxt_set_default_row_from_key(XTOpenTablePtr ot, XTIndexPtr ind, xtWord1 *record); +void myxt_print_key(XTIndexPtr ind, xtWord1 *key_value); + +xtWord4 myxt_store_row_length(XTOpenTablePtr ot, char *rec_buff); +xtBool myxt_store_row(XTOpenTablePtr ot, XTTabRecInfoPtr rec_info, char *rec_buff); +size_t myxt_load_row_length(XTOpenTablePtr ot, size_t buffer_size, xtWord1 *source_buf, u_int *ret_col_cnt); +xtBool myxt_load_row(XTOpenTablePtr ot, xtWord1 *source_buf, xtWord1 *dest_buff, u_int col_cnt); +xtBool myxt_find_column(XTOpenTablePtr ot, u_int *col_idx, const char *col_name); +void myxt_get_column_name(XTOpenTablePtr ot, u_int col_idx, u_int len, char *col_name); +void myxt_get_column_as_string(XTOpenTablePtr ot, char *buffer, u_int col_idx, u_int len, char *value); +xtBool myxt_set_column(XTOpenTablePtr ot, char *buffer, u_int col_idx, const char *value, u_int len); +void myxt_get_column_data(XTOpenTablePtr ot, char *buffer, u_int col_idx, char **value, size_t *len); + +void myxt_setup_dictionary(XTThreadPtr self, XTDictionary *dic); +xtBool myxt_load_dictionary(XTThreadPtr self, struct XTDictionary *dic, struct XTDatabase *db, XTPathStrPtr tab_path); +void myxt_free_dictionary(XTThreadPtr self, XTDictionary *dic); +void myxt_move_dictionary(XTDictionaryPtr dic, XTDictionaryPtr source_dic); +XTDDTable *myxt_create_table_from_table(XTThreadPtr self, STRUCT_TABLE *my_tab); + +void myxt_static_convert_identifier(XTThreadPtr self, struct charset_info_st *cs, char *from, char *to, size_t to_len); +char *myxt_convert_identifier(XTThreadPtr self, struct charset_info_st *cs, char *from); +void myxt_static_convert_table_name(XTThreadPtr self, char *from, char *to, size_t to_len); +char *myxt_convert_table_name(XTThreadPtr self, char *from); +int myxt_strcasecmp(char * a, char *b); +int myxt_isspace(struct charset_info_st *cs, char a); +int myxt_ispunct(struct charset_info_st *cs, char a); +int myxt_isdigit(struct charset_info_st *cs, char a); + +struct charset_info_st *myxt_getcharset(bool convert); + +#ifdef XT_STREAMING +xtBool myxt_use_blobs(XTOpenTablePtr ot, void **ret_pbms_table, xtWord1 *rec_buf); +void myxt_unuse_blobs(XTOpenTablePtr ot, void *pbms_table); +xtBool myxt_retain_blobs(XTOpenTablePtr ot, void *pbms_table, xtRecordID record); +void myxt_release_blobs(XTOpenTablePtr ot, xtWord1 *rec_buf, xtRecordID record); +#endif + +void *myxt_create_thread(); +void myxt_destroy_thread(void *thread, xtBool end_threads); +XTThreadPtr myxt_get_self(); + +int myxt_statistics_fill_table(XTThreadPtr self, void *th, void *ta, void *co, MX_CONST void *ch); +void myxt_get_status(XTThreadPtr self, XTStringBufferPtr strbuf); + +void myxt_bitmap_init(XTThreadPtr self, MY_BITMAP *map, u_int n_bits); +void myxt_bitmap_free(XTThreadPtr self, MY_BITMAP *map); + +class XTDDColumnFactory +{ +public: + static XTDDColumn *createFromMySQLField(XTThread *self, STRUCT_TABLE *, Field *); +}; + +#endif diff --git a/storage/pbxt/src/pbms.h b/storage/pbxt/src/pbms.h new file mode 100644 index 00000000000..1c1f4e71c04 --- /dev/null +++ b/storage/pbxt/src/pbms.h @@ -0,0 +1,866 @@ +/* Copyright (c) 2007 PrimeBase Technologies GmbH + * + * PrimeBase Media Stream for MySQL + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Paul McCullagh + * H&G2JCtL + * + * 2007-06-01 + * + * This file contains the BLOB streaming interface engines that + * are streaming enabled. + * + */ +#ifndef __streaming_unx_h__ +#define __streaming_unx_h__ + +#include <stdio.h> +#include <sys/types.h> +#include <unistd.h> +#include <stdlib.h> +#include <fcntl.h> +#include <string.h> +#include <dirent.h> +#include <signal.h> +#include <ctype.h> + +#ifdef USE_PRAGMA_INTERFACE +#pragma interface /* gcc class implementation */ +#endif + +#define MS_SHARED_MEMORY_MAGIC 0x7E9A120C +#define MS_ENGINE_VERSION 1 +#define MS_CALLBACK_VERSION 1 +#define MS_SHARED_MEMORY_VERSION 1 +#define MS_ENGINE_LIST_SIZE 80 +#define MS_TEMP_FILE_PREFIX "pbms_temp_" +#define MS_TEMP_FILE_PREFIX "pbms_temp_" + +#define MS_RESULT_MESSAGE_SIZE 300 +#define MS_RESULT_STACK_SIZE 200 + +#define MS_BLOB_HANDLE_SIZE 300 + +#define SH_MASK ((S_IRUSR | S_IWUSR) | (S_IRGRP | S_IWGRP) | (S_IROTH)) + +#define MS_OK 0 +#define MS_ERR_ENGINE 1 /* Internal engine error. */ +#define MS_ERR_UNKNOWN_TABLE 2 /* Returned if the engine cannot open the given table. */ +#define MS_ERR_NOT_FOUND 3 /* The BLOB cannot be found. */ +#define MS_ERR_TABLE_LOCKED 4 /* Table is currently locked. */ +#define MS_ERR_INCORRECT_URL 5 +#define MS_ERR_AUTH_FAILED 6 +#define MS_ERR_NOT_IMPLEMENTED 7 +#define MS_ERR_UNKNOWN_DB 8 +#define MS_ERR_REMOVING_REPO 9 +#define MS_ERR_DATABASE_DELETED 10 + +#define MS_LOCK_NONE 0 +#define MS_LOCK_READONLY 1 +#define MS_LOCK_READ_WRITE 2 + +#define MS_XACT_NONE 0 +#define MS_XACT_BEGIN 1 +#define MS_XACT_COMMIT 2 +#define MS_XACT_ROLLBACK 3 + +#define PBMS_ENGINE_REF_LEN 8 +#define PBMS_BLOB_URL_SIZE 200 + +#define PBMS_FIELD_COL_SIZE 128 +#define PBMS_FIELD_COND_SIZE 300 + + +typedef struct PBMSBlobID { + u_int64_t bi_blob_size; + u_int64_t bi_blob_id; // or repo file offset if type = REPO + u_int32_t bi_tab_id; // or repo ID if type = REPO + u_int32_t bi_auth_code; + u_int32_t bi_blob_type; +} PBMSBlobIDRec, *PBMSBlobIDPtr; + +typedef struct PBMSResultRec { + int mr_code; /* Engine specific error code. */ + char mr_message[MS_RESULT_MESSAGE_SIZE]; /* Error message, required if non-zero return code. */ + char mr_stack[MS_RESULT_STACK_SIZE]; /* Trace information about where the error occurred. */ +} PBMSResultRec, *PBMSResultPtr; + +typedef struct PBMSEngineRefRec { + unsigned char er_data[PBMS_ENGINE_REF_LEN]; +} PBMSEngineRefRec, *PBMSEngineRefPtr; + +typedef struct PBMSBlobURL { + char bu_data[PBMS_BLOB_URL_SIZE]; +} PBMSBlobURLRec, *PBMSBlobURLPtr; + +typedef struct PBMSFieldRef { + char fr_column[PBMS_FIELD_COL_SIZE]; + char fr_cond[PBMS_FIELD_COND_SIZE]; +} PBMSFieldRefRec, *PBMSFieldRefPtr; +/* + * The engine must free its resources for the given thread. + */ +typedef void (*MSCloseConnFunc)(void *thd); + +/* Before access BLOBs of a table, the streaming engine will open the table. + * Open tables are managed as a pool by the streaming engine. + * When a request is received, the streaming engine will ask all + * registered engine to open the table. The engine must return a NULL + * open_table pointer if it does not handle the table. + * A callback allows an engine to request all open tables to be + * closed by the streaming engine. + */ +typedef int (*MSOpenTableFunc)(void *thd, const char *table_url, void **open_table, PBMSResultPtr result); +typedef void (*MSCloseTableFunc)(void *thd, void *open_table); + +/* + * When the streaming engine wants to use an open table handle from the + * pool, it calls the lock table function. + */ +typedef int (*MSLockTableFunc)(void *thd, int *xact, void *open_table, int lock_type, PBMSResultPtr result); +typedef int (*MSUnlockTableFunc)(void *thd, int xact, void *open_table, PBMSResultPtr result); + +/* This function is used to locate and send a BLOB on the given stream. + */ +typedef int (*MSSendBLOBFunc)(void *thd, void *open_table, const char *blob_column, const char *blob_url, void *stream, PBMSResultPtr result); + +/* + * Lookup and engine reference, and return readable text. + */ +typedef int (*MSLookupRefFunc)(void *thd, void *open_table, unsigned short col_index, PBMSEngineRefPtr eng_ref, PBMSFieldRefPtr feild_ref, PBMSResultPtr result); + +typedef struct PBMSEngineRec { + int ms_version; /* MS_ENGINE_VERSION */ + int ms_index; /* The index into the engine list. */ + int ms_removing; /* TRUE (1) if the engine is being removed. */ + const char *ms_engine_name; + void *ms_engine_info; + MSCloseConnFunc ms_close_conn; + MSOpenTableFunc ms_open_table; + MSCloseTableFunc ms_close_table; + MSLockTableFunc ms_lock_table; + MSUnlockTableFunc ms_unlock_table; + MSSendBLOBFunc ms_send_blob; + MSLookupRefFunc ms_lookup_ref; +} PBMSEngineRec, *PBMSEnginePtr; + +/* + * This function should never be called directly, it is called + * by deregisterEngine() below. + */ +typedef void (*ECDeregisterdFunc)(PBMSEnginePtr engine); + +typedef void (*ECTableCloseAllFunc)(const char *table_url); + +typedef int (*ECSetContentLenFunc)(void *stream, off_t len, PBMSResultPtr result); + +typedef int (*ECWriteHeadFunc)(void *stream, PBMSResultPtr result); + +typedef int (*ECWriteStreamFunc)(void *stream, void *buffer, size_t len, PBMSResultPtr result); + +/* + * The engine should call this function from + * its own close connection function! + */ +typedef int (*ECCloseConnFunc)(void *thd, PBMSResultPtr result); + +/* + * Call this function before retaining or releasing BLOBs in a row. + */ +typedef int (*ECOpenTableFunc)(void **open_table, char *table_path, PBMSResultPtr result); + +/* + * Call this function when the operation is complete. + */ +typedef void (*ECCloseTableFunc)(void *open_table); + +/* + * Call this function for each BLOB to be retained. When a BLOB is used, the + * URL may be changed. The returned URL is valid as long as the the + * table is open. + * + * The returned URL must be inserted into the row in place of the given + * URL. + */ +typedef int (*ECUseBlobFunc)(void *open_table, char **ret_blob_url, char *blob_url, unsigned short col_index, PBMSResultPtr result); + +/* + * Reference Blobs that has been uploaded to the streaming engine. + * + * All BLOBs specified by the use blob function are retained by + * this function. + * + * The engine reference is a (unaligned) 8 byte value which + * identifies the row that the BLOBs are in. + */ +typedef int (*ECRetainBlobsFunc)(void *open_table, PBMSEngineRefPtr eng_ref, PBMSResultPtr result); + +/* + * If a row containing a BLOB is deleted, then the BLOBs in the + * row must be released. + * + * Note: if a table is dropped, all the BLOBs referenced by the + * table are automatically released. + */ +typedef int (*ECReleaseBlobFunc)(void *open_table, char *blob_url, unsigned short col_index, PBMSEngineRefPtr eng_ref, PBMSResultPtr result); + +typedef int (*ECDropTable)(const char *table_path, PBMSResultPtr result); + +typedef int (*ECRenameTable)(const char *from_table, const char *to_table, PBMSResultPtr result); + +typedef struct PBMSCallbacksRec { + int cb_version; /* MS_CALLBACK_VERSION */ + ECDeregisterdFunc cb_deregister; + ECTableCloseAllFunc cb_table_close_all; + ECSetContentLenFunc cb_set_cont_len; + ECWriteHeadFunc cb_write_head; + ECWriteStreamFunc cb_write_stream; + ECCloseConnFunc cb_close_conn; + ECOpenTableFunc cb_open_table; + ECCloseTableFunc cb_close_table; + ECUseBlobFunc cb_use_blob; + ECRetainBlobsFunc cb_retain_blobs; + ECReleaseBlobFunc cb_release_blob; + ECDropTable cb_drop_table; + ECRenameTable cb_rename_table; +} PBMSCallbacksRec, *PBMSCallbacksPtr; + +typedef struct PBMSSharedMemoryRec { + int sm_magic; /* MS_SHARED_MEMORY_MAGIC */ + int sm_version; /* MS_SHARED_MEMORY_VERSION */ + volatile int sm_shutdown_lock; /* "Cheap" lock for shutdown! */ + PBMSCallbacksPtr sm_callbacks; + int sm_reserved1[20]; + void *sm_reserved2[20]; + int sm_list_size; + int sm_list_len; + PBMSEnginePtr sm_engine_list[MS_ENGINE_LIST_SIZE]; +} PBMSSharedMemoryRec, *PBMSSharedMemoryPtr; + +#ifndef PBMS_API +#ifndef PBMS_CLIENT_API +Please define he value of PBMS_API +#endif +#else + +class PBMS_API +{ +private: + const char *temp_prefix[3]; + +public: + PBMS_API(): sharedMemory(NULL) { + int i = 0; + temp_prefix[i++] = MS_TEMP_FILE_PREFIX; +#ifdef MS_TEMP_FILE_PREFIX + temp_prefix[i++] = MS_TEMP_FILE_PREFIX; +#endif + temp_prefix[i++] = NULL; + + } + + ~PBMS_API() { } + + /* + * Register the engine with the Stream Engine. + */ + int registerEngine(PBMSEnginePtr engine, PBMSResultPtr result) { + int err; + + deleteTempFiles(); + + if ((err = getSharedMemory(true, result))) + return err; + + for (int i=0; i<sharedMemory->sm_list_size; i++) { + if (!sharedMemory->sm_engine_list[i]) { + sharedMemory->sm_engine_list[i] = engine; + engine->ms_index = i; + if (i >= sharedMemory->sm_list_len) + sharedMemory->sm_list_len = i+1; + return MS_OK; + } + } + + result->mr_code = 15010; + strcpy(MS_RESULT_MESSAGE_SIZE, result->mr_message, "Too many BLOB streaming engines already registered"); + *result->mr_stack = 0; + return MS_ERR_ENGINE; + } + + void lock() { + while (sharedMemory->sm_shutdown_lock) + usleep(10000); + sharedMemory->sm_shutdown_lock++; + while (sharedMemory->sm_shutdown_lock != 1) { + usleep(random() % 10000); + sharedMemory->sm_shutdown_lock--; + usleep(10000); + sharedMemory->sm_shutdown_lock++; + } + } + + void unlock() { + sharedMemory->sm_shutdown_lock--; + } + + void deregisterEngine(PBMSEnginePtr engine) { + PBMSResultRec result; + int err; + + if ((err = getSharedMemory(true, &result))) + return; + + lock(); + + bool empty = true; + for (int i=0; i<sharedMemory->sm_list_len; i++) { + if (sharedMemory->sm_engine_list[i]) { + if (sharedMemory->sm_engine_list[i] == engine) { + if (sharedMemory->sm_callbacks) + sharedMemory->sm_callbacks->cb_deregister(engine); + sharedMemory->sm_engine_list[i] = NULL; + } + else + empty = false; + } + } + + unlock(); + + if (empty) { + char temp_file[100]; + + sharedMemory->sm_magic = 0; + free(sharedMemory); + sharedMemory = NULL; + const char **prefix = temp_prefix; + while (*prefix) { + getTempFileName(temp_file, *prefix, getpid()); + unlink(temp_file); + prefix++; + } + } + } + + void closeAllTables(const char *table_url) + { + PBMSResultRec result; + int err; + + if ((err = getSharedMemory(true, &result))) + return; + + if (sharedMemory->sm_callbacks) + sharedMemory->sm_callbacks->cb_table_close_all(table_url); + } + + int setContentLength(void *stream, off_t len, PBMSResultPtr result) + { + int err; + + if ((err = getSharedMemory(true, result))) + return err; + + return sharedMemory->sm_callbacks->cb_set_cont_len(stream, len, result); + } + + int writeHead(void *stream, PBMSResultPtr result) + { + int err; + + if ((err = getSharedMemory(true, result))) + return err; + + return sharedMemory->sm_callbacks->cb_write_head(stream, result); + } + + int writeStream(void *stream, void *buffer, size_t len, PBMSResultPtr result) + { + int err; + + if ((err = getSharedMemory(true, result))) + return err; + + return sharedMemory->sm_callbacks->cb_write_stream(stream, buffer, len, result); + } + + int closeConn(void *thd, PBMSResultPtr result) + { + int err; + + if ((err = getSharedMemory(true, result))) + return err; + + if (!sharedMemory->sm_callbacks) + return MS_OK; + + return sharedMemory->sm_callbacks->cb_close_conn(thd, result); + } + + int openTable(void **open_table, char *table_path, PBMSResultPtr result) + { + int err; + + if ((err = getSharedMemory(true, result))) + return err; + + if (!sharedMemory->sm_callbacks) { + *open_table = NULL; + return MS_OK; + } + + return sharedMemory->sm_callbacks->cb_open_table(open_table, table_path, result); + } + + int closeTable(void *open_table, PBMSResultPtr result) + { + int err; + + if ((err = getSharedMemory(true, result))) + return err; + + if (sharedMemory->sm_callbacks && open_table) + sharedMemory->sm_callbacks->cb_close_table(open_table); + return MS_OK; + } + + int couldBeURL(char *blob_url) + /* ~*test/~1-150-2b5e0a7-0[*<blob size>][.ext] */ + /* ~*test/_1-150-2b5e0a7-0[*<blob size>][.ext] */ + { + char *ptr; + size_t len; + bool have_blob_size = false; + + if (blob_url) { + if ((len = strlen(blob_url))) { + /* Too short: */ + if (len <= 10) + return 0; + + /* Required prefix: */ + /* NOTE: ~> is deprecated v0.5.4+, now use ~* */ + if (*blob_url != '~' || (*(blob_url + 1) != '>' && *(blob_url + 1) != '*')) + return 0; + + ptr = blob_url + len - 1; + + /* Allow for an optional extension: */ + if (!isdigit(*ptr)) { + while (ptr > blob_url && *ptr != '/' && *ptr != '.') + ptr--; + if (ptr == blob_url || *ptr != '.') + return 0; + if (ptr == blob_url || !isdigit(*ptr)) + return 0; + } + + // field 1: server id OR blob size + do_again: + while (ptr > blob_url && isdigit(*ptr)) + ptr--; + + if (ptr != blob_url && *ptr == '*' && !have_blob_size) { + ptr--; + have_blob_size = true; + goto do_again; + } + + if (ptr == blob_url || *ptr != '-') + return 0; + + + // field 2: Authoration code + ptr--; + if (!isxdigit(*ptr)) + return 0; + + while (ptr > blob_url && isxdigit(*ptr)) + ptr--; + + if (ptr == blob_url || *ptr != '-') + return 0; + + // field 3:offset + ptr--; + if (!isxdigit(*ptr)) + return 0; + + while (ptr > blob_url && isdigit(*ptr)) + ptr--; + + if (ptr == blob_url || *ptr != '-') + return 0; + + + // field 4:Table id + ptr--; + if (!isdigit(*ptr)) + return 0; + + while (ptr > blob_url && isdigit(*ptr)) + ptr--; + + /* NOTE: ^ and : are deprecated v0.5.4+, now use ! and ~ */ + if (ptr == blob_url || (*ptr != '^' && *ptr != ':' && *ptr != '_' && *ptr != '~')) + return 0; + ptr--; + + if (ptr == blob_url || *ptr != '/') + return 0; + ptr--; + if (ptr == blob_url) + return 0; + return 1; + } + } + return 0; + } + + int useBlob(void *open_table, char **ret_blob_url, char *blob_url, unsigned short col_index, PBMSResultPtr result) + { + int err; + + if ((err = getSharedMemory(true, result))) + return err; + + if (!couldBeURL(blob_url)) { + *ret_blob_url = NULL; + return MS_OK; + } + + if (!sharedMemory->sm_callbacks) { + result->mr_code = MS_ERR_INCORRECT_URL; + strcpy(MS_RESULT_MESSAGE_SIZE, result->mr_message, "BLOB streaming engine (PBMS) not installed"); + *result->mr_stack = 0; + return MS_ERR_INCORRECT_URL; + } + + return sharedMemory->sm_callbacks->cb_use_blob(open_table, ret_blob_url, blob_url, col_index, result); + } + + int retainBlobs(void *open_table, PBMSEngineRefPtr eng_ref, PBMSResultPtr result) + { + int err; + + if ((err = getSharedMemory(true, result))) + return err; + + if (!sharedMemory->sm_callbacks) + return MS_OK; + + return sharedMemory->sm_callbacks->cb_retain_blobs(open_table, eng_ref, result); + } + + int releaseBlob(void *open_table, char *blob_url, unsigned short col_index, PBMSEngineRefPtr eng_ref, PBMSResultPtr result) + { + int err; + + if ((err = getSharedMemory(true, result))) + return err; + + if (!sharedMemory->sm_callbacks) + return MS_OK; + + if (!couldBeURL(blob_url)) + return MS_OK; + + return sharedMemory->sm_callbacks->cb_release_blob(open_table, blob_url, col_index, eng_ref, result); + } + + int dropTable(const char *table_path, PBMSResultPtr result) + { + int err; + + if ((err = getSharedMemory(true, result))) + return err; + + if (!sharedMemory->sm_callbacks) + return MS_OK; + + return sharedMemory->sm_callbacks->cb_drop_table(table_path, result); + } + + int renameTable(const char *from_table, const char *to_table, PBMSResultPtr result) + { + int err; + + if ((err = getSharedMemory(true, result))) + return err; + + if (!sharedMemory->sm_callbacks) + return MS_OK; + + return sharedMemory->sm_callbacks->cb_rename_table(from_table, to_table, result); + } + + volatile PBMSSharedMemoryPtr sharedMemory; + +private: + int getSharedMemory(bool create, PBMSResultPtr result) + { + int tmp_f; + int r; + char temp_file[100]; + const char **prefix = temp_prefix; + void *tmp_p = NULL; + + if (sharedMemory) + return MS_OK; + + while (*prefix) { + getTempFileName(temp_file, *prefix, getpid()); + tmp_f = open(temp_file, O_RDWR | (create ? O_CREAT : 0), SH_MASK); + if (tmp_f == -1) + return setOSResult(errno, "open", temp_file, result); + + r = lseek(tmp_f, 0, SEEK_SET); + if (r == -1) { + close(tmp_f); + return setOSResult(errno, "lseek", temp_file, result); + } + ssize_t tfer; + char buffer[100]; + + tfer = read(tmp_f, buffer, 100); + if (tfer == -1) { + close(tmp_f); + return setOSResult(errno, "read", temp_file, result); + } + + buffer[tfer] = 0; + sscanf(buffer, "%p", &tmp_p); + sharedMemory = (PBMSSharedMemoryPtr) tmp_p; + if (!sharedMemory || sharedMemory->sm_magic != MS_SHARED_MEMORY_MAGIC) { + if (!create) + return MS_OK; + + sharedMemory = (PBMSSharedMemoryPtr) calloc(1, sizeof(PBMSSharedMemoryRec)); + sharedMemory->sm_magic = MS_SHARED_MEMORY_MAGIC; + sharedMemory->sm_version = MS_SHARED_MEMORY_VERSION; + sharedMemory->sm_list_size = MS_ENGINE_LIST_SIZE; + + r = lseek(tmp_f, 0, SEEK_SET); + if (r == -1) { + close(tmp_f); + return setOSResult(errno, "fseek", temp_file, result); + } + + sprintf(buffer, "%p", (void *) sharedMemory); + tfer = write(tmp_f, buffer, strlen(buffer)); + if (tfer != (ssize_t) strlen(buffer)) { + close(tmp_f); + return setOSResult(errno, "write", temp_file, result); + } + r = fsync(tmp_f); + if (r == -1) { + close(tmp_f); + return setOSResult(errno, "fsync", temp_file, result); + } + } + else if (sharedMemory->sm_version != MS_SHARED_MEMORY_VERSION) { + close(tmp_f); + result->mr_code = -1000; + *result->mr_stack = 0; + strcpy(MS_RESULT_MESSAGE_SIZE, result->mr_message, "Shared memory version: "); + strcat(MS_RESULT_MESSAGE_SIZE, result->mr_message, sharedMemory->sm_version); + strcat(MS_RESULT_MESSAGE_SIZE, result->mr_message, ", does not match engine shared memory version: "); + strcat(MS_RESULT_MESSAGE_SIZE, result->mr_message, MS_SHARED_MEMORY_VERSION); + strcat(MS_RESULT_MESSAGE_SIZE, result->mr_message, "."); + return MS_ERR_ENGINE; + } + close(tmp_f); + + // For backward compatability we need to create the old versions but we only need to read the current version. + if (create) + prefix++; + else + break; + } + return MS_OK; + } + + void strcpy(size_t size, char *to, const char *from) + { + if (size > 0) { + size--; + while (*from && size--) + *to++ = *from++; + *to = 0; + } + } + + void strcat(size_t size, char *to, const char *from) + { + while (*to && size--) to++; + strcpy(size, to, from); + } + + void strcat(size_t size, char *to, int val) + { + char buffer[100]; + + sprintf(buffer, "%d", val); + strcat(size, to, buffer); + } + + int setOSResult(int err, const char *func, char *file, PBMSResultPtr result) { + char *msg; + + result->mr_code = err; + *result->mr_stack = 0; + strcpy(MS_RESULT_MESSAGE_SIZE, result->mr_message, "System call "); + strcat(MS_RESULT_MESSAGE_SIZE, result->mr_message, func); + strcat(MS_RESULT_MESSAGE_SIZE, result->mr_message, "() failed on "); + strcat(MS_RESULT_MESSAGE_SIZE, result->mr_message, file); + strcat(MS_RESULT_MESSAGE_SIZE, result->mr_message, ": "); + +#ifdef XT_WIN + if (FormatMessage(FORMAT_MESSAGE_FROM_SYSTEM, NULL, err, 0, iMessage + strlen(iMessage), MS_RESULT_MESSAGE_SIZE - strlen(iMessage), NULL)) { + char *ptr; + + ptr = &iMessage[strlen(iMessage)]; + while (ptr-1 > err_msg) { + if (*(ptr-1) != '\n' && *(ptr-1) != '\r' && *(ptr-1) != '.') + break; + ptr--; + } + *ptr = 0; + + strcat(MS_RESULT_MESSAGE_SIZE, result->mr_message, " ("); + strcat(MS_RESULT_MESSAGE_SIZE, result->mr_message, err); + strcat(MS_RESULT_MESSAGE_SIZE, result->mr_message, ")"); + return MS_ERR_ENGINE; + } +#endif + + msg = strerror(err); + if (msg) { + strcat(MS_RESULT_MESSAGE_SIZE, result->mr_message, msg); + strcat(MS_RESULT_MESSAGE_SIZE, result->mr_message, " ("); + strcat(MS_RESULT_MESSAGE_SIZE, result->mr_message, err); + strcat(MS_RESULT_MESSAGE_SIZE, result->mr_message, ")"); + } + else { + strcat(MS_RESULT_MESSAGE_SIZE, result->mr_message, "Unknown OS error code "); + strcat(MS_RESULT_MESSAGE_SIZE, result->mr_message, err); + } + + return MS_ERR_ENGINE; + } + + void getTempFileName(char *temp_file, const char * prefix, int pid) + { + sprintf(temp_file, "/tmp/%s%d", prefix, pid); + } + + bool startsWith(const char *cstr, const char *w_cstr) + { + while (*cstr && *w_cstr) { + if (*cstr != *w_cstr) + return false; + cstr++; + w_cstr++; + } + return *cstr || !*w_cstr; + } + + void deleteTempFiles() + { + struct dirent entry; + struct dirent *result; + DIR *odir; + int err; + char temp_file[100]; + + if (!(odir = opendir("/tmp/"))) + return; + err = readdir_r(odir, &entry, &result); + while (!err && result) { + const char **prefix = temp_prefix; + + while (*prefix) { + if (startsWith(entry.d_name, *prefix)) { + int pid = atoi(entry.d_name + strlen(*prefix)); + + /* If the process does not exist: */ + if (kill(pid, 0) == -1 && errno == ESRCH) { + getTempFileName(temp_file, *prefix, pid); + unlink(temp_file); + } + } + prefix++; + } + + err = readdir_r(odir, &entry, &result); + } + closedir(odir); + } +}; +#endif // PBMS_API + +/* + * The following is a low level API for accessing blobs directly. + */ + + +/* + * Any threads using the direct blob access API must first register them selves with the + * blob streaming engine before using the blob access functions. This is done by calling + * PBMSInitBlobStreamingThread(). Call PBMSDeinitBlobStreamingThread() after the thread is + * done using the direct blob access API + */ + +/* +* PBMSInitBlobStreamingThread(): Returns a pointer to a blob streaming thread. +*/ +extern void *PBMSInitBlobStreamingThread(char *thread_name, PBMSResultPtr result); +extern void PBMSDeinitBlobStreamingThread(void *v_bs_thread); + +/* +* PBMSGetError():Gets the last error reported by a blob streaming thread. +*/ +extern void PBMSGetError(void *v_bs_thread, PBMSResultPtr result); + +/* +* PBMSCreateBlob():Creates a new blob in the database of the given size. cont_type can be NULL. +*/ +extern bool PBMSCreateBlob(PBMSBlobIDPtr blob_id, char *database_name, char *cont_type, u_int64_t size); + +/* +* PBMSWriteBlob():Write the data to the blob in one or more chunks. The total size of all the chuncks of +* data written to the blob must match the size specified when the blob was created. +*/ +extern bool PBMSWriteBlob(PBMSBlobIDPtr blob_id, char *database_name, char *data, size_t size, size_t offset); + +/* +* PBMSReadBlob():Read the blob data out of the blob in one or more chunks. +*/ +extern bool PBMSReadBlob(PBMSBlobIDPtr blob_id, char *database_name, char *buffer, size_t *size, size_t offset); + +/* +* PBMSIDToURL():Convert a blob id to a blob URL. The 'url' buffer must be atleast PBMS_BLOB_URL_SIZE bytes in size. +*/ +extern bool PBMSIDToURL(PBMSBlobIDPtr blob_id, char *database_name, char *url); + +/* +* PBMSIDToURL():Convert a blob URL to a blob ID. +*/ +extern bool PBMSURLToID(char *url, PBMSBlobIDPtr blob_id); + +#endif diff --git a/storage/pbxt/src/pthread_xt.cc b/storage/pbxt/src/pthread_xt.cc new file mode 100755 index 00000000000..0a9f4da2074 --- /dev/null +++ b/storage/pbxt/src/pthread_xt.cc @@ -0,0 +1,712 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2006-03-22 Paul McCullagh + * + * H&G2JCtL + * + * This file contains windows specific code + */ + +#include "xt_config.h" + +#ifdef XT_WIN +#include <my_pthread.h> +#else +#include <sys/resource.h> +#endif +#include <errno.h> +#include <limits.h> +#include <string.h> + +#include "pthread_xt.h" +#include "thread_xt.h" + +#ifdef XT_WIN + +void xt_p_init_threading(void) +{ +} + +int xt_p_set_normal_priority(pthread_t thr) +{ + if (!SetThreadPriority (thr, THREAD_PRIORITY_NORMAL)) + return GetLastError(); + return 0; +} + +int xt_p_set_low_priority(pthread_t thr) +{ + if (!SetThreadPriority (thr, THREAD_PRIORITY_LOWEST)) + return GetLastError(); + return 0; +} + +int xt_p_set_high_priority(pthread_t thr) +{ + if (!SetThreadPriority (thr, THREAD_PRIORITY_HIGHEST)) + return GetLastError(); + return 0; +} + +#define XT_RWLOCK_MAGIC 0x78AC390E + +#ifdef XT_THREAD_LOCK_INFO +int xt_p_mutex_init(xt_mutex_type *mutex, const pthread_mutexattr_t *attr, const char *n) +#else +int xt_p_mutex_init(xt_mutex_type *mutex, const pthread_mutexattr_t *attr) +#endif +{ + InitializeCriticalSection(&mutex->mt_cs); +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_init(&mutex->mt_lock_info, mutex); + mutex->mt_name = n; +#endif + return 0; +} + +int xt_p_mutex_destroy(xt_mutex_type *mutex) +{ + DeleteCriticalSection(&mutex->mt_cs); +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_free(&mutex->mt_lock_info); +#endif + return 0; +} + +int xt_p_mutex_lock(xt_mutex_type *mx) +{ + EnterCriticalSection(&mx->mt_cs); +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_add_owner(&mx->mt_lock_info); +#endif + return 0; +} + +int xt_p_mutex_unlock(xt_mutex_type *mx) +{ + LeaveCriticalSection(&mx->mt_cs); +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_release_owner(&mx->mt_lock_info); +#endif + return 0; +} + +int xt_p_mutex_trylock(xt_mutex_type *mutex) +{ +#if(_WIN32_WINNT >= 0x0400) + /* NOTE: MySQL bug! was using?! + * pthread_mutex_trylock(A) (WaitForSingleObject((A), 0) == WAIT_TIMEOUT) + */ + if (TryEnterCriticalSection(&mutex->mt_cs)) { +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_add_owner(&mutex->mt_lock_info); +#endif + return 0; + } + return WAIT_TIMEOUT; +#else + EnterCriticalSection(&mutex->mt_cs); +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_add_owner(&mutex->mt_lock_info); +#endif + return 0; +#endif +} + +#ifdef XT_THREAD_LOCK_INFO +int xt_p_rwlock_init(xt_rwlock_type *rwl, const pthread_condattr_t *attr, const char *n) +#else +int xt_p_rwlock_init(xt_rwlock_type *rwl, const pthread_condattr_t *attr) +#endif +{ + int result; + + if (rwl == NULL) + return ERROR_BAD_ARGUMENTS; + + rwl->rw_sh_count = 0; + rwl->rw_ex_count = 0; + rwl->rw_sh_complete_count = 0; + + result = xt_p_mutex_init_with_autoname(&rwl->rw_ex_lock, NULL); + if (result != 0) + goto failed; + + result = xt_p_mutex_init_with_autoname(&rwl->rw_sh_lock, NULL); + if (result != 0) + goto failed_2; + + result = pthread_cond_init(&rwl->rw_sh_cond, NULL); + if (result != 0) + goto failed_3; + + rwl->rw_magic = XT_RWLOCK_MAGIC; +#ifdef XT_THREAD_LOCK_INFO + rwl->rw_name = n; + xt_thread_lock_info_init(&rwl->rw_lock_info, rwl); +#endif + return 0; + + failed_3: + (void) xt_p_mutex_destroy(&rwl->rw_sh_lock); + + failed_2: + (void) xt_p_mutex_destroy(&rwl->rw_ex_lock); + + failed: + return result; +} + +int xt_p_rwlock_destroy(xt_rwlock_type *rwl) +{ + int result = 0, result1 = 0, result2 = 0; + + if (rwl == NULL) + return ERROR_BAD_ARGUMENTS; + + if (rwl->rw_magic != XT_RWLOCK_MAGIC) + return ERROR_BAD_ARGUMENTS; + + if ((result = xt_p_mutex_lock(&rwl->rw_ex_lock)) != 0) + return result; + + if ((result = xt_p_mutex_lock(&rwl->rw_sh_lock)) != 0) { + (void) xt_p_mutex_unlock(&rwl->rw_ex_lock); + return result; + } + + /* + * Check whether any threads own/wait for the lock (wait for ex.access); + * report "BUSY" if so. + */ + if (rwl->rw_ex_count > 0 || rwl->rw_sh_count > rwl->rw_sh_complete_count) { + result = xt_p_mutex_unlock(&rwl->rw_sh_lock); + result1 = xt_p_mutex_unlock(&rwl->rw_ex_lock); + result2 = ERROR_BUSY; + } + else { + rwl->rw_magic = 0; + + if ((result = xt_p_mutex_unlock(&rwl->rw_sh_lock)) != 0) + { + xt_p_mutex_unlock(&rwl->rw_ex_lock); + return result; + } + + if ((result = xt_p_mutex_unlock(&rwl->rw_ex_lock)) != 0) + return result; + + result = pthread_cond_destroy(&rwl->rw_sh_cond); + result1 = xt_p_mutex_destroy(&rwl->rw_sh_lock); + result2 = xt_p_mutex_destroy(&rwl->rw_ex_lock); + } + +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_free(&rwl->rw_lock_info); +#endif + + return (result != 0) ? result : ((result1 != 0) ? result1 : result2); +} + + +int xt_p_rwlock_rdlock(xt_rwlock_type *rwl) +{ + int result; + + if (rwl == NULL) + return ERROR_BAD_ARGUMENTS; + + if (rwl->rw_magic != XT_RWLOCK_MAGIC) + return ERROR_BAD_ARGUMENTS; + + if ((result = xt_p_mutex_lock(&rwl->rw_ex_lock)) != 0) + return result; + + if (++rwl->rw_sh_count == INT_MAX) { + if ((result = xt_p_mutex_lock(&rwl->rw_sh_lock)) != 0) + { + (void) xt_p_mutex_unlock(&rwl->rw_ex_lock); + return result; + } + + rwl->rw_sh_count -= rwl->rw_sh_complete_count; + rwl->rw_sh_complete_count = 0; + + if ((result = xt_p_mutex_unlock(&rwl->rw_sh_lock)) != 0) + { + (void) xt_p_mutex_unlock(&rwl->rw_ex_lock); + return result; + } + } + +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_add_owner(&rwl->rw_lock_info); +#endif + + return (xt_p_mutex_unlock (&(rwl->rw_ex_lock))); +} + +int xt_p_rwlock_wrlock(xt_rwlock_type *rwl) +{ + int result; + + if (rwl == NULL) + return ERROR_BAD_ARGUMENTS; + + if (rwl->rw_magic != XT_RWLOCK_MAGIC) + return ERROR_BAD_ARGUMENTS; + + if ((result = xt_p_mutex_lock (&rwl->rw_ex_lock)) != 0) + return result; + + if ((result = xt_p_mutex_lock (&rwl->rw_sh_lock)) != 0) { + (void) xt_p_mutex_unlock (&rwl->rw_ex_lock); + return result; + } + + if (rwl->rw_ex_count == 0) { + if (rwl->rw_sh_complete_count > 0) { + rwl->rw_sh_count -= rwl->rw_sh_complete_count; + rwl->rw_sh_complete_count = 0; + } + + if (rwl->rw_sh_count > 0) { + rwl->rw_sh_complete_count = -rwl->rw_sh_count; + + do { + result = pthread_cond_wait (&rwl->rw_sh_cond, &rwl->rw_sh_lock.mt_cs); + } + while (result == 0 && rwl->rw_sh_complete_count < 0); + + if (result == 0) + rwl->rw_sh_count = 0; + } + } + + if (result == 0) + rwl->rw_ex_count++; + +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_add_owner(&rwl->rw_lock_info); +#endif + + return result; +} + +int xt_p_rwlock_unlock(xt_rwlock_type *rwl) +{ + int result, result1; + + if (rwl == NULL) + return (ERROR_BAD_ARGUMENTS); + + if (rwl->rw_magic != XT_RWLOCK_MAGIC) + return ERROR_BAD_ARGUMENTS; + + if (rwl->rw_ex_count == 0) { + if ((result = xt_p_mutex_lock(&rwl->rw_sh_lock)) != 0) + return result; + + if (++rwl->rw_sh_complete_count == 0) + result = pthread_cond_signal(&rwl->rw_sh_cond); + + result1 = xt_p_mutex_unlock(&rwl->rw_sh_lock); + } + else { + rwl->rw_ex_count--; + + result = xt_p_mutex_unlock(&rwl->rw_sh_lock); + result1 = xt_p_mutex_unlock(&rwl->rw_ex_lock); + } + +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_release_owner(&rwl->rw_lock_info); +#endif + + return ((result != 0) ? result : result1); +} + +int xt_p_cond_wait(xt_cond_type *cond, xt_mutex_type *mutex) +{ + return xt_p_cond_timedwait(cond, mutex, NULL); +} + +int xt_p_cond_timedwait(xt_cond_type *cond, xt_mutex_type *mt, struct timespec *abstime) +{ + pthread_mutex_t *mutex = &mt->mt_cs; + int result; + long timeout; + union ft64 now; + + if (abstime != NULL) { + GetSystemTimeAsFileTime(&now.ft); + + timeout = (long)((abstime->tv.i64 - now.i64) / 10000); + if (timeout < 0) + timeout = 0L; + if (timeout > abstime->max_timeout_msec) + timeout = abstime->max_timeout_msec; + } + else + timeout= INFINITE; + + WaitForSingleObject(cond->broadcast_block_event, INFINITE); + + EnterCriticalSection(&cond->lock_waiting); + cond->waiting++; + LeaveCriticalSection(&cond->lock_waiting); + + LeaveCriticalSection(mutex); + + result= WaitForMultipleObjects(2, cond->events, FALSE, timeout); + + EnterCriticalSection(&cond->lock_waiting); + cond->waiting--; + + if (cond->waiting == 0) { + /* The last waiter must reset the broadcast + * state (whther there was a broadcast or not)! + */ + ResetEvent(cond->events[xt_cond_type::BROADCAST]); + SetEvent(cond->broadcast_block_event); + } + LeaveCriticalSection(&cond->lock_waiting); + + EnterCriticalSection(mutex); + + return result == WAIT_TIMEOUT ? ETIMEDOUT : 0; +} + +int xt_p_join(pthread_t thread, void **value) +{ + switch (WaitForSingleObject(thread, INFINITE)) { + case WAIT_OBJECT_0: + case WAIT_TIMEOUT: + /* Don't do this! According to the Win docs: + * _endthread automatically closes the thread handle + * (whereas _endthreadex does not). Therefore, when using + * _beginthread and _endthread, do not explicitly close the + * thread handle by calling the Win32 CloseHandle API. + CloseHandle(thread); + */ + break; + case WAIT_FAILED: + return GetLastError(); + } + return 0; +} + +#else // XT_WIN + +#ifdef __darwin__ +#define POLICY SCHED_RR +#else +#define POLICY pth_policy +#endif + +static int pth_policy; +static int pth_normal_priority; +static int pth_min_priority; +static int pth_max_priority; + +/* Return zero if the priority was set OK, + * else errno. + */ +static int pth_set_priority(pthread_t thread, int priority) +{ + struct sched_param sp; + + memset(&sp, 0, sizeof(struct sched_param)); + sp.sched_priority = priority; + return pthread_setschedparam(thread, POLICY, &sp); +} + +static void pth_get_priority_limits(void) +{ + XTThreadPtr self = NULL; + struct sched_param sp; + int err; + int start; + + /* Save original priority: */ + err = pthread_getschedparam(pthread_self(), &pth_policy, &sp); + if (err) { + xt_throw_errno(XT_CONTEXT, err); + return; + } + pth_normal_priority = sp.sched_priority; + + start = sp.sched_priority; + +#ifdef XT_FREEBSD + pth_min_priority = sched_get_priority_min(sched_getscheduler(0)); + pth_max_priority = sched_get_priority_max(sched_getscheduler(0)); +#else + /* Search for the minimum priority: */ + pth_min_priority = start; + for (;;) { + /* 2007-03-01: Corrected, pth_set_priority returns the error code + * (thanks to Hakan for pointing out this bug!) + */ + if (pth_set_priority(pthread_self(), pth_min_priority-1) != 0) + break; + pth_min_priority--; + } + + /* Search for the maximum priority: */ + pth_max_priority = start; + for (;;) { + if (pth_set_priority(pthread_self(), pth_max_priority+1) != 0) + break; + pth_max_priority++; + } + + /* Restore original priority: */ + pthread_setschedparam(pthread_self(), pth_policy, &sp); +#endif +} + +xtPublic void xt_p_init_threading(void) +{ + pth_get_priority_limits(); +} + +xtPublic int xt_p_set_low_priority(pthread_t thr) +{ + if (pth_min_priority == pth_max_priority) { + /* Under Linux the priority of normal (non-runtime) + * threads are set using the standard methods + * for setting process priority. + */ + + /* We could set who == 0 because it should have the same affect + * as using the PID. + */ + + /* -20 = highest, 20 = lowest */ + if (setpriority(PRIO_PROCESS, getpid(), 20) == -1) + return errno; + return 0; + } + return pth_set_priority(thr, pth_min_priority); +} + +xtPublic int xt_p_set_normal_priority(pthread_t thr) +{ + if (pth_min_priority == pth_max_priority) { + if (setpriority(PRIO_PROCESS, getpid(), 0) == -1) + return errno; + return 0; + } + return pth_set_priority(thr, pth_normal_priority); +} + +xtPublic int xt_p_set_high_priority(pthread_t thr) +{ + if (pth_min_priority == pth_max_priority) { + if (setpriority(PRIO_PROCESS, getpid(), -20) == -1) + return errno; + return 0; + } + return pth_set_priority(thr, pth_max_priority); +} + +#ifdef DEBUG_LOCKING + +xtPublic int xt_p_mutex_lock(xt_mutex_type *mutex, u_int line, const char *file) +{ + XTThreadPtr self = xt_get_self(); + int r; + + ASSERT_NS(mutex->mu_init == 12345); + r = pthread_mutex_lock(&mutex->mu_plock); + if (r == 0) { + if (mutex->mu_trace) + printf("==LOCK mutex %d %s:%d\n", (int) mutex->mu_trace, file, (int) line); + ASSERT_NS(!mutex->mu_locker); + mutex->mu_locker = self; + mutex->mu_line = line; + mutex->mu_file = file; + } +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_add_owner(&mutex->mu_lock_info); +#endif + return r; +} + +xtPublic int xt_p_mutex_unlock(xt_mutex_type *mutex) +{ + XTThreadPtr self = xt_get_self(); + + ASSERT_NS(mutex->mu_init == 12345); + ASSERT_NS(mutex->mu_locker == self); + mutex->mu_locker = NULL; + if (mutex->mu_trace) + printf("UNLOCK mutex %d\n", (int) mutex->mu_trace); +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_release_owner(&mutex->mu_lock_info); +#endif + return pthread_mutex_unlock(&mutex->mu_plock); +} + +xtPublic int xt_p_mutex_destroy(xt_mutex_type *mutex) +{ + ASSERT_NS(mutex->mu_init == 12345); + mutex->mu_init = 89898; +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_free(&mutex->mu_lock_info); +#endif + return pthread_mutex_destroy(&mutex->mu_plock); +} + +xtPublic int xt_p_mutex_trylock(xt_mutex_type *mutex) +{ + XTThreadPtr self = xt_get_self(); + int r; + + ASSERT_NS(mutex->mu_init == 12345); + r = pthread_mutex_trylock(&mutex->mu_plock); + if (r == 0) { + ASSERT_NS(!mutex->mu_locker); + mutex->mu_locker = self; +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_add_owner(&mutex->mu_lock_info); +#endif + } + return r; +} + +#ifdef XT_THREAD_LOCK_INFO +xtPublic int xt_p_mutex_init(xt_mutex_type *mutex, const pthread_mutexattr_t *attr, const char *n) +#else +xtPublic int xt_p_mutex_init(xt_mutex_type *mutex, const pthread_mutexattr_t *attr) +#endif +{ + mutex->mu_init = 12345; + mutex->mu_trace = FALSE; + mutex->mu_locker = NULL; +#ifdef XT_THREAD_LOCK_INFO + mutex->mu_name = n; + xt_thread_lock_info_init(&mutex->mu_lock_info, mutex); +#endif + return pthread_mutex_init(&mutex->mu_plock, attr); +} + +xtPublic int xt_p_cond_wait(xt_cond_type *cond, xt_mutex_type *mutex) +{ + XTThreadPtr self = xt_get_self(); + int r; + + ASSERT_NS(mutex->mu_init == 12345); + ASSERT_NS(mutex->mu_locker == self); + mutex->mu_locker = NULL; + r = pthread_cond_wait(cond, &mutex->mu_plock); + ASSERT_NS(!mutex->mu_locker); + mutex->mu_locker = self; + return r; +} + +xtPublic int xt_p_cond_timedwait(xt_cond_type *cond, xt_mutex_type *mutex, const struct timespec *abstime) +{ + XTThreadPtr self = xt_get_self(); + int r; + + ASSERT_NS(mutex->mu_init == 12345); + ASSERT_NS(mutex->mu_locker == self); + mutex->mu_locker = NULL; + r = pthread_cond_timedwait(cond, &mutex->mu_plock, abstime); + ASSERT_NS(!mutex->mu_locker); + mutex->mu_locker = self; + return r; +} + +xtPublic int xt_p_rwlock_rdlock(xt_rwlock_type *rwlock) +{ + int r; + + ASSERT_NS(rwlock->rw_init == 67890); + r = pthread_rwlock_rdlock(&rwlock->rw_plock); +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_add_owner(&rwlock->rw_lock_info); +#endif + return r; +} + +xtPublic int xt_p_rwlock_wrlock(xt_rwlock_type *rwlock) +{ + XTThreadPtr self = xt_get_self(); + int r; + + ASSERT_NS(rwlock->rw_init == 67890); + r = pthread_rwlock_wrlock(&rwlock->rw_plock); + if (r == 0) { + ASSERT_NS(!rwlock->rw_locker); + rwlock->rw_locker = self; + } +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_add_owner(&rwlock->rw_lock_info); +#endif + return r; +} + +xtPublic int xt_p_rwlock_unlock(xt_rwlock_type *rwlock) +{ + XTThreadPtr self = xt_get_self(); + + ASSERT_NS(rwlock->rw_init == 67890); + if (rwlock->rw_locker) { + ASSERT_NS(rwlock->rw_locker == self); + rwlock->rw_locker = NULL; + } +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_release_owner(&rwlock->rw_lock_info); +#endif + return pthread_rwlock_unlock(&rwlock->rw_plock); +} + +xtPublic int xt_p_rwlock_destroy(xt_rwlock_type *rwlock) +{ + ASSERT_NS(rwlock->rw_init == 67890); + rwlock->rw_init = 0; +#ifdef XT_THREAD_LOCK_INFO + xt_thread_lock_info_free(&rwlock->rw_lock_info); +#endif + return pthread_rwlock_destroy(&rwlock->rw_plock); +} + +#ifdef XT_THREAD_LOCK_INFO +xtPublic int xt_p_rwlock_init(xt_rwlock_type *rwlock, const pthread_rwlockattr_t *attr, const char *n) +#else +xtPublic int xt_p_rwlock_init(xt_rwlock_type *rwlock, const pthread_rwlockattr_t *attr) +#endif +{ + rwlock->rw_init = 67890; + rwlock->rw_readers = 0; + rwlock->rw_locker = NULL; +#ifdef XT_THREAD_LOCK_INFO + rwlock->rw_name = n; + xt_thread_lock_info_init(&rwlock->rw_lock_info, rwlock); +#endif + return pthread_rwlock_init(&rwlock->rw_plock, attr); +} + +#endif // DEBUG_LOCKING + +#endif // XT_WIN + diff --git a/storage/pbxt/src/pthread_xt.h b/storage/pbxt/src/pthread_xt.h new file mode 100755 index 00000000000..d8ef1a85d41 --- /dev/null +++ b/storage/pbxt/src/pthread_xt.h @@ -0,0 +1,292 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2006-03-22 Paul McCullagh + * + * H&G2JCtL + * + * This file contains windows specific code. + */ + +#ifndef __win_xt_h__ +#define __win_xt_h__ + +#ifdef XT_WIN +#include <windef.h> +#include <my_pthread.h> +#else +#include <pthread.h> +#endif + +#include "locklist_xt.h" + +#ifdef DEBUG +//#define DEBUG_LOCKING +#endif + +#define xt_cond_struct _opaque_pthread_cond_t +#define xt_cond_type pthread_cond_t + +#define xt_cond_wait pthread_cond_wait +#define xt_cond_wakeall pthread_cond_broadcast + +#ifdef __cplusplus +extern "C" { +#endif +void xt_p_init_threading(void); +int xt_p_set_normal_priority(pthread_t thr); +int xt_p_set_low_priority(pthread_t thr); +int xt_p_set_high_priority(pthread_t thr); +#ifdef __cplusplus +} +#endif + +#ifdef XT_WIN + +#ifdef __cplusplus +extern "C" { +#endif + +typedef LPVOID pthread_key_t; + +typedef struct xt_mutex_struct { + CRITICAL_SECTION mt_cs; +#ifdef XT_THREAD_LOCK_INFO + const char *mt_name; + XTThreadLockInfoRec mt_lock_info; +#endif +} xt_mutex_type; + +typedef struct xt_rwlock_struct { + xt_mutex_type rw_ex_lock; + xt_mutex_type rw_sh_lock; + pthread_cond_t rw_sh_cond; + int rw_sh_count; + int rw_ex_count; + int rw_sh_complete_count; + int rw_magic; +#ifdef XT_THREAD_LOCK_INFO + const char *rw_name; + XTThreadLockInfoRec rw_lock_info; +#endif +} xt_rwlock_type; + +#ifdef XT_THREAD_LOCK_INFO +int xt_p_mutex_init(xt_mutex_type *mutex, const pthread_mutexattr_t *attr, const char *name); +#else +int xt_p_mutex_init(xt_mutex_type *mutex, const pthread_mutexattr_t *attr); +#endif +int xt_p_mutex_destroy(xt_mutex_type *mutex); +int xt_p_mutex_lock(xt_mutex_type *mx); +int xt_p_mutex_unlock(xt_mutex_type *mx); +int xt_p_mutex_trylock(xt_mutex_type *mutex); + +#ifdef XT_THREAD_LOCK_INFO +int xt_p_rwlock_init(xt_rwlock_type *rwlock, const pthread_condattr_t *attr, const char *name); +#else +int xt_p_rwlock_init(xt_rwlock_type *rwlock, const pthread_condattr_t *attr); +#endif +int xt_p_rwlock_destroy(xt_rwlock_type *rwlock); +int xt_p_rwlock_rdlock(xt_rwlock_type *mx); +int xt_p_rwlock_wrlock(xt_rwlock_type *mx); +int xt_p_rwlock_unlock(xt_rwlock_type *mx); + +int xt_p_cond_wait(xt_cond_type *cond, xt_mutex_type *mutex); +int xt_p_cond_timedwait(xt_cond_type *cond, xt_mutex_type *mutex, struct timespec *abstime); + +int xt_p_join(pthread_t thread, void **value); + +#ifdef __cplusplus +} +#endif + +#ifdef XT_THREAD_LOCK_INFO +#define xt_p_rwlock_init_with_name(a,b,c) xt_p_rwlock_init(a,b,c) +#define xt_p_rwlock_init_with_autoname(a,b) xt_p_rwlock_init_with_name(a,b,LOCKLIST_ARG_SUFFIX(a)) +#else +#define xt_p_rwlock_init_with_name(a,b,c) xt_p_rwlock_init(a,b,c) +#define xt_p_rwlock_init_with_autoname(a,b) xt_p_rwlock_init(a,b) +#endif + +#define xt_slock_rwlock_ns xt_p_rwlock_rdlock +#define xt_xlock_rwlock_ns xt_p_rwlock_wrlock +#define xt_unlock_rwlock_ns xt_p_rwlock_unlock + +#ifdef XT_THREAD_LOCK_INFO +#define xt_p_mutex_init_with_name(a,b,c) xt_p_mutex_init(a,b,c) +#define xt_p_mutex_init_with_autoname(a,b) xt_p_mutex_init_with_name(a,b,LOCKLIST_ARG_SUFFIX(a)) +#else +#define xt_p_mutex_init_with_name(a,b,c) xt_p_mutex_init(a,b) +#define xt_p_mutex_init_with_autoname(a,b) xt_p_mutex_init(a,b) +#endif +#define xt_lock_mutex_ns xt_p_mutex_lock +#define xt_unlock_mutex_ns xt_p_mutex_unlock +#define xt_mutex_trylock xt_p_mutex_trylock + +#else // XT_WIN + +/* Finger weg! */ +#ifdef pthread_mutex_t +#undef pthread_mutex_t +#endif +#ifdef pthread_rwlock_t +#undef pthread_rwlock_t +#endif +#ifdef pthread_mutex_init +#undef pthread_mutex_init +#endif +#ifdef pthread_mutex_destroy +#undef pthread_mutex_destroy +#endif +#ifdef pthread_mutex_lock +#undef pthread_mutex_lock +#endif +#ifdef pthread_mutex_unlock +#undef pthread_mutex_unlock +#endif +#ifdef pthread_cond_wait +#undef pthread_cond_wait +#endif +#ifdef pthread_cond_broadcast +#undef pthread_cond_broadcast +#endif +#ifdef pthread_mutex_trylock +#undef pthread_mutex_trylock +#endif + +/* + * ----------------------------------------------------------------------- + * Reedefinition of pthread locking, for debugging + */ + +struct XTThread; + + +#ifdef XT_THREAD_LOCK_INFO + +#define xt_p_mutex_init_with_name(a,b,c) xt_p_mutex_init(a,b,c) +#define xt_p_mutex_init_with_autoname(a,b) xt_p_mutex_init_with_name(a,b,LOCKLIST_ARG_SUFFIX(a)) + +#define xt_p_rwlock_init_with_name(a,b,c) xt_p_rwlock_init(a,b,c) +#define xt_p_rwlock_init_with_autoname(a,b) xt_p_rwlock_init_with_name(a,b,LOCKLIST_ARG_SUFFIX(a)) + +#else + +#define xt_p_mutex_init_with_name(a,b,c) xt_p_mutex_init(a,b) +#define xt_p_mutex_init_with_autoname(a,b) xt_p_mutex_init(a,b) + +#define xt_p_rwlock_init_with_name(a,b,c) xt_p_rwlock_init(a,b) +#define xt_p_rwlock_init_with_autoname(a,b) xt_p_rwlock_init_with_name(a,b) + +#endif + +#ifdef DEBUG_LOCKING + +#ifdef __cplusplus +extern "C" { +#endif + +typedef struct xt_mutex_struct { + unsigned short mu_init; + unsigned short mu_trace; + unsigned int mu_line; + const char *mu_file; + struct XTThread *mu_locker; + pthread_mutex_t mu_plock; +#ifdef XT_THREAD_LOCK_INFO + const char *mu_name; + XTThreadLockInfoRec mu_lock_info; +#endif +} xt_mutex_type; + +typedef struct xt_rwlock_struct { + u_int rw_init; + volatile u_int rw_readers; + struct XTThread *rw_locker; + pthread_rwlock_t rw_plock; +#ifdef XT_THREAD_LOCK_INFO + const char *rw_name; + XTThreadLockInfoRec rw_lock_info; +#endif +} xt_rwlock_type; + +int xt_p_rwlock_rdlock(xt_rwlock_type *mx); +int xt_p_rwlock_wrlock(xt_rwlock_type *mx); +int xt_p_rwlock_unlock(xt_rwlock_type *mx); + +int xt_p_mutex_lock(xt_mutex_type *mx, u_int line, const char *file); +int xt_p_mutex_unlock(xt_mutex_type *mx); +int xt_p_mutex_trylock(xt_mutex_type *mutex); +int xt_p_mutex_destroy(xt_mutex_type *mutex); +#ifdef XT_THREAD_LOCK_INFO +int xt_p_mutex_init(xt_mutex_type *mutex, const pthread_mutexattr_t *attr, const char *name); +#else +int xt_p_mutex_init(xt_mutex_type *mutex, const pthread_mutexattr_t *attr); +#endif +int xt_p_rwlock_destroy(xt_rwlock_type * rwlock); +#ifdef XT_THREAD_LOCK_INFO +int xt_p_rwlock_init(xt_rwlock_type *rwlock, const pthread_rwlockattr_t *attr, const char *name); +#else +int xt_p_rwlock_init(xt_rwlock_type *rwlock, const pthread_rwlockattr_t *attr); +#endif +int xt_p_cond_wait(xt_cond_type *cond, xt_mutex_type *mutex); +int xt_p_cond_timedwait(xt_cond_type *cond, xt_mutex_type *mutex, const struct timespec *abstime); + +#ifdef __cplusplus +} +#endif + +#define xt_slock_rwlock_ns xt_p_rwlock_rdlock +#define xt_xlock_rwlock_ns xt_p_rwlock_wrlock +#define xt_unlock_rwlock_ns xt_p_rwlock_unlock + +#define xt_lock_mutex_ns(x) xt_p_mutex_lock(x, __LINE__, __FILE__) +#define xt_unlock_mutex_ns xt_p_mutex_unlock +#define xt_mutex_trylock xt_p_mutex_trylock + +#else // DEBUG_LOCKING + +#define xt_rwlock_struct _opaque_pthread_rwlock_t +#define xt_mutex_struct _opaque_pthread_mutex_t + +#define xt_rwlock_type pthread_rwlock_t +#define xt_mutex_type pthread_mutex_t + +#define xt_slock_rwlock_ns pthread_rwlock_rdlock +#define xt_xlock_rwlock_ns pthread_rwlock_wrlock +#define xt_unlock_rwlock_ns pthread_rwlock_unlock + +#define xt_lock_mutex_ns pthread_mutex_lock +#define xt_unlock_mutex_ns pthread_mutex_unlock +#define xt_mutex_trylock pthread_mutex_trylock + +#define xt_p_mutex_trylock pthread_mutex_trylock +#define xt_p_mutex_destroy pthread_mutex_destroy +#define xt_p_mutex_init pthread_mutex_init +#define xt_p_rwlock_destroy pthread_rwlock_destroy +#define xt_p_rwlock_init pthread_rwlock_init +#define xt_p_cond_wait pthread_cond_wait +#define xt_p_cond_timedwait pthread_cond_timedwait + +#endif // DEBUG_LOCKING + +#define xt_p_join pthread_join + +#endif // XT_WIN + +#endif diff --git a/storage/pbxt/src/restart_xt.cc b/storage/pbxt/src/restart_xt.cc new file mode 100644 index 00000000000..3bf03e5fb8c --- /dev/null +++ b/storage/pbxt/src/restart_xt.cc @@ -0,0 +1,3203 @@ +/* Copyright (c) 2007 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2007-11-12 Paul McCullagh + * + * H&G2JCtL + * + * Restart and write data to the database. + */ + +#include "xt_config.h" + +#include <signal.h> +#include <time.h> + +#ifndef DRIZZLED +#include "mysql_priv.h" +#endif + +#include "ha_pbxt.h" + +#include "xactlog_xt.h" +#include "database_xt.h" +#include "util_xt.h" +#include "strutil_xt.h" +#include "filesys_xt.h" +#include "restart_xt.h" +#include "myxt_xt.h" +#include "trace_xt.h" + +#ifdef DEBUG +//#define DEBUG_PRINT +//#define DEBUG_KEEP_LOGS +//#define PRINT_LOG_ON_RECOVERY +//#define TRACE_RECORD_DATA +//#define SKIP_STARTUP_CHECKPOINT +//#define NEVER_CHECKPOINT +//#define TRACE_CHECKPOINT +#endif + +#define PRINTF printf +//#define PRINTF xt_ftracef +//#define PRINTF xt_trace + +void xt_print_bytes(xtWord1 *buf, u_int len) +{ + for (u_int i=0; i<len; i++) { + PRINTF("%02x ", (u_int) *buf); + buf++; + } +} + +void xt_print_log_record(xtLogID log, xtLogOffset offset, XTXactLogBufferDPtr record) +{ + const char *type = NULL; + const char *rec_type = NULL; + xtOpSeqNo op_no = 0; + xtTableID tab_id = 0; + xtRowID row_id = 0; + xtRecordID rec_id = 0; + xtBool xn_set = FALSE; + xtXactID xn_id = 0; + char buffer[200]; + XTTabRecExtDPtr rec_buf; + XTTabRecExtDPtr ext_rec; + XTTabRecFixDPtr fix_rec; + u_int rec_len; + xtLogID log_id = 0; + xtLogOffset log_offset = 0; + + rec_buf = NULL; + ext_rec = NULL; + fix_rec = NULL; + rec_len = 0; + switch (record->xl.xl_status_1) { + case XT_LOG_ENT_REC_MODIFIED: + case XT_LOG_ENT_UPDATE: + case XT_LOG_ENT_INSERT: + case XT_LOG_ENT_DELETE: + case XT_LOG_ENT_UPDATE_BG: + case XT_LOG_ENT_INSERT_BG: + case XT_LOG_ENT_DELETE_BG: + op_no = XT_GET_DISK_4(record->xu.xu_op_seq_4); + tab_id = XT_GET_DISK_4(record->xu.xu_tab_id_4); + rec_id = XT_GET_DISK_4(record->xu.xu_rec_id_4); + xn_id = XT_GET_DISK_4(record->xu.xu_xact_id_4); + row_id = XT_GET_DISK_4(record->xu.xu_row_id_4); + rec_len = XT_GET_DISK_2(record->xu.xu_size_2); + xn_set = TRUE; + type="rec"; + rec_buf = (XTTabRecExtDPtr) &record->xu.xu_rec_type_1; + ext_rec = (XTTabRecExtDPtr) &record->xu.xu_rec_type_1; + if (XT_REC_IS_EXT_DLOG(ext_rec->tr_rec_type_1)) { + log_id = XT_GET_DISK_2(ext_rec->re_log_id_2); + log_offset = XT_GET_DISK_6(ext_rec->re_log_offs_6); + } + else { + ext_rec = NULL; + fix_rec = (XTTabRecFixDPtr) &record->xu.xu_rec_type_1; + } + break; + case XT_LOG_ENT_UPDATE_FL: + case XT_LOG_ENT_INSERT_FL: + case XT_LOG_ENT_DELETE_FL: + case XT_LOG_ENT_UPDATE_FL_BG: + case XT_LOG_ENT_INSERT_FL_BG: + case XT_LOG_ENT_DELETE_FL_BG: + op_no = XT_GET_DISK_4(record->xf.xf_op_seq_4); + tab_id = XT_GET_DISK_4(record->xf.xf_tab_id_4); + rec_id = XT_GET_DISK_4(record->xf.xf_rec_id_4); + xn_id = XT_GET_DISK_4(record->xf.xf_xact_id_4); + row_id = XT_GET_DISK_4(record->xf.xf_row_id_4); + rec_len = XT_GET_DISK_2(record->xf.xf_size_2); + xn_set = TRUE; + type="rec"; + rec_buf = (XTTabRecExtDPtr) &record->xf.xf_rec_type_1; + ext_rec = (XTTabRecExtDPtr) &record->xf.xf_rec_type_1; + if (XT_REC_IS_EXT_DLOG(ext_rec->tr_rec_type_1)) { + log_id = XT_GET_DISK_2(ext_rec->re_log_id_2); + log_offset = XT_GET_DISK_6(ext_rec->re_log_offs_6); + } + else { + ext_rec = NULL; + fix_rec = (XTTabRecFixDPtr) &record->xf.xf_rec_type_1; + } + break; + case XT_LOG_ENT_REC_FREED: + case XT_LOG_ENT_REC_REMOVED: + case XT_LOG_ENT_REC_REMOVED_EXT: + op_no = XT_GET_DISK_4(record->fr.fr_op_seq_4); + tab_id = XT_GET_DISK_4(record->fr.fr_tab_id_4); + rec_id = XT_GET_DISK_4(record->fr.fr_rec_id_4); + xn_id = XT_GET_DISK_4(record->fr.fr_xact_id_4); + xn_set = TRUE; + type="rec"; + break; + case XT_LOG_ENT_REC_REMOVED_BI: + op_no = XT_GET_DISK_4(record->rb.rb_op_seq_4); + tab_id = XT_GET_DISK_4(record->rb.rb_tab_id_4); + rec_id = XT_GET_DISK_4(record->rb.rb_rec_id_4); + xn_id = XT_GET_DISK_4(record->rb.rb_xact_id_4); + row_id = XT_GET_DISK_4(record->rb.rb_row_id_4); + rec_len = XT_GET_DISK_2(record->rb.rb_size_2); + xn_set = TRUE; + type="rec"; + rec_buf = (XTTabRecExtDPtr) &record->rb.rb_rec_type_1; + ext_rec = (XTTabRecExtDPtr) &record->rb.rb_rec_type_1; + if (XT_REC_IS_EXT_DLOG(record->rb.rb_rec_type_1)) { + log_id = XT_GET_DISK_2(ext_rec->re_log_id_2); + log_offset = XT_GET_DISK_6(ext_rec->re_log_offs_6); + } + else { + ext_rec = NULL; + fix_rec = (XTTabRecFixDPtr) &record->rb.rb_rec_type_1; + } + break; + case XT_LOG_ENT_REC_MOVED: + op_no = XT_GET_DISK_4(record->xw.xw_op_seq_4); + tab_id = XT_GET_DISK_4(record->xw.xw_tab_id_4); + rec_id = XT_GET_DISK_4(record->xw.xw_rec_id_4); + log_id = XT_GET_DISK_2(&record->xw.xw_rec_type_1); // This is actually correct + log_offset = XT_GET_DISK_6(record->xw.xw_next_rec_id_4); // This is actually correct! + type="rec"; + break; + case XT_LOG_ENT_REC_CLEANED: + case XT_LOG_ENT_REC_CLEANED_1: + case XT_LOG_ENT_REC_UNLINKED: + op_no = XT_GET_DISK_4(record->xw.xw_op_seq_4); + tab_id = XT_GET_DISK_4(record->xw.xw_tab_id_4); + rec_id = XT_GET_DISK_4(record->xw.xw_rec_id_4); + type="rec"; + break; + case XT_LOG_ENT_ROW_NEW: + case XT_LOG_ENT_ROW_NEW_FL: + case XT_LOG_ENT_ROW_ADD_REC: + case XT_LOG_ENT_ROW_SET: + case XT_LOG_ENT_ROW_FREED: + op_no = XT_GET_DISK_4(record->xa.xa_op_seq_4); + tab_id = XT_GET_DISK_4(record->xa.xa_tab_id_4); + rec_id = XT_GET_DISK_4(record->xa.xa_row_id_4); + type="row"; + break; + case XT_LOG_ENT_NO_OP: + op_no = XT_GET_DISK_4(record->no.no_op_seq_4); + tab_id = XT_GET_DISK_4(record->no.no_tab_id_4); + type="-"; + break; + case XT_LOG_ENT_END_OF_LOG: + break; + } + + switch (record->xl.xl_status_1) { + case XT_LOG_ENT_HEADER: + rec_type = "HEADER"; + break; + case XT_LOG_ENT_NEW_LOG: + rec_type = "NEW LOG"; + break; + case XT_LOG_ENT_DEL_LOG: + sprintf(buffer, "DEL LOG log=%d ", (int) XT_GET_DISK_4(record->xl.xl_log_id_4)); + rec_type = buffer; + break; + case XT_LOG_ENT_NEW_TAB: + rec_type = "NEW TABLE"; + break; + case XT_LOG_ENT_COMMIT: + rec_type = "COMMIT"; + xn_id = XT_GET_DISK_4(record->xe.xe_xact_id_4); + xn_set = TRUE; + break; + case XT_LOG_ENT_ABORT: + rec_type = "ABORT"; + xn_id = XT_GET_DISK_4(record->xe.xe_xact_id_4); + xn_set = TRUE; + break; + case XT_LOG_ENT_CLEANUP: + rec_type = "CLEANUP"; + xn_id = XT_GET_DISK_4(record->xc.xc_xact_id_4); + xn_set = TRUE; + break; + case XT_LOG_ENT_REC_MODIFIED: + rec_type = "MODIFIED"; + break; + case XT_LOG_ENT_UPDATE: + rec_type = "UPDATE"; + break; + case XT_LOG_ENT_UPDATE_FL: + rec_type = "UPDATE-FL"; + break; + case XT_LOG_ENT_INSERT: + rec_type = "INSERT"; + break; + case XT_LOG_ENT_INSERT_FL: + rec_type = "INSERT-FL"; + break; + case XT_LOG_ENT_DELETE: + rec_type = "DELETE"; + break; + case XT_LOG_ENT_DELETE_FL: + rec_type = "DELETE-FL-BG"; + break; + case XT_LOG_ENT_UPDATE_BG: + rec_type = "UPDATE-BG"; + break; + case XT_LOG_ENT_UPDATE_FL_BG: + rec_type = "UPDATE-FL-BG"; + break; + case XT_LOG_ENT_INSERT_BG: + rec_type = "INSERT-BG"; + break; + case XT_LOG_ENT_INSERT_FL_BG: + rec_type = "INSERT-FL-BG"; + break; + case XT_LOG_ENT_DELETE_BG: + rec_type = "DELETE-BG"; + break; + case XT_LOG_ENT_DELETE_FL_BG: + rec_type = "DELETE-FL-BG"; + break; + case XT_LOG_ENT_REC_FREED: + rec_type = "FREE REC"; + break; + case XT_LOG_ENT_REC_REMOVED: + rec_type = "REMOVED REC"; + break; + case XT_LOG_ENT_REC_REMOVED_EXT: + rec_type = "REMOVED-X REC"; + break; + case XT_LOG_ENT_REC_REMOVED_BI: + rec_type = "REMOVED-BI REC"; + break; + case XT_LOG_ENT_REC_MOVED: + rec_type = "MOVED REC"; + break; + case XT_LOG_ENT_REC_CLEANED: + rec_type = "CLEAN REC"; + break; + case XT_LOG_ENT_REC_CLEANED_1: + rec_type = "CLEAN REC-1"; + break; + case XT_LOG_ENT_REC_UNLINKED: + rec_type = "UNLINK REC"; + break; + case XT_LOG_ENT_ROW_NEW: + rec_type = "NEW ROW"; + break; + case XT_LOG_ENT_ROW_NEW_FL: + rec_type = "NEW ROW-FL"; + break; + case XT_LOG_ENT_ROW_ADD_REC: + rec_type = "REC ADD ROW"; + break; + case XT_LOG_ENT_ROW_SET: + rec_type = "SET ROW"; + break; + case XT_LOG_ENT_ROW_FREED: + rec_type = "FREE ROW"; + break; + case XT_LOG_ENT_OP_SYNC: + rec_type = "OP SYNC"; + break; + case XT_LOG_ENT_NO_OP: + rec_type = "NO OP"; + break; + case XT_LOG_ENT_END_OF_LOG: + rec_type = "END OF LOG"; + break; + } + + if (log) + PRINTF("log=%d offset=%d ", (int) log, (int) offset); + PRINTF("%s ", rec_type); + if (type) + PRINTF("op=%lu tab=%lu %s=%lu ", (u_long) op_no, (u_long) tab_id, type, (u_long) rec_id); + if (row_id) + PRINTF("row=%lu ", (u_long) row_id); + if (log_id) + PRINTF("log=%lu offset=%lu ", (u_long) log_id, (u_long) log_offset); + if (xn_set) + PRINTF("xact=%lu ", (u_long) xn_id); + +#ifdef TRACE_RECORD_DATA + if (rec_buf) { + switch (rec_buf->tr_rec_type_1 & XT_TAB_STATUS_MASK) { + case XT_TAB_STATUS_FREED: + PRINTF("FREE"); + break; + case XT_TAB_STATUS_DELETE: + PRINTF("DELE"); + break; + case XT_TAB_STATUS_FIXED: + PRINTF("FIX-"); + break; + case XT_TAB_STATUS_VARIABLE: + PRINTF("VAR-"); + break; + case XT_TAB_STATUS_EXT_DLOG: + PRINTF("EXT-"); + break; + } + if (rec_buf->tr_rec_type_1 & XT_TAB_STATUS_CLEANED_BIT) + PRINTF("C"); + else + PRINTF(" "); + } + if (ext_rec) { + rec_len -= offsetof(XTTabRecExtDRec, re_data); + xt_print_bytes((xtWord1 *) ext_rec, offsetof(XTTabRecExtDRec, re_data)); + PRINTF("| "); + if (rec_len > 20) + rec_len = 20; + xt_print_bytes(ext_rec->re_data, rec_len); + } + if (fix_rec) { + rec_len -= offsetof(XTTabRecFixDRec, rf_data); + xt_print_bytes((xtWord1 *) fix_rec, offsetof(XTTabRecFixDRec, rf_data)); + PRINTF("| "); + if (rec_len > 20) + rec_len = 20; + xt_print_bytes(fix_rec->rf_data, rec_len); + } +#endif + + PRINTF("\n"); +} + +#ifdef DEBUG_PRINT +void check_rows(void) +{ + static XTOpenFilePtr of = NULL; + + if (!of) + of = xt_open_file_ns("./test/test_tab-1.xtr", XT_FS_DEFAULT); + if (of) { + size_t size = (size_t) xt_seek_eof_file(NULL, of); + xtWord8 *buffer = (xtWord8 *) xt_malloc_ns(size); + xt_pread_file(of, 0, size, size, buffer, NULL); + for (size_t i=0; i<size/8; i++) { + if (!buffer[i]) + printf("%d is NULL\n", (int) i); + } + } +} + +#endif + +/* ---------------------------------------------------------------------- + * APPLYING CHANGES IN SEQUENCE + */ + +typedef struct XTOperation { + xtOpSeqNo or_op_seq; + xtWord4 or_op_len; + xtLogID or_log_id; + xtLogOffset or_log_offset; +} XTOperationRec, *XTOperationPtr; + +static int xres_cmp_op_seq(struct XTThread *self __attribute__((unused)), register const void *thunk __attribute__((unused)), register const void *a, register const void *b) +{ + xtOpSeqNo lf_op_seq = *((xtOpSeqNo *) a); + XTOperationPtr lf_ptr = (XTOperationPtr) b; + + if (lf_op_seq == lf_ptr->or_op_seq) + return 0; + if (XTTableSeq::xt_op_is_before(lf_op_seq, lf_ptr->or_op_seq)) + return -1; + return 1; +} + +xtPublic void xt_xres_init_tab(XTThreadPtr self, XTTableHPtr tab) +{ + tab->tab_op_list = xt_new_sortedlist(self, sizeof(XTOperationRec), 20, 1000, xres_cmp_op_seq, NULL, NULL, TRUE, FALSE); +} + +xtPublic void xt_xres_exit_tab(XTThreadPtr self, XTTableHPtr tab) +{ + if (tab->tab_op_list) { + xt_free_sortedlist(self, tab->tab_op_list); + tab->tab_op_list = NULL; + } +} + +static xtBool xres_open_table(XTThreadPtr self, XTWriterStatePtr ws, xtTableID tab_id) +{ + XTOpenTablePtr ot; + + if ((ot = ws->ws_ot)) { + if (ot->ot_table->tab_id == tab_id) + return OK; + xt_db_return_table_to_pool(self, ot); + ws->ws_ot = NULL; + } + + if (ws->ws_tab_gone == tab_id) + return FAILED; + if ((ws->ws_ot = xt_db_open_pool_table(self, ws->ws_db, tab_id, NULL, TRUE))) { + XTTableHPtr tab; + + tab = ws->ws_ot->ot_table; + if (!tab->tab_ind_rec_log_id) { + /* Should not happen... */ + tab->tab_ind_rec_log_id = ws->ws_ind_rec_log_id; + tab->tab_ind_rec_log_offset = ws->ws_ind_rec_log_offset; + } + return OK; + } + ws->ws_tab_gone = tab_id; + return FAILED; +} + +/* {INDEX-RECOV_ROWID} + * Add missing index entries during recovery. + * Set the row ID even if the index entry + * is not committed. It will be removed later by + * the sweeper. + */ +static xtBool xres_add_index_entries(XTOpenTablePtr ot, xtRowID row_id, xtRecordID rec_id, xtWord1 *rec_data) +{ + XTTableHPtr tab = ot->ot_table; + u_int idx_cnt; + XTIndexPtr *ind; + //XTIdxSearchKeyRec key; + + if (tab->tab_dic.dic_disable_index) + return OK; + + for (idx_cnt=0, ind=tab->tab_dic.dic_keys; idx_cnt<tab->tab_dic.dic_key_count; idx_cnt++, ind++) { + /* + key.sk_on_key = FALSE; + key.sk_key_value.sv_flags = XT_SEARCH_WHOLE_KEY; + key.sk_key_value.sv_rec_id = rec_offset; + key.sk_key_value.sv_key = key.sk_key_buf; + key.sk_key_value.sv_length = myxt_create_key_from_row(*ind, key.sk_key_buf, rec_data, NULL); + if (!xt_idx_search(ot, *ind, &key)) { + ot->ot_err_index_no = (*ind)->mi_index_no; + return FAILED; + } + if (!key.sk_on_key) { + } + */ + if (!xt_idx_insert(ot, *ind, row_id, rec_id, rec_data, NULL, TRUE)) { + /* Check the error, certain errors are recoverable! */ + XTThreadPtr self = xt_get_self(); + + if (self->t_exception.e_xt_err == XT_SYSTEM_ERROR && + (XT_FILE_IN_USE(self->t_exception.e_sys_err) || + XT_FILE_ACCESS_DENIED(self->t_exception.e_sys_err) || + XT_FILE_TOO_MANY_OPEN(self->t_exception.e_sys_err) || + self->t_exception.e_sys_err == XT_ENOMEM)) { + ot->ot_err_index_no = (*ind)->mi_index_no; + return FAILED; + } + + /* TODO: Write something to the index header to indicate that + * it is corrupted. + */ + tab->tab_dic.dic_disable_index = XT_INDEX_CORRUPTED; + xt_log_and_clear_exception_ns(); + return OK; + } + } + return OK; +} + +static void xres_remove_index_entries(XTOpenTablePtr ot, xtRecordID rec_id, xtWord1 *rec_data) +{ + XTTableHPtr tab = ot->ot_table; + u_int idx_cnt; + XTIndexPtr *ind; + + if (tab->tab_dic.dic_disable_index) + return; + + for (idx_cnt=0, ind=tab->tab_dic.dic_keys; idx_cnt<tab->tab_dic.dic_key_count; idx_cnt++, ind++) { + if (!xt_idx_delete(ot, *ind, rec_id, rec_data)) + xt_log_and_clear_exception_ns(); + } +} + +static xtWord1 *xres_load_record(XTThreadPtr self, XTOpenTablePtr ot, xtRecordID rec_id, xtWord1 *data, size_t red_size, XTInfoBufferPtr rec_buf, u_int cols_req) +{ + XTTableHPtr tab = ot->ot_table; + xtWord1 *rec_data; + + rec_data = ot->ot_row_rbuffer; + + ASSERT(red_size <= ot->ot_row_rbuf_size); + ASSERT(tab->tab_dic.dic_rec_size <= ot->ot_row_rbuf_size); + if (data) { + if (rec_data != data) + memcpy(rec_data, data, red_size); + } + else { + /* It can be that less than 'dic_rec_size' was written for + * variable length type records. + * If this is the last record in the file, then we will read + * less than actual record size. + */ + if (!XT_PREAD_RR_FILE(ot->ot_rec_file, xt_rec_id_to_rec_offset(tab, rec_id), tab->tab_dic.dic_rec_size, 0, rec_data, &red_size, &self->st_statistics.st_rec, self)) + goto failed; + + if (red_size < sizeof(XTTabRecHeadDRec)) + return NULL; + } + + if (XT_REC_IS_FIXED(rec_data[0])) + rec_data = ot->ot_row_rbuffer + XT_REC_FIX_HEADER_SIZE; + else { + if (!xt_ib_alloc(NULL, rec_buf, tab->tab_dic.dic_buf_size)) + goto failed; + if (XT_REC_IS_VARIABLE(rec_data[0])) { + if (!myxt_load_row(ot, rec_data + XT_REC_FIX_HEADER_SIZE, rec_buf->ib_db.db_data, cols_req)) + goto failed; + } + else if (XT_REC_IS_EXT_DLOG(rec_data[0])) { + if (red_size < XT_REC_EXT_HEADER_SIZE) + return NULL; + + ASSERT(cols_req); + if (cols_req && cols_req <= tab->tab_dic.dic_fix_col_count) { + if (!myxt_load_row(ot, rec_data + XT_REC_EXT_HEADER_SIZE, rec_buf->ib_db.db_data, cols_req)) + goto failed; + } + else { + if (!xt_tab_load_ext_data(ot, rec_id, rec_buf->ib_db.db_data, cols_req)) + goto failed; + } + } + else + /* This is possible, the record has already been cleaned up. */ + return NULL; + rec_data = rec_buf->ib_db.db_data; + } + + return rec_data; + + failed: + /* Running out of memory should not be ignored. */ + if (self->t_exception.e_xt_err == XT_SYSTEM_ERROR && + self->t_exception.e_sys_err == XT_ENOMEM) + xt_throw(self); + xt_log_and_clear_exception_ns(); + return NULL; +} + +/* + * Apply a change from the log. + * + * This function is basically very straight forward, were it not + * for the option to apply operations out of sequence. + * (i.e. in_sequence == FALSE) + * + * If operations are applied in sequence, then they can be + * applied blindly. The update operation is just executed as + * it was logged. + * + * If the changes are not in sequence, then some operation are missing, + * however, the operations that are present are in the correct order. + * + * This can only happen at the end of recovery!!! + * After we have applied all operations in the log we may be + * left with some operations that have not been applied + * because operations were logged out of sequence. + * + * The application of these operations there has to take into + * account the current state of the database. + * They are then applied in a manner that maintains the + * database consistency. + * + * For example, a record that is freed, is free by placing it + * on the current free list. Part of the data logged for the + * operation is ignored. Namely: the "next block" pointer + * that was originally written into the freed record. + */ +static void xres_apply_change(XTThreadPtr self, XTOpenTablePtr ot, XTXactLogBufferDPtr record, xtBool in_sequence, xtBool check_index, XTInfoBufferPtr rec_buf) +{ + XTTableHPtr tab = ot->ot_table; + size_t len; + xtRecordID rec_id; + xtRefID free_ref_id; + XTTabRecFreeDRec free_rec; + xtRowID row_id; + XTTabRowRefDRec row_buf; + XTTabRecHeadDRec rec_head; + size_t tfer; + xtRecordID link_rec_id, prev_link_rec_id; + xtWord1 *rec_data = NULL; + XTTabRecFreeDPtr free_data; + + switch (record->xl.xl_status_1) { + case XT_LOG_ENT_REC_MODIFIED: + case XT_LOG_ENT_UPDATE: + case XT_LOG_ENT_INSERT: + case XT_LOG_ENT_DELETE: + case XT_LOG_ENT_UPDATE_BG: + case XT_LOG_ENT_INSERT_BG: + case XT_LOG_ENT_DELETE_BG: + rec_id = XT_GET_DISK_4(record->xu.xu_rec_id_4); + len = (size_t) XT_GET_DISK_2(record->xu.xu_size_2); + if (!XT_PWRITE_RR_FILE(ot->ot_rec_file, xt_rec_id_to_rec_offset(tab, rec_id), len, (xtWord1 *) &record->xu.xu_rec_type_1, &ot->ot_thread->st_statistics.st_rec, ot->ot_thread)) + xt_throw(self); + tab->tab_bytes_to_flush += len; + + if (check_index && ot->ot_table->tab_dic.dic_key_count) { + switch (record->xl.xl_status_1) { + case XT_LOG_ENT_DELETE: + case XT_LOG_ENT_DELETE_BG: + break; + case XT_LOG_ENT_REC_MODIFIED: + if ((rec_data = xres_load_record(self, ot, rec_id, NULL, 0, rec_buf, tab->tab_dic.dic_ind_cols_req))) + xres_remove_index_entries(ot, rec_id, rec_data); + /* No break required: */ + default: + if ((rec_data = xres_load_record(self, ot, rec_id, &record->xu.xu_rec_type_1, len, rec_buf, tab->tab_dic.dic_ind_cols_req))) { + row_id = XT_GET_DISK_4(record->xu.xu_row_id_4); + if (!xres_add_index_entries(ot, row_id, rec_id, rec_data)) + xt_throw(self); + } + break; + } + } + + if (!in_sequence) { + /* A record has been allocated from the EOF, but out of sequence. + * This could leave a gap where other records were allocated + * from the EOF, but those operations have been lost! + * We compensate for this by adding all blocks between + * to the free list. + */ + free_rec.rf_rec_type_1 = XT_TAB_STATUS_FREED; + free_rec.rf_not_used_1 = 0; + while (tab->tab_head_rec_eof_id < rec_id) { + XT_SET_DISK_4(free_rec.rf_next_rec_id_4, tab->tab_head_rec_free_id); + if (!XT_PWRITE_RR_FILE(ot->ot_rec_file, tab->tab_head_rec_eof_id, sizeof(XTTabRecFreeDRec), (xtWord1 *) &free_rec, &ot->ot_thread->st_statistics.st_rec, ot->ot_thread)) + xt_throw(self); + tab->tab_bytes_to_flush += sizeof(XTTabRecFreeDRec); + tab->tab_head_rec_free_id = tab->tab_head_rec_eof_id; + tab->tab_head_rec_eof_id++; + } + } + if (tab->tab_head_rec_eof_id < rec_id + 1) + tab->tab_head_rec_eof_id = rec_id + 1; + tab->tab_flush_pending = TRUE; + break; + case XT_LOG_ENT_UPDATE_FL: + case XT_LOG_ENT_INSERT_FL: + case XT_LOG_ENT_DELETE_FL: + case XT_LOG_ENT_UPDATE_FL_BG: + case XT_LOG_ENT_INSERT_FL_BG: + case XT_LOG_ENT_DELETE_FL_BG: + rec_id = XT_GET_DISK_4(record->xf.xf_rec_id_4); + len = (size_t) XT_GET_DISK_2(record->xf.xf_size_2); + free_ref_id = XT_GET_DISK_4(record->xf.xf_free_rec_id_4); + + if (check_index && + record->xf.xf_status_1 != XT_LOG_ENT_DELETE_FL && + record->xf.xf_status_1 != XT_LOG_ENT_DELETE_FL_BG) { + if ((rec_data = xres_load_record(self, ot, rec_id, &record->xf.xf_rec_type_1, len, rec_buf, tab->tab_dic.dic_ind_cols_req))) { + row_id = XT_GET_DISK_4(record->xf.xf_row_id_4); + if (!xres_add_index_entries(ot, row_id, rec_id, rec_data)) + xt_throw(self); + } + } + + if (!in_sequence) { + /* This record was allocated from the free list. + * Because this operation is out of sequence, there + * could have been other allocations from the + * free list before this, that have gone missing. + * For this reason we have to search the current + * free list and remove the record. + */ + link_rec_id = tab->tab_head_rec_free_id; + prev_link_rec_id = 0; + while (link_rec_id) { + if (!XT_PREAD_RR_FILE(ot->ot_rec_file, xt_rec_id_to_rec_offset(tab, link_rec_id), sizeof(XTTabRecFreeDRec), sizeof(XTTabRecFreeDRec), (xtWord1 *) &free_rec, NULL, &self->st_statistics.st_rec, self)) + xt_throw(self); + if (link_rec_id == rec_id) + break; + prev_link_rec_id = link_rec_id; + link_rec_id = XT_GET_DISK_4(free_rec.rf_next_rec_id_4); + } + if (link_rec_id == rec_id) { + /* The block was found on the free list. + * remove it: */ + if (prev_link_rec_id) { + /* We write the record from position 'link_rec_id' into + * position 'prev_link_rec_id'. This unlinks 'link_rec_id'! + */ + if (!XT_PWRITE_RR_FILE(ot->ot_rec_file, xt_rec_id_to_rec_offset(tab, prev_link_rec_id), sizeof(XTTabRecFreeDRec), (xtWord1 *) &free_rec, &ot->ot_thread->st_statistics.st_rec, ot->ot_thread)) + xt_throw(self); + tab->tab_bytes_to_flush += sizeof(XTTabRecFreeDRec); + free_ref_id = tab->tab_head_rec_free_id; + } + else + /* The block is at the front of the list: */ + free_ref_id = XT_GET_DISK_4(free_rec.rf_next_rec_id_4); + } + else { + /* Not found on the free list? */ + if (tab->tab_head_rec_eof_id < rec_id + 1) + tab->tab_head_rec_eof_id = rec_id + 1; + goto write_mod_data; + } + } + if (tab->tab_head_rec_eof_id < rec_id + 1) + tab->tab_head_rec_eof_id = rec_id + 1; + tab->tab_head_rec_free_id = free_ref_id; + tab->tab_head_rec_fnum--; + write_mod_data: + if (!XT_PWRITE_RR_FILE(ot->ot_rec_file, xt_rec_id_to_rec_offset(tab, rec_id), len, (xtWord1 *) &record->xf.xf_rec_type_1, &ot->ot_thread->st_statistics.st_rec, ot->ot_thread)) + xt_throw(self); + tab->tab_bytes_to_flush += len; + tab->tab_flush_pending = TRUE; + break; + case XT_LOG_ENT_REC_REMOVED: + case XT_LOG_ENT_REC_REMOVED_EXT: { + xtBool record_loaded; + XTTabRecExtDPtr ext_rec; + size_t red_size; + xtWord4 log_over_size = 0; + xtLogID data_log_id = 0; + xtLogOffset data_log_offset = 0; + u_int cols_required = 0; + + rec_id = XT_GET_DISK_4(record->fr.fr_rec_id_4); + free_data = (XTTabRecFreeDPtr) &record->fr.fr_rec_type_1; + + /* This is a short-cut, it does not require loading the record: */ + if (!check_index && !tab->tab_dic.dic_blob_count && record->fr.fr_status_1 != XT_LOG_ENT_REC_REMOVED_EXT) + goto do_rec_freed; + + ext_rec = (XTTabRecExtDPtr) ot->ot_row_rbuffer; + + if (!XT_PREAD_RR_FILE(ot->ot_rec_file, xt_rec_id_to_rec_offset(tab, rec_id), tab->tab_dic.dic_rec_size, 0, ext_rec, &red_size, &self->st_statistics.st_rec, self)) { + xt_log_and_clear_exception_ns(); + goto do_rec_freed; + } + + if (red_size < sizeof(XTTabRecHeadDRec)) + goto do_rec_freed; + + /* Check that the record is the same as the one originally removed. + * This can be different if recovery is repeated. + * For example: + * + * log=21 offset=6304472 REMOVED-X REC op=360616 tab=7 rec=25874 + * log=21 offset=6309230 UPDATE-FL op=360618 tab=7 rec=25874 row=26667 log=1 offset=26503077 xact=209 + * log=21 offset=6317500 CLEAN REC op=360631 tab=7 rec=25874 + * + * If this recovery sequence is repeated, then the REMOVED-X will free the + * extended record belonging to the update that came afterwards! + * + * Additional situation to consider: + * + * - A record "x" is created, and index entries created. + * - A checkpoint is made done. + * - Record "x" is deleted due to UPDATE. + * - The index entries are removed, but the index is not + * flushed. + * - This deletion is written to disk by the writer. + * So we have the situation that the remove is on disk, + * but the index changes have not been made. + * + * In this case, skipping to "do_rec_freed" is incorrect. + */ + if (record->fr.fr_stat_id_1 != ext_rec->tr_stat_id_1 || + XT_GET_DISK_4(record->fr.fr_xact_id_4) != XT_GET_DISK_4(ext_rec->tr_xact_id_4)) + goto dont_remove_x_record; + + if (record->xl.xl_status_1 == XT_LOG_ENT_REC_REMOVED_EXT) { + if (!XT_REC_IS_EXT_DLOG(ext_rec->tr_rec_type_1)) + goto dont_remove_x_record; + if (red_size < offsetof(XTTabRecExtDRec, re_data)) + goto dont_remove_x_record; + + /* Save this for later (can be overwritten by xres_load_record(): */ + data_log_id = XT_GET_DISK_2(ext_rec->re_log_id_2); + data_log_offset = XT_GET_DISK_6(ext_rec->re_log_offs_6); + log_over_size = XT_GET_DISK_4(ext_rec->re_log_dat_siz_4); + } + dont_remove_x_record: + + record_loaded = FALSE; + + if (check_index) { + cols_required = tab->tab_dic.dic_ind_cols_req; + if (tab->tab_dic.dic_blob_cols_req > cols_required) + cols_required = tab->tab_dic.dic_blob_cols_req; + if (!(rec_data = xres_load_record(self, ot, rec_id, ot->ot_row_rbuffer, red_size, rec_buf, cols_required))) + goto do_rec_freed; + record_loaded = TRUE; + xres_remove_index_entries(ot, rec_id, rec_data); + } + + if (tab->tab_dic.dic_blob_count) { + if (!record_loaded) { + if (tab->tab_dic.dic_blob_cols_req > cols_required) + cols_required = tab->tab_dic.dic_blob_cols_req; + if (!(rec_data = xres_load_record(self, ot, rec_id, ot->ot_row_rbuffer, red_size, rec_buf, cols_required))) + /* [(7)] REMOVE is followed by FREE: + goto get_rec_offset; + */ + goto do_rec_freed; + record_loaded = TRUE; + } +#ifdef XT_STREAMING + myxt_release_blobs(ot, rec_data, rec_id); +#endif + } + + if (record->xl.xl_status_1 == XT_LOG_ENT_REC_REMOVED_EXT) { + /* Note: dlb_delete_log() may be repeated, but should handle this: + * + * Example: + * log=5 offset=213334 CLEAN REC op=28175 tab=1 rec=317428 + * ... + * log=6 offset=321063 REMOVED-X REC op=33878 tab=1 rec=317428 + * + * When this sequence is repeated during recovery, then CLEAN REC + * will reset the status byte of the record so that it + * comes back to here! + * + * The check for zero is probably not required here. + */ + if (data_log_id && data_log_offset && log_over_size) { + if (!ot->ot_thread->st_dlog_buf.dlb_delete_log(data_log_id, data_log_offset, log_over_size, tab->tab_id, rec_id, self)) { + if (ot->ot_thread->t_exception.e_xt_err != XT_ERR_BAD_EXT_RECORD && + ot->ot_thread->t_exception.e_xt_err != XT_ERR_DATA_LOG_NOT_FOUND) + xt_log_and_clear_exception_ns(); + } + } + } + + goto do_rec_freed; + } + case XT_LOG_ENT_REC_REMOVED_BI: { + /* + * For deletion we need the complete before image because of the following problem. + * + * DROP TABLE IF EXISTS t1; + * CREATE TABLE t1 (ID int primary key auto_increment, value int, index (value)) engine=pbxt; + * + * insert t1(value) values(50); + * + * -- CHECKPOINT -- + * + * update t1 set value = 60; + * + * -- PAUSE -- + * + * update t1 set value = 70; + * + * -- CRASH -- + * + * select value from t1; + * select * from t1; + * + * 081203 12:11:46 [Note] PBXT: Recovering from 1-148, bytes to read: 33554284 + * log=1 offset=148 UPDATE-BG op=5 tab=1 rec=2 row=1 xact=3 + * log=1 offset=188 REC ADD ROW op=6 tab=1 row=1 + * log=1 offset=206 COMMIT xact=3 + * log=1 offset=216 REMOVED REC op=7 tab=1 rec=1 xact=2 + * log=1 offset=241 CLEAN REC op=8 tab=1 rec=2 + * log=1 offset=261 CLEANUP xact=3 + * log=1 offset=267 UPDATE-FL-BG op=9 tab=1 rec=1 row=1 xact=4 + * log=1 offset=311 REC ADD ROW op=10 tab=1 row=1 + * log=1 offset=329 COMMIT xact=4 + * log=1 offset=339 REMOVED REC op=11 tab=1 rec=2 xact=3 + * log=1 offset=364 CLEAN REC op=12 tab=1 rec=1 + * log=1 offset=384 CLEANUP xact=4 + * 081203 12:12:15 [Note] PBXT: Recovering complete at 1-390, bytes read: 33554284 + * + * mysql> select value from t1; + * +-------+ + * | value | + * +-------+ + * | 50 | + * | 70 | + * +-------+ + * 2 rows in set (55.99 sec) + * + * mysql> select * from t1; + * +----+-------+ + * | ID | value | + * +----+-------+ + * | 1 | 70 | + * +----+-------+ + * 1 row in set (0.00 sec) + */ + XTTabRecExtDPtr ext_rec; + xtWord4 log_over_size = 0; + xtLogID data_log_id = 0; + xtLogOffset data_log_offset = 0; + u_int cols_required = 0; + xtBool record_loaded; + size_t rec_size; + + rec_id = XT_GET_DISK_4(record->rb.rb_rec_id_4); + rec_size = XT_GET_DISK_2(record->rb.rb_size_2); + + ext_rec = (XTTabRecExtDPtr) &record->rb.rb_rec_type_1; + + if (XT_REC_IS_EXT_DLOG(record->rb.rb_rec_type_1)) { + /* Save this for later (can be overwritten by xres_load_record(): */ + data_log_id = XT_GET_DISK_2(ext_rec->re_log_id_2); + data_log_offset = XT_GET_DISK_6(ext_rec->re_log_offs_6); + log_over_size = XT_GET_DISK_4(ext_rec->re_log_dat_siz_4); + } + + record_loaded = FALSE; + + if (check_index) { + cols_required = tab->tab_dic.dic_ind_cols_req; +#ifdef XT_STREAMING + if (tab->tab_dic.dic_blob_cols_req > cols_required) + cols_required = tab->tab_dic.dic_blob_cols_req; +#endif + if (!(rec_data = xres_load_record(self, ot, rec_id, &record->rb.rb_rec_type_1, rec_size, rec_buf, cols_required))) + goto go_on_to_free; + record_loaded = TRUE; + xres_remove_index_entries(ot, rec_id, rec_data); + } + +#ifdef XT_STREAMING + if (tab->tab_dic.dic_blob_count) { + if (!record_loaded) { + cols_required = tab->tab_dic.dic_blob_cols_req; + if (!(rec_data = xres_load_record(self, ot, rec_id, &record->rb.rb_rec_type_1, rec_size, rec_buf, cols_required))) + /* [(7)] REMOVE is followed by FREE: + goto get_rec_offset; + */ + goto go_on_to_free; + record_loaded = TRUE; + } + myxt_release_blobs(ot, rec_data, rec_id); + } +#endif + + if (data_log_id && data_log_offset && log_over_size) { + if (!ot->ot_thread->st_dlog_buf.dlb_delete_log(data_log_id, data_log_offset, log_over_size, tab->tab_id, rec_id, self)) { + if (ot->ot_thread->t_exception.e_xt_err != XT_ERR_BAD_EXT_RECORD && + ot->ot_thread->t_exception.e_xt_err != XT_ERR_DATA_LOG_NOT_FOUND) + xt_log_and_clear_exception_ns(); + } + } + + go_on_to_free: + /* Use the new record type: */ + record->rb.rb_rec_type_1 = record->rb.rb_new_rec_type_1; + free_data = (XTTabRecFreeDPtr) &record->rb.rb_rec_type_1; + goto do_rec_freed; + } + case XT_LOG_ENT_REC_FREED: + rec_id = XT_GET_DISK_4(record->fr.fr_rec_id_4); + free_data = (XTTabRecFreeDPtr) &record->fr.fr_rec_type_1; + do_rec_freed: + if (!in_sequence) { + size_t red_size; + + /* Free the record. + * We place the record on front of the current + * free list. + * + * However, before we do this, we remove the record + * from its row list, if the record is on a row list. + * + * We do this here, because in the normal removal + * from the row list uses the operations: + * + * XT_LOG_ENT_REC_UNLINKED, XT_LOG_ENT_ROW_SET and + * XT_LOG_ENT_ROW_FREED. + * + * When operations are performed out of sequence, + * these operations are ignored for the purpose + * of removing the record from the row. + */ + if (!XT_PREAD_RR_FILE(ot->ot_rec_file, xt_rec_id_to_rec_offset(tab, rec_id), sizeof(XTTabRecHeadDRec), sizeof(XTTabRecHeadDRec), (xtWord1 *) &rec_head, NULL, &self->st_statistics.st_rec, self)) + xt_throw(self); + /* The record is already free: */ + if (XT_REC_IS_FREE(rec_head.tr_rec_type_1)) + goto free_done; + row_id = XT_GET_DISK_4(rec_head.tr_row_id_4); + + /* Search the row for this record: */ + if (!XT_PREAD_RR_FILE(ot->ot_row_file, xt_row_id_to_row_offset(tab, row_id), sizeof(XTTabRowRefDRec), sizeof(XTTabRowRefDRec), (xtWord1 *) &row_buf, NULL, &self->st_statistics.st_rec, self)) + xt_throw(self); + link_rec_id = XT_GET_DISK_4(row_buf.rr_ref_id_4); + prev_link_rec_id = 0; + while (link_rec_id) { + if (!XT_PREAD_RR_FILE(ot->ot_rec_file, xt_rec_id_to_rec_offset(tab, link_rec_id), sizeof(XTTabRecHeadDRec), 0, (xtWord1 *) &rec_head, &red_size, &self->st_statistics.st_rec, self)) { + xt_log_and_clear_exception(self); + break; + } + if (red_size < sizeof(XTTabRecHeadDRec)) + break; + if (link_rec_id == rec_id) + break; + if (XT_GET_DISK_4(rec_head.tr_row_id_4) != row_id) + break; + switch (rec_head.tr_rec_type_1 & XT_TAB_STATUS_MASK) { + case XT_TAB_STATUS_FREED: + break; + case XT_TAB_STATUS_DELETE: + case XT_TAB_STATUS_FIXED: + case XT_TAB_STATUS_VARIABLE: + case XT_TAB_STATUS_EXT_DLOG: + break; + default: + ASSERT(FALSE); + goto exit_loop; + } + if (rec_head.tr_rec_type_1 & ~(XT_TAB_STATUS_CLEANED_BIT | XT_TAB_STATUS_MASK)) { + ASSERT(FALSE); + break; + } + prev_link_rec_id = link_rec_id; + link_rec_id = XT_GET_DISK_4(rec_head.tr_prev_rec_id_4); + } + + exit_loop: + if (link_rec_id == rec_id) { + /* The record was found on the row list, remove it: */ + if (prev_link_rec_id) { + /* We write the previous variation pointer from position 'link_rec_id' into + * variation pointer of the 'prev_link_rec_id' record. This unlinks 'link_rec_id'! + */ + if (!XT_PWRITE_RR_FILE(ot->ot_rec_file, xt_rec_id_to_rec_offset(tab, prev_link_rec_id) + offsetof(XTTabRecHeadDRec, tr_prev_rec_id_4), XT_RECORD_ID_SIZE, (xtWord1 *) &rec_head.tr_prev_rec_id_4, &ot->ot_thread->st_statistics.st_rec, ot->ot_thread)) + xt_throw(self); + tab->tab_bytes_to_flush += XT_RECORD_ID_SIZE; + } + else { + /* The record is at the front of the row list: */ + xtRefID ref_id = XT_GET_DISK_4(rec_head.tr_prev_rec_id_4); + XT_SET_DISK_4(row_buf.rr_ref_id_4, ref_id); + if (!XT_PWRITE_RR_FILE(ot->ot_row_file, xt_row_id_to_row_offset(tab, row_id), sizeof(XTTabRowRefDRec), (xtWord1 *) &row_buf, &ot->ot_thread->st_statistics.st_rec, ot->ot_thread)) + xt_throw(self); + tab->tab_bytes_to_flush += sizeof(XTTabRowRefDRec); + } + } + + /* Now we free the record, by placing it at the front of + * the free list: + */ + XT_SET_DISK_4(free_data->rf_next_rec_id_4, tab->tab_head_rec_free_id); + } + tab->tab_head_rec_free_id = rec_id; + tab->tab_head_rec_fnum++; + if (!XT_PWRITE_RR_FILE(ot->ot_rec_file, xt_rec_id_to_rec_offset(tab, rec_id), sizeof(XTTabRecFreeDRec), (xtWord1 *) free_data, &ot->ot_thread->st_statistics.st_rec, ot->ot_thread)) + xt_throw(self); + tab->tab_bytes_to_flush += sizeof(XTTabRecFreeDRec); + tab->tab_flush_pending = TRUE; + free_done: + break; + case XT_LOG_ENT_REC_MOVED: + len = 8; + rec_id = XT_GET_DISK_4(record->xw.xw_rec_id_4); + if (!XT_PWRITE_RR_FILE(ot->ot_rec_file, xt_rec_id_to_rec_offset(tab, rec_id) + offsetof(XTTabRecExtDRec, re_log_id_2), len, (xtWord1 *) &record->xw.xw_rec_type_1, &ot->ot_thread->st_statistics.st_rec, ot->ot_thread)) + xt_throw(self); + tab->tab_bytes_to_flush += len; + tab->tab_flush_pending = TRUE; + break; + case XT_LOG_ENT_REC_CLEANED: + len = offsetof(XTTabRecHeadDRec, tr_prev_rec_id_4) + XT_RECORD_ID_SIZE; + goto get_rec_offset; + case XT_LOG_ENT_REC_CLEANED_1: + len = 1; + goto get_rec_offset; + case XT_LOG_ENT_REC_UNLINKED: + if (!in_sequence) { + /* Unlink the record. + * This is done when the record is freed. + */ + break; + } + len = offsetof(XTTabRecHeadDRec, tr_prev_rec_id_4) + XT_RECORD_ID_SIZE; + get_rec_offset: + rec_id = XT_GET_DISK_4(record->xw.xw_rec_id_4); + if (!XT_PWRITE_RR_FILE(ot->ot_rec_file, xt_rec_id_to_rec_offset(tab, rec_id), len, (xtWord1 *) &record->xw.xw_rec_type_1, &ot->ot_thread->st_statistics.st_rec, ot->ot_thread)) + xt_throw(self); + tab->tab_bytes_to_flush += len; + tab->tab_flush_pending = TRUE; + break; + case XT_LOG_ENT_ROW_NEW: + len = offsetof(XTactRowAddedEntryDRec, xa_free_list_4); + row_id = XT_GET_DISK_4(record->xa.xa_row_id_4); + if (!in_sequence) { + /* A row was allocated from the EOF. Because operations are missing. + * The blocks between the current EOF and the new EOF need to be + * place on the free list! + */ + while (tab->tab_head_row_eof_id < row_id) { + XT_SET_DISK_4(row_buf.rr_ref_id_4, tab->tab_head_row_free_id); + if (!XT_PWRITE_RR_FILE(ot->ot_row_file, xt_row_id_to_row_offset(tab, tab->tab_head_row_eof_id), sizeof(XTTabRowRefDRec), (xtWord1 *) &row_buf, &ot->ot_thread->st_statistics.st_rec, ot->ot_thread)) + xt_throw(self); + tab->tab_bytes_to_flush += sizeof(XTTabRowRefDRec); + tab->tab_head_row_free_id = tab->tab_head_row_eof_id; + tab->tab_head_row_eof_id++; + } + } + if (tab->tab_head_row_eof_id < row_id + 1) + tab->tab_head_row_eof_id = row_id + 1; + tab->tab_flush_pending = TRUE; + break; + case XT_LOG_ENT_ROW_NEW_FL: + len = sizeof(XTactRowAddedEntryDRec); + row_id = XT_GET_DISK_4(record->xa.xa_row_id_4); + free_ref_id = XT_GET_DISK_4(record->xa.xa_free_list_4); + if (!in_sequence) { + size_t red_size; + /* The record was taken from the free list. + * If the operations were in sequence, then this would be + * the front of the free list now. + * However, because operations are missing, it may no + * longer be the front of the free list! + * Search and remove: + */ + link_rec_id = tab->tab_head_row_free_id; + prev_link_rec_id = 0; + while (link_rec_id) { + if (!XT_PREAD_RR_FILE(ot->ot_row_file, xt_row_id_to_row_offset(tab, link_rec_id), sizeof(XTTabRowRefDRec), 0, (xtWord1 *) &row_buf, &red_size, &self->st_statistics.st_rec, self)) { + xt_log_and_clear_exception(self); + break; + } + if (red_size < sizeof(XTTabRowRefDRec)) + break; + if (link_rec_id == row_id) + break; + prev_link_rec_id = link_rec_id; + link_rec_id = XT_GET_DISK_4(row_buf.rr_ref_id_4); + } + if (link_rec_id == row_id) { + /* The block was found on the free list, remove it: */ + if (prev_link_rec_id) { + /* We write the record from position 'link_rec_id' into + * position 'prev_link_rec_id'. This unlinks 'link_rec_id'! + */ + if (!XT_PWRITE_RR_FILE(ot->ot_row_file, xt_row_id_to_row_offset(tab, prev_link_rec_id), sizeof(XTTabRowRefDRec), (xtWord1 *) &row_buf, &ot->ot_thread->st_statistics.st_rec, ot->ot_thread)) + xt_throw(self); + tab->tab_bytes_to_flush += sizeof(XTTabRowRefDRec); + free_ref_id = tab->tab_head_row_free_id; + } + else + /* The block is at the front of the free list: */ + free_ref_id = XT_GET_DISK_4(row_buf.rr_ref_id_4); + } + else { + /* Not found? */ + if (tab->tab_head_row_eof_id < row_id + 1) + tab->tab_head_row_eof_id = row_id + 1; + break; + } + + } + if (tab->tab_head_row_eof_id < row_id + 1) + tab->tab_head_row_eof_id = row_id + 1; + tab->tab_head_row_free_id = free_ref_id; + tab->tab_head_row_fnum--; + tab->tab_flush_pending = TRUE; + break; + case XT_LOG_ENT_ROW_FREED: + row_id = XT_GET_DISK_4(record->wr.wr_row_id_4); + if (!in_sequence) { + /* Free the row. + * Since this operation is being performed out of sequence, we + * must assume that some other free and allocation operations + * must be missing. + * For this reason, we add the row to the front of the + * existing free list. + */ + XT_SET_DISK_4(record->wr.wr_ref_id_4, tab->tab_head_row_free_id); + } + tab->tab_head_row_free_id = row_id; + tab->tab_head_row_fnum++; + goto write_row_data; + case XT_LOG_ENT_ROW_ADD_REC: + row_id = XT_GET_DISK_4(record->wr.wr_row_id_4); + if (!in_sequence) { + if (!XT_PREAD_RR_FILE(ot->ot_row_file, xt_row_id_to_row_offset(tab, row_id), sizeof(XTTabRowRefDRec), 0, (xtWord1 *) &row_buf, &tfer, &self->st_statistics.st_rec, self)) + xt_throw(self); + if (tfer == sizeof(XTTabRowRefDRec)) { + /* Add a record to the front of the row. + * This is easy, but we have to make sure that the next + * pointer in the record is correct. + */ + rec_id = XT_GET_DISK_4(record->wr.wr_ref_id_4); + if (!XT_PREAD_RR_FILE(ot->ot_rec_file, xt_rec_id_to_rec_offset(tab, rec_id), sizeof(XTTabRecHeadDRec), 0, (xtWord1 *) &rec_head, &tfer, &self->st_statistics.st_rec, self)) + xt_throw(self); + if (tfer == sizeof(XTTabRecHeadDRec) && XT_GET_DISK_4(rec_head.tr_row_id_4) == row_id) { + /* This is now the correct next pointer: */ + xtRecordID next_ref_id = XT_GET_DISK_4(row_buf.rr_ref_id_4); + if (XT_GET_DISK_4(rec_head.tr_prev_rec_id_4) != next_ref_id && + rec_id != next_ref_id) { + XT_SET_DISK_4(rec_head.tr_prev_rec_id_4, next_ref_id); + if (!XT_PWRITE_RR_FILE(ot->ot_rec_file, xt_rec_id_to_rec_offset(tab, rec_id), sizeof(XTTabRecHeadDRec), (xtWord1 *) &rec_head, &ot->ot_thread->st_statistics.st_rec, ot->ot_thread)) + xt_throw(self); + tab->tab_bytes_to_flush += sizeof(XTTabRecHeadDRec); + } + } + } + + } + goto write_row_data; + case XT_LOG_ENT_ROW_SET: + if (!in_sequence) + /* This operation is ignored when out of sequence! + * The operation is used to remove a record from a row. + * This is done automatically when the record is freed. + */ + break; + row_id = XT_GET_DISK_4(record->wr.wr_row_id_4); + write_row_data: + ASSERT_NS(XT_GET_DISK_4(record->wr.wr_ref_id_4) < tab->tab_head_rec_eof_id); + if (!XT_PWRITE_RR_FILE(ot->ot_row_file, xt_row_id_to_row_offset(tab, row_id), sizeof(XTTabRowRefDRec), (xtWord1 *) &record->wr.wr_ref_id_4, &ot->ot_thread->st_statistics.st_rec, self)) + xt_throw(self); + tab->tab_bytes_to_flush += sizeof(XTTabRowRefDRec); + if (tab->tab_head_row_eof_id < row_id + 1) + tab->tab_head_row_eof_id = row_id + 1; + tab->tab_flush_pending = TRUE; + break; + case XT_LOG_ENT_NO_OP: + case XT_LOG_ENT_END_OF_LOG: + break; + } +} + +/* + * Apply all operations that have been buffered + * for a particular table. + * Operations are buffered if they are + * read from the log out of sequence. + * + * In this case we buffer, and wait for the + * out of sequence operations to arrive. + * + * When the server is running, this will always be + * the case. A delay occurs while a transaction + * fills its private log buffer. + */ +static void xres_apply_operations(XTThreadPtr self, XTWriterStatePtr ws, xtBool in_sequence) +{ + XTTableHPtr tab = ws->ws_ot->ot_table; + u_int i = 0; + XTOperationPtr op; + xtBool check_index; + +// XTDatabaseHPtr db, XTOpenTablePtr ot, XTXactSeqReadPtr sr, XTDataBufferPtr databuf + xt_sl_lock(self, tab->tab_op_list); + for (;;) { + op = (XTOperationPtr) xt_sl_item_at(tab->tab_op_list, i); + if (!op) + break; + if (in_sequence && tab->tab_head_op_seq+1 != op->or_op_seq) + break; + xt_db_set_size(self, &ws->ws_databuf, (size_t) op->or_op_len); + if (!ws->ws_db->db_xlog.xlog_rnd_read(&ws->ws_seqread, op->or_log_id, op->or_log_offset, (size_t) op->or_op_len, ws->ws_databuf.db_data, NULL, self)) + xt_throw(self); + check_index = ws->ws_in_recover && xt_comp_log_pos(op->or_log_id, op->or_log_offset, ws->ws_ind_rec_log_id, ws->ws_ind_rec_log_offset) >= 0; + xres_apply_change(self, ws->ws_ot, (XTXactLogBufferDPtr) ws->ws_databuf.db_data, in_sequence, check_index, &ws->ws_rec_buf); + tab->tab_head_op_seq = op->or_op_seq; + if (tab->tab_wr_wake_freeer) { + if (!XTTableSeq::xt_op_is_before(tab->tab_head_op_seq, tab->tab_wake_freeer_op)) + xt_wr_wake_freeer(self); + } + i++; + } + xt_sl_remove_from_front(self, tab->tab_op_list, i); + xt_sl_unlock(self, tab->tab_op_list); +} + +/* Check for operations still remaining on tables. + * These operations are applied even though operations + * in sequence are missing. + */ +xtBool xres_sync_operations(XTThreadPtr self, XTDatabaseHPtr db, XTWriterStatePtr ws) +{ + u_int edx; + XTTableEntryPtr te_ptr; + XTTableHPtr tab; + xtBool op_synced = FALSE; + + xt_enum_tables_init(&edx); + while ((te_ptr = xt_enum_tables_next(self, db, &edx))) { + /* Dirty read of tab_op_list OK, here because this is the + * only thread that updates the list! + */ + if ((tab = te_ptr->te_table)) { + if (xt_sl_get_size(tab->tab_op_list)) { + op_synced = TRUE; + if (xres_open_table(self, ws, te_ptr->te_tab_id)) + xres_apply_operations(self, ws, FALSE); + } + + /* Update the pointer cache: */ + tab->tab_seq.xt_op_seq_set(self, tab->tab_head_op_seq+1); + tab->tab_row_eof_id = tab->tab_head_row_eof_id; + tab->tab_row_free_id = tab->tab_head_row_free_id; + tab->tab_row_fnum = tab->tab_head_row_fnum; + tab->tab_rec_eof_id = tab->tab_head_rec_eof_id; + tab->tab_rec_free_id = tab->tab_head_rec_free_id; + tab->tab_rec_fnum = tab->tab_head_rec_fnum; + } + } + return op_synced; +} + +/* + * Operations from the log are applied in sequence order. + * If the operations are out of sequence, they are buffered + * until the missing operations appear. + * + * NOTE: No lock is required because there should only be + * one thread that does this! + */ +xtPublic void xt_xres_apply_in_order(XTThreadPtr self, XTWriterStatePtr ws, xtLogID log_id, xtLogOffset log_offset, XTXactLogBufferDPtr record) +{ + xtOpSeqNo op_seq; + xtTableID tab_id; + size_t len; + xtBool check_index; + +// XTDatabaseHPtr db, XTOpenTablePtr *ot, XTXactSeqReadPtr sr, XTDataBufferPtr databuf + switch (record->xl.xl_status_1) { + case XT_LOG_ENT_REC_MODIFIED: + case XT_LOG_ENT_UPDATE: + case XT_LOG_ENT_INSERT: + case XT_LOG_ENT_DELETE: + case XT_LOG_ENT_UPDATE_BG: + case XT_LOG_ENT_INSERT_BG: + case XT_LOG_ENT_DELETE_BG: + len = offsetof(XTactUpdateEntryDRec, xu_rec_type_1) + (size_t) XT_GET_DISK_2(record->xu.xu_size_2); + op_seq = XT_GET_DISK_4(record->xu.xu_op_seq_4); + tab_id = XT_GET_DISK_4(record->xu.xu_tab_id_4); + break; + case XT_LOG_ENT_UPDATE_FL: + case XT_LOG_ENT_INSERT_FL: + case XT_LOG_ENT_DELETE_FL: + case XT_LOG_ENT_UPDATE_FL_BG: + case XT_LOG_ENT_INSERT_FL_BG: + case XT_LOG_ENT_DELETE_FL_BG: + len = offsetof(XTactUpdateFLEntryDRec, xf_rec_type_1) + (size_t) XT_GET_DISK_2(record->xf.xf_size_2); + op_seq = XT_GET_DISK_4(record->xf.xf_op_seq_4); + tab_id = XT_GET_DISK_4(record->xf.xf_tab_id_4); + break; + case XT_LOG_ENT_REC_FREED: + case XT_LOG_ENT_REC_REMOVED: + case XT_LOG_ENT_REC_REMOVED_EXT: + /* [(7)] REMOVE is now a extended version of FREE! */ + len = offsetof(XTactFreeRecEntryDRec, fr_rec_type_1) + sizeof(XTTabRecFreeDRec); + goto fixed_len_data; + case XT_LOG_ENT_REC_REMOVED_BI: + len = offsetof(XTactRemoveBIEntryDRec, rb_rec_type_1) + (size_t) XT_GET_DISK_2(record->rb.rb_size_2); + op_seq = XT_GET_DISK_4(record->rb.rb_op_seq_4); + tab_id = XT_GET_DISK_4(record->rb.rb_tab_id_4); + break; + case XT_LOG_ENT_REC_MOVED: + len = offsetof(XTactWriteRecEntryDRec, xw_rec_type_1) + 8; + goto fixed_len_data; + case XT_LOG_ENT_REC_CLEANED: + len = offsetof(XTactWriteRecEntryDRec, xw_rec_type_1) + offsetof(XTTabRecHeadDRec, tr_prev_rec_id_4) + XT_RECORD_ID_SIZE; + goto fixed_len_data; + case XT_LOG_ENT_REC_CLEANED_1: + len = offsetof(XTactWriteRecEntryDRec, xw_rec_type_1) + 1; + goto fixed_len_data; + case XT_LOG_ENT_REC_UNLINKED: + len = offsetof(XTactWriteRecEntryDRec, xw_rec_type_1) + offsetof(XTTabRecHeadDRec, tr_prev_rec_id_4) + XT_RECORD_ID_SIZE; + fixed_len_data: + op_seq = XT_GET_DISK_4(record->xw.xw_op_seq_4); + tab_id = XT_GET_DISK_4(record->xw.xw_tab_id_4); + break; + case XT_LOG_ENT_ROW_NEW: + len = sizeof(XTactRowAddedEntryDRec) - 4; + goto new_row; + case XT_LOG_ENT_ROW_NEW_FL: + len = sizeof(XTactRowAddedEntryDRec); + new_row: + op_seq = XT_GET_DISK_4(record->xa.xa_op_seq_4); + tab_id = XT_GET_DISK_4(record->xa.xa_tab_id_4); + break; + case XT_LOG_ENT_ROW_ADD_REC: + case XT_LOG_ENT_ROW_SET: + case XT_LOG_ENT_ROW_FREED: + len = offsetof(XTactWriteRowEntryDRec, wr_ref_id_4) + sizeof(XTTabRowRefDRec); + op_seq = XT_GET_DISK_4(record->wr.wr_op_seq_4); + tab_id = XT_GET_DISK_4(record->wr.wr_tab_id_4); + break; + case XT_LOG_ENT_NO_OP: + case XT_LOG_ENT_END_OF_LOG: + return; + default: + return; + } + + if (!xres_open_table(self, ws, tab_id)) + return; + + XTTableHPtr tab = ws->ws_ot->ot_table; + + /* NOTE: + * + * During normal operation this is actually given. + * + * During recovery, it only applies to the record/row files + * The index file is flushed indepently, and changes may + * have been applied to the index (due to a call to flush index, + * which comes as a result of out of memory) that have not been + * applied to the record/row files. + * + * As a result we need to do the index checks that apply to this + * change. + * + * At the moment, I will just do everything, which should not + * hurt! + * + * This error can be repeated by running the test + * runTest(OUT_OF_CACHE_UPDATE_TEST, 32, OUT_OF_CACHE_UPDATE_TEST_UPDATE_COUNT, OUT_OF_CACHE_UPDATE_TEST_SET_SIZE) + * and crashing after a while. + * + * Do this by setting not_this to NULL. This will cause the test to + * hang after a while. After a restart the indexes are corrupt if the + * ws->ws_in_recover condition is not present here. + */ + if (ws->ws_in_recover) { + if (!tab->tab_recovery_done) { + /* op_seq <= tab_head_op_seq + 1: */ + ASSERT(XTTableSeq::xt_op_is_before(op_seq, tab->tab_head_op_seq+2)); + if (XTTableSeq::xt_op_is_before(op_seq-1, tab->tab_head_op_seq)) + /* Adjust the operation sequence number: */ + tab->tab_head_op_seq = op_seq-1; + tab->tab_recovery_done = TRUE; + } + } + + if (!XTTableSeq::xt_op_is_before(tab->tab_head_op_seq, op_seq)) + return; + + if (tab->tab_head_op_seq+1 == op_seq) { + /* I could use tab_ind_rec_log_id, but this may be a problem, if + * recovery does not recover up to the last committed transaction. + */ + check_index = ws->ws_in_recover && xt_comp_log_pos(log_id, log_offset, ws->ws_ind_rec_log_id, ws->ws_ind_rec_log_offset) >= 0; + xres_apply_change(self, ws->ws_ot, record, TRUE, check_index, &ws->ws_rec_buf); + tab->tab_head_op_seq = op_seq; + if (tab->tab_wr_wake_freeer) { + if (!XTTableSeq::xt_op_is_before(tab->tab_head_op_seq, tab->tab_wake_freeer_op)) + xt_wr_wake_freeer(self); + } + + /* Apply any operations in the list that now follow on... + * NOTE: the tab_op_list only has be locked for modification. + * This is because only one thread ever changes the list + * (on startup and the writer), but the checkpoint thread + * reads it. + */ + XTOperationPtr op; + if ((op = (XTOperationPtr) xt_sl_first_item(tab->tab_op_list))) { + if (tab->tab_head_op_seq+1 == op->or_op_seq) { + xres_apply_operations(self, ws, TRUE); + } + } + } + else { + /* Add the operation to the list: */ + XTOperationRec op; + + op.or_op_seq = op_seq; + op.or_op_len = len; + op.or_log_id = log_id; + op.or_log_offset = log_offset; + xt_sl_lock(self, tab->tab_op_list); + xt_sl_insert(self, tab->tab_op_list, &op_seq, &op); + ASSERT(tab->tab_op_list->sl_usage_count < 1000000); + xt_sl_unlock(self, tab->tab_op_list); + } +} + +/* ---------------------------------------------------------------------- + * CHECKPOINTING FUNCTIONALITY + */ + +static xtBool xres_delete_data_log(XTDatabaseHPtr db, xtLogID log_id) +{ + XTDataLogFilePtr data_log; + char path[PATH_MAX]; + + db->db_datalogs.dlc_name(PATH_MAX, path, log_id); + + if (!db->db_datalogs.dlc_remove_data_log(log_id, TRUE)) + return FAILED; + + if (xt_fs_exists(path)) { +#ifdef DEBUG_LOG_DELETE + printf("-- delete log: %s\n", path); +#endif + if (!xt_fs_delete(NULL, path)) + return FAILED; + } + /* The log was deleted: */ + if (!db->db_datalogs.dlc_get_data_log(&data_log, log_id, TRUE, NULL)) + return FAILED; + if (data_log) { + if (!db->db_datalogs.dls_set_log_state(data_log, XT_DL_DELETED)) + return FAILED; + } + return OK; +} + +static int xres_comp_flush_tabs(XTThreadPtr self __attribute__((unused)), register const void *thunk __attribute__((unused)), register const void *a, register const void *b) +{ + xtTableID tab_id = *((xtTableID *) a); + XTCheckPointTablePtr cp_tab = (XTCheckPointTablePtr) b; + + if (tab_id < cp_tab->cpt_tab_id) + return -1; + if (tab_id > cp_tab->cpt_tab_id) + return 1; + return 0; +} + +static void xres_init_checkpoint_state(XTThreadPtr self, XTCheckPointStatePtr cp) +{ + xt_init_mutex_with_autoname(self, &cp->cp_state_lock); +} + +static void xres_free_checkpoint_state(XTThreadPtr self, XTCheckPointStatePtr cp) +{ + xt_free_mutex(&cp->cp_state_lock); + if (cp->cp_table_ids) { + xt_free_sortedlist(self, cp->cp_table_ids); + cp->cp_table_ids = NULL; + } +} + +/* + * Remove the deleted logs so that they can be re-used. + * This is only possible after a checkpoint has been + * written that does _not_ include these logs as logs + * to be deleted! + */ +static xtBool xres_remove_data_logs(XTDatabaseHPtr db) +{ + u_int no_of_logs = xt_sl_get_size(db->db_datalogs.dlc_deleted); + xtLogID *log_id_ptr; + + for (u_int i=0; i<no_of_logs; i++) { + log_id_ptr = (xtLogID *) xt_sl_item_at(db->db_datalogs.dlc_deleted, i); + if (!db->db_datalogs.dlc_remove_data_log(*log_id_ptr, FALSE)) + return FAILED; + } + xt_sl_set_size(db->db_datalogs.dlc_deleted, 0); + return OK; +} + +/* ---------------------------------------------------------------------- + * INIT & EXIT + */ + +xtPublic void xt_xres_init(XTThreadPtr self, XTDatabaseHPtr db) +{ + xtLogID max_log_id; + + xt_init_mutex_with_autoname(self, &db->db_cp_lock); + xt_init_cond(self, &db->db_cp_cond); + + xres_init_checkpoint_state(self, &db->db_cp_state); + db->db_restart.xres_init(self, db, &db->db_wr_log_id, &db->db_wr_log_offset, &max_log_id); + + /* It is also the position where transactions will start writing the + * log: + */ + if (!db->db_xlog.xlog_set_write_offset(db->db_wr_log_id, db->db_wr_log_offset, max_log_id, self)) + xt_throw(self); +} + +xtPublic void xt_xres_exit(XTThreadPtr self, XTDatabaseHPtr db) +{ + db->db_restart.xres_exit(self); + xres_free_checkpoint_state(self, &db->db_cp_state); + xt_free_mutex(&db->db_cp_lock); + xt_free_cond(&db->db_cp_cond); +} + +/* ---------------------------------------------------------------------- + * RESTART FUNCTIONALITY + */ + +/* + * Restart the database. This function loads the restart position, and + * applies all changes in the logs, until the end of the log, or + * a corrupted record is found. + * + * The restart position is the position in the log where we know that + * all the changes up to that point have been flushed to the + * database. + * + * This is called the checkpoint position. The checkpoint position + * is written alternatively to 2 restart files. + * + * To make a checkpoint: + * Get the current log writer log offset. + * For each table: + * Get the log offset of the next operation on the table, if an + * operation is queued for the table. + * Flush that table, and the operation sequence to the table. + * For each unclean transaction: + * Get the log offset of the begin of the transaction. + * Write the lowest of all log offsets to the restart file! + */ + +void XTXactRestart::xres_init(XTThreadPtr self, XTDatabaseHPtr db, xtLogID *log_id, xtLogOffset *log_offset, xtLogID *max_log_id) +{ + char path[PATH_MAX]; + XTOpenFilePtr of = NULL; + XTXlogCheckpointDPtr res_1_buffer = NULL; + XTXlogCheckpointDPtr res_2_buffer = NULL; + XTXlogCheckpointDPtr use_buffer; + xtLogID ind_rec_log_id = 0; + xtLogOffset ind_rec_log_offset = 0; + + enter_(); + xres_db = db; + + ASSERT(!self->st_database); + /* The following call stack: + * XTDatabaseLog::xlog_flush_pending() + * XTDatabaseLog::xlog_flush() + * xt_xlog_flush_log() + * xt_flush_indices() + * idx_out_of_memory_failure() + * xt_idx_delete() + * xres_remove_index_entries() + * xres_apply_change() + * xt_xres_apply_in_order() + * XTXactRestart::xres_restart() + * XTXactRestart::xres_init() + * Leads to st_database being used! + */ + self->st_database = db; + +#ifdef SKIP_STARTUP_CHECKPOINT + /* When debugging, we do not checkpoint immediately, just in case + * we detect a problem during recovery. + */ + xres_cp_required = FALSE; +#else + xres_cp_required = TRUE; +#endif + xres_cp_number = 0; + try_(a) { + + /* Figure out which restart file to use. + */ + xres_name(PATH_MAX, path, 1); + if ((of = xt_open_file(self, path, XT_FS_MISSING_OK))) { + size_t res_1_size; + + res_1_size = (size_t) xt_seek_eof_file(self, of); + res_1_buffer = (XTXlogCheckpointDPtr) xt_malloc(self, res_1_size); + if (!xt_pread_file(of, 0, res_1_size, res_1_size, res_1_buffer, NULL, &self->st_statistics.st_x, self)) + xt_throw(self); + xt_close_file(self, of); + of = NULL; + if (!xres_check_checksum(res_1_buffer, res_1_size)) { + xt_free(self, res_1_buffer); + res_1_buffer = NULL; + } + } + + xres_name(PATH_MAX, path, 2); + if ((of = xt_open_file(self, path, XT_FS_MISSING_OK))) { + size_t res_2_size; + + res_2_size = (size_t) xt_seek_eof_file(self, of); + res_2_buffer = (XTXlogCheckpointDPtr) xt_malloc(self, res_2_size); + if (!xt_pread_file(of, 0, res_2_size, res_2_size, res_2_buffer, NULL, &self->st_statistics.st_x, self)) + xt_throw(self); + xt_close_file(self, of); + of = NULL; + if (!xres_check_checksum(res_2_buffer, res_2_size)) { + xt_free(self, res_2_buffer); + res_2_buffer = NULL; + } + } + + if (res_1_buffer && res_2_buffer) { + if (xt_comp_log_pos( + XT_GET_DISK_4(res_1_buffer->xcp_log_id_4), + XT_GET_DISK_6(res_1_buffer->xcp_log_offs_6), + XT_GET_DISK_4(res_2_buffer->xcp_log_id_4), + XT_GET_DISK_6(res_2_buffer->xcp_log_offs_6)) > 0) { + /* The first log is the further along than the second: */ + xt_free(self, res_2_buffer); + res_2_buffer = NULL; + } + else { + if (XT_GET_DISK_6(res_1_buffer->xcp_chkpnt_no_6) > + XT_GET_DISK_6(res_2_buffer->xcp_chkpnt_no_6)) { + xt_free(self, res_2_buffer); + res_2_buffer = NULL; + } + else { + xt_free(self, res_1_buffer); + res_1_buffer = NULL; + } + } + } + + if (res_1_buffer) { + use_buffer = res_1_buffer; + xres_next_res_no = 2; + } + else { + use_buffer = res_2_buffer; + xres_next_res_no = 1; + } + + /* Read the checkpoint data: */ + if (use_buffer) { + u_int no_of_logs; + xtLogID xt_log_id; + xtTableID xt_tab_id; + + xres_cp_number = XT_GET_DISK_6(use_buffer->xcp_chkpnt_no_6); + xres_cp_log_id = XT_GET_DISK_4(use_buffer->xcp_log_id_4); + xres_cp_log_offset = XT_GET_DISK_6(use_buffer->xcp_log_offs_6); + xt_tab_id = XT_GET_DISK_4(use_buffer->xcp_tab_id_4); + if (xt_tab_id > db->db_curr_tab_id) + db->db_curr_tab_id = xt_tab_id; + db->db_xn_curr_id = XT_GET_DISK_4(use_buffer->xcp_xact_id_4); + ind_rec_log_id = XT_GET_DISK_4(use_buffer->xcp_ind_rec_log_id_4); + ind_rec_log_offset = XT_GET_DISK_6(use_buffer->xcp_ind_rec_log_offs_6); + no_of_logs = XT_GET_DISK_2(use_buffer->xcp_log_count_2); + +#ifdef DEBUG_PRINT + printf("CHECKPOINT log=%d offset=%d ", (int) xres_cp_log_id, (int) xres_cp_log_offset); + if (no_of_logs) + printf("DELETED LOGS: "); +#endif + + /* Logs that are deleted are locked until _after_ the next + * checkpoint. + * + * To prevent the following problem from occuring: + * - Recovery is performed, and log X is deleted + * - After delete a log is free for re-use. + * New data is writen to log X. + * - Server crashes. + * - Recovery is performed from previous checkpoint, + * and log X is deleted again. + * + * To lock the logs the are placed on the deleted list. + * After the next checkpoint, all logs on this list + * will be removed. + */ + for (u_int i=0; i<no_of_logs; i++) { + xt_log_id = (xtLogID) XT_GET_DISK_2(use_buffer->xcp_del_log[i]); +#ifdef DEBUG_PRINT + if (i != 0) + printf(", "); + printf("%d", (int) xt_log_id); +#endif +#ifdef DEBUG_KEEP_LOGS + xt_dl_set_to_delete(self, db, xt_log_id); +#else + if (!xres_delete_data_log(db, xt_log_id)) + xt_throw(self); +#endif + } + +#ifdef DEBUG_PRINT + printf("\n"); +#endif + } + else { + /* Try to determine the correct start point. */ + xres_cp_number = 0; + xres_cp_log_id = xt_xlog_get_min_log(self, db); + xres_cp_log_offset = 0; + ind_rec_log_id = xres_cp_log_id; + ind_rec_log_offset = xres_cp_log_offset; + +#ifdef DEBUG_PRINT + printf("CHECKPOINT log=1 offset=0\n"); +#endif + } + + if (res_1_buffer) { + xt_free(self, res_1_buffer); + res_1_buffer = NULL; + } + if (res_2_buffer) { + xt_free(self, res_2_buffer); + res_2_buffer = NULL; + } + + if (!xres_restart(self, log_id, log_offset, ind_rec_log_id, ind_rec_log_offset, max_log_id)) + xt_throw(self); + } + catch_(a) { + self->st_database = NULL; + if (of) + xt_close_file(self, of); + if (res_1_buffer) + xt_free(self, res_1_buffer); + if (res_2_buffer) + xt_free(self, res_2_buffer); + xres_exit(self); + throw_(); + } + cont_(a); + self->st_database = NULL; + + exit_(); +} + +void XTXactRestart::xres_exit(XTThreadPtr self __attribute__((unused))) +{ +} + +void XTXactRestart::xres_name(size_t size, char *path, xtLogID log_id) +{ + char name[50]; + + sprintf(name, "restart-%lu.xt", (u_long) log_id); + xt_strcpy(size, path, xres_db->db_main_path); + xt_add_system_dir(size, path); + xt_add_dir_char(size, path); + xt_strcat(size, path, name); +} + +xtBool XTXactRestart::xres_check_checksum(XTXlogCheckpointDPtr buffer, size_t size) +{ + size_t head_size; + + /* The minimum size: */ + if (size < offsetof(XTXlogCheckpointDRec, xcp_head_size_4) + 4) + return FAILED; + + /* Check the sizes: */ + head_size = XT_GET_DISK_4(buffer->xcp_head_size_4); + if (size < head_size) + return FAILED; + + if (XT_GET_DISK_2(buffer->xcp_checksum_2) != xt_get_checksum(((xtWord1 *) buffer) + 2, size - 2, 1)) + return FAILED; + + if (XT_GET_DISK_2(buffer->xcp_version_2) != XT_CHECKPOINT_VERSION) + return FAILED; + + return OK; +} + +void XTXactRestart::xres_recover_progress(XTThreadPtr self, XTOpenFilePtr *of, int perc) +{ +#ifdef XT_USE_GLOBAL_DB + if (!perc) { + char file_path[PATH_MAX]; + + xt_strcpy(PATH_MAX, file_path, xres_db->db_main_path); + xt_add_pbxt_file(PATH_MAX, file_path, "recovery-progress"); + *of = xt_open_file(self, file_path, XT_FS_CREATE | XT_FS_MAKE_PATH); + xt_set_eof_file(self, *of, 0); + } + + if (perc > 100) { + char file_path[PATH_MAX]; + + if (*of) { + xt_close_file(self, *of); + *of = NULL; + } + xt_strcpy(PATH_MAX, file_path, xres_db->db_main_path); + xt_add_pbxt_file(PATH_MAX, file_path, "recovery-progress"); + if (xt_fs_exists(file_path)) + xt_fs_delete(self, file_path); + } + else { + char number[40]; + + sprintf(number, "%d", perc); + if (!xt_pwrite_file(*of, 0, strlen(number), number, &self->st_statistics.st_x, self)) + xt_throw(self); + if (!xt_flush_file(*of, &self->st_statistics.st_x, self)) + xt_throw(self); + } +#endif +} + +xtBool XTXactRestart::xres_restart(XTThreadPtr self, xtLogID *log_id, xtLogOffset *log_offset, xtLogID ind_rec_log_id, xtLogOffset ind_rec_log_offset, xtLogID *max_log_id) +{ + xtBool ok = TRUE; + XTDatabaseHPtr db = xres_db; + XTXactLogBufferDPtr record; + xtXactID xn_id; + XTXactDataPtr xact; + xtTableID tab_id; + XTWriterStateRec ws; + off_t bytes_read = 0; + off_t bytes_to_read; + volatile xtBool print_progress = FALSE; + volatile off_t perc_size = 0, next_goal = 0; + int perc_complete = 1; + XTOpenFilePtr progress_file = NULL; + xtBool min_ram_xn_id_set = FALSE; + u_int log_count; + + memset(&ws, 0, sizeof(ws)); + + ws.ws_db = db; + ws.ws_in_recover = TRUE; + ws.ws_ind_rec_log_id = ind_rec_log_id; + ws.ws_ind_rec_log_offset = ind_rec_log_offset; + + /* Initialize the data log buffer (required if extended data is + * referenced). + * Note: this buffer is freed later. It is part of the thread + * "open database" state, and this means that a thread + * may not have another database open (in use) when + * it calls this functions. + */ + self->st_dlog_buf.dlb_init(db, xt_db_log_buffer_size); + + if (!db->db_xlog.xlog_seq_init(&ws.ws_seqread, xt_db_log_buffer_size, TRUE)) + return FAILED; + + bytes_to_read = xres_bytes_to_read(self, db, &log_count, max_log_id); + /* Don't print anything about recovering an empty database: */ + if (bytes_to_read != 0) + xt_logf(XT_NT_INFO, "PBXT: Recovering from %lu-%llu, bytes to read: %llu\n", (u_long) xres_cp_log_id, (u_llong) xres_cp_log_offset, (u_llong) bytes_to_read); + if (bytes_to_read >= 10*1024*1024) { + print_progress = TRUE; + perc_size = bytes_to_read / 100; + next_goal = perc_size; + xres_recover_progress(self, &progress_file, 0); + } + + if (!db->db_xlog.xlog_seq_start(&ws.ws_seqread, xres_cp_log_id, xres_cp_log_offset, FALSE)) { + ok = FALSE; + goto failed; + } + + try_(a) { + for (;;) { + if (!db->db_xlog.xlog_seq_next(&ws.ws_seqread, &record, TRUE, self)) { + ok = FALSE; + break; + } + /* Increment before. If record is NULL then xseq_record_len will be zero, + * UNLESS the last record was of type XT_LOG_ENT_END_OF_LOG + * which fills the log to align to block of size 512. + */ + bytes_read += ws.ws_seqread.xseq_record_len; + if (!record) + break; +#ifdef PRINT_LOG_ON_RECOVERY + xt_print_log_record(ws.ws_seqread.xseq_rec_log_id, ws.ws_seqread.xseq_rec_log_offset, record); +#endif + if (print_progress && bytes_read > next_goal) { + if (((perc_complete - 1) % 25) == 0) + xt_logf(XT_NT_INFO, "PBXT: "); + if ((perc_complete % 25) == 0) + xt_logf(XT_NT_INFO, "%2d\n", (int) perc_complete); + else + xt_logf(XT_NT_INFO, "%2d ", (int) perc_complete); + xt_log_flush(self); + xres_recover_progress(self, &progress_file, perc_complete); + next_goal += perc_size; + perc_complete++; + } + switch (record->xl.xl_status_1) { + case XT_LOG_ENT_HEADER: + break; + case XT_LOG_ENT_NEW_LOG: { + /* Adjust the bytes read for the fact that logs are written + * on 512 byte boundaries. + */ + off_t offs, eof = ws.ws_seqread.xseq_log_eof; + + offs = ws.ws_seqread.xseq_rec_log_offset + ws.ws_seqread.xseq_record_len; + if (eof > offs) + bytes_read += eof - offs; + if (!db->db_xlog.xlog_seq_start(&ws.ws_seqread, XT_GET_DISK_4(record->xl.xl_log_id_4), 0, TRUE)) + xt_throw(self); + break; + } + case XT_LOG_ENT_NEW_TAB: + tab_id = XT_GET_DISK_4(record->xt.xt_tab_id_4); + if (tab_id > db->db_curr_tab_id) + db->db_curr_tab_id = tab_id; + break; + case XT_LOG_ENT_UPDATE_BG: + case XT_LOG_ENT_INSERT_BG: + case XT_LOG_ENT_DELETE_BG: + xn_id = XT_GET_DISK_4(record->xu.xu_xact_id_4); + goto start_xact; + case XT_LOG_ENT_UPDATE_FL_BG: + case XT_LOG_ENT_INSERT_FL_BG: + case XT_LOG_ENT_DELETE_FL_BG: + xn_id = XT_GET_DISK_4(record->xf.xf_xact_id_4); + start_xact: + if (xt_xn_is_before(db->db_xn_curr_id, xn_id)) + db->db_xn_curr_id = xn_id; + + if (!(xact = xt_xn_add_old_xact(db, xn_id, self))) + xt_throw(self); + + xact->xd_begin_log = ws.ws_seqread.xseq_rec_log_id; + xact->xd_begin_offset = ws.ws_seqread.xseq_rec_log_offset; + + xact->xd_end_xn_id = xn_id; + xact->xd_end_time = db->db_xn_end_time; + xact->xd_flags = (XT_XN_XAC_LOGGED | XT_XN_XAC_ENDED | XT_XN_XAC_RECOVERED | XT_XN_XAC_SWEEP); + + /* This may affect the "minimum RAM transaction": */ + if (!min_ram_xn_id_set || xt_xn_is_before(xn_id, db->db_xn_min_ram_id)) { + min_ram_xn_id_set = TRUE; + db->db_xn_min_ram_id = xn_id; + } + xt_xres_apply_in_order(self, &ws, ws.ws_seqread.xseq_rec_log_id, ws.ws_seqread.xseq_rec_log_offset, record); + break; + case XT_LOG_ENT_COMMIT: + case XT_LOG_ENT_ABORT: + xn_id = XT_GET_DISK_4(record->xe.xe_xact_id_4); + if ((xact = xt_xn_get_xact(db, xn_id, self))) { + xact->xd_end_xn_id = xn_id; + xact->xd_flags |= XT_XN_XAC_ENDED | XT_XN_XAC_SWEEP; + xact->xd_flags &= ~XT_XN_XAC_RECOVERED; // We can expect an end record on cleanup! + if (record->xl.xl_status_1 == XT_LOG_ENT_COMMIT) + xact->xd_flags |= XT_XN_XAC_COMMITTED; + } + break; + case XT_LOG_ENT_CLEANUP: + /* The transaction was cleaned up: */ + xn_id = XT_GET_DISK_4(record->xc.xc_xact_id_4); + xt_xn_delete_xact(db, xn_id, self); + break; + case XT_LOG_ENT_OP_SYNC: + xres_sync_operations(self, db, &ws); + break; + case XT_LOG_ENT_DEL_LOG: + xtLogID rec_log_id; + + rec_log_id = XT_GET_DISK_4(record->xl.xl_log_id_4); + xt_dl_set_to_delete(self, db, rec_log_id); + break; + default: + xt_xres_apply_in_order(self, &ws, ws.ws_seqread.xseq_rec_log_id, ws.ws_seqread.xseq_rec_log_offset, record); + break; + } + } + + if (xres_sync_operations(self, db, &ws)) { + XTactOpSyncEntryDRec op_sync; + time_t now = time(NULL); + + op_sync.os_status_1 = XT_LOG_ENT_OP_SYNC; + op_sync.os_checksum_1 = XT_CHECKSUM_1(now) ^ XT_CHECKSUM_1(ws.ws_seqread.xseq_rec_log_id); + XT_SET_DISK_4(op_sync.os_time_4, (xtWord4) now); + /* TODO: If this is done, check to see that + * the byte written here are read back by the writter. + * This is in order to be in sync with 'xl_log_bytes_written'. + * i.e. xl_log_bytes_written == xl_log_bytes_read + */ + if (!db->db_xlog.xlog_write_thru(&ws.ws_seqread, sizeof(XTactOpSyncEntryDRec), (xtWord1 *) &op_sync, self)) + xt_throw(self); + } + } + catch_(a) { + ok = FALSE; + } + cont_(a); + + if (ok) { + if (print_progress) { + while (perc_complete <= 100) { + if (((perc_complete - 1) % 25) == 0) + xt_logf(XT_NT_INFO, "PBXT: "); + if ((perc_complete % 25) == 0) + xt_logf(XT_NT_INFO, "%2d\n", (int) perc_complete); + else + xt_logf(XT_NT_INFO, "%2d ", (int) perc_complete); + xt_log_flush(self); + xres_recover_progress(self, &progress_file, perc_complete); + perc_complete++; + } + } + if (bytes_to_read != 0) + xt_logf(XT_NT_INFO, "PBXT: Recovering complete at %lu-%llu, bytes read: %llu\n", (u_long) ws.ws_seqread.xseq_rec_log_id, (u_llong) ws.ws_seqread.xseq_rec_log_offset, (u_llong) bytes_read); + + *log_id = ws.ws_seqread.xseq_rec_log_id; + *log_offset = ws.ws_seqread.xseq_rec_log_offset; + + if (!min_ram_xn_id_set) + /* This is true because if no transaction was placed in RAM then + * the next transaction in RAM will have the next ID: */ + db->db_xn_min_ram_id = db->db_xn_curr_id + 1; + } + + failed: + xt_free_writer_state(self, &ws); + self->st_dlog_buf.dlb_exit(self); + xres_recover_progress(self, &progress_file, 101); + return ok; +} + +xtBool XTXactRestart::xres_is_checkpoint_pending(xtLogID curr_log_id, xtLogOffset curr_log_offset) +{ + return xt_bytes_since_last_checkpoint(xres_db, curr_log_id, curr_log_offset) >= xt_db_checkpoint_frequency / 2; +} + +/* + * Calculate the bytes to be read for recovery. + * This is only an estimate of the number of bytes that + * will be read. + */ +off_t XTXactRestart::xres_bytes_to_read(XTThreadPtr self, XTDatabaseHPtr db, u_int *log_count, xtLogID *max_log_id) +{ + off_t to_read = 0, eof; + xtLogID log_id = xres_cp_log_id; + char log_path[PATH_MAX]; + XTOpenFilePtr of; + XTXactLogHeaderDRec log_head; + size_t head_size; + size_t red_size; + + *max_log_id = log_id; + *log_count = 0; + for (;;) { + db->db_xlog.xlog_name(PATH_MAX, log_path, log_id); + of = NULL; + if (!xt_open_file_ns(&of, log_path, XT_FS_MISSING_OK)) + xt_throw(self); + if (!of) + break; + pushr_(xt_close_file, of); + + /* Check the first record of the log, to see if it is valid. */ + if (!xt_pread_file(of, 0, sizeof(XTXactLogHeaderDRec), 0, (xtWord1 *) &log_head, &red_size, &self->st_statistics.st_xlog, self)) + xt_throw(self); + /* The minimum size (old log size): */ + if (red_size < XT_MIN_LOG_HEAD_SIZE) + goto done; + head_size = XT_GET_DISK_4(log_head.xh_size_4); + if (log_head.xh_status_1 != XT_LOG_ENT_HEADER) + goto done; + if (log_head.xh_checksum_1 != XT_CHECKSUM_1(log_id)) + goto done; + if (XT_LOG_HEAD_MAGIC(&log_head, head_size) != XT_LOG_FILE_MAGIC) + goto done; + if (head_size > offsetof(XTXactLogHeaderDRec, xh_log_id_4) + 4) { + if (XT_GET_DISK_4(log_head.xh_log_id_4) != log_id) + goto done; + } + if (head_size > offsetof(XTXactLogHeaderDRec, xh_version_2) + 4) { + if (XT_GET_DISK_2(log_head.xh_version_2) > XT_LOG_VERSION_NO) + xt_throw_ulxterr(XT_CONTEXT, XT_ERR_NEW_TYPE_OF_XLOG, (u_long) log_id); + } + + eof = xt_seek_eof_file(self, of); + freer_(); // xt_close_file(of) + if (log_id == xres_cp_log_id) + to_read += (eof - xres_cp_log_offset); + else + to_read += eof; + (*log_count)++; + *max_log_id = log_id; + log_id++; + } + return to_read; + + done: + freer_(); // xt_close_file(of) + return to_read; +} + + +/* ---------------------------------------------------------------------- + * C H E C K P O I N T P R O C E S S + */ + +typedef enum XTFileType { + XT_FT_RECROW_FILE, + XT_FT_INDEX_FILE +} XTFileType; + +typedef struct XTDirtyFile { + xtTableID df_tab_id; + XTFileType df_file_type; +} XTDirtyFileRec, *XTDirtyFilePtr; + +#define XT_MAX_FLUSH_FILES 200 +#define XT_FLUSH_THRESHOLD (2 * 1024 * 1024) + +/* Sort files to be flused. */ +#ifdef USE_LATER +static void xres_cp_flush_files(XTThreadPtr self, XTDatabaseHPtr db) +{ + u_int edx; + XTTableEntryPtr te; + XTDirtyFileRec flush_list[XT_MAX_FLUSH_FILES]; + u_int file_count = 0; + XTIndexPtr *iptr; + u_int dirty_blocks; + XTOpenTablePtr ot; + XTTableHPtr tab; + + retry: + xt_enum_tables_init(&edx); + xt_ht_lock(self, db->db_tables); + pushr_(xt_ht_unlock, db->db_tables); + while (file_count < XT_MAX_FLUSH_FILES && + (te = xt_enum_tables_next(self, db, &edx))) { + if ((tab = te->te_table)) { + if (tab->tab_bytes_to_flush >= XT_FLUSH_THRESHOLD) { + flush_list[file_count].df_tab_id = te->te_tab_id; + flush_list[file_count].df_file_type = XT_FT_RECROW_FILE; + file_count++; + } + if (file_count == XT_MAX_FLUSH_FILES) + break; + iptr = tab->tab_dic.dic_keys; + dirty_blocks = 0; + for (u_int i=0;i<tab->tab_dic.dic_key_count; i++) { + dirty_blocks += (*iptr)->mi_dirty_blocks; + iptr++; + } + if ((dirty_blocks * XT_INDEX_PAGE_SIZE) >= XT_FLUSH_THRESHOLD) { + flush_list[file_count].df_tab_id = te->te_tab_id; + flush_list[file_count].df_file_type = XT_FT_INDEX_FILE; + file_count++; + } + } + } + freer_(); // xt_ht_unlock(db->db_tables) + + for (u_int i=0;i<file_count && !self->t_quit; i++) { + /* We want to flush about once a second: */ + xt_sleep_milli_second(400); + if ((ot = xt_db_open_pool_table(self, db, flush_list[i].df_tab_id, NULL, TRUE))) { + pushr_(xt_db_return_table_to_pool, ot); + + if (flush_list[i].df_file_type == XT_FT_RECROW_FILE) { + if (!xt_flush_record_row(ot, NULL)) + xt_throw(self); + } + else { + if (!xt_flush_indices(ot, NULL)) + xt_throw(self); + } + + freer_(); // xt_db_return_table_to_pool(ot) + } + } + + if (file_count == 100) + goto retry; +} +#endif + +#ifdef xxx +void XTXactRestart::xres_checkpoint_pending(xtLogID log_id, xtLogOffset log_offset) +{ +#ifdef TRACE_CHECKPOINT_ACTIVITY + xtBool tmp = xres_cp_pending; +#endif + xres_cp_pending = xres_is_checkpoint_pending(log_id, log_offset); +#ifdef TRACE_CHECKPOINT_ACTIVITY + if (tmp) { + if (!xres_cp_pending) + printf("%s xres_cp_pending = FALSE\n", xt_get_self()->t_name); + } + else { + if (xres_cp_pending) + printf("%s xres_cp_pending = TRUE\n", xt_get_self()->t_name); + } +#endif +} + + + xres_checkpoint_pending(); + + if (!xres_cp_required && + !xres_cp_pending && + xt_sl_get_size(db->db_datalogs.dlc_to_delete) == 0 && + xt_sl_get_size(db->db_datalogs.dlc_deleted) == 0) + return FALSE; +#endif + +#ifdef NEVER_CHECKPOINT +xtBool no_checkpoint = TRUE; +#endif + +#define XT_CHECKPOINT_IF_NO_ACTIVITY 0 +#define XT_CHECKPOINT_PAUSE_IF_ACTIVITY 1 +#define XT_CHECKPOINT_NO_PAUSE 2 + +/* + * This function performs table flush, as long as the system is idle. + */ +static xtBool xres_cp_checkpoint(XTThreadPtr self, XTDatabaseHPtr db, u_int curr_writer_total, xtBool force_checkpoint) +{ + XTCheckPointStatePtr cp = &db->db_cp_state; + XTOpenTablePtr ot; + XTCheckPointTablePtr to_flush_ptr; + XTCheckPointTableRec to_flush; + u_int table_count = 0; + xtBool checkpoint_done; + off_t bytes_flushed = 0; + int check_type; + +#ifdef NEVER_CHECKPOINT + if (no_checkpoint) + return FALSE; +#endif + if (force_checkpoint) { + if (db->db_restart.xres_cp_required) + check_type = XT_CHECKPOINT_NO_PAUSE; + else + check_type = XT_CHECKPOINT_PAUSE_IF_ACTIVITY; + } + else + check_type = XT_CHECKPOINT_IF_NO_ACTIVITY; + + to_flush.cpt_tab_id = 0; + to_flush.cpt_flushed = 0; + + /* Start a checkpoint: */ + if (!xt_begin_checkpoint(db, FALSE, self)) + xt_throw(self); + + while (!self->t_quit) { + xt_lock_mutex_ns(&cp->cp_state_lock); + table_count = 0; + if (cp->cp_table_ids) + table_count = xt_sl_get_size(cp->cp_table_ids); + if (!cp->cp_running || cp->cp_flush_count >= table_count) { + xt_unlock_mutex_ns(&cp->cp_state_lock); + break; + } + if (cp->cp_next_to_flush > table_count) + cp->cp_next_to_flush = 0; + + to_flush_ptr = (XTCheckPointTablePtr) xt_sl_item_at(cp->cp_table_ids, cp->cp_next_to_flush); + if (to_flush_ptr) + to_flush = *to_flush_ptr; + xt_unlock_mutex_ns(&cp->cp_state_lock); + + if (to_flush_ptr) { + if ((ot = xt_db_open_pool_table(self, db, to_flush.cpt_tab_id, NULL, TRUE))) { + pushr_(xt_db_return_table_to_pool, ot); + + if (!(to_flush.cpt_flushed & XT_CPT_REC_ROW_FLUSHED)) { + if (!xt_flush_record_row(ot, &bytes_flushed, FALSE)) + xt_throw(self); + } + + xt_lock_mutex_ns(&cp->cp_state_lock); + to_flush_ptr = NULL; + if (cp->cp_running) + to_flush_ptr = (XTCheckPointTablePtr) xt_sl_item_at(cp->cp_table_ids, cp->cp_next_to_flush); + if (to_flush_ptr) + to_flush = *to_flush_ptr; + xt_unlock_mutex_ns(&cp->cp_state_lock); + + if (to_flush_ptr && !self->t_quit) { + if (!(to_flush.cpt_flushed & XT_CPT_INDEX_FLUSHED)) { + switch (check_type) { + case XT_CHECKPOINT_IF_NO_ACTIVITY: + if (bytes_flushed > 0 && curr_writer_total != db->db_xn_total_writer_count) { + freer_(); // xt_db_return_table_to_pool(ot) + goto end_checkpoint; + } + break; + case XT_CHECKPOINT_PAUSE_IF_ACTIVITY: + if (bytes_flushed > 2 * 1024 * 1024 && curr_writer_total != db->db_xn_total_writer_count) { + curr_writer_total = db->db_xn_total_writer_count; + bytes_flushed = 0; + xt_sleep_milli_second(400); + } + break; + case XT_CHECKPOINT_NO_PAUSE: + break; + } + + if (!self->t_quit) { + if (!xt_flush_indices(ot, &bytes_flushed, FALSE)) + xt_throw(self); + to_flush.cpt_flushed |= XT_CPT_INDEX_FLUSHED; + } + } + } + + freer_(); // xt_db_return_table_to_pool(ot) + } + + if ((to_flush.cpt_flushed & XT_CPT_ALL_FLUSHED) == XT_CPT_ALL_FLUSHED) + cp->cp_next_to_flush++; + } + else + cp->cp_next_to_flush++; + + if (self->t_quit) + break; + + switch (check_type) { + case XT_CHECKPOINT_IF_NO_ACTIVITY: + if (bytes_flushed > 0 && curr_writer_total != db->db_xn_total_writer_count) + goto end_checkpoint; + break; + case XT_CHECKPOINT_PAUSE_IF_ACTIVITY: + if (bytes_flushed > 2 * 1024 * 1024 && curr_writer_total != db->db_xn_total_writer_count) { + curr_writer_total = db->db_xn_total_writer_count; + bytes_flushed = 0; + xt_sleep_milli_second(400); + } + break; + case XT_CHECKPOINT_NO_PAUSE: + break; + } + } + + end_checkpoint: + if (!xt_end_checkpoint(db, self, &checkpoint_done)) + xt_throw(self); + return checkpoint_done; +} + + +/* Wait for the log writer to tell us to do something. + */ +static void xres_cp_wait_for_log_writer(XTThreadPtr self, XTDatabaseHPtr db, u_long milli_secs) +{ + xt_lock_mutex(self, &db->db_cp_lock); + pushr_(xt_unlock_mutex, &db->db_cp_lock); + if (!self->t_quit) + xt_timed_wait_cond(self, &db->db_cp_cond, &db->db_cp_lock, milli_secs); + freer_(); // xt_unlock_mutex(&db->db_cp_lock) +} + +/* + * This is the way checkpoint works: + * + * To write a checkpoint we need to flush all tables in + * the database. + * + * Before flushing the first table we get the checkpoint + * log position. + * + * After flushing all files we write of the checkpoint + * log position. + */ +static void xres_cp_main(XTThreadPtr self) +{ + XTDatabaseHPtr db = self->st_database; + u_int curr_writer_total; + time_t now; + + xt_set_low_priority(self); + + + while (!self->t_quit) { + /* Wait 2 seconds: */ + curr_writer_total = db->db_xn_total_writer_count; + xt_db_approximate_time = time(NULL); + now = xt_db_approximate_time; + while (!self->t_quit && xt_db_approximate_time < now + 2 && !db->db_restart.xres_cp_required) { + xres_cp_wait_for_log_writer(self, db, 400); + xt_db_approximate_time = time(NULL); + xt_db_free_unused_open_tables(self, db); + } + + if (self->t_quit) + break; + + if (curr_writer_total == db->db_xn_total_writer_count) + /* No activity in 2 seconds: */ + xres_cp_checkpoint(self, db, curr_writer_total, FALSE); + else { + /* There server is busy, check if we need to + * write a checkpoint anyway... + */ + if (db->db_restart.xres_cp_required || + db->db_restart.xres_is_checkpoint_pending(db->db_xlog.xl_write_log_id, db->db_xlog.xl_write_log_offset)) { + /* Flush tables, until the checkpoint is complete. */ + xres_cp_checkpoint(self, db, curr_writer_total, TRUE); + } + } + + if (curr_writer_total == db->db_xn_total_writer_count) { + /* We did a checkpoint, and still, nothing has + * happened.... + * + * Wait for something to happen: + */ + xtLogID log_id; + xtLogOffset log_offset; + + while (!self->t_quit && curr_writer_total == db->db_xn_total_writer_count) { + /* The writer position: */ + xt_lock_mutex(self, &db->db_wr_lock); + pushr_(xt_unlock_mutex, &db->db_wr_lock); + log_id = db->db_wr_log_id; + log_offset = db->db_wr_log_offset; + freer_(); // xt_unlock_mutex(&db->db_wr_lock) + + /* This condition means we could checkpoint: */ + if (!(xt_sl_get_size(db->db_datalogs.dlc_to_delete) == 0 && + xt_sl_get_size(db->db_datalogs.dlc_deleted) == 0 && + xt_comp_log_pos(log_id, log_offset, db->db_restart.xres_cp_log_id, db->db_restart.xres_cp_log_offset) <= 0)) + break; + + xres_cp_wait_for_log_writer(self, db, 400); + xt_db_approximate_time = time(NULL); + xt_db_free_unused_open_tables(self, db); + } + } + } +} + +static void *xres_cp_run_thread(XTThreadPtr self) +{ + XTDatabaseHPtr db = (XTDatabaseHPtr) self->t_data; + int count; + void *mysql_thread; + + mysql_thread = myxt_create_thread(); + + while (!self->t_quit) { + try_(a) { + /* + * The garbage collector requires that the database + * is in use because. + */ + xt_use_database(self, db, XT_FOR_CHECKPOINTER); + + /* This action is both safe and required (see details elsewhere) */ + xt_heap_release(self, self->st_database); + + xres_cp_main(self); + } + catch_(a) { + /* This error is "normal"! */ + if (self->t_exception.e_xt_err != XT_ERR_NO_DICTIONARY && + !(self->t_exception.e_xt_err == XT_SIGNAL_CAUGHT && + self->t_exception.e_sys_err == SIGTERM)) + xt_log_and_clear_exception(self); + } + cont_(a); + + /* Avoid releasing the database (done above) */ + self->st_database = NULL; + xt_unuse_database(self, self); + + /* After an exception, pause before trying again... */ + /* Number of seconds */ + count = 60; + while (!self->t_quit && count > 0) { + sleep(1); + count--; + } + } + + myxt_destroy_thread(mysql_thread, TRUE); + return NULL; +} + +static void xres_cp_free_thread(XTThreadPtr self, void *data) +{ + XTDatabaseHPtr db = (XTDatabaseHPtr) data; + + if (db->db_cp_thread) { + xt_lock_mutex(self, &db->db_cp_lock); + pushr_(xt_unlock_mutex, &db->db_cp_lock); + db->db_cp_thread = NULL; + freer_(); // xt_unlock_mutex(&db->db_cp_lock) + } +} + +/* Start a checkpoint, if none has been started. */ +xtPublic xtBool xt_begin_checkpoint(XTDatabaseHPtr db, xtBool have_table_lock, XTThreadPtr thread) +{ + XTCheckPointStatePtr cp = &db->db_cp_state; + xtLogID log_id; + xtLogOffset log_offset; + xtLogID ind_rec_log_id; + xtLogOffset ind_rec_log_offset; + u_int edx; + XTTableEntryPtr te_ptr; + XTTableHPtr tab; + XTOperationPtr op; + XTCheckPointTableRec cpt; + XTSortedListPtr tables = NULL; + + /* First check if a checkpoint is already running: */ + xt_lock_mutex_ns(&cp->cp_state_lock); + if (cp->cp_running) { + xt_unlock_mutex_ns(&cp->cp_state_lock); + return OK; + } + if (cp->cp_table_ids) { + xt_free_sortedlist(NULL, cp->cp_table_ids); + cp->cp_table_ids = NULL; + } + xt_unlock_mutex_ns(&cp->cp_state_lock); + + /* Flush the log before we continue. This is to ensure that + * before we write a checkpoint, that the changes + * done by the sweeper and the compactor, have been + * applied. + * + * Note, the sweeper does not flush the log, so this is + * necessary! + * + * --- I have removed this flush. It is actually just a + * minor optimisation, which pushes the flush position + * below ahead. + * + * Note that the writer position used for the checkpoint + * _will_ be behind the current log flush position. + * + * This is because the writer cannot apply log changes + * until they are flushed. + */ + /* This is an alternative to the above. + if (!xt_xlog_flush_log(self)) + xt_throw(self); + */ + xt_lock_mutex_ns(&db->db_wr_lock); + + /* The theoretical maximum restart log postion, is the + * position of the writer thread: + */ + log_id = db->db_wr_log_id; + log_offset = db->db_wr_log_offset; + + ind_rec_log_id = db->db_xlog.xl_flush_log_id; + ind_rec_log_offset = db->db_xlog.xl_flush_log_offset; + + xt_unlock_mutex_ns(&db->db_wr_lock); + + /* Go through all the transactions, and find + * the lowest log start position of all the transactions. + */ + for (u_int i=0; i<XT_XN_NO_OF_SEGMENTS; i++) { + XTXactSegPtr seg; + + seg = &db->db_xn_idx[i]; + XT_XACT_WRITE_LOCK(&seg->xs_tab_lock, self); + for (u_int j=0; j<XT_XN_HASH_TABLE_SIZE; j++) { + XTXactDataPtr xact; + + xact = seg->xs_table[j]; + while (xact) { + /* If the transaction is logged, but not cleaned: */ + if ((xact->xd_flags & (XT_XN_XAC_LOGGED | XT_XN_XAC_CLEANED)) == XT_XN_XAC_LOGGED) { + if (xt_comp_log_pos(log_id, log_offset, xact->xd_begin_log, xact->xd_begin_offset) > 0) { + log_id = xact->xd_begin_log; + log_offset = xact->xd_begin_offset; + } + } + xact = xact->xd_next_xact; + } + } + XT_XACT_UNLOCK(&seg->xs_tab_lock, self); + } + +#ifdef TRACE_CHECKPOINT + printf("BEGIN CHECKPOINT %d-%llu\n", (int) log_id, (u_llong) log_offset); +#endif + /* Go through all tables, and find the lowest log position. + * The log position stored by each table shows the position of + * the next operation that still needs to be applied. + * + * This comes from the list of operations which are + * queued for the table. + * + * This function also builds a list of tables! + */ + + if (!(tables = xt_new_sortedlist_ns(sizeof(XTCheckPointTableRec), 20, xres_comp_flush_tabs, NULL, NULL))) + return FAILED; + + xt_enum_tables_init(&edx); + if (!have_table_lock) + xt_ht_lock(NULL, db->db_tables); + while ((te_ptr = xt_enum_tables_next(NULL, db, &edx))) { + if ((tab = te_ptr->te_table)) { + xt_sl_lock_ns(tab->tab_op_list, thread); + if ((op = (XTOperationPtr) xt_sl_first_item(tab->tab_op_list))) { + if (xt_comp_log_pos(log_id, log_offset, op->or_log_id, op->or_log_offset) > 0) { + log_id = op->or_log_id; + log_offset = op->or_log_offset; + } + } + xt_sl_unlock(NULL, tab->tab_op_list); + cpt.cpt_flushed = 0; + cpt.cpt_tab_id = tab->tab_id; +#ifdef TRACE_CHECKPOINT + printf("to flush: %d %s\n", (int) tab->tab_id, tab->tab_name->ps_path); +#endif + if (!xt_sl_insert(NULL, tables, &tab->tab_id, &cpt)) { + if (!have_table_lock) + xt_ht_unlock(NULL, db->db_tables); + xt_free_sortedlist(NULL, tables); + return FAILED; + } + } + } + if (!have_table_lock) + xt_ht_unlock(NULL, db->db_tables); + + xt_lock_mutex_ns(&cp->cp_state_lock); + /* If there is a table list, then someone was faster than me! */ + if (!cp->cp_running && log_id && log_offset) { + cp->cp_running = TRUE; + cp->cp_log_id = log_id; + cp->cp_log_offset = log_offset; + + cp->cp_ind_rec_log_id = ind_rec_log_id; + cp->cp_ind_rec_log_offset = ind_rec_log_offset; + + cp->cp_flush_count = 0; + cp->cp_next_to_flush = 0; + cp->cp_table_ids = tables; + } + else + xt_free_sortedlist(NULL, tables); + xt_unlock_mutex_ns(&cp->cp_state_lock); + + /* At this point, log flushing can begin... */ + return OK; +} + +/* End a checkpoint, if a checkpoint has been started, + * and all checkpoint tables have been flushed + */ +xtPublic xtBool xt_end_checkpoint(XTDatabaseHPtr db, XTThreadPtr thread, xtBool *checkpoint_done) +{ + XTCheckPointStatePtr cp = &db->db_cp_state; + XTXlogCheckpointDPtr cp_buf = NULL; + char path[PATH_MAX]; + XTOpenFilePtr of; + u_int table_count; + size_t chk_size = 0; + u_int no_of_logs = 0; + +#ifdef NEVER_CHECKPOINT + return OK; +#endif + /* Lock the checkpoint state so that only on thread can do this! */ + xt_lock_mutex_ns(&cp->cp_state_lock); + if (!cp->cp_running) + goto checkpoint_done; + + table_count = 0; + if (cp->cp_table_ids) + table_count = xt_sl_get_size(cp->cp_table_ids); + if (cp->cp_flush_count < table_count) { + /* Checkpoint is not done, yet! */ + xt_unlock_mutex_ns(&cp->cp_state_lock); + if (checkpoint_done) + *checkpoint_done = FALSE; + return OK; + } + + /* Check if anything has changed since the last checkpoint, + * if not, there is no need to write a new checkpoint! + */ + if (xt_sl_get_size(db->db_datalogs.dlc_to_delete) == 0 && + xt_sl_get_size(db->db_datalogs.dlc_deleted) == 0 && + xt_comp_log_pos(cp->cp_log_id, cp->cp_log_offset, db->db_restart.xres_cp_log_id, db->db_restart.xres_cp_log_offset) <= 0) { + /* A checkpoint is required if the size of the deleted + * list is not zero. The reason is, I cannot remove the + * logs from the deleted list BEFORE a checkpoint has been + * done which does NOT include these logs. + * + * Even though the logs have already been deleted. They + * remain on the deleted list to ensure that they are NOT + * reused during this time, until the next checkpoint. + * + * This is done because if they are used, then on restart + * they would be deleted! + */ +#ifdef TRACE_CHECKPOINT + printf("--- END CHECKPOINT - no write\n"); +#endif + goto checkpoint_done; + } + +#ifdef TRACE_CHECKPOINT + printf("--- END CHECKPOINT - write start point\n"); +#endif + xt_lock_mutex_ns(&db->db_datalogs.dlc_lock); + + no_of_logs = xt_sl_get_size(db->db_datalogs.dlc_to_delete); + chk_size = offsetof(XTXlogCheckpointDRec, xcp_del_log) + no_of_logs * 2; + xtLogID *log_id_ptr; + + if (!(cp_buf = (XTXlogCheckpointDPtr) xt_malloc_ns(chk_size))) { + xt_unlock_mutex_ns(&db->db_datalogs.dlc_lock); + goto failed_0; + } + + /* Increment the checkpoint number. This value is used if 2 checkpoint have the + * same log number. In this case checkpoints may differ in the log files + * that should be deleted. Here it is important to use the most recent + * log file! + */ + db->db_restart.xres_cp_number++; + + /* Create the checkpoint record: */ + XT_SET_DISK_4(cp_buf->xcp_head_size_4, chk_size); + XT_SET_DISK_2(cp_buf->xcp_version_2, XT_CHECKPOINT_VERSION); + XT_SET_DISK_6(cp_buf->xcp_chkpnt_no_6, db->db_restart.xres_cp_number); + XT_SET_DISK_4(cp_buf->xcp_log_id_4, cp->cp_log_id); + XT_SET_DISK_6(cp_buf->xcp_log_offs_6, cp->cp_log_offset); + XT_SET_DISK_4(cp_buf->xcp_tab_id_4, db->db_curr_tab_id); + XT_SET_DISK_4(cp_buf->xcp_xact_id_4, db->db_xn_curr_id); + XT_SET_DISK_4(cp_buf->xcp_ind_rec_log_id_4, cp->cp_ind_rec_log_id); + XT_SET_DISK_6(cp_buf->xcp_ind_rec_log_offs_6, cp->cp_ind_rec_log_offset); + XT_SET_DISK_2(cp_buf->xcp_log_count_2, no_of_logs); + + for (u_int i=0; i<no_of_logs; i++) { + log_id_ptr = (xtLogID *) xt_sl_item_at(db->db_datalogs.dlc_to_delete, i); + XT_SET_DISK_2(cp_buf->xcp_del_log[i], (xtWord2) *log_id_ptr); + } + + XT_SET_DISK_2(cp_buf->xcp_checksum_2, xt_get_checksum(((xtWord1 *) cp_buf) + 2, chk_size - 2, 1)); + + xt_unlock_mutex_ns(&db->db_datalogs.dlc_lock); + + /* Write the checkpoint: */ + db->db_restart.xres_name(PATH_MAX, path, db->db_restart.xres_next_res_no); + if (!(of = xt_open_file_ns(path, XT_FS_CREATE | XT_FS_MAKE_PATH))) + goto failed_1; + + if (!xt_set_eof_file(NULL, of, 0)) + goto failed_2; + if (!xt_pwrite_file(of, 0, chk_size, (xtWord1 *) cp_buf, &thread->st_statistics.st_x, thread)) + goto failed_2; + if (!xt_flush_file(of, &thread->st_statistics.st_x, thread)) + goto failed_2; + + xt_close_file_ns(of); + + /* Next time write the other restart file: */ + db->db_restart.xres_next_res_no = (db->db_restart.xres_next_res_no % 2) + 1; + db->db_restart.xres_cp_log_id = cp->cp_log_id; + db->db_restart.xres_cp_log_offset = cp->cp_log_offset; + db->db_restart.xres_cp_required = FALSE; + + /* + * Remove all the data logs that were deleted on the + * last checkpoint: + */ + if (!xres_remove_data_logs(db)) + goto failed_0; + +#ifndef DEBUG_KEEP_LOGS + /* After checkpoint, we can delete transaction logs that will no longer be required + * for recovery... + */ + if (cp->cp_log_id > 1) { + xtLogID current_log_id = cp->cp_log_id; + xtLogID del_log_id; + +#ifdef XT_NUMBER_OF_LOGS_TO_SAVE + if (pbxt_crash_debug) { + /* To save the logs, we just consider them in use: */ + if (current_log_id > XT_NUMBER_OF_LOGS_TO_SAVE) + current_log_id -= XT_NUMBER_OF_LOGS_TO_SAVE; + else + current_log_id = 1; + } +#endif + + del_log_id = current_log_id - 1; + + while (del_log_id > 0) { + db->db_xlog.xlog_name(PATH_MAX, path, del_log_id); + if (!xt_fs_exists(path)) + break; + del_log_id--; + } + + /* This was the lowest log ID that existed: */ + del_log_id++; + + /* Delete all logs that still exist, that come before + * the current log: + * + * Do this from least to greatest to ensure no "holes" appear. + */ + while (del_log_id < current_log_id) { + switch (db->db_xlog.xlog_delete_log(del_log_id, thread)) { + case OK: + break; + case FAILED: + goto exit_loop; + case XT_ERR: + goto failed_0; + } + del_log_id++; + } + exit_loop:; + } + + /* And we can delete data logs in the list, and place them + * on the deleted list. + */ + xtLogID log_id; + for (u_int i=0; i<no_of_logs; i++) { + log_id = (xtLogID) XT_GET_DISK_2(cp_buf->xcp_del_log[i]); + if (!xres_delete_data_log(db, log_id)) + goto failed_0; + } +#endif + + xt_free_ns(cp_buf); + cp_buf = NULL; + + checkpoint_done: + cp->cp_running = FALSE; + if (cp->cp_table_ids) { + xt_free_sortedlist(NULL, cp->cp_table_ids); + cp->cp_table_ids = NULL; + } + cp->cp_flush_count = 0; + cp->cp_next_to_flush = 0; + db->db_restart.xres_cp_required = FALSE; + xt_unlock_mutex_ns(&cp->cp_state_lock); + if (checkpoint_done) + *checkpoint_done = TRUE; + return OK; + + failed_2: + xt_close_file_ns(of); + + failed_1: + xt_free_ns(cp_buf); + + failed_0: + if (cp_buf) + xt_free_ns(cp_buf); + xt_unlock_mutex_ns(&cp->cp_state_lock); + return FAILED; +} + +xtPublic xtWord8 xt_bytes_since_last_checkpoint(XTDatabaseHPtr db, xtLogID curr_log_id, xtLogOffset curr_log_offset) +{ + xtLogID log_id; + xtLogOffset log_offset; + size_t byte_count = 0; + + log_id = db->db_restart.xres_cp_log_id; + log_offset = db->db_restart.xres_cp_log_offset; + + /* Assume the logs have the threshold: */ + if (log_id < curr_log_id) { + if (log_offset < xt_db_log_file_threshold) + byte_count = (size_t) (xt_db_log_file_threshold - log_offset); + log_offset = 0; + log_id++; + } + while (log_id < curr_log_id) { + byte_count += (size_t) xt_db_log_file_threshold; + log_id++; + } + if (log_offset < curr_log_offset) + byte_count += (size_t) (curr_log_offset - log_offset); + + return byte_count; +} + +xtPublic void xt_start_checkpointer(XTThreadPtr self, XTDatabaseHPtr db) +{ + char name[PATH_MAX]; + + sprintf(name, "CP-%s", xt_last_directory_of_path(db->db_main_path)); + xt_remove_dir_char(name); + db->db_cp_thread = xt_create_daemon(self, name); + xt_set_thread_data(db->db_cp_thread, db, xres_cp_free_thread); + xt_run_thread(self, db->db_cp_thread, xres_cp_run_thread); +} + +xtPublic void xt_wait_for_checkpointer(XTThreadPtr self, XTDatabaseHPtr db) +{ + time_t then, now; + xtBool message = FALSE; + xtLogID log_id; + xtLogOffset log_offset; + + if (db->db_cp_thread) { + then = time(NULL); + for (;;) { + xt_lock_mutex(self, &db->db_wr_lock); + pushr_(xt_unlock_mutex, &db->db_wr_lock); + log_id = db->db_wr_log_id; + log_offset = db->db_wr_log_offset; + freer_(); // xt_unlock_mutex(&db->db_wr_lock) + + if (xt_sl_get_size(db->db_datalogs.dlc_to_delete) == 0 && + xt_sl_get_size(db->db_datalogs.dlc_deleted) == 0 && + xt_comp_log_pos(log_id, log_offset, db->db_restart.xres_cp_log_id, db->db_restart.xres_cp_log_offset) <= 0) + break; + + /* Do a final checkpoint before shutdown: */ + db->db_restart.xres_cp_required = TRUE; + + xt_lock_mutex(self, &db->db_cp_lock); + pushr_(xt_unlock_mutex, &db->db_cp_lock); + if (!xt_broadcast_cond_ns(&db->db_cp_cond)) { + xt_log_and_clear_exception_ns(); + break; + } + freer_(); // xt_unlock_mutex(&db->db_cp_lock) + + xt_sleep_milli_second(10); + + now = time(NULL); + if (now >= then + 16) { + xt_logf(XT_NT_INFO, "Aborting wait for '%s' checkpointer\n", db->db_name); + message = FALSE; + break; + } + if (now >= then + 2) { + if (!message) { + message = TRUE; + xt_logf(XT_NT_INFO, "Waiting for '%s' checkpointer...\n", db->db_name); + } + } + } + + if (message) + xt_logf(XT_NT_INFO, "Checkpointer '%s' done.\n", db->db_name); + } +} + +xtPublic void xt_stop_checkpointer(XTThreadPtr self, XTDatabaseHPtr db) +{ + XTThreadPtr thr_wr; + + if (db->db_cp_thread) { + xt_lock_mutex(self, &db->db_cp_lock); + pushr_(xt_unlock_mutex, &db->db_cp_lock); + + /* This pointer is safe as long as you have the transaction lock. */ + if ((thr_wr = db->db_cp_thread)) { + xtThreadID tid = thr_wr->t_id; + + /* Make sure the thread quits when woken up. */ + xt_terminate_thread(self, thr_wr); + + xt_wake_checkpointer(self, db); + + freer_(); // xt_unlock_mutex(&db->db_cp_lock) + + /* + * GOTCHA: This is a wierd thing but the SIGTERM directed + * at a particular thread (in this case the sweeper) was + * being caught by a different thread and killing the server + * sometimes. Disconcerting. + * (this may only be a problem on Mac OS X) + xt_kill_thread(thread); + */ + xt_wait_for_thread(tid, FALSE); + + /* PMC - This should not be necessary to set the signal here, but in the + * debugger the handler is not called!!? + thr_wr->t_delayed_signal = SIGTERM; + xt_kill_thread(thread); + */ + db->db_cp_thread = NULL; + } + else + freer_(); // xt_unlock_mutex(&db->db_cp_lock) + } +} + +xtPublic void xt_wake_checkpointer(XTThreadPtr self, XTDatabaseHPtr db) +{ + if (!xt_broadcast_cond_ns(&db->db_cp_cond)) + xt_log_and_clear_exception(self); +} + +xtPublic void xt_free_writer_state(struct XTThread *self, XTWriterStatePtr ws) +{ + if (ws->ws_db) + ws->ws_db->db_xlog.xlog_seq_exit(&ws->ws_seqread); + xt_db_set_size(self, &ws->ws_databuf, 0); + xt_ib_free(self, &ws->ws_rec_buf); + if (ws->ws_ot) { + xt_db_return_table_to_pool(self, ws->ws_ot); + ws->ws_ot = NULL; + } +} + +xtPublic void xt_dump_xlogs(XTDatabaseHPtr db, xtLogID start_log) +{ + XTXactSeqReadRec seq; + XTXactLogBufferDPtr record; + xtLogID log_id = db->db_restart.xres_cp_log_id; + char log_path[PATH_MAX]; + XTThreadPtr thread = xt_get_self(); + + /* Find the first log that still exists:*/ + for (;;) { + log_id--; + db->db_xlog.xlog_name(PATH_MAX, log_path, log_id); + if (!xt_fs_exists(log_path)) + break; + } + log_id++; + + if (!db->db_xlog.xlog_seq_init(&seq, xt_db_log_buffer_size, FALSE)) + return; + + if (log_id < start_log) + log_id = start_log; + + for (;;) { + db->db_xlog.xlog_name(PATH_MAX, log_path, log_id); + if (!xt_fs_exists(log_path)) + break; + + if (!db->db_xlog.xlog_seq_start(&seq, log_id, 0, FALSE)) + goto done; + + PRINTF("---------- DUMP LOG %d\n", (int) log_id); + for (;;) { + if (!db->db_xlog.xlog_seq_next(&seq, &record, TRUE, thread)) { + PRINTF("---------- DUMP LOG %d ERROR\n", (int) log_id); + xt_log_and_clear_exception_ns(); + break; + } + if (!record) { + PRINTF("---------- DUMP LOG %d DONE\n", (int) log_id); + break; + } + xt_print_log_record(seq.xseq_rec_log_id, seq.xseq_rec_log_offset, record); + } + + log_id++; + } + + done: + db->db_xlog.xlog_seq_exit(&seq); +} diff --git a/storage/pbxt/src/restart_xt.h b/storage/pbxt/src/restart_xt.h new file mode 100644 index 00000000000..259b3cbda90 --- /dev/null +++ b/storage/pbxt/src/restart_xt.h @@ -0,0 +1,134 @@ +/* Copyright (c) 2007 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2007-11-12 Paul McCullagh + * + * H&G2JCtL + * + * Restart and write data to the database. + */ + +#ifndef __restart_xt_h__ +#define __restart_xt_h__ + +#include "pthread_xt.h" +#include "filesys_xt.h" +#include "sortedlist_xt.h" +#include "util_xt.h" +#include "xactlog_xt.h" + +struct XTThread; +struct XTOpenTable; +struct XTDatabase; +struct XTTable; + +typedef struct XTWriterState { + struct XTDatabase *ws_db; + xtBool ws_in_recover; + xtLogID ws_ind_rec_log_id; + xtLogOffset ws_ind_rec_log_offset; + XTXactSeqReadRec ws_seqread; + XTDataBufferRec ws_databuf; + XTInfoBufferRec ws_rec_buf; + xtTableID ws_tab_gone; /* Cache the ID of the last table that does not exist. */ + struct XTOpenTable *ws_ot; +} XTWriterStateRec, *XTWriterStatePtr; + +#define XT_CHECKPOINT_VERSION 1 + +typedef struct XTXlogCheckpoint { + XTDiskValue2 xcp_checksum_2; /* The checksum of the all checkpoint data. */ + XTDiskValue4 xcp_head_size_4; + XTDiskValue2 xcp_version_2; /* The version of the checkpoint record. */ + XTDiskValue6 xcp_chkpnt_no_6; /* Incremented for each checkpoint. */ + XTDiskValue4 xcp_log_id_4; /* The restart log ID. */ + XTDiskValue6 xcp_log_offs_6; /* The restart log offset. */ + XTDiskValue4 xcp_tab_id_4; /* The current high table ID. */ + XTDiskValue4 xcp_xact_id_4; /* The current high transaction ID. */ + XTDiskValue4 xcp_ind_rec_log_id_4; /* The index recovery log ID. */ + XTDiskValue6 xcp_ind_rec_log_offs_6; /* The index recovery log offset. */ + XTDiskValue2 xcp_log_count_2; /* Number of logs to be deleted in the area below. */ + XTDiskValue2 xcp_del_log[XT_VAR_LENGTH]; +} XTXlogCheckpointDRec, *XTXlogCheckpointDPtr; + +typedef struct XTXactRestart { + struct XTDatabase *xres_db; + int xres_next_res_no; /* The next restart file to be written. */ + xtLogID xres_cp_log_id; /* Log number of the last checkpoint. */ + xtLogOffset xres_cp_log_offset; /* Log offset of the last checkpoint */ + xtBool xres_cp_required; /* Checkpoint required (startup and shutdown). */ + xtWord8 xres_cp_number; /* The checkpoint number (used to decide which is the latest checkpoint). */ + +public: + void xres_init(struct XTThread *self, struct XTDatabase *db, xtLogID *log_id, xtLogOffset *log_offset, xtLogID *max_log_id); + void xres_exit(struct XTThread *self); + xtBool xres_is_checkpoint_pending(xtLogID log_id, xtLogOffset log_offset); + void xres_checkpoint_pending(xtLogID log_id, xtLogOffset log_offset); + xtBool xres_checkpoint(struct XTThread *self); + void xres_name(size_t size, char *path, xtLogID log_id); + +private: + xtBool xres_check_checksum(XTXlogCheckpointDPtr buffer, size_t size); + void xres_recover_progress(XTThreadPtr self, XTOpenFilePtr *of, int perc); + xtBool xres_restart(struct XTThread *self, xtLogID *log_id, xtLogOffset *log_offset, xtLogID ind_rec_log_id, off_t ind_rec_log_offset, xtLogID *max_log_id); + off_t xres_bytes_to_read(struct XTThread *self, struct XTDatabase *db, u_int *log_count, xtLogID *max_log_id); +} XTXactRestartRec, *XTXactRestartPtr; + +typedef struct XTCheckPointState { + xt_mutex_type cp_state_lock; /* Lock and the entire checkpoint state. */ + xtBool cp_running; /* TRUE if a checkpoint is running. */ + xtLogID cp_log_id; + xtLogOffset cp_log_offset; + xtLogID cp_ind_rec_log_id; + xtLogOffset cp_ind_rec_log_offset; + XTSortedListPtr cp_table_ids; /* List of tables to be flushed for the checkpoint. */ + u_int cp_flush_count; /* The number of tables flushed. */ + u_int cp_next_to_flush; /* The next table to be flushed. */ +} XTCheckPointStateRec, *XTCheckPointStatePtr; + +#define XT_CPT_NONE_FLUSHED 0 +#define XT_CPT_REC_ROW_FLUSHED 1 +#define XT_CPT_INDEX_FLUSHED 2 +#define XT_CPT_ALL_FLUSHED (XT_CPT_REC_ROW_FLUSHED | XT_CPT_INDEX_FLUSHED) + +typedef struct XTCheckPointTable { + u_int cpt_flushed; + xtTableID cpt_tab_id; +} XTCheckPointTableRec, *XTCheckPointTablePtr; + +void xt_xres_init(struct XTThread *self, struct XTDatabase *db); +void xt_xres_exit(struct XTThread *self, struct XTDatabase *db); + +void xt_xres_init_tab(struct XTThread *self, struct XTTable *tab); +void xt_xres_exit_tab(struct XTThread *self, struct XTTable *tab); + +void xt_xres_apply_in_order(struct XTThread *self, XTWriterStatePtr ws, xtLogID log_id, xtLogOffset log_offset, XTXactLogBufferDPtr record); + +xtBool xt_begin_checkpoint(struct XTDatabase *db, xtBool have_table_lock, struct XTThread *thread); +xtBool xt_end_checkpoint(struct XTDatabase *db, struct XTThread *thread, xtBool *checkpoint_done); +void xt_start_checkpointer(struct XTThread *self, struct XTDatabase *db); +void xt_wait_for_checkpointer(struct XTThread *self, struct XTDatabase *db); +void xt_stop_checkpointer(struct XTThread *self, struct XTDatabase *db); +void xt_wake_checkpointer(struct XTThread *self, struct XTDatabase *db); +void xt_free_writer_state(struct XTThread *self, XTWriterStatePtr ws); +xtWord8 xt_bytes_since_last_checkpoint(struct XTDatabase *db, xtLogID curr_log_id, xtLogOffset curr_log_offset); + +void xt_print_log_record(xtLogID log, off_t offset, XTXactLogBufferDPtr record); +void xt_dump_xlogs(struct XTDatabase *db, xtLogID start_log); + +#endif diff --git a/storage/pbxt/src/sortedlist_xt.cc b/storage/pbxt/src/sortedlist_xt.cc new file mode 100644 index 00000000000..b4c525dbb22 --- /dev/null +++ b/storage/pbxt/src/sortedlist_xt.cc @@ -0,0 +1,352 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-02-04 Paul McCullagh + * + * H&G2JCtL + */ + +#include "xt_config.h" + +#include "pthread_xt.h" +#include "thread_xt.h" +#include "sortedlist_xt.h" + +XTSortedListPtr xt_new_sortedlist_ns(u_int item_size, u_int grow_size, XTCompareFunc comp_func, void *thunk, XTFreeFunc free_func) +{ + XTSortedListPtr sl; + + if (!(sl = (XTSortedListPtr) xt_calloc_ns(sizeof(XTSortedListRec)))) + return NULL; + sl->sl_item_size = item_size; + sl->sl_grow_size = grow_size; + sl->sl_comp_func = comp_func; + sl->sl_thunk = thunk; + sl->sl_free_func = free_func; + sl->sl_current_size = 0; + return sl; +} + +XTSortedListPtr xt_new_sortedlist(XTThreadPtr self, u_int item_size, u_int initial_size, u_int grow_size, XTCompareFunc comp_func, void *thunk, XTFreeFunc free_func, xtBool with_lock, xtBool with_cond) +{ + XTSortedListPtr sl; + + sl = (XTSortedListPtr) xt_calloc(self, sizeof(XTSortedListRec)); + xt_init_sortedlist(self, sl, item_size, initial_size, grow_size, comp_func, thunk, free_func, with_lock, with_cond); + return sl; +} + +xtPublic void xt_init_sortedlist(XTThreadPtr self, XTSortedListPtr sl, u_int item_size, u_int initial_size, u_int grow_size, XTCompareFunc comp_func, void *thunk, XTFreeFunc free_func, xtBool with_lock, xtBool with_cond) +{ + sl->sl_item_size = item_size; + sl->sl_grow_size = grow_size; + sl->sl_comp_func = comp_func; + sl->sl_thunk = thunk; + sl->sl_free_func = free_func; + sl->sl_current_size = initial_size; + + if (initial_size) { + try_(a) { + sl->sl_data = (char *) xt_malloc(self, initial_size * item_size); + } + catch_(a) { + xt_free(self, sl); + throw_(); + } + cont_(a); + } + + if (with_lock || with_cond) { + sl->sl_lock = (xt_mutex_type *) xt_calloc(self, sizeof(xt_mutex_type)); + try_(b) { + xt_init_mutex_with_autoname(self, sl->sl_lock); + } + catch_(b) { + xt_free(self, sl->sl_lock); + sl->sl_lock = NULL; + xt_free_sortedlist(self, sl); + throw_(); + } + cont_(b); + } + + if (with_cond) { + sl->sl_cond = (xt_cond_type *) xt_calloc(self, sizeof(xt_cond_type)); + try_(c) { + xt_init_cond(self, sl->sl_cond); + } + catch_(c) { + xt_free(self, sl->sl_cond); + sl->sl_cond = NULL; + xt_free_sortedlist(self, sl); + throw_(); + } + cont_(c); + } +} + +xtPublic void xt_empty_sortedlist(XTThreadPtr self, XTSortedListPtr sl) +{ + if (sl->sl_lock) + xt_lock_mutex(self, sl->sl_lock); + if (sl->sl_data) { + while (sl->sl_usage_count > 0) { + sl->sl_usage_count--; + if (sl->sl_free_func) + (*sl->sl_free_func)(self, sl->sl_thunk, &sl->sl_data[sl->sl_usage_count * sl->sl_item_size]); + } + } + if (sl->sl_lock) + xt_unlock_mutex(self, sl->sl_lock); +} + +xtPublic void xt_free_sortedlist(XTThreadPtr self, XTSortedListPtr sl) +{ + xt_empty_sortedlist(self, sl); + if (sl->sl_data) { + xt_free(self, sl->sl_data); + sl->sl_data = NULL; + } + if (sl->sl_lock) { + xt_free_mutex(sl->sl_lock); + xt_free(self, sl->sl_lock); + } + if (sl->sl_cond) { + xt_free_cond(sl->sl_cond); + xt_free(self, sl->sl_cond); + } + xt_free(self, sl); +} + +xtPublic void *xt_sl_find(XTThreadPtr self, XTSortedListPtr sl, void *key) +{ + void *result; + size_t idx; + + if (sl->sl_usage_count == 0) + return NULL; + else if (sl->sl_usage_count == 1) { + if ((*sl->sl_comp_func)(self, sl->sl_thunk, key, sl->sl_data) == 0) + return sl->sl_data; + return NULL; + } + result = xt_bsearch(self, key, sl->sl_data, sl->sl_usage_count, sl->sl_item_size, &idx, sl->sl_thunk, sl->sl_comp_func); + return result; +} + +/* + * Returns: + * 1 = Value inserted. + * 2 = Value not inserted, already in the list. + * 0 = An error occurred. + */ +xtPublic int xt_sl_insert(XTThreadPtr self, XTSortedListPtr sl, void *key, void *data) +{ + size_t idx; + + if (sl->sl_usage_count == 0) + idx = 0; + else if (sl->sl_usage_count == 1) { + int r; + + if ((r = (*sl->sl_comp_func)(self, sl->sl_thunk, key, sl->sl_data)) == 0) { + if (sl->sl_free_func) + (*sl->sl_free_func)(self, sl->sl_thunk, data); + return 2; + } + if (r < 0) + idx = 0; + else + idx = 1; + } + else { + if (xt_bsearch(self, key, sl->sl_data, sl->sl_usage_count, sl->sl_item_size, &idx, sl->sl_thunk, sl->sl_comp_func)) { + if (sl->sl_free_func) + (*sl->sl_free_func)(self, sl->sl_thunk, data); + return 2; + } + } + if (sl->sl_usage_count == sl->sl_current_size) { + if (!xt_realloc_ns((void **) &sl->sl_data, (sl->sl_current_size + sl->sl_grow_size) * sl->sl_item_size)) { + if (sl->sl_free_func) + (*sl->sl_free_func)(self, sl->sl_thunk, data); + if (self) + xt_throw(self); + return 0; + } + sl->sl_current_size = sl->sl_current_size + sl->sl_grow_size; + } + XT_MEMMOVE(sl->sl_data, &sl->sl_data[(idx+1) * sl->sl_item_size], &sl->sl_data[idx * sl->sl_item_size], (sl->sl_usage_count-idx) * sl->sl_item_size); + XT_MEMCPY(sl->sl_data, &sl->sl_data[idx * sl->sl_item_size], data, sl->sl_item_size); + sl->sl_usage_count++; + return 1; +} + +xtPublic xtBool xt_sl_delete(XTThreadPtr self, XTSortedListPtr sl, void *key) +{ + void *result; + size_t idx; + + if (sl->sl_usage_count == 0) + return FALSE; + if (sl->sl_usage_count == 1) { + if ((*sl->sl_comp_func)(self, sl->sl_thunk, key, sl->sl_data) != 0) + return FALSE; + idx = 0; + result = sl->sl_data; + } + else { + if (!(result = xt_bsearch(self, key, sl->sl_data, sl->sl_usage_count, sl->sl_item_size, &idx, sl->sl_thunk, sl->sl_comp_func))) + return FALSE; + } + if (sl->sl_free_func) + (*sl->sl_free_func)(self, sl->sl_thunk, result); + sl->sl_usage_count--; + XT_MEMMOVE(sl->sl_data, &sl->sl_data[idx * sl->sl_item_size], &sl->sl_data[(idx+1) * sl->sl_item_size], (sl->sl_usage_count-idx) * sl->sl_item_size); + return TRUE; +} + +xtPublic void xt_sl_delete_item_at(struct XTThread *self, XTSortedListPtr sl, size_t idx) +{ + void *result; + + if (idx >= sl->sl_usage_count) + return; + result = &sl->sl_data[idx * sl->sl_item_size]; + if (sl->sl_free_func) + (*sl->sl_free_func)(self, sl->sl_thunk, result); + sl->sl_usage_count--; + XT_MEMMOVE(sl->sl_data, &sl->sl_data[idx * sl->sl_item_size], &sl->sl_data[(idx+1) * sl->sl_item_size], (sl->sl_usage_count-idx) * sl->sl_item_size); +} + +xtPublic void xt_sl_remove_from_front(struct XTThread *self __attribute__((unused)), XTSortedListPtr sl, size_t items) +{ + if (sl->sl_usage_count <= items) + xt_sl_set_size(sl, 0); + else { + XT_MEMMOVE(sl->sl_data, sl->sl_data, &sl->sl_data[items * sl->sl_item_size], (sl->sl_usage_count-items) * sl->sl_item_size); + sl->sl_usage_count -= items; + } +} + +xtPublic void xt_sl_delete_from_info(XTThreadPtr self, XTSortedListInfoPtr li_undo) +{ + xt_sl_delete(self, li_undo->li_sl, li_undo->li_key); +} + +xtPublic size_t xt_sl_get_size(XTSortedListPtr sl) +{ + return sl->sl_usage_count; +} + +xtPublic void xt_sl_set_size(XTSortedListPtr sl, size_t new_size) +{ + sl->sl_usage_count = new_size; + if (sl->sl_usage_count + sl->sl_grow_size <= sl->sl_current_size) { + size_t curr_size; + + curr_size = sl->sl_usage_count; + if (curr_size < sl->sl_grow_size) + curr_size = sl->sl_grow_size; + + if (xt_realloc(NULL, (void **) &sl->sl_data, curr_size * sl->sl_item_size)) + sl->sl_current_size = curr_size; + } +} + +xtPublic void *xt_sl_item_at(XTSortedListPtr sl, size_t idx) +{ + if (idx < sl->sl_usage_count) + return &sl->sl_data[idx * sl->sl_item_size]; + return NULL; +} + +xtPublic void *xt_sl_last_item(XTSortedListPtr sl) +{ + if (sl->sl_usage_count > 0) + return xt_sl_item_at(sl, sl->sl_usage_count - 1); + return NULL; +} + +xtPublic void *xt_sl_first_item(XTSortedListPtr sl) +{ + if (sl->sl_usage_count > 0) + return xt_sl_item_at(sl, 0); + return NULL; +} + +xtPublic xtBool xt_sl_lock(XTThreadPtr self, XTSortedListPtr sl) +{ + xtBool r = OK; + + if (sl->sl_locker != self) + r = xt_lock_mutex(self, sl->sl_lock); + if (r) { + sl->sl_locker = self; + sl->sl_lock_count++; + } + return r; +} + +xtPublic void xt_sl_unlock(XTThreadPtr self, XTSortedListPtr sl) +{ + ASSERT(!self || sl->sl_locker == self); + ASSERT(sl->sl_lock_count > 0); + + sl->sl_lock_count--; + if (!sl->sl_lock_count) { + sl->sl_locker = NULL; + xt_unlock_mutex(self, sl->sl_lock); + } +} + +xtPublic void xt_sl_lock_ns(XTSortedListPtr sl, XTThreadPtr thread) +{ + if (sl->sl_locker != thread) + xt_lock_mutex_ns(sl->sl_lock); + sl->sl_locker = thread; + sl->sl_lock_count++; +} + +xtPublic void xt_sl_unlock_ns(XTSortedListPtr sl) +{ + ASSERT_NS(!sl->sl_locker || sl->sl_locker == xt_get_self()); + ASSERT_NS(sl->sl_lock_count > 0); + + sl->sl_lock_count--; + if (!sl->sl_lock_count) { + sl->sl_locker = NULL; + xt_unlock_mutex_ns(sl->sl_lock); + } +} + +xtPublic void xt_sl_wait(XTThreadPtr self, XTSortedListPtr sl) +{ + xt_wait_cond(self, sl->sl_cond, sl->sl_lock); +} + +xtPublic xtBool xt_sl_signal(XTThreadPtr self, XTSortedListPtr sl) +{ + return xt_signal_cond(self, sl->sl_cond); +} + +xtPublic void xt_sl_broadcast(XTThreadPtr self, XTSortedListPtr sl) +{ + xt_broadcast_cond(self, sl->sl_cond); +} + diff --git a/storage/pbxt/src/sortedlist_xt.h b/storage/pbxt/src/sortedlist_xt.h new file mode 100644 index 00000000000..cf3066981fe --- /dev/null +++ b/storage/pbxt/src/sortedlist_xt.h @@ -0,0 +1,79 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-02-04 Paul McCullagh + * + * H&G2JCtL + */ +#ifndef __xt_sortedlist_h__ +#define __xt_sortedlist_h__ + +#include "pthread_xt.h" +#include "bsearch_xt.h" + +struct XTThread; + +typedef struct XTSortedList { + u_int sl_item_size; + u_int sl_grow_size; + XTCompareFunc sl_comp_func; + void *sl_thunk; + XTFreeFunc sl_free_func; + xt_mutex_type *sl_lock; + struct XTThread *sl_locker; + u_int sl_lock_count; + xt_cond_type *sl_cond; + + u_int sl_current_size; + u_int sl_usage_count; + char *sl_data; +} XTSortedListRec, *XTSortedListPtr; + +typedef struct XTSortedListInfo { + XTSortedListPtr li_sl; + void *li_key; +} XTSortedListInfoRec, *XTSortedListInfoPtr; + +XTSortedListPtr xt_new_sortedlist(struct XTThread *self, u_int item_size, u_int initial_size, u_int grow_size, XTCompareFunc comp_func, void *thunk, XTFreeFunc free_func, xtBool with_lock, xtBool with_cond); +void xt_init_sortedlist(struct XTThread *self, XTSortedListPtr sl, u_int item_size, u_int initial_size, u_int grow_size, XTCompareFunc comp_func, void *thunk, XTFreeFunc free_func, xtBool with_lock, xtBool with_cond); +void xt_free_sortedlist(struct XTThread *self, XTSortedListPtr ld); +void xt_empty_sortedlist(struct XTThread *self, XTSortedListPtr sl); +XTSortedListPtr xt_new_sortedlist_ns(u_int item_size, u_int grow_size, XTCompareFunc comp_func, void *thunk, XTFreeFunc free_func); + +xtBool xt_sl_insert(struct XTThread *self, XTSortedListPtr sl, void *key, void *data); +void *xt_sl_find(struct XTThread *self, XTSortedListPtr sl, void *key); +xtBool xt_sl_delete(struct XTThread *self, XTSortedListPtr sl, void *key); +void xt_sl_delete_item_at(struct XTThread *self, XTSortedListPtr sl, size_t i); +void xt_sl_remove_from_front(struct XTThread *self, XTSortedListPtr sl, size_t items); +void xt_sl_delete_from_info(struct XTThread *self, XTSortedListInfoPtr li); +size_t xt_sl_get_size(XTSortedListPtr sl); +void xt_sl_set_size(XTSortedListPtr sl, size_t new_size); +void *xt_sl_item_at(XTSortedListPtr sl, size_t i); +void *xt_sl_last_item(XTSortedListPtr sl); +void *xt_sl_first_item(XTSortedListPtr sl); + +xtBool xt_sl_lock(struct XTThread *self, XTSortedListPtr sl); +void xt_sl_unlock(struct XTThread *self, XTSortedListPtr sl); +void xt_sl_lock_ns(XTSortedListPtr sl, struct XTThread *thread); +void xt_sl_unlock_ns(XTSortedListPtr sl); + +void xt_sl_wait(struct XTThread *self, XTSortedListPtr sl); +xtBool xt_sl_signal(struct XTThread *self, XTSortedListPtr sl); +void xt_sl_broadcast(struct XTThread *self, XTSortedListPtr sl); + +#endif diff --git a/storage/pbxt/src/streaming_xt.cc b/storage/pbxt/src/streaming_xt.cc new file mode 100755 index 00000000000..2ce263c7b31 --- /dev/null +++ b/storage/pbxt/src/streaming_xt.cc @@ -0,0 +1,623 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2006-06-07 Paul McCullagh + * + * H&G2JCtL + * + * This file contains PBXT streaming interface. + */ + +#include "xt_config.h" + +#ifdef XT_STREAMING +#include "ha_pbxt.h" + +#include "thread_xt.h" +#include "strutil_xt.h" +#include "table_xt.h" +#include "myxt_xt.h" +#include "xaction_xt.h" +#include "database_xt.h" +#include "streaming_xt.h" + +extern PBMSEngineRec pbxt_engine; + +static PBMS_API pbxt_streaming; + +/* ---------------------------------------------------------------------- + * INIT & EXIT + */ + +xtPublic xtBool xt_init_streaming(void) +{ + XTThreadPtr self = NULL; + int err; + PBMSResultRec result; + + if ((err = pbxt_streaming.registerEngine(&pbxt_engine, &result))) { + xt_logf(XT_CONTEXT, XT_LOG_ERROR, "%s\n", result.mr_message); + return FAILED; + } + return OK; +} + +xtPublic void xt_exit_streaming(void) +{ + pbxt_streaming.deregisterEngine(&pbxt_engine); +} + +/* ---------------------------------------------------------------------- + * UTILITY FUNCTIONS + */ + +static void str_result_to_exception(XTExceptionPtr e, int r, PBMSResultPtr result) +{ + char *str, *end_str; + + e->e_xt_err = r; + e->e_sys_err = result->mr_code; + xt_strcpy(XT_ERR_MSG_SIZE, e->e_err_msg, result->mr_message); + + e->e_source_line = 0; + str = result->mr_stack; + if ((end_str = strchr(str, '('))) { + xt_strcpy_term(XT_MAX_FUNC_NAME_SIZE, e->e_func_name, str, '('); + str = end_str+1; + if ((end_str = strchr(str, ':'))) { + xt_strcpy_term(XT_SOURCE_FILE_NAME_SIZE, e->e_source_file, str, ':'); + str = end_str+1; + if ((end_str = strchr(str, ')'))) { + char number[40]; + + xt_strcpy_term(40, number, str, ')'); + e->e_source_line = atol(number); + str = end_str+1; + if (*str == '\n') + str++; + } + } + } + + if (e->e_source_line == 0) { + *e->e_func_name = 0; + *e->e_source_file = 0; + xt_strcpy(XT_ERR_MSG_SIZE, e->e_catch_trace, result->mr_stack); + } + else + xt_strcpy(XT_ERR_MSG_SIZE, e->e_catch_trace, str); +} + +static void str_exception_to_result(XTExceptionPtr e, PBMSResultPtr result) +{ + int len; + + if (e->e_sys_err) + result->mr_code = e->e_sys_err; + else + result->mr_code = e->e_xt_err; + xt_strcpy(MS_RESULT_MESSAGE_SIZE, result->mr_message, e->e_err_msg); + xt_strcpy(MS_RESULT_STACK_SIZE, result->mr_stack, e->e_func_name); + xt_strcat(MS_RESULT_STACK_SIZE, result->mr_stack, "("); + xt_strcat(MS_RESULT_STACK_SIZE, result->mr_stack, e->e_source_file); + xt_strcat(MS_RESULT_STACK_SIZE, result->mr_stack, ":"); + xt_strcati(MS_RESULT_STACK_SIZE, result->mr_stack, (int) e->e_source_line); + xt_strcat(MS_RESULT_STACK_SIZE, result->mr_stack, ")"); + len = strlen(result->mr_stack); + if (strncmp(result->mr_stack, e->e_catch_trace, len) == 0) + xt_strcat(MS_RESULT_STACK_SIZE, result->mr_stack, e->e_catch_trace + len); + else { + xt_strcat(MS_RESULT_STACK_SIZE, result->mr_stack, "\n"); + xt_strcat(MS_RESULT_STACK_SIZE, result->mr_stack, e->e_catch_trace); + } +} + +static XTIndexPtr str_find_index(XTTableHPtr tab, u_int *col_list, u_int col_cnt) +{ + u_int i, j; + XTIndexPtr *ind; /* MySQL/PBXT key description */ + + ind = tab->tab_dic.dic_keys; + for (i=0; i<tab->tab_dic.dic_key_count; i++) { + if ((*ind)->mi_seg_count == col_cnt) { + for (j=0; j<(*ind)->mi_seg_count; j++) { + if ((*ind)->mi_seg[j].col_idx != col_list[j]) + goto loop; + } + return *ind; + } + + loop: + ind++; + } + return NULL; +} + +static XTThreadPtr str_set_current_thread(THD *thd, PBMSResultPtr result) +{ + XTThreadPtr self; + XTExceptionRec e; + + if (!(self = xt_ha_set_current_thread(thd, &e))) { + str_exception_to_result(&e, result); + return NULL; + } + return self; +} + +/* ---------------------------------------------------------------------- + * BLOB STREAMING INTERFACE + */ + +static void pbxt_close_conn(void *thread) +{ + xt_ha_close_connection((THD *) thread); +} + +static int pbxt_open_table(void *thread, const char *table_url, void **open_table, PBMSResultPtr result) +{ + THD *thd = (THD *) thread; + XTThreadPtr self; + XTTableHPtr tab = NULL; + XTOpenTablePtr ot = NULL; + int err = MS_OK; + + if (!(self = str_set_current_thread(thd, result))) + return MS_ERR_ENGINE; + + try_(a) { + xt_ha_open_database_of_table(self, (XTPathStrPtr) table_url); + if (!(tab = xt_use_table(self, (XTPathStrPtr) table_url, FALSE, TRUE, NULL))) { + err = MS_ERR_UNKNOWN_TABLE; + goto done; + } + if (!(ot = xt_open_table(tab))) + throw_(); + ot->ot_thread = self; + done:; + } + catch_(a) { + str_exception_to_result(&self->t_exception, result); + err = MS_ERR_ENGINE; + } + cont_(a); + if (tab) + xt_heap_release(self, tab); + *open_table = ot; + return err; +} + +static void pbxt_close_table(void *thread, void *open_table_ptr) +{ + THD *thd = (THD *) thread; + volatile XTThreadPtr self; + XTOpenTablePtr ot = (XTOpenTablePtr) open_table_ptr; + XTExceptionRec e; + + if (thd) { + if (!(self = xt_ha_set_current_thread(thd, &e))) { + xt_log_exception(NULL, &e, XT_LOG_DEFAULT); + return; + } + } + else { + if (!(self = xt_create_thread("TempForClose", FALSE, TRUE, &e))) { + xt_log_exception(NULL, &e, XT_LOG_DEFAULT); + return; + } + } + + ot->ot_thread = self; + try_(a) { + xt_close_table(ot, TRUE, FALSE); + } + catch_(a) { + xt_log_and_clear_exception(self); + } + cont_(a); + if (!thd) + xt_free_thread(self); +} + +static int pbxt_lock_table(void *thread, int *xact, void *open_table, int lock_type, PBMSResultPtr result) +{ + THD *thd = (THD *) thread; + XTThreadPtr self; + XTOpenTablePtr ot = (XTOpenTablePtr) open_table; + int err = MS_OK; + + if (!(self = str_set_current_thread(thd, result))) + return MS_ERR_ENGINE; + + if (lock_type != MS_LOCK_NONE) { + try_(a) { + xt_ha_open_database_of_table(self, ot->ot_table->tab_name); + ot->ot_thread = self; + } + catch_(a) { + str_exception_to_result(&self->t_exception, result); + err = MS_ERR_ENGINE; + } + cont_(a); + } + + if (!err && *xact == MS_XACT_BEGIN) { + if (self->st_xact_data) + *xact = MS_XACT_NONE; + else { + if (xt_xn_begin(self)) { + *xact = MS_XACT_COMMIT; + } + else { + str_exception_to_result(&self->t_exception, result); + err = MS_ERR_ENGINE; + } + } + } + + return err; +} + +static int pbxt_unlock_table(void *thread, int xact, void *open_table __attribute__((unused)), PBMSResultPtr result) +{ + THD *thd = (THD *) thread; + XTThreadPtr self = xt_ha_thd_to_self(thd); + int err = MS_OK; + + if (xact == MS_XACT_COMMIT) { + if (!xt_xn_commit(self)) { + str_exception_to_result(&self->t_exception, result); + err = MS_ERR_ENGINE; + } + } + else if (xact == MS_XACT_ROLLBACK) { + xt_xn_rollback(self); + } + + return err; +} + +static int pbxt_send_blob(void *thread, void *open_table, const char *blob_column, const char *blob_url_p, void *stream, PBMSResultPtr result) +{ + THD *thd = (THD *) thread; + XTThreadPtr self = xt_ha_thd_to_self(thd); + XTOpenTablePtr ot = (XTOpenTablePtr) open_table; + int err = MS_OK; + u_int blob_col_idx, col_idx; + char col_name[XT_IDENTIFIER_NAME_SIZE]; + XTStringBufferRec value; + u_int col_list[XT_MAX_COLS_PER_INDEX]; + u_int col_cnt; + char col_names[XT_ERR_MSG_SIZE - 200]; + XTIdxSearchKeyRec search_key; + XTIndexPtr ind; + char *blob_data; + size_t blob_len; + const char *blob_url = blob_url_p; + + memset(&value, 0, sizeof(value)); + + *col_names = 0; + + ot->ot_thread = self; + try_(a) { + if (ot->ot_row_wbuf_size < ot->ot_table->tab_dic.dic_buf_size) { + xt_realloc(self, (void **) &ot->ot_row_wbuffer, ot->ot_table->tab_dic.dic_buf_size); + ot->ot_row_wbuf_size = ot->ot_table->tab_dic.dic_buf_size; + } + + xt_strcpy_url(XT_IDENTIFIER_NAME_SIZE, col_name, blob_column); + if (!myxt_find_column(ot, &blob_col_idx, col_name)) + xt_throw_tabcolerr(XT_CONTEXT, XT_ERR_COLUMN_NOT_FOUND, ot->ot_table->tab_name, blob_column); + + /* Prepare a row for the condition: */ + const char *ptr; + + col_cnt = 0; + while (*blob_url) { + ptr = xt_strchr(blob_url, '='); + xt_strncpy_url(XT_IDENTIFIER_NAME_SIZE, col_name, blob_url, (size_t) (ptr - blob_url)); + if (!myxt_find_column(ot, &col_idx, col_name)) + xt_throw_tabcolerr(XT_CONTEXT, XT_ERR_COLUMN_NOT_FOUND, ot->ot_table->tab_name, col_name); + if (*col_names) + xt_strcat(sizeof(col_names), col_names, ", "); + xt_strcat(sizeof(col_names), col_names, col_name); + blob_url = ptr; + if (*blob_url == '=') + blob_url++; + ptr = xt_strchr(blob_url, '&'); + value.sb_len = 0; + xt_sb_concat_url_len(self, &value, blob_url, (size_t) (ptr - blob_url)); + blob_url = ptr; + if (*blob_url == '&') + blob_url++; + if (!myxt_set_column(ot, (char *) ot->ot_row_rbuffer, col_idx, value.sb_cstring, value.sb_len)) + xt_throw_tabcolerr(XT_CONTEXT, XT_ERR_CONVERSION, ot->ot_table->tab_name, col_name); + if (col_cnt < XT_MAX_COLS_PER_INDEX) { + col_list[col_cnt] = col_idx; + col_cnt++; + } + } + + /* Find a matching index: */ + if (!(ind = str_find_index(ot->ot_table, col_list, col_cnt))) + xt_throw_ixterr(XT_CONTEXT, XT_ERR_NO_MATCHING_INDEX, col_names); + + search_key.sk_key_value.sv_flags = 0; + search_key.sk_key_value.sv_rec_id = 0; + search_key.sk_key_value.sv_row_id = 0; + search_key.sk_key_value.sv_key = search_key.sk_key_buf; + search_key.sk_key_value.sv_length = myxt_create_key_from_row(ind, search_key.sk_key_buf, ot->ot_row_rbuffer, NULL); + search_key.sk_on_key = FALSE; + + if (!xt_idx_search(ot, ind, &search_key)) + xt_throw(self); + + if (!ot->ot_curr_rec_id) + xt_throw_taberr(XT_CONTEXT, XT_ERR_NO_ROWS, ot->ot_table->tab_name); + + while (ot->ot_curr_rec_id) { + if (!search_key.sk_on_key) + xt_throw_taberr(XT_CONTEXT, XT_ERR_NO_ROWS, ot->ot_table->tab_name); + + retry: + /* X TODO - Check if the write buffer is big enough here! */ + switch (xt_tab_read_record(ot, ot->ot_row_wbuffer)) { + case FALSE: + if (xt_idx_next(ot, ind, &search_key)) + break; + case XT_ERR: + xt_throw(self); + case XT_NEW: + if (xt_idx_match_search(ot, ind, &search_key, ot->ot_row_wbuffer, XT_S_MODE_MATCH)) + goto success; + if (!xt_idx_next(ot, ind, &search_key)) + xt_throw(self); + break; + case XT_RETRY: + goto retry; + default: + goto success; + } + } + + success: + myxt_get_column_data(ot, (char *) ot->ot_row_wbuffer, blob_col_idx, &blob_data, &blob_len); + + /* + * Write the content length, then write the HTTP + * header, and then the content. + */ + err = pbxt_streaming.setContentLength(stream, blob_len, result); + if (!err) + err = pbxt_streaming.writeHead(stream, result); + if (!err) + err = pbxt_streaming.writeStream(stream, (void *) blob_data, blob_len, result); + } + catch_(a) { + str_exception_to_result(&self->t_exception, result); + if (result->mr_code == XT_ERR_NO_ROWS) + err = MS_ERR_NOT_FOUND; + else + err = MS_ERR_ENGINE; + } + cont_(a); + xt_sb_set_size(NULL, &value, 0); + return err; +} + +int pbxt_lookup_ref(void *thread, void *open_table, unsigned short col_index, PBMSEngineRefPtr eng_ref, PBMSFieldRefPtr field_ref, PBMSResultPtr result) +{ + THD *thd = (THD *) thread; + XTThreadPtr self = xt_ha_thd_to_self(thd); + XTOpenTablePtr ot = (XTOpenTablePtr) open_table; + int err = MS_OK; + u_int i, len; + char *data; + XTIndexPtr ind = NULL; + + ot->ot_thread = self; + if (ot->ot_row_wbuf_size < ot->ot_table->tab_dic.dic_buf_size) { + xt_realloc(self, (void **) &ot->ot_row_wbuffer, ot->ot_table->tab_dic.dic_buf_size); + ot->ot_row_wbuf_size = ot->ot_table->tab_dic.dic_buf_size; + } + + ot->ot_curr_rec_id = (xtRecordID) XT_GET_DISK_8(eng_ref->er_data); + switch (xt_tab_dirty_read_record(ot, ot->ot_row_wbuffer)) { + case FALSE: + err = MS_ERR_ENGINE; + break; + default: + break; + } + + if (err) { + str_exception_to_result(&self->t_exception, result); + goto exit; + } + + myxt_get_column_name(ot, col_index, PBMS_FIELD_COL_SIZE, field_ref->fr_column); + + for (i=0; i<ot->ot_table->tab_dic.dic_key_count; i++) { + ind = ot->ot_table->tab_dic.dic_keys[i]; + if (ind->mi_flags & (HA_UNIQUE_CHECK | HA_NOSAME)) + break; + } + + if (ind) { + len = 0; + data = field_ref->fr_cond; + for (i=0; i<ind->mi_seg_count; i++) { + if (i > 0) { + xt_strcat(PBMS_FIELD_COND_SIZE, data, "&"); + len = strlen(data); + } + myxt_get_column_name(ot, ind->mi_seg[i].col_idx, PBMS_FIELD_COND_SIZE - len, data + len); + len = strlen(data); + xt_strcat(PBMS_FIELD_COND_SIZE, data, "="); + len = strlen(data); + myxt_get_column_as_string(ot, (char *) ot->ot_row_wbuffer, ind->mi_seg[i].col_idx, PBMS_FIELD_COND_SIZE - len, data + len); + len = strlen(data); + } + } + else + xt_strcpy(PBMS_FIELD_COND_SIZE, field_ref->fr_cond, "*no unique key*"); + + exit: + return err; +} + +PBMSEngineRec pbxt_engine = { + MS_ENGINE_VERSION, + 0, + FALSE, + "PBXT", + NULL, + pbxt_close_conn, + pbxt_open_table, + pbxt_close_table, + pbxt_lock_table, + pbxt_unlock_table, + pbxt_send_blob, + pbxt_lookup_ref +}; + +/* ---------------------------------------------------------------------- + * CALL IN FUNCTIONS + */ + +xtPublic void xt_pbms_close_all_tables(const char *table_url) +{ + pbxt_streaming.closeAllTables(table_url); +} + +xtPublic xtBool xt_pbms_close_connection(void *thd, XTExceptionPtr e) +{ + PBMSResultRec result; + int err; + + err = pbxt_streaming.closeConn(thd, &result); + if (err) { + str_result_to_exception(e, err, &result); + return FAILED; + } + return OK; +} + +xtPublic xtBool xt_pbms_open_table(void **open_table, char *table_path) +{ + PBMSResultRec result; + int err; + + err = pbxt_streaming.openTable(open_table, table_path, &result); + if (err) { + XTThreadPtr thread = xt_get_self(); + + str_result_to_exception(&thread->t_exception, err, &result); + return FAILED; + } + return OK; +} + +xtPublic void xt_pbms_close_table(void *open_table) +{ + PBMSResultRec result; + int err; + + err = pbxt_streaming.closeTable(open_table, &result); + if (err) { + XTThreadPtr thread = xt_get_self(); + + str_result_to_exception(&thread->t_exception, err, &result); + xt_log_exception(thread, &thread->t_exception, XT_LOG_DEFAULT); + } +} + +xtPublic xtBool xt_pbms_use_blob(void *open_table, char **ret_blob_url, char *blob_url, unsigned short col_index) +{ + PBMSResultRec result; + int err; + + err = pbxt_streaming.useBlob(open_table, ret_blob_url, blob_url, col_index, &result); + if (err) { + XTThreadPtr thread = xt_get_self(); + + str_result_to_exception(&thread->t_exception, err, &result); + return FAILED; + } + return OK; +} + +xtPublic xtBool xt_pbms_retain_blobs(void *open_table, PBMSEngineRefPtr eng_ref) +{ + PBMSResultRec result; + int err; + + err = pbxt_streaming.retainBlobs(open_table, eng_ref, &result); + if (err) { + XTThreadPtr thread = xt_get_self(); + + str_result_to_exception(&thread->t_exception, err, &result); + return FAILED; + } + return OK; +} + +xtPublic void xt_pbms_release_blob(void *open_table, char *blob_url, unsigned short col_index, PBMSEngineRefPtr eng_ref) +{ + PBMSResultRec result; + int err; + + err = pbxt_streaming.releaseBlob(open_table, blob_url, col_index, eng_ref, &result); + if (err) { + XTThreadPtr thread = xt_get_self(); + + str_result_to_exception(&thread->t_exception, err, &result); + xt_log_exception(thread, &thread->t_exception, XT_LOG_DEFAULT); + } +} + +xtPublic void xt_pbms_drop_table(const char *table_path) +{ + PBMSResultRec result; + int err; + + err = pbxt_streaming.dropTable(table_path, &result); + if (err) { + XTThreadPtr thread = xt_get_self(); + + str_result_to_exception(&thread->t_exception, err, &result); + xt_log_exception(thread, &thread->t_exception, XT_LOG_DEFAULT); + } +} + +xtPublic void xt_pbms_rename_table(const char *from_table, const char *to_table) +{ + PBMSResultRec result; + int err; + + err = pbxt_streaming.renameTable(from_table, to_table, &result); + if (err) { + XTThreadPtr thread = xt_get_self(); + + str_result_to_exception(&thread->t_exception, err, &result); + xt_log_exception(thread, &thread->t_exception, XT_LOG_DEFAULT); + } +} + +#endif // XT_STREAMING diff --git a/storage/pbxt/src/streaming_xt.h b/storage/pbxt/src/streaming_xt.h new file mode 100755 index 00000000000..6fe36822383 --- /dev/null +++ b/storage/pbxt/src/streaming_xt.h @@ -0,0 +1,46 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2006-06-07 Paul McCullagh + * + * H&G2JCtL + * + * This file contains PBXT streaming interface. + */ + +#ifndef __streaming_xt_h__ +#define __streaming_xt_h__ + +#include "xt_defs.h" +#define PBMS_API pbms_api_PBXT +#include "pbms.h" + +xtBool xt_init_streaming(void); +void xt_exit_streaming(void); + +void xt_pbms_close_all_tables(const char *table_url); +xtBool xt_pbms_close_connection(void *thd, XTExceptionPtr e); +xtBool xt_pbms_open_table(void **open_table, char *table_path); +void xt_pbms_close_table(void *open_table); +xtBool xt_pbms_use_blob(void *open_table, char **ret_blob_url, char *blob_url, unsigned short col_index); +xtBool xt_pbms_retain_blobs(void *open_table, PBMSEngineRefPtr eng_ref); +void xt_pbms_release_blob(void *open_table, char *blob_url, unsigned short col_index, PBMSEngineRefPtr eng_ref); +void xt_pbms_drop_table(const char *table_path); +void xt_pbms_rename_table(const char *from_table, const char *to_table); + +#endif diff --git a/storage/pbxt/src/strutil_xt.cc b/storage/pbxt/src/strutil_xt.cc new file mode 100644 index 00000000000..60e45c455d1 --- /dev/null +++ b/storage/pbxt/src/strutil_xt.cc @@ -0,0 +1,567 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-01-03 Paul McCullagh + * + * H&G2JCtL + */ + +#include "xt_config.h" + +#include <stdio.h> +#include <string.h> +#include <ctype.h> + +#include "strutil_xt.h" + +xtPublic void xt_strcpy(size_t size, char *to, c_char *from) +{ + if (size > 0) { + size--; + while (*from && size--) + *to++ = *from++; + *to = 0; + } +} + +xtPublic void xt_strncpy(size_t size, char *to, c_char *from, size_t len_from) +{ + if (size > 0) { + size--; + while (len_from-- && size--) + *to++ = *from++; + *to = 0; + } +} + +xtPublic void xt_strcpy_term(size_t size, char *to, c_char *from, char term) +{ + if (size > 0) { + size--; + while (*from && *from != term && size--) + *to++ = *from++; + *to = 0; + } +} + +xtPublic void xt_strcat_term(size_t size, char *to, c_char *from, char term) +{ + while (*to && size--) to++; + if (size > 0) { + size--; + while (*from && *from != term && size--) + *to++ = *from++; + *to = 0; + } +} + +xtPublic void xt_strcat(size_t size, char *to, c_char *from) +{ + while (*to && size--) to++; + xt_strcpy(size, to, from); +} + +xtPublic void xt_strcati(size_t size, char *to, int i) +{ + char buffer[50]; + + sprintf(buffer, "%d", i); + xt_strcat(size, to, buffer); +} + +xtPublic xtBool xt_ends_with(c_char *str, c_char *sub) +{ + unsigned long len = strlen(str); + + if (len >= strlen(sub)) + return strcmp(&str[len-strlen(sub)], sub) == 0; + return FALSE; +} + +xtPublic xtPublic xtBool xt_starts_with(c_char *str, c_char *sub) +{ + return (strstr(str, sub) == str); +} + +/* This function returns "" if the path ends with a dir char */ +xtPublic void xt_2nd_last_name_of_path(size_t size, char *dest, c_char *path) +{ + size_t len; + c_char *ptr, *pend; + + len = strlen(path); + if (!len) { + *dest = 0; + return; + } + ptr = path + len - 1; + while (ptr != path && !XT_IS_DIR_CHAR(*ptr)) + ptr--; + if (!XT_IS_DIR_CHAR(*ptr)) { + *dest = 0; + return; + } + pend = ptr; + ptr--; + while (ptr != path && !XT_IS_DIR_CHAR(*ptr)) + ptr--; + if (XT_IS_DIR_CHAR(*ptr)) + ptr++; + len = (size_t) (pend - ptr); + if (len > size-1) + len = size-1; + memcpy(dest, ptr, len); + dest[len] = 0; +} + +/* This function returns "" if the path ends with a dir char */ +xtPublic char *xt_last_name_of_path(c_char *path) +{ + size_t length; + c_char *ptr; + + length = strlen(path); + if (!length) + return (char *) path; + ptr = path + length - 1; + while (ptr != path && !XT_IS_DIR_CHAR(*ptr)) ptr--; + if (XT_IS_DIR_CHAR(*ptr)) ptr++; + return (char *) ptr; +} + +xtPublic char *xt_last_2_names_of_path(c_char *path) +{ + size_t length; + c_char *ptr; + + length = strlen(path); + if (!length) + return (char *) path; + ptr = path + length - 1; + while (ptr != path && !XT_IS_DIR_CHAR(*ptr)) ptr--; + if (XT_IS_DIR_CHAR(*ptr)) { + ptr--; + while (ptr != path && !XT_IS_DIR_CHAR(*ptr)) ptr--; + if (XT_IS_DIR_CHAR(*ptr)) + ptr++; + } + return (char *) ptr; +} + +xtPublic c_char *xt_last_directory_of_path(c_char *path) +/* This function returns the last name component, even if the path ends with a dir char */ +{ + size_t length; + c_char *ptr; + + length = strlen(path); + if (!length) + return(path); + ptr = path + length - 1; + /* Path may end with multiple slashes: */ + while (ptr != path && XT_IS_DIR_CHAR(*ptr)) + ptr--; + while (ptr != path && !XT_IS_DIR_CHAR(*ptr)) + ptr--; + if (XT_IS_DIR_CHAR(*ptr)) ptr++; + return(ptr); +} + +xtPublic char *xt_find_extension(c_char *file_name) +{ + c_char *ptr; + + for (ptr = file_name + strlen(file_name) - 1; ptr >= file_name; ptr--) { + if (XT_IS_DIR_CHAR(*ptr)) + break; + if (*ptr == '.') + return (char *) (ptr + 1); + } + return NULL; +} + +xtPublic void xt_remove_extension(char *file_name) +{ + char *ptr = xt_find_extension(file_name); + + if (ptr) + *(ptr - 1) = 0; +} + +xtPublic xtBool xt_is_extension(c_char *file_name, c_char *ext) +{ + char *ptr; + + if (!(ptr = xt_find_extension(file_name))) + return FALSE; + return strcmp(ptr, ext) == 0; +} + +/* + * Optionally remove trailing directory delimiters (If the directory name consists of one + * character, the directory delimiter is not removed). + */ +xtPublic xtBool xt_remove_dir_char(char *dir_name) +{ + size_t length; + xtBool removed = FALSE; + + length = strlen(dir_name); + while (length > 1 && XT_IS_DIR_CHAR(dir_name[length - 1])) { + dir_name[length - 1] = '\0'; + length--; + removed = TRUE; + } + return removed; +} + +xtPublic void xt_remove_last_name_of_path(char *path) +{ + char *ptr; + + if ((ptr = xt_last_name_of_path(path))) + *ptr = 0; +} + +xtBool xt_add_dir_char(size_t max, char *path) +{ + size_t slen = strlen(path); + + if (slen >= max) + return FALSE; + + if (slen == 0) { + /* If no path is given we will be at the current working directory, under UNIX we must + * NOT add a directory delimiter character: + */ + return FALSE; + } + + if (!XT_IS_DIR_CHAR(path[slen - 1])) { + path[slen] = XT_DIR_CHAR; + path[slen + 1] = '\0'; + return TRUE; + } + return FALSE; +} + +xtPublic xtInt8 xt_str_to_int8(c_char *ptr, xtBool *overflow) +{ + xtInt8 value = 0; + + if (overflow) + *overflow = FALSE; + while (*ptr == '0') ptr++; + if (!*ptr) + value = (xtInt8) 0; + else { + sscanf(ptr, "%"PRId64, &value); + if (!value && overflow) + *overflow = TRUE; + } + return value; +} + +xtPublic void xt_int8_to_str(xtInt8 value, char *string) +{ + sprintf(string, "%"PRId64, value); +} + +xtPublic void xt_double_to_str(double value, int scale, char *string) +{ + char *ptr; + + sprintf(string, "%.*f", scale, value); + ptr = string + strlen(string) - 1; + + if (strchr(string, '.') && (*ptr == '0' || *ptr == '.')) { + while (ptr-1 > string && *(ptr-1) == '0') ptr--; + if (ptr-1 > string && *(ptr-1) == '.') ptr--; + *ptr = 0; + } +} + +/* + * This function understand GB, MB, KB. + */ +xtPublic xtInt8 xt_byte_size_to_int8(c_char *ptr) +{ + char number[101], *num_ptr; + xtInt8 size; + + while (*ptr && isspace(*ptr)) + ptr++; + + num_ptr = number; + while (*ptr && isdigit(*ptr)) { + if (num_ptr < number+100) { + *num_ptr = *ptr; + num_ptr++; + } + ptr++; + } + *num_ptr = 0; + size = xt_str_to_int8(number, NULL); + + while (*ptr && isspace(*ptr)) + ptr++; + + switch (toupper(*ptr)) { + case 'P': + size *= 1024; + case 'T': + size *= 1024; + case 'G': + size *= 1024; + case 'M': + size *= 1024; + case 'K': + size *= 1024; + break; + } + + return size; +} + +xtPublic void xt_int8_to_byte_size(xtInt8 value, char *string) +{ + double v; + c_char *unit; + char val_str[100]; + + if (value >= (xtInt8) (1024 * 1024 * 1024)) { + v = (double) value / (double) (1024 * 1024 * 1024); + unit = "GB"; + } + else if (value >= (xtInt8) (1024 * 1024)) { + v = (double) value / (double) (1024 * 1024); + unit = "MB"; + } + else if (value >= (xtInt8) 1024) { + v = (double) value / (double) (1024); + unit = "Kb"; + } + else { + v = (double) value; + unit = "bytes"; + } + + xt_double_to_str(v, 2, val_str); + sprintf(string, "%s %s (%"PRId64" bytes)", val_str, unit, value); +} + +xtPublic c_char *xt_get_version(void) +{ + return "1.0.08 RC"; +} + +/* Copy and URL decode! */ +xtPublic void xt_strcpy_url(size_t size, char *to, c_char *from) +{ + if (size > 0) { + size--; + while (*from && size--) { + if (*from == '%' && isxdigit(*(from+1)) && isxdigit(*(from+2))) { + unsigned char a = xt_hex_digit(*(from+1)); + unsigned char b = xt_hex_digit(*(from+2)); + *to++ = a << 4 | b; + from += 3; + } + else + *to++ = *from++; + } + *to = 0; + } +} + +/* Copy and URL decode! */ +xtPublic void xt_strncpy_url(size_t size, char *to, c_char *from, size_t len_from) +{ + if (size > 0) { + size--; + while (len_from-- && size--) { + if (*from == '%' && len_from >= 2 && isxdigit(*(from+1)) && isxdigit(*(from+2))) { + unsigned char a = xt_hex_digit(*(from+1)); + unsigned char b = xt_hex_digit(*(from+2)); + *to++ = a << 4 | b; + from += 3; + } + else + *to++ = *from++; + } + *to = 0; + } +} + +/* Returns a pointer to the end of the string if nothing found! */ +const char *xt_strchr(const char *str, char ch) +{ + while (*str && *str != ch) str++; + return str; +} + +unsigned char xt_hex_digit(char ch) +{ + if (isdigit(ch)) + return((unsigned char) ch - (unsigned char) '0'); + + ch = toupper(ch); + if (ch >= 'A' && ch <= 'F') + return((unsigned char) ch - (unsigned char) 'A' + (unsigned char) 10); + + return((unsigned char) 0); +} + +#ifdef XT_WIN +xtPublic void xt_win_dialog(char *message) +{ + MessageBoxA(NULL, message, "Debug Me!", MB_ICONWARNING | MB_OK); +} +#endif + +/* + * --------------- SYSTEM STATISTICS ------------------ + */ + +static char su_t_unit[10] = "usec"; +/* + * Note times, are return in microseconds, but the display in xtstat is currently + * in milliseconds. + */ +static XTStatMetaDataRec pbxt_stat_meta_data[XT_STAT_MAXIMUM] = { + { XT_STAT_TIME_CURRENT, "Current Time", "time", "curr", XT_STAT_DATE, + "The current time in seconds" }, + { XT_STAT_TIME_PASSED, "Time Since Last Call", "time", su_t_unit, XT_STAT_ACCUMULATIVE | XT_STAT_TIME_VALUE, + "Time passed in %sseconds since last statistics call" }, + + { XT_STAT_COMMITS, "Commit Count", "xact", "commt", XT_STAT_ACCUMULATIVE, + "Number of transactions committed" }, + { XT_STAT_ROLLBACKS, "Rollback Count", "xact", "rollb", XT_STAT_ACCUMULATIVE, + "Number of transactions rolled back" }, + { XT_STAT_WAIT_FOR_XACT, "Wait for Xact Count", "xact", "waits", XT_STAT_ACCUMULATIVE, + "Number of times waited for another transaction" }, + { XT_STAT_XACT_TO_CLEAN, "Dirty Xact Count", "xact", "dirty", 0, + "Number of transactions still to be cleaned up" }, + + { XT_STAT_STAT_READS, "Read Statements", "stat", "read", XT_STAT_ACCUMULATIVE, + "Number of SELECT statements" }, + { XT_STAT_STAT_WRITES, "Write Statements", "stat", "write", XT_STAT_ACCUMULATIVE, + "Number of UPDATE/INSERT/DELETE statements" }, + + { XT_STAT_REC_BYTES_IN, "Record Bytes Read", "rec", "in", XT_STAT_ACCUMULATIVE | XT_STAT_BYTE_COUNT, + "Bytes read from the record/row files" }, + { XT_STAT_REC_BYTES_OUT, "Record Bytes Written", "rec", "out", XT_STAT_ACCUMULATIVE | XT_STAT_BYTE_COUNT, + "Bytes written from the record/row files" }, + { XT_STAT_REC_SYNC_COUNT, "Record File Flushes", "rec", "syncs", XT_STAT_ACCUMULATIVE | XT_STAT_COMBO_FIELD, + "Number of flushes to record/row files" }, + { XT_STAT_REC_SYNC_TIME, "Record Flush Time", "rec", su_t_unit, XT_STAT_ACCUMULATIVE | XT_STAT_TIME_VALUE | XT_STAT_COMBO_FIELD_2, + "The time in %sseconds to flush record/row files" }, + { XT_STAT_REC_CACHE_HIT, "Record Cache Hits", "rec", "hits", XT_STAT_ACCUMULATIVE, + "Hits when accessing the record cache" }, + { XT_STAT_REC_CACHE_MISS, "Record Cache Misses", "rec", "miss", XT_STAT_ACCUMULATIVE, + "Misses when accessing the record cache" }, + { XT_STAT_REC_CACHE_FREES, "Record Cache Frees", "rec", "frees", XT_STAT_ACCUMULATIVE, + "Number of record cache pages freed" }, + { XT_STAT_REC_CACHE_USAGE, "Record Cache Usage", "rec", "%use", XT_STAT_PERCENTAGE, + "Percentage of record cache in use" }, + + { XT_STAT_IND_BYTES_IN, "Index Bytes Read", "ind", "in", XT_STAT_ACCUMULATIVE | XT_STAT_BYTE_COUNT, + "Bytes read from the index files" }, + { XT_STAT_IND_BYTES_OUT, "Index Bytes Written", "ind", "out", XT_STAT_ACCUMULATIVE | XT_STAT_BYTE_COUNT, + "Bytes written from the index files" }, + { XT_STAT_IND_SYNC_COUNT, "Index File Flushes", "ind", "syncs", XT_STAT_ACCUMULATIVE | XT_STAT_COMBO_FIELD, + "Number of flushes to index files" }, + { XT_STAT_IND_SYNC_TIME, "Index Flush Time", "ind", su_t_unit, XT_STAT_ACCUMULATIVE | XT_STAT_TIME_VALUE | XT_STAT_COMBO_FIELD_2, + "The time in %sseconds to flush index files" }, + { XT_STAT_IND_CACHE_HIT, "Index Cache Hits", "ind", "hits", XT_STAT_ACCUMULATIVE, + "Hits when accessing the index cache" }, + { XT_STAT_IND_CACHE_MISS, "Index Cache Misses", "ind", "miss", XT_STAT_ACCUMULATIVE, + "Misses when accessing the index cache" }, + { XT_STAT_IND_CACHE_USAGE, "Index Cache Usage", "ind", "%use", XT_STAT_PERCENTAGE, + "Percentage of index cache used" }, + { XT_STAT_ILOG_BYTES_IN, "Index Log Bytes In", "ilog", "in", XT_STAT_ACCUMULATIVE | XT_STAT_BYTE_COUNT, + "Bytes read from the index log files" }, + { XT_STAT_ILOG_BYTES_OUT, "Index Log Bytes Out", "ilog", "out", XT_STAT_ACCUMULATIVE | XT_STAT_BYTE_COUNT, + "Bytes written from the index log files" }, + { XT_STAT_ILOG_SYNC_COUNT, "Index Log File Syncs", "ilog", "syncs", XT_STAT_ACCUMULATIVE | XT_STAT_COMBO_FIELD, + "Number of flushes to index log files" }, + { XT_STAT_ILOG_SYNC_TIME, "Index Log Sync Time", "ilog", su_t_unit, XT_STAT_ACCUMULATIVE | XT_STAT_TIME_VALUE | XT_STAT_COMBO_FIELD_2, + "The time in %sseconds to flush index log files" }, + + { XT_STAT_XLOG_BYTES_IN, "Xact Log Bytes In", "xlog", "in", XT_STAT_ACCUMULATIVE | XT_STAT_BYTE_COUNT, + "Bytes read from the transaction log files" }, + { XT_STAT_XLOG_BYTES_OUT, "Xact Log Bytes Out", "xlog", "out", XT_STAT_ACCUMULATIVE | XT_STAT_BYTE_COUNT, + "Bytes written from the transaction log files" }, + { XT_STAT_XLOG_SYNC_COUNT, "Xact Log File Syncs", "xlog", "syncs", XT_STAT_ACCUMULATIVE, + "Number of flushes to transaction log files" }, + { XT_STAT_XLOG_SYNC_TIME, "Xact Log Sync Time", "xlog", su_t_unit, XT_STAT_ACCUMULATIVE | XT_STAT_TIME_VALUE, + "The time in %sseconds to flush transaction log files" }, + { XT_STAT_XLOG_CACHE_HIT, "Xact Log Cache Hits", "xlog", "hits", XT_STAT_ACCUMULATIVE, + "Hits when accessing the transaction log cache" }, + { XT_STAT_XLOG_CACHE_MISS, "Xact Log Cache Misses","xlog", "miss", XT_STAT_ACCUMULATIVE, + "Misses when accessing the transaction log cache" }, + { XT_STAT_XLOG_CACHE_USAGE, "Xact Log Cache Usage", "xlog", "%use", XT_STAT_PERCENTAGE, + "Percentage of transaction log cache used" }, + + { XT_STAT_DATA_BYTES_IN, "Data Log Bytes In", "data", "in", XT_STAT_ACCUMULATIVE | XT_STAT_BYTE_COUNT, + "Bytes read from the data log files" }, + { XT_STAT_DATA_BYTES_OUT, "Data Log Bytes Out", "data", "out", XT_STAT_ACCUMULATIVE | XT_STAT_BYTE_COUNT, + "Bytes written from the data log files" }, + { XT_STAT_DATA_SYNC_COUNT, "Data Log File Syncs", "data", "syncs", XT_STAT_ACCUMULATIVE, + "Number of flushes to data log files" }, + { XT_STAT_DATA_SYNC_TIME, "Data Log Sync Time", "data", su_t_unit, XT_STAT_ACCUMULATIVE | XT_STAT_TIME_VALUE, + "The time in %sseconds to flush data log files" }, + + { XT_STAT_BYTES_TO_CHKPNT, "Bytes to Checkpoint", "to", "chkpt", XT_STAT_BYTE_COUNT, + "Bytes written to the log since the last checkpoint" }, + { XT_STAT_LOG_BYTES_TO_WRITE, "Log Bytes to Write", "to", "write", XT_STAT_BYTE_COUNT, + "Bytes written to the log, still to be written to the database" }, + { XT_STAT_BYTES_TO_SWEEP, "Log Bytes to Sweep", "to", "sweep", XT_STAT_BYTE_COUNT, + "Bytes written to the log, still to be read by the sweeper" }, + { XT_STAT_SWEEPER_WAITS, "Sweeper Wait on Xact", "sweep", "waits", XT_STAT_ACCUMULATIVE, + "Attempts to cleanup a transaction" }, + + { XT_STAT_SCAN_INDEX, "Index Scan Count", "scan", "index", XT_STAT_ACCUMULATIVE, + "Number of index scans" }, + { XT_STAT_SCAN_TABLE, "Table Scan Count", "scan", "table", XT_STAT_ACCUMULATIVE, + "Number of table scans" }, + { XT_STAT_ROW_SELECT, "Select Row Count", "row", "sel", XT_STAT_ACCUMULATIVE, + "Number of rows selected" }, + { XT_STAT_ROW_INSERT, "Insert Row Count", "row", "ins", XT_STAT_ACCUMULATIVE, + "Number of rows inserted" }, + { XT_STAT_ROW_UPDATE, "Update Row Count", "row", "upd", XT_STAT_ACCUMULATIVE, + "Number of rows updated" }, + { XT_STAT_ROW_DELETE, "Delete Row Count", "row", "del", XT_STAT_ACCUMULATIVE, + "Number of rows deleted" }, + + { XT_STAT_RETRY_INDEX_SCAN, "Index Scan Retries", "retry", "iscan", XT_STAT_ACCUMULATIVE, + "Index scans restarted because of locked record" }, + { XT_STAT_REREAD_REC_LIST, "Record List Rereads", "retry", "rlist", XT_STAT_ACCUMULATIVE, + "Record list rescanned due to lock" } +}; + +xtPublic XTStatMetaDataPtr xt_get_stat_meta_data(int i) +{ + return &pbxt_stat_meta_data[i]; +} + +xtPublic void xt_set_time_unit(const char *u) +{ + xt_strcpy(10, su_t_unit, u); +} + diff --git a/storage/pbxt/src/strutil_xt.h b/storage/pbxt/src/strutil_xt.h new file mode 100644 index 00000000000..62067e0b671 --- /dev/null +++ b/storage/pbxt/src/strutil_xt.h @@ -0,0 +1,164 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-01-03 Paul McCullagh + * + * H&G2JCtL + */ + +#ifndef __xt_strutil_h__ +#define __xt_strutil_h__ + +#include <string.h> + +#include "xt_defs.h" + +#ifdef XT_WIN +#define XT_DIR_CHAR '\\' +#define XT_IS_DIR_CHAR(c) ((c) == '/' || (c) == '\\') +#else +#define XT_DIR_CHAR '/' +#define XT_IS_DIR_CHAR(c) ((c) == '/') +#endif + +#define MAX_INT8_STRING_SIZE 100 + +void xt_strcpy(size_t size, char *to, c_char *from); +void xt_strncpy(size_t size, char *to, c_char *from, size_t len_from); +void xt_strcat(size_t size, char *to, c_char *from); +void xt_strcati(size_t size, char *to, int i); +void xt_strcpy_term(size_t size, char *to, c_char *from, char term); +void xt_strcat_term(size_t size, char *to, c_char *from, char term); + +xtBool xt_ends_with(c_char *str, c_char *sub); +xtBool xt_starts_with(c_char *str, c_char *sub); + +char *xt_last_2_names_of_path(c_char *path); +char *xt_last_name_of_path(c_char *path); +void xt_2nd_last_name_of_path(size_t size, char *dest, c_char *path); +c_char *xt_last_directory_of_path(c_char *path); +xtBool xt_remove_dir_char(char *dir_name); +xtBool xt_add_dir_char(size_t max, char *path); +void xt_remove_last_name_of_path(char *path); +char *xt_find_extension(c_char *file_name); +void xt_remove_extension(char *file_name); +xtBool xt_is_extension(c_char *file_name, c_char *ext); + +xtInt8 xt_str_to_int8(c_char *ptr, xtBool *overflow); +void xt_int8_to_str(xtInt8 value, char *string); +void xt_double_to_str(double value, int scale, char *string); + +xtInt8 xt_byte_size_to_int8(c_char *ptr); +void xt_int8_to_byte_size(xtInt8 value, char *string); + +c_char *xt_get_version(void); + +void xt_strcpy_url(size_t size, char *to, c_char *from); +void xt_strncpy_url(size_t size, char *to, c_char *from, size_t len_from); + +const char *xt_strchr(const char *str, char ch); +unsigned char xt_hex_digit(char ch); + +#define XT_STAT_TIME_CURRENT 0 +#define XT_STAT_TIME_PASSED 1 + +#define XT_STAT_COMMITS 2 +#define XT_STAT_ROLLBACKS 3 +#define XT_STAT_WAIT_FOR_XACT 4 +#define XT_STAT_XACT_TO_CLEAN 5 + +#define XT_STAT_STAT_READS 6 +#define XT_STAT_STAT_WRITES 7 + +#define XT_STAT_REC_BYTES_IN 8 +#define XT_STAT_REC_BYTES_OUT 9 +#define XT_STAT_REC_SYNC_COUNT 10 +#define XT_STAT_REC_SYNC_TIME 11 +#define XT_STAT_REC_CACHE_HIT 12 +#define XT_STAT_REC_CACHE_MISS 13 +#define XT_STAT_REC_CACHE_FREES 14 +#define XT_STAT_REC_CACHE_USAGE 15 + +#define XT_STAT_IND_BYTES_IN 16 +#define XT_STAT_IND_BYTES_OUT 17 +#define XT_STAT_IND_SYNC_COUNT 18 +#define XT_STAT_IND_SYNC_TIME 19 +#define XT_STAT_IND_CACHE_HIT 20 +#define XT_STAT_IND_CACHE_MISS 21 +#define XT_STAT_IND_CACHE_USAGE 22 +#define XT_STAT_ILOG_BYTES_IN 23 +#define XT_STAT_ILOG_BYTES_OUT 24 +#define XT_STAT_ILOG_SYNC_COUNT 25 +#define XT_STAT_ILOG_SYNC_TIME 26 + +#define XT_STAT_XLOG_BYTES_IN 27 +#define XT_STAT_XLOG_BYTES_OUT 28 +#define XT_STAT_XLOG_SYNC_COUNT 29 +#define XT_STAT_XLOG_SYNC_TIME 30 +#define XT_STAT_XLOG_CACHE_HIT 31 +#define XT_STAT_XLOG_CACHE_MISS 32 +#define XT_STAT_XLOG_CACHE_USAGE 33 + +#define XT_STAT_DATA_BYTES_IN 34 +#define XT_STAT_DATA_BYTES_OUT 35 +#define XT_STAT_DATA_SYNC_COUNT 36 +#define XT_STAT_DATA_SYNC_TIME 37 + +#define XT_STAT_BYTES_TO_CHKPNT 38 +#define XT_STAT_LOG_BYTES_TO_WRITE 39 +#define XT_STAT_BYTES_TO_SWEEP 40 +#define XT_STAT_SWEEPER_WAITS 41 + +#define XT_STAT_SCAN_INDEX 42 +#define XT_STAT_SCAN_TABLE 43 +#define XT_STAT_ROW_SELECT 44 +#define XT_STAT_ROW_INSERT 45 +#define XT_STAT_ROW_UPDATE 46 +#define XT_STAT_ROW_DELETE 47 + +#define XT_STAT_CURRENT_MAX 48 + +#define XT_STAT_RETRY_INDEX_SCAN 48 +#define XT_STAT_REREAD_REC_LIST 49 +#define XT_STAT_MAXIMUM 50 + +#define XT_STAT_ACCUMULATIVE 1 +#define XT_STAT_BYTE_COUNT 2 +#define XT_STAT_PERCENTAGE 4 +#define XT_STAT_COMBO_FIELD 8 /* Field is short, 2 chars instead of 5. */ +#define XT_STAT_COMBO_FIELD_2 16 /* Field is short, 2 chars instead of 5. */ +#define XT_STAT_TIME_VALUE 32 +#define XT_STAT_DATE 64 + +typedef struct XTStatMetaData { + int sm_id; + const char *sm_name; + const char *sm_short_line_1; + const char *sm_short_line_2; + int sm_flags; + const char *sm_description; +} XTStatMetaDataRec, *XTStatMetaDataPtr; + +XTStatMetaDataPtr xt_get_stat_meta_data(int i); +void xt_set_time_unit(const char *u); + +#ifdef XT_WIN +void xt_win_dialog(char *message); +#endif + +#endif diff --git a/storage/pbxt/src/systab_xt.cc b/storage/pbxt/src/systab_xt.cc new file mode 100644 index 00000000000..73ecc7a07cb --- /dev/null +++ b/storage/pbxt/src/systab_xt.cc @@ -0,0 +1,656 @@ +/* Copyright (c) 2008 PrimeBase Technologies GmbH, Germany + * + * PrimeBase Media Stream for MySQL + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Paul McCullagh + * + * 2007-07-18 + * + * H&G2JCtL + * + * System tables. + * + */ + +#include "xt_config.h" + +#include <stdlib.h> +#include <time.h> +#ifdef DRIZZLED +#include <drizzled/server_includes.h> +#include <drizzled/current_session.h> +#endif + +#include "ha_pbxt.h" +#include "systab_xt.h" +#include "discover_xt.h" +#include "table_xt.h" +#include "strutil_xt.h" +#include "database_xt.h" +#include "trace_xt.h" + +#if MYSQL_VERSION_ID >= 50120 +#define byte uchar +#endif + +/* + * ------------------------------------------------------------------------- + * SYSTEM TABLE DEFINITIONS + */ + +//-------------------------------- +static DT_FIELD_INFO xt_location_info[] = +{ + { "Path", 128, NULL, MYSQL_TYPE_VARCHAR, (CHARSET_INFO *) system_charset_info, 0, "The location of PBXT tables"}, + { "Table_count", 0, NULL, MYSQL_TYPE_LONGLONG, NULL, NOT_NULL_FLAG, "The number of PBXT table in this location"}, + { NULL, 0, NULL, MYSQL_TYPE_STRING, NULL, 0, NULL} +}; + +static DT_FIELD_INFO xt_statistics_info[] = +{ + { "ID", 0, NULL, MYSQL_TYPE_LONG, NULL, NOT_NULL_FLAG, "The ID of the statistic"}, + { "Name", 40, NULL, MYSQL_TYPE_VARCHAR, (CHARSET_INFO *) system_charset_info, 0, "The name of the statistic"}, + { "Value", 0, NULL, MYSQL_TYPE_LONGLONG, NULL, NOT_NULL_FLAG, "The accumulated value"}, + { NULL, 0, NULL, MYSQL_TYPE_STRING, NULL, 0, NULL} +}; + +/* +static DT_FIELD_INFO xt_reference_info[] = +{ + {"Table_name", 128, NULL, MYSQL_TYPE_STRING, system_charset_info, NOT_NULL_FLAG, "The name of the referencing table"}, + {"Blob_id", NULL, NULL, MYSQL_TYPE_LONGLONG, NULL, NOT_NULL_FLAG, "The BLOB reference number - part of the BLOB URL"}, + {"Column_name", 50, NULL, MYSQL_TYPE_STRING, system_charset_info, NOT_NULL_FLAG, "The column name of the referencing field"}, + {"Row_condition", 50, NULL, MYSQL_TYPE_VARCHAR, system_charset_info, 0, "This condition identifies the row in the table"}, + {"Blob_url", 50, NULL, MYSQL_TYPE_VARCHAR, system_charset_info, NOT_NULL_FLAG, "The BLOB URL for HTTP GET access"}, + {"Repository_id", NULL, NULL, MYSQL_TYPE_LONG, NULL, NOT_NULL_FLAG, "The repository file number of the BLOB"}, + {"Repo_blob_offset",NULL, NULL, MYSQL_TYPE_LONGLONG, NULL, NOT_NULL_FLAG, "The offset in the repository file"}, + {"Blob_size", NULL, NULL, MYSQL_TYPE_LONGLONG, NULL, NOT_NULL_FLAG, "The size of the BLOB in bytes"}, + {"Deletion_time", NULL, NULL, MYSQL_TYPE_TIMESTAMP, NULL, 0, "The time the BLOB was deleted"}, + {"Remove_in", NULL, NULL, MYSQL_TYPE_LONG, NULL, 0, "The number of seconds before the reference/BLOB is removed perminently"}, + {"Temp_log_id", NULL, NULL, MYSQL_TYPE_LONG, NULL, 0, "Temporary log number of the referencing deletion entry"}, + {"Temp_log_offset", NULL, NULL, MYSQL_TYPE_LONGLONG, NULL, 0, "Temporary log offset of the referencing deletion entry"}, + {NULL, NULL, NULL, MYSQL_TYPE_STRING, NULL, 0, NULL} +}; +*/ + +#define XT_SYSTAB_INVALID 0 +#define XT_SYSTAB_LOCATION_ID 1 +#define XT_SYSTAB_STATISTICS_ID 2 + +static THR_LOCK sys_location_lock; +static THR_LOCK sys_statistics_lock; +static xtBool sys_lock_inited = FALSE; + +static XTSystemTableShareRec xt_internal_tables[] = +{ + { XT_SYSTAB_LOCATION_ID, "pbxt.location", &sys_location_lock, xt_location_info, NULL, FALSE}, + { XT_SYSTAB_STATISTICS_ID, "pbxt.statistics", &sys_statistics_lock, xt_statistics_info, NULL, FALSE}, + { XT_SYSTAB_INVALID, NULL, NULL, NULL, NULL, FALSE} +}; + + +/* +static int pbms_discover_handler(handlerton *hton, THD* thd, const char *db, const char *name, uchar **frmblob, size_t *frmlen) +{ + int err = 1, i = 0; + MY_STAT stat_info; + + // Check that the database exists! + if ((!db) || ! my_stat(db,&stat_info,MYF(0))) + return err; + + while (pbms_internal_tables[i].name) { + if (!strcasecmp(name, pbms_internal_tables[i].name)) { + err = ms_create_table_frm(hton, thd, db, name, pbms_internal_tables[i].info, pbms_internal_tables[i].keys, frmblob, frmlen); + break; + } + i++; + } + + return err; +} +*/ + +/* + * ------------------------------------------------------------------------- + * MYSQL UTILITIES + */ + +void xt_my_set_notnull_in_record(Field *field, char *record) +{ + if (field->null_ptr) + record[(uint) (field->null_ptr - (uchar *) field->table->record[0])] &= (uchar) ~field->null_bit; +} + +/* + * ------------------------------------------------------------------------- + * OPEN SYSTEM TABLES + */ + +XTOpenSystemTable::XTOpenSystemTable(XTThreadPtr self, XTDatabaseHPtr db, XTSystemTableShare *share, TABLE *table): +XTObject() +{ + ost_share = share; + ost_my_table = table; + ost_db = db; + xt_heap_reference(self, db); +} + +XTOpenSystemTable::~XTOpenSystemTable() +{ + XTSystemTableShare::releaseSystemTable(this); +} + +/* + * ------------------------------------------------------------------------- + * LOCATION TABLE + */ + +XTLocationTable::XTLocationTable(XTThreadPtr self, XTDatabaseHPtr db, XTSystemTableShare *share, TABLE *table): +XTOpenSystemTable(self, db, share, table) +{ +} + +XTLocationTable::~XTLocationTable() +{ + unuse(); +} + +bool XTLocationTable::use() +{ + return true; +} + +bool XTLocationTable::unuse() +{ + return true; +} + + +bool XTLocationTable::seqScanInit() +{ + lt_index = 0; + return true; +} + +bool XTLocationTable::seqScanNext(char *buf, bool *eof) +{ + bool ok = true; + + *eof = false; + + xt_ht_lock(NULL, ost_db->db_tables); + if (lt_index >= xt_sl_get_size(ost_db->db_table_paths)) { + ok = false; + *eof = true; + goto done; + } + loadRow(buf, lt_index); + lt_index++; + + done: + xt_ht_unlock(NULL, ost_db->db_tables); + return ok; +#ifdef xxx + csWord4 last_access; + csWord4 last_ref; + csWord4 creation_time; + csWord4 access_code; + csWord2 cont_type; + size_t ref_size; + csWord2 head_size; + csWord8 blob_size; + uint32 len; + Field *curr_field; + byte *save; + MY_BITMAP *save_write_set; + + last_access = CS_GET_DISK_4(blob->rb_last_access_4); + last_ref = CS_GET_DISK_4(blob->rb_last_ref_4); + creation_time = CS_GET_DISK_4(blob->rb_create_time_4); + cont_type = CS_GET_DISK_2(blob->rb_cont_type_2); + ref_size = CS_GET_DISK_1(blob->rb_ref_size_1); + head_size = CS_GET_DISK_2(blob->rb_head_size_2); + blob_size = CS_GET_DISK_6(blob->rb_blob_size_6); + access_code = CS_GET_DISK_4(blob->rb_auth_code_4); + + /* ASSERT_COLUMN_MARKED_FOR_WRITE is failing when + * I use store()!?? + * But I want to use it! :( + */ + save_write_set = table->write_set; + table->write_set = NULL; + + memset(buf, 0xFF, table->s->null_bytes); + for (Field **field=table->field ; *field ; field++) { + curr_field = *field; + + save = curr_field->ptr; +#if MYSQL_VERSION_ID < 50114 + curr_field->ptr = (byte *) buf + curr_field->offset(); +#else + curr_field->ptr = (byte *) buf + curr_field->offset(curr_field->table->record[0]); +#endif + switch (curr_field->field_name[0]) { + case 'A': + ASSERT(strcmp(curr_field->field_name, "Access_code") == 0); + curr_field->store(access_code, true); + xt_my_set_notnull_in_record(curr_field, buf); + break; + case 'R': + switch (curr_field->field_name[6]) { + case 't': + // Repository_id INT + ASSERT(strcmp(curr_field->field_name, "Repository_id") == 0); + curr_field->store(iRepoFile->myRepo->getRepoID(), true); + xt_my_set_notnull_in_record(curr_field, buf); + break; + case 'l': + // Repo_blob_offset BIGINT + ASSERT(strcmp(curr_field->field_name, "Repo_blob_offset") == 0); + curr_field->store(iRepoOffset, true); + xt_my_set_notnull_in_record(curr_field, buf); + break; + } + break; + case 'B': + switch (curr_field->field_name[5]) { + case 's': + // Blob_size BIGINT + ASSERT(strcmp(curr_field->field_name, "Blob_size") == 0); + curr_field->store(blob_size, true); + xt_my_set_notnull_in_record(curr_field, buf); + break; + case 'd': + // Blob_data LONGBLOB + ASSERT(strcmp(curr_field->field_name, "Blob_data") == 0); + if (blob_size <= 0xFFFFFFF) { + iBlobBuffer->setLength((u_int) blob_size); + len = iRepoFile->read(iBlobBuffer->getBuffer(0), iRepoOffset + head_size, (size_t) blob_size, 0); + ((Field_blob *) curr_field)->set_ptr(len, (byte *) iBlobBuffer->getBuffer(0)); + xt_my_set_notnull_in_record(curr_field, buf); + } + break; + } + break; + case 'H': + // Head_size SMALLINT UNSIGNED + ASSERT(strcmp(curr_field->field_name, "Head_size") == 0); + curr_field->store(head_size, true); + xt_my_set_notnull_in_record(curr_field, buf); + break; + case 'C': + switch (curr_field->field_name[1]) { + case 'r': + // Creation_time TIMESTAMP + ASSERT(strcmp(curr_field->field_name, "Creation_time") == 0); + curr_field->store(ms_my_1970_to_mysql_time(creation_time), true); + xt_my_set_notnull_in_record(curr_field, buf); + break; + case 'o': + // Content_type CHAR(128) + ASSERT(strcmp(curr_field->field_name, "Content_type") == 0); + CSString *cont_type_str = ost_share->mySysDatabase->getContentType(cont_type); + if (cont_type_str) { + curr_field->store(cont_type_str->getCString(), cont_type_str->length(), &my_charset_utf8_general_ci); + cont_type_str->release(); + xt_my_set_notnull_in_record(curr_field, buf); + } + break; + } + break; + case 'L': + switch (curr_field->field_name[5]) { + case 'r': + // Last_ref_time TIMESTAMP + ASSERT(strcmp(curr_field->field_name, "Last_ref_time") == 0); + curr_field->store(ms_my_1970_to_mysql_time(last_ref), true); + xt_my_set_notnull_in_record(curr_field, buf); + break; + case 'a': + // Last_access_time TIMESTAMP + ASSERT(strcmp(curr_field->field_name, "Last_access_time") == 0); + curr_field->store(ms_my_1970_to_mysql_time(last_access), true); + xt_my_set_notnull_in_record(curr_field, buf); + break; + } + break; + } + curr_field->ptr = save; + } + + table->write_set = save_write_set; + return true; +#endif + return false; +} + +void XTLocationTable::loadRow(char *buf, xtWord4 row_id) +{ + TABLE *table = ost_my_table; + Field *curr_field; + XTTablePathPtr tp_ptr; + byte *save; + MY_BITMAP *save_write_set; + + /* ASSERT_COLUMN_MARKED_FOR_WRITE is failing when + * I use store()!?? + * But I want to use it! :( + */ + save_write_set = table->write_set; + table->write_set = NULL; + + memset(buf, 0xFF, table->s->null_bytes); + + tp_ptr = *((XTTablePathPtr *) xt_sl_item_at(ost_db->db_table_paths, row_id)); + + for (Field **field=table->field ; *field ; field++) { + curr_field = *field; + + save = curr_field->ptr; +#if MYSQL_VERSION_ID < 50114 + curr_field->ptr = (byte *) buf + curr_field->offset(); +#else + curr_field->ptr = (byte *) buf + curr_field->offset(curr_field->table->record[0]); +#endif + switch (curr_field->field_name[0]) { + case 'P': + // Path VARCHAR(128) + ASSERT_NS(strcmp(curr_field->field_name, "Path") == 0); + curr_field->store(tp_ptr->tp_path, strlen(tp_ptr->tp_path), &my_charset_utf8_general_ci); + xt_my_set_notnull_in_record(curr_field, buf); + break; + case 'T': + // Table_count INT + ASSERT_NS(strcmp(curr_field->field_name, "Table_count") == 0); + curr_field->store(tp_ptr->tp_tab_count, true); + xt_my_set_notnull_in_record(curr_field, buf); + break; + } + curr_field->ptr = save; + } + table->write_set = save_write_set; +} + +xtWord4 XTLocationTable::seqScanPos(xtWord1 *buf __attribute__((unused))) +{ + return lt_index-1; +} + +bool XTLocationTable::seqScanRead(xtWord4 rec_id, char *buf) +{ + loadRow(buf, rec_id); + return true; +} + +/* + * ------------------------------------------------------------------------- + * STATISTICS TABLE + */ + +XTStatisticsTable::XTStatisticsTable(XTThreadPtr self, XTDatabaseHPtr db, XTSystemTableShare *share, TABLE *table): +XTOpenSystemTable(self, db, share, table) +{ +} + +XTStatisticsTable::~XTStatisticsTable() +{ + unuse(); +} + +bool XTStatisticsTable::use() +{ + return true; +} + +bool XTStatisticsTable::unuse() +{ + return true; +} + + +bool XTStatisticsTable::seqScanInit() +{ + tt_index = 0; + xt_gather_statistics(&tt_statistics); + return true; +} + +bool XTStatisticsTable::seqScanNext(char *buf, bool *eof) +{ + bool ok = true; + + *eof = false; + + if (tt_index >= XT_STAT_CURRENT_MAX) { + ok = false; + *eof = true; + goto done; + } + loadRow(buf, tt_index); + tt_index++; + + done: + return ok; +} + +void XTStatisticsTable::loadRow(char *buf, xtWord4 rec_id) +{ + TABLE *table = ost_my_table; + MY_BITMAP *save_write_set; + Field *curr_field; + byte *save; + const char *stat_name; + u_llong stat_value; + + /* ASSERT_COLUMN_MARKED_FOR_WRITE is failing when + * I use store()!?? + * But I want to use it! :( + */ + save_write_set = table->write_set; + table->write_set = NULL; + + memset(buf, 0xFF, table->s->null_bytes); + + stat_name = xt_get_stat_meta_data(rec_id)->sm_name; + stat_value = xt_get_statistic(&tt_statistics, ost_db, rec_id); + + for (Field **field=table->field ; *field ; field++) { + curr_field = *field; + + save = curr_field->ptr; +#if MYSQL_VERSION_ID < 50114 + curr_field->ptr = (byte *) buf + curr_field->offset(); +#else + curr_field->ptr = (byte *) buf + curr_field->offset(curr_field->table->record[0]); +#endif + switch (curr_field->field_name[0]) { + case 'I': + // Value BIGINT + ASSERT_NS(strcmp(curr_field->field_name, "ID") == 0); + curr_field->store(rec_id+1, true); + xt_my_set_notnull_in_record(curr_field, buf); + break; + case 'N': + // Name VARCHAR(40) + ASSERT_NS(strcmp(curr_field->field_name, "Name") == 0); + curr_field->store(stat_name, strlen(stat_name), &my_charset_utf8_general_ci); + xt_my_set_notnull_in_record(curr_field, buf); + break; + case 'V': + // Value BIGINT + ASSERT_NS(strcmp(curr_field->field_name, "Value") == 0); + curr_field->store(stat_value, true); + xt_my_set_notnull_in_record(curr_field, buf); + break; + } + curr_field->ptr = save; + } + table->write_set = save_write_set; +} + +xtWord4 XTStatisticsTable::seqScanPos(xtWord1 *buf __attribute__((unused))) +{ + return tt_index-1; +} + +bool XTStatisticsTable::seqScanRead(xtWord4 rec_id, char *buf) +{ + loadRow(buf, rec_id); + return true; +} + +/* + * ------------------------------------------------------------------------- + * SYSTEM TABLE SHARES + */ + +void st_path_to_table_name(size_t size, char *buffer, const char *path) +{ + char *str; + + xt_strcpy(size, buffer, xt_last_2_names_of_path(path)); + xt_remove_extension(buffer); + if ((str = strchr(buffer, '\\'))) + *str = '.'; + if ((str = strchr(buffer, '/'))) + *str = '.'; +} + +void XTSystemTableShare::startUp(XTThreadPtr self __attribute__((unused))) +{ + thr_lock_init(&sys_location_lock); + thr_lock_init(&sys_statistics_lock); + sys_lock_inited = TRUE; +} + +void XTSystemTableShare::shutDown(XTThreadPtr self __attribute__((unused))) +{ + if (sys_lock_inited) { + thr_lock_delete(&sys_location_lock); + thr_lock_delete(&sys_statistics_lock); + sys_lock_inited = FALSE; + } +} + +bool XTSystemTableShare::isSystemTable(const char *table_path) +{ + int i = 0; + char tab_name[100]; + + st_path_to_table_name(100, tab_name, table_path); + while (xt_internal_tables[i].sts_path) { + if (strcasecmp(tab_name, xt_internal_tables[i].sts_path) == 0) + return true; + i++; + } + return false; +} + +void XTSystemTableShare::setSystemTableDeleted(const char *table_path) +{ + int i = 0; + char tab_name[100]; + + st_path_to_table_name(100, tab_name, table_path); + while (xt_internal_tables[i].sts_path) { + if (strcasecmp(tab_name, xt_internal_tables[i].sts_path) == 0) { + xt_internal_tables[i].sts_exists = FALSE; + break; + } + i++; + } +} + +bool XTSystemTableShare::doesSystemTableExist() +{ + int i = 0; + + while (xt_internal_tables[i].sts_path) { + if (xt_internal_tables[i].sts_exists) + return true; + i++; + } + return false; +} + +void XTSystemTableShare::createSystemTables(XTThreadPtr self __attribute__((unused)), XTDatabaseHPtr db __attribute__((unused))) +{ + int i = 0; + + while (xt_internal_tables[i].sts_path) { + if (!xt_create_table_frm(pbxt_hton, + current_thd, "pbxt", + strchr(xt_internal_tables[i].sts_path, '.') + 1, + xt_internal_tables[i].sts_info, + xt_internal_tables[i].sts_keys, + TRUE /*do not recreate*/)) + xt_internal_tables[i].sts_exists = TRUE; + i++; + } +} + +XTOpenSystemTable *XTSystemTableShare::openSystemTable(XTThreadPtr self, const char *table_path, TABLE *table) +{ + XTSystemTableShare *share; + XTOpenSystemTable *otab = NULL; + int i = 0; + char tab_name[100]; + + st_path_to_table_name(100, tab_name, table_path); + while (xt_internal_tables[i].sts_path) { + if (strcasecmp(tab_name, xt_internal_tables[i].sts_path) == 0) { + share = &xt_internal_tables[i]; + goto found; + } + i++; + } + return NULL; + + found: + share->sts_exists = TRUE; + switch (share->sts_id) { + case XT_SYSTAB_LOCATION_ID: + if (!(otab = new XTLocationTable(self, self->st_database, share, table))) + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + break; + case XT_SYSTAB_STATISTICS_ID: + if (!(otab = new XTStatisticsTable(self, self->st_database, share, table))) + xt_throw_errno(XT_CONTEXT, XT_ENOMEM); + break; + default: + xt_throw_taberr(XT_CONTEXT, XT_ERR_TABLE_NOT_FOUND, (XTPathStrPtr) table_path); + break; + } + + return otab; +} + +void XTSystemTableShare::releaseSystemTable(XTOpenSystemTable *tab) +{ + if (tab->ost_db) { + XTThreadPtr self = xt_get_self(); + + try_(a) { + xt_heap_release(self, tab->ost_db); + } + catch_(a) { + } + cont_(a); + tab->ost_db = NULL; + } +} diff --git a/storage/pbxt/src/systab_xt.h b/storage/pbxt/src/systab_xt.h new file mode 100644 index 00000000000..e64bcc816f4 --- /dev/null +++ b/storage/pbxt/src/systab_xt.h @@ -0,0 +1,155 @@ +/* Copyright (c) 2008 PrimeBase Technologies GmbH, Germany + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Paul McCullagh + * + * 2007-07-18 + * + * H&G2JCtL + * + * PBXT System tables. + * + */ + +/* + +DROP TABLE IF EXISTS pbms_repository; +CREATE TABLE pbms_repository ( + Repository_id INT COMMENT 'The reppository file number', + Repo_blob_offset BIGINT COMMENT 'The offset of the BLOB in the repository file', + Blob_size BIGINT COMMENT 'The size of the BLOB in bytes', + Head_size SMALLINT UNSIGNED COMMENT 'The size of the BLOB header - preceeds the BLOB data', + Access_code INT COMMENT 'The 4-byte authorisation code required to access the BLOB - part of the BLOB URL', + Creation_time TIMESTAMP COMMENT 'The time the BLOB was created', + Last_ref_time TIMESTAMP COMMENT 'The last time the BLOB was referenced', + Last_access_time TIMESTAMP COMMENT 'The last time the BLOB was accessed (read)', + Content_type CHAR(128) COMMENT 'The content type of the BLOB - returned by HTTP GET calls', + Blob_data LONGBLOB COMMENT 'The data of this BLOB' +) ENGINE=PBMS; + + PRIMARY KEY (Repository_id, Repo_blob_offset) + +DROP TABLE IF EXISTS pbms_reference; +CREATE TABLE pbms_reference ( + Table_name CHAR(64) COMMENT 'The name of the referencing table', + Blob_id BIGINT COMMENT 'The BLOB reference number - part of the BLOB URL', + Column_name CHAR(64) COMMENT 'The column name of the referencing field', + Row_condition VARCHAR(255) COMMENT 'This condition identifies the row in the table', + Blob_url VARCHAR(200) COMMENT 'The BLOB URL for HTTP GET access', + Repository_id INT COMMENT 'The repository file number of the BLOB', + Repo_blob_offset BIGINT COMMENT 'The offset in the repository file', + Blob_size BIGINT COMMENT 'The size of the BLOB in bytes', + Deletion_time TIMESTAMP COMMENT 'The time the BLOB was deleted', + Remove_in INT COMMENT 'The number of seconds before the reference/BLOB is removed perminently', + Temp_log_id INT COMMENT 'Temporary log number of the referencing deletion entry', + Temp_log_offset BIGINT COMMENT 'Temporary log offset of the referencing deletion entry' +) ENGINE=PBMS; + + PRIMARY KEY (Table_name, Blob_id, Column_name, Condition) +*/ + +#ifndef __SYSTAB_XT_H__ +#define __SYSTAB_XT_H__ + +#include "ccutils_xt.h" +#include "discover_xt.h" +#include "thread_xt.h" + +struct XTSystemTableShare; +struct XTDatabase; + +class XTOpenSystemTable : public XTObject { +public: + XTSystemTableShare *ost_share; + TABLE *ost_my_table; + struct XTDatabase *ost_db; + + XTOpenSystemTable(XTThreadPtr self, struct XTDatabase *db, XTSystemTableShare *share, TABLE *table); + virtual ~XTOpenSystemTable(); + + virtual bool use() { return true; } + virtual bool unuse() { return true; } + virtual bool seqScanInit() { return true; } + virtual bool seqScanNext(char *buf __attribute__((unused)), bool *eof) { + *eof = true; + return false; + } + virtual int getRefLen() { return 4; } + virtual xtWord4 seqScanPos(xtWord1 *buf __attribute__((unused))) { + return 0; + } + virtual bool seqScanRead(xtWord4 rec_id __attribute__((unused)), char *buf __attribute__((unused))) { + return true; + } + +private: +}; + +class XTLocationTable : public XTOpenSystemTable { + u_int lt_index; + +public: + XTLocationTable(XTThreadPtr self, struct XTDatabase *db, XTSystemTableShare *share, TABLE *table); + virtual ~XTLocationTable(); + + virtual bool use(); + virtual bool unuse(); + virtual bool seqScanInit(); + virtual bool seqScanNext(char *buf, bool *eof); + virtual void loadRow(char *buf, xtWord4 row_id); + virtual xtWord4 seqScanPos(xtWord1 *buf); + virtual bool seqScanRead(xtWord4 rec_id, char *buf); +}; + +class XTStatisticsTable : public XTOpenSystemTable { + u_int tt_index; + XTStatisticsRec tt_statistics; + +public: + XTStatisticsTable(XTThreadPtr self, struct XTDatabase *db, XTSystemTableShare *share, TABLE *table); + virtual ~XTStatisticsTable(); + + virtual bool use(); + virtual bool unuse(); + virtual bool seqScanInit(); + virtual bool seqScanNext(char *buf, bool *eof); + virtual void loadRow(char *buf, xtWord4 row_id); + virtual xtWord4 seqScanPos(xtWord1 *buf); + virtual bool seqScanRead(xtWord4 rec_id, char *buf); +}; + +typedef struct XTSystemTableShare { + u_int sts_id; + const char *sts_path; + THR_LOCK *sts_my_lock; + DT_FIELD_INFO *sts_info; + DT_KEY_INFO *sts_keys; + xtBool sts_exists; + + static void startUp(XTThreadPtr self); + static void shutDown(XTThreadPtr self); + + static bool isSystemTable(const char *table_path); + static void setSystemTableDeleted(const char *table_path); + static bool doesSystemTableExist(); + static void createSystemTables(XTThreadPtr self, struct XTDatabase *db); + static XTOpenSystemTable *openSystemTable(XTThreadPtr self, const char *table_path, TABLE *table); + static void releaseSystemTable(XTOpenSystemTable *tab); +} XTSystemTableShareRec, *XTSystemTableSharePtr; + +#endif diff --git a/storage/pbxt/src/tabcache_xt.cc b/storage/pbxt/src/tabcache_xt.cc new file mode 100644 index 00000000000..b9f9ccd37e1 --- /dev/null +++ b/storage/pbxt/src/tabcache_xt.cc @@ -0,0 +1,1254 @@ +/* Copyright (c) 2007 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2007-10-30 Paul McCullagh + * + * H&G2JCtL + * + * The new table cache. Caches all non-index data. This includes the data + * files and the row pointer files. + */ + +#include "xt_config.h" + +#include <signal.h> + +#include "pthread_xt.h" +#include "tabcache_xt.h" +#include "table_xt.h" +#include "database_xt.h" +#include "trace_xt.h" +#include "myxt_xt.h" + +xtPublic XTTabCacheMemRec xt_tab_cache; + +static void tabc_fr_wait_for_cache(XTThreadPtr self, u_int msecs); + +xtPublic void xt_tc_set_cache_size(size_t cache_size) +{ + xt_tab_cache.tcm_cache_size = cache_size; + xt_tab_cache.tcm_low_level = cache_size / 4 * 3; // Current 75% + xt_tab_cache.tcm_high_level = cache_size / 100 * 95; // Current 95% +} + +/* + * Initialize the disk cache. + */ +xtPublic void xt_tc_init(XTThreadPtr self, size_t cache_size) +{ + xt_tc_set_cache_size(cache_size); + + xt_tab_cache.tcm_approx_page_count = cache_size / sizeof(XTTabCachePageRec); + /* Determine the size of the hash table. + * The size is set to 2* the number of pages! + */ + xt_tab_cache.tcm_hash_size = (xt_tab_cache.tcm_approx_page_count * 2) / XT_TC_SEGMENT_COUNT; + + try_(a) { + for (u_int i=0; i<XT_TC_SEGMENT_COUNT; i++) { + xt_tab_cache.tcm_segment[i].tcs_cache_in_use = 0; + xt_tab_cache.tcm_segment[i].tcs_hash_table = (XTTabCachePagePtr *) xt_calloc(self, xt_tab_cache.tcm_hash_size * sizeof(XTTabCachePagePtr)); + xt_rwmutex_init_with_autoname(self, &xt_tab_cache.tcm_segment[i].tcs_lock); + } + + xt_init_mutex_with_autoname(self, &xt_tab_cache.tcm_lock); + xt_init_cond(self, &xt_tab_cache.tcm_cond); + xt_init_mutex_with_autoname(self, &xt_tab_cache.tcm_freeer_lock); + xt_init_cond(self, &xt_tab_cache.tcm_freeer_cond); + } + catch_(a) { + xt_tc_exit(self); + throw_(); + } + cont_(a); +} + +xtPublic void xt_tc_exit(XTThreadPtr self) +{ + for (u_int i=0; i<XT_TC_SEGMENT_COUNT; i++) { + if (xt_tab_cache.tcm_segment[i].tcs_hash_table) { + if (xt_tab_cache.tcm_segment[i].tcs_cache_in_use) { + XTTabCachePagePtr page, tmp_page; + + for (size_t j=0; j<xt_tab_cache.tcm_hash_size; j++) { + page = xt_tab_cache.tcm_segment[i].tcs_hash_table[j]; + while (page) { + tmp_page = page; + page = page->tcp_next; + xt_free(self, tmp_page); + } + } + } + + xt_free(self, xt_tab_cache.tcm_segment[i].tcs_hash_table); + xt_tab_cache.tcm_segment[i].tcs_hash_table = NULL; + xt_rwmutex_free(self, &xt_tab_cache.tcm_segment[i].tcs_lock); + } + } + + xt_free_mutex(&xt_tab_cache.tcm_lock); + xt_free_cond(&xt_tab_cache.tcm_cond); + xt_free_mutex(&xt_tab_cache.tcm_freeer_lock); + xt_free_cond(&xt_tab_cache.tcm_freeer_cond); +} + +xtPublic xtInt8 xt_tc_get_usage() +{ + xtInt8 size = 0; + + for (u_int i=0; i<XT_TC_SEGMENT_COUNT; i++) { + size += xt_tab_cache.tcm_segment[i].tcs_cache_in_use; + } + return size; +} + +xtPublic xtInt8 xt_tc_get_size() +{ + return (xtInt8) xt_tab_cache.tcm_cache_size; +} + +xtPublic xtInt8 xt_tc_get_high() +{ + return (xtInt8) xt_tab_cache.tcm_cache_high; +} + +#ifdef DEBUG +xtPublic void xt_check_table_cache(XTTableHPtr tab) +{ + XTTabCachePagePtr page, ppage; + + xt_lock_mutex_ns(&xt_tab_cache.tcm_lock); + ppage = NULL; + page = xt_tab_cache.tcm_lru_page; + while (page) { + if (tab) { + if (page->tcp_db_id == tab->tab_db->db_id && page->tcp_tab_id == tab->tab_id) { + ASSERT_NS(!XTTableSeq::xt_op_is_before(tab->tab_seq.ts_next_seq, page->tcp_op_seq)); + } + } + ASSERT_NS(page->tcp_lr_used == ppage); + ppage = page; + page = page->tcp_mr_used; + } + ASSERT_NS(xt_tab_cache.tcm_mru_page == ppage); + xt_unlock_mutex_ns(&xt_tab_cache.tcm_lock); +} +#endif + +void XTTabCache::xt_tc_setup(XTTableHPtr tab, size_t head_size, size_t rec_size) +{ + tci_table = tab; + tci_header_size = head_size; + tci_rec_size = rec_size; + tci_rows_per_page = (XT_TC_PAGE_SIZE / rec_size) + 1; + if (tci_rows_per_page < 2) + tci_rows_per_page = 2; + tci_page_size = tci_rows_per_page * rec_size; +} + +/* + * This function assumes that we never write past the boundary of a page. + * This should be the case, because we should never write more than + * a row, and there are only whole rows on a page. + */ +xtBool XTTabCache::xt_tc_write(XT_ROW_REC_FILE_PTR file, xtRefID ref_id, size_t inc, size_t size, xtWord1 *data, xtOpSeqNo *op_seq, xtBool read, XTThreadPtr thread) +{ + size_t offset; + XTTabCachePagePtr page; + XTTabCacheSegPtr seg; + + /* + retry: + */ + if (!tc_fetch(file, ref_id, &seg, &page, &offset, read, thread)) + return FAILED; + /* Don't write while there is a read lock on the page, + * which can happen during a sequential scan... + * + * This will have to be OK. + * I cannot wait for the lock because a thread locks + * itself out when updating during a sequential scan. + * + * However, I don't think this is a problem, because + * the only records that are changed, are records + * containing uncommitted data. Such records should + * be ignored by a sequential scan. As long as + * we don't crash due to reading half written + * data! + * + if (page->tcp_lock_count) { + if (!xt_timed_wait_cond_ns(&seg->tcs_cond, &seg->tcs_lock, 100)) { + xt_rwmutex_unlock(&seg->tcs_lock, thread->t_id); + return FAILED; + } + xt_rwmutex_unlock(&seg->tcs_lock, thread->t_id); + // The page may have dissappeared from the cache, while we were sleeping! + goto retry; + } + */ + + ASSERT_NS(offset + inc + 4 <= tci_page_size); + memcpy(page->tcp_data + offset + inc, data, size); + /* GOTCHA, this was "op_seq > page->tcp_op_seq", however + * this does not handle overflow! + if (XTTableSeq::xt_op_is_before(page->tcp_op_seq, op_seq)) + page->tcp_op_seq = op_seq; + */ + + page->tcp_dirty = TRUE; + ASSERT_NS(page->tcp_db_id == tci_table->tab_db->db_id && page->tcp_tab_id == tci_table->tab_id); + *op_seq = tci_table->tab_seq.ts_set_op_seq(page); + xt_rwmutex_unlock(&seg->tcs_lock, thread->t_id); + return OK; +} + +/* + * This is a special version of write which is used to set the "clean" bit. + * The alternative would be to read the record first, but this + * is much quicker! + * + * This function also checks if xn_id, row_id and other data match (the checks + * are similar to xn_sw_cleanup_done) before modifying the record, otherwise it + * assumes that the record was already updated earlier and we must not set it to + * clean. + * + * If the record was not modified the function returns FALSE. + * + * The function has a self pointer and can throw an exception. + */ +xtBool XTTabCache::xt_tc_write_cond(XTThreadPtr self, XT_ROW_REC_FILE_PTR file, xtRefID ref_id, xtWord1 new_type, xtOpSeqNo *op_seq, + xtXactID xn_id, xtRowID row_id, u_int stat_id, u_int rec_type) +{ + size_t offset; + XTTabCachePagePtr page; + XTTabCacheSegPtr seg; + XTTabRecHeadDPtr rec_head; + + if (!tc_fetch(file, ref_id, &seg, &page, &offset, TRUE, self)) + xt_throw(self); + + ASSERT(offset + 1 <= tci_page_size); + + rec_head = (XTTabRecHeadDPtr)(page->tcp_data + offset); + + /* Transaction must match: */ + if (XT_GET_DISK_4(rec_head->tr_xact_id_4) != xn_id) + goto no_change; + + /* Record header must match expected value from + * log or clean has been done, or is not required. + * + * For example, it is not required if a record + * has been overwritten in a transaction. + */ + if (rec_head->tr_rec_type_1 != rec_type || + rec_head->tr_stat_id_1 != stat_id) + goto no_change; + + /* Row must match: */ + if (XT_GET_DISK_4(rec_head->tr_row_id_4) != row_id) + goto no_change; + + *(page->tcp_data + offset) = new_type; + + page->tcp_dirty = TRUE; + ASSERT(page->tcp_db_id == tci_table->tab_db->db_id && page->tcp_tab_id == tci_table->tab_id); + *op_seq = tci_table->tab_seq.ts_set_op_seq(page); + xt_rwmutex_unlock(&seg->tcs_lock, self->t_id); + return TRUE; + + no_change: + xt_rwmutex_unlock(&seg->tcs_lock, self->t_id); + return FALSE; +} + +xtBool XTTabCache::xt_tc_read(XT_ROW_REC_FILE_PTR file, xtRefID ref_id, size_t size, xtWord1 *data, XTThreadPtr thread) +{ + return tc_read_direct(file, ref_id, size, data, thread); +} + +xtBool XTTabCache::xt_tc_read_4(XT_ROW_REC_FILE_PTR file, xtRefID ref_id, xtWord4 *value, XTThreadPtr thread) +{ + register u_int page_idx; + register XTTabCachePagePtr page; + register XTTabCacheSegPtr seg; + register u_int hash_idx; + register XTTabCacheMemPtr dcg = &xt_tab_cache; + off_t address; + + ASSERT_NS(ref_id); + ref_id--; + page_idx = ref_id / this->tci_rows_per_page; + address = (off_t) ref_id * (off_t) this->tci_rec_size + (off_t) this->tci_header_size; + + hash_idx = page_idx + (file->fr_id * 223); + seg = &dcg->tcm_segment[hash_idx & XT_TC_SEGMENT_MASK]; + hash_idx = (hash_idx >> XT_TC_SEGMENT_SHIFTS) % dcg->tcm_hash_size; + + xt_rwmutex_slock(&seg->tcs_lock, thread->t_id); + page = seg->tcs_hash_table[hash_idx]; + while (page) { + if (page->tcp_page_idx == page_idx && page->tcp_file_id == file->fr_id) { + size_t offset; + xtWord1 *buffer; + + offset = (ref_id % this->tci_rows_per_page) * this->tci_rec_size; + ASSERT_NS(offset + 4 <= this->tci_page_size); + buffer = page->tcp_data + offset; + *value = XT_GET_DISK_4(buffer); + xt_rwmutex_unlock(&seg->tcs_lock, thread->t_id); + return OK; + } + page = page->tcp_next; + } + xt_rwmutex_unlock(&seg->tcs_lock, thread->t_id); + +#ifdef XT_USE_ROW_REC_MMAP_FILES + return xt_pread_fmap_4(file, address, value, &thread->st_statistics.st_rec, thread); +#else + xtWord1 data[4]; + + if (!XT_PREAD_RR_FILE(file, address, 4, 4, data, NULL, &thread->st_statistics.st_rec, thread)) + return FAILED; + *value = XT_GET_DISK_4(data); + return OK; +#endif +} + +xtBool XTTabCache::xt_tc_get_page(XT_ROW_REC_FILE_PTR file, xtRefID ref_id, XTTabCachePagePtr *ret_page, size_t *offset, XTThreadPtr thread) +{ + XTTabCachePagePtr page; + XTTabCacheSegPtr seg; + +#ifdef XT_SEQ_SCAN_FROM_MEMORY + if (!tc_fetch_direct(file, ref_id, &seg, &page, offset, thread)) + return FAILED; + if (!seg) { + *ret_page = NULL; + return OK; + } +#else + if (!tc_fetch(file, ref_id, &seg, &page, offset, TRUE, thread)) + return FAILED; +#endif + page->tcp_lock_count++; + xt_rwmutex_unlock(&seg->tcs_lock, thread->t_id); + *ret_page = page; + return OK; +} + +void XTTabCache::xt_tc_release_page(XT_ROW_REC_FILE_PTR file __attribute__((unused)), XTTabCachePagePtr page, XTThreadPtr thread) +{ + XTTabCacheSegPtr seg; + + seg = &xt_tab_cache.tcm_segment[page->tcp_seg]; + xt_rwmutex_xlock(&seg->tcs_lock, thread->t_id); + +#ifdef DEBUG + XTTabCachePagePtr lpage, ppage; + + ppage = NULL; + lpage = seg->tcs_hash_table[page->tcp_hash_idx]; + while (lpage) { + if (lpage->tcp_page_idx == page->tcp_page_idx && + lpage->tcp_file_id == page->tcp_file_id) + break; + ppage = lpage; + lpage = lpage->tcp_next; + } + + ASSERT_NS(page == lpage); + ASSERT_NS(page->tcp_lock_count > 0); +#endif + + if (page->tcp_lock_count > 0) + page->tcp_lock_count--; + + xt_rwmutex_unlock(&seg->tcs_lock, thread->t_id); +} + +xtBool XTTabCache::xt_tc_read_page(XT_ROW_REC_FILE_PTR file, xtRefID ref_id, xtWord1 *data, XTThreadPtr thread) +{ + return tc_read_direct(file, ref_id, this->tci_page_size, data, thread); +} + +/* Read row and record files directly. + * This by-passed the cache when reading, which mean + * we rely in the OS for caching. + * This probably only makes sense when these files + * are memory mapped. + */ +xtBool XTTabCache::tc_read_direct(XT_ROW_REC_FILE_PTR file, xtRefID ref_id, size_t size, xtWord1 *data, XTThreadPtr thread) +{ + register u_int page_idx; + register XTTabCachePagePtr page; + register XTTabCacheSegPtr seg; + register u_int hash_idx; + register XTTabCacheMemPtr dcg = &xt_tab_cache; + size_t red_size; + off_t address; + + ASSERT_NS(ref_id); + ref_id--; + page_idx = ref_id / this->tci_rows_per_page; + address = (off_t) ref_id * (off_t) this->tci_rec_size + (off_t) this->tci_header_size; + + hash_idx = page_idx + (file->fr_id * 223); + seg = &dcg->tcm_segment[hash_idx & XT_TC_SEGMENT_MASK]; + hash_idx = (hash_idx >> XT_TC_SEGMENT_SHIFTS) % dcg->tcm_hash_size; + + xt_rwmutex_slock(&seg->tcs_lock, thread->t_id); + page = seg->tcs_hash_table[hash_idx]; + while (page) { + if (page->tcp_page_idx == page_idx && page->tcp_file_id == file->fr_id) { + size_t offset; + + offset = (ref_id % this->tci_rows_per_page) * this->tci_rec_size; + ASSERT_NS(offset + size <= this->tci_page_size); + memcpy(data, page->tcp_data + offset, size); + xt_rwmutex_unlock(&seg->tcs_lock, thread->t_id); + return OK; + } + page = page->tcp_next; + } + xt_rwmutex_unlock(&seg->tcs_lock, thread->t_id); + if (!XT_PREAD_RR_FILE(file, address, size, 0, data, &red_size, &thread->st_statistics.st_rec, thread)) + return FAILED; + memset(data + red_size, 0, size - red_size); + return OK; +} + +xtBool XTTabCache::tc_fetch_direct(XT_ROW_REC_FILE_PTR file, xtRefID ref_id, XTTabCacheSegPtr *ret_seg, XTTabCachePagePtr *ret_page, size_t *offset, XTThreadPtr thread) +{ + register u_int page_idx; + register XTTabCachePagePtr page; + register XTTabCacheSegPtr seg; + register u_int hash_idx; + register XTTabCacheMemPtr dcg = &xt_tab_cache; + + ASSERT_NS(ref_id); + ref_id--; + page_idx = ref_id / this->tci_rows_per_page; + *offset = (ref_id % this->tci_rows_per_page) * this->tci_rec_size; + + hash_idx = page_idx + (file->fr_id * 223); + seg = &dcg->tcm_segment[hash_idx & XT_TC_SEGMENT_MASK]; + hash_idx = (hash_idx >> XT_TC_SEGMENT_SHIFTS) % dcg->tcm_hash_size; + + xt_rwmutex_xlock(&seg->tcs_lock, thread->t_id); + page = seg->tcs_hash_table[hash_idx]; + while (page) { + if (page->tcp_page_idx == page_idx && page->tcp_file_id == file->fr_id) { + *ret_seg = seg; + *ret_page = page; + return OK; + } + page = page->tcp_next; + } + xt_rwmutex_unlock(&seg->tcs_lock, thread->t_id); + *ret_seg = NULL; + *ret_page = NULL; + return OK; +} + +/* + * Note, this function may return an exclusive, or a shared lock. + * If the page is in cache it will return a shared lock of the segment. + * If the page was just added to the cache it will return an + * exclusive lock. + */ +xtBool XTTabCache::tc_fetch(XT_ROW_REC_FILE_PTR file, xtRefID ref_id, XTTabCacheSegPtr *ret_seg, XTTabCachePagePtr *ret_page, size_t *offset, xtBool read, XTThreadPtr thread) +{ + register u_int page_idx; + register XTTabCachePagePtr page, new_page; + register XTTabCacheSegPtr seg; + register u_int hash_idx; + register XTTabCacheMemPtr dcg = &xt_tab_cache; + size_t red_size; + off_t address; + + ASSERT_NS(ref_id); + ref_id--; + page_idx = ref_id / this->tci_rows_per_page; + address = (off_t) page_idx * (off_t) this->tci_page_size + (off_t) this->tci_header_size; + *offset = (ref_id % this->tci_rows_per_page) * this->tci_rec_size; + + hash_idx = page_idx + (file->fr_id * 223); + seg = &dcg->tcm_segment[hash_idx & XT_TC_SEGMENT_MASK]; + hash_idx = (hash_idx >> XT_TC_SEGMENT_SHIFTS) % dcg->tcm_hash_size; + + xt_rwmutex_slock(&seg->tcs_lock, thread->t_id); + page = seg->tcs_hash_table[hash_idx]; + while (page) { + if (page->tcp_page_idx == page_idx && page->tcp_file_id == file->fr_id) { + /* This page has been most recently used: */ + if (XT_TIME_DIFF(page->tcp_ru_time, dcg->tcm_ru_now) > (dcg->tcm_approx_page_count >> 1)) { + /* Move to the front of the MRU list: */ + xt_lock_mutex_ns(&dcg->tcm_lock); + + page->tcp_ru_time = ++dcg->tcm_ru_now; + if (dcg->tcm_mru_page != page) { + /* Remove from the MRU list: */ + if (dcg->tcm_lru_page == page) + dcg->tcm_lru_page = page->tcp_mr_used; + if (page->tcp_lr_used) + page->tcp_lr_used->tcp_mr_used = page->tcp_mr_used; + if (page->tcp_mr_used) + page->tcp_mr_used->tcp_lr_used = page->tcp_lr_used; + + /* Make the page the most recently used: */ + if ((page->tcp_lr_used = dcg->tcm_mru_page)) + dcg->tcm_mru_page->tcp_mr_used = page; + page->tcp_mr_used = NULL; + dcg->tcm_mru_page = page; + if (!dcg->tcm_lru_page) + dcg->tcm_lru_page = page; + } + xt_unlock_mutex_ns(&dcg->tcm_lock); + } + *ret_seg = seg; + *ret_page = page; + thread->st_statistics.st_rec_cache_hit++; + return OK; + } + page = page->tcp_next; + } + xt_rwmutex_unlock(&seg->tcs_lock, thread->t_id); + + /* Page not found, allocate a new page: */ + size_t page_size = offsetof(XTTabCachePageRec, tcp_data) + this->tci_page_size; + if (!(new_page = (XTTabCachePagePtr) xt_malloc_ns(page_size))) + return FAILED; + /* Increment cache used. */ + seg->tcs_cache_in_use += page_size; + + /* Check the level of the cache: */ + size_t cache_used = 0; + for (int i=0; i<XT_TC_SEGMENT_COUNT; i++) + cache_used += dcg->tcm_segment[i].tcs_cache_in_use; + + if (cache_used > dcg->tcm_cache_high) + dcg->tcm_cache_high = cache_used; + + if (cache_used > dcg->tcm_cache_size) { + XTThreadPtr self; + time_t now; + + /* Wait for the cache level to go down. + * If this happens, then the freeer is not working fast + * enough! + */ + + /* But before I do this, I must flush my own log because: + * - The freeer might be waiting for a page to be cleaned. + * - The page can only be cleaned once it has been written to + * the database. + * - The writer cannot write the page data until it has been + * flushed to the log. + * - The log won't be flushed, unless this thread does it. + * So there could be a deadlock if I don't flush the log! + */ + if ((self = xt_get_self())) { + if (!xt_xlog_flush_log(self)) + goto failed; + } + + /* Wait for the free'er thread: */ + xt_lock_mutex_ns(&dcg->tcm_freeer_lock); + now = time(NULL); + do { + /* I have set the timeout to 2 here because of the following situation: + * 1. Transaction allocates an op seq + * 2. Transaction goes to update cache, but must wait for + * cache to be freed (after this, the op would be written to + * the log). + * 3. The free'er wants to free cache, but is waiting for the writter. + * 4. The writer cannot continue because an op seq is missing! + * So the writer is waiting for the transaction thread to write + * the op seq. + * - So we have a deadlock situation. + * - However, this situation can only occur if there is not enougn + * cache. + * The timeout helps, but will not solve the problem, unless we + * ignore cache level here, after a while, and just continue. + */ + + /* Wake freeer before we go to sleep: */ + if (!dcg->tcm_freeer_busy) { + if (!xt_broadcast_cond_ns(&dcg->tcm_freeer_cond)) + xt_log_and_clear_exception_ns(); + } + + dcg->tcm_threads_waiting++; +#ifdef DEBUG + if (!xt_timed_wait_cond_ns(&dcg->tcm_freeer_cond, &dcg->tcm_freeer_lock, 30000)) { + dcg->tcm_threads_waiting--; + break; + } +#else + if (!xt_timed_wait_cond_ns(&dcg->tcm_freeer_cond, &dcg->tcm_freeer_lock, 1000)) { + dcg->tcm_threads_waiting--; + break; + } +#endif + dcg->tcm_threads_waiting--; + + cache_used = 0; + for (int i=0; i<XT_TC_SEGMENT_COUNT; i++) + cache_used += dcg->tcm_segment[i].tcs_cache_in_use; + + if (cache_used <= dcg->tcm_high_level) + break; + /* + * If there is too little cache we can get stuck here. + * The problem is that seg numbers are allocated before fetching a + * record to be updated. + * + * It can happen that we end up waiting for that seq number + * to be written to the log before we can continue here. + * + * This happens as follows: + * 1. This thread waits for the freeer. + * 2. The freeer cannot free a page because it has not been + * written by the writter. + * 3. The writter cannot continue because it is waiting + * for a missing sequence number. + * 4. The missing sequence number is the one allocated + * before we entered this function! + * + * So don't wait for more than 5 seconds here! + */ + } + while (time(NULL) < now + 5); + xt_unlock_mutex_ns(&dcg->tcm_freeer_lock); + } + else if (cache_used > dcg->tcm_high_level) { + /* Wake up the freeer because the cache level, + * is higher than the high level. + */ + if (!dcg->tcm_freeer_busy) { + xt_lock_mutex_ns(&xt_tab_cache.tcm_freeer_lock); + if (!xt_broadcast_cond_ns(&xt_tab_cache.tcm_freeer_cond)) + xt_log_and_clear_exception_ns(); + xt_unlock_mutex_ns(&xt_tab_cache.tcm_freeer_lock); + } + } + + /* Read the page into memory.... */ + new_page->tcp_dirty = FALSE; + new_page->tcp_seg = (xtWord1) ((page_idx + (file->fr_id * 223)) & XT_TC_SEGMENT_MASK); + new_page->tcp_lock_count = 0; + new_page->tcp_hash_idx = hash_idx; + new_page->tcp_page_idx = page_idx; + new_page->tcp_file_id = file->fr_id; + new_page->tcp_db_id = this->tci_table->tab_db->db_id; + new_page->tcp_tab_id = this->tci_table->tab_id; + new_page->tcp_data_size = this->tci_page_size; + new_page->tcp_op_seq = 0; // Value not used because not dirty + + if (read) { + if (!XT_PREAD_RR_FILE(file, address, this->tci_page_size, 0, new_page->tcp_data, &red_size, &thread->st_statistics.st_rec, thread)) + goto failed; + } + +#ifdef XT_MEMSET_UNUSED_SPACE + /* Removing this is an optimization. It should not be required + * to clear the unused space in the page. + */ + memset(new_page->tcp_data + red_size, 0, this->tci_page_size - red_size); +#endif + + /* Add the page to the cache! */ + xt_rwmutex_xlock(&seg->tcs_lock, thread->t_id); + page = seg->tcs_hash_table[hash_idx]; + while (page) { + if (page->tcp_page_idx == page_idx && page->tcp_file_id == file->fr_id) { + /* Oops, someone else was faster! */ + xt_free_ns(new_page); + goto done_ok; + } + page = page->tcp_next; + } + page = new_page; + + /* Make the page the most recently used: */ + xt_lock_mutex_ns(&dcg->tcm_lock); + page->tcp_ru_time = ++dcg->tcm_ru_now; + if ((page->tcp_lr_used = dcg->tcm_mru_page)) + dcg->tcm_mru_page->tcp_mr_used = page; + page->tcp_mr_used = NULL; + dcg->tcm_mru_page = page; + if (!dcg->tcm_lru_page) + dcg->tcm_lru_page = page; + xt_unlock_mutex_ns(&dcg->tcm_lock); + + /* Add the page to the hash table: */ + page->tcp_next = seg->tcs_hash_table[hash_idx]; + seg->tcs_hash_table[hash_idx] = page; + + done_ok: + *ret_seg = seg; + *ret_page = page; +#ifdef DEBUG_CHECK_CACHE + //XT_TC_check_cache(); +#endif + thread->st_statistics.st_rec_cache_miss++; + return OK; + + failed: + xt_free_ns(new_page); + return FAILED; +} + + +/* ---------------------------------------------------------------------- + * OPERATION SEQUENCE + */ + +xtBool XTTableSeq::ts_log_no_op(XTThreadPtr thread, xtTableID tab_id, xtOpSeqNo op_seq) +{ + XTactNoOpEntryDRec ent_rec; + xtWord4 sum = (xtWord4) tab_id ^ (xtWord4) op_seq; + + ent_rec.no_status_1 = XT_LOG_ENT_NO_OP; + ent_rec.no_checksum_1 = XT_CHECKSUM_1(sum); + XT_SET_DISK_4(ent_rec.no_tab_id_4, tab_id); + XT_SET_DISK_4(ent_rec.no_op_seq_4, op_seq); + /* TODO - If this also fails we have a problem. + * From this point on we should actually not generate + * any more op IDs. The problem is that the + * some will be missing, so the writer will not + * be able to contniue. + */ + return xt_xlog_log_data(thread, sizeof(XTactNoOpEntryDRec), (XTXactLogBufferDPtr) &ent_rec, FALSE); +} + +#ifdef XT_NOT_INLINE +xtOpSeqNo XTTableSeq::ts_set_op_seq(XTTabCachePagePtr page) +{ + xtOpSeqNo seq; + + xt_lock_mutex_ns(&ts_ns_lock); + page->tcp_op_seq = seq = ts_next_seq++; + xt_unlock_mutex_ns(&ts_ns_lock); + return seq; +} + +xtOpSeqNo XTTableSeq::ts_get_op_seq() +{ + xtOpSeqNo seq; + + xt_lock_mutex_ns(&ts_ns_lock); + seq = ts_next_seq++; + xt_unlock_mutex_ns(&ts_ns_lock); + return seq; +} +#endif + +#ifdef XT_NOT_INLINE +/* + * Return TRUE if the current sequence is before the + * target (then) sequence number. This function + * takes into account overflow. Overflow is detected + * by checking the difference between the 2 values. + * If the difference is very large, then we + * assume overflow. + */ +xtBool XTTableSeq::xt_op_is_before(register xtOpSeqNo now, register xtOpSeqNo then) +{ + ASSERT_NS(sizeof(xtOpSeqNo) == 4); + /* The now time is being incremented. + * If it is after the then time (which is static, then + * it is not before! + */ + if (now >= then) { + if ((now - then) > (xtOpSeqNo) 0xFFFFFFFF/2) + return TRUE; + return FALSE; + } + + /* If it appears to be before, we still have to check + * for overflow. If the gap is bigger then half of + * the MAX value, then we can assume it has wrapped around + * because we know that no then can be so far in the + * future! + */ + if ((then - now) > (xtOpSeqNo) 0xFFFFFFFF/2) + return FALSE; + return TRUE; +} +#endif + + +/* ---------------------------------------------------------------------- + * F R E E E R P R O C E S S + */ + +/* + * Used by the writer to wake the freeer. + */ +xtPublic void xt_wr_wake_freeer(XTThreadPtr self) +{ + xt_lock_mutex(self, &xt_tab_cache.tcm_freeer_lock); + pushr_(xt_unlock_mutex, &xt_tab_cache.tcm_freeer_lock); + if (!xt_broadcast_cond_ns(&xt_tab_cache.tcm_freeer_cond)) + xt_log_and_clear_exception_ns(); + freer_(); // xt_unlock_mutex(&xt_tab_cache.tcm_freeer_lock) +} + +/* Wait for a transaction to quit: */ +static void tabc_fr_wait_for_cache(XTThreadPtr self, u_int msecs) +{ + if (!self->t_quit) + xt_timed_wait_cond(NULL, &xt_tab_cache.tcm_freeer_cond, &xt_tab_cache.tcm_freeer_lock, msecs); +} + +typedef struct TCResource { + XTOpenTablePtr tc_ot; +} TCResourceRec, *TCResourcePtr; + +static void tabc_free_fr_resources(XTThreadPtr self, TCResourcePtr tc) +{ + if (tc->tc_ot) { + xt_db_return_table_to_pool(self, tc->tc_ot); + tc->tc_ot = NULL; + } +} + +static XTTableHPtr tabc_get_table(XTThreadPtr self, TCResourcePtr tc, xtDatabaseID db_id, xtTableID tab_id) +{ + XTTableHPtr tab; + XTDatabaseHPtr db; + + if (tc->tc_ot) { + tab = tc->tc_ot->ot_table; + if (tab->tab_id == tab_id && tab->tab_db->db_id == db_id) + return tab; + + xt_db_return_table_to_pool(self, tc->tc_ot); + tc->tc_ot = NULL; + } + + if (!tc->tc_ot) { + if (!(db = xt_get_database_by_id(self, db_id))) + return NULL; + + pushr_(xt_heap_release, db); + tc->tc_ot = xt_db_open_pool_table(self, db, tab_id, NULL, TRUE); + freer_(); // xt_heap_release(db); + if (!tc->tc_ot) + return NULL; + } + + return tc->tc_ot->ot_table; +} + +/* + * Free the given page, or the least recently used page. + * Return the amount of bytes freed. + */ +static size_t tabc_free_page(XTThreadPtr self, TCResourcePtr tc) +{ + register XTTabCacheMemPtr dcg = &xt_tab_cache; + XTTableHPtr tab = NULL; + XTTabCachePagePtr page, lpage, ppage; + XTTabCacheSegPtr seg; + u_int page_cnt; + xtBool was_dirty; + +#ifdef DEBUG_CHECK_CACHE + //XT_TC_check_cache(); +#endif + dcg->tcm_free_try_count = 0; + + retry: + /* Note, handling the page is safe because + * there is only one free'er thread which + * can remove pages from the cache! + */ + page_cnt = 0; + if (!(page = dcg->tcm_lru_page)) { + dcg->tcm_free_try_count = 0; + return 0; + } + + retry_2: + if ((was_dirty = page->tcp_dirty)) { + /* Do all this stuff without a lock, because to + * have a lock while doing this is too expensive! + */ + + /* Wait for the page to be cleaned. */ + tab = tabc_get_table(self, tc, page->tcp_db_id, page->tcp_tab_id); + } + + seg = &dcg->tcm_segment[page->tcp_seg]; + xt_rwmutex_xlock(&seg->tcs_lock, self->t_id); + + if (page->tcp_dirty) { + if (!was_dirty) { + xt_rwmutex_unlock(&seg->tcs_lock, self->t_id); + goto retry_2; + } + + if (tab) { + ASSERT(!XTTableSeq::xt_op_is_before(tab->tab_seq.ts_next_seq, page->tcp_op_seq+1)); + /* This should never happen. However, is has been occuring, + * during multi_update test on Windows. + * In particular it occurs after rename of a table, during ALTER. + * As if the table was not flushed before the rename!? + * To guard against an infinite loop below, I will just continue here. + */ + if (XTTableSeq::xt_op_is_before(tab->tab_seq.ts_next_seq, page->tcp_op_seq+1)) + goto go_on; + /* OK, we have the table, now we check where the current + * sequence number is. + */ + if (XTTableSeq::xt_op_is_before(tab->tab_head_op_seq, page->tcp_op_seq)) { + XTDatabaseHPtr db = tab->tab_db; + + rewait: + xt_rwmutex_unlock(&seg->tcs_lock, self->t_id); + + /* Flush the log, in case this is holding up the + * writer! + */ + if (!db->db_xlog.xlog_flush(self)) { + dcg->tcm_free_try_count = 0; + xt_throw(self); + } + + xt_lock_mutex(self, &db->db_wr_lock); + pushr_(xt_unlock_mutex, &db->db_wr_lock); + + /* The freeer is now waiting: */ + db->db_wr_freeer_waiting = TRUE; + + /* If the writer is idle, wake it up. + * The writer will commit the changes to the database + * which will allow the freeer to free up the cache. + */ + if (db->db_wr_idle) { + if (!xt_broadcast_cond_ns(&db->db_wr_cond)) + xt_log_and_clear_exception_ns(); + } + + /* Go to sleep on the writer's condition. + * The writer will wake the free'er before it goes to + * sleep! + */ + tab->tab_wake_freeer_op = page->tcp_op_seq; + tab->tab_wr_wake_freeer = TRUE; + if (!xt_timed_wait_cond_ns(&db->db_wr_cond, &db->db_wr_lock, 30000)) { + tab->tab_wr_wake_freeer = FALSE; + db->db_wr_freeer_waiting = FALSE; + xt_throw(self); + } + tab->tab_wr_wake_freeer = FALSE; + db->db_wr_freeer_waiting = FALSE; + freer_(); // xt_unlock_mutex(&db->db_wr_lock) + + xt_rwmutex_xlock(&seg->tcs_lock, self->t_id); + if (XTTableSeq::xt_op_is_before(tab->tab_head_op_seq, page->tcp_op_seq)) + goto rewait; + } + go_on:; + } + } + + /* Wait if the page is being read or locked. */ + if (page->tcp_lock_count) { + /* (1) If the page is being read, then we should not free + * it immediately. + * (2) If a page is locked, the locker may be waiting + * for the freeer to free some cache - this + * causes a deadlock. + * + * Therefore, we move on, and try to free another page... + */ + if (page_cnt < (dcg->tcm_approx_page_count >> 1)) { + /* Page has not changed MRU position, and we + * have looked at less than half of the pages. + * Go to the next page... + */ + if ((page = page->tcp_mr_used)) { + page_cnt++; + xt_rwmutex_unlock(&seg->tcs_lock, self->t_id); + goto retry_2; + } + } + xt_rwmutex_unlock(&seg->tcs_lock, self->t_id); + dcg->tcm_free_try_count++; + + /* Starting to spin, free the threads: */ + if (dcg->tcm_threads_waiting) { + if (!xt_broadcast_cond_ns(&dcg->tcm_freeer_cond)) + xt_log_and_clear_exception_ns(); + } + goto retry; + } + + /* Page is clean, remove from the hash table: */ + + /* Find the page on the list: */ + u_int page_idx = page->tcp_page_idx; + u_int file_id = page->tcp_file_id; + + ppage = NULL; + lpage = seg->tcs_hash_table[page->tcp_hash_idx]; + while (lpage) { + if (lpage->tcp_page_idx == page_idx && lpage->tcp_file_id == file_id) + break; + ppage = lpage; + lpage = lpage->tcp_next; + } + + if (page == lpage) { + /* Should be the case! */ + if (ppage) + ppage->tcp_next = page->tcp_next; + else + seg->tcs_hash_table[page->tcp_hash_idx] = page->tcp_next; + } +#ifdef DEBUG + else + ASSERT_NS(FALSE); +#endif + + /* Remove from the MRU list: */ + xt_lock_mutex_ns(&dcg->tcm_lock); + if (dcg->tcm_lru_page == page) + dcg->tcm_lru_page = page->tcp_mr_used; + if (dcg->tcm_mru_page == page) + dcg->tcm_mru_page = page->tcp_lr_used; + if (page->tcp_lr_used) + page->tcp_lr_used->tcp_mr_used = page->tcp_mr_used; + if (page->tcp_mr_used) + page->tcp_mr_used->tcp_lr_used = page->tcp_lr_used; + xt_unlock_mutex_ns(&dcg->tcm_lock); + + /* Free the page: */ + size_t freed_space = offsetof(XTTabCachePageRec, tcp_data) + page->tcp_data_size; + seg->tcs_cache_in_use -= freed_space; + xt_free_ns(page); + + xt_rwmutex_unlock(&seg->tcs_lock, self->t_id); + self->st_statistics.st_rec_cache_frees++; + dcg->tcm_free_try_count = 0; + return freed_space; +} + +static void tabc_fr_main(XTThreadPtr self) +{ + register XTTabCacheMemPtr dcg = &xt_tab_cache; + TCResourceRec tc = { 0 }; + + xt_set_low_priority(self); + dcg->tcm_freeer_busy = TRUE; + + while (!self->t_quit) { + size_t cache_used, freed; + + pushr_(tabc_free_fr_resources, &tc); + + while (!self->t_quit) { + /* Total up the cache memory used: */ + cache_used = 0; + for (int i=0; i<XT_TC_SEGMENT_COUNT; i++) + cache_used += dcg->tcm_segment[i].tcs_cache_in_use; + if (cache_used > dcg->tcm_cache_high) { + dcg->tcm_cache_high = cache_used; + } + + /* Check if the cache usage is over 95%: */ + if (self->t_quit || cache_used < dcg->tcm_high_level) + break; + + /* Reduce cache to the 75% level: */ + while (!self->t_quit && cache_used > dcg->tcm_low_level) { + freed = tabc_free_page(self, &tc); + cache_used -= freed; + if (cache_used <= dcg->tcm_high_level) { + /* Wakeup any threads that are waiting for some cache to be + * freed. + */ + if (dcg->tcm_threads_waiting) { + if (!xt_broadcast_cond_ns(&dcg->tcm_freeer_cond)) + xt_log_and_clear_exception_ns(); + } + } + } + } + + freer_(); // tabc_free_fr_resources(&tc) + + xt_lock_mutex(self, &dcg->tcm_freeer_lock); + pushr_(xt_unlock_mutex, &dcg->tcm_freeer_lock); + + if (dcg->tcm_threads_waiting) { + /* Wake threads before we go to sleep: */ + if (!xt_broadcast_cond_ns(&dcg->tcm_freeer_cond)) + xt_log_and_clear_exception_ns(); + } + + /* Wait for a thread that allocates data to signal + * that the cache level has exceeeded the upper limit: + */ + xt_db_approximate_time = time(NULL); + dcg->tcm_freeer_busy = FALSE; + tabc_fr_wait_for_cache(self, 500); + //tabc_fr_wait_for_cache(self, 30*1000); + dcg->tcm_freeer_busy = TRUE; + xt_db_approximate_time = time(NULL); + freer_(); // xt_unlock_mutex(&dcg->tcm_freeer_lock) + } +} + +static void *tabc_fr_run_thread(XTThreadPtr self) +{ + int count; + void *mysql_thread; + + mysql_thread = myxt_create_thread(); + + while (!self->t_quit) { + try_(a) { + tabc_fr_main(self); + } + catch_(a) { + /* This error is "normal"! */ + if (!(self->t_exception.e_xt_err == XT_SIGNAL_CAUGHT && + self->t_exception.e_sys_err == SIGTERM)) + xt_log_and_clear_exception(self); + } + cont_(a); + + /* After an exception, pause before trying again... */ + /* Number of seconds */ +#ifdef DEBUG + count = 10; +#else + count = 2*60; +#endif + while (!self->t_quit && count > 0) { + xt_db_approximate_time = xt_trace_clock(); + sleep(1); + count--; + } + } + + myxt_destroy_thread(mysql_thread, TRUE); + return NULL; +} + +static void tabc_fr_free_thread(XTThreadPtr self, void *data __attribute__((unused))) +{ + if (xt_tab_cache.tcm_freeer_thread) { + xt_lock_mutex(self, &xt_tab_cache.tcm_freeer_lock); + pushr_(xt_unlock_mutex, &xt_tab_cache.tcm_freeer_lock); + xt_tab_cache.tcm_freeer_thread = NULL; + freer_(); // xt_unlock_mutex(&xt_tab_cache.tcm_freeer_lock) + } +} + +xtPublic void xt_start_freeer(XTThreadPtr self) +{ + xt_tab_cache.tcm_freeer_thread = xt_create_daemon(self, "free-er"); + xt_set_thread_data(xt_tab_cache.tcm_freeer_thread, NULL, tabc_fr_free_thread); + xt_run_thread(self, xt_tab_cache.tcm_freeer_thread, tabc_fr_run_thread); +} + +xtPublic void xt_quit_freeer(XTThreadPtr self) +{ + if (xt_tab_cache.tcm_freeer_thread) { + xt_lock_mutex(self, &xt_tab_cache.tcm_freeer_lock); + pushr_(xt_unlock_mutex, &xt_tab_cache.tcm_freeer_lock); + xt_terminate_thread(self, xt_tab_cache.tcm_freeer_thread); + freer_(); // xt_unlock_mutex(&xt_tab_cache.tcm_freeer_lock) + } +} + +xtPublic void xt_stop_freeer(XTThreadPtr self) +{ + XTThreadPtr thr_fr; + + if (xt_tab_cache.tcm_freeer_thread) { + xt_lock_mutex(self, &xt_tab_cache.tcm_freeer_lock); + pushr_(xt_unlock_mutex, &xt_tab_cache.tcm_freeer_lock); + + /* This pointer is safe as long as you have the transaction lock. */ + if ((thr_fr = xt_tab_cache.tcm_freeer_thread)) { + xtThreadID tid = thr_fr->t_id; + + /* Make sure the thread quits when woken up. */ + xt_terminate_thread(self, thr_fr); + + /* Wake the freeer to get it to quit: */ + if (!xt_broadcast_cond_ns(&xt_tab_cache.tcm_freeer_cond)) + xt_log_and_clear_exception_ns(); + + freer_(); // xt_unlock_mutex(&xt_tab_cache.tcm_freeer_lock) + + /* + * GOTCHA: This is a wierd thing but the SIGTERM directed + * at a particular thread (in this case the sweeper) was + * being caught by a different thread and killing the server + * sometimes. Disconcerting. + * (this may only be a problem on Mac OS X) + xt_kill_thread(thread); + */ + xt_wait_for_thread(tid, FALSE); + + /* PMC - This should not be necessary to set the signal here, but in the + * debugger the handler is not called!!? + thr_fr->t_delayed_signal = SIGTERM; + xt_kill_thread(thread); + */ + xt_tab_cache.tcm_freeer_thread = NULL; + } + else + freer_(); // xt_unlock_mutex(&xt_tab_cache.tcm_freeer_lock) + } +} + +xtPublic void xt_load_pages(XTThreadPtr self, XTOpenTablePtr ot) +{ + XTTableHPtr tab = ot->ot_table; + xtRecordID rec_id; + XTTabCachePagePtr page; + XTTabCacheSegPtr seg; + size_t poffset; + + rec_id = 1; + while (rec_id<tab->tab_row_eof_id) { + if (!tab->tab_rows.tc_fetch(ot->ot_row_file, rec_id, &seg, &page, &poffset, TRUE, self)) + xt_throw(self); + xt_rwmutex_unlock(&seg->tcs_lock, self->t_id); + rec_id += tab->tab_rows.tci_rows_per_page; + } + + rec_id = 1; + while (rec_id<tab->tab_rec_eof_id) { + if (!tab->tab_recs.tc_fetch(ot->ot_rec_file, rec_id, &seg, &page, &poffset, TRUE, self)) + xt_throw(self); + xt_rwmutex_unlock(&seg->tcs_lock, self->t_id); + rec_id += tab->tab_recs.tci_rows_per_page; + } +} + + diff --git a/storage/pbxt/src/tabcache_xt.h b/storage/pbxt/src/tabcache_xt.h new file mode 100644 index 00000000000..694244835b4 --- /dev/null +++ b/storage/pbxt/src/tabcache_xt.h @@ -0,0 +1,250 @@ +/* Copyright (c) 2007 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2007-10-31 Paul McCullagh + * + * H&G2JCtL + * + * The new table cache. Caches all non-index data. This includes the data + * files and the row pointer files. + */ +#ifndef __tabcache_h__ +#define __tabcache_h__ + +struct XTTable; +struct XTOpenTable; +struct XTTabCache; + +#include "thread_xt.h" +#include "filesys_xt.h" +#include "lock_xt.h" + +#ifdef DEBUG +//#define XT_USE_CACHE_DEBUG_SIZES +//#define XT_NOT_INLINE +#endif + +#ifdef XT_USE_CACHE_DEBUG_SIZES + +#define XT_TC_PAGE_SIZE (4*1024) +#define XT_TC_SEGMENT_SHIFTS 1 + +#else + +#define XT_TC_PAGE_SIZE (32*1024) +#define XT_TC_SEGMENT_SHIFTS 3 + +#endif + +#define XT_TIME_DIFF(start, now) (\ + ((xtWord4) (now) < (xtWord4) (start)) ? \ + ((xtWord4) 0XFFFFFFFF - ((xtWord4) (start) - (xtWord4) (now))) : \ + ((xtWord4) (now) - (xtWord4) (start))) + +#define XT_TC_SEGMENT_COUNT ((off_t) 1 << XT_TC_SEGMENT_SHIFTS) +#define XT_TC_SEGMENT_MASK (XT_TC_SEGMENT_COUNT - 1) + +typedef struct XTTabCachePage { + xtWord1 tcp_dirty; /* TRUE if the page is dirty. */ + xtWord1 tcp_seg; /* Segement number of the page. */ + u_int tcp_lock_count; /* Number of read locks on this page. */ + u_int tcp_hash_idx; /* The hash index of the page. */ + u_int tcp_page_idx; /* The page address. */ + u_int tcp_file_id; /* The file id of the page. */ + xtDatabaseID tcp_db_id; /* The ID of the database. */ + xtTableID tcp_tab_id; /* The ID of the table of this cache page. */ + xtWord4 tcp_data_size; /* Size of the data on this page. */ + xtOpSeqNo tcp_op_seq; /* The operation sequence number (dirty pages have a operations sequence) */ + xtWord4 tcp_ru_time; /* If this is in the top 1/4 don't change position in MRU list. */ + struct XTTabCachePage *tcp_next; /* Pointer to next page on hash list, or next free page on free list. */ + struct XTTabCachePage *tcp_mr_used; /* More recently used pages. */ + struct XTTabCachePage *tcp_lr_used; /* Less recently used pages. */ + xtWord1 tcp_data[XT_TC_PAGE_SIZE]; /* This is actually tci_page_size! */ +} XTTabCachePageRec, *XTTabCachePagePtr; + +/* + * Each table has a "table operation sequence". This sequence is incremented by + * each operation on the table. Each operation in the log is tagged by a + * sequence number. + * + * The writter threads re-order operations in the log, and write the operations + * to the database in sequence. + * + * It is safe to free a cache page when the sequence number of the cache page, + * is less than or equal to the written sequence number. + */ +typedef struct XTTableSeq { + xtOpSeqNo ts_next_seq; /* The next sequence number for operations on the table. */ + xt_mutex_type ts_ns_lock; /* Lock for the next sequence number. */ + + xtBool ts_log_no_op(XTThreadPtr thread, xtTableID tab_id, xtOpSeqNo op_seq); + + /* Return the next operation sequence number. */ +#ifdef XT_NOT_INLINE + xtOpSeqNo ts_set_op_seq(XTTabCachePagePtr page); + + xtOpSeqNo ts_get_op_seq(); +#else + xtOpSeqNo ts_set_op_seq(XTTabCachePagePtr page) + { + xtOpSeqNo seq; + + xt_lock_mutex_ns(&ts_ns_lock); + page->tcp_op_seq = seq = ts_next_seq++; + xt_unlock_mutex_ns(&ts_ns_lock); + return seq; + } + + xtOpSeqNo ts_get_op_seq() + { + xtOpSeqNo seq; + + xt_lock_mutex_ns(&ts_ns_lock); + seq = ts_next_seq++; + xt_unlock_mutex_ns(&ts_ns_lock); + return seq; + } +#endif + + void xt_op_seq_init(XTThreadPtr self) { + xt_init_mutex_with_autoname(self, &ts_ns_lock); + } + + void xt_op_seq_set(XTThreadPtr self __attribute__((unused)), xtOpSeqNo n) { + ts_next_seq = n; + } + + void xt_op_seq_exit(XTThreadPtr self __attribute__((unused))) { + xt_free_mutex(&ts_ns_lock); + } + +#ifdef XT_NOT_INLINE + static xtBool xt_op_is_before(register xtOpSeqNo now, register xtOpSeqNo then); +#else + static inline xtBool xt_op_is_before(register xtOpSeqNo now, register xtOpSeqNo then) + { + if (now >= then) { + if ((now - then) > (xtOpSeqNo) 0xFFFFFFFF/2) + return TRUE; + return FALSE; + } + if ((then - now) > (xtOpSeqNo) 0xFFFFFFFF/2) + return FALSE; + return TRUE; + } +#endif +} XTTableSeqRec, *XTTableSeqPtr; + +/* A disk cache segment. The cache is divided into a number of segments + * to improve concurrency. + */ +typedef struct XTTabCacheSeg { + XTRWMutexRec tcs_lock; /* The cache segment read/write lock. */ + //xt_cond_type tcs_cond; + XTTabCachePagePtr *tcs_hash_table; + size_t tcs_cache_in_use; +} XTTabCacheSegRec, *XTTabCacheSegPtr; + +/* + * The free'er thread has a list of tables to be purged from the cache. + * If a table is in the list then it is not allowed to fetch a cache page from + * that table. + * The free'er thread goes through all the cache, and removes + * all cache pages for any table in the purge list. + * When a table has been purged it signals any threads waiting for the + * purge to complete (this is usually due to a drop table). + */ +typedef struct XTTabCachePurge { + int tcp_state; /* The state of the purge. */ + XTTableSeqPtr tcp_tab_seq; /* Identifies the table to be purged from cache. */ +} XTTabCachePurgeRec, *XTTabCachePurgePtr; + +typedef struct XTTabCacheMem { + xt_mutex_type tcm_lock; /* The public cache lock. */ + xt_cond_type tcm_cond; /* The public cache wait condition. */ + XTTabCacheSegRec tcm_segment[XT_TC_SEGMENT_COUNT]; + XTTabCachePagePtr tcm_lru_page; + XTTabCachePagePtr tcm_mru_page; + xtWord4 tcm_ru_now; + size_t tcm_approx_page_count; + size_t tcm_hash_size; + u_int tcm_writer_thread_count; + size_t tcm_cache_size; + size_t tcm_cache_high; /* The high water level of cache allocation. */ + size_t tcm_low_level; /* This is the level to which the freeer will free, once it starts working. */ + size_t tcm_high_level; /* This is the level at which the freeer will start to work (to avoid waiting)! */ + + /* The free'er thread: */ + struct XTThread *tcm_freeer_thread; /* The freeer thread . */ + xt_mutex_type tcm_freeer_lock; /* The public cache lock. */ + xt_cond_type tcm_freeer_cond; /* The public cache wait condition. */ + u_int tcm_purge_list_len; /* The length of the purge list. */ + XTTabCachePurgePtr tcm_purge_list; /* Non-NULL if a table is to be purged. */ + u_int tcm_threads_waiting; /* Count of the number of threads waiting for the freeer. */ + xtBool tcm_freeer_busy; + u_int tcm_free_try_count; +} XTTabCacheMemRec, *XTTabCacheMemPtr; + +/* + * This structure contains the information about a particular table + * for the cache. Each table has its own page size, row size + * and rows per page. + * Tables also have + */ +typedef struct XTTabCache { + struct XTTable *tci_table; + size_t tci_header_size; + size_t tci_page_size; + size_t tci_rec_size; + size_t tci_rows_per_page; + +public: + void xt_tc_setup(struct XTTable *tab, size_t head_size, size_t row_size); + xtBool xt_tc_write(XT_ROW_REC_FILE_PTR file, xtRefID ref_id, size_t offset, size_t size, xtWord1 *data, xtOpSeqNo *op_seq, xtBool read, XTThreadPtr thread); + xtBool xt_tc_write_cond(XTThreadPtr self, XT_ROW_REC_FILE_PTR file, xtRefID ref_id, xtWord1 new_type, xtOpSeqNo *op_seq, xtXactID xn_id, xtRowID row_id, u_int stat_id, u_int rec_type); + xtBool xt_tc_read(XT_ROW_REC_FILE_PTR file, xtRefID ref_id, size_t size, xtWord1 *data, XTThreadPtr thread); + xtBool xt_tc_read_4(XT_ROW_REC_FILE_PTR file, xtRefID ref_id, xtWord4 *data, XTThreadPtr thread); + xtBool xt_tc_read_page(XT_ROW_REC_FILE_PTR file, xtRefID ref_id, xtWord1 *data, XTThreadPtr thread); + xtBool xt_tc_get_page(XT_ROW_REC_FILE_PTR file, xtRefID ref_id, XTTabCachePagePtr *page, size_t *offset, XTThreadPtr thread); + void xt_tc_release_page(XT_ROW_REC_FILE_PTR file, XTTabCachePagePtr page, XTThreadPtr thread); + xtBool tc_fetch(XT_ROW_REC_FILE_PTR file, xtRefID ref_id, XTTabCacheSegPtr *ret_seg, XTTabCachePagePtr *ret_page, size_t *offset, xtBool read, XTThreadPtr thread); + +private: + xtBool tc_read_direct(XT_ROW_REC_FILE_PTR file, xtRefID ref_id, size_t size, xtWord1 *data, XTThreadPtr thread); + xtBool tc_fetch_direct(XT_ROW_REC_FILE_PTR file, xtRefID ref_id, XTTabCacheSegPtr *ret_seg, XTTabCachePagePtr *ret_page, size_t *offset, XTThreadPtr thread); +} XTTabCacheRec, *XTTabCachePtr; + +extern XTTabCacheMemRec xt_tab_cache; + +void xt_tc_init(XTThreadPtr self, size_t cache_size); +void xt_tc_exit(XTThreadPtr self); +void xt_tc_set_cache_size(size_t cache_size); +xtInt8 xt_tc_get_usage(); +xtInt8 xt_tc_get_size(); +xtInt8 xt_tc_get_high(); +void xt_load_pages(XTThreadPtr self, struct XTOpenTable *ot); +#ifdef DEBUG +void xt_check_table_cache(struct XTTable *tab); +#endif + +void xt_quit_freeer(XTThreadPtr self); +void xt_stop_freeer(XTThreadPtr self); +void xt_start_freeer(XTThreadPtr self); +void xt_wr_wake_freeer(XTThreadPtr self); + +#endif diff --git a/storage/pbxt/src/table_xt.cc b/storage/pbxt/src/table_xt.cc new file mode 100644 index 00000000000..57021f89b66 --- /dev/null +++ b/storage/pbxt/src/table_xt.cc @@ -0,0 +1,5014 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-02-08 Paul McCullagh + * + * H&G2JCtL + */ + +#include "xt_config.h" + +#include <string.h> +#include <stdio.h> +#ifndef XT_WIN +#include <strings.h> +#endif +#include <ctype.h> +#include <time.h> + +#ifdef DRIZZLED +#include <drizzled/common.h> +#include <mysys/thr_lock.h> +#include <drizzled/dtcollation.h> +#include <drizzled/handlerton.h> +#else +#include "mysql_priv.h" +#endif + +#include "table_xt.h" +#include "database_xt.h" +#include "heap_xt.h" +#include "strutil_xt.h" +#include "myxt_xt.h" +#include "cache_xt.h" +#include "trace_xt.h" +#ifdef XT_STREAMING +#include "streaming_xt.h" +#endif +#include "index_xt.h" +#include "restart_xt.h" +#include "systab_xt.h" + +#ifdef DEBUG +//#define TRACE_VARIATIONS +//#define TRACE_VARIATIONS_IN_DUP_CHECK +//#define DUMP_CHECK_TABLE +//#define CHECK_INDEX_ON_CHECK_TABLE +//#define TRACE_TABLE_IDS +//#define TRACE_FLUSH +//#define TRACE_CREATE_TABLES +#endif + +#define CHECK_TABLE_STATS + +#ifdef TRACE_TABLE_IDS +//#define PRINTF xt_ftracef +#define PRINTF xt_trace +#endif + +/* + * ----------------------------------------------------------------------- + * Internal structures + */ + +#define XT_MAX_TABLE_FILE_NAME_SIZE (XT_TABLE_NAME_SIZE+6+40) + +/* + * ----------------------------------------------------------------------- + * Compare paths: + */ + +/* GOTCHA! The problem: + * + * The server uses names like: "./test/my_tab", + * the BLOB streaming engine uses: "test/my_tab" + * which leads to the same table being loaded twice. + */ +xtPublic int xt_tab_compare_paths(char *n1, char *n2) +{ + n1 = xt_last_2_names_of_path(n1); + n2 = xt_last_2_names_of_path(n2); + if (pbxt_ignore_case) + return strcasecmp(n1, n2); + return strcmp(n1, n2); +} + +/* + * This function only compares only the last 2 components of + * the path because table names must differ in this area. + */ +xtPublic int xt_tab_compare_names(const char *n1, const char *n2) +{ + n1 = xt_last_2_names_of_path(n1); + n2 = xt_last_2_names_of_path(n2); + if (pbxt_ignore_case) + return strcasecmp(n1, n2); + return strcmp(n1, n2); +} + +/* + * ----------------------------------------------------------------------- + * Private utilities + */ + +static xtBool tab_list_comp(void *key, void *data) +{ + XTTableHPtr tab = (XTTableHPtr) data; + + return strcmp(xt_last_2_names_of_path((char *) key), xt_last_2_names_of_path(tab->tab_name->ps_path)) == 0; +} + +static xtHashValue tab_list_hash(xtBool is_key, void *key_data) +{ + XTTableHPtr tab = (XTTableHPtr) key_data; + + if (is_key) + return xt_ht_hash(xt_last_2_names_of_path((char *) key_data)); + return xt_ht_hash(xt_last_2_names_of_path(tab->tab_name->ps_path)); +} + +static xtBool tab_list_comp_ci(void *key, void *data) +{ + XTTableHPtr tab = (XTTableHPtr) data; + + return strcasecmp(xt_last_2_names_of_path((char *) key), xt_last_2_names_of_path(tab->tab_name->ps_path)) == 0; +} + +static xtHashValue tab_list_hash_ci(xtBool is_key, void *key_data) +{ + XTTableHPtr tab = (XTTableHPtr) key_data; + + if (is_key) + return xt_ht_casehash(xt_last_2_names_of_path((char *) key_data)); + return xt_ht_casehash(xt_last_2_names_of_path(tab->tab_name->ps_path)); +} + +static void tab_list_free(XTThreadPtr self, void *data) +{ + XTTableHPtr tab = (XTTableHPtr) data; + XTDatabaseHPtr db = tab->tab_db; + XTTableEntryPtr te_ptr; + + /* Remove the reference from the ID list, whem the table is + * removed from the name list: + */ + if ((te_ptr = (XTTableEntryPtr) xt_sl_find(self, db->db_table_by_id, &tab->tab_id))) + te_ptr->te_table = NULL; + + if (tab->tab_dic.dic_table) + tab->tab_dic.dic_table->removeReferences(self); + xt_heap_release(self, tab); +} + +static void tab_close_mapped_files(XTThreadPtr self, XTTableHPtr tab) +{ + if (tab->tab_rec_file) { + xt_fs_release_file(self, tab->tab_rec_file); + tab->tab_rec_file = NULL; + } + if (tab->tab_row_file) { + xt_fs_release_file(self, tab->tab_row_file); + tab->tab_row_file = NULL; + } +} + +static void tab_finalize(XTThreadPtr self, void *x) +{ + XTTableHPtr tab = (XTTableHPtr) x; + + xt_exit_row_locks(&tab->tab_locks); + + xt_xres_exit_tab(self, tab); + + if (tab->tab_ind_free_list) { + XTIndFreeListPtr list, flist; + + list = tab->tab_ind_free_list; + while (list) { + flist = list; + list = list->fl_next_list; + xt_free(self, flist); + } + tab->tab_ind_free_list = NULL; + } + + if (tab->tab_ind_file) { + xt_fs_release_file(self, tab->tab_ind_file); + tab->tab_ind_file = NULL; + } + tab_close_mapped_files(self, tab); + + if (tab->tab_index_head) { + xt_free(self, tab->tab_index_head); + tab->tab_index_head = NULL; + } + +#ifdef TRACE_TABLE_IDS + PRINTF("%s: free TABLE: db=%d tab=%d %s\n", self->t_name, (int) tab->tab_db ? tab->tab_db->db_id : 0, (int) tab->tab_id, + tab->tab_name ? xt_last_2_names_of_path(tab->tab_name->ps_path) : "?"); +#endif + if (tab->tab_name) { + xt_free(self, tab->tab_name); + tab->tab_name = NULL; + } + myxt_free_dictionary(self, &tab->tab_dic); + if (tab->tab_free_locks) { + tab->tab_seq.xt_op_seq_exit(self); + xt_spinlock_free(self, &tab->tab_ainc_lock); + xt_free_mutex(&tab->tab_rec_flush_lock); + xt_free_mutex(&tab->tab_ind_flush_lock); + xt_free_mutex(&tab->tab_dic_field_lock); + xt_free_mutex(&tab->tab_row_lock); + xt_free_mutex(&tab->tab_ind_lock); + xt_free_mutex(&tab->tab_rec_lock); + for (u_int i=0; i<XT_ROW_RWLOCKS; i++) + XT_TAB_ROW_FREE_LOCK(self, &tab->tab_row_rwlock[i]); + } +} + +static void tab_onrelease(XTThreadPtr self, void *x) +{ + XTTableHPtr tab = (XTTableHPtr) x; + + /* Signal threads waiting for exclusive use of the table: */ + if (tab->tab_db->db_tables) + xt_ht_signal(self, tab->tab_db->db_tables); +} + +/* + * ----------------------------------------------------------------------- + * PUBLIC METHODS + */ + +/* + * This function sets the table name to "", if the file + * does not belong to XT. + */ +xtPublic char *xt_tab_file_to_name(size_t size, char *tab_name, char *file_name) +{ + char *cptr; + size_t len; + + file_name = xt_last_name_of_path(file_name); + cptr = file_name + strlen(file_name) - 1; + while (cptr > file_name && *cptr != '.') + cptr--; + if (cptr > file_name && *cptr == '.') { + if (strcmp(cptr, ".xtl") == 0 || strcmp(cptr, ".xtr") == 0) { + cptr--; + while (cptr > file_name && isdigit(*cptr)) + cptr--; + } + else { + const char **ext = pbxt_extensions; + + while (*ext) { + if (strcmp(cptr, *ext) == 0) + goto ret_name; + ext++; + } + cptr = file_name; + } + } + + ret_name: + len = cptr - file_name; + if (len > size-1) + len = size-1; + + memcpy(tab_name, file_name, len); + tab_name[len] = 0; + + /* Return a pointer to what was removed! */ + return file_name + len; +} + +static void tab_get_row_file_name(char *table_name, char *name, xtTableID tab_id) +{ + sprintf(table_name, "%s-%lu.xtr", name, (u_long) tab_id); +} + +static void tab_get_data_file_name(char *table_name, char *name, xtTableID tab_id __attribute__((unused))) +{ + sprintf(table_name, "%s.xtd", name); +} + +static void tab_get_index_file_name(char *table_name, char *name, xtTableID tab_id __attribute__((unused))) +{ + sprintf(table_name, "%s.xti", name); +} + +static void tab_free_by_id(XTThreadPtr self __attribute__((unused)), void *thunk __attribute__((unused)), void *item) +{ + XTTableEntryPtr te_ptr = (XTTableEntryPtr) item; + + if (te_ptr->te_tab_name) { + xt_free(self, te_ptr->te_tab_name); + te_ptr->te_tab_name = NULL; + } + te_ptr->te_tab_id = 0; + te_ptr->te_table = NULL; +} + +static int tab_comp_by_id(XTThreadPtr self __attribute__((unused)), register const void *thunk __attribute__((unused)), register const void *a, register const void *b) +{ + xtTableID te_id = *((xtTableID *) a); + XTTableEntryPtr te_ptr = (XTTableEntryPtr) b; + + if (te_id < te_ptr->te_tab_id) + return -1; + if (te_id == te_ptr->te_tab_id) + return 0; + return 1; +} + +static void tab_free_path(XTThreadPtr self __attribute__((unused)), void *thunk __attribute__((unused)), void *item) +{ + XTTablePathPtr tp_ptr = *((XTTablePathPtr *) item); + + xt_free(self, tp_ptr); +} + +static int tab_comp_path(XTThreadPtr self __attribute__((unused)), register const void *thunk __attribute__((unused)), register const void *a, register const void *b) +{ + char *path = (char *) a; + XTTablePathPtr tp_ptr = *((XTTablePathPtr *) b); + + return xt_tab_compare_paths(path, tp_ptr->tp_path); +} + +xtPublic void xt_describe_tables_init(XTThreadPtr self, XTDatabaseHPtr db, XTTableDescPtr td) +{ + td->td_db = db; + td->td_path_idx = 0; + if (td->td_path_idx < xt_sl_get_size(db->db_table_paths)) { + XTTablePathPtr *tp_ptr; + + tp_ptr = (XTTablePathPtr *) xt_sl_item_at(db->db_table_paths, td->td_path_idx); + td->td_tab_path = *tp_ptr; + td->td_open_dir = xt_dir_open(self, td->td_tab_path->tp_path, "*.xtr"); + } + else + td->td_open_dir = NULL; +} + +xtPublic xtBool xt_describe_tables_next(XTThreadPtr self, XTTableDescPtr td) +{ + char *tab_name; + xtBool r = FALSE; + + enter_(); + retry: + if (!td->td_open_dir) + return_(FALSE); + try_(a) { + r = xt_dir_next(self, td->td_open_dir); + } + catch_(a) { + xt_describe_tables_exit(self, td); + throw_(); + } + cont_(a); + if (!r) { + XTTablePathPtr *tp_ptr; + + if (td->td_path_idx+1 >= xt_sl_get_size(td->td_db->db_table_paths)) + return_(FALSE); + + if (td->td_open_dir) + xt_dir_close(NULL, td->td_open_dir); + td->td_open_dir = NULL; + + td->td_path_idx++; + tp_ptr = (XTTablePathPtr *) xt_sl_item_at(td->td_db->db_table_paths, td->td_path_idx); + td->td_tab_path = *tp_ptr; + td->td_open_dir = xt_dir_open(self, td->td_tab_path->tp_path, "*.xtr"); + goto retry; + } + + tab_name = xt_dir_name(self, td->td_open_dir); + td->td_file_name = tab_name; + td->td_tab_id = (xtTableID) xt_file_name_to_id(tab_name); + xt_tab_file_to_name(XT_TABLE_NAME_SIZE, td->td_tab_name, tab_name); + return_(TRUE); +} + +xtPublic void xt_describe_tables_exit(XTThreadPtr self __attribute__((unused)), XTTableDescPtr td) +{ + if (td->td_open_dir) + xt_dir_close(NULL, td->td_open_dir); + td->td_open_dir = NULL; + td->td_tab_path = NULL; +} + +xtPublic void xt_tab_init_db(XTThreadPtr self, XTDatabaseHPtr db) +{ + XTTableDescRec desc; + XTTableEntryRec te_tab; + XTTablePathPtr db_path; + int len; + + enter_(); + pushr_(xt_tab_exit_db, db); + if (pbxt_ignore_case) + db->db_tables = xt_new_hashtable(self, tab_list_comp_ci, tab_list_hash_ci, tab_list_free, TRUE, TRUE); + else + db->db_tables = xt_new_hashtable(self, tab_list_comp, tab_list_hash, tab_list_free, TRUE, TRUE); + db->db_table_by_id = xt_new_sortedlist(self, sizeof(XTTableEntryRec), 20, 20, tab_comp_by_id, db, tab_free_by_id, FALSE, FALSE); + db->db_table_paths = xt_new_sortedlist(self, sizeof(XTTablePathPtr), 20, 20, tab_comp_path, db, tab_free_path, FALSE, FALSE); + + if (db->db_multi_path) { + XTOpenFilePtr of; + char *buffer, *ptr, *path; + char pbuf[PATH_MAX]; + + xt_strcpy(PATH_MAX, pbuf, db->db_main_path); + xt_add_location_file(PATH_MAX, pbuf); + if (xt_fs_exists(pbuf)) { + of = xt_open_file(self, pbuf, XT_FS_DEFAULT); + pushr_(xt_close_file, of); + len = (int) xt_seek_eof_file(self, of); + buffer = (char *) xt_malloc(self, len + 1); + pushr_(xt_free, buffer); + if (!xt_pread_file(of, 0, len, len, buffer, NULL, &self->st_statistics.st_x, self)) + xt_throw(self); + buffer[len] = 0; + ptr = buffer; + while (*ptr) { + /* Ignore preceeding space: */ + while (*ptr && isspace(*ptr)) + ptr++; + path = ptr; + while (*ptr && *ptr != '\n' && *ptr != '\r') { +#ifdef XT_WIN + /* Undo the conversion below: */ + if (*ptr == '/') + *ptr = '\\'; +#endif + ptr++; + } + if (*path != '#' && ptr > path) { + len = (int) (ptr - path); + db_path = (XTTablePathPtr) xt_malloc(self, offsetof(XTTablePathRec, tp_path) + len + 1); + db_path->tp_tab_count = 0; + memcpy(db_path->tp_path, path, len); + db_path->tp_path[len] = 0; + xt_sl_insert(self, db->db_table_paths, db_path->tp_path, &db_path); + } + ptr++; + } + freer_(); // xt_free(buffer) + freer_(); // xt_close_file(of) + } + } + else { + len = (int) strlen(db->db_main_path); + db_path = (XTTablePathPtr) xt_malloc(self, offsetof(XTTablePathRec, tp_path) + len + 1); + db_path->tp_tab_count = 0; + strcpy(db_path->tp_path, db->db_main_path); + xt_sl_insert(self, db->db_table_paths, db_path->tp_path, &db_path); + } + + xt_describe_tables_init(self, db, &desc); + pushr_(xt_describe_tables_exit, &desc); + while (xt_describe_tables_next(self, &desc)) { + te_tab.te_tab_id = desc.td_tab_id; + + if (te_tab.te_tab_id > db->db_curr_tab_id) + db->db_curr_tab_id = te_tab.te_tab_id; + + te_tab.te_tab_name = xt_dup_string(self, desc.td_tab_name); + te_tab.te_tab_path = desc.td_tab_path; + desc.td_tab_path->tp_tab_count++; + te_tab.te_table = NULL; + xt_sl_insert(self, db->db_table_by_id, &desc.td_tab_id, &te_tab); + } + freer_(); // xt_describe_tables_exit(&desc) + + popr_(); // Discard xt_tab_exit_db(db) + exit_(); +} + +static void tab_save_table_paths(XTThreadPtr self, XTDatabaseHPtr db) +{ + XTTablePathPtr *tp_ptr; + XTStringBufferRec buffer; + XTOpenFilePtr of; + char path[PATH_MAX]; + + memset(&buffer, 0, sizeof(buffer)); + + xt_strcpy(PATH_MAX, path, db->db_main_path); + xt_add_location_file(PATH_MAX, path); + + if (xt_sl_get_size(db->db_table_paths)) { + pushr_(xt_sb_free, &buffer); + for (u_int i=0; i<xt_sl_get_size(db->db_table_paths); i++) { + tp_ptr = (XTTablePathPtr *) xt_sl_item_at(db->db_table_paths, i); + xt_sb_concat(self, &buffer, (*tp_ptr)->tp_path); + xt_sb_concat(self, &buffer, "\n"); + } + +#ifdef XT_WIN + /* To make the location file cross-platform (at least + * as long as relative paths are used) we replace all '\' + * with '/': */ + char *ptr; + + ptr = buffer.sb_cstring; + while (*ptr) { + if (*ptr == '\\') + *ptr = '/'; + ptr++; + } +#endif + + of = xt_open_file(self, path, XT_FS_CREATE | XT_FS_MAKE_PATH); + pushr_(xt_close_file, of); + if (!xt_pwrite_file(of, 0, strlen(buffer.sb_cstring), buffer.sb_cstring, &self->st_statistics.st_x, self)) + xt_throw(self); + xt_set_eof_file(self, of, strlen(buffer.sb_cstring)); + freer_(); // xt_close_file(of) + + freer_(); // xt_sb_free(&buffer); + } + else + xt_fs_delete(NULL, path); +} + +static XTTablePathPtr tab_get_table_path(XTThreadPtr self, XTDatabaseHPtr db, XTPathStrPtr tab_name, xtBool save_it) +{ + XTTablePathPtr *tp, tab_path; + char path[PATH_MAX]; + + xt_strcpy(PATH_MAX, path, tab_name->ps_path); + xt_remove_last_name_of_path(path); + xt_remove_dir_char(path); + tp = (XTTablePathPtr *) xt_sl_find(self, db->db_table_paths, path); + if (tp) + tab_path = *tp; + else { + int len = (int) strlen(path); + + tab_path = (XTTablePathPtr) xt_malloc(self, offsetof(XTTablePathRec, tp_path) + len + 1); + tab_path->tp_tab_count = 0; + memcpy(tab_path->tp_path, path, len); + tab_path->tp_path[len] = 0; + xt_sl_insert(self, db->db_table_paths, tab_path->tp_path, &tab_path); + if (save_it) { + tab_save_table_paths(self, db); + if (xt_sl_get_size(db->db_table_paths) == 1) { + XTSystemTableShare::createSystemTables(self, db); + } + } + } + tab_path->tp_tab_count++; + return tab_path; +} + +static void tab_remove_table_path(XTThreadPtr self, XTDatabaseHPtr db, XTTablePathPtr tab_path) +{ + if (tab_path->tp_tab_count > 0) { + tab_path->tp_tab_count--; + if (tab_path->tp_tab_count == 0) { + xt_sl_delete(self, db->db_table_paths, tab_path->tp_path); + tab_save_table_paths(self, db); + } + } +} + +static void tab_free_table_path(XTThreadPtr self, XTTablePathPtr tab_path) +{ + XTDatabaseHPtr db = self->st_database; + + tab_remove_table_path(self, db, tab_path); +} + +xtPublic void xt_tab_exit_db(XTThreadPtr self, XTDatabaseHPtr db) +{ + if (db->db_tables) { + xt_free_hashtable(self, db->db_tables); + db->db_tables = NULL; + } + if (db->db_table_by_id) { + xt_free_sortedlist(self, db->db_table_by_id); + db->db_table_by_id = NULL; + } + if (db->db_table_paths) { + xt_free_sortedlist(self, db->db_table_paths); + db->db_table_paths = NULL; + } +} + +static void tab_check_table(XTThreadPtr self __attribute__((unused)), XTTableHPtr tab __attribute__((unused))) +{ + enter_(); + exit_(); +} + +xtPublic void xt_check_tables(XTThreadPtr self) +{ + u_int edx; + XTTableEntryPtr te_ptr; + volatile XTTableHPtr tab; + char path[PATH_MAX]; + + enter_(); + xt_logf(XT_INFO, "Check %s: Table...\n", self->st_database->db_main_path); + xt_enum_tables_init(&edx); + try_(a) { + for (;;) { + xt_ht_lock(self, self->st_database->db_tables); + pushr_(xt_ht_unlock, self->st_database->db_tables); + te_ptr = xt_enum_tables_next(self, self->st_database, &edx); + freer_(); // xt_ht_unlock(db->db_tables) + if (!te_ptr) + break; + xt_strcpy(PATH_MAX, path, te_ptr->te_tab_path->tp_path); + xt_add_dir_char(PATH_MAX, path); + xt_strcat(PATH_MAX, path, te_ptr->te_tab_name); + tab = xt_use_table(self, (XTPathStrPtr) path, FALSE, FALSE, NULL); + tab_check_table(self, tab); + xt_heap_release(self, tab); + tab = NULL; + } + } + catch_(a) { + if (tab) + xt_heap_release(self, tab); + throw_(); + } + cont_(a); + exit_(); +} + +xtPublic xtBool xt_table_exists(XTDatabaseHPtr db) +{ + return xt_sl_get_size(db->db_table_by_id) > 0; +} + +/* + * Enumerate all tables in the current database. + */ + +xtPublic void xt_enum_tables_init(u_int *edx) +{ + *edx = 0; +} + +xtPublic XTTableEntryPtr xt_enum_tables_next(XTThreadPtr self __attribute__((unused)), XTDatabaseHPtr db, u_int *edx) +{ + XTTableEntryPtr en_ptr; + + if (*edx >= xt_sl_get_size(db->db_table_by_id)) + return NULL; + en_ptr = (XTTableEntryPtr) xt_sl_item_at(db->db_table_by_id, *edx); + (*edx)++; + return en_ptr; +} + +xtPublic void xt_enum_files_of_tables_init(XTPathStrPtr tab_name, xtTableID tab_id, XTFilesOfTablePtr ft) +{ + ft->ft_state = 0; + ft->ft_tab_name = tab_name; + ft->ft_tab_id = tab_id; +} + +xtPublic xtBool xt_enum_files_of_tables_next(XTFilesOfTablePtr ft) +{ + char file_name[XT_MAX_TABLE_FILE_NAME_SIZE]; + + retry: + switch (ft->ft_state) { + case 0: + tab_get_row_file_name(file_name, xt_last_name_of_path(ft->ft_tab_name->ps_path), ft->ft_tab_id); + break; + case 1: + tab_get_data_file_name(file_name, xt_last_name_of_path(ft->ft_tab_name->ps_path), ft->ft_tab_id); + break; + case 2: + tab_get_index_file_name(file_name, xt_last_name_of_path(ft->ft_tab_name->ps_path), ft->ft_tab_id); + break; + default: + return FAILED; + } + + ft->ft_state++; + xt_strcpy(PATH_MAX, ft->ft_file_path, ft->ft_tab_name->ps_path); + xt_remove_last_name_of_path(ft->ft_file_path); + xt_strcat(PATH_MAX, ft->ft_file_path, file_name); + if (!xt_fs_exists(ft->ft_file_path)) + goto retry; + + return TRUE; +} + +static xtBool tab_find_table(XTThreadPtr self, XTDatabaseHPtr db, XTPathStrPtr name, xtTableID *tab_id) +{ + u_int edx; + XTTableEntryPtr te_ptr; + char path[PATH_MAX]; + + xt_enum_tables_init(&edx); + while ((te_ptr = xt_enum_tables_next(self, db, &edx))) { + xt_strcpy(PATH_MAX, path, te_ptr->te_tab_path->tp_path); + xt_add_dir_char(PATH_MAX, path); + xt_strcat(PATH_MAX, path, te_ptr->te_tab_name); + if (xt_tab_compare_names(path, name->ps_path) == 0) { + *tab_id = te_ptr->te_tab_id; + return TRUE; + } + } + return FALSE; +} + +xtPublic void xt_tab_set_index_error(XTTableHPtr tab) +{ + switch (tab->tab_dic.dic_disable_index) { + case XT_INDEX_OK: + break; + case XT_INDEX_TOO_OLD: + xt_register_taberr(XT_REG_CONTEXT, XT_ERR_INDEX_OLD_VERSION, tab->tab_name); + break; + case XT_INDEX_TOO_NEW: + xt_register_taberr(XT_REG_CONTEXT, XT_ERR_INDEX_NEW_VERSION, tab->tab_name); + break; + case XT_INDEX_BAD_BLOCK: + char number[40]; + + sprintf(number, "%d", (int) tab->tab_index_page_size); + xt_register_i2xterr(XT_REG_CONTEXT, XT_ERR_BAD_IND_BLOCK_SIZE, xt_last_name_of_path(tab->tab_name->ps_path), number); + break; + case XT_INDEX_CORRUPTED: + xt_register_taberr(XT_REG_CONTEXT, XT_ERR_INDEX_CORRUPTED, tab->tab_name); + break; + case XT_INDEX_MISSING: + xt_register_taberr(XT_REG_CONTEXT, XT_ERR_INDEX_MISSING, tab->tab_name); + break; + } +} + +static void tab_load_index_header(XTThreadPtr self, XTTableHPtr tab, XTOpenFilePtr file, XTPathStrPtr table_name) +{ + XT_NODE_TEMP; + XTIndexPtr *ind; + xtWord1 *data; + XTIndexFormatDPtr index_fmt; + + /* Load the pointers: */ + if (tab->tab_index_head) + xt_free_ns(tab->tab_index_head); + tab->tab_index_head = (XTIndexHeadDPtr) xt_calloc(self, XT_INDEX_HEAD_SIZE); + + if (file) { + if (!xt_pread_file(file, 0, XT_INDEX_HEAD_SIZE, 0, tab->tab_index_head, NULL, &self->st_statistics.st_ind, self)) + xt_throw(self); + + tab->tab_index_format_offset = XT_GET_DISK_4(tab->tab_index_head->tp_format_offset_4); + index_fmt = (XTIndexFormatDPtr) (((xtWord1 *) tab->tab_index_head) + tab->tab_index_format_offset); + + /* If the table version is less than or equal to an incompatible (unsupported + * version), or greater than the current version, then we cannot open this table + */ + if (XT_GET_DISK_2(index_fmt->if_tab_version_2) <= XT_TAB_INCOMPATIBLE_VERSION || + XT_GET_DISK_2(index_fmt->if_tab_version_2) > XT_TAB_CURRENT_VERSION) { + switch (XT_GET_DISK_2(index_fmt->if_tab_version_2)) { + case 4: + xt_throw_tabcolerr(XT_CONTEXT, XT_ERR_UPGRADE_TABLE, table_name, "0.9.91 Beta"); + break; + case 3: + xt_throw_tabcolerr(XT_CONTEXT, XT_ERR_UPGRADE_TABLE, table_name, "0.9.85 Beta"); + break; + default: + xt_throw_taberr(XT_CONTEXT, XT_ERR_BAD_TABLE_VERSION, table_name); + break; + } + return; + } + + tab->tab_dic.dic_index_ver = XT_GET_DISK_2(index_fmt->if_ind_version_2); + tab->tab_dic.dic_disable_index = XT_INDEX_OK; + + if (tab->tab_dic.dic_index_ver == 1) { + tab->tab_index_header_size = 1024 * 16; + tab->tab_index_page_size = 1024 * 16; + } + else { + tab->tab_index_header_size = XT_GET_DISK_4(tab->tab_index_head->tp_header_size_4); + tab->tab_index_page_size = XT_GET_DISK_4(index_fmt->if_page_size_4); + } + + /* Incorrect version of index is handled by allowing a sequential scan, but no index access. + * Recovery with the wrong index type will not recover the indexes, a REPAIR TABLE + * will be required! + */ + if (tab->tab_dic.dic_index_ver != XT_IND_CURRENT_VERSION) { + if (tab->tab_dic.dic_index_ver != XT_IND_CURRENT_VERSION) + tab->tab_dic.dic_disable_index = XT_INDEX_TOO_OLD; + else + tab->tab_dic.dic_disable_index = XT_INDEX_TOO_NEW; + } + else if (tab->tab_index_page_size != XT_INDEX_PAGE_SIZE) + tab->tab_dic.dic_disable_index = XT_INDEX_BAD_BLOCK; + } + else { + memset(tab->tab_index_head, 0, XT_INDEX_HEAD_SIZE); + tab->tab_dic.dic_disable_index = XT_INDEX_MISSING; + tab->tab_index_header_size = XT_INDEX_HEAD_SIZE; + tab->tab_index_page_size = XT_INDEX_PAGE_SIZE; + tab->tab_dic.dic_index_ver = 0; + tab->tab_index_format_offset = 0; + } + + + if (tab->tab_dic.dic_disable_index) { + xt_tab_set_index_error(tab); + xt_log_and_clear_exception_ns(); + } + + if (tab->tab_dic.dic_disable_index) { + /* Reset, as if we have empty indexes. + * Flush will wipe things out, of course. + * REPAIR TABLE will be required... + */ + XT_NODE_ID(tab->tab_ind_eof) = 1; + XT_NODE_ID(tab->tab_ind_free) = 0; + + ind = tab->tab_dic.dic_keys; + for (u_int i=0; i<tab->tab_dic.dic_key_count; i++, ind++) + XT_NODE_ID((*ind)->mi_root) = 0; + } + else { + XT_NODE_ID(tab->tab_ind_eof) = (xtIndexNodeID) XT_GET_DISK_6(tab->tab_index_head->tp_ind_eof_6); + XT_NODE_ID(tab->tab_ind_free) = (xtIndexNodeID) XT_GET_DISK_6(tab->tab_index_head->tp_ind_free_6); + + data = tab->tab_index_head->tp_data; + ind = tab->tab_dic.dic_keys; + for (u_int i=0; i<tab->tab_dic.dic_key_count; i++, ind++) { + (*ind)->mi_root = XT_GET_NODE_REF(tab, data); + data += XT_NODE_REF_SIZE; + } + } +} + +static void tab_load_table_format(XTThreadPtr self, XTOpenFilePtr file, XTPathStrPtr table_name, size_t *ret_format_offset, size_t *ret_head_size, XTDictionaryPtr dic) +{ + XTDiskValue4 size_buf; + size_t head_size; + XTTableFormatDRec tab_fmt; + size_t fmt_size; + + if (!xt_pread_file(file, 0, 4, 4, &size_buf, NULL, &self->st_statistics.st_rec, self)) + xt_throw(self); + + head_size = XT_GET_DISK_4(size_buf); + *ret_format_offset = head_size; + + /* Load the table format information: */ + if (!xt_pread_file(file, head_size, offsetof(XTTableFormatDRec, tf_definition), offsetof(XTTableFormatDRec, tf_tab_version_2) + 2, &tab_fmt, NULL, &self->st_statistics.st_rec, self)) + xt_throw(self); + + /* If the table version is less than or equal to an incompatible (unsupported + * version), or greater than the current version, then we cannot open this table + */ + if (XT_GET_DISK_2(tab_fmt.tf_tab_version_2) <= XT_TAB_INCOMPATIBLE_VERSION || + XT_GET_DISK_2(tab_fmt.tf_tab_version_2) > XT_TAB_CURRENT_VERSION) { + switch (XT_GET_DISK_2(tab_fmt.tf_tab_version_2)) { + case 4: + xt_throw_tabcolerr(XT_CONTEXT, XT_ERR_UPGRADE_TABLE, table_name, "0.9.91 Beta"); + break; + case 3: + xt_throw_tabcolerr(XT_CONTEXT, XT_ERR_UPGRADE_TABLE, table_name, "0.9.85 Beta"); + break; + default: + xt_throw_taberr(XT_CONTEXT, XT_ERR_BAD_TABLE_VERSION, table_name); + break; + } + return; + } + + fmt_size = XT_GET_DISK_4(tab_fmt.tf_format_size_4); + *ret_head_size = XT_GET_DISK_4(tab_fmt.tf_tab_head_size_4); + dic->dic_rec_size = XT_GET_DISK_4(tab_fmt.tf_rec_size_4); + dic->dic_rec_fixed = XT_GET_DISK_1(tab_fmt.tf_rec_fixed_1); + dic->dic_tab_flags = XT_GET_DISK_2(tab_fmt.tf_tab_flags_2); + dic->dic_min_auto_inc = XT_GET_DISK_8(tab_fmt.tf_min_auto_inc_8); + if (fmt_size > offsetof(XTTableFormatDRec, tf_definition)) { + size_t def_size = fmt_size - offsetof(XTTableFormatDRec, tf_definition); + char *def_sql; + + pushsr_(def_sql, xt_free, (char *) xt_malloc(self, def_size)); + if (!xt_pread_file(file, head_size+offsetof(XTTableFormatDRec, tf_definition), def_size, def_size, def_sql, NULL, &self->st_statistics.st_rec, self)) + xt_throw(self); + dic->dic_table = xt_ri_create_table(self, false, table_name, def_sql, myxt_create_table_from_table(self, dic->dic_my_table)); + freer_(); // xt_free(def_sql) + } + else + dic->dic_table = myxt_create_table_from_table(self, dic->dic_my_table); +} + +static void tab_load_table_header(XTThreadPtr self, XTTableHPtr tab, XTOpenFilePtr file) +{ + XTTableHeadDRec rec_head; + + if (!xt_pread_file(file, 0, sizeof(XTTableHeadDRec), sizeof(XTTableHeadDRec), (xtWord1 *) &rec_head, NULL, &self->st_statistics.st_rec, self)) + xt_throw(self); + + tab->tab_head_op_seq = XT_GET_DISK_4(rec_head.th_op_seq_4); + tab->tab_head_row_free_id = (xtRowID) XT_GET_DISK_6(rec_head.th_row_free_6); + tab->tab_head_row_eof_id = (xtRowID) XT_GET_DISK_6(rec_head.th_row_eof_6); + tab->tab_head_row_fnum = (xtWord4) XT_GET_DISK_6(rec_head.th_row_fnum_6); + tab->tab_head_rec_free_id = (xtRecordID) XT_GET_DISK_6(rec_head.th_rec_free_6); + tab->tab_head_rec_eof_id = (xtRecordID) XT_GET_DISK_6(rec_head.th_rec_eof_6); + tab->tab_head_rec_fnum = (xtWord4) XT_GET_DISK_6(rec_head.th_rec_fnum_6); +} + +xtPublic void xt_tab_store_header(XTOpenTablePtr ot, XTTableHeadDPtr rec_head) +{ + XTTableHPtr tab = ot->ot_table; + + XT_SET_DISK_4(rec_head->th_op_seq_4, tab->tab_head_op_seq); + XT_SET_DISK_6(rec_head->th_row_free_6, tab->tab_head_row_free_id); + XT_SET_DISK_6(rec_head->th_row_eof_6, tab->tab_head_row_eof_id); + XT_SET_DISK_6(rec_head->th_row_fnum_6, tab->tab_head_row_fnum); + XT_SET_DISK_6(rec_head->th_rec_free_6, tab->tab_head_rec_free_id); + XT_SET_DISK_6(rec_head->th_rec_eof_6, tab->tab_head_rec_eof_id); + XT_SET_DISK_6(rec_head->th_rec_fnum_6, tab->tab_head_rec_fnum); +} + +xtPublic xtBool xt_tab_write_header(XTOpenTablePtr ot, XTTableHeadDPtr rec_head, struct XTThread *thread) +{ + if (!XT_PWRITE_RR_FILE(ot->ot_rec_file, offsetof(XTTableHeadDRec, th_op_seq_4), 40, (xtWord1 *) rec_head->th_op_seq_4, &thread->st_statistics.st_rec, thread)) + return FAILED; + if (!XT_FLUSH_RR_FILE(ot->ot_rec_file, &thread->st_statistics.st_rec, thread)) + return FAILED; + return OK; +} + +xtPublic xtBool xt_tab_write_min_auto_inc(XTOpenTablePtr ot) +{ + xtWord1 value[8]; + off_t offset; + + XT_SET_DISK_8(value, ot->ot_table->tab_dic.dic_min_auto_inc); + offset = ot->ot_table->tab_table_format_offset + offsetof(XTTableFormatDRec, tf_min_auto_inc_8); + if (!XT_PWRITE_RR_FILE(ot->ot_rec_file, offset, 8, value, &ot->ot_thread->st_statistics.st_rec, ot->ot_thread)) + return FAILED; + if (!XT_FLUSH_RR_FILE(ot->ot_rec_file, &ot->ot_thread->st_statistics.st_rec, ot->ot_thread)) + return FAILED; + return OK; +} + +/* a helper function to remove table from the open tables hash on exception + * used in tab_new_handle() below + */ +static void xt_del_from_db_tables_ht(XTThreadPtr self, XTTableHPtr tab) +{ + XTTableEntryPtr te_ptr; + XTDatabaseHPtr db = tab->tab_db; + xtTableID tab_id = tab->tab_id; + + /* Oops! should use tab->tab_name, instead of tab! */ + xt_ht_del(self, db->db_tables, tab->tab_name); + + /* Remove the reference from the ID list, when a table is + * removed from the table name list: + */ + if ((te_ptr = (XTTableEntryPtr) xt_sl_find(self, db->db_table_by_id, &tab_id))) + te_ptr->te_table = NULL; +} + +/* + * Create a new table handle (i.e. open a table). + * Return NULL if the table is missing, and it is OK for the table + * to be missing. + */ +static int tab_new_handle(XTThreadPtr self, XTTableHPtr *r_tab, XTDatabaseHPtr db, xtTableID tab_id, XTPathStrPtr tab_path, xtBool missing_ok, XTDictionaryPtr dic) +{ + char path[PATH_MAX]; + XTTableHPtr tab; + char file_name[XT_MAX_TABLE_FILE_NAME_SIZE]; + XTOpenFilePtr of_rec, of_ind; + XTTableEntryPtr te_ptr; + size_t tab_format_offset; + size_t tab_head_size; + + enter_(); + + tab = (XTTableHPtr) xt_heap_new(self, sizeof(XTTableHRec), tab_finalize); + pushr_(xt_heap_release, tab); + + tab->tab_name = (XTPathStrPtr) xt_dup_string(self, tab_path->ps_path); + tab->tab_db = db; + tab->tab_id = tab_id; +#ifdef TRACE_TABLE_IDS + PRINTF("%s: allocated TABLE: db=%d tab=%d %s\n", self->t_name, (int) db->db_id, (int) tab->tab_id, xt_last_2_names_of_path(tab->tab_name->ps_path)); +#endif + + if (dic) { + myxt_move_dictionary(&tab->tab_dic, dic); + myxt_setup_dictionary(self, &tab->tab_dic); + } + else { + if (!myxt_load_dictionary(self, &tab->tab_dic, db, tab_path)) { + freer_(); // xt_heap_release(tab) + return_(XT_TAB_NO_DICTIONARY); + } + } + + tab->tab_seq.xt_op_seq_init(self); + xt_spinlock_init_with_autoname(self, &tab->tab_ainc_lock); + xt_init_mutex_with_autoname(self, &tab->tab_rec_flush_lock); + xt_init_mutex_with_autoname(self, &tab->tab_ind_flush_lock); + xt_init_mutex_with_autoname(self, &tab->tab_dic_field_lock); + xt_init_mutex_with_autoname(self, &tab->tab_row_lock); + xt_init_mutex_with_autoname(self, &tab->tab_ind_lock); + xt_init_mutex_with_autoname(self, &tab->tab_rec_lock); + for (u_int i=0; i<XT_ROW_RWLOCKS; i++) + XT_TAB_ROW_INIT_LOCK(self, &tab->tab_row_rwlock[i]); + tab->tab_free_locks = TRUE; + + xt_strcpy(PATH_MAX, path, tab_path->ps_path); + xt_remove_last_name_of_path(path); + tab_get_row_file_name(file_name, xt_last_name_of_path(tab_path->ps_path), tab_id); + xt_strcat(PATH_MAX, path, file_name); + tab->tab_row_file = xt_fs_get_file(self, path); + + xt_remove_last_name_of_path(path); + tab_get_data_file_name(file_name, xt_last_name_of_path(tab_path->ps_path), tab_id); + xt_strcat(PATH_MAX, path, file_name); + tab->tab_rec_file = xt_fs_get_file(self, path); + + xt_remove_last_name_of_path(path); + tab_get_index_file_name(file_name, xt_last_name_of_path(tab_path->ps_path), tab_id); + xt_strcat(PATH_MAX, path, file_name); + tab->tab_ind_file = xt_fs_get_file(self, path); + + of_ind = xt_open_file(self, tab->tab_ind_file->fil_path, XT_FS_MISSING_OK); + if (of_ind) { + pushr_(xt_close_file, of_ind); + tab_load_index_header(self, tab, of_ind, tab_path); + freer_(); // xt_close_file(of_ind) + } + else + tab_load_index_header(self, tab, of_ind, tab_path); + + of_rec = xt_open_file(self, tab->tab_rec_file->fil_path, missing_ok ? XT_FS_MISSING_OK : XT_FS_DEFAULT); + if (!of_rec) { + freer_(); // xt_heap_release(tab) + return_(XT_TAB_NOT_FOUND); + } + pushr_(xt_close_file, of_rec); + tab_load_table_format(self, of_rec, tab_path, &tab_format_offset, &tab_head_size, &tab->tab_dic); + tab->tab_table_format_offset = tab_format_offset; + tab->tab_table_head_size = tab_head_size; + tab->tab_dic.dic_table->dt_table = tab; + tab_load_table_header(self, tab, of_rec); + freer_(); // xt_close_file(of_rec) + + tab->tab_seq.xt_op_seq_set(self, tab->tab_head_op_seq+1); + tab->tab_row_eof_id = tab->tab_head_row_eof_id; + tab->tab_row_free_id = tab->tab_head_row_free_id; + tab->tab_row_fnum = tab->tab_head_row_fnum; + tab->tab_rec_eof_id = tab->tab_head_rec_eof_id; + tab->tab_rec_free_id = tab->tab_head_rec_free_id; + tab->tab_rec_fnum = tab->tab_head_rec_fnum; + + tab->tab_rows.xt_tc_setup(tab, sizeof(XTTabRowHeadDRec), sizeof(XTTabRowRefDRec)); + tab->tab_recs.xt_tc_setup(tab, tab_head_size, tab->tab_dic.dic_rec_size); + + xt_xres_init_tab(self, tab); + + if (!xt_init_row_locks(&tab->tab_locks)) + xt_throw(self); + + xt_heap_set_release_callback(self, tab, tab_onrelease); + + popr_(); // Discard xt_heap_release(tab) + + xt_ht_put(self, db->db_tables, tab); + + /* Add a reference to the ID list, when a table is + * added to the table name list: + */ + if ((te_ptr = (XTTableEntryPtr) xt_sl_find(self, db->db_table_by_id, &tab->tab_id))) + te_ptr->te_table = tab; + + /* Moved from after xt_init_row_locks() above, so that calling + * xt_use_table_no_lock() with no_load == FALSE from attachReferences() + * will work if we have cyclic foreign key references. + */ + if (tab->tab_dic.dic_table) { + pushr_(xt_del_from_db_tables_ht, tab); + tab->tab_dic.dic_table->attachReferences(self, db); + popr_(); + } + + *r_tab = tab; + return_(XT_TAB_OK); +} + + +/* + * Get a reference to a table in the current database. The table reference is valid, + * as long as the thread is using the database!!! + */ +xtPublic XTTableHPtr xt_use_table_no_lock(XTThreadPtr self, XTDatabaseHPtr db, XTPathStrPtr name, xtBool no_load, xtBool missing_ok, XTDictionaryPtr dic, xtBool *opened) +{ + XTTableHPtr tab; + + if (!db) + xt_throw_xterr(XT_CONTEXT, XT_ERR_NO_DATABASE_IN_USE); + + tab = (XTTableHPtr) xt_ht_get(self, db->db_tables, name); + if (!tab && !no_load) { + xtTableID tab_id = 0; + + if (!tab_find_table(self, db, name, &tab_id)) { + if (missing_ok) + return NULL; + xt_throw_taberr(XT_CONTEXT, XT_ERR_TABLE_NOT_FOUND, name); + } + + if (tab_new_handle(self, &tab, db, tab_id, name, FALSE, dic) == XT_TAB_NO_DICTIONARY) + xt_throw_taberr(XT_CONTEXT, XT_ERR_NO_DICTIONARY, name); + + if (opened) + *opened = TRUE; + } + + if (tab) + xt_heap_reference(self, tab); + + return tab; +} + +static void tab_close_table(XTOpenTablePtr ot) +{ + xt_ind_free_reserved(ot); + + if (ot->ot_rec_file) { + XT_CLOSE_RR_FILE_NS(ot->ot_rec_file); + ot->ot_rec_file = NULL; + + } + if (ot->ot_ind_file) { + xt_close_file_ns(ot->ot_ind_file); + ot->ot_ind_file = NULL; + + } + if (ot->ot_row_file) { + XT_CLOSE_RR_FILE_NS(ot->ot_row_file); + ot->ot_row_file = NULL; + + } + if (ot->ot_table) { + xt_heap_release(xt_get_self(), ot->ot_table); + ot->ot_table = NULL; + } + if (ot->ot_ind_rhandle) { + xt_ind_release_handle(ot->ot_ind_rhandle, FALSE, ot->ot_thread); + ot->ot_ind_rhandle = NULL; + } + if (ot->ot_row_rbuffer) { + xt_free_ns(ot->ot_row_rbuffer); + ot->ot_row_rbuf_size = 0; + ot->ot_row_rbuffer = NULL; + } + if (ot->ot_row_wbuffer) { + xt_free_ns(ot->ot_row_wbuffer); + ot->ot_row_wbuf_size = 0; + ot->ot_row_wbuffer = NULL; + } +#ifdef XT_TRACK_RETURNED_ROWS + if (ot->ot_rows_returned) { + xt_free_ns(ot->ot_rows_returned); + ot->ot_rows_returned = NULL; + } + ot->ot_rows_ret_curr = 0; + ot->ot_rows_ret_max = 0; +#endif + xt_free(NULL, ot); +} + +/* + * This function locks a particular table by locking the table directory + * and waiting for all open tables handles to close. + * + * Things are a bit complicated because the sweeper must be turned off before + * the table directory is locked. + */ +static XTOpenTablePoolPtr tab_lock_table(XTThreadPtr self, XTPathStrPtr name, xtBool no_load, xtBool flush_table, xtBool missing_ok, XTTableHPtr *tab) +{ + XTOpenTablePoolPtr table_pool; + XTDatabaseHPtr db = self->st_database; + + enter_(); + /* Lock the table, and close all references: */ + pushsr_(table_pool, xt_db_unlock_table_pool, xt_db_lock_table_pool_by_name(self, db, name, no_load, flush_table, missing_ok, FALSE, tab)); + if (!table_pool) { + freer_(); // xt_db_unlock_table_pool(db) + return_(NULL); + } + +#ifdef XT_STREAMING + /* Tell PBMS to close all open tables of this sort: */ + xt_pbms_close_all_tables(name->ps_path); +#endif + + /* Wait for all open tables to close: */ + xt_db_wait_for_open_tables(self, table_pool); + + popr_(); // Discard xt_db_unlock_table_pool(table_pool) + return_(table_pool); +} + +static void tab_delete_table_files(XTThreadPtr self, XTPathStrPtr tab_name, xtTableID tab_id) +{ + XTFilesOfTableRec ft; + + xt_enum_files_of_tables_init(tab_name, tab_id, &ft); + while (xt_enum_files_of_tables_next(&ft)) { + if (!xt_fs_delete(NULL, ft.ft_file_path)) + xt_log_and_clear_exception(self); + } +} + +xtPublic void xt_create_table(XTThreadPtr self, XTPathStrPtr name, XTDictionaryPtr dic) +{ + char table_name[XT_MAX_TABLE_FILE_NAME_SIZE]; + char path[PATH_MAX]; + XTDatabaseHPtr db = self->st_database; + XTOpenTablePoolPtr table_pool; + XTTableHPtr tab; + XTTableHPtr old_tab = 0; + xtTableID old_tab_id = 0; + xtTableID tab_id = 0; + XTTabRowHeadDRec row_head; + XTTableHeadDRec rec_head; + XTTableFormatDRec table_fmt; + XTIndexFormatDPtr index_fmt; + XTStringBufferRec tab_def = { 0, 0, 0 }; + XTTableEntryRec te_tab; + XTSortedListInfoRec li_undo; + +#ifdef TRACE_CREATE_TABLES + printf("CREATE %s\n", name->ps_path); +#endif + enter_(); + if (strlen(xt_last_name_of_path(name->ps_path)) > XT_TABLE_NAME_SIZE-1) + xt_throw_taberr(XT_CONTEXT, XT_ERR_NAME_TOO_LONG, name); + if (!db) + xt_throw_xterr(XT_CONTEXT, XT_ERR_NO_DATABASE_IN_USE); + + /* Lock to prevent table list change during creation. */ + table_pool = tab_lock_table(self, name, FALSE, TRUE, TRUE, &old_tab); + pushr_(xt_db_unlock_table_pool, table_pool); + xt_ht_lock(self, db->db_tables); + pushr_(xt_ht_unlock, db->db_tables); + + /* This must be done before we remove the old table + * from the directory, or we will not be able + * to find the table, which could is require + * for TRUNCATE! + */ + if (xt_sl_get_size(db->db_table_by_id) >= XT_MAX_TABLES) + xt_throw_ulxterr(XT_CONTEXT, XT_ERR_TOO_MANY_TABLES, (u_long) XT_MAX_TABLES); + + tab_id = db->db_curr_tab_id + 1; + + if (old_tab) { + old_tab_id = old_tab->tab_id; + xt_dl_delete_ext_data(self, old_tab, FALSE, TRUE); + xt_heap_release(self, old_tab); + + /* For the Windows version this must be done before we + * start to delete the underlying files! + */ + tab_close_mapped_files(self, old_tab); + + tab_delete_table_files(self, name, old_tab_id); + + /* Remove the PBMS table: */ + ASSERT(xt_get_self() == self); +#ifdef XT_STREAMING + xt_pbms_drop_table(name->ps_path); +#endif + + /* Remove the table from the directory. It will get a new + * ID so the handle in the directory will no longer be valid. + */ + xt_ht_del(self, db->db_tables, name); + } + + /* Add the table to the directory, well remove on error! */ + li_undo.li_sl = db->db_table_by_id; + li_undo.li_key = &tab_id; + te_tab.te_tab_id = tab_id; + te_tab.te_tab_name = xt_dup_string(self, xt_last_name_of_path(name->ps_path)); + te_tab.te_tab_path = tab_get_table_path(self, db, name, TRUE); + te_tab.te_table = NULL; + xt_sl_insert(self, db->db_table_by_id, &tab_id, &te_tab); + pushr_(xt_sl_delete_from_info, &li_undo); + + *path = 0; + try_(a) { + XTOpenFilePtr of_row, of_rec, of_ind; + off_t eof; + size_t def_len = 0; + + tab = (XTTableHPtr) xt_heap_new(self, sizeof(XTTableHRec), tab_finalize); + pushr_(xt_heap_release, tab); + + /* The length of the foreign key definition: */ + if (dic->dic_table) { + dic->dic_table->loadString(self, &tab_def); + def_len = tab_def.sb_len + 1; + } + + tab->tab_head_op_seq = 0; +#ifdef DEBUG + //tab->tab_head_op_seq = 0xFFFFFFFF - 12; +#endif + + /* ------- ROW FILE: */ + xt_strcpy(PATH_MAX, path, name->ps_path); + xt_remove_last_name_of_path(path); + tab_get_row_file_name(table_name, xt_last_name_of_path(name->ps_path), tab_id); + xt_strcat(PATH_MAX, path, table_name); + + of_row = xt_open_file(self, path, XT_FS_CREATE | XT_FS_EXCLUSIVE); + pushr_(xt_close_file, of_row); + XT_SET_DISK_4(row_head.rh_magic_4, XT_TAB_ROW_MAGIC); + if (!xt_pwrite_file(of_row, 0, sizeof(row_head), &row_head, &self->st_statistics.st_rec, self)) + xt_throw(self); + freer_(); // xt_close_file(of_row) + + (void) ASSERT(sizeof(XTTabRowHeadDRec) == sizeof(XTTabRowRefDRec)); + (void) ASSERT(sizeof(XTTabRowRefDRec) == 1 << XT_TAB_ROW_SHIFTS); + + tab->tab_row_eof_id = 1; + tab->tab_row_free_id = 0; + tab->tab_row_fnum = 0; + + tab->tab_head_row_eof_id = 1; + tab->tab_head_row_free_id = 0; + tab->tab_head_row_fnum = 0; + + /* ------------ DATA FILE: */ + xt_remove_last_name_of_path(path); + tab_get_data_file_name(table_name, xt_last_name_of_path(name->ps_path), tab_id); + xt_strcat(PATH_MAX, path, table_name); + of_rec = xt_open_file(self, path, XT_FS_CREATE | XT_FS_EXCLUSIVE); + pushr_(xt_close_file, of_rec); + + /* Calculate the offset of the first record in the data handle file. */ + eof = sizeof(XTTableHeadDRec) + offsetof(XTTableFormatDRec, tf_definition) + def_len + XT_FORMAT_DEF_SPACE; + eof = (eof + 1024 - 1) / 1024 * 1024; // Round to a value divisible by 1024 + + tab->tab_table_format_offset = sizeof(XTTableHeadDRec); + tab->tab_table_head_size = (size_t) eof; + + tab->tab_rec_eof_id = 1; // This is the first record ID! + tab->tab_rec_free_id = 0; + tab->tab_rec_fnum = 0; + + tab->tab_head_rec_eof_id = 1; // The first record ID + tab->tab_head_rec_free_id = 0; + tab->tab_head_rec_fnum = 0; + + tab->tab_dic.dic_rec_size = dic->dic_rec_size; + tab->tab_dic.dic_rec_fixed = dic->dic_rec_fixed; + tab->tab_dic.dic_tab_flags = dic->dic_tab_flags; + tab->tab_dic.dic_min_auto_inc = dic->dic_min_auto_inc; + tab->tab_dic.dic_def_ave_row_size = dic->dic_def_ave_row_size; + + XT_SET_DISK_4(rec_head.th_head_size_4, sizeof(XTTableHeadDRec)); + XT_SET_DISK_4(rec_head.th_op_seq_4, tab->tab_head_op_seq); + XT_SET_DISK_6(rec_head.th_row_free_6, tab->tab_head_row_free_id); + XT_SET_DISK_6(rec_head.th_row_eof_6, tab->tab_head_row_eof_id); + XT_SET_DISK_6(rec_head.th_row_fnum_6, tab->tab_head_row_fnum); + XT_SET_DISK_6(rec_head.th_rec_free_6, tab->tab_head_rec_free_id); + XT_SET_DISK_6(rec_head.th_rec_eof_6, tab->tab_head_rec_eof_id); + XT_SET_DISK_6(rec_head.th_rec_fnum_6, tab->tab_head_rec_fnum); + + if (!xt_pwrite_file(of_rec, 0, sizeof(XTTableHeadDRec), &rec_head, &self->st_statistics.st_rec, self)) + xt_throw(self); + + /* Store the table format: */ + memset(&table_fmt, 0, offsetof(XTTableFormatDRec, tf_definition)); + XT_SET_DISK_4(table_fmt.tf_format_size_4, offsetof(XTTableFormatDRec, tf_definition) + def_len); + XT_SET_DISK_4(table_fmt.tf_tab_head_size_4, eof); + XT_SET_DISK_2(table_fmt.tf_tab_version_2, XT_TAB_CURRENT_VERSION); + XT_SET_DISK_4(table_fmt.tf_rec_size_4, tab->tab_dic.dic_rec_size); + XT_SET_DISK_1(table_fmt.tf_rec_fixed_1, tab->tab_dic.dic_rec_fixed); + XT_SET_DISK_2(table_fmt.tf_tab_flags_2, tab->tab_dic.dic_tab_flags); + XT_SET_DISK_8(table_fmt.tf_min_auto_inc_8, tab->tab_dic.dic_min_auto_inc); + + if (!xt_pwrite_file(of_rec, sizeof(XTTableHeadDRec), offsetof(XTTableFormatDRec, tf_definition), &table_fmt, &self->st_statistics.st_rec, self)) + xt_throw(self); + if (def_len) { + if (!xt_pwrite_file(of_rec, sizeof(XTTableHeadDRec) + offsetof(XTTableFormatDRec, tf_definition), def_len, tab_def.sb_cstring, &self->st_statistics.st_rec, self)) + xt_throw(self); + } + + freer_(); // xt_close_file(of_rec) + + /* ----------- INDEX FILE: */ + xt_remove_last_name_of_path(path); + tab_get_index_file_name(table_name, xt_last_name_of_path(name->ps_path), tab_id); + xt_strcat(PATH_MAX, path, table_name); + of_ind = xt_open_file(self, path, XT_FS_CREATE | XT_FS_EXCLUSIVE); + pushr_(xt_close_file, of_ind); + + /* This is the size of the index header: */ + tab->tab_index_format_offset = offsetof(XTIndexHeadDRec, tp_data) + dic->dic_key_count * XT_NODE_REF_SIZE; + if (!(tab->tab_index_head = (XTIndexHeadDPtr) xt_calloc_ns(XT_INDEX_HEAD_SIZE))) + xt_throw(self); + + XT_NODE_ID(tab->tab_ind_eof) = 1; + XT_NODE_ID(tab->tab_ind_free) = 0; + + XT_SET_DISK_4(tab->tab_index_head->tp_header_size_4, XT_INDEX_HEAD_SIZE); + XT_SET_DISK_4(tab->tab_index_head->tp_format_offset_4, tab->tab_index_format_offset); + XT_SET_DISK_6(tab->tab_index_head->tp_ind_eof_6, XT_NODE_ID(tab->tab_ind_eof)); + XT_SET_DISK_6(tab->tab_index_head->tp_ind_free_6, XT_NODE_ID(tab->tab_ind_free)); + + /* Store the index format: */ + index_fmt = (XTIndexFormatDPtr) (((xtWord1 *) tab->tab_index_head) + tab->tab_index_format_offset); + XT_SET_DISK_4(index_fmt->if_format_size_4, sizeof(XTIndexFormatDRec)); + XT_SET_DISK_2(index_fmt->if_tab_version_2, XT_TAB_CURRENT_VERSION); + XT_SET_DISK_2(index_fmt->if_ind_version_2, XT_IND_CURRENT_VERSION); + XT_SET_DISK_1(index_fmt->if_node_ref_size_1, XT_NODE_REF_SIZE); + XT_SET_DISK_1(index_fmt->if_rec_ref_size_1, XT_RECORD_REF_SIZE); + XT_SET_DISK_4(index_fmt->if_page_size_4, XT_INDEX_PAGE_SIZE); + + /* Save the header: */ + if (!xt_pwrite_file(of_ind, 0, XT_INDEX_HEAD_SIZE, tab->tab_index_head, &self->st_statistics.st_ind, self)) + xt_throw(self); + + freer_(); // xt_close_file(of_ind) + + /* ------------ */ + /* Log the new table ID! */ + db->db_curr_tab_id = tab_id; + if (!xt_xn_log_tab_id(self, tab_id)) { + db->db_curr_tab_id = tab_id - 1; + xt_throw(self); + } + + freer_(); // xt_heap_release(tab) + + /* {LOAD-FOR-FKS} + * 2008-12-10: Note, there is another problem, example: + * set storage_engine = pbxt; + * + * CREATE TABLE t1 (s1 INT PRIMARY KEY, s2 INT); + * CREATE TABLE t2 (s1 INT PRIMARY KEY, FOREIGN KEY (s1) REFERENCES t1 (s1) ON UPDATE CASCADE); + * CREATE TABLE t3 (s1 INT PRIMARY KEY, FOREIGN KEY (s1) REFERENCES t2 (s1) ON UPDATE CASCADE); + * + * DROP TABLE IF EXISTS t2,t1; + * CREATE TABLE t1 (s1 ENUM('a','b') PRIMARY KEY); + * CREATE TABLE t2 (s1 ENUM('A','B'), FOREIGN KEY (s1) REFERENCES t1 (s1)); + * + * DROP TABLE IF EXISTS t2,t1; + * + * In the example above. The second create t2 does not fail, although t3 references it, + * and the data types do not match. + * + * The main problem is that this error comes on DROP TABLE IF EXISTS t2! Which prevents + * the table from being dropped - not good. + * + * So my idea here is to open the table, and if it fails, then the create table fails + * as well. + */ + if (!old_tab_id) { + tab = xt_use_table_no_lock(self, db, name, FALSE, FALSE, NULL, NULL); + xt_heap_release(self, tab); + } + } + catch_(a) { + /* Creation failed, delete the table files: */ + if (*path) + tab_delete_table_files(self, name, tab_id); + xt_sb_set_size(self, &tab_def, 0); + throw_(); + } + cont_(a); + + xt_sb_set_size(self, &tab_def, 0); + + if (old_tab_id) { + try_(b) { + XTTableEntryPtr te_ptr; + + if ((te_ptr = (XTTableEntryPtr) xt_sl_find(self, db->db_table_by_id, &old_tab_id))) { + tab_remove_table_path(self, db, te_ptr->te_tab_path); + xt_sl_delete(self, db->db_table_by_id, &old_tab_id); + } + + /* Same purpose as above {LOAD-FOR-FKS} (although this should work, + * beacuse this is a TRUNCATE TABLE. + */ + tab = xt_use_table_no_lock(self, db, name, FALSE, FALSE, NULL, NULL); + xt_heap_release(self, tab); + } + catch_(b) { + /* Log this error, but do not return it, because + * it just involves the cleanup of the old table, + * the new table has been successfully created. + */ + xt_log_and_clear_exception(self); + } + cont_(b); + } + + popr_(); // Discard xt_sl_delete_from_info(&li_undo) + + freer_(); // xt_ht_unlock(db->db_tables) + freer_(); // xt_db_unlock_table_pool(table_pool) + + /* I open the table here, because I cannot rely on MySQL to do + * it after a create. This is normally OK, but with foreign keys + * tables can be referenced and then they are not opened + * before use. In this example, the INSERT opens t2, but t1 is + * not opened of the create. As a result the foreign key + * reference is not resolved. + * + * drop table t1, t2; + * CREATE TABLE t1 + * ( + * id INT PRIMARY KEY + * ) ENGINE=pbxt; + * + * CREATE TABLE t2 + * ( + * v INT, + * CONSTRAINT c1 FOREIGN KEY (v) REFERENCES t1(id) + * ) ENGINE=pbxt; + * + * --error 1452 + * INSERT INTO t2 VALUES(2); + */ + /* this code is not needed anymore as we open tables referred by FKs as necessary during checks + xt_ht_lock(self, db->db_tables); + pushr_(xt_ht_unlock, db->db_tables); + tab = xt_use_table_no_lock(self, db, name, FALSE, FALSE, NULL, NULL); + freer_(); // xt_ht_unlock(db->db_tables) + xt_heap_release(self, tab); + * CHANGED see {LOAD-FOR-FKS} above. + */ + + exit_(); +} + +xtPublic void xt_drop_table(XTThreadPtr self, XTPathStrPtr tab_name) +{ + XTDatabaseHPtr db = self->st_database; + XTOpenTablePoolPtr table_pool; + XTTableHPtr tab; + xtTableID tab_id = 0; + xtBool can_drop = TRUE; + + enter_(); + +#ifdef TRACE_CREATE_TABLES + printf("DROP %s\n", tab_name->ps_path); +#endif + + table_pool = tab_lock_table(self, tab_name, FALSE, TRUE, TRUE, &tab); + pushr_(xt_db_unlock_table_pool, table_pool); + xt_ht_lock(self, db->db_tables); + pushr_(xt_ht_unlock, db->db_tables); + + if (table_pool) { + tab_id = tab->tab_id; /* tab is not null if returned table_pool is not null */ + /* check if other tables refer this */ + if (!self->st_ignore_fkeys) + can_drop = tab->tab_dic.dic_table->checkCanDrop(); + } + + if (can_drop) { + if (tab_id) { + XTTableEntryPtr te_ptr; + + xt_dl_delete_ext_data(self, tab, FALSE, TRUE); + xt_heap_release(self, tab); + + /* For the Windows version this must be done before we + * start to delete the underlying files! + */ + tab_close_mapped_files(self, tab); + + tab_delete_table_files(self, tab_name, tab_id); + + ASSERT(xt_get_self() == self); +#ifdef XT_STREAMING + xt_pbms_drop_table(tab_name->ps_path); +#endif + if ((te_ptr = (XTTableEntryPtr) xt_sl_find(self, db->db_table_by_id, &tab_id))) { + tab_remove_table_path(self, db, te_ptr->te_tab_path); + xt_sl_delete(self, db->db_table_by_id, &tab_id); + } + } + + xt_ht_del(self, db->db_tables, tab_name); + } + else { /* cannot drop table because of FK dependencies */ + xt_throw_xterr(XT_CONTEXT, XT_ERR_ROW_IS_REFERENCED); + } + + freer_(); // xt_ht_unlock(db->db_tables) + freer_(); // xt_db_unlock_table_pool(table_pool) + exit_(); +} + +xtPublic void xt_check_table(XTThreadPtr self, XTOpenTablePtr ot) +{ + XTTableHPtr tab = ot->ot_table; + xtRecordID prec_id; + XTTabRecExtDPtr rec_buf = (XTTabRecExtDPtr) ot->ot_row_rbuffer; + XTactExtRecEntryDRec ext_rec; + size_t log_size; + xtLogID log_id; + xtLogOffset log_offset; + xtRecordID rec_id; + xtRecordID prev_rec_id; + xtXactID xn_id; + xtRowID row_id; + u_llong free_rec_count = 0, free_count2 = 0; + u_llong delete_rec_count = 0; + u_llong alloc_rec_count = 0; + u_llong alloc_rec_bytes = 0; + size_t rec_size; + size_t row_size; + +#if defined(DUMP_CHECK_TABLE) || defined(CHECK_TABLE_STATS) + printf("\nCHECK TABLE: %s\n", tab->tab_name->ps_path); +#endif + + xt_lock_mutex(self, &tab->tab_db->db_co_ext_lock); + pushr_(xt_unlock_mutex, &tab->tab_db->db_co_ext_lock); + + xt_lock_mutex(self, &tab->tab_rec_lock); + pushr_(xt_unlock_mutex, &tab->tab_rec_lock); + +#ifdef CHECK_TABLE_STATS + printf("Record buffer size = %lu\n", (u_long) tab->tab_dic.dic_buf_size); + printf("Handle data record size = %lu\n", (u_long) tab->tab_dic.dic_rec_size); + printf("Min/max header size = %d/%d\n", (int) offsetof(XTTabRecFix, rf_data), tab->tab_dic.dic_rec_fixed ? (int) offsetof(XTTabRecFix, rf_data) : (int) offsetof(XTTabRecExtDRec, re_data)); + if (tab->tab_dic.dic_def_ave_row_size) + printf("Maximum fixed size = %lu\n", (u_long) XT_TAB_MAX_FIX_REC_LENGTH_SPEC); + else + printf("Maximum fixed size = %lu\n", (u_long) XT_TAB_MAX_FIX_REC_LENGTH); + printf("Minimum variable size = %lu\n", (u_long) XT_TAB_MIN_VAR_REC_LENGTH); + + printf("Min/avg/max record size = %llu/%llu/%llu\n", (u_llong) tab->tab_dic.dic_min_row_size, (u_llong) tab->tab_dic.dic_ave_row_size, (u_llong) tab->tab_dic.dic_max_row_size); + if (tab->tab_dic.dic_def_ave_row_size) + printf("Avg row len set for tab = %lu\n", (u_long) tab->tab_dic.dic_def_ave_row_size); + else + printf("Avg row len set for tab = not specified\n"); + printf("Rows fixed length = %s\n", tab->tab_dic.dic_rec_fixed ? "YES" : "NO"); + if (tab->tab_dic.dic_tab_flags & XT_TAB_FLAGS_TEMP_TAB) + printf("Table type = TEMP\n"); + printf("Minimum auto-increment = %llu\n", (u_llong) tab->tab_dic.dic_min_auto_inc); + printf("Number of columns = %lu\n", (u_long) tab->tab_dic.dic_no_of_cols); + printf("Number of fixed columns = %lu\n", (u_long) tab->tab_dic.dic_fix_col_count); + printf("Columns req. for index = %lu\n", (u_long) tab->tab_dic.dic_ind_cols_req); + if (tab->tab_dic.dic_ind_rec_len) + printf("Rec len req. for index = %llu\n", (u_llong) tab->tab_dic.dic_ind_rec_len); + printf("Columns req. for blobs = %lu\n", (u_long) tab->tab_dic.dic_blob_cols_req); + printf("Number of blob columns = %lu\n", (u_long) tab->tab_dic.dic_blob_count); + printf("Number of indices = %lu\n", (u_long) tab->tab_dic.dic_key_count); +#endif + +#ifdef DUMP_CHECK_TABLE + printf("Records:-\n"); + printf("Free list: %llu (%llu)\n", (u_llong) tab->tab_rec_free_id, (u_llong) tab->tab_rec_fnum); + printf("EOF: %llu\n", (u_llong) tab->tab_rec_eof_id); +#endif + + rec_size = XT_REC_EXT_HEADER_SIZE; + if (rec_size > tab->tab_recs.tci_rec_size) + rec_size = tab->tab_recs.tci_rec_size; + rec_id = 1; + while (rec_id < tab->tab_rec_eof_id) { + if (!xt_tab_get_rec_data(ot, rec_id, tab->tab_dic.dic_rec_size, ot->ot_row_rbuffer)) + xt_throw(self); + +#ifdef DUMP_CHECK_TABLE + printf("%-4llu ", (u_llong) rec_id); +#endif + switch (rec_buf->tr_rec_type_1 & XT_TAB_STATUS_MASK) { + case XT_TAB_STATUS_FREED: +#ifdef DUMP_CHECK_TABLE + printf("======== "); +#endif + free_rec_count++; + break; + case XT_TAB_STATUS_DELETE: +#ifdef DUMP_CHECK_TABLE + printf("delete "); +#endif + delete_rec_count++; + break; + case XT_TAB_STATUS_FIXED: +#ifdef DUMP_CHECK_TABLE + printf("record-F "); +#endif + alloc_rec_count++; + row_size = myxt_store_row_length(ot, (char *) ot->ot_row_rbuffer + XT_REC_FIX_HEADER_SIZE); + alloc_rec_bytes += row_size; + break; + case XT_TAB_STATUS_VARIABLE: +#ifdef DUMP_CHECK_TABLE + printf("record-V "); +#endif + alloc_rec_count++; + row_size = myxt_load_row_length(ot, tab->tab_dic.dic_rec_size, ot->ot_row_rbuffer + XT_REC_FIX_HEADER_SIZE, NULL); + alloc_rec_bytes += row_size; + break; + case XT_TAB_STATUS_EXT_DLOG: +#ifdef DUMP_CHECK_TABLE + printf("record-X "); +#endif + alloc_rec_count++; + row_size = XT_GET_DISK_4(rec_buf->re_log_dat_siz_4) + ot->ot_rec_size - XT_REC_EXT_HEADER_SIZE; + alloc_rec_bytes += row_size; + break; + } +#ifdef DUMP_CHECK_TABLE + if (rec_buf->tr_rec_type_1 & XT_TAB_STATUS_CLEANED_BIT) + printf("C"); + else + printf(" "); +#endif + prev_rec_id = XT_GET_DISK_4(rec_buf->tr_prev_rec_id_4); + xn_id = XT_GET_DISK_4(rec_buf->tr_xact_id_4); + row_id = XT_GET_DISK_4(rec_buf->tr_row_id_4); + switch (rec_buf->tr_rec_type_1 & XT_TAB_STATUS_MASK) { + case XT_TAB_STATUS_FREED: +#ifdef DUMP_CHECK_TABLE + printf(" prev=%-3llu (xact=%-3llu row=%lu)\n", (u_llong) prev_rec_id, (u_llong) xn_id, (u_long) row_id); +#endif + break; + case XT_TAB_STATUS_EXT_DLOG: +#ifdef DUMP_CHECK_TABLE + printf(" prev=%-3llu xact=%-3llu row=%lu Xlog=%lu Xoff=%llu Xsiz=%lu\n", (u_llong) prev_rec_id, (u_llong) xn_id, (u_long) row_id, (u_long) XT_GET_DISK_2(rec_buf->re_log_id_2), (u_llong) XT_GET_DISK_6(rec_buf->re_log_offs_6), (u_long) XT_GET_DISK_4(rec_buf->re_log_dat_siz_4)); +#endif + + log_size = XT_GET_DISK_4(rec_buf->re_log_dat_siz_4); + XT_GET_LOG_REF(log_id, log_offset, rec_buf); + if (!self->st_dlog_buf.dlb_read_log(log_id, log_offset, offsetof(XTactExtRecEntryDRec, er_data), (xtWord1 *) &ext_rec, self)) + xt_log_and_clear_exception(self); + else { + size_t log_size2; + xtTableID curr_tab_id; + xtRecordID curr_rec_id; + + log_size2 = XT_GET_DISK_4(ext_rec.er_data_size_4); + curr_tab_id = XT_GET_DISK_4(ext_rec.er_tab_id_4); + curr_rec_id = XT_GET_DISK_4(ext_rec.er_rec_id_4); + if (log_size2 != log_size || curr_tab_id != tab->tab_id || curr_rec_id != rec_id) { + xt_logf(XT_INFO, "Table %s: record %llu, extended record %lu:%llu not valid\n", tab->tab_name, (u_llong) rec_id, (u_long) log_id, (u_llong) log_offset); + } + } + break; + default: +#ifdef DUMP_CHECK_TABLE + printf(" prev=%-3llu xact=%-3llu row=%lu\n", (u_llong) prev_rec_id, (u_llong) xn_id, (u_long) row_id); +#endif + break; + } + rec_id++; + } + +#ifdef CHECK_TABLE_STATS + printf("Fixed length rec. len. = %llu\n", (u_llong) tab->tab_dic.dic_rec_size - XT_REC_FIX_HEADER_SIZE); + if (alloc_rec_count) + printf("Average comp. rec. len. = %llu\n", (u_llong) ((double) alloc_rec_bytes / (double) alloc_rec_count + (double) 0.5)); + printf("Free record count = %llu\n", (u_llong) free_rec_count); + printf("Deleted record count = %llu\n", (u_llong) delete_rec_count); + printf("Allocated record count = %llu\n", (u_llong) alloc_rec_count); +#endif + if (tab->tab_rec_fnum != free_rec_count) + xt_logf(XT_INFO, "Table %s: incorrect number of free blocks, %llu, should be: %llu\n", tab->tab_name, (u_llong) free_rec_count, (u_llong) tab->tab_rec_fnum); + + /* Checking the free list: */ + prec_id = 0; + rec_id = tab->tab_rec_free_id; + while (rec_id) { + if (rec_id >= tab->tab_rec_eof_id) { + xt_logf(XT_INFO, "Table %s: invalid reference on free list: %llu, ", tab->tab_name, (u_llong) rec_id); + if (prec_id) + xt_logf(XT_INFO, "reference by: %llu\n", (u_llong) prec_id); + else + xt_logf(XT_INFO, "reference by list head pointer\n"); + break; + } + if (!xt_tab_get_rec_data(ot, rec_id, XT_REC_FIX_HEADER_SIZE, (xtWord1 *) rec_buf)) { + xt_log_and_clear_exception(self); + break; + } + if ((rec_buf->tr_rec_type_1 & XT_TAB_STATUS_MASK) != XT_TAB_STATUS_FREED) + xt_logf(XT_INFO, "Table %s: record, %llu, on free list is not free\n", tab->tab_name, (u_llong) rec_id); + free_count2++; + prec_id = rec_id; + rec_id = XT_GET_DISK_4(rec_buf->tr_prev_rec_id_4); + } + if (free_count2 < free_rec_count) + xt_logf(XT_INFO, "Table %s: not all free blocks (%llu) on free list: %llu\n", tab->tab_name, (u_llong) free_rec_count, (u_llong) free_count2); + + freer_(); // xt_unlock_mutex_ns(&tab->tab_rec_lock); + + xtRefID ref_id; + + xt_lock_mutex(self, &tab->tab_row_lock); + pushr_(xt_unlock_mutex, &tab->tab_row_lock); + +#ifdef DUMP_CHECK_TABLE + printf("Rows:-\n"); + printf("Free list: %llu (%llu)\n", (u_llong) tab->tab_row_free_id, (u_llong) tab->tab_row_fnum); + printf("EOF: %llu\n", (u_llong) tab->tab_row_eof_id); +#endif + + rec_id = 1; + while (rec_id < tab->tab_row_eof_id) { + if (!tab->tab_rows.xt_tc_read_4(ot->ot_row_file, rec_id, &ref_id, self)) + xt_throw(self); +#ifdef DUMP_CHECK_TABLE + printf("%-3llu ", (u_llong) rec_id); +#endif +#ifdef DUMP_CHECK_TABLE + if (ref_id == 0) + printf("====== 0\n"); + else + printf("in use %llu\n", (u_llong) ref_id); +#endif + rec_id++; + } + + freer_(); // xt_unlock_mutex(&tab->tab_row_lock); + +#ifdef CHECK_INDEX_ON_CHECK_TABLE + xt_check_indices(ot); +#endif + freer_(); // xt_unlock_mutex(&tab->tab_db->db_co_ext_lock); +} + +xtPublic void xt_rename_table(XTThreadPtr self, XTPathStrPtr old_name, XTPathStrPtr new_name) +{ + XTDatabaseHPtr db = self->st_database; + XTOpenTablePoolPtr table_pool; + XTTableHPtr tab; + char table_name[XT_MAX_TABLE_FILE_NAME_SIZE]; + char *postfix; + XTFilesOfTableRec ft; + XTDictionaryRec dic; + xtTableID tab_id; + XTTableEntryPtr te_ptr; + char *te_new_name; + XTTablePathPtr te_new_path; + XTTablePathPtr te_old_path; + char to_path[PATH_MAX]; + + memset(&dic, 0, sizeof(dic)); + +#ifdef TRACE_CREATE_TABLES + printf("RENAME %s --> %s\n", old_name->ps_path, new_name->ps_path); +#endif + if (strlen(xt_last_name_of_path(new_name->ps_path)) > XT_TABLE_NAME_SIZE-1) + xt_throw_taberr(XT_CONTEXT, XT_ERR_NAME_TOO_LONG, new_name); + + /* MySQL renames the table while it is in use. Here is + * the sequence: + * + * OPEN tab1 + * CREATE tmp_tab + * OPEN tmp_tab + * COPY tab1 -> tmp_tab + * CLOSE tmp_tab + * RENAME tab1 -> tmp2_tab + * RENAME tmp_tab -> tab1 + * CLOSE tab1 (tmp2_tab) + * DELETE tmp2_tab + * OPEN tab1 + * + * Since the table is open when it is renamed, I cannot + * get exclusive use of the table for this operation. + * + * So instead we just make sure that the sweeper is not + * using the table. + */ + table_pool = tab_lock_table(self, old_name, FALSE, TRUE, FALSE, &tab); + pushr_(xt_db_unlock_table_pool, table_pool); + xt_ht_lock(self, db->db_tables); + pushr_(xt_ht_unlock, db->db_tables); + tab_id = tab->tab_id; + myxt_move_dictionary(&dic, &tab->tab_dic); + pushr_(myxt_free_dictionary, &dic); + + /* Unmap the memory mapped table files: + * For windows this must be done before we + * can rename the files. + */ + tab_close_mapped_files(self, tab); + + xt_heap_release(self, tab); + + /* Create the new name and path: */ + te_new_name = xt_dup_string(self, xt_last_name_of_path(new_name->ps_path)); + pushr_(xt_free, te_new_name); + te_new_path = tab_get_table_path(self, db, new_name, FALSE); + pushr_(tab_free_table_path, te_new_path); + + te_ptr = (XTTableEntryPtr) xt_sl_find(self, db->db_table_by_id, &tab_id); + + /* Remove the table from the Database directory: */ + xt_ht_del(self, db->db_tables, old_name); + + xt_enum_files_of_tables_init(old_name, tab_id, &ft); + while (xt_enum_files_of_tables_next(&ft)) { + postfix = xt_tab_file_to_name(XT_MAX_TABLE_FILE_NAME_SIZE, table_name, ft.ft_file_path); + + xt_strcpy(PATH_MAX, to_path, new_name->ps_path); + xt_strcat(PATH_MAX, to_path, postfix); + + if (!xt_fs_rename(NULL, ft.ft_file_path, to_path)) + xt_log_and_clear_exception(self); + } + + /* Switch the table name and path: */ + xt_free(self, te_ptr->te_tab_name); + te_ptr->te_tab_name = te_new_name; + te_old_path = te_ptr->te_tab_path; + te_ptr->te_tab_path = te_new_path; + tab_remove_table_path(self, db, te_old_path); + + popr_(); // Discard tab_free_table_path(te_new_path); + popr_(); // Discard xt_free(te_new_name); + + tab = xt_use_table_no_lock(self, db, new_name, FALSE, FALSE, &dic, NULL); + xt_heap_release(self, tab); + + freer_(); // myxt_free_dictionary(&dic) + freer_(); // xt_ht_unlock(db->db_tables) + freer_(); // xt_db_unlock_table_pool(table_pool) +} + +xtPublic XTTableHPtr xt_use_table(XTThreadPtr self, XTPathStrPtr name, xtBool no_load, xtBool missing_ok, xtBool *opened) +{ + XTTableHPtr tab; + XTDatabaseHPtr db = self->st_database; + + xt_ht_lock(self, db->db_tables); + pushr_(xt_ht_unlock, db->db_tables); + tab = xt_use_table_no_lock(self, db, name, no_load, missing_ok, NULL, opened); + freer_(); + return tab; +} + +xtPublic void xt_sync_flush_table(XTThreadPtr self, XTOpenTablePtr ot) +{ + XTTableHPtr tab = ot->ot_table; + XTDatabaseHPtr db = tab->tab_db; + + /* Wakeup the sweeper: + * We want the sweeper to check if there is anything to do, + * so we must wake it up. + * Once it has done all it can, it will go back to sleep. + * This should be good enough. + * + * NOTE: I all cases, we do not wait if the sweeper is in + * error state. + */ + if (db->db_sw_idle) { + u_int check_count = db->db_sw_check_count; + + for (;;) { + xt_wakeup_sweeper(db); + if (!db->db_sw_thread || db->db_sw_idle != XT_THREAD_IDLE || check_count != db->db_sw_check_count) + break; + xt_sleep_milli_second(10); + } + } + + /* Wait for the sweeper to become idle: */ + xt_lock_mutex(self, &db->db_sw_lock); + pushr_(xt_unlock_mutex, &db->db_sw_lock); + while (db->db_sw_thread && !db->db_sw_idle) { + xt_timed_wait_cond(self, &db->db_sw_cond, &db->db_sw_lock, 10); + } + freer_(); // xt_unlock_mutex(&db->db_sw_lock) + + /* Wait for the writer to write out all operations on the table: + * We also do not wait for the writer if it is in + * error state. + */ + while (db->db_wr_thread && + db->db_wr_idle != XT_THREAD_INERR && + XTTableSeq::xt_op_is_before(tab->tab_head_op_seq+1, tab->tab_seq.ts_next_seq)) { + /* Flush the log, in case this is holding up the + * writer! + */ + if (!db->db_xlog.xlog_flush(self)) + xt_throw(self); + + xt_lock_mutex(self, &db->db_wr_lock); + pushr_(xt_unlock_mutex, &db->db_wr_lock); + db->db_wr_thread_waiting++; + /* + * Wake the writer if it is sleeping. In order to + * flush a table we must wait for the writer to complete + * committing all the changes in the table to the database. + */ + if (db->db_wr_idle) { + if (!xt_broadcast_cond_ns(&db->db_wr_cond)) + xt_log_and_clear_exception_ns(); + } + + freer_(); // xt_unlock_mutex(&db->db_wr_lock) + xt_sleep_milli_second(10); + + xt_lock_mutex(self, &db->db_wr_lock); + pushr_(xt_unlock_mutex, &db->db_wr_lock); + db->db_wr_thread_waiting--; + freer_(); // xt_unlock_mutex(&db->db_wr_lock) + } + + xt_flush_table(self, ot); +} + +xtPublic xtBool xt_flush_record_row(XTOpenTablePtr ot, off_t *bytes_flushed, xtBool have_table_lock) +{ + XTTableHeadDRec rec_head; + XTTableHPtr tab = ot->ot_table; + off_t to_flush; + XTCheckPointTablePtr cp_tab; + XTCheckPointStatePtr cp = NULL; + + if (!xt_begin_checkpoint(tab->tab_db, have_table_lock, ot->ot_thread)) + return FAILED; + + xt_lock_mutex_ns(&tab->tab_rec_flush_lock); + + ASSERT_NS(ot->ot_thread == xt_get_self()); + /* Make sure that the table recovery point, in + * particular the operation ID is recorded + * before all other flush activity! + * + * This is because only operations after the + * recovery point in the header are applied + * to the table on recovery. + * + * So the operation ID is recorded before the + * flush activity, and written after all is done. + */ + xt_tab_store_header(ot, &rec_head); + +#ifdef TRACE_FLUSH + printf("FLUSH rec/row %d %s\n", (int) tab->tab_bytes_to_flush, tab->tab_name->ps_path); + fflush(stdout); +#endif + /* Write the table header: */ + if (tab->tab_flush_pending) { + tab->tab_flush_pending = FALSE; + // Want to see how much was to be flushed in the debugger: + to_flush = tab->tab_bytes_to_flush; + tab->tab_bytes_to_flush = 0; + if (bytes_flushed) + *bytes_flushed += to_flush; + /* Flush the table data: */ + if (!(tab->tab_dic.dic_tab_flags & XT_TAB_FLAGS_TEMP_TAB)) { + if (!XT_FLUSH_RR_FILE(ot->ot_rec_file, &ot->ot_thread->st_statistics.st_rec, ot->ot_thread) || + !XT_FLUSH_RR_FILE(ot->ot_row_file, &ot->ot_thread->st_statistics.st_rec, ot->ot_thread)) { + tab->tab_flush_pending = TRUE; + goto failed; + } + } + + /* The header includes the operation number which + * must be written AFTER all other data, + * because operations will not be applied again. + */ + if (!xt_tab_write_header(ot, &rec_head, ot->ot_thread)) { + tab->tab_flush_pending = TRUE; + goto failed; + } + } + + /* Flush the auto-increment: */ + if (xt_db_auto_increment_mode == 1) { + if (tab->tab_auto_inc != tab->tab_dic.dic_min_auto_inc) { + tab->tab_dic.dic_min_auto_inc = tab->tab_auto_inc; + if (!xt_tab_write_min_auto_inc(ot)) + goto failed; + } + } + + /* Mark this table as record/row flushed: */ + cp = &tab->tab_db->db_cp_state; + xt_lock_mutex_ns(&cp->cp_state_lock); + if (cp->cp_running) { + cp_tab = (XTCheckPointTablePtr) xt_sl_find(NULL, cp->cp_table_ids, &tab->tab_id); + if (cp_tab && (cp_tab->cpt_flushed & XT_CPT_ALL_FLUSHED) != XT_CPT_ALL_FLUSHED) { + cp_tab->cpt_flushed |= XT_CPT_REC_ROW_FLUSHED; + if ((cp_tab->cpt_flushed & XT_CPT_ALL_FLUSHED) == XT_CPT_ALL_FLUSHED) { + ASSERT_NS(cp->cp_flush_count < xt_sl_get_size(cp->cp_table_ids)); + cp->cp_flush_count++; + } + } + } + xt_unlock_mutex_ns(&cp->cp_state_lock); + +#ifdef TRACE_FLUSH + printf("FLUSH --end-- %s\n", tab->tab_name->ps_path); + fflush(stdout); +#endif + xt_unlock_mutex_ns(&tab->tab_rec_flush_lock); + + if (!xt_end_checkpoint(tab->tab_db, ot->ot_thread, NULL)) + return FAILED; + return OK; + + failed: + xt_unlock_mutex_ns(&tab->tab_rec_flush_lock); + return FAILED; +} + +xtPublic void xt_flush_table(XTThreadPtr self, XTOpenTablePtr ot) +{ + /* GOTCHA [*10*]: This bug was difficult to find. + * It occured on Windows in the multi_update + * test, sometimes. + * + * What happens is the checkpointer starts to + * flush the table, and gets to the + * XT_FLUSH_RR_FILE part. + * + * Then a rename occurs, and the user thread + * flushes the table, and goes through and + * writes the table header, with the most + * recent table operation (the last operation + * that occurred). + * + * The checkpointer the completes and + * also writes the header, but with old + * values (as read in xt_tab_store_header()). + * + * The then user thread continues, and + * reopens the table after rename. + * On reopen, it reads the old value from the header, + * and sets the current operation number. + * + * Now there is a problem in the able cache, + * because some cache pages have operation numbers + * that are greater than current operation + * number! + * + * This later lead to the free-er hanging while + * it waited for an operation to be + * written to the disk that never would be. + * This is because a page can only be freed when + * the head operation number has passed the + * page operation number. + * + * Which indicates that the page has been written + * to disk. + */ + + if (!xt_flush_record_row(ot, NULL, FALSE)) + xt_throw(self); + + /* This was before the table data flush, + * (after xt_tab_store_header() above, + * but I don't think it makes any difference. + * Because in the checkpointer it was at this + * position. + */ + if (!xt_flush_indices(ot, NULL, FALSE)) + xt_throw(self); + +} + +xtPublic XTOpenTablePtr tab_open_table(XTTableHPtr tab) +{ + volatile XTOpenTablePtr ot; + XTThreadPtr self; + + if (!(ot = (XTOpenTablePtr) xt_malloc_ns(sizeof(XTOpenTableRec)))) + return NULL; + memset(ot, 0, offsetof(XTOpenTableRec, ot_ind_wbuf)); + + self = xt_get_self(); + try_(a) { + xt_heap_reference(self, tab); + ot->ot_table = tab; +#ifdef XT_USE_ROW_REC_MMAP_FILES + ot->ot_row_file = xt_open_fmap(self, ot->ot_table->tab_row_file->fil_path, xt_db_row_file_grow_size); + ot->ot_rec_file = xt_open_fmap(self, ot->ot_table->tab_rec_file->fil_path, xt_db_data_file_grow_size); +#else + ot->ot_row_file = xt_open_file(self, ot->ot_table->tab_row_file->fil_path, XT_FS_DEFAULT); + ot->ot_rec_file = xt_open_file(self, ot->ot_table->tab_rec_file->fil_path, XT_FS_DEFAULT); +#endif +#ifdef XT_USE_DIRECT_IO_ON_INDEX + ot->ot_ind_file = xt_open_file(self, ot->ot_table->tab_ind_file->fil_path, XT_FS_MISSING_OK | XT_FS_DIRECT_IO); +#else + ot->ot_ind_file = xt_open_file(self, ot->ot_table->tab_ind_file->fil_path, XT_FS_MISSING_OK); +#endif + } + catch_(a) { + ; + } + cont_(a); + + if (!ot->ot_table || !ot->ot_row_file || !ot->ot_rec_file) + goto failed; + + if (!(ot->ot_row_rbuffer = (xtWord1 *) xt_malloc_ns(ot->ot_table->tab_dic.dic_rec_size))) + goto failed; + ot->ot_row_rbuf_size = ot->ot_table->tab_dic.dic_rec_size; + if (!(ot->ot_row_wbuffer = (xtWord1 *) xt_malloc_ns(ot->ot_table->tab_dic.dic_rec_size))) + goto failed; + ot->ot_row_wbuf_size = ot->ot_table->tab_dic.dic_rec_size; + + /* Cache this stuff to speed access a bit: */ + ot->ot_rec_fixed = ot->ot_table->tab_dic.dic_rec_fixed; + ot->ot_rec_size = ot->ot_table->tab_dic.dic_rec_size; + + return ot; + + failed: + tab_close_table(ot); + return NULL; +} + +xtPublic XTOpenTablePtr xt_open_table(XTTableHPtr tab) +{ + return tab_open_table(tab); +} + +xtPublic void xt_close_table(XTOpenTablePtr ot, xtBool flush, xtBool have_table_lock) +{ + if (flush) { + if (!xt_flush_record_row(ot, NULL, have_table_lock)) + xt_log_and_clear_exception_ns(); + + if (!xt_flush_indices(ot, NULL, have_table_lock)) + xt_log_and_clear_exception_ns(); + } + tab_close_table(ot); +} + +xtPublic int xt_use_table_by_id(XTThreadPtr self, XTTableHPtr *r_tab, XTDatabaseHPtr db, xtTableID tab_id) +{ + XTTableEntryPtr te_ptr; + XTTableHPtr tab = NULL; + int r = XT_TAB_OK; + char path[PATH_MAX]; + + if (!db) + xt_throw_xterr(XT_CONTEXT, XT_ERR_NO_DATABASE_IN_USE); + xt_ht_lock(self, db->db_tables); + pushr_(xt_ht_unlock, db->db_tables); + + te_ptr = (XTTableEntryPtr) xt_sl_find(self, db->db_table_by_id, &tab_id); + if (te_ptr) { + if (!(tab = te_ptr->te_table)) { + /* Open the table: */ + xt_strcpy(PATH_MAX, path, te_ptr->te_tab_path->tp_path); + xt_add_dir_char(PATH_MAX, path); + xt_strcat(PATH_MAX, path, te_ptr->te_tab_name); + r = tab_new_handle(self, &tab, db, tab_id, (XTPathStrPtr) path, TRUE, NULL); + } + } + else + r = XT_TAB_NOT_FOUND; + + if (tab) + xt_heap_reference(self, tab); + *r_tab = tab; + + freer_(); // xt_ht_unlock(db->db_tables) + return r; +} + +/* The fixed part of the record is already in the row buffer. + * This function loads the extended part, expanding the row + * buffer if necessary. + */ +xtPublic xtBool xt_tab_load_ext_data(XTOpenTablePtr ot, xtRecordID load_rec_id, xtWord1 *buffer, u_int cols_req) +{ + size_t log_size; + xtLogID log_id; + xtLogOffset log_offset; + xtWord1 save_buffer[offsetof(XTactExtRecEntryDRec, er_data)]; + xtBool retried = FALSE; + XTactExtRecEntryDPtr ext_data_ptr; + size_t log_size2; + xtTableID curr_tab_id; + xtRecordID curr_rec_id; + + log_size = XT_GET_DISK_4(((XTTabRecExtDPtr) ot->ot_row_rbuffer)->re_log_dat_siz_4); + XT_GET_LOG_REF(log_id, log_offset, (XTTabRecExtDPtr) ot->ot_row_rbuffer); + + if (ot->ot_rec_size + log_size > ot->ot_row_rbuf_size) { + if (!xt_realloc_ns((void **) &ot->ot_row_rbuffer, ot->ot_rec_size + log_size)) + return FAILED; + ot->ot_row_rbuf_size = ot->ot_rec_size + log_size; + } + + /* Read the extended part first: */ + ext_data_ptr = (XTactExtRecEntryDPtr) (ot->ot_row_rbuffer + ot->ot_rec_size - offsetof(XTactExtRecEntryDRec, er_data)); + + /* Save the data which the header will overwrite: */ + memcpy(save_buffer, ext_data_ptr, offsetof(XTactExtRecEntryDRec, er_data)); + + reread: + if (!ot->ot_thread->st_dlog_buf.dlb_read_log(log_id, log_offset, offsetof(XTactExtRecEntryDRec, er_data) + log_size, (xtWord1 *) ext_data_ptr, ot->ot_thread)) + goto retry_read; + + log_size2 = XT_GET_DISK_4(ext_data_ptr->er_data_size_4); + curr_tab_id = XT_GET_DISK_4(ext_data_ptr->er_tab_id_4); + curr_rec_id = XT_GET_DISK_4(ext_data_ptr->er_rec_id_4); + + if (log_size2 != log_size || curr_tab_id != ot->ot_table->tab_id || curr_rec_id != load_rec_id) { + /* [(3)] This can happen in the following circumstances: + * - A new record is created, but the data log is not + * flushed. + * - The server quits. + * - On restart the transaction is rolled back, but the data record + * was not written, so later a new record could be written at this + * location. + * - Later the sweeper tries to cleanup this record, and finds + * that a different record has been written at this position. + * + * NOTE: Index entries can only be written to disk for records + * that have been committed to the disk, because uncommitted + * records may not exist in order to remove the index entry + * on cleanup. + */ + xt_register_xterr(XT_REG_CONTEXT, XT_ERR_BAD_EXT_RECORD); + goto retry_read; + } + + /* Restore the saved area: */ + memcpy(ext_data_ptr, save_buffer, offsetof(XTactExtRecEntryDRec, er_data)); + + if (retried) + xt_unlock_mutex_ns(&ot->ot_table->tab_db->db_co_ext_lock); + return myxt_load_row(ot, ot->ot_row_rbuffer + XT_REC_EXT_HEADER_SIZE, buffer, cols_req); + + retry_read: + if (!retried) { + /* (1) It may be that reading the log fails because the garbage collector + * has moved the record since we determined the location. + * We handle this here, by re-reading the data the garbage collector + * would have updated. + * + * (2) It may also happen that a new record is just being updated or + * inserted. It is possible that the handle part of the record + * has been written, but not yet the overflow. + * This means that repeating the read attempt could work. + * + * (3) The extended data has been written by another handler and not yet + * flushed. This should not happen because on committed extended + * records are read, and all data should be flushed before + * commit! + * + * NOTE: (2) above is not a problem when versioning is working + * correctly. In this case, we should never try to read the extended + * part of an uncommitted record (belonging to some other thread/ + * transaction). + */ + XTTabRecExtDRec rec_buf; + + xt_lock_mutex_ns(&ot->ot_table->tab_db->db_co_ext_lock); + retried = TRUE; + + if (!xt_tab_get_rec_data(ot, load_rec_id, XT_REC_EXT_HEADER_SIZE, (xtWord1 *) &rec_buf)) + goto failed; + + XT_GET_LOG_REF(log_id, log_offset, &rec_buf); + goto reread; + } + + failed: + if (retried) + xt_unlock_mutex_ns(&ot->ot_table->tab_db->db_co_ext_lock); + return FAILED; +} + +xtPublic xtBool xt_tab_put_rec_data(XTOpenTablePtr ot, xtRecordID rec_id, size_t size, xtWord1 *buffer, xtOpSeqNo *op_seq) +{ + register XTTableHPtr tab = ot->ot_table; + + ASSERT_NS(rec_id); + + return tab->tab_recs.xt_tc_write(ot->ot_rec_file, rec_id, 0, size, buffer, op_seq, TRUE, ot->ot_thread); +} + +xtPublic xtBool xt_tab_put_log_op_rec_data(XTOpenTablePtr ot, u_int status, xtRecordID free_rec_id, xtRecordID rec_id, size_t size, xtWord1 *buffer) +{ + register XTTableHPtr tab = ot->ot_table; + xtOpSeqNo op_seq; + + ASSERT_NS(rec_id); + + if (status == XT_LOG_ENT_REC_MOVED) { + if (!tab->tab_recs.xt_tc_write(ot->ot_rec_file, rec_id, offsetof(XTTabRecExtDRec, re_log_id_2), size, buffer, &op_seq, TRUE, ot->ot_thread)) + return FAILED; + } +#ifdef DEBUG + else if (status == XT_LOG_ENT_REC_CLEANED_1) { + ASSERT_NS(0); // shouldn't be used anymore + } +#endif + else { + if (!tab->tab_recs.xt_tc_write(ot->ot_rec_file, rec_id, 0, size, buffer, &op_seq, TRUE, ot->ot_thread)) + return FAILED; + } + + return xt_xlog_modify_table(ot, status, op_seq, free_rec_id, rec_id, size, buffer); +} + +xtPublic xtBool xt_tab_put_log_rec_data(XTOpenTablePtr ot, u_int status, xtRecordID free_rec_id, xtRecordID rec_id, size_t size, xtWord1 *buffer, xtOpSeqNo *op_seq) +{ + register XTTableHPtr tab = ot->ot_table; + + ASSERT_NS(rec_id); + + if (status == XT_LOG_ENT_REC_MOVED) { + if (!tab->tab_recs.xt_tc_write(ot->ot_rec_file, rec_id, offsetof(XTTabRecExtDRec, re_log_id_2), size, buffer, op_seq, TRUE, ot->ot_thread)) + return FAILED; + } + else { + if (!tab->tab_recs.xt_tc_write(ot->ot_rec_file, rec_id, 0, size, buffer, op_seq, TRUE, ot->ot_thread)) + return FAILED; + } + + return xt_xlog_modify_table(ot, status, *op_seq, free_rec_id, rec_id, size, buffer); +} + +xtPublic xtBool xt_tab_get_rec_data(XTOpenTablePtr ot, xtRecordID rec_id, size_t size, xtWord1 *buffer) +{ + register XTTableHPtr tab = ot->ot_table; + + ASSERT_NS(rec_id); + + return tab->tab_recs.xt_tc_read(ot->ot_rec_file, rec_id, (size_t) size, buffer, ot->ot_thread); +} + +/* + * Note: this function grants locks even to transactions that + * are not specifically waiting for this transaction. + * This is required, because all threads waiting for + * a lock should be considered "equal". In other words, + * they should not have to wait for the "right" transaction + * before they get the lock, or it will turn into a + * race to wait for the correct transaction. + * + * A transaction T1 can end up waiting for the wrong transaction + * T2, because T2 has released the lock, and given it to T3. + * Of course, T1 will wake up soon and realize this, but + * it is a matter of timing. + * + * The main point is that T2 has release the lock because + * it has ended (see {RELEASING-LOCKS} for more details) + * and therefore, there is no danger of it claiming the + * lock again, which can lead to a deadlock if T1 is + * given the lock instead of T3 in the example above. + * Then, if T2 tries to regain the lock before T1 + * realizes that it has the lock. + */ +//static xtBool tab_get_lock_after_wait(XTThreadPtr thread, XTLockWaitPtr lw) +//{ +// register XTTableHPtr tab = lw->lw_ot->ot_table; + + /* {ROW-LIST-LOCK} + * I don't believe this lock is required. If it is, please explain why!! + * XT_TAB_ROW_READ_LOCK(&tab->tab_row_rwlock[gl->lw_row_id % XT_ROW_RWLOCKS], thread); + * + * With the old row lock implementation a XT_TAB_ROW_WRITE_LOCK was required because + * the row locking did not have its own locks. + * The new list locking has its own locks. I was using XT_TAB_ROW_READ_LOCK, + * but i don't think this is required. + */ +// return tab->tab_locks.xt_set_temp_lock(lw->lw_ot, lw, &lw->lw_thread->st_lock_list); +//} + +/* + * NOTE: Previously this function did not gain the row lock. + * If this change is a problem, please document why! + * The previously implementation did wait until no lock was on the + * row. + * + * I am thinking that it is simply a good idea to grab the lock, + * instead of waiting for no lock, before the retry. But it could + * result in locking more than required! + */ +static xtBool tab_wait_for_update(register XTOpenTablePtr ot, xtRowID row_id, xtXactID xn_id, XTThreadPtr thread) +{ + XTLockWaitRec lw; + XTXactWaitRec xw; + xtBool ok; + + xw.xw_xn_id = xn_id; + + lw.lw_thread = thread; + lw.lw_ot = ot; + lw.lw_row_id = row_id; + lw.lw_row_updated = FALSE; + + /* First try to get the lock: */ + if (!ot->ot_table->tab_locks.xt_set_temp_lock(ot, &lw, &thread->st_lock_list)) + return FAILED; + if (lw.lw_curr_lock != XT_NO_LOCK) + /* Wait for the lock, then the transaction: */ + ok = xt_xn_wait_for_xact(thread, &xw, &lw); + else + /* Just wait for the transaction: */ + ok = xt_xn_wait_for_xact(thread, &xw, NULL); + +#ifdef DEBUG_LOCK_QUEUE + ot->ot_table->tab_locks.rl_check(&lw); +#endif + return ok; +} + +/* {WAIT-FOR} + * XT_OLD - The record is old. No longer visible because there is + * newer committed record before it in the record list. + * This is a special case of FALSE (the record is not visible). + * (see {WAIT-FOR} for details). + * It is significant because if we find too many of these when + * searching for records, then we have reason to believe the + * sweeper is far behind. This can happen in a test like this: + * runTest(INCREMENT_TEST, 2, INCREMENT_TEST_UPDATE_COUNT); + * What happens is T1 detects an updated row by T2, + * but T2 has not committed yet. + * It waits for T2. T2 commits and updates again before T1 + * can update. + * + * Of course if we got a lock on the row when T2 quits, then + * this would not happen! + */ + +/* + * Is a record visible? + * Returns TRUE, FALSE, XT_ERR. + * + * TRUE - The record is visible. + * FALSE - The record is not visible. + * XT_ERR - An exception (error) occurred. + * XT_NEW - The most recent variation of this row has been returned + * and is to be used instead of the input! + * XT_REREAD - Re-read the record, and try again. + * + * Basically, a record is visible if it was committed on or before + * the transactions "visible time" (st_visible_time), and there + * are no other visible records before this record in the + * variation chain for the record. + * + * This holds in general, but you don't always get to see the + * visible record (as defined in this sence). + * + * On any kind of update (SELECT FOR UPDATE, UPDATE or DELETE), you + * get to see the most recent variation of the row! + * + * So on update, this function will wait if necessary for a recent + * update to be committed. + * + * So an update is a kind of "committed read" with a wait for + * uncommitted records. + * + * The result: + * - INSERTS may not seen by the update read, depending on when + * they occur. + * - Records may be returned in non-index order. + * - New records returned must be checked again by an index scan + * to make sure they conform to the condition! + * + * CREATE TABLE test_tab (ID int primary key, Value int, Name varchar(20), + * index(Value, Name)) ENGINE=pbxt; + * INSERT test_tab values(4, 2, 'D'); + * INSERT test_tab values(5, 2, 'E'); + * INSERT test_tab values(6, 2, 'F'); + * INSERT test_tab values(7, 2, 'G'); + * + * -- C1 + * begin; + * select * from test_tab where id = 6 for update; + * -- C2 + * begin; + * select * from test_tab where value = 2 order by value, name for update; + * -- C1 + * update test_tab set Name = 'A' where id = 7; + * commit; + * -- C2 + * Result order D, E, F, A. + * + * But Jim does it like this, so it should be OK. + */ +static int tab_visible(register XTOpenTablePtr ot, XTTabRecHeadDPtr rec_head, xtRecordID *new_rec_id) +{ + XTThreadPtr thread = ot->ot_thread; + xtXactID xn_id; + XTTabRecHeadDRec var_head; + xtRowID row_id; + xtRecordID var_rec_id; + register XTTableHPtr tab; + xtBool wait = FALSE; + xtXactID wait_xn_id = 0; +#ifdef TRACE_VARIATIONS + char t_buf[500]; + int len; +#endif + int result = TRUE; + xtBool rec_clean; + xtRecordID invalid_rec; + + retry: + /* It can be that between the time that I read the index, + * and the time that I try to access the + * record, that the record is removed by + * the sweeper! + */ + if (XT_REC_NOT_VALID(rec_head->tr_rec_type_1)) + return FALSE; + + row_id = XT_GET_DISK_4(rec_head->tr_row_id_4); + + /* This can happen if the row has been removed, and + * reused: + */ + if (ot->ot_curr_row_id && row_id != ot->ot_curr_row_id) + return FALSE; + +#ifdef TRACE_VARIATIONS + len = sprintf(t_buf, "row=%d rec=%d ", (int) row_id, (int) ot->ot_curr_rec_id); +#endif + if (!(rec_clean = XT_REC_IS_CLEAN(rec_head->tr_rec_type_1))) { + /* The record is not clean, which means it has not been swept. + * So we have to check if it is visible. + */ + xn_id = XT_GET_DISK_4(rec_head->tr_xact_id_4); + switch (xt_xn_status(ot, xn_id, ot->ot_curr_rec_id)) { + case XT_XN_VISIBLE: + break; + case XT_XN_NOT_VISIBLE: + if (ot->ot_for_update) { + /* It is visible, only if it is an insert, + * which means if has no previous variation. + * Note, if an insert is updated, the record + * should be overwritten (TODO - check this). + */ + var_rec_id = XT_GET_DISK_4(rec_head->tr_prev_rec_id_4); + if (!var_rec_id) + break; +#ifdef TRACE_VARIATIONS + if (len <= 450) + len += sprintf(t_buf+len, "OTHER COMMIT (OVERWRITTEN) T%d\n", (int) xn_id); + xt_ttracef(thread, "%s", t_buf); +#endif + } +#ifdef TRACE_VARIATIONS + else { + if (len <= 450) + len += sprintf(t_buf+len, "OTHER COMMIT T%d\n", (int) xn_id); + xt_ttracef(thread, "%s", t_buf); + } +#endif + /* {WAKE-SW} + * The record is not visible, although it has been committed. + * Clean the transaction ASAP. + */ + ot->ot_table->tab_db->db_sw_faster |= XT_SW_DIRTY_RECORD_FOUND; + return FALSE; + case XT_XN_ABORTED: + /* {WAKE-SW} + * Reading an aborted record, this transaction + * must be cleaned up ASAP! + */ + ot->ot_table->tab_db->db_sw_faster |= XT_SW_DIRTY_RECORD_FOUND; +#ifdef TRACE_VARIATIONS + if (len <= 450) + len += sprintf(t_buf+len, "ABORTED T%d\n", (int) xn_id); + xt_ttracef(thread, "%s", t_buf); +#endif + return FALSE; + case XT_XN_MY_UPDATE: + /* This is a record written by this transaction. */ + if (thread->st_is_update) { + /* Check that it was not written by the current update statement: */ + if (XT_STAT_ID_MASK(thread->st_update_id) == rec_head->tr_stat_id_1) { +#ifdef TRACE_VARIATIONS + if (len <= 450) + len += sprintf(t_buf+len, "MY UPDATE IN THIS STATEMENT T%d\n", (int) xn_id); + xt_ttracef(thread, "%s", t_buf); +#endif + return FALSE; + } + } + ot->ot_curr_row_id = row_id; + ot->ot_curr_updated = TRUE; + if (!(xt_tab_get_row(ot, row_id, &var_rec_id))) + return XT_ERR; + /* It is visible if it is at the front of the list. + * An update can end up not being at the front of the list + * if it is deleted afterwards! + */ +#ifdef TRACE_VARIATIONS + if (len <= 450) { + if (var_rec_id == ot->ot_curr_rec_id) + len += sprintf(t_buf+len, "MY UPDATE T%d\n", (int) xn_id); + else + len += sprintf(t_buf+len, "MY UPDATE (OVERWRITTEN) T%d\n", (int) xn_id); + } + xt_ttracef(thread, "%s", t_buf); +#endif + return var_rec_id == ot->ot_curr_rec_id; + case XT_XN_OTHER_UPDATE: + if (ot->ot_for_update) { + /* If this is an insert, we are interested! + * Updated values are handled below. This is because + * the changed (new) records returned below are always + * followed (in the version chain) by the record + * we would have returned (if nothing had changed). + * + * As a result, we only return records here which have + * no "history". + */ + var_rec_id = XT_GET_DISK_4(rec_head->tr_prev_rec_id_4); + if (!var_rec_id) { +#ifdef TRACE_VARIATIONS + if (len <= 450) + len += sprintf(t_buf+len, "OTHER INSERT (WAIT FOR) T%d\n", (int) xn_id); + xt_ttracef(thread, "%s", t_buf); +#endif + if (!tab_wait_for_update(ot, row_id, xn_id, thread)) + return XT_ERR; + if (!xt_tab_get_rec_data(ot, ot->ot_curr_rec_id, sizeof(XTTabRecHeadDRec), (xtWord1 *) &var_head)) + return XT_ERR; + rec_head = &var_head; + goto retry; + } + } +#ifdef TRACE_VARIATIONS + if (len <= 450) + len += sprintf(t_buf+len, "OTHER UPDATE T%d\n", (int) xn_id); + xt_ttracef(thread, "%s", t_buf); +#endif + return FALSE; + case XT_XN_REREAD: +#ifdef TRACE_VARIATIONS + if (len <= 450) + len += sprintf(t_buf+len, "REREAD?! T%d\n", (int) xn_id); + xt_ttracef(thread, "%s", t_buf); +#endif + return XT_REREAD; + } + } + + /* Follow the variation chain until we come to this record. + * If it is not the first visible variation then + * it is not visible at all. If it in not found on the + * variation chain, it is also not visible. + */ + tab = ot->ot_table; + + retry_2: + +#ifdef XT_USE_LIST_BASED_ROW_LOCKS + /* The list based row locks used there own locks, so + * it is not necessary to get a write lock here. + */ + XT_TAB_ROW_READ_LOCK(&tab->tab_row_rwlock[row_id % XT_ROW_RWLOCKS], thread); +#else + if (ot->ot_for_update) + XT_TAB_ROW_WRITE_LOCK(&tab->tab_row_rwlock[row_id % XT_ROW_RWLOCKS], thread); + else + XT_TAB_ROW_READ_LOCK(&tab->tab_row_rwlock[row_id % XT_ROW_RWLOCKS], thread); +#endif + + invalid_rec = 0; + retry_3: + if (!(xt_tab_get_row(ot, row_id, &var_rec_id))) + goto failed; +#ifdef TRACE_VARIATIONS + len += sprintf(t_buf+len, "ROW=%d", (int) row_id); +#endif + while (var_rec_id != ot->ot_curr_rec_id) { + if (!var_rec_id) { +#ifdef TRACE_VARIATIONS + xt_ttracef(thread, "row=%d rec=%d NOT VISI not found in list\n", (int) row_id, (int) ot->ot_curr_rec_id); +#endif + goto not_found; + } + if (!xt_tab_get_rec_data(ot, var_rec_id, sizeof(XTTabRecHeadDRec), (xtWord1 *) &var_head)) + goto failed; +#ifdef TRACE_VARIATIONS + if (len <= 450) + len += sprintf(t_buf+len, " -> %d(%d)", (int) var_rec_id, (int) var_head.tr_rec_type_1); +#endif + /* All clean records are visible, by all transactions: */ + if (XT_REC_IS_CLEAN(var_head.tr_rec_type_1)) { +#ifdef TRACE_VARIATIONS + xt_ttracef(thread, "row=%d rec=%d NOT VISI clean rec found\n", (int) row_id, (int) ot->ot_curr_rec_id); +#endif + goto not_found; + } + if (XT_REC_IS_FREE(var_head.tr_rec_type_1)) { +#ifdef TRACE_VARIATIONS + xt_ttracef(thread, "row=%d rec=%d NOT VISI free rec found?!\n", (int) row_id, (int) ot->ot_curr_rec_id); +#endif + /* + * After an analysis we came to conclusion that this situation is + * possible and valid. It can happen if index scan and row deletion + * go in parallel: + * + * Client Thread Sweeper + * ------------- ------- + * 1. start index scan, lock the index file. + * 2. start row deletion, wait for index lock + * 3. unlock the index file, start search for + * the valid version of the record + * 4. delete the row, mark record as freed, + * but not yet cleaned by sweeper + * 5. observe the record being freed + * + * after these steps we can get here, if the record was marked as free after + * the tab_visible was entered by the scanning thread. + * + */ + if (invalid_rec != var_rec_id) { + /* This was "var_rec_id = invalid_rec", caused an infinite loop (bug #310184!) */ + invalid_rec = var_rec_id; + goto retry_3; + } + /* Assume end of list. */ + goto not_found; + } + + /* This can happen if the row has been removed, and + * reused: + */ + if (row_id != XT_GET_DISK_4(var_head.tr_row_id_4)) + goto not_found; + + xn_id = XT_GET_DISK_4(var_head.tr_xact_id_4); + /* This variation is visibleif committed before this + * transaction started, or updated by this transaction. + * + * We now know that this is the valid variation for + * this record (for this table) for this transaction! + * This will not change, unless the transaction + * updates the record (again). + * + * So we can store this information as a hint, if + * we see other variations belonging to this record, + * then we can ignore them immediately! + */ + switch (xt_xn_status(ot, xn_id, var_rec_id)) { + case XT_XN_VISIBLE: + /* {WAKE-SW} + * We have encountered a record that has been overwritten, if the + * record has not been cleaned, then the sweeper is too far + * behind! + */ + if (!rec_clean) + ot->ot_table->tab_db->db_sw_faster |= XT_SW_DIRTY_RECORD_FOUND; +#ifdef TRACE_VARIATIONS + xt_ttracef(thread, "row=%d rec=%d NOT VISI committed rec found\n", (int) row_id, (int) ot->ot_curr_rec_id); +#endif + goto not_found; + case XT_XN_NOT_VISIBLE: + if (ot->ot_for_update) { + /* Substitute this record for the one we + * are reading!! + */ + if (result == TRUE) { + if (XT_REC_IS_DELETE(var_head.tr_rec_type_1)) + result = FALSE; + else { + *new_rec_id = var_rec_id; + result = XT_NEW; + } + } + } + break; + case XT_XN_ABORTED: + /* Ignore the record, it will be removed. */ + break; + case XT_XN_MY_UPDATE: +#ifdef TRACE_VARIATIONS + xt_ttracef(thread, "row=%d rec=%d NOT VISI my update found\n", (int) row_id, (int) ot->ot_curr_rec_id); +#endif + goto not_found; + case XT_XN_OTHER_UPDATE: + /* Wait for this update to commit or abort: */ + if (!wait) { + wait = TRUE; + wait_xn_id = xn_id; + } +#ifdef TRACE_VARIATIONS + if (len <= 450) + len += sprintf(t_buf+len, "-T%d", (int) wait_xn_id); +#endif + break; + case XT_XN_REREAD: + if (invalid_rec != var_rec_id) { + invalid_rec = var_rec_id; + goto retry_3; + } + /* Assume end of list. */ +#ifdef XT_CRASH_DEBUG + /* Should not happen! */ + xt_crash_me(); +#endif + goto not_found; + } + var_rec_id = XT_GET_DISK_4(var_head.tr_prev_rec_id_4); + } +#ifdef TRACE_VARIATIONS + if (len <= 450) + sprintf(t_buf+len, " -> %d(%d)\n", (int) var_rec_id, (int) rec_head->tr_rec_type_1); + else + sprintf(t_buf+len, " ...\n"); + //xt_ttracef(thread, "%s", t_buf); +#endif + + if (ot->ot_for_update) { + xtBool ok; + XTLockWaitRec lw; + + if (wait) { + XT_TAB_ROW_UNLOCK(&tab->tab_row_rwlock[row_id % XT_ROW_RWLOCKS], thread); +#ifdef TRACE_VARIATIONS + xt_ttracef(thread, "T%d WAIT FOR T%d (will retry)\n", (int) thread->st_xact_data->xd_start_xn_id, (int) wait_xn_id); +#endif + if (!tab_wait_for_update(ot, row_id, wait_xn_id, thread)) + return XT_ERR; + wait = FALSE; + wait_xn_id = 0; + /* + * Retry in order to try to avoid missing + * any records that we should see in FOR UPDATE + * mode. + * + * We also want to take another look at the record + * we just tried to read. + * + * If it has been updated, then a new record has + * been created. This will be detected when we + * try to read it again, and XT_NEW will be returned. + */ + thread->st_statistics.st_retry_index_scan++; + return XT_RETRY; + } + + /* {ROW-LIST-LOCK} */ + lw.lw_thread = thread; + lw.lw_ot = ot; + lw.lw_row_id = row_id; + lw.lw_row_updated = FALSE; + ok = tab->tab_locks.xt_set_temp_lock(ot, &lw, &thread->st_lock_list); + XT_TAB_ROW_UNLOCK(&tab->tab_row_rwlock[row_id % XT_ROW_RWLOCKS], thread); + if (!ok) { +#ifdef DEBUG_LOCK_QUEUE + ot->ot_table->tab_locks.rl_check(&lw); +#endif + return XT_ERR; + } + if (lw.lw_curr_lock != XT_NO_LOCK) { +#ifdef TRACE_VARIATIONS + xt_ttracef(thread, "T%d WAIT FOR LOCK(%D) T%d\n", (int) thread->st_xact_data->xd_start_xn_id, (int) lock_type, (int) xn_id); +#endif + if (!xt_xn_wait_for_xact(thread, NULL, &lw)) { +#ifdef DEBUG_LOCK_QUEUE + ot->ot_table->tab_locks.rl_check(&lw); +#endif + return XT_ERR; + } +#ifdef DEBUG_LOCK_QUEUE + ot->ot_table->tab_locks.rl_check(&lw); +#endif +#ifdef TRACE_VARIATIONS + len = sprintf(t_buf, "(retry): row=%d rec=%d ", (int) row_id, (int) ot->ot_curr_rec_id); +#endif + /* GOTCHA! + * Reset the result before we go down the list again, to make sure we + * get the latest record!! + */ + result = TRUE; + thread->st_statistics.st_reread_record_list++; + goto retry_2; + } +#ifdef DEBUG_LOCK_QUEUE + ot->ot_table->tab_locks.rl_check(&lw); +#endif + } + else { + XT_TAB_ROW_UNLOCK(&tab->tab_row_rwlock[row_id % XT_ROW_RWLOCKS], thread); + } + +#ifdef TRACE_VARIATIONS + if (result == XT_NEW) + xt_ttracef(thread, "row=%d rec=%d RETURN NEW %d\n", (int) row_id, (int) ot->ot_curr_rec_id, (int) *new_rec_id); + else if (result) + xt_ttracef(thread, "row=%d rec=%d VISIBLE\n", (int) row_id, (int) ot->ot_curr_rec_id); + else + xt_ttracef(thread, "row=%d rec=%d RETURN NOT VISIBLE (NEW)\n", (int) row_id, (int) ot->ot_curr_rec_id); +#endif + + ot->ot_curr_row_id = row_id; + ot->ot_curr_updated = FALSE; + return result; + + not_found: + XT_TAB_ROW_UNLOCK(&tab->tab_row_rwlock[row_id % XT_ROW_RWLOCKS], thread); + return FALSE; + + failed: + XT_TAB_ROW_UNLOCK(&tab->tab_row_rwlock[row_id % XT_ROW_RWLOCKS], thread); + return XT_ERR; +} + +/* + * Return TRUE if the record has been read, and is visible. + * Return FALSE if the record is not visible. + * Return XT_ERR if an error occurs. + */ +xtPublic int xt_tab_visible(XTOpenTablePtr ot) +{ + xtRowID row_id; + XTTabRecHeadDRec rec_head; + xtRecordID new_rec_id; + xtBool read_again = FALSE; + int r; + + if ((row_id = ot->ot_curr_row_id)) { + /* Fast track, do a quick check. + * Row ID is only set if this record has been committed, + * (and swept). + * Check if it is the first on the list! + */ + xtRecordID var_rec_id; + + retry: + if (!(xt_tab_get_row(ot, row_id, &var_rec_id))) + return XT_ERR; + if (ot->ot_curr_rec_id == var_rec_id) { + /* Looks good.. */ + if (ot->ot_for_update) { + XTThreadPtr thread = ot->ot_thread; + XTTableHPtr tab = ot->ot_table; + XTLockWaitRec lw; + + /* {ROW-LIST-LOCK} */ + lw.lw_thread = thread; + lw.lw_ot = ot; + lw.lw_row_id = row_id; + lw.lw_row_updated = FALSE; + if (!tab->tab_locks.xt_set_temp_lock(ot, &lw, &thread->st_lock_list)) { +#ifdef DEBUG_LOCK_QUEUE + ot->ot_table->tab_locks.rl_check(&lw); +#endif + return XT_ERR; + } + if (lw.lw_curr_lock != XT_NO_LOCK) { + if (!xt_xn_wait_for_xact(thread, NULL, &lw)) { +#ifdef DEBUG_LOCK_QUEUE + ot->ot_table->tab_locks.rl_check(&lw); +#endif + return XT_ERR; + } +#ifdef DEBUG_LOCK_QUEUE + ot->ot_table->tab_locks.rl_check(&lw); +#endif + goto retry; + } +#ifdef DEBUG_LOCK_QUEUE + ot->ot_table->tab_locks.rl_check(&lw); +#endif + } + return TRUE; + } + } + + reread: + if (!xt_tab_get_rec_data(ot, ot->ot_curr_rec_id, sizeof(XTTabRecHeadDRec), (xtWord1 *) &rec_head)) + return XT_ERR; + + switch ((r = tab_visible(ot, &rec_head, &new_rec_id))) { + case XT_NEW: + ot->ot_curr_rec_id = new_rec_id; + break; + case XT_REREAD: + /* Avoid infinite loop: */ + if (read_again) { + /* Should not happen! */ +#ifdef XT_CRASH_DEBUG + /* Generate a core dump! */ + xt_crash_me(); +#endif + return FALSE; + } + read_again = TRUE; + goto reread; + default: + break; + } + return r; +} + +/* + * Read a record, and return one of the following: + * TRUE - the record has been read, and is visible. + * FALSE - the record is not visible. + * XT_ERR - an error occurs. + * XT_NEW - Means the expected record has been changed. + * When doing an index scan, the conditions must be checked again! + */ +xtPublic int xt_tab_read_record(register XTOpenTablePtr ot, xtWord1 *buffer) +{ + register XTTableHPtr tab = ot->ot_table; + size_t rec_size = tab->tab_dic.dic_rec_size; + xtRecordID new_rec_id; + int result; + xtBool read_again = FALSE; + + if (!(ot->ot_thread->st_xact_data)) { + xt_register_xterr(XT_REG_CONTEXT, XT_ERR_NO_TRANSACTION); + return XT_ERR; + } + + reread: + if (!xt_tab_get_rec_data(ot, ot->ot_curr_rec_id, rec_size, ot->ot_row_rbuffer)) + return XT_ERR; + + switch (tab_visible(ot, (XTTabRecHeadDPtr) ot->ot_row_rbuffer, &new_rec_id)) { + case FALSE: + return FALSE; + case XT_ERR: + return XT_ERR; + case XT_NEW: + if (!xt_tab_get_rec_data(ot, new_rec_id, rec_size, ot->ot_row_rbuffer)) + return XT_ERR; + ot->ot_curr_rec_id = new_rec_id; + result = XT_NEW; + break; + case XT_RETRY: + return XT_RETRY; + case XT_REREAD: + /* Avoid infinite loop: */ + if (read_again) { + /* Should not happen! */ +#ifdef XT_CRASH_DEBUG + /* Generate a core dump! */ + xt_crash_me(); +#endif + return FALSE; + } + read_again = TRUE; + goto reread; + default: + result = OK; + break; + } + + if (ot->ot_rec_fixed) + memcpy(buffer, ot->ot_row_rbuffer + XT_REC_FIX_HEADER_SIZE, rec_size - XT_REC_FIX_HEADER_SIZE); + else if (ot->ot_row_rbuffer[0] == XT_TAB_STATUS_VARIABLE || ot->ot_row_rbuffer[0] == XT_TAB_STATUS_VAR_CLEAN) { + if (!myxt_load_row(ot, ot->ot_row_rbuffer + XT_REC_FIX_HEADER_SIZE, buffer, ot->ot_cols_req)) + return XT_ERR; + } + else { + u_int cols_req = ot->ot_cols_req; + + ASSERT_NS(cols_req); + if (cols_req && cols_req <= tab->tab_dic.dic_fix_col_count) { + if (!myxt_load_row(ot, ot->ot_row_rbuffer + XT_REC_EXT_HEADER_SIZE, buffer, cols_req)) + return XT_ERR; + } + else { + if (!xt_tab_load_ext_data(ot, ot->ot_curr_rec_id, buffer, cols_req)) + return XT_ERR; + } + } + + return result; +} + +/* + * Returns: + * + * TRUE/OK - record was read. + * FALSE/FAILED - An error occurred. + */ +xtPublic int xt_tab_dirty_read_record(register XTOpenTablePtr ot, xtWord1 *buffer) +{ + register XTTableHPtr tab = ot->ot_table; + size_t rec_size = tab->tab_dic.dic_rec_size; + + if (!xt_tab_get_rec_data(ot, ot->ot_curr_rec_id, rec_size, ot->ot_row_rbuffer)) + return FAILED; + + if (XT_REC_NOT_VALID(ot->ot_row_rbuffer[0])) { + /* Should not happen! */ + xt_register_xterr(XT_REG_CONTEXT, XT_ERR_RECORD_DELETED); + return FAILED; + } + + ot->ot_curr_row_id = XT_GET_DISK_4(((XTTabRecHeadDPtr) ot->ot_row_rbuffer)->tr_row_id_4); + ot->ot_curr_updated = + (XT_GET_DISK_4(((XTTabRecHeadDPtr) ot->ot_row_rbuffer)->tr_xact_id_4) == ot->ot_thread->st_xact_data->xd_start_xn_id); + + if (ot->ot_rec_fixed) + memcpy(buffer, ot->ot_row_rbuffer + XT_REC_FIX_HEADER_SIZE, rec_size - XT_REC_FIX_HEADER_SIZE); + else if (ot->ot_row_rbuffer[0] == XT_TAB_STATUS_VARIABLE || ot->ot_row_rbuffer[0] == XT_TAB_STATUS_VAR_CLEAN) { + if (!myxt_load_row(ot, ot->ot_row_rbuffer + XT_REC_FIX_HEADER_SIZE, buffer, ot->ot_cols_req)) + return FAILED; + } + else { + u_int cols_req = ot->ot_cols_req; + + ASSERT_NS(cols_req); + if (cols_req && cols_req <= tab->tab_dic.dic_fix_col_count) { + if (!myxt_load_row(ot, ot->ot_row_rbuffer + XT_REC_EXT_HEADER_SIZE, buffer, cols_req)) + return FAILED; + } + else { + if (!xt_tab_load_ext_data(ot, ot->ot_curr_rec_id, buffer, cols_req)) + return FAILED; + } + } + + return OK; +} + +/* + * Pull the entire row pointer file into memory. + */ +xtPublic void xt_tab_load_row_pointers(XTThreadPtr self, XTOpenTablePtr ot) +{ + XTTableHPtr tab = ot->ot_table; + xtRecordID eof_rec_id = tab->tab_row_eof_id; + xtInt8 usage; + xtWord1 *buffer = NULL; + + /* Check if there is enough cache: */ + usage = xt_tc_get_usage(); + if (xt_tc_get_high() > usage) + usage = xt_tc_get_high(); + if (usage + ((xtInt8) eof_rec_id * (xtInt8) tab->tab_rows.tci_rec_size) < xt_tc_get_size()) { + xtRecordID rec_id; + size_t poffset, tfer; + off_t offset, end_offset; + XTTabCachePagePtr page; + + end_offset = xt_row_id_to_row_offset(tab, eof_rec_id); + rec_id = 1; + while (rec_id < eof_rec_id) { + if (!tab->tab_rows.xt_tc_get_page(ot->ot_row_file, rec_id, &page, &poffset, self)) + xt_throw(self); + if (page) + tab->tab_rows.xt_tc_release_page(ot->ot_row_file, page, self); + else { + xtWord1 *buff_ptr; + + if (!buffer) + buffer = (xtWord1 *) xt_malloc(self, tab->tab_rows.tci_page_size); + offset = xt_row_id_to_row_offset(tab, rec_id); + tfer = tab->tab_rows.tci_page_size; + if (offset + (off_t) tfer > end_offset) + tfer = (size_t) (end_offset - offset); + XT_LOCK_MEMORY_PTR(buff_ptr, ot->ot_row_file, offset, tfer, &self->st_statistics.st_rec, self); + if (buff_ptr) { + memcpy(buffer, buff_ptr, tfer); + XT_UNLOCK_MEMORY_PTR(ot->ot_row_file, self); + } + } + rec_id += tab->tab_rows.tci_rows_per_page; + } + if (buffer) + xt_free(self, buffer); + } +} + +xtPublic void xt_tab_load_table(XTThreadPtr self, XTOpenTablePtr ot) +{ + xt_load_pages(self, ot); + xt_load_indices(self, ot); +} + +xtPublic xtBool xt_tab_load_record(register XTOpenTablePtr ot, xtRecordID rec_id, XTInfoBufferPtr rec_buf) +{ + register XTTableHPtr tab = ot->ot_table; + size_t rec_size = tab->tab_dic.dic_rec_size; + + if (!xt_tab_get_rec_data(ot, rec_id, rec_size, ot->ot_row_rbuffer)) + return FAILED; + + if (XT_REC_NOT_VALID(ot->ot_row_rbuffer[0])) { + /* Should not happen! */ + XTThreadPtr self = ot->ot_thread; + + xt_log(XT_WARNING, "Recently updated record invalid\n"); + return OK; + } + + ot->ot_curr_row_id = XT_GET_DISK_4(((XTTabRecHeadDPtr) ot->ot_row_rbuffer)->tr_row_id_4); + ot->ot_curr_updated = + (XT_GET_DISK_4(((XTTabRecHeadDPtr) ot->ot_row_rbuffer)->tr_xact_id_4) == ot->ot_thread->st_xact_data->xd_start_xn_id); + + if (ot->ot_rec_fixed) { + size_t size = rec_size - XT_REC_FIX_HEADER_SIZE; + if (!xt_ib_alloc(NULL, rec_buf, size)) + return FAILED; + memcpy(rec_buf->ib_db.db_data, ot->ot_row_rbuffer + XT_REC_FIX_HEADER_SIZE, size); + } + else { + if (!xt_ib_alloc(NULL, rec_buf, tab->tab_dic.dic_buf_size)) + return FAILED; + if (ot->ot_row_rbuffer[0] == XT_TAB_STATUS_VARIABLE || ot->ot_row_rbuffer[0] == XT_TAB_STATUS_VAR_CLEAN) { + if (!myxt_load_row(ot, ot->ot_row_rbuffer + XT_REC_FIX_HEADER_SIZE, rec_buf->ib_db.db_data, ot->ot_cols_req)) + return FAILED; + } + else { + u_int cols_req = ot->ot_cols_req; + + ASSERT_NS(cols_req); + if (cols_req && cols_req <= tab->tab_dic.dic_fix_col_count) { + if (!myxt_load_row(ot, ot->ot_row_rbuffer + XT_REC_EXT_HEADER_SIZE, rec_buf->ib_db.db_data, cols_req)) + return FAILED; + } + else { + if (!xt_tab_load_ext_data(ot, ot->ot_curr_rec_id, rec_buf->ib_db.db_data, cols_req)) + return FAILED; + } + } + } + + return OK; +} + +xtPublic xtBool xt_tab_free_row(XTOpenTablePtr ot, XTTableHPtr tab, xtRowID row_id) +{ + XTTabRowRefDRec free_row; + xtRowID prev_row; + xtOpSeqNo op_seq; + + ASSERT_NS(row_id); // Cannot free the header! + + xt_lock_mutex_ns(&tab->tab_row_lock); + prev_row = tab->tab_row_free_id; + XT_SET_DISK_4(free_row.rr_ref_id_4, prev_row); + if (!tab->tab_rows.xt_tc_write(ot->ot_row_file, row_id, 0, sizeof(XTTabRowRefDRec), (xtWord1 *) &free_row, &op_seq, TRUE, ot->ot_thread)) { + xt_unlock_mutex_ns(&tab->tab_row_lock); + return FAILED; + } + tab->tab_row_free_id = row_id; + tab->tab_row_fnum++; + xt_unlock_mutex_ns(&tab->tab_row_lock); + + if (!xt_xlog_modify_table(ot, XT_LOG_ENT_ROW_FREED, op_seq, 0, row_id, sizeof(XTTabRowRefDRec), (xtWord1 *) &free_row)) + return FAILED; + + return OK; +} + +static void tab_free_ext_record_on_fail(XTOpenTablePtr ot, xtRecordID rec_id, XTTabRecExtDPtr ext_rec, xtBool log_err) +{ + xtWord4 log_over_size = XT_GET_DISK_4(ext_rec->re_log_dat_siz_4); + xtLogID log_id; + xtLogOffset log_offset; + + XT_GET_LOG_REF(log_id, log_offset, ext_rec); + + if (!ot->ot_thread->st_dlog_buf.dlb_delete_log(log_id, log_offset, log_over_size, ot->ot_table->tab_id, rec_id, ot->ot_thread)) { + if (log_err) + xt_log_and_clear_exception_ns(); + } +} + +static void tab_save_exception(XTExceptionPtr e) +{ + XTThreadPtr self = xt_get_self(); + + *e = self->t_exception; +} + +static void tab_restore_exception(XTExceptionPtr e) +{ + XTThreadPtr self = xt_get_self(); + + self->t_exception = *e; +} + +/* + * This function assumes that a record may be partially written. + * It removes all associated data and references to the record. + * + * This function return XT_ERR if an error occurs. + * TRUE if the record has been removed, and may be freed. + * FALSE if the record has already been freed. + * + */ +xtPublic int xt_tab_remove_record(XTOpenTablePtr ot, xtRecordID rec_id, xtWord1 *rec_data, xtRecordID *prev_var_id, xtBool clean_delete, xtRowID row_id, xtXactID xn_id __attribute__((unused))) +{ + register XTTableHPtr tab = ot->ot_table; + size_t rec_size; + xtWord1 old_rec_type; + u_int cols_req; + u_int cols_in_buffer; + + *prev_var_id = 0; + + if (!rec_id) + return FALSE; + + /* + * NOTE: This function uses the read buffer. This should be OK because + * the function is only called by the sweeper. The read buffer + * is REQUIRED because of the call to xt_tab_load_ext_data()!!! + */ + rec_size = tab->tab_dic.dic_rec_size; + if (!xt_tab_get_rec_data(ot, rec_id, rec_size, ot->ot_row_rbuffer)) + return XT_ERR; + old_rec_type = ot->ot_row_rbuffer[0]; + + /* Check of the record has not already been freed: */ + if (XT_REC_IS_FREE(old_rec_type)) + return FALSE; + + /* This record must belong to the given row: */ + if (XT_GET_DISK_4(((XTTabRecExtDPtr) ot->ot_row_rbuffer)->tr_row_id_4) != row_id) + return FALSE; + + /* The transaction ID of the record must be BEFORE or equal to the given + * transaction ID. + * + * No, this does not always hold. Because we wait for updates now, + * a "younger" transaction can update before an older + * transaction. + * Commit order determined the actual order in which the transactions + * should be replicated. This is determined by the log number of + * the commit record! + if (db->db_xn_curr_id(xn_id, XT_GET_DISK_4(((XTTabRecExtDPtr) ot->ot_row_rbuffer)->tr_xact_id_4))) + return FALSE; + */ + + *prev_var_id = XT_GET_DISK_4(((XTTabRecExtDPtr) ot->ot_row_rbuffer)->tr_prev_rec_id_4); + + if (tab->tab_dic.dic_key_count) { + XTIndexPtr *ind; + + switch (old_rec_type) { + case XT_TAB_STATUS_DELETE: + case XT_TAB_STATUS_DEL_CLEAN: + rec_size = sizeof(XTTabRecHeadDRec); + goto set_removed; + case XT_TAB_STATUS_FIXED: + case XT_TAB_STATUS_FIX_CLEAN: + /* We know that for a fixed length record, + * dic_ind_rec_len <= dic_rec_size! */ + rec_size = (size_t) tab->tab_dic.dic_ind_rec_len + XT_REC_FIX_HEADER_SIZE; + rec_data = ot->ot_row_rbuffer + XT_REC_FIX_HEADER_SIZE; + break; + case XT_TAB_STATUS_VARIABLE: + case XT_TAB_STATUS_VAR_CLEAN: + cols_req = tab->tab_dic.dic_ind_cols_req; + + cols_in_buffer = cols_req; + rec_size = myxt_load_row_length(ot, rec_size - XT_REC_FIX_HEADER_SIZE, ot->ot_row_rbuffer + XT_REC_FIX_HEADER_SIZE, &cols_in_buffer); + if (cols_in_buffer < cols_req) + rec_size = tab->tab_dic.dic_rec_size; + else + rec_size += XT_REC_FIX_HEADER_SIZE; + if (!myxt_load_row(ot, ot->ot_row_rbuffer + XT_REC_FIX_HEADER_SIZE, rec_data, cols_req)) { + xt_log_and_clear_exception_ns(); + goto set_removed; + } + break; + case XT_TAB_STATUS_EXT_DLOG: + case XT_TAB_STATUS_EXT_CLEAN: + cols_req = tab->tab_dic.dic_ind_cols_req; + + ASSERT_NS(cols_req); + cols_in_buffer = cols_req; + rec_size = myxt_load_row_length(ot, rec_size - XT_REC_EXT_HEADER_SIZE, ot->ot_row_rbuffer + XT_REC_EXT_HEADER_SIZE, &cols_in_buffer); + if (cols_in_buffer < cols_req) { + rec_size = tab->tab_dic.dic_rec_size; + if (!xt_tab_load_ext_data(ot, rec_id, rec_data, cols_req)) { + /* This is actually quite possible after recovery, see [(3)] */ + if (ot->ot_thread->t_exception.e_xt_err != XT_ERR_BAD_EXT_RECORD && + ot->ot_thread->t_exception.e_xt_err != XT_ERR_DATA_LOG_NOT_FOUND) + xt_log_and_clear_exception_ns(); + goto set_removed; + } + } + else { + /* All the records we require are in the buffer... */ + rec_size += XT_REC_EXT_HEADER_SIZE; + if (!myxt_load_row(ot, ot->ot_row_rbuffer + XT_REC_EXT_HEADER_SIZE, rec_data, cols_req)) { + xt_log_and_clear_exception_ns(); + goto set_removed; + } + } + break; + default: + break; + } + + /* Could this be the case?: This change may only be flushed after the + * operation below has been flushed to the log. + * + * No, remove records are never "undone". The sweeper will delete + * the record again if it does not land in the log. + * + * The fact that the index entries have already been removed is not + * a problem. + */ + if (!tab->tab_dic.dic_disable_index) { + ind = tab->tab_dic.dic_keys; + for (u_int i=0; i<tab->tab_dic.dic_key_count; i++, ind++) { + if (!xt_idx_delete(ot, *ind, rec_id, rec_data)) + xt_log_and_clear_exception_ns(); + } + } + } + else { + /* No indices: */ + switch (old_rec_type) { + case XT_TAB_STATUS_DELETE: + case XT_TAB_STATUS_DEL_CLEAN: + rec_size = XT_REC_FIX_HEADER_SIZE; + break; + case XT_TAB_STATUS_FIXED: + case XT_TAB_STATUS_FIX_CLEAN: + case XT_TAB_STATUS_VARIABLE: + case XT_TAB_STATUS_VAR_CLEAN: + rec_size = XT_REC_FIX_HEADER_SIZE; + break; + case XT_TAB_STATUS_EXT_DLOG: + case XT_TAB_STATUS_EXT_CLEAN: + rec_size = XT_REC_EXT_HEADER_SIZE; + break; + } + } + +#ifdef XT_STREAMING + if (tab->tab_dic.dic_blob_count) { + /* If the record contains any LONGBLOB then check how much + * space we need. + */ + size_t blob_size; + + switch (old_rec_type) { + case XT_TAB_STATUS_DELETE: + case XT_TAB_STATUS_DEL_CLEAN: + break; + case XT_TAB_STATUS_FIXED: + case XT_TAB_STATUS_FIX_CLEAN: + /* Should not be the case, record with LONGBLOB can never be fixed! */ + break; + case XT_TAB_STATUS_VARIABLE: + case XT_TAB_STATUS_VAR_CLEAN: + cols_req = tab->tab_dic.dic_blob_cols_req; + cols_in_buffer = cols_req; + blob_size = myxt_load_row_length(ot, rec_size - XT_REC_FIX_HEADER_SIZE, ot->ot_row_rbuffer + XT_REC_FIX_HEADER_SIZE, &cols_in_buffer); + if (cols_in_buffer < cols_req) + blob_size = tab->tab_dic.dic_rec_size; + else + blob_size += XT_REC_FIX_HEADER_SIZE; + if (blob_size > rec_size) + rec_size = blob_size; + break; + case XT_TAB_STATUS_EXT_DLOG: + case XT_TAB_STATUS_EXT_CLEAN: + cols_req = tab->tab_dic.dic_blob_cols_req; + cols_in_buffer = cols_req; + blob_size = myxt_load_row_length(ot, rec_size - XT_REC_EXT_HEADER_SIZE, ot->ot_row_rbuffer + XT_REC_EXT_HEADER_SIZE, &cols_in_buffer); + if (cols_in_buffer < cols_req) + blob_size = tab->tab_dic.dic_rec_size; + else + blob_size += XT_REC_EXT_HEADER_SIZE; + if (blob_size > rec_size) + rec_size = blob_size; + break; + } + } +#endif + + set_removed: + if (XT_REC_IS_EXT_DLOG(old_rec_type)) { + /* {LOCK-EXT-REC} Lock, and read again to make sure that the + * compactor does not change this record, while + * we are removing it! */ + xt_lock_mutex_ns(&tab->tab_db->db_co_ext_lock); + if (!xt_tab_get_rec_data(ot, rec_id, XT_REC_EXT_HEADER_SIZE, ot->ot_row_rbuffer)) { + xt_unlock_mutex_ns(&tab->tab_db->db_co_ext_lock); + return FAILED; + } + xt_unlock_mutex_ns(&tab->tab_db->db_co_ext_lock); + + } + + xtOpSeqNo op_seq; + XTTabRecFreeDPtr free_rec = (XTTabRecFreeDPtr) ot->ot_row_rbuffer; + xtRecordID prev_rec_id; + + /* A record is "clean" deleted if the record was + * XT_TAB_STATUS_DELETE which was comitted. + * This makes sure that the record will still invalidate + * following records in a row. + * + * Example: + * + * 1. INSERT A ROW, then DELETE it, assume the sweeper is delayed. + * + * We now have the sequence row X --> del rec A --> valid rec B. + * + * 2. A SELECT can still find B. Assume it now goes to check + * if the record is valid, it reads row X, and gets A. + * + * 3. Now the sweeper gets control and removes X, A and B. + * It frees A with the clean bit. + * + * 4. Now the SELECT gets control and reads A. Normally a freed record + * would be ignored, and it would go onto B, which would then + * be considered valid (note, even after the free, the next + * pointer is not affected). + * + * However, because the clean bit has been set, it will stop at A + * and consider B invalid (which is the desired result). + * + * NOTE: We assume it is not possible for A to be allocated and refer + * to B, because B is freed before A. This means that B may refer to + * A after the next allocation. + */ + + xtWord1 new_rec_type = XT_TAB_STATUS_FREED | (clean_delete ? XT_TAB_STATUS_CLEANED_BIT : 0); + + xt_lock_mutex_ns(&tab->tab_rec_lock); + free_rec->rf_rec_type_1 = new_rec_type; + prev_rec_id = tab->tab_rec_free_id; + XT_SET_DISK_4(free_rec->rf_next_rec_id_4, prev_rec_id); + if (!xt_tab_put_rec_data(ot, rec_id, sizeof(XTTabRecFreeDRec), ot->ot_row_rbuffer, &op_seq)) { + xt_unlock_mutex_ns(&tab->tab_rec_lock); + return FAILED; + } + tab->tab_rec_free_id = rec_id; + ASSERT_NS(tab->tab_rec_free_id < tab->tab_rec_eof_id); + tab->tab_rec_fnum++; + xt_unlock_mutex_ns(&tab->tab_rec_lock); + + free_rec->rf_rec_type_1 = old_rec_type; + return xt_xlog_modify_table(ot, XT_LOG_ENT_REC_REMOVED_BI, op_seq, (xtRecordID) new_rec_type, rec_id, rec_size, ot->ot_row_rbuffer); +} + +static xtRowID tab_new_row(XTOpenTablePtr ot, XTTableHPtr tab) +{ + xtRowID row_id; + xtOpSeqNo op_seq; + xtRowID next_row_id = 0; + u_int status; + + xt_lock_mutex_ns(&tab->tab_row_lock); + if ((row_id = tab->tab_row_free_id)) { + status = XT_LOG_ENT_ROW_NEW_FL; + + if (!tab->tab_rows.xt_tc_read_4(ot->ot_row_file, row_id, &next_row_id, ot->ot_thread)) { + xt_unlock_mutex_ns(&tab->tab_row_lock); + return 0; + } + tab->tab_row_free_id = next_row_id; + tab->tab_row_fnum--; + } + else { + status = XT_LOG_ENT_ROW_NEW; + row_id = tab->tab_row_eof_id; + if (row_id == 0xFFFFFFFF) { + xt_unlock_mutex_ns(&tab->tab_row_lock); + xt_register_xterr(XT_REG_CONTEXT, XT_ERR_MAX_ROW_COUNT); + return 0; + } + if (((row_id - 1) % tab->tab_rows.tci_rows_per_page) == 0) { + /* By fetching the page now, we avoid reading it later... */ + XTTabCachePagePtr page; + XTTabCacheSegPtr seg; + size_t poffset; + + if (!tab->tab_rows.tc_fetch(ot->ot_row_file, row_id, &seg, &page, &poffset, FALSE, ot->ot_thread)) { + xt_unlock_mutex_ns(&tab->tab_row_lock); + return 0; + } + xt_rwmutex_unlock(&seg->tcs_lock, ot->ot_thread->t_id); + } + tab->tab_row_eof_id++; + } + op_seq = tab->tab_seq.ts_get_op_seq(); + xt_unlock_mutex_ns(&tab->tab_row_lock); + + if (!xt_xlog_modify_table(ot, status, op_seq, next_row_id, row_id, 0, NULL)) + return 0; + + XT_DISABLED_TRACE(("new row tx=%d row=%d\n", (int) ot->ot_thread->st_xact_data->xd_start_xn_id, (int) row_id)); + ASSERT_NS(row_id); + return row_id; +} + +xtPublic xtBool xt_tab_get_row(register XTOpenTablePtr ot, xtRowID row_id, xtRecordID *var_rec_id) +{ + register XTTableHPtr tab = ot->ot_table; + + (void) ASSERT_NS(sizeof(XTTabRowRefDRec) == 4); + + if (!tab->tab_rows.xt_tc_read_4(ot->ot_row_file, row_id, var_rec_id, ot->ot_thread)) + return FAILED; + return OK; +} + +xtPublic xtBool xt_tab_set_row(XTOpenTablePtr ot, u_int status, xtRowID row_id, xtRecordID var_rec_id) +{ + register XTTableHPtr tab = ot->ot_table; + XTTabRowRefDRec row_buf; + xtOpSeqNo op_seq; + + ASSERT_NS(var_rec_id < tab->tab_rec_eof_id); + XT_SET_DISK_4(row_buf.rr_ref_id_4, var_rec_id); + + if (!tab->tab_rows.xt_tc_write(ot->ot_row_file, row_id, 0, sizeof(XTTabRowRefDRec), (xtWord1 *) &row_buf, &op_seq, TRUE, ot->ot_thread)) + return FAILED; + + return xt_xlog_modify_table(ot, status, op_seq, 0, row_id, sizeof(XTTabRowRefDRec), (xtWord1 *) &row_buf); +} + +xtPublic xtBool xt_tab_free_record(XTOpenTablePtr ot, u_int status, xtRecordID rec_id, xtBool clean_delete) +{ + register XTTableHPtr tab = ot->ot_table; + XTTabRecHeadDRec rec_head; + XTactFreeRecEntryDRec free_rec; + xtRecordID prev_rec_id; + + /* Don't free the record if it is already free! */ + if (!xt_tab_get_rec_data(ot, rec_id, sizeof(XTTabRecHeadDRec), (xtWord1 *) &rec_head)) + return FAILED; + + if (!XT_REC_IS_FREE(rec_head.tr_rec_type_1)) { + xtOpSeqNo op_seq; + + /* This information will be used to determine if the resources of the record + * should be removed. + */ + free_rec.fr_stat_id_1 = rec_head.tr_stat_id_1; + XT_COPY_DISK_4(free_rec.fr_xact_id_4, rec_head.tr_xact_id_4); + + /* A record is "clean" deleted if the record was + * XT_TAB_STATUS_DELETE which was comitted. + * This makes sure that the record will still invalidate + * following records in a row. + * + * Example: + * + * 1. INSERT A ROW, then DELETE it, assume the sweeper is delayed. + * + * We now have the sequence row X --> del rec A --> valid rec B. + * + * 2. A SELECT can still find B. Assume it now goes to check + * if the record is valid, ti reads row X, and gets A. + * + * 3. Now the sweeper gets control and removes X, A and B. + * It frees A with the clean bit. + * + * 4. Now the SELECT gets control and reads A. Normally a freed record + * would be ignored, and it would go onto B, which would then + * be considered valid (note, even after the free, the next + * pointer is not affected). + * + * However, because the clean bit has been set, it will stop at A + * and consider B invalid (which is the desired result). + * + * NOTE: We assume it is not possible for A to be allocated and refer + * to B, because B is freed before A. This means that B may refer to + * A after the next allocation. + */ + + (void) ASSERT_NS(sizeof(XTTabRecFreeDRec) == sizeof(XTactFreeRecEntryDRec) - offsetof(XTactFreeRecEntryDRec, fr_rec_type_1)); + free_rec.fr_rec_type_1 = XT_TAB_STATUS_FREED | (clean_delete ? XT_TAB_STATUS_CLEANED_BIT : 0); + free_rec.fr_not_used_1 = 0; + + xt_lock_mutex_ns(&tab->tab_rec_lock); + prev_rec_id = tab->tab_rec_free_id; + XT_SET_DISK_4(free_rec.fr_next_rec_id_4, prev_rec_id); + if (!xt_tab_put_rec_data(ot, rec_id, sizeof(XTTabRecFreeDRec), &free_rec.fr_rec_type_1, &op_seq)) { + xt_unlock_mutex_ns(&tab->tab_rec_lock); + return FAILED; + } + tab->tab_rec_free_id = rec_id; + ASSERT_NS(tab->tab_rec_free_id < tab->tab_rec_eof_id); + tab->tab_rec_fnum++; + xt_unlock_mutex_ns(&tab->tab_rec_lock); + + if (!xt_xlog_modify_table(ot, status, op_seq, rec_id, rec_id, sizeof(XTactFreeRecEntryDRec) - offsetof(XTactFreeRecEntryDRec, fr_stat_id_1), &free_rec.fr_stat_id_1)) + return FAILED; + } + return OK; +} + +static void tab_free_row_on_fail(XTOpenTablePtr ot, XTTableHPtr tab, xtRowID row_id) +{ + XTExceptionRec e; + + tab_save_exception(&e); + xt_tab_free_row(ot, tab, row_id); + tab_restore_exception(&e); +} + +static xtBool tab_add_record(XTOpenTablePtr ot, XTTabRecInfoPtr rec_info, u_int status) +{ + register XTTableHPtr tab = ot->ot_table; + XTThreadPtr thread = ot->ot_thread; + xtRecordID rec_id; + xtLogID log_id; + xtLogOffset log_offset; + xtOpSeqNo op_seq; + xtRecordID next_rec_id = 0; + + if (rec_info->ri_ext_rec) { + /* Determine where the overflow will go... */ + if (!thread->st_dlog_buf.dlb_get_log_offset(&log_id, &log_offset, rec_info->ri_log_data_size + offsetof(XTactExtRecEntryDRec, er_data), ot->ot_thread)) + return FAILED; + XT_SET_LOG_REF(rec_info->ri_ext_rec, log_id, log_offset); + } + + /* Write the record to disk: */ + xt_lock_mutex_ns(&tab->tab_rec_lock); + if ((rec_id = tab->tab_rec_free_id)) { + XTTabRecFreeDRec free_block; + + ASSERT_NS(rec_id < tab->tab_rec_eof_id); + if (!xt_tab_get_rec_data(ot, rec_id, sizeof(XTTabRecFreeDRec), (xtWord1 *) &free_block)) { + xt_unlock_mutex_ns(&tab->tab_rec_lock); + return FAILED; + } + next_rec_id = XT_GET_DISK_4(free_block.rf_next_rec_id_4); + tab->tab_rec_free_id = next_rec_id; + + tab->tab_rec_fnum--; + + /* XT_LOG_ENT_UPDATE --> XT_LOG_ENT_UPDATE_FL */ + /* XT_LOG_ENT_INSERT --> XT_LOG_ENT_INSERT_FL */ + /* XT_LOG_ENT_DELETE --> XT_LOG_ENT_DELETE_FL */ + status += 2; + + if (!xt_tab_put_rec_data(ot, rec_id, rec_info->ri_rec_buf_size, (xtWord1 *) rec_info->ri_fix_rec_buf, &op_seq)) { + xt_unlock_mutex_ns(&tab->tab_rec_lock); + return FAILED; + } + } + else { + xtBool read; + + rec_id = tab->tab_rec_eof_id; + tab->tab_rec_eof_id++; + + /* If we are writing to a new page (at the EOF) + * then we do not need to read the page from the + * file because it is new. + * + * Note that this only works because we are holding + * a lock on the record file. + */ + read = ((rec_id - 1) % tab->tab_recs.tci_rows_per_page) != 0; + + if (!tab->tab_recs.xt_tc_write(ot->ot_rec_file, rec_id, 0, rec_info->ri_rec_buf_size, (xtWord1 *) rec_info->ri_fix_rec_buf, &op_seq, read, ot->ot_thread)) { + xt_unlock_mutex_ns(&tab->tab_rec_lock); + return FAILED; + } + } + xt_unlock_mutex_ns(&tab->tab_rec_lock); + + if (!xt_xlog_modify_table(ot, status, op_seq, next_rec_id, rec_id, rec_info->ri_rec_buf_size, (xtWord1 *) rec_info->ri_fix_rec_buf)) + return FAILED; + + if (rec_info->ri_ext_rec) { + /* Write the log buffer overflow: */ + rec_info->ri_log_buf->er_status_1 = XT_LOG_ENT_EXT_REC_OK; + XT_SET_DISK_4(rec_info->ri_log_buf->er_data_size_4, rec_info->ri_log_data_size); + XT_SET_DISK_4(rec_info->ri_log_buf->er_tab_id_4, tab->tab_id); + XT_SET_DISK_4(rec_info->ri_log_buf->er_rec_id_4, rec_id); + if (!thread->st_dlog_buf.dlb_append_log(log_id, log_offset, offsetof(XTactExtRecEntryDRec, er_data) + rec_info->ri_log_data_size, (xtWord1 *) rec_info->ri_log_buf, ot->ot_thread)) { + /* Failed to write the overflow, free the record allocated above: */ + return FAILED; + } + } + + XT_DISABLED_TRACE(("new rec tx=%d val=%d\n", (int) thread->st_xact_data->xd_start_xn_id, (int) rec_id)); + rec_info->ri_rec_id = rec_id; + return OK; +} + +static void tab_delete_record_on_fail(XTOpenTablePtr ot, xtRowID row_id, xtRecordID rec_id, XTTabRecHeadDPtr row_ptr, xtWord1 *rec_data, u_int key_count) +{ + XTExceptionRec e; + xtBool log_err = TRUE; + XTTabRecInfoRec rec_info; + + tab_save_exception(&e); + + if (e.e_xt_err == XT_ERR_DUPLICATE_KEY || + e.e_xt_err == XT_ERR_DUPLICATE_FKEY) { + /* If the error does not cause rollback, then we will ignore the + * error if an error occurs in the UNDO! + */ + log_err = FALSE; + tab_restore_exception(&e); + } + if (key_count) { + XTIndexPtr *ind; + + ind = ot->ot_table->tab_dic.dic_keys; + for (u_int i=0; i<key_count; i++, ind++) { + if (!xt_idx_delete(ot, *ind, rec_id, rec_data)) { + if (log_err) + xt_log_and_clear_exception_ns(); + } + } + } + + if (row_ptr->tr_rec_type_1 == XT_TAB_STATUS_EXT_DLOG || row_ptr->tr_rec_type_1 == XT_TAB_STATUS_EXT_CLEAN) + tab_free_ext_record_on_fail(ot, rec_id, (XTTabRecExtDPtr) row_ptr, log_err); + + rec_info.ri_fix_rec_buf = (XTTabRecFixDPtr) ot->ot_row_wbuffer; + rec_info.ri_rec_buf_size = offsetof(XTTabRecFixDRec, rf_data); + rec_info.ri_ext_rec = NULL; + rec_info.ri_fix_rec_buf->tr_rec_type_1 = XT_TAB_STATUS_DELETE; + rec_info.ri_fix_rec_buf->tr_stat_id_1 = 0; + XT_SET_DISK_4(rec_info.ri_fix_rec_buf->tr_row_id_4, row_id); + XT_SET_DISK_4(rec_info.ri_fix_rec_buf->tr_prev_rec_id_4, rec_id); + XT_SET_DISK_4(rec_info.ri_fix_rec_buf->tr_xact_id_4, ot->ot_thread->st_xact_data->xd_start_xn_id); + + if (!tab_add_record(ot, &rec_info, XT_LOG_ENT_DELETE)) + goto failed; + + if (!xt_tab_set_row(ot, XT_LOG_ENT_ROW_ADD_REC, row_id, rec_info.ri_rec_id)) + goto failed; + + if (log_err) + tab_restore_exception(&e); + return; + + failed: + if (log_err) + xt_log_and_clear_exception_ns(); + else + tab_restore_exception(&e); +} + +/* + * Wait until all the variations between the start of the chain, and + * the given record have been rolled-back. + * If any is committed, register a locked error, and return FAILED. + */ +static xtBool tab_wait_for_rollback(XTOpenTablePtr ot, xtRowID row_id, xtRecordID commit_rec_id) +{ + register XTTableHPtr tab = ot->ot_table; + xtRecordID var_rec_id; + XTTabRecHeadDRec var_head; + xtXactID xn_id; + xtRecordID invalid_rec = 0; + XTXactWaitRec xw; + + retry: + if (!xt_tab_get_row(ot, row_id, &var_rec_id)) + return FAILED; + + while (var_rec_id != commit_rec_id) { + if (!var_rec_id) + goto locked; + if (!xt_tab_get_rec_data(ot, var_rec_id, sizeof(XTTabRecHeadDRec), (xtWord1 *) &var_head)) + return FAILED; + if (XT_REC_IS_CLEAN(var_head.tr_rec_type_1)) + goto locked; + if (XT_REC_IS_FREE(var_head.tr_rec_type_1)) + /* Should not happen: */ + goto record_invalid; + xn_id = XT_GET_DISK_4(var_head.tr_xact_id_4); + switch (xt_xn_status(ot, xn_id, var_rec_id)) { + case XT_XN_VISIBLE: + case XT_XN_NOT_VISIBLE: + goto locked; + case XT_XN_ABORTED: + /* Ingore the record, it will be removed. */ + break; + case XT_XN_MY_UPDATE: + /* Should not happen: */ + goto locked; + case XT_XN_OTHER_UPDATE: + /* Wait for the transaction to commit or rollback: */ + XT_TAB_ROW_UNLOCK(&tab->tab_row_rwlock[row_id % XT_ROW_RWLOCKS], ot->ot_thread); + xw.xw_xn_id = xn_id; + if (!xt_xn_wait_for_xact(ot->ot_thread, &xw, NULL)) { + XT_TAB_ROW_WRITE_LOCK(&tab->tab_row_rwlock[row_id % XT_ROW_RWLOCKS], ot->ot_thread); + return FAILED; + } + XT_TAB_ROW_WRITE_LOCK(&tab->tab_row_rwlock[row_id % XT_ROW_RWLOCKS], ot->ot_thread); + goto retry; + case XT_XN_REREAD: + goto record_invalid; + } + var_rec_id = XT_GET_DISK_4(var_head.tr_prev_rec_id_4); + } + return OK; + + locked: + xt_register_xterr(XT_REG_CONTEXT, XT_ERR_RECORD_CHANGED); + return FAILED; + + record_invalid: + /* Prevent an infinite loop due to a bad record: */ + if (invalid_rec != var_rec_id) { + var_rec_id = invalid_rec; + goto retry; + } + /* The record is invalid, it will be "overwritten"... */ +#ifdef XT_CRASH_DEBUG + /* Should not happen! */ + xt_crash_me(); +#endif + return OK; +} + +/* Check if a record may be visible: + * Return TRUE of the record may be visible now. + * Return XT_MAYBE if the record may be visible in the future (set out_xn_id). + * Return FALSE of the record is not valid (freed or is a delete record). + * Return XT_ERR if an error occurred. + */ +xtPublic int xt_tab_maybe_committed(XTOpenTablePtr ot, xtRecordID rec_id, xtXactID *out_xn_id, xtRowID *out_rowid, xtBool *out_updated) +{ + XTTabRecHeadDRec rec_head; + xtXactID rec_xn_id = 0; + xtBool wait = FALSE; + xtXactID wait_xn_id = 0; + xtRowID row_id; + xtRecordID var_rec_id; + xtXactID xn_id; + register XTTableHPtr tab; +#ifdef TRACE_VARIATIONS_IN_DUP_CHECK + char t_buf[500]; + int len; + char *t_type = "C"; +#endif + xtRecordID invalid_rec = 0; + + reread: + if (!xt_tab_get_rec_data(ot, rec_id, sizeof(XTTabRecHeadDRec), (xtWord1 *) &rec_head)) + return XT_ERR; + + if (XT_REC_NOT_VALID(rec_head.tr_rec_type_1)) + return FALSE; + + if (!XT_REC_IS_CLEAN(rec_head.tr_rec_type_1)) { + rec_xn_id = XT_GET_DISK_4(rec_head.tr_xact_id_4); + switch (xt_xn_status(ot, rec_xn_id, rec_id)) { + case XT_XN_VISIBLE: +#ifdef TRACE_VARIATIONS_IN_DUP_CHECK + t_type="V"; +#endif + break; + case XT_XN_NOT_VISIBLE: +#ifdef TRACE_VARIATIONS_IN_DUP_CHECK + t_type="NV"; +#endif + break; + case XT_XN_ABORTED: + return FALSE; + case XT_XN_MY_UPDATE: +#ifdef TRACE_VARIATIONS_IN_DUP_CHECK + t_type="My-Upd"; +#endif + break; + case XT_XN_OTHER_UPDATE: +#ifdef TRACE_VARIATIONS_IN_DUP_CHECK + t_type="Wait"; +#endif + wait = TRUE; + wait_xn_id = rec_xn_id; + break; + case XT_XN_REREAD: +#ifdef TRACE_VARIATIONS_IN_DUP_CHECK + t_type="Re-read"; +#endif + /* Avoid infinite loop: */ + if (invalid_rec == rec_id) { + /* Should not happen! */ +#ifdef XT_CRASH_DEBUG + /* Generate a core dump! */ + xt_crash_me(); +#endif + return FALSE; + } + invalid_rec = rec_id; + goto reread; + } + } + + /* Follow the variation chain until we come to this record. + * If it is not the first visible variation then + * it is not visible at all. If it in not found on the + * variation chain, it is also not visible. + */ + row_id = XT_GET_DISK_4(rec_head.tr_row_id_4); + + tab = ot->ot_table; + XT_TAB_ROW_READ_LOCK(&tab->tab_row_rwlock[row_id % XT_ROW_RWLOCKS], ot->ot_thread); + + invalid_rec = 0; + retry: + if (!(xt_tab_get_row(ot, row_id, &var_rec_id))) + goto failed; +#ifdef TRACE_VARIATIONS_IN_DUP_CHECK + len = sprintf(t_buf, "dup row=%d", (int) row_id); +#endif + while (var_rec_id != rec_id) { + if (!var_rec_id) + goto not_found; +#ifdef TRACE_VARIATIONS_IN_DUP_CHECK + if (len <= 450) + len += sprintf(t_buf+len, " -> %d", (int) var_rec_id); +#endif + if (!xt_tab_get_rec_data(ot, var_rec_id, sizeof(XTTabRecHeadDRec), (xtWord1 *) &rec_head)) + goto failed; + /* All clean records are visible, by all transactions: */ + if (XT_REC_IS_CLEAN(rec_head.tr_rec_type_1)) + goto not_found; + + if (XT_REC_IS_FREE(rec_head.tr_rec_type_1)) { + /* Should not happen: */ + if (invalid_rec != var_rec_id) { + var_rec_id = invalid_rec; + goto retry; + } + /* Assume end of list. */ +#ifdef XT_CRASH_DEBUG + /* Should not happen! */ + xt_crash_me(); +#endif + goto not_found; + } + + xn_id = XT_GET_DISK_4(rec_head.tr_xact_id_4); + switch (xt_xn_status(ot, xn_id, var_rec_id)) { + case XT_XN_VISIBLE: + case XT_XN_NOT_VISIBLE: + goto not_found; + case XT_XN_ABORTED: + /* Ingore the record, it will be removed. */ +#ifdef TRACE_VARIATIONS_IN_DUP_CHECK + if (len <= 450) + len += sprintf(t_buf+len, "(T%d-A)", (int) xn_id); +#endif + break; + case XT_XN_MY_UPDATE: + goto not_found; + case XT_XN_OTHER_UPDATE: +#ifdef TRACE_VARIATIONS_IN_DUP_CHECK + if (len <= 450) + len += sprintf(t_buf+len, "(T%d-wait)", (int) xn_id); +#endif + /* Wait for this update to commit or abort: */ + if (!wait) { + wait = TRUE; + wait_xn_id = xn_id; + } + break; + case XT_XN_REREAD: + if (invalid_rec != var_rec_id) { + var_rec_id = invalid_rec; + goto retry; + } + /* Assume end of list. */ +#ifdef XT_CRASH_DEBUG + /* Should not happen! */ + xt_crash_me(); +#endif + goto not_found; + } + var_rec_id = XT_GET_DISK_4(rec_head.tr_prev_rec_id_4); + } +#ifdef TRACE_VARIATIONS_IN_DUP_CHECK + if (len <= 450) + sprintf(t_buf+len, " -> %d(T%d-%s)\n", (int) var_rec_id, (int) rec_xn_id, t_type); + else + sprintf(t_buf+len, " ...(T%d-%s)\n", (int) rec_xn_id, t_type); +#endif + + XT_TAB_ROW_UNLOCK(&tab->tab_row_rwlock[row_id % XT_ROW_RWLOCKS], ot->ot_thread); + if (wait) { + *out_xn_id = wait_xn_id; + return XT_MAYBE; + } +#ifdef TRACE_VARIATIONS_IN_DUP_CHECK + xt_ttracef(thread, "%s", t_buf); +#endif + if (out_rowid) { + *out_rowid = row_id; + *out_updated = (rec_xn_id == ot->ot_thread->st_xact_data->xd_start_xn_id); + } + return TRUE; + + not_found: + XT_TAB_ROW_UNLOCK(&tab->tab_row_rwlock[row_id % XT_ROW_RWLOCKS], ot->ot_thread); + return FALSE; + + failed: + XT_TAB_ROW_UNLOCK(&tab->tab_row_rwlock[row_id % XT_ROW_RWLOCKS], ot->ot_thread); + return XT_ERR; +} + +xtPublic xtBool xt_tab_new_record(XTOpenTablePtr ot, xtWord1 *rec_buf) +{ + register XTTableHPtr tab = ot->ot_table; + register XTThreadPtr self = ot->ot_thread; + XTTabRecInfoRec rec_info; + xtRowID row_id; + u_int idx_cnt = 0; + XTIndexPtr *ind; +#ifdef XT_STREAMING + void *pbms_table; + + /* PBMS: Reference BLOBs!? */ + if (tab->tab_dic.dic_blob_count) { + if (!myxt_use_blobs(ot, &pbms_table, rec_buf)) + return FAILED; + } +#endif + + if (!myxt_store_row(ot, &rec_info, (char *) rec_buf)) + goto failed_0; + + /* Get a new row ID: */ + if (!(row_id = tab_new_row(ot, tab))) + goto failed_0; + + rec_info.ri_fix_rec_buf->tr_stat_id_1 = self->st_update_id; + XT_SET_DISK_4(rec_info.ri_fix_rec_buf->tr_row_id_4, row_id); + XT_SET_DISK_4(rec_info.ri_fix_rec_buf->tr_prev_rec_id_4, 0); + XT_SET_DISK_4(rec_info.ri_fix_rec_buf->tr_xact_id_4, self->st_xact_data->xd_start_xn_id); + + /* Note, it is important that this record is written BEFORE the row + * due to the problem distributed here [(5)] + */ + if (!tab_add_record(ot, &rec_info, XT_LOG_ENT_INSERT)) + goto failed_1; + +#ifdef TRACE_VARIATIONS + xt_ttracef(self, "insert: row=%d rec=%d T%d\n", (int) row_id, (int) rec_info.ri_rec_id, (int) self->st_xact_data->xd_start_xn_id); +#endif + if (!xt_tab_set_row(ot, XT_LOG_ENT_ROW_ADD_REC, row_id, rec_info.ri_rec_id)) + goto failed_1; + XT_DISABLED_TRACE(("set new tx=%d row=%d rec=%d\n", (int) self->st_xact_data->xd_start_xn_id, (int) row_id, (int) rec_info.ri_rec_id)); + + /* Add the index references: */ + for (idx_cnt=0, ind=tab->tab_dic.dic_keys; idx_cnt<tab->tab_dic.dic_key_count; idx_cnt++, ind++) { + if (!xt_idx_insert(ot, *ind, 0, rec_info.ri_rec_id, rec_buf, NULL, FALSE)) { + ot->ot_err_index_no = (*ind)->mi_index_no; + goto failed_2; + } + } + +#ifdef XT_STREAMING + /* Reference the BLOBs in the row: */ + if (tab->tab_dic.dic_blob_count) { + if (!myxt_retain_blobs(ot, pbms_table, rec_info.ri_rec_id)) { + pbms_table = NULL; + goto failed_2; + } + pbms_table = NULL; + } +#endif + + /* Do the foreign key stuff: */ + if (ot->ot_table->tab_dic.dic_table->dt_fkeys.size() > 0) { + if (!ot->ot_table->tab_dic.dic_table->insertRow(ot, rec_buf)) + goto failed_2; + } + + self->st_statistics.st_row_insert++; + return OK; + + failed_2: + /* Once the row has been inserted, it is to late to remove it! + * Now all we can do is delete it! + */ + tab_delete_record_on_fail(ot, row_id, rec_info.ri_rec_id, (XTTabRecHeadDPtr) rec_info.ri_fix_rec_buf, rec_buf, idx_cnt); + goto failed_0; + + failed_1: + tab_free_row_on_fail(ot, tab, row_id); + + failed_0: +#ifdef XT_STREAMING + if (tab->tab_dic.dic_blob_count && pbms_table) + myxt_unuse_blobs(ot, pbms_table); +#endif + return FAILED; +} + +/* We cannot remove a change we have made to a row while a transaction + * is running, so we have to undo what we have done by + * overwriting the record we just created with + * the before image! + */ +static xtBool tab_overwrite_record_on_fail(XTOpenTablePtr ot, XTTabRecInfoPtr rec_info, xtWord1 *before_buf, xtWord1 *after_buf, u_int idx_cnt) +{ + register XTTableHPtr tab = ot->ot_table; + XTTabRecHeadDRec prev_rec_head; + u_int i; + XTIndexPtr *ind; + XTThreadPtr thread = ot->ot_thread; + xtLogID log_id; + xtLogOffset log_offset; + xtRecordID rec_id = rec_info->ri_rec_id; + + /* Remove the new extended record: */ + if (rec_info->ri_ext_rec) + tab_free_ext_record_on_fail(ot, rec_id, (XTTabRecExtDPtr) rec_info->ri_fix_rec_buf, TRUE); + + /* Undo index entries of the new record: */ + if (after_buf) { + for (i=0, ind=tab->tab_dic.dic_keys; i<idx_cnt; i++, ind++) { + if (!xt_idx_delete(ot, *ind, rec_id, after_buf)) + return FAILED; + } + } + + memcpy(&prev_rec_head, rec_info->ri_fix_rec_buf, sizeof(XTTabRecHeadDRec)); + + if (!before_buf) { + /* Can happen if the delete was called from some cascaded action. + * And this is better than a crash... + * + * TODO: to make sure the change will not be applied in case the + * transaction will be commited, we'd need to add a log entry to + * restore the record like it's done for top-level operation. In + * order to do this we'd need to read the before-image of the + * record before modifying it. + */ + if (!ot->ot_thread->t_exception.e_xt_err) + xt_register_xterr(XT_REG_CONTEXT, XT_ERR_NO_BEFORE_IMAGE); + return FAILED; + } + + /* Restore the previous record! */ + if (!myxt_store_row(ot, rec_info, (char *) before_buf)) + return FAILED; + + memcpy(rec_info->ri_fix_rec_buf, &prev_rec_head, sizeof(XTTabRecHeadDRec)); + + if (rec_info->ri_ext_rec) { + /* Determine where the overflow will go... */ + if (!thread->st_dlog_buf.dlb_get_log_offset(&log_id, &log_offset, rec_info->ri_log_data_size + offsetof(XTactExtRecEntryDRec, er_data), ot->ot_thread)) + return FAILED; + XT_SET_LOG_REF(rec_info->ri_ext_rec, log_id, log_offset); + } + + if (!xt_tab_put_log_op_rec_data(ot, XT_LOG_ENT_REC_MODIFIED, 0, rec_id, rec_info->ri_rec_buf_size, (xtWord1 *) rec_info->ri_fix_rec_buf)) + return FAILED; + + if (rec_info->ri_ext_rec) { + /* Write the log buffer overflow: */ + rec_info->ri_log_buf->er_status_1 = XT_LOG_ENT_EXT_REC_OK; + XT_SET_DISK_4(rec_info->ri_log_buf->er_data_size_4, rec_info->ri_log_data_size); + XT_SET_DISK_4(rec_info->ri_log_buf->er_tab_id_4, tab->tab_id); + XT_SET_DISK_4(rec_info->ri_log_buf->er_rec_id_4, rec_id); + if (!thread->st_dlog_buf.dlb_append_log(log_id, log_offset, offsetof(XTactExtRecEntryDRec, er_data) + rec_info->ri_log_data_size, (xtWord1 *) rec_info->ri_log_buf, ot->ot_thread)) + return FAILED; + } + + /* Put the index entries back: */ + for (idx_cnt=0, ind=tab->tab_dic.dic_keys; idx_cnt<tab->tab_dic.dic_key_count; idx_cnt++, ind++) { + if (!xt_idx_insert(ot, *ind, 0, rec_id, before_buf, after_buf, TRUE)) + /* Incomplete restore, there will be a rollback... */ + return FAILED; + } + + return OK; +} + +/* + * GOTCHA: + * If a transaction updates the same record over again, we should update + * in place. This prevents producing unnecessary variations! + */ +static xtBool tab_overwrite_record(XTOpenTablePtr ot, xtWord1 *before_buf, xtWord1 *after_buf) +{ + register XTTableHPtr tab = ot->ot_table; + xtRowID row_id = ot->ot_curr_row_id; + register XTThreadPtr self = ot->ot_thread; + xtRecordID rec_id = ot->ot_curr_rec_id; + XTTabRecExtDRec prev_rec_head; + XTTabRecInfoRec rec_info; + u_int idx_cnt = 0, i; + XTIndexPtr *ind; + xtLogID log_id; + xtLogOffset log_offset; + xtBool prev_ext_rec; + +#ifdef XT_STREAMING + void *pbms_table; + + if (tab->tab_dic.dic_blob_count) { + if (!myxt_use_blobs(ot, &pbms_table, after_buf)) + return FAILED; + } +#endif + + if (!myxt_store_row(ot, &rec_info, (char *) after_buf)) + goto failed_0; + + /* Read before we overwrite! */ + if (!xt_tab_get_rec_data(ot, rec_id, XT_REC_EXT_HEADER_SIZE, (xtWord1 *) &prev_rec_head)) + goto failed_0; + + prev_ext_rec = prev_rec_head.tr_rec_type_1 & XT_TAB_STATUS_EXT_DLOG; + + if (rec_info.ri_ext_rec) { + /* Determine where the overflow will go... */ + if (!self->st_dlog_buf.dlb_get_log_offset(&log_id, &log_offset, offsetof(XTactExtRecEntryDRec, er_data) + rec_info.ri_log_data_size, ot->ot_thread)) + goto failed_0; + XT_SET_LOG_REF(rec_info.ri_ext_rec, log_id, log_offset); + } + + rec_info.ri_fix_rec_buf->tr_stat_id_1 = self->st_update_id; + XT_SET_DISK_4(rec_info.ri_fix_rec_buf->tr_row_id_4, row_id); + XT_COPY_DISK_4(rec_info.ri_fix_rec_buf->tr_prev_rec_id_4, prev_rec_head.tr_prev_rec_id_4); + XT_SET_DISK_4(rec_info.ri_fix_rec_buf->tr_xact_id_4, self->st_xact_data->xd_start_xn_id); + + /* Remove the index references, that have changed: */ + for (idx_cnt=0, ind=tab->tab_dic.dic_keys; idx_cnt<tab->tab_dic.dic_key_count; idx_cnt++, ind++) { + if (!xt_idx_delete(ot, *ind, rec_id, before_buf)) { + goto failed_0; + } + } + +#ifdef TRACE_VARIATIONS + xt_ttracef(self, "overwrite: row=%d rec=%d T%d\n", (int) row_id, (int) rec_id, (int) self->st_xact_data->xd_start_xn_id); +#endif + /* Overwrite the record: */ + if (!xt_tab_put_log_op_rec_data(ot, XT_LOG_ENT_REC_MODIFIED, 0, rec_id, rec_info.ri_rec_buf_size, (xtWord1 *) rec_info.ri_fix_rec_buf)) + goto failed_0; + + if (rec_info.ri_ext_rec) { + /* Write the log buffer overflow: */ + rec_info.ri_log_buf->er_status_1 = XT_LOG_ENT_EXT_REC_OK; + XT_SET_DISK_4(rec_info.ri_log_buf->er_data_size_4, rec_info.ri_log_data_size); + XT_SET_DISK_4(rec_info.ri_log_buf->er_tab_id_4, tab->tab_id); + XT_SET_DISK_4(rec_info.ri_log_buf->er_rec_id_4, rec_id); + if (!self->st_dlog_buf.dlb_append_log(log_id, log_offset, offsetof(XTactExtRecEntryDRec, er_data) + rec_info.ri_log_data_size, (xtWord1 *) rec_info.ri_log_buf, ot->ot_thread)) + goto failed_1; + } + + /* Add the index references that have changed: */ + for (idx_cnt=0, ind=tab->tab_dic.dic_keys; idx_cnt<tab->tab_dic.dic_key_count; idx_cnt++, ind++) { + if (!xt_idx_insert(ot, *ind, 0, rec_id, after_buf, before_buf, FALSE)) { + ot->ot_err_index_no = (*ind)->mi_index_no; + goto failed_2; + } + } + + /* Do the foreign key stuff: */ + if (ot->ot_table->tab_dic.dic_table->dt_trefs || ot->ot_table->tab_dic.dic_table->dt_fkeys.size() > 0) { + if (!ot->ot_table->tab_dic.dic_table->updateRow(ot, before_buf, after_buf)) + goto failed_2; + } + + /* Delete the previous overflow area: */ + if (prev_ext_rec) + tab_free_ext_record_on_fail(ot, rec_id, &prev_rec_head, TRUE); + +#ifdef XT_STREAMING + if (tab->tab_dic.dic_blob_count) { + /* Retain the BLOBs new record: */ + if (!myxt_retain_blobs(ot, pbms_table, rec_id)) + return FAILED; + /* Release the BLOBs in the old record: */ + myxt_release_blobs(ot, before_buf, rec_id); + } +#endif + + return OK; + + failed_2: + /* Remove the new extended record: */ + if (rec_info.ri_ext_rec) + tab_free_ext_record_on_fail(ot, rec_id, (XTTabRecExtDPtr) rec_info.ri_fix_rec_buf, TRUE); + + /* Restore the previous record! */ + /* Undo index entries: */ + for (i=0, ind=tab->tab_dic.dic_keys; i<idx_cnt; i++, ind++) { + if (!xt_idx_delete(ot, *ind, rec_id, after_buf)) + goto failed_1; + } + + /* Restore the record: */ + if (!myxt_store_row(ot, &rec_info, (char *) before_buf)) + goto failed_1; + + if (rec_info.ri_ext_rec) + memcpy(rec_info.ri_fix_rec_buf, &prev_rec_head, XT_REC_EXT_HEADER_SIZE); + else + memcpy(rec_info.ri_fix_rec_buf, &prev_rec_head, sizeof(XTTabRecHeadDRec)); + + if (!xt_tab_put_log_op_rec_data(ot, XT_LOG_ENT_REC_MODIFIED, 0, rec_id, rec_info.ri_rec_buf_size, (xtWord1 *) rec_info.ri_fix_rec_buf)) + goto failed_1; + + /* Put the index entries back: */ + for (idx_cnt=0, ind=tab->tab_dic.dic_keys; idx_cnt<tab->tab_dic.dic_key_count; idx_cnt++, ind++) { + if (!xt_idx_insert(ot, *ind, 0, rec_id, before_buf, after_buf, TRUE)) + /* Incomplete restore, there will be a rollback... */ + goto failed_0; + } + + /* The previous record has now been restored. */ + goto failed_0; + + failed_1: + /* The old record is overwritten, I must free the previous extended record: */ + if (prev_ext_rec) + tab_free_ext_record_on_fail(ot, rec_id, &prev_rec_head, TRUE); + + failed_0: +#ifdef XT_STREAMING + /* Unuse the BLOBs of the new record: */ + if (tab->tab_dic.dic_blob_count && pbms_table) + myxt_unuse_blobs(ot, pbms_table); +#endif + return FAILED; +} + +xtPublic xtBool xt_tab_update_record(XTOpenTablePtr ot, xtWord1 *before_buf, xtWord1 *after_buf) +{ + register XTTableHPtr tab; + xtRowID row_id; + register XTThreadPtr self; + xtRecordID curr_var_rec_id; + XTTabRecInfoRec rec_info; + u_int idx_cnt = 0; + XTIndexPtr *ind; + +#ifdef XT_STREAMING + void *pbms_table; +#endif + + /* + * Originally only the flag ot->ot_curr_updated was checked, and if it was on, then + * tab_overwrite_record() was called, but this caused crashes in some cases like: + * + * set @@autocommit = 0; + * create table t1 (s1 int primary key); + * create table t2 (s1 int primary key, foreign key (s1) references t1 (s1) on update cascade); + * insert into t1 values (1); + * insert into t2 values (1); + * update t1 set s1 = 1; + * + * the last update lead to a crash on t2 cascade update because before_buf argument is NULL + * in the call below. It is NULL only during cascade update of child table. In that case we + * cannot pass before_buf value from XTDDTableRef::modifyRow as the before_buf is the original + * row for the parent (t1) table and it would be used to update any existing indexes + * in the child table which would be wrong of course. + * + * Alternative solution would be to copy the after_info in the XTDDTableRef::modifyRow(): + * + * ... + * if (!xt_tab_load_record(ot, ot->ot_curr_rec_id, &after_info)) + * goto failed_2; + * ... + * + * here the xt_tab_load_record() loads the original row, so we can copy it from there, but in + * that case we'd need to allocate a new (possibly up to 65536 bytes long) buffer, which makes + * the optimization questionable + * + */ + if (ot->ot_curr_updated && before_buf) + /* This record has already been updated by this transaction. + * Do the update in place! + */ + return tab_overwrite_record(ot, before_buf, after_buf); + + tab = ot->ot_table; + row_id = ot->ot_curr_row_id; + self = ot->ot_thread; + +#ifdef XT_STREAMING + /* PBMS: Reference BLOBs!? */ + if (tab->tab_dic.dic_blob_count) { + if (!myxt_use_blobs(ot, &pbms_table, after_buf)) + return FAILED; + } +#endif + + if (!myxt_store_row(ot, &rec_info, (char *) after_buf)) + goto failed_0; + + rec_info.ri_fix_rec_buf->tr_stat_id_1 = self->st_update_id; + XT_SET_DISK_4(rec_info.ri_fix_rec_buf->tr_row_id_4, row_id); + XT_SET_DISK_4(rec_info.ri_fix_rec_buf->tr_prev_rec_id_4, ot->ot_curr_rec_id); + XT_SET_DISK_4(rec_info.ri_fix_rec_buf->tr_xact_id_4, self->st_xact_data->xd_start_xn_id); + + /* Create the new record: */ + if (!tab_add_record(ot, &rec_info, XT_LOG_ENT_UPDATE)) + goto failed_0; + + /* Link the new variation into the list: */ + XT_TAB_ROW_WRITE_LOCK(&tab->tab_row_rwlock[row_id % XT_ROW_RWLOCKS], ot->ot_thread); + + if (!xt_tab_get_row(ot, row_id, &curr_var_rec_id)) + goto failed_1; + + if (curr_var_rec_id != ot->ot_curr_rec_id) { + /* If the transaction does not rollback, I will get an + * exception here: + */ + if (!tab_wait_for_rollback(ot, row_id, ot->ot_curr_rec_id)) + goto failed_1; + /* [(4)] This is the situation when we overwrite the + * reference to curr_var_rec_id! + * When curr_var_rec_id is cleaned up by the sweeper, the + * sweeper will notice that the record is no longer in + * the row list. + */ + } + +#ifdef TRACE_VARIATIONS + xt_ttracef(self, "update: row=%d rec=%d T%d\n", (int) row_id, (int) rec_info.ri_rec_id, (int) self->st_xact_data->xd_start_xn_id); +#endif + if (!xt_tab_set_row(ot, XT_LOG_ENT_ROW_ADD_REC, row_id, rec_info.ri_rec_id)) + goto failed_1; + XT_DISABLED_TRACE(("set upd tx=%d row=%d rec=%d\n", (int) self->st_xact_data->xd_start_xn_id, (int) row_id, (int) rec_info.ri_rec_id)); + + XT_TAB_ROW_UNLOCK(&tab->tab_row_rwlock[row_id % XT_ROW_RWLOCKS], ot->ot_thread); + + /* Add the index references: */ + for (idx_cnt=0, ind=tab->tab_dic.dic_keys; idx_cnt<tab->tab_dic.dic_key_count; idx_cnt++, ind++) { + if (!xt_idx_insert(ot, *ind, 0, rec_info.ri_rec_id, after_buf, before_buf, FALSE)) { + ot->ot_err_index_no = (*ind)->mi_index_no; + goto failed_2; + } + } + +#ifdef XT_STREAMING + /* Reference the BLOBs in the row: */ + if (tab->tab_dic.dic_blob_count) { + if (!myxt_retain_blobs(ot, pbms_table, rec_info.ri_rec_id)) { + pbms_table = NULL; + goto failed_2; + } + pbms_table = NULL; + } +#endif + + if (ot->ot_table->tab_dic.dic_table->dt_trefs || ot->ot_table->tab_dic.dic_table->dt_fkeys.size() > 0) { + if (!ot->ot_table->tab_dic.dic_table->updateRow(ot, before_buf, after_buf)) + goto failed_2; + } + + ot->ot_thread->st_statistics.st_row_update++; + return OK; + + failed_2: + tab_overwrite_record_on_fail(ot, &rec_info, before_buf, after_buf, idx_cnt); + goto failed_0; + + failed_1: + XT_TAB_ROW_UNLOCK(&tab->tab_row_rwlock[row_id % XT_ROW_RWLOCKS], ot->ot_thread); + + failed_0: +#ifdef XT_STREAMING + if (tab->tab_dic.dic_blob_count && pbms_table) + myxt_unuse_blobs(ot, pbms_table); +#endif + return FAILED; +} + +xtPublic xtBool xt_tab_delete_record(XTOpenTablePtr ot, xtWord1 *rec_buf) +{ + register XTTableHPtr tab = ot->ot_table; + xtRowID row_id = ot->ot_curr_row_id; + xtRecordID curr_var_rec_id; + XTTabRecInfoRec rec_info; + + /* Setup a delete record: */ + rec_info.ri_fix_rec_buf = (XTTabRecFixDPtr) ot->ot_row_wbuffer; + rec_info.ri_rec_buf_size = offsetof(XTTabRecFixDRec, rf_data); + rec_info.ri_ext_rec = NULL; + rec_info.ri_fix_rec_buf->tr_rec_type_1 = XT_TAB_STATUS_DELETE; + rec_info.ri_fix_rec_buf->tr_stat_id_1 = 0; + XT_SET_DISK_4(rec_info.ri_fix_rec_buf->tr_row_id_4, row_id); + XT_SET_DISK_4(rec_info.ri_fix_rec_buf->tr_prev_rec_id_4, ot->ot_curr_rec_id); + XT_SET_DISK_4(rec_info.ri_fix_rec_buf->tr_xact_id_4, ot->ot_thread->st_xact_data->xd_start_xn_id); + + if (!tab_add_record(ot, &rec_info, XT_LOG_ENT_DELETE)) + return FAILED; + + XT_TAB_ROW_WRITE_LOCK(&tab->tab_row_rwlock[row_id % XT_ROW_RWLOCKS], ot->ot_thread); + + if (!xt_tab_get_row(ot, row_id, &curr_var_rec_id)) + goto failed_1; + + if (curr_var_rec_id != ot->ot_curr_rec_id) { + if (!tab_wait_for_rollback(ot, row_id, ot->ot_curr_rec_id)) + goto failed_1; + } + +#ifdef TRACE_VARIATIONS + xt_ttracef(ot->ot_thread, "update: row=%d rec=%d T%d\n", (int) row_id, (int) rec_info.ri_rec_id, (int) ot->ot_thread->st_xact_data->xd_start_xn_id); +#endif + if (!xt_tab_set_row(ot, XT_LOG_ENT_ROW_ADD_REC, row_id, rec_info.ri_rec_id)) + goto failed_1; + XT_DISABLED_TRACE(("del row tx=%d row=%d rec=%d\n", (int) ot->ot_thread->st_xact_data->xd_start_xn_id, (int) row_id, (int) rec_info.ri_rec_id)); + + XT_TAB_ROW_UNLOCK(&tab->tab_row_rwlock[row_id % XT_ROW_RWLOCKS], ot->ot_thread); + + if (ot->ot_table->tab_dic.dic_table->dt_trefs) { + if (!ot->ot_table->tab_dic.dic_table->deleteRow(ot, rec_buf)) + goto failed_2; + } + + ot->ot_thread->st_statistics.st_row_delete++; + return OK; + + failed_2: + tab_overwrite_record_on_fail(ot, &rec_info, rec_buf, NULL, 0); + return FAILED; + + failed_1: + XT_TAB_ROW_UNLOCK(&tab->tab_row_rwlock[row_id % XT_ROW_RWLOCKS], ot->ot_thread); + return FAILED; +} + +xtPublic xtBool xt_tab_restrict_rows(XTBasicListPtr list, XTThreadPtr thread) +{ + u_int i; + XTRestrictItemPtr item; + XTOpenTablePtr pot = NULL; + XTDatabaseHPtr db = thread->st_database; + xtBool ok = TRUE; + + for (i=0; i<list->bl_count; i++) { + item = (XTRestrictItemPtr) xt_bl_item_at(list, i); + if (item) + if (pot) { + if (pot->ot_table->tab_id == item->ri_tab_id) + goto check_action; + xt_db_return_table_to_pool_ns(pot); + pot = NULL; + } + + if (!xt_db_open_pool_table_ns(&pot, db, item->ri_tab_id)) { + /* Should not happen, but just in case, we just don't + * remove the lock. We will probably end up with a deadlock + * somewhere. + */ + xt_log_and_clear_exception_ns(); + goto skip_check_action; + } + if (!pot) + /* Can happen of the table has been dropped: */ + goto skip_check_action; + + check_action: + if (!pot->ot_table->tab_dic.dic_table->checkNoAction(pot, item->ri_rec_id)) { + ok = FALSE; + break; + } + skip_check_action:; + } + + if (pot) + xt_db_return_table_to_pool_ns(pot); + xt_bl_free(NULL, list); + return ok; +} + + +xtPublic xtBool xt_tab_seq_init(XTOpenTablePtr ot) +{ + register XTTableHPtr tab = ot->ot_table; + + ot->ot_seq_page = NULL; + ot->ot_on_page = FALSE; + ot->ot_seq_offset = 0; + + ot->ot_curr_rec_id = 0; // 0 is an invalid position! + ot->ot_curr_row_id = 0; // 0 is an invalid row ID! + ot->ot_curr_updated = FALSE; + + /* We note the current EOF before we start a sequential scan. + * It is basically possible to update the same record more than + * once because an updated record creates a new record which + * has a new position which may be in the area that is + * still to be scanned. + * + * By noting the EOF before we start a sequential scan we + * reduce the possibility of this. + * + * However, the possibility still remains, but it should + * not be a problem because a record is not modified + * if there is nothing to change, which is the case + * if the record has already been changed! + * + * NOTE (2008-01-29) There is no longer a problem with updating a + * record twice because records are marked by an update. + * + * [(10)] I have changed this (see below). I now check the + * current EOF of the table. + * + * The reason is that committed read must be able to see the + * changes that occur during table table scan. * + */ + ot->ot_seq_eof_id = tab->tab_rec_eof_id; + + if (!ot->ot_thread->st_xact_data) { + /* MySQL ignores this error, so we + * setup the sequential scan so that it will + * deliver nothing! + */ + ot->ot_seq_rec_id = ot->ot_seq_eof_id; + xt_register_xterr(XT_REG_CONTEXT, XT_ERR_NO_TRANSACTION); + return FAILED; + } + + ot->ot_seq_rec_id = 1; + ot->ot_thread->st_statistics.st_scan_table++; + return OK; +} + +xtPublic void xt_tab_seq_reset(XTOpenTablePtr ot) +{ + ot->ot_seq_rec_id = 0; + ot->ot_seq_eof_id = 0; + ot->ot_seq_page = NULL; + ot->ot_on_page = FALSE; + ot->ot_seq_offset = 0; +} + +xtPublic void xt_tab_seq_exit(XTOpenTablePtr ot) +{ + register XTTableHPtr tab = ot->ot_table; + + if (ot->ot_seq_page) { + tab->tab_recs.xt_tc_release_page(ot->ot_rec_file, ot->ot_seq_page, ot->ot_thread); + ot->ot_seq_page = NULL; + } + ot->ot_on_page = FALSE; +} + +xtPublic xtBool xt_tab_seq_next(XTOpenTablePtr ot, xtWord1 *buffer, xtBool *eof) +{ + register XTTableHPtr tab = ot->ot_table; + register size_t rec_size = tab->tab_dic.dic_rec_size; + xtWord1 *buff_ptr; + xtRecordID new_rec_id; + xtBool ptr_locked; + xtRecordID invalid_rec = 0; + XTTabRecHeadDRec rec_head; + + next_page: + if (!ot->ot_on_page) { + if (!(ot->ot_on_page = tab->tab_recs.xt_tc_get_page(ot->ot_rec_file, ot->ot_seq_rec_id, &ot->ot_seq_page, &ot->ot_seq_offset, ot->ot_thread))) + return FAILED; + } + + next_record: + /* [(10)] The current EOF is used: */ + if (ot->ot_seq_rec_id >= ot->ot_seq_eof_id) { + *eof = TRUE; + return OK; + } + + if (ot->ot_seq_offset >= tab->tab_recs.tci_page_size) { + if (ot->ot_seq_page) { + tab->tab_recs.xt_tc_release_page(ot->ot_rec_file, ot->ot_seq_page, ot->ot_thread); + ot->ot_seq_page = NULL; + } + ot->ot_on_page = FALSE; + goto next_page; + } + + if (ot->ot_seq_page) { + ptr_locked = FALSE; + buff_ptr = ot->ot_seq_page->tcp_data + ot->ot_seq_offset; + } + else { + size_t red_size; + + ptr_locked = TRUE; + if (!xt_pread_fmap(ot->ot_rec_file, xt_rec_id_to_rec_offset(tab, ot->ot_seq_rec_id), sizeof(XTTabRecHeadDRec), sizeof(XTTabRecHeadDRec), &rec_head, &red_size, &ot->ot_thread->st_statistics.st_rec, ot->ot_thread)) + return FAILED; + buff_ptr = (xtWord1 *) &rec_head; + } + + /* This is the current record: */ + ot->ot_curr_rec_id = ot->ot_seq_rec_id; + ot->ot_curr_row_id = 0; + + /* Move to the next record: */ + ot->ot_seq_rec_id++; + ot->ot_seq_offset += rec_size; + + retry: + switch (tab_visible(ot, (XTTabRecHeadDPtr) buff_ptr, &new_rec_id)) { + case FALSE: + goto next_record; + case XT_ERR: + goto failed; + case XT_NEW: + ptr_locked = FALSE; + buff_ptr = ot->ot_row_rbuffer; + if (!xt_tab_get_rec_data(ot, new_rec_id, rec_size, ot->ot_row_rbuffer)) + return XT_ERR; + ot->ot_curr_rec_id = new_rec_id; + break; + case XT_RETRY: + goto retry; + case XT_REREAD: + if (invalid_rec != ot->ot_curr_rec_id) { + /* Don't re-read for the same record twice: */ + invalid_rec = ot->ot_curr_rec_id; + + /* Undo move to next: */ + ot->ot_seq_rec_id--; + ot->ot_seq_offset -= rec_size; + + /* Prepare to reread the page: */ + if (ot->ot_seq_page) { + tab->tab_recs.xt_tc_release_page(ot->ot_rec_file, ot->ot_seq_page, ot->ot_thread); + ot->ot_seq_page = NULL; + } + ot->ot_on_page = FALSE; + goto next_page; + } +#ifdef XT_CRASH_DEBUG + /* Should not happen! */ + xt_crash_me(); +#endif + /* Continue, and skip the record... */ + invalid_rec = 0; + goto next_record; + default: + if (ptr_locked) + XT_LOCK_MEMORY_PTR(buff_ptr, ot->ot_rec_file, xt_rec_id_to_rec_offset(tab, ot->ot_curr_rec_id), tab->tab_rows.tci_page_size, &ot->ot_thread->st_statistics.st_rec, ot->ot_thread); + break; + } + + switch (*buff_ptr) { + case XT_TAB_STATUS_FIXED: + case XT_TAB_STATUS_FIX_CLEAN: + memcpy(buffer, buff_ptr + XT_REC_FIX_HEADER_SIZE, rec_size - XT_REC_FIX_HEADER_SIZE); + break; + case XT_TAB_STATUS_VARIABLE: + case XT_TAB_STATUS_VAR_CLEAN: + if (!myxt_load_row(ot, buff_ptr + XT_REC_FIX_HEADER_SIZE, buffer, ot->ot_cols_req)) + goto failed_1; + break; + case XT_TAB_STATUS_EXT_DLOG: + case XT_TAB_STATUS_EXT_CLEAN: { + u_int cols_req = ot->ot_cols_req; + + ASSERT_NS(cols_req); + if (cols_req && cols_req <= tab->tab_dic.dic_fix_col_count) { + if (!myxt_load_row(ot, buff_ptr + XT_REC_EXT_HEADER_SIZE, buffer, cols_req)) + goto failed_1; + } + else { + if (buff_ptr != ot->ot_row_rbuffer) + memcpy(ot->ot_row_rbuffer, buff_ptr, rec_size); + if (!xt_tab_load_ext_data(ot, ot->ot_curr_rec_id, buffer, cols_req)) + goto failed_1; + } + break; + } + } + if (ptr_locked) + XT_UNLOCK_MEMORY_PTR(ot->ot_rec_file, ot->ot_thread); + + *eof = FALSE; + return OK; + + failed_1: + if (ptr_locked) + XT_UNLOCK_MEMORY_PTR(ot->ot_rec_file, ot->ot_thread); + + failed: + return FAILED; +} + diff --git a/storage/pbxt/src/table_xt.h b/storage/pbxt/src/table_xt.h new file mode 100644 index 00000000000..5ce284e4122 --- /dev/null +++ b/storage/pbxt/src/table_xt.h @@ -0,0 +1,599 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-02-08 Paul McCullagh + * + * H&G2JCtL + */ +#ifndef __xt_table_h__ +#define __xt_table_h__ + +#include <time.h> + +#include "datalog_xt.h" +#include "filesys_xt.h" +#include "hashtab_xt.h" +#include "index_xt.h" +#include "cache_xt.h" +#include "util_xt.h" +#include "heap_xt.h" +#include "tabcache_xt.h" +#include "xactlog_xt.h" +#include "lock_xt.h" + +struct XTDatabase; +struct XTThread; +struct XTCache; +struct XTOpenTable; +struct XTTablePath; + +#define XT_TAB_INCOMPATIBLE_VERSION 4 +#define XT_TAB_CURRENT_VERSION 5 + +#define XT_IND_CURRENT_VERSION 3 + +#define XT_HEAD_BUFFER_SIZE 1024 + +#ifdef DEBUG +//#define XT_TRACK_INDEX_UPDATES +//#define XT_TRACK_RETURNED_ROWS +#endif + +/* + * NOTE: Records may only be freed (placed on the free list), after + * all currently running transactions have ended. + * The reason is, running transactions may have references in memory + * to these records (a sequential scan has a large buffer). + * If the records are freed they may be re-used. This will + * cause problems because the references will then refer to + * new data. + * + * As a result, deleted records are first placed in the + * REMOVED state. Later, when transactions have quit, they + * are freed. + */ +#define XT_TAB_STATUS_FREED 0x00 /* On the free list. */ +#define XT_TAB_STATUS_DELETE 0x01 /* A transactional delete record (an "update" that indicates a delete). */ +#define XT_TAB_STATUS_FIXED 0x02 +#define XT_TAB_STATUS_VARIABLE 0x03 /* Uses one block, but has the variable format. */ +#define XT_TAB_STATUS_EXT_DLOG 0x04 /* Variable format, and the trailing part of the record in the data log. */ +#define XT_TAB_STATUS_EXT_HDATA 0x05 /* Variable format, and the trailing part of the record in the handle data file. */ +#define XT_TAB_STATUS_DATA 0x06 /* A block of data with a next pointer (5 bytes overhead). */ +#define XT_TAB_STATUS_END_DATA 0x07 /* An block of data without an end pointer (1 byte overhead). */ +#define XT_TAB_STATUS_MASK 0x0F + +#define XT_TAB_STATUS_DEL_CLEAN (XT_TAB_STATUS_DELETE | XT_TAB_STATUS_CLEANED_BIT) +#define XT_TAB_STATUS_FIX_CLEAN (XT_TAB_STATUS_FIXED | XT_TAB_STATUS_CLEANED_BIT) +#define XT_TAB_STATUS_VAR_CLEAN (XT_TAB_STATUS_VARIABLE | XT_TAB_STATUS_CLEANED_BIT) +#define XT_TAB_STATUS_EXT_CLEAN (XT_TAB_STATUS_EXT_DLOG | XT_TAB_STATUS_CLEANED_BIT) + +#define XT_TAB_STATUS_CLEANED_BIT 0x80 /* This bit is set when the record is cleaned and committed. */ + +#define XT_REC_IS_CLEAN(x) ((x) & XT_TAB_STATUS_CLEANED_BIT) +#define XT_REC_IS_FREE(x) (((x) & XT_TAB_STATUS_MASK) == XT_TAB_STATUS_FREED) +#define XT_REC_IS_DELETE(x) (((x) & XT_TAB_STATUS_MASK) == XT_TAB_STATUS_DELETE) +#define XT_REC_IS_FIXED(x) (((x) & XT_TAB_STATUS_MASK) == XT_TAB_STATUS_FIXED) +#define XT_REC_IS_VARIABLE(x) (((x) & XT_TAB_STATUS_MASK) == XT_TAB_STATUS_VARIABLE) +#define XT_REC_IS_EXT_DLOG(x) (((x) & XT_TAB_STATUS_MASK) == XT_TAB_STATUS_EXT_DLOG) +#define XT_REC_IS_EXT_HDATA(x) (((x) & XT_TAB_STATUS_MASK) == XT_TAB_STATUS_EXT_HDATA) +#define XT_REC_NOT_VALID(x) (XT_REC_IS_FREE(x) || XT_REC_IS_DELETE(x)) + +/* Results for xt_use_table_by_id(): */ +#define XT_TAB_OK 0 +#define XT_TAB_NOT_FOUND 1 +#define XT_TAB_NO_DICTIONARY 2 +#define XT_TAB_POOL_CLOSED 3 /* Cannot open table at the moment, the pool is closed. */ +#define XT_TAB_FAILED 4 + +#define XT_TAB_ROW_USE_RW_MUTEX + +#ifdef XT_TAB_ROW_USE_FASTWRLOCK +#define XT_TAB_ROW_LOCK_TYPE XTFastRWLockRec +#define XT_TAB_ROW_INIT_LOCK(s, i) xt_fastrwlock_init(s, i) +#define XT_TAB_ROW_FREE_LOCK(s, i) xt_fastrwlock_free(s, i) +#define XT_TAB_ROW_READ_LOCK(i, s) xt_fastrwlock_slock(i, s) +#define XT_TAB_ROW_WRITE_LOCK(i, s) xt_fastrwlock_xlock(i, s) +#define XT_TAB_ROW_UNLOCK(i, s) xt_fastrwlock_unlock(i, s) +#elif defined(XT_TAB_ROW_USE_PTHREAD_RW) +#define XT_TAB_ROW_LOCK_TYPE xt_rwlock_type +#define XT_TAB_ROW_INIT_LOCK(s, i) xt_init_rwlock(s, i) +#define XT_TAB_ROW_FREE_LOCK(s, i) xt_free_rwlock(i) +#define XT_TAB_ROW_READ_LOCK(i, s) xt_slock_rwlock_ns(i) +#define XT_TAB_ROW_WRITE_LOCK(i, s) xt_xlock_rwlock_ns(i) +#define XT_TAB_ROW_UNLOCK(i, s) xt_unlock_rwlock_ns(i) +#elif defined(XT_TAB_ROW_USE_RW_MUTEX) +#define XT_TAB_ROW_LOCK_TYPE XTRWMutexRec +#define XT_TAB_ROW_INIT_LOCK(s, i) xt_rwmutex_init_with_autoname(s, i) +#define XT_TAB_ROW_FREE_LOCK(s, i) xt_rwmutex_free(s, i) +#define XT_TAB_ROW_READ_LOCK(i, s) xt_rwmutex_slock(i, (s)->t_id) +#define XT_TAB_ROW_WRITE_LOCK(i, s) xt_rwmutex_xlock(i, (s)->t_id) +#define XT_TAB_ROW_UNLOCK(i, s) xt_rwmutex_unlock(i, (s)->t_id) +#else +#define XT_TAB_ROW_LOCK_TYPE XTSpinLockRec +#define XT_TAB_ROW_INIT_LOCK(s, i) xt_spinlock_init(s, i) +#define XT_TAB_ROW_FREE_LOCK(s, i) xt_spinlock_free(s, i) +#define XT_TAB_ROW_READ_LOCK(i, s) xt_spinlock_lock(i) +#define XT_TAB_ROW_WRITE_LOCK(i, s) xt_spinlock_lock(i) +#define XT_TAB_ROW_UNLOCK(i, s) xt_spinlock_unlock(i) +#endif + +/* ------- TABLE DATA FILE ------- */ + +#define XT_TAB_DATA_MAGIC 0x1234ABCD + +#define XT_FORMAT_DEF_SPACE 512 + +#define XT_TAB_FLAGS_TEMP_TAB 1 + +/* + * This header ensures that no record in the data file has the offset 0. + */ +typedef struct XTTableHead { + XTDiskValue4 th_head_size_4; /* The size of the table header. */ + XTDiskValue4 th_op_seq_4; + XTDiskValue6 th_row_free_6; + XTDiskValue6 th_row_eof_6; + XTDiskValue6 th_row_fnum_6; + XTDiskValue6 th_rec_free_6; + XTDiskValue6 th_rec_eof_6; + XTDiskValue6 th_rec_fnum_6; +} XTTableHeadDRec, *XTTableHeadDPtr; + +typedef struct XTTableFormat { + XTDiskValue4 tf_format_size_4; /* The size of this structure (table format). */ + XTDiskValue4 tf_tab_head_size_4; /* The offset of the first record in the data handle file. */ + XTDiskValue2 tf_tab_version_2; /* The table version number. */ + XTDiskValue2 tf_tab_flags_2; /* Table flags XT_TAB_FLAGS_* */ + XTDiskValue4 tf_rec_size_4; /* The maximum size of records in the table. */ + XTDiskValue1 tf_rec_fixed_1; /* Set to 1 if this table contains fixed length records. */ + XTDiskValue1 tf_reserved_1; /* - */ + XTDiskValue8 tf_min_auto_inc_8; /* This is the minimum auto-increment value. */ + xtWord1 tf_reserved[64]; /* Reserved, set to 0. */ + char tf_definition[XT_VAR_LENGTH]; /* A cstring, currently it only contains the foreign key information. */ +} XTTableFormatDRec, *XTTableFormatDPtr; + +#define XT_STAT_ID_MASK(x) ((x) & (u_int) 0x000000FF) + +/* A record that fits completely in the data file record */ +typedef struct XTTabRecHead { + xtWord1 tr_rec_type_1; + xtWord1 tr_stat_id_1; + xtDiskRecordID4 tr_prev_rec_id_4; /* The previous variation of this record. */ + XTDiskValue4 tr_xact_id_4; /* The transaction ID. */ + XTDiskValue4 tr_row_id_4; /* The row ID of this record. */ +} XTTabRecHeadDRec, *XTTabRecHeadDPtr; + +typedef struct XTTabRecFix { + xtWord1 tr_rec_type_1; /* XT_TAB_STATUS_FREED, XT_TAB_STATUS_DELETE, + * XT_TAB_STATUS_FIXED, XT_TAB_STATUS_VARIABLE */ + xtWord1 tr_stat_id_1; + xtDiskRecordID4 tr_prev_rec_id_4; /* The previous variation of this record. */ + XTDiskValue4 tr_xact_id_4; /* The transaction ID. */ + XTDiskValue4 tr_row_id_4; /* The row ID of this record. */ + xtWord1 rf_data[XT_VAR_LENGTH]; /* NOTE: This data is in RAW MySQL format. */ +} XTTabRecFixDRec, *XTTabRecFixDPtr; + +/* An extended record that overflows into the log file: */ +typedef struct XTTabRecExt { + xtWord1 tr_rec_type_1; /* XT_TAB_STATUS_EXT_DLOG */ + xtWord1 tr_stat_id_1; + xtDiskRecordID4 tr_prev_rec_id_4; /* The previous variation of this record. */ + XTDiskValue4 tr_xact_id_4; /* The transaction ID. */ + XTDiskValue4 tr_row_id_4; /* The row ID of this record. */ + XTDiskValue2 re_log_id_2; /* Reference to overflow area, log ID */ + XTDiskValue6 re_log_offs_6; /* Reference to the overflow area, log offset */ + XTDiskValue4 re_log_dat_siz_4; /* Size of the overflow data. */ + xtWord1 re_data[XT_VAR_LENGTH]; /* This data is in packed PBXT format. */ +} XTTabRecExtDRec, *XTTabRecExtDPtr; + +typedef struct XTTabRecExtHdat { + xtWord1 tr_rec_type_1; /* XT_TAB_STATUS_EXT_HDATA */ + xtWord1 tr_stat_id_1; + xtDiskRecordID4 tr_prev_rec_id_4; /* The previous variation of this record. */ + XTDiskValue4 tr_xact_id_4; /* The transaction ID. */ + XTDiskValue4 tr_row_id_4; /* The row ID of this record. */ + XTDiskValue4 eh_blk_rec_id_4; /* The record ID of the next block. */ + XTDiskValue2 eh_blk_siz_2; /* The total size of the data in the trailing blocks */ + xtWord1 eh_data[XT_VAR_LENGTH]; /* This data is in packed PBXT format. */ +} XTTabRecExtHdatDRec, *XTTabRecExtHdatDPtr; + +typedef struct XTTabRecData { + xtWord1 tr_rec_type_1; /* XT_TAB_STATUS_DATA */ + XTDiskValue4 rd_blk_rec_id_4; /* The record ID of the next block. */ + xtWord1 rd_data[XT_VAR_LENGTH]; /* This data is in packed PBXT format. */ +} XTTabRecDataDRec, *XTTabRecDataDPtr; + +typedef struct XTTabRecEndDat { + xtWord1 tr_rec_type_1; /* XT_TAB_STATUS_END_DATA */ + xtWord1 ed_data[XT_VAR_LENGTH]; /* This data is in packed PBXT format. */ +} XTTabRecEndDatDRec, *XTTabRecEndDatDPtr; + +#define XT_REC_FIX_HEADER_SIZE sizeof(XTTabRecHeadDRec) +#define XT_REC_EXT_HEADER_SIZE offsetof(XTTabRecExtDRec, re_data) +#define XT_REC_FIX_EXT_HEADER_DIFF (XT_REC_EXT_HEADER_SIZE - XT_REC_FIX_HEADER_SIZE) + +typedef struct XTTabRecFree { + xtWord1 rf_rec_type_1; + xtWord1 rf_not_used_1; + xtDiskRecordID4 rf_next_rec_id_4; /* The next block on the free list. */ +} XTTabRecFreeDRec, *XTTabRecFreeDPtr; + +typedef struct XTTabRecInfo { + XTTabRecFixDPtr ri_fix_rec_buf; /* This references the start of the buffer (set for all types of records) */ + XTTabRecExtDPtr ri_ext_rec; /* This is only set for extended records. */ + xtWord4 ri_rec_buf_size; + XTactExtRecEntryDPtr ri_log_buf; + xtWord4 ri_log_data_size; /* This size of the data in the log record. */ + xtRecordID ri_rec_id; /* The record ID. */ +} XTTabRecInfoRec, *XTTabRecInfoPtr; + +/* ------- TABLE ROW FILE ------- */ + +#define XT_TAB_ROW_SHIFTS 2 +#define XT_TAB_ROW_MAGIC 0x4567CDEF +//#define XT_TAB_ROW_FREE 0 +//#define XT_TAB_ROW_IN_USE 1 + +/* + * NOTE: The shift count assumes the size of a table row + * reference is 8 bytes (XT_TAB_ROW_SHIFTS) + */ +typedef struct XTTabRowRef { + XTDiskValue4 rr_ref_id_4; /* 4-byte reference, could be a RowID or a RecordID + * If this row is free, then it is a RowID, which + * references the next free row. + * If it is in use, then it is a RecordID which + * points to the first record in the variation + * list for the row. + */ +} XTTabRowRefDRec, *XTTabRowRefDPtr; + +/* + * This is the header for the row file. The size MUST be a + * the same size as sizeof(XTTabRowRefDRec) + */ +typedef struct XTTabRowHead { + XTDiskValue4 rh_magic_4; +} XTTabRowHeadDRec, *XTTabRowHeadDPtr; + +/* ------- TABLE & OPEN TABLES & TABLE LISTING ------- */ + +typedef struct XTTable : public XTHeap { + struct XTDatabase *tab_db; /* Heap pointer */ + XTPathStrPtr tab_name; + xtBool tab_free_locks; + xtTableID tab_id; + + xtWord8 tab_auto_inc; /* The next value to be issued as an auto-increment value. */ + XTSpinLockRec tab_ainc_lock; /* Lock for the auto-increment counter. */ + + size_t tab_index_format_offset; + size_t tab_index_header_size; + size_t tab_index_page_size; + u_int tab_index_block_shifts; + XTIndexHeadDPtr tab_index_head; + size_t tab_table_format_offset; + size_t tab_table_head_size; + XTDictionaryRec tab_dic; + xt_mutex_type tab_dic_field_lock; /* Lock for setting field->ptr!. */ + + XTRowLocksRec tab_locks; /* The locks held on this table. */ + + XTTableSeqRec tab_seq; /* The table operation sequence. */ + XTTabCacheRec tab_rows; + XTTabCacheRec tab_recs; + + /* Used to apply operations to the database in order. */ + XTSortedListPtr tab_op_list; /* The operation list. Operations to be applied. */ + /* Values that belong in the header when flushed! */ + xtBool tab_flush_pending; /* TRUE if the table needs to be flushed */ + xtBool tab_recovery_done; /* TRUE if the table has been recovered */ + off_t tab_bytes_to_flush; /* Number of bytes of the record/row files to flush. */ + + xtOpSeqNo tab_head_op_seq; /* The number of the operation last applied to the database. */ + xtRowID tab_head_row_free_id; + xtRowID tab_head_row_eof_id; + xtWord4 tab_head_row_fnum; + xtRecordID tab_head_rec_free_id; + xtRecordID tab_head_rec_eof_id; + xtWord4 tab_head_rec_fnum; + + xtOpSeqNo tab_co_op_seq; /* The operation last applied by the compactor. */ + + xtBool tab_wr_wake_freeer; /* Set to TRUE if the writer must wake the freeer. */ + xtOpSeqNo tab_wake_freeer_op; /* Set to the sequence number the freeer is waiting for. */ + + XTFilePtr tab_row_file; + xtRowID tab_row_eof_id; /* Indicates the EOF of the table row file. */ + xtRowID tab_row_free_id; /* The start of the free list in the table row file. */ + xtWord4 tab_row_fnum; /* The count of the number of free rows on the free list. */ + xt_mutex_type tab_row_lock; /* Lock for updating the EOF and free list. */ + XT_TAB_ROW_LOCK_TYPE tab_row_rwlock[XT_ROW_RWLOCKS]; /* Used to lock a row during update. */ + + xt_mutex_type tab_rec_flush_lock; /* Required while the record/row files are being flushed. */ + XTFilePtr tab_rec_file; + xtRecordID tab_rec_eof_id; /* This value can only grow. */ + xtRecordID tab_rec_free_id; + xtWord4 tab_rec_fnum; /* The count of the number of free rows on the free list. */ + xt_mutex_type tab_rec_lock; /* Lock for the free list. */ + + xt_mutex_type tab_ind_flush_lock; /* Required while the index file is being flushed. */ + xtLogID tab_ind_rec_log_id; /* The point before which index entries have been written. */ + xtLogOffset tab_ind_rec_log_offset; /* The log offset of the write point. */ + XTFilePtr tab_ind_file; + xtIndexNodeID tab_ind_eof; /* This value can only grow. */ + xtIndexNodeID tab_ind_free; /* The start of the free page list of the index. */ + XTIndFreeListPtr tab_ind_free_list; /* A cache of the free list (if exists, don't go to disk!) */ + xt_mutex_type tab_ind_lock; /* Lock for reading and writing the index free list. */ + xtWord2 tab_ind_flush_seq; +} XTTableHRec, *XTTableHPtr; /* Heap pointer */ + +/* Used for an in-memory list of the tables, ordered by ID. */ +typedef struct XTTableEntry { + xtTableID te_tab_id; + char *te_tab_name; + struct XTTablePath *te_tab_path; + XTTableHPtr te_table; +} XTTableEntryRec, *XTTableEntryPtr; + +typedef struct XTOpenTable { + struct XTThread *ot_thread; /* The thread currently using this open table. */ + XTTableHPtr ot_table; /* PBXT table information. */ + + struct XTOpenTable *ot_otp_next_free; /* Next free open table in the open table pool. */ + struct XTOpenTable *ot_otp_mr_used; + struct XTOpenTable *ot_otp_lr_used; + time_t ot_otp_free_time; /* The time this table was place on the free list. */ + + //struct XTOpenTable *ot_pool_next; /* Next pointer for open table pool. */ + + XT_ROW_REC_FILE_PTR ot_rec_file; + XT_ROW_REC_FILE_PTR ot_row_file; + XTOpenFilePtr ot_ind_file; + u_int ot_err_index_no; /* The number of the index on which the last error occurred */ + + xtBool ot_rec_fixed; /* Cached from table for quick access. */ + size_t ot_rec_size; /* Cached from table for quick access. */ + + char ot_error_key[XT_IDENTIFIER_NAME_SIZE]; + xtBool ot_for_update; /* True if reading FOR UPDATE. */ + xtBool ot_is_modify; /* True if UPDATE or DELETE. */ + xtRowID ot_temp_row_lock; /* The temporary row lock set on this table. */ + u_int ot_cols_req; /* The number of columns required from the table. */ + + /* GOTCHA: Separate buffers for reading and writing rows because + * of blob references, to this buffer, as in this test: + * + * drop table if exists t1; + * CREATE TABLE t1 (id MEDIUMINT NOT NULL, b1 BIT(8), vc TEXT, + * bc CHAR(255), d DECIMAL(10,4) DEFAULT 0, + * f FLOAT DEFAULT 0, total BIGINT UNSIGNED, + * y YEAR, t DATE) + * PARTITION BY RANGE (YEAR(t)) + * (PARTITION p1 VALUES LESS THAN (2005), + * PARTITION p2 VALUES LESS THAN MAXVALUE); + * + * INSERT INTO t1 VALUES(412,1,'eTesting MySQL databases is a cool ', + * 'EEEMust make it bug free for the customer', + * 654321.4321,15.21,0,1965,"2005-11-14"); + * + * UPDATE t1 SET b1 = 0, t="2006-02-22" WHERE id = 412; + * + */ + size_t ot_row_rbuf_size; /* The current size of the read row buffer (resized dynamically). */ + xtWord1 *ot_row_rbuffer; /* The row buffer for reading rows. */ + size_t ot_row_wbuf_size; /* The current size of the write row buffer (resized dynamically). */ + xtWord1 *ot_row_wbuffer; /* The row buffer for writing rows. */ + + /* Details of the current record: */ + xtRecordID ot_curr_rec_id; /* The offset of the current record. */ + xtRowID ot_curr_row_id; /* The row ID of the current record. */ + xtBool ot_curr_updated; /* TRUE if the current record was updated by the current transaction. */ + + XTIndBlockPtr ot_ind_res_bufs; /* A list of reserved index buffers. */ + u_int ot_ind_res_count; /* The number of reserved buffers. */ +#ifdef XT_TRACK_INDEX_UPDATES + u_int ot_ind_changed; + u_int ot_ind_reserved; + u_int ot_ind_reads; +#endif +#ifdef XT_TRACK_RETURNED_ROWS + u_int ot_rows_ret_max; + u_int ot_rows_ret_curr; + xtRecordID *ot_rows_returned; +#endif + /* GOTCHA: Separate buffers for reading and writing the index are required + * because MySQL sometimes scans and updates an index with the same + * table handler. + */ + XTIdxItemRec ot_ind_state; /* Decribes the state of the index buffer. */ + XTIndHandlePtr ot_ind_rhandle; /* This handle references a block which is being used in a sequential scan. */ + //XTIdxBranchDRec ot_ind_rbuf; /* The index read buffer. */ + XTIdxBranchDRec ot_ind_wbuf; /* Buffer for the current index node for writing. */ + xtWord1 ot_ind_wbuf2[XT_INDEX_PAGE_SIZE]; /* Overflow for the write buffer when a node is too big. */ + + /* Note: the fields below ot_ind_rbuf are not zero'ed out on creation + * of this structure! + */ + xtRecordID ot_seq_rec_id; /* Current position of a sequential scan. */ + xtRecordID ot_seq_eof_id; /* The EOF at the start of the sequential scan. */ + XTTabCachePagePtr ot_seq_page; /* If ot_seq_buffer is non-NULL, then a page has been locked! */ + xtBool ot_on_page; + size_t ot_seq_offset; /* Offset on the current page. */ +} XTOpenTableRec, *XTOpenTablePtr; + +#define XT_DATABASE_NAME_SIZE XT_IDENTIFIER_NAME_SIZE + +typedef struct XTTableDesc { + char td_tab_name[XT_TABLE_NAME_SIZE+4]; // 4 extra for DEL# (tables being deleted) + xtTableID td_tab_id; + char *td_file_name; + + struct XTDatabase *td_db; + struct XTTablePath *td_tab_path; // The path of the table. + u_int td_path_idx; + XTOpenDirPtr td_open_dir; +} XTTableDescRec, *XTTableDescPtr; + + +typedef struct XTFilesOfTable { + int ft_state; + XTPathStrPtr ft_tab_name; + xtTableID ft_tab_id; + char ft_file_path[PATH_MAX]; +} XTFilesOfTableRec, *XTFilesOfTablePtr; + +typedef struct XTRestrictItem { + xtTableID ri_tab_id; + xtRecordID ri_rec_id; +} XTRestrictItemRec, *XTRestrictItemPtr; + +int xt_tab_compare_names(const char *n1, const char *n2); +int xt_tab_compare_paths(char *n1, char *n2); +void xt_tab_init_db(struct XTThread *self, struct XTDatabase *db); +void xt_tab_exit_db(struct XTThread *self, struct XTDatabase *db); +void xt_check_tables(struct XTThread *self); + +char *xt_tab_file_to_name(size_t size, char *tab_name, char *file_name); + +void xt_create_table(struct XTThread *self, XTPathStrPtr name, XTDictionaryPtr dic); +XTTableHPtr xt_use_table(struct XTThread *self, XTPathStrPtr name, xtBool no_load, xtBool missing_ok, xtBool *opened); +void xt_sync_flush_table(struct XTThread *self, XTOpenTablePtr ot); +xtBool xt_flush_record_row(XTOpenTablePtr ot, off_t *bytes_flushed, xtBool have_table_loc); +void xt_flush_table(struct XTThread *self, XTOpenTablePtr ot); +XTTableHPtr xt_use_table_no_lock(XTThreadPtr self, struct XTDatabase *db, XTPathStrPtr name, xtBool no_load, xtBool missing_ok, XTDictionaryPtr dic, xtBool *opened); +int xt_use_table_by_id(struct XTThread *self, XTTableHPtr *tab, struct XTDatabase *db, xtTableID tab_id); +XTOpenTablePtr xt_open_table(XTTableHPtr tab); +void xt_close_table(XTOpenTablePtr ot, xtBool flush, xtBool have_table_lock); +void xt_drop_table(struct XTThread *self, XTPathStrPtr name); +void xt_check_table(XTThreadPtr self, XTOpenTablePtr tab); +void xt_rename_table(struct XTThread *self, XTPathStrPtr old_name, XTPathStrPtr new_name); + +void xt_describe_tables_init(struct XTThread *self, struct XTDatabase *db, XTTableDescPtr td); +xtBool xt_describe_tables_next(struct XTThread *self, XTTableDescPtr td); +void xt_describe_tables_exit(struct XTThread *self, XTTableDescPtr td); + +xtBool xt_table_exists(struct XTDatabase *db); + +void xt_enum_tables_init(u_int *edx); +XTTableEntryPtr xt_enum_tables_next(struct XTThread *self, struct XTDatabase *db, u_int *edx); + +void xt_enum_files_of_tables_init(struct XTDatabase *db, char *tab_name, xtTableID tab_id, XTFilesOfTablePtr ft); +xtBool xt_enum_files_of_tables_next(XTFilesOfTablePtr ft); + +xtBool xt_tab_seq_init(XTOpenTablePtr ot); +void xt_tab_seq_reset(XTOpenTablePtr ot); +void xt_tab_seq_exit(XTOpenTablePtr ot); +xtBool xt_tab_seq_next(XTOpenTablePtr ot, xtWord1 *buffer, xtBool *eof); + +xtBool xt_tab_new_record(XTOpenTablePtr ot, xtWord1 *buffer); +xtBool xt_tab_delete_record(XTOpenTablePtr ot, xtWord1 *buffer); +xtBool xt_tab_restrict_rows(XTBasicListPtr list, struct XTThread *thread); +xtBool xt_tab_update_record(XTOpenTablePtr ot, xtWord1 *before_buf, xtWord1 *after_buf); +int xt_tab_visible(XTOpenTablePtr ot); +int xt_tab_read_record(register XTOpenTablePtr ot, xtWord1 *buffer); +int xt_tab_dirty_read_record(register XTOpenTablePtr ot, xtWord1 *buffer); +void xt_tab_load_row_pointers(XTThreadPtr self, XTOpenTablePtr ot); +void xt_tab_load_table(struct XTThread *self, XTOpenTablePtr ot); +xtBool xt_tab_load_record(register XTOpenTablePtr ot, xtRecordID rec_id, XTInfoBufferPtr rec_buf); +int xt_tab_remove_record(XTOpenTablePtr ot, xtRecordID rec_id, xtWord1 *rec_data, xtRecordID *prev_var_rec_id, xtBool clean_delete, xtRowID row_id, xtXactID xn_id); +int xt_tab_maybe_committed(XTOpenTablePtr ot, xtRecordID rec_id, xtXactID *xn_id, xtRowID *out_rowid, xtBool *out_updated); +xtBool xt_tab_free_record(XTOpenTablePtr ot, u_int status, xtRecordID rec_id, xtBool clean_delete); +void xt_tab_store_header(XTOpenTablePtr ot, XTTableHeadDPtr rec_head); +xtBool xt_tab_write_header(XTOpenTablePtr ot, XTTableHeadDPtr rec_head, struct XTThread *thread); +xtBool xt_tab_write_min_auto_inc(XTOpenTablePtr ot); + +xtBool xt_tab_get_row(register XTOpenTablePtr ot, xtRowID row_id, xtRecordID *var_rec_id); +xtBool xt_tab_set_row(XTOpenTablePtr ot, u_int status, xtRowID row_id, xtRecordID var_rec_id); +xtBool xt_tab_free_row(XTOpenTablePtr ot, XTTableHPtr tab, xtRowID row_id); + +xtBool xt_tab_load_ext_data(XTOpenTablePtr ot, xtRecordID load_rec_id, xtWord1 *buffer, u_int cols_req); +xtBool xt_tab_put_rec_data(XTOpenTablePtr ot, xtRecordID rec_id, size_t size, xtWord1 *buffer, xtOpSeqNo *op_seq); +xtBool xt_tab_put_eof_rec_data(XTOpenTablePtr ot, xtRecordID rec_id, size_t size, xtWord1 *buffer, xtOpSeqNo *op_seq); +xtBool xt_tab_put_log_op_rec_data(XTOpenTablePtr ot, u_int status, xtRecordID free_rec_id, xtRecordID rec_id, size_t size, xtWord1 *buffer); +xtBool xt_tab_put_log_rec_data(XTOpenTablePtr ot, u_int status, xtRecordID free_rec_id, xtRecordID rec_id, size_t size, xtWord1 *buffer, xtOpSeqNo *op_seq); +xtBool xt_tab_get_rec_data(register XTOpenTablePtr ot, xtRecordID rec_id, size_t size, xtWord1 *buffer); +void xt_tab_set_index_error(XTTableHPtr tab); + +inline off_t xt_row_id_to_row_offset(register XTTableHPtr tab, xtRowID row_id) +{ + return (off_t) tab->tab_rows.tci_header_size + (off_t) (row_id - 1) * (off_t) tab->tab_rows.tci_rec_size; +} + +inline xtRowID xt_row_offset_row_id(register XTTableHPtr tab, off_t rec_offs) +{ +#ifdef DEBUG + if (((rec_offs - (off_t) tab->tab_rows.tci_header_size) % (off_t) tab->tab_rows.tci_rec_size) != 0) { + printf("ERROR! Not a valid record offset!\n"); + } +#endif + return (xtRowID) ((rec_offs - (off_t) tab->tab_rows.tci_header_size) / (off_t) tab->tab_rows.tci_rec_size) + 1; +} + +inline off_t xt_rec_id_to_rec_offset(register XTTableHPtr tab, xtRefID ref_id) +{ + if (!ref_id) + return (off_t) 0; + return (off_t) tab->tab_recs.tci_header_size + (off_t) (ref_id-1) * (off_t) tab->tab_recs.tci_rec_size; +} + +inline xtRefID xt_rec_offset_rec_id(register XTTableHPtr tab, off_t ref_offs) +{ + if (!ref_offs) + return (xtRefID) 0; +#ifdef DEBUG + if (((ref_offs - (off_t) tab->tab_recs.tci_header_size) % (off_t) tab->tab_recs.tci_rec_size) != 0) { + printf("ERROR! Not a valid record offset!\n"); + } +#endif + + return (xtRefID) ((ref_offs - (off_t) tab->tab_recs.tci_header_size) / (off_t) tab->tab_recs.tci_rec_size)+1; +} + +inline off_t xt_ind_node_to_offset(register XTTableHPtr tab, xtIndexNodeID node_id) +{ + if (!XT_NODE_ID(node_id)) + return (off_t) 0; + return (off_t) tab->tab_index_header_size + (off_t) (XT_NODE_ID(node_id)-1) * (off_t) tab->tab_index_page_size; +} + +inline xtIndexNodeID xt_ind_offset_to_node(register XTTableHPtr tab, off_t ind_offs) +{ + XT_NODE_TEMP; + + if (!ind_offs) + return XT_RET_NODE_ID(0); +#ifdef DEBUG + if (((ind_offs - (off_t) tab->tab_index_header_size) % (off_t) tab->tab_index_page_size) != 0) { + printf("ERROR! Not a valid index offset!\n"); + } +#endif + + return XT_RET_NODE_ID(((ind_offs - (off_t) tab->tab_index_header_size) / (off_t) tab->tab_index_page_size)+1); +} + +#define XT_RESIZE_ROW_BUFFER(thr, rb, size) \ + do { \ + if (rb->rb_size < size) { \ + xt_realloc(thr, (void **) &rb->x.rb_buffer, size); \ + rb->rb_size = size; \ + } \ + } \ + while (0) + +#endif diff --git a/storage/pbxt/src/thread_xt.cc b/storage/pbxt/src/thread_xt.cc new file mode 100644 index 00000000000..1fba81511ad --- /dev/null +++ b/storage/pbxt/src/thread_xt.cc @@ -0,0 +1,2291 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-01-03 Paul McCullagh + * + * H&G2JCtL + */ + +#include "xt_config.h" + +#ifndef XT_WIN +#include <unistd.h> +#include <sys/time.h> +#include <sys/resource.h> +#endif +#include <time.h> +#include <stdarg.h> +#include <signal.h> +#include <stdlib.h> +#include <ctype.h> +#include <errno.h> + +#include "xt_defs.h" +#include "strutil_xt.h" +#include "pthread_xt.h" +#include "thread_xt.h" +#include "memory_xt.h" +#include "sortedlist_xt.h" +#include "trace_xt.h" +#include "myxt_xt.h" +#include "database_xt.h" + +void xt_db_init_thread(XTThreadPtr self, XTThreadPtr new_thread); +void xt_db_exit_thread(XTThreadPtr self); + +static void thr_accumulate_statistics(XTThreadPtr self); + +/* + * ----------------------------------------------------------------------- + * THREAD GLOBALS + */ + +xtPublic u_int xt_thr_maximum_threads; +xtPublic u_int xt_thr_current_thread_count; +xtPublic u_int xt_thr_current_max_threads; + +/* This structure is a double linked list of thread, with a wait + * condition on it. + */ +static XTLinkedListPtr thr_list; + +/* This structure maps thread ID's to thread pointers. */ +xtPublic XTThreadPtr *xt_thr_array; +static xt_mutex_type thr_array_lock; + +/* Global accumulated statistics: */ +static XTStatisticsRec thr_statistics; + +/* + * ----------------------------------------------------------------------- + * Error logging + */ + +static xt_mutex_type log_mutex; +static int log_level = 0; +static FILE *log_file = NULL; +static xtBool log_newline = TRUE; + +xtPublic xtBool xt_init_logging(void) +{ + int err; + + log_file = stdout; + log_level = XT_LOG_TRACE; + err = xt_p_mutex_init_with_autoname(&log_mutex, NULL); + if (err) { + xt_log_errno(XT_NS_CONTEXT, err); + log_file = NULL; + log_level = 0; + return FALSE; + } + if (!xt_init_trace()) { + xt_exit_logging(); + return FALSE; + } + return TRUE; +} + +xtPublic void xt_exit_logging(void) +{ + if (log_file) { + xt_free_mutex(&log_mutex); + log_file = NULL; + } + xt_exit_trace(); +} + +xtPublic void xt_get_now(char *buffer, size_t len) +{ + time_t ticks; + struct tm ltime; + + ticks = time(NULL); + if (ticks == (time_t) -1) { +#ifdef XT_WIN + printf(buffer, "** error %d getting time **", errno); +#else + snprintf(buffer, len, "** error %d getting time **", errno); +#endif + return; + } + localtime_r(&ticks, <ime); + strftime(buffer, len, "%y%m%d %H:%M:%S", <ime); +} + +static void thr_log_newline(XTThreadPtr self, c_char *func, c_char *file, u_int line, int level) +{ + c_char *level_str; + char time_str[200]; + char thr_name[XT_THR_NAME_SIZE+3]; + + xt_get_now(time_str, 200); + if (self && *self->t_name) { + xt_strcpy(XT_THR_NAME_SIZE+3, thr_name, " "); + xt_strcat(XT_THR_NAME_SIZE+3, thr_name, self->t_name); + } + else + thr_name[0] = 0; + switch (level) { + case XT_LOG_FATAL: level_str = " [Fatal]"; break; + case XT_LOG_ERROR: level_str = " [Error]"; break; + case XT_LOG_WARNING: level_str = " [Warning]"; break; + case XT_LOG_INFO: level_str = " [Note]"; break; + case XT_LOG_TRACE: level_str = " [Trace]"; break; + default: level_str = " "; break; + } + if (func && *func && *func != '-') { + char func_name[XT_MAX_FUNC_NAME_SIZE]; + + xt_strcpy_term(XT_MAX_FUNC_NAME_SIZE, func_name, func, '('); + if (file && *file) + fprintf(log_file, "%s%s%s %s(%s:%d) ", time_str, level_str, thr_name, func_name, xt_last_name_of_path(file), line); + else + fprintf(log_file, "%s%s%s %s() ", time_str, level_str, thr_name, func_name); + } + else { + if (file && *file) + fprintf(log_file, "%s%s%s [%s:%d] ", time_str, level_str, thr_name, xt_last_name_of_path(file), line); + else + fprintf(log_file, "%s%s%s ", time_str, level_str, thr_name); + } +} + +#ifdef XT_WIN +/* Windows uses printf()!! */ +#define DEFAULT_LOG_BUFFER_SIZE 2000 +#else +#ifdef DEBUG +#define DEFAULT_LOG_BUFFER_SIZE 10 +#else +#define DEFAULT_LOG_BUFFER_SIZE 2000 +#endif +#endif + +void xt_log_flush(XTThreadPtr self __attribute__((unused))) +{ + fflush(log_file); +} + +/* + * Log the given formated string information to the log file. + * Before each new line, this function writes the + * log header, which includes the time, log level, + * and source file and line number (optional). + */ +static void thr_log_va(XTThreadPtr self, c_char *func, c_char *file, u_int line, int level, c_char *fmt, va_list ap) +{ + char buffer[DEFAULT_LOG_BUFFER_SIZE]; + char *log_string = NULL; + + if (level > log_level) + return; + + xt_lock_mutex_ns(&log_mutex); + +#ifdef XT_WIN + vsprintf(buffer, fmt, ap); + log_string = buffer; +#else +#if !defined(va_copy) || defined(XT_SOLARIS) + int len; + + len = vsnprintf(buffer, DEFAULT_LOG_BUFFER_SIZE-1, fmt, ap); + if (len > DEFAULT_LOG_BUFFER_SIZE-1) + len = DEFAULT_LOG_BUFFER_SIZE-1; + buffer[len] = 0; + log_string = buffer; +#else + /* Use the buffer, unless it is too small */ + va_list ap2; + + va_copy(ap2, ap); + if (vsnprintf(buffer, DEFAULT_LOG_BUFFER_SIZE, fmt, ap) >= DEFAULT_LOG_BUFFER_SIZE) { + if (vasprintf(&log_string, fmt, ap2) == -1) + log_string = NULL; + } + else + log_string = buffer; +#endif +#endif + + if (log_string) { + char *str, *str_end, tmp_ch; + + str = log_string; + while (*str) { + if (log_newline) { + thr_log_newline(self, func, file, line, level); + log_newline = FALSE; + } + str_end = strchr(str, '\n'); + if (str_end) { + str_end++; + tmp_ch = *str_end; + *str_end = 0; + log_newline = TRUE; + } + else { + str_end = str + strlen(str); + tmp_ch = 0; + } + fprintf(log_file, "%s", str); + fflush(log_file); + *str_end = tmp_ch; + str = str_end; + } + + if (log_string != buffer) + free(log_string); + } + + xt_unlock_mutex_ns(&log_mutex); +} + +xtPublic void xt_logf(XTThreadPtr self, c_char *func, c_char *file, u_int line, int level, c_char *fmt, ...) +{ + va_list ap; + + va_start(ap, fmt); + thr_log_va(self, func, file, line, level, fmt, ap); + va_end(ap); +} + +xtPublic void xt_log(XTThreadPtr self, c_char *func, c_char *file, u_int line, int level, c_char *string) +{ + xt_logf(self, func, file, line, level, "%s", string); +} + +static int thr_log_error_va(XTThreadPtr self, c_char *func, c_char *file, u_int line, int level, int xt_err, int sys_err, c_char *fmt, va_list ap) +{ + int default_level; + char xt_err_string[50]; + + *xt_err_string = 0; + switch (xt_err) { + case XT_ASSERTION_FAILURE: + strcpy(xt_err_string, "Assertion"); + default_level = XT_LOG_FATAL; + break; + case XT_SYSTEM_ERROR: + strcpy(xt_err_string, "errno"); + default_level = XT_LOG_ERROR; + break; + case XT_SIGNAL_CAUGHT: + strcpy(xt_err_string, "Signal"); + default_level = XT_LOG_ERROR; + break; + default: + sprintf(xt_err_string, "%d", xt_err); + default_level = XT_LOG_ERROR; + break; + } + if (level == XT_LOG_DEFAULT) + level = default_level; + + if (*xt_err_string) { + if (sys_err) + xt_logf(self, func, file, line, level, "%s (%d): ", xt_err_string, sys_err); + else + xt_logf(self, func, file, line, level, "%s: ", xt_err_string); + } + thr_log_va(self, func, file, line, level, fmt, ap); + xt_logf(self, func, file, line, level, "\n"); + return level; +} + +/* The function returns the actual log level used. */ +xtPublic int xt_log_errorf(XTThreadPtr self, c_char *func, c_char *file, u_int line, int level, int xt_err, int sys_err, c_char *fmt, ...) +{ + va_list ap; + + va_start(ap, fmt); + level = thr_log_error_va(self, func, file, line, level, xt_err, sys_err, fmt, ap); + va_end(ap); + return level; +} + +/* The function returns the actual log level used. */ +xtPublic int xt_log_error(XTThreadPtr self, c_char *func, c_char *file, u_int line, int level, int xt_err, int sys_err, c_char *string) +{ + return xt_log_errorf(self, func, file, line, level, xt_err, sys_err, "%s", string); +} + +xtPublic void xt_log_exception(XTThreadPtr self, XTExceptionPtr e, int level) +{ + level = xt_log_error( + self, + e->e_func_name, + e->e_source_file, + e->e_source_line, + level, + e->e_xt_err, + e->e_sys_err, + e->e_err_msg); + /* Dump the catch trace: */ + if (*e->e_catch_trace) + xt_logf(self, NULL, NULL, 0, level, "%s", e->e_catch_trace); +} + +xtPublic void xt_log_and_clear_exception(XTThreadPtr self) +{ + xt_log_exception(self, &self->t_exception, XT_LOG_DEFAULT); + xt_clear_exception(self); +} + +xtPublic void xt_log_and_clear_exception_ns(void) +{ + xt_log_and_clear_exception(xt_get_self()); +} + +xtPublic void xt_log_and_clear_warning(XTThreadPtr self) +{ + xt_log_exception(self, &self->t_exception, XT_LOG_WARNING); + xt_clear_exception(self); +} + +xtPublic void xt_log_and_clear_warning_ns(void) +{ + xt_log_and_clear_warning(xt_get_self()); +} + +/* + * ----------------------------------------------------------------------- + * Exceptions + */ + +static void thr_add_catch_trace(XTExceptionPtr e, c_char *func, c_char *file, u_int line) +{ + if (func && *func && *func != '-') { + xt_strcat_term(XT_CATCH_TRACE_SIZE, e->e_catch_trace, func, '('); + xt_strcat(XT_CATCH_TRACE_SIZE, e->e_catch_trace, "("); + } + if (file && *file) { + xt_strcat(XT_CATCH_TRACE_SIZE, e->e_catch_trace, xt_last_name_of_path(file)); + if (line) { + char buffer[40]; + + sprintf(buffer, "%u", line); + xt_strcat(XT_CATCH_TRACE_SIZE, e->e_catch_trace, ":"); + xt_strcat(XT_CATCH_TRACE_SIZE, e->e_catch_trace, buffer); + } + } + if (func && *func && *func != '-') + xt_strcat(XT_CATCH_TRACE_SIZE, e->e_catch_trace, ")"); + xt_strcat(XT_CATCH_TRACE_SIZE, e->e_catch_trace, "\n"); +} + +static void thr_save_error_va(XTExceptionPtr e, XTThreadPtr self, xtBool throw_it, c_char *func, c_char *file, u_int line, int xt_err, int sys_err, c_char *fmt, va_list ap) +{ + int i; + + if (!e) + return; + + e->e_xt_err = xt_err; + e->e_sys_err = sys_err; + vsnprintf(e->e_err_msg, XT_ERR_MSG_SIZE, fmt, ap); + + /* Make the first character of the message upper case: */ + if (isalpha(e->e_err_msg[0]) && islower(e->e_err_msg[0])) + e->e_err_msg[0] = (char) toupper(e->e_err_msg[0]); + + if (func && *func && *func != '-') + xt_strcpy_term(XT_MAX_FUNC_NAME_SIZE, e->e_func_name, func, '('); + else + *e->e_func_name = 0; + if (file && *file) { + xt_strcpy(XT_SOURCE_FILE_NAME_SIZE, e->e_source_file, xt_last_name_of_path(file)); + e->e_source_line = line; + } + else { + *e->e_source_file = 0; + e->e_source_line = 0; + } + *e->e_catch_trace = 0; + + if (!self) + return; + + /* Create a stack trace for this exception: */ + thr_add_catch_trace(e, func, file, line); + for (i=self->t_call_top-1; i>=0; i--) + thr_add_catch_trace(e, self->t_call_stack[i].cs_func, self->t_call_stack[i].cs_file, self->t_call_stack[i].cs_line); + + if (throw_it) + xt_throw(self); +} + +/* + * ----------------------------------------------------------------------- + * THROWING EXCEPTIONS + */ + +/* If we have to allocate resources and the hold them temporarily during which + * time an exception could occur, then these functions provide a holding + * place for the data, which will be freed in the case of an exception. + * + * Note: the free functions could themselves allocated resources. + * to make sure all things work out we only remove the resource from + * then stack when it is freed. + */ +static void thr_free_resources(XTThreadPtr self, XTResourcePtr top) +{ + XTResourcePtr rp; + XTThreadFreeFunc free_func; + + if (!top) + return; + while (self->t_res_top > top) { + /* Pop the top resource: */ + rp = (XTResourcePtr) (((char *) self->t_res_top) - self->t_res_top->r_prev_size); + + /* Free the resource: */ + if (rp->r_free_func) { + free_func = rp->r_free_func; + rp->r_free_func = NULL; + free_func(self, rp->r_data); + } + + self->t_res_top = rp; + } +} + +xtPublic void xt_bug(XTThreadPtr self __attribute__((unused))) +{ + static int *bug_ptr = NULL; + + bug_ptr = NULL; +} + +/* + * This function is called when an exception is caught. + * It restores the function call top and frees + * any resource allocated by lower levels. + */ +xtPublic void xt_caught(XTThreadPtr self) +{ + /* Restore the call top: */ + self->t_call_top = self->t_jmp_env[self->t_jmp_depth].jb_call_top; + + /* Free the temporary data that would otherwize be lost + * This should do nothing, because we actually free things on throw + * (se below). + */ + thr_free_resources(self, self->t_jmp_env[self->t_jmp_depth].jb_res_top); +} + +/* Throw an already registered error: */ +xtPublic void xt_throw(XTThreadPtr self) +{ + if (self) { + ASSERT_NS(self->t_exception.e_xt_err); + if (self->t_jmp_depth > 0 && self->t_jmp_depth <= XT_MAX_JMP) { + /* As recommended by Barry: rree the resources before the stack is invalid! */ + thr_free_resources(self, self->t_jmp_env[self->t_jmp_depth-1].jb_res_top); + + /* Then do the longjmp: */ + longjmp(self->t_jmp_env[self->t_jmp_depth-1].jb_buffer, 1); + } + } + + /* + * We cannot throw an error, because it will not be caught. + * This means there is no try ... catch block above. + * In this case, we just return. + * The calling functions must handle errors... + xt_caught(self); + xt_log(XT_CONTEXT, XT_LOG_FATAL, "Uncaught exception\n"); + xt_exit_thread(self, NULL); + */ +} + +xtPublic void xt_throwf(XTThreadPtr self, c_char *func, c_char *file, u_int line, int xt_err, int sys_err, c_char *fmt, ...) +{ + va_list ap; + XTThreadPtr thread = self ? self : xt_get_self(); + + va_start(ap, fmt); + thr_save_error_va(thread ? &thread->t_exception : NULL, thread, self ? TRUE : FALSE, func, file, line, xt_err, sys_err, fmt, ap); + va_end(ap); +} + +xtPublic void xt_throw_error(XTThreadPtr self, c_char *func, c_char *file, u_int line, int xt_err, int sys_err, c_char *msg) +{ + xt_throwf(self, func, file, line, xt_err, sys_err, "%s", msg); +} + +#define XT_SYS_ERR_SIZE 300 + +static c_char *thr_get_sys_error(int err, char *err_msg __attribute__((unused))) +{ +#ifdef XT_WIN + char *ptr; + + if (!FormatMessage(FORMAT_MESSAGE_FROM_SYSTEM, NULL, + err, 0, err_msg, XT_SYS_ERR_SIZE, NULL)) { + return strerror(err); + } + + ptr = &err_msg[strlen(err_msg)]; + while (ptr-1 > err_msg) { + if (*(ptr-1) != '\n' && *(ptr-1) != '\r' && *(ptr-1) != '.') + break; + ptr--; + } + *ptr = 0; +return err_msg; +#else + return strerror(err); +#endif +} + +static c_char *thr_get_err_string(int xt_err) +{ + c_char *str; + + switch (xt_err) { + case XT_ERR_STACK_OVERFLOW: str = "Stack overflow"; break; + case XT_ERR_JUMP_OVERFLOW: str = "Jump overflow"; break; + case XT_ERR_TABLE_EXISTS: str = "Table `%s` already exists"; break; + case XT_ERR_NAME_TOO_LONG: str = "Name '%s' is too long"; break; + case XT_ERR_TABLE_NOT_FOUND: str = "Table `%s` not found"; break; + case XT_ERR_SESSION_NOT_FOUND: str = "Session %s not found"; break; + case XT_ERR_BAD_ADDRESS: str = "Incorrect address '%s'"; break; + case XT_ERR_UNKNOWN_SERVICE: str = "Unknown service '%s'"; break; + case XT_ERR_UNKNOWN_HOST: str = "Host '%s' not found"; break; + case XT_ERR_TOKEN_EXPECTED: str = "%s expected in place of %s"; break; + case XT_ERR_PROPERTY_REQUIRED: str = "Property '%s' required"; break; + case XT_ERR_DEADLOCK: str = "Deadlock, transaction aborted"; break; + case XT_ERR_CANNOT_CHANGE_DB: str = "Cannot change database while transaction is in progress"; break; + case XT_ERR_ILLEGAL_CHAR: str = "Illegal character: '%s'"; break; + case XT_ERR_UNTERMINATED_STRING:str = "Unterminated string: %s"; break; + case XT_ERR_SYNTAX: str = "Syntax error near %s"; break; + case XT_ERR_ILLEGAL_INSTRUCTION:str = "Illegal instruction"; break; + case XT_ERR_OUT_OF_BOUNDS: str = "Memory reference out of bounds"; break; + case XT_ERR_STACK_UNDERFLOW: str = "Stack underflow"; break; + case XT_ERR_TYPE_MISMATCH: str = "Type mismatch"; break; + case XT_ERR_ILLEGAL_TYPE: str = "Illegal type for operator"; break; + case XT_ERR_ID_TOO_LONG: str = "Identifier too long: %s"; break; + case XT_ERR_TYPE_OVERFLOW: str = "Type overflow: %s"; break; + case XT_ERR_TABLE_IN_USE: str = "Table `%s` in use"; break; + case XT_ERR_NO_DATABASE_IN_USE: str = "No database in use"; break; + case XT_ERR_CANNOT_RESOLVE_TYPE:str = "Cannot resolve type with ID: %s"; break; + case XT_ERR_BAD_INDEX_DESC: str = "Unsupported index description: %s"; break; + case XT_ERR_WRONG_NO_OF_VALUES: str = "Incorrect number of values"; break; + case XT_ERR_CANNOT_OUTPUT_VALUE:str = "Cannot output given type"; break; + case XT_ERR_COLUMN_NOT_FOUND: str = "Column `%s.%s` not found"; break; + case XT_ERR_NOT_IMPLEMENTED: str = "Not implemented"; break; + case XT_ERR_UNEXPECTED_EOS: str = "Connection unexpectedly lost"; break; + case XT_ERR_BAD_TOKEN: str = "Incorrect binary token"; break; + case XT_ERR_RES_STACK_OVERFLOW: str = "Internal error: resource stack overflow"; break; + case XT_ERR_BAD_INDEX_TYPE: str = "Unsupported index type: %s"; break; + case XT_ERR_INDEX_EXISTS: str = "Index '%s' already exists"; break; + case XT_ERR_INDEX_STRUC_EXISTS: str = "Index '%s' has an identical structure"; break; + case XT_ERR_INDEX_NOT_FOUND: str = "Index '%s' not found"; break; + case XT_ERR_INDEX_CORRUPT: str = "Cannot read index '%s'"; break; + case XT_ERR_TYPE_NOT_SUPPORTED: str = "Data type %s not supported"; break; + case XT_ERR_BAD_TABLE_VERSION: str = "Table `%s` version not supported, upgrade required"; break; + case XT_ERR_BAD_RECORD_FORMAT: str = "Record format unknown, either corrupted or upgrade required"; break; + case XT_ERR_BAD_EXT_RECORD: str = "Extended record part does not match reference"; break; + case XT_ERR_RECORD_CHANGED: str = "Record already updated, transaction aborted"; break; + case XT_ERR_XLOG_WAS_CORRUPTED: str = "Corrupted transaction log has been truncated"; break; + case XT_ERR_DUPLICATE_KEY: str = "Duplicate unique key"; break; + case XT_ERR_NO_DICTIONARY: str = "Table `%s` has not yet been opened by MySQL"; break; + case XT_ERR_TOO_MANY_TABLES: str = "Limit of %s tables per database exceeded"; break; + case XT_ERR_KEY_TOO_LARGE: str = "Index '%s' exceeds the key size limit of %s"; break; + case XT_ERR_MULTIPLE_DATABASES: str = "Multiple database in a single transaction is not permitted"; break; + case XT_ERR_NO_TRANSACTION: str = "Internal error: no transaction running"; break; + case XT_ERR_A_EXPECTED_NOT_B: str = "%s expected in place of %s"; break; + case XT_ERR_NO_MATCHING_INDEX: str = "Matching index required for '%s'"; break; + case XT_ERR_TABLE_LOCKED: str = "Table `%s` locked"; break; + case XT_ERR_NO_REFERENCED_ROW: str = "Constraint: `%s`"; break; // "Foreign key '%s', referenced row does not exist" + case XT_ERR_ROW_IS_REFERENCED: str = "Constraint: `%s`"; break; // "Foreign key '%s', has a referencing row" + case XT_ERR_BAD_DICTIONARY: str = "Internal dictionary does not match MySQL dictionary"; break; + case XT_ERR_LOADING_MYSQL_DIC: str = "Error %s loading MySQL .frm file"; break; + case XT_ERR_COLUMN_IS_NOT_NULL: str = "Column `%s` is NOT NULL"; break; + case XT_ERR_INCORRECT_NO_OF_COLS: str = "Incorrect number of columns near %s"; break; + case XT_ERR_FK_ON_TEMP_TABLE: str = "Cannot create foreign key on temporary table"; break; + case XT_ERR_REF_TABLE_NOT_FOUND: str = "Referenced table `%s` not found"; break; + case XT_ERR_REF_TYPE_WRONG: str = "Incorrect data type on referenced column `%s`"; break; + case XT_ERR_DUPLICATE_FKEY: str = "Duplicate unique foreign key, contraint: %s"; break; + case XT_ERR_INDEX_FILE_TO_LARGE: str = "Index file has grown too large: %s"; break; + case XT_ERR_UPGRADE_TABLE: str = "Table `%s` must be upgraded from PBXT version %s"; break; + case XT_ERR_INDEX_NEW_VERSION: str = "Table `%s` index created by a newer version, upgrade required"; break; + case XT_ERR_LOCK_TIMEOUT: str = "Lock timeout on table `%s`"; break; + case XT_ERR_CONVERSION: str = "Error converting value for column `%s.%s`"; break; + case XT_ERR_NO_ROWS: str = "No matching row found in table `%s`"; break; + case XT_ERR_DATA_LOG_NOT_FOUND: str = "Data log not found: '%s'"; break; + case XT_ERR_LOG_MAX_EXCEEDED: str = "Maximum log count, %s, exceeded"; break; + case XT_ERR_MAX_ROW_COUNT: str = "Maximum row count reached"; break; + case XT_ERR_FILE_TOO_LONG: str = "File cannot be mapped, too large: '%s'"; break; + case XT_ERR_BAD_IND_BLOCK_SIZE: str = "Table `%s`, incorrect index block size: %s"; break; + case XT_ERR_INDEX_CORRUPTED: str = "Table `%s` index is corrupted, REPAIR TABLE required"; break; + case XT_ERR_NO_INDEX_CACHE: str = "Not enough index cache memory to handle concurrent updates"; break; + case XT_ERR_INDEX_LOG_CORRUPT: str = "Index log corrupt: '%s'"; break; + case XT_ERR_TOO_MANY_THREADS: str = "Too many threads: %s, increase max_connections"; break; + case XT_ERR_TOO_MANY_WAITERS: str = "Too many waiting threads: %s"; break; + case XT_ERR_INDEX_OLD_VERSION: str = "Table `%s` index created by an older version, REPAIR TABLE required"; break; + case XT_ERR_PBXT_TABLE_EXISTS: str = "System table cannot be dropped because PBXT table still exists"; break; + case XT_ERR_SERVER_RUNNING: str = "A server is possibly already running"; break; + case XT_ERR_INDEX_MISSING: str = "Index file of table '%s' is missing"; break; + case XT_ERR_RECORD_DELETED: str = "Record was deleted"; break; + case XT_ERR_NEW_TYPE_OF_XLOG: str = "Transaction log %s, is using a newer format, upgrade required"; break; + case XT_ERR_NO_BEFORE_IMAGE: str = "Internal error: no before image"; break; + case XT_ERR_FK_REF_TEMP_TABLE: str = "Foreign key may not reference temporary table"; break; + default: str = "Unknown XT error"; break; + } + return str; +} + +xtPublic void xt_throw_i2xterr(XTThreadPtr self, c_char *func, c_char *file, u_int line, int xt_err, c_char *item, c_char *item2) +{ + xt_throwf(self, func, file, line, xt_err, 0, thr_get_err_string(xt_err), item, item2); +} + +xtPublic void xt_throw_ixterr(XTThreadPtr self, c_char *func, c_char *file, u_int line, int xt_err, c_char *item) +{ + xt_throw_i2xterr(self, func, file, line, xt_err, item, NULL); +} + +xtPublic void xt_throw_tabcolerr(XTThreadPtr self, c_char *func, c_char *file, u_int line, int xt_err, XTPathStrPtr tab_item, c_char *item2) +{ + char buffer[XT_IDENTIFIER_NAME_SIZE + XT_IDENTIFIER_NAME_SIZE + XT_IDENTIFIER_NAME_SIZE + 3]; + + xt_2nd_last_name_of_path(sizeof(buffer), buffer, tab_item->ps_path); + xt_strcat(sizeof(buffer), buffer, "."); + xt_strcat(sizeof(buffer), buffer, xt_last_name_of_path(tab_item->ps_path)); + + xt_throw_i2xterr(self, func, file, line, xt_err, buffer, item2); +} + +xtPublic void xt_throw_taberr(XTThreadPtr self, c_char *func, c_char *file, u_int line, int xt_err, XTPathStrPtr tab_item) +{ + char buffer[XT_IDENTIFIER_NAME_SIZE + XT_IDENTIFIER_NAME_SIZE + XT_IDENTIFIER_NAME_SIZE + 3]; + + xt_2nd_last_name_of_path(sizeof(buffer), buffer, tab_item->ps_path); + xt_strcat(sizeof(buffer), buffer, "."); + xt_strcat(sizeof(buffer), buffer, xt_last_name_of_path(tab_item->ps_path)); + + xt_throw_ixterr(self, func, file, line, xt_err, buffer); +} + +xtPublic void xt_throw_ulxterr(XTThreadPtr self, c_char *func, c_char *file, u_int line, int xt_err, u_long value) +{ + char buffer[100]; + + sprintf(buffer, "%lu", value); + xt_throw_ixterr(self, func, file, line, xt_err, buffer); +} + +xtPublic void xt_throw_sulxterr(XTThreadPtr self, c_char *func, c_char *file, u_int line, int xt_err, c_char *item, u_long value) +{ + char buffer[100]; + + sprintf(buffer, "%lu", value); + xt_throw_i2xterr(self, func, file, line, xt_err, item, buffer); +} + +xtPublic void xt_throw_xterr(XTThreadPtr self, c_char *func, c_char *file, u_int line, int xt_err) +{ + xt_throw_ixterr(self, func, file, line, xt_err, NULL); +} + +xtPublic void xt_throw_errno(XTThreadPtr self, c_char *func, c_char *file, u_int line, int err) +{ + char err_msg[XT_SYS_ERR_SIZE]; + + xt_throw_error(self, func, file, line, XT_SYSTEM_ERROR, err, thr_get_sys_error(err, err_msg)); +} + +xtPublic void xt_throw_ferrno(XTThreadPtr self, c_char *func, c_char *file, u_int line, int err, c_char *path) +{ + char err_msg[XT_SYS_ERR_SIZE]; + + xt_throwf(self, func, file, line, XT_SYSTEM_ERROR, err, "%s: '%s'", thr_get_sys_error(err, err_msg), path); +} + +xtPublic void xt_throw_assertion(XTThreadPtr self, c_char *func, c_char *file, u_int line, c_char *str) +{ + xt_throw_error(self, func, file, line, XT_ASSERTION_FAILURE, 0, str); +} + +static void xt_log_assertion(XTThreadPtr self, c_char *func, c_char *file, u_int line, c_char *str) +{ + xt_log_error(self, func, file, line, XT_LOG_DEFAULT, XT_ASSERTION_FAILURE, 0, str); +} + +xtPublic void xt_throw_signal(XTThreadPtr self, c_char *func, c_char *file, u_int line, int sig) +{ +#ifdef XT_WIN + char buffer[100]; + + sprintf(buffer, "Signal #%d", sig); + xt_throw_error(self, func, file, line, XT_SIGNAL_CAUGHT, sig, buffer); +#else + xt_throw_error(self, func, file, line, XT_SIGNAL_CAUGHT, sig, strsignal(sig)); +#endif +} + +/* + * ----------------------------------------------------------------------- + * REGISTERING EXCEPTIONS + */ + +xtPublic void xt_registerf(c_char *func, c_char *file, u_int line, int xt_err, int sys_err, c_char *fmt, ...) +{ + va_list ap; + XTThreadPtr thread = xt_get_self(); + + va_start(ap, fmt); + thr_save_error_va(thread ? &thread->t_exception : NULL, thread, FALSE, func, file, line, xt_err, sys_err, fmt, ap); + va_end(ap); +} + +xtPublic void xt_register_i2xterr(c_char *func, c_char *file, u_int line, int xt_err, c_char *item, c_char *item2) +{ + xt_registerf(func, file, line, xt_err, 0, thr_get_err_string(xt_err), item, item2); +} + +xtPublic void xt_register_ixterr(c_char *func, c_char *file, u_int line, int xt_err, c_char *item) +{ + xt_register_i2xterr(func, file, line, xt_err, item, NULL); +} + +xtPublic void xt_register_tabcolerr(c_char *func, c_char *file, u_int line, int xt_err, XTPathStrPtr tab_item, c_char *item2) +{ + char buffer[XT_IDENTIFIER_NAME_SIZE + XT_IDENTIFIER_NAME_SIZE + XT_IDENTIFIER_NAME_SIZE + 3]; + + xt_2nd_last_name_of_path(sizeof(buffer), buffer, tab_item->ps_path); + xt_strcat(sizeof(buffer), buffer, "."); + xt_strcpy(sizeof(buffer), buffer, xt_last_name_of_path(tab_item->ps_path)); + xt_strcat(sizeof(buffer), buffer, "."); + xt_strcat(sizeof(buffer), buffer, item2); + + xt_register_ixterr(func, file, line, xt_err, buffer); +} + +xtPublic void xt_register_taberr(c_char *func, c_char *file, u_int line, int xt_err, XTPathStrPtr tab_item) +{ + char buffer[XT_IDENTIFIER_NAME_SIZE + XT_IDENTIFIER_NAME_SIZE + XT_IDENTIFIER_NAME_SIZE + 3]; + + xt_2nd_last_name_of_path(sizeof(buffer), buffer, tab_item->ps_path); + xt_strcat(sizeof(buffer), buffer, "."); + xt_strcpy(sizeof(buffer), buffer, xt_last_name_of_path(tab_item->ps_path)); + + xt_register_ixterr(func, file, line, xt_err, buffer); +} + +xtPublic void xt_register_ulxterr(c_char *func, c_char *file, u_int line, int xt_err, u_long value) +{ + char buffer[100]; + + sprintf(buffer, "%lu", value); + xt_register_ixterr(func, file, line, xt_err, buffer); +} + +xtPublic xtBool xt_register_ferrno(c_char *func, c_char *file, u_int line, int err, c_char *path) +{ + char err_msg[XT_SYS_ERR_SIZE]; + + xt_registerf(func, file, line, XT_SYSTEM_ERROR, err, "%s: '%s'", thr_get_sys_error(err, err_msg), path); + return FAILED; +} + +xtPublic void xt_register_error(c_char *func, c_char *file, u_int line, int xt_err, int sys_err, c_char *msg) +{ + xt_registerf(func, file, line, xt_err, sys_err, "%s", msg); +} + +xtPublic xtBool xt_register_errno(c_char *func, c_char *file, u_int line, int err) +{ + char err_msg[XT_SYS_ERR_SIZE]; + + xt_register_error(func, file, line, XT_SYSTEM_ERROR, err, thr_get_sys_error(err, err_msg)); + return FAILED; +} + +xtPublic void xt_register_xterr(c_char *func, c_char *file, u_int line, int xt_err) +{ + xt_register_error(func, file, line, xt_err, 0, thr_get_err_string(xt_err)); +} + +/* + * ----------------------------------------------------------------------- + * CREATING EXCEPTIONS + */ + +xtPublic void xt_exceptionf(XTExceptionPtr e, XTThreadPtr self, c_char *func, c_char *file, u_int line, int xt_err, int sys_err, c_char *fmt, ...) +{ + va_list ap; + + va_start(ap, fmt); + thr_save_error_va(e, self, FALSE, func, file, line, xt_err, sys_err, fmt, ap); + va_end(ap); +} + +xtPublic void xt_exception_error(XTExceptionPtr e, XTThreadPtr self, c_char *func, c_char *file, u_int line, int xt_err, int sys_err, c_char *msg) +{ + xt_exceptionf(e, self, func, file, line, xt_err, sys_err, "%s", msg); +} + +xtPublic xtBool xt_exception_errno(XTExceptionPtr e, XTThreadPtr self, c_char *func, c_char *file, u_int line, int err) +{ + char err_msg[XT_SYS_ERR_SIZE]; + + xt_exception_error(e, self, func, file, line, XT_SYSTEM_ERROR, err, thr_get_sys_error(err, err_msg)); + return FAILED; +} + +/* + * ----------------------------------------------------------------------- + * LOG ERRORS + */ + +xtPublic void xt_log_errno(XTThreadPtr self, c_char *func, c_char *file, u_int line, int err) +{ + XTExceptionRec e; + + xt_exception_errno(&e, self, func, file, line, err); + xt_log_exception(self, &e, XT_LOG_DEFAULT); +} + +/* + * ----------------------------------------------------------------------- + * Assertions and failures (one breakpoints for all failures) + */ + +xtPublic xtBool xt_assert(XTThreadPtr self __attribute__((unused)), c_char *expr, c_char *func, c_char *file, u_int line) +{ +#ifdef DEBUG + //xt_set_fflush(TRUE); + //xt_dump_trace(); + printf("%s(%s:%d) %s\n", func, file, (int) line, expr); +#ifdef XT_WIN + FatalAppExit(0, "Assertion Failed!"); +#endif +#else + xt_throw_assertion(self, func, file, line, expr); +#endif + return FALSE; +} + +xtPublic xtBool xt_assume(XTThreadPtr self, c_char *expr, c_char *func, c_char *file, u_int line) +{ + xt_log_assertion(self, func, file, line, expr); + return FALSE; +} + +/* + * ----------------------------------------------------------------------- + * Create and destroy threads + */ + +typedef struct ThreadData { + xtBool td_started; + XTThreadPtr td_thr; + void *(*td_start_routine)(XTThreadPtr self); +} ThreadDataRec, *ThreadDataPtr; + +#ifdef XT_WIN +pthread_key(void *, thr_key); +#else +static pthread_key_t thr_key; +#endif + +#ifdef HANDLE_SIGNALS +static void thr_ignore_signal(int sig) +{ +#pragma unused(sig) +} + +static void thr_throw_signal(int sig) +{ + XTThreadPtr self; + + self = xt_get_self(); + + if (self->t_main) { + /* The main thread will pass on a signal to all threads: */ + xt_signal_all_threads(self, sig); + if (sig != SIGTERM) { + if (self->t_disable_interrupts) { + self->t_delayed_signal = sig; + self->t_disable_interrupts = FALSE; /* Prevent infinite loop */ + } + else { + self->t_delayed_signal = 0; + xt_throw_signal(self, "thr_throw_signal", NULL, 0, sig); + } + } + } + else { + if (self->t_disable_interrupts) { + self->t_delayed_signal = sig; + self->t_disable_interrupts = FALSE; /* Prevent infinite loop */ + } + else { + self->t_delayed_signal = 0; + xt_throw_signal(self, "thr_throw_signal", NULL, 0, sig); + } + } +} + +static xtBool thr_setup_signals(void) +{ + struct sigaction action; + + sigemptyset(&action.sa_mask); + action.sa_flags = 0; + action.sa_handler = thr_ignore_signal; + + if (sigaction(SIGPIPE, &action, NULL) == -1) + goto error_occurred; + if (sigaction(SIGHUP, &action, NULL) == -1) + goto error_occurred; + + action.sa_handler = thr_throw_signal; + + if (sigaction(SIGQUIT, &action, NULL) == -1) + goto error_occurred; + if (sigaction(SIGTERM, &action, NULL) == -1) + goto error_occurred; +#ifndef DEBUG + if (sigaction(SIGILL, &action, NULL) == -1) + goto error_occurred; + if (sigaction(SIGBUS, &action, NULL) == -1) + goto error_occurred; + if (sigaction(SIGSEGV, &action, NULL) == -1) + goto error_occurred; +#endif + return TRUE; + + error_occurred: + xt_log_errno(XT_NS_CONTEXT, errno); + return FALSE; +} +#endif + +static void *thr_main(void *data) +{ + ThreadDataPtr td = (ThreadDataPtr) data; + XTThreadPtr self = td->td_thr; + void *(*start_routine)(XTThreadPtr); + void *return_data; + + enter_(); + self->t_pthread = pthread_self(); + start_routine = td->td_start_routine; + return_data = NULL; + +#ifdef HANDLE_SIGNALS + if (!thr_setup_signals()) + return NULL; +#endif + + try_(a) { + if (!xt_set_key(thr_key, self, &self->t_exception)) + throw_(); + td->td_started = TRUE; + return_data = (*start_routine)(self); + } + catch_(a) { + xt_log_and_clear_exception(self); + } + cont_(a); + + outer_(); + xt_free_thread(self); + return return_data; +} + +static void thr_free_data(XTThreadPtr self) +{ + if (self->t_free_data) { + (*self->t_free_data)(self, self->t_data); + self->t_data = NULL; + } +} + +xtPublic void xt_set_thread_data(XTThreadPtr self, void *data, XTThreadFreeFunc free_func) +{ + thr_free_data(self); + self->t_free_data = free_func; + self->t_data = data; +} + +static void thr_exit(XTThreadPtr self) +{ + /* Free the thread temporary data. */ + thr_free_resources(self, (XTResourcePtr) self->x.t_res_stack); + xt_db_exit_thread(self); + thr_free_data(self); /* Free custom user data. */ + + if (self->t_id > 0) { + ASSERT(self->t_id < xt_thr_current_max_threads); + xt_lock_mutex(self, &thr_array_lock); + pushr_(xt_unlock_mutex, &thr_array_lock); + thr_accumulate_statistics(self); + xt_thr_array[self->t_id] = NULL; + xt_thr_current_thread_count--; + if (self->t_id+1 == xt_thr_current_max_threads) { + /* We can reduce the current maximum, + * this makes operations that scan the array faster! + */ + u_int i; + + i = self->t_id; + for(;;) { + if (xt_thr_array[i]) + break; + if (!i) + break; + i--; + } + xt_thr_current_max_threads = i+1; + } + freer_(); // xt_unlock_mutex(&thr_array_lock) + } + + xt_free_cond(&self->t_cond); + xt_free_mutex(&self->t_lock); + + self->st_thread_list_count = 0; + self->st_thread_list_size = 0; + if (self->st_thread_list) { + xt_free_ns(self->st_thread_list); + self->st_thread_list = NULL; + } +} + +static void thr_init(XTThreadPtr self, XTThreadPtr new_thread) +{ + new_thread->t_res_top = (XTResourcePtr) new_thread->x.t_res_stack; + + new_thread->st_thread_list_count = 0; + new_thread->st_thread_list_size = 0; + new_thread->st_thread_list = NULL; + try_(a) { + xt_init_cond(self, &new_thread->t_cond); + xt_init_mutex_with_autoname(self, &new_thread->t_lock); + + xt_lock_mutex(self, &thr_array_lock); + pushr_(xt_unlock_mutex, &thr_array_lock); + + ASSERT(xt_thr_current_thread_count <= xt_thr_current_max_threads); + ASSERT(xt_thr_current_max_threads <= xt_thr_maximum_threads); + if (xt_thr_current_thread_count == xt_thr_maximum_threads) + xt_throw_ulxterr(XT_CONTEXT, XT_ERR_TOO_MANY_THREADS, (u_long) xt_thr_maximum_threads+1); + if (xt_thr_current_thread_count == xt_thr_current_max_threads) { + new_thread->t_id = xt_thr_current_thread_count; + xt_thr_array[new_thread->t_id] = new_thread; + xt_thr_current_max_threads++; + } + else { + /* There must be a free slot: */ + for (u_int i=0; i<xt_thr_current_max_threads; i++) { + if (!xt_thr_array[i]) { + new_thread->t_id = i; + xt_thr_array[i] = new_thread; + break; + } + } + } + xt_thr_current_thread_count++; + freer_(); // xt_unlock_mutex(&thr_array_lock) + + xt_db_init_thread(self, new_thread); + } + catch_(a) { + thr_exit(new_thread); + throw_(); + } + cont_(a); + +} + +/* + * The caller of this function automatically becomes the main thread. + */ +xtPublic XTThreadPtr xt_init_threading(u_int max_threads) +{ + volatile XTThreadPtr self = NULL; + XTExceptionRec e; + int err; + + /* Align the number of threads: */ + xt_thr_maximum_threads = xt_align_size(max_threads, XT_XS_LOCK_ALIGN); + +#ifdef HANDLE_SIGNALS + if (!thr_setup_signals()) + return NULL; +#endif + + xt_p_init_threading(); + + err = pthread_key_create(&thr_key, NULL); + if (err) { + xt_log_errno(XT_NS_CONTEXT, err); + return NULL; + } + + if ((err = xt_p_mutex_init_with_autoname(&thr_array_lock, NULL))) { + xt_log_errno(XT_NS_CONTEXT, err); + goto failed; + } + + if (!(xt_thr_array = (XTThreadPtr *) malloc(xt_thr_maximum_threads * sizeof(XTThreadPtr)))) { + xt_log_errno(XT_NS_CONTEXT, XT_ENOMEM); + goto failed; + } + + xt_thr_array[0] = (XTThreadPtr) 1; // Dummy, not used + xt_thr_current_thread_count = 1; + xt_thr_current_max_threads = 1; + + /* Create the main thread: */ + self = xt_create_thread("MainThread", TRUE, FALSE, &e); + if (!self) { + xt_log_exception(NULL, &e, XT_LOG_DEFAULT); + goto failed; + } + + try_(a) { + XTThreadPtr thread = self; + thr_list = xt_new_linkedlist(thread, NULL, NULL, TRUE); + } + catch_(a) { + XTThreadPtr thread = self; + xt_log_and_clear_exception(thread); + xt_exit_threading(thread); + } + cont_(a); + + return self; + + failed: + xt_exit_threading(NULL); + return NULL; +} + +xtPublic void xt_exit_threading(XTThreadPtr self) +{ + if (thr_list) { + xt_free_linkedlist(self, thr_list); + thr_list = NULL; + } + + /* This should be the main thread! */ + if (self) { + ASSERT(self->t_main); + xt_free_thread(self); + } + + if (xt_thr_array) { + free(xt_thr_array); + xt_thr_array = NULL; + xt_free_mutex(&thr_array_lock); + } + + xt_thr_current_thread_count = 0; + xt_thr_current_max_threads = 0; + + /* I no longer delete 'thr_key' because + * functions that call xt_get_self() after this + * point will get junk back if we delete + * thr_key. In particular the XT_THREAD_LOCK_INFO + * code fails + if (thr_key) { + pthread_key_delete(thr_key); + thr_key = (pthread_key_t) 0; + } + */ +} + +xtPublic void xt_wait_for_all_threads(XTThreadPtr self) +{ + if (thr_list) + xt_ll_wait_till_empty(self, thr_list); +} + +/* + * Call this function in a busy wait loop! + * Use if for wait loops that are not + * time critical. + */ +xtPublic void xt_busy_wait(void) +{ +#ifdef XT_WIN + Sleep(1); +#else + usleep(10); +#endif +} + +xtPublic void xt_critical_wait(void) +{ + /* NOTE: On Mac xt_busy_wait() works better than xt_yield() + */ +#if defined(XT_MAC) || defined(XT_WIN) + xt_busy_wait(); +#else + xt_yield(); +#endif +} + + +/* + * Use this for loops that time critical. + * Time critical means we need to get going + * as soon as possible! + */ +xtPublic void xt_yield(void) +{ +#ifdef XT_WIN + Sleep(0); +#elif defined(XT_MAC) || defined(XT_SOLARIS) + usleep(0); +#elif defined(XT_NETBSD) + sched_yield(); +#else + pthread_yield(); +#endif +} + +xtPublic void xt_sleep_milli_second(u_int t) +{ +#ifdef XT_WIN + Sleep(t); +#else + usleep(t * 1000); +#endif +} + +xtPublic void xt_signal_all_threads(XTThreadPtr self, int sig) +{ + XTLinkedItemPtr li; + XTThreadPtr sig_thr; + + xt_ll_lock(self, thr_list); + try_(a) { + li = thr_list->ll_items; + while (li) { + sig_thr = (XTThreadPtr) li; + if (sig_thr != self) + pthread_kill(sig_thr->t_pthread, sig); + li = li->li_next; + } + } + catch_(a) { + xt_ll_unlock(self, thr_list); + throw_(); + } + cont_(a); + xt_ll_unlock(self, thr_list); +} + +/* + * Apply the given function to all threads except self! + */ +xtPublic void xt_do_to_all_threads(XTThreadPtr self, void (*do_func_ptr)(XTThreadPtr self, XTThreadPtr to_thr, void *thunk), void *thunk) +{ + XTLinkedItemPtr li; + XTThreadPtr to_thr; + + xt_ll_lock(self, thr_list); + pushr_(xt_ll_unlock, thr_list); + + li = thr_list->ll_items; + while (li) { + to_thr = (XTThreadPtr) li; + if (to_thr != self) + (*do_func_ptr)(self, to_thr, thunk); + li = li->li_next; + } + + freer_(); // xt_ll_unlock(thr_list) +} + +xtPublic XTThreadPtr xt_get_self(void) +{ + XTThreadPtr self; + + /* First check if the handler has the data: */ + if ((self = myxt_get_self())) + return self; + /* Then it must be a background process, and the + * thread info is stored in the local key: */ + return (XTThreadPtr) xt_get_key(thr_key); +} + +xtPublic void xt_set_self(XTThreadPtr self) +{ + xt_set_key(thr_key, self, NULL); +} + +xtPublic void xt_clear_exception(XTThreadPtr thread) +{ + thread->t_exception.e_xt_err = 0; + thread->t_exception.e_sys_err = 0; + *thread->t_exception.e_err_msg = 0; + *thread->t_exception.e_func_name = 0; + *thread->t_exception.e_source_file = 0; + thread->t_exception.e_source_line = 0; + *thread->t_exception.e_catch_trace = 0; +} + +/* + * Create a thread without requiring thread to do it (as in xt_create_daemon()). + * + * This function returns NULL on error. + */ +xtPublic XTThreadPtr xt_create_thread(c_char *name, xtBool main_thread, xtBool user_thread, XTExceptionPtr e) +{ + volatile XTThreadPtr self; + + self = (XTThreadPtr) xt_calloc_ns(sizeof(XTThreadRec)); + if (!self) { + xt_exception_errno(e, XT_CONTEXT, ENOMEM); + return NULL; + } + + if (!xt_set_key(thr_key, self, e)) { + xt_free_ns(self); + return NULL; + } + + xt_strcpy(XT_THR_NAME_SIZE, self->t_name, name); + self->t_main = main_thread; + self->t_daemon = FALSE; + + try_(a) { + thr_init(self, self); + } + catch_(a) { + *e = self->t_exception; + xt_set_key(thr_key, NULL, NULL); + xt_free_ns(self); + self = NULL; + } + cont_(a); + + if (self && user_thread) { + /* Add non-temporary threads to the thread list. */ + try_(b) { + xt_ll_add(self, thr_list, &self->t_links, TRUE); + } + catch_(b) { + *e = self->t_exception; + xt_free_thread(self); + self = NULL; + } + cont_(b); + } + + return self; +} + +/* + * Create a daemon thread. + */ +xtPublic XTThreadPtr xt_create_daemon(XTThreadPtr self, c_char *name) +{ + XTThreadPtr new_thread; + + /* NOTE: thr_key will be set when this thread start running. */ + + new_thread = (XTThreadPtr) xt_calloc(self, sizeof(XTThreadRec)); + xt_strcpy(XT_THR_NAME_SIZE, new_thread->t_name, name); + new_thread->t_main = FALSE; + new_thread->t_daemon = TRUE; + + try_(a) { + thr_init(self, new_thread); + } + catch_(a) { + xt_free(self, new_thread); + throw_(); + } + cont_(a); + return new_thread; +} + +void xt_free_thread(XTThreadPtr self) +{ + thr_exit(self); + if (!self->t_daemon && thr_list) + xt_ll_remove(self, thr_list, &self->t_links, TRUE); + /* Note, if I move this before thr_exit() then self = xt_get_self(); will fail in + * xt_close_file_ns() which is called by xt_unuse_database()! + */ + if (thr_key) { + xt_set_key(thr_key, NULL, NULL); + } + xt_free_ns(self); +} + +xtPublic pthread_t xt_run_thread(XTThreadPtr self, XTThreadPtr child, void *(*start_routine)(XTThreadPtr)) +{ + ThreadDataRec data; + int err; + pthread_t child_thread; + + enter_(); + + // 'data' can be on the stack because we are waiting for the thread to start + // before exiting the function. + data.td_started = FALSE; + data.td_thr = child; + data.td_start_routine = start_routine; +#ifdef XT_WIN + { + pthread_attr_t attr = { 0, 0, 0 }; + + attr.priority = THREAD_PRIORITY_NORMAL; + err = pthread_create(&child_thread, &attr, thr_main, &data); + } +#else + err = pthread_create(&child_thread, NULL, thr_main, &data); +#endif + if (err) { + xt_free_thread(child); + xt_throw_errno(XT_CONTEXT, err); + } + while (!data.td_started) { + /* Check that the self is still alive: */ + if (pthread_kill(child_thread, 0)) + break; + xt_busy_wait(); + } + return_(child_thread); +} + +xtPublic void xt_exit_thread(XTThreadPtr self, void *result) +{ + xt_free_thread(self); + pthread_exit(result); +} + +xtPublic void *xt_wait_for_thread(xtThreadID tid, xtBool ignore_error) +{ + int err; + void *value_ptr = NULL; + xtBool ok = FALSE; + XTThreadPtr thread; + pthread_t t1 = 0; + + xt_lock_mutex_ns(&thr_array_lock); + if (tid < xt_thr_maximum_threads) { + if ((thread = xt_thr_array[tid])) { + t1 = thread->t_pthread; + ok = TRUE; + } + } + xt_unlock_mutex_ns(&thr_array_lock); + if (ok) { + err = xt_p_join(t1, &value_ptr); + if (err && !ignore_error) + xt_log_errno(XT_NS_CONTEXT, err); + } + return value_ptr; +} + +/* + * Kill the given thead, and wait for it to terminate. + * This function just returns if the self is already dead. + */ +xtPublic void xt_kill_thread(pthread_t t1) +{ + int err; + void *value_ptr = NULL; + + err = pthread_kill(t1, SIGTERM); + if (err) + return; + err = xt_p_join(t1, &value_ptr); + if (err) + xt_log_errno(XT_NS_CONTEXT, err); +} + +/* + * ----------------------------------------------------------------------- + * Read/write locking + */ + +#ifdef XT_THREAD_LOCK_INFO +xtPublic xtBool xt_init_rwlock(XTThreadPtr self, xt_rwlock_type *rwlock, const char *name) +#else +xtPublic xtBool xt_init_rwlock(XTThreadPtr self, xt_rwlock_type *rwlock) +#endif +{ + int err; + +#ifdef XT_THREAD_LOCK_INFO + err = xt_p_rwlock_init_with_name(rwlock, NULL, name); +#else + err = xt_p_rwlock_init(rwlock, NULL); +#endif + + if (err) { + xt_throw_errno(XT_CONTEXT, err); + return FAILED; + } + return OK; +} + +xtPublic void xt_free_rwlock(xt_rwlock_type *rwlock) +{ + int err; + + for (;;) { + err = xt_p_rwlock_destroy(rwlock); + if (err != XT_EBUSY) + break; + xt_busy_wait(); + } + /* PMC - xt_xn_exit_db() is called even when xt_xn_init_db() is not fully completed! + * This generates a lot of log entries. But I have no desire to only call + * free for those articles that I have init'ed! + if (err) + xt_log_errno(XT_NS_CONTEXT, err); + */ +} + +xtPublic xt_rwlock_type *xt_slock_rwlock(XTThreadPtr self, xt_rwlock_type *rwlock) +{ + int err; + + for (;;) { + err = xt_slock_rwlock_ns(rwlock); + if (err != XT_EAGAIN) + break; + xt_busy_wait(); + } + if (err) { + xt_throw_errno(XT_CONTEXT, err); + return NULL; + } + return rwlock; +} + +xtPublic xt_rwlock_type *xt_xlock_rwlock(XTThreadPtr self, xt_rwlock_type *rwlock) +{ + int err; + + for (;;) { + err = xt_xlock_rwlock_ns(rwlock); + if (err != XT_EAGAIN) + break; + xt_busy_wait(); + } + + if (err) { + xt_throw_errno(XT_CONTEXT, err); + return NULL; + } + return rwlock; +} + +xtPublic void xt_unlock_rwlock(XTThreadPtr XT_UNUSED(self), xt_rwlock_type *rwlock) +{ + int err; + + err = xt_unlock_rwlock_ns(rwlock); + if (err) + xt_log_errno(XT_NS_CONTEXT, err); +} + +/* + * ----------------------------------------------------------------------- + * Mutex locking + */ + +xtPublic xt_mutex_type *xt_new_mutex(XTThreadPtr self) +{ + xt_mutex_type *mx; + + if (!(mx = (xt_mutex_type *) xt_calloc(self, sizeof(xt_mutex_type)))) + return NULL; + pushr_(xt_free, mx); + if (!xt_init_mutex_with_autoname(self, mx)) { + freer_(); + return NULL; + } + popr_(); + return mx; +} + +xtPublic void xt_delete_mutex(XTThreadPtr self, xt_mutex_type *mx) +{ + if (mx) { + xt_free_mutex(mx); + xt_free(self, mx); + } +} + +#ifdef XT_THREAD_LOCK_INFO +xtPublic xtBool xt_init_mutex(XTThreadPtr self, xt_mutex_type *mx, const char *name) +#else +xtPublic xtBool xt_init_mutex(XTThreadPtr self, xt_mutex_type *mx) +#endif +{ + int err; + + err = xt_p_mutex_init_with_name(mx, NULL, name); + if (err) { + xt_throw_errno(XT_CONTEXT, err); + return FALSE; + } + return TRUE; +} + +void xt_free_mutex(xt_mutex_type *mx) +{ + int err; + + for (;;) { + err = xt_p_mutex_destroy(mx); + if (err != XT_EBUSY) + break; + xt_busy_wait(); + } + /* PMC - xt_xn_exit_db() is called even when xt_xn_init_db() is not fully completed! + if (err) + xt_log_errno(XT_NS_CONTEXT, err); + */ +} + +xtPublic xtBool xt_lock_mutex(XTThreadPtr self, xt_mutex_type *mx) +{ + int err; + + for (;;) { + err = xt_lock_mutex_ns(mx); + if (err != XT_EAGAIN) + break; + xt_busy_wait(); + } + + if (err) { + xt_throw_errno(XT_CONTEXT, err); + return FALSE; + } + return TRUE; +} + +xtPublic void xt_unlock_mutex(XTThreadPtr self, xt_mutex_type *mx) +{ + int err; + + err = xt_unlock_mutex_ns(mx); + if (err) + xt_throw_errno(XT_CONTEXT, err); +} + +xtPublic xtBool xt_set_key(pthread_key_t key, const void *value, XTExceptionPtr e) +{ +#ifdef XT_WIN + my_pthread_setspecific_ptr(thr_key, (void *) value); +#else + int err; + + err = pthread_setspecific(key, value); + if (err) { + if (e) + xt_exception_errno(e, XT_NS_CONTEXT, err); + return FALSE; + } +#endif + return TRUE; +} + +xtPublic void *xt_get_key(pthread_key_t key) +{ +#ifdef XT_WIN + return my_pthread_getspecific_ptr(void *, thr_key); +#else + return pthread_getspecific(key); +#endif +} + +xtPublic xt_cond_type *xt_new_cond(XTThreadPtr self) +{ + xt_cond_type *cond; + + if (!(cond = (xt_cond_type *) xt_calloc(self, sizeof(xt_cond_type)))) + return NULL; + pushr_(xt_free, cond); + if (!xt_init_cond(self, cond)) { + freer_(); + return NULL; + } + popr_(); + return cond; +} + +xtPublic void xt_delete_cond(XTThreadPtr self, xt_cond_type *cond) +{ + if (cond) { + xt_free_cond(cond); + xt_free(self, cond); + } +} + +xtPublic xtBool xt_init_cond(XTThreadPtr self, xt_cond_type *cond) +{ + int err; + + err = pthread_cond_init(cond, NULL); + if (err) { + xt_throw_errno(XT_CONTEXT, err); + return FALSE; + } + return TRUE; +} + +xtPublic void xt_free_cond(xt_cond_type *cond) +{ + int err; + + for (;;) { + err = pthread_cond_destroy(cond); + if (err != XT_EBUSY) + break; + xt_busy_wait(); + } + /* PMC - xt_xn_exit_db() is called even when xt_xn_init_db() is not fully completed! + if (err) + xt_log_errno(XT_NS_CONTEXT, err); + */ +} + +xtPublic xtBool xt_throw_delayed_signal(XTThreadPtr self, c_char *func, c_char *file, u_int line) +{ + XTThreadPtr me = self ? self : xt_get_self(); + + if (me->t_delayed_signal) { + int sig = me->t_delayed_signal; + + me->t_delayed_signal = 0; + xt_throw_signal(self, func, file, line, sig); + return FAILED; + } + return OK; +} + +xtPublic xtBool xt_wait_cond(XTThreadPtr self, xt_cond_type *cond, xt_mutex_type *mutex) +{ + int err; + XTThreadPtr me = self ? self : xt_get_self(); + + /* PMC - In my tests, if I throw an exception from within the wait + * the condition and the mutex remain locked. + */ + me->t_disable_interrupts = TRUE; + err = xt_p_cond_wait(cond, mutex); + me->t_disable_interrupts = FALSE; + if (err) { + xt_throw_errno(XT_CONTEXT, err); + return FALSE; + } + if (me->t_delayed_signal) { + xt_throw_delayed_signal(XT_CONTEXT); + return FALSE; + } + return TRUE; +} + +xtPublic xtBool xt_suspend(XTThreadPtr thread) +{ + xtBool ok; + + // You can only suspend yourself. + ASSERT_NS(pthread_equal(thread->t_pthread, pthread_self())); + + xt_lock_mutex_ns(&thread->t_lock); + ok = xt_wait_cond(NULL, &thread->t_cond, &thread->t_lock); + xt_unlock_mutex_ns(&thread->t_lock); + return ok; +} + +xtPublic xtBool xt_unsuspend(XTThreadPtr target) +{ + return xt_broadcast_cond_ns(&target->t_cond); +} + +xtPublic void xt_lock_thread(XTThreadPtr thread) +{ + xt_lock_mutex_ns(&thread->t_lock); +} + +xtPublic void xt_unlock_thread(XTThreadPtr thread) +{ + xt_unlock_mutex_ns(&thread->t_lock); +} + +xtPublic xtBool xt_wait_thread(XTThreadPtr thread) +{ + return xt_wait_cond(NULL, &thread->t_cond, &thread->t_lock); +} + +xtPublic void xt_signal_thread(XTThreadPtr target) +{ + xt_broadcast_cond_ns(&target->t_cond); +} + +xtPublic void xt_terminate_thread(XTThreadPtr self __attribute__((unused)), XTThreadPtr target) +{ + target->t_quit = TRUE; + target->t_delayed_signal = SIGTERM; +} + +xtPublic xtProcID xt_getpid() +{ +#ifdef XT_WIN + return GetCurrentProcessId(); +#else + return getpid(); +#endif +} + +xtPublic xtBool xt_process_exists(xtProcID pid) +{ + xtBool found; + +#ifdef XT_WIN + HANDLE h; + DWORD code; + + found = FALSE; + h = OpenProcess(PROCESS_QUERY_INFORMATION, FALSE, pid); + if (h) { + if (GetExitCodeProcess(h, &code)) { + if (code == STILL_ACTIVE) + found = TRUE; + } + CloseHandle(h); + } + else { + int err; + + err = HRESULT_CODE(GetLastError()); + if (err != ERROR_INVALID_PARAMETER) + found = TRUE; + } +#else + found = TRUE; + if (kill(pid, 0) == -1) { + if (errno == ESRCH) + found = FALSE; + } +#endif + return found; +} + +xtPublic xtBool xt_timed_wait_cond(XTThreadPtr self, xt_cond_type *cond, xt_mutex_type *mutex, u_long milli_sec) +{ + int err; + struct timespec abstime; + XTThreadPtr me = self ? self : xt_get_self(); + +#ifdef XT_WIN + union ft64 now; + + GetSystemTimeAsFileTime(&now.ft); + + /* System time is measured in 100ns units. + * This calculation will be reversed by the Windows implementation + * of pthread_cond_timedwait(), in order to extract the + * milli-second timeout! + */ + abstime.tv.i64 = now.i64 + (milli_sec * 10000); + + abstime.max_timeout_msec = milli_sec; +#else + struct timeval now; + u_llong micro_sec; + + /* Get the current time in microseconds: */ + gettimeofday(&now, NULL); + micro_sec = (u_llong) now.tv_sec * (u_llong) 1000000 + (u_llong) now.tv_usec; + + /* Add the timeout which is in milli seconds */ + micro_sec += (u_llong) milli_sec * (u_llong) 1000; + + /* Setup the end time, which is in nano-seconds. */ + abstime.tv_sec = (long) (micro_sec / 1000000); /* seconds */ + abstime.tv_nsec = (long) ((micro_sec % 1000000) * 1000); /* and nanoseconds */ +#endif + + me->t_disable_interrupts = TRUE; + err = xt_p_cond_timedwait(cond, mutex, &abstime); + me->t_disable_interrupts = FALSE; + if (err && err != ETIMEDOUT) { + xt_throw_errno(XT_CONTEXT, err); + return FALSE; + } + if (me->t_delayed_signal) { + xt_throw_delayed_signal(XT_CONTEXT); + return FALSE; + } + return TRUE; +} + +xtPublic xtBool xt_signal_cond(XTThreadPtr self, xt_cond_type *cond) +{ + int err; + + err = pthread_cond_signal(cond); + if (err) { + xt_throw_errno(XT_CONTEXT, err); + return FAILED; + } + return OK; +} + +xtPublic void xt_broadcast_cond(XTThreadPtr self, xt_cond_type *cond) +{ + int err; + + err = pthread_cond_broadcast(cond); + if (err) + xt_throw_errno(XT_CONTEXT, err); +} + +xtPublic xtBool xt_broadcast_cond_ns(xt_cond_type *cond) +{ + int err; + + err = pthread_cond_broadcast(cond); + if (err) { + xt_register_errno(XT_REG_CONTEXT, err); + return FAILED; + } + return OK; +} + +static int prof_setjmp_count = 0; + +xtPublic int prof_setjmp(void) +{ + prof_setjmp_count++; + return 0; +} + +xtPublic void xt_set_low_priority(XTThreadPtr self) +{ + int err = xt_p_set_low_priority(self->t_pthread); + if (err) { + self = NULL; /* Will cause logging, instead of throwing exception */ + xt_throw_errno(XT_CONTEXT, err); + } +} + +xtPublic void xt_set_normal_priority(XTThreadPtr self) +{ + int err = xt_p_set_normal_priority(self->t_pthread); + if (err) { + self = NULL; /* Will cause logging, instead of throwing exception */ + xt_throw_errno(XT_CONTEXT, err); + } +} + +xtPublic void xt_set_high_priority(XTThreadPtr self) +{ + int err = xt_p_set_high_priority(self->t_pthread); + if (err) { + self = NULL; /* Will cause logging, instead of throwing exception */ + xt_throw_errno(XT_CONTEXT, err); + } +} + +xtPublic void xt_set_priority(XTThreadPtr self, int priority) +{ + if (priority < XT_PRIORITY_NORMAL) + xt_set_low_priority(self); + else if (priority > XT_PRIORITY_NORMAL) + xt_set_high_priority(self); + else + xt_set_normal_priority(self); +} + +/* + * ----------------------------------------------------------------------- + * STATISTICS + */ + +xtPublic void xt_gather_statistics(XTStatisticsPtr stats) +{ + XTThreadPtr *thr; + xtWord8 s; + + xt_lock_mutex_ns(&thr_array_lock); + *stats = thr_statistics; + // Ignore index 0, it is not used! + thr = &xt_thr_array[1]; + for (u_int i=1; i<xt_thr_current_max_threads; i++) { + if (*thr) { + stats->st_commits += (*thr)->st_statistics.st_commits; + stats->st_rollbacks += (*thr)->st_statistics.st_rollbacks; + stats->st_stat_read += (*thr)->st_statistics.st_stat_read; + stats->st_stat_write += (*thr)->st_statistics.st_stat_write; + + XT_ADD_STATS(stats->st_rec, (*thr)->st_statistics.st_rec); + if ((s = (*thr)->st_statistics.st_rec.ts_flush_start)) + stats->st_rec.ts_flush_time += xt_trace_clock() - s; + stats->st_rec_cache_hit += (*thr)->st_statistics.st_rec_cache_hit; + stats->st_rec_cache_miss += (*thr)->st_statistics.st_rec_cache_miss; + stats->st_rec_cache_frees += (*thr)->st_statistics.st_rec_cache_frees; + + XT_ADD_STATS(stats->st_ind, (*thr)->st_statistics.st_ind); + if ((s = (*thr)->st_statistics.st_ind.ts_flush_start)) + stats->st_ind.ts_flush_time += xt_trace_clock() - s; + stats->st_ind_cache_hit += (*thr)->st_statistics.st_ind_cache_hit; + stats->st_ind_cache_miss += (*thr)->st_statistics.st_ind_cache_miss; + XT_ADD_STATS(stats->st_ilog, (*thr)->st_statistics.st_ilog); + + XT_ADD_STATS(stats->st_xlog, (*thr)->st_statistics.st_xlog); + if ((s = (*thr)->st_statistics.st_xlog.ts_flush_start)) + stats->st_xlog.ts_flush_time += xt_trace_clock() - s; + stats->st_xlog_cache_hit += (*thr)->st_statistics.st_xlog_cache_hit; + stats->st_xlog_cache_miss += (*thr)->st_statistics.st_xlog_cache_miss; + + XT_ADD_STATS(stats->st_data, (*thr)->st_statistics.st_data); + if ((s = (*thr)->st_statistics.st_data.ts_flush_start)) + stats->st_data.ts_flush_time += xt_trace_clock() - s; + + stats->st_scan_index += (*thr)->st_statistics.st_scan_index; + stats->st_scan_table += (*thr)->st_statistics.st_scan_table; + stats->st_row_select += (*thr)->st_statistics.st_row_select; + stats->st_row_insert += (*thr)->st_statistics.st_row_insert; + stats->st_row_update += (*thr)->st_statistics.st_row_update; + stats->st_row_delete += (*thr)->st_statistics.st_row_delete; + + stats->st_wait_for_xact += (*thr)->st_statistics.st_wait_for_xact; + stats->st_retry_index_scan += (*thr)->st_statistics.st_retry_index_scan; + stats->st_reread_record_list += (*thr)->st_statistics.st_reread_record_list; + } + thr++; + } + xt_unlock_mutex_ns(&thr_array_lock); +} + +static void thr_accumulate_statistics(XTThreadPtr self) +{ + thr_statistics.st_commits += self->st_statistics.st_commits; + thr_statistics.st_rollbacks += self->st_statistics.st_rollbacks; + thr_statistics.st_stat_read += self->st_statistics.st_stat_read; + thr_statistics.st_stat_write += self->st_statistics.st_stat_write; + + XT_ADD_STATS(thr_statistics.st_rec, self->st_statistics.st_rec); + thr_statistics.st_rec_cache_hit += self->st_statistics.st_rec_cache_hit; + thr_statistics.st_rec_cache_miss += self->st_statistics.st_rec_cache_miss; + thr_statistics.st_rec_cache_frees += self->st_statistics.st_rec_cache_frees; + + XT_ADD_STATS(thr_statistics.st_ind, self->st_statistics.st_ind); + thr_statistics.st_ind_cache_hit += self->st_statistics.st_ind_cache_hit; + thr_statistics.st_ind_cache_miss += self->st_statistics.st_ind_cache_miss; + XT_ADD_STATS(thr_statistics.st_ilog, self->st_statistics.st_ilog); + + XT_ADD_STATS(thr_statistics.st_xlog, self->st_statistics.st_xlog); + thr_statistics.st_xlog_cache_hit += self->st_statistics.st_xlog_cache_hit; + thr_statistics.st_xlog_cache_miss += self->st_statistics.st_xlog_cache_miss; + + XT_ADD_STATS(thr_statistics.st_data, self->st_statistics.st_data); + + thr_statistics.st_scan_index += self->st_statistics.st_scan_index; + thr_statistics.st_scan_table += self->st_statistics.st_scan_table; + thr_statistics.st_row_select += self->st_statistics.st_row_select; + thr_statistics.st_row_insert += self->st_statistics.st_row_insert; + thr_statistics.st_row_update += self->st_statistics.st_row_update; + thr_statistics.st_row_delete += self->st_statistics.st_row_delete; + + thr_statistics.st_wait_for_xact += self->st_statistics.st_wait_for_xact; + thr_statistics.st_retry_index_scan += self->st_statistics.st_retry_index_scan; + thr_statistics.st_reread_record_list += self->st_statistics.st_reread_record_list; +} + +xtPublic u_llong xt_get_statistic(XTStatisticsPtr stats, XTDatabaseHPtr db, u_int rec_id) +{ + u_llong stat_value; + + switch (rec_id) { + case XT_STAT_TIME_CURRENT: + stat_value = (u_llong) time(NULL); + break; + case XT_STAT_TIME_PASSED: + stat_value = (u_llong) xt_trace_clock(); + break; + case XT_STAT_COMMITS: + stat_value = stats->st_commits; + break; + case XT_STAT_ROLLBACKS: + stat_value = stats->st_rollbacks; + break; + case XT_STAT_STAT_READS: + stat_value = stats->st_stat_read; + break; + case XT_STAT_STAT_WRITES: + stat_value = stats->st_stat_write; + break; + + case XT_STAT_REC_BYTES_IN: + stat_value = stats->st_rec.ts_read; + break; + case XT_STAT_REC_BYTES_OUT: + stat_value = stats->st_rec.ts_write; + break; + case XT_STAT_REC_SYNC_COUNT: + stat_value = stats->st_rec.ts_flush; + break; + case XT_STAT_REC_SYNC_TIME: + stat_value = stats->st_rec.ts_flush_time; + break; + case XT_STAT_REC_CACHE_HIT: + stat_value = stats->st_rec_cache_hit; + break; + case XT_STAT_REC_CACHE_MISS: + stat_value = stats->st_rec_cache_miss; + break; + case XT_STAT_REC_CACHE_FREES: + stat_value = stats->st_rec_cache_frees; + break; + case XT_STAT_REC_CACHE_USAGE: + stat_value = (u_llong) xt_tc_get_usage(); + break; + + case XT_STAT_IND_BYTES_IN: + stat_value = stats->st_ind.ts_read; + break; + case XT_STAT_IND_BYTES_OUT: + stat_value = stats->st_ind.ts_write; + break; + case XT_STAT_IND_SYNC_COUNT: + stat_value = stats->st_ind.ts_flush; + break; + case XT_STAT_IND_SYNC_TIME: + stat_value = stats->st_ind.ts_flush_time; + break; + case XT_STAT_IND_CACHE_HIT: + stat_value = stats->st_ind_cache_hit; + break; + case XT_STAT_IND_CACHE_MISS: + stat_value = stats->st_ind_cache_miss; + break; + case XT_STAT_IND_CACHE_USAGE: + stat_value = (u_llong) xt_ind_get_usage(); + break; + case XT_STAT_ILOG_BYTES_IN: + stat_value = stats->st_ilog.ts_read; + break; + case XT_STAT_ILOG_BYTES_OUT: + stat_value = stats->st_ilog.ts_write; + break; + case XT_STAT_ILOG_SYNC_COUNT: + stat_value = stats->st_ilog.ts_flush; + break; + case XT_STAT_ILOG_SYNC_TIME: + stat_value = stats->st_ilog.ts_flush_time; + break; + + case XT_STAT_XLOG_BYTES_IN: + stat_value = stats->st_xlog.ts_read; + break; + case XT_STAT_XLOG_BYTES_OUT: + stat_value = stats->st_xlog.ts_write; + break; + case XT_STAT_XLOG_SYNC_COUNT: + stat_value = stats->st_xlog.ts_flush; + break; + case XT_STAT_XLOG_SYNC_TIME: + stat_value = stats->st_xlog.ts_flush_time; + break; + case XT_STAT_XLOG_CACHE_HIT: + stat_value = stats->st_xlog_cache_hit; + break; + case XT_STAT_XLOG_CACHE_MISS: + stat_value = stats->st_xlog_cache_miss; + break; + case XT_STAT_XLOG_CACHE_USAGE: + stat_value = (u_llong) xt_xlog_get_usage(); + break; + + case XT_STAT_DATA_BYTES_IN: + stat_value = stats->st_data.ts_read; + break; + case XT_STAT_DATA_BYTES_OUT: + stat_value = stats->st_data.ts_write; + break; + case XT_STAT_DATA_SYNC_COUNT: + stat_value = stats->st_data.ts_flush; + break; + case XT_STAT_DATA_SYNC_TIME: + stat_value = stats->st_data.ts_flush_time; + break; + + case XT_STAT_BYTES_TO_CHKPNT: + stat_value = db ? xt_bytes_since_last_checkpoint(db, db->db_xlog.xl_write_log_id, db->db_xlog.xl_write_log_offset) : 0; + break; + case XT_STAT_LOG_BYTES_TO_WRITE: + stat_value = db ? db->db_xlog.xl_log_bytes_written - db->db_xlog.xl_log_bytes_read : 0;//db->db_xlog.xlog_bytes_to_write(); + break; + case XT_STAT_BYTES_TO_SWEEP: + /* This stat is potentially very expensive: */ + stat_value = db ? xt_xn_bytes_to_sweep(db, xt_get_self()) : 0; + break; + case XT_STAT_WAIT_FOR_XACT: + stat_value = stats->st_wait_for_xact; + break; + case XT_STAT_XACT_TO_CLEAN: + stat_value = db ? db->db_xn_curr_id + 1 - db->db_xn_to_clean_id : 0; + break; + case XT_STAT_SWEEPER_WAITS: + stat_value = db ? db->db_stat_sweep_waits : 0; + break; + + case XT_STAT_SCAN_INDEX: + stat_value = stats->st_scan_index; + break; + case XT_STAT_SCAN_TABLE: + stat_value = stats->st_scan_table; + break; + case XT_STAT_ROW_SELECT: + stat_value = stats->st_row_select; + break; + case XT_STAT_ROW_INSERT: + stat_value = stats->st_row_insert; + break; + case XT_STAT_ROW_UPDATE: + stat_value = stats->st_row_update; + break; + case XT_STAT_ROW_DELETE: + stat_value = stats->st_row_delete; + break; + + case XT_STAT_RETRY_INDEX_SCAN: + stat_value = stats->st_retry_index_scan; + break; + case XT_STAT_REREAD_REC_LIST: + stat_value = stats->st_reread_record_list; + break; + default: + stat_value = 0; + break; + } + return stat_value; +} diff --git a/storage/pbxt/src/thread_xt.h b/storage/pbxt/src/thread_xt.h new file mode 100644 index 00000000000..4344c5335b9 --- /dev/null +++ b/storage/pbxt/src/thread_xt.h @@ -0,0 +1,675 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-01-03 Paul McCullagh + * + * H&G2JCtL + */ + +#ifndef __xt_thread_h__ +#define __xt_thread_h__ + +#include <stdio.h> +#ifndef XT_WIN +#include <sys/param.h> +#endif +#include <setjmp.h> + +#include "xt_defs.h" +#include "xt_errno.h" +#include "linklist_xt.h" +#include "memory_xt.h" +#include "xactlog_xt.h" +#include "datalog_xt.h" +#include "lock_xt.h" +#include "locklist_xt.h" + +/* + * ----------------------------------------------------------------------- + * Macros and defines + */ + +#ifdef XT_WIN +#define __FUNC__ __FUNCTION__ +#elif defined(XT_SOLARIS) +#define __FUNC__ "__func__" +#else +#define __FUNC__ __PRETTY_FUNCTION__ +#endif + +#define XT_ERR_MSG_SIZE (PATH_MAX + 200) + +#ifdef DEBUG +#define ASSERT(expr) ((expr) ? TRUE : xt_assert(self, #expr, __FUNC__, __FILE__, __LINE__)) +#else +#define ASSERT(expr) ((void) 0) +#endif + +#ifdef DEBUG +#define ASSUME(expr) ((expr) ? TRUE : xt_assume(self, #expr, __FUNC__, __FILE__, __LINE__)) +#else +#define ASSUME(expr) ((void) 0) +#endif + +#ifdef DEBUG +#define ASSERT_NS(expr) ((expr) ? TRUE : xt_assert(NULL, #expr, __FUNC__, __FILE__, __LINE__)) +#else +#define ASSERT_NS(expr) ((void) 0) +#endif + +#define XT_THROW_ASSERTION(str) xt_throw_assertion(self, __FUNC__, __FILE__, __LINE__, str) + +/* Log levels */ +#define XT_LOG_DEFAULT -1 +#define XT_LOG_PROTOCOL 0 +#define XT_LOG_FATAL 1 +#define XT_LOG_ERROR 2 +#define XT_LOG_WARNING 3 +#define XT_LOG_INFO 4 +#define XT_LOG_TRACE 5 + +#define XT_PROTOCOL self, "", NULL, 0, XT_LOG_PROTOCOL +#define XT_WARNING self, "", NULL, 0, XT_LOG_WARNING +#define XT_INFO self, "", NULL, 0, XT_LOG_INFO +#define XT_ERROR self, "", NULL, 0, XT_LOG_ERROR +#define XT_TRACE self, "", NULL, 0, XT_LOG_TRACE + +#define XT_NT_PROTOCOL NULL, "", NULL, 0, XT_LOG_PROTOCOL +#define XT_NT_WARNING NULL, "", NULL, 0, XT_LOG_WARNING +#define XT_NT_INFO NULL, "", NULL, 0, XT_LOG_INFO +#define XT_NT_ERROR NULL, "", NULL, 0, XT_LOG_ERROR +#define XT_NT_TRACE NULL, "", NULL, 0, XT_LOG_TRACE + +#define XT_ERROR_CONTEXT(func) self, __FUNC__, __FILE__, __LINE__, XT_LOG_ERROR + +/* Thread types */ +#define XT_THREAD_MAIN 0 +#define XT_THREAD_WORKER 1 + +/* Thread Priorities: */ +#define XT_PRIORITY_LOW 0 +#define XT_PRIORITY_NORMAL 1 +#define XT_PRIORITY_HIGH 2 + +#define XT_CONTEXT self, __FUNC__, __FILE__, __LINE__ +#define XT_NS_CONTEXT NULL, __FUNC__, __FILE__, __LINE__ +#define XT_REG_CONTEXT __FUNC__, __FILE__, __LINE__ + +#define XT_MAX_JMP 20 +#define XT_MAX_CALL_STACK 100 /* The number of functions recorded by enter_() and exit() */ +#define XT_RES_STACK_SIZE 4000 /* The size of the stack resource stack in bytes. */ +#define XT_MAX_RESOURCE_USAGE 5 /* The maximum number of temp slots used per routine. */ +#define XT_CATCH_TRACE_SIZE 1024 +#define XT_MAX_FUNC_NAME_SIZE 120 +#define XT_SOURCE_FILE_NAME_SIZE 40 +#define XT_THR_NAME_SIZE 80 + +#ifdef XT_THREAD_LOCK_INFO +#define xt_init_rwlock_with_autoname(a,b) xt_init_rwlock(a,b,LOCKLIST_ARG_SUFFIX(b)) +#else +#define xt_init_rwlock_with_autoname(a,b) xt_init_rwlock(a,b) +#endif + +typedef struct XTException { + int e_xt_err; /* The XT error number (ALWAYS non-zero on error, else zero) */ + int e_sys_err; /* The system error number (0 if none) */ + char e_err_msg[XT_ERR_MSG_SIZE]; /* The error message text (0 terminated string) */ + char e_func_name[XT_MAX_FUNC_NAME_SIZE]; /* The name of the function in which the exception occurred */ + char e_source_file[XT_SOURCE_FILE_NAME_SIZE]; /* The source file in which the exception was thrown */ + u_int e_source_line; /* The source code line number on which the exception was thrown */ + char e_catch_trace[XT_CATCH_TRACE_SIZE]; /* A string of the catch trace. */ +} XTExceptionRec, *XTExceptionPtr; + +struct XTThread; +struct XTSortedList; +struct XTXactLog; +struct XTXactData; +struct XTDatabase; + +typedef void (*XTThreadFreeFunc)(struct XTThread *self, void *data); + +typedef struct XTResourceArgs { + void *ra_p1; + xtWord4 ra_p2; +} XTResourceArgsRec, *XTResourceArgsPtr; + +/* This structure represents a temporary resource on the resource stack. + * Resource are automatically freed if an exception occurs. + */ +typedef struct XTResource { + xtWord4 r_prev_size; /* The size of the previous resource on the stack (must be first!) */ + void *r_data; /* A pointer to the resource data (this may be on the resource stack) */ + XTThreadFreeFunc r_free_func; /* The function used to free the resource. */ +} XTResourceRec, *XTResourcePtr; + +typedef struct XTJumpBuf { + XTResourcePtr jb_res_top; + int jb_call_top; + jmp_buf jb_buffer; +} XTJumpBufRec, *XTJumpBufPtr; + +typedef struct XTCallStack { + c_char *cs_func; + c_char *cs_file; + u_int cs_line; +} XTCallStackRec, *XTCallStackPtr; + +typedef struct XTIOStats { + u_int ts_read; /* The number of bytes read. */ + u_int ts_write; /* The number of bytes written. */ + xtWord8 ts_flush_time; /* The accumulated flush time. */ + xtWord8 ts_flush_start; /* Start time, non-zero if a timer is running. */ + u_int ts_flush; /* The number of flush operations. */ +} XTIOStatsRec, *XTIOStatsPtr; + +#define XT_ADD_STATS(x, y) { \ + (x).ts_read += (y).ts_read; \ + (x).ts_write += (y).ts_write; \ + (x).ts_flush_time += (y).ts_flush_time; \ + (x).ts_flush += (y).ts_flush; \ +} + +typedef struct XTStatistics { + u_int st_commits; + u_int st_rollbacks; + u_int st_stat_read; + u_int st_stat_write; + + XTIOStatsRec st_rec; + u_int st_rec_cache_hit; + u_int st_rec_cache_miss; + u_int st_rec_cache_frees; + + XTIOStatsRec st_ind; + u_int st_ind_cache_hit; + u_int st_ind_cache_miss; + XTIOStatsRec st_ilog; + + XTIOStatsRec st_xlog; + u_int st_xlog_cache_hit; + u_int st_xlog_cache_miss; + + XTIOStatsRec st_data; + + XTIOStatsRec st_x; + + u_int st_scan_index; + u_int st_scan_table; + u_int st_row_select; + u_int st_row_insert; + u_int st_row_update; + u_int st_row_delete; + + u_int st_wait_for_xact; + u_int st_retry_index_scan; + u_int st_reread_record_list; + XTIOStatsRec st_ind_flush_time; +} XTStatisticsRec, *XTStatisticsPtr; + +/* + * PBXT supports COMMITTED READ and REPEATABLE READ. + * + * As Jim says, multi-versioning cannot implement SERIALIZABLE. Basically + * you need locking to do this. Although phantom reads do not occur with + * MVCC, it is still not serializable. + * + * This can be seen from the following example: + * + * T1: INSERT t1 VALUE (1, 1); + * T2: INSERT t1 VALUE (2, 2); + * T1: UPDATE t1 SET b = 3 WHERE a IN (1, 2); + * T2: UPDATE t1 SET b = 4 WHERE a IN (1, 2); + * Serialized result (T1, T2) or (T2, T1): + * a b or a b + * 1 4 1 3 + * 2 4 1 3 + * Non-serialized (MVCC) result: + * a b + * 1 3 + * 2 4 + */ +#define XT_XACT_UNCOMMITTED_READ 0 +#define XT_XACT_COMMITTED_READ 1 +#define XT_XACT_REPEATABLE_READ 2 /* Guarentees rows already read will not change. */ +#define XT_XACT_SERIALIZABLE 3 + +typedef struct XTThread { + XTLinkedItemRec t_links; /* Required to be a member of a double-linked list. */ + + char t_name[XT_THR_NAME_SIZE]; /* The name of the thread. */ + xtBool t_main; /* TRUE if this is the main (initial) thread */ + xtBool t_quit; /* TRUE if this thread should stop running. */ + xtBool t_daemon; /* TRUE if this thread is a daemon. */ + xtThreadID t_id; /* The thread ID (0=main), index into thread array. */ + pthread_t t_pthread; /* The pthread associated with xt thread */ + xtBool t_disable_interrupts; /* TRUE if interrupts are disabled. */ + int t_delayed_signal; /* Throw this signal as soon as you can! */ + + void *t_data; /* Data passed to the thread. */ + XTThreadFreeFunc t_free_data; /* Routine used to free the thread data */ + + int t_call_top; /* A pointer to the top of the call stack. */ + XTCallStackRec t_call_stack[XT_MAX_CALL_STACK];/* Records the function under execution (to be output on error). */ + + XTResourcePtr t_res_top; /* The top of the resource stack (reference next free space). */ + union { + char t_res_stack[XT_RES_STACK_SIZE]; /* Temporary data to be freed if an exception occurs. */ + xtWord4 t_align_res_stack; + } x; + + int t_jmp_depth; /* The current jump depth */ + XTJumpBufRec t_jmp_env[XT_MAX_JMP]; /* The process environment to be restored on exception */ + XTExceptionRec t_exception; /* The exception details. */ + + xt_cond_type t_cond; /* The pthread condition used for suspending the thread. */ + xt_mutex_type t_lock; /* Thread lock, used for operations on a thread that may be done by other threads. + * for example xt_unuse_database(). + */ + + /* Application specific data: */ + struct XTDatabase *st_database; /* The database in use by the thread. */ + u_int st_lock_count; /* We count the number of locks MySQL has set in order to know when they are all released. */ + u_int st_stat_count; /* start statement count. */ + struct XTXactData *st_xact_data; /* The transaction data, not NULL if the transaction performs an update. */ + xtBool st_xact_writer; /* TRUE if the transaction has written somthing to the log. */ + time_t st_xact_write_time; /* Approximate first write time (uses xt_db_approximate_time). */ + xtBool st_xact_long_running; /* TRUE if this is a long running writer transaction. */ + xtWord4 st_visible_time; /* Transactions committed before this time are visible. */ + XTDataLogBufferRec st_dlog_buf; + + int st_xact_mode; /* The transaction mode. */ + xtBool st_ignore_fkeys; /* TRUE if we must ignore foreign keys. */ + xtBool st_auto_commit; /* TRUE if this is an auto-commit transaction. */ + xtBool st_table_trans; /* TRUE transactions is a result of LOCK TABLES. */ + xtBool st_abort_trans; /* TRUE if the transaction should be aborted. */ + xtBool st_stat_ended; /* TRUE if the statement was ended. */ + xtBool st_stat_trans; /* TRUE if a statement transaction is running (started on UPDATE). */ + xtBool st_stat_modify; /* TRUE if the statement is an INSERT/UPDATE/DELETE */ +#ifdef XT_IMPLEMENT_NO_ACTION + XTBasicListRec st_restrict_list; /* These records have been deleted and should have no reference. */ +#endif + /* Local thread list. */ + u_int st_thread_list_count; + u_int st_thread_list_size; + xtThreadID *st_thread_list; + + /* Used to prevent a record from being updated twice in one statement. */ + xtBool st_is_update; /* TRUE if this is an UPDATE statement. */ + u_int st_update_id; /* The update statement ID. */ + + XTRowLockListRec st_lock_list; /* The thread row lock list (drop locks on transaction end). */ + XTStatisticsRec st_statistics; /* Accumulated statistics for this thread. */ +#ifdef XT_THREAD_LOCK_INFO + /* list of locks (spins, mutextes, etc) that this thread currently holds (debugging) */ + XTThreadLockInfoPtr st_thread_lock_list[XT_THREAD_LOCK_INFO_MAX_COUNT]; + int st_thread_lock_count; +#endif +} XTThreadRec, *XTThreadPtr; + +/* + * ----------------------------------------------------------------------- + * Call stack + */ + +#define XT_INIT_CHECK_STACK char xt_chk_buffer[512]; memset(xt_chk_buffer, 0xFE, 512); +#define XT_RE_CHECK_STACK memset(xt_chk_buffer, 0xFE, 512); + +/* + * This macro must be placed at the start of every function. + * It records the current context so that we can + * dump a type of stack trace later if necessary. + * + * It also sets up the current thread pointer 'self'. + */ +#ifdef DEBUG +#define XT_STACK_TRACE +#endif + +/* + * These macros generate a stack trace which can be used + * to locate an error on exception. + */ +#ifdef XT_STACK_TRACE + +/* + * Place this call at the top of a function, + * after the declaration of local variable, and + * before the first code is executed. + */ +#define enter_() int xt_frame = self->t_call_top++; \ + do { \ + if (xt_frame < XT_MAX_CALL_STACK) { \ + self->t_call_stack[xt_frame].cs_func = __FUNC__; \ + self->t_call_stack[xt_frame].cs_file = __FILE__; \ + self->t_call_stack[xt_frame].cs_line = __LINE__; \ + } \ + } while (0) + +#define outer_() self->t_call_top = xt_frame; + +/* + * On exit to a function, either exit_() or + * return_() must be called. + */ +#define exit_() do { \ + outer_(); \ + return; \ + } while (0) + +#define return_(x) do { \ + outer_(); \ + return(x); \ + } while (0) + +#define returnc_(x, typ) do { \ + typ rv; \ + rv = (x); \ + outer_(); \ + return(rv); \ + } while (0) + +/* + * Sets the line number before a call to get a better + * stack trace; + */ +#define call_(x) do { self->t_call_stack[xt_frame].cs_line = __LINE__; x; } while (0) + +#else +#define enter_() +#define outer_() +#define exit_() return; +#define return_(x) return (x) +#define returnc_(x, typ) return (x) +#define call_(x) x +#endif + +/* + * ----------------------------------------------------------------------- + * Throwing and catching + */ + +int prof_setjmp(void); + +#define TX_CHK_JMP() if ((self)->t_jmp_depth < 0 || (self)->t_jmp_depth >= XT_MAX_JMP) xt_throw_xterr(self, __FUNC__, __FILE__, __LINE__, XT_ERR_JUMP_OVERFLOW) +#ifdef PROFILE +#define profile_setjmp prof_setjmp() +#else +#define profile_setjmp +#endif + +#define try_(n) TX_CHK_JMP(); \ + (self)->t_jmp_env[(self)->t_jmp_depth].jb_res_top = (self)->t_res_top; \ + (self)->t_jmp_env[(self)->t_jmp_depth].jb_call_top = (self)->t_call_top; \ + (self)->t_jmp_depth++; profile_setjmp; if (setjmp((self)->t_jmp_env[(self)->t_jmp_depth-1].jb_buffer)) goto catch_##n; +#define catch_(n) (self)->t_jmp_depth--; goto cont_##n; catch_##n: (self)->t_jmp_depth--; xt_caught(self); +#define cont_(n) cont_##n: +#define throw_() xt_throw(self) + +/* + * ----------------------------------------------------------------------- + * Resource stack + */ + +//#define DEBUG_RESOURCE_STACK + +#ifdef DEBUG_RESOURCE_STACK +#define CHECK_RS if ((char *) (self)->t_res_top < (self)->x.t_res_stack) xt_bug(self); +#define CHECK_NS_RS { XTThreadPtr self = xt_get_self(); CHECK_RS; } +#else +#define CHECK_RS remove this! +#define CHECK_NS_RS remove this! +#endif + +/* + * Allocate a resource on the resource stack. The resource will be freed + * automatocally if an exception occurs. Before exiting the current + * procedure you must free the resource using popr_() or freer_(). + * v = value to be set to the resource, + * f = function which frees the resource, + * s = the size of the resource, + */ + +/* GOTCHA: My experience is that contructs such as *((xtWordPS *) &(v)) = (xtWordPS) (x) + * cause optimised versions to crash?! + */ +#define allocr_(v, f, s, t) do { \ + if (((char *) (self)->t_res_top) > (self)->x.t_res_stack + XT_RES_STACK_SIZE - sizeof(XTResourceRec) + (s) + 4) \ + xt_throw_xterr(self, __FUNC__, __FILE__, __LINE__, XT_ERR_RES_STACK_OVERFLOW); \ + v = (t) (((char *) (self)->t_res_top) + sizeof(XTResourceRec)); \ + (self)->t_res_top->r_data = (v); \ + (self)->t_res_top->r_free_func = (XTThreadFreeFunc) (f); \ + (self)->t_res_top = (XTResourcePtr) (((char *) (self)->t_res_top) + sizeof(XTResourceRec) + (s)); \ + (self)->t_res_top->r_prev_size = sizeof(XTResourceRec) + (s); \ + } while (0) + +#define alloczr_(v, f, s, t) do { allocr_(v, f, s, t); \ + memset(v, 0, s); } while (0) + +/* Push and set a resource: + * v = value to be set to the resource, + * f = function which frees the resource, + * r = the resource, + * NOTE: the expression (r) must come first because it may contain + * calls which use the resource stack!! + */ +#define pushsr_(v, f, r) do { \ + if (((char *) (self)->t_res_top) > (self)->x.t_res_stack + XT_RES_STACK_SIZE - sizeof(XTResourceRec) + 4) \ + xt_throw_xterr(self, __FUNC__, __FILE__, __LINE__, XT_ERR_RES_STACK_OVERFLOW); \ + v = (r); \ + (self)->t_res_top->r_data = (v); \ + (self)->t_res_top->r_free_func = (XTThreadFreeFunc) (f); \ + (self)->t_res_top = (XTResourcePtr) (((char *) (self)->t_res_top) + sizeof(XTResourceRec)); \ + (self)->t_res_top->r_prev_size = sizeof(XTResourceRec); \ + } while (0) + +/* Push a resource. In the event of an exception it will be freed + * the free routine. + * f = function which frees the resource, + * r = a pointer to the resource, + */ +#define pushr_(f, r) do { \ + if (((char *) (self)->t_res_top) > (self)->x.t_res_stack + XT_RES_STACK_SIZE - sizeof(XTResourceRec) + 4) \ + xt_throw_xterr(self, __FUNC__, __FILE__, __LINE__, XT_ERR_RES_STACK_OVERFLOW); \ + (self)->t_res_top->r_data = (r); \ + (self)->t_res_top->r_free_func = (XTThreadFreeFunc) (f); \ + (self)->t_res_top = (XTResourcePtr) (((char *) (self)->t_res_top) + sizeof(XTResourceRec)); \ + (self)->t_res_top->r_prev_size = sizeof(XTResourceRec); \ + } while (0) + +/* Pop a resource without freeing it: */ +#ifdef DEBUG_RESOURCE_STACK +#define popr_() do { \ + (self)->t_res_top = (XTResourcePtr) (((char *) (self)->t_res_top) - (self)->t_res_top->r_prev_size); \ + if ((char *) (self)->t_res_top < (self)->x.t_res_stack) \ + xt_bug(self); \ + } while (0) +#else +#define popr_() do { (self)->t_res_top = (XTResourcePtr) (((char *) (self)->t_res_top) - (self)->t_res_top->r_prev_size); } while (0) +#endif + +#define setr_(r) do { ((XTResourcePtr) (((char *) (self)->t_res_top) - (self)->t_res_top->r_prev_size))->r_data = (r); } while (0) + +/* Pop and free a resource: */ +#ifdef DEBUG_RESOURCE_STACK +#define freer_() do { \ + register XTResourcePtr rp; \ + rp = (XTResourcePtr) (((char *) (self)->t_res_top) - (self)->t_res_top->r_prev_size); \ + if ((char *) rp < (self)->x.t_res_stack) \ + xt_bug(self); \ + (rp->r_free_func)((self), rp->r_data); \ + (self)->t_res_top = rp; \ + } while (0) +#else +#define freer_() do { \ + register XTResourcePtr rp; \ + rp = (XTResourcePtr) (((char *) (self)->t_res_top) - (self)->t_res_top->r_prev_size); \ + (rp->r_free_func)((self), rp->r_data); \ + (self)->t_res_top = rp; \ + } while (0) +#endif + +/* + * ----------------------------------------------------------------------- + * Thread globals + */ + +extern u_int xt_thr_maximum_threads; +extern u_int xt_thr_current_thread_count; +extern u_int xt_thr_current_max_threads; +extern struct XTThread **xt_thr_array; + +/* + * ----------------------------------------------------------------------- + * Function prototypes + */ + +void xt_get_now(char *buffer, size_t len); +xtBool xt_init_logging(void); +void xt_exit_logging(void); +void xt_log_flush(XTThreadPtr self); +void xt_logf(XTThreadPtr self, c_char *func, c_char *file, u_int line, int level, c_char *fmt, ...); +void xt_log(XTThreadPtr self, c_char *func, c_char *file, u_int line, int level, c_char *string); +int xt_log_errorf(XTThreadPtr self, c_char *func, c_char *file, u_int line, int level, int xt_err, int sys_err, c_char *fmt, ...); +int xt_log_error(XTThreadPtr self, c_char *func, c_char *file, u_int line, int level, int xt_err, int sys_err, c_char *string); +void xt_log_exception(XTThreadPtr self, XTExceptionPtr e, int level); +void xt_clear_exception(XTThreadPtr self); +void xt_log_and_clear_exception(XTThreadPtr self); +void xt_log_and_clear_exception_ns(void); +void xt_log_and_clear_warning(XTThreadPtr self); +void xt_log_and_clear_warning_ns(void); + +void xt_bug(XTThreadPtr self); +void xt_caught(XTThreadPtr self); +void xt_throw(XTThreadPtr self); +void xt_throwf(XTThreadPtr self, c_char *func, c_char *file, u_int line, int xt_err, int sys_err, c_char *format, ...); +void xt_throw_error(XTThreadPtr self, c_char *func, c_char *file, u_int line, int xt_err, int sys_err, c_char *message); +void xt_throw_i2xterr(XTThreadPtr self, c_char *func, c_char *file, u_int line, int xt_err, c_char *item, c_char *item2); +void xt_throw_ixterr(XTThreadPtr self, c_char *func, c_char *file, u_int line, int xt_err, c_char *item); +void xt_throw_tabcolerr(XTThreadPtr self, c_char *func, c_char *file, u_int line, int xt_err, XTPathStrPtr tab_item, c_char *item2); +void xt_throw_taberr(XTThreadPtr self, c_char *func, c_char *file, u_int line, int xt_err, XTPathStrPtr tab_item); +void xt_throw_ulxterr(XTThreadPtr self, c_char *func, c_char *file, u_int line, int xt_err, u_long value); +void xt_throw_sulxterr(XTThreadPtr self, c_char *func, c_char *file, u_int line, int xt_err, c_char *item, u_long value); +void xt_throw_xterr(XTThreadPtr self, c_char *func, c_char *file, u_int line, int xt_err); +void xt_throw_errno(XTThreadPtr self, c_char *func, c_char *file, u_int line, int err_no); +void xt_throw_ferrno(XTThreadPtr self, c_char *func, c_char *file, u_int line, int err_no, c_char *path); +void xt_throw_assertion(XTThreadPtr self, c_char *func, c_char *file, u_int line, c_char *str); +void xt_throw_signal(XTThreadPtr self, c_char *func, c_char *file, u_int line, int sig); +xtBool xt_throw_delayed_signal(XTThreadPtr self, c_char *func, c_char *file, u_int line); + +void xt_registerf(c_char *func, c_char *file, u_int line, int xt_err, int sys_err, c_char *fmt, ...); +void xt_register_i2xterr(c_char *func, c_char *file, u_int line, int xt_err, c_char *item, c_char *item2); +void xt_register_ixterr(c_char *func, c_char *file, u_int line, int xt_err, c_char *item); +void xt_register_tabcolerr(c_char *func, c_char *file, u_int line, int xt_err, XTPathStrPtr tab_item, c_char *item2); +void xt_register_taberr(c_char *func, c_char *file, u_int line, int xt_err, XTPathStrPtr tab_item); +void xt_register_ulxterr(c_char *func, c_char *file, u_int line, int xt_err, u_long value); +xtBool xt_register_ferrno(c_char *func, c_char *file, u_int line, int err, c_char *path); +void xt_register_error(c_char *func, c_char *file, u_int line, int xt_err, int sys_err, c_char *msg); +xtBool xt_register_errno(c_char *func, c_char *file, u_int line, int err); +void xt_register_xterr(c_char *func, c_char *file, u_int line, int xt_err); + +void xt_exceptionf(XTExceptionPtr e, XTThreadPtr self, c_char *func, c_char *file, u_int line, int xt_err, int sys_err, c_char *fmt, ...); +void xt_exception_error(XTExceptionPtr e, XTThreadPtr self, c_char *func, c_char *file, u_int line, int xt_err, int sys_err, c_char *msg); +xtBool xt_exception_errno(XTExceptionPtr e, XTThreadPtr self, c_char *func, c_char *file, u_int line, int err); + +void xt_log_errno(XTThreadPtr self, c_char *func, c_char *file, u_int line, int err); + +xtBool xt_assert(XTThreadPtr self, c_char *expr, c_char *func, c_char *file, u_int line); +xtBool xt_assume(XTThreadPtr self, c_char *expr, c_char *func, c_char *file, u_int line); + +XTThreadPtr xt_init_threading(u_int max_threads); +void xt_exit_threading(XTThreadPtr self); + +XTThreadPtr xt_create_thread(c_char *name, xtBool main_thread, xtBool temp_thread, XTExceptionPtr e); +XTThreadPtr xt_create_daemon(XTThreadPtr parent, c_char *name); +void xt_free_thread(XTThreadPtr self); +void xt_set_thread_data(XTThreadPtr self, void *data, XTThreadFreeFunc free_func); +pthread_t xt_run_thread(XTThreadPtr parent, XTThreadPtr child, void *(*start_routine)(XTThreadPtr)); +void xt_exit_thread(XTThreadPtr self, void *result); +void *xt_wait_for_thread(xtThreadID tid, xtBool ignore_error); +void xt_signal_all_threads(XTThreadPtr self, int sig); +void xt_do_to_all_threads(XTThreadPtr self, void (*do_func_ptr)(XTThreadPtr self, XTThreadPtr to_thr, void *thunk), void *thunk); +void xt_kill_thread(pthread_t t1); +XTThreadPtr xt_get_self(void); +void xt_set_self(XTThreadPtr self); +void xt_wait_for_all_threads(XTThreadPtr self); +void xt_busy_wait(void); +void xt_critical_wait(void); +void xt_yield(void); +void xt_sleep_milli_second(u_int t); +xtBool xt_suspend(XTThreadPtr self); +xtBool xt_unsuspend(XTThreadPtr self, XTThreadPtr target); +void xt_lock_thread(XTThreadPtr thread); +void xt_unlock_thread(XTThreadPtr thread); +xtBool xt_wait_thread(XTThreadPtr thread); +void xt_signal_thread(XTThreadPtr target); +void xt_terminate_thread(XTThreadPtr self, XTThreadPtr target); +xtProcID xt_getpid(); +xtBool xt_process_exists(xtProcID pid); + +#ifdef XT_THREAD_LOCK_INFO +#define xt_init_rwlock_with_autoname(a,b) xt_init_rwlock(a,b,LOCKLIST_ARG_SUFFIX(b)) +xtBool xt_init_rwlock(XTThreadPtr self, xt_rwlock_type *rwlock, const char *name); +#else +#define xt_init_rwlock_with_autoname(a,b) xt_init_rwlock(a,b) +xtBool xt_init_rwlock(XTThreadPtr self, xt_rwlock_type *rwlock); +#endif + +void xt_free_rwlock(xt_rwlock_type *rwlock); +xt_rwlock_type *xt_slock_rwlock(XTThreadPtr self, xt_rwlock_type *rwlock); +xt_rwlock_type *xt_xlock_rwlock(XTThreadPtr self, xt_rwlock_type *rwlock); +void xt_unlock_rwlock(XTThreadPtr self, xt_rwlock_type *rwlock); + +xt_mutex_type *xt_new_mutex(XTThreadPtr self); +void xt_delete_mutex(XTThreadPtr self, xt_mutex_type *mx); +#ifdef XT_THREAD_LOCK_INFO +#define xt_init_mutex_with_autoname(a,b) xt_init_mutex(a,b,LOCKLIST_ARG_SUFFIX(b)) +xtBool xt_init_mutex(XTThreadPtr self, xt_mutex_type *mx, const char *name); +#else +#define xt_init_mutex_with_autoname(a,b) xt_init_mutex(a,b) +xtBool xt_init_mutex(XTThreadPtr self, xt_mutex_type *mx); +#endif +void xt_free_mutex(xt_mutex_type *mx); +xtBool xt_lock_mutex(XTThreadPtr self, xt_mutex_type *mx); +void xt_unlock_mutex(XTThreadPtr self, xt_mutex_type *mx); + +pthread_cond_t *xt_new_cond(XTThreadPtr self); +void xt_delete_cond(XTThreadPtr self, pthread_cond_t *cond); + +xtBool xt_init_cond(XTThreadPtr self, pthread_cond_t *cond); +void xt_free_cond(pthread_cond_t *cond); +xtBool xt_wait_cond(XTThreadPtr self, pthread_cond_t *cond, xt_mutex_type *mutex); +xtBool xt_timed_wait_cond(XTThreadPtr self, pthread_cond_t *cond, xt_mutex_type *mutex, u_long milli_sec); +xtBool xt_signal_cond(XTThreadPtr self, pthread_cond_t *cond); +void xt_broadcast_cond(XTThreadPtr self, pthread_cond_t *cond); +xtBool xt_broadcast_cond_ns(xt_cond_type *cond); + +xtBool xt_set_key(pthread_key_t key, const void *value, XTExceptionPtr e); +void *xt_get_key(pthread_key_t key); + +void xt_set_low_priority(XTThreadPtr self); +void xt_set_normal_priority(XTThreadPtr self); +void xt_set_high_priority(XTThreadPtr self); +void xt_set_priority(XTThreadPtr self, int priority); + +void xt_gather_statistics(XTStatisticsPtr stats); +u_llong xt_get_statistic(XTStatisticsPtr stats, struct XTDatabase *db, u_int rec_id); + +#define xt_timed_wait_cond_ns(a, b, c) xt_timed_wait_cond(NULL, a, b, c) + +#endif + diff --git a/storage/pbxt/src/trace_xt.cc b/storage/pbxt/src/trace_xt.cc new file mode 100644 index 00000000000..e5881cb6d12 --- /dev/null +++ b/storage/pbxt/src/trace_xt.cc @@ -0,0 +1,345 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-02-07 Paul McCullagh + * + * H&G2JCtL + */ + +#include "xt_config.h" + +#include <stdio.h> +#include <stdarg.h> +#include <errno.h> +#include <stdlib.h> +#include <time.h> + +#include "trace_xt.h" +#include "pthread_xt.h" +#include "thread_xt.h" + +#ifdef DEBUG +//#define PRINT_TRACE +//#define RESET_AFTER_DUMP +#endif + +static xtBool trace_initialized = FALSE; +static xt_mutex_type trace_mutex; +static size_t trace_log_size; +static size_t trace_log_offset; +static size_t trace_log_end; +static char *trace_log_buffer; +static u_long trace_stat_count; +static FILE *trace_dump_file; +static xtBool trace_flush_dump = FALSE; + +#define DEFAULT_TRACE_LOG_SIZE (40*1024*1204) +#define MAX_PRINT_LEN 2000 + +xtPublic xtBool xt_init_trace(void) +{ + int err; + + err = xt_p_mutex_init_with_autoname(&trace_mutex, NULL); + if (err) { + xt_log_errno(XT_NS_CONTEXT, err); + trace_initialized = FALSE; + return FALSE; + } + trace_initialized = TRUE; + trace_log_buffer = (char *) malloc(DEFAULT_TRACE_LOG_SIZE+1); + if (!trace_log_buffer) { + xt_log_errno(XT_NS_CONTEXT, ENOMEM); + xt_exit_trace(); + return FALSE; + } + trace_log_size = DEFAULT_TRACE_LOG_SIZE; + trace_log_offset = 0; + trace_log_end = 0; + trace_stat_count = 0; + return TRUE; +} + +xtPublic void xt_exit_trace(void) +{ + if (trace_initialized) { +#ifdef DEBUG + xt_dump_trace(); +#endif + xt_free_mutex(&trace_mutex); + trace_initialized = FALSE; + if (trace_log_buffer) + free(trace_log_buffer); + trace_log_buffer = NULL; + trace_log_size = 0; + trace_log_offset = 0; + trace_log_end = 0; + trace_stat_count = 0; + } + if (trace_dump_file) { + fclose(trace_dump_file); + trace_dump_file = NULL; + } +} + +xtPublic void xt_print_trace(void) +{ + if (trace_log_offset) { + xt_lock_mutex_ns(&trace_mutex); + if (trace_log_end > trace_log_offset+1) { + trace_log_buffer[trace_log_end] = 0; + printf("%s", trace_log_buffer + trace_log_offset + 1); + } + trace_log_buffer[trace_log_offset] = 0; + printf("%s", trace_log_buffer); + trace_log_offset = 0; + trace_log_end = 0; + xt_unlock_mutex_ns(&trace_mutex); + } +} + +xtPublic void xt_dump_trace(void) +{ + FILE *fp; + + if (trace_log_offset) { + fp = fopen("pbxt.log", "w"); + + xt_lock_mutex_ns(&trace_mutex); + if (fp) { + if (trace_log_end > trace_log_offset+1) { + trace_log_buffer[trace_log_end] = 0; + fprintf(fp, "%s", trace_log_buffer + trace_log_offset + 1); + } + trace_log_buffer[trace_log_offset] = 0; + fprintf(fp, "%s", trace_log_buffer); + fclose(fp); + } + +#ifdef RESET_AFTER_DUMP + trace_log_offset = 0; + trace_log_end = 0; + trace_stat_count = 0; +#endif + xt_unlock_mutex_ns(&trace_mutex); + } + + if (trace_dump_file) { + xt_lock_mutex_ns(&trace_mutex); + if (trace_dump_file) { + fflush(trace_dump_file); + fclose(trace_dump_file); + trace_dump_file = NULL; + } + xt_unlock_mutex_ns(&trace_mutex); + } +} + +xtPublic void xt_trace(const char *fmt, ...) +{ + va_list ap; + size_t len; + + va_start(ap, fmt); + xt_lock_mutex_ns(&trace_mutex); + + if (trace_log_offset + MAX_PRINT_LEN > trace_log_size) { + /* Start at the beginning of the buffer again: */ + trace_log_end = trace_log_offset; + trace_log_offset = 0; + } + + len = (size_t) vsnprintf(trace_log_buffer + trace_log_offset, trace_log_size - trace_log_offset, fmt, ap); + trace_log_offset += len; + + xt_unlock_mutex_ns(&trace_mutex); + va_end(ap); + +#ifdef PRINT_TRACE + xt_print_trace(); +#endif +} + +xtPublic void xt_ttracef(XTThreadPtr self, char *fmt, ...) +{ + va_list ap; + size_t len; + + va_start(ap, fmt); + xt_lock_mutex_ns(&trace_mutex); + + if (trace_log_offset + MAX_PRINT_LEN > trace_log_size) { + trace_log_end = trace_log_offset; + trace_log_offset = 0; + } + + trace_stat_count++; + len = (size_t) sprintf(trace_log_buffer + trace_log_offset, "%lu %s: ", trace_stat_count, self->t_name); + trace_log_offset += len; + len = (size_t) vsnprintf(trace_log_buffer + trace_log_offset, trace_log_size - trace_log_offset, fmt, ap); + trace_log_offset += len; + + xt_unlock_mutex_ns(&trace_mutex); + va_end(ap); + +#ifdef PRINT_TRACE + xt_print_trace(); +#endif +} + +xtPublic void xt_ttraceq(XTThreadPtr self, char *query) +{ + size_t qlen = strlen(query), tlen; + char *ptr, *qptr; + + if (!self) + self = xt_get_self(); + + xt_lock_mutex_ns(&trace_mutex); + + if (trace_log_offset + qlen + 100 >= trace_log_size) { + /* Start at the beginning of the buffer again: */ + trace_log_end = trace_log_offset; + trace_log_offset = 0; + } + + trace_stat_count++; + tlen = (size_t) sprintf(trace_log_buffer + trace_log_offset, "%lu %s: ", trace_stat_count, self->t_name); + trace_log_offset += tlen; + + ptr = trace_log_buffer + trace_log_offset; + qlen = 0; + qptr = query; + while (*qptr) { + if (*qptr == '\n' || *qptr == '\r') + *ptr = ' '; + else + *ptr = *qptr; + if (*qptr == '\n' || *qptr == '\r' || *qptr == ' ') { + qptr++; + while (*qptr == '\n' || *qptr == '\r' || *qptr == ' ') + qptr++; + } + else + qptr++; + ptr++; + qlen++; + } + + trace_log_offset += qlen; + *(trace_log_buffer + trace_log_offset) = '\n'; + *(trace_log_buffer + trace_log_offset + 1) = '\0'; + trace_log_offset++; + + xt_unlock_mutex_ns(&trace_mutex); + +#ifdef PRINT_TRACE + xt_print_trace(); +#endif +} + +/* + * Returns the time in microseconds. + * (1/1000000 of a second) + */ +xtPublic xtWord8 xt_trace_clock(void) +{ + static xtWord8 trace_start_clock = 0; + xtWord8 now; + +#ifdef XT_WIN + now = ((xtWord8) GetTickCount()) * (xtWord8) 1000; +#else + struct timeval tv; + + gettimeofday(&tv, NULL); + now = (xtWord8) tv.tv_sec * (xtWord8) 1000000 + tv.tv_usec; +#endif + if (trace_start_clock) + return now - trace_start_clock; + trace_start_clock = now; + return 0; +} + +xtPublic char *xt_trace_clock_str(char *ptr) +{ + static char buffer[50]; + xtWord8 now = xt_trace_clock(); + + if (!ptr) + ptr = buffer; + + sprintf(ptr, "%d.%06d", (int) (now / (xtWord8) 1000000), (int) (now % (xtWord8) 1000000)); + return ptr; +} + +xtPublic char *xt_trace_clock_diff(char *ptr) +{ + static xtWord8 trace_last_clock = 0; + static char buffer[50]; + xtWord8 now = xt_trace_clock(); + + if (!ptr) + ptr = buffer; + + sprintf(ptr, "%d.%06d (%d)", (int) (now / (xtWord8) 1000000), (int) (now % (xtWord8) 1000000), (int) (now - trace_last_clock)); + trace_last_clock = now; + return ptr; +} + +xtPublic char *xt_trace_clock_diff(char *ptr, xtWord8 start_time) +{ + xtWord8 now = xt_trace_clock(); + + sprintf(ptr, "%d.%06d (%d)", (int) (now / (xtWord8) 1000000), (int) (now % (xtWord8) 1000000), (int) (now - start_time)); + return ptr; +} + + +xtPublic void xt_set_fflush(xtBool on) +{ + trace_flush_dump = on; +} + +xtPublic void xt_ftracef(char *fmt, ...) +{ + va_list ap; + + va_start(ap, fmt); + xt_lock_mutex_ns(&trace_mutex); + + if (!trace_dump_file) { + char buffer[100]; + + for (int i=1; ;i++) { + sprintf(buffer, "pbxt-dump-%d.log", i); + if (!xt_fs_exists(buffer)) { + trace_dump_file = fopen(buffer, "w"); + break; + } + } + } + + vfprintf(trace_dump_file, fmt, ap); + if (trace_flush_dump) + fflush(trace_dump_file); + + xt_unlock_mutex_ns(&trace_mutex); + va_end(ap); +} + diff --git a/storage/pbxt/src/trace_xt.h b/storage/pbxt/src/trace_xt.h new file mode 100644 index 00000000000..44cfa9945f1 --- /dev/null +++ b/storage/pbxt/src/trace_xt.h @@ -0,0 +1,49 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-02-07 Paul McCullagh + * + * H&G2JCtL + */ +#ifndef __xt_trace_h__ +#define __xt_trace_h__ + +#include "xt_defs.h" + +xtBool xt_init_trace(void); +void xt_exit_trace(void); +void xt_dump_trace(void); +void xt_print_trace(void); + +void xt_trace(const char *fmt, ...); +void xt_ttraceq(struct XTThread *self, char *query); +void xt_ttracef(struct XTThread *self, char *fmt, ...); +xtWord8 xt_trace_clock(void); +char *xt_trace_clock_str(char *ptr); +char *xt_trace_clock_diff(char *ptr); +char *xt_trace_clock_diff(char *ptr, xtWord8 start_time); +void xt_set_fflush(xtBool on); +void xt_ftracef(char *fmt, ...); + +#define XT_DEBUG_TRACE(x) +#define XT_DISABLED_TRACE(x) +#ifdef DEBUG +//#define PBXT_HANDLER_TRACE +#endif + +#endif diff --git a/storage/pbxt/src/util_xt.cc b/storage/pbxt/src/util_xt.cc new file mode 100644 index 00000000000..6e1db1f5f73 --- /dev/null +++ b/storage/pbxt/src/util_xt.cc @@ -0,0 +1,414 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2004-01-03 Paul McCullagh + * + * H&G2JCtL + */ + +#include "xt_config.h" + +#include <stdio.h> +#include <time.h> +#include <ctype.h> +#ifndef XT_WIN +#include <sys/param.h> +#endif + +#include "util_xt.h" +#include "strutil_xt.h" +#include "memory_xt.h" + +xtPublic int xt_comp_log_pos(xtLogID id1, off_t off1, xtLogID id2, off_t off2) +{ + if (id1 < id2) + return -1; + if (id1 > id2) + return 1; + if (off1 < off2) + return -1; + if (off1 > off2) + return 1; + return 0; +} + +/* + * This function returns the current time in micorsonds since + * 00:00:00 UTC, January 1, 1970. + * Currently it is accurate to the second :( + */ +xtPublic xtWord8 xt_time_now(void) +{ + xtWord8 ms; + + ms = (xtWord8) time(NULL); + ms *= 1000000; + return ms; +} + +xtPublic void xt_free_nothing(struct XTThread XT_UNUSED(*thr), void XT_UNUSED(*x)) +{ +} + +/* + * A file name has the form: + * <text>-<number>[.<ext>] + * This function return the number part as a + * u_long. + */ +xtPublic xtWord4 xt_file_name_to_id(char *file_name) +{ + u_long value = 0; + + if (file_name) { + char *num = file_name + strlen(file_name) - 1; + + while (num >= file_name && *num != '-') + num--; + num++; + if (isdigit(*num)) + sscanf(num, "%lu", &value); + } + return (xtWord4) value; +} + +/* + * now is moving forward. then is a static time in the + * future. What is the time difference? + * + * These variables can overflow. + */ +xtPublic int xt_time_difference(register xtWord4 now, register xtWord4 then) +{ + /* now is after then, so the now time has passed + * then. So we return a negative difference. + */ + if (now >= then) { + /* now has gone past then. If the difference is + * great, then we assume an overflow, and reverse! + */ + if ((now - then) > (xtWord4) 0xFFFFFFFF/2) + return (int) (0xFFFFFFFF - (now - then)); + + return (int) now - (int) then; + } + /* If now is before then, we check the difference. + * If the difference is very large, then we assume + * that now has gone past then, and overflowed. + */ + if ((then - now) > (xtWord4) 0xFFFFFFFF/2) + return - (int) (0xFFFFFFFF - (then - now)); + return then - now; +} + +xtPublic xtWord2 xt_get_checksum(xtWord1 *data, size_t len, u_int interval) +{ + register xtWord4 sum = 0, g; + xtWord1 *chk; + + chk = data + len - 1; + while (chk > data) { + sum = (sum << 4) + *chk; + if ((g = sum & 0xF0000000)) { + sum = sum ^ (g >> 24); + sum = sum ^ g; + } + chk -= interval; + } + return (xtWord2) (sum ^ (sum >> 16)); +} + +xtPublic xtWord1 xt_get_checksum1(xtWord1 *data, size_t len) +{ + register xtWord4 sum = 0, g; + xtWord1 *chk; + + chk = data + len - 1; + while (chk > data) { + sum = (sum << 4) + *chk; + if ((g = sum & 0xF0000000)) { + sum = sum ^ (g >> 24); + sum = sum ^ g; + } + chk--; + } + return (xtWord1) (sum ^ (sum >> 24) ^ (sum >> 16) ^ (sum >> 8)); +} + +/* + * --------------- Data Buffer ------------------ + */ + +xtPublic xtBool xt_db_set_size(struct XTThread *self, XTDataBufferPtr dbuf, size_t size) +{ + if (dbuf->db_size < size) { + if (!xt_realloc(self, (void **) &dbuf->db_data, size)) + return FAILED; + dbuf->db_size = size; + } + else if (!size) { + if (dbuf->db_data) + xt_free(self, dbuf->db_data); + dbuf->db_data = NULL; + dbuf->db_size = 0; + } + return OK; +} + +/* + * --------------- Data Buffer ------------------ + */ + +xtPublic xtBool xt_ib_alloc(struct XTThread *self, XTInfoBufferPtr ib, size_t size) +{ + if (!ib->ib_free) { + ib->ib_db.db_size = 0; + ib->ib_db.db_data = NULL; + } + if (size <= ib->ib_db.db_size) + return OK; + + if (size <= XT_IB_DEFAULT_SIZE) { + ib->ib_db.db_size = XT_IB_DEFAULT_SIZE; + ib->ib_db.db_data = ib->ib_data; + return OK; + } + + if (ib->ib_db.db_data == ib->ib_data) { + ib->ib_db.db_size = 0; + ib->ib_db.db_data = NULL; + } + + ib->ib_free = TRUE; + return xt_db_set_size(self, &ib->ib_db, size); +} + +void xt_ib_free(struct XTThread *self, XTInfoBufferPtr ib) +{ + if (ib->ib_free) { + xt_db_set_size(self, &ib->ib_db, 0); + ib->ib_free = FALSE; + } +} + +/* + * --------------- Basic List ------------------ + */ + +xtPublic xtBool xt_bl_set_size(struct XTThread *self, XTBasicListPtr bl, size_t size) +{ + if (bl->bl_size < size) { + if (!xt_realloc(self, (void **) &bl->bl_data, size * bl->bl_item_size)) + return FAILED; + bl->bl_size = size; + } + else if (!size) { + if (bl->bl_data) + xt_free(self, bl->bl_data); + bl->bl_data = NULL; + bl->bl_size = 0; + bl->bl_count = 0; + } + return OK; +} + +xtPublic xtBool xt_bl_dup(struct XTThread *self, XTBasicListPtr from_bl, XTBasicListPtr to_bl) +{ + to_bl->bl_item_size = from_bl->bl_item_size; + to_bl->bl_size = 0; + to_bl->bl_count = from_bl->bl_count; + to_bl->bl_data = NULL; + if (!xt_bl_set_size(self, to_bl, from_bl->bl_count)) + return FAILED; + memcpy(to_bl->bl_data, from_bl->bl_data, to_bl->bl_count * to_bl->bl_item_size); + return OK; +} + +xtPublic xtBool xt_bl_append(struct XTThread *self, XTBasicListPtr bl, void *value) +{ + if (bl->bl_count == bl->bl_size) { + if (!xt_bl_set_size(self, bl, bl->bl_count+1)) + return FAILED; + } + memcpy(&bl->bl_data[bl->bl_count * bl->bl_item_size], value, bl->bl_item_size); + bl->bl_count++; + return OK; +} + +xtPublic void *xt_bl_last_item(XTBasicListPtr bl) +{ + if (!bl->bl_count) + return NULL; + return &bl->bl_data[(bl->bl_count-1) * bl->bl_item_size]; +} + +xtPublic void *xt_bl_item_at(XTBasicListPtr bl, u_int i) +{ + if (i >= bl->bl_count) + return NULL; + return &bl->bl_data[i * bl->bl_item_size]; +} + +xtPublic void xt_bl_free(struct XTThread *self, XTBasicListPtr wl) +{ + xt_bl_set_size(self, wl, 0); +} + +/* + * --------------- Basic Queue ------------------ + */ + +xtPublic xtBool xt_bq_set_size(struct XTThread *self, XTBasicQueuePtr bq, size_t size) +{ + if (bq->bq_size < size) { + if (!xt_realloc(self, (void **) &bq->bq_data, size * bq->bq_item_size)) + return FAILED; + bq->bq_size = size; + } + else if (!size) { + if (bq->bq_data) + xt_free(self, bq->bq_data); + bq->bq_data = NULL; + bq->bq_size = 0; + bq->bq_front = 0; + bq->bq_back = 0; + } + return OK; +} + +xtPublic void *xt_bq_get(XTBasicQueuePtr bq) +{ + if (bq->bq_back == bq->bq_front) + return NULL; + return &bq->bq_data[bq->bq_back * bq->bq_item_size]; +} + +xtPublic void xt_bq_next(XTBasicQueuePtr bq) +{ + if (bq->bq_back < bq->bq_front) { + bq->bq_back++; + if (bq->bq_front == bq->bq_back) { + bq->bq_front = 0; + bq->bq_back = 0; + } + } +} + +xtPublic xtBool xt_bq_add(struct XTThread *self, XTBasicQueuePtr bq, void *value) +{ + if (bq->bq_front == bq->bq_size) { + if (bq->bq_back >= bq->bq_max_waste) { + bq->bq_front -= bq->bq_back; + memmove(bq->bq_data, &bq->bq_data[bq->bq_back * bq->bq_item_size], bq->bq_front * bq->bq_item_size); + bq->bq_back = 0; + } + else { + if (!xt_bq_set_size(self, bq, bq->bq_front+bq->bq_item_inc)) + return FAILED; + } + } + memcpy(&bq->bq_data[bq->bq_front * bq->bq_item_size], value, bq->bq_item_size); + bq->bq_front++; + return OK; +} + +xtPublic void xt_sb_free(struct XTThread *self, XTStringBufferPtr dbuf) +{ + xt_sb_set_size(self, dbuf, 0); +} + +xtPublic xtBool xt_sb_set_size(struct XTThread *self, XTStringBufferPtr dbuf, size_t size) +{ + if (dbuf->sb_size < size) { + if (!xt_realloc(self, (void **) &dbuf->sb_cstring, size)) + return FAILED; + dbuf->sb_size = size; + } + else if (!size) { + if (dbuf->sb_cstring) + xt_free(self, dbuf->sb_cstring); + dbuf->sb_cstring = NULL; + dbuf->sb_size = 0; + dbuf->sb_len = 0; + } + return OK; +} + +xtPublic xtBool xt_sb_concat_len(struct XTThread *self, XTStringBufferPtr dbuf, c_char *str, size_t len) +{ + if (!xt_sb_set_size(self, dbuf, dbuf->sb_len + len + 1)) + return FAILED; + memcpy(dbuf->sb_cstring + dbuf->sb_len, str, len); + dbuf->sb_len += len; + dbuf->sb_cstring[dbuf->sb_len] = 0; + return OK; +} + +xtPublic xtBool xt_sb_concat(struct XTThread *self, XTStringBufferPtr dbuf, c_char *str) +{ + return xt_sb_concat_len(self, dbuf, str, strlen(str)); +} + +xtPublic xtBool xt_sb_concat_char(struct XTThread *self, XTStringBufferPtr dbuf, int ch) +{ + if (!xt_sb_set_size(self, dbuf, dbuf->sb_len + 1 + 1)) + return FAILED; + dbuf->sb_cstring[dbuf->sb_len] = (char) ch; + dbuf->sb_len++; + dbuf->sb_cstring[dbuf->sb_len] = 0; + return OK; +} + +xtPublic xtBool xt_sb_concat_int8(struct XTThread *self, XTStringBufferPtr dbuf, xtInt8 val) +{ + char buffer[200]; + + sprintf(buffer, "%"PRId64, val); + return xt_sb_concat(self, dbuf, buffer); +} + +xtPublic char *xt_sb_take_cstring(XTStringBufferPtr sbuf) +{ + char *str = sbuf->sb_cstring; + + sbuf->sb_cstring = NULL; + sbuf->sb_size = 0; + sbuf->sb_len = 0; + return str; +} + +xtPublic xtBool xt_sb_concat_url_len(struct XTThread *self, XTStringBufferPtr dbuf, c_char *from, size_t len_from) +{ + if (!xt_sb_set_size(self, dbuf, dbuf->sb_len + len_from + 1)) + return FAILED; + while (len_from--) { + if (*from == '%' && len_from >= 2 && isxdigit(*(from+1)) && isxdigit(*(from+2))) { + unsigned char a = xt_hex_digit(*(from+1)); + unsigned char b = xt_hex_digit(*(from+2)); + dbuf->sb_cstring[dbuf->sb_len] = a << 4 | b; + from += 3; + } + else + dbuf->sb_cstring[dbuf->sb_len] = *from++; + dbuf->sb_len++; + } + dbuf->sb_cstring[dbuf->sb_len] = 0; + return OK; +} + + diff --git a/storage/pbxt/src/util_xt.h b/storage/pbxt/src/util_xt.h new file mode 100644 index 00000000000..bb5003b10fb --- /dev/null +++ b/storage/pbxt/src/util_xt.h @@ -0,0 +1,123 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2004-01-03 Paul McCullagh + * + * H&G2JCtL + */ + +#ifndef __xt_xtutil_h__ +#define __xt_xtutil_h__ + +#include <stddef.h> + +#include "xt_defs.h" + +#define XT_CHECKSUM_1(sum) ((xtWord1) ((sum) ^ ((sum) >> 24) ^ ((sum) >> 16) ^ ((sum) >> 8))) +#define XT_CHECKSUM_2(sum) ((xtWord2) ((sum) ^ ((sum) >> 16))) +#define XT_CHECKSUM4_8(sum) ((xtWord4) (sum) ^ (xtWord4) ((sum) >> 32)) + +int xt_comp_log_pos(xtLogID id1, off_t off1, xtLogID id2, off_t off2); +xtWord8 xt_time_now(void); +void xt_free_nothing(struct XTThread *self, void *x); +xtWord4 xt_file_name_to_id(char *file_name); +xtBool xt_time_difference(register xtWord4 now, register xtWord4 then); +xtWord2 xt_get_checksum(xtWord1 *data, size_t len, u_int interval); +xtWord1 xt_get_checksum1(xtWord1 *data, size_t len); + +typedef struct XTDataBuffer { + size_t db_size; + xtWord1 *db_data; +} XTDataBufferRec, *XTDataBufferPtr; + +xtBool xt_db_set_size(struct XTThread *self, XTDataBufferPtr db, size_t size); + +#define XT_IB_DEFAULT_SIZE 512 + +typedef struct XTInfoBuffer { + xtBool ib_free; + XTDataBufferRec ib_db; + xtWord1 ib_data[XT_IB_DEFAULT_SIZE]; +} XTInfoBufferRec, *XTInfoBufferPtr; + +xtBool xt_ib_alloc(struct XTThread *self, XTInfoBufferPtr ib, size_t size); +void xt_ib_free(struct XTThread *self, XTInfoBufferPtr ib); + +typedef struct XTBasicList { + u_int bl_item_size; + u_int bl_size; + u_int bl_count; + xtWord1 *bl_data; +} XTBasicListRec, *XTBasicListPtr; + +xtBool xt_bl_set_size(struct XTThread *self, XTBasicListPtr wl, size_t size); +xtBool xt_bl_dup(struct XTThread *self, XTBasicListPtr from_bl, XTBasicListPtr to_bl); +xtBool xt_bl_append(struct XTThread *self, XTBasicListPtr wl, void *value); +void *xt_bl_last_item(XTBasicListPtr wl); +void *xt_bl_item_at(XTBasicListPtr wl, u_int i); +void xt_bl_free(struct XTThread *self, XTBasicListPtr wl); + +typedef struct XTBasicQueue { + u_int bq_item_size; + u_int bq_max_waste; + u_int bq_item_inc; + u_int bq_size; + u_int bq_front; + u_int bq_back; + xtWord1 *bq_data; +} XTBasicQueueRec, *XTBasicQueuePtr; + +xtBool xt_bq_set_size(struct XTThread *self, XTBasicQueuePtr wq, size_t size); +void *xt_bq_get(XTBasicQueuePtr wq); +void xt_bq_next(XTBasicQueuePtr wq); +xtBool xt_bq_add(struct XTThread *self, XTBasicQueuePtr wl, void *value); + +typedef struct XTStringBuffer { + size_t sb_size; + size_t sb_len; + char *sb_cstring; +} XTStringBufferRec, *XTStringBufferPtr; + +void xt_sb_free(struct XTThread *self, XTStringBufferPtr db); +xtBool xt_sb_set_size(struct XTThread *self, XTStringBufferPtr db, size_t size); +xtBool xt_sb_concat_len(struct XTThread *self, XTStringBufferPtr dbuf, c_char *str, size_t len); +xtBool xt_sb_concat(struct XTThread *self, XTStringBufferPtr dbuf, c_char *str); +xtBool xt_sb_concat_char(struct XTThread *self, XTStringBufferPtr dbuf, int ch); +xtBool xt_sb_concat_int8(struct XTThread *self, XTStringBufferPtr dbuf, xtInt8 val); +char *xt_sb_take_cstring(XTStringBufferPtr dbuf); +xtBool xt_sb_concat_url_len(struct XTThread *self, XTStringBufferPtr dbuf, c_char *str, size_t len); + +static inline size_t xt_align_size(size_t size, size_t align) +{ + register size_t diff = size % align; + + if (diff) + return size + align - diff; + return size; +} + +static inline off_t xt_align_offset(off_t size, size_t align) +{ + register off_t diff = size % (off_t) align; + + if (diff) + return size + align - diff; + return size; +} + +#endif diff --git a/storage/pbxt/src/win_inttypes.h b/storage/pbxt/src/win_inttypes.h new file mode 100644 index 00000000000..c8561939e54 --- /dev/null +++ b/storage/pbxt/src/win_inttypes.h @@ -0,0 +1,259 @@ +/* Copyright (C) 1997-2001, 2004, 2007 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA + 02111-1307 USA. */ + +/* + * ISO C99: 7.8 Format conversion of integer types <inttypes.h> + */ + +/* + * this is a reduced verion of the original linux inttypes.h file + */ + +#ifndef _INTTYPES_H +#define _INTTYPES_H 1 + +/* The ISO C99 standard specifies that these macros must only be + defined if explicitly requested. */ +#if !defined __cplusplus || defined __STDC_FORMAT_MACROS + +# if __WORDSIZE == 64 +# define __PRI64_PREFIX "l" +# define __PRIPTR_PREFIX "l" +# else +# define __PRI64_PREFIX "ll" +# define __PRIPTR_PREFIX +# endif + +/* Macros for printing format specifiers. */ + +/* Decimal notation. */ +# define PRId8 "d" +# define PRId16 "d" +# define PRId32 "d" +# define PRId64 __PRI64_PREFIX "d" + +# define PRIdLEAST8 "d" +# define PRIdLEAST16 "d" +# define PRIdLEAST32 "d" +# define PRIdLEAST64 __PRI64_PREFIX "d" + +# define PRIdFAST8 "d" +# define PRIdFAST16 __PRIPTR_PREFIX "d" +# define PRIdFAST32 __PRIPTR_PREFIX "d" +# define PRIdFAST64 __PRI64_PREFIX "d" + + +# define PRIi8 "i" +# define PRIi16 "i" +# define PRIi32 "i" +# define PRIi64 __PRI64_PREFIX "i" + +# define PRIiLEAST8 "i" +# define PRIiLEAST16 "i" +# define PRIiLEAST32 "i" +# define PRIiLEAST64 __PRI64_PREFIX "i" + +# define PRIiFAST8 "i" +# define PRIiFAST16 __PRIPTR_PREFIX "i" +# define PRIiFAST32 __PRIPTR_PREFIX "i" +# define PRIiFAST64 __PRI64_PREFIX "i" + +/* Octal notation. */ +# define PRIo8 "o" +# define PRIo16 "o" +# define PRIo32 "o" +# define PRIo64 __PRI64_PREFIX "o" + +# define PRIoLEAST8 "o" +# define PRIoLEAST16 "o" +# define PRIoLEAST32 "o" +# define PRIoLEAST64 __PRI64_PREFIX "o" + +# define PRIoFAST8 "o" +# define PRIoFAST16 __PRIPTR_PREFIX "o" +# define PRIoFAST32 __PRIPTR_PREFIX "o" +# define PRIoFAST64 __PRI64_PREFIX "o" + +/* Unsigned integers. */ +# define PRIu8 "u" +# define PRIu16 "u" +# define PRIu32 "u" +# define PRIu64 __PRI64_PREFIX "u" + +# define PRIuLEAST8 "u" +# define PRIuLEAST16 "u" +# define PRIuLEAST32 "u" +# define PRIuLEAST64 __PRI64_PREFIX "u" + +# define PRIuFAST8 "u" +# define PRIuFAST16 __PRIPTR_PREFIX "u" +# define PRIuFAST32 __PRIPTR_PREFIX "u" +# define PRIuFAST64 __PRI64_PREFIX "u" + +/* lowercase hexadecimal notation. */ +# define PRIx8 "x" +# define PRIx16 "x" +# define PRIx32 "x" +# define PRIx64 __PRI64_PREFIX "x" + +# define PRIxLEAST8 "x" +# define PRIxLEAST16 "x" +# define PRIxLEAST32 "x" +# define PRIxLEAST64 __PRI64_PREFIX "x" + +# define PRIxFAST8 "x" +# define PRIxFAST16 __PRIPTR_PREFIX "x" +# define PRIxFAST32 __PRIPTR_PREFIX "x" +# define PRIxFAST64 __PRI64_PREFIX "x" + +/* UPPERCASE hexadecimal notation. */ +# define PRIX8 "X" +# define PRIX16 "X" +# define PRIX32 "X" +# define PRIX64 __PRI64_PREFIX "X" + +# define PRIXLEAST8 "X" +# define PRIXLEAST16 "X" +# define PRIXLEAST32 "X" +# define PRIXLEAST64 __PRI64_PREFIX "X" + +# define PRIXFAST8 "X" +# define PRIXFAST16 __PRIPTR_PREFIX "X" +# define PRIXFAST32 __PRIPTR_PREFIX "X" +# define PRIXFAST64 __PRI64_PREFIX "X" + + +/* Macros for printing `intmax_t' and `uintmax_t'. */ +# define PRIdMAX __PRI64_PREFIX "d" +# define PRIiMAX __PRI64_PREFIX "i" +# define PRIoMAX __PRI64_PREFIX "o" +# define PRIuMAX __PRI64_PREFIX "u" +# define PRIxMAX __PRI64_PREFIX "x" +# define PRIXMAX __PRI64_PREFIX "X" + + +/* Macros for printing `intptr_t' and `uintptr_t'. */ +# define PRIdPTR __PRIPTR_PREFIX "d" +# define PRIiPTR __PRIPTR_PREFIX "i" +# define PRIoPTR __PRIPTR_PREFIX "o" +# define PRIuPTR __PRIPTR_PREFIX "u" +# define PRIxPTR __PRIPTR_PREFIX "x" +# define PRIXPTR __PRIPTR_PREFIX "X" + + +/* Macros for scanning format specifiers. */ + +/* Signed decimal notation. */ +# define SCNd8 "hhd" +# define SCNd16 "hd" +# define SCNd32 "d" +# define SCNd64 __PRI64_PREFIX "d" + +# define SCNdLEAST8 "hhd" +# define SCNdLEAST16 "hd" +# define SCNdLEAST32 "d" +# define SCNdLEAST64 __PRI64_PREFIX "d" + +# define SCNdFAST8 "hhd" +# define SCNdFAST16 __PRIPTR_PREFIX "d" +# define SCNdFAST32 __PRIPTR_PREFIX "d" +# define SCNdFAST64 __PRI64_PREFIX "d" + +/* Signed decimal notation. */ +# define SCNi8 "hhi" +# define SCNi16 "hi" +# define SCNi32 "i" +# define SCNi64 __PRI64_PREFIX "i" + +# define SCNiLEAST8 "hhi" +# define SCNiLEAST16 "hi" +# define SCNiLEAST32 "i" +# define SCNiLEAST64 __PRI64_PREFIX "i" + +# define SCNiFAST8 "hhi" +# define SCNiFAST16 __PRIPTR_PREFIX "i" +# define SCNiFAST32 __PRIPTR_PREFIX "i" +# define SCNiFAST64 __PRI64_PREFIX "i" + +/* Unsigned decimal notation. */ +# define SCNu8 "hhu" +# define SCNu16 "hu" +# define SCNu32 "u" +# define SCNu64 __PRI64_PREFIX "u" + +# define SCNuLEAST8 "hhu" +# define SCNuLEAST16 "hu" +# define SCNuLEAST32 "u" +# define SCNuLEAST64 __PRI64_PREFIX "u" + +# define SCNuFAST8 "hhu" +# define SCNuFAST16 __PRIPTR_PREFIX "u" +# define SCNuFAST32 __PRIPTR_PREFIX "u" +# define SCNuFAST64 __PRI64_PREFIX "u" + +/* Octal notation. */ +# define SCNo8 "hho" +# define SCNo16 "ho" +# define SCNo32 "o" +# define SCNo64 __PRI64_PREFIX "o" + +# define SCNoLEAST8 "hho" +# define SCNoLEAST16 "ho" +# define SCNoLEAST32 "o" +# define SCNoLEAST64 __PRI64_PREFIX "o" + +# define SCNoFAST8 "hho" +# define SCNoFAST16 __PRIPTR_PREFIX "o" +# define SCNoFAST32 __PRIPTR_PREFIX "o" +# define SCNoFAST64 __PRI64_PREFIX "o" + +/* Hexadecimal notation. */ +# define SCNx8 "hhx" +# define SCNx16 "hx" +# define SCNx32 "x" +# define SCNx64 __PRI64_PREFIX "x" + +# define SCNxLEAST8 "hhx" +# define SCNxLEAST16 "hx" +# define SCNxLEAST32 "x" +# define SCNxLEAST64 __PRI64_PREFIX "x" + +# define SCNxFAST8 "hhx" +# define SCNxFAST16 __PRIPTR_PREFIX "x" +# define SCNxFAST32 __PRIPTR_PREFIX "x" +# define SCNxFAST64 __PRI64_PREFIX "x" + + +/* Macros for scanning `intmax_t' and `uintmax_t'. */ +# define SCNdMAX __PRI64_PREFIX "d" +# define SCNiMAX __PRI64_PREFIX "i" +# define SCNoMAX __PRI64_PREFIX "o" +# define SCNuMAX __PRI64_PREFIX "u" +# define SCNxMAX __PRI64_PREFIX "x" + +/* Macros for scaning `intptr_t' and `uintptr_t'. */ +# define SCNdPTR __PRIPTR_PREFIX "d" +# define SCNiPTR __PRIPTR_PREFIX "i" +# define SCNoPTR __PRIPTR_PREFIX "o" +# define SCNuPTR __PRIPTR_PREFIX "u" +# define SCNxPTR __PRIPTR_PREFIX "x" + +#endif /* C++ && format macros */ + + +#endif /* inttypes.h */ diff --git a/storage/pbxt/src/xaction_xt.cc b/storage/pbxt/src/xaction_xt.cc new file mode 100644 index 00000000000..14f62db373d --- /dev/null +++ b/storage/pbxt/src/xaction_xt.cc @@ -0,0 +1,2682 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-04-10 Paul McCullagh + * + * H&G2JCtL + */ + +#include "xt_config.h" + +#include <time.h> +#include <signal.h> + +#include "xaction_xt.h" +#include "database_xt.h" +#include "strutil_xt.h" +#include "heap_xt.h" +#include "trace_xt.h" +#include "myxt_xt.h" +#include "tabcache_xt.h" + +#ifdef DEBUG +//#define TRACE_WAIT_FOR +//#define TRACE_VARIATIONS +//#define TRACE_SWEEPER_ACTIVITY + +/* Enable to trace the statements executed by the engine: */ +//#define TRACE_STATEMENTS +#endif + +#if defined(TRACE_STATEMENTS) || defined(TRACE_VARIATIONS) +#define TRACE_TRANSACTION +#endif + +static void xn_sw_wait_for_xact(XTThreadPtr self, XTDatabaseHPtr db, u_int hsecs); +static xtBool xn_get_xact_details(XTDatabaseHPtr db, xtXactID xn_id, XTThreadPtr thread __attribute__((unused)), int *flags, xtXactID *start, xtXactID *end, xtThreadID *thd_id); +static xtBool xn_get_xact_pointer(XTDatabaseHPtr db, xtXactID xn_id, XTXactDataPtr *xact_ptr); + +/* ============================================================================================== */ + +typedef struct XNSWRecItem { + xtTableID ri_tab_id; + xtRecordID ri_rec_id; +} XNSWRecItemRec, *XNSWRecItemPtr; + +typedef struct XNSWToFreeItem { + xtTableID ri_tab_id; /* If non-zero, then this is the table of the data record to be freed. + * If zero, then this free the transaction below must be freed. + */ + union { + xtRecordID ri_rec_id; + xtXactID ri_xn_id; + } x; + xtXactID ri_wait_xn_id; /* Wait for this transaction to be cleaned (or being cleaned up) + * before freeing this resource. */ +} XNSWToFreeItemRec, *XNSWToFreeItemPtr; + +/* ---------------------------------------------------------------------- + * TRANSACTION/THREAD WAIT LIST + */ + +typedef struct XNWaitThread { + /* The wait condition of the thread. */ + xt_mutex_type wt_lock; + xt_cond_type wt_cond; + + /* The list of threads waiting for this thread. */ + XTSpinLockRec wt_wait_list_lock; + u_int wt_wait_list_count; + u_int wt_wait_list_size; + xtThreadID *wt_wait_list; +} XNWaitThreadRec, *XNWaitThreadPtr; + +static XNWaitThreadPtr xn_wait_thread_array; + +xtPublic void xt_thread_wait_init(XTThreadPtr self) +{ + xn_wait_thread_array = (XNWaitThreadPtr) xt_calloc(self, xt_thr_maximum_threads * sizeof(XNWaitThreadRec)); + for (u_int i=0; i<xt_thr_maximum_threads; i++) { + xt_init_mutex_with_autoname(self, &xn_wait_thread_array[i].wt_lock); + xt_init_cond(self, &xn_wait_thread_array[i].wt_cond); + xn_wait_thread_array[i].wt_wait_list = NULL; + xn_wait_thread_array[i].wt_wait_list_count = 0; + xn_wait_thread_array[i].wt_wait_list_size = 0; + xt_spinlock_init_with_autoname(self, &xn_wait_thread_array[i].wt_wait_list_lock); + } +} + +xtPublic void xt_thread_wait_exit(XTThreadPtr self) +{ + if (xn_wait_thread_array) { + for (u_int i=0; i<xt_thr_maximum_threads; i++) { + xt_free_mutex(&xn_wait_thread_array[i].wt_lock); + xt_free_cond(&xn_wait_thread_array[i].wt_cond); + if (xn_wait_thread_array[i].wt_wait_list) + xt_free(self, xn_wait_thread_array[i].wt_wait_list); + xt_spinlock_free(self, &xn_wait_thread_array[i].wt_wait_list_lock); + } + xt_free(self, xn_wait_thread_array); + } +} + +static xtBool xn_wait_for_thread(xtThreadID waiting_id, xtThreadID wait_for_id) +{ + XNWaitThreadPtr wt; + + wt = &xn_wait_thread_array[wait_for_id]; + xt_spinlock_lock(&wt->wt_wait_list_lock); + if (wt->wt_wait_list_count == wt->wt_wait_list_size) { + if (!xt_realloc_ns((void **) &wt->wt_wait_list, (wt->wt_wait_list_size+1) * sizeof(xtThreadID))) + return FAILED; + wt->wt_wait_list_size++; + } + for (u_int i=0; i<wt->wt_wait_list_count; i++) { + if (wt->wt_wait_list[i] == waiting_id) + goto done; + } + wt->wt_wait_list[wt->wt_wait_list_count] = waiting_id; + wt->wt_wait_list_count++; + done: + xt_spinlock_unlock(&wt->wt_wait_list_lock); + return OK; +} + +xtPublic void xt_xn_wakeup_thread(xtThreadID thd_id) +{ + XNWaitThreadPtr target_wt; + + target_wt = &xn_wait_thread_array[thd_id]; + xt_lock_mutex_ns(&target_wt->wt_lock); + xt_broadcast_cond_ns(&target_wt->wt_cond); + xt_unlock_mutex_ns(&target_wt->wt_lock); +} + +xtPublic void xt_xn_wakeup_thread_list(XTThreadPtr thread) +{ + XNWaitThreadPtr target_wt; + + for (u_int i=0; i<thread->st_thread_list_count; i++) { + target_wt = &xn_wait_thread_array[thread->st_thread_list[i]]; + xt_lock_mutex_ns(&target_wt->wt_lock); + xt_broadcast_cond_ns(&target_wt->wt_cond); + xt_unlock_mutex_ns(&target_wt->wt_lock); + } + thread->st_thread_list_count = 0; +} + +xtPublic void xt_xn_wakeup_waiting_threads(XTThreadPtr thread) +{ + XNWaitThreadPtr wt; + XNWaitThreadPtr target_wt; + + wt = &xn_wait_thread_array[thread->t_id]; + if (!wt->wt_wait_list_count) + return; + + xt_spinlock_lock(&wt->wt_wait_list_lock); + if (thread->st_thread_list_size < wt->wt_wait_list_count) { + if (!xt_realloc_ns((void **) &thread->st_thread_list, wt->wt_wait_list_count * sizeof(xtThreadID))) + goto failed; + thread->st_thread_list_size = wt->wt_wait_list_count; + } + memcpy(thread->st_thread_list, wt->wt_wait_list, wt->wt_wait_list_count * sizeof(xtThreadID)); + thread->st_thread_list_count = wt->wt_wait_list_count; + wt->wt_wait_list_count = 0; + xt_spinlock_unlock(&wt->wt_wait_list_lock); + + xt_xn_wakeup_thread_list(thread); + return; + + failed: + for (u_int i=0; i<wt->wt_wait_list_count; i++) { + target_wt = &xn_wait_thread_array[wt->wt_wait_list[i]]; + xt_lock_mutex_ns(&target_wt->wt_lock); + xt_broadcast_cond_ns(&target_wt->wt_cond); + xt_unlock_mutex_ns(&target_wt->wt_lock); + } + wt->wt_wait_list_count = 0; + xt_spinlock_unlock(&wt->wt_wait_list_lock); +} + +/* ---------------------------------------------------------------------- + * WAIT FOR TRANSACTIONS + */ + +typedef struct XNWaitFor { + xtXactID wf_waiting_xn_id; /* The transaction of the waiting thread. */ + xtXactID wf_for_me_xn_id; /* The transaction we are waiting for. */ +} XNWaitForRec, *XNWaitForPtr; + +static int xn_compare_wait_for(XTThreadPtr XT_UNUSED(self), register const void XT_UNUSED(*thunk), register const void *a, register const void *b) +{ + xtXactID *x = (xtXactID *) a; + XNWaitForPtr y = (XNWaitForPtr) b; + + if (*x == y->wf_waiting_xn_id) + return 0; + if (xt_xn_is_before(*x, y->wf_waiting_xn_id)) + return -1; + return 1; +} + +static void xn_free_wait_for(XTThreadPtr XT_UNUSED(self), void XT_UNUSED(*thunk), void XT_UNUSED(*item)) +{ +} + +/* + * A deadlock occurs when a transaction is waiting for itself! + * For example A is waiting for B which is waiting for A. + * By repeatedly scanning the wait_for list we can find out if a + * transaction is waiting for itself. + */ +static xtBool xn_detect_deadlock(XTDatabaseHPtr db, xtXactID waiting, xtXactID for_me) +{ + XNWaitForPtr wf; + + for (;;) { + if (waiting == for_me) { +#ifdef TRACE_WAIT_FOR + for (u_int i=0; i<xt_sl_get_size(db->db_xn_wait_for); i++) { + wf = (XNWaitForPtr) xt_sl_item_at(db->db_xn_wait_for, i); + xt_trace("T%lu --> T%lu\n", (u_long) wf->wf_waiting_xn_id, (u_long) wf->wf_for_me_xn_id); + } + xt_ttracef(xt_get_self(), "DEADLOCK\n"); + xt_dump_trace(); +#endif + xt_register_xterr(XT_REG_CONTEXT, XT_ERR_DEADLOCK); + return TRUE; + } + if (!(wf = (XNWaitForPtr) xt_sl_find(NULL, db->db_xn_wait_for, &for_me))) + break; + for_me = wf->wf_for_me_xn_id; + } + return FALSE; +} + +#ifdef XT_USE_SPINLOCK_WAIT_FOR + +#if defined(XT_MAC) || defined(XT_WIN) +#define WAIT_SPIN_COUNT 10 +#else +#define WAIT_SPIN_COUNT 50 +#endif + +/* Should not be required, but we wait for a second, + * just in case the wakeup is missed! + */ +#ifdef DEBUG +#define WAIT_FOR_XACT_TIME 30000 +#else +#define WAIT_FOR_XACT_TIME 1000 +#endif + +static xtBool xn_add_to_wait_for(XTDatabaseHPtr db, XNWaitForPtr wf, XTThreadPtr thread) +{ + /* If we are waiting for a transaction to end, + * put this thread on the wait list... + * + * As long as the temporary lock is removed + * or turned into a permanent lock before + * a thread waits again, all should be OK! + */ + xt_spinlock_lock(&db->db_xn_wait_spinlock); + +#ifdef TRACE_WAIT_FOR + xt_ttracef(thread, "T%lu -wait-> T%lu\n", (u_long) thread->st_xact_data->xd_start_xn_id, (u_long) wait_xn_id); +#endif + /* Check for a deadlock: */ + if (xn_detect_deadlock(db, wf->wf_waiting_xn_id, wf->wf_for_me_xn_id)) + goto failed; + + /* We will wait for this transaction... */ + db->db_xn_wait_count++; + if (thread->st_xact_writer) + db->db_xn_writer_wait_count++; + + if (!xt_sl_insert(NULL, db->db_xn_wait_for, &wf->wf_waiting_xn_id, wf)) { + db->db_xn_wait_count--; + goto failed; + } + + xt_spinlock_unlock(&db->db_xn_wait_spinlock); + return OK; + + failed: + xt_spinlock_unlock(&db->db_xn_wait_spinlock); + return FAILED; +} + +inline void xn_remove_from_wait_for(XTDatabaseHPtr db, XNWaitForPtr wf, XTThreadPtr thread) +{ + xt_spinlock_lock(&db->db_xn_wait_spinlock); + + xt_sl_delete(NULL, db->db_xn_wait_for, &wf->wf_waiting_xn_id); + db->db_xn_wait_count--; + if (thread->st_xact_writer) + db->db_xn_writer_wait_count--; + +#ifdef TRACE_WAIT_FOR + xt_ttracef(thread, "T%lu -wait-> T%lu FAILED\n", (u_long) thread->st_xact_data->xd_start_xn_id, (u_long) wait_xn_id); +#endif + xt_spinlock_unlock(&db->db_xn_wait_spinlock); +} + +//*BUG*/u_int spin_high; +/* Wait for a transation to terminate or a lock to be granted. + * + * If term_req is TRUE, then the termination of the transaction is required + * before continuing. + * + * If pw_func is set then this function will not return before this call has + * succeeded. + * + * This function returns FAILE on error. + */ +xtPublic xtBool xt_xn_wait_for_xact(XTThreadPtr thread, XTXactWaitPtr xw, XTLockWaitPtr lw) +{ + XTDatabaseHPtr db = thread->st_database; + XNWaitForRec wf; + int flags = 0; + xtXactID start = 0; + XTXactDataPtr wait_xact_ptr; + xtBool on_wait_list = FALSE; + XTXactWaitRec xw_new; + u_int loop_count = 0; + XNWaitThreadPtr my_wt; + + ASSERT_NS(thread->st_xact_data); + thread->st_statistics.st_wait_for_xact++; + + wf.wf_waiting_xn_id = thread->st_xact_data->xd_start_xn_id; + + if (lw) { + xtXactID locking_xn_id; + + wait_for_locker: + locking_xn_id = lw->lw_xn_id; + wf.wf_for_me_xn_id = lw->lw_xn_id; + if (!xn_add_to_wait_for(db, &wf, thread)) + return FAILED; + + while (loop_count < WAIT_SPIN_COUNT) { + loop_count++; + + switch (lw->lw_curr_lock) { + case XT_LOCK_ERR: + xn_remove_from_wait_for(db, &wf, thread); + return FAILED; + case XT_NO_LOCK: + /* Got the lock: */ + /* Check if we must also wait for the transaction: */ + if (lw->lw_row_updated) { + /* This will override the xw passed in. + * The reason is, because we are actually waiting + * for a lock, and the lock owner may have changed + * while we were waiting for the lock. + */ + xw_new.xw_xn_id = lw->lw_updating_xn_id; + xw = &xw_new; + } + if (xw) { + if (wf.wf_for_me_xn_id == xw->xw_xn_id) + on_wait_list = TRUE; + else + xn_remove_from_wait_for(db, &wf, thread); + goto wait_for_xact; + } + xn_remove_from_wait_for(db, &wf, thread); +//*BUG*/if (loop_count > spin_high) { spin_high = loop_count; printf("spin high = %d\n", spin_high); } + return OK; + case XT_TEMP_LOCK: + case XT_PERM_LOCK: + if (locking_xn_id != lw->lw_xn_id) { + /* Change the transaction that we are waiting for: */ + xn_remove_from_wait_for(db, &wf, thread); + goto wait_for_locker; + } + break; + } + + xt_critical_wait(); + } + + + /* The non-spinning version... */ + wait_for_locker_no_spin: + my_wt = &xn_wait_thread_array[thread->t_id]; + xt_lock_mutex_ns(&my_wt->wt_lock); + + for (;;) { + switch (lw->lw_curr_lock) { + case XT_LOCK_ERR: + xt_unlock_mutex_ns(&my_wt->wt_lock); + xn_remove_from_wait_for(db, &wf, thread); + return FAILED; + case XT_NO_LOCK: + xt_unlock_mutex_ns(&my_wt->wt_lock); + if (lw->lw_row_updated) { + xw_new.xw_xn_id = lw->lw_updating_xn_id; + xw = &xw_new; + } + if (xw) { + if (wf.wf_for_me_xn_id == xw->xw_xn_id) + on_wait_list = TRUE; + else + xn_remove_from_wait_for(db, &wf, thread); + goto wait_for_xact; + } + xn_remove_from_wait_for(db, &wf, thread); +//*BUG*/if (loop_count > spin_high) { spin_high = loop_count; printf("spin high = %d\n", spin_high); } + return OK; + case XT_TEMP_LOCK: + case XT_PERM_LOCK: + if (locking_xn_id != lw->lw_xn_id) { + /* Change the transaction that we are waiting for: */ + xt_unlock_mutex_ns(&my_wt->wt_lock); + xn_remove_from_wait_for(db, &wf, thread); + locking_xn_id = lw->lw_xn_id; + wf.wf_for_me_xn_id = lw->lw_xn_id; + if (!xn_add_to_wait_for(db, &wf, thread)) + return FAILED; + goto wait_for_locker_no_spin; + } + break; + } + + xt_timed_wait_cond_ns(&my_wt->wt_cond, &my_wt->wt_lock, WAIT_FOR_XACT_TIME); + } + + xt_unlock_mutex_ns(&my_wt->wt_lock); + } + + if (xw) { + xtThreadID tn_thd_id; + + wait_for_xact: + wf.wf_for_me_xn_id = xw->xw_xn_id; + + if (!xn_get_xact_pointer(db, xw->xw_xn_id, &wait_xact_ptr)) + /* The transaction was not found... */ + goto wait_done; + + if (wait_xact_ptr) { + /* This is a dirty read, but it should work! */ + flags = wait_xact_ptr->xd_flags; + start = wait_xact_ptr->xd_start_xn_id; + tn_thd_id = wait_xact_ptr->xd_thread_id; + } + else { + tn_thd_id = 0; + if (!xn_get_xact_details(db, xw->xw_xn_id, thread, &flags, &start, NULL, &tn_thd_id)) + flags = XT_XN_XAC_ENDED | XT_XN_XAC_SWEEP; + } + + if ((flags & XT_XN_XAC_ENDED) || start != xw->xw_xn_id) + /* The transaction has terminated! */ + goto wait_done; + + /* Tell the thread we are waiting for it: */ + xn_wait_for_thread(thread->t_id, tn_thd_id); + + if (!on_wait_list) { + if (!xn_add_to_wait_for(db, &wf, thread)) + return FAILED; + on_wait_list = TRUE; + } + + /* The spinning version: */ + while (loop_count < WAIT_SPIN_COUNT) { + loop_count++; + + xt_critical_wait(); + + if (wait_xact_ptr) { + /* This is a dirty read, but it should work! */ + flags = wait_xact_ptr->xd_flags; + start = wait_xact_ptr->xd_start_xn_id; + } + else { + if (!xn_get_xact_details(db, xw->xw_xn_id, thread, &flags, &start, NULL, NULL)) + flags = XT_XN_XAC_ENDED | XT_XN_XAC_SWEEP; + } + + if ((flags & XT_XN_XAC_ENDED) || start != xw->xw_xn_id) + /* The transaction has terminated! */ + goto wait_done; + } + + /* The non-spinning version: + * + * I believe I can avoid missing the wakeup signal + * by locking before we check if the transaction + * is still running. + * + * Even though db->db_xn_wait_on_cond is "dirty read". + * + * The reason is, before the signal is sent the + * lock is also aquired. This is not possible until + * this thread is safely sleaping. + */ + my_wt = &xn_wait_thread_array[thread->t_id]; + xt_lock_mutex_ns(&my_wt->wt_lock); + + for (;;) { + if (wait_xact_ptr) { + /* This is a dirty read, but it should work! */ + flags = wait_xact_ptr->xd_flags; + start = wait_xact_ptr->xd_start_xn_id; + } + else { + if (!xn_get_xact_details(db, xw->xw_xn_id, thread, &flags, &start, NULL, NULL)) + flags = XT_XN_XAC_ENDED | XT_XN_XAC_SWEEP; + } + + if ((flags & XT_XN_XAC_ENDED) || start != xw->xw_xn_id) + /* The transaction has terminated! */ + break; + + xt_timed_wait_cond_ns(&my_wt->wt_cond, &my_wt->wt_lock, WAIT_FOR_XACT_TIME); + } + + xt_unlock_mutex_ns(&my_wt->wt_lock); + + wait_done: + if (on_wait_list) + xn_remove_from_wait_for(db, &wf, thread); + } + +//*BUG*/if (loop_count > spin_high) { spin_high = loop_count; printf("spin high = %d\n", spin_high); } + return OK; +} + +#else // XT_USE_SPINLOCK_WAIT_FOR +/* + * The given thread must wait for the specified transaction to terminate. This + * function places the transaction of the thread on a list of waiting threads. + * + * Before waiting we make a check for deadlocks. A deadlock occurs + * if waiting would introduce a cycle. + */ +xtPublic xtBool old_xt_xn_wait_for_xact(XTThreadPtr thread, xtXactID xn_id, xtBool will_retry, XTLockWaitFuncPtr pw_func, XTLockWaitPtr pw_data) +{ + XTDatabaseHPtr db = thread->st_database; + XNWaitForRec wf; + int flags = 0; + xtXactID start = 0; + + ASSERT_NS(thread->st_xact_data); + + thread->st_statistics.st_wait_for_xact++; + wf.wf_waiting_xn_id = thread->st_xact_data->xd_start_xn_id; + wf.wf_for_me_xn_id = xn_id; + wf.wf_thread_id = thread->t_id; + + xt_lock_mutex_ns(&db->db_xn_wait_lock); + +#ifdef TRACE_WAIT_FOR + xt_ttracef(thread, "T%lu -wait-> T%lu\n", (u_long) thread->st_xact_data->xd_start_xn_id, (u_long) xn_id); +#endif + for (;;) { + if (!xn_get_xact_details(db, xn_id, thread, &flags, &start, NULL, NULL)) + break; + + /* This is a dirty read, but it should work! */ + if ((flags & XT_XN_XAC_ENDED) || start != xn_id) + break; + + if (xn_detect_deadlock(db, wf.wf_waiting_xn_id, wf.wf_for_me_xn_id)) + goto failed; + + /* We will wait for this transaction... */ + db->db_xn_wait_count++; + if (thread->st_xact_writer) + db->db_xn_writer_wait_count++; + + if (!xt_sl_insert(NULL, db->db_xn_wait_for, &wf.wf_waiting_xn_id, &wf)) { + db->db_xn_wait_count--; + goto failed; + } + + if (!xn_get_xact_details(db, xn_id, thread, &flags, &start, NULL, NULL)) { + xt_sl_delete(NULL, db->db_xn_wait_for, &wf.wf_waiting_xn_id); + db->db_xn_wait_count--; + if (thread->st_xact_writer) + db->db_xn_writer_wait_count--; + break; + } + + if ((flags & XT_XN_XAC_ENDED) || start != xn_id) { + xt_sl_delete(NULL, db->db_xn_wait_for, &wf.wf_waiting_xn_id); + db->db_xn_wait_count--; + if (thread->st_xact_writer) + db->db_xn_writer_wait_count--; + break; + } + + db->db_xn_post_wait[thread->t_id].pw_call_me = pw_func; + db->db_xn_post_wait[thread->t_id].pw_thread = thread; + db->db_xn_post_wait[thread->t_id].pw_data = pw_data; + + /* Timed wait because it is possible that transaction quits before + * we go to sleep. + */ + if (!xt_timed_wait_cond(NULL, &db->db_xn_wait_cond, &db->db_xn_wait_lock, 2 * 1000)) { + xt_sl_delete(NULL, db->db_xn_wait_for, &wf.wf_waiting_xn_id); + db->db_xn_wait_count--; + if (thread->st_xact_writer) + db->db_xn_writer_wait_count--; + goto failed; + } + + db->db_xn_post_wait[thread->t_id].pw_call_me = NULL; + xt_sl_delete(NULL, db->db_xn_wait_for, &wf.wf_waiting_xn_id); + db->db_xn_wait_count--; + if (thread->st_xact_writer) + db->db_xn_writer_wait_count--; + + if (will_retry) + break; + } + +#ifdef TRACE_WAIT_FOR + xt_ttracef(thread, "T%lu -wait-> T%lu DONE\n", (u_long) thread->st_xact_data->xd_start_xn_id, (u_long) xn_id); +#endif + xt_unlock_mutex_ns(&db->db_xn_wait_lock); + return OK; + + failed: +#ifdef TRACE_WAIT_FOR + xt_ttracef(self, "T%lu -wait-> T%lu FAILED\n", (u_long) self->st_xact_data->xd_start_xn_id, (u_long) xn_id); +#endif + xt_unlock_mutex_ns(&db->db_xn_wait_lock); + return FAILED; +} + +xtPublic void old_xt_xn_wakeup_transactions(XTDatabaseHPtr db, XTThreadPtr thread) +{ + u_int len; + XNWaitForPtr wf; + + xt_lock_mutex_ns(&db->db_xn_wait_lock); + /* The idea here is to release the oldest transactions + * first. Although this may not be completely fair + * it has the advantage that older transactions are + * encouraged to complete first. + * + * I have found the following problem with this test: + * runTest(INCREMENT_TEST, 16, INCREMENT_TEST_UPDATE_COUNT); + * with a bit of bad luck a transaction can be starved. + * This results in the sweeper stalling because it is + * waiting for an old transaction to quite so that + * it continue. + * + * Because the sweeper is waiting, the number of + * versions of the record to be updated + * begins to increase. In the above test over + * 1600 transaction remain uncleaned. + * + * This means that there are 1600 version of the + * row which must be scanned to find the most + * recent version. + */ + if ((len = (u_int) xt_sl_get_size(db->db_xn_wait_for))) { + for (u_int i=0; i<len; i++) { + wf = (XNWaitForPtr) xt_sl_item_at(db->db_xn_wait_for, i); + if (db->db_xn_post_wait[wf->wf_thread_id].pw_call_me) { + if (db->db_xn_post_wait[wf->wf_thread_id].pw_call_me(thread, &db->db_xn_post_wait[wf->wf_thread_id])) + db->db_xn_post_wait[wf->wf_thread_id].pw_call_me = NULL; + } + } + if (!xt_broadcast_cond_ns(&db->db_xn_wait_cond)) + xt_log_and_clear_exception_ns(); + } + ASSERT_NS(db->db_xn_wait_count == len); + xt_unlock_mutex_ns(&db->db_xn_wait_lock); +} +#endif // XT_USE_SPINLOCK_WAIT_FOR + +/* ---------------------------------------------------------------------- + * Utilities + */ + +//#define HIGH_X +#ifdef HIGH_X +u_long tot_alloced; +u_long high_alloced; +u_long not_clean_max; +u_long in_ram_max; +#endif + +static void xn_free_xact(XTDatabaseHPtr db, XTXactSegPtr seg, XTXactDataPtr xact) +{ +#ifdef HIGH_X + tot_alloced--; +#endif + /* This indicates the structure is free: */ + xact->xd_start_xn_id = 0; + if ((xtWord1 *) xact >= db->db_xn_data && (xtWord1 *) xact < db->db_xn_data_end) { + /* Put it in the free list: */ + xact->xd_next_xact = seg->xs_free_list; + seg->xs_free_list = xact; + return; + } + xt_free_ns(xact); +} + +/* + * GOTCHA: The value db->db_xn_curr_id may be a bit larger + * than the actual transaction created because there is + * a gap between the issude of the transaction ID + * and the creation of a memory structure. + * (indicated here: {GAP-INC-ADD-XACT}) + * + * This function returns the actuall current transaction ID. + * This is the number of the last transaction actually + * created in memory. + * + * This means that if you call xt_xn_get_xact() with any + * number less than or equal to this value, not finding + * the transaction means it has already ended! + */ +xtPublic xtXactID xt_xn_get_curr_id(XTDatabaseHPtr db) +{ + int i; + xtXactID curr_xn_id; + register XTXactSegPtr seg = db->db_xn_idx; + + /* Find the highest transaction ID actually created... */ + curr_xn_id = seg->xs_last_xn_id; + seg++; + for (i=1; i<XT_XN_NO_OF_SEGMENTS; i++, seg++) { + if (xt_xn_is_before(curr_xn_id, seg->xs_last_xn_id)) + curr_xn_id = seg->xs_last_xn_id; + } + return curr_xn_id; +} + +xtPublic XTXactDataPtr xt_xn_add_old_xact(XTDatabaseHPtr db, xtXactID xn_id, XTThreadPtr thread __attribute__((unused))) +{ + register XTXactDataPtr xact; + register XTXactSegPtr seg; + register XTXactDataPtr *hash; + + seg = &db->db_xn_idx[xn_id & XT_XN_SEGMENT_MASK]; + XT_XACT_WRITE_LOCK(&seg->xs_tab_lock, thread); + hash = &seg->xs_table[(xn_id >> XT_XN_SEGMENT_SHIFTS) % XT_XN_HASH_TABLE_SIZE]; + xact = *hash; + while (xact) { + if (xact->xd_start_xn_id == xn_id) + goto done_ok; + xact = xact->xd_next_xact; + } + + if ((xact = seg->xs_free_list)) + seg->xs_free_list = xact->xd_next_xact; + else { + /* We have used up all the free transaction slots, + * the sweeper should work faster to free them + * up... + */ + db->db_sw_faster |= XT_SW_NO_MORE_XACT_SLOTS; + if (!(xact = (XTXactDataPtr) xt_malloc_ns(sizeof(XTXactDataRec)))) { + XT_XACT_UNLOCK(&seg->xs_tab_lock, thread); + return NULL; + } + } + + xact->xd_next_xact = *hash; + *hash = xact; + + xact->xd_start_xn_id = xn_id; + xact->xd_end_xn_id = 0; + xact->xd_end_time = 0; + xact->xd_begin_log = 0; + xact->xd_flags = 0; + + /* Get the largest transaction id. */ + if (xt_xn_is_before(seg->xs_last_xn_id, xn_id)) + seg->xs_last_xn_id = xn_id; + + done_ok: + XT_XACT_UNLOCK(&seg->xs_tab_lock, thread); +#ifdef HIGH_X + tot_alloced++; + if (tot_alloced > high_alloced) + high_alloced = tot_alloced; +#endif + return xact; +} + +static XTXactDataPtr xn_add_new_xact(XTDatabaseHPtr db, xtXactID xn_id, XTThreadPtr thread __attribute__((unused))) +{ + register XTXactDataPtr xact; + register XTXactSegPtr seg; + register XTXactDataPtr *hash; + + seg = &db->db_xn_idx[xn_id & XT_XN_SEGMENT_MASK]; + XT_XACT_WRITE_LOCK(&seg->xs_tab_lock, thread); + hash = &seg->xs_table[(xn_id >> XT_XN_SEGMENT_SHIFTS) % XT_XN_HASH_TABLE_SIZE]; + + if ((xact = seg->xs_free_list)) + seg->xs_free_list = xact->xd_next_xact; + else { + /* We have used up all the free transaction slots, + * the sweeper should work faster to free them + * up... + */ + db->db_sw_faster |= XT_SW_NO_MORE_XACT_SLOTS; + if (!(xact = (XTXactDataPtr) xt_malloc_ns(sizeof(XTXactDataRec)))) { + XT_XACT_UNLOCK(&seg->xs_tab_lock, thread); + return NULL; + } + } + + xact->xd_next_xact = *hash; + *hash = xact; + + xact->xd_thread_id = thread->t_id; + xact->xd_start_xn_id = xn_id; + xact->xd_end_xn_id = 0; + xact->xd_end_time = 0; + xact->xd_begin_log = 0; + xact->xd_flags = 0; + + seg->xs_last_xn_id = xn_id; + XT_XACT_UNLOCK(&seg->xs_tab_lock, thread); +#ifdef HIGH_X + tot_alloced++; + if (tot_alloced > high_alloced) + high_alloced = tot_alloced; +#endif + return xact; +} + +static xtBool xn_get_xact_details(XTDatabaseHPtr db, xtXactID xn_id, XTThreadPtr thread __attribute__((unused)), int *flags, xtXactID *start, xtWord4 *end, xtThreadID *thd_id) +{ + register XTXactSegPtr seg; + register XTXactDataPtr xact; + xtBool found = FALSE; + + seg = &db->db_xn_idx[xn_id & XT_XN_SEGMENT_MASK]; + XT_XACT_READ_LOCK(&seg->xs_tab_lock, thread); + xact = seg->xs_table[(xn_id >> XT_XN_SEGMENT_SHIFTS) % XT_XN_HASH_TABLE_SIZE]; + while (xact) { + if (xact->xd_start_xn_id == xn_id) { + found = TRUE; + if (flags) + *flags = xact->xd_flags; + if (start) + *start = xact->xd_start_xn_id; + if (end) + *end = xact->xd_end_time; + if (thd_id) + *thd_id = xact->xd_thread_id; + break; + } + xact = xact->xd_next_xact; + } + XT_XACT_UNLOCK(&seg->xs_tab_lock, thread); + return found; +} + +static xtBool xn_get_xact_pointer(XTDatabaseHPtr db, xtXactID xn_id, XTXactDataPtr *xact_ptr) +{ + register XTXactSegPtr seg; + register XTXactDataPtr xact; + xtBool found = FALSE; + + *xact_ptr = NULL; + seg = &db->db_xn_idx[xn_id & XT_XN_SEGMENT_MASK]; + XT_XACT_READ_LOCK(&seg->xs_tab_lock, thread); + xact = seg->xs_table[(xn_id >> XT_XN_SEGMENT_SHIFTS) % XT_XN_HASH_TABLE_SIZE]; + while (xact) { + if (xact->xd_start_xn_id == xn_id) { + found = TRUE; + /* We only return pointers to transaction structures that are permanently + * allocated! + */ + if ((xtWord1 *) xact >= db->db_xn_data && (xtWord1 *) xact < db->db_xn_data_end) + *xact_ptr = xact; + break; + } + xact = xact->xd_next_xact; + } + XT_XACT_UNLOCK(&seg->xs_tab_lock, thread); + return found; +} + +static xtBool xn_get_xact_start(XTDatabaseHPtr db, xtXactID xn_id, XTThreadPtr thread __attribute__((unused)), xtLogID *log_id, xtLogOffset *log_offset) +{ + register XTXactSegPtr seg; + register XTXactDataPtr xact; + xtBool found = FALSE; + + seg = &db->db_xn_idx[xn_id & XT_XN_SEGMENT_MASK]; + XT_XACT_READ_LOCK(&seg->xs_tab_lock, thread); + xact = seg->xs_table[(xn_id >> XT_XN_SEGMENT_SHIFTS) % XT_XN_HASH_TABLE_SIZE]; + while (xact) { + if (xact->xd_start_xn_id == xn_id) { + found = TRUE; + *log_id = xact->xd_begin_log; + *log_offset = xact->xd_begin_offset; + break; + } + xact = xact->xd_next_xact; + } + XT_XACT_UNLOCK(&seg->xs_tab_lock, thread); + return found; +} + +/* NOTE: this function may only be used by the sweeper or the recovery process. */ +xtPublic XTXactDataPtr xt_xn_get_xact(XTDatabaseHPtr db, xtXactID xn_id, XTThreadPtr thread __attribute__((unused))) +{ + register XTXactSegPtr seg; + register XTXactDataPtr xact; + + seg = &db->db_xn_idx[xn_id & XT_XN_SEGMENT_MASK]; + XT_XACT_READ_LOCK(&seg->xs_tab_lock, thread); + xact = seg->xs_table[(xn_id >> XT_XN_SEGMENT_SHIFTS) % XT_XN_HASH_TABLE_SIZE]; + while (xact) { + if (xact->xd_start_xn_id == xn_id) + break; + xact = xact->xd_next_xact; + } + XT_XACT_UNLOCK(&seg->xs_tab_lock, thread); + return xact; +} + +/* + * Delete a transaction, return TRUE if the transaction + * was found. + */ +xtPublic xtBool xt_xn_delete_xact(XTDatabaseHPtr db, xtXactID xn_id, XTThreadPtr thread __attribute__((unused))) +{ + XTXactDataPtr xact, pxact = NULL; + XTXactSegPtr seg; + + seg = &db->db_xn_idx[xn_id & XT_XN_SEGMENT_MASK]; + XT_XACT_WRITE_LOCK(&seg->xs_tab_lock, thread); + xact = seg->xs_table[(xn_id >> XT_XN_SEGMENT_SHIFTS) % XT_XN_HASH_TABLE_SIZE]; + while (xact) { + if (xact->xd_start_xn_id == xn_id) { + if (pxact) + pxact->xd_next_xact = xact->xd_next_xact; + else + seg->xs_table[(xn_id >> XT_XN_SEGMENT_SHIFTS) % XT_XN_HASH_TABLE_SIZE] = xact->xd_next_xact; + xn_free_xact(db, seg, xact); + XT_XACT_UNLOCK(&seg->xs_tab_lock, thread); + return TRUE; + } + pxact = xact; + xact = xact->xd_next_xact; + } + XT_XACT_UNLOCK(&seg->xs_tab_lock, thread); + return FALSE; +} + +//#define DEBUG_RAM_LIST +#ifdef DEBUG_RAM_LIST + +#define DEBUG_RAM_LIST_SIZE 80 + +int check_ram_init_count = 0; +xt_rwlock_type check_ram_lock; +xtXactID check_ram_trns[DEBUG_RAM_LIST_SIZE]; +int check_ram_dummy; + +static void check_ram_init(void) +{ + if (check_ram_init_count == 0) + xt_init_rwlock(NULL, &check_ram_lock); + check_ram_init_count++; +} + +static void check_ram_free(void) +{ + check_ram_init_count--; + if (check_ram_init_count == 0) + xt_free_rwlock(&check_ram_lock); +} + +static void check_ram_min_id(XTDatabaseHPtr db) +{ + int i; + + xt_slock_rwlock_ns(&check_ram_lock); + for (i=0; i<DEBUG_RAM_LIST_SIZE; i++) { + if (check_ram_trns[i] && xt_xn_is_before(check_ram_trns[i], db->db_xn_min_ram_id)) { + /* This should never happen! */ + XTXactDataPtr x_ptr; + + check_ram_dummy = 0; + for (i=0; i<DEBUG_RAM_LIST_SIZE; i++) { + if (check_ram_trns[i]) { + x_ptr = xt_xn_get_xact(db, check_ram_trns[i]); + check_ram_dummy = 1; + } + } + break; + } + } + xt_unlock_rwlock_ns(&check_ram_lock); +} + +static void check_ram_add(xtXactID xn_id) +{ + int i; + + xt_xlock_rwlock_ns(&check_ram_lock); + for (i=0; i<DEBUG_RAM_LIST_SIZE; i++) { + if (!check_ram_trns[i]) { + check_ram_trns[i] = xn_id; + xt_unlock_rwlock_ns(&check_ram_lock); + return; + } + } + xt_unlock_rwlock_ns(&check_ram_lock); + printf("DEBUG --- List too small\n"); +} + +static void check_ram_del(xtXactID xn_id) +{ + int i; + + xt_xlock_rwlock_ns(&check_ram_lock); + for (i=0; i<DEBUG_RAM_LIST_SIZE; i++) { + if (check_ram_trns[i] == xn_id) { + check_ram_trns[i] = 0; + xt_unlock_rwlock_ns(&check_ram_lock); + return; + } + } + xt_unlock_rwlock_ns(&check_ram_lock); +} +#endif + +/* ---------------------------------------------------------------------- + * Init and Exit + */ + +xtPublic void xt_xn_init_db(XTThreadPtr self, XTDatabaseHPtr db) +{ + XTXactDataPtr xact; + XTXactSegPtr seg; + +#ifdef DEBUG_RAM_LIST + check_ram_init(); +#endif + xt_spinlock_init_with_autoname(self, &db->db_xn_id_lock); + xt_spinlock_init_with_autoname(self, &db->db_xn_wait_spinlock); + //xt_init_mutex_with_autoname(self, &db->db_xn_wait_lock); + //xt_init_cond(self, &db->db_xn_wait_cond); + xt_init_mutex_with_autoname(self, &db->db_sw_lock); + xt_init_cond(self, &db->db_sw_cond); + xt_init_mutex_with_autoname(self, &db->db_wr_lock); + xt_init_cond(self, &db->db_wr_cond); + + /* Pre-allocate transaction data structures: */ + db->db_xn_data = (xtWord1 *) xt_malloc(self, sizeof(XTXactDataRec) * XT_XN_DATA_ALLOC_COUNT * XT_XN_NO_OF_SEGMENTS); + db->db_xn_data_end = db->db_xn_data + sizeof(XTXactDataRec) * XT_XN_DATA_ALLOC_COUNT * XT_XN_NO_OF_SEGMENTS; + xact = (XTXactDataPtr) db->db_xn_data; + for (u_int i=0; i<XT_XN_NO_OF_SEGMENTS; i++) { + seg = &db->db_xn_idx[i]; + XT_XACT_INIT_LOCK(self, &seg->xs_tab_lock); + for (u_int j=0; j<XT_XN_DATA_ALLOC_COUNT; j++) { + xact->xd_next_xact = seg->xs_free_list; + seg->xs_free_list = xact; + xact++; + } + } + + /* Initialize the data logs: */ + db->db_datalogs.dlc_init(self, db); + + /* Setup the transaction log: */ + db->db_xlog.xlog_setup(self, db, (off_t) xt_db_log_file_threshold, xt_db_transaction_buffer_size, xt_db_log_file_count); + + db->db_xn_end_time = 1; + + /* Initializing the restart file, also does + * recovery. This returns the log position after recovery. + * + * This is the log position where the writer thread will + * begin. The writer thread writes changes to the database that + * have been flushed to the log. + */ + xt_xres_init(self, db); + + /* Initialize the "last transaction in memory", by default + * this is the current transaction ID, which is the ID + * of the last transaction. + */ + for (u_int i=0; i<XT_XN_NO_OF_SEGMENTS; i++) { + seg = &db->db_xn_idx[i]; + XT_XACT_INIT_LOCK(self, &seg->xs_tab_lock); + seg->xs_last_xn_id = db->db_xn_curr_id; + } + + /* + * The next transaction to clean is the lowest transaction + * in memory: + */ + db->db_xn_to_clean_id = db->db_xn_min_ram_id; + + /* + * No transactions are running, so the minimum transaction + * ID is the next one to run: + */ + db->db_xn_min_run_id = db->db_xn_curr_id + 1; + + db->db_xn_wait_for = xt_new_sortedlist(self, sizeof(XNWaitForRec), 100, 50, xn_compare_wait_for, db, xn_free_wait_for, FALSE, FALSE); +} + +xtPublic void xt_xn_exit_db(XTThreadPtr self, XTDatabaseHPtr db) +{ +#ifdef HIGH_X + printf("=========> MOST TXs CURR ALLOC: %lu\n", tot_alloced); + printf("=========> MOST TXs HIGH ALLOC: %lu\n", high_alloced); + printf("=========> MAX TXs NOT CLEAN: %lu\n", not_clean_max); + printf("=========> MAX TXs IN RAM: %lu\n", in_ram_max); +#endif + + xt_stop_sweeper(self, db); // Should be done already! + xt_stop_writer(self, db); // Should be done already! + + xt_xres_exit(self, db); + db->db_xlog.xlog_exit(self); + + db->db_datalogs.dlc_exit(self); + + for (u_int i=0; i<XT_XN_NO_OF_SEGMENTS; i++) { + XTXactSegPtr seg; + + seg = &db->db_xn_idx[i]; + for (u_int j=0; j<XT_XN_HASH_TABLE_SIZE; j++) { + XTXactDataPtr xact, nxact; + + xact = seg->xs_table[j]; + while (xact) { + nxact = xact->xd_next_xact; + xn_free_xact(db, seg, xact); + xact = nxact; + } + } + XT_XACT_FREE_LOCK(self, &seg->xs_tab_lock); + } + if (db->db_xn_wait_for) { + xt_free_sortedlist(self, db->db_xn_wait_for); + db->db_xn_wait_for = NULL; + } + if (db->db_xn_data) { + xt_free(self, db->db_xn_data); + db->db_xn_data = NULL; + db->db_xn_data_end = NULL; + } + + xt_free_cond(&db->db_wr_cond); + xt_free_mutex(&db->db_wr_lock); + xt_free_cond(&db->db_sw_cond); + xt_free_mutex(&db->db_sw_lock); + //xt_free_cond(&db->db_xn_wait_cond); + //xt_free_mutex(&db->db_xn_wait_lock); + xt_spinlock_free(self, &db->db_xn_wait_spinlock); + xt_spinlock_free(self, &db->db_xn_id_lock); +#ifdef DEBUG_RAM_LIST + check_ram_free(); +#endif +} + +xtPublic void xt_xn_init_thread(XTThreadPtr self, int what_for) +{ + ASSERT(self->st_database); + + if (!xt_init_row_lock_list(&self->st_lock_list)) + xt_throw(self); + switch (what_for) { + case XT_FOR_COMPACTOR: + self->st_dlog_buf.dlb_init(self->st_database, xt_db_log_buffer_size); + break; + case XT_FOR_WRITER: + /* The writer does not need a transaction buffer. */ + self->st_dlog_buf.dlb_init(self->st_database, 0); + break; + case XT_FOR_SWEEPER: + self->st_dlog_buf.dlb_init(self->st_database, 0); + break; + case XT_FOR_USER: + self->st_dlog_buf.dlb_init(self->st_database, xt_db_log_buffer_size); + break; + } +} + +xtPublic void xt_xn_exit_thread(XTThreadPtr self) +{ + if (self->st_xact_data) + xt_xn_rollback(self); + self->st_dlog_buf.dlb_exit(self); + xt_exit_row_lock_list(&self->st_lock_list); +} + +/* ---------------------------------------------------------------------- + * Begin and End Transactions + */ + +xtPublic xtBool xt_xn_begin(XTThreadPtr self) +{ + XTDatabaseHPtr db = self->st_database; + xtXactID xn_id; + + ASSERT(!self->st_xact_data); + + xt_spinlock_lock(&db->db_xn_id_lock); + xn_id = ++db->db_xn_curr_id; + xt_spinlock_unlock(&db->db_xn_id_lock); + +#ifdef HIGH_X + if (xt_xn_is_before(not_clean_max, xn_id - db->db_xn_to_clean_id)) + not_clean_max = xn_id - db->db_xn_to_clean_id; + if (xt_xn_is_before(in_ram_max, xn_id - db->db_xn_min_ram_id)) + in_ram_max = xn_id - db->db_xn_min_ram_id; +#endif + /* {GAP-INC-ADD-XACT} This is the gap between incrementing the ID, + * and creating the transaction in memory. + * See xt_xn_get_curr_id(). + */ + + if (!(self->st_xact_data = xn_add_new_xact(db, xn_id, self))) + return FAILED; + self->st_xact_writer = FALSE; + + /* All transactions that committed before or at this time + * are this one are visible: */ + self->st_visible_time = db->db_xn_end_time; + +#ifdef TRACE_TRANSACTION + xt_ttracef(self, "BEGIN T%lu\n", (u_long) self->st_xact_data->xd_start_xn_id); +#endif + return OK; +} + +static xtBool xn_end_xact(XTThreadPtr thread, u_int status) +{ + XTXactDataPtr xact; + xtBool ok = TRUE; + + ASSERT_NS(thread->st_xact_data); + if ((xact = thread->st_xact_data)) { + XTDatabaseHPtr db = thread->st_database; + xtXactID xn_id = xact->xd_start_xn_id; + xtBool writer; + + if ((writer = thread->st_xact_writer)) { + /* The transaction wrote something: */ + XTXactEndEntryDRec entry; + xtWord4 sum; + + sum = XT_CHECKSUM4_XACT(xn_id) ^ XT_CHECKSUM4_XACT(0); + entry.xe_status_1 = status; + entry.xe_checksum_1 = XT_CHECKSUM_1(sum); + XT_SET_DISK_4(entry.xe_xact_id_4, xn_id); + XT_SET_DISK_4(entry.xe_not_used_4, 0); + +#ifdef XT_IMPLEMENT_NO_ACTION + /* This will check any resticts that have been delayed to the end of the statement. */ + if (thread->st_restrict_list.bl_count) { + if (!xt_tab_restrict_rows(&thread->st_restrict_list, thread)) { + ok = FALSE; + status = XT_LOG_ENT_ABORT; + } + } +#endif + + /* Flush the data log: */ + if (!thread->st_dlog_buf.dlb_flush_log(TRUE, thread)) { + ok = FALSE; + status = XT_LOG_ENT_ABORT; + } + + /* Write and flush the transaction log: */ + if (!xt_xlog_log_data(thread, sizeof(XTXactEndEntryDRec), (XTXactLogBufferDPtr) &entry, TRUE)) { + ok = FALSE; + status = XT_LOG_ENT_ABORT; + /* Make sure this is done, if we failed to log + * the transction end! + */ + if (thread->st_xact_writer) { + /* Adjust this in case of error, but don't forget + * to lock! + */ + xt_spinlock_lock(&db->db_xlog.xl_buffer_lock); + db->db_xn_writer_count--; + thread->st_xact_writer = FALSE; + if (thread->st_xact_long_running) { + db->db_xn_long_running_count--; + thread->st_xact_long_running = FALSE; + } + xt_spinlock_unlock(&db->db_xlog.xl_buffer_lock); + } + } + + /* Setting this flag completes the transaction, + * Do this before we release the locks, because + * the unlocked transactions expect the + * transaction they are waiting for to be + * gone! + */ + xact->xd_end_time = ++db->db_xn_end_time; + if (status == XT_LOG_ENT_COMMIT) { + thread->st_statistics.st_commits++; + xact->xd_flags |= (XT_XN_XAC_COMMITTED | XT_XN_XAC_ENDED); + } + else { + thread->st_statistics.st_rollbacks++; + xact->xd_flags |= XT_XN_XAC_ENDED; + } + + /* {REMOVE-LOCKS} Drop locks is you have any: */ + thread->st_lock_list.xt_remove_all_locks(db, thread); + + /* Do this afterwards to make sure the sweeper + * does not cleanup transactions start cleaning up + * before any transactions that were waiting for + * this transaction have completed! + */ + xact->xd_end_xn_id = db->db_xn_curr_id; + + /* Now you can sweep! */ + xact->xd_flags |= XT_XN_XAC_SWEEP; + } + else { + /* Read-only transaction can be removed, immediately */ + xact->xd_end_time = ++db->db_xn_end_time; + xact->xd_flags |= (XT_XN_XAC_COMMITTED | XT_XN_XAC_ENDED); + + /* Drop locks is you have any: */ + thread->st_lock_list.xt_remove_all_locks(db, thread); + + xact->xd_end_xn_id = db->db_xn_curr_id; + + xact->xd_flags |= XT_XN_XAC_SWEEP; + + if (xt_xn_delete_xact(db, xn_id, thread)) { + if (db->db_xn_min_ram_id == xn_id) + db->db_xn_min_ram_id = xn_id+1; + } + } + +#ifdef TRACE_TRANSACTION + if (status == XT_LOG_ENT_COMMIT) + xt_ttracef(thread, "COMMIT T%lu\n", (u_long) xn_id); + else + xt_ttracef(thread, "ABORT T%lu\n", (u_long) xn_id); +#endif + + if (db->db_xn_min_run_id == xn_id) + db->db_xn_min_run_id = xn_id+1; + + thread->st_xact_data = NULL; + + xt_xn_wakeup_waiting_threads(thread); + + /* {WAKE-SW} Waking the sweeper + * is no longer unconditional. + * (see all comments to {WAKE-SW}) + * + * We now wake the sweeper if it is + * supposed to work faster. + * + * There are now 2 cases: + * - We run out of transaction slots. + * - We encounter old index entries. + * + * The following test: + * runTest(INCREMENT_TEST, 16, INCREMENT_TEST_UPDATE_COUNT); + * has extreme problems with sweeping every 1/10s + * because a huge number of index entries accumulate + * that need to be cleaned. + * + * New code detects this case. + */ + if (db->db_sw_faster) + xt_wakeup_sweeper(db); + + /* Don't get too far ahead of the sweeper! */ + if (writer) { + if ((db->db_sw_faster & XT_SW_TOO_FAR_BEHIND) != 0) { + xtWord8 then = xt_trace_clock() + (xtWord8) 20000; + + for (;;) { + xt_critical_wait(); + if (db->db_sw_faster & XT_SW_TOO_FAR_BEHIND) + break; + if (xt_trace_clock() >= then) + break; + } + } + } + } + return ok; +} + +xtPublic xtBool xt_xn_commit(XTThreadPtr thread) +{ + return xn_end_xact(thread, XT_LOG_ENT_COMMIT); +} + +xtPublic xtBool xt_xn_rollback(XTThreadPtr thread) +{ + return xn_end_xact(thread, XT_LOG_ENT_ABORT); +} + +xtPublic xtBool xt_xn_log_tab_id(XTThreadPtr self, xtTableID tab_id) +{ + XTXactNewTabEntryDRec entry; + + entry.xt_status_1 = XT_LOG_ENT_NEW_TAB; + entry.xt_checksum_1 = XT_CHECKSUM_1(tab_id); + XT_SET_DISK_4(entry.xt_tab_id_4, tab_id); + return xt_xlog_log_data(self, sizeof(XTXactNewTabEntryDRec), (XTXactLogBufferDPtr) &entry, TRUE); +} + +xtPublic int xt_xn_status(XTOpenTablePtr ot, xtXactID xn_id, xtRecordID XT_UNUSED(rec_id)) +{ + register XTThreadPtr self = ot->ot_thread; + int flags; + xtWord4 end; + +#ifdef DRIZZLED + /* Conditional waste of time! + * Drizzle has strict warnings. + * I know this is not necessary! + */ + flags = 0; + end = 0; +#endif + if (xn_id == self->st_xact_data->xd_start_xn_id) + return XT_XN_MY_UPDATE; + if (xt_xn_is_before(xn_id, self->st_database->db_xn_min_ram_id) || + !xn_get_xact_details(self->st_database, xn_id, ot->ot_thread, &flags, NULL, &end, NULL)) { + /* Not in RAM, rollback done: */ +//*DBG*/xt_dump_xlogs(self->st_database, 0); +//*DBG*/xt_check_table(self, ot); +//*DBG*/xt_dump_trace(); + /* {XACT-NOT-IN-RAM} + * This should never happen (CHANGED see below)! + * + * Because if the transaction is no longer in RAM, then it has been + * cleaned up. So the record should be marked as clean, or not + * exist. + * + * After sweeping, we wait for all transactions to quit that were + * running at the time of cleanup before removing the transaction record. + * (see {XACT-NOT-IN-RAM}) + * + * If this was not the case, then we could be here because: + * - The user transaction (T2) reads record x and notes that the record + * has not been cleaned (CLEAN bit not set). + * + * - The sweeper is busy sweeping the transaction (T1) that created + * record x. + * The SW sets the CLEAN bit on record x, and the schedules T1 for + * deletion. + * + * Now T1 should not be deleted before T2 quits. If it does happen + * then we land up here. + * + * THIS CAN NOW HAPPEN! + * + * First of all, a MYSTERY: + * This did happen, dispite the description above! The reason why + * is left as an exercise to the reader (really, I don't now why!) + * + * This has force me to add code to handle the situation. This + * is done by re-reading the record that is being checked by this + * function. After re-reading, the record should either be + * invalid (free) or clean (CLEAN bit set). + * + * If this is the case, then we will not run land up here + * again. + * + * Because we are only here because the record was valid but not + * clean (you can confirm this by looking at the code that + * calls this function). + */ + return XT_XN_REREAD; + } + if (!(flags & XT_XN_XAC_ENDED)) + /* Transaction not ended, may be visible. */ + return XT_XN_OTHER_UPDATE; + /* Visible if the transaction was committed: */ + if (flags & XT_XN_XAC_COMMITTED) { + if (!xt_xn_is_before(self->st_visible_time, end)) // was self->st_visible_time >= xact->xd_end_time + return XT_XN_VISIBLE; + return XT_XN_NOT_VISIBLE; + } + return XT_XN_ABORTED; +} + +xtPublic xtWord8 xt_xn_bytes_to_sweep(XTDatabaseHPtr db, XTThreadPtr thread) +{ + xtXactID xn_id; + xtXactID curr_xn_id; + xtLogID xn_log_id = 0; + xtLogOffset xn_log_offset = 0; + xtLogID x_log_id = 0; + xtLogOffset x_log_offset = 0; + xtLogID log_id; + xtLogOffset log_offset; + xtWord8 byte_count = 0; + + xn_id = db->db_xn_to_clean_id; + curr_xn_id = xt_xn_get_curr_id(db); + // Limit the number of transactions checked! + for (int i=0; i<1000; i++) { + if (xt_xn_is_before(curr_xn_id, xn_id)) + break; + if (xn_get_xact_start(db, xn_id, thread, &x_log_id, &x_log_offset)) { + if (xn_log_id) { + if (xt_comp_log_pos(x_log_id, x_log_offset, xn_log_id, xn_log_offset) < 0) { + xn_log_id = x_log_id; + xn_log_offset = x_log_offset; + } + } + else { + xn_log_id = x_log_id; + x_log_offset = x_log_offset; + } + } + xn_id++; + } + if (!xn_log_id) + return 0; + + /* Assume the logs have the threshold: */ + log_id = db->db_xlog.xl_write_log_id; + log_offset = db->db_xlog.xl_write_log_offset; + if (xn_log_id < log_id) { + if (xn_log_offset < xt_db_log_file_threshold) + byte_count = (size_t) (xt_db_log_file_threshold - xn_log_offset); + xn_log_offset = 0; + xn_log_id++; + } + while (xn_log_id < log_id) { + byte_count += (size_t) xt_db_log_file_threshold; + xn_log_id++; + } + if (xn_log_offset < log_offset) + byte_count += (size_t) (log_offset - xn_log_offset); + + return byte_count; +} + +/* ---------------------------------------------------------------------- + * S W E E P E R P R O C E S S + */ + +typedef struct XNSweeperState { + XTDatabaseHPtr ss_db; + XTXactSeqReadRec ss_seqread; + XTDataBufferRec ss_databuf; + u_int ss_call_cnt; + XTBasicQueueRec ss_to_free; + xtBool ss_flush_pending; + XTOpenTablePtr ss_ot; +} XNSweeperStateRec, *XNSweeperStatePtr; + +static XTOpenTablePtr xn_sw_get_open_table(XTThreadPtr self, XNSweeperStatePtr ss, xtTableID tab_id, int *r) +{ + if (ss->ss_ot) { + if (ss->ss_ot->ot_table->tab_id == tab_id) + return ss->ss_ot; + + xt_db_return_table_to_pool(self, ss->ss_ot); + ss->ss_ot = NULL; + } + + if (!ss->ss_ot) { + if (!(ss->ss_ot = xt_db_open_pool_table(self, ss->ss_db, tab_id, r, TRUE))) + return NULL; + } + + return ss->ss_ot; +} + +static void xn_sw_close_open_table(XTThreadPtr self, XNSweeperStatePtr ss) +{ + if (ss->ss_ot) { + xt_db_return_table_to_pool(self, ss->ss_ot); + ss->ss_ot = NULL; + } +} + +/* + * A thread can set a bit in db_sw_faster to make + * the sweeper go faster. + */ +static void xn_sw_could_go_faster(XTThreadPtr self, XTDatabaseHPtr db) +{ + if (db->db_sw_faster) { + if (!db->db_sw_fast) { + xt_set_priority(self, xt_db_sweeper_priority+1); + db->db_sw_fast = TRUE; + } + } +} + +static void xn_sw_go_slower(XTThreadPtr self, XTDatabaseHPtr db) +{ + if (db->db_sw_fast) { + xt_set_priority(self, xt_db_sweeper_priority); + db->db_sw_fast = FALSE; + } + db->db_sw_faster = XT_SW_WORK_NORMAL; +} + +/* Add a record to the "to free" queue. We note the current + * transaction at the time this is done. The record will + * only be freed once this transaction terminated, together + * with all transactions that started before it! + * + * The reason for this is that a sequential scan or some + * other operation may read a committed record which is no longer + * valid because it is no longer the latest variation (the first + * variation reachable from the row pointer). + * + * In this case, the sweeper will free the variation. + * If the variation is re-used and committed before + * the sequential scan or read completes, and by some + * fluke is used by the same record as previously, + * the system will think the record is valid + * again. + * + * Without re-reading the record the sequential + * scan or other read will find it on the variation list, and + * return the record data as if valid! + * + * ------------ 2008-01-03 + * + * An example of this is: + * + * Assume we have 3 records. + * The 3rd record is deleted, and committed. + * Before cleanup can be performed + * a sequential scan takes a copy of the records. + * + * Now assume a new insert is done before + * the sequential scan gets to the 3rd record. + * + * The insert allocates the 3rd row and 3rd record + * again. + * + * Now, when the sequential scan gets to the old copy of the 3rd record, + * this is valid because the row points to this record again. + * + * HOWEVER! I have now changed the sequential scan so that it accesses + * the records from the cache, without making a copy. + * + * This means that this problem cannot occur because the sequential scan + * always reads the current data from the cache. + * + * There is also no race condition (although no lock is taken), because + * the record is writen before the row (see here [(5)]). + * + * This means that the row does not point to the record before the + * record has been modified. + * + * Once the record has been modified then the sequential scan will see + * that the record belongs to a new transaction. + * + * If the row pointer was set before the record updated then a race + * condition would exist when the sequential scan reads the record + * after the insert has updated the row pointer but before it has + * changed the record. + * + * AS A RESULT: + * + * I believe I can remove the delayed free record! + * + * This means I can combine the REMOVE and FREE operations. + * + * This is good because this takes care of the problem + * that records are lost when: + * + * The server crashes when the delayed free list still has items on it. + * AND + * The transaction that freed the records has been cleaned, and this + * fact has been committed to the log. + * + * So I have removed the delay here: [(6)] + * + * ------------ 2008-12-03 + * + * This code to delay removal of records was finally removed (see above) + */ + +/* + * As above, but instead a transaction is added to the "to free" queue. + * + * It is important that transactions remain in memory until all + * currently running transactions have ended. This is because + * sequential and index scans have copies of old data. + * + * In the old data a record may not be indicated as cleaned. Such + * a record is considered invalid if the transaction is not in RAM. + * + * GOTCHA: + * + * And this problem is demonstrated by the following example + * which was derived from flush_table.test. + * + * Each handler command below is a separate transaction. + * However the buffer is loaded by 'read first'. + * Depending on when cleanup occurs, records can disappear + * in some of the next commands. + * + * 2 solutions for the test. Use begin ... commit around + * handler open ... close. Or use analyze table t1 before + * open. analyze table waits for the sweeper to complete! + * + * create table dummy(table_id char(20) primary key); + * let $1=100; + * while ($1) + * { + * drop table if exists t1; + * create table t1(table_id char(20) primary key); + * insert into t1 values ('Record-01'); + * insert into t1 values ('Record-02'); + * insert into t1 values ('Record-03'); + * insert into t1 values ('Record-04'); + * insert into t1 values ('Record-05'); + * handler t1 open; + * handler t1 read first limit 1; + * handler t1 read next limit 1; + * handler t1 read next limit 1; + * handler t1 read next limit 1; + * handler t1 close; + * commit; + * dec $1; + * } + * + */ +#ifdef MUST_DELAY_REMOVE +static void xn_sw_add_xact_to_free(XTThreadPtr self, XNSweeperStatePtr ss, xtXactID xn_id) +{ + XNSWToFreeItemRec free_item; + + if ((ss->ss_to_free.bq_front - ss->ss_to_free.bq_back) >= XT_TN_MAX_TO_FREE) { + /* If the queue is full, try to free some items: + * We use the call count to avoid doing this every time, + * when the queue overflows! + */ + if ((ss->ss_call_cnt % XT_TN_MAX_TO_FREE_CHECK) == 0) + /* GOTCHA: This call was not locking the sweeper, + * this could cause failure, of course: + */ + xn_sw_service_to_free(self, ss, TRUE); + ss->ss_call_cnt++; + } + + free_item.ri_wait_xn_id = ss->ss_db->db_xn_curr_id; + free_item.ri_tab_id = 0; + free_item.x.ri_xn_id = xn_id; + + xt_bq_add(self, &ss->ss_to_free, &free_item); +} +#endif + +static void xt_sw_delete_variations(XTThreadPtr self, XNSweeperStatePtr ss, XTOpenTablePtr ot, xtRecordID rec_id, xtRowID row_id, xtXactID xn_id) +{ + xtRecordID prev_var_rec_id; + + while (rec_id) { + switch (xt_tab_remove_record(ot, rec_id, ss->ss_databuf.db_data, &prev_var_rec_id, FALSE, row_id, xn_id)) { + case XT_ERR: + throw_(); + return; + case TRUE: + break; + } + rec_id = prev_var_rec_id; + } +} + +static void xt_sw_delete_variation(XTThreadPtr self, XNSweeperStatePtr ss, XTOpenTablePtr ot, xtRecordID rec_id, xtBool clean_delete, xtRowID row_id, xtXactID xn_id) +{ + xtRecordID prev_var_rec_id; + + switch (xt_tab_remove_record(ot, rec_id, ss->ss_databuf.db_data, &prev_var_rec_id, clean_delete, row_id, xn_id)) { + case XT_ERR: + throw_(); + return; + case TRUE: + break; + case FALSE: + break; + } +} + +/* Set rec_type to this value in order to force cleanup, without + * a check. + */ +#define XN_FORCE_CLEANUP XT_TAB_STATUS_FREED + +/* + * Read the record to be cleaned. Return TRUE if the cleanup has already been done. + */ +static xtBool xn_sw_cleanup_done(XTThreadPtr self, XTOpenTablePtr ot, xtRecordID rec_id, xtXactID xn_id, u_int rec_type, u_int stat_id, xtRowID row_id, XTTabRecHeadDPtr rec_head) +{ + if (!xt_tab_get_rec_data(ot, rec_id, sizeof(XTTabRecHeadDRec), (xtWord1 *) rec_head)) + throw_(); + + if (rec_type == XN_FORCE_CLEANUP) { + if (XT_REC_IS_FREE(rec_head->tr_rec_type_1)) + return TRUE; + } + else { + /* Transaction must match: */ + if (XT_GET_DISK_4(rec_head->tr_xact_id_4) != xn_id) + return TRUE; + + /* Record header must match expected value from + * log or clean has been done, or is not required. + * + * For example, it is not required if a record + * has been overwritten in a transaction. + */ + if (rec_head->tr_rec_type_1 != rec_type || + rec_head->tr_stat_id_1 != stat_id) + return TRUE; + + /* Row must match: */ + if (XT_GET_DISK_4(rec_head->tr_row_id_4) != row_id) + return TRUE; + } + + return FALSE; +} + +static void xn_sw_clean_indices(XTThreadPtr self __attribute__((unused)), XTOpenTablePtr ot, xtRecordID rec_id, xtRowID row_id, xtWord1 *rec_data, xtWord1 *rec_buffer) +{ + XTTableHPtr tab = ot->ot_table; + u_int cols_req; + XTIndexPtr *ind; + + if (!tab->tab_dic.dic_key_count) + return; + + cols_req = tab->tab_dic.dic_ind_cols_req; + if (XT_REC_IS_FIXED(rec_data[0])) + rec_buffer = rec_data + XT_REC_FIX_HEADER_SIZE; + else { + if (XT_REC_IS_VARIABLE(rec_data[0])) { + if (!myxt_load_row(ot, rec_data + XT_REC_FIX_HEADER_SIZE, rec_buffer, cols_req)) + goto failed; + } + else if (XT_REC_IS_EXT_DLOG(rec_data[0])) { + ASSERT(cols_req); + if (cols_req && cols_req <= tab->tab_dic.dic_fix_col_count) { + if (!myxt_load_row(ot, rec_data + XT_REC_EXT_HEADER_SIZE, rec_buffer, cols_req)) + goto failed; + } + else { + if (rec_data != ot->ot_row_rbuffer) + memcpy(ot->ot_row_rbuffer, rec_data, tab->tab_dic.dic_rec_size); + if (!xt_tab_load_ext_data(ot, rec_id, rec_buffer, cols_req)) + goto failed; + } + } + else + /* This is possible, the record has already been cleaned up. */ + return; + } + + ind = tab->tab_dic.dic_keys; + for (u_int i=0; i<tab->tab_dic.dic_key_count; i++, ind++) { + if (!xt_idx_update_row_id(ot, *ind, rec_id, row_id, rec_buffer)) + xt_log_and_clear_exception_ns(); + } + return; + + failed: + xt_log_and_clear_exception_ns(); +} + +/* + * Return TRUE if the cleanup was done. FAILED if cleanup could not be done + * because dictionary information is not available. + */ +static xtBool xn_sw_cleanup_variation(XTThreadPtr self, XNSweeperStatePtr ss, XTXactDataPtr xact, xtTableID tab_id, xtRecordID rec_id, u_int status, u_int rec_type, u_int stat_id, xtRowID row_id, xtWord1 *rec_buf) +{ + XTOpenTablePtr ot; + XTTableHPtr tab; + XTTabRecHeadDRec rec_head; + xtRecordID after_rec_id; + xtXactID xn_id; + int r; + + if (!(ot = xn_sw_get_open_table(self, ss, tab_id, &r))) { + /* The table no longer exists, consider cleanup done: */ + switch (r) { + case XT_TAB_NOT_FOUND: + break; + case XT_TAB_NO_DICTIONARY: + case XT_TAB_POOL_CLOSED: + return FALSE; + } + return TRUE; + } + + tab = ot->ot_table; + + /* Make sure the buffer is large enough! */ + xt_db_set_size(self, &ss->ss_databuf, (size_t) tab->tab_dic.dic_buf_size); + + xn_id = xact->xd_start_xn_id; + if (xact->xd_flags & XT_XN_XAC_COMMITTED) { + /* The transaction has been committed. Clean the record and + * remove variations no longer in use. + */ + switch (status) { + case XT_LOG_ENT_REC_MODIFIED: + case XT_LOG_ENT_UPDATE: + case XT_LOG_ENT_UPDATE_FL: + case XT_LOG_ENT_UPDATE_BG: + case XT_LOG_ENT_UPDATE_FL_BG: + if (xn_sw_cleanup_done(self, ot, rec_id, xn_id, rec_type, stat_id, row_id, &rec_head)) + goto done_ok; + after_rec_id = XT_GET_DISK_4(rec_head.tr_prev_rec_id_4); + xt_sw_delete_variations(self, ss, ot, after_rec_id, row_id, xn_id); + rec_head.tr_rec_type_1 |= XT_TAB_STATUS_CLEANED_BIT; + XT_SET_NULL_DISK_4(rec_head.tr_prev_rec_id_4); + if (!xt_tab_put_log_op_rec_data(ot, XT_LOG_ENT_REC_CLEANED, 0, rec_id, offsetof(XTTabRecHeadDRec, tr_prev_rec_id_4) + XT_RECORD_ID_SIZE, (xtWord1 *) &rec_head)) + throw_(); + xn_sw_clean_indices(self, ot, rec_id, row_id, rec_buf, ss->ss_databuf.db_data); + break; + case XT_LOG_ENT_INSERT: + case XT_LOG_ENT_INSERT_FL: + case XT_LOG_ENT_INSERT_BG: + case XT_LOG_ENT_INSERT_FL_BG: { + /* POTENTIAL BUG 1: + * + * DROP TABLE IF EXISTS t1; + * CREATE TABLE t1 ( id int, name varchar(300)) engine=pbxt; + * + * begin; + * insert t1(id, name) values(1, "aaa"); + * update t1 set name=REPEAT('A', 300) where id = 1; + * commit; + * flush tables; + * select * from t1; + * + * Because the type of record changes, from VARIABLE to + * EXTENDED, the cleanup needs to take this into account. + * + * The input new status value which is written here + * depends on the first write to the record. + * However, the second write changes the record status. + * + * Previously we used a OR function to write the bit and + * return the byte value of the result. + * + * The write funtion now checks the record to be written + * to make sure it matches the record that needs to be + * cleaned. So OR'ing the bit is no longer required. + * + * POTENTIAL BUG 2: + * + * We have changed this to fix the following bug: + * + * T1 starts + * T2 starts + * T2 insert record 100 in row 50 + * T2 commits + * T1 updates row 50 and adds record 101 + * + * The sweeper does cleanup in order T1, T2, ... + * + * The sweeper cleans T1 by removing record 100 from the + * row 50 variation list. + * This means that record 100 is free. + * + * The sweeper cleans T2 by marking record 100 as clean. + * !BUG! record 100 has already been freed! + * + * To avoid this we have to check a record before + * cleaning (as we do above for update in xn_sw_cleanup_done()) + * We check that the record is, in fact, the exact + * record that was inserted. + * + * This is now done be xt_tc_write_cond(). + */ + xtOpSeqNo op_seq; + + rec_head.tr_rec_type_1 = rec_type | XT_TAB_STATUS_CLEANED_BIT; + if(!tab->tab_recs.xt_tc_write_cond(self, ot->ot_rec_file, rec_id, rec_head.tr_rec_type_1, &op_seq, xn_id, row_id, stat_id, rec_type)) + /* this means record was not updated by xt_tc_write_bor and doesn't need to */ + break; + if (!xt_xlog_modify_table(ot, XT_LOG_ENT_REC_CLEANED_1, op_seq, 0, rec_id, 1, &rec_head.tr_rec_type_1)) + throw_(); + xn_sw_clean_indices(self, ot, rec_id, row_id, rec_buf, ss->ss_databuf.db_data); + break; + } + case XT_LOG_ENT_DELETE: + case XT_LOG_ENT_DELETE_FL: + case XT_LOG_ENT_DELETE_BG: + case XT_LOG_ENT_DELETE_FL_BG: + if (xn_sw_cleanup_done(self, ot, rec_id, xn_id, rec_type, stat_id, row_id, &rec_head)) + goto done_ok; + after_rec_id = XT_GET_DISK_4(rec_head.tr_prev_rec_id_4); + xt_sw_delete_variations(self, ss, ot, after_rec_id, row_id, xn_id); + xt_sw_delete_variation(self, ss, ot, rec_id, TRUE, row_id, xn_id); + if (row_id) { + if (!xt_tab_free_row(ot, tab, row_id)) + throw_(); + } + break; + } + } + else { + /* The transaction has been aborted. Remove the variation from the + * variation list. If this means the list is empty, then remove + * the record as well. + */ + xtRecordID first_rec_id, next_rec_id, prev_rec_id; + XTTabRecHeadDRec prev_rec_head; + + if (xn_sw_cleanup_done(self, ot, rec_id, xn_id, rec_type, stat_id, row_id, &rec_head)) + goto done_ok; + + if (!row_id) + row_id = XT_GET_DISK_4(rec_head.tr_row_id_4); + after_rec_id = XT_GET_DISK_4(rec_head.tr_prev_rec_id_4); + if (!row_id) + goto unlink_done; + + /* Now remove the record from the variation list, + * (if it is still on the list). + */ + XT_TAB_ROW_WRITE_LOCK(&tab->tab_row_rwlock[row_id % XT_ROW_RWLOCKS], self); + + /* Find the variation before the variation we wish to remove: */ + if (!(xt_tab_get_row(ot, row_id, &first_rec_id))) + goto failed; + prev_rec_id = 0; + next_rec_id = first_rec_id; + while (next_rec_id != rec_id) { + if (!next_rec_id) { + /* The record was not found in the list (we are done) */ + XT_TAB_ROW_UNLOCK(&tab->tab_row_rwlock[row_id % XT_ROW_RWLOCKS], self); + goto unlink_done; + } + if (!xt_tab_get_rec_data(ot, next_rec_id, sizeof(XTTabRecHeadDRec), (xtWord1 *) &prev_rec_head)) { + xt_log_and_clear_exception(self); + break; + } + prev_rec_id = next_rec_id; + next_rec_id = XT_GET_DISK_4(prev_rec_head.tr_prev_rec_id_4); + } + + if (next_rec_id == rec_id) { + /* The record was found on the list: */ + if (prev_rec_id) { + /* Unlink the deleted variation: + * I have found the following sequence: + * + * 17933 in use 1906112 + * 1906112 delete xact=2901 row=17933 prev=2419240 + * 2419240 delete xact=2899 row=17933 prev=2153360 + * 2153360 record-X C xact=2599 row=17933 prev=0 Xlog=151 Xoff=16824 Xsiz=100 + * + * Despite the following facts which should prevent chains from + * forming: + * + * --- Only one transaction can modify a row + * at any one time. So it is not possible for a new change + * to be linked onto an uncommitted change. + * + * --- Transactions that modify the same row + * twice do not allocate a new record for each change. + * + * -- A change that has been + * rolled back will not be linked onto. Instead + * the new transaction will link to the last. + * Comitted record. + * + * So if the sweeper is slow in doing its job + * we can have the situation that a number of records + * can refer to the last committed record of the + * row. + * + * Only one will be reference by the row pointer. + * + * The other, will all have been rolled back. + * This occurs over here: [(4)] + */ + XT_SET_DISK_4(prev_rec_head.tr_prev_rec_id_4, after_rec_id); + if (!xt_tab_put_log_op_rec_data(ot, XT_LOG_ENT_REC_UNLINKED, 0, prev_rec_id, offsetof(XTTabRecHeadDRec, tr_prev_rec_id_4) + XT_RECORD_ID_SIZE, (xtWord1 *) &prev_rec_head)) + goto failed; + } + else { + /* Variation to be removed at the front of the list. */ + ASSERT(rec_id == first_rec_id); + if (after_rec_id) { + /* Unlink the deleted variation, from the front of the list: */ + if (!xt_tab_set_row(ot, XT_LOG_ENT_ROW_SET, row_id, after_rec_id)) + goto failed; + } + else { + /* No more variations, remove the row: */ + if (!xt_tab_free_row(ot, tab, row_id)) + goto failed; + } + } + } + + XT_TAB_ROW_UNLOCK(&tab->tab_row_rwlock[row_id % XT_ROW_RWLOCKS], self); + + /* Note: even when not found on the row list, the record must still + * be freed. + * + * There might be an exception to this, but there are very definite + * cases where this is required, for example when an unreferenced + * record is found and added to the clean up list xn_add_cu_record(). + */ + + unlink_done: + /* Delete the extended record and index entries: + * + * NOTE! This must be done after we have release the row lock. Because + * a thread that does a duplicate check locks the index, and then + * check whether a row is valid, and can deadlock with + * code that locks a row, then an index! + * + * However, this should all be OK, because the variation has been removed from the + * row variation list at this stage, and now just need to be deleted. + */ + xt_sw_delete_variation(self, ss, ot, rec_id, FALSE, row_id, xn_id); + } + + done_ok: + return OK; + + failed: + XT_TAB_ROW_UNLOCK(&tab->tab_row_rwlock[row_id % XT_ROW_RWLOCKS], self); + throw_(); + return FAILED; +} + +/* Go through all updated records of a transaction and cleanup. + * This means, of the transaction was aborted, then all the variations written + * by the transaction must be removed. + * If the transaction was committed then we remove older variations. + * If a delete was committed this can lead to the row being removed. + * + * After a transaction has been cleaned it can be removed from RAM. + * If this was the last transaction in a log, and the log has reached + * threshold, and the log is no longer in exclusive use, then the log + * can be deleted. + * + * This function returns OK if the transaction was cleaned up, FALSE + * if a retry is required. Othersize an error is thrown. + */ +static xtBool xn_sw_cleanup_xact(XTThreadPtr self, XNSweeperStatePtr ss, XTXactDataPtr xact) +{ + XTDatabaseHPtr db = ss->ss_db; + XTXactLogBufferDPtr record; + xtTableID tab_id; + xtRecordID rec_id; + xtXactID xn_id; + xtRowID row_id; + + if (!db->db_xlog.xlog_seq_start(&ss->ss_seqread, xact->xd_begin_log, xact->xd_begin_offset, FALSE)) + xt_throw(self); + + for (;;) { + if (self->t_quit) + return FAILED; + + xn_sw_could_go_faster(self, db); + + if (!db->db_xlog.xlog_seq_next(&ss->ss_seqread, &record, FALSE, self)) + xt_throw(self); + if (!record) { + /* Recovered transactions are considered cleaned when we + * reach the end of the transaction log. + * This is required, because transactions that do + * not have a commit (or rollback) record, because they were + * running when the server last went down, will otherwise not + * have the cleanup completed!! + */ + ASSERT(xact->xd_flags & XT_XN_XAC_RECOVERED); + if (!(xact->xd_flags & XT_XN_XAC_RECOVERED)) + return FAILED; + goto cleanup_done; + } + switch (record->xh.xh_status_1) { + case XT_LOG_ENT_NEW_LOG: + if (!db->db_xlog.xlog_seq_start(&ss->ss_seqread, XT_GET_DISK_4(record->xl.xl_log_id_4), 0, FALSE)) + xt_throw(self); + break; + case XT_LOG_ENT_COMMIT: + case XT_LOG_ENT_ABORT: + xn_id = XT_GET_DISK_4(record->xe.xe_xact_id_4); + if (xn_id == xact->xd_start_xn_id) + goto cleanup_done; + break; + case XT_LOG_ENT_REC_MODIFIED: + case XT_LOG_ENT_UPDATE: + case XT_LOG_ENT_INSERT: + case XT_LOG_ENT_DELETE: + case XT_LOG_ENT_UPDATE_BG: + case XT_LOG_ENT_INSERT_BG: + case XT_LOG_ENT_DELETE_BG: + xn_id = XT_GET_DISK_4(record->xu.xu_xact_id_4); + if (xn_id != xact->xd_start_xn_id) + break; + tab_id = XT_GET_DISK_4(record->xu.xu_tab_id_4); + rec_id = XT_GET_DISK_4(record->xu.xu_rec_id_4); + row_id = XT_GET_DISK_4(record->xu.xu_row_id_4); + if (!xn_sw_cleanup_variation(self, ss, xact, tab_id, rec_id, record->xu.xu_status_1, record->xu.xu_rec_type_1, record->xu.xu_stat_id_1, row_id, &record->xu.xu_rec_type_1)) + return FAILED; + break; + case XT_LOG_ENT_UPDATE_FL: + case XT_LOG_ENT_INSERT_FL: + case XT_LOG_ENT_DELETE_FL: + case XT_LOG_ENT_UPDATE_FL_BG: + case XT_LOG_ENT_INSERT_FL_BG: + case XT_LOG_ENT_DELETE_FL_BG: + xn_id = XT_GET_DISK_4(record->xf.xf_xact_id_4); + if (xn_id != xact->xd_start_xn_id) + break; + tab_id = XT_GET_DISK_4(record->xf.xf_tab_id_4); + rec_id = XT_GET_DISK_4(record->xf.xf_rec_id_4); + row_id = XT_GET_DISK_4(record->xf.xf_row_id_4); + if (!xn_sw_cleanup_variation(self, ss, xact, tab_id, rec_id, record->xf.xf_status_1, record->xf.xf_rec_type_1, record->xf.xf_stat_id_1, row_id, &record->xf.xf_rec_type_1)) + return FAILED; + break; + default: + break; + } + } + + cleanup_done: + /* Write the log to indicate the transaction has been cleaned: */ + XTXactCleanupEntryDRec cu; + + cu.xc_status_1 = XT_LOG_ENT_CLEANUP; + cu.xc_checksum_1 = XT_CHECKSUM_1(XT_CHECKSUM4_XACT(xact->xd_start_xn_id)); + XT_SET_DISK_4(cu.xc_xact_id_4, xact->xd_start_xn_id); + + if (!xt_xlog_log_data(self, sizeof(XTXactCleanupEntryDRec), (XTXactLogBufferDPtr) &cu, FALSE)) + return FAILED; + + ss->ss_flush_pending = TRUE; + + xact->xd_flags |= XT_XN_XAC_CLEANED; + ASSERT(db->db_xn_to_clean_id == xact->xd_start_xn_id); +#ifdef MUST_DELAY_REMOVE + xn_sw_add_xact_to_free(self, ss, xact->xd_start_xn_id); +#else + xn_id = xact->xd_start_xn_id; + if (xt_xn_delete_xact(db, xn_id, self)) { + /* Recalculate the minimum memory transaction: */ + ASSERT(!xt_xn_is_before(xn_id, db->db_xn_min_ram_id)); + + if (db->db_xn_min_ram_id == xn_id) { + db->db_xn_min_ram_id = xn_id+1; + } + else { + xtXactID xn_curr_xn_id = xt_xn_get_curr_id(db); + + while (!xt_xn_is_before(xn_curr_xn_id, db->db_xn_min_ram_id)) { // was db->db_xn_min_ram_id <= xn_curr_xn_id + /* db_xn_min_ram_id may be changed, by some other process! */ + xn_id = db->db_xn_min_ram_id; + if (xn_get_xact_details(db, xn_id, self, NULL, NULL, NULL, NULL)) + break; + db->db_xn_min_ram_id = xn_id+1; + } + } + } +#endif + + return OK; +} + +static void xn_free_sw_state(XTThreadPtr self, XNSweeperStatePtr ss) +{ + xn_sw_close_open_table(self, ss); + if (ss->ss_db) + ss->ss_db->db_xlog.xlog_seq_exit(&ss->ss_seqread); + xt_db_set_size(self, &ss->ss_databuf, 0); + xt_bq_set_size(self, &ss->ss_to_free, 0); +} + +static void xn_sw_main(XTThreadPtr self) +{ + XTDatabaseHPtr db = self->st_database; + XNSweeperStatePtr ss; + XTXactDataPtr xact, xact2; + time_t idle_start = 0; + xtXactID curr_id; + + xt_set_priority(self, xt_db_sweeper_priority); + + alloczr_(ss, xn_free_sw_state, sizeof(XNSweeperStateRec), XNSweeperStatePtr); + ss->ss_db = db; + + if (!db->db_xlog.xlog_seq_init(&ss->ss_seqread, xt_db_log_buffer_size, FALSE)) + xt_throw(self); + + ss->ss_to_free.bq_item_size = sizeof(XNSWToFreeItemRec); + ss->ss_to_free.bq_max_waste = XT_TN_MAX_TO_FREE_WASTE; + ss->ss_to_free.bq_item_inc = XT_TN_MAX_TO_FREE_INC; + ss->ss_call_cnt = 0; + ss->ss_flush_pending = FALSE; + + while (!self->t_quit) { + while (!self->t_quit) { + /* We are just about to check the condition for sleeping, + * so if the condition for sleeping holds, then we wil + * exit the loop and sleep. + * + * We will then sleep if nobody sets the flag before we + * actually do sleep! + */ + curr_id = xt_xn_get_curr_id(db); + if (xt_xn_is_before(curr_id, db->db_xn_to_clean_id)) { + db->db_sw_faster &= ~XT_SW_TOO_FAR_BEHIND; + break; + } + /* {TUNING} How far to we allow the sweeper to get behind? + * The higher this is, the higher burst performance can + * be. But too high and the sweeper falls out of reading the + * transaction log cache, and also starts to spread + * changes around in index and data blocks that are no + * longer hot. + */ + if (curr_id - db->db_xn_to_clean_id > 250) + db->db_sw_faster |= XT_SW_TOO_FAR_BEHIND; + else + db->db_sw_faster &= ~XT_SW_TOO_FAR_BEHIND; + xn_sw_could_go_faster(self, db); + idle_start = 0; + + if ((xact = xt_xn_get_xact(db, db->db_xn_to_clean_id, self))) { + xtXactID xn_id; + + if (!(xact->xd_flags & XT_XN_XAC_SWEEP)) + /* Transaction has not yet ending, and ready to sweep. */ + goto sleep; + + /* Check if we can cleanup the transaction. + * We do this by checking to see if there is any running + * transaction which start before the end of this transaction. + */ + xn_id = xact->xd_start_xn_id; + while (xt_xn_is_before(xn_id, xact->xd_end_xn_id)) { + xn_id++; + if ((xact2 = xt_xn_get_xact(db, xn_id, self))) { + if (!(xact2->xd_flags & XT_XN_XAC_ENDED)) { + /* A transaction was started before the end of + * the transaction we wish to sweep, and this + * transaction has not committed, the we have to + * wait. + */ + db->db_stat_sweep_waits++; + goto sleep; + } + } + } + + /* Can cleanup the transaction, and move to the next. */ + if (xact->xd_flags & XT_XN_XAC_LOGGED) { +#ifdef TRACE_SWEEPER_ACTIVITY + printf("SWEEPER: cleanup %d\n", (int) xact->xd_start_xn_id); +#endif + if (!xn_sw_cleanup_xact(self, ss, xact)) { + /* We failed to clean (try again later)... */ +#ifdef TRACE_SWEEPER_ACTIVITY + printf("SWEEPER: cleanup retry...\n", (int) xact->xd_start_xn_id); +#endif + goto sleep; + } +#ifdef TRACE_SWEEPER_ACTIVITY + printf("SWEEPER: cleanup DONE\n", (int) xact->xd_start_xn_id); +#endif + } + else { + /* This was a read-only transaction, it is safe to + * just remove the transaction structure from memory. + * (should not be necessary because RO transactions + * do this themselves): + */ + if (xt_xn_delete_xact(db, db->db_xn_to_clean_id, self)) { + if (db->db_xn_min_ram_id == db->db_xn_to_clean_id) + db->db_xn_min_ram_id = db->db_xn_to_clean_id+1; + } + } + } + + /* Move on to clean the next: */ + db->db_xn_to_clean_id++; + } + + sleep: + + xn_sw_close_open_table(self, ss); + + xn_sw_go_slower(self, db); + + /* Shrink the free list, if it is empty, and larger then + * the default: + */ + if (ss->ss_to_free.bq_size > XT_TN_MAX_TO_FREE) { + if (ss->ss_to_free.bq_front == 0 && ss->ss_to_free.bq_back == 0) + xt_bq_set_size(self, &ss->ss_to_free, XT_TN_MAX_TO_FREE); + } + + /* Windows: close the log file that we have open for reading, if we + * read past the end of the log on the last transaction. + * This makes sure that the log is closed when the checkpointer + * tries to remove or rename it!! + */ + if (ss->ss_seqread.xseq_log_file) { + if (ss->ss_seqread.xseq_rec_log_id != ss->ss_seqread.xseq_log_id) + db->db_xlog.xlog_seq_close(&ss->ss_seqread); + } + + if (ss->ss_flush_pending) { + /* Flush pending means we have written something to the log. + * + * if so we flush the log so that the writer will also do + * its work! + * + * This will lead to the freeer continuing if it is waiting. + */ + + time_t now = time(NULL); + if (idle_start) { + /* By default, we wait for 2 seconds idle time, the + * we flush the log. + */ + if (now >= idle_start + 2) { + if (!xt_xlog_flush_log(self)) + xt_throw(self); + ss->ss_flush_pending = FALSE; + } + } + else + idle_start = now; + } + + /* {WAKE-SW} Waking up the sweeper is very expensive! + * Cost is 3% of execution time on the test: + * runTest(SMALL_SELECT_TEST, 2, 100000) + * + * On the other hand, polling every 1/10 second + * is cheap, because the check for transactions + * ready for cleanup is very quick. + * + * So this is the prefered method. + */ + xn_sw_wait_for_xact(self, db, 10); + } + + if (ss->ss_flush_pending) { + xt_xlog_flush_log(self); + ss->ss_flush_pending = FALSE; + } + + freer_(); // xn_free_sw_state(ss) +} + +static void *xn_sw_run_thread(XTThreadPtr self) +{ + XTDatabaseHPtr db = (XTDatabaseHPtr) self->t_data; + int count; + void *mysql_thread; + + mysql_thread = myxt_create_thread(); + + while (!self->t_quit) { + try_(a) { + /* + * The garbage collector requires that the database + * is in use because. + */ + xt_use_database(self, db, XT_FOR_SWEEPER); + + /* This action is both safe and required: + * + * safe: releasing the database is safe because as + * long as this thread is running the database + * reference is valid, and this reference cannot + * be the only one to the database because + * otherwize this thread would not be running. + * + * required: releasing the database is necessary + * otherwise we cannot close the database + * correctly because we only shutdown this + * thread when the database is closed and we + * only close the database when all references + * are removed. + */ + xt_heap_release(self, self->st_database); + + xn_sw_main(self); + } + catch_(a) { + /* This error is "normal"! */ + if (self->t_exception.e_xt_err != XT_ERR_NO_DICTIONARY && + !(self->t_exception.e_xt_err == XT_SIGNAL_CAUGHT && + self->t_exception.e_sys_err == SIGTERM)) + xt_log_and_clear_exception(self); + } + cont_(a); + + /* Avoid releasing the database (done above) */ + self->st_database = NULL; + xt_unuse_database(self, self); + + /* After an exception, pause before trying again... */ + /* Number of seconds */ +#ifdef DEBUG + count = 10; +#else + count = 2*60; +#endif + db->db_sw_idle = XT_THREAD_INERR; + while (!self->t_quit && count > 0) { + sleep(1); + count--; + } + db->db_sw_idle = XT_THREAD_BUSY; + } + + myxt_destroy_thread(mysql_thread, TRUE); + return NULL; +} + +static void xn_sw_free_thread(XTThreadPtr self, void *data) +{ + XTDatabaseHPtr db = (XTDatabaseHPtr) data; + + if (db->db_sw_thread) { + xt_lock_mutex(self, &db->db_sw_lock); + pushr_(xt_unlock_mutex, &db->db_sw_lock); + db->db_sw_thread = NULL; + freer_(); // xt_unlock_mutex(&db->db_sw_lock) + } +} + +/* Wait for a transaction to quit: */ +static void xn_sw_wait_for_xact(XTThreadPtr self, XTDatabaseHPtr db, u_int hsecs) +{ + xt_lock_mutex(self, &db->db_sw_lock); + pushr_(xt_unlock_mutex, &db->db_sw_lock); + db->db_sw_idle = XT_THREAD_IDLE; + if (!self->t_quit && !db->db_sw_faster) + xt_timed_wait_cond(self, &db->db_sw_cond, &db->db_sw_lock, hsecs * 10); + db->db_sw_idle = XT_THREAD_BUSY; + db->db_sw_check_count++; + freer_(); // xt_unlock_mutex(&db->db_sw_lock) +} + +xtPublic void xt_start_sweeper(XTThreadPtr self, XTDatabaseHPtr db) +{ + char name[PATH_MAX]; + + sprintf(name, "SW-%s", xt_last_directory_of_path(db->db_main_path)); + xt_remove_dir_char(name); + db->db_sw_thread = xt_create_daemon(self, name); + xt_set_thread_data(db->db_sw_thread, db, xn_sw_free_thread); + xt_run_thread(self, db->db_sw_thread, xn_sw_run_thread); +} + +xtPublic void xt_wait_for_sweeper(XTThreadPtr self, XTDatabaseHPtr db, int abort_time) +{ + time_t then, now; + xtBool message = FALSE; + + if (db->db_sw_thread) { + then = time(NULL); + while (!xt_xn_is_before(xt_xn_get_curr_id(db), db->db_xn_to_clean_id)) { // was db->db_xn_to_clean_id <= xt_xn_get_curr_id(db) + xt_lock_mutex(self, &db->db_sw_lock); + pushr_(xt_unlock_mutex, &db->db_sw_lock); + xt_wakeup_sweeper(db); + freer_(); // xt_unlock_mutex(&db->db_sw_lock) + xt_sleep_milli_second(10); + now = time(NULL); + if (abort_time && now >= then + abort_time) { + xt_logf(XT_NT_INFO, "Aborting wait for '%s' sweeper\n", db->db_name); + message = FALSE; + break; + } + if (now >= then + 2) { + if (!message) { + message = TRUE; + xt_logf(XT_NT_INFO, "Waiting for '%s' sweeper...\n", db->db_name); + } + } + } + + if (message) + xt_logf(XT_NT_INFO, "Sweeper '%s' done.\n", db->db_name); + } +} + +xtPublic void xt_stop_sweeper(XTThreadPtr self, XTDatabaseHPtr db) +{ + XTThreadPtr thr_sw; + + if (db->db_sw_thread) { + xt_lock_mutex(self, &db->db_sw_lock); + pushr_(xt_unlock_mutex, &db->db_sw_lock); + + /* This pointer is safe as long as you have the transaction lock. */ + if ((thr_sw = db->db_sw_thread)) { + xtThreadID tid = thr_sw->t_id; + + /* Make sure the thread quits when woken up. */ + xt_terminate_thread(self, thr_sw); + + xt_wakeup_sweeper(db); + + freer_(); // xt_unlock_mutex(&db->db_sw_lock) + + /* + * GOTCHA: This is a wierd thing but the SIGTERM directed + * at a particular thread (in this case the sweeper) was + * being caught by a different thread and killing the server + * sometimes. Disconcerting. + * (this may only be a problem on Mac OS X) + xt_kill_thread(thread); + */ + xt_wait_for_thread(tid, FALSE); + + /* PMC - This should not be necessary to set the signal here, but in the + * debugger the handler is not called!!? + thr_sw->t_delayed_signal = SIGTERM; + xt_kill_thread(thread); + */ + db->db_sw_thread = NULL; + } + else + freer_(); // xt_unlock_mutex(&db->db_sw_lock) + } +} + +xtPublic void xt_wakeup_sweeper(XTDatabaseHPtr db) +{ + /* This flag makes the gap for the race condition + * very small. + * + * However, this posibility still remains because + * we do not lock the mutex db_sw_lock here. + * + * The reason is that it is too expensive. + * + * In the event that the wakeup is missed the sleeper + * wait will timeout eventually. + */ + if (db->db_sw_idle) { + if (!xt_broadcast_cond_ns(&db->db_sw_cond)) + xt_log_and_clear_exception_ns(); + } +} diff --git a/storage/pbxt/src/xaction_xt.h b/storage/pbxt/src/xaction_xt.h new file mode 100644 index 00000000000..9a651fc2532 --- /dev/null +++ b/storage/pbxt/src/xaction_xt.h @@ -0,0 +1,184 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2005-04-10 Paul McCullagh + * + * H&G2JCtL + */ +#ifndef __xt_xaction_h__ +#define __xt_xaction_h__ + +#include "filesys_xt.h" +#include "lock_xt.h" + +struct XTThread; +struct XTDatabase; +struct XTOpenTable; + +#ifdef DEBUG +//#define XT_USE_XACTION_DEBUG_SIZES +#endif + +#ifdef XT_USE_XACTION_DEBUG_SIZES + +#define XT_XN_DATA_ALLOC_COUNT 400 +#define XT_XN_SEGMENT_SHIFTS 1 +#define XT_XN_HASH_TABLE_SIZE 31 +#define XT_TN_NUMBER_INCREMENT 20 +#define XT_TN_MAX_TO_FREE 20 +#define XT_TN_MAX_TO_FREE_WASTE 3 +#define XT_TN_MAX_TO_FREE_CHECK 3 +#define XT_TN_MAX_TO_FREE_INC 3 + +#else + +#define XT_XN_DATA_ALLOC_COUNT 1250 // Number of pre-allocated transaction data structures per segment +#define XT_XN_SEGMENT_SHIFTS 5 // (32) +#define XT_XN_HASH_TABLE_SIZE 1279 // This is a prime number! +#define XT_TN_NUMBER_INCREMENT 100 // The increment of the transaction number on restart +#define XT_TN_MAX_TO_FREE 800 // The maximum size of the "to free" list +#define XT_TN_MAX_TO_FREE_WASTE 400 +#define XT_TN_MAX_TO_FREE_CHECK 100 // Once we have exceeded the limit, we only try in intervals +#define XT_TN_MAX_TO_FREE_INC 100 + +#endif + +#define XT_XN_NO_OF_SEGMENTS (1 << XT_XN_SEGMENT_SHIFTS) +#define XT_XN_SEGMENT_MASK (XT_XN_NO_OF_SEGMENTS - 1) + +#define XT_XN_XAC_LOGGED 1 +#define XT_XN_XAC_ENDED 2 /* The transaction has ended. */ +#define XT_XN_XAC_COMMITTED 4 /* The transaction was committed. */ +#define XT_XN_XAC_CLEANED 8 /* The transaction has been cleaned. */ +#define XT_XN_XAC_RECOVERED 16 /* This transaction was detected on recovery. */ +#define XT_XN_XAC_SWEEP 32 /* End ID has been set, OK to sweep. */ + +#define XT_XN_VISIBLE 0 /* The transaction is committed, and the record is visible. */ +#define XT_XN_NOT_VISIBLE 1 /* The transaction is committed, but not visible. */ +#define XT_XN_ABORTED 2 /* Transaction was aborted. */ +#define XT_XN_MY_UPDATE 3 /* The record was update by me. */ +#define XT_XN_OTHER_UPDATE 4 /* The record was updated by someone else. */ +#define XT_XN_REREAD 5 /* The transaction is not longer in RAM, status is unkown, retry. */ + +typedef struct XTXactData { + xtXactID xd_start_xn_id; /* Note: may be zero!. */ + xtXactID xd_end_xn_id; /* Note: may be zero!. */ + + /* The begin position: */ + xtLogID xd_begin_log; /* Non-zero if begin has been logged. */ + xtLogOffset xd_begin_offset; + int xd_flags; + xtWord4 xd_end_time; + xtThreadID xd_thread_id; + + /* A transaction may be indexed twice in the hash table. + * Once on the start sequence number, and once on the + * end sequence number. + */ + struct XTXactData *xd_next_xact; /* Next pointer in the hash table, also used by the free list. */ + +} XTXactDataRec, *XTXactDataPtr; + +#define XT_XACT_USE_SPINLOCK + +#ifdef XT_XACT_USE_FASTWRLOCK +#define XT_XACT_LOCK_TYPE XTFastRWLockRec +#define XT_XACT_INIT_LOCK(s, i) xt_fastrwlock_init(s, i) +#define XT_XACT_FREE_LOCK(s, i) xt_fastrwlock_free(s, i) +#define XT_XACT_READ_LOCK(i, s) xt_fastrwlock_slock(i, s) +#define XT_XACT_WRITE_LOCK(i, s) xt_fastrwlock_xlock(i, s) +#define XT_XACT_UNLOCK(i, s) xt_fastrwlock_unlock(i, s) +#elif defined(XT_XACT_USE_PTHREAD_RW) +#define XT_XACT_LOCK_TYPE xt_rwlock_type +#define XT_XACT_INIT_LOCK(s, i) xt_init_rwlock(s, i) +#define XT_XACT_FREE_LOCK(s, i) xt_free_rwlock(i) +#define XT_XACT_READ_LOCK(i, s) xt_slock_rwlock_ns(i) +#define XT_XACT_WRITE_LOCK(i, s) xt_xlock_rwlock_ns(i) +#define XT_XACT_UNLOCK(i, s) xt_unlock_rwlock_ns(i) +#elif defined(XT_XACT_USE_RW_MUTEX) +#define XT_XACT_LOCK_TYPE XTRWMutexRec +#define XT_XACT_INIT_LOCK(s, i) xt_rwmutex_init(s, i) +#define XT_XACT_FREE_LOCK(s, i) xt_rwmutex_free(s, i) +#define XT_XACT_READ_LOCK(i, s) xt_rwmutex_slock(i, (s)->t_id) +#define XT_XACT_WRITE_LOCK(i, s) xt_rwmutex_xlock(i, (s)->t_id) +#define XT_XACT_UNLOCK(i, s) xt_rwmutex_unlock(i, (s)->t_id) +#else +#define XT_XACT_LOCK_TYPE XTSpinLockRec +#define XT_XACT_INIT_LOCK(s, i) xt_spinlock_init_with_autoname(s, i) +#define XT_XACT_FREE_LOCK(s, i) xt_spinlock_free(s, i) +#define XT_XACT_READ_LOCK(i, s) xt_spinlock_lock(i) +#define XT_XACT_WRITE_LOCK(i, s) xt_spinlock_lock(i) +#define XT_XACT_UNLOCK(i, s) xt_spinlock_unlock(i) +#endif + +/* We store the transactions in a number of segments, each + * segment has a hash table. + */ +typedef struct XTXactSeg { + XT_XACT_LOCK_TYPE xs_tab_lock; /* Lock for hash table. */ + xtXactID xs_last_xn_id; /* The last transaction ID added. */ + XTXactDataPtr xs_free_list; /* List of transaction data structures. */ + XTXactDataPtr xs_table[XT_XN_HASH_TABLE_SIZE]; /* Hash table containing the transaction data structures. */ +} XTXactSegRec, *XTXactSegPtr; + +typedef struct XTXactWait { + xtXactID xw_xn_id; +} XTXactWaitRec, *XTXactWaitPtr; + +void xt_thread_wait_init(struct XTThread *self); +void xt_thread_wait_exit(struct XTThread *self); + +void xt_xn_init_db(struct XTThread *self, struct XTDatabase *db); +void xt_xn_exit_db(struct XTThread *self, struct XTDatabase *db); +void xt_start_sweeper(struct XTThread *self, struct XTDatabase *db); +void xt_wait_for_sweeper(struct XTThread *self, struct XTDatabase *db, int abort_time); +void xt_stop_sweeper(struct XTThread *self, struct XTDatabase *db); + +void xt_xn_init_thread(struct XTThread *self, int what_for); +void xt_xn_exit_thread(struct XTThread *self); +void xt_wakeup_sweeper(struct XTDatabase *db); + +xtBool xt_xn_begin(struct XTThread *self); +xtBool xt_xn_commit(struct XTThread *self); +xtBool xt_xn_rollback(struct XTThread *self); +xtBool xt_xn_log_tab_id(struct XTThread *self, xtTableID tab_id); +int xt_xn_status(struct XTOpenTable *ot, xtXactID xn_id, xtRecordID rec_id); +xtBool xt_xn_wait_for_xact(struct XTThread *self, XTXactWaitPtr xw, struct XTLockWait *lw); +void xt_xn_wakeup_waiting_threads(struct XTThread *thread); +void xt_xn_wakeup_thread_list(struct XTThread *thread); +void xt_xn_wakeup_thread(xtThreadID thd_id); +xtXactID xt_xn_get_curr_id(struct XTDatabase *db); +xtWord8 xt_xn_bytes_to_sweep(struct XTDatabase *db, struct XTThread *thread); + +XTXactDataPtr xt_xn_add_old_xact(struct XTDatabase *db, xtXactID xn_id, struct XTThread *thread); +XTXactDataPtr xt_xn_get_xact(struct XTDatabase *db, xtXactID xn_id, struct XTThread *thread); +xtBool xt_xn_delete_xact(struct XTDatabase *db, xtXactID xn_id, struct XTThread *thread); + +inline xtBool xt_xn_is_before(register xtXactID now, register xtXactID then) +{ + if (now >= then) { + if ((now - then) > (xtXactID) 0xFFFFFFFF/2) + return TRUE; + return FALSE; + } + if ((then - now) > (xtXactID) 0xFFFFFFFF/2) + return FALSE; + return TRUE; +} + +#endif diff --git a/storage/pbxt/src/xactlog_xt.cc b/storage/pbxt/src/xactlog_xt.cc new file mode 100644 index 00000000000..82c0d85b770 --- /dev/null +++ b/storage/pbxt/src/xactlog_xt.cc @@ -0,0 +1,2853 @@ +/* Copyright (c) 2007 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2007-10-30 Paul McCullagh + * + * H&G2JCtL + * + * The transaction log contains all operations on the data handle + * and row pointer files of a table. + * + * The transaction log does not contain operations on index data. + */ + +#include "xt_config.h" + +#include <signal.h> + +#include "xactlog_xt.h" +#include "database_xt.h" +#include "util_xt.h" +#include "strutil_xt.h" +#include "filesys_xt.h" +#include "myxt_xt.h" +#include "trace_xt.h" + +#ifdef DEBUG +//#define PRINT_TABLE_MODIFICATIONS +//#define TRACE_WRITER_ACTIVITY +#endif +#ifndef XT_WIN +#ifndef XT_MAC +#define PREWRITE_LOG_COMPLETELY +#endif +#endif + +static void xlog_wr_log_written(XTDatabaseHPtr db); + +/* + * ----------------------------------------------------------------------- + * T R A N S A C T I O L O G C A C H E + */ + +static XTXLogCacheRec xt_xlog_cache; + +/* + * Initialize the disk cache. + */ +xtPublic void xt_xlog_init(XTThreadPtr self, size_t cache_size) +{ + XTXLogBlockPtr block; + + /* + * This is required to ensure that the block + * works! + */ + + /* Determine the number of block that will fit into the given memory: */ + /* + xt_xlog_cache.xlc_hash_size = (cache_size / (XLC_SEGMENT_COUNT * sizeof(XTXLogBlockPtr) + sizeof(XTXLogBlockRec))) / (XLC_SEGMENT_COUNT >> 1); + xt_xlog_cache.xlc_block_count = (cache_size - (XLC_SEGMENT_COUNT * xt_xlog_cache.xlc_hash_size * sizeof(XTXLogBlockPtr))) / sizeof(XTXLogBlockRec); + */ + /* Do not count the size of the cache directory towards the cache size: */ + xt_xlog_cache.xlc_block_count = cache_size / sizeof(XTXLogBlockRec); + xt_xlog_cache.xlc_upper_limit = ((xtWord8) xt_xlog_cache.xlc_block_count * (xtWord8) XT_XLC_BLOCK_SIZE * (xtWord8) 3) / (xtWord8) 4; + xt_xlog_cache.xlc_hash_size = xt_xlog_cache.xlc_block_count / (XLC_SEGMENT_COUNT >> 1); + if (!xt_xlog_cache.xlc_hash_size) + xt_xlog_cache.xlc_hash_size = 1; + + try_(a) { + for (u_int i=0; i<XLC_SEGMENT_COUNT; i++) { + xt_xlog_cache.xlc_segment[i].lcs_hash_table = (XTXLogBlockPtr *) xt_calloc(self, xt_xlog_cache.xlc_hash_size * sizeof(XTXLogBlockPtr)); + xt_init_mutex_with_autoname(self, &xt_xlog_cache.xlc_segment[i].lcs_lock); + xt_init_cond(self, &xt_xlog_cache.xlc_segment[i].lcs_cond); + } + + block = (XTXLogBlockPtr) xt_malloc(self, xt_xlog_cache.xlc_block_count * sizeof(XTXLogBlockRec)); + xt_xlog_cache.xlc_blocks = block; + xt_xlog_cache.xlc_blocks_end = (XTXLogBlockPtr) ((char *) block + (xt_xlog_cache.xlc_block_count * sizeof(XTXLogBlockRec))); + xt_xlog_cache.xlc_next_to_free = block; + xt_init_mutex_with_autoname(self, &xt_xlog_cache.xlc_lock); + xt_init_cond(self, &xt_xlog_cache.xlc_cond); + + for (u_int i=0; i<xt_xlog_cache.xlc_block_count; i++) { + block->xlb_address = 0; + block->xlb_log_id = 0; + block->xlb_state = XLC_BLOCK_FREE; + block++; + } + xt_xlog_cache.xlc_free_count = xt_xlog_cache.xlc_block_count; + } + catch_(a) { + xt_xlog_exit(self); + throw_(); + } + cont_(a); +} + +xtPublic void xt_xlog_exit(XTThreadPtr self) +{ + for (u_int i=0; i<XLC_SEGMENT_COUNT; i++) { + if (xt_xlog_cache.xlc_segment[i].lcs_hash_table) { + xt_free(self, xt_xlog_cache.xlc_segment[i].lcs_hash_table); + xt_xlog_cache.xlc_segment[i].lcs_hash_table = NULL; + xt_free_mutex(&xt_xlog_cache.xlc_segment[i].lcs_lock); + xt_free_cond(&xt_xlog_cache.xlc_segment[i].lcs_cond); + } + } + + if (xt_xlog_cache.xlc_blocks) { + xt_free(self, xt_xlog_cache.xlc_blocks); + xt_xlog_cache.xlc_blocks = NULL; + xt_free_mutex(&xt_xlog_cache.xlc_lock); + xt_free_cond(&xt_xlog_cache.xlc_cond); + } + memset(&xt_xlog_cache, 0, sizeof(xt_xlog_cache)); +} + +xtPublic xtInt8 xt_xlog_get_usage() +{ + xtInt8 size; + + size = (xtInt8) (xt_xlog_cache.xlc_block_count - xt_xlog_cache.xlc_free_count) * sizeof(XTXLogBlockRec); + return size; +} + +xtPublic xtInt8 xt_xlog_get_size() +{ + xtInt8 size; + + size = (xtInt8) xt_xlog_cache.xlc_block_count * sizeof(XTXLogBlockRec); + return size; +} + +xtPublic xtLogID xt_xlog_get_min_log(XTThreadPtr self, XTDatabaseHPtr db) +{ + char path[PATH_MAX]; + XTOpenDirPtr od; + char *file; + xtLogID log_id, min_log = 0; + + xt_strcpy(PATH_MAX, path, db->db_main_path); + xt_add_system_dir(PATH_MAX, path); + if (xt_fs_exists(path)) { + pushsr_(od, xt_dir_close, xt_dir_open(self, path, NULL)); + while (xt_dir_next(self, od)) { + file = xt_dir_name(self, od); + if (xt_starts_with(file, "xlog")) { + if ((log_id = (xtLogID) xt_file_name_to_id(file))) { + if (!min_log || log_id < min_log) + min_log = log_id; + } + } + } + freer_(); // xt_dir_close(od) + } + if (!min_log) + return 1; + return min_log; +} + +xtPublic void xt_xlog_delete_logs(XTThreadPtr self, XTDatabaseHPtr db) +{ + char path[PATH_MAX]; + XTOpenDirPtr od; + char *file; + + /* Close all the index logs before we delete them: */ + db->db_indlogs.ilp_close(self, TRUE); + + /* Close the transaction logs too: */ + db->db_xlog.xlog_close(self); + + xt_strcpy(PATH_MAX, path, db->db_main_path); + xt_add_system_dir(PATH_MAX, path); + if (!xt_fs_exists(path)) + return; + pushsr_(od, xt_dir_close, xt_dir_open(self, path, NULL)); + while (xt_dir_next(self, od)) { + file = xt_dir_name(self, od); + if (xt_ends_with(file, ".xt")) { + xt_add_dir_char(PATH_MAX, path); + xt_strcat(PATH_MAX, path, file); + xt_fs_delete(self, path); + xt_remove_last_name_of_path(path); + } + } + freer_(); // xt_dir_close(od) + + /* I no longer attach the condition: !db->db_multi_path + * to removing this directory. This is because + * the pbxt directory must now be removed explicitly + * by drop database, or by delete all the PBXT + * system tables. + */ + if (!xt_fs_rmdir(NULL, path)) + xt_log_and_clear_exception(self); +} + +#ifdef DEBUG_CHECK_CACHE +static void xt_xlog_check_cache(void) +{ + XTXLogBlockPtr block, pblock; + u_int used_count; + u_int free_count; + + // Check the LRU list: + used_count = 0; + pblock = NULL; + block = xt_xlog_cache.xlc_lru_block; + while (block) { + used_count++; + ASSERT_NS(block->xlb_state != XLC_BLOCK_FREE); + ASSERT_NS(block->xlb_lr_used == pblock); + pblock = block; + block = block->xlb_mr_used; + } + ASSERT_NS(xt_xlog_cache.xlc_mru_block == pblock); + ASSERT_NS(xt_xlog_cache.xlc_free_count + used_count == xt_xlog_cache.xlc_block_count); + + // Check the free list: + free_count = 0; + block = xt_xlog_cache.xlc_free_list; + while (block) { + free_count++; + ASSERT_NS(block->xlb_state == XLC_BLOCK_FREE); + block = block->xlb_next; + } + ASSERT_NS(xt_xlog_cache.xlc_free_count == free_count); +} +#endif + +#ifdef FOR_DEBUG +static void xlog_check_lru_list(XTXLogBlockPtr block) +{ + XTXLogBlockPtr list_block, plist_block; + + plist_block = NULL; + list_block = xt_xlog_cache.xlc_lru_block; + while (list_block) { + ASSERT_NS(block != list_block); + ASSERT_NS(list_block->xlb_lr_used == plist_block); + plist_block = list_block; + list_block = list_block->xlb_mr_used; + } + ASSERT_NS(xt_xlog_cache.xlc_mru_block == plist_block); +} +#endif + +/* + * Log cache blocks are used and freed on a round-robin basis. + * In addition, only data read by restart, and data transfered + * from the transaction log are stored in the transaction log. + * + * This ensures that the transaction log contains the most + * recently written log data. + * + * If the sweeper gets behind due to a long running transacation + * then it falls out of the log cache, and must read from + * the log files directly. + * + * This data read is no longer cached as it was previously. + * This has the advantage that it does not disturn the writter + * thread which would otherwise hit the cache. + * + * If transactions are not too long, it should be possible + * to keep the sweeper in the log cache. + */ +static xtBool xlog_free_block(XTXLogBlockPtr to_free) +{ + XTXLogBlockPtr block, pblock; + xtLogID log_id; + off_t address; + XTXLogCacheSegPtr seg; + u_int hash_idx; + + retry: + log_id = to_free->xlb_log_id; + address = to_free->xlb_address; + + seg = &xt_xlog_cache.xlc_segment[((u_int) address >> XT_XLC_BLOCK_SHIFTS) & XLC_SEGMENT_MASK]; + hash_idx = (((u_int) (address >> (XT_XLC_SEGMENT_SHIFTS + XT_XLC_BLOCK_SHIFTS))) ^ (log_id << 16)) % xt_xlog_cache.xlc_hash_size; + + xt_lock_mutex_ns(&seg->lcs_lock); + if (to_free->xlb_state == XLC_BLOCK_FREE) + goto done_ok; + if (to_free->xlb_log_id != log_id || to_free->xlb_address != address) { + xt_unlock_mutex_ns(&seg->lcs_lock); + goto retry; + } + + pblock = NULL; + block = seg->lcs_hash_table[hash_idx]; + while (block) { + if (block->xlb_address == address && block->xlb_log_id == log_id) { + ASSERT_NS(block == to_free); + ASSERT_NS(block->xlb_state != XLC_BLOCK_FREE); + + /* Wait if the block is being read: */ + if (block->xlb_state == XLC_BLOCK_READING) { + /* Wait for the block to be read, then try again. */ + if (!xt_timed_wait_cond_ns(&seg->lcs_cond, &seg->lcs_lock, 100)) + goto failed; + xt_unlock_mutex_ns(&seg->lcs_lock); + goto retry; + } + + goto free_the_block; + } + pblock = block; + block = block->xlb_next; + } + + /* We did not find the block, someone else freed it... */ + xt_unlock_mutex_ns(&seg->lcs_lock); + goto retry; + + free_the_block: + ASSERT_NS(block->xlb_state == XLC_BLOCK_CLEAN); + + /* Remove from the hash table: */ + if (pblock) + pblock->xlb_next = block->xlb_next; + else + seg->lcs_hash_table[hash_idx] = block->xlb_next; + + /* Free the block: */ + xt_xlog_cache.xlc_free_count++; + block->xlb_state = XLC_BLOCK_FREE; + + done_ok: + xt_unlock_mutex_ns(&seg->lcs_lock); + return OK; + + failed: + xt_unlock_mutex_ns(&seg->lcs_lock); + return FAILED; +} + +#define XT_FETCH_READ 0 +#define XT_FETCH_BLANK 1 +#define XT_FETCH_TEST 2 + +static xtBool xlog_fetch_block(XTXLogBlockPtr *ret_block, XTOpenFilePtr file, xtLogID log_id, off_t address, XTXLogCacheSegPtr *ret_seg, int fetch_type, XTThreadPtr thread) +{ + register XTXLogBlockPtr block; + register XTXLogCacheSegPtr seg; + register u_int hash_idx; + register XTXLogCacheRec *dcg = &xt_xlog_cache; + size_t red_size; + + /* Make sure we have a free block ready (to avoid unlock below): */ + if (fetch_type != XT_FETCH_TEST && dcg->xlc_next_to_free->xlb_state != XLC_BLOCK_FREE) { + if (!xlog_free_block(dcg->xlc_next_to_free)) + return FAILED; + } + + seg = &dcg->xlc_segment[((u_int) address >> XT_XLC_BLOCK_SHIFTS) & XLC_SEGMENT_MASK]; + hash_idx = (((u_int) (address >> (XT_XLC_SEGMENT_SHIFTS + XT_XLC_BLOCK_SHIFTS))) ^ (log_id << 16)) % dcg->xlc_hash_size; + + xt_lock_mutex_ns(&seg->lcs_lock); + retry: + block = seg->lcs_hash_table[hash_idx]; + while (block) { + if (block->xlb_address == address && block->xlb_log_id == log_id) { + ASSERT_NS(block->xlb_state != XLC_BLOCK_FREE); + + /* + * Wait if the block is being read. + */ + if (block->xlb_state == XLC_BLOCK_READING) { + if (!xt_timed_wait_cond_ns(&seg->lcs_cond, &seg->lcs_lock, 100)) { + xt_unlock_mutex_ns(&seg->lcs_lock); + return FAILED; + } + goto retry; + } + + *ret_seg = seg; + *ret_block = block; + thread->st_statistics.st_xlog_cache_hit++; + return OK; + } + block = block->xlb_next; + } + + if (fetch_type == XT_FETCH_TEST) { + xt_unlock_mutex_ns(&seg->lcs_lock); + *ret_seg = NULL; + *ret_block = NULL; + thread->st_statistics.st_xlog_cache_miss++; + return OK; + } + + /* Block not found: */ + get_free_block: + if (dcg->xlc_next_to_free->xlb_state != XLC_BLOCK_FREE) { + xt_unlock_mutex_ns(&seg->lcs_lock); + if (!xlog_free_block(dcg->xlc_next_to_free)) + return FAILED; + xt_lock_mutex_ns(&seg->lcs_lock); + } + + xt_lock_mutex_ns(&dcg->xlc_lock); + block = dcg->xlc_next_to_free; + if (block->xlb_state != XLC_BLOCK_FREE) { + xt_unlock_mutex_ns(&dcg->xlc_lock); + goto get_free_block; + } + dcg->xlc_next_to_free++; + if (dcg->xlc_next_to_free == dcg->xlc_blocks_end) + dcg->xlc_next_to_free = dcg->xlc_blocks; + dcg->xlc_free_count--; + + if (fetch_type == XT_FETCH_READ) { + block->xlb_address = address; + block->xlb_log_id = log_id; + block->xlb_state = XLC_BLOCK_READING; + + xt_unlock_mutex_ns(&dcg->xlc_lock); + + /* Add the block to the hash table: */ + block->xlb_next = seg->lcs_hash_table[hash_idx]; + seg->lcs_hash_table[hash_idx] = block; + + /* Read the block into memory: */ + xt_unlock_mutex_ns(&seg->lcs_lock); + + if (!xt_pread_file(file, address, XT_XLC_BLOCK_SIZE, 0, block->xlb_data, &red_size, &thread->st_statistics.st_xlog, thread)) + return FAILED; + memset(block->xlb_data + red_size, 0, XT_XLC_BLOCK_SIZE - red_size); + thread->st_statistics.st_xlog_cache_miss++; + + xt_lock_mutex_ns(&seg->lcs_lock); + block->xlb_state = XLC_BLOCK_CLEAN; + xt_cond_wakeall(&seg->lcs_cond); + } + else { + block->xlb_address = address; + block->xlb_log_id = log_id; + block->xlb_state = XLC_BLOCK_CLEAN; + memset(block->xlb_data, 0, XT_XLC_BLOCK_SIZE); + + xt_unlock_mutex_ns(&dcg->xlc_lock); + + /* Add the block to the hash table: */ + block->xlb_next = seg->lcs_hash_table[hash_idx]; + seg->lcs_hash_table[hash_idx] = block; + } + + *ret_seg = seg; + *ret_block = block; +#ifdef DEBUG_CHECK_CACHE + //xt_xlog_check_cache(); +#endif + return OK; +} + +static xtBool xlog_transfer_to_cache(XTOpenFilePtr file, xtLogID log_id, off_t offset, size_t size, xtWord1 *data, XTThreadPtr thread) +{ + off_t address; + XTXLogBlockPtr block; + XTXLogCacheSegPtr seg; + size_t boff; + size_t tfer; + xtBool read_block = FALSE; + +#ifdef DEBUG_CHECK_CACHE + //xt_xlog_check_cache(); +#endif + /* We have to read the first block, if we are + * not at the begining of the file: + */ + if (offset) + read_block = TRUE; + address = offset & ~XT_XLC_BLOCK_MASK; + + boff = (size_t) (offset - address); + tfer = XT_XLC_BLOCK_SIZE - boff; + if (tfer > size) + tfer = size; + while (size > 0) { + if (!xlog_fetch_block(&block, file, log_id, address, &seg, read_block ? XT_FETCH_READ : XT_FETCH_BLANK, thread)) { +#ifdef DEBUG_CHECK_CACHE + //xt_xlog_check_cache(); +#endif + return FAILED; + } + ASSERT_NS(block && block->xlb_state == XLC_BLOCK_CLEAN); + memcpy(block->xlb_data + boff, data, tfer); + xt_unlock_mutex_ns(&seg->lcs_lock); + size -= tfer; + data += tfer; + + /* Following block need not be read + * because we always transfer to the + * end of the file! + */ + read_block = FALSE; + address += XT_XLC_BLOCK_SIZE; + + boff = 0; + tfer = size; + if (tfer > XT_XLC_BLOCK_SIZE) + tfer = XT_XLC_BLOCK_SIZE; + } +#ifdef DEBUG_CHECK_CACHE + //xt_xlog_check_cache(); +#endif + return OK; +} + +static xtBool xt_xlog_read(XTOpenFilePtr file, xtLogID log_id, off_t offset, size_t size, xtWord1 *data, xtBool load_cache, XTThreadPtr thread) +{ + off_t address; + XTXLogBlockPtr block; + XTXLogCacheSegPtr seg; + size_t boff; + size_t tfer; + +#ifdef DEBUG_CHECK_CACHE + //xt_xlog_check_cache(); +#endif + address = offset & ~XT_XLC_BLOCK_MASK; + boff = (size_t) (offset - address); + tfer = XT_XLC_BLOCK_SIZE - boff; + if (tfer > size) + tfer = size; + while (size > 0) { + if (!xlog_fetch_block(&block, file, log_id, address, &seg, load_cache ? XT_FETCH_READ : XT_FETCH_TEST, thread)) + return FAILED; + if (!block) { + size_t red_size; + + if (!xt_pread_file(file, address + boff, size, 0, data, &red_size, &thread->st_statistics.st_xlog, thread)) + return FAILED; + memset(data + red_size, 0, size - red_size); + return OK; + } + memcpy(data, block->xlb_data + boff, tfer); + xt_unlock_mutex_ns(&seg->lcs_lock); + size -= tfer; + data += tfer; + address += XT_XLC_BLOCK_SIZE; + boff = 0; + tfer = size; + if (tfer > XT_XLC_BLOCK_SIZE) + tfer = XT_XLC_BLOCK_SIZE; + } +#ifdef DEBUG_CHECK_CACHE + //xt_xlog_check_cache(); +#endif + return OK; +} + +static xtBool xt_xlog_write(XTOpenFilePtr file, xtLogID log_id, off_t offset, size_t size, xtWord1 *data, XTThreadPtr thread) +{ + if (!xt_pwrite_file(file, offset, size, data, &thread->st_statistics.st_xlog, thread)) + return FAILED; + return xlog_transfer_to_cache(file, log_id, offset, size, data, thread); +} + +/* + * ----------------------------------------------------------------------- + * D A T A B A S E T R A N S A C T I O N L O G S + */ + +void XTDatabaseLog::xlog_setup(XTThreadPtr self, XTDatabaseHPtr db, off_t inp_log_file_size, size_t transaction_buffer_size, int log_count) +{ + volatile off_t log_file_size = inp_log_file_size; + size_t log_size; + + try_(a) { + memset(this, 0, sizeof(XTDatabaseLogRec)); + + if (log_count <= 1) + log_count = 1; + else if (log_count > 1000000) + log_count = 1000000; + + xl_db = db; + + xl_log_file_threshold = xt_align_offset(log_file_size, 1024); + xl_log_file_count = log_count; + xl_size_of_buffers = transaction_buffer_size; + + xt_init_mutex_with_autoname(self, &xl_write_lock); + xt_init_cond(self, &xl_write_cond); + xt_writing = 0; + xl_log_id = 0; + xl_log_file = 0; + + xt_spinlock_init_with_autoname(self, &xl_buffer_lock); + + /* Note that we allocate a little bit more for each buffer + * in order to make sure that we can write a trailing record + * to the log buffer. + */ + log_size = transaction_buffer_size + sizeof(XTXactNewLogEntryDRec); + + /* Add in order to round the buffer to an integral of 512 */ + if (log_size % 512) + log_size += (512 - (log_size % 512)); + + xl_write_log_id = 0; + xl_write_log_offset = 0; + xl_write_buf_pos = 0; + xl_write_buf_pos_start = 0; + xl_write_buffer = (xtWord1 *) xt_malloc(self, log_size); + xl_write_done = TRUE; + + xl_append_log_id = 0; + xl_append_log_offset = 0; + xl_append_buf_pos = 0; + xl_append_buf_pos_start = 0; + xl_append_buffer = (xtWord1 *) xt_malloc(self, log_size); + + xl_last_flush_time = 10; + xl_flush_log_id = 0; + xl_flush_log_offset = 0; + } + catch_(a) { + xlog_exit(self); + throw_(); + } + cont_(a); +} + +xtBool XTDatabaseLog::xlog_set_write_offset(xtLogID log_id, xtLogOffset log_offset, xtLogID max_log_id, XTThreadPtr thread) +{ + xl_max_log_id = max_log_id; + + xl_write_log_id = log_id; + xl_write_log_offset = log_offset; + xl_write_buf_pos = 0; + xl_write_buf_pos_start = 0; + xl_write_done = TRUE; + + xl_append_log_id = log_id; + xl_append_log_offset = log_offset; + if (log_offset == 0) { + XTXactLogHeaderDPtr log_head; + + log_head = (XTXactLogHeaderDPtr) xl_append_buffer; + memset(log_head, 0, sizeof(XTXactLogHeaderDRec)); + log_head->xh_status_1 = XT_LOG_ENT_HEADER; + log_head->xh_checksum_1 = XT_CHECKSUM_1(log_id); + XT_SET_DISK_4(log_head->xh_size_4, sizeof(XTXactLogHeaderDRec)); + XT_SET_DISK_4(log_head->xh_log_id_4, log_id); + XT_SET_DISK_2(log_head->xh_version_2, XT_LOG_VERSION_NO); + XT_SET_DISK_4(log_head->xh_magic_4, XT_LOG_FILE_MAGIC); + xl_append_buf_pos = sizeof(XTXactLogHeaderDRec); + xl_append_buf_pos_start = 0; + } + else { + /* Start the log buffer at a block boundary: */ + size_t buf_pos = (size_t) (log_offset % 512); + + xl_append_buf_pos = buf_pos; + xl_append_buf_pos_start = buf_pos; + xl_append_log_offset = log_offset - buf_pos; + + if (!xlog_open_log(log_id, log_offset, thread)) + return FAILED; + + if (!xt_pread_file(xl_log_file, xl_append_log_offset, buf_pos, buf_pos, xl_append_buffer, NULL, &thread->st_statistics.st_xlog, thread)) + return FAILED; + } + + xl_flush_log_id = log_id; + xl_flush_log_offset = log_offset; + return OK; +} + +void XTDatabaseLog::xlog_close(XTThreadPtr self) +{ + if (xl_log_file) { + xt_close_file(self, xl_log_file); + xl_log_file = NULL; + } +} + +void XTDatabaseLog::xlog_exit(XTThreadPtr self) +{ + xt_spinlock_free(self, &xl_buffer_lock); + xt_free_mutex(&xl_write_lock); + xt_free_cond(&xl_write_cond); + xlog_close(self); + if (xl_write_buffer) { + xt_free(self, xl_write_buffer); + xl_write_buffer = NULL; + } + if (xl_append_buffer) { + xt_free(self, xl_append_buffer); + xl_append_buffer = NULL; + } +} + +#define WR_NO_SPACE 1 +#define WR_FLUSH 2 + +xtBool XTDatabaseLog::xlog_flush(XTThreadPtr thread) +{ + if (!xlog_flush_pending()) + return OK; + return xlog_append(thread, 0, NULL, 0, NULL, TRUE, NULL, NULL); +} + +xtBool XTDatabaseLog::xlog_flush_pending() +{ + xtLogID req_flush_log_id; + xtLogOffset req_flush_log_offset; + + xt_lck_slock(&xl_buffer_lock); + req_flush_log_id = xl_append_log_id; + req_flush_log_offset = xl_append_log_offset + xl_append_buf_pos; + if (xt_comp_log_pos(req_flush_log_id, req_flush_log_offset, xl_flush_log_id, xl_flush_log_offset) <= 0) { + xt_spinlock_unlock(&xl_buffer_lock); + return FALSE; + } + xt_spinlock_unlock(&xl_buffer_lock); + return TRUE; +} + +/* + * Write data to the end of the log buffer. + * + * commit is set to true if the caller also requires + * the log to be flushed, after writing the data. + * + * This function returns the log ID and offset of + * the data write position. + */ +xtBool XTDatabaseLog::xlog_append(XTThreadPtr thread, size_t size1, xtWord1 *data1, size_t size2, xtWord1 *data2, xtBool commit, xtLogID *log_id, xtLogOffset *log_offset) +{ + int write_reason = 0; + xtLogID req_flush_log_id; + xtLogOffset req_flush_log_offset; + size_t part_size; + xtWord8 flush_time; + + if (!size1) { + /* Just flush the buffer... */ + xt_lck_slock(&xl_buffer_lock); + write_reason = WR_FLUSH; + req_flush_log_id = xl_append_log_id; + req_flush_log_offset = xl_append_log_offset + xl_append_buf_pos; + xt_spinlock_unlock(&xl_buffer_lock); + goto write_log_to_file; + } + else { + req_flush_log_id = 0; + req_flush_log_offset = 0; + } + + /* + * This is a dirty read, which will send us to the + * best starting position: + * + * If there is space, now, then there is probably + * still enough space, after we have locked the + * buffer for writting. + */ + if (xl_append_buf_pos + size1 + size2 <= xl_size_of_buffers) + goto copy_to_log_buffer; + + /* + * There is not enough space in the append buffer. + * So we need to write the log, until there is space. + */ + write_reason = WR_NO_SPACE; + + write_log_to_file: + if (write_reason) { + /* We need to write for one of 2 reasons: not + * enough space in the buffer, or a flush + * is required. + */ + + /* + * The objective of the following code is to + * pick one writer, out of all threads. + * The others rest will wait for the writer. + */ + xtBool i_am_writer; + + if (write_reason == WR_FLUSH) { + /* Before we flush, check if we should wait for running + * transactions that may commit shortly. + */ + if (xl_db->db_xn_writer_count - xl_db->db_xn_writer_wait_count - xl_db->db_xn_long_running_count > 0 && xl_last_flush_time) { + /* Wait for about as long as the last flush took, + * the idea is to saturate the disk with flushing...: */ + xtWord8 then = xt_trace_clock() + (xtWord8) xl_last_flush_time; + + for (;;) { + xt_critical_wait(); + /* If a thread leaves this loop because times up, or + * a thread manages to flush so fast that this thread + * sleeps during this time, then it could be that + * the required flush occurs before other conditions + * of this loop are met! + * + * So we check here to make sure that the log has not been + * flushed as we require: + */ + if (xt_comp_log_pos(req_flush_log_id, req_flush_log_offset, xl_flush_log_id, xl_flush_log_offset) <= 0) { + ASSERT_NS(xt_comp_log_pos(xl_write_log_id, xl_write_log_offset, xl_append_log_id, xl_append_log_offset) <= 0); + return OK; + } + + if (xl_db->db_xn_writer_count - xl_db->db_xn_writer_wait_count - xl_db->db_xn_long_running_count > 0) + break; + if (xt_trace_clock() >= then) + break; + } + } + } + + i_am_writer = FALSE; + xt_lock_mutex_ns(&xl_write_lock); + if (xt_writing) { + if (!xt_timed_wait_cond_ns(&xl_write_cond, &xl_write_lock, 500)) { + xt_unlock_mutex_ns(&xl_write_lock); + return FALSE; + } + } + else { + xt_writing = TRUE; + i_am_writer = TRUE; + } + xt_unlock_mutex_ns(&xl_write_lock); + + if (!i_am_writer) { + /* If I am not the writer, then I just waited for the + * writer. So it may be that my requirements have now + * been met! + */ + if (write_reason == WR_FLUSH) { + /* If the reason was to flush, then + * check the last flush sequence, maybe it is passed + * our required sequence. + */ + if (xt_comp_log_pos(req_flush_log_id, req_flush_log_offset, xl_flush_log_id, xl_flush_log_offset) <= 0) { + /* The required flush position of the log is before + * or equal to the actual flush position. This means the condition + * for this thread have been satified (via group commit). + * Nothing more to do! + */ + ASSERT_NS(xt_comp_log_pos(xl_write_log_id, xl_write_log_offset, xl_append_log_id, xl_append_log_offset) <= 0); + return OK; + } + goto write_log_to_file; + } + + /* It may be that there is now space in the append buffer: */ + if (xl_append_buf_pos + size1 + size2 <= xl_size_of_buffers) + goto copy_to_log_buffer; + + goto write_log_to_file; + } + + /* I am the writer, check the conditions, again: */ + if (write_reason == WR_FLUSH) { + /* The writer wants the log to be flushed to a particular point: */ + if (xt_comp_log_pos(req_flush_log_id, req_flush_log_offset, xl_flush_log_id, xl_flush_log_offset) <= 0) { + /* The writers required flush position is before or equal + * to the actual position, so the writer is done... + */ + xt_writing = FALSE; + xt_cond_wakeall(&xl_write_cond); + ASSERT_NS(xt_comp_log_pos(xl_write_log_id, xl_write_log_offset, xl_append_log_id, xl_append_log_offset) <= 0); + return OK; + } + /* Not flushed, but what about written? */ + if (xt_comp_log_pos(req_flush_log_id, req_flush_log_offset, xl_write_log_id, xl_write_log_offset + (xl_write_done ? xl_write_buf_pos : 0)) <= 0) { + /* The write position is after or equal to the required flush + * position. This means that all we have to do is flush + * to satisfy the writers condition. + */ + xtBool ok = TRUE; + + if (xl_log_id != xl_write_log_id) + ok = xlog_open_log(xl_write_log_id, xl_write_log_offset + (xl_write_done ? xl_write_buf_pos : 0), thread); + + if (ok) { + if (xl_db->db_co_busy) { + /* [(8)] Flush the compactor log. */ + xt_lock_mutex_ns(&xl_db->db_co_dlog_lock); + ok = xl_db->db_co_thread->st_dlog_buf.dlb_flush_log(TRUE, thread); + xt_unlock_mutex_ns(&xl_db->db_co_dlog_lock); + } + } + + if (ok) { + flush_time = thread->st_statistics.st_xlog.ts_flush_time; + if ((ok = xt_flush_file(xl_log_file, &thread->st_statistics.st_xlog, thread))) { + xl_last_flush_time = (u_int) (thread->st_statistics.st_xlog.ts_flush_time - flush_time); + xl_log_bytes_flushed = xl_log_bytes_written; + + xt_lock_mutex_ns(&xl_db->db_wr_lock); + xl_flush_log_id = xl_write_log_id; + xl_flush_log_offset = xl_write_log_offset + (xl_write_done ? xl_write_buf_pos : 0); + /* + * We have written data to the log, wake the writer to commit + * the data to the database. + */ + xlog_wr_log_written(xl_db); + xt_unlock_mutex_ns(&xl_db->db_wr_lock); + } + } + xt_writing = FALSE; + xt_cond_wakeall(&xl_write_cond); + ASSERT_NS(xt_comp_log_pos(xl_write_log_id, xl_write_log_offset, xl_append_log_id, xl_append_log_offset) <= 0); + return ok; + } + } + else { + /* If there is space in the buffer, then we can go on + * to copy our data into the buffer: + */ + if (xl_append_buf_pos + size1 + size2 <= xl_size_of_buffers) { + xt_writing = FALSE; + xt_cond_wakeall(&xl_write_cond); + goto copy_to_log_buffer; + } + } + + rewrite: + /* If the current write buffer has been written, then + * switch the logs. Otherwise we must try to existing + * write buffer. + */ + if (xl_write_done) { + /* This means that the current write buffer has been writen, + * i.e. it is empty! + */ + xt_spinlock_lock(&xl_buffer_lock); + xtWord1 *tmp_buffer = xl_write_buffer; + + /* The write position is now the append position: */ + xl_write_log_id = xl_append_log_id; + xl_write_log_offset = xl_append_log_offset; + xl_write_buf_pos = xl_append_buf_pos; + xl_write_buf_pos_start = xl_append_buf_pos_start; + xl_write_buffer = xl_append_buffer; + xl_write_done = FALSE; + + /* We have to maintain 512 byte alignment: */ + ASSERT_NS((xl_write_log_offset % 512) == 0); + part_size = xl_write_buf_pos % 512; + if (part_size != 0) + memcpy(tmp_buffer, xl_write_buffer + xl_write_buf_pos - part_size, part_size); + + /* The new append position will be after the + * current append position: + */ + xl_append_log_offset += xl_append_buf_pos - part_size; + xl_append_buf_pos = part_size; + xl_append_buf_pos_start = part_size; + xl_append_buffer = tmp_buffer; // The old write buffer (which is empty) + + /* + * If the append offset exceeds the log threshhold, then + * we set the append buffer to a new log file: + * + * NOTE: This algorithm will cause the log to be overwriten by a maximum + * of the log buffer size! + */ + if (xl_append_log_offset >= xl_log_file_threshold) { + XTXactNewLogEntryDPtr log_tail; + XTXactLogHeaderDPtr log_head; + + xl_append_log_id++; + + /* Write the final record to the old log. + * There is enough space for this because we allocate the + * buffer a little bigger than required. + */ + log_tail = (XTXactNewLogEntryDPtr) (xl_write_buffer + xl_write_buf_pos); + log_tail->xl_status_1 = XT_LOG_ENT_NEW_LOG; + log_tail->xl_checksum_1 = XT_CHECKSUM_1(xl_append_log_id) ^ XT_CHECKSUM_1(xl_write_log_id); + XT_SET_DISK_4(log_tail->xl_log_id_4, xl_append_log_id); + xl_write_buf_pos += sizeof(XTXactNewLogEntryDRec); + + /* We add the header to the next log. */ + log_head = (XTXactLogHeaderDPtr) xl_append_buffer; + memset(log_head, 0, sizeof(XTXactLogHeaderDRec)); + log_head->xh_status_1 = XT_LOG_ENT_HEADER; + log_head->xh_checksum_1 = XT_CHECKSUM_1(xl_append_log_id); + XT_SET_DISK_4(log_head->xh_size_4, sizeof(XTXactLogHeaderDRec)); + XT_SET_DISK_4(log_head->xh_log_id_4, xl_append_log_id); + XT_SET_DISK_2(log_head->xh_version_2, XT_LOG_VERSION_NO); + XT_SET_DISK_4(log_head->xh_magic_4, XT_LOG_FILE_MAGIC); + + xl_append_log_offset = 0; + xl_append_buf_pos = sizeof(XTXactLogHeaderDRec); + xl_append_buf_pos_start = 0; + } + xt_spinlock_unlock(&xl_buffer_lock); + /* We have completed the switch. The append buffer is empty, and + * other threads can begin to write to it. + * + * Meanwhile, this thread will write the write buffer... + */ + } + + /* Make sure we have the correct log open: */ + if (xl_log_id != xl_write_log_id) { + if (!xlog_open_log(xl_write_log_id, xl_write_log_offset, thread)) + goto write_failed; + } + + /* Write the buffer. */ + /* Always write an integral number of 512 byte blocks: */ + ASSERT_NS((xl_write_log_offset % 512) == 0); + if ((part_size = xl_write_buf_pos % 512)) { + part_size = 512 - part_size; + xl_write_buffer[xl_write_buf_pos] = XT_LOG_ENT_END_OF_LOG; + if (!xt_pwrite_file(xl_log_file, xl_write_log_offset, xl_write_buf_pos+part_size, xl_write_buffer, &thread->st_statistics.st_xlog, thread)) + goto write_failed; + } + else { + if (!xt_pwrite_file(xl_log_file, xl_write_log_offset, xl_write_buf_pos, xl_write_buffer, &thread->st_statistics.st_xlog, thread)) + goto write_failed; + } + + /* This part has not been written: */ + part_size = xl_write_buf_pos - xl_write_buf_pos_start; + + /* We have written the data to the log, transfer + * the buffer data into the cache. */ + if (!xlog_transfer_to_cache(xl_log_file, xl_log_id, xl_write_log_offset+xl_write_buf_pos_start, part_size, xl_write_buffer+xl_write_buf_pos_start, thread)) + goto write_failed; + + xl_write_done = TRUE; + xl_log_bytes_written += part_size; + + if (write_reason == WR_FLUSH) { + if (xl_db->db_co_busy) { + /* [(8)] Flush the compactor log. */ + xt_lock_mutex_ns(&xl_db->db_co_dlog_lock); + if (!xl_db->db_co_thread->st_dlog_buf.dlb_flush_log(TRUE, thread)) { + xt_unlock_mutex_ns(&xl_db->db_co_dlog_lock); + goto write_failed; + } + xt_unlock_mutex_ns(&xl_db->db_co_dlog_lock); + } + + /* And flush if required: */ + flush_time = thread->st_statistics.st_xlog.ts_flush_time; + if (!xt_flush_file(xl_log_file, &thread->st_statistics.st_xlog, thread)) + goto write_failed; + xl_last_flush_time = (u_int) (thread->st_statistics.st_xlog.ts_flush_time - flush_time); + + xl_log_bytes_flushed = xl_log_bytes_written; + + xt_lock_mutex_ns(&xl_db->db_wr_lock); + xl_flush_log_id = xl_write_log_id; + xl_flush_log_offset = xl_write_log_offset + xl_write_buf_pos; + /* + * We have written data to the log, wake the writer to commit + * the data to the database. + */ + xlog_wr_log_written(xl_db); + xt_unlock_mutex_ns(&xl_db->db_wr_lock); + + /* Check that the require flush condition has arrived. */ + if (xt_comp_log_pos(req_flush_log_id, req_flush_log_offset, xl_flush_log_id, xl_flush_log_offset) > 0) + /* The required position is still after the current flush + * position, continue writing: */ + goto rewrite; + + xt_writing = FALSE; + xt_cond_wakeall(&xl_write_cond); + ASSERT_NS(xt_comp_log_pos(xl_write_log_id, xl_write_log_offset, xl_append_log_id, xl_append_log_offset) <= 0); + return OK; + } + else + xlog_wr_log_written(xl_db); + + /* + * Check that the buffer is now available, otherwise, + * switch and write again! + */ + if (xl_append_buf_pos + size1 + size2 > xl_size_of_buffers) + goto rewrite; + + xt_writing = FALSE; + xt_cond_wakeall(&xl_write_cond); + } + + copy_to_log_buffer: + xt_spinlock_lock(&xl_buffer_lock); + /* Now we have to check again. The check above was a dirty read! + */ + if (xl_append_buf_pos + size1 + size2 > xl_size_of_buffers) { + xt_spinlock_unlock(&xl_buffer_lock); + /* Not enough space, write the buffer, and return here. */ + write_reason = WR_NO_SPACE; + goto write_log_to_file; + } + + memcpy(xl_append_buffer + xl_append_buf_pos, data1, size1); + if (size2) + memcpy(xl_append_buffer + xl_append_buf_pos + size1, data2, size2); + /* Add the log ID to the checksum! + * This is required because log files are re-used, and we don't + * want the records to be valid when the log is re-used. + */ + register XTXactLogBufferDPtr record; + + /* + * Adjust db_xn_writer_count here. It is protected by + * xl_buffer_lock. + */ + record = (XTXactLogBufferDPtr) (xl_append_buffer + xl_append_buf_pos); + switch (record->xh.xh_status_1) { + case XT_LOG_ENT_HEADER: + case XT_LOG_ENT_END_OF_LOG: + break; + case XT_LOG_ENT_REC_MODIFIED: + case XT_LOG_ENT_UPDATE: + case XT_LOG_ENT_UPDATE_BG: + case XT_LOG_ENT_UPDATE_FL: + case XT_LOG_ENT_UPDATE_FL_BG: + case XT_LOG_ENT_INSERT: + case XT_LOG_ENT_INSERT_BG: + case XT_LOG_ENT_INSERT_FL: + case XT_LOG_ENT_INSERT_FL_BG: + case XT_LOG_ENT_DELETE: + case XT_LOG_ENT_DELETE_BG: + case XT_LOG_ENT_DELETE_FL: + case XT_LOG_ENT_DELETE_FL_BG: + xtWord2 sum; + + sum = XT_GET_DISK_2(record->xu.xu_checksum_2) ^ XT_CHECKSUM_2(xl_append_log_id); + XT_SET_DISK_2(record->xu.xu_checksum_2, sum); + + if (!thread->st_xact_writer) { + thread->st_xact_writer = TRUE; + thread->st_xact_write_time = xt_db_approximate_time; + xl_db->db_xn_writer_count++; + xl_db->db_xn_total_writer_count++; + } + break; + case XT_LOG_ENT_ROW_NEW: + case XT_LOG_ENT_ROW_NEW_FL: + record->xl.xl_checksum_1 ^= XT_CHECKSUM_1(xl_append_log_id); + + if (!thread->st_xact_writer) { + thread->st_xact_writer = TRUE; + thread->st_xact_write_time = xt_db_approximate_time; + xl_db->db_xn_writer_count++; + xl_db->db_xn_total_writer_count++; + } + break; + case XT_LOG_ENT_COMMIT: + case XT_LOG_ENT_ABORT: + ASSERT_NS(thread->st_xact_writer); + ASSERT_NS(xl_db->db_xn_writer_count > 0); + if (thread->st_xact_writer) { + xl_db->db_xn_writer_count--; + thread->st_xact_writer = FALSE; + if (thread->st_xact_long_running) { + xl_db->db_xn_long_running_count--; + thread->st_xact_long_running = FALSE; + } + } + /* No break required! */ + default: + record->xl.xl_checksum_1 ^= XT_CHECKSUM_1(xl_append_log_id); + break; + } +#ifdef DEBUG + ASSERT_NS(xlog_verify(record, size1 + size2, xl_append_log_id)); +#endif + if (log_id) + *log_id = xl_append_log_id; + if (log_offset) + *log_offset = xl_append_log_offset + xl_append_buf_pos; + xl_append_buf_pos += size1 + size2; + if (commit) { + write_reason = WR_FLUSH; + req_flush_log_id = xl_append_log_id; + req_flush_log_offset = xl_append_log_offset + xl_append_buf_pos; + xt_spinlock_unlock(&xl_buffer_lock); + goto write_log_to_file; + } + + // Failed sometime when outside the spinlock! + ASSERT_NS(xt_comp_log_pos(xl_write_log_id, xl_write_log_offset, xl_append_log_id, xl_append_log_offset + xl_append_buf_pos) <= 0); + xt_spinlock_unlock(&xl_buffer_lock); + + return OK; + + write_failed: + xt_writing = FALSE; + xt_cond_wakeall(&xl_write_cond); + return FAILED; +} + +/* + * This function does not always delete the log. It may just rename a + * log to a new log which it will need. + * This speeds things up: + * + * - No need to pre-allocate the new log. + * - Log data is already flushed (i.e. disk blocks allocated) + * - Log is already in OS cache. + * + * However, it means that I need to checksum things differently + * on each log to make sure I do not treat an old record + * as valid! + * + * Return OK, FAILED or XT_ERR + */ +int XTDatabaseLog::xlog_delete_log(xtLogID del_log_id, XTThreadPtr thread) +{ + char path[PATH_MAX]; + + if (xl_max_log_id < xl_write_log_id) + xl_max_log_id = xl_write_log_id; + + xlog_name(PATH_MAX, path, del_log_id); + + if (xt_db_offline_log_function == XT_RECYCLE_LOGS) { + char new_path[PATH_MAX]; + xtLogID new_log_id; + xtBool ok; + + /* Make sure that the total logs is less than or equal to the log file count + * (plus dynamic component). + */ + while (xl_max_log_id - del_log_id + 1 <= (xl_log_file_count + xt_log_file_dyn_count) && + /* And the number of logs after the current log (including the current log) + * must be less or equal to the log file count. */ + xl_max_log_id - xl_write_log_id + 1 <= xl_log_file_count) { + new_log_id = xl_max_log_id+1; + xlog_name(PATH_MAX, new_path, new_log_id); + ok = xt_fs_rename(NULL, path, new_path); + if (ok) { + xl_max_log_id = new_log_id; + goto done; + } + if (!xt_fs_exists(new_path)) { + /* Try again later: */ + if (thread->t_exception.e_xt_err == XT_SYSTEM_ERROR && + XT_FILE_IN_USE(thread->t_exception.e_sys_err)) + return FAILED; + + return XT_ERR; + } + xl_max_log_id = new_log_id; + } + } + + if (xt_db_offline_log_function != XT_KEEP_LOGS) { + if (!xt_fs_delete(NULL, path)) { + if (thread->t_exception.e_xt_err == XT_SYSTEM_ERROR && + XT_FILE_IN_USE(thread->t_exception.e_sys_err)) + return FAILED; + + return XT_ERR; + } + } + + done: + return OK; +} + +/* PRIVATE FUNCTIONS */ +xtBool XTDatabaseLog::xlog_open_log(xtLogID log_id, off_t curr_write_pos, XTThreadPtr thread) +{ + char log_path[PATH_MAX]; + off_t eof; + + if (xl_log_id == log_id) + return OK; + + if (xl_log_file) { + if (!xt_flush_file(xl_log_file, &thread->st_statistics.st_xlog, thread)) + return FAILED; + xt_close_file_ns(xl_log_file); + xl_log_file = NULL; + xl_log_id = 0; + } + + xlog_name(PATH_MAX, log_path, log_id); + if (!(xl_log_file = xt_open_file_ns(log_path, XT_FS_CREATE | XT_FS_MAKE_PATH))) + return FAILED; + /* Allocate space until the required size: */ + if (curr_write_pos < xl_log_file_threshold) { + eof = xt_seek_eof_file(NULL, xl_log_file); + if (eof == 0) { + /* A new file (bad), we need a greater file count: */ + xt_log_file_dyn_count++; + xt_log_file_dyn_dec = 4; + } + else { + /* An existing file (good): */ + if (xt_log_file_dyn_count > 0) { + if (xt_log_file_dyn_dec > 0) + xt_log_file_dyn_dec--; + else + xt_log_file_dyn_count--; + } + } + if (eof < xl_log_file_threshold) { + char buffer[2048]; + size_t tfer; + + memset(buffer, 0, 2048); + + curr_write_pos = xt_align_offset(curr_write_pos, 512); +#ifdef PREWRITE_LOG_COMPLETELY + while (curr_write_pos < xl_log_file_threshold) { + tfer = 2048; + if ((off_t) tfer > xl_log_file_threshold - curr_write_pos) + tfer = (size_t) (xl_log_file_threshold - curr_write_pos); + if (curr_write_pos == 0) + *buffer = XT_LOG_ENT_END_OF_LOG; + if (!xt_pwrite_file(xl_log_file, curr_write_pos, tfer, buffer, &thread->st_statistics.st_xlog, thread)) + return FAILED; + *buffer = 0; + curr_write_pos += tfer; + } +#else + if (curr_write_pos < xl_log_file_threshold) { + tfer = 2048; + + if (curr_write_pos < xl_log_file_threshold - 2048) + curr_write_pos = xl_log_file_threshold - 2048; + if ((off_t) tfer > xl_log_file_threshold - curr_write_pos) + tfer = (size_t) (xl_log_file_threshold - curr_write_pos); + if (!xt_pwrite_file(xl_log_file, curr_write_pos, tfer, buffer, &thread->st_statistics.st_xlog, thread)) + return FAILED; + } +#endif + } + else if (eof > xl_log_file_threshold + (128 * 1024 * 1024)) { + if (!xt_set_eof_file(NULL, xl_log_file, xl_log_file_threshold)) + return FAILED; + } + } + xl_log_id = log_id; + return OK; +} + +void XTDatabaseLog::xlog_name(size_t size, char *path, xtLogID log_id) +{ + char name[50]; + + sprintf(name, "xlog-%lu.xt", (u_long) log_id); + xt_strcpy(size, path, xl_db->db_main_path); + xt_add_system_dir(size, path); + xt_add_dir_char(size, path); + xt_strcat(size, path, name); +} + +/* + * ----------------------------------------------------------------------- + * T H R E A D T R A N S A C T I O N B U F F E R + */ + +xtPublic xtBool xt_xlog_flush_log(XTThreadPtr thread) +{ + return thread->st_database->db_xlog.xlog_flush(thread); +} + +xtPublic xtBool xt_xlog_log_data(XTThreadPtr thread, size_t size, XTXactLogBufferDPtr log_entry, xtBool commit) +{ + return thread->st_database->db_xlog.xlog_append(thread, size, (xtWord1 *) log_entry, 0, NULL, commit, NULL, NULL); +} + +/* Allocate a record from the free list. */ +xtPublic xtBool xt_xlog_modify_table(struct XTOpenTable *ot, u_int status, xtOpSeqNo op_seq, xtRecordID free_rec_id, xtRecordID rec_id, size_t size, xtWord1 *data) +{ + XTXactLogBufferDRec log_entry; + XTThreadPtr thread = ot->ot_thread; + XTTableHPtr tab = ot->ot_table; + size_t len; + xtWord4 sum = 0; + int check_size = 1; + XTXactDataPtr xact = NULL; + + switch (status) { + case XT_LOG_ENT_REC_MODIFIED: + case XT_LOG_ENT_UPDATE: + case XT_LOG_ENT_INSERT: + case XT_LOG_ENT_DELETE: + check_size = 2; + XT_SET_DISK_4(log_entry.xu.xu_op_seq_4, op_seq); + XT_SET_DISK_4(log_entry.xu.xu_tab_id_4, tab->tab_id); + XT_SET_DISK_4(log_entry.xu.xu_rec_id_4, rec_id); + XT_SET_DISK_2(log_entry.xu.xu_size_2, size); + len = offsetof(XTactUpdateEntryDRec, xu_rec_type_1); + if (!(thread->st_xact_data->xd_flags & XT_XN_XAC_LOGGED)) { + /* Add _BG: */ + status++; + xact = thread->st_xact_data; + xact->xd_flags |= XT_XN_XAC_LOGGED; + } + break; + case XT_LOG_ENT_UPDATE_FL: + case XT_LOG_ENT_INSERT_FL: + case XT_LOG_ENT_DELETE_FL: + check_size = 2; + XT_SET_DISK_4(log_entry.xf.xf_op_seq_4, op_seq); + XT_SET_DISK_4(log_entry.xf.xf_tab_id_4, tab->tab_id); + XT_SET_DISK_4(log_entry.xf.xf_rec_id_4, rec_id); + XT_SET_DISK_2(log_entry.xf.xf_size_2, size); + XT_SET_DISK_4(log_entry.xf.xf_free_rec_id_4, free_rec_id); + sum ^= XT_CHECKSUM4_REC(free_rec_id); + len = offsetof(XTactUpdateFLEntryDRec, xf_rec_type_1); + if (!(thread->st_xact_data->xd_flags & XT_XN_XAC_LOGGED)) { + /* Add _BG: */ + status++; + xact = thread->st_xact_data; + xact->xd_flags |= XT_XN_XAC_LOGGED; + } + break; + case XT_LOG_ENT_REC_FREED: + case XT_LOG_ENT_REC_REMOVED: + case XT_LOG_ENT_REC_REMOVED_EXT: + ASSERT_NS(size == 1 + XT_XACT_ID_SIZE + sizeof(XTTabRecFreeDRec)); + XT_SET_DISK_4(log_entry.fr.fr_op_seq_4, op_seq); + XT_SET_DISK_4(log_entry.fr.fr_tab_id_4, tab->tab_id); + XT_SET_DISK_4(log_entry.fr.fr_rec_id_4, rec_id); + len = offsetof(XTactFreeRecEntryDRec, fr_stat_id_1); + break; + case XT_LOG_ENT_REC_REMOVED_BI: + check_size = 2; + XT_SET_DISK_4(log_entry.rb.rb_op_seq_4, op_seq); + XT_SET_DISK_4(log_entry.rb.rb_tab_id_4, tab->tab_id); + XT_SET_DISK_4(log_entry.rb.rb_rec_id_4, rec_id); + XT_SET_DISK_2(log_entry.rb.rb_size_2, size); + log_entry.rb.rb_new_rec_type_1 = (xtWord1) free_rec_id; + sum ^= XT_CHECKSUM4_REC(free_rec_id); + len = offsetof(XTactRemoveBIEntryDRec, rb_rec_type_1); + break; + case XT_LOG_ENT_REC_MOVED: + ASSERT_NS(size == 8); + XT_SET_DISK_4(log_entry.xw.xw_op_seq_4, op_seq); + XT_SET_DISK_4(log_entry.xw.xw_tab_id_4, tab->tab_id); + XT_SET_DISK_4(log_entry.xw.xw_rec_id_4, rec_id); + len = offsetof(XTactWriteRecEntryDRec, xw_rec_type_1); + break; + case XT_LOG_ENT_REC_CLEANED: + ASSERT_NS(size == offsetof(XTTabRecHeadDRec, tr_prev_rec_id_4) + XT_RECORD_ID_SIZE); + XT_SET_DISK_4(log_entry.xw.xw_op_seq_4, op_seq); + XT_SET_DISK_4(log_entry.xw.xw_tab_id_4, tab->tab_id); + XT_SET_DISK_4(log_entry.xw.xw_rec_id_4, rec_id); + len = offsetof(XTactWriteRecEntryDRec, xw_rec_type_1); + break; + case XT_LOG_ENT_REC_CLEANED_1: + ASSERT_NS(size == 1); + XT_SET_DISK_4(log_entry.xw.xw_op_seq_4, op_seq); + XT_SET_DISK_4(log_entry.xw.xw_tab_id_4, tab->tab_id); + XT_SET_DISK_4(log_entry.xw.xw_rec_id_4, rec_id); + len = offsetof(XTactWriteRecEntryDRec, xw_rec_type_1); + break; + case XT_LOG_ENT_REC_UNLINKED: + ASSERT_NS(size == offsetof(XTTabRecHeadDRec, tr_prev_rec_id_4) + XT_RECORD_ID_SIZE); + XT_SET_DISK_4(log_entry.xw.xw_op_seq_4, op_seq); + XT_SET_DISK_4(log_entry.xw.xw_tab_id_4, tab->tab_id); + XT_SET_DISK_4(log_entry.xw.xw_rec_id_4, rec_id); + len = offsetof(XTactWriteRecEntryDRec, xw_rec_type_1); + break; + case XT_LOG_ENT_ROW_NEW: + ASSERT_NS(size == 0); + XT_SET_DISK_4(log_entry.xa.xa_op_seq_4, op_seq); + XT_SET_DISK_4(log_entry.xa.xa_tab_id_4, tab->tab_id); + XT_SET_DISK_4(log_entry.xa.xa_row_id_4, rec_id); + len = offsetof(XTactRowAddedEntryDRec, xa_row_id_4) + XT_ROW_ID_SIZE; + break; + case XT_LOG_ENT_ROW_NEW_FL: + ASSERT_NS(size == 0); + XT_SET_DISK_4(log_entry.xa.xa_op_seq_4, op_seq); + XT_SET_DISK_4(log_entry.xa.xa_tab_id_4, tab->tab_id); + XT_SET_DISK_4(log_entry.xa.xa_row_id_4, rec_id); + XT_SET_DISK_4(log_entry.xa.xa_free_list_4, free_rec_id); + sum ^= XT_CHECKSUM4_REC(free_rec_id); + len = offsetof(XTactRowAddedEntryDRec, xa_free_list_4) + XT_ROW_ID_SIZE; + break; + case XT_LOG_ENT_ROW_ADD_REC: + case XT_LOG_ENT_ROW_SET: + case XT_LOG_ENT_ROW_FREED: + ASSERT_NS(size == sizeof(XTTabRowRefDRec)); + XT_SET_DISK_4(log_entry.wr.wr_op_seq_4, op_seq); + XT_SET_DISK_4(log_entry.wr.wr_tab_id_4, tab->tab_id); + XT_SET_DISK_4(log_entry.wr.wr_row_id_4, rec_id); + len = offsetof(XTactWriteRowEntryDRec, wr_ref_id_4); + break; + default: + ASSERT_NS(FALSE); + len = 0; + break; + } + + xtWord1 *dptr = data; + xtWord4 g; + + sum ^= op_seq ^ (tab->tab_id << 8) ^ XT_CHECKSUM4_REC(rec_id); + if ((g = sum & 0xF0000000)) { + sum = sum ^ (g >> 24); + sum = sum ^ g; + } + for (u_int i=0; i<(u_int) size; i++) { + sum = (sum << 4) + *dptr; + if ((g = sum & 0xF0000000)) { + sum = sum ^ (g >> 24); + sum = sum ^ g; + } + dptr++; + } + + log_entry.xh.xh_status_1 = status; + if (check_size == 1) { + log_entry.xh.xh_checksum_1 = XT_CHECKSUM_1(sum); + } + else { + xtWord2 c; + + c = XT_CHECKSUM_2(sum); + XT_SET_DISK_2(log_entry.xu.xu_checksum_2, c); + } +#ifdef PRINT_TABLE_MODIFICATIONS + xt_print_log_record(0, 0, &log_entry); +#endif + if (xact) + return thread->st_database->db_xlog.xlog_append(thread, len, (xtWord1 *) &log_entry, size, data, FALSE, &xact->xd_begin_log, &xact->xd_begin_offset); + + return thread->st_database->db_xlog.xlog_append(thread, len, (xtWord1 *) &log_entry, size, data, FALSE, NULL, NULL); +} + +/* + * ----------------------------------------------------------------------- + * S E Q U E N T I A L L O G R E A D I N G + */ + +/* + * Use the log buffer for sequential reading the log. + */ +xtBool XTDatabaseLog::xlog_seq_init(XTXactSeqReadPtr seq, size_t buffer_size, xtBool load_cache) +{ + seq->xseq_buffer_size = buffer_size; + seq->xseq_load_cache = load_cache; + + seq->xseq_log_id = 0; + seq->xseq_log_file = NULL; + seq->xseq_log_eof = 0; + + seq->xseq_buf_log_offset = 0; + seq->xseq_buffer_len = 0; + seq->xseq_buffer = (xtWord1 *) xt_malloc_ns(buffer_size); + + seq->xseq_rec_log_id = 0; + seq->xseq_rec_log_offset = 0; + seq->xseq_record_len = 0; + + return seq->xseq_buffer != NULL; +} + +void XTDatabaseLog::xlog_seq_exit(XTXactSeqReadPtr seq) +{ + xlog_seq_close(seq); + if (seq->xseq_buffer) { + xt_free_ns(seq->xseq_buffer); + seq->xseq_buffer = NULL; + } +} + +void XTDatabaseLog::xlog_seq_close(XTXactSeqReadPtr seq) +{ + if (seq->xseq_log_file) { + xt_close_file_ns(seq->xseq_log_file); + seq->xseq_log_file = NULL; + } + seq->xseq_log_id = 0; + seq->xseq_log_eof = 0; +} + +xtBool XTDatabaseLog::xlog_seq_start(XTXactSeqReadPtr seq, xtLogID log_id, xtLogOffset log_offset, xtBool missing_ok __attribute__((unused))) +{ + if (seq->xseq_rec_log_id != log_id) { + seq->xseq_rec_log_id = log_id; + seq->xseq_buf_log_offset = seq->xseq_rec_log_offset; + seq->xseq_buffer_len = 0; + } + + /* Windows version: this will help to switch + * to the new log file. + * Due to reading from the log buffers, this was + * not always done! + */ + if (seq->xseq_log_id != log_id) { + if (seq->xseq_log_file) { + xt_close_file_ns(seq->xseq_log_file); + seq->xseq_log_file = NULL; + } + } + seq->xseq_rec_log_offset = log_offset; + seq->xseq_record_len = 0; + return OK; +} + +size_t XTDatabaseLog::xlog_bytes_to_write() +{ + xtLogID log_id; + xtLogOffset log_offset; + xtLogID to_log_id; + xtLogOffset to_log_offset; + size_t byte_count = 0; + + log_id = xl_db->db_wr_log_id; + log_offset = xl_db->db_wr_log_offset; + to_log_id = xl_db->db_xlog.xl_flush_log_id; + to_log_offset = xl_db->db_xlog.xl_flush_log_offset; + + /* Assume the logs have the threshold: */ + if (log_id < to_log_id) { + if (log_offset < xt_db_log_file_threshold) + byte_count = (size_t) (xt_db_log_file_threshold - log_offset); + log_offset = 0; + log_id++; + } + while (log_id < to_log_id) { + byte_count += (size_t) xt_db_log_file_threshold; + log_id++; + } + if (log_offset < to_log_offset) + byte_count += (size_t) (to_log_offset - log_offset); + + return byte_count; +} + +xtBool XTDatabaseLog::xlog_read_from_cache(XTXactSeqReadPtr seq, xtLogID log_id, xtLogOffset log_offset, size_t size, off_t eof, xtWord1 *buffer, size_t *data_read, XTThreadPtr thread) +{ + /* xseq_log_file could be NULL because xseq_log_id is not set + * to zero when xseq_log_file is set to NULL! + * This bug caused a crash in TeamDrive. + */ + if (seq->xseq_log_id != log_id || !seq->xseq_log_file) { + char path[PATH_MAX]; + + if (seq->xseq_log_file) { + xt_close_file_ns(seq->xseq_log_file); + seq->xseq_log_file = NULL; + } + + xlog_name(PATH_MAX, path, log_id); + if (!xt_open_file_ns(&seq->xseq_log_file, path, XT_FS_MISSING_OK)) + return FAILED; + if (!seq->xseq_log_file) { + if (data_read) + *data_read = 0; + return OK; + } + seq->xseq_log_id = log_id; + seq->xseq_log_eof = 0; + } + + if (!eof) { + if (!seq->xseq_log_eof) + seq->xseq_log_eof = xt_seek_eof_file(NULL, seq->xseq_log_file); + eof = seq->xseq_log_eof; + } + + if (log_offset >= eof) { + if (data_read) + *data_read = 0; + return OK; + } + + if ((off_t) size > eof - log_offset) + size = (size_t) (eof - log_offset); + + if (data_read) + *data_read = size; + return xt_xlog_read(seq->xseq_log_file, seq->xseq_log_id, log_offset, size, buffer, seq->xseq_load_cache, thread); +} + +xtBool XTDatabaseLog::xlog_rnd_read(XTXactSeqReadPtr seq, xtLogID log_id, xtLogOffset log_offset, size_t size, xtWord1 *buffer, size_t *data_read, XTThreadPtr thread) +{ + /* Fast track to reading from cache: */ + if (log_id < xl_write_log_id) + return xlog_read_from_cache(seq, log_id, log_offset, size, 0, buffer, data_read, thread); + + if (log_id == xl_write_log_id && log_offset + (xtLogOffset) size <= xl_write_log_offset) + return xlog_read_from_cache(seq, log_id, log_offset, size, xl_write_log_offset, buffer, data_read, thread); + + /* May be in the log write or append buffer: */ + xt_lck_slock(&xl_buffer_lock); + + if (log_id < xl_write_log_id) { + xt_spinlock_unlock(&xl_buffer_lock); + return xlog_read_from_cache(seq, log_id, log_offset, size, 0, buffer, data_read, thread); + } + + /* Check the write buffer: */ + if (log_id == xl_write_log_id) { + if (log_offset + (xtLogOffset) size <= xl_write_log_offset) { + xt_spinlock_unlock(&xl_buffer_lock); + return xlog_read_from_cache(seq, log_id, log_offset, size, xl_write_log_offset, buffer, data_read, thread); + } + + if (log_offset < xl_write_log_offset + (xtLogOffset) xl_write_buf_pos) { + /* Reading partially from the write buffer: */ + if (log_offset >= xl_write_log_offset) { + /* Completely in the buffer. */ + off_t offset = log_offset - xl_write_log_offset; + + if (size > xl_write_buf_pos - offset) + size = (size_t) (xl_write_buf_pos - offset); + + memcpy(buffer, xl_write_buffer + offset, size); + if (data_read) + *data_read = size; + goto unlock_and_return; + } + + /* End part in the buffer: */ + size_t tfer; + + /* The amount that will be taken from the cache: */ + tfer = (size_t) (xl_write_log_offset - log_offset); + + size -= tfer; + if (size > xl_write_buf_pos) + size = xl_write_buf_pos; + + memcpy(buffer + tfer, xl_write_buffer, size); + + xt_spinlock_unlock(&xl_buffer_lock); + + /* Read the first part from the cache: */ + if (data_read) + *data_read = tfer + size; + return xlog_read_from_cache(seq, log_id, log_offset, tfer, log_offset + tfer, buffer, NULL, thread); + } + } + + /* Check the append buffer: */ + if (log_id == xl_append_log_id) { + if (log_offset >= xl_append_log_offset && log_offset < xl_append_log_offset + (xtLogOffset) xl_append_buf_pos) { + /* It is in the append buffer: */ + size_t offset = (size_t) (log_offset - xl_append_log_offset); + + if (size > xl_append_buf_pos - offset) + size = xl_append_buf_pos - offset; + + memcpy(buffer, xl_append_buffer + offset, size); + if (data_read) + *data_read = size; + goto unlock_and_return; + } + } + + if (xl_append_log_id == 0) { + /* This catches the case that + * the log has not yet been initialized + * for writing. + */ + xt_spinlock_unlock(&xl_buffer_lock); + return xlog_read_from_cache(seq, log_id, log_offset, size, 0, buffer, data_read, thread); + } + + if (data_read) + *data_read = 0; + + unlock_and_return: + xt_spinlock_unlock(&xl_buffer_lock); + return OK; +} + +xtBool XTDatabaseLog::xlog_write_thru(XTXactSeqReadPtr seq, size_t size, xtWord1 *data, XTThreadPtr thread) +{ + if (!xt_xlog_write(seq->xseq_log_file, seq->xseq_log_id, seq->xseq_rec_log_offset, size, data, thread)) + return FALSE; + xl_log_bytes_written += size; + seq->xseq_rec_log_offset += size; + return TRUE; +} + +xtBool XTDatabaseLog::xlog_verify(XTXactLogBufferDPtr record, size_t rec_size, xtLogID log_id) +{ + xtWord4 sum = 0; + xtOpSeqNo op_seq; + xtTableID tab_id; + xtRecordID rec_id, free_rec_id; + int check_size = 1; + xtWord1 *dptr; + + switch (record->xh.xh_status_1) { + case XT_LOG_ENT_HEADER: + if (record->xh.xh_checksum_1 != XT_CHECKSUM_1(log_id)) + return FALSE; + if (XT_LOG_HEAD_MAGIC(record, rec_size) != XT_LOG_FILE_MAGIC) + return FALSE; + if (rec_size >= offsetof(XTXactLogHeaderDRec, xh_log_id_4) + 4) { + if (XT_GET_DISK_4(record->xh.xh_log_id_4) != log_id) + return FALSE; + } + return TRUE; + case XT_LOG_ENT_NEW_LOG: + case XT_LOG_ENT_DEL_LOG: + return record->xl.xl_checksum_1 == (XT_CHECKSUM_1(XT_GET_DISK_4(record->xl.xl_log_id_4)) ^ XT_CHECKSUM_1(log_id)); + case XT_LOG_ENT_NEW_TAB: + return record->xl.xl_checksum_1 == (XT_CHECKSUM_1(XT_GET_DISK_4(record->xt.xt_tab_id_4)) ^ XT_CHECKSUM_1(log_id)); + case XT_LOG_ENT_COMMIT: + case XT_LOG_ENT_ABORT: + sum = XT_CHECKSUM4_XACT(XT_GET_DISK_4(record->xe.xe_xact_id_4)) ^ XT_CHECKSUM4_XACT(XT_GET_DISK_4(record->xe.xe_not_used_4)); + return record->xe.xe_checksum_1 == (XT_CHECKSUM_1(sum) ^ XT_CHECKSUM_1(log_id)); + case XT_LOG_ENT_CLEANUP: + sum = XT_CHECKSUM4_XACT(XT_GET_DISK_4(record->xc.xc_xact_id_4)); + return record->xc.xc_checksum_1 == (XT_CHECKSUM_1(sum) ^ XT_CHECKSUM_1(log_id)); + case XT_LOG_ENT_REC_MODIFIED: + case XT_LOG_ENT_UPDATE: + case XT_LOG_ENT_INSERT: + case XT_LOG_ENT_DELETE: + case XT_LOG_ENT_UPDATE_BG: + case XT_LOG_ENT_INSERT_BG: + case XT_LOG_ENT_DELETE_BG: + check_size = 2; + op_seq = XT_GET_DISK_4(record->xu.xu_op_seq_4); + tab_id = XT_GET_DISK_4(record->xu.xu_tab_id_4); + rec_id = XT_GET_DISK_4(record->xu.xu_rec_id_4); + dptr = &record->xu.xu_rec_type_1; + rec_size -= offsetof(XTactUpdateEntryDRec, xu_rec_type_1); + break; + case XT_LOG_ENT_UPDATE_FL: + case XT_LOG_ENT_INSERT_FL: + case XT_LOG_ENT_DELETE_FL: + case XT_LOG_ENT_UPDATE_FL_BG: + case XT_LOG_ENT_INSERT_FL_BG: + case XT_LOG_ENT_DELETE_FL_BG: + check_size = 2; + op_seq = XT_GET_DISK_4(record->xf.xf_op_seq_4); + tab_id = XT_GET_DISK_4(record->xf.xf_tab_id_4); + rec_id = XT_GET_DISK_4(record->xf.xf_rec_id_4); + free_rec_id = XT_GET_DISK_4(record->xf.xf_free_rec_id_4); + sum ^= XT_CHECKSUM4_REC(free_rec_id); + dptr = &record->xf.xf_rec_type_1; + rec_size -= offsetof(XTactUpdateFLEntryDRec, xf_rec_type_1); + break; + case XT_LOG_ENT_REC_FREED: + case XT_LOG_ENT_REC_REMOVED: + case XT_LOG_ENT_REC_REMOVED_EXT: + op_seq = XT_GET_DISK_4(record->fr.fr_op_seq_4); + tab_id = XT_GET_DISK_4(record->fr.fr_tab_id_4); + rec_id = XT_GET_DISK_4(record->fr.fr_rec_id_4); + dptr = &record->fr.fr_stat_id_1; + rec_size -= offsetof(XTactFreeRecEntryDRec, fr_stat_id_1); + break; + case XT_LOG_ENT_REC_REMOVED_BI: + check_size = 2; + op_seq = XT_GET_DISK_4(record->rb.rb_op_seq_4); + tab_id = XT_GET_DISK_4(record->rb.rb_tab_id_4); + rec_id = XT_GET_DISK_4(record->rb.rb_rec_id_4); + free_rec_id = (xtWord4) record->rb.rb_new_rec_type_1; + sum ^= XT_CHECKSUM4_REC(free_rec_id); + dptr = &record->rb.rb_rec_type_1; + rec_size -= offsetof(XTactRemoveBIEntryDRec, rb_rec_type_1); + break; + case XT_LOG_ENT_REC_MOVED: + case XT_LOG_ENT_REC_CLEANED: + case XT_LOG_ENT_REC_CLEANED_1: + case XT_LOG_ENT_REC_UNLINKED: + op_seq = XT_GET_DISK_4(record->xw.xw_op_seq_4); + tab_id = XT_GET_DISK_4(record->xw.xw_tab_id_4); + rec_id = XT_GET_DISK_4(record->xw.xw_rec_id_4); + dptr = &record->xw.xw_rec_type_1; + rec_size -= offsetof(XTactWriteRecEntryDRec, xw_rec_type_1); + break; + case XT_LOG_ENT_ROW_NEW: + case XT_LOG_ENT_ROW_NEW_FL: + op_seq = XT_GET_DISK_4(record->xa.xa_op_seq_4); + tab_id = XT_GET_DISK_4(record->xa.xa_tab_id_4); + rec_id = XT_GET_DISK_4(record->xa.xa_row_id_4); + if (record->xh.xh_status_1 == XT_LOG_ENT_ROW_NEW) { + dptr = (xtWord1 *) record + offsetof(XTactRowAddedEntryDRec, xa_free_list_4); + rec_size -= offsetof(XTactRowAddedEntryDRec, xa_free_list_4); + } + else { + free_rec_id = XT_GET_DISK_4(record->xa.xa_free_list_4); + sum ^= XT_CHECKSUM4_REC(free_rec_id); + dptr = (xtWord1 *) record + sizeof(XTactRowAddedEntryDRec); + rec_size -= sizeof(XTactRowAddedEntryDRec); + } + break; + case XT_LOG_ENT_ROW_ADD_REC: + case XT_LOG_ENT_ROW_SET: + case XT_LOG_ENT_ROW_FREED: + op_seq = XT_GET_DISK_4(record->wr.wr_op_seq_4); + tab_id = XT_GET_DISK_4(record->wr.wr_tab_id_4); + rec_id = XT_GET_DISK_4(record->wr.wr_row_id_4); + dptr = (xtWord1 *) &record->wr.wr_ref_id_4; + rec_size -= offsetof(XTactWriteRowEntryDRec, wr_ref_id_4); + break; + case XT_LOG_ENT_OP_SYNC: + return record->xl.xl_checksum_1 == (XT_CHECKSUM_1(XT_GET_DISK_4(record->os.os_time_4)) ^ XT_CHECKSUM_1(log_id)); + case XT_LOG_ENT_NO_OP: + sum = XT_GET_DISK_4(record->no.no_tab_id_4) ^ XT_GET_DISK_4(record->no.no_op_seq_4); + return record->xe.xe_checksum_1 == (XT_CHECKSUM_1(sum) ^ XT_CHECKSUM_1(log_id)); + case XT_LOG_ENT_END_OF_LOG: + return FALSE; + default: + ASSERT_NS(FALSE); + return FALSE; + } + + xtWord4 g; + + sum ^= (xtWord4) op_seq ^ ((xtWord4) tab_id << 8) ^ XT_CHECKSUM4_REC(rec_id); + + if ((g = sum & 0xF0000000)) { + sum = sum ^ (g >> 24); + sum = sum ^ g; + } + for (u_int i=0; i<(u_int) rec_size; i++) { + sum = (sum << 4) + *dptr; + if ((g = sum & 0xF0000000)) { + sum = sum ^ (g >> 24); + sum = sum ^ g; + } + dptr++; + } + + if (check_size == 1) { + if (record->xh.xh_checksum_1 != (XT_CHECKSUM_1(sum) ^ XT_CHECKSUM_1(log_id))) { + return FAILED; + } + } + else { + if (XT_GET_DISK_2(record->xu.xu_checksum_2) != (XT_CHECKSUM_2(sum) ^ XT_CHECKSUM_2(log_id))) { + return FAILED; + } + } + return TRUE; +} + +xtBool XTDatabaseLog::xlog_seq_next(XTXactSeqReadPtr seq, XTXactLogBufferDPtr *ret_entry, xtBool verify, XTThreadPtr thread) +{ + XTXactLogBufferDPtr record; + size_t tfer; + size_t len; + size_t rec_offset; + size_t max_rec_len; + size_t size; + u_int check_size = 1; + + /* Go to the next record (xseq_record_len must be initialized + * to 0 for this to work. + */ + seq->xseq_rec_log_offset += seq->xseq_record_len; + seq->xseq_record_len = 0; + + if (seq->xseq_rec_log_offset < seq->xseq_buf_log_offset || + seq->xseq_rec_log_offset >= seq->xseq_buf_log_offset + (xtLogOffset) seq->xseq_buffer_len) { + /* The current position is nowhere near the buffer, read data into the + * buffer: + */ + tfer = seq->xseq_buffer_size; + if (!xlog_rnd_read(seq, seq->xseq_rec_log_id, seq->xseq_rec_log_offset, tfer, seq->xseq_buffer, &tfer, thread)) + return FAILED; + seq->xseq_buf_log_offset = seq->xseq_rec_log_offset; + seq->xseq_buffer_len = tfer; + + /* Should we go to the next log? */ + if (!tfer) { + goto return_empty; + } + } + + /* The start of the record is in the buffer: */ + read_from_buffer: + rec_offset = (size_t) (seq->xseq_rec_log_offset - seq->xseq_buf_log_offset); + max_rec_len = seq->xseq_buffer_len - rec_offset; + size = 0; + + /* Check the type of record: */ + record = (XTXactLogBufferDPtr) (seq->xseq_buffer + rec_offset); + switch (record->xh.xh_status_1) { + case XT_LOG_ENT_HEADER: + len = sizeof(XTXactLogHeaderDRec); + break; + case XT_LOG_ENT_NEW_LOG: + case XT_LOG_ENT_DEL_LOG: + len = sizeof(XTXactNewLogEntryDRec); + break; + case XT_LOG_ENT_NEW_TAB: + len = sizeof(XTXactNewTabEntryDRec); + break; + case XT_LOG_ENT_COMMIT: + case XT_LOG_ENT_ABORT: + len = sizeof(XTXactEndEntryDRec); + break; + case XT_LOG_ENT_CLEANUP: + len = sizeof(XTXactCleanupEntryDRec); + break; + case XT_LOG_ENT_REC_MODIFIED: + case XT_LOG_ENT_UPDATE: + case XT_LOG_ENT_INSERT: + case XT_LOG_ENT_DELETE: + case XT_LOG_ENT_UPDATE_BG: + case XT_LOG_ENT_INSERT_BG: + case XT_LOG_ENT_DELETE_BG: + check_size = 2; + len = offsetof(XTactUpdateEntryDRec, xu_rec_type_1); + if (len > max_rec_len) + /* The size is not in the buffer: */ + goto read_more; + len += (size_t) XT_GET_DISK_2(record->xu.xu_size_2); + break; + case XT_LOG_ENT_UPDATE_FL: + case XT_LOG_ENT_INSERT_FL: + case XT_LOG_ENT_DELETE_FL: + case XT_LOG_ENT_UPDATE_FL_BG: + case XT_LOG_ENT_INSERT_FL_BG: + case XT_LOG_ENT_DELETE_FL_BG: + check_size = 2; + len = offsetof(XTactUpdateFLEntryDRec, xf_rec_type_1); + if (len > max_rec_len) + /* The size is not in the buffer: */ + goto read_more; + len += (size_t) XT_GET_DISK_2(record->xf.xf_size_2); + break; + case XT_LOG_ENT_REC_FREED: + case XT_LOG_ENT_REC_REMOVED: + case XT_LOG_ENT_REC_REMOVED_EXT: + /* [(7)] REMOVE is now a extended version of FREE! */ + len = offsetof(XTactFreeRecEntryDRec, fr_rec_type_1) + sizeof(XTTabRecFreeDRec); + break; + case XT_LOG_ENT_REC_REMOVED_BI: + check_size = 2; + len = offsetof(XTactRemoveBIEntryDRec, rb_rec_type_1); + if (len > max_rec_len) + /* The size is not in the buffer: */ + goto read_more; + len += (size_t) XT_GET_DISK_2(record->rb.rb_size_2); + break; + case XT_LOG_ENT_REC_MOVED: + len = offsetof(XTactWriteRecEntryDRec, xw_rec_type_1) + 8; + break; + case XT_LOG_ENT_REC_CLEANED: + len = offsetof(XTactWriteRecEntryDRec, xw_rec_type_1) + offsetof(XTTabRecHeadDRec, tr_prev_rec_id_4) + XT_RECORD_ID_SIZE; + break; + case XT_LOG_ENT_REC_CLEANED_1: + len = offsetof(XTactWriteRecEntryDRec, xw_rec_type_1) + 1; + break; + case XT_LOG_ENT_REC_UNLINKED: + len = offsetof(XTactWriteRecEntryDRec, xw_rec_type_1) + offsetof(XTTabRecHeadDRec, tr_prev_rec_id_4) + XT_RECORD_ID_SIZE; + break; + case XT_LOG_ENT_ROW_NEW: + len = offsetof(XTactRowAddedEntryDRec, xa_row_id_4) + XT_ROW_ID_SIZE; + break; + case XT_LOG_ENT_ROW_NEW_FL: + len = offsetof(XTactRowAddedEntryDRec, xa_free_list_4) + XT_ROW_ID_SIZE; + break; + case XT_LOG_ENT_ROW_ADD_REC: + case XT_LOG_ENT_ROW_SET: + case XT_LOG_ENT_ROW_FREED: + len = offsetof(XTactWriteRowEntryDRec, wr_ref_id_4) + XT_REF_ID_SIZE; + break; + case XT_LOG_ENT_OP_SYNC: + len = sizeof(XTactOpSyncEntryDRec); + break; + case XT_LOG_ENT_NO_OP: + len = sizeof(XTactNoOpEntryDRec); + break; + case XT_LOG_ENT_END_OF_LOG: { + off_t eof = seq->xseq_log_eof, adjust; + + if (eof > seq->xseq_rec_log_offset) { + adjust = eof - seq->xseq_rec_log_offset; + + seq->xseq_record_len = (size_t) adjust; + } + goto return_empty; + } + default: + ASSERT_NS(FALSE); + seq->xseq_record_len = 0; + goto return_empty; + } + + ASSERT_NS(len <= seq->xseq_buffer_size); + if (len <= max_rec_len) { + if (verify) { + if (!xlog_verify(record, len, seq->xseq_rec_log_id)) { + goto return_empty; + } + } + + /* The record is completely in the buffer: */ + seq->xseq_record_len = len; + *ret_entry = record; + return OK; + } + + /* The record is partially in the buffer. */ + memmove(seq->xseq_buffer, seq->xseq_buffer + rec_offset, max_rec_len); + seq->xseq_buf_log_offset += rec_offset; + seq->xseq_buffer_len = max_rec_len; + + /* Read the rest, as far as possible: */ + tfer = seq->xseq_buffer_size - max_rec_len; + if (!xlog_rnd_read(seq, seq->xseq_rec_log_id, seq->xseq_buf_log_offset + max_rec_len, tfer, seq->xseq_buffer + max_rec_len, &tfer, thread)) + return FAILED; + seq->xseq_buffer_len += tfer; + + if (seq->xseq_buffer_len < len) { + /* A partial record is in the log, must be the end of the log: */ + goto return_empty; + } + + /* The record is not completely in the buffer: */ + seq->xseq_record_len = len; + *ret_entry = (XTXactLogBufferDPtr) seq->xseq_buffer; + return OK; + + read_more: + ASSERT_NS(len <= seq->xseq_buffer_size); + memmove(seq->xseq_buffer, seq->xseq_buffer + rec_offset, max_rec_len); + seq->xseq_buf_log_offset += rec_offset; + seq->xseq_buffer_len = max_rec_len; + + /* Read the rest, as far as possible: */ + tfer = seq->xseq_buffer_size - max_rec_len; + if (!xlog_rnd_read(seq, seq->xseq_rec_log_id, seq->xseq_buf_log_offset + max_rec_len, tfer, seq->xseq_buffer + max_rec_len, &tfer, thread)) + return FAILED; + seq->xseq_buffer_len += tfer; + + if (seq->xseq_buffer_len < len + size) { + /* We did not get as much as we need, return an empty record: */ + goto return_empty; + } + + goto read_from_buffer; + + return_empty: + *ret_entry = NULL; + return OK; +} + +void XTDatabaseLog::xlog_seq_skip(XTXactSeqReadPtr seq, size_t size) +{ + seq->xseq_record_len += size; +} + +/* ---------------------------------------------------------------------- + * W R I T E R P R O C E S S + */ + +/* + * The log has been written. Wake the writer to commit the + * data to disk, if the transaction log cache is full. + * + * Data may not be written to the database until it has been + * flushed to the log. + * + * This is because there is no way to undo changes to the + * database. + * + * However, I have dicovered that writing constantly in the + * background can disturb the I/O in the foreground. + * + * So we can delay the writing of the database. But we should + * not delay it longer than we have transaction log cache. + * + * If so, the data that we need will fall out of the cache + * and we will have to read it again. + */ +static void xlog_wr_log_written(XTDatabaseHPtr db) +{ + if (db->db_wr_idle) { + xtWord8 cached_bytes; + + /* Determine if the cached log data is about to fall out of the cache. */ + cached_bytes = db->db_xlog.xl_log_bytes_written - db->db_xlog.xl_log_bytes_read; + /* The limit is 75%: */ + if (cached_bytes >= xt_xlog_cache.xlc_upper_limit) { + if (!xt_broadcast_cond_ns(&db->db_wr_cond)) + xt_log_and_clear_exception_ns(); + } + } +} + +#define XT_MORE_TO_WRITE 1 +#define XT_FREER_WAITING 2 +#define XT_NO_ACTIVITY 3 +#define XT_LOG_CACHE_FULL 4 +#define XT_CHECKPOINT_REQ 5 +#define XT_THREAD_WAITING 6 +#define XT_TIME_TO_WRITE 7 + +/* + * Wait for a transaction to quit, i.e. the log to be flushed. + */ +static void xlog_wr_wait_for_log_flush(XTThreadPtr self, XTDatabaseHPtr db) +{ + xtXactID last_xn_id; + xtWord8 cached_bytes; + int reason = XT_MORE_TO_WRITE; + +#ifdef TRACE_WRITER_ACTIVITY + printf("WRITER --- DONE\n"); +#endif + + xt_lock_mutex(self, &db->db_wr_lock); + pushr_(xt_unlock_mutex, &db->db_wr_lock); + + /* + * Wake the freeer if it is waiting for this writer, before + * we go to sleep! + */ + if (db->db_wr_freeer_waiting) { + if (!xt_broadcast_cond_ns(&db->db_wr_cond)) + xt_log_and_clear_exception_ns(); + } + + if (db->db_wr_flush_point_log_id == db->db_xlog.xl_flush_log_id && + db->db_wr_flush_point_log_offset == db->db_xlog.xl_flush_log_offset) { + /* Wake the checkpointer to flush the indexes: + * PMC 15.05.2008 - Not doing this anymore! + xt_wake_checkpointer(self, db); + */ + + /* Sleep as long as the flush point has not changed, from the last + * target flush point. + */ + while (!self->t_quit && + db->db_wr_flush_point_log_id == db->db_xlog.xl_flush_log_id && + db->db_wr_flush_point_log_offset == db->db_xlog.xl_flush_log_offset && + reason != XT_LOG_CACHE_FULL && + reason != XT_TIME_TO_WRITE && + reason != XT_CHECKPOINT_REQ) { + + /* + * Sleep as long as there is no reason to write any more... + */ + while (!self->t_quit) { + last_xn_id = db->db_xn_curr_id; + db->db_wr_idle = XT_THREAD_IDLE; + xt_timed_wait_cond(self, &db->db_wr_cond, &db->db_wr_lock, 500); + db->db_wr_idle = XT_THREAD_BUSY; + /* These are the reasons for doing work: */ + /* The free'er thread is waiting for the writer: */ + if (db->db_wr_freeer_waiting) { + reason = XT_FREER_WAITING; + break; + } + /* Some thread is waiting for the writer: */ + if (db->db_wr_thread_waiting) { + reason = XT_THREAD_WAITING; + break; + } + /* Check if the cache will soon overflow... */ + ASSERT(db->db_xlog.xl_log_bytes_written >= db->db_xlog.xl_log_bytes_read); + ASSERT(db->db_xlog.xl_log_bytes_written >= db->db_xlog.xl_log_bytes_flushed); + /* Sanity check: */ + ASSERT(db->db_xlog.xl_log_bytes_written < db->db_xlog.xl_log_bytes_read + 500000000); + /* This is the amount of data still to be written: */ + cached_bytes = db->db_xlog.xl_log_bytes_written - db->db_xlog.xl_log_bytes_read; + /* The limit is 75%: */ + if (cached_bytes >= xt_xlog_cache.xlc_upper_limit) { + reason = XT_LOG_CACHE_FULL; + break; + } + + /* TODO: Create a system variable which specifies the write frequency. *//* + if (cached_bytes >= (12 * 1024 * 1024)) { + reason = XT_TIME_TO_WRITE; + break; + } + */ + + /* Check if we are holding up a checkpoint: */ + if (db->db_restart.xres_cp_required || + db->db_restart.xres_is_checkpoint_pending(db->db_xlog.xl_write_log_id, db->db_xlog.xl_write_log_offset)) { + /* Enough data has been flushed for a checkpoint: */ + if (!db->db_restart.xres_is_checkpoint_pending(db->db_wr_log_id, db->db_wr_log_offset)) { + /* But not enough data has been written for a checkpoint: */ + reason = XT_CHECKPOINT_REQ; + break; + } + } + /* There is no activity, if the current ID has not changed during + * the wait, and the sweeper has nothing to do, and the checkpointer. + */ + if (db->db_xn_curr_id == last_xn_id && + xt_xn_is_before(xt_xn_get_curr_id(db), db->db_xn_to_clean_id) && // db->db_xn_curr_id < db->db_xn_to_clean_id + !db->db_restart.xres_is_checkpoint_pending(db->db_xlog.xl_write_log_id, db->db_xlog.xl_write_log_offset)) { + /* There seems to be no activity at the moment. + * this might be a good time to write the log data. + */ + reason = XT_NO_ACTIVITY; + break; + } + } + } + } + freer_(); // xt_unlock_mutex(&db->db_wr_lock) + + if (reason == XT_LOG_CACHE_FULL || reason == XT_TIME_TO_WRITE || reason == XT_CHECKPOINT_REQ) { + /* Make sure that we have something to write: */ + if (db->db_xlog.xlog_bytes_to_write() < 2 * 1204 * 1024) + xt_xlog_flush_log(self); + } + +#ifdef TRACE_WRITER_ACTIVITY + switch (reason) { + case XT_MORE_TO_WRITE: printf("WRITER --- still more to write...\n"); break; + case XT_FREER_WAITING: printf("WRITER --- free'er waiting for writer...\n"); break; + case XT_NO_ACTIVITY: printf("WRITER --- no activity...\n"); break; + case XT_LOG_CACHE_FULL: printf("WRITER --- running out of log cache...\n"); break; + case XT_CHECKPOINT_REQ: printf("WRITER --- enough flushed for a checkpoint...\n"); break; + case XT_THREAD_WAITING: printf("WRITER --- thread waiting for writer...\n"); break; + case XT_TIME_TO_WRITE: printf("WRITER --- limit of 12MB reached, time to write...\n"); break; + } +#endif +} + +static void xlog_wr_could_go_faster(XTThreadPtr self, XTDatabaseHPtr db) +{ + if (db->db_wr_faster) { + if (!db->db_wr_fast) { + xt_set_normal_priority(self); + db->db_wr_fast = TRUE; + } + db->db_wr_faster = FALSE; + } +} + +static void xlog_wr_could_go_slower(XTThreadPtr self, XTDatabaseHPtr db) +{ + if (db->db_wr_fast && !db->db_wr_faster) { + xt_set_low_priority(self); + db->db_wr_fast = FALSE; + } +} + +static void xlog_wr_main(XTThreadPtr self) +{ + XTDatabaseHPtr db = self->st_database; + XTWriterStatePtr ws; + XTXactLogBufferDPtr record; + + xt_set_low_priority(self); + + alloczr_(ws, xt_free_writer_state, sizeof(XTWriterStateRec), XTWriterStatePtr); + ws->ws_db = db; + ws->ws_in_recover = FALSE; + + if (!db->db_xlog.xlog_seq_init(&ws->ws_seqread, xt_db_log_buffer_size, FALSE)) + xt_throw(self); + + if (!db->db_xlog.xlog_seq_start(&ws->ws_seqread, db->db_wr_log_id, db->db_wr_log_offset, FALSE)) + xt_throw(self); + + while (!self->t_quit) { + while (!self->t_quit) { + /* Determine the point to which we can write. + * This is the current log flush point! + */ + xt_lock_mutex_ns(&db->db_wr_lock); + db->db_wr_flush_point_log_id = db->db_xlog.xl_flush_log_id; + db->db_wr_flush_point_log_offset = db->db_xlog.xl_flush_log_offset; + xt_unlock_mutex_ns(&db->db_wr_lock); + + if (xt_comp_log_pos(db->db_wr_log_id, db->db_wr_log_offset, db->db_wr_flush_point_log_id, db->db_wr_flush_point_log_offset) >= 0) { + break; + } + + while (!self->t_quit) { + xlog_wr_could_go_faster(self, db); + + /* This is the restart position: */ + xt_lock_mutex(self, &db->db_wr_lock); + pushr_(xt_unlock_mutex, &db->db_wr_lock); + db->db_wr_log_id = ws->ws_seqread.xseq_rec_log_id; + db->db_wr_log_offset = ws->ws_seqread.xseq_rec_log_offset + ws->ws_seqread.xseq_record_len; + freer_(); // xt_unlock_mutex(&db->db_wr_lock) + + if (xt_comp_log_pos(db->db_wr_log_id, db->db_wr_log_offset, db->db_wr_flush_point_log_id, db->db_wr_flush_point_log_offset) >= 0) { + break; + } + + /* Apply all changes that have been flushed to the log, to the + * database. + */ + if (!db->db_xlog.xlog_seq_next(&ws->ws_seqread, &record, FALSE, self)) + xt_throw(self); + if (!record) { + break; + } + /* Count the number of bytes read from the log: */ + db->db_xlog.xl_log_bytes_read += ws->ws_seqread.xseq_record_len; + + switch (record->xl.xl_status_1) { + case XT_LOG_ENT_HEADER: + break; + case XT_LOG_ENT_NEW_LOG: + if (!db->db_xlog.xlog_seq_start(&ws->ws_seqread, XT_GET_DISK_4(record->xl.xl_log_id_4), 0, TRUE)) + xt_throw(self); + break; + case XT_LOG_ENT_NEW_TAB: + case XT_LOG_ENT_COMMIT: + case XT_LOG_ENT_ABORT: + case XT_LOG_ENT_CLEANUP: + case XT_LOG_ENT_OP_SYNC: + break; + case XT_LOG_ENT_DEL_LOG: + xtLogID log_id; + + log_id = XT_GET_DISK_4(record->xl.xl_log_id_4); + xt_dl_set_to_delete(self, db, log_id); + break; + default: + xt_xres_apply_in_order(self, ws, ws->ws_seqread.xseq_rec_log_id, ws->ws_seqread.xseq_rec_log_offset, record); + break; + } + } + } + + if (ws->ws_ot) { + xt_db_return_table_to_pool(self, ws->ws_ot); + ws->ws_ot = NULL; + } + + xlog_wr_could_go_slower(self, db); + + /* Note, we delay writing the database for a maximum of + * 2 seconds. + */ + xlog_wr_wait_for_log_flush(self, db); + } + + freer_(); // xt_free_writer_state(ss) +} + +static void *xlog_wr_run_thread(XTThreadPtr self) +{ + XTDatabaseHPtr db = (XTDatabaseHPtr) self->t_data; + int count; + void *mysql_thread; + + mysql_thread = myxt_create_thread(); + + while (!self->t_quit) { + try_(a) { + /* + * The garbage collector requires that the database + * is in use because. + */ + xt_use_database(self, db, XT_FOR_WRITER); + + /* This action is both safe and required (see details elsewhere) */ + xt_heap_release(self, self->st_database); + + xlog_wr_main(self); + } + catch_(a) { + /* This error is "normal"! */ + if (self->t_exception.e_xt_err != XT_ERR_NO_DICTIONARY && + !(self->t_exception.e_xt_err == XT_SIGNAL_CAUGHT && + self->t_exception.e_sys_err == SIGTERM)) + xt_log_and_clear_exception(self); + } + cont_(a); + + /* Avoid releasing the database (done above) */ + self->st_database = NULL; + xt_unuse_database(self, self); + + /* After an exception, pause before trying again... */ + /* Number of seconds */ +#ifdef DEBUG + count = 10; +#else + count = 2*60; +#endif + db->db_wr_idle = XT_THREAD_INERR; + while (!self->t_quit && count > 0) { + sleep(1); + count--; + } + db->db_wr_idle = XT_THREAD_BUSY; + } + + myxt_destroy_thread(mysql_thread, TRUE); + return NULL; +} + +static void xlog_wr_free_thread(XTThreadPtr self, void *data) +{ + XTDatabaseHPtr db = (XTDatabaseHPtr) data; + + if (db->db_wr_thread) { + xt_lock_mutex(self, &db->db_wr_lock); + pushr_(xt_unlock_mutex, &db->db_wr_lock); + db->db_wr_thread = NULL; + freer_(); // xt_unlock_mutex(&db->db_wr_lock) + } +} + +xtPublic void xt_start_writer(XTThreadPtr self, XTDatabaseHPtr db) +{ + char name[PATH_MAX]; + + sprintf(name, "WR-%s", xt_last_directory_of_path(db->db_main_path)); + xt_remove_dir_char(name); + db->db_wr_thread = xt_create_daemon(self, name); + xt_set_thread_data(db->db_wr_thread, db, xlog_wr_free_thread); + xt_run_thread(self, db->db_wr_thread, xlog_wr_run_thread); +} + +/* + * This function is called on database shutdown. + * We will wait a certain amounnt of time for the writer to + * complete its work. + * If it takes to long we will abort! + */ +xtPublic void xt_wait_for_writer(XTThreadPtr self, XTDatabaseHPtr db) +{ + time_t then, now; + xtBool message = FALSE; + + if (db->db_wr_thread) { + then = time(NULL); + while (xt_comp_log_pos(db->db_wr_log_id, db->db_wr_log_offset, db->db_wr_flush_point_log_id, db->db_wr_flush_point_log_offset) < 0) { + + xt_lock_mutex(self, &db->db_wr_lock); + pushr_(xt_unlock_mutex, &db->db_wr_lock); + db->db_wr_thread_waiting++; + /* Wake the writer so that it con complete its work. */ + if (db->db_wr_idle) { + if (!xt_broadcast_cond_ns(&db->db_wr_cond)) + xt_log_and_clear_exception_ns(); + } + freer_(); // xt_unlock_mutex(&db->db_wr_lock) + + xt_sleep_milli_second(10); + + xt_lock_mutex(self, &db->db_wr_lock); + pushr_(xt_unlock_mutex, &db->db_wr_lock); + db->db_wr_thread_waiting--; + freer_(); // xt_unlock_mutex(&db->db_wr_lock) + + now = time(NULL); + if (now >= then + 16) { + xt_logf(XT_NT_INFO, "Aborting wait for '%s' writer\n", db->db_name); + message = FALSE; + break; + } + if (now >= then + 2) { + if (!message) { + message = TRUE; + xt_logf(XT_NT_INFO, "Waiting for '%s' writer...\n", db->db_name); + } + } + } + + if (message) + xt_logf(XT_NT_INFO, "Writer '%s' done.\n", db->db_name); + } +} + +xtPublic void xt_stop_writer(XTThreadPtr self, XTDatabaseHPtr db) +{ + XTThreadPtr thr_wr; + + if (db->db_wr_thread) { + xt_lock_mutex(self, &db->db_wr_lock); + pushr_(xt_unlock_mutex, &db->db_wr_lock); + + /* This pointer is safe as long as you have the transaction lock. */ + if ((thr_wr = db->db_wr_thread)) { + xtThreadID tid = thr_wr->t_id; + + /* Make sure the thread quits when woken up. */ + xt_terminate_thread(self, thr_wr); + + /* Wake the writer thread so that it will quit: */ + xt_broadcast_cond(self, &db->db_wr_cond); + + freer_(); // xt_unlock_mutex(&db->db_wr_lock) + + /* + * GOTCHA: This is a wierd thing but the SIGTERM directed + * at a particular thread (in this case the sweeper) was + * being caught by a different thread and killing the server + * sometimes. Disconcerting. + * (this may only be a problem on Mac OS X) + xt_kill_thread(thread); + */ + xt_wait_for_thread(tid, FALSE); + + /* PMC - This should not be necessary to set the signal here, but in the + * debugger the handler is not called!!? + thr_wr->t_delayed_signal = SIGTERM; + xt_kill_thread(thread); + */ + db->db_wr_thread = NULL; + } + else + freer_(); // xt_unlock_mutex(&db->db_wr_lock) + } +} + +#ifdef NOT_USED +static void xlog_add_to_flush_buffer(u_int flush_count, XTXLogBlockPtr *flush_buffer, XTXLogBlockPtr block) +{ + register u_int count = flush_count; + register u_int i; + register u_int guess; + register xtInt8 r; + + i = 0; + while (i < count) { + guess = (i + count - 1) >> 1; + r = (xtInt8) block->xlb_address - (xtInt8) flush_buffer[guess]->xlb_address; + if (r == 0) { + // Should not happen... + ASSERT_NS(FALSE); + return; + } + if (r < (xtInt8) 0) + count = guess; + else + i = guess + 1; + } + + /* Insert at position i */ + memmove(flush_buffer + i + 1, flush_buffer + i, (flush_count - i) * sizeof(XTXLogBlockPtr)); + flush_buffer[i] = block; +} + +static XTXLogBlockPtr xlog_find_block(XTOpenFilePtr file, xtLogID log_id, off_t address, XTXLogCacheSegPtr *ret_seg) +{ + register XTXLogCacheSegPtr seg; + register XTXLogBlockPtr block; + register u_int hash_idx; + register XTXLogCacheRec *dcg = &xt_xlog_cache; + + seg = &dcg->xlc_segment[((u_int) address >> XT_XLC_BLOCK_SHIFTS) & XLC_SEGMENT_MASK]; + hash_idx = (((u_int) (address >> (XT_XLC_SEGMENT_SHIFTS + XT_XLC_BLOCK_SHIFTS))) ^ (log_id << 16)) % dcg->xlc_hash_size; + + xt_lock_mutex_ns(&seg->lcs_lock); + retry: + block = seg->lcs_hash_table[hash_idx]; + while (block) { + if (block->xlb_address == address && block->xlb_log_id == log_id) { + ASSERT_NS(block->xlb_state != XLC_BLOCK_FREE); + + /* Wait if the block is being read or written. + * If we will just read the data, then we don't care + * if the buffer is being written. + */ + if (block->xlb_state == XLC_BLOCK_READING) { + if (!xt_timed_wait_cond_ns(&seg->lcs_cond, &seg->lcs_lock, 100)) + break; + goto retry; + } + + *ret_seg = seg; + return block; + } + block = block->xlb_next; + } + + /* Block not found: */ + xt_unlock_mutex_ns(&seg->lcs_lock); + return NULL; +} + +static int xlog_cmp_log_files(struct XTThread *self, register const void *thunk, register const void *a, register const void *b) +{ +#pragma unused(self, thunk) + xtLogID lf_id = *((xtLogID *) a); + XTXactLogFilePtr lf_ptr = (XTXactLogFilePtr) b; + + if (lf_id < lf_ptr->lf_log_id) + return -1; + if (lf_id == lf_ptr->lf_log_id) + return 0; + return 1; +} + +#endif + + +#ifdef OLD_CODE +static xtBool xlog_free_lru_blocks() +{ + XTXLogBlockPtr block, pblock; + xtWord4 ru_time; + xtLogID log_id; + off_t address; + //off_t hash; + XTXLogCacheSegPtr seg; + u_int hash_idx; + xtBool have_global_lock = FALSE; + +#ifdef DEBUG_CHECK_CACHE + //xt_xlog_check_cache(); +#endif + retry: + if (!(block = xt_xlog_cache.xlc_lru_block)) + return OK; + + ru_time = block->xlb_ru_time; + log_id = block->xlb_log_id; + address = block->xlb_address; + + /* + hash = (address >> XT_XLC_BLOCK_SHIFTS) ^ ((off_t) log_id << 28); + seg = &xt_xlog_cache.xlc_segment[hash & XLC_SEGMENT_MASK]; + hash_idx = (hash >> XT_XLC_SEGMENT_SHIFTS) % xt_xlog_cache.xlc_hash_size; + */ + seg = &xt_xlog_cache.xlc_segment[((u_int) address >> XT_XLC_BLOCK_SHIFTS) & XLC_SEGMENT_MASK]; + hash_idx = (((u_int) (address >> (XT_XLC_SEGMENT_SHIFTS + XT_XLC_BLOCK_SHIFTS))) ^ (log_id << 16)) % xt_xlog_cache.xlc_hash_size; + + xt_lock_mutex_ns(&seg->lcs_lock); + + free_more: + pblock = NULL; + block = seg->lcs_hash_table[hash_idx]; + while (block) { + if (block->xlb_address == address && block->xlb_log_id == log_id) { + ASSERT_NS(block->xlb_state != XLC_BLOCK_FREE); + + /* Try again if the block has been used in the meantime: */ + if (ru_time != block->xlb_ru_time) { + if (have_global_lock) + /* Having this lock means we have already freed at least one block so + * don't bother to free more if we are having trouble. + */ + goto done_ok; + + /* If the recently used time has changed, then the + * block is probably no longer the LR used. + */ + xt_unlock_mutex_ns(&seg->lcs_lock); + goto retry; + } + + /* Wait if the block is being read: */ + if (block->xlb_state == XLC_BLOCK_READING) { + if (have_global_lock) + goto done_ok; + + /* Wait for the block to be read, then try again. */ + if (!xt_timed_wait_cond_ns(&seg->lcs_cond, &seg->lcs_lock, 100)) + goto failed; + xt_unlock_mutex_ns(&seg->lcs_lock); + goto retry; + } + + goto free_the_block; + } + pblock = block; + block = block->xlb_next; + } + + if (have_global_lock) { + xt_unlock_mutex_ns(&xt_xlog_cache.xlc_lock); + have_global_lock = FALSE; + } + + /* We did not find the block, someone else freed it... */ + xt_unlock_mutex_ns(&seg->lcs_lock); + goto retry; + + free_the_block: + ASSERT_NS(block->xlb_state == XLC_BLOCK_CLEAN); + + /* Remove from the hash table: */ + if (pblock) + pblock->xlb_next = block->xlb_next; + else + seg->lcs_hash_table[hash_idx] = block->xlb_next; + + /* Now free the block */ + if (!have_global_lock) { + xt_lock_mutex_ns(&xt_xlog_cache.xlc_lock); + have_global_lock = TRUE; + } + + /* Remove from the MRU list: */ + if (xt_xlog_cache.xlc_lru_block == block) + xt_xlog_cache.xlc_lru_block = block->xlb_mr_used; + if (xt_xlog_cache.xlc_mru_block == block) + xt_xlog_cache.xlc_mru_block = block->xlb_lr_used; + if (block->xlb_lr_used) + block->xlb_lr_used->xlb_mr_used = block->xlb_mr_used; + if (block->xlb_mr_used) + block->xlb_mr_used->xlb_lr_used = block->xlb_lr_used; + + /* Put the block on the free list: */ + block->xlb_next = xt_xlog_cache.xlc_free_list; + xt_xlog_cache.xlc_free_list = block; + xt_xlog_cache.xlc_free_count++; + block->xlb_state = XLC_BLOCK_FREE; + + if (xt_xlog_cache.xlc_free_count < XT_XLC_MAX_FREE_COUNT) { + /* Now that we have all the locks, try to free some more in this segment: */ + block = block->xlb_mr_used; + for (u_int i=0; block && i<XLC_SEGMENT_COUNT; i++) { + ru_time = block->xlb_ru_time; + log_id = block->xlb_log_id; + address = block->xlb_address; + + if (seg == &xt_xlog_cache.xlc_segment[((u_int) address >> XT_XLC_BLOCK_SHIFTS) & XLC_SEGMENT_MASK]) { + hash_idx = (((u_int) (address >> (XT_XLC_SEGMENT_SHIFTS + XT_XLC_BLOCK_SHIFTS))) ^ (log_id << 16)) % xt_xlog_cache.xlc_hash_size; + goto free_more; + } + block = block->xlb_mr_used; + } + } + + done_ok: + xt_unlock_mutex_ns(&xt_xlog_cache.xlc_lock); + xt_unlock_mutex_ns(&seg->lcs_lock); +#ifdef DEBUG_CHECK_CACHE + //xt_xlog_check_cache(); +#endif + return OK; + + failed: + xt_unlock_mutex_ns(&seg->lcs_lock); +#ifdef DEBUG_CHECK_CACHE + //xt_xlog_check_cache(); +#endif + return FAILED; +} + +#endif diff --git a/storage/pbxt/src/xactlog_xt.h b/storage/pbxt/src/xactlog_xt.h new file mode 100644 index 00000000000..391b646b53f --- /dev/null +++ b/storage/pbxt/src/xactlog_xt.h @@ -0,0 +1,460 @@ +/* Copyright (c) 2007 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2007-10-31 Paul McCullagh + * + * H&G2JCtL + * + * The new table cache. Caches all non-index data. This includes the data + * files and the row pointer files. + */ + +#ifndef __xactlog_xt_h__ +#define __xactlog_xt_h__ + +#include "pthread_xt.h" +#include "filesys_xt.h" +#include "sortedlist_xt.h" + +struct XTThread; +struct XTOpenTable; +struct XTDatabase; + +#ifdef DEBUG +//#define XT_USE_CACHE_DEBUG_SIZES +#endif + +#ifdef XT_USE_CACHE_DEBUG_SIZES +#define XT_XLC_BLOCK_SHIFTS 5 +#define XT_XLC_FILE_SLOTS 7 +#define XT_XLC_SEGMENT_SHIFTS 1 +#define XT_XLC_MAX_FLUSH_SEG_COUNT 10 +#define XT_XLC_MAX_FREE_COUNT 10 +#else +/* Block size is determined by the number of shifts 1 << 15 = 32K */ +#define XT_XLC_BLOCK_SHIFTS 15 +#define XT_XLC_FILE_SLOTS 71 +/* The number of segments are determined by the segment shifts 1 << 3 = 8 */ +#define XT_XLC_SEGMENT_SHIFTS 3 +#define XT_XLC_MAX_FLUSH_SEG_COUNT 250 +#define XT_XLC_MAX_FREE_COUNT 100 +#endif + +#define XT_XLC_BLOCK_SIZE (1 << XT_XLC_BLOCK_SHIFTS) +#define XT_XLC_BLOCK_MASK (XT_XLC_BLOCK_SIZE - 1) + +#define XT_TIME_DIFF(start, now) (\ + ((xtWord4) (now) < (xtWord4) (start)) ? \ + ((xtWord4) 0XFFFFFFFF - ((xtWord4) (start) - (xtWord4) (now))) : \ + ((xtWord4) (now) - (xtWord4) (start))) + +#define XLC_SEGMENT_COUNT ((off_t) 1 << XT_XLC_SEGMENT_SHIFTS) +#define XLC_SEGMENT_MASK (XLC_SEGMENT_COUNT - 1) +#define XLC_MAX_FLUSH_COUNT (XT_XLC_MAX_FLUSH_SEG_COUNT * XLC_SEGMENT_COUNT) + +#define XLC_BLOCK_FREE 0 +#define XLC_BLOCK_READING 1 +#define XLC_BLOCK_CLEAN 2 + +#define XT_RECYCLE_LOGS 0 +#define XT_DELETE_LOGS 1 +#define XT_KEEP_LOGS 2 + +/* LOG CACHE ---------------------------------------------------- */ + +typedef struct XTXLogBlock { + off_t xlb_address; /* The block address. */ + xtLogID xlb_log_id; /* The log id of the block. */ + xtWord4 xlb_state; /* Block status. */ + struct XTXLogBlock *xlb_next; /* Pointer to next block on hash list, or next free block on free list. */ + xtWord1 xlb_data[XT_XLC_BLOCK_SIZE]; +} XTXLogBlockRec, *XTXLogBlockPtr; + +/* A disk cache segment. The cache is divided into a number of segments + * to improve concurrency. + */ +typedef struct XTXLogCacheSeg { + xt_mutex_type lcs_lock; /* The cache segment lock. */ + xt_cond_type lcs_cond; + XTXLogBlockPtr *lcs_hash_table; +} XTXLogCacheSegRec, *XTXLogCacheSegPtr; + +typedef struct XTXLogCache { + xt_mutex_type xlc_lock; /* The public cache lock. */ + xt_cond_type xlc_cond; /* The public cache wait condition. */ + XTXLogCacheSegRec xlc_segment[XLC_SEGMENT_COUNT]; + XTXLogBlockPtr xlc_blocks; + XTXLogBlockPtr xlc_blocks_end; + XTXLogBlockPtr xlc_next_to_free; + xtWord4 xlc_free_count; + xtWord4 xlc_hash_size; + xtWord4 xlc_block_count; + xtWord8 xlc_upper_limit; +} XTXLogCacheRec; + +/* LOG ENTRIES ---------------------------------------------------- */ + +#define XT_LOG_ENT_EOF 0 +#define XT_LOG_ENT_HEADER 1 +#define XT_LOG_ENT_NEW_LOG 2 /* Move to the next log! NOTE!! May not appear in a group!! */ +#define XT_LOG_ENT_DEL_LOG 3 /* Delete the given transaction/data log. */ +#define XT_LOG_ENT_NEW_TAB 4 /* This record indicates a new table was created. */ + +#define XT_LOG_ENT_COMMIT 5 /* Transaction was committed. */ +#define XT_LOG_ENT_ABORT 6 /* Transaction was aborted. */ +#define XT_LOG_ENT_CLEANUP 7 /* Written after a cleanup. */ + +#define XT_LOG_ENT_REC_MODIFIED 8 /* This records has been modified by the transaction. */ +#define XT_LOG_ENT_UPDATE 9 +#define XT_LOG_ENT_UPDATE_BG 10 +#define XT_LOG_ENT_UPDATE_FL 11 +#define XT_LOG_ENT_UPDATE_FL_BG 12 +#define XT_LOG_ENT_INSERT 13 +#define XT_LOG_ENT_INSERT_BG 14 +#define XT_LOG_ENT_INSERT_FL 15 +#define XT_LOG_ENT_INSERT_FL_BG 16 +#define XT_LOG_ENT_DELETE 17 +#define XT_LOG_ENT_DELETE_BG 18 +#define XT_LOG_ENT_DELETE_FL 19 +#define XT_LOG_ENT_DELETE_FL_BG 20 + +#define XT_LOG_ENT_REC_FREED 21 /* This record has been placed in the free list. */ +#define XT_LOG_ENT_REC_REMOVED 22 /* Free record and dependecies: index references, blob references. */ +#define XT_LOG_ENT_REC_REMOVED_EXT 23 /* Free record and dependecies: index references, extended data, blob references. */ +#define XT_LOG_ENT_REC_REMOVED_BI 38 /* Free record and dependecies: includes before image of record, for freeing index, etc. */ + +#define XT_LOG_ENT_REC_MOVED 24 /* The record has been moved by the compactor. */ +#define XT_LOG_ENT_REC_CLEANED 25 /* This record has been cleaned by the sweeper. */ +#define XT_LOG_ENT_REC_CLEANED_1 26 /* This record has been cleaned by the sweeper (short form). */ +#define XT_LOG_ENT_REC_UNLINKED 27 /* The record after this record is unlinked from the variation list. */ + +#define XT_LOG_ENT_ROW_NEW 28 /* Row allocated from the EOF. */ +#define XT_LOG_ENT_ROW_NEW_FL 29 /* Row allocated from the free list. */ +#define XT_LOG_ENT_ROW_ADD_REC 30 /* Record added to the row. */ +#define XT_LOG_ENT_ROW_SET 31 +#define XT_LOG_ENT_ROW_FREED 32 + +#define XT_LOG_ENT_OP_SYNC 33 /* Operations syncronised. */ +#define XT_LOG_ENT_EXT_REC_OK 34 /* An extended record */ +#define XT_LOG_ENT_EXT_REC_DEL 35 /* A deleted extended record */ + +#define XT_LOG_ENT_NO_OP 36 /* If write to the database fails, we still try to log the + * op code, in an attempt to continue, if writting to log + * still works. + */ +#define XT_LOG_ENT_END_OF_LOG 37 /* This is a record that indicates the end of the log, and + * fills to the end of a 512 byte block. + */ + +#define XT_LOG_FILE_MAGIC 0xAE88FE12 +#define XT_LOG_VERSION_NO 1 + +typedef struct XTXactLogHeader { + xtWord1 xh_status_1; /* XT_LOG_ENT_HEADER */ + xtWord1 xh_checksum_1; + XTDiskValue4 xh_size_4; /* Must be set to sizeof(XTXactLogHeaderDRec). */ + XTDiskValue8 xh_free_space_8; /* The accumulated free space in this file. */ + XTDiskValue8 xh_file_len_8; /* The last confirmed correct file length (always set on close). */ + XTDiskValue8 xh_comp_pos_8; /* Compaction position (XT_DL_STATUS_CO_SOURCE only). */ + xtWord1 xh_comp_stat_1; /* The compaction status XT_DL_STATUS_CO_SOURCE/XT_DL_STATUS_CO_TARGET */ + XTDiskValue4 xh_log_id_4; + XTDiskValue4 xh_version_2; /* XT_LOG_VERSION_NO */ + XTDiskValue4 xh_magic_4; /* MUST always be at the end of the structure!! */ +} XTXactLogHeaderDRec, *XTXactLogHeaderDPtr; + +/* This is the original log head size (don't change): */ +#define XT_MIN_LOG_HEAD_SIZE (offsetof(XTXactLogHeaderDRec, xh_log_id_4) + 4) +#define XT_LOG_HEAD_MAGIC(b, l) XT_GET_DISK_4(((xtWord1 *) (b)) + (l) - 4) + +typedef struct XTXactNewLogEntry { + xtWord1 xl_status_1; /* XT_LOG_ENT_NEW_LOG, XT_LOG_ENT_DEL_LOG */ + xtWord1 xl_checksum_1; + XTDiskValue4 xl_log_id_4; /* Store the current table ID. */ +} XTXactNewLogEntryDRec, *XTXactNewLogEntryDPtr; + +typedef struct XTXactNewTabEntry { + xtWord1 xt_status_1; /* XT_LOG_ENT_NEW_TAB */ + xtWord1 xt_checksum_1; + XTDiskValue4 xt_tab_id_4; /* Store the current table ID. */ +} XTXactNewTabEntryDRec, *XTXactNewTabEntryDPtr; + +/* This record must appear in a transaction group, and therefore has no transaction ID: */ +typedef struct XTXactEndEntry { + xtWord1 xe_status_1; /* XT_LOG_ENT_COMMIT, XT_LOG_ENT_ABORT */ + xtWord1 xe_checksum_1; + XTDiskValue4 xe_xact_id_4; /* The transaction. */ + XTDiskValue4 xe_not_used_4; /* Was the end sequence number (no longer used - v1.0.04+), set to zero). */ +} XTXactEndEntryDRec, *XTXactEndEntryDPtr; + +typedef struct XTXactCleanupEntry { + xtWord1 xc_status_1; /* XT_LOG_ENT_CLEANUP */ + xtWord1 xc_checksum_1; + XTDiskValue4 xc_xact_id_4; /* The transaction that was cleaned up. */ +} XTXactCleanupEntryDRec, *XTXactCleanupEntryDPtr; + +typedef struct XTactUpdateEntry { + xtWord1 xu_status_1; /* XT_LOG_ENT_REC_MODIFIED, XT_LOG_ENT_UPDATE, XT_LOG_ENT_INSERT, XT_LOG_ENT_DELETE */ + /* XT_LOG_ENT_UPDATE_BG, XT_LOG_ENT_INSERT_BG, XT_LOG_ENT_DELETE_BG */ + XTDiskValue2 xu_checksum_2; + XTDiskValue4 xu_op_seq_4; /* Operation sequence number. */ + XTDiskValue4 xu_tab_id_4; /* Table ID of the record. */ + xtDiskRecordID4 xu_rec_id_4; /* Offset of the new updated record. */ + XTDiskValue2 xu_size_2; /* Size of the record data. */ + /* This is the start of the actual record data: */ + xtWord1 xu_rec_type_1; /* Type of the record. */ + xtWord1 xu_stat_id_1; + xtDiskRecordID4 xu_prev_rec_id_4; /* The previous variation of this record. */ + XTDiskValue4 xu_xact_id_4; /* The transaction ID. */ + XTDiskValue4 xu_row_id_4; /* The row ID of this record. */ +} XTactUpdateEntryDRec, *XTactUpdateEntryDPtr; + +typedef struct XTactUpdateFLEntry { + xtWord1 xf_status_1; /* XT_LOG_ENT_UPDATE_FL, XT_LOG_ENT_INSERT_FL, XT_LOG_ENT_DELETE_FL */ + /* XT_LOG_ENT_UPDATE_FL_BG, XT_LOG_ENT_INSERT_FL_BG, XT_LOG_ENT_DELETE_FL_BG */ + XTDiskValue2 xf_checksum_2; + XTDiskValue4 xf_op_seq_4; /* Operation sequence number. */ + XTDiskValue4 xf_tab_id_4; /* Table ID of the record. */ + xtDiskRecordID4 xf_rec_id_4; /* Offset of the new updated record. */ + XTDiskValue2 xf_size_2; /* Size of the record data. */ + xtDiskRecordID4 xf_free_rec_id_4; /* Update to the free list. */ + /* This is the start of the actual record data: */ + xtWord1 xf_rec_type_1; /* Type of the record. */ + xtWord1 xf_stat_id_1; + xtDiskRecordID4 xf_prev_rec_id_4; /* The previous variation of this record. */ + XTDiskValue4 xf_xact_id_4; /* The transaction ID. */ + XTDiskValue4 xf_row_id_4; /* The row ID of this record. */ +} XTactUpdateFLEntryDRec, *XTactUpdateFLEntryDPtr; + +typedef struct XTactFreeRecEntry { + xtWord1 fr_status_1; /* XT_LOG_ENT_REC_REMOVED, XT_LOG_ENT_REC_REMOVED_EXT, XT_LOG_ENT_REC_FREED */ + xtWord1 fr_checksum_1; + XTDiskValue4 fr_op_seq_4; /* Operation sequence number. */ + XTDiskValue4 fr_tab_id_4; /* Table ID of the record. */ + xtDiskRecordID4 fr_rec_id_4; /* Offset of the new written record. */ + /* This data confirms the record state for release of + * attached resources (extended records, indexes and blobs) + */ + xtWord1 fr_stat_id_1; /* The statement ID of the record. */ + XTDiskValue4 fr_xact_id_4; /* The transaction ID of the record. */ + /* This is the start of the actual record data: */ + xtWord1 fr_rec_type_1; + xtWord1 fr_not_used_1; + xtDiskRecordID4 fr_next_rec_id_4; /* The next block on the free list. */ +} XTactFreeRecEntryDRec, *XTactFreeRecEntryDPtr; + +typedef struct XTactRemoveBIEntry { + xtWord1 rb_status_1; /* XT_LOG_ENT_REC_REMOVED_BI */ + XTDiskValue2 rb_checksum_2; + XTDiskValue4 rb_op_seq_4; /* Operation sequence number. */ + XTDiskValue4 rb_tab_id_4; /* Table ID of the record. */ + xtDiskRecordID4 rb_rec_id_4; /* Offset of the new written record. */ + XTDiskValue2 rb_size_2; /* Size of the record data. */ + + xtWord1 rb_new_rec_type_1; /* New type of the record (needed for below). */ + + /* This is the start of the record data, with some fields overwritten for the free: */ + xtWord1 rb_rec_type_1; /* Type of the record. */ + xtWord1 rb_stat_id_1; + xtDiskRecordID4 rb_next_rec_id_4; /* The next block on the free list (overwritten). */ + XTDiskValue4 rb_xact_id_4; /* The transaction ID. */ + XTDiskValue4 rb_row_id_4; /* The row ID of this record. */ +} XTactRemoveBIEntryDRec, *XTactRemoveBIEntryDPtr; + +typedef struct XTactWriteRecEntry { + xtWord1 xw_status_1; /* XT_LOG_ENT_REC_MOVED, XT_LOG_ENT_REC_CLEANED, XT_LOG_ENT_REC_CLEANED_1, + * XT_LOG_ENT_REC_UNLINKED */ + xtWord1 xw_checksum_1; + XTDiskValue4 xw_op_seq_4; /* Operation sequence number. */ + XTDiskValue4 xw_tab_id_4; /* Table ID of the record. */ + xtDiskRecordID4 xw_rec_id_4; /* Offset of the new written record. */ + /* This is the start of the actual record data: */ + xtWord1 xw_rec_type_1; + xtWord1 xw_stat_id_1; + xtDiskRecordID4 xw_next_rec_id_4; /* The next block on the free list. */ +} XTactWriteRecEntryDRec, *XTactWriteRecEntryDPtr; + +typedef struct XTactRowAddedEntry { + xtWord1 xa_status_1; /* XT_LOG_ENT_ROW_NEW or XT_LOG_ENT_ROW_NEW_FL */ + xtWord1 xa_checksum_1; + XTDiskValue4 xa_op_seq_4; /* Operation sequence number. */ + XTDiskValue4 xa_tab_id_4; /* Table ID of the record. */ + XTDiskValue4 xa_row_id_4; /* The row ID of the row allocated. */ + XTDiskValue4 xa_free_list_4; /* Change to the free list (ONLY for XT_LOG_ENT_ROW_NEW_FL). */ +} XTactRowAddedEntryDRec, *XTactRowAddedEntryDPtr; + +typedef struct XTactWriteRowEntry { + xtWord1 wr_status_1; /* XT_LOG_ENT_ROW_ADD_REC, XT_LOG_ENT_ROW_SET, XT_LOG_ENT_ROW_FREED */ + xtWord1 wr_checksum_1; + XTDiskValue4 wr_op_seq_4; /* Operation sequence number. */ + XTDiskValue4 wr_tab_id_4; /* Table ID of the record. */ + XTDiskValue4 wr_row_id_4; /* Row ID of the row that was modified. */ + /* This is the start of the actual record data: */ + XTDiskValue4 wr_ref_id_4; /* The row reference data. */ +} XTactWriteRowEntryDRec, *XTactWriteRowEntryDPtr; + +typedef struct XTactOpSyncEntry { + xtWord1 os_status_1; /* XT_LOG_ENT_OP_SYNC */ + xtWord1 os_checksum_1; + XTDiskValue4 os_time_4; /* Time of the restart. */ +} XTactOpSyncEntryDRec, *XTactOpSyncEntryDPtr; + +typedef struct XTactNoOpEntry { + xtWord1 no_status_1; /* XT_LOG_ENT_NO_OP */ + xtWord1 no_checksum_1; + XTDiskValue4 no_op_seq_4; /* Operation sequence number. */ + XTDiskValue4 no_tab_id_4; /* Table ID of the record. */ +} XTactNoOpEntryDRec, *XTactNoOpEntryDPtr; + +typedef struct XTactExtRecEntry { + xtWord1 er_status_1; /* XT_LOG_ENT_EXT_REC_OK, XT_LOG_ENT_EXT_REC_DEL */ + XTDiskValue4 er_data_size_4; /* Size of this record data area only. */ + XTDiskValue4 er_tab_id_4; /* The table referencing this extended record. */ + xtDiskRecordID4 er_rec_id_4; /* The ID of the reference record. */ + xtWord1 er_data[XT_VAR_LENGTH]; +} XTactExtRecEntryDRec, *XTactExtRecEntryDPtr; + +typedef union XTXactLogBuffer { + XTXactLogHeaderDRec xh; + XTXactNewLogEntryDRec xl; + XTXactNewTabEntryDRec xt; + XTXactEndEntryDRec xe; + XTXactCleanupEntryDRec xc; + XTactUpdateEntryDRec xu; + XTactUpdateFLEntryDRec xf; + XTactFreeRecEntryDRec fr; + XTactRemoveBIEntryDRec rb; + XTactWriteRecEntryDRec xw; + XTactRowAddedEntryDRec xa; + XTactWriteRowEntryDRec wr; + XTactOpSyncEntryDRec os; + XTactExtRecEntryDRec er; + XTactNoOpEntryDRec no; +} XTXactLogBufferDRec, *XTXactLogBufferDPtr; + +/* ---------------------------------------- */ + +typedef struct XTXactSeqRead { + size_t xseq_buffer_size; /* Size of the buffer. */ + xtBool xseq_load_cache; /* TRUE if reads should load the cache! */ + + xtLogID xseq_log_id; + XTOpenFilePtr xseq_log_file; + off_t xseq_log_eof; + + xtLogOffset xseq_buf_log_offset; /* File offset of the buffer. */ + size_t xseq_buffer_len; /* Amount of data in the buffer. */ + xtWord1 *xseq_buffer; + + xtLogID xseq_rec_log_id; /* The current record log ID. */ + xtLogOffset xseq_rec_log_offset; /* The current log read position. */ + size_t xseq_record_len; /* The length of the current record. */ +} XTXactSeqReadRec, *XTXactSeqReadPtr; + +typedef struct XTXactLogFile { + xtLogID lf_log_id; + off_t lr_file_len; /* The log file size (0 means this is the last log) */ +} XTXactLogFileRec, *XTXactLogFilePtr; + +/* + * The transaction log. Each database has one. + */ +typedef struct XTDatabaseLog { + struct XTDatabase *xl_db; + + off_t xl_log_file_threshold; + u_int xl_log_file_count; /* Number of logs to use (>= 1). */ + u_int xt_log_file_dyn_count; /* A dynamic value to add to log file count. */ + u_int xt_log_file_dyn_dec; /* Used to descide when to decrement the dynamic count. */ + size_t xl_size_of_buffers; /* The size of both log buffers. */ + xtWord8 xl_log_bytes_written; /* The total number of bytes written to the log, after recovery. */ + xtWord8 xl_log_bytes_flushed; /* The total number of bytes flushed to the log, after recovery. */ + xtWord8 xl_log_bytes_read; /* The total number of log bytes read, after recovery. */ + + u_int xl_last_flush_time; /* Last flush time in micro-seconds. */ + + /* The writer log buffer: */ + xt_mutex_type xl_write_lock; + xt_cond_type xl_write_cond; + xtBool xt_writing; /* TRUE if a thread is writing. */ + xtLogID xl_log_id; /* The number of the write log. */ + XTOpenFilePtr xl_log_file; /* The open write log. */ + + XTSpinLockRec xl_buffer_lock; /* This locks both the write and the append log buffers. */ + + xtLogID xl_max_log_id; /* The ID of the highest log on disk. */ + + xtLogID xl_write_log_id; /* This is the log ID were the write data will go. */ + xtLogOffset xl_write_log_offset; /* The file offset of the write log. */ + size_t xl_write_buf_pos; + size_t xl_write_buf_pos_start; + xtWord1 *xl_write_buffer; + xtBool xl_write_done; /* TRUE if the write buffer has been written! */ + + xtLogID xl_append_log_id; /* This is the log ID were the append data will go. */ + xtLogOffset xl_append_log_offset; /* The file offset in the log were the append data will go. */ + size_t xl_append_buf_pos; /* The amount of data in the append buffer. */ + size_t xl_append_buf_pos_start; /* The amount of data in the append buffer already written. */ + xtWord1 *xl_append_buffer; + + xtLogID xl_flush_log_id; /* The last log flushed. */ + xtLogOffset xl_flush_log_offset; /* The position in the log flushed. */ + + void xlog_setup(struct XTThread *self, struct XTDatabase *db, off_t log_file_size, size_t transaction_buffer_size, int log_count); + xtBool xlog_set_write_offset(xtLogID log_id, xtLogOffset log_offset, xtLogID max_log_id, struct XTThread *thread); + void xlog_close(struct XTThread *self); + void xlog_exit(struct XTThread *self); + void xlog_name(size_t size, char *path, xtLogID log_id); + int xlog_delete_log(xtLogID del_log_id, struct XTThread *thread); + + xtBool xlog_append(struct XTThread *thread, size_t size1, xtWord1 *data1, size_t size2, xtWord1 *data2, xtBool commit, xtLogID *log_id, xtLogOffset *log_offset); + xtBool xlog_flush(struct XTThread *thread); + xtBool xlog_flush_pending(); + + xtBool xlog_seq_init(XTXactSeqReadPtr seq, size_t buffer_size, xtBool load_cache); + void xlog_seq_exit(XTXactSeqReadPtr seq); + void xlog_seq_close(XTXactSeqReadPtr seq); + xtBool xlog_seq_start(XTXactSeqReadPtr seq, xtLogID log_id, xtLogOffset log_offset, xtBool missing_ok); + xtBool xlog_rnd_read(XTXactSeqReadPtr seq, xtLogID log_id, xtLogOffset log_offset, size_t size, xtWord1 *data, size_t *read, struct XTThread *thread); + size_t xlog_bytes_to_write(); + xtBool xlog_read_from_cache(XTXactSeqReadPtr seq, xtLogID log_id, xtLogOffset log_offset, size_t size, off_t eof, xtWord1 *buffer, size_t *data_read, struct XTThread *thread); + xtBool xlog_write_thru(XTXactSeqReadPtr seq, size_t size, xtWord1 *data, struct XTThread *thread); + xtBool xlog_verify(XTXactLogBufferDPtr record, size_t rec_size, xtLogID log_id); + xtBool xlog_seq_next(XTXactSeqReadPtr seq, XTXactLogBufferDPtr *entry, xtBool verify, struct XTThread *thread); + void xlog_seq_skip(XTXactSeqReadPtr seq, size_t size); + +private: + xtBool xlog_open_log(xtLogID log_id, off_t curr_eof, struct XTThread *thread); +} XTDatabaseLogRec, *XTDatabaseLogPtr; + +xtBool xt_xlog_flush_log(struct XTThread *thread); +xtBool xt_xlog_log_data(struct XTThread *thread, size_t len, XTXactLogBufferDPtr log_entry, xtBool commit); +xtBool xt_xlog_modify_table(struct XTOpenTable *ot, u_int status, xtOpSeqNo op_seq, xtRecordID free_list, xtRecordID address, size_t size, xtWord1 *data); + +void xt_xlog_init(struct XTThread *self, size_t cache_size); +void xt_xlog_exit(struct XTThread *self); +xtInt8 xt_xlog_get_usage(); +xtInt8 xt_xlog_get_size(); +xtLogID xt_xlog_get_min_log(struct XTThread *self, struct XTDatabase *db); +void xt_xlog_delete_logs(struct XTThread *self, struct XTDatabase *db); + +void xt_start_writer(struct XTThread *self, struct XTDatabase *db); +void xt_wait_for_writer(struct XTThread *self, struct XTDatabase *db); +void xt_stop_writer(struct XTThread *self, struct XTDatabase *db); + +#endif + diff --git a/storage/pbxt/src/xt_config.h b/storage/pbxt/src/xt_config.h new file mode 100644 index 00000000000..6571ebdaebe --- /dev/null +++ b/storage/pbxt/src/xt_config.h @@ -0,0 +1,99 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * 2006-03-22 Paul McCullagh + * + * H&G2JCtL + * + * This header file should be included in every source, before all other + * headers. + * + * In particular: BEFORE THE SYSTEM HEADERS + */ + +#ifndef __xt_config_h__ +#define __xt_config_h__ + +#define MYSQL_SERVER 1 + +#ifdef DRIZZLED +#include "drizzled/global.h" +const int max_connections = 500; +#else +#include <mysql_version.h> +#include "my_global.h" +#endif + +/* + * This enables everything that GNU can do. The macro is actually + * recommended for new programs. + */ +#ifndef _GNU_SOURCE +#define _GNU_SOURCE +#endif + +/* + * Make sure we use the thread safe version of the library. + */ +#define _THREAD_SAFE + +/* + * This causes things to be defined like stuff in inttypes.h + * which is used in printf() + */ +#ifndef __STDC_FORMAT_MACROS +#define __STDC_FORMAT_MACROS +#endif + +/* + * This define is not required by Linux because the _GNU_SOURCE + * definition includes POSIX complience. But I need it for + * Mac OS X. + */ +//#define _POSIX_C_SOURCE 2 +//#define _ANSI_SOURCE + +#ifdef __APPLE__ +#define XT_MAC +#endif + +#if defined(MSDOS) || defined(__WIN__) +#define XT_WIN +#endif + +#ifdef XT_WIN +#ifdef _DEBUG +#define DEBUG +#endif // _DEBUG +#else +#define XT_STREAMING +#endif + +#ifdef __FreeBSD__ +#define XT_FREEBSD +#endif + +#ifdef __NetBSD__ +#define XT_NETBSD +#endif + +#ifdef __sun +#define XT_SOLARIS +#endif + +#endif diff --git a/storage/pbxt/src/xt_defs.h b/storage/pbxt/src/xt_defs.h new file mode 100644 index 00000000000..16981ddc672 --- /dev/null +++ b/storage/pbxt/src/xt_defs.h @@ -0,0 +1,782 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Author: Paul McCullagh + * + * H&G2JCtL + */ +#ifndef __xt_defs_h__ +#define __xt_defs_h__ + +#ifdef XT_WIN +#include "win_inttypes.h" +#else +#include <inttypes.h> +#endif +#include <sys/types.h> +#include <assert.h> +#include <stddef.h> +#include <string.h> + +//#include "pthread_xt.h" + +#ifdef DEBUG +//#define DEBUG_LOG_DELETE +#endif + +/* the following macros are used to quote compile-time numeric + * constants into strings, e.g. __LINE__ + */ +#define _QUOTE(x) #x +#define QUOTE(x) _QUOTE(x) + +/* ---------------------------------------------------------------------- + * CRASH DEBUGGING + */ + +/* Define this if crash debug should be on by default: + * pbxt_crash_debug set to TRUE by default. + * It can be turned off by creating a file called 'no-debug' + * in the pbxt database. + * It can be turned on by defining the file 'crash-debug' + * in the pbxt database. + */ +//#define XT_CRASH_DEBUG + +/* These are the things crash debug will do: */ +/* Create a core dump (windows only): */ +#define XT_COREDUMP + +/* Backup the datadir before recovery after a crash: */ +//#define XT_BACKUP_BEFORE_RECOVERY + +/* Keep this number of transaction logs around + * for analysis after a crash. + */ +#define XT_NUMBER_OF_LOGS_TO_SAVE 5 + +/* ---------------------------------------------------------------------- + * GENERIC GLOBAL TYPES + */ + +#ifdef XT_WIN + +#define xtInt1 __int8 +#define xtInt2 __int16 +#define xtInt4 __int32 +#define xtInt8 __int64 + +#define xtWord1 unsigned __int8 +#define xtWord2 unsigned __int16 +#define xtWord4 unsigned __int32 +#define xtWord8 unsigned __int64 + +#ifndef PATH_MAX +#define PATH_MAX MAX_PATH +#endif +#ifndef NAME_MAX +#define NAME_MAX MAX_PATH +#endif + +/* XT actually assumes that off_t is 8 bytes: */ +#define off_t xtWord8 + +#else // XT_WIN + +#define xtInt1 int8_t +#define xtInt2 int16_t +#define xtInt4 int32_t +#define xtInt8 int64_t + +#ifdef XT_SOLARIS +#define u_int8_t uint8_t +#define u_int16_t uint16_t +#define u_int32_t uint32_t +#define u_int64_t uint64_t +#endif + +#define xtWord1 u_int8_t +#define xtWord2 u_int16_t +#define xtWord4 u_int32_t +#define xtWord8 u_int64_t + +#endif // XT_WIN + +/* A pointer sized word value: */ +#define xtWordPS ptrdiff_t + +#define XT_MAX_INT_1 ((xtInt1) 0x7F) +#define XT_MIN_INT_1 ((xtInt1) 0x80) +#define XT_MAX_INT_2 ((xtInt2) 0x7FFF) +#define XT_MIN_INT_2 ((xtInt2) 0x8000) +#define XT_MAX_INT_4 ((xtInt4) 0x7FFFFFFF) +#define XT_MIN_INT_4 ((xtInt4) 0x80000000) + +#define xtReal4 float +#define xtReal8 double + +#ifndef u_int +#define u_int unsigned int /* Assumed at least 4 bytes long! */ +#define u_long unsigned long /* Assumed at least 4 bytes long! */ +#endif +#define llong long long /* Assumed at least 8 bytes long! */ +#define u_llong unsigned long long /* Assumed at least 8 bytes long! */ + +#define c_char const char + +#ifndef NULL +#define NULL 0 +#endif + +#define xtPublic + +#define xtBool int +#ifndef TRUE +#define TRUE 1 +#endif +#ifndef FALSE +#define FALSE 0 +#endif + +/* Additional return codes: */ +#define XT_MAYBE 2 +#define XT_ERR -1 +#define XT_NEW -2 +#define XT_RETRY -3 +#define XT_REREAD -4 + +#ifdef OK +#undef OK +#endif +#define OK TRUE + +#ifdef FAILED +#undef FAILED +#endif +#define FAILED FALSE + +typedef xtWord1 XTDiskValue1[1]; +typedef xtWord1 XTDiskValue2[2]; +typedef xtWord1 XTDiskValue3[3]; +typedef xtWord1 XTDiskValue4[4]; +typedef xtWord1 XTDiskValue6[6]; +typedef xtWord1 XTDiskValue8[8]; + +#ifdef DEBUG +#define XT_VAR_LENGTH 100 +#else +#define XT_VAR_LENGTH 1 +#endif + +typedef struct XTPathStr { + char ps_path[XT_VAR_LENGTH]; +} *XTPathStrPtr; + +#define XT_UNUSED(x) x __attribute__((__unused__)) + +/* ---------------------------------------------------------------------- + * MAIN CONSTANTS + */ + +/* + * Define if there should only be one database per server instance: + */ +#define XT_USE_GLOBAL_DB + +/* + * The rollover size is the write limit of a log file. + * After this size is reached, a thread will start a + * new log. + * + * However, logs can grow much larger than this size. + * The reason is, a transaction single transaction + * may not span more than one data log file. + * + * This means the log rollover size is actually a + * minimum size. + */ + +#ifdef DEBUG +//#define XT_USE_GLOBAL_DEBUG_SIZES +#endif + +/* + * I believe the MySQL limit is 16. This limit is currently only used for + * BLOB streaming. + */ +#define XT_MAX_COLS_PER_INDEX 32 + +/* + * The maximum number of tables that can be created in a PBXT + * database. The amount is based on the fact that XT creates + * about 5 files per table in the database, and also + * uses directory listing to find tables. + */ +#define XT_MAX_TABLES 10000 + +/* + * When the amount of garbage in the file is greater than the + * garbage threshold, then compactor is activated. + */ +#define XT_GARBAGE_THRESHOLD ((double) 50.0) + +/* A record that does not contain blobs will be handled as a fixed + * length record if its maximum size is less than this amount, + * regardless of the size of the VARCHAR fields it contains. + */ +#define XT_TAB_MIN_VAR_REC_LENGTH 320 + +/* No record in the data handle file may exceed this size: */ +#define XT_TAB_MAX_FIX_REC_LENGTH (16 * 1024) + +/* No record in the data handle file may exceed this size, if + * AVG_ROW_LENGTH is set. + */ +#define XT_TAB_MAX_FIX_REC_LENGTH_SPEC (64 * 1024) + +/* + * Determines the page size of the indexes. The value is given + * in shifts of 1 to the left (e.g. 1 << 11 == 2048, + * 1 << 12 == 4096). + * + * PMC: Note the performance of sysbench is better with 11 + * than with 12. + * + * InnoDB uses 16K pages: + * 1 << 14 == 16384. + */ +#define XT_INDEX_PAGE_SHIFTS 14 + +/* The number of RW locks used to scatter locks on the rows + * of a table. The locks are only help for a short time during which + * the row list is scanned. + * + * For more details see [(9)]. + */ +#define XT_ROW_RWLOCKS 223 + +/* + * These are the number of row lock "slots" per table. + * Row locks are taken on UPDATE/DELETE or SELECT FOR UPDATE. + */ +#define XT_ROW_LOCK_COUNT (XT_ROW_RWLOCKS * 91) + +/* + * The size of index write buffer. Must be at least as large as the + * largest index page, plus overhead. + */ +#define XT_INDEX_WRITE_BUFFER_SIZE (1024 * 1024) + +/* This is the time in seconds that a open table in the open + * table pool must be on the free list before it + * is actually freed from the pool. + * + * This is to reduce the affect from MySQL with a very low + * table cache size, which causes tables to be openned and + * closed very rapidly. + */ +#define XT_OPEN_TABLE_FREE_TIME 30 + +#ifdef XT_USE_GLOBAL_DEBUG_SIZES +/* + * DEBUG SIZES! + * Reduce the thresholds to make things happen faster. + */ + +//#undef XT_ROW_RWLOCKS +//#define XT_ROW_RWLOCKS 2 + +//#undef XT_TAB_MIN_VAR_REC_LENGTH +//#define XT_TAB_MIN_VAR_REC_LENGTH 20 + +//#undef XT_ROW_LOCK_COUNT +//#define XT_ROW_LOCK_COUNT (XT_ROW_RWLOCKS * 2) + +//#undef XT_INDEX_PAGE_SHIFTS +//#define XT_INDEX_PAGE_SHIFTS 12 + +//#undef XT_INDEX_WRITE_BUFFER_SIZE +//#define XT_INDEX_WRITE_BUFFER_SIZE (40 * 1024) + +#endif + +/* Define this in order to use memory mapped files: */ +#define XT_USE_ROW_REC_MMAP_FILES + +/* Define this in order to use direct I/O on index files: */ +/* NOTE: DO NOT ENABLE! + * {DIRECT-IO} + * It currently does not work, because of changes to the inde + * cache. + */ +//#define XT_USE_DIRECT_IO_ON_INDEX + +#ifdef XT_USE_ROW_REC_MMAP_FILES + +#define XT_SEQ_SCAN_FROM_MEMORY +#define XT_ROW_REC_FILE_PTR XTMapFilePtr +#define XT_PWRITE_RR_FILE xt_pwrite_fmap +#define XT_PREAD_RR_FILE xt_pread_fmap +#define XT_FLUSH_RR_FILE xt_flush_fmap +#define XT_CLOSE_RR_FILE_NS xt_close_fmap_ns + +#else + +#define XT_ROW_REC_FILE_PTR XTOpenFilePtr +#define XT_PWRITE_RR_FILE xt_pwrite_file +#define XT_PREAD_RR_FILE xt_pread_file +#define XT_FLUSH_RR_FILE xt_flush_file +#define XT_CLOSE_RR_FILE_NS xt_close_file_ns + +#endif + +#ifdef XT_SEQ_SCAN_FROM_MEMORY +#define XT_LOCK_MEMORY_PTR(x, f, a, s, v, c) do { x = xt_lock_fmap_ptr(f, a, s, v, c); } while (0) +#define XT_UNLOCK_MEMORY_PTR(f, v) xt_unlock_fmap_ptr(f, v); +#else +#define XT_LOCK_MEMORY_PTR(x, f, a, v, c) +#define XT_UNLOCK_MEMORY_PTR(f, v) +#endif + +/* {NO-ACTION-BUG} + * Define this to implement NO ACTION correctly + * NOTE: this does not work currently because of a bug + * in MySQL + * + * The bug prevent returning of an error in external_lock() + * on statement end. In this case an assertion fails. + * + * set storage_engine = pbxt; + * DROP TABLE IF EXISTS t4,t3,t2,t1; + * CREATE TABLE t1 (s1 INT PRIMARY KEY); + * CREATE TABLE t2 (s1 INT PRIMARY KEY, FOREIGN KEY (s1) REFERENCES t1 (s1) ON DELETE NO ACTION); + * + * INSERT INTO t1 VALUES (1); + * INSERT INTO t2 VALUES (1); + * + * begin; + * INSERT INTO t1 VALUES (2); + * DELETE FROM t1 where s1 = 1; + * <-- Assertion fails here because this DELETE returns + * an error from external_lock() + */ +//#define XT_IMPLEMENT_NO_ACTION + +/* ---------------------------------------------------------------------- + * GLOBAL CONSTANTS + */ + +#define XT_INDEX_PAGE_SIZE (1 << XT_INDEX_PAGE_SHIFTS) +#define XT_INDEX_PAGE_MASK (XT_INDEX_PAGE_SIZE - 1) + +/* The index file uses direct I/O. This is the minimum block. + * size that can be used when doing direct I/O. + */ +#define XT_BLOCK_SIZE_FOR_DIRECT_IO 512 + +/* + * The header is currently a fixed size, so the information must + * fit in this block! + * + * This must also be a multiple of XT_INDEX_MIN_BLOCK_SIZE + */ +#define XT_INDEX_HEAD_SIZE (XT_BLOCK_SIZE_FOR_DIRECT_IO * 8) // 4K + +#define XT_IDENTIFIER_CHAR_COUNT 64 + +#define XT_IDENTIFIER_NAME_SIZE ((XT_IDENTIFIER_CHAR_COUNT * 3) + 1) // The identifier length as UTF-8 +#define XT_TABLE_NAME_SIZE ((XT_IDENTIFIER_CHAR_COUNT * 5) + 1) // The maximum length of a file name that has been normalized + +#define XT_ADD_PTR(p, l) ((void *) ((char *) (p) + (l))) + +/* ---------------------------------------------------------------------- + * BYTE ORDER + */ + +/* + * Byte order on the disk is little endian! This is the byte order of the i386. + * Little endian byte order starts with the least significant byte. + * + * The reason for choosing this byte order for the disk is 2-fold: + * Firstly the i386 is the cheapest and fasted platform today. + * Secondly the i386, unlike RISK chips (with big endian) can address + * memory that is not aligned! + * + * Since the disk image of PrimeBase XT is not aligned, the second point + * is significant. A RISK chip needs to access it byte-wise, so we might as + * well do the byte swapping at the same time. + * + * The macros below are of 4 general types: + * + * GET/SET - Get and set 1,2,4,8 byte values (short, int, long, etc). + * Values are swapped only on big endian platforms. This makes these + * functions very efficient on little-endian platforms. + * + * COPY - Transfer data without swapping regardless of platform. This + * function is a bit more efficient on little-endian platforms + * because alignment is not an issue. + * + * MOVE - Similar to get and set, but the deals with memory instead + * of values. Since no swapping is done on little-endian platforms + * this function is identical to COPY on little-endian platforms. + * + * SWAP - Transfer and swap data regardless of the platform type. + * Aligment is not assumed. + * + * The DISK component of the macro names indicates that alignment of + * the value cannot be assumed. + * + */ +#if BYTE_ORDER == BIG_ENDIAN +/* The native order of the machine is big endian. Since the native disk + * disk order of XT is little endian, all data to and from disk + * must be swapped. + */ +#define XT_SET_DISK_1(d, s) ((d)[0] = (xtWord1) (s)) + +#define XT_SET_DISK_2(d, s) do { (d)[0] = (xtWord1) (((xtWord2) (s)) & 0xFF); (d)[1] = (xtWord1) ((((xtWord2) (s)) >> 8 ) & 0xFF); } while (0) + +#define XT_SET_DISK_3(d, s) do { (d)[0] = (xtWord1) (((xtWord4) (s)) & 0xFF); (d)[1] = (xtWord1) ((((xtWord4) (s)) >> 8 ) & 0xFF); \ + (d)[2] = (xtWord1) ((((xtWord4) (s)) >> 16) & 0xFF); } while (0) + +#define XT_SET_DISK_4(d, s) do { (d)[0] = (xtWord1) (((xtWord4) (s)) & 0xFF); (d)[1] = (xtWord1) ((((xtWord4) (s)) >> 8 ) & 0xFF); \ + (d)[2] = (xtWord1) ((((xtWord4) (s)) >> 16) & 0xFF); (d)[3] = (xtWord1) ((((xtWord4) (s)) >> 24) & 0xFF); } while (0) + +#define XT_SET_DISK_6(d, s) do { (d)[0] = (xtWord1) (((xtWord8) (s)) & 0xFF); (d)[1] = (xtWord1) ((((xtWord8) (s)) >> 8 ) & 0xFF); \ + (d)[2] = (xtWord1) ((((xtWord8) (s)) >> 16) & 0xFF); (d)[3] = (xtWord1) ((((xtWord8) (s)) >> 24) & 0xFF); \ + (d)[4] = (xtWord1) ((((xtWord8) (s)) >> 32) & 0xFF); (d)[5] = (xtWord1) ((((xtWord8) (s)) >> 40) & 0xFF); } while (0) + +#define XT_SET_DISK_8(d, s) do { (d)[0] = (xtWord1) (((xtWord8) (s)) & 0xFF); (d)[1] = (xtWord1) ((((xtWord8) (s)) >> 8 ) & 0xFF); \ + (d)[2] = (xtWord1) ((((xtWord8) (s)) >> 16) & 0xFF); (d)[3] = (xtWord1) ((((xtWord8) (s)) >> 24) & 0xFF); \ + (d)[4] = (xtWord1) ((((xtWord8) (s)) >> 32) & 0xFF); (d)[5] = (xtWord1) ((((xtWord8) (s)) >> 40) & 0xFF); \ + (d)[6] = (xtWord1) ((((xtWord8) (s)) >> 48) & 0xFF); (d)[7] = (xtWord1) ((((xtWord8) (s)) >> 56) & 0xFF); } while (0) + +#define XT_GET_DISK_1(s) ((s)[0]) + +#define XT_GET_DISK_2(s) ((xtWord2) (((xtWord2) (s)[0]) | (((xtWord2) (s)[1]) << 8))) + +#define XT_GET_DISK_3(s) ((xtWord4) (((xtWord4) (s)[0]) | (((xtWord4) (s)[1]) << 8) | (((xtWord4) (s)[2]) << 16))) + +#define XT_GET_DISK_4(s) (((xtWord4) (s)[0]) | (((xtWord4) (s)[1]) << 8 ) | \ + (((xtWord4) (s)[2]) << 16) | (((xtWord4) (s)[3]) << 24)) + +#define XT_GET_DISK_6(s) (((xtWord8) (s)[0]) | (((xtWord8) (s)[1]) << 8 ) | \ + (((xtWord8) (s)[2]) << 16) | (((xtWord8) (s)[3]) << 24) | \ + (((xtWord8) (s)[4]) << 32) | (((xtWord8) (s)[5]) << 40)) + +#define XT_GET_DISK_8(s) (((xtWord8) (s)[0]) | (((xtWord8) (s)[1]) << 8 ) | \ + (((xtWord8) (s)[2]) << 16) | (((xtWord8) (s)[3]) << 24) | \ + (((xtWord8) (s)[4]) << 32) | (((xtWord8) (s)[5]) << 40) | \ + (((xtWord8) (s)[6]) << 48) | (((xtWord8) (s)[7]) << 56)) + +/* Move will copy memory, and swap the bytes on a big endian machine. + * On a little endian machine it is the same as COPY. + */ +#define XT_MOVE_DISK_1(d, s) ((d)[0] = (s)[0]) +#define XT_MOVE_DISK_2(d, s) do { (d)[0] = (s)[1]; (d)[1] = (s)[0]; } while (0) +#define XT_MOVE_DISK_3(d, s) do { (d)[0] = (s)[2]; (d)[1] = (s)[1]; (d)[2] = (s)[0]; } while (0) +#define XT_MOVE_DISK_4(d, s) do { (d)[0] = (s)[3]; (d)[1] = (s)[2]; (d)[2] = (s)[1]; (d)[3] = (s)[0]; } while (0) +#define XT_MOVE_DISK_8(d, s) do { (d)[0] = (s)[7]; (d)[1] = (s)[6]; \ + (d)[2] = (s)[5]; (d)[3] = (s)[4]; \ + (d)[4] = (s)[3]; (d)[5] = (s)[2]; \ + (d)[6] = (s)[1]; (d)[7] = (s)[0]; } while (0) + +/* + * Copy just copies the number of bytes assuming the data is not alligned. + */ +#define XT_COPY_DISK_1(d, s) (d)[0] = s +#define XT_COPY_DISK_2(d, s) do { (d)[0] = (s)[0]; (d)[1] = (s)[1]; } while (0) +#define XT_COPY_DISK_3(d, s) do { (d)[0] = (s)[0]; (d)[1] = (s)[1]; (d)[2] = (s)[2]; } while (0) +#define XT_COPY_DISK_4(d, s) do { (d)[0] = (s)[0]; (d)[1] = (s)[1]; (d)[2] = (s)[2]; (d)[3] = (s)[3]; } while (0) +#define XT_COPY_DISK_6(d, s) memcpy(&((d)[0]), &((s)[0]), 6) +#define XT_COPY_DISK_8(d, s) memcpy(&((d)[0]), &((s)[0]), 8) +#define XT_COPY_DISK_10(d, s) memcpy(&((d)[0]), &((s)[0]), 10) + +#define XT_SET_NULL_DISK_1(d) XT_SET_DISK_1(d, 0) +#define XT_SET_NULL_DISK_2(d) do { (d)[0] = 0; (d)[1] = 0; } while (0) +#define XT_SET_NULL_DISK_4(d) do { (d)[0] = 0; (d)[1] = 0; (d)[2] = 0; (d)[3] = 0; } while (0) +#define XT_SET_NULL_DISK_6(d) do { (d)[0] = 0; (d)[1] = 0; (d)[2] = 0; (d)[3] = 0; (d)[4] = 0; (d)[5] = 0; } while (0) +#define XT_SET_NULL_DISK_8(d) do { (d)[0] = 0; (d)[1] = 0; (d)[2] = 0; (d)[3] = 0; (d)[4] = 0; (d)[5] = 0; (d)[6] = 0; (d)[7] = 0; } while (0) + +#define XT_IS_NULL_DISK_1(d) (!(XT_GET_DISK_1(d))) +#define XT_IS_NULL_DISK_4(d) (!(d)[0] && !(d)[1] && !(d)[2] && !(d)[3]) +#define XT_IS_NULL_DISK_8(d) (!(d)[0] && !(d)[1] && !(d)[2] && !(d)[3] && !(d)[4] && !(d)[5] && !(d)[6] && !(7)[3]) + +#define XT_EQ_DISK_4(d, s) ((d)[0] == (s)[0] && (d)[1] == (s)[1] && (d)[2] == (s)[2] && (d)[3] == (s)[3]) +#define XT_EQ_DISK_8(d, s) ((d)[0] == (s)[0] && (d)[1] == (s)[1] && (d)[2] == (s)[2] && (d)[3] == (s)[3] && \ + (d)[4] == (s)[4] && (d)[5] == (s)[5] && (d)[6] == (s)[6] && (d)[7] == (s)[7]) + +#define XT_IS_FF_DISK_4(d) ((d)[0] == 0xFF && (d)[1] == 0xFF && (d)[2] == 0xFF && (d)[3] == 0xFF) +#else +/* + * The native order of the machine is little endian. This means the data to + * and from disk need not be swapped. In addition to this, since + * the i386 can access non-aligned memory we are not required to + * handle the data byte-for-byte. + */ +#define XT_SET_DISK_1(d, s) ((d)[0] = (xtWord1) (s)) +#define XT_SET_DISK_2(d, s) (*((xtWord2 *) &((d)[0])) = (xtWord2) (s)) +#define XT_SET_DISK_3(d, s) do { (*((xtWord2 *) &((d)[0])) = (xtWord2) (s)); *((xtWord1 *) &((d)[2])) = (xtWord1) (((xtWord4) (s)) >> 16); } while (0) +#define XT_SET_DISK_4(d, s) (*((xtWord4 *) &((d)[0])) = (xtWord4) (s)) +#define XT_SET_DISK_6(d, s) do { *((xtWord4 *) &((d)[0])) = (xtWord4) (s); *((xtWord2 *) &((d)[4])) = (xtWord2) (((xtWord8) (s)) >> 32); } while (0) +#define XT_SET_DISK_8(d, s) (*((xtWord8 *) &((d)[0])) = (xtWord8) (s)) + +#define XT_GET_DISK_1(s) ((s)[0]) +#define XT_GET_DISK_2(s) *((xtWord2 *) &((s)[0])) +#define XT_GET_DISK_3(s) ((xtWord4) *((xtWord2 *) &((s)[0])) | (((xtWord4) *((xtWord1 *) &((s)[2]))) << 16)) +#define XT_GET_DISK_4(s) *((xtWord4 *) &((s)[0])) +#define XT_GET_DISK_6(s) ((xtWord8) *((xtWord4 *) &((s)[0])) | (((xtWord8) *((xtWord2 *) &((s)[4]))) << 32)) +#define XT_GET_DISK_8(s) *((xtWord8 *) &((s)[0])) + +#define XT_MOVE_DISK_1(d, s) ((d)[0] = (s)[0]) +#define XT_MOVE_DISK_2(d, s) XT_COPY_DISK_2(d, s) +#define XT_MOVE_DISK_3(d, s) XT_COPY_DISK_3(d, s) +#define XT_MOVE_DISK_4(d, s) XT_COPY_DISK_4(d, s) +#define XT_MOVE_DISK_8(d, s) XT_COPY_DISK_8(d, s) + +#define XT_COPY_DISK_1(d, s) (d)[0] = s +#define XT_COPY_DISK_2(d, s) (*((xtWord2 *) &((d)[0])) = (*((xtWord2 *) &((s)[0])))) +#define XT_COPY_DISK_3(d, s) do { *((xtWord2 *) &((d)[0])) = *((xtWord2 *) &((s)[0])); (d)[2] = (s)[2]; } while (0) +#define XT_COPY_DISK_4(d, s) (*((xtWord4 *) &((d)[0])) = (*((xtWord4 *) &((s)[0])))) +#define XT_COPY_DISK_6(d, s) do { *((xtWord4 *) &((d)[0])) = *((xtWord4 *) &((s)[0])); *((xtWord2 *) &((d)[4])) = *((xtWord2 *) &((s)[4])); } while (0) +#define XT_COPY_DISK_8(d, s) (*((xtWord8 *) &(d[0])) = (*((xtWord8 *) &((s)[0])))) +#define XT_COPY_DISK_10(d, s) memcpy(&((d)[0]), &((s)[0]), 10) + +#define XT_SET_NULL_DISK_1(d) XT_SET_DISK_1(d, 0) +#define XT_SET_NULL_DISK_2(d) XT_SET_DISK_2(d, 0) +#define XT_SET_NULL_DISK_3(d) XT_SET_DISK_3(d, 0) +#define XT_SET_NULL_DISK_4(d) XT_SET_DISK_4(d, 0L) +#define XT_SET_NULL_DISK_6(d) XT_SET_DISK_6(d, 0LL) +#define XT_SET_NULL_DISK_8(d) XT_SET_DISK_8(d, 0LL) + +#define XT_IS_NULL_DISK_1(d) (!(XT_GET_DISK_1(d))) +#define XT_IS_NULL_DISK_2(d) (!(XT_GET_DISK_2(d))) +#define XT_IS_NULL_DISK_3(d) (!(XT_GET_DISK_3(d))) +#define XT_IS_NULL_DISK_4(d) (!(XT_GET_DISK_4(d))) +#define XT_IS_NULL_DISK_8(d) (!(XT_GET_DISK_8(d))) + +#define XT_EQ_DISK_4(d, s) (XT_GET_DISK_4(d) == XT_GET_DISK_4(s)) +#define XT_EQ_DISK_8(d, s) (XT_GET_DISK_8(d) == XT_GET_DISK_8(s)) + +#define XT_IS_FF_DISK_4(d) (XT_GET_DISK_4(d) == 0xFFFFFFFF) +#endif + +#define XT_CMP_DISK_4(a, b) ((xtInt4) XT_GET_DISK_4(a) - (xtInt4) XT_GET_DISK_4(b)) +#define XT_CMP_DISK_8(d, s) memcmp(&((d)[0]), &((s)[0]), 8) +//#define XT_CMP_DISK_8(d, s) (XT_CMP_DISK_4((d).h_number_4, (s).h_number_4) == 0 ? XT_CMP_DISK_4((d).h_file_4, (s).h_file_4) : XT_CMP_DISK_4((d).h_number_4, (s).h_number_4)) + +#define XT_SWAP_DISK_2(d, s) do { (d)[0] = (s)[1]; (d)[1] = (s)[0]; } while (0) +#define XT_SWAP_DISK_3(d, s) do { (d)[0] = (s)[2]; (d)[1] = (s)[1]; (d)[2] = (s)[0]; } while (0) +#define XT_SWAP_DISK_4(d, s) do { (d)[0] = (s)[3]; (d)[1] = (s)[2]; (d)[2] = (s)[1]; (d)[3] = (s)[0]; } while (0) +#define XT_SWAP_DISK_8(d, s) do { (d)[0] = (s)[7]; (d)[1] = (s)[6]; (d)[2] = (s)[5]; (d)[3] = (s)[4]; \ + (d)[4] = (s)[3]; (d)[5] = (s)[2]; (d)[6] = (s)[1]; (d)[7] = (s)[0]; } while (0) + +/* ---------------------------------------------------------------------- + * GLOBAL APPLICATION TYPES & MACROS + */ + +struct XTThread; + +typedef void (*XTFreeFunc)(struct XTThread *self, void *thunk, void *item); +typedef int (*XTCompareFunc)(struct XTThread *self, register const void *thunk, register const void *a, register const void *b); + +/* Log ID and offset: */ +#define xtLogID xtWord4 +#define xtLogOffset off_t + +#define xtDatabaseID xtWord4 +#define xtTableID xtWord4 +#define xtOpSeqNo xtWord4 +#define xtXactID xtWord4 +#define xtThreadID xtWord4 + +#ifdef DEBUG +//#define XT_USE_NODE_ID_STRUCT +#endif + +#ifdef XT_USE_NODE_ID_STRUCT +typedef struct xtIndexNodeID { + xtWord4 x; +} xtIndexNodeID; +#define XT_NODE_TEMP xtWord4 xt_node_temp +#define XT_NODE_ID(a) (a).x +#define XT_RET_NODE_ID(a) *((xtIndexNodeID *) &(xt_node_temp = (a))) +#else +#define XT_NODE_TEMP +#define xtIndexNodeID xtWord4 +#define XT_NODE_ID(a) a +#define XT_RET_NODE_ID(a) ((xtIndexNodeID) (a)) +#endif + +/* Row, Record ID and Record offsets: */ +#define xtRowID xtWord4 +#define xtRecordID xtWord4 /* NOTE: Record offset == header-size + record-id * record-size! */ +#define xtRefID xtWord4 /* Must be big enough to contain a xtRowID and a xtRecordID! */ +#define xtRecOffset off_t +#define xtDiskRecordID4 XTDiskValue4 +#ifdef XT_WIN +#define xtProcID DWORD +#else +#define xtProcID pid_t +#endif + +#define XT_ROW_ID_SIZE 4 +#define XT_RECORD_ID_SIZE 4 +#define XT_REF_ID_SIZE 4 /* max(XT_ROW_ID_SIZE, XT_RECORD_ID_SIZE) */ +#define XT_RECORD_OFFS_SIZE 4 +#define XT_RECORD_REF_SIZE (XT_RECORD_ID_SIZE + XT_ROW_ID_SIZE) +#define XT_CHECKSUM4_REC(x) (x) + +#define XT_XACT_ID_SIZE 4 +#define XT_CHECKSUM4_XACT(x) (x) + +/* ---------------------------------------------------------------------- + * GLOBAL VARIABLES + */ + +extern bool pbxt_inited; +extern xtBool pbxt_ignore_case; +extern const char *pbxt_extensions[]; +extern xtBool pbxt_crash_debug; + + +/* ---------------------------------------------------------------------- + * DRIZZLE MAPPINGS VARIABLES + */ + +#ifdef DRIZZLED +/* Drizzle is stuck at this level: */ +#define MYSQL_VERSION_ID 60005 + +#define TABLE_LIST TableList +#define TABLE Table +#define THD Session +#define MYSQL_THD Session * +#define THR_THD THR_Session +#define STRUCT_TABLE class Table + +#define MYSQL_TYPE_STRING DRIZZLE_TYPE_VARCHAR +#define MYSQL_TYPE_VARCHAR DRIZZLE_TYPE_VARCHAR +#define MYSQL_TYPE_LONGLONG DRIZZLE_TYPE_LONGLONG +#define MYSQL_TYPE_BLOB DRIZZLE_TYPE_BLOB +#define MYSQL_TYPE_ENUM DRIZZLE_TYPE_ENUM +#define MYSQL_TYPE_LONG DRIZZLE_TYPE_LONG +#define MYSQL_PLUGIN_VAR_HEADER DRIZZLE_PLUGIN_VAR_HEADER +#define MYSQL_SYSVAR_STR DRIZZLE_SYSVAR_STR +#define MYSQL_SYSVAR_INT DRIZZLE_SYSVAR_INT +#define MYSQL_SYSVAR DRIZZLE_SYSVAR +#define MYSQL_STORAGE_ENGINE_PLUGIN DRIZZLE_STORAGE_ENGINE_PLUGIN +#define MYSQL_INFORMATION_SCHEMA_PLUGIN DRIZZLE_INFORMATION_SCHEMA_PLUGIN +#define memcpy_fixed memcpy +#define bfill(m, len, ch) memset(m, ch, len) + +#define mx_tmp_use_all_columns(x, y) (x)->use_all_columns(y) +#define mx_tmp_restore_column_map(x, y) (x)->restore_column_map(y) + +#define MX_TABLE_TYPES_T handler::Table_flags +#define MX_UINT8_T uint8_t +#define MX_ULONG_T uint32_t +#define MX_ULONGLONG_T uint64_t +#define MX_LONGLONG_T uint64_t +#define MX_CHARSET_INFO struct charset_info_st +#define MX_CONST_CHARSET_INFO const struct charset_info_st +#define MX_CONST const +#define my_bool bool +#define int16 int16_t +#define int32 int32_t +#define uint16 uint16_t +#define uint32 uint32_t +#define uchar unsigned char +#define longlong int64_t +#define ulonglong uint64_t + +#define HAVE_LONG_LONG + +#define my_malloc(x, y) malloc(x) +#define my_free(x, y) free(x) + +#define HA_CAN_SQL_HANDLER 0 +#define HA_CAN_INSERT_DELAYED 0 + +#define max cmax +#define min cmin + +#define NullS NULL + +#define thd_charset session_charset +#define thd_query session_query +#define thd_slave_thread session_slave_thread +#define thd_non_transactional_update session_non_transactional_update +#define thd_binlog_format session_binlog_format +#define thd_mark_transaction_to_rollback session_mark_transaction_to_rollback +#define thd_ha_data session_ha_data +#define current_thd current_session +#define thd_sql_command session_sql_command +#define thd_test_options session_test_options +#define thd_killed session_killed +#define thd_tx_isolation session_tx_isolation +#define thd_in_lock_tables session_in_lock_tables +#define thd_tablespace_op session_tablespace_op +#define thd_alloc session_alloc +#define thd_make_lex_string session_make_lex_string + +#define my_pthread_setspecific_ptr(T, V) pthread_setspecific(T, (void*) (V)) + +#define mysql_real_data_home drizzle_real_data_home + +#define mi_int4store(T,A) { uint32_t def_temp= (uint32_t) (A);\ + ((unsigned char*) (T))[3]= (unsigned char) (def_temp);\ + ((unsigned char*) (T))[2]= (unsigned char) (def_temp >> 8);\ + ((unsigned char*) (T))[1]= (unsigned char) (def_temp >> 16);\ + ((unsigned char*) (T))[0]= (unsigned char) (def_temp >> 24); } + +#define mi_uint4korr(A) ((uint32_t) (((uint32_t) (((const unsigned char*) (A))[3])) +\ + (((uint32_t) (((const unsigned char*) (A))[2])) << 8) +\ + (((uint32_t) (((const unsigned char*) (A))[1])) << 16) +\ + (((uint32_t) (((const unsigned char*) (A))[0])) << 24))) + +#else // DRIZZLED +/* The MySQL case: */ +#if MYSQL_VERSION_ID >= 60008 +#define STRUCT_TABLE struct TABLE +#else +#define STRUCT_TABLE struct st_table +#endif + +#define mx_tmp_use_all_columns dbug_tmp_use_all_columns +#define mx_tmp_restore_column_map(x, y) dbug_tmp_restore_column_map((x)->read_set, y) + +#define MX_TABLE_TYPES_T ulonglong +#define MX_UINT8_T uint8 +#define MX_ULONG_T ulong +#define MX_ULONGLONG_T ulonglong +#define MX_LONGLONG_T longlong +#define MX_CHARSET_INFO CHARSET_INFO +#define MX_CONST_CHARSET_INFO struct charset_info_st +#define MX_CONST + +#endif // DRIZZLED + +#ifndef XT_SCAN_CORE_DEFINED +#define XT_SCAN_CORE_DEFINED +xtBool xt_mm_scan_core(void); +#endif + +//#define DEBUG_LOCK_QUEUE + +#endif diff --git a/storage/pbxt/src/xt_errno.h b/storage/pbxt/src/xt_errno.h new file mode 100644 index 00000000000..4d74589efe3 --- /dev/null +++ b/storage/pbxt/src/xt_errno.h @@ -0,0 +1,129 @@ +/* Copyright (c) 2005 PrimeBase Technologies GmbH + * + * PrimeBase XT + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Author: Paul McCullagh + * + * H&G2JCtL + */ + +#define XT_NO_ERR 0 +#define XT_SYSTEM_ERROR -1 +#define XT_ERR_STACK_OVERFLOW -2 +#define XT_ASSERTION_FAILURE -3 +#define XT_SIGNAL_CAUGHT -4 +#define XT_ERR_JUMP_OVERFLOW -5 +#define XT_ERR_BAD_HANDLE -6 +#define XT_ERR_TABLE_EXISTS -7 +#define XT_ERR_NAME_TOO_LONG -8 +#define XT_ERR_TABLE_NOT_FOUND -9 +#define XT_ERR_SESSION_NOT_FOUND -10 +#define XT_ERR_BAD_ADDRESS -11 +#define XT_ERR_UNKNOWN_SERVICE -12 +#define XT_ERR_UNKNOWN_HOST -13 +#define XT_ERR_TOKEN_EXPECTED -14 +#define XT_ERR_PROPERTY_REQUIRED -15 +#define XT_ERR_BAD_XACTION -16 +#define XT_ERR_INVALID_SLOT -17 +#define XT_ERR_DEADLOCK -18 +#define XT_ERR_CANNOT_CHANGE_DB -19 +#define XT_ERR_ILLEGAL_CHAR -20 +#define XT_ERR_UNTERMINATED_STRING -21 +#define XT_ERR_SYNTAX -22 +#define XT_ERR_ILLEGAL_INSTRUCTION -23 +#define XT_ERR_OUT_OF_BOUNDS -24 +#define XT_ERR_STACK_UNDERFLOW -25 +#define XT_ERR_TYPE_MISMATCH -26 +#define XT_ERR_ILLEGAL_TYPE -27 +#define XT_ERR_ID_TOO_LONG -28 +#define XT_ERR_TYPE_OVERFLOW -29 +#define XT_ERR_TABLE_IN_USE -30 +#define XT_ERR_NO_DATABASE_IN_USE -31 +#define XT_ERR_CANNOT_RESOLVE_TYPE -32 +#define XT_ERR_BAD_INDEX_DESC -33 +#define XT_ERR_WRONG_NO_OF_VALUES -34 +#define XT_ERR_CANNOT_OUTPUT_VALUE -35 +#define XT_ERR_COLUMN_NOT_FOUND -36 +#define XT_ERR_NOT_IMPLEMENTED -37 +#define XT_ERR_UNEXPECTED_EOS -38 +#define XT_ERR_BAD_TOKEN -39 +#define XT_ERR_RES_STACK_OVERFLOW -40 +#define XT_ERR_BAD_INDEX_TYPE -41 +#define XT_ERR_INDEX_EXISTS -42 +#define XT_ERR_INDEX_STRUC_EXISTS -43 +#define XT_ERR_INDEX_NOT_FOUND -44 +#define XT_ERR_INDEX_CORRUPT -45 +#define XT_ERR_DUPLICATE_KEY -46 +#define XT_ERR_TYPE_NOT_SUPPORTED -47 +#define XT_ERR_BAD_TABLE_VERSION -48 +#define XT_ERR_BAD_RECORD_FORMAT -49 +#define XT_ERR_BAD_EXT_RECORD -50 +#define XT_ERR_RECORD_CHANGED -51 // Record has already been updated by some other transaction +#define XT_ERR_XLOG_WAS_CORRUPTED -52 +#define XT_ERR_NO_DICTIONARY -53 +#define XT_ERR_TOO_MANY_TABLES -54 // Maximum number of table exceeded. +#define XT_ERR_KEY_TOO_LARGE -55 // Maximum size of an index key exceeded +#define XT_ERR_MULTIPLE_DATABASES -56 +#define XT_ERR_NO_TRANSACTION -57 +#define XT_ERR_A_EXPECTED_NOT_B -58 +#define XT_ERR_NO_MATCHING_INDEX -59 +#define XT_ERR_TABLE_LOCKED -60 +#define XT_ERR_NO_REFERENCED_ROW -61 +#define XT_ERR_BAD_DICTIONARY -62 +#define XT_ERR_LOADING_MYSQL_DIC -63 +#define XT_ERR_ROW_IS_REFERENCED -64 +#define XT_ERR_COLUMN_IS_NOT_NULL -65 +#define XT_ERR_INCORRECT_NO_OF_COLS -66 +#define XT_ERR_FK_ON_TEMP_TABLE -67 +#define XT_ERR_REF_TABLE_NOT_FOUND -68 +#define XT_ERR_REF_TYPE_WRONG -69 +#define XT_ERR_DUPLICATE_FKEY -70 +#define XT_ERR_INDEX_FILE_TO_LARGE -71 +#define XT_ERR_UPGRADE_TABLE -72 +#define XT_ERR_INDEX_NEW_VERSION -73 +#define XT_ERR_LOCK_TIMEOUT -74 +#define XT_ERR_CONVERSION -75 +#define XT_ERR_NO_ROWS -76 +#define XT_ERR_MYSQL_ERROR -77 +#define XT_ERR_DATA_LOG_NOT_FOUND -78 +#define XT_ERR_LOG_MAX_EXCEEDED -79 +#define XT_ERR_MAX_ROW_COUNT -80 +#define XT_ERR_FILE_TOO_LONG -81 +#define XT_ERR_BAD_IND_BLOCK_SIZE -82 +#define XT_ERR_INDEX_CORRUPTED -83 +#define XT_ERR_NO_INDEX_CACHE -84 +#define XT_ERR_INDEX_LOG_CORRUPT -85 +#define XT_ERR_TOO_MANY_THREADS -86 +#define XT_ERR_TOO_MANY_WAITERS -87 +#define XT_ERR_INDEX_OLD_VERSION -88 +#define XT_ERR_PBXT_TABLE_EXISTS -89 +#define XT_ERR_SERVER_RUNNING -90 +#define XT_ERR_INDEX_MISSING -91 +#define XT_ERR_RECORD_DELETED -92 +#define XT_ERR_NEW_TYPE_OF_XLOG -93 +#define XT_ERR_NO_BEFORE_IMAGE -94 +#define XT_ERR_FK_REF_TEMP_TABLE -95 + +#ifdef XT_WIN +#define XT_ENOMEM ERROR_NOT_ENOUGH_MEMORY +#define XT_EAGAIN ERROR_RETRY +#define XT_EBUSY ERROR_BUSY +#else +#define XT_ENOMEM ENOMEM +#define XT_EAGAIN EAGAIN +#define XT_EBUSY EBUSY +#endif |