diff options
author | unknown <guilhem@gbichot3.local> | 2007-06-22 14:49:37 +0200 |
---|---|---|
committer | unknown <guilhem@gbichot3.local> | 2007-06-22 14:49:37 +0200 |
commit | 1a96259191b193b353387cbb70d7567009e3b247 (patch) | |
tree | 27f19470e270f1546d4eb9ac1eaf51ff23ec8a08 /storage/maria | |
parent | fd9bd5802932b08da7484c54445ae14ee4e25385 (diff) | |
download | mariadb-git-1a96259191b193b353387cbb70d7567009e3b247.tar.gz |
- WL#3239 "log CREATE TABLE in Maria"
- WL#3240 "log DROP TABLE in Maria"
- similarly, log RENAME TABLE, REPAIR/OPTIMIZE TABLE, and
DELETE no_WHERE_clause (== the DELETE which just truncates the files)
- create_rename_lsn added to MARIA_SHARE's state
- all these operations (except DROP TABLE) also update the table's
create_rename_lsn, which is needed for the correctness of
Recovery (see function comment of _ma_repair_write_log_record()
in ma_check.c)
- write a COMMIT record when transaction commits.
- don't log REDOs/UNDOs if this is an internal temporary table
like inside ALTER TABLE (I expect this to be a big win). There was
already no logging for user-created "CREATE TEMPORARY" tables.
- don't fsync files/directories if the table is not transactional
- in translog_write_record(), autogenerate a 2-byte-id for the table
and log the "id->name" pair (LOGREC_FILE_ID); log
LOGREC_LONG_TRANSACTION_ID; automatically store
the table's 2-byte-id in any log record.
- preparations for Checkpoint: translog_get_horizon(); pausing Checkpoint
when some dirty pages are unknown; capturing trn->rec_lsn,
trn->first_undo_lsn for Checkpoint and log's low-water-mark computing.
- assertions, comments.
storage/maria/Makefile.am:
more files to build
storage/maria/ha_maria.cc:
- logging a REPAIR log record if REPAIR/OPTIMIZE was successful.
- ha_maria::data_file_type does not have to be set in every info()
call, just do it once in open().
- if caller said that transactionality can be disabled (like if
caller is ALTER TABLE) i.e. thd->transaction.on==FALSE, then we
temporarily disable transactionality of the table in external_lock();
that will ensure that no REDOs/UNDOs are logged for this possibly
massive write operation (they are not needed, as if any write fails,
the table will be dropped). We re-enable in external_lock(F_UNLCK),
which in ALTER TABLE happens before the tmp table replaces the original
one (which is good, as thus the final table will have a REDO RENAME
and a correct create_rename_lsn).
- when we commit we also have to write a log record, so
trnman_commit_trn() calls become ma_commit() calls
- at end of engine's initialization, we are potentially entering a
multi-threaded dangerous world (clients are going to be accepted)
and so some assertions of mutex-owning become enforceable, for that
we set maria_multi_threaded=TRUE (see ma_control_file.c)
storage/maria/ha_maria.h:
new member ha_maria::save_transactional (see also ha_maria.cc)
storage/maria/ma_blockrec.c:
- fixing comments according to discussion with Monty
- if a table is transactional but temporarily non-transactional
(like in ALTER TABLE), we need to give a sensible LSN to the pages
(and, if we give 0, pagecache asserts).
- translog_write_record() now takes care of storing the share's
2-byte-id in the log record
storage/maria/ma_blockrec.h:
fixing comment according to discussion with Monty
storage/maria/ma_check.c:
When REPAIR/OPTIMIZE modify the data/index file, if this is a
transactional table, they must sync it; if they remove files or rename
files, they must sync the directory, so that everything is durable.
This is just applying to REPAIR/OPTIMIZE the logic already implemented
in CREATE/DROP/RENAME a few months ago.
Adding a function to write a LOGREC_REPAIR_TABLE at end of
REPAIR/OPTIMIZE (called only by ha_maria, not by maria_chk), and
to update the table's create_rename_lsn.
storage/maria/ma_close.c:
fix for a future bug
storage/maria/ma_control_file.c:
ensuring that if Maria is running in multi-threaded mode, anybody
wanting to write to the control file and update
last_checkpoint_lsn/last_logno owns the log's lock.
storage/maria/ma_control_file.h:
see ma_control_file.c
storage/maria/ma_create.c:
when creating a table:
- sync it and its directory only if this is a transactional table
and there is a log (no point in syncing in maria_chk)
- decouple the two uses of linkname/linkname_ptr (for index file and
for data file) into more variables, as we need to know all links
until the moment we write the LOGREC_CREATE_TABLE.
- set share.data_file_type early so that _ma_initialize_data_file()
knows it (Monty's bugfix so that a table always has at least a bitmap
page when it is created; so data-file is not 0 bytes anymore).
- log a LOGREC_CREATE_TABLE; it contains the bytes which we have
just written to the index file's header. Update table's
create_rename_lsn.
- syncing of kfile had been bugified in a previous merge, correcting
- syncing of dfile is now needed as it's not empty anymore
- in _ma_initialize_data_file(), use share's block_size and not the
global one. This is a gratuitous change, both variables are equal,
just that I find it more future-proof to use share-bound variable
rather than global one.
storage/maria/ma_delete_all.c:
log a LOGREC_DELETE_ALL record when doing ma_delete_all_rows();
update create_rename_lsn then.
storage/maria/ma_delete_table.c:
- logging LOGREC_DROP_TABLE; knowing if this is needed, requires
knowing if the table is transactional, which requires opening the
table.
- we need to sync directories only if the table is transactional
storage/maria/ma_extra.c:
questions
storage/maria/ma_init.c:
when maria_end() is called, engine is not multithreaded
storage/maria/ma_loghandler.c:
- translog_inited has to be visible to ma_create() (see how it is used
in ma_create())
- checkpoint record will be a single record, not three
- no REDO for TRUNCATE (TRUNCATE calls ma_create() internally so will
log a REDO_CREATE)
- adding REDO for DELETE no_WHERE_clause (fast DELETE of all rows by
truncating the files), REPAIR.
- MY_WAIT_IF_FULL to wait&retry if a log write hits a full disk
- in translog_write_record(), if MARIA_SHARE does not yet have a
2-byte-id, generate one for it and log LOGREC_FILE_ID; automatically
store this short id into log records.
- in translog_write_record(), if transaction has not logged its
long trid, log LOGREC_LONG_TRANSACTION_ID.
- For Checkpoint, we need to know the current end-of-log: adding
translog_get_horizon().
- For Control File, adding an assertion that the thread owns the
log's lock (control file is protected by this lock)
storage/maria/ma_loghandler.h:
Changes in log records (see ma_loghandler.c).
new prototypes, new functions.
storage/maria/ma_loghandler_lsn.h:
adding a type LSN_WITH_FLAGS especially for TRN::first_undo_lsn,
where the most significant byte is used for flags.
storage/maria/ma_open.c:
storing the create_rename_lsn in the index file's header (in the
state, precisely) and retrieving it from there.
storage/maria/ma_pagecache.c:
- my set_if_bigger was wrong, correcting it
- if the first_in_switch list is not empty, it means that
changed_blocks misses some dirty pages, so Checkpoint cannot run and
needs to wait. A variable missing_blocks_in_changed_list is added to
tell that (should it be named missing_blocks_in_changed_blocks?)
- pagecache_collect_changed_blocks_with_lsn() now also tells the
minimum rec_lsn (needed for low-water mark computation).
storage/maria/ma_pagecache.h:
see ma_pagecache.c
storage/maria/ma_panic.c:
comment
storage/maria/ma_range.c:
comment
storage/maria/ma_rename.c:
- logging LOGREC_RENAME_TABLE; knowing if this is needed, requires
knowing if the table is transactional, which requires opening the
table.
- update create_rename_lsn
- we need to sync directories only if the table is transactional
storage/maria/ma_static.c:
comment
storage/maria/ma_test_all.sh:
- tip for Valgrind-ing ma_test_all
- do "export maria_path=somepath" before calling ma_test_all,
if you want to run ma_test_all out of storage/maria (useful
to have parallel runs, like one normal and one Valgrind, they
must not use the same tables so need to run in different directories)
storage/maria/maria_def.h:
- state now contains, in memory and on disk, the create_rename_lsn
- share now contains a 2-byte-id
storage/maria/trnman.c:
preparations for Checkpoint: capture trn->rec_lsn, trn->first_undo_lsn;
minimum first_undo_lsn needed to know log's low-water-mark
storage/maria/trnman.h:
using most significant byte of first_undo_lsn to hold miscellaneous
flags, for now TRANSACTION_LOGGED_LONG_ID.
dummy_transaction_object is already declared in ma_static.c.
storage/maria/trnman_public.h:
dummy_transaction_object was declared in all files including
trnman_public.h, while in fact it's a single object.
new prototype
storage/maria/unittest/ma_test_loghandler-t.c:
update for new prototype
storage/maria/unittest/ma_test_loghandler_multigroup-t.c:
update for new prototype
storage/maria/unittest/ma_test_loghandler_multithread-t.c:
update for new prototype
storage/maria/unittest/ma_test_loghandler_pagecache-t.c:
update for new prototype
storage/maria/ma_commit.c:
function which wraps:
- writing a LOGREC_COMMIT record (==commit on disk)
- calling trnman_commit_trn() (=commit in memory)
storage/maria/ma_commit.h:
new header file
.tree-is-private:
this file is now needed to keep our tree private (don't push it
to public trees). When 5.1 is merged into mysql-maria, we can abandon
our maria-specific post-commit trigger; .tree_is_private will take
care of keeping commit mails private. Don't push this file to public
trees.
Diffstat (limited to 'storage/maria')
35 files changed, 1407 insertions, 632 deletions
diff --git a/storage/maria/Makefile.am b/storage/maria/Makefile.am index 9d8ab704541..fbb25584910 100644 --- a/storage/maria/Makefile.am +++ b/storage/maria/Makefile.am @@ -54,7 +54,8 @@ noinst_HEADERS = maria_def.h ma_rt_index.h ma_rt_key.h ma_rt_mbr.h \ ma_sp_defs.h ma_fulltext.h ma_ftdefs.h ma_ft_test1.h \ ma_ft_eval.h trnman.h lockman.h tablockman.h \ ma_control_file.h ha_maria.h ma_blockrec.h \ - ma_loghandler.h ma_loghandler_lsn.h ma_pagecache.h + ma_loghandler.h ma_loghandler_lsn.h ma_pagecache.h \ + ma_commit.h ma_test1_DEPENDENCIES= $(LIBRARIES) ma_test1_LDADD= @CLIENT_EXTRA_LDFLAGS@ libmaria.a \ $(top_builddir)/storage/myisam/libmyisam.a \ @@ -112,7 +113,8 @@ libmaria_a_SOURCES = ma_init.c ma_open.c ma_extra.c ma_info.c ma_rkey.c \ ha_maria.cc trnman.c lockman.c tablockman.c \ ma_rt_index.c ma_rt_key.c ma_rt_mbr.c ma_rt_split.c \ ma_sp_key.c ma_control_file.c ma_loghandler.c \ - ma_pagecache.c ma_pagecaches.c + ma_pagecache.c ma_pagecaches.c \ + ma_commit.c CLEANFILES = test?.MA? FT?.MA? isam.log ma_test_all ma_rt_test.MA? sp_test.MA? SUFFIXES = .sh diff --git a/storage/maria/ha_maria.cc b/storage/maria/ha_maria.cc index 288366675a7..e05f97a384d 100644 --- a/storage/maria/ha_maria.cc +++ b/storage/maria/ha_maria.cc @@ -30,6 +30,7 @@ #include "maria_def.h" #include "ma_rt_index.h" #include "ma_blockrec.h" +#include "ma_commit.h" #define MARIA_CANNOT_ROLLBACK HA_NO_TRANSACTIONS #ifdef MARIA_CANNOT_ROLLBACK @@ -690,7 +691,8 @@ int ha_maria::open(const char *name, int mode, uint test_if_locked) info(HA_STATUS_NO_LOCK | HA_STATUS_VARIABLE | HA_STATUS_CONST); if (!(test_if_locked & HA_OPEN_WAIT_IF_LOCKED)) VOID(maria_extra(file, HA_EXTRA_WAIT_LOCK, 0)); - if (file->s->data_file_type != STATIC_RECORD) + save_transactional= file->s->base.transactional; + if ((data_file_type= file->s->data_file_type) != STATIC_RECORD) int_table_flags |= HA_REC_NOT_IN_SEQ; if (file->s->options & (HA_OPTION_CHECKSUM | HA_OPTION_COMPRESS_RECORD)) int_table_flags |= HA_HAS_CHECKSUM; @@ -1178,6 +1180,8 @@ int ha_maria::repair(THD *thd, HA_CHECK ¶m, bool do_optimize) llstr(rows, llbuff), llstr(file->state->records, llbuff2)); } + if (!error) + error= _ma_repair_write_log_record(¶m, file); } else { @@ -1806,7 +1810,6 @@ int ha_maria::info(uint flag) MY_APPEND_EXT | MY_UNPACK_FILENAME); if (strcmp(name_buff, maria_info.index_file_name)) index_file_name=maria_info.index_file_name; - data_file_type= maria_info.data_file_type; } if (flag & HA_STATUS_ERRKEY) { @@ -1860,7 +1863,7 @@ int ha_maria::external_lock(THD *thd, int lock_type) { TRN *trn= THD_TRN; DBUG_ENTER("ha_maria::external_lock"); - if (!file->s->base.transactional) + if (!save_transactional) goto skip_transaction; if (!trn && lock_type != F_UNLCK) /* no transaction yet - open it now */ { @@ -1884,6 +1887,19 @@ int ha_maria::external_lock(THD *thd, int lock_type) trans_register_ha(thd, FALSE, maria_hton); trnman_new_statement(trn); } + if (!thd->transaction.on) + { + /* + No need to log REDOs/UNDOs. If this is an internal temporary table + which will be renamed to a permanent table (like in ALTER TABLE), + the rename happens after unlocking so will be durable (and the table + will get its create_rename_lsn). + Note: if we wanted to enable users to have an old backup and apply + tons of archived logs to roll-forward, we could then not disable + REDOs/UNDOs in this case. + */ + file->s->base.transactional= FALSE; + } } else { @@ -1894,7 +1910,8 @@ int ha_maria::external_lock(THD *thd, int lock_type) { /* autocommit ? rollback a transaction */ #ifdef MARIA_CANNOT_ROLLBACK - trnman_commit_trn(trn); + if (ma_commit(trn)) + DBUG_RETURN(1); THD_TRN= 0; #else if (!(thd->options & (OPTION_NOT_AUTOCOMMIT | OPTION_BEGIN))) @@ -1906,6 +1923,7 @@ int ha_maria::external_lock(THD *thd, int lock_type) #endif } } + file->s->base.transactional= save_transactional; } skip_transaction: DBUG_RETURN(maria_lock_database(file, !table->s->tmp_table ? @@ -1916,7 +1934,7 @@ skip_transaction: int ha_maria::start_stmt(THD *thd, thr_lock_type lock_type) { TRN *trn= THD_TRN; - if (file->s->base.transactional) + if (save_transactional) { DBUG_ASSERT(trn); // this may be called only after external_lock() DBUG_ASSERT(trnman_has_locked_tables(trn)); @@ -2186,8 +2204,7 @@ static int maria_commit(handlerton *hton __attribute__ ((unused)), DBUG_RETURN(0); // end of statement DBUG_PRINT("info", ("THD_TRN set to 0x0")); THD_TRN= 0; - DBUG_RETURN(trnman_commit_trn(trn) ? - HA_ERR_OUT_OF_MEM : 0); // end of transaction + DBUG_RETURN(ma_commit(trn)); // end of transaction } @@ -2212,6 +2229,7 @@ static int maria_rollback(handlerton *hton __attribute__ ((unused)), static int ha_maria_init(void *p) { + int res; maria_hton= (handlerton *)p; maria_hton->state= SHOW_OPTION_YES; maria_hton->db_type= DB_TYPE_MARIA; @@ -2223,14 +2241,16 @@ static int ha_maria_init(void *p) maria_hton->flags= HTON_CAN_RECREATE | HTON_SUPPORT_LOG_TABLES; bzero(maria_log_pagecache, sizeof(*maria_log_pagecache)); maria_data_root= mysql_real_data_home; - return (test(maria_init() || ma_control_file_create_or_open() || - (init_pagecache(maria_log_pagecache, - TRANSLOG_PAGECACHE_SIZE, 0, 0, - TRANSLOG_PAGE_SIZE) == 0) || - translog_init(maria_data_root, TRANSLOG_FILE_SIZE, - MYSQL_VERSION_ID, server_id, maria_log_pagecache, - TRANSLOG_DEFAULT_FLAGS) || - trnman_init())); + res= maria_init() || ma_control_file_create_or_open() || + (init_pagecache(maria_log_pagecache, + TRANSLOG_PAGECACHE_SIZE, 0, 0, + TRANSLOG_PAGE_SIZE) == 0) || + translog_init(maria_data_root, TRANSLOG_FILE_SIZE, + MYSQL_VERSION_ID, server_id, maria_log_pagecache, + TRANSLOG_DEFAULT_FLAGS) || + trnman_init(); + maria_multi_threaded= TRUE; + return res; } diff --git a/storage/maria/ha_maria.h b/storage/maria/ha_maria.h index dd0a9594ef3..a2f6b190657 100644 --- a/storage/maria/ha_maria.h +++ b/storage/maria/ha_maria.h @@ -39,6 +39,11 @@ class ha_maria :public handler char *data_file_name, *index_file_name; enum data_file_type data_file_type; bool can_enable_indexes; + /** + @brief for temporarily disabling table's transactionality + (if THD::transaction::on is false), remember the original value here + */ + bool save_transactional; int repair(THD * thd, HA_CHECK ¶m, bool optimize); public: diff --git a/storage/maria/ma_blockrec.c b/storage/maria/ma_blockrec.c index 39769507887..d2512f1e025 100644 --- a/storage/maria/ma_blockrec.c +++ b/storage/maria/ma_blockrec.c @@ -171,11 +171,14 @@ started and we can then delete TRANSID and VER_PTR from the row to gain more space. - If a row is deleted in Maria, we change TRANSID to current transid and - change VER_PTR to point to the undo record for the delete. The undo - record must contain the original TRANSID, so that another transaction - can use this to check if they should use the found row or go to the - previous row pointed to by the VER_PTR in the undo row. + If a row is deleted in Maria, we change TRANSID to the deleting + transaction's id, change VER_PTR to point to the undo record for the delete, + and add DELETE_TRANSID (the id of the transaction which last + inserted/updated the row before its deletion). DELETE_TRANSID allows an old + transaction to avoid reading the log to know if it can see the last version + before delete (in other words it reduces the probability of having to follow + VER_PTR). TODO: depending on a compilation option, evaluate the performance + impact of not storing DELETE_TRANSID (which would make the row smaller). Description of the different parts: @@ -391,7 +394,12 @@ my_bool _ma_once_end_block_record(MARIA_SHARE *share) share->temporary ? FLUSH_IGNORE_CHANGED : FLUSH_RELEASE)) res= 1; - if (my_close(share->bitmap.file.file, MYF(MY_WME))) + /* + File must be synced as it is going out of the maria_open_list and so + becoming unknown to Checkpoint. + */ + if (my_sync(share->bitmap.file.file, MYF(MY_WME)) || + my_close(share->bitmap.file.file, MYF(MY_WME))) res= 1; /* Trivial assignment to guard against multiple invocations @@ -400,6 +408,8 @@ my_bool _ma_once_end_block_record(MARIA_SHARE *share) */ share->bitmap.file.file= -1; } + if (share->id != 0) + translog_deassign_id_from_share(share); return res; } @@ -573,7 +583,14 @@ void _ma_unpin_all_pages(MARIA_HA *info, LSN undo_lsn) DBUG_ASSERT(undo_lsn != 0 || !info->s->base.transactional); if (!info->s->base.transactional) - undo_lsn= 0; /* Avoid assert in key cache */ + { + /* + If this is a transactional table but with transactionality temporarily + disabled (like in ALTER TABLE) we need to give a sensible LSN to pages + and not 0. If this is not a transactional table it will reduce to 0. + */ + undo_lsn= info->s->state.create_rename_lsn; + } while (pinned_page-- != page_link) pagecache_unlock_by_link(info->s->pagecache, pinned_page->link, @@ -1133,7 +1150,6 @@ static my_bool write_tail(MARIA_HA *info, LSN lsn; /* Log REDO changes of tail page */ - fileid_store(log_data, info->dfile.file); page_store(log_data+ FILEID_STORE_SIZE, block->page); dirpos_store(log_data+ FILEID_STORE_SIZE + PAGE_STORE_SIZE, row_pos.rownr); @@ -1143,7 +1159,8 @@ static my_bool write_tail(MARIA_HA *info, log_array[TRANSLOG_INTERNAL_PARTS + 1].length= length; if (translog_write_record(&lsn, LOGREC_REDO_INSERT_ROW_TAIL, info->trn, share, sizeof(log_data) + length, - TRANSLOG_INTERNAL_PARTS + 2, log_array)) + TRANSLOG_INTERNAL_PARTS + 2, log_array, + log_data)) DBUG_RETURN(1); } @@ -1388,7 +1405,6 @@ static my_bool free_full_pages(MARIA_HA *info, MARIA_ROW *row) size_t extents_length= row->extents_count * ROW_EXTENT_SIZE; DBUG_ENTER("free_full_pages"); - fileid_store(log_data, info->dfile.file); pagerange_store(log_data + FILEID_STORE_SIZE, row->extents_count); log_array[TRANSLOG_INTERNAL_PARTS + 0].str= (char*) log_data; @@ -1397,7 +1413,8 @@ static my_bool free_full_pages(MARIA_HA *info, MARIA_ROW *row) log_array[TRANSLOG_INTERNAL_PARTS + 1].length= extents_length; if (translog_write_record(&lsn, LOGREC_REDO_PURGE_BLOCKS, info->trn, info->s, sizeof(log_data) + extents_length, - TRANSLOG_INTERNAL_PARTS + 2, log_array)) + TRANSLOG_INTERNAL_PARTS + 2, log_array, + log_data)) DBUG_RETURN(1); DBUG_RETURN (_ma_bitmap_free_full_pages(info, row->extents, @@ -1431,7 +1448,6 @@ static my_bool free_full_page_range(MARIA_HA *info, ulonglong page, uint count) { LSN lsn; DBUG_ASSERT(info->trn->rec_lsn); - fileid_store(log_data, info->dfile.file); pagerange_store(log_data + FILEID_STORE_SIZE, 1); int5store(log_data + FILEID_STORE_SIZE + PAGERANGE_STORE_SIZE, page); @@ -1442,7 +1458,8 @@ static my_bool free_full_page_range(MARIA_HA *info, ulonglong page, uint count) if (translog_write_record(&lsn, LOGREC_REDO_PURGE_BLOCKS, info->trn, info->s, sizeof(log_data), - TRANSLOG_INTERNAL_PARTS + 1, log_array)) + TRANSLOG_INTERNAL_PARTS + 1, log_array, + log_data)) res= 1; } @@ -1455,24 +1472,25 @@ static my_bool free_full_page_range(MARIA_HA *info, ulonglong page, uint count) } -/* - Write a record to a (set of) pages +/** + @brief Write a record to a (set of) pages - SYNOPSIS - write_block_record() - info Maria handler - old_record Orignal record in case of update; NULL in case of insert - record Record we should write - row Statistics about record (calculated by calc_record_size()) - map_blocks On which pages the record should be stored - row_pos Position on head page where to put head part of record + @param info Maria handler + @param old_record Original record in case of update; NULL in case of + insert + @param record Record we should write + @param row Statistics about record (calculated by + calc_record_size()) + @param map_blocks On which pages the record should be stored + @param row_pos Position on head page where to put head part of + record - NOTES - On return all pinned pages are released. + @note + On return all pinned pages are released. - RETURN - 0 ok - 1 error + @return Operation status + @retval 0 OK + @retval 1 Error */ static my_bool write_block_record(MARIA_HA *info, @@ -1940,7 +1958,6 @@ static my_bool write_block_record(MARIA_HA *info, size_t data_length= (size_t) (data - row_pos->data); /* Log REDO changes of head page */ - fileid_store(log_data, info->dfile.file); page_store(log_data+ FILEID_STORE_SIZE, head_block->page); dirpos_store(log_data+ FILEID_STORE_SIZE + PAGE_STORE_SIZE, row_pos->rownr); @@ -1950,7 +1967,8 @@ static my_bool write_block_record(MARIA_HA *info, log_array[TRANSLOG_INTERNAL_PARTS + 1].length= data_length; if (translog_write_record(&lsn, LOGREC_REDO_INSERT_ROW_HEAD, info->trn, share, sizeof(log_data) + data_length, - TRANSLOG_INTERNAL_PARTS + 2, log_array)) + TRANSLOG_INTERNAL_PARTS + 2, log_array, + log_data)) goto disk_err; } @@ -2010,7 +2028,6 @@ static my_bool write_block_record(MARIA_HA *info, NullS)) goto disk_err; } - fileid_store(log_data, info->dfile.file); log_pos= log_data + FILEID_STORE_SIZE; log_array_pos= log_array+ TRANSLOG_INTERNAL_PARTS+1; @@ -2068,7 +2085,7 @@ static my_bool write_block_record(MARIA_HA *info, error= translog_write_record(&lsn, LOGREC_REDO_INSERT_ROW_BLOBS, info->trn, share, log_entry_length, (uint) (log_array_pos - log_array), - log_array); + log_array, log_data); if (log_array != tmp_log_array) my_free((gptr) log_array, MYF(0)); if (error) @@ -2084,7 +2101,6 @@ static my_bool write_block_record(MARIA_HA *info, /* LOGREC_UNDO_ROW_INSERT & LOGREC_UNDO_ROW_INSERT share same header */ lsn_store(log_data, info->trn->undo_lsn); - fileid_store(log_data + LSN_STORE_SIZE, info->dfile.file); page_store(log_data+ LSN_STORE_SIZE + FILEID_STORE_SIZE, head_block->page); dirpos_store(log_data+ LSN_STORE_SIZE + FILEID_STORE_SIZE + @@ -2099,7 +2115,8 @@ static my_bool write_block_record(MARIA_HA *info, /* Write UNDO log record for the INSERT */ if (translog_write_record(&lsn, LOGREC_UNDO_ROW_INSERT, info->trn, share, sizeof(log_data), - TRANSLOG_INTERNAL_PARTS + 1, log_array)) + TRANSLOG_INTERNAL_PARTS + 1, log_array, + log_data + LSN_STORE_SIZE)) goto disk_err; } else @@ -2114,7 +2131,7 @@ static my_bool write_block_record(MARIA_HA *info, if (translog_write_record(&lsn, LOGREC_UNDO_ROW_UPDATE, info->trn, share, sizeof(log_data) + row_length, TRANSLOG_INTERNAL_PARTS + 1 + row_parts_count, - log_array)) + log_array, log_data + LSN_STORE_SIZE)) goto disk_err; } } @@ -2164,6 +2181,15 @@ crashed: my_errno= HA_ERR_WRONG_IN_RECORD; disk_err: + /** + @todo RECOVERY we are going to let dirty pages go to disk while we have + logged UNDO, this violates WAL. If we have not written any full pages, + all dirty pages are pinned so we could just delete them from the + pagecache. Moreover, we have written some REDOs without a closing UNDO, + it's possible that a next operation by this transaction succeeds and then + Recovery would glue the "orphan REDOs" to the succeeded operation and + execute the failed REDOs. + */ /* Unpin all pinned pages to not cause problems for disk cache */ _ma_unpin_all_pages(info, 0); @@ -2229,20 +2255,18 @@ my_bool _ma_write_block_record(MARIA_HA *info __attribute__ ((unused)), } -/* - Remove row written by _ma_write_block_record +/** + @brief Remove row written by _ma_write_block_record() - SYNOPSIS - _ma_abort_write_block_record() - info Maria handler + @param info Maria handler - INFORMATION - This is called in case we got a duplicate unique key while - writing keys. + @note + This is called in case we got a duplicate unique key while + writing keys. - RETURN - 0 ok - 1 error + @return Operation status + @retval 0 OK + @retval 1 Error */ my_bool _ma_write_abort_block_record(MARIA_HA *info) @@ -2288,16 +2312,19 @@ my_bool _ma_write_abort_block_record(MARIA_HA *info) really undo a failed insert. Note that this UNDO will cause recover to ignore the LOGREC_UNDO_ROW_INSERT that is the previous entry in the UNDO chain. - We will soon change that: we will here execute the UNDO records - generated while we were trying to write the row; this will log some CLRs - which will replace this LOGREC_UNDO_PURGE. RECOVERY TODO BUG. + */ + /** + @todo RECOVERY BUG + We will soon change that: we will here execute the UNDO records + generated while we were trying to write the row; this will log some + CLRs which will replace this LOGREC_UNDO_PURGE. */ lsn_store(log_data, info->trn->undo_lsn); log_array[TRANSLOG_INTERNAL_PARTS + 0].str= (char*) log_data; log_array[TRANSLOG_INTERNAL_PARTS + 0].length= sizeof(log_data); if (translog_write_record(&lsn, LOGREC_UNDO_ROW_PURGE, - info->trn, info->s, sizeof(log_data), - TRANSLOG_INTERNAL_PARTS + 1, log_array)) + info->trn, NULL, sizeof(log_data), + TRANSLOG_INTERNAL_PARTS + 1, log_array, NULL)) res= 1; } _ma_unpin_all_pages(info, info->trn->undo_lsn); @@ -2514,7 +2541,6 @@ static my_bool delete_head_or_tail(MARIA_HA *info, DBUG_ASSERT(share->pagecache->block_size == block_size); /* Log REDO data */ - fileid_store(log_data, info->dfile.file); page_store(log_data+ FILEID_STORE_SIZE, page); dirpos_store(log_data+ FILEID_STORE_SIZE + PAGE_STORE_SIZE, record_number); @@ -2524,7 +2550,8 @@ static my_bool delete_head_or_tail(MARIA_HA *info, if (translog_write_record(&lsn, (head ? LOGREC_REDO_PURGE_ROW_HEAD : LOGREC_REDO_PURGE_ROW_TAIL), info->trn, share, sizeof(log_data), - TRANSLOG_INTERNAL_PARTS + 1, log_array)) + TRANSLOG_INTERNAL_PARTS + 1, log_array, + log_data)) DBUG_RETURN(1); if (pagecache_write(share->pagecache, &info->dfile, page, 0, @@ -2545,7 +2572,6 @@ static my_bool delete_head_or_tail(MARIA_HA *info, PAGE_STORE_SIZE + PAGERANGE_STORE_SIZE]; LEX_STRING log_array[TRANSLOG_INTERNAL_PARTS + 1]; - fileid_store(log_data, info->dfile.file); pagerange_store(log_data + FILEID_STORE_SIZE, 1); page_store(log_data+ FILEID_STORE_SIZE + PAGERANGE_STORE_SIZE, page); pagerange_store(log_data + FILEID_STORE_SIZE + PAGERANGE_STORE_SIZE + @@ -2554,7 +2580,8 @@ static my_bool delete_head_or_tail(MARIA_HA *info, log_array[TRANSLOG_INTERNAL_PARTS + 0].length= sizeof(log_data); if (translog_write_record(&lsn, LOGREC_REDO_PURGE_BLOCKS, info->trn, share, sizeof(log_data), - TRANSLOG_INTERNAL_PARTS + 1, log_array)) + TRANSLOG_INTERNAL_PARTS + 1, log_array, + log_data)) DBUG_RETURN(1); DBUG_ASSERT(empty_space >= info->s->bitmap.sizes[0]); } @@ -2631,7 +2658,6 @@ my_bool _ma_delete_block_record(MARIA_HA *info, const byte *record) /* Write UNDO record */ lsn_store(log_data, info->trn->undo_lsn); - fileid_store(log_data+ LSN_STORE_SIZE, info->dfile.file); page_store(log_data+ LSN_STORE_SIZE + FILEID_STORE_SIZE, page); dirpos_store(log_data+ LSN_STORE_SIZE + FILEID_STORE_SIZE + PAGE_STORE_SIZE, record_number); @@ -2645,7 +2671,7 @@ my_bool _ma_delete_block_record(MARIA_HA *info, const byte *record) if (translog_write_record(&lsn, LOGREC_UNDO_ROW_DELETE, info->trn, info->s, sizeof(log_data) + row_length, TRANSLOG_INTERNAL_PARTS + 1 + row_parts_count, - info->log_row_parts)) + info->log_row_parts, log_data + LSN_STORE_SIZE)) goto err; } diff --git a/storage/maria/ma_blockrec.h b/storage/maria/ma_blockrec.h index f45250ff39c..819d1c2e4d2 100644 --- a/storage/maria/ma_blockrec.h +++ b/storage/maria/ma_blockrec.h @@ -96,7 +96,7 @@ enum en_page_type { UNALLOCATED_PAGE, HEAD_PAGE, TAIL_PAGE, BLOB_PAGE, MAX_PAGE_ /******* defines that affects allocation (density) of data *******/ /* - If the tail part (from the main block or a blob) uses more than 75 % of + If the tail part (from the main block or a blob) would use more than 75 % of the size of page, store the tail on a full page instead of a shared tail page. */ diff --git a/storage/maria/ma_check.c b/storage/maria/ma_check.c index 8f10c98d0ee..0fc2b77304d 100644 --- a/storage/maria/ma_check.c +++ b/storage/maria/ma_check.c @@ -53,6 +53,7 @@ #endif #include "ma_rt_index.h" #include "ma_blockrec.h" +#include "trnman_public.h" /* Functions defined in this file */ @@ -2132,11 +2133,15 @@ err: /* Replace the actual file with the temporary file */ if (new_file >= 0) { + myf sync_dir= (share->base.transactional && !share->temporary) ? + MY_SYNC_DIR : 0; my_close(new_file,MYF(0)); info->dfile.file= new_file= -1; if (maria_change_to_newfile(share->data_file_name,MARIA_NAME_DEXT, - DATA_TMP_EXT, (param->testflag & T_BACKUP_DATA ? - MYF(MY_REDEL_MAKE_BACKUP): MYF(0))) || + DATA_TMP_EXT, + MYF((param->testflag & T_BACKUP_DATA ? + MY_REDEL_MAKE_BACKUP : 0) | + sync_dir)) || _ma_open_datafile(info,share,-1)) got_error=1; } @@ -2328,6 +2333,8 @@ int maria_sort_index(HA_CHECK *param, register MARIA_HA *info, my_string name) int old_lock; MARIA_SHARE *share=info->s; MARIA_STATE_INFO old_state; + myf sync_dir= (share->base.transactional && !share->temporary) ? + MY_SYNC_DIR : 0; DBUG_ENTER("maria_sort_index"); /* cannot sort index files with R-tree indexes */ @@ -2388,7 +2395,7 @@ int maria_sort_index(HA_CHECK *param, register MARIA_HA *info, my_string name) share->kfile.file = -1; VOID(my_close(new_file,MYF(MY_WME))); if (maria_change_to_newfile(share->index_file_name, MARIA_NAME_IEXT, - INDEX_TMP_EXT, MYF(0)) || + INDEX_TMP_EXT, sync_dir) || _ma_open_keyfile(share)) goto err2; info->lock_type= F_UNLCK; /* Force maria_readinfo to lock */ @@ -2604,6 +2611,8 @@ int maria_repair_by_sort(HA_CHECK *param, register MARIA_HA *info, char llbuff[22]; MARIA_SORT_INFO sort_info; ulonglong key_map=share->state.key_map; + myf sync_dir= (share->base.transactional && !share->temporary) ? + MY_SYNC_DIR : 0; DBUG_ENTER("maria_repair_by_sort"); start_records=info->state->records; @@ -2922,8 +2931,9 @@ err: info->dfile.file= new_file= -1; if (maria_change_to_newfile(share->data_file_name,MARIA_NAME_DEXT, DATA_TMP_EXT, - (param->testflag & T_BACKUP_DATA ? - MYF(MY_REDEL_MAKE_BACKUP): MYF(0))) || + MYF((param->testflag & T_BACKUP_DATA ? + MY_REDEL_MAKE_BACKUP : 0) | + sync_dir)) || _ma_open_datafile(info,share,-1)) got_error=1; } @@ -3022,6 +3032,8 @@ int maria_repair_parallel(HA_CHECK *param, register MARIA_HA *info, MARIA_SORT_INFO sort_info; ulonglong key_map=share->state.key_map; pthread_attr_t thr_attr; + myf sync_dir= (share->base.transactional && !share->temporary) ? + MY_SYNC_DIR : 0; DBUG_ENTER("maria_repair_parallel"); start_records=info->state->records; @@ -3445,8 +3457,9 @@ err: info->dfile.file= new_file= -1; if (maria_change_to_newfile(share->data_file_name,MARIA_NAME_DEXT, DATA_TMP_EXT, - (param->testflag & T_BACKUP_DATA ? - MYF(MY_REDEL_MAKE_BACKUP): MYF(0))) || + MYF((param->testflag & T_BACKUP_DATA ? + MY_REDEL_MAKE_BACKUP : 0) | + sync_dir)) || _ma_open_datafile(info,share,-1)) got_error=1; } @@ -5135,3 +5148,64 @@ static void restore_data_file_type(MARIA_SHARE *share) share->data_file_type= share->state.header.data_file_type= share->pack.header_length= 0; } + + +/** + @brief Writes a LOGREC_REPAIR_TABLE record and updates create_rename_lsn + + REPAIR/OPTIMIZE have replaced the data/index file with a new file + and so, in this scenario: + @verbatim + CHECKPOINT - REDO_INSERT - COMMIT - ... - REPAIR - ... - crash + @endverbatim + we do not want Recovery to apply the REDO_INSERT to the table, as it would + then possibly wrongly extend the table. By updating create_rename_lsn at + the end of REPAIR, we know that REDO_INSERT will be skipped. + + @param param description of the REPAIR operation + @param info table + + @return Operation status + @retval 0 ok + @retval 1 error (disk problem) +*/ + +int _ma_repair_write_log_record(const HA_CHECK *param, MARIA_HA *info) +{ + MARIA_SHARE *share= info->s; + /* Only called from ha_maria.cc, not maria_check, so translog is inited */ + if (share->base.transactional && !share->temporary) + { + /* For now this record is only informative */ + LEX_STRING log_array[TRANSLOG_INTERNAL_PARTS + 1]; + uchar log_data[LSN_STORE_SIZE]; + compile_time_assert(LSN_STORE_SIZE >= (FILEID_STORE_SIZE + 4)); + log_array[TRANSLOG_INTERNAL_PARTS + 0].str= (char*) log_data; + log_array[TRANSLOG_INTERNAL_PARTS + 0].length= FILEID_STORE_SIZE + 4; + /* + testflag gives an idea of what REPAIR did (in particular T_QUICK + or not: did it touch the data file or not?). + */ + int4store(log_data + FILEID_STORE_SIZE, param->testflag); + if (unlikely(translog_write_record(&share->state.create_rename_lsn, + LOGREC_REDO_REPAIR_TABLE, + &dummy_transaction_object, share, + log_array[TRANSLOG_INTERNAL_PARTS + + 0].length, + sizeof(log_array)/sizeof(log_array[0]), + log_array, log_data))) + return 1; + /* + But this piece is really needed, to have the new table's content durable + and to not apply old REDOs to the new table. The table's existence was + made durable earlier (MY_SYNC_DIR passed to maria_change_to_newfile()). + */ + lsn_store(log_data, share->state.create_rename_lsn); + DBUG_ASSERT(info->dfile.file >= 0); + DBUG_ASSERT(share->kfile.file >= 0); + return (my_pwrite(share->kfile.file, log_data, sizeof(log_data), + sizeof(share->state.header) + 2, MYF(MY_NABP)) || + _ma_sync_table_files(info)); + } + return 0; +} diff --git a/storage/maria/ma_close.c b/storage/maria/ma_close.c index dc60ce8aa83..34c1bfb4d6d 100644 --- a/storage/maria/ma_close.c +++ b/storage/maria/ma_close.c @@ -57,14 +57,6 @@ int maria_close(register MARIA_HA *info) info->opt_flag&= ~(READ_CACHE_USED | WRITE_CACHE_USED); } flag= !--share->reopen; - /* - RECOVERY TODO: - If "flag" is TRUE, in the line below we are going to make the table - unknown to future checkpoints, so it needs to have fsync'ed itself - entirely (bitmap, pages, etc) at this point. - The flushing is currently done a few lines further (which is ok, as we - still hold THR_LOCK_maria), but syncing is missing. - */ maria_open_list=list_delete(maria_open_list,&info->open_list); pthread_mutex_unlock(&share->intern_lock); @@ -82,7 +74,12 @@ int maria_close(register MARIA_HA *info) FLUSH_IGNORE_CHANGED : FLUSH_RELEASE))) error= my_errno; - + /* + File must be synced as it is going out of the maria_open_list and so + becoming unknown to Checkpoint. + */ + if (my_sync(share->kfile.file, MYF(MY_WME))) + error= my_errno; /* If we are crashed, we can safely flush the current state as it will not change the crashed state. diff --git a/storage/maria/ma_commit.c b/storage/maria/ma_commit.c new file mode 100644 index 00000000000..88aaee0509f --- /dev/null +++ b/storage/maria/ma_commit.c @@ -0,0 +1,71 @@ +/* Copyright (C) 2007 MySQL AB + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; version 2 of the License. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ + +#include "maria_def.h" +#include "trnman.h" + +/** + @brief writes a COMMIT record to log and commits transaction in memory + + @param trn transaction + + @return Operation status + @retval 0 ok + @retval 1 error (disk error or out of memory) +*/ + +int ma_commit(TRN *trn) +{ + if (trn->undo_lsn == 0) /* no work done, rollback (cheaper than commit) */ + return trnman_rollback_trn(trn); + /* + - if COMMIT record is written before trnman_commit_trn(): + if Checkpoint comes in the middle it will see trn is not committed, + then if crash, Recovery might roll back trn (if min(rec_lsn) is after + COMMIT record) and this is not an issue as + * transaction's updates were not made visible to other transactions + * "commit ok" was not sent to client + Alternatively, Recovery might commit trn (if min(rec_lsn) is before COMMIT + record), which is ok too. All in all it means that "trn committed" is not + 100% equal to "COMMIT record written". + - if COMMIT record is written after trnman_commit_trn(): + if crash happens between the two, trn will be rolled back which is an + issue (transaction's updates were made visible to other transactions). + So we need to go the first way. + */ + /** + @todo RECOVERY share's state is written to disk only in + maria_lock_database(), so COMMIT record is not the last record of the + transaction! It is probably an issue. Recovery of the state is a problem + not yet solved. + */ + LSN commit_lsn; + LEX_STRING log_array[TRANSLOG_INTERNAL_PARTS]; + /* + We do not store "thd->transaction.xid_state.xid" for now, it will be + needed only when we support XA. + */ + return + translog_write_record(&commit_lsn, LOGREC_COMMIT, + trn, NULL, 0, + sizeof(log_array)/sizeof(log_array[0]), + log_array, NULL) || + translog_flush(commit_lsn) || trnman_commit_trn(trn); + /* + Note: if trnman_commit_trn() fails above, we have already + written the COMMIT record, so Checkpoint and Recovery will see the + transaction as committed. + */ +} diff --git a/storage/maria/ma_commit.h b/storage/maria/ma_commit.h new file mode 100644 index 00000000000..2c57c73fd7a --- /dev/null +++ b/storage/maria/ma_commit.h @@ -0,0 +1,18 @@ +/* Copyright (C) 2007 MySQL AB + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; version 2 of the License. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ + +C_MODE_START +int ma_commit(TRN *trn); +C_MODE_END diff --git a/storage/maria/ma_control_file.c b/storage/maria/ma_control_file.c index f53da8a5881..db5440dc873 100644 --- a/storage/maria/ma_control_file.c +++ b/storage/maria/ma_control_file.c @@ -50,6 +50,13 @@ LSN last_checkpoint_lsn; uint32 last_logno; +/** + @brief If log's lock should be asserted when writing to control file. + + Can be re-used by any function which needs to be thread-safe except when + it is called at startup. +*/ +my_bool maria_multi_threaded= FALSE; /* Control file is less then 512 bytes (a disk sector), @@ -203,6 +210,8 @@ err: the last_checkpoint_lsn and last_logno global variables. Called when we have created a new log (after syncing this log's creation) and when we have written a checkpoint (after syncing this log record). + Variables last_checkpoint_lsn and last_logno must be protected by caller + using log's lock, unless this function is called at startup. SYNOPSIS ma_control_file_write_and_force() @@ -233,12 +242,14 @@ int ma_control_file_write_and_force(const LSN checkpoint_lsn, uint32 logno, DBUG_ENTER("ma_control_file_write_and_force"); DBUG_ASSERT(control_file_fd >= 0); /* must be open */ +#ifndef DBUG_OFF + if (maria_multi_threaded) + translog_lock_assert_owner(); +#endif memcpy(buffer + CONTROL_FILE_MAGIC_STRING_OFFSET, CONTROL_FILE_MAGIC_STRING, CONTROL_FILE_MAGIC_STRING_SIZE); - /* TODO: you need some protection to be able to read last_* global vars */ - if (objs_to_write == CONTROL_FILE_UPDATE_ONLY_LSN) update_checkpoint_lsn= TRUE; else if (objs_to_write == CONTROL_FILE_UPDATE_ONLY_LOGNO) @@ -270,7 +281,6 @@ int ma_control_file_write_and_force(const LSN checkpoint_lsn, uint32 logno, my_sync(control_file_fd, MYF(MY_WME))) DBUG_RETURN(1); - /* TODO: you need some protection to be able to write last_* global vars */ if (update_checkpoint_lsn) last_checkpoint_lsn= checkpoint_lsn; if (update_logno) diff --git a/storage/maria/ma_control_file.h b/storage/maria/ma_control_file.h index 4728d719b2f..c974838684b 100644 --- a/storage/maria/ma_control_file.h +++ b/storage/maria/ma_control_file.h @@ -43,6 +43,8 @@ extern LSN last_checkpoint_lsn; */ extern uint32 last_logno; +extern my_bool maria_multi_threaded; + typedef enum enum_control_file_error { CONTROL_FILE_OK= 0, CONTROL_FILE_TOO_SMALL, diff --git a/storage/maria/ma_create.c b/storage/maria/ma_create.c index d8660dd41cb..53e15deb74b 100644 --- a/storage/maria/ma_create.c +++ b/storage/maria/ma_create.c @@ -19,6 +19,7 @@ #include "ma_sp_defs.h" #include <my_bit.h> #include "ma_blockrec.h" +#include "trnman_public.h" #if defined(MSDOS) || defined(__WIN__) #ifdef __WIN__ @@ -51,7 +52,8 @@ int maria_create(const char *name, enum data_file_type datafile_type, unique_key_parts,fulltext_keys,offset, not_block_record_extra_length; uint max_field_lengths, extra_header_size; ulong reclength, real_reclength,min_pack_length; - char filename[FN_REFLEN],linkname[FN_REFLEN], *linkname_ptr; + char filename[FN_REFLEN], dlinkname[FN_REFLEN], *dlinkname_ptr= NULL, + klinkname[FN_REFLEN], *klinkname_ptr= NULL; ulong pack_reclength; ulonglong tot_length,max_rows, tmp; enum en_fieldtype type; @@ -62,11 +64,12 @@ int maria_create(const char *name, enum data_file_type datafile_type, HA_KEYSEG *keyseg,tmp_keyseg; MARIA_COLUMNDEF *column, *end_column; ulong *rec_per_key_part; - my_off_t key_root[HA_MAX_POSSIBLE_KEY]; + my_off_t key_root[HA_MAX_POSSIBLE_KEY], kfile_size_before_extension; MARIA_CREATE_INFO tmp_create_info; my_bool tmp_table= FALSE; /* cache for presence of HA_OPTION_TMP_TABLE */ my_bool forced_packed; - myf sync_dir= MY_SYNC_DIR; + myf sync_dir= 0; + uchar *log_data= NULL; DBUG_ENTER("maria_create"); DBUG_PRINT("enter", ("keys: %u columns: %u uniques: %u flags: %u", keys, columns, uniques, flags)); @@ -250,8 +253,9 @@ int maria_create(const char *name, enum data_file_type datafile_type, if (flags & HA_CREATE_TMP_TABLE) { options|= HA_OPTION_TMP_TABLE; + tmp_table= TRUE; create_mode|= O_EXCL | O_NOFOLLOW; - /* temp tables are not crash-safe (dropped at restart) */ + /* "CREATE TEMPORARY" tables are not crash-safe (dropped at restart) */ ci->transactional= FALSE; } share.base.null_bytes= ci->null_bytes; @@ -624,6 +628,7 @@ int maria_create(const char *name, enum data_file_type datafile_type, share.state.dellink = HA_OFFSET_ERROR; share.state.first_bitmap_with_space= 0; + share.state.create_rename_lsn= 0; share.state.process= (ulong) getpid(); share.state.unique= (ulong) 0; share.state.update_count=(ulong) 0; @@ -671,11 +676,15 @@ int maria_create(const char *name, enum data_file_type datafile_type, #endif /* max_data_file_length and max_key_file_length are recalculated on open */ - if (options & HA_OPTION_TMP_TABLE) - { - tmp_table= TRUE; - sync_dir= 0; + if (tmp_table) share.base.max_data_file_length= (my_off_t) ci->data_file_length; + else if (ci->transactional && translog_inited) + { + /* + we have checked translog_inited above, because maria_chk may call us + (via maria_recreate_table()) and it does not have a log. + */ + sync_dir= MY_SYNC_DIR; } if (datafile_type == BLOCK_RECORD) @@ -712,9 +721,9 @@ int maria_create(const char *name, enum data_file_type datafile_type, MY_UNPACK_FILENAME | (have_iext ? MY_REPLACE_EXT : MY_APPEND_EXT)); } - fn_format(linkname, name, "", MARIA_NAME_IEXT, + fn_format(klinkname, name, "", MARIA_NAME_IEXT, MY_UNPACK_FILENAME|MY_APPEND_EXT); - linkname_ptr=linkname; + klinkname_ptr= klinkname; /* Don't create the table if the link or file exists to ensure that one doesn't accidently destroy another table. @@ -730,7 +739,6 @@ int maria_create(const char *name, enum data_file_type datafile_type, (MY_UNPACK_FILENAME | (flags & HA_DONT_TOUCH_DATA) ? MY_RETURN_REAL_PATH : 0) | MY_APPEND_EXT); - linkname_ptr=0; /* Replace the current file. Don't sync dir now if the data file has the same path. @@ -753,7 +761,7 @@ int maria_create(const char *name, enum data_file_type datafile_type, goto err; } - if ((file= my_create_with_symlink(linkname_ptr, filename, 0, create_mode, + if ((file= my_create_with_symlink(klinkname_ptr, filename, 0, create_mode, MYF(MY_WME|create_flag))) < 0) goto err; errpos=1; @@ -780,24 +788,24 @@ int maria_create(const char *name, enum data_file_type datafile_type, MY_UNPACK_FILENAME | (have_dext ? MY_REPLACE_EXT : MY_APPEND_EXT)); } - fn_format(linkname, name, "",MARIA_NAME_DEXT, + fn_format(dlinkname, name, "",MARIA_NAME_DEXT, MY_UNPACK_FILENAME | MY_APPEND_EXT); - linkname_ptr=linkname; + dlinkname_ptr= dlinkname; create_flag=0; } else { fn_format(filename,name,"", MARIA_NAME_DEXT, MY_UNPACK_FILENAME | MY_APPEND_EXT); - linkname_ptr=0; create_flag=MY_DELETE_OLD; } if ((dfile= - my_create_with_symlink(linkname_ptr, filename, 0, create_mode, + my_create_with_symlink(dlinkname_ptr, filename, 0, create_mode, MYF(MY_WME | create_flag | sync_dir))) < 0) goto err; errpos=3; + share.data_file_type= datafile_type; if (_ma_initialize_data_file(dfile, &share)) goto err; } @@ -925,14 +933,82 @@ int maria_create(const char *name, enum data_file_type datafile_type, goto err; } + if ((kfile_size_before_extension= my_tell(file,MYF(0))) == MY_FILEPOS_ERROR) + goto err; #ifndef DBUG_OFF - if ((uint) my_tell(file,MYF(0)) != info_length) + if (kfile_size_before_extension != info_length) + DBUG_PRINT("warning",("info_length: %u != used_length: %u", + info_length, (uint)kfile_size_before_extension)); +#endif + + if (sync_dir) { - uint pos= (uint) my_tell(file,MYF(0)); - DBUG_PRINT("warning",("info_length: %d != used_length: %d", - info_length, pos)); + /* + we log the first bytes and then the size to which we extend; this is + not log 1 KB of mostly zeroes if this is a small table. + */ + char empty_string[]= ""; + LEX_STRING log_array[TRANSLOG_INTERNAL_PARTS + 3]; + uint total_rec_length= 0; + uint i; + log_array[TRANSLOG_INTERNAL_PARTS + 0].length= 1 + 2 + + kfile_size_before_extension; + /* we are needing maybe 64 kB, so don't use the stack */ + log_data= my_malloc(log_array[TRANSLOG_INTERNAL_PARTS + 0].length, MYF(0)); + if ((log_data == NULL) || + my_pread(file, 1 + 2 + log_data, kfile_size_before_extension, + 0, MYF(MY_NABP))) + goto err_no_lock; + /* + remember if the data file was created or not, to know if Recovery can + do it or not, in the future + */ + log_data[0]= test(flags & HA_DONT_TOUCH_DATA); + int2store(log_data + 1, kfile_size_before_extension); + log_array[TRANSLOG_INTERNAL_PARTS + 0].str= log_data; + /* symlink description is also needed for re-creation by Recovery: */ + log_array[TRANSLOG_INTERNAL_PARTS + 1].str= + dlinkname_ptr ? dlinkname : empty_string; + log_array[TRANSLOG_INTERNAL_PARTS + 1].length= + strlen(log_array[TRANSLOG_INTERNAL_PARTS + 1].str); + log_array[TRANSLOG_INTERNAL_PARTS + 2].str= + klinkname_ptr ? klinkname : empty_string; + log_array[TRANSLOG_INTERNAL_PARTS + 2].length= + strlen(log_array[TRANSLOG_INTERNAL_PARTS + 2].str); + for (i= TRANSLOG_INTERNAL_PARTS; + i < (sizeof(log_array)/sizeof(log_array[0])); i++) + total_rec_length+= log_array[i].length; + /* + For this record to be of any use for Recovery, we need the upper + MySQL layer to be crash-safe, which it is not now (that would require + work using the ddl_log of sql/sql_table.cc); when it is, we should + reconsider the moment of writing this log record (before or after op, + under THR_LOCK_maria or not...), how to use it in Recovery, and force + the log. For now this record is just informative. + Note that in case of TRUNCATE TABLE we also come here. + When in CREATE/TRUNCATE (or DROP or RENAME or REPAIR) we have not called + external_lock(), so have no TRN. It does not matter, as all these + operations are non-transactional and sync their files. + */ + if (unlikely(translog_write_record(&share.state.create_rename_lsn, + LOGREC_REDO_CREATE_TABLE, + &dummy_transaction_object, NULL, + total_rec_length, + sizeof(log_array)/sizeof(log_array[0]), + log_array, NULL))) + goto err_no_lock; + /* + store LSN into file, needed for Recovery to not be confused if a + DROP+CREATE happened (applying REDOs to the wrong table). + If such direct my_pwrite() to a fixed offset is too "hackish", I can + call ma_state_info_write() again but it will be less efficient. + */ + lsn_store(log_data, share.state.create_rename_lsn); + if (my_pwrite(file, log_data, LSN_STORE_SIZE, + sizeof(share.state.header) + 2, MYF(MY_NABP))) + goto err_no_lock; + my_free(log_data, MYF(0)); } -#endif /* Enlarge files */ DBUG_PRINT("info", ("enlarge to keystart: %lu", @@ -940,38 +1016,25 @@ int maria_create(const char *name, enum data_file_type datafile_type, if (my_chsize(file,(ulong) share.base.keystart,0,MYF(0))) goto err; + if (sync_dir && my_sync(file, MYF(0))) + goto err; + if (! (flags & HA_DONT_TOUCH_DATA)) { #ifdef USE_RELOC if (my_chsize(dfile,share.base.min_pack_length*ci->reloc_rows,0,MYF(0))) goto err; - if (!tmp_table && my_sync(file, MYF(0))) - goto err; #endif - /* if !USE_RELOC, there was no write to the file, no need to sync it */ errpos=2; - if (my_close(dfile,MYF(0))) + if ((sync_dir && my_sync(dfile, MYF(0))) || my_close(dfile,MYF(0))) goto err; } - errpos=0; pthread_mutex_unlock(&THR_LOCK_maria); res= 0; + my_free((char*) rec_per_key_part,MYF(0)); + errpos=0; if (my_close(file,MYF(0))) res= my_errno; - /* - RECOVERY TODO - Write a log record describing the CREATE operation (just the file - names, link names, and the full header's content). - For this record to be of any use for Recovery, we need the upper - MySQL layer to be crash-safe, which it is not now (that would require work - using the ddl_log of sql/sql_table.cc); when is is, we should reconsider - the moment of writing this log record (before or after op, under - THR_LOCK_maria or not...), how to use it in Recovery, and force the log. - For now this record is just informative. - If operation failed earlier, we clean up in "err:" and the MySQL layer - will clean up the frm, so we needn't write anything to the log. - */ - my_free((char*) rec_per_key_part,MYF(0)); DBUG_RETURN(res); err: @@ -996,6 +1059,7 @@ err_no_lock: MY_UNPACK_FILENAME | MY_APPEND_EXT), sync_dir); } + my_free(log_data, MYF(MY_ALLOW_ZERO_PTR)); my_free((char*) rec_per_key_part, MYF(0)); DBUG_RETURN(my_errno=save_errno); /* return the fatal errno */ } @@ -1086,9 +1150,9 @@ int _ma_initialize_data_file(File dfile, MARIA_SHARE *share) { if (share->data_file_type == BLOCK_RECORD) { - if (my_chsize(dfile, maria_block_size, 0, MYF(MY_WME))) + if (my_chsize(dfile, share->base.block_size, 0, MYF(MY_WME))) return 1; - share->state.state.data_file_length= maria_block_size; + share->state.state.data_file_length= share->base.block_size; _ma_bitmap_delete_all(share); } return 0; diff --git a/storage/maria/ma_delete_all.c b/storage/maria/ma_delete_all.c index 2d85b347662..7286f540aa1 100644 --- a/storage/maria/ma_delete_all.c +++ b/storage/maria/ma_delete_all.c @@ -17,21 +17,38 @@ /* This clears the status information and truncates files */ #include "maria_def.h" +#include "trnman_public.h" + +/** + @brief deletes all rows from a table + + @param info Maria handler + + @return Operation status + @retval 0 ok + @retval 1 error +*/ int maria_delete_all_rows(MARIA_HA *info) { uint i; MARIA_SHARE *share=info->s; MARIA_STATE_INFO *state=&share->state; + my_bool log_record; DBUG_ENTER("maria_delete_all_rows"); if (share->options & HA_OPTION_READ_ONLY_DATA) { DBUG_RETURN(my_errno=EACCES); } - /* LOCK TODO take X-lock on table here */ + /** + @todo LOCK take X-lock on table here. + When we have versioning, if some other thread is looking at this table, + we cannot shrink the file like this. + */ if (_ma_readinfo(info,F_WRLCK,1)) DBUG_RETURN(my_errno); + log_record= share->base.transactional && !share->temporary; if (_ma_mark_file_changed(info)) goto err; @@ -54,27 +71,13 @@ int maria_delete_all_rows(MARIA_HA *info) */ flush_pagecache_blocks(share->pagecache, &share->kfile, FLUSH_IGNORE_CHANGED); - /* - RECOVERY TODO Log the two chsize and header modifications and force the - log. So that if crash between the two chsize, we finish the work at - Recovery. For this scenario: - "TRUNCATE TABLE t1; DROP TABLE t1; RENAME TABLE t2 to t1; crash;" - Recovery mustn't truncate the new t1, so the log records of TRUNCATE - should be applied only if t1 exists and its ZeroDirtyPagesLSN is smaller - than the records'. See more comments below. - */ if (my_chsize(info->dfile.file, 0, 0, MYF(MY_WME)) || my_chsize(share->kfile.file, share->base.keystart, 0, MYF(MY_WME)) ) goto err; - if (_ma_initialize_data_file(info->dfile.file, info->s)) + if (_ma_initialize_data_file(info->dfile.file, share)) goto err; - /* - RECOVERY TODO Consider updating ZeroDirtyPagesLSN here. It is - not a necessity (it is one only in RENAME commands) but an optional - optimization which will allow some REDO skipping at Recovery. - */ VOID(_ma_writeinfo(info,WRITEINFO_UPDATE_KEYFILE)); #ifdef HAVE_MMAP /* Resize mmaped area */ @@ -82,24 +85,48 @@ int maria_delete_all_rows(MARIA_HA *info) _ma_remap_file(info, (my_off_t)0); rw_unlock(&info->s->mmap_lock); #endif - /* - RECOVERY TODO Until we have the TRUNCATE log record and take it into - account for log-low-water-mark calculation and use it in Recovery, we need - to sync. - */ - if (_ma_sync_table_files(info)) - goto err; + if (log_record) + { + /* For now this record is only informative */ + LEX_STRING log_array[TRANSLOG_INTERNAL_PARTS + 1]; + uchar log_data[LSN_STORE_SIZE]; + log_array[TRANSLOG_INTERNAL_PARTS + 0].str= (char*) log_data; + log_array[TRANSLOG_INTERNAL_PARTS + 0].length= FILEID_STORE_SIZE; + if (unlikely(translog_write_record(&share->state.create_rename_lsn, + LOGREC_REDO_DELETE_ALL, + info->trn, share, 0, + sizeof(log_array)/sizeof(log_array[0]), + log_array, log_data))) + goto err; + /* + store LSN into file. It is an optimization so that all old REDOs for + this table are ignored (scenario: checkpoint, INSERT1s, DELETE ALL; + INSERT2s, crash: then Recovery can skip INSERT1s). It also allows us to + ignore the present record at Recovery. + Note that storing the LSN could not be done by _ma_writeinfo() above as + the table is locked at this moment. So we need to do it by ourselves. + */ + lsn_store(log_data, share->state.create_rename_lsn); + if (my_pwrite(share->kfile.file, log_data, sizeof(log_data), + sizeof(share->state.header) + 2, MYF(MY_NABP)) || + _ma_sync_table_files(info)) + goto err; + /** + @todo RECOVERY Until we take into account the log record above + for log-low-water-mark calculation and use it in Recovery, we need + to sync above. + */ + } allow_break(); /* Allow SIGHUP & SIGINT */ DBUG_RETURN(0); err: { int save_errno=my_errno; - /* RECOVERY TODO log the header modifications */ VOID(_ma_writeinfo(info,WRITEINFO_UPDATE_KEYFILE)); info->update|=HA_STATE_WRITTEN; /* Buffer changed */ - /* RECOVERY TODO until we log above we have to sync */ - if (_ma_sync_table_files(info) && !save_errno) + /** @todo RECOVERY until we use the log record above we have to sync */ + if (log_record &&_ma_sync_table_files(info) && !save_errno) save_errno= my_errno; allow_break(); /* Allow SIGHUP & SIGINT */ DBUG_RETURN(my_errno=save_errno); diff --git a/storage/maria/ma_delete_table.c b/storage/maria/ma_delete_table.c index aafe7a1dee9..990714043bf 100644 --- a/storage/maria/ma_delete_table.c +++ b/storage/maria/ma_delete_table.c @@ -13,11 +13,18 @@ along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ -/* - deletes a table -*/ - #include "ma_fulltext.h" +#include "trnman_public.h" + +/** + @brief drops (deletes) a table + + @param name table's name + + @return Operation status + @retval 0 ok + @retval 1 error +*/ int maria_delete_table(const char *name) { @@ -25,56 +32,78 @@ int maria_delete_table(const char *name) #ifdef USE_RAID uint raid_type=0,raid_chunks=0; #endif + MARIA_HA *info; + myf sync_dir; DBUG_ENTER("maria_delete_table"); #ifdef EXTRA_DEBUG _ma_check_table_is_closed(name,"delete"); #endif - /* LOCK TODO take X-lock on table here */ + /** @todo LOCK take X-lock on table */ + /* + We need to know if this table is transactional. + When built with RAID support, we also need to determine if this table + makes use of the raid feature. If yes, we need to remove all raid + chunks. This is done with my_raid_delete(). Unfortunately it is + necessary to open the table just to check this. We use + 'open_for_repair' to be able to open even a crashed table. If even + this open fails, we assume no raid configuration for this table + and try to remove the normal data file only. This may however + leave the raid chunks behind. + */ + if (!(info= maria_open(name, O_RDONLY, HA_OPEN_FOR_REPAIR))) + { #ifdef USE_RAID + raid_type= 0; +#endif + sync_dir= 0; + } + else { - MARIA_HA *info; - /* - When built with RAID support, we need to determine if this table - makes use of the raid feature. If yes, we need to remove all raid - chunks. This is done with my_raid_delete(). Unfortunately it is - necessary to open the table just to check this. We use - 'open_for_repair' to be able to open even a crashed table. If even - this open fails, we assume no raid configuration for this table - and try to remove the normal data file only. This may however - leave the raid chunks behind. - */ - if (!(info= maria_open(name, O_RDONLY, HA_OPEN_FOR_REPAIR))) - raid_type= 0; - else - { - raid_type= info->s->base.raid_type; - raid_chunks= info->s->base.raid_chunks; - maria_close(info); - } +#ifdef USE_RAID + raid_type= info->s->base.raid_type; + raid_chunks= info->s->base.raid_chunks; +#endif + sync_dir= (info->s->base.transactional && !info->s->temporary) ? + MY_SYNC_DIR : 0; + maria_close(info); } +#ifdef USE_RAID #ifdef EXTRA_DEBUG _ma_check_table_is_closed(name,"delete"); #endif #endif /* USE_RAID */ + if (sync_dir) + { + /* + For this log record to be of any use for Recovery, we need the upper + MySQL layer to be crash-safe in DDLs; when it is we should reconsider + the moment of writing this log record, how to use it in Recovery, and + force the log. For now this record is only informative. + */ + LSN lsn; + LEX_STRING log_array[TRANSLOG_INTERNAL_PARTS + 1]; + log_array[TRANSLOG_INTERNAL_PARTS + 0].str= (char *)name; + log_array[TRANSLOG_INTERNAL_PARTS + 0].length= strlen(name); + if (unlikely(translog_write_record(&lsn, LOGREC_REDO_DROP_TABLE, + &dummy_transaction_object, NULL, + log_array[TRANSLOG_INTERNAL_PARTS + + 0].length, + sizeof(log_array)/sizeof(log_array[0]), + log_array, NULL))) + DBUG_RETURN(1); + } + fn_format(from,name,"",MARIA_NAME_IEXT,MY_UNPACK_FILENAME|MY_APPEND_EXT); - /* - RECOVERY TODO log the two deletes below. - Then do the file deletions. - For this log record to be of any use for Recovery, we need the upper MySQL - layer to be crash-safe in DDLs; when it is we should reconsider the moment - of writing this log record, how to use it in Recovery, and force the log. - For now this record is only informative. - */ - if (my_delete_with_symlink(from, MYF(MY_WME | MY_SYNC_DIR))) + if (my_delete_with_symlink(from, MYF(MY_WME | sync_dir))) DBUG_RETURN(my_errno); fn_format(from,name,"",MARIA_NAME_DEXT,MY_UNPACK_FILENAME|MY_APPEND_EXT); #ifdef USE_RAID if (raid_type) - DBUG_RETURN(my_raid_delete(from, raid_chunks, MYF(MY_WME | MY_SYNC_DIR)) ? + DBUG_RETURN(my_raid_delete(from, raid_chunks, MYF(MY_WME | sync_dir)) ? my_errno : 0); #endif - DBUG_RETURN(my_delete_with_symlink(from, MYF(MY_WME | MY_SYNC_DIR)) ? + DBUG_RETURN(my_delete_with_symlink(from, MYF(MY_WME | sync_dir)) ? my_errno : 0); } diff --git a/storage/maria/ma_extra.c b/storage/maria/ma_extra.c index d6a0d2f4441..61eba165412 100644 --- a/storage/maria/ma_extra.c +++ b/storage/maria/ma_extra.c @@ -21,21 +21,20 @@ static void maria_extra_keyflag(MARIA_HA *info, enum ha_extra_function function); +/** + @brief Set options and buffers to optimize table handling -/* - Set options and buffers to optimize table handling + @param name table's name + @param info open table + @param function operation + @param extra_arg Pointer to extra argument (normally pointer to + ulong); used when function is one of: + HA_EXTRA_WRITE_CACHE + HA_EXTRA_CACHE - SYNOPSIS - maria_extra() - info open table - function operation - extra_arg Pointer to extra argument (normally pointer to ulong) - Used when function is one of: - HA_EXTRA_WRITE_CACHE - HA_EXTRA_CACHE - RETURN VALUES - 0 ok - # error + @return Operation status + @retval 0 ok + @retval !=0 error */ int maria_extra(MARIA_HA *info, enum ha_extra_function function, @@ -265,14 +264,24 @@ int maria_extra(MARIA_HA *info, enum ha_extra_function function, pthread_mutex_unlock(&THR_LOCK_maria); break; case HA_EXTRA_PREPARE_FOR_DELETE: + /* QQ: suggest to rename it to "PREPARE_FOR_DROP" */ pthread_mutex_lock(&THR_LOCK_maria); share->last_version= 0L; /* Impossible version */ #ifdef __WIN__ /* Close the isam and data files as Win32 can't drop an open table */ pthread_mutex_lock(&share->intern_lock); + /* + If this is Windows we remove blocks from pagecache. If not Windows we + don't do it, so these pages stay in the pagecache? So they may later be + flushed to a wrong file? + Or is it that this flush_pagecache_blocks() never finds any blocks? Then + why do we do it on Windows? + Don't we wait for all instances to be closed before dropping the table? + Do we ever do something useful here? + BUG? + */ if (flush_pagecache_blocks(share->pagecache, &share->kfile, - (function == HA_EXTRA_FORCE_REOPEN ? - FLUSH_RELEASE : FLUSH_IGNORE_CHANGED))) + FLUSH_IGNORE_CHANGED)) { error=my_errno; share->changed=1; @@ -292,9 +301,11 @@ int maria_extra(MARIA_HA *info, enum ha_extra_function function, info->lock_type = F_UNLCK; } if (share->kfile.file >= 0) + { _ma_decrement_open_count(info); - if (share->kfile.file >= 0 && my_close(share->kfile,MYF(0))) - error=my_errno; + if (my_close(share->kfile,MYF(0))) + error=my_errno; + } { LIST *list_element ; for (list_element=maria_open_list ; @@ -304,6 +315,9 @@ int maria_extra(MARIA_HA *info, enum ha_extra_function function, MARIA_HA *tmpinfo=(MARIA_HA*) list_element->data; if (tmpinfo->s == info->s) { + /** + @todo RECOVERY BUG: flush of bitmap and sync of dfile are missing + */ if (tmpinfo->dfile.file >= 0 && my_close(tmpinfo->dfile.file, MYF(0))) error = my_errno; diff --git a/storage/maria/ma_init.c b/storage/maria/ma_init.c index ac4826a721d..8042c6d9873 100644 --- a/storage/maria/ma_init.c +++ b/storage/maria/ma_init.c @@ -53,7 +53,7 @@ void maria_end(void) { if (maria_inited) { - maria_inited= FALSE; + maria_inited= maria_multi_threaded= FALSE; ft_free_stopwords(); trnman_destroy(); translog_destroy(); diff --git a/storage/maria/ma_loghandler.c b/storage/maria/ma_loghandler.c index 474f50e1e2c..9ed1d4b9d93 100644 --- a/storage/maria/ma_loghandler.c +++ b/storage/maria/ma_loghandler.c @@ -17,6 +17,14 @@ #include "ma_blockrec.h" #include "trnman.h" +/** + @file + @brief Module which writes and reads to a transaction log + + @todo LOG: in functions where the log's lock is required, a + translog_assert_owner() could be added. +*/ + /* number of opened log files in the pagecache (should be at least 2) */ #define OPENED_FILES_NUM 3 @@ -166,7 +174,7 @@ static struct st_translog_descriptor log_descriptor; /* Marker for end of log */ static byte end_of_log= 0; -static my_bool translog_inited; +my_bool translog_inited= 0; /* record classes */ enum record_class @@ -218,7 +226,7 @@ struct st_log_record_type_descriptor uint16 read_header_len; /* HOOK for writing the record called before lock */ prewrite_rec_hook prewrite_hook; - /* HOOK for writing the record called when LSN is known */ + /* HOOK for writing the record called when LSN is known, inside lock */ inwrite_rec_hook inwrite_hook; /* HOOK for reading headers */ read_rec_hook read_hook; @@ -230,6 +238,13 @@ struct st_log_record_type_descriptor }; +#include <my_atomic.h> +/* an array that maps id of a MARIA_SHARE to this MARIA_SHARE */ +static MARIA_SHARE **id_to_share= NULL; +#define SHARE_ID_MAX 65535 /* array's size */ +/* lock for id_to_share */ +static my_atomic_rwlock_t LOCK_id_to_share; + static my_bool write_hook_for_redo(enum translog_record_type type, TRN *trn, LSN *lsn, struct st_translog_parts *parts); @@ -291,7 +306,9 @@ static LOG_DESC INIT_LOGREC_REDO_INSERT_ROW_HEAD= write_hook_for_redo, NULL, 0}; static LOG_DESC INIT_LOGREC_REDO_INSERT_ROW_TAIL= -{LOGRECTYPE_VARIABLE_LENGTH, 0, 8, NULL, NULL, NULL, 0}; +{LOGRECTYPE_VARIABLE_LENGTH, 0, + FILEID_STORE_SIZE + PAGE_STORE_SIZE + DIRPOS_STORE_SIZE, NULL, + write_hook_for_redo, NULL, 0}; static LOG_DESC INIT_LOGREC_REDO_INSERT_ROW_BLOB= {LOGRECTYPE_VARIABLE_LENGTH, 0, 8, NULL, write_hook_for_redo, NULL, 0}; @@ -376,15 +393,9 @@ static LOG_DESC INIT_LOGREC_COMMIT= static LOG_DESC INIT_LOGREC_COMMIT_WITH_UNDO_PURGE= {LOGRECTYPE_PSEUDOFIXEDLENGTH, 5, 5, NULL, NULL, NULL, 1}; -static LOG_DESC INIT_LOGREC_CHECKPOINT_PAGE= -{LOGRECTYPE_VARIABLE_LENGTH, 0, 6, NULL, NULL, NULL, 0}; - -static LOG_DESC INIT_LOGREC_CHECKPOINT_TRAN= +static LOG_DESC INIT_LOGREC_CHECKPOINT= {LOGRECTYPE_VARIABLE_LENGTH, 0, 0, NULL, NULL, NULL, 0}; -static LOG_DESC INIT_LOGREC_CHECKPOINT_TABL= -{LOGRECTYPE_VARIABLE_LENGTH, 0, 8, NULL, NULL, NULL, 0}; - static LOG_DESC INIT_LOGREC_REDO_CREATE_TABLE= {LOGRECTYPE_VARIABLE_LENGTH, 0, 0, NULL, NULL, NULL, 0}; @@ -394,8 +405,13 @@ static LOG_DESC INIT_LOGREC_REDO_RENAME_TABLE= static LOG_DESC INIT_LOGREC_REDO_DROP_TABLE= {LOGRECTYPE_VARIABLE_LENGTH, 0, 0, NULL, NULL, NULL, 0}; -static LOG_DESC INIT_LOGREC_REDO_TRUNCATE_TABLE= -{LOGRECTYPE_VARIABLE_LENGTH, 0, 0, NULL, NULL, NULL, 0}; +static LOG_DESC INIT_LOGREC_REDO_DELETE_ALL= +{LOGRECTYPE_FIXEDLENGTH, FILEID_STORE_SIZE, FILEID_STORE_SIZE, + NULL, NULL, NULL, 0}; + +static LOG_DESC INIT_LOGREC_REDO_REPAIR_TABLE= +{LOGRECTYPE_FIXEDLENGTH, FILEID_STORE_SIZE + 4, FILEID_STORE_SIZE + 4, + NULL, NULL, NULL, 0}; static LOG_DESC INIT_LOGREC_FILE_ID= {LOGRECTYPE_VARIABLE_LENGTH, 0, 4, NULL, NULL, NULL, 0}; @@ -403,6 +419,7 @@ static LOG_DESC INIT_LOGREC_FILE_ID= static LOG_DESC INIT_LOGREC_LONG_TRANSACTION_ID= {LOGRECTYPE_FIXEDLENGTH, 6, 6, NULL, NULL, NULL, 0}; +const myf log_write_flags= MY_WME | MY_NABP | MY_WAIT_IF_FULL; static void loghandler_init() { @@ -454,20 +471,18 @@ static void loghandler_init() INIT_LOGREC_COMMIT; log_record_type_descriptor[LOGREC_COMMIT_WITH_UNDO_PURGE]= INIT_LOGREC_COMMIT_WITH_UNDO_PURGE; - log_record_type_descriptor[LOGREC_CHECKPOINT_PAGE]= - INIT_LOGREC_CHECKPOINT_PAGE; - log_record_type_descriptor[LOGREC_CHECKPOINT_TRAN]= - INIT_LOGREC_CHECKPOINT_TRAN; - log_record_type_descriptor[LOGREC_CHECKPOINT_TABL]= - INIT_LOGREC_CHECKPOINT_TABL; + log_record_type_descriptor[LOGREC_CHECKPOINT]= + INIT_LOGREC_CHECKPOINT; log_record_type_descriptor[LOGREC_REDO_CREATE_TABLE]= INIT_LOGREC_REDO_CREATE_TABLE; log_record_type_descriptor[LOGREC_REDO_RENAME_TABLE]= INIT_LOGREC_REDO_RENAME_TABLE; log_record_type_descriptor[LOGREC_REDO_DROP_TABLE]= INIT_LOGREC_REDO_DROP_TABLE; - log_record_type_descriptor[LOGREC_REDO_TRUNCATE_TABLE]= - INIT_LOGREC_REDO_TRUNCATE_TABLE; + log_record_type_descriptor[LOGREC_REDO_DELETE_ALL]= + INIT_LOGREC_REDO_DELETE_ALL; + log_record_type_descriptor[LOGREC_REDO_REPAIR_TABLE]= + INIT_LOGREC_REDO_REPAIR_TABLE; log_record_type_descriptor[LOGREC_FILE_ID]= INIT_LOGREC_FILE_ID; log_record_type_descriptor[LOGREC_LONG_TRANSACTION_ID]= @@ -554,6 +569,7 @@ static File open_logfile_by_number_no_cache(uint32 file_no) DBUG_ENTER("open_logfile_by_number_no_cache"); /* TODO: add O_DIRECT to open flags (when buffer is aligned) */ + /* TODO: use my_create() */ if ((file= my_open(translog_filename_by_fileno(file_no, path), O_CREAT | O_BINARY | O_RDWR, MYF(MY_WME))) < 0) @@ -615,7 +631,7 @@ static my_bool translog_write_file_header() bzero(page, sizeof(page_buff) - (page- page_buff)); DBUG_RETURN(my_pwrite(log_descriptor.log_file_num[0], page_buff, - sizeof(page_buff), 0, MYF(MY_WME | MY_NABP)) != 0); + sizeof(page_buff), 0, log_write_flags) != 0); } @@ -1222,7 +1238,7 @@ static my_bool translog_buffer_next(TRANSLOG_ADDRESS *horizon, /* - Set max LSN send to file + Set max LSN sent to file SYNOPSIS translog_set_sent_to_file() @@ -1512,7 +1528,7 @@ static my_bool translog_buffer_flush(struct st_translog_buffer *buffer) } if (my_pwrite(buffer->file, (char*) buffer->buffer, buffer->size, LSN_OFFSET(buffer->offset), - MYF(MY_WME | MY_NABP))) + log_write_flags)) { UNRECOVERABLE_ERROR(("Can't write buffer (%lu,0x%lx) size %lu " "to the disk (%d)", @@ -2230,7 +2246,16 @@ my_bool translog_init(const char *directory, */ log_descriptor.flushed--; /* offset decreased */ log_descriptor.sent_to_file--; /* offset decreased */ - + /* + Log records will refer to a MARIA_SHARE by a unique 2-byte id; set up + structures for generating 2-byte ids: + */ + my_atomic_rwlock_init(&LOCK_id_to_share); + id_to_share= (MARIA_SHARE **) my_malloc(SHARE_ID_MAX*sizeof(MARIA_SHARE*), + MYF(MY_WME|MY_ZEROFILL)); + if (unlikely(!id_to_share)) + DBUG_RETURN(1); + id_to_share--; /* min id is 1 */ translog_inited= 1; DBUG_RETURN(0); } @@ -2303,6 +2328,8 @@ void translog_destroy() } pthread_mutex_destroy(&log_descriptor.sent_to_file_lock); my_close(log_descriptor.directory_fd, MYF(MY_WME)); + my_atomic_rwlock_destroy(&LOCK_id_to_share); + my_free((gptr)(id_to_share + 1), MYF(MY_ALLOW_ZERO_PTR)); translog_inited= 0; } DBUG_VOID_RETURN; @@ -2362,6 +2389,14 @@ static inline my_bool translog_unlock() } +#define translog_buffer_lock_assert_owner(B) \ + safe_mutex_assert_owner(&B->mutex); +void translog_lock_assert_owner() +{ + translog_buffer_lock_assert_owner(log_descriptor.bc.buffer); +} + + /* Start new page @@ -4154,26 +4189,30 @@ err: } -/* - Write the log record - - SYNOPSIS - translog_write_record() - lsn LSN of the record will be written here - type the log record type - trn Transaction structure pointer for hooks by - record log type, for short_id - share MARIA_SHARE of table or NULL - rec_len record length or 0 (count it) - part_no number of parts or 0 (count it) - parts_data zero ended (in case of number of parts is 0) - array of LEX_STRINGs (parts), first - TRANSLOG_INTERNAL_PARTS positions in the log - should be unused (need for loghandler) - - RETURN - 0 OK - 1 Error +/** + @brief Writes the log record + + If share has no 2-byte-id yet, gives an id to the share and logs + LOGREC_FILE_ID. If transaction has not logged LOGREC_LONG_TRANSACTION_ID + yet, logs it. + + @param lsn LSN of the record will be written here + @param type the log record type + @param trn Transaction structure pointer for hooks by + record log type, for short_id + @param share MARIA_SHARE of table or NULL + @param rec_len record length or 0 (count it) + @param part_no number of parts or 0 (count it) + @param parts_data zero ended (in case of number of parts is 0) + array of LEX_STRINGs (parts), first + TRANSLOG_INTERNAL_PARTS positions in the log + should be unused (need for loghandler) + @param store_share_id if share!=NULL then share's id will automatically + be stored in the two first bytes pointed (so + pointer is assumed to be !=NULL) + @return Operation status + @retval 0 OK + @retval 1 Error */ my_bool translog_write_record(LSN *lsn, @@ -4181,7 +4220,8 @@ my_bool translog_write_record(LSN *lsn, TRN *trn, struct st_maria_share *share, translog_size_t rec_len, uint part_no, - LEX_STRING *parts_data) + LEX_STRING *parts_data, + uchar *store_share_id) { struct st_translog_parts parts; LEX_STRING *part; @@ -4191,10 +4231,41 @@ my_bool translog_write_record(LSN *lsn, DBUG_PRINT("enter", ("type: %u ShortTrID: %u", (uint) type, (uint)short_trid)); - if (share && !share->base.transactional) + if (share) { - DBUG_PRINT("info", ("It is not transactional table")); - DBUG_RETURN(0); + if (!share->base.transactional) + { + DBUG_PRINT("info", ("It is not transactional table")); + DBUG_RETURN(0); + } + if (unlikely(share->id == 0)) + { + /* + First log write for this MARIA_SHARE; give it a short id. + When the lock manager is enabled and needs a short id, it should be + assigned in the lock manager (because row locks will be taken before + log records are written; for example SELECT FOR UPDATE takes locks but + writes no log record. + */ + if (unlikely(translog_assign_id_to_share(share, trn))) + DBUG_RETURN(1); + } + fileid_store(store_share_id, share->id); + } + if (unlikely(!(trn->first_undo_lsn & TRANSACTION_LOGGED_LONG_ID))) + { + LSN lsn; + LEX_STRING log_array[TRANSLOG_INTERNAL_PARTS + 1]; + uchar log_data[6]; + int6store(log_data, trn->trid); + log_array[TRANSLOG_INTERNAL_PARTS + 0].str= (char*) log_data; + log_array[TRANSLOG_INTERNAL_PARTS + 0].length= sizeof(log_data); + trn->first_undo_lsn|= TRANSACTION_LOGGED_LONG_ID; /* no recursion */ + if (unlikely(translog_write_record(&lsn, LOGREC_LONG_TRANSACTION_ID, + trn, NULL, sizeof(log_data), + sizeof(log_array)/sizeof(log_array[0]), + log_array, NULL))) + DBUG_RETURN(1); } parts.parts= parts_data; @@ -4375,20 +4446,19 @@ void translog_free_record_header(TRANSLOG_HEADER_BUFFER *buff) } -/* - Set current horizon in the scanner data structure +/** + @brief Returns the current horizon at the end of the current log - SYNOPSIS - translog_scanner_set_horizon() - scanner Information about current chunk during scanning + @return Horizon */ -static void translog_scanner_set_horizon(struct st_translog_scanner_data - *scanner) +TRANSLOG_ADDRESS translog_get_horizon() { + TRANSLOG_ADDRESS res; translog_lock(); - scanner->horizon= log_descriptor.horizon; + res= log_descriptor.horizon; translog_unlock(); + return res; } @@ -4446,7 +4516,7 @@ my_bool translog_init_scanner(LSN lsn, scanner->fixed_horizon= fixed_horizon; - translog_scanner_set_horizon(scanner); + scanner->horizon= translog_get_horizon(); DBUG_PRINT("info", ("horizon: (0x%lu,0x%lx)", (ulong) LSN_FILE_NO(scanner->horizon), (ulong) LSN_OFFSET(scanner->horizon))); @@ -4499,7 +4569,7 @@ static my_bool translog_scanner_eol(TRANSLOG_SCANNER_DATA *scanner) DBUG_PRINT("info", ("Horizon is fixed and reached")); DBUG_RETURN(1); } - translog_scanner_set_horizon(scanner); + scanner->horizon= translog_get_horizon(); DBUG_PRINT("info", ("Horizon is re-read, EOL: %d", scanner->horizon <= (scanner->page_addr + @@ -5368,17 +5438,31 @@ static void translog_force_current_buffer_to_finish() } -/* - Flush the log up to given LSN (included) - - SYNOPSIS - translog_flush() - lsn log record serial number up to which (inclusive) - the log have to be flushed - - RETURN - 0 OK - 1 Error +/** + @brief Flush the log up to given LSN (included) + + @param lsn log record serial number up to which (inclusive) + the log has to be flushed + + @return Operation status + @retval 0 OK + @retval 1 Error + + @todo LOG: when a log write fails, we should not write to this log anymore + (if we add more log records to this log they will be unreadable: we will hit + the broken log record): all translog_flush() should be made to fail (because + translog_flush() is when a a transaction wants something durable and we + cannot make anything durable as log is corrupted). For that, a "my_bool + st_translog_descriptor::write_error" could be set to 1 when a + translog_write_record() or translog_flush() fails, and translog_flush() + would test this var (and translog_write_record() could also test this var if + it wants, though it's not absolutely needed). + Then, either shut Maria down immediately, or switch to a new log (but if we + get write error after write error, that would create too many logs). + A popular open-source transactional engine intentionally crashes as soon as + a log flush fails (we however don't want to crash the entire mysqld, but + stopping all engine's operations immediately would make sense). + Same applies to translog_write_record(). */ my_bool translog_flush(LSN lsn) @@ -5469,24 +5553,55 @@ my_bool translog_flush(LSN lsn) /* We sync file when we are closing it => do nothing if file closed */ } log_descriptor.flushed= sent_to_file; + /** @todo LOG decide if syncing of directory is needed */ rc|= my_sync(log_descriptor.directory_fd, MYF(MY_WME)); translog_unlock(); DBUG_RETURN(rc); } +/** + @brief Sets transaction's rec_lsn if needed + + A transaction sometimes writes a REDO even before the page is in the + pagecache (example: brand new head or tail pages; full pages). So, if + Checkpoint happens just after the REDO write, it needs to know that the + REDO phase must start before this REDO. Scanning the pagecache cannot + tell that as the page is not in the cache. So, transaction sets its rec_lsn + to the REDO's LSN or somewhere before, and Checkpoint reads the + transaction's rec_lsn. + + @todo move it to a separate file + + @return Operation status, always 0 (success) +*/ + static my_bool write_hook_for_redo(enum translog_record_type type __attribute__ ((unused)), TRN *trn, LSN *lsn, struct st_translog_parts *parts __attribute__ ((unused))) { + /* + If the hook stays so simple, it would be faster to pass + !trn->rec_lsn ? trn->rec_lsn : some_dummy_lsn + to translog_write_record(), like Monty did in his original code, and not + have a hook. For now we keep it like this. + */ if (trn->rec_lsn == 0) trn->rec_lsn= *lsn; return 0; } +/** + @brief Sets transaction's undo_lsn, first_undo_lsn if needed + + @todo move it to a separate file + + @return Operation status, always 0 (success) +*/ + static my_bool write_hook_for_undo(enum translog_record_type type __attribute__ ((unused)), TRN *trn, LSN *lsn, @@ -5494,11 +5609,109 @@ static my_bool write_hook_for_undo(enum translog_record_type type __attribute__ ((unused))) { trn->undo_lsn= *lsn; - if (trn->first_undo_lsn == 0) - trn->first_undo_lsn= *lsn; + if (unlikely(LSN_WITH_FLAGS_TO_LSN(trn->first_undo_lsn) == 0)) + trn->first_undo_lsn= + trn->undo_lsn | LSN_WITH_FLAGS_TO_FLAGS(trn->first_undo_lsn); return 0; /* when we implement purging, we will specialize this hook: UNDO_PURGE records will additionally set trn->undo_purge_lsn */ } + + +/** + @brief Gives a 2-byte-id to MARIA_SHARE and logs this fact + + If a MARIA_SHARE does not yet have a 2-byte-id (unique over all currently + open MARIA_SHAREs), give it one and record this assignment in the log + (LOGREC_FILE_ID log record). + + @param share table + @param trn calling transaction + + @return Operation status + @retval 0 OK + @retval 1 Error + + @note Can be called even if share already has an id (then will do nothing) +*/ + +int translog_assign_id_to_share(MARIA_SHARE *share, TRN *trn) +{ + /* + If you give an id to a non-BLOCK_RECORD table, you also need to release + this id somewhere. Then you can change the assertion. + */ + DBUG_ASSERT(share->data_file_type == BLOCK_RECORD); + /* re-check under mutex to avoid having 2 ids for the same share */ + pthread_mutex_lock(&share->intern_lock); + if (likely(share->id == 0)) + { + /* Inspired by set_short_trid() of trnman.c */ + int i= share->kfile.file % SHARE_ID_MAX + 1; + my_atomic_rwlock_wrlock(&LOCK_id_to_share); + /** + @todo RECOVERY BUG: if all slots are used, and we're using rwlocks + above, we will never exit the loop. To be discussed with Serg. + */ + for ( ; ; i= i % SHARE_ID_MAX + 1) /* the range is [1..SHARE_ID_MAX] */ + { + void *tmp= NULL; + if (id_to_share[i] == NULL && + my_atomic_casptr((void **)&id_to_share[i], &tmp, share)) + break; + } + my_atomic_rwlock_wrunlock(&LOCK_id_to_share); + share->id= (uint16)i; + DBUG_PRINT("info", ("id_to_share: 0x%lx -> %u", (ulong)share, i)); + LSN lsn; + LEX_STRING log_array[TRANSLOG_INTERNAL_PARTS + 2]; + uchar log_data[FILEID_STORE_SIZE]; + log_array[TRANSLOG_INTERNAL_PARTS + 0].str= (char*) log_data; + log_array[TRANSLOG_INTERNAL_PARTS + 0].length= sizeof(log_data); + /* + open_file_name is an unresolved name (symlinks are not resolved, datadir + is not realpath-ed, etc) which is good: the log can be moved to another + directory and continue working. + */ + log_array[TRANSLOG_INTERNAL_PARTS + 1].str= share->open_file_name; + /** + @todo if we had the name's length in MARIA_SHARE we could avoid this + strlen() + */ + log_array[TRANSLOG_INTERNAL_PARTS + 1].length= + strlen(share->open_file_name); + if (unlikely(translog_write_record(&lsn, LOGREC_FILE_ID, trn, share, + sizeof(log_data) + + log_array[TRANSLOG_INTERNAL_PARTS + + 1].length, + sizeof(log_array)/sizeof(log_array[0]), + log_array, log_data))) + return 1; + } + pthread_mutex_unlock(&share->intern_lock); + return 0; +} + + +/** + @brief Recycles a MARIA_SHARE's short id. + + @param share table + + @note Must be called only if share has an id (i.e. id != 0) +*/ + +void translog_deassign_id_from_share(MARIA_SHARE *share) +{ + DBUG_PRINT("info", ("id_to_share: 0x%lx id %u -> 0", + (ulong)share, share->id)); + /* + We don't need any mutex as we are called only when closing the last + instance of the table: no writes can be happening. + */ + my_atomic_rwlock_rdlock(&LOCK_id_to_share); + my_atomic_storeptr((void **)&id_to_share[share->id], 0); + my_atomic_rwlock_rdunlock(&LOCK_id_to_share); +} diff --git a/storage/maria/ma_loghandler.h b/storage/maria/ma_loghandler.h index e9872e7bfb7..0a160a9bc53 100644 --- a/storage/maria/ma_loghandler.h +++ b/storage/maria/ma_loghandler.h @@ -86,13 +86,12 @@ enum translog_record_type LOGREC_PREPARE_WITH_UNDO_PURGE, LOGREC_COMMIT, LOGREC_COMMIT_WITH_UNDO_PURGE, - LOGREC_CHECKPOINT_PAGE, - LOGREC_CHECKPOINT_TRAN, - LOGREC_CHECKPOINT_TABL, + LOGREC_CHECKPOINT, LOGREC_REDO_CREATE_TABLE, LOGREC_REDO_RENAME_TABLE, LOGREC_REDO_DROP_TABLE, - LOGREC_REDO_TRUNCATE_TABLE, + LOGREC_REDO_DELETE_ALL, + LOGREC_REDO_REPAIR_TABLE, LOGREC_FILE_ID, LOGREC_LONG_TRANSACTION_ID, LOGREC_RESERVED_FUTURE_EXTENSION= 63 @@ -181,9 +180,7 @@ struct st_translog_reader_data }; struct st_transaction; -#ifdef __cplusplus -extern "C" { -#endif +C_MODE_START /* Records types for unittests */ #define LOGREC_FIXED_RECORD_0LSN_EXAMPLE 1 @@ -199,13 +196,12 @@ extern my_bool translog_init(const char *directory, uint32 log_file_max_size, uint32 server_version, uint32 server_id, PAGECACHE *pagecache, uint flags); -extern my_bool translog_write_record(LSN *lsn, - enum translog_record_type type, - struct st_transaction *trn, - struct st_maria_share *share, - translog_size_t rec_len, - uint part_no, - LEX_STRING *parts_data); +extern my_bool +translog_write_record(LSN *lsn, enum translog_record_type type, + struct st_transaction *trn, + struct st_maria_share *share, + translog_size_t rec_len, uint part_no, + LEX_STRING *parts_data, uchar *store_share_id); extern void translog_destroy(); @@ -232,7 +228,10 @@ extern translog_size_t translog_read_next_record_header(TRANSLOG_SCANNER_DATA *scanner, TRANSLOG_HEADER_BUFFER *buff); -#ifdef __cplusplus -} -#endif - +extern void translog_lock_assert_owner(); +extern TRANSLOG_ADDRESS translog_get_horizon(); +extern int translog_assign_id_to_share(struct st_maria_share *share, + struct st_transaction *trn); +extern void translog_deassign_id_from_share(struct st_maria_share *share); +extern my_bool translog_inited; +C_MODE_END diff --git a/storage/maria/ma_loghandler_lsn.h b/storage/maria/ma_loghandler_lsn.h index 1789d3ce61b..c641337e8ba 100644 --- a/storage/maria/ma_loghandler_lsn.h +++ b/storage/maria/ma_loghandler_lsn.h @@ -35,7 +35,7 @@ typedef TRANSLOG_ADDRESS LSN; /* checks LSN */ #define LSN_VALID(L) DBUG_ASSERT((L) >= 0 && (L) < (uint64)0xFFFFFFFFFFFFFFLL) -/* size of stored LSN on a disk */ +/* size of stored LSN on a disk, don't change it! */ #define LSN_STORE_SIZE 7 /* Puts LSN into buffer (dst) */ @@ -53,4 +53,12 @@ typedef TRANSLOG_ADDRESS LSN; #define LSN_REPLACE_OFFSET(L, S) (LSN_FINE_NO_PART(L) | (S)) +/* + an 8-byte type whose most significant byte is used for "flags"; 7 + other bytes are a LSN. +*/ +typedef LSN LSN_WITH_FLAGS; +#define LSN_WITH_FLAGS_TO_LSN(x) (x & ULL(0x00FFFFFFFFFFFFFF)) +#define LSN_WITH_FLAGS_TO_FLAGS(x) (x & ULL(0xFF00000000000000)) + #endif diff --git a/storage/maria/ma_open.c b/storage/maria/ma_open.c index b8ce6d123e7..4e72adf3b7e 100644 --- a/storage/maria/ma_open.c +++ b/storage/maria/ma_open.c @@ -919,12 +919,23 @@ static void setup_key_functions(register MARIA_KEYDEF *keyinfo) } -/* - Function to save and store the header in the index file (.MYI) +/** + @brief Function to save and store the header in the index file (.MYI) + + @param file descriptor of the index file to write + @param state state information to write to the file + @param pWrite bitmap (determines the amount of information to + write, and if my_write() or my_pwrite() should be + used) + + @return Operation status + @retval 0 OK + @retval 1 Error */ uint _ma_state_info_write(File file, MARIA_STATE_INFO *state, uint pWrite) { + /** @todo RECOVERY write it only at checkpoint time */ uchar buff[MARIA_STATE_INFO_SIZE + MARIA_STATE_EXTRA_SIZE]; uchar *ptr=buff; uint i, keys= (uint) state->header.keys; @@ -935,6 +946,11 @@ uint _ma_state_info_write(File file, MARIA_STATE_INFO *state, uint pWrite) /* open_count must be first because of _ma_mark_file_changed ! */ mi_int2store(ptr,state->open_count); ptr+= 2; + /* + if you change the offset of this LSN inside the file, fix + ma_create + ma_rename + ma_delete_all + backward-compatibility. + */ + lsn_store(ptr, state->create_rename_lsn); ptr+= LSN_STORE_SIZE; *ptr++= (uchar)state->changed; *ptr++= state->sortkey; mi_rowstore(ptr,state->state.records); ptr+= 8; @@ -959,6 +975,7 @@ uint _ma_state_info_write(File file, MARIA_STATE_INFO *state, uint pWrite) { mi_sizestore(ptr,state->key_root[i]); ptr+= 8; } + /** @todo RECOVERY key_del is a problem for recovery */ mi_sizestore(ptr,state->key_del); ptr+= 8; if (pWrite & 2) /* From maria_chk */ { @@ -994,6 +1011,7 @@ byte *_ma_state_info_read(byte *ptr, MARIA_STATE_INFO *state) key_parts= mi_uint2korr(state->header.key_parts); state->open_count = mi_uint2korr(ptr); ptr+= 2; + state->create_rename_lsn= lsn_korr(ptr); ptr+= LSN_STORE_SIZE; state->changed= (my_bool) *ptr++; state->sortkey= (uint) *ptr++; state->state.records= mi_rowkorr(ptr); ptr+= 8; diff --git a/storage/maria/ma_pagecache.c b/storage/maria/ma_pagecache.c index 18c36fcfbd1..ae42f702b0a 100755 --- a/storage/maria/ma_pagecache.c +++ b/storage/maria/ma_pagecache.c @@ -114,6 +114,11 @@ /* TODO: put it to my_static.c */ my_bool my_disable_flush_pagecache_blocks= 0; +/** + when flushing pages of a file, it can happen that we take some dirty blocks + out of changed_blocks[]; Checkpoint must not run at this moment. +*/ +uint changed_blocks_is_incomplete= 0; #define STRUCT_PTR(TYPE, MEMBER, a) \ (TYPE *) ((char *) (a) - offsetof(TYPE, MEMBER)) @@ -308,7 +313,7 @@ struct st_pagecache_block_link enum pagecache_page_type type; /* type of the block */ uint hits_left; /* number of hits left until promotion */ ulonglong last_hit_time; /* timestamp of the last hit */ - LSN rec_lsn; /* LSN when first became dirty */ + LSN rec_lsn; /**< LSN when first became dirty */ KEYCACHE_CONDVAR *condvar; /* condition variable for 'no readers' event */ }; @@ -2523,7 +2528,8 @@ void pagecache_unlock(PAGECACHE *pagecache, { DBUG_ASSERT(lock == PAGECACHE_LOCK_WRITE_UNLOCK); DBUG_ASSERT(pin == PAGECACHE_UNPIN); - set_if_bigger(block->rec_lsn, first_REDO_LSN_for_page); + if (block->rec_lsn == 0) + block->rec_lsn= first_REDO_LSN_for_page; } if (lsn != 0) { @@ -2685,7 +2691,8 @@ void pagecache_unlock_by_link(PAGECACHE *pagecache, DBUG_ASSERT(lock == PAGECACHE_LOCK_WRITE_UNLOCK || lock == PAGECACHE_LOCK_READ_UNLOCK); DBUG_ASSERT(pin == PAGECACHE_UNPIN); - set_if_bigger(block->rec_lsn, first_REDO_LSN_for_page); + if (block->rec_lsn == 0) + block->rec_lsn= first_REDO_LSN_for_page; } if (lsn != 0) { @@ -3279,8 +3286,8 @@ restart: if (need_lock_change) { /* - RECOVERY TODO BUG We are doing an unlock here, so need to give the - page its rec_lsn + We don't set rec_lsn of the block; this is ok as for the + Maria-block-record's pages, we always keep pages pinned here. */ if (make_lock_and_pin(pagecache, block, write_lock_change_table[lock].unlock_lock, @@ -3500,22 +3507,21 @@ static int flush_cached_blocks(PAGECACHE *pagecache, } -/* - flush all key blocks for a file to disk, but don't do any mutex locks +/** + @brief flush all key blocks for a file to disk but don't do any mutex locks - flush_pagecache_blocks_int() - pagecache pointer to a key cache data structure - file handler for the file to flush to - flush_type type of the flush + @param pagecache pointer to a pagecache data structure + @param file handler for the file to flush to + @param flush_type type of the flush - NOTES - This function doesn't do any mutex locks because it needs to be called - both from flush_pagecache_blocks and flush_all_key_blocks (the later one - does the mutex lock in the resize_pagecache() function). + @note + This function doesn't do any mutex locks because it needs to be called + both from flush_pagecache_blocks and flush_all_key_blocks (the later one + does the mutex lock in the resize_pagecache() function). - RETURN - 0 ok - 1 error + @return Operation status + @retval 0 OK + @retval 1 Error */ static int flush_pagecache_blocks_int(PAGECACHE *pagecache, @@ -3547,6 +3553,7 @@ static int flush_pagecache_blocks_int(PAGECACHE *pagecache, #if defined(PAGECACHE_DEBUG) uint cnt= 0; #endif + uint8 changed_blocks_is_incomplete_incremented= 0; if (type != FLUSH_IGNORE_CHANGED) { @@ -3636,16 +3643,23 @@ restart: else { /* Link the block into a list of blocks 'in switch' */ - /* - RECOVERY TODO BUG this unlink_changed() is a serious problem for - Maria's Checkpoint: it removes a page from the list of dirty - pages, while it's still dirty. A solution is to abandon - first_in_switch, just wait for this page to be - flushed by somebody else, and loop. TODO: check all places - where we remove a page from the list of dirty pages - */ unlink_changed(block); link_changed(block, &first_in_switch); + /* + We have just removed a page from the list of dirty pages + ("changed_blocks") though it's still dirty (the flush by another + thread has not yet happened). Checkpoint will miss the page and so + must be blocked until that flush has happened. + */ + /** + @todo RECOVERY: check all places where we remove a page from the + list of dirty pages + */ + if (unlikely(!changed_blocks_is_incomplete_incremented)) + { + changed_blocks_is_incomplete_incremented= 1; + changed_blocks_is_incomplete++; + } } } } @@ -3683,6 +3697,8 @@ restart: KEYCACHE_DBUG_ASSERT(cnt <= pagecache->blocks_used); #endif } + changed_blocks_is_incomplete-= + changed_blocks_is_incomplete_incremented; /* The following happens very seldom */ if (! (type == FLUSH_KEEP || type == FLUSH_FORCE_WRITE)) { @@ -3789,51 +3805,56 @@ int reset_pagecache_counters(const char *name, PAGECACHE *pagecache) } -/* - Allocates a buffer and stores in it some information about all dirty pages - of type PAGECACHE_LSN_PAGE. - - SYNOPSIS - pagecache_collect_changed_blocks_with_lsn() - pagecache pointer to the page cache - str (OUT) pointer to a LEX_STRING where the allocated buffer, and - its size, will be put - max_lsn (OUT) pointer to a LSN where the maximum rec_lsn of all - relevant dirty pages will be put - - DESCRIPTION - Does the allocation because the caller cannot know the size itself. - Memory freeing is to be done by the caller (if the "str" member of the - LEX_STRING is not NULL). - Ignores all pages of another type than PAGECACHE_LSN_PAGE, because they - are not interesting for a checkpoint record. - The caller has the intention of doing checkpoints. - - RETURN - 0 on success - 1 on error +/** + @brief Allocates a buffer and stores in it some info about all dirty pages + + Does the allocation because the caller cannot know the size itself. + Memory freeing is to be done by the caller (if the "str" member of the + LEX_STRING is not NULL). + Ignores all pages of another type than PAGECACHE_LSN_PAGE, because they + are not interesting for a checkpoint record. + The caller has the intention of doing checkpoints. + + @param pagecache pointer to the page cache + @param[out] str pointer to where the allocated buffer, and + its size, will be put + @param[out] min_rec_lsn pointer to where the minimum rec_lsn of all + relevant dirty pages will be put + @param[out] max_rec_lsn pointer to where the maximum rec_lsn of all + relevant dirty pages will be put + @return Operation status + @retval 0 OK + @retval 1 Error */ + my_bool pagecache_collect_changed_blocks_with_lsn(PAGECACHE *pagecache, LEX_STRING *str, - LSN *max_lsn) + LSN *min_rec_lsn, + LSN *max_rec_lsn) { my_bool error= 0; ulong stored_list_size= 0; uint file_hash; char *ptr; + LSN minimum_rec_lsn= ULONGLONG_MAX, maximum_rec_lsn= 0; DBUG_ENTER("pagecache_collect_changed_blocks_with_LSN"); - *max_lsn= 0; DBUG_ASSERT(NULL == str->str); /* We lock the entire cache but will be quick, just reading/writing a few MBs of memory at most. - When we enter here, we must be sure that no "first_in_switch" situation - is happening or will happen (either we have to get rid of - first_in_switch in the code or, first_in_switch has to increment a - "danger" counter for this function to know it has to wait). TODO. */ pagecache_pthread_mutex_lock(&pagecache->cache_lock); + while (changed_blocks_is_incomplete > 0) + { + /* + Some pages are more recent in memory than on disk (=dirty) and are not + in "changed_blocks" so we cannot know them. Wait. + */ + pagecache_pthread_mutex_unlock(&pagecache->cache_lock); + sleep(1); + pagecache_pthread_mutex_lock(&pagecache->cache_lock); + } /* Count how many dirty pages are interesting */ for (file_hash= 0; file_hash < PAGECACHE_CHANGED_BLOCKS_HASH; file_hash++) @@ -3851,35 +3872,15 @@ my_bool pagecache_collect_changed_blocks_with_lsn(PAGECACHE *pagecache, DBUG_ASSERT(block->status & PCBLOCK_CHANGED); if (block->type != PAGECACHE_LSN_PAGE) continue; /* no need to store it */ - /* - In the current pagecache, rec_lsn is not set correctly: - 1) it is set on pagecache_unlock(), too late (a page is dirty - (PCBLOCK_CHANGED) since the first pagecache_write()). So in this - scenario: - thread1: thread2: - write_REDO - pagecache_write() checkpoint : reclsn not known - pagecache_unlock(sets rec_lsn) - commit - crash, - at recovery we will wrongly skip the REDO. It also affects the - low-water mark's computation. - 2) sometimes the unlocking can be an implicit action of - pagecache_write(), without any call to pagecache_unlock(), then - rec_lsn is not set. - 1) and 2) are critical problems. - TODO: fix this when Monty has explained how he writes BLOB pages. - */ - if (block->rec_lsn == 0) - { - DBUG_ASSERT(0); - goto err; - } stored_list_size++; } } - str->length= 8+(4+4+8)*stored_list_size; + str->length= 8 + /* number of dirty pages */ + (4 + /* file */ + 4 + /* pageno */ + LSN_STORE_SIZE /* rec_lsn */ + ) * stored_list_size; if (NULL == (str->str= my_malloc(str->length, MYF(MY_WME)))) goto err; ptr= str->str; @@ -3896,19 +3897,27 @@ my_bool pagecache_collect_changed_blocks_with_lsn(PAGECACHE *pagecache, { if (block->type != PAGECACHE_LSN_PAGE) continue; /* no need to store it in the checkpoint record */ - DBUG_ASSERT((4 == sizeof(block->hash_link->file.file))); - DBUG_ASSERT((4 == sizeof(block->hash_link->pageno))); + compile_time_assert((4 == sizeof(block->hash_link->file.file))); + compile_time_assert((4 == sizeof(block->hash_link->pageno))); int4store(ptr, block->hash_link->file.file); ptr+= 4; int4store(ptr, block->hash_link->pageno); ptr+= 4; - int8store(ptr, (ulonglong) block->rec_lsn); - ptr+= 8; - set_if_bigger(*max_lsn, block->rec_lsn); + lsn_store(ptr, block->rec_lsn); + ptr+= LSN_STORE_SIZE; + if (block->rec_lsn != 0) + { + if (cmp_translog_addr(block->rec_lsn, minimum_rec_lsn) < 0) + minimum_rec_lsn= block->rec_lsn; + if (cmp_translog_addr(block->rec_lsn, maximum_rec_lsn) > 0) + maximum_rec_lsn= block->rec_lsn; + } /* otherwise, some trn->rec_lsn should hold the info */ } } end: pagecache_pthread_mutex_unlock(&pagecache->cache_lock); + *min_rec_lsn= minimum_rec_lsn; + *max_rec_lsn= maximum_rec_lsn; DBUG_RETURN(error); err: diff --git a/storage/maria/ma_pagecache.h b/storage/maria/ma_pagecache.h index ef14cd48cef..478f71161eb 100644 --- a/storage/maria/ma_pagecache.h +++ b/storage/maria/ma_pagecache.h @@ -239,6 +239,7 @@ extern my_bool pagecache_delete_pages(PAGECACHE *pagecache, extern void end_pagecache(PAGECACHE *keycache, my_bool cleanup); extern my_bool pagecache_collect_changed_blocks_with_lsn(PAGECACHE *pagecache, LEX_STRING *str, + LSN *min_lsn, LSN *max_lsn); extern int reset_pagecache_counters(const char *name, PAGECACHE *pagecache); diff --git a/storage/maria/ma_panic.c b/storage/maria/ma_panic.c index b74403e6eb2..0394f630343 100644 --- a/storage/maria/ma_panic.c +++ b/storage/maria/ma_panic.c @@ -52,7 +52,12 @@ int maria_panic(enum ha_panic_function flag) info=(MARIA_HA*) list_element->data; switch (flag) { case HA_PANIC_CLOSE: - pthread_mutex_unlock(&THR_LOCK_maria); /* Not exactly right... */ + /* + If bad luck (if some tables would be used now, which normally does not + happen in MySQL), as we release the mutex, the list may change and so + we may crash. + */ + pthread_mutex_unlock(&THR_LOCK_maria); if (maria_close(info)) error=my_errno; pthread_mutex_lock(&THR_LOCK_maria); diff --git a/storage/maria/ma_range.c b/storage/maria/ma_range.c index f91a61259d7..b359868e8e4 100644 --- a/storage/maria/ma_range.c +++ b/storage/maria/ma_range.c @@ -29,25 +29,22 @@ static uint _ma_keynr(MARIA_HA *info, MARIA_KEYDEF *keyinfo, byte *page, byte *keypos, uint *ret_max_key); -/* - Estimate how many records there is in a given range +/** + @brief Estimate how many records there is in a given range - SYNOPSIS - maria_records_in_range() - info MARIA handler - inx Index to use - min_key Min key. Is = 0 if no min range - max_key Max key. Is = 0 if no max range + @param info MARIA handler + @param inx Index to use + @param min_key Min key. Is = 0 if no min range + @param max_key Max key. Is = 0 if no max range - NOTES - We should ONLY return 0 if there is no rows in range + @note + We should ONLY return 0 if there is no rows in range - RETURN - HA_POS_ERROR error (or we can't estimate number of rows) - number Estimated number of rows + @return Estimated number of rows or error + @retval HA_POS_ERROR error (or we can't estimate number of rows) + @retval number Estimated number of rows */ - ha_rows maria_records_in_range(MARIA_HA *info, int inx, key_range *min_key, key_range *max_key) { @@ -115,6 +112,13 @@ ha_rows maria_records_in_range(MARIA_HA *info, int inx, key_range *min_key, rw_unlock(&info->s->key_root_lock[inx]); fast_ma_writeinfo(info); + /** + @todo LOCK + If res==0 (no rows), if we need to guarantee repeatability of the search, + we will need to set a next-key lock in this statement. + Also SELECT COUNT(*)... + */ + DBUG_PRINT("info",("records: %ld",(ulong) (res))); DBUG_RETURN(res); } diff --git a/storage/maria/ma_rename.c b/storage/maria/ma_rename.c index a80bbcd398f..5224698c614 100644 --- a/storage/maria/ma_rename.c +++ b/storage/maria/ma_rename.c @@ -18,6 +18,18 @@ */ #include "ma_fulltext.h" +#include "trnman_public.h" + +/** + @brief renames a table + + @param old_name current name of table + @param new_name table should be renamed to this name + + @return Operation status + @retval 0 OK + @retval !=0 Error +*/ int maria_rename(const char *old_name, const char *new_name) { @@ -26,22 +38,73 @@ int maria_rename(const char *old_name, const char *new_name) #ifdef USE_RAID uint raid_type=0,raid_chunks=0; #endif + MARIA_HA *info; + MARIA_SHARE *share; + myf sync_dir; DBUG_ENTER("maria_rename"); #ifdef EXTRA_DEBUG _ma_check_table_is_closed(old_name,"rename old_table"); _ma_check_table_is_closed(new_name,"rename new table2"); #endif - /* LOCK TODO take X-lock on table here */ + /** @todo LOCK take X-lock on table */ + if (!(info= maria_open(old_name, O_RDWR, HA_OPEN_FOR_REPAIR))) + DBUG_RETURN(my_errno); + share= info->s; #ifdef USE_RAID + raid_type = share->base.raid_type; + raid_chunks = share->base.raid_chunks; +#endif + + sync_dir= (share->base.transactional && !share->temporary) ? + MY_SYNC_DIR : 0; + if (sync_dir) { - MARIA_HA *info; - if (!(info=maria_open(old_name, O_RDONLY, 0))) - DBUG_RETURN(my_errno); - raid_type = info->s->base.raid_type; - raid_chunks = info->s->base.raid_chunks; - maria_close(info); + uchar log_data[LSN_STORE_SIZE]; + LEX_STRING log_array[TRANSLOG_INTERNAL_PARTS + 3]; + uint old_name_len= strlen(old_name), new_name_len= strlen(new_name); + int2store(log_data, old_name_len); + int2store(log_data + 2, new_name_len); + log_array[TRANSLOG_INTERNAL_PARTS + 0].str= log_data; + log_array[TRANSLOG_INTERNAL_PARTS + 0].length= 2 + 2; + log_array[TRANSLOG_INTERNAL_PARTS + 1].str= (char *)old_name; + log_array[TRANSLOG_INTERNAL_PARTS + 1].length= old_name_len; + log_array[TRANSLOG_INTERNAL_PARTS + 2].str= (char *)new_name; + log_array[TRANSLOG_INTERNAL_PARTS + 2].length= new_name_len; + /* + For this record to be of any use for Recovery, we need the upper + MySQL layer to be crash-safe, which it is not now (that would require + work using the ddl_log of sql/sql_table.cc); when it is, we should + reconsider the moment of writing this log record (before or after op, + under THR_LOCK_maria or not...), how to use it in Recovery, and force + the log. For now this record is just informative. + */ + if (unlikely(translog_write_record(&share->state.create_rename_lsn, + LOGREC_REDO_RENAME_TABLE, + &dummy_transaction_object, NULL, + 2 + 2 + old_name_len + new_name_len, + sizeof(log_array)/sizeof(log_array[0]), + log_array, NULL))) + { + maria_close(info); + DBUG_RETURN(1); + } + /* + store LSN into file, needed for Recovery to not be confused if a + RENAME happened (applying REDOs to the wrong table). + */ + lsn_store(log_data, share->state.create_rename_lsn); + if (my_pwrite(share->kfile.file, log_data, sizeof(log_data), + sizeof(share->state.header) + 2, MYF(MY_NABP)) || + my_sync(share->kfile.file, MYF(MY_WME))) + { + maria_close(info); + DBUG_RETURN(1); + } } + + maria_close(info); +#ifdef USE_RAID #ifdef EXTRA_DEBUG _ma_check_table_is_closed(old_name,"rename raidcheck"); #endif @@ -49,29 +112,18 @@ int maria_rename(const char *old_name, const char *new_name) fn_format(from,old_name,"",MARIA_NAME_IEXT,MY_UNPACK_FILENAME|MY_APPEND_EXT); fn_format(to,new_name,"",MARIA_NAME_IEXT,MY_UNPACK_FILENAME|MY_APPEND_EXT); - /* - RECOVERY TODO log the two renames below. Update - ZeroDirtyPagesLSN of the table on disk (=> sync the files), this is - needed so that Recovery does not pick a wrong table. - Then do the file renames. - For this log record to be of any use for Recovery, we need the upper MySQL - layer to be crash-safe in DDLs; when it is we should reconsider the moment - of writing this log record, how to use it in Recovery, and force the log. - For now this record is only informative. But ZeroDirtyPagesLSN is - critically needed! - */ - if (my_rename_with_symlink(from, to, MYF(MY_WME | MY_SYNC_DIR))) + if (my_rename_with_symlink(from, to, MYF(MY_WME | sync_dir))) DBUG_RETURN(my_errno); fn_format(from,old_name,"",MARIA_NAME_DEXT,MY_UNPACK_FILENAME|MY_APPEND_EXT); fn_format(to,new_name,"",MARIA_NAME_DEXT,MY_UNPACK_FILENAME|MY_APPEND_EXT); #ifdef USE_RAID if (raid_type) data_file_rename_error= my_raid_rename(from, to, raid_chunks, - MYF(MY_WME | MY_SYNC_DIR)); + MYF(MY_WME | sync_dir)); else #endif data_file_rename_error= - my_rename_with_symlink(from, to, MYF(MY_WME | MY_SYNC_DIR)); + my_rename_with_symlink(from, to, MYF(MY_WME | sync_dir)); if (data_file_rename_error) { /* @@ -81,7 +133,7 @@ int maria_rename(const char *old_name, const char *new_name) data_file_rename_error= my_errno; fn_format(from, old_name, "", MARIA_NAME_IEXT, MYF(MY_UNPACK_FILENAME|MY_APPEND_EXT)); fn_format(to, new_name, "", MARIA_NAME_IEXT, MYF(MY_UNPACK_FILENAME|MY_APPEND_EXT)); - my_rename_with_symlink(to, from, MYF(MY_WME | MY_SYNC_DIR)); + my_rename_with_symlink(to, from, MYF(MY_WME | sync_dir)); } DBUG_RETURN(data_file_rename_error); diff --git a/storage/maria/ma_static.c b/storage/maria/ma_static.c index c77f3f512fd..16bf0eca935 100644 --- a/storage/maria/ma_static.c +++ b/storage/maria/ma_static.c @@ -47,7 +47,13 @@ PAGECACHE *maria_pagecache= &maria_pagecache_var; PAGECACHE maria_log_pagecache_var; PAGECACHE *maria_log_pagecache= &maria_log_pagecache_var; -/* For using maria externally */ +/** + @brief when transactionality does not matter we can use this transaction + + Used in external programs like ma_test*, and also internally inside + libmaria when there is no transaction around and the operation isn't + transactional (CREATE/DROP/RENAME/OPTIMIZE/REPAIR). +*/ TRN dummy_transaction_object; /* Enough for comparing if number is zero */ diff --git a/storage/maria/ma_test_all.sh b/storage/maria/ma_test_all.sh index 8ee326a9c69..76b6c32913f 100755 --- a/storage/maria/ma_test_all.sh +++ b/storage/maria/ma_test_all.sh @@ -3,10 +3,16 @@ # Execute some simple basic test on MyISAM libary to check if things # works at all. +# If you want to run this in Valgrind, you should use --trace-children=yes, +# so that it detects problems in ma_test* and not in the shell script valgrind="valgrind --alignment=8 --leak-check=yes" silent="-s" suffix="" #set -x -v -e +if [ -z "$maria_path" ] +then + maria_path="." +fi run_tests() { @@ -14,139 +20,139 @@ run_tests() # # First some simple tests # - ./ma_test1$suffix $silent $row_type - ./maria_chk$suffix -se test1 - ./ma_test1$suffix $silent -N $row_type - ./maria_chk$suffix -se test1 - ./ma_test1$suffix $silent -P --checksum $row_type - ./maria_chk$suffix -se test1 - ./ma_test1$suffix $silent -P -N $row_type - ./maria_chk$suffix -se test1 - ./ma_test1$suffix $silent -B -N -R2 $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent -a -k 480 --unique $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent -a -N -R1 $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent -p $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent -p -N --unique $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent -p -N --key_length=127 --checksum $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent -p -N --key_length=128 $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent -p --key_length=480 $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent -a -B $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent -a -B --key_length=64 --unique $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent -a -B -k 480 --checksum $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent -a -B -k 480 -N --unique --checksum $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent -a -m $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent -a -m -P --unique --checksum $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent -a -m -P --key_length=480 --key_cache $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent -m -p $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent -w --unique $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent -a -w --key_length=64 --checksum $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent -a -w -N --key_length=480 $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent -a -w --key_length=480 --checksum $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent -a -b -N $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent -a -b --key_length=480 $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent -p -B --key_length=480 $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent --checksum --unique $row_type - ./maria_chk$suffix -se test1 - ./ma_test1$suffix $silent --unique $row_type - ./maria_chk$suffix -se test1 + $maria_path/ma_test1$suffix $silent $row_type + $maria_path/maria_chk$suffix -se test1 + $maria_path/ma_test1$suffix $silent -N $row_type + $maria_path/maria_chk$suffix -se test1 + $maria_path/ma_test1$suffix $silent -P --checksum $row_type + $maria_path/maria_chk$suffix -se test1 + $maria_path/ma_test1$suffix $silent -P -N $row_type + $maria_path/maria_chk$suffix -se test1 + $maria_path/ma_test1$suffix $silent -B -N -R2 $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent -a -k 480 --unique $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent -a -N -R1 $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent -p $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent -p -N --unique $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent -p -N --key_length=127 --checksum $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent -p -N --key_length=128 $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent -p --key_length=480 $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent -a -B $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent -a -B --key_length=64 --unique $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent -a -B -k 480 --checksum $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent -a -B -k 480 -N --unique --checksum $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent -a -m $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent -a -m -P --unique --checksum $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent -a -m -P --key_length=480 --key_cache $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent -m -p $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent -w --unique $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent -a -w --key_length=64 --checksum $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent -a -w -N --key_length=480 $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent -a -w --key_length=480 --checksum $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent -a -b -N $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent -a -b --key_length=480 $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent -p -B --key_length=480 $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent --checksum --unique $row_type + $maria_path/maria_chk$suffix -se test1 + $maria_path/ma_test1$suffix $silent --unique $row_type + $maria_path/maria_chk$suffix -se test1 - ./ma_test1$suffix $silent --key_multiple -N -S $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent --key_multiple -a -p --key_length=480 $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent --key_multiple -a -B --key_length=480 $row_type - ./maria_chk$suffix -sm test1 - ./ma_test1$suffix $silent --key_multiple -P -S $row_type - ./maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent --key_multiple -N -S $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent --key_multiple -a -p --key_length=480 $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent --key_multiple -a -B --key_length=480 $row_type + $maria_path/maria_chk$suffix -sm test1 + $maria_path/ma_test1$suffix $silent --key_multiple -P -S $row_type + $maria_path/maria_chk$suffix -sm test1 - ./maria_pack$suffix --force -s test1 - ./maria_chk$suffix -ess test1 + $maria_path/maria_pack$suffix --force -s test1 + $maria_path/maria_chk$suffix -ess test1 - ./ma_test2$suffix $silent -L -K -W -P $row_type - ./maria_chk$suffix -sm test2 - ./ma_test2$suffix $silent -L -K -W -P -A $row_type - ./maria_chk$suffix -sm test2 - ./ma_test2$suffix $silent -L -K -P -R3 -m50 -b1000000 $row_type - ./maria_chk$suffix -sm test2 - ./ma_test2$suffix $silent -L -B $row_type - ./maria_chk$suffix -sm test2 - ./ma_test2$suffix $silent -D -B -c $row_type - ./maria_chk$suffix -sm test2 - ./ma_test2$suffix $silent -m10000 -e4096 -K $row_type - ./maria_chk$suffix -sm test2 - ./ma_test2$suffix $silent -m10000 -e8192 -K $row_type - ./maria_chk$suffix -sm test2 - ./ma_test2$suffix $silent -m10000 -e16384 -E16384 -K -L $row_type - ./maria_chk$suffix -sm test2 + $maria_path/ma_test2$suffix $silent -L -K -W -P $row_type + $maria_path/maria_chk$suffix -sm test2 + $maria_path/ma_test2$suffix $silent -L -K -W -P -A $row_type + $maria_path/maria_chk$suffix -sm test2 + $maria_path/ma_test2$suffix $silent -L -K -P -R3 -m50 -b1000000 $row_type + $maria_path/maria_chk$suffix -sm test2 + $maria_path/ma_test2$suffix $silent -L -B $row_type + $maria_path/maria_chk$suffix -sm test2 + $maria_path/ma_test2$suffix $silent -D -B -c $row_type + $maria_path/maria_chk$suffix -sm test2 + $maria_path/ma_test2$suffix $silent -m10000 -e4096 -K $row_type + $maria_path/maria_chk$suffix -sm test2 + $maria_path/ma_test2$suffix $silent -m10000 -e8192 -K $row_type + $maria_path/maria_chk$suffix -sm test2 + $maria_path/ma_test2$suffix $silent -m10000 -e16384 -E16384 -K -L $row_type + $maria_path/maria_chk$suffix -sm test2 } run_repair_tests() { row_type=$1 - ./ma_test1$suffix $silent --checksum $row_type - ./maria_chk$suffix -se test1 - ./maria_chk$suffix -rs test1 - ./maria_chk$suffix -se test1 - ./maria_chk$suffix -rqs test1 - ./maria_chk$suffix -se test1 - ./maria_chk$suffix -rs --correct-checksum test1 - ./maria_chk$suffix -se test1 - ./maria_chk$suffix -rqs --correct-checksum test1 - ./maria_chk$suffix -se test1 - ./maria_chk$suffix -ros --correct-checksum test1 - ./maria_chk$suffix -se test1 - ./maria_chk$suffix -rqos --correct-checksum test1 - ./maria_chk$suffix -se test1 + $maria_path/ma_test1$suffix $silent --checksum $row_type + $maria_path/maria_chk$suffix -se test1 + $maria_path/maria_chk$suffix -rs test1 + $maria_path/maria_chk$suffix -se test1 + $maria_path/maria_chk$suffix -rqs test1 + $maria_path/maria_chk$suffix -se test1 + $maria_path/maria_chk$suffix -rs --correct-checksum test1 + $maria_path/maria_chk$suffix -se test1 + $maria_path/maria_chk$suffix -rqs --correct-checksum test1 + $maria_path/maria_chk$suffix -se test1 + $maria_path/maria_chk$suffix -ros --correct-checksum test1 + $maria_path/maria_chk$suffix -se test1 + $maria_path/maria_chk$suffix -rqos --correct-checksum test1 + $maria_path/maria_chk$suffix -se test1 } run_pack_tests() { row_type=$1 # check of maria_pack / maria_chk - ./ma_test1$suffix $silent --checksum $row_type - ./maria_pack$suffix --force -s test1 - ./maria_chk$suffix -ess test1 - ./maria_chk$suffix -rqs test1 - ./maria_chk$suffix -es test1 - ./maria_chk$suffix -rs test1 - ./maria_chk$suffix -es test1 - ./maria_chk$suffix -rus test1 - ./maria_chk$suffix -es test1 + $maria_path/ma_test1$suffix $silent --checksum $row_type + $maria_path/maria_pack$suffix --force -s test1 + $maria_path/maria_chk$suffix -ess test1 + $maria_path/maria_chk$suffix -rqs test1 + $maria_path/maria_chk$suffix -es test1 + $maria_path/maria_chk$suffix -rs test1 + $maria_path/maria_chk$suffix -es test1 + $maria_path/maria_chk$suffix -rus test1 + $maria_path/maria_chk$suffix -es test1 - ./ma_test1$suffix $silent --checksum -S $row_type - ./maria_chk$suffix -se test1 - ./maria_chk$suffix -ros test1 - ./maria_chk$suffix -rqs test1 - ./maria_chk$suffix -se test1 + $maria_path/ma_test1$suffix $silent --checksum -S $row_type + $maria_path/maria_chk$suffix -se test1 + $maria_path/maria_chk$suffix -ros test1 + $maria_path/maria_chk$suffix -rqs test1 + $maria_path/maria_chk$suffix -se test1 - ./maria_pack$suffix --force -s test1 - ./maria_chk$suffix -rqs test1 - ./maria_chk$suffix -es test1 - ./maria_chk$suffix -rus test1 - ./maria_chk$suffix -es test1 + $maria_path/maria_pack$suffix --force -s test1 + $maria_path/maria_chk$suffix -rqs test1 + $maria_path/maria_chk$suffix -es test1 + $maria_path/maria_chk$suffix -rus test1 + $maria_path/maria_chk$suffix -es test1 } echo "Running tests with dynamic row format" @@ -169,27 +175,27 @@ run_tests "-M -T" # Tests that gives warnings # -./ma_test2$suffix $silent -L -K -W -P -S -R1 -m500 -./maria_chk$suffix -sm test2 +$maria_path/ma_test2$suffix $silent -L -K -W -P -S -R1 -m500 +$maria_path/maria_chk$suffix -sm test2 echo "ma_test2$suffix $silent -L -K -R1 -m2000 ; Should give error 135" -./ma_test2$suffix $silent -L -K -R1 -m2000 -echo "./maria_chk$suffix -sm test2 will warn that 'Datafile is almost full'" -./maria_chk$suffix -sm test2 -./maria_chk$suffix -ssm test2 +$maria_path/ma_test2$suffix $silent -L -K -R1 -m2000 +echo "$maria_path/maria_chk$suffix -sm test2 will warn that 'Datafile is almost full'" +$maria_path/maria_chk$suffix -sm test2 +$maria_path/maria_chk$suffix -ssm test2 # # Some timing tests # -time ./ma_test2$suffix $silent -time ./ma_test2$suffix $silent -S -time ./ma_test2$suffix $silent -M -time ./ma_test2$suffix $silent -B -time ./ma_test2$suffix $silent -L -time ./ma_test2$suffix $silent -K -time ./ma_test2$suffix $silent -K -B -time ./ma_test2$suffix $silent -L -B -time ./ma_test2$suffix $silent -L -K -B -time ./ma_test2$suffix $silent -L -K -W -B -time ./ma_test2$suffix $silent -L -K -W -B -S -time ./ma_test2$suffix $silent -L -K -W -B -M -time ./ma_test2$suffix $silent -D -K -W -B -S +time $maria_path/ma_test2$suffix $silent +time $maria_path/ma_test2$suffix $silent -S +time $maria_path/ma_test2$suffix $silent -M +time $maria_path/ma_test2$suffix $silent -B +time $maria_path/ma_test2$suffix $silent -L +time $maria_path/ma_test2$suffix $silent -K +time $maria_path/ma_test2$suffix $silent -K -B +time $maria_path/ma_test2$suffix $silent -L -B +time $maria_path/ma_test2$suffix $silent -L -K -B +time $maria_path/ma_test2$suffix $silent -L -K -W -B +time $maria_path/ma_test2$suffix $silent -L -K -W -B -S +time $maria_path/ma_test2$suffix $silent -L -K -W -B -M +time $maria_path/ma_test2$suffix $silent -D -K -W -B -S diff --git a/storage/maria/maria_def.h b/storage/maria/maria_def.h index d9e31e800c4..740808c7bbe 100644 --- a/storage/maria/maria_def.h +++ b/storage/maria/maria_def.h @@ -93,6 +93,7 @@ typedef struct st_maria_state_info uint sortkey; /* sorted by this key (not used) */ uint open_count; uint8 changed; /* Changed since mariachk */ + LSN create_rename_lsn; /**< LSN when table was last created/renamed */ /* the following isn't saved on disk */ uint state_diff_length; /* Should be 0 */ @@ -101,7 +102,8 @@ typedef struct st_maria_state_info } MARIA_STATE_INFO; -#define MARIA_STATE_INFO_SIZE (24 + 4 + 11*8 + 4*4 + 8 + 3*4 + 5*8) +#define MARIA_STATE_INFO_SIZE \ + (24 + LSN_STORE_SIZE + 4 + 11*8 + 4*4 + 8 + 3*4 + 5*8) #define MARIA_STATE_KEY_SIZE 8 #define MARIA_STATE_KEYBLOCK_SIZE 8 #define MARIA_STATE_KEYSEG_SIZE 4 @@ -229,6 +231,7 @@ typedef struct st_maria_share PAGECACHE *pagecache; /* ref to the current key cache */ MARIA_DECODE_TREE *decode_trees; uint16 *decode_tables; + uint16 id; /**< 2-byte id by which log records refer to the table */ /* Called the first time the table instance is opened */ my_bool (*once_init)(struct st_maria_share *, File); /* Called when the last instance of the table is closed */ @@ -889,6 +892,7 @@ volatile int *_ma_killed_ptr(HA_CHECK *param); void _ma_check_print_error _VARARGS((HA_CHECK *param, const char *fmt, ...)); void _ma_check_print_warning _VARARGS((HA_CHECK *param, const char *fmt, ...)); void _ma_check_print_info _VARARGS((HA_CHECK *param, const char *fmt, ...)); +int _ma_repair_write_log_record(const HA_CHECK *param, MARIA_HA *info); C_MODE_END int _ma_flush_pending_blocks(MARIA_SORT_PARAM *param); diff --git a/storage/maria/trnman.c b/storage/maria/trnman.c index d6b35f071ea..83249ab328f 100644 --- a/storage/maria/trnman.c +++ b/storage/maria/trnman.c @@ -52,6 +52,7 @@ static my_atomic_rwlock_t LOCK_short_trid_to_trn, LOCK_pool; /* Simple interface functions + QQ: if they stay so simple, should we make them inline? */ uint trnman_increment_locked_tables(TRN *trn) @@ -343,6 +344,9 @@ int trnman_end_trn(TRN *trn, my_bool commit) LF_PINS *pins= trn->pins; DBUG_ENTER("trnman_end_trn"); + DBUG_ASSERT(trn->rec_lsn == 0); + /* if a rollback, all UNDO records should have been executed */ + DBUG_ASSERT(commit || trn->undo_lsn == 0); DBUG_PRINT("info", ("pthread_mutex_lock LOCK_trn_list")); pthread_mutex_lock(&LOCK_trn_list); @@ -379,8 +383,6 @@ int trnman_end_trn(TRN *trn, my_bool commit) /* if transaction is committed and it was not the only active transaction - add it to the committed list (which is used for read-from relation) - TODO check in the condition below that a transaction have made some - changes, was not read-only. Something like '&& UndoLSN != 0' */ if (commit && active_list_min.next != &active_list_max) { @@ -390,6 +392,19 @@ int trnman_end_trn(TRN *trn, my_bool commit) trnman_committed_transactions++; res= lf_hash_insert(&trid_to_committed_trn, pins, &trn); + /* + By going on with life is res<0, we let other threads block on + our rows (because they will never see us committed in + trid_to_committed_trn) until they timeout. Though correct, this is not a + good situation: + - if connection reconnects and wants to check if its rows have been + committed, it will not be able to do that (it will just lock on them) so + connection stays permanently in doubt + - internal structures trid_to_committed_trn and committed_list are + desynchronized. + So we should take Maria down immediately, the two problems being + automatically solved at restart. + */ DBUG_ASSERT(res <= 0); } if (res) @@ -526,71 +541,133 @@ void trnman_rollback_statement(TRN *trn __attribute__ ((unused))) } -/* - Allocates two buffers and stores in them some information about transactions - of the active list (into the first buffer) and of the committed list (into - the second buffer). - - SYNOPSIS - trnman_collect_transactions() - str_act (OUT) pointer to a LEX_STRING where the allocated buffer, and - its size, will be put - str_com (OUT) pointer to a LEX_STRING where the allocated buffer, and - its size, will be put +/** + @brief Allocates buffers and stores in them some info about transactions + Does the allocation because the caller cannot know the size itself. + Memory freeing is to be done by the caller (if the "str" member of the + LEX_STRING is not NULL). + The caller has the intention of doing checkpoints. - DESCRIPTION - Does the allocation because the caller cannot know the size itself. - Memory freeing is to be done by the caller (if the "str" member of the - LEX_STRING is not NULL). - The caller has the intention of doing checkpoints. + @param[out] str_act pointer to where the allocated buffer, + and its size, will be put; buffer will be filled + with info about active transactions + @param[out] str_com pointer to where the allocated buffer, + and its size, will be put; buffer will be filled + with info about committed transactions + @param[out] min_first_undo_lsn pointer to where the minimum + first_undo_lsn of all transactions will be put - RETURN - 0 on success - 1 on error + @return Operation status + @retval 0 OK + @retval 1 Error */ -my_bool trnman_collect_transactions(LEX_STRING *str_act, LEX_STRING *str_com) + +my_bool trnman_collect_transactions(LEX_STRING *str_act, LEX_STRING *str_com, + LSN *min_rec_lsn, LSN *min_first_undo_lsn) { my_bool error; TRN *trn; char *ptr; + uint stored_transactions= 0; + LSN minimum_rec_lsn= ULONGLONG_MAX, minimum_first_undo_lsn= ULONGLONG_MAX; DBUG_ENTER("trnman_collect_transactions"); DBUG_ASSERT((NULL == str_act->str) && (NULL == str_com->str)); + /* validate the use of read_non_atomic() in general: */ + compile_time_assert((sizeof(LSN) == 8) && (sizeof(LSN_WITH_FLAGS) == 8)); + DBUG_PRINT("info", ("pthread_mutex_lock LOCK_trn_list")); pthread_mutex_lock(&LOCK_trn_list); - str_act->length= 8+(6+2+7+7+7)*trnman_active_transactions; - str_com->length= 8+(6+7+7)*trnman_committed_transactions; + str_act->length= 2 + /* number of active transactions */ + LSN_STORE_SIZE + /* minimum of their rec_lsn */ + (6 + /* long id */ + 2 + /* short id */ + LSN_STORE_SIZE + /* undo_lsn */ +#ifdef MARIA_VERSIONING /* not enabled yet */ + LSN_STORE_SIZE + /* undo_purge_lsn */ +#endif + LSN_STORE_SIZE /* first_undo_lsn */ + ) * trnman_active_transactions; + str_com->length= 8 + /* number of committed transactions */ + (6 + /* long id */ +#ifdef MARIA_VERSIONING /* not enabled yet */ + LSN_STORE_SIZE + /* undo_purge_lsn */ +#endif + LSN_STORE_SIZE /* first_undo_lsn */ + ) * trnman_committed_transactions; if ((NULL == (str_act->str= my_malloc(str_act->length, MYF(MY_WME)))) || (NULL == (str_com->str= my_malloc(str_com->length, MYF(MY_WME))))) goto err; /* First, the active transactions */ - ptr= str_act->str; - int8store(ptr, (ulonglong)trnman_active_transactions); - ptr+= 8; + ptr= str_act->str + 2 + LSN_STORE_SIZE; for (trn= active_list_min.next; trn != &active_list_max; trn= trn->next) { /* - trns with a short trid of 0 are not initialized; Recovery will recognize - this and ignore them. - State is not needed for now (only when we supported prepared trns). - For LSNs, Sanja will soon push lsn7store. + trns with a short trid of 0 are not even initialized, we can ignore + them. trns with undo_lsn==0 have done no writes, we can ignore them + too. XID not needed now. */ + uint sid; + LSN rec_lsn, undo_lsn, first_undo_lsn; + if ((sid= trn->short_id) == 0) + { + /* + Not even inited, has done nothing. Or it is the + dummy_transaction_object, which does only non-transactional + immediate-sync operations (CREATE/DROP/RENAME/REPAIR TABLE), and so + can be forgotten for Checkpoint. + */ + continue; + } +#ifndef MARIA_CHECKPOINT +/* + in the checkpoint patch (not yet ready) we will have a real implementation + of lsn_read_non_atomic(); for now it's not needed +*/ +#define lsn_read_non_atomic(A) (A) +#endif + /* needed for low-water mark calculation */ + if (((rec_lsn= lsn_read_non_atomic(trn->rec_lsn)) > 0) && + (cmp_translog_addr(rec_lsn, minimum_rec_lsn) < 0)) + minimum_rec_lsn= rec_lsn; + /* + trn may have logged REDOs but not yet UNDO, that's why we read rec_lsn + before deciding to ignore if undo_lsn==0. + */ + if ((undo_lsn= trn->undo_lsn) == 0) /* trn can be forgotten */ + continue; + stored_transactions++; int6store(ptr, trn->trid); ptr+= 6; - int2store(ptr, trn->short_id); + int2store(ptr, sid); ptr+= 2; - /* needed for rollback */ - /* lsn7store(ptr, trn->undo_lsn); */ - ptr+= 7; - /* needed for purge */ - /* lsn7store(ptr, trn->undo_purge_lsn); */ - ptr+= 7; + lsn_store(ptr, undo_lsn); /* needed for rollback */ + ptr+= LSN_STORE_SIZE; +#ifdef MARIA_VERSIONING /* not enabled yet */ + /* to know where purging should start (last delete of this trn) */ + lsn_store(ptr, trn->undo_purge_lsn); + ptr+= LSN_STORE_SIZE; +#endif /* needed for low-water mark calculation */ - /* lsn7store(ptr, read_non_atomic(&trn->first_undo_lsn)); */ - ptr+= 7; + if (((first_undo_lsn= lsn_read_non_atomic(trn->first_undo_lsn)) > 0) && + (cmp_translog_addr(first_undo_lsn, minimum_first_undo_lsn) < 0)) + minimum_first_undo_lsn= first_undo_lsn; + lsn_store(ptr, first_undo_lsn); + ptr+= LSN_STORE_SIZE; + /** + @todo RECOVERY: add a comment explaining why we can dirtily read some + vars, inspired by the text of "assumption 8" in WL#3072 + */ } + str_act->length= ptr - str_act->str; /* as we maybe over-estimated */ + ptr= str_act->str; + int2store(ptr, stored_transactions); + ptr+= 2; + /* this LSN influences how REDOs for any page can be ignored by Recovery */ + lsn_store(ptr, minimum_rec_lsn); + /* one day there will also be a list of prepared transactions */ /* do the same for committed ones */ ptr= str_com->str; int8store(ptr, (ulonglong)trnman_committed_transactions); @@ -598,18 +675,26 @@ my_bool trnman_collect_transactions(LEX_STRING *str_act, LEX_STRING *str_com) for (trn= committed_list_min.next; trn != &committed_list_max; trn= trn->next) { + LSN first_undo_lsn; int6store(ptr, trn->trid); ptr+= 6; - /* mi_int7store(ptr, trn->undo_purge_lsn); */ - ptr+= 7; - /* mi_int7store(ptr, read_non_atomic(&trn->first_undo_lsn)); */ - ptr+= 7; +#ifdef MARIA_VERSIONING /* not enabled yet */ + lsn_store(ptr, trn->undo_purge_lsn); + ptr+= LSN_STORE_SIZE; +#endif + first_undo_lsn= LSN_WITH_FLAGS_TO_LSN(trn->first_undo_lsn); + if (cmp_translog_addr(first_undo_lsn, minimum_first_undo_lsn) < 0) + minimum_first_undo_lsn= first_undo_lsn; + lsn_store(ptr, first_undo_lsn); + ptr+= LSN_STORE_SIZE; } /* TODO: if we see there exists no transaction (active and committed) we can tell the lock-free structures to do some freeing (my_free()). */ error= 0; + *min_rec_lsn= minimum_rec_lsn; + *min_first_undo_lsn= minimum_first_undo_lsn; goto end; err: error= 1; diff --git a/storage/maria/trnman.h b/storage/maria/trnman.h index 1e1550efb46..1a4423f2a11 100644 --- a/storage/maria/trnman.h +++ b/storage/maria/trnman.h @@ -45,12 +45,13 @@ struct st_transaction LF_PINS *pins; TrID trid, min_read_from, commit_trid; TRN *next, *prev; - LSN rec_lsn, undo_lsn, first_undo_lsn; + LSN rec_lsn, undo_lsn; + LSN_WITH_FLAGS first_undo_lsn; uint locked_tables; /* Note! if locks.loid is 0, trn is NOT initialized */ }; -TRN dummy_transaction_object; +#define TRANSACTION_LOGGED_LONG_ID ULL(0x8000000000000000) C_MODE_END diff --git a/storage/maria/trnman_public.h b/storage/maria/trnman_public.h index 4b3f8acb4b3..3e0a21c26a6 100644 --- a/storage/maria/trnman_public.h +++ b/storage/maria/trnman_public.h @@ -20,6 +20,8 @@ to include my_atomic.h in C++ code. */ +#include "ma_loghandler_lsn.h" + C_MODE_START typedef uint64 TrID; /* our TrID is 6 bytes */ typedef struct st_transaction TRN; @@ -27,6 +29,7 @@ typedef struct st_transaction TRN; #define SHORT_TRID_MAX 65535 extern uint trnman_active_transactions, trnman_allocated_transactions; +extern TRN dummy_transaction_object; int trnman_init(void); void trnman_destroy(void); @@ -39,7 +42,9 @@ void trnman_free_trn(TRN *trn); int trnman_can_read_from(TRN *trn, TrID trid); void trnman_new_statement(TRN *trn); void trnman_rollback_statement(TRN *trn); -my_bool trnman_collect_transactions(LEX_STRING *str_act, LEX_STRING *str_com); +my_bool trnman_collect_transactions(LEX_STRING *str_act, LEX_STRING *str_com, + LSN *min_rec_lsn, + LSN *min_first_undo_lsn); uint trnman_increment_locked_tables(TRN *trn); uint trnman_decrement_locked_tables(TRN *trn); diff --git a/storage/maria/unittest/ma_test_loghandler-t.c b/storage/maria/unittest/ma_test_loghandler-t.c index f05d58a784f..e31136d52ec 100644 --- a/storage/maria/unittest/ma_test_loghandler-t.c +++ b/storage/maria/unittest/ma_test_loghandler-t.c @@ -196,7 +196,7 @@ int main(int argc __attribute__((unused)), char *argv[]) if (translog_write_record(&lsn, LOGREC_FIXED_RECORD_0LSN_EXAMPLE, trn, NULL, - 6, TRANSLOG_INTERNAL_PARTS + 1, parts)) + 6, TRANSLOG_INTERNAL_PARTS + 1, parts, NULL)) { fprintf(stderr, "Can't write record #%lu\n", (ulong) 0); translog_destroy(); @@ -218,7 +218,7 @@ int main(int argc __attribute__((unused)), char *argv[]) parts[TRANSLOG_INTERNAL_PARTS + 1].str= NULL; parts[TRANSLOG_INTERNAL_PARTS + 1].length= 0; if (translog_write_record(&lsn, LOGREC_FIXED_RECORD_1LSN_EXAMPLE, - trn, NULL, LSN_STORE_SIZE, 0, parts)) + trn, NULL, LSN_STORE_SIZE, 0, parts, NULL)) { fprintf(stderr, "1 Can't write reference defore record #%lu\n", (ulong) i); @@ -238,7 +238,7 @@ int main(int argc __attribute__((unused)), char *argv[]) if (translog_write_record(&lsn, LOGREC_VARIABLE_RECORD_1LSN_EXAMPLE, trn, NULL, 0, TRANSLOG_INTERNAL_PARTS + 2, - parts)) + parts, NULL)) { fprintf(stderr, "1 Can't write var reference defore record #%lu\n", (ulong) i); @@ -257,7 +257,7 @@ int main(int argc __attribute__((unused)), char *argv[]) if (translog_write_record(&lsn, LOGREC_FIXED_RECORD_2LSN_EXAMPLE, trn, NULL, - 23, TRANSLOG_INTERNAL_PARTS + 1, parts)) + 23, TRANSLOG_INTERNAL_PARTS + 1, parts, NULL)) { fprintf(stderr, "0 Can't write reference defore record #%lu\n", (ulong) i); @@ -277,7 +277,7 @@ int main(int argc __attribute__((unused)), char *argv[]) if (translog_write_record(&lsn, LOGREC_VARIABLE_RECORD_2LSN_EXAMPLE, trn, NULL, 14 + rec_len, - TRANSLOG_INTERNAL_PARTS + 2, parts)) + TRANSLOG_INTERNAL_PARTS + 2, parts, NULL)) { fprintf(stderr, "0 Can't write var reference defore record #%lu\n", (ulong) i); @@ -294,7 +294,7 @@ int main(int argc __attribute__((unused)), char *argv[]) LOGREC_FIXED_RECORD_0LSN_EXAMPLE, trn, NULL, 6, TRANSLOG_INTERNAL_PARTS + 1, - parts)) + parts, NULL)) { fprintf(stderr, "Can't write record #%lu\n", (ulong) i); translog_destroy(); @@ -313,7 +313,7 @@ int main(int argc __attribute__((unused)), char *argv[]) LOGREC_VARIABLE_RECORD_0LSN_EXAMPLE, trn, NULL, rec_len, TRANSLOG_INTERNAL_PARTS + 1, - parts)) + parts, NULL)) { fprintf(stderr, "Can't write variable record #%lu\n", (ulong) i); translog_destroy(); diff --git a/storage/maria/unittest/ma_test_loghandler_multigroup-t.c b/storage/maria/unittest/ma_test_loghandler_multigroup-t.c index 9ed57da8fec..1281ee425d8 100644 --- a/storage/maria/unittest/ma_test_loghandler_multigroup-t.c +++ b/storage/maria/unittest/ma_test_loghandler_multigroup-t.c @@ -192,7 +192,7 @@ int main(int argc __attribute__((unused)), char *argv[]) trn->short_id= 0; if (translog_write_record(&lsn, LOGREC_FIXED_RECORD_0LSN_EXAMPLE, trn, NULL, - 6, TRANSLOG_INTERNAL_PARTS + 1, parts)) + 6, TRANSLOG_INTERNAL_PARTS + 1, parts, NULL)) { fprintf(stderr, "Can't write record #%lu\n", (ulong) 0); translog_destroy(); @@ -214,7 +214,7 @@ int main(int argc __attribute__((unused)), char *argv[]) LOGREC_FIXED_RECORD_1LSN_EXAMPLE, trn, NULL, LSN_STORE_SIZE, - TRANSLOG_INTERNAL_PARTS + 1, parts)) + TRANSLOG_INTERNAL_PARTS + 1, parts, NULL)) { fprintf(stderr, "1 Can't write reference before record #%lu\n", (ulong) i); @@ -234,7 +234,7 @@ int main(int argc __attribute__((unused)), char *argv[]) LOGREC_VARIABLE_RECORD_1LSN_EXAMPLE, trn, NULL, LSN_STORE_SIZE + rec_len, TRANSLOG_INTERNAL_PARTS + 2, - parts)) + parts, NULL)) { fprintf(stderr, "1 Can't write var reference before record #%lu\n", (ulong) i); @@ -255,7 +255,7 @@ int main(int argc __attribute__((unused)), char *argv[]) LOGREC_FIXED_RECORD_2LSN_EXAMPLE, trn, NULL, 23, TRANSLOG_INTERNAL_PARTS + 1, - parts)) + parts, NULL)) { fprintf(stderr, "0 Can't write reference before record #%lu\n", (ulong) i); @@ -276,7 +276,7 @@ int main(int argc __attribute__((unused)), char *argv[]) LOGREC_VARIABLE_RECORD_2LSN_EXAMPLE, trn, NULL, LSN_STORE_SIZE * 2 + rec_len, TRANSLOG_INTERNAL_PARTS + 2, - parts)) + parts, NULL)) { fprintf(stderr, "0 Can't write var reference before record #%lu\n", (ulong) i); @@ -293,7 +293,7 @@ int main(int argc __attribute__((unused)), char *argv[]) if (translog_write_record(&lsn, LOGREC_FIXED_RECORD_0LSN_EXAMPLE, trn, NULL, 6, - TRANSLOG_INTERNAL_PARTS + 1, parts)) + TRANSLOG_INTERNAL_PARTS + 1, parts, NULL)) { fprintf(stderr, "Can't write record #%lu\n", (ulong) i); translog_destroy(); @@ -311,7 +311,7 @@ int main(int argc __attribute__((unused)), char *argv[]) if (translog_write_record(&lsn, LOGREC_VARIABLE_RECORD_0LSN_EXAMPLE, trn, NULL, rec_len, - TRANSLOG_INTERNAL_PARTS + 1, parts)) + TRANSLOG_INTERNAL_PARTS + 1, parts, NULL)) { fprintf(stderr, "Can't write variable record #%lu\n", (ulong) i); translog_destroy(); diff --git a/storage/maria/unittest/ma_test_loghandler_multithread-t.c b/storage/maria/unittest/ma_test_loghandler_multithread-t.c index 688c1ec33be..ff966160acc 100644 --- a/storage/maria/unittest/ma_test_loghandler_multithread-t.c +++ b/storage/maria/unittest/ma_test_loghandler_multithread-t.c @@ -137,7 +137,7 @@ void writer(int num) if (translog_write_record(&lsn, LOGREC_FIXED_RECORD_0LSN_EXAMPLE, &trn, NULL, 6, TRANSLOG_INTERNAL_PARTS + 1, - parts)) + parts, NULL)) { fprintf(stderr, "Can't write LOGREC_FIXED_RECORD_0LSN_EXAMPLE record #%lu " "thread %i\n", (ulong) i, num); @@ -154,7 +154,7 @@ void writer(int num) LOGREC_VARIABLE_RECORD_0LSN_EXAMPLE, &trn, NULL, len, TRANSLOG_INTERNAL_PARTS + 1, - parts)) + parts, NULL)) { fprintf(stderr, "Can't write variable record #%lu\n", (ulong) i); translog_destroy(); @@ -303,7 +303,7 @@ int main(int argc __attribute__((unused)), LOGREC_FIXED_RECORD_0LSN_EXAMPLE, &dummy_transaction_object, NULL, 6, TRANSLOG_INTERNAL_PARTS + 1, - parts)) + parts, NULL)) { fprintf(stderr, "Can't write the first record\n"); translog_destroy(); diff --git a/storage/maria/unittest/ma_test_loghandler_pagecache-t.c b/storage/maria/unittest/ma_test_loghandler_pagecache-t.c index b43f0cfa98c..35e05f9c997 100644 --- a/storage/maria/unittest/ma_test_loghandler_pagecache-t.c +++ b/storage/maria/unittest/ma_test_loghandler_pagecache-t.c @@ -94,7 +94,7 @@ int main(int argc __attribute__((unused)), char *argv[]) LOGREC_FIXED_RECORD_0LSN_EXAMPLE, &dummy_transaction_object, NULL, 6, TRANSLOG_INTERNAL_PARTS + 1, - parts)) + parts, NULL)) { fprintf(stderr, "Can't write record #%lu\n", (ulong) 0); translog_destroy(); |