Using the Reliable Notification Service

Background

There are two CORBA services defined by the OMG to support the Supplier/Consumer design pattern. This pattern allows messages (known as Events in this context) to be generated by one or more suppliers and delivered to one or more consumers without requiring that the suppliers and consumers have any knowledge of each other.

The Event Service provides a basic implementation of this pattern, and the Notification service extends this basic service to support a rich variety of optional features.

Reliability and Persistence

One of the optional features of the Notification service is Reliability. By default the Event Service and the Notification service provide a best-effort support for event delivery. If things go wrong -- program crashes, communications failures, etc. events may be lost without notice.

There are some circumstances in which losing events is not acceptable. The Notification service may be used for these situations if it is configured for reliable operation. Reliable operation is not available in the Event Service. Reliable operation means information is saved persistently (usually on a disk file) and used to recover from the various failures that might otherwise lead to loss of data.

There are two separate, but related, issues that need to be addressed to provide reliable event delivery: topology persistence an event persistence.

To provide topology persistence, sometimes called connection persistence, the Notification service must keep track of what clients (Suppliers and Consumers) have connected to the Notification service and what options have been specified to contol the delivery of events.

To provide event persistence the Notification service tracks each event in persistent storage to be sure it is delivered to every consumer that should receive it.

There may be situations in which topology persistence is all that is necessary -- it may be acceptable to lose events during a failure as long as the system is restored to normal operation afterward. Event persistence on the other hand can only be supported if topology persistence is also being used. It doesn't help to keep track of events if the system is unable to find the consumers to which the events should be delivered.

Two separate issues must be addressed as part of setting up the Notifcation for reliable operation. At the system administration level the Notification service must be configured for topology persistence and possibly for event persistence. At the application level, programs that operate as consumers and suppliers must set the appropriate parameters to enable reliable operation, and must cooperate with the reconnection process that occurs during topology recovery.

Configuring Notification Service Reliability.

Service Configurator Changes

Runtime configuration of the Notification Service is supported through the service configurator file. This file is normally named svc.conf; however the -ORBSvcConf command line option allows an alternate service configuration file to be specified.

Service configuration changes to support Notification Service reliability include a new option on the existing Notify_Default_Event_Manager_Objects_Factory service configuration command, and two new service configuration commands.

Notify_Default_Event_Manager_Objects_Factory option: -AllowReconnect

Certain recovery cases require that a Consumer be able to reconnect to an existing proxy object in the Notification Service in order to receive all events delivered by that proxy object. This behavior is a departure from the OMG Specification which mandates that the Notification Service should throw an "Already Connected" exception when a consumer attempts to connect to a proxy that was previously used by another Consumer.

A new option, -AllowReconnect, is available for the existing Notify_Default_Event_Manager_Objects_Factorycommand to support this requirement. As an example of its use, the following line configures the Notification Service for multi-threaded operation supporting reconnection.

static Notify_Default_Event_Manager_Objects_Factory "-DispatchingThreads 2
      -SourceThreads 2 -AllowReconnect"

Configuring Connection (Topologogy) Reliability

The support for persistent topology is actually a configurable strategy. TAO includes an XML Topology Persistence Strategy that uses an XML file for persistent storage, but it it is designed to allow other strategies to be developed. For example if topology information should be stored in a relational database file, it is possible to develop a persistent topology strategy to do so. The details of doing this are beyond the scope of this document.

This document describes how to configure the XML topology persistence included with TAO.

An example of the service configuration command to configure the XML strategy is:

dynamic Topology_Factory Service_Object* TAO_CosNotification_Persist:_make_XML_Topology_Factory() "-base_path ./reconnect_test"

The first part of this line: dynamic Topology_Factory Service_Object* TAO_CosNotification_Persist:_make_XML_Topology_Factory()should be given exactly as shown. For details on this syntax, see chapter 17 of the TAO Developer's Guide.

The quoted string at the end of the line contain arguments for the configured strategy. The arguments recognized by the XML topology strategy implemented in this project are:

-v
-base_path file_path
-backup_count count
-save_base_path file_path
-load_base_path file_path
-no_timestamp

Topology_Factory Option: -v

To help diagnose and/or document svc.conf settings, the "-v" will cause the options for the Topology_Factory to be displayed as they are interpreted

Topology_Factory Option: -base_path file_path

The argument for this option is a fully qualified path name without an extension for the xml file in which topology information is saved. Three extensions will be appended to this name: .new, .xml, and .000

Saved topology information will be written to file_path.new file. Information with a .new extension is not necessarily complete and will not be used to restore the topology.

When the .new file is complete, the previous file_path.000 (if any) will be deleted, the previous file_path.xml (if any) will be renamed as file_path.000 and the file_path.new file will be renamed as file_path.xml. The assumption is that a file system rename operation is atomic. If this assumption holds than at any time the file file_path.xml (if it exists) contains the most recent complete save. If file_path.xml does not exist then file_path.000 contains the most recent complete save. If neither of these files exist the saved topology information is not available.

Topology_Factory Option: -backup_count count

This option modifies the behavior described in the preceeding section to allow additional backup copies of the topology file to be retained. The default value, 1, means that only the file_path.000 file will be kept. If a higher number is specified, then older versions will be kept. Rather than deleting file_path.000, the system will rename it to be file_path.001. Older versions will be named file_path.002, file_path.002 and so on.

Under normal circumstances only one backup file is required -- in fact these additional backup files will not be used to restore the topoogy. However setting this number to a larger value lets the system keep a brief history of topology changes. Since the XML files are roughly human-readable this can be used as a diagnostic tool for problems related to Notification Service topology.

Topology_Factory Options: -save_base_path file_path and -load_base_path file_path

These options are alternatives to the -base_path option. They allow the file from which topology information is loaded at Notification Service startup time to be different from the file to which this information is saved as the system runs.

This option is mostly used for developer testing, a system administrator may find an interesting use for this option -- possibly involving script files that rename the XML files during recovery from a Notification Service failure.

Topology_Factory Option: -no_timestamp

The XML files include a timestamp to indicate when the information was saved. The timestamp is for information only and is not needed for correct functioning of the topology persistence. This option suppresses that timestamp. Doing so makes it possible to compare XML files using a program like diff to see if the files represent the same topology.

This option is intended primarily for testing the persistent topology implementation.

Configuring Event Reliability

A service configuraton new object, "Event_Persistence", can be configured in the service configuration file to enable and configure the Event Reliability. An example of the line needed to configure the Event_Persistence object is:

dynamic Event_Persistence Service_Object* TAO_CosNotification_Persist:_make_Standard_Event_Persistence() "-v -file_path ./event_persist.db"

If this line does not appear in svc.conf, then event reliability will not be supported. QoS parameters for reliable event delivery will be silently ignored when Event Reliability is not configured. Event reliability also requires topology reliability, so if this line appears there must also be a "Topology_Factory" line in the file. If not, the Notification Service will fail to start up.

The beginning of this line, up to and including the parentheses, should appear exactly as shown. For details on this syntax, see chapter 17 of the TAO Developer's Guide. The quoted string at the end of the line contains options for Event_Persistence.

Event_Persistence Option: -v

This option and any option that appears after this option will be written to the log (normally the console) as it is processed. This is intended to help diagnose and document the Event Persistence settings. The default is to configure Event Persistence silently.

Event_Persistence Option: -file_path path

This option gives the completely qualified name for the file in which persistent event information will be stored. The file should be configured on a reliable device that supports synchronized writes (i.e. flushing the operating system's write cache.) A device that is suitable for storing a reliable database would be appropriate for storing this file. The file will be subject to a relatively high number of small (single block) write requests, but very few, if any, read requests. If the file does not exist, then a new file will be created. If the file does exist, and if topology is successfully loaded, the events from this file will be reloaded and redelivered automatically. This is a required option. There is no default value.

Event_Persistence Option: -block_size n

This option gives the block size in bytes for the device on which the event reliability file is stored. For both performance and reliability reasons it is important that the value matches the physical characteristics of the device. The default value is 512.

Application Programming Changes to Support Reliability

When it is configured as described above, the Notification service supports reliable connectivity and/or event delivery. Actually achieving such reliability, however, requires cooperation from the Notification service clients (Suppliers and Consumers).

There are a number of failure possibilities and different recovery techniques are needed to handle them. The simplest case is when a client fails and is restarted.

The Notification service will have maintained the connection points (Supplier and Consumer Admins, Proxy Consumers, Proxy Admins, etc.) As each of these connections was established, an ID returned by the notification service. An application that wishes to be reconnected after a failure should save a persistent copy of these IDs. For example, it could write the IDs to a file, then read them back from the file after restarting. Using these ID's the application can reconnect to the existing connection points in the Notification service. The reconnection to the Proxy objects will only work if the Notification service has been configured with the -AllowReconnection option described above, but otherwise this process is fairly straightforward.

As soon as a supplier has reconnected, it can resume sending events. As soon as a consumer has reconnected, persistent events (if any) and new events will start to arrive.

Notice that the identity of a consumer or supplier is determined by these saved IDs. This is true even if the restarted client is running on a completely different machine from the original client.

The case of the Notification service itself failing then being restarted on the same or a different machine is somewhat more complicated. The Notification service wasn't designed to initiate a connection to a client. It must wait for the client to reconnect before it can start accepting or delivering events. The difficulty is in having the client know when to initiatie the reconnection, and to where the Notification service is running in case it was necessary to move it to a new machine due to the failure

Reconnection Registry

The reconnection registry provides an answer to the question of how the client knows where and when to reconnect to the Notification service. This TAO-specific interface is implemented by the EventChannelFactory in the reliable Notification Service. Clients can narrow the EventChannelFactory object reference to a Reconnection Registery interface, then register a Reconnection Callback object that will be notified when the Notification service has restarted and is ready for reconections. The EventChannelFactory passes its own object reference to the Reconnection Callback object to inform the client where the Notification service is now running.

The interfaces involved are defined in the NotifyExt.idl file (in $TAO_ROOT/orbsvcs/orbsvcs) and are shown here:

  /**
   * \brief An interface which gets registered with a ReconnectionRegistry.
   *
   * A supplier or consumer must implement this interface in order to
   * allow the Notification Service to attempt to reconnect to it after
   * a failure.  The supplier or consumer must register its instance of
   * this interface with the ReconnectionRegistry.
   */
  interface ReconnectionCallback
  {
    /// Perform operations to reconnect to the Notification Service
    /// after a failure.
    void reconnect (in Object new_connection);

    /// Check to see if the ReconnectionCallback is alive
    boolean is_alive ();
  };

  /**
   * \brief An interface that handles registration of suppliers and consumers.
   *
   * This registry should be implemented by an EventChannelFactory and
   * will call the appropriate reconnect methods for all ReconnectionCallback
   * objects registered with it.
   */
  interface ReconnectionRegistry
  {
    typedef unsigned long ReconnectionID;
    ReconnectionID register_callback(in ReconnectionCallback reconection);

    void unregister_callback (in ReconnectionID id);

    /// Check to see if the ReconnectionRegistry is alive
    boolean is_alive ();
  };

Using Event Reliability

Configuring the Notification service for reliable event delivery is necessary, but not sufficient to enable reliable handling of events. The application code in either the client or the server must configure the EventChannel through which the events are delivered to operate in the reliable mode. This is done by setting the QoSProperties named "ConnectionReliabilty" and "EventReliability" to the value "persistent" -- either at the time the channel is created or at a later time useing the set_qos method.

Once an channel has been configured for reliable operation, persistence can be disabled on an event by event basis using QoSProperties of the event itself. This could be none, for examlpe, to avoid the overhead of persistently storing events for which reliability is not needed.

The supplier sends events to the EventChannel using a push() method. For persistent events, this call will not return to the supplier until the Notification service is prepared to guarantee event delivery.

Application code in the Consumer should be written with the knowledge that events are guaranteed to be delivered, but during recovery from a failure there is a possibility that an event may arrive more than once. This could happen, for example if the event was in the process of being delivered at the time the failure occurred and the failure prevents the Notfication service from determining if the delivery completed successfully. To meet its committment that every event will be delivered, the Notification service will retry the delivery in this canse which may result in a duplicate event.

As long as this situation is understood at the time the application is designed, it should be possible for the application to handle this situation.