CheckMate
The CheckMate Task Execution Framework is a lightweight and extensible system within the EleanorAI Framework designed to manage long-running tasks with features such as checkpointing, rollbacks, and retries. It provides a structured approach to executing sequences of stages, ensuring reliability and fault tolerance in complex operations. CheckMate simplifies the execution of complex, multi-step tasks commonly encountered in compute-heavy AI operations. The EleanorAI Framework leverages CheckMate to reliably and efficiently execute these long-running operations. Key features include:
- Checkpointing: Save the state of each stage to allow tasks to resume from the last successful point in case of failures or interruptions.
- Rollbacks: Revert changes made by stages if errors occur, maintaining data integrity throughout the task execution.
- Retry Policies: Configure custom retry strategies for stages, specifying how many times and under what conditions a stage should retry upon failure.
- Pluggable Persistence Interface: Abstract the storage layer, allowing for different persistence backends, including future support for object storage systems.
- Persistent State Management: Maintain comprehensive state information for tasks and stages, enabling monitoring and precise control over the execution flow.
The CheckMate implementation centers around the CheckMateEngine
, which serves as the core orchestrator of task execution. It manages the lifecycle of tasks, handles state transitions, and coordinates the execution of stages according to defined retry policies. Within this framework, tasks are composed of multiple stages, each representing an individual unit of work. These stages inherit from BaseStage
and implement the execute_stage
method where the business logic is defined.
To ensure consistency and prevent invalid operations, the framework utilizes a state machine, specifically the _StageStateMachine
, to manage valid state transitions for each stage. Concurrency control is achieved through the implementation of BindLock
, preventing concurrent modifications to task state files and ensuring thread safety and data integrity. CheckMate leverages the TaskIO
interface for persistence, enabling flexible integration with various storage backends by managing the reading and writing of task and stage states.
Error handling is a critical aspect of the design, with mechanisms in place to capture and manage exceptions during stage execution, allowing for retries or rollbacks based on the defined policies.
CheckMate tasks are namespace agnostic.
Implementing Stages
- All custom stages will share a common state model, as each stage is executed the state model will be updated.
- State models need to subclass
BaseStageState
- Attributes in stage state must be declared in the state model and set to
None
by default. This is very important since the CheckMate stage execution algorithm will leverage PYdantic default values to determine whether or not current state attributes need to be overridden. - Stages must check for check for required attributes before execution and must raise a
MissingStateException
if any required attributes are missing. - Best practice: start each stage with a new RDBMS session and use
session_context
to manage service context at each stage. - Best practice: Stages should interact with the service layer and not directly with the DAOs or invoke self-hosted API calls.
- Best practice: there is no need to use logging / timing log messages, stage execution history is tracked by
CheckMateEngine
automatically and written to the persistence store. - It is generally considered an anti-patten to try and persist raw ORM objects in the stage state. Instead, persist the minimum amount of information required to re-create the object in the future. This will help prevent future developers from assuming RDBMS session binding when restoring state.
- Naming convention: Use a module naming convention
NAME_cm.py
to contain subclassedBaseStage
andBaseStageState
classes. Inside this module define acm_factory_NAME
function that returns a pre-configured instance ofCheckMate
.
Rollbacks
- Only implement
rollback
operations when necessary, for example many operations will either be idempotent or will get automatically rolled back via the RDBMS session management framework on error. - Rollbacks need to be implemented such that on error the stage can re-run from a clean state.
- Typically, rollbacks are implemented in the first and/or last stages in the pipeline.