How to do Massive, Slow Database Operations Efficiently With Event-Driven Recursion

Scenario: You need to perform a very heavy scripted database operation, on a very large number of times.

To do this all at once, would: (1) slow down your instance, and (2) take longer than the maximum allowable transaction time/sql count/etc. and (3) other bad stuff.
Each operation takes between 10 and 60 seconds because of the additional business logic that needs to run.
You can't optimize the operations, they're simply slow, and there's a lot of them.

Example: You need to reclassify 100,000 CIs from one class to another (a very heavy operation on a large number of records).
How do you handle this?

I’ve run into this scenario a lot, and in every team I’ve been on (which is a fair number), the go-to answer, is to run a scheduled script which does the operation on a batch (some specific number) of records at a certain interval.

But imagine if you have the job run every 10 minutes, and it deletes 20 records per run.
You can imagine a scenario where the instance or scheduled job is particularly slow due to uncontrollable circumstances (such as the volume of integrations hitting the instance at that particular moment, or a discovery run happening simultaneously with the scheduled script). In this case, the job could take some amount of time longer than the average interval between jobs, in which case the jobs would begin to "pile up" and result in a sort of traffic jam that would both be inefficient, and be a massive drain on instance performance.

So the way to do this safely, would be to figure out what is the longest amount of time you could imagine one "batch" taking, and then setting the scheduled job interval to a little bit more than that amount of time multiplied by the number of records per batch. For example, if you do a 20-record batch and each record takes takes between 10 and 60 seconds, you might want to run one batch every 12 or so minutes.

However, if you need to update 100k records at a rate of 1 every 1.2 minutes, that is 120,000 minutes, or 83 days (24 hours per day).

So, for an operation like this, how can we make it the most efficient that it can be, without bogging down our instance?
The solution: Event-driven recursion.

Using a scheduled job with a gap between each job prevents the “pile-up” effect, which is good, and it gives anything else which might be piling up in the queue (such as a discovery run) to have some time to run. But, since each job might take a different amount of time, we have to be overly conservative with our intervals.
However, with event-driven recursion, we can set it up so that each batch triggers the next! This would prevent the pile-up, and send the next batch to the “back of the line” in terms of the scheduling queue. The result would be that the batches would run slower when the instance is busy, but much quicker when the queue is empty, preventing the pile-up effect, and preventing bogging down the instance, both without sacrificing much in terms of the performance of our actual operations (which we’d like to get done before Jules Verne’s round-the-world journey finishes).


How is it done?

Event-driven recursion consists of two records:

  1. An Event registry

  2. A Script Action (to be triggered by the event)

  3. A background script, run once, to trigger the event.


Step 1: The Event

Step one is pretty simple. Just go to the event Event Registry [sysevent_register] table, and create a new event record. This is a very simple record; all you need is a Name, and a Description of the event and what should trigger it (in the Fired by field).

Step 2: The Script Action

Next, head over to the Script Action [sysevent_script_action] table, and create a new record. This is the script that will execute each time the event is fired. It’s also the script that’s going to be firing the event.
Give the script a Name, set the order to 1000, and set the Event name field to whatever event you created in Step 1. Finally, add your script into the Script field.

Your script will have to follow a few basic guidelines, but here is a basic template you can use:

function eventWrapper(limit, currentNumber) {
    //Make sure we have proper values
    limit = Number(limit) > 0 ? Number(limit) : 10; //Default: 10
    currentNumber = currentNumber > 0 ? currentNumber : 0; //Default: 0
    var gr = new GlideRecord('table_name');
    gr.setLimit(limit);
    gr.addQuery('some_query');
    gr.query();
    if (!gr.hasNext()) {
        gs.log('EDR: [Script name here] completed with ' + currentNumber + ' total records processed.');
        return; //No more records to process. Halt function, and loop.
    }
    while (gr.next()) {
        /*
         * Do something to a batch of records here, based on the limit parameter
         */
        currentNumber++;
    }
    //Log a progress update
    gs.log('EDR: [Script name here] processed ' + currentNumber + ' records so far.');
    //Trigger the event again
    gs.eventQueue('event.name', gr, limit, currentNumber);
}

eventWrapper(event.parm1, event.parm2);

Step 3: The Background Script

The last thing you’ve got to do, is kick off this recursive process.
Simply copy the code in your Script Action, and on the last line, replace event.parm1 and event.parm2 with the desired batch size and the number 0 respectively:

eventWrapper(25, 0);

And that’s it! The script will continue to re-execute itself until it completes processing all records.