Configuring a Processor: Scheduling Tab

  1. The first configuration option is the Scheduling Strategy. 



    There are three possible options for scheduling components:

    Timer driven: This is the default mode. The Processor will be scheduled to run on a regular interval. The interval at which the Processor is run is defined by the ‘Run schedule’ option (see below).

    Event driven
    : When this mode is selected, the Processor will be triggered to run by an event, and that event occurs when FlowFiles enter Connections feeding this Processor. This mode is currently considered experimental and is not supported by all Processors. When this mode is selected, the ‘Run schedule’ option is not configurable, as the Processor is not triggered to run periodically but as the result of an event. Additionally, this is the only mode for which the ‘Concurrent tasks’ option can be set to 0. In this case, the number of threads is limited only by the size of the Event-Driven Thread Pool that the administrator has configured.

    CRON driven: When using the CRON driven scheduling mode, the Processor is scheduled to run periodically, similar to the Timer driven scheduling mode. However, the CRON driven mode provides significantly more flexibility at the expense of increasing the complexity of the configuration. The CRON driven scheduling value is a string of six required fields and one optional field, each separated by a space. These fields are:

    FieldValid values

    Seconds

    0-59

    Minutes

    0-59

    Hours

    0-23

    Day of Month

    1-31

    Month

    1-12 or JAN-DEC

    Day of Week

    1-7 or SUN-SAT

    Year (optional)

    empty, 1970-2099

    You typically specify values one of the following ways:

    • Number: Specify one or more valid value. You can enter more than one value using a comma-separated list.

    • Range: Specify a range using the <number>-<number> syntax.

    • Increment: Specify an increment using <start value>/<increment> syntax. For example, in the Minutes field, 0/15 indicates the minutes 0, 15, 30, and 45.

    You should also be aware of several valid special characters:

    • *  — Indicates that all values are valid for that field.

    • ?  — Indicates that no specific value is specified. This special character is valid in the Days of Month and Days of Week field.

    • L  — You can append L to one of the Day of Week values, to specify the last occurrence of this day in the month. For example, 1L indicates the last Sunday of the month.


  2. The second configuration point is Concurrent Tasks:



    This controls how many threads the Processor will use, or in other words, how many FlowFiles should be processed by this Processor at the same time. Increasing the threshold will typically allow the Processor to handle more data in the same amount of time, but at the potential expense of other Processors. This field is available for most Processors, but some types are only configurable with one.

  3. “Run schedule” dictates how often the Processor should be scheduled to run.



    The valid values for this field depend on the selected Scheduling Strategy (see above). If using the Event driven Scheduling Strategy, this field is not available. When using the Timer driven Scheduling Strategy, this value is a time duration specified by a number followed by a time unit, such as '10 secs' or '5 mins'.

    Note: When configured for clustering, an Execution setting will be available. This setting is used to determine which node(s) the Processor will be scheduled to execute. Selecting All Nodes will result in this Processor being scheduled on every node in the cluster. Selecting Primary Node will result in this Processor being scheduled on the Primary Node only.


  4. The right-hand side of the tab contains a slider for choosing the ‘Run duration.’ This controls how long the Processor should be scheduled to run each time that it is triggered. On the left-hand side of the slider, it is marked ‘Lower latency’ while the right-hand side is marked ‘Higher throughput.’ When a Processor finishes running, it must update the repository in order to transfer the FlowFiles to the next Connection. Updating the repository is expensive, so the more work that can be done at once before updating the repository, the more work the Processor can handle (Higher throughput). However, this means that the next Processor cannot start processing those FlowFiles until the previous Process updates this repository. As a result, the latency will be longer (the time required to process the FlowFile from beginning to end will be longer). As a result, the slider provides a spectrum from which the DFM can choose to favor Lower Latency or Higher Throughput.