Reply
Highlighted
Accepted Solution

Fix Job Timeout

Authority
Posts: 20
8069     0

I have a CCS script that removes port-security commands off all ports in a Cisco 4500 switch.  The job was running for 13 minutes and successfully removing the port-security commands for each port.  At 13 minutes into the job it quit and logged the message:

 

*** Job Failed Due to Timeout ***

 

Is there a parameter to be set to kill a long running job?  Or is there a timeout value to set in a CCS script to keep the job running longer?

Re: Fix Job Timeout

Adviser
Posts: 53
8070     0

Hi,

 

This is probably way more detail then you are looking for, but the following describes how timeouts work in the CCS and Perl job engines in detail:

 

Thanks,

- Chris

 

6.8.7 CCS Script Timeouts

 

CCS scripts can contain 3 types of timeouts, Script-Timeout, Action-Timeout and Trigger-Timeout. Script-Timeout specifies the per-command timeout for each command in the script (i.e. Trigger-Commands and Action-Commands). If not specified, Script-Timeout defaults to 60 seconds. Action-Timeout and Trigger-Timeout are similar in that they can be used to increase/decrease the per-command timeout for each command in the given Action/Trigger block (i.e. Trigger-Commands and Action-Commands). All timeouts are specified in seconds.

 

A script watchdog process exists to prevent/catch run away scripts. The maximum amount of time the watchdog process will allow the script to run is determined by examining the script execution flow while taking the Script-Timeout, Action-Timeout and Trigger-Timeout into consideration. This timeout is restricted to the range of 5 to 120 minutes. Note, this algorithm may break down in situations where there is a lot of Trigger looping over Action-Commands/Trigger-Commands  output. This is because when examining the script execution flow, it is impossible to take into consideration the dynamic output/processing that the script will encounter at runtime. In such cases, the Script-Timeout, Action-Timeout and/or Trigger-Timeout can be increased to compensate for this.

 

6.8.7 Perl Script Timeouts

 

Perl scripts can contain 1 type of timeouts, Script-Timeout. Script-Timeout specifies the script watchdog process timeout (explained above). If not specified, Script-Timeout defaults to 5 minutes. This timeout is restricted to the range of 5 to 120 minutes. The per-command timeout for each command in the script is fixed at 60 seconds. All timeouts are specified in seconds.

 

Changes to CCS Script Timeouts in 6.9.2

 

The upper bound of the Script-Timeout range has been increased to 240 minutes.

 

Changes to Perl Script Timeouts in 6.9.2

 

The per-command timeout for each command in the script can now be overridden by using the NetMRI_Easy.pm send_async_command method. Additionally, the upper bound of the Script-Timeout range has been increased to 240 minutes.

 

Re: Fix Job Timeout

Authority
Posts: 20
8070     0

I am running 7.0.3 code.  Do you have any examples of how to set the timeouts for Script-Timeout, Action-Timeout and Trigger-Timeouts?

 

My job does run a lot of Trigger looping over Action-Commands/Trigger-Commands output, so it sounds like I need to increase the timeouts.  I am not clear how or where to set these timeout values?

 

Here is my script below:

 

================================================

 

Script-Filter:
$Vendor eq "Cisco" and $Type in ["Switch","Switch-Router"] and $sysDescr like /IOS/

################

Action:
Find Interfaces

Action-Commands:
SET: $UpdateMade = "no"
sho ip int brief

Output-Triggers:
Process Interfaces

################
Trigger:
Process Interfaces

Trigger-Description:
Find valid interfaces to check.

Trigger-Variables:
$IntName /(\w+\d+(\/\d{1,2}|\/\d{1,2}\/\d+|\/\d{1,2}\.\d+|\/\d{1,2}\:\d+)?|\w+-\w+\d{1,3})/

Trigger-Template:
[[$intName]]\s+unassigned

Trigger-Commands: {$UpdateMade eq "no"}
show run interface $intName
SET:$cmdsRemoved = "no"

Trigger-Commands: {$UpdateMade eq "yes"}
do show run interface $intName
SET:$cmdsRemoved = "no"

Output-Triggers:
ParseOutput
################
Trigger:
ParseOutput

Trigger-Variables:
$cmd /switchport\sport-security\smaximum|switchport\sport-security\sviolation\srestrict|switchport\sport-security\saging\stime|switchport\sport-security\saging\stype|switchport\sport-security/

Trigger-Template:
[[$cmd]]

Trigger-Filter:
$cmd like /port-security/

Trigger-Commands: {$UpdateMade eq "no"}
config t

# Only remove the commands 1 time, not for each match of "port-security"

Trigger-Commands: {$cmdsRemoved eq "no"}
int $intName
no switchport port-security
no switchport port-security maximum
no switchport port-security violation restrict
no switchport port-security aging time
no switchport port-security aging type inactivity
exit
SET:$UpdateMade = "yes"
SET:$cmdsRemoved = "yes"

 


########
Action:

End and Write Memory

Action-Description:
End and Write Memory only if we entered config mode.

Action-Commands: {$UpdateMade eq "yes"}
end
write mem
SET:$UpdateMade = "no"

 

 

Re: Fix Job Timeout

Adviser
Posts: 53
8070     0

Sure, simply add "Script-Timeout: 240" (or some other number of your choice) above or below the Script-Filter attribute. Just FYI - A script running against a single device for 13 minutes is a long time. Likely it is due to one of the Trigger-Varaible regular expressions. Complex regular expressions can cause the job engine to use lots of CPU and prolong the script execution time.

 

Thanks,

- Chris

 

 

Re: Fix Job Timeout

Expert
Posts: 263
8070     0

Of course, Chris is the expert on this, but I can't imagine how any amount of looping through interfaces in the CCS engine could hit the timeout.  In the session log, could you see how far it progressed just prior to the timeout?  The other thing that can cause a timeout is if a command that is sent to a device results in an error message such that the expected CLI prompt is not returned.  That can cause the CCS Expect parser to time out looking for one of the expected prompt types.

Re: Fix Job Timeout

[ Edited ]
Authority
Posts: 20
8070     0

I was about 67% through the 240 interfaces on the switch when it stopped.  There were not errors other than it just stopped.  I modifed the timeouts today and reran on 6 switches without issue.  Not sure how a timeout value of 240 seconds will help with a switch that was running for more than 13 minutes. 

 

I plan to run this switch on a Group of 150 switches, some will have 48 ports up to switches with 240 ports.  Each switch will take a different amount of time.  Watching the script run, you can see it takes about 5-6 seconds per port (session log real time output).  A switch with 240 ports should take approximately 24 minutes to finish.  

 

My question:  How would a Script-Timeout of 240 help this situation?

 

2nd Question:  How many switches does it run in parallel if I have a group of 150 switches?

Re: Fix Job Timeout

Adviser
Posts: 53
8070     0

The Script-Timeout specifies the per-command timeout for each command in the script (i.e. Trigger-Commands and Action-Commands). If not specified, Script-Timeout defaults to 60 seconds.

 

So, for each command, such as:

 

int $intName
no switchport port-security
no switchport port-security maximum

 

The job engine was allowing 60 seconds per command before (because no Script-Timeout was specified and the default is 60 seconds), whereas, with the addition of "Script-Timeout: 240" it is now allowing 240 seconds per command. The per command timeout and number of commands in the script is used to calculate the overall time which the script is allowed to run, so we have evectively increased this time by 4x.

 

Regarding the number of jobs that run in parallel, I don't have the specifics at the moment, but I can tell you that the number depends on the platform (e.g. a 1400 model may run 10 in parallel, whereas, a 4000 model may run 40 in parallel). Again, don't quote me on those numbers, as I don't have the specifics at the moment.

 

Thanks,

- Chris

 

 

Re: Fix Job Timeout

Authority
Posts: 20
8070     0

Thanks Chris.  So, it possibly appears that maybe my job quit prematurely because of excessive CPU from complex regex searches while other processes were running on the appliance.  We have a 2200 NetMRI appliance.  I was watching the job run from the realtime session log.  There was never a point when a specific command hung for more than 5-6 seconds.  I never approched a 60 sec timeout.  So it appears that timer increases will not help my situation.

 

I have job running tomorrow on 8-10 switches and then on 150 switches Monday.  I will report back the success/failure of these jobs.

 

Thanbk you for your help.

Re: Fix Job Timeout

Adviser
Posts: 53
8070     0

 

I think increasing the Script-Timeout will definitely help in this situation. What may be confusing here is that there are multiple timeouts at play. First, there is the per-command timeout (i.e. how long the job engine will wait for each individual command to complete). Second, there is a script watchdog timeout. The script watchdog is used to monitor how long the entire script can run. If the script watchdog timeout is exceeded, the script is killed. This feature is used to prevent "run away" scripts that could be caused by an infinite loop or inefficient regular expressions which take up excessive CPU and thus time. The script watchdog timeout is computed by summing all of the per-command timeouts in the script (it actually isn't this simple, but this is the basic concept). I believe it is the script watchdog timeout that is causing the job to be killed in this situation. Increasing the per-command timeout by 4x (by setting "Script-Timeout: 240") will effectively increase the script watchdog timeout by 4x, which should help.

 

Thanks,

- Chris

 

Re: Fix Job Timeout

Authority
Posts: 20
8070     0

Thanks Chris.  The script ran on 15 switches last night and completed without error.  Monday we will running this on a device group with approximately ~160 switches.  Fingers crossed....

Re: Fix Job Timeout

Authority
Posts: 20
8070     0

Chris,

 

The job ran on 169 switches this morning without error or any timeout issues.  I believe the original timeout may have been anomaly, possible with complex regex expression as you stated above.  Either way, I am happy with the results.  It saved us a ton of time being able to run a script to remove 5 commands off every port on 169 switches.

 

Thank you for your help.

Re: Fix Job Timeout

Adviser
Posts: 53
8070     0

You're welcome. I'm glad it worked. In your situation, it was the large number of interfaces that was causing the following to come into play (pasted from above). "Note, this algorithm may break down in situations where there is a lot of Trigger looping over Action-Commands/Trigger-Commands  output. This is because when examining the script execution flow, it is impossible to take into consideration the dynamic output/processing that the script will encounter at runtime. In such cases, the Script-Timeout, Action-Timeout and/or Trigger-Timeout can be increased to compensate for this."

 

Thanks,

- Chris

 

 

Showing results for 
Search instead for 
Do you mean 

Recommended for You