We are using Eggplant Manager to run sets of tests. one of our tests, after adding a very long running script, started hitting some wall, and the Manager reports status "Timed out"
HOWEVER! The script is not actually timing out (in the sense that that it is running longer than the allotted time). Here's how I know:
-The test has a timeout value of about 12 hours. The test hits "Processing" stage after about 8, and stays there for 3-4 hours until timing out.
-There are no results in EM, and the results folder is completely empty. In a legitimate timeout, there are results up to the point of timeout.
-At the end of each script, we write some stats data to CSV for later parsing. This final step completes successfully in every script in the test, several hours before timeout.
Now, here's the tricky part. In troubleshooting, we thought maybe there was a limit of test cases per test. We counted 260ish test cases in this test (across several scripts), and speculated that there might be a cap at 255. We commented out some of the test cases and the test passed! However, this theory was quickly proven wrong when we realized we had another test that had over 500 test cases.
Now, you're probably guessing that one of the test cases I commented out was hitting an infinite loop or something to cause timeout. This is not the case! This is not a "normal" timeout for reasons listed above, and furthermore, the script finishes in the expected time when run by itself. As a final check, I added another script to the test (which was working with some cases commented out) and we saw the exact same behavior: EM reports timeout, but there are no results for tests before timeout, test sticks in "Processing" for several hours, results folder is completely empty.
So I checked EM Log and found it mostly unhelpful. There is are a few lines that might hint to whats going on:
ERROR !!!!! failed to retrieve file list from http://127.0.0.1:3100/list_files_for_path
ERROR: Net::ReadTimeout - Net::ReadTimeout
[several lines of jruby stuff. looks vaguely like a stack trace, but knowing nothing about jruby, I'm not sure. Some references to "block in init_worker" and "block in request". Nothing jumps out as an explanation to the problem]
DEBUG: --- copy files from execution (class type): Array
DEBUG: **** Download complete!
ERROR: Test run server (Local Host) had an error retrieving test results from agent.
DEBUG: ---agent execution complete execution server completed...
These lines happen about 2.5 minutes after the test completes, where it then sits in "Processing" state for several hours before timing out. Logs are on a off-network air-gapped machine so I can't copy/paste more, but if it would be helpful, I can make a way to get more of the log.
Symptoms really look like there is a limit on test cases, but we have a test with hundreds more that has never given us a problem. If anyone can offer ANY hints on where I can start looking to fix this would be HUGELY appreciated.