At my job, we run many MS SQL jobs for crunching data. And on rare occasionally, we'll see jobs that would get stuck in its running state for unknown reasons. Since we do not have external softwares to monitor the SQL jobs and I do not wish to check the job results daily, I created a solution that would allow us to at least know whether a job has exceeded its expected run time. That way, I can just go and check that specific job and if I feel that it's indeed stuck, I can stop and restart it.
My solution was another SQL job that would monitor the specified job. :) I know what you are thinking, what if the job that monitors the other job hangs? Well, I do worry about that as well but I'm hoping these monitor jobs are simple enough that they will not hang (the jobs that got stuck have been complex and usually long running.)
How does my solution work? I created a stored procedure that takes in name of the job to monitor and the expected duration as parameters. As a first step of the job that I want to monitor, I would call this stored procedure and a new job would get created. The newly created job's only task is to check whether the job specified in the parameter is still running after the specified duration starting from the time that it was created. And if the specified job exceeds the specified duration, I send an email out to notify the developers in our team.
When I was searching for a solution on the internet to our job hanging problem, I didn't find too many things. That's why I thought it might be helpful if I post my source code out there for other people to use.
The source code is on github.com. A link is also provided on the clearcubic.com projects page.
No comments:
Post a Comment