The root cause of this issue was slow response times from a third-party application we integrate with. These slow response times caused some requests to back up and disrupted service for other endpoints as our infrastructure consumed resources that should have been processing web requests. Instead of promptly processing requests, threads were instead waiting on third-party APIs to return results causing requests to queue.
While we did have caching on these third-party API requests, and reasonable timeouts set, those timeouts were not aggressive enough to keep the application working when we were also experiencing moderate traffic.
We’ve now updated the platform to more quickly give up using aggressive timeout settings if certain third-party CRMs are excessively slow. This should make the application more resilient to this type of slow down / request queueing behavior.