We're back online and everything should be working as normal.
Feb 29, 20:02-20:15 UTC
Error pages across platform
We fixed the issues causing sporadic error pages throughout the platform. We'll continue to monitor performance, but the site should be fully operational again.
Feb 19, 18:16-19:21 UTC
[Scheduled] Database Upgrade
We've completed the database update with less than one minute of customer facing disruption and the platform is now operating normally.
Feb 9, 13:00-13:26 UTC
This issue has been resolved. The intermittent outages were caused by malicious users of a customer site attempting to use the platform to post commercial content. We're continuing to monitor the situation, but the platform should be fully operational again.
Jan 11, 17:48-18:15 UTC
DNS propagation should now be complete and the platform should be operating normally.
This outage was caused by a routine maintenance operation that had cascading consequences that we did not anticipate while applying the change. In response we've updated our procedures to require explicit sign off on production changes before they are applied in addition to tests in our staging environment.
We've also discovered that a class of maintenance operations which we had previously believed to be low risk can actually cause persistent outages through side effects that we had not fully anticipated. We're updating our procedures to review these operations much more carefully and develop strategies for making such changes without downtime.
Jan 5, 16:09-16:43 UTC
We experienced a brief outage caused by routine maintenance this morning.
We changed the configuration of our load balancers to add a new larger block of IP addresses to support new customers and other growth of the platform. This change propagated through the infrastructure in a way that we had not anticipated and caused some old ip address ranges to be blocked before the new ones were completely in service.
To prevent this from recurring we've updated our maintenance procedures to perform this operation in a multi-step fashion that will allow for zero downtime.
Jan 5, 15:45 UTC
CVE-2015-0235 GHOST: glibc gethostbyname buffer overflow
All systems have now been patched to resolve the GHOST glibc vulnerability. This resulted in less than 1 minute of downtime as the last of our machines were rebooted.
Technical information here: http://www.securityfocus.com/archive/1/534555
Jan 29, 14:11 UTC
503 Errors on Customer Sites
While performing routine maintenance our infrastructure provider encountered a problem that caused the platform to become unresponsive. This issue has been corrected, though we are scheduling a maintenance window for tomorrow Tuesday, January 27th at 4 pm EST to perform further improvements on the underlying infrastructure to prevent this issue from happening again.
The underlying cause was related to our HA Proxy instances experiencing a "split brain" where both machines thought they were the primary machine that should be serving requests.
This is the same issue that caused an outage on December 15th, and we expect once the infrastructure changes are done tomorrow it will be definitely resolved.
Jan 26, 18:37-21:45 UTC
Unplanned Connectivity Outage
Our infrastructure provider has notified us that they experienced network instability after the planned network maintenance window this morning. This caused a second outage after the planned maintenance window:
Jan 11, 19:16 UTC