Data Center Issues

Report 20:32 GMT: We appear to be having a strange network issue in the Los Angeles Datacenter. We are working on it and will update as soon as we know something.

Report 22:42 GMT: Data center fully functional.

Report 23:13 GMT: At approximately 12:55PM PST (GMT -8:00) we noticed a routing abnormality. One of our Datacenter floors was fully operational while the other was partially inaccessible. The one that was having issues (11th floor space), many clients could connect just fine, while others, including myself could not. We immediately had people working on it to try to identify the issue. 15 Minutes later we posted on the forum as we saw this issue being a very widespread one since more and more tickets were coming in. We monitor all parts of our network both internally and externally, as well as our servers, so we know when things happen, etc..

We then began a two pronged approach to determine what the issue was. We were looking into network changes (IE config changes on the switches) as well as any possible hardware problems. At 1:30PM we determined this issue to be a hardware issue. We felt that a distribution switch (one that feeds the switches customers are connected to) was dying. Rich was there and I asked him to run a battery of tests. After he ran the tests which included consoling into the distribution switches, we determined that that that switch was operating correctly, and began checking for any code changes. At 2:15PM we grabbed a standby distribution switch (which we have for these cases) . We were then checking the code and routing tables of the distribution switches and the core network switches.

At 3:00PM, Ryan (our main network guru) logged into our core switches and determined that the hardware routing table was full, so it couldn't install the 11th floor routes into its memory, including the arp routes. He then filtered out all routes and 5 minutes later everything came back online. Once that was done, we waited 5 more minutes and then did a reboot of the core network switch and implemented a table limit of 239k route limit installed to prevent the same issue from ever happening again.

Note: The outage is not reflected in the Alertra report because not all access to the data center was cut off. Alertra's server was able to access our server during the episode which means that not all the web visitors were cut off.

captainccs

April 13, 2007