23-12-2013 12:34 PM - last edited on 06-01-2014 03:57 PM by marionh
Hi everyone,
Following the outage last week I wanted to give you an update on what happened that evening and how we are focussed on continually improving the 48 network experience.
As I mentioned previously we have an ongoing investment programme on our platform and are coming to the end of a significant phase of that programme. There are still some final elements of this programme to be completed in January and February next year and unfortunately these are the elements that caused the problem last week. They also added to the length of time the system took to recover full service.
In accordance with best practice we had implemented a change freeze on the network in December and were confident that the work we had completed over 2013 had us in a good place over the Christmas period and into January when we recommenced the upgrade programme. This turned out not to be the case & it was the specific nodes which were next on the list for an upgrade that exasperated the issues. It is particularly disappointing that you have experienced another outage like this given that we have in fact invested a lot of time and money into the network throughout 2013, but we are confident that we will see the benefit of these upgrades in 2014.
With regard to the event itself our systems first started reporting errors at 5:43pm on Wednesday 19th December. The errors indicated that there was a fault with the hardware that manages the throughput of traffic across our core platform. This immediately affected calls, texts and data. It also resulted in the failure of our website services so that you could not buy memberships or add-ons on our website either.
The issue was escalated immediately with our service providers. Due to the nature of the issue our engineers could not rectify it remotely and were sent to the data centre. Data centre engineers began investigating the issues before our own engineers arrived. Unfortunately when the faulty hardware was fixed at around 8pm the rest of the network took a lot longer to recover but we saw the website services return at that time. This was the key focus until we saw service resume at 1:25am that night. Our engineers worked hard throughout the evening to bring all the necessary systems back online as quick as possible. Unfortunately this process can take some time due to the complexities involved in this work. Our engineers then remained onsite for a rest of the night to ensure the systems were stable.
We have now put in place a higher level of monitoring on the hardware impacted by this issue. We have already made some configuration changes to the hardware that manages the throughput of traffic across the platform and have replaced the specific hardware that was showing the errors.
We are committed to providing a better service to our customers and are confident that the final changes we will be making early next year will allow us to provide a significantly improved and robust service for our customers.
The purpose of this level of detail is to explain what happened, not to excuse it. We do know how frustrating and annoying these outages are and we can’t apologise enough for it.
Regards,
Techie Kev
03-01-2014 12:13 AM
Hi
For what it's worth...
It's my belief, the only 48 Representative on thread is Techie kev...Other than that, just good old Community Forum helper's dotted here and there!
(That's the reason for the response with the mention of kudos)
🙂