The IDA – Infocomm Development Authority of Singapore has finally recently concluded its investigations on the SingTel’s Internet Disruption incident. It has been already more than 6 months since on Wednesday 9th of October 2013, a fire broke out at SingTel’s Bukit Panjang Exchange facility damaging a number of internet cables and disrupting internet services across the island for more than a 1 week. At the time of this accident, the disruption created a massive shockwave of discontent among affected Internet users and even in part of the general public who then went on to vent their anger on social medias in a frenzy that lasted for weeks.
Even the minister for Communication & Information, Dr Yaacob, had to jump into the fray in the thick of the crisis calling the outage a “major incident” and saying he shared the frustrations of those who were affected by it. He also then aimed to reassure the public declaring that the regulatory Authority in charge i.e. the IDA – Infocomm Development Authority of Singapore – was “on the top of the Situation” and was focusing on ensuring the resilience and reliability of the Internet system in Singapore. As reported in the news then, IDA had immediately launched a “full investigation” into the incident to find out what had happened declaring that it would “take appropriate actions to prevent such incident from happening again”.
This fire incident and its impact for Internet users had quite naturally raised some questions about the level of preparedness and actual capabilities of Telcos such as SingTel to manage critical disruption threats effectively. The concern was magnified by the fact that the same fire also affected other Telcos’ operations i.e. M1 and Starhub. Hence raising additional concerns about the existence of potential vulnerabilities in the Internet Network Infrastructure in Singapore.
Aiming apparently to quickly alleviate these concerns, the IDA also issued a statement reported in the Straits Times online on the 24rd October declaring in substance that there were no needs to be worried as “Singapore has resilient telecommunication networks”. The Authority further assured the public that “enough back-up is in place to cope with events such as the recent October 9 fire”.
However considering the scale of the disruption that had just affected the Internet in Singapore and without proper explanations provided at that time on what, why & how it had happened, this kind of statements sounded more like an attempt to PR spin out of a difficult situation than the result of a careful analysis of the reality of it. But nevertheless it contributed in appeasing the situation as furthermore the Internet access had already been fully restored to all users, the online chatters started to subside and after a few more weeks, the matter dropped completely from the news headlines much to the relief of all the parties under scrutiny. There goes the fast-paced news cycle in modern societies always obsessed with the latest thing where newer news are replacing older ones even when the problems highlighted have not necessarily been resolved. There is indeed a natural tendency in human being to move on after an accident appears to be over. And in this case as the Internet access had been restored to everyone and a credible and respected regulatory authority, i.e. the IDA, had told the public that they were taking charge and investigating the incident, what was there to worry about?
However from a Risk and Business Continuity Management’s perspective, we should not sleep easy thinking that it is somebody else’s responsibility to manage this kind of situation as this incident left us with too many unanswered questions and a heightened sense of vulnerability. We must attempt to answers those questions if we want to be able to reduce the probability of the same or even worse accidents to happen again in the future.
Indeed, asking questions is the beginning of wisdom as it ensures adequate scrutiny and opens the door for learning and improvements. Hence to ensure a proper understanding of this incident, it is necessary to be able to ask some very simple and reasonable questions that I would list down as follow:
1 – QUESTIONS about the CAUSES OF THE INCIDENT: How did the fire start in the cable room despite all the preventive controls we should assume were in place? And what were actually those controls? Why they apparently did not work?
2 – QUESTIONS about the LEVEL OF PREPAREDNESS & MANAGEMENT OF THE INCIDENT: How did Singtel prepare for and manage the actual fire incident and the resulting Internet disruptions? What were the controls in place to minimize the impact of the disruptions to the users? Did they work according to the plan that we assume was developed to ensure effective crisis and business continuity management in Singtel’s operations?
3 – QUESTIONS about the LONG-TERM CONSEQUENCES OF THE INCIDENT: What are then the long-term negative consequences of the ‘October 9 fire’ level of disruptions for Singtel, the other Telcos, the IDA and the general public? What does it indicates about the vulnerability of the Internet infrastructure in Singapore?
4 – QUESTIONS about the TOLERABLE LEVEL OF LOSS EXPOSURE: What is the level of disruptions that can be tolerated within the Singapore context i.e. In terms of number of users affected and length of the disruption before the economic and political costs are considered to high?
5 – QUESTIONS about the AREAS IN NEEDS OF IMPROVEMENTS: Are the negative consequences identified in question 4 acceptable to the users (general public and corporations) or do SingTel other Telcos and the IDA need to do more to reduce them in the future?
As already mentioned, the IDA released its final investigation report on the SingTel Fire Incident earlier this month on the 7th May 2014. You can consult a summary of the report on IDA website at the following link: SingTel Fire – IDA Report.
This report provides some answers to a number of the questions listed above and by analyzing this case I will now attempt to fill the gaps:
1 – CAUSES OF THE INCIDENT – ANALYSIS:
The fire started at 2.16pm in a cable room in SingTel’s Bukit Panjang Exchange facility. According to the IDA investigation report, it was due to the combination of an “outdated” cable maintenance system, safety lapses and human errors. The direct trigger of the fire was the careless use of an unauthorized blowtorch by a Singtel employee doing some maintenance work. As reported in the Strait times, “It sparked a slow burning fire that went undetected as a result of further safety lapses and human errors”.
W.H. Heinrich’s Theory of Accident Causation can help us better understand what are the factors potentially leading to accidents. In short, Heinrich explained that accidents are usually the result of a chain of events that includes unsafe act(s) and unsafe condition(s). Safety efforts should focus first on identifying these factors and then second on trying to eliminate/control specific, unsafe acts in process flows and/or mitigating unsafe conditions by changing the process or even the overall system. Unsafe acts leading to accidents are often the result of system design flaws, ineffective training and/or complacency.
Complacency sets in when due to lax supervisions and conflicting priorities, SOPs and safety practices are not properly enforced allowing staffs to start cutting corners when getting the job done such as in the SingTel fire case, by using an unauthorized blowtorch.
The report IDA also help us to pinpoint the “outdated” cable maintenance system as an unsafe condition/constraint leading to unsafe acts i.e. the use of a blowtorch. The report explains that the need for cable maintenance requires regularly removing the sealant protecting the exit points for cables in order to carry out the maintenance work. Heat is necessary to shrink the sealant hence the use of a blowtorch to do the work. The use of flames – burning- is by definition an unsafe act that requires precautionary measures both preventive and corrective such has proper equipment to be used, training and supervising of workers, fire suppressant systems, and so on.
The IDA report could not be clearer when it said that SingTel failed in that area by not enforcing its standard operating procedures and work safety practices that would have prevented the fire.
2 – PREPAREDNESS & MANAGEMENT OF THE INCIDENT – ANALYSIS:
The only TRUE test of the effectiveness of a Crisis and Business Continuity Management plan is a CRISIS or rather the capability to navigate smoothly though a real incident or crisis by mitigating its potentially negative consequences.
The Question is.. Did SingTel pass successfully the Real Crisis Test when we consider how the Telco prepared for and managed this relatively minor incident?
The IDA report has finally answered clearly that question..
“IDA has found that SingTel (as well as CityNet and OpenNet)“had NOT fulfilled their respective obligations, to provide sufficiently resilient telecommunication systems and services, and to restore services to affected end users as quickly as possible when the service disruptions occurred.”
Let’s look at each points one by one:
- ) Failure to provide sufficiently resilient telecommunication systems and services. The IDA report seem to imply that the way the telecommunication system and infrastructure has been built makes it vulnerable to serious outage threats. What are the issues is not clear without information available. Is it too many bottleneck and single points of failure, not enough backups and/or redundancy? to many interdependencies between supposedly independent telco operators? etc.
- ) Failure to restore services to affected end users as quickly as possible when the service disruptions occurred. While IDA recognized that SingTel had a business continuity management plan in place, it proved to be not sufficient to address service outages such as the one caused by this fire effectively enough. As a consequence of its failures, SingTel has been fined S$ 6 million. While this amount is almost negligible in comparison to Singtel’s Earnings, this is the largest penalty imposed on a telco operator so far and hence in sone ways, highlights the seriousness of the offense.
- ) I would also add a third point to the list of SingTel’s failures in this incident. It is important to note that SingTel managed quite badly the crisis communication part of the crisis as well. The sequence of the incident makes it painfully obvious as while people across Singapore were wondering why they could not access Internet and where desperately looking for information, it took SingTel too many hours after the fire to acknowledge and inform its customers that they were responsible for the disruption. And once they started communicating, SingTel made all the basic communication mistakes such as making time commitment on services restoration that they could not fulfill and thereafter declaring victory too soon, claiming that all services were restored when many homeowners had still problems to connect and ended up even more infuriated by SingTel “mission accomplished” communication strategy..
To sum it up, having a formal crisis and business continuity management plan is not enough. It is important for organizations to truly test repeatedly their plans using various scenarios to train staff, identify potential weaknesses and then improve. You want to have reasonable assurance that your plans will work for real on the D-day.
3 – CONSEQUENCES OF THE INCIDENT – ANALYSIS:
While the fire was put out quite quickly within 20 minutes, it still damaged a large number of Internet cables immediately disrupting Internet services across the island affecting a significant number of households and business customers including some important banking services Other Internet service providers (ISPs) M1 and starhub were also affected. In total about 270,000 telecom and broadcast subscribers were affected.
The economic cost to the users for not being able to use Internet services has apparently not been fully evaluated. But as SingTel prioritized the more sensitive business customers for faster recovery and as most Internet services were restored within a few days, and, we can speculate that the overall economic cost for the subscribers was relatively low.
As for the impact on the telco operators, i.e. I will focus on SingTel here, while we do not have all the information, we can expect that SingTel will have to suffer some direct and indirect, short–term and possibly longer-term NEGATIVE Consequences in the following areas:
- Property Damage Cost: The direct cost associated with the damaged facility and cables including the cost repairing the cables probably shared with Citynet and Opennet. We also would expect it to be at least partially covered by insurance
- Crisis Management Cost: All the cost associated with putting in place the crisis and business continuity management plans including all the resources and management attention necessary to deal with the actual incident.
- Liability Costs: while short in duration, the outage affected a significant number of businesses including banking exposing SingTel for potential claims for compensation from these customers. The amount of liability costs will depend of the SLA (service level agreement) terms and conditions used in those contracts to address such outages.
- Fines for breach of SRC: Under the Telecommunication Service Resiliency Code, operators are required to ensure the design of their networks and services are resilient to service outages, and when outages do occur, to ensure they restore the services expeditiously. M1 was fined S$1.5 million by IDA for a three-day outage affecting 250,000 of its customers. With regards the SingTel Fire, IDA declared that SingTel, CityNet and OpenNet were found in breach of SRC. As a consequence, IDA has imposed a financial penalty of S$6 million on SingTel.
- Reputational Damages: As already mentioned the disruption of Internet access resulted in a massive wave of discontent among affected Internet users and even part of the general public who then went on to vent their anger on social medias in a frenzy that lasted for weeks. The perceived mismanagement of the crisis communication by SingTel during the incident magnified the negative sentiment. Angry customers went on to dig out unrelated past stories and problems about SingTel services. SingTel Facebook page was assaulted with many negative comments from frustrated users affected by the outage. We are months later and the fever is now gone, nevertheless the complacency, negligence and mismanagement exposed during the crisis and confirmed in the IDA investigation report has certainly left a severe dent on SingTel Reputation.
- Strategic & Business Costs: The reputational damage has certainly undermined the credibility of SingTel as a responsible and well-managed telco operator. As the incident has in fact exposed a combination of negligence, complacency and mismanagement on the part of SingTel, it could have – and certainly already had – negative ripple effects undermining SingTel’s position in other areas, such as for the example the sale of OpenNet consortium to a business trust (Net Link) owned by SingTel. The other telcos had opposed the sale and the incident gave them additional ammunition to build the case that to ensure effective competition and proper risk management, a dominant player such as SingTel should not be given control over the Internet infrastructure in Singapore. OpenNet sale to Net Link was finally completed last November but as part of the sale agreement, SingTel is required to sell down its stake in the trust by 2018. On the business side, the impact should be limited this time especially as the fire highlighted the interdependancies between the telcos operators that are actually sharing the same cables and are at least partially in the same “boat” when it comes to vulnerability to disruptions.
4 – TOLERABLE LEVEL OF LOSS EXPOSURE:
The general public and business customers certainly understands that it is not realistically possible nor economical to prevent all threats from materializing and causing disruptions, hence there is a certain level of probability of disruptions that can be and must be tolerated. However when as reported in the Straits Times online on the 24rd October the IDA initially stated that, “enough back-up is in place to cope with events such as the recent October 9 fire”, it may have given the impression that the level of disruptions caused by this fire i.e. 270,000 households & Businesses for 2 to 7 days, constituted an acceptable level of disruptions scenario. The final IDA investigation report corrects that impression as it clearly stated that:
1) The telecommunication systems and services were not resilient enough and
2) the restoration of services to affected end users was too slow
With increasing dependency of businesses and consumers that rely on 24 hours / 7 days on Internet communications to carry out their business, SingTel and other telco operators are expected to do better, much better otherwise the costs of future accidents will be much more problematic.
5 – AREAS IN NEEDS OF IMPROVEMENTS:
I often warn our clients that no matter how “strong” a risk, crisis and business continuity management system looks like “on paper”, and even when it has been validated following proper reference standards (such as ISO 31,000:2009 and ISO 22301:2012). Risk Management is much more than just designing a formal control system and “ticking the boxes” on various compliance requirements.
The Singtel incident reminds us that finding out about the shortcomings of your crisis and business continuity plan only at the time of the real event is not a good thing. It will be at the very least highly embarrassing for the organization affected like in this case but we should not forget that it could also be potentially fatal for any organization when more serious threats or scenarios materialize.
As reported in The Straits Times on the 7th May 2014. Mr Leong Keng Thai, IDA deputy chief executive left no doubts when he declared that the “blaze could have been avoided” with proper preventive measures. And the IDA report went on further to state that SingTel’s management of the crisis was insufficient. So there are definitely rooms for improvements..
The IDA has already instructed SingTel to take immediate corrective actions on some specific points and as a result, SingTel has for example promised to switch to alternative sealing methods that do not require heat and to ensure proper training and effective supervision of workers carrying out cables maintenance work. But there other issues related to the resilience of Singapore Telecommunication system and Infrastructure that should be put under the microscope.
We should not take this lightly as the Singtel Fire could be the symptom of potentially more deeply rooted vulnerabilities in Singapore Telecommunication system and Infrastructure. Indeed if a relatively minor incident i.e. small fire stopped quite rapidly can affect about 270,000 households, businesses including the banking system and even other Telcos, we may wonder what would happen if more serious threats would materialise such as for example a natural perils event, an major accident resulting from a human mistake or a terrorist attack at a cable room facility? In short, strong operational systems, infrastructure and effective crisis preparedness is not a luxury, it is essential for organizational sustainability.
To conclude, in its statement, the IDA finally said that it has already embarked on a review of the resilience of all parts of Singapore’s infocomm infrastructure, and the review is expected to be completed in the second half of this year. We should be on the lookout for that report.