「Y2K 2.0」?讓一堆網路服務抓狂的閏秒


增加一秒似乎沒什麼,但不少網路服務的底層平台(例如 Linux 作業系統與許多 Java 應用平台)卻出現無法處理這額外一秒的問題,導致類似 Y2K、百年蟲這樣的夢靨在 GMT 的半夜上演。而一些以 Network Time Protocol(NTP)進行網路對時的服務也不知該如何應付這多出來的一秒。

這類問題對於單機系統的影響比較小,頂多手動改一改時間,或是在下一次自動對時的時候進行自動調整,但是對於雲端系統來說,精確協調的時間成了不可或缺的要素,所以愈到不對盤的時候,一些莫名其妙的問題也跟著浮現,所以有網路媒體稱之為 Y2K 2.0。

目前網路上已經有一些業者傳出不少災情,不過也有業者事先防範(例如 Google 改寫 NTP server 以每次更新時增減數微秒(ms)的方式,讓系統漸漸適應「時差」 ),得以安然度過。

After it went into effect tonight, half the internet — including Reddit, FourSquare, Yelp, LinkedIn, Gawker StumbleUpon, and more — came crashing down. The outages were mostly (thankfully) brief. Here's how it happened.

Reddit, Mozilla, and possibly many other web outfits experienced brief technical problems on Saturday evening, when software unpinning their online operations choked on the “leap second” that was added to the world’s atomic clocks.

Very large-scale distributed systems, like ours, demand that time be well-synchronized and expect that time always moves forwards. Computers traditionally accommodate leap seconds by setting their clock backwards by one second at the very end of the day. But this “repeated” second can be a problem. For example, what happens to write operations that happen during that second? Does email that comes in during that second get stored correctly? What about all the unforeseen problems that may come up with the massive number of systems and servers that we run? Our systems are engineered for data integrity, and some will refuse to work if their time is sufficiently “wrong.” We saw some of our clustered systems stop accepting work on a small scale during the leap second in 2005, and while it didn’t affect the site or any of our data, we wanted to fix such issues once and for all.

....The solution we came up with came to be known as the “leap smear.” We modified our internal NTP servers to gradually add a couple of milliseconds to every update, varying over a time window before the moment when the leap second actually happens.

