Tumblr has been down for more than 12 hours due to an issue with
their database cluster. Here is the comment I left on GigaOm.com
This is the freshest lesson for entrepreneurs and startups:
- Learn to value your data
- Implement a high availability plan
- Plan a disaster recovery strategy
“Tumblr likely has the resources to recover…”
I really hope that holds out true but remember, data is the only
irreplaceable asset of an organization. Once it’s gone, it’s
gone.
When I was handling the disaster at Fotolog (massive database
corruption when our SAN crashed), I couldn’t find any company or
consulting firm ready to handle the situation and help with data
recovery. It was a miracle that I came across the concept of DUDE
(Data …
“Funny how Amazon doesn't use S3 to store any assets for
amazon.com”tweet by @gruber
Amazon's S3 suffered a major outage today knocking many
websites offline. S3 outage started at approximately 12:00 PM EST
and the last time I checked at 11:11PM EST, Smugmug, a popular
photo hosting site that extensively uses S3, was still
down.
- S3 down for more than 7 hours
- S3 outage, 7 hours and counting
- S3 down again
- Amazon failure downs …
Disaster is really inevitable. Even with all the redundant power
investments, ThePlanet (formerly EV1 and RackShack), had to
shut down their backup generators at their H1 data center on the
instructions of the fire crew. This happened after a wire-short
in fault transformer led to an explosion that knocked off one of
their walls, ultimately bringing 9,000 servers down. Luckily no
one was injured.
This just goes on to show that just because a data center has
redundant power and backup generators, it does not mean that a
disaster cannot happen. IIRC, ThePlanet's last disaster was
blamed on backup generators not kicking off properly.
While there was no damage to servers, I wonder how many MyISAM
repairs need to be triggered once the servers do come back
online?
- …