Semi-sync Replication is a plugin
available for mysql which allows you to create more durable
replication topologies. For instance you can ensure that in
the event of a master crash that at least one of your replicas
has all transaction currently written to the master so that when
you promote, you know you're not missing any data.
That's a huge simplification.
What's the downside? Write speed. If a transaction on
your master have to wait until a replica acknowledges it has that
transaction, then there is going to be some delay. Not only
that, but your network latency between the two points matters a
lot. If you want greater durability, the cost is
performance.
It's important to note that the master doesn't wait until the
replica actually runs the transaction on the …
When you're testing out a new version of MySQL in a
non-production environment there is a temptation to go wild and
turn on all kinds of new features. Especially if you're
reading the changelogs or the manual and scanning through
options. You want to start with the most reasonable set of
defaults, right? Maybe you're even doing benchmarks to
optimize performance using all the new bells and whistles.
Resist the temptation! If your goal is to upgrade your
production environment then what you really want is to isolate
changes. You want to preform the upgrade with as little to
no impact as possible. Then you can start turning on
features or making changes one-by-one.
Why? Anytime you're doing a major upgrade to something as
fundamental as your core RDBMS, there are many ways things can go
wrong. Performance regressions & incompatible changes,
client/server incompatibilities …
I believe in automation as much as possible, and I'm always
working to make the day to day tasks of operations as smooth as
possible. Also I try not to be afraid to take good tools
and make them better.
Here in Database Ops at Box, we use pt-kill running as a service
to constantly monitor our servers and help protect against long
running queries. But our thresholds are pretty generous,
and in some cases it's possible for unforeseen circumstances to
cause enough queries to storm the database such that we can have
problems before any of them hit the threshold for "busy time."
Ditto for idle connections.
The response is that someone has to be available to manually run
another copy of pt-kill with much lower thresholds to clear out
these thundering herds. But what if we could let pt-kill
handle both the "normal" mode and still protect us from
herds?
That's what we've done by adding a …