*edit* I want to point out that this test was done on a single
database server which used MySQL partitioning. This is a
demonstration of how Shard-Query can improve performance in
non-sharded databases too.*edit*.
Over the weekend I spent a lot of time improving my new
Shard-Query tool (code.google.com/p/shard-query) and the
improvements can equate to big performance gains on partitioned
data sets versus executing the query directly on MySQL.
I'll explain this graph below, but lower is better (response
time) and Shard-Query is the red line.
MySQL understands that queries which access data in only certain
partitions don't have to read the rest of the table. This
partition elimination works well, but MySQL left a big
optimization out of partitioning: …
I created a new tool this week:
http://code.google.com/p/shard-query
As the name Shard-Query suggests, the goal of the tool is to run
a query over multiple shards, and to return the combined results
together as a unified query. It uses Gearman to ask each server
for a set of rows and then runs the query over the combined set.
This isn't a new idea, however, Shard-Query is different than
other Gearman examples I've seen, because it supports
aggregation.
It does this by doing some basic query rewriting based on the
input query.
Take this query for example:
select c2, sum(s0.c1), max(c1) from t1 as s0 join t1 using (c1,c2) where c2 = 98818 group by c2;
The tool will split this up into two queries.
This first query will be sent to each shard. Notice that …
Back in November 2009 I was working on a project to port Scribd.com code base to Rails 2.2 and noticed that some old plugins we were using in 2.1 were abandoned by their authors. Some of them were just removed from the code base, but one needed a replacement – that was an old plugin called acts_as_readonlyable that helped us to distribute our queries among a cluster of MySQL slaves. There were some alternatives but we didn’t like them for one or another reasons so we’ve decided to go with creating our own ActiveRecord plugin, that would help us scale our databases out. That’s the story behind the first release of DbCharmer.
Today, six months after the first release of …
[Read more]
The history tells that a single RDBMS node cannot handle tons of
traffics on web system which come from all over the world, no
matter how the database is tuned. MySQL has implemented a
master/slave style replication built-in for long time, and it has
enabled web applications to handle traffics using a scale-out
strategy. Having many slaves has been suitable for web sites
where most of traffics are reads. Thus, MySQL's master/slave
replication has been used on many web sites, and is being used
still.
However, when a site grow large, amount of traffic may exceed the
replication's capacity. In such a case, people may use memcached.
It's an in-memory, very fast and well-known KVS, key value store,
and its read throughput is far better than MySQL. It's been used
as a cache for web applications to store 'hot' data with MySQL as
a back-end storage, as it can reduce read requests to MySQL
dramatically.
While 1:N replication …
This Thursday (November 26th, 14:00 UTC), Giuseppe
Maxia will present the Spider Storage Engine. This session was
originally scheduled for October 15th but had to be postponed for
technical reasons.
Here's from the abstract: Everybody needs sharding. Which is not easy to maintain. Being tied to the application layer, sharding is hard to export and to interact with. The Spider storage engine, a plugin for MySQL 5.1 and later, solves the problem in a transparent way. It is an extension of partitioning. Using this engine, the user can deal transparently with multiple backends in the server layer. This means that the data is accessible from any application without code changes. This lecture …
[Read more]
This Thursday (November 26th, 14:00 UTC), Giuseppe
Maxia will present the Spider Storage Engine. This session was
originally scheduled for October 15th but had to be postponed for
technical reasons.
Here's from the abstract: Everybody needs sharding. Which is not easy to maintain. Being tied to the application layer, sharding is hard to export and to interact with. The Spider storage engine, a plugin for MySQL 5.1 and later, solves the problem in a transparent way. It is an extension of partitioning. Using this engine, the user can deal transparently with multiple backends in the server layer. This means that the data is accessible from any application without code changes. This lecture …
[Read more]
This Thursday (November 26th, 14:00 UTC), Giuseppe
Maxia will present the Spider Storage Engine. This session was
originally scheduled for October 15th but had to be postponed for
technical reasons.
Here's from the abstract: Everybody needs sharding. Which is not easy to maintain. Being tied to the application layer, sharding is hard to export and to interact with. The Spider storage engine, a plugin for MySQL 5.1 and later, solves the problem in a transparent way. It is an extension of partitioning. Using this engine, the user can deal transparently with multiple backends in the server layer. This means that the data is accessible from any application without code changes. This …
[Read more]I wrote a post a while back that said why you don't want to shard. In that post that I tried to explain that hardware advances such as 128G of RAM being so cheap is changing the point at which you need to shard, and that the (often omitted) operational issues created by sharding can be painful.
What I didn't mention was that if you've established that you will need to eventually shard, is it better to just get it out of the way early? My answer is almost always no. That is to say I disagree with a statement I've been hearing recently; "shard early, shard often". Here's why:
- There's an order of magnitude better performance that can be gained by focusing on query/index/schema optimization. The gains from sharding are usually much lower.
- If you shard …
|
The Spider storage engine should be already known to the community. Its version 2.5 has recently been released, with new features, the most important of which is that you can execute remote SQL statements in the backend servers. The method is quite simple. Together with Spider, you also get an UDF that executes SQL code in a remote server. You send a query with parameters saying how to connect to the server, and check the result (1 for success, 0 for failure). If the SQL involves a SELECT, the result can be sent to a temporary table. Simple and effective. |
In addition to the Spider engine, Kentoku SHIBA has also created
the …
Mike Hogan, CEO of ScaleDB spoke at the Boston MySQL User Group in September 2009:
ScaleDB is a storage engine for MySQL that delivers shared-disk clustering. It has been described as the Oracle RAC of MySQL. Using ScaleDB, you can scale your cluster by simply adding nodes, without partitioning your data. Each node has full read/write capability, eliminating the need for slaves, while delivering cluster-level load balancing. ScaleDB is looking for additional beta testers, there is a sign up at http://www.scaledb.com.
Slides are online (and downloadable) at http://www.slideshare.net/Sheeri/scale-db-preso-for-boston-my-sql-meetup-92009
Watch the video online at …
[Read more]