In this part we are going to analyze a bit more complex query
than before.
This query is a 6-way join.
The query is:
SELECT
supp_nation,
cust_nation,
l_year,
SUM(volume) AS revenue
FROM
(
SELECT
n1.n_name AS supp_nation,
n2.n_name AS cust_nation,
extract(year FROM l_shipdate) as
l_year,
l_extendedprice * (1 - l_discount) AS …
In this part we will discuss how NDB batch handling works. Query
execution of
complex SQL queries means that more rows can be delivered than
the receiver is
capable of receiving. This means that we need to create a data
flow from the
producer where the data resides and the query executor in the
MySQL Server.
The MySQL Server uses a record where the storage engine have to
copy the result
row into the record. This means that the storage of batches of
rows is taken
care of by the storage engine.
When NDB performs a range scan it will decide on the possible
parallelism before
the scan is started. The NDB API have to allocate enough memory
to ensure that
we have memory prepared to receive the rows as they arrive in a
flow of result
rows from the data nodes. It is possible to set batch size of
hundreds and even
thousands of rows for a query.
The …
In the previous part we showed how NDB will parallelise a
simple
2-way join query from TPC-H. In this part we will describe
how
the pushdown of joins to a storage engine works in the MySQL
Server.
First a quick introduction to how a SQL engine handles a
query.
The query normally goes through 5 different phases:
1) Receive query on the client connection
2) Query parsing
3) Query optimisation
4) Query execution
5) Send result of query on client connection
The result of 1) is a text string that contains the SQL query
to
execute. In this simplistic view of the SQL engine we will
ignore
any such things as prepared statements and other things making
the
model more complex.
The text string is parsed by 2) into a data structure that
represents
the query in objects that match concepts in the SQL engine.
Query …
For a very long time – since when multiple CPU cores were commonly available – I dreamed about MySQL having the ability to execute queries in parallel. This feature was lacking from MySQL, and I wrote a lots of posts on how to emulate parallel queries in MySQL using different methods: from simple parallel bash script to using Apache Spark to using ClickHouse together with MySQL. I have watched parallelism coming to PostgreSQL, to new databases like TiDB, to …
[Read more]Parallel query execution is my favorite, non-existent, feature in MySQL. In all versions of MySQL – at least at the time of writing – when you run a single query it will run in one thread, effectively utilizing one CPU core only. Multiple queries run at the same time will be using different threads and will utilize more than one CPU core.
On multi-core machines – which is the majority of the hardware nowadays – and in the cloud, we have multiple cores available for use. With faster disks (i.e. SSD) we can’t utilize the full potential of IOPS with just one thread.
AWS Aurora (based on MySQL 5.6) now has a version which will support parallelism for SELECT queries (utilizing the read capacity of storage nodes underneath the Aurora cluster). In this article, we will look at how this can improve the reporting/analytical query performance in MySQL. I will compare AWS Aurora with MySQL …
[Read more]
There is often a need to run queries in the background in MySQL.
This generally is accomplished using a message queue (like
gearman), or by using extensions to a client (PHP has such
extensions) or by using MariaDB's native async query interface (C
client only I think).
While such solutions work well, they don't work for say a GO
client, or for the mysql command line client itself.
I present "async"; part of the Swanhart Toolkit (http://github.com/greenlion/swanhart-tools). Async
is a stored procedure and event based solution for asynchronous
queries.
It consists of:
-
- A queue table to hold the SQL to run, the state of execution,
error messages, etc
- A settings table that controls the number of parallel threads
to use for executing queries
- A stored routine …
Many of my recent engagements have been all around strategy to
implement Real Time Big Data Analytics: Computing hardware
cost of extending a single table collection with MariaDB
and Parallel Query found in the Spider storage engine
to offload columnar MPP storage like InfiniDB or Vertica.
As of today Parallel Query is only available
from releases of MariaDB Spider supported by spiral arms. The
more efficient way to use parallel query with Spider can be done
on group by, and count queries that use a single
spider table. In such case Spider Engine will execute query
push down AKA map reduce.
Spider gets multiple levels of parallel execution for a single
partitioned tables.
First level is per backend server:
The way to actually tell spider to scan different backends in
concurrency is to set spider_sts_bg_mode=1
Other level is per …
While Shard-Query can work over multiple nodes, this blog post focuses on using Shard-Query with a single node. Shard-Query can add parallelism to queries which use partitioned tables. Very large tables can often be partitioned fairly easily. Shard-Query can leverage partitioning to add paralellism, because each partition can be queried independently. Because MySQL 5.6 supports the partition hint, Shard-Query can add parallelism to any partitioning method (even subpartioning) on 5.6 but it is limited to RANGE/LIST partitioning methods on early versions.
The output from Shard-Query is from the commandline client, but you can use MySQL proxy to communicate with Shard-Query too.
In the examples I am going to use the schema from the Star Schema Benchmark. I generated data for scale factor 10, which means about 6GB of data in the largest table. I am going to show a few different queries, and …
[Read more]