We usually receive and see some questions regarding the charset levels in MySQL, especially after the deprecation of utf8mb3 and the new default uf8mb4. If you understand how the charset works on MySQL but have some questions regarding this change, please check out Migrating to utf8mb4: Things to Consider by Sveta Smirnova. Some of the […]
The utf8mb4 character set is the new default as of MySQL 8.0, and this change neither affects existing data nor forces any upgrades.
Migration to utf8mb4 has many advantages including:
- It can store more symbols, including emojis
- It has new collations for Asian languages
- It is faster than utf8mb3
Still, you may wonder how migration affects your existing data. This blog covers multiple aspects of it.
Storage Requirements
As the name suggests, the maximum number of bytes that one character can take with character set utf8mb4 is four bytes. This is larger than the requirements for utf8mb3 which takes three bytes and many other MySQL character sets.
Fortunately, utf8mb3 is a subset of …
[Read more]In this blog, I will provide answers to the Q & A for the Troubleshooting Issues with MySQL Character Sets webinar.
First, I want to thank everybody for attending the March 9 MySQL character sets troubleshooting webinar. The recording and slides for the webinar are available here. Below is the list of your questions that I wasn’t able to answer during the webinar, with responses:
Q: We’ve had some issues converting tables from
utf8
to
utf8mb4
. Our issue was that the collation we wanted to use –
…[Read more]In this post, I’ll discuss how MySQL 5.7 handles UTF8MB4 and the load data infile.
Many of my clients have told me that they do not like using the LOAD DATA INFILE statement and prefer to manually parse and load the data. The main reason they do it is issues with the character sets, specifically UTF8MB4 and the load data infile. This was surprising to me as nowadays everyone uses UTF8. MySQL 5.7 (as well as 5.6) has full support for UTF8MB4, which should fix any remaining issues (i.e., you can now load new emoji, like ).
Last week I was investigating an interesting case where we were loading data and got the following error:
mysql -e 'select version()' +-----------+ | version() | +-----------+ | 5.7.12 | +-----------+ $ mysql -vvv testdb < load_data.sql ERROR 1300 (HY000) at line 1: Invalid …[Read more]
In Python it is easily possible to findout the name of a Unicode
character and findout some properties about that character. The
module which does that is called unicodedata
.
An example:
>>> import unicodedata
>>> unicodedata.name('☺')
'WHITE SMILING FACE'
This module uses the data as released in the UnicodeData.txt file from the unicode.org website.
So if UnicodeData.txt is a 'database', then we should be able to import it into MySQL and use it!
I wrote a small Python script to automate this. The basic steps are:
- Download UnicodeData.txt
- Create a unicodedata.ucd table
- Use
LOAD DATA LOCAL INFILE
to load the data
This isn't difficult especially because the file doesn't have the actual characters in it. It is …
[Read more]