I've fixed this, and want to share what I did.
The issue was one of mixed encoding. Some fields contained data encoded properly as UTF-8; others contained data encoded as something else, probably ISO-8859-1. When imported to a new UTF-8 database, this was causing truncation.
My steps to solve this:
- Copy the original database,
wordpress
, to a new database,wordpress2
. Make sure the collation ofwordpress2
is set to UTF-8. - Follow any one of the steps above to convert the tables and columns of
wordpress2
to UTF-8. This will cause truncation. - For each row on
wordpress
containing entries where the data when converted to UTF-8 was not the same as the data when converted to ASCII, update the corresponding row inwordpress2
with thewordpress
data, converted to UTF-8. A sample script is below.
The script:
UPDATE wordpress3.wp_options wp3
INNER JOIN wordpress.wp_options wp ON (wp.option_id = wp3.option_id
AND convert(wp.option_value using utf8) != convert(wp.option_value using ascii))
SET wp3.option_value = convert(wp.option_value using utf8);
A more knowledgeable friend wrote a series of scripts for me that queried information_schema
to locate all the columns containing entries where convert(value using utf8) != convert(value using ascii)
, and to then generate versions of the script above for them.
Results:
It worked! On my new database, I can save Japanese without it turning into question marks (because the character set has successfully been set to UTF-8), and all the mis-encoded fields that were causing data truncation have been fixed.
There are some posts that still contain misencoded characters, but since I can find almost all of those strings by searching for â, Ã, Â, or æ, I can just go in and manually replace them. I'm not troubled by that.