The MySQL soap opera

UnConeD: drupal passes a bytestream consisting of utf-8 data to mysql if mysql is set to a plain 8-bit character then it will be treated as such. if it is set to utf-8, then it is treated as utf-8. but regardless, it is stored in the same fashion

It makes sense. In fact if you convert the tables and columns from latin1 to utf8, the content does not change. In latin1 you see three characters representing one utf8 character. But if you convert the db in utf8 (without changing the content of the fields), the database assumes that the three characters are then three utf8 characters, representing EIGHT latin1 characters. The content remains the same. This is why he says that they are stored in the same way.

My database started more than a year ago without the character encoding. But it always contained utf8 data stored in the classic iso/latin mode. When mysql got upgraded, all my database was automatically considered as latin. But the content in the tables was latin1 representing utf8. And, consequently, the data was sent out properly by Drupal as utf8.

Morb: UnConeD: alright. another stupid question. in mysql 4.1, all the tables are defaulted to latin1 for a charset. likewise, when they are dump’d, the table defs are latin1. that shouldn’t have any corruptive effect on utf-8 data?

UnConeD: morbus: no because latin-1 means, each character is a byte. so the utf-8 bytestream enters and leaves as is. it just doesn’t get interpreted correctly. the only place where it has an effect is on column data lengths

Here you ask the wrong question, this is the point. He is right because he assumes the database is in latin1 and the dump is again in latin1. So the utf8 data is still preserved because always coded as latin1, so retained.

The point is that the curruption isn’t because the utf8 data is converted to latin1 as you assume in your question. But the exact opposite.

The corruption I have in the database is the result of this fucking option that I see if I do a mysqldump -?

default-character-set utf8

I believe that mysqldump, by default, reads my latin1 data (three latin characters for one utf8), taking it as utf8 (three utf8 characters) and dumping it in a ISO-8859-1 (latin1) file.

Basically:
In the web page I have one character of utf8: ( ’ )
In the database that single character is encoded as three latin1 characters: ( ’ )
As I dump the database, mysqldump defaults to utf8. This means that he believes my database data is ALREADY in utf8 ( “ ) so three utf8 characters to dump in a latin1 .sql file. Resulting in this following crap: ( ’ )-latin equal to ( ’ )-utf8.

Morbus: UnConeD: without –default-character-set=”latin1″ on the mysqldump, the dump is corrupted. WITH that –d-c-s though, the curly quotes are fine. where “dump is corrupted” means “the curly quotes are messed up”

UnConeD: cut and paste from a mysqldump WITHOUT -d-c-s: in their 20’s and 30’s. and with -d-c-s: in their 20’s and 30’

I WISH I had that type of dump. In my latin1 database those three characters in the dump are what correctly represent one utf8 character. So it’s all good.

The point is that it’s not what I have in my dumps. Without the latin1 option my dumps show: ( ’ )

UnConeD: morbus: but that’s just the literal utf-8 bytes, what encoding are you viewing it in?

Morbus: no idea.

He is saying that it’s the latin1 encoding of the utf8 character. So if you are viewing the dump in latin, the dump is correct. The point is that I was seeing ( ’ ) in utf8 already. And a friggin ( ’ ) in the latin dump.

UnConeD: there’s your problem. you’re not using a utf-8 locale most likely.

Morbus: so, without the -d-c-s, if I reimported those characters into a fresh db, they’d still be fine?

If the dump you showed him was set in latin1 (so three characters for one utf8) he is correct to say that the dump was working properly. But that’s not what happened in my corruption problem.

Morbus: UnConeD: another question. So, the reason I’m seeing the garbled curly quotes is my shell encoding, fine. So, if I import the what-appears-to-be-garbled stuff, it’d still import correctly? What about if I dumped with the default-character-set (so that it looked fine in my shell) and then reimported? would that too still be fine?

UnConeD: morbus: no because you’d lose all non-latin1 characters

Morbus: on which? the –default-char-set?

UnConeD: yeah the overriding one

Bad. He is wrong here. Because my goddamn database is in latin1. If I export a latin1 database as latin1 there is no conflict. What I override is a wrongly defaulted mysqldump.

Could you forward him this mail to understand if what I’m saying is correct?


Thanks to both, really.

Posted in: Uncategorized | Tagged:

Leave a Reply