Comments Page - Unicode shenanigans: Martine Ã©crit en UTF-8

« Back Unicode shenanigans: Martine Ã©crit en UTF-8blog.poisson.chatSubmitted by 082349872349872 9 months ago

bazzargh 9 months ago
I read this and wondered,
"what Planet software is Planet Haskell using that it's doing these odd things - oh, it's Venus? https://github.com/rubys/venus ... wow that's old... and doesn't that use something like filesystem storage for the feeds and couldn't this happen if you stored the xml with no character set specified and the parser messed it up on the way back in? Which since it's ending up with a windows encoding...wait, why am I remembering any of this....oh. checks credits for Venus, recordscratch yes, that's me, wondering why I fixed venus to run on windows 18 years ago"
https://github.com/rubys/venus/commit/210781768705d20dc3cbe6...
Sorry about that. As I recall at the time I was the only person using Venus on windows, and I was just running a hacked up version on a local machine. There's some conditionals in there about whether it uses libxml2 or not (in modern python, it wouldn't need to), that call doesn't take a charset parameter and my guess is the problems begin as soon as libxml2 tries to parse the file on disk. I think my own version was frankensteined to the extent of using a sqlite back end so I didn't have to deal with windows files any more.
The author of Venus, Sam Ruby, is on here (rubys) but it looks like he hasn't checked in for a long time; last I saw he was over at fly.io.
Oh and the even funnier part of this all is, back when this was written, Sam's blog was THE go to place to look up the list of Mojibake mistakes you'd made...
https://intertwingly.net/stories/2004/04/14/i18n.html#Cleani...
netsharc 9 months ago
When UTF-8 wasn't univeral (geez, I'm old... It was still this century though!), a page I found to figure out what was going on when I encountered Mojibake is: http://www.jeppesn.dk/utf-8.html . Amazingly it hasn't suffered linkrot and is still online.
And I fully agree with footnote 2, why is an extended Unicode character being used in place of the apostrophe? https://tedclancy.wordpress.com/2015/06/03/which-unicode-cha...
- LegionMammal978 9 months ago
  Because the apostrophe as used in English is a punctuation mark, not a letter, and especially not a modifier letter. The author argues that any name ought to be matched by anything in \w, and we should avoid a punctuation character for that reason, but he doesn't mention other punctuation marks like the hyphen that also commonly occur in names.
- throw0101b 9 months ago
  > And I fully agree with footnote 2, why is an extended Unicode character being used in place of the apostrophe?
  Some reasons:
  > The Unicode character ’ (U+2019 right single quotation mark) is used for both a typographic apostrophe and a single right (closing) quotation mark.[1] This is due to the many fonts and character sets (such as CP1252) that unified the characters into a single code point, and the difficulty of software distinguishing which character is intended by a user's typing.[2] There are arguments that the typographic apostrophe should be a different code point, U+02BC modifier letter apostrophe.[3][better source needed]
  > The straight apostrophe ' (the "ASCII apostrophe", U+0027 ' apostrophe) is even more ambiguous, as it could also be intended as a left or right quotation mark, or a prime symbol.
  * https://en.wikipedia.org/wiki/Right_single_quotation_mark
  * https://en.wikipedia.org/wiki/Apostrophe#Unicode
Dwedit 9 months ago
Wikipedia used to have this picture for an illustration of Mojibake: https://dic.academic.ru/pictures/wiki/files/76/Letter_to_Rus... A very good job from the postal employees who corrected it.
andai 9 months ago
I often run into this when I do stuff in Python and forget to add encoding="utf-8" to open(). I think they're finally changing this to the default.
Actually I ran into a separate issue on Windows where Python will automatically replace the line endings depending on the OS. So I had to specify newline='\n' as an argument to open() or it would alter the newlines to Windows format in the output.
(My fault for not running it in WSL, I guess.)
kevin_thibedeau 9 months ago
> What’s
Unicode has only one apostrophe. The same one as ASCII apostrophe. The problem here is people using right single qoute as a "fancy" apostrophe when they should be using a font that renders apostrophe in the desired way. I have to fix this junk all the time in CD-text and Musicbrainz metadata.
- uasi 9 months ago
  The Unicode Standard states that U+2019 RIGHT SINGLE QUOTATION MARK is the preferred apostrophe character.
  0027 APOSTROPHE [snip] * 2019 is preferred for apostrophe
  https://unicode.org/Public/UNIDATA/NamesList.txt
  zinekeller 9 months ago
  Sadly, I don't think that pedantic people will accept Unicode's notes, the same way that they will stil use the Ohm symbol (which is just there for CJK compatibilty) even if Unicode explicitly stated that the correct approach is to use the capital omega symbol.
  (Or even the dreaded case of using the Philippine Peso sign (which is simply named in Unicode as the PESO SIGN) for other pesos, which is sometimes encountered in some apps! At least this is outright wrong that it is corrected immediately.)
- zahlman 9 months ago
  First: no, that reasoning doesn't generalize nearly well enough. Unicode has many quotation marks, and " is just one of them. “ and ” are perfectly acceptable to use to wrap double-quoted text and you can't say they "should" use ordinary " and a proper font, because a font doesn't know which one is the beginning and which is the end of the quoted text. (Hence why those characters even exist).
  Second: no, Unicode definitely considers other characters to be apostrophes:
  >>> unicodedata.name(chr(700)) # https://en.wikipedia.org/wiki/Modifier_letter_apostrophe 'MODIFIER LETTER APOSTROPHE' >>> unicodedata.name(chr(1370)) # https://en.wikipedia.org/wiki/Armenian_(Unicode_block) 'ARMENIAN APOSTROPHE' >>> unicodedata.name(chr(2036)) # https://en.wikipedia.org/wiki/N%27Ko_script 'NKO HIGH TONE APOSTROPHE' >>> unicodedata.name(chr(2037)) 'NKO LOW TONE APOSTROPHE' >>> unicodedata.name(chr(65287)) # https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms 'FULLWIDTH APOSTROPHE' >>> unicodedata.name(chr(917543)) # https://en.wikipedia.org/wiki/Tags_(Unicode_block) 'TAG APOSTROPHE'
  There's even a "double apostrophe" which is definitely not any kind of quotation mark, semantically (https://en.wikipedia.org/wiki/Modifier_letter_double_apostro...) .
  For that matter, there's a "Greek question mark" (https://en.wikipedia.org/wiki/Question_mark#Greek_question_m...) which is semantically equivalent to ? for Greek text, but rendered identically to ; in most fonts. And there are some CJK ideographs (https://en.wikipedia.org/wiki/Ghost_characters) which represent completely fake characters not used in any writing (including in any other East Asian language), which were included in an old Japanese character standard by mistake and then copied into Unicode.
  "Unicode only has one" is the start of a lot of false claims. It is very much "designed by committee".
  eviks 9 months ago
  First: you don't need to generalize it, it's true in this specific case, and this fact doesn't depend on the ability to generalize
  Second: no, there is only a single apostrophe fit for this context, the ones are used for different purposes
  keybored 9 months ago
  We can follow and make all sorts of rules. What you cannot do:
  - Unicode has only one apostrophe (the ASCII one literally named that)
  When Unicode says
  - 2019 is preferred for apostrophe
  You can’t appeal to Unicode and then ignore them in the next breath.
  eviks 9 months ago
  You should take this argument back to the OP or zahlman re. whether referencing Unicode implies agreement with any nonsense in the standard's comments (such as rejecting the literal "apostrophe" to be used as an apostrophe).
  I was responding to the list of symbols named "apostrophe", where zahlman also seems to follow the consistent logic of only listing apostrophes, not quotes
  keybored 9 months ago
  > You should take this argument back to the OP or zahlman re. whether referencing Unicode implies agreement with any nonsense in the standard's comments (such as rejecting the literal "apostrophe" to be used as an apostrophe).
  You: No, that’s completely true what they said there
  You: Wait, you should take that up with them…
  Just don’t speak out of your mouth with “Unicode” when you aren’t prepared to get it thrown back in your face? Doesn’t seem difficult.
  > I was responding to the list of symbols named "apostrophe", where zahlman also seems to follow the consistent logic of only listing apostrophes, not quotes
  Oh wow they’re named apostrophe? How great. I’ll start using this “start of header” character in my emails, that is probably so fit for purpose.
  eviks 9 months ago
  But what does seem difficult is for you to use words for understanding instead of throwing throwing them in people's faces.
  It is true what they said within the context of the conversation where the standard notes are either explicitly or implicitly rejected.
  > that is probably so fit for purpose.
  Or probably not. Unlike in this case where apostrophe is definitely fit for purpose
  keybored 9 months ago
  I don’t know why I phrased that so aggressively. Sorry.
- eadmund 9 months ago
  > Unicode has only one apostrophe. The same one as ASCII apostrophe. The problem here is people using right single qoute as a "fancy" apostrophe when they should be using a font that renders apostrophe in the desired way.
  No, the ASCII ‘quotes’ are inch and foot markers. Relying on a font to render the the inch and foot markers as quotes is a … unique … approach.
  > I have to fix this junk all the time in CD-text and Musicbrainz metadata.
  Wait, are you the guy whose Musicbrainz titles are constantly replacing my carefully- and properly-punctuated ones every time I sync? I beg you to reconsider.
  jimrandomh 9 months ago
  What input method are you using such that this is even possible? Nearly all English speakers are using keyboards with a single apostrophe key which inserts \x27, and could not insert any of the other quote characters even if they wanted to. As a result, nearly all extant English-language text uses \x27 for both apostrophes and single quotes, and all this Unicode prescriptivism is describing a convention that is clearly not the one that English actually follows.
  eadmund 9 months ago
  > What input method are you using such that this is even possible?
  ibus. In my X settings the physical Caps Lock key is turned into Compose, and then I can easily type ‘ with Compose <' (and vice-versa), ’ with Compose >' (ditto), “ with Compose <" (ditto), ” with Compose >" (ditto), ß with Compose ss, þ with Compose th, Þ with Compose TH, æ with Compose ae, Æ with Compose AE, … with Compose .., — with Compose ---, – with Compose --. and so forth.
  https://github.com/kragen/xcompose offers an excellent XCompose file with over a thousand wonderful Compose mappings. Highly recommended!
  eviks 9 months ago
  > the ASCII ‘quotes’ are inch and foot markers.
  There are special symbols for those. There is no other symbol for the apostrophe
  > Relying on a font to render the the inch and foot markers as quotes is a … unique … approach
  No, you would rely on a font to render those dedicated inch symbols as inch markers
  Besides you can only do the display substitution with context awareness, so that ' as inches won't be changed to quotes, but ' in titles will be
- throw0101b 9 months ago
  > Unicode has only one apostrophe.
  Two code points have "apostrophe" in their name, U+0027 and U+02BC:
  * https://en.wikipedia.org/wiki/Apostrophe#Unicode
  There is also a third which is often used, U+2019:
  > The Unicode character ’ (U+2019 right single quotation mark) is used for both a typographic apostrophe and a single right (closing) quotation mark.[1] This is due to the many fonts and character sets (such as CP1252) that unified the characters into a single code point, and the difficulty of software distinguishing which character is intended by a user's typing.[2] There are arguments that the typographic apostrophe should be a different code point, U+02BC modifier letter apostrophe.[3][better source needed]
  > The straight apostrophe ' (the "ASCII apostrophe", U+0027 ' apostrophe) is even more ambiguous, as it could also be intended as a left or right quotation mark, or a prime symbol.
  * https://en.wikipedia.org/wiki/Right_single_quotation_mark
numpad0 9 months ago
ÁÉÍÓÚ is always Windows-1252. Windows fully supported UTF-8 for decades, but still defaults to regional encodings for current system locale unless otherwise specified, for compatibility. Therefore, an app that does not explicitly specify so will still butcher the text as it runs on Windows and handles char[] or String even on Windows 11.
IIRC it's combination of some new_API_2.final() Win32 API and compilation options for C++, C#, and Java. Microsoft briefly tried to switch system locale to UTF-8 on Windows 10, but I think they've since given up on it.
Other platforms such as Electron, Android, iOS shouldn't have this problem; those should be UTF-8 native.
thrtythreeforty 9 months ago
Can someone explain the French meme to me? I feel like I've missed several references. Obviously it's a joke about UTF-8 and the characters in the title, but I don't get it.
- alricb 9 months ago
  Martine [1] is a series of French-language children's book. They have simple titles like "Martine à la ferme" (Martine at the farm) or "Martine fait du camping" (Martine goes camping).
  "Martine Ã©crit en UTF-8" is what you get if you interpret the UTF-8 string "Martine écrit en UTF-8" (Martine writes in UTF-8) as Latin-1. It's not as common as it once was, but that kind of encoding issue used to be encountered fairly often by French speakers on the Web.
  [1]: https://en.wikipedia.org/wiki/Martine_(character)
  082349872349872 9 months ago
  Pedantry: the image seems to have been taken from an older cover of "Martine vive la rentrÃ©e"
  https://www.casterman.com/Jeunesse/Catalogue/vive-la-rentree...
  EDIT: for a short course in contemporary french slang, one could do worse than https://www.google.com/search?q=martine+parodie&udm=2
  (my favourite being "Martine and her parodies")
- undefined 9 months ago
  [deleted]