• SleepyMyroslav 3 hours ago

    In gamedev there is simple rule: don't try to do any of that.

    If it is text game needs to show to user then every version of the text that is needed is a translated text. Programmer will never know if context or locale will need word order changes or anything complicated. Just trust the translation team.

    If text is coming from user - then change design until its not needed to 'convert'. There are major issues just to show user back what he entered! Because the font for editing and displayed text could be different. Not even mentioning RTL and other issues.

    Once ppl learn about localization the questions like why a programming language does not do this 'simple text operation' are just a newcomer detector. :)

    • fluoridation 2 hours ago

      >Once ppl learn about localization the questions like why a programming language does not do this 'simple text operation' are just a newcomer detector. :)

      I think you are purposefully misinterpreting the question. They're not asking about converting the case of any Unicode string with locale sensitivity, they're asking about converting the case of ASCII characters.

      What if your game needs to talk to a server and do some string manipulation in between requests? Are you really going to architect everything so that the client doesn't need to handle any of that ever?

      • squeaky-clean 2 hours ago

        > They're not asking about converting the case of any Unicode string with locale sensitivity, they're asking about converting the case of ASCII characters.

        I'm confused now. The article specifically mentions issues with UTF-16 and UTF-32 unicode characters outside the basic multilingual plane (BMP).

        • fluoridation an hour ago

          I'm referring to the people who call case conversion in general "a simple text operation". Say you have an std::string and you want to make it lower case. If you assume it contains just ASCII that's a simpler operation than if you assume it contains UTF-8, but C++ doesn't provide a single function that does either of them. A person can rightly complain that the former is a basic functionality that the language should include; personally, I would agree. And you could say "wow, doesn't this person realize that case conversion in Unicode is actually complicated? They must be really inexperienced." It could be that the other person really doesn't know about Unicode, or it could mean that you and them are thinking about entirely different problems and you're being judgemental a bit too eagerly.

          • squeaky-clean an hour ago

            For ascii in C++ isn't there std::tolower / std::toupper? If you're not dealing with unsigned char types there isn't a simple case conversion function, but that's for a good reason as the article lays out.

            • fluoridation 35 minutes ago

              Those functions take and return single characters. What's missing is functions that operate on strings. You can use them in combination with std::transform(), but as the article points out, even if you're just dealing with ASCII you can easily do it wrong. I've been using C++ for over 20 years and I didn't know tolower() and toupper() were non-addressable. There's really no excuse for the library not having simple case conversion functions that operate on strings in-place.

      • zahlman 2 hours ago

        >If text is coming from user - then change design until its not needed to 'convert'

        In games, you can possibly get away with this. Most other people need to worry about things like string collation (locale-aware sorting) for user-supplied text.

      • blenderob 12 hours ago

        It is issues like this due to which I gave up on C++. There are so many ways to do something and every way is freaking wrong!

        An acceptable solution is given at the end of the article:

        > If you use the International Components for Unicode (ICU) library, you can use u_strToUpper and u_strToLower.

        Makes you wonder why this isn't part of the C++ standard library itself. Every revision of the C++ standard brings with itself more syntax and more complexity in the language. But as a user of C++ I don't need more syntax and more complexity in the language. But I do need more standard library functions that solves these ordinary real-world programming problems.

        • bayindirh 12 hours ago

          I don't think it's a C++ problem. You just can't transform anything developed in "ancient" times to unicode aware in a single swoop.

          On the other hand, libicu is 37MB by itself, so it's not something someone can write in a weekend and ship.

          Any tool which is old enough will have a thousand ways to do something. This is the inevitability of software and programming languages. In the domain of C++, which has a size mammoth now, everyone expects this huge pony to learn new tricks, but everybody has a different idea of the "new tricks", so more features are added on top of its already impressive and very long list of features and capabilities.

          You want libICU built-in? There must be other folks who want that too. So you may need to find them and work with them to make your dream a reality.

          So, C++ is doing fine. It's not that they omitted Unicode during the design phase. Unicode has arrived later, and it has to be integrated by other means. This is what libraries for.

          • zahlman 3 hours ago

            >You just can't transform anything developed in "ancient" times to unicode aware in a single swoop.

            Even for Python it took well over a decade, and people still complain about the fact that they don't get to treat byte-sequences transparently as text any more - as if they want to wrestle with the `basestring` supertype, getting `UnicodeDecodeError` from an encoding operation or vice-versa, trying to guess the encoding of someone else's data instead of expecting it to be decoded on the other side....

            But in C++ (and in C), you have the additional problem that the 8-bit integer type was named for the concept of a character of text, even though it clearly cannot actually represent any such thing. (Not to mention the whole bit about `char` being a separate type from both `signed char` and `unsigned char`, without defined signedness.)

            • pornel 12 hours ago

              Being developed in, and having to stay compatible with, ancient times is a real problem of C++.

              The now-invalid assumptions couldn't have been avoided 50 years ago. Fixing them now in C++ is difficult or impossible, but still, the end result is a ton of brokenness baked into C++.

              Languages developed in the 21st century typically have some at least half-decent Unicode support built-in. Unicode is big and complex, but there's a lot that a language can do to at least not silently destroy the encoding.

              • cm2187 11 hours ago

                That explains why there are two functions, one for ascii and one for unicode. That doesn't explain why the unicode functions are hard to use (per the article).

                • BoringTimesGang 10 hours ago

                  Because human language is hard to boil down to a simple computing model and the problem is underdefined, based on naive assumptions.

                  Or perhaps I should say naïve.

                  • cm2187 3 hours ago

                    Well pretty much every other more recent language solved that problem.

                    • kccqzy 3 hours ago

                      Almost no programming language, perhaps other than Swift, solved that problem. Just use the article's examples as test cases. It's just as wrong as the C++ version in the article, except it's wrong with nicer syntax.

                      • zahlman 3 hours ago

                        Python's strings have uppercase, lowercase and case-folding methods that don't choke on this. They don't use UTF-16 internally (they can use UCS-2 for strings whose code points will fit in that range; while a string might store code points from the surrogate-pair range, they're never interpreted as surrogate pairs, but instead as an error encoding so that e.g. invalid UTF-8 can be round-tripped) so they're never worried about surrogate pairs, and it knows a few things about localized text casing:

                            >>> 'ß'.upper()
                            'SS'
                            >>> 'ß'.lower()
                            'ß'
                            >>> 'ß'.casefold()
                            'ss'
                        
                        There are a lot of really complicated tasks for Unicode strings. String casing isn't really one of them.

                        (No, Python can't turn 'SS' back into 'ß'. But doing that requires metadata about language that a string simply doesn't represent.)

                        • kccqzy 2 hours ago

                          Still breaks on, for example, Turkish i vs İ. It's impossible to do correctly without language information.

                          > (No, Python can't turn 'SS' back into 'ß'. But doing that requires metadata about language that a string simply doesn't represent.)

                          Yes that's my point. Because in typical languages strings don't store language metadata, this is impossible to do correctly in general.

                        • tedunangst 2 hours ago

                          But that's wrong. The upper case for ß is ẞ.

                          • cm2187 an hour ago

                            C#'s "ToUpper" takes an optional CultureInfo argument if you want to play around with how to treat different languages. Again, solved problem decades ago.

                            • IncreasePosts an hour ago

                              That was only adopted in Germany like 7 years ago!

                              • kccqzy 10 minutes ago

                                Well languages and conventions change. The € sign was added not that long ago and it was somewhat painful. The Chinese language uses a single character to refer to chemical elements so when IUPAC names new elements they will invent new characters. Etc.

                • ectospheno 3 hours ago

                  > Any tool which is old enough will have a thousand ways to do something.

                  Only because of the strange desire of programmers to never stop. Not every program is a never ending story. Most are short stories their authors bludgeon into a novel.

                  Programming languages bloat into stupidity for the same reason. Nothing is ever removed. Programmers need editors.

                  • fluoridation 2 hours ago

                    So how do you design a language that accommodates both the people who need a codebase to be stable for decades and the people who want the bleeding edge all the time, backwards compatibility be damned?

                    • the_gorilla 2 hours ago

                      You don't. Any language that tries to do both turns into an unusable abomination like C++. Good languages are stable and the bleeding edge is just the "new thing" and not necessarily better than the old thing.

                      • fluoridation 2 hours ago

                        C++ doesn't try to do that. It aims to remain as backwards compatible as possible, which is what the GP is complaining about.

                  • relaxing 12 hours ago

                    It’s been 30 years. Unicode predates C++98. Java saw the writing on the wall. There’s no excuse.

                    • bayindirh 11 hours ago

                      > There’s no excuse.

                      I politely disagree. None of the programming languages which started integrating Unicode was targeting from bare metal to GUI, incl. embedded and OS development at the same time.

                      C++ has a great target area when compared to other programming languages. There are widely used libraries which compile correctly on PDP-11s, even if they are updated constantly.

                      You can't just say "I'll be just making everything Unicode aware, backwards compatibility be damned, eh".

                      • blenderob 11 hours ago

                        But we don't have to make everything Unicode aware. Backward compatibility is indeed very important in C++. Like you rightly said, it still has to work for PDP-11 without breaking anything.

                        But the C++ overlords could always add a new type that is Unicode-aware. Converting one Unicode string to another is a purely in-memory, in-CPU operation. It does not need any I/O and it does not need any interaction with peripherals. So one can dream that such a type along with its conversion routines could be added to an updated standard library without breaking existing code that compiles correctly on PDP-11s.

                        • bayindirh 11 hours ago

                          > Converting one Unicode string to another is a purely in-memory, in-CPU operation.

                          ...but it's a complex operation. This is what libICU is mostly for. You can't just look-up a single table and convert a string to another like you work on ASCII table or any other simple encoding.

                          Germans have their ß to S (or capital ß depending on the year), Turkish has ı/I and i/İ pairs, and tons of other languages have other rules.

                          Esp, this I/ı and İ/i pairs break tons of applications in very unexpected ways. I don't remember how many bugs I reported, and how many workarounds I have implemented in my systems.

                          Adding a type is nice, but the surrounding machinery is so big, it brings tons of work with itself. Unicode is such a complicated system, that I read that even you need two UTF-16 characters (4 bytes in total) to encode a single character. This is insane (as in complexity, I guess they have their reasons).

                          • SAI_Peregrinus 7 hours ago

                            > Unicode is such a complicated system, that I read that even you need two UTF-16 characters (4 bytes in total) to encode a single character. This is insane (as in complexity, I guess they have their reasons).

                            Because there are more than 65,535 characters. That's just writing systems, not Unicode's fault. Most of the unnecessary complexity of Unicode is legacy compatibility: UTF-16 & UTF-32 are bad ideas that increase complexity, but they predate UTF-8 which actually works decently well so they get kept around for backwards compatibility. Likewise with the need for multiple normalization forms.

                            • bayindirh 7 hours ago

                              I mean, I already know some Unicode internals and linguistics (since I developed a language-specific compression algorithm back in the day), but I have never seen a single character requiring four bytes (and I know Emoji chaining for skin color, etc.).

                              So, seeing this just moved the complexity of Unicode one notch up in my head, and I respect the guys who designed and made it work. It was not whining or complaining of any sort. :)

                              • fluoridation 2 hours ago

                                Cuneiform codepoints are 17 bits long. If you're using UTF-16 you'll need two code units to represent a character.

                            • blenderob 11 hours ago

                              Thanks for the reply! Really appreciate the time you have taken to write down a thoughtful reply.

                      • gpderetta an hour ago

                        Java ended up picking UCS-2 and getting screwed.

                      • akira2501 3 hours ago

                        > libicu is 37MB by itself, so it's not something someone can write in a weekend and ship.

                        Isn't that mostly just from tables derived from the Unicode standard?

                      • pistoleer 12 hours ago

                        > There are so many ways to do something and every way is freaking wrong!

                        That's life! The perfect way does not exist. The best you can do is be aware of the tradeoffs, and languages like C++ absolutely throw them in your face at every single opportunity. It's fatiguing, and writing in javascript or python allows us to uphold the facade that everything is okay and that we don't have to worry about a thing.

                        • pornel 12 hours ago

                          JS and Python are still old enough to have been created when Unicode was in its infancy, so they have their own share of problems from using UCS-2 (such as indexing strings by what is now a UTF-16 code unit, rather than by a codepoint or a grapheme cluster).

                          Swift has been developed in the modern times, and it's able to tackle Unicode properly, e.g. makes distinction between codepoints and grapheme clusters, and steers users away from random-access indexing and having a single (incorrect) notion of a string length.

                        • Muromec 11 hours ago

                          Well, the only time you can do str lower where unicode locale awareness will be a problem is when you do it on the user input, like names.

                          How about you just dont? If it's a constant in your code, you probably use ASCII anyway or can do a static mapping. If it's user user input -- just don't str lower / str upper it.

                          • pjmlp 7 hours ago

                            Because it is a fight to put anything on a ISO managed language, and only the strongest persevere long enough to make it happen.

                            Regardless of what ISO language we are talking about.

                            • gpderetta an hour ago

                              Yes, significantly smaller libraries had an hard time getting onto the standard. Getting the equivalent of ICU would be almost impossible. And good luck keeping it up to date.

                            • BoringTimesGang 12 hours ago

                              >It is issues like this due to which I gave up on C++. There are so many ways to do something and every way is freaking wrong!

                              These are mostly unicode or linguistics problems.

                              • tralarpa 12 hours ago

                                The fact that the standard library works against you doesn't help (to_lower takes an int, but only kind of works (sometimes) correctly on unsigned char, and wchar_t is implicitly promoted to int).

                                • BoringTimesGang 12 hours ago

                                  to_lower is in the std namespace but is actually just part of the C89 standard, meaning it predates both UTF8 and UTF16. Is the alternative that it should be made unusable, and more existing code broken? A modern user has to include one of the c-prefix headers to use it, already hinting to them that 'here be dragons'.

                                  But there are always dragons. It's strings. The mere assumption that they can be transformed int-by-int, irrespective of encoding, is wrong. As is the assumption that a sensible transformation to lower case without error handling exists.

                            • appointment 12 hours ago

                              The key takeaway here is that you can't correctly process a string if you don't what language it's in. That includes variants of the same language with different rules, eg en-US and en-UK or es-MX and es-ES.

                              If you are handling multilingual text the locale is mandatory metadata.

                              • zarzavat 12 hours ago

                                Different parts of a string can be in different languages too[1].

                                The lowercase of "DON'T FUSS ABOUT FUSSBALL" is "don't fuss about fußball". Unless you're in Switzerland.

                                [1] https://en.wikipedia.org/wiki/Code-switching

                                • schoen 12 hours ago

                                  Probably "don't fuss about Fußball" for the same reasons, right?

                                  • thiht 6 hours ago

                                    I thought the German language deprecated the use of ß years ago, no? I learned German for a year and that's what the teacher told us, but maybe it's not the whole story

                                    • 47282847 5 hours ago

                                      Incorrect. ẞ is still a thing.

                                      • CamperBob2 3 hours ago

                                        Going by what you and the grandparent wrote, it's not just a thing, but two different things: ẞ ß

                                        It is probably time for an Esperanto advocate to show up and set us all straight.

                                        • D-Coder 9 minutes ago

                                          Pri kio vi parolas? En Esperanto, unu letero egalas unu sonon.

                                          What are you talking about? In Esperanto, one letter equals one sound.

                                • vardump 12 hours ago

                                  As always, Raymond is right. (And as usually, I could guess it's him before even clicking the link.)

                                  That said, 99% time when doing upper- or lowercase operation you're interested just in the 7-bit ASCII range of characters.

                                  For the remaining 1%, there's ICU library. Just like Raymond Chen mentioned.

                                  • crazygringo 22 minutes ago

                                    > That said, 99% time when doing upper- or lowercase operation you're interested just in the 7-bit ASCII range of characters.

                                    I think it's more the exact opposite.

                                    The only times I'm dealing with 7-bit ASCII is for internal identifiers like variable names or API endpoints. Which is a lot of the time, but I can't ever think of when I've needed my code to change their case. It might literally be never.

                                    On the other hand, needing to switch between upper, lower, and title case happens all the time, always with people's names and article titles and product makes and whatnot. Which are never in ASCII because this isn't 1990.

                                    • sebstefan 11 hours ago

                                      Yes please, keep making software that mangles my actual last name at every step of the way. 99% of the world loves it when you only care about the USA.

                                      • Muromec 11 hours ago

                                        If it needs to uppercase names it probably interfaces with something forsaken like Sabre/Amadeus that only understands ASCII anyway.

                                        The real problem is accepting non-ASCII input from user where you later assume it's ASCII-only and safe to bitfuck around.

                                        • sebstefan 10 hours ago

                                          From experience anything banking adjacent will usually fuck it up as well

                                          For some reason they have a hard-on for putting last names in capital letters and they still have systems in place that use ASCII

                                          • Muromec 7 hours ago

                                            If it uses ASCII anyway, what's the problem then? Don't accept non-ASCII user input.

                                            • sebstefan 7 hours ago

                                              First off: And exclude 70% of the world?

                                              Usually they'll accept it, but some parts of the backend are still running code from the 60's.

                                              So you get your name rendered properly on the web interface, and most core features, but one day you're wandering off from the beaten path, by, like, requesting some insurance contract, and you'll see your name at the top with some characters mangled, depending on what your name's like. Mine is just accented latin characters so it usually drops the accents ; not sure how it would work if your name was in an entirely different alphabet

                                              • Muromec an hour ago

                                                >First off: And exclude 70% of the world?

                                                Guess what, I'm part of this 70% and I also work in a bank and I know exactly how.

                                                Not a single letter in my name (any of them) can be represented with ASCII. When it is represented in UTF-8, most of the people who have to see it can't read it anyway.

                                                So my identity document issued by the country which doesn't use Latin alphabet includes ASCII-representation of my name in addition to canonical form in Ukrainian Cyrillic. That ASCII-rendering is happily accepted by all kinds of systems that only speak ASCII.

                                                People still can't pronounce it and it got misspelled like yesterday when dictated over the phone.

                                                Now regarding the accents, it's illegal to not support them per GDPR (as per case law, discussed here few years ago).

                                          • InfamousRece 9 hours ago

                                            Some systems are still using EBCDIC.

                                        • fhars 12 hours ago

                                          No, when you are doing string manipulation, you are almost never interestet in just the seven bit ASCII range, as there is almost no language that can be written using just that.

                                          • vardump 12 hours ago

                                            > as there is almost no language that can be written using just that.

                                            99% of use cases I've seen have nothing to do with human language.

                                            1% human language case that is needs to be handled properly using a proper Unicode library.

                                            Your mileage (percentages) may vary depending on your job.

                                            • kergonath 11 hours ago

                                              Right. That’s why I still get mail with my name mangled and my street name barely recognisable. Because I’m in the 1%. Too bad for me…

                                              In all seriousness, though, in the real world ASCII works only for a subset of a handful of languages. The vast majority of the population does not read or write any English in their day to day lives. As far as end users are concerned, you should probably swap your percentages.

                                              ASCII is mostly fine within your programs like the parser you mention in your other comment. But even then, it’s better if a Chinese user name does not break your reporting or logging systems or your parser, so it’s still a good idea to take Unicode seriously. Otherwise, anything that comes from a user or gets out of the program needs to behave.

                                              • vardump 11 hours ago

                                                I said use a Unicode library if input data is actual human language. Which names and addresses are.

                                                99% case being ASCII data generated by other software of unknown provenance. (Or sometimes by humans, but it's still data for machines, not for humans.)

                                                • kergonath 8 hours ago

                                                  I am really not sure about this 99%. A lot of programs deal with quite a lot of user-provided data, which you don’t control.

                                                • Muromec 11 hours ago

                                                  Who and why still tries to lowercase/uppercase names? Please tell them to stop.

                                                  • kergonath 8 hours ago

                                                    Hell if I know. I don’t know what kind of abomination e-commerce websites run on their backend, I just see the consequences.

                                                • 9dev 12 hours ago

                                                  It's funny how software developers live in bubbles so much. Whether you deal with human language a lot or almost not at all depends entirely on your specific domain. Anyone working on user interfaces of any kind must accommodate for proper encoding, for example; that includes pretty much every line-of-business app out there, which is a lot of code.

                                                  • elpocko 11 hours ago

                                                    Every search feature everywhere has to be case-insensitive or it's unusable. Search seems like a pretty ubiquitous feature in a lot of software, and has to work regardless of locale/encoding.

                                                    • inexcf 12 hours ago

                                                      Why do you need upper- or lowercase conversion in cases that have nothing to do with human language?

                                                      • vardump 11 hours ago

                                                        Here's an example. Hypothetically say you want to build an HTML parser.

                                                        You might encounter tags like <html>, <HTML>, <Html>, etc., but you want to perform a hash table lookup.

                                                        So first you're going to normalize to either lower- or uppercase.

                                                        • ARandumGuy 3 hours ago

                                                          Converting string case is almost never something you want to do for text that's displayed to the end user, but there are many situations where you need to do it internally. Generally when the spec is case insensitive, but you still need to verify or organize things using string comparison.

                                                          • inexcf 9 hours ago

                                                            Ah, i see, we disagree on what is "human language". An abbreviation like HTML and it's different capitalisations to me sound a lot like a feature of human language.

                                                            • recursive 3 hours ago

                                                              Is this a serious argument? Humans don't directly use HTML to communicate with each other. It's a document markup language rendered by user agents, developed against a specification.

                                                              • tannhaeuser 15 minutes ago

                                                                Markup languages and SGML in particular absolutely are designed for digital text communication by humans and to be written using plain text editors; it's kindof the entire point of avoiding binary data constructs.

                                                                And to GP, SGML/HTML actually has a facility to define uppercasing rules beyond ASCII, namely the LCNMSTRT, UCNMSTRT, LCNMCHAR, UCNMCHAR options in the SYNTAX NAMING section in the SGML declaration introduced in the "Extended Naming Rules" revision of ISO 8879 (SGML std, cf. https://sgmljs.net/docs/sgmlrefman.html). Like basically everything else on this level, these rules are still used by HTML 5 to this date, and in particular, that while elements names can contain arbitrary characters, only those in the IRV (ASCII) get case-folded for canonization.

                                                            • Muromec 11 hours ago

                                                              But but, I want to have a custom web component and register it under my own name, which can only be properly written in Ukrainian Cyrillic. How dare you not let me have it.

                                                        • daemin 12 hours ago

                                                          I would argue that for most programs when you're doing string manipulation you're doing it for internal programming reasons - logs, error messages, etc. In that case you are in nearly full control of the strings and therefore can declare that you're only working with ASCII.

                                                          The other normal cases of string usage are file paths and user interface, and the needed operations can be done with simple string functions, and even in UTF8 encoding the characters you care about are in the ASCII range. With file paths the manipulations that you're most often doing is path based so you only care about '/', '\', ':', and '.' ASCII characters. With user interface elements you're likely to be using them as just static data and only substituting values into placeholders when necessary.

                                                          • pistoleer 12 hours ago

                                                            > I would argue that for most programs when you're doing string manipulation you're doing it for internal programming reasons - logs, error messages, etc. In that case you are in nearly full control of the strings and therefore can declare that you're only working with ASCII.

                                                            Why would you argue that? In my experience it's about formatting things that are addressed to the user, where the hardest and most annoying localization problems matter a lot. That includes sorting the last name "van den Berg" just after "Bakker", stylizing it as "Berg, van den", and making sure this capitalization is correct and not "Van Den Berg". There is no built in standard library function in any language that does any of that. It's so much larger than ascii and even larger than unicode.

                                                            Another user said that the main takeaway is that you can't process strings until you know their language (locale), and that is exactly correct.

                                                            • daemin 11 hours ago

                                                              I would maintain that your program has more string manipulation for error messages and logging than for generating localised formatted names.

                                                              Further I do say that if you're creating text for presenting to the user then the most common operation would be replacement of some field in pre-defined text.

                                                              In your case I would design it so that the correctly capitalised first name, surname, and variations of those for sorting would be generated at the data entry point (manually or automatically) and then just used when needed in user facing text generation. Therefore the only string operation needed would be replacement of placeholders like the fmt and standard library provide. This uses more memory and storage but these are cheaper now.

                                                              • pistoleer 10 hours ago

                                                                I agree, but the logging formatters don't really do much beyond trivially pasting in placeholders.

                                                                And as for data entry... Maybe in an ideal world. In the current world, marred by importing previously mangled datasets, a common solution in the few companies I've worked at is to just not do anything, which leaves ugly edges, yet is "good enough".

                                                            • heisenzombie 12 hours ago

                                                              File paths? I think filesystem paths are generally “bags of bytes” that the OS might interpret as UTF-16 (Windows) or UTF-8 (macOS, Linux).

                                                              For example: https://en.m.wikipedia.org/wiki/Program_Files#Localization

                                                              • vardump 12 hours ago

                                                                File paths are scary. The last I checked (which is admittedly a while ago), Windows didn't for example care about correct UTF-16 surrogate pairs at all, it'd happily accept invalid UTF-16 strings.

                                                                So use standard string processing libraries on path names at your own peril.

                                                                It's a good idea to consider file paths as a bag of bytes.

                                                                • netsharc 11 hours ago

                                                                  IIRC, the FAT filesystem (before Windows 95) allowed lowercase letters, but there's a layer in the filesystem driver that converted everything to uppercase, e.g. if you did the command "more readme.txt", the more command would ask the filesystem for "readme.txt" and it would search for "README.TXT" in the file allocation table.

                                                                  I think I once hex-edited the FA-table to change a filename to have a lowercase name (or maybe it was disk corruption), trying to delete that file didn't work because it would be trying to delete "FOO", and couldn't find it because the file was named "FOo".

                                                                  • Someone 11 hours ago

                                                                    > It's a good idea to consider file paths as a bag of bytes

                                                                    (Nitpick: sequence of bytes)

                                                                    Also very limiting. If you do that, you can’t, for example, show a file name to the user as a string or easily use a shell to process data in your file system (do you type “/bin” or “\x2F\x62\x69\x6E”?)

                                                                    Unix, from the start, claimed file names where byte sequences, yet assumed many of those to encode ascii.

                                                                    That’s part of why Plan 9 made the choice “names may contain any printable character (that is, any character outside hexadecimal 00-1F and 80-9F)” (https://9fans.github.io/plan9port/man/man9/intro.html)

                                                                    • daemin 11 hours ago

                                                                      That's what I mean, you treat filesystem paths as bags of bytes separated by known ASCII characters, as the only path manipulation that you generally need to do is to append a path, remove a path, change extension, things that only care about those ASCII characters. You only modify the path strings at those known characters and leave everything in between as is (with some exceptions using OS API specific functions as needed).

                                                                  • BoringTimesGang 12 hours ago

                                                                    Now double all of that effort, so you can get it to work with Windows' UTF-16 wstrings.

                                                              • flareback 2 hours ago

                                                                He gave 4 examples of how it's done incorrectly, but zero actual examples of doing it correctly.

                                                                • TheGeminon 44 minutes ago

                                                                  > Okay, so those are the problems. What’s the solution?

                                                                  > If you need to perform a case mapping on a string, you can use LCMap­String­Ex with LCMAP_LOWERCASE or LCMAP_UPPERCASE, possibly with other flags like LCMAP_LINGUISTIC_CASING. If you use the International Components for Unicode (ICU) library, you can use u_strToUpper and u_strToLower.

                                                                  • commandlinefan an hour ago

                                                                        for (int i = 0; i < strlen(s); i++) {
                                                                            s[i] ^= 0x20;
                                                                        }
                                                                    • calibas 40 minutes ago

                                                                      Thank you for this universal approach. I can now toggle capitalization on/off for any character, instead of just being limited to alphabetic ones!

                                                                      Jokes aside, I was kinda hoping for a good answer that doesn't rely on a Windows API or an external library, but I'm not sure there is one. It's a rather complex problem when you account for more than just ASCII and the English language.

                                                                      • vardump an hour ago

                                                                        Surely you meant:

                                                                          s[i] &= ~0x20;
                                                                        
                                                                        We're talking about converting to upper case after all! As an added benefit, every space character (0x20) is now a NUL byte!
                                                                    • PhilipRoman 11 hours ago

                                                                      Thought this was going to be about and-not-ing bytes with 0x20. Wrong for most inputs but sure as hell faster than anything else.

                                                                      • cyxxon 12 hours ago

                                                                        Small nitpick: the example "LATIN SMALL LETTER SHARP S (“ß” U+00DF) uppercases to the two-character sequence “SS”:³ Straße ⇒ STRASSE" is slightly wrong, it seems to me, as we now do actually have a uppercase version of that, so it should uppercase to "Latin Capital Letter Sharp S" (U+1E9E). The double-S thing is still widely used, though.

                                                                        • mkayokay 12 hours ago

                                                                          Duden mentions this: "Bei Verwendung von Großbuchstaben steht traditionellerweise SS für ß. In manchen Schriften gibt es aber auch einen entsprechenden Großbuchstaben; seine Verwendung ist fakultativ ‹§ 25 E3›."

                                                                          But isn't it also dependent on the available glyphs in the font used? So f.e. it needs to be ensured that U+1E9E exists?

                                                                          • Rygian 2 hours ago

                                                                            The footnote #3 in the article (called as part of your quote) covers the different ways to uppercase ß with more detail.

                                                                            • Muromec 11 hours ago

                                                                              But what if you need to uppercase the historical record in a vital records registry from 1950ies, but and OCRed last week? Now you need to not just be locale-aware, but you locale should be versioned.

                                                                              • pjmlp 7 hours ago

                                                                                Lowering case is even better, because a Swiss user would expect the two-character sequence “SS“ to be converted into “ss“ and not “ß“.

                                                                                And thus we add country specific locale to the party.

                                                                              • PoignardAzur 18 minutes ago

                                                                                So I'm going to be that guy and say it:

                                                                                Man, I'm happy we don't need to deal with this crap in Rust, and we can just use String::to_lowercase. Not having to worry about things makes coding fun.

                                                                                • HPsquared 2 hours ago

                                                                                  I thought this was going to be about adding or subtracting 32. Old school.

                                                                                  • high_na_euv 9 hours ago

                                                                                    In cpp basic things are hard

                                                                                    • johnnyjeans an hour ago

                                                                                      nothing about working with locales, or text in general, is basic. we were decades into working with digital computers before we moved past switchboards and LEDs. don't take for granted just how high of a perch upon the shoulders of giants you have. that's exactly how the mistakes in the blog post get made.

                                                                                      • onemoresoop 7 hours ago

                                                                                        It's subjective but I find C++ extremely ugly.

                                                                                      • the_gorilla 2 hours ago

                                                                                        Why are some functions addressable in C++ and others not? Seems like a pointless design oversight.

                                                                                        • ahartmetz 12 hours ago

                                                                                          ...and that is why you use QString if you are using the Qt framework. QString is a string class that actually does what you want when used in the obvious way. It probably helps that it was mostly created by people with "ASCII+" native languages. Or with customers that expect not exceedingly dumb behavior. The methods are called QString::toUpper() and QString::toLower() and take only the implicit "this" argument, unlike Win32 LCMapStringEx() which takes 5-8 arguments...

                                                                                          • cannam 11 hours ago

                                                                                            QString::toUpper/toLower are not locale-aware (https://doc.qt.io/qt-6/qstring.html#toLower)

                                                                                            Qt does have a locale-aware equivalent (QLocale::toUpper/toLower) which calls out to ICU if available. Otherwise it falls back to the QString functions, so you have to be confident about how your build is configured. Whether it works or not has very little to do with the design of QString.

                                                                                            • ahartmetz 9 hours ago

                                                                                              I don't see a problem with that. You can have it done locale-aware or not and "not" seems like a sane default. QString will uppercase 'ü' to 'Ü' just fine without locale-awareness whereas std::string doesn't handle non-ASCII according to the article. The cases where locale matters are probably very rare and the result will probably be reasonable anyway.

                                                                                            • aetherspawn an hour ago

                                                                                              I will admit I don’t love the Qt licensing model, but most things in Qt just work as they are supposed to, and on every platform too.

                                                                                              • vardump 12 hours ago

                                                                                                You just want a banana, but you also get the gorilla. And the jungle.

                                                                                              • serbuvlad 11 hours ago

                                                                                                The real insights here are that strings in C++ suck and UTF-16 is extremely unintuitive.

                                                                                                • criddell 4 hours ago

                                                                                                  Strings in C++ standard library do suck (and C++ is my favorite language).

                                                                                                  As for UTF-16, well, I don't know that UTF-8 is a whole lot more intuitive:

                                                                                                  > And for UTF-8 data, you have the same issues discussed before: Multibyte characters will not be converted properly, and it breaks for case mappings that alter string lengths.

                                                                                                  • recursive 3 hours ago

                                                                                                    UTF-16 has all the complexity of UTF-8 plus surrogate pairs.

                                                                                                    • zahlman 2 hours ago

                                                                                                      Surrogate pairs aren't more complex than UTF-8's scheme for determining the number of bytes used to represent a code point. (Arguably the logic is slightly simpler.) But the important point is that UTF-16 pretends to be a constant-length encoding while actually having the surrogate-pair loophole - that's because it's a hack on top of UCS-2 (which originally worked well enough for Microsoft to get married to; but then the BMP turned out not to be enough code points). UTF-8 is clearly designed from scratch to be a multi-byte encoding (and, while the standard now makes the corresponding sequences illegal, the scheme was designed to be able to support much higher code points - up to 2^42 if we extend the logic all the way; hypothetical 6-byte sequences starting with values FC or FD would neatly map up to 2^31).