• skybrian 5 hours ago

    Apparently "documents have reasonably short paragraphs" should be added to "falsehoods programmers believe about text."

    • AlienRobot an hour ago

      Somewhere, a programmer created a 4096 character buffer and sought the next '\n' only to be defeated by Tibetan.

      • pbronez 4 hours ago

        I never thought about this element of cross language structure before. Text direction, diacritics, punctuation, sure - but I always assumed that chunking was universal. Turns out no:

        “the typographical notion of the paragraph does not really exist in a Tibetan text the way it does in European languages. As a result, Tibetan texts often need to be processed as a long stream of uninterrupted text with no forced line breaks, sometimes over hundreds or thousands of pages. “

        • teractiveodular 2 hours ago

          The same applies to old Chinese, and in fact most ancient languages. Latin and Greek were originally written in scriptio continua, meaning no punctuation or spacing:

          https://en.wikipedia.org/wiki/Scriptio_continua

          • crazygringo 2 hours ago

            Tens of pages, sure.

            But hundreds? Thousands?

            Do they not have the concepts of headers? Sections? Chapters?

            Both in non-fiction and fiction, there are a lot more means of content separation than just paragraphs.

            • DougMerritt 2 hours ago

              "Continued on next scroll"

              • Obscurity4340 2 hours ago

                Its all about the scrolls, man

        • heydenberk 3 hours ago

          Jim Woolsey, a hippie and early-ish computer hacker from New Hope, Pennsylvania, was an important and early force in the digitization of the Tibetan language. This interview[0] with him from 1993 is a fascinating time capsule, and interesting in its own right. He was a family friend and I always admired his singular commitment to this important and underappreciated work.

          [0] https://www.mcall.com/1993/10/08/new-hope-man-computer-guru-...

          • hyperhello 7 hours ago

            This has been in the works for a while. There is an old HyperCard stack to teach Tibetan pronunciation (with 16bit sound) you can try: https://hcsimulator.com/Learn-Tibetan

            • fsckboy 4 hours ago

              the only vowel is AH ?

              • shanekandy 3 hours ago

                In text, the singular vowels are built on the ah syllable with modifying marks.

                • cosignal 4 hours ago

                  The site seems incomplete. Tibetan does have 5 vowels, and it looks like the non intrinsic vowels are written at the bottom section of the view, but I can't get them to work. I assume the intention would be that you click one of the other vowels to toggle it, but it no worky.

                  • hyperhello 4 hours ago

                    I don't know who created it, or if it was part of a larger proto-Duolingo language product.

              • java-man 7 hours ago

                I want to know the details how they achieved it (the support for super-long paragraphs, or rather, the absence thereof).

                Does anyone know?

                • l1n 7 hours ago

                  https://gerrit.libreoffice.org/c/core/+/172801

                  Pretty short change for reducing O(n^2) impact with a cache.

                  This change includes the following scalability improvements for documents containing extremely large paragraphs:

                  - Reduces the size of layout contexts to account for LF control chars.

                  - Due to typical access patterns while laying out paragraphs, VCL was making O(n^2) calls to vcl::ScriptRun::next(). VCL now uses an existing global LRU cache for script runs, avoiding much of this overhead.

              • xmly 3 hours ago

                There are over 50 tibetic languages, which one do you choose?

                • teractiveodular 2 hours ago

                  There is only one (modern) written Tibetan script.

                • wslh 7 hours ago

                  With all due respect, the innovation side of Tibetans is also appreciated in "The Nine Billion Names of God" [1].

                  [1] <https://en.wikipedia.org/wiki/The_Nine_Billion_Names_of_God>

                  • dymk 6 hours ago

                    Unsong takes inspiration from this as well -

                    https://unsongbook.com/

                    • asimovfan 6 hours ago

                      i don't know how it is phrased in the book itself but in Tibetan Buddhism there is no god. And their innovation is far beyond this book (at least the plot summary on wikipedia).

                      • wslh 5 hours ago

                        If I were a Tibetan Buddhist, I might say we were just having some fun with Arthur C. Clarke's imagination.

                      • sol2070 5 hours ago

                        Classic!

                      • einpoklum 3 hours ago

                        Hey everyone, I'm Eyal, a LibreOffice project volunteer who does a lot of QA regarding Right-to-Left and Complex-Text-Layout scripts (= written languages). I want to thank thunderbong3 for posting a link to that post - and heartily thank Jonathan Clark, the new RTL-CTL-CJK-focused developer at The Document Foundation, who implemented the performance improvement for Tibetan.

                        Most bugs we encounter and report in LibreOffice are more general, and aren't script specific (e.g. code which forgets that the content may be right-to-left resulting in wrong behavior in those cases); and a lot of the script-specific bugs are about the most popular script, which is Arabic (that is also used for Farsi, Urdu, Javanese etc.)

                        But we do have some issues regarding less-commonly-used scripts, like Tibetan or Mongolian. Here:

                        https://bugs.documentfoundation.org/show_bug.cgi?id=115607

                        is the meta-bug which tracks issues with: Mongolian, Tibetan, Uyghur, Zhuang,Kazak, Xibo, Dai, Yi, Miao, Jingpo, Lisu, Lahu, Wa, etc.

                        We don't know if there are really very few issues specific to those languages (which is quite possible), or whether it's just that they're not used so much and the users aren't motivated enough to file bugs.

                        Still, as Jonathan's recent fix demonstrates, there is certainly the interest to address them when developer-time-resources become available.

                        I would like to encourage everyone who cares about these scripts, and "document editing fairness" across countries and cultures, to consider:

                        1. Try using LibreOffice with such languages which you know at least a little bit of - and if you find any bugs, file them at our BugZilla: https://bugs.documentfoundation.org/

                        2. Consider supproting The Document Foundation, which manages the LibreOffice project, financially:

                        https://www.libreoffice.org/donate/

                        We are one of the larger FOSS projects in the world, with tens of Millions of regular users (if not > 100 Million) and a board of trustees with members from dozens of countries; but - we don't have large corporations investing money nor time in the project. While a few commercial companies do contribute to LibreOffice (like Collabora and Allotropia) - many fundamental issues are not close enough to their customers' needs - which is why it was decided to hire Jonathan directly to give RTL-CTL-CJK support a boost. Individual user donations are what enables this work.