• notpublic 6 months ago

    Instead of doing a diff, curious if Normalized compression distance (NCD)[1] will yield a better result. It is very simple algorithm:

    to compare two images, i1 and i2

      l1  = length(gzip(i1))
      l2  = length(gzip(i2))
      l12 = length(gzip(concatenate(i1, i2))
    
      ncd = (l12 - min(l1, l2))/max(l1, l2)
    
    Here is a nice article where I found out about this long ago.

    https://yieldthought.com/post/95722882055/machine-learning-t...

    From the article:

    "Basically it states that the degree of similarity between two objects can be approximated by the degree to which you can better compress them by concatenating them into one object rather than compressing them individually."

    [1] https://en.wikipedia.org/wiki/Normalized_compression_distanc...

    • johnisgood 6 months ago

      Oh interesting, I remember comparing images before, I think I was doing a diff as well, so I suppose this would have worked? Nice to know! They were very small images though.

      It probably would have added the overhead from compression which in my case would have been detrimental.

      • notpublic 6 months ago

        Do try it. We use it for text search in one of our apps and works remarkably well. Basically to find which chunks contain the given text. Since the text can span multiple chunks, a simple string search will not work.

        • namvdo 6 months ago

          [dead]

      • varjag 6 months ago
        • superdisk 6 months ago

          I just restarted the webserver. It's running on OpenBSD HTTPd + MediaWiki + SQLite, and keeping it up has been a perpetual thorn in my side. Oh well. I need to figure out some alternative setup probably.

          • j45 6 months ago

            Modify your DNS to put cloudflare or bunny in front of it and you'll be good. Don't stop self-hosting :)

            • zoezoezoezoe 6 months ago

              self-hosting means freedom, never stop self-hosting

            • MonkeyClub 6 months ago

              Is your VPS on OpenBSD.Amsterdam by any chance? (The 46.23.. address seems familiar.)

              • superdisk 6 months ago

                Yep, that's it. The host is (for the most part) fine, but there's either some problem with httpd or the PHP worker pool where it just dies after some number of requests.

                • MonkeyClub 6 months ago

                  Hi, neighbor! (I'm on server 7.)

                  The service is indeed great, Mischa does an excellent job.

                  Yeah PHP on httpd can be flaky, I'd wish for a lighter solution for wikis.

          • xenonite 6 months ago
            • rcarmo 6 months ago

              Holy cow.

              • kanwisher 6 months ago

                honestly this would be better with an AI model

                • secondplacetho 6 months ago

                  ML is the second best answer to everything, and very rarely the first best answer.

                  Of course it'd be better than something that is intentionally limiting itself. But that says nothing.

                  • teruakohatu 6 months ago

                    > honestly this would be better with an AI model

                    In the article the author tried Tesseract which uses ML and has some neural network models, and also tried ChatGPT.

                    I have come to the same conclusion as the author when doing OCR that needed 100% accuracy.

                    When you know the font, spacing and the layout is fixed, old school statistical analysis of the pixels works a treat.

                    • register 6 months ago

                      Completely second that. This is my experience as well.

                      • Vampiero 6 months ago

                        You can generalize that to anything: when you know the problem domain so well, why the hell are you using ChatGPT to solve any problem within it? Use the most specialized tool for the job or you're just wasting CPU and memory (and electricity, and money, and time). Same goes for a neural net trained on every possible character set. If you know the font and character size in advance it's way overkill.

                        It's a bit more effort to set up since you actually have to set it up. But at least it's done right.

                      • curt15 6 months ago

                        By "AI model" do you mean neural nets? "AI" or "ML" are just buzzwords that conveys no real meaning about the underlying mathematics. The underlying models could be something as basic as linear or logistic regression, which depending on the application could actually be more appropriate that full-blown neural nets.