• llm_trw 9 months ago

    Right, so there's more to it than I initially thought, but it's still hopelessly data-constrained. They’re hoping you could magically obtain all the necessary data from images and videos recorded by the phone when you remember to use the camera.

    From my experience building a meeting minutes AI tool for myself, I’ve found that audio carries far more semantic information than video, and we're lacking most of the model capabilities to make audio useful—like speaker diarization. For video, you need object detection, and not just limited to the 100 or so categories of YOLO or DETR. You need to build a hierarchy of objects, in addition to OCR running continuously on each frame.

    Once the raw data collection is done, you somehow need to integrate it into a RAG system that can retrieve all of this in a meaningful way to feed to an LLM, with a context length far beyond anything currently available.

    All in all, just for inference you'd need more compute power than you'd find in the average supercomputer today. Give it 20 years and multiple always on cameras and microphones attached to you and this will be as simple as running a local 8b LLM is today.

    • nareshshah139 9 months ago

      Nah, you can do all of this with a simple phi3.5 instruct + SAM 2, both of which fit into an Nvidia Jetson Orin 64 GB chip.

      We do this at scale in factories/warehouses describing everything that happens within like:

      Idle time Safety Incidents Process Following across frames Breaks Misplacement of items Counting items placed/picked/assembled across frames

      • nareshshah139 9 months ago

        SAM2/Grounding DINO + Phi3.5 Instruct(Vision) give essentially an unlimited vocabulary

        • nareshshah139 9 months ago

          If you want audio transcription just add distill-whisper to the mix.

        • llm_trw 9 months ago

          You should perhaps use that system to reread my post. Maybe it can explain it to you top.

      • greatgib 9 months ago

        With this, my feeling is that nowadays people are doing "research papers" with nothing as content. Like something that would be a basic blog article or a small implementation project of anything.

        • creer 9 months ago

          Unfortunately not new at all. What's new is the richness of the presentation work, for sure. And the self promotion on HN. And the publication of the code should make this far, far more useful than the equivalent of the past which never did... Oh wait, no, the code isn't there - only the link.

        • kak3a 9 months ago

          Well executed and could be very useful. Shouldn't be too hard for Apple to implement this for its Photo album with is Apple Intelligence.

          • undefined 9 months ago
            [deleted]
            • dotancohen 9 months ago

              Is this another AI comment or karma farm? Well executed? Did you read the fine article?

            • undefined 9 months ago
              [deleted]
              • m463 9 months ago

                What photos I took in a national park contained a bird?

                https://xkcd.com/1425/

                • thebeardisred 9 months ago

                  And yet the ability for me to annotate useful information in popular platforms is still (at best) free form tags or written descriptions.

                  • godelski 9 months ago

                    Neat. But EXIF data is incredibly unreliable. It doesn't look like they're explicitly using it, but it is just noisier if you implicitly use it. Obviously this doesn't make this useless but it should be something to consider.

                    If you aren't familiar with this issue you can demonstrate this quite easily with an android phone. Go grab some random photo, edit the exif data so that it suggests that it was taken sometime in the past and then upload it to Google photos. Or even lazier, turn off backups, take a photo, wait a day or two, turn back on backups. Photos will add new exif data to it and log the date taken as the upload date. They don't overwrite the existing data, so you just end up having multiple and conflicting dates. What's a bit surprising to me is that these even happens with pixel phones, where you use the official camera app, and even the fucking name of the picture contains metadata specifying when the photo was taken. Kinda crazy when you think about it.

                    Which you could solve a lot of this stuff to a decent degree of accuracy with some regexing and a bit of logic, but you'll never cover all bases. Or you could throw AI at it and press your luck (I hope that if anyone does this they don't just train by saying a specific key is correct but you gotta teach it that there's context clues. Humans can do this pretty well because of context, and hey, even the photo can have context clues like a calendar in the background. You gotta pick that up from photos that had their exif data wiped). It's surprisingly a hard problem.

                    I'm bringing this up not to dismiss the paper[0] but rather to bring light to how even tiny corners of a problem that we might have taken for granted is far more complex than expected (even if 80% of the time the complexity is footgunning. Actually, especially when that is the case!). Also if anyone works in Google Photos... come on guys. Really? Can someone explain why this choice was made (in situation of Google phone + Google app, so full pipeline controlled by Google). There's got to be something I'm missing.

                    [0] My problem with the paper is it more reads like an ad aimed at technical users rather than research that clearly demonstrates that the results are due specifically due to the method they created and not other factors. There's a whole "productization" of ML papers, but that's a different issue. ML is fuzzy and we can be critical, encourage papers to be better, say how we'd get higher confidence in results, AND happy to see the work. Not all research needs to be the same and demonstrating that things are possible is a type of research. Just low confidence. But yeah, this does look more like an engineering project that hypothesis testing.

                    • hprotagonist 9 months ago

                      but, how good is it at sorting my memes

                      • bigiain 9 months ago

                        No so much memes, but I use searching by location and/or date very useful in my iPhone photos, but that doesn't work nearly so well for screenshots. I don't think there's a decent technological answer to that though, I'd have to add the sort of metadata Photos relies on from exif data everything I capture a screenshot for that to be any use.