Comments Page - Answer any question about your photo albums with OmniQuery

« Back Answer any question about your photo albums with OmniQueryjiahaoli.netSubmitted by ljhnick a year ago

llm_trw a year ago
Right, so there's more to it than I initially thought, but it's still hopelessly data-constrained. They’re hoping you could magically obtain all the necessary data from images and videos recorded by the phone when you remember to use the camera.
From my experience building a meeting minutes AI tool for myself, I’ve found that audio carries far more semantic information than video, and we're lacking most of the model capabilities to make audio useful—like speaker diarization. For video, you need object detection, and not just limited to the 100 or so categories of YOLO or DETR. You need to build a hierarchy of objects, in addition to OCR running continuously on each frame.
Once the raw data collection is done, you somehow need to integrate it into a RAG system that can retrieve all of this in a meaningful way to feed to an LLM, with a context length far beyond anything currently available.
All in all, just for inference you'd need more compute power than you'd find in the average supercomputer today. Give it 20 years and multiple always on cameras and microphones attached to you and this will be as simple as running a local 8b LLM is today.
- nareshshah139 a year ago
  Nah, you can do all of this with a simple phi3.5 instruct + SAM 2, both of which fit into an Nvidia Jetson Orin 64 GB chip.
  We do this at scale in factories/warehouses describing everything that happens within like:
  Idle time Safety Incidents Process Following across frames Breaks Misplacement of items Counting items placed/picked/assembled across frames
  nareshshah139 a year ago
  SAM2/Grounding DINO + Phi3.5 Instruct(Vision) give essentially an unlimited vocabulary
  nareshshah139 a year ago
  If you want audio transcription just add distill-whisper to the mix.
  llm_trw a year ago
  You should perhaps use that system to reread my post. Maybe it can explain it to you top.
greatgib a year ago
With this, my feeling is that nowadays people are doing "research papers" with nothing as content. Like something that would be a basic blog article or a small implementation project of anything.
- creer a year ago
  Unfortunately not new at all. What's new is the richness of the presentation work, for sure. And the self promotion on HN. And the publication of the code should make this far, far more useful than the equivalent of the past which never did... Oh wait, no, the code isn't there - only the link.
kak3a a year ago
Well executed and could be very useful. Shouldn't be too hard for Apple to implement this for its Photo album with is Apple Intelligence.
- undefined a year ago
  [deleted]
- dotancohen a year ago
  Is this another AI comment or karma farm? Well executed? Did you read the fine article?
undefined a year ago
[deleted]
m463 a year ago
What photos I took in a national park contained a bird?
https://xkcd.com/1425/
thebeardisred a year ago
And yet the ability for me to annotate useful information in popular platforms is still (at best) free form tags or written descriptions.
godelski a year ago
Neat. But EXIF data is incredibly unreliable. It doesn't look like they're explicitly using it, but it is just noisier if you implicitly use it. Obviously this doesn't make this useless but it should be something to consider.
If you aren't familiar with this issue you can demonstrate this quite easily with an android phone. Go grab some random photo, edit the exif data so that it suggests that it was taken sometime in the past and then upload it to Google photos. Or even lazier, turn off backups, take a photo, wait a day or two, turn back on backups. Photos will add new exif data to it and log the date taken as the upload date. They don't overwrite the existing data, so you just end up having multiple and conflicting dates. What's a bit surprising to me is that these even happens with pixel phones, where you use the official camera app, and even the fucking name of the picture contains metadata specifying when the photo was taken. Kinda crazy when you think about it.
Which you could solve a lot of this stuff to a decent degree of accuracy with some regexing and a bit of logic, but you'll never cover all bases. Or you could throw AI at it and press your luck (I hope that if anyone does this they don't just train by saying a specific key is correct but you gotta teach it that there's context clues. Humans can do this pretty well because of context, and hey, even the photo can have context clues like a calendar in the background. You gotta pick that up from photos that had their exif data wiped). It's surprisingly a hard problem.
I'm bringing this up not to dismiss the paper[0] but rather to bring light to how even tiny corners of a problem that we might have taken for granted is far more complex than expected (even if 80% of the time the complexity is footgunning. Actually, especially when that is the case!). Also if anyone works in Google Photos... come on guys. Really? Can someone explain why this choice was made (in situation of Google phone + Google app, so full pipeline controlled by Google). There's got to be something I'm missing.
[0] My problem with the paper is it more reads like an ad aimed at technical users rather than research that clearly demonstrates that the results are due specifically due to the method they created and not other factors. There's a whole "productization" of ML papers, but that's a different issue. ML is fuzzy and we can be critical, encourage papers to be better, say how we'd get higher confidence in results, AND happy to see the work. Not all research needs to be the same and demonstrating that things are possible is a type of research. Just low confidence. But yeah, this does look more like an engineering project that hypothesis testing.
hprotagonist a year ago
but, how good is it at sorting my memes
- bigiain a year ago
  No so much memes, but I use searching by location and/or date very useful in my iPhone photos, but that doesn't work nearly so well for screenshots. I don't think there's a decent technological answer to that though, I'd have to add the sort of metadata Photos relies on from exif data everything I capture a screenshot for that to be any use.