• sashank_1509 18 hours ago

    This has been my experience. Foundation models have completely changed the game of ML. Previously, companies might have needed to hire ML engineers familiar with ML training, architectures etc to get mediocre results. Now companies can just hire a regular software engineer familiar with foundation model API’s to get excellent results. In some ways it is sad, but in other ways the result you get is so much better than we achieved before.

    My example was an image segmentation model. I managed to create an dataset of 100,000+ images and was training UNets and other advanced models on it, always reached a good validation loss but my data was simply not diverse enough and I faced a lot of issues in actual deployment, where the data distribution kept changing on a day to day basis. Then, I tried DINO v2 from Meta, finetuned on 4 images and it solved the problem, handled all the variations in lighting etc with far higher accuracy than I ever achieved. It makes sense, DINO was train on 100M + images, I would never be able to compete with that.

    In this case, the company still needed my expertise, because Meta just released the weights and so someone had to setup the fine-tuning pipeline. But I can imagine a fine tuning API like OpenAI’s requiring no expertise outside of simple coding. If AI results depend on scale, it naturally follows that only a few well funded companies, will build AI that actually works, and everyone else will just use their models. The only way this trend reverses, is if compute becomes so cheap and ubiquitous, that everyone can achieve the necessary scale.

    • pmontra 17 hours ago

      > The only way this trend reverses, is if compute becomes so cheap and ubiquitous, that everyone can achieve the necessary scale.

      We would still need the 100 M+ images with accurate labels. That work can be performed collectively and open sourced but it must be maintained etc. I don't think it will be easy.

      • goldemerald 16 hours ago

        DinoV2 is an unsupervised model. It learns both a high quality global image representation and local representations with no labels. It's becoming strikingly clear that foundation models are the go to choice for common data types of natural images, text, video, and audio. The labels are effectively free, the hard part now is extracting quality from massive datasets.

        • EGreg 9 hours ago

          The other way it can reverse is discovering better methods to train models, or fine-tune existing ones with LoRA or whatever.

          How did Chinese companies do it, is it a fabricated claim? https://slashdot.org/story/24/12/27/0420235/chinese-firm-tra...

        • NegatioN 13 hours ago

          I haven't compared image models in a long while, so I don't know the relevant performance metrics. But even a few years ago, you would usually use a pretrained model, and then finetune on your own dataset though. So those models would also have "seen millions of images", and not just your 100k.

          This change of not needing ML engineers is not so much about the models, as it is about easy API access for how to finetune a model, it seems to me?

          Of course it's great that the models have advanced and become better, and more robust though.

          • isoprophlex 15 hours ago

            This was exactly my experience being the ML engineer on a predictive maintenance project. We detected broken traffic signs in video feeds from trucks; first you segment, then you classify.

            Simply yeeting every "object of interest" into DINOv2 and running any cheap classifier on that was a game changer.

            • ac2u 10 hours ago

              Could you elaborate? I thought DINO took images and outputted segmented objects? Or do you mean that your first step was something like a yolo model to get bounding boxes and you are just using dino to segment to make the classification part easier?

              • isoprophlex 4 hours ago

                We got bboxes from yolo indeed to identify "here is a traffic sign", "here is a traffic light" etc. Then we cropped out these objects of interest and took the DINOv2 embeddings of them.

                Not using it to create segmentations (there are YOLO models that do that, so if you need a segmentation you can get it in one pass), no, just to get a single vector representing each crop.

                Our goal was not only to know "this is a traffic sign", but also do multilabel classification like "has graffiti", "has deformations", "shows decoloration" etc. If you store those it becomes pretty trivial (and hella fast) to pass these off to a bunch of data scientists so they can let loose all the classifiers in sklearn on that. See [1] for a substantially similar example.

                [1] https://blog.roboflow.com/how-to-classify-images-with-dinov2

                • ac2u 2 hours ago

                  Understood. Thanks for taking the time to elaborate.

            • IanCal 9 hours ago

              Things like DINO, GroundingDINO, SAM (and whatever the latest versions of those are) are incredible. I think the progress in this field has been overlooked given LLMs, they're less end-user friendly but they're so good compared to what I remember working with.

              I was able to turn around a segmentation and classifier demo in almost no time because they gave me fast and quick segmentation from a text description and then I trained a YOLO model on the results.

              • bboygravity 17 hours ago

                Could DINO or some other model be used to identify fillable form fields in webforms and/or PDF forms and/or desktop apps?

                Or does it likely just work on real world photos and cartoons and stuff?

              • pj_mukh 17 hours ago

                Just 4 images?! Damn. I’ve had to do at least in the 100’s. I guess it depends on the complexity of the segmentation.

              • Imnimo a day ago

                It's tough to judge without seeing examples of the targets and the user photos, but I'm curious if this could be done with just old-school SIFT. If it really is exactly the same image in the in the corpus and on the wall, does a neural embedding model really buy you a lot? A small number of high confidence tie points seems like it'd be all you need, but it probably depends a lot on just how challenging the user photos are.

                • Morizero a day ago

                  I find a lot of applied AI use-cases to be "same as this other method, but more expensive".

                  • miki123211 12 hours ago

                    It's often vastly more expensive to inference, but vastly cheaper and faster to train / set up.

                    Many LLM use cases could be solved by a much smaller, specialized model and/or a bunch of if statements or regexes, but training the specialized model and coming up with the if statements requires programmer time, an ML engineer, human labelers, an eval pipeline, ml ops expertise to set up the GPUs etc.

                    With an LLM, you spend 10 minutes to integrate with the OpenAI API, and that's something any programmer can do, and get results that are "good enough".

                    If you're extremely cash-poor, time-rich and have the right expertise, making your own model makes sense. Otherwise, human time is more valuable than computer time.

                    • Terr_ a day ago

                      Better to spend $100 in op-ex money than spend $1 in cap-ex money reading a journal paper, especially if it lets you tell investors "AI." :p

                      • mattnewton a day ago

                        Your engineers cost <$1/hr and understand journal papers?

                        • Terr_ 18 hours ago

                          The 100-vs-1 is a ratio.

                      • relativ575 20 hours ago

                        Use cases such as?

                        • Morizero 19 hours ago

                          I'm in an AI focused education research group, and most "smart/personalized tutors" on the market have similar processes and outcomes as paper flashcards.

                        • kjkjadksj 18 hours ago

                          That was happening even when they were still calling it machine learning in the papers. Longer before that still. It’s the way some people reliably get papers out for better or worse. Find a known phenomenon with existing published methods, use the same dataset potentially using new method of the day, show there’s a little agreement between the old “gold standard” and your method, and boom, new paper for your cv on $hotnewmethod you can now land jobs with. Never mind no one will cite it. That’s not the point here.

                        • relativ575 20 hours ago

                          From TFA:

                          > LLMs and the platforms powering them are quickly becoming one-stop shops for any ML-related tasks. From my perspective, the real revolution is not the chat ability or the knowledge embedded in these models, but rather the versatility they bring in a single system.

                          Why use another piece of software if LLM is good enough?

                          • comex 18 hours ago

                            Performance. A museum visitor may not have a good internet connection, so any solution that involves uploading a photo to a server will probably be (much) slower than client-side detection. There’s a thin line between a magical experience and an annoying gimmick. Making people wait for something to load is a sure way to cross that line.

                            Also privacy. Do museum visitors know their camera data is being sent to the United States? Is that even legal (without consent) where the museum is located? Yes, visitors are supposed to be pointing their phone at a wall, but I suspect there will often be other people in view.

                            • titzer 20 hours ago

                              Cost. Same reason you don't deliver UPS packages with B-2 bombers.

                              • msp26 19 hours ago

                                The cost of LLM inference is cheap and will continue to decrease. More traditional methods take up far more of an engineer's time (which also costs money).

                                If I have a project with a low enough lifetime inputs I'm not wasting my time labelling data and training a model. That time could be better spent working on something else. As long as the evaluation is thorough, it doesn't matter. But I still like doing some labelling manually to get a feel for the problem space.

                          • suriya-ganesh 15 hours ago

                            This tracks with my experience. We built a complex processing pipeline for an NLP classification, search and comprehension task. Using vector database of Proprietary data etc.

                            We ran a benchmark of our system against an LLM call and the LLM performed much better for so much cheaper, in terms of dev time, complexity, and compute. Incredible time to be in working in the space seeing traditional problems eaten away by new paradigms

                            • hackerdood 3 hours ago

                              Very neat explanation of solving these kinds of unique challenges, especially given how similar the illustrations were.

                              One question I had was, knowing how difficult it was to train the model with the base images, and given that the client didn’t have time to photograph them, did you consider flying someone out to the museum for a couple of days to photograph each illustration from several angles with the actual lighting throughout the day? Or potentially hiring a photographer near the museum to do that? It seems like a round trip ticket plus a couple nights in a hotel could have saved a lot of headache, providing more images to turn into synthetic training data. Even if you still had to resort to using 4o as a tiebreaker, it could be that you only present two candidates as the third might have a much lower similarity score to the second candidate. Good write up either way.

                              • JayShower a day ago

                                Alternative solution that would require less heavy lifting of ML but a little more upfront programming: It sounds like the cars are arranged in a grid on the wall. Maybe it would be possible to narrow down which car the user took a photo of by looking at the photos of the surrounding cars as well, and hardcoding into the system the position of each car relative to one another? Could potentially do that locally very quickly (maybe even at the level of QR-code speed) versus doing an embedding + LLM.

                                Con of this approach would be that it’s requires maintenance if they ever decide to change the illustration positions.

                                • armchairhacker 12 hours ago

                                  Put each painting in an artsy frame whose edges are each different, colorful pattern. When the user photographs the painting, they’ll include all (or even most) of the frame, and distinguishing the frames is easy.

                                  • arkh 10 hours ago

                                    > artsy frame

                                    Embedding a QR code or simply a barcode somewhere and you're done. Maybe hide it like a watermark so it does not show to the naked eye and doing some Fourier transform in the app won't require a network connection nor lot of processing power.

                                    • ndileas 8 hours ago

                                      the article does mention that the client rejected a similar approach. steganography seems like a bad choice for a museum setting where you don't own the images.

                                    • jaffa2 8 hours ago

                                      This seems the way to go… its only 350 images

                                  • wongarsu a day ago

                                    Interesting approach to a a very interesting challenge, given how close the images supposedly are.

                                    With the limited training data they have I'm surprised they don't mention any attempts at synthetic training data. Make (or buy) a couple museum scenes in blender, hang one of the images there, take images from a lot of angles, repeat for more scenes, lighting conditions and all 350 images. Should be easy to script. Then train YOLO on those images, or if that still fails use their embedding approach with those training images.

                                    • brody_hamer a day ago

                                      They did.

                                      > “ To address this limitation, we turned to data augmentation, artificially creating new versions of each image by modifying colors, adding noise, applying distortion, or rotating images. By the end, we had generated 600 augmented images per car.”

                                      • wongarsu 20 hours ago

                                        Those are pretty standard. A standard YOLO training run applies more transformations than that, and there are ready-made modules that do the same in keras and pytorch (for their mobilenet and VGG16). I'm not sure if anyone is training any serious vision algorithm without that kind of data augmentation.

                                        What I am talking about is that they want to recognize scenes containing the images, but only have the images as training data. They have a good idea what those scenes will look like. Going there to take actual training pictures was evidently not viable, but generating approximations of them might have been.

                                    • lynguist 6 hours ago

                                      Huh I think this YouTube short is the same topic: https://youtube.com/shorts/DA_-6296G5o?si=BLKcSP2Q1jAaca9K

                                      Finding new geoglyphs from known examples.

                                      • olup 4 days ago

                                        First time for me posting this kind of story - I thought it would make an interesting case on solving a hard computer vision problem with a crafty product engineer team.

                                        • caioariede a day ago

                                          Just a small feedback… I have switched to the reader mode because the font used is very challenging to read for me.

                                          • littlestymaar a day ago

                                            Also, having a blog post about image detection, and not showing a single picture in the whole post was quite frustrating.

                                            • Oarch a day ago

                                              Especially given the detailed description surely the author could just generate a similar image

                                              • bl4ckneon a day ago

                                                Just thinking that. Spend a few minutes trying to have chatgpt generate some images with Dall-E 3. Flux would probably be better to get all the specific details but ya

                                          • yannis 21 hours ago

                                            Thanks for sharing. Interesting approach. As other commenters mentioned, article could do well with some hypothetical images. Maybe on a follow-up blog post? Also since you mentioning your Company's name you missing an opportunity for marketing by not providing a link.

                                            • idkman_oops 13 hours ago

                                              Can you tell me what font is this?

                                              • martin_a 11 hours ago

                                                One of these:

                                                > ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,"Liberation Mono","Courier New",monospace

                                                • nmeofthestate 5 hours ago

                                                  The single-character-width 'fi' ligature is quite jarring in a mono-spaced font.

                                              • yuvalr1 10 hours ago

                                                A completely different approach that don't require heavy AI would be an app on the user phone that does this:

                                                1. Measure the distance from the wall (standard image processing)

                                                2. Use the rotations of the gyro sensors on the phone to conclude which car is being looked at

                                                I wonder if this could be as accurate though

                                                • mrbombastic 10 hours ago

                                                  You could definitely cheat a little with something like this or geofences, but that requires the photos stay in the same place or the museum updates whenever they move.

                                                • the_duke 14 hours ago

                                                  Side question: is there any good model that allows for image similarity detection across a large image set, that can be incrementally augmented with new images?

                                                  You'd somehow have to generate an embedding for each image, I presume.

                                                  • ResearchAtPlay 7 hours ago

                                                    Yes, you could implement image similarity search using embeddings: Create embeddings for the entire image set, save the embeddings in a database, and add embeddings incrementally as new images come in. To search for a similar image, create the embedding for the image that you are looking for and compute the cosine similarity between that embedding and the embeddings in your database. The closer the cosine similarity is to 1.0 the more similar the images.

                                                    For choosing a model, the article mentions the AWS Titan multimodal model, but you’d have to pay for API access to create the embeddings. Alternatively, self-hosting the CLIP model [0] to create embeddings would avoid API costs.

                                                    Follow-up question: Would the embeddings from the llama3.2-vision models be of higher quality (contain more information) than the original CLIP model?

                                                    The llama vision models use CLIP under the hood, but they add a projection head to align with the text model and the CLIP weights are mutated during alignment training, so I assume the llama vision embeddings would be of higher quality, but I don’t know for sure. Does anybody know?

                                                    (I would love to test this quality myself but Ollama does not yet support creating image embeddings from the llama vision models - a feature request with several upvotes has been opened [1].)

                                                    [0] https://github.com/openai/CLIP

                                                    [1] https://github.com/ollama/ollama/issues/5304

                                                    • jonathan-adly 7 hours ago

                                                      So, there is a whole world with vision based RAG/search.

                                                      We have a good open-source repo here with a ColPali implementation: https://github.com/tjmlabs/ColiVara

                                                      • ResearchAtPlay 6 hours ago

                                                        Thanks for the link to the ColPali implementation - interesting! I am specifically interested in evaluation benchmarks for different image embedding models.

                                                        I see the ColiVara-Eval repo in your link. If I understand correctly, ColQwen2 is the current leader followed closely by ColPali when applying those models for RAG with documents.

                                                        But how do those models compare to each other and to the llama3.2-vision embeddings when applied to, for example, sentiment analysis for photos? Do benchmarks like that exist?

                                                        • jonathan-adly 6 hours ago

                                                          The “equivalent” here would be Jina-Clip (architecture-wise), not necessarily performance.

                                                          The ColPali paper(1) does a good job explaining why you don’t really want to directly use vision embeddings; and how you are much better off optimizing for RAG with a ColPali like setup. Basically, it is not optimized for textual understanding, it works if you are searching for the word bird; and images of birds. But doesn’t work well to pull a document where it’s a paper about birds.

                                                          1. https://arxiv.org/abs/2407.01449

                                                          • ResearchAtPlay 3 hours ago

                                                            Makes sense. My main takeaway from the ColPali paper (and your comments) is that ColPali works best for document RAG, whereas vision model embeddings are best used for image similarity search or sentiment analysis. So to answer my own question: The best model to use depends on the application.

                                                    • lyu07282 5 hours ago

                                                      Beyond using off-the-shelf embeddings, if you want to teach a model what "similar" means exactly, that's metric learning: https://paperswithcode.com/task/metric-learning

                                                    • kredd a day ago

                                                      A bit tangential, but I think we will see a good chunk of small teams building competing products in different software business segments, by just doubling on productivity and offering a cheaper option due to less operational overhead (reads: paying engineers). I can think of at least two businesses that can be competed in costs if the team can automate a good chunk of it.

                                                      • qeternity a day ago

                                                        > I can think of at least two businesses that can be competed in costs if the team can automate a good chunk of it.

                                                        And which would those be?

                                                        • kredd a day ago

                                                          We both know I didn't write it down with the hopes that I'll act on the at some point in the near future, and want to avoid my imaginary competitors. Even though, in reality, I will ponder about it for another week or two, give up without actually getting anything done, then regret for never trying :)

                                                          • qeternity 13 hours ago

                                                            We both know I was hoping you'd tell me anyway :)

                                                          • satvikpendem a day ago

                                                            Job applications, recruiter outreach and initial screening calls. I heard of an AI interviewer via voice chat on a reddit thread recently.

                                                            • rad_gruchalski a day ago

                                                              „AI” talking to an „AI”. What a time to be alive.

                                                        • saint_yossarian a day ago

                                                          I mean, cool tech, but why not just print a QR code next to each illustration?

                                                          • olup 15 hours ago

                                                            Poster here. We would have loved that, and it was one of our first proposal - a QR code or some kind of marker. However, the client is understandably very controlling on the aesthetics of their wall as a central element of their scenography. We would have pushed for it again in the last resort, but would probably have lost the contract.

                                                            • psandor 7 hours ago

                                                              This is completely offtopic, but I would bet it was a government-funded museum. A reasonable institution would have worked with you to find an acceptable compromise, something much easier to implement with a small sacrifice of aesthetics.

                                                              Anyway, great work, and thank you for taking the time to share it!

                                                              • achierius 6 hours ago

                                                                Really? I would much less expect a government museum to be particular about aesthetics. Privately run museums/collections/exhibitions on the other hand tend to have very finicky owners -- after all, they're putting up their own money to achieve their vision, and so of course they tend to not want to compromise on how it might look.

                                                            • JayShower a day ago

                                                              Sounds like the client cared a lot about the user experience being smooth (they declined the solution of presenting the user with the narrowed-down choices of which car they took a picture of), and I think adding a bunch of QR codes to this aesthetic wall of car illustrations would not align with that goal.

                                                              • urbandw311er a day ago

                                                                This feels like one of those “NASA spent millions developing a space pen, Russians took a pencil” moments.

                                                              • cuu508 17 hours ago

                                                                Or just a human-readable label with the model and year on it. Visitors would not need to mess with gadgets to read the labels which would be a huge usability win.

                                                                • nnnnico a day ago

                                                                  just in: using gpt4o to read QRs

                                                                  • nthingtohide 15 hours ago

                                                                    Or just geo tag the room itself.

                                                                  • rldjbpin 18 hours ago

                                                                    reads to me like 95% of the "conventional AI" was applied to the problem and then using llm in the end seems to work like a lucky three-faced dice.

                                                                    when "embeddings" are used to perform closeness test, you are using a pretrained computer vision model behind the scenes. it is doing the far majority of tasks of filtering out hundreds of images down to a handful.

                                                                    visual llm works on textual descriptions that seem far too close for similar images. regardless, more power to the team for finding something that works for them.

                                                                    • og_kalu 10 hours ago

                                                                      >visual llm works on textual descriptions

                                                                      SOTA V-LLMs do not work on textual descriptions.

                                                                    • vessenes a day ago

                                                                      Thanks for the “bitter lesson” news from the frontlines. Curious; did you experiment with 4o as the sole pipeline? And of course as I think you mention, it would be interesting to know if say llama 8b could do a similar job as well.

                                                                      Congrats on shipping.

                                                                      • numba888 a day ago

                                                                        they don't self-host the models, neither embedding nor last step llm. taking into account low load self-hosting likely would be more expensive. if so why not to use the best models.

                                                                      • gunalx a day ago

                                                                        Cool real life use Case. Don't think lmms usually get applied reasonably where they should be and I am glad that a generic knn model also was used to simplify costs and also just more suitable.

                                                                        • TZubiri a day ago

                                                                          Calling an llm and a cv model by the same name to give the appearance of agi is a pet peeve of mine.

                                                                          And someone that's not openai buying into this naming convention is just unpaid propaganda

                                                                          • throwaway314155 19 hours ago

                                                                            How would you prefer people talk about it? "Multimodal LLM"? My understanding is the vision portion is indeed wired directly to (and trained alongside) the language portion.

                                                                            > give the appearance of agi

                                                                            Can you point out where specifically they're doing this? Best I can tell, they give a decent summary of the effectiveness of multi-modal LLM's with support for vision, and then talk about using it to solve an incredibly narrow task. The only diction I could see that hints at "agi" is when they describe the versatility of this approach; but how could you possibly argue against that? It's objectively more versatile (if not wasteful and more expensive).

                                                                            • TZubiri 14 hours ago

                                                                              I looked into the documentation and api, and it seems you are right, it is genuinely part of the gpt model. Of course, we cannot confirm without source code.

                                                                              My understanding was that there was a traditional cv library that was effectively producing an image to text before passing it to the llm. But the more I think about it, even that method would involve training for image detection to a point where objects are recognized by images not by tokens.

                                                                              So the gpt product is no longer an llm or text based.

                                                                              Can't say much for sure at this point with closed source, we will probably see competition catch up eventually and have more info then. At which point openai will eventually release the text2img separately and dispense with the mysticism and agi pretention.

                                                                              My guess is that this is a separate image to text model ( or image+text model) and it is slapped on to the main llm code.

                                                                              I don't think that text is just another modality, it probably will always be the core.

                                                                              I don't have a source on something as strategic and subjective, I just have an finger on the pulse: their robot demo that does laundry, their consistent talk about AGI, their mention of power-seeking in docs, their attempt to raise trillions for chip factories, transition to for profit. They have a huge pressure to be THE monopoly and their risk is for GPT to be a text based local maximum and for intelligence not to be a sappir wolphian phenomenon.

                                                                              P.s: early docs from 2023 refer to the img2txt submodel as gpt4v, that's what we should call the submodule in my opinion. (If it in fact is the same piece of tech)

                                                                          • schappim a day ago

                                                                            I would love to see the prompt / image data sent to GPT-4o!

                                                                            • olup 14 hours ago

                                                                              On the prompt side, it's very simple, and can probably be done in a variety of ways. How we did it is to prepare a prompt with multiple "user" messages. The first one gives the instruction

                                                                              you are given a reference and three candidates, which one of the candidates do you think is a match to the reference? Only output its identifier or a code when none is found

                                                                              Not exactly that but something along those lines.

                                                                              Then one "user" message per car (reference + candidates) with image + text indicating the type (reference or candidate) and an identifier (can be as simple as the index for the candidates).

                                                                            • babyent 18 hours ago

                                                                              This was a fun read. I’m not a AI expert by any means. I’m also ESL. Please bear with me.

                                                                              However the inaccuracy threshold seems fine for a museum, but in enterprise operations inaccuracy can mean lost revenue or worse lost trust and future business flow.

                                                                              I’m struggling with some more advanced AI use cases in my collaborative work platform. I use AI (LLMs) for things like summarizations, communication, finding information using embedding. However, sometimes it is completely wrong.

                                                                              To test this I spent a few days (doing something unrelated) building up a recipes database and then trying to query it for things like “I want to make a quick and easy drink”. I ran the data through classification and other steps to get as good data as I could. The results would still include fries or some other food result when I’m asking for drinks.

                                                                              So I have to ask what the heck am I doing wrong? Again, for things like sending messages and reminders or coming up with descriptions, and finding old messages that match some input - no problem.

                                                                              But if I have data that I’m augmenting with additional information (trying to attach more information that maybe missing but possible to deduce from what’s available) to try and enable richer workflows I’m always being bit in the butt. I feel like if I can figure this out I can provide way more value.

                                                                              Not sure if what I said makes sense.

                                                                              • numba888 32 minutes ago

                                                                                > Not sure if what I said makes sense.

                                                                                Not sure either. But here is the lesson from this and other sources. To improve the output use multistep approach. Get the first answer, one or more, and pass it through the second verification step(s). Like 'for this * is this *' relevant? Or is it correct, does it solve the problem, etc.. Then select the answer with the best scores on the filters. You see, it's very similar to that in the original post. Get first candidates, filter.

                                                                              • GaggiX a day ago

                                                                                Is there a reason to choose VGG16 over more modern models?

                                                                                • rmiaouh 14 hours ago

                                                                                  Yes, there is a reason. VGG16 is a lightweight model that is very cost-effective to self-host. Initially, as the article mentions, we didn’t have access to large, cost-efficient models (like AWS Titan) capable of generating image embeddings. As a result, we opted for VGG16, which is efficient, delivers good performance, and can run on CPUs with just 4GB of RAM. This makes it ideal for small-scale setups, such as VPS instances costing around €10/month.

                                                                                • gazchop a day ago

                                                                                  I hear a lot of qualitative speak but nothing quantitative.