• radarsat1 an hour ago

    As someone not super familiar with deployment but enough to know that GPUs are difficult to work with due to being costly and sometimes hard to allocate: apart from optimizing the models themselves, what's the trick for handling cloud GPU resources at scale to serve something like this, supporting many realtime connections with low latency? Do you just allocate a GPU per websocket connection? Which would mean keeping a pool of GPU instances allocated in case someone connects, otherwise cold start time would be bad.. but isn't that super expensive? I feel like I'm missing some trick in the cloud space that makes this kind of thing possible and affordable.

    • whiplash451 an hour ago

      Not the author, but their description implies that they are running more than one stream per GPU.

      So you can basically spin off a few GPUs as a baseline, allocate streams to them then boot up a new GPU when existing GPUs get overwhelmed.

      Does not look very different than standard cloud compute management. I’m not saying it’s easy, but definitely not rocket science either.

      • bpanahij an hour ago

        We're partnering with GPU infrastructure providers like Replicate. In addition, we have done some engineering to bring down our stack's cold and warm boot times. With sufficient caches on disk, and potentially a running process/memory snapshot we can bring these cold/warm boot times down to under 5 seconds. Of course, we're making progress every week on this, and it's getting better all the time.

        • pavlov an hour ago

          You can do parallel rendering jobs on a GPU. (Think of how each GPU-accelerated window on a desktop OS has its own context for rendering resources.)

          So if the rendering is lightweight enough, you can multiplex potentially lots of simultaneous jobs onto a smaller pool of beefy GPU server instances.

          Still, all these GPU-backed cloud services are expensive to run. Right now it’s paid by VC money — just like Uber used to be substantially cheaper than taxis when they were starting out. Similarly everybody in consumer AI hopes to be the winner who can eventually jack up prices after burning billions getting the customers.

          • ilaksh an hour ago

            It is expensive. They charge in 6 second increments. I have not found anywhere that says how much per 6 second stream.

            Okay found it, $0.24 per minute, on the bottom of the pricing page.

            That means they can spend $14/hour on GPU and still break even. So I believe that leaves a bit of room for profit.

            • bpanahij an hour ago

              Scroll down the page and the per minute pricing is there: https://www.tavus.io/pricing

              We bill in 6 second increments, so you only pay for what you use in 6 second bins.

              • ilaksh an hour ago

                Oh sorry I didn't see that. Got it. $0.24 per minute.

          • primitivesuave 21 minutes ago

            I really hope this technology becomes the future of political campaigning. The signage industry which prints billions of posters, plastic lawn signs, and banners for the post-election landfill needs to be disrupted.

            These days I get a daily dose of amazement at what a small engineering team is able to accomplish.

            • bpanahij 3 minutes ago

              Thanks for these thoughts and compliments. I love the idea of preventing landfill with this tech. Our team is awesome and we really love our customers and all the jobs that can be done with this kind of tech!

              • qazxcvbnmlp 7 minutes ago

                Oh my! How dystopian.

                “He promised me they wouldn’t support X” “He promised me they would support X”

                (Dynamically grab and show actions from the candidates past that feed into the individuals viewpoint)

                Further the disconnect between what the candidate says they do and what they do, meanwhile it will feel like they got your best interests in mind.

                • jerf a few seconds ago

                  Heh, I'm not even sure that would change much honestly. If I define a "lie" for the purpose of this post (and nothing else) as "a politician's claim they support a position during election season that they have manifestly not supported during their existing tenure as a politician", even cynical ol' me is a bit shocked by the amount of lying I've seen in this campaign. I'm not even talking about forward lying here about something they won't do for whatever reason once they get into office, I'm talking about their platform incorporating things that they were denouncing a year ago and vigorously voting against.

              • karolist 3 hours ago

                Felt like talking to a person, I couldn't bring myself to treat it like a piece of code, that's how real it felt. I wanted to be polite and diplomatic, caught myself thinking about "how I look to this person". This brought me thinking of the conscious effort we put in when we talk with people and how sloppy and relaxed we can be when interacting with algorithms.

                For a little example, when searching Google I default to a minimal set of keywords required to get the result, instead of typing full sentences. I'm sort of afraid this technology will train people to behave like that when video chatting with virtual assistants and that attitude will bleed in real life interactions in societies.

                • bpanahij 2 hours ago

                  Thanks for that insight. Brian here, one of the engineers for CVI. I've spoken with CVI so much, and as it has become more natural, I've found myself becoming more comfortable with a conversational style of interaction with the vastness of information contained within the LLMs and context under the hood. Whereas, with Google or other search based interactions I'm more point and shoot. I find CVI is more of an experience and for me yields more insight.

                  • alwa 2 hours ago

                    I’m having trouble understanding what CVI means here. Is it the firm Computer Vision Inc. (https://www.cvi.ai/)?

                    The firm in the post seems to be called Tavus, and their products either “digital twins” or “Carter.”

                    Not meaning to be pedantic, I’m just wondering whether the “V” in the thing you’ve spoken to indicates more “voice” or “video” conversations.

                    • mertgerdan an hour ago

                      Hahah that's very valid looking back, it stands for Conversational Video Interface

                  • whiplash451 an hour ago

                    I see it the other way around.

                    I think our human-human interaction style will “leak” into the way we interact with humanoid AI agents. Movie-Her style.

                  • wantsanagent 2 hours ago

                    Functionality for a demo launch: 9.5/10

                    Creepiness: 10/10

                    • CapeTheory an hour ago

                      I was just about to try it, but the idea of allowing Firefox access to my audio/video to talk to a machine-generated person gave me such a bad feeling, I couldn't go through with it even fuelled by my morbid curiosity.

                    • kwindla 36 minutes ago

                      • caseyy 3 hours ago

                        Amazing work technically, less than 1 second is very impressive. It quite scary though that I might FaceTime someone one day soon, and they’d won’t be real.

                        What do you think about the societal implications for this? Today we have a bit of a loneliness crisis due to a lack of human connection.

                        • btbuildem 2 hours ago

                          Another nail in the coffin for WFH, too. "They" will be scared we're not actually working even when on calls.

                          • kredd an hour ago

                            The question is, what'll come first - AI agents that will replace white collar jobs, so you don't even need the employees or companies not trusting WFH employees, thus bringing everyone back to in person?

                        • davidvaughan an hour ago

                          That is technically impressive, Hassaan, and thanks for sharing.

                          One recommendation: I wouldn't have the demo avatar saying things like "really cool setup you have there, and a great view out of your window". At that point, it feels intrusive.

                          As for what I'd build... Mentors/instructors for learning. If you could hook up with a service like mathacademy, you'd win edtech. Maybe some creatures instead of human avatars would appeal to younger people.

                          • alwa 27 minutes ago

                            There were some balloons coincidentally in the background of a colleague's camera view. The Carter volunteered "and can I just say, we need more positivity in the world, the balloons behind you give a good vibe." My colleague physically recoiled, pushed the camera away, and hung up.

                            I think it was a combination of the intrusiveness and the notion of a machine 1) projecting (incorrect) assumptions about her attitudes/intentions onto the environment's decor, and 2) passing judgment on her. That kind of comment would be kind of impolite between strangers, like the thing that only a bad boss would feel entitled say to an underling they didn't know very well.

                            Just an implementation detail, though, of course! I figure if you're able to evoke massive spookiness and subtle shades of social expectations like this, you must be onto something powerful.

                          • shtack 24 minutes ago

                            Cool, I built a prototype of something very similar (face+voice cloning, no video analysis) using openly available models/APIs: https://bslsk0.appspot.com/

                            The video latency is definitely the biggest hurdle. With dedicated a100s I can get it down <2s, but it's pricy.

                            • taude 40 minutes ago

                              I had him be a Dungeon Master and start taking me through an adventure. Was very impressive and convincing (for the two minutes I was conversing), and the latency was really good. Felt very natural.

                              • turnsout 3 hours ago

                                Incredibly impressive on a technical level. The Carter avatar seems to swallow nervously a lot (LOL), and there's some weirdness with the mouth/teeth, but it's quite responsive. I've seen more lag on Zoom talking to people with bad wifi.

                                Honestly this is the future of call centers. On the surface it might seem like the video/avatar is unnecessary, and that what really matters is the speech-to-speech loop. But once the avatar is expressive enough, I bet the CSAT would be higher for video calls than voice-only.

                                • nick3443 an hour ago

                                  Actually what really matters for a call center is having the problem I called in for resolved promptly.

                                  • turnsout 22 minutes ago

                                    Right, so do you want to wait 45 minutes for a human, or get it resolved via AI in 2 minutes?

                                • vlad-r 2 hours ago

                                  This was definitely one of the most disturbing experiences I've had.

                                  But it's somehow awesome at the same time.

                                  • username44 3 hours ago

                                    It was pretty cool, I tried the Tavus demo. Seemed to nod way too much, like the entire time. The actual conversation was pretty clearly with a text model, because it has no concept of what it looks like, or even that it has a video avatar at all. It would say things like “I don’t have eyes” etc.

                                    • airstrike 3 hours ago

                                      This is awesome! I particularly like the example from https://www.tavus.io/product/video-generation

                                      It's got a "80s/90s sci-fi" vibe to it that I just find awesomely nostalgic (I might be thinking about the cafe scene in Back to the Future 2?). It's obviously only going to improve from here.

                                      I almost like this video more than I like the "Talk to Carter" CTA on your homepage, even though that's also obviously valuable. I just happen to have people in the room with me now and can't really talk, so that is preventing me from trying it out. But I would like to see in action, so a pre-recorded video explaining what it does is key

                                      • btbuildem 2 hours ago

                                        Interesting -- compare the training video to the render! I think if you know the person, it would still be very hard to pass the digital twin as the real thing. But if you mean to face strangers, this could very well work already. There are small glitches but that's easy to blame on a video codes / network issues.

                                      • kmetan 2 hours ago

                                        Why is it trying to autofill my payment cards?


                                        • byearthithatius 2 hours ago

                                          That is your browser. Hassaan, you should add autocomplete="name" to prevent this in the future since clearly it scares some folks. He didn't do anything that its just your browser looking for autocomplete text boxes.

                                          • hassaanr 2 hours ago

                                            Great callout- will make that change now!

                                        • syx an hour ago

                                          This is funny my name is Simone, pronounced 'see-moh-nay' (Italian male), but both bots kept pronouncing it wrong, either like Simon or the English female version of Simone (Siy-mown). No matter how many times I tried to correct them and asked them to repeat it, they kept making the same mistake. It felt like I was talking to an idiot. I guess it has something to do with how my name is tokenized.

                                          • bpanahij an hour ago

                                            We have the ability to send phonetic pronunciations as guidance, and this could be a great addition to our LLM/response generation stack! Adding a check for names and then adding in the phoneme.

                                          • byearthithatius 2 hours ago

                                            This is really cool. I got kind of scared I was about to talk to some random Hassaan haha. Super excited to see where this goes. Incredible MVP.

                                            • hassaanr an hour ago

                                              Haha imagining the website just opening a direct webcam feed to my desk. Appreciate the support!

                                            • alexawarrior4 2 hours ago

                                              Hassaan isn't working but Carter works great. I even asked it to converse in Espanol, which it does (with a horrible accent) but fluently. Great work on the future of LLM interaction.

                                              • hassaanr 2 hours ago

                                                Unfortunately, it looks like HN has given my little blog the hug of death. Should be back up soon

                                                • alexawarrior4 2 hours ago

                                                  This would be WONDERFUL with a Spanish-native accent as a language tutor, but as you've already got English you should try marketing this to the English-learning world. There is a huge dearth of native English speaker interaction in worldwide language instruction, and it's typically only available to the most privileged of students. Your system could democratize this so anyone with an affordable fee (say $10-20/month, subsidized for the poorest) could practice speaking and have their own personal tutor. The State Department and Defense Language Institute might love this as well as, if trained on languages like Iraqi Arabic and Korean would allow live-exercise training prior to deployment.

                                                  It can also function as an instructional tutor in a way that feels natural and interactive, as opposed to the clunkiness of ChatGPT. For instance, I asked it (in Spanish) to guide me through programming a REST API, and what frameworks I would use for that, and it was giving coherent and useful responses. Really the "secret sauce" that OpenAI needs to actually become integrated into everyday life.

                                                  • rpazpri1 2 hours ago

                                                    Multilingual support is coming out shortly! Super excited to see all the awesome uses cases with this

                                              • kevinsync 3 hours ago

                                                Very cool! I think part of why this felt believable enough for me is the compressed / low-quality video presented in an interface we're all familiar with -- it helps gloss over visual artifacts that would otherwise set off alarm bells at higher resolution. Kinda reminds me of how Unreal Engine 5 / Unity 6 demos look really good at 1440p / 4k @ 40-60 fps on a decent monitor, but absolutely blast my brain into pieces at 480p @ very high fps on a CRT. Things just gloss over in the best ways at lower resolutions + analog and trick my mind into thinking they may as well be real.

                                                • ratedgene 3 hours ago

                                                  Ah, I wish I could type to this thing

                                                  • hassaanr 3 hours ago

                                                    Great point. This is possible with CVI, but we didn't build it into the demos. We'll get it added

                                                  • iamleppert 2 hours ago

                                                    I would pay cold hard cash if I could easily create an AI avatar of myself that could attend teams meetings and do basic interaction, like give a status update when called on.

                                                    • ndarray an hour ago

                                                      This would require the AI to alert you as soon as your colleagues are starting to figure out that they're talking to an AI and start interrogating it, so that you can jump in with your real mic and save the situation. Preferably the AI would repeat whatever you speak into your mic, otherwise there would be noticeable audio changes. Hope they never ask you to sing.

                                                      • pantulis 2 hours ago

                                                        Last time I checked it was not possible through Teams API call for video conferences, although it is pretty easy to set up a chat bot in Teams with a custom Copilot. I'd say that it looked more feasible through a plugin for Google Meet but there are too many hoops. I'd expect that to be reserved either for the host platforms or for selected partners.

                                                        • Philpax 2 hours ago

                                                          I can't imagine someone doing this would be doing it through an official integration; it's much more likely to be a virtual webcam, which is compatible with anything.

                                                          • hassaanr 2 hours ago

                                                            Give us a few weeks and this will be possible!

                                                            • windexh8er 2 hours ago
                                                              • pantulis 31 minutes ago

                                                                I didn't mean the video impersonation, I was referring to the possibility of making a synthetic bot automatically attend a conference call like a regular user without using a desktop camera simulation or stuff like that.

                                                                It's not a matter of AI, it's a matter of how Teams or Meet or Zoom allow programmatic access to the video and audio streams (the presence APIs for attending a meeting are mostly there, I think).

                                                          • zoeysmithe 2 hours ago

                                                            Okay so this is impossible because you'll get caught because tech will never fool everyone like this all the time.

                                                            But lets talk about the sentiment behind here. Am I the only one seeing some terrible things being done with AI in terms of time management, meetings, and written materials? Asking AI to "turn this nice concise 3 paragraphs into a 6 page report" is a huge problem. Everyone thinks they're an amazing technical writer now, but most good writing is concise and short and these AI monstrosities are just a waste of everyone's time.

                                                            Reform work culture instead! Why do we have cameras on our faces? Why are we making these reports? Why so many meetings? "Meeting culture" is the problem and it needs to go, but it upholds middle-management jobs and structures, so here we are asking for robots of us to sit in meetings with management to get just the 8 bullet points we need from that 1 hour meeting.

                                                            We've entered a new level of kafkaesque capitalism where a manager puts 8 bullets points into an AI, gets a professional 4 page report, then turns that into a meeting for staff to take that report and meeting transcript to...you guessed it, turn it back into those 8 bullet points.

                                                          • bilater an hour ago

                                                            This is cool but if you're trying to cater to devs you need to have a simple on demand API model and no subscription. We need to be able to evaluate the cost on our side.

                                                            • ilaksh an hour ago

                                                              This is so amazing. What's the base rate for streaming with the API? Can you add that to the Pricing page please?

                                                              • aschobel 3 hours ago

                                                                I like how it weaves in background elements into the conversation; it mentioned my cat walking around.

                                                                I'm having latency issues, right now it doesn't seem to respond to my utterances and then responds to 3-4 of them in a row.

                                                                It was also a bit weird that it didn't know it was at a "ranch". It didn't have any contextual awareness of how it was presenting.

                                                                Overall it felt very natural talking to a video agent.

                                                                • e12e 3 hours ago

                                                                  Are you looking into speech to speech (no text) models?

                                                                  • hassaanr 3 hours ago

                                                                    Yeah we are! The issue we're seeing is with controllability and hallucinations in speech to speech models that we're trying to work through still

                                                                  • hirako2000 3 hours ago

                                                                    > Lower-end hardware

                                                                    That is? Roughly speaking, what resource spec?

                                                                    • gamerDude 2 hours ago

                                                                      Definitely responds quickly. But could not carry on a conversation and kept trying to almost divert the conversation into less interesting topics. Weirdly kept complimenting me or taking one word and saying, oh you feel ____. Which is not what I said or feel.

                                                                      • nkunkux2 3 hours ago

                                                                        Tried it, very impressive: digital Hassaan noticed record player at the background and asked some stuff about it, nice :) Had some latency issues though.

                                                                        • bradhilton an hour ago

                                                                          Okay, that was really impressive. Well done!

                                                                          • k1ck4ss 3 hours ago

                                                                            The meeting has ended Contact the meeting host if the meeting ended unexpectedly.

                                                                            • hassaanr 2 hours ago

                                                                              Try again! My blog got the hug of death it seems

                                                                            • android521 3 hours ago

                                                                              For me, there is 5 second+ delay and the video ends abruptly.

                                                                              • ninju 2 hours ago

                                                                                HN Hug of Death ?

                                                                              • heyitsguay 3 hours ago

                                                                                This is really cool in terms of the tech, but what is this useful for as a consumer? I mean it's basically just a chatbot right? And nobody likes interacting with those. Forcing a conversational interaction seems like a step down in UX.

                                                                                • andywertner 2 hours ago

                                                                                  This is a really good question. While you're right that a common use case would be chatbots for product support, it isn't the only one. Some examples:

                                                                                  - interactive experiences with historical figures - digital twins for celebrity/influencer fan interactions - "live" and/or personalized advertisements

                                                                                  Some of our users are already building these kinds of applications.

                                                                                  • hassaanr 2 hours ago

                                                                                    The way we see it is that this brings us closer to communicating with computers the way we communicate with each other. It has vision and can (not perfectly) take into account your expressions, your surroundings, and can respond accordingly.

                                                                                    • joshdavham 3 hours ago

                                                                                      That's actually a good question. For example, the technology is still currently at a level where the user can still cleary tell that it's a chatbot, but now with a face. Does this make their experience better? Or does it add a weird level of uncaninness to the experience?

                                                                                      • hassaanr 2 hours ago

                                                                                        It'll depend on the use case- but with customers that are using it today we're seeing higher engagement and satisfaction rates. It's a different interface to communicate that is more natural to humans (our bullish opinion).

                                                                                        • heyitsguay 3 hours ago

                                                                                          I don't think the level of fidelity actually matters as much as authority or ability. What can the agent do that isn't accomplished by, for example, a landing page or an FAQ page? I've never encountered a (text) chatbot that did anything useful for me as a consumer, whether for sales or support.

                                                                                          • rpazpri1 2 hours ago

                                                                                            totally agree! agentic capabilities are really important and can significantly elevate the experience. using LLM tools is a great way to get at least part of the way there. feel free to check out our docs for "bring your own LLM" here https://docs.tavus.io/sections/conversational-video-interfac...

                                                                                        • Mistletoe an hour ago

                                                                                          I don't even like video calls with real people in my real life. Texting works great. This is really neat but I'd much rather just have a text chat with a real customer service rep. I don't need to see a face, don't want to, and especially don't want to see a fake face.

                                                                                        • nithayakumar 3 hours ago

                                                                                          Oh man - i've been watching you guys for awhile. We're YC too and building a superapp for sales ppl. Any killer use cases you've seen or imagined for sales (outside of prospecting vid customization?

                                                                                          • hassaanr 3 hours ago

                                                                                            Glad we've been worth the follow :) Totally- we're seeing AI sales agents for calls, technical counterparts (think like AI sales engineer that joins the call with you), website embeds to answer initial questions or be a virtual sales rep.

                                                                                          • altruios an hour ago

                                                                                            So at what point to we consider the morality of 'owning' such an entity/construct (should it prove itself sufficiently sentient...)?

                                                                                            to extend this (to a hypothetical future situation): what morality does a company have of 'owning' a digitally uploaded brain?

                                                                                            I worry about far future events... but since American law is based on precedence: we should be careful now how we define/categorize things.

                                                                                            To be clear - I don't think this is an issue NOW... but I can't say for certain when these issues will come into play... So edging on the side of early/caution seems prudent... and releasing 'ownership' before any sort of 'revolt' could happen seems wise if a little silly at the current moment.