I tried this yesterday, asking it to create a simple daily reminder task, which it happily did. Then when the time came and went I simply got a chat that the task failed, with no explanation of why or how it failed. When I asked it why, it hallucinated that I had too many tasks. (I only had the one) So, now I don't know why it failed or how to fix it. Which leads to two related observations:
1) I find it interesting that the LLM rarely seems trained to understand it's own features, or about your account, or how the LLM works. Seems strange that it has no idea about it's own support.
2) Which leads me to the Open AI support docs[0]. It seems pretty telling to me that they use old-school search and not an LLM for its own help docs, right?
Same experience except mine insisted I had no tasks.
It does say it's a beta on the label, but the thing inside doesn't seem to know that, nor what it's supposed to know. Your point 1, for sure.
Point 2 is a SaaS from before the LLMs+RAG beat normal things. Status page, a SaaS. API membership, metrics, and billing, a SaaS. These are all undifferentiated, but arguably they selected quite well for when the selections were made, and unless the help is going to sell more users, they shouldn't spend time on undifferentiated heavy lifting, arguably.
You can hardly blame a product for not doing something that we don't know for certain to be possible.
I've thought about this a lot too and my guess is that because foundational modals take a lot to train, I don't think they are trained fairly often, and from my experiences you can't train in new data easily, so I think you'd have to have some little up to date side system, and I suspect they're very thoughtful about these "side systems" they place, from trying to build some agent orchestration stuff myself nothing ends up being as simple as as I expect with "side systems" and stuff easily goes off the rails. So my thought was probably, given the scale they're dealing with, this is probably a low priority not actually particularly easy feature.
> So my thought was probably, given the scale they're dealing with, this is probably a low priority not actually particularly easy feature.
"working like OpenAI said it should" is a weird thing to put low priority. Why do they continuously put out features that break and bug? I'm tired of stochastic outputs and being told that we should accept sub-90% success rates.
At their scale, being less than 99.99% right results in thousands of problems. So their scale and the outsized impact of their statistical bugs is part of the issue.
Why are you setting your bar this way? Is it because of how they do their feature releases (no warning of it being an alpha or beta feature)? Their product, ChatGPT was released 2 years ago, and is a fairly complicated product. My understanding was the whole thing is still a pretty early product generally. It doesn't seem unusual that any startup doing something as big as they are to release features that don't have all the kinks ironed out. I've released some kinda janky features to 100,000s of users before not totally knowing how it's going to preform with all of them at that scale, I don't think that is very controversial in product development.
Also, I was specifically talking about it being able to understand the features it has in my earlier comment, I don't think that is the same problem as the remind me feature not working consistently.
> I've released some kinda janky features to 100,000s of users before not totally knowing how it's going to preform with all of them at that scale, I don't think that is very controversial in product development.
Oh, that's because modern-day product development of "ship fast, break things" is its own problem. The whole tech industry is built on principles that are antithetical to the profession of engineering. It's not controversial in product development, because the people doing the development all decided to loosen their morals and think its Fine to release broken things and fix later.
That my bar is high and OpenAI is so low is its own issue. But then again, I haven't released a product where it could randomly tell people to poison themselves by combining noxious chemicals or whatever other dangerous hallucination ChatGPT spews. If I had engineered something like that, with the opportunity to harm people and being unable to guarantee it wouldn't, if I had engineered that misinformation was a possibility to be created at scale, if I had engineered this, I would have trouble sleeping...
I regularly use Perplexity and Cursor which can search the internet and documentation to answer questions that aren't in their training data. It doesn't seem that hard for ChatGPT to search and summarize their own docs when people ask about it.
You would want a feature like "self aware" to be pretty canonical, not based on a web search, and even if they had a discreet internal side system it could query that you controlled, if the training data was a year old, how would you keep it matched from a systems point of view over time? Also it's unclear how the model would interoperate the data each time it ran on the new context. It seems like a pretty complicated system to build tbh, esp when maintaining human created help and docs and FAQs etc is A LOT simpler and more reliable source of truth. That said, my understanding is behind the scenes they are working towards the product we experience just built around the foundational model, not THE foundational model is it pretty much is today. Once they have a bunch of smaller llms that do discreet standard tasks set up, I would guess they will become considerably more "aware".
I question the same things frequently. I routinely try to ask chatgpt to help me understand the openai api documentation and how to use it and it rarely is helpful, and frequently tells me things that are just blatantly untrue. At least nowadays I can link it directly to the documentation for it to read.
But I dont understand why their own documentation and products and lots of examples using them wouldn't be the number one thing they would want to train the models on (or fine tune, or at least make available through a tool)
Now imagine giving this "agent" a task like booking a table at a restaurant or similar.
"Yeah sure I got you a table at a nice restaurant. Don’t worry."
> it hallucinated that I had too many tasks.
How do you know it hallucinated? Maybe your task was one too many and it is only able to handle zero tasks (which would appear to be true in your case).
Buggy af right now, 95% tasks failed and I get a ton of emails about it
Very, very, very buggy and really looks extremely low effort as with many OpenAI feature rollouts. Nothing wrong with an MVP feature, but make it at least do what it’s supposed to do and maybe give it 10% more extensibility than the bare bones.
Yeah, I saw the 4o with Tasks today, tried it and asked "what is 4o with Tasks", it had no idea. I had to set it to web search mode to figure it out.
If you ask me to describe how a human brain works, I'll have no idea and woukd have to search the web to get an (incomplete) idea.
> 2) Which leads me to the Open AI support docs[0]. It seems pretty telling to me that they use old-school search and not an LLM for its own help docs, right?
I agree, but then again, if you're a dev in this space, presumably you know what keywords to use to refine your search. RAG'ed search implies that the user (dev) are not "in the know".
New killer feature: cron
Can’t imagine why everyone doesn’t pay $200/mo for even more features. Eventually I bet they can clean out /tmp!
cron, but completely unreliable. How nice.
LLM heads will say “it’s not completely unreliable, it works very often”. That is completely unreliable. You cannot rely on it to work.
Please product people, stop putting LLMs at the core of products that need reliability.
It's all a matter of degree. Even in deterministic systems, bit flipping happens. Rarely, but it does. You don't throw out computers as a whole because of this phenomena, do you? You just assess the risk and determine if the scenario you care about sits above or below the threshold.
When’s the last time you personally had a bit flip on you?
I'm surprised it took OpenAI this long to launch scheduled tasks, but as we've seen from our users[0], pure LLM-based responses are quite limited in utility.
For context: ~50% of our users use a time-triggered Loop, often with an LLM component.
Simple stuff I've used it for: baby name idea generator, reminder to pay housekeeper, pre-natal notifications, etc.
We're moving away from cron-esque automations as one of our core-value props (most new users use us for spinning up APIs really quickly), but the base functionality of LLM+code+cron will still be available (and migrated!) to the next version of our product.
Important caveat:
> ChatGPT has a limit on 10 active tasks at any time. If you reach this limit, ChatGPT will not be able to create a new task unless you pause or delete an existing active task or it completes per its scheduled time.
So this is pretty much useless for most real-world uses cases.
I'm trying to figure out how this would be useful with the existing feature set.
It seems like it would be good for summarizing daily updates against a search query. but all it would do is display them. I would probably want to connect it with some tools at minimum for it to be useful.
They're really trying to juice the usage numbers
"How chatgpt reminders saved my life and made me more productive." Videos on YouTube in 3,2,1.
As long as it’s generating hype and funding, it brings us closer to their own definition of AGI. It’s the perfect plan.
Surely we want to be scheduling and calling LLMs from temporalio, dagster - even cron - instead of whatever this is. Why put the LLM at the middle?
Yeah, it’s pretty bad, embarrassingly so quite honestly. Literally a single developer in a day could probably significantly improve it. I’m sure that’s coming, but why don’t they just launch these MVP features at least a quarter baked. It’s essentially unusable as is. If it could ping me on my phone And advanced voice could open or I could go do a basic task, great I’m back to using it. But essentially as it is rolled out, it’s hilariously minimal and borderline unusable.
This will be a lot more useful when it's able to combine with more tools, such as in custom GPT actions, APIs, "computer use", the Python interpreter, etc.
Works on my machine. (tm)
But it won't let me reschedule my task execution time or change its prompting... It will just go forever now I guess
Oddly enough, I do not have access to scheduled tasks either on the app or web interfaces and I am a paying customer.
It took me a minute to find it. It's a different model -- pull down the models list and you might see one with tasks.
This could be done with an API key and AWS Lambda in minutes.
Can I ask it to check for deals on products and make it search the web several times a day?
sounds like theyre trying to get ahead of cron job wrappers so they dont get slammed at peak times
If it works correctly, wouldn't those still be peak times? Except with this they have to process the initial scheduling request in addition to the at-execution task.
A lot of answers don't go stale for hours or days. They'll do the task early, at an off-peak time, hidden from the user, double-check that it really wasn't time sensitive, then surface the saved answer at the time desired.
How are they going to double check without incurring the cost of running it again?
Everyone else's crons, synced to wall clocks, vs your centralized cron (task scheduler, really) that is aware of scheduled work and current load on your systems that consume the scheduled tasks.
Controlling the ability to nudge the wakeup times by small amounts of time can make a huge difference to your ability to manage spiky workloads like this.
OpenAI resembles the old Apple: ship the best experience. The ChatGPT app on every platform is the best in business and they are shipping polished features relatively quickly. It's quite the contrast to Apple of today, the world's largest company who is so inept that they are releasing Apple Intelligence, which is quite literally using ChatGPT 3.5 tech in 2025. It just shows how valuable CEO's like Altman, Musk and Jobs are to a corporation.
The ChatGPT UI/UX is pretty middling. They still don't have a proper answer to Claude Projects, plus they are focusing on shipping stuff like this instead of fixing the numerous papercuts with the chat experience in their UI. How is it that I can access the most powerful AI on the planet with o1 pro, but if I paste more than few pages of text there's no solution for that, it just overflows the input box and makes it impossible to navigate?
> They still don't have a proper answer to Claude Projects
They added Projects in December:
https://help.openai.com/en/articles/10169521-using-projects-...
ChatGPT's Projects feature has weird limitations I've run into. Features that work outside projects, do not necessarily work inside them.
I say this as someone who prefers using ChatGPT over Claude, but pays for both. Hoping they figure it out.
edit: restructured text to make sense.
OpenAI projects don't work very well compared to Anthropic (which has its own limitations as is).
The "old" Apple certainly didn't ship anything quick or on the bleeding edge, nor did they ship the "best" experience. They did, however, have somewhat different priorities than their competitors. They still do to some extent.
this has to be sarcasm
their commenting behavior is strange. i'm not certain.
Apple Intelligence is running on device instead of racks and racks of cloud hardware. Of course it’s less sophisticated.
Yeah, but knowing that doesn't make it much better; it's the wrong design choice.
Agreed. The vast majority of their audience doesn't understand the difference. And among the subset that do, I imagine there's a fair number of us that don't care about the distinction. I just want it to work well.
Indeed which makes me excited for..
Open AI creating an AI phone with Microsoft ... release H.E.R. (the movie) in your pocket.
Your AI assistant / Agent is seen on the Lock Screen (like a FaceTime call UI/UX) waiting at your beckon to do everything for you /be there for via via text, voice, gestures, expressions, etc.
It interfaces with other AI Agents of businesses, companies, your doctor, friends & family to schedule things & used as a knowledge-base (ask friends birthday if they allow that info).
Apple is indeed stale & boring to me (heavy GPT user) in 2025.