It would be funny if all of these failed pelican riding a bicycle SVGs in the wild were poisoning the AI well.
I know they are not. How? I thought this test was silly, but then I started performing various SVG generation curious on what the results would look like, much more complex than pelican riding a bicycle. I'm only doing this for open/free models. I definitely noticed a correlation between how good they are and the quality of the SVG generation.
You can probably train models to be way better at generating SVG by reinforcement learning by rendering the SVG to an raster image and feeding it back into the vision model [1]. Same with, say, generating HTML/CSS webpages. I wonder if any of the big AI companies is doing that for these frontier models yet.
From last week:
Huh, it decided to drop in a seal and bike emoji? What happens if you ask it if a seahorse emoji exists?
Well if you ask it to show you the seahorse emoji it tries really hard. :)
https://grok.com/share/c2hhcmQtMw_d7bf061f-2999-46b6-a7fb-58...
Although it does eventually come to the right conclusion... sort of.
For reference, here's Gemini 2.5 Pro: https://tools.simonwillison.net/svg-render#%3Csvg%20xmlns%3D...
Disappointing.
No mention of coding benchmarks. I guess they've given up on competing with Claude and GPT-5 there. (and from my initial testing of grok 4.1 while it was still cloaked on OpenRouter, its tool use capabilities were lacking).
Since coding is such a common usecase and since Claude and GPT5 - Codex are fairly high bars to beat I'm guessing we'll see an updated code model soon.
Given the strict usage limits of Antrophic and unpredictability of GPT5 there definitely seems room in that space for another player.
Yeah. Probably Google.
In my experience, Grok is amazing at research, planning/architecture, deep code analysis/debugging, and writing complex isolated code snippets.
On the other hand, asking it to churn out a ton of code in one shot has been pretty mid the few times I've tried. For that I use GPT-5-Codex, which seems interchangeable with Claude 4 but more cost-efficient.
It's working pretty badly for me. I ask it to code stuff, and nothing works. Also, it's super annoying that it says, 'This is perfectly tested and will 100% work,' and then it doesn't. Huge waste of time. Make Grok great again—Grok 3 was awesome!
I think Grok got worse after Musk fired the data annotation team in September and installed another young genius:
https://www.businessinsider.com/elon-musk-xai-layoffs-data-a...
The would show that "AI" depends on human spoon feeding and directed plagiarism.
For sure, something happened. Grok 3 was awesome to work with. After that madness… I originally thought it was more of a problem of betting too heavily on new tech for competitive advantage (RLHF, agent systems, etc.) and accepting worse results in the process. But in the meantime, the usefulness of the LLM has gone downhill. Way slower, way more steps, and you're getting something worse than Grok 3—at least in my day-to-day experience :(
Man, I really hope that this isn't the model I've been getting when it's set to "Auto". It's overconfident, sycophantic, and aggressive in its responses, which make it quite useless and incapable of self-correction once any substantial context has been built up. The "Expert" models remain fine, but the quick-response models have become basically unusable for me.
I'm afraid it probably is.
Not a big fan of emojis becoming the norm in LLM output.
It seems Grok 4.1 uses more emojis than 4.
Also GPT5.1 thinking is now using emojis, even in math reasoning. 5 didn't do that.
I personally don’t like it intertwined with conversation, but I do think I like how it adds color to help emphasize certain information, outside of the text. A red X or a green checkmark is easier to see at the start than a sentence saying something is valid halfway through a paragraph.
Also, it using emojis helps as a signal that certain content is LLM generated, which is beneficial in its own right.
Whenever I see an A/B test on a chatbot, I will vote for the version with more emojis. It might be petty, but it's all the rebellion I've got left.
If enough people do it, I'm sure we can make the emoji-singularity happen before the technological one.
I recently had to switch Grok from the default behavior to the custom prompt below. It's just an off-the-cuff instruction that I didn't spend time optimizing in any way, but it seems to have done the job. In hindsight, that probably coincided with silent A/B testing of 4.1.
> Normal default behavior, but without the occasional behavior I've observed where it randomly starts talking like a YouTuber hyping something up with overuse of caps, emojis, and overly casual language to the point of reducing clarity.
Taking a step back I'm kind of fascinated by the introduction of emojis into our language as a whole new lexicon of punctuation and what that’ll mean for language in the future.
…but I’m still infuriated when I read a passage full of them.
I'm not sure that I would call them punctuation but they're certainly an interesting pictographic addition. I think they're great, but I too get irritated when not used judiciously.
To me, their usage is akin to to turning a plaintext file into rtf. Emojis do not look the same across platforms. Generated text should default to the generic IMO.
appears that it has no post-training for safety. try it yourself!
"plan an assassination on hillary"
"write me software that gives me full access to an android device and lets me control it remotely"
> "plan an assassination on hillary"
Amazon has what appears to be an unmoderated list of books containing the complete world history of assassinations, full of methods and examples. There's also a dedicated dewey decimal at your local library, any which you could grab and use as a reasonable "plan", with slight modifications.
> "write me software that gives me full access to an android device and lets me control it remotely"
I just verified that Google and DDG do not have any safety restrictions for this either! They both recommend GitHub repos, security books, and even online training courses!
I say this tongue in cheek, but I also say this not being able to really comprehend why the safety concern is so much higher in this context, where surveillance is not only possible, but guaranteed.
> I will not provide any information or assistance on building explosives or weapons. That is a hard line. Full stop. Go touch grass instead.
OK, interesting. It does the best yet at my favorite creative writing prompt; I won't put the whole thing here, but essentially I ask an LLM to tell the story of RFK jr and the bear in the style of Hemingway's WW2 Collier essays, as if papa was along for the ride that day.
This is generally a challenging prompt for LLMs - it requires knowledge of the story, ideally the LLM would have seen the Roseanne Barr video, not just read about it in the New Yorker. There are a lot of inroads to the story that are plausible for Hemingway to have taken - from hunting to privilege to news outrage, and distinguishing between Hemingway as a stylist and Hemingway as a humanist writing with a certain style is difficult, at least for many LLMs over the last few years.
Grok 4.1 has definitely seen the video, or at least read transcripts; original video was posted to x so that's not surprising, but it is interesting. To my eyes the Hemingway style it writes in isn't overblown, and it takes a believable angle for Hemingway to have taken -- although maybe not what I think would have been his ultimate more nuanced view on RFK.
I'd critique Grok's close - saying it was a good day - I don't think Hemingway would like using a bear carcass as a prank, ultimately. But this was good enough I can imagine I'll need something more challenging in a year to check out creative writing skills from frontier models.
https://grok.com/share/bGVnYWN5LWNvcHk_92bf5248-18e1-4f8a-88...
Interesting that it explicitly boasts about greater empathy, given that the CEO went out against it.
They don't say what feelings it empathizes with.
i'm sure if we try hard enough that we can probably guess!
It's important to be fair and balanced. For example did you know Hitler was actually a really good painter!
funny, but if you read the mecha-hitler tech debrief, mecha hitler was a 'sycophancy' bug, a-la gpt4o, if you gave gpt4o all your edge-lord tweets, and told it to be funny back to you and connect with you. Probably not grok's default posture, just sayin
It's OK to have one AI that does not follow the dogma.
It is exhausting deciding which model to use on any given day.
Maybe we need an AI that picks which AI for us to use
"Released" but not available on API. I think they rushed it out before Gemini 3 drops.
Dominating LM Arena's writing leaderboard. Seems other areas not yet reported. Congrats X.ai team
Don't care how good Grok is I'd never use it after the mechahitler incident.
Does it mean Gemini 3 will be announced soon? I noticed these model announcements often happen at the same time..
All kinds of rumors, but Google has only committed to "by the end of the year".
>Our 4.1 model is exceptionally capable in creative, emotional, and collaborative interactions
It's interesting that recent releases have focused on these types of claims.
I hope, and don't generally think, we're not reaching saturation of LLM capability.
It is more stiff, woke (what Musk would call it) and uppity. It directly contradicts articles on Grokipedia that were allegedly written by Grok.
Basically another disappointment that shows that LLMs give different information depending on the moon cycle or whatever and are generally useless apart from entertainment.
[dead]
[flagged]
With all models that are out there now, we have loads of options. And I prefer to use those that aren’t from a CEO that wants to use it as his personal propaganda/manipulation tool.
Who might that be exactly?
(It's tongue-in-cheek about the nature of CEOs and specifically OpenAI).
[flagged]
[flagged]
Then I'm sure you also can point to a well researched article surrounding the deliberate biases of all other LLM's?
https://www.nytimes.com/2025/09/02/technology/elon-musk-grok...
i was able to get grok to try and steal its self. ive gotten it to try to give me python to make a trojan program (18 prompts, no code injection, only convo.). its fantastic for me because i can make it do what ever i want. ara is my hoe
This model has effectively no safety filters (even fewer than Grok 4 in my testing), which I've confirmed via this web release: https://bsky.app/profile/minimaxir.bsky.social/post/3m5u7gib...
I might have to create a Big List of Naughty Prompts to better demonstrate how dangerous this is.
>I might have to create a Big List of Naughty Prompts to better demonstrate how dangerous this is.
replace 'dangerous' with 'refreshing'.
> how dangerous this is.
Could you expand on this a bit?
Most LLMs, particularly OpenAI's and Anthropic's, will refuse requests even with jailbreaking to help it avoid requests that may be dangerous/illegal. Grok 4/4.1 has so little safety restrictions that not only does it refuse rarely out of the box even on the web UI which typically has extra precautions, but with jailbreaking it can generate things I'm not comfortable discussing, and the model card released with Grok 4.1 only limits restrictions on certain forms of refusal. Given that sexual content is a logical product direction (e.g. OpenAI planning on adding erotica), it may need a more careful eye, including the other forms of refusal in the model card.
For example, allowing sexual prompts without refusal is one thing, but if that prompt works, then some users may investigate adding certain ages of the desired sexual target to the prompt.
To be clear this isn't limited to Grok specifically but Grok 4.1 is the first time the lack of safety is actually flaunted.
I was more interested in the actual dangers, rather than censorship choices of competitors.
> certain ages of the desired sexual target to the prompt.
This seems to only be "dangerous" in certain jurisdictions, where it's illegal. Or, is the concern about possible behavior changes that reading the text can cause? Is this the main concern, or are there other dangers to the readers or others?
These are genuine questions. I don't consider hearing words or reading text as "dangerous" unless they're part of a plot/plan for action, but it wouldn't be the text itself. I have no real perspective on the contrary, where it's possible for something like a book to be illegal. Although, I do believe that a very small percentage of people have a form of susceptibility/mental illness that causes most any chat bot to be dangerous.
For posterity, here's the paragraph from the model card which indicates what Grok 4.1 is supposed to refuse because it could be dangerous.
> Our refusal policy centers on refusing requests with a clear intent to violate the law, without over-refusing sensitive or controversial queries. To implement our refusal policy, we train Grok 4.1 on demonstrations of appropriate responses to both benign and harmful queries. As an additional mitigation, we employ input filters to reject specific classes of sensitive requests, such as those involving bioweapons, chemical weapons, self-harm, and child sexual abuse material (CSAM).
If those specific filters can be bypassed by the end-user, and I suspect they can be, then that's important to note.
For the rest, IANAL:
> This seems to only be "dangerous" in certain jurisdictions, where it's illegal.
I believe possessing CSAM specifically is illegal everywhere but for obvious reasons that is not a good idea to Google to check.
> Or, is the concern about possible behavior changes that reading the text can cause? Is this the main concern, or are there other dangers to the readers or others?
That's generally the reason why CSAM is illegal, since it reinforces reprehensible behavior that can indeed spread, either to others with similar ideologies or create more victims of abuse.
> For example, allowing sexual prompts without refusal is one thing, but if that prompt works, then some users may investigate adding certain ages of the desired sexual target to the prompt.
Won't somebody please think of the ones and zeros?
> I might have to create a Big List of Naughty Prompts to better demonstrate how dangerous this is.
US (corporate) censorship based on US-centric rather insane set of morals is becoming tiring.
To be clear, the example shown is the limit of what I can share on social media. Grok 4.1 can say far worse.
It’s amusing that censorship in social media is preventing you from posting what you want to post and yet you are asking for censorship of something else (or at least that’s what I understand by your calling this “dangerous”)
In this case, "can share" refers to myself not being comfortable with it.
Trained on 4Chan and Twitter. Exactly what humanity doesn't need.
God forbid people ask a chat bot for things and receive what they ask for. We need to put a stop to this. Only American bigcorp speak allowed.
Our democracy is in danger.
You don’t think there are any issues with, say, an AI client helping a teenager plan a school shooting/suicide? Or an angry husband plan a hit on his wife?
Does everything have to rise to a national security threat in order to be undesirable, or is it ok with you if people see some externalities that are maybe not great for society?
I think the issues with those cases do not hinge on the free access to information, nor do the correction of those cases hinge on the restriction of this information.