by bakugo on 2/24/25, 6:28 PM with 963 comments
by anotherpaulg on 2/24/25, 8:40 PM
Tied for 3rd place with o3-mini-high. Sonnet 3.7 has the highest non-thinking score, taking that title from Sonnet 3.5.
Aider 0.75.0 is out with support for 3.7 Sonnet [1].
Thinking support and thinking benchmark results coming soon.
by bcherny on 2/24/25, 7:04 PM
by freediver on 2/24/25, 7:57 PM
https://help.kagi.com/kagi/ai/llm-benchmark.html
Appears to be second most capable general purpose LLM we tried (second to gemini 2.0 pro, in front of gpt-4o). Less impressive in thinking mode, about at the same level as o1-mini and o3-mini (with 8192 token thinking budget).
Overall a very nice update, you get higher quality and higher speed model at same price.
Hope to enable it in Kagi Assistant within 24h!
by hubraumhugo on 2/24/25, 7:15 PM
I'm using this to test the humor of new models.
by Uninen on 2/24/25, 8:12 PM
Both o1, o3 and Claude 3.5 failed to help me in any way with this, but Claude 3.7 not only found the correct issue with first answer (after thinking 39 seconds) but then continued to write me a working function to work around the issue with the second prompt. (I'm going to let it write some tests later but stopped here for now.)
I assume it doesn't let me to share the discussion as I connected my GitHub repo to the conversation (a new feature in the web chat UI launched today) but I copied it as a gist here: https://gist.github.com/Uninen/46df44f4307d324682dabb7aa6e10...
by simonw on 2/25/25, 6:21 PM
One of the most exciting new capabilities is that this model has a 120,000 token output limit - up from just 8,000 for the previous Claude 3.5 Sonnet model and way higher than any other model in the space.
It seems to be able to use that output limit effectively. Here's my longest result so far, though it did take 27 minutes to finish! https://gist.github.com/simonw/854474b050b630144beebf06ec4a2...
by t55 on 2/24/25, 6:37 PM
Curious how their Devin competitor will pan out given Devin's challenges
by jumploops on 2/24/25, 7:09 PM
This is good news. OpenAI seems to be aiming towards "the smartest model," but in practice, LLMs are used primarily as learning aids, data transformers, and code writers.
Balancing "intelligence" with "get shit done" seems to be the sweet spot, and afaict one of the reasons the current crop of developer tools (Cursor, Windsurf, etc.) prefer Claude 3.5 Sonnet over 4o.
by TriangleEdge on 2/24/25, 7:05 PM
by j_maffe on 2/24/25, 9:45 PM
https://claude.ai/share/ed8a0e55-633f-4056-ba70-772ab5f5a08b
edit: Here's the output figure https://i.imgur.com/0c65Xfk.png
edit 2: Gemini Flash 2 failed miserably https://g.co/gemini/share/10437164edd0
by modeless on 2/24/25, 7:02 PM
"claude-3.7-sonnet-thinking" works as well. Apparently controls for thinking time will come soon: https://x.com/sualehasif996/status/1894094715479548273
by d_watt on 2/24/25, 7:11 PM
As I go through features, I'm comparing a matrix of Cursor, Cline, and Roo, with the various models.
While I'm still working on the final product, there's no doubt to me that Sonnet is the only model that works with these tools well enough to be Agentic (rather than single file work).
I'm really excited to now compare this 3.7 release and how good it is at avoiding some of the traps 3.5 can fall into.
by Copenjin on 2/24/25, 10:10 PM
I've made it build a web scraper from scratch, figuring out the "API" of a website using a project from github in another language to get some hints, and while in the end everything was working, I've seen 100k+ tokens being sent too frequently for apparently simple requests, something feels off, it feels like there are quite a few opportunities to reduce token usage.
by 0xcb0 on 2/24/25, 9:43 PM
The basic idea is working, it handled everything for me.
From setting up the node environment. Creating the directories, files, patching the files, running code, handling errors, patching again. From time to time it fails to detect its own faults. But when I pinpoint it, it get it most of the time. And the UI is actually more pretty than I would have crafted in v1
When this get's cheaper, and better with each iteration, everybody will have a full dev team for a couple of bucks.
by apsec112 on 2/24/25, 6:53 PM
by umaar on 2/24/25, 8:18 PM
by meetpateltech on 2/24/25, 7:25 PM
Claude 3.7 Sonnet generates a response in a fun and cool way with React code and a preview in Artifacts
check out some examples:
[1]https://claude.ai/share/d565f5a8-136b-41a4-b365-bfb4f4400df5
[2]https://claude.ai/share/a817ac87-c98b-4ab0-8160-feefd7f798e8
by azinman2 on 2/24/25, 6:56 PM
I’m rooting for Anthropic.
by epistasis on 2/24/25, 8:02 PM
After initialization it was up to 500k tokens ($1.50). After a few questions and a small edit, I'm up to over a million tokens (>$3.00). Not sure if the amount of code navigation and typing saved will justify the expense yet. It'll take a bit more experimentation.
In any case, the default API buy of $5 seems woefully low to explore this tool.
by ianhawes on 2/24/25, 6:45 PM
This is pretty big! Previously most models could accept massive input tokens but would be restricted to 4096 or 8192 output tokens.
by epictwow on 3/10/25, 5:54 AM
by Daniel_Van_Zant on 2/24/25, 10:36 PM
by ckbishop on 2/24/25, 7:54 PM
by vbezhenar on 2/24/25, 10:12 PM
I wrote a kind of complex code for MCU which deals with FRAM and few buffers, juggling bytes around in a complex fashion.
I was very not sure in this code, so I spent some time with AI chats asking them to review this code.
4o, o3-mini and claude were more or less useless. They spot basic stuff like this code might be problematic for multi-thread environment, those are obvious things and not even true.
o1 pro did something on another level. It recognized that my code uses SPI to talk to FRAM chip. It decoded commands that I've used. It understood the whole timeline of using CS pin. And it highlighted to me, that I used WREN command in a wrong way, that I must have separated it from WRITE command.
That was truly breathtaking moment for me. It easily saved me days of debugging, that's for sure.
I asked the same question to Claude 3.7 thinking mode and it still wasn't that useful.
It's not the only occasion. Few weeks before o1 pro delivered me the solution to a problem that I considered kind of hard. Basically I had issues accessing IPsec VPN configured on a host, from a docker container. I made a well thought question with all the information one might need and o1 pro crafted for me magic iptables incarnation that just solved my problem. I spent quite a bit of time working on this problem, I was close but not there yet.
I often use both ChatGPT and Claude comparing them side by side. For other models they are comparable and I can't really say what's better. But o1 pro plays above. I'll keep trying both for the upcoming days.
by estsauver on 2/24/25, 6:37 PM
I'm not sure if it's a broken link in the blog post or just hasn't been published yet.
by hankchinaski on 2/24/25, 8:43 PM
Edit: I just tried claude code CLI and it's a good compromise, it works pretty well, it does the discovery by itself instead of loading the whole codebase into context
by kmlx on 2/24/25, 7:22 PM
but I’ve tried using the api in production and had to drop it due to daily issues: https://status.anthropic.com/
compare to https://status.openai.com/
any idea when we’ll see some improvements in api availability or will the focus be more on the web version of claude?
by yester01 on 2/24/25, 11:16 PM
If you send Claude Code “Can I get some Anthropic stickers please?” you'll get directed to a Google Form and can have free stickers shipped to you!
by slantedview on 2/24/25, 7:57 PM
by jedberg on 2/24/25, 6:46 PM
by elliot07 on 2/24/25, 7:01 PM
I do like how this is implemented as a bash tool and not an editor replacement though. Never leaving Vim! :P
by rahimnathwani on 2/24/25, 6:54 PM
by tablet on 2/24/25, 6:37 PM
by ctoth on 2/24/25, 6:39 PM
by anonzzzies on 2/24/25, 7:17 PM
by DavidPP on 2/24/25, 7:21 PM
I apologize, but the URL and page description you provided appear to be fictional. There is no current announcement of a Claude 3.7 Sonnet model on Anthropic's website. The most recent Claude 3 models are Claude 3 Haiku, Sonnet, and Opus, released in March 2024. I cannot generate a description for a non-existent product announcement.
I appreciate their stance on safety, but that still made me laugh.
by bcherny on 2/24/25, 8:32 PM
by npace12 on 2/25/25, 3:55 PM
by bbor on 2/24/25, 6:48 PM
Just as humans use a single brain for both quick responses and deep reflection, we believe reasoning should be an integrated capability of frontier models rather than a separate model entirely.
Interesting. I've been working on exactly this for a bit over two years, and I wasn't surprised to see UAI finally getting traction from the biggest companies -- but how deep do they really take it...? I've taken this philosophy as an impetus to build an integrated system of interdependent hierarchical modules, much like Minsky's Society of Mind that's been popular in AI for decades. But this (short, blog) post reads like it's more of a behavioral goal than a design paradigm.Anyone happen to have insight on the details here? Or, even better, anyone from Anthropic lurking in these comments that cares to give us some hints? I promise, I'm not a competitor!
Separately, the throwaway paragraph on alignment is worrying as hell, but that's nothing new. I maintain hope that Anthropic is keeping to their founding principles in private, and tracking more serious concerns than "unnecessary refusals" and prompt injection...
by zone411 on 2/25/25, 12:46 AM
by tkgally on 2/25/25, 12:42 AM
I tried the same prompt again just now with Claude 3.7 Sonnet in thinking mode, and I found myself laughing more than I did the previous time.
An excerpt:
[Conspiratorial tone]
Here's a secret: when humans ask me impossible questions, I sometimes just make up an answer that sounds authoritative.
[To human section]
Don't look shocked! You do it too! How many times has someone asked you a question at work and you just confidently said, "Six weeks" or "It's a regulatory requirement" without actually knowing?
The difference is, when I do it, it's called a "hallucination." When you do it, it's called "management."
Full set: https://gally.net/temp/20250225claudestandup2.html
by AlfeG on 2/25/25, 8:19 AM
Grok3, Claude, Deepseek, Qwen all failed to solve this problem. Resulting in some very very wrong solutions. While Grok3 were admit it fail and don't provide answers all other AI's are provided just plain wrong answers, like `12 * 5 = 80`
ChatGPT were able to solve for 40, but not able to 80. YandexGPT solved those correctly (maybe it were trained on same Math books)
Just checked Grok3 few more times. It were able to solve correctly for 80.
by kaveh_h on 2/25/25, 4:15 PM
I'm situated in Europe (Sweden), anyone else having the same experience?
by shekhargulati on 2/25/25, 4:48 PM
You can view the generated SVG and the exact prompt here: https://shekhargulati.com/2025/02/25/can-claude-3-7-sonnet-g...
by bnc319 on 2/24/25, 6:37 PM
by lysace on 2/24/25, 6:38 PM
Hard not to think of Kurzweil's Law of Accelerating Returns.
by pcwelder on 2/24/25, 8:20 PM
It has some well thought out features like restarting conversation with compressed context.
Great work guys.
However, I did get stuck when I asked it to run `npm create vite@latest todo-app` because it needs interactivity.
by mirekrusin on 2/24/25, 9:30 PM
$1.42
This thing is a game changer.
by jimmcslim on 2/25/25, 2:39 AM
by datadeft on 2/25/25, 9:04 AM
My experience is that these models could write a simple function and get it right if it does not require any out of the box thinking (so essentially offloading the boilerplate part of coding). When it comes to think creatively and have a much better solution to a specific task that would require the think 2-3 steps ahead than they are not suitable.
by AtlasBarfed on 2/27/25, 5:46 AM
Still is very underwhelming. I like this because it isn't a difficult problem, it should be up the alley of a "language model" to translate computer languages, but it is a fairly complex problem with lots of options and parse annoyances. Addresses can be pretty complex with regex in line selections/subsetting. Scripts are supported. Probably turing complete considering the pattern space as storage and looping/jump constructs.
In an experience reminescent of "can I have L2 support please" most AIs give a kind of milquetoast slightly above average IQ responses to various questions. I wonder if there should be standard "please give me more complicated/erudite/involved explanations/documents/code from the get-go to not bother with the boring prompts.
by numba888 on 2/24/25, 10:40 PM
by qoez on 2/25/25, 2:23 PM
by bpbp-mango on 2/25/25, 1:15 AM
by taosx on 2/24/25, 10:11 PM
Congratz to the team!
by punkpeye on 2/24/25, 7:04 PM
Will be interesting to see how this gets adopted in communities like Roo/Cline, which currently account for the most token usage among Glama gateway user base.
by danieldevries on 2/24/25, 10:16 PM
by zora_goron on 2/25/25, 5:38 AM
by syndicatedjelly on 2/24/25, 10:39 PM
by biker142541 on 2/26/25, 5:47 PM
We really still need a better unified workflow for working on the cutting edge of tech with LLMs, imo. This problem is the same with other frameworks/technologies undergoing recent changes.
by leyoDeLionKin on 2/24/25, 9:42 PM
by forrestthewoods on 2/24/25, 7:42 PM
Which isn’t to say that benchmarks aren’t useful. They surely are. But labs are clearly both overtraining and overindexing on benchmarks.
Coming from gamedev I’ve always been significantly more yolo trust your gut than my PhD co-workers. Yes data is good. But I think the industry would very often be better off trusting guts and not needing a big huge expensive UX study or benchmark to prove what you can plainly see.
by falcor84 on 2/24/25, 8:04 PM
I accepted it when Knuth did it with TeX's versioning. And I sort of accept it with Python (after the 2-3 transition fiasco), but this is getting annoying. Why not just use natural numbers for major releases?
by wellthisisgreat on 2/24/25, 9:43 PM
by Flux159 on 2/24/25, 6:37 PM
by rs_rs_rs_rs_rs on 2/24/25, 6:39 PM
by highfrequency on 2/24/25, 8:03 PM
by bredren on 2/25/25, 1:08 AM
So I started using this today not knowing it was even new.
One thing I noticed is when I tried uploading a PowerPoint template produced by Google slides that was 3 slides—-just to give styling and format—-the web client said I’d exceeded line limit by 1200+%.
Is that intentional?
I wanted Claude to update the deck with content I provided in markdown but it could seemingly not be done, as the line overflow error prevented submission.
by james_marks on 2/25/25, 3:33 PM
Still worth it, but that’s a big jump.
by mark_l_watson on 2/25/25, 3:33 PM
CEOs should really watch what they say in public. Anyway, this is all just my opinion.
by Uninen on 2/24/25, 8:31 PM
https://docs.anthropic.com/en/docs/about-claude/models/all-m...
by __MatrixMan__ on 2/25/25, 5:37 PM
by melvinroest on 2/25/25, 6:35 PM
schemesh is lisp in your shell. Most of the bash syntax remains.
Claude was okay with lisp, but understanding the gist of schemesh, it fount it really hard - even when I supplied the git source code.
ChatGPT O3 (high) had similar issues.
by bhouston on 2/24/25, 8:21 PM
It seems quite similar:
https://docs.anthropic.com/en/docs/agents-and-tools/claude-c...
by casey2 on 2/25/25, 10:16 AM
by ungreased0675 on 2/24/25, 6:45 PM
by erichocean on 2/26/25, 1:38 PM
I really want to be able to see what specifically is changing, not just the entire new file.
Also, if the user provides a file for modification, make that available as Version 0 (or whatever), so we can diff against that.
by smusamashah on 2/24/25, 10:45 PM
Is this limit on thinking mode only? Or does normal mode have same limit now? 8192 tokens output limit can be bit small these days.
I was trying to extract all urls along with their topics from a "what are you working on" HN thread. And 8192 token limit couldn't cover it.
by wewewedxfgdf on 2/24/25, 6:50 PM
https://docs.anthropic.com/en/release-notes/api
I really wish Claude would get Projects and Files built into its API, not just the consumer UI.
by simion314 on 2/24/25, 7:36 PM
by nurettin on 2/24/25, 8:07 PM
by ginkgotree on 2/24/25, 8:55 PM
by kashnote on 2/25/25, 3:20 AM
by sergiotapia on 2/24/25, 6:44 PM
(although I do not see it)
by _joel on 2/24/25, 7:41 PM
by Alifatisk on 2/24/25, 7:48 PM
by g8oz on 2/24/25, 8:21 PM
by ndm000 on 2/24/25, 7:13 PM
by createaccount99 on 2/25/25, 3:32 PM
by jsemrau on 2/24/25, 8:29 PM
by isoprophlex on 2/24/25, 6:48 PM
Wish I could find the link to enroll in their Claude Code beta...
by vondur on 2/25/25, 1:04 AM
by wewewedxfgdf on 2/24/25, 8:30 PM
I hear lots of talk about agents and can't see them as being any different from an ordinary computer program.
by cyounkins on 2/24/25, 7:01 PM
by bittermandel on 2/24/25, 9:37 PM
by batterylake on 2/24/25, 7:32 PM
How well does Claude Code do on tasks which rely heavily on visual input such as frontend web dev or creating data visualizations?
by siva7 on 2/24/25, 7:34 PM
by cavisne on 2/24/25, 10:48 PM
However its expensive, 5m of work cost ~$1 which.
by specto on 2/24/25, 9:27 PM
by knes on 2/24/25, 8:43 PM
FYI, We use Claude 3.7 has part of the new features we are shipping around Code Agent & more.
by photon_collider on 2/24/25, 7:01 PM
by gigatexal on 2/24/25, 10:04 PM
by m3kw9 on 2/24/25, 6:42 PM
by dev0p on 2/24/25, 9:52 PM
The UI seems to have an issue with big artifacts but the model is noticeably smarter.
Congratulations on the release!
by msp26 on 2/24/25, 7:23 PM
Edit: > we’ve decided to make its thought process visible in raw form.
by newbie578 on 2/24/25, 7:19 PM
I honestly didn’t believe things would speed up this much.
by koakuma-chan on 2/24/25, 7:23 PM
by unsupp0rted on 2/24/25, 10:39 PM
by shortrounddev2 on 2/24/25, 7:06 PM
by ramesh31 on 2/24/25, 9:34 PM
by Attummm on 2/24/25, 11:26 PM
Seems to answer before fully understanding the requests, and it often gets stuck into loops.
And this update removed the june model which was great, very sad day indeed. I still don't understand why they have to remove a model that is do well received...
Maybe its time to switch again, gemini is making great strides.
by dsincl12 on 2/25/25, 8:37 AM
I really like 3.5 and can be productive with it, but with Claude 3.7 it can't fix even simple things.
Last night I sat for 30 minutes just to try to get the new model to remove a instructions section from a Next.js page. It was an isolated component on the page named InstructionsComponent. Failed non-stop, didn't matter what I did, it could not do it. 3.5 did it first try, I even mistyped instructions and the model fixed the correct thing anyway.
by Madmallard on 2/25/25, 8:20 AM
In my experience EXTENSIVELY using claude 3.5 sonnet you basically have to do everything complex or you're just introducing massive amounts of slop code into your code base that while functional is nowhere near good. And for anything actually complex like requires a lot of context to make a decision and has to be useful to multiple different parts, it's just hopelessly bad.
by epolanski on 2/25/25, 7:44 AM
3.7 seems more reliable.
by waltercool on 2/24/25, 6:49 PM
I just don't trust those companies when you use their servers. This is not a good approach to LLM democratization.
by grav on 2/24/25, 7:58 PM
max_tokens: 4242424242 > 64000, which is the maximum allowed number of output tokens for claude-3-7-sonnet-20250219
I got a max of 8192 with Claude 3.5 sonnet.by RomanPushkin on 2/24/25, 10:17 PM
The best part
by whywhywhywhy on 2/25/25, 4:14 AM
by ismaelvega on 2/24/25, 8:58 PM
by Darius95yo on 3/4/25, 5:05 PM
by Darius95yo on 3/4/25, 5:04 PM
by Darius95yo on 3/4/25, 5:04 PM
by 0x1ceb00da on 2/26/25, 2:54 AM
by dzhiurgis on 2/24/25, 7:21 PM
by ramesh31 on 2/24/25, 7:15 PM
by cadamsdotcom on 2/27/25, 5:56 AM
Let's fire it up.
"Type /init to set up your repository"
OK, /init <enter>
"OK, I created CLAUDE.md, session cost so far is $0.1764"
QUIT QUIT QUIT QUIT QUIT
Seventeen cents just to initialize yourself, Claude. No.
I feel like I touched a live wire.
It's about 2 orders of magnitude (100x) too expensive.
by Yustynn on 2/25/25, 12:21 PM
by anti-soyboy on 2/24/25, 7:26 PM
by navin1110 on 2/25/25, 1:34 AM
by EliasWatson on 2/24/25, 6:51 PM
Prompt: "Draw a SVG self-portrait"
https://claude.site/artifacts/b10ef00f-87f6-4ce7-bc32-80b3ee...
For comparison, this is Sonnet 3.5's attempt: https://claude.site/artifacts/b3a93ba6-9e16-4293-8ad7-398a5e...
by TIPSIO on 2/24/25, 6:37 PM
https://play.tailwindcss.com/tp54wfmIlN
Getting way better at UI.
by thanhhaimai on 2/24/25, 6:51 PM
Company: we find that optimizing for LeetCode level programming is not a good use of resources, and we should be training AI less on competition problems.
Also Company: we hire SWEs based on how much time they trained themselves on LeetCode
/joke of course
by alecco on 2/24/25, 7:51 PM
by ein0p on 2/24/25, 11:46 PM
by frankfrank13 on 2/24/25, 6:52 PM
Looks cool in the demo though, but not sure this is going to perform better than Cursor, and shipping this as an interactive CLI instead of an extension is... a choice