by reilly3000 on 3/31/24, 5:32 PM with 149 comments
by JCM9 on 3/31/24, 5:48 PM
Legal angles here also also super interesting. There’s a growing body of scenarios where companies are held accountable for the goofs of their AI “assistants.” Thus we’re likely heading for some comical train wrecks as companies that don’t properly vet this stuff set themselves up for some expensive disasters (eg think the AI assistant doing things that will get the company into trouble).
I’m bullish on the tech, but bearish on the ability of folks to deploy it at scale without making a big expensive mess.
by tuckerconnelly on 3/31/24, 6:03 PM
* I was able to make a simple AI agent that could control my Spotify account, and make playlists based on its world knowledge (rather than Spotify recommendation algos), which was really cool. I used it pretty frequently to guide Spotify into my music tastes, and would say I got value out of it.
* GPT-4 worked quite well actually, GPT-3.5 worked maybe 80% of the time. Mixtral did not work at all, aside from needing hacks/workarounds to get function-calling working in the first place.
* It was very slow and VERY expensive. Needing CoT was a limitation. Could easily rack up $30/day just testing it.
My overall takeaway: it's too early: too expensive, too slow, too unreliable. Unless you somehow have a breakthrough with a custom model.
From the marketing side, people just don't "get it." I've since niched down, and it's very, very promising from a business perspective.
[1] https://konos.ai
by kennethologist on 3/31/24, 5:52 PM
- GPU bill is $200/month for 21,000 designs per month or about 1¢ per render (no character training like Photo AI helps costs) - Hosted on a shared VPS with my other sites @ $500/mo, but % wise Interior AI is ~$50 of that
+= $250/month in costs
It makes about $45,000 in MRR and so $44,730 is pure profits! It is 100% ran by AI robots, no people
I lead the robots and do product dev but only when necessary"
by airstrike on 3/31/24, 5:57 PM
It's true that have to wrestle a lot with them to get them to do what I want for more complex tasks... so they are great for certain tasks and terrible for others, but when I'm in Xcode, I dearly miss vscode because of Copilot autocomplete, which I guess is an indication that it adds some value
One unexpected synergy has been how good GPT4 is at explaining why my rust code is so bad, thanks to the very verbose compiler messages and availability of high quality training data (i.e. the great rust code in the wild)—despite GPT4 not always being great at writing new rust code from a blank file.
Part of me thinks in the future this loop is going to be a bit more automated, with an LLM in the mix... similar to how LSPs are "obvious" and ubiquitous these days
On an unrelated note, I also wrote a small python script for translating my Xcode project's localizable strings into ~10 different languages with some carefully constructed instructions and error checking (basically some simple JSON validation before OpenAI offered JSON as a response type). I only speak ~2 of the target languages, and only 1 natively, but from a quick review the translations seemed mostly fine. Definitely a solid starting point
by varunshenoy on 4/1/24, 3:07 AM
I've used Devin a few times (see: https://x.com/varunshenoy\_/status/1767591341289250961?s=20), and while it's far from perfect, it's by far the best I've seen. It doesn't get stuck in loops, and it keeps trying new things until it succeeds. Devin feels like a fairly competent high school intern.
Interestingly, Devin seems better suited as an entry-level analyst than a software engineer. We've been using it internally to scrape and structure real estate listings. Their stack for web RPA and browser automation works _really_ well. And it makes sense why this is important: if you want to have a successful agent, you need to provide it with good tools. Again, it's not flawless, but it gives me hope for the future of AI agents.
by neilv on 3/31/24, 6:23 PM
Don't put it in charge of paying bills.
Do put it in charge of making SEO content sites, conducting mass scam automated interactions, generating bulk code where company tolerates incompetence, making stock art for blog posts that don't need to look professional, handling customer service for accounts you don't care about, etc.
by trzy on 3/31/24, 5:51 PM
by spxneo on 3/31/24, 6:18 PM
I'm seeing also an explosion in the number of comments advertising their AI tool on anything remotely related to AI topics. Makes me think we are headed for a major correction.
by nottorp on 3/31/24, 6:06 PM
I use LLMs as a glorified search engine. That was better than web search at some point, I'm not sure the publicly available LLMs are that good any more. Gemini seems to be extremely worried to not offend anyone instead of giving me results lately.
At least it's still useful for 'give me the template code for starting an XXX' ...
by danenania on 3/31/24, 6:28 PM
https://github.com/plandex-ai/plandex
It's working quite well though I am still ironing out some kinks (PRs welcome btw).
I think the key to agents that really work is understanding the limitations of the models and working around them rather than trying to do everything with the LLM.
In the context of software development, imo we are currently at the stage of developer-AI symbiosis and probably will be for some time. We aren't yet at the stage where it makes sense to try to get an agent to code and debug complex tasks end-to-end. Trying to do this is a recipe for burning lots of tokens and spending more time and than it would take to build the thing yourself. But if you follow the 80/20 rule and get the AI to do the bulk of the work, intervening frequently to keep it on track and then polishing up the final product manually at the end, huge productivity gains are definitely in reach.
by jacob019 on 3/31/24, 7:12 PM
There is a good amount of research going into combining LLMs with RL for decision making, it is a powerful combination. LLMs help with high level reasoning and goal setting, and of course provide a smooth interface for interacting with humans and with other agents. LLMs also contain much of the collective knowledge of humanity, which is very useful for training agents to do things. If you want to train a robot to make a sandwich it's helpful to know things, like what is a sandwich, and that it is necessary to move, get bread, etc.
These feedback loop LLM agent projects are kind of misguided IMO. AI agents are real and useful and progressing fast, but we need to combine more tools than just LLM to build effective systems.
Personally, I am using LLMs quite effectively for ecommerce: classifying messages, drafting responses, handling simple requests like order cancellation. All kinds of glue stuff that used to be painful is now automated and easy. I could go on.
by wkirby on 3/31/24, 5:58 PM
That line of reasoning has held true across basically every project we’ve touched that tried to incorporate LLMs into a core workflow.
by hubraumhugo on 3/31/24, 6:03 PM
- Navigation: Detect navigation elements and handle actions like pagination or infinite scroll automatically.
- Network Analysis: Identify desired data within network calls.
- Data transformation: Clean and map the data into the desired format. Finetuned small and performant LLMs are great at this task with a high reliability.
The main challenge:
We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of websites in a reliable, scalable, and cost-efficient way is a whole different beast.
The integration of tightly constrained agents with traditional engineering methods effectively solved this issue for us.
by edshiro on 4/7/24, 6:32 AM
The fact that you can get vastly different outcomes for similar runs (even while using Claude 3 Opus with tool/function calling) can drive you insane. I read somewhere down in this thread that one way to mitigate these problems is my implementing a robust state machine. I reckon this can help, but I also believe that somehow leveraging memory from previous runs could be useful too. It's not fully clear in my mind how to go about doing this.
I'm still very excited about the space though. It's a great place to be and I love the energy but also measured enthusiasm from everyone who is trying to push the boundaries of what is possible with agents.
I'm currently also tinkering with my own Python AI Agent library to further my understanding of how they work: https://github.com/kenshiro-o/nagato-ai . I don't expect it to become the standard but it's good fun and a great learning opportunity for me :).
by habitue on 3/31/24, 8:23 PM
Currently, from what I've seen, current LLMs "diverge" when put into a loop. They seem to reason acceptably in small chunks, but when you string the chunks together, they go off the rails and don't recover.
Can you slap another layer of LLM on top to explicitly recover? People have tried this, it seems like nobody has figured out the error correction needed to get it to converge well.
My personal opinion is that this is the measure of whether we have AGI or not. When LLM-in-a-loop converges, self-corrects, etc, then we're there.
It's likely all current agent code out there is just fine, and when you plug in a smart enough LLM it'll just work.
by arretevad on 3/31/24, 6:43 PM
by nunodonato on 3/31/24, 6:25 PM
by suchintan on 3/31/24, 5:53 PM
I've seen a lot of success come from AI sales agents, just doing basic SDR style work
We're having some success automating manual workflows for companies at Skyvern, but we've only begun to scratch the surface.
I suspect that this will play out a lot like the iPhone era -- first few years will be a lot of discovery and iteration, then things will kick into superdrive and you'll see major shifts in user behavior
by digitcatphd on 3/31/24, 6:53 PM
I actually don’t think we will need agents in the future, I think one model will be able to morph itself or just delegate copies of itself like MoE for actions.
It just seems extremely unlikely to me foundation models don’t get exponentially smarter over the next few years and can’t do this.
by geor9e on 3/31/24, 9:47 PM
If most people's only experience with AI is the chat.openai.com interface then yeah I can see why it seems like too much hassle to most people. The trick is figure out your long prompts ahead of time, and hardcode each one into a HTTP Request in something else (Tasker, BetterTouchTools, Alfred, Apple Shortcuts, etc). For me, I have dozens of long prompts to do exactly what I want, assigned to wakewords, hotkeys, and trigger words on my mac/watch/phone. Another key thing is I use FAST models, i.e. Groq not GPT-4. Latency makes AI too much hassle. i.e. 1. Instant (<1 second end-to-end) answers in just a few words, to voice questions spoken into my watch any time I hold the side button 2. Summarize long articles and youtube videos before I decide to spend time on them 3. Add quick code snippets in plain english with hotkeys or voice 4. Get the main arguments for and against something just to frame it … stuff like that. If it would make your life easier for an AI to save you 1 second per task, why not.
by bsenftner on 3/31/24, 6:05 PM
First, we know these AIs are trained with data from the general Internet, and that data is vast.
Second, the general Internet contains owner manuals and support forums for practically every active product there is, globally. These are every possible product too: physical products, virtual products like software or music, and experience products like travel or education. Between the owner’s manuals and the support forums for these products there is extremely deep knowledge about the purpose, use and troubleshooting of these products.
Third, one cannot just ask an LLM direct deep questions about some random product and expect a deep knowledge answer. One has to first create the context within the LLM that activates the area(s) of deep knowledge you want your answers to arise. This requires the use of long form prompts that create the expert you want, and once that expert is active in the LLM’s context, then you ask it questions and receive the deep knowledge answers desired.
Fourth, one can create an LLM agent that helps a person create the LLM agent they want, the LLM agent can help generate new agents, and dependency chains between different agents are not difficult at all, including information exchange between groups of agents collaborating on shared replies to requests.
And last, all that deep information about using pretty much every software there is can be tapped with careful prompting to create the context of an expert user of that software, and experts such as these can become plugins and drivers for that software. It's at our finger tips...!
by jokethrowaway on 3/31/24, 6:18 PM
Sure, we can do that, but do users want that?
I don't want to chat, talk or interact with people, I want the most efficient ui possible for the task at hand. When I do chat with someone is because some businesses are crap at automating and I need a human to fix something. Even then I don't want a robot that can't do anything.
The only exception I can think of is tutoring but then I'd really question the validity of the answers. RAG is pretty cool in that regard because it can point at the original paragraph being used to answer the question.
That might be useful to someone but that's not my favourite way of learning.
Give me a summary of the content, give me the content, Ctrl+F and I'm good to go.
For low stakes things like gaming where the agent messing up would just be a fun bug, I think it can be great.
Looking forward to automatically generated side quests based on actions and npc which get pissed if I put a box on their head and hire mercenaries if I murder their families.
by mattew on 3/31/24, 5:45 PM
That said, I’m very bullish on agents overall though and expect that once they get their assistants behaving a bit more predictably we will see some cool stuff.
It’s really quite magical to see one of these think through how to solve a problem, use custom tools that you implement to help solve it, and come to a solution.
by furyofantares on 3/31/24, 6:12 PM
Trying to get more inference value per-prompt is a good thing. Starting by trying to get it to do long-chain tasks per-prompt makes no sense.
I'm a huge fan of LLMs for productivity, but even small tasks often require multiple prompts of build-up or fix-up. We should work toward getting those done in a single prompt more often, then work toward slightly larger tasks etc.
Plugins and GPTs are both attempts at getting more/better inference per-prompt. There is some progress there, but it's pretty limited. There's also plenty of people building task-specific tools that get better results than someone using the chat interface due a lot of prompt work.
So there is incremental progress happening, but it's been fairly slow. The fact that it's this much work to get incrementally more inference value per prompt makes it very hard to imagine anyone closing the whole loop immediately with an agent.
by harrisoned on 3/31/24, 6:11 PM
I also have been experimenting with it to replace the intention classifier part of Google's dialogflow. We use it at work for our chatbot. Earlier, we used Watson and it was amazing, but became very expensive. Dialogflow is cheap, but it is as innacurate with complex natural language as it is cheap.
Mixtral (8x7B) has proved extremely accurate in identifying intentions with a consistent JSON output, giving it a short context, so i assume a simple 7B model would do the job. I still don't know if it is financially worth it, but it's something i'm gonna try if i can't fix the dialogflow's intentions. But in no way the model's output would directly interface with a client. That's asking for trouble.
by anotherpaulg on 3/31/24, 7:23 PM
Instead, I've found the "pair programming chat" UX to be the most effective for AI coding. With aider, you collaborate with the AI on coding, asking for a sequence of bite sized changes to your code base. At each step, you can offer direction or corrections if you're unhappy with how the AI interpreted your request. If you need to, you can also just jump in and make certain code changes yourself within your normal IDE/editor. Aider will notice your edits and make sure the AI is aware as your chat continues.
by KennyBlanken on 3/31/24, 6:40 PM
NVIDIA estimates they'll ship up to 2M H100 GPUs. They have a TDP of about 300-400W each. Assume that because of their high cost, their utilization is very high. Assume another 2/3rds of that is used for cooling, which would be another 200W. Be generous and throw out all the overhead from the host computers, storage, power distribution, UPS systems, and networking.
2M * 600W = 1.2GW.
Let's say you only operated them during the daytime and wanted to do so from solar power. You'd need between ten and twenty square miles worth of solar panels to do so.
by bobosha on 4/1/24, 11:09 AM
by timacles on 3/31/24, 6:28 PM
Anything more complex just turns into an irritating back and forth game that when I finally arrive at the solution, I feel like I wasted my time not getting practical experience, but rather gaming a magic 8 ball into giving me what I wanted.
It just doesn't feel satisfying to me to use them anymore. I don't deny that they improve my productivity, but its at the cost of enjoying what i do. I was never able to enter that feeling of zen flow, while using LLMs regularly.
by dudus on 3/31/24, 6:02 PM
I feel like these architectures built on top of last gen LLMs are mostly useless now.
The current gen jump was significant enough that creating a complex chain of thought with RAG on last gen usually is surpassed by 0 shots on current gen.
So instead of spending time and money building it it's better to focus on 0-shot and update your models to the latest version.
Feeding LLM outputs into other LLM inputs IMHO just will increase the bias. Initially I expected to mix and match different models to avoid it but that didn't work as much as I expected.
It depends a lot on your application honestly.
by wslh on 3/31/24, 5:49 PM
Another one is listening to many social media (e.g. Twitter) posts to sense if there is a business opportunity. SDRs scan the results in an Slack channel manually but based on these signals.
Finally, this is now a workflow but we did this [1] that is a piece in our work.
by precompute on 3/31/24, 7:13 PM
by hhh on 3/31/24, 5:47 PM
by dave333 on 3/31/24, 6:51 PM
by neural_thing on 3/31/24, 7:04 PM
by nijfranck on 3/31/24, 8:02 PM
by warthog on 3/31/24, 6:51 PM
We use agents in workflow to be able to do this in bulk. Problem is it does take a long time but at least it saves time at the end of the day and saves you from manually visiting a list of 100 different domains to see a piece of information
by paradite on 3/31/24, 5:57 PM
Specifically on using AI for coding, I wrote about different levels of AI coding from L1 to L5, we are still at L2/L3 stage for mature and production ready tech. Agents are L4/L5:
by stavros on 3/31/24, 6:02 PM
All that's left is for someone to bundle it all up into a nice package, and we'll be in the future.
by erru on 4/1/24, 9:13 AM
by vood on 3/31/24, 6:08 PM
These agents aren't super smart: just few PDFs for context plus a few sentences system prompts.
I do get what I want in 80% of use cases (not measured, just a feeling).
by ed on 3/31/24, 6:10 PM
We’re definitely in the “wait” phase of the wait calculation. Everyone is expecting GPT5/q* to change things but really we won’t know until we see it.
by aubanel on 3/31/24, 7:54 PM
That said, I believe the current best models are still not good enough - but let's wait a few months.
by Its_Padar on 3/31/24, 5:57 PM
by nothacking_ on 3/31/24, 6:38 PM
LLMs are perfect for this, super flashy, with a ton of hype. In reality, LLMs are really bad at most applications, they are a solution in search of a problem.
by tomrod on 3/31/24, 6:03 PM
by manojlds on 3/31/24, 5:45 PM
by psalmadek on 3/31/24, 5:55 PM
by babelfish on 3/31/24, 5:51 PM
by moomoo11 on 3/31/24, 6:45 PM
by nonrandomstring on 3/31/24, 6:31 PM
It doesn't matter that you think it's the coolest and most amazing technology in history. It may be. So what?
It doesn't matter that experts from every part of industry are yelling that "this is the future", that the march of this tech is "inevitable". They need to believe that, for their own reasons.
It doesn't matter that academics from Yale, Harvard and MIT are publishing a dozen new papers on it every week. For the mostpart their horizon ends at the campus gate.
It doesn't matter that investors are clamouring to give you money and inviting you to soirees to woo you because your project has the latest buzzwords in the name. Investors have to invest in something.
And it doesn't matter if market research people are telling you that the latent demand and growth opportunity is huge. People tell them what they want to hear.
The real test - and I wish I had known this when I was twenty - is do ordinary people on the London Omnibus want it? Not my inner ego projection. Not my wishful thinking. Not what "the numbers" say. Go and ask them.
My experience right now - from asking people (for a show I make) is that people are shit scared of AI and if they don't hold a visceral distaste for it they've an ambivalence that's about as stable nitro-glycerine on a hot day. I know that may be a difficult thing to hear as a business person.
If you are harbouring in your heart any remnant of the idea that you can create demand, that they will "see the light" and once they have a taste will be back for more, or that by will and power they can be made, regulated and peer pressured into accepting your "vision", then you'd be wise to gently let go of those thoughts.
by salomon812 on 3/31/24, 5:53 PM
by chancemehmu on 3/31/24, 6:12 PM
by toomanyrichies on 3/31/24, 6:41 PM
If I had asked StackOverflow the same question, it would have been quickly closed as being not broadly applicable enough (since this `sed` command is quite specific to its use case). After ChatGPT broke the code apart for me, I was able to ask StackOverflow a series of more discrete, more broadly-applicable questions and get a human answer.
TL;DR- I quite like ChatGPT as a search engine when "you don't know what you don't know", and getting unblocked means being pointed in the right direction.
1. https://www.richie.codes/shell
2. https://github.com/rbenv/rbenv/blob/e8b7a27ee67a5751b899215b...
by iAkashPaul on 3/31/24, 6:08 PM
by swagatkonchada on 3/31/24, 6:53 PM
by causalmodels on 3/31/24, 7:29 PM
The first issue is that agents have extremely high failure rates. Agents really don't have the capacity to learn from either success or failure since their internal state is fixed after training. If you ask an agent to repeatedly do some task it has a chance of failing every single time. We have been able to largely mitigate this by modeling agentic software as a state machine. At every step we have the model choose the inputs to the state machine and then we record them. We then 'compile' the resulting state-transition table down into a program that we can executed deterministically. This isn't totally fool proof since the world state can change between program runs, so we have methods that allow the LLM to make slight modifications to the program as needed. The idea here is that agents should never have to solve the same problem twice. The cool thing about this approach is that smarter models make the entire system work better. If you have a particularly complex task, you can call out to gp4-turbo or claude3-opus to map out the correct action sequence and then fall back to less complex models like mistral 7b.
The second issue is that almost all software is designed for people, not LLMs. What is intuitive for human users may not be intuitive for non-human users. We're focused on making agents reliably interact with the internet so I'll use web pages as an example. Web pages contain tons of visually encoded information in things like the layout hierarchy, images, etc. But most LLMs rely on purely text inputs. You can try exposing the underling HTML or the DOM to the model, but this doesn't work so well in practice. We get around this by treating LLMs as if they were visually impaired users. We give them a purely text interface by using ARIA trees. This interface is much more compact than either the DOM or HTML so responses come back faster and cost way less.
The third issue I see with people building agents is they go after the wrong class of problem. I meet a lot of people who want to use agents for big ticket items such as planning an entire trip + doing all the booking. The cost of a trip can run into the thousands of dollars and be a nightmare to undo if something goes wrong. You really don't want to throw agents at this kind of problem, at least not yet, because the downside to failure is so high. Users generally want expensive things to be done well and agents can't do that yet.
However there are a ton of things I would like someone to do for me that would cost less than five dollars of someones time and the stakes for things going wrong are low. My go to example is making reservations. I really don't want to spend the time sorting through the hundreds of nearby restaurants. I just want to give something the general parameters of what I'm looking for and have reservations show up in my inbox. These are the kinds of tasks that agents are going to accelerate.
[1] https://github.com/hdresearch/hdr-browser [2] https://hdr.is
by kwinkunks on 3/31/24, 5:56 PM
- The pythonrepl or llm-math agent not being used when it should be and the agent returning a wrong or approximate answer.
- The wikipedia and webbrowsed agents doing spurious research in an attempt to answer a question I did not ask (hallucinating a question, essentially).
- Agents getting stuck in a loop of asking the same question over and over until they time out.
- The model not believing an answer it gets from an agent (eg using a Python function to get today's date and not believing the answer because "The date is in the future").
When you layer all this on top of the usual challenges of writing prompts (plus, with Python function, writing the docstring so the agent knows when to call it), wrong answers, hallucination, etc, etc, I'm unconvinced. But maybe I'm doing it wrong!
by jhawleypeters on 3/31/24, 6:17 PM
In my experience, you need to keep a human in the loop. This implies that you can't get the technology to scale, but I'm optimistic because LLMs have rapidly gotten better at following directions while I've been using them over the last six months.
Summarization is probably the clearest strength of LLMs over a human. With ever-growing context windows, summarizing books in one shot becomes feasible. Most books can be summarized in one sentence, though the most useful, information-dense ones cannot.
I had Gemini 1.5 Pro summarize an old book titled Natural Hormonal Enhancement yesterday. Having just read the book, the result was acceptable.
https://hawleypeters.com/summary-of-natural-hormonal-enhance...
For information-dense books, it seems clear to me that chatting with the book is the way to go. I think there's promise to build a competent agent for this kind of use case. Imagine gathering 15 papers and then chatting about their contents with an agent with queries like:
What's the consensus? Where do these papers diverge in their conclusions? Please translate this passage into plain English.
I haven't done this myself, but I have a hard time imagining such an agent being useless. Perhaps this is a failure of imagination on my part.
The brightest spot in my experimentation is [Cursor](https://cursor.sh). It's good for little dev tasks like refactoring a small block of code and chatting about how to use vim. I imagine it'd be able to talk about how to set up various configs, particularly if you @ the documentation, a feature that it supports, including [adding documentation](https://docs.cursor.sh/features/custom-docs).
Edit: I think a lot of disappointment comes from these kinds of tools not being AGI, or a replacement for a human that does some repetitive task. They magnify the power of somebody that's already curious and driven. They still empower lazy, disengaged users, but with goals like doing the bare minimum, and avoiding work altogether, these tools cannot help one accomplish much of use.
by ruined on 3/31/24, 5:56 PM
by Zetobal on 3/31/24, 6:08 PM
by tracer4201 on 3/31/24, 5:51 PM
It’s actually quite awful. It’s obvious the text is LLM generated because of the verbose, generic writing style. It communicates clearly but without substance. Not gonna lie, I secretly judge these people.
by jaxomlotus on 3/31/24, 6:07 PM