by gregpr07 on 11/5/24, 3:51 PM with 72 comments
I made Browser-Use, an open-source tool that lets (all Langchain supported) LLMs execute tasks directly in the browser just with function calling.
It allows you to build agents that interact with web elements using natural language prompts. We created a layer that simplifies website interaction for LLMs by extracting xPaths and interactive elements like buttons and input fields (and other fancy things). This enables you to design custom web automation and scraping functions without manual inspection through DevTools.
Hasn't this been done a lot of times? Good question, as a general SaaS tool yes, but I think a lot of people are going to try to make their own web automation agents from scratch, so the idea is to provide groundwork/library for the hard part so that not everyone has to repeat these steps:
- parse html in a LLM friendly way (clickable items + screenshots)
- provide a nice function calls for everything inside the browser
- create reusable agent classes
What this is NOT? An all knowing AI agent that can solve all your problems.
The vision: create repeatable tasks on the web just by prompting your agent and not care about the hows.
To better showcase the power of text extraction we made a few demos such as:
- Applying for multiple software engineering jobs in San Francisco
- Opening new tabs to search for images of Albert Einstein, Oprah Winfrey, and Steve Jobs
- Finding the cheapest one-way flight from London to Kyrgyzstan for December 25th
I’d be interested in feedback on how this tool fits into your automation workflows. Try it out and let me know how it performs on your end.
We are Gregor & Magnus and we built this in 5 days.
by firejake308 on 11/5/24, 5:31 PM
by theredsix on 11/5/24, 6:24 PM
* Cerebellum (Typescript): https://github.com/theredsix/cerebellum
* Skyvern: https://github.com/Skyvern-AI/skyvern
Disclaimer: I am the author of Cerebellum
by gitgud on 11/6/24, 3:07 AM
agent = Agent(
task='Go to hackernews on show hn and give me top 10 post titels, their points and hours. Calculate for each the ratio of points per hour.',
llm=ChatOpenAI(model='gpt-4o'),
)
await agent.run()
Passing prompts to a LLM agent... waiting for the black box to run and do something...by maggreenWAI on 11/5/24, 8:58 PM
Do you think: 1. Websites release more API functions for agents to interact with them or 2. We will transform with tools like this the UI into functions callable by agents and maybe even cache all inferred functions for websites in a third party service?
by G_o_D on 11/6/24, 12:27 AM
by bravura on 11/5/24, 5:25 PM
a) There were a test / eval suite to determine which model works best for what. It could be divided into a training suite and test suite. (Training tasks can be used for training, test tasks only for evaluation.) Possibly a combination of unit tests against known xpaths, and integration tests that are multi-step and end in a measurable result. I know the web is constantly changing, so I'm not 100% sure how this should work.
b) There were some sort of wiki, or perhaps another repo or discussion board, of community-generated prompt recipes for particular actions.
by Oras on 11/5/24, 6:01 PM
I also saw one doing Captcha solving with Selenium [1].
I will keep an eye on your development, good luck!
[0] https://www.multion.ai/ [1] https://github.com/VRSEN/agency-swarm
by soham123 on 11/5/24, 6:13 PM
Compatible with any LLMs and agentic framework
by rahimnathwani on 11/5/24, 5:44 PM
by coreyp_1 on 11/5/24, 5:17 PM
I see in the readme that it claims that it is MIT licensed, but there is no actual license file or information in any of the source files that I could find.
by daft_pink on 11/5/24, 6:04 PM
by KaoruAoiShiho on 11/5/24, 11:53 PM
by fragmede on 11/5/24, 8:17 PM
by DeathArrow on 11/6/24, 8:32 AM
Can it use a headless browser?
by WillAdams on 11/5/24, 5:17 PM
I'd give my interest in Hell for a way to have a script plug in data into a Java app.
by ReD_CoDE on 11/8/24, 1:08 PM