from Hacker News

Show HN: I wrote an open-source browser alternative for Computer Use for any LLM

by gregpr07 on 11/5/24, 3:51 PM with 72 comments

Hey HN,

I made Browser-Use, an open-source tool that lets (all Langchain supported) LLMs execute tasks directly in the browser just with function calling.

It allows you to build agents that interact with web elements using natural language prompts. We created a layer that simplifies website interaction for LLMs by extracting xPaths and interactive elements like buttons and input fields (and other fancy things). This enables you to design custom web automation and scraping functions without manual inspection through DevTools.

Hasn't this been done a lot of times? Good question, as a general SaaS tool yes, but I think a lot of people are going to try to make their own web automation agents from scratch, so the idea is to provide groundwork/library for the hard part so that not everyone has to repeat these steps:

- parse html in a LLM friendly way (clickable items + screenshots)

- provide a nice function calls for everything inside the browser

- create reusable agent classes

What this is NOT? An all knowing AI agent that can solve all your problems.

The vision: create repeatable tasks on the web just by prompting your agent and not care about the hows.

To better showcase the power of text extraction we made a few demos such as:

- Applying for multiple software engineering jobs in San Francisco

- Opening new tabs to search for images of Albert Einstein, Oprah Winfrey, and Steve Jobs

- Finding the cheapest one-way flight from London to Kyrgyzstan for December 25th

I’d be interested in feedback on how this tool fits into your automation workflows. Try it out and let me know how it performs on your end.

We are Gregor & Magnus and we built this in 5 days.

  • by firejake308 on 11/5/24, 5:31 PM

    Is it decided then that screenshots are better input for LLMs than HTML, or is that still an active area of investigation? I see that y'all elected for a mostly screenshot-based approach here, wondering if that was based on evidence or just a working theory.
  • by theredsix on 11/5/24, 6:24 PM

    Awesome project, starred! Here are some other projects for agentic browser interactions:

    * Cerebellum (Typescript): https://github.com/theredsix/cerebellum

    * Skyvern: https://github.com/Skyvern-AI/skyvern

    Disclaimer: I am the author of Cerebellum

  • by gitgud on 11/6/24, 3:07 AM

    It's impressive, but to me it seems like the saddest development experience...

        agent = Agent(
            task='Go to hackernews on show hn and give me top 10 post titels, their points and hours. Calculate for each the ratio of points per hour.',
            llm=ChatOpenAI(model='gpt-4o'),
        )
        
        await agent.run()
    
    Passing prompts to a LLM agent... waiting for the black box to run and do something...
  • by maggreenWAI on 11/5/24, 8:58 PM

    Let's say in 1 year, more agents than humans interact with the web.

    Do you think: 1. Websites release more API functions for agents to interact with them or 2. We will transform with tools like this the UI into functions callable by agents and maybe even cache all inferred functions for websites in a third party service?

  • by G_o_D on 11/6/24, 12:27 AM

    It is called screen scraping, where text rendered on screen/monitors are being scraped either in browser or even in windows os even on android screen , thats how softwares like autohotkey and all do automation windows or android screen can be dumped into heirarchical xml along with x y coordinates of its ui elements along with text they contain which can be uses o click scroll scrape text
  • by bravura on 11/5/24, 5:25 PM

    It would be amazing if you:

    a) There were a test / eval suite to determine which model works best for what. It could be divided into a training suite and test suite. (Training tasks can be used for training, test tasks only for evaluation.) Possibly a combination of unit tests against known xpaths, and integration tests that are multi-step and end in a measurable result. I know the web is constantly changing, so I'm not 100% sure how this should work.

    b) There were some sort of wiki, or perhaps another repo or discussion board, of community-generated prompt recipes for particular actions.

  • by Oras on 11/5/24, 6:01 PM

    This looks interesting. I am really impressed with MultiOn [0], and I tried to make something similar, but it's quite challenging doing it with a Chrome extension.

    I also saw one doing Captcha solving with Selenium [1].

    I will keep an eye on your development, good luck!

    [0] https://www.multion.ai/ [1] https://github.com/VRSEN/agency-swarm

  • by soham123 on 11/5/24, 6:13 PM

    I have built something similar at https://github.com/ComposioHQ/composio/tree/master/python/co...

    Compatible with any LLMs and agentic framework

  • by rahimnathwani on 11/5/24, 5:44 PM

    In case anyone else was looking for the functions available to the LLM: https://github.com/gregpr07/browser-use/blob/68a3227c8bc97fe...
  • by coreyp_1 on 11/5/24, 5:17 PM

    This looks really interesting. The first hurdle, though, that prevents me from experimenting with this on my job is the lack of a license.

    I see in the readme that it claims that it is MIT licensed, but there is no actual license file or information in any of the source files that I could find.

  • by daft_pink on 11/5/24, 6:04 PM

    I was really excited about the original claude computer use until I watched the youtube videos and saw it was only running in a docker container. I wish I could run something like this on a real machine.
  • by KaoruAoiShiho on 11/5/24, 11:53 PM

    Maybe can build a database for which sites / pages work best with HTML vs Screenshots, and then can choose to use HTML to save on token cost / improve latency if possible.
  • by fragmede on 11/5/24, 8:17 PM

    wants to have cron, so I can ask it to check with my local parking agency, every day or every 12 hours, do I have a parking ticket, and to raise a warning if I do. Or to check with county jail and see if someone is still there/not there. Or check the price of a product on Amazon every hour and warn when it's changed (aka camelcamelcamel but local). Search craigslist/zillow/Facebook marketplace for items until one shows up. etc.
  • by DeathArrow on 11/6/24, 8:32 AM

    >This enables you to design custom web automation and scraping functions without manual inspection through DevTools

    Can it use a headless browser?

  • by WillAdams on 11/5/24, 5:17 PM

    Does it work with COM objects/Java applications?

    I'd give my interest in Hell for a way to have a script plug in data into a Java app.

  • by ReD_CoDE on 11/8/24, 1:08 PM

    Many web developers use Playwright and Puppeteer, so why Selenium?