from Hacker News

Act-1: Transformer for Actions

by thesephist on 9/14/22, 8:24 PM with 77 comments

  • by visarga on 9/14/22, 10:56 PM

    Related - GPT-3 with a Python interpreter can solve many tasks. It is also language model + a computer, but on a different level.

    https://mobile.twitter.com/sergeykarayev/status/156937788144...

  • by frenchie4111 on 9/15/22, 12:21 AM

    One thing I have noticed with heavy use of copilot/dall-e, is that it's great at getting you most of the way there. But a big thing it's not great at is repeatability. When relying on something like ACT-1 to do data entry in salesforce, I need to do roughly exactly the same thing every time, even if the context is slightly different or I tell it something slightly differently. How well will it be able to do that?

    Also this is very very cool, I love copilot, I hope I get to use this thing very soon.

  • by tasdfqwer0897 on 9/14/22, 8:34 PM

    Hey, I helped make this! Happy to answer any questions.
  • by leetrout on 9/15/22, 3:01 AM

    So many folks standardizing on swagger/openapi opens the door to training on structured api definitions... this never occurred to me before.
  • by mrits on 9/14/22, 9:37 PM

    Natural language interfaces are very limited and certainly not the next generation of computing. Granularity of functionality and composable input will always be more efficient as long as the original source is a human. I think the natural language part of your product is the lease interesting and certainly not the most impressive.
  • by codekansas on 9/14/22, 9:08 PM

    This is incredible :) Will you be releasing more information about how the system was designed / how data was collected / how actions are executed?
  • by joaquincabezas on 9/15/22, 8:26 AM

    I remember a long time ago, reading about Semantic Web and intelligent agents and dreaming of a Natural Language interface for planning journeys…

    “I want to travel from Seville to Berlin next October, avoiding weekends, for a two or three nights stay in a hotel by the river. Direct flights preferred.”

  • by blind666 on 9/14/22, 11:01 PM

    If this scales up, it can be thought of as "actionable Google search", and if taken to extreme, has the potential to make internet query-able for better or worse
  • by colemannugent on 9/14/22, 10:00 PM

    So here's the main problem I see with this:

    >Anyone who can articulate their ideas in language can implement them

    I'd be shocked if even 10% of the users who can't navigate a GUI could accurately describe what they want the software to do. To the user who doesn't know they can use Ctrl-Z to undo, the first half dozen times the AI mangles their inherited spreadsheet might be enough to put them off the idea.

  • by holoduke on 9/14/22, 9:14 PM

    How does the AI alter it's models during a process? I thought the weight models are pregenerated and not altered once used in a real-life app.
  • by anigbrowl on 9/14/22, 10:15 PM

      - OK here's my email
      - Please select all pictures of taxis to prove you are not a robot
      ಥ_ಥ
    
    Seriously though, the potential is good. I see several things they're doing right that have the potential to distinguish them from competing offerings.
  • by bluecoconut on 9/14/22, 10:25 PM

    Wow! Love it, this is the most exciting thing I've seen in a while. I'm working on something similar, and it's so great to see others who seem to get-it and are chasing generalization in AI systems!

    A few questions:

    1. I'm curious if you're representing the task-operations using RL techniques (as many personal assistant systems seem to be) or if this is entirely a seq2seq transformer style model for predicting actions?

    2. Assumption: Due to scaling of transformers, I assume that this is not directly working on the image data of a screen, and instead is working off of DOM trees; (2a) is this the case? and (2b) if so, are you using purely linear tokenization of the tree or are you using something closer to Evoformer (AlphaFold style) to combine graphs-neural nets and transformers?

    3. Have you noticed that learning actions and representations of one application transfers well to new applications? or is the quality of the model heavily dependent on app domain?

    I noticed multiple references to data applications (Excel, tableau, etc.). My challenge is that large language models and AI systems in general are about to hit a wall in the data domain because they fundamentally don't understand data [1] [2], which will ultimately limit the quality of these capabilities.

    I am personally tackling this problem directly. I'm tying to prove more coherent data-aware operations in these systems by building a "foundation model" for tabular data that connects to LLMs (think RETRO style lookups of embeddings (representing columns of data)). I have been prototyping conversational AI systems (mostly Q/A oriented), and have recently been moving towards task oriented operations (right now, transparently, just SQL executors).

    There seem to be good representations of DOM tree/visual-object models that you all are working with to take reasonable action, however I assume these are limited in scale (N^2 and all), and so I am wondering if you have any opinions on how to extend these systems for data (especially as the "windowed context grows" (eg. an excel with 100k+ rows))?

    [1] https://arxiv.org/abs/2106.03253 "Tabular Data: Deep Learning is Not All You Need" [2] https://arxiv.org/abs/2110.01889 "In summary, we think that a fundamental reorientation of the domain may be necessary. For now, the question of whether the use of current deep learning techniques is beneficial for tabular data can generally be answered in the negative"

  • by atemerev on 9/15/22, 8:03 AM

    Many people worry that these things will take over our jobs. Worry not! Imagine how much work will be needed to fix things when these models will screw up, and how much we will charge for hour.
  • by skybrian on 9/14/22, 10:38 PM

    Wow, what a great way to make a mess online! I can see spammers using it, but who's going to trust this with access to any accounts they care about?
  • by rajnathani on 9/17/22, 6:08 AM

    The founders of this company are the main authors of the Transformer architecture: https://techcrunch.com/2022/04/26/2304039/
  • by d--b on 9/15/22, 6:56 AM

    “Open the pod bay doors, Act-1”
  • by lee101 on 9/15/22, 12:30 AM

    Awesome, I wonder if the app is recording what we do so it can replicate what we do, or maybe if not it should have a training mode where we tell it what we do then do it so it can learn.

    I feel like some of this could one day be built using a shared model that understands HTML and JavaScript code etc with a few example prompts. Or maybe something that understands intent+a browser automation language like Selenium, if not then some custom input output language+training as adept alludes to.

    If interested in building something like this also checkout https://text-generator.io which already pulls down links and images to analyse to generate better text so has a lot of the required parts

  • by FeepingCreature on 9/15/22, 1:13 AM

    Act-1, please fix the hideous low contrast on the adept.ai website.
  • by i_am_toaster on 9/14/22, 10:14 PM

    I look forward to seeing the progress made on this in the future, but at this time I don’t see any potential in this product.
  • by midislack on 9/14/22, 11:49 PM

    We gotta stop this lazy "X for Y" marketing crap. Seriously, if your product is just "X for Y" it doesn't even sound like a good pitch.