Agent智能体的工作流可以简单分成两种:一种是 固定的静态工作流,一种是智能体自主决策的动态工作流 。
其实以上两个问题都可以通过 Self-Reflection from past experience 来解决,那问题就转变成了如何获得past-experience,past-experience如何转化成经验,如何在新的推理中使用这些经验。这一章会介绍三个模型自主探索学习和经验总结的方案分别是:AppAgent,Trial and Error和AutoGuide
- AppAgent: Multimodal Agents as Smartphone Users
- https://github.com/mnotgod96/AppAgent
APPAgent是腾讯实验室推出出的和Andriod手机自主交互的智能体,整体方案和 上一章 我们讲过的WebVoyager的方案类似,使用多模态大模型和SOM页面元素分割来识别每一步模型和页面的哪些元素进行交互。而自主学习的部分,论文 基于模型的前期自主探索,来构建工具说明书,帮助模型了解每款APP的使用,从而提高推理阶段的任务完成率 。这里论文在9个android app上进行了测试,一些测试任务如下
那如何使用模型来自主生成APP操作说明书呢?类比人类在使用一个新工具时通过Trial and Error来不断更新自己对工具的认知和使用方式,这里的模型探索也是如此。论文先生成了一组基于APP的任务指令,然后基于每个指令模型会对APP的使用进行自主探索,每一步模型的输入包括
self_explore_task_template = """You are an agent that is trained to complete certain tasks on a smartphone. You will be
given a screenshot of a smartphone app. The interactive UI elements on the screenshot are labeled with numeric tags
starting from 1.
You can call the following functions to interact with those labeled elements to control the smartphone:
1. tap(element: int)
2. text(text_input: str)
3. long_press(element: int)
4. swipe(element: int, direction: str, dist: str)
The task you need to complete is to . Your past actions to proceed with this task are summarized as
Now, given the following labeled screenshot, you need to think and call the function needed to proceed with the task.
Your output should include three parts in the given format:
You can only take one action at a time, so please directly call the function."""
tap_doc_template = """I will give you the screenshot of a mobile app before and after tapping the UI element labeled
with the number on the screen. The numeric tag of each element is located at the center of the element.
Tapping this UI element is a necessary part of proceeding with a larger task, which is to . Your task is to
describe the functionality of the UI element concisely in one or two sentences. Notice that your description of the UI
element should focus on the general function. For example, if the UI element is used to navigate to the chat window
with John, your description should not include the name of the specific person. Just say: "Tapping this area will
navigate the user to the chat window". Never include the numeric tag of the UI element in your description. You can use
pronouns such as "the UI element" to refer to the element."""
论文验证了,前期自主探索形成的说明书,对模型的任务完成准确率有很大的提升,几乎可以逼近基于人工探索形成的说明书(Watching Demos),以及直接手工编写说明书(Manually Crafted)的水平。
- LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error
- https://github.com/microsoft/simulated-trial-and-error
上面APPAgent帮助模型自我学习如何进行前端交互,微软提出的STE是针对后端API交互 ,让模型通过前期的多轮API交互学习API调用,并通过In-Context-Learning或者SFT使用前期探索的结果帮助模型更好的使用API来完成任务。
Your task is to answer the user's query as best you can. You have access to the following tools which you can use via API call to help with your response:
Now you have the chance to explore the available APIs. You can do this by 1) synthesizing some natural user query that calling the API could help, and 2) trying to respond to the user query with the help of the APIs. Here, you can focus on queries that only require calling the API once.
Now, first input your synthesized user query. You should make the query natural - for example, try to avoid using the provided API descriptions or API names in the query, as the user does not know what APIs you have access to. Also try to make the query as specific as possible. Input just the user query alone; do NOT solve the query for now.
User Query:
Now, try to respond to the query using the available APIs.
The format you use the API is by specifying 1) Action: the API function name you'd like to call 2) Action Input: the input parameters of the API call in a json string format. The result of the API call will be returned starting with "Observation:". Remember that you should only perform a SINGLE action at a time, do NOT return a list of multiple actions.
1) the only values that should follow "Action:" are: {api_names}
2) use the following json string format for the API arguments:
Action Input:
"key_1": "value_1",
"key_n": "value_n",
Remember to ALWAYS use the following format:
Thought: you should always think about what to do next
Action: the API function name
Action Input: the input parameters of the API call in json string format
Observation: the return result of the API call. This is what I will provide you with; you do not need to repeat it in your response.
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the response to the user query
Begin! Remember that your response should never start with "Observation:" since that is what I will provide you with. Once you have enough information, please immediately use \nThought: I now know the final answer\nFinal Answer:
User Query (the same you just synthesized): {query}
Now you know a bit more about the API. You can synthesize another user query to explore the API a bit further and consolidate your understanding of the API, based on things that you discovered about this API. Again, just input the user query alone; do NOT solve the query for now.
User Query:
这里我们只关注ICL的方案,因为泛化性更好,能更快拓展新工具和新场景。 和上面APPAgent不同的,这里的ICL不是使用前期探索生成的工具说明书,而是直接使用模型调用工具的历史操作,类似于案例。 当用户有新的提问时,会基于query的Embedding(SentenceBert),召回前期探索阶段中最相似的15个query和最终模型的API调用结果作为推理上文,进行工具推理。
- AutoGuide: Automated Generation and Selection of State-Aware Guidelines for Large Language Model Agents
想要构建并使用指南,AutoGuide包含三个核心模块:状态总结模块(State Summarization),指南抽取模块(Guideline Extraction),和指南召回模块。论文针对不同的Agent场景设计了不同的状态总结和抽取prompt,这里还是用我们上一章刚提过的webagent中的WebArena数据集为例,分别说下两个模块
举个例子,以下的任务中,两条行为链路是在Action1的时候出现了差异,则会使用Action1之前的观察和行为作为输入(current trajectory) 进行状态总结。这里得到的状态应该是"You are on the List of forum Page"
同样是上面的例子,针对状态"You are on the List of forum Page",以上prompt得到的指南是
基于以上获取的状态和状态指南,在推理阶段,每一步执行会先使用State Summarization模块对当前状态进行总结,然后基于当前的状态去构建好的状态指南中先定位相似的状态,这里使用了和上面状态消重合并相同的大模型prompt,然后基于定位到的状态,获取所有的相关指南。如果指南数量太多,则使用下面的prompt对指南进行筛选,只保留Top-K。然后基于这Top-K指南进行下一步思考和行为的推理。
想看更全的大模型相关论文梳理·微调及预训练数据和框架·AIGC应用,移步Github >>
"if you want to navigate to a specific forum, you can click on the link that exactly matches the forum name you are looking for."