mirror of
https://github.com/aljazceru/Tutorial-Codebase-Knowledge.git
synced 2025-12-19 07:24:20 +01:00
init push
This commit is contained in:
259
docs/Browser Use/01_agent.md
Normal file
259
docs/Browser Use/01_agent.md
Normal file
@@ -0,0 +1,259 @@
|
||||
# Chapter 1: The Agent - Your Browser Assistant's Brain
|
||||
|
||||
Welcome to the `Browser Use` tutorial! We're excited to help you learn how to automate web tasks using the power of Large Language Models (LLMs).
|
||||
|
||||
Imagine you want to perform a simple task, like searching Google for "cute cat pictures" and clicking on the very first image result. For a human, this is easy! You open your browser, type in the search, look at the results, and click.
|
||||
|
||||
But how do you tell a computer program to do this? It needs to understand the goal, look at the webpage like a human does, decide what to click or type next, and then actually perform those actions. This is where the **Agent** comes in.
|
||||
|
||||
## What Problem Does the Agent Solve?
|
||||
|
||||
The Agent is the core orchestrator, the "brain" or "project manager" of your browser automation task. It connects all the different pieces needed to achieve your goal. Without the Agent, you'd have a bunch of tools (like a browser controller and an LLM) but no central coordinator telling them what to do and when.
|
||||
|
||||
The Agent solves the problem of turning a high-level goal (like "find cat pictures") into concrete actions on a webpage, using intelligence to adapt to what it "sees" in the browser.
|
||||
|
||||
## Meet the Agent: Your Project Manager
|
||||
|
||||
Think of the `Agent` like a project manager overseeing a complex task. It doesn't do *all* the work itself, but it coordinates specialists:
|
||||
|
||||
1. **Receives the Task:** You give the Agent the overall goal (e.g., "Search Google for 'cute cat pictures' and click the first image result.").
|
||||
2. **Consults the Planner (LLM):** The Agent shows the current state of the webpage (using the [BrowserContext](03_browsercontext.md)) to a Large Language Model (LLM). It asks, "Here's the goal, and here's what the webpage looks like right now. What should be the very next step?" The LLM acts as a smart planner, suggesting actions like "type 'cute cat pictures' into the search bar" or "click the element with index 5". We'll learn more about how we instruct the LLM in the [System Prompt](02_system_prompt.md) chapter.
|
||||
3. **Manages History:** The Agent keeps track of everything that has happened so far – the actions taken, the results, and the state of the browser at each step. This "memory" is managed by the [Message Manager](06_message_manager.md) and helps the LLM make better decisions.
|
||||
4. **Instructs the Doer (Controller):** Once the LLM suggests an action (like "click element 5"), the Agent tells the [Action Controller & Registry](05_action_controller___registry.md) to actually perform that specific action within the browser.
|
||||
5. **Observes the Results (BrowserContext):** After the Controller acts, the Agent uses the [BrowserContext](03_browsercontext.md) again to see the new state of the webpage (e.g., the Google search results page).
|
||||
6. **Repeats:** The Agent repeats steps 2-5, continuously consulting the LLM, instructing the Controller, and observing the results, until the original task is complete or it reaches a stopping point.
|
||||
|
||||
## Using the Agent: A Simple Example
|
||||
|
||||
Let's see how you might use the Agent in Python code. Don't worry about understanding every detail yet; focus on the main idea. We're setting up the Agent with our task and the necessary components.
|
||||
|
||||
```python
|
||||
# --- Simplified Example ---
|
||||
# We need to import the necessary parts from the browser_use library
|
||||
from browser_use import Agent, Browser, Controller, BrowserConfig, BrowserContextConfig
|
||||
# Assume 'my_llm' is your configured Large Language Model (e.g., from OpenAI, Anthropic)
|
||||
from my_llm_setup import my_llm # Placeholder for your specific LLM setup
|
||||
|
||||
# 1. Define the task for the Agent
|
||||
my_task = "Go to google.com, search for 'cute cat pictures', and click the first image result."
|
||||
|
||||
# 2. Basic browser configuration (we'll learn more later)
|
||||
browser_config = BrowserConfig() # Default settings
|
||||
context_config = BrowserContextConfig() # Default settings
|
||||
|
||||
# 3. Initialize the components the Agent needs
|
||||
# The Browser manages the underlying browser application
|
||||
browser = Browser(config=browser_config)
|
||||
# The Controller knows *how* to perform actions like 'click' or 'type'
|
||||
controller = Controller()
|
||||
|
||||
async def main():
|
||||
# The BrowserContext represents a single browser tab/window environment
|
||||
# It uses the Browser and its configuration
|
||||
async with BrowserContext(browser=browser, config=context_config) as browser_context:
|
||||
|
||||
# 4. Create the Agent instance!
|
||||
agent = Agent(
|
||||
task=my_task,
|
||||
llm=my_llm, # The "brain" - the Language Model
|
||||
browser_context=browser_context, # The "eyes" - interacts with the browser tab
|
||||
controller=controller # The "hands" - executes actions
|
||||
# Many other settings can be configured here!
|
||||
)
|
||||
|
||||
print(f"Agent created. Starting task: {my_task}")
|
||||
|
||||
# 5. Run the Agent! This starts the loop.
|
||||
# It will keep taking steps until the task is done or it hits the limit.
|
||||
history = await agent.run(max_steps=15) # Limit steps for safety
|
||||
|
||||
# 6. Check the result
|
||||
if history.is_done() and history.is_successful():
|
||||
print("✅ Agent finished the task successfully!")
|
||||
print(f"Final message from agent: {history.final_result()}")
|
||||
else:
|
||||
print("⚠️ Agent stopped. Maybe max_steps reached or task wasn't completed successfully.")
|
||||
|
||||
# The 'async with' block automatically cleans up the browser_context
|
||||
await browser.close() # Close the browser application
|
||||
|
||||
# Run the asynchronous function
|
||||
import asyncio
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**What happens when you run this?**
|
||||
|
||||
1. An `Agent` object is created with your task, the LLM, the browser context, and the controller.
|
||||
2. Calling `agent.run(max_steps=15)` starts the main loop.
|
||||
3. The Agent gets the initial state of the browser (likely a blank page).
|
||||
4. It asks the LLM what to do. The LLM might say "Go to google.com".
|
||||
5. The Agent tells the Controller to execute the "go to URL" action.
|
||||
6. The browser navigates to Google.
|
||||
7. The Agent gets the new state (Google's homepage).
|
||||
8. It asks the LLM again. The LLM says "Type 'cute cat pictures' into the search bar".
|
||||
9. The Agent tells the Controller to type the text.
|
||||
10. This continues step-by-step: pressing Enter, seeing results, asking the LLM, clicking the image.
|
||||
11. Eventually, the LLM will hopefully tell the Agent the task is "done".
|
||||
12. `agent.run()` finishes and returns the `history` object containing details of what happened.
|
||||
|
||||
## How it Works Under the Hood: The Agent Loop
|
||||
|
||||
Let's visualize the process with a simple diagram:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant Agent
|
||||
participant LLM
|
||||
participant Controller
|
||||
participant BC as BrowserContext
|
||||
|
||||
User->>Agent: Start task("Search Google for cats...")
|
||||
Note over Agent: Agent Loop Starts
|
||||
Agent->>BC: Get current state (e.g., blank page)
|
||||
BC-->>Agent: Current Page State
|
||||
Agent->>LLM: What's next? (Task + State + History)
|
||||
LLM-->>Agent: Plan: [Action: Type 'cute cat pictures', Action: Press Enter]
|
||||
Agent->>Controller: Execute: type_text(...)
|
||||
Controller->>BC: Perform type action
|
||||
Agent->>Controller: Execute: press_keys('Enter')
|
||||
Controller->>BC: Perform press action
|
||||
Agent->>BC: Get new state (search results page)
|
||||
BC-->>Agent: New Page State
|
||||
Agent->>LLM: What's next? (Task + New State + History)
|
||||
LLM-->>Agent: Plan: [Action: click_element(index=5)]
|
||||
Agent->>Controller: Execute: click_element(index=5)
|
||||
Controller->>BC: Perform click action
|
||||
Note over Agent: Loop continues until done...
|
||||
LLM-->>Agent: Plan: [Action: done(success=True, text='Found cat picture!')]
|
||||
Agent->>Controller: Execute: done(...)
|
||||
Controller-->>Agent: ActionResult (is_done=True)
|
||||
Note over Agent: Agent Loop Ends
|
||||
Agent->>User: Return History (Task Complete)
|
||||
|
||||
```
|
||||
|
||||
The core of the `Agent` lives in the `agent/service.py` file. The `Agent` class manages the overall process.
|
||||
|
||||
1. **Initialization (`__init__`)**: When you create an `Agent`, it sets up its internal state, stores the task, the LLM, the controller, and prepares the [Message Manager](06_message_manager.md) to keep track of the conversation history. It also figures out the best way to talk to the specific LLM you provided.
|
||||
|
||||
```python
|
||||
# --- File: agent/service.py (Simplified __init__) ---
|
||||
class Agent:
|
||||
def __init__(
|
||||
self,
|
||||
task: str,
|
||||
llm: BaseChatModel,
|
||||
browser_context: BrowserContext,
|
||||
controller: Controller,
|
||||
# ... other settings like use_vision, max_failures, etc.
|
||||
**kwargs
|
||||
):
|
||||
self.task = task
|
||||
self.llm = llm
|
||||
self.browser_context = browser_context
|
||||
self.controller = controller
|
||||
self.settings = AgentSettings(**kwargs) # Store various settings
|
||||
self.state = AgentState() # Internal state (step count, failures, etc.)
|
||||
|
||||
# Setup message manager for history, using the task and system prompt
|
||||
self._message_manager = MessageManager(
|
||||
task=self.task,
|
||||
system_message=self.settings.system_prompt_class(...).get_system_message(),
|
||||
settings=MessageManagerSettings(...)
|
||||
# ... more setup ...
|
||||
)
|
||||
# ... other initializations ...
|
||||
logger.info("Agent initialized.")
|
||||
```
|
||||
|
||||
2. **Running the Task (`run`)**: The `run` method orchestrates the main loop. It calls the `step` method repeatedly until the task is marked as done, an error occurs, or `max_steps` is reached.
|
||||
|
||||
```python
|
||||
# --- File: agent/service.py (Simplified run method) ---
|
||||
class Agent:
|
||||
# ... (init) ...
|
||||
async def run(self, max_steps: int = 100) -> AgentHistoryList:
|
||||
self._log_agent_run() # Log start event
|
||||
try:
|
||||
for step_num in range(max_steps):
|
||||
if self.state.stopped or self.state.consecutive_failures >= self.settings.max_failures:
|
||||
break # Stop conditions
|
||||
|
||||
# Wait if paused
|
||||
while self.state.paused: await asyncio.sleep(0.2)
|
||||
|
||||
step_info = AgentStepInfo(step_number=step_num, max_steps=max_steps)
|
||||
await self.step(step_info) # <<< Execute one step of the loop
|
||||
|
||||
if self.state.history.is_done():
|
||||
await self.log_completion() # Log success/failure
|
||||
break # Exit loop if agent signaled 'done'
|
||||
else:
|
||||
logger.info("Max steps reached.") # Ran out of steps
|
||||
|
||||
finally:
|
||||
# ... (cleanup, telemetry, potentially save history/gif) ...
|
||||
pass
|
||||
return self.state.history # Return the recorded history
|
||||
```
|
||||
|
||||
3. **Taking a Step (`step`)**: This is the heart of the loop. In each step, the Agent:
|
||||
* Gets the current browser state (`browser_context.get_state()`).
|
||||
* Adds this state to the history via the `_message_manager`.
|
||||
* Asks the LLM for the next action (`get_next_action()`).
|
||||
* Tells the `Controller` to execute the action(s) (`multi_act()`).
|
||||
* Records the outcome in the history.
|
||||
* Handles any errors that might occur.
|
||||
|
||||
```python
|
||||
# --- File: agent/service.py (Simplified step method) ---
|
||||
class Agent:
|
||||
# ... (init, run) ...
|
||||
async def step(self, step_info: Optional[AgentStepInfo] = None) -> None:
|
||||
logger.info(f"📍 Step {self.state.n_steps}")
|
||||
state = None
|
||||
model_output = None
|
||||
result: list[ActionResult] = []
|
||||
|
||||
try:
|
||||
# 1. Get current state from the browser
|
||||
state = await self.browser_context.get_state() # Uses BrowserContext
|
||||
|
||||
# 2. Add state (+ previous result) to message history for LLM context
|
||||
self._message_manager.add_state_message(state, self.state.last_result, ...)
|
||||
|
||||
# 3. Get LLM's decision on the next action(s)
|
||||
input_messages = self._message_manager.get_messages()
|
||||
model_output = await self.get_next_action(input_messages) # Calls the LLM
|
||||
|
||||
self.state.n_steps += 1 # Increment step counter
|
||||
|
||||
# 4. Execute the action(s) using the Controller
|
||||
result = await self.multi_act(model_output.action) # Uses Controller
|
||||
self.state.last_result = result # Store result for next step's context
|
||||
|
||||
# 5. Record step details (actions, results, state snapshot)
|
||||
self._make_history_item(model_output, state, result, ...)
|
||||
|
||||
self.state.consecutive_failures = 0 # Reset failure count on success
|
||||
|
||||
except Exception as e:
|
||||
# Handle errors, increment failure count, maybe retry later
|
||||
result = await self._handle_step_error(e)
|
||||
self.state.last_result = result
|
||||
# ... (finally block for logging/telemetry) ...
|
||||
```
|
||||
|
||||
## Conclusion
|
||||
|
||||
You've now met the `Agent`, the central coordinator in `Browser Use`. You learned that it acts like a project manager, taking your high-level task, consulting an LLM for step-by-step planning, managing the history, and instructing a `Controller` to perform actions within a `BrowserContext`.
|
||||
|
||||
The Agent's effectiveness heavily relies on how well we instruct the LLM planner. In the next chapter, we'll dive into exactly that: crafting the **System Prompt** to guide the LLM's behavior.
|
||||
|
||||
[Next Chapter: System Prompt](02_system_prompt.md)
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
235
docs/Browser Use/02_system_prompt.md
Normal file
235
docs/Browser Use/02_system_prompt.md
Normal file
@@ -0,0 +1,235 @@
|
||||
# Chapter 2: The System Prompt - Setting the Rules for Your AI Assistant
|
||||
|
||||
In [Chapter 1: The Agent](01_agent.md), we met the `Agent`, our project manager for automating browser tasks. We saw it consults a Large Language Model (LLM) – the "planner" – to decide the next steps based on the current state of the webpage. But how does the Agent tell the LLM *how* it should think, behave, and respond? Just giving it the task isn't enough!
|
||||
|
||||
Imagine hiring a new assistant. You wouldn't just say, "Organize my files!" You'd give them specific instructions: "Please sort the files alphabetically by client name, put them in the blue folders, and give me a summary list when you're done." Without these rules, the assistant might do something completely different!
|
||||
|
||||
The **System Prompt** solves this exact problem for our LLM. It's the set of core instructions and rules we give the LLM at the very beginning, telling it exactly how to act as a browser automation assistant and, crucially, how to format its responses so the `Agent` can understand them.
|
||||
|
||||
## What is the System Prompt? The AI's Rulebook
|
||||
|
||||
Think of the System Prompt like the AI assistant's fundamental operating manual, its "Prime Directive," or the rules of a board game. It defines:
|
||||
|
||||
1. **Persona:** "You are an AI agent designed to automate browser tasks."
|
||||
2. **Goal:** "Your goal is to accomplish the ultimate task..."
|
||||
3. **Input:** How to understand the information it receives about the webpage ([DOM Representation](04_dom_representation.md)).
|
||||
4. **Capabilities:** What actions it can take ([Action Controller & Registry](05_action_controller___registry.md)).
|
||||
5. **Limitations:** What it *shouldn't* do (e.g., hallucinate actions).
|
||||
6. **Response Format:** The *exact* structure (JSON format) its thoughts and planned actions must follow.
|
||||
|
||||
Without this rulebook, the LLM might just chat casually, give vague suggestions, or produce output in a format the `Agent` code can't parse. The System Prompt ensures the LLM behaves like the specialized tool we need.
|
||||
|
||||
## Why is the Response Format So Important?
|
||||
|
||||
This is a critical point. The `Agent` code isn't a human reading the LLM's response. It's a program expecting data in a very specific structure. The System Prompt tells the LLM to *always* respond in a JSON format that looks something like this (simplified):
|
||||
|
||||
```json
|
||||
{
|
||||
"current_state": {
|
||||
"evaluation_previous_goal": "Success - Found the search bar.",
|
||||
"memory": "On google.com main page. Need to search for cats.",
|
||||
"next_goal": "Type 'cute cat pictures' into the search bar."
|
||||
},
|
||||
"action": [
|
||||
{
|
||||
"input_text": {
|
||||
"index": 5, // The index of the search bar element
|
||||
"text": "cute cat pictures"
|
||||
}
|
||||
},
|
||||
{
|
||||
"press_keys": {
|
||||
"keys": "Enter" // Press the Enter key
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `Agent` can easily read this JSON:
|
||||
* It understands the LLM's thoughts (`current_state`).
|
||||
* It sees the exact `action` list the LLM wants to perform.
|
||||
* It passes these actions (like `input_text` or `press_keys`) to the [Action Controller & Registry](05_action_controller___registry.md) to execute them in the browser.
|
||||
|
||||
If the LLM responded with just "Okay, I'll type 'cute cat pictures' into the search bar and press Enter," the `Agent` wouldn't know *which* element index corresponds to the search bar or exactly which actions to call. The strict JSON format is essential for automation.
|
||||
|
||||
## A Peek Inside the Rulebook (`system_prompt.md`)
|
||||
|
||||
The actual instructions live in a text file within the `Browser Use` library: `browser_use/agent/system_prompt.md`. It's quite detailed, but here's a tiny snippet focusing on the response format rule:
|
||||
|
||||
```markdown
|
||||
# Response Rules
|
||||
1. RESPONSE FORMAT: You must ALWAYS respond with valid JSON in this exact format:
|
||||
{{"current_state": {{"evaluation_previous_goal": "...",
|
||||
"memory": "...",
|
||||
"next_goal": "..."}},
|
||||
"action":[{{"one_action_name": {{...}}}}, ...]}}
|
||||
|
||||
2. ACTIONS: You can specify multiple actions in the list... Use maximum {{max_actions}} actions...
|
||||
```
|
||||
*(This is heavily simplified! The real file has many more rules about element interaction, error handling, task completion, etc.)*
|
||||
|
||||
This file clearly defines the JSON structure (`current_state` and `action`) and other crucial behaviors required from the LLM.
|
||||
|
||||
## How the Agent Uses the System Prompt
|
||||
|
||||
The `Agent` uses a helper class called `SystemPrompt` (found in `agent/prompts.py`) to manage these rules. Here's the flow:
|
||||
|
||||
1. **Loading:** When you create an `Agent`, it internally creates a `SystemPrompt` object. This object reads the rules from the `system_prompt.md` file.
|
||||
2. **Formatting:** The `SystemPrompt` object formats these rules into a special `SystemMessage` object that LLMs understand as foundational instructions.
|
||||
3. **Conversation Start:** This `SystemMessage` is given to the [Message Manager](06_message_manager.md), which keeps track of the conversation history with the LLM. The `SystemMessage` becomes the *very first message*, setting the context for all future interactions in that session.
|
||||
|
||||
Think of it like starting a meeting: the first thing you do is state the agenda and rules (System Prompt), and then the discussion (LLM interaction) follows based on that foundation.
|
||||
|
||||
Let's look at a simplified view of the `SystemPrompt` class loading the rules:
|
||||
|
||||
```python
|
||||
# --- File: agent/prompts.py (Simplified) ---
|
||||
import importlib.resources # Helps find files within the installed library
|
||||
from langchain_core.messages import SystemMessage # Special message type for LLMs
|
||||
|
||||
class SystemPrompt:
|
||||
def __init__(self, action_description: str, max_actions_per_step: int = 10):
|
||||
# We ignore these details for now
|
||||
self.default_action_description = action_description
|
||||
self.max_actions_per_step = max_actions_per_step
|
||||
self._load_prompt_template() # <--- Loads the rules file
|
||||
|
||||
def _load_prompt_template(self) -> None:
|
||||
"""Load the prompt rules from the system_prompt.md file."""
|
||||
try:
|
||||
# Finds the 'system_prompt.md' file inside the browser_use package
|
||||
filepath = importlib.resources.files('browser_use.agent').joinpath('system_prompt.md')
|
||||
with filepath.open('r') as f:
|
||||
self.prompt_template = f.read() # Read the text content
|
||||
print("System Prompt template loaded successfully!")
|
||||
except Exception as e:
|
||||
print(f"Error loading system prompt: {e}")
|
||||
self.prompt_template = "Error: Could not load prompt." # Fallback
|
||||
|
||||
def get_system_message(self) -> SystemMessage:
|
||||
"""Format the loaded rules into a message for the LLM."""
|
||||
# Replace placeholders like {{max_actions}} with actual values
|
||||
prompt = self.prompt_template.format(max_actions=self.max_actions_per_step)
|
||||
# Wrap the final rules text in a SystemMessage object
|
||||
return SystemMessage(content=prompt)
|
||||
|
||||
# --- How it plugs into Agent creation (Conceptual) ---
|
||||
# from browser_use import Agent, SystemPrompt
|
||||
# from my_llm_setup import my_llm # Your LLM
|
||||
# ... other setup ...
|
||||
|
||||
# When you create an Agent:
|
||||
# agent = Agent(
|
||||
# task="Find cat pictures",
|
||||
# llm=my_llm,
|
||||
# browser_context=...,
|
||||
# controller=...,
|
||||
# # The Agent's __init__ method does something like this internally:
|
||||
# # system_prompt_obj = SystemPrompt(action_description="...", max_actions_per_step=10)
|
||||
# # system_message_for_llm = system_prompt_obj.get_system_message()
|
||||
# # This system_message_for_llm is then passed to the Message Manager.
|
||||
# )
|
||||
```
|
||||
|
||||
This code shows how the `SystemPrompt` class finds and reads the `system_prompt.md` file and prepares the instructions as a `SystemMessage` ready for the LLM conversation.
|
||||
|
||||
## Under the Hood: Initialization and Conversation Flow
|
||||
|
||||
Let's visualize how the System Prompt fits into the Agent's setup and interaction loop:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant Agent_Init as Agent Initialization
|
||||
participant SP as SystemPrompt Class
|
||||
participant MM as Message Manager
|
||||
participant Agent_Run as Agent Run Loop
|
||||
participant LLM
|
||||
|
||||
User->>Agent_Init: Create Agent(task, llm, ...)
|
||||
Note over Agent_Init: Agent needs the rules!
|
||||
Agent_Init->>SP: Create SystemPrompt(...)
|
||||
SP->>SP: _load_prompt_template() reads system_prompt.md
|
||||
SP-->>Agent_Init: SystemPrompt instance
|
||||
Agent_Init->>SP: get_system_message()
|
||||
SP-->>Agent_Init: system_message (The Formatted Rules)
|
||||
Note over Agent_Init: Pass rules to conversation manager
|
||||
Agent_Init->>MM: Initialize MessageManager(task, system_message)
|
||||
MM->>MM: Store system_message as message #1
|
||||
MM-->>Agent_Init: MessageManager instance ready
|
||||
Agent_Init-->>User: Agent created and ready
|
||||
|
||||
User->>Agent_Run: agent.run() starts the task
|
||||
Note over Agent_Run: Agent needs context for LLM
|
||||
Agent_Run->>MM: get_messages()
|
||||
MM-->>Agent_Run: [system_message, user_message(state), ...]
|
||||
Note over Agent_Run: Send rules + current state to LLM
|
||||
Agent_Run->>LLM: Ask for next action (Input includes rules)
|
||||
LLM-->>Agent_Run: JSON response (LLM followed rules!)
|
||||
Agent_Run->>MM: add_model_output(...)
|
||||
Note over Agent_Run: Loop continues...
|
||||
```
|
||||
|
||||
Internally, the `Agent`'s initialization code (`__init__` in `agent/service.py`) explicitly creates the `SystemPrompt` and passes its output to the `MessageManager`:
|
||||
|
||||
```python
|
||||
# --- File: agent/service.py (Simplified Agent __init__) ---
|
||||
# ... other imports ...
|
||||
from browser_use.agent.prompts import SystemPrompt # Import the class
|
||||
from browser_use.agent.message_manager.service import MessageManager, MessageManagerSettings
|
||||
|
||||
class Agent:
|
||||
def __init__(
|
||||
self,
|
||||
task: str,
|
||||
llm: BaseChatModel,
|
||||
browser_context: BrowserContext,
|
||||
controller: Controller,
|
||||
system_prompt_class: Type[SystemPrompt] = SystemPrompt, # Allows customizing the prompt class
|
||||
max_actions_per_step: int = 10,
|
||||
# ... other parameters ...
|
||||
**kwargs
|
||||
):
|
||||
self.task = task
|
||||
self.llm = llm
|
||||
# ... store other components ...
|
||||
|
||||
# Get the list of available actions from the controller
|
||||
self.available_actions = controller.registry.get_prompt_description()
|
||||
|
||||
# 1. Create the SystemPrompt instance using the provided class
|
||||
system_prompt_instance = system_prompt_class(
|
||||
action_description=self.available_actions,
|
||||
max_actions_per_step=max_actions_per_step,
|
||||
)
|
||||
|
||||
# 2. Get the formatted SystemMessage (the rules)
|
||||
system_message = system_prompt_instance.get_system_message()
|
||||
|
||||
# 3. Initialize the Message Manager with the task and the rules
|
||||
self._message_manager = MessageManager(
|
||||
task=self.task,
|
||||
system_message=system_message, # <--- Pass the rules here!
|
||||
settings=MessageManagerSettings(...)
|
||||
# ... other message manager setup ...
|
||||
)
|
||||
# ... rest of initialization ...
|
||||
logger.info("Agent initialized with System Prompt.")
|
||||
```
|
||||
|
||||
When the `Agent` runs its loop (`agent.run()` calls `agent.step()`), it asks the `MessageManager` for the current conversation history (`self._message_manager.get_messages()`). The `MessageManager` always ensures that the `SystemMessage` (containing the rules) is the very first item in that history list sent to the LLM.
|
||||
|
||||
## Conclusion
|
||||
|
||||
The System Prompt is the essential rulebook that governs the LLM's behavior within the `Browser Use` framework. It tells the LLM how to interpret the browser state, what actions it can take, and most importantly, dictates the exact JSON format for its responses. This structured communication is key to enabling the `Agent` to reliably understand the LLM's plan and execute browser automation tasks.
|
||||
|
||||
Without a clear System Prompt, the LLM would be like an untrained assistant – potentially intelligent, but unable to follow the specific procedures needed for the job.
|
||||
|
||||
Now that we understand how the `Agent` gets its fundamental instructions, how does it actually perceive the webpage it's supposed to interact with? In the next chapter, we'll explore the component responsible for representing the browser's state: the [BrowserContext](03_browsercontext.md).
|
||||
|
||||
[Next Chapter: BrowserContext](03_browsercontext.md)
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
295
docs/Browser Use/03_browsercontext.md
Normal file
295
docs/Browser Use/03_browsercontext.md
Normal file
@@ -0,0 +1,295 @@
|
||||
# Chapter 3: BrowserContext - The Agent's Isolated Workspace
|
||||
|
||||
In the [previous chapter](02_system_prompt.md), we learned how the `System Prompt` acts as the rulebook for the AI assistant (LLM) that guides our `Agent`. We know the Agent uses the LLM to decide *what* to do next based on the current situation in the browser.
|
||||
|
||||
But *where* does the Agent actually "see" the webpage and perform its actions? How does it keep track of the current website address (URL), the page content, and things like cookies, all while staying focused on its specific task without getting mixed up with your other browsing?
|
||||
|
||||
This is where the **BrowserContext** comes in.
|
||||
|
||||
## What Problem Does BrowserContext Solve?
|
||||
|
||||
Imagine you ask your `Agent` to log into a specific online shopping website and check your order status. You might already be logged into that same website in your regular browser window with your personal account.
|
||||
|
||||
If the Agent just used your main browser window, it might:
|
||||
1. Get confused by your existing login.
|
||||
2. Accidentally use your personal cookies or saved passwords.
|
||||
3. Interfere with other tabs you have open.
|
||||
|
||||
We need a way to give the Agent its *own*, clean, separate browsing environment for each task. It needs an isolated "workspace" where it can open websites, log in, click buttons, and manage its own cookies without affecting anything else.
|
||||
|
||||
The `BrowserContext` solves this by representing a single, isolated browser session.
|
||||
|
||||
## Meet the BrowserContext: Your Agent's Private Browser Window
|
||||
|
||||
Think of a `BrowserContext` like opening a brand new **Incognito Window** or creating a **separate User Profile** in your web browser (like Chrome or Firefox).
|
||||
|
||||
* **It's Isolated:** What happens in one `BrowserContext` doesn't affect others or your main browser session. It has its own cookies, its own history (for that session), and its own set of tabs.
|
||||
* **It Manages State:** It keeps track of everything important about the current web session the Agent is working on:
|
||||
* The current URL.
|
||||
* Which tabs are open within its "window".
|
||||
* Cookies specific to that session.
|
||||
* The structure and content of the current webpage (the DOM - Document Object Model, which we'll explore in the [next chapter](04_dom_representation.md)).
|
||||
* **It's the Agent's Viewport:** The `Agent` looks through the `BrowserContext` to "see" the current state of the webpage. When the Agent decides to perform an action (like clicking a button), it tells the [Action Controller](05_action_controller___registry.md) to perform it *within* that specific `BrowserContext`.
|
||||
|
||||
Essentially, the `BrowserContext` is like a dedicated, clean desk or workspace given to the Agent for its specific job.
|
||||
|
||||
## Using the BrowserContext
|
||||
|
||||
Before we can have an isolated session (`BrowserContext`), we first need the main browser application itself. This is handled by the `Browser` class. Think of `Browser` as the entire Chrome or Firefox application installed on your computer, while `BrowserContext` is just one window or profile within that application.
|
||||
|
||||
Here's a simplified example of how you might set up a `Browser` and then create a `BrowserContext` to navigate to a page:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
# Import necessary classes
|
||||
from browser_use import Browser, BrowserConfig, BrowserContext, BrowserContextConfig
|
||||
|
||||
async def main():
|
||||
# 1. Configure the main browser application (optional, defaults are usually fine)
|
||||
browser_config = BrowserConfig(headless=False) # Show the browser window
|
||||
|
||||
# 2. Create the main Browser instance
|
||||
# This might launch a browser application in the background (or connect to one)
|
||||
browser = Browser(config=browser_config)
|
||||
print("Browser application instance created.")
|
||||
|
||||
# 3. Configure the specific session/window (optional)
|
||||
context_config = BrowserContextConfig(
|
||||
user_agent="MyCoolAgent/1.0", # Example: Set a custom user agent
|
||||
cookies_file="my_session_cookies.json" # Example: Save/load cookies
|
||||
)
|
||||
|
||||
# 4. Create the isolated BrowserContext (like opening an incognito window)
|
||||
# We use 'async with' to ensure it cleans up automatically afterwards
|
||||
async with browser.new_context(config=context_config) as browser_context:
|
||||
print(f"BrowserContext created (ID: {browser_context.context_id}).")
|
||||
|
||||
# 5. Use the context to interact with the browser session
|
||||
start_url = "https://example.com"
|
||||
print(f"Navigating to: {start_url}")
|
||||
await browser_context.navigate_to(start_url)
|
||||
|
||||
# 6. Get information *from* the context
|
||||
current_state = await browser_context.get_state() # Get current page info
|
||||
print(f"Current page title: {current_state.title}")
|
||||
print(f"Current page URL: {current_state.url}")
|
||||
|
||||
# The Agent would use this 'browser_context' object to see the page
|
||||
# and tell the Controller to perform actions within it.
|
||||
|
||||
print("BrowserContext closed automatically.")
|
||||
|
||||
# 7. Close the main browser application when done
|
||||
await browser.close()
|
||||
print("Browser application closed.")
|
||||
|
||||
# Run the asynchronous code
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**What happens here?**
|
||||
|
||||
1. We set up a `BrowserConfig` (telling it *not* to run headless so we can see the window).
|
||||
2. We create a `Browser` instance, which represents the overall browser program.
|
||||
3. We create a `BrowserContextConfig` to specify settings for our isolated session (like a custom name or where to save cookies).
|
||||
4. Crucially, `browser.new_context(...)` creates our isolated session. The `async with` block ensures this session is properly closed later.
|
||||
5. We use methods *on the `browser_context` object* like `navigate_to()` to control *this specific session*.
|
||||
6. We use `browser_context.get_state()` to get information about the current page within *this session*. The `Agent` heavily relies on this method.
|
||||
7. After the `async with` block finishes, the `browser_context` is closed (like closing the incognito window), and finally, we close the main `browser` application.
|
||||
|
||||
## How it Works Under the Hood
|
||||
|
||||
When the `Agent` needs to understand the current situation to decide the next step, it asks the `BrowserContext` for the latest state using the `get_state()` method. What happens then?
|
||||
|
||||
1. **Wait for Stability:** The `BrowserContext` first waits for the webpage to finish loading and for network activity to settle down (`_wait_for_page_and_frames_load`). This prevents the Agent from acting on an incomplete page.
|
||||
2. **Analyze the Page:** It then uses the [DOM Representation](04_dom_representation.md) service (`DomService`) to analyze the current HTML structure of the page. This service figures out which elements are visible, interactive (buttons, links, input fields), and where they are.
|
||||
3. **Capture Visuals:** It often takes a screenshot of the current view (`take_screenshot`). This can be helpful for advanced agents or debugging.
|
||||
4. **Gather Metadata:** It gets the current URL, page title, and information about any other tabs open *within this context*.
|
||||
5. **Package the State:** All this information (DOM structure, URL, title, screenshot, etc.) is bundled into a `BrowserState` object.
|
||||
6. **Return to Agent:** The `BrowserContext` returns this `BrowserState` object to the `Agent`. The Agent then uses this information (often sending it to the LLM) to plan its next action.
|
||||
|
||||
Here's a simplified diagram of the `get_state()` process:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Agent
|
||||
participant BC as BrowserContext
|
||||
participant PlaywrightPage as Underlying Browser Page
|
||||
participant DomService as DOM Service
|
||||
|
||||
Agent->>BC: get_state()
|
||||
Note over BC: Wait for page to be ready...
|
||||
BC->>PlaywrightPage: Ensure page/network is stable
|
||||
PlaywrightPage-->>BC: Page is ready
|
||||
Note over BC: Analyze the page content...
|
||||
BC->>DomService: Get simplified DOM structure + interactive elements
|
||||
DomService-->>BC: DOMState (element tree, etc.)
|
||||
Note over BC: Get visuals and metadata...
|
||||
BC->>PlaywrightPage: Take screenshot()
|
||||
PlaywrightPage-->>BC: Screenshot data
|
||||
BC->>PlaywrightPage: Get URL, Title
|
||||
PlaywrightPage-->>BC: URL, Title data
|
||||
Note over BC: Combine everything...
|
||||
BC->>BC: Create BrowserState object
|
||||
BC-->>Agent: Return BrowserState
|
||||
```
|
||||
|
||||
Let's look at some simplified code snippets from the library.
|
||||
|
||||
The `BrowserContext` is initialized (`__init__` in `browser/context.py`) with its configuration and a reference to the main `Browser` instance that created it.
|
||||
|
||||
```python
|
||||
# --- File: browser/context.py (Simplified __init__) ---
|
||||
import uuid
|
||||
# ... other imports ...
|
||||
if TYPE_CHECKING:
|
||||
from browser_use.browser.browser import Browser # Link to the Browser class
|
||||
|
||||
@dataclass
|
||||
class BrowserContextConfig: # Configuration settings
|
||||
# ... various settings like user_agent, cookies_file, window_size ...
|
||||
pass
|
||||
|
||||
@dataclass
|
||||
class BrowserSession: # Holds the actual Playwright context
|
||||
context: PlaywrightBrowserContext # The underlying Playwright object
|
||||
cached_state: Optional[BrowserState] = None # Stores the last known state
|
||||
|
||||
class BrowserContext:
|
||||
def __init__(
|
||||
self,
|
||||
browser: 'Browser', # Reference to the main Browser instance
|
||||
config: BrowserContextConfig = BrowserContextConfig(),
|
||||
# ... other optional state ...
|
||||
):
|
||||
self.context_id = str(uuid.uuid4()) # Unique ID for this session
|
||||
self.config = config # Store the configuration
|
||||
self.browser = browser # Store the reference to the parent Browser
|
||||
|
||||
# The actual Playwright session is created later, when needed
|
||||
self.session: BrowserSession | None = None
|
||||
logger.debug(f"BrowserContext object created (ID: {self.context_id}). Session not yet initialized.")
|
||||
|
||||
# The 'async with' statement calls __aenter__ which initializes the session
|
||||
async def __aenter__(self):
|
||||
await self._initialize_session() # Creates the actual browser window/tab
|
||||
return self
|
||||
|
||||
async def _initialize_session(self):
|
||||
# ... (complex setup code happens here) ...
|
||||
# Gets the main Playwright browser from self.browser
|
||||
playwright_browser = await self.browser.get_playwright_browser()
|
||||
# Creates the isolated Playwright context (like the incognito window)
|
||||
context = await self._create_context(playwright_browser)
|
||||
# Creates the BrowserSession to hold the context and state
|
||||
self.session = BrowserSession(context=context, cached_state=None)
|
||||
logger.debug(f"BrowserContext session initialized (ID: {self.context_id}).")
|
||||
# ... (sets up the initial page) ...
|
||||
return self.session
|
||||
|
||||
# ... other methods like navigate_to, close, etc. ...
|
||||
```
|
||||
|
||||
The `get_state` method orchestrates fetching the current information from the browser session.
|
||||
|
||||
```python
|
||||
# --- File: browser/context.py (Simplified get_state and helpers) ---
|
||||
# ... other imports ...
|
||||
from browser_use.dom.service import DomService # Imports the DOM analyzer
|
||||
from browser_use.browser.views import BrowserState # Imports the state structure
|
||||
|
||||
class BrowserContext:
|
||||
# ... (init, aenter, etc.) ...
|
||||
|
||||
async def get_state(self) -> BrowserState:
|
||||
"""Get the current state of the browser session."""
|
||||
logger.debug(f"Getting state for context {self.context_id}...")
|
||||
# 1. Make sure the page is loaded and stable
|
||||
await self._wait_for_page_and_frames_load()
|
||||
|
||||
# 2. Get the actual Playwright session object
|
||||
session = await self.get_session()
|
||||
|
||||
# 3. Update the state (this does the heavy lifting)
|
||||
session.cached_state = await self._update_state()
|
||||
logger.debug(f"State update complete for {self.context_id}.")
|
||||
|
||||
# 4. Optionally save cookies if configured
|
||||
if self.config.cookies_file:
|
||||
asyncio.create_task(self.save_cookies())
|
||||
|
||||
return session.cached_state
|
||||
|
||||
async def _wait_for_page_and_frames_load(self, timeout_overwrite: float | None = None):
|
||||
"""Ensures page is fully loaded before continuing."""
|
||||
# ... (complex logic to wait for network idle, minimum times) ...
|
||||
page = await self.get_current_page()
|
||||
await page.wait_for_load_state('load', timeout=5000) # Simplified wait
|
||||
logger.debug("Page load/network stability checks passed.")
|
||||
await asyncio.sleep(self.config.minimum_wait_page_load_time) # Ensure minimum wait
|
||||
|
||||
async def _update_state(self) -> BrowserState:
|
||||
"""Fetches all info and builds the BrowserState."""
|
||||
session = await self.get_session()
|
||||
page = await self.get_current_page() # Get the active Playwright page object
|
||||
|
||||
try:
|
||||
# Use DomService to analyze the page content
|
||||
dom_service = DomService(page)
|
||||
# Get the simplified DOM tree and interactive elements map
|
||||
content_info = await dom_service.get_clickable_elements(
|
||||
highlight_elements=self.config.highlight_elements,
|
||||
# ... other DOM options ...
|
||||
)
|
||||
|
||||
# Take a screenshot
|
||||
screenshot_b64 = await self.take_screenshot()
|
||||
|
||||
# Get URL, Title, Tabs, Scroll info etc.
|
||||
url = page.url
|
||||
title = await page.title()
|
||||
tabs = await self.get_tabs_info()
|
||||
pixels_above, pixels_below = await self.get_scroll_info(page)
|
||||
|
||||
# Create the BrowserState object
|
||||
browser_state = BrowserState(
|
||||
element_tree=content_info.element_tree,
|
||||
selector_map=content_info.selector_map,
|
||||
url=url,
|
||||
title=title,
|
||||
tabs=tabs,
|
||||
screenshot=screenshot_b64,
|
||||
pixels_above=pixels_above,
|
||||
pixels_below=pixels_below,
|
||||
)
|
||||
return browser_state
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f'Failed to update state: {str(e)}')
|
||||
# Maybe return old state or raise error
|
||||
raise BrowserError("Failed to get browser state") from e
|
||||
|
||||
async def take_screenshot(self, full_page: bool = False) -> str:
|
||||
"""Takes a screenshot and returns base64 encoded string."""
|
||||
page = await self.get_current_page()
|
||||
screenshot_bytes = await page.screenshot(full_page=full_page, animations='disabled')
|
||||
return base64.b64encode(screenshot_bytes).decode('utf-8')
|
||||
|
||||
# ... many other helper methods (_get_current_page, get_tabs_info, etc.) ...
|
||||
|
||||
```
|
||||
This shows how `BrowserContext` acts as a manager for a specific browser session, using underlying tools (like Playwright and `DomService`) to gather the necessary information (`BrowserState`) that the `Agent` needs to operate.
|
||||
|
||||
## Conclusion
|
||||
|
||||
The `BrowserContext` is a fundamental concept in `Browser Use`. It provides the necessary **isolated environment** for the `Agent` to perform its tasks, much like an incognito window or a separate browser profile. It manages the session's state (URL, cookies, tabs, page content) and provides the `Agent` with a snapshot of the current situation via the `get_state()` method.
|
||||
|
||||
Understanding the `BrowserContext` helps clarify *where* the Agent works. Now, how does the Agent actually understand the *content* of the webpage within that context? How is the complex structure of a webpage represented in a way the Agent (and the LLM) can understand?
|
||||
|
||||
In the next chapter, we'll dive into exactly that: the [DOM Representation](04_dom_representation.md).
|
||||
|
||||
[Next Chapter: DOM Representation](04_dom_representation.md)
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
316
docs/Browser Use/04_dom_representation.md
Normal file
316
docs/Browser Use/04_dom_representation.md
Normal file
@@ -0,0 +1,316 @@
|
||||
# Chapter 4: DOM Representation - Mapping the Webpage
|
||||
|
||||
In the [previous chapter](03_browsercontext.md), we learned about the `BrowserContext`, the Agent's private workspace for browsing. We saw that the Agent uses `browser_context.get_state()` to get a snapshot of the current webpage. But how does the Agent actually *understand* the content of that snapshot?
|
||||
|
||||
Imagine you're looking at the Google homepage. You instantly recognize the logo, the search bar, and the buttons. But a computer program just sees a wall of code (HTML). How can our `Agent` figure out: "This rectangular box is the search bar I need to type into," or "This specific image link is the first result I should click"?
|
||||
|
||||
This is the problem solved by **DOM Representation**.
|
||||
|
||||
## What Problem Does DOM Representation Solve?
|
||||
|
||||
Webpages are built using HTML (HyperText Markup Language), which describes the structure and content. Your browser reads this HTML and creates an internal, structured representation called the **Document Object Model (DOM)**. It's like the browser builds a detailed blueprint or an outline from the HTML instructions.
|
||||
|
||||
However, this raw DOM blueprint is incredibly complex and contains lots of information irrelevant to our Agent's task. The Agent doesn't need to know about every single tiny visual detail; it needs a *simplified map* focused on what's important for interaction:
|
||||
|
||||
1. **What elements are on the page?** (buttons, links, input fields, text)
|
||||
2. **Are they visible to a user?** (Hidden elements shouldn't be interacted with)
|
||||
3. **Are they interactive?** (Can you click it? Can you type in it?)
|
||||
4. **How can the Agent refer to them?** (We need a simple way to say "click *this* button")
|
||||
|
||||
DOM Representation solves the problem of translating the complex, raw DOM blueprint into a simplified, structured map that highlights the interactive "landmarks" and pathways the Agent can use.
|
||||
|
||||
## Meet `DomService`: The Map Maker
|
||||
|
||||
The component responsible for creating this map is the `DomService`. Think of it as a cartographer specializing in webpages.
|
||||
|
||||
When the `Agent` (via the `BrowserContext`) asks for the current state of the page, the `BrowserContext` employs the `DomService` to analyze the page's live DOM.
|
||||
|
||||
Here's what the `DomService` does:
|
||||
|
||||
1. **Examines the Live Page:** It looks at the current structure rendered in the browser tab, not just the initial HTML source code (because JavaScript can change the page after it loads).
|
||||
2. **Identifies Elements:** It finds all the meaningful elements like buttons, links, input fields, and text blocks.
|
||||
3. **Checks Properties:** For each element, it determines crucial properties:
|
||||
* **Visibility:** Is it actually displayed on the screen?
|
||||
* **Interactivity:** Is it something a user can click, type into, or otherwise interact with?
|
||||
* **Position:** Where is it located (roughly)?
|
||||
4. **Assigns Interaction Indices:** This is key! For elements deemed interactive and visible, `DomService` assigns a unique number, called a `highlight_index` (like `[5]`, `[12]`, etc.). This gives the Agent and the LLM a simple, unambiguous way to refer to specific elements.
|
||||
5. **Builds a Structured Tree:** It organizes this information into a simplified tree structure (`element_tree`) that reflects the page layout but is much easier to process than the full DOM.
|
||||
6. **Creates an Index Map:** It generates a `selector_map`, which is like an index in a book, mapping each `highlight_index` directly to its corresponding element node in the tree.
|
||||
|
||||
The final output is a `DOMState` object containing the simplified `element_tree` and the handy `selector_map`. This `DOMState` is then included in the `BrowserState` that `BrowserContext.get_state()` returns to the Agent.
|
||||
|
||||
## The Output: `DOMState` - The Agent's Map
|
||||
|
||||
The `DOMState` object produced by `DomService` has two main parts:
|
||||
|
||||
1. **`element_tree`:** This is the root of our simplified map, represented as a `DOMElementNode` object (defined in `dom/views.py`). Each node in the tree can be either an element (`DOMElementNode`) or a piece of text (`DOMTextNode`). `DOMElementNode`s contain information like the tag name (`<button>`, `<input>`), attributes (`aria-label="Search"`), visibility, interactivity, and importantly, the `highlight_index` if applicable. The tree structure helps understand the page layout (e.g., this button is inside that section).
|
||||
|
||||
*Conceptual Example Tree:*
|
||||
```
|
||||
<body> [no index]
|
||||
|-- <div> [no index]
|
||||
| |-- <input aria-label="Search"> [highlight_index: 5]
|
||||
| +-- <button> [highlight_index: 6]
|
||||
| +-- "Google Search" (TextNode)
|
||||
+-- <a> href="/images"> [highlight_index: 7]
|
||||
+-- "Images" (TextNode)
|
||||
```
|
||||
|
||||
2. **`selector_map`:** This is a Python dictionary that acts as a quick lookup. It maps the integer `highlight_index` directly to the corresponding `DOMElementNode` object in the `element_tree`.
|
||||
|
||||
*Conceptual Example Map:*
|
||||
```python
|
||||
{
|
||||
5: <DOMElementNode tag_name='input', attributes={'aria-label':'Search'}, ...>,
|
||||
6: <DOMElementNode tag_name='button', ...>,
|
||||
7: <DOMElementNode tag_name='a', attributes={'href':'/images'}, ...>
|
||||
}
|
||||
```
|
||||
|
||||
This `selector_map` is incredibly useful because when the LLM decides "click element 5", the Agent can instantly find the correct `DOMElementNode` using `selector_map[5]` and tell the [Action Controller & Registry](05_action_controller___registry.md) exactly which element to interact with.
|
||||
|
||||
## How the Agent Uses the Map
|
||||
|
||||
The `Agent` takes the `DOMState` (usually simplifying the `element_tree` further into a text representation) and includes it in the information sent to the LLM. Remember the JSON response format from [Chapter 2](02_system_prompt.md)? The LLM uses the `highlight_index` from this map to specify actions:
|
||||
|
||||
```json
|
||||
// LLM might receive a simplified text view like:
|
||||
// "[5]<input aria-label='Search'>\n[6]<button>Google Search</button>\n[7]<a>Images</a>"
|
||||
|
||||
// And respond with:
|
||||
{
|
||||
"current_state": {
|
||||
"evaluation_previous_goal": "...",
|
||||
"memory": "On Google homepage, need to search for cats.",
|
||||
"next_goal": "Type 'cute cats' into the search bar [5]."
|
||||
},
|
||||
"action": [
|
||||
{
|
||||
"input_text": {
|
||||
"index": 5, // <-- Uses the highlight_index from the DOM map!
|
||||
"text": "cute cats"
|
||||
}
|
||||
}
|
||||
// ... maybe press Enter action ...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Code Example: Seeing the Map
|
||||
|
||||
We don't usually interact with `DomService` directly. Instead, we get its output via the `BrowserContext`. Let's revisit the example from Chapter 3 and see where the DOM representation fits:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from browser_use import Browser, BrowserConfig, BrowserContext, BrowserContextConfig
|
||||
|
||||
async def main():
|
||||
browser_config = BrowserConfig(headless=False)
|
||||
browser = Browser(config=browser_config)
|
||||
context_config = BrowserContextConfig()
|
||||
|
||||
async with browser.new_context(config=context_config) as browser_context:
|
||||
# Navigate to a page (e.g., Google)
|
||||
await browser_context.navigate_to("https://www.google.com")
|
||||
|
||||
print("Getting current page state...")
|
||||
# This call uses DomService internally to generate the DOM representation
|
||||
current_state = await browser_context.get_state()
|
||||
|
||||
print(f"\nCurrent Page URL: {current_state.url}")
|
||||
print(f"Current Page Title: {current_state.title}")
|
||||
|
||||
# Accessing the DOM Representation parts within the BrowserState
|
||||
print("\n--- DOM Representation Details ---")
|
||||
# The element_tree is the root node of our simplified DOM map
|
||||
if current_state.element_tree:
|
||||
print(f"Root element tag of simplified tree: <{current_state.element_tree.tag_name}>")
|
||||
else:
|
||||
print("Element tree is empty.")
|
||||
|
||||
# The selector_map provides direct access to interactive elements by index
|
||||
if current_state.selector_map:
|
||||
print(f"Number of interactive elements found: {len(current_state.selector_map)}")
|
||||
|
||||
# Let's try to find the element the LLM might call [5] (often the search bar)
|
||||
example_index = 5 # Note: Indices can change depending on the page!
|
||||
if example_index in current_state.selector_map:
|
||||
element_node = current_state.selector_map[example_index]
|
||||
print(f"Element [{example_index}]: Tag=<{element_node.tag_name}>, Attributes={element_node.attributes}")
|
||||
# The Agent uses this node reference to perform actions
|
||||
else:
|
||||
print(f"Element [{example_index}] not found in the selector map for this page state.")
|
||||
else:
|
||||
print("No interactive elements found (selector map is empty).")
|
||||
|
||||
# The Agent would typically convert element_tree into a compact text format
|
||||
# (using methods like element_tree.clickable_elements_to_string())
|
||||
# to send to the LLM along with the task instructions.
|
||||
|
||||
print("\nBrowserContext closed.")
|
||||
await browser.close()
|
||||
print("Browser closed.")
|
||||
|
||||
# Run the asynchronous code
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**What happens here?**
|
||||
|
||||
1. We set up the `Browser` and `BrowserContext`.
|
||||
2. We navigate to Google.
|
||||
3. `browser_context.get_state()` is called. **Internally**, this triggers the `DomService`.
|
||||
4. `DomService` analyzes the Google page, finds interactive elements (like the search bar, buttons), assigns them `highlight_index` numbers, and builds the `element_tree` and `selector_map`.
|
||||
5. This `DOMState` (containing the tree and map) is packaged into the `BrowserState` object returned by `get_state()`.
|
||||
6. Our code then accesses `current_state.element_tree` and `current_state.selector_map` to peek at the map created by `DomService`.
|
||||
7. We demonstrate looking up an element using its potential index (`selector_map[5]`).
|
||||
|
||||
## How It Works Under the Hood: `DomService` in Action
|
||||
|
||||
Let's trace the flow when `BrowserContext.get_state()` is called:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Agent
|
||||
participant BC as BrowserContext
|
||||
participant DomService
|
||||
participant PlaywrightPage as Browser Page (JS Env)
|
||||
participant buildDomTree_js as buildDomTree.js
|
||||
|
||||
Agent->>BC: get_state()
|
||||
Note over BC: Needs to analyze the page content
|
||||
BC->>DomService: get_clickable_elements(...)
|
||||
Note over DomService: Needs to run analysis script in browser
|
||||
DomService->>PlaywrightPage: evaluate(js_code='buildDomTree.js', args={...})
|
||||
Note over PlaywrightPage: Execute JavaScript code
|
||||
PlaywrightPage->>buildDomTree_js: Run analysis function
|
||||
Note over buildDomTree_js: Analyzes live DOM, finds visible & interactive elements, assigns highlight_index
|
||||
buildDomTree_js-->>PlaywrightPage: Return structured data (nodes, indices, map)
|
||||
PlaywrightPage-->>DomService: Return JS execution result (JSON-like data)
|
||||
Note over DomService: Process the raw data from JS
|
||||
DomService->>DomService: _construct_dom_tree(result)
|
||||
Note over DomService: Builds Python DOMElementNode tree and selector_map
|
||||
DomService-->>BC: Return DOMState (element_tree, selector_map)
|
||||
Note over BC: Combine DOMState with URL, title, screenshot etc.
|
||||
BC->>BC: Create BrowserState object
|
||||
BC-->>Agent: Return BrowserState (containing DOM map)
|
||||
```
|
||||
|
||||
**Key Code Points:**
|
||||
|
||||
1. **`BrowserContext` calls `DomService`:** Inside `browser/context.py`, the `_update_state` method (called by `get_state`) initializes and uses the `DomService`:
|
||||
|
||||
```python
|
||||
# --- File: browser/context.py (Simplified _update_state) ---
|
||||
from browser_use.dom.service import DomService # Import the service
|
||||
from browser_use.browser.views import BrowserState
|
||||
|
||||
class BrowserContext:
|
||||
# ... other methods ...
|
||||
async def _update_state(self) -> BrowserState:
|
||||
page = await self.get_current_page() # Get the active Playwright page object
|
||||
# ... error handling ...
|
||||
try:
|
||||
# 1. Create DomService instance for the current page
|
||||
dom_service = DomService(page)
|
||||
|
||||
# 2. Call DomService to get the DOM map (DOMState)
|
||||
content_info = await dom_service.get_clickable_elements(
|
||||
highlight_elements=self.config.highlight_elements,
|
||||
viewport_expansion=self.config.viewport_expansion,
|
||||
# ... other options ...
|
||||
)
|
||||
|
||||
# 3. Get other info (screenshot, URL, title etc.)
|
||||
screenshot_b64 = await self.take_screenshot()
|
||||
url = page.url
|
||||
title = await page.title()
|
||||
# ... gather more state ...
|
||||
|
||||
# 4. Package everything into BrowserState
|
||||
browser_state = BrowserState(
|
||||
element_tree=content_info.element_tree, # <--- From DomService
|
||||
selector_map=content_info.selector_map, # <--- From DomService
|
||||
url=url,
|
||||
title=title,
|
||||
screenshot=screenshot_b64,
|
||||
# ... other state info ...
|
||||
)
|
||||
return browser_state
|
||||
except Exception as e:
|
||||
logger.error(f'Failed to update state: {str(e)}')
|
||||
raise # Or handle error
|
||||
```
|
||||
|
||||
2. **`DomService` runs JavaScript:** Inside `dom/service.py`, the `_build_dom_tree` method executes the JavaScript code stored in `buildDomTree.js` within the browser page's context.
|
||||
|
||||
```python
|
||||
# --- File: dom/service.py (Simplified _build_dom_tree) ---
|
||||
import logging
|
||||
from importlib import resources
|
||||
# ... other imports ...
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class DomService:
|
||||
def __init__(self, page: 'Page'):
|
||||
self.page = page
|
||||
# Load the JavaScript code from the file when DomService is created
|
||||
self.js_code = resources.read_text('browser_use.dom', 'buildDomTree.js')
|
||||
# ...
|
||||
|
||||
async def _build_dom_tree(
|
||||
self, highlight_elements: bool, focus_element: int, viewport_expansion: int
|
||||
) -> tuple[DOMElementNode, SelectorMap]:
|
||||
|
||||
# Prepare arguments for the JavaScript function
|
||||
args = {
|
||||
'doHighlightElements': highlight_elements,
|
||||
'focusHighlightIndex': focus_element,
|
||||
'viewportExpansion': viewport_expansion,
|
||||
'debugMode': logger.getEffectiveLevel() == logging.DEBUG,
|
||||
}
|
||||
|
||||
try:
|
||||
# Execute the JavaScript code in the browser page!
|
||||
# The JS code analyzes the live DOM and returns a structured result.
|
||||
eval_page = await self.page.evaluate(self.js_code, args)
|
||||
except Exception as e:
|
||||
logger.error('Error evaluating JavaScript: %s', e)
|
||||
raise
|
||||
|
||||
# ... (optional debug logging) ...
|
||||
|
||||
# Parse the result from JavaScript into Python objects
|
||||
return await self._construct_dom_tree(eval_page)
|
||||
|
||||
async def _construct_dom_tree(self, eval_page: dict) -> tuple[DOMElementNode, SelectorMap]:
|
||||
# ... (logic to parse js_node_map from eval_page) ...
|
||||
# ... (loops through nodes, creates DOMElementNode/DOMTextNode objects) ...
|
||||
# ... (builds the tree structure by linking parents/children) ...
|
||||
# ... (populates the selector_map dictionary) ...
|
||||
# This uses the structures defined in dom/views.py
|
||||
# ...
|
||||
root_node = ... # Parsed root DOMElementNode
|
||||
selector_map = ... # Populated dictionary {index: DOMElementNode}
|
||||
return root_node, selector_map
|
||||
# ... other methods like get_clickable_elements ...
|
||||
```
|
||||
|
||||
3. **`buildDomTree.js` (Conceptual):** This JavaScript file (located at `dom/buildDomTree.js` in the library) is the core map-making logic that runs *inside the browser*. It traverses the live DOM, checks element visibility and interactivity using browser APIs (like `element.getBoundingClientRect()`, `window.getComputedStyle()`, `document.elementFromPoint()`), assigns the `highlight_index`, and packages the results into a structured format that the Python `DomService` can understand. *We don't need to understand the JS code itself, just its purpose.*
|
||||
|
||||
4. **Python Data Structures (`DOMElementNode`, `DOMTextNode`):** The results from the JavaScript are parsed into Python objects defined in `dom/views.py`. These dataclasses (`DOMElementNode`, `DOMTextNode`) hold the information about each mapped element or text segment.
|
||||
|
||||
## Conclusion
|
||||
|
||||
DOM Representation, primarily handled by the `DomService`, is crucial for bridging the gap between the complex reality of a webpage (the DOM) and the Agent/LLM's need for a simplified, actionable understanding. By creating a structured `element_tree` and an indexed `selector_map`, it provides a clear map of interactive landmarks on the page, identified by simple `highlight_index` numbers.
|
||||
|
||||
This map allows the LLM to make specific plans like "type into element [5]" or "click element [12]", which the Agent can then reliably translate into concrete actions.
|
||||
|
||||
Now that we understand how the Agent sees the page, how does it actually *perform* those actions like clicking or typing? In the next chapter, we'll explore the component responsible for executing the LLM's plan: the [Action Controller & Registry](05_action_controller___registry.md).
|
||||
|
||||
[Next Chapter: Action Controller & Registry](05_action_controller___registry.md)
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
340
docs/Browser Use/05_action_controller___registry.md
Normal file
340
docs/Browser Use/05_action_controller___registry.md
Normal file
@@ -0,0 +1,340 @@
|
||||
# Chapter 5: Action Controller & Registry - The Agent's Hands and Toolbox
|
||||
|
||||
In the [previous chapter](04_dom_representation.md), we saw how the `DomService` creates a simplified map (`DOMState`) of the webpage, allowing the Agent and its LLM planner to identify interactive elements like buttons and input fields using unique numbers (`highlight_index`). The LLM uses this map to decide *what* specific action to take next, like "click element [5]" or "type 'hello world' into element [12]".
|
||||
|
||||
But how does the program actually *do* that? How does the abstract idea "click element [5]" turn into a real click inside the browser window managed by the [BrowserContext](03_browsercontext.md)?
|
||||
|
||||
This is where the **Action Controller** and **Action Registry** come into play. They are the "hands" and "toolbox" that execute the Agent's decisions.
|
||||
|
||||
## What Problem Do They Solve?
|
||||
|
||||
Imagine you have a detailed instruction manual (the LLM's plan) for building a model car. The manual tells you exactly which piece to pick up (`index=5`) and what to do with it ("click" or "attach"). However, you still need:
|
||||
|
||||
1. **A Toolbox:** A collection of all the tools you might need (screwdriver, glue, pliers). You need to know what tools are available.
|
||||
2. **A Mechanic:** Someone (or you!) who can read the instruction ("Use the screwdriver on screw #5"), select the correct tool from the toolbox, and skillfully use it on the specified part.
|
||||
|
||||
Without the toolbox and the mechanic, the instruction manual is useless.
|
||||
|
||||
Similarly, the `Browser Use` Agent needs:
|
||||
1. **Action Registry (The Toolbox):** A defined list of all possible actions the Agent can perform (e.g., `click_element`, `input_text`, `scroll_down`, `go_to_url`, `done`). This registry also holds details about each action, like what parameters it needs (e.g., `click_element` needs an `index`).
|
||||
2. **Action Controller (The Mechanic):** A component that takes the specific action requested by the LLM (e.g., "execute `click_element` with `index=5`"), finds the corresponding function (the "tool") in the Registry, ensures the request is valid, and then executes that function using the [BrowserContext](03_browsercontext.md) (the "car").
|
||||
|
||||
The Controller and Registry solve the problem of translating the LLM's high-level plan into concrete, executable browser operations in a structured and reliable way.
|
||||
|
||||
## Meet the Toolbox and the Mechanic
|
||||
|
||||
Let's break down these two closely related concepts:
|
||||
|
||||
### 1. Action Registry: The Toolbox (`controller/registry/service.py`)
|
||||
|
||||
Think of the `Registry` as a carefully organized toolbox. Each drawer is labeled with the name of a tool (an action like `click_element`), and inside, you find the tool itself (the actual code function) along with its instructions (description and required parameters).
|
||||
|
||||
* **Catalog of Actions:** It holds a dictionary where keys are action names (strings like `"click_element"`) and values are `RegisteredAction` objects containing:
|
||||
* The action's `name`.
|
||||
* A `description` (for humans and the LLM).
|
||||
* The actual Python `function` to call.
|
||||
* A `param_model` (a Pydantic model defining required parameters like `index` or `text`).
|
||||
* **Informs the LLM:** The `Registry` can generate a description of all available actions and their parameters. This description is given to the LLM (as part of the [System Prompt](02_system_prompt.md)) so it knows exactly what "tools" it's allowed to ask the Agent to use.
|
||||
|
||||
### 2. Action Controller: The Mechanic (`controller/service.py`)
|
||||
|
||||
The `Controller` is the skilled mechanic who uses the tools from the Registry.
|
||||
|
||||
* **Receives Instructions:** It gets the action request from the Agent. This request typically comes in the form of an `ActionModel` object, which represents the LLM's JSON output (e.g., `{"click_element": {"index": 5}}`).
|
||||
* **Selects the Tool:** It looks at the `ActionModel`, identifies the action name (`"click_element"`), and retrieves the corresponding `RegisteredAction` from the `Registry`.
|
||||
* **Validates Parameters:** It uses the action's `param_model` (e.g., `ClickElementAction`) to check if the provided parameters (`{"index": 5}`) are correct.
|
||||
* **Executes the Action:** It calls the actual Python function associated with the action (e.g., the `click_element` function), passing it the validated parameters and the necessary `BrowserContext` (so the function knows *which* browser tab to act upon).
|
||||
* **Reports the Result:** The action function performs the task (e.g., clicking the element) and returns an `ActionResult` object, indicating whether it succeeded, failed, or produced some output. The Controller passes this result back to the Agent.
|
||||
|
||||
## Using the Controller: Executing an Action
|
||||
|
||||
In the Agent's main loop ([Chapter 1: Agent](01_agent.md)), after the LLM provides its plan as an `ActionModel`, the Agent simply hands this model over to the `Controller` to execute it.
|
||||
|
||||
```python
|
||||
# --- Simplified Agent step calling the Controller ---
|
||||
# Assume 'llm_response_model' is the ActionModel object parsed from LLM's JSON
|
||||
# Assume 'self.controller' is the Controller instance
|
||||
# Assume 'self.browser_context' is the current BrowserContext
|
||||
|
||||
# ... inside the Agent's step method ...
|
||||
|
||||
try:
|
||||
# Agent tells the Controller: "Execute this action!"
|
||||
action_result: ActionResult = await self.controller.act(
|
||||
action=llm_response_model, # The LLM's chosen action and parameters
|
||||
browser_context=self.browser_context # The browser tab to act within
|
||||
# Other context like LLMs for extraction might be passed too
|
||||
)
|
||||
|
||||
# Agent receives the result from the Controller
|
||||
print(f"Action executed. Result: {action_result.extracted_content}")
|
||||
if action_result.is_done:
|
||||
print("Task marked as done by the action!")
|
||||
if action_result.error:
|
||||
print(f"Action encountered an error: {action_result.error}")
|
||||
|
||||
# Agent records this result in the history ([Message Manager](06_message_manager.md))
|
||||
# ...
|
||||
|
||||
except Exception as e:
|
||||
print(f"Failed to execute action: {e}")
|
||||
# Handle the error
|
||||
```
|
||||
|
||||
**What happens here?**
|
||||
|
||||
1. The Agent has received `llm_response_model` (e.g., representing `{"click_element": {"index": 5}}`).
|
||||
2. It calls `self.controller.act()`, passing the action model and the active `browser_context`.
|
||||
3. The `controller.act()` method handles looking up the `"click_element"` function in the `Registry`, validating the `index` parameter, and calling the function to perform the click within the `browser_context`.
|
||||
4. The `click_element` function executes (interacting with the browser via `BrowserContext` methods).
|
||||
5. It returns an `ActionResult` (e.g., `ActionResult(extracted_content="Clicked button with index 5")`).
|
||||
6. The Agent receives this `action_result` and proceeds.
|
||||
|
||||
## How it Works Under the Hood: The Execution Flow
|
||||
|
||||
Let's trace the journey of an action request from the Agent to the browser click:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Agent
|
||||
participant Controller
|
||||
participant Registry
|
||||
participant ClickFunc as click_element Function
|
||||
participant BC as BrowserContext
|
||||
|
||||
Note over Agent: LLM decided: click_element(index=5)
|
||||
Agent->>Controller: act(action={"click_element": {"index": 5}}, browser_context=BC)
|
||||
Note over Controller: Identify action and params
|
||||
Controller->>Controller: action_name = "click_element", params = {"index": 5}
|
||||
Note over Controller: Ask Registry for the tool
|
||||
Controller->>Registry: Get action definition for "click_element"
|
||||
Registry-->>Controller: Return RegisteredAction(name="click_element", function=ClickFunc, param_model=ClickElementAction, ...)
|
||||
Note over Controller: Validate params using param_model
|
||||
Controller->>Controller: ClickElementAction(index=5) # Validation OK
|
||||
Note over Controller: Execute the function
|
||||
Controller->>ClickFunc: ClickFunc(params=ClickElementAction(index=5), browser=BC)
|
||||
Note over ClickFunc: Perform the click via BrowserContext
|
||||
ClickFunc->>BC: Find element with index 5
|
||||
BC-->>ClickFunc: Element reference
|
||||
ClickFunc->>BC: Execute click on element
|
||||
BC-->>ClickFunc: Click successful
|
||||
ClickFunc-->>Controller: Return ActionResult(extracted_content="Clicked button...")
|
||||
Controller-->>Agent: Return ActionResult
|
||||
```
|
||||
|
||||
This diagram shows the Controller orchestrating the process: receiving the request, consulting the Registry, validating, calling the specific action function, and returning the result.
|
||||
|
||||
## Diving Deeper into the Code
|
||||
|
||||
Let's peek at simplified versions of the key files.
|
||||
|
||||
### 1. Registering Actions (`controller/registry/service.py`)
|
||||
|
||||
Actions are typically registered using a decorator `@registry.action`.
|
||||
|
||||
```python
|
||||
# --- File: controller/registry/service.py (Simplified Registry) ---
|
||||
from typing import Callable, Type
|
||||
from pydantic import BaseModel
|
||||
# Assume ActionModel, RegisteredAction are defined in views.py
|
||||
|
||||
class Registry:
|
||||
def __init__(self, exclude_actions: list[str] = []):
|
||||
self.registry: dict[str, RegisteredAction] = {}
|
||||
self.exclude_actions = exclude_actions
|
||||
# ... other initializations ...
|
||||
|
||||
def _create_param_model(self, function: Callable) -> Type[BaseModel]:
|
||||
"""Creates a Pydantic model from function signature (simplified)"""
|
||||
# ... (Inspects function signature to build a model) ...
|
||||
# Example: for func(index: int, text: str), creates a model
|
||||
# class func_parameters(ActionModel):
|
||||
# index: int
|
||||
# text: str
|
||||
# return func_parameters
|
||||
pass # Placeholder for complex logic
|
||||
|
||||
def action(
|
||||
self,
|
||||
description: str,
|
||||
param_model: Type[BaseModel] | None = None,
|
||||
):
|
||||
"""Decorator for registering actions"""
|
||||
def decorator(func: Callable):
|
||||
if func.__name__ in self.exclude_actions: return func # Skip excluded
|
||||
|
||||
# If no specific param_model provided, try to generate one
|
||||
actual_param_model = param_model # Or self._create_param_model(func) if needed
|
||||
|
||||
# Ensure function is awaitable (async)
|
||||
wrapped_func = func # Assume func is already async for simplicity
|
||||
|
||||
action = RegisteredAction(
|
||||
name=func.__name__,
|
||||
description=description,
|
||||
function=wrapped_func,
|
||||
param_model=actual_param_model,
|
||||
)
|
||||
self.registry[func.__name__] = action # Add to the toolbox!
|
||||
print(f"Action '{func.__name__}' registered.")
|
||||
return func
|
||||
return decorator
|
||||
|
||||
def get_prompt_description(self) -> str:
|
||||
"""Get a description of all actions for the prompt (simplified)"""
|
||||
descriptions = []
|
||||
for action in self.registry.values():
|
||||
# Format description for LLM (e.g., "click_element: Click element {index: {'type': 'integer'}}")
|
||||
descriptions.append(f"{action.name}: {action.description} {action.param_model.schema()}")
|
||||
return "\n".join(descriptions)
|
||||
|
||||
async def execute_action(self, action_name: str, params: dict, browser, **kwargs) -> Any:
|
||||
"""Execute a registered action (simplified)"""
|
||||
if action_name not in self.registry:
|
||||
raise ValueError(f"Action {action_name} not found")
|
||||
|
||||
action = self.registry[action_name]
|
||||
try:
|
||||
# Validate params using the registered Pydantic model
|
||||
validated_params = action.param_model(**params)
|
||||
|
||||
# Call the actual action function with validated params and browser context
|
||||
# Assumes function takes validated_params model and browser
|
||||
result = await action.function(validated_params, browser=browser, **kwargs)
|
||||
return result
|
||||
except Exception as e:
|
||||
raise RuntimeError(f"Error executing {action_name}: {e}") from e
|
||||
|
||||
```
|
||||
|
||||
This shows how the `@registry.action` decorator takes a function, its description, and parameter model, and stores them in the `registry` dictionary. `execute_action` is the core method used by the `Controller` to run a specific action.
|
||||
|
||||
### 2. Defining Action Parameters (`controller/views.py`)
|
||||
|
||||
Each action often has its own Pydantic model to define its expected parameters.
|
||||
|
||||
```python
|
||||
# --- File: controller/views.py (Simplified Action Parameter Models) ---
|
||||
from pydantic import BaseModel
|
||||
from typing import Optional
|
||||
|
||||
# Example parameter model for the 'click_element' action
|
||||
class ClickElementAction(BaseModel):
|
||||
index: int # The highlight_index of the element to click
|
||||
xpath: Optional[str] = None # Optional hint (usually index is enough)
|
||||
|
||||
# Example parameter model for the 'input_text' action
|
||||
class InputTextAction(BaseModel):
|
||||
index: int # The highlight_index of the input field
|
||||
text: str # The text to type
|
||||
xpath: Optional[str] = None # Optional hint
|
||||
|
||||
# Example parameter model for the 'done' action (task completion)
|
||||
class DoneAction(BaseModel):
|
||||
text: str # A final message or result
|
||||
success: bool # Was the overall task successful?
|
||||
|
||||
# ... other action models like GoToUrlAction, ScrollAction etc. ...
|
||||
```
|
||||
|
||||
These models ensure that when the Controller receives parameters like `{"index": 5}`, it can validate that `index` is indeed an integer as required by `ClickElementAction`.
|
||||
|
||||
### 3. The Controller Service (`controller/service.py`)
|
||||
|
||||
The `Controller` class ties everything together. It initializes the `Registry` and registers the default browser actions. Its main job is the `act` method.
|
||||
|
||||
```python
|
||||
# --- File: controller/service.py (Simplified Controller) ---
|
||||
import logging
|
||||
from browser_use.agent.views import ActionModel, ActionResult # Input/Output types
|
||||
from browser_use.browser.context import BrowserContext # Needed by actions
|
||||
from browser_use.controller.registry.service import Registry # The toolbox
|
||||
from browser_use.controller.views import ClickElementAction, InputTextAction, DoneAction # Param models
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class Controller:
|
||||
def __init__(self, exclude_actions: list[str] = []):
|
||||
self.registry = Registry(exclude_actions=exclude_actions) # Initialize the toolbox
|
||||
|
||||
# --- Register Default Actions ---
|
||||
# (Registration happens when Controller is created)
|
||||
|
||||
@self.registry.action("Click element", param_model=ClickElementAction)
|
||||
async def click_element(params: ClickElementAction, browser: BrowserContext):
|
||||
logger.info(f"Attempting to click element index {params.index}")
|
||||
# --- Actual click logic using browser object ---
|
||||
element_node = await browser.get_dom_element_by_index(params.index)
|
||||
await browser._click_element_node(element_node) # Internal browser method
|
||||
# ---
|
||||
msg = f"🖱️ Clicked element with index {params.index}"
|
||||
return ActionResult(extracted_content=msg, include_in_memory=True)
|
||||
|
||||
@self.registry.action("Input text into an element", param_model=InputTextAction)
|
||||
async def input_text(params: InputTextAction, browser: BrowserContext):
|
||||
logger.info(f"Attempting to type into element index {params.index}")
|
||||
# --- Actual typing logic using browser object ---
|
||||
element_node = await browser.get_dom_element_by_index(params.index)
|
||||
await browser._input_text_element_node(element_node, params.text) # Internal method
|
||||
# ---
|
||||
msg = f"⌨️ Input text into index {params.index}"
|
||||
return ActionResult(extracted_content=msg, include_in_memory=True)
|
||||
|
||||
@self.registry.action("Complete task", param_model=DoneAction)
|
||||
async def done(params: DoneAction):
|
||||
logger.info(f"Task completion requested. Success: {params.success}")
|
||||
return ActionResult(is_done=True, success=params.success, extracted_content=params.text)
|
||||
|
||||
# ... registration for scroll_down, go_to_url, etc. ...
|
||||
|
||||
async def act(
|
||||
self,
|
||||
action: ActionModel, # The ActionModel from the LLM
|
||||
browser_context: BrowserContext, # The context to act within
|
||||
**kwargs # Other potential context (LLMs, etc.)
|
||||
) -> ActionResult:
|
||||
"""Execute an action defined in the ActionModel"""
|
||||
try:
|
||||
# ActionModel might look like: ActionModel(click_element=ClickElementAction(index=5))
|
||||
# model_dump gets {'click_element': {'index': 5}}
|
||||
action_data = action.model_dump(exclude_unset=True)
|
||||
|
||||
for action_name, params in action_data.items():
|
||||
if params is not None:
|
||||
logger.debug(f"Executing action: {action_name} with params: {params}")
|
||||
# Call the registry's execute method
|
||||
result = await self.registry.execute_action(
|
||||
action_name=action_name,
|
||||
params=params,
|
||||
browser=browser_context, # Pass the essential context
|
||||
**kwargs # Pass any other context needed by actions
|
||||
)
|
||||
|
||||
# Ensure result is ActionResult or convert it
|
||||
if isinstance(result, ActionResult): return result
|
||||
if isinstance(result, str): return ActionResult(extracted_content=result)
|
||||
return ActionResult() # Default empty result if action returned None
|
||||
|
||||
logger.warning("ActionModel had no action to execute.")
|
||||
return ActionResult(error="No action specified in the model")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error during controller.act: {e}", exc_info=True)
|
||||
return ActionResult(error=str(e)) # Return error in ActionResult
|
||||
```
|
||||
|
||||
The `Controller` registers all the standard browser actions during initialization. The `act` method then dynamically finds and executes the requested action using the `Registry`.
|
||||
|
||||
## Conclusion
|
||||
|
||||
The **Action Registry** acts as the definitive catalog or "toolbox" of all operations the `Browser Use` Agent can perform. The **Action Controller** is the "mechanic" that interprets the LLM's plan, selects the appropriate tool from the Registry, and executes it within the specified [BrowserContext](03_browsercontext.md).
|
||||
|
||||
Together, they provide a robust and extensible way to translate high-level instructions into low-level browser interactions, forming the crucial link between the Agent's "brain" (LLM planner) and its "hands" (browser manipulation).
|
||||
|
||||
Now that we know how actions are chosen and executed, how does the Agent keep track of the conversation with the LLM, including the history of states observed and actions taken? We'll explore this in the next chapter on the [Message Manager](06_message_manager.md).
|
||||
|
||||
[Next Chapter: Message Manager](06_message_manager.md)
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
386
docs/Browser Use/06_message_manager.md
Normal file
386
docs/Browser Use/06_message_manager.md
Normal file
@@ -0,0 +1,386 @@
|
||||
# Chapter 6: Message Manager - Keeping the Conversation Straight
|
||||
|
||||
In the [previous chapter](05_action_controller___registry.md), we learned how the `Action Controller` and `Registry` act as the Agent's "hands" and "toolbox", executing the specific actions decided by the LLM planner. But how does the LLM get all the information it needs to make those decisions in the first place? How does the Agent keep track of the ongoing conversation, including what it "saw" on the page and what happened after each action?
|
||||
|
||||
Imagine you're having a long, multi-step discussion with an assistant about a complex task. If the assistant has a poor memory, they might forget earlier instructions, the current status, or previous results, making it impossible to proceed correctly. LLMs face a similar challenge: they need the conversation history for context, but they have a limited memory (called the "context window").
|
||||
|
||||
This is the problem the **Message Manager** solves.
|
||||
|
||||
## What Problem Does the Message Manager Solve?
|
||||
|
||||
The `Agent` needs to have a conversation with the LLM. This conversation isn't just chat; it includes:
|
||||
|
||||
1. **Initial Instructions:** The core rules from the [System Prompt](02_system_prompt.md).
|
||||
2. **The Task:** The overall goal the Agent needs to achieve.
|
||||
3. **Observations:** What the Agent currently "sees" in the browser ([BrowserContext](03_browsercontext.md) state, including the [DOM Representation](04_dom_representation.md)).
|
||||
4. **Action Results:** What happened after the last action was performed ([Action Controller & Registry](05_action_controller___registry.md)).
|
||||
5. **LLM's Plan:** The sequence of actions the LLM decided on.
|
||||
|
||||
The Message Manager solves several key problems:
|
||||
|
||||
* **Organizes History:** It structures the conversation chronologically, keeping track of who said what (System, User/Agent State, AI/LLM Plan).
|
||||
* **Formats Messages:** It ensures the browser state, action results, and even images are formatted correctly so the LLM can understand them.
|
||||
* **Tracks Size:** It keeps count of the "tokens" (roughly, words or parts of words) used in the conversation history.
|
||||
* **Manages Limits:** It helps prevent the conversation history from exceeding the LLM's context window limit, potentially by removing older parts of the conversation if it gets too long.
|
||||
|
||||
Think of the `MessageManager` as a meticulous secretary for the Agent-LLM conversation. It takes clear, concise notes, presents the current situation accurately, and ensures the conversation doesn't ramble on for too long, keeping everything within the LLM's "attention span".
|
||||
|
||||
## Meet the Message Manager: The Conversation Secretary
|
||||
|
||||
The `MessageManager` (found in `agent/message_manager/service.py`) is responsible for managing the list of messages that are sent to the LLM in each step.
|
||||
|
||||
Here are its main jobs:
|
||||
|
||||
1. **Initialization:** When the `Agent` starts, the `MessageManager` is created. It immediately adds the foundational messages:
|
||||
* The `SystemMessage` containing the rules from the [System Prompt](02_system_prompt.md).
|
||||
* A `HumanMessage` stating the overall `task`.
|
||||
* Other initial setup messages (like examples or sensitive data placeholders).
|
||||
2. **Adding Browser State:** Before asking the LLM what to do next, the `Agent` gets the current `BrowserState`. It then tells the `MessageManager` to add this information as a `HumanMessage`. This message includes the simplified DOM map, the current URL, and potentially a screenshot (if `use_vision` is enabled). It also includes the results (`ActionResult`) from the *previous* step, so the LLM knows what happened last.
|
||||
3. **Adding LLM Output:** After the LLM responds with its plan (`AgentOutput`), the `Agent` tells the `MessageManager` to add this plan as an `AIMessage`. This typically includes the LLM's reasoning and the list of actions to perform.
|
||||
4. **Adding Action Results (Indirectly):** The results from the `Controller.act` call (`ActionResult`) aren't added as separate messages *after* the action. Instead, they are included in the *next* `HumanMessage` that contains the browser state (see step 2). This keeps the context tight: "Here's the current page, and here's what happened right before we got here."
|
||||
5. **Providing Messages to LLM:** When the `Agent` is ready to call the LLM, it asks the `MessageManager` for the current conversation history (`get_messages()`).
|
||||
6. **Token Management:** Every time a message is added, the `MessageManager` calculates how many tokens it adds (`_count_tokens`) and updates the total. If the total exceeds the limit (`max_input_tokens`), it might trigger a truncation strategy (`cut_messages`) to shorten the history, usually by removing parts of the oldest user state message or removing the image first.
|
||||
|
||||
## How the Agent Uses the Message Manager
|
||||
|
||||
Let's revisit the simplified `Agent.step` method from [Chapter 1](01_agent.md) and highlight the `MessageManager` interactions (using `self._message_manager`):
|
||||
|
||||
```python
|
||||
# --- File: agent/service.py (Simplified step method - Highlighting MessageManager) ---
|
||||
class Agent:
|
||||
# ... (init, run) ...
|
||||
async def step(self, step_info: Optional[AgentStepInfo] = None) -> None:
|
||||
logger.info(f"📍 Step {self.state.n_steps}")
|
||||
state = None
|
||||
model_output = None
|
||||
result: list[ActionResult] = []
|
||||
|
||||
try:
|
||||
# 1. Get current state from the browser
|
||||
state = await self.browser_context.get_state() # Uses BrowserContext
|
||||
|
||||
# 2. Add state + PREVIOUS result to message history via MessageManager
|
||||
# 'self.state.last_result' holds the outcome of the *previous* step's action
|
||||
self._message_manager.add_state_message(
|
||||
state,
|
||||
self.state.last_result, # Result from previous action
|
||||
step_info,
|
||||
self.settings.use_vision # Tell it whether to include image
|
||||
)
|
||||
|
||||
# 3. Get the complete, formatted message history for the LLM
|
||||
input_messages = self._message_manager.get_messages()
|
||||
|
||||
# 4. Get LLM's decision on the next action(s)
|
||||
model_output = await self.get_next_action(input_messages) # Calls the LLM
|
||||
|
||||
# --- Agent increments step counter ---
|
||||
self.state.n_steps += 1
|
||||
|
||||
# 5. Remove the potentially large state message before adding the compact AI response
|
||||
# (This is an optimization mentioned in the provided code)
|
||||
self._message_manager._remove_last_state_message()
|
||||
|
||||
# 6. Add the LLM's response (the plan) to the history
|
||||
self._message_manager.add_model_output(model_output)
|
||||
|
||||
# 7. Execute the action(s) using the Controller
|
||||
result = await self.multi_act(model_output.action) # Uses Controller
|
||||
|
||||
# 8. Store the result of THIS action. It will be used in the *next* step's
|
||||
# call to self._message_manager.add_state_message()
|
||||
self.state.last_result = result
|
||||
|
||||
# ... (Record step details, handle success/failure) ...
|
||||
|
||||
except Exception as e:
|
||||
# Handle errors...
|
||||
result = await self._handle_step_error(e)
|
||||
self.state.last_result = result
|
||||
# ... (finally block) ...
|
||||
```
|
||||
|
||||
This flow shows the cycle: add state/previous result -> get messages -> call LLM -> add LLM response -> execute action -> store result for *next* state message.
|
||||
|
||||
## How it Works Under the Hood: Managing the Flow
|
||||
|
||||
Let's visualize the key interactions during one step of the Agent loop involving the `MessageManager`:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Agent
|
||||
participant BC as BrowserContext
|
||||
participant MM as MessageManager
|
||||
participant LLM
|
||||
participant Controller
|
||||
|
||||
Note over Agent: Start of step
|
||||
Agent->>BC: get_state()
|
||||
BC-->>Agent: Current BrowserState (DOM map, URL, screenshot?)
|
||||
Note over Agent: Have BrowserState and `last_result` from previous step
|
||||
Agent->>MM: add_state_message(BrowserState, last_result)
|
||||
MM->>MM: Format state/result into HumanMessage (with text/image)
|
||||
MM->>MM: Calculate tokens for new message
|
||||
MM->>MM: Add HumanMessage to internal history list
|
||||
MM->>MM: Update total token count
|
||||
MM->>MM: Check token limit, potentially call cut_messages()
|
||||
Note over Agent: Ready to ask LLM
|
||||
Agent->>MM: get_messages()
|
||||
MM-->>Agent: Return List[BaseMessage] (System, Task, State1, Plan1, State2...)
|
||||
Agent->>LLM: Invoke LLM with message list
|
||||
LLM-->>Agent: LLM Response (AgentOutput containing plan)
|
||||
Note over Agent: Got LLM's plan
|
||||
Agent->>MM: _remove_last_state_message() # Optimization
|
||||
MM->>MM: Remove last (large) HumanMessage from list
|
||||
Agent->>MM: add_model_output(AgentOutput)
|
||||
MM->>MM: Format plan into AIMessage (with tool calls)
|
||||
MM->>MM: Calculate tokens for AIMessage
|
||||
MM->>MM: Add AIMessage to internal history list
|
||||
MM->>MM: Update total token count
|
||||
Note over Agent: Ready to execute plan
|
||||
Agent->>Controller: multi_act(AgentOutput.action)
|
||||
Controller-->>Agent: List[ActionResult] (Result of this step's actions)
|
||||
Agent->>Agent: Store ActionResult in `self.state.last_result` (for next step)
|
||||
Note over Agent: End of step
|
||||
```
|
||||
|
||||
This shows how `MessageManager` sits between the Agent, the Browser State, and the LLM, managing the history list and token counts.
|
||||
|
||||
## Diving Deeper into the Code (`agent/message_manager/service.py`)
|
||||
|
||||
Let's look at simplified versions of key methods in `MessageManager`.
|
||||
|
||||
**1. Initialization (`__init__` and `_init_messages`)**
|
||||
|
||||
When the `Agent` creates the `MessageManager`, it passes the task and the already-formatted `SystemMessage`.
|
||||
|
||||
```python
|
||||
# --- File: agent/message_manager/service.py (Simplified __init__) ---
|
||||
from langchain_core.messages import SystemMessage, HumanMessage, AIMessage, ToolMessage
|
||||
# ... other imports ...
|
||||
from browser_use.agent.views import MessageManagerState # Internal state storage
|
||||
from browser_use.agent.message_manager.views import MessageMetadata, ManagedMessage # Message wrapper
|
||||
|
||||
class MessageManager:
|
||||
def __init__(
|
||||
self,
|
||||
task: str,
|
||||
system_message: SystemMessage, # Received from Agent
|
||||
settings: MessageManagerSettings = MessageManagerSettings(),
|
||||
state: MessageManagerState = MessageManagerState(), # Stores history
|
||||
):
|
||||
self.task = task
|
||||
self.settings = settings # Max tokens, image settings, etc.
|
||||
self.state = state # Holds the 'history' object
|
||||
self.system_prompt = system_message
|
||||
|
||||
# Only initialize if history is empty (e.g., not resuming from saved state)
|
||||
if len(self.state.history.messages) == 0:
|
||||
self._init_messages()
|
||||
|
||||
def _init_messages(self) -> None:
|
||||
"""Add the initial fixed messages to the history."""
|
||||
# Add the main system prompt (rules)
|
||||
self._add_message_with_tokens(self.system_prompt)
|
||||
|
||||
# Add the user's task
|
||||
task_message = HumanMessage(
|
||||
content=f'Your ultimate task is: """{self.task}"""...'
|
||||
)
|
||||
self._add_message_with_tokens(task_message)
|
||||
|
||||
# Add other setup messages (context, sensitive data info, examples)
|
||||
# ... (simplified - see full code for details) ...
|
||||
|
||||
# Example: Add a placeholder for where the main history begins
|
||||
placeholder_message = HumanMessage(content='[Your task history memory starts here]')
|
||||
self._add_message_with_tokens(placeholder_message)
|
||||
```
|
||||
|
||||
This sets up the foundational context for the LLM.
|
||||
|
||||
**2. Adding Browser State (`add_state_message`)**
|
||||
|
||||
This method takes the current `BrowserState` and the previous `ActionResult`, formats them into a `HumanMessage` (potentially multi-modal with image and text parts), and adds it to the history.
|
||||
|
||||
```python
|
||||
# --- File: agent/message_manager/service.py (Simplified add_state_message) ---
|
||||
# ... imports ...
|
||||
from browser_use.browser.views import BrowserState
|
||||
from browser_use.agent.views import ActionResult, AgentStepInfo
|
||||
from browser_use.agent.prompts import AgentMessagePrompt # Helper to format state
|
||||
|
||||
class MessageManager:
|
||||
# ... (init) ...
|
||||
|
||||
def add_state_message(
|
||||
self,
|
||||
state: BrowserState, # The current view of the browser
|
||||
result: Optional[List[ActionResult]] = None, # Result from *previous* action
|
||||
step_info: Optional[AgentStepInfo] = None,
|
||||
use_vision=True, # Flag to include screenshot
|
||||
) -> None:
|
||||
"""Add browser state and previous result as a human message."""
|
||||
|
||||
# Add any 'memory' messages from the previous result first (if any)
|
||||
if result:
|
||||
for r in result:
|
||||
if r.include_in_memory and (r.extracted_content or r.error):
|
||||
content = f"Action result: {r.extracted_content}" if r.extracted_content else f"Action error: {r.error}"
|
||||
msg = HumanMessage(content=content)
|
||||
self._add_message_with_tokens(msg)
|
||||
result = None # Don't include again in the main state message
|
||||
|
||||
# Use a helper class to format the BrowserState (+ optional remaining result)
|
||||
# into the correct message structure (text + optional image)
|
||||
state_prompt = AgentMessagePrompt(
|
||||
state,
|
||||
result, # Pass any remaining result info
|
||||
include_attributes=self.settings.include_attributes,
|
||||
step_info=step_info,
|
||||
)
|
||||
# Get the formatted message (could be complex list for vision)
|
||||
state_message = state_prompt.get_user_message(use_vision)
|
||||
|
||||
# Add the formatted message (with token calculation) to history
|
||||
self._add_message_with_tokens(state_message)
|
||||
|
||||
```
|
||||
|
||||
**3. Adding Model Output (`add_model_output`)**
|
||||
|
||||
This takes the LLM's plan (`AgentOutput`) and formats it as an `AIMessage` with specific "tool calls" structure that many models expect.
|
||||
|
||||
```python
|
||||
# --- File: agent/message_manager/service.py (Simplified add_model_output) ---
|
||||
# ... imports ...
|
||||
from browser_use.agent.views import AgentOutput
|
||||
|
||||
class MessageManager:
|
||||
# ... (init, add_state_message) ...
|
||||
|
||||
def add_model_output(self, model_output: AgentOutput) -> None:
|
||||
"""Add model output (the plan) as an AI message with tool calls."""
|
||||
# Format the output according to OpenAI's tool calling standard
|
||||
tool_calls = [
|
||||
{
|
||||
'name': 'AgentOutput', # The 'tool' name
|
||||
'args': model_output.model_dump(mode='json', exclude_unset=True), # The LLM's JSON output
|
||||
'id': str(self.state.tool_id), # Unique ID for the call
|
||||
'type': 'tool_call',
|
||||
}
|
||||
]
|
||||
|
||||
# Create the AIMessage containing the tool calls
|
||||
msg = AIMessage(
|
||||
content='', # Content is often empty when using tool calls
|
||||
tool_calls=tool_calls,
|
||||
)
|
||||
|
||||
# Add it to history
|
||||
self._add_message_with_tokens(msg)
|
||||
|
||||
# Add a corresponding empty ToolMessage (required by some models)
|
||||
self.add_tool_message(content='') # Content depends on tool execution result
|
||||
|
||||
def add_tool_message(self, content: str) -> None:
|
||||
"""Add tool message to history (often confirms tool call receipt/result)"""
|
||||
# ToolMessage links back to the AIMessage's tool_call_id
|
||||
msg = ToolMessage(content=content, tool_call_id=str(self.state.tool_id))
|
||||
self.state.tool_id += 1 # Increment for next potential tool call
|
||||
self._add_message_with_tokens(msg)
|
||||
```
|
||||
|
||||
**4. Adding Messages and Counting Tokens (`_add_message_with_tokens`, `_count_tokens`)**
|
||||
|
||||
This is the core function called by others to add any message to the history, ensuring token counts are tracked.
|
||||
|
||||
```python
|
||||
# --- File: agent/message_manager/service.py (Simplified _add_message_with_tokens) ---
|
||||
# ... imports ...
|
||||
from langchain_core.messages import BaseMessage
|
||||
from browser_use.agent.message_manager.views import MessageMetadata, ManagedMessage
|
||||
|
||||
class MessageManager:
|
||||
# ... (other methods) ...
|
||||
|
||||
def _add_message_with_tokens(self, message: BaseMessage, position: int | None = None) -> None:
|
||||
"""Internal helper to add any message with its token count metadata."""
|
||||
|
||||
# 1. Optionally filter sensitive data (replace actual data with placeholders)
|
||||
# if self.settings.sensitive_data:
|
||||
# message = self._filter_sensitive_data(message) # Simplified
|
||||
|
||||
# 2. Count the tokens in the message
|
||||
token_count = self._count_tokens(message)
|
||||
|
||||
# 3. Create metadata object
|
||||
metadata = MessageMetadata(tokens=token_count)
|
||||
|
||||
# 4. Add the message and its metadata to the history list
|
||||
# (self.state.history is a MessageHistory object)
|
||||
self.state.history.add_message(message, metadata, position)
|
||||
# Note: self.state.history.add_message also updates the total token count
|
||||
|
||||
# 5. Check if history exceeds token limit and truncate if needed
|
||||
self.cut_messages() # Check and potentially trim history
|
||||
|
||||
def _count_tokens(self, message: BaseMessage) -> int:
|
||||
"""Estimate tokens in a message."""
|
||||
tokens = 0
|
||||
if isinstance(message.content, list): # Multi-modal (text + image)
|
||||
for item in message.content:
|
||||
if isinstance(item, dict) and 'image_url' in item:
|
||||
# Add fixed cost for images
|
||||
tokens += self.settings.image_tokens
|
||||
elif isinstance(item, dict) and 'text' in item:
|
||||
# Estimate tokens based on text length
|
||||
tokens += len(item['text']) // self.settings.estimated_characters_per_token
|
||||
elif isinstance(message.content, str): # Text message
|
||||
text = message.content
|
||||
if hasattr(message, 'tool_calls'): # Add tokens for tool call structure
|
||||
text += str(getattr(message, 'tool_calls', ''))
|
||||
tokens += len(text) // self.settings.estimated_characters_per_token
|
||||
|
||||
return tokens
|
||||
|
||||
def cut_messages(self):
|
||||
"""Trim messages if total tokens exceed the limit."""
|
||||
# Calculate how many tokens we are over the limit
|
||||
diff = self.state.history.current_tokens - self.settings.max_input_tokens
|
||||
if diff <= 0:
|
||||
return # We are within limits
|
||||
|
||||
logger.debug(f"Token limit exceeded by {diff}. Trimming history.")
|
||||
|
||||
# Strategy:
|
||||
# 1. Try removing the image from the *last* (most recent) state message if present.
|
||||
# (Code logic finds the last message, checks content list, removes image item, updates counts)
|
||||
# ... (Simplified - see full code for image removal logic) ...
|
||||
|
||||
# 2. If still over limit after image removal (or no image was present),
|
||||
# trim text content from the *end* of the last state message.
|
||||
# Calculate proportion to remove, shorten string, create new message.
|
||||
# ... (Simplified - see full code for text trimming logic) ...
|
||||
|
||||
# Ensure we don't get stuck if trimming isn't enough (raise error)
|
||||
if self.state.history.current_tokens > self.settings.max_input_tokens:
|
||||
raise ValueError("Max token limit reached even after trimming.")
|
||||
|
||||
```
|
||||
|
||||
This shows the basic mechanics of adding messages, calculating their approximate size, and applying strategies to keep the history within the LLM's context window limit.
|
||||
|
||||
## Conclusion
|
||||
|
||||
The `MessageManager` is the Agent's conversation secretary. It meticulously records the dialogue between the Agent (reporting browser state and action results) and the LLM (providing analysis and action plans), starting from the initial `System Prompt` and task definition.
|
||||
|
||||
Crucially, it formats these messages correctly, tracks the conversation's size using token counts, and implements strategies to keep the history concise enough for the LLM's limited context window. Without the `MessageManager`, the Agent would quickly lose track of the conversation, and the LLM wouldn't have the necessary context to guide the browser effectively.
|
||||
|
||||
Many of the objects managed and passed around by the `MessageManager`, like `BrowserState`, `ActionResult`, and `AgentOutput`, are defined as specific data structures. In the next chapter, we'll take a closer look at these important **Data Structures (Views)**.
|
||||
|
||||
[Next Chapter: Data Structures (Views)](07_data_structures__views_.md)
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
235
docs/Browser Use/07_data_structures__views_.md
Normal file
235
docs/Browser Use/07_data_structures__views_.md
Normal file
@@ -0,0 +1,235 @@
|
||||
# Chapter 7: Data Structures (Views) - The Project's Blueprints
|
||||
|
||||
In the [previous chapter](06_message_manager.md), we saw how the `MessageManager` acts like a secretary, carefully organizing the conversation between the [Agent](01_agent.md) and the LLM. It manages different pieces of information – the browser's current state, the LLM's plan, the results of actions, and more.
|
||||
|
||||
But how do all these different components – the Agent, the LLM parser, the [BrowserContext](03_browsercontext.md), the [Action Controller & Registry](05_action_controller___registry.md), and the [Message Manager](06_message_manager.md) – ensure they understand each other perfectly? If the LLM gives a plan in one format, and the Controller expects it in another, things will break!
|
||||
|
||||
Imagine trying to build furniture using instructions written in a language you don't fully understand, or trying to fill out a form where every section uses a different layout. It would be confusing and error-prone. We need a shared, consistent language and format.
|
||||
|
||||
This is where **Data Structures (Views)** come in. They act as the official blueprints or standardized forms for all the important information passed around within the `Browser Use` project.
|
||||
|
||||
## What Problem Do Data Structures Solve?
|
||||
|
||||
In a complex system like `Browser Use`, many components need to exchange data:
|
||||
|
||||
* The [BrowserContext](03_browsercontext.md) needs to package up the current state of the webpage.
|
||||
* The [Agent](01_agent.md) needs to understand the LLM's multi-step plan.
|
||||
* The [Action Controller & Registry](05_action_controller___registry.md) needs to know exactly which action to perform and with what specific parameters (like which element index to click).
|
||||
* The Controller needs to report back the result of an action in a predictable way.
|
||||
|
||||
Without a standard format for each piece of data, you might encounter problems like:
|
||||
|
||||
* Misinterpreting data (e.g., is `5` an element index or a quantity?).
|
||||
* Missing required information.
|
||||
* Inconsistent naming (`element_id` vs `index` vs `element_number`).
|
||||
* Difficulty debugging when data looks different every time.
|
||||
|
||||
Data Structures (Views) solve this by defining **strict, consistent blueprints** for the data. Everyone agrees to use these blueprints, ensuring smooth communication and preventing errors.
|
||||
|
||||
## Meet Pydantic: The Blueprint Maker and Checker
|
||||
|
||||
In `Browser Use`, these blueprints are primarily defined using a popular Python library called **Pydantic**.
|
||||
|
||||
Think of Pydantic like a combination of:
|
||||
|
||||
1. **A Blueprint Designer:** It provides an easy way to define the structure of your data using standard Python type hints (like `str` for text, `int` for whole numbers, `bool` for True/False, `list` for lists).
|
||||
2. **A Quality Inspector:** When data comes in (e.g., from the LLM or from an action's result), Pydantic automatically checks if it matches the blueprint. Does it have all the required fields? Are the data types correct? If not, Pydantic raises an error, stopping bad data before it causes problems later.
|
||||
|
||||
These Pydantic models (our blueprints) are often stored in files named `views.py` within different component directories (like `agent/views.py`, `browser/views.py`), which is why we sometimes call them "Views".
|
||||
|
||||
## Key Blueprints in `Browser Use`
|
||||
|
||||
Let's look at some of the most important data structures used in the project. Don't worry about memorizing every detail; focus on *what kind* of information each blueprint holds and *who* uses it.
|
||||
|
||||
*(Note: These are simplified representations. The actual models might have more fields or features.)*
|
||||
|
||||
### 1. `BrowserState` (from `browser/views.py`)
|
||||
|
||||
* **Purpose:** Represents a complete snapshot of the browser's state at a specific moment.
|
||||
* **Blueprint Contents (Simplified):**
|
||||
* `url`: The current web address (string).
|
||||
* `title`: The title of the webpage (string).
|
||||
* `element_tree`: The simplified map of the webpage content (from [DOM Representation](04_dom_representation.md)).
|
||||
* `selector_map`: The lookup map for interactive elements (from [DOM Representation](04_dom_representation.md)).
|
||||
* `screenshot`: An optional image of the page (string, base64 encoded).
|
||||
* `tabs`: Information about other open tabs in this context (list).
|
||||
* **Who Uses It:**
|
||||
* Created by: [BrowserContext](03_browsercontext.md) (`get_state()` method).
|
||||
* Used by: [Agent](01_agent.md) (to see the current situation), [Message Manager](06_message_manager.md) (to store in history).
|
||||
|
||||
```python
|
||||
# --- Conceptual Pydantic Model ---
|
||||
# File: browser/views.py (Simplified Example)
|
||||
from pydantic import BaseModel
|
||||
from typing import Optional, List, Dict # For type hints
|
||||
# Assume DOMElementNode and TabInfo are defined elsewhere
|
||||
|
||||
class BrowserState(BaseModel):
|
||||
url: str
|
||||
title: str
|
||||
element_tree: Optional[object] # Simplified: Actual type is DOMElementNode
|
||||
selector_map: Optional[Dict[int, object]] # Simplified: Actual type is SelectorMap
|
||||
screenshot: Optional[str] = None # Optional field
|
||||
tabs: List[object] = [] # Simplified: Actual type is TabInfo
|
||||
|
||||
# Pydantic ensures that when a BrowserState is created,
|
||||
# 'url' and 'title' MUST be provided as strings.
|
||||
```
|
||||
|
||||
### 2. `ActionModel` (from `controller/registry/views.py`)
|
||||
|
||||
* **Purpose:** Represents a *single* specific action the LLM wants to perform, including its parameters. This model is often created *dynamically* based on the actions available in the [Action Controller & Registry](05_action_controller___registry.md).
|
||||
* **Blueprint Contents (Example for `click_element`):**
|
||||
* `index`: The `highlight_index` of the element to click (integer).
|
||||
* `xpath`: An optional hint about the element's location (string).
|
||||
* **Blueprint Contents (Example for `input_text`):**
|
||||
* `index`: The `highlight_index` of the input field (integer).
|
||||
* `text`: The text to type (string).
|
||||
* **Who Uses It:**
|
||||
* Defined by/Registered in: [Action Controller & Registry](05_action_controller___registry.md).
|
||||
* Created based on: LLM output (often part of `AgentOutput`).
|
||||
* Used by: [Action Controller & Registry](05_action_controller___registry.md) (to validate parameters and know what function to call).
|
||||
|
||||
```python
|
||||
# --- Conceptual Pydantic Models ---
|
||||
# File: controller/views.py (Simplified Examples)
|
||||
from pydantic import BaseModel
|
||||
from typing import Optional
|
||||
|
||||
class ClickElementAction(BaseModel):
|
||||
index: int
|
||||
xpath: Optional[str] = None # Optional hint
|
||||
|
||||
class InputTextAction(BaseModel):
|
||||
index: int
|
||||
text: str
|
||||
xpath: Optional[str] = None # Optional hint
|
||||
|
||||
# Base model that dynamically holds ONE of the above actions
|
||||
class ActionModel(BaseModel):
|
||||
# Pydantic allows models like this where only one field is expected
|
||||
# e.g., ActionModel(click_element=ClickElementAction(index=5))
|
||||
# or ActionModel(input_text=InputTextAction(index=12, text="hello"))
|
||||
click_element: Optional[ClickElementAction] = None
|
||||
input_text: Optional[InputTextAction] = None
|
||||
# ... fields for other possible actions (scroll, done, etc.) ...
|
||||
pass # More complex logic handles ensuring only one action is present
|
||||
```
|
||||
|
||||
### 3. `AgentOutput` (from `agent/views.py`)
|
||||
|
||||
* **Purpose:** Represents the complete plan received from the LLM after it analyzes the current state. This is the structure the [System Prompt](02_system_prompt.md) tells the LLM to follow.
|
||||
* **Blueprint Contents (Simplified):**
|
||||
* `current_state`: The LLM's thoughts/reasoning (a nested structure, often called `AgentBrain`).
|
||||
* `action`: A *list* of one or more `ActionModel` objects representing the steps the LLM wants to take.
|
||||
* **Who Uses It:**
|
||||
* Created by: The [Agent](01_agent.md) parses the LLM's raw JSON output into this structure.
|
||||
* Used by: [Agent](01_agent.md) (to understand the plan), [Message Manager](06_message_manager.md) (to store the plan in history), [Action Controller & Registry](05_action_controller___registry.md) (reads the `action` list).
|
||||
|
||||
```python
|
||||
# --- Conceptual Pydantic Model ---
|
||||
# File: agent/views.py (Simplified Example)
|
||||
from pydantic import BaseModel
|
||||
from typing import List
|
||||
# Assume ActionModel and AgentBrain are defined elsewhere
|
||||
|
||||
class AgentOutput(BaseModel):
|
||||
current_state: object # Simplified: Actual type is AgentBrain
|
||||
action: List[ActionModel] # A list of actions to execute
|
||||
|
||||
# Pydantic ensures the LLM output MUST have 'current_state' and 'action',
|
||||
# and that 'action' MUST be a list containing valid ActionModel objects.
|
||||
```
|
||||
|
||||
### 4. `ActionResult` (from `agent/views.py`)
|
||||
|
||||
* **Purpose:** Represents the outcome after the [Action Controller & Registry](05_action_controller___registry.md) attempts to execute a single action.
|
||||
* **Blueprint Contents (Simplified):**
|
||||
* `is_done`: Did this action signal the end of the overall task? (boolean, optional).
|
||||
* `success`: If done, was the task successful overall? (boolean, optional).
|
||||
* `extracted_content`: Any text result from the action (e.g., "Clicked button X") (string, optional).
|
||||
* `error`: Any error message if the action failed (string, optional).
|
||||
* `include_in_memory`: Should this result be explicitly shown to the LLM next time? (boolean).
|
||||
* **Who Uses It:**
|
||||
* Created by: Functions within the [Action Controller & Registry](05_action_controller___registry.md) (like `click_element`).
|
||||
* Used by: [Agent](01_agent.md) (to check status, record results), [Message Manager](06_message_manager.md) (includes info in the next state message sent to LLM).
|
||||
|
||||
```python
|
||||
# --- Conceptual Pydantic Model ---
|
||||
# File: agent/views.py (Simplified Example)
|
||||
from pydantic import BaseModel
|
||||
from typing import Optional
|
||||
|
||||
class ActionResult(BaseModel):
|
||||
is_done: Optional[bool] = False
|
||||
success: Optional[bool] = None
|
||||
extracted_content: Optional[str] = None
|
||||
error: Optional[str] = None
|
||||
include_in_memory: bool = False # Default to False
|
||||
|
||||
# Pydantic helps ensure results are consistently structured.
|
||||
# For example, 'is_done' must be True or False if provided.
|
||||
```
|
||||
|
||||
## The Power of Blueprints: Ensuring Consistency
|
||||
|
||||
Using Pydantic models for these data structures provides a huge benefit: **automatic validation**.
|
||||
|
||||
Imagine the LLM sends back a plan, but it forgets to include the `index` for a `click_element` action.
|
||||
|
||||
```json
|
||||
// Bad LLM Response (Missing 'index')
|
||||
{
|
||||
"current_state": { ... },
|
||||
"action": [
|
||||
{
|
||||
"click_element": {
|
||||
"xpath": "//button[@id='submit']" // 'index' is missing!
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
When the [Agent](01_agent.md) tries to parse this JSON into the `AgentOutput` Pydantic model, Pydantic will immediately notice that the `index` field (which is required by the `ClickElementAction` blueprint) is missing. It will raise a `ValidationError`.
|
||||
|
||||
```python
|
||||
# --- Conceptual Agent Code ---
|
||||
import pydantic
|
||||
# Assume AgentOutput is the Pydantic model defined earlier
|
||||
# Assume 'llm_json_response' contains the bad JSON from above
|
||||
|
||||
try:
|
||||
# Try to create the AgentOutput object from the LLM's response
|
||||
llm_plan = AgentOutput.model_validate_json(llm_json_response)
|
||||
# If validation succeeds, proceed...
|
||||
print("LLM Plan Validated:", llm_plan)
|
||||
except pydantic.ValidationError as e:
|
||||
# Pydantic catches the error!
|
||||
print(f"Validation Error: The LLM response didn't match the blueprint!")
|
||||
print(e)
|
||||
# The Agent can now handle this error gracefully,
|
||||
# maybe asking the LLM to try again, instead of crashing later.
|
||||
```
|
||||
|
||||
This automatic checking catches errors early, preventing the [Action Controller & Registry](05_action_controller___registry.md) from receiving incomplete instructions and making the whole system much more robust and easier to debug. It enforces the "contract" between different components.
|
||||
|
||||
## Under the Hood: Simple Classes
|
||||
|
||||
These data structures are simply Python classes, mostly inheriting from `pydantic.BaseModel` or defined using Python's built-in `dataclass`. They don't contain complex logic themselves; their main job is to define the *shape* and *type* of the data. You'll find their definitions scattered across the various `views.py` files within the project's component directories (like `agent/`, `browser/`, `controller/`, `dom/`).
|
||||
|
||||
Think of them as the official vocabulary and grammar rules that all the components agree to use when communicating.
|
||||
|
||||
## Conclusion
|
||||
|
||||
Data Structures (Views), primarily defined using Pydantic models, are the essential blueprints that ensure consistent and reliable communication within the `Browser Use` project. They act like standardized forms for `BrowserState`, `AgentOutput`, `ActionModel`, and `ActionResult`, making sure every component knows exactly what kind of data to expect and how to interpret it.
|
||||
|
||||
By defining these clear structures and leveraging Pydantic's automatic validation, `Browser Use` prevents misunderstandings between components, catches errors early, and makes the overall system more robust and maintainable. These standardized structures also make it easier to log and understand what's happening in the system.
|
||||
|
||||
Speaking of logging and understanding the system's behavior, how can we monitor the Agent's performance and gather data for improvement? In the next and final chapter, we'll explore the [Telemetry Service](08_telemetry_service.md).
|
||||
|
||||
[Next Chapter: Telemetry Service](08_telemetry_service.md)
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
290
docs/Browser Use/08_telemetry_service.md
Normal file
290
docs/Browser Use/08_telemetry_service.md
Normal file
@@ -0,0 +1,290 @@
|
||||
# Chapter 8: Telemetry Service - Helping Improve the Project (Optional)
|
||||
|
||||
In the [previous chapter](07_data_structures__views_.md), we explored the essential blueprints (`Data Structures (Views)`) that keep communication clear and consistent between all the parts of `Browser Use`. We saw how components like the [Agent](01_agent.md) and the [Action Controller & Registry](05_action_controller___registry.md) use these blueprints to exchange information reliably.
|
||||
|
||||
Now, let's think about the project itself. How do the developers who build `Browser Use` know if it's working well for users? How do they find out about common errors or which features are most popular, so they can make the tool better?
|
||||
|
||||
## What Problem Does the Telemetry Service Solve?
|
||||
|
||||
Imagine you released a new tool, like `Browser Use`. You want it to be helpful, but you don't know how people are actually using it. Are they running into unexpected errors? Are certain actions (like clicking vs. scrolling) causing problems? Is the performance okay? Without some feedback, it's hard to know where to focus improvements.
|
||||
|
||||
One way to get feedback is through bug reports or feature requests, but that only captures a small fraction of user experiences. We need a way to get a broader, anonymous picture of how the tool is performing "in the wild."
|
||||
|
||||
The **Telemetry Service** solves this by providing an *optional* and *anonymous* way to send basic usage statistics back to the project developers. Think of it like an anonymous suggestion box or an automatic crash report that doesn't include any personal information.
|
||||
|
||||
**Crucially:** This service is designed to protect user privacy. It doesn't collect website content, personal data, or anything sensitive. It only sends anonymous statistics about the tool's operation, and **it can be completely disabled**.
|
||||
|
||||
## Meet `ProductTelemetry`: The Anonymous Reporter
|
||||
|
||||
The component responsible for this is the `ProductTelemetry` service, found in `telemetry/service.py`.
|
||||
|
||||
* **Collects Usage Data:** It gathers anonymized information about events like:
|
||||
* When an [Agent](01_agent.md) starts or finishes a run.
|
||||
* Details about each step the Agent takes (like which actions were used).
|
||||
* Errors encountered during agent runs.
|
||||
* Which actions are defined in the [Action Controller & Registry](05_action_controller___registry.md).
|
||||
* **Anonymizes Data:** It uses a randomly generated user ID (stored locally, not linked to you) to group events from the same installation without knowing *who* the user is.
|
||||
* **Sends Data:** It sends this anonymous data to a secure third-party service (PostHog) used by the developers to analyze trends and identify potential issues.
|
||||
* **Optional:** You can easily turn it off.
|
||||
|
||||
## How is Telemetry Used? (Mostly Automatic)
|
||||
|
||||
You usually don't interact with the `ProductTelemetry` service directly. Instead, other components like the `Agent` and `Controller` automatically call it at key moments.
|
||||
|
||||
**Example: Agent Run Start/End**
|
||||
|
||||
When you create an `Agent` and call `agent.run()`, the Agent automatically notifies the Telemetry Service.
|
||||
|
||||
```python
|
||||
# --- File: agent/service.py (Simplified Agent run method) ---
|
||||
class Agent:
|
||||
# ... (other methods) ...
|
||||
|
||||
# Agent has a telemetry object initialized in __init__
|
||||
# self.telemetry = ProductTelemetry()
|
||||
|
||||
async def run(self, max_steps: int = 100) -> AgentHistoryList:
|
||||
# ---> Tell Telemetry: Agent run is starting <---
|
||||
self._log_agent_run() # This includes a telemetry.capture() call
|
||||
|
||||
try:
|
||||
# ... (main agent loop runs here) ...
|
||||
for step_num in range(max_steps):
|
||||
# ... (agent takes steps) ...
|
||||
if self.state.history.is_done():
|
||||
break
|
||||
# ...
|
||||
finally:
|
||||
# ---> Tell Telemetry: Agent run is ending <---
|
||||
self.telemetry.capture(
|
||||
AgentEndTelemetryEvent( # Uses a specific data structure
|
||||
agent_id=self.state.agent_id,
|
||||
is_done=self.state.history.is_done(),
|
||||
success=self.state.history.is_successful(),
|
||||
# ... other anonymous stats ...
|
||||
)
|
||||
)
|
||||
# ... (cleanup browser etc.) ...
|
||||
|
||||
return self.state.history
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
1. When the `Agent` is created, it gets an instance of `ProductTelemetry`.
|
||||
2. Inside the `run` method, before the main loop starts, `_log_agent_run()` is called, which internally uses `self.telemetry.capture()` to send an `AgentRunTelemetryEvent`.
|
||||
3. After the loop finishes (or an error occurs), the `finally` block ensures that another `self.telemetry.capture()` call is made, this time sending an `AgentEndTelemetryEvent` with summary statistics about the run.
|
||||
|
||||
Similarly, the `Agent.step` method captures an `AgentStepTelemetryEvent`, and the `Controller`'s `Registry` captures a `ControllerRegisteredFunctionsTelemetryEvent` when it's initialized. This happens automatically in the background if telemetry is enabled.
|
||||
|
||||
## How to Disable Telemetry
|
||||
|
||||
If you prefer not to send any anonymous usage data, you can easily disable the Telemetry Service.
|
||||
|
||||
Set the environment variable `ANONYMIZED_TELEMETRY` to `False`.
|
||||
|
||||
How you set environment variables depends on your operating system:
|
||||
|
||||
* **Linux/macOS (in terminal):**
|
||||
```bash
|
||||
export ANONYMIZED_TELEMETRY=False
|
||||
# Now run your Python script in the same terminal
|
||||
python your_agent_script.py
|
||||
```
|
||||
* **Windows (Command Prompt):**
|
||||
```cmd
|
||||
set ANONYMIZED_TELEMETRY=False
|
||||
python your_agent_script.py
|
||||
```
|
||||
* **Windows (PowerShell):**
|
||||
```powershell
|
||||
$env:ANONYMIZED_TELEMETRY="False"
|
||||
python your_agent_script.py
|
||||
```
|
||||
* **In Python Code (using `os` module, *before* importing `browser_use`):**
|
||||
```python
|
||||
import os
|
||||
os.environ['ANONYMIZED_TELEMETRY'] = 'False'
|
||||
|
||||
# Now import and use browser_use
|
||||
from browser_use import Agent # ... other imports
|
||||
# ... rest of your script ...
|
||||
```
|
||||
|
||||
If this environment variable is set to `False`, the `ProductTelemetry` service will be initialized in a disabled state, and no data will be collected or sent.
|
||||
|
||||
## How It Works Under the Hood: Sending Anonymous Data
|
||||
|
||||
When telemetry is enabled and an event occurs (like `agent.run()` starting):
|
||||
|
||||
1. **Component Calls Capture:** The `Agent` (or `Controller`) calls `telemetry.capture(event_data)`.
|
||||
2. **Telemetry Service Checks:** The `ProductTelemetry` service checks if it's enabled. If not, it does nothing.
|
||||
3. **Get User ID:** It retrieves or generates a unique, anonymous user ID. This is typically a random UUID (like `a1b2c3d4-e5f6-7890-abcd-ef1234567890`) stored in a hidden file on your computer (`~/.cache/browser_use/telemetry_user_id`). This ID helps group events from the same installation without identifying the actual user.
|
||||
4. **Send to PostHog:** It sends the event data (structured using Pydantic models like `AgentRunTelemetryEvent`) along with the anonymous user ID to PostHog, a third-party service specialized in product analytics.
|
||||
5. **Analysis:** Developers can then look at aggregated, anonymous trends in PostHog (e.g., "What percentage of agent runs finish successfully?", "What are the most common errors?") to understand usage patterns and prioritize improvements.
|
||||
|
||||
Here's a simplified diagram:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Agent
|
||||
participant TelemetrySvc as ProductTelemetry
|
||||
participant LocalFile as ~/.cache/.../user_id
|
||||
participant PostHog
|
||||
|
||||
Agent->>TelemetrySvc: capture(AgentRunEvent)
|
||||
Note over TelemetrySvc: Telemetry Enabled? Yes.
|
||||
TelemetrySvc->>LocalFile: Read existing User ID (or create new)
|
||||
LocalFile-->>TelemetrySvc: Anonymous User ID (UUID)
|
||||
Note over TelemetrySvc: Package Event + User ID
|
||||
TelemetrySvc->>PostHog: Send(EventData, UserID)
|
||||
PostHog-->>TelemetrySvc: Acknowledgment (Optional)
|
||||
```
|
||||
|
||||
Let's look at the simplified code involved.
|
||||
|
||||
**1. Initializing Telemetry (`telemetry/service.py`)**
|
||||
|
||||
The service checks the environment variable during initialization.
|
||||
|
||||
```python
|
||||
# --- File: telemetry/service.py (Simplified __init__) ---
|
||||
import os
|
||||
import uuid
|
||||
import logging
|
||||
from pathlib import Path
|
||||
from posthog import Posthog # The library for the external service
|
||||
from browser_use.utils import singleton
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@singleton # Ensures only one instance exists
|
||||
class ProductTelemetry:
|
||||
USER_ID_PATH = str(Path.home() / '.cache' / 'browser_use' / 'telemetry_user_id')
|
||||
# ... (API key constants) ...
|
||||
_curr_user_id = None
|
||||
|
||||
def __init__(self) -> None:
|
||||
# Check the environment variable
|
||||
telemetry_disabled = os.getenv('ANONYMIZED_TELEMETRY', 'true').lower() == 'false'
|
||||
|
||||
if telemetry_disabled:
|
||||
self._posthog_client = None # Telemetry is off
|
||||
logger.debug('Telemetry disabled by environment variable.')
|
||||
else:
|
||||
# Initialize the PostHog client if enabled
|
||||
self._posthog_client = Posthog(...)
|
||||
logger.info(
|
||||
'Anonymized telemetry enabled.' # Inform the user
|
||||
)
|
||||
# Optionally silence PostHog's own logs
|
||||
# ...
|
||||
|
||||
# ... (other methods) ...
|
||||
```
|
||||
|
||||
**2. Capturing an Event (`telemetry/service.py`)**
|
||||
|
||||
The `capture` method sends the data if the client is active.
|
||||
|
||||
```python
|
||||
# --- File: telemetry/service.py (Simplified capture) ---
|
||||
# Assume BaseTelemetryEvent is the base Pydantic model for events
|
||||
from browser_use.telemetry.views import BaseTelemetryEvent
|
||||
|
||||
class ProductTelemetry:
|
||||
# ... (init) ...
|
||||
|
||||
def capture(self, event: BaseTelemetryEvent) -> None:
|
||||
# Do nothing if telemetry is disabled
|
||||
if self._posthog_client is None:
|
||||
return
|
||||
|
||||
try:
|
||||
# Get the anonymous user ID (lazy loaded)
|
||||
anon_user_id = self.user_id
|
||||
|
||||
# Send the event name and its properties (as a dictionary)
|
||||
self._posthog_client.capture(
|
||||
distinct_id=anon_user_id,
|
||||
event=event.name, # e.g., "agent_run"
|
||||
properties=event.properties # Data from the event model
|
||||
)
|
||||
logger.debug(f'Telemetry event captured: {event.name}')
|
||||
except Exception as e:
|
||||
# Don't crash the main application if telemetry fails
|
||||
logger.error(f'Failed to send telemetry event {event.name}: {e}')
|
||||
|
||||
@property
|
||||
def user_id(self) -> str:
|
||||
"""Gets or creates the anonymous user ID."""
|
||||
if self._curr_user_id:
|
||||
return self._curr_user_id
|
||||
|
||||
try:
|
||||
# Check if the ID file exists
|
||||
id_file = Path(self.USER_ID_PATH)
|
||||
if not id_file.exists():
|
||||
# Create directory and generate a new UUID if it doesn't exist
|
||||
id_file.parent.mkdir(parents=True, exist_ok=True)
|
||||
new_user_id = str(uuid.uuid4())
|
||||
id_file.write_text(new_user_id)
|
||||
self._curr_user_id = new_user_id
|
||||
else:
|
||||
# Read the existing UUID from the file
|
||||
self._curr_user_id = id_file.read_text().strip()
|
||||
except Exception:
|
||||
# Fallback if file access fails
|
||||
self._curr_user_id = 'UNKNOWN_USER_ID'
|
||||
return self._curr_user_id
|
||||
|
||||
```
|
||||
|
||||
**3. Event Data Structures (`telemetry/views.py`)**
|
||||
|
||||
Like other components, Telemetry uses Pydantic models to define the structure of the data being sent.
|
||||
|
||||
```python
|
||||
# --- File: telemetry/views.py (Simplified Event Example) ---
|
||||
from dataclasses import dataclass, asdict
|
||||
from typing import Any, Dict, Sequence
|
||||
|
||||
# Base class for all telemetry events (conceptual)
|
||||
@dataclass
|
||||
class BaseTelemetryEvent:
|
||||
@property
|
||||
def name(self) -> str:
|
||||
raise NotImplementedError
|
||||
@property
|
||||
def properties(self) -> Dict[str, Any]:
|
||||
# Helper to convert the dataclass fields to a dictionary
|
||||
return {k: v for k, v in asdict(self).items() if k != 'name'}
|
||||
|
||||
# Specific event for when an agent run starts
|
||||
@dataclass
|
||||
class AgentRunTelemetryEvent(BaseTelemetryEvent):
|
||||
agent_id: str # Anonymous ID for the specific agent instance
|
||||
use_vision: bool # Was vision enabled?
|
||||
task: str # The task description (anonymized/hashed in practice)
|
||||
model_name: str # Name of the LLM used
|
||||
chat_model_library: str # Library used for the LLM (e.g., ChatOpenAI)
|
||||
version: str # browser-use version
|
||||
source: str # How browser-use was installed (e.g., pip, git)
|
||||
name: str = 'agent_run' # The event name sent to PostHog
|
||||
|
||||
# ... other event models like AgentEndTelemetryEvent, AgentStepTelemetryEvent ...
|
||||
```
|
||||
|
||||
These structures ensure the data sent to PostHog is consistent and well-defined.
|
||||
|
||||
## Conclusion
|
||||
|
||||
The **Telemetry Service** (`ProductTelemetry`) provides an optional and privacy-conscious way for the `Browser Use` project to gather anonymous feedback about how the tool is being used. It automatically captures events like agent runs, steps, and errors, sending anonymized statistics to developers via PostHog.
|
||||
|
||||
This feedback loop is vital for identifying common issues, understanding feature usage, and ultimately improving the `Browser Use` library for everyone. Remember, you have full control and can easily disable this service by setting the `ANONYMIZED_TELEMETRY=False` environment variable.
|
||||
|
||||
This chapter concludes our tour of the core components within the `Browser Use` project. You've learned about the [Agent](01_agent.md), the guiding [System Prompt](02_system_prompt.md), the isolated [BrowserContext](03_browsercontext.md), the webpage map ([DOM Representation](04_dom_representation.md)), the action execution engine ([Action Controller & Registry](05_action_controller___registry.md)), the conversation tracker ([Message Manager](06_message_manager.md)), the data blueprints ([Data Structures (Views)](07_data_structures__views_.md)), and now the optional feedback mechanism ([Telemetry Service](08_telemetry_service.md)). We hope this gives you a solid foundation for understanding and using `Browser Use`!
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
53
docs/Browser Use/index.md
Normal file
53
docs/Browser Use/index.md
Normal file
@@ -0,0 +1,53 @@
|
||||
# Tutorial: Browser Use
|
||||
|
||||
**Browser Use** is a project that allows an *AI agent* to control a web browser and perform tasks automatically.
|
||||
Think of it like an AI assistant that can browse websites, fill forms, click buttons, and extract information based on your instructions. It uses a Large Language Model (LLM) as its "brain" to decide what actions to take on a webpage to complete a given *task*. The project manages the browser session, understands the page structure (DOM), and communicates back and forth with the LLM.
|
||||
|
||||
|
||||
**Source Repository:** [https://github.com/browser-use/browser-use/tree/3076ba0e83f30b45971af58fe2aeff64472da812/browser_use](https://github.com/browser-use/browser-use/tree/3076ba0e83f30b45971af58fe2aeff64472da812/browser_use)
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A0["Agent"]
|
||||
A1["BrowserContext"]
|
||||
A2["Action Controller & Registry"]
|
||||
A3["DOM Representation"]
|
||||
A4["Message Manager"]
|
||||
A5["System Prompt"]
|
||||
A6["Data Structures (Views)"]
|
||||
A7["Telemetry Service"]
|
||||
A0 -- "Gets state from" --> A1
|
||||
A0 -- "Uses to execute actions" --> A2
|
||||
A0 -- "Uses for LLM communication" --> A4
|
||||
A0 -- "Gets instructions from" --> A5
|
||||
A0 -- "Uses/Produces data formats" --> A6
|
||||
A0 -- "Logs events to" --> A7
|
||||
A1 -- "Gets DOM structure via" --> A3
|
||||
A1 -- "Provides BrowserState" --> A6
|
||||
A2 -- "Executes actions on" --> A1
|
||||
A2 -- "Defines/Uses ActionModel/Ac..." --> A6
|
||||
A2 -- "Logs registered functions to" --> A7
|
||||
A3 -- "Provides structure to" --> A1
|
||||
A3 -- "Uses DOM structures" --> A6
|
||||
A4 -- "Provides messages to" --> A0
|
||||
A4 -- "Initializes with" --> A5
|
||||
A4 -- "Formats data using" --> A6
|
||||
A5 -- "Defines structure for Agent..." --> A6
|
||||
A7 -- "Receives events from" --> A0
|
||||
```
|
||||
|
||||
## Chapters
|
||||
|
||||
1. [Agent](01_agent.md)
|
||||
2. [System Prompt](02_system_prompt.md)
|
||||
3. [BrowserContext](03_browsercontext.md)
|
||||
4. [DOM Representation](04_dom_representation.md)
|
||||
5. [Action Controller & Registry](05_action_controller___registry.md)
|
||||
6. [Message Manager](06_message_manager.md)
|
||||
7. [Data Structures (Views)](07_data_structures__views_.md)
|
||||
8. [Telemetry Service](08_telemetry_service.md)
|
||||
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
Reference in New Issue
Block a user