mirror of
https://github.com/aljazceru/Tutorial-Codebase-Knowledge.git
synced 2025-12-20 16:04:20 +01:00
init push
This commit is contained in:
316
docs/Browser Use/04_dom_representation.md
Normal file
316
docs/Browser Use/04_dom_representation.md
Normal file
@@ -0,0 +1,316 @@
|
||||
# Chapter 4: DOM Representation - Mapping the Webpage
|
||||
|
||||
In the [previous chapter](03_browsercontext.md), we learned about the `BrowserContext`, the Agent's private workspace for browsing. We saw that the Agent uses `browser_context.get_state()` to get a snapshot of the current webpage. But how does the Agent actually *understand* the content of that snapshot?
|
||||
|
||||
Imagine you're looking at the Google homepage. You instantly recognize the logo, the search bar, and the buttons. But a computer program just sees a wall of code (HTML). How can our `Agent` figure out: "This rectangular box is the search bar I need to type into," or "This specific image link is the first result I should click"?
|
||||
|
||||
This is the problem solved by **DOM Representation**.
|
||||
|
||||
## What Problem Does DOM Representation Solve?
|
||||
|
||||
Webpages are built using HTML (HyperText Markup Language), which describes the structure and content. Your browser reads this HTML and creates an internal, structured representation called the **Document Object Model (DOM)**. It's like the browser builds a detailed blueprint or an outline from the HTML instructions.
|
||||
|
||||
However, this raw DOM blueprint is incredibly complex and contains lots of information irrelevant to our Agent's task. The Agent doesn't need to know about every single tiny visual detail; it needs a *simplified map* focused on what's important for interaction:
|
||||
|
||||
1. **What elements are on the page?** (buttons, links, input fields, text)
|
||||
2. **Are they visible to a user?** (Hidden elements shouldn't be interacted with)
|
||||
3. **Are they interactive?** (Can you click it? Can you type in it?)
|
||||
4. **How can the Agent refer to them?** (We need a simple way to say "click *this* button")
|
||||
|
||||
DOM Representation solves the problem of translating the complex, raw DOM blueprint into a simplified, structured map that highlights the interactive "landmarks" and pathways the Agent can use.
|
||||
|
||||
## Meet `DomService`: The Map Maker
|
||||
|
||||
The component responsible for creating this map is the `DomService`. Think of it as a cartographer specializing in webpages.
|
||||
|
||||
When the `Agent` (via the `BrowserContext`) asks for the current state of the page, the `BrowserContext` employs the `DomService` to analyze the page's live DOM.
|
||||
|
||||
Here's what the `DomService` does:
|
||||
|
||||
1. **Examines the Live Page:** It looks at the current structure rendered in the browser tab, not just the initial HTML source code (because JavaScript can change the page after it loads).
|
||||
2. **Identifies Elements:** It finds all the meaningful elements like buttons, links, input fields, and text blocks.
|
||||
3. **Checks Properties:** For each element, it determines crucial properties:
|
||||
* **Visibility:** Is it actually displayed on the screen?
|
||||
* **Interactivity:** Is it something a user can click, type into, or otherwise interact with?
|
||||
* **Position:** Where is it located (roughly)?
|
||||
4. **Assigns Interaction Indices:** This is key! For elements deemed interactive and visible, `DomService` assigns a unique number, called a `highlight_index` (like `[5]`, `[12]`, etc.). This gives the Agent and the LLM a simple, unambiguous way to refer to specific elements.
|
||||
5. **Builds a Structured Tree:** It organizes this information into a simplified tree structure (`element_tree`) that reflects the page layout but is much easier to process than the full DOM.
|
||||
6. **Creates an Index Map:** It generates a `selector_map`, which is like an index in a book, mapping each `highlight_index` directly to its corresponding element node in the tree.
|
||||
|
||||
The final output is a `DOMState` object containing the simplified `element_tree` and the handy `selector_map`. This `DOMState` is then included in the `BrowserState` that `BrowserContext.get_state()` returns to the Agent.
|
||||
|
||||
## The Output: `DOMState` - The Agent's Map
|
||||
|
||||
The `DOMState` object produced by `DomService` has two main parts:
|
||||
|
||||
1. **`element_tree`:** This is the root of our simplified map, represented as a `DOMElementNode` object (defined in `dom/views.py`). Each node in the tree can be either an element (`DOMElementNode`) or a piece of text (`DOMTextNode`). `DOMElementNode`s contain information like the tag name (`<button>`, `<input>`), attributes (`aria-label="Search"`), visibility, interactivity, and importantly, the `highlight_index` if applicable. The tree structure helps understand the page layout (e.g., this button is inside that section).
|
||||
|
||||
*Conceptual Example Tree:*
|
||||
```
|
||||
<body> [no index]
|
||||
|-- <div> [no index]
|
||||
| |-- <input aria-label="Search"> [highlight_index: 5]
|
||||
| +-- <button> [highlight_index: 6]
|
||||
| +-- "Google Search" (TextNode)
|
||||
+-- <a> href="/images"> [highlight_index: 7]
|
||||
+-- "Images" (TextNode)
|
||||
```
|
||||
|
||||
2. **`selector_map`:** This is a Python dictionary that acts as a quick lookup. It maps the integer `highlight_index` directly to the corresponding `DOMElementNode` object in the `element_tree`.
|
||||
|
||||
*Conceptual Example Map:*
|
||||
```python
|
||||
{
|
||||
5: <DOMElementNode tag_name='input', attributes={'aria-label':'Search'}, ...>,
|
||||
6: <DOMElementNode tag_name='button', ...>,
|
||||
7: <DOMElementNode tag_name='a', attributes={'href':'/images'}, ...>
|
||||
}
|
||||
```
|
||||
|
||||
This `selector_map` is incredibly useful because when the LLM decides "click element 5", the Agent can instantly find the correct `DOMElementNode` using `selector_map[5]` and tell the [Action Controller & Registry](05_action_controller___registry.md) exactly which element to interact with.
|
||||
|
||||
## How the Agent Uses the Map
|
||||
|
||||
The `Agent` takes the `DOMState` (usually simplifying the `element_tree` further into a text representation) and includes it in the information sent to the LLM. Remember the JSON response format from [Chapter 2](02_system_prompt.md)? The LLM uses the `highlight_index` from this map to specify actions:
|
||||
|
||||
```json
|
||||
// LLM might receive a simplified text view like:
|
||||
// "[5]<input aria-label='Search'>\n[6]<button>Google Search</button>\n[7]<a>Images</a>"
|
||||
|
||||
// And respond with:
|
||||
{
|
||||
"current_state": {
|
||||
"evaluation_previous_goal": "...",
|
||||
"memory": "On Google homepage, need to search for cats.",
|
||||
"next_goal": "Type 'cute cats' into the search bar [5]."
|
||||
},
|
||||
"action": [
|
||||
{
|
||||
"input_text": {
|
||||
"index": 5, // <-- Uses the highlight_index from the DOM map!
|
||||
"text": "cute cats"
|
||||
}
|
||||
}
|
||||
// ... maybe press Enter action ...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Code Example: Seeing the Map
|
||||
|
||||
We don't usually interact with `DomService` directly. Instead, we get its output via the `BrowserContext`. Let's revisit the example from Chapter 3 and see where the DOM representation fits:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from browser_use import Browser, BrowserConfig, BrowserContext, BrowserContextConfig
|
||||
|
||||
async def main():
|
||||
browser_config = BrowserConfig(headless=False)
|
||||
browser = Browser(config=browser_config)
|
||||
context_config = BrowserContextConfig()
|
||||
|
||||
async with browser.new_context(config=context_config) as browser_context:
|
||||
# Navigate to a page (e.g., Google)
|
||||
await browser_context.navigate_to("https://www.google.com")
|
||||
|
||||
print("Getting current page state...")
|
||||
# This call uses DomService internally to generate the DOM representation
|
||||
current_state = await browser_context.get_state()
|
||||
|
||||
print(f"\nCurrent Page URL: {current_state.url}")
|
||||
print(f"Current Page Title: {current_state.title}")
|
||||
|
||||
# Accessing the DOM Representation parts within the BrowserState
|
||||
print("\n--- DOM Representation Details ---")
|
||||
# The element_tree is the root node of our simplified DOM map
|
||||
if current_state.element_tree:
|
||||
print(f"Root element tag of simplified tree: <{current_state.element_tree.tag_name}>")
|
||||
else:
|
||||
print("Element tree is empty.")
|
||||
|
||||
# The selector_map provides direct access to interactive elements by index
|
||||
if current_state.selector_map:
|
||||
print(f"Number of interactive elements found: {len(current_state.selector_map)}")
|
||||
|
||||
# Let's try to find the element the LLM might call [5] (often the search bar)
|
||||
example_index = 5 # Note: Indices can change depending on the page!
|
||||
if example_index in current_state.selector_map:
|
||||
element_node = current_state.selector_map[example_index]
|
||||
print(f"Element [{example_index}]: Tag=<{element_node.tag_name}>, Attributes={element_node.attributes}")
|
||||
# The Agent uses this node reference to perform actions
|
||||
else:
|
||||
print(f"Element [{example_index}] not found in the selector map for this page state.")
|
||||
else:
|
||||
print("No interactive elements found (selector map is empty).")
|
||||
|
||||
# The Agent would typically convert element_tree into a compact text format
|
||||
# (using methods like element_tree.clickable_elements_to_string())
|
||||
# to send to the LLM along with the task instructions.
|
||||
|
||||
print("\nBrowserContext closed.")
|
||||
await browser.close()
|
||||
print("Browser closed.")
|
||||
|
||||
# Run the asynchronous code
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**What happens here?**
|
||||
|
||||
1. We set up the `Browser` and `BrowserContext`.
|
||||
2. We navigate to Google.
|
||||
3. `browser_context.get_state()` is called. **Internally**, this triggers the `DomService`.
|
||||
4. `DomService` analyzes the Google page, finds interactive elements (like the search bar, buttons), assigns them `highlight_index` numbers, and builds the `element_tree` and `selector_map`.
|
||||
5. This `DOMState` (containing the tree and map) is packaged into the `BrowserState` object returned by `get_state()`.
|
||||
6. Our code then accesses `current_state.element_tree` and `current_state.selector_map` to peek at the map created by `DomService`.
|
||||
7. We demonstrate looking up an element using its potential index (`selector_map[5]`).
|
||||
|
||||
## How It Works Under the Hood: `DomService` in Action
|
||||
|
||||
Let's trace the flow when `BrowserContext.get_state()` is called:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Agent
|
||||
participant BC as BrowserContext
|
||||
participant DomService
|
||||
participant PlaywrightPage as Browser Page (JS Env)
|
||||
participant buildDomTree_js as buildDomTree.js
|
||||
|
||||
Agent->>BC: get_state()
|
||||
Note over BC: Needs to analyze the page content
|
||||
BC->>DomService: get_clickable_elements(...)
|
||||
Note over DomService: Needs to run analysis script in browser
|
||||
DomService->>PlaywrightPage: evaluate(js_code='buildDomTree.js', args={...})
|
||||
Note over PlaywrightPage: Execute JavaScript code
|
||||
PlaywrightPage->>buildDomTree_js: Run analysis function
|
||||
Note over buildDomTree_js: Analyzes live DOM, finds visible & interactive elements, assigns highlight_index
|
||||
buildDomTree_js-->>PlaywrightPage: Return structured data (nodes, indices, map)
|
||||
PlaywrightPage-->>DomService: Return JS execution result (JSON-like data)
|
||||
Note over DomService: Process the raw data from JS
|
||||
DomService->>DomService: _construct_dom_tree(result)
|
||||
Note over DomService: Builds Python DOMElementNode tree and selector_map
|
||||
DomService-->>BC: Return DOMState (element_tree, selector_map)
|
||||
Note over BC: Combine DOMState with URL, title, screenshot etc.
|
||||
BC->>BC: Create BrowserState object
|
||||
BC-->>Agent: Return BrowserState (containing DOM map)
|
||||
```
|
||||
|
||||
**Key Code Points:**
|
||||
|
||||
1. **`BrowserContext` calls `DomService`:** Inside `browser/context.py`, the `_update_state` method (called by `get_state`) initializes and uses the `DomService`:
|
||||
|
||||
```python
|
||||
# --- File: browser/context.py (Simplified _update_state) ---
|
||||
from browser_use.dom.service import DomService # Import the service
|
||||
from browser_use.browser.views import BrowserState
|
||||
|
||||
class BrowserContext:
|
||||
# ... other methods ...
|
||||
async def _update_state(self) -> BrowserState:
|
||||
page = await self.get_current_page() # Get the active Playwright page object
|
||||
# ... error handling ...
|
||||
try:
|
||||
# 1. Create DomService instance for the current page
|
||||
dom_service = DomService(page)
|
||||
|
||||
# 2. Call DomService to get the DOM map (DOMState)
|
||||
content_info = await dom_service.get_clickable_elements(
|
||||
highlight_elements=self.config.highlight_elements,
|
||||
viewport_expansion=self.config.viewport_expansion,
|
||||
# ... other options ...
|
||||
)
|
||||
|
||||
# 3. Get other info (screenshot, URL, title etc.)
|
||||
screenshot_b64 = await self.take_screenshot()
|
||||
url = page.url
|
||||
title = await page.title()
|
||||
# ... gather more state ...
|
||||
|
||||
# 4. Package everything into BrowserState
|
||||
browser_state = BrowserState(
|
||||
element_tree=content_info.element_tree, # <--- From DomService
|
||||
selector_map=content_info.selector_map, # <--- From DomService
|
||||
url=url,
|
||||
title=title,
|
||||
screenshot=screenshot_b64,
|
||||
# ... other state info ...
|
||||
)
|
||||
return browser_state
|
||||
except Exception as e:
|
||||
logger.error(f'Failed to update state: {str(e)}')
|
||||
raise # Or handle error
|
||||
```
|
||||
|
||||
2. **`DomService` runs JavaScript:** Inside `dom/service.py`, the `_build_dom_tree` method executes the JavaScript code stored in `buildDomTree.js` within the browser page's context.
|
||||
|
||||
```python
|
||||
# --- File: dom/service.py (Simplified _build_dom_tree) ---
|
||||
import logging
|
||||
from importlib import resources
|
||||
# ... other imports ...
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class DomService:
|
||||
def __init__(self, page: 'Page'):
|
||||
self.page = page
|
||||
# Load the JavaScript code from the file when DomService is created
|
||||
self.js_code = resources.read_text('browser_use.dom', 'buildDomTree.js')
|
||||
# ...
|
||||
|
||||
async def _build_dom_tree(
|
||||
self, highlight_elements: bool, focus_element: int, viewport_expansion: int
|
||||
) -> tuple[DOMElementNode, SelectorMap]:
|
||||
|
||||
# Prepare arguments for the JavaScript function
|
||||
args = {
|
||||
'doHighlightElements': highlight_elements,
|
||||
'focusHighlightIndex': focus_element,
|
||||
'viewportExpansion': viewport_expansion,
|
||||
'debugMode': logger.getEffectiveLevel() == logging.DEBUG,
|
||||
}
|
||||
|
||||
try:
|
||||
# Execute the JavaScript code in the browser page!
|
||||
# The JS code analyzes the live DOM and returns a structured result.
|
||||
eval_page = await self.page.evaluate(self.js_code, args)
|
||||
except Exception as e:
|
||||
logger.error('Error evaluating JavaScript: %s', e)
|
||||
raise
|
||||
|
||||
# ... (optional debug logging) ...
|
||||
|
||||
# Parse the result from JavaScript into Python objects
|
||||
return await self._construct_dom_tree(eval_page)
|
||||
|
||||
async def _construct_dom_tree(self, eval_page: dict) -> tuple[DOMElementNode, SelectorMap]:
|
||||
# ... (logic to parse js_node_map from eval_page) ...
|
||||
# ... (loops through nodes, creates DOMElementNode/DOMTextNode objects) ...
|
||||
# ... (builds the tree structure by linking parents/children) ...
|
||||
# ... (populates the selector_map dictionary) ...
|
||||
# This uses the structures defined in dom/views.py
|
||||
# ...
|
||||
root_node = ... # Parsed root DOMElementNode
|
||||
selector_map = ... # Populated dictionary {index: DOMElementNode}
|
||||
return root_node, selector_map
|
||||
# ... other methods like get_clickable_elements ...
|
||||
```
|
||||
|
||||
3. **`buildDomTree.js` (Conceptual):** This JavaScript file (located at `dom/buildDomTree.js` in the library) is the core map-making logic that runs *inside the browser*. It traverses the live DOM, checks element visibility and interactivity using browser APIs (like `element.getBoundingClientRect()`, `window.getComputedStyle()`, `document.elementFromPoint()`), assigns the `highlight_index`, and packages the results into a structured format that the Python `DomService` can understand. *We don't need to understand the JS code itself, just its purpose.*
|
||||
|
||||
4. **Python Data Structures (`DOMElementNode`, `DOMTextNode`):** The results from the JavaScript are parsed into Python objects defined in `dom/views.py`. These dataclasses (`DOMElementNode`, `DOMTextNode`) hold the information about each mapped element or text segment.
|
||||
|
||||
## Conclusion
|
||||
|
||||
DOM Representation, primarily handled by the `DomService`, is crucial for bridging the gap between the complex reality of a webpage (the DOM) and the Agent/LLM's need for a simplified, actionable understanding. By creating a structured `element_tree` and an indexed `selector_map`, it provides a clear map of interactive landmarks on the page, identified by simple `highlight_index` numbers.
|
||||
|
||||
This map allows the LLM to make specific plans like "type into element [5]" or "click element [12]", which the Agent can then reliably translate into concrete actions.
|
||||
|
||||
Now that we understand how the Agent sees the page, how does it actually *perform* those actions like clicking or typing? In the next chapter, we'll explore the component responsible for executing the LLM's plan: the [Action Controller & Registry](05_action_controller___registry.md).
|
||||
|
||||
[Next Chapter: Action Controller & Registry](05_action_controller___registry.md)
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
Reference in New Issue
Block a user