init push

This commit is contained in:
zachary62
2025-04-04 13:03:54 -04:00
parent e62ee2cb13
commit 2ebad5e5f2
160 changed files with 2 additions and 0 deletions

202
docs/OpenManus/01_llm.md Normal file
View File

@@ -0,0 +1,202 @@
# Chapter 1: The LLM - Your Agent's Brainpower
Welcome to the OpenManus tutorial! We're thrilled to have you on board. Let's start with the absolute core of any intelligent agent: the "brain" that does the thinking and understanding. In OpenManus, this brainpower comes from something called a **Large Language Model (LLM)**, and we interact with it using our `LLM` class.
## What's the Big Deal with LLMs?
Imagine you have access to an incredibly smart expert who understands language, can reason, write, summarize, and even generate creative ideas. That's kind of what an LLM (like GPT-4, Claude, or Llama) is! These are massive AI models trained on vast amounts of text and data, making them capable of understanding and generating human-like text.
They are the engine that drives the "intelligence" in AI applications like chatbots, writing assistants, and, of course, the agents you'll build with OpenManus.
## Why Do We Need an `LLM` Class?
Okay, so LLMs are powerful. Can't our agent just talk directly to them?
Well, it's a bit more complicated than a casual chat. Talking to these big AI models usually involves:
1. **Complex APIs:** Each LLM provider (like OpenAI, Anthropic, Google, AWS) has its own specific way (an API or Application Programming Interface) to send requests and get responses. It's like needing different phone numbers and dialing procedures for different experts.
2. **API Keys:** You need secret keys to prove you're allowed to use the service (and get billed for it!). Managing these securely is important.
3. **Formatting:** You need to structure your questions (prompts) and conversation history in a very specific format the LLM understands.
4. **Errors & Retries:** Sometimes network connections hiccup, or the LLM service is busy. You need a way to handle these errors gracefully, maybe by trying again.
5. **Tracking Usage (Tokens):** Using these powerful models costs money, often based on how much text you send and receive (measured in "tokens"). You need to keep track of this.
Doing all this *every time* an agent needs to think would be repetitive and messy!
**This is where the `LLM` class comes in.** Think of it as a super-helpful **translator and network manager** rolled into one.
* It knows how to talk to different LLM APIs.
* It securely handles your API keys (using settings from the [Configuration](07_configuration__config_.md)).
* It formats your messages correctly.
* It automatically retries if there's a temporary glitch.
* It helps count the "tokens" used.
It hides all that complexity, giving your agent a simple way to "ask" the LLM something.
**Use Case:** Let's say we want our agent to simply answer the question: "What is the capital of France?" The `LLM` class will handle all the background work to get that answer from the actual AI model.
## How Do Agents Use the `LLM` Class?
In OpenManus, agents (which we'll learn more about in [Chapter 3: BaseAgent](03_baseagent.md)) have an `llm` component built-in. Usually, you don't even need to create it manually; the agent does it for you when it starts up, using settings from your configuration file (`config/config.toml`).
The primary way an agent uses the `LLM` class is through its `ask` method.
Let's look at a simplified example of how you might use the `LLM` class directly (though usually, your agent handles this):
```python
# Import necessary classes
from app.llm import LLM
from app.schema import Message
import asyncio # Needed to run asynchronous code
# Assume configuration is already loaded (API keys, model name, etc.)
# Create an instance of the LLM class (using default settings)
llm_interface = LLM()
# Prepare the question as a list of messages
# (We'll learn more about Messages in Chapter 2)
conversation = [
Message.user_message("What is the capital of France?")
]
# Define an async function to ask the question
async def ask_question():
print("Asking the LLM...")
# Use the 'ask' method to send the conversation
response = await llm_interface.ask(messages=conversation)
print(f"LLM Response: {response}")
# Run the async function
asyncio.run(ask_question())
```
**Explanation:**
1. We import the `LLM` class and the `Message` class (more on `Message` in the [next chapter](02_message___memory.md)).
2. We create `llm_interface = LLM()`. This sets up our connection to the LLM using settings found in the configuration.
3. We create a `conversation` list containing our question, formatted as a `Message` object. The `LLM` class needs the input in this list-of-messages format.
4. We call `await llm_interface.ask(messages=conversation)`. This is the core action! We send our message list to the LLM via our interface. The `await` keyword is used because communicating over the network takes time, so we wait for the response asynchronously.
5. The `ask` method returns the LLM's text response as a string.
**Example Output (might vary slightly):**
```
Asking the LLM...
LLM Response: The capital of France is Paris.
```
See? We just asked a question and got an answer, without worrying about API keys, JSON formatting, or network errors! The `LLM` class handled it all.
There's also a more advanced method called `ask_tool`, which allows the LLM to use specific [Tools](04_tool___toolcollection.md), but we'll cover that later. For now, `ask` is the main way to get text responses.
## Under the Hood: What Happens When You `ask`?
Let's peek behind the curtain. When your agent calls `llm.ask(...)`, several things happen in sequence:
1. **Format Messages:** The `LLM` class takes your list of `Message` objects and converts them into the exact dictionary format the specific LLM API (like OpenAI's or AWS Bedrock's) expects. This might involve adding special tags or structuring image data if needed (`llm.py: format_messages`).
2. **Count Tokens:** It calculates roughly how many "tokens" your input messages will use (`llm.py: count_message_tokens`).
3. **Check Limits:** It checks if sending this request would exceed any configured token limits (`llm.py: check_token_limit`). If it does, it raises a specific `TokenLimitExceeded` error *before* making the expensive API call.
4. **Send Request:** It sends the formatted messages and other parameters (like the desired model, `max_tokens`) to the LLM's API endpoint over the internet (`llm.py: client.chat.completions.create` or similar for AWS Bedrock in `bedrock.py`).
5. **Handle Glitches (Retry):** If the API call fails due to a temporary issue (like a network timeout or the service being momentarily busy), the `LLM` class automatically waits a bit and tries again, up to a few times (thanks to the `@retry` decorator in `llm.py`).
6. **Receive Response:** Once successful, it receives the response from the LLM API.
7. **Extract Answer:** It pulls out the actual text content from the API response.
8. **Update Counts:** It records the number of input tokens used and the number of tokens in the received response (`llm.py: update_token_count`).
9. **Return Result:** Finally, it returns the LLM's text answer back to your agent.
Here's a simplified diagram showing the flow:
```mermaid
sequenceDiagram
participant Agent
participant LLMClass as LLM Class (app/llm.py)
participant TokenCounter as Token Counter (app/llm.py)
participant OpenAIClient as OpenAI/Bedrock Client (app/llm.py, app/bedrock.py)
participant LLM_API as Actual LLM API (e.g., OpenAI, AWS Bedrock)
Agent->>+LLMClass: ask(messages)
LLMClass->>LLMClass: format_messages(messages)
LLMClass->>+TokenCounter: count_message_tokens(formatted_messages)
TokenCounter-->>-LLMClass: input_token_count
LLMClass->>LLMClass: check_token_limit(input_token_count)
Note over LLMClass: If limit exceeded, raise Error.
LLMClass->>+OpenAIClient: create_completion(formatted_messages, model, ...)
Note right of OpenAIClient: Handles retries on network errors etc.
OpenAIClient->>+LLM_API: Send HTTP Request
LLM_API-->>-OpenAIClient: Receive HTTP Response
OpenAIClient-->>-LLMClass: completion_response
LLMClass->>LLMClass: extract_content(completion_response)
LLMClass->>+TokenCounter: update_token_count(input_tokens, completion_tokens)
TokenCounter-->>-LLMClass:
LLMClass-->>-Agent: llm_answer (string)
```
Let's look at a tiny piece of the `ask` method in `app/llm.py` to see the retry mechanism:
```python
# Simplified snippet from app/llm.py
from tenacity import retry, wait_random_exponential, stop_after_attempt, retry_if_exception_type
from openai import OpenAIError
# ... other imports ...
class LLM:
# ... other methods like __init__, format_messages ...
@retry( # This decorator handles retries!
wait=wait_random_exponential(min=1, max=60), # Wait 1-60s between tries
stop=stop_after_attempt(6), # Give up after 6 tries
retry=retry_if_exception_type((OpenAIError, Exception)) # Retry on these errors
)
async def ask(
self,
messages: List[Union[dict, Message]],
# ... other parameters ...
) -> str:
try:
# 1. Format messages (simplified)
formatted_msgs = self.format_messages(messages)
# 2. Count tokens & Check limits (simplified)
input_tokens = self.count_message_tokens(formatted_msgs)
if not self.check_token_limit(input_tokens):
raise TokenLimitExceeded(...) # Special error, not retried
# 3. Prepare API call parameters (simplified)
params = {"model": self.model, "messages": formatted_msgs, ...}
# 4. Make the actual API call (simplified)
response = await self.client.chat.completions.create(**params)
# 5. Process response & update tokens (simplified)
answer = response.choices[0].message.content
self.update_token_count(response.usage.prompt_tokens, ...)
return answer
except TokenLimitExceeded:
raise # Don't retry token limits
except Exception as e:
logger.error(f"LLM ask failed: {e}")
raise # Let the @retry decorator handle retrying other errors
```
**Explanation:**
* The `@retry(...)` part *above* the `async def ask(...)` line is key. It tells Python: "If the code inside this `ask` function fails with certain errors (like `OpenAIError`), wait a bit and try running it again, up to 6 times."
* Inside the `try...except` block, the code performs the steps we discussed: format, count, check, call the API (`self.client.chat.completions.create`), and process the result.
* Crucially, it catches the `TokenLimitExceeded` error separately and `raise`s it again immediately we *don't* want to retry if we know we've run out of tokens!
* Other errors will be caught by the final `except Exception`, logged, and re-raised, allowing the `@retry` mechanism to decide whether to try again.
This shows how the `LLM` class uses libraries like `tenacity` to add resilience without cluttering the main logic of your agent.
## Wrapping Up Chapter 1
You've learned about the core "brain" the Large Language Model (LLM) and why we need the `LLM` class in OpenManus to interact with it smoothly. This class acts as a vital interface, handling API complexities, errors, and token counting, providing your agents with simple `ask` (and `ask_tool`) methods.
Now that we understand how to communicate with the LLM, we need a way to structure the conversation keeping track of who said what. That's where Messages and Memory come in.
Let's move on to [Chapter 2: Message / Memory](02_message___memory.md) to explore how we represent and store conversations for our agents.
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,320 @@
# Chapter 2: Message / Memory - Remembering the Conversation
In [Chapter 1: The LLM - Your Agent's Brainpower](01_llm.md), we learned how our agent uses the `LLM` class to access its "thinking" capabilities. But just like humans, an agent needs to remember what was said earlier in a conversation to make sense of new requests and respond appropriately.
Imagine asking a friend: "What was the first thing I asked you?". If they have no memory, they can't answer! Agents face the same problem. They need a way to store the conversation history.
This is where `Message` and `Memory` come in.
## What Problem Do They Solve?
Think about a simple chat:
1. **You:** "What's the weather like in London?"
2. **Agent:** "It's currently cloudy and 15°C in London."
3. **You:** "What about Paris?"
For the agent to answer your *second* question ("What about Paris?"), it needs to remember that the *topic* of the conversation is "weather". Without remembering the first question, the second question is meaningless.
`Message` and `Memory` provide the structure to:
1. Represent each individual turn (like your question or the agent's answer) clearly.
2. Store these turns in order, creating a log of the conversation.
## The Key Concepts: Message and Memory
Let's break these down:
### 1. Message: A Single Turn in the Chat
A `Message` object is like a single speech bubble in a chat interface. It represents one specific thing said by someone (or something) at a particular point in the conversation.
Every `Message` has two main ingredients:
* **`role`**: *Who* sent this message? This is crucial for the LLM to understand the flow. Common roles are:
* `user`: A message from the end-user interacting with the agent. (e.g., "What's the weather?")
* `assistant`: A message *from* the agent/LLM. (e.g., "The weather is sunny.")
* `system`: An initial instruction to guide the agent's overall behavior. (e.g., "You are a helpful weather assistant.")
* `tool`: The output or result from a [Tool / ToolCollection](04_tool___toolcollection.md) that the agent used. (e.g., The raw data returned by a weather API tool).
* **`content`**: *What* was said? This is the actual text of the message. (e.g., "What's the weather like in London?")
There are also optional parts for more advanced uses, like `tool_calls` (when the assistant decides to use a tool) or `base64_image` (if an image is included in the message), but `role` and `content` are the basics.
### 2. Memory: The Conversation Log
The `Memory` object is simply a container, like a list or a notebook, that holds a sequence of `Message` objects.
* It keeps track of the entire conversation history (or at least the recent parts).
* It stores messages in the order they occurred.
* Agents look at the `Memory` before deciding what to do next, giving them context.
Think of `Memory` as the agent's short-term memory for the current interaction.
## How Do We Use Them?
Let's see how you'd typically work with `Message` and `Memory` in OpenManus (often, the agent framework handles some of this automatically, but it's good to understand the pieces).
**1. Creating Messages:**
The `Message` class in `app/schema.py` provides handy shortcuts to create messages with the correct role:
```python
# Import the Message class
from app.schema import Message
# Create a message from the user
user_q = Message.user_message("What's the capital of France?")
# Create a message from the assistant (agent's response)
assistant_a = Message.assistant_message("The capital of France is Paris.")
# Create a system instruction
system_instruction = Message.system_message("You are a helpful geography expert.")
print(f"User Message: Role='{user_q.role}', Content='{user_q.content}'")
print(f"Assistant Message: Role='{assistant_a.role}', Content='{assistant_a.content}'")
```
**Explanation:**
* We import `Message` from `app/schema.py`.
* `Message.user_message("...")` creates a `Message` object with `role` set to `user`.
* `Message.assistant_message("...")` creates one with `role` set to `assistant`.
* `Message.system_message("...")` creates one with `role` set to `system`.
* Each of these returns a `Message` object containing the role and the text content you provided.
**Example Output:**
```
User Message: Role='user', Content='What's the capital of France?'
Assistant Message: Role='assistant', Content='The capital of France is Paris.'
```
**2. Storing Messages in Memory:**
The `Memory` class (`app/schema.py`) holds these messages. Agents usually have a `memory` attribute.
```python
# Import Memory and Message
from app.schema import Message, Memory
# Create a Memory instance
conversation_memory = Memory()
# Add messages to the memory
conversation_memory.add_message(
Message.system_message("You are a helpful geography expert.")
)
conversation_memory.add_message(
Message.user_message("What's the capital of France?")
)
conversation_memory.add_message(
Message.assistant_message("The capital of France is Paris.")
)
conversation_memory.add_message(
Message.user_message("What about Spain?")
)
# See the messages stored
print(f"Number of messages in memory: {len(conversation_memory.messages)}")
# Print the last message
print(f"Last message: {conversation_memory.messages[-1].to_dict()}")
```
**Explanation:**
* We import `Memory` and `Message`.
* `conversation_memory = Memory()` creates an empty memory store.
* `conversation_memory.add_message(...)` adds a `Message` object to the end of the internal list.
* `conversation_memory.messages` gives you access to the list of `Message` objects currently stored.
* `message.to_dict()` converts a `Message` object into a simple dictionary format, which is often needed for APIs.
**Example Output:**
```
Number of messages in memory: 4
Last message: {'role': 'user', 'content': 'What about Spain?'}
```
**3. Using Memory for Context:**
Now, how does the agent use this? Before calling the [LLM](01_llm.md) to figure out the answer to "What about Spain?", the agent would grab the messages from its `Memory`.
```python
# (Continuing from previous example)
# Agent prepares to ask the LLM
messages_for_llm = conversation_memory.to_dict_list()
print("Messages being sent to LLM for context:")
for msg in messages_for_llm:
print(f"- {msg}")
# Simplified: Agent would now pass 'messages_for_llm' to llm.ask(...)
# response = await agent.llm.ask(messages=messages_for_llm)
# print(f"LLM would likely respond about the capital of Spain, e.g., 'The capital of Spain is Madrid.'")
```
**Explanation:**
* `conversation_memory.to_dict_list()` converts all stored `Message` objects into the list-of-dictionaries format that the `llm.ask` method expects (as we saw in Chapter 1).
* By sending this *entire history*, the LLM sees:
1. Its instructions ("You are a helpful geography expert.")
2. The first question ("What's the capital of France?")
3. Its previous answer ("The capital of France is Paris.")
4. The *new* question ("What about Spain?")
* With this context, the LLM can correctly infer that "What about Spain?" means "What is the capital of Spain?".
## Under the Hood: How It Works
`Memory` is conceptually simple. It's primarily a wrapper around a standard Python list, ensuring messages are stored correctly and providing convenient methods.
Here's a simplified flow of how an agent uses memory:
```mermaid
sequenceDiagram
participant User
participant Agent as BaseAgent (app/agent/base.py)
participant Mem as Memory (app/schema.py)
participant LLM as LLM Class (app/llm.py)
participant LLM_API as Actual LLM API
User->>+Agent: Sends message ("What about Spain?")
Agent->>+Mem: update_memory(role="user", content="What about Spain?")
Mem->>Mem: Adds Message(role='user', ...) to internal list
Mem-->>-Agent: Memory updated
Agent->>Agent: Needs to generate response
Agent->>+Mem: Get all messages (memory.messages)
Mem-->>-Agent: Returns list of Message objects
Agent->>Agent: Formats messages to dict list (memory.to_dict_list())
Agent->>+LLM: ask(messages=formatted_list)
LLM->>LLM_API: Sends request with history
LLM_API-->>LLM: Receives response ("The capital is Madrid.")
LLM-->>-Agent: Returns text response
Agent->>+Mem: update_memory(role="assistant", content="The capital is Madrid.")
Mem->>Mem: Adds Message(role='assistant', ...) to internal list
Mem-->>-Agent: Memory updated
Agent->>-User: Sends response ("The capital is Madrid.")
```
**Code Glimpse:**
Let's look at the core parts in `app/schema.py`:
```python
# Simplified snippet from app/schema.py
from typing import List, Optional
from pydantic import BaseModel, Field
# (Role enum and other definitions are here)
class Message(BaseModel):
role: str # Simplified: In reality uses ROLE_TYPE Literal
content: Optional[str] = None
# ... other optional fields like tool_calls, name, etc.
def to_dict(self) -> dict:
# Creates a dictionary representation, skipping None values
message_dict = {"role": self.role}
if self.content is not None:
message_dict["content"] = self.content
# ... add other fields if they exist ...
return message_dict
@classmethod
def user_message(cls, content: str) -> "Message":
return cls(role="user", content=content)
@classmethod
def assistant_message(cls, content: Optional[str]) -> "Message":
return cls(role="assistant", content=content)
# ... other classmethods like system_message, tool_message ...
class Memory(BaseModel):
messages: List[Message] = Field(default_factory=list)
max_messages: int = 100 # Example limit
def add_message(self, message: Message) -> None:
"""Add a single message to the list."""
self.messages.append(message)
# Optional: Trim old messages if limit exceeded
if len(self.messages) > self.max_messages:
self.messages = self.messages[-self.max_messages :]
def to_dict_list(self) -> List[dict]:
"""Convert all stored messages to dictionaries."""
return [msg.to_dict() for msg in self.messages]
# ... other methods like clear(), get_recent_messages() ...
```
**Explanation:**
* The `Message` class uses Pydantic `BaseModel` for structure and validation. It clearly defines `role` and `content`. The classmethods (`user_message`, etc.) are just convenient ways to create instances with the role pre-filled. `to_dict` prepares it for API calls.
* The `Memory` class also uses `BaseModel`. Its main part is `messages: List[Message]`, which holds the conversation history. `add_message` simply appends to this list (and optionally trims it). `to_dict_list` iterates through the stored messages and converts each one using its `to_dict` method.
And here's how an agent might use its memory attribute (simplified from `app/agent/base.py`):
```python
# Simplified conceptual snippet inspired by app/agent/base.py
from app.schema import Memory, Message, ROLE_TYPE # Simplified imports
from app.llm import LLM
class SimplifiedAgent:
def __init__(self):
self.memory = Memory() # Agent holds a Memory instance
self.llm = LLM() # Agent has access to the LLM
def add_user_input(self, text: str):
"""Adds user input to memory."""
user_msg = Message.user_message(text)
self.memory.add_message(user_msg)
print(f"Agent Memory Updated with: {user_msg.to_dict()}")
async def generate_response(self) -> str:
"""Generates a response based on memory."""
print("Agent consulting memory...")
messages_for_llm = self.memory.to_dict_list()
print(f"Sending {len(messages_for_llm)} messages to LLM...")
# The actual call to the LLM
response_text = await self.llm.ask(messages=messages_for_llm)
# Add assistant response to memory
assistant_msg = Message.assistant_message(response_text)
self.memory.add_message(assistant_msg)
print(f"Agent Memory Updated with: {assistant_msg.to_dict()}")
return response_text
# Example Usage (needs async context)
# agent = SimplifiedAgent()
# agent.add_user_input("What is the capital of France?")
# response = await agent.generate_response() # Gets "Paris"
# agent.add_user_input("What about Spain?")
# response2 = await agent.generate_response() # Gets "Madrid"
```
**Explanation:**
* The agent has `self.memory`.
* When input arrives (`add_user_input`), it creates a `Message` and adds it using `self.memory.add_message`.
* When generating a response (`generate_response`), it retrieves the history using `self.memory.to_dict_list()` and passes it to `self.llm.ask`.
* It then adds the LLM's response back into memory as an `assistant` message.
## Wrapping Up Chapter 2
You've now learned about `Message` (a single conversational turn with a role and content) and `Memory` (the ordered list storing these messages). Together, they provide the crucial context agents need to understand conversations and respond coherently. They act as the agent's short-term memory or chat log.
We have the brain ([LLM](01_llm.md)) and the memory (`Message`/`Memory`). Now we need something to orchestrate the process to receive input, consult memory, use the LLM, potentially use tools, and manage its state. That's the job of the Agent itself.
Let's move on to [Chapter 3: BaseAgent](03_baseagent.md) to see how agents are structured and how they use these core components.
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,228 @@
# Chapter 3: BaseAgent - The Agent Blueprint
In the previous chapters, we learned about the "brain" ([Chapter 1: The LLM](01_llm.md)) that powers our agents and how they remember conversations using [Chapter 2: Message / Memory](02_message___memory.md). Now, let's talk about the agent itself!
Imagine you want to build different kinds of digital helpers: one that can browse the web, one that can write code, and maybe one that just answers questions. While they have different jobs, they probably share some basic features, right? They all need a name, a way to remember things, a way to know if they are busy or waiting, and a process to follow when doing their work.
## What Problem Does `BaseAgent` Solve?
Building every agent from scratch, defining these common features over and over again, would be tedious and error-prone. It's like designing a completely new car frame, engine, and wheels every time you want to build a new car model (a sports car, a truck, a sedan). It's inefficient!
This is where `BaseAgent` comes in. Think of it as the **master blueprint** or the standard **chassis and engine design** for *all* agents in OpenManus.
**Use Case:** Let's say we want to create a simple "EchoAgent" that just repeats back whatever the user says. Even this simple agent needs:
* A name (e.g., "EchoBot").
* Memory to store what the user said.
* A state (is it idle, or is it working on echoing?).
* A way to run and perform its simple "echo" task.
Instead of defining all these basics for EchoAgent, and then again for a "WeatherAgent", and again for a "CodeWriterAgent", we define them *once* in `BaseAgent`.
## Key Concepts: The Building Blocks of an Agent
`BaseAgent` (`app/agent/base.py`) defines the fundamental properties and abilities that *all* agents built using OpenManus must have. It ensures consistency and saves us from repeating code. Here are the essential parts:
1. **`name` (str):** A unique name to identify the agent (e.g., "browser_agent", "code_writer").
2. **`description` (Optional[str]):** A short explanation of what the agent does.
3. **`state` (AgentState):** The agent's current status. Is it doing nothing (`IDLE`), actively working (`RUNNING`), finished its task (`FINISHED`), or encountered a problem (`ERROR`)?
4. **`memory` (Memory):** An instance of the `Memory` class we learned about in [Chapter 2: Message / Memory](02_message___memory.md). This is where the agent stores the conversation history (`Message` objects).
5. **`llm` (LLM):** An instance of the `LLM` class from [Chapter 1: The LLM - Your Agent's Brainpower](01_llm.md). This gives the agent access to the language model for "thinking".
6. **`run()` method:** The main function you call to start the agent's work. It manages the overall process, like changing the state to `RUNNING` and repeatedly calling the `step()` method.
7. **`step()` method:** This is the crucial part! `BaseAgent` defines *that* agents must have a `step` method, but it doesn't say *what* the step does. It's marked as `abstract`, meaning **each specific agent type (like our EchoAgent or a BrowserAgent) must provide its own implementation of `step()`**. This method defines the actual work the agent performs in a single cycle.
8. **`max_steps` (int):** A safety limit on how many `step` cycles the agent can run before stopping automatically. This prevents agents from running forever if they get stuck.
Think of it like this:
* `BaseAgent` provides the car chassis (`name`, `state`), the engine (`llm`), the fuel tank (`memory`), and the ignition key (`run()`).
* The `step()` method is like the specific driving instructions (turn left, accelerate, brake) that make a sports car drive differently from a truck, even though they share the same basic parts.
## How Do We Use `BaseAgent`?
You typically don't use `BaseAgent` directly. It's an **abstract** class, meaning it's a template, not a finished product. You **build upon it** by creating new classes that *inherit* from `BaseAgent`.
Let's imagine creating our simple `EchoAgent`:
```python
# Conceptual Example - Not runnable code, just for illustration
# Import BaseAgent and necessary components
from app.agent.base import BaseAgent
from app.schema import Message
class EchoAgent(BaseAgent): # Inherits from BaseAgent!
"""A simple agent that echoes the last user message."""
name: str = "EchoBot"
description: str = "Repeats the last thing the user said."
# THIS IS THE IMPORTANT PART - We implement the abstract 'step' method
async def step(self) -> str:
"""Perform one step: find the last user message and echo it."""
last_user_message = None
# Look backwards through memory to find the last user message
for msg in reversed(self.memory.messages):
if msg.role == "user":
last_user_message = msg
break
if last_user_message and last_user_message.content:
echo_content = f"You said: {last_user_message.content}"
# Add the echo response to memory as an 'assistant' message
self.update_memory("assistant", echo_content)
# The state will be set to FINISHED after this step by run()
# (Simplified: a real agent might need more complex logic)
self.state = AgentState.FINISHED # Indicate task is done
return echo_content # Return the result of this step
else:
self.state = AgentState.FINISHED # Nothing to echo, finish
return "I didn't hear anything from the user to echo."
# How you might conceptually use it:
# echo_bot = EchoAgent()
# # Add a user message to its memory
# echo_bot.update_memory("user", "Hello there!")
# # Start the agent's run loop
# result = await echo_bot.run()
# print(result) # Output would contain: "Step 1: You said: Hello there!"
```
**Explanation:**
1. `class EchoAgent(BaseAgent):` - We declare that `EchoAgent` is a *type of* `BaseAgent`. It automatically gets all the standard parts like `name`, `memory`, `llm`, `state`, and the `run()` method.
2. We provide a specific `name` and `description`.
3. Crucially, we define `async def step(self) -> str:`. This is *our* specific logic for the `EchoAgent`. In this case, it looks through the `memory` (inherited from `BaseAgent`), finds the last user message, and prepares an echo response.
4. It uses `self.update_memory(...)` (a helper method provided by `BaseAgent`) to add its response to the memory.
5. It sets its `self.state` to `FINISHED` to signal that its job is done after this one step.
6. The `run()` method (which we didn't have to write, it's inherited from `BaseAgent`) would handle starting the process, calling our `step()` method, and returning the final result.
This way, we only had to focus on the unique part the echoing logic inside `step()` while `BaseAgent` handled the common structure. More complex agents like `BrowserAgent` or `ToolCallAgent` (found in `app/agent/`) follow the same principle but have much more sophisticated `step()` methods, often involving thinking with the [LLM](01_llm.md) and using [Tools](04_tool___toolcollection.md).
## Under the Hood: The `run()` Loop
What actually happens when you call `agent.run()`? The `BaseAgent` provides a standard execution loop:
1. **Check State:** It makes sure the agent is `IDLE` before starting. You can't run an agent that's already running or has finished.
2. **Set State:** It changes the agent's state to `RUNNING`. It uses a safety mechanism (`state_context`) to ensure the state is handled correctly, even if errors occur.
3. **Initialize:** If you provided an initial request (e.g., `agent.run("What's the weather?")`), it adds that as the first `user` message to the `memory`.
4. **Loop:** It enters a loop that continues as long as:
* The agent hasn't reached its `max_steps` limit.
* The agent's state is still `RUNNING` (i.e., it hasn't set itself to `FINISHED` or `ERROR` inside its `step()` method).
5. **Increment Step Counter:** It increases `current_step`.
6. **Execute `step()`:** This is where it calls the specific `step()` method implemented by the subclass (like our `EchoAgent.step()`). **This is the core of the agent's unique behavior.**
7. **Record Result:** It stores the string returned by `step()`.
8. **Repeat:** It goes back to step 4 until the loop condition is false.
9. **Finalize:** Once the loop finishes (either `max_steps` reached or state changed to `FINISHED`/`ERROR`), it sets the state back to `IDLE` (unless it ended in `ERROR`).
10. **Return Results:** It returns a string summarizing the results from all the steps.
Heres a simplified diagram showing the flow:
```mermaid
sequenceDiagram
participant User
participant MyAgent as MySpecificAgent (e.g., EchoAgent)
participant BaseRun as BaseAgent.run()
participant MyStep as MySpecificAgent.step()
User->>+MyAgent: Calls run("Initial Request")
MyAgent->>+BaseRun: run("Initial Request")
BaseRun->>BaseRun: Check state (must be IDLE)
BaseRun->>MyAgent: Set state = RUNNING
BaseRun->>MyAgent: Add "Initial Request" to memory
Note over BaseRun, MyStep: Loop starts (while step < max_steps AND state == RUNNING)
loop Execution Loop
BaseRun->>BaseRun: Increment current_step
BaseRun->>+MyStep: Calls step()
MyStep->>MyStep: Executes specific logic (e.g., reads memory, calls LLM, adds response to memory)
MyStep->>MyAgent: Maybe sets state = FINISHED
MyStep-->>-BaseRun: Returns step_result (string)
BaseRun->>BaseRun: Record step_result
BaseRun->>BaseRun: Check loop condition (step < max_steps AND state == RUNNING?)
end
Note over BaseRun: Loop ends
BaseRun->>MyAgent: Set state = IDLE (or keep ERROR)
BaseRun-->>-MyAgent: Returns combined results
MyAgent-->>-User: Returns final result string
```
## Code Glimpse: Inside `app/agent/base.py`
Let's peek at the `BaseAgent` definition itself.
```python
# Simplified snippet from app/agent/base.py
from abc import ABC, abstractmethod # Needed for abstract classes/methods
from pydantic import BaseModel, Field
from app.llm import LLM
from app.schema import AgentState, Memory, Message
class BaseAgent(BaseModel, ABC): # Inherits from Pydantic's BaseModel and ABC
"""Abstract base class for managing agent state and execution."""
# Core attributes defined here
name: str = Field(..., description="Unique name")
description: Optional[str] = Field(None)
state: AgentState = Field(default=AgentState.IDLE)
memory: Memory = Field(default_factory=Memory) # Gets a Memory instance
llm: LLM = Field(default_factory=LLM) # Gets an LLM instance
max_steps: int = Field(default=10)
current_step: int = Field(default=0)
# ... other config and helper methods like update_memory ...
async def run(self, request: Optional[str] = None) -> str:
"""Execute the agent's main loop asynchronously."""
if self.state != AgentState.IDLE:
raise RuntimeError("Agent not IDLE")
if request:
self.update_memory("user", request) # Add initial request
results = []
# Simplified: using a context manager for state changes
# async with self.state_context(AgentState.RUNNING):
self.state = AgentState.RUNNING
try:
while (self.current_step < self.max_steps and self.state == AgentState.RUNNING):
self.current_step += 1
# ====> THE CORE CALL <====
step_result = await self.step() # Calls the subclass's step method
results.append(f"Step {self.current_step}: {step_result}")
# (Simplified: actual code has more checks)
finally:
# Reset state after loop finishes or if error occurs
if self.state != AgentState.ERROR:
self.state = AgentState.IDLE
return "\n".join(results)
@abstractmethod # Marks this method as needing implementation by subclasses
async def step(self) -> str:
"""Execute a single step in the agent's workflow. Must be implemented by subclasses."""
pass # BaseAgent provides no implementation for step()
def update_memory(self, role: str, content: str, ...) -> None:
"""Helper to add messages to self.memory easily."""
# ... implementation uses Message.user_message etc. ...
self.memory.add_message(...)
```
**Explanation:**
* `class BaseAgent(BaseModel, ABC):` declares it as both a Pydantic model (for data validation) and an Abstract Base Class.
* Fields like `name`, `state`, `memory`, `llm`, `max_steps` are defined. `default_factory=Memory` means each agent gets its own fresh `Memory` instance when created.
* The `run()` method contains the loop logic we discussed, crucially calling `await self.step()`.
* `@abstractmethod` above `async def step(self) -> str:` signals that any class inheriting from `BaseAgent` *must* provide its own version of the `step` method. `BaseAgent` itself just puts `pass` (do nothing) there.
* Helper methods like `update_memory` are provided for convenience.
## Wrapping Up Chapter 3
We've learned about `BaseAgent`, the fundamental blueprint for all agents in OpenManus. It provides the common structure (`name`, `state`, `memory`, `llm`) and the core execution loop (`run()`), freeing us to focus on the unique logic of each agent by implementing the `step()` method. It acts as the chassis upon which specialized agents are built.
Now that we have the agent structure, how do agents gain specific skills beyond just talking to the LLM? How can they browse the web, run code, or interact with files? They use **Tools**!
Let's move on to [Chapter 4: Tool / ToolCollection](04_tool___toolcollection.md) to explore how we give agents capabilities to interact with the world.
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,309 @@
# Chapter 4: Tool / ToolCollection - Giving Your Agent Skills
In [Chapter 3: BaseAgent - The Agent Blueprint](03_baseagent.md), we learned how `BaseAgent` provides the standard structure for our agents, including a brain ([LLM](01_llm.md)) and memory ([Message / Memory](02_message___memory.md)). But what if we want our agent to do more than just *think* and *remember*? What if we want it to *act* in the world like searching the web, running code, or editing files?
This is where **Tools** come in!
## What Problem Do They Solve?
Imagine an agent trying to answer the question: "What's the weather like in Tokyo *right now*?"
The agent's LLM brain has a lot of general knowledge, but it doesn't have *real-time* access to the internet. It can't check the current weather. It needs a specific **capability** or **skill** to do that.
Similarly, if you ask an agent to "Write a python script that prints 'hello world' and save it to a file named `hello.py`," the agent needs the ability to:
1. Understand the request (using its LLM).
2. Write the code (using its LLM).
3. Actually *execute* code to create and write to a file.
Steps 1 and 2 are handled by the LLM, but step 3 requires interacting with the computer's file system something the LLM can't do directly.
**Tools** give agents these specific, actionable skills. A `ToolCollection` organizes these skills so the agent knows what it can do.
**Use Case:** Let's build towards an agent that can:
1. Search the web for today's date.
2. Tell the user the date.
This agent needs a "Web Search" tool.
## Key Concepts: Tools and Toolboxes
Let's break down the two main ideas:
### 1. `BaseTool`: The Blueprint for a Skill
Think of `BaseTool` (`app/tool/base.py`) as the *template* or *design specification* for any tool. It doesn't *do* anything itself, but it defines what every tool needs to have:
* **`name` (str):** A short, descriptive name for the tool (e.g., `web_search`, `file_writer`, `code_runner`). This is how the agent (or LLM) identifies the tool.
* **`description` (str):** A clear explanation of what the tool does, what it's good for, and when to use it. This is crucial for the LLM to decide *which* tool to use for a given task.
* **`parameters` (dict):** A definition of the inputs the tool expects. For example, a `web_search` tool needs a `query` input, and a `file_writer` needs a `path` and `content`. This is defined using a standard format called JSON Schema.
* **`execute` method:** An **abstract** method. This means `BaseTool` says "every tool *must* have an execute method", but each specific tool needs to provide its *own* instructions for how to actually perform the action.
You almost never use `BaseTool` directly. You use it as a starting point to create *actual*, usable tools.
### 2. Concrete Tools: The Actual Skills
These are specific classes that *inherit* from `BaseTool` and provide the real implementation for the `execute` method. OpenManus comes with several pre-built tools:
* **`WebSearch` (`app/tool/web_search.py`):** Searches the web using engines like Google, Bing, etc.
* **`Bash` (`app/tool/bash.py`):** Executes shell commands (like `ls`, `pwd`, `python script.py`).
* **`StrReplaceEditor` (`app/tool/str_replace_editor.py`):** Views, creates, and edits files by replacing text.
* **`BrowserUseTool` (`app/tool/browser_use_tool.py`):** Interacts with web pages like a user (clicking, filling forms, etc.).
* **`Terminate` (`app/tool/terminate.py`):** A special tool used by agents to signal they have finished their task.
Each of these defines its specific `name`, `description`, `parameters`, and implements the `execute` method to perform its unique action.
### 3. `ToolCollection`: The Agent's Toolbox
Think of a handyman. They don't just carry one tool; they have a toolbox filled with hammers, screwdrivers, wrenches, etc.
A `ToolCollection` (`app/tool/tool_collection.py`) is like that toolbox for an agent.
* It holds a list of specific tool instances (like `WebSearch`, `Bash`).
* It allows the agent (and its LLM) to see all the available tools and their descriptions.
* It provides a way to execute a specific tool by its name.
When an agent needs to perform an action, its LLM can look at the `ToolCollection`, read the descriptions of the available tools, choose the best one for the job, figure out the necessary inputs based on the tool's `parameters`, and then ask the `ToolCollection` to execute that tool with those inputs.
## How Do We Use Them?
Let's see how we can equip an agent with a simple tool. We'll create a basic "EchoTool" first.
**1. Creating a Concrete Tool (Inheriting from `BaseTool`):**
```python
# Import the necessary base class
from app.tool.base import BaseTool, ToolResult
# Define our simple tool
class EchoTool(BaseTool):
"""A simple tool that echoes the input text."""
name: str = "echo_message"
description: str = "Repeats back the text provided in the 'message' parameter."
parameters: dict = {
"type": "object",
"properties": {
"message": {
"type": "string",
"description": "The text to be echoed back.",
},
},
"required": ["message"], # Tells the LLM 'message' must be provided
}
# Implement the actual action
async def execute(self, message: str) -> ToolResult:
"""Takes a message and returns it."""
print(f"EchoTool executing with message: '{message}'")
# ToolResult is a standard way to return tool output
return ToolResult(output=f"You said: {message}")
# Create an instance of our tool
echo_tool_instance = EchoTool()
print(f"Tool Name: {echo_tool_instance.name}")
print(f"Tool Description: {echo_tool_instance.description}")
```
**Explanation:**
* We import `BaseTool` and `ToolResult` (a standard object for wrapping tool outputs).
* `class EchoTool(BaseTool):` declares that our `EchoTool` *is a type of* `BaseTool`.
* We define the `name`, `description`, and `parameters` according to the `BaseTool` template. The `parameters` structure tells the LLM what input is expected (`message` as a string) and that it's required.
* We implement `async def execute(self, message: str) -> ToolResult:`. This is the *specific* logic for our tool. It takes the `message` input and returns it wrapped in a `ToolResult`.
**Example Output:**
```
Tool Name: echo_message
Tool Description: Repeats back the text provided in the 'message' parameter.
```
**2. Creating a ToolCollection:**
Now, let's put our `EchoTool` and the built-in `WebSearch` tool into a toolbox.
```python
# Import ToolCollection and the tools we want
from app.tool import ToolCollection, WebSearch
# Assume EchoTool class is defined as above
# from your_module import EchoTool # Or wherever EchoTool is defined
# Create instances of the tools
echo_tool = EchoTool()
web_search_tool = WebSearch() # Uses default settings
# Create a ToolCollection containing these tools
my_toolbox = ToolCollection(echo_tool, web_search_tool)
# See the names of the tools in the collection
tool_names = [tool.name for tool in my_toolbox]
print(f"Tools in the toolbox: {tool_names}")
# Get the parameters needed for the LLM
tool_params_for_llm = my_toolbox.to_params()
print(f"\nParameters for LLM (showing first tool):")
import json
print(json.dumps(tool_params_for_llm[0], indent=2))
```
**Explanation:**
* We import `ToolCollection` and the specific tools (`WebSearch`, `EchoTool`).
* We create instances of the tools we need.
* `my_toolbox = ToolCollection(echo_tool, web_search_tool)` creates the collection, holding our tool instances.
* We can access the tools inside using `my_toolbox.tools` or iterate over `my_toolbox`.
* `my_toolbox.to_params()` is a crucial method. It formats the `name`, `description`, and `parameters` of *all* tools in the collection into a list of dictionaries. This specific format is exactly what the agent's [LLM](01_llm.md) needs (when using its `ask_tool` method) to understand which tools are available and how to use them.
**Example Output:**
```
Tools in the toolbox: ['echo_message', 'web_search']
Parameters for LLM (showing first tool):
{
"type": "function",
"function": {
"name": "echo_message",
"description": "Repeats back the text provided in the 'message' parameter.",
"parameters": {
"type": "object",
"properties": {
"message": {
"type": "string",
"description": "The text to be echoed back."
}
},
"required": [
"message"
]
}
}
}
```
**3. Agent Using the ToolCollection:**
Now, how does an agent like `ToolCallAgent` (a specific type of [BaseAgent](03_baseagent.md)) use this?
Conceptually (the real agent code is more complex):
1. The agent is configured with a `ToolCollection` (like `my_toolbox`).
2. When the agent needs to figure out the next step, it calls its LLM's `ask_tool` method.
3. It passes the conversation history ([Message / Memory](02_message___memory.md)) AND the output of `my_toolbox.to_params()` to the LLM.
4. The LLM looks at the conversation and the list of available tools (from `to_params()`). It reads the `description` of each tool to understand what it does.
5. If the LLM decides a tool is needed (e.g., the user asked "What's today's date?", the LLM sees the `web_search` tool is available and appropriate), it will generate a special response indicating:
* The `name` of the tool to use (e.g., `"web_search"`).
* The `arguments` (inputs) for the tool, based on its `parameters` (e.g., `{"query": "today's date"}`).
6. The agent receives this response from the LLM.
7. The agent then uses the `ToolCollection`'s `execute` method: `await my_toolbox.execute(name="web_search", tool_input={"query": "today's date"})`.
8. The `ToolCollection` finds the `WebSearch` tool instance in its internal `tool_map` and calls *its* `execute` method with the provided input.
9. The `WebSearch` tool runs, performs the actual web search, and returns the results (as a `ToolResult` or similar).
10. The agent takes this result, formats it as a `tool` message, adds it to its memory, and continues its thinking process (often asking the LLM again, now with the tool's result as context).
The `ToolCollection` acts as the crucial bridge between the LLM's *decision* to use a tool and the *actual execution* of that tool's code.
## Under the Hood: How `ToolCollection.execute` Works
Let's trace the flow when an agent asks its `ToolCollection` to run a tool:
```mermaid
sequenceDiagram
participant Agent as ToolCallAgent
participant LLM as LLM (Deciding Step)
participant Toolbox as ToolCollection
participant SpecificTool as e.g., WebSearch Tool
Agent->>+LLM: ask_tool(messages, tools=Toolbox.to_params())
LLM->>LLM: Analyzes messages & available tools
LLM-->>-Agent: Response indicating tool call: name='web_search', arguments={'query': '...'}
Agent->>+Toolbox: execute(name='web_search', tool_input={'query': '...'})
Toolbox->>Toolbox: Look up 'web_search' in internal tool_map
Note right of Toolbox: Finds the WebSearch instance
Toolbox->>+SpecificTool: Calls execute(**tool_input) on the found tool
SpecificTool->>SpecificTool: Performs actual web search action
SpecificTool-->>-Toolbox: Returns ToolResult (output="...", error=None)
Toolbox-->>-Agent: Returns the ToolResult
Agent->>Agent: Processes the result (adds to memory, etc.)
```
**Code Glimpse:**
Let's look at the `ToolCollection` itself in `app/tool/tool_collection.py`:
```python
# Simplified snippet from app/tool/tool_collection.py
from typing import Any, Dict, List, Tuple
from app.tool.base import BaseTool, ToolResult, ToolFailure
from app.exceptions import ToolError
class ToolCollection:
# ... (Config class) ...
tools: Tuple[BaseTool, ...] # Holds the tool instances
tool_map: Dict[str, BaseTool] # Maps name to tool instance for quick lookup
def __init__(self, *tools: BaseTool):
"""Initializes with a sequence of tools."""
self.tools = tools
# Create the map for easy lookup by name
self.tool_map = {tool.name: tool for tool in tools}
def to_params(self) -> List[Dict[str, Any]]:
"""Formats tools for the LLM API."""
# Calls the 'to_param()' method on each tool
return [tool.to_param() for tool in self.tools]
async def execute(
self, *, name: str, tool_input: Dict[str, Any] = None
) -> ToolResult:
"""Finds a tool by name and executes it."""
# 1. Find the tool instance using the name
tool = self.tool_map.get(name)
if not tool:
# Return a standard failure result if tool not found
return ToolFailure(error=f"Tool {name} is invalid")
# 2. Execute the tool's specific method
try:
# The 'tool(**tool_input)' calls the tool instance's __call__ method,
# which in BaseTool, calls the tool's 'execute' method.
# The ** unpacks the dictionary into keyword arguments.
result = await tool(**(tool_input or {}))
# Ensure the result is a ToolResult (or subclass)
return result if isinstance(result, ToolResult) else ToolResult(output=str(result))
except ToolError as e:
# Handle errors specific to tools
return ToolFailure(error=e.message)
except Exception as e:
# Handle unexpected errors during execution
return ToolFailure(error=f"Unexpected error executing tool {name}: {e}")
# ... other methods like add_tool, __iter__ ...
```
**Explanation:**
* The `__init__` method takes tool instances and stores them in `self.tools` (a tuple) and `self.tool_map` (a dictionary mapping name to instance).
* `to_params` iterates through `self.tools` and calls each tool's `to_param()` method (defined in `BaseTool`) to get the LLM-compatible format.
* `execute` is the core method used by agents:
* It uses `self.tool_map.get(name)` to quickly find the correct tool instance based on the requested name.
* If found, it calls `await tool(**(tool_input or {}))`. The `**` unpacks the `tool_input` dictionary into keyword arguments for the tool's `execute` method (e.g., `message="hello"` for our `EchoTool`, or `query="today's date"` for `WebSearch`).
* It wraps the execution in `try...except` blocks to catch errors and return a standardized `ToolFailure` result if anything goes wrong.
## Wrapping Up Chapter 4
We've learned how **Tools** give agents specific skills beyond basic language understanding.
* `BaseTool` is the abstract blueprint defining a tool's `name`, `description`, and expected `parameters`.
* Concrete tools (like `WebSearch`, `Bash`, or our custom `EchoTool`) inherit from `BaseTool` and implement the actual `execute` logic.
* `ToolCollection` acts as the agent's toolbox, holding various tools and providing methods (`to_params`, `execute`) for the agent (often guided by its [LLM](01_llm.md)) to discover and use these capabilities.
With tools, agents can interact with external systems, run code, access real-time data, and perform complex actions, making them much more powerful.
But how do we coordinate multiple agents, potentially using different tools, to work together on a larger task? That's where Flows come in.
Let's move on to [Chapter 5: BaseFlow](05_baseflow.md) to see how we orchestrate complex workflows involving multiple agents and steps.
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,361 @@
# Chapter 5: BaseFlow - Managing Multi-Step Projects
In [Chapter 4: Tool / ToolCollection](04_tool___toolcollection.md), we saw how to give agents specific skills like web searching or running code using Tools. Now, imagine you have a task that requires multiple steps, maybe even using different skills (tools) or agents along the way. How do you coordinate this complex work?
That's where **Flows** come in!
## What Problem Does `BaseFlow` Solve?
Think about a simple agent, maybe one equipped with a web search tool. You could ask it, "What's the capital of France?" and it could use its tool and answer "Paris." That's a single-step task.
But what if you ask something more complex, like: "Research the pros and cons of electric cars and then write a short blog post summarizing them."
This isn't a single action. It involves:
1. **Planning:** Figuring out the steps needed (e.g., search for pros, search for cons, structure blog post, write draft, review draft).
2. **Executing Step 1:** Using a web search tool to find pros.
3. **Executing Step 2:** Using a web search tool to find cons.
4. **Executing Step 3:** Maybe using the [LLM](01_llm.md) brain to outline the blog post.
5. **Executing Step 4:** Using the LLM to write the post based on the research and outline.
6. **Executing Step 5:** Perhaps a final review step.
A single [BaseAgent](03_baseagent.md) *might* be able to handle this if it's very sophisticated, but it's often clearer and more manageable to have a dedicated **orchestrator** or **project manager** overseeing the process.
**This is the job of a `Flow`.** Specifically, `BaseFlow` is the blueprint for these orchestrators. It defines a structure that can manage multiple agents and coordinate their work to achieve a larger goal according to a specific strategy (like following a pre-defined plan).
**Use Case:** Let's stick with our "Research and Write" task. We need something to manage the overall process: first the research, then the writing. A `PlanningFlow` (a specific type of Flow built on `BaseFlow`) is perfect for this. It will first create a plan (like the steps above) and then execute each step, potentially assigning different steps to different specialized agents if needed.
## Key Concepts: Flow, Agents, and Strategy
1. **`BaseFlow` (`app/flow/base.py`):**
* This is the **abstract blueprint** for all flows. Think of it as the job description for a project manager it says a manager needs to know their team (agents) and have a way to run the project (`execute` method), but it doesn't dictate *how* they manage.
* It mainly holds a dictionary of available `agents` that can be used within the flow.
* You don't use `BaseFlow` directly; you use specific implementations.
2. **Concrete Flows (e.g., `PlanningFlow` in `app/flow/planning.py`):**
* These are the **specific strategies** for managing the project. They *inherit* from `BaseFlow`.
* `PlanningFlow` is a key example. Its strategy is:
1. Receive the overall goal.
2. Use an LLM and a special `PlanningTool` to break the goal down into a sequence of steps (the "plan").
3. Execute each step in the plan, one by one, usually by calling the `run()` method of an appropriate [BaseAgent](03_baseagent.md).
4. Track the status of each step (e.g., not started, in progress, completed).
3. **Agents within the Flow:**
* These are the "workers" or "specialists" managed by the flow.
* A flow holds one or more [BaseAgent](03_baseagent.md) instances.
* In a `PlanningFlow`, one agent might be designated as the primary agent (often responsible for helping create the plan), while others (or maybe the same one) act as "executors" for the plan steps. The flow decides which agent is best suited for each step.
Think of it like building a house:
* `BaseFlow` is the concept of a "General Contractor".
* `PlanningFlow` is a specific *type* of General Contractor who always starts by creating a detailed architectural plan and then hires specialists for each phase.
* The `agents` are the specialists: the plumber, the electrician, the carpenter, etc.
* The overall goal ("Build a house") is given to the `PlanningFlow` (Contractor).
* The `PlanningFlow` creates the plan (foundation, framing, plumbing, electrical...).
* The `PlanningFlow` then calls the appropriate `agent` (specialist) for each step in the plan.
## How Do We Use Flows?
You typically use a `FlowFactory` to create a specific type of flow, providing it with the agents it needs.
Let's set up a simple `PlanningFlow` with one agent called "Manus" (which is a general-purpose agent in OpenManus).
```python
# Import necessary classes
from app.agent.manus import Manus # A capable agent
from app.flow.flow_factory import FlowFactory, FlowType
import asyncio # Needed for async execution
# 1. Create the agent(s) we want the flow to manage
# We can give agents specific keys (names) within the flow
agents_for_flow = {
"research_writer": Manus() # Use Manus agent for all tasks
}
# 2. Create the flow using the factory
# We specify the type (PLANNING) and provide the agents
planning_flow_instance = FlowFactory.create_flow(
flow_type=FlowType.PLANNING,
agents=agents_for_flow,
# Optional: specify which agent is primary (if not first)
# primary_agent_key="research_writer"
)
print(f"Created a {type(planning_flow_instance).__name__}")
print(f"Primary agent: {planning_flow_instance.primary_agent.name}")
# 3. Define the overall goal for the flow
overall_goal = "Research the main benefits of solar power and write a short summary."
# Define an async function to run the flow
async def run_the_flow():
print(f"\nExecuting flow with goal: '{overall_goal}'")
# 4. Execute the flow with the goal
final_result = await planning_flow_instance.execute(overall_goal)
print("\n--- Flow Execution Finished ---")
print(f"Final Result:\n{final_result}")
# Run the async function
# asyncio.run(run_the_flow()) # Uncomment to run
```
**Explanation:**
1. We import the agent we want to use (`Manus`) and the `FlowFactory` plus `FlowType`.
2. We create a dictionary `agents_for_flow` mapping a key ("research\_writer") to an instance of our `Manus` agent. This tells the flow which workers are available.
3. We use `FlowFactory.create_flow()` specifying `FlowType.PLANNING` and passing our `agents_for_flow`. The factory handles constructing the `PlanningFlow` object correctly.
4. We define the high-level task (`overall_goal`).
5. We call `await planning_flow_instance.execute(overall_goal)`. This is where the magic happens! The `PlanningFlow` takes over.
**Expected Outcome (High Level):**
When you run this (if uncommented), you won't just get an immediate answer. You'll likely see output indicating:
* A plan is being created (e.g., Step 1: Search for benefits, Step 2: Synthesize findings, Step 3: Write summary).
* The agent ("research\_writer") starting to execute Step 1. This might involve output from the agent using its web search tool.
* The agent moving on to Step 2, then Step 3, potentially showing LLM thinking or writing output.
* Finally, the `execute` call will return a string containing the results of the steps and possibly a final summary generated by the flow or the agent.
The `PlanningFlow` manages this entire multi-step process automatically based on the initial goal.
## Under the Hood: How `PlanningFlow.execute` Works
Let's peek behind the curtain of the `PlanningFlow`'s `execute` method. What happens when you call it?
**High-Level Walkthrough:**
1. **Receive Goal:** The `execute` method gets the `input_text` (our overall goal).
2. **Create Plan (`_create_initial_plan`):**
* It constructs messages for the [LLM](01_llm.md), including a system message asking it to act as a planner.
* It tells the LLM about the `PlanningTool` (a special [Tool](04_tool___toolcollection.md) designed for creating and managing plans).
* It calls the LLM's `ask_tool` method, essentially asking: "Please use the PlanningTool to create a plan for this goal: *{input\_text}*".
* The `PlanningTool` (when called by the LLM) stores the generated steps (e.g., ["Search benefits", "Write summary"]) associated with a unique `plan_id`.
3. **Execution Loop:** The flow enters a loop to execute the plan steps.
* **Get Next Step (`_get_current_step_info`):** It checks the stored plan (using the `PlanningTool`) to find the first step that isn't marked as "completed". It gets the step's text and index.
* **Check for Completion:** If no non-completed steps are found, the plan is finished! The loop breaks.
* **Select Executor (`get_executor`):** It determines which agent should perform the current step. In our simple example, it will always select our "research\_writer" agent. More complex flows could choose based on step type (e.g., a "[CODE]" step might go to a coding agent).
* **Execute Step (`_execute_step`):**
* It prepares a prompt for the selected executor agent, including the current plan status and the specific instruction for the current step (e.g., "You are working on step 0: 'Search benefits'. Please execute this step.").
* It calls the executor agent's `run()` method with this prompt: `await executor.run(step_prompt)`. The agent then does its work (which might involve using its own tools, memory, and LLM).
* It gets the result back from the agent's `run()`.
* **Mark Step Complete (`_mark_step_completed`):** It tells the `PlanningTool` to update the status of the current step to "completed".
* **Loop:** Go back to find the next step.
4. **Finalize (`_finalize_plan`):** Once the loop finishes, it might generate a final summary of the completed plan (potentially using the LLM again).
5. **Return Result:** The accumulated results from executing all the steps are returned as a string.
**Sequence Diagram:**
Here's a simplified view of the process:
```mermaid
sequenceDiagram
participant User
participant PF as PlanningFlow
participant LLM_Planner as LLM (for Planning)
participant PlanTool as PlanningTool
participant Executor as Executor Agent (e.g., Manus)
participant AgentLLM as Agent's LLM (for Execution)
User->>+PF: execute("Research & Summarize Solar Power")
PF->>+LLM_Planner: ask_tool("Create plan...", tools=[PlanTool])
LLM_Planner->>+PlanTool: execute(command='create', steps=['Search', 'Summarize'], ...)
PlanTool-->>-LLM_Planner: Plan created (ID: plan_123)
LLM_Planner-->>-PF: Plan created successfully
Note over PF: Start Execution Loop
loop Plan Steps
PF->>+PlanTool: get_next_step(plan_id='plan_123')
PlanTool-->>-PF: Step 0: "Search"
PF->>PF: Select Executor (Manus)
PF->>+Executor: run("Execute step 0: 'Search'...")
Executor->>+AgentLLM: ask/ask_tool (e.g., use web search)
AgentLLM-->>-Executor: Search results
Executor-->>-PF: Step 0 result ("Found benefits X, Y, Z...")
PF->>+PlanTool: mark_step(plan_id='plan_123', step=0, status='completed')
PlanTool-->>-PF: Step marked
PF->>+PlanTool: get_next_step(plan_id='plan_123')
PlanTool-->>-PF: Step 1: "Summarize"
PF->>PF: Select Executor (Manus)
PF->>+Executor: run("Execute step 1: 'Summarize'...")
Executor->>+AgentLLM: ask("Summarize: X, Y, Z...")
AgentLLM-->>-Executor: Summary text
Executor-->>-PF: Step 1 result ("Solar power benefits include...")
PF->>+PlanTool: mark_step(plan_id='plan_123', step=1, status='completed')
PlanTool-->>-PF: Step marked
PF->>+PlanTool: get_next_step(plan_id='plan_123')
PlanTool-->>-PF: No more steps
end
Note over PF: End Execution Loop
PF->>PF: Finalize (optional summary)
PF-->>-User: Final combined result string
```
**Code Glimpse:**
Let's look at simplified snippets from the flow files.
* **`app/flow/base.py`:** The blueprint just holds agents.
```python
# Simplified snippet from app/flow/base.py
from abc import ABC, abstractmethod
from typing import Dict, List, Optional, Union
from pydantic import BaseModel
from app.agent.base import BaseAgent
class BaseFlow(BaseModel, ABC):
"""Base class for execution flows supporting multiple agents"""
agents: Dict[str, BaseAgent] # Holds the agents
primary_agent_key: Optional[str] = None # Key for the main agent
# ... __init__ handles setting up the agents dictionary ...
@property
def primary_agent(self) -> Optional[BaseAgent]:
"""Get the primary agent for the flow"""
return self.agents.get(self.primary_agent_key)
@abstractmethod # Subclasses MUST implement execute
async def execute(self, input_text: str) -> str:
"""Execute the flow with given input"""
pass
```
* **`app/flow/flow_factory.py`:** Creates the specific flow.
```python
# Simplified snippet from app/flow/flow_factory.py
from enum import Enum
from app.agent.base import BaseAgent
from app.flow.base import BaseFlow
from app.flow.planning import PlanningFlow # Import specific flows
class FlowType(str, Enum):
PLANNING = "planning" # Add other flow types here
class FlowFactory:
@staticmethod
def create_flow(flow_type: FlowType, agents, **kwargs) -> BaseFlow:
flows = { # Maps type enum to the actual class
FlowType.PLANNING: PlanningFlow,
}
flow_class = flows.get(flow_type)
if not flow_class:
raise ValueError(f"Unknown flow type: {flow_type}")
# Creates an instance of PlanningFlow(agents, **kwargs)
return flow_class(agents, **kwargs)
```
* **`app/flow/planning.py`:** The core planning and execution logic.
```python
# Simplified snippets from app/flow/planning.py
from app.flow.base import BaseFlow
from app.tool import PlanningTool
from app.agent.base import BaseAgent
from app.schema import Message # Assuming Message is imported
class PlanningFlow(BaseFlow):
planning_tool: PlanningTool = Field(default_factory=PlanningTool)
# ... other fields like llm, active_plan_id ...
async def execute(self, input_text: str) -> str:
"""Execute the planning flow with agents."""
# 1. Create the plan if input is provided
if input_text:
await self._create_initial_plan(input_text)
# Check if plan exists...
result_accumulator = ""
while True:
# 2. Get the next step to execute
step_index, step_info = await self._get_current_step_info()
# 3. Exit if no more steps
if step_index is None:
result_accumulator += await self._finalize_plan()
break
# 4. Get the agent to execute the step
executor_agent = self.get_executor(step_info.get("type"))
# 5. Execute the step using the agent
step_result = await self._execute_step(executor_agent, step_info)
result_accumulator += step_result + "\n"
# Mark step as completed (done inside _execute_step or here)
# await self._mark_step_completed(step_index) # Simplified
# Maybe check if agent finished early...
return result_accumulator
async def _create_initial_plan(self, request: str):
"""Uses LLM and PlanningTool to create the plan."""
logger.info(f"Creating plan for: {request}")
system_msg = Message.system_message("You are a planner...")
user_msg = Message.user_message(f"Create a plan for: {request}")
# Ask LLM to use the planning tool
response = await self.llm.ask_tool(
messages=[user_msg],
system_msgs=[system_msg],
tools=[self.planning_tool.to_param()], # Provide the tool spec
# Force LLM to use a tool (often planning tool)
# tool_choice=ToolChoice.AUTO # Or specify planning tool name
)
# Process LLM response to execute the planning tool call
# Simplified: Assume LLM calls planning_tool.execute(...)
# to store the plan steps.
# ... logic to handle response and tool execution ...
logger.info("Plan created.")
async def _execute_step(self, executor: BaseAgent, step_info: dict) -> str:
"""Execute a single step using the executor agent."""
step_text = step_info.get("text", "Current step")
plan_status = await self._get_plan_text() # Get current plan state
# Construct prompt for the agent
step_prompt = f"Current Plan:\n{plan_status}\n\nYour Task:\nExecute step: {step_text}"
# Call the agent's run method!
step_result = await executor.run(step_prompt)
# Mark step completed after execution
await self._mark_step_completed()
return step_result
async def _mark_step_completed(self):
"""Update the planning tool state for the current step."""
if self.current_step_index is not None:
await self.planning_tool.execute(
command="mark_step",
plan_id=self.active_plan_id,
step_index=self.current_step_index,
step_status="completed" # Simplified status
)
logger.info(f"Step {self.current_step_index} marked complete.")
# ... other helper methods like _get_current_step_info, get_executor ...
```
**Explanation of Snippets:**
* `BaseFlow` defines the `agents` dictionary and the abstract `execute` method.
* `FlowFactory` looks at the requested `FlowType` and returns an instance of the corresponding class (`PlanningFlow`).
* `PlanningFlow.execute` orchestrates the overall process: create plan, loop through steps, get executor, execute step via `agent.run()`, mark complete.
* `_create_initial_plan` shows interaction with the [LLM](01_llm.md) and the `PlanningTool` to generate the initial steps.
* `_execute_step` shows how the flow prepares a prompt and then delegates the actual work for a specific step to an agent by calling `executor.run()`.
* `_mark_step_completed` updates the plan state using the `PlanningTool`.
## Wrapping Up Chapter 5
We've seen that `BaseFlow` provides a way to manage complex, multi-step tasks that might involve multiple agents or tools. It acts as an orchestrator or project manager. We focused on `PlanningFlow`, a specific strategy where a plan is created first, and then each step is executed sequentially by designated agents. This allows OpenManus to tackle much larger and more complex goals than a single agent could handle alone.
So far, we've covered the core components: LLMs, Memory, Agents, Tools, and Flows. But how do we define the structure of data that these components pass around, like the format of tool parameters or agent configurations? That's where schemas come in.
Let's move on to [Chapter 6: Schema](06_schema.md) to understand how OpenManus defines and validates data structures.
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

240
docs/OpenManus/06_schema.md Normal file
View File

@@ -0,0 +1,240 @@
# Chapter 6: Schema - The Official Data Forms
In [Chapter 5: BaseFlow](05_baseflow.md), we saw how Flows act like project managers, coordinating different [Agents](03_baseagent.md) and [Tools](04_tool___toolcollection.md) to complete complex tasks. But for all these different parts (Flows, Agents, LLMs, Tools) to work together smoothly, they need to speak the same language and use the same formats when exchanging information.
Imagine a busy office where everyone fills out forms for requests, reports, and messages. If everyone uses their *own* unique form layout, it quickly becomes chaotic! Someone might forget a required field, use the wrong data type (like writing "yesterday" instead of a specific date), or mislabel information. It would be incredibly hard to process anything efficiently.
This is where **Schemas** come into play in OpenManus.
## What Problem Does Schema Solve?
In our digital "office" (the OpenManus application), various components need to pass data back and forth:
* The User sends a request (a message).
* The Agent stores this message in its [Memory](02_message___memory.md).
* The Agent might ask the [LLM](01_llm.md) for help, sending the conversation history.
* The LLM might decide to use a [Tool](04_tool___toolcollection.md), sending back instructions on which tool and what inputs to use.
* The Tool executes and sends back its results.
* The Agent updates its status (e.g., from `RUNNING` to `FINISHED`).
Without a standard way to structure all this information, we'd face problems:
* **Inconsistency:** One part might expect a user message to have a `sender` field, while another expects a `role` field.
* **Errors:** A Tool might expect a number as input but receive text, causing it to crash.
* **Confusion:** It would be hard for developers (and the system itself!) to know exactly what information is contained in a piece of data.
* **Maintenance Nightmares:** Changing how data is structured in one place could break many other parts unexpectedly.
**Schemas solve this by defining the official "forms" or "templates" for all the important data structures used in OpenManus.** Think of them as the agreed-upon standard formats that everyone must use.
**Use Case:** When the LLM decides the agent should use the `web_search` tool with the query "latest AI news", it doesn't just send back a vague text string. It needs to send structured data that clearly says:
1. "I want to call a tool."
2. "The tool's name is `web_search`."
3. "The input parameter `query` should be set to `latest AI news`."
A schema defines exactly how this "tool call request" should look, ensuring the Agent understands it correctly.
## Key Concepts: Standard Templates via Pydantic
1. **Schema as Templates:** At its core, a schema is a formal definition of a data structure. It specifies:
* What pieces of information (fields) must be included (e.g., a `Message` must have a `role`).
* What type each piece of information should be (e.g., `role` must be text, `current_step` in an Agent must be a number).
* Which fields are optional and which are required.
* Sometimes, default values or specific allowed values (e.g., `role` must be one of "user", "assistant", "system", or "tool").
2. **Pydantic: The Schema Engine:** OpenManus uses a popular Python library called **Pydantic** to define and enforce these schemas. You don't need to be a Pydantic expert, but understanding its role is helpful. Pydantic lets us define these data structures using simple Python classes. When data is loaded into these classes, Pydantic automatically:
* **Validates** the data: Checks if all required fields are present and if the data types are correct. If not, it raises an error *before* the bad data can cause problems elsewhere.
* **Provides Auto-completion and Clarity:** Because the structure is clearly defined in code, developers get better auto-completion hints in their editors, making the code easier to write and understand.
Think of Pydantic as the strict office manager who checks every form submitted, ensuring it's filled out correctly according to the official template before passing it on.
## How Do We Use Schemas? (Examples)
Schemas are defined throughout the OpenManus codebase, primarily as Pydantic models. You've already encountered some! Let's look at a few key examples found mostly in `app/schema.py` and `app/tool/base.py`.
**1. `Message` (from `app/schema.py`): The Chat Bubble**
We saw this in [Chapter 2: Message / Memory](02_message___memory.md). It defines the structure for a single turn in a conversation.
```python
# Simplified Pydantic model from app/schema.py
from pydantic import BaseModel, Field
from typing import List, Optional, Literal
# Define allowed roles
ROLE_TYPE = Literal["system", "user", "assistant", "tool"]
class Message(BaseModel):
role: ROLE_TYPE = Field(...) # '...' means this field is required
content: Optional[str] = Field(default=None) # Optional text content
# ... other optional fields like tool_calls, name, tool_call_id ...
# Class methods like user_message, assistant_message are here...
```
**Explanation:**
* This Pydantic class `Message` defines the "form" for a message.
* `role: ROLE_TYPE = Field(...)` means every message *must* have a `role`, and its value must be one of the strings defined in `ROLE_TYPE`. Pydantic enforces this.
* `content: Optional[str] = Field(default=None)` means a message *can* have text `content`, but it's optional. If not provided, it defaults to `None`.
* Pydantic ensures that if you try to create a `Message` object without a valid `role`, or with `content` that isn't a string, you'll get an error immediately.
**2. `ToolCall` and `Function` (from `app/schema.py`): The Tool Request Form**
When the LLM tells the agent to use a tool, it sends back data structured according to the `ToolCall` schema.
```python
# Simplified Pydantic models from app/schema.py
from pydantic import BaseModel
class Function(BaseModel):
name: str # The name of the tool/function to call
arguments: str # The input arguments as a JSON string
class ToolCall(BaseModel):
id: str # A unique ID for this specific call
type: str = "function" # Currently always "function"
function: Function # Embeds the Function details above
```
**Explanation:**
* The `Function` schema defines that we need the `name` of the tool (as text) and its `arguments` (also as text, expected to be formatted as JSON).
* The `ToolCall` schema includes a unique `id`, the `type` (always "function" for now), and embeds the `Function` data.
* This ensures that whenever the agent receives a tool call instruction from the LLM, it knows exactly where to find the tool's name and arguments, preventing guesswork and errors.
**3. `AgentState` (from `app/schema.py`): The Agent Status Report**
We saw this in [Chapter 3: BaseAgent](03_baseagent.md). It standardizes how we represent the agent's current status.
```python
# Simplified definition from app/schema.py
from enum import Enum
class AgentState(str, Enum):
"""Agent execution states"""
IDLE = "IDLE"
RUNNING = "RUNNING"
FINISHED = "FINISHED"
ERROR = "ERROR"
```
**Explanation:**
* This uses Python's `Enum` (Enumeration) type, which is automatically compatible with Pydantic.
* It defines a fixed set of allowed values for the agent's state. An agent's state *must* be one of these four strings.
* This prevents typos (like "Runing" or "Idle") and makes it easy to check the agent's status reliably.
**4. `ToolResult` (from `app/tool/base.py`): The Tool Output Form**
When a [Tool](04_tool___toolcollection.md) finishes its job, it needs to report back its findings in a standard way.
```python
# Simplified Pydantic model from app/tool/base.py
from pydantic import BaseModel, Field
from typing import Any, Optional
class ToolResult(BaseModel):
"""Represents the result of a tool execution."""
output: Any = Field(default=None) # The main result data
error: Optional[str] = Field(default=None) # Error message, if any
# ... other optional fields like base64_image, system message ...
class Config:
arbitrary_types_allowed = True # Allows 'Any' type for output
```
**Explanation:**
* Defines a standard structure for *any* tool's output.
* It includes an `output` field for the successful result (which can be of `Any` type, allowing flexibility for different tools) and an optional `error` field to report problems.
* Specific tools might *inherit* from `ToolResult` to add more specific fields, like `SearchResult` adding `url`, `title`, etc. (see `app/tool/web_search.py`). Using `ToolResult` as a base ensures all tool outputs have a consistent minimum structure.
## Under the Hood: Pydantic Validation
The real power of using Pydantic for schemas comes from its automatic data validation. Let's illustrate with a simplified `Message` example.
Imagine you have this Pydantic model:
```python
# Standalone Example (Illustrative)
from pydantic import BaseModel, ValidationError
from typing import Literal
ROLE_TYPE = Literal["user", "assistant"] # Only allow these roles
class SimpleMessage(BaseModel):
role: ROLE_TYPE
content: str
```
Now, let's see what happens when we try to create instances:
```python
# --- Valid Data ---
try:
msg1 = SimpleMessage(role="user", content="Hello there!")
print("msg1 created successfully:", msg1.model_dump()) # .model_dump() shows dict
except ValidationError as e:
print("Error creating msg1:", e)
# --- Missing Required Field ('content') ---
try:
msg2 = SimpleMessage(role="assistant")
print("msg2 created successfully:", msg2.model_dump())
except ValidationError as e:
print("\nError creating msg2:")
print(e) # Pydantic gives a detailed error
# --- Invalid Role ---
try:
msg3 = SimpleMessage(role="system", content="System message") # 'system' is not allowed
print("msg3 created successfully:", msg3.model_dump())
except ValidationError as e:
print("\nError creating msg3:")
print(e) # Pydantic catches the wrong role
# --- Wrong Data Type for 'content' ---
try:
msg4 = SimpleMessage(role="user", content=123) # content should be string
print("msg4 created successfully:", msg4.model_dump())
except ValidationError as e:
print("\nError creating msg4:")
print(e) # Pydantic catches the type error
```
**Example Output:**
```
msg1 created successfully: {'role': 'user', 'content': 'Hello there!'}
Error creating msg2:
1 validation error for SimpleMessage
content
Field required [type=missing, input_value={'role': 'assistant'}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.7/v/missing
Error creating msg3:
1 validation error for SimpleMessage
role
Input should be 'user' or 'assistant' [type=literal_error, input_value='system', input_type=str]
For further information visit https://errors.pydantic.dev/2.7/v/literal_error
Error creating msg4:
1 validation error for SimpleMessage
content
Input should be a valid string [type=string_type, input_value=123, input_type=int]
For further information visit https://errors.pydantic.dev/2.7/v/string_type
```
**Explanation:**
* When the data matches the schema (`msg1`), the object is created successfully.
* When data is missing (`msg2`), has an invalid value (`msg3`), or the wrong type (`msg4`), Pydantic automatically raises a `ValidationError`.
* The error message clearly explains *what* is wrong and *where*.
This validation happens automatically whenever data is loaded into these Pydantic models within OpenManus, catching errors early and ensuring data consistency across the entire application. You mostly find these schema definitions in `app/schema.py`, but also within specific tool files (like `app/tool/base.py`, `app/tool/web_search.py`) for their specific results.
## Wrapping Up Chapter 6
You've learned that **Schemas** are like official data templates or forms used throughout OpenManus. They define the expected structure for important data like messages, tool calls, agent states, and tool results. By using the **Pydantic** library, OpenManus automatically **validates** data against these schemas, ensuring consistency, preventing errors, and making the whole system more reliable and easier to understand. They are the backbone of structured communication between different components.
We've now covered most of the core functional building blocks of OpenManus. But how do we configure things like which LLM model to use, API keys, or which tools an agent should have? That's handled by the Configuration system.
Let's move on to [Chapter 7: Configuration (Config)](07_configuration__config_.md) to see how we manage settings and secrets for our agents and flows.
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,324 @@
# Chapter 7: Configuration (Config)
Welcome to Chapter 7! In [Chapter 6: Schema](06_schema.md), we learned how OpenManus uses schemas to define the structure of data passed between different components, like official forms ensuring everyone fills them out correctly.
Now, think about setting up a new application. You often need to tell it *how* to behave.
* Which AI model should it use?
* What's the secret key to access that AI?
* Should it run code in a restricted "sandbox" environment?
* Which search engine should it prefer?
These are all **settings** or **configurations**. This chapter explores how OpenManus manages these settings using the `Config` system.
## What Problem Does Config Solve?
Imagine you're building a simple app that uses an AI service. You need an API key to access it. Where do you put this key?
* **Option 1: Hardcode it directly in the code.**
```python
# Bad idea! Don't do this!
api_key = "MY_SUPER_SECRET_API_KEY_12345"
# ... rest of the code uses api_key ...
```
This is a terrible idea! Your secret key is exposed in the code. Sharing the code means sharing your secret. Changing the key means editing the code. What if multiple parts of the code need the key? You'd have it scattered everywhere!
* **Option 2: Use a Configuration System.**
Keep all settings in a separate, easy-to-read file. The application reads this file when it starts and makes the settings available wherever they're needed.
OpenManus uses Option 2. It keeps settings in a file named `config.toml` and uses a special `Config` object to manage them.
**Use Case:** Let's say we want our [LLM](01_llm.md) component to use the "gpt-4o" model and a specific API key. Instead of writing "gpt-4o" and the key directly into the `LLM` class code, the `LLM` class will *ask* the `Config` system: "What model should I use?" and "What's the API key?". The `Config` system provides the answers it read from `config.toml`.
## Key Concepts: The Settings File and Manager
### 1. The Settings File (`config.toml`)
This is a simple text file located in the `config/` directory of your OpenManus project. It uses the TOML format (Tom's Obvious, Minimal Language), which is designed to be easy for humans to read.
It contains sections for different parts of the application. Here's a highly simplified snippet:
```toml
# config/config.toml (Simplified Example)
[llm] # Settings for the Large Language Model
model = "gpt-4o"
api_key = "YOUR_OPENAI_API_KEY_HERE" # Replace with your actual key
base_url = "https://api.openai.com/v1"
api_type = "openai"
[sandbox] # Settings for the code execution sandbox
use_sandbox = true
image = "python:3.12-slim"
memory_limit = "256m"
[search_config] # Settings for web search
engine = "DuckDuckGo"
[browser_config] # Settings for the browser tool
headless = false
```
**Explanation:**
* `[llm]`, `[sandbox]`, etc., define sections.
* `model = "gpt-4o"` assigns the value `"gpt-4o"` to the `model` setting within the `llm` section.
* `api_key = "YOUR_..."` stores your secret key (you should put your real key here and **never** share this file publicly if it contains secrets!).
* `use_sandbox = true` sets a boolean (true/false) value.
This file acts as the central "control panel" list for the application's behavior.
### 2. The Settings Manager (`Config` class in `app/config.py`)
Okay, we have the settings file. How does the application *use* it?
OpenManus has a special Python class called `Config` (defined in `app/config.py`). Think of this class as the **Settings Manager**. Its job is:
1. **Read the File:** When the application starts, the `Config` manager reads the `config.toml` file.
2. **Parse and Store:** It understands the TOML format and stores the settings internally, often using the Pydantic [Schemas](06_schema.md) we learned about (like `LLMSettings`, `SandboxSettings`) to validate the data.
3. **Provide Access:** It offers a way for any other part of the application to easily ask for a specific setting (e.g., "Give me the LLM model name").
### 3. The Singleton Pattern: One Manager to Rule Them All
The `Config` class uses a special design pattern called a **Singleton**. This sounds fancy, but the idea is simple: **There is only ever *one* instance (object) of the `Config` manager in the entire application.**
*Analogy:* Think of the principal's office in a school. There's only one principal's office. If any teacher or student needs official school-wide information (like the date of the next holiday), they go to that single, central office. They don't each have their own separate, potentially conflicting, information source.
The `Config` object is like that principal's office. When any part of OpenManus (like the [LLM](01_llm.md) class or the [DockerSandbox](08_dockersandbox.md) class) needs a setting, it asks the *same*, single `Config` instance. This ensures everyone is using the same configuration values that were loaded at the start.
## How Do We Use It? (Accessing Settings)
Because `Config` is a singleton, accessing settings is straightforward. You import the pre-created instance and ask for the setting you need.
The single instance is created automatically when `app/config.py` is first loaded and is made available as `config`.
```python
# Example of how another part of the code might use the config
from app.config import config # Import the singleton instance
# Access LLM settings
default_llm_settings = config.llm.get("default") # Get the 'default' LLM config
if default_llm_settings:
model_name = default_llm_settings.model
api_key = default_llm_settings.api_key
print(f"LLM Model: {model_name}")
# Don't print the API key in real code! This is just for illustration.
# print(f"LLM API Key: {api_key[:4]}...{api_key[-4:]}")
# Access Sandbox settings
use_sandbox_flag = config.sandbox.use_sandbox
sandbox_image = config.sandbox.image
print(f"Use Sandbox: {use_sandbox_flag}")
print(f"Sandbox Image: {sandbox_image}")
# Access Search settings (check if it exists)
if config.search_config:
search_engine = config.search_config.engine
print(f"Preferred Search Engine: {search_engine}")
# Access Browser settings (check if it exists)
if config.browser_config:
run_headless = config.browser_config.headless
print(f"Run Browser Headless: {run_headless}")
```
**Explanation:**
1. `from app.config import config`: We import the single, shared `config` object.
2. `config.llm`: Accesses the dictionary of all LLM configurations read from the `[llm]` sections in `config.toml`. We use `.get("default")` to get the settings specifically for the LLM named "default".
3. `default_llm_settings.model`: Accesses the `model` attribute of the `LLMSettings` object. Pydantic ensures this attribute exists and is the correct type.
4. `config.sandbox.use_sandbox`: Directly accesses the `use_sandbox` attribute within the `sandbox` settings object (`SandboxSettings`).
5. We check if `config.search_config` and `config.browser_config` exist before accessing them, as they might be optional sections in the `config.toml` file.
**Use Case Example: How `LLM` Gets Its Settings**
Let's revisit our use case. When an `LLM` object is created (often inside a [BaseAgent](03_baseagent.md)), its initialization code (`__init__`) looks something like this (simplified):
```python
# Simplified snippet from app/llm.py __init__ method
from app.config import config, LLMSettings # Import config and the schema
from typing import Optional
class LLM:
# ... other methods ...
def __init__(self, config_name: str = "default", llm_config: Optional[LLMSettings] = None):
# If specific llm_config isn't provided, get it from the global config
if llm_config is None:
# Ask the global 'config' object for the settings
# corresponding to 'config_name' (e.g., "default")
llm_settings = config.llm.get(config_name)
if not llm_settings: # Handle case where the name doesn't exist
llm_settings = config.llm.get("default") # Fallback to default
else: # Use the provided config if given
llm_settings = llm_config
# Store the settings read from the config object
self.model = llm_settings.model
self.api_key = llm_settings.api_key
self.base_url = llm_settings.base_url
# ... store other settings like max_tokens, temperature ...
print(f"LLM initialized with model: {self.model}")
# Initialize the actual API client using these settings
# self.client = AsyncOpenAI(api_key=self.api_key, base_url=self.base_url)
# ... rest of initialization ...
```
**Explanation:**
* The `LLM` class imports the global `config` object.
* In its `__init__`, it uses `config.llm.get(config_name)` to retrieve the specific settings (like `model`, `api_key`) it needs.
* It then uses these retrieved values to configure itself and the underlying API client.
This way, the `LLM` class doesn't need the actual values hardcoded inside it. It just asks the central `Config` manager. If you want to change the model or API key, you only need to update `config.toml` and restart the application!
## Under the Hood: Loading and Providing Settings
What happens when the application starts and the `config` object is first used?
1. **First Access:** The first time code tries to `import config` from `app.config`, Python runs the code in `app.config.py`.
2. **Singleton Check:** The `Config` class's special `__new__` method checks if an instance (`_instance`) already exists. If not, it creates a new one. If it *does* exist, it just returns the existing one. This ensures only one instance is ever made.
3. **Initialization (`__init__`):** The `__init__` method (run only once for the single instance) calls `_load_initial_config`.
4. **Find File (`_get_config_path`):** It looks for `config/config.toml`. If that doesn't exist, it looks for `config/config.example.toml` as a fallback.
5. **Read File (`_load_config`):** It opens the found `.toml` file and uses the standard `tomllib` library to read its contents into a Python dictionary.
6. **Parse & Validate:** `_load_initial_config` takes this raw dictionary and carefully organizes it, using Pydantic models (`LLMSettings`, `SandboxSettings`, `BrowserSettings`, `SearchSettings`, `MCPSettings`, all defined in `app/config.py`) to structure and *validate* the settings. For example, it creates `LLMSettings` objects for each entry under `[llm]`. If a required setting is missing or has the wrong type (e.g., `max_tokens` is text instead of a number), Pydantic will raise an error here, stopping the app from starting with bad configuration.
7. **Store Internally:** The validated settings (now nicely structured Pydantic objects) are stored within the `Config` instance (in `self._config`).
8. **Ready for Use:** The `config` instance is now ready. Subsequent accesses simply return the stored, validated settings via properties like `config.llm`, `config.sandbox`, etc.
**Sequence Diagram:**
```mermaid
sequenceDiagram
participant App as Application Start
participant CfgMod as app/config.py
participant Config as Config Singleton Object
participant TOML as config.toml File
participant Parser as TOML Parser & Pydantic
participant OtherMod as e.g., app/llm.py
App->>+CfgMod: import config
Note over CfgMod: First time loading module
CfgMod->>+Config: Config() called (implicitly via `config = Config()`)
Config->>Config: __new__ checks if _instance exists (it doesn't)
Config->>Config: Creates new Config instance (_instance)
Config->>Config: Calls __init__ (only runs once)
Config->>Config: _load_initial_config()
Config->>Config: _get_config_path() -> finds path
Config->>+TOML: Opens file
TOML-->>-Config: Returns file content
Config->>+Parser: Parses TOML content into dict
Parser-->>-Config: Returns raw_config dict
Config->>+Parser: Validates dict using Pydantic models (LLMSettings etc.)
Parser-->>-Config: Returns validated AppConfig object
Config->>Config: Stores validated config internally
Config-->>-CfgMod: Returns the single instance
CfgMod-->>-App: Provides `config` instance
App->>+OtherMod: Code runs (e.g., `LLM()`)
OtherMod->>+Config: Accesses property (e.g., `config.llm`)
Config-->>-OtherMod: Returns stored settings (e.g., Dict[str, LLMSettings])
```
**Code Glimpse (`app/config.py`):**
Let's look at the key parts:
```python
# Simplified snippet from app/config.py
import threading
import tomllib
from pathlib import Path
from pydantic import BaseModel, Field
# ... other imports like typing ...
# --- Pydantic Models for Settings ---
class LLMSettings(BaseModel): # Defines structure for [llm] section
model: str
api_key: str
# ... other fields like base_url, max_tokens, api_type ...
class SandboxSettings(BaseModel): # Defines structure for [sandbox] section
use_sandbox: bool
image: str
# ... other fields like memory_limit, timeout ...
# ... Similar models for BrowserSettings, SearchSettings, MCPSettings ...
class AppConfig(BaseModel): # Holds all validated settings together
llm: Dict[str, LLMSettings]
sandbox: Optional[SandboxSettings]
browser_config: Optional[BrowserSettings]
search_config: Optional[SearchSettings]
mcp_config: Optional[MCPSettings]
# --- The Singleton Config Class ---
class Config:
_instance = None
_lock = threading.Lock() # Ensures thread-safety during creation
_initialized = False
def __new__(cls): # Controls instance creation (Singleton part 1)
if cls._instance is None:
with cls._lock:
if cls._instance is None:
cls._instance = super().__new__(cls)
return cls._instance
def __init__(self): # Initializes the instance (runs only once)
if not self._initialized:
with self._lock:
if not self._initialized:
self._config: Optional[AppConfig] = None # Where settings are stored
self._load_initial_config() # Load from file
self._initialized = True
def _load_config(self) -> dict: # Reads the TOML file
config_path = self._get_config_path() # Finds config.toml or example
with config_path.open("rb") as f:
return tomllib.load(f) # Parses TOML into a dictionary
def _load_initial_config(self): # Parses dict and validates with Pydantic
raw_config = self._load_config()
# ... (logic to handle defaults and structure the raw_config dict) ...
# ... (creates LLMSettings, SandboxSettings etc. from raw_config) ...
# Validate the final structured dict using AppConfig
self._config = AppConfig(**structured_config_dict)
# --- Properties to Access Settings ---
@property
def llm(self) -> Dict[str, LLMSettings]:
# Provides easy access like 'config.llm'
return self._config.llm
@property
def sandbox(self) -> SandboxSettings:
# Provides easy access like 'config.sandbox'
return self._config.sandbox
# ... Properties for browser_config, search_config, mcp_config ...
# --- Create the Singleton Instance ---
# This line runs when the module is imported, creating the single instance.
config = Config()
```
**Explanation:**
* The Pydantic models (`LLMSettings`, `SandboxSettings`, `AppConfig`) define the expected structure and types for the settings read from `config.toml`.
* The `Config` class uses `__new__` and `_lock` to implement the singleton pattern, ensuring only one instance.
* `__init__` calls `_load_initial_config` only once.
* `_load_initial_config` reads the TOML file and uses the Pydantic models (within `AppConfig`) to parse and validate the settings, storing the result in `self._config`.
* `@property` decorators provide clean access (e.g., `config.llm`) to the stored settings.
* `config = Config()` at the end creates the actual singleton instance that gets imported elsewhere.
## Wrapping Up Chapter 7
We've learned that the `Config` system is OpenManus's way of managing application settings. It reads configurations from the `config.toml` file at startup, validates them using Pydantic [Schemas](06_schema.md), and makes them available throughout the application via a single, shared `config` object (using the singleton pattern). This keeps settings separate from code, making the application more flexible, secure, and easier to manage.
Many components rely on these configurations. For instance, when an agent needs to execute code safely, it might use a `DockerSandbox`. The settings for this sandbox like which Docker image to use or how much memory to allow are read directly from the configuration we just discussed.
Let's move on to [Chapter 8: DockerSandbox](08_dockersandbox.md) to see how OpenManus provides a secure environment for running code generated by agents, using settings managed by our `Config` system.
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,340 @@
# Chapter 8: DockerSandbox - A Safe Play Area for Code
Welcome to Chapter 8! In [Chapter 7: Configuration (Config)](07_configuration__config_.md), we learned how OpenManus manages settings using the `config.toml` file and the `Config` object. We saw settings for the [LLM](01_llm.md), search tools, and something called `[sandbox]`. Now, let's dive into what that sandbox is!
## What Problem Does `DockerSandbox` Solve?
Imagine our agent, powered by a smart [LLM](01_llm.md), needs to test a piece of code it just wrote, or run a shell command to check something on the system. For example, the user asks: "Write a Python script that calculates 2 plus 2 and run it."
The agent might generate the code `print(2 + 2)`. But where should it run this code?
Running code generated by an AI, especially one connected to the internet, directly on your own computer is **risky**! What if the AI accidentally (or if tricked) generates harmful code like `delete_all_my_files()`? That would be disastrous!
We need a safe, isolated place to run potentially untrusted commands or code a place where even if something goes wrong, it doesn't affect our main system.
**This is exactly what the `DockerSandbox` provides.** Think of it as a **secure laboratory sandbox** or a disposable, locked room. Inside this room, the agent can perform potentially messy or dangerous experiments (like running code) without any risk to the outside environment (your computer).
**Use Case:** Our agent needs to execute the Python code `print(2 + 2)`. Instead of running it directly, it will ask the `DockerSandbox` to run it inside a secure container. The sandbox will execute the code, capture the output ("4"), and report it back, all without giving the code access to the host machine's files or settings.
## Key Concepts: Secure Execution with Docker
1. **Isolation via Docker:** `DockerSandbox` uses **Docker containers** to achieve isolation. Docker is a technology that allows packaging applications and their dependencies into lightweight, self-contained units called containers. Crucially, these containers run isolated from the host system and each other. They have their own restricted view of files, network, and processes. It's like giving the code its own mini-computer to run on, completely separate from yours.
2. **The Sandbox Container:** When needed, the `DockerSandbox` system creates a specific Docker container based on settings in your `config.toml`. This container is the actual "sandbox" environment.
3. **Lifecycle Management:** The `DockerSandbox` system handles the entire life of the container:
* **Creation:** Starting up a fresh container when needed.
* **Command Execution:** Running commands (like `python script.py` or `ls`) inside the container.
* **File Transfers:** Safely copying files into or out of the container if needed (e.g., putting a script file in, getting a result file out).
* **Cleanup:** Stopping and removing the container automatically when it's no longer needed or after a period of inactivity, ensuring no resources are wasted.
4. **Configuration (`config.toml`):** As we saw in the [previous chapter](07_configuration__config_.md), the `[sandbox]` section in `config.toml` controls how the sandbox behaves:
* `use_sandbox = true`: Turns the sandbox feature on. If `false`, code might run directly on the host (less safe!).
* `image = "python:3.12-slim"`: Specifies which Docker base image to use (e.g., a minimal Python environment).
* `memory_limit = "512m"`: Restricts how much memory the container can use.
* `cpu_limit = 1.0`: Restricts how much CPU power the container can use.
* `timeout = 300`: Sets a default time limit (in seconds) for commands.
* `network_enabled = false`: Controls whether the container can access the internet (often disabled for extra security).
## How Do We Use It? (Via Tools and Clients)
Typically, you don't interact with the `DockerSandbox` class directly. Instead, [Tools](04_tool___toolcollection.md) that need to execute code, like `Bash` (`app/tool/bash.py`) or `PythonExecute` (`app/tool/python_execute.py`), often use a helper called a **Sandbox Client** to interact with the sandbox environment *if* it's enabled in the configuration.
OpenManus provides a ready-to-use client instance: `SANDBOX_CLIENT` (from `app/sandbox/client.py`).
Let's see conceptually how a tool might use `SANDBOX_CLIENT` to run our `print(2 + 2)` example safely.
**1. Check Configuration:**
First, the system checks if the sandbox is enabled.
```python
# Check the configuration loaded in Chapter 7
from app.config import config
if config.sandbox and config.sandbox.use_sandbox:
print("Sandbox is ENABLED. Code will run inside a container.")
# Proceed with using the sandbox client...
else:
print("Sandbox is DISABLED. Code might run directly on the host (potentially unsafe).")
# Fallback or raise an error...
```
**Explanation:**
* We import the global `config` object.
* We check `config.sandbox` (to see if the section exists) and `config.sandbox.use_sandbox`. This value comes directly from your `config.toml` file.
**2. Use the Sandbox Client:**
If the sandbox is enabled, a tool would use the shared `SANDBOX_CLIENT` to execute the command.
```python
# Example of using the sandbox client (simplified)
from app.sandbox.client import SANDBOX_CLIENT
import asyncio
# Assume sandbox is enabled based on the config check above
# The Python code our agent wants to run
python_code = "print(2 + 2)"
# Create a temporary script file content
# We wrap the code to make it executable via 'python script.py'
script_content = f"{python_code}"
script_name = "temp_script.py"
# Define the command to run inside the sandbox
command_to_run = f"python {script_name}"
async def run_in_sandbox():
try:
print(f"Asking sandbox to run: {command_to_run}")
# 1. Create the sandbox container (if not already running)
# The client handles this automatically based on config
# (Simplified: Actual creation might be handled by a manager)
# await SANDBOX_CLIENT.create(config=config.sandbox) # Often implicit
# 2. Write the script file into the sandbox
await SANDBOX_CLIENT.write_file(script_name, script_content)
print(f"Wrote '{script_name}' to sandbox.")
# 3. Execute the command inside the sandbox
output = await SANDBOX_CLIENT.run_command(command_to_run)
print(f"Sandbox execution output: {output}")
except Exception as e:
print(f"An error occurred: {e}")
# finally:
# 4. Cleanup (often handled automatically by a manager or context)
# await SANDBOX_CLIENT.cleanup()
# print("Sandbox cleaned up.")
# Run the async function
# asyncio.run(run_in_sandbox()) # Uncomment to run
```
**Explanation:**
1. We import the pre-configured `SANDBOX_CLIENT`.
2. We define the Python code and the command (`python temp_script.py`) needed to execute it.
3. `SANDBOX_CLIENT.write_file(script_name, script_content)`: This copies our Python code into a file *inside* the isolated container. The path `script_name` refers to a path *within* the sandbox.
4. `SANDBOX_CLIENT.run_command(command_to_run)`: This is the core step! It tells the Docker container to execute `python temp_script.py`. The client waits for the command to finish and captures its output (stdout).
5. The `output` variable receives the result ("4\n" in this case).
6. **Crucially**, the actual container creation and cleanup might be managed automatically in the background (by the `SandboxManager`, see `app/sandbox/core/manager.py`) or handled when the client is used within a specific context, so explicit `create()` and `cleanup()` calls might not always be needed directly in the tool's code.
**Expected Output (High Level):**
```
Sandbox is ENABLED. Code will run inside a container.
Asking sandbox to run: python temp_script.py
Wrote 'temp_script.py' to sandbox.
Sandbox execution output: 4
# (Cleanup messages might appear depending on implementation)
```
The important part is that `print(2 + 2)` was executed securely *inside* the Docker container, managed by the sandbox system, without exposing the host machine.
## Under the Hood: How Sandbox Execution Works
Let's trace the simplified journey when a tool uses `SANDBOX_CLIENT.run_command("python script.py")`:
1. **Request:** The tool (e.g., `PythonExecute`) calls `SANDBOX_CLIENT.run_command(...)`.
2. **Check/Create Container:** The `SANDBOX_CLIENT` (likely using `DockerSandbox` internally, possibly managed by `SandboxManager`) checks if a suitable sandbox container is already running. If not, it creates one based on the `SandboxSettings` from the `config` object (pulling the image, setting resource limits, etc.). This uses the Docker engine installed on your host machine.
3. **Execute Command:** The client sends the command (`python script.py`) to the running Docker container for execution.
4. **Docker Runs Command:** The Docker engine runs the command *inside* the isolated container environment. The script executes.
5. **Capture Output:** The `DockerSandbox` infrastructure captures the standard output (stdout) and standard error (stderr) produced by the command within the container.
6. **Return Result:** The captured output is sent back to the `SANDBOX_CLIENT`.
7. **Client Returns:** The `SANDBOX_CLIENT` returns the output string to the calling tool.
8. **(Later) Cleanup:** The `SandboxManager` or context eventually decides to stop and remove the idle container to free up resources.
**Sequence Diagram:**
```mermaid
sequenceDiagram
participant Tool as Tool (e.g., PythonExecute)
participant Client as SANDBOX_CLIENT
participant Sandbox as DockerSandbox
participant Docker as Docker Engine (Host)
participant Container as Docker Container
Tool->>+Client: run_command("python script.py")
Client->>+Sandbox: run_command("python script.py")
Note over Sandbox: Checks if container exists. Assume No.
Sandbox->>+Docker: Create Container Request (using config: image, limits)
Docker->>+Container: Creates & Starts Container
Container-->>-Docker: Container Ready
Docker-->>-Sandbox: Container Created (ID: abc)
Sandbox->>+Docker: Execute Command Request (in Container abc: "python script.py")
Docker->>+Container: Runs "python script.py"
Note over Container: script prints "4"
Container-->>-Docker: Command Output ("4\n")
Docker-->>-Sandbox: Command Result ("4\n")
Sandbox-->>-Client: Returns "4\n"
Client-->>-Tool: Returns "4\n"
Note over Tool, Container: ... Later (idle timeout or explicit cleanup) ...
Client->>+Sandbox: cleanup() (or Manager does it)
Sandbox->>+Docker: Stop Container Request (ID: abc)
Docker->>Container: Stops Container
Container-->>Docker: Stopped
Sandbox->>+Docker: Remove Container Request (ID: abc)
Docker->>Docker: Removes Container abc
Docker-->>-Sandbox: Container Removed
Sandbox-->>-Client: Cleanup Done
```
## Code Glimpse: Sandbox Components
Let's look at simplified snippets of the key parts.
**1. `SandboxSettings` in `app/config.py`:**
This Pydantic model defines the structure for the `[sandbox]` section in `config.toml`.
```python
# Simplified snippet from app/config.py
from pydantic import BaseModel, Field
class SandboxSettings(BaseModel):
"""Configuration for the execution sandbox"""
use_sandbox: bool = Field(False, description="Whether to use the sandbox")
image: str = Field("python:3.12-slim", description="Base image")
work_dir: str = Field("/workspace", description="Container working directory")
memory_limit: str = Field("512m", description="Memory limit")
cpu_limit: float = Field(1.0, description="CPU limit")
timeout: int = Field(300, description="Default command timeout (seconds)")
network_enabled: bool = Field(False, description="Whether network access is allowed")
```
**Explanation:** This defines the expected settings and their types, which `Config` uses to validate `config.toml`.
**2. `LocalSandboxClient` in `app/sandbox/client.py`:**
This class provides a convenient interface to the underlying `DockerSandbox`.
```python
# Simplified snippet from app/sandbox/client.py
from app.config import SandboxSettings
from app.sandbox.core.sandbox import DockerSandbox
from typing import Optional
class LocalSandboxClient: # Implements BaseSandboxClient
def __init__(self):
self.sandbox: Optional[DockerSandbox] = None
async def create(self, config: Optional[SandboxSettings] = None, ...):
"""Creates a sandbox if one doesn't exist."""
if not self.sandbox:
# Create the actual DockerSandbox instance
self.sandbox = DockerSandbox(config, ...)
await self.sandbox.create() # Start the container
async def run_command(self, command: str, timeout: Optional[int] = None) -> str:
"""Runs command in the sandbox."""
if not self.sandbox:
# Simplified: In reality, might auto-create or raise error
await self.create() # Ensure sandbox exists
# Delegate the command execution to the DockerSandbox instance
return await self.sandbox.run_command(command, timeout)
async def write_file(self, path: str, content: str) -> None:
"""Writes file to the sandbox."""
if not self.sandbox: await self.create()
# Delegate writing to the DockerSandbox instance
await self.sandbox.write_file(path, content)
async def cleanup(self) -> None:
"""Cleans up the sandbox resources."""
if self.sandbox:
await self.sandbox.cleanup() # Tell DockerSandbox to stop/remove container
self.sandbox = None
# Create the shared instance used by tools
SANDBOX_CLIENT = LocalSandboxClient()
```
**Explanation:** The client acts as a middleman. It holds a `DockerSandbox` instance and forwards calls like `run_command` or `write_file` to it, potentially handling creation/cleanup implicitly.
**3. `DockerSandbox` in `app/sandbox/core/sandbox.py`:**
This class interacts directly with the Docker engine.
```python
# Simplified snippet from app/sandbox/core/sandbox.py
import docker
import asyncio
from app.config import SandboxSettings
from app.sandbox.core.terminal import AsyncDockerizedTerminal # For running commands
class DockerSandbox:
def __init__(self, config: Optional[SandboxSettings] = None, ...):
self.config = config or SandboxSettings()
self.client = docker.from_env() # Connect to Docker engine
self.container: Optional[docker.models.containers.Container] = None
self.terminal: Optional[AsyncDockerizedTerminal] = None
async def create(self) -> "DockerSandbox":
"""Creates and starts the Docker container."""
try:
# 1. Prepare container settings (image, limits, etc.) from self.config
container_config = {...} # Simplified
# 2. Use Docker client to create the container
container_data = await asyncio.to_thread(
self.client.api.create_container, **container_config
)
self.container = self.client.containers.get(container_data["Id"])
# 3. Start the container
await asyncio.to_thread(self.container.start)
# 4. Initialize a terminal interface to run commands inside
self.terminal = AsyncDockerizedTerminal(container_data["Id"], ...)
await self.terminal.init()
return self
except Exception as e:
await self.cleanup() # Cleanup on failure
raise RuntimeError(f"Failed to create sandbox: {e}")
async def run_command(self, cmd: str, timeout: Optional[int] = None) -> str:
"""Runs a command using the container's terminal."""
if not self.terminal: raise RuntimeError("Sandbox not initialized")
# Use the terminal helper to execute the command and get output
return await self.terminal.run_command(
cmd, timeout=timeout or self.config.timeout
)
async def write_file(self, path: str, content: str) -> None:
"""Writes content to a file inside the container."""
if not self.container: raise RuntimeError("Sandbox not initialized")
try:
# Simplified: Creates a temporary tar archive with the file
# and uses Docker's put_archive to copy it into the container
tar_stream = await self._create_tar_stream(...) # Helper method
await asyncio.to_thread(
self.container.put_archive, "/", tar_stream
)
except Exception as e:
raise RuntimeError(f"Failed to write file: {e}")
async def cleanup(self) -> None:
"""Stops and removes the Docker container."""
if self.terminal: await self.terminal.close()
if self.container:
try:
await asyncio.to_thread(self.container.stop, timeout=5)
except Exception: pass # Ignore errors on stop
try:
await asyncio.to_thread(self.container.remove, force=True)
except Exception: pass # Ignore errors on remove
self.container = None
```
**Explanation:** This class contains the low-level logic to interact with Docker's API (via the `docker` Python library) to create, start, stop, and remove containers, as well as execute commands and transfer files using Docker's mechanisms.
## Wrapping Up Chapter 8
You've learned about the `DockerSandbox`, a critical security feature in OpenManus. It provides an isolated Docker container environment where agents can safely execute potentially untrusted code or commands generated by the [LLM](01_llm.md), using tools like `Bash` or `PythonExecute`. By isolating execution, the sandbox protects your host system from accidental or malicious harm. Its behavior is configured in `config.toml`, and it's typically used via the `SANDBOX_CLIENT` interface.
Now that we understand the core components LLMs, Memory, Agents, Tools, Flows, Schemas, Config, and the Sandbox how does information, especially structured data and context, flow between the user, the agent, and external models or tools in a standardized way?
Let's move on to the final core concept in [Chapter 9: MCP (Model Context Protocol)](09_mcp__model_context_protocol_.md) to explore how OpenManus defines a protocol for rich context exchange.
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,417 @@
# Chapter 9: MCP (Model Context Protocol)
Welcome to the final chapter of our core concepts tutorial! In [Chapter 8: DockerSandbox](08_dockersandbox.md), we saw how OpenManus can safely run code in an isolated environment. Now, let's explore a powerful way to extend your agent's capabilities *without* changing its internal code: the **Model Context Protocol (MCP)**.
## What Problem Does MCP Solve?
Imagine you have an agent running smoothly. Suddenly, you realize you need it to perform a new, specialized task maybe interacting with a custom company database or using a complex scientific calculation library.
Normally, you might have to:
1. Stop the agent.
2. Write new code for the [Tool](04_tool___toolcollection.md) that performs this task.
3. Add this tool to the agent's code or configuration.
4. Restart the agent.
This process can be cumbersome, especially if you want to add or update tools frequently, or if different people are managing different tools.
What if there was a way for the agent to **dynamically discover and use tools** provided by a completely separate service? Like plugging in a new USB device, and your computer automatically recognizes and uses it?
**This is what MCP enables!** It defines a standard way for an OpenManus agent (`MCPAgent`) to connect to an external **MCP Server**. This server advertises the tools it offers, and the agent can call these tools remotely as if they were built-in.
**Use Case:** Let's say we want our agent to be able to run basic shell commands (like `ls` or `pwd`) using the `Bash` tool. Instead of building the `Bash` tool directly into the agent, we can run an `MCPServer` that *offers* the `Bash` tool. Our `MCPAgent` can connect to this server, discover the `Bash` tool, and use it when needed, all without having the `Bash` tool's code inside the agent itself. If we later update the `Bash` tool on the server, the agent automatically gets the new version without needing changes.
## Key Concepts: The Agent, The Server, and The Rules
MCP involves a few key players working together:
1. **`MCPServer` (The Tool Provider):**
* Think of this as a separate application, like a dedicated "Tool Shop" running independently from your agent.
* It holds one or more [Tools](04_tool___toolcollection.md) (like `Bash`, `BrowserUseTool`, `StrReplaceEditor`, or custom ones).
* It "advertises" these tools, meaning it can tell connected clients (agents) which tools are available, what they do, and how to use them.
* When asked, it executes a tool and sends the result back.
* In OpenManus, `app/mcp/server.py` provides an implementation of this server.
2. **`MCPAgent` (The Tool User):**
* This is a specialized type of [BaseAgent](03_baseagent.md) designed specifically to talk to an `MCPServer`.
* When it starts, it connects to the specified `MCPServer`.
* It asks the server: "What tools do you have?"
* It treats the server's tools as its own available `ToolCollection`.
* When its [LLM](01_llm.md) decides to use one of these tools, the `MCPAgent` sends a request to the `MCPServer` to execute it.
* It can even periodically check if the server has added or removed tools and update its capabilities accordingly!
3. **The Protocol (The Rules of Communication):**
* MCP defines the exact format of messages exchanged between the `MCPAgent` and `MCPServer`. How does the agent ask for the tool list? How does it request a tool execution? How is the result formatted?
* OpenManus supports two main ways (transports) for this communication:
* **stdio (Standard Input/Output):** The agent starts the server process directly and communicates with it using standard text streams (like typing commands in a terminal). This is simpler for local setups.
* **SSE (Server-Sent Events):** The agent connects to a running server over the network (using HTTP). This is more suitable if the server is running elsewhere.
*Analogy:* Imagine the `MCPServer` is a smart TV's App Store, offering apps (tools) like Netflix or YouTube. The `MCPAgent` is a universal remote control. MCP is the protocol that lets the remote connect to the TV, see the available apps, and tell the TV "Launch Netflix" or "Play this video on YouTube". The actual app logic runs on the TV (the server), not the remote (the agent).
## How Do We Use It?
Let's see how to run the server and connect an agent using the simple `stdio` method.
**1. Run the MCPServer:**
The server needs to be running first. OpenManus provides a script to run a server that includes standard tools like `Bash`, `Browser`, and `Editor`.
Open a terminal and run:
```bash
# Make sure you are in the root directory of the OpenManus project
# Use python to run the server module
python -m app.mcp.server --transport stdio
```
**Expected Output (in the server terminal):**
```
INFO:root:Registered tool: bash
INFO:root:Registered tool: browser
INFO:root:Registered tool: editor
INFO:root:Registered tool: terminate
INFO:root:Starting OpenManus server (stdio mode)
# --- The server is now running and waiting for a connection ---
```
**Explanation:**
* `python -m app.mcp.server` tells Python to run the server code located in `app/mcp/server.py`.
* `--transport stdio` specifies that it should listen for connections via standard input/output.
* It registers the built-in tools and waits.
**2. Run the MCPAgent (connecting to the server):**
Now, open a *separate* terminal. We'll run a script that starts the `MCPAgent` and tells it how to connect to the server we just started.
```bash
# In a NEW terminal, in the root directory of OpenManus
# Run the MCP agent runner script
python run_mcp.py --connection stdio --interactive
```
**Expected Output (in the agent terminal):**
```
INFO:app.config:Configuration loaded successfully from .../config/config.toml
INFO:app.agent.mcp:Initializing MCPAgent with stdio connection...
# ... (potential logs about connecting) ...
INFO:app.tool.mcp:Connected to server with tools: ['bash', 'browser', 'editor', 'terminate']
INFO:app.agent.mcp:Connected to MCP server via stdio
MCP Agent Interactive Mode (type 'exit' to quit)
Enter your request:
```
**Explanation:**
* `python run_mcp.py` runs the agent launcher script.
* `--connection stdio` tells the agent to connect using standard input/output. The script (`run_mcp.py`) knows how to start the server process (`python -m app.mcp.server`) for this mode.
* `--interactive` puts the agent in a mode where you can chat with it.
* The agent connects, asks the server for its tools (`list_tools`), and logs the tools it found (`bash`, `browser`, etc.). It's now ready for your requests!
**3. Interact with the Agent (Using a Server Tool):**
Now, in the agent's interactive prompt, ask it to do something that requires a tool provided by the server, like listing files using `bash`:
```text
# In the agent's terminal
Enter your request: Use the bash tool to list the files in the current directory.
```
**What Happens:**
1. The `MCPAgent` receives your request.
2. Its [LLM](01_llm.md) analyzes the request and decides the `bash` tool is needed, with the command `ls`.
3. The agent sees that `bash` is a tool provided by the connected `MCPServer`.
4. The agent sends a `call_tool` request over `stdio` to the server: "Please run `bash` with `command='ls'`".
5. The `MCPServer` receives the request, finds its `Bash` tool, and executes `ls`.
6. The server captures the output (the list of files).
7. The server sends the result back to the agent.
8. The agent receives the result, adds it to its [Memory](02_message___memory.md), and might use its LLM again to formulate a user-friendly response based on the tool's output.
**Expected Output (in the agent terminal, may vary):**
```text
# ... (Potential LLM thinking logs) ...
INFO:app.agent.mcp:Executing tool: bash with input {'command': 'ls'}
# ... (Server logs might show execution in its own terminal) ...
Agent: The bash tool executed the 'ls' command and returned the following output:
[List of files/directories in the project root, e.g.,]
README.md
app
config
run_mcp.py
... etc ...
```
Success! The agent used a tool (`bash`) that wasn't part of its own code, but was provided dynamically by the external `MCPServer` via the Model Context Protocol. If you added a *new* tool to the `MCPServer` code and restarted the server, the agent could potentially discover and use it without needing any changes itself (it periodically refreshes the tool list).
Type `exit` in the agent's terminal to stop it, then stop the server (usually Ctrl+C in its terminal).
## Under the Hood: How MCP Communication Flows
Let's trace the simplified steps when the agent uses a server tool:
1. **Connect & List:** Agent starts, connects to Server (`stdio` or `SSE`). Agent sends `list_tools` request. Server replies with list of tools (`name`, `description`, `parameters`). Agent stores these.
2. **User Request:** User asks agent to do something (e.g., "list files").
3. **LLM Decides:** Agent's LLM decides to use `bash` tool with `command='ls'`.
4. **Agent Request:** Agent finds `bash` in its list of server tools. Sends `call_tool` request to Server (containing tool name `bash` and arguments `{'command': 'ls'}`).
5. **Server Executes:** Server receives request. Finds its internal `Bash` tool. Calls the tool's `execute(command='ls')` method. The tool runs `ls`.
6. **Server Response:** Server gets the result from the tool (e.g., "README.md\napp\n..."). Sends this result back to the Agent.
7. **Agent Processes:** Agent receives the result. Updates its memory. Presents the answer to the user.
**Sequence Diagram:**
```mermaid
sequenceDiagram
participant User
participant Agent as MCPAgent
participant LLM as Agent's LLM
participant Server as MCPServer
participant BashTool as Bash Tool (on Server)
Note over Agent, Server: Initial Connection & list_tools (omitted for brevity)
User->>+Agent: "List files using bash"
Agent->>+LLM: ask_tool("List files", tools=[...bash_schema...])
LLM-->>-Agent: Decide: call tool 'bash', args={'command':'ls'}
Agent->>+Server: call_tool(name='bash', args={'command':'ls'})
Server->>+BashTool: execute(command='ls')
BashTool->>BashTool: Runs 'ls' command
BashTool-->>-Server: Returns file list string
Server-->>-Agent: Tool Result (output=file list)
Agent->>Agent: Process result, update memory
Agent-->>-User: "OK, the files are: ..."
```
## Code Glimpse: Key MCP Components
Let's look at simplified parts of the relevant files.
**1. `MCPServer` (`app/mcp/server.py`): Registering Tools**
The server uses the `fastmcp` library to handle the protocol details. It needs to register the tools it wants to offer.
```python
# Simplified snippet from app/mcp/server.py
from mcp.server.fastmcp import FastMCP
from app.tool.base import BaseTool
from app.tool.bash import Bash # Import the tool to offer
from app.logger import logger
import json
class MCPServer:
def __init__(self, name: str = "openmanus"):
self.server = FastMCP(name) # The underlying MCP server library
self.tools: Dict[str, BaseTool] = {}
# Add tools to offer
self.tools["bash"] = Bash()
# ... add other tools like Browser, Editor ...
def register_tool(self, tool: BaseTool) -> None:
"""Registers a tool's execute method with the FastMCP server."""
tool_name = tool.name
tool_param = tool.to_param() # Get schema for the LLM
tool_function = tool_param["function"]
# Define the function that the MCP server will expose
async def tool_method(**kwargs):
logger.info(f"Executing {tool_name} via MCP: {kwargs}")
# Call the actual tool's execute method
result = await tool.execute(**kwargs)
logger.info(f"Result of {tool_name}: {result}")
# Return result (often needs conversion, e.g., to JSON)
return json.dumps(result.model_dump()) if hasattr(result, "model_dump") else str(result)
# Attach metadata (name, description, parameters) for discovery
tool_method.__name__ = tool_name
tool_method.__doc__ = self._build_docstring(tool_function)
tool_method.__signature__ = self._build_signature(tool_function)
# Register with the FastMCP library instance
self.server.tool()(tool_method)
logger.info(f"Registered tool for MCP: {tool_name}")
def register_all_tools(self):
for tool in self.tools.values():
self.register_tool(tool)
def run(self, transport: str = "stdio"):
self.register_all_tools()
logger.info(f"Starting MCP server ({transport} mode)")
self.server.run(transport=transport) # Start listening
# Command-line execution part:
# if __name__ == "__main__":
# server = MCPServer()
# server.run(transport="stdio") # Or based on args
```
**Explanation:** The `MCPServer` creates instances of tools (`Bash`, etc.) and then uses `register_tool` to wrap each tool's `execute` method into a format the `fastmcp` library understands. This allows the server to advertise the tool (with its name, description, parameters) and call the correct function when the agent makes a `call_tool` request.
**2. `MCPClients` (`app/tool/mcp.py`): Client-Side Tool Representation**
The `MCPAgent` uses this class, which acts like a `ToolCollection`, but its tools are proxies that make calls to the remote server.
```python
# Simplified snippet from app/tool/mcp.py
from mcp import ClientSession # MCP library for client-side communication
from mcp.client.stdio import stdio_client # Specific transport handler
from mcp.types import TextContent
from app.tool.base import BaseTool, ToolResult
from app.tool.tool_collection import ToolCollection
from contextlib import AsyncExitStack
# Represents a single tool on the server, callable from the client
class MCPClientTool(BaseTool):
session: Optional[ClientSession] = None # Holds the connection
async def execute(self, **kwargs) -> ToolResult:
"""Execute by calling the remote tool via the MCP session."""
if not self.session: return ToolResult(error="Not connected")
try:
# Make the actual remote call
result = await self.session.call_tool(self.name, kwargs)
# Extract text output from the response
content = ", ".join(
item.text for item in result.content if isinstance(item, TextContent)
)
return ToolResult(output=content or "No output.")
except Exception as e:
return ToolResult(error=f"MCP tool error: {e}")
# The collection holding the proxy tools
class MCPClients(ToolCollection):
session: Optional[ClientSession] = None
exit_stack: AsyncExitStack = None # Manages connection resources
async def connect_stdio(self, command: str, args: List[str]):
"""Connect using stdio."""
if self.session: await self.disconnect()
self.exit_stack = AsyncExitStack()
# Set up stdio connection using MCP library helper
server_params = {"command": command, "args": args} # Simplified
streams = await self.exit_stack.enter_async_context(
stdio_client(server_params)
)
# Establish the MCP session over the connection
self.session = await self.exit_stack.enter_async_context(
ClientSession(*streams)
)
await self._initialize_and_list_tools() # Get tool list from server
async def _initialize_and_list_tools(self):
"""Fetch tools from server and create proxy objects."""
await self.session.initialize()
response = await self.session.list_tools() # Ask server for tools
self.tool_map = {}
for tool_info in response.tools:
# Create an MCPClientTool instance for each server tool
proxy_tool = MCPClientTool(
name=tool_info.name,
description=tool_info.description,
parameters=tool_info.inputSchema, # Use schema from server
session=self.session, # Pass the active session
)
self.tool_map[tool_info.name] = proxy_tool
self.tools = tuple(self.tool_map.values())
logger.info(f"MCP Client found tools: {list(self.tool_map.keys())}")
async def disconnect(self):
if self.session and self.exit_stack:
await self.exit_stack.aclose() # Clean up connection
# ... reset state ...
```
**Explanation:** `MCPClients` handles the connection (`connect_stdio`). When connected, it calls `list_tools` on the server. For each tool reported by the server, it creates a local `MCPClientTool` proxy object. This proxy object looks like a normal `BaseTool` (with name, description, parameters), but its `execute` method doesn't run code locally instead, it uses the active `ClientSession` to send a `call_tool` request back to the server.
**3. `MCPAgent` (`app/agent/mcp.py`): Using MCPClients**
The agent integrates the `MCPClients` collection.
```python
# Simplified snippet from app/agent/mcp.py
from app.agent.toolcall import ToolCallAgent
from app.tool.mcp import MCPClients
class MCPAgent(ToolCallAgent):
# Use MCPClients as the tool collection
mcp_clients: MCPClients = Field(default_factory=MCPClients)
available_tools: MCPClients = None # Will point to mcp_clients
connection_type: str = "stdio"
# ... other fields ...
async def initialize(
self, command: Optional[str] = None, args: Optional[List[str]] = None, ...
):
"""Initialize by connecting the MCPClients instance."""
if self.connection_type == "stdio":
# Tell mcp_clients to connect
await self.mcp_clients.connect_stdio(command=command, args=args or [])
# elif self.connection_type == "sse": ...
# The agent's tools are now the tools provided by the server
self.available_tools = self.mcp_clients
# Store initial tool schemas for detecting changes later
self.tool_schemas = {t.name: t.parameters for t in self.available_tools}
# Add system message about tools...
async def _refresh_tools(self):
"""Periodically check the server for tool updates."""
if not self.mcp_clients.session: return
# Ask the server for its current list of tools
response = await self.mcp_clients.session.list_tools()
current_tools = {t.name: t.inputSchema for t in response.tools}
# Compare with stored schemas (self.tool_schemas)
# Detect added/removed tools and update self.tool_schemas
# Add system messages to memory if tools change
# ... logic to detect and log changes ...
async def think(self) -> bool:
"""Agent's thinking step."""
# Refresh tools periodically
if self.current_step % self._refresh_tools_interval == 0:
await self._refresh_tools()
# Stop if server seems gone (no tools left)
if not self.mcp_clients.tool_map: return False
# Use parent class's think method, which uses self.available_tools
# (which points to self.mcp_clients) for tool decisions/calls
return await super().think()
async def cleanup(self):
"""Disconnect the MCP session when the agent finishes."""
if self.mcp_clients.session:
await self.mcp_clients.disconnect()
```
**Explanation:** The `MCPAgent` holds an instance of `MCPClients`. In `initialize`, it tells `MCPClients` to connect to the server. It sets its own `available_tools` to point to the `MCPClients` instance. When the agent's `think` method (inherited from `ToolCallAgent`) needs to consider or execute tools, it uses `self.available_tools`. Because this *is* the `MCPClients` object, any tool execution results in a remote call to the `MCPServer` via the proxy tools. The agent also adds logic to periodically `_refresh_tools` and `cleanup` the connection.
## Wrapping Up Chapter 9
Congratulations on completing the core concepts tutorial!
In this final chapter, we explored the **Model Context Protocol (MCP)**. You learned how MCP allows an `MCPAgent` to connect to an external `MCPServer` and dynamically discover and use tools hosted by that server. This provides a powerful way to extend agent capabilities with specialized tools without modifying the agent's core code, enabling a flexible, plug-and-play architecture for agent skills.
You've journeyed through the essential building blocks of OpenManus:
* The "brain" ([LLM](01_llm.md))
* Conversation history ([Message / Memory](02_message___memory.md))
* The agent structure ([BaseAgent](03_baseagent.md))
* Agent skills ([Tool / ToolCollection](04_tool___toolcollection.md))
* Multi-step task orchestration ([BaseFlow](05_baseflow.md))
* Data structure definitions ([Schema](06_schema.md))
* Settings management ([Configuration (Config)](07_configuration__config_.md))
* Secure code execution ([DockerSandbox](08_dockersandbox.md))
* And dynamic external tools ([MCP](09_mcp__model_context_protocol_.md))
Armed with this knowledge, you're now well-equipped to start exploring the OpenManus codebase, experimenting with different agents and tools, and building your own intelligent applications! Good luck!
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

55
docs/OpenManus/index.md Normal file
View File

@@ -0,0 +1,55 @@
# Tutorial: OpenManus
OpenManus is a framework for building autonomous *AI agents*.
Think of it like a digital assistant that can perform tasks. It uses a central **brain** (an `LLM` like GPT-4) to understand requests and decide what to do next.
Agents can use various **tools** (like searching the web or writing code) to interact with the world or perform specific actions. Some complex tasks might involve a **flow** that coordinates multiple agents.
It keeps track of the conversation using `Memory` and ensures secure code execution using a `DockerSandbox`.
The system is flexible, allowing new tools to be added, even dynamically through the `MCP` protocol.
**Source Repository:** [https://github.com/mannaandpoem/OpenManus/tree/f616c5d43d02d93ccc6e55f11666726d6645fdc2](https://github.com/mannaandpoem/OpenManus/tree/f616c5d43d02d93ccc6e55f11666726d6645fdc2)
```mermaid
flowchart TD
A0["BaseAgent"]
A1["Tool / ToolCollection"]
A2["LLM"]
A3["Message / Memory"]
A4["Schema"]
A5["BaseFlow"]
A6["DockerSandbox"]
A7["Configuration (Config)"]
A8["MCP (Model Context Protocol)"]
A0 -- "Uses LLM for thinking" --> A2
A0 -- "Uses Memory for context" --> A3
A0 -- "Executes Tools" --> A1
A5 -- "Orchestrates Agents" --> A0
A1 -- "Uses Sandbox for execution" --> A6
A2 -- "Reads LLM Config" --> A7
A6 -- "Reads Sandbox Config" --> A7
A7 -- "Provides MCP Config" --> A8
A8 -- "Provides Dynamic Tools" --> A1
A8 -- "Extends BaseAgent" --> A0
A4 -- "Defines Agent Structures" --> A0
A4 -- "Defines Message Structure" --> A3
A2 -- "Processes Messages" --> A3
A5 -- "Uses Tools" --> A1
A4 -- "Defines Tool Structures" --> A1
```
## Chapters
1. [LLM](01_llm.md)
2. [Message / Memory](02_message___memory.md)
3. [BaseAgent](03_baseagent.md)
4. [Tool / ToolCollection](04_tool___toolcollection.md)
5. [BaseFlow](05_baseflow.md)
6. [Schema](06_schema.md)
7. [Configuration (Config)](07_configuration__config_.md)
8. [DockerSandbox](08_dockersandbox.md)
9. [MCP (Model Context Protocol)](09_mcp__model_context_protocol_.md)
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)