Documentation/collate rearch notes (#4958)

* Add links to github issues in the README and clarify run instructions * Added a new doc in the core package with architecture notes.
2026-02-14 02:34:27 +01:00 · 2023-07-12 18:38:48 -07:00
parent 3582ada3df
commit 077e143cc2
2 changed files with 278 additions and 0 deletions
--- a/autogpt/core/ARCHITECTURE_NOTES.md
+++ b/autogpt/core/ARCHITECTURE_NOTES.md
@@ -0,0 +1,272 @@
+# Re-architecture Notes
+
+## Key Documents
+
+- [Planned Agent Workflow](https://whimsical.com/agent-workflow-v2-NmnTQ8R7sVo7M3S43XgXmZ)
+- [Original Architecture Diagram](https://www.figma.com/file/fwdj44tPR7ArYtnGGUKknw/Modular-Architecture?type=whiteboard&node-id=0-1) - This is sadly well out of date at this point.
+- [Kanban](https://github.com/orgs/Significant-Gravitas/projects/1/views/1?filterQuery=label%3Are-arch)
+
+## The Motivation
+
+The `master` branch of Auto-GPT is an organically grown amalgamation of many thoughts 
+and ideas about agent-driven autonomous systems.  It lacks clear abstraction boundaries, 
+has issues of global state and poorly encapsulated state, and is generally just hard to 
+make effective changes to.  Mainly it's just a system that's hard to make changes to.  
+And research in the field is moving fast, so we want to be able to try new ideas 
+quickly.  
+
+## Initial Planning
+
+A large group of maintainers and contributors met do discuss the architectural 
+challenges associated with the existing codebase. Many much-desired features (building 
+new user interfaces, enabling project-specific agents, enabling multi-agent systems) 
+are bottlenecked by the global state in the system. We discussed the tradeoffs between 
+an incremental system transition and a big breaking version change and decided to go 
+for the breaking version change. We justified this by saying:
+
+- We can maintain, in essence, the same user experience as now even with a radical 
+  restructuring of the codebase
+- Our developer audience is struggling to use the existing codebase to build 
+  applications and libraries of their own, so this breaking change will largely be 
+  welcome.
+
+## Primary Goals
+
+- Separate the AutoGPT application code from the library code.
+- Remove global state from the system
+- Allow for multiple agents per user (with facilities for running simultaneously)
+- Create a serializable representation of an Agent
+- Encapsulate the core systems in abstractions with clear boundaries.
+
+## Secondary goals
+
+- Use existing tools to ditch any unneccesary cruft in the codebase (document loading, 
+  json parsing, anything easier to replace than to port).
+- Bring in the [core agent loop updates](https://whimsical.com/agent-workflow-v2-NmnTQ8R7sVo7M3S43XgXmZ)
+  being developed simultaneously by @Pwuts 
+
+# The Agent Subsystems
+
+## Configuration
+
+We want a lot of things from a configuration system. We lean heavily on it in the 
+`master` branch to allow several parts of the system to communicate with each other.  
+[Recent work](https://github.com/Significant-Gravitas/Auto-GPT/pull/4737) has made it 
+so that the config is no longer a singleton object that is materialized from the import 
+state, but it's still treated as a 
+[god object](https://en.wikipedia.org/wiki/God_object) containing all information about
+the system and _critically_ allowing any system to reference configuration information 
+about other parts of the system.  
+
+### What we want
+
+- It should still be reasonable to collate the entire system configuration in a 
+  sensible way.
+- The configuration should be validatable and validated.
+- The system configuration should be a _serializable_ representation of an `Agent`.
+- The configuration system should provide a clear (albeit very low-level) contract 
+  about user-configurable aspects of the system.
+- The configuration should reasonably manage default values and user-provided overrides.
+- The configuration system needs to handle credentials in a reasonable way.
+- The configuration should be the representation of some amount of system state, like 
+  api budgets and resource usage.  These aspects are recorded in the configuration and 
+  updated by the system itself.
+- Agent systems should have encapsulated views of the configuration.  E.g. the memory 
+  system should know about memory configuration but nothing about command configuration.
+
+## Workspace
+
+There are two ways to think about the workspace:
+
+- The workspace is a scratch space for an agent where it can store files, write code, 
+  and do pretty much whatever else it likes.
+- The workspace is, at any given point in time, the single source of truth for what an 
+  agent is.  It contains the serializable state (the configuration) as well as all 
+  other working state (stored files, databases, memories, custom code).  
+
+In the existing system there is **one** workspace.  And because the workspace holds so 
+much agent state, that means a user can only work with one agent at a time.
+
+## Memory
+
+The memory system has been under extremely active development. 
+See [#3536](https://github.com/Significant-Gravitas/Auto-GPT/issues/3536) and 
+[#4208](https://github.com/Significant-Gravitas/Auto-GPT/pull/4208) for discussion and 
+work in the `master` branch.  The TL;DR is 
+that we noticed a couple of months ago that the `Agent` performed **worse** with 
+permanent memory than without it.  Since then the knowledge storage and retrieval 
+system has been [redesigned](https://whimsical.com/memory-system-8Ae6x6QkjDwQAUe9eVJ6w1) 
+and partially implemented in the `master` branch.
+
+## Planning/Prompt-Engineering
+
+The planning system is the system that translates user desires/agent intentions into
+language model prompts.  In the course of development, it has become pretty clear 
+that `Planning` is the wrong name for this system
+
+### What we want
+
+- It should be incredibly obvious what's being passed to a language model, when it's
+  being passed, and what the language model response is. The landscape of language 
+  model research is developing very rapidly, so building complex abstractions between 
+  users/contributors and the language model interactions is going to make it very 
+  difficult for us to nimbly respond to new research developments.
+- Prompt-engineering should ideally be exposed in a parameterizeable way to users. 
+- We should, where possible, leverage OpenAI's new  
+  [function calling api](https://openai.com/blog/function-calling-and-other-api-updates) 
+  to get outputs in a standard machine-readable format and avoid the deep pit of 
+  parsing json (and fixing unparsable json).
+
+### Planning Strategies
+
+The [new agent workflow](https://whimsical.com/agent-workflow-v2-NmnTQ8R7sVo7M3S43XgXmZ) 
+has many, many interaction points for language models.  We really would like to not 
+distribute prompt templates and raw strings all through the system. The re-arch solution 
+is to encapsulate language model interactions into planning strategies. 
+These strategies are defined by 
+
+- The `LanguageModelClassification` they use (`FAST` or `SMART`)
+- A function `build_prompt` that takes strategy specific arguments and constructs a 
+  `LanguageModelPrompt` (a simple container for lists of messages and functions to
+  pass to the language model)
+- A function `parse_content` that parses the response content (a dict) into a better 
+  formatted dict.  Contracts here are intentionally loose and will tighten once we have 
+  at least one other language model provider.
+
+## Resources
+
+Resources are kinds of services we consume from external APIs.  They may have associated 
+credentials and costs we need to manage.  Management of those credentials is implemented 
+as manipulation of the resource configuration.  We have two categories of resources 
+currently
+
+- AI/ML model providers (including language model providers and embedding model providers, ie OpenAI)
+- Memory providers (e.g. Pinecone, Weaviate, ChromaDB, etc.)
+
+### What we want
+
+- Resource abstractions should provide a common interface to different service providers 
+  for a particular kind of service.  
+- Resource abstractions should manipulate the configuration to manage their credentials 
+  and budget/accounting.
+- Resource abstractions should be composable over an API (e.g. I should be able to make 
+  an OpenAI provider that is both a LanguageModelProvider and an EmbeddingModelProvider
+  and use it wherever I need those services).
+
+## Abilities
+
+Along with planning and memory usage, abilities are one of the major augmentations of 
+augmented language models.  They allow us to expand the scope of what language models
+can do by hooking them up to code they can execute to obtain new knowledge or influence
+the world.  
+
+### What we want
+
+- Abilities should have an extremely clear interface that users can write to.
+- Abilities should have an extremely clear interface that a language model can 
+  understand
+- Abilities should be declarative about their dependencies so the system can inject them
+- Abilities should be executable (where sensible) in an async run loop.
+- Abilities should be not have side effects unless those side effects are clear in 
+  their representation to an agent (e.g. the BrowseWeb ability shouldn't write a file,
+  but the WriteFile ability can).
+
+## Plugins
+
+Users want to add lots of features that we don't want to support as first-party. 
+Or solution to this is a plugin system to allow users to plug in their functionality or
+to construct their agent from a public plugin marketplace.  Our primary concern in the
+re-arch is to build a stateless plugin service interface and a simple implementation 
+that can load plugins from installed packages or from zip files.  Future efforts will 
+expand this system to allow plugins to load from a marketplace or some other kind 
+of service.
+
+### What is a Plugin
+
+Plugins are a kind of garbage term.  They refer to a number of things.
+
+- New commands for the agent to execute.  This is the most common usage.
+- Replacements for entire subsystems like memory or language model providers
+- Application plugins that do things like send emails or communicate via whatsapp
+- The repositories contributors create that may themselves have multiple plugins in them.
+
+### Usage in the existing system
+
+The current plugin system is _hook-based_.  This means plugins don't correspond to 
+kinds of objects in the system, but rather to times in the system at which we defer 
+execution to them.  The main advantage of this setup is that user code can hijack 
+pretty much any behavior of the agent by injecting code that supercedes the normal 
+agent execution.  The disadvantages to this approach are numerous:
+
+- We have absolutely no mechanisms to enforce any security measures because the threat 
+  surface is everything.
+- We cannot reason about agent behavior in a cohesive way because control flow can be
+  ceded to user code at pretty much any point and arbitrarily change or break the
+  agent behavior
+- The interface for designing a plugin is kind of terrible and difficult to standardize
+- The hook based implementation means we couple ourselves to a particular flow of 
+  control (or otherwise risk breaking plugin behavior).  E.g. many of the hook targets
+  in the [old workflow](https://whimsical.com/agent-workflow-VAzeKcup3SR7awpNZJKTyK) 
+  are not present or mean something entirely different in the 
+  [new workflow](https://whimsical.com/agent-workflow-v2-NmnTQ8R7sVo7M3S43XgXmZ).
+- Etc.
+
+### What we want
+
+- A concrete definition of a plugin that is narrow enough in scope that we can define 
+  it well and reason about how it will work in the system.
+- A set of abstractions that let us define a plugin by its storage format and location 
+- A service interface that knows how to parse the plugin abstractions and turn them 
+  into concrete classes and objects.
+
+
+## Some Notes on how and why we'll use OO in this project
+
+First and foremost, Python itself is an object-oriented language. It's 
+underlying [data model](https://docs.python.org/3/reference/datamodel.html) is built 
+with object-oriented programming in mind. It offers useful tools like abstract base 
+classes to communicate interfaces to developers who want to, e.g., write plugins, or 
+help work on implementations. If we were working in a different language that offered 
+different tools, we'd use a different paradigm.
+
+While many things are classes in the re-arch, they are not classes in the same way. 
+There are three kinds of things (roughly) that are written as classes in the re-arch:
+1.  **Configuration**:  Auto-GPT has *a lot* of configuration.  This configuration 
+    is *data* and we use **[Pydantic](https://docs.pydantic.dev/latest/)** to manage it as 
+    pydantic is basically industry standard for this stuff. It provides runtime validation 
+    for all the configuration and allows us to easily serialize configuration to both basic 
+    python types (dicts, lists, and primatives) as well as serialize to json, which is 
+    important for us being able to put representations of agents 
+    [on the wire](https://en.wikipedia.org/wiki/Wire_protocol) for web applications and 
+    agent-to-agent communication. *These are essentially 
+    [structs](https://en.wikipedia.org/wiki/Struct_(C_programming_language)) rather than 
+    traditional classes.*
+2.  **Internal Data**: Very similar to configuration, Auto-GPT passes around boatloads 
+    of internal data.  We are interacting with language models and language model APIs 
+    which means we are handling lots of *structured* but *raw* text.  Here we also 
+    leverage **pydantic** to both *parse* and *validate* the internal data and also to 
+    give us concrete types which we can use static type checkers to validate against 
+    and discover problems before they show up as bugs at runtime. *These are 
+    essentially [structs](https://en.wikipedia.org/wiki/Struct_(C_programming_language)) 
+    rather than traditional classes.*
+3.  **System Interfaces**: This is our primary traditional use of classes in the 
+    re-arch.  We have a bunch of systems. We want many of those systems to have 
+    alternative implementations (e.g. via plugins). We use abstract base classes to 
+    define interfaces to communicate with people who might want to provide those 
+    plugins. We provide a single concrete implementation of most of those systems as a 
+    subclass of the interface. This should not be controversial.
+
+The approach is consistent with 
+[prior](https://github.com/Significant-Gravitas/Auto-GPT/issues/2458)
+[work](https://github.com/Significant-Gravitas/Auto-GPT/pull/2442) done by other 
+maintainers in this direction.
+
+From an organization standpoint, OO programming is by far the most popular programming 
+paradigm (especially for Python). It's the one most often taught in programming classes
+and the one with the most available online training for people interested in 
+contributing.   
+
+Finally, and importantly, we scoped the plan and initial design of the re-arch as a 
+large group of maintainers and collaborators early on. This is consistent with the 
+design we chose and no-one offered alternatives.
+ 
--- a/autogpt/core/README.md
+++ b/autogpt/core/README.md
@@ -6,6 +6,12 @@ a work in progress and is not yet feature complete.  In particular, it does not
 have many of the Auto-GPT commands implemented and is pending ongoing work to 
 [re-incorporate vector-based memory and knowledge retrieval](https://github.com/Significant-Gravitas/Auto-GPT/issues/3536).

+## [Overview](ARCHITECTURE_NOTES.md)
+
+The Auto-GPT Re-arch is a re-implementation of the Auto-GPT agent that is designed to be more modular,
+more extensible, and more maintainable than the original Auto-GPT agent.  It is also designed to be
+more accessible to new developers and to be easier to contribute to. The re-arch is a work in progress
+and is not yet feature complete.  It is also not yet ready for production use.

 ## Running the Re-arch Code