Files
Auto-GPT/autogpt/processing/text.py
Richard Beales dda8d0f6bf Release 0.3.1 (#4201)
* Feature/tighten up ci pipeline (#3700)

* Fix docker volume mounts (#3710)

Co-authored-by: Reinier van der Leer <github@pwuts.nl>
Co-authored-by: Nicholas Tindle <nick@ntindle.com>

* Feature/enable intuitive logs for community challenge step 1 (#3695)

* Feature/enable intuitive logs summarization (#3697)

* Move task_complete command out of prompt (#3663)

* feat: move task_complete command out of prompt

* fix: formatting fixes

* Add the shutdown command to the test agents

* tests: update test vcrs

---------

Co-authored-by: James Collins <collijk@uw.edu>

* Allow users to Disable Commands via the .env (#3667)

* Document Disabling command categories (#3669)

* feat: move task_complete command out of prompt

* fix: formatting fixes

* feat: add command disabling

* docs: document how to disable command categories

* Enable denylist handling for plugins (#3688)

Co-authored-by: Luke Kyohere <lkyohere@mfsafrica.com>
Co-authored-by: Nicholas Tindle <nick@ntindle.com>

* Fix call to `plugin.post_planning` (#3414)

Co-authored-by: Nicholas Tindle <nick@ntindle.com>

* create information retrieval challenge a (#3770)

Co-authored-by: Richard Beales <rich@richbeales.net>

* fix typos (#3798)

* Update run.bat (#3783)

Co-authored-by: Richard Beales <rich@richbeales.net>

* Update run.sh (#3752)

Co-authored-by: Richard Beales <rich@richbeales.net>

* ADD: Bash block in the contributing markdown (#3701)

Co-authored-by: Richard Beales <rich@richbeales.net>

* BUGFIX: Selenium Driver object reference was included in the browsing results for some reason (#3642)

* * there is really no need to return the  reference to the Selenium driver along with the text summary and list of links.

* * removing unused second return value from browse_website()

* * updated cassette

* * updated YAML cassette for test_browse_website

* * after requirements reinstall, another update YAML cassette for test_browse_website

* * another update YAML cassette for test_browse_website, only as a placholder commit to trigger re-testing due to some docker TCP timeout issue

* * another update YAML cassette for test_browse_website

---------

Co-authored-by: batyu <batyu@localhost>

* Update CONTRIBUTING.md

* Self feedback Improvement (#3680)

* Improved `Self-Feedback`

* minor tweak

* Test: Updated `test_get_self_feedback.py`

* community challenges in the wiki (#3764)

* Update README.md

* Update PULL_REQUEST_TEMPLATE.md

Added link to wiki Contributing page

* Add link to wiki Contributing page

* fix

* Add link to wiki page  on Contributing

* Implement Logging of User Input in logs/Debug Folder (#3867)

* Adds USER_INPUT_FILE_NAME

* Update agent.py

* Update agent.py

Log only if console_input is not the authorise_key

* Reformatting

* add information retrieval challenge to the wiki (#3876)

* add code owners policy (#3981)

* add code owners

* added @ to codeowners

* switched to team ownership

* Memory Challenge C (#3908)

* Memory Challenge C

* Working cassettes

* Doc fixes

* Linting and doc fix

* Updated cassette

* One more cassette try

---------

Co-authored-by: merwanehamadi <merwanehamadi@gmail.com>

* memory challenge c inconsistent (#3985)

* Improve & fix memory challenge docs. (#3989)

Co-authored-by: Kaan Osmanagaoglu <kaano@questps.com.au>

* Feature/centralize prompt (#3990)

Co-authored-by: xiao.hu <454292663@qq.com>

* Use correct reference to prompt_generator in autogpt/llm/chat.py (#4011)

* fix typos (#3998)

Co-authored-by: Minfeng Lu <minfenglu@Minfengs-MacBook-Pro.local>
Co-authored-by: Richard Beales <rich@richbeales.net>

* fix typo in the getting started docs (#3997)

Co-authored-by: Richard Beales <rich@richbeales.net>

* Fix path to workspace directory in setup guide (#3927)

Co-authored-by: Nicholas Tindle <nick@ntindle.com>

* document that docker-compose 1.29.0 is minimally required (#3963)

Co-authored-by: Nicholas Tindle <nick@ntindle.com>

* Integrate pytest-xdist Plugin for Parallel and Concurrent Testing (#3870)

* Adds pytest-parallel dependencies

* Implement pytest-parallel for faster tests

* Uses pytest-xdist

* Auto number of workers processes

* Update ci.yml

---------

Co-authored-by: Nicholas Tindle <nick@ntindle.com>

* explain temperature setting in env file (#4140)

Co-authored-by: Richard Beales <rich@richbeales.net>

* Catch JSON error in summary_memory.py (#3996)

Co-authored-by: k-boikov <64261260+k-boikov@users.noreply.github.com>

* Update duckduckgo dependency - min should be 2.9.5 (#4142)

Co-authored-by: k-boikov <64261260+k-boikov@users.noreply.github.com>

* Update Dockerfile - add missing scripts and plugins directories. (#3706)

Co-authored-by: k-boikov <64261260+k-boikov@users.noreply.github.com>

* Updated memory setup links (#3829)

Co-authored-by: k-boikov <64261260+k-boikov@users.noreply.github.com>

* Parse package versions so upgrades can be forced (#4149)

* parse package versions so upgrades can be forced

* better version from @collijk

* fix typo in autopgt/agent/agent.py (#3747)

Co-authored-by: merwanehamadi <merwanehamadi@gmail.com>
Co-authored-by: Richard Beales <rich@richbeales.net>
Co-authored-by: k-boikov <64261260+k-boikov@users.noreply.github.com>

* Fix `milvus_memory_test.py` mock `Config` (#3424)

Co-authored-by: k-boikov <64261260+k-boikov@users.noreply.github.com>

* Implemented showing the number of preauthorised commands left. #1035 (#3322)

Co-authored-by: mayubi <marwand@ayubi-it.de>
Co-authored-by: Nicholas Tindle <nick@ntindle.com>
Co-authored-by: k-boikov <64261260+k-boikov@users.noreply.github.com>

* Challenge: Kubernetes and documentation (#4121)

* challenge_kubes_and_readme

* docs

* testing

* black and isort

* revision

* lint

* comments

* blackisort

* docs

* docs

* deleting_cassette

* suggestions

* misspelling_errors

---------

Co-authored-by: merwanehamadi <merwanehamadi@gmail.com>

* Make sdwebui tests pass (when SD is running) (#3721)

Co-authored-by: Nicholas Tindle <nick@ntindle.com>

* Add Edge browser support using EdgeChromiumDriverManager (#3058)

Co-authored-by: Nicholas Tindle <nick@ntindle.com>
Co-authored-by: k-boikov <64261260+k-boikov@users.noreply.github.com>

* Added --install-plugin-deps to Docker (#4151)

Co-authored-by: Nicholas Tindle <nick@ntindle.com>

* Feature/basic proxy (#4164)

* basic proxy (#54)

* basic proxy (#55)

* basic proxy

* basic proxy

* basic proxy

* basic proxy

* add back double quotes

* add more specific files

* write file

* basic proxy

* Put back double quotes

* test new CI (#4168)

* test new CI

* test new CI

* remove double quotes

* Feature/test new ci pipeline 2 (#4169)

* test new CI

* remove double quotes

* make it a variable

* make it a variable

* Test New CI Pipeline (#4170)

* introduce dummy prompt change

* introduce dummy prompt change

* empty commit

* empty commit

* empty commit

* push to origin repo

* add s to quote

* Feature/fix rate limiting issue Step 1 (#4173)


* temporarilly remove 3.11

* add back 3.11 (#4185)

* Revert "Put back 3.11 until it's removed as a requirement" (#4191)

---------

Co-authored-by: Reinier van der Leer <github@pwuts.nl>
Co-authored-by: merwanehamadi <merwanehamadi@gmail.com>
Co-authored-by: Peter Petermann <ppetermann80@googlemail.com>
Co-authored-by: Nicholas Tindle <nick@ntindle.com>
Co-authored-by: James Collins <collijk@uw.edu>
Co-authored-by: Luke K <2609441+pr-0f3t@users.noreply.github.com>
Co-authored-by: Luke Kyohere <lkyohere@mfsafrica.com>
Co-authored-by: Robin Richtsfeld <robin.richtsfeld@gmail.com>
Co-authored-by: RainRat <rainrat78@yahoo.ca>
Co-authored-by: itsmarble <130370814+itsmarble@users.noreply.github.com>
Co-authored-by: Ambuj Pawar <pawar.ambuj@gmail.com>
Co-authored-by: bszollosinagy <4211175+bszollosinagy@users.noreply.github.com>
Co-authored-by: batyu <batyu@localhost>
Co-authored-by: Pi <sunfish7@gmail.com>
Co-authored-by: AbTrax <45964236+AbTrax@users.noreply.github.com>
Co-authored-by: Andres Caicedo <73312784+AndresCdo@users.noreply.github.com>
Co-authored-by: Douglas Schonholtz <15002691+dschonholtz@users.noreply.github.com>
Co-authored-by: Kaan <kaanixir@gmail.com>
Co-authored-by: Kaan Osmanagaoglu <kaano@questps.com.au>
Co-authored-by: xiao.hu <454292663@qq.com>
Co-authored-by: Tomasz Kasperczyk <tomaszikasperczyk@gmail.com>
Co-authored-by: minfeng-ai <42948406+minfenglu@users.noreply.github.com>
Co-authored-by: Minfeng Lu <minfenglu@Minfengs-MacBook-Pro.local>
Co-authored-by: Shlomi <81581678+jit-shlomi@users.noreply.github.com>
Co-authored-by: Itai Steinherz <itaisteinherz@gmail.com>
Co-authored-by: Boostrix <119627414+Boostrix@users.noreply.github.com>
Co-authored-by: Kristian Jackson <kristian.jackson@gmail.com>
Co-authored-by: k-boikov <64261260+k-boikov@users.noreply.github.com>
Co-authored-by: Eduardo Salinas <edus@microsoft.com>
Co-authored-by: prom3theu5 <dave@simcube.co.uk>
Co-authored-by: dominic-ks <contact@bedevious.co.uk>
Co-authored-by: andrey13771 <51243350+andrey13771@users.noreply.github.com>
Co-authored-by: Marwand Ayubi <98717667+xhypeDE@users.noreply.github.com>
Co-authored-by: mayubi <marwand@ayubi-it.de>
Co-authored-by: Media <12145726+rihp@users.noreply.github.com>
Co-authored-by: Cenny <cwenner@gmail.com>
Co-authored-by: Abdelkarim Habouch <37211852+karimhabush@users.noreply.github.com>
2023-05-14 21:36:55 +01:00

171 lines
5.1 KiB
Python

"""Text processing functions"""
from typing import Dict, Generator, Optional
import spacy
from selenium.webdriver.remote.webdriver import WebDriver
from autogpt.config import Config
from autogpt.llm import count_message_tokens, create_chat_completion
from autogpt.logs import logger
from autogpt.memory import get_memory
CFG = Config()
def split_text(
text: str,
max_length: int = CFG.browse_chunk_max_length,
model: str = CFG.fast_llm_model,
question: str = "",
) -> Generator[str, None, None]:
"""Split text into chunks of a maximum length
Args:
text (str): The text to split
max_length (int, optional): The maximum length of each chunk. Defaults to 8192.
Yields:
str: The next chunk of text
Raises:
ValueError: If the text is longer than the maximum length
"""
flattened_paragraphs = " ".join(text.split("\n"))
nlp = spacy.load(CFG.browse_spacy_language_model)
nlp.add_pipe("sentencizer")
doc = nlp(flattened_paragraphs)
sentences = [sent.text.strip() for sent in doc.sents]
current_chunk = []
for sentence in sentences:
message_with_additional_sentence = [
create_message(" ".join(current_chunk) + " " + sentence, question)
]
expected_token_usage = (
count_message_tokens(messages=message_with_additional_sentence, model=model)
+ 1
)
if expected_token_usage <= max_length:
current_chunk.append(sentence)
else:
yield " ".join(current_chunk)
current_chunk = [sentence]
message_this_sentence_only = [
create_message(" ".join(current_chunk), question)
]
expected_token_usage = (
count_message_tokens(messages=message_this_sentence_only, model=model)
+ 1
)
if expected_token_usage > max_length:
raise ValueError(
f"Sentence is too long in webpage: {expected_token_usage} tokens."
)
if current_chunk:
yield " ".join(current_chunk)
def summarize_text(
url: str, text: str, question: str, driver: Optional[WebDriver] = None
) -> str:
"""Summarize text using the OpenAI API
Args:
url (str): The url of the text
text (str): The text to summarize
question (str): The question to ask the model
driver (WebDriver): The webdriver to use to scroll the page
Returns:
str: The summary of the text
"""
if not text:
return "Error: No text to summarize"
model = CFG.fast_llm_model
text_length = len(text)
logger.info(f"Text length: {text_length} characters")
summaries = []
chunks = list(
split_text(
text, max_length=CFG.browse_chunk_max_length, model=model, question=question
),
)
scroll_ratio = 1 / len(chunks)
for i, chunk in enumerate(chunks):
if driver:
scroll_to_percentage(driver, scroll_ratio * i)
logger.info(f"Adding chunk {i + 1} / {len(chunks)} to memory")
memory_to_add = f"Source: {url}\n" f"Raw content part#{i + 1}: {chunk}"
memory = get_memory(CFG)
memory.add(memory_to_add)
messages = [create_message(chunk, question)]
tokens_for_chunk = count_message_tokens(messages, model)
logger.info(
f"Summarizing chunk {i + 1} / {len(chunks)} of length {len(chunk)} characters, or {tokens_for_chunk} tokens"
)
summary = create_chat_completion(
model=model,
messages=messages,
)
summaries.append(summary)
logger.info(
f"Added chunk {i + 1} summary to memory, of length {len(summary)} characters"
)
memory_to_add = f"Source: {url}\n" f"Content summary part#{i + 1}: {summary}"
memory.add(memory_to_add)
logger.info(f"Summarized {len(chunks)} chunks.")
combined_summary = "\n".join(summaries)
messages = [create_message(combined_summary, question)]
return create_chat_completion(
model=model,
messages=messages,
)
def scroll_to_percentage(driver: WebDriver, ratio: float) -> None:
"""Scroll to a percentage of the page
Args:
driver (WebDriver): The webdriver to use
ratio (float): The percentage to scroll to
Raises:
ValueError: If the ratio is not between 0 and 1
"""
if ratio < 0 or ratio > 1:
raise ValueError("Percentage should be between 0 and 1")
driver.execute_script(f"window.scrollTo(0, document.body.scrollHeight * {ratio});")
def create_message(chunk: str, question: str) -> Dict[str, str]:
"""Create a message for the chat completion
Args:
chunk (str): The chunk of text to summarize
question (str): The question to answer
Returns:
Dict[str, str]: The message to send to the chat completion
"""
return {
"role": "user",
"content": f'"""{chunk}""" Using the above text, answer the following'
f' question: "{question}" -- if the question cannot be answered using the text,'
" summarize the text.",
}