Files
IngestRSS/todo.md
Charles-Gormley 8ad98c4cb4 some db changes
2024-12-21 21:38:49 -05:00

48 lines
2.1 KiB
Markdown

# After Public Launch
* Monthly Kaggle Dataset Publishing.
* Vector Database Initialization at earlier phase. [ Done ]
* Test out Vector Databases at Small Scale.
* [x] Testing
* [x] Fix OpenAI Error.
* [x] Fix Pinecone Error
* [x] Fix input error.
* [ ] Let it run for a day
* [x] Check Open AI Bill
* [x] Check Vector Database Bill
* [x] Figure out Vector Database Bug.
* [x] Add Logging to Pinecone
* [x] Run a simple test to see if any logs pop up.
* [x] Check Logs out and figure out what to debug
* [x] Figure out best way to store articles since metadata or in S3.
* [x] Turn off the eu
* [ ] Ensure the US data storage for both is working.
* [ ] Decreae the cost of cloudwatch Logs
* [ ] Test out Vector Databases at Scale.
* [ ] Add in text cleaning before after ingesting article but before storage.
* [ ] Automate the monthly data ingestion job
* [ ] Lambda Optimization
* Monthly ingestion job
* Protocol for annotating data.
* [ ] Development
* [ ] Check out Raj's script
* [ ] DSPy Integration
* [ ] LLMRouter integration
* [ ] Annotation Categories
* [ ] Main topic/Category ( list )
* [ ] Writing Stley ( e.g. Informal, professional, etc...)
* [ ] Promotional Material ( 0=Not Promotional, 1=Promotional)
* [ ] Stuff that is news ( 0= Not News, 1=News)
* [ ] Stuff that is news but like a list of news topics. ( 0=Opposite, 1=News Topic Lists)
* [ ] Annotating Entities ( List of Key entities with entity specific sentiment )
* [ ] List of Major Events ( e.g. Ukraine War, Israel Palestine, etc... )
* [ ] List of Minor Event ( e.g. Specific Battle, Court Case step, etc..)
* [ ] Novelty Factor ( Scale from 0(Not Interesting) -> 100(Interesting))
* [ ] Annotating Podcast Scripts or Video Scripts ( 0=is not a script, 1=Is a script)
* [ ] Political Quadrant ( Or that eight dimensional thing that guy had. )
* Estimation Algorithm for annotation cost.
* Open Source Protocol for running this.