IngestRSS/todo.md

# After Public Launch
* Monthly Kaggle Dataset Publishing.

* Vector Database Initialization at earlier phase. [ Done ]
* Test out Vector Databases at Small Scale.
    * [x] Testing
        * [x] Fix OpenAI Error.
        * [x] Fix Pinecone Error
        * [x] Fix input error.
    * [ ] Let it run for a day
        * [x] Check Open AI Bill
        * [x] Check Vector Database Bill
        * [x] Figure out Vector Database Bug.
            * [x] Add Logging to Pinecone
            * [x] Run a simple test to see if any logs pop up.
            * [x] Check Logs out and figure out what to debug
    * [x] Figure out best way to store articles since metadata or in S3.
* [x] Turn off the eu
* [ ] Ensure the US data storage for both is working.
* [ ] Decreae the cost of cloudwatch Logs
* [ ] Test out Vector Databases at Scale.
* [ ] Add in text cleaning before after ingesting article but before storage.
* [ ] Automate the monthly data ingestion job
* [ ] Lambda Optimization


* Monthly ingestion job
* Protocol for annotating data.
    * [ ] Development
        * [ ] Check out Raj's script
        * [ ] DSPy Integration
        * [ ] LLMRouter integration
    * [ ] Annotation Categories
    * [ ] Main topic/Category ( list )
        * [ ] Writing Stley ( e.g. Informal, professional, etc...)
        * [ ] Promotional Material ( 0=Not Promotional, 1=Promotional)
        * [ ] Stuff that is news ( 0= Not News, 1=News)
        * [ ] Stuff that is news but like a list of news topics. ( 0=Opposite,  1=News Topic Lists)
        * [ ] Annotating Entities ( List of Key entities with entity specific sentiment )
        * [ ] List of Major Events ( e.g. Ukraine War, Israel Palestine, etc... )
        * [ ] List of Minor Event ( e.g. Specific Battle, Court Case step, etc..)
        * [ ] Novelty Factor ( Scale from 0(Not Interesting) -> 100(Interesting))
        * [ ] Annotating Podcast Scripts or Video Scripts ( 0=is not a script, 1=Is a script)
        * [ ] Political Quadrant ( Or that eight dimensional thing that guy had. )

* Estimation Algorithm for annotation cost.
* Open Source Protocol for running this.