mirror of
https://github.com/aljazceru/IngestRSS.git
synced 2025-12-17 22:14:20 +01:00
2.1 KiB
2.1 KiB
After Public Launch
-
Monthly Kaggle Dataset Publishing.
-
Vector Database Initialization at earlier phase. [ Done ]
-
Test out Vector Databases at Small Scale.
- Testing
- Fix OpenAI Error.
- Fix Pinecone Error
- Fix input error.
- Let it run for a day
- Check Open AI Bill
- Check Vector Database Bill
- Figure out Vector Database Bug.
- Add Logging to Pinecone
- Run a simple test to see if any logs pop up.
- Check Logs out and figure out what to debug
- Figure out best way to store articles since metadata or in S3.
- Testing
-
Turn off the eu
-
Ensure the US data storage for both is working.
-
Decreae the cost of cloudwatch Logs
-
Test out Vector Databases at Scale.
-
Add in text cleaning before after ingesting article but before storage.
-
Automate the monthly data ingestion job
-
Lambda Optimization
-
Monthly ingestion job
-
Protocol for annotating data.
- Development
- Check out Raj's script
- DSPy Integration
- LLMRouter integration
- Annotation Categories
- Main topic/Category ( list )
- Writing Stley ( e.g. Informal, professional, etc...)
- Promotional Material ( 0=Not Promotional, 1=Promotional)
- Stuff that is news ( 0= Not News, 1=News)
- Stuff that is news but like a list of news topics. ( 0=Opposite, 1=News Topic Lists)
- Annotating Entities ( List of Key entities with entity specific sentiment )
- List of Major Events ( e.g. Ukraine War, Israel Palestine, etc... )
- List of Minor Event ( e.g. Specific Battle, Court Case step, etc..)
- Novelty Factor ( Scale from 0(Not Interesting) -> 100(Interesting))
- Annotating Podcast Scripts or Video Scripts ( 0=is not a script, 1=Is a script)
- Political Quadrant ( Or that eight dimensional thing that guy had. )
- Development
-
Estimation Algorithm for annotation cost.
-
Open Source Protocol for running this.