Files
IngestRSS/todo.md
Charles-Gormley 8ad98c4cb4 some db changes
2024-12-21 21:38:49 -05:00

2.1 KiB

After Public Launch

  • Monthly Kaggle Dataset Publishing.

  • Vector Database Initialization at earlier phase. [ Done ]

  • Test out Vector Databases at Small Scale.

    • Testing
      • Fix OpenAI Error.
      • Fix Pinecone Error
      • Fix input error.
    • Let it run for a day
      • Check Open AI Bill
      • Check Vector Database Bill
      • Figure out Vector Database Bug.
        • Add Logging to Pinecone
        • Run a simple test to see if any logs pop up.
        • Check Logs out and figure out what to debug
    • Figure out best way to store articles since metadata or in S3.
  • Turn off the eu

  • Ensure the US data storage for both is working.

  • Decreae the cost of cloudwatch Logs

  • Test out Vector Databases at Scale.

  • Add in text cleaning before after ingesting article but before storage.

  • Automate the monthly data ingestion job

  • Lambda Optimization

  • Monthly ingestion job

  • Protocol for annotating data.

    • Development
      • Check out Raj's script
      • DSPy Integration
      • LLMRouter integration
    • Annotation Categories
    • Main topic/Category ( list )
      • Writing Stley ( e.g. Informal, professional, etc...)
      • Promotional Material ( 0=Not Promotional, 1=Promotional)
      • Stuff that is news ( 0= Not News, 1=News)
      • Stuff that is news but like a list of news topics. ( 0=Opposite, 1=News Topic Lists)
      • Annotating Entities ( List of Key entities with entity specific sentiment )
      • List of Major Events ( e.g. Ukraine War, Israel Palestine, etc... )
      • List of Minor Event ( e.g. Specific Battle, Court Case step, etc..)
      • Novelty Factor ( Scale from 0(Not Interesting) -> 100(Interesting))
      • Annotating Podcast Scripts or Video Scripts ( 0=is not a script, 1=Is a script)
      • Political Quadrant ( Or that eight dimensional thing that guy had. )
  • Estimation Algorithm for annotation cost.

  • Open Source Protocol for running this.