Fixed critical Windows compatibility issues and updated outdated documentation.
CRITICAL WINDOWS HANG FIXES:
1. ProcessPoolExecutor → ThreadPoolExecutor
- Fixes PowerShell/terminal hang where Ctrl+C wouldn't work
- Prevents .pf directory lock requiring Task Manager kill
- Root cause: Nested ProcessPool + ThreadPool on Windows creates kernel deadlock
2. Ctrl+C Interruption Support
- Replaced subprocess.run with Popen+poll pattern (industry standard)
- Poll subprocess every 100ms for interruption checking
- Added global stop_event and signal handlers for graceful shutdown
- Root cause: subprocess.run blocks threads with no signal propagation
DOCUMENTATION DRIFT FIX:
- Removed hardcoded "14 phases" references (actual is 19+ commands)
- Updated to "multiple analysis phases" throughout all docs
- Fixed CLI help text to be version-agnostic
- Added missing "Summary generation" step in HOWTOUSE.md
Changes:
- pipelines.py: ProcessPoolExecutor → ThreadPoolExecutor, added Popen+poll pattern
- Added signal handling and run_subprocess_with_interrupt() function
- commands/full.py: Updated docstring to remove specific phase count
- README.md: Changed "14 distinct phases" to "multiple analysis phases"
- HOWTOUSE.md: Updated phase references, added missing summary step
- CLAUDE.md & ARCHITECTURE.md: Removed hardcoded phase counts
Impact: Critical UX fixes - Windows compatibility restored, pipeline interruptible
Testing: Ctrl+C works, no PowerShell hangs, .pf directory deletable
19 KiB
TheAuditor Architecture
This document provides a comprehensive technical overview of TheAuditor's architecture, design patterns, and implementation details.
System Overview
TheAuditor is an offline-first, AI-centric SAST (Static Application Security Testing) and code intelligence platform. It orchestrates industry-standard tools to provide ground truth about code quality and security, producing AI-consumable reports optimized for LLM context windows.
Core Design Principles
- Offline-First Operation - All analysis runs without network access, ensuring data privacy and reproducible results
- Dual-Mode Architecture - Courier Mode preserves raw external tool outputs; Expert Mode applies security expertise objectively
- AI-Centric Workflow - Produces chunks optimized for LLM context windows (65KB by default)
- Sandboxed Execution - Isolated analysis environment prevents cross-contamination
- No Fix Generation - Reports findings without prescribing solutions
Truth Courier vs Insights: Separation of Concerns
TheAuditor maintains a strict architectural separation between factual observation and optional interpretation:
Truth Courier Modules (Core)
These modules are the foundation - they gather and report verifiable facts without judgment:
- Indexer: Reports "Function X exists at line Y with Z parameters"
- Taint Analyzer: Reports "Data flows from pattern A to pattern B through path C"
- Impact Analyzer: Reports "Changing function X affects Y files through Z call chains"
- Graph Analyzer: Reports "Module A imports B, B imports C, C imports A (cycle detected)"
- Pattern Detector: Reports "Line X matches pattern Y from rule Z"
- Linters: Reports "Tool ESLint flagged line X with rule Y"
These modules form the immutable ground truth. They report what exists, not what it means.
Insights Modules (Optional Interpretation Layer)
These are optional packages that consume Truth Courier data to add scoring and classification. All insights modules have been consolidated into a single package for better organization:
theauditor/insights/
├── __init__.py # Package exports
├── ml.py # Machine learning predictions (requires pip install -e ".[ml]")
├── graph.py # Graph health scoring and recommendations
└── taint.py # Vulnerability severity classification
- insights/taint.py: Adds "This flow is XSS with HIGH severity"
- insights/graph.py: Adds "Health score: 70/100, Grade: B"
- insights/ml.py (requires
pip install -e ".[ml]"): Adds "80% probability of bugs based on historical patterns"
Important: Insights modules are:
- Not installed by default (ML requires explicit opt-in)
- Completely decoupled from core analysis
- Still based on technical patterns, not business logic interpretation
- Designed for teams that want actionable scores alongside raw facts
- All consolidated in
/insightspackage for consistency
The FCE: Factual Correlation Engine
The FCE correlates facts from multiple tools without interpreting them:
- Reports: "Tool A and Tool B both flagged line 100"
- Reports: "Pattern X and Pattern Y co-occur in file Z"
- Never says: "This is bad" or "Fix this way"
Core Components
Indexer Package (theauditor/indexer/)
The indexer has been refactored from a monolithic 2000+ line file into a modular package structure:
theauditor/indexer/
├── __init__.py # Package initialization and backward compatibility
├── config.py # Constants, patterns, and configuration
├── database.py # DatabaseManager class for all DB operations
├── core.py # FileWalker and ASTCache classes
├── orchestrator.py # IndexOrchestrator - main coordination logic
└── extractors/
├── __init__.py # BaseExtractor abstract class and registry
├── python.py # Python-specific extraction logic
├── javascript.py # JavaScript/TypeScript extraction
├── docker.py # Docker/docker-compose extraction
├── sql.py # SQL extraction
└── nginx.py # Nginx configuration extraction
Key features:
- Dynamic extractor registry for automatic language detection
- Batched database operations (200 records per batch by default)
- AST caching for performance optimization
- Monorepo detection and intelligent path filtering
- Parallel JavaScript processing when semantic parser available
Pipeline System (theauditor/pipelines.py)
Orchestrates comprehensive analysis pipeline in parallel stages:
Stage 1 - Foundation (Sequential):
- Repository indexing - Build manifest and symbol database
- Framework detection - Identify technologies in use
Stage 2 - Concurrent Analysis (3 Parallel Tracks):
- Track A (Network I/O):
- Dependency checking
- Documentation fetching
- Documentation summarization
- Track B (Code Analysis):
- Workset creation
- Linting
- Pattern detection
- Track C (Graph Build):
- Graph building
Stage 3 - Final Aggregation (Sequential):
- Graph analysis
- Taint analysis
- Factual correlation engine
- Report generation
Pattern Detection Engine
- 100+ YAML-defined security patterns in
theauditor/patterns/ - AST-based matching for Python and JavaScript
- Supports semantic analysis via TypeScript compiler
Factual Correlation Engine (FCE) (theauditor/fce.py)
- 29 advanced correlation rules in
theauditor/correlations/rules/ - Detects complex vulnerability patterns across multiple tools
- Categories: Authentication, Injection, Data Exposure, Infrastructure, Code Quality, Framework-Specific
Taint Analysis Package (theauditor/taint_analyzer.py)
A comprehensive taint analysis module that tracks data flow from sources to sinks:
- Tracks data flow from user inputs to dangerous outputs
- Detects SQL injection, XSS, command injection vulnerabilities
- Database-aware analysis using
repo_index.db - Supports both assignment-based and direct-use patterns
- Merges findings from multiple detection methods
Note: The optional severity scoring for taint analysis is provided by theauditor/insights/taint.py (Insights module)
Graph Analysis (theauditor/graph/)
- builder.py: Constructs dependency graph from codebase
- analyzer.py: Detects cycles, measures complexity, identifies hotspots
- Uses NetworkX for graph algorithms
Note: The optional health scoring and recommendations are provided by theauditor/insights/graph.py (Insights module)
Framework Detection (theauditor/framework_detector.py)
- Auto-detects Django, Flask, React, Vue, Angular, etc.
- Applies framework-specific rules
- Influences pattern selection and analysis behavior
Configuration Parsers (theauditor/parsers/)
Specialized parsers for configuration file analysis:
- webpack_config_parser.py: Webpack configuration analysis
- compose_parser.py: Docker Compose file parsing
- nginx_parser.py: Nginx configuration parsing
- dockerfile_parser.py: Dockerfile security analysis
- prisma_schema_parser.py: Prisma ORM schema parsing
These parsers are used by extractors during indexing to extract security-relevant configuration data.
Refactoring Detection (theauditor/commands/refactor.py)
Detects incomplete refactorings and cross-stack inconsistencies:
- Analyzes database migrations to detect schema changes
- Uses impact analysis to trace affected files
- Applies correlation rules from
/correlations/rules/refactoring.yaml - Detects API contract mismatches, field migrations, foreign key changes
- Supports auto-detection from migration files or specific change analysis
System Architecture Diagrams
High-Level Data Flow
graph TB
subgraph "Input Layer"
CLI[CLI Commands]
Files[Project Files]
end
subgraph "Core Pipeline"
Index[Indexer]
Framework[Framework Detector]
Deps[Dependency Checker]
Patterns[Pattern Detection]
Taint[Taint Analysis]
Graph[Graph Builder]
FCE[Factual Correlation Engine]
end
subgraph "Storage"
DB[(SQLite DB)]
Raw[Raw Output]
Chunks[65KB Chunks]
end
CLI --> Index
Files --> Index
Index --> DB
Index --> Framework
Framework --> Deps
Deps --> Patterns
Patterns --> Graph
Graph --> Taint
Taint --> FCE
FCE --> Raw
Raw --> Chunks
Parallel Pipeline Execution
graph LR
subgraph "Stage 1 - Sequential"
S1[Index] --> S2[Framework Detection]
end
subgraph "Stage 2 - Parallel"
direction TB
subgraph "Track A - Network I/O"
A1[Deps Check]
A2[Doc Fetch]
A3[Doc Summary]
A1 --> A2 --> A3
end
subgraph "Track B - Code Analysis"
B1[Workset]
B2[Linting]
B3[Patterns]
B1 --> B2 --> B3
end
subgraph "Track C - Graph"
C1[Graph Build]
end
end
subgraph "Stage 3 - Sequential"
E1[Graph Analysis] --> E2[Taint] --> E3[FCE] --> E4[Report]
end
S2 --> A1
S2 --> B1
S2 --> C1
A3 --> E1
B3 --> E1
C1 --> E1
Data Chunking System
The extraction system (theauditor/extraction.py) implements pure courier model chunking:
graph TD
subgraph "Analysis Results"
P[Patterns.json]
T[Taint.json<br/>Multiple lists merged]
L[Lint.json]
F[FCE.json]
end
subgraph "Extraction Process"
E[Extraction Engine<br/>Budget: 1.5MB]
M[Merge Logic<br/>For taint_paths +<br/>rule_findings]
C1[Chunk 1<br/>0-65KB]
C2[Chunk 2<br/>65-130KB]
C3[Chunk 3<br/>130-195KB]
TR[Truncation<br/>Flag]
end
subgraph "Output"
R1[patterns_chunk01.json]
R2[patterns_chunk02.json]
R3[patterns_chunk03.json]
end
P --> E
T --> M --> E
L --> E
F --> E
E --> C1 --> R1
E --> C2 --> R2
E --> C3 --> R3
E -.->|If >195KB| TR
TR -.-> R3
Key features:
- Budget system: 1.5MB total budget for all chunks
- Smart merging: Taint analysis merges multiple finding lists (taint_paths, rule_findings, infrastructure)
- Preservation: All findings preserved, no filtering or sampling
- Chunking: Only chunks files >65KB, copies smaller files as-is
Dual Environment Architecture
graph TB
subgraph "Development Environment"
V1[.venv/]
PY[Python 3.11+]
AU[TheAuditor Code]
V1 --> PY --> AU
end
subgraph "Sandboxed Analysis Environment"
V2[.auditor_venv/.theauditor_tools/]
NODE[Bundled Node.js v20.11.1]
TS[TypeScript Compiler]
ES[ESLint]
PR[Prettier]
NM[node_modules/]
V2 --> NODE
NODE --> TS
NODE --> ES
NODE --> PR
NODE --> NM
end
AU -->|Analyzes using| V2
AU -.->|Never uses| V1
TheAuditor maintains strict separation between:
- Primary Environment (
.venv/): TheAuditor's Python code and dependencies - Sandboxed Environment (
.auditor_venv/.theauditor_tools/): Isolated JS/TS analysis tools
This ensures reproducibility and prevents TheAuditor from analyzing its own analysis tools.
Database Schema
erDiagram
files ||--o{ symbols : contains
files ||--o{ refs : contains
files ||--o{ api_endpoints : contains
files ||--o{ sql_queries : contains
files ||--o{ docker_images : contains
files {
string path PK
string language
int size
string hash
json metadata
}
symbols {
string path FK
string name
string type
int line
json metadata
}
refs {
string src FK
string value
string kind
int line
}
api_endpoints {
string file FK
string method
string path
int line
}
sql_queries {
string file_path FK
string command
string query
int line_number
}
docker_images {
string file_path FK
string base_image
json env_vars
json build_args
}
Command Flow Sequence
sequenceDiagram
participant User
participant CLI
participant Pipeline
participant Analyzers
participant Database
participant Output
User->>CLI: aud full
CLI->>Pipeline: Execute pipeline
Pipeline->>Database: Initialize schema
Pipeline->>Analyzers: Index files
Analyzers->>Database: Store file metadata
par Parallel Execution
Pipeline->>Analyzers: Dependency check
and
Pipeline->>Analyzers: Pattern detection
and
Pipeline->>Analyzers: Graph building
end
Pipeline->>Analyzers: Taint analysis
Analyzers->>Database: Query symbols & refs
Pipeline->>Analyzers: FCE correlation
Analyzers->>Output: Generate reports
Pipeline->>Output: Create chunks
Output->>User: .pf/readthis/
Output Structure
All results are organized in the .pf/ directory:
.pf/
├── raw/ # Immutable tool outputs (ground truth)
│ ├── eslint.json
│ ├── ruff.json
│ └── ...
├── readthis/ # AI-optimized chunks (<65KB each, max 3 chunks per file)
│ ├── manifest.md # Repository overview
│ ├── patterns_*.md # Security findings
│ ├── taint_*.md # Data-flow issues
│ └── tickets_*.md # Actionable tasks
├── repo_index.db # SQLite database of code symbols
├── pipeline.log # Execution trace
└── findings.json # Consolidated results
Key Output Files
- manifest.md: Complete file inventory with SHA-256 hashes
- patterns_*.md: Chunked security findings from 100+ detection rules
- tickets_*.md: Prioritized, actionable issues with evidence
- repo_index.db: Queryable database of all code symbols and relationships
Operating Modes
TheAuditor operates in two distinct modes:
Courier Mode (External Tools)
- Preserves exact outputs from ESLint, Ruff, MyPy, etc.
- No interpretation or filtering
- Complete audit trail from source to finding
Expert Mode (Internal Engines)
- Taint Analysis: Tracks untrusted data through the application
- Pattern Detection: YAML-based rules with AST matching
- Graph Analysis: Architectural insights and dependency tracking
- Secret Detection: Identifies hardcoded credentials and API keys
CLI Entry Points
- Main CLI:
theauditor/cli.py- Central command router - Command modules:
theauditor/commands/- One module per command - Utilities:
theauditor/utils/- Shared functionality - Configuration:
theauditor/config_runtime.py- Runtime configuration
Each command module follows a standardized structure with:
@click.command()decorator@handle_exceptionsdecorator for error handling- Consistent logging and output formatting
Performance Optimizations
- Batched database operations: 200 records per batch (configurable)
- Parallel rule execution: ThreadPoolExecutor with 4 workers
- AST caching: Persistent cache for parsed AST trees
- Incremental analysis: Workset-based analysis for changed files only
- Lazy loading: Patterns and rules loaded on-demand
- Memory-efficient chunking: Stream large files instead of loading entirely
Configuration System
TheAuditor supports runtime configuration via multiple sources (priority order):
- Environment variables (
THEAUDITOR_*prefix) .pf/config.jsonfile (project-specific)- Built-in defaults in
config_runtime.py
Example configuration:
export THEAUDITOR_LIMITS_MAX_CHUNKS_PER_FILE=5 # Default: 3
export THEAUDITOR_LIMITS_MAX_CHUNK_SIZE=100000 # Default: 65000
export THEAUDITOR_LIMITS_MAX_FILE_SIZE=5242880 # Default: 2097152
export THEAUDITOR_TIMEOUTS_LINT_TIMEOUT=600 # Default: 300
Advanced Features
Database-Aware Rules
Specialized analyzers query repo_index.db to detect:
- ORM anti-patterns (N+1 queries, missing transactions)
- Docker security misconfigurations
- Nginx configuration issues
- Multi-file correlation patterns
Holistic Analysis
Project-level analyzers that operate across the entire codebase:
- Bundle Analyzer: Correlates package.json, lock files, and imports
- Source Map Detector: Scans build directories for exposed maps
- Framework Detectors: Identify technology stack automatically
Incremental Analysis
Workset-based analysis for efficient processing:
- Git diff integration for changed file detection
- Dependency tracking for impact analysis
- Cached results for unchanged files
Contributing to TheAuditor
Adding Language Support
TheAuditor's modular architecture makes it straightforward to add new language support:
1. Create an Extractor
Create a new extractor in theauditor/indexer/extractors/{language}.py:
from . import BaseExtractor
class {Language}Extractor(BaseExtractor):
def supported_extensions(self) -> List[str]:
return ['.ext', '.ext2']
def extract(self, file_info, content, tree=None):
# Extract symbols, imports, routes, etc.
return {
'imports': [],
'routes': [],
'symbols': [],
# ... other extracted data
}
The extractor will be automatically registered via the BaseExtractor inheritance.
2. Create Configuration Parser (Optional)
For configuration files, create a parser in theauditor/parsers/{language}_parser.py:
class {Language}Parser:
def parse_file(self, file_path: Path) -> Dict[str, Any]:
# Parse configuration file
return parsed_data
3. Add Security Patterns
Create YAML patterns in theauditor/patterns/{language}.yml:
- name: hardcoded-secret-{language}
pattern: 'api_key\s*=\s*["\'][^"\']+["\']'
severity: critical
category: security
languages: ["{language}"]
description: "Hardcoded API key in {Language} code"
4. Add Framework Detection
Update theauditor/framework_detector.py to detect {Language} frameworks.
Adding New Analyzers
Database-Aware Rules
Create analyzers that query repo_index.db in theauditor/rules/{category}/:
def find_{issue}_patterns(db_path: str) -> List[Dict[str, Any]]:
conn = sqlite3.connect(db_path)
# Query and analyze
return findings
AST-Based Rules
For semantic analysis, create rules in theauditor/rules/{framework}/:
def find_{framework}_issues(tree, file_path) -> List[Dict[str, Any]]:
# Traverse AST and detect issues
return findings
Pattern-Based Rules
Add YAML patterns to theauditor/patterns/ for regex-based detection.
Architecture Guidelines
- Maintain Truth Courier vs Insights separation - Core modules report facts, insights add interpretation
- Use the extractor registry - Inherit from
BaseExtractorfor automatic registration - Follow existing patterns - Look at
python.pyorjavascript.pyextractors as examples - Write comprehensive tests - Test extractors, parsers, and patterns
- Document your additions - Update this file and CONTRIBUTING.md
For detailed contribution guidelines, see CONTRIBUTING.md.