Files
Auditor/ARCHITECTURE.md
TheAuditorTool c7a59e420b Fix: Critical Windows ProcessPoolExecutor hang and documentation drift
Fixed critical Windows compatibility issues and updated outdated documentation.

  CRITICAL WINDOWS HANG FIXES:
  1. ProcessPoolExecutor → ThreadPoolExecutor
     - Fixes PowerShell/terminal hang where Ctrl+C wouldn't work
     - Prevents .pf directory lock requiring Task Manager kill
     - Root cause: Nested ProcessPool + ThreadPool on Windows creates kernel deadlock

  2. Ctrl+C Interruption Support
     - Replaced subprocess.run with Popen+poll pattern (industry standard)
     - Poll subprocess every 100ms for interruption checking
     - Added global stop_event and signal handlers for graceful shutdown
     - Root cause: subprocess.run blocks threads with no signal propagation

  DOCUMENTATION DRIFT FIX:
  - Removed hardcoded "14 phases" references (actual is 19+ commands)
  - Updated to "multiple analysis phases" throughout all docs
  - Fixed CLI help text to be version-agnostic
  - Added missing "Summary generation" step in HOWTOUSE.md

  Changes:
  - pipelines.py: ProcessPoolExecutor → ThreadPoolExecutor, added Popen+poll pattern
  - Added signal handling and run_subprocess_with_interrupt() function
  - commands/full.py: Updated docstring to remove specific phase count
  - README.md: Changed "14 distinct phases" to "multiple analysis phases"
  - HOWTOUSE.md: Updated phase references, added missing summary step
  - CLAUDE.md & ARCHITECTURE.md: Removed hardcoded phase counts

  Impact: Critical UX fixes - Windows compatibility restored, pipeline interruptible
  Testing: Ctrl+C works, no PowerShell hangs, .pf directory deletable
2025-09-09 14:26:18 +07:00

606 lines
19 KiB
Markdown

# TheAuditor Architecture
This document provides a comprehensive technical overview of TheAuditor's architecture, design patterns, and implementation details.
## System Overview
TheAuditor is an offline-first, AI-centric SAST (Static Application Security Testing) and code intelligence platform. It orchestrates industry-standard tools to provide ground truth about code quality and security, producing AI-consumable reports optimized for LLM context windows.
### Core Design Principles
1. **Offline-First Operation** - All analysis runs without network access, ensuring data privacy and reproducible results
2. **Dual-Mode Architecture** - Courier Mode preserves raw external tool outputs; Expert Mode applies security expertise objectively
3. **AI-Centric Workflow** - Produces chunks optimized for LLM context windows (65KB by default)
4. **Sandboxed Execution** - Isolated analysis environment prevents cross-contamination
5. **No Fix Generation** - Reports findings without prescribing solutions
## Truth Courier vs Insights: Separation of Concerns
TheAuditor maintains a strict architectural separation between **factual observation** and **optional interpretation**:
### Truth Courier Modules (Core)
These modules are the foundation - they gather and report verifiable facts without judgment:
- **Indexer**: Reports "Function X exists at line Y with Z parameters"
- **Taint Analyzer**: Reports "Data flows from pattern A to pattern B through path C"
- **Impact Analyzer**: Reports "Changing function X affects Y files through Z call chains"
- **Graph Analyzer**: Reports "Module A imports B, B imports C, C imports A (cycle detected)"
- **Pattern Detector**: Reports "Line X matches pattern Y from rule Z"
- **Linters**: Reports "Tool ESLint flagged line X with rule Y"
These modules form the immutable ground truth. They report **what exists**, not what it means.
### Insights Modules (Optional Interpretation Layer)
These are **optional packages** that consume Truth Courier data to add scoring and classification. All insights modules have been consolidated into a single package for better organization:
```
theauditor/insights/
├── __init__.py # Package exports
├── ml.py # Machine learning predictions (requires pip install -e ".[ml]")
├── graph.py # Graph health scoring and recommendations
└── taint.py # Vulnerability severity classification
```
- **insights/taint.py**: Adds "This flow is XSS with HIGH severity"
- **insights/graph.py**: Adds "Health score: 70/100, Grade: B"
- **insights/ml.py** (requires `pip install -e ".[ml]"`): Adds "80% probability of bugs based on historical patterns"
**Important**: Insights modules are:
- Not installed by default (ML requires explicit opt-in)
- Completely decoupled from core analysis
- Still based on technical patterns, not business logic interpretation
- Designed for teams that want actionable scores alongside raw facts
- All consolidated in `/insights` package for consistency
### The FCE: Factual Correlation Engine
The FCE correlates facts from multiple tools without interpreting them:
- Reports: "Tool A and Tool B both flagged line 100"
- Reports: "Pattern X and Pattern Y co-occur in file Z"
- Never says: "This is bad" or "Fix this way"
## Core Components
### Indexer Package (`theauditor/indexer/`)
The indexer has been refactored from a monolithic 2000+ line file into a modular package structure:
```
theauditor/indexer/
├── __init__.py # Package initialization and backward compatibility
├── config.py # Constants, patterns, and configuration
├── database.py # DatabaseManager class for all DB operations
├── core.py # FileWalker and ASTCache classes
├── orchestrator.py # IndexOrchestrator - main coordination logic
└── extractors/
├── __init__.py # BaseExtractor abstract class and registry
├── python.py # Python-specific extraction logic
├── javascript.py # JavaScript/TypeScript extraction
├── docker.py # Docker/docker-compose extraction
├── sql.py # SQL extraction
└── nginx.py # Nginx configuration extraction
```
Key features:
- **Dynamic extractor registry** for automatic language detection
- **Batched database operations** (200 records per batch by default)
- **AST caching** for performance optimization
- **Monorepo detection** and intelligent path filtering
- **Parallel JavaScript processing** when semantic parser available
### Pipeline System (`theauditor/pipelines.py`)
Orchestrates comprehensive analysis pipeline in **parallel stages**:
**Stage 1 - Foundation (Sequential):**
1. Repository indexing - Build manifest and symbol database
2. Framework detection - Identify technologies in use
**Stage 2 - Concurrent Analysis (3 Parallel Tracks):**
- **Track A (Network I/O):**
- Dependency checking
- Documentation fetching
- Documentation summarization
- **Track B (Code Analysis):**
- Workset creation
- Linting
- Pattern detection
- **Track C (Graph Build):**
- Graph building
**Stage 3 - Final Aggregation (Sequential):**
- Graph analysis
- Taint analysis
- Factual correlation engine
- Report generation
### Pattern Detection Engine
- 100+ YAML-defined security patterns in `theauditor/patterns/`
- AST-based matching for Python and JavaScript
- Supports semantic analysis via TypeScript compiler
### Factual Correlation Engine (FCE) (`theauditor/fce.py`)
- **29 advanced correlation rules** in `theauditor/correlations/rules/`
- Detects complex vulnerability patterns across multiple tools
- Categories: Authentication, Injection, Data Exposure, Infrastructure, Code Quality, Framework-Specific
### Taint Analysis Package (`theauditor/taint_analyzer.py`)
A comprehensive taint analysis module that tracks data flow from sources to sinks:
- Tracks data flow from user inputs to dangerous outputs
- Detects SQL injection, XSS, command injection vulnerabilities
- Database-aware analysis using `repo_index.db`
- Supports both assignment-based and direct-use patterns
- Merges findings from multiple detection methods
**Note**: The optional severity scoring for taint analysis is provided by `theauditor/insights/taint.py` (Insights module)
### Graph Analysis (`theauditor/graph/`)
- **builder.py**: Constructs dependency graph from codebase
- **analyzer.py**: Detects cycles, measures complexity, identifies hotspots
- Uses NetworkX for graph algorithms
**Note**: The optional health scoring and recommendations are provided by `theauditor/insights/graph.py` (Insights module)
### Framework Detection (`theauditor/framework_detector.py`)
- Auto-detects Django, Flask, React, Vue, Angular, etc.
- Applies framework-specific rules
- Influences pattern selection and analysis behavior
### Configuration Parsers (`theauditor/parsers/`)
Specialized parsers for configuration file analysis:
- **webpack_config_parser.py**: Webpack configuration analysis
- **compose_parser.py**: Docker Compose file parsing
- **nginx_parser.py**: Nginx configuration parsing
- **dockerfile_parser.py**: Dockerfile security analysis
- **prisma_schema_parser.py**: Prisma ORM schema parsing
These parsers are used by extractors during indexing to extract security-relevant configuration data.
### Refactoring Detection (`theauditor/commands/refactor.py`)
Detects incomplete refactorings and cross-stack inconsistencies:
- Analyzes database migrations to detect schema changes
- Uses impact analysis to trace affected files
- Applies correlation rules from `/correlations/rules/refactoring.yaml`
- Detects API contract mismatches, field migrations, foreign key changes
- Supports auto-detection from migration files or specific change analysis
## System Architecture Diagrams
### High-Level Data Flow
```mermaid
graph TB
subgraph "Input Layer"
CLI[CLI Commands]
Files[Project Files]
end
subgraph "Core Pipeline"
Index[Indexer]
Framework[Framework Detector]
Deps[Dependency Checker]
Patterns[Pattern Detection]
Taint[Taint Analysis]
Graph[Graph Builder]
FCE[Factual Correlation Engine]
end
subgraph "Storage"
DB[(SQLite DB)]
Raw[Raw Output]
Chunks[65KB Chunks]
end
CLI --> Index
Files --> Index
Index --> DB
Index --> Framework
Framework --> Deps
Deps --> Patterns
Patterns --> Graph
Graph --> Taint
Taint --> FCE
FCE --> Raw
Raw --> Chunks
```
### Parallel Pipeline Execution
```mermaid
graph LR
subgraph "Stage 1 - Sequential"
S1[Index] --> S2[Framework Detection]
end
subgraph "Stage 2 - Parallel"
direction TB
subgraph "Track A - Network I/O"
A1[Deps Check]
A2[Doc Fetch]
A3[Doc Summary]
A1 --> A2 --> A3
end
subgraph "Track B - Code Analysis"
B1[Workset]
B2[Linting]
B3[Patterns]
B1 --> B2 --> B3
end
subgraph "Track C - Graph"
C1[Graph Build]
end
end
subgraph "Stage 3 - Sequential"
E1[Graph Analysis] --> E2[Taint] --> E3[FCE] --> E4[Report]
end
S2 --> A1
S2 --> B1
S2 --> C1
A3 --> E1
B3 --> E1
C1 --> E1
```
### Data Chunking System
The extraction system (`theauditor/extraction.py`) implements pure courier model chunking:
```mermaid
graph TD
subgraph "Analysis Results"
P[Patterns.json]
T[Taint.json<br/>Multiple lists merged]
L[Lint.json]
F[FCE.json]
end
subgraph "Extraction Process"
E[Extraction Engine<br/>Budget: 1.5MB]
M[Merge Logic<br/>For taint_paths +<br/>rule_findings]
C1[Chunk 1<br/>0-65KB]
C2[Chunk 2<br/>65-130KB]
C3[Chunk 3<br/>130-195KB]
TR[Truncation<br/>Flag]
end
subgraph "Output"
R1[patterns_chunk01.json]
R2[patterns_chunk02.json]
R3[patterns_chunk03.json]
end
P --> E
T --> M --> E
L --> E
F --> E
E --> C1 --> R1
E --> C2 --> R2
E --> C3 --> R3
E -.->|If >195KB| TR
TR -.-> R3
```
Key features:
- **Budget system**: 1.5MB total budget for all chunks
- **Smart merging**: Taint analysis merges multiple finding lists (taint_paths, rule_findings, infrastructure)
- **Preservation**: All findings preserved, no filtering or sampling
- **Chunking**: Only chunks files >65KB, copies smaller files as-is
### Dual Environment Architecture
```mermaid
graph TB
subgraph "Development Environment"
V1[.venv/]
PY[Python 3.11+]
AU[TheAuditor Code]
V1 --> PY --> AU
end
subgraph "Sandboxed Analysis Environment"
V2[.auditor_venv/.theauditor_tools/]
NODE[Bundled Node.js v20.11.1]
TS[TypeScript Compiler]
ES[ESLint]
PR[Prettier]
NM[node_modules/]
V2 --> NODE
NODE --> TS
NODE --> ES
NODE --> PR
NODE --> NM
end
AU -->|Analyzes using| V2
AU -.->|Never uses| V1
```
TheAuditor maintains strict separation between:
1. **Primary Environment** (`.venv/`): TheAuditor's Python code and dependencies
2. **Sandboxed Environment** (`.auditor_venv/.theauditor_tools/`): Isolated JS/TS analysis tools
This ensures reproducibility and prevents TheAuditor from analyzing its own analysis tools.
## Database Schema
```mermaid
erDiagram
files ||--o{ symbols : contains
files ||--o{ refs : contains
files ||--o{ api_endpoints : contains
files ||--o{ sql_queries : contains
files ||--o{ docker_images : contains
files {
string path PK
string language
int size
string hash
json metadata
}
symbols {
string path FK
string name
string type
int line
json metadata
}
refs {
string src FK
string value
string kind
int line
}
api_endpoints {
string file FK
string method
string path
int line
}
sql_queries {
string file_path FK
string command
string query
int line_number
}
docker_images {
string file_path FK
string base_image
json env_vars
json build_args
}
```
## Command Flow Sequence
```mermaid
sequenceDiagram
participant User
participant CLI
participant Pipeline
participant Analyzers
participant Database
participant Output
User->>CLI: aud full
CLI->>Pipeline: Execute pipeline
Pipeline->>Database: Initialize schema
Pipeline->>Analyzers: Index files
Analyzers->>Database: Store file metadata
par Parallel Execution
Pipeline->>Analyzers: Dependency check
and
Pipeline->>Analyzers: Pattern detection
and
Pipeline->>Analyzers: Graph building
end
Pipeline->>Analyzers: Taint analysis
Analyzers->>Database: Query symbols & refs
Pipeline->>Analyzers: FCE correlation
Analyzers->>Output: Generate reports
Pipeline->>Output: Create chunks
Output->>User: .pf/readthis/
```
## Output Structure
All results are organized in the `.pf/` directory:
```
.pf/
├── raw/ # Immutable tool outputs (ground truth)
│ ├── eslint.json
│ ├── ruff.json
│ └── ...
├── readthis/ # AI-optimized chunks (<65KB each, max 3 chunks per file)
│ ├── manifest.md # Repository overview
│ ├── patterns_*.md # Security findings
│ ├── taint_*.md # Data-flow issues
│ └── tickets_*.md # Actionable tasks
├── repo_index.db # SQLite database of code symbols
├── pipeline.log # Execution trace
└── findings.json # Consolidated results
```
### Key Output Files
- **manifest.md**: Complete file inventory with SHA-256 hashes
- **patterns_*.md**: Chunked security findings from 100+ detection rules
- **tickets_*.md**: Prioritized, actionable issues with evidence
- **repo_index.db**: Queryable database of all code symbols and relationships
## Operating Modes
TheAuditor operates in two distinct modes:
### Courier Mode (External Tools)
- Preserves exact outputs from ESLint, Ruff, MyPy, etc.
- No interpretation or filtering
- Complete audit trail from source to finding
### Expert Mode (Internal Engines)
- **Taint Analysis**: Tracks untrusted data through the application
- **Pattern Detection**: YAML-based rules with AST matching
- **Graph Analysis**: Architectural insights and dependency tracking
- **Secret Detection**: Identifies hardcoded credentials and API keys
## CLI Entry Points
- **Main CLI**: `theauditor/cli.py` - Central command router
- **Command modules**: `theauditor/commands/` - One module per command
- **Utilities**: `theauditor/utils/` - Shared functionality
- **Configuration**: `theauditor/config_runtime.py` - Runtime configuration
Each command module follows a standardized structure with:
- `@click.command()` decorator
- `@handle_exceptions` decorator for error handling
- Consistent logging and output formatting
## Performance Optimizations
- **Batched database operations**: 200 records per batch (configurable)
- **Parallel rule execution**: ThreadPoolExecutor with 4 workers
- **AST caching**: Persistent cache for parsed AST trees
- **Incremental analysis**: Workset-based analysis for changed files only
- **Lazy loading**: Patterns and rules loaded on-demand
- **Memory-efficient chunking**: Stream large files instead of loading entirely
## Configuration System
TheAuditor supports runtime configuration via multiple sources (priority order):
1. **Environment variables** (`THEAUDITOR_*` prefix)
2. **`.pf/config.json`** file (project-specific)
3. **Built-in defaults** in `config_runtime.py`
Example configuration:
```bash
export THEAUDITOR_LIMITS_MAX_CHUNKS_PER_FILE=5 # Default: 3
export THEAUDITOR_LIMITS_MAX_CHUNK_SIZE=100000 # Default: 65000
export THEAUDITOR_LIMITS_MAX_FILE_SIZE=5242880 # Default: 2097152
export THEAUDITOR_TIMEOUTS_LINT_TIMEOUT=600 # Default: 300
```
## Advanced Features
### Database-Aware Rules
Specialized analyzers query `repo_index.db` to detect:
- ORM anti-patterns (N+1 queries, missing transactions)
- Docker security misconfigurations
- Nginx configuration issues
- Multi-file correlation patterns
### Holistic Analysis
Project-level analyzers that operate across the entire codebase:
- **Bundle Analyzer**: Correlates package.json, lock files, and imports
- **Source Map Detector**: Scans build directories for exposed maps
- **Framework Detectors**: Identify technology stack automatically
### Incremental Analysis
Workset-based analysis for efficient processing:
- Git diff integration for changed file detection
- Dependency tracking for impact analysis
- Cached results for unchanged files
## Contributing to TheAuditor
### Adding Language Support
TheAuditor's modular architecture makes it straightforward to add new language support:
#### 1. Create an Extractor
Create a new extractor in `theauditor/indexer/extractors/{language}.py`:
```python
from . import BaseExtractor
class {Language}Extractor(BaseExtractor):
def supported_extensions(self) -> List[str]:
return ['.ext', '.ext2']
def extract(self, file_info, content, tree=None):
# Extract symbols, imports, routes, etc.
return {
'imports': [],
'routes': [],
'symbols': [],
# ... other extracted data
}
```
The extractor will be automatically registered via the `BaseExtractor` inheritance.
#### 2. Create Configuration Parser (Optional)
For configuration files, create a parser in `theauditor/parsers/{language}_parser.py`:
```python
class {Language}Parser:
def parse_file(self, file_path: Path) -> Dict[str, Any]:
# Parse configuration file
return parsed_data
```
#### 3. Add Security Patterns
Create YAML patterns in `theauditor/patterns/{language}.yml`:
```yaml
- name: hardcoded-secret-{language}
pattern: 'api_key\s*=\s*["\'][^"\']+["\']'
severity: critical
category: security
languages: ["{language}"]
description: "Hardcoded API key in {Language} code"
```
#### 4. Add Framework Detection
Update `theauditor/framework_detector.py` to detect {Language} frameworks.
### Adding New Analyzers
#### Database-Aware Rules
Create analyzers that query `repo_index.db` in `theauditor/rules/{category}/`:
```python
def find_{issue}_patterns(db_path: str) -> List[Dict[str, Any]]:
conn = sqlite3.connect(db_path)
# Query and analyze
return findings
```
#### AST-Based Rules
For semantic analysis, create rules in `theauditor/rules/{framework}/`:
```python
def find_{framework}_issues(tree, file_path) -> List[Dict[str, Any]]:
# Traverse AST and detect issues
return findings
```
#### Pattern-Based Rules
Add YAML patterns to `theauditor/patterns/` for regex-based detection.
### Architecture Guidelines
1. **Maintain Truth Courier vs Insights separation** - Core modules report facts, insights add interpretation
2. **Use the extractor registry** - Inherit from `BaseExtractor` for automatic registration
3. **Follow existing patterns** - Look at `python.py` or `javascript.py` extractors as examples
4. **Write comprehensive tests** - Test extractors, parsers, and patterns
5. **Document your additions** - Update this file and CONTRIBUTING.md
For detailed contribution guidelines, see [CONTRIBUTING.md](CONTRIBUTING.md).