mirror of https://github.com/aljazceru/Auditor.git synced 2025-12-17 03:24:18 +01:00

Files

TheAuditorTool c7a59e420b Fix: Critical Windows ProcessPoolExecutor hang and documentation drift

Fixed critical Windows compatibility issues and updated outdated documentation.

  CRITICAL WINDOWS HANG FIXES:
  1. ProcessPoolExecutor → ThreadPoolExecutor
     - Fixes PowerShell/terminal hang where Ctrl+C wouldn't work
     - Prevents .pf directory lock requiring Task Manager kill
     - Root cause: Nested ProcessPool + ThreadPool on Windows creates kernel deadlock

  2. Ctrl+C Interruption Support
     - Replaced subprocess.run with Popen+poll pattern (industry standard)
     - Poll subprocess every 100ms for interruption checking
     - Added global stop_event and signal handlers for graceful shutdown
     - Root cause: subprocess.run blocks threads with no signal propagation

  DOCUMENTATION DRIFT FIX:
  - Removed hardcoded "14 phases" references (actual is 19+ commands)
  - Updated to "multiple analysis phases" throughout all docs
  - Fixed CLI help text to be version-agnostic
  - Added missing "Summary generation" step in HOWTOUSE.md

  Changes:
  - pipelines.py: ProcessPoolExecutor → ThreadPoolExecutor, added Popen+poll pattern
  - Added signal handling and run_subprocess_with_interrupt() function
  - commands/full.py: Updated docstring to remove specific phase count
  - README.md: Changed "14 distinct phases" to "multiple analysis phases"
  - HOWTOUSE.md: Updated phase references, added missing summary step
  - CLAUDE.md & ARCHITECTURE.md: Removed hardcoded phase counts

  Impact: Critical UX fixes - Windows compatibility restored, pipeline interruptible
  Testing: Ctrl+C works, no PowerShell hangs, .pf directory deletable

2025-09-09 14:26:18 +07:00

19 KiB

Raw Permalink Blame History

TheAuditor Architecture

This document provides a comprehensive technical overview of TheAuditor's architecture, design patterns, and implementation details.

System Overview

TheAuditor is an offline-first, AI-centric SAST (Static Application Security Testing) and code intelligence platform. It orchestrates industry-standard tools to provide ground truth about code quality and security, producing AI-consumable reports optimized for LLM context windows.

Core Design Principles

Offline-First Operation - All analysis runs without network access, ensuring data privacy and reproducible results
Dual-Mode Architecture - Courier Mode preserves raw external tool outputs; Expert Mode applies security expertise objectively
AI-Centric Workflow - Produces chunks optimized for LLM context windows (65KB by default)
Sandboxed Execution - Isolated analysis environment prevents cross-contamination
No Fix Generation - Reports findings without prescribing solutions

Truth Courier vs Insights: Separation of Concerns

TheAuditor maintains a strict architectural separation between factual observation and optional interpretation:

Truth Courier Modules (Core)

These modules are the foundation - they gather and report verifiable facts without judgment:

Indexer: Reports "Function X exists at line Y with Z parameters"
Taint Analyzer: Reports "Data flows from pattern A to pattern B through path C"
Impact Analyzer: Reports "Changing function X affects Y files through Z call chains"
Graph Analyzer: Reports "Module A imports B, B imports C, C imports A (cycle detected)"
Pattern Detector: Reports "Line X matches pattern Y from rule Z"
Linters: Reports "Tool ESLint flagged line X with rule Y"

These modules form the immutable ground truth. They report what exists, not what it means.

Insights Modules (Optional Interpretation Layer)

These are optional packages that consume Truth Courier data to add scoring and classification. All insights modules have been consolidated into a single package for better organization:

theauditor/insights/
├── __init__.py      # Package exports
├── ml.py           # Machine learning predictions (requires pip install -e ".[ml]")
├── graph.py        # Graph health scoring and recommendations
└── taint.py        # Vulnerability severity classification

insights/taint.py: Adds "This flow is XSS with HIGH severity"
insights/graph.py: Adds "Health score: 70/100, Grade: B"
insights/ml.py (requires pip install -e ".[ml]"): Adds "80% probability of bugs based on historical patterns"

Important: Insights modules are:

Not installed by default (ML requires explicit opt-in)
Completely decoupled from core analysis
Still based on technical patterns, not business logic interpretation
Designed for teams that want actionable scores alongside raw facts
All consolidated in /insights package for consistency

The FCE: Factual Correlation Engine

The FCE correlates facts from multiple tools without interpreting them:

Reports: "Tool A and Tool B both flagged line 100"
Reports: "Pattern X and Pattern Y co-occur in file Z"
Never says: "This is bad" or "Fix this way"

Core Components

Indexer Package (`theauditor/indexer/`)

The indexer has been refactored from a monolithic 2000+ line file into a modular package structure:

theauditor/indexer/
├── __init__.py           # Package initialization and backward compatibility
├── config.py             # Constants, patterns, and configuration
├── database.py           # DatabaseManager class for all DB operations
├── core.py               # FileWalker and ASTCache classes
├── orchestrator.py       # IndexOrchestrator - main coordination logic
└── extractors/
    ├── __init__.py       # BaseExtractor abstract class and registry
    ├── python.py         # Python-specific extraction logic
    ├── javascript.py     # JavaScript/TypeScript extraction
    ├── docker.py         # Docker/docker-compose extraction
    ├── sql.py            # SQL extraction
    └── nginx.py          # Nginx configuration extraction

Key features:

Dynamic extractor registry for automatic language detection
Batched database operations (200 records per batch by default)
AST caching for performance optimization
Monorepo detection and intelligent path filtering
Parallel JavaScript processing when semantic parser available

Pipeline System (`theauditor/pipelines.py`)

Orchestrates comprehensive analysis pipeline in parallel stages:

Stage 1 - Foundation (Sequential):

Repository indexing - Build manifest and symbol database
Framework detection - Identify technologies in use

Stage 2 - Concurrent Analysis (3 Parallel Tracks):

Track A (Network I/O):
- Dependency checking
- Documentation fetching
- Documentation summarization
Track B (Code Analysis):
- Workset creation
- Linting
- Pattern detection
Track C (Graph Build):
- Graph building

Stage 3 - Final Aggregation (Sequential):

Graph analysis
Taint analysis
Factual correlation engine
Report generation

Pattern Detection Engine

100+ YAML-defined security patterns in theauditor/patterns/
AST-based matching for Python and JavaScript
Supports semantic analysis via TypeScript compiler

Factual Correlation Engine (FCE) (`theauditor/fce.py`)

29 advanced correlation rules in theauditor/correlations/rules/
Detects complex vulnerability patterns across multiple tools
Categories: Authentication, Injection, Data Exposure, Infrastructure, Code Quality, Framework-Specific

Taint Analysis Package (`theauditor/taint_analyzer.py`)

A comprehensive taint analysis module that tracks data flow from sources to sinks:

Tracks data flow from user inputs to dangerous outputs
Detects SQL injection, XSS, command injection vulnerabilities
Database-aware analysis using repo_index.db
Supports both assignment-based and direct-use patterns
Merges findings from multiple detection methods

Note: The optional severity scoring for taint analysis is provided by theauditor/insights/taint.py (Insights module)

Graph Analysis (`theauditor/graph/`)

builder.py: Constructs dependency graph from codebase
analyzer.py: Detects cycles, measures complexity, identifies hotspots
Uses NetworkX for graph algorithms

Note: The optional health scoring and recommendations are provided by theauditor/insights/graph.py (Insights module)

Framework Detection (`theauditor/framework_detector.py`)

Auto-detects Django, Flask, React, Vue, Angular, etc.
Applies framework-specific rules
Influences pattern selection and analysis behavior

Configuration Parsers (`theauditor/parsers/`)

Specialized parsers for configuration file analysis:

webpack_config_parser.py: Webpack configuration analysis
compose_parser.py: Docker Compose file parsing
nginx_parser.py: Nginx configuration parsing
dockerfile_parser.py: Dockerfile security analysis
prisma_schema_parser.py: Prisma ORM schema parsing

These parsers are used by extractors during indexing to extract security-relevant configuration data.

Refactoring Detection (`theauditor/commands/refactor.py`)

Detects incomplete refactorings and cross-stack inconsistencies:

Analyzes database migrations to detect schema changes
Uses impact analysis to trace affected files
Applies correlation rules from /correlations/rules/refactoring.yaml
Detects API contract mismatches, field migrations, foreign key changes
Supports auto-detection from migration files or specific change analysis

System Architecture Diagrams

High-Level Data Flow

graph TB
    subgraph "Input Layer"
        CLI[CLI Commands]
        Files[Project Files]
    end
    
    subgraph "Core Pipeline"
        Index[Indexer]
        Framework[Framework Detector]
        Deps[Dependency Checker]
        Patterns[Pattern Detection]
        Taint[Taint Analysis]
        Graph[Graph Builder]
        FCE[Factual Correlation Engine]
    end
    
    subgraph "Storage"
        DB[(SQLite DB)]
        Raw[Raw Output]
        Chunks[65KB Chunks]
    end
    
    CLI --> Index
    Files --> Index
    Index --> DB
    Index --> Framework
    Framework --> Deps
    
    Deps --> Patterns
    Patterns --> Graph
    Graph --> Taint
    Taint --> FCE
    
    FCE --> Raw
    Raw --> Chunks

Parallel Pipeline Execution

graph LR
    subgraph "Stage 1 - Sequential"
        S1[Index] --> S2[Framework Detection]
    end
    
    subgraph "Stage 2 - Parallel"
        direction TB
        subgraph "Track A - Network I/O"
            A1[Deps Check]
            A2[Doc Fetch]
            A3[Doc Summary]
            A1 --> A2 --> A3
        end
        
        subgraph "Track B - Code Analysis"
            B1[Workset]
            B2[Linting]
            B3[Patterns]
            B1 --> B2 --> B3
        end
        
        subgraph "Track C - Graph"
            C1[Graph Build]
        end
    end
    
    subgraph "Stage 3 - Sequential"
        E1[Graph Analysis] --> E2[Taint] --> E3[FCE] --> E4[Report]
    end
    
    S2 --> A1
    S2 --> B1
    S2 --> C1
    
    A3 --> E1
    B3 --> E1
    C1 --> E1

Data Chunking System

The extraction system (theauditor/extraction.py) implements pure courier model chunking:

graph TD
    subgraph "Analysis Results"
        P[Patterns.json]
        T[Taint.json<br/>Multiple lists merged]
        L[Lint.json]
        F[FCE.json]
    end
    
    subgraph "Extraction Process"
        E[Extraction Engine<br/>Budget: 1.5MB]
        M[Merge Logic<br/>For taint_paths +<br/>rule_findings]
        C1[Chunk 1<br/>0-65KB]
        C2[Chunk 2<br/>65-130KB]
        C3[Chunk 3<br/>130-195KB]
        TR[Truncation<br/>Flag]
    end
    
    subgraph "Output"
        R1[patterns_chunk01.json]
        R2[patterns_chunk02.json]
        R3[patterns_chunk03.json]
    end
    
    P --> E
    T --> M --> E
    L --> E
    F --> E
    
    E --> C1 --> R1
    E --> C2 --> R2
    E --> C3 --> R3
    E -.->|If >195KB| TR
    TR -.-> R3

Key features:

Budget system: 1.5MB total budget for all chunks
Smart merging: Taint analysis merges multiple finding lists (taint_paths, rule_findings, infrastructure)
Preservation: All findings preserved, no filtering or sampling
Chunking: Only chunks files >65KB, copies smaller files as-is

Dual Environment Architecture

graph TB
    subgraph "Development Environment"
        V1[.venv/]
        PY[Python 3.11+]
        AU[TheAuditor Code]
        V1 --> PY --> AU
    end
    
    subgraph "Sandboxed Analysis Environment"
        V2[.auditor_venv/.theauditor_tools/]
        NODE[Bundled Node.js v20.11.1]
        TS[TypeScript Compiler]
        ES[ESLint]
        PR[Prettier]
        NM[node_modules/]
        V2 --> NODE
        NODE --> TS
        NODE --> ES
        NODE --> PR
        NODE --> NM
    end
    
    AU -->|Analyzes using| V2
    AU -.->|Never uses| V1

TheAuditor maintains strict separation between:

Primary Environment (.venv/): TheAuditor's Python code and dependencies
Sandboxed Environment (.auditor_venv/.theauditor_tools/): Isolated JS/TS analysis tools

This ensures reproducibility and prevents TheAuditor from analyzing its own analysis tools.

Database Schema

erDiagram
    files ||--o{ symbols : contains
    files ||--o{ refs : contains
    files ||--o{ api_endpoints : contains
    files ||--o{ sql_queries : contains
    files ||--o{ docker_images : contains
    
    files {
        string path PK
        string language
        int size
        string hash
        json metadata
    }
    
    symbols {
        string path FK
        string name
        string type
        int line
        json metadata
    }
    
    refs {
        string src FK
        string value
        string kind
        int line
    }
    
    api_endpoints {
        string file FK
        string method
        string path
        int line
    }
    
    sql_queries {
        string file_path FK
        string command
        string query
        int line_number
    }
    
    docker_images {
        string file_path FK
        string base_image
        json env_vars
        json build_args
    }

Command Flow Sequence

sequenceDiagram
    participant User
    participant CLI
    participant Pipeline
    participant Analyzers
    participant Database
    participant Output
    
    User->>CLI: aud full
    CLI->>Pipeline: Execute pipeline
    Pipeline->>Database: Initialize schema
    
    Pipeline->>Analyzers: Index files
    Analyzers->>Database: Store file metadata
    
    par Parallel Execution
        Pipeline->>Analyzers: Dependency check
        and
        Pipeline->>Analyzers: Pattern detection
        and
        Pipeline->>Analyzers: Graph building
    end
    
    Pipeline->>Analyzers: Taint analysis
    Analyzers->>Database: Query symbols & refs
    
    Pipeline->>Analyzers: FCE correlation
    Analyzers->>Output: Generate reports
    
    Pipeline->>Output: Create chunks
    Output->>User: .pf/readthis/

Output Structure

All results are organized in the .pf/ directory:

.pf/
├── raw/                # Immutable tool outputs (ground truth)
│   ├── eslint.json
│   ├── ruff.json
│   └── ...
├── readthis/           # AI-optimized chunks (<65KB each, max 3 chunks per file)
│   ├── manifest.md     # Repository overview
│   ├── patterns_*.md   # Security findings
│   ├── taint_*.md      # Data-flow issues
│   └── tickets_*.md    # Actionable tasks
├── repo_index.db       # SQLite database of code symbols
├── pipeline.log        # Execution trace
└── findings.json       # Consolidated results

Key Output Files

manifest.md: Complete file inventory with SHA-256 hashes
patterns_*.md: Chunked security findings from 100+ detection rules
tickets_*.md: Prioritized, actionable issues with evidence
repo_index.db: Queryable database of all code symbols and relationships

Operating Modes

TheAuditor operates in two distinct modes:

Courier Mode (External Tools)

Preserves exact outputs from ESLint, Ruff, MyPy, etc.
No interpretation or filtering
Complete audit trail from source to finding

Expert Mode (Internal Engines)

Taint Analysis: Tracks untrusted data through the application
Pattern Detection: YAML-based rules with AST matching
Graph Analysis: Architectural insights and dependency tracking
Secret Detection: Identifies hardcoded credentials and API keys

CLI Entry Points

Main CLI: theauditor/cli.py - Central command router
Command modules: theauditor/commands/ - One module per command
Utilities: theauditor/utils/ - Shared functionality
Configuration: theauditor/config_runtime.py - Runtime configuration

Each command module follows a standardized structure with:

@click.command() decorator
@handle_exceptions decorator for error handling
Consistent logging and output formatting

Performance Optimizations

Batched database operations: 200 records per batch (configurable)
Parallel rule execution: ThreadPoolExecutor with 4 workers
AST caching: Persistent cache for parsed AST trees
Incremental analysis: Workset-based analysis for changed files only
Lazy loading: Patterns and rules loaded on-demand
Memory-efficient chunking: Stream large files instead of loading entirely

Configuration System

TheAuditor supports runtime configuration via multiple sources (priority order):

Environment variables (THEAUDITOR_* prefix)
.pf/config.json file (project-specific)
Built-in defaults in config_runtime.py

Example configuration:

export THEAUDITOR_LIMITS_MAX_CHUNKS_PER_FILE=5  # Default: 3
export THEAUDITOR_LIMITS_MAX_CHUNK_SIZE=100000  # Default: 65000
export THEAUDITOR_LIMITS_MAX_FILE_SIZE=5242880  # Default: 2097152
export THEAUDITOR_TIMEOUTS_LINT_TIMEOUT=600     # Default: 300

Advanced Features

Database-Aware Rules

Specialized analyzers query repo_index.db to detect:

ORM anti-patterns (N+1 queries, missing transactions)
Docker security misconfigurations
Nginx configuration issues
Multi-file correlation patterns

Holistic Analysis

Project-level analyzers that operate across the entire codebase:

Bundle Analyzer: Correlates package.json, lock files, and imports
Source Map Detector: Scans build directories for exposed maps
Framework Detectors: Identify technology stack automatically

Incremental Analysis

Workset-based analysis for efficient processing:

Git diff integration for changed file detection
Dependency tracking for impact analysis
Cached results for unchanged files

Contributing to TheAuditor

Adding Language Support

TheAuditor's modular architecture makes it straightforward to add new language support:

1. Create an Extractor

Create a new extractor in theauditor/indexer/extractors/{language}.py:

from . import BaseExtractor

class {Language}Extractor(BaseExtractor):
    def supported_extensions(self) -> List[str]:
        return ['.ext', '.ext2']
    
    def extract(self, file_info, content, tree=None):
        # Extract symbols, imports, routes, etc.
        return {
            'imports': [],
            'routes': [],
            'symbols': [],
            # ... other extracted data
        }

The extractor will be automatically registered via the BaseExtractor inheritance.

2. Create Configuration Parser (Optional)

For configuration files, create a parser in theauditor/parsers/{language}_parser.py:

class {Language}Parser:
    def parse_file(self, file_path: Path) -> Dict[str, Any]:
        # Parse configuration file
        return parsed_data

3. Add Security Patterns

Create YAML patterns in theauditor/patterns/{language}.yml:

- name: hardcoded-secret-{language}
  pattern: 'api_key\s*=\s*["\'][^"\']+["\']'
  severity: critical
  category: security
  languages: ["{language}"]
  description: "Hardcoded API key in {Language} code"

4. Add Framework Detection

Update theauditor/framework_detector.py to detect {Language} frameworks.

Adding New Analyzers

Database-Aware Rules

Create analyzers that query repo_index.db in theauditor/rules/{category}/:

def find_{issue}_patterns(db_path: str) -> List[Dict[str, Any]]:
    conn = sqlite3.connect(db_path)
    # Query and analyze
    return findings

AST-Based Rules

For semantic analysis, create rules in theauditor/rules/{framework}/:

def find_{framework}_issues(tree, file_path) -> List[Dict[str, Any]]:
    # Traverse AST and detect issues
    return findings

Pattern-Based Rules

Add YAML patterns to theauditor/patterns/ for regex-based detection.

Architecture Guidelines

Maintain Truth Courier vs Insights separation - Core modules report facts, insights add interpretation
Use the extractor registry - Inherit from BaseExtractor for automatic registration
Follow existing patterns - Look at python.py or javascript.py extractors as examples
Write comprehensive tests - Test extractors, parsers, and patterns
Document your additions - Update this file and CONTRIBUTING.md

For detailed contribution guidelines, see CONTRIBUTING.md.

19 KiB Raw Permalink Blame History

TheAuditor Architecture

System Overview

Core Design Principles

Truth Courier vs Insights: Separation of Concerns

Truth Courier Modules (Core)

Insights Modules (Optional Interpretation Layer)

The FCE: Factual Correlation Engine

Core Components

Indexer Package (theauditor/indexer/)

Pipeline System (theauditor/pipelines.py)

Pattern Detection Engine

Factual Correlation Engine (FCE) (theauditor/fce.py)

Taint Analysis Package (theauditor/taint_analyzer.py)

Graph Analysis (theauditor/graph/)

Framework Detection (theauditor/framework_detector.py)

Configuration Parsers (theauditor/parsers/)

Refactoring Detection (theauditor/commands/refactor.py)

System Architecture Diagrams

High-Level Data Flow

Parallel Pipeline Execution

Data Chunking System

Dual Environment Architecture

Database Schema

Command Flow Sequence

Output Structure

Key Output Files

Operating Modes

Courier Mode (External Tools)

Expert Mode (Internal Engines)

CLI Entry Points

Performance Optimizations

Configuration System

Advanced Features

Database-Aware Rules

Holistic Analysis

Incremental Analysis

Contributing to TheAuditor

Adding Language Support

1. Create an Extractor

2. Create Configuration Parser (Optional)

3. Add Security Patterns

4. Add Framework Detection

Adding New Analyzers

Database-Aware Rules

AST-Based Rules

Pattern-Based Rules

Architecture Guidelines

19 KiB

Raw Permalink Blame History

Indexer Package (`theauditor/indexer/`)

Pipeline System (`theauditor/pipelines.py`)

Factual Correlation Engine (FCE) (`theauditor/fce.py`)

Taint Analysis Package (`theauditor/taint_analyzer.py`)

Graph Analysis (`theauditor/graph/`)

Framework Detection (`theauditor/framework_detector.py`)

Configuration Parsers (`theauditor/parsers/`)

Refactoring Detection (`theauditor/commands/refactor.py`)