Files
Auditor/HOWTOUSE.md
TheAuditorTool c7a59e420b Fix: Critical Windows ProcessPoolExecutor hang and documentation drift
Fixed critical Windows compatibility issues and updated outdated documentation.

  CRITICAL WINDOWS HANG FIXES:
  1. ProcessPoolExecutor → ThreadPoolExecutor
     - Fixes PowerShell/terminal hang where Ctrl+C wouldn't work
     - Prevents .pf directory lock requiring Task Manager kill
     - Root cause: Nested ProcessPool + ThreadPool on Windows creates kernel deadlock

  2. Ctrl+C Interruption Support
     - Replaced subprocess.run with Popen+poll pattern (industry standard)
     - Poll subprocess every 100ms for interruption checking
     - Added global stop_event and signal handlers for graceful shutdown
     - Root cause: subprocess.run blocks threads with no signal propagation

  DOCUMENTATION DRIFT FIX:
  - Removed hardcoded "14 phases" references (actual is 19+ commands)
  - Updated to "multiple analysis phases" throughout all docs
  - Fixed CLI help text to be version-agnostic
  - Added missing "Summary generation" step in HOWTOUSE.md

  Changes:
  - pipelines.py: ProcessPoolExecutor → ThreadPoolExecutor, added Popen+poll pattern
  - Added signal handling and run_subprocess_with_interrupt() function
  - commands/full.py: Updated docstring to remove specific phase count
  - README.md: Changed "14 distinct phases" to "multiple analysis phases"
  - HOWTOUSE.md: Updated phase references, added missing summary step
  - CLAUDE.md & ARCHITECTURE.md: Removed hardcoded phase counts

  Impact: Critical UX fixes - Windows compatibility restored, pipeline interruptible
  Testing: Ctrl+C works, no PowerShell hangs, .pf directory deletable
2025-09-09 14:26:18 +07:00

1157 lines
35 KiB
Markdown

# How to Use TheAuditor
This comprehensive guide covers everything you need to know about setting up, configuring, and using **TheAuditor** for code analysis and security auditing. Whether you're performing a one-time security audit or integrating continuous analysis into your development workflow, this guide will walk you through every step.
---
## Prerequisites
Before installing **TheAuditor**, ensure you have:
- **Python 3.11 or higher** (3.12+ recommended)
- **Git** (for repository operations)
- **Operating System**: Linux, macOS, or Windows with WSL
---
## Installation & Setup
### Understanding the Architecture
TheAuditor uses a **dual-environment** design:
1. **TheAuditor Installation** - The tool itself (installed once, used everywhere)
2. **Project Sandbox** - Created per-project for isolated analysis
### Step 1: Install TheAuditor Tool
**IMPORTANT**: Do NOT create a virtual environment. Use your system Python.
```bash
# Choose a permanent location for TheAuditor (NOT inside your projects)
cd ~/tools # or C:\tools on Windows, or wherever you keep dev tools
# Clone the repository
git clone https://github.com/TheAuditorTool/Auditor.git
cd TheAuditor
# Install TheAuditor to your system
pip install -e .
# Verify the installation worked
aud --version
# Optional: Install with ML capabilities
# pip install -e ".[ml]"
# For development with all optional dependencies:
# pip install -e ".[all]" # Includes Insights module
```
**Common Mistakes to Avoid:**
- ❌ Don't create a venv before installing TheAuditor
- ❌ Don't install TheAuditor inside your project directory
- ❌ Don't run `pip install` from your project directory
- ✅ Install TheAuditor ONCE in a tools directory
- ✅ Use TheAuditor to analyze MANY projects
### Step 2: Setup Project for Analysis (MANDATORY)
**Navigate to YOUR PROJECT directory first:**
```bash
# Go to the project you want to analyze (NOT TheAuditor directory!)
cd ~/my-project-to-audit
# Create the sandboxed environment for THIS project
aud setup-claude --target .
```
This command:
- Creates **`.auditor_venv/.theauditor_tools/`** sandbox directory
- Installs **TypeScript compiler** (`tsc`) in isolation
- Installs **ESLint** and related tools
- Updates all tools to latest versions
- Configures the sandbox for TheAuditor's exclusive use
**Why is this required?**
- TheAuditor **NEVER** uses your global or project-installed tools
- Ensures reproducible results across different environments
- Prevents contamination between analysis tools and project dependencies
- **Required for TheAuditor to function at all** - not just for JavaScript/TypeScript analysis
**Expected output:**
```
Step 1: Setting up Python virtual environment...
[OK] Venv already exists: C:\Users\user\Desktop\TheAuditor\.auditor_venv
[OK] TheAuditor already installed in C:\Users\user\Desktop\TheAuditor\.auditor_venv
Upgrading to ensure latest version...
Installing TheAuditor from C:\Users\user\Desktop\TheAuditor...
[OK] Installed TheAuditor (editable) from C:\Users\user\Desktop\TheAuditor
[OK] Executable available: C:\Users\user\Desktop\TheAuditor\.auditor_venv\Scripts\aud.exe
Installing Python linting tools...
Checking for latest linter versions...
[OK] Updated to latest package versions
Installing linters from pyproject.toml...
[OK] Python linters installed (ruff, mypy, black, bandit, pylint)
Setting up JavaScript/TypeScript tools in sandboxed environment...
Creating sandboxed tools directory: C:\Users\user\Desktop\TheAuditor\.auditor_venv\.theauditor_tools
[OK] ESLint v9 flat config copied to sandbox
[Track A] Checking for latest tool versions...
[Track B] Setting up portable Node.js runtime...
[OK] Node.js runtime already installed at C:\Users\user\Desktop\TheAuditor\.auditor_venv\.theauditor_tools\node-runtime
[OK] Updated @typescript-eslint/parser: 8.41.0 → ^8.42.0
[OK] Updated @typescript-eslint/eslint-plugin: 8.41.0 → ^8.42.0
Updated 2 packages to latest versions
Installing JS/TS linters using bundled Node.js...
[OK] JavaScript/TypeScript tools installed in sandbox
[OK] Tools isolated from project: C:\Users\user\Desktop\TheAuditor\.auditor_venv\.theauditor_tools
[OK] Using bundled Node.js - no system dependency!
[OK] ESLint verified at: C:\Users\user\Desktop\TheAuditor\.auditor_venv\.theauditor_tools\node_modules\.bin\eslint.cmd
```
---
## Core Commands & Workflow
### Complete Audit Pipeline
On a medium 20k LOC node/react/vite stack, expect the analysis to take around 30 minutes.
Progress bars for tracks B/C may display inconsistently on PowerShell.
Run a comprehensive audit with multiple analysis phases organized in parallel stages:
```bash
aud full
# Skip network operations (deps, docs) for faster execution
aud full --offline
```
This executes in **parallel stages** for optimal performance:
**Stage 1 - Foundation (Sequential):**
1. **Repository indexing** - Build manifest and symbol database
2. **Framework detection** - Identify technologies in use
**Stage 2 - Concurrent Analysis (3 Parallel Tracks):**
- **Track A (Network I/O):** *(skipped with --offline)*
3. **Dependency checking** - Scan for vulnerabilities
4. **Documentation fetching** - Gather project docs
5. **Documentation summarization** - Create AI-friendly summaries
- **Track B (Code Analysis):**
6. **Workset creation** - Define analysis scope
7. **Linting** - Run code quality checks
8. **Pattern detection** - Apply security rules
- **Track C (Graph Build):**
9. **Graph building** - Construct dependency graph
**Stage 3 - Final Aggregation (Sequential):**
10. **Graph analysis** - Find architectural issues
11. **Taint analysis** - Track data flow
12. **Factual correlation engine** - Correlate findings across tools with 29 advanced rules
13. **Report generation** - Produce final output
14. **Summary generation** - Create executive summary
**Output**: Complete results in **`.pf/readthis/`** directory
### Offline Mode
When working on the same codebase repeatedly or when network access is limited, use offline mode to skip network operations (dependency checking and documentation fetching):
```bash
# Run full audit without network operations
aud full --offline
# Combine with other flags
aud full --offline --quiet
aud full --offline --exclude-self # Only meant for dogfooding; in 9/10 projects, --exclude-self will correctly exclude the entire project, producing empty results
```
**Benefits:**
- **Faster execution** - Skips slow network operations
- **Air-gapped operation** - Works without internet access
- **Iterative development** - Perfect for repeated runs during development
**What gets skipped:**
- Dependency vulnerability scanning
- Documentation fetching and summarization
- Latest version checks
**What still runs:**
- All code analysis (indexing, linting, patterns)
- Graph building and analysis
- Taint analysis and FCE
- Report generation
### Incremental Analysis (Workset-based)
Analyze only changed files based on git diff:
```bash
# Create workset from uncommitted changes
aud workset
# Create workset from specific commit range
aud workset --diff "HEAD~3..HEAD"
# Create workset for all files
aud workset --all
```
Then run targeted analysis:
```bash
aud lint --workset
aud detect-patterns --workset
```
### Linting with Auto-fix
Run comprehensive linting across all supported languages:
```bash
# Run linting on workset
aud lint --workset
# Auto-fix issues where possible
aud lint --fix
# Run on all files
aud lint --all
```
Supports:
- **Python**: **Ruff**, **MyPy**, **Black**, **Bandit**, **Pylint**
- **JavaScript/TypeScript**: **ESLint** with TypeScript parser
- **General**: **Prettier** for formatting
### Security Analysis
#### Taint Analysis
Track data flow from **sources** (user input) to **sinks** (database, output):
```bash
aud taint-analyze
```
Detects:
- **SQL injection** vulnerabilities
- **XSS** (Cross-site scripting)
- **Command injection**
- **Path traversal**
- Other injection attacks
#### Pattern Detection
Run pattern-based vulnerability scanning:
```bash
aud detect-patterns
```
Uses **100+ YAML-defined patterns** across multiple categories:
**Security Patterns:**
- Hardcoded secrets and API keys
- Insecure randomness (**Math.random** for security)
- Weak cryptographic algorithms
- Authentication bypasses
- Missing authentication decorators
**Resource Management:**
- Socket, stream, and worker leaks
- File handles not closed properly
- Database connections left open
- Event listeners not removed
**Concurrency Issues:**
- **Race conditions** (check-then-act)
- **Deadlocks** (nested locks, lock ordering)
- Shared state without synchronization
- Unsafe parallel writes
**ORM & Database:**
- **Sequelize** death queries and N+1 patterns
- **Prisma** connection pool exhaustion
- **TypeORM** missing transactions
- Missing database indexes
**Deployment & Infrastructure:**
- **Docker** security misconfigurations
- **nginx** exposed paths and weak SSL
- **docker-compose** privileged containers
- **webpack** source map exposure in production
**Framework-Specific:**
- **Django**, **Flask**, **FastAPI** vulnerabilities
- **React** hooks dependency issues
- **Vue** reactivity problems
- **Angular**, **Next.js**, **Express.js** patterns
- Multi-tenant security violations
### Docker Security Analysis
Analyze Docker images for security misconfigurations and vulnerabilities:
```bash
# Analyze all indexed Docker images
aud docker-analyze
# Filter by severity level
aud docker-analyze --severity critical
# Save results to JSON file
aud docker-analyze --output docker-security.json
```
Detects:
- **Containers running as root** - CIS Docker Benchmark violation
- **Exposed secrets in ENV/ARG** - Hardcoded passwords, API keys, tokens
- **High entropy values** - Potential secrets using Shannon entropy
- **Known secret patterns** - GitHub tokens, AWS keys, Slack tokens
The command requires Docker images to be indexed first (`aud index`). It queries the `repo_index.db` for Docker metadata and performs security analysis.
### Project Structure Report
Generate comprehensive project structure and intelligence reports:
```bash
# Generate default structure report
aud structure
# Specify output location
aud structure --output PROJECT_OVERVIEW.md
# Adjust directory tree depth
aud structure --max-depth 6
# Analyze different root directory
aud structure --root ./src
```
The report includes:
- **Directory tree visualization** - Smart file grouping and critical file(size/loc) highlighting
- **Project statistics** - Total files, LOC, estimated tokens
- **Language distribution** - Percentage breakdown by file type
- **Top 10 largest files** - By token count with percentage of codebase
- **Top 15 critical files** - Identified by naming conventions (auth.py, config.js, etc.)
- **AI context optimization** - Recommendations for reading order and token budget
- **Symbol counts** - Functions, classes, imports from database
Useful for:
- Getting quick project overview
- Understanding codebase structure
- Planning AI assistant interactions
- Identifying critical components
- Token budget management for LLMs
### Impact Analysis
Assess the blast radius of a specific code change:
```bash
# Analyze impact of changes to a specific function
aud impact --file "src/auth/login.py" --line 42
# Analyze impact with depth limit
aud impact --file "src/database.py" --line 100 --depth 3
# Trace frontend to backend dependencies
aud impact --file "frontend/api.ts" --line 50 --trace-to-backend
```
Shows:
- Dependent functions and modules
- Call chain analysis
- Affected test files
- Risk assessment
- Cross-stack impact (frontend → backend tracing)
### Refactoring Analysis
Detect and analyze refactoring issues such as data model changes, API contract mismatches, and incomplete migrations:
```bash
# Analyze impact from a specific model change
aud refactor --file "models/Product.ts" --line 42
# Auto-detect refactoring from database migrations
aud refactor --auto-detect --migration-dir backend/migrations
# Analyze current workset for refactoring issues
aud refactor --workset
# Generate detailed report
aud refactor --auto-detect --output refactor_report.json
```
Detects:
- **Data Model Changes**: Fields moved between tables (e.g., `product.price``variant.price`)
- **Foreign Key Changes**: References updated (e.g., `product_id``product_variant_id`)
- **API Contract Mismatches**: Frontend expects old structure, backend provides new
- **Missing Updates**: Code still using old field/table names
- **Cross-Stack Inconsistencies**: TypeScript interfaces not matching backend models
The refactor command uses:
- Impact analysis to trace affected files
- Migration file analysis to detect schema changes
- Pattern detection with refactoring-specific rules
- FCE correlation to find related issues
- Risk assessment based on blast radius
### Insights Analysis (Optional)
Run optional interpretive analysis on top of factual audit data:
```bash
# Run all insights modules
aud insights --mode all
# ML-powered insights (requires pip install -e ".[ml]")
aud insights --mode ml --ml-train
# Graph health metrics and recommendations
aud insights --mode graph
# Taint vulnerability scoring
aud insights --mode taint
# Impact analysis insights
aud insights --mode impact
# Generate comprehensive report
aud insights --output insights_report.json
# Train ML model on your codebase patterns
aud insights --mode ml --ml-train --training-data .pf/raw/
# Get ML-powered suggestions
aud insights --mode ml --ml-suggest
```
Modes:
- **ml**: Machine learning predictions and pattern recognition
- **graph**: Health scores, architectural recommendations
- **taint**: Vulnerability severity scoring and classification
- **impact**: Change impact assessment and risk scoring
- **all**: Run all available insights modules
The insights command:
- Reads existing audit data from `.pf/raw/`
- Applies interpretive scoring and classification
- Generates actionable recommendations
- Outputs to `.pf/insights/` for separation from facts
- Provides technical scoring without crossing into semantic interpretation
### Graph Visualization
Generate rich visual intelligence from dependency graphs:
```bash
# Build dependency graphs first
aud graph build
# Basic visualization
aud graph viz
# Show only dependency cycles
aud graph viz --view cycles --include-analysis
# Top 10 hotspots (most connected nodes)
aud graph viz --view hotspots --top-hotspots 10
# Architectural layers visualization
aud graph viz --view layers --format svg
# Impact analysis visualization
aud graph viz --view impact --impact-target "src/auth/login.py"
# Call graph instead of import graph
aud graph viz --graph-type call --view full
# Generate SVG for AI analysis
aud graph viz --format svg --include-analysis --title "System Architecture"
# Custom output location
aud graph viz --out-dir ./architecture/ --format png
```
View Modes:
- **full**: Complete graph with all nodes and edges
- **cycles**: Only nodes/edges involved in dependency cycles (red highlighting)
- **hotspots**: Top N most connected nodes with gradient coloring
- **layers**: Architectural layers as subgraphs with clear hierarchy
- **impact**: Highlight impact radius with color-coded upstream/downstream
Visual Encoding:
- **Node Color**: Programming language (Python=blue, JavaScript=yellow, TypeScript=blue)
- **Node Size**: Importance/connectivity (larger = more dependencies)
- **Edge Color**: Red for cycles, gray for normal dependencies
- **Border Width**: Code churn (thicker = more changes)
- **Node Shape**: Module=box, Function=ellipse, Class=diamond
The graph viz command:
- Generates Graphviz DOT format files
- Optionally creates SVG/PNG images (requires Graphviz installation)
- Supports filtered views for focusing on specific concerns
- Includes analysis data for cycle and hotspot highlighting
- Produces AI-readable SVG output for LLM analysis
### Dependency Management
Check for outdated or vulnerable dependencies:
```bash
# Check for latest versions
aud deps --check-latest
# Scan for known vulnerabilities
aud deps --vuln-scan
# Update all dependencies to latest
aud deps --upgrade-all
```
---
## Architecture: Truth Courier vs Insights
### Understanding the Separation of Concerns
TheAuditor implements a strict architectural separation between **factual observation** (Truth Courier modules) and **optional interpretation** (Insights modules). This design ensures the tool remains an objective source of ground truth while offering actionable intelligence when needed.
### The Core Philosophy
TheAuditor doesn't try to understand your business logic or make your AI "smarter." Instead, it solves the real problem: **LLMs lose context and make inconsistent changes across large codebases.**
The workflow:
1. **You tell AI**: "Add JWT auth with CSRF tokens and password complexity"
2. **AI writes code**: Probably inconsistent due to context limits
3. **You run**: `aud full`
4. **TheAuditor reports**: All the inconsistencies and security holes
5. **AI reads the report**: Now sees the complete picture across all files
6. **AI fixes issues**: With full visibility of what's broken
7. **Repeat until clean**
### Truth Courier Modules (Core)
These modules report verifiable facts without judgment:
```python
# What Truth Couriers Report - Just Facts
{
"taint_analyzer": "Data from req.body flows to res.send at line 45",
"pattern_detector": "Line 45 matches pattern 'unsanitized-output'",
"impact_analyzer": "Changing handleRequest() affects 12 downstream functions",
"graph_analyzer": "Module A imports B, B imports C, C imports A"
}
```
**Key Truth Couriers:**
- **Indexer**: Maps all code symbols and their locations
- **Taint Analyzer**: Traces data flow through the application
- **Impact Analyzer**: Maps dependency chains and change blast radius
- **Graph Analyzer**: Detects cycles and architectural patterns
- **Pattern Detector**: Matches code against security patterns
### Insights Modules (Optional Scoring)
These optional modules add technical scoring and classification:
```python
# What Insights Add - Technical Classifications
{
"taint/insights": {
"vulnerability_type": "Cross-Site Scripting",
"severity": "HIGH"
},
"graph/insights": {
"health_score": 70,
"recommendation": "Reduce coupling"
}
}
```
**Installation:**
```bash
# Base installation (Truth Couriers only)
pip install -e .
# With ML insights (optional)
pip install -e ".[ml]"
# Development with all dependencies (not for general users)
# pip install -e ".[all]"
```
### Correlation Rules: Detecting YOUR Patterns
Correlation rules detect when multiple facts indicate an inconsistency in YOUR codebase:
```yaml
# Example: Detecting incomplete refactoring
- name: "PRODUCT_VARIANT_REFACTOR"
co_occurring_facts:
- tool: "grep"
pattern: "ProductVariant.*retail_price" # Backend changed
- tool: "grep"
pattern: "product\\.unit_price" # Frontend didn't
```
This isn't "understanding" that products have prices. It's detecting that you moved a field from one model to another and some code wasn't updated. Pure consistency checking.
The correlation engine loads rules from `/correlations/rules/`. We provide common patterns, but many are project-specific. You write rules that detect YOUR patterns, YOUR refactorings, YOUR inconsistencies.
### Why This Works
**What doesn't work:**
- Making AI "understand" your business domain
- Adding semantic layers to guess what you mean
- Complex context management systems
**What does work:**
- Accept that AI will make inconsistent changes
- Detect those inconsistencies after the fact
- Give AI the full picture so it can fix them
TheAuditor doesn't try to prevent mistakes. It finds them so they can be fixed.
### Practical Example
```bash
# You ask AI to implement authentication
Human: "Add JWT auth with CSRF protection"
# AI writes code (probably with issues due to context limits)
AI: *implements auth across 15 files*
# You audit it
$ aud full
# TheAuditor finds issues
- "JWT secret hardcoded at auth.js:47"
- "CSRF token generated but never validated"
- "Auth middleware missing on /api/admin/*"
# You can also check impact of changes
$ aud impact --file "auth.js" --line 47
# Shows: "Changing this affects 23 files, 47 functions"
# AI reads the audit and can now see ALL issues
AI: *reads .pf/readthis/*
AI: "I see 5 security issues across auth flow. Fixing..."
# AI fixes with complete visibility
AI: *fixes all issues because it can see the full picture*
```
### Key Points
1. **No Business Logic Understanding**: TheAuditor doesn't need to know what your app does
2. **Just Consistency Checking**: It finds where your code doesn't match itself
3. **Facts, Not Opinions**: Reports what IS, not what SHOULD BE
4. **Complete Dependency Tracing**: Impact analyzer shows exactly what's affected by changes
5. **AI + Audit Loop**: Write → Audit → Fix → Repeat until clean
This is why TheAuditor works where semantic understanding fails - it's not trying to read your mind, just verify your code's consistency.
---
## Understanding the Output
### Directory Structure
After running analyses, results are organized in **`.pf/`**:
```
.pf/
├── raw/ # Raw, unmodified tool outputs (Truth Couriers)
│ ├── linting.json # Raw linter results
│ ├── patterns.json # Pattern detection findings
│ ├── taint_analysis.json # Taint analysis results
│ ├── graph.json # Dependency graph data
│ └── graph_analysis.json # Graph analysis (cycles, hotspots)
├── insights/ # Optional interpretive analysis (Insights modules)
│ ├── ml_suggestions.json # ML predictions and patterns
│ ├── taint_insights.json # Vulnerability severity scoring
│ └── graph_insights.json # Health scores and recommendations
├── readthis/ # AI-consumable chunks
│ ├── manifest.md # Repository overview
│ ├── patterns_001.md # Chunked findings (65KB max)
│ ├── patterns_002.md
│ ├── taint_001.md # Chunked taint results
│ ├── tickets_001.md # Actionable issue tickets
│ └── summary.md # Executive summary
├── graphs/ # Graph visualizations
│ ├── import_graph.dot # Dependency graph DOT file
│ ├── import_graph_cycles.dot # Cycles-only view
│ └── import_graph.svg # SVG visualization (if generated)
├── pipeline.log # Complete execution log
├── error.log # Error details (if failures occur)
├── findings.json # Consolidated findings
├── risk_scores.json # Risk analysis results
└── report.md # Human-readable report
```
### Key Output Files
#### `.pf/raw/`
Contains **unmodified outputs** from each tool. These files preserve the exact format and data from linters, scanners, and analyzers. **Never modified** after creation. This is the source of ground truth.
#### `.pf/insights/`
Contains **optional interpretive analysis** from Insights modules. These files add technical scoring and classification on top of raw data. Only created when insights commands are run.
#### `.pf/graphs/`
Contains **graph visualizations** in DOT and image formats. Generated by `aud graph viz` command with various view modes for focusing on specific concerns.
#### `.pf/readthis/`
Contains processed, **chunked data optimized for AI consumption**:
- Each file is under **65KB** by default (configurable via `THEAUDITOR_LIMITS_MAX_CHUNK_SIZE`)
- Maximum 3 chunks per file by default (configurable via `THEAUDITOR_LIMITS_MAX_CHUNKS_PER_FILE`)
- Structured with clear headers and sections
- Includes context, evidence, and suggested fixes
- Ready for direct consumption by **Claude**, **GPT-4**, etc.
#### `.pf/pipeline.log`
Complete execution log showing:
- Each phase's **execution time**
- **Success/failure** status
- Key statistics and findings
- Error messages if any
#### `.pf/error.log`
Created only when errors occur. Contains:
- Full **stack traces**
- Detailed error messages
- Phase-specific failure information
- Debugging information
---
## Advanced Usage
### Custom Pattern Rules
Create custom detection patterns in **`.pf/patterns/`**:
```yaml
# .pf/patterns/custom_auth.yaml
name: weak_password_check
severity: high
category: security
pattern: 'password\s*==\s*["\']'
description: "Hardcoded password comparison"
test_template: |
def test_weak_password():
assert password != "admin"
```
### ML-Powered Suggestions
Train models on your codebase patterns:
```bash
# Initial training
aud learn
# Get improvement suggestions
aud suggest
# Provide feedback for continuous learning
aud learn-feedback --accept
```
### Development-Specific Flags
#### Excluding TheAuditor's Own Files
When testing or developing within TheAuditor's repository (e.g., analyzing `fakeproj/project_anarchy/`), use the `--exclude-self` flag to prevent false positives from TheAuditor's own files:
```bash
# Exclude all TheAuditor files from analysis
aud index --exclude-self
aud full --exclude-self
```
This flag excludes:
- All TheAuditor source code directories (`theauditor/`, `tests/`, etc.)
- Root configuration files (`pyproject.toml`, `package-template.json`, `Dockerfile`)
- Documentation and build files
**Use case:** Testing vulnerable projects within TheAuditor's repository without framework detection picking up TheAuditor's own configuration files.
### CI/CD Integration
#### GitHub Actions Example
```yaml
name: Security Audit
on: [push, pull_request]
jobs:
audit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.12'
- name: Set up Node.js
uses: actions/setup-node@v2
with:
node-version: '18'
- name: Install TheAuditor
run: |
pip install -e ".[all]"
aud setup-claude --target .
- name: Run Audit
run: aud full
- name: Upload Results
if: always()
uses: actions/upload-artifact@v2
with:
name: audit-results
path: .pf/
```
### Running TheAuditor on Its Own Codebase (Dogfooding)
When developing TheAuditor or testing it on itself, you need a special dual-environment setup:
#### Understanding the Dual-Environment Architecture
TheAuditor maintains strict separation between:
1. **Primary Environment** (`.venv/`) - Where TheAuditor runs from
2. **Sandboxed Environment** (`.auditor_venv/.theauditor_tools/`) - Tools TheAuditor uses for analysis
This ensures reproducibility and prevents TheAuditor from analyzing its own analysis tools.
#### Setup Procedure for Dogfooding
```bash
# 1. Clone and set up development environment
git clone https://github.com/TheAuditorTool/Auditor.git
cd theauditor
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -e .
# 2. CRITICAL: Create the sandboxed analysis environment
aud setup-claude --target .
# 3. Verify setup
aud full --quick-test
# 4. Run full analysis on TheAuditor itself
aud full
```
#### Analyzing Test Projects Within TheAuditor
When analyzing test projects like `fakeproj/` from within TheAuditor's repository:
```bash
cd fakeproj/project_anarchy
aud full --exclude-self # Excludes TheAuditor's own files
```
The `--exclude-self` flag prevents:
- Framework detection from identifying TheAuditor's `pyproject.toml`
- False positives from TheAuditor's configuration files
- Contamination from TheAuditor's source code
---
## Refactoring Detection
TheAuditor includes sophisticated capabilities for detecting incomplete refactorings, data model changes, and cross-stack inconsistencies.
### Understanding Refactoring Issues
Common refactoring problems TheAuditor detects:
1. **Data Model Evolution** - Fields moved between models (e.g., `product.price``variant.price`)
2. **Foreign Key Changes** - References updated in database but not in code
3. **API Contract Mismatches** - Frontend expects old structure, backend provides new
4. **Cross-Stack Inconsistencies** - TypeScript interfaces not matching backend models
5. **Incomplete Migrations** - Some code still using old field/table names
### How Refactoring Detection Works
TheAuditor uses multiple techniques:
#### Migration Analysis
Analyzes database migration files to understand schema changes:
```javascript
// Migration detected: Field moved from products to product_variants
removeColumn('products', 'unit_price');
addColumn('product_variants', 'retail_price', DataTypes.DECIMAL);
```
#### Impact Analysis
Traces dependencies to find all affected code:
```bash
aud impact --file "models/Product.ts" --line 42
# Shows: 47 files need updating
```
#### Pattern Detection
Over 30 refactoring-specific patterns detect common issues:
```yaml
- name: "PRODUCT_PRICE_FIELD_REMOVED"
description: "Code accessing price on Product after migration to ProductVariant"
```
#### Cross-Stack Tracing
Matches frontend API calls to backend endpoints to detect contract mismatches.
### Using Refactoring Detection
#### Quick Detection
```bash
# Auto-detect from migrations
aud refactor --auto-detect
# Analyze specific change
aud refactor --file "models/Product.ts" --line 42
# Use with workset
aud refactor --workset
# Generate detailed report
aud refactor --auto-detect --output refactor_report.json
```
#### Best Practices for Refactoring
**Before Refactoring:**
1. Run impact analysis: `aud impact --file "model.ts" --line 42`
2. Create workset: `aud workset --from-impact`
3. Baseline analysis: `aud refactor --workset`
**During Refactoring:**
- Run incremental checks: `aud refactor --workset`
- Validate cross-stack: `aud impact --trace-to-backend`
**After Refactoring:**
- Full validation: `aud unified --mode refactor`
- Generate report: `aud report --format refactoring`
### Real-World Example
A product variant refactoring might be detected as:
```
PRODUCT_PRICE_FIELD_REMOVED
- Frontend: 23 files accessing product.unit_price
- Backend: Field moved to ProductVariant.retail_price
- Impact: POS system cannot display prices
ORDER_ITEMS_WRONG_REFERENCE
- Database: order_items.product_variant_id (new)
- Code: Still using order_items.product_id (old)
- Impact: Orders cannot be created
```
### Custom Refactoring Rules
TheAuditor uses YAML-based correlation rules to detect refactoring issues. These rules are YOUR business logic - you define what patterns indicate problems in YOUR codebase.
#### How It Works
1. **Rules Location**: `/theauditor/correlations/rules/refactoring.yaml`
2. **Rule Structure**: Each rule defines co-occurring facts that must ALL match
3. **Detection**: When all facts match, TheAuditor reports the issue
4. **No Code Changes**: Just edit YAML to define new patterns
#### Creating Your Own Rules
Edit `/theauditor/correlations/rules/refactoring.yaml` or create new YAML files:
```yaml
rules:
- name: "MY_FIELD_MIGRATION"
description: "Detect when price field moved but old code remains"
co_occurring_facts:
- tool: "grep"
pattern: "removeColumn.*price" # Migration removed field
- tool: "grep"
pattern: "product\\.price" # Code still uses old field
confidence: 0.92
- name: "API_VERSION_MISMATCH"
description: "Frontend calling v1 API but backend is v2"
co_occurring_facts:
- tool: "grep"
pattern: "/api/v1/" # Frontend uses v1
- tool: "grep"
pattern: "router.*'/v2/'" # Backend only has v2
confidence: 0.95
```
#### Available Tools for Facts
- **grep**: Pattern matching in files
- **patterns**: Matches from pattern detection
- **taint_analyzer**: Taint flow findings
- **lint**: Linter findings
#### Real Example from Production
```yaml
- name: "PRODUCT_VARIANT_REFACTOR"
description: "Product fields moved to ProductVariant but frontend still uses old structure"
co_occurring_facts:
- tool: "grep"
pattern: "ProductVariant.*retail_price.*Sequelize" # Backend changed
- tool: "grep"
pattern: "product\\.unit_price|product\\.retail_price" # Frontend didn't
confidence: 0.92
```
This detects when you moved price fields from Product to ProductVariant model but frontend still expects the old structure.
---
## Troubleshooting
### Common Issues
#### "TypeScript compiler not available in TheAuditor sandbox"
**Solution**: Run **`aud setup-claude --target .`** to set up the sandbox.
#### "Coverage < 90% - run `aud capsules` first"
**Solution**: Generate code capsules for better analysis coverage:
```bash
aud index
aud workset --all
```
#### Linting produces no results
**Solution**: Ensure linters are installed:
```bash
# For Python
pip install -e ".[linters]"
# For JavaScript/TypeScript
aud setup-claude --target .
```
#### Pipeline fails at specific phase
**Solution**: Check **`.pf/error.log`** for details:
```bash
cat .pf/error.log
# Or check phase-specific error log
cat .pf/error_phase_08.log
```
### Performance Optimization
For large repositories:
```bash
# Limit analysis scope
aud workset --paths "src/critical/**/*.py"
# Run specific commands only
aud index && aud lint && aud detect-patterns
# Adjust chunking for larger context windows
export THEAUDITOR_LIMITS_MAX_CHUNK_SIZE=100000 # 100KB chunks
export THEAUDITOR_LIMITS_MAX_CHUNKS_PER_FILE=5 # Allow up to 5 chunks
```
### Runtime Configuration
TheAuditor supports environment variable overrides for runtime configuration:
```bash
# Chunking configuration
export THEAUDITOR_LIMITS_MAX_CHUNKS_PER_FILE=5 # Default: 3
export THEAUDITOR_LIMITS_MAX_CHUNK_SIZE=100000 # Default: 65000 (bytes)
# File size limits
export THEAUDITOR_LIMITS_MAX_FILE_SIZE=5242880 # Default: 2097152 (2MB)
# Timeout configuration
export THEAUDITOR_TIMEOUTS_LINT_TIMEOUT=600 # Default: 300 (seconds)
export THEAUDITOR_TIMEOUTS_FCE_TIMEOUT=1200 # Default: 600 (seconds)
# Batch processing
export THEAUDITOR_LIMITS_DEFAULT_BATCH_SIZE=500 # Default: 200
```
Configuration can also be set via `.pf/config.json` for project-specific overrides.
---
## Best Practices
1. **Always run `aud init` first** in a new project
2. **Set up the sandbox** for JavaScript/TypeScript projects using **`aud setup-claude --target .`**
3. **Use worksets** for incremental analysis during development
4. **Run `aud full`** before releases for comprehensive analysis
5. **Review `.pf/readthis/`** for AI-friendly issue summaries
6. **Check exit codes** in CI/CD for automated pass/fail decisions
7. **Archive results** with timestamps for audit trails
---
## Exit Codes for Automation
**TheAuditor** uses specific exit codes for CI/CD integration:
- **`0`** - Success, no critical/high issues
- **`1`** - High severity findings
- **`2`** - Critical severity findings
- **`3`** - Pipeline/task incomplete
Use these in scripts:
```bash
aud full
if [ $? -eq 2 ]; then
echo "Critical vulnerabilities found - blocking deployment"
exit 1
fi
```
---
## Getting Help
- Run **`aud --help`** for command overview
- Run **`aud <command> --help`** for specific command help
- Check **`.pf/pipeline.log`** for execution details
- Review **`.pf/error.log`** for troubleshooting
- Refer to **`teamsop.md`** for development workflow
---
## Next Steps
1. Initialize your first project with **`aud init`**
2. Run **`aud full`** to see TheAuditor in action
3. Explore the results in **`.pf/readthis/`**
4. Integrate into your CI/CD pipeline
5. Customize patterns for your specific needs
---
**Remember**: TheAuditor is designed to work **offline**, maintain **data integrity**, and produce **AI-ready outputs**. All analysis is **deterministic** and **reproducible**.