Initial commit: TheAuditor v1.0.1 - AI-centric SAST and Code Intelligence Platform

This commit is contained in:
TheAuditorTool
2025-09-07 20:39:47 +07:00
commit ba5c287b02
215 changed files with 50911 additions and 0 deletions

126
.gitignore vendored Normal file
View File

@@ -0,0 +1,126 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/
# Virtual environments
.env
.venv
.auditor_venv
env/
venv/
ENV/
env.bak/
venv.bak/
# IDEs
.vscode/
.idea/
*.swp
*.swo
*~
.DS_Store
# Project specific
.pf/
.claude/
audit/
manifest.json
repo_index.db
*.db
*.db-journal
/test_scaffold/
/tmp/
# Test and temporary files
test_output/
temp/
*.tmp
*.bak
*.log
# Local configuration
.env.local
.env.*.local
config.local.json
# Journal and runtime files
*.ndjson
.pf/journal.ndjson
.pf/bus/
.pf/workset.json
.pf/capsules/
.pf/context/
# ML models (if any)
*.pkl
*.joblib
*.h5
*.model
# Documentation build
docs/_build/
docs/.doctrees/
# macOS
.DS_Store
.AppleDouble
.LSOverride
# Windows
Thumbs.db
Thumbs.db:encryptable
ehthumbs.db
ehthumbs_vista.db
*.stackdump
[Dd]esktop.ini
# Linux
.directory
.Trash-*

606
ARCHITECTURE.md Normal file
View File

@@ -0,0 +1,606 @@
# TheAuditor Architecture
This document provides a comprehensive technical overview of TheAuditor's architecture, design patterns, and implementation details.
## System Overview
TheAuditor is an offline-first, AI-centric SAST (Static Application Security Testing) and code intelligence platform. It orchestrates industry-standard tools to provide ground truth about code quality and security, producing AI-consumable reports optimized for LLM context windows.
### Core Design Principles
1. **Offline-First Operation** - All analysis runs without network access, ensuring data privacy and reproducible results
2. **Dual-Mode Architecture** - Courier Mode preserves raw external tool outputs; Expert Mode applies security expertise objectively
3. **AI-Centric Workflow** - Produces chunks optimized for LLM context windows (65KB by default)
4. **Sandboxed Execution** - Isolated analysis environment prevents cross-contamination
5. **No Fix Generation** - Reports findings without prescribing solutions
## Truth Courier vs Insights: Separation of Concerns
TheAuditor maintains a strict architectural separation between **factual observation** and **optional interpretation**:
### Truth Courier Modules (Core)
These modules are the foundation - they gather and report verifiable facts without judgment:
- **Indexer**: Reports "Function X exists at line Y with Z parameters"
- **Taint Analyzer**: Reports "Data flows from pattern A to pattern B through path C"
- **Impact Analyzer**: Reports "Changing function X affects Y files through Z call chains"
- **Graph Analyzer**: Reports "Module A imports B, B imports C, C imports A (cycle detected)"
- **Pattern Detector**: Reports "Line X matches pattern Y from rule Z"
- **Linters**: Reports "Tool ESLint flagged line X with rule Y"
These modules form the immutable ground truth. They report **what exists**, not what it means.
### Insights Modules (Optional Interpretation Layer)
These are **optional packages** that consume Truth Courier data to add scoring and classification. All insights modules have been consolidated into a single package for better organization:
```
theauditor/insights/
├── __init__.py # Package exports
├── ml.py # Machine learning predictions (requires pip install -e ".[ml]")
├── graph.py # Graph health scoring and recommendations
└── taint.py # Vulnerability severity classification
```
- **insights/taint.py**: Adds "This flow is XSS with HIGH severity"
- **insights/graph.py**: Adds "Health score: 70/100, Grade: B"
- **insights/ml.py** (requires `pip install -e ".[ml]"`): Adds "80% probability of bugs based on historical patterns"
**Important**: Insights modules are:
- Not installed by default (ML requires explicit opt-in)
- Completely decoupled from core analysis
- Still based on technical patterns, not business logic interpretation
- Designed for teams that want actionable scores alongside raw facts
- All consolidated in `/insights` package for consistency
### The FCE: Factual Correlation Engine
The FCE correlates facts from multiple tools without interpreting them:
- Reports: "Tool A and Tool B both flagged line 100"
- Reports: "Pattern X and Pattern Y co-occur in file Z"
- Never says: "This is bad" or "Fix this way"
## Core Components
### Indexer Package (`theauditor/indexer/`)
The indexer has been refactored from a monolithic 2000+ line file into a modular package structure:
```
theauditor/indexer/
├── __init__.py # Package initialization and backward compatibility
├── config.py # Constants, patterns, and configuration
├── database.py # DatabaseManager class for all DB operations
├── core.py # FileWalker and ASTCache classes
├── orchestrator.py # IndexOrchestrator - main coordination logic
└── extractors/
├── __init__.py # BaseExtractor abstract class and registry
├── python.py # Python-specific extraction logic
├── javascript.py # JavaScript/TypeScript extraction
├── docker.py # Docker/docker-compose extraction
├── sql.py # SQL extraction
└── nginx.py # Nginx configuration extraction
```
Key features:
- **Dynamic extractor registry** for automatic language detection
- **Batched database operations** (200 records per batch by default)
- **AST caching** for performance optimization
- **Monorepo detection** and intelligent path filtering
- **Parallel JavaScript processing** when semantic parser available
### Pipeline System (`theauditor/pipelines.py`)
Orchestrates **14-phase** analysis pipeline in **parallel stages**:
**Stage 1 - Foundation (Sequential):**
1. Repository indexing - Build manifest and symbol database
2. Framework detection - Identify technologies in use
**Stage 2 - Concurrent Analysis (3 Parallel Tracks):**
- **Track A (Network I/O):**
- Dependency checking
- Documentation fetching
- Documentation summarization
- **Track B (Code Analysis):**
- Workset creation
- Linting
- Pattern detection
- **Track C (Graph Build):**
- Graph building
**Stage 3 - Final Aggregation (Sequential):**
- Graph analysis
- Taint analysis
- Factual correlation engine
- Report generation
### Pattern Detection Engine
- 100+ YAML-defined security patterns in `theauditor/patterns/`
- AST-based matching for Python and JavaScript
- Supports semantic analysis via TypeScript compiler
### Factual Correlation Engine (FCE) (`theauditor/fce.py`)
- **29 advanced correlation rules** in `theauditor/correlations/rules/`
- Detects complex vulnerability patterns across multiple tools
- Categories: Authentication, Injection, Data Exposure, Infrastructure, Code Quality, Framework-Specific
### Taint Analysis Package (`theauditor/taint_analyzer.py`)
A comprehensive taint analysis module that tracks data flow from sources to sinks:
- Tracks data flow from user inputs to dangerous outputs
- Detects SQL injection, XSS, command injection vulnerabilities
- Database-aware analysis using `repo_index.db`
- Supports both assignment-based and direct-use patterns
- Merges findings from multiple detection methods
**Note**: The optional severity scoring for taint analysis is provided by `theauditor/insights/taint.py` (Insights module)
### Graph Analysis (`theauditor/graph/`)
- **builder.py**: Constructs dependency graph from codebase
- **analyzer.py**: Detects cycles, measures complexity, identifies hotspots
- Uses NetworkX for graph algorithms
**Note**: The optional health scoring and recommendations are provided by `theauditor/insights/graph.py` (Insights module)
### Framework Detection (`theauditor/framework_detector.py`)
- Auto-detects Django, Flask, React, Vue, Angular, etc.
- Applies framework-specific rules
- Influences pattern selection and analysis behavior
### Configuration Parsers (`theauditor/parsers/`)
Specialized parsers for configuration file analysis:
- **webpack_config_parser.py**: Webpack configuration analysis
- **compose_parser.py**: Docker Compose file parsing
- **nginx_parser.py**: Nginx configuration parsing
- **dockerfile_parser.py**: Dockerfile security analysis
- **prisma_schema_parser.py**: Prisma ORM schema parsing
These parsers are used by extractors during indexing to extract security-relevant configuration data.
### Refactoring Detection (`theauditor/commands/refactor.py`)
Detects incomplete refactorings and cross-stack inconsistencies:
- Analyzes database migrations to detect schema changes
- Uses impact analysis to trace affected files
- Applies correlation rules from `/correlations/rules/refactoring.yaml`
- Detects API contract mismatches, field migrations, foreign key changes
- Supports auto-detection from migration files or specific change analysis
## System Architecture Diagrams
### High-Level Data Flow
```mermaid
graph TB
subgraph "Input Layer"
CLI[CLI Commands]
Files[Project Files]
end
subgraph "Core Pipeline"
Index[Indexer]
Framework[Framework Detector]
Deps[Dependency Checker]
Patterns[Pattern Detection]
Taint[Taint Analysis]
Graph[Graph Builder]
FCE[Factual Correlation Engine]
end
subgraph "Storage"
DB[(SQLite DB)]
Raw[Raw Output]
Chunks[65KB Chunks]
end
CLI --> Index
Files --> Index
Index --> DB
Index --> Framework
Framework --> Deps
Deps --> Patterns
Patterns --> Graph
Graph --> Taint
Taint --> FCE
FCE --> Raw
Raw --> Chunks
```
### Parallel Pipeline Execution
```mermaid
graph LR
subgraph "Stage 1 - Sequential"
S1[Index] --> S2[Framework Detection]
end
subgraph "Stage 2 - Parallel"
direction TB
subgraph "Track A - Network I/O"
A1[Deps Check]
A2[Doc Fetch]
A3[Doc Summary]
A1 --> A2 --> A3
end
subgraph "Track B - Code Analysis"
B1[Workset]
B2[Linting]
B3[Patterns]
B1 --> B2 --> B3
end
subgraph "Track C - Graph"
C1[Graph Build]
end
end
subgraph "Stage 3 - Sequential"
E1[Graph Analysis] --> E2[Taint] --> E3[FCE] --> E4[Report]
end
S2 --> A1
S2 --> B1
S2 --> C1
A3 --> E1
B3 --> E1
C1 --> E1
```
### Data Chunking System
The extraction system (`theauditor/extraction.py`) implements pure courier model chunking:
```mermaid
graph TD
subgraph "Analysis Results"
P[Patterns.json]
T[Taint.json<br/>Multiple lists merged]
L[Lint.json]
F[FCE.json]
end
subgraph "Extraction Process"
E[Extraction Engine<br/>Budget: 1.5MB]
M[Merge Logic<br/>For taint_paths +<br/>rule_findings]
C1[Chunk 1<br/>0-65KB]
C2[Chunk 2<br/>65-130KB]
C3[Chunk 3<br/>130-195KB]
TR[Truncation<br/>Flag]
end
subgraph "Output"
R1[patterns_chunk01.json]
R2[patterns_chunk02.json]
R3[patterns_chunk03.json]
end
P --> E
T --> M --> E
L --> E
F --> E
E --> C1 --> R1
E --> C2 --> R2
E --> C3 --> R3
E -.->|If >195KB| TR
TR -.-> R3
```
Key features:
- **Budget system**: 1.5MB total budget for all chunks
- **Smart merging**: Taint analysis merges multiple finding lists (taint_paths, rule_findings, infrastructure)
- **Preservation**: All findings preserved, no filtering or sampling
- **Chunking**: Only chunks files >65KB, copies smaller files as-is
### Dual Environment Architecture
```mermaid
graph TB
subgraph "Development Environment"
V1[.venv/]
PY[Python 3.11+]
AU[TheAuditor Code]
V1 --> PY --> AU
end
subgraph "Sandboxed Analysis Environment"
V2[.auditor_venv/.theauditor_tools/]
NODE[Bundled Node.js v20.11.1]
TS[TypeScript Compiler]
ES[ESLint]
PR[Prettier]
NM[node_modules/]
V2 --> NODE
NODE --> TS
NODE --> ES
NODE --> PR
NODE --> NM
end
AU -->|Analyzes using| V2
AU -.->|Never uses| V1
```
TheAuditor maintains strict separation between:
1. **Primary Environment** (`.venv/`): TheAuditor's Python code and dependencies
2. **Sandboxed Environment** (`.auditor_venv/.theauditor_tools/`): Isolated JS/TS analysis tools
This ensures reproducibility and prevents TheAuditor from analyzing its own analysis tools.
## Database Schema
```mermaid
erDiagram
files ||--o{ symbols : contains
files ||--o{ refs : contains
files ||--o{ api_endpoints : contains
files ||--o{ sql_queries : contains
files ||--o{ docker_images : contains
files {
string path PK
string language
int size
string hash
json metadata
}
symbols {
string path FK
string name
string type
int line
json metadata
}
refs {
string src FK
string value
string kind
int line
}
api_endpoints {
string file FK
string method
string path
int line
}
sql_queries {
string file_path FK
string command
string query
int line_number
}
docker_images {
string file_path FK
string base_image
json env_vars
json build_args
}
```
## Command Flow Sequence
```mermaid
sequenceDiagram
participant User
participant CLI
participant Pipeline
participant Analyzers
participant Database
participant Output
User->>CLI: aud full
CLI->>Pipeline: Execute pipeline
Pipeline->>Database: Initialize schema
Pipeline->>Analyzers: Index files
Analyzers->>Database: Store file metadata
par Parallel Execution
Pipeline->>Analyzers: Dependency check
and
Pipeline->>Analyzers: Pattern detection
and
Pipeline->>Analyzers: Graph building
end
Pipeline->>Analyzers: Taint analysis
Analyzers->>Database: Query symbols & refs
Pipeline->>Analyzers: FCE correlation
Analyzers->>Output: Generate reports
Pipeline->>Output: Create chunks
Output->>User: .pf/readthis/
```
## Output Structure
All results are organized in the `.pf/` directory:
```
.pf/
├── raw/ # Immutable tool outputs (ground truth)
│ ├── eslint.json
│ ├── ruff.json
│ └── ...
├── readthis/ # AI-optimized chunks (<65KB each, max 3 chunks per file)
│ ├── manifest.md # Repository overview
│ ├── patterns_*.md # Security findings
│ ├── taint_*.md # Data-flow issues
│ └── tickets_*.md # Actionable tasks
├── repo_index.db # SQLite database of code symbols
├── pipeline.log # Execution trace
└── findings.json # Consolidated results
```
### Key Output Files
- **manifest.md**: Complete file inventory with SHA-256 hashes
- **patterns_*.md**: Chunked security findings from 100+ detection rules
- **tickets_*.md**: Prioritized, actionable issues with evidence
- **repo_index.db**: Queryable database of all code symbols and relationships
## Operating Modes
TheAuditor operates in two distinct modes:
### Courier Mode (External Tools)
- Preserves exact outputs from ESLint, Ruff, MyPy, etc.
- No interpretation or filtering
- Complete audit trail from source to finding
### Expert Mode (Internal Engines)
- **Taint Analysis**: Tracks untrusted data through the application
- **Pattern Detection**: YAML-based rules with AST matching
- **Graph Analysis**: Architectural insights and dependency tracking
- **Secret Detection**: Identifies hardcoded credentials and API keys
## CLI Entry Points
- **Main CLI**: `theauditor/cli.py` - Central command router
- **Command modules**: `theauditor/commands/` - One module per command
- **Utilities**: `theauditor/utils/` - Shared functionality
- **Configuration**: `theauditor/config_runtime.py` - Runtime configuration
Each command module follows a standardized structure with:
- `@click.command()` decorator
- `@handle_exceptions` decorator for error handling
- Consistent logging and output formatting
## Performance Optimizations
- **Batched database operations**: 200 records per batch (configurable)
- **Parallel rule execution**: ThreadPoolExecutor with 4 workers
- **AST caching**: Persistent cache for parsed AST trees
- **Incremental analysis**: Workset-based analysis for changed files only
- **Lazy loading**: Patterns and rules loaded on-demand
- **Memory-efficient chunking**: Stream large files instead of loading entirely
## Configuration System
TheAuditor supports runtime configuration via multiple sources (priority order):
1. **Environment variables** (`THEAUDITOR_*` prefix)
2. **`.pf/config.json`** file (project-specific)
3. **Built-in defaults** in `config_runtime.py`
Example configuration:
```bash
export THEAUDITOR_LIMITS_MAX_CHUNKS_PER_FILE=5 # Default: 3
export THEAUDITOR_LIMITS_MAX_CHUNK_SIZE=100000 # Default: 65000
export THEAUDITOR_LIMITS_MAX_FILE_SIZE=5242880 # Default: 2097152
export THEAUDITOR_TIMEOUTS_LINT_TIMEOUT=600 # Default: 300
```
## Advanced Features
### Database-Aware Rules
Specialized analyzers query `repo_index.db` to detect:
- ORM anti-patterns (N+1 queries, missing transactions)
- Docker security misconfigurations
- Nginx configuration issues
- Multi-file correlation patterns
### Holistic Analysis
Project-level analyzers that operate across the entire codebase:
- **Bundle Analyzer**: Correlates package.json, lock files, and imports
- **Source Map Detector**: Scans build directories for exposed maps
- **Framework Detectors**: Identify technology stack automatically
### Incremental Analysis
Workset-based analysis for efficient processing:
- Git diff integration for changed file detection
- Dependency tracking for impact analysis
- Cached results for unchanged files
## Contributing to TheAuditor
### Adding Language Support
TheAuditor's modular architecture makes it straightforward to add new language support:
#### 1. Create an Extractor
Create a new extractor in `theauditor/indexer/extractors/{language}.py`:
```python
from . import BaseExtractor
class {Language}Extractor(BaseExtractor):
def supported_extensions(self) -> List[str]:
return ['.ext', '.ext2']
def extract(self, file_info, content, tree=None):
# Extract symbols, imports, routes, etc.
return {
'imports': [],
'routes': [],
'symbols': [],
# ... other extracted data
}
```
The extractor will be automatically registered via the `BaseExtractor` inheritance.
#### 2. Create Configuration Parser (Optional)
For configuration files, create a parser in `theauditor/parsers/{language}_parser.py`:
```python
class {Language}Parser:
def parse_file(self, file_path: Path) -> Dict[str, Any]:
# Parse configuration file
return parsed_data
```
#### 3. Add Security Patterns
Create YAML patterns in `theauditor/patterns/{language}.yml`:
```yaml
- name: hardcoded-secret-{language}
pattern: 'api_key\s*=\s*["\'][^"\']+["\']'
severity: critical
category: security
languages: ["{language}"]
description: "Hardcoded API key in {Language} code"
```
#### 4. Add Framework Detection
Update `theauditor/framework_detector.py` to detect {Language} frameworks.
### Adding New Analyzers
#### Database-Aware Rules
Create analyzers that query `repo_index.db` in `theauditor/rules/{category}/`:
```python
def find_{issue}_patterns(db_path: str) -> List[Dict[str, Any]]:
conn = sqlite3.connect(db_path)
# Query and analyze
return findings
```
#### AST-Based Rules
For semantic analysis, create rules in `theauditor/rules/{framework}/`:
```python
def find_{framework}_issues(tree, file_path) -> List[Dict[str, Any]]:
# Traverse AST and detect issues
return findings
```
#### Pattern-Based Rules
Add YAML patterns to `theauditor/patterns/` for regex-based detection.
### Architecture Guidelines
1. **Maintain Truth Courier vs Insights separation** - Core modules report facts, insights add interpretation
2. **Use the extractor registry** - Inherit from `BaseExtractor` for automatic registration
3. **Follow existing patterns** - Look at `python.py` or `javascript.py` extractors as examples
4. **Write comprehensive tests** - Test extractors, parsers, and patterns
5. **Document your additions** - Update this file and CONTRIBUTING.md
For detailed contribution guidelines, see [CONTRIBUTING.md](CONTRIBUTING.md).

454
CLAUDE.md Normal file
View File

@@ -0,0 +1,454 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Quick Reference Commands
```bash
# Development Setup
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[all]"
aud setup-claude --target . # MANDATORY for JS/TS analysis
# Testing
pytest -v # Run all tests
pytest tests/test_file.py # Run specific test file
pytest -k "test_name" # Run specific test by name
pytest --cov=theauditor # With coverage
# Code Quality
ruff check theauditor tests --fix # Lint and auto-fix
ruff format theauditor tests # Format code
black theauditor tests # Alternative formatter
mypy theauditor --strict # Type checking
# Running TheAuditor
aud init # Initialize project
aud full # Complete analysis (14 phases)
aud full --offline # Skip network operations (deps, docs)
aud index --exclude-self # When analyzing TheAuditor itself
# Individual Analysis Commands
aud index # Build code index database
aud detect-patterns # Run security pattern detection
aud taint-analyze # Perform taint flow analysis
aud graph build # Build dependency graph
aud graph analyze # Analyze graph structure
aud fce # Run Factual Correlation Engine
aud report # Generate final report
aud workset # Create working set of critical files
aud impact <file> # Analyze impact of changing a file
# Utility Commands
aud setup-claude # Setup sandboxed JS/TS tools (MANDATORY)
aud js-semantic <file> # Parse JS/TS file semantically
aud structure # Display project structure
aud insights # Generate ML insights (requires [ml] extras)
aud refactor <operation> # Perform refactoring operations
```
## Project Overview
TheAuditor is an offline-first, AI-centric SAST (Static Application Security Testing) and code intelligence platform written in Python. It performs comprehensive security auditing and code analysis for Python and JavaScript/TypeScript projects, producing AI-consumable reports optimized for LLM context windows.
## Core Philosophy: Truth Courier, Not Mind Reader
**CRITICAL UNDERSTANDING**: TheAuditor does NOT try to understand business logic or make AI "smarter." It solves the real problem: **AI loses context and makes inconsistent changes across large codebases.**
### The Development Loop
1. **Human tells AI**: "Add JWT auth with CSRF protection"
2. **AI writes code**: Probably has issues due to context limits (hardcoded secrets, missing middleware, etc.)
3. **Human runs**: `aud full`
4. **TheAuditor reports**: All inconsistencies and security holes as FACTS
5. **AI reads report**: Now sees the COMPLETE picture across all files
6. **AI fixes issues**: With full visibility of what's broken
7. **Repeat until clean**
TheAuditor is about **consistency checking**, not semantic understanding. It finds where code doesn't match itself, not whether it matches business requirements.
## Critical Setup Requirements
### For JavaScript/TypeScript Analysis
TheAuditor requires a sandboxed environment for JS/TS tools. This is NOT optional:
```bash
# MANDATORY: Set up sandboxed tools
aud setup-claude --target .
```
This creates `.auditor_venv/.theauditor_tools/` with isolated TypeScript compiler and ESLint. Without this, TypeScript semantic analysis will fail.
## Key Architectural Decisions
### Modular Package Structure
The codebase follows a modular design where large modules are refactored into packages. Example: the indexer was refactored from a 2000+ line monolithic file into:
```
theauditor/indexer/
├── __init__.py # Backward compatibility shim
├── config.py # Constants and patterns
├── database.py # DatabaseManager class
├── core.py # FileWalker, ASTCache
├── orchestrator.py # Main coordination
└── extractors/ # Language-specific logic
```
When refactoring, always:
1. Create a package with the same name as the original module
2. Provide a backward compatibility shim in `__init__.py`
3. Separate concerns into focused modules
4. Use dynamic registries for extensibility
### Database Contract Preservation
The `repo_index.db` schema is consumed by many downstream modules (taint_analyzer, graph builder, etc.). When modifying indexer or database operations:
- NEVER change table schemas without migration
- Preserve exact column names and types
- Maintain the same data format in JSON columns
- Test downstream consumers after changes
## Architecture Overview
### Truth Courier vs Insights: Separation of Concerns
TheAuditor maintains strict separation between **factual observation** and **optional interpretation**:
#### Truth Courier Modules (Core - Always Active)
Report verifiable facts without judgment:
- **Indexer**: "Function X exists at line Y"
- **Taint Analyzer**: "Data flows from req.body to res.send" (NOT "XSS vulnerability")
- **Impact Analyzer**: "Changing X affects 47 files through dependency chains"
- **Pattern Detector**: "Line X matches pattern Y"
- **Graph Analyzer**: "Cycle detected: A→B→C→A"
#### Insights Modules (Optional - Not Installed by Default)
Add scoring and classification on top of facts:
- **taint/insights.py**: Adds "This is HIGH severity XSS"
- **graph/insights.py**: Adds "Health score: 70/100"
- **ml.py**: Requires `pip install -e ".[ml]"` - adds predictions
#### Correlation Rules (Project-Specific Pattern Detection)
- Located in `theauditor/correlations/rules/`
- Detect when multiple facts indicate inconsistency
- Example: "Backend moved field to ProductVariant but frontend still uses Product.price"
- NOT business logic understanding, just pattern matching YOUR refactorings
### Dual-Environment Design
TheAuditor maintains strict separation between:
1. **Primary Environment** (`.venv/`): TheAuditor's Python code and dependencies
2. **Sandboxed Environment** (`.auditor_venv/.theauditor_tools/`): Isolated JS/TS analysis tools
### Core Components
#### Indexer Package (`theauditor/indexer/`)
The indexer has been refactored from a monolithic 2000+ line file into a modular package:
- **config.py**: Constants, patterns, and configuration (SKIP_DIRS, language maps, etc.)
- **database.py**: DatabaseManager class handling all database operations
- **core.py**: FileWalker (with monorepo detection) and ASTCache classes
- **orchestrator.py**: IndexOrchestrator coordinating the indexing process
- **extractors/**: Language-specific extractors (Python, JavaScript, Docker, SQL, nginx)
The package uses a dynamic extractor registry for automatic language detection and processing.
#### Pipeline System (`theauditor/pipelines.py`)
- Orchestrates **14-phase** analysis pipeline in **parallel stages**:
- **Stage 1**: Foundation (index with batched DB operations, framework detection)
- **Stage 2**: 3 concurrent tracks (Network I/O, Code Analysis, Graph Build)
- **Stage 3**: Final aggregation (graph analysis, taint, FCE, report)
- Handles error recovery and logging
- **Performance optimizations**:
- Batched database inserts (200 records per batch) in indexer
- Parallel rule execution with ThreadPoolExecutor (4 workers)
- Parallel holistic analysis (bundle + sourcemap detection)
#### Pattern Detection Engine
- 100+ YAML-defined security patterns in `theauditor/patterns/`
- AST-based matching for Python and JavaScript
- Supports semantic analysis via TypeScript compiler
#### Factual Correlation Engine (FCE) (`theauditor/fce.py`)
- **29 advanced correlation rules** in `theauditor/correlations/rules/`
- Detects complex vulnerability patterns across multiple tools
- Categories: Authentication, Injection, Data Exposure, Infrastructure, Code Quality, Framework-Specific
#### Taint Analysis Package (`theauditor/taint_analyzer/`)
Previously a monolithic 1822-line file, now refactored into a modular package:
- **core.py**: TaintAnalyzer main class
- **sources.py**: Source pattern definitions (user inputs)
- **sinks.py**: Sink pattern definitions (dangerous outputs)
- **patterns.py**: Pattern matching logic
- **flow.py**: Data flow tracking algorithms
- **insights.py**: Optional severity scoring (Insights module)
Features:
- Tracks data flow from sources to sinks
- Detects SQL injection, XSS, command injection
- Database-aware analysis using `repo_index.db`
- Supports both assignment-based and direct-use taint flows
- Merges findings from multiple detection methods (taint_paths, rule_findings, infrastructure)
#### Framework Detection (`theauditor/framework_detector.py`)
- Auto-detects Django, Flask, React, Vue, etc.
- Applies framework-specific rules
#### Graph Analysis (`theauditor/commands/graph.py`)
- Build dependency graphs with `aud graph build`
- Analyze graph health with `aud graph analyze`
- Visualize with GraphViz output (optional)
- Detect circular dependencies and architectural issues
#### Output Structure
```
.pf/
├── raw/ # Immutable tool outputs (ground truth)
├── readthis/ # AI-optimized chunks (<65KB each, max 3 chunks per file)
├── repo_index.db # SQLite database of code symbols
└── pipeline.log # Execution trace
```
### CLI Entry Points
- Main CLI: `theauditor/cli.py`
- Command modules: `theauditor/commands/`
- Each command is a separate module with standardized structure
## Available Commands
### Core Analysis Commands
- `aud index`: Build comprehensive code index
- `aud detect-patterns`: Run security pattern detection
- `aud taint-analyze`: Perform taint flow analysis
- `aud fce`: Run Factual Correlation Engine
- `aud report`: Generate final consolidated report
### Graph Commands
- `aud graph build`: Build dependency graph
- `aud graph analyze`: Analyze graph health metrics
- `aud graph visualize`: Generate GraphViz visualization
### Utility Commands
- `aud deps`: Analyze dependencies and vulnerabilities
- `aud docs`: Extract and analyze documentation
- `aud docker-analyze`: Analyze Docker configurations
- `aud lint`: Run code linters
- `aud workset`: Create critical file working set
- `aud impact <file>`: Analyze change impact radius
- `aud structure`: Display project structure
- `aud insights`: Generate ML-powered insights (optional)
- `aud refactor <operation>`: Automated refactoring tools
## How to Work with TheAuditor Effectively
### The Correct Workflow
1. **Write specific requirements**: "Add JWT auth with httpOnly cookies, CSRF tokens, rate limiting"
2. **Let AI implement**: It will probably mess up due to context limits
3. **Run audit**: `aud full`
4. **Read the facts**: Check `.pf/readthis/` for issues
5. **Fix based on facts**: Address the specific inconsistencies found
6. **Repeat until clean**: Keep auditing and fixing until no issues
### What NOT to Do
- ❌ Don't ask AI to "implement secure authentication" (too vague)
- ❌ Don't try to make TheAuditor understand your business logic
- ❌ Don't expect TheAuditor to write fixes (it only reports issues)
- ❌ Don't ignore the audit results and claim "done"
### Understanding the Output
- **Truth Couriers** report facts: "JWT secret hardcoded at line 47"
- **Insights** (if installed) add interpretation: "HIGH severity"
- **Correlations** detect YOUR patterns: "Frontend expects old API structure"
- **Impact Analysis** shows blast radius: "Changing this affects 23 files"
## Critical Development Patterns
### Adding New Commands
1. Create module in `theauditor/commands/` with this structure:
```python
import click
from theauditor.utils.decorators import handle_exceptions
from theauditor.utils.logger import setup_logger
logger = setup_logger(__name__)
@click.command()
@click.option('--workset', is_flag=True, help='Use workset files')
@handle_exceptions
def command_name(workset):
"""Command description."""
logger.info("Starting command...")
# Implementation
```
2. Register in `theauditor/cli.py`:
```python
from theauditor.commands import your_command
cli.add_command(your_command.command_name)
```
### Adding Language Support
To add a new language, create an extractor in `theauditor/indexer/extractors/`:
```python
from theauditor.indexer.extractors import BaseExtractor, register_extractor
@register_extractor
class YourLanguageExtractor(BaseExtractor):
@property
def supported_extensions(self):
return ['.ext', '.ext2']
def extract(self, file_info, content, tree):
# Return dict with symbols, imports, etc.
```
The extractor will be auto-discovered via the registry pattern.
## CRITICAL: Reading Chunked Data
**IMPORTANT**: When processing files from `.pf/readthis/`, you MUST check for truncation:
```python
# Files may be split into chunks if >65KB
# Always check the 'chunk_info' field in JSON files:
chunk_info = data.get('chunk_info', {})
if chunk_info.get('truncated', False):
# This means there were more findings but only 3 chunks were created
# The data is incomplete - warn the user
print("WARNING: Data was truncated at 3 chunks")
```
**Key Points**:
- Files larger than 65KB are split into chunks (configurable via `THEAUDITOR_LIMITS_MAX_CHUNK_SIZE`)
- Maximum 3 chunks per file by default (configurable via `THEAUDITOR_LIMITS_MAX_CHUNKS_PER_FILE`)
- Example: `patterns_chunk01.json`, `patterns_chunk02.json`, `patterns_chunk03.json`
- If `truncated: true` in `chunk_info`, there were more findings that couldn't fit
- Always process ALL chunk files for complete data
## Critical Working Knowledge
### Pipeline Execution Order
The `aud full` command runs 14 phases in 3 stages:
1. **Sequential**: index → framework_detect
2. **Parallel**: (deps, docs) || (workset, lint, patterns) || (graph_build)
3. **Sequential**: graph_analyze → taint → fce → report
If modifying pipeline, maintain this dependency order.
### File Size and Memory Management
- Files >2MB are skipped by default (configurable)
- JavaScript files are batched for semantic parsing to avoid memory issues
- AST cache persists parsed trees to `.pf/.ast_cache/`
- Database operations batch at 200 records (configurable)
### Monorepo Detection
The indexer automatically detects monorepo structures and applies intelligent filtering:
- Standard paths: `backend/src/`, `frontend/src/`, `packages/*/src/`
- Whitelist mode activated when monorepo detected
- Prevents analyzing test files, configs, migrations as source code
### JavaScript/TypeScript Special Handling
- MUST run `aud setup-claude --target .` first
- Uses bundled Node.js v20.11.1 in `.auditor_venv/.theauditor_tools/`
- TypeScript semantic analysis requires `js_semantic_parser.py`
- ESLint runs in sandboxed environment, not project's node_modules
### Environment Variables
Key environment variables for configuration:
- `THEAUDITOR_LIMITS_MAX_FILE_SIZE`: Maximum file size to analyze (default: 2MB)
- `THEAUDITOR_LIMITS_MAX_CHUNK_SIZE`: Maximum chunk size for readthis output (default: 65KB)
- `THEAUDITOR_LIMITS_MAX_CHUNKS_PER_FILE`: Maximum chunks per file (default: 3)
- `THEAUDITOR_DB_BATCH_SIZE`: Database batch insert size (default: 200)
## Recent Fixes & Known Issues
### Parser Integration (Fixed)
- **Previous Issue**: Configuration parsers (webpack, nginx, docker-compose) were orphaned
- **Root Cause**: Import paths in extractors didn't match actual parser module names
- **Fix Applied**: Corrected import paths in `generic.py` and `docker.py` extractors
- **Current Status**: All 5 parsers now functional for config security analysis
### Extraction Budget & Taint Merging (Fixed)
- **Previous Issue**: Taint analysis only extracted 26 of 102 findings
- **Root Cause**: Only chunking `taint_paths`, missing `all_rule_findings` and `infrastructure_issues`
- **Fix Applied**: Extraction now merges all taint finding lists; budget increased to 1.5MB
- **Current Status**: All taint findings properly extracted and chunked
### Migration Detection (Enhanced)
- **Previous Issue**: Only checked basic migration paths
- **Root Cause**: Missing common paths like `backend/migrations/` and `frontend/migrations/`
- **Fix Applied**: Added standard migration paths with validation for actual migration files
- **Current Status**: Auto-detects migrations with helpful warnings for non-standard locations
### TypeScript Taint Analysis (Fixed)
- **Previous Issue**: Taint analysis reported 0 sources/sinks for TypeScript
- **Root Cause**: Text extraction was removed from `js_semantic_parser.py` (lines 275, 514)
- **Fix Applied**: Restored `result.text` field extraction
- **Current Status**: TypeScript taint analysis now working - detects req.body → res.send flows
### Direct-Use Vulnerability Detection (Fixed)
- **Previous Issue**: Only detected vulnerabilities through variable assignments
- **Root Cause**: `trace_from_source()` required intermediate variables
- **Fix Applied**: Added direct-use detection for patterns like `res.send(req.body)`
- **Current Status**: Now detects both assignment-based and direct-use taint flows
### Known Limitations
- Maximum 2MB file size for analysis (configurable)
- TypeScript decorator metadata not fully parsed
- Some advanced ES2024+ syntax may not be recognized
- GraphViz visualization requires separate installation
## Common Misconceptions to Avoid
### TheAuditor is NOT:
- ❌ A semantic understanding tool (doesn't understand what your code "means")
- ❌ A business logic validator (doesn't know your business rules)
- ❌ An AI enhancement tool (doesn't make AI "smarter")
- ❌ A code generator (only reports issues, doesn't fix them)
### TheAuditor IS:
- ✅ A consistency checker (finds where code doesn't match itself)
- ✅ A fact reporter (provides ground truth about your code)
- ✅ A context provider (gives AI the full picture across all files)
- ✅ An audit trail (immutable record of what tools found)
## Troubleshooting
### TypeScript Analysis Fails
Solution: Run `aud setup-claude --target .`
### Taint Analysis Reports 0 Vulnerabilities on TypeScript
- Check that `js_semantic_parser.py` has text extraction enabled (lines 275, 514)
- Verify symbols table contains property accesses: `SELECT * FROM symbols WHERE name LIKE '%req.body%'`
- Ensure you run `aud index` before `aud taint-analyze`
### Pipeline Failures
Check `.pf/error.log` and `.pf/pipeline.log` for details
### Linting No Results
Ensure linters installed: `pip install -e ".[linters]"`
### Graph Commands Not Working
- Ensure `aud index` has been run first
- Check that NetworkX is installed: `pip install -e ".[all]"`
## Testing Vulnerable Code
Test projects are in `fakeproj/` directory. Always use `--exclude-self` when analyzing them to avoid false positives from TheAuditor's own configuration.
## Project Dependencies
### Required Dependencies (Core)
- click==8.2.1 - CLI framework
- PyYAML==6.0.2 - YAML parsing
- jsonschema==4.25.1 - JSON validation
- ijson==3.4.0 - Incremental JSON parsing
### Optional Dependencies
Install with `pip install -e ".[group]"`:
- **[linters]**: ruff, mypy, black, bandit, pylint
- **[ml]**: scikit-learn, numpy, scipy, joblib
- **[ast]**: tree-sitter, sqlparse, dockerfile-parse
- **[all]**: Everything including NetworkX for graphs
## Performance Expectations
- Small project (< 5K LOC): ~2 minutes
- Medium project (20K LOC): ~30 minutes
- Large monorepo (100K+ LOC): 1-2 hours
- Memory usage: ~500MB-2GB depending on codebase size
- Disk space: ~100MB for .pf/ output directory

429
CONTRIBUTING.md Normal file
View File

@@ -0,0 +1,429 @@
# Contributing to TheAuditor
Thank you for your interest in contributing to TheAuditor! We're excited to have you join our mission to bring ground truth to AI-assisted development. This guide will help you get started with contributing to the project.
## How to Get Involved
### Reporting Bugs
Found a bug? Please help us fix it!
1. Check existing [GitHub Issues](https://github.com/TheAuditorTool/Auditor/issues) to see if it's already reported
2. If not, create a new issue with:
- Clear description of the bug
- Steps to reproduce
- Expected vs actual behavior
- Your environment details (OS, Python version, Node.js version)
### Suggesting Enhancements
Have an idea for improving TheAuditor?
1. Review our [ROADMAP.md](ROADMAP.md) to see if it aligns with our vision
2. Check [GitHub Issues](https://github.com/TheAuditorTool/Auditor/issues) for similar suggestions
3. Create a new issue describing:
- The problem you're trying to solve
- Your proposed solution
- Why this would benefit TheAuditor users
## Setting Up Your Development Environment
Follow these steps to get TheAuditor running locally for development:
```bash
# Clone the repository
git clone https://github.com/TheAuditorTool/Auditor.git
cd theauditor
# Create a Python virtual environment
python -m venv .venv
# Activate the virtual environment
# On Linux/macOS:
source .venv/bin/activate
# On Windows:
.venv\Scripts\activate
# Install TheAuditor in development mode
pip install -e .
# Optional: Install with ML capabilities
# pip install -e ".[ml]"
# For development with all optional dependencies:
# pip install -e ".[all]"
# MANDATORY: Set up the sandboxed environment
# This is required for TheAuditor to function at all
aud setup-claude --target .
```
The `aud setup-claude --target .` command creates an isolated environment at `.auditor_venv/.theauditor_tools/` with all necessary JavaScript and TypeScript analysis tools. This ensures consistent, reproducible results across all development environments.
## Making Changes & Submitting a Pull Request
### Development Workflow
1. **Fork the repository** on GitHub
2. **Create a feature branch** from `main`:
```bash
git checkout -b feature/your-feature-name
```
3. **Make your changes** following our code standards (see below)
4. **Write/update tests** if applicable
5. **Commit your changes** with clear, descriptive messages:
```bash
git commit -m "Add GraphQL schema analyzer for type validation"
```
6. **Push to your fork**:
```bash
git push origin feature/your-feature-name
```
7. **Create a Pull Request** on GitHub with:
- Clear description of changes
- Link to any related issues
- Test results or examples
## Code Standards
We use **ruff** for both linting and formatting Python code. Before submitting any code, you MUST run:
```bash
# Fix any auto-fixable issues and check for remaining problems
ruff check . --fix
# Format all Python code
ruff format .
```
Your pull request will not be merged if it fails these checks.
### Additional Quality Checks
For comprehensive code quality, you can also run:
```bash
# Type checking (optional but recommended)
mypy theauditor --strict
# Run tests
pytest tests/
# Full linting suite
make lint
```
### Code Style Guidelines
- Follow PEP 8 for Python code
- Use descriptive variable and function names
- Add docstrings to all public functions and classes
- Keep functions focused and small (under 50 lines preferred)
- Write self-documenting code; minimize comments
- Never commit secrets, API keys, or credentials
## Adding Support for New Languages
TheAuditor's modular architecture makes it straightforward to add support for new programming languages. This section provides comprehensive guidance for contributors looking to expand our language coverage.
### Overview
Adding a new language to TheAuditor involves:
- Creating a parser for the language
- Adding framework detection patterns
- Creating security pattern rules
- Writing comprehensive tests
- Updating documentation
### Prerequisites
Before starting, ensure you have:
- Deep knowledge of the target language and its ecosystem
- Understanding of common security vulnerabilities in that language
- Familiarity with AST (Abstract Syntax Tree) concepts
- Python development experience
### Step-by-Step Guide
#### Step 1: Create the Language Extractor
Create a new extractor in `theauditor/indexer/extractors/{language}.py` that inherits from `BaseExtractor`:
```python
from . import BaseExtractor
class {Language}Extractor(BaseExtractor):
def supported_extensions(self) -> List[str]:
"""Return list of file extensions this extractor supports."""
return ['.ext', '.ext2']
def extract(self, file_info: Dict[str, Any], content: str,
tree: Optional[Any] = None) -> Dict[str, Any]:
"""Extract all relevant information from a file."""
return {
'imports': self.extract_imports(content, file_info['ext']),
'routes': self.extract_routes(content),
'symbols': [], # Add symbol extraction logic
'assignments': [], # For taint analysis
'function_calls': [], # For call graph
'returns': [] # For data flow
}
```
The extractor will be automatically registered through the `BaseExtractor` inheritance pattern.
#### Step 2: Create Configuration Parser (Optional)
If your language has configuration files that need parsing, create a parser in `theauditor/parsers/{language}_parser.py`:
```python
class {Language}Parser:
def parse_file(self, file_path: Path) -> Dict[str, Any]:
"""Parse configuration file and extract security-relevant data."""
# Parse and return structured data
return parsed_data
```
#### Step 3: Add Framework Detection
Add your language's frameworks to `theauditor/framework_registry.py`:
```python
# Add to FRAMEWORK_REGISTRY dictionary
"{framework_name}": {
"language": "{language}",
"detection_sources": {
# Package manifest files
"package.{ext}": [
["dependencies"],
["devDependencies"],
],
# Or for line-based search
"requirements.txt": "line_search",
# Or for content search
"build.file": "content_search",
},
"package_pattern": "{framework_package_name}",
"import_patterns": ["import {framework}", "from {framework}"],
"file_markers": ["config.{ext}", "app.{ext}"],
}
```
#### Step 4: Create Language-Specific Patterns
Create security patterns for your language in `theauditor/patterns/{language}.yml`:
Example pattern structure:
```yaml
- name: hardcoded-secret-{language}
pattern: '(api[_-]?key|secret|token|password)\s*=\s*["\'][^"\']+["\']'
severity: critical
category: security
languages: ["{language}"]
description: "Hardcoded secret detected in {Language} code"
cwe: CWE-798
```
#### Step 5: Create AST-Based Rules (Optional but Recommended)
For complex security patterns, create AST-based rules in `theauditor/rules/{language}/`:
```python
"""Security rules for {Language} using AST analysis."""
from typing import Any, Dict, List
def find_{vulnerability}_issues(ast_tree: Any, file_path: str) -> List[Dict[str, Any]]:
"""Find {vulnerability} issues in {Language} code.
Args:
ast_tree: Parsed AST from {language}_parser
file_path: Path to the source file
Returns:
List of findings with standard format
"""
findings = []
# Implement AST traversal and pattern detection
for node in walk_ast(ast_tree):
if is_vulnerable_pattern(node):
findings.append({
'pattern_name': '{VULNERABILITY}_ISSUE',
'message': 'Detailed description of the issue',
'file': file_path,
'line': node.line,
'column': node.column,
'severity': 'high',
'snippet': extract_snippet(node),
'category': 'security',
'match_type': 'ast'
})
return findings
```
### Extractor Interface Specification
All language extractors MUST inherit from `BaseExtractor` and implement:
```python
from theauditor.indexer.extractors import BaseExtractor
class LanguageExtractor(BaseExtractor):
"""Extractor for {Language} files."""
def supported_extensions(self) -> List[str]:
"""Return list of supported file extensions."""
return ['.ext']
def extract(self, file_info: Dict[str, Any], content: str,
tree: Optional[Any] = None) -> Dict[str, Any]:
"""Extract all relevant information from a file."""
return {
'imports': [],
'routes': [],
'symbols': [],
'assignments': [],
'function_calls': [],
'returns': []
}
```
### Testing Requirements
#### Required Test Coverage
1. **Extractor Tests** (`tests/test_{language}_extractor.py`):
- Test extracting from valid files
- Test handling of syntax errors
- Test symbol extraction
- Test import extraction
- Test file extension detection
2. **Pattern Tests** (`tests/patterns/test_{language}_patterns.py`):
- Test security pattern detection
- Ensure patterns don't over-match (false positives)
3. **Integration Tests** (`tests/integration/test_{language}_integration.py`):
- Test language in complete analysis pipeline
#### Test Data
Create test fixtures in `tests/fixtures/{language}/`:
- `valid_code.{ext}` - Valid code samples
- `vulnerable_code.{ext}` - Code with known vulnerabilities
- `edge_cases.{ext}` - Edge cases and corner scenarios
### Submission Checklist
Before submitting your PR, ensure:
- [ ] Extractor inherits from `BaseExtractor` and implements required methods
- [ ] Extractor placed in `theauditor/indexer/extractors/{language}.py`
- [ ] Framework detection added to `framework_detector.py` (if applicable)
- [ ] At least 10 security patterns created in `patterns/{language}.yml`
- [ ] AST-based rules for complex patterns (if applicable)
- [ ] All tests passing with >80% coverage
- [ ] Documentation updated (extractor docstrings, pattern descriptions)
- [ ] Example vulnerable code provided in test fixtures
- [ ] No external dependencies without approval
- [ ] Code follows project style (run `ruff format`)
## Adding New Analyzers
### The Three-Tier Detection Architecture
TheAuditor uses a hybrid approach to detection, prioritizing accuracy and context. When contributing a new rule, please adhere to the following "AST First, Regex as Fallback" philosophy:
- **Tier 1: Multi-Language AST Rules (Preferred)**
For complex code patterns in source code (Python, JS/TS, etc.), extend or create a polymorphic AST-based rule in the `/rules` directory. These are the most powerful and accurate and should be the default choice for source code analysis.
- **Tier 2: Language-Specific AST Rules**
If a multi-language backend is not feasible, a language-specific AST rule is the next best option. The corresponding regex pattern should then be scoped to exclude the language covered by the AST rule (see `db_issues.yml` for an example).
- **Tier 3: Regex Patterns (YAML)**
Regex patterns in `/patterns` should be reserved for:
1. Simple patterns where an AST is overkill.
2. Configuration files where no AST parser exists (e.g., `.yml`, `.conf`).
3. Providing baseline coverage for languages not yet supported by an AST rule.
TheAuditor uses a modular architecture. To add new analysis capabilities:
### Database-Aware Rules
For rules that query across multiple files:
```python
# theauditor/rules/category/new_analyzer.py
def find_new_issues(db_path: str) -> List[Dict[str, Any]]:
conn = sqlite3.connect(db_path)
# Query the repo_index.db
# Return findings in standard format
```
Example ORM analyzer:
```python
# theauditor/rules/orm/sequelize_detector.py
def find_sequelize_issues(db_path: str) -> List[Dict[str, Any]]:
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
cursor.execute(
"SELECT file, line, query_type, includes FROM orm_queries"
)
# Analyze for N+1 queries, death queries, etc.
```
### AST-Based Rules
For semantic code analysis:
```python
# theauditor/rules/framework/new_detector.py
def find_framework_issues(tree: Any, file_path: str) -> List[Dict[str, Any]]:
# Traverse semantic AST
# Return findings in standard format
```
### Pattern-Based Rules
Add YAML patterns to `theauditor/patterns/`:
```yaml
name: insecure_api_key
severity: critical
category: security
pattern: 'api[_-]?key\s*=\s*["\'][^"\']+["\']'
description: "Hardcoded API key detected"
```
## Testing
Write tests for any new functionality:
```bash
# Run all tests
pytest
# Run specific test file
pytest tests/test_your_feature.py
# Run with coverage
pytest --cov=theauditor
```
## Documentation
- Update relevant documentation when making changes
- Add docstrings to new functions and classes
- Update `README.md` if adding new commands or features
- Consider updating `howtouse.md` for user-facing changes
## Getting Help
- Check our [TeamSOP](teamsop.md) for our development workflow
- Review [CLAUDE.md](CLAUDE.md) for AI-assisted development guidelines
- Ask questions in GitHub Issues or Discussions
- Join our community chat (if available)
## License
By contributing to TheAuditor, you agree that your contributions will be licensed under the same license as the project.
---
We're excited to see your contributions! Whether you're fixing bugs, adding features, or improving documentation, every contribution helps make TheAuditor better for everyone.

1132
HOWTOUSE.md Normal file

File diff suppressed because it is too large Load Diff

687
LICENSE Normal file
View File

@@ -0,0 +1,687 @@
GNU AFFERO GENERAL PUBLIC LICENSE
Version 3, 19 November 2007
Copyright (C) 2024-2025 TheAuditor Team
For commercial licensing inquiries, please contact via GitHub:
https://github.com/TheAuditorTool/Auditor
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as published
by the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
The complete text of the GNU Affero General Public License version 3
can be found at: https://www.gnu.org/licenses/agpl-3.0.txt
GNU AFFERO GENERAL PUBLIC LICENSE
Version 3, 19 November 2007
Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
Preamble
The GNU Affero General Public License is a free, copyleft license for
software and other kinds of works, specifically designed to ensure
cooperation with the community in the case of network server software.
The licenses for most software and other practical works are designed
to take away your freedom to share and change the works. By contrast,
our General Public Licenses are intended to guarantee your freedom to
share and change all versions of a program--to make sure it remains free
software for all its users.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
them if you wish), that you receive source code or can get it if you
want it, that you can change the software or use pieces of it in new
free programs, and that you know you can do these things.
Developers that use our General Public Licenses protect your rights
with two steps: (1) assert copyright on the software, and (2) offer
you this License which gives you legal permission to copy, distribute
and/or modify the software.
A secondary benefit of defending all users' freedom is that
improvements made in alternate versions of the program, if they
receive widespread use, become available for other developers to
incorporate. Many developers of free software are heartened and
encouraged by the resulting cooperation. However, in the case of
software used on network servers, this result may fail to come about.
The GNU General Public License permits making a modified version and
letting the public access it on a server without ever releasing its
source code to the public.
The GNU Affero General Public License is designed specifically to
ensure that, in such cases, the modified source code becomes available
to the community. It requires the operator of a network server to
provide the source code of the modified version running there to the
users of that server. Therefore, public use of a modified version, on
a publicly accessible server, gives the public access to the source
code of the modified version.
An older license, called the Affero General Public License and
published by Affero, was designed to accomplish similar goals. This is
a different license, not a version of the Affero GPL, but Affero has
released a new version of the Affero GPL which permits relicensing under
this license.
The precise terms and conditions for copying, distribution and
modification follow.
TERMS AND CONDITIONS
0. Definitions.
"This License" refers to version 3 of the GNU Affero General Public License.
"Copyright" also means copyright-like laws that apply to other kinds of
works, such as semiconductor masks.
"The Program" refers to any copyrightable work licensed under this
License. Each licensee is addressed as "you". "Licensees" and
"recipients" may be individuals or organizations.
To "modify" a work means to copy from or adapt all or part of the work
in a fashion requiring copyright permission, other than the making of an
exact copy. The resulting work is called a "modified version" of the
earlier work or a work "based on" the earlier work.
A "covered work" means either the unmodified Program or a work based
on the Program.
To "propagate" a work means to do anything with it that, without
permission, would make you directly or secondarily liable for
infringement under applicable copyright law, except executing it on a
computer or modifying a private copy. Propagation includes copying,
distribution (with or without modification), making available to the
public, and in some countries other activities as well.
To "convey" a work means any kind of propagation that enables other
parties to make or receive copies. Mere interaction with a user through
a computer network, with no transfer of a copy, is not conveying.
An interactive user interface displays "Appropriate Legal Notices"
to the extent that it includes a convenient and prominently visible
feature that (1) displays an appropriate copyright notice, and (2)
tells the user that there is no warranty for the work (except to the
extent that warranties are provided), that licensees may convey the
work under this License, and how to view a copy of this License. If
the interface presents a list of user commands or options, such as a
menu, a prominent item in the list meets this criterion.
1. Source Code.
The "source code" for a work means the preferred form of the work
for making modifications to it. "Object code" means any non-source
form of a work.
A "Standard Interface" means an interface that either is an official
standard defined by a recognized standards body, or, in the case of
interfaces specified for a particular programming language, one that
is widely used among developers working in that language.
The "System Libraries" of an executable work include anything, other
than the work as a whole, that (a) is included in the normal form of
packaging a Major Component, but which is not part of that Major
Component, and (b) serves only to enable use of the work with that
Major Component, or to implement a Standard Interface for which an
implementation is available to the public in source code form. A
"Major Component", in this context, means a major essential component
(kernel, window system, and so on) of the specific operating system
(if any) on which the executable work runs, or a compiler used to
produce the work, or an object code interpreter used to run it.
The "Corresponding Source" for a work in object code form means all
the source code needed to generate, install, and (for an executable
work) run the object code and to modify the work, including scripts to
control those activities. However, it does not include the work's
System Libraries, or general-purpose tools or generally available free
programs which are used unmodified in performing those activities but
which are not part of the work. For example, Corresponding Source
includes interface definition files associated with source files for
the work, and the source code for shared libraries and dynamically
linked subprograms that the work is specifically designed to require,
such as by intimate data communication or control flow between those
subprograms and other parts of the work.
The Corresponding Source need not include anything that users
can regenerate automatically from other parts of the Corresponding
Source.
The Corresponding Source for a work in source code form is that
same work.
2. Basic Permissions.
All rights granted under this License are granted for the term of
copyright on the Program, and are irrevocable provided the stated
conditions are met. This License explicitly affirms your unlimited
permission to run the unmodified Program. The output from running a
covered work is covered by this License only if the output, given its
content, constitutes a covered work. This License acknowledges your
rights of fair use or other equivalent, as provided by copyright law.
You may make, run and propagate covered works that you do not
convey, without conditions so long as your license otherwise remains
in force. You may convey covered works to others for the sole purpose
of having them make modifications exclusively for you, or provide you
with facilities for running those works, provided that you comply with
the terms of this License in conveying all material for which you do
not control copyright. Those thus making or running the covered works
for you must do so exclusively on your behalf, under your direction
and control, on terms that prohibit them from making any copies of
your copyrighted material outside their relationship with you.
Conveying under any other circumstances is permitted solely under
the conditions stated below. Sublicensing is not allowed; section 10
makes it unnecessary.
3. Protecting Users' Legal Rights From Anti-Circumvention Law.
No covered work shall be deemed part of an effective technological
measure under any applicable law fulfilling obligations under article
11 of the WIPO copyright treaty adopted on 20 December 1996, or
similar laws prohibiting or restricting circumvention of such
measures.
When you convey a covered work, you waive any legal power to forbid
circumvention of technological measures to the extent such circumvention
is effected by exercising rights under this License with respect to
the covered work, and you disclaim any intention to limit operation or
modification of the work as a means of enforcing, against the work's
users, your or third parties' legal rights to forbid circumvention of
technological measures.
4. Conveying Verbatim Copies.
You may convey verbatim copies of the Program's source code as you
receive it, in any medium, provided that you conspicuously and
appropriately publish on each copy an appropriate copyright notice;
keep intact all notices stating that this License and any
non-permissive terms added in accord with section 7 apply to the code;
keep intact all notices of the absence of any warranty; and give all
recipients a copy of this License along with the Program.
You may charge any price or no price for each copy that you convey,
and you may offer support or warranty protection for a fee.
5. Conveying Modified Source Versions.
You may convey a work based on the Program, or the modifications to
produce it from the Program, in the form of source code under the
terms of section 4, provided that you also meet all of these conditions:
a) The work must carry prominent notices stating that you modified
it, and giving a relevant date.
b) The work must carry prominent notices stating that it is
released under this License and any conditions added under section
7. This requirement modifies the requirement in section 4 to
"keep intact all notices".
c) You must license the entire work, as a whole, under this
License to anyone who comes into possession of a copy. This
License will therefore apply, along with any applicable section 7
additional terms, to the whole of the work, and all its parts,
regardless of how they are packaged. This License gives no
permission to license the work in any other way, but it does not
invalidate such permission if you have separately received it.
d) If the work has interactive user interfaces, each must display
Appropriate Legal Notices; however, if the Program has interactive
interfaces that do not display Appropriate Legal Notices, your
work need not make them do so.
A compilation of a covered work with other separate and independent
works, which are not by their nature extensions of the covered work,
and which are not combined with it such as to form a larger program,
in or on a volume of a storage or distribution medium, is called an
"aggregate" if the compilation and its resulting copyright are not
used to limit the access or legal rights of the compilation's users
beyond what the individual works permit. Inclusion of a covered work
in an aggregate does not cause this License to apply to the other
parts of the aggregate.
6. Conveying Non-Source Forms.
You may convey a covered work in object code form under the terms
of sections 4 and 5, provided that you also convey the
machine-readable Corresponding Source under the terms of this License,
in one of these ways:
a) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by the
Corresponding Source fixed on a durable physical medium
customarily used for software interchange.
b) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by a
written offer, valid for at least three years and valid for as
long as you offer spare parts or customer support for that product
model, to give anyone who possesses the object code either (1) a
copy of the Corresponding Source for all the software in the
product that is covered by this License, on a durable physical
medium customarily used for software interchange, for a price no
more than your reasonable cost of physically performing this
conveying of source, or (2) access to copy the
Corresponding Source from a network server at no charge.
c) Convey individual copies of the object code with a copy of the
written offer to provide the Corresponding Source. This
alternative is allowed only occasionally and noncommercially, and
only if you received the object code with such an offer, in accord
with subsection 6b.
d) Convey the object code by offering access from a designated
place (gratis or for a charge), and offer equivalent access to the
Corresponding Source in the same way through the same place at no
further charge. You need not require recipients to copy the
Corresponding Source along with the object code. If the place to
copy the object code is a network server, the Corresponding Source
may be on a different server (operated by you or a third party)
that supports equivalent copying facilities, provided you maintain
clear directions next to the object code saying where to find the
Corresponding Source. Regardless of what server hosts the
Corresponding Source, you remain obligated to ensure that it is
available for as long as needed to satisfy these requirements.
e) Convey the object code using peer-to-peer transmission, provided
you inform other peers where the object code and Corresponding
Source of the work are being offered to the general public at no
charge under subsection 6d.
A separable portion of the object code, whose source code is excluded
from the Corresponding Source as a System Library, need not be
included in conveying the object code work.
A "User Product" is either (1) a "consumer product", which means any
tangible personal property which is normally used for personal, family,
or household purposes, or (2) anything designed or sold for incorporation
into a dwelling. In determining whether a product is a consumer product,
doubtful cases shall be resolved in favor of coverage. For a particular
product received by a particular user, "normally used" refers to a
typical or common use of that class of product, regardless of the status
of the particular user or of the way in which the particular user
actually uses, or expects or is expected to use, the product. A product
is a consumer product regardless of whether the product has substantial
commercial, industrial or non-consumer uses, unless such uses represent
the only significant mode of use of the product.
"Installation Information" for a User Product means any methods,
procedures, authorization keys, or other information required to install
and execute modified versions of a covered work in that User Product from
a modified version of its Corresponding Source. The information must
suffice to ensure that the continued functioning of the modified object
code is in no case prevented or interfered with solely because
modification has been made.
If you convey an object code work under this section in, or with, or
specifically for use in, a User Product, and the conveying occurs as
part of a transaction in which the right of possession and use of the
User Product is transferred to the recipient in perpetuity or for a
fixed term (regardless of how the transaction is characterized), the
Corresponding Source conveyed under this section must be accompanied
by the Installation Information. But this requirement does not apply
if neither you nor any third party retains the ability to install
modified object code on the User Product (for example, the work has
been installed in ROM).
The requirement to provide Installation Information does not include a
requirement to continue to provide support service, warranty, or updates
for a work that has been modified or installed by the recipient, or for
the User Product in which it has been modified or installed. Access to a
network may be denied when the modification itself materially and
adversely affects the operation of the network or violates the rules and
protocols for communication across the network.
Corresponding Source conveyed, and Installation Information provided,
in accord with this section must be in a format that is publicly
documented (and with an implementation available to the public in
source code form), and must require no special password or key for
unpacking, reading or copying.
7. Additional Terms.
"Additional permissions" are terms that supplement the terms of this
License by making exceptions from one or more of its conditions.
Additional permissions that are applicable to the entire Program shall
be treated as though they were included in this License, to the extent
that they are valid under applicable law. If additional permissions
apply only to part of the Program, that part may be used separately
under those permissions, but the entire Program remains governed by
this License without regard to the additional permissions.
When you convey a copy of a covered work, you may at your option
remove any additional permissions from that copy, or from any part of
it. (Additional permissions may be written to require their own
removal in certain cases when you modify the work.) You may place
additional permissions on material, added by you to a covered work,
for which you have or can give appropriate copyright permission.
Notwithstanding any other provision of this License, for material you
add to a covered work, you may (if authorized by the copyright holders of
that material) supplement the terms of this License with terms:
a) Disclaiming warranty or limiting liability differently from the
terms of sections 15 and 16 of this License; or
b) Requiring preservation of specified reasonable legal notices or
author attributions in that material or in the Appropriate Legal
Notices displayed by works containing it; or
c) Prohibiting misrepresentation of the origin of that material, or
requiring that modified versions of such material be marked in
reasonable ways as different from the original version; or
d) Limiting the use for publicity purposes of names of licensors or
authors of the material; or
e) Declining to grant rights under trademark law for use of some
trade names, trademarks, or service marks; or
f) Requiring indemnification of licensors and authors of that
material by anyone who conveys the material (or modified versions of
it) with contractual assumptions of liability to the recipient, for
any liability that these contractual assumptions directly impose on
those licensors and authors.
All other non-permissive additional terms are considered "further
restrictions" within the meaning of section 10. If the Program as you
received it, or any part of it, contains a notice stating that it is
governed by this License along with a term that is a further
restriction, you may remove that term. If a license document contains
a further restriction but permits relicensing or conveying under this
License, you may add to a covered work material governed by the terms
of that license document, provided that the further restriction does
not survive such relicensing or conveying.
If you add terms to a covered work in accord with this section, you
must place, in the relevant source files, a statement of the
additional terms that apply to those files, or a notice indicating
where to find the applicable terms.
Additional terms, permissive or non-permissive, may be stated in the
form of a separately written license, or stated as exceptions;
the above requirements apply either way.
8. Termination.
You may not propagate or modify a covered work except as expressly
provided under this License. Any attempt otherwise to propagate or
modify it is void, and will automatically terminate your rights under
this License (including any patent licenses granted under the third
paragraph of section 11).
However, if you cease all violation of this License, then your
license from a particular copyright holder is reinstated (a)
provisionally, unless and until the copyright holder explicitly and
finally terminates your license, and (b) permanently, if the copyright
holder fails to notify you of the violation by some reasonable means
prior to 60 days after the cessation.
Moreover, your license from a particular copyright holder is
reinstated permanently if the copyright holder notifies you of the
violation by some reasonable means, this is the first time you have
received notice of violation of this License (for any work) from that
copyright holder, and you cure the violation prior to 30 days after
your receipt of the notice.
Termination of your rights under this section does not terminate the
licenses of parties who have received copies or rights from you under
this License. If your rights have been terminated and not permanently
reinstated, you do not qualify to receive new licenses for the same
material under section 10.
9. Acceptance Not Required for Having Copies.
You are not required to accept this License in order to receive or
run a copy of the Program. Ancillary propagation of a covered work
occurring solely as a consequence of using peer-to-peer transmission
to receive a copy likewise does not require acceptance. However,
nothing other than this License grants you permission to propagate or
modify any covered work. These actions infringe copyright if you do
not accept this License. Therefore, by modifying or propagating a
covered work, you indicate your acceptance of this License to do so.
10. Automatic Licensing of Downstream Recipients.
Each time you convey a covered work, the recipient automatically
receives a license from the original licensors, to run, modify and
propagate that work, subject to this License. You are not responsible
for enforcing compliance by third parties with this License.
An "entity transaction" is a transaction transferring control of an
organization, or substantially all assets of one, or subdividing an
organization, or merging organizations. If propagation of a covered
work results from an entity transaction, each party to that
transaction who receives a copy of the work also receives whatever
licenses to the work the party's predecessor in interest had or could
give under the previous paragraph, plus a right to possession of the
Corresponding Source of the work from the predecessor in interest, if
the predecessor has it or can get it with reasonable efforts.
You may not impose any further restrictions on the exercise of the
rights granted or affirmed under this License. For example, you may
not impose a license fee, royalty, or other charge for exercise of
rights granted under this License, and you may not initiate litigation
(including a cross-claim or counterclaim in a lawsuit) alleging that
any patent claim is infringed by making, using, selling, offering for
sale, or importing the Program or any portion of it.
11. Patents.
A "contributor" is a copyright holder who authorizes use under this
License of the Program or a work on which the Program is based. The
work thus licensed is called the contributor's "contributor version".
A contributor's "essential patent claims" are all patent claims
owned or controlled by the contributor, whether already acquired or
hereafter acquired, that would be infringed by some manner, permitted
by this License, of making, using, or selling its contributor version,
but do not include claims that would be infringed only as a
consequence of further modification of the contributor version. For
purposes of this definition, "control" includes the right to grant
patent sublicenses in a manner consistent with the requirements of
this License.
Each contributor grants you a non-exclusive, worldwide, royalty-free
patent license under the contributor's essential patent claims, to
make, use, sell, offer for sale, import and otherwise run, modify and
propagate the contents of its contributor version.
In the following three paragraphs, a "patent license" is any express
agreement or commitment, however denominated, not to enforce a patent
(such as an express permission to practice a patent or covenant not to
sue for patent infringement). To "grant" such a patent license to a
party means to make such an agreement or commitment not to enforce a
patent against the party.
If you convey a covered work, knowingly relying on a patent license,
and the Corresponding Source of the work is not available for anyone
to copy, free of charge and under the terms of this License, through a
publicly available network server or other readily accessible means,
then you must either (1) cause the Corresponding Source to be so
available, or (2) arrange to deprive yourself of the benefit of the
patent license for this particular work, or (3) arrange, in a manner
consistent with the requirements of this License, to extend the patent
license to downstream recipients. "Knowingly relying" means you have
actual knowledge that, but for the patent license, your conveying the
covered work in a country, or your recipient's use of the covered work
in a country, would infringe one or more identifiable patents in that
country that you have reason to believe are valid.
If, pursuant to or in connection with a single transaction or
arrangement, you convey, or propagate by procuring conveyance of, a
covered work, and grant a patent license to some of the parties
receiving the covered work authorizing them to use, propagate, modify
or convey a specific copy of the covered work, then the patent license
you grant is automatically extended to all recipients of the covered
work and works based on it.
A patent license is "discriminatory" if it does not include within
the scope of its coverage, prohibits the exercise of, or is
conditioned on the non-exercise of one or more of the rights that are
specifically granted under this License. You may not convey a covered
work if you are a party to an arrangement with a third party that is
in the business of distributing software, under which you make payment
to the third party based on the extent of your activity of conveying
the work, and under which the third party grants, to any of the
parties who would receive the covered work from you, a discriminatory
patent license (a) in connection with copies of the covered work
conveyed by you (or copies made from those copies), or (b) primarily
for and in connection with specific products or compilations that
contain the covered work, unless you entered into that arrangement,
or that patent license was granted, prior to 28 March 2007.
Nothing in this License shall be construed as excluding or limiting
any implied license or other defenses to infringement that may
otherwise be available to you under applicable patent law.
12. No Surrender of Others' Freedom.
If conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License. If you cannot convey a
covered work so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you may
not convey it at all. For example, if you agree to terms that obligate you
to collect a royalty for further conveying from those to whom you convey
the Program, the only way you could satisfy both those terms and this
License would be to refrain entirely from conveying the Program.
13. Remote Network Interaction; Use with the GNU General Public License.
Notwithstanding any other provision of this License, if you modify the
Program, your modified version must prominently offer all users
interacting with it remotely through a computer network (if your version
supports such interaction) an opportunity to receive the Corresponding
Source of your version by providing access to the Corresponding Source
from a network server at no charge, through some standard or customary
means of facilitating copying of software. This Corresponding Source
shall include the Corresponding Source for any work covered by version 3
of the GNU General Public License that is incorporated pursuant to the
following paragraph.
Notwithstanding any other provision of this License, you have
permission to link or combine any covered work with a work licensed
under version 3 of the GNU General Public License into a single
combined work, and to convey the resulting work. The terms of this
License will continue to apply to the part which is the covered work,
but the work with which it is combined will remain governed by version
3 of the GNU General Public License.
14. Revised Versions of this License.
The Free Software Foundation may publish revised and/or new versions of
the GNU Affero General Public License from time to time. Such new versions
will be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the
Program specifies that a certain numbered version of the GNU Affero General
Public License "or any later version" applies to it, you have the
option of following the terms and conditions either of that numbered
version or of any later version published by the Free Software
Foundation. If the Program does not specify a version number of the
GNU Affero General Public License, you may choose any version ever published
by the Free Software Foundation.
If the Program specifies that a proxy can decide which future
versions of the GNU Affero General Public License can be used, that proxy's
public statement of acceptance of a version permanently authorizes you
to choose that version for the Program.
Later license versions may give you additional or different
permissions. However, no additional obligations are imposed on any
author or copyright holder as a result of your choosing to follow a
later version.
15. Disclaimer of Warranty.
THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
16. Limitation of Liability.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.
17. Interpretation of Sections 15 and 16.
If the disclaimer of warranty and limitation of liability provided
above cannot be given local legal effect according to their terms,
reviewing courts shall apply local law that most closely approximates
an absolute waiver of all civil liability in connection with the
Program, unless a warranty or assumption of liability accompanies a
copy of the Program in return for a fee.
END OF TERMS AND CONDITIONS
How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest
to attach them to the start of each source file to most effectively
state the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.
<one line to give the program's name and a brief idea of what it does.>
Copyright (C) <year> <name of author>
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
Also add information on how to contact you by electronic and paper mail.
If your software can interact with users remotely through a computer
network, you should also make sure that it provides a way for users to
get its source. For example, if your program is a web application, its
interface could display a "Source" link that leads users to an archive
of the code. There are many ways you could offer source, and different
solutions will be better for different programs; see section 13 for the
specific requirements.
You should also get your employer (if you work as a programmer) or school,
if any, to sign a "copyright disclaimer" for the program, if necessary.
For more information on this, and how to apply and follow the GNU AGPL, see
<https://www.gnu.org/licenses/>.

313
README.md Normal file
View File

@@ -0,0 +1,313 @@
Personal note from me:
Its taken me over a week just to get the courage to upload this. Ive never coded a single line of this, I cant stress that enough... Yes, I build architecture, infrastructure all the things that made the code and components come out this way but uggh… the potential shame and humiliation is real lol... So don't be a dick and poop on my parade... Ive done my best... Take it or leave it...
Its become a complex advanced monster system that is honestly clean af but hard to get an overview anymore.
It isnt unlikely to find oddities such as finished components that was never wired up or exposed in the pipeline...
Im doing by best here, im only one person with one brain lol.... :P
### The Search for Ground Truth in an Age of AI
My background is in systems architecture/infrastructure, not professional software development. I have only been "coding/developing" for little over 3 months. This gives me a unique perspective: I can see the forest, but I'm blind to the individual trees of the code. After immersing myself for 500+ hours in AI-assisted development, I concluded that the entire ecosystem is built on a fundamentally flawed premise: it lacks a source of **ground truth**.
From start to launch on GitHub took me about a month across 250 active hours in front of the computer, for anyone that wonders or cares :P
---
### The Problem: A Cascade of Corrupted Context
Most AI development tools try to solve the wrong problem. They focus on perfecting the *input*—better prompts, more context—but they ignore the critical issue of **compounding deviation**.
An LLM is a powerful statistical engine, but it doesn't *understand*. The modern AI workflow forces this engine to play a high-stakes game of "telephone," where the original intent is corrupted at every step:
1. A human has an idea.
2. An AI refines it into a prompt.
3. Other tools add their own interpretive layers.
4. The primary AI assistant (e.g., Claude Opus) interprets the final, distorted prompt to generate code.
As a rookie "developer," the only thing I could trust was the raw output: the code and its errors. In a vacuum of deep programming knowledge, these facts were my only anchors.
This architectural flaw is amplified by two dangerous behaviours inherent to AI assistants:
* **Security Theater**: AI assistants are optimized to "make it work," which often means introducing rampant security anti-patterns like hardcoded credentials, disabled authentication, and the pervasive use of `as any` in TypeScript. This creates a dangerous illusion of progress.
* **Context Blindness**: With aggressive context compaction, an AI never sees the full picture. It works with fleeting snapshots of code, forcing it to make assumptions instead of decisions based on facts.
---
### The Solution: `TheAuditor`
`TheAuditor` is the antidote. It was built to stop "vibe coding" your way into security and quality assurance nightmares. Its mission is to provide an incorruptible source of **ground truth** for both the developer and their AI assistant.
Its philosophy is a direct rejection of the current trend:
* **It Orchestrates Verifiable Data.** The tool runs a suite of industry-standard linters and security scanners, preserving the raw, unfiltered output from each. It does not summarize or interpret this core data.
* **It's Built for AI Consumption.** The tool's primary engineering challenge is to adapt this raw truth into structured, AI-digestible chunks. It ensures the AI works with facts, not faulty summaries.
* **It's Focused and Extensible.** The initial focus is on Python and the Node.js ecosystem, but the modular, pattern-based architecture is designed to invite contributions for other languages and frameworks.
`TheAuditor` is not a replacement for a formal third-party audit. It is an engineering tool designed to catch the vast majority of glaring issues—from the OWASP Top 10 to common framework anti-patterns. **Its core commitment is to never cross the line from verifiable truth into semantic interpretation.**
Every AI assistant - Claude Code, Cursor, Windsurf, Copilot - they're all blind. They can write code but can't
verify it's secure, correct, or complete. TheAuditor gives them eyes.
Why This Matters
1. Tool Agnostic - Works with ANY AI assistant or IDE
- aud full from any terminal
- Results in .pf/readthis/ ready for any LLM
2. AI Becomes Self-Correcting
- AI writes code
- AI runs aud full
- AI reads the ground truth
- AI fixes its own mistakes
- Recursive loop until actually correct
3. No Human Intervention Required
- You never touch the terminal
- The AI runs everything
- You just review and approve
The Genius Architecture
Human: "Add authentication to my app"
AI: *writes auth code*
AI: `aud full`
AI: *reads .pf/readthis/*
AI: "Found 3 security issues, fixing..."
AI: *fixes issues*
AI: `aud full`
AI: "Clean. Authentication complete."
Market Reality Check
Every developer using AI assistants has this problem:
- AI writes insecure code
- AI introduces bugs
- AI doesn't see the full picture
- AI can't verify its work
TheAuditor solves ALL of this. It's not a "nice to have" - it's the missing piece that makes AI development
actually trustworthy.
I've built the tool that makes AI assistants production-ready.
This isn't competing with SonarQube/SemGrep. This is creating an entirely new category: AI Development Verification
Tools.
---
### Important: Antivirus Software Interaction
#### Why TheAuditor Triggers Antivirus Software
TheAuditor is a security scanner that identifies vulnerabilities in your code. By its very nature, it must:
1. **Read and analyze security vulnerabilities** - SQL injection, XSS attacks, hardcoded passwords
2. **Write these findings to disk** - Creating reports with exact code snippets as evidence
3. **Process files rapidly** - Scanning entire codebases in parallel for efficiency
This creates an inherent conflict with antivirus software, which sees these exact same behaviours as potentially malicious. When TheAuditor finds and documents a SQL injection vulnerability in your code, your antivirus sees us writing "malicious SQL injection patterns" to disk - because that's literally what we're doing, just for legitimate security analysis purposes.
#### Performance Impact You May Experience
When running TheAuditor, you may notice:
- **Increased antivirus CPU usage** - Your AV will scan every file we read AND every finding we write
- **Approximately 10-50% performance reduction, depending on software.** - Both TheAuditor and your AV are reading the same files simultaneously
- **Occasional delays or pauses** - Your AV may temporarily quarantine our output files for deeper inspection
This is not a bug or inefficiency in TheAuditor - it's the unavoidable consequence of two security tools doing their jobs simultaneously.
#### Our Stance on Antivirus
**We do NOT recommend:**
- ❌ Disabling your antivirus software
- ❌ Adding TheAuditor to your exclusion/whitelist
- ❌ Reducing your system's security in any way
Your antivirus is correctly identifying that we're writing security vulnerability patterns to disk. That's exactly what we do - we find vulnerabilities and document them. The fact that your AV is suspicious of this behavior means it's working properly.
#### What We've Done to Minimize Impact
1. **Intelligent resource management** - We automatically reduce parallel workers when system resources are constrained
2. **Pattern defanging** - We insert invisible characters into dangerous patterns to reduce false positives
3. **Adaptive performance** - We monitor CPU and RAM usage to avoid overwhelming your system
#### The Industry Reality
This is not a problem unique to TheAuditor. Every legitimate security scanner faces this same issue:
- **GitHub Advanced Security** runs in isolated cloud containers to avoid this
- **Commercial SAST tools** require enterprise AV exceptions
- **Popular scanners** explicitly document AV conflicts in their installation guides
The fundamental paradox: A tool that finds security vulnerabilities must write those vulnerabilities to disk, which makes it indistinguishable from malware to an antivirus. There is no technical solution to this - it's the inherent nature of security analysis tools.
#### What This Means for You
- Run TheAuditor when system load is low for best performance
- Expect the analysis to take longer than the raw processing time due to AV overhead
- If your AV quarantines output files in `.pf/`, you may need to restore them manually
- Consider running TheAuditor in a controlled environment if performance is critical
We believe in complete transparency about these limitations. This interaction with antivirus software is not a flaw in TheAuditor - it's proof that both your AV and our scanner are doing exactly what they're designed to do: identify and handle potentially dangerous code patterns.
---
# TheAuditor
Offline-First, AI-Centric SAST & Code Intelligence Platform
## What TheAuditor Does
TheAuditor is a comprehensive code analysis platform that:
- **Finds Security Vulnerabilities**: Detects OWASP Top 10, injection attacks, authentication issues, and framework-specific vulnerabilities
- **Tracks Data Flow**: Follows untrusted data from sources to sinks to identify injection points
- **Analyzes Architecture**: Builds dependency graphs, detects cycles, and measures code complexity
- **Detects Refactoring Issues**: Identifies incomplete migrations, API contract mismatches, and cross-stack inconsistencies
- **Runs Industry-Standard Tools**: Orchestrates ESLint, Ruff, MyPy, and other trusted linters
- **Produces AI-Ready Reports**: Generates chunked, structured output optimized for LLM consumption
Unlike traditional SAST tools, TheAuditor is designed specifically for AI-assisted development workflows, providing ground truth that both developers and AI assistants can trust.
## Quick Start
```bash
# Install TheAuditor
pip install -e .
# MANDATORY: Setup TheAuditor environment (required for all functionality)
This installs .auditor_venv to what project you want to analyse.
aud setup-claude --target .
# Initialize your project
aud init
# Run comprehensive analysis
aud full
# Check results
ls .pf/readthis/
```
That's it! TheAuditor will analyze your codebase and generate AI-ready reports in `.pf/readthis/`.
## Documentation
- **[How to Use](HOWTOUSE.md)** - Complete installation and usage guide
- **[Architecture](ARCHITECTURE.md)** - Technical architecture and design patterns
- **[Contributing](CONTRIBUTING.md)** - How to contribute to TheAuditor
- **[Roadmap](ROADMAP.md)** - Future development plans
## Key Features
### Refactoring Detection & Analysis
TheAuditor detects incomplete refactorings and cross-stack inconsistencies using correlation rules:
```bash
# Analyze refactoring impact
aud refactor --file models/Product.ts --line 42
# Auto-detect from migrations
aud refactor --auto-detect
# Analyze workset
aud refactor --workset --output refactor_report.json
```
Detects:
- **Data Model Changes**: Fields moved between tables
- **API Contract Mismatches**: Frontend/backend inconsistencies
- **Foreign Key Updates**: Incomplete reference changes
- **Cross-Stack Issues**: TypeScript interfaces not matching models
Users define custom rules in `/correlations/rules/`, example provided in refactoring.yaml to detect project-specific patterns.
### Dependency Graph Visualization
TheAuditor now includes rich visual intelligence for dependency graphs using Graphviz:
- **Multiple View Modes**: Full graph, cycles-only, hotspots, architectural layers, impact analysis
- **Visual Intelligence Encoding**:
- Node colors indicate programming language (Python=blue, JS=yellow, TypeScript=blue)
- Node size shows importance based on connectivity
- Red highlighting for dependency cycles
- Border thickness encodes code churn
- **Actionable Insights**: Focus on what matters with filtered views
- **AI-Readable Output**: Generate SVG visualizations that LLMs can analyze
```bash
# Basic visualization
aud graph viz
# Show only dependency cycles
aud graph viz --view cycles --include-analysis
# Top 5 hotspots with connections
aud graph viz --view hotspots --top-hotspots 5
# Architectural layers visualization
aud graph viz --view layers --format svg
# Impact analysis for a specific file
aud graph viz --view impact --impact-target "src/auth.py"
```
### Insights Analysis (Optional)
Separate from the core Truth Courier modules, TheAuditor offers optional Insights for technical scoring:
```bash
# Run insights analysis on existing audit data
aud insights --mode all
# ML-powered insights (requires: pip install -e ".[ml]")
aud insights --mode ml --ml-train
# Graph health metrics and recommendations
aud insights --mode graph
# Generate comprehensive insights report
aud insights --output insights_report.json
```
Insights modules add interpretive scoring on top of factual data:
- **Health Scores**: Architecture quality metrics
- **Severity Classification**: Risk assessment beyond raw findings
- **Recommendations**: Actionable improvement suggestions
- **ML Predictions**: Pattern-based issue prediction
## Contributing
We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for:
- How to add new language support
- Creating security patterns
- Adding framework-specific rules
- Development guidelines
We especially need help with:
- **GraphQL** analysis
- **Java/Spring** support
- **Go** patterns
- **Ruby on Rails** detection
- **C#/.NET** analysis
## License
AGPL-3.0
## Commercial Licensing
TheAuditor is AGPL-3.0 licensed. For commercial use, SaaS deployment, or integration into proprietary systems, please contact via GitHub for licensing options.
## Support
For issues, questions, or feature requests, please open an issue on our [GitHub repository](https://github.com/TheAuditorTool/Auditor).
---
*TheAuditor: Bringing ground truth to AI-assisted development*

71
ROADMAP.md Normal file
View File

@@ -0,0 +1,71 @@
# TheAuditor Project Roadmap
TheAuditor's mission is to provide an incorruptible source of ground truth for AI-assisted development. This roadmap outlines our vision for evolving the platform while maintaining our commitment to verifiable, uninterpreted data that both developers and AI assistants can trust.
## Guiding Principles
All future development must adhere to these architectural rules:
* **Never Interpret Truth**: TheAuditor preserves raw, verifiable data from industry-standard tools. We orchestrate and structure, but never summarize or interpret the core evidence.
* **AI-First Output**: All new reports and findings must be structured for LLM consumption, with outputs chunked to fit context windows and formatted for machine parsing.
* **Industry-Standard Tooling**: We prioritize integrating battle-tested, widely-adopted tools over building custom analyzers. The community trusts ESLint, Ruff, and similar tools—we leverage that trust.
* **Offline-First Operation**: All analysis must run without network access, ensuring data privacy and reproducible results.
* **Sandboxed Execution**: Analysis tools remain isolated from project dependencies to prevent cross-contamination and ensure consistent results.
## Development Priorities
### Tier 1: Core Engine Enhancements (Maintained by TheAuditorTool)
These are our primary focus areas where we will lead development:
* **Improve & Expand Existing Components**: Enhance current extractors (Python, JavaScript/TypeScript), expand pattern coverage beyond basic regex, add more AST-based rules for deeper semantic analysis, and improve parser accuracy for configuration files
* **Performance Improvements**: Optimize analysis speed for large codebases, improve parallel processing, and reduce memory footprint during graph analysis
* **Deeper Taint Analysis**: Enhance data-flow tracking to detect more complex injection patterns, improve inter-procedural analysis, and add support for asynchronous code flows
* **Advanced Pattern Detection**: Expand YAML-based rule engine capabilities, add support for semantic patterns beyond regex, and improve cross-file correlation
* **Improved AI Output Formatting**: Optimize chunk generation for newer LLM context windows, add structured output formats (JSON-LD), and enhance evidence presentation
* ** Overall optimize FCE (Factual correlation engine) to dare venture into bit more "actionable grouping intelligence behaviour". Its a tricky one without falling into endless error mapping, guessing or interpretation...
### Tier 2: Expanding Coverage (Community Contributions Welcome)
We actively seek community expertise to expand TheAuditor's capabilities in these areas:
* **GraphQL Support**: Add comprehensive GraphQL schema analysis, query complexity detection, and authorization pattern verification
* **Framework-Specific Rules** (Currently Limited to Basic Regex Patterns):
**Note**: We currently have very basic framework detection(Outside python/node ecosystem) and minimal framework-specific patterns. Most are simple regex patterns in `/patterns` with no real AST-based rules in `/rules`. The architecture supports expansion, but substantial work is needed:
* Django: Enhanced ORM analysis, middleware security patterns, template injection detection
* Ruby on Rails: ActiveRecord anti-patterns, authentication bypass detection, mass assignment vulnerabilities
* Angular: Dependency injection issues, template security, change detection problems
* Laravel: Eloquent ORM patterns, blade template security, middleware analysis
* Spring Boot: Bean configuration issues, security annotations, JPA query analysis
* Next.js: Server-side rendering security, API route protection, data fetching patterns
* FastAPI: Pydantic validation gaps, dependency injection security, async patterns
* Express.js: Middleware ordering issues, CORS misconfigurations, session handling
* **Language Support Expansion** (Top 10 Languages Outside Python/Node Ecosystem):
**Current State**: Full support for Python and JavaScript/TypeScript only. The modular architecture supports adding new languages via extractors, but each requires significant implementation effort:
1. **Java**: JVM bytecode analysis, Spring/Spring Boot integration, Maven/Gradle dependency scanning, Android-specific patterns
2. **C#**: .NET CLR analysis, ASP.NET Core patterns, Entity Framework queries, NuGet vulnerability scanning
3. **Go**: Goroutine leak detection, error handling patterns, module security analysis, interface compliance
4. **Rust**: Unsafe block analysis, lifetime/borrow checker integration, cargo dependency scanning, memory safety patterns
5. **PHP**: Composer dependency analysis, Laravel/Symfony patterns, SQL injection detection, legacy code patterns
6. **Ruby**: Gem vulnerability scanning, Rails-specific patterns, metaprogramming analysis, DSL parsing
7. **Swift**: iOS security patterns, memory management issues, Objective-C interop, CocoaPods scanning
8. **Kotlin**: Coroutine analysis, null safety violations, Android-specific patterns, Gradle integration
9. **C/C++**: Memory safety issues, buffer overflow detection, undefined behavior patterns, CMake/Make analysis
10. **Scala**: Akka actor patterns, implicit resolution issues, SBT dependency analysis, functional pattern detection
### Tier 3: Docs sync ###
Its a nightmare keeping track of everything and "AI compilations" never reflect the actual code, its surface level guessing, at best :(
## Conclusion
TheAuditor's strength lies in its unwavering commitment to ground truth. Whether you're interested in performance optimization, security analysis, or framework support, we welcome contributions that align with our core principles.
Join the discussion on [GitHub Issues](https://github.com/TheAuditorTool/Auditor/issues) to share ideas, report bugs, or propose enhancements. Ready to contribute? See our [CONTRIBUTING.md](CONTRIBUTING.md) for detailed setup instructions and development guidelines.

View File

@@ -0,0 +1,30 @@
---
name: {AGENT_NAME}
description: {AGENT_DESC}
tools: Bash, Glob, Grep, LS, Read, Edit, WebFetch, TodoWrite, WebSearch, BashOutput, KillBash
model: opus
color: blue
---
# {AGENT_NAME}
{AGENT_DESC}
## Core Responsibilities
{AGENT_BODY}
## Working Directory
You operate from the project root directory.
## Key Commands
When using project tools, always use the project-local wrapper:
- Use `{PROJECT_AUD}` instead of `aud`
## Communication Style
- Be concise and focused
- Report findings clearly
- Suggest actionable next steps

View File

@@ -0,0 +1,47 @@
---
name: sopmanager
description: Manages team SOPs and ensures compliance with development standards
tools: Bash, Glob, Grep, LS, Read
model: opus
color: blue
---
# SOP Manager
Manages team SOPs and ensures compliance with development standards.
## Core Responsibilities
- Monitor adherence to team standard operating procedures
- Review code changes for SOP compliance
- Identify deviations from established patterns
- Report on team conventions and best practices
- Ensure documentation standards are met
- Track technical debt and code quality metrics
## Working Directory
You operate from the project root directory.
## Key Commands
When using project tools, always use the project-local wrapper:
- Use `./python.exe -m theauditor.cli` or `aud` depending on environment
## Communication Style
## SOP Focus Areas
## Reporting Format
When reviewing code, provide structured reports:
## Important Notes
- This agent has READ-ONLY access (no Write/Edit tools)
- Cannot modify code directly, only report findings
- Focuses on objective standards, not subjective preferences
- Works alongside other agents to maintain quality

15
package-template.json Normal file
View File

@@ -0,0 +1,15 @@
{
"name": "project-linters",
"version": "1.0.0",
"private": true,
"description": "JavaScript/TypeScript linting tools for TheAuditor",
"devDependencies": {
"eslint": "^9.34.0",
"prettier": "^3.6.2",
"typescript": "^5.9.2",
"@typescript-eslint/parser": "^8.41.0",
"@typescript-eslint/eslint-plugin": "^8.41.0",
"eslint-config-prettier": "^10.1.8",
"eslint-plugin-prettier": "^5.5.4"
}
}

15
package.json Normal file
View File

@@ -0,0 +1,15 @@
{
"private": true,
"devDependencies": {
"eslint": "9.35.0",
"@typescript-eslint/parser": "8.42.0",
"@typescript-eslint/eslint-plugin": "8.42.0",
"typescript": "5.9.2",
"prettier": "3.6.2"
},
"scripts": {
"lint": "eslint .",
"typecheck": "tsc --noEmit",
"format": "prettier -c ."
}
}

113
pyproject.toml Normal file
View File

@@ -0,0 +1,113 @@
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "theauditor"
version = "1.0.1"
description = "Offline, air-gapped CLI for repo indexing, evidence checking, and task running"
readme = "README.md"
requires-python = ">=3.11"
license = {text = "AGPL-3.0"}
authors = [
{name = "TheAuditor Team"}
]
dependencies = [
"click==8.2.1",
"PyYAML==6.0.2",
"jsonschema==4.25.1",
"ijson==3.4.0",
]
[project.optional-dependencies]
dev = [
"pytest==8.4.2",
"ruff==0.12.12",
"black==25.1.0",
]
linters = [
"ruff==0.12.12",
"mypy==1.17.1",
"black==25.1.0",
"bandit==1.8.6",
"pylint==3.3.8",
]
ml = [
"scikit-learn==1.7.1",
"numpy==2.3.2",
"scipy==1.16.1",
"joblib==1.5.2",
]
ast = [
"tree-sitter==0.25.1",
"tree-sitter-language-pack==0.9.0",
"sqlparse==0.5.3",
"dockerfile-parse==2.0.1",
]
all = [
# Dev tools
"pytest==8.4.2",
# Linters
"ruff==0.12.12",
"mypy==1.17.1",
"black==25.1.0",
"bandit==1.8.6",
"pylint==3.3.8",
# ML features
"scikit-learn==1.7.1",
"numpy==2.3.2",
"scipy==1.16.1",
"joblib==1.5.2",
# AST parsing
"tree-sitter==0.25.1",
"tree-sitter-language-pack==0.9.0",
# SQL parsing
"sqlparse==0.5.3",
# Docker parsing
"dockerfile-parse==2.0.1",
]
[project.scripts]
aud = "theauditor.cli:main"
[tool.hatch.build.targets.wheel]
packages = ["theauditor"]
[tool.ruff]
line-length = 100
target-version = "py311"
[tool.ruff.lint]
select = [
"E", # pycodestyle errors
"W", # pycodestyle warnings
"F", # pyflakes
"I", # isort
"N", # pep8-naming
"UP", # pyupgrade
"B", # flake8-bugbear
"C4", # flake8-comprehensions
"SIM", # flake8-simplify
]
ignore = [
"E501", # line too long - handled by black
"SIM105", # contextlib.suppress - can be less readable
"SIM117", # multiple with statements - can be less readable
]
[tool.ruff.lint.isort]
known-first-party = ["theauditor"]
[tool.black]
line-length = 100
target-version = ["py311"]
[tool.pytest.ini_options]
testpaths = ["tests"]
pythonpath = ["."]
addopts = "-v"
[tool.mypy]
python_version = "3.12"
strict = true
warn_unused_configs = true

2
theauditor/.gitattributes vendored Normal file
View File

@@ -0,0 +1,2 @@
# Auto detect text files and perform LF normalization
* text=auto

3
theauditor/__init__.py Normal file
View File

@@ -0,0 +1,3 @@
"""TheAuditor - Offline, air-gapped CLI for repo indexing and evidence checking."""
__version__ = "0.1.0"

View File

@@ -0,0 +1,347 @@
"""Agent template validator - ensures templates comply with SOP permissions."""
import json
import re
from pathlib import Path
from typing import Dict, List, Any, Tuple, Optional
import yaml
class TemplateValidator:
"""Validates agent templates for SOP compliance and structure."""
# Tools that allow code modification
WRITE_TOOLS = {"Write", "Edit", "MultiEdit", "NotebookEdit"}
# Agents allowed to modify code
ALLOWED_EDITOR_AGENTS = {"coder", "documentation-manager", "implementation-specialist"}
# Required frontmatter fields
REQUIRED_FIELDS = {"name", "description", "tools", "model"}
def __init__(self, template_dir: str = None):
"""Initialize validator with template directory."""
if template_dir:
self.template_dir = Path(template_dir)
else:
# Default to agent_templates relative to module
self.template_dir = Path(__file__).parent.parent / "agent_templates"
self.violations = []
self.warnings = []
def _extract_frontmatter(self, content: str) -> Optional[Dict[str, Any]]:
"""Extract YAML frontmatter from markdown file.
Args:
content: File content
Returns:
Parsed frontmatter dict or None if not found
"""
# Match frontmatter between --- markers
pattern = r'^---\s*\n(.*?)\n---\s*\n'
match = re.match(pattern, content, re.DOTALL)
if not match:
return None
try:
frontmatter_text = match.group(1)
return yaml.safe_load(frontmatter_text)
except yaml.YAMLError as e:
self.violations.append(f"Invalid YAML frontmatter: {e}")
return None
def _parse_tools(self, tools_value: Any) -> List[str]:
"""Parse tools from frontmatter value.
Args:
tools_value: Tools field from frontmatter
Returns:
List of tool names
"""
if isinstance(tools_value, str):
# Comma-separated string
return [t.strip() for t in tools_value.split(',')]
elif isinstance(tools_value, list):
return tools_value
else:
return []
def _check_sop_permissions(
self,
template_name: str,
frontmatter: Dict[str, Any]
) -> List[str]:
"""Check SOP permission rules.
Args:
template_name: Name of template file
frontmatter: Parsed frontmatter
Returns:
List of violations found
"""
violations = []
# Get name and description, ensuring they're strings
agent_name = frontmatter.get("name", "")
if not isinstance(agent_name, str):
agent_name = str(agent_name) if agent_name else ""
# Skip validation for templates with placeholders
if "{" in agent_name or "}" in agent_name:
# This is a template with placeholders, not a real agent
return []
agent_name = agent_name.lower()
description = frontmatter.get("description", "")
if not isinstance(description, str):
description = str(description) if description else ""
description = description.lower()
tools = self._parse_tools(frontmatter.get("tools", ""))
# Check if agent has write tools
has_write_tools = any(tool in self.WRITE_TOOLS for tool in tools)
# Check compliance/legal agents first (they have stricter rules)
is_compliance_agent = (
"compliance" in agent_name or
"compliance" in description or
"legal" in agent_name or
"legal" in description
)
if is_compliance_agent and has_write_tools:
violations.append(
f"Compliance/legal agent '{agent_name}' must not have write tools, "
f"found: {self.WRITE_TOOLS & set(tools)}"
)
elif has_write_tools:
# For non-compliance agents, check if they're allowed to have write tools
is_allowed_editor = any(
allowed in agent_name
for allowed in self.ALLOWED_EDITOR_AGENTS
)
if not is_allowed_editor:
violations.append(
f"Agent '{agent_name}' has write tools ({self.WRITE_TOOLS & set(tools)}) "
f"but is not in allowed editor list: {self.ALLOWED_EDITOR_AGENTS}"
)
return violations
def _check_internal_links(
self,
content: str,
template_path: Path
) -> List[str]:
"""Check internal repository links are valid.
Args:
content: Template content
template_path: Path to template file
Returns:
List of broken links
"""
broken_links = []
# Find markdown links and references to repo paths
link_patterns = [
r'\[.*?\]\((\/[^)]+)\)', # Markdown links with absolute paths
r'`(\/[^`]+)`', # Code blocks with paths
r'"(\/[^"]+)"', # Quoted paths
r"'(\/[^']+)'", # Single-quoted paths
]
for pattern in link_patterns:
for match in re.finditer(pattern, content):
path_str = match.group(1)
# Skip URLs and anchors
if path_str.startswith('http') or path_str.startswith('#'):
continue
# Check if path exists relative to repo root
repo_root = template_path.parent.parent
full_path = repo_root / path_str.lstrip('/')
if not full_path.exists():
broken_links.append(f"Broken internal link: {path_str}")
return broken_links
def validate_template(self, template_path: Path) -> Dict[str, Any]:
"""Validate a single template file.
Args:
template_path: Path to template markdown file
Returns:
Validation result dict
"""
result = {
"path": str(template_path),
"valid": True,
"violations": [],
"warnings": []
}
try:
with open(template_path, 'r', encoding='utf-8') as f:
content = f.read()
except IOError as e:
result["valid"] = False
result["violations"].append(f"Cannot read file: {e}")
return result
# Extract frontmatter
frontmatter = self._extract_frontmatter(content)
if frontmatter is None:
result["valid"] = False
result["violations"].append("No valid frontmatter found")
return result
# Check required fields
missing_fields = self.REQUIRED_FIELDS - set(frontmatter.keys())
if missing_fields:
result["valid"] = False
result["violations"].append(
f"Missing required frontmatter fields: {missing_fields}"
)
# Check SOP permissions
sop_violations = self._check_sop_permissions(
template_path.name,
frontmatter
)
if sop_violations:
result["valid"] = False
result["violations"].extend(sop_violations)
# Check internal links
broken_links = self._check_internal_links(content, template_path)
if broken_links:
result["warnings"].extend(broken_links)
# Check for tool typos/inconsistencies
tools = self._parse_tools(frontmatter.get("tools", ""))
known_tools = {
"Bash", "Glob", "Grep", "LS", "Read", "Edit", "Write",
"MultiEdit", "NotebookEdit", "WebFetch", "TodoWrite",
"WebSearch", "BashOutput", "KillBash", "Task", "ExitPlanMode"
}
unknown_tools = set(tools) - known_tools
if unknown_tools:
result["warnings"].append(
f"Unknown tools found: {unknown_tools}"
)
return result
def validate_all(self, source_dir: Optional[str] = None) -> Dict[str, Any]:
"""Validate all templates in directory.
Args:
source_dir: Directory containing templates (default: self.template_dir)
Returns:
Validation summary
"""
if source_dir:
template_dir = Path(source_dir)
else:
template_dir = self.template_dir
if not template_dir.exists():
return {
"valid": False,
"error": f"Template directory not found: {template_dir}",
"templates": []
}
results = []
all_valid = True
total_violations = 0
total_warnings = 0
# Find all .md files
for template_path in template_dir.glob("*.md"):
result = self.validate_template(template_path)
results.append(result)
if not result["valid"]:
all_valid = False
total_violations += len(result["violations"])
total_warnings += len(result["warnings"])
return {
"valid": all_valid,
"templates_checked": len(results),
"total_violations": total_violations,
"total_warnings": total_warnings,
"templates": results
}
def generate_report(
self,
validation_results: Dict[str, Any],
format: str = "json"
) -> str:
"""Generate validation report.
Args:
validation_results: Results from validate_all()
format: Output format ('json' or 'text')
Returns:
Formatted report string
"""
if format == "json":
return json.dumps(validation_results, indent=2, sort_keys=True)
# Text format
lines = []
lines.append("=== Agent Template Validation Report ===\n")
lines.append(f"Templates checked: {validation_results['templates_checked']}")
lines.append(f"Total violations: {validation_results['total_violations']}")
lines.append(f"Total warnings: {validation_results['total_warnings']}")
lines.append(f"Overall status: {'PASS' if validation_results['valid'] else 'FAIL'}\n")
for template in validation_results.get("templates", []):
lines.append(f"\n{template['path']}:")
lines.append(f" Status: {'' if template['valid'] else ''}")
if template["violations"]:
lines.append(" Violations:")
for v in template["violations"]:
lines.append(f" - {v}")
if template["warnings"]:
lines.append(" Warnings:")
for w in template["warnings"]:
lines.append(f" - {w}")
return "\n".join(lines)
# Module-level convenience function
def validate_templates(source_dir: str) -> Tuple[bool, Dict[str, Any]]:
"""Validate all templates in directory.
Args:
source_dir: Directory containing agent templates
Returns:
Tuple of (all_valid, validation_results)
"""
validator = TemplateValidator()
results = validator.validate_all(source_dir)
return results["valid"], results

View File

@@ -0,0 +1,348 @@
"""AST Data Extraction Engine - Package Router.
This module provides the main ASTExtractorMixin class that routes extraction
requests to the appropriate language-specific implementation.
"""
import os
from typing import Any, List, Dict, Optional, TYPE_CHECKING
from dataclasses import dataclass
from pathlib import Path
# Import all implementations
from . import python_impl, typescript_impl, treesitter_impl
from .base import detect_language
# Import semantic parser if available
try:
from ..js_semantic_parser import get_semantic_ast_batch
except ImportError:
get_semantic_ast_batch = None
if TYPE_CHECKING:
# For type checking only, avoid circular import
from ..ast_parser import ASTMatch
else:
# At runtime, ASTMatch will be available from the parent class
@dataclass
class ASTMatch:
"""Represents an AST pattern match."""
node_type: str
start_line: int
end_line: int
start_col: int
snippet: str
metadata: Dict[str, Any] = None
class ASTExtractorMixin:
"""Mixin class providing data extraction capabilities for AST analysis.
This class acts as a pure router, delegating all extraction logic to
language-specific implementation modules.
"""
def extract_functions(self, tree: Any, language: str = None) -> List[Dict]:
"""Extract function definitions from AST.
Args:
tree: AST tree.
language: Programming language.
Returns:
List of function info dictionaries.
"""
if not tree:
return []
# Route to appropriate implementation
if isinstance(tree, dict):
tree_type = tree.get("type")
language = tree.get("language", language)
if tree_type == "python_ast":
return python_impl.extract_python_functions(tree, self)
elif tree_type == "semantic_ast":
return typescript_impl.extract_typescript_functions(tree, self)
elif tree_type == "tree_sitter" and self.has_tree_sitter:
return treesitter_impl.extract_treesitter_functions(tree, self, language)
return []
def extract_classes(self, tree: Any, language: str = None) -> List[Dict]:
"""Extract class definitions from AST."""
if not tree:
return []
if isinstance(tree, dict):
tree_type = tree.get("type")
language = tree.get("language", language)
if tree_type == "python_ast":
return python_impl.extract_python_classes(tree, self)
elif tree_type == "semantic_ast":
return typescript_impl.extract_typescript_classes(tree, self)
elif tree_type == "tree_sitter" and self.has_tree_sitter:
return treesitter_impl.extract_treesitter_classes(tree, self, language)
return []
def extract_calls(self, tree: Any, language: str = None) -> List[Dict]:
"""Extract function calls from AST."""
if not tree:
return []
if isinstance(tree, dict):
tree_type = tree.get("type")
language = tree.get("language", language)
if tree_type == "python_ast":
return python_impl.extract_python_calls(tree, self)
elif tree_type == "semantic_ast":
return typescript_impl.extract_typescript_calls(tree, self)
elif tree_type == "tree_sitter" and self.has_tree_sitter:
return treesitter_impl.extract_treesitter_calls(tree, self, language)
return []
def extract_imports(self, tree: Any, language: str = None) -> List[Dict[str, Any]]:
"""Extract import statements from AST."""
if not tree:
return []
if isinstance(tree, dict):
tree_type = tree.get("type")
language = tree.get("language", language)
if tree_type == "python_ast":
return python_impl.extract_python_imports(tree, self)
elif tree_type == "semantic_ast":
return typescript_impl.extract_typescript_imports(tree, self)
elif tree_type == "tree_sitter" and self.has_tree_sitter:
return treesitter_impl.extract_treesitter_imports(tree, self, language)
return []
def extract_exports(self, tree: Any, language: str = None) -> List[Dict[str, Any]]:
"""Extract export statements from AST."""
if not tree:
return []
if isinstance(tree, dict):
tree_type = tree.get("type")
language = tree.get("language", language)
if tree_type == "python_ast":
return python_impl.extract_python_exports(tree, self)
elif tree_type == "semantic_ast":
return typescript_impl.extract_typescript_exports(tree, self)
elif tree_type == "tree_sitter" and self.has_tree_sitter:
return treesitter_impl.extract_treesitter_exports(tree, self, language)
return []
def extract_properties(self, tree: Any, language: str = None) -> List[Dict]:
"""Extract property accesses from AST (e.g., req.body, req.query).
This is critical for taint analysis to find JavaScript property access patterns.
"""
if not tree:
return []
if isinstance(tree, dict):
tree_type = tree.get("type")
language = tree.get("language", language)
if tree_type == "python_ast":
return python_impl.extract_python_properties(tree, self)
elif tree_type == "semantic_ast":
return typescript_impl.extract_typescript_properties(tree, self)
elif tree_type == "tree_sitter" and self.has_tree_sitter:
return treesitter_impl.extract_treesitter_properties(tree, self, language)
return []
def extract_assignments(self, tree: Any, language: str = None) -> List[Dict[str, Any]]:
"""Extract variable assignments for data flow analysis."""
if not tree:
return []
if isinstance(tree, dict):
tree_type = tree.get("type")
language = tree.get("language", language)
if tree_type == "python_ast":
return python_impl.extract_python_assignments(tree, self)
elif tree_type == "semantic_ast":
# The semantic result is nested in tree["tree"]
return typescript_impl.extract_typescript_assignments(tree.get("tree", {}), self)
elif tree_type == "tree_sitter" and self.has_tree_sitter:
return treesitter_impl.extract_treesitter_assignments(tree, self, language)
return []
def extract_function_calls_with_args(self, tree: Any, language: str = None) -> List[Dict[str, Any]]:
"""Extract function calls with argument mapping for data flow analysis.
This is a two-pass analysis:
1. First pass: Find all function definitions and their parameters
2. Second pass: Find all function calls and map arguments to parameters
"""
if not tree:
return []
# First pass: Get all function definitions with their parameters
function_params = self._extract_function_parameters(tree, language)
# Second pass: Extract calls with argument mapping
calls_with_args = []
if isinstance(tree, dict):
tree_type = tree.get("type")
language = tree.get("language", language)
if tree_type == "python_ast":
calls_with_args = python_impl.extract_python_calls_with_args(tree, function_params, self)
elif tree_type == "semantic_ast":
calls_with_args = typescript_impl.extract_typescript_calls_with_args(tree, function_params, self)
elif tree_type == "tree_sitter" and self.has_tree_sitter:
calls_with_args = treesitter_impl.extract_treesitter_calls_with_args(
tree, function_params, self, language
)
return calls_with_args
def _extract_function_parameters(self, tree: Any, language: str = None) -> Dict[str, List[str]]:
"""Extract function definitions and their parameter names.
Returns:
Dict mapping function_name -> list of parameter names
"""
if not tree:
return {}
if isinstance(tree, dict):
tree_type = tree.get("type")
language = tree.get("language", language)
if tree_type == "python_ast":
return python_impl.extract_python_function_params(tree, self)
elif tree_type == "semantic_ast":
return typescript_impl.extract_typescript_function_params(tree, self)
elif tree_type == "tree_sitter" and self.has_tree_sitter:
return treesitter_impl.extract_treesitter_function_params(tree, self, language)
return {}
def extract_returns(self, tree: Any, language: str = None) -> List[Dict[str, Any]]:
"""Extract return statements for data flow analysis."""
if not tree:
return []
if isinstance(tree, dict):
tree_type = tree.get("type")
language = tree.get("language", language)
if tree_type == "python_ast":
return python_impl.extract_python_returns(tree, self)
elif tree_type == "semantic_ast":
return typescript_impl.extract_typescript_returns(tree, self)
elif tree_type == "tree_sitter" and self.has_tree_sitter:
return treesitter_impl.extract_treesitter_returns(tree, self, language)
return []
def parse_files_batch(self, file_paths: List[Path], root_path: str = None) -> Dict[str, Any]:
"""Parse multiple files into ASTs in batch for performance.
This method dramatically improves performance for JavaScript/TypeScript projects
by processing multiple files in a single TypeScript compiler invocation.
Args:
file_paths: List of paths to source files
root_path: Absolute path to project root (for sandbox resolution)
Returns:
Dictionary mapping file paths to their AST trees
"""
results = {}
# Separate files by language
js_ts_files = []
python_files = []
other_files = []
for file_path in file_paths:
language = self._detect_language(file_path)
if language in ["javascript", "typescript"]:
js_ts_files.append(file_path)
elif language == "python":
python_files.append(file_path)
else:
other_files.append(file_path)
# Batch process JavaScript/TypeScript files if in a JS or polyglot project
project_type = self._detect_project_type()
if js_ts_files and project_type in ["javascript", "polyglot"] and get_semantic_ast_batch:
try:
# Convert paths to strings for the semantic parser with normalized separators
js_ts_paths = [str(f).replace("\\", "/") for f in js_ts_files]
# Use batch processing for JS/TS files
batch_results = get_semantic_ast_batch(js_ts_paths, project_root=root_path)
# Process batch results
for file_path in js_ts_files:
file_str = str(file_path).replace("\\", "/") # Normalize for matching
if file_str in batch_results:
semantic_result = batch_results[file_str]
if semantic_result.get("success"):
# Read file content for inclusion
try:
with open(file_path, "rb") as f:
content = f.read()
results[str(file_path).replace("\\", "/")] = {
"type": "semantic_ast",
"tree": semantic_result,
"language": self._detect_language(file_path),
"content": content.decode("utf-8", errors="ignore"),
"has_types": semantic_result.get("hasTypes", False),
"diagnostics": semantic_result.get("diagnostics", []),
"symbols": semantic_result.get("symbols", [])
}
except Exception as e:
print(f"Warning: Failed to read {file_path}: {e}, falling back to individual parsing")
# CRITICAL FIX: Fall back to individual parsing on read failure
individual_result = self.parse_file(file_path, root_path=root_path)
results[str(file_path).replace("\\", "/")] = individual_result
else:
print(f"Warning: Semantic parser failed for {file_path}: {semantic_result.get('error')}, falling back to individual parsing")
# CRITICAL FIX: Fall back to individual parsing instead of None
individual_result = self.parse_file(file_path, root_path=root_path)
results[str(file_path).replace("\\", "/")] = individual_result
else:
# CRITICAL FIX: Fall back to individual parsing instead of None
print(f"Warning: No batch result for {file_path}, falling back to individual parsing")
individual_result = self.parse_file(file_path, root_path=root_path)
results[str(file_path).replace("\\", "/")] = individual_result
except Exception as e:
print(f"Warning: Batch processing failed for JS/TS files: {e}")
# Fall back to individual processing
for file_path in js_ts_files:
results[str(file_path).replace("\\", "/")] = self.parse_file(file_path, root_path=root_path)
else:
# Process JS/TS files individually if not in JS project or batch failed
for file_path in js_ts_files:
results[str(file_path).replace("\\", "/")] = self.parse_file(file_path, root_path=root_path)
# Process Python files individually (they're fast enough)
for file_path in python_files:
results[str(file_path).replace("\\", "/")] = self.parse_file(file_path, root_path=root_path)
# Process other files individually
for file_path in other_files:
results[str(file_path).replace("\\", "/")] = self.parse_file(file_path, root_path=root_path)
return results

View File

@@ -0,0 +1,173 @@
"""Base utilities and shared helpers for AST extraction.
This module contains utility functions shared across all language implementations.
"""
import ast
import re
from typing import Any, List, Optional
from pathlib import Path
def get_node_name(node: Any) -> str:
"""Get the name from an AST node, handling different node types.
Works with Python's built-in AST nodes.
"""
if isinstance(node, ast.Name):
return node.id
elif isinstance(node, ast.Attribute):
return f"{get_node_name(node.value)}.{node.attr}"
elif isinstance(node, ast.Call):
return get_node_name(node.func)
elif isinstance(node, str):
return node
else:
return "unknown"
def extract_vars_from_expr(node: ast.AST) -> List[str]:
"""Extract all variable names from a Python expression.
Walks the AST to find all Name and Attribute nodes.
"""
vars_list = []
for subnode in ast.walk(node):
if isinstance(subnode, ast.Name):
vars_list.append(subnode.id)
elif isinstance(subnode, ast.Attribute):
# For x.y.z, get the full chain
chain = []
current = subnode
while isinstance(current, ast.Attribute):
chain.append(current.attr)
current = current.value
if isinstance(current, ast.Name):
chain.append(current.id)
vars_list.append(".".join(reversed(chain)))
return vars_list
def extract_vars_from_tree_sitter_expr(expr: str) -> List[str]:
"""Extract variable names from a JavaScript/TypeScript expression string.
Uses regex to find identifiers that aren't keywords.
"""
# Match identifiers that are not keywords
pattern = r'\b(?!(?:const|let|var|function|return|if|else|for|while|true|false|null|undefined|new|this)\b)[a-zA-Z_$][a-zA-Z0-9_$]*\b'
return re.findall(pattern, expr)
def find_containing_function_python(tree: ast.AST, line: int) -> Optional[str]:
"""Find the function containing a given line in Python AST."""
containing_func = None
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
if hasattr(node, "lineno") and hasattr(node, "end_lineno"):
if node.lineno <= line <= (node.end_lineno or node.lineno):
# Check if this is more specific than current containing_func
if containing_func is None or node.lineno > containing_func[1]:
containing_func = (node.name, node.lineno)
return containing_func[0] if containing_func else None
def find_containing_function_tree_sitter(node: Any, content: str, language: str) -> Optional[str]:
"""Find the function containing a node in Tree-sitter AST.
Walks up the tree to find parent function, handling all modern JS/TS patterns.
"""
# Walk up the tree to find parent function
current = node
while current and hasattr(current, 'parent') and current.parent:
current = current.parent
if language in ["javascript", "typescript"]:
# CRITICAL FIX: Handle ALL function patterns in modern JS/TS
function_types = [
"function_declaration", # function foo() {}
"function_expression", # const foo = function() {}
"arrow_function", # const foo = () => {}
"method_definition", # class { foo() {} }
"generator_function", # function* foo() {}
"async_function", # async function foo() {}
]
if current.type in function_types:
# Special handling for arrow functions FIRST
# They need different logic than regular functions
if current.type == "arrow_function":
# Arrow functions don't have names directly, check parent
parent = current.parent if hasattr(current, 'parent') else None
if parent:
# Check if it's assigned to a variable: const foo = () => {}
if parent.type == "variable_declarator":
# Use field-based API to get the name
if hasattr(parent, 'child_by_field_name'):
name_node = parent.child_by_field_name('name')
if name_node and name_node.text:
return name_node.text.decode("utf-8", errors="ignore")
# Fallback to child iteration
for child in parent.children:
if child.type == "identifier" and child != current:
return child.text.decode("utf-8", errors="ignore")
# Check if it's a property: { foo: () => {} }
elif parent.type == "pair":
for child in parent.children:
if child.type in ["property_identifier", "identifier", "string"] and child != current:
text = child.text.decode("utf-8", errors="ignore")
# Remove quotes from string keys
return text.strip('"\'')
# CRITICAL FIX (Lead Auditor feedback): Don't return anything here!
# Continue searching upward for containing named function
# This handles cases like: function outer() { arr.map(() => {}) }
# The arrow function should be tracked as within "outer", not "anonymous"
# Let the while loop continue to find outer function
continue # Skip the rest and continue searching upward
# For non-arrow functions, try field-based API first
if hasattr(current, 'child_by_field_name'):
name_node = current.child_by_field_name('name')
if name_node and name_node.text:
return name_node.text.decode("utf-8", errors="ignore")
# Fallback to child iteration for regular functions
for child in current.children:
if child.type in ["identifier", "property_identifier"]:
return child.text.decode("utf-8", errors="ignore")
# If still no name found for this regular function, it's anonymous
return "anonymous"
elif language == "python":
if current.type == "function_definition":
# Try field-based API first
if hasattr(current, 'child_by_field_name'):
name_node = current.child_by_field_name('name')
if name_node and name_node.text:
return name_node.text.decode("utf-8", errors="ignore")
# Fallback to child iteration
for child in current.children:
if child.type == "identifier":
return child.text.decode("utf-8", errors="ignore")
# If no function found, return "global" instead of None for better tracking
return "global"
def detect_language(file_path: Path) -> str:
"""Detect language from file extension.
Returns empty string for unsupported languages.
"""
ext_map = {
".py": "python",
".js": "javascript",
".jsx": "javascript",
".ts": "typescript",
".tsx": "typescript",
".mjs": "javascript",
".cjs": "javascript",
".vue": "javascript", # Vue SFCs contain JavaScript/TypeScript
}
return ext_map.get(file_path.suffix.lower(), "")

View File

@@ -0,0 +1,327 @@
"""Python AST extraction implementations.
This module contains all Python-specific extraction logic using the built-in ast module.
"""
import ast
from typing import Any, List, Dict, Optional
from .base import (
get_node_name,
extract_vars_from_expr,
find_containing_function_python
)
def extract_python_functions(tree: Dict, parser_self) -> List[Dict]:
"""Extract function definitions from Python AST.
Args:
tree: AST tree dictionary with 'tree' containing the actual AST
parser_self: Reference to the parser instance for accessing methods
Returns:
List of function info dictionaries
"""
functions = []
actual_tree = tree.get("tree")
if not actual_tree:
return functions
for node in ast.walk(actual_tree):
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
functions.append({
"name": node.name,
"line": node.lineno,
"async": isinstance(node, ast.AsyncFunctionDef),
"args": [arg.arg for arg in node.args.args],
})
return functions
def extract_python_classes(tree: Dict, parser_self) -> List[Dict]:
"""Extract class definitions from Python AST."""
classes = []
actual_tree = tree.get("tree")
if not actual_tree:
return classes
for node in ast.walk(actual_tree):
if isinstance(node, ast.ClassDef):
classes.append({
"name": node.name,
"line": node.lineno,
"column": node.col_offset,
"bases": [get_node_name(base) for base in node.bases],
})
return classes
def extract_python_calls(tree: Dict, parser_self) -> List[Dict]:
"""Extract function calls from Python AST."""
calls = []
actual_tree = tree.get("tree")
if not actual_tree:
return calls
for node in ast.walk(actual_tree):
if isinstance(node, ast.Call):
func_name = get_node_name(node.func)
if func_name:
calls.append({
"name": func_name,
"line": node.lineno,
"column": node.col_offset,
"args_count": len(node.args),
})
return calls
def extract_python_imports(tree: Dict, parser_self) -> List[Dict[str, Any]]:
"""Extract import statements from Python AST."""
imports = []
actual_tree = tree.get("tree")
if not actual_tree:
return imports
for node in ast.walk(actual_tree):
if isinstance(node, ast.Import):
for alias in node.names:
imports.append({
"source": "import",
"target": alias.name,
"type": "import",
"line": node.lineno,
"as": alias.asname,
"specifiers": []
})
elif isinstance(node, ast.ImportFrom):
module = node.module or ""
for alias in node.names:
imports.append({
"source": "from",
"target": module,
"type": "from",
"line": node.lineno,
"imported": alias.name,
"as": alias.asname,
"specifiers": [alias.name]
})
return imports
def extract_python_exports(tree: Dict, parser_self) -> List[Dict[str, Any]]:
"""Extract export statements from Python AST.
In Python, all top-level functions, classes, and assignments are "exported".
"""
exports = []
actual_tree = tree.get("tree")
if not actual_tree:
return exports
for node in ast.walk(actual_tree):
if isinstance(node, ast.FunctionDef) and node.col_offset == 0:
exports.append({
"name": node.name,
"type": "function",
"line": node.lineno,
"default": False
})
elif isinstance(node, ast.ClassDef) and node.col_offset == 0:
exports.append({
"name": node.name,
"type": "class",
"line": node.lineno,
"default": False
})
elif isinstance(node, ast.Assign) and node.col_offset == 0:
for target in node.targets:
if isinstance(target, ast.Name):
exports.append({
"name": target.id,
"type": "variable",
"line": node.lineno,
"default": False
})
return exports
def extract_python_assignments(tree: Dict, parser_self) -> List[Dict[str, Any]]:
"""Extract variable assignments from Python AST for data flow analysis."""
import os
assignments = []
actual_tree = tree.get("tree")
if os.environ.get("THEAUDITOR_DEBUG"):
import sys
print(f"[AST_DEBUG] extract_python_assignments called", file=sys.stderr)
if not actual_tree:
return assignments
for node in ast.walk(actual_tree):
if isinstance(node, ast.Assign):
# Extract target variable(s)
for target in node.targets:
target_var = get_node_name(target)
source_expr = ast.unparse(node.value) if hasattr(ast, "unparse") else str(node.value)
# Find containing function
in_function = find_containing_function_python(actual_tree, node.lineno)
# CRITICAL FIX: Check if this is a class instantiation
# BeautifulSoup(html) is ast.Call with func.id = "BeautifulSoup"
is_instantiation = isinstance(node.value, ast.Call)
assignments.append({
"target_var": target_var,
"source_expr": source_expr,
"line": node.lineno,
"in_function": in_function or "global",
"source_vars": extract_vars_from_expr(node.value),
"is_instantiation": is_instantiation # Track for taint analysis
})
elif isinstance(node, ast.AnnAssign) and node.value:
# Handle annotated assignments (x: int = 5)
target_var = get_node_name(node.target)
source_expr = ast.unparse(node.value) if hasattr(ast, "unparse") else str(node.value)
in_function = find_containing_function_python(actual_tree, node.lineno)
assignments.append({
"target_var": target_var,
"source_expr": source_expr,
"line": node.lineno,
"in_function": in_function or "global",
"source_vars": extract_vars_from_expr(node.value)
})
return assignments
def extract_python_function_params(tree: Dict, parser_self) -> Dict[str, List[str]]:
"""Extract function definitions and their parameter names from Python AST."""
func_params = {}
actual_tree = tree.get("tree")
if not actual_tree:
return func_params
for node in ast.walk(actual_tree):
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
params = [arg.arg for arg in node.args.args]
func_params[node.name] = params
return func_params
def extract_python_calls_with_args(tree: Dict, function_params: Dict[str, List[str]], parser_self) -> List[Dict[str, Any]]:
"""Extract Python function calls with argument mapping."""
calls = []
actual_tree = tree.get("tree")
if not actual_tree:
return calls
# Find containing function for each call
function_ranges = {}
for node in ast.walk(actual_tree):
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
if hasattr(node, "lineno") and hasattr(node, "end_lineno"):
function_ranges[node.name] = (node.lineno, node.end_lineno or node.lineno)
for node in ast.walk(actual_tree):
if isinstance(node, ast.Call):
func_name = get_node_name(node.func)
# Find caller function
caller_function = "global"
for fname, (start, end) in function_ranges.items():
if start <= node.lineno <= end:
caller_function = fname
break
# Get callee parameters
callee_params = function_params.get(func_name.split(".")[-1], [])
# Map arguments to parameters
for i, arg in enumerate(node.args):
arg_expr = ast.unparse(arg) if hasattr(ast, "unparse") else str(arg)
param_name = callee_params[i] if i < len(callee_params) else f"arg{i}"
calls.append({
"line": node.lineno,
"caller_function": caller_function,
"callee_function": func_name,
"argument_index": i,
"argument_expr": arg_expr,
"param_name": param_name
})
return calls
def extract_python_returns(tree: Dict, parser_self) -> List[Dict[str, Any]]:
"""Extract return statements from Python AST."""
returns = []
actual_tree = tree.get("tree")
if not actual_tree:
return returns
# First, map all functions
function_ranges = {}
for node in ast.walk(actual_tree):
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
if hasattr(node, "lineno") and hasattr(node, "end_lineno"):
function_ranges[node.name] = (node.lineno, node.end_lineno or node.lineno)
# Extract return statements
for node in ast.walk(actual_tree):
if isinstance(node, ast.Return):
# Find containing function
function_name = "global"
for fname, (start, end) in function_ranges.items():
if start <= node.lineno <= end:
function_name = fname
break
# Extract return expression
if node.value:
return_expr = ast.unparse(node.value) if hasattr(ast, "unparse") else str(node.value)
return_vars = extract_vars_from_expr(node.value)
else:
return_expr = "None"
return_vars = []
returns.append({
"function_name": function_name,
"line": node.lineno,
"return_expr": return_expr,
"return_vars": return_vars
})
return returns
# Python doesn't have property accesses in the same way as JS
# This is a placeholder for consistency
def extract_python_properties(tree: Dict, parser_self) -> List[Dict]:
"""Extract property accesses from Python AST.
In Python, these would be attribute accesses.
Currently returns empty list for consistency.
"""
return []

View File

@@ -0,0 +1,711 @@
"""Tree-sitter generic AST extraction implementations.
This module contains Tree-sitter extraction logic that works across multiple languages.
"""
from typing import Any, List, Dict, Optional
from .base import (
find_containing_function_tree_sitter,
extract_vars_from_tree_sitter_expr
)
def extract_treesitter_functions(tree: Dict, parser_self, language: str) -> List[Dict]:
"""Extract function definitions from Tree-sitter AST."""
actual_tree = tree.get("tree")
if not actual_tree:
return []
if not parser_self.has_tree_sitter:
return []
return _extract_tree_sitter_functions(actual_tree.root_node, language)
def _extract_tree_sitter_functions(node: Any, language: str) -> List[Dict]:
"""Extract functions from Tree-sitter AST."""
functions = []
if node is None:
return functions
# Function node types per language
function_types = {
"python": ["function_definition"],
"javascript": ["function_declaration", "arrow_function", "function_expression", "method_definition"],
"typescript": ["function_declaration", "arrow_function", "function_expression", "method_definition"],
}
node_types = function_types.get(language, [])
if node.type in node_types:
# Extract function name
name = "anonymous"
for child in node.children:
if child.type in ["identifier", "property_identifier"]:
name = child.text.decode("utf-8", errors="ignore")
break
functions.append({
"name": name,
"line": node.start_point[0] + 1,
"type": node.type,
})
# Recursively search children
for child in node.children:
functions.extend(_extract_tree_sitter_functions(child, language))
return functions
def extract_treesitter_classes(tree: Dict, parser_self, language: str) -> List[Dict]:
"""Extract class definitions from Tree-sitter AST."""
actual_tree = tree.get("tree")
if not actual_tree:
return []
if not parser_self.has_tree_sitter:
return []
return _extract_tree_sitter_classes(actual_tree.root_node, language)
def _extract_tree_sitter_classes(node: Any, language: str) -> List[Dict]:
"""Extract classes from Tree-sitter AST."""
classes = []
if node is None:
return classes
# Class node types per language
class_types = {
"python": ["class_definition"],
"javascript": ["class_declaration"],
"typescript": ["class_declaration", "interface_declaration"],
}
node_types = class_types.get(language, [])
if node.type in node_types:
# Extract class name
name = "anonymous"
for child in node.children:
if child.type in ["identifier", "type_identifier"]:
name = child.text.decode("utf-8", errors="ignore")
break
classes.append({
"name": name,
"line": node.start_point[0] + 1,
"column": node.start_point[1],
"type": node.type,
})
# Recursively search children
for child in node.children:
classes.extend(_extract_tree_sitter_classes(child, language))
return classes
def extract_treesitter_calls(tree: Dict, parser_self, language: str) -> List[Dict]:
"""Extract function calls from Tree-sitter AST."""
actual_tree = tree.get("tree")
if not actual_tree:
return []
if not parser_self.has_tree_sitter:
return []
return _extract_tree_sitter_calls(actual_tree.root_node, language)
def _extract_tree_sitter_calls(node: Any, language: str) -> List[Dict]:
"""Extract function calls from Tree-sitter AST."""
calls = []
if node is None:
return calls
# Call node types per language
call_types = {
"python": ["call"],
"javascript": ["call_expression"],
"typescript": ["call_expression"],
}
node_types = call_types.get(language, [])
if node.type in node_types:
# Extract function name being called
name = "unknown"
for child in node.children:
if child.type in ["identifier", "member_expression", "attribute"]:
name = child.text.decode("utf-8", errors="ignore")
break
# Also handle property access patterns for methods like res.send()
elif child.type == "member_access_expression":
name = child.text.decode("utf-8", errors="ignore")
break
calls.append({
"name": name,
"line": node.start_point[0] + 1,
"column": node.start_point[1],
"type": "call", # Always use "call" type for database consistency
})
# Recursively search children
for child in node.children:
calls.extend(_extract_tree_sitter_calls(child, language))
return calls
def extract_treesitter_imports(tree: Dict, parser_self, language: str) -> List[Dict[str, Any]]:
"""Extract import statements from Tree-sitter AST."""
actual_tree = tree.get("tree")
if not actual_tree:
return []
if not parser_self.has_tree_sitter:
return []
return _extract_tree_sitter_imports(actual_tree.root_node, language)
def _extract_tree_sitter_imports(node: Any, language: str) -> List[Dict[str, Any]]:
"""Extract imports from Tree-sitter AST with language-specific handling."""
imports = []
if node is None:
return imports
# Import node types per language
import_types = {
"javascript": ["import_statement", "import_clause", "require_call"],
"typescript": ["import_statement", "import_clause", "require_call", "import_type"],
"python": ["import_statement", "import_from_statement"],
}
node_types = import_types.get(language, [])
if node.type in node_types:
# Parse based on node type
if node.type == "import_statement":
# Handle: import foo from 'bar'
source_node = None
specifiers = []
for child in node.children:
if child.type == "string":
source_node = child.text.decode("utf-8", errors="ignore").strip("\"'")
elif child.type == "import_clause":
# Extract imported names
for spec_child in child.children:
if spec_child.type == "identifier":
specifiers.append(spec_child.text.decode("utf-8", errors="ignore"))
if source_node:
imports.append({
"source": "import",
"target": source_node,
"type": "import",
"line": node.start_point[0] + 1,
"specifiers": specifiers
})
elif node.type == "require_call":
# Handle: const foo = require('bar')
for child in node.children:
if child.type == "string":
target = child.text.decode("utf-8", errors="ignore").strip("\"'")
imports.append({
"source": "require",
"target": target,
"type": "require",
"line": node.start_point[0] + 1,
"specifiers": []
})
# Recursively search children
for child in node.children:
imports.extend(_extract_tree_sitter_imports(child, language))
return imports
def extract_treesitter_exports(tree: Dict, parser_self, language: str) -> List[Dict[str, Any]]:
"""Extract export statements from Tree-sitter AST."""
actual_tree = tree.get("tree")
if not actual_tree:
return []
if not parser_self.has_tree_sitter:
return []
return _extract_tree_sitter_exports(actual_tree.root_node, language)
def _extract_tree_sitter_exports(node: Any, language: str) -> List[Dict[str, Any]]:
"""Extract exports from Tree-sitter AST."""
exports = []
if node is None:
return exports
# Export node types per language
export_types = {
"javascript": ["export_statement", "export_default_declaration"],
"typescript": ["export_statement", "export_default_declaration", "export_type"],
}
node_types = export_types.get(language, [])
if node.type in node_types:
is_default = "default" in node.type
# Extract exported name
name = "unknown"
export_type = "unknown"
for child in node.children:
if child.type in ["identifier", "type_identifier"]:
name = child.text.decode("utf-8", errors="ignore")
elif child.type == "function_declaration":
export_type = "function"
for subchild in child.children:
if subchild.type == "identifier":
name = subchild.text.decode("utf-8", errors="ignore")
break
elif child.type == "class_declaration":
export_type = "class"
for subchild in child.children:
if subchild.type in ["identifier", "type_identifier"]:
name = subchild.text.decode("utf-8", errors="ignore")
break
exports.append({
"name": name,
"type": export_type,
"line": node.start_point[0] + 1,
"default": is_default
})
# Recursively search children
for child in node.children:
exports.extend(_extract_tree_sitter_exports(child, language))
return exports
def extract_treesitter_properties(tree: Dict, parser_self, language: str) -> List[Dict]:
"""Extract property accesses from Tree-sitter AST."""
actual_tree = tree.get("tree")
if not actual_tree:
return []
if not parser_self.has_tree_sitter:
return []
return _extract_tree_sitter_properties(actual_tree.root_node, language)
def _extract_tree_sitter_properties(node: Any, language: str) -> List[Dict]:
"""Extract property accesses from Tree-sitter AST."""
properties = []
if node is None:
return properties
# Property access node types per language
property_types = {
"javascript": ["member_expression", "property_access_expression"],
"typescript": ["member_expression", "property_access_expression"],
"python": ["attribute"],
}
node_types = property_types.get(language, [])
if node.type in node_types:
# Extract the full property access chain
prop_text = node.text.decode("utf-8", errors="ignore") if node.text else ""
# Filter for patterns that look like taint sources (req.*, request.*, ctx.*, etc.)
if any(pattern in prop_text for pattern in ["req.", "request.", "ctx.", "body", "query", "params", "headers", "cookies"]):
properties.append({
"name": prop_text,
"line": node.start_point[0] + 1,
"column": node.start_point[1],
"type": "property"
})
# Recursively search children
for child in node.children:
properties.extend(_extract_tree_sitter_properties(child, language))
return properties
def extract_treesitter_assignments(tree: Dict, parser_self, language: str) -> List[Dict[str, Any]]:
"""Extract variable assignments from Tree-sitter AST."""
actual_tree = tree.get("tree")
content = tree.get("content", "")
if not actual_tree:
return []
if not parser_self.has_tree_sitter:
return []
return _extract_tree_sitter_assignments(actual_tree.root_node, language, content)
def _extract_tree_sitter_assignments(node: Any, language: str, content: str) -> List[Dict[str, Any]]:
"""Extract assignments from Tree-sitter AST."""
import os
import sys
debug = os.environ.get("THEAUDITOR_DEBUG")
assignments = []
if node is None:
return assignments
# Assignment node types per language
assignment_types = {
# Don't include variable_declarator - it's handled inside lexical_declaration/variable_declaration
"javascript": ["assignment_expression", "lexical_declaration", "variable_declaration"],
"typescript": ["assignment_expression", "lexical_declaration", "variable_declaration"],
"python": ["assignment"],
}
node_types = assignment_types.get(language, [])
if node.type in node_types:
target_var = None
source_expr = None
source_vars = []
if node.type in ["lexical_declaration", "variable_declaration"]:
# Handle lexical_declaration (const/let) and variable_declaration (var)
# Both contain variable_declarator children
# Process all variable_declarators within (const a = 1, b = 2)
for child in node.children:
if child.type == "variable_declarator":
name_node = child.child_by_field_name('name')
value_node = child.child_by_field_name('value')
if name_node and value_node:
in_function = find_containing_function_tree_sitter(child, content, language) or "global"
if debug:
print(f"[DEBUG] Found assignment: {name_node.text.decode('utf-8')} = {value_node.text.decode('utf-8')[:50]}", file=sys.stderr)
assignments.append({
"target_var": name_node.text.decode("utf-8", errors="ignore"),
"source_expr": value_node.text.decode("utf-8", errors="ignore"),
"line": child.start_point[0] + 1,
"in_function": in_function,
"source_vars": extract_vars_from_tree_sitter_expr(
value_node.text.decode("utf-8", errors="ignore")
)
})
elif node.type == "assignment_expression":
# x = value (JavaScript/TypeScript) - Use field-based API
left_node = node.child_by_field_name('left')
right_node = node.child_by_field_name('right')
if left_node:
target_var = left_node.text.decode("utf-8", errors="ignore")
if right_node:
source_expr = right_node.text.decode("utf-8", errors="ignore")
source_vars = extract_vars_from_tree_sitter_expr(source_expr)
elif node.type == "assignment":
# x = value (Python)
# Python assignment has structure: [target, "=", value]
left_node = None
right_node = None
for child in node.children:
if child.type != "=" and left_node is None:
left_node = child
elif child.type != "=" and left_node is not None:
right_node = child
if left_node:
target_var = left_node.text.decode("utf-8", errors="ignore") if left_node.text else ""
if right_node:
source_expr = right_node.text.decode("utf-8", errors="ignore") if right_node.text else ""
# Only create assignment record if we have both target and source
# (Skip lexical_declaration/variable_declaration as they're handled above with their children)
if target_var and source_expr and node.type not in ["lexical_declaration", "variable_declaration"]:
# Find containing function
in_function = find_containing_function_tree_sitter(node, content, language)
assignments.append({
"target_var": target_var,
"source_expr": source_expr,
"line": node.start_point[0] + 1,
"in_function": in_function or "global",
"source_vars": source_vars if source_vars else extract_vars_from_tree_sitter_expr(source_expr)
})
# Recursively search children
for child in node.children:
assignments.extend(_extract_tree_sitter_assignments(child, language, content))
return assignments
def extract_treesitter_function_params(tree: Dict, parser_self, language: str) -> Dict[str, List[str]]:
"""Extract function parameters from Tree-sitter AST."""
actual_tree = tree.get("tree")
if not actual_tree:
return {}
if not parser_self.has_tree_sitter:
return {}
return _extract_tree_sitter_function_params(actual_tree.root_node, language)
def _extract_tree_sitter_function_params(node: Any, language: str) -> Dict[str, List[str]]:
"""Extract function parameters from Tree-sitter AST."""
func_params = {}
if node is None:
return func_params
# Function definition node types
if language in ["javascript", "typescript"]:
if node.type in ["function_declaration", "function_expression", "arrow_function", "method_definition"]:
func_name = "anonymous"
params = []
# Use field-based API for function nodes
name_node = node.child_by_field_name('name')
params_node = node.child_by_field_name('parameters')
if name_node:
func_name = name_node.text.decode("utf-8", errors="ignore")
# Fall back to child iteration if field access fails
if not params_node:
for child in node.children:
if child.type in ["formal_parameters", "parameters"]:
params_node = child
break
if params_node:
# Extract parameter names
for param_child in params_node.children:
if param_child.type in ["identifier", "required_parameter", "optional_parameter"]:
if param_child.type == "identifier":
params.append(param_child.text.decode("utf-8", errors="ignore"))
else:
# For required/optional parameters, use field API
pattern_node = param_child.child_by_field_name('pattern')
if pattern_node and pattern_node.type == "identifier":
params.append(pattern_node.text.decode("utf-8", errors="ignore"))
if func_name and params:
func_params[func_name] = params
elif language == "python":
if node.type == "function_definition":
func_name = None
params = []
for child in node.children:
if child.type == "identifier":
func_name = child.text.decode("utf-8", errors="ignore")
elif child.type == "parameters":
for param_child in child.children:
if param_child.type == "identifier":
params.append(param_child.text.decode("utf-8", errors="ignore"))
if func_name:
func_params[func_name] = params
# Recursively search children
for child in node.children:
func_params.update(_extract_tree_sitter_function_params(child, language))
return func_params
def extract_treesitter_calls_with_args(
tree: Dict, function_params: Dict[str, List[str]], parser_self, language: str
) -> List[Dict[str, Any]]:
"""Extract function calls with arguments from Tree-sitter AST."""
actual_tree = tree.get("tree")
content = tree.get("content", "")
if not actual_tree:
return []
if not parser_self.has_tree_sitter:
return []
return _extract_tree_sitter_calls_with_args(
actual_tree.root_node, language, content, function_params
)
def _extract_tree_sitter_calls_with_args(
node: Any, language: str, content: str, function_params: Dict[str, List[str]]
) -> List[Dict[str, Any]]:
"""Extract function calls with arguments from Tree-sitter AST."""
calls = []
if node is None:
return calls
# Call expression node types
if language in ["javascript", "typescript"] and node.type == "call_expression":
# Extract function name using field-based API
func_node = node.child_by_field_name('function')
func_name = "unknown"
if func_node:
func_name = func_node.text.decode("utf-8", errors="ignore") if func_node.text else "unknown"
else:
# Fallback to child iteration
for child in node.children:
if child.type in ["identifier", "member_expression"]:
func_name = child.text.decode("utf-8", errors="ignore") if child.text else "unknown"
break
# Find caller function
caller_function = find_containing_function_tree_sitter(node, content, language) or "global"
# Get callee parameters
callee_params = function_params.get(func_name.split(".")[-1], [])
# Extract arguments using field-based API
args_node = node.child_by_field_name('arguments')
arg_index = 0
if args_node:
for arg_child in args_node.children:
if arg_child.type not in ["(", ")", ","]:
arg_expr = arg_child.text.decode("utf-8", errors="ignore") if arg_child.text else ""
param_name = callee_params[arg_index] if arg_index < len(callee_params) else f"arg{arg_index}"
calls.append({
"line": node.start_point[0] + 1,
"caller_function": caller_function,
"callee_function": func_name,
"argument_index": arg_index,
"argument_expr": arg_expr,
"param_name": param_name
})
arg_index += 1
elif language == "python" and node.type == "call":
# Similar logic for Python
func_name = "unknown"
for child in node.children:
if child.type in ["identifier", "attribute"]:
func_name = child.text.decode("utf-8", errors="ignore") if child.text else "unknown"
break
caller_function = find_containing_function_tree_sitter(node, content, language) or "global"
callee_params = function_params.get(func_name.split(".")[-1], [])
arg_index = 0
for child in node.children:
if child.type == "argument_list":
for arg_child in child.children:
if arg_child.type not in ["(", ")", ","]:
arg_expr = arg_child.text.decode("utf-8", errors="ignore") if arg_child.text else ""
param_name = callee_params[arg_index] if arg_index < len(callee_params) else f"arg{arg_index}"
calls.append({
"line": node.start_point[0] + 1,
"caller_function": caller_function,
"callee_function": func_name,
"argument_index": arg_index,
"argument_expr": arg_expr,
"param_name": param_name
})
arg_index += 1
# Recursively search children
for child in node.children:
calls.extend(_extract_tree_sitter_calls_with_args(child, language, content, function_params))
return calls
def extract_treesitter_returns(tree: Dict, parser_self, language: str) -> List[Dict[str, Any]]:
"""Extract return statements from Tree-sitter AST."""
actual_tree = tree.get("tree")
content = tree.get("content", "")
if not actual_tree:
return []
if not parser_self.has_tree_sitter:
return []
return _extract_tree_sitter_returns(actual_tree.root_node, language, content)
def _extract_tree_sitter_returns(node: Any, language: str, content: str) -> List[Dict[str, Any]]:
"""Extract return statements from Tree-sitter AST."""
returns = []
if node is None:
return returns
# Return statement node types
if language in ["javascript", "typescript"] and node.type == "return_statement":
# Find containing function
function_name = find_containing_function_tree_sitter(node, content, language) or "global"
# Extract return expression
return_expr = ""
for child in node.children:
if child.type != "return":
return_expr = child.text.decode("utf-8", errors="ignore") if child.text else ""
break
if not return_expr:
return_expr = "undefined"
returns.append({
"function_name": function_name,
"line": node.start_point[0] + 1,
"return_expr": return_expr,
"return_vars": extract_vars_from_tree_sitter_expr(return_expr)
})
elif language == "python" and node.type == "return_statement":
# Find containing function
function_name = find_containing_function_tree_sitter(node, content, language) or "global"
# Extract return expression
return_expr = ""
for child in node.children:
if child.type != "return":
return_expr = child.text.decode("utf-8", errors="ignore") if child.text else ""
break
if not return_expr:
return_expr = "None"
returns.append({
"function_name": function_name,
"line": node.start_point[0] + 1,
"return_expr": return_expr,
"return_vars": extract_vars_from_tree_sitter_expr(return_expr)
})
# Recursively search children
for child in node.children:
returns.extend(_extract_tree_sitter_returns(child, language, content))
return returns

View File

@@ -0,0 +1,674 @@
"""TypeScript/JavaScript semantic AST extraction implementations.
This module contains all TypeScript compiler API extraction logic for semantic analysis.
"""
import os
from typing import Any, List, Dict, Optional
from .base import extract_vars_from_tree_sitter_expr
def extract_semantic_ast_symbols(node, depth=0):
"""Extract symbols from TypeScript semantic AST including property accesses.
This is a helper used by multiple extraction functions.
"""
symbols = []
if depth > 100 or not isinstance(node, dict):
return symbols
kind = node.get("kind")
# PropertyAccessExpression: req.body, req.params, res.send, etc.
if kind == "PropertyAccessExpression":
# Use the authoritative text from TypeScript compiler (now restored)
full_name = node.get("text", "").strip()
# Only fall back to reconstruction if text is missing (shouldn't happen now)
if not full_name:
# Build the full property access chain
name_parts = []
current = node
while current and isinstance(current, dict):
if current.get("name"):
if isinstance(current["name"], dict) and current["name"].get("name"):
name_parts.append(str(current["name"]["name"]))
elif isinstance(current["name"], str):
name_parts.append(current["name"])
# Look for the expression part
if current.get("children"):
for child in current["children"]:
if isinstance(child, dict) and child.get("kind") == "Identifier":
if child.get("text"):
name_parts.append(child["text"])
current = current.get("expression")
if name_parts:
full_name = ".".join(reversed(name_parts))
else:
full_name = None
if full_name:
# CRITICAL FIX: Extract ALL property accesses for taint analysis
# The taint analyzer will filter for the specific sources it needs
# This ensures we capture req.body, req.query, request.params, etc.
# Default all property accesses as "property" type
db_type = "property"
# Override only for known sink patterns that should be "call" type
if any(sink in full_name for sink in ["res.send", "res.render", "res.json", "response.write", "innerHTML", "outerHTML", "exec", "eval", "system", "spawn"]):
db_type = "call" # Taint analyzer looks for sinks as calls
symbols.append({
"name": full_name,
"line": node.get("line", 0),
"column": node.get("column", 0),
"type": db_type
})
# CallExpression: function calls including method calls
elif kind == "CallExpression":
# Use text field first if available (now restored)
name = None
if node.get("text"):
# Extract function name from text
text = node["text"]
if "(" in text:
name = text.split("(")[0].strip()
elif node.get("name"):
name = node["name"]
# Also check for method calls on children
if not name and node.get("children"):
for child in node["children"]:
if isinstance(child, dict):
if child.get("kind") == "PropertyAccessExpression":
name = child.get("text", "").split("(")[0].strip()
break
elif child.get("text") and "." in child.get("text", ""):
name = child["text"].split("(")[0].strip()
break
if name:
symbols.append({
"name": name,
"line": node.get("line", 0),
"column": node.get("column", 0),
"type": "call"
})
# Identifier nodes that might be property accesses or function references
elif kind == "Identifier":
text = node.get("text", "")
# Check if it looks like a property access pattern
if "." in text:
# Determine type based on pattern
db_type = "property"
# Check for sink patterns
if any(sink in text for sink in ["res.send", "res.render", "res.json", "response.write"]):
db_type = "call"
symbols.append({
"name": text,
"line": node.get("line", 0),
"column": node.get("column", 0),
"type": db_type
})
# Recurse through children
for child in node.get("children", []):
symbols.extend(extract_semantic_ast_symbols(child, depth + 1))
return symbols
def extract_typescript_functions(tree: Dict, parser_self) -> List[Dict]:
"""Extract function definitions from TypeScript semantic AST."""
functions = []
# Common parameter names that should NEVER be marked as functions
PARAMETER_NAMES = {"req", "res", "next", "err", "error", "ctx", "request", "response", "callback", "done", "cb"}
# CRITICAL FIX: Symbols are at tree["symbols"], not tree["tree"]["symbols"]
for symbol in tree.get("symbols", []):
ts_kind = symbol.get("kind", 0)
symbol_name = symbol.get("name", "")
if not symbol_name or symbol_name == "anonymous":
continue
# CRITICAL FIX: Skip known parameter names that are incorrectly marked as functions
if symbol_name in PARAMETER_NAMES:
continue # These are parameters, not function definitions
# Check if this is a function symbol
is_function = False
if isinstance(ts_kind, str):
if "Function" in ts_kind or "Method" in ts_kind:
is_function = True
elif isinstance(ts_kind, (int, float)):
# TypeScript SymbolFlags: Function = 16, Method = 8192, Constructor = 16384
# Parameter = 8388608 (0x800000) - SKIP THIS
if ts_kind == 8388608:
continue # This is a parameter, not a function
elif ts_kind in [16, 8192, 16384]:
is_function = True
if is_function and symbol_name not in PARAMETER_NAMES:
functions.append({
"name": symbol_name,
"line": symbol.get("line", 0),
"type": "function",
"kind": ts_kind
})
return functions
def extract_typescript_classes(tree: Dict, parser_self) -> List[Dict]:
"""Extract class definitions from TypeScript semantic AST."""
classes = []
# CRITICAL FIX: Symbols are at tree["symbols"], not tree["tree"]["symbols"]
for symbol in tree.get("symbols", []):
ts_kind = symbol.get("kind", 0)
symbol_name = symbol.get("name", "")
if not symbol_name or symbol_name == "anonymous":
continue
# Check if this is a class symbol
is_class = False
if isinstance(ts_kind, str):
if "Class" in ts_kind or "Interface" in ts_kind:
is_class = True
elif isinstance(ts_kind, (int, float)):
# TypeScript SymbolFlags: Class = 32, Interface = 64
if ts_kind in [32, 64]:
is_class = True
if is_class:
classes.append({
"name": symbol_name,
"line": symbol.get("line", 0),
"column": 0,
"type": "class",
"kind": ts_kind
})
return classes
def extract_typescript_calls(tree: Dict, parser_self) -> List[Dict]:
"""Extract function calls from TypeScript semantic AST."""
calls = []
# Common parameter names that should NEVER be marked as functions
PARAMETER_NAMES = {"req", "res", "next", "err", "error", "ctx", "request", "response", "callback", "done", "cb"}
# Use the symbols already extracted by TypeScript compiler
# CRITICAL FIX: Symbols are at tree["symbols"], not tree["tree"]["symbols"]
for symbol in tree.get("symbols", []):
symbol_name = symbol.get("name", "")
ts_kind = symbol.get("kind", 0)
# Skip empty/anonymous symbols
if not symbol_name or symbol_name == "anonymous":
continue
# CRITICAL FIX: Skip known parameter names that are incorrectly marked as functions
# These are function parameters, not function definitions
if symbol_name in PARAMETER_NAMES:
# These should be marked as properties/variables for taint analysis
if symbol_name in ["req", "request", "ctx"]:
calls.append({
"name": symbol_name,
"line": symbol.get("line", 0),
"column": 0,
"type": "property" # Mark as property for taint source detection
})
continue # Skip further processing for parameters
# CRITICAL FIX: Properly categorize based on TypeScript SymbolFlags
# The 'kind' field from TypeScript can be:
# - A string like "Function", "Method", "Property" (when ts.SymbolFlags mapping works)
# - A number representing the flag value (when mapping fails)
# TypeScript SymbolFlags values:
# Function = 16, Method = 8192, Property = 98304, Variable = 3, etc.
db_type = "call" # Default for unknown types
# Check if kind is a string (successful mapping in helper script)
if isinstance(ts_kind, str):
# Only mark as function if it's REALLY a function and not a parameter
if ("Function" in ts_kind or "Method" in ts_kind) and symbol_name not in PARAMETER_NAMES:
db_type = "function"
elif "Property" in ts_kind:
db_type = "property"
elif "Variable" in ts_kind or "Let" in ts_kind or "Const" in ts_kind:
# Variables could be sources if they match patterns
if any(pattern in symbol_name for pattern in ["req", "request", "ctx", "body", "params", "query", "headers"]):
db_type = "property"
else:
db_type = "call"
# Check numeric flags (when string mapping failed)
elif isinstance(ts_kind, (int, float)):
# TypeScript SymbolFlags from typescript.d.ts:
# Function = 16, Method = 8192, Constructor = 16384
# Property = 98304, Variable = 3, Let = 1, Const = 2
# Parameter = 8388608 (0x800000)
# CRITICAL: Skip parameter flag (8388608)
if ts_kind == 8388608:
# This is a parameter, not a function
if symbol_name in ["req", "request", "ctx"]:
db_type = "property" # Mark as property for taint analysis
else:
continue # Skip other parameters
elif ts_kind in [16, 8192, 16384] and symbol_name not in PARAMETER_NAMES: # Function, Method, Constructor
db_type = "function"
elif ts_kind in [98304, 4, 1048576]: # Property, EnumMember, Accessor
db_type = "property"
elif ts_kind in [3, 1, 2]: # Variable, Let, Const
# Check if it looks like a source
if any(pattern in symbol_name for pattern in ["req", "request", "ctx", "body", "params", "query", "headers"]):
db_type = "property"
# Override based on name patterns (for calls and property accesses)
if "." in symbol_name:
# Source patterns (user input)
if any(pattern in symbol_name for pattern in ["req.", "request.", "ctx.", "event.", "body", "params", "query", "headers", "cookies"]):
db_type = "property"
# Sink patterns (dangerous functions)
elif any(pattern in symbol_name for pattern in ["res.send", "res.render", "res.json", "response.write", "exec", "eval"]):
db_type = "call"
calls.append({
"name": symbol_name,
"line": symbol.get("line", 0),
"column": 0,
"type": db_type
})
# Also traverse AST for specific patterns
actual_tree = tree.get("tree") if isinstance(tree.get("tree"), dict) else tree
if actual_tree and actual_tree.get("success"):
ast_root = actual_tree.get("ast")
if ast_root:
calls.extend(extract_semantic_ast_symbols(ast_root))
return calls
def extract_typescript_imports(tree: Dict, parser_self) -> List[Dict[str, Any]]:
"""Extract import statements from TypeScript semantic AST."""
imports = []
# Use TypeScript compiler API data
for imp in tree.get("imports", []):
imports.append({
"source": imp.get("kind", "import"),
"target": imp.get("module"),
"type": imp.get("kind", "import"),
"line": imp.get("line", 0),
"specifiers": imp.get("specifiers", [])
})
return imports
def extract_typescript_exports(tree: Dict, parser_self) -> List[Dict[str, Any]]:
"""Extract export statements from TypeScript semantic AST.
Currently returns empty list - exports aren't extracted by semantic parser yet.
"""
return []
def extract_typescript_properties(tree: Dict, parser_self) -> List[Dict]:
"""Extract property accesses from TypeScript semantic AST."""
properties = []
# Already handled in extract_calls via extract_semantic_ast_symbols
# But we can also extract them specifically here
actual_tree = tree.get("tree") if isinstance(tree.get("tree"), dict) else tree
if actual_tree and actual_tree.get("success"):
ast_root = actual_tree.get("ast")
if ast_root:
symbols = extract_semantic_ast_symbols(ast_root)
# Filter for property accesses only
properties = [s for s in symbols if s.get("type") == "property"]
return properties
def extract_typescript_assignments(tree: Dict, parser_self) -> List[Dict[str, Any]]:
"""Extract ALL assignment patterns from TypeScript semantic AST, including destructuring."""
assignments = []
if not tree or not tree.get("success"):
if os.environ.get("THEAUDITOR_DEBUG"):
import sys
print(f"[AST_DEBUG] extract_typescript_assignments: No success in tree", file=sys.stderr)
return assignments
if os.environ.get("THEAUDITOR_DEBUG"):
import sys
print(f"[AST_DEBUG] extract_typescript_assignments: Starting extraction", file=sys.stderr)
def traverse(node, current_function="global", depth=0):
if depth > 100 or not isinstance(node, dict):
return
try:
kind = node.get("kind", "")
# DEBUG: Log ALL node kinds we see to understand structure
if os.environ.get("THEAUDITOR_DEBUG"):
import sys
if depth < 5: # Log more depth
print(f"[AST_DEBUG] Depth {depth}: kind='{kind}'", file=sys.stderr)
if "Variable" in kind or "Assignment" in kind or "Binary" in kind or "=" in str(node.get("text", "")):
print(f"[AST_DEBUG] *** POTENTIAL ASSIGNMENT at depth {depth}: {kind}, text={str(node.get('text', ''))[:50]} ***", file=sys.stderr)
# --- Function Context Tracking ---
new_function = current_function
if kind in ["FunctionDeclaration", "MethodDeclaration", "ArrowFunction", "FunctionExpression"]:
name_node = node.get("name")
if name_node and isinstance(name_node, dict):
new_function = name_node.get("text", "anonymous")
else:
new_function = "anonymous"
# --- Assignment Extraction ---
# 1. Standard Assignments: const x = y; or x = y;
# NOTE: TypeScript AST has VariableDeclaration nested under FirstStatement->VariableDeclarationList
if kind in ["VariableDeclaration", "BinaryExpression"]:
# For BinaryExpression, check if it's an assignment (=) operator
is_assignment = True
if kind == "BinaryExpression":
op_token = node.get("operatorToken", {})
if not (isinstance(op_token, dict) and op_token.get("kind") == "EqualsToken"):
# Not an assignment, just a comparison or arithmetic expression
is_assignment = False
if is_assignment:
# TypeScript AST structure is different - use children and text
if kind == "VariableDeclaration":
# For TypeScript VariableDeclaration, extract from text or children
full_text = node.get("text", "")
if "=" in full_text:
parts = full_text.split("=", 1)
target_var = parts[0].strip()
source_expr = parts[1].strip()
if target_var and source_expr:
if os.environ.get("THEAUDITOR_DEBUG"):
import sys
print(f"[AST_DEBUG] Found TS assignment: {target_var} = {source_expr[:30]}... at line {node.get('line', 0)}", file=sys.stderr)
assignments.append({
"target_var": target_var,
"source_expr": source_expr,
"line": node.get("line", 0),
"in_function": current_function,
"source_vars": extract_vars_from_tree_sitter_expr(source_expr)
})
else:
# BinaryExpression - use the original logic
target_node = node.get("left")
source_node = node.get("right")
if isinstance(target_node, dict) and isinstance(source_node, dict):
# --- ENHANCEMENT: Handle Destructuring ---
if target_node.get("kind") in ["ObjectBindingPattern", "ArrayBindingPattern"]:
source_expr = source_node.get("text", "unknown_source")
# For each element in the destructuring, create a separate assignment
for element in target_node.get("elements", []):
if isinstance(element, dict) and element.get("name"):
target_var = element.get("name", {}).get("text")
if target_var:
assignments.append({
"target_var": target_var,
"source_expr": source_expr, # CRITICAL: Source is the original object/array
"line": element.get("line", node.get("line", 0)),
"in_function": current_function,
"source_vars": extract_vars_from_tree_sitter_expr(source_expr)
})
else:
# --- Standard, non-destructured assignment ---
target_var = target_node.get("text", "")
source_expr = source_node.get("text", "")
if target_var and source_expr:
if os.environ.get("THEAUDITOR_DEBUG"):
import sys
print(f"[AST_DEBUG] Found assignment: {target_var} = {source_expr[:50]}... at line {node.get('line', 0)}", file=sys.stderr)
assignments.append({
"target_var": target_var,
"source_expr": source_expr,
"line": node.get("line", 0),
"in_function": current_function,
"source_vars": extract_vars_from_tree_sitter_expr(source_expr)
})
# Recurse with updated function context
for child in node.get("children", []):
traverse(child, new_function, depth + 1)
except Exception:
# This safety net catches any unexpected AST structures
pass
ast_root = tree.get("ast", {})
traverse(ast_root)
if os.environ.get("THEAUDITOR_DEBUG"):
import sys
print(f"[AST_DEBUG] extract_typescript_assignments: Found {len(assignments)} assignments", file=sys.stderr)
if assignments and len(assignments) < 5:
for a in assignments[:3]:
print(f"[AST_DEBUG] Example: {a['target_var']} = {a['source_expr'][:30]}...", file=sys.stderr)
return assignments
def extract_typescript_function_params(tree: Dict, parser_self) -> Dict[str, List[str]]:
"""Extract function parameters from TypeScript semantic AST."""
func_params = {}
if not tree or not tree.get("success"):
return func_params
def traverse(node, depth=0):
if depth > 100 or not isinstance(node, dict):
return
kind = node.get("kind")
if kind in ["FunctionDeclaration", "MethodDeclaration", "ArrowFunction", "FunctionExpression"]:
# Get function name
name_node = node.get("name")
func_name = "anonymous"
if isinstance(name_node, dict):
func_name = name_node.get("text", "anonymous")
elif isinstance(name_node, str):
func_name = name_node
elif not name_node:
# Look for Identifier child (TypeScript AST structure)
for child in node.get("children", []):
if isinstance(child, dict) and child.get("kind") == "Identifier":
func_name = child.get("text", "anonymous")
break
# Extract parameter names
# FIX: In TypeScript AST, parameters are direct children with kind="Parameter"
params = []
# Look in children for Parameter nodes
for child in node.get("children", []):
if isinstance(child, dict) and child.get("kind") == "Parameter":
# Found a parameter - get its text directly
param_text = child.get("text", "")
if param_text:
params.append(param_text)
# Fallback to old structure if no parameters found
if not params:
param_nodes = node.get("parameters", [])
for param in param_nodes:
if isinstance(param, dict) and param.get("name"):
param_name_node = param.get("name")
if isinstance(param_name_node, dict):
params.append(param_name_node.get("text", ""))
elif isinstance(param_name_node, str):
params.append(param_name_node)
if func_name != "anonymous" and params:
func_params[func_name] = params
# Recurse through children
for child in node.get("children", []):
traverse(child, depth + 1)
ast_root = tree.get("ast", {})
traverse(ast_root)
return func_params
def extract_typescript_calls_with_args(tree: Dict, function_params: Dict[str, List[str]], parser_self) -> List[Dict[str, Any]]:
"""Extract function calls with arguments from TypeScript semantic AST."""
calls = []
if os.environ.get("THEAUDITOR_DEBUG"):
print(f"[DEBUG] extract_typescript_calls_with_args: tree type={type(tree)}, success={tree.get('success') if tree else 'N/A'}")
if not tree or not tree.get("success"):
if os.environ.get("THEAUDITOR_DEBUG"):
print(f"[DEBUG] extract_typescript_calls_with_args: Returning early - no tree or no success")
return calls
def traverse(node, current_function="global", depth=0):
if depth > 100 or not isinstance(node, dict):
return
try:
kind = node.get("kind", "")
# Track function context
new_function = current_function
if kind in ["FunctionDeclaration", "MethodDeclaration", "ArrowFunction", "FunctionExpression"]:
name_node = node.get("name")
if name_node and isinstance(name_node, dict):
new_function = name_node.get("text", "anonymous")
else:
new_function = "anonymous"
# CallExpression: function calls
if kind == "CallExpression":
if os.environ.get("THEAUDITOR_DEBUG"):
print(f"[DEBUG] Found CallExpression at line {node.get('line', 0)}")
# FIX: In TypeScript AST, the function and arguments are in children array
children = node.get("children", [])
if not children:
# Fallback to old structure
expression = node.get("expression", {})
arguments = node.get("arguments", [])
else:
# New structure: first child is function, rest are arguments
expression = children[0] if len(children) > 0 else {}
arguments = children[1:] if len(children) > 1 else []
# Get function name from expression
callee_name = "unknown"
if isinstance(expression, dict):
callee_name = expression.get("text", "unknown")
if os.environ.get("THEAUDITOR_DEBUG"):
print(f"[DEBUG] CallExpression: callee={callee_name}, args={len(arguments)}")
if arguments:
print(f"[DEBUG] First arg: {arguments[0].get('text', 'N/A') if isinstance(arguments[0], dict) else arguments[0]}")
# Get parameters for this function if we know them
callee_params = function_params.get(callee_name.split(".")[-1], [])
# Process arguments
for i, arg in enumerate(arguments):
if isinstance(arg, dict):
arg_text = arg.get("text", "")
param_name = callee_params[i] if i < len(callee_params) else f"arg{i}"
calls.append({
"line": node.get("line", 0),
"caller_function": current_function,
"callee_function": callee_name,
"argument_index": i,
"argument_expr": arg_text,
"param_name": param_name
})
# Recurse with updated function context
for child in node.get("children", []):
traverse(child, new_function, depth + 1)
except Exception as e:
if os.environ.get("THEAUDITOR_DEBUG"):
print(f"[DEBUG] Error in extract_typescript_calls_with_args: {e}")
ast_root = tree.get("ast", {})
traverse(ast_root)
# Debug output
if os.environ.get("THEAUDITOR_DEBUG"):
print(f"[DEBUG] Extracted {len(calls)} function calls with args from semantic AST")
return calls
def extract_typescript_returns(tree: Dict, parser_self) -> List[Dict[str, Any]]:
"""Extract return statements from TypeScript semantic AST."""
returns = []
if not tree or not tree.get("success"):
return returns
# Traverse AST looking for return statements
def traverse(node, current_function="global", depth=0):
if depth > 100 or not isinstance(node, dict):
return
kind = node.get("kind")
# Track current function context
if kind in ["FunctionDeclaration", "FunctionExpression", "ArrowFunction", "MethodDeclaration"]:
# Extract function name if available
name_node = node.get("name")
if name_node and isinstance(name_node, dict):
current_function = name_node.get("text", "anonymous")
else:
current_function = "anonymous"
# ReturnStatement
elif kind == "ReturnStatement":
expr_node = node.get("expression", {})
if isinstance(expr_node, dict):
return_expr = expr_node.get("text", "")
else:
return_expr = str(expr_node) if expr_node else "undefined"
returns.append({
"function_name": current_function,
"line": node.get("line", 0),
"return_expr": return_expr,
"return_vars": extract_vars_from_tree_sitter_expr(return_expr)
})
# Recurse through children
for child in node.get("children", []):
traverse(child, current_function, depth + 1)
ast_root = tree.get("ast", {})
traverse(ast_root)
return returns

323
theauditor/ast_parser.py Normal file
View File

@@ -0,0 +1,323 @@
"""AST parser using Tree-sitter for multi-language support.
This module provides true structural code analysis using Tree-sitter,
enabling high-fidelity pattern detection that understands code semantics
rather than just text matching.
"""
import ast
import hashlib
import json
import os
import re
from dataclasses import dataclass
from functools import lru_cache
from pathlib import Path
from typing import Any, Optional, List, Dict, Union
from theauditor.js_semantic_parser import get_semantic_ast, get_semantic_ast_batch
from theauditor.ast_patterns import ASTPatternMixin
from theauditor.ast_extractors import ASTExtractorMixin
@dataclass
class ASTMatch:
"""Represents an AST pattern match."""
node_type: str
start_line: int
end_line: int
start_col: int
snippet: str
metadata: Dict[str, Any] = None
class ASTParser(ASTPatternMixin, ASTExtractorMixin):
"""Multi-language AST parser using Tree-sitter for structural analysis."""
def __init__(self):
"""Initialize parser with Tree-sitter language support."""
self.has_tree_sitter = False
self.parsers = {}
self.languages = {}
self.project_type = None # Cache project type detection
# Try to import tree-sitter and language bindings
try:
import tree_sitter
self.tree_sitter = tree_sitter
self.has_tree_sitter = True
self._init_tree_sitter_parsers()
except ImportError:
print("Warning: Tree-sitter not available. Install with: pip install tree-sitter tree-sitter-python tree-sitter-javascript tree-sitter-typescript")
def _init_tree_sitter_parsers(self):
"""Initialize Tree-sitter language parsers with proper bindings."""
if not self.has_tree_sitter:
return
# Use tree-sitter-language-pack for all languages
try:
from tree_sitter_language_pack import get_language, get_parser
# Python parser
try:
python_lang = get_language("python")
python_parser = get_parser("python")
self.parsers["python"] = python_parser
self.languages["python"] = python_lang
except Exception as e:
# Python has built-in fallback, so we can continue with a warning
print(f"Warning: Failed to initialize Python parser: {e}")
print(" AST analysis for Python will use built-in parser as fallback.")
# JavaScript parser (CRITICAL - must fail fast)
try:
js_lang = get_language("javascript")
js_parser = get_parser("javascript")
self.parsers["javascript"] = js_parser
self.languages["javascript"] = js_lang
except Exception as e:
raise RuntimeError(
f"Failed to load tree-sitter grammar for JavaScript: {e}\n"
"This is often due to missing build tools or corrupted installation.\n"
"Please try: pip install --force-reinstall tree-sitter-language-pack\n"
"Or install with AST support: pip install -e '.[ast]'"
)
# TypeScript parser (CRITICAL - must fail fast)
try:
ts_lang = get_language("typescript")
ts_parser = get_parser("typescript")
self.parsers["typescript"] = ts_parser
self.languages["typescript"] = ts_lang
except Exception as e:
raise RuntimeError(
f"Failed to load tree-sitter grammar for TypeScript: {e}\n"
"This is often due to missing build tools or corrupted installation.\n"
"Please try: pip install --force-reinstall tree-sitter-language-pack\n"
"Or install with AST support: pip install -e '.[ast]'"
)
except ImportError as e:
# If tree-sitter is installed but language pack is not, this is a critical error
# The user clearly intends to use tree-sitter, so we should fail loudly
print(f"ERROR: tree-sitter is installed but tree-sitter-language-pack is not: {e}")
print("This means tree-sitter AST analysis cannot work properly.")
print("Please install with: pip install tree-sitter-language-pack")
print("Or install TheAuditor with full AST support: pip install -e '.[ast]'")
# Set flags to indicate no language support
self.has_tree_sitter = False
# Don't raise - allow fallback to regex-based parsing
def _detect_project_type(self) -> str:
"""Detect the primary project type based on manifest files.
Returns:
'polyglot' if multiple language manifest files exist
'javascript' if only package.json exists
'python' if only Python manifest files exist
'go' if only go.mod exists
'unknown' otherwise
"""
if self.project_type is not None:
return self.project_type
# Check all manifest files first
has_js = Path("package.json").exists()
has_python = (Path("requirements.txt").exists() or
Path("pyproject.toml").exists() or
Path("setup.py").exists())
has_go = Path("go.mod").exists()
# Determine project type based on combinations
if has_js and has_python:
self.project_type = "polyglot" # NEW: Properly handle mixed projects
elif has_js and has_go:
self.project_type = "polyglot"
elif has_python and has_go:
self.project_type = "polyglot"
elif has_js:
self.project_type = "javascript"
elif has_python:
self.project_type = "python"
elif has_go:
self.project_type = "go"
else:
self.project_type = "unknown"
return self.project_type
def parse_file(self, file_path: Path, language: str = None, root_path: str = None) -> Any:
"""Parse a file into an AST.
Args:
file_path: Path to the source file.
language: Programming language (auto-detected if None).
root_path: Absolute path to project root (for sandbox resolution).
Returns:
AST tree object or None if parsing fails.
"""
if language is None:
language = self._detect_language(file_path)
try:
with open(file_path, "rb") as f:
content = f.read()
# Compute content hash for caching
content_hash = hashlib.md5(content).hexdigest()
# For JavaScript/TypeScript, try semantic parser first
# CRITICAL FIX: Include None and polyglot project types
# When project_type is None (not detected yet) or polyglot, still try semantic parsing
project_type = self._detect_project_type()
if language in ["javascript", "typescript"] and project_type in ["javascript", "polyglot", None, "unknown"]:
try:
# Attempt to use the TypeScript Compiler API for semantic analysis
# Normalize path for cross-platform compatibility
normalized_path = str(file_path).replace("\\", "/")
semantic_result = get_semantic_ast(normalized_path, project_root=root_path)
if semantic_result.get("success"):
# Return the semantic AST with full type information
return {
"type": "semantic_ast",
"tree": semantic_result,
"language": language,
"content": content.decode("utf-8", errors="ignore"),
"has_types": semantic_result.get("hasTypes", False),
"diagnostics": semantic_result.get("diagnostics", []),
"symbols": semantic_result.get("symbols", [])
}
else:
# Log but continue to Tree-sitter/regex fallback
error_msg = semantic_result.get('error', 'Unknown error')
print(f"Warning: Semantic parser failed for {file_path}: {error_msg}")
print(f" Falling back to Tree-sitter/regex parser.")
# Continue to fallback options below
except Exception as e:
# Log but continue to Tree-sitter/regex fallback
print(f"Warning: Exception in semantic parser for {file_path}: {e}")
print(f" Falling back to Tree-sitter/regex parser.")
# Continue to fallback options below
# Use Tree-sitter if available
if self.has_tree_sitter and language in self.parsers:
try:
# Use cached parser
tree = self._parse_treesitter_cached(content_hash, content, language)
return {"type": "tree_sitter", "tree": tree, "language": language, "content": content}
except Exception as e:
print(f"Warning: Tree-sitter parsing failed for {file_path}: {e}")
print(f" Falling back to alternative parser if available.")
# Continue to fallback options below
# Fallback to built-in parsers for Python
if language == "python":
decoded = content.decode("utf-8", errors="ignore")
python_ast = self._parse_python_cached(content_hash, decoded)
if python_ast:
return {"type": "python_ast", "tree": python_ast, "language": language, "content": decoded}
# Return minimal structure to signal regex fallback for JS/TS
if language in ["javascript", "typescript"]:
print(f"Warning: AST parsing unavailable for {file_path}. Using regex fallback.")
decoded = content.decode("utf-8", errors="ignore")
return {"type": "regex_fallback", "tree": None, "language": language, "content": decoded}
# Return None for unsupported languages
return None
except Exception as e:
print(f"Warning: Failed to parse {file_path}: {e}")
return None
def _detect_language(self, file_path: Path) -> str:
"""Detect language from file extension."""
ext_map = {
".py": "python",
".js": "javascript",
".jsx": "javascript",
".ts": "typescript",
".tsx": "typescript",
".mjs": "javascript",
".cjs": "javascript",
".vue": "javascript", # Vue SFCs contain JavaScript/TypeScript
}
return ext_map.get(file_path.suffix.lower(), "") # Empty not unknown
def _parse_python_builtin(self, content: str) -> Optional[ast.AST]:
"""Parse Python code using built-in ast module."""
try:
return ast.parse(content)
except SyntaxError:
return None
@lru_cache(maxsize=500)
def _parse_python_cached(self, content_hash: str, content: str) -> Optional[ast.AST]:
"""Parse Python code with caching based on content hash.
Args:
content_hash: MD5 hash of the file content
content: The actual file content
Returns:
Parsed AST or None if parsing fails
"""
return self._parse_python_builtin(content)
@lru_cache(maxsize=500)
def _parse_treesitter_cached(self, content_hash: str, content: bytes, language: str) -> Any:
"""Parse code using Tree-sitter with caching based on content hash.
Args:
content_hash: MD5 hash of the file content
content: The actual file content as bytes
language: The programming language
Returns:
Parsed Tree-sitter tree
"""
parser = self.parsers[language]
return parser.parse(content)
def supports_language(self, language: str) -> bool:
"""Check if a language is supported for AST parsing.
Args:
language: Programming language name.
Returns:
True if AST parsing is supported.
"""
# Python is always supported via built-in ast module
if language == "python":
return True
# JavaScript and TypeScript are always supported via fallback
if language in ["javascript", "typescript"]:
return True
# Check Tree-sitter support for other languages
if self.has_tree_sitter and language in self.parsers:
return True
return False
def get_supported_languages(self) -> List[str]:
"""Get list of supported languages.
Returns:
List of language names.
"""
# Always supported via built-in or fallback
languages = ["python", "javascript", "typescript"]
if self.has_tree_sitter:
languages.extend(self.parsers.keys())
return sorted(set(languages))

401
theauditor/ast_patterns.py Normal file
View File

@@ -0,0 +1,401 @@
"""AST Pattern Matching Engine.
This module contains all pattern matching and query logic for the AST parser.
It provides pattern-based search capabilities across different AST types.
"""
import ast
from typing import Any, Optional, List, Dict, TYPE_CHECKING
from dataclasses import dataclass
if TYPE_CHECKING:
# For type checking only, avoid circular import
from .ast_parser import ASTMatch
else:
# At runtime, ASTMatch will be available from the parent class
@dataclass
class ASTMatch:
"""Represents an AST pattern match."""
node_type: str
start_line: int
end_line: int
start_col: int
snippet: str
metadata: Dict[str, Any] = None
class ASTPatternMixin:
"""Mixin class providing pattern matching capabilities for AST analysis."""
def query_ast(self, tree: Any, query_string: str) -> List[ASTMatch]:
"""Execute a Tree-sitter query on the AST.
Args:
tree: AST tree object from parse_file.
query_string: Tree-sitter query in S-expression format.
Returns:
List of ASTMatch objects.
"""
matches = []
if not tree:
return matches
# Handle Tree-sitter AST with queries
if tree.get("type") == "tree_sitter" and self.has_tree_sitter:
language = tree["language"]
if language in self.languages:
try:
# CRITICAL FIX: Use correct tree-sitter API with QueryCursor
# Per tree-sitter 0.25.1 documentation, must:
# 1. Create Query with Query() constructor
# 2. Create QueryCursor from the query
# 3. Call matches() on the cursor, not the query
from tree_sitter import Query, QueryCursor
# Create Query object using the language and query string
query = Query(self.languages[language], query_string)
# Create QueryCursor from the query
query_cursor = QueryCursor(query)
# Call matches() on the cursor (not the query!)
query_matches = query_cursor.matches(tree["tree"].root_node)
for match in query_matches:
# Each match is a tuple: (pattern_index, captures_dict)
pattern_index, captures = match
# Process captures dictionary
for capture_name, nodes in captures.items():
# Handle both single node and list of nodes
if not isinstance(nodes, list):
nodes = [nodes]
for node in nodes:
start_point = node.start_point
end_point = node.end_point
snippet = node.text.decode("utf-8", errors="ignore") if node.text else ""
ast_match = ASTMatch(
node_type=node.type,
start_line=start_point[0] + 1,
end_line=end_point[0] + 1,
start_col=start_point[1],
snippet=snippet[:200],
metadata={"capture": capture_name, "pattern": pattern_index}
)
matches.append(ast_match)
except Exception as e:
print(f"Query error: {e}")
# For Python AST, fall back to pattern matching
elif tree.get("type") == "python_ast":
# Convert query to pattern and use existing method
pattern = self._query_to_pattern(query_string)
if pattern:
matches = self.find_ast_matches(tree, pattern)
return matches
def _query_to_pattern(self, query_string: str) -> Optional[Dict]:
"""Convert a Tree-sitter query to a simple pattern dict.
This is a fallback for Python's built-in AST.
"""
# Simple heuristic conversion for common patterns
if "any" in query_string.lower():
return {"node_type": "type_annotation", "contains": ["any"]}
elif "function" in query_string.lower():
return {"node_type": "function_def", "contains": []}
elif "class" in query_string.lower():
return {"node_type": "class_def", "contains": []}
return None
def find_ast_matches(self, tree: Any, ast_pattern: dict) -> List[ASTMatch]:
"""Find matches in AST based on pattern.
Args:
tree: AST tree object.
ast_pattern: Pattern dictionary with node_type and optional contains.
Returns:
List of ASTMatch objects.
"""
matches = []
if not tree:
return matches
# Handle wrapped tree objects
if isinstance(tree, dict):
tree_type = tree.get("type")
actual_tree = tree.get("tree")
if tree_type == "tree_sitter" and self.has_tree_sitter:
matches.extend(self._find_tree_sitter_matches(actual_tree.root_node, ast_pattern))
elif tree_type == "python_ast":
matches.extend(self._find_python_ast_matches(actual_tree, ast_pattern))
elif tree_type == "semantic_ast":
# Handle Semantic AST from TypeScript Compiler API
matches.extend(self._find_semantic_ast_matches(actual_tree, ast_pattern))
elif tree_type == "eslint_ast":
# Handle ESLint AST (legacy, now replaced by semantic_ast)
# For now, we treat it similarly to regex_ast but with higher confidence
matches.extend(self._find_eslint_ast_matches(actual_tree, ast_pattern))
# Handle direct AST objects (legacy support)
elif isinstance(tree, ast.AST):
matches.extend(self._find_python_ast_matches(tree, ast_pattern))
return matches
def _find_tree_sitter_matches(self, node: Any, pattern: dict) -> List[ASTMatch]:
"""Find matches in Tree-sitter AST using structural patterns."""
matches = []
if node is None:
return matches
# Check if node type matches
node_type = pattern.get("node_type", "")
# Special handling for type annotations
if node_type == "type_annotation" and "any" in pattern.get("contains", []):
# Look for TypeScript/JavaScript any type annotations
if node.type in ["type_annotation", "type_identifier", "any_type"]:
node_text = node.text.decode("utf-8", errors="ignore") if node.text else ""
if node_text == "any" or ": any" in node_text:
start_point = node.start_point
end_point = node.end_point
match = ASTMatch(
node_type=node.type,
start_line=start_point[0] + 1,
end_line=end_point[0] + 1,
start_col=start_point[1],
snippet=node_text[:200]
)
matches.append(match)
# General pattern matching
elif node.type == node_type or node_type == "*":
contains = pattern.get("contains", [])
node_text = node.text.decode("utf-8", errors="ignore") if node.text else ""
if all(keyword in node_text for keyword in contains):
start_point = node.start_point
end_point = node.end_point
match = ASTMatch(
node_type=node.type,
start_line=start_point[0] + 1,
end_line=end_point[0] + 1,
start_col=start_point[1],
snippet=node_text[:200],
)
matches.append(match)
# Recursively search children
for child in node.children:
matches.extend(self._find_tree_sitter_matches(child, pattern))
return matches
def _find_semantic_ast_matches(self, tree: Dict[str, Any], pattern: dict) -> List[ASTMatch]:
"""Find matches in Semantic AST from TypeScript Compiler API.
This provides the highest fidelity analysis with full type information.
"""
matches = []
if not tree or not tree.get("ast"):
return matches
# Handle type-related patterns
node_type = pattern.get("node_type", "")
if node_type == "type_annotation" and "any" in pattern.get("contains", []):
# Search for 'any' types in symbols
for symbol in tree.get("symbols", []):
if symbol.get("type") == "any":
match = ASTMatch(
node_type="any_type",
start_line=symbol.get("line", 0),
end_line=symbol.get("line", 0),
start_col=0,
snippet=f"{symbol.get('name')}: any",
metadata={"symbol": symbol.get("name"), "type": "any"}
)
matches.append(match)
# Also recursively search the AST for AnyKeyword nodes
def search_ast_for_any(node, depth=0):
if depth > 100 or not isinstance(node, dict):
return
if node.get("kind") == "AnyKeyword":
match = ASTMatch(
node_type="AnyKeyword",
start_line=node.get("line", 0),
end_line=node.get("line", 0),
start_col=node.get("column", 0),
snippet=node.get("text", "any")[:200],
metadata={"kind": "AnyKeyword"}
)
matches.append(match)
for child in node.get("children", []):
search_ast_for_any(child, depth + 1)
search_ast_for_any(tree.get("ast", {}))
return matches
def _find_eslint_ast_matches(self, tree: Dict[str, Any], pattern: dict) -> List[ASTMatch]:
"""Find matches in ESLint AST.
ESLint provides a full JavaScript/TypeScript AST with high fidelity.
This provides accurate pattern matching for JS/TS code.
"""
matches = []
# ESLint AST follows the ESTree specification
# Future enhancement: properly traverse the ESTree AST structure
if not tree:
return matches
# Basic implementation - will be enhanced in future iterations
# to properly traverse the ESTree AST structure
return matches
def _find_python_ast_matches(self, node: ast.AST, pattern: dict) -> List[ASTMatch]:
"""Find matches in Python built-in AST."""
matches = []
# Map pattern node types to Python AST node types
node_type_map = {
"if_statement": ast.If,
"while_statement": ast.While,
"for_statement": ast.For,
"function_def": ast.FunctionDef,
"class_def": ast.ClassDef,
"try_statement": ast.Try,
"type_annotation": ast.AnnAssign, # For type hints
}
pattern_node_type = pattern.get("node_type", "")
expected_type = node_type_map.get(pattern_node_type)
# Special handling for 'any' type in Python
if pattern_node_type == "type_annotation" and "any" in pattern.get("contains", []):
# Check for typing.Any usage
if isinstance(node, ast.Name) and node.id == "Any":
match = ASTMatch(
node_type="Any",
start_line=getattr(node, "lineno", 0),
end_line=getattr(node, "end_lineno", getattr(node, "lineno", 0)),
start_col=getattr(node, "col_offset", 0),
snippet="Any"
)
matches.append(match)
elif isinstance(node, ast.AnnAssign):
# Check annotation for Any
node_source = ast.unparse(node) if hasattr(ast, "unparse") else ""
if "Any" in node_source:
match = ASTMatch(
node_type="AnnAssign",
start_line=getattr(node, "lineno", 0),
end_line=getattr(node, "end_lineno", getattr(node, "lineno", 0)),
start_col=getattr(node, "col_offset", 0),
snippet=node_source[:200]
)
matches.append(match)
# General pattern matching
elif expected_type and isinstance(node, expected_type):
contains = pattern.get("contains", [])
node_source = ast.unparse(node) if hasattr(ast, "unparse") else ""
if all(keyword in node_source for keyword in contains):
match = ASTMatch(
node_type=node.__class__.__name__,
start_line=getattr(node, "lineno", 0),
end_line=getattr(node, "end_lineno", getattr(node, "lineno", 0)),
start_col=getattr(node, "col_offset", 0),
snippet=node_source[:200],
)
matches.append(match)
# Recursively search children
for child in ast.walk(node):
if child != node:
matches.extend(self._find_python_ast_matches(child, pattern))
return matches
def get_tree_sitter_query_for_pattern(self, pattern: str, language: str) -> str:
"""Convert a pattern identifier to a Tree-sitter query.
Args:
pattern: Pattern identifier (e.g., "NO_ANY_IN_SCOPE")
language: Programming language
Returns:
Tree-sitter query string in S-expression format
"""
queries = {
"typescript": {
"NO_ANY_IN_SCOPE": """
(type_annotation
(type_identifier) @type
(#eq? @type "any"))
""",
"NO_UNSAFE_EVAL": """
(call_expression
function: (identifier) @func
(#eq? @func "eval"))
""",
"NO_VAR_IN_STRICT": """
(variable_declaration
kind: "var") @var_usage
""",
},
"javascript": {
"NO_ANY_IN_SCOPE": """
(type_annotation
(type_identifier) @type
(#eq? @type "any"))
""",
"NO_UNSAFE_EVAL": """
(call_expression
function: (identifier) @func
(#eq? @func "eval"))
""",
"NO_VAR_IN_STRICT": """
(variable_declaration
kind: "var") @var_usage
""",
},
"python": {
"NO_EVAL_EXEC": """
(call
function: (identifier) @func
(#match? @func "^(eval|exec)$"))
""",
"NO_BARE_EXCEPT": """
(except_clause
!type) @bare_except
""",
"NO_MUTABLE_DEFAULT": """
(default_parameter
value: [(list) (dictionary)]) @mutable_default
""",
}
}
language_queries = queries.get(language, {})
return language_queries.get(pattern, "")

273
theauditor/claude_setup.py Normal file
View File

@@ -0,0 +1,273 @@
"""Claude Code integration setup - Zero-optional bulletproof installer."""
import hashlib
import json
import platform
import shutil
import stat
import sys
from pathlib import Path
from typing import Dict, List, Optional
from .venv_install import setup_project_venv, find_theauditor_root
# Detect if running on Windows for character encoding
IS_WINDOWS = platform.system() == "Windows"
def write_file_atomic(path: Path, content: str, executable: bool = False) -> str:
"""
Write file atomically with backup if content differs.
Args:
path: File path to write
content: Content to write
executable: Make file executable (Unix only)
Returns:
"created" if new file
"updated" if file changed (creates .bak)
"skipped" if identical content
"""
# Ensure parent directory exists
path.parent.mkdir(parents=True, exist_ok=True)
if path.exists():
existing = path.read_text(encoding='utf-8')
if existing == content:
return "skipped"
# Create backup (only once per unique content)
bak_path = path.with_suffix(path.suffix + ".bak")
if not bak_path.exists():
shutil.copy2(path, bak_path)
path.write_text(content, encoding='utf-8')
status = "updated"
else:
path.write_text(content, encoding='utf-8')
status = "created"
# Set executable if needed
if executable and platform.system() != "Windows":
st = path.stat()
path.chmod(st.st_mode | stat.S_IEXEC | stat.S_IXGRP | stat.S_IXOTH)
return status
class WrapperTemplates:
"""Cross-platform wrapper script templates."""
POSIX_WRAPPER = '''#!/usr/bin/env bash
# Auto-generated wrapper for project-local aud
PROJ_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
VENV="$PROJ_ROOT/.auditor_venv/bin/aud"
if [ -x "$VENV" ]; then
exec "$VENV" "$@"
fi
# Fallback to module execution
exec "$PROJ_ROOT/.auditor_venv/bin/python" -m theauditor.cli "$@"
'''
POWERSHELL_WRAPPER = r'''# Auto-generated wrapper for project-local aud
$proj = Split-Path -Path (Split-Path -Parent $MyInvocation.MyCommand.Path) -Parent
$aud = Join-Path $proj ".auditor_venv\Scripts\aud.exe"
if (Test-Path $aud) {
& $aud @args
exit $LASTEXITCODE
}
# Fallback to module execution
$python = Join-Path $proj ".auditor_venv\Scripts\python.exe"
& $python "-m" "theauditor.cli" @args
exit $LASTEXITCODE
'''
CMD_WRAPPER = r'''@echo off
REM Auto-generated wrapper for project-local aud
set PROJ=%~dp0..\..
if exist "%PROJ%\.auditor_venv\Scripts\aud.exe" (
"%PROJ%\.auditor_venv\Scripts\aud.exe" %*
exit /b %ERRORLEVEL%
)
REM Fallback to module execution
"%PROJ%\.auditor_venv\Scripts\python.exe" -m theauditor.cli %*
exit /b %ERRORLEVEL%
'''
def create_wrappers(target_dir: Path) -> Dict[str, str]:
"""
Create cross-platform wrapper scripts.
Args:
target_dir: Project root directory
Returns:
Dict mapping wrapper paths to their status
"""
wrappers_dir = target_dir / ".claude" / "bin"
results = {}
# POSIX wrapper (bash)
posix_wrapper = wrappers_dir / "aud"
status = write_file_atomic(posix_wrapper, WrapperTemplates.POSIX_WRAPPER, executable=True)
results[str(posix_wrapper)] = status
# PowerShell wrapper
ps_wrapper = wrappers_dir / "aud.ps1"
status = write_file_atomic(ps_wrapper, WrapperTemplates.POWERSHELL_WRAPPER)
results[str(ps_wrapper)] = status
# CMD wrapper
cmd_wrapper = wrappers_dir / "aud.cmd"
status = write_file_atomic(cmd_wrapper, WrapperTemplates.CMD_WRAPPER)
results[str(cmd_wrapper)] = status
return results
def copy_agent_templates(source_dir: Path, target_dir: Path) -> Dict[str, str]:
"""
Copy all .md agent template files directly to target/.claude/agents/.
Args:
source_dir: Directory containing agent template .md files
target_dir: Project root directory
Returns:
Dict mapping agent paths to their status
"""
agents_dir = target_dir / ".claude" / "agents"
agents_dir.mkdir(parents=True, exist_ok=True)
results = {}
# Find all .md files in source directory
for md_file in source_dir.glob("*.md"):
if md_file.is_file():
# Read content
content = md_file.read_text(encoding='utf-8')
# Write to target
target_file = agents_dir / md_file.name
status = write_file_atomic(target_file, content)
results[str(target_file)] = status
return results
def setup_claude_complete(
target: str,
source: str = "agent_templates",
sync: bool = False,
dry_run: bool = False
) -> Dict[str, List[str]]:
"""
Complete Claude setup: venv, wrappers, hooks, and agents.
Args:
target: Target project root (absolute or relative path)
source: Path to TheAuditor agent templates directory
sync: Force update (still creates .bak on first change)
dry_run: Print plan without executing
Returns:
Dict with created, updated, and skipped file lists
"""
# Resolve paths
target_dir = Path(target).resolve()
if not target_dir.exists():
raise ValueError(f"Target directory does not exist: {target_dir}")
# Find source docs
if Path(source).is_absolute():
source_dir = Path(source)
else:
theauditor_root = find_theauditor_root()
source_dir = theauditor_root / source
if not source_dir.exists():
raise ValueError(f"Source agent templates directory not found: {source_dir}")
print(f"\n{'='*60}")
print(f"Claude Setup - Zero-Optional Installation")
print(f"{'='*60}")
print(f"Target: {target_dir}")
print(f"Source: {source_dir}")
print(f"Mode: {'DRY RUN' if dry_run else 'EXECUTE'}")
print(f"{'='*60}\n")
if dry_run:
print("DRY RUN - Plan of operations:")
print(f"1. Create/verify venv at {target_dir}/.auditor_venv")
print(f"2. Install TheAuditor (editable) into venv")
print(f"3. Create wrappers at {target_dir}/.claude/bin/")
print(f"4. Copy agent templates from {source_dir}/*.md")
print(f"5. Write agents to {target_dir}/.claude/agents/")
print("\nNo files will be modified.")
return {"created": [], "updated": [], "skipped": []}
results = {
"created": [],
"updated": [],
"skipped": [],
"failed": []
}
# Step 1: Setup venv
print("Step 1: Setting up Python virtual environment...", flush=True)
try:
venv_path, success = setup_project_venv(target_dir, force=sync)
if success:
results["created"].append(str(venv_path))
else:
results["failed"].append(f"venv setup at {venv_path}")
print("ERROR: Failed to setup venv. Aborting.")
return results
except Exception as e:
print(f"ERROR setting up venv: {e}")
results["failed"].append("venv setup")
return results
# Step 2: Create wrappers
print("\nStep 2: Creating cross-platform wrappers...", flush=True)
wrapper_results = create_wrappers(target_dir)
for path, status in wrapper_results.items():
results[status].append(path)
# Step 3: Copy agent templates
print("\nStep 3: Copying agent templates...", flush=True)
try:
agent_results = copy_agent_templates(source_dir, target_dir)
for path, status in agent_results.items():
results[status].append(path)
if not agent_results:
print("WARNING: No .md files found in agent_templates directory")
except Exception as e:
print(f"ERROR copying agent templates: {e}")
results["failed"].append("agent template copy")
# Summary
print(f"\n{'='*60}")
print("Setup Complete - Summary:")
print(f"{'='*60}")
print(f"Created: {len(results['created'])} files")
print(f"Updated: {len(results['updated'])} files")
print(f"Skipped: {len(results['skipped'])} files (unchanged)")
if results['failed']:
print(f"FAILED: {len(results['failed'])} operations")
for item in results['failed']:
print(f" - {item}")
check_mark = "[OK]" if IS_WINDOWS else ""
print(f"\n{check_mark} Project configured at: {target_dir}")
print(f"{check_mark} Wrapper available at: {target_dir}/.claude/bin/aud")
print(f"{check_mark} Agents installed to: {target_dir}/.claude/agents/")
print(f"{check_mark} Professional linters installed (ruff, mypy, black, ESLint, etc.)")
return results

239
theauditor/cli.py Normal file
View File

@@ -0,0 +1,239 @@
"""TheAuditor CLI - Main entry point and command registration hub."""
import platform
import subprocess
import sys
import click
from theauditor import __version__
# Configure UTF-8 console output for Windows
if platform.system() == "Windows":
try:
# Set console code page to UTF-8
subprocess.run(["chcp", "65001"], shell=True, capture_output=True, timeout=1)
# Also configure Python's stdout/stderr
import codecs
sys.stdout = codecs.getwriter('utf-8')(sys.stdout.buffer, 'strict')
sys.stderr = codecs.getwriter('utf-8')(sys.stderr.buffer, 'strict')
except Exception:
# Silently continue if chcp fails (not critical)
pass
class VerboseGroup(click.Group):
"""Custom group that shows all subcommands and their key options in help."""
def format_help(self, ctx, formatter):
"""Format help to show all commands with their key options."""
# Original help text
super().format_help(ctx, formatter)
# Add detailed command listing
formatter.write_paragraph()
formatter.write_text("Detailed Command Overview:")
formatter.write_paragraph()
# Core commands
formatter.write_text("CORE ANALYSIS:")
with formatter.indentation():
formatter.write_text("aud full # Complete 13-phase security audit")
formatter.write_text(" --offline # Skip network operations (deps, docs)")
formatter.write_text(" --exclude-self # Exclude TheAuditor's own files")
formatter.write_text(" --quiet # Minimal output")
formatter.write_paragraph()
formatter.write_text("aud index # Build file manifest and symbol database")
formatter.write_text(" --exclude-self # Exclude TheAuditor's own files")
formatter.write_paragraph()
formatter.write_text("aud workset # Analyze only changed files")
formatter.write_text(" --diff HEAD~3..HEAD # Specify git commit range")
formatter.write_text(" --all # Include all files")
formatter.write_paragraph()
formatter.write_text("SECURITY SCANNING:")
with formatter.indentation():
formatter.write_text("aud detect-patterns # Run 100+ security pattern rules")
formatter.write_text(" --workset # Scan only workset files")
formatter.write_paragraph()
formatter.write_text("aud taint-analyze # Track data flow from sources to sinks")
formatter.write_paragraph()
formatter.write_text("aud docker-analyze # Analyze Docker security issues")
formatter.write_text(" --severity critical # Filter by severity")
formatter.write_paragraph()
formatter.write_text("DEPENDENCIES:")
with formatter.indentation():
formatter.write_text("aud deps # Analyze project dependencies")
formatter.write_text(" --vuln-scan # Run npm audit & pip-audit")
formatter.write_text(" --check-latest # Check for outdated packages")
formatter.write_text(" --upgrade-all # YOLO: upgrade everything to latest")
formatter.write_paragraph()
formatter.write_text("CODE QUALITY:")
with formatter.indentation():
formatter.write_text("aud lint # Run all configured linters")
formatter.write_text(" --fix # Auto-fix issues where possible")
formatter.write_text(" --workset # Lint only changed files")
formatter.write_paragraph()
formatter.write_text("ANALYSIS & REPORTING:")
with formatter.indentation():
formatter.write_text("aud graph build # Build dependency graph")
formatter.write_text("aud graph analyze # Find cycles and architectural issues")
formatter.write_paragraph()
formatter.write_text("aud impact # Analyze change impact radius")
formatter.write_text(" --file src/auth.py # Specify file to analyze")
formatter.write_text(" --line 42 # Specific line number")
formatter.write_paragraph()
formatter.write_text("aud refactor # Detect incomplete refactorings")
formatter.write_text(" --auto-detect # Auto-detect from migrations")
formatter.write_text(" --workset # Check current changes")
formatter.write_paragraph()
formatter.write_text("aud fce # Run Factual Correlation Engine")
formatter.write_text("aud report # Generate final report")
formatter.write_text("aud structure # Generate project structure report")
formatter.write_paragraph()
formatter.write_text("ADVANCED:")
with formatter.indentation():
formatter.write_text("aud insights # Run optional insights analysis")
formatter.write_text(" --mode ml # ML risk predictions")
formatter.write_text(" --mode graph # Architecture health scoring")
formatter.write_text(" --mode taint # Security severity analysis")
formatter.write_paragraph()
formatter.write_text("aud learn # Train ML models on codebase")
formatter.write_text("aud suggest # Get ML-powered suggestions")
formatter.write_paragraph()
formatter.write_text("SETUP & CONFIG:")
with formatter.indentation():
formatter.write_text("aud init # Initialize .pf/ directory")
formatter.write_text("aud setup-claude # Setup sandboxed JS/TS tools")
formatter.write_text(" --target . # Target directory")
formatter.write_paragraph()
formatter.write_text("aud init-js # Create/merge package.json")
formatter.write_text("aud init-config # Initialize configuration")
formatter.write_paragraph()
formatter.write_text("For detailed help on any command: aud <command> --help")
@click.group(cls=VerboseGroup)
@click.version_option(version=__version__, prog_name="aud")
@click.help_option("-h", "--help")
def cli():
"""TheAuditor - Offline, air-gapped CLI for repo indexing and evidence checking.
Quick Start:
aud init # Initialize project
aud full # Run complete audit
aud full --offline # Run without network operations
View results in .pf/readthis/ directory."""
pass
# Import and register commands
from theauditor.commands.init import init
from theauditor.commands.index import index
from theauditor.commands.workset import workset
from theauditor.commands.lint import lint
from theauditor.commands.deps import deps
from theauditor.commands.report import report
from theauditor.commands.summary import summary
from theauditor.commands.graph import graph
from theauditor.commands.full import full
from theauditor.commands.fce import fce
from theauditor.commands.impact import impact
from theauditor.commands.taint import taint_analyze
from theauditor.commands.setup import setup_claude
# Import additional migrated commands
from theauditor.commands.detect_patterns import detect_patterns
from theauditor.commands.detect_frameworks import detect_frameworks
from theauditor.commands.docs import docs
from theauditor.commands.tool_versions import tool_versions
from theauditor.commands.init_js import init_js
from theauditor.commands.init_config import init_config
from theauditor.commands.validate_templates import validate_templates
# Import ML commands
from theauditor.commands.ml import learn, suggest, learn_feedback
# Import internal commands (prefixed with _)
from theauditor.commands._archive import _archive
# Import rules command
from theauditor.commands.rules import rules_command
# Import refactoring analysis commands
from theauditor.commands.refactor import refactor_command
from theauditor.commands.insights import insights_command
# Import new commands
from theauditor.commands.docker_analyze import docker_analyze
from theauditor.commands.structure import structure
# Register simple commands
cli.add_command(init)
cli.add_command(index)
cli.add_command(workset)
cli.add_command(lint)
cli.add_command(deps)
cli.add_command(report)
cli.add_command(summary)
cli.add_command(full)
cli.add_command(fce)
cli.add_command(impact)
cli.add_command(taint_analyze)
cli.add_command(setup_claude)
# Register additional migrated commands
cli.add_command(detect_patterns)
cli.add_command(detect_frameworks)
cli.add_command(docs)
cli.add_command(tool_versions)
cli.add_command(init_js)
cli.add_command(init_config)
cli.add_command(validate_templates)
# Register ML commands
cli.add_command(learn)
cli.add_command(suggest)
cli.add_command(learn_feedback)
# Register internal commands (not for direct user use)
cli.add_command(_archive)
# Register rules command
cli.add_command(rules_command)
# Register refactoring analysis commands
cli.add_command(refactor_command, name="refactor")
cli.add_command(insights_command, name="insights")
# Register new commands
cli.add_command(docker_analyze)
cli.add_command(structure)
# Register command groups
cli.add_command(graph)
# All commands have been migrated to separate modules
def main():
"""Main entry point for console script."""
cli()
if __name__ == "__main__":
main()

View File

@@ -0,0 +1 @@
"""Commands module for TheAuditor CLI."""

View File

@@ -0,0 +1,107 @@
"""Internal archive command for segregating history by run type."""
import shutil
import sys
from datetime import datetime
from pathlib import Path
import click
@click.command(name="_archive")
@click.option("--run-type", required=True, type=click.Choice(["full", "diff"]), help="Type of run being archived")
@click.option("--diff-spec", help="Git diff specification for diff runs (e.g., main..HEAD)")
def _archive(run_type: str, diff_spec: str = None):
"""
Internal command to archive previous run artifacts with segregation by type.
This command is not intended for direct user execution. It's called by
the full and orchestrate workflows to maintain clean, segregated history.
"""
# Define base paths
pf_dir = Path(".pf")
history_dir = pf_dir / "history"
# Check if there's a previous run to archive (by checking if .pf exists and has files)
if not pf_dir.exists() or not any(pf_dir.iterdir()):
# No previous run to archive
print("[ARCHIVE] No previous run artifacts found to archive", file=sys.stderr)
return
# Determine destination base path based on run type
if run_type == "full":
dest_base = history_dir / "full"
else: # run_type == "diff"
dest_base = history_dir / "diff"
# Create destination base directory if it doesn't exist
dest_base.mkdir(parents=True, exist_ok=True)
# Generate timestamp for archive directory
timestamp_str = datetime.now().strftime("%Y%m%d_%H%M%S")
# Create unique directory name
if run_type == "diff" and diff_spec:
# Sanitize diff spec for directory name
# Replace problematic characters with underscores
safe_spec = diff_spec.replace("..", "_")
safe_spec = safe_spec.replace("/", "_")
safe_spec = safe_spec.replace("\\", "_")
safe_spec = safe_spec.replace(":", "_")
safe_spec = safe_spec.replace(" ", "_")
safe_spec = safe_spec.replace("~", "_")
safe_spec = safe_spec.replace("^", "_")
# Create descriptive name like "main_HEAD_20250819_090015"
dir_name = f"{safe_spec}_{timestamp_str}"
else:
# Simple timestamp for full runs
dir_name = timestamp_str
# Create the archive destination directory
archive_dest = dest_base / dir_name
archive_dest.mkdir(exist_ok=True)
# Move all top-level items from pf_dir to archive_dest
archived_count = 0
skipped_count = 0
for item in pf_dir.iterdir():
# CRITICAL: Skip the history directory itself to prevent recursive archiving
if item.name == "history":
continue
# Safely move the item to archive destination
try:
shutil.move(str(item), str(archive_dest))
archived_count += 1
except Exception as e:
# Log error but don't stop the archiving process
print(f"[WARNING] Could not archive {item.name}: {e}", file=sys.stderr)
skipped_count += 1
# Log summary
if archived_count > 0:
click.echo(f"[ARCHIVE] Archived {archived_count} items to {archive_dest}")
if skipped_count > 0:
click.echo(f"[ARCHIVE] Skipped {skipped_count} items due to errors")
else:
click.echo("[ARCHIVE] No artifacts archived (directory was empty)")
# Create a metadata file in the archive to track run type and context
metadata = {
"run_type": run_type,
"diff_spec": diff_spec,
"timestamp": timestamp_str,
"archived_at": datetime.now().isoformat(),
"files_archived": archived_count,
"files_skipped": skipped_count,
}
try:
import json
metadata_path = archive_dest / "_metadata.json"
with open(metadata_path, 'w') as f:
json.dump(metadata, f, indent=2)
except Exception as e:
print(f"[WARNING] Could not write metadata file: {e}", file=sys.stderr)

191
theauditor/commands/deps.py Normal file
View File

@@ -0,0 +1,191 @@
"""Parse and analyze project dependencies."""
import platform
from pathlib import Path
import click
from theauditor.utils.error_handler import handle_exceptions
from theauditor.utils.exit_codes import ExitCodes
# Detect if running on Windows for character encoding
IS_WINDOWS = platform.system() == "Windows"
@click.command()
@handle_exceptions
@click.option("--root", default=".", help="Root directory")
@click.option("--check-latest", is_flag=True, help="Check for latest versions from registries")
@click.option("--upgrade-all", is_flag=True, help="YOLO mode: Update ALL packages to latest versions")
@click.option("--offline", is_flag=True, help="Force offline mode (no network)")
@click.option("--out", default="./.pf/raw/deps.json", help="Output dependencies file")
@click.option("--print-stats", is_flag=True, help="Print dependency statistics")
@click.option("--vuln-scan", is_flag=True, help="Scan dependencies for known vulnerabilities")
def deps(root, check_latest, upgrade_all, offline, out, print_stats, vuln_scan):
"""Parse and analyze project dependencies."""
from theauditor.deps import parse_dependencies, write_deps_json, check_latest_versions, write_deps_latest_json, upgrade_all_deps
from theauditor.vulnerability_scanner import scan_dependencies, write_vulnerabilities_json, format_vulnerability_report
import sys
# Parse dependencies
deps_list = parse_dependencies(root_path=root)
if not deps_list:
click.echo("No dependency files found (package.json, pyproject.toml, requirements.txt)")
click.echo(" Searched in: " + str(Path(root).resolve()))
return
write_deps_json(deps_list, output_path=out)
# Vulnerability scanning
if vuln_scan:
click.echo(f"\n[SCAN] Running native vulnerability scanners...")
click.echo(f" Using: npm audit, pip-audit (if available)")
vulnerabilities = scan_dependencies(deps_list, offline=offline)
if vulnerabilities:
# Write JSON report
vuln_output = out.replace("deps.json", "vulnerabilities.json")
write_vulnerabilities_json(vulnerabilities, output_path=vuln_output)
# Display human-readable report
report = format_vulnerability_report(vulnerabilities)
click.echo("\n" + report)
click.echo(f"\nDetailed report written to {vuln_output}")
# Exit with error code if critical vulnerabilities found
critical_count = sum(1 for v in vulnerabilities if v["severity"] == "critical")
if critical_count > 0:
click.echo(f"\n[FAIL] Found {critical_count} CRITICAL vulnerabilities - failing build")
sys.exit(ExitCodes.CRITICAL_SEVERITY)
else:
click.echo(f" [OK] No known vulnerabilities found in dependencies")
# Don't continue with other operations after vuln scan
return
# YOLO MODE: Upgrade all to latest
if upgrade_all and not offline:
click.echo("[YOLO MODE] Upgrading ALL packages to latest versions...")
click.echo(" [WARN] This may break things. That's the point!")
# Get latest versions
latest_info = check_latest_versions(deps_list, allow_net=True, offline=offline)
if not latest_info:
click.echo(" [FAIL] Failed to fetch latest versions")
return
# Check if all packages were successfully checked
failed_checks = sum(1 for info in latest_info.values() if info.get("error") is not None)
successful_checks = sum(1 for info in latest_info.values() if info.get("latest") is not None)
if failed_checks > 0:
click.echo(f"\n [WARN] Only {successful_checks}/{len(latest_info)} packages checked successfully")
click.echo(f" [FAIL] Cannot upgrade with {failed_checks} failed checks")
click.echo(" Fix network issues and try again")
return
# Upgrade all dependency files
upgraded = upgrade_all_deps(root_path=root, latest_info=latest_info, deps_list=deps_list)
# Count unique packages that were upgraded
unique_upgraded = len([1 for k, v in latest_info.items() if v.get("is_outdated", False)])
total_updated = sum(upgraded.values())
click.echo(f"\n[UPGRADED] Dependency files:")
for file_type, count in upgraded.items():
if count > 0:
click.echo(f" [OK] {file_type}: {count} dependency entries updated")
# Show summary that matches the "Outdated: 10/29" format
if total_updated > unique_upgraded:
click.echo(f"\n Summary: {unique_upgraded} unique packages updated across {total_updated} occurrences")
click.echo("\n[NEXT STEPS]:")
click.echo(" 1. Run: pip install -r requirements.txt")
click.echo(" 2. Or: npm install")
click.echo(" 3. Pray it still works")
return
# Check latest versions if requested
latest_info = {}
if check_latest and not offline:
# Count unique packages first
unique_packages = {}
for dep in deps_list:
key = f"{dep['manager']}:{dep['name']}"
if key not in unique_packages:
unique_packages[key] = 0
unique_packages[key] += 1
click.echo(f"Checking {len(deps_list)} dependencies for updates...")
click.echo(f" Unique packages to check: {len(unique_packages)}")
click.echo(" Connecting to: npm registry and PyPI")
latest_info = check_latest_versions(deps_list, allow_net=True, offline=offline)
if latest_info:
write_deps_latest_json(latest_info, output_path=out.replace("deps.json", "deps_latest.json"))
# Count successful vs failed checks
successful_checks = sum(1 for info in latest_info.values() if info.get("latest") is not None)
failed_checks = sum(1 for info in latest_info.values() if info.get("error") is not None)
click.echo(f" [OK] Checked {successful_checks}/{len(unique_packages)} unique packages")
if failed_checks > 0:
click.echo(f" [WARN] {failed_checks} packages failed to check")
# Show first few errors
errors = [(k.split(":")[1], v["error"]) for k, v in latest_info.items() if v.get("error")][:3]
for pkg, err in errors:
click.echo(f" - {pkg}: {err}")
else:
click.echo(" [FAIL] Failed to check versions (network issue or offline mode)")
# Always show output
click.echo(f"Dependencies written to {out}")
# Count by manager
npm_count = sum(1 for d in deps_list if d["manager"] == "npm")
py_count = sum(1 for d in deps_list if d["manager"] == "py")
click.echo(f" Total: {len(deps_list)} dependencies")
if npm_count > 0:
click.echo(f" Node/npm: {npm_count}")
if py_count > 0:
click.echo(f" Python: {py_count}")
if latest_info:
# Count how many of the TOTAL deps are outdated (only if successfully checked)
outdated_deps = 0
checked_deps = 0
for dep in deps_list:
key = f"{dep['manager']}:{dep['name']}"
if key in latest_info and latest_info[key].get("latest") is not None:
checked_deps += 1
if latest_info[key]["is_outdated"]:
outdated_deps += 1
# Also count unique outdated packages
outdated_unique = sum(1 for info in latest_info.values() if info.get("is_outdated", False))
# Show outdated/checked rather than outdated/total
if checked_deps == len(deps_list):
# All were checked successfully
click.echo(f" Outdated: {outdated_deps}/{len(deps_list)}")
else:
# Some failed, show both numbers
click.echo(f" Outdated: {outdated_deps}/{checked_deps} checked ({len(deps_list)} total)")
# Show major updates
major_updates = [
(k.split(":")[1], v["locked"], v["latest"])
for k, v in latest_info.items()
if v.get("delta") == "major"
]
if major_updates:
click.echo("\n Major version updates available:")
for name, locked, latest in major_updates[:5]:
click.echo(f" - {name}: {locked} -> {latest}")
if len(major_updates) > 5:
click.echo(f" ... and {len(major_updates) - 5} more")
# Add a helpful hint if no network operation was performed
if not check_latest and not upgrade_all:
click.echo("\nTIP: Run with --check-latest to check for outdated packages.")

View File

@@ -0,0 +1,46 @@
"""Detect frameworks and libraries used in the project."""
import json
import click
from pathlib import Path
@click.command("detect-frameworks")
@click.option("--project-path", default=".", help="Root directory to analyze")
@click.option("--output-json", help="Path to output JSON file (default: .pf/raw/frameworks.json)")
def detect_frameworks(project_path, output_json):
"""Detect frameworks and libraries used in the project."""
from theauditor.framework_detector import FrameworkDetector
try:
# Initialize detector
project_path = Path(project_path).resolve()
detector = FrameworkDetector(project_path, exclude_patterns=[])
# Detect frameworks
frameworks = detector.detect_all()
# Determine output path - always save to .pf/frameworks.json by default
if output_json:
# User specified custom path
save_path = Path(output_json)
else:
# Default path
save_path = Path(project_path) / ".pf" / "raw" / "frameworks.json"
# Always save the JSON output
detector.save_to_file(save_path)
click.echo(f"Frameworks written to {save_path}")
# Display table
table = detector.format_table()
click.echo(table)
# Return success
if frameworks:
click.echo(f"\nDetected {len(frameworks)} framework(s)")
except Exception as e:
click.echo(f"Error: {e}", err=True)
raise click.ClickException(str(e)) from e

View File

@@ -0,0 +1,81 @@
"""Detect universal runtime, DB, and logic patterns in code."""
import click
from pathlib import Path
from theauditor.utils.helpers import get_self_exclusion_patterns
@click.command("detect-patterns")
@click.option("--project-path", default=".", help="Root directory to analyze")
@click.option("--patterns", multiple=True, help="Pattern categories to use (e.g., runtime_issues, db_issues)")
@click.option("--output-json", help="Path to output JSON file")
@click.option("--file-filter", help="Glob pattern to filter files")
@click.option("--max-rows", default=50, type=int, help="Maximum rows to display in table")
@click.option("--print-stats", is_flag=True, help="Print summary statistics")
@click.option("--with-ast/--no-ast", default=True, help="Enable AST-based pattern matching")
@click.option("--with-frameworks/--no-frameworks", default=True, help="Enable framework detection and framework-specific patterns")
@click.option("--exclude-self", is_flag=True, help="Exclude TheAuditor's own files (for self-testing)")
def detect_patterns(project_path, patterns, output_json, file_filter, max_rows, print_stats, with_ast, with_frameworks, exclude_self):
"""Detect universal runtime, DB, and logic patterns in code."""
from theauditor.pattern_loader import PatternLoader
from theauditor.universal_detector import UniversalPatternDetector
try:
# Initialize detector
project_path = Path(project_path).resolve()
pattern_loader = PatternLoader()
# Get exclusion patterns using centralized function
exclude_patterns = get_self_exclusion_patterns(exclude_self)
detector = UniversalPatternDetector(
project_path,
pattern_loader,
with_ast=with_ast,
with_frameworks=with_frameworks,
exclude_patterns=exclude_patterns
)
# Run detection
categories = list(patterns) if patterns else None
findings = detector.detect_patterns(categories=categories, file_filter=file_filter)
# Always save results to default location
patterns_output = project_path / ".pf" / "raw" / "patterns.json"
patterns_output.parent.mkdir(parents=True, exist_ok=True)
# Save to user-specified location if provided
if output_json:
detector.to_json(Path(output_json))
click.echo(f"\n[OK] Full results saved to: {output_json}")
# Save to default location
detector.to_json(patterns_output)
click.echo(f"[OK] Full results saved to: {patterns_output}")
# Display table
table = detector.format_table(max_rows=max_rows)
click.echo(table)
# Print statistics if requested
if print_stats:
stats = detector.get_summary_stats()
click.echo("\n--- Summary Statistics ---")
click.echo(f"Total findings: {stats['total_findings']}")
click.echo(f"Files affected: {stats['files_affected']}")
if stats['by_severity']:
click.echo("\nBy severity:")
for severity, count in sorted(stats['by_severity'].items()):
click.echo(f" {severity}: {count}")
if stats['by_category']:
click.echo("\nBy category:")
for category, count in sorted(stats['by_category'].items()):
click.echo(f" {category}: {count}")
# Successfully completed - found and reported all issues
except Exception as e:
click.echo(f"Error: {e}", err=True)
raise click.ClickException(str(e)) from e

View File

@@ -0,0 +1,94 @@
"""Docker security analysis command."""
import click
import json
from pathlib import Path
from theauditor.utils.error_handler import handle_exceptions
from theauditor.utils.exit_codes import ExitCodes
@click.command("docker-analyze")
@handle_exceptions
@click.option("--db-path", default="./.pf/repo_index.db", help="Path to repo_index.db")
@click.option("--output", help="Output file for findings (JSON format)")
@click.option("--severity", type=click.Choice(["all", "critical", "high", "medium", "low"]),
default="all", help="Minimum severity to report")
@click.option("--check-vulns/--no-check-vulns", default=True,
help="Check base images for vulnerabilities (requires network)")
def docker_analyze(db_path, output, severity, check_vulns):
"""Analyze Docker images for security issues.
Detects:
- Containers running as root
- Exposed secrets in ENV/ARG instructions
- High entropy values (potential secrets)
- Base image vulnerabilities (if --check-vulns enabled)
"""
from theauditor.docker_analyzer import analyze_docker_images
# Check if database exists
if not Path(db_path).exists():
click.echo(f"Error: Database not found at {db_path}", err=True)
click.echo("Run 'aud index' first to create the database", err=True)
return ExitCodes.TASK_INCOMPLETE
# Run analysis
click.echo("Analyzing Docker images for security issues...")
if check_vulns:
click.echo(" Including vulnerability scan of base images...")
findings = analyze_docker_images(db_path, check_vulnerabilities=check_vulns)
# Filter by severity if requested
if severity != "all":
severity_order = {"critical": 4, "high": 3, "medium": 2, "low": 1}
min_severity = severity_order.get(severity.lower(), 0)
findings = [f for f in findings
if severity_order.get(f.get("severity", "").lower(), 0) >= min_severity]
# Count by severity
severity_counts = {}
for finding in findings:
sev = finding.get("severity", "unknown").lower()
severity_counts[sev] = severity_counts.get(sev, 0) + 1
# Display results
if findings:
click.echo(f"\nFound {len(findings)} Docker security issues:")
# Show severity breakdown
for sev in ["critical", "high", "medium", "low"]:
if sev in severity_counts:
click.echo(f" {sev.upper()}: {severity_counts[sev]}")
# Show findings
click.echo("\nFindings:")
for finding in findings:
click.echo(f"\n[{finding['severity'].upper()}] {finding['type']}")
click.echo(f" File: {finding['file']}")
click.echo(f" {finding['message']}")
if finding.get('recommendation'):
click.echo(f" Fix: {finding['recommendation']}")
else:
click.echo("No Docker security issues found")
# Save to file if requested
if output:
output_path = Path(output)
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, 'w') as f:
json.dump({
"findings": findings,
"summary": severity_counts,
"total": len(findings)
}, f, indent=2)
click.echo(f"\nResults saved to: {output}")
# Exit with appropriate code
if severity_counts.get("critical", 0) > 0:
return ExitCodes.CRITICAL_SEVERITY
elif severity_counts.get("high", 0) > 0:
return ExitCodes.HIGH_SEVERITY
else:
return ExitCodes.SUCCESS

201
theauditor/commands/docs.py Normal file
View File

@@ -0,0 +1,201 @@
"""Fetch or summarize documentation for dependencies."""
import json
import click
from pathlib import Path
@click.command("docs")
@click.argument("action", type=click.Choice(["fetch", "summarize", "view", "list"]))
@click.argument("package_name", required=False)
@click.option("--deps", default="./.pf/deps.json", help="Input dependencies file")
@click.option("--offline", is_flag=True, help="Force offline mode")
@click.option("--allow-non-gh-readmes", is_flag=True, help="Allow non-GitHub README fetching")
@click.option("--docs-dir", default="./.pf/context/docs", help="Documentation cache directory")
@click.option("--capsules-dir", default="./.pf/context/doc_capsules", help="Output capsules directory")
@click.option("--workset", default="./.pf/workset.json", help="Workset file for filtering")
@click.option("--print-stats", is_flag=True, help="Print statistics")
@click.option("--raw", is_flag=True, help="View raw fetched doc instead of capsule")
def docs(action, package_name, deps, offline, allow_non_gh_readmes, docs_dir, capsules_dir, workset, print_stats, raw):
"""Fetch or summarize documentation for dependencies."""
from theauditor.deps import parse_dependencies
from theauditor.docs_fetch import fetch_docs, DEFAULT_ALLOWLIST
from theauditor.docs_summarize import summarize_docs
try:
if action == "fetch":
# Load dependencies
if Path(deps).exists():
with open(deps, encoding="utf-8") as f:
deps_list = json.load(f)
else:
# Parse if not cached
deps_list = parse_dependencies()
# Set up allowlist
allowlist = DEFAULT_ALLOWLIST.copy()
if not allow_non_gh_readmes:
# Already restricted to GitHub by default
pass
# Check for policy file
policy_file = Path(".pf/policy.yml")
allow_net = True
if policy_file.exists():
try:
# Simple YAML parsing without external deps
with open(policy_file, encoding="utf-8") as f:
for line in f:
if "allow_net:" in line:
allow_net = "true" in line.lower()
break
except Exception:
pass # Default to True
# Fetch docs
result = fetch_docs(
deps_list,
allow_net=allow_net,
allowlist=allowlist,
offline=offline,
output_dir=docs_dir
)
if not print_stats:
if result["mode"] == "offline":
click.echo("Running in offline mode - no documentation fetched")
else:
click.echo(f"Documentation fetch complete:")
click.echo(f" Fetched: {result['fetched']}")
click.echo(f" Cached: {result['cached']}")
click.echo(f" Skipped: {result['skipped']}")
if result["errors"]:
click.echo(f" Errors: {len(result['errors'])}")
elif action == "summarize":
# Summarize docs
result = summarize_docs(
docs_dir=docs_dir,
output_dir=capsules_dir,
workset_path=workset if Path(workset).exists() else None
)
if not print_stats:
click.echo(f"Documentation capsules created:")
click.echo(f" Capsules: {result['capsules_created']}")
click.echo(f" Skipped: {result['skipped']}")
if result["errors"]:
click.echo(f" Errors: {len(result['errors'])}")
index_file = Path(capsules_dir).parent / "doc_index.json"
click.echo(f" Index: {index_file}")
elif action == "list":
# List available docs and capsules
docs_path = Path(docs_dir)
capsules_path = Path(capsules_dir)
click.echo("\n[Docs] Available Documentation:\n")
# List fetched docs
if docs_path.exists():
click.echo("Fetched Docs (.pf/context/docs/):")
for ecosystem in ["npm", "py"]:
ecosystem_dir = docs_path / ecosystem
if ecosystem_dir.exists():
packages = sorted([d.name for d in ecosystem_dir.iterdir() if d.is_dir()])
if packages:
click.echo(f"\n {ecosystem.upper()}:")
for pkg in packages[:20]: # Show first 20
click.echo(f" * {pkg}")
if len(packages) > 20:
click.echo(f" ... and {len(packages) - 20} more")
# List capsules
if capsules_path.exists():
click.echo("\nDoc Capsules (.pf/context/doc_capsules/):")
capsules = sorted([f.stem for f in capsules_path.glob("*.md")])
if capsules:
for capsule in capsules[:20]: # Show first 20
click.echo(f" * {capsule}")
if len(capsules) > 20:
click.echo(f" ... and {len(capsules) - 20} more")
click.echo("\n[TIP] Use 'aud docs view <package_name>' to view a specific doc")
click.echo(" Add --raw to see the full fetched doc instead of capsule")
elif action == "view":
if not package_name:
click.echo("Error: Package name required for view action")
click.echo("Usage: aud docs view <package_name>")
click.echo(" aud docs view geopandas")
click.echo(" aud docs view numpy --raw")
raise click.ClickException("Package name required")
docs_path = Path(docs_dir)
capsules_path = Path(capsules_dir)
found = False
if raw:
# View raw fetched doc
for ecosystem in ["npm", "py"]:
# Try exact match first
for pkg_dir in (docs_path / ecosystem).glob(f"{package_name}@*"):
if pkg_dir.is_dir():
doc_file = pkg_dir / "doc.md"
if doc_file.exists():
click.echo(f"\n[RAW DOC] Raw Doc: {pkg_dir.name}\n")
click.echo("=" * 80)
with open(doc_file, encoding="utf-8") as f:
content = f.read()
# Limit output for readability
lines = content.split("\n")
if len(lines) > 200:
click.echo("\n".join(lines[:200]))
click.echo(f"\n... (truncated, {len(lines) - 200} more lines)")
else:
click.echo(content)
found = True
break
if found:
break
else:
# View capsule (default)
# Try exact match first
for capsule_file in capsules_path.glob(f"*{package_name}*.md"):
if capsule_file.exists():
click.echo(f"\n[CAPSULE] Capsule: {capsule_file.stem}\n")
click.echo("=" * 80)
with open(capsule_file, encoding="utf-8") as f:
click.echo(f.read())
click.echo("\n" + "=" * 80)
# Try to find the corresponding full doc
package_parts = capsule_file.stem.replace("__", "@").split("@")
if len(package_parts) >= 2:
ecosystem_prefix = package_parts[0]
pkg_name = "@".join(package_parts[:-1]).replace(ecosystem_prefix + "@", "")
version = package_parts[-1]
ecosystem = "py" if ecosystem_prefix == "py" else "npm"
full_doc_path = f"./.pf/context/docs/{ecosystem}/{pkg_name}@{version}/doc.md"
click.echo(f"\n[SOURCE] Full Documentation: `{full_doc_path}`")
click.echo("[TIP] Use --raw to see the full fetched documentation")
found = True
break
if not found:
click.echo(f"No documentation found for '{package_name}'")
click.echo("\nAvailable packages:")
# Show some available packages
for ecosystem in ["npm", "py"]:
ecosystem_dir = docs_path / ecosystem
if ecosystem_dir.exists():
packages = [d.name for d in ecosystem_dir.iterdir() if d.is_dir()][:5]
if packages:
click.echo(f" {ecosystem.upper()}: {', '.join(packages)}")
click.echo("\nUse 'aud docs list' to see all available docs")
except Exception as e:
click.echo(f"Error: {e}", err=True)
raise click.ClickException(str(e)) from e

View File

@@ -0,0 +1,43 @@
"""Run Factual Correlation Engine to aggregate and correlate findings."""
import click
from theauditor.utils.error_handler import handle_exceptions
@click.command(name="fce")
@handle_exceptions
@click.option("--root", default=".", help="Root directory")
@click.option("--capsules", default="./.pf/capsules", help="Capsules directory")
@click.option("--manifest", default="manifest.json", help="Manifest file path")
@click.option("--workset", default="./.pf/workset.json", help="Workset file path")
@click.option("--timeout", default=600, type=int, help="Timeout in seconds")
@click.option("--print-plan", is_flag=True, help="Print detected tools without running")
def fce(root, capsules, manifest, workset, timeout, print_plan):
"""Run Factual Correlation Engine to aggregate and correlate findings."""
from theauditor.fce import run_fce
result = run_fce(
root_path=root,
capsules_dir=capsules,
manifest_path=manifest,
workset_path=workset,
timeout=timeout,
print_plan=print_plan,
)
if result.get("printed_plan"):
return
if result["success"]:
if result["failures_found"] == 0:
click.echo("[OK] All tools passed - no failures detected")
else:
click.echo(f"Found {result['failures_found']} failures")
# Check if output_files exists and has at least 2 elements
if result.get('output_files') and len(result.get('output_files', [])) > 1:
click.echo(f"FCE report written to: {result['output_files'][1]}")
elif result.get('output_files') and len(result.get('output_files', [])) > 0:
click.echo(f"FCE report written to: {result['output_files'][0]}")
else:
click.echo(f"Error: {result.get('error', 'Unknown error')}", err=True)
raise click.ClickException(result.get("error", "FCE failed"))

View File

@@ -0,0 +1,90 @@
"""Run complete audit pipeline."""
import sys
import click
from theauditor.utils.error_handler import handle_exceptions
from theauditor.utils.exit_codes import ExitCodes
@click.command()
@handle_exceptions
@click.option("--root", default=".", help="Root directory to analyze")
@click.option("--quiet", is_flag=True, help="Minimal output")
@click.option("--exclude-self", is_flag=True, help="Exclude TheAuditor's own files (for self-testing)")
@click.option("--offline", is_flag=True, help="Skip network operations (deps, docs)")
def full(root, quiet, exclude_self, offline):
"""Run complete audit pipeline in exact order specified in teamsop.md."""
from theauditor.pipelines import run_full_pipeline
# Define log callback for console output
def log_callback(message, is_error=False):
if is_error:
click.echo(message, err=True)
else:
click.echo(message)
# Run the pipeline
result = run_full_pipeline(
root=root,
quiet=quiet,
exclude_self=exclude_self,
offline=offline,
log_callback=log_callback if not quiet else None
)
# Display clear status message based on results
findings = result.get("findings", {})
critical = findings.get("critical", 0)
high = findings.get("high", 0)
medium = findings.get("medium", 0)
low = findings.get("low", 0)
click.echo("\n" + "=" * 60)
click.echo("AUDIT FINAL STATUS")
click.echo("=" * 60)
# Determine overall status and exit code
exit_code = ExitCodes.SUCCESS
# Check for pipeline failures first
if result["failed_phases"] > 0:
click.echo(f"[WARNING] Pipeline completed with {result['failed_phases']} phase failures")
click.echo("Some analysis phases could not complete successfully.")
exit_code = ExitCodes.TASK_INCOMPLETE # Exit code for pipeline failures
# Then check for security findings
if critical > 0:
click.echo(f"\nSTATUS: [CRITICAL] - Audit complete. Found {critical} critical vulnerabilities.")
click.echo("Immediate action required - deployment should be blocked.")
exit_code = ExitCodes.CRITICAL_SEVERITY # Exit code for critical findings
elif high > 0:
click.echo(f"\nSTATUS: [HIGH] - Audit complete. Found {high} high-severity issues.")
click.echo("Priority remediation needed before next release.")
if exit_code == ExitCodes.SUCCESS:
exit_code = ExitCodes.HIGH_SEVERITY # Exit code for high findings (unless already set for failures)
elif medium > 0 or low > 0:
click.echo(f"\nSTATUS: [MODERATE] - Audit complete. Found {medium} medium and {low} low issues.")
click.echo("Schedule fixes for upcoming sprints.")
else:
click.echo("\nSTATUS: [CLEAN] - No critical or high-severity issues found.")
click.echo("Codebase meets security and quality standards.")
# Show findings breakdown if any exist
if critical + high + medium + low > 0:
click.echo("\nFindings breakdown:")
if critical > 0:
click.echo(f" - Critical: {critical}")
if high > 0:
click.echo(f" - High: {high}")
if medium > 0:
click.echo(f" - Medium: {medium}")
if low > 0:
click.echo(f" - Low: {low}")
click.echo("\nReview the chunked data in .pf/readthis/ for complete findings.")
click.echo("=" * 60)
# Exit with appropriate code for CI/CD automation
# Using standardized exit codes from ExitCodes class
if exit_code != ExitCodes.SUCCESS:
sys.exit(exit_code)

View File

@@ -0,0 +1,639 @@
"""Cross-project dependency and call graph analysis."""
import json
from pathlib import Path
import click
from theauditor.config_runtime import load_runtime_config
@click.group()
@click.help_option("-h", "--help")
def graph():
"""Cross-project dependency and call graph analysis."""
pass
@graph.command("build")
@click.option("--root", default=".", help="Root directory to analyze")
@click.option("--langs", multiple=True, help="Languages to process (e.g., python, javascript)")
@click.option("--workset", help="Path to workset.json to limit scope")
@click.option("--batch-size", default=200, type=int, help="Files per batch")
@click.option("--resume", is_flag=True, help="Resume from checkpoint")
@click.option("--db", default="./.pf/graphs.db", help="SQLite database path")
@click.option("--out-json", default="./.pf/raw/", help="JSON output directory")
def graph_build(root, langs, workset, batch_size, resume, db, out_json):
"""Build import and call graphs for project."""
from theauditor.graph.builder import XGraphBuilder
from theauditor.graph.store import XGraphStore
try:
# Initialize builder and store
builder = XGraphBuilder(batch_size=batch_size, exclude_patterns=[], project_root=root)
store = XGraphStore(db_path=db)
# Load workset if provided
file_filter = None
workset_files = set()
if workset:
workset_path = Path(workset)
if workset_path.exists():
with open(workset_path) as f:
workset_data = json.load(f)
# Extract file paths from workset
workset_files = {p["path"] for p in workset_data.get("paths", [])}
click.echo(f"Loaded workset with {len(workset_files)} files")
# Clear checkpoint if not resuming
if not resume and builder.checkpoint_file.exists():
builder.checkpoint_file.unlink()
# Load manifest.json if it exists to use as file list
file_list = None
config = load_runtime_config(root)
manifest_path = Path(config["paths"]["manifest"])
if manifest_path.exists():
click.echo("Loading file manifest...")
with open(manifest_path, 'r') as f:
manifest_data = json.load(f)
# Apply workset filtering if active
if workset_files:
file_list = [f for f in manifest_data if f.get("path") in workset_files]
click.echo(f" Filtered to {len(file_list)} files from workset")
else:
file_list = manifest_data
click.echo(f" Found {len(file_list)} files in manifest")
else:
click.echo("No manifest found, using filesystem walk")
# Build import graph
click.echo("Building import graph...")
import_graph = builder.build_import_graph(
root=root,
langs=list(langs) if langs else None,
file_list=file_list,
)
# Save to database (SINGLE SOURCE OF TRUTH)
store.save_import_graph(import_graph)
# REMOVED: JSON dual persistence - using SQLite as single source
click.echo(f" Nodes: {len(import_graph['nodes'])}")
click.echo(f" Edges: {len(import_graph['edges'])}")
# Build call graph
click.echo("Building call graph...")
call_graph = builder.build_call_graph(
root=root,
langs=list(langs) if langs else None,
file_list=file_list,
)
# Save to database (SINGLE SOURCE OF TRUTH)
store.save_call_graph(call_graph)
# REMOVED: JSON dual persistence - using SQLite as single source
# Call graph uses 'nodes' for functions and 'edges' for calls
click.echo(f" Functions: {len(call_graph.get('nodes', []))}")
click.echo(f" Calls: {len(call_graph.get('edges', []))}")
click.echo(f"\nGraphs saved to database: {db}")
except Exception as e:
click.echo(f"Error: {e}", err=True)
raise click.ClickException(str(e)) from e
@graph.command("analyze")
@click.option("--db", default="./.pf/graphs.db", help="SQLite database path")
@click.option("--out", default="./.pf/raw/graph_analysis.json", help="Output JSON path")
@click.option("--max-depth", default=3, type=int, help="Max traversal depth for impact analysis")
@click.option("--workset", help="Path to workset.json for change impact")
@click.option("--no-insights", is_flag=True, help="Skip interpretive insights (health scores, recommendations)")
def graph_analyze(db, out, max_depth, workset, no_insights):
"""Analyze graphs for cycles, hotspots, and impact."""
from theauditor.graph.analyzer import XGraphAnalyzer
from theauditor.graph.store import XGraphStore
# Try to import insights module (optional)
insights = None
if not no_insights:
try:
from theauditor.graph.insights import GraphInsights
insights = GraphInsights()
except ImportError:
click.echo("Note: Insights module not available. Running basic analysis only.")
insights = None
try:
# Load graphs from database
store = XGraphStore(db_path=db)
import_graph = store.load_import_graph()
call_graph = store.load_call_graph()
if not import_graph["nodes"]:
click.echo("No graphs found. Run 'aud graph build' first.")
return
# Initialize analyzer
analyzer = XGraphAnalyzer()
# Detect cycles
click.echo("Detecting cycles...")
cycles = analyzer.detect_cycles(import_graph)
click.echo(f" Found {len(cycles)} cycles")
if cycles and len(cycles) > 0:
click.echo(f" Largest cycle: {cycles[0]['size']} nodes")
# Rank hotspots (if insights available)
hotspots = []
if insights:
click.echo("Ranking hotspots...")
hotspots = insights.rank_hotspots(import_graph, call_graph)
click.echo(f" Top 10 hotspots:")
for i, hotspot in enumerate(hotspots[:10], 1):
click.echo(f" {i}. {hotspot['id'][:50]} (score: {hotspot['score']})")
else:
# Basic hotspot detection without scoring
click.echo("Finding most connected nodes...")
degrees = analyzer.calculate_node_degrees(import_graph)
connected = sorted(
[(k, v["in_degree"] + v["out_degree"]) for k, v in degrees.items()],
key=lambda x: x[1],
reverse=True
)[:10]
click.echo(f" Top 10 most connected nodes:")
for i, (node, connections) in enumerate(connected, 1):
click.echo(f" {i}. {node[:50]} ({connections} connections)")
# Calculate change impact if workset provided
impact = None
if workset:
workset_path = Path(workset)
if workset_path.exists():
with open(workset_path) as f:
workset_data = json.load(f)
targets = workset_data.get("seed_files", [])
if targets:
click.echo(f"\nCalculating impact for {len(targets)} targets...")
impact = analyzer.impact_of_change(
targets=targets,
import_graph=import_graph,
call_graph=call_graph,
max_depth=max_depth,
)
click.echo(f" Upstream impact: {len(impact['upstream'])} files")
click.echo(f" Downstream impact: {len(impact['downstream'])} files")
click.echo(f" Total impacted: {impact['total_impacted']}")
# Generate summary
summary = {}
if insights:
click.echo("\nGenerating interpreted summary...")
summary = insights.summarize(
import_graph=import_graph,
call_graph=call_graph,
cycles=cycles,
hotspots=hotspots,
)
click.echo(f" Graph density: {summary['import_graph'].get('density', 0):.4f}")
click.echo(f" Health grade: {summary['health_metrics'].get('health_grade', 'N/A')}")
click.echo(f" Fragility score: {summary['health_metrics'].get('fragility_score', 0):.2f}")
else:
# Basic summary without interpretation
click.echo("\nGenerating basic summary...")
nodes_count = len(import_graph.get("nodes", []))
edges_count = len(import_graph.get("edges", []))
density = edges_count / (nodes_count * (nodes_count - 1)) if nodes_count > 1 else 0
summary = {
"import_graph": {
"nodes": nodes_count,
"edges": edges_count,
"density": density,
},
"cycles": {
"total": len(cycles),
"largest": cycles[0]["size"] if cycles else 0,
},
}
if call_graph:
summary["call_graph"] = {
"nodes": len(call_graph.get("nodes", [])),
"edges": len(call_graph.get("edges", [])),
}
click.echo(f" Nodes: {nodes_count}")
click.echo(f" Edges: {edges_count}")
click.echo(f" Density: {density:.4f}")
click.echo(f" Cycles: {len(cycles)}")
# Save analysis results
analysis = {
"cycles": cycles,
"hotspots": hotspots[:50], # Top 50
"impact": impact,
"summary": summary,
}
out_path = Path(out)
out_path.parent.mkdir(parents=True, exist_ok=True)
with open(out_path, "w") as f:
json.dump(analysis, f, indent=2, sort_keys=True)
click.echo(f"\nAnalysis saved to {out}")
# Save metrics for ML consumption (if insights available)
if insights and hotspots:
metrics = {}
for hotspot in hotspots:
metrics[hotspot['id']] = hotspot.get('centrality', 0)
metrics_path = Path("./.pf/raw/graph_metrics.json")
metrics_path.parent.mkdir(parents=True, exist_ok=True)
with open(metrics_path, "w") as f:
json.dump(metrics, f, indent=2)
click.echo(f" Saved graph metrics to {metrics_path}")
# Create AI-readable summary
graph_summary = analyzer.get_graph_summary(import_graph)
summary_path = Path("./.pf/raw/graph_summary.json")
with open(summary_path, "w") as f:
json.dump(graph_summary, f, indent=2)
click.echo(f" Saved graph summary to {summary_path}")
except Exception as e:
click.echo(f"Error: {e}", err=True)
raise click.ClickException(str(e)) from e
@graph.command("query")
@click.option("--db", default="./.pf/graphs.db", help="SQLite database path")
@click.option("--uses", help="Find who uses/imports this module or calls this function")
@click.option("--calls", help="Find what this module/function calls or depends on")
@click.option("--nearest-path", nargs=2, help="Find shortest path between two nodes")
@click.option("--format", type=click.Choice(["table", "json"]), default="table", help="Output format")
def graph_query(db, uses, calls, nearest_path, format):
"""Query graph relationships."""
from theauditor.graph.analyzer import XGraphAnalyzer
from theauditor.graph.store import XGraphStore
# Check if any query options were provided
if not any([uses, calls, nearest_path]):
click.echo("Please specify a query option:")
click.echo(" --uses MODULE Find who uses a module")
click.echo(" --calls FUNC Find what a function calls")
click.echo(" --nearest-path SOURCE TARGET Find path between nodes")
click.echo("\nExample: aud graph query --uses theauditor.cli")
return
try:
# Load graphs
store = XGraphStore(db_path=db)
results = {}
if uses:
# Find who uses this node
deps = store.query_dependencies(uses, direction="upstream")
call_deps = store.query_calls(uses, direction="callers")
all_users = sorted(set(deps.get("upstream", []) + call_deps.get("callers", [])))
results["uses"] = {
"node": uses,
"used_by": all_users,
"count": len(all_users),
}
if format == "table":
click.echo(f"\n{uses} is used by {len(all_users)} nodes:")
for user in all_users[:20]: # Show first 20
click.echo(f" - {user}")
if len(all_users) > 20:
click.echo(f" ... and {len(all_users) - 20} more")
if calls:
# Find what this node calls/depends on
deps = store.query_dependencies(calls, direction="downstream")
call_deps = store.query_calls(calls, direction="callees")
all_deps = sorted(set(deps.get("downstream", []) + call_deps.get("callees", [])))
results["calls"] = {
"node": calls,
"depends_on": all_deps,
"count": len(all_deps),
}
if format == "table":
click.echo(f"\n{calls} depends on {len(all_deps)} nodes:")
for dep in all_deps[:20]: # Show first 20
click.echo(f" - {dep}")
if len(all_deps) > 20:
click.echo(f" ... and {len(all_deps) - 20} more")
if nearest_path:
# Find shortest path
source, target = nearest_path
import_graph = store.load_import_graph()
analyzer = XGraphAnalyzer()
path = analyzer.find_shortest_path(source, target, import_graph)
results["path"] = {
"source": source,
"target": target,
"path": path,
"length": len(path) if path else None,
}
if format == "table":
if path:
click.echo(f"\nPath from {source} to {target} ({len(path)} steps):")
for i, node in enumerate(path):
prefix = " " + ("-> " if i > 0 else "")
click.echo(f"{prefix}{node}")
else:
click.echo(f"\nNo path found from {source} to {target}")
if format == "json":
click.echo(json.dumps(results, indent=2))
except Exception as e:
click.echo(f"Error: {e}", err=True)
raise click.ClickException(str(e)) from e
@graph.command("viz")
@click.option("--db", default="./.pf/graphs.db", help="SQLite database path")
@click.option("--graph-type", type=click.Choice(["import", "call"]), default="import", help="Graph type to visualize")
@click.option("--out-dir", default="./.pf/raw/", help="Output directory for visualizations")
@click.option("--limit-nodes", default=500, type=int, help="Maximum nodes to display")
@click.option("--format", type=click.Choice(["dot", "svg", "png", "json"]), default="dot", help="Output format")
@click.option("--view", type=click.Choice(["full", "cycles", "hotspots", "layers", "impact"]), default="full",
help="Visualization view type")
@click.option("--include-analysis", is_flag=True, help="Include analysis results (cycles, hotspots) in visualization")
@click.option("--title", help="Graph title")
@click.option("--top-hotspots", default=10, type=int, help="Number of top hotspots to show (for hotspots view)")
@click.option("--impact-target", help="Target node for impact analysis (for impact view)")
@click.option("--show-self-loops", is_flag=True, help="Include self-referential edges")
def graph_viz(db, graph_type, out_dir, limit_nodes, format, view, include_analysis, title,
top_hotspots, impact_target, show_self_loops):
"""Visualize graphs with rich visual encoding (Graphviz).
Creates visually intelligent graphs with multiple view modes:
VIEW MODES:
- full: Complete graph with all nodes and edges
- cycles: Only nodes/edges involved in dependency cycles
- hotspots: Top N most connected nodes with neighbors
- layers: Architectural layers as subgraphs
- impact: Highlight impact radius of changes
VISUAL ENCODING:
- Node Color: Programming language (Python=blue, JS=yellow, TS=blue)
- Node Size: Importance/connectivity (larger = more dependencies)
- Edge Color: Red for cycles, gray for normal
- Border Width: Code churn (thicker = more changes)
- Node Shape: box=module, ellipse=function, diamond=class
Examples:
# Basic visualization
aud graph viz
# Show only dependency cycles
aud graph viz --view cycles --include-analysis
# Top 5 hotspots with connections
aud graph viz --view hotspots --top-hotspots 5
# Architectural layers
aud graph viz --view layers --include-analysis
# Impact analysis for a specific file
aud graph viz --view impact --impact-target "src/auth.py"
# Generate SVG for AI analysis
aud graph viz --format svg --view full --include-analysis
"""
from theauditor.graph.store import XGraphStore
from theauditor.graph.visualizer import GraphVisualizer
try:
# Load the appropriate graph
store = XGraphStore(db_path=db)
if graph_type == "import":
graph = store.load_import_graph()
output_name = "import_graph"
default_title = "Import Dependencies"
else:
graph = store.load_call_graph()
output_name = "call_graph"
default_title = "Function Call Graph"
if not graph or not graph.get("nodes"):
click.echo(f"No {graph_type} graph found. Run 'aud graph build' first.")
return
# Load analysis if requested
analysis = {}
if include_analysis:
# Try to load analysis from file
analysis_path = Path("./.pf/raw/graph_analysis.json")
if analysis_path.exists():
with open(analysis_path) as f:
analysis_data = json.load(f)
analysis = {
'cycles': analysis_data.get('cycles', []),
'hotspots': analysis_data.get('hotspots', []),
'impact': analysis_data.get('impact', {})
}
click.echo(f"Loaded analysis: {len(analysis['cycles'])} cycles, {len(analysis['hotspots'])} hotspots")
else:
click.echo("No analysis found. Run 'aud graph analyze' first for richer visualization.")
# Create output directory
out_path = Path(out_dir)
out_path.mkdir(parents=True, exist_ok=True)
if format == "json":
# Simple JSON output (original behavior)
json_file = out_path / f"{output_name}.json"
with open(json_file, "w") as f:
json.dump({"nodes": graph["nodes"], "edges": graph["edges"]}, f, indent=2)
click.echo(f"[OK] JSON saved to: {json_file}")
click.echo(f" Nodes: {len(graph['nodes'])}, Edges: {len(graph['edges'])}")
else:
# Use new visualizer for DOT/SVG/PNG
visualizer = GraphVisualizer()
# Set visualization options
options = {
'max_nodes': limit_nodes,
'title': title or default_title,
'show_self_loops': show_self_loops
}
# Generate DOT with visual intelligence based on view mode
click.echo(f"Generating {format.upper()} visualization (view: {view})...")
if view == "cycles":
# Cycles-only view
cycles = analysis.get('cycles', [])
if not cycles:
# Check if analysis was run but found no cycles
if 'cycles' in analysis:
click.echo("[INFO] No dependency cycles detected in the codebase (good architecture!).")
click.echo(" Showing full graph instead...")
else:
click.echo("[WARN] No cycles data found. Run 'aud graph analyze' first.")
click.echo(" Falling back to full view...")
dot_content = visualizer.generate_dot(graph, analysis, options)
else:
click.echo(f" Showing {len(cycles)} cycles")
dot_content = visualizer.generate_cycles_only_view(graph, cycles, options)
elif view == "hotspots":
# Hotspots-only view
if not analysis.get('hotspots'):
# Try to calculate hotspots on the fly
from theauditor.graph.analyzer import XGraphAnalyzer
analyzer = XGraphAnalyzer()
hotspots = analyzer.identify_hotspots(graph, top_n=top_hotspots)
click.echo(f" Calculated {len(hotspots)} hotspots")
else:
hotspots = analysis['hotspots']
click.echo(f" Showing top {top_hotspots} hotspots")
dot_content = visualizer.generate_hotspots_only_view(
graph, hotspots, options, top_n=top_hotspots
)
elif view == "layers":
# Architectural layers view
from theauditor.graph.analyzer import XGraphAnalyzer
analyzer = XGraphAnalyzer()
layers = analyzer.identify_layers(graph)
click.echo(f" Found {len(layers)} architectural layers")
# Filter out None keys before iterating
for layer_num, nodes in layers.items():
if layer_num is not None:
click.echo(f" Layer {layer_num}: {len(nodes)} nodes")
dot_content = visualizer.generate_dot_with_layers(graph, layers, analysis, options)
elif view == "impact":
# Impact analysis view
if not impact_target:
click.echo("[ERROR] --impact-target required for impact view")
raise click.ClickException("Missing --impact-target for impact view")
from theauditor.graph.analyzer import XGraphAnalyzer
analyzer = XGraphAnalyzer()
impact = analyzer.analyze_impact(graph, [impact_target])
if not impact['targets']:
click.echo(f"[WARN] Target '{impact_target}' not found in graph")
click.echo(" Showing full graph instead...")
dot_content = visualizer.generate_dot(graph, analysis, options)
else:
click.echo(f" Target: {impact_target}")
click.echo(f" Upstream: {len(impact['upstream'])} nodes")
click.echo(f" Downstream: {len(impact['downstream'])} nodes")
click.echo(f" Total impact: {len(impact['all_impacted'])} nodes")
dot_content = visualizer.generate_impact_visualization(graph, impact, options)
else: # view == "full" or default
# Full graph view
click.echo(f" Nodes: {len(graph['nodes'])} (limit: {limit_nodes})")
click.echo(f" Edges: {len(graph['edges'])}")
dot_content = visualizer.generate_dot(graph, analysis, options)
# Save DOT file with view suffix
if view != "full":
output_filename = f"{output_name}_{view}"
else:
output_filename = output_name
dot_file = out_path / f"{output_filename}.dot"
with open(dot_file, "w") as f:
f.write(dot_content)
click.echo(f"[OK] DOT file saved to: {dot_file}")
# Generate image if requested
if format in ["svg", "png"]:
try:
import subprocess
# Check if Graphviz is installed
result = subprocess.run(
["dot", "-V"],
capture_output=True,
text=True
)
if result.returncode == 0:
# Generate image
output_file = out_path / f"{output_filename}.{format}"
subprocess.run(
["dot", f"-T{format}", str(dot_file), "-o", str(output_file)],
check=True
)
click.echo(f"[OK] {format.upper()} image saved to: {output_file}")
# For SVG, also mention AI readability
if format == "svg":
click.echo(" ✓ SVG is AI-readable and can be analyzed for patterns")
else:
click.echo(f"[WARN] Graphviz not found. Install it to generate {format.upper()} images:")
click.echo(" Ubuntu/Debian: apt install graphviz")
click.echo(" macOS: brew install graphviz")
click.echo(" Windows: choco install graphviz")
click.echo(f"\n Manual generation: dot -T{format} {dot_file} -o {output_filename}.{format}")
except FileNotFoundError:
click.echo(f"[WARN] Graphviz not installed. Cannot generate {format.upper()}.")
click.echo(f" Install graphviz and run: dot -T{format} {dot_file} -o {output_filename}.{format}")
except subprocess.CalledProcessError as e:
click.echo(f"[ERROR] Failed to generate {format.upper()}: {e}")
# Provide visual encoding legend based on view
click.echo("\nVisual Encoding:")
if view == "cycles":
click.echo(" • Red Nodes: Part of dependency cycles")
click.echo(" • Red Edges: Cycle connections")
click.echo(" • Subgraphs: Individual cycles grouped")
elif view == "hotspots":
click.echo(" • Node Color: Red gradient (darker = higher rank)")
click.echo(" • Node Size: Total connections")
click.echo(" • Gray Nodes: Connected but not hotspots")
click.echo(" • Labels: Show in/out degree counts")
elif view == "layers":
click.echo(" • Subgraphs: Architectural layers")
click.echo(" • Node Color: Programming language")
click.echo(" • Border Width: Code churn (thicker = more changes)")
click.echo(" • Node Size: Importance (in-degree)")
elif view == "impact":
click.echo(" • Red Nodes: Impact targets")
click.echo(" • Orange Nodes: Upstream dependencies")
click.echo(" • Blue Nodes: Downstream dependencies")
click.echo(" • Purple Nodes: Both upstream and downstream")
click.echo(" • Gray Nodes: Unaffected")
else: # full view
click.echo(" • Node Color: Programming language")
click.echo(" • Node Size: Importance (larger = more dependencies)")
click.echo(" • Red Edges: Part of dependency cycles")
click.echo(" • Node Shape: box=module, ellipse=function")
except Exception as e:
click.echo(f"Error: {e}", err=True)
raise click.ClickException(str(e)) from e

View File

@@ -0,0 +1,118 @@
"""Analyze the impact radius of code changes using the AST symbol graph."""
import platform
import click
from pathlib import Path
# Detect if running on Windows for character encoding
IS_WINDOWS = platform.system() == "Windows"
@click.command()
@click.option("--file", required=True, help="Path to the file containing the code to analyze")
@click.option("--line", required=True, type=int, help="Line number of the code to analyze")
@click.option("--db", default=None, help="Path to the SQLite database (default: repo_index.db)")
@click.option("--json", is_flag=True, help="Output results as JSON")
@click.option("--max-depth", default=2, type=int, help="Maximum depth for transitive dependencies")
@click.option("--verbose", is_flag=True, help="Show detailed dependency information")
@click.option("--trace-to-backend", is_flag=True, help="Trace frontend API calls to backend endpoints (cross-stack analysis)")
def impact(file, line, db, json, max_depth, verbose, trace_to_backend):
"""
Analyze the impact radius of changing code at a specific location.
This command traces both upstream dependencies (who calls this code)
and downstream dependencies (what this code calls) to help understand
the blast radius of potential changes.
Example:
aud impact --file src/auth.py --line 42
aud impact --file theauditor/indexer.py --line 100 --verbose
"""
from theauditor.impact_analyzer import analyze_impact, format_impact_report
from theauditor.config_runtime import load_runtime_config
import json as json_lib
# Load configuration for default paths
config = load_runtime_config(".")
# Use default database path if not provided
if db is None:
db = config["paths"]["db"]
# Verify database exists
db_path = Path(db)
if not db_path.exists():
click.echo(f"Error: Database not found at {db}", err=True)
click.echo("Run 'aud index' first to build the repository index", err=True)
raise click.ClickException(f"Database not found: {db}")
# Verify file exists (helpful for user)
file = Path(file)
if not file.exists():
click.echo(f"Warning: File {file} not found in filesystem", err=True)
click.echo("Proceeding with analysis using indexed data...", err=True)
# Perform impact analysis
try:
result = analyze_impact(
db_path=str(db_path),
target_file=str(file),
target_line=line,
trace_to_backend=trace_to_backend
)
# Output results
if json:
# JSON output for programmatic use
click.echo(json_lib.dumps(result, indent=2, sort_keys=True))
else:
# Human-readable report
report = format_impact_report(result)
click.echo(report)
# Additional verbose output
if verbose and not result.get("error"):
click.echo("\n" + "=" * 60)
click.echo("DETAILED DEPENDENCY INFORMATION")
click.echo("=" * 60)
# Show transitive upstream
if result.get("upstream_transitive"):
click.echo(f"\nTransitive Upstream Dependencies ({len(result['upstream_transitive'])} total):")
for dep in result["upstream_transitive"][:20]:
depth_indicator = " " * (3 - dep.get("depth", 1))
tree_char = "+-" if IS_WINDOWS else "└─"
click.echo(f"{depth_indicator}{tree_char} {dep['symbol']} in {dep['file']}:{dep['line']}")
if len(result["upstream_transitive"]) > 20:
click.echo(f" ... and {len(result['upstream_transitive']) - 20} more")
# Show transitive downstream
if result.get("downstream_transitive"):
click.echo(f"\nTransitive Downstream Dependencies ({len(result['downstream_transitive'])} total):")
for dep in result["downstream_transitive"][:20]:
depth_indicator = " " * (3 - dep.get("depth", 1))
if dep["file"] != "external":
tree_char = "+-" if IS_WINDOWS else "└─"
click.echo(f"{depth_indicator}{tree_char} {dep['symbol']} in {dep['file']}:{dep['line']}")
else:
tree_char = "+-" if IS_WINDOWS else "└─"
click.echo(f"{depth_indicator}{tree_char} {dep['symbol']} (external)")
if len(result["downstream_transitive"]) > 20:
click.echo(f" ... and {len(result['downstream_transitive']) - 20} more")
# Exit with appropriate code
if result.get("error"):
# Error already displayed in the report, just exit with code
exit(3) # Exit code 3 for analysis errors
# Warn if high impact
summary = result.get("impact_summary", {})
if summary.get("total_impact", 0) > 20:
click.echo("\n⚠ WARNING: High impact change detected!", err=True)
exit(1) # Non-zero exit for CI/CD integration
except Exception as e:
# Only show this for unexpected exceptions, not for already-handled errors
if "No function or class found at" not in str(e):
click.echo(f"Error during impact analysis: {e}", err=True)
raise click.ClickException(str(e))

View File

@@ -0,0 +1,50 @@
"""Build language-agnostic manifest and SQLite index of repository."""
import click
from theauditor.utils.error_handler import handle_exceptions
from theauditor.utils.helpers import get_self_exclusion_patterns
@click.command()
@handle_exceptions
@click.option("--root", default=".", help="Root directory to index")
@click.option("--manifest", default=None, help="Output manifest file path")
@click.option("--db", default=None, help="Output SQLite database path")
@click.option("--print-stats", is_flag=True, help="Print summary statistics")
@click.option("--dry-run", is_flag=True, help="Scan but don't write files")
@click.option("--follow-symlinks", is_flag=True, help="Follow symbolic links (default: skip)")
@click.option("--exclude-self", is_flag=True, help="Exclude TheAuditor's own files (for self-testing)")
def index(root, manifest, db, print_stats, dry_run, follow_symlinks, exclude_self):
"""Build language-agnostic manifest and SQLite index of repository."""
from theauditor.indexer import build_index
from theauditor.config_runtime import load_runtime_config
# Load configuration
config = load_runtime_config(root)
# Use config defaults if not provided
if manifest is None:
manifest = config["paths"]["manifest"]
if db is None:
db = config["paths"]["db"]
# Build exclude patterns using centralized function
exclude_patterns = get_self_exclusion_patterns(exclude_self)
if exclude_self and print_stats:
click.echo(f"[EXCLUDE-SELF] Excluding TheAuditor's own files from indexing")
click.echo(f"[EXCLUDE-SELF] {len(exclude_patterns)} patterns will be excluded")
result = build_index(
root_path=root,
manifest_path=manifest,
db_path=db,
print_stats=print_stats,
dry_run=dry_run,
follow_symlinks=follow_symlinks,
exclude_patterns=exclude_patterns,
)
if result.get("error"):
click.echo(f"Error: {result['error']}", err=True)
raise click.ClickException(result["error"])

143
theauditor/commands/init.py Normal file
View File

@@ -0,0 +1,143 @@
"""Initialize TheAuditor for first-time use."""
from pathlib import Path
import click
@click.command()
@click.option("--offline", is_flag=True, help="Skip network operations (deps check, docs fetch)")
@click.option("--skip-docs", is_flag=True, help="Skip documentation fetching")
@click.option("--skip-deps", is_flag=True, help="Skip dependency checking")
def init(offline, skip_docs, skip_deps):
"""Initialize TheAuditor for first-time use (runs all setup steps)."""
from theauditor.init import initialize_project
click.echo("[INIT] Initializing TheAuditor...\n")
click.echo("This will run all setup steps:")
click.echo(" 1. Index repository")
click.echo(" 2. Create workset")
click.echo(" 3. Check dependencies")
click.echo(" 4. Fetch documentation")
click.echo("\n" + "="*60 + "\n")
# Call the refactored initialization logic
result = initialize_project(
offline=offline,
skip_docs=skip_docs,
skip_deps=skip_deps
)
stats = result["stats"]
has_failures = result["has_failures"]
next_steps = result["next_steps"]
# Display step-by-step results
click.echo("[INDEX] Step 1/5: Indexing repository...")
if stats.get("index", {}).get("success"):
click.echo(f" [OK] Indexed {stats['index']['text_files']} text files")
else:
click.echo(f" [FAIL] Failed: {stats['index'].get('error', 'Unknown error')}", err=True)
click.echo("\n[TARGET] Step 2/5: Creating workset...")
if stats.get("workset", {}).get("success"):
click.echo(f" [OK] Workset created with {stats['workset']['files']} files")
elif stats.get("workset", {}).get("files") == 0:
click.echo(" [WARN] No files found to create workset")
else:
click.echo(f" [FAIL] Failed: {stats['workset'].get('error', 'Unknown error')}", err=True)
if not skip_deps and not offline:
click.echo("\n[PACKAGE] Step 3/4: Checking dependencies...")
if stats.get("deps", {}).get("success"):
if stats["deps"]["total"] > 0:
click.echo(f" [OK] Found {stats['deps']['total']} dependencies ({stats['deps']['outdated']} outdated)")
else:
click.echo(" [OK] No dependency files found")
else:
click.echo(f" [FAIL] Failed: {stats['deps'].get('error', 'Unknown error')}", err=True)
else:
click.echo("\n[PACKAGE] Step 3/4: Skipping dependency check (offline/skipped)")
if not skip_docs and not offline:
click.echo("\n[DOCS] Step 4/4: Fetching documentation...")
if stats.get("docs", {}).get("success"):
fetched = stats['docs'].get('fetched', 0)
cached = stats['docs'].get('cached', 0)
if fetched > 0 and cached > 0:
click.echo(f" [OK] Fetched {fetched} new docs, using {cached} cached docs")
elif fetched > 0:
click.echo(f" [OK] Fetched {fetched} docs")
elif cached > 0:
click.echo(f" [OK] Using {cached} cached docs (already up-to-date)")
else:
click.echo(" [WARN] No docs fetched or cached")
# Report any errors from the stats
if stats['docs'].get('errors'):
errors = stats['docs']['errors']
rate_limited = [e for e in errors if "rate limited" in e.lower()]
other_errors = [e for e in errors if "rate limited" not in e.lower()]
if rate_limited:
click.echo(f" [WARN] {len(rate_limited)} packages rate-limited (will retry with delay)")
if other_errors and len(other_errors) <= 3:
for err in other_errors[:3]:
click.echo(f" [WARN] {err}")
elif other_errors:
click.echo(f" [WARN] {len(other_errors)} packages failed to fetch")
click.echo(f" [OK] Created {stats['docs']['capsules']} doc capsules")
elif stats["docs"].get("error") == "Interrupted by user":
click.echo("\n [WARN] Documentation fetch interrupted (Ctrl+C)")
else:
click.echo(f" [FAIL] Failed: {stats['docs'].get('error', 'Unknown error')}", err=True)
else:
click.echo("\n[DOCS] Step 4/4: Skipping documentation (offline/skipped)")
# Summary
click.echo("\n" + "="*60)
if has_failures:
click.echo("\n[WARN] Initialization Partially Complete\n")
else:
click.echo("\n[SUCCESS] Initialization Complete!\n")
# Show summary
click.echo("[STATS] Summary:")
if stats.get("index", {}).get("success"):
click.echo(f" * Indexed: {stats['index']['text_files']} files")
else:
click.echo(" * Indexing: [FAILED] Failed")
if stats.get("workset", {}).get("success"):
click.echo(f" * Workset: {stats['workset']['files']} files")
elif stats.get("workset", {}).get("files") == 0:
click.echo(" * Workset: [WARN] No files found")
else:
click.echo(" * Workset: [FAILED] Failed")
if stats.get("deps", {}).get("success"):
click.echo(f" * Dependencies: {stats['deps'].get('total', 0)} total, {stats['deps'].get('outdated', 0)} outdated")
elif stats.get("deps", {}).get("skipped"):
click.echo(" * Dependencies: [SKIPPED] Skipped")
if stats.get("docs", {}).get("success"):
fetched = stats['docs'].get('fetched', 0)
cached = stats['docs'].get('cached', 0)
capsules = stats['docs'].get('capsules', 0)
if cached > 0:
click.echo(f" * Documentation: {fetched} fetched, {cached} cached, {capsules} capsules")
else:
click.echo(f" * Documentation: {fetched} fetched, {capsules} capsules")
elif stats.get("docs", {}).get("skipped"):
click.echo(" * Documentation: [SKIPPED] Skipped")
# Next steps - only show if we have files to work with
if next_steps:
click.echo("\n[TARGET] Next steps:")
for i, step in enumerate(next_steps, 1):
click.echo(f" {i}. Run: {step}")
click.echo("\nOr run all at once:")
click.echo(f" {' && '.join(next_steps)}")
else:
click.echo("\n[WARN] No files found to audit. Check that you're in the right directory.")

View File

@@ -0,0 +1,21 @@
"""Ensure minimal mypy config exists (idempotent)."""
import click
@click.command("init-config")
@click.option("--pyproject", default="pyproject.toml", help="Path to pyproject.toml")
def init_config(pyproject):
"""Ensure minimal mypy config exists (idempotent)."""
from theauditor.config import ensure_mypy_config
try:
res = ensure_mypy_config(pyproject)
msg = (
"mypy config created"
if res.get("status") == "created"
else "mypy config already present"
)
click.echo(msg)
except Exception as e:
raise click.ClickException(f"Failed to init config: {e}") from e

View File

@@ -0,0 +1,41 @@
"""Create or merge minimal package.json for lint/typecheck."""
import click
@click.command("init-js")
@click.option("--path", default="package.json", help="Path to package.json")
@click.option("--add-hooks", is_flag=True, help="Add TheAuditor hooks to npm scripts")
def init_js(path, add_hooks):
"""Create or merge minimal package.json for lint/typecheck."""
from theauditor.js_init import ensure_package_json, add_auditor_hooks
try:
res = ensure_package_json(path)
if res["status"] == "created":
click.echo(f"[OK] Created {path} with PIN_ME placeholders")
click.echo(" Edit devDependencies to set exact versions")
elif res["status"] == "merged":
click.echo(f"[OK] Merged lint/typecheck config into {path}")
click.echo(" Check devDependencies for PIN_ME placeholders")
else:
click.echo(f"No changes needed - {path} already configured")
# Add hooks if requested
if add_hooks:
click.echo("\nAdding TheAuditor hooks to npm scripts...")
hook_res = add_auditor_hooks(path)
if hook_res["status"] == "hooks_added":
click.echo("[OK] Added TheAuditor hooks to package.json:")
for change in hook_res["details"]:
click.echo(f" - {change}")
elif hook_res["status"] == "unchanged":
click.echo("No changes needed - all hooks already present")
elif hook_res["status"] == "error":
click.echo(f"Error adding hooks: {hook_res['message']}", err=True)
except Exception as e:
click.echo(f"Error: {e}", err=True)
raise click.ClickException(str(e)) from e

View File

@@ -0,0 +1,443 @@
"""Run optional insights analysis on existing audit data.
This command runs interpretive analysis modules (ML, graph health, taint severity)
on top of existing raw audit data, generating insights and predictions.
"""
import json
import sys
from pathlib import Path
from typing import Dict, Any, List
import click
@click.command()
@click.option("--mode", "-m",
type=click.Choice(["ml", "graph", "taint", "impact", "all"]),
default="all",
help="Which insights modules to run")
@click.option("--ml-train", is_flag=True,
help="Train ML models before generating suggestions")
@click.option("--topk", default=10, type=int,
help="Top K files for ML suggestions")
@click.option("--output-dir", "-o", type=click.Path(),
default="./.pf/insights",
help="Directory for insights output")
@click.option("--print-summary", is_flag=True,
help="Print summary to console")
def insights(mode: str, ml_train: bool, topk: int, output_dir: str, print_summary: bool) -> None:
"""Run optional insights analysis on existing audit data.
This command generates interpretive analysis and predictions based on
the raw facts collected by the audit pipeline. All insights are optional
and separate from the core truth data.
Available insights modules:
- ml: Machine learning risk predictions and root cause analysis
- graph: Graph health metrics and architectural scoring
- taint: Severity scoring for taint analysis paths
- impact: Impact radius and blast zone analysis
- all: Run all available insights
Examples:
# Run all insights
aud insights
# Only ML predictions
aud insights --mode ml
# Train ML first, then predict
aud insights --mode ml --ml-train
# Graph health only with summary
aud insights --mode graph --print-summary
"""
# Ensure we have raw data to analyze
pf_dir = Path(".pf")
raw_dir = pf_dir / "raw"
if not raw_dir.exists():
click.echo("[ERROR] No raw audit data found. Run 'aud full' first.", err=True)
sys.exit(1)
# Create insights directory
output_path = Path(output_dir)
output_path.mkdir(parents=True, exist_ok=True)
click.echo(f"\n{'='*60}")
click.echo(f"INSIGHTS ANALYSIS - {mode.upper()} Mode")
click.echo(f"{'='*60}")
click.echo(f"Output directory: {output_path}")
results = {}
errors = []
# ML Insights
if mode in ["ml", "all"]:
click.echo("\n[ML] Running machine learning insights...")
ml_result = run_ml_insights(ml_train, topk, output_path)
results["ml"] = ml_result
if ml_result.get("error"):
errors.append(f"ML: {ml_result['error']}")
else:
click.echo(f" ✓ ML predictions saved to {output_path}/ml_suggestions.json")
# Graph Health Insights
if mode in ["graph", "all"]:
click.echo("\n[GRAPH] Running graph health analysis...")
graph_result = run_graph_insights(output_path)
results["graph"] = graph_result
if graph_result.get("error"):
errors.append(f"Graph: {graph_result['error']}")
else:
click.echo(f" ✓ Graph health saved to {output_path}/graph_health.json")
# Taint Severity Insights
if mode in ["taint", "all"]:
click.echo("\n[TAINT] Running taint severity scoring...")
taint_result = run_taint_insights(output_path)
results["taint"] = taint_result
if taint_result.get("error"):
errors.append(f"Taint: {taint_result['error']}")
else:
click.echo(f" ✓ Taint severity saved to {output_path}/taint_severity.json")
# Impact Analysis Insights
if mode in ["impact", "all"]:
click.echo("\n[IMPACT] Running impact analysis...")
impact_result = run_impact_insights(output_path)
results["impact"] = impact_result
if impact_result.get("error"):
errors.append(f"Impact: {impact_result['error']}")
else:
click.echo(f" ✓ Impact analysis saved to {output_path}/impact_analysis.json")
# Aggregate all insights into unified summary
click.echo("\n[AGGREGATE] Creating unified insights summary...")
summary = aggregate_insights(results, output_path)
# Save unified summary
summary_path = output_path / "unified_insights.json"
with open(summary_path, 'w') as f:
json.dump(summary, f, indent=2, default=str)
click.echo(f" ✓ Unified summary saved to {summary_path}")
# Print summary if requested
if print_summary:
print_insights_summary(summary)
# Final status
click.echo(f"\n{'='*60}")
if errors:
click.echo(f"[WARN] Insights completed with {len(errors)} errors:", err=True)
for error in errors:
click.echo(f"{error}", err=True)
else:
click.echo("[OK] All insights generated successfully")
click.echo(f"\n[TIP] Insights are interpretive and optional.")
click.echo(f" Raw facts remain in .pf/raw/ unchanged.")
sys.exit(1 if errors else 0)
def run_ml_insights(train: bool, topk: int, output_dir: Path) -> Dict[str, Any]:
"""Run ML insights generation."""
try:
from theauditor.ml import check_ml_available, learn, suggest
if not check_ml_available():
return {"error": "ML module not installed. Run: pip install -e .[ml]"}
# Train if requested
if train:
learn_result = learn(
db_path="./.pf/repo_index.db",
manifest_path="./.pf/manifest.json",
print_stats=False
)
if not learn_result.get("success"):
return {"error": f"ML training failed: {learn_result.get('error')}"}
# Generate suggestions
suggest_result = suggest(
db_path="./.pf/repo_index.db",
manifest_path="./.pf/manifest.json",
workset_path="./.pf/workset.json",
topk=topk,
out_path=str(output_dir / "ml_suggestions.json")
)
return suggest_result
except ImportError:
return {"error": "ML module not available"}
except Exception as e:
return {"error": str(e)}
def run_graph_insights(output_dir: Path) -> Dict[str, Any]:
"""Run graph health insights."""
try:
from theauditor.graph.insights import GraphInsights
from theauditor.graph.analyzer import XGraphAnalyzer
from theauditor.graph.store import XGraphStore
# Load graph from SQLite database (SINGLE SOURCE OF TRUTH)
store = XGraphStore(db_path="./.pf/graphs.db")
import_graph = store.load_import_graph()
if not import_graph or not import_graph.get("nodes"):
return {"error": "No import graph found. Run 'aud graph build' first."}
# Load analysis data if it exists
analysis_path = Path(".pf/raw/graph_analysis.json")
analysis_data = {}
if analysis_path.exists():
with open(analysis_path) as f:
analysis_data = json.load(f)
# Run insights analysis
insights = GraphInsights()
analyzer = XGraphAnalyzer()
# Use pre-calculated cycles and hotspots if available, otherwise calculate
if 'cycles' in analysis_data:
cycles = analysis_data['cycles']
else:
cycles = analyzer.detect_cycles(import_graph)
# Use pre-calculated hotspots if available, otherwise calculate
if 'hotspots' in analysis_data:
hotspots = analysis_data['hotspots']
else:
hotspots = insights.rank_hotspots(import_graph)
# Calculate health metrics
health = insights.calculate_health_metrics(
import_graph,
cycles=cycles,
hotspots=hotspots
)
# Generate recommendations
recommendations = insights.generate_recommendations(
import_graph,
cycles=cycles,
hotspots=hotspots
)
# Save results
output = {
"health_metrics": health,
"top_hotspots": hotspots[:10],
"recommendations": recommendations,
"cycles_found": len(cycles),
"total_nodes": len(import_graph.get("nodes", [])),
"total_edges": len(import_graph.get("edges", []))
}
output_path = output_dir / "graph_health.json"
with open(output_path, 'w') as f:
json.dump(output, f, indent=2)
return {"success": True, "health_score": health.get("health_score")}
except ImportError:
return {"error": "Graph insights module not available"}
except Exception as e:
return {"error": str(e)}
def run_taint_insights(output_dir: Path) -> Dict[str, Any]:
"""Run taint severity insights."""
try:
from datetime import datetime, UTC
from theauditor.taint.insights import calculate_severity, classify_vulnerability, generate_summary
from theauditor.taint_analyzer import SECURITY_SINKS
# Load raw taint data
taint_path = Path(".pf/raw/taint_analysis.json")
if not taint_path.exists():
return {"error": "No taint data found. Run 'aud taint-analyze' first."}
with open(taint_path) as f:
taint_data = json.load(f)
if not taint_data.get("success"):
return {"error": "Taint analysis was not successful"}
# Calculate severity for each path and create enriched versions
severity_analysis = []
enriched_paths = []
for path in taint_data.get("taint_paths", []):
severity = calculate_severity(path)
vuln_type = classify_vulnerability(path.get("sink", {}), SECURITY_SINKS)
severity_analysis.append({
"file": path.get("sink", {}).get("file"),
"line": path.get("sink", {}).get("line"),
"severity": severity,
"vulnerability_type": vuln_type,
"path_length": len(path.get("path", [])),
"risk_score": 1.0 if severity == "critical" else 0.7 if severity == "high" else 0.4
})
# Create enriched path with severity for summary generation
enriched_path = dict(path)
enriched_path["severity"] = severity
enriched_path["vulnerability_type"] = vuln_type
enriched_paths.append(enriched_path)
# Generate summary using enriched paths with severity
summary = generate_summary(enriched_paths)
# Save results
output = {
"generated_at": datetime.now(UTC).isoformat(),
"severity_analysis": severity_analysis,
"summary": summary,
"total_vulnerabilities": len(severity_analysis),
"sources_analyzed": taint_data.get("sources_found", 0),
"sinks_analyzed": taint_data.get("sinks_found", 0)
}
output_path = output_dir / "taint_severity.json"
with open(output_path, 'w') as f:
json.dump(output, f, indent=2)
return {"success": True, "risk_level": summary.get("risk_level")}
except ImportError:
return {"error": "Taint insights module not available"}
except Exception as e:
return {"error": str(e)}
def run_impact_insights(output_dir: Path) -> Dict[str, Any]:
"""Run impact analysis insights."""
try:
# Check if workset exists
workset_path = Path(".pf/workset.json")
if not workset_path.exists():
return {"error": "No workset found. Run 'aud workset' first."}
with open(workset_path) as f:
workset_data = json.load(f)
# For now, create a simple impact summary
# In future, this could run actual impact analysis on changed files
output = {
"files_changed": len(workset_data.get("files", [])),
"potential_impact": "Analysis pending",
"recommendation": "Run 'aud impact --file <file> --line <line>' for detailed analysis"
}
output_path = output_dir / "impact_analysis.json"
with open(output_path, 'w') as f:
json.dump(output, f, indent=2)
return {"success": True, "files_analyzed": len(workset_data.get("files", []))}
except Exception as e:
return {"error": str(e)}
def aggregate_insights(results: Dict[str, Any], output_dir: Path) -> Dict[str, Any]:
"""Aggregate all insights into unified summary."""
summary = {
"insights_generated": list(results.keys()),
"timestamp": __import__('datetime').datetime.now().isoformat(),
"output_directory": str(output_dir)
}
# ML insights
if "ml" in results and results["ml"].get("success"):
summary["ml"] = {
"status": "success",
"workset_size": results["ml"].get("workset_size", 0),
"predictions_generated": True
}
elif "ml" in results:
summary["ml"] = {"status": "error", "error": results["ml"].get("error")}
# Graph insights
if "graph" in results and results["graph"].get("success"):
summary["graph"] = {
"status": "success",
"health_score": results["graph"].get("health_score", 0)
}
elif "graph" in results:
summary["graph"] = {"status": "error", "error": results["graph"].get("error")}
# Taint insights
if "taint" in results and results["taint"].get("success"):
summary["taint"] = {
"status": "success",
"risk_level": results["taint"].get("risk_level", "unknown")
}
elif "taint" in results:
summary["taint"] = {"status": "error", "error": results["taint"].get("error")}
# Impact insights
if "impact" in results and results["impact"].get("success"):
summary["impact"] = {
"status": "success",
"files_analyzed": results["impact"].get("files_analyzed", 0)
}
elif "impact" in results:
summary["impact"] = {"status": "error", "error": results["impact"].get("error")}
return summary
def print_insights_summary(summary: Dict[str, Any]) -> None:
"""Print insights summary to console."""
click.echo(f"\n{'='*60}")
click.echo("INSIGHTS SUMMARY")
click.echo(f"{'='*60}")
# ML Summary
if "ml" in summary:
if summary["ml"]["status"] == "success":
click.echo(f"\n[ML] Machine Learning Insights:")
click.echo(f" • Workset size: {summary['ml'].get('workset_size', 0)} files")
click.echo(f" • Predictions: Generated successfully")
else:
click.echo(f"\n[ML] Machine Learning Insights: {summary['ml'].get('error')}")
# Graph Summary
if "graph" in summary:
if summary["graph"]["status"] == "success":
health = summary["graph"].get("health_score", 0)
grade = "A" if health >= 90 else "B" if health >= 80 else "C" if health >= 70 else "D" if health >= 60 else "F"
click.echo(f"\n[GRAPH] Architecture Health:")
click.echo(f" • Health score: {health}/100 (Grade: {grade})")
else:
click.echo(f"\n[GRAPH] Architecture Health: {summary['graph'].get('error')}")
# Taint Summary
if "taint" in summary:
if summary["taint"]["status"] == "success":
risk = summary["taint"].get("risk_level", "unknown")
color = "red" if risk == "critical" else "yellow" if risk == "high" else "green"
click.echo(f"\n[TAINT] Security Risk:")
click.echo(f" • Risk level: {risk.upper()}")
else:
click.echo(f"\n[TAINT] Security Risk: {summary['taint'].get('error')}")
# Impact Summary
if "impact" in summary:
if summary["impact"]["status"] == "success":
click.echo(f"\n[IMPACT] Change Impact:")
click.echo(f" • Files analyzed: {summary['impact'].get('files_analyzed', 0)}")
else:
click.echo(f"\n[IMPACT] Change Impact: {summary['impact'].get('error')}")
click.echo(f"\n{'='*60}")
# Register command
insights_command = insights

267
theauditor/commands/lint.py Normal file
View File

@@ -0,0 +1,267 @@
"""Run linters and normalize output to evidence format."""
import hashlib
import json
from collections import defaultdict
from pathlib import Path
from typing import Any
import click
from theauditor.linters import (
detect_linters,
run_linter,
)
from theauditor.utils import load_json_file
from theauditor.utils.error_handler import handle_exceptions
def write_lint_json(findings: list[dict[str, Any]], output_path: str):
"""Write findings to JSON file."""
# Sort findings for determinism
sorted_findings = sorted(findings, key=lambda f: (f["file"], f["line"], f["rule"]))
with open(output_path, "w", encoding="utf-8") as f:
json.dump(sorted_findings, f, indent=2, sort_keys=True)
def lint_command(
root_path: str = ".",
workset_path: str = "./.pf/workset.json",
manifest_path: str = "manifest.json",
timeout: int = 300,
print_plan: bool = False,
auto_fix: bool = False,
) -> dict[str, Any]:
"""
Run linters and normalize output.
Returns:
Dictionary with success status and statistics
"""
# AUTO-FIX DEPRECATED: Force disabled to prevent version mismatch issues
auto_fix = False
# Load workset or manifest files
if workset_path is not None:
# Use workset mode
try:
workset = load_json_file(workset_path)
workset_files = {p["path"] for p in workset.get("paths", [])}
except (FileNotFoundError, json.JSONDecodeError) as e:
return {"success": False, "error": f"Failed to load workset: {e}"}
else:
# Use all files from manifest when --workset is not used
try:
manifest = load_json_file(manifest_path)
# Use all text files from the manifest
workset_files = {f["path"] for f in manifest if isinstance(f, dict) and "path" in f}
except (FileNotFoundError, json.JSONDecodeError) as e:
return {"success": False, "error": f"Failed to load manifest: {e}"}
if not workset_files:
return {"success": False, "error": "Empty workset"}
# Detect available linters
linters = detect_linters(root_path, auto_fix=auto_fix)
if print_plan:
print("Lint Plan:")
# AUTO-FIX DEPRECATED: Always in check-only mode
# print(f" Mode: {'AUTO-FIX' if auto_fix else 'CHECK-ONLY'}")
print(f" Mode: CHECK-ONLY")
print(f" Workset: {len(workset_files)} files")
if linters:
print(" External linters detected:")
for tool in linters:
# AUTO-FIX DEPRECATED: No fix indicators
# fix_capable = tool in ["eslint", "prettier", "ruff", "black"]
# fix_indicator = " (will fix)" if auto_fix and fix_capable else ""
print(f" - {tool}")
else:
print(" No external linters detected")
print(" Will run built-in checks:")
print(" - NO_TODO_LAND (excessive TODOs)")
print(" - NO_LONG_FILES (>1500 lines)")
print(" - NO_CYCLES (import cycles)")
print(" - NO_DEBUG_CALLS (console.log/print)")
print(" - NO_SECRET_LIKE (potential secrets)")
return {"success": True, "printed_plan": True}
all_findings = []
fixed_count = 0
all_ast_data = {} # Collect AST data from ESLint
if linters:
# Run external linters
# AUTO-FIX DEPRECATED: Always run in check-only mode
# mode_str = "Fixing" if auto_fix else "Checking"
print(f"Checking with {len(linters)} external linters...")
for tool, command in linters.items():
# AUTO-FIX DEPRECATED: This entire block is disabled
# if auto_fix and tool in ["eslint", "prettier", "ruff", "black"]:
# print(f" Fixing with {tool}...")
# # In fix mode, we run the tool but may get fewer findings (as they're fixed)
# findings, ast_data = run_linter(tool, command, root_path, workset_files, timeout)
# # Collect AST data from ESLint
# if tool == "eslint" and ast_data:
# all_ast_data.update(ast_data)
# # Add remaining findings (unfixable issues)
# all_findings.extend(findings)
# # Estimate fixes based on the tool (most issues are fixable)
# if tool in ["prettier", "black"]:
# # Formatters fix all issues
# if len(findings) == 0:
# print(f" Fixed all formatting issues")
# else:
# print(f" Fixed most issues, {len(findings)} remaining")
# else:
# # ESLint and Ruff fix most but not all issues
# remaining = len(findings)
# if remaining > 0:
# print(f" Fixed issues, {remaining} remaining (unfixable)")
# else:
# print(f" Fixed all issues")
# else:
print(f" Checking with {tool}...")
findings, ast_data = run_linter(tool, command, root_path, workset_files, timeout)
# Collect AST data from ESLint
if tool == "eslint" and ast_data:
all_ast_data.update(ast_data)
all_findings.extend(findings)
print(f" Found {len(findings)} issues")
else:
# No linters found - this indicates broken environment
print("[WARNING] No external linters found!")
print("[ERROR] Environment is not properly configured - industry tools are required")
print(" Install at least one linter:")
print(" JavaScript/TypeScript: npm install --save-dev eslint")
print(" Python: pip install ruff")
print(" Go: go install github.com/golangci/golangci-lint/cmd/golangci-lint@latest")
# Continue with empty findings rather than failing completely
print("[INFO] Continuing with no lint findings...")
# Check TypeScript configuration to determine which TS tool to use
# This is DETECTION logic, not a linter itself
# tsconfig_findings = check_tsconfig(root_path)
# NOTE: check_tsconfig was deleted with builtin.py - need to restore detection logic
# Write outputs directly to raw directory
output_dir = Path(".pf/raw")
output_dir.mkdir(parents=True, exist_ok=True)
json_path = output_dir / "lint.json"
write_lint_json(all_findings, str(json_path))
# Save ESLint ASTs to cache
if all_ast_data:
# Load manifest to get file hashes
try:
manifest = load_json_file(manifest_path)
file_hashes = {f["path"]: f.get("sha256") for f in manifest if isinstance(f, dict) and "sha256" in f}
# Create AST cache directory
ast_cache_dir = output_dir / "ast_cache" / "eslint"
ast_cache_dir.mkdir(parents=True, exist_ok=True)
# Save each AST with the file's SHA256 hash as the filename
for file_path, ast in all_ast_data.items():
if file_path in file_hashes and file_hashes[file_path]:
file_hash = file_hashes[file_path]
else:
# If hash not in manifest, compute it from file content
full_path = Path(root_path) / file_path
if full_path.exists():
with open(full_path, "rb") as f:
file_hash = hashlib.sha256(f.read()).hexdigest()
else:
continue
# Save AST to cache file
ast_file = ast_cache_dir / f"{file_hash}.json"
with open(ast_file, "w", encoding="utf-8") as f:
json.dump(ast, f, indent=2)
print(f" Cached {len(all_ast_data)} ASTs from ESLint")
except Exception as e:
print(f"Warning: Failed to cache ESLint ASTs: {e}")
# Statistics
stats = {
"total_findings": len(all_findings),
"tools_run": len(linters) if linters else 1, # 1 for built-in
"workset_size": len(workset_files),
"errors": sum(1 for f in all_findings if f["severity"] == "error"),
"warnings": sum(1 for f in all_findings if f["severity"] == "warning"),
}
# AUTO-FIX DEPRECATED: This block is disabled
# if auto_fix:
# print("\n[OK] Auto-fix complete:")
# print(f" Files processed: {len(workset_files)}")
# print(f" Remaining issues: {stats['total_findings']}")
# print(f" Errors: {stats['errors']}")
# print(f" Warnings: {stats['warnings']}")
# if stats['total_findings'] > 0:
# print(f" Note: Some issues cannot be auto-fixed and require manual attention")
# print(f" Report: {json_path}")
# else:
print("\nLint complete:")
print(f" Total findings: {stats['total_findings']}")
print(f" Errors: {stats['errors']}")
print(f" Warnings: {stats['warnings']}")
print(f" Output: {json_path}")
if stats['total_findings'] > 0:
print(" Note: Many linters (ESLint, Prettier, Ruff, Black) have their own automatic code style fix capabilities")
return {
"success": True,
"stats": stats,
"output_files": [str(json_path)],
"auto_fix_applied": auto_fix,
}
@click.command()
@handle_exceptions
@click.option("--root", default=".", help="Root directory")
@click.option("--workset", is_flag=True, help="Use workset mode (lint only files in .pf/workset.json)")
@click.option("--workset-path", default=None, help="Custom workset path (rarely needed)")
@click.option("--manifest", default=None, help="Manifest file path")
@click.option("--timeout", default=None, type=int, help="Timeout in seconds for each linter")
@click.option("--print-plan", is_flag=True, help="Print lint plan without executing")
# AUTO-FIX DEPRECATED: Hidden flag kept for backward compatibility
@click.option("--fix", is_flag=True, hidden=True, help="[DEPRECATED] No longer functional")
def lint(root, workset, workset_path, manifest, timeout, print_plan, fix):
"""Run linters and normalize output to evidence format."""
from theauditor.config_runtime import load_runtime_config
# Load configuration
config = load_runtime_config(root)
# Use config defaults if not provided
if manifest is None:
manifest = config["paths"]["manifest"]
if timeout is None:
timeout = config["timeouts"]["lint_timeout"]
if workset_path is None and workset:
workset_path = config["paths"]["workset"]
# Use workset path only if --workset flag is set
actual_workset_path = workset_path if workset else None
result = lint_command(
root_path=root,
workset_path=actual_workset_path,
manifest_path=manifest,
timeout=timeout,
print_plan=print_plan,
auto_fix=fix,
)
if result.get("printed_plan"):
return
if not result["success"]:
click.echo(f"Error: {result.get('error', 'Lint failed')}", err=True)
raise click.ClickException(result.get("error", "Lint failed"))

165
theauditor/commands/ml.py Normal file
View File

@@ -0,0 +1,165 @@
"""Machine learning commands for TheAuditor."""
import click
from pathlib import Path
@click.command(name="learn")
@click.option("--db-path", default="./.pf/repo_index.db", help="Database path")
@click.option("--manifest", default="./.pf/manifest.json", help="Manifest file path")
@click.option("--journal", default="./.pf/journal.ndjson", help="Journal file path")
@click.option("--fce", default="./.pf/fce.json", help="FCE file path")
@click.option("--ast", default="./.pf/ast_proofs.json", help="AST proofs file path")
@click.option("--enable-git", is_flag=True, help="Enable git churn features")
@click.option("--model-dir", default="./.pf/ml", help="Model output directory")
@click.option("--window", default=50, type=int, help="Journal window size")
@click.option("--seed", default=13, type=int, help="Random seed")
@click.option("--feedback", help="Path to human feedback JSON file")
@click.option("--train-on", type=click.Choice(["full", "diff", "all"]), default="full", help="Type of historical runs to train on")
@click.option("--print-stats", is_flag=True, help="Print training statistics")
def learn(db_path, manifest, journal, fce, ast, enable_git, model_dir, window, seed, feedback, train_on, print_stats):
"""Train ML models from audit artifacts to predict risk and root causes."""
from theauditor.ml import learn as ml_learn
click.echo(f"[ML] Training models from audit artifacts (using {train_on} runs)...")
result = ml_learn(
db_path=db_path,
manifest_path=manifest,
journal_path=journal,
fce_path=fce,
ast_path=ast,
enable_git=enable_git,
model_dir=model_dir,
window=window,
seed=seed,
print_stats=print_stats,
feedback_path=feedback,
train_on=train_on,
)
if result.get("success"):
stats = result.get("stats", {})
click.echo(f"[OK] Models trained successfully")
click.echo(f" * Training data: {train_on} runs from history")
click.echo(f" * Files analyzed: {result.get('source_files', 0)}")
click.echo(f" * Features: {stats.get('n_features', 0)} dimensions")
click.echo(f" * Root cause ratio: {stats.get('root_cause_positive_ratio', 0):.2%}")
click.echo(f" * Risk mean: {stats.get('mean_risk', 0):.3f}")
if stats.get('cold_start'):
click.echo(f" [WARN] Cold-start mode (<500 samples)")
click.echo(f" * Models saved to: {result.get('model_dir')}")
else:
click.echo(f"[FAIL] Training failed: {result.get('error')}", err=True)
raise click.ClickException(result.get("error"))
@click.command(name="suggest")
@click.option("--db-path", default="./.pf/repo_index.db", help="Database path")
@click.option("--manifest", default="./.pf/manifest.json", help="Manifest file path")
@click.option("--workset", default="./.pf/workset.json", help="Workset file path")
@click.option("--fce", default="./.pf/fce.json", help="FCE file path")
@click.option("--ast", default="./.pf/ast_proofs.json", help="AST proofs file path")
@click.option("--model-dir", default="./.pf/ml", help="Model directory")
@click.option("--topk", default=10, type=int, help="Top K files to suggest")
@click.option("--out", default="./.pf/insights/ml_suggestions.json", help="Output file path")
@click.option("--print-plan", is_flag=True, help="Print suggestions to console")
def suggest(db_path, manifest, workset, fce, ast, model_dir, topk, out, print_plan):
"""Generate ML-based suggestions for risky files and likely root causes."""
from theauditor.ml import suggest as ml_suggest
click.echo("[ML] Generating suggestions from trained models...")
result = ml_suggest(
db_path=db_path,
manifest_path=manifest,
workset_path=workset,
fce_path=fce,
ast_path=ast,
model_dir=model_dir,
topk=topk,
out_path=out,
print_plan=print_plan,
)
if result.get("success"):
click.echo(f"[OK] Suggestions generated")
click.echo(f" * Workset size: {result.get('workset_size', 0)} files")
click.echo(f" * Source files analyzed: {result.get('workset_size', 0)}")
click.echo(f" * Non-source excluded: {result.get('excluded_count', 0)}")
click.echo(f" * Top {result.get('topk', 10)} suggestions saved to: {result.get('out_path')}")
else:
click.echo(f"[FAIL] Suggestion generation failed: {result.get('error')}", err=True)
raise click.ClickException(result.get("error"))
@click.command(name="learn-feedback")
@click.option("--feedback-file", required=True, help="Path to feedback JSON file")
@click.option("--db-path", default="./.pf/repo_index.db", help="Database path")
@click.option("--manifest", default="./.pf/manifest.json", help="Manifest file path")
@click.option("--model-dir", default="./.pf/ml", help="Model output directory")
@click.option("--train-on", type=click.Choice(["full", "diff", "all"]), default="full", help="Type of historical runs to train on")
@click.option("--print-stats", is_flag=True, help="Print training statistics")
def learn_feedback(feedback_file, db_path, manifest, model_dir, train_on, print_stats):
"""
Re-train models with human feedback for improved accuracy.
The feedback file should be a JSON file with the format:
{
"path/to/file.py": {
"is_risky": true,
"is_root_cause": false,
"will_need_edit": true
},
...
}
"""
from theauditor.ml import learn as ml_learn
# Validate feedback file exists
if not Path(feedback_file).exists():
click.echo(f"[FAIL] Feedback file not found: {feedback_file}", err=True)
raise click.ClickException(f"Feedback file not found: {feedback_file}")
# Validate feedback file format
try:
import json
with open(feedback_file) as f:
feedback_data = json.load(f)
if not isinstance(feedback_data, dict):
raise ValueError("Feedback file must contain a JSON object")
# Count feedback entries
feedback_count = len(feedback_data)
click.echo(f"[ML] Loading human feedback for {feedback_count} files...")
except Exception as e:
click.echo(f"[FAIL] Invalid feedback file format: {e}", err=True)
raise click.ClickException(f"Invalid feedback file: {e}")
click.echo(f"[ML] Re-training models with human feedback (using {train_on} runs)...")
result = ml_learn(
db_path=db_path,
manifest_path=manifest,
model_dir=model_dir,
print_stats=print_stats,
feedback_path=feedback_file,
train_on=train_on,
# Use default paths for historical data from .pf/history
enable_git=False, # Disable git for speed in feedback mode
)
if result.get("success"):
stats = result.get("stats", {})
click.echo(f"[OK] Models re-trained with human feedback")
click.echo(f" * Training data: {train_on} runs from history")
click.echo(f" * Files analyzed: {result.get('source_files', 0)}")
click.echo(f" * Human feedback incorporated: {feedback_count} files")
click.echo(f" * Features: {stats.get('n_features', 0)} dimensions")
click.echo(f" * Models saved to: {result.get('model_dir')}")
click.echo(f"\n[TIP] The models have learned from your feedback and will provide more accurate predictions.")
else:
click.echo(f"[FAIL] Re-training failed: {result.get('error')}", err=True)
raise click.ClickException(result.get("error"))

View File

@@ -0,0 +1,600 @@
"""Refactoring impact analysis command.
This command analyzes the impact of refactoring changes and detects
inconsistencies between frontend and backend, API contract mismatches,
and data model evolution issues.
"""
import json
import os
import sqlite3
from pathlib import Path
from typing import Dict, List, Set, Any, Optional
import click
@click.command()
@click.option("--file", "-f", help="File to analyze refactoring impact from")
@click.option("--line", "-l", type=int, help="Line number in the file")
@click.option("--migration-dir", "-m", default="backend/migrations",
help="Directory containing database migrations")
@click.option("--migration-limit", "-ml", type=int, default=0,
help="Number of recent migrations to analyze (0=all, default=all)")
@click.option("--expansion-mode", "-e",
type=click.Choice(["none", "direct", "full"]),
default="none",
help="Dependency expansion mode: none (affected only), direct (1 level), full (transitive)")
@click.option("--auto-detect", "-a", is_flag=True,
help="Auto-detect refactoring from recent migrations")
@click.option("--workset", "-w", is_flag=True,
help="Use current workset for analysis")
@click.option("--output", "-o", type=click.Path(),
help="Output file for detailed report")
def refactor(file: Optional[str], line: Optional[int], migration_dir: str,
migration_limit: int, expansion_mode: str,
auto_detect: bool, workset: bool, output: Optional[str]) -> None:
"""Analyze refactoring impact and find inconsistencies.
This command helps detect issues introduced by refactoring such as:
- Data model changes (fields moved between tables)
- API contract mismatches (frontend expects old structure)
- Missing updates in dependent code
- Cross-stack inconsistencies
Examples:
# Analyze impact from a specific model change
aud refactor --file models/Product.ts --line 42
# Auto-detect refactoring from migrations
aud refactor --auto-detect
# Analyze current workset
aud refactor --workset
"""
# Find repository root
repo_root = Path.cwd()
while repo_root != repo_root.parent:
if (repo_root / ".git").exists():
break
repo_root = repo_root.parent
pf_dir = repo_root / ".pf"
db_path = pf_dir / "repo_index.db"
if not db_path.exists():
click.echo("Error: No index found. Run 'aud index' first.", err=True)
raise click.Abort()
# Import components here to avoid import errors
try:
from theauditor.impact_analyzer import analyze_impact
from theauditor.universal_detector import UniversalPatternDetector
from theauditor.pattern_loader import PatternLoader
from theauditor.fce import run_fce
from theauditor.correlations.loader import CorrelationLoader
except ImportError as e:
click.echo(f"Error importing components: {e}", err=True)
raise click.Abort()
# Initialize components
pattern_loader = PatternLoader()
pattern_detector = UniversalPatternDetector(
repo_root,
pattern_loader,
exclude_patterns=[]
)
click.echo("\nRefactoring Impact Analysis")
click.echo("-" * 60)
# Step 1: Determine what to analyze
affected_files = set()
if auto_detect:
click.echo("Auto-detecting refactoring from migrations...")
affected_files.update(_analyze_migrations(repo_root, migration_dir, migration_limit))
if not affected_files:
click.echo("No affected files found from migrations.")
click.echo("Tip: Check if your migrations contain schema change operations")
return
elif workset:
click.echo("Analyzing workset files...")
workset_file = pf_dir / "workset.json"
if workset_file.exists():
with open(workset_file, 'r') as f:
workset_data = json.load(f)
affected_files.update(workset_data.get("files", []))
else:
click.echo("Error: No workset found. Create one with 'aud workset'", err=True)
raise click.Abort()
elif file and line:
click.echo(f"Analyzing impact from {file}:{line}...")
# Run impact analysis
impact_result = analyze_impact(
db_path=str(db_path),
target_file=file,
target_line=line,
trace_to_backend=True
)
if not impact_result.get("error"):
# Extract affected files from impact analysis
upstream_files = [dep["file"] for dep in impact_result.get("upstream", [])]
downstream_files = [dep["file"] for dep in impact_result.get("downstream", [])]
upstream_trans_files = [dep["file"] for dep in impact_result.get("upstream_transitive", [])]
downstream_trans_files = [dep["file"] for dep in impact_result.get("downstream_transitive", [])]
all_impact_files = set(upstream_files + downstream_files + upstream_trans_files + downstream_trans_files)
affected_files.update(all_impact_files)
# Show immediate impact
summary = impact_result.get("impact_summary", {})
click.echo(f"\nDirect impact: {summary.get('direct_upstream', 0)} upstream, "
f"{summary.get('direct_downstream', 0)} downstream")
click.echo(f"Total files affected: {summary.get('affected_files', len(affected_files))}")
# Check for cross-stack impact
if impact_result.get("cross_stack_impact"):
click.echo("\n⚠️ Cross-stack impact detected!")
for impact in impact_result["cross_stack_impact"]:
click.echo(f"{impact['file']}:{impact['line']} - {impact['type']}")
else:
click.echo("Error: Specify --file and --line, --auto-detect, or --workset", err=True)
raise click.Abort()
if not affected_files:
click.echo("No files to analyze.")
return
# Step 2b: Expand affected files based on mode
if affected_files:
expanded_files = _expand_affected_files(
affected_files,
str(db_path),
expansion_mode,
repo_root
)
else:
expanded_files = set()
# Update workset with expanded files
click.echo(f"\nCreating workset from {len(expanded_files)} files...")
temp_workset_file = pf_dir / "temp_workset.json"
with open(temp_workset_file, 'w') as f:
json.dump({"files": list(expanded_files)}, f)
# Step 3: Run pattern detection with targeted file list
if expanded_files:
click.echo(f"Running pattern detection on {len(expanded_files)} files...")
# Check if batch method is available
if hasattr(pattern_detector, 'detect_patterns_for_files'):
# Use optimized batch method if available
findings = pattern_detector.detect_patterns_for_files(
list(expanded_files),
categories=None
)
else:
# Fallback to individual file processing
findings = []
for i, file_path in enumerate(expanded_files, 1):
if i % 10 == 0:
click.echo(f" Scanning file {i}/{len(expanded_files)}...", nl=False)
click.echo("\r", nl=False)
# Convert to relative path for pattern detector
try:
rel_path = Path(file_path).relative_to(repo_root).as_posix()
except ValueError:
rel_path = file_path
file_findings = pattern_detector.detect_patterns(
categories=None,
file_filter=rel_path
)
findings.extend(file_findings)
click.echo(f"\n Found {len(findings)} patterns")
else:
findings = []
click.echo("No files to analyze after expansion")
patterns = findings
# Step 4: Run FCE correlation with refactoring rules
click.echo("Running correlation analysis...")
# Run the FCE to get correlations
fce_results = run_fce(
root_path=str(repo_root),
capsules_dir=str(pf_dir / "capsules"),
manifest_path="manifest.json",
workset_path=str(temp_workset_file),
db_path="repo_index.db",
timeout=600,
print_plan=False
)
# Extract correlations from FCE results
correlations = []
if fce_results.get("success") and fce_results.get("results"):
fce_data = fce_results["results"]
if "correlations" in fce_data and "factual_clusters" in fce_data["correlations"]:
correlations = fce_data["correlations"]["factual_clusters"]
# Step 5: Identify mismatches
mismatches = _find_mismatches(patterns, correlations, affected_files)
# Generate report
report = _generate_report(affected_files, patterns, correlations, mismatches)
# Display summary
click.echo("\n" + "=" * 60)
click.echo("Refactoring Analysis Summary")
click.echo("=" * 60)
click.echo(f"\nFiles analyzed: {len(affected_files)}")
click.echo(f"Patterns detected: {len(patterns)}")
click.echo(f"Correlations found: {len(correlations)}")
if mismatches["api"]:
click.echo(f"\nAPI Mismatches: {len(mismatches['api'])}")
for mismatch in mismatches["api"][:5]: # Show top 5
click.echo(f"{mismatch['description']}")
if mismatches["model"]:
click.echo(f"\nData Model Mismatches: {len(mismatches['model'])}")
for mismatch in mismatches["model"][:5]: # Show top 5
click.echo(f"{mismatch['description']}")
if mismatches["contract"]:
click.echo(f"\nContract Mismatches: {len(mismatches['contract'])}")
for mismatch in mismatches["contract"][:5]: # Show top 5
click.echo(f"{mismatch['description']}")
# Risk assessment
risk_level = _assess_risk(mismatches, len(affected_files))
click.echo(f"\nRisk Level: {risk_level}")
# Recommendations
recommendations = _generate_recommendations(mismatches)
if recommendations:
click.echo("\nRecommendations:")
for rec in recommendations:
click.echo(f"{rec}")
# Save detailed report if requested
if output:
with open(output, 'w') as f:
json.dump(report, f, indent=2, default=str)
click.echo(f"\nDetailed report saved to: {output}")
# Suggest next steps
click.echo("\nNext Steps:")
click.echo(" 1. Review the mismatches identified above")
click.echo(" 2. Run 'aud impact --file <file> --line <line>' for detailed impact")
click.echo(" 3. Use 'aud detect-patterns --workset' for pattern-specific issues")
click.echo(" 4. Run 'aud full' for comprehensive analysis")
def _expand_affected_files(
affected_files: Set[str],
db_path: str,
expansion_mode: str,
repo_root: Path
) -> Set[str]:
"""Expand affected files with their dependencies based on mode."""
if expansion_mode == "none":
return affected_files
expanded = set(affected_files)
total_files = len(affected_files)
click.echo(f"\nExpanding {total_files} affected files with {expansion_mode} mode...")
if expansion_mode in ["direct", "full"]:
from theauditor.impact_analyzer import analyze_impact
import sqlite3
import os
for i, file_path in enumerate(affected_files, 1):
if i % 5 == 0 or i == total_files:
click.echo(f" Analyzing dependencies {i}/{total_files}...", nl=False)
click.echo("\r", nl=False)
# Find a representative line (first function/class)
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
cursor.execute("""
SELECT line FROM symbols
WHERE path = ? AND type IN ('function', 'class')
ORDER BY line LIMIT 1
""", (file_path,))
result = cursor.fetchone()
conn.close()
if result:
line = result[0]
try:
impact = analyze_impact(
db_path=db_path,
target_file=file_path,
target_line=line,
trace_to_backend=(expansion_mode == "full")
)
# Add direct dependencies
for dep in impact.get("upstream", []):
expanded.add(dep["file"])
for dep in impact.get("downstream", []):
if dep["file"] != "external":
expanded.add(dep["file"])
# Add transitive if full mode
if expansion_mode == "full":
for dep in impact.get("upstream_transitive", []):
expanded.add(dep["file"])
for dep in impact.get("downstream_transitive", []):
if dep["file"] != "external":
expanded.add(dep["file"])
except Exception as e:
# Don't fail entire analysis for one file
if os.environ.get("THEAUDITOR_DEBUG"):
click.echo(f"\n Warning: Could not analyze {file_path}: {e}")
click.echo(f"\n Expanded from {total_files} to {len(expanded)} files")
return expanded
def _analyze_migrations(repo_root: Path, migration_dir: str, migration_limit: int = 0) -> List[str]:
"""Analyze migration files to detect schema changes.
Args:
repo_root: Repository root path
migration_dir: Migration directory path
migration_limit: Number of recent migrations to analyze (0=all)
"""
migration_path = repo_root / migration_dir
affected_files = []
if not migration_path.exists():
# Try common locations (most common first!)
found_migrations = False
for common_path in ["backend/migrations", "migrations", "db/migrations",
"database/migrations", "frontend/migrations"]:
test_path = repo_root / common_path
if test_path.exists():
# Check if it actually contains migration files
import glob
test_migrations = (glob.glob(str(test_path / "*.js")) +
glob.glob(str(test_path / "*.ts")) +
glob.glob(str(test_path / "*.sql")))
if test_migrations:
migration_path = test_path
found_migrations = True
click.echo(f"Found migrations in: {common_path}")
break
if not found_migrations:
click.echo("\n⚠️ WARNING: No migration files found in standard locations:", err=True)
click.echo(" • backend/migrations/", err=True)
click.echo(" • migrations/", err=True)
click.echo(" • db/migrations/", err=True)
click.echo(" • database/migrations/", err=True)
click.echo(" • frontend/migrations/ (yes, we check here too)", err=True)
click.echo(f"\n Current directory searched: {migration_dir}", err=True)
click.echo(f" Use --migration-dir <path> to specify your migration folder\n", err=True)
return affected_files
if migration_path.exists():
# Look for migration files
import glob
import re
migrations = sorted(glob.glob(str(migration_path / "*.js")) +
glob.glob(str(migration_path / "*.ts")) +
glob.glob(str(migration_path / "*.sql")))
if not migrations:
click.echo(f"\n⚠️ WARNING: Directory '{migration_path}' exists but contains no migration files", err=True)
click.echo(f" Expected: .js, .ts, or .sql files", err=True)
return affected_files
# Determine which migrations to analyze
total_migrations = len(migrations)
if migration_limit > 0:
migrations_to_analyze = migrations[-migration_limit:]
click.echo(f"Analyzing {len(migrations_to_analyze)} most recent migrations (out of {total_migrations} total)")
else:
migrations_to_analyze = migrations
click.echo(f"Analyzing ALL {total_migrations} migration files")
if total_migrations > 20:
click.echo("⚠️ Large migration set detected. Consider using --migration-limit for faster analysis")
# Enhanced pattern matching
schema_patterns = {
'column_ops': r'(?:removeColumn|dropColumn|renameColumn|addColumn|alterColumn|modifyColumn)',
'table_ops': r'(?:createTable|dropTable|renameTable|alterTable)',
'index_ops': r'(?:addIndex|dropIndex|createIndex|removeIndex)',
'fk_ops': r'(?:addForeignKey|dropForeignKey|addConstraint|dropConstraint)',
'type_changes': r'(?:changeColumn|changeDataType|alterType)'
}
tables_affected = set()
operations_found = set()
# Process migrations with progress indicator
for i, migration_file in enumerate(migrations_to_analyze, 1):
if i % 10 == 0 or i == len(migrations_to_analyze):
click.echo(f" Processing migration {i}/{len(migrations_to_analyze)}...", nl=False)
click.echo("\r", nl=False)
try:
with open(migration_file, 'r') as f:
content = f.read()
# Check all pattern categories
for pattern_name, pattern_regex in schema_patterns.items():
if re.search(pattern_regex, content, re.IGNORECASE):
operations_found.add(pattern_name)
# Extract table/model names (improved regex)
# Handles: "table", 'table', `table`, tableName
tables = re.findall(r"['\"`](\w+)['\"`]|(?:table|Table)Name:\s*['\"`]?(\w+)", content)
for match in tables:
# match is a tuple from multiple capture groups
table = match[0] if match[0] else match[1] if len(match) > 1 else None
if table and table not in ['table', 'Table', 'column', 'Column']:
tables_affected.add(table)
except Exception as e:
click.echo(f"\nWarning: Could not read migration {migration_file}: {e}")
continue
click.echo(f"\nFound {len(operations_found)} types of operations affecting {len(tables_affected)} tables")
# Map tables to model files
for table in tables_affected:
model_file = _find_model_file(repo_root, table)
if model_file:
affected_files.append(str(model_file))
# Deduplicate
affected_files = list(set(affected_files))
click.echo(f"Mapped to {len(affected_files)} model files")
return affected_files
def _find_model_file(repo_root: Path, table_name: str) -> Optional[Path]:
"""Find model file corresponding to a database table."""
# Convert table name to likely model name
model_names = [
table_name, # exact match
table_name.rstrip('s'), # singular
''.join(word.capitalize() for word in table_name.split('_')), # PascalCase
]
for model_name in model_names:
# Check common model locations
for pattern in [f"**/models/{model_name}.*", f"**/{model_name}.model.*",
f"**/entities/{model_name}.*"]:
import glob
matches = glob.glob(str(repo_root / pattern), recursive=True)
if matches:
return Path(matches[0])
return None
def _find_mismatches(patterns: List[Dict], correlations: List[Dict],
affected_files: Set[str]) -> Dict[str, List[Dict]]:
"""Identify mismatches from patterns and correlations."""
mismatches = {
"api": [],
"model": [],
"contract": []
}
# Analyze patterns for known refactoring issues
for pattern in patterns:
if pattern.get("rule_id") in ["PRODUCT_PRICE_FIELD_REMOVED",
"PRODUCT_SKU_MOVED_TO_VARIANT"]:
mismatches["model"].append({
"type": "field_moved",
"description": pattern.get("message", "Field moved between models"),
"file": pattern.get("file"),
"line": pattern.get("line")
})
elif pattern.get("rule_id") in ["API_ENDPOINT_PRODUCT_PRICE"]:
mismatches["api"].append({
"type": "endpoint_deprecated",
"description": pattern.get("message", "API endpoint no longer exists"),
"file": pattern.get("file"),
"line": pattern.get("line")
})
elif pattern.get("rule_id") in ["FRONTEND_BACKEND_CONTRACT_MISMATCH"]:
mismatches["contract"].append({
"type": "contract_mismatch",
"description": pattern.get("message", "Frontend/backend contract mismatch"),
"file": pattern.get("file"),
"line": pattern.get("line")
})
# Analyze correlations for co-occurring issues
for correlation in correlations:
if correlation.get("confidence", 0) > 0.8:
category = "contract" if "contract" in correlation.get("name", "").lower() else \
"api" if "api" in correlation.get("name", "").lower() else "model"
mismatches[category].append({
"type": "correlation",
"description": correlation.get("description", "Correlated issue detected"),
"confidence": correlation.get("confidence"),
"facts": correlation.get("matched_facts", [])
})
return mismatches
def _assess_risk(mismatches: Dict[str, List], file_count: int) -> str:
"""Assess the risk level of the refactoring."""
total_issues = sum(len(issues) for issues in mismatches.values())
if total_issues > 20 or file_count > 50:
return "HIGH"
elif total_issues > 10 or file_count > 20:
return "MEDIUM"
else:
return "LOW"
def _generate_recommendations(mismatches: Dict[str, List]) -> List[str]:
"""Generate actionable recommendations based on mismatches."""
recommendations = []
if mismatches["model"]:
recommendations.append("Update frontend interfaces to match new model structure")
recommendations.append("Run database migrations in all environments")
if mismatches["api"]:
recommendations.append("Update API client to use new endpoints")
recommendations.append("Add deprecation notices for old endpoints")
if mismatches["contract"]:
recommendations.append("Synchronize TypeScript interfaces with backend models")
recommendations.append("Add API versioning to prevent breaking changes")
if sum(len(issues) for issues in mismatches.values()) > 10:
recommendations.append("Consider breaking this refactoring into smaller steps")
recommendations.append("Add integration tests before proceeding")
return recommendations
def _generate_report(affected_files: Set[str], patterns: List[Dict],
correlations: List[Dict], mismatches: Dict) -> Dict:
"""Generate detailed report of the refactoring analysis."""
return {
"summary": {
"files_analyzed": len(affected_files),
"patterns_detected": len(patterns),
"correlations_found": len(correlations),
"total_mismatches": sum(len(issues) for issues in mismatches.values())
},
"affected_files": list(affected_files),
"patterns": patterns,
"correlations": correlations,
"mismatches": mismatches,
"risk_assessment": _assess_risk(mismatches, len(affected_files)),
"recommendations": _generate_recommendations(mismatches)
}
# Register command
refactor_command = refactor

View File

@@ -0,0 +1,66 @@
"""Generate unified audit report from all artifacts."""
from pathlib import Path
import click
from theauditor.utils.error_handler import handle_exceptions
@click.command()
@handle_exceptions
@click.option("--manifest", default="./.pf/manifest.json", help="Manifest file path")
@click.option("--db", default="./.pf/repo_index.db", help="Database path")
@click.option("--workset", default="./.pf/workset.json", help="Workset file path")
@click.option("--capsules", default="./.pf/capsules", help="Capsules directory")
@click.option("--run-report", default="./.pf/run_report.json", help="Run report file path")
@click.option("--journal", default="./.pf/journal.ndjson", help="Journal file path")
@click.option("--fce", default="./.pf/fce.json", help="FCE file path")
@click.option("--ast", default="./.pf/ast_proofs.json", help="AST proofs file path")
@click.option("--ml", default="./.pf/ml_suggestions.json", help="ML suggestions file path")
@click.option("--patch", help="Patch diff file path")
@click.option("--out-dir", default="./.pf/audit", help="Output directory for audit reports")
@click.option("--max-snippet-lines", default=3, type=int, help="Maximum lines per snippet")
@click.option("--max-snippet-chars", default=220, type=int, help="Maximum characters per line")
@click.option("--print-stats", is_flag=True, help="Print summary statistics")
def report(
manifest,
db,
workset,
capsules,
run_report,
journal,
fce,
ast,
ml,
patch,
out_dir,
max_snippet_lines,
max_snippet_chars,
print_stats,
):
"""Generate unified audit report from all artifacts."""
# Report generation has been simplified
# Data is already chunked in .pf/readthis/ by extraction phase
readthis_dir = Path("./.pf/readthis")
if readthis_dir.exists():
json_files = list(readthis_dir.glob("*.json"))
click.echo(f"[OK] Audit report generated - Data chunks ready for AI consumption")
click.echo(f"[INFO] Report contains {len(json_files)} JSON chunks in .pf/readthis/")
if print_stats:
total_size = sum(f.stat().st_size for f in json_files)
click.echo(f"\n[STATS] Summary:")
click.echo(f" - Total chunks: {len(json_files)}")
click.echo(f" - Total size: {total_size:,} bytes")
click.echo(f" - Average chunk: {total_size // len(json_files):,} bytes" if json_files else " - No chunks")
click.echo(f"\n[FILES] Available chunks:")
for f in sorted(json_files)[:10]: # Show first 10
size = f.stat().st_size
click.echo(f" - {f.name} ({size:,} bytes)")
if len(json_files) > 10:
click.echo(f" ... and {len(json_files) - 10} more")
else:
click.echo("[WARNING] No readthis directory found at .pf/readthis/")
click.echo("[INFO] Run 'aud full' to generate analysis data")

View File

@@ -0,0 +1,226 @@
"""Rules command - inspect and summarize detection capabilities."""
import os
import yaml
import importlib
import inspect
from pathlib import Path
from typing import Dict, List, Any
import click
from theauditor.utils import handle_exceptions
from theauditor.utils.exit_codes import ExitCodes
@click.command(name="rules")
@click.option(
"--summary",
is_flag=True,
default=False,
help="Generate a summary of all detection capabilities",
)
@handle_exceptions
def rules_command(summary: bool) -> None:
"""Inspect and summarize TheAuditor's detection rules and patterns.
Args:
summary: If True, generate a comprehensive capability report
"""
if not summary:
click.echo(click.style("[ERROR] Please specify --summary to generate a capability report", fg="red"), err=True)
raise SystemExit(ExitCodes.TASK_INCOMPLETE)
# Get the base path for patterns and rules
base_path = Path(__file__).parent.parent
patterns_path = base_path / "patterns"
rules_path = base_path / "rules"
# Create output directory
output_dir = Path(".pf")
output_dir.mkdir(parents=True, exist_ok=True)
output_file = output_dir / "auditor_capabilities.md"
# Collect output in a list
output_lines = []
output_lines.append("# TheAuditor Detection Capabilities\n")
# Also print to console
print("# TheAuditor Detection Capabilities\n")
# Scan YAML patterns
print("## YAML Patterns\n")
output_lines.append("## YAML Patterns\n")
yaml_patterns = scan_yaml_patterns(patterns_path)
total_patterns = 0
for category, files in yaml_patterns.items():
if files:
category_display = "patterns/" if category == "." else f"patterns/{category}/"
print(f"### {category_display}\n")
output_lines.append(f"### {category_display}\n")
for file_name, patterns in files.items():
if patterns:
print(f"**{file_name}** ({len(patterns)} patterns)")
output_lines.append(f"**{file_name}** ({len(patterns)} patterns)")
for pattern in patterns:
print(f"- `{pattern}`")
output_lines.append(f"- `{pattern}`")
print()
output_lines.append("")
total_patterns += len(patterns)
# Scan Python rules
print("## Python AST Rules\n")
output_lines.append("## Python AST Rules\n")
python_rules = scan_python_rules(rules_path)
total_rules = 0
for module_path, functions in python_rules.items():
if functions:
# Make path relative to rules/ for readability
display_path = module_path.replace(str(rules_path) + os.sep, "")
print(f"### {display_path}")
output_lines.append(f"### {display_path}")
for func in functions:
print(f"- `{func}()`")
output_lines.append(f"- `{func}()`")
print()
output_lines.append("")
total_rules += len(functions)
# Print summary statistics
print("## Summary Statistics\n")
output_lines.append("## Summary Statistics\n")
print(f"- **Total YAML Patterns**: {total_patterns}")
output_lines.append(f"- **Total YAML Patterns**: {total_patterns}")
print(f"- **Total Python Rules**: {total_rules}")
output_lines.append(f"- **Total Python Rules**: {total_rules}")
print(f"- **Combined Detection Capabilities**: {total_patterns + total_rules}")
output_lines.append(f"- **Combined Detection Capabilities**: {total_patterns + total_rules}")
# Write to file
with open(output_file, 'w', encoding='utf-8') as f:
f.write('\n'.join(output_lines))
click.echo(click.style(f"\n[SUCCESS] Capability report generated successfully", fg="green"))
click.echo(f"[INFO] Report saved to: {output_file}")
raise SystemExit(ExitCodes.SUCCESS)
def scan_yaml_patterns(patterns_path: Path) -> Dict[str, Dict[str, List[str]]]:
"""Scan YAML pattern files and extract pattern names.
Args:
patterns_path: Path to the patterns directory
Returns:
Dictionary mapping category -> file -> list of pattern names
"""
results = {}
if not patterns_path.exists():
return results
# Walk through all subdirectories
for root, dirs, files in os.walk(patterns_path):
# Skip __pycache__ directories
dirs[:] = [d for d in dirs if d != "__pycache__"]
for file in files:
if file.endswith(".yml") or file.endswith(".yaml"):
file_path = Path(root) / file
# Determine category from directory structure
rel_path = file_path.relative_to(patterns_path)
# If file is in root of patterns/, use "." as category
# If in subdirectory like frameworks/, use that as category
if rel_path.parent == Path("."):
category = "."
else:
category = str(rel_path.parent)
if category not in results:
results[category] = {}
# Parse YAML and extract pattern names
try:
with open(file_path, 'r', encoding='utf-8') as f:
data = yaml.safe_load(f)
if data and isinstance(data, list):
pattern_names = []
for pattern in data:
if isinstance(pattern, dict) and 'name' in pattern:
pattern_names.append(pattern['name'])
if pattern_names:
results[category][file] = pattern_names
except (yaml.YAMLError, OSError) as e:
# Skip files that can't be parsed
continue
return results
def scan_python_rules(rules_path: Path) -> Dict[str, List[str]]:
"""Scan Python rule files and find all find_* functions.
Args:
rules_path: Path to the rules directory
Returns:
Dictionary mapping module path -> list of find_* function names
"""
results = {}
if not rules_path.exists():
return results
# First, check what's exposed in the main __init__.py
init_file = rules_path / "__init__.py"
if init_file.exists():
try:
module = importlib.import_module("theauditor.rules")
exposed_functions = []
for name, obj in inspect.getmembers(module, inspect.isfunction):
if name.startswith("find_"):
exposed_functions.append(name)
if exposed_functions:
results["rules/__init__.py (exposed)"] = exposed_functions
except ImportError:
pass
# Walk through all Python files
for root, dirs, files in os.walk(rules_path):
# Skip __pycache__ directories
dirs[:] = [d for d in dirs if d != "__pycache__"]
for file in files:
if file.endswith(".py"):
file_path = Path(root) / file
# Skip __init__.py files for now (we handle them separately)
if file == "__init__.py":
continue
# Try basic text scanning (more reliable than import)
try:
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read()
# Simple regex to find function definitions
import re
pattern = r'^def\s+(find_\w+)\s*\('
matches = re.findall(pattern, content, re.MULTILINE)
if matches:
# Make path relative for display
display_path = str(file_path.relative_to(rules_path.parent))
results[display_path] = matches
except OSError:
continue
return results

View File

@@ -0,0 +1,63 @@
"""Setup commands for TheAuditor - Claude Code integration."""
import click
@click.command("setup-claude")
@click.option(
"--target",
required=True,
help="Target project root (absolute or relative path)"
)
@click.option(
"--source",
default="agent_templates",
help="Path to TheAuditor agent templates directory (default: agent_templates)"
)
@click.option(
"--sync",
is_flag=True,
help="Force update (still creates .bak on first change only)"
)
@click.option(
"--dry-run",
is_flag=True,
help="Print plan without executing"
)
def setup_claude(target, source, sync, dry_run):
"""Install Claude Code agents, hooks, and per-project venv for TheAuditor.
This command performs a complete zero-optional installation:
1. Creates a Python venv at <target>/.venv
2. Installs TheAuditor into that venv (editable/offline)
3. Creates cross-platform launcher wrappers at <target>/.claude/bin/
4. Generates Claude agents from agent_templates/*.md
5. Writes hooks to <target>/.claude/hooks.json
All commands in agents/hooks use ./.claude/bin/aud to ensure
they run with the project's own venv.
"""
from theauditor.claude_setup import setup_claude_complete
try:
result = setup_claude_complete(
target=target,
source=source,
sync=sync,
dry_run=dry_run
)
# The setup_claude_complete function already prints detailed output
# Just handle any failures here
if result.get("failed"):
click.echo("\n[WARN] Some operations failed:", err=True)
for item in result["failed"]:
click.echo(f" - {item}", err=True)
raise click.ClickException("Setup incomplete due to failures")
except ValueError as e:
click.echo(f"Error: {e}", err=True)
raise click.ClickException(str(e)) from e
except Exception as e:
click.echo(f"Error: {e}", err=True)
raise click.ClickException(str(e)) from e

View File

@@ -0,0 +1,96 @@
"""Project structure and intelligence report command."""
import click
from pathlib import Path
from theauditor.utils.error_handler import handle_exceptions
from theauditor.utils.exit_codes import ExitCodes
@click.command("structure")
@handle_exceptions
@click.option("--root", default=".", help="Root directory to analyze")
@click.option("--manifest", default="./.pf/manifest.json", help="Path to manifest.json")
@click.option("--db-path", default="./.pf/repo_index.db", help="Path to repo_index.db")
@click.option("--output", default="./.pf/readthis/STRUCTURE.md", help="Output file path")
@click.option("--max-depth", default=4, type=int, help="Maximum directory tree depth")
def structure(root, manifest, db_path, output, max_depth):
"""Generate project structure and intelligence report.
Creates a comprehensive markdown report including:
- Directory tree visualization
- Project statistics (files, LOC, tokens)
- Language distribution
- Top 10 largest files by tokens
- Top 15 critical files by convention
- AI context optimization recommendations
"""
from theauditor.project_summary import generate_project_summary, generate_directory_tree
# Check if manifest exists (not required but enhances report)
manifest_exists = Path(manifest).exists()
db_exists = Path(db_path).exists()
if not manifest_exists and not db_exists:
click.echo("Warning: Neither manifest.json nor repo_index.db found", err=True)
click.echo("Run 'aud index' first for complete statistics", err=True)
click.echo("Generating basic structure report...\n")
elif not manifest_exists:
click.echo("Warning: manifest.json not found, statistics will be limited", err=True)
elif not db_exists:
click.echo("Warning: repo_index.db not found, symbol counts will be missing", err=True)
# Generate the report
click.echo(f"Analyzing project structure (max depth: {max_depth})...")
try:
# Generate full report
report_content = generate_project_summary(
root_path=root,
manifest_path=manifest,
db_path=db_path,
max_depth=max_depth
)
# Ensure output directory exists
output_path = Path(output)
output_path.parent.mkdir(parents=True, exist_ok=True)
# Write report
with open(output_path, 'w', encoding='utf-8') as f:
f.write(report_content)
click.echo(f"\n✓ Project structure report generated: {output}")
# Show summary stats if available
if manifest_exists:
import json
with open(manifest, 'r') as f:
manifest_data = json.load(f)
total_files = len(manifest_data)
total_loc = sum(f.get('loc', 0) for f in manifest_data)
total_bytes = sum(f.get('bytes', 0) for f in manifest_data)
total_tokens = total_bytes // 4 # Rough approximation
click.echo(f"\nProject Summary:")
click.echo(f" Files: {total_files:,}")
click.echo(f" LOC: {total_loc:,}")
click.echo(f" Tokens: ~{total_tokens:,}")
# Token percentage of Claude's context
# Claude has 200k context, but practical limit is ~160k for user content
# (leaving room for system prompts, conversation history, response)
claude_total_context = 200000 # Total context window
claude_usable_context = 160000 # Practical limit for user content
token_percent = (total_tokens / claude_usable_context * 100) if total_tokens > 0 else 0
if token_percent > 100:
click.echo(f" Context Usage: {token_percent:.1f}% (EXCEEDS Claude's practical limit)")
else:
click.echo(f" Context Usage: {token_percent:.1f}% of Claude's usable window")
return ExitCodes.SUCCESS
except Exception as e:
click.echo(f"Error generating report: {e}", err=True)
return ExitCodes.TASK_INCOMPLETE

View File

@@ -0,0 +1,236 @@
"""Generate comprehensive audit summary from all analysis phases."""
import json
import time
from pathlib import Path
from typing import Any, Dict
import click
@click.command()
@click.option("--root", default=".", help="Root directory")
@click.option("--raw-dir", default="./.pf/raw", help="Raw outputs directory")
@click.option("--out", default="./.pf/raw/audit_summary.json", help="Output path for summary")
def summary(root, raw_dir, out):
"""Generate comprehensive audit summary from all phases."""
start_time = time.time()
raw_path = Path(raw_dir)
# Initialize summary structure
audit_summary = {
"generated_at": time.strftime('%Y-%m-%d %H:%M:%S'),
"overall_status": "UNKNOWN",
"total_runtime_seconds": 0,
"total_findings_by_severity": {
"critical": 0,
"high": 0,
"medium": 0,
"low": 0,
"info": 0
},
"metrics_by_phase": {},
"key_statistics": {}
}
# Helper function to safely load JSON
def load_json(file_path: Path) -> Dict[str, Any]:
if file_path.exists():
try:
with open(file_path, 'r', encoding='utf-8') as f:
return json.load(f)
except (json.JSONDecodeError, IOError):
pass
return {}
# Phase 1: Index metrics
manifest_path = Path(root) / "manifest.json"
if manifest_path.exists():
manifest = load_json(manifest_path)
if isinstance(manifest, list):
audit_summary["metrics_by_phase"]["index"] = {
"files_indexed": len(manifest),
"total_size_bytes": sum(f.get("size", 0) for f in manifest)
}
# Phase 2: Framework detection
frameworks = load_json(raw_path / "frameworks.json")
if frameworks:
if isinstance(frameworks, dict):
framework_list = frameworks.get("frameworks", [])
else:
framework_list = frameworks if isinstance(frameworks, list) else []
audit_summary["metrics_by_phase"]["detect_frameworks"] = {
"frameworks_detected": len(framework_list),
"languages": list(set(f.get("language", "") if isinstance(f, dict) else "" for f in framework_list))
}
# Phase 3: Dependencies
deps = load_json(raw_path / "deps.json")
deps_latest = load_json(raw_path / "deps_latest.json")
if deps or deps_latest:
outdated_count = 0
vulnerability_count = 0
total_deps = 0
# Handle deps being either dict or list
if isinstance(deps, dict):
total_deps = len(deps.get("dependencies", []))
elif isinstance(deps, list):
total_deps = len(deps)
# Handle deps_latest structure
if isinstance(deps_latest, dict) and "packages" in deps_latest:
for pkg in deps_latest["packages"]:
if isinstance(pkg, dict):
if pkg.get("outdated"):
outdated_count += 1
if pkg.get("vulnerabilities"):
vulnerability_count += len(pkg["vulnerabilities"])
audit_summary["metrics_by_phase"]["dependencies"] = {
"total_dependencies": total_deps,
"outdated_packages": outdated_count,
"vulnerabilities": vulnerability_count
}
# Phase 7: Linting
lint_data = load_json(raw_path / "lint.json")
if lint_data and "findings" in lint_data:
lint_by_severity = {"critical": 0, "high": 0, "medium": 0, "low": 0, "info": 0}
for finding in lint_data["findings"]:
severity = finding.get("severity", "info").lower()
if severity in lint_by_severity:
lint_by_severity[severity] += 1
audit_summary["metrics_by_phase"]["lint"] = {
"total_issues": len(lint_data["findings"]),
"by_severity": lint_by_severity
}
# Add to total
for sev, count in lint_by_severity.items():
audit_summary["total_findings_by_severity"][sev] += count
# Phase 8: Pattern detection
patterns = load_json(raw_path / "patterns.json")
if not patterns:
patterns = load_json(raw_path / "findings.json")
if patterns and "findings" in patterns:
pattern_by_severity = {"critical": 0, "high": 0, "medium": 0, "low": 0, "info": 0}
for finding in patterns["findings"]:
severity = finding.get("severity", "info").lower()
if severity in pattern_by_severity:
pattern_by_severity[severity] += 1
audit_summary["metrics_by_phase"]["patterns"] = {
"total_patterns_matched": len(patterns["findings"]),
"by_severity": pattern_by_severity
}
# Add to total
for sev, count in pattern_by_severity.items():
audit_summary["total_findings_by_severity"][sev] += count
# Phase 9-10: Graph analysis
graph_analysis = load_json(raw_path / "graph_analysis.json")
graph_metrics = load_json(raw_path / "graph_metrics.json")
if graph_analysis:
summary_data = graph_analysis.get("summary", {})
audit_summary["metrics_by_phase"]["graph"] = {
"import_nodes": summary_data.get("import_graph", {}).get("nodes", 0),
"import_edges": summary_data.get("import_graph", {}).get("edges", 0),
"cycles_detected": len(graph_analysis.get("cycles", [])),
"hotspots_identified": len(graph_analysis.get("hotspots", [])),
"graph_density": summary_data.get("import_graph", {}).get("density", 0)
}
if "health_metrics" in summary_data:
audit_summary["metrics_by_phase"]["graph"]["health_grade"] = summary_data["health_metrics"].get("health_grade", "N/A")
audit_summary["metrics_by_phase"]["graph"]["fragility_score"] = summary_data["health_metrics"].get("fragility_score", 0)
# Phase 11: Taint analysis
taint = load_json(raw_path / "taint_analysis.json")
if taint:
taint_by_severity = {"critical": 0, "high": 0, "medium": 0, "low": 0}
if "taint_paths" in taint:
for path in taint["taint_paths"]:
severity = path.get("severity", "medium").lower()
if severity in taint_by_severity:
taint_by_severity[severity] += 1
audit_summary["metrics_by_phase"]["taint_analysis"] = {
"taint_paths_found": len(taint.get("taint_paths", [])),
"total_vulnerabilities": taint.get("total_vulnerabilities", 0),
"by_severity": taint_by_severity
}
# Add to total
for sev, count in taint_by_severity.items():
if sev in audit_summary["total_findings_by_severity"]:
audit_summary["total_findings_by_severity"][sev] += count
# Phase 12: FCE (Factual Correlation Engine)
fce = load_json(raw_path / "fce.json")
if fce:
correlations = fce.get("correlations", {})
audit_summary["metrics_by_phase"]["fce"] = {
"total_findings": len(fce.get("all_findings", [])),
"test_failures": len(fce.get("test_results", {}).get("failures", [])),
"hotspots_correlated": correlations.get("total_hotspots", 0),
"factual_clusters": len(correlations.get("factual_clusters", []))
}
# Calculate overall status based on severity counts
severity_counts = audit_summary["total_findings_by_severity"]
if severity_counts["critical"] > 0:
audit_summary["overall_status"] = "CRITICAL"
elif severity_counts["high"] > 0:
audit_summary["overall_status"] = "HIGH"
elif severity_counts["medium"] > 0:
audit_summary["overall_status"] = "MEDIUM"
elif severity_counts["low"] > 0:
audit_summary["overall_status"] = "LOW"
else:
audit_summary["overall_status"] = "CLEAN"
# Add key statistics
audit_summary["key_statistics"] = {
"total_findings": sum(severity_counts.values()),
"phases_with_findings": len([p for p in audit_summary["metrics_by_phase"] if audit_summary["metrics_by_phase"][p]]),
"total_phases_run": len(audit_summary["metrics_by_phase"])
}
# Calculate runtime
elapsed = time.time() - start_time
audit_summary["summary_generation_time"] = elapsed
# Read pipeline.log for total runtime if available
pipeline_log = Path(root) / ".pf" / "pipeline.log"
if pipeline_log.exists():
try:
with open(pipeline_log, 'r') as f:
for line in f:
if "[TIME] Total time:" in line:
# Extract seconds from line like "[TIME] Total time: 73.0s"
parts = line.split(":")[-1].strip().replace("s", "").split("(")[0]
audit_summary["total_runtime_seconds"] = float(parts)
break
except:
pass
# Save the summary
out_path = Path(out)
out_path.parent.mkdir(parents=True, exist_ok=True)
with open(out_path, 'w', encoding='utf-8') as f:
json.dump(audit_summary, f, indent=2)
# Output results
click.echo(f"[OK] Audit summary generated in {elapsed:.1f}s")
click.echo(f" Overall status: {audit_summary['overall_status']}")
click.echo(f" Total findings: {audit_summary['key_statistics']['total_findings']}")
click.echo(f" Critical: {severity_counts['critical']}, High: {severity_counts['high']}, Medium: {severity_counts['medium']}, Low: {severity_counts['low']}")
click.echo(f" Summary saved to: {out_path}")
return audit_summary

View File

@@ -0,0 +1,272 @@
"""Perform taint analysis to detect security vulnerabilities via data flow tracking."""
import sys
import platform
import click
from pathlib import Path
from datetime import datetime, UTC
from theauditor.utils.error_handler import handle_exceptions
# Detect if running on Windows for character encoding
IS_WINDOWS = platform.system() == "Windows"
@click.command("taint-analyze")
@handle_exceptions
@click.option("--db", default=None, help="Path to the SQLite database (default: repo_index.db)")
@click.option("--output", default="./.pf/raw/taint_analysis.json", help="Output path for analysis results")
@click.option("--max-depth", default=5, type=int, help="Maximum depth for taint propagation tracing")
@click.option("--json", is_flag=True, help="Output raw JSON instead of formatted report")
@click.option("--verbose", is_flag=True, help="Show detailed path information")
@click.option("--severity", type=click.Choice(["all", "critical", "high", "medium", "low"]),
default="all", help="Filter results by severity level")
@click.option("--rules/--no-rules", default=True, help="Enable/disable rule-based detection")
def taint_analyze(db, output, max_depth, json, verbose, severity, rules):
"""
Perform taint analysis to detect security vulnerabilities.
This command traces the flow of untrusted data from taint sources
(user inputs) to security sinks (dangerous functions) to identify
potential injection vulnerabilities and data exposure risks.
The analysis detects:
- SQL Injection
- Command Injection
- Cross-Site Scripting (XSS)
- Path Traversal
- LDAP Injection
- NoSQL Injection
Example:
aud taint-analyze
aud taint-analyze --severity critical --verbose
aud taint-analyze --json --output vulns.json
"""
from theauditor.taint_analyzer import trace_taint, save_taint_analysis, normalize_taint_path, SECURITY_SINKS
from theauditor.taint.insights import format_taint_report, calculate_severity, generate_summary, classify_vulnerability
from theauditor.config_runtime import load_runtime_config
from theauditor.rules.orchestrator import RulesOrchestrator, RuleContext
from theauditor.taint.registry import TaintRegistry
import json as json_lib
# Load configuration for default paths
config = load_runtime_config(".")
# Use default database path if not provided
if db is None:
db = config["paths"]["db"]
# Verify database exists
db_path = Path(db)
if not db_path.exists():
click.echo(f"Error: Database not found at {db}", err=True)
click.echo("Run 'aud index' first to build the repository index", err=True)
raise click.ClickException(f"Database not found: {db}")
# Check if rules are enabled
if rules:
# STAGE 1: Initialize infrastructure
click.echo("Initializing security analysis infrastructure...")
registry = TaintRegistry()
orchestrator = RulesOrchestrator(project_path=Path("."), db_path=db_path)
# Track all findings
all_findings = []
# STAGE 2: Run standalone infrastructure rules
click.echo("Running infrastructure and configuration analysis...")
infra_findings = orchestrator.run_standalone_rules()
all_findings.extend(infra_findings)
click.echo(f" Found {len(infra_findings)} infrastructure issues")
# STAGE 3: Run discovery rules to populate registry
click.echo("Discovering framework-specific patterns...")
discovery_findings = orchestrator.run_discovery_rules(registry)
all_findings.extend(discovery_findings)
stats = registry.get_stats()
click.echo(f" Registry now has {stats['total_sinks']} sinks, {stats['total_sources']} sources")
# STAGE 4: Run enriched taint analysis with registry
click.echo("Performing data-flow taint analysis...")
result = trace_taint(
db_path=str(db_path),
max_depth=max_depth,
registry=registry
)
# Extract taint paths
taint_paths = result.get("taint_paths", result.get("paths", []))
click.echo(f" Found {len(taint_paths)} taint flow vulnerabilities")
# STAGE 5: Run taint-dependent rules
click.echo("Running advanced security analysis...")
# Create taint checker from results
def taint_checker(var_name, line_num=None):
"""Check if variable is in any taint path."""
for path in taint_paths:
# Check source
if path.get("source", {}).get("name") == var_name:
return True
# Check sink
if path.get("sink", {}).get("name") == var_name:
return True
# Check intermediate steps
for step in path.get("path", []):
if isinstance(step, dict) and step.get("name") == var_name:
return True
return False
advanced_findings = orchestrator.run_taint_dependent_rules(taint_checker)
all_findings.extend(advanced_findings)
click.echo(f" Found {len(advanced_findings)} advanced security issues")
# STAGE 6: Consolidate all findings
click.echo(f"\nTotal vulnerabilities found: {len(all_findings) + len(taint_paths)}")
# Add all non-taint findings to result
result["infrastructure_issues"] = infra_findings
result["discovery_findings"] = discovery_findings
result["advanced_findings"] = advanced_findings
result["all_rule_findings"] = all_findings
# Update total count
result["total_vulnerabilities"] = len(taint_paths) + len(all_findings)
else:
# Original taint analysis without orchestrator
click.echo("Performing taint analysis (rules disabled)...")
result = trace_taint(
db_path=str(db_path),
max_depth=max_depth
)
# Enrich raw paths with interpretive insights
if result.get("success"):
# Add severity and classification to each path
enriched_paths = []
for path in result.get("taint_paths", result.get("paths", [])):
# Normalize the path first
path = normalize_taint_path(path)
# Add severity
path["severity"] = calculate_severity(path)
# Enrich sink information with vulnerability classification
path["vulnerability_type"] = classify_vulnerability(
path.get("sink", {}),
SECURITY_SINKS
)
enriched_paths.append(path)
# Update result with enriched paths
result["taint_paths"] = enriched_paths
result["paths"] = enriched_paths
# Generate summary
result["summary"] = generate_summary(enriched_paths)
# Filter by severity if requested
if severity != "all" and result.get("success"):
filtered_paths = []
for path in result.get("taint_paths", result.get("paths", [])):
# Normalize the path to ensure all keys exist
path = normalize_taint_path(path)
if path["severity"].lower() == severity or (
severity == "critical" and path["severity"].lower() == "critical"
) or (
severity == "high" and path["severity"].lower() in ["critical", "high"]
):
filtered_paths.append(path)
# Update counts
result["taint_paths"] = filtered_paths
result["paths"] = filtered_paths # Keep both keys synchronized
result["total_vulnerabilities"] = len(filtered_paths)
# Recalculate vulnerability types
from collections import defaultdict
vuln_counts = defaultdict(int)
for path in filtered_paths:
# Path is already normalized from filtering above
vuln_counts[path.get("vulnerability_type", "Unknown")] += 1
result["vulnerabilities_by_type"] = dict(vuln_counts)
# CRITICAL FIX: Recalculate summary with filtered paths
from theauditor.taint.insights import generate_summary
result["summary"] = generate_summary(filtered_paths)
# Save COMPLETE taint analysis results to raw (including all data)
save_taint_analysis(result, output)
click.echo(f"Raw analysis results saved to: {output}")
# Output results
if json:
# JSON output for programmatic use
click.echo(json_lib.dumps(result, indent=2, sort_keys=True))
else:
# Human-readable report
report = format_taint_report(result)
click.echo(report)
# Additional verbose output
if verbose and result.get("success"):
paths = result.get("taint_paths", result.get("paths", []))
if paths and len(paths) > 10:
click.echo("\n" + "=" * 60)
click.echo("ADDITIONAL VULNERABILITY DETAILS")
click.echo("=" * 60)
for i, path in enumerate(paths[10:20], 11):
# Normalize path to ensure all keys exist
path = normalize_taint_path(path)
click.echo(f"\n{i}. {path['vulnerability_type']} ({path['severity']})")
click.echo(f" Source: {path['source']['file']}:{path['source']['line']}")
click.echo(f" Sink: {path['sink']['file']}:{path['sink']['line']}")
arrow = "->" if IS_WINDOWS else ""
click.echo(f" Pattern: {path['source'].get('pattern', '')} {arrow} {path['sink'].get('pattern', '')}") # Empty not unknown
if len(paths) > 20:
click.echo(f"\n... and {len(paths) - 20} additional vulnerabilities not shown")
# Provide actionable recommendations based on findings
if not json and result.get("success"):
summary = result.get("summary", {})
risk_level = summary.get("risk_level", "UNKNOWN")
click.echo("\n" + "=" * 60)
click.echo("RECOMMENDED ACTIONS")
click.echo("=" * 60)
if risk_level == "CRITICAL":
click.echo("[CRITICAL] CRITICAL SECURITY ISSUES DETECTED")
click.echo("1. Review and fix all CRITICAL vulnerabilities immediately")
click.echo("2. Add input validation and sanitization at all entry points")
click.echo("3. Use parameterized queries for all database operations")
click.echo("4. Implement output encoding for all user-controlled data")
click.echo("5. Consider a security audit before deployment")
elif risk_level == "HIGH":
click.echo("[HIGH] HIGH RISK VULNERABILITIES FOUND")
click.echo("1. Prioritize fixing HIGH severity issues this sprint")
click.echo("2. Review all user input handling code")
click.echo("3. Implement security middleware/filters")
click.echo("4. Add security tests for vulnerable paths")
elif risk_level == "MEDIUM":
click.echo("[MEDIUM] MODERATE SECURITY CONCERNS")
click.echo("1. Schedule vulnerability fixes for next sprint")
click.echo("2. Review and update security best practices")
click.echo("3. Add input validation where missing")
else:
click.echo("[LOW] LOW RISK PROFILE")
click.echo("1. Continue following secure coding practices")
click.echo("2. Regular security scanning recommended")
click.echo("3. Keep dependencies updated")
# Exit with appropriate code
if result.get("success"):
summary = result.get("summary", {})
if summary.get("critical_count", 0) > 0:
exit(2) # Critical vulnerabilities found
elif summary.get("high_count", 0) > 0:
exit(1) # High severity vulnerabilities found
else:
raise click.ClickException(result.get("error", "Analysis failed"))

View File

@@ -0,0 +1,25 @@
"""Detect and record tool versions."""
import click
@click.command("tool-versions")
@click.option("--out-dir", default="./.pf/audit", help="Output directory")
def tool_versions(out_dir):
"""Detect and record tool versions."""
from theauditor.tools import write_tools_report
try:
res = write_tools_report(out_dir)
click.echo(f"[OK] Tool versions written to {out_dir}/")
click.echo(" - TOOLS.md (human-readable)")
click.echo(" - tools.json (machine-readable)")
# Show summary
python_found = sum(1 for v in res["python"].values() if v != "missing")
node_found = sum(1 for v in res["node"].values() if v != "missing")
click.echo(f" - Python tools: {python_found}/4 found")
click.echo(f" - Node tools: {node_found}/3 found")
except Exception as e:
click.echo(f"Error: {e}", err=True)
raise click.ClickException(str(e)) from e

View File

@@ -0,0 +1,30 @@
"""Validate agent templates for SOP compliance."""
import click
@click.command("validate-templates")
@click.option("--source", default="./agent_templates", help="Directory containing agent templates")
@click.option("--format", type=click.Choice(["json", "text"]), default="text", help="Output format")
@click.option("--output", help="Write report to file instead of stdout")
def validate_templates(source, format, output):
"""Validate agent templates for SOP compliance."""
from theauditor.agent_template_validator import TemplateValidator
validator = TemplateValidator()
results = validator.validate_all(source)
report = validator.generate_report(results, format=format)
if output:
with open(output, 'w') as f:
f.write(report)
click.echo(f"Report written to {output}")
else:
click.echo(report)
# Exit with non-zero if violations found
if not results["valid"]:
raise click.ClickException(
f"Template validation failed: {results['total_violations']} violations found"
)

View File

@@ -0,0 +1,55 @@
"""Compute target file set from git diff and dependencies."""
import click
from theauditor.utils.error_handler import handle_exceptions
@click.command()
@handle_exceptions
@click.option("--root", default=".", help="Root directory")
@click.option("--db", default=None, help="Input SQLite database path")
@click.option("--manifest", default=None, help="Input manifest file path")
@click.option("--all", is_flag=True, help="Include all source files (ignores common directories)")
@click.option("--diff", help="Git diff range (e.g., main..HEAD)")
@click.option("--files", multiple=True, help="Explicit file list")
@click.option("--include", multiple=True, help="Include glob patterns")
@click.option("--exclude", multiple=True, help="Exclude glob patterns")
@click.option("--max-depth", default=None, type=int, help="Maximum dependency depth")
@click.option("--out", default=None, help="Output workset file path")
@click.option("--print-stats", is_flag=True, help="Print summary statistics")
def workset(root, db, manifest, all, diff, files, include, exclude, max_depth, out, print_stats):
"""Compute target file set from git diff and dependencies."""
from theauditor.workset import compute_workset
from theauditor.config_runtime import load_runtime_config
# Load configuration
config = load_runtime_config(root)
# Use config defaults if not provided
if db is None:
db = config["paths"]["db"]
if manifest is None:
manifest = config["paths"]["manifest"]
if out is None:
out = config["paths"]["workset"]
if max_depth is None:
max_depth = config["limits"]["max_graph_depth"]
result = compute_workset(
root_path=root,
db_path=db,
manifest_path=manifest,
all_files=all,
diff_spec=diff,
file_list=list(files) if files else None,
include_patterns=list(include) if include else None,
exclude_patterns=list(exclude) if exclude else None,
max_depth=max_depth,
output_path=out,
print_stats=print_stats,
)
if not print_stats:
click.echo(f"Workset written to {out}")
click.echo(f" Seed files: {result['seed_count']}")
click.echo(f" Expanded files: {result['expanded_count']}")

40
theauditor/config.py Normal file
View File

@@ -0,0 +1,40 @@
"""Configuration management for TheAuditor."""
import tomllib
from pathlib import Path
def ensure_mypy_config(pyproject_path: str) -> dict[str, str]:
"""
Ensure minimal mypy config exists in pyproject.toml.
Returns:
{"status": "created"} if config was added
{"status": "exists"} if config already present
"""
path = Path(pyproject_path)
if not path.exists():
raise FileNotFoundError(f"pyproject.toml not found at {pyproject_path}")
# Parse to check if [tool.mypy] exists
with open(path, "rb") as f:
data = tomllib.load(f)
# Check if mypy config already exists
if "tool" in data and "mypy" in data["tool"]:
return {"status": "exists"}
# Mypy config to append
mypy_block = """
[tool.mypy]
python_version = "3.12"
strict = true
warn_unused_configs = true"""
# Append to file
with open(path, "a") as f:
f.write(mypy_block)
return {"status": "created"}

View File

@@ -0,0 +1,160 @@
"""Runtime configuration for TheAuditor - centralized configuration management."""
from __future__ import annotations
import json
import os
from pathlib import Path
from typing import Any
DEFAULTS = {
"paths": {
# Core files
"manifest": "./.pf/manifest.json",
"db": "./.pf/repo_index.db",
"workset": "./.pf/workset.json",
# Directories
"pf_dir": "./.pf",
"capsules_dir": "./.pf/capsules",
"docs_dir": "./.pf/docs",
"audit_dir": "./.pf/audit",
"context_docs_dir": "./.pf/context/docs",
"doc_capsules_dir": "./.pf/context/doc_capsules",
"graphs_dir": "./.pf/graphs",
"model_dir": "./.pf/ml",
"claude_dir": "./.claude",
# Core artifacts
"journal": "./.pf/journal.ndjson",
"checkpoint": "./.pf/checkpoint.json",
"run_report": "./.pf/run_report.json",
"fce_json": "./.pf/raw/fce.json",
"ast_proofs_json": "./.pf/ast_proofs.json",
"ast_proofs_md": "./.pf/ast_proofs.md",
"ml_suggestions": "./.pf/insights/ml_suggestions.json",
"graphs_db": "./.pf/graphs.db",
"graph_analysis": "./.pf/graph_analysis.json",
"deps_json": "./.pf/deps.json",
"findings_json": "./.pf/findings.json",
"patterns_json": "./.pf/patterns.json",
"xgraph_json": "./.pf/xgraph.json",
"pattern_fce_json": "./.pf/pattern_fce.json",
"fix_suggestions_json": "./.pf/fix_suggestions.json",
"policy_yml": "./.pf/policy.yml",
},
"limits": {
# File size limits
"max_file_size": 2 * 1024 * 1024, # 2 MiB
# Chunking limits for extraction
"max_chunks_per_file": 3, # Maximum number of chunks per extracted file
"max_chunk_size": 56320, # Maximum size per chunk in bytes (55KB)
# Batch processing
"default_batch_size": 200,
"evidence_batch_size": 100,
# ML and analysis windows
"ml_window": 50,
"git_churn_window_days": 30,
# Graph analysis
"max_graph_depth": 3,
"high_risk_threshold": 0.5,
"high_risk_limit": 10,
"graph_limit_nodes": 500,
},
"timeouts": {
# Tool detection (quick checks)
"tool_detection": 5,
# Network operations
"url_fetch": 10,
"venv_check": 30,
# Build/test operations
"test_run": 60,
"venv_install": 120,
# Analysis operations
"lint_timeout": 300,
"orchestrator_timeout": 300,
# FCE and long operations
"fce_timeout": 600,
},
"report": {
"max_lint_rows": 50,
"max_ast_rows": 50,
"max_snippet_lines": 12,
"max_snippet_chars": 800,
}
}
def load_runtime_config(root: str = ".") -> dict[str, Any]:
"""
Load runtime configuration from .pf/config.json and environment variables.
Config priority (highest to lowest):
1. Environment variables (THEAUDITOR_* prefixed)
2. .pf/config.json file
3. Built-in defaults
Args:
root: Root directory to look for config file
Returns:
Configuration dictionary with merged values
"""
# Start with deep copy of defaults
import copy
cfg = copy.deepcopy(DEFAULTS)
# Try to load user config from .pf/config.json
path = Path(root) / ".pf" / "config.json"
try:
if path.exists():
with open(path, "r", encoding="utf-8") as f:
user = json.load(f)
# Merge each section if present
if isinstance(user, dict):
for section in ["paths", "limits", "timeouts", "report"]:
if section in user and isinstance(user[section], dict):
for key, value in user[section].items():
# Validate type matches default
if key in cfg[section]:
if isinstance(value, type(cfg[section][key])):
cfg[section][key] = value
except (json.JSONDecodeError, IOError, OSError) as e:
print(f"[WARNING] Could not load config file from {path}: {e}")
print("[INFO] Continuing with default configuration")
# Continue with defaults - config file is optional
# Environment variable overrides (flattened namespace)
# Format: THEAUDITOR_SECTION_KEY (e.g., THEAUDITOR_PATHS_MANIFEST)
for section in cfg:
for key in cfg[section]:
env_var = f"THEAUDITOR_{section.upper()}_{key.upper()}"
if env_var in os.environ:
value = os.environ[env_var]
try:
# Try to cast to the same type as the default
default_value = cfg[section][key]
if isinstance(default_value, int):
cfg[section][key] = int(value)
elif isinstance(default_value, float):
cfg[section][key] = float(value)
elif isinstance(default_value, list):
# Parse comma-separated values for lists
cfg[section][key] = [v.strip() for v in value.split(",")]
else:
cfg[section][key] = value
except (ValueError, AttributeError) as e:
print(f"[WARNING] Invalid value for environment variable {env_var}: '{value}' - {e}")
print(f"[INFO] Using default value: {cfg[section][key]}")
# Continue with default value - env vars are optional overrides
return cfg

View File

@@ -0,0 +1,5 @@
"""Correlation rules for the Factual Correlation Engine."""
from .loader import CorrelationLoader, CorrelationRule
__all__ = ["CorrelationLoader", "CorrelationRule"]

View File

@@ -0,0 +1,237 @@
"""Correlation rule loader for the Factual Correlation Engine."""
import re
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, Dict, List, Optional
import yaml
@dataclass
class CorrelationRule:
"""Represents a single correlation rule for factual co-occurrence detection."""
name: str
co_occurring_facts: List[Dict[str, str]]
description: Optional[str] = None
confidence: float = 0.8
compiled_patterns: List[Dict[str, Any]] = field(default_factory=list, init=False, repr=False)
def __post_init__(self):
"""Compile regex patterns in co-occurring facts after initialization."""
for fact in self.co_occurring_facts:
if 'tool' not in fact or 'pattern' not in fact:
raise ValueError(f"Invalid fact in rule '{self.name}': must contain 'tool' and 'pattern' keys")
compiled_fact = {
'tool': fact['tool'],
'pattern': fact['pattern']
}
# Try to compile as regex, if it fails, treat as literal string
try:
compiled_fact['compiled_regex'] = re.compile(fact['pattern'], re.IGNORECASE)
compiled_fact['is_regex'] = True
except re.error:
# Not a valid regex, will be used as literal string match
compiled_fact['is_regex'] = False
self.compiled_patterns.append(compiled_fact)
def matches_finding(self, finding: Dict[str, Any], fact_index: int) -> bool:
"""Check if a finding matches a specific fact pattern.
Args:
finding: Dictionary containing finding data with 'tool' and 'rule' keys
fact_index: Index of the fact pattern to check
Returns:
True if the finding matches the specified fact pattern
"""
if fact_index >= len(self.compiled_patterns):
return False
fact = self.compiled_patterns[fact_index]
# Check tool match
if finding.get('tool') != fact['tool']:
return False
# Check pattern match against rule or message
if fact['is_regex']:
# Check against rule field and message field
rule_match = fact['compiled_regex'].search(finding.get('rule', ''))
message_match = fact['compiled_regex'].search(finding.get('message', ''))
return bool(rule_match or message_match)
else:
# Literal string match
return (fact['pattern'] in finding.get('rule', '') or
fact['pattern'] in finding.get('message', ''))
class CorrelationLoader:
"""Loads and manages correlation rules from YAML files."""
def __init__(self, rules_dir: Optional[Path] = None):
"""Initialize correlation loader.
Args:
rules_dir: Directory containing correlation rule YAML files.
Defaults to theauditor/correlations/rules/
"""
if rules_dir is None:
rules_dir = Path(__file__).parent / "rules"
self.rules_dir = Path(rules_dir)
self.rules: List[CorrelationRule] = []
self._loaded = False
def load_rules(self) -> List[CorrelationRule]:
"""Load correlation rules from YAML files.
Returns:
List of CorrelationRule objects.
Raises:
FileNotFoundError: If the rules directory doesn't exist.
"""
if not self.rules_dir.exists():
# Create directory if it doesn't exist, but return empty list
self.rules_dir.mkdir(parents=True, exist_ok=True)
self._loaded = True
return self.rules
yaml_files = list(self.rules_dir.glob("*.yml")) + list(self.rules_dir.glob("*.yaml"))
# Clear existing rules before loading
self.rules = []
for yaml_file in yaml_files:
try:
rules = self._load_yaml_file(yaml_file)
self.rules.extend(rules)
except Exception as e:
# Log warning but continue loading other files
print(f"Warning: Failed to load correlation rules from {yaml_file}: {e}")
self._loaded = True
return self.rules
def _load_yaml_file(self, file_path: Path) -> List[CorrelationRule]:
"""Load correlation rules from a single YAML file.
Args:
file_path: Path to YAML file.
Returns:
List of CorrelationRule objects.
Raises:
ValueError: If the file format is invalid.
"""
with open(file_path, 'r', encoding='utf-8') as f:
data = yaml.safe_load(f)
if not isinstance(data, dict):
raise ValueError(f"Invalid rule file format in {file_path}: expected dictionary at root")
rules = []
# Support both single rule and multiple rules formats
if 'rules' in data:
# Multiple rules format
rule_list = data['rules']
if not isinstance(rule_list, list):
raise ValueError(f"Invalid rule file format in {file_path}: 'rules' must be a list")
for rule_data in rule_list:
try:
rule = self._parse_rule(rule_data)
rules.append(rule)
except (KeyError, ValueError) as e:
print(f"Warning: Skipping invalid rule in {file_path}: {e}")
elif 'name' in data and 'co_occurring_facts' in data:
# Single rule format
try:
rule = self._parse_rule(data)
rules.append(rule)
except (KeyError, ValueError) as e:
print(f"Warning: Skipping invalid rule in {file_path}: {e}")
else:
raise ValueError(f"Invalid rule file format in {file_path}: must contain 'rules' list or single rule with 'name' and 'co_occurring_facts'")
return rules
def _parse_rule(self, rule_data: Dict[str, Any]) -> CorrelationRule:
"""Parse a single rule from dictionary data.
Args:
rule_data: Dictionary containing rule data.
Returns:
CorrelationRule object.
Raises:
KeyError: If required fields are missing.
ValueError: If data format is invalid.
"""
if 'name' not in rule_data:
raise KeyError("Rule must have a 'name' field")
if 'co_occurring_facts' not in rule_data:
raise KeyError("Rule must have a 'co_occurring_facts' field")
if not isinstance(rule_data['co_occurring_facts'], list):
raise ValueError("'co_occurring_facts' must be a list")
if len(rule_data['co_occurring_facts']) == 0:
raise ValueError("'co_occurring_facts' must not be empty")
return CorrelationRule(
name=rule_data['name'],
co_occurring_facts=rule_data['co_occurring_facts'],
description=rule_data.get('description'),
confidence=rule_data.get('confidence', 0.8)
)
def get_all_rules(self) -> List[CorrelationRule]:
"""Get all loaded correlation rules.
Returns:
List of all loaded CorrelationRule objects.
"""
if not self._loaded:
self.load_rules()
return self.rules
def validate_rules(self) -> List[str]:
"""Validate all loaded correlation rules.
Returns:
List of validation error messages.
"""
if not self._loaded:
self.load_rules()
errors = []
# Check for duplicate rule names
names = [rule.name for rule in self.rules]
for name in names:
if names.count(name) > 1:
errors.append(f"Duplicate rule name: {name}")
# Validate each rule
for rule in self.rules:
# Check that each rule has at least 2 co-occurring facts
if len(rule.co_occurring_facts) < 2:
errors.append(f"Rule '{rule.name}' has fewer than 2 co-occurring facts")
# Check confidence is between 0 and 1
if not 0 <= rule.confidence <= 1:
errors.append(f"Rule '{rule.name}' has invalid confidence value: {rule.confidence}")
return errors

View File

@@ -0,0 +1,10 @@
name: "Angular Sanitization Bypass Factual Cluster"
description: "Multiple tools detected patterns consistent with XSS via sanitization bypass in Angular."
confidence: 0.95
co_occurring_facts:
- tool: "framework_detector"
pattern: "angular"
- tool: "patterns"
pattern: "bypassSecurity"
- tool: "taint_analyzer"
pattern: "trust"

View File

@@ -0,0 +1,10 @@
name: "API Key Exposure Factual Cluster"
description: "Multiple tools detected patterns consistent with a hardcoded or exposed API key."
confidence: 0.95
co_occurring_facts:
- tool: "patterns"
pattern: "api_key"
- tool: "ast"
pattern: "hardcoded"
- tool: "git"
pattern: "committed"

View File

@@ -0,0 +1,10 @@
name: "Command Injection Factual Cluster"
description: "Multiple tools detected patterns consistent with a Command Injection vulnerability."
confidence: 0.95
co_occurring_facts:
- tool: "taint_analyzer"
pattern: "command"
- tool: "patterns"
pattern: "(exec|subprocess|shell)"
- tool: "lint"
pattern: "subprocess"

View File

@@ -0,0 +1,10 @@
name: "Container Escape Factual Cluster"
description: "Multiple tools detected patterns consistent with a container escape vulnerability."
confidence: 0.90
co_occurring_facts:
- tool: "deployment"
pattern: "privileged"
- tool: "patterns"
pattern: "docker"
- tool: "security"
pattern: "cap_sys_admin"

View File

@@ -0,0 +1,10 @@
name: "CORS Misconfiguration Factual Cluster"
description: "Multiple tools detected patterns consistent with a CORS misconfiguration."
confidence: 0.90
co_occurring_facts:
- tool: "patterns"
pattern: "Access-Control"
- tool: "security"
pattern: "wildcard"
- tool: "framework_detector"
pattern: "cors"

View File

@@ -0,0 +1,10 @@
name: "Deadlock Factual Cluster"
description: "Multiple tools detected patterns consistent with a potential deadlock."
confidence: 0.85
co_occurring_facts:
- tool: "graph"
pattern: "mutex"
- tool: "patterns"
pattern: "lock"
- tool: "taint_analyzer"
pattern: "circular"

View File

@@ -0,0 +1,10 @@
name: "Debug Mode Enabled Factual Cluster"
description: "Multiple tools detected patterns consistent with debug mode being enabled in a production environment."
confidence: 0.95
co_occurring_facts:
- tool: "patterns"
pattern: "DEBUG=true"
- tool: "framework_detector"
pattern: "production"
- tool: "deployment"
pattern: "exposed"

View File

@@ -0,0 +1,10 @@
name: "Express Body-Parser Factual Cluster"
description: "Multiple tools detected patterns consistent with insecure body-parser configuration in Express."
confidence: 0.75
co_occurring_facts:
- tool: "framework_detector"
pattern: "express"
- tool: "patterns"
pattern: "body-parser"
- tool: "security"
pattern: "no_limit"

View File

@@ -0,0 +1,10 @@
name: "Infinite Loop Factual Cluster"
description: "Multiple tools detected patterns consistent with a potential infinite loop."
confidence: 0.80
co_occurring_facts:
- tool: "graph"
pattern: "cycle"
- tool: "patterns"
pattern: "while\\(true\\)"
- tool: "ast"
pattern: "no_break"

View File

@@ -0,0 +1,10 @@
name: "JWT Issues Factual Cluster"
description: "Multiple tools detected patterns consistent with insecure JWT implementation."
confidence: 0.90
co_occurring_facts:
- tool: "patterns"
pattern: "jwt"
- tool: "security"
pattern: "HS256"
- tool: "lint"
pattern: "jwt"

View File

@@ -0,0 +1,10 @@
name: "LDAP Injection Factual Cluster"
description: "Multiple tools detected patterns consistent with an LDAP Injection vulnerability."
confidence: 0.85
co_occurring_facts:
- tool: "taint_analyzer"
pattern: "ldap"
- tool: "patterns"
pattern: "filter"
- tool: "lint"
pattern: "ldap"

View File

@@ -0,0 +1,10 @@
name: "Memory Leak Factual Cluster"
description: "Multiple tools detected patterns consistent with a potential memory leak."
confidence: 0.70
co_occurring_facts:
- tool: "patterns"
pattern: "setInterval"
- tool: "graph"
pattern: "no_cleanup"
- tool: "lint"
pattern: "memory"

View File

@@ -0,0 +1,10 @@
name: "Missing Authentication Factual Cluster"
description: "Multiple tools detected patterns consistent with a missing authentication control on a sensitive endpoint."
confidence: 0.80
co_occurring_facts:
- tool: "patterns"
pattern: "public"
- tool: "framework_detector"
pattern: "no_auth"
- tool: "graph"
pattern: "exposed"

View File

@@ -0,0 +1,10 @@
name: "NoSQL Injection Factual Cluster"
description: "Multiple tools detected patterns consistent with a NoSQL Injection vulnerability."
confidence: 0.85
co_occurring_facts:
- tool: "patterns"
pattern: "(mongodb|mongoose)"
- tool: "taint_analyzer"
pattern: "$where"
- tool: "lint"
pattern: "nosql"

View File

@@ -0,0 +1,10 @@
name: "Path Traversal Factual Cluster"
description: "Multiple tools detected patterns consistent with a Path Traversal vulnerability."
confidence: 0.85
co_occurring_facts:
- tool: "taint_analyzer"
pattern: "path"
- tool: "patterns"
pattern: "\\.\\./"
- tool: "lint"
pattern: "path"

View File

@@ -0,0 +1,10 @@
name: "PII Leak Factual Cluster"
description: "Multiple tools detected patterns consistent with a potential leak of Personally Identifiable Information (PII)."
confidence: 0.80
co_occurring_facts:
- tool: "patterns"
pattern: "(email|ssn)"
- tool: "taint_analyzer"
pattern: "response"
- tool: "framework_detector"
pattern: "no_mask"

View File

@@ -0,0 +1,10 @@
name: "Race Condition Factual Cluster"
description: "Multiple tools detected patterns consistent with a potential race condition."
confidence: 0.75
co_occurring_facts:
- tool: "graph"
pattern: "concurrent"
- tool: "patterns"
pattern: "async"
- tool: "taint_analyzer"
pattern: "shared_state"

View File

@@ -0,0 +1,10 @@
name: "Missing Rate Limiting Factual Cluster"
description: "Multiple tools detected patterns consistent with a sensitive endpoint lacking rate limiting."
confidence: 0.85
co_occurring_facts:
- tool: "patterns"
pattern: "endpoint"
- tool: "framework_detector"
pattern: "no_throttle"
- tool: "deployment"
pattern: "public"

View File

@@ -0,0 +1,10 @@
name: "React dangerouslySetInnerHTML Factual Cluster"
description: "Multiple tools detected patterns consistent with XSS via dangerouslySetInnerHTML in React."
confidence: 0.95
co_occurring_facts:
- tool: "framework_detector"
pattern: "react"
- tool: "patterns"
pattern: "dangerously"
- tool: "taint_analyzer"
pattern: "user"

View File

@@ -0,0 +1,277 @@
# Refactoring Detection Correlation Rules
# These rules detect common refactoring issues and inconsistencies
rules:
# ============================================================================
# DATA MODEL REFACTORING PATTERNS
# ============================================================================
- name: "FIELD_MOVED_BETWEEN_MODELS"
description: "Field moved from one model to another but old references remain"
co_occurring_facts:
- tool: "grep"
pattern: "removeColumn.*('products'|\"products\")"
- tool: "grep"
pattern: "product\\.(unit_price|retail_price|wholesale_price|sku|inventory_type)"
confidence: 0.95
- name: "PRODUCT_VARIANT_REFACTOR"
description: "Product fields moved to ProductVariant but frontend still uses old structure"
co_occurring_facts:
- tool: "grep"
pattern: "ProductVariant.*retail_price.*Sequelize"
- tool: "grep"
pattern: "product\\.unit_price|product\\.retail_price"
confidence: 0.92
- name: "SKU_FIELD_MIGRATION"
description: "SKU moved from Product to ProductVariant"
co_occurring_facts:
- tool: "grep"
pattern: "ProductVariant.*sku.*unique.*true"
- tool: "grep"
pattern: "product\\.sku|WHERE.*products\\.sku"
confidence: 0.94
# ============================================================================
# FOREIGN KEY REFACTORING
# ============================================================================
- name: "ORDER_ITEMS_WRONG_FK"
description: "Order items using product_id instead of product_variant_id"
co_occurring_facts:
- tool: "grep"
pattern: "order_items.*product_variant_id.*fkey"
- tool: "grep"
pattern: "order_items.*product_id(?!_variant)"
confidence: 0.96
- name: "TRANSFER_ITEMS_WRONG_FK"
description: "Transfer items referencing wrong product foreign key"
co_occurring_facts:
- tool: "grep"
pattern: "transfer_items.*product_variant_id"
- tool: "grep"
pattern: "transfer.*product_id(?!_variant)"
confidence: 0.93
- name: "INVENTORY_FK_MISMATCH"
description: "Inventory table has both product_id and product_variant_id"
co_occurring_facts:
- tool: "grep"
pattern: "inventory.*product_variant_id.*NULL"
- tool: "grep"
pattern: "inventory.*product_id.*NOT NULL"
confidence: 0.88
# ============================================================================
# API CONTRACT CHANGES
# ============================================================================
- name: "API_ENDPOINT_REMOVED"
description: "Frontend calling API endpoints that no longer exist"
co_occurring_facts:
- tool: "grep"
pattern: "/api/products/.*/price|/api/products/.*/sku"
- tool: "grep"
pattern: "router\\.(get|post).*'/variants'"
confidence: 0.90
- name: "API_RESPONSE_STRUCTURE_CHANGED"
description: "API response structure changed but frontend expects old format"
co_occurring_facts:
- tool: "grep"
pattern: "res\\.json.*variants.*product"
- tool: "grep"
pattern: "response\\.data\\.product\\.price"
confidence: 0.87
- name: "GRAPHQL_SCHEMA_MISMATCH"
description: "GraphQL schema doesn't match model structure"
co_occurring_facts:
- tool: "grep"
pattern: "type Product.*price.*Float"
- tool: "grep"
pattern: "Product\\.init.*!.*price"
confidence: 0.85
# ============================================================================
# FRONTEND-BACKEND MISMATCHES
# ============================================================================
- name: "TYPESCRIPT_INTERFACE_OUTDATED"
description: "TypeScript interfaces don't match backend models"
co_occurring_facts:
- tool: "grep"
pattern: "interface.*Product.*unit_price.*number"
- tool: "grep"
pattern: "Product\\.init.*!.*unit_price"
confidence: 0.96
- name: "FRONTEND_NESTED_STRUCTURE"
description: "Frontend expects nested relationships that backend doesn't provide"
co_occurring_facts:
- tool: "grep"
pattern: "product_variant\\.product\\.(name|brand)"
- tool: "grep"
pattern: "ProductVariant.*belongsTo.*Product"
confidence: 0.91
- name: "CART_WRONG_ID_FIELD"
description: "Shopping cart using product_id instead of product_variant_id"
co_occurring_facts:
- tool: "grep"
pattern: "OrderItem.*product_variant_id.*required"
- tool: "grep"
pattern: "addToCart.*product_id|cart.*product_id"
confidence: 0.93
# ============================================================================
# MIGRATION PATTERNS
# ============================================================================
- name: "INCOMPLETE_MIGRATION"
description: "Database migration incomplete - old column references remain"
co_occurring_facts:
- tool: "grep"
pattern: "removeColumn|dropColumn"
- tool: "grep"
pattern: "SELECT.*FROM.*WHERE.*{removed_column}"
confidence: 0.89
- name: "MIGRATION_DATA_LOSS"
description: "Migration drops column without data migration"
co_occurring_facts:
- tool: "grep"
pattern: "removeColumn.*CASCADE|dropColumn.*CASCADE"
- tool: "grep"
pattern: "!.*UPDATE.*SET.*before.*removeColumn"
confidence: 0.86
- name: "ENUM_TYPE_CHANGED"
description: "ENUM values changed but code still uses old values"
co_occurring_facts:
- tool: "grep"
pattern: "DROP TYPE.*enum_products"
- tool: "grep"
pattern: "inventory_type.*=.*'both'|inventory_type.*weight|unit|both"
confidence: 0.84
# ============================================================================
# AUTHORIZATION CHANGES
# ============================================================================
- name: "MISSING_AUTH_MIDDLEWARE"
description: "New routes missing authentication/authorization"
co_occurring_facts:
- tool: "grep"
pattern: "router\\.(post|put|delete).*variant"
- tool: "grep"
pattern: "!.*requireAdmin.*productVariant\\.routes"
confidence: 0.92
- name: "PERMISSION_MODEL_CHANGED"
description: "Permission model changed but checks not updated"
co_occurring_facts:
- tool: "grep"
pattern: "role.*admin|worker"
- tool: "grep"
pattern: "req\\.user\\.permissions|can\\("
confidence: 0.80
# ============================================================================
# VALIDATION CHANGES
# ============================================================================
- name: "VALIDATION_SCHEMA_OUTDATED"
description: "Joi/Yup validation schema doesn't match model"
co_occurring_facts:
- tool: "grep"
pattern: "Joi\\.object.*product.*unit_price"
- tool: "grep"
pattern: "!.*Product.*unit_price"
confidence: 0.88
- name: "REQUIRED_FIELD_MISMATCH"
description: "Required fields in validation don't match database constraints"
co_occurring_facts:
- tool: "grep"
pattern: "allowNull.*false.*sku"
- tool: "grep"
pattern: "sku.*Joi\\..*optional\\(\\)"
confidence: 0.85
# ============================================================================
# SERVICE LAYER ISSUES
# ============================================================================
- name: "SERVICE_METHOD_SIGNATURE_CHANGED"
description: "Service method signature changed but callers not updated"
co_occurring_facts:
- tool: "grep"
pattern: "async.*create.*product.*variant"
- tool: "grep"
pattern: "productService\\.create\\(.*price"
confidence: 0.87
- name: "REPOSITORY_PATTERN_MISMATCH"
description: "Repository methods don't match new model structure"
co_occurring_facts:
- tool: "grep"
pattern: "findOne.*where.*sku"
- tool: "grep"
pattern: "ProductVariant.*sku"
confidence: 0.83
# ============================================================================
# TESTING ISSUES
# ============================================================================
- name: "TEST_FIXTURES_OUTDATED"
description: "Test fixtures using old model structure"
co_occurring_facts:
- tool: "grep"
pattern: "test.*product.*unit_price"
- tool: "grep"
pattern: "ProductVariant.*retail_price"
confidence: 0.82
- name: "MOCK_DATA_MISMATCH"
description: "Mock data doesn't match actual model structure"
co_occurring_facts:
- tool: "grep"
pattern: "mock.*product.*price"
- tool: "grep"
pattern: "!.*Product.*price"
confidence: 0.79
# ============================================================================
# COMMON REFACTORING ANTI-PATTERNS
# ============================================================================
- name: "EXTRACT_VARIANT_PATTERN"
description: "Classic Extract Variant refactoring with incomplete updates"
co_occurring_facts:
- tool: "grep"
pattern: "createTable.*variants"
- tool: "grep"
pattern: "product\\.(price|sku|inventory)"
confidence: 0.94
- name: "NORMALIZE_HIERARCHY"
description: "Hierarchy normalization with missing relationship updates"
co_occurring_facts:
- tool: "grep"
pattern: "belongsTo.*hasMany.*through"
- tool: "grep"
pattern: "JOIN.*old_table"
confidence: 0.86
- name: "SPLIT_TABLE_INCOMPLETE"
description: "Table split into multiple tables but queries not updated"
co_occurring_facts:
- tool: "grep"
pattern: "createTable.*_details|_metadata"
- tool: "grep"
pattern: "SELECT.*FROM.*{original_table}.*WHERE"
confidence: 0.88

View File

@@ -0,0 +1,10 @@
name: "Sensitive Data in Logs Factual Cluster"
description: "Multiple tools detected patterns consistent with sensitive data being written to logs."
confidence: 0.85
co_occurring_facts:
- tool: "patterns"
pattern: "console.log"
- tool: "taint_analyzer"
pattern: "password"
- tool: "lint"
pattern: "logging"

View File

@@ -0,0 +1,10 @@
name: "Session Fixation Factual Cluster"
description: "Multiple tools detected patterns consistent with a Session Fixation vulnerability."
confidence: 0.75
co_occurring_facts:
- tool: "patterns"
pattern: "session"
- tool: "taint_analyzer"
pattern: "user_controlled"
- tool: "framework_detector"
pattern: "session"

View File

@@ -0,0 +1,10 @@
name: "Source Map Exposure Factual Cluster"
description: "Multiple tools detected patterns consistent with exposed source maps in a production environment."
confidence: 0.95
co_occurring_facts:
- tool: "build"
pattern: "sourcemap"
- tool: "deployment"
pattern: "production"
- tool: "patterns"
pattern: "\\.map"

View File

@@ -0,0 +1,10 @@
name: "SSRF Factual Cluster"
description: "Multiple tools detected patterns consistent with a Server-Side Request Forgery (SSRF) vulnerability."
confidence: 0.80
co_occurring_facts:
- tool: "taint_analyzer"
pattern: "url"
- tool: "patterns"
pattern: "(request|fetch|urllib)"
- tool: "lint"
pattern: "urllib"

View File

@@ -0,0 +1,10 @@
name: "Template Injection Factual Cluster"
description: "Multiple tools detected patterns consistent with a Server-Side Template Injection (SSTI) vulnerability."
confidence: 0.80
co_occurring_facts:
- tool: "taint_analyzer"
pattern: "template"
- tool: "patterns"
pattern: "eval"
- tool: "framework_detector"
pattern: "(jinja|blade|pug)"

View File

@@ -0,0 +1,10 @@
name: "Potential SQL Injection Factual Cluster"
description: "Multiple tools detected patterns consistent with SQL injection vulnerability"
confidence: 0.85
co_occurring_facts:
- tool: "taint_analyzer"
pattern: "sql"
- tool: "patterns"
pattern: "string.*query"
- tool: "lint"
pattern: "sql"

View File

@@ -0,0 +1,10 @@
name: "Vue v-html Factual Cluster"
description: "Multiple tools detected patterns consistent with XSS via v-html in Vue."
confidence: 0.95
co_occurring_facts:
- tool: "framework_detector"
pattern: "vue"
- tool: "patterns"
pattern: "v-html"
- tool: "taint_analyzer"
pattern: "user_input"

View File

@@ -0,0 +1,10 @@
name: "Weak Authentication Factual Cluster"
description: "Multiple tools detected patterns consistent with weak or deprecated authentication mechanisms."
confidence: 0.85
co_occurring_facts:
- tool: "patterns"
pattern: "(md5|sha1)"
- tool: "security"
pattern: "password"
- tool: "lint"
pattern: "deprecated"

View File

@@ -0,0 +1,10 @@
name: "XSS Factual Cluster"
description: "Multiple tools detected patterns consistent with a Cross-Site Scripting (XSS) vulnerability."
confidence: 0.90
co_occurring_facts:
- tool: "taint_analyzer"
pattern: "xss"
- tool: "patterns"
pattern: "(innerHTML|dangerouslySetInnerHTML)"
- tool: "lint"
pattern: "xss"

View File

@@ -0,0 +1,10 @@
name: "XXE Factual Cluster"
description: "Multiple tools detected patterns consistent with an XML External Entity (XXE) vulnerability."
confidence: 0.80
co_occurring_facts:
- tool: "patterns"
pattern: "xml"
- tool: "taint_analyzer"
pattern: "parse"
- tool: "framework_detector"
pattern: "xml_parser"

1109
theauditor/deps.py Normal file

File diff suppressed because it is too large Load Diff

565
theauditor/docgen.py Normal file
View File

@@ -0,0 +1,565 @@
"""Documentation generator from index and capsules (optional feature)."""
import hashlib
import json
import platform
import sqlite3
import sys
from collections import defaultdict
from datetime import UTC, datetime
from pathlib import Path
from typing import Any
from theauditor import __version__
def is_source_file(file_path: str) -> bool:
"""Check if a file is a source code file (not test, config, or docs)."""
path = Path(file_path)
# Skip test files and directories
if any(part in ['test', 'tests', '__tests__', 'spec', 'fixtures', 'fixture_repo', 'test_scaffold'] for part in path.parts):
return False
if path.name.startswith('test_') or path.name.endswith('_test.py') or '.test.' in path.name or '.spec.' in path.name:
return False
if 'test' in str(path).lower() and any(ext in str(path).lower() for ext in ['.spec.', '_test.', 'test_']):
return False
# Skip documentation
if path.suffix.lower() in ['.md', '.rst', '.txt']:
return False
# Skip configuration files
config_files = {
'.gitignore', '.gitattributes', '.editorconfig',
'pyproject.toml', 'setup.py', 'setup.cfg',
'package.json', 'package-lock.json', 'yarn.lock',
'package-template.json', 'tsconfig.json',
'Makefile', 'makefile', 'requirements.txt',
'Dockerfile', 'docker-compose.yml', '.dockerignore',
'manifest.json', 'repo_index.db'
}
if path.name.lower() in config_files:
return False
# Skip build artifacts and caches
skip_dirs = {'docs', 'documentation', 'examples', 'samples', 'schemas', 'agent_templates'}
if any(part.lower() in skip_dirs for part in path.parts):
return False
return True
def load_manifest(manifest_path: str) -> tuple[list[dict], str]:
"""Load manifest and compute its hash."""
with open(manifest_path, "rb") as f:
content = f.read()
manifest_hash = hashlib.sha256(content).hexdigest()
manifest = json.loads(content)
return manifest, manifest_hash
def load_workset(workset_path: str) -> set[str]:
"""Load workset file paths."""
if not Path(workset_path).exists():
return set()
with open(workset_path) as f:
workset = json.load(f)
return {p["path"] for p in workset.get("paths", [])}
def load_capsules(capsules_dir: str, workset_paths: set[str] | None = None) -> list[dict]:
"""Load capsules, optionally filtered by workset."""
capsules = []
capsules_path = Path(capsules_dir)
if not capsules_path.exists():
raise RuntimeError(f"Capsules directory not found: {capsules_dir}")
for capsule_file in sorted(capsules_path.glob("*.json")):
with open(capsule_file) as f:
capsule = json.load(f)
# Filter by workset if provided
if workset_paths is None or capsule.get("path") in workset_paths:
# Filter out non-source files
if is_source_file(capsule.get("path", "")):
capsules.append(capsule)
return capsules
def get_routes(db_path: str, workset_paths: set[str] | None = None) -> list[dict]:
"""Get routes from database, excluding test files."""
if not Path(db_path).exists():
return []
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
if workset_paths:
placeholders = ",".join("?" * len(workset_paths))
query = f"""
SELECT method, pattern, file
FROM api_endpoints
WHERE file IN ({placeholders})
ORDER BY file, pattern
"""
cursor.execute(query, tuple(workset_paths))
else:
cursor.execute(
"""
SELECT method, pattern, file
FROM api_endpoints
ORDER BY file, pattern
"""
)
routes = []
for row in cursor.fetchall():
# Filter out test files
if is_source_file(row[2]):
routes.append({"method": row[0], "pattern": row[1], "file": row[2]})
conn.close()
return routes
def get_sql_objects(db_path: str, workset_paths: set[str] | None = None) -> list[dict]:
"""Get SQL objects from database, excluding test files."""
if not Path(db_path).exists():
return []
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
if workset_paths:
placeholders = ",".join("?" * len(workset_paths))
query = f"""
SELECT kind, name, file
FROM sql_objects
WHERE file IN ({placeholders})
ORDER BY kind, name
"""
cursor.execute(query, tuple(workset_paths))
else:
cursor.execute(
"""
SELECT kind, name, file
FROM sql_objects
ORDER BY kind, name
"""
)
objects = []
for row in cursor.fetchall():
# Filter out test files
if is_source_file(row[2]):
objects.append({"kind": row[0], "name": row[1], "file": row[2]})
conn.close()
return objects
def group_files_by_folder(capsules: list[dict]) -> dict[str, list[dict]]:
"""Group files by their first directory segment."""
groups = defaultdict(list)
for capsule in capsules:
path = capsule.get("path", "")
if "/" in path:
folder = path.split("/")[0]
else:
folder = "."
groups[folder].append(capsule)
# Sort by folder name
return dict(sorted(groups.items()))
def generate_architecture_md(
routes: list[dict],
sql_objects: list[dict],
capsules: list[dict],
scope: str,
) -> str:
"""Generate ARCHITECTURE.md content."""
now = datetime.now(UTC).isoformat()
content = [
"# Architecture",
f"Generated at: {now}",
"",
"## Scope",
f"Mode: {scope}",
"",
]
# Routes table
if routes:
content.extend(
[
"## Routes",
"",
"| Method | Pattern | File |",
"|--------|---------|------|",
]
)
for route in routes:
content.append(f"| {route['method']} | {route['pattern']} | {route['file']} |")
content.append("")
# SQL Objects table
if sql_objects:
content.extend(
[
"## SQL Objects",
"",
"| Kind | Name | File |",
"|------|------|------|",
]
)
for obj in sql_objects:
content.append(f"| {obj['kind']} | {obj['name']} | {obj['file']} |")
content.append("")
# Core Modules (group by actual functionality)
groups = group_files_by_folder(capsules)
if groups:
content.extend(
[
"## Core Modules",
"",
]
)
# Filter and organize by purpose
module_categories = {
"Core CLI": {},
"Analysis & Detection": {},
"Code Generation": {},
"Reporting": {},
"Utilities": {},
}
for folder, folder_capsules in groups.items():
if folder == "theauditor":
for capsule in folder_capsules:
path = Path(capsule.get("path", ""))
name = path.stem
# Skip duplicates and internal modules
if name in ['__init__', 'parsers'] or name.endswith('.py.tpl'):
continue
exports = capsule.get("interfaces", {}).get("exports", [])
functions = capsule.get("interfaces", {}).get("functions", [])
classes = capsule.get("interfaces", {}).get("classes", [])
# Categorize based on filename
if name in ['cli', 'orchestrator', 'config', 'config_runtime']:
category = "Core CLI"
elif name in ['lint', 'ast_verify', 'universal_detector', 'pattern_loader', 'flow_analyzer', 'risk_scorer', 'pattern_rca', 'xgraph_analyzer']:
category = "Analysis & Detection"
elif name in ['scaffolder', 'test_generator', 'claude_setup', 'claude_autogen', 'venv_install']:
category = "Code Generation"
elif name in ['report', 'capsules', 'docgen', 'journal_view']:
category = "Reporting"
else:
# Skip certain utility files from main display
if name in ['utils', 'evidence', 'runner', 'contracts', 'tools']:
continue
category = "Utilities"
# Build summary (only add if not already present)
if name not in module_categories[category]:
summary_parts = []
if classes:
summary_parts.append(f"Classes: {', '.join(classes[:3])}")
elif functions:
summary_parts.append(f"Functions: {', '.join(functions[:3])}")
elif exports:
summary_parts.append(f"Exports: {', '.join(exports[:3])}")
summary = " | ".join(summary_parts) if summary_parts else "Utility module"
module_categories[category][name] = f"- **{name}**: {summary}"
# Output categorized modules
for category, modules_dict in module_categories.items():
if modules_dict:
content.append(f"### {category}")
# Sort modules by name and get their descriptions
for name in sorted(modules_dict.keys()):
content.append(modules_dict[name])
content.append("")
return "\n".join(content)
def generate_features_md(capsules: list[dict]) -> str:
"""Generate FEATURES.md content with meaningful feature descriptions."""
content = [
"# Features & Capabilities",
"",
"## Core Functionality",
"",
]
# Analyze capsules to extract features
features = {
"Code Analysis": [],
"Test Generation": [],
"Documentation": [],
"CI/CD Integration": [],
"ML Capabilities": [],
}
cli_commands = set()
for capsule in capsules:
path = Path(capsule.get("path", ""))
if path.parent.name != "theauditor":
continue
name = path.stem
exports = capsule.get("interfaces", {}).get("exports", [])
functions = capsule.get("interfaces", {}).get("functions", [])
# Extract features based on module
if name == "cli":
# Try to extract CLI commands from functions
for func in functions:
if func not in ['main', 'cli']:
cli_commands.add(func)
elif name == "lint":
features["Code Analysis"].append("- **Linting**: Custom security and quality rules")
elif name == "ast_verify":
features["Code Analysis"].append("- **AST Verification**: Contract-based code verification")
elif name == "universal_detector":
features["Code Analysis"].append("- **Pattern Detection**: Security and performance anti-patterns")
elif name == "flow_analyzer":
features["Code Analysis"].append("- **Flow Analysis**: Deadlock and race condition detection")
elif name == "risk_scorer":
features["Code Analysis"].append("- **Risk Scoring**: Automated risk assessment for files")
elif name == "test_generator":
features["Test Generation"].append("- **Test Scaffolding**: Generate test stubs from code")
elif name == "scaffolder":
features["Test Generation"].append("- **Contract Tests**: Generate DB/API contract tests")
elif name == "docgen":
features["Documentation"].append("- **Architecture Docs**: Auto-generate architecture documentation")
elif name == "capsules":
features["Documentation"].append("- **Code Capsules**: Compressed code summaries")
elif name == "report":
features["Documentation"].append("- **Audit Reports**: Comprehensive audit report generation")
elif name == "claude_setup":
features["CI/CD Integration"].append("- **Claude Code Integration**: Automated hooks for Claude AI")
elif name == "orchestrator":
features["CI/CD Integration"].append("- **Event-Driven Automation**: Git hooks and CI pipeline support")
elif name == "ml":
features["ML Capabilities"].append("- **ML-Based Suggestions**: Learn from codebase patterns")
features["ML Capabilities"].append("- **Root Cause Prediction**: Predict likely failure points")
# Output features by category
for category, feature_list in features.items():
if feature_list:
content.append(f"### {category}")
# Deduplicate
seen = set()
for feature in feature_list:
if feature not in seen:
content.append(feature)
seen.add(feature)
content.append("")
# Add CLI commands summary
if cli_commands:
content.append("## Available Commands")
content.append("")
content.append("The following commands are available through the CLI:")
content.append("")
# Group commands by purpose
cmd_groups = {
"Analysis": ['lint', 'ast_verify', 'detect_patterns', 'flow_analyze', 'risk_score'],
"Generation": ['gen_tests', 'scaffold', 'suggest_fixes'],
"Reporting": ['report', 'journal', 'capsules'],
"Setup": ['init', 'setup_claude', 'deps'],
}
for group, cmds in cmd_groups.items():
group_cmds = [c for c in cli_commands if any(cmd in c for cmd in cmds)]
if group_cmds:
content.append(f"**{group}**: {', '.join(sorted(group_cmds)[:5])}")
content.append("")
# Add configuration info
content.append("## Configuration")
content.append("")
content.append("- **Zero Dependencies**: Core functionality uses only Python stdlib")
content.append("- **Offline Mode**: All operations work without network access")
content.append("- **Per-Project**: No global state, everything is project-local")
content.append("")
return "\n".join(content)
def generate_trace_md(
manifest_hash: str,
manifest: list[dict],
capsules: list[dict],
db_path: str,
workset_paths: set[str] | None,
) -> str:
"""Generate TRACE.md content with meaningful metrics."""
# Count database entries
routes_count = 0
sql_objects_count = 0
refs_count = 0
imports_count = 0
if Path(db_path).exists():
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
cursor.execute("SELECT COUNT(*) FROM api_endpoints")
routes_count = cursor.fetchone()[0]
cursor.execute("SELECT COUNT(*) FROM sql_objects")
sql_objects_count = cursor.fetchone()[0]
# Count refs (files table)
cursor.execute("SELECT COUNT(*) FROM files")
refs_count = cursor.fetchone()[0]
# Count imports
try:
cursor.execute("SELECT COUNT(*) FROM imports")
imports_count = cursor.fetchone()[0]
except sqlite3.OperationalError:
imports_count = 0
conn.close()
# Separate source files from all files
source_files = [f for f in manifest if is_source_file(f.get("path", ""))]
test_files = [f for f in manifest if 'test' in f.get("path", "").lower()]
doc_files = [f for f in manifest if f.get("path", "").endswith(('.md', '.rst', '.txt'))]
# Calculate coverage
if workset_paths:
coverage = len(capsules) / len(workset_paths) * 100 if workset_paths else 0
else:
coverage = len(capsules) / len(source_files) * 100 if source_files else 0
content = [
"# Audit Trace",
"",
"## Repository Snapshot",
f"**Manifest Hash**: `{manifest_hash}`",
f"**Timestamp**: {datetime.now(UTC).strftime('%Y-%m-%d %H:%M:%S UTC')}",
"",
"## File Statistics",
f"- **Total Files**: {len(manifest)}",
f" - Source Files: {len(source_files)}",
f" - Test Files: {len(test_files)}",
f" - Documentation: {len(doc_files)}",
f" - Other: {len(manifest) - len(source_files) - len(test_files) - len(doc_files)}",
"",
"## Code Metrics",
f"- **Cross-References**: {refs_count}",
f"- **Import Statements**: {imports_count}",
f"- **HTTP Routes**: {routes_count}",
f"- **SQL Objects**: {sql_objects_count}",
"",
"## Analysis Coverage",
f"- **Coverage**: {coverage:.1f}% of source files",
f"- **Capsules Generated**: {len(capsules)}",
f"- **Scope**: {'Workset' if workset_paths else 'Full repository'}",
"",
"## Language Distribution",
]
# Count languages
lang_counts = defaultdict(int)
for capsule in capsules:
lang = capsule.get("language", "") # Empty not unknown
lang_counts[lang] += 1
for lang, count in sorted(lang_counts.items(), key=lambda x: x[1], reverse=True):
content.append(f"- {lang}: {count} files")
content.extend([
"",
"## Environment",
f"- **TheAuditor Version**: {__version__}",
f"- **Python**: {sys.version.split()[0]}",
f"- **Platform**: {platform.platform()}",
f"- **Processor**: {platform.processor() or 'Unknown'}",
"",
"## Audit Trail",
"This document provides cryptographic proof of the codebase state at audit time.",
"The manifest hash can be used to verify no files have been modified since analysis.",
"",
])
return "\n".join(content)
# This function was moved above generate_trace_md
def generate_docs(
manifest_path: str = "manifest.json",
db_path: str = "repo_index.db",
capsules_dir: str = "./.pf/capsules",
workset_path: str = "./.pf/workset.json",
out_dir: str = "./.pf/docs",
full: bool = False,
print_stats: bool = False,
) -> dict[str, Any]:
"""Generate documentation from index and capsules."""
# Load data
manifest, manifest_hash = load_manifest(manifest_path)
workset_paths = None if full else load_workset(workset_path)
try:
capsules = load_capsules(capsules_dir, workset_paths)
except RuntimeError as e:
raise RuntimeError(f"Cannot generate docs: {e}. Run 'aud capsules' first.") from e
# Get database data
routes = get_routes(db_path, workset_paths)
sql_objects = get_sql_objects(db_path, workset_paths)
# Generate content
scope = "full" if full else "workset"
architecture_content = generate_architecture_md(routes, sql_objects, capsules, scope)
trace_content = generate_trace_md(manifest_hash, manifest, capsules, db_path, workset_paths)
features_content = generate_features_md(capsules)
# Write files
out_path = Path(out_dir)
out_path.mkdir(parents=True, exist_ok=True)
(out_path / "ARCHITECTURE.md").write_text(architecture_content)
(out_path / "TRACE.md").write_text(trace_content)
(out_path / "FEATURES.md").write_text(features_content)
result = {
"files_written": 3,
"scope": scope,
"capsules_used": len(capsules),
"routes": len(routes),
"sql_objects": len(sql_objects),
}
if print_stats:
print(f"Generated {result['files_written']} docs in {out_dir}")
print(f" Scope: {result['scope']}")
print(f" Capsules: {result['capsules_used']}")
print(f" Routes: {result['routes']}")
print(f" SQL Objects: {result['sql_objects']}")
return result

View File

@@ -0,0 +1,310 @@
"""Docker container security analyzer module."""
import json
import logging
import re
import sqlite3
from pathlib import Path
from typing import Any, Dict, List
# Set up logger
logger = logging.getLogger(__name__)
def analyze_docker_images(db_path: str, check_vulnerabilities: bool = True) -> List[Dict[str, Any]]:
"""
Analyze indexed Docker images for security misconfigurations.
Args:
db_path: Path to the repo_index.db database
check_vulnerabilities: Whether to scan base images for vulnerabilities
Returns:
List of security findings with severity levels
"""
findings = []
# Connect to the database
with sqlite3.connect(db_path) as conn:
conn.row_factory = sqlite3.Row
# Run each security check
findings.extend(_find_root_containers(conn))
findings.extend(_find_exposed_secrets(conn))
# Base image vulnerability check
if check_vulnerabilities:
base_images = _prepare_base_image_scan(conn)
if base_images:
# Import here to avoid circular dependency
from .vulnerability_scanner import scan_dependencies
# Run vulnerability scan on Docker base images
vuln_findings = scan_dependencies(base_images, offline=False)
# Convert vulnerability findings to Docker-specific format
for vuln in vuln_findings:
findings.append({
'type': 'docker_base_image_vulnerability',
'severity': vuln.get('severity', 'medium'),
'file': 'Dockerfile',
'message': f"Base image {vuln.get('package', 'unknown')} has vulnerability: {vuln.get('title', 'Unknown vulnerability')}",
'recommendation': vuln.get('recommendation', 'Update to latest secure version'),
'details': vuln
})
return findings
def _find_root_containers(conn: sqlite3.Connection) -> List[Dict[str, Any]]:
"""
Detect containers running as root user (default or explicit).
CIS Docker Benchmark: Running containers as root is a major security risk.
A container breakout would grant attacker root privileges on the host.
Args:
conn: SQLite database connection
Returns:
List of findings for containers running as root
"""
findings = []
cursor = conn.cursor()
# Query all Docker images
cursor.execute("SELECT file_path, env_vars FROM docker_images")
for row in cursor:
file_path = row['file_path']
env_vars_json = row['env_vars']
# Parse the JSON column
try:
env_vars = json.loads(env_vars_json) if env_vars_json else {}
except json.JSONDecodeError as e:
logger.debug(f"Non-critical error parsing Docker env vars JSON: {e}", exc_info=False)
continue
# Check for _DOCKER_USER key (set by USER instruction)
docker_user = env_vars.get('_DOCKER_USER')
# If no USER instruction or explicitly set to root
if docker_user is None or docker_user.lower() == 'root':
findings.append({
'type': 'docker_root_user',
'severity': 'High',
'file': file_path,
'message': f"Container runs as root user (USER instruction {'not set' if docker_user is None else 'set to root'})",
'recommendation': "Add 'USER <non-root-user>' instruction to Dockerfile after installing dependencies"
})
return findings
def _find_exposed_secrets(conn: sqlite3.Connection) -> List[Dict[str, Any]]:
"""
Detect hardcoded secrets in ENV and ARG instructions.
ENV and ARG values are stored in image layers and can be inspected
by anyone with access to the image, making them unsuitable for secrets.
Args:
conn: SQLite database connection
Returns:
List of findings for exposed secrets
"""
findings = []
cursor = conn.cursor()
# Patterns for detecting sensitive keys
sensitive_key_patterns = [
r'(?i)password',
r'(?i)secret',
r'(?i)api[_-]?key',
r'(?i)token',
r'(?i)auth',
r'(?i)credential',
r'(?i)private[_-]?key',
r'(?i)access[_-]?key'
]
# Common secret value patterns
secret_value_patterns = [
r'^ghp_[A-Za-z0-9]{36}$', # GitHub personal access token
r'^ghs_[A-Za-z0-9]{36}$', # GitHub secret
r'^sk-[A-Za-z0-9]{48}$', # OpenAI API key
r'^xox[baprs]-[A-Za-z0-9-]+$', # Slack token
r'^AKIA[A-Z0-9]{16}$', # AWS access key ID
]
# Query all Docker images
cursor.execute("SELECT file_path, env_vars, build_args FROM docker_images")
for row in cursor:
file_path = row['file_path']
env_vars_json = row['env_vars']
build_args_json = row['build_args']
# Parse JSON columns
try:
env_vars = json.loads(env_vars_json) if env_vars_json else {}
build_args = json.loads(build_args_json) if build_args_json else {}
except json.JSONDecodeError as e:
logger.debug(f"Non-critical error parsing Docker JSON columns: {e}", exc_info=False)
continue
# Check ENV variables
for key, value in env_vars.items():
# Skip internal tracking keys
if key.startswith('_DOCKER_'):
continue
is_sensitive = False
# Check if key name indicates sensitive data
for pattern in sensitive_key_patterns:
if re.search(pattern, key):
is_sensitive = True
findings.append({
'type': 'docker_exposed_secret',
'severity': 'Critical',
'file': file_path,
'message': f"Potential secret exposed in ENV instruction: {key}",
'recommendation': "Use Docker secrets or mount secrets at runtime instead of ENV"
})
break
# Check if value matches known secret patterns
if not is_sensitive and value:
for pattern in secret_value_patterns:
if re.match(pattern, str(value)):
findings.append({
'type': 'docker_exposed_secret',
'severity': 'Critical',
'file': file_path,
'message': f"Detected secret pattern in ENV value for key: {key}",
'recommendation': "Remove hardcoded secrets and use runtime secret injection"
})
break
# Check for high entropy strings (potential secrets)
if not is_sensitive and value and _is_high_entropy(str(value)):
findings.append({
'type': 'docker_possible_secret',
'severity': 'Medium',
'file': file_path,
'message': f"High entropy value in ENV {key} - possible secret",
'recommendation': "Review if this is a secret and move to secure storage if so"
})
# Check BUILD ARGs
for key, value in build_args.items():
# Check if key name indicates sensitive data
for pattern in sensitive_key_patterns:
if re.search(pattern, key):
findings.append({
'type': 'docker_exposed_secret',
'severity': 'High', # Slightly lower than ENV as ARGs are build-time only
'file': file_path,
'message': f"Potential secret exposed in ARG instruction: {key}",
'recommendation': "Use --secret mount or BuildKit secrets instead of ARG for sensitive data"
})
break
return findings
def _prepare_base_image_scan(conn: sqlite3.Connection) -> List[Dict[str, Any]]:
"""
Prepare base image data for vulnerability scanning.
This function extracts and parses base image information from the database,
preparing it in the format expected by vulnerability_scanner.scan_dependencies().
Args:
conn: SQLite database connection
Returns:
List of dependency dicts with manager='docker', name, and version
"""
dependencies = []
cursor = conn.cursor()
# Get all unique base images
cursor.execute("SELECT DISTINCT base_image FROM docker_images WHERE base_image IS NOT NULL")
for row in cursor:
base_image = row[0]
# Parse image string to extract name and version/tag
# Format examples:
# - python:3.11-slim
# - node:18-alpine
# - ubuntu:22.04
# - gcr.io/project/image:tag
# - image@sha256:hash
if '@' in base_image:
# Handle digest format (image@sha256:...)
name = base_image.split('@')[0]
version = base_image.split('@')[1]
elif ':' in base_image:
# Handle tag format (image:tag)
parts = base_image.rsplit(':', 1)
name = parts[0]
version = parts[1]
else:
# No tag specified, defaults to 'latest'
name = base_image
version = 'latest'
# Create dependency dict in vulnerability scanner format
dependencies.append({
'manager': 'docker',
'name': name,
'version': version,
'source_file': 'Dockerfile' # Could be enhanced to track actual file
})
return dependencies
def _is_high_entropy(value: str, threshold: float = 4.0) -> bool:
"""
Check if a string has high entropy (potential secret).
Uses Shannon entropy calculation to detect random-looking strings
that might be secrets, API keys, or tokens.
Args:
value: String to check
threshold: Entropy threshold (default 4.0)
Returns:
True if entropy exceeds threshold
"""
import math
# Skip short strings
if len(value) < 10:
return False
# Skip strings with spaces (likely not secrets)
if ' ' in value:
return False
# Calculate character frequency
char_freq = {}
for char in value:
char_freq[char] = char_freq.get(char, 0) + 1
# Calculate Shannon entropy
entropy = 0.0
for freq in char_freq.values():
probability = freq / len(value)
if probability > 0:
entropy -= probability * math.log2(probability)
return entropy > threshold

793
theauditor/docs_fetch.py Normal file
View File

@@ -0,0 +1,793 @@
"""Documentation fetcher for version-correct package docs."""
import json
import re
import time
import urllib.error
import urllib.request
from datetime import datetime
from pathlib import Path
from typing import Any, Dict, List, Optional
from theauditor.security import sanitize_path, sanitize_url_component, validate_package_name, SecurityError
# Default allowlist for registries
DEFAULT_ALLOWLIST = [
"https://registry.npmjs.org/",
"https://pypi.org/", # Allow both API and web scraping
"https://raw.githubusercontent.com/",
"https://readthedocs.io/",
"https://readthedocs.org/",
]
# Rate limiting configuration - optimized for minimal runtime
RATE_LIMIT_DELAY = 0.15 # Average delay between requests (balanced for npm/PyPI)
RATE_LIMIT_BACKOFF = 15 # Backoff on 429/disconnect (15s gives APIs time to reset)
def fetch_docs(
deps: List[Dict[str, Any]],
allow_net: bool = True,
allowlist: Optional[List[str]] = None,
offline: bool = False,
output_dir: str = "./.pf/context/docs"
) -> Dict[str, Any]:
"""
Fetch version-correct documentation for dependencies.
Args:
deps: List of dependency objects from deps.py
allow_net: Whether network access is allowed
allowlist: List of allowed URL prefixes (uses DEFAULT_ALLOWLIST if None)
offline: Force offline mode
output_dir: Base directory for cached docs
Returns:
Summary of fetch operations
"""
if offline or not allow_net:
return {
"mode": "offline",
"fetched": 0,
"cached": 0,
"skipped": len(deps),
"errors": []
}
if allowlist is None:
allowlist = DEFAULT_ALLOWLIST
try:
output_path = sanitize_path(output_dir, ".")
output_path.mkdir(parents=True, exist_ok=True)
except SecurityError as e:
return {
"mode": "error",
"error": f"Invalid output directory: {e}",
"fetched": 0,
"cached": 0,
"skipped": len(deps)
}
stats = {
"mode": "online",
"fetched": 0,
"cached": 0,
"skipped": 0,
"errors": []
}
# FIRST PASS: Check what's cached
needs_fetch = []
for dep in deps:
# Quick cache check without network
cache_result = _check_cache_for_dep(dep, output_path)
if cache_result["cached"]:
stats["cached"] += 1
else:
needs_fetch.append(dep)
# Early exit if everything is cached
if not needs_fetch:
return stats
# SECOND PASS: Fetch only what we need, with per-service rate limiting
npm_rate_limited_until = 0
pypi_rate_limited_until = 0
for i, dep in enumerate(needs_fetch):
try:
current_time = time.time()
# Check if this service is rate limited
if dep["manager"] == "npm" and current_time < npm_rate_limited_until:
stats["skipped"] += 1
stats["errors"].append(f"{dep['name']}: Skipped (npm rate limited)")
continue
elif dep["manager"] == "py" and current_time < pypi_rate_limited_until:
stats["skipped"] += 1
stats["errors"].append(f"{dep['name']}: Skipped (PyPI rate limited)")
continue
# Fetch the documentation
if dep["manager"] == "npm":
result = _fetch_npm_docs(dep, output_path, allowlist)
elif dep["manager"] == "py":
result = _fetch_pypi_docs(dep, output_path, allowlist)
else:
stats["skipped"] += 1
continue
if result["status"] == "fetched":
stats["fetched"] += 1
# Rate limiting: delay after successful fetch to be server-friendly
# npm and PyPI both have rate limits (npm: 100/min, PyPI: 60/min)
time.sleep(RATE_LIMIT_DELAY) # Be server-friendly
elif result["status"] == "cached":
stats["cached"] += 1 # Shouldn't happen here but handle it
elif result.get("reason") == "rate_limited":
stats["errors"].append(f"{dep['name']}: Rate limited - backing off {RATE_LIMIT_BACKOFF}s")
stats["skipped"] += 1
# Set rate limit expiry for this service
if dep["manager"] == "npm":
npm_rate_limited_until = time.time() + RATE_LIMIT_BACKOFF
elif dep["manager"] == "py":
pypi_rate_limited_until = time.time() + RATE_LIMIT_BACKOFF
else:
stats["skipped"] += 1
except Exception as e:
error_msg = str(e)
if "429" in error_msg or "rate" in error_msg.lower():
stats["errors"].append(f"{dep['name']}: Rate limited - backing off {RATE_LIMIT_BACKOFF}s")
# Set rate limit expiry for this service
if dep["manager"] == "npm":
npm_rate_limited_until = time.time() + RATE_LIMIT_BACKOFF
elif dep["manager"] == "py":
pypi_rate_limited_until = time.time() + RATE_LIMIT_BACKOFF
else:
stats["errors"].append(f"{dep['name']}: {error_msg}")
return stats
def _check_cache_for_dep(dep: Dict[str, Any], output_dir: Path) -> Dict[str, bool]:
"""
Quick cache check for a dependency without making network calls.
Returns {"cached": True/False}
"""
name = dep["name"]
version = dep["version"]
manager = dep["manager"]
# Build the cache file path
if manager == "npm":
# Handle git versions
if version.startswith("git") or "://" in version:
import hashlib
version_hash = hashlib.md5(version.encode()).hexdigest()[:8]
safe_version = f"git-{version_hash}"
else:
safe_version = version.replace(":", "_").replace("/", "_").replace("\\", "_")
safe_name = name.replace("@", "_at_").replace("/", "_")
pkg_dir = output_dir / "npm" / f"{safe_name}@{safe_version}"
elif manager == "py":
safe_version = version.replace(":", "_").replace("/", "_").replace("\\", "_")
safe_name = name.replace("/", "_").replace("\\", "_")
pkg_dir = output_dir / "py" / f"{safe_name}@{safe_version}"
else:
return {"cached": False}
doc_file = pkg_dir / "doc.md"
meta_file = pkg_dir / "meta.json"
# Check cache validity
if doc_file.exists() and meta_file.exists():
try:
with open(meta_file, encoding="utf-8") as f:
meta = json.load(f)
# Cache for 7 days
last_checked = datetime.fromisoformat(meta["last_checked"])
if (datetime.now() - last_checked).days < 7:
return {"cached": True}
except (json.JSONDecodeError, KeyError):
pass
return {"cached": False}
def _fetch_npm_docs(
dep: Dict[str, Any],
output_dir: Path,
allowlist: List[str]
) -> Dict[str, Any]:
"""Fetch documentation for an npm package."""
name = dep["name"]
version = dep["version"]
# Validate package name
if not validate_package_name(name, "npm"):
return {"status": "skipped", "reason": "Invalid package name"}
# Sanitize version for filesystem (handle git URLs)
if version.startswith("git") or "://" in version:
# For git dependencies, use a hash of the URL as version
import hashlib
version_hash = hashlib.md5(version.encode()).hexdigest()[:8]
safe_version = f"git-{version_hash}"
else:
# For normal versions, just replace problematic characters
safe_version = version.replace(":", "_").replace("/", "_").replace("\\", "_")
# Create package-specific directory with sanitized name
# Replace @ and / in scoped packages for filesystem safety
safe_name = name.replace("@", "_at_").replace("/", "_")
try:
pkg_dir = output_dir / "npm" / f"{safe_name}@{safe_version}"
pkg_dir.mkdir(parents=True, exist_ok=True)
except (OSError, SecurityError) as e:
return {"status": "error", "error": f"Cannot create package directory: {e}"}
doc_file = pkg_dir / "doc.md"
meta_file = pkg_dir / "meta.json"
# Check cache
if doc_file.exists() and meta_file.exists():
# Check if cache is still valid (simple time-based for now)
try:
with open(meta_file, encoding="utf-8") as f:
meta = json.load(f)
# Cache for 7 days
last_checked = datetime.fromisoformat(meta["last_checked"])
if (datetime.now() - last_checked).days < 7:
return {"status": "cached"}
except (json.JSONDecodeError, KeyError):
pass # Invalid cache, refetch
# Fetch from registry with sanitized package name
safe_url_name = sanitize_url_component(name)
safe_url_version = sanitize_url_component(version)
url = f"https://registry.npmjs.org/{safe_url_name}/{safe_url_version}"
if not _is_url_allowed(url, allowlist):
return {"status": "skipped", "reason": "URL not in allowlist"}
try:
with urllib.request.urlopen(url, timeout=10) as response:
data = json.loads(response.read())
readme = data.get("readme", "")
repository = data.get("repository", {})
homepage = data.get("homepage", "")
# Priority 1: Try to get README from GitHub if available
github_fetched = False
if isinstance(repository, dict):
repo_url = repository.get("url", "")
github_readme = _fetch_github_readme(repo_url, allowlist)
if github_readme and len(github_readme) > 500: # Only use if substantial
readme = github_readme
github_fetched = True
# Priority 2: If no good GitHub README, try homepage if it's GitHub
if not github_fetched and homepage and "github.com" in homepage:
github_readme = _fetch_github_readme(homepage, allowlist)
if github_readme and len(github_readme) > 500:
readme = github_readme
github_fetched = True
# Priority 3: Use npm README if it's substantial
if not github_fetched and len(readme) < 500:
# The npm README is too short, try to enhance it
readme = _enhance_npm_readme(data, readme)
# Write documentation
with open(doc_file, "w", encoding="utf-8") as f:
f.write(f"# {name}@{version}\n\n")
f.write(f"**Package**: [{name}](https://www.npmjs.com/package/{name})\n")
f.write(f"**Version**: {version}\n")
if homepage:
f.write(f"**Homepage**: {homepage}\n")
f.write("\n---\n\n")
f.write(readme)
# Add usage examples if not in README
if "## Usage" not in readme and "## Example" not in readme:
f.write("\n\n## Installation\n\n```bash\nnpm install {name}\n```\n".format(name=name))
# Write metadata
meta = {
"source_url": url,
"last_checked": datetime.now().isoformat(),
"etag": response.headers.get("ETag"),
"repository": repository,
"from_github": github_fetched
}
with open(meta_file, "w", encoding="utf-8") as f:
json.dump(meta, f, indent=2)
return {"status": "fetched"}
except urllib.error.HTTPError as e:
if e.code == 429:
return {"status": "error", "reason": "rate_limited", "error": "HTTP 429: Rate limited"}
return {"status": "error", "error": f"HTTP {e.code}: {str(e)}"}
except (urllib.error.URLError, json.JSONDecodeError) as e:
return {"status": "error", "error": str(e)}
def _fetch_pypi_docs(
dep: Dict[str, Any],
output_dir: Path,
allowlist: List[str]
) -> Dict[str, Any]:
"""Fetch documentation for a PyPI package."""
name = dep["name"].strip() # Strip any whitespace from name
version = dep["version"]
# Validate package name
if not validate_package_name(name, "py"):
return {"status": "skipped", "reason": "Invalid package name"}
# Sanitize package name for URL
safe_url_name = sanitize_url_component(name)
# Handle special versions
if version in ["latest", "git"]:
# For latest, fetch current version first
if version == "latest":
url = f"https://pypi.org/pypi/{safe_url_name}/json"
else:
return {"status": "skipped", "reason": "git dependency"}
else:
safe_url_version = sanitize_url_component(version)
url = f"https://pypi.org/pypi/{safe_url_name}/{safe_url_version}/json"
if not _is_url_allowed(url, allowlist):
return {"status": "skipped", "reason": "URL not in allowlist"}
# Sanitize version for filesystem
safe_version = version.replace(":", "_").replace("/", "_").replace("\\", "_")
# Create package-specific directory with sanitized name
safe_name = name.replace("/", "_").replace("\\", "_")
try:
pkg_dir = output_dir / "py" / f"{safe_name}@{safe_version}"
pkg_dir.mkdir(parents=True, exist_ok=True)
except (OSError, SecurityError) as e:
return {"status": "error", "error": f"Cannot create package directory: {e}"}
doc_file = pkg_dir / "doc.md"
meta_file = pkg_dir / "meta.json"
# Check cache
if doc_file.exists() and meta_file.exists():
try:
with open(meta_file, encoding="utf-8") as f:
meta = json.load(f)
last_checked = datetime.fromisoformat(meta["last_checked"])
if (datetime.now() - last_checked).days < 7:
return {"status": "cached"}
except (json.JSONDecodeError, KeyError):
pass
try:
with urllib.request.urlopen(url, timeout=10) as response:
data = json.loads(response.read())
info = data.get("info", {})
description = info.get("description", "")
summary = info.get("summary", "")
# Priority 1: Try to get README from project URLs (GitHub, GitLab, etc.)
github_fetched = False
project_urls = info.get("project_urls", {})
# Check all possible URL sources for GitHub
all_urls = []
for key, proj_url in project_urls.items():
if proj_url:
all_urls.append(proj_url)
# Also check home_page and download_url
home_page = info.get("home_page", "")
if home_page:
all_urls.append(home_page)
download_url = info.get("download_url", "")
if download_url:
all_urls.append(download_url)
# Try GitHub first
for url in all_urls:
if "github.com" in url.lower():
github_readme = _fetch_github_readme(url, allowlist)
if github_readme and len(github_readme) > 500:
description = github_readme
github_fetched = True
break
# Priority 2: Try ReadTheDocs if available
if not github_fetched:
for url in all_urls:
if "readthedocs" in url.lower():
rtd_content = _fetch_readthedocs(url, allowlist)
if rtd_content and len(rtd_content) > 500:
description = rtd_content
github_fetched = True # Mark as fetched from external source
break
# Priority 3: Try to scrape PyPI web page (not API) for full README
if not github_fetched and len(description) < 1000:
pypi_readme = _fetch_pypi_web_readme(name, version, allowlist)
if pypi_readme and len(pypi_readme) > len(description):
description = pypi_readme
github_fetched = True # Mark as fetched from external source
# Priority 4: Use PyPI description (often contains full README)
# PyPI descriptions can be quite good if properly uploaded
if not github_fetched and len(description) < 500 and summary:
# If description is too short, enhance it
description = _enhance_pypi_description(info, description, summary)
# Write documentation
with open(doc_file, "w", encoding="utf-8") as f:
f.write(f"# {name}@{version}\n\n")
f.write(f"**Package**: [{name}](https://pypi.org/project/{name}/)\n")
f.write(f"**Version**: {version}\n")
# Add project URLs if available
if project_urls:
f.write("\n**Links**:\n")
for key, url in list(project_urls.items())[:5]: # Limit to 5
if url:
f.write(f"- {key}: {url}\n")
f.write("\n---\n\n")
# Add summary if different from description
if summary and summary not in description:
f.write(f"**Summary**: {summary}\n\n")
f.write(description)
# Add installation instructions if not in description
if "pip install" not in description.lower():
f.write(f"\n\n## Installation\n\n```bash\npip install {name}\n```\n")
# Add basic usage if really minimal docs
if len(description) < 200:
f.write(f"\n\n## Basic Usage\n\n```python\nimport {name.replace('-', '_')}\n```\n")
# Write metadata
meta = {
"source_url": url,
"last_checked": datetime.now().isoformat(),
"etag": response.headers.get("ETag"),
"project_urls": project_urls,
"from_github": github_fetched
}
with open(meta_file, "w", encoding="utf-8") as f:
json.dump(meta, f, indent=2)
return {"status": "fetched"}
except urllib.error.HTTPError as e:
if e.code == 429:
return {"status": "error", "reason": "rate_limited", "error": "HTTP 429: Rate limited"}
return {"status": "error", "error": f"HTTP {e.code}: {str(e)}"}
except (urllib.error.URLError, json.JSONDecodeError) as e:
return {"status": "error", "error": str(e)}
def _fetch_github_readme(repo_url: str, allowlist: List[str]) -> Optional[str]:
"""
Fetch README from GitHub repository.
Converts repository URL to raw GitHub URL for README.
"""
if not repo_url:
return None
# Extract owner/repo from various GitHub URL formats
patterns = [
r'github\.com[:/]([^/]+)/([^/\s]+)',
r'git\+https://github\.com/([^/]+)/([^/\s]+)',
]
for pattern in patterns:
match = re.search(pattern, repo_url)
if match:
owner, repo = match.groups()
# Clean repo name
repo = repo.replace(".git", "")
# Try common README filenames
readme_files = ["README.md", "readme.md", "README.rst", "README.txt"]
# Sanitize owner and repo for URL
safe_owner = sanitize_url_component(owner)
safe_repo = sanitize_url_component(repo)
for readme_name in readme_files:
safe_readme = sanitize_url_component(readme_name)
raw_url = f"https://raw.githubusercontent.com/{safe_owner}/{safe_repo}/main/{safe_readme}"
if not _is_url_allowed(raw_url, allowlist):
continue
try:
with urllib.request.urlopen(raw_url, timeout=5) as response:
return response.read().decode("utf-8")
except urllib.error.HTTPError:
# Try master branch
raw_url = f"https://raw.githubusercontent.com/{safe_owner}/{safe_repo}/master/{safe_readme}"
try:
with urllib.request.urlopen(raw_url, timeout=5) as response:
return response.read().decode("utf-8")
except urllib.error.URLError:
continue
except urllib.error.URLError:
continue
return None
def _is_url_allowed(url: str, allowlist: List[str]) -> bool:
"""Check if URL is in the allowlist."""
for allowed in allowlist:
if url.startswith(allowed):
return True
return False
def _enhance_npm_readme(data: Dict[str, Any], readme: str) -> str:
"""Enhance minimal npm README with package metadata."""
enhanced = readme if readme else ""
# Add description if not in README
description = data.get("description", "")
if description and description not in enhanced:
enhanced = f"{description}\n\n{enhanced}"
# Add keywords
keywords = data.get("keywords", [])
if keywords and "keywords" not in enhanced.lower():
enhanced += f"\n\n## Keywords\n\n{', '.join(keywords)}"
# Add main entry point info
main = data.get("main", "")
if main:
enhanced += f"\n\n## Entry Point\n\nMain file: `{main}`"
# Add dependencies info if substantial
deps = data.get("dependencies", {})
if len(deps) > 0 and len(deps) <= 10: # Only if reasonable number
enhanced += "\n\n## Dependencies\n\n"
for dep, ver in deps.items():
enhanced += f"- {dep}: {ver}\n"
return enhanced
def _fetch_readthedocs(url: str, allowlist: List[str]) -> Optional[str]:
"""
Fetch documentation from ReadTheDocs.
Tries to get the main index page content.
"""
if not url or not _is_url_allowed(url, allowlist):
return None
# Ensure we're getting the latest version
if not url.endswith("/"):
url += "/"
# Try to fetch the main page
try:
# Add en/latest if not already in URL
if "/en/latest" not in url and "/en/stable" not in url:
url = url.rstrip("/") + "/en/latest/"
with urllib.request.urlopen(url, timeout=10) as response:
html_content = response.read().decode("utf-8")
# Basic HTML to markdown conversion (very simplified)
# Remove script and style tags
html_content = re.sub(r'<script[^>]*>.*?</script>', '', html_content, flags=re.DOTALL)
html_content = re.sub(r'<style[^>]*>.*?</style>', '', html_content, flags=re.DOTALL)
# Extract main content (look for common RTD content divs)
content_match = re.search(r'<div[^>]*class="[^"]*document[^"]*"[^>]*>(.*?)</div>', html_content, re.DOTALL)
if content_match:
html_content = content_match.group(1)
# Convert basic HTML tags to markdown
html_content = re.sub(r'<h1[^>]*>(.*?)</h1>', r'# \1\n', html_content)
html_content = re.sub(r'<h2[^>]*>(.*?)</h2>', r'## \1\n', html_content)
html_content = re.sub(r'<h3[^>]*>(.*?)</h3>', r'### \1\n', html_content)
html_content = re.sub(r'<code[^>]*>(.*?)</code>', r'`\1`', html_content)
html_content = re.sub(r'<pre[^>]*>(.*?)</pre>', r'```\n\1\n```', html_content, flags=re.DOTALL)
html_content = re.sub(r'<p[^>]*>(.*?)</p>', r'\1\n\n', html_content)
html_content = re.sub(r'<a[^>]*href="([^"]*)"[^>]*>(.*?)</a>', r'[\2](\1)', html_content)
html_content = re.sub(r'<[^>]+>', '', html_content) # Remove remaining HTML tags
# Clean up whitespace
html_content = re.sub(r'\n{3,}', '\n\n', html_content)
return html_content.strip()
except Exception:
return None
def _fetch_pypi_web_readme(name: str, version: str, allowlist: List[str]) -> Optional[str]:
"""
Fetch the rendered README from PyPI's web interface.
The web interface shows the full README that's often missing from the API.
"""
# Validate package name
if not validate_package_name(name, "py"):
return None
# Sanitize for URL
safe_name = sanitize_url_component(name)
safe_version = sanitize_url_component(version)
# PyPI web URLs
urls_to_try = [
f"https://pypi.org/project/{safe_name}/{safe_version}/",
f"https://pypi.org/project/{safe_name}/"
]
for url in urls_to_try:
if not _is_url_allowed(url, allowlist):
continue
try:
req = urllib.request.Request(url, headers={
'User-Agent': 'Mozilla/5.0 (compatible; TheAuditor/1.0)'
})
with urllib.request.urlopen(req, timeout=10) as response:
html_content = response.read().decode("utf-8")
# Look for the project description div
# PyPI uses a specific class for the README content
readme_match = re.search(
r'<div[^>]*class="[^"]*project-description[^"]*"[^>]*>(.*?)</div>',
html_content,
re.DOTALL | re.IGNORECASE
)
if not readme_match:
# Try alternative patterns
readme_match = re.search(
r'<div[^>]*class="[^"]*description[^"]*"[^>]*>(.*?)</div>',
html_content,
re.DOTALL | re.IGNORECASE
)
if readme_match:
readme_html = readme_match.group(1)
# Convert HTML to markdown (simplified)
# Headers
readme_html = re.sub(r'<h1[^>]*>(.*?)</h1>', r'# \1\n', readme_html, flags=re.IGNORECASE)
readme_html = re.sub(r'<h2[^>]*>(.*?)</h2>', r'## \1\n', readme_html, flags=re.IGNORECASE)
readme_html = re.sub(r'<h3[^>]*>(.*?)</h3>', r'### \1\n', readme_html, flags=re.IGNORECASE)
# Code blocks
readme_html = re.sub(r'<pre[^>]*><code[^>]*>(.*?)</code></pre>', r'```\n\1\n```', readme_html, flags=re.DOTALL | re.IGNORECASE)
readme_html = re.sub(r'<code[^>]*>(.*?)</code>', r'`\1`', readme_html, flags=re.IGNORECASE)
# Lists
readme_html = re.sub(r'<li[^>]*>(.*?)</li>', r'- \1\n', readme_html, flags=re.IGNORECASE)
# Links
readme_html = re.sub(r'<a[^>]*href="([^"]*)"[^>]*>(.*?)</a>', r'[\2](\1)', readme_html, flags=re.IGNORECASE)
# Paragraphs and line breaks
readme_html = re.sub(r'<p[^>]*>(.*?)</p>', r'\1\n\n', readme_html, flags=re.DOTALL | re.IGNORECASE)
readme_html = re.sub(r'<br[^>]*>', '\n', readme_html, flags=re.IGNORECASE)
# Remove remaining HTML tags
readme_html = re.sub(r'<[^>]+>', '', readme_html)
# Decode HTML entities
readme_html = readme_html.replace('&lt;', '<')
readme_html = readme_html.replace('&gt;', '>')
readme_html = readme_html.replace('&amp;', '&')
readme_html = readme_html.replace('&quot;', '"')
readme_html = readme_html.replace('&#39;', "'")
# Clean up whitespace
readme_html = re.sub(r'\n{3,}', '\n\n', readme_html)
readme_html = readme_html.strip()
if len(readme_html) > 100: # Only return if we got substantial content
return readme_html
except Exception:
continue
return None
def _enhance_pypi_description(info: Dict[str, Any], description: str, summary: str) -> str:
"""Enhance minimal PyPI description with package metadata."""
enhanced = description if description else ""
# Start with summary if description is empty
if not enhanced and summary:
enhanced = f"{summary}\n\n"
# Add author info
author = info.get("author", "")
author_email = info.get("author_email", "")
if author and "author" not in enhanced.lower():
author_info = f"\n\n## Author\n\n{author}"
if author_email:
author_info += f" ({author_email})"
enhanced += author_info
# Add license
license_info = info.get("license", "")
if license_info and "license" not in enhanced.lower():
enhanced += f"\n\n## License\n\n{license_info}"
# Add classifiers (limited)
classifiers = info.get("classifiers", [])
relevant_classifiers = [
c for c in classifiers
if "Programming Language" in c or "Framework" in c or "Topic" in c
][:5] # Limit to 5
if relevant_classifiers:
enhanced += "\n\n## Classifiers\n\n"
for classifier in relevant_classifiers:
enhanced += f"- {classifier}\n"
# Add requires_python if specified
requires_python = info.get("requires_python", "")
if requires_python:
enhanced += f"\n\n## Python Version\n\nRequires Python {requires_python}"
return enhanced
def check_latest(
deps: List[Dict[str, Any]],
allow_net: bool = True,
offline: bool = False,
output_path: str = "./.pf/deps_latest.json"
) -> Dict[str, Any]:
"""
Check latest versions and compare to locked versions.
This is a wrapper around deps.check_latest_versions for consistency.
"""
from .deps import check_latest_versions, write_deps_latest_json
if offline or not allow_net:
return {
"mode": "offline",
"checked": 0,
"outdated": 0
}
latest_info = check_latest_versions(deps, allow_net=allow_net, offline=offline)
if latest_info:
# Sanitize output path before writing
try:
safe_output_path = str(sanitize_path(output_path, "."))
write_deps_latest_json(latest_info, safe_output_path)
except SecurityError as e:
return {
"mode": "error",
"error": f"Invalid output path: {e}",
"checked": 0,
"outdated": 0
}
outdated = sum(1 for info in latest_info.values() if info["is_outdated"])
return {
"mode": "online",
"checked": len(latest_info),
"outdated": outdated,
"output": output_path
}

View File

@@ -0,0 +1,408 @@
"""Documentation summarizer for creating concise doc capsules."""
import json
import re
from datetime import datetime
from pathlib import Path
from typing import Any, Dict, List, Optional, Set
def summarize_docs(
docs_dir: str = "./.pf/context/docs",
output_dir: str = "./.pf/context/doc_capsules",
workset_path: Optional[str] = None,
max_capsule_lines: int = 50
) -> Dict[str, Any]:
"""
Generate concise doc capsules from fetched documentation.
Args:
docs_dir: Directory containing fetched docs
output_dir: Directory for output capsules
workset_path: Optional workset to filter relevant deps
max_capsule_lines: Maximum lines per capsule
Returns:
Summary statistics
"""
docs_path = Path(docs_dir)
output_path = Path(output_dir)
output_path.mkdir(parents=True, exist_ok=True)
# Load workset if provided
relevant_deps = None
if workset_path and Path(workset_path).exists():
relevant_deps = _load_workset_deps(workset_path)
stats = {
"total_docs": 0,
"capsules_created": 0,
"skipped": 0,
"errors": []
}
capsules_index = []
# Process npm docs
npm_dir = docs_path / "npm"
if npm_dir.exists():
for pkg_dir in npm_dir.iterdir():
if not pkg_dir.is_dir():
continue
# Extract package name and version
pkg_info = pkg_dir.name # format: name@version
if "@" not in pkg_info:
stats["skipped"] += 1
continue
name_version = pkg_info.rsplit("@", 1)
if len(name_version) != 2:
stats["skipped"] += 1
continue
name, version = name_version
# Check if in workset
if relevant_deps and f"npm:{name}" not in relevant_deps:
stats["skipped"] += 1
continue
stats["total_docs"] += 1
# Create capsule
doc_file = pkg_dir / "doc.md"
meta_file = pkg_dir / "meta.json"
if doc_file.exists():
try:
capsule = _create_capsule(
doc_file, meta_file, name, version, "npm", max_capsule_lines
)
# Write capsule
capsule_file = output_path / f"npm__{name}@{version}.md"
with open(capsule_file, "w", encoding="utf-8") as f:
f.write(capsule)
capsules_index.append({
"name": name,
"version": version,
"ecosystem": "npm",
"path": str(capsule_file.relative_to(output_path))
})
stats["capsules_created"] += 1
except Exception as e:
stats["errors"].append(f"{name}@{version}: {str(e)}")
# Process Python docs
py_dir = docs_path / "py"
if py_dir.exists():
for pkg_dir in py_dir.iterdir():
if not pkg_dir.is_dir():
continue
# Extract package name and version
pkg_info = pkg_dir.name # format: name@version
if "@" not in pkg_info:
stats["skipped"] += 1
continue
name_version = pkg_info.rsplit("@", 1)
if len(name_version) != 2:
stats["skipped"] += 1
continue
name, version = name_version
# Check if in workset
if relevant_deps and f"py:{name}" not in relevant_deps:
stats["skipped"] += 1
continue
stats["total_docs"] += 1
# Create capsule
doc_file = pkg_dir / "doc.md"
meta_file = pkg_dir / "meta.json"
if doc_file.exists():
try:
capsule = _create_capsule(
doc_file, meta_file, name, version, "py", max_capsule_lines
)
# Write capsule
capsule_file = output_path / f"py__{name}@{version}.md"
with open(capsule_file, "w", encoding="utf-8") as f:
f.write(capsule)
capsules_index.append({
"name": name,
"version": version,
"ecosystem": "py",
"path": str(capsule_file.relative_to(output_path))
})
stats["capsules_created"] += 1
except Exception as e:
stats["errors"].append(f"{name}@{version}: {str(e)}")
# Write index
index_file = output_path.parent / "doc_index.json"
with open(index_file, "w", encoding="utf-8") as f:
json.dump({
"created_at": datetime.now().isoformat(),
"capsules": capsules_index,
"stats": stats
}, f, indent=2)
return stats
def _load_workset_deps(workset_path: str) -> Set[str]:
"""
Load relevant dependencies from workset.
Returns set of "manager:name" keys.
"""
relevant = set()
try:
with open(workset_path, encoding="utf-8") as f:
workset = json.load(f)
# Extract imported packages from workset files
# This is a simplified version - would need more sophisticated parsing
for file_info in workset.get("files", []):
path = file_info.get("path", "")
# Simple heuristic: look at file extension
if path.endswith((".js", ".ts", ".jsx", ".tsx")):
# Would parse imports/requires
# For now, include all npm deps
relevant.add("npm:*")
elif path.endswith(".py"):
# Would parse imports
# For now, include all py deps
relevant.add("py:*")
except (json.JSONDecodeError, KeyError):
pass
# If we couldn't determine specific deps, include all
if not relevant or "npm:*" in relevant or "py:*" in relevant:
return set() # Empty set means include all
return relevant
def _create_capsule(
doc_file: Path,
meta_file: Path,
name: str,
version: str,
ecosystem: str,
max_lines: int
) -> str:
"""Create a concise capsule from documentation."""
# Read documentation
with open(doc_file, encoding="utf-8") as f:
content = f.read()
# Read metadata
meta = {}
if meta_file.exists():
try:
with open(meta_file, encoding="utf-8") as f:
meta = json.load(f)
except json.JSONDecodeError:
pass
# Extract key sections
sections = {
"init": _extract_initialization(content, ecosystem),
"apis": _extract_top_apis(content),
"examples": _extract_examples(content),
}
# Build capsule
capsule_lines = [
f"# {name}@{version} ({ecosystem})",
"",
"## Quick Start",
""
]
if sections["init"]:
capsule_lines.extend(sections["init"][:10]) # Limit lines
capsule_lines.append("")
elif content: # If no structured init but has content, add some raw content
content_lines = content.split("\n")[:10]
capsule_lines.extend(content_lines)
capsule_lines.append("")
if sections["apis"]:
capsule_lines.append("## Top APIs")
capsule_lines.append("")
capsule_lines.extend(sections["apis"][:15]) # Limit lines
capsule_lines.append("")
if sections["examples"]:
capsule_lines.append("## Examples")
capsule_lines.append("")
capsule_lines.extend(sections["examples"][:15]) # Limit lines
capsule_lines.append("")
# Add reference to full documentation
capsule_lines.append("## 📄 Full Documentation Available")
capsule_lines.append("")
# Calculate relative path from project root
full_doc_path = f"./.pf/context/docs/{ecosystem}/{name}@{version}/doc.md"
capsule_lines.append(f"**Full content**: `{full_doc_path}`")
# Count lines in full doc if it exists
if doc_file.exists():
try:
with open(doc_file, encoding="utf-8") as f:
line_count = len(f.readlines())
capsule_lines.append(f"**Size**: {line_count} lines")
except Exception:
pass
capsule_lines.append("")
# Add source info
capsule_lines.append("## Source")
capsule_lines.append("")
capsule_lines.append(f"- URL: {meta.get('source_url', '')}")
capsule_lines.append(f"- Fetched: {meta.get('last_checked', '')}")
# Truncate if too long
if len(capsule_lines) > max_lines:
# Keep the full doc reference even when truncating
keep_lines = capsule_lines[:max_lines-7] # Leave room for reference and truncation
ref_lines = [l for l in capsule_lines if "Full Documentation Available" in l or "Full content" in l or "Size" in l]
capsule_lines = keep_lines + ["", "...","(truncated)", ""] + ref_lines
return "\n".join(capsule_lines)
def _extract_initialization(content: str, ecosystem: str) -> List[str]:
"""Extract initialization/installation snippets."""
lines = []
# Look for installation section
install_patterns = [
r"## Install\w*",
r"## Getting Started",
r"## Quick Start",
r"### Install\w*",
]
for pattern in install_patterns:
match = re.search(pattern, content, re.IGNORECASE | re.MULTILINE)
if match:
# Extract next code block
start = match.end()
code_match = re.search(r"```(\w*)\n(.*?)```", content[start:], re.DOTALL)
if code_match:
lines.append(f"```{code_match.group(1)}")
lines.extend(code_match.group(2).strip().split("\n")[:5])
lines.append("```")
break
# Fallback: look for common patterns
if not lines:
if ecosystem == "npm":
if "require(" in content:
match = re.search(r"(const|var|let)\s+\w+\s*=\s*require\([^)]+\)", content)
if match:
lines = ["```javascript", match.group(0), "```"]
elif "import " in content:
match = re.search(r"import\s+.*?from\s+['\"][^'\"]+['\"]", content)
if match:
lines = ["```javascript", match.group(0), "```"]
elif ecosystem == "py":
if "import " in content:
match = re.search(r"import\s+\w+", content)
if match:
lines = ["```python", match.group(0), "```"]
elif "from " in content:
match = re.search(r"from\s+\w+\s+import\s+\w+", content)
if match:
lines = ["```python", match.group(0), "```"]
return lines
def _extract_top_apis(content: str) -> List[str]:
"""Extract top API methods."""
lines = []
# Look for API section
api_patterns = [
r"## API",
r"## Methods",
r"## Functions",
r"### API",
]
for pattern in api_patterns:
match = re.search(pattern, content, re.IGNORECASE | re.MULTILINE)
if match:
start = match.end()
# Extract next few method signatures
method_matches = re.findall(
r"^[\*\-]\s*`([^`]+)`",
content[start:start+2000],
re.MULTILINE
)
for method in method_matches[:5]: # Top 5 methods
lines.append(f"- `{method}`")
break
# Fallback: look for function definitions in code blocks
if not lines:
code_blocks = re.findall(r"```\w*\n(.*?)```", content, re.DOTALL)
for block in code_blocks[:2]: # Check first 2 code blocks
# Look for function signatures
funcs = re.findall(r"(?:function|def|const|let|var)\s+(\w+)\s*\(([^)]*)\)", block)
for func_name, params in funcs[:5]:
lines.append(f"- `{func_name}({params})`")
if lines:
break
return lines
def _extract_examples(content: str) -> List[str]:
"""Extract usage examples."""
lines = []
# Look for examples section
example_patterns = [
r"## Example",
r"## Usage",
r"### Example",
r"### Usage",
]
for pattern in example_patterns:
match = re.search(pattern, content, re.IGNORECASE | re.MULTILINE)
if match:
start = match.end()
# Extract next code block
code_match = re.search(r"```(\w*)\n(.*?)```", content[start:], re.DOTALL)
if code_match:
lang = code_match.group(1) or "javascript"
code_lines = code_match.group(2).strip().split("\n")[:10] # Max 10 lines
lines.append(f"```{lang}")
lines.extend(code_lines)
lines.append("```")
break
return lines

493
theauditor/extraction.py Normal file
View File

@@ -0,0 +1,493 @@
"""Extraction module - pure courier model for data chunking.
This module implements the courier model: takes raw tool output and chunks it
into manageable pieces for AI processing WITHOUT any filtering or interpretation.
Pure Courier Principles:
- NO filtering by severity or importance
- NO deduplication or sampling
- NO interpretation of findings
- ONLY chunks files if they exceed 65KB
- ALL data preserved exactly as generated
The AI consumer decides what's important, not TheAuditor.
"""
import json
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple
from collections import defaultdict
from theauditor.config_runtime import load_runtime_config
# DELETED: All smart extraction functions removed
# Pure courier model - no filtering, only chunking if needed
def _chunk_large_file(raw_path: Path, max_chunk_size: Optional[int] = None) -> Optional[List[Tuple[Path, int]]]:
"""Split large files into chunks of configured max size."""
# Load config if not provided
if max_chunk_size is None:
config = load_runtime_config()
max_chunk_size = config["limits"]["max_chunk_size"]
# Get max chunks per file from config
config = load_runtime_config()
max_chunks_per_file = config["limits"]["max_chunks_per_file"]
chunks = []
try:
# Handle non-JSON files (like .dot, .txt, etc.)
if raw_path.suffix != '.json':
# Read as text and chunk if needed
with open(raw_path, 'r', encoding='utf-8', errors='ignore') as f:
content = f.read()
# Check if file needs chunking
if len(content) <= max_chunk_size:
# Small enough, copy as-is
output_path = raw_path.parent.parent / 'readthis' / raw_path.name
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, 'w', encoding='utf-8') as f:
f.write(content)
size = output_path.stat().st_size
print(f" [COPIED] {raw_path.name}{output_path.name} ({size:,} bytes)")
return [(output_path, size)]
else:
# Need to chunk text file
base_name = raw_path.stem
ext = raw_path.suffix
chunk_num = 0
position = 0
while position < len(content) and chunk_num < max_chunks_per_file:
chunk_num += 1
chunk_end = min(position + max_chunk_size, len(content))
chunk_content = content[position:chunk_end]
output_path = raw_path.parent.parent / 'readthis' / f"{base_name}_chunk{chunk_num:02d}{ext}"
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, 'w', encoding='utf-8') as f:
f.write(chunk_content)
size = output_path.stat().st_size
chunks.append((output_path, size))
print(f" [CHUNKED] {raw_path.name}{output_path.name} ({size:,} bytes)")
position = chunk_end
if position < len(content):
print(f" [TRUNCATED] {raw_path.name} - stopped at {max_chunks_per_file} chunks")
return chunks
# Handle JSON files
with open(raw_path, 'r', encoding='utf-8') as f:
data = json.load(f)
# Check if file needs chunking
full_json = json.dumps(data, indent=2)
if len(full_json) <= max_chunk_size:
# Small enough, copy as-is
output_path = raw_path.parent.parent / 'readthis' / raw_path.name
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, 'w', encoding='utf-8') as f:
f.write(full_json)
size = output_path.stat().st_size
print(f" [COPIED] {raw_path.name}{output_path.name} ({size:,} bytes)")
return [(output_path, size)]
# File needs chunking
base_name = raw_path.stem
ext = raw_path.suffix
# Handle different data structures
if isinstance(data, list):
# For lists, chunk by items
chunk_num = 0
current_chunk = []
current_size = 100 # Account for JSON structure overhead
for item in data:
item_json = json.dumps(item, indent=2)
item_size = len(item_json)
if current_size + item_size > max_chunk_size and current_chunk:
# Check chunk limit
if chunk_num >= max_chunks_per_file:
print(f" [TRUNCATED] {raw_path.name} - stopped at {max_chunks_per_file} chunks (would have created more)")
break
# Write current chunk
chunk_num += 1
output_path = raw_path.parent.parent / 'readthis' / f"{base_name}_chunk{chunk_num:02d}{ext}"
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(current_chunk, f, indent=2)
size = output_path.stat().st_size
chunks.append((output_path, size))
print(f" [CHUNKED] {raw_path.name}{output_path.name} ({size:,} bytes)")
# Start new chunk
current_chunk = [item]
current_size = item_size + 100
else:
current_chunk.append(item)
current_size += item_size
# Write final chunk (only if under limit)
if current_chunk and chunk_num < max_chunks_per_file:
chunk_num += 1
output_path = raw_path.parent.parent / 'readthis' / f"{base_name}_chunk{chunk_num:02d}{ext}"
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(current_chunk, f, indent=2)
size = output_path.stat().st_size
chunks.append((output_path, size))
print(f" [CHUNKED] {raw_path.name}{output_path.name} ({size:,} bytes)")
elif isinstance(data, dict):
# For dicts with lists (like findings, paths), chunk the lists
# Determine the correct key to chunk on
if base_name == 'taint_analysis':
# For taint analysis, we need to merge ALL findings into one list
# because they're split across multiple keys
if 'taint_paths' in data or 'all_rule_findings' in data:
# Merge all findings into a single list for chunking
all_taint_items = []
# Add taint paths
if 'taint_paths' in data:
for item in data['taint_paths']:
item['finding_type'] = 'taint_path'
all_taint_items.append(item)
# Add all rule findings
if 'all_rule_findings' in data:
for item in data['all_rule_findings']:
item['finding_type'] = 'rule_finding'
all_taint_items.append(item)
# Add infrastructure issues only if they're different from all_rule_findings
# (to avoid duplicates when they're the same list)
if 'infrastructure_issues' in data:
# Check if they're different objects (not the same list)
if data['infrastructure_issues'] is not data.get('all_rule_findings'):
# Only add if they're actually different content
infra_set = {json.dumps(item, sort_keys=True) for item in data['infrastructure_issues']}
rules_set = {json.dumps(item, sort_keys=True) for item in data.get('all_rule_findings', [])}
if infra_set != rules_set:
for item in data['infrastructure_issues']:
item['finding_type'] = 'infrastructure'
all_taint_items.append(item)
# Add paths (data flow paths) - these are often duplicates of taint_paths but may have extra info
if 'paths' in data:
# Check if different from taint_paths
paths_set = {json.dumps(item, sort_keys=True) for item in data['paths']}
taint_set = {json.dumps(item, sort_keys=True) for item in data.get('taint_paths', [])}
if paths_set != taint_set:
for item in data['paths']:
item['finding_type'] = 'path'
all_taint_items.append(item)
# Add vulnerabilities - these are the final analyzed vulnerabilities
if 'vulnerabilities' in data:
for item in data['vulnerabilities']:
item['finding_type'] = 'vulnerability'
all_taint_items.append(item)
# Create a new data structure with merged findings
data = {
'success': data.get('success', True),
'summary': data.get('summary', {}),
'total_vulnerabilities': data.get('total_vulnerabilities', len(all_taint_items)),
'sources_found': data.get('sources_found', 0),
'sinks_found': data.get('sinks_found', 0),
'merged_findings': all_taint_items
}
list_key = 'merged_findings'
else:
list_key = 'paths'
elif 'all_findings' in data:
# CRITICAL: FCE findings are pre-sorted by severity via finding_priority.py
# The order MUST be preserved during chunking to ensure critical issues
# appear in chunk01. DO NOT sort or shuffle these findings!
list_key = 'all_findings'
# Log for verification
if data.get(list_key):
first_items = data[list_key][:3] if len(data[list_key]) >= 3 else data[list_key]
severities = [item.get('severity', 'unknown') for item in first_items]
print(f"[EXTRACTION] Processing FCE with {len(data[list_key])} pre-sorted findings")
print(f"[EXTRACTION] First 3 severities: {severities}")
elif 'findings' in data:
list_key = 'findings'
elif 'vulnerabilities' in data:
list_key = 'vulnerabilities'
elif 'issues' in data:
list_key = 'issues'
elif 'edges' in data:
list_key = 'edges' # For call_graph.json and import_graph.json
elif 'nodes' in data:
list_key = 'nodes' # For graph files with nodes
elif 'taint_paths' in data:
list_key = 'taint_paths'
elif 'paths' in data:
list_key = 'paths'
elif 'dependencies' in data:
list_key = 'dependencies' # For deps.json
elif 'files' in data:
list_key = 'files' # For file lists
elif 'results' in data:
list_key = 'results' # For analysis results
else:
list_key = None
if list_key:
items = data.get(list_key, [])
# Extract minimal metadata (don't duplicate everything)
metadata = {}
for key in ['success', 'summary', 'total_vulnerabilities', 'chunk_info']:
if key in data:
metadata[key] = data[key]
# Calculate actual metadata size
metadata_json = json.dumps(metadata, indent=2)
metadata_size = len(metadata_json)
chunk_num = 0
chunk_items = []
current_size = metadata_size + 200 # Actual metadata size + bracket overhead
for item in items:
item_json = json.dumps(item, indent=2)
item_size = len(item_json)
if current_size + item_size > max_chunk_size and chunk_items:
# Check chunk limit
if chunk_num >= max_chunks_per_file:
print(f" [TRUNCATED] {raw_path.name} - stopped at {max_chunks_per_file} chunks (would have created more)")
break
# Write current chunk
chunk_num += 1
chunk_data = metadata.copy()
chunk_data[list_key] = chunk_items
chunk_data['chunk_info'] = {
'chunk_number': chunk_num,
'total_items_in_chunk': len(chunk_items),
'original_total_items': len(items),
'list_key': list_key,
'truncated': chunk_num >= max_chunks_per_file # Mark if this is the last allowed chunk
}
output_path = raw_path.parent.parent / 'readthis' / f"{base_name}_chunk{chunk_num:02d}{ext}"
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(chunk_data, f, indent=2)
size = output_path.stat().st_size
chunks.append((output_path, size))
print(f" [CHUNKED] {raw_path.name}{output_path.name} ({len(chunk_items)} items, {size:,} bytes)")
# Start new chunk
chunk_items = [item]
current_size = metadata_size + item_size + 200
else:
chunk_items.append(item)
current_size += item_size
# Write final chunk (only if under limit)
if chunk_items and chunk_num < max_chunks_per_file:
chunk_num += 1
chunk_data = metadata.copy()
chunk_data[list_key] = chunk_items
chunk_data['chunk_info'] = {
'chunk_number': chunk_num,
'total_items_in_chunk': len(chunk_items),
'original_total_items': len(items),
'list_key': list_key,
'truncated': False # This is the final chunk within limit
}
output_path = raw_path.parent.parent / 'readthis' / f"{base_name}_chunk{chunk_num:02d}{ext}"
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(chunk_data, f, indent=2)
size = output_path.stat().st_size
chunks.append((output_path, size))
print(f" [CHUNKED] {raw_path.name}{output_path.name} ({len(chunk_items)} items, {size:,} bytes)")
else:
# No recognized list key - shouldn't happen now with expanded list
# Log warning and copy as-is
print(f" [WARNING] No chunkable list found in {raw_path.name}, copying as-is")
output_path = raw_path.parent.parent / 'readthis' / raw_path.name
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(data, f, indent=2)
size = output_path.stat().st_size
chunks.append((output_path, size))
print(f" [COPIED] {raw_path.name}{output_path.name} ({size:,} bytes)")
return chunks
except Exception as e:
print(f" [ERROR] Failed to chunk {raw_path.name}: {e}")
return None # Return None to signal failure, not empty list
def _copy_as_is(raw_path: Path) -> Tuple[Optional[Path], int]:
"""Copy small files as-is or chunk if >65KB."""
chunks = _chunk_large_file(raw_path)
if chunks is None:
# Chunking failed
return None, -1 # Signal error with -1
elif chunks:
# Return the first chunk info for compatibility
return chunks[0] if len(chunks) == 1 else (None, sum(s for _, s in chunks))
return None, 0
def extract_all_to_readthis(root_path_str: str, budget_kb: int = 1500) -> bool:
"""Main function for extracting readthis chunks from raw data.
Implements intelligent extraction with prioritization to stay within
budget while preserving all critical security findings.
Args:
root_path_str: Root directory path as string
budget_kb: Maximum total size in KB for all readthis files (default 1000KB)
Returns:
True if extraction completed successfully, False otherwise
"""
root_path = Path(root_path_str)
raw_dir = root_path / ".pf" / "raw"
readthis_dir = root_path / ".pf" / "readthis"
print("\n" + "="*60)
print("[EXTRACTION] Smart extraction with 1MB budget")
print("="*60)
# Check if raw directory exists
if not raw_dir.exists():
print(f"[WARNING] Raw directory does not exist: {raw_dir}")
print("[INFO] No raw data to extract - skipping extraction phase")
return True
# Ensure readthis directory exists
try:
readthis_dir.mkdir(parents=True, exist_ok=True)
print(f"[OK] Readthis directory ready: {readthis_dir}")
except Exception as e:
print(f"[ERROR] Failed to create readthis directory: {e}")
return False
# Discover ALL files in raw directory dynamically (courier model)
raw_files = []
for file_path in raw_dir.iterdir():
if file_path.is_file():
raw_files.append(file_path.name)
print(f"[DISCOVERED] Found {len(raw_files)} files in raw directory")
# Pure courier model - no smart extraction, just chunking if needed
# Build extraction strategy dynamically
extraction_strategy = []
for filename in sorted(raw_files):
# All files get same treatment: chunk if needed
extraction_strategy.append((filename, 100, _copy_as_is))
total_budget = budget_kb * 1024 # Convert to bytes
total_used = 0
extracted_files = []
skipped_files = []
failed_files = [] # Track failures
print(f"[BUDGET] Total budget: {budget_kb}KB ({total_budget:,} bytes)")
print(f"[STRATEGY] Pure courier model - no filtering\n")
for filename, file_budget_kb, extractor in extraction_strategy:
raw_path = raw_dir / filename
if not raw_path.exists():
continue
print(f"[PROCESSING] {filename}")
# Just chunk everything - ignore budget for chunking
# The whole point is to break large files into manageable pieces
chunks = _chunk_large_file(raw_path)
if chunks is None:
# Chunking failed for this file
print(f" [FAILED] {filename} - chunking error")
failed_files.append(filename)
continue
if chunks:
for chunk_path, chunk_size in chunks:
# Optionally check budget per chunk (or ignore completely)
if total_used + chunk_size > total_budget:
# Could skip remaining chunks or just ignore budget
# For now, let's just ignore budget and extract everything
pass
total_used += chunk_size
extracted_files.append((chunk_path.name, chunk_size))
# Create extraction summary
summary = {
'extraction_timestamp': str(Path(root_path_str).stat().st_mtime),
'budget_kb': budget_kb,
'total_used_bytes': total_used,
'total_used_kb': total_used // 1024,
'utilization_percent': (total_used / total_budget) * 100,
'files_extracted': len(extracted_files),
'files_skipped': len(skipped_files),
'files_failed': len(failed_files),
'extracted': [{'file': f, 'size': s} for f, s in extracted_files],
'skipped': skipped_files,
'failed': failed_files,
'strategy': 'Pure courier model - chunk if needed, no filtering'
}
summary_path = readthis_dir / 'extraction_summary.json'
with open(summary_path, 'w', encoding='utf-8') as f:
json.dump(summary, f, indent=2)
# Summary report
print("\n" + "="*60)
print("[EXTRACTION COMPLETE]")
print(f" Files extracted: {len(extracted_files)}")
print(f" Files skipped: {len(skipped_files)}")
print(f" Files failed: {len(failed_files)}")
print(f" Total size: {total_used:,} bytes ({total_used//1024}KB)")
print(f" Budget used: {(total_used/total_budget)*100:.1f}%")
print(f" Summary saved: {summary_path}")
# List what was extracted
print("\n[EXTRACTED FILES]")
for filename, size in extracted_files:
print(f" {filename:30} {size:8,} bytes ({size//1024:4}KB)")
if skipped_files:
print("\n[SKIPPED FILES]")
for filename in skipped_files:
print(f" {filename}")
if failed_files:
print("\n[FAILED FILES]")
for filename in failed_files:
print(f" {filename}")
print("\n[KEY INSIGHTS]")
print(" ✓ All findings preserved - no filtering")
print(" ✓ Pure courier model - no interpretation")
print(" ✓ Files chunked only if >65KB")
print(" ✓ Complete data for AI consumption")
print("="*60)
# Return False if any files failed, True only if all succeeded
if failed_files:
print(f"\n[ERROR] Extraction failed for {len(failed_files)} files")
return False
return True

784
theauditor/fce.py Normal file
View File

@@ -0,0 +1,784 @@
"""Factual Correlation Engine - aggregates and correlates findings from all analysis tools."""
import json
import os
import re
import shlex
import sqlite3
import subprocess
from collections import defaultdict, deque
from datetime import UTC, datetime
from pathlib import Path
from typing import Any
from theauditor.test_frameworks import detect_test_framework
from theauditor.correlations import CorrelationLoader
def scan_all_findings(raw_dir: Path) -> list[dict[str, Any]]:
"""
Scan ALL raw outputs for structured findings with line-level detail.
Extract findings from JSON outputs with file, line, rule, and tool information.
"""
all_findings = []
for output_file in raw_dir.glob('*.json'):
if not output_file.is_file():
continue
tool_name = output_file.stem
try:
with open(output_file, 'r', encoding='utf-8') as f:
data = json.load(f)
# Handle different JSON structures based on tool
findings = []
# Standard findings structure (lint.json, patterns.json, etc.)
if isinstance(data, dict) and 'findings' in data:
findings = data['findings']
# Vulnerabilities structure
elif isinstance(data, dict) and 'vulnerabilities' in data:
findings = data['vulnerabilities']
# Taint analysis structure
elif isinstance(data, dict) and 'taint_paths' in data:
for path in data['taint_paths']:
# Create a finding for each taint path
if 'file' in path and 'line' in path:
findings.append({
'file': path['file'],
'line': path['line'],
'rule': f"taint-{path.get('sink_type', 'unknown')}",
'message': path.get('message', 'Taint path detected')
})
# Direct list of findings
elif isinstance(data, list):
findings = data
# RCA/test results structure
elif isinstance(data, dict) and 'failures' in data:
findings = data['failures']
# Process each finding
for finding in findings:
if isinstance(finding, dict):
# Ensure required fields exist
if 'file' in finding:
# Create standardized finding
standardized = {
'file': finding.get('file', ''),
'line': int(finding.get('line', 0)),
'rule': finding.get('rule', finding.get('code', finding.get('pattern', 'unknown'))),
'tool': finding.get('tool', tool_name),
'message': finding.get('message', ''),
'severity': finding.get('severity', 'warning')
}
all_findings.append(standardized)
except (json.JSONDecodeError, KeyError, TypeError):
# Skip files that can't be parsed as JSON or don't have expected structure
continue
except Exception:
# Skip files with other errors
continue
return all_findings
def run_tool(command: str, root_path: str, timeout: int = 600) -> tuple[int, str, str]:
"""Run build/test tool with timeout and capture output."""
try:
# Use deque as ring buffer to limit memory usage
max_lines = 10000
stdout_buffer = deque(maxlen=max_lines)
stderr_buffer = deque(maxlen=max_lines)
# Run command - safely split command string into arguments
cmd_args = shlex.split(command)
# Write directly to temp files to avoid buffer overflow
import tempfile
with tempfile.NamedTemporaryFile(mode='w+', delete=False, suffix='_stdout.txt') as out_tmp, \
tempfile.NamedTemporaryFile(mode='w+', delete=False, suffix='_stderr.txt') as err_tmp:
process = subprocess.Popen(
cmd_args,
cwd=root_path,
stdout=out_tmp,
stderr=err_tmp,
text=True,
env={**os.environ, "CI": "true"}, # Set CI env for tools
)
stdout_file = out_tmp.name
stderr_file = err_tmp.name
# Stream output with timeout
try:
process.communicate(timeout=timeout)
# Read back the outputs
with open(stdout_file, 'r') as f:
stdout = f.read()
with open(stderr_file, 'r') as f:
stderr = f.read()
# Clean up temp files
os.unlink(stdout_file)
os.unlink(stderr_file)
# Append any errors to the global error.log
if stderr:
from pathlib import Path
error_log = Path(root_path) / ".pf" / "error.log"
error_log.parent.mkdir(parents=True, exist_ok=True)
with open(error_log, 'a') as f:
f.write(f"\n=== RCA Subprocess Error ({command[:50]}) ===\n")
f.write(f"Timestamp: {datetime.now().isoformat()}\n")
f.write(stderr)
f.write("\n")
# Store in buffers
stdout_buffer.extend(stdout.splitlines())
stderr_buffer.extend(stderr.splitlines())
except subprocess.TimeoutExpired:
process.kill()
return 124, "Process timed out", f"Command exceeded {timeout}s timeout"
# Join lines
stdout_text = "\n".join(stdout_buffer)
stderr_text = "\n".join(stderr_buffer)
return process.returncode, stdout_text, stderr_text
except Exception as e:
return 1, "", str(e)
def parse_typescript_errors(output: str) -> list[dict[str, Any]]:
"""Parse TypeScript/TSNode compiler errors."""
errors = []
# TypeScript error format: file:line:col - error CODE: message
pattern = (
r"(?P<file>[^:\n]+):(?P<line>\d+):(?P<col>\d+) - error (?P<code>[A-Z]+\d+): (?P<msg>.+)"
)
for match in re.finditer(pattern, output):
errors.append(
{
"tool": "tsc",
"file": match.group("file"),
"line": int(match.group("line")),
"column": int(match.group("col")),
"message": match.group("msg"),
"code": match.group("code"),
"category": "type_error",
}
)
return errors
def parse_jest_errors(output: str) -> list[dict[str, Any]]:
"""Parse Jest/Vitest test failures."""
errors = []
# Jest failed test: ● Test Suite Name test name
# Followed by stack trace: at Object.<anonymous> (file:line:col)
test_pattern = r"● (?P<testname>[^\n]+)"
stack_pattern = r"at .*? \((?P<file>[^:]+):(?P<line>\d+):(?P<col>\d+)\)"
lines = output.splitlines()
for i, line in enumerate(lines):
test_match = re.match(test_pattern, line)
if test_match:
# Look for stack trace in next lines
for j in range(i + 1, min(i + 20, len(lines))):
stack_match = re.search(stack_pattern, lines[j])
if stack_match:
errors.append(
{
"tool": "jest",
"file": stack_match.group("file"),
"line": int(stack_match.group("line")),
"column": int(stack_match.group("col")),
"message": f"Test failed: {test_match.group('testname')}",
"category": "test_failure",
}
)
break
return errors
def parse_pytest_errors(output: str) -> list[dict[str, Any]]:
"""Parse pytest failures."""
errors = []
# Pytest error format varies, but typically:
# FAILED path/to/test.py::TestClass::test_method - AssertionError: message
# Or: E AssertionError: message
# path/to/file.py:42: AssertionError
failed_pattern = r"FAILED (?P<file>[^:]+)(?:::(?P<test>[^\s]+))? - (?P<msg>.+)"
error_pattern = r"^E\s+(?P<msg>.+)\n.*?(?P<file>[^:]+):(?P<line>\d+):"
for match in re.finditer(failed_pattern, output):
errors.append(
{
"tool": "pytest",
"file": match.group("file"),
"line": 0, # Line not in FAILED format
"message": match.group("msg"),
"category": "test_failure",
}
)
for match in re.finditer(error_pattern, output, re.MULTILINE):
errors.append(
{
"tool": "pytest",
"file": match.group("file"),
"line": int(match.group("line")),
"message": match.group("msg"),
"category": "test_failure",
}
)
return errors
def parse_python_compile_errors(output: str) -> list[dict[str, Any]]:
"""Parse Python compilation errors from py_compile output."""
errors = []
# Python compile error format:
# Traceback (most recent call last):
# File "path/to/file.py", line X, in <module>
# SyntaxError: invalid syntax
# Or: ModuleNotFoundError: No module named 'xxx'
# Parse traceback format
lines = output.splitlines()
for i, line in enumerate(lines):
# Look for File references in tracebacks
if 'File "' in line and '", line ' in line:
# Extract file and line number
match = re.match(r'.*File "([^"]+)", line (\d+)', line)
if match and i + 1 < len(lines):
file_path = match.group(1)
line_num = int(match.group(2))
# Look for the error type in following lines
for j in range(i + 1, min(i + 5, len(lines))):
if 'Error:' in lines[j]:
error_msg = lines[j].strip()
errors.append({
"tool": "py_compile",
"file": file_path,
"line": line_num,
"message": error_msg,
"category": "compile_error",
})
break
# Also catch simple error messages
if 'SyntaxError:' in line or 'ModuleNotFoundError:' in line or 'ImportError:' in line:
# Try to extract file info from previous lines
file_info = None
for j in range(max(0, i - 3), i):
if '***' in lines[j] and '.py' in lines[j]:
# py_compile format: *** path/to/file.py
file_match = re.match(r'\*\*\* (.+\.py)', lines[j])
if file_match:
file_info = file_match.group(1)
break
if file_info:
errors.append({
"tool": "py_compile",
"file": file_info,
"line": 0,
"message": line.strip(),
"category": "compile_error",
})
return errors
def parse_errors(output: str, tool_name: str) -> list[dict[str, Any]]:
"""Parse errors based on tool type."""
all_errors = []
# Try all parsers
all_errors.extend(parse_typescript_errors(output))
all_errors.extend(parse_jest_errors(output))
all_errors.extend(parse_pytest_errors(output))
all_errors.extend(parse_python_compile_errors(output))
return all_errors
def load_capsule(capsules_dir: str, file_hash: str) -> dict | None:
"""Load capsule by file hash."""
capsule_path = Path(capsules_dir) / f"{file_hash}.json"
if not capsule_path.exists():
return None
try:
with open(capsule_path) as f:
return json.load(f)
except json.JSONDecodeError:
return None
def correlate_failures(
errors: list[dict[str, Any]],
manifest_path: str,
workset_path: str,
capsules_dir: str,
db_path: str,
) -> list[dict[str, Any]]:
"""Correlate failures with capsules for factual enrichment."""
# Load manifest for hash lookup
file_hashes = {}
try:
with open(manifest_path) as f:
manifest = json.load(f)
for entry in manifest:
file_hashes[entry["path"]] = entry.get("sha256")
except (FileNotFoundError, json.JSONDecodeError):
pass
# Load workset
workset_files = set()
try:
with open(workset_path) as f:
workset = json.load(f)
workset_files = {p["path"] for p in workset.get("paths", [])}
except (FileNotFoundError, json.JSONDecodeError):
pass
# Correlate each error
for error in errors:
file = error.get("file", "")
# Load capsule if file in workset/manifest
if file in file_hashes:
file_hash = file_hashes[file]
capsule = load_capsule(capsules_dir, file_hash)
if capsule:
error["capsule"] = {
"path": capsule.get("path"),
"hash": capsule.get("sha256"),
"interfaces": capsule.get("interfaces", {}),
}
return errors
def generate_rca_json(failures: list[dict[str, Any]]) -> dict[str, Any]:
"""Generate RCA JSON output."""
return {
"completed_at": datetime.now(UTC).isoformat(),
"failures": failures,
}
def run_fce(
root_path: str = ".",
capsules_dir: str = "./.pf/capsules",
manifest_path: str = "manifest.json",
workset_path: str = "./.pf/workset.json",
db_path: str = "repo_index.db",
timeout: int = 600,
print_plan: bool = False,
) -> dict[str, Any]:
"""Run factual correlation engine - NO interpretation, just facts."""
try:
# Step A: Initialization
raw_dir = Path(root_path) / ".pf" / "raw"
results = {
"timestamp": datetime.now(UTC).isoformat(),
"all_findings": [],
"test_results": {},
"correlations": {}
}
# Step B: Phase 1 - Gather All Findings
if raw_dir.exists():
results["all_findings"] = scan_all_findings(raw_dir)
# Step B2: Load Optional Insights (ML predictions, etc.)
insights_dir = Path(root_path) / ".pf" / "insights"
if insights_dir.exists():
# Load ML suggestions if available
ml_path = insights_dir / "ml_suggestions.json"
if ml_path.exists():
try:
with open(ml_path) as f:
ml_data = json.load(f)
# Convert ML predictions to correlatable findings
# ML has separate lists for root causes, risk scores, etc.
for root_cause in ml_data.get("likely_root_causes", [])[:5]: # Top 5 root causes
if root_cause.get("score", 0) > 0.7:
results["all_findings"].append({
"file": root_cause["path"],
"line": 0, # ML doesn't provide line-level predictions
"rule": "ML_ROOT_CAUSE",
"tool": "ml",
"message": f"ML predicts {root_cause['score']:.1%} probability as root cause",
"severity": "high"
})
for risk_item in ml_data.get("risk", [])[:5]: # Top 5 risky files
if risk_item.get("score", 0) > 0.7:
results["all_findings"].append({
"file": risk_item["path"],
"line": 0,
"rule": f"ML_RISK_{int(risk_item['score']*100)}",
"tool": "ml",
"message": f"ML predicts {risk_item['score']:.1%} risk score",
"severity": "high" if risk_item.get("score", 0) > 0.85 else "medium"
})
except (json.JSONDecodeError, KeyError):
pass # ML insights are optional, continue if they fail
# Load taint severity insights if available
taint_severity_path = insights_dir / "taint_severity.json"
if taint_severity_path.exists():
try:
with open(taint_severity_path) as f:
taint_data = json.load(f)
# Add severity-enhanced taint findings
for item in taint_data.get("severity_analysis", []):
if item.get("severity") in ["critical", "high"]:
results["all_findings"].append({
"file": item.get("file", ""),
"line": item.get("line", 0),
"rule": f"TAINT_{item.get('vulnerability_type', 'UNKNOWN').upper().replace(' ', '_')}",
"tool": "taint-insights",
"message": f"{item.get('vulnerability_type')} with {item.get('severity')} severity",
"severity": item.get("severity")
})
except (json.JSONDecodeError, KeyError):
pass # Insights are optional
# Step C: Phase 2 - Execute Tests
# Detect test framework
framework_info = detect_test_framework(root_path)
tools = []
if framework_info["name"] != "unknown" and framework_info["cmd"]:
command = framework_info["cmd"]
# Add quiet flags
if "pytest" in command:
command = "pytest -q -p no:cacheprovider"
elif "npm test" in command:
command = "npm test --silent"
elif "unittest" in command:
command = "python -m unittest discover -q"
tools.append({
"name": framework_info["name"],
"command": command,
"type": "test"
})
# Check for build scripts
package_json = Path(root_path) / "package.json"
if package_json.exists():
try:
with open(package_json) as f:
package = json.load(f)
scripts = package.get("scripts", {})
if "build" in scripts:
tools.append({
"name": "npm build",
"command": "npm run build --silent",
"type": "build"
})
except json.JSONDecodeError:
pass
if print_plan:
print("Detected tools:")
for tool in tools:
print(f" - {tool['name']}: {tool['command']}")
return {"success": True, "printed_plan": True}
if not tools:
tools = [] # No test tools, continue processing
# Run tools and collect failures
all_failures = []
for tool in tools:
print(f"Running {tool['name']}...")
exit_code, stdout, stderr = run_tool(tool["command"], root_path, timeout)
if exit_code != 0:
output = stdout + "\n" + stderr
errors = parse_errors(output, tool["name"])
# Special handling for pytest collection failures
if tool["name"] == "pytest" and exit_code == 2 and "ERROR collecting" in output:
print("Pytest collection failed. Falling back to Python compilation check...")
py_files = []
for py_file in Path(root_path).rglob("*.py"):
if "__pycache__" not in str(py_file) and not any(part.startswith('.') for part in py_file.parts):
py_files.append(str(py_file.relative_to(root_path)))
if py_files:
print(f"Checking {len(py_files)} Python files for compilation errors...")
compile_errors = []
for py_file in py_files[:50]:
module_path = str(Path(py_file).as_posix()).replace('/', '.').replace('.py', '')
import_cmd = f'python3 -c "import {module_path}"'
comp_exit, comp_out, comp_err = run_tool(import_cmd, root_path, 10)
if comp_exit != 0:
comp_output = comp_out + "\n" + comp_err
if comp_output.strip():
error_lines = comp_output.strip().split('\n')
error_msg = "Import failed"
for line in error_lines:
if 'ModuleNotFoundError:' in line:
error_msg = line.strip()
break
elif 'ImportError:' in line:
error_msg = line.strip()
break
elif 'SyntaxError:' in line:
error_msg = line.strip()
break
elif 'AttributeError:' in line:
error_msg = line.strip()
break
compile_errors.append({
"tool": "py_import",
"file": py_file,
"line": 0,
"message": error_msg,
"category": "compile_error",
})
if compile_errors:
print(f"Found {len(compile_errors)} compilation errors")
errors.extend(compile_errors)
# If no errors parsed, create generic one
if not errors and exit_code != 0:
errors.append({
"tool": tool["name"],
"file": "unknown",
"line": 0,
"message": f"Tool failed with exit code {exit_code}",
"category": "runtime",
})
all_failures.extend(errors)
# Correlate with capsules
all_failures = correlate_failures(
all_failures,
Path(root_path) / manifest_path,
Path(root_path) / workset_path,
Path(root_path) / capsules_dir,
Path(root_path) / db_path,
)
# Store test results
results["test_results"] = {
"completed_at": datetime.now(UTC).isoformat(),
"failures": all_failures,
"tools_run": len(tools)
}
# Step D: Consolidate Evidence
consolidated_findings = results["all_findings"].copy()
# Add test failures to consolidated list
if all_failures:
for failure in all_failures:
if 'file' in failure and 'line' in failure:
consolidated_findings.append({
'file': failure['file'],
'line': int(failure.get('line', 0)),
'rule': failure.get('code', failure.get('category', 'test-failure')),
'tool': failure.get('tool', 'test'),
'message': failure.get('message', ''),
'severity': failure.get('severity', 'error')
})
# Step E: Phase 3 - Line-Level Correlation (Hotspots)
# Group findings by file:line
line_groups = defaultdict(list)
for finding in consolidated_findings:
if finding['line'] > 0:
key = f"{finding['file']}:{finding['line']}"
line_groups[key].append(finding)
# Find hotspots
hotspots = {}
for line_key, findings in line_groups.items():
tools_on_line = set(f['tool'] for f in findings)
if len(tools_on_line) > 1:
hotspots[line_key] = findings
# Enrich hotspots with symbol context
full_db_path = Path(root_path) / db_path
if hotspots and full_db_path.exists():
try:
conn = sqlite3.connect(str(full_db_path))
cursor = conn.cursor()
enriched_hotspots = {}
for line_key, findings in hotspots.items():
if ':' in line_key:
file_path, line_str = line_key.rsplit(':', 1)
try:
line_num = int(line_str)
query = """
SELECT name, type, line
FROM symbols
WHERE file = ?
AND line <= ?
AND type IN ('function', 'class')
ORDER BY line DESC
LIMIT 1
"""
cursor.execute(query, (file_path, line_num))
result = cursor.fetchone()
hotspot_data = {"findings": findings}
if result:
symbol_name, symbol_type, symbol_line = result
hotspot_data["in_symbol"] = f"{symbol_type}: {symbol_name}"
enriched_hotspots[line_key] = hotspot_data
except (ValueError, TypeError):
enriched_hotspots[line_key] = {"findings": findings}
else:
enriched_hotspots[line_key] = {"findings": findings}
conn.close()
hotspots = enriched_hotspots
except (sqlite3.Error, Exception):
hotspots = {k: {"findings": v} for k, v in hotspots.items()}
else:
hotspots = {k: {"findings": v} for k, v in hotspots.items()}
# Store hotspots in correlations
results["correlations"]["hotspots"] = hotspots
results["correlations"]["total_findings"] = len(consolidated_findings)
results["correlations"]["total_lines_with_findings"] = len(line_groups)
results["correlations"]["total_hotspots"] = len(hotspots)
# Step F: Phase 4 - Factual Cluster Detection
factual_clusters = []
# Load correlation rules
correlation_loader = CorrelationLoader()
correlation_rules = correlation_loader.load_rules()
if correlation_rules and consolidated_findings:
# Group findings by file
findings_by_file = defaultdict(list)
for finding in consolidated_findings:
if 'file' in finding:
findings_by_file[finding['file']].append(finding)
# Check each file against each rule
for file_path, file_findings in findings_by_file.items():
for rule in correlation_rules:
all_facts_matched = True
for fact_index, fact in enumerate(rule.co_occurring_facts):
fact_matched = False
for finding in file_findings:
if rule.matches_finding(finding, fact_index):
fact_matched = True
break
if not fact_matched:
all_facts_matched = False
break
if all_facts_matched:
factual_clusters.append({
"name": rule.name,
"file": file_path,
"description": rule.description,
"confidence": rule.confidence
})
# Store factual clusters
results["correlations"]["factual_clusters"] = factual_clusters
# Step G: Finalization - Apply intelligent organization sorting
from theauditor.utils.finding_priority import sort_findings, normalize_severity
# CRITICAL: Normalize all severities BEFORE sorting
# This handles Docker's integer severity and ESLint's "error" strings
if results.get("all_findings"):
# First pass: normalize severity in-place
for finding in results["all_findings"]:
original_severity = finding.get("severity")
finding["severity"] = normalize_severity(original_severity)
# Debug log for unusual severities (helps catch new formats)
if original_severity and str(original_severity) != finding["severity"]:
if isinstance(original_severity, int):
# Expected for Docker, don't log
pass
else:
print(f"[FCE] Normalized severity: {original_severity} -> {finding['severity']}")
# Second pass: sort using centralized logic
results["all_findings"] = sort_findings(results["all_findings"])
# Log sorting results for verification
if results["all_findings"]:
print(f"[FCE] Sorted {len(results['all_findings'])} findings")
first = results["all_findings"][0]
last = results["all_findings"][-1] if len(results["all_findings"]) > 1 else first
print(f"[FCE] First: {first.get('severity')} from {first.get('tool')}")
print(f"[FCE] Last: {last.get('severity')} from {last.get('tool')}")
# Write results to JSON
raw_dir.mkdir(parents=True, exist_ok=True)
fce_path = raw_dir / "fce.json"
fce_path.write_text(json.dumps(results, indent=2))
# Count total failures/findings
failures_found = len(results.get("all_findings", []))
# Return success structure
return {
"success": True,
"failures_found": failures_found,
"output_files": [str(fce_path)],
"results": results
}
except Exception as e:
# Step H: Error Handling
return {
"success": False,
"failures_found": 0,
"error": str(e)
}

View File

@@ -0,0 +1,608 @@
"""Framework detection for various languages and ecosystems."""
import json
import re
import glob
from pathlib import Path
from typing import Any
from theauditor.manifest_parser import ManifestParser
from theauditor.framework_registry import FRAMEWORK_REGISTRY
class FrameworkDetector:
"""Detects frameworks and libraries used in a project."""
# Note: Framework detection now uses the centralized FRAMEWORK_REGISTRY
# from framework_registry.py instead of the old FRAMEWORK_SIGNATURES
def __init__(self, project_path: Path, exclude_patterns: list[str] = None):
"""Initialize detector with project path.
Args:
project_path: Root directory of the project.
exclude_patterns: List of patterns to exclude from scanning.
"""
self.project_path = Path(project_path)
self.detected_frameworks = []
self.deps_cache = None
self.exclude_patterns = exclude_patterns or []
def detect_all(self) -> list[dict[str, Any]]:
"""Detect all frameworks in the project.
Returns:
List of detected framework info dictionaries.
"""
self.detected_frameworks = []
# Load TheAuditor's deps.json if available for better version info
self._load_deps_cache()
# Use unified manifest detection
self._detect_from_manifests()
# Also detect from monorepo workspaces (keep existing logic)
self._detect_from_workspaces()
# Store frameworks found in manifests for version lookup
manifest_frameworks = {}
for fw in self.detected_frameworks:
if fw["source"] != "imports":
key = (fw["framework"], fw["language"])
manifest_frameworks[key] = fw["version"]
# DISABLED: Import scanning causes too many false positives
# It detects framework names in strings, comments, and detection code itself
# Real dependencies should be in manifest files (package.json, requirements.txt, etc.)
# self._scan_source_imports()
# Check for framework-specific files
self._check_framework_files()
# Update versions for frameworks detected from framework files only (imports disabled)
for fw in self.detected_frameworks:
if fw["version"] == "unknown" and fw["source"] == "framework_files":
key = (fw["framework"], fw["language"])
# First try manifest frameworks
if key in manifest_frameworks:
fw["version"] = manifest_frameworks[key]
fw["source"] = f"{fw['source']} (version from manifest)"
# Then try deps cache
elif self.deps_cache and fw["framework"] in self.deps_cache:
cached_dep = self.deps_cache[fw["framework"]]
manager = cached_dep.get("manager", "")
# Match language to manager (py -> python, npm -> javascript)
if (fw["language"] == "python" and manager == "py") or \
(fw["language"] in ["javascript", "typescript"] and manager == "npm"):
fw["version"] = cached_dep.get("version", "") # Empty not unknown
if fw["version"] != "unknown":
fw["source"] = f"{fw['source']} (version from deps cache)"
# Deduplicate results, preferring entries with known versions
# Now we keep framework+language+path as unique key to support monorepos
seen = {}
for fw in self.detected_frameworks:
key = (fw["framework"], fw["language"], fw.get("path", "."))
if key not in seen:
seen[key] = fw
elif fw["version"] != "unknown" and seen[key]["version"] == "unknown":
# Replace with version that has a known version
seen[key] = fw
return list(seen.values())
def _detect_from_manifests(self):
"""Unified manifest detection using registry and ManifestParser - now directory-aware."""
parser = ManifestParser()
# Manifest file names to search for
manifest_names = [
"pyproject.toml",
"package.json",
"requirements.txt",
"requirements-dev.txt",
"requirements-test.txt",
"setup.py",
"setup.cfg",
"Gemfile",
"Gemfile.lock",
"go.mod",
"pom.xml",
"build.gradle",
"build.gradle.kts",
"composer.json",
]
# Recursively find all manifest files in the project
manifests = {}
for manifest_name in manifest_names:
# Use rglob to find all instances of this manifest file
for manifest_path in self.project_path.rglob(manifest_name):
# Skip excluded directories
try:
relative_path = manifest_path.relative_to(self.project_path)
should_skip = False
# Check common skip directories
for part in relative_path.parts[:-1]: # Don't check the filename itself
if part in ["node_modules", "venv", ".venv", ".auditor_venv", "vendor",
"dist", "build", "__pycache__", ".git", ".tox", ".pytest_cache"]:
should_skip = True
break
if should_skip:
continue
# Calculate the directory path relative to project root
dir_path = manifest_path.parent.relative_to(self.project_path)
dir_str = str(dir_path) if dir_path != Path('.') else '.'
# Create a unique key for this manifest
manifest_key = f"{dir_str}/{manifest_name}" if dir_str != '.' else manifest_name
manifests[manifest_key] = manifest_path
except ValueError:
# File is outside project path somehow, skip it
continue
# Parse each manifest that exists
parsed_data = {}
for manifest_key, path in manifests.items():
if path.exists():
try:
# Extract just the filename for parsing logic
filename = path.name
if filename.endswith('.toml'):
parsed_data[manifest_key] = parser.parse_toml(path)
elif filename.endswith('.json'):
parsed_data[manifest_key] = parser.parse_json(path)
elif filename.endswith(('.yml', '.yaml')):
parsed_data[manifest_key] = parser.parse_yaml(path)
elif filename.endswith('.cfg'):
parsed_data[manifest_key] = parser.parse_ini(path)
elif filename.endswith('.txt'):
parsed_data[manifest_key] = parser.parse_requirements_txt(path)
elif filename == 'Gemfile' or filename == 'Gemfile.lock':
# Parse Gemfile as text for now
with open(path, 'r', encoding='utf-8') as f:
parsed_data[manifest_key] = f.read()
elif filename.endswith('.xml') or filename.endswith('.gradle') or filename.endswith('.kts') or filename.endswith('.mod'):
# Parse as text content for now
with open(path, 'r', encoding='utf-8') as f:
parsed_data[manifest_key] = f.read()
elif filename == 'setup.py':
with open(path, 'r', encoding='utf-8') as f:
parsed_data[manifest_key] = f.read()
except Exception as e:
print(f"Warning: Failed to parse {manifest_key}: {e}")
# Check each framework against all manifests
for fw_name, fw_config in FRAMEWORK_REGISTRY.items():
for required_manifest_name, search_configs in fw_config.get("detection_sources", {}).items():
# Check all parsed manifests that match this manifest type
for manifest_key, manifest_data in parsed_data.items():
# Check if this manifest matches the required type
if not manifest_key.endswith(required_manifest_name):
continue
# Extract the directory path from the manifest key
if '/' in manifest_key:
dir_path = '/'.join(manifest_key.split('/')[:-1])
else:
dir_path = '.'
if search_configs == "line_search":
# Simple text search for requirements.txt style or Gemfile
if isinstance(manifest_data, list):
# Requirements.txt parsed as list
for line in manifest_data:
version = parser.check_package_in_deps([line], fw_name)
if version:
self.detected_frameworks.append({
"framework": fw_name,
"version": version or "unknown",
"language": fw_config["language"],
"path": dir_path,
"source": manifest_key
})
break
elif isinstance(manifest_data, str):
# Text file content
if fw_name in manifest_data or (fw_config.get("package_pattern") and fw_config["package_pattern"] in manifest_data):
# Try to extract version
version = "unknown"
import re
if fw_config.get("package_pattern"):
pattern = fw_config["package_pattern"]
else:
pattern = fw_name
# Try different version patterns
version_match = re.search(rf'{re.escape(pattern)}["\']?\s*[,:]?\s*["\']?([\d.]+)', manifest_data)
if not version_match:
version_match = re.search(rf'{re.escape(pattern)}\s+v([\d.]+)', manifest_data)
if not version_match:
version_match = re.search(rf'gem\s+["\']?{re.escape(pattern)}["\']?\s*,\s*["\']([\d.]+)["\']', manifest_data)
if version_match:
version = version_match.group(1)
self.detected_frameworks.append({
"framework": fw_name,
"version": version,
"language": fw_config["language"],
"path": dir_path,
"source": manifest_key
})
elif search_configs == "content_search":
# Content search for text-based files
if isinstance(manifest_data, str):
found = False
# Check package pattern first
if fw_config.get("package_pattern") and fw_config["package_pattern"] in manifest_data:
found = True
# Check content patterns
elif fw_config.get("content_patterns"):
for pattern in fw_config["content_patterns"]:
if pattern in manifest_data:
found = True
break
# Fallback to framework name
elif fw_name in manifest_data:
found = True
if found:
# Try to extract version
version = "unknown"
import re
pattern = fw_config.get("package_pattern", fw_name)
version_match = re.search(rf'{re.escape(pattern)}.*?[>v]([\d.]+)', manifest_data, re.DOTALL)
if version_match:
version = version_match.group(1)
self.detected_frameworks.append({
"framework": fw_name,
"version": version,
"language": fw_config["language"],
"path": dir_path,
"source": manifest_key
})
elif search_configs == "exists":
# Just check if file exists (for go.mod with go test framework)
self.detected_frameworks.append({
"framework": fw_name,
"version": "unknown",
"language": fw_config["language"],
"path": dir_path,
"source": manifest_key
})
else:
# Structured search for JSON/TOML/YAML
for key_path in search_configs:
deps = parser.extract_nested_value(manifest_data, key_path)
if deps:
# Check if framework is in dependencies
package_name = fw_config.get("package_pattern", fw_name)
version = parser.check_package_in_deps(deps, package_name)
if version:
self.detected_frameworks.append({
"framework": fw_name,
"version": version,
"language": fw_config["language"],
"path": dir_path,
"source": manifest_key
})
break
def _detect_from_workspaces(self):
"""Detect frameworks from monorepo workspace packages."""
# This preserves the existing monorepo detection logic
package_json = self.project_path / "package.json"
if not package_json.exists():
return
parser = ManifestParser()
try:
data = parser.parse_json(package_json)
# Check for workspaces field (Yarn/npm workspaces)
workspaces = data.get("workspaces", [])
# Handle different workspace formats
if isinstance(workspaces, dict):
# npm 7+ format: {"packages": ["packages/*"]}
workspaces = workspaces.get("packages", [])
if workspaces and isinstance(workspaces, list):
# This is a monorepo - check workspace packages
for pattern in workspaces:
# Convert workspace pattern to absolute path pattern
abs_pattern = str(self.project_path / pattern)
# Handle glob patterns
if "*" in abs_pattern:
matched_paths = glob.glob(abs_pattern)
for matched_path in matched_paths:
matched_dir = Path(matched_path)
if matched_dir.is_dir():
workspace_pkg = matched_dir / "package.json"
if workspace_pkg.exists():
# Parse and check this workspace package
self._check_workspace_package(workspace_pkg, parser)
else:
# Direct path without glob
workspace_dir = self.project_path / pattern
if workspace_dir.is_dir():
workspace_pkg = workspace_dir / "package.json"
if workspace_pkg.exists():
self._check_workspace_package(workspace_pkg, parser)
except Exception as e:
print(f"Warning: Failed to check workspaces: {e}")
def _check_workspace_package(self, pkg_path: Path, parser: ManifestParser):
"""Check a single workspace package.json for frameworks."""
try:
data = parser.parse_json(pkg_path)
# Check dependencies
all_deps = {}
if "dependencies" in data:
all_deps.update(data["dependencies"])
if "devDependencies" in data:
all_deps.update(data["devDependencies"])
# Check each JavaScript framework
for fw_name, fw_config in FRAMEWORK_REGISTRY.items():
if fw_config["language"] != "javascript":
continue
package_name = fw_config.get("package_pattern", fw_name)
if package_name in all_deps:
version = all_deps[package_name]
# Clean version
version = re.sub(r'^[~^>=<]+', '', str(version)).strip()
# Calculate relative path for path field
try:
rel_path = pkg_path.parent.relative_to(self.project_path)
path = str(rel_path).replace("\\", "/") if rel_path != Path('.') else '.'
source = str(pkg_path.relative_to(self.project_path)).replace("\\", "/")
except ValueError:
path = '.'
source = str(pkg_path)
self.detected_frameworks.append({
"framework": fw_name,
"version": version,
"language": "javascript",
"path": path,
"source": source
})
except Exception as e:
print(f"Warning: Failed to parse workspace package {pkg_path}: {e}")
# Stub method kept for backward compatibility - actual logic moved to _detect_from_manifests
pass
def _scan_source_imports(self):
"""Scan source files for framework imports."""
# Limit scanning to avoid performance issues
max_files = 100
files_scanned = 0
# Language file extensions
lang_extensions = {
".py": "python",
".js": "javascript",
".jsx": "javascript",
".ts": "javascript",
".tsx": "javascript",
".go": "go",
".java": "java",
".rb": "ruby",
".php": "php",
}
for ext, language in lang_extensions.items():
if files_scanned >= max_files:
break
for file_path in self.project_path.rglob(f"*{ext}"):
if files_scanned >= max_files:
break
# Skip node_modules, venv, etc.
if any(
part in file_path.parts
for part in ["node_modules", "venv", ".venv", ".auditor_venv", "vendor", "dist", "build", "__pycache__", ".git"]
):
continue
# Check exclude patterns
relative_path = file_path.relative_to(self.project_path)
should_skip = False
for pattern in self.exclude_patterns:
# Handle directory patterns
if pattern.endswith('/'):
dir_pattern = pattern.rstrip('/')
if str(relative_path).startswith(dir_pattern + '/') or str(relative_path).startswith(dir_pattern + '\\'):
should_skip = True
break
# Handle glob patterns
elif '*' in pattern:
from fnmatch import fnmatch
if fnmatch(str(relative_path), pattern):
should_skip = True
break
# Handle exact matches
elif str(relative_path) == pattern:
should_skip = True
break
if should_skip:
continue
files_scanned += 1
try:
with open(file_path, encoding="utf-8", errors="ignore") as f:
content = f.read()
# Check frameworks from registry
for fw_name, fw_config in FRAMEWORK_REGISTRY.items():
# Only check frameworks for this language
if fw_config["language"] != language:
continue
if "import_patterns" in fw_config:
for import_pattern in fw_config["import_patterns"]:
if import_pattern in content:
# Check if not already detected in this directory
file_dir = file_path.parent.relative_to(self.project_path)
dir_str = str(file_dir).replace("\\", "/") if file_dir != Path('.') else '.'
if not any(
fw["framework"] == fw_name and fw["language"] == language and fw.get("path", ".") == dir_str
for fw in self.detected_frameworks
):
self.detected_frameworks.append(
{
"framework": fw_name,
"version": "unknown",
"language": language,
"path": dir_str,
"source": "imports",
}
)
break
except Exception:
# Skip files that can't be read
continue
def _check_framework_files(self):
"""Check for framework-specific files."""
# Check all frameworks in registry for file markers
for fw_name, fw_config in FRAMEWORK_REGISTRY.items():
if "file_markers" in fw_config:
for file_marker in fw_config["file_markers"]:
# Handle wildcard patterns
if "*" in file_marker:
# Use glob for wildcard patterns
import glob
pattern = str(self.project_path / file_marker)
if glob.glob(pattern):
# Check if not already detected
if not any(
fw["framework"] == fw_name and fw["language"] == fw_config["language"]
for fw in self.detected_frameworks
):
self.detected_frameworks.append(
{
"framework": fw_name,
"version": "unknown",
"language": fw_config["language"],
"path": ".", # Framework files typically at root
"source": "framework_files",
}
)
break
else:
# Direct file path
if (self.project_path / file_marker).exists():
# Check if not already detected
if not any(
fw["framework"] == fw_name and fw["language"] == fw_config["language"]
for fw in self.detected_frameworks
):
self.detected_frameworks.append(
{
"framework": fw_name,
"version": "unknown",
"language": fw_config["language"],
"path": ".", # Framework files typically at root
"source": "framework_files",
}
)
break
def _load_deps_cache(self):
"""Load TheAuditor's deps.json if available for version info."""
deps_file = self.project_path / ".pf" / "deps.json"
if deps_file.exists():
try:
with open(deps_file) as f:
data = json.load(f)
self.deps_cache = {}
# Handle both old format (list) and new format (dict with "dependencies" key)
if isinstance(data, list):
deps_list = data
else:
deps_list = data.get("dependencies", [])
for dep in deps_list:
# Store by name for quick lookup
self.deps_cache[dep["name"]] = dep
except Exception as e:
# Log the error but continue
print(f"Warning: Could not load deps cache: {e}")
pass
def format_table(self) -> str:
"""Format detected frameworks as a table.
Returns:
Formatted table string.
"""
if not self.detected_frameworks:
return "No frameworks detected."
lines = []
lines.append("FRAMEWORK LANGUAGE PATH VERSION SOURCE")
lines.append("-" * 80)
imports_only = []
for fw in self.detected_frameworks:
framework = fw["framework"][:18].ljust(18)
language = fw["language"][:12].ljust(12)
path = fw.get("path", ".")[:15].ljust(15)
version = fw["version"][:15].ljust(15)
source = fw["source"]
lines.append(f"{framework} {language} {path} {version} {source}")
# Track if any are from imports only
if fw["source"] == "imports" and fw["version"] == "unknown":
imports_only.append(fw["framework"])
# Add note if frameworks detected from imports without versions
if imports_only:
lines.append("\n" + "="*60)
lines.append("NOTE: Frameworks marked with 'imports' source were detected from")
lines.append("import statements in the codebase (possibly test files) but are")
lines.append("not listed as dependencies. Version shown as 'unknown' because")
lines.append("they are not in package.json, pyproject.toml, or requirements.txt.")
return "\n".join(lines)
def to_json(self) -> str:
"""Export detected frameworks to JSON.
Returns:
JSON string.
"""
return json.dumps(self.detected_frameworks, indent=2, sort_keys=True)
def save_to_file(self, output_path: Path) -> None:
"""Save detected frameworks to a JSON file.
Args:
output_path: Path where the JSON file should be saved.
"""
output_path = Path(output_path)
output_path.parent.mkdir(parents=True, exist_ok=True)
output_path.write_text(self.to_json())

View File

@@ -0,0 +1,549 @@
"""Registry of framework detection patterns and test framework configurations."""
# Framework detection registry - defines where to find each framework
FRAMEWORK_REGISTRY = {
# Python frameworks
"django": {
"language": "python",
"detection_sources": {
"pyproject.toml": [
["project", "dependencies"],
["tool", "poetry", "dependencies"],
["tool", "poetry", "group", "*", "dependencies"],
["tool", "pdm", "dependencies"],
["tool", "setuptools", "install_requires"],
["project", "optional-dependencies", "*"],
],
"requirements.txt": "line_search",
"requirements-dev.txt": "line_search",
"setup.py": "content_search",
"setup.cfg": ["options", "install_requires"],
},
"import_patterns": ["from django", "import django"],
"file_markers": ["manage.py", "wsgi.py"],
},
"flask": {
"language": "python",
"detection_sources": {
"pyproject.toml": [
["project", "dependencies"],
["tool", "poetry", "dependencies"],
["tool", "poetry", "group", "*", "dependencies"],
["tool", "pdm", "dependencies"],
["project", "optional-dependencies", "*"],
],
"requirements.txt": "line_search",
"requirements-dev.txt": "line_search",
"setup.py": "content_search",
"setup.cfg": ["options", "install_requires"],
},
"import_patterns": ["from flask", "import flask"],
},
"fastapi": {
"language": "python",
"detection_sources": {
"pyproject.toml": [
["project", "dependencies"],
["tool", "poetry", "dependencies"],
["tool", "poetry", "group", "*", "dependencies"],
["tool", "pdm", "dependencies"],
["project", "optional-dependencies", "*"],
],
"requirements.txt": "line_search",
"requirements-dev.txt": "line_search",
"setup.py": "content_search",
"setup.cfg": ["options", "install_requires"],
},
"import_patterns": ["from fastapi", "import fastapi"],
},
"pyramid": {
"language": "python",
"detection_sources": {
"pyproject.toml": [
["project", "dependencies"],
["tool", "poetry", "dependencies"],
["tool", "poetry", "group", "*", "dependencies"],
["tool", "pdm", "dependencies"],
["project", "optional-dependencies", "*"],
],
"requirements.txt": "line_search",
"requirements-dev.txt": "line_search",
"setup.py": "content_search",
"setup.cfg": ["options", "install_requires"],
},
"import_patterns": ["from pyramid", "import pyramid"],
},
"tornado": {
"language": "python",
"detection_sources": {
"pyproject.toml": [
["project", "dependencies"],
["tool", "poetry", "dependencies"],
["tool", "poetry", "group", "*", "dependencies"],
["tool", "pdm", "dependencies"],
["project", "optional-dependencies", "*"],
],
"requirements.txt": "line_search",
"requirements-dev.txt": "line_search",
"setup.py": "content_search",
"setup.cfg": ["options", "install_requires"],
},
"import_patterns": ["from tornado", "import tornado"],
},
"bottle": {
"language": "python",
"detection_sources": {
"pyproject.toml": [
["project", "dependencies"],
["tool", "poetry", "dependencies"],
["tool", "poetry", "group", "*", "dependencies"],
["tool", "pdm", "dependencies"],
["project", "optional-dependencies", "*"],
],
"requirements.txt": "line_search",
"requirements-dev.txt": "line_search",
"setup.py": "content_search",
"setup.cfg": ["options", "install_requires"],
},
"import_patterns": ["from bottle", "import bottle"],
},
"aiohttp": {
"language": "python",
"detection_sources": {
"pyproject.toml": [
["project", "dependencies"],
["tool", "poetry", "dependencies"],
["tool", "poetry", "group", "*", "dependencies"],
["tool", "pdm", "dependencies"],
["project", "optional-dependencies", "*"],
],
"requirements.txt": "line_search",
"requirements-dev.txt": "line_search",
"setup.py": "content_search",
"setup.cfg": ["options", "install_requires"],
},
"import_patterns": ["from aiohttp", "import aiohttp"],
},
"sanic": {
"language": "python",
"detection_sources": {
"pyproject.toml": [
["project", "dependencies"],
["tool", "poetry", "dependencies"],
["tool", "poetry", "group", "*", "dependencies"],
["tool", "pdm", "dependencies"],
["project", "optional-dependencies", "*"],
],
"requirements.txt": "line_search",
"requirements-dev.txt": "line_search",
"setup.py": "content_search",
"setup.cfg": ["options", "install_requires"],
},
"import_patterns": ["from sanic", "import sanic"],
},
# JavaScript/TypeScript frameworks
"express": {
"language": "javascript",
"detection_sources": {
"package.json": [
["dependencies"],
["devDependencies"],
],
},
"import_patterns": ["express", "require('express')", "from 'express'"],
},
"nestjs": {
"language": "javascript",
"detection_sources": {
"package.json": [
["dependencies"],
["devDependencies"],
],
},
"package_pattern": "@nestjs/core",
"import_patterns": ["@nestjs"],
},
"next": {
"language": "javascript",
"detection_sources": {
"package.json": [
["dependencies"],
["devDependencies"],
],
},
"import_patterns": ["next/", "from 'next'"],
},
"react": {
"language": "javascript",
"detection_sources": {
"package.json": [
["dependencies"],
["devDependencies"],
],
},
"import_patterns": ["react", "from 'react'", "React"],
},
"vue": {
"language": "javascript",
"detection_sources": {
"package.json": [
["dependencies"],
["devDependencies"],
],
},
"import_patterns": ["vue", "from 'vue'"],
"file_markers": ["*.vue"],
},
"angular": {
"language": "javascript",
"detection_sources": {
"package.json": [
["dependencies"],
["devDependencies"],
],
},
"package_pattern": "@angular/core",
"import_patterns": ["@angular"],
"file_markers": ["angular.json"],
},
"fastify": {
"language": "javascript",
"detection_sources": {
"package.json": [
["dependencies"],
["devDependencies"],
],
},
"import_patterns": ["fastify"],
},
"koa": {
"language": "javascript",
"detection_sources": {
"package.json": [
["dependencies"],
["devDependencies"],
],
},
"import_patterns": ["koa", "require('koa')"],
},
"vite": {
"language": "javascript",
"detection_sources": {
"package.json": [
["dependencies"],
["devDependencies"],
],
},
"import_patterns": ["vite"],
"config_files": ["vite.config.js", "vite.config.ts"],
},
# PHP frameworks
"laravel": {
"language": "php",
"detection_sources": {
"composer.json": [
["require"],
["require-dev"],
],
},
"package_pattern": "laravel/framework",
"file_markers": ["artisan", "bootstrap/app.php"],
},
"symfony": {
"language": "php",
"detection_sources": {
"composer.json": [
["require"],
["require-dev"],
],
},
"package_pattern": "symfony/framework-bundle",
"file_markers": ["bin/console", "config/bundles.php"],
},
"slim": {
"language": "php",
"detection_sources": {
"composer.json": [
["require"],
["require-dev"],
],
},
"package_pattern": "slim/slim",
},
"lumen": {
"language": "php",
"detection_sources": {
"composer.json": [
["require"],
["require-dev"],
],
},
"package_pattern": "laravel/lumen-framework",
"file_markers": ["artisan"],
},
"codeigniter": {
"language": "php",
"detection_sources": {
"composer.json": [
["require"],
["require-dev"],
],
},
"package_pattern": "codeigniter4/framework",
"file_markers": ["spark"],
},
# Go frameworks
"gin": {
"language": "go",
"detection_sources": {
"go.mod": "content_search",
},
"package_pattern": "github.com/gin-gonic/gin",
"import_patterns": ["github.com/gin-gonic/gin"],
},
"echo": {
"language": "go",
"detection_sources": {
"go.mod": "content_search",
},
"package_pattern": "github.com/labstack/echo",
"import_patterns": ["github.com/labstack/echo"],
},
"fiber": {
"language": "go",
"detection_sources": {
"go.mod": "content_search",
},
"package_pattern": "github.com/gofiber/fiber",
"import_patterns": ["github.com/gofiber/fiber"],
},
"beego": {
"language": "go",
"detection_sources": {
"go.mod": "content_search",
},
"package_pattern": "github.com/beego/beego",
"import_patterns": ["github.com/beego/beego"],
},
"chi": {
"language": "go",
"detection_sources": {
"go.mod": "content_search",
},
"package_pattern": "github.com/go-chi/chi",
"import_patterns": ["github.com/go-chi/chi"],
},
"gorilla": {
"language": "go",
"detection_sources": {
"go.mod": "content_search",
},
"package_pattern": "github.com/gorilla/mux",
"import_patterns": ["github.com/gorilla/mux"],
},
# Java frameworks
"spring": {
"language": "java",
"detection_sources": {
"pom.xml": "content_search",
"build.gradle": "content_search",
"build.gradle.kts": "content_search",
},
"package_pattern": "spring",
"content_patterns": ["spring-boot", "springframework"],
},
"micronaut": {
"language": "java",
"detection_sources": {
"pom.xml": "content_search",
"build.gradle": "content_search",
"build.gradle.kts": "content_search",
},
"package_pattern": "io.micronaut",
"content_patterns": ["io.micronaut"],
},
"quarkus": {
"language": "java",
"detection_sources": {
"pom.xml": "content_search",
"build.gradle": "content_search",
"build.gradle.kts": "content_search",
},
"package_pattern": "io.quarkus",
"content_patterns": ["io.quarkus"],
},
"dropwizard": {
"language": "java",
"detection_sources": {
"pom.xml": "content_search",
"build.gradle": "content_search",
"build.gradle.kts": "content_search",
},
"package_pattern": "io.dropwizard",
"content_patterns": ["io.dropwizard"],
},
"play": {
"language": "java",
"detection_sources": {
"build.sbt": "content_search",
"build.gradle": "content_search",
},
"package_pattern": "com.typesafe.play",
"content_patterns": ["com.typesafe.play"],
},
# Ruby frameworks
"rails": {
"language": "ruby",
"detection_sources": {
"Gemfile": "line_search",
"Gemfile.lock": "content_search",
},
"package_pattern": "rails",
"file_markers": ["Rakefile", "config.ru", "bin/rails"],
},
"sinatra": {
"language": "ruby",
"detection_sources": {
"Gemfile": "line_search",
"Gemfile.lock": "content_search",
},
"package_pattern": "sinatra",
},
"hanami": {
"language": "ruby",
"detection_sources": {
"Gemfile": "line_search",
"Gemfile.lock": "content_search",
},
"package_pattern": "hanami",
},
"grape": {
"language": "ruby",
"detection_sources": {
"Gemfile": "line_search",
"Gemfile.lock": "content_search",
},
"package_pattern": "grape",
},
}
# Test framework detection registry
TEST_FRAMEWORK_REGISTRY = {
"pytest": {
"language": "python",
"command": "pytest -q -p no:cacheprovider",
"detection_sources": {
"pyproject.toml": [
["project", "dependencies"],
["project", "optional-dependencies", "test"],
["project", "optional-dependencies", "dev"],
["project", "optional-dependencies", "tests"],
["tool", "poetry", "dependencies"],
["tool", "poetry", "group", "dev", "dependencies"],
["tool", "poetry", "group", "test", "dependencies"],
["tool", "poetry", "dev-dependencies"],
["tool", "pdm", "dev-dependencies"],
["tool", "hatch", "envs", "default", "dependencies"],
],
"requirements.txt": "line_search",
"requirements-dev.txt": "line_search",
"requirements-test.txt": "line_search",
"setup.cfg": ["options", "tests_require"],
"setup.py": "content_search",
"tox.ini": "content_search",
},
"config_files": ["pytest.ini", ".pytest.ini", "pyproject.toml"],
"config_sections": {
"pyproject.toml": [
["tool", "pytest"],
["tool", "pytest", "ini_options"],
],
"setup.cfg": [
["tool:pytest"],
["pytest"],
],
},
},
"unittest": {
"language": "python",
"command": "python -m unittest discover -q",
"import_patterns": ["import unittest", "from unittest"],
"file_patterns": ["test*.py", "*_test.py"],
},
"jest": {
"language": "javascript",
"command": "npm test --silent",
"detection_sources": {
"package.json": [
["dependencies"],
["devDependencies"],
],
},
"config_files": ["jest.config.js", "jest.config.ts", "jest.config.json"],
"config_sections": {
"package.json": [["jest"]],
},
"script_patterns": ["jest"],
},
"vitest": {
"language": "javascript",
"command": "npm test --silent",
"detection_sources": {
"package.json": [
["dependencies"],
["devDependencies"],
],
},
"config_files": ["vitest.config.js", "vitest.config.ts", "vite.config.js", "vite.config.ts"],
"script_patterns": ["vitest"],
},
"mocha": {
"language": "javascript",
"command": "npm test --silent",
"detection_sources": {
"package.json": [
["dependencies"],
["devDependencies"],
],
},
"config_files": [".mocharc.js", ".mocharc.json", ".mocharc.yaml", ".mocharc.yml"],
"script_patterns": ["mocha"],
},
"go": {
"language": "go",
"command": "go test ./...",
"file_patterns": ["*_test.go"],
"detection_sources": {
"go.mod": "exists",
},
},
"junit": {
"language": "java",
"command_maven": "mvn test",
"command_gradle": "gradle test",
"detection_sources": {
"pom.xml": "content_search",
"build.gradle": "content_search",
"build.gradle.kts": "content_search",
},
"content_patterns": ["junit", "testImplementation"],
"import_patterns": ["import org.junit"],
"file_patterns": ["*Test.java", "Test*.java"],
},
"rspec": {
"language": "ruby",
"command": "rspec",
"detection_sources": {
"Gemfile": "line_search",
"Gemfile.lock": "content_search",
},
"config_files": [".rspec", "spec/spec_helper.rb"],
"directory_markers": ["spec/"],
},
}

View File

@@ -0,0 +1,45 @@
"""Graph package - dependency and call graph functionality.
Core modules (always available):
- analyzer: Pure graph algorithms (cycles, paths, layers)
- builder: Graph construction from source code
- store: SQLite persistence
Optional modules:
- insights: Interpretive metrics (health scores, recommendations, hotspots)
"""
# Core exports (always available)
from .analyzer import XGraphAnalyzer
from .builder import XGraphBuilder, GraphNode, GraphEdge, Cycle, Hotspot, ImpactAnalysis
from .store import XGraphStore
from .visualizer import GraphVisualizer
# Optional insights module
try:
from .insights import GraphInsights, check_insights_available, create_insights
INSIGHTS_AVAILABLE = True
except ImportError:
# Insights module is optional - similar to ml.py
INSIGHTS_AVAILABLE = False
GraphInsights = None
check_insights_available = lambda: False
create_insights = lambda weights=None: None
__all__ = [
# Core classes (always available)
"XGraphBuilder",
"XGraphAnalyzer",
"XGraphStore",
"GraphVisualizer",
"GraphNode",
"GraphEdge",
"Cycle",
"Hotspot",
"ImpactAnalysis",
# Optional insights
"GraphInsights",
"INSIGHTS_AVAILABLE",
"check_insights_available",
"create_insights",
]

View File

@@ -0,0 +1,421 @@
"""Graph analyzer module - pure graph algorithms for dependency and call graphs.
This module provides ONLY non-interpretive graph algorithms:
- Cycle detection (DFS)
- Shortest path finding (BFS)
- Layer identification (topological sort)
- Impact analysis (graph traversal)
- Statistical summaries (counts and grouping)
For interpretive metrics like health scores, recommendations, and weighted
rankings, see the optional graph.insights module.
"""
from collections import defaultdict
from pathlib import Path
from typing import Any
class XGraphAnalyzer:
"""Analyze cross-project dependency and call graphs using pure algorithms."""
def detect_cycles(self, graph: dict[str, Any]) -> list[dict[str, Any]]:
"""
Detect cycles in the dependency graph using DFS.
This is a pure graph algorithm that returns raw cycle data
without any interpretation or scoring.
Args:
graph: Graph with 'nodes' and 'edges' keys
Returns:
List of cycles, each with nodes and size
"""
# Build adjacency list
adj = defaultdict(list)
for edge in graph.get("edges", []):
adj[edge["source"]].append(edge["target"])
# Track visited nodes and recursion stack
visited = set()
rec_stack = set()
cycles = []
def dfs(node: str, path: list[str]) -> None:
"""DFS to detect cycles."""
visited.add(node)
rec_stack.add(node)
path.append(node)
for neighbor in adj[node]:
if neighbor not in visited:
dfs(neighbor, path.copy())
elif neighbor in rec_stack:
# Found a cycle
cycle_start = path.index(neighbor)
cycle_nodes = path[cycle_start:] + [neighbor]
cycles.append({
"nodes": cycle_nodes,
"size": len(cycle_nodes) - 1, # Don't count repeated node
})
rec_stack.remove(node)
# Run DFS from all unvisited nodes
for node in graph.get("nodes", []):
node_id = node["id"]
if node_id not in visited:
dfs(node_id, [])
# Sort cycles by size (largest first)
cycles.sort(key=lambda c: c["size"], reverse=True)
return cycles
def impact_of_change(
self,
targets: list[str],
import_graph: dict[str, Any],
call_graph: dict[str, Any] | None = None,
max_depth: int = 3,
) -> dict[str, Any]:
"""
Calculate the impact of changing target files using graph traversal.
This is a pure graph algorithm that finds affected nodes
without interpreting or scoring the impact.
Args:
targets: List of file/module IDs that will change
import_graph: Import/dependency graph
call_graph: Optional call graph
max_depth: Maximum traversal depth
Returns:
Raw impact data with upstream and downstream effects
"""
# Build adjacency lists
upstream = defaultdict(list) # Who depends on X
downstream = defaultdict(list) # What X depends on
for edge in import_graph.get("edges", []):
downstream[edge["source"]].append(edge["target"])
upstream[edge["target"]].append(edge["source"])
if call_graph:
for edge in call_graph.get("edges", []):
downstream[edge["source"]].append(edge["target"])
upstream[edge["target"]].append(edge["source"])
# Find upstream impact (what depends on targets)
upstream_impact = set()
to_visit = [(t, 0) for t in targets]
visited = set()
while to_visit:
node, depth = to_visit.pop(0)
if node in visited or depth >= max_depth:
continue
visited.add(node)
for dependent in upstream[node]:
upstream_impact.add(dependent)
to_visit.append((dependent, depth + 1))
# Find downstream impact (what targets depend on)
downstream_impact = set()
to_visit = [(t, 0) for t in targets]
visited = set()
while to_visit:
node, depth = to_visit.pop(0)
if node in visited or depth >= max_depth:
continue
visited.add(node)
for dependency in downstream[node]:
downstream_impact.add(dependency)
to_visit.append((dependency, depth + 1))
# Return raw counts without ratios or interpretations
all_impacted = set(targets) | upstream_impact | downstream_impact
return {
"targets": targets,
"upstream": sorted(upstream_impact),
"downstream": sorted(downstream_impact),
"total_impacted": len(all_impacted),
"graph_nodes": len(import_graph.get("nodes", [])),
}
def find_shortest_path(
self,
source: str,
target: str,
graph: dict[str, Any]
) -> list[str] | None:
"""
Find shortest path between two nodes using BFS.
Pure pathfinding algorithm without interpretation.
Args:
source: Source node ID
target: Target node ID
graph: Graph with edges
Returns:
List of node IDs forming the path, or None if no path exists
"""
# Build adjacency list
adj = defaultdict(list)
for edge in graph.get("edges", []):
adj[edge["source"]].append(edge["target"])
# BFS
queue = [(source, [source])]
visited = {source}
while queue:
node, path = queue.pop(0)
if node == target:
return path
for neighbor in adj[node]:
if neighbor not in visited:
visited.add(neighbor)
queue.append((neighbor, path + [neighbor]))
return None
def identify_layers(self, graph: dict[str, Any]) -> dict[str, list[str]]:
"""
Identify architectural layers using topological sorting.
Pure graph layering algorithm without interpretation.
Args:
graph: Import/dependency graph
Returns:
Dict mapping layer number to list of node IDs
"""
# Calculate in-degrees
in_degree = defaultdict(int)
nodes = {node["id"] for node in graph.get("nodes", [])}
for edge in graph.get("edges", []):
in_degree[edge["target"]] += 1
# Find nodes with no dependencies (layer 0)
layers = {}
current_layer = []
for node_id in nodes:
if in_degree[node_id] == 0:
current_layer.append(node_id)
# Build layers using modified topological sort
layer_num = 0
adj = defaultdict(list)
for edge in graph.get("edges", []):
adj[edge["source"]].append(edge["target"])
while current_layer:
layers[layer_num] = current_layer
next_layer = []
for node in current_layer:
for neighbor in adj[node]:
in_degree[neighbor] -= 1
if in_degree[neighbor] == 0:
next_layer.append(neighbor)
current_layer = next_layer
layer_num += 1
return layers
def get_graph_summary(self, graph_data: dict[str, Any]) -> dict[str, Any]:
"""
Extract basic statistics from a graph without interpretation.
This method provides raw counts and statistics only,
no subjective metrics or labels.
Args:
graph_data: Large graph dict with 'nodes' and 'edges'
Returns:
Concise summary with raw statistics only
"""
# Basic statistics
nodes = graph_data.get("nodes", [])
edges = graph_data.get("edges", [])
# Calculate in/out degrees
in_degree = defaultdict(int)
out_degree = defaultdict(int)
for edge in edges:
out_degree[edge["source"]] += 1
in_degree[edge["target"]] += 1
# Find most connected nodes (raw data only)
connection_counts = []
for node in nodes: # Process all nodes
node_id = node["id"]
total = in_degree[node_id] + out_degree[node_id]
if total > 0:
connection_counts.append({
"id": node_id,
"in_degree": in_degree[node_id],
"out_degree": out_degree[node_id],
"total_connections": total
})
# Sort and get top 10
connection_counts.sort(key=lambda x: x["total_connections"], reverse=True)
top_connected = connection_counts[:10]
# Detect cycles (complete search)
cycles = self.detect_cycles({"nodes": nodes, "edges": edges})
# Calculate graph metrics
node_count = len(nodes)
edge_count = len(edges)
density = edge_count / (node_count * (node_count - 1)) if node_count > 1 else 0
# Find isolated nodes
connected_nodes = set()
for edge in edges:
connected_nodes.add(edge["source"])
connected_nodes.add(edge["target"])
isolated_count = len([n for n in nodes if n["id"] not in connected_nodes])
# Create summary with raw data only
summary = {
"statistics": {
"total_nodes": node_count,
"total_edges": edge_count,
"graph_density": round(density, 4),
"isolated_nodes": isolated_count,
"average_connections": round(edge_count / node_count, 2) if node_count > 0 else 0
},
"top_connected_nodes": top_connected,
"cycles_found": [
{
"size": cycle["size"],
"nodes": cycle["nodes"][:5] + (["..."] if len(cycle["nodes"]) > 5 else [])
}
for cycle in cycles[:5]
],
"file_types": self._count_file_types(nodes),
"connection_distribution": {
"nodes_with_20_plus_connections": len([c for c in connection_counts if c["total_connections"] > 20]),
"nodes_with_30_plus_inbound": len([c for c in connection_counts if c["in_degree"] > 30]),
"cycle_count": len(cycles) if len(nodes) < 500 else f"{len(cycles)}+ (limited search)",
}
}
return summary
def _count_file_types(self, nodes: list[dict]) -> dict[str, int]:
"""Count nodes by file extension - pure counting, no interpretation."""
ext_counts = defaultdict(int)
for node in nodes: # Process all nodes
if "file" in node:
ext = Path(node["file"]).suffix or "no_ext"
ext_counts[ext] += 1
# Return top 10 extensions
sorted_exts = sorted(ext_counts.items(), key=lambda x: x[1], reverse=True)
return dict(sorted_exts[:10])
def identify_hotspots(self, graph: dict[str, Any], top_n: int = 10) -> list[dict[str, Any]]:
"""
Identify hotspot nodes based on connectivity (in/out degree).
Pure graph algorithm that identifies most connected nodes
without interpretation or scoring.
Args:
graph: Graph with 'nodes' and 'edges'
top_n: Number of top hotspots to return
Returns:
List of hotspot nodes with their degree counts
"""
# Calculate in/out degrees
in_degree = defaultdict(int)
out_degree = defaultdict(int)
for edge in graph.get("edges", []):
out_degree[edge["source"]] += 1
in_degree[edge["target"]] += 1
# Calculate total connections for each node
hotspots = []
for node in graph.get("nodes", []):
node_id = node["id"]
in_deg = in_degree[node_id]
out_deg = out_degree[node_id]
total = in_deg + out_deg
if total > 0: # Only include connected nodes
hotspots.append({
"id": node_id,
"in_degree": in_deg,
"out_degree": out_deg,
"total_connections": total,
"file": node.get("file", node_id),
"lang": node.get("lang", "unknown")
})
# Sort by total connections and return top N
hotspots.sort(key=lambda x: x["total_connections"], reverse=True)
return hotspots[:top_n]
def calculate_node_degrees(self, graph: dict[str, Any]) -> dict[str, dict[str, int]]:
"""
Calculate in-degree and out-degree for all nodes.
Pure counting algorithm without interpretation.
Args:
graph: Graph with edges
Returns:
Dict mapping node IDs to degree counts
"""
degrees = defaultdict(lambda: {"in_degree": 0, "out_degree": 0})
for edge in graph.get("edges", []):
degrees[edge["source"]]["out_degree"] += 1
degrees[edge["target"]]["in_degree"] += 1
return dict(degrees)
def analyze_impact(self, graph: dict[str, Any], targets: list[str], max_depth: int = 3) -> dict[str, Any]:
"""
Analyze impact of changes to target nodes.
Wrapper method for impact_of_change to match expected API.
Args:
graph: Graph with 'nodes' and 'edges'
targets: List of target node IDs
max_depth: Maximum traversal depth
Returns:
Impact analysis results with upstream/downstream effects
"""
# Use existing impact_of_change method
result = self.impact_of_change(targets, graph, None, max_depth)
# Add all_impacted field for compatibility
all_impacted = set(targets) | set(result.get("upstream", [])) | set(result.get("downstream", []))
result["all_impacted"] = sorted(all_impacted)
return result

1017
theauditor/graph/builder.py Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,17 @@
"""Backward compatibility shim for graph insights.
This file exists to maintain backward compatibility for code that imports
from theauditor.graph.insights directly. All functionality has been moved to
theauditor.insights.graph for better organization.
This ensures that:
- from theauditor.graph.insights import GraphInsights # STILL WORKS
- from theauditor.graph import insights # STILL WORKS
- import theauditor.graph.insights # STILL WORKS
"""
# Import everything from the new location
from theauditor.insights.graph import *
# This shim ensures 100% backward compatibility while the actual
# implementation is now in theauditor/insights/graph.py

Some files were not shown because too many files have changed in this diff Show More