diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index 2e4b055..6e3bf91 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -87,7 +87,7 @@ Key features: - **Parallel JavaScript processing** when semantic parser available ### Pipeline System (`theauditor/pipelines.py`) -Orchestrates **14-phase** analysis pipeline in **parallel stages**: +Orchestrates comprehensive analysis pipeline in **parallel stages**: **Stage 1 - Foundation (Sequential):** 1. Repository indexing - Build manifest and symbol database diff --git a/CLAUDE.md b/CLAUDE.md index 225b7ba..66a9e60 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -25,7 +25,7 @@ mypy theauditor --strict # Type checking # Running TheAuditor aud init # Initialize project -aud full # Complete analysis (14 phases) +aud full # Complete analysis (multiple phases) aud full --offline # Skip network operations (deps, docs) aud index --exclude-self # When analyzing TheAuditor itself @@ -150,7 +150,7 @@ The indexer has been refactored from a monolithic 2000+ line file into a modular The package uses a dynamic extractor registry for automatic language detection and processing. #### Pipeline System (`theauditor/pipelines.py`) -- Orchestrates **14-phase** analysis pipeline in **parallel stages**: +- Orchestrates comprehensive analysis pipeline in **parallel stages**: - **Stage 1**: Foundation (index with batched DB operations, framework detection) - **Stage 2**: 3 concurrent tracks (Network I/O, Code Analysis, Graph Build) - **Stage 3**: Final aggregation (graph analysis, taint, FCE, report) @@ -324,7 +324,7 @@ if chunk_info.get('truncated', False): ## Critical Working Knowledge ### Pipeline Execution Order -The `aud full` command runs 14 phases in 3 stages: +The `aud full` command runs multiple analysis phases in 3 stages: 1. **Sequential**: index → framework_detect 2. **Parallel**: (deps, docs) || (workset, lint, patterns) || (graph_build) 3. **Sequential**: graph_analyze → taint → fce → report diff --git a/HOWTOUSE.md b/HOWTOUSE.md index 86805c2..1abe2d4 100644 --- a/HOWTOUSE.md +++ b/HOWTOUSE.md @@ -120,7 +120,7 @@ Setting up JavaScript/TypeScript tools in sandboxed environment... On a medium 20k LOC node/react/vite stack, expect the analysis to take around 30 minutes. Progress bars for tracks B/C may display inconsistently on PowerShell. -Run a comprehensive audit with all **14 analysis phases**: +Run a comprehensive audit with multiple analysis phases organized in parallel stages: ```bash aud full @@ -152,12 +152,13 @@ This executes in **parallel stages** for optimal performance: 11. **Taint analysis** - Track data flow 12. **Factual correlation engine** - Correlate findings across tools with 29 advanced rules 13. **Report generation** - Produce final output +14. **Summary generation** - Create executive summary **Output**: Complete results in **`.pf/readthis/`** directory ### Offline Mode -When working on the same codebase repeatedly or when network access is limited, use offline mode to skip dependency checking and documentation phases: +When working on the same codebase repeatedly or when network access is limited, use offline mode to skip network operations (dependency checking and documentation fetching): ```bash # Run full audit without network operations @@ -1069,10 +1070,7 @@ For large repositories: # Limit analysis scope aud workset --paths "src/critical/**/*.py" -# Skip documentation phases -aud full --skip-docs - -# Run specific phases only +# Run specific commands only aud index && aud lint && aud detect-patterns # Adjust chunking for larger context windows diff --git a/README.md b/README.md index 66a4f30..89d159e 100644 --- a/README.md +++ b/README.md @@ -149,15 +149,15 @@ This architectural flaw is amplified by two dangerous behaviours inherent to AI - **Security Theater**: AI assistants are optimized to "make it work," which often means introducing rampant security anti-patterns like hardcoded credentials, disabled authentication, and the pervasive use of `as any` in TypeScript. This creates a dangerous illusion of progress. - **Context Blindness**: With aggressive context compaction, an AI never sees the full picture. It works with fleeting snapshots of code, forcing it to make assumptions instead of decisions based on facts. -## The 14-Phase Analysis Pipeline +## The Comprehensive Analysis Pipeline -TheAuditor runs a comprehensive audit through 14 distinct phases organized in 4 stages: +TheAuditor runs a comprehensive audit through multiple analysis phases organized in parallel stages: **STAGE 1: Foundation (Sequential)** 1. **Index Repository** - Build complete code inventory and SQLite database 2. **Detect Frameworks** - Identify Django, Flask, React, Vue, etc. -**STAGE 2: Parallel Analysis (3 concurrent tracks)** +**STAGE 2: Concurrent Analysis (3 parallel tracks)** *Track A - Network Operations:* 3. **Check Dependencies** - Analyze package versions and known vulnerabilities @@ -175,9 +175,10 @@ TheAuditor runs a comprehensive audit through 14 distinct phases organized in 4 11. **Visualize Graph** - Generate multiple graph views 12. **Taint Analysis** - Track data flow from sources to sinks -**STAGE 3: Aggregation (Sequential)** +**STAGE 3: Final Aggregation (Sequential)** 13. **Factual Correlation Engine** - Cross-reference findings across all tools 14. **Generate Report** - Produce final AI-consumable chunks in `.pf/readthis/` +15. **Summary Generation** - Create executive summary of findings ## Key Features diff --git a/theauditor/commands/full.py b/theauditor/commands/full.py index 6d938e7..7d1d566 100644 --- a/theauditor/commands/full.py +++ b/theauditor/commands/full.py @@ -13,7 +13,7 @@ from theauditor.utils.exit_codes import ExitCodes @click.option("--exclude-self", is_flag=True, help="Exclude TheAuditor's own files (for self-testing)") @click.option("--offline", is_flag=True, help="Skip network operations (deps, docs)") def full(root, quiet, exclude_self, offline): - """Run complete audit pipeline in exact order specified in teamsop.md.""" + """Run complete audit pipeline with multiple analysis phases organized in parallel stages.""" from theauditor.pipelines import run_full_pipeline # Define log callback for console output diff --git a/theauditor/pipelines.py b/theauditor/pipelines.py index 697feb9..991d9b4 100644 --- a/theauditor/pipelines.py +++ b/theauditor/pipelines.py @@ -4,11 +4,13 @@ import json import os import platform import shutil +import signal import subprocess import sys import tempfile +import threading import time -from concurrent.futures import ProcessPoolExecutor, as_completed, wait +from concurrent.futures import ThreadPoolExecutor, as_completed, wait from datetime import datetime from pathlib import Path from typing import Any, Callable, List, Tuple @@ -23,6 +25,79 @@ except ImportError: # Windows compatibility IS_WINDOWS = platform.system() == "Windows" +# Global stop event for interrupt handling +stop_event = threading.Event() + +def signal_handler(signum, frame): + """Handle Ctrl+C by setting stop event.""" + print("\n[INFO] Interrupt received, stopping pipeline gracefully...", file=sys.stderr) + stop_event.set() + +# Register signal handler +signal.signal(signal.SIGINT, signal_handler) +if not IS_WINDOWS: + signal.signal(signal.SIGTERM, signal_handler) + +def run_subprocess_with_interrupt(cmd, stdout_fp, stderr_fp, cwd, shell=False, timeout=300): + """ + Run subprocess with interrupt checking every 100ms. + + Args: + cmd: Command to execute + stdout_fp: File handle for stdout + stderr_fp: File handle for stderr + cwd: Working directory + shell: Whether to use shell execution + timeout: Maximum time to wait (seconds) + + Returns: + subprocess.CompletedProcess-like object with returncode, stdout, stderr + """ + process = subprocess.Popen( + cmd, + stdout=stdout_fp, + stderr=stderr_fp, + text=True, + cwd=cwd, + shell=shell + ) + + # Poll process every 100ms to check for completion or interruption + start_time = time.time() + while process.poll() is None: + if stop_event.is_set(): + # User interrupted - terminate subprocess + process.terminate() + try: + process.wait(timeout=5) + except subprocess.TimeoutExpired: + process.kill() + process.wait() + raise KeyboardInterrupt("Pipeline interrupted by user") + + # Check timeout + if time.time() - start_time > timeout: + process.terminate() + try: + process.wait(timeout=5) + except subprocess.TimeoutExpired: + process.kill() + process.wait() + raise subprocess.TimeoutExpired(cmd, timeout) + + # Sleep briefly to avoid busy-waiting + time.sleep(0.1) + + # Create result object similar to subprocess.run + class Result: + def __init__(self, returncode): + self.returncode = returncode + self.stdout = None + self.stderr = None + + result = Result(process.returncode) + return result + def run_command_chain(commands: List[Tuple[str, List[str]]], root: str, chain_name: str) -> dict: """ @@ -101,13 +176,13 @@ def run_command_chain(commands: List[Tuple[str, List[str]]], root: str, chain_na with open(stdout_file, 'w+', encoding='utf-8') as out_fp, \ open(stderr_file, 'w+', encoding='utf-8') as err_fp: - result = subprocess.run( + result = run_subprocess_with_interrupt( cmd, - stdout=out_fp, - stderr=err_fp, - text=True, + stdout_fp=out_fp, + stderr_fp=err_fp, cwd=root, - shell=IS_WINDOWS # Windows compatibility fix + shell=IS_WINDOWS, # Windows compatibility fix + timeout=300 # 5 minutes per command in parallel tracks ) # Read outputs @@ -157,6 +232,12 @@ def run_command_chain(commands: List[Tuple[str, List[str]]], root: str, chain_na chain_errors.append(f"Error in {description}: {stderr}") break # Stop chain on failure + except KeyboardInterrupt: + # User interrupted - clean up and exit + failed = True + write_status(f"INTERRUPTED: {description}", completed_count, len(commands)) + chain_output.append(f"[INTERRUPTED] Pipeline stopped by user") + raise # Re-raise to propagate up except Exception as e: failed = True write_status(f"ERROR: {description}", completed_count, len(commands)) @@ -475,13 +556,13 @@ def run_full_pipeline( with open(stdout_file, 'w+', encoding='utf-8') as out_fp, \ open(stderr_file, 'w+', encoding='utf-8') as err_fp: - result = subprocess.run( + result = run_subprocess_with_interrupt( cmd, - stdout=out_fp, - stderr=err_fp, - text=True, + stdout_fp=out_fp, + stderr_fp=err_fp, cwd=root, - shell=IS_WINDOWS # Windows compatibility fix + shell=IS_WINDOWS, # Windows compatibility fix + timeout=300 # 5 minutes per command in parallel tracks ) # Read outputs @@ -590,9 +671,9 @@ def run_full_pipeline( log_output(" Track B: Code Analysis (workset, lint, patterns)") log_output(" Track C: Graph & Taint Analysis") - # Execute parallel tracks using ProcessPoolExecutor + # Execute parallel tracks using ThreadPoolExecutor (Windows-safe) parallel_results = [] - with ProcessPoolExecutor(max_workers=3) as executor: + with ThreadPoolExecutor(max_workers=3) as executor: futures = [] # Submit Track A if it has commands @@ -673,6 +754,12 @@ def run_full_pipeline( else: log_output(f"[FAILED] {result['name']} failed", is_error=True) failed_phases += 1 + except KeyboardInterrupt: + log_output(f"[INTERRUPTED] Pipeline stopped by user", is_error=True) + # Cancel remaining futures + for f in pending_futures: + f.cancel() + raise # Re-raise to exit except Exception as e: log_output(f"[ERROR] Parallel track failed with exception: {e}", is_error=True) failed_phases += 1 @@ -715,13 +802,13 @@ def run_full_pipeline( with open(stdout_file, 'w+', encoding='utf-8') as out_fp, \ open(stderr_file, 'w+', encoding='utf-8') as err_fp: - result = subprocess.run( + result = run_subprocess_with_interrupt( cmd, - stdout=out_fp, - stderr=err_fp, - text=True, + stdout_fp=out_fp, + stderr_fp=err_fp, cwd=root, - shell=IS_WINDOWS # Windows compatibility fix + shell=IS_WINDOWS, # Windows compatibility fix + timeout=600 # 10 minutes for final aggregation ) # Read outputs