Fix: Critical Windows ProcessPoolExecutor hang and documentation drift

Fixed critical Windows compatibility issues and updated outdated documentation.

  CRITICAL WINDOWS HANG FIXES:
  1. ProcessPoolExecutor → ThreadPoolExecutor
     - Fixes PowerShell/terminal hang where Ctrl+C wouldn't work
     - Prevents .pf directory lock requiring Task Manager kill
     - Root cause: Nested ProcessPool + ThreadPool on Windows creates kernel deadlock

  2. Ctrl+C Interruption Support
     - Replaced subprocess.run with Popen+poll pattern (industry standard)
     - Poll subprocess every 100ms for interruption checking
     - Added global stop_event and signal handlers for graceful shutdown
     - Root cause: subprocess.run blocks threads with no signal propagation

  DOCUMENTATION DRIFT FIX:
  - Removed hardcoded "14 phases" references (actual is 19+ commands)
  - Updated to "multiple analysis phases" throughout all docs
  - Fixed CLI help text to be version-agnostic
  - Added missing "Summary generation" step in HOWTOUSE.md

  Changes:
  - pipelines.py: ProcessPoolExecutor → ThreadPoolExecutor, added Popen+poll pattern
  - Added signal handling and run_subprocess_with_interrupt() function
  - commands/full.py: Updated docstring to remove specific phase count
  - README.md: Changed "14 distinct phases" to "multiple analysis phases"
  - HOWTOUSE.md: Updated phase references, added missing summary step
  - CLAUDE.md & ARCHITECTURE.md: Removed hardcoded phase counts

  Impact: Critical UX fixes - Windows compatibility restored, pipeline interruptible
  Testing: Ctrl+C works, no PowerShell hangs, .pf directory deletable
This commit is contained in:
TheAuditorTool
2025-09-09 14:26:18 +07:00
parent e89c898c91
commit c7a59e420b
6 changed files with 119 additions and 33 deletions

View File

@@ -87,7 +87,7 @@ Key features:
- **Parallel JavaScript processing** when semantic parser available
### Pipeline System (`theauditor/pipelines.py`)
Orchestrates **14-phase** analysis pipeline in **parallel stages**:
Orchestrates comprehensive analysis pipeline in **parallel stages**:
**Stage 1 - Foundation (Sequential):**
1. Repository indexing - Build manifest and symbol database

View File

@@ -25,7 +25,7 @@ mypy theauditor --strict # Type checking
# Running TheAuditor
aud init # Initialize project
aud full # Complete analysis (14 phases)
aud full # Complete analysis (multiple phases)
aud full --offline # Skip network operations (deps, docs)
aud index --exclude-self # When analyzing TheAuditor itself
@@ -150,7 +150,7 @@ The indexer has been refactored from a monolithic 2000+ line file into a modular
The package uses a dynamic extractor registry for automatic language detection and processing.
#### Pipeline System (`theauditor/pipelines.py`)
- Orchestrates **14-phase** analysis pipeline in **parallel stages**:
- Orchestrates comprehensive analysis pipeline in **parallel stages**:
- **Stage 1**: Foundation (index with batched DB operations, framework detection)
- **Stage 2**: 3 concurrent tracks (Network I/O, Code Analysis, Graph Build)
- **Stage 3**: Final aggregation (graph analysis, taint, FCE, report)
@@ -324,7 +324,7 @@ if chunk_info.get('truncated', False):
## Critical Working Knowledge
### Pipeline Execution Order
The `aud full` command runs 14 phases in 3 stages:
The `aud full` command runs multiple analysis phases in 3 stages:
1. **Sequential**: index → framework_detect
2. **Parallel**: (deps, docs) || (workset, lint, patterns) || (graph_build)
3. **Sequential**: graph_analyze → taint → fce → report

View File

@@ -120,7 +120,7 @@ Setting up JavaScript/TypeScript tools in sandboxed environment...
On a medium 20k LOC node/react/vite stack, expect the analysis to take around 30 minutes.
Progress bars for tracks B/C may display inconsistently on PowerShell.
Run a comprehensive audit with all **14 analysis phases**:
Run a comprehensive audit with multiple analysis phases organized in parallel stages:
```bash
aud full
@@ -152,12 +152,13 @@ This executes in **parallel stages** for optimal performance:
11. **Taint analysis** - Track data flow
12. **Factual correlation engine** - Correlate findings across tools with 29 advanced rules
13. **Report generation** - Produce final output
14. **Summary generation** - Create executive summary
**Output**: Complete results in **`.pf/readthis/`** directory
### Offline Mode
When working on the same codebase repeatedly or when network access is limited, use offline mode to skip dependency checking and documentation phases:
When working on the same codebase repeatedly or when network access is limited, use offline mode to skip network operations (dependency checking and documentation fetching):
```bash
# Run full audit without network operations
@@ -1069,10 +1070,7 @@ For large repositories:
# Limit analysis scope
aud workset --paths "src/critical/**/*.py"
# Skip documentation phases
aud full --skip-docs
# Run specific phases only
# Run specific commands only
aud index && aud lint && aud detect-patterns
# Adjust chunking for larger context windows

View File

@@ -149,15 +149,15 @@ This architectural flaw is amplified by two dangerous behaviours inherent to AI
- **Security Theater**: AI assistants are optimized to "make it work," which often means introducing rampant security anti-patterns like hardcoded credentials, disabled authentication, and the pervasive use of `as any` in TypeScript. This creates a dangerous illusion of progress.
- **Context Blindness**: With aggressive context compaction, an AI never sees the full picture. It works with fleeting snapshots of code, forcing it to make assumptions instead of decisions based on facts.
## The 14-Phase Analysis Pipeline
## The Comprehensive Analysis Pipeline
TheAuditor runs a comprehensive audit through 14 distinct phases organized in 4 stages:
TheAuditor runs a comprehensive audit through multiple analysis phases organized in parallel stages:
**STAGE 1: Foundation (Sequential)**
1. **Index Repository** - Build complete code inventory and SQLite database
2. **Detect Frameworks** - Identify Django, Flask, React, Vue, etc.
**STAGE 2: Parallel Analysis (3 concurrent tracks)**
**STAGE 2: Concurrent Analysis (3 parallel tracks)**
*Track A - Network Operations:*
3. **Check Dependencies** - Analyze package versions and known vulnerabilities
@@ -175,9 +175,10 @@ TheAuditor runs a comprehensive audit through 14 distinct phases organized in 4
11. **Visualize Graph** - Generate multiple graph views
12. **Taint Analysis** - Track data flow from sources to sinks
**STAGE 3: Aggregation (Sequential)**
**STAGE 3: Final Aggregation (Sequential)**
13. **Factual Correlation Engine** - Cross-reference findings across all tools
14. **Generate Report** - Produce final AI-consumable chunks in `.pf/readthis/`
15. **Summary Generation** - Create executive summary of findings
## Key Features

View File

@@ -13,7 +13,7 @@ from theauditor.utils.exit_codes import ExitCodes
@click.option("--exclude-self", is_flag=True, help="Exclude TheAuditor's own files (for self-testing)")
@click.option("--offline", is_flag=True, help="Skip network operations (deps, docs)")
def full(root, quiet, exclude_self, offline):
"""Run complete audit pipeline in exact order specified in teamsop.md."""
"""Run complete audit pipeline with multiple analysis phases organized in parallel stages."""
from theauditor.pipelines import run_full_pipeline
# Define log callback for console output

View File

@@ -4,11 +4,13 @@ import json
import os
import platform
import shutil
import signal
import subprocess
import sys
import tempfile
import threading
import time
from concurrent.futures import ProcessPoolExecutor, as_completed, wait
from concurrent.futures import ThreadPoolExecutor, as_completed, wait
from datetime import datetime
from pathlib import Path
from typing import Any, Callable, List, Tuple
@@ -23,6 +25,79 @@ except ImportError:
# Windows compatibility
IS_WINDOWS = platform.system() == "Windows"
# Global stop event for interrupt handling
stop_event = threading.Event()
def signal_handler(signum, frame):
"""Handle Ctrl+C by setting stop event."""
print("\n[INFO] Interrupt received, stopping pipeline gracefully...", file=sys.stderr)
stop_event.set()
# Register signal handler
signal.signal(signal.SIGINT, signal_handler)
if not IS_WINDOWS:
signal.signal(signal.SIGTERM, signal_handler)
def run_subprocess_with_interrupt(cmd, stdout_fp, stderr_fp, cwd, shell=False, timeout=300):
"""
Run subprocess with interrupt checking every 100ms.
Args:
cmd: Command to execute
stdout_fp: File handle for stdout
stderr_fp: File handle for stderr
cwd: Working directory
shell: Whether to use shell execution
timeout: Maximum time to wait (seconds)
Returns:
subprocess.CompletedProcess-like object with returncode, stdout, stderr
"""
process = subprocess.Popen(
cmd,
stdout=stdout_fp,
stderr=stderr_fp,
text=True,
cwd=cwd,
shell=shell
)
# Poll process every 100ms to check for completion or interruption
start_time = time.time()
while process.poll() is None:
if stop_event.is_set():
# User interrupted - terminate subprocess
process.terminate()
try:
process.wait(timeout=5)
except subprocess.TimeoutExpired:
process.kill()
process.wait()
raise KeyboardInterrupt("Pipeline interrupted by user")
# Check timeout
if time.time() - start_time > timeout:
process.terminate()
try:
process.wait(timeout=5)
except subprocess.TimeoutExpired:
process.kill()
process.wait()
raise subprocess.TimeoutExpired(cmd, timeout)
# Sleep briefly to avoid busy-waiting
time.sleep(0.1)
# Create result object similar to subprocess.run
class Result:
def __init__(self, returncode):
self.returncode = returncode
self.stdout = None
self.stderr = None
result = Result(process.returncode)
return result
def run_command_chain(commands: List[Tuple[str, List[str]]], root: str, chain_name: str) -> dict:
"""
@@ -101,13 +176,13 @@ def run_command_chain(commands: List[Tuple[str, List[str]]], root: str, chain_na
with open(stdout_file, 'w+', encoding='utf-8') as out_fp, \
open(stderr_file, 'w+', encoding='utf-8') as err_fp:
result = subprocess.run(
result = run_subprocess_with_interrupt(
cmd,
stdout=out_fp,
stderr=err_fp,
text=True,
stdout_fp=out_fp,
stderr_fp=err_fp,
cwd=root,
shell=IS_WINDOWS # Windows compatibility fix
shell=IS_WINDOWS, # Windows compatibility fix
timeout=300 # 5 minutes per command in parallel tracks
)
# Read outputs
@@ -157,6 +232,12 @@ def run_command_chain(commands: List[Tuple[str, List[str]]], root: str, chain_na
chain_errors.append(f"Error in {description}: {stderr}")
break # Stop chain on failure
except KeyboardInterrupt:
# User interrupted - clean up and exit
failed = True
write_status(f"INTERRUPTED: {description}", completed_count, len(commands))
chain_output.append(f"[INTERRUPTED] Pipeline stopped by user")
raise # Re-raise to propagate up
except Exception as e:
failed = True
write_status(f"ERROR: {description}", completed_count, len(commands))
@@ -475,13 +556,13 @@ def run_full_pipeline(
with open(stdout_file, 'w+', encoding='utf-8') as out_fp, \
open(stderr_file, 'w+', encoding='utf-8') as err_fp:
result = subprocess.run(
result = run_subprocess_with_interrupt(
cmd,
stdout=out_fp,
stderr=err_fp,
text=True,
stdout_fp=out_fp,
stderr_fp=err_fp,
cwd=root,
shell=IS_WINDOWS # Windows compatibility fix
shell=IS_WINDOWS, # Windows compatibility fix
timeout=300 # 5 minutes per command in parallel tracks
)
# Read outputs
@@ -590,9 +671,9 @@ def run_full_pipeline(
log_output(" Track B: Code Analysis (workset, lint, patterns)")
log_output(" Track C: Graph & Taint Analysis")
# Execute parallel tracks using ProcessPoolExecutor
# Execute parallel tracks using ThreadPoolExecutor (Windows-safe)
parallel_results = []
with ProcessPoolExecutor(max_workers=3) as executor:
with ThreadPoolExecutor(max_workers=3) as executor:
futures = []
# Submit Track A if it has commands
@@ -673,6 +754,12 @@ def run_full_pipeline(
else:
log_output(f"[FAILED] {result['name']} failed", is_error=True)
failed_phases += 1
except KeyboardInterrupt:
log_output(f"[INTERRUPTED] Pipeline stopped by user", is_error=True)
# Cancel remaining futures
for f in pending_futures:
f.cancel()
raise # Re-raise to exit
except Exception as e:
log_output(f"[ERROR] Parallel track failed with exception: {e}", is_error=True)
failed_phases += 1
@@ -715,13 +802,13 @@ def run_full_pipeline(
with open(stdout_file, 'w+', encoding='utf-8') as out_fp, \
open(stderr_file, 'w+', encoding='utf-8') as err_fp:
result = subprocess.run(
result = run_subprocess_with_interrupt(
cmd,
stdout=out_fp,
stderr=err_fp,
text=True,
stdout_fp=out_fp,
stderr_fp=err_fp,
cwd=root,
shell=IS_WINDOWS # Windows compatibility fix
shell=IS_WINDOWS, # Windows compatibility fix
timeout=600 # 10 minutes for final aggregation
)
# Read outputs