Files
aitbc/docs/troubleshooting/comprehensive-guide.md
aitbc e4f1a96172
Some checks failed
Blockchain Synchronization Verification / sync-verification (push) Failing after 8s
CLI Tests / test-cli (push) Successful in 10s
Contract Performance Benchmarks / benchmark-gas-usage (push) Successful in 1m22s
Contract Performance Benchmarks / benchmark-execution-time (push) Successful in 1m11s
Contract Performance Benchmarks / benchmark-throughput (push) Successful in 1m13s
Cross-Chain Functionality Tests / test-cross-chain-sync (push) Failing after 5s
Cross-Chain Functionality Tests / test-cross-chain-transactions (push) Successful in 5s
Cross-Chain Functionality Tests / test-cross-chain-bridge (push) Has been skipped
Cross-Chain Functionality Tests / test-multi-chain-consensus (push) Failing after 3s
Cross-Chain Functionality Tests / aggregate-results (push) Has been skipped
Cross-Node Transaction Testing / transaction-test (push) Successful in 5s
Deploy to Testnet / deploy-testnet (push) Successful in 1m14s
Contract Performance Benchmarks / compare-benchmarks (push) Has been cancelled
Documentation Validation / validate-docs (push) Failing after 10s
Multi-Node Stress Testing / stress-test (push) Has been cancelled
Node Failover Simulation / failover-test (push) Has been cancelled
Security Scanning / security-scan (push) Has been cancelled
Smart Contract Tests / test-solidity (map[name:aitbc-contracts path:contracts]) (push) Has been cancelled
Smart Contract Tests / test-solidity (map[name:aitbc-token path:packages/solidity/aitbc-token]) (push) Has been cancelled
Smart Contract Tests / test-foundry (push) Has been cancelled
Smart Contract Tests / lint-solidity (push) Has been cancelled
Smart Contract Tests / deploy-contracts (push) Has been cancelled
Documentation Validation / validate-policies-strict (push) Successful in 3s
Integration Tests / test-service-integration (push) Failing after 45s
Multi-Chain Island Architecture Tests / test-multi-chain-island (push) Failing after 2s
Multi-Node Blockchain Health Monitoring / health-check (push) Successful in 5s
P2P Network Verification / p2p-verification (push) Successful in 3s
Production Tests / Production Integration Tests (push) Failing after 7s
Python Tests / test-python (push) Failing after 46s
Staking Tests / test-staking-service (push) Failing after 2s
Staking Tests / test-staking-integration (push) Has been skipped
Staking Tests / test-staking-contract (push) Has been skipped
Staking Tests / run-staking-test-runner (push) Has been skipped
Systemd Sync / sync-systemd (push) Successful in 21s
API Endpoint Tests / test-api-endpoints (push) Failing after 12m19s
ci: standardize pytest invocation and add security scanning
- Changed pytest calls to use `venv/bin/python -m pytest` with explicit config
- Added `--rootdir "$PWD"` and `--import-mode=importlib` for consistent imports
- Fixed PYTHONPATH to use absolute paths with $PWD prefix
- Added smart contract security scanning for Solidity files
- Added Circom circuit security checks for ZK proof circuits
- Added ZK proof implementation security validation
- Added contracts/** to security scanning workflow
2026-05-11 13:46:42 +02:00

975 lines
17 KiB
Markdown

# Comprehensive Troubleshooting Guide
This guide provides troubleshooting steps for common issues encountered when deploying and operating the AITBC platform.
## Table of Contents
- [General Troubleshooting](#general-troubleshooting)
- [Blockchain Node Issues](#blockchain-node-issues)
- [Coordinator API Issues](#coordinator-api-issues)
- [Wallet Daemon Issues](#wallet-daemon-issues)
- [Marketplace Service Issues](#marketplace-service-issues)
- [Database Issues](#database-issues)
- [Network Issues](#network-issues)
- [GPU Issues](#gpu-issues)
- [Performance Issues](#performance-issues)
- [Security Issues](#security-issues)
## General Troubleshooting
### Service Won't Start
**Symptoms:**
- Service fails to start
- Systemd service shows "failed" status
- No logs available
**Diagnosis:**
```bash
# Check service status
sudo systemctl status aitbc-coordinator-api
# Check recent logs
sudo journalctl -u aitbc-coordinator-api -n 50
# Check for errors in logs
sudo journalctl -u aitbc-coordinator-api -f | grep -i error
```
**Solutions:**
1. Check configuration files
```bash
# Validate configuration
python -m apps.coordinator_api.main --validate-config
```
2. Check port conflicts
```bash
# Check if port is in use
sudo netstat -tulpn | grep 8011
# Kill process using the port
sudo kill -9 $(sudo lsof -t -i:8011)
```
3. Check permissions
```bash
# Check file permissions
ls -la /opt/aitbc
# Fix permissions
sudo chown -R aitbc:aitbc /opt/aitbc
```
4. Check dependencies
```bash
# Verify Python dependencies
source venv/bin/activate
pip list
# Install missing dependencies
pip install -r requirements.txt
```
### High CPU Usage
**Symptoms:**
- Service consuming excessive CPU
- System sluggish
- High load averages
**Diagnosis:**
```bash
# Check CPU usage
top -p $(pgrep -f coordinator-api)
# Check process details
ps aux | grep coordinator-api
# Check system load
uptime
```
**Solutions:**
1. Profile the application
```bash
# Profile with cProfile
python -m cProfile -o profile.stats apps/coordinator_api/main.py
# Analyze profile
python -m pstats profile.stats
```
2. Check for infinite loops
```bash
# Monitor process strace
sudo strace -p $(pgrep -f coordinator-api)
```
3. Optimize database queries
```bash
# Enable query logging
export SQLALCHEMY_ECHO=true
# Analyze slow queries
psql -d aitbc -c "SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;"
```
### Memory Leaks
**Symptoms:**
- Memory usage increases over time
- Service crashes with OOM killer
- Swap usage high
**Diagnosis:**
```bash
# Check memory usage
free -h
# Check process memory
ps aux | grep coordinator-api
# Monitor memory over time
watch -n 1 'free -h'
```
**Solutions:**
1. Check for memory leaks
```bash
# Use memory profiler
pip install memory-profiler
python -m memory_profiler apps/coordinator_api/main.py
```
2. Check connection pooling
```python
# Reduce pool size
engine = create_engine(
DATABASE_URL,
pool_size=5,
max_overflow=10
)
```
3. Restart service periodically
```bash
# Add to crontab
0 2 * * * systemctl restart aitbc-coordinator-api
```
## Blockchain Node Issues
### Node Won't Sync
**Symptoms:**
- Block height not increasing
- Sync status shows "syncing" indefinitely
- Peers not connecting
**Diagnosis:**
```bash
# Check sync status
curl http://localhost:8080/v1/network
# Check peer connections
curl http://localhost:8080/v1/network/peers
# Check blockchain logs
sudo journalctl -u aitbc-blockchain -n 50
```
**Solutions:**
1. Add bootstrap peers
```bash
# Edit configuration
echo "BOOTSTRAP_PEERS=peer1.example.com:8080,peer2.example.com:8080" >> /etc/aitbc/blockchain.env
# Restart service
sudo systemctl restart aitbc-blockchain
```
2. Check network connectivity
```bash
# Test peer connectivity
telnet peer.example.com 8080
# Check firewall
sudo ufw status
```
3. Reset blockchain state
```bash
# Stop service
sudo systemctl stop aitbc-blockchain
# Backup data
mv /var/lib/aitbc/blockchain /var/lib/aitbc/blockchain.backup
# Start service
sudo systemctl start aitbc-blockchain
```
### Fork Detected
**Symptoms:**
- Multiple blockchain branches
- Consensus failures
- Invalid blocks
**Diagnosis:**
```bash
# Check blockchain height
curl http://localhost:8080/v1/blocks/head
# Check for forks
curl http://localhost:8080/v1/blocks/forks
```
**Solutions:**
1. Choose correct fork
```bash
# Revert to correct height
curl -X POST http://localhost:8080/v1/admin/revert \
-H "Content-Type: application/json" \
-d '{"height": 12345}'
```
2. Restart with clean state
```bash
# Stop service
sudo systemctl stop aitbc-blockchain
# Clear blockchain data
rm -rf /var/lib/aitbc/blockchain
# Start service
sudo systemctl start aitbc-blockchain
```
## Coordinator API Issues
### 500 Internal Server Error
**Symptoms:**
- API returns 500 errors
- Jobs fail to submit
- Status checks fail
**Diagnosis:**
```bash
# Check API logs
sudo journalctl -u aitbc-coordinator-api -n 100 | grep -i error
# Check database connection
psql -d aitbc -c "SELECT 1;"
# Check health endpoint
curl http://localhost:8011/health
```
**Solutions:**
1. Check database connectivity
```bash
# Test database connection
psql -h localhost -U aitbc -d aitbc
# Restart PostgreSQL
sudo systemctl restart postgresql
```
2. Check Redis connection
```bash
# Test Redis
redis-cli ping
# Restart Redis
sudo systemctl restart redis
```
3. Check datetime handling
```bash
# Check for datetime comparison errors
# Ensure all datetimes are timezone-aware or offset-naive consistently
```
### Job Stuck in Queued State
**Symptoms:**
- Jobs remain in QUEUED state
- No miners assigned
- Job expiration
**Diagnosis:**
```bash
# Check job status
curl -H "X-Api-Key: $API_KEY" \
http://localhost:8011/v1/jobs/{job_id}
# Check miner availability
curl http://localhost:8011/v1/miners
# Check logs
sudo journalctl -u aitbc-coordinator-api -n 50
```
**Solutions:**
1. Check miner registration
```bash
# Verify miners are registered
curl http://localhost:8011/v1/miners
# Register miner if needed
curl -X POST http://localhost:8011/v1/miners/register \
-H "Content-Type: application/json" \
-d '{"miner_id": "miner-123", "gpu_type": "nvidia-rtx-3090"}'
```
2. Check job constraints
```bash
# Verify job constraints can be satisfied
curl -H "X-Api-Key: $API_KEY" \
http://localhost:8011/v1/jobs/{job_id} | jq '.constraints'
```
3. Increase job TTL
```bash
# Resubmit with longer TTL
curl -X POST http://localhost:8011/v1/jobs \
-H "Content-Type: application/json" \
-H "X-Api-Key: $API_KEY" \
-d '{"payload": {...}, "ttl_seconds": 3600}'
```
## Wallet Daemon Issues
### Wallet Not Responding
**Symptoms:**
- Wallet daemon unresponsive
- Transactions not signing
- Balance not updating
**Diagnosis:**
```bash
# Check wallet daemon status
sudo systemctl status aitbc-wallet
# Check wallet logs
sudo journalctl -u aitbc-wallet -n 50
# Test wallet endpoint
curl http://localhost:8071/health
```
**Solutions:**
1. Check wallet file integrity
```bash
# Verify wallet file exists
ls -la /var/lib/aitbc/wallet/
# Check wallet file permissions
chmod 600 /var/lib/aitbc/wallet/wallet.dat
```
2. Restart wallet daemon
```bash
sudo systemctl restart aitbc-wallet
```
3. Check key derivation
```bash
# Verify key derivation path
python -c "from aitbc_crypto import Wallet; w = Wallet(); print(w.address)"
```
### Transaction Signing Failed
**Symptoms:**
- Transactions fail to sign
- Invalid signature errors
- Key not found errors
**Diagnosis:**
```bash
# Check wallet keys
curl http://localhost:8071/v1/keys
# Check transaction logs
sudo journalctl -u aitbc-wallet -n 50 | grep -i transaction
```
**Solutions:**
1. Verify private key
```bash
# Check private key exists
ls -la /var/lib/aitbc/wallet/private_key
# Regenerate keys if needed
curl -X POST http://localhost:8071/v1/keys/regenerate
```
2. Check key permissions
```bash
# Secure private key
chmod 600 /var/lib/aitbc/wallet/private_key
chown aitbc:aitbc /var/lib/aitbc/wallet/private_key
```
## Marketplace Service Issues
### Offers Not Matching
**Symptoms:**
- GPU offers not matched with jobs
- Jobs remain unassigned
- Marketplace not updating
**Diagnosis:**
```bash
# Check marketplace status
curl http://localhost:8102/health
# Check offers
curl http://localhost:8102/v1/offers
# Check matching logs
sudo journalctl -u aitbc-marketplace -n 50
```
**Solutions:**
1. Check offer constraints
```bash
# Verify offer constraints
curl http://localhost:8102/v1/offers | jq '.[].constraints'
```
2. Restart matching engine
```bash
sudo systemctl restart aitbc-marketplace
```
3. Clear offer cache
```bash
# Clear Redis cache
redis-cli FLUSHALL
# Restart service
sudo systemctl restart aitbc-marketplace
```
## Database Issues
### Connection Refused
**Symptoms:**
- Database connection errors
- Service unable to connect to PostgreSQL
- "Connection refused" messages
**Diagnosis:**
```bash
# Check PostgreSQL status
sudo systemctl status postgresql
# Test connection
psql -h localhost -U aitbc -d aitbc
# Check PostgreSQL logs
sudo tail -f /var/log/postgresql/postgresql-*.log
```
**Solutions:**
1. Restart PostgreSQL
```bash
sudo systemctl restart postgresql
```
2. Check connection limits
```bash
# Check max connections
psql -d aitbc -c "SHOW max_connections;"
# Check active connections
psql -d aitbc -c "SELECT count(*) FROM pg_stat_activity;"
```
3. Check firewall
```bash
# Check if port 5432 is open
sudo ufw status | grep 5432
# Allow PostgreSQL
sudo ufw allow 5432/tcp
```
### Slow Queries
**Symptoms:**
- API responses slow
- Database CPU high
- Query timeouts
**Diagnosis:**
```bash
# Enable query logging
psql -d aitbc -c "ALTER SYSTEM SET log_min_duration_statement = 1000;"
sudo systemctl reload postgresql
# Check slow queries
psql -d aitbc -c "SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;"
```
**Solutions:**
1. Add indexes
```sql
-- Add index on frequently queried columns
CREATE INDEX idx_job_state ON job(state);
CREATE INDEX idx_job_created_at ON job(created_at);
```
2. Optimize queries
```sql
-- Use EXPLAIN ANALYZE
EXPLAIN ANALYZE SELECT * FROM job WHERE state = 'QUEUED';
```
3. Increase work_mem
```sql
-- Increase work_mem for complex queries
ALTER SYSTEM SET work_mem = '256MB';
sudo systemctl reload postgresql
```
### Database Corruption
**Symptoms:**
- Data inconsistencies
- Queries return wrong results
- Database won't start
**Diagnosis:**
```bash
# Check database integrity
psql -d aitbc -c "VACUUM FULL ANALYZE;"
# Check for corruption
psql -d aitbc -c "SELECT * FROM pg_stat_database;"
```
**Solutions:**
1. Restore from backup
```bash
# Stop PostgreSQL
sudo systemctl stop postgresql
# Restore from backup
psql -d aitbc < backup-20260511.sql
# Start PostgreSQL
sudo systemctl start postgresql
```
2. Use WAL recovery
```bash
# Configure recovery
echo "restore_command = 'cp /var/lib/postgresql/wal/%f %p'" >> /etc/postgresql/*/main/recovery.conf
# Restart PostgreSQL
sudo systemctl restart postgresql
```
## Network Issues
### Connection Timeouts
**Symptoms:**
- Services unable to connect to each other
- Intermittent connection failures
- High latency
**Diagnosis:**
```bash
# Test connectivity
ping -c 10 localhost
# Check DNS
nslookup localhost
# Check ports
telnet localhost 8011
```
**Solutions:**
1. Check network configuration
```bash
# Check IP configuration
ip addr show
# Check routing
ip route show
# Check DNS
cat /etc/resolv.conf
```
2. Check firewall rules
```bash
# Check UFW status
sudo ufw status
# Check iptables
sudo iptables -L -n
```
3. Check MTU
```bash
# Check MTU
ip link show
# Adjust MTU if needed
sudo ip link set eth0 mtu 1500
```
### DNS Issues
**Symptoms:**
- Domain names not resolving
- Services unable to connect by hostname
- Slow DNS resolution
**Diagnosis:**
```bash
# Test DNS resolution
nslookup google.com
# Check DNS servers
cat /etc/resolv.conf
# Test local DNS
dig localhost
```
**Solutions:**
1. Change DNS servers
```bash
# Use Google DNS
echo "nameserver 8.8.8.8" > /etc/resolv.conf
echo "nameserver 8.8.4.4" >> /etc/resolv.conf
```
2. Clear DNS cache
```bash
# Clear systemd cache
sudo systemd-resolve --flush-caches
# Restart DNS service
sudo systemctl restart systemd-resolved
```
## GPU Issues
### GPU Not Detected
**Symptoms:**
- GPU not recognized
- CUDA errors
- Mining fails
**Diagnosis:**
```bash
# Check GPU
nvidia-smi
# Check CUDA
nvcc --version
# Check driver
dmesg | grep -i nvidia
```
**Solutions:**
1. Reinstall NVIDIA driver
```bash
# Remove old driver
sudo apt remove nvidia-* --purge
# Install new driver
sudo apt install nvidia-driver-535
# Reboot
sudo reboot
```
2. Check CUDA installation
```bash
# Verify CUDA installation
nvcc --version
# Reinstall CUDA if needed
sudo apt install nvidia-cuda-toolkit
```
3. Check GPU permissions
```bash
# Add user to video group
sudo usermod -aG video $USER
# Reboot
sudo reboot
```
### GPU Memory Errors
**Symptoms:**
- Out of memory errors
- CUDA out of memory
- Jobs failing
**Diagnosis:**
```bash
# Check GPU memory
nvidia-smi
# Monitor memory usage
watch -n 1 nvidia-smi
```
**Solutions:**
1. Reduce batch size
```python
# Reduce batch size in job configuration
batch_size = 8 # Reduce from 16
```
2. Clear GPU cache
```python
import torch
torch.cuda.empty_cache()
```
3. Restart mining service
```bash
sudo systemctl restart aitbc-miner
```
## Performance Issues
### Slow API Response Times
**Symptoms:**
- API requests take long to complete
- Timeouts
- Poor user experience
**Diagnosis:**
```bash
# Measure response time
time curl http://localhost:8011/v1/jobs
# Check database query times
psql -d aitbc -c "SELECT * FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"
```
**Solutions:**
1. Enable caching
```python
# Add Redis caching
from functools import lru_cache
@lru_cache(maxsize=1000)
def get_job(job_id: str):
return job_service.get_job(job_id)
```
2. Optimize database queries
```sql
-- Add indexes
CREATE INDEX CONCURRENTLY idx_job_state ON job(state);
```
3. Use connection pooling
```python
# Increase pool size
engine = create_engine(
DATABASE_URL,
pool_size=20,
max_overflow=40
)
```
### High Latency
**Symptoms:**
- Network latency high
- Slow data transfer
- Poor performance
**Diagnosis:**
```bash
# Measure latency
ping -c 10 localhost
# Check network throughput
iperf3 -s
iperf3 -c localhost
```
**Solutions:**
1. Optimize network
```bash
# Check network configuration
ethtool eth0
# Adjust network settings
sudo ethtool -G eth0 rx 4096 tx 4096
```
2. Use local caching
```python
# Cache frequently accessed data
from cachetools import TTLCache
cache = TTLCache(maxsize=1000, ttl=300)
```
## Security Issues
### Unauthorized Access
**Symptoms:**
- Unauthorized API calls
- Failed authentication attempts
- Suspicious activity
**Diagnosis:**
```bash
# Check authentication logs
sudo journalctl -u aitbc-coordinator-api | grep -i authentication
# Check access logs
sudo tail -f /var/log/nginx/access.log
```
**Solutions:**
1. Review API keys
```bash
# List all API keys
curl -H "X-Admin-Key: $ADMIN_KEY" \
http://localhost:8011/v1/admin/api-keys
# Revoke suspicious keys
curl -X DELETE http://localhost:8011/v1/admin/api-keys/{key_id}
```
2. Enable rate limiting
```python
# Add rate limiting
from slowapi import Limiter
limiter = Limiter(key_func=get_remote_address)
@app.post("/v1/jobs")
@limiter.limit("100/minute")
async def submit_job():
pass
```
3. Enable IP whitelisting
```bash
# Configure nginx
allow 192.168.1.0/24;
deny all;
```
### Data Breach
**Symptoms:**
- Data accessed without authorization
- Logs show suspicious activity
- Credentials compromised
**Diagnosis:**
```bash
# Check for suspicious activity
sudo journalctl -u aitbc-* | grep -i error
# Check access logs
sudo grep "401\|403" /var/log/nginx/access.log
```
**Solutions:**
1. Immediate containment
```bash
# Stop all services
sudo systemctl stop aitbc-*
# Change all credentials
# Rotate API keys
# Change database passwords
```
2. Investigate breach
```bash
# Preserve evidence
sudo journalctl -u aitbc-* > incident-logs.txt
# Analyze logs
grep -i "suspicious\|unauthorized" incident-logs.txt
```
3. Recovery
```bash
# Restore from backup
psql -d aitbc < backup.sql
# Restart services
sudo systemctl start aitbc-*
```
## Getting Help
### Log Collection
When reporting issues, collect the following information:
```bash
# Service logs
sudo journalctl -u aitbc-coordinator-api -n 500 > coordinator.log
sudo journalctl -u aitbc-blockchain -n 500 > blockchain.log
sudo journalctl -u aitbc-marketplace -n 500 > marketplace.log
# System information
uname -a > system-info.txt
free -h >> system-info.txt
df -h >> system-info.txt
# Network information
ip addr show > network-info.txt
netstat -tulpn >> network-info.txt
# Database information
psql -d aitbc -c "\l" > database-info.txt
psql -d aitbc -c "SELECT version();" >> database-info.txt
```
### Support Channels
- **GitHub Issues**: https://github.com/oib/AITBC/issues
- **Documentation**: https://aitbc.bubuit.net/docs/
- **Community**: https://community.aitbc.dev/
### Debug Mode
Enable debug mode for detailed logging:
```bash
# Edit environment
echo "DEBUG=true" >> /etc/aitbc/coordinator.env
# Restart service
sudo systemctl restart aitbc-coordinator-api
# View debug logs
sudo journalctl -u aitbc-coordinator-api -f
```