Some checks failed
API Endpoint Tests / test-api-endpoints (push) Successful in 10s
Blockchain Synchronization Verification / sync-verification (push) Failing after 3s
CLI Tests / test-cli (push) Failing after 4s
Documentation Validation / validate-docs (push) Successful in 8s
Documentation Validation / validate-policies-strict (push) Successful in 4s
Integration Tests / test-service-integration (push) Successful in 38s
Multi-Node Blockchain Health Monitoring / health-check (push) Successful in 2s
P2P Network Verification / p2p-verification (push) Successful in 3s
Security Scanning / security-scan (push) Successful in 40s
Smart Contract Tests / test-solidity (map[name:aitbc-token path:packages/solidity/aitbc-token]) (push) Successful in 15s
Smart Contract Tests / lint-solidity (push) Successful in 8s
- Relocate blockchain-event-bridge README content to docs/apps/blockchain/blockchain-event-bridge.md - Relocate blockchain-explorer README content to docs/apps/blockchain/blockchain-explorer.md - Replace app READMEs with redirect notices pointing to new documentation location - Consolidate documentation in central docs/ directory for better organization
214 lines
4.8 KiB
Markdown
214 lines
4.8 KiB
Markdown
# Monitor
|
|
|
|
## Status
|
|
✅ Operational
|
|
|
|
## Overview
|
|
System monitoring and alerting service for tracking application health, performance metrics, and generating alerts for critical events.
|
|
|
|
## Architecture
|
|
|
|
### Core Components
|
|
- **Health Check Service**: Periodic health checks for all services
|
|
- **Metrics Collector**: Collects performance metrics from applications
|
|
- **Alert Manager**: Manages alert rules and notifications
|
|
- **Dashboard**: Web dashboard for monitoring visualization
|
|
- **Log Aggregator**: Aggregates logs from all services
|
|
- **Notification Service**: Sends alerts via email, Slack, etc.
|
|
|
|
## Quick Start (End Users)
|
|
|
|
### Prerequisites
|
|
- Python 3.13+
|
|
- Access to application endpoints
|
|
- Notification service credentials (email, Slack webhook)
|
|
|
|
### Installation
|
|
```bash
|
|
cd /opt/aitbc/apps/monitor
|
|
.venv/bin/pip install -r requirements.txt
|
|
```
|
|
|
|
### Configuration
|
|
Set environment variables in `.env`:
|
|
```bash
|
|
MONITOR_INTERVAL=60
|
|
ALERT_EMAIL=admin@example.com
|
|
SLACK_WEBHOOK=https://hooks.slack.com/services/...
|
|
PROMETHEUS_URL=http://localhost:9090
|
|
```
|
|
|
|
### Running the Service
|
|
```bash
|
|
.venv/bin/python main.py
|
|
```
|
|
|
|
### Access Dashboard
|
|
Open `http://localhost:8080` in a browser to access the monitoring dashboard.
|
|
|
|
## Developer Guide
|
|
|
|
### Development Setup
|
|
1. Clone the repository
|
|
2. Create virtual environment: `python -m venv .venv`
|
|
3. Install dependencies: `pip install -r requirements.txt`
|
|
4. Configure monitoring targets
|
|
5. Run tests: `pytest tests/`
|
|
|
|
### Project Structure
|
|
```
|
|
monitor/
|
|
├── src/
|
|
│ ├── health_check/ # Health check service
|
|
│ ├── metrics_collector/ # Metrics collection
|
|
│ ├── alert_manager/ # Alert management
|
|
│ ├── dashboard/ # Web dashboard
|
|
│ ├── log_aggregator/ # Log aggregation
|
|
│ └── notification/ # Notification service
|
|
├── tests/ # Test suite
|
|
└── pyproject.toml # Project configuration
|
|
```
|
|
|
|
### Testing
|
|
```bash
|
|
# Run all tests
|
|
pytest tests/
|
|
|
|
# Run health check tests
|
|
pytest tests/test_health_check.py
|
|
|
|
# Run alert manager tests
|
|
pytest tests/test_alerts.py
|
|
```
|
|
|
|
## API Reference
|
|
|
|
### Health Checks
|
|
|
|
#### Run Health Check
|
|
```http
|
|
GET /api/v1/monitor/health/{service_name}
|
|
```
|
|
|
|
#### Get All Health Status
|
|
```http
|
|
GET /api/v1/monitor/health
|
|
```
|
|
|
|
#### Add Health Check Target
|
|
```http
|
|
POST /api/v1/monitor/health/targets
|
|
Content-Type: application/json
|
|
|
|
{
|
|
"service_name": "string",
|
|
"endpoint": "http://localhost:8000/health",
|
|
"interval": 60,
|
|
"timeout": 10
|
|
}
|
|
```
|
|
|
|
### Metrics
|
|
|
|
#### Get Metrics
|
|
```http
|
|
GET /api/v1/monitor/metrics?service=blockchain-node
|
|
```
|
|
|
|
#### Query Prometheus
|
|
```http
|
|
POST /api/v1/monitor/metrics/query
|
|
Content-Type: application/json
|
|
|
|
{
|
|
"query": "up{job=\"blockchain-node\"}",
|
|
"range": "1h"
|
|
}
|
|
```
|
|
|
|
### Alerts
|
|
|
|
#### Create Alert Rule
|
|
```http
|
|
POST /api/v1/monitor/alerts/rules
|
|
Content-Type: application/json
|
|
|
|
{
|
|
"name": "high_cpu_usage",
|
|
"condition": "cpu_usage > 80",
|
|
"duration": 300,
|
|
"severity": "warning|critical",
|
|
"notification": "email|slack"
|
|
}
|
|
```
|
|
|
|
#### Get Active Alerts
|
|
```http
|
|
GET /api/v1/monitor/alerts/active
|
|
```
|
|
|
|
#### Acknowledge Alert
|
|
```http
|
|
POST /api/v1/monitor/alerts/{alert_id}/acknowledge
|
|
```
|
|
|
|
### Logs
|
|
|
|
#### Query Logs
|
|
```http
|
|
POST /api/v1/monitor/logs/query
|
|
Content-Type: application/json
|
|
|
|
{
|
|
"service": "blockchain-node",
|
|
"level": "ERROR",
|
|
"time_range": "1h",
|
|
"query": "error"
|
|
}
|
|
```
|
|
|
|
#### Get Log Statistics
|
|
```http
|
|
GET /api/v1/monitor/logs/stats?service=blockchain-node
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
- `MONITOR_INTERVAL`: Interval for health checks (default: 60s)
|
|
- `ALERT_EMAIL`: Email address for alert notifications
|
|
- `SLACK_WEBHOOK`: Slack webhook for notifications
|
|
- `PROMETHEUS_URL`: Prometheus server URL
|
|
- `LOG_RETENTION_DAYS`: Log retention period (default: 30 days)
|
|
- `ALERT_COOLDOWN`: Alert cooldown period (default: 300s)
|
|
|
|
### Monitoring Targets
|
|
- **Services**: List of services to monitor
|
|
- **Endpoints**: Health check endpoints for each service
|
|
- **Intervals**: Check intervals for each service
|
|
|
|
### Alert Rules
|
|
- **CPU Usage**: Alert when CPU usage exceeds threshold
|
|
- **Memory Usage**: Alert when memory usage exceeds threshold
|
|
- **Disk Usage**: Alert when disk usage exceeds threshold
|
|
- **Service Down**: Alert when service is unresponsive
|
|
|
|
## Troubleshooting
|
|
|
|
**Health check failing**: Verify service endpoint and network connectivity.
|
|
|
|
**Alerts not triggering**: Check alert rule configuration and notification settings.
|
|
|
|
**Metrics not collecting**: Verify Prometheus integration and service metrics endpoints.
|
|
|
|
**Logs not appearing**: Check log aggregation configuration and service log paths.
|
|
|
|
## Security Notes
|
|
|
|
- Secure access to monitoring dashboard
|
|
- Use authentication for API endpoints
|
|
- Encrypt alert notification credentials
|
|
- Implement role-based access control
|
|
- Regularly review alert rules
|
|
- Monitor for unauthorized access attempts
|