System Monitor¶
This chapter describes how to monitor the resource state (CPU, memory, disk, network, etc.) of the solution server and configure thresholds. (Depending on the operational environment, a Grafana-based integrated monitoring view may also be available.)
System Overview¶
Select Admin → System Status → System Monitoring in the left sidebar.
The following information is displayed in real time.
| Metric | Korean | Items Shown |
|---|---|---|
| CPU | CPU | Per-core utilization (%), average, peak |
| Memory | 메모리 | Used / free / total (GB), utilization (%) |
| Disk | 디스크 | Per-partition used / free / utilization |
| Network | 네트워크 | Tx / Rx (MB/s), connection state |
| Uptime | 가동 시간 | Time elapsed since boot |
The Pause / Resume buttons at the top control screen refresh (server-side monitoring continues regardless).
Utilization Levels¶
Each metric's utilization is color-coded into four levels.
| Level | Korean | Color | Meaning |
|---|---|---|---|
| Low | 낮음 | Green | Plenty of headroom |
| Medium | 보통 | Yellow | Normal range |
| High | 높음 | Orange | Caution |
| Critical | 위험 | Red | Immediate action required |
Default thresholds, adjustable per environment:
| Metric | Medium | High | Critical |
|---|---|---|---|
| CPU | 60% | 80% | 90% |
| Memory | 70% | 85% | 95% |
| Disk | 70% | 85% | 95% |
Configuring Thresholds¶
- Click the Settings (⚙) button at the top right of System Monitor
- Adjust the per-metric threshold sliders
- Notification Channel — select where to send threshold alerts (email, webhook, etc.)
- Save
Ignoring Brief Spikes
Brief spikes from batch jobs or sudden user surges are normal. Configuring Critical alerts to fire only when sustained for over an hour reduces noise.
Resource History¶
The Resource History tab shows time-series charts for past trends.
| Period Option | Data Resolution |
|---|---|
| Last 1 hour | 1-second granularity |
| Last 24 hours | 1-minute granularity |
| Last 7 days | 5-minute granularity |
| Last 30 days | 1-hour granularity |
When investigating spikes, cross-reference the audit log around the same time to infer causes.
System Inspection¶
Select Admin → System Status → System Inspection in the left sidebar. The screen header reads "Service status and version compatibility" and the page lists health-check results for every backend service the solution depends on.
Screen layout¶
| Area | Location | Description |
|---|---|---|
| Refresh button | Top-right | Runs the health check again immediately. |
| Card — Total | Top gray | Total number of monitored services. |
| Card — Healthy | Top green | Services that responded successfully on the latest health check. |
| Card — Unhealthy | Top red | Services that did not respond or returned an error — anything other than 0 needs immediate attention. |
| Card — Incompatible | Top yellow | Services that responded with a version different from what the solution expects. |
| Service search input | Body top-right | Filter the list instantly by service name. |
| Service table | Body | One row per service — Service / Status / Version / Compatibility columns. |
Service table columns¶
| Column | Description |
|---|---|
| Service | Backend component name the solution depends on (e.g., workflow-service, retrieval-service, qdrant, postgres, redis, audio-service, xgen-model). |
| Status | Latest health-check result. Green Healthy / red Unhealthy badge. |
| Version | Reported version string (e.g., 2.0.0). - if not reported. |
| Compatibility | Whether the version matches what the solution requires. - if not reported. |
Usage scenarios¶
- First check during an outage — When users report issues (e.g., no chat response, embedding errors), open this screen first and confirm whether the Unhealthy card is non-zero.
- Post-deployment verification — After a release or infrastructure operation, refresh to confirm that all services have returned to healthy.
- Version-compatibility tracing — If Incompatible is non-zero, use the Version column to identify which services are off-version and share with the infrastructure team.
Log Viewer¶
Select Admin → System Status → Log Viewer in the left sidebar. Technical logs emitted by backend services can be searched and filtered by level.
Screen layout¶
| Area | Location | Description |
|---|---|---|
| Tab — All | Top-left (default) | All log levels in chronological order. |
| Tab — Error | Second | System errors and exceptions only — first stop when tracing an incident. |
| Tab — Warn | Third | Potential risk signals (succeeded but with anomalies). |
| Tab — Info | Fourth | Normal-flow informational logs (e.g., request handled). |
| Tab — Debug | Fifth | Detailed development / analysis logs. Normally disabled in production, enabled only in diagnostic mode. |
| Log search input | Top-right | Instant filter by message or service. |
| Refresh button | Top-right | Reload latest logs immediately. |
Log table columns¶
| Column | Description |
|---|---|
| Level | ERROR / WARN / INFO / DEBUG shown as a colored badge. |
| Message | Log body (e.g., Successfully retrieved system status, No execution data found for the given filters). |
| Source | Identifier of the originating service / module (e.g., get_system_status, workflow_processor). |
| Details | Call path, trace context, and other metadata. |
| Time | Timestamp (second granularity). Click the column-header ↓ to toggle sort direction. |
Per-tab usage guide¶
All¶
All log levels shown in chronological order. Use this at the start of analysis to view the time-line context as a whole, or to see logs around an error event.
Error (ERROR)¶
System errors and exceptions only. Check at least once a day; trace causes of new ERRORs as soon as they appear and cross-reference with user reports. Rows are tagged with a red ERROR badge.
Warn (WARN)¶
Potential risk signals (succeeded but with anomalies). Review weekly — accumulating warnings can foreshadow incidents, so watch for trends.
Info (INFO)¶
Normal-flow informational logs (e.g., request handled). Use to trace a specific request's normal flow or confirm that an expected call happened.
Debug (DEBUG)¶
Detailed development / analysis logs. Noisy in production; enable only when reproducing a specific issue and return to normal level once diagnosis is done.
vs. Audit Log
- Log Viewer: technical logs from backend components (stack traces, processing-result messages). Used by operations to chase incidents.
- Audit Log: a permanent record of who did what when (user activity). Used for regulatory and internal-audit response. For data-change audits, see Data Management · Data Audit Log.
Operational recommendations¶
- Standard incident-response flow — Use System Inspection to identify the unhealthy service → switch to Log Viewer → Error tab for the same time window → forward the message and source to the infrastructure team.
- Turn DEBUG off after use — Debug-level logs put pressure on disk usage and search performance. Return to normal level as soon as diagnosis is done.
- Standardize search keywords — Share common keywords (service names, exception classes) across the team so investigations converge quickly.
Operational Recommendations¶
- Weekly review — Inspect the 30-day chart on Resource History weekly. Disks fill gradually, so weekly checks are essential.
- Watch for unplanned restarts — Abnormally short uptime (e.g., under one day) suggests an unplanned restart. Check the audit log for cause.
- Periodic threshold recalibration — Baseline utilization rises with user growth. Reassess threshold appropriateness quarterly.
Contact¶
For questions about System Monitor, please contact the Xgen Solution Administrator.






