# Compass **Repository Path**: trent/compass ## Basic Information - **Project Name**: Compass - **Description**: 罗盘是一个大数据任务诊断平台,旨在提升用户排查问题效率,降低用户异常任务成本。 其主要功能特性如下: 非侵入式,即时诊断,无需修改已有的调度平台,即可体验诊断效果。 支持多种主流调度平台,例如DolphinScheduler、Airflow或自研等。 支持多版本Spark、Hadoop 2.x和3.x 任务日志诊断和解析。 支持工作流层异常诊断,识别各种失败和基线耗时异常问题。 - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 19 - **Created**: 2023-04-06 - **Last Updated**: 2023-04-06 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Compass [中文文档](README_zh.md) Compass is a big data task diagnosis platform, which aims to improve the efficiency of user troubleshooting and reduce the cost of abnormal tasks for users. The key features: - Non-invasive, instant diagnosis, you can experience the diagnostic effect without modifying the existing scheduling platform. - Supports multiple scheduling platforms(DolphinScheduler, Airflow, or self-developed etc.) - Supports Spark 2.x or 3.x, Hadoop 2.x or 3.x troubleshooting. - Supports workflow layer exception diagnosis, identifies various failures and baseline time-consuming abnormal problems. - Supports Spark engine layer exception diagnosis, including 14 types of exceptions such as data skew, large table scanning, and memory waste. - Supports various log matching rule writing and abnormal threshold adjustment, and can be optimized according to actual scenarios. Compass has supported the concept of diagnostic types:
| Diagnostic Dimensions | Diagnostic Type | Type Description |
| Failure analysis | Run failure | Tasks that ultimately fail to run |
| First failure | Tasks that have been retried more than once | |
| Long term failure | Tasks that have failed to run in the last ten days | |
| Time analysis | Baseline time abnormality | Tasks that end earlier or later than the historical normal end time |
| Baseline time-consuming abnormality | Tasks that run for too long or too short relative to the historical normal running time | |
| Long running time | Tasks that run for more than two hours | |
| Error analysis | SQL failure | Tasks that fail due to SQL execution issues |
| Shuffle failure | Tasks that fail due to shuffle execution issues | |
| Memory overflow | Tasks that fail due to memory overflow issues | |
| Cost analysis | Memory waste | Tasks with a peak memory usage to total memory ratio that is too low |
| CPU waste | Tasks with a driver/executor calculation time to total CPU calculation time ratio that is too low | |
| Efficiency analysis | Large table scanning | Tasks with too many scanned rows due to no partition restrictions |
| OOM warning | Tasks with a cumulative memory of broadcast tables and a high memory ratio of driver or executor | |
| Data skew | Tasks where the maximum amount of data processed by the task in the stage is much larger than the median | |
| Job time-consuming abnormality | Tasks with a high ratio of idle time to job running time | |
| Stage time-consuming abnormality | Tasks with a high ratio of idle time to stage running time | |
| Task long tail | Tasks where the maximum running time of the task in the stage is much larger than the median | |
| HDFS stuck | Tasks where the processing rate of tasks in the stage is too slow | |
| Too many speculative execution tasks | Tasks in which speculative execution of tasks frequently occurs in the stage | |
| Global sorting abnormality | Tasks with long running time due to global sorting |