GU Zhaojun1, 2, YE Jingwei2, 3, LIU Chunbo1, ZHANG Zhikai2, WANG Zhi4
For the system log data with the distribution characteristics of "group anomaly" and "local anomaly", traditional semi-supervised log anomaly detection method of anomaly detection with partially observed anomalies(ADOA) has poor accuracy of pseudo-labels generated for unlabeled data. To solve the problem, the improved semi-supervised log anomaly detection model was proposed. The known abnormal samples were clustered by k-means, and the reconstruction errors of unlabeled samples were calculated by kernel principal component analysis. The comprehensive anomaly score of sample was calculated from reconstruction error and similarity to abnormal samples, which was used as pseudo-label. Sample weights for the LightGBM classifier were calculated based on pseudo-labels to train the anomaly detection model. The impact of the proportion of training set samples on model performance was explored through parameter experiments. The experiments were conducted on two public datasets of HDFS and BGL. The results show that the proposed model can improve the pseudo-label accuracy. Compared to existing models of DeepLog, LogAnomaly, LogCluster, PCA and PLELog, the precision and F1 score are improved. Compared to traditional ADOA anomaly detection methods, F1 scores are increased by 8.4% and 8.5% on the two datasets, respectively.