Abstract: Performance evaluation of search engines is an important aspect of information retrieval. Many evaluation metrics have been proposed with different characteristics. Accurate and reliable judgment is required to select an optimal metric among many candidates. Based on t test, a method was proposed, and empirical investigation was conducted to compare five commonly used metrics of average precision (AP), precision at 10 document level (P@10), recalllevel precision (RP), reciprocal ranking (RR) and normalized discounted cumulative gain (NDCG). The results show that NDCG is the best, which is followed by AP, RP and P@10 with the worst of RR. The proposed method is able to provide quantitative conclusion for the comparison of any two metrics.