Find Duplicates among the PubMed, EMBASE, and Cochrane Library Databases in Systematic Review
Datasets usually provide raw data for analysis. This raw data often comes in spreadsheet form, but can be any collection of data, on which analysis can be performed.
Finding duplicates is an important phase of systematic review. However, no consensus regarding the methods to find duplicates has been provided. This study aims to describe a pragmatic strategy of combining auto- and hand-searching duplicates in systematic review and to evaluate the prevalence and characteristics of duplicates.
Methods and Findings
Literatures regarding portal vein thrombosis (PVT) and Budd-Chiari syndrome (BCS) were searched by the PubMed, EMBASE, and Cochrane library databases. Duplicates included one index paper and one or more redundant papers. They were divided into type-I (duplicates among different databases) and type-II (duplicate publications in different journals/issues) duplicates. For type-I duplicates, reference items were further compared between index and redundant papers. Of 10936 papers regarding PVT, 2399 and 1307 were identified as auto- and hand-searched duplicates, respectively. The prevalence of auto- and hand-searched redundant papers was 11.0% (1201/10936) and 6.1% (665/10936), respectively. They included 3431 type-I and 275 type-II duplicates. Of 11403 papers regarding BCS, 3275 and 2064 were identified as auto- and hand-searched duplicates, respectively. The prevalence of auto- and hand-searched redundant papers was 14.4% (1640/11403) and 9.1% (1039/11403), respectively. They included 5053 type-I and 286 type-II duplicates. Most of type-I duplicates were identified by auto-searching method (69.5%, 2385/3431 in PVT literatures; 64.6%, 3263/5053 in BCS literatures). Nearly all type-II duplicates were identified by hand-searching method (94.9%, 261/275 in PVT literatures; 95.8%, 274/286 in BCS literatures). Compared with those identified by auto-searching method, type-I duplicates identified by hand-searching method had a significantly higher prevalence of wrong items (47/2385 versus 498/1046, p<0.0001 in PVT literatures; 30/3263 versus 778/1790, p<0.0001 in BCS literatures). Most of wrong items originated from EMBASE database.
Given the inadequacy of a single strategy of auto-searching method, a combined strategy of auto- and hand-searching methods should be employed to find duplicates in systematic review.