New open source data-confirmation utility: damage (data manifest generator)

Paul Lesack, our multitalented Data/GIS Analyst in the UBC Library Research Commons, recently created a data manifest generator called damage that helps to verify the integrity of downloaded datasets. Damage is an open-source utility for both Mac and Windows with associated Python libraries for many different systems. This utility was created to assist the Data Liberation Initiative (DLI), which is a partnership between post-secondary institutions and Statistics Canada. 

The purpose of the damage utility is to create standardized documents which definitively describe the contents of data sets. Users can confirm with confidence that what they have downloaded is what they are supposed to have. Damage uses checksums, or unique identifiers based on the contents of a file, to assign a unique fingerprint for each file. Comparing the output of damage to a document created by it will assure that what you have is what you expected.  

Damage exports results in a variety of formats, so that it can be used with word processors, spreadsheets or computer programs. The fcheck Python library which powers the damage utility is available for those who wish to incorporate these features into their own software, and can be used on any computing platform that uses Python. 

Damage also has features unique to data: 

  • Identifying non-ASCII characters in text files (such as accented characters) 
  • Identifying file encoding to reduce incompatibility between systems 
  • For statistical files in SAS, SPSS and Stata formats, damage calculates the number of cases and variables. 

Although designed for data, the damage utility is useful to anyone who wants to document, share and verify their files. Learn more about it by reading the documentation here

Download the Windows and Mac versions of the utility here