Friday, December 3, 2010

Google Refine Messy Data

Google Refine is a power tool for working with messy data sets, including cleaning up inconsistencies, transforming them from one format into another, and extending them with new data from external web services or other databases. Version 2.0 introduces a new extensions architecture, a reconciliation framework for linking records to other databases (like Freebase), and a ton of new transformation commands and expressions. I've tried it and recommend it for cleaning up inconsistent data files when you need a consistent database (they can be in a number of formats including from Access or Excel). There is very little to download and you can be up and running in a realtively short time. Some small bugs exist, e.g. I discovered that I could not export to csv in Explorer - so I use Firefox! This tool considering that it is open a free is a little known secret.