Assessing Data Quality

Indicators are measured using a single data source or compiled using multiple data sources. Assessing the usefulness of an indicator requires understanding the quality of the data used to measure the indicator. Issues of data quality include the data’s validity, reliability, timeliness, and comprehensiveness.

Validity refers to the level of confidence that the data measure what they are supposed to measure. For example, do “years of teaching” actually measure “quality of instruction”? And, if so, are the data for “years of teaching” collected and reported in such a way that they are an accurate reflection of on-the-ground conditions?

Reliability, in this context, refers to consistency in how the data are collected. If survey data are being used, are the same questions being asked in subsequent surveys? For example, in 2008 the U.S. Census changed the questions asked to determine disability status. Thus, disability data collected prior to 2008 are not comparable to disability data collected after this date. Data on disabilities collected prior to and beyond 2008 are still “valid” in that they provide statistics on the numbers and types of disabilities. However, a change-over-time analysis that spans this time period would not be considered reliable.

Timeliness refers to how often the data are updated. Are the data current enough to reflect on-the-ground conditions? The U.S. Census collects massive amounts of data in a way that is both valid and (most often) reliable. However, the census collects and compiles its most comprehensive data once every decade, which can affect the timeliness of the data. For example, the Great Recession, which hit eight years after the previous census, had a significant impact on demographic patterns related to income and employment. But updated data did not become available until the 2010 census was released.

An additional factor affecting timeliness is the time and expense needed to collect, process, and disseminate data. Unless there is a particular mandate and sufficient funds to collect data consistently, data sets are often compiled on a one-time-only basis or updated “as needed.” Data that are localized or reflect a special interest are often collected infrequently. Examples of these kinds of data include surveys on resident perceptions of safety, comprehensive listings of community and cultural institutions, and mapping of the locations and conditions of sidewalks and pedestrian safety features.

Comprehensiveness refers to whether the data provide a complete measure of the indicator the data are intended to represent. Data collected through a complete census of the population (such as those provided by the U.S. Census) are the most comprehensive data available. Data collected through a scientific sampling method (such as those provided by the American Community Survey) can also be considered comprehensive as long as the margins of error are in an acceptable range. In contrast, data collected through a convenience sample are rarely comprehensive. Examples include membership listings provided by local arts associations, client data provided by social service providers, data on the homeless population gathered through one-time visual surveys, and data on fair housing violations gathered through complaint hotlines.

Issues of comprehensiveness can also stem from selection bias. For example, body mass index data collected from driver’s license records would be meaningless for an analysis focused on child obesity rates. Similarly, health outcome data based on insurance claims records would not be useful for an analysis of the health care needs of the uninsured.

 


Adapted from Besser, Diane T. (April, 2014). “What Does ‘Equity’ Look Like? A Synthesis of Equity Policy, Administration and Planning in the Portland, Oregon, Metropolitan Area”. Report prepared for the Institute for Metropolitan Studies, Portland State University, Portland, OR (unpublished).