Recently, I took some time to do a little analysis of our geospatial metadata in geo.data.gov. The results are extremely interesting, and they highlight a difficult challenge facing Data.gov as we work towards improving ways that our users search for and discover Federal geospatial datasets, which rely heavily on the quality of our metadata.
Almost all agencies publishing metadata on geo.data.gov currently use the FGDC format, which is named after the interagency committee that established the standard in 1994. The latest version of the format has been around for close to 15 years, is very highly structured, and has many required elements. Most importantly for metadata geeks, many of the fields allow the use of free text, and do not enforce strict vocabularies.
I decided to look at one important FGDC metadata element in particular, the publisher name, (actual xml tag name is <publish>) which is usually the agency that provided the data. Both our public consumers as well as our data providers are interested in filtering geospatial results by agency, and I wanted to see how feasible it would be to index this metadata field in order to create a filter by agency capability. We continually get requests to provide current counts of datasets per agency, to track dataset publishing status for just their organization, and to allow data journalists another facet by which they can gauge individual agency participation in open data and open government, and this is the field that is best suited to satisfy those needs.
Below is a list of the unique values ranked by frequency of occurrence based on my original query parameters. Specifically, I queried for all records that have been created or updated within the last year that are also approved for release on Data.gov. You can interact with this list directly by searching/filtering.
As you can see, we describe our agencies’ names in many, many different ways. My personal favorites are the number of different ways that the agencies USGS and NOAA are described. USGS describes themselves as “U.S. Geological Survey”, “USGS”, “U.S. Geological Survey (USGS)”, and three others. In their defense, they are also the ones who probably have the greatest number of groups publishing geospatial metadata, so they’re also the most likely to be at the top of the list.
Faceted search is highly dependent on metadata quality. In order for us to be able to provide a way to filter results by agency, we need a standard way of describing agency names, or a way to map the different labels representing the same thing. Most search engines don’t expose very many facets, but providing the most common ones can make a huge difference in terms of better search and discovery.
At this point, we are looking at two ways to provide a short term solution to this problem while the community looks longer term at replacing the FGDC metadata standard with something better suited (one of the standards gaining a lot of traction in the geo community today is ISO 19115-2). One solution is to use entity resolution technology to try to converge the names onto a controlled vocabulary list of agencies that we manage. Two is to require agencies, when publishing their metadata to Data.gov, to reference a controlled vocabulary unique identifier for this element.
Our current preference is to require agencies to reference a controlled vocabulary URI in their metadata. This moves us in the right direction of metadata standardization rather than passing the issue on to each application owner who will invariably need to deploy their own custom solution. Of the agencies that we’ve spoken to, many have said that it wouldn’t be too onerous a change for them to update any new records to reference a controlled vocabulary listing available via an HTTP URI.
In either solution, it’s clear that we need a controlled vocabulary using permanent URIs that describe each of the Federal agencies. We hope that our effort to create a vocabulary publishing site called vocab.data.gov will help to fill this gap.
How do you think we should solve this problem? Please send us your comments.
Chris Musialek is the Chief Software Architect on Data.gov and the product manager for the geo.data.gov and geoplatform.gov projects.