Data Harvesting

Data.gov is organized around metadata published by government offices. This metadata is harvested from external websites and aggregated on Data.gov so that it’s easier to browse and search. However, some applications may want to consume this metadata programatically and there are two ways of doing this explained below.

Disclaimer: Data.gov also syndicates data from state and local governments. However, non-federal data sources are governed by different terms of service and often different licenses than Federal data. When using or harvesting data from Data.gov, please note this distinction. When harvesting large volumes of data or metadata through Data.gov, we recommend you filter for Federal sources and separate non-federal sources to avoid comingling metadata without making this distinction.

Option 1: Harvest Aggregate Metadata

The simplest option is to access metadata in aggregate as it exists on catalog.data.gov. This can be done via our CKAN API or our CSW endpoint. We do not currently provide a single aggregate file of all metadata, but we hope to provide this in the future. Until then, you can follow this GitHub issue for instructions on using the CKAN API to crawl or filter metadata.

Option 2: Harvest From Upstream Harvest Sources

Another option is to go directly to the metadata source. Every harvested source of metadata is listed at http://catalog.data.gov/harvest and via our CKAN API using this filter. As part of Project Open Data most government offices have transitioned to make all of their metadata available via a standard schema packaged as a data.json file. These are treated just as any other harvest source and you can use the CKAN API to filter for only these harvest sources.