Making environmental data available for reuse by others is a core strategy EPA uses to help meet its mission of protecting human health and safeguarding the environment. EPA has a long tradition of publishing data – notable examples include Toxics Release Inventory (TRI) data, as well as Envirofacts – a data warehouse that provides public access to data extracts from Agency program information systems. EPA is also helping to support data discovery through the Environmental Data Gateway, a centralized catalog for EPA data sets.
The Open Government Directive calls for more and better access to federal government data, and EPA has made numerous high value data sets available on Data.gov. Recognizing that making government data available is an evolving responsibility, Data.gov supports both commonly used data formats (e.g. CSV, XML), as well as an emerging data publishing innovation known as Linked Open Data (LOD). In contrast to publishing data on the Web, LOD publishes data into the Web, so it can be interlinked with other linked data, making it easier to discover and ultimately much more useful. LOD leverages international Web standards such as Resource Description Framework (RDF), and Uniform Resource Identifiers (URIs), which make data both human and machine-readable. This is important because most government data coming from relational data systems does not adequately describe the underlying data model needed to enable application development by third parties.
While LOD is a relatively new approach to data provisioning, growth has been exponential. Other national governments have published LOD including the UK, Sweden, Germany, France, Spain, New Zealand and Australia.
Linked data enables data discovery. While traditional approaches to publishing on the Web result in data silos that are hard to find, linked data approaches allow disparate data to be “surfed”, much as one discovers and surfs pages on the Web. By enabling data from different sources to be connected and queried, linked data offers new opportunities to create Web-based mash ups of light weight compositions of data displayed in compelling ways.
The manner in which governments typically provision data requires third party users to manually download the data, then reconcile and harmonize it for use in analytics. Such approaches are costly and time consuming for application developers. Integrating CSV files requires each application developer to identify and link common data elements across datasets. When application developers integrate the same datasets independently, this work is repeated. When government publishes linked data, the publishers identify and create links to other data, and those links are then available for re-use by application developers. While there is a one-time cost incurred by government to create links, that investment lowers costs for third-party consumers (e.g., other government agencies, NGOs, private industry). Linked data incorporates key metadata, which makes the data readily understandable to consumers. By ‘baking in’ metadata, less work is required to understand the meaning of the data, further lowering the cost threshold for third party application developers.
Current Efforts and Future Activities
Linked data is a rapidly evolving phenomenon, and interest in access to government linked data has resulted in the formation of a World Wide Web Consortium (W3C) Government Linked Data Working Group. EPA, Health and Human Services (HHS), the General Services Administration (GSA), and others are participating in this international working group which is providing standards and other information needed to facilitate publication of high quality government linked data.
In 2011, EPA rendered contents from its Facility Registry and Substance Registry as linked open data. These registries maintain information about the facilities and substances tracked or regulated by either EPA or one of its state or tribal partners. EPA decided to render this content as linked data because of its cross-cutting nature – many facilities have multiple EPA program interests; similarly, many substances are regulated by more than one program. In addition, EPA also generated linked data from data on toxic releases and waste management activities from its Toxics Release Inventory (TRI) program. Together, these three datasets will provide ‘binding posts’ to which other Agency data can be linked. In the last six months, EPA developed proof-of-concept pilots in which EPA data were linked with other government data (e.g., Health and Human Services) and non-government data (e.g. Wikipedia). The pilots showed how linked data approaches can be used to combine EPA and other data to gain new views and insights from existing information.
While efforts continue at EPA to produce additional linked datasets, access to the data requires the establishment of a permanent data hosting platform from which data can be queried and displayed. EPA anticipates our data will be hosted and accessible later in 2012.
About the author:
Michael Pendleton, U.S. Environmental Protection Agency, Office of Environmental Information