Note: This is a guest blog post from developer Luke Ruth (see “About the author” at the end of this post) on mashing up Government Linked Data, leveraging Facilities Registry data from the EPA and Hospital Compare data from HHS. Luke is a welcome addition to the Data.gov/semantic community of practice!
As an undergraduate studying Computer Information Systems and Psychology at University of Mary Washington, I recently had the opportunity to perform a senior project on using open government content and Linked Data principles. I heard a guest lecture by Bernadette Hyland who co-chairs the W3C Government Linked Data Working Group. She explained that publishing high quality government data is part of the US Administration’s Open Government Initiative. Linked Open Data allows agencies, like the US Environmental Protection Agency (EPA), the US Department of Health and Human Services (HHS), and many others to publish government collected data using international data standards for the Web.
Government agencies that publish open government data as Linked Data allows any civically minded person to easily discover and access their data. This is different than accessing an HTML page on the “Web of documents”, rather, it’s all about discovering and accessing the “Web of data.” Thus, the open government data I used for my senior project wasn’t locked away in a proprietary database or file, rather, I could get at the data itself because the government agencies published it using open data exchange standards. The project involved mashing up EPA Facilities Registry and HHS Hospital Compare data.
The goal of my senior project was to merge different data about the same hospitals from these independently published sources, as provided by multiple datasets from EPA and HHS. The inherent challenge was that neither publisher used the same classes, identifiers, or classification system when describing hospitals. If you just google “Mary Washington Hospital”, you’ll need to sift through the approximately 4.2 million results that are returned if you have any hope of finding what you’re looking for. That’s 4.2 million separate resources that require an ability to mentally aggregate the data you find residing in multiple locations. Now what if there was a data structure that did almost all of that traversal for you? There is, and it’s called Linked Data.
I used the following tools to achieve my goal:
1. SPARQL – the query language used to create data files;
2. Perl – a script for cleaning up the data files;
3. Java – a script for making hospital comparisons; and
4. Callimachus – a framework for the application and RDF storage, and RDFa – extension to XHMTL for embedding metadata.
Almost as meaningful as the tools I did use are the tools I did not. I was an undergraduate student with only one Linked Data course under my belt before I set out on this project. I did not need years of study or an advanced degree. That said, I did have fantastic mentorship and guidance along the way. This project also did not necessitate a specific or new ontology because the process was completed using raw text processing and string matching.
The process I used to match the hospitals between these two datasets can be broken down into three steps.
Step 1: Create the data files using SPARQL to query the endpoint.
Step 2: Clean the data files (SPARQL result-set) using a Perl script.
Step 3: Run the cleaned data files through the Java comparison script.
The most important part of this process was step 3 because it was here that hospitals either successfully matched or failed to match. First, zip codes of hospitals were compared and if the zip codes matched, the hospitals entered the next level of matching. This level first compared the names of the two hospitals using string cleaning to remove punctuation, white space, and any words that may interfere with the matching such as “inc” or “llc”. If the names did not match, the addresses of the two hospitals would be compared using a similar string cleaning process. This was done because the names of the hospitals occasionally would not match even though they represented the same facility, and the street address matching would catch this error.
Once the process was completed, the results were available for analysis. Of the 4,680 entries in the data file, 2,275 were successfully matched. This is successful because two large government datasets have been merged, giving any interested citizen access to a new combination of useful and meaningful information.
Screenshot from 3RoundStones EPA dataset
What can be improved upon? How can a higher match percentage be generated? What challenges exist? One challenge that arises whenever you’re dealing with any type of data or data storage is dirty data. Anything from typos to different naming conventions to HTML escape codes can interfere with multiple steps throughout the process. However, these kinds of issues do have a silver lining. In this case, an interesting opportunity is provided for crowd-sourcing as a means to improve the quality of data. Approximately 1,600 hospitals failed on zip code matching, which I believe can be improved though alterations of the SPARQL query. The other approximately 800 failed on name and street address matching, which could be improved with alterations to the Java comparisons script.
The ultimate goal is to move away from the inefficiencies of raw text processing toward a more standardized and efficient way of using data. For example, now that the string matching for these two datasets has been done, future application developers should use the pre-established linked data (same-as relationships) to avoid any redundancy or data duplication. This means an easier and more productive effort from future developers and publishers. Another way of making future developers jobs easier is through the use of meaningful comments. Data is only useful to concerned citizens and potential developers if they can understand it. This can be achieved through clear explanations of what certain data represents. This becomes particularly important when it comes to any data involving numbers, which may be completely obvious to someone who works with the data day in and day out, but not to someone who has never dealt with it before.
The most obvious remaining objective is to increase the match percentage between the two datasets. This would increase the amount of information available to the general public and decrease the amount of work for future application developers. An interesting opportunity also lies in data visualization, which could provide a more information-rich display to the user. It would also make sense in the future to pull in even more relevant linked datasets, such as a short description from DBpedia.
What is important to take away from this is that a modest effort by publishers to use established vocabularies and identifiers can and does go a long way towards the rapid production of data-driven Web applications. It is also telling that linking multiple open datasets can produce not only meaningful results, but also unexpected insights. Lastly, civic-minded citizens are offered a great amount of, and faster access to, relevant and functional information using high quality Linked Open Data published by government authorities.
About the author:
Luke Ruth, University of Mary Washington, Computer Information Systems/Psychology, graduating May 2012
contact Luke dot Ruth at gmail dot com