The following is a summary of the primary research for this paper. In the context of the research question, the review touches on the characteristics of the participants that used OD, their motivations and issues surrounding the use of OD in project delivery.
The survey included three series of questions. The first referred to information about the participants, the second examined their involvement with OD, and the third asked questions related to specific OD projects. Requests to complete the survey were sent to 42 separate organizations or groups known to work with open data. A total of 123 participants responded to the survey but only 99 completed it in its entirety. The most common geographic location of participants was Canada (44%), followed by the European Union (22%) and the United States (18%). The geographic locations of participants are provided in the following table.
In addition, nearly 70% of respondents were between the ages of 25 to 44 and more than 80% of all participants had a university degree. Specific percentages related to age groups and educations are provided within Appendix 6.
Users of Open Data
The users of PSI are generally multi-stakeholders and represent groups that are loosely connected by community interest. The users of OD can be categorized within three separate groups. The first group consists of analysts who prefer data in a raw format with the least amount of filtering applied. This group includes academics, economists and journalists who need to analyse and interpret data in order to contribute to their work-related deliverables. In addition, this group can also include other government organizations or internal departments within the same organization. The second group includes individuals who would prefer dynamic or automated access to data. These are the software developers who need access to clean and reliable data through automated methods such as an Application Programming Interface (API). For the most part their objective is to render the data in a visual interface for a specific audience, which necessitates the need for some level of data analytics. In most cases, this involves aggregating multiple datasets. The last group is comprised of regular citizens or special interest groups who prefer accessing data through an interface with a visual interpretation of the data. Most have limited or no skills for analysing and aggregating datasets and believe that governments are obligated to provide visual interfaces to OD. Organizations working with OD can be found in the following table.
Motivation for using Open Data
Interest in OD is developing in a bottom-up direction. Special interest groups and not-for-profit organizations are realising the potential benefits of OD and how it can help them achieve strategic objectives. For most countries, the movement is still just beginning and a lot of people are very interested in learning more about the topic. Notwithstanding, a very small portion of people are using OD in the hope to build applications and/or web services. The large users of OD remain academic institutions and the private sectors.
Micro and Macro data
The benefits of OD information are segregated into two levels of information, micro and macro data. Micro data refers to information that is relevant to individuals and their daily activities, such as bus schedules, road closures and recreational activities. Not surprisingly, smaller governments like municipalities publish micro data. Macro data, on the other hand, consists of data with a wider geographical span such as national and international level information. Macro data can impact larger subsets of the population and be specific to an area or topic. This level of data is dependent on statistical, population and geographical data. In some cases, several micro datasets can be aggregated with macro datasets. Although both are important to citizens, macro data tends to impact a greater number of people. The majority of the OD users that responded to the survey commonly use macro data, including statistical data closely followed by research, population and geographical data. Specific percentages related to the type of data used by survey participants are provided within Appendix 6.
The Source of Open Data
Surprisingly, the source for data most commonly used by survey participants was not open. Significant amounts of data are published as web content on public organizations’ web sites but are not available in an open format. This lack of availability has created a culture of scrapers – individuals who take data from web sites and openly re-use it. That data is then transferred into structured datasets or databases where it can be re-used. In some cases this process is automated, and if web content changes the scraping process is triggered again. It is a complicated method of extracting data and an even more complex way to maintain and update data. In addition, scraping web content falls short of OD license agreements since the data is not available through an OD portal. An interesting outcome of the survey was that 8% of respondents that were working on a specific OD project and working for a federal government organizations were also scraping data. It seems that in some cases it is more cost-effective for governments to scrape themselves rather than to extract the data from its original source. Percentage of macro and micro sources of OD from specific organizations can be found in the following diagram while information related to sources used by separate nations is provided within Appendix 6.
Preferred methods for accessing Open Data
The preferred methods for accessing OD remain with tools and software that are commonly used or openly available. Formats such as comma-delimited files, which are not proprietary to specific software, are preferred for accessing and managing OD. In addition, the preferred tools used to manage and manipulate OD are also not proprietary; open source software is commonly used but in most cases personal preference outweighs the decision and individuals will use tools that are familiar to them. Specific percentages related to the preferred method for accessing the data by survey participants are provided within Appendix 6.
When asked what benefits are derived from OD initiatives the majority of the participants indicated that OD will support innovation and citizens have the right to public data and that public information should be openly accessible and available online. In addition, releasing new data was more important for participants than improving the quality of existing data. Specific percentages of related questions to the importance and benefits of OD initiatives are provided within Appendix 6.
Open Data Projects
Included in the survey were specific questions related to OD projects. A total of 48 participants responded to these questions. Projects included local initiatives for housing, charities, education and health-related data. International projects were also included, which looked at opportunities with open source software, data visualizations and aggregation with social media and big data. A few projects also included government initiatives for publishing OD and others included specific work toward new standards.
Several participants working on projects belong to different advocacy or special interest groups, which create personal projects in the attempt of making governments accountable for their actions. These groups also lobby governments with an aim to convince them to adopt OD policies. Other personal initiatives included the aggregation of statistical data in an attempt to identify trends to help charities and not-for-profit organizations determine their strategic direction.
From the participants working on specific projects, 23% identified federal governments as their primary client and nearly 84% stated that they are likely to use OD again. The average of primary clients from participants working on specific projects can be found in the following table.
In addition, the following table defines the participants working on specific projects that agreed with each comment.
Critical issues in government today
The survey offered an open-ended question allowing participants the opportunity to contribute what they believed were the top three issues facing government organizations today. In response to this question, 35% of the participants referred to the relevance of available OD, specifically expressing concern regarding its accuracy and availability. Financial restrictions were also identified as an issue for governments to effectively capture and publish OD. More than 25% of the participants believed that effective dissemination of OD could help governments cut costs and mitigate risks caused by cutbacks.
Furthermore, participants expressed that governments have an embedded organizational culture that does not share information. One quarter of participants stated that governments need to take action and make top-down changes:
The governments will need to build a culture of open data engagement, which will require a change in mindset, and culture. (Participant 97)
There were also several comments regarding the quality of OD. Relevant information about datasets, including the method of collection, metadata and instructions for re-use, is often missing. Inconsistency was also a relevant concern, which relates to a lack of existing standards for information management. In some cases, participants indicated that information about the datasets are simply insufficient to effectively aggregate with other datasets; this was especially true for geographic information. Real transparency was also a common theme. Several participants believed that members of government do not understand the significant importance of OD and the impact it can have for citizens:
Increased visibility of data tends to increase its accuracy and quality through feedback from users. (Participant 74)
The issue of enabling and facilitating public access to open data with visual interface was also a concern.
The interviews included 10 participants, 5 from public organizations and 5 from private organizations.
The majority of the interview questions for public officials consisted of quantitative questions to identify direct and indirect costs for the preparation and dissemination of OD. However, most governments’ operating costs attributed to OD are small or unknown because they are absorbed within existing information technology operations and sections. Extracting the cost was complex and in most cases organizations were not aware of the actual funds used for OD initiatives. In other cases, some GC departments and agencies were simply not willing to share these costs.
With the exception of large GC departments, manual intervention by staff members is required for the dissemination of new datasets or updates to existing ones. Three public organizations interviewed provided approximate costs to human resource and operating or maintenance costs, including direct and indirect costs. The average cost for the dissemination of OD within these organizations averaged $130 thousand CDN per year. The following table contains the average costs from each organization.
Two organizations spent time developing a cost recovery model for the preparation and dissemination of OD by individual datasets. They believed that estimated time spent to publishing each new dataset is approximately 275 hours, from the analysis to the deployment phase. What was not clear is the indirect costs and time spent on publishing OD from other sections within organizations. What was very clear and voiced from all of the interview participants from public organizations was that the cost of publishing OD would definitely increase.
Furthermore, interview participants commented on the lack of internal policies and standards that in turn reflect on the poor quality of the OD published:
Internal IM practices are lacking, we are building data from the outside in. The internal practices of governments were never designed to manage information, which informs the community in this open format. Governments need to re-think their frameworks. (Interview Participant 4)
A common concept among interview participants from the private sector was the lack of available data and the difficulty in locating data. These participants work with data on a daily basis and frequently need to locate new and interesting datasets for their projects or initiatives.
Participants indicated that in many cases information was simply not available in an open format but is publicly available online from public web sites. As such, individuals and companies frequently resort to scraping data off public web pages. Although scraping raises many concerns, the consensus among participants is that it is a common practice, even within government organizations.
In the case of one participant that was scraping public web sites, the information he was collecting was simply not available in any other format. By aggregating the scraped data, he was able to create a valuable and appealing dataset, since this information was not available anywhere else. By offering an interface to access this data, he created value by allowing regular citizens to search and query the data without analytical skills. Unfortunately, existing license agreements do not include the scraping of public websites; the web content is publicly available but the right to re-use the data is technically not permitted. Scraped data is simply not OD. Interestingly, the added value of the information that he created by aggregating this data is so significant that members of the GC are subscribed users to his service:
While publishing public information on web sites governments should make the effort to publish data in an open format also. Any data available publicly should also be available in an open format also. (Interview Participant 2)
Another issue brought forward by interview participants was common standards. Standards provide a common method for aggregating OD that can protect privacy. This was especially relevant for geographic standards. The aggregation of OD is dependent on standards across datasets and if separate departments utilize different standards it will take a considerable amount of effort to amalgamate data. As an example, geographical frameworks used within OD vary between GC departments and this lack of a common framework poses challenges to analysts who need to aggregate data across departments. Existing geographies used in datasets can include postal code zones, federal electoral districts, census geographies or health districts. For example, health and environmental issues can span many jurisdictions and geographic boundaries.
Focus Group Session
The focus group session was held during a conference event and included three candidates from the private sector. The goal of the session was to explore ideas in a divergent manner and identify possible solutions for known issues preventing OD from achieving benefits. The group identified transparency as an issue and related the problem to the organizational culture found in government organizations. Experiences from participants were shared among the group; the discussion focused on examples where information was not released in re-usable formats:
When requesting information that was not publicly available online it was provided to me in a compact disk or in a paper format. (Session Participant 2)
Other issues were identified which related to the efforts required to find information. There is still much information that is not yet available in open format, such as information related to grants and contributions. Participants that work on specific research initiatives indicated a need for a large amount of statistical data to identify trends. According to the group, a considerable amount of effort is used to try to determine if the information actually exists and is available.