Thank you for this invitation.
I was told that it may be a good idea for me to introduce myself first in order to assist you in your questions, which I will be happy to answer afterwards in either English or French.
I am a bit of a strange beast. My training has been in several areas. I completed a Bachelors in Mathematics and Computer Science, after which I did a Masters in Linguistics and a Doctorate in Linguistics, with a focus on artificial intelligence. This lead to my work on what is called natural language processing, that is, the use of computers to understand texts written in French, English, Italian, and so on, for the purposes of translating them, and automatically correcting or processing them.
I worked, among other areas, in the private sector as a natural language processing—or NLP—software developer. I am currently a professor at the School of Library and Information Science. I was hired under their digital information management envelop. That is really our main theme, that is digital information.
My current expertise is in two areas. I work in the area of natural language processing as it applies to document management. On the other hand, I'm focusing more and more on digital libraries for document collections, whether they be library documents, archives, museum document or other kinds of documents, and their access functions. Certain websites and databases would also fall under digital libraries. Collections and data sets are an example of digital libraries. I am particularly interested in these issues from that perspective.
I have based my opening remarks this morning on the five questions I received. I just wanted to give you an introduction first.
We talk about open data, linked data, linked open data, RDF data. They don't all mean the same thing. There are more or less open types of data. It is not enough to publish data for that data to serve as an excellent example of open data. An excellent example, the best format, is the RDF format which is user-readable and operable.
There are several jurisdictions that will publish data, but that data is not necessarily in an easily usable format. There are degrees of usability in what is provided.
Another term that is used is big data. Once again, that is something different. That term refers to research based on massive data. Even though it is different, one can only expect that the advent of enormous quantities of data will significantly change people's attitudes towards knowledge and the use that can be made of that knowledge. That will change everything.
The first question was how the Government of Canada compares to other jurisdictions, in Canada and abroad. I compiled some data in a table that is in the notes that I gave to the committee. It includes data on the availability of data from governments in Canada and abroad.
The results are quite variable both in terms of the number of data sets and degree of real openness. Some governments publish their documents as zipped PDF images, which is not necessarily the most desirable format for open data.
I am not going to go over the table in detail. I would say quite briefly though that the United Kingdom is known internationally for its extensive publication of data, including a large quantity of truly open RDF data. The number of data sets is approximately 17,000.
Canada's number of data sets is over 190,000, which is higher. On the other hand, Canada's data is less open. There are more zipped files, geographical maps, for the data. There is currently exactly one data set in RDF, which is a little sad. The table describes much of the data and it would be too long to go over that now.
I have also pointed out a website, Linking Open Government Data, which has ranked a number of countries. It puts Canada in second place for publishing data sets.
Clearly that ranking is based on the number of available data sets, but not necessarily on the ease with which those data can be accessed.
I am now going to answer the second question, that is, how does this compare with what the private sector is collecting and making available.
Obviously, public administrations do not publish the same kind of data. They publish information on the activities of the public administration, public services management, natural resources, etc. The private sector is much more reticent to share their data. The reasons for this are quite obvious. Businesses are afraid of losing their competitiveness. Many incentives are offered to the private sector to meet certain consumer expectations, because consumers want societies to be more transparent and environmentally responsible, among other things. The public sector acknowledges that this can lead to some risk sharing. For example, insurance companies and pharmaceutical companies can benefit from other businesses' data in order to improve their competitiveness.
The third question is how can proper use of public data stimulate job creation and economic added value? The availability of open data clearly encourages the development of various applications. However, one should not only think of the money that can be made. Rather, one should consider public data as a new public service, just like libraries. That's the parallel that should be made, rather than considering this as an economic added value for the purpose of immediately making money.
The fourth question is how we can make sure that there is accountability and transparency, while being prudent on privacy issues? The distinction must be made— and others do make this distinction—between collective data, that can be open data when it is anonymous, private or personal data, which should be available to the individuals but not to the public, and transformed data, which can be anonymized before being published. It's important to define a series of confidentiality principles in order to manage this.
The last question is how we can make sure that public data serves the needs of the population of Canada? I have identified four potential ways of doing that. We can have new public officers, for example a chief data officer or something similar. Obviously there has to be a public and transparent official policy along with new structures, such as citizens' advocacy groups. Furthermore, we need to include the documentation sectors, that is, library scientists and archivists, who are used to managing data and taking into account user needs in order to improve their services.
Thank you.