I'm an intellectual property lawyer, and I also studied computer science and math before going to law school, with a focus on artificial intelligence. I'm going to provide an overview of artificial intelligence technologies, and I'll highlight issues that impact copyright in relation to these innovations.
The term “artificial intelligence” is often applied when machines mimic cognitive functions that humans associate with the human mind, such as learning and problem solving. It's a field of computer science that includes something called “machine learning”. Machine learning can automate decision-making using programming rules that dynamically update. This involves training the system using large datasets. Supervised learning involves labelling these datasets, such as “cats” and “dogs” for images of cats and dogs. Unsupervised learning involves training data without those sets, and clusters are discovered automatically.
AI learns to think by reading, listening, and viewing data, which can include copyrighted works such as images, video, text, and other data. It's different from typical software because it automates decisions that are not normally in the realm of computers, and then the code adapts or changes over time in response to the learning of this data. This triggers new ethical and legal issues, which is what we as a law firm look at.
One of the issues is that AI systems need to meet certain ethical standards, and those ethical standards often embed rights and values. One issue that comes up from this point is that there is an increase in biased AI systems, and we're trying to discover why these systems are so biased. Consider a very simple example. In 2016, there was a event called Beauty.AI, an international beauty contest judged by an AI system. Six thousand people from more than 100 countries submitted photos to be judged, but the vast majority of the winners were white-skinned. Upon investigation, they realized that the AI system had been trained on hundreds of thousands of images that did not include non-white faces, so the training dataset was not sufficiently diverse.
Other examples relate to human resource tools, credit scoring, as well as policing and public safety. These biases can cause harm and inequality. Responsible AI should maximize benefits instead of these harms.
What does this have to do with copyright law? The AI training datasets can involve copyrighted works such as images, video, text, and data. The training process can involve reproductions of the training data, and these can be temporary reproductions to extract features of the data that can be discarded after the training process. An AI system can rely on the factual nature of the works to understand these patterns. The AI system algorithm is separate from the training data, but the training data may result in an improved or optimized algorithm. It is unclear whether the use of copyrighted works for training an AI system is considered copyright infringement if the author's or copyright owner's permission is not obtained. This uncertainty exists even if the initial training is done for research purposes—an enumerated fair dealing ground—and then the trained system is eventually used for commercial purposes or made available under a licensing arrangement. This uncertainty can limit the data that is used by AI innovators to train the AI system. The quality of the dataset will impact the quality of the resulting trained algorithm. There's a common saying in computer science: garbage in, garbage out.
There are public or open datasets available, but they may not be made up of the best-quality data. In fact, a number of examples show that the available open datasets under different licensing arrangement actually do result in biased algorithms due to gender inequality in the underlying datasets. An algorithm trained on this sub-optimal data may result in a generated bias.
An AI developer can develop or generate their own large body of training data, but this may not always be feasible if a certain quality or type of data is required. For example, when training a face-recognition algorithm, it's desirable to have a diverse dataset with thousands of images representing different types of people. However, this may be very difficult for a company to generate unless they are a large social media company, for example, collecting a lot of images on a daily basis.
A recent decision also creates additional uncertainty when that machine-generated raw data is a copyrighted work, because human skill and judgment were used to set parameters around creating that data. This creates additional uncertainties about the scope of copyright protections afforded to data and what can be used for training these systems. Further, even temporary reproductions of copyrighted works for technical purposes can be considered copyright infringement, which creates additional uncertainty.
Another issue relating to AI systems and copyrighted works is that they're now starting to generate new works that can be considered literary works, artistic works, and musical works. The role played by a human in the creation of these works will vary, depending on the technology. An example is a system called AIVA, which actually composes classical music and has an album out. It has already released an album and it also has other tracks available.
It's difficult under the current copyright law to clearly define whether these machine-generated works are protectable as copyright works. It also shows that the nature of these technologies is changing. We need to consider how copyright can address these future technologies and uses and resulting works. This uncertainty creates uncertainty around ownership of these works and the commercialization of these works.