A Study of Copyright Issues in the Input Phase of Generative
Artificial Intelligence Machine Learning
Weixi Song
School of Artificial Intelligence and Law, Southwest University of Political Science and Law, Chongqing, China
Keywords: Generative Artificial Intelligence, Input Phase of Machine Learning, Copyright.
Abstract: In recent years, the rapid progress of generative AI technology has had a profound impact and transformative
impact on multiple dimensions such as technological innovation, cultural evolution, and education model
transformation. Among the many complex issues raised by generative AI, the input phase of machine learning
is highly dependent on the input of massive amounts of data, which often cover the scope of content protected
by intellectual property rights, which has led to copyright disputes. This reality has directly led to the deep
contradictions and fierce conflicts between the traditional copyright industry and the emerging artificial
intelligence industry. Although the traditional solutions proposed by scholars can adjust the contradiction
between the two to a certain extent, there is still room for improvement. This paper explores a new mechanism
that conforms to the principle of balance of interests to solve the problem of infringement in the input stage
of machine learning.
1 INTRODUCTION
Machine learning is the core of AI and the
fundamental way to become intelligent. Generative
AI requires deep learning and big data analysis to
achieve the purpose of generating text, images, audio
and other types of content, and ultimately make it
have the scientific definition of "emergent
capabilities"(Wu and Lai,2024). Machine learning
can be subdivided into three phases, namely the input
phase, the learning phase and the output phase
(Xu,2022, of which the input phase is an important
part of machine learning and the focus of this paper.
In the input stage, the generative AI needs to
capture and integrate a large amount of training data,
which includes both information protected by
copyright law and content not protected by copyright
law, aiming to provide rich material for the
subsequent self-learning process. Relying on the
training dataset constituted by these carefully
collected data, the generative AI carries out in-depth
model training so that the content it generates or
expresses can highly approximate the samples in the
training set in terms of effect, thus achieving the
desired level of intelligent output (Lai, 2024).
However, the data crawled in the input phase of
machine learning inevitably conflicts with the
copyright holder's right to reproduce under the law if
it crawls to copyright-protected content.
If the problem of infringement at the input stage
cannot be solved, then machine learning will remain
under the cloud of infringement risk for a long time.
2 COPYRIGHT DISPUTES AND
INSTITUTIONAL DILEMMAS
AT THE INPUT STAGE
At the input stage, generative AI infringes on
copyright mainly in the form of unauthorized
crawling of data, which is often copyright-protected
content. In order to effectively train AI algorithms,
developers often obtain the large amount of data they
need through the technical protection measures of
deciphering databases(Liu and Wei,2019) and copy
and store the data obtained through such means into
their own servers or not, and directly feed the data to
the AI. This may create the problem of infringement
at the input stage of generative AI. In China, the
"Altman case" was the first legal decision on the issue
of generative AI infringement, and the final decision
resulted in a judgment declaring that the use of
generative AIs to provide generative services without
authorization was found to be infringing, and in
March 2024, Google Inc. was hit by a lawsuit for the
180
Song, W.
A Study of Copyright Issues in the Input Phase of Generative Artificial Intelligence Machine Learning.
DOI: 10.5220/0013975800004912
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 1st International Conference on Innovative Education and Social Development (IESD 2025), pages 180-186
ISBN: 978-989-758-779-5
Proceedings Copyright © 2025 by SCITEPRESS – Science and Technology Publications, Lda.
unauthorized use of copyrighted content to train its
chatbots. bots, for which Google Inc. was hit with a
financial penalty of up to €250 million by the French
market regulator.
The rapid development of the artificial
intelligence industry has caused a great impact on the
balance of interests system constructed by traditional
copyright law. Under the impact of artificial
intelligence, the problem of infringement has become
more secretive, thus making the evidence mechanism
fail, and its massive use has likewise caused a great
impact on the rights licensing system (Yuan and
Xia,2024). However, at present, if artificial
intelligence input every work to obtain prior rights
license, almost equal to kill this new industry.
Generative artificial intelligence as an important field
of development, if it is caught in a large number of
infringement disputes, it is bound to cause a "chilling
effect", which is not conducive to the development
and progress of the new technology. How to protect
the interests of copyright holders while promoting the
development of generative artificial intelligence has
become an urgent problem to be solved.
From the existing explorations in the academic
community, some scholars have provided ideas the
problem of infringement .for solving generative
artificial intelligence the concept of temporary
reproduction by explaining "Temporary copying" is
an incidental and objective technical phenomenon,
which, in the field of artificial intelligence, manifests
itself as temporary storage in a temporary cache, and
does not produce a copy that can be recognized as an
act of copying, but is merely an addition to the
subsequent steps. In order to avoid expanding the
connotation of the right of reproduction, scholars and
judicial precedents have agreed that temporary
reproduction should not be recognized as the content
of the right of reproduction, which gives generative
artificial intelligence trainers a certain degree of
avoidance of infringement risk, but in practice, the
use of temporary reproduction means to avoid the risk
of copyright infringement is obviously not practical.
As a result, academics have proposed a fair use
system, hoping to rationalize the infringement
problem at the input stage by including machine
learning of generative AI as one of the twelve
elements stipulated in China's copyright law.
Similarly, some scholars have put forward other
solutions, arguing that the intellectual property rights
of copyright holders should be protected, and that
copyrights such as the right to machine learning and
the right to read should be added. At present, the
latter argument has received a certain degree of
criticism from the academic community, which
argues that human beings do not possess intellectual
property rights of the so-called "right to read", and
that analogous to the so-called right of machines to
read and the so-called right of machines to learn
would be even more important when it comes to
human reading behavior. The so-called "right to read"
and "right to learn" of machines have even less legal
basis. As a result, the fair use system has received
widespread attention from academics as an important
means to address the current infringement of
generative AI input.
3 OBSTACLES TO EXISTING
FAIR USE REGIMES TO
ADDRESS INFRINGEMENT AT
THE INPUT STAGE OF
GENERATIVE AI
3.1 Fair Use Regime is Feasible to
Address Generative AI
Infringements
Based on in the previous article, the input stage of
machine learning, due to the massive amount of
grabbing data caused by copyright disputes, want to
solve this problem, its key point is still focused on,
can the machine legally grab these data for their own
machine learning?
According to the provisions of China's copyright
law, many scholars have focused on the "fair use
system", which is undoubtedly the most suitable
solution for the development of generative artificial
intelligence, and its inclusion in the fair use system
can directly and effectively solve the problem of
infringement of copy right in the input stage, and can
produce full benefits for the development and
expansion of the artificial intelligence industry. The
development and expansion of the industry can
produce full benefits. However, the fair use system is
essentially a system that, for the purpose of
reasonable use and according to the principle of
proportionality, can produce a certain degree of
derogation to the rights of copyright owners. It is
strictly regulated and controlled by copyright law.
The construction of the criteria for the application
of the fair use rule in China's current copyright law
follows the "three-step test" established by the Berne
Convention.This standard requires that fair use must
meet the following three conditions: firstly, it must
meet specific and exceptional circumstances;
A Study of Copyright Issues in the Input Phase of Generative Artificial Intelligence Machine Learning
181
secondly, it must not impede the normal utilization of
the original work; and thirdly, it must not reasonably
impair the legitimate rights and interests of the
copyright owner of the original work. Among them,
China's copyright law clearly stipulates twelve
situations that can be recognized as fair use.
The input phase of machine learning is currently
not included in the twelve types of fair use stipulated
by China, but the newly added paragraph 13, "Other
Laws and Regulations", provides for the possibility
that the input phase of machine learning can be
included in the fair use system. As a result, the fair
use system has some room for operation in theory. In
addition, in the wave of artificial intelligence, as
countries pursuing development together with us, the
EU and Japan have loosened the restrictions on
copyright infringement by generative AI, giving AI
companies more space and freedom to develop, so
there is a certain possibility and reasonableness for
China to make regulations on the fair use of
generative AI by analogy.
3.2 Difficulties Encountered vs the Fair
Use System
Although the fair use system described above
provides us with ideas for solving the infringement
problem in the input phase of machine learning, the
fair use system still has some limitations and
problems that should not be ignored.
First, the use of the fair use defense in the input
phase of generative AI may have pushed the
boundaries of the fair use system's interest protection.
The fair use system itself is a restriction on the
copyright of the copyright holder, which itself is
exempt from certain infringement of copyright. It is a
sacrifice made to the rights of copyright holders for
the development of science and technology,
therefore, the law needs to strictly limit the content of
the provisions of the fair use system to ensure that the
rights and interests of copyright holders are not
infringed by unreasonable use(Wang and Chu,2024).
It is highly likely that the production of generative
artificial intelligence will act as a substitute for the
copyright owner's work, thus substantially reducing
the revenue that the copyright owner obtains through
the work, which will break the balance of interests of
the fair use system original and will not be conducive
to encouraging creativity.
Second, generative AI may not pass the second
step of the fair use test. The key to integrating the
input stage into the fair use regime centers on the
successful realization of the last two steps of the
three-step test. With regard to the definition of the
"two shall nots", according to the official
interpretation of the Adjudicatory Committee of the
World Trade Organization, the core logic lies in the
methodological framework of economic analysis to
define them. Specifically, the first "shall not" has a
strict meaning, which requires that the original work
not be exploited in a way that conflicts with the
copyright owner's market behavior of obtaining
economic benefits through the exercise of legal
rights.
However, combined with the actual background
of the current artificial intelligence we can come to
this very unfortunate conclusion, that is, almost most
of the artificial intelligence on the market can not pass
the second step of the three-step test method.
Generative artificial intelligence as a highly
sophisticated industry, its nature requires a lot of
research and development and capital investment, so
it is almost difficult to see individual research and
development of generative artificial intelligence on
the market, often with technology companies as the
main body. And as an operating company, it will
inevitably generate competition for the original
copyrighted work in the market competition,
squeezing the space for its survival. According to the
theory of socially necessary labor time determining
the value of goods put forward by Max, artificial
intelligence will reduce the original work of socially
necessary labor time and thus reduce the value of its
goods the economic value of the work and the
interests of the work will be seriously impacted, and
thus, therefore, most of the generative artificial
intelligence is unable to pass this step. This also
creates an obvious dilemma for the fair use system,
how to advance the fair use system without violating
the existing legal basis system?
Again, generative AI also fails the third step of the
fair use test. In contrast, the second "shall not" is
relatively loosely defined. On the basis of not
violating the principle of balance of interests, it
allows the utilization of the original work to cause
moderate derogation to the economic interests of the
copyright owner of the original work within a
reasonable range. This is both a limitation on the
rights of the copyright holder and a license for the
rights of the user(Wu and Lai,2024).
However, it is also unfortunate that it does not
stipulate what constitutes "moderate derogation",
which leaves us with a problem of definition.
Although the concept of "proportionate derogation" is
IESD 2025 - International Conference on Innovative Education and Social Development
182
difficult to understand, we can explore this issue
through the principle of balancing of interests. The
principle of balancing interests permeates the entire
content of intellectual property law and has a role to
play in intellectual property rights that cannot be
ignored. Its external manifestation is the
measurement of interests. According to China's first
ALCG case, it can be seen that China is still more
inclined to protect the interests of copyright owners
to achieve the effect of balancing interests.
Currently, generative artificial intelligence has
been widely used in a variety of fields, and its future
prospects for large-scale application are also very
clear. This kind of technology has already grasped the
opportunity and initiative of the new era, and its
advantageous position in the dissemination of
information compared with individual copyright
holders should not be ignored. Generative artificial
intelligence not only has the ability to generate
content efficiently, but also relies on the wide
coverage of the Internet to realize the rapid diffusion
of information and access to it, thus substantially
changing the traditional pattern of information
dissemination and posing unprecedented challenges
to the rights and interests of copyright owners. The
copyright owner itself is in a rather unfavorable
position, and the application of the fair use system at
this time almost completely breaks the antagonism
between the interests of the copyright owner and the
generative artificial intelligence, and completely
reverses to the generative artificial intelligence.
Furthermore, the focus of the fair use system itself
is "fair use", and fair use itself requires the user to
meet the condition of "non-profit". The popular text-
generating artificial intelligence chatgpt, openAI
company launched several different price point
products and services, obviously with the fair use of
"non-profit" fundamental hedge.
According to existing judicial cases in China, the
number of courts that have ruled that the input phase
of machine learning is fair use is still in the
minority(Cong,2024). Judicial practice also confirms
that the application of the fair use system still has
certain difficulties in China.
To summarize, I believe that at the present time,
the fair use system does not have the reasonableness
of directly applying to the input stage of machine
learning, nor does it have the legal origin and custom,
and the blind expansion of the fair use system will
only run counter to the fundamental spirit of China's
establishment of the copyright law, and violates the
provisions of the basic principle of the principle of
balance of interests.
4 A SOLUTION TO THE
INFRINGEMENT PROBLEM
OF GENERATIVE ARTIFICIAL
INTELLIGENCE
Generative artificial intelligence of machine learning
input stage of the infringement problem of the
solution is not only can only purely apply the fair use
system, generative artificial intelligence infringement
problem of the solution needs to take the necessary
classification to do refinement, and fair use of the
system to carry out some transformation.
4.1 Adaptation of The Existing Fair
Use Regime - Inclusion of
Non-Commercial Purpose Use in
The Fair Use Regime
4.1.1 Do All Input Stages Violate the Two
"Shall Nots"?
Generative AI, as an emerging and highly intelligent
technological tool, has an extremely wide range of
applications, covering a variety of fields such as
personal learning, data retrieval, and assisted decision
making. It is worth noting that not all types of
generative AI cannot be evaluated by the three-step
test. In order to explore this issue more deeply, some
scholars in the academic community have already
proposed a division between two types of AI for
commercial and non-commercial purposes(Wu and
Lai,2024). The proposal of this categorization has
greatly assisted us in addressing more precisely the
infringement issues that may be involved in the input
stage of generative AI.
In the context of non-commercial use, the actual
harm to copyright holders from generative AI can be
minimal. It is mainly used in scientific research,
public governance and other fields that are important
for the well-being of society. Since non-commercial
use does not have the characteristic of wide
dissemination, its degree of infringement on the
interests of copyright owners is relatively low, and it
does not cause market substitution effect on the
original work, thus avoiding serious squeezing and
damage to the interests due to copyright owners.
A Study of Copyright Issues in the Input Phase of Generative Artificial Intelligence Machine Learning
183
In addition, non-commercial use can be applied by
analogy to item (a) of the twelve circumstances
stipulated in China's copyright law, which allows
individuals to use other people's works for the
purpose of study, research or enjoyment. Although in
this context, the subject of study using generative AI
may not be limited to individuals, but may also
include corporate entities such as companies, this
does not prevent the reasonableness of this mode of
use in terms of legislation or even in terms of
expanding the interpretation of the content of the first
subparagraph thereof.
From this I believe that it is reasonable and
necessary to include machine learning of generative
AI for non-commercial purposes in the category of
fair use. Once it is included, then the problem of
infringement at the input stage that we have been
exploring will be solved.
4.1.2 Benefits of including Machine
Learning for Non-Commercial
Purposes in The Fair Use Regime
According to discussions among scholars, artificial
intelligence is now divided into three stages, namely,
weak artificial intelligence, stronger artificial
intelligence and strong artificial intelligence.
Scholars widely agree that the current artificial
intelligence is still in the stage of strong artificial
intelligence, i.e., artificial intelligence is still only
used as a tool at this stage, and does not have the so-
called "human" thinking characteristics (Cong,2019).
This is because generative AI is a tool revolution
brought about by the evolution of human beings to
today's technological level. As a tool to assist
humans, generative AI must rely on machine learning
in order to escape the algorithmic bias that it may
induce(Lai, 2024).
Non-commercial purposes of generative artificial
intelligence application field is mainly scientific
research and public policy and other public welfare
science category, will be included in the category of
fair use, one is in line with the development needs of
the current society secondly, can be better application
of new technology to promote the well-being of
mankind, thirdly, is conducive to the generative
artificial intelligence reasonably legitimate get rid of
algorithmic bias.
According to the theory of business failure, if non-
commercial generative AI cannot be included in the
fair use system, it may lead to the loss of incentive for
its developers and users to conduct scientific and
public research without getting the benefits they
want.
Therefore, there should not be too much
restriction in generative AI for non-commercial
purposes, which in itself is a hindrance to
development and contrary to the concept of copyright
law making.
4.2 Reasonableness of Applying a
Statutory Licensing System for
Commercial Purposes
4.2.1 Use for Commercial Purposes Fails the
Three-Step Test
Generative AI for commercial purposes is itself
monetized by the content it outputs. For its
copyrightability issue this paper does not study this,
but it is undeniable that the share of generative AI has
been increasing rapidly since its introduction. The
profits of the traditional copyright industry, however,
have been hit by the rapid growth of the AI industry.
This suggests that our concerns are not unfounded
worries and that the AI industry is tangibly affecting
the functioning of the traditional copyright industry.
In the contemporary era when generative AI has
become an industry, users sign agreements with and
even pay certain fees to developers of generative AI,
and if such commercial behavior is regarded as fair
use, it will undoubtedly pose a serious challenge to
the copyright system and seriously threaten the
legitimate rights and interests of copyright holders.
Such behavior not only blurs the line between fair use
and commercial exploitation, but also may weaken
the core function of the copyright system to motivate
and protect creators, which in turn will have a
negative impact on the long-term development of
cultural innovation and knowledge dissemination
Gao,2024.
4.2.2 Reasonableness of The Statutory
Licensing System
Statutory licensing has been proposed as a measure to
resolve this dilemma by derogating from the rights of
the copyright but balancing the interests by means of
a fee paid by the user to the copyright. Such a solution
is more eclectic than the fair use system in that it
protects the middle ground between the two. It allows
the copyright holder to receive compensation for the
derogation of his rights, while for trainers, the
payment of compensation allows them to use the data
without prior permission from the copyright holder
and greatly facilitates their use of the data. This
IESD 2025 - International Conference on Innovative Education and Social Development
184
system is more in line with the principle of
harmonization of rights and obligations and the
balance of interests of intellectual property rights, and
more suitable for the development of China's national
conditions.
At present, China's copyright is still in a period of
development, and the public's awareness of
intellectual property protection is still weak. I have
always believed that only when the humanities lead
the development of science and technology can the
field of science and technology be stabilized and
brought to the forefront. The statutory licensing
system can increase the public's awareness of
intellectual property rights, thus promoting more
people to know the law, abide by the law and use the
law.
4.2.3 How Times Set the Amount of
Compensation
Traditional intellectual property disputes tend to
determine the amount of damages through two
aspects, one is the loss of the copyright owner, and
the other is the unjust enrichment of the defendant. In
the case of generative artificial intelligence, the
problem becomes more complex, because generative
artificial intelligence is often a massive amount of
data, involving a very wide range of copyright
owners, if the copyright owner loss as a measure, it
will undoubtedly increase the judicial pressure, and
very unrealistic. Therefore, I should advocate to take
the defendant's unjust enrichment to determine the
amount, should be in accordance with a certain
percentage of the benefits received by the generative
artificial intelligence company as the copyright
holder's compensation, this amount should not be set
too high, otherwise, according to the same market
failure theory, the development of the generative
artificial intelligence industry will be severely
challenged(Zhang,2024).
Similarly, since the data crawled is massive, it is
unlikely that AI companies will spend much time on
finding individual copyright holders, and since a
system of compensation is to be implemented, it is
necessary to set up a specialized agency to manage
copyrights. With reference to the American
Federation of Musicians
In keeping with the practice of the Hop and the
radio and record company AFM, they will entrust the
Foundation with the management of the money, free
concerts and jobs for more musicologists.
Although this last step is difficult to implement
concretely in China nowadays, the specialized
agencies can adopt internalized rules to rationally
allocate the use of compensation.
5 CONCLUSION
The rapid development of generative artificial
intelligence, for us to bring a new round of
technological revolution at the same time, but also
brought new copyright risks and challenges, in the
face of this new highly intelligent tools, we found that
the original principle of balance of interests as the
core of the construction of the copyright law has not
been able to adapt to the needs of the social
development of the present, generative artificial
intelligence training process is not only secretive but
also has a huge amount of sex, making the copyright
The training process of generative artificial
intelligence is not only secretive but also massive,
which makes the copyright suffer unprecedented
serious challenges.
Although some scholars strongly advocate that
the input stage of machine learning be included in the
fair use system through re-legislation or the logic of
expressive and non-expressive use, so that the input
stage of the machine can be rationalized and justified
from now on. However, the commercial use of
generative artificial intelligence significantly violates
the legal principle of the unity of rights and
obligations, and moreover violates the principle of
balance of interests in copyright law.
From this we divide the application of generative
AI into two segments, commercial use and non-
commercial use, non-commercial use incorporates
the fair use system is more reasonable and in line with
the principle of proportionality, while commercial use
through the combination of the statutory license
system and the compensation system of the system of
the two systems, so that it balances the interests of the
two sides, the rights of the copyright is impaired but
through the payment of compensation to make up for
the way of their Damage suffered.
REFERENCES
Cong Yike.2024. Limitations and Exceptions of Data Min-
ing in the Age of Artificial Intelligence. Technology En-
trepreneurship Monthly 2024,37(08):39-45.
Cong Lixian.2019. Copyrightability and Copyright Attrib-
ution of Artificial Intelligence Generated Con-
tent.China Publishing 2019,(01):11-14.
A Study of Copyright Issues in the Input Phase of Generative Artificial Intelligence Machine Learning
185
Gao Yang.2024. Regulation of copyright infringement of
artificial intelligence training data. China Publishing
2024(15):12-18.
Lai Jiayang.2024.Copyright Infringement Risks and Re-
sponse Strategies in the Input Stage of Machine Learn-
ing. Technology and Law 2024,(10):161-165.
Liu Youhua &WEI Yuanshan.2019. Copyright infringe-
ment problem of machine learning and its solution.
Journal of East China University of Political Science
and Law ,2019,22(02):68-79.
Wu Jiaxu, Lai Xiaopeng.2024. Copyright Dilemma of Gen-
erative Artificial Intelligence Machine Learning and Its
Institutional Response. Editor's Friend 2024,(11):96-104.
Xu Long.2022.Copyright dilemmas and institutional pro-
grams for machine learning. Southeast Academic 2022,
(02):237-245.
Yuan Zhenfu and Xia Zixuan.2024.A study on the copy-
right compensation system for work utilization in ma-
chine learning.science and technology and law.
2024,(07):28-36
Wang Qian,Chu Chu.2024.A preliminary study on the
boundary between artificial intelligence and copyright:
legal challenges and reflections under technological
progress. China Edit 2024,(08):56-62.
Zhang Ping.2024.The Institutional Difficulties of Copy-
right Legitimacy of Artificial Intelligence Generated
Content and Its Solution Path.Legal Science (Journal of
Northwest University of Politics and Law)
2024,42(03):18-31
IESD 2025 - International Conference on Innovative Education and Social Development
186