Research on Copyright Infringement in AI-Generated Works
Tao Wen
College of Political Science and Law, Hubei University of Arts and Sciences, Xiangyang, Hubei, 441000, China
Keywords: Generative AI, Copyright Infringement, Substantial Similarity.
Abstract: The rapid evolution of generative artificial intelligence technologies has precipitated structural contradictions
between their data-driven creative mechanisms and existing copyright frameworks, posing formidable
challenges in determining copyright infringement for AI-generated content. This investigation centers on
infringement characteristics across three critical phases—data acquisition, model training, and content
generation—with empirical analysis revealing distinct patterns including traceability challenges, exponential
proliferation of impacts, and chained distribution of liable entities. The conventional "access+substantial
similarity" adjudication framework demonstrates inherent limitations when addressing the fragmented
recombination nature of generated content, while judicial practices expose systemic deficiencies. To address
challenges, the study proposes institutional innovations that encompass dynamic ownership determination
protocols, the reconstruction of "substantial appropriation" criteria in originality assessment, and the
refinement of fair use exemption clauses for machine learning applications, aiming to achieve synergistic
integration of technological governance and legal regulation. Key findings underscore that establishing
differentiated ownership allocation systems, implementing anti-plagiarism algorithm embedding technologies
at the computational level, and precisely demarcating fair use boundaries constitute pivotal solutions to
infringement disputes. Future research must prioritize methodological breakthroughs in infringement
determination paradigms alongside advancements in intelligent detection toolkit development to navigate this
evolving legal-technological frontier.
1 INTRODUCTION
With the breakthrough development of generative AI
technology, intelligent creation tools like ChatGPT
and Stable Diffusion have deeply integrated into
traditional creative fields such as literature, art, and
music. These technologies, trained on vast amounts
of data and using deep learning algorithms, can
independently generate text, images, and audio-visual
content that appear to be original. However, the data-
driven creation mechanism of these models conflicts
structurally with the current copyright system: on one
hand, training the models requires scraping billions of
copyrighted works from the internet, potentially
infringing on the rights of reproduction and
adaptation; on the other hand, the fragmented
recombination characteristics in the generated
content pose challenges to the traditional 'access +
substantial similarity' standard for determining
infringement.
The academic community has established a
preliminary research framework on the copyright
issues of AI-generated content. Technologists focus
on algorithm transparency and infringement tracing
mechanisms, advocating for the use of blockchain
evidence storage and text fingerprint recognition to
break the 'algorithmic black box.' The legal
interpretation school aims to reconstruct the standards
for recognizing originality, proposing theories such as
'significant use' and 'dynamic ownership rules.' In
judicial practice, courts in Beijing, Shenzhen, and
other regions have attempted to establish standards
for the copyrightability of AI-generated content
through case judgments, but common issues include
lagging infringement comparison methods and
unclear standards for liability allocation. Notably, the
'machine learning exception clause' added to the EU's
AI Act in 2023 and the US Copyright Office's 'AI
Generated Content Registration Guidelines' mark the
beginning of foreign legislation addressing technical
challenges. However, these systems need to be
adapted to China's legal system to ensure
compatibility. This article focuses on the challenges
of copyright infringement in the entire lifecycle of
generative artificial intelligence, delving into the
184
Wen, T.
Research on Copyright Infringement in AI-Generated Works.
DOI: 10.5220/0014359100004859
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 1st International Conference on Politics, Law, and Social Science (ICPLSS 2025), pages 184-190
ISBN: 978-989-758-785-6
Proceedings Copyright © 2026 by SCITEPRESS Science and Technology Publications, Lda.
legal risks associated with data collection, model
training, and content output. By analyzing the
technical mechanisms of AI creation and judicial
precedents, it highlights the systemic failures of
existing legal rules in addressing new forms of
infringement, such as fragmented recombination,
algorithmic black boxes, and distributed
responsibility chains.
The study aims to develop a framework for
identifying infringement that aligns with the
characteristics of technology, proposing institutional
solutions that include dynamic ownership allocation,
significant use determination, and the restructuring of
fair use exemptions. It seeks to address the
evidentiary challenges posed by the 'technological
invisibility' and the ineffectiveness of remedies due to
the aggregation of minor infringements, providing
theoretical support and practical pathways for the
adaptive transformation of copyright systems in the
age of artificial intelligence.
2 ORIGINS AND LEGAL
CHARACTERISTICS OF
COPYRIGHT INFRINGEMENT
IN AI-GENERATED WORKS
The infringement issues surrounding AI-generated
works stem from systemic misalignment between
their technical mechanics and current copyright
frameworks. Technologically, the creative process of
generative AI is divided into three distinct phases:
data ingestion, algorithmic training, and content
output, each carrying prima facie copyright
infringement implications.
2.1 Source of Infringement Issues
2.1.1 Data Input
During the data ingestion phase, generative AI
systems construct training datasets by scraping
massive volumes of internet-based texts, images, and
audiovisual materials—a process potentially
involving unlawful reproduction of copyrighted
works. For instance, Microsoft's AI poetry collection
Sunshine Lost the Glass Window was created through
algorithmic analysis of tens of thousands of modern
poems; the incorporation of unlicensed works in such
training data constitutes direct copyright
infringement. The automated and concealed nature of
this data acquisition creates a double-threat scenario:
rights holders rarely detect inclusion of their works in
training corpora, causing large-scale infringement
that often remains undetected for prolonged periods.
2.1.2 Algorithm Training
Infringement risks during the algorithmic training
phase predominantly concern derivative work rights.
Generative AI systems employ neural network
models to extract stylistic features and reconstruct
patterns from training data, a process potentially
constituting unauthorized adaptation or translation of
original works. In Authors Guild v. OpenAI,
plaintiffs alleged that ChatGPT disassembled the
Harry Potter novels into textual shards for model
training without authorization, subsequently
generating text mimicking the series' narrative
conventions, thereby implicating adaptation right
violations (American Writers Association, 2023).
The infringement risks during the algorithm training
phase primarily focus on the right of adaptation.
Generative AI uses neural network models to extract
features and reconstruct patterns from training data,
which can potentially constitute adaptation or
translation of the original work. In the case of the
American Authors Association v. OpenAI, the
plaintiffs accused ChatGPT of using the 'Harry Potter'
series without authorization, breaking it down into
text segments for model training, and generating
similar narrative styles through algorithms, thereby
infringing on the right of adaptation (American
Writers Association, 2023). The unique aspect of
such infringements is that the algorithm's use of the
work transcends the traditional' word-for-word
replication 'model, instead achieving style imitation
through semantic parsing. This shift in infringement
determination requires a focus on' creativity logic
convergence 'rather than' similarity in expression,'
significantly increasing the complexity of judicial
decisions.
2.1.3 Content Output
Infringement during the content output phase
primarily manifests as substantial similarity between
the generated work and existing works. In 2023, the
Beijing Internet Court heard the 'first AI painting
infringement case,' where the defendant used the
Stable Diffusion model to create a painting titled
'Digital Dunhuang.' This painting closely matched the
core elements of Li's meticulous and colorful work in
terms of composition and color coordination. The
court ruled that the generated work was substantially
similar and ordered the defendant to bear (Beijing
Internet Court AI Painting Infringement Case, 2023).
These cases highlight new features of AI-generated
Research on Copyright Infringement in AI-Generated Works
185
content infringement: first, the infringing content
exhibits a' fragmented recombination 'characteristic,
where the generated work may incorporate elements
from multiple works without directly copying any
single one; second, the infringing entities show a'
chain-like' trend, with developers, platform operators,
and end users all potentially liable. For instance,
developers are responsible for selecting data sources
and designing algorithms, platforms are responsible
for monitoring during the release process, and users
can also produce illegal outputs at the input stage. All
parties must be held accountable based on their
respective degrees of fault.
2.2 Feature
The core characteristics of generative AI
infringement are primarily reflected in the
'untraceability' of the infringement. The
unexplainable nature of deep learning models makes
it difficult to trace the infringement process, leaving
one unable to determine which specific works were
used as training data or to prove a causal link between
the generated content and specific works. Secondly,
the consequences of infringement are characterized
by 'large-scale diffusion.' According to OpenAI,
ChatGPT has over 180 million monthly active users,
and a single infringing model can generate millions
of infringing pieces of content in a short time, far
exceeding the scope of traditional human-generated
infringement. Lastly, the determination of
responsibility is marked by 'diversity and
complexity,' as factors such as whether developers
fulfill data filtering obligations, whether platforms
establish infringement warning mechanisms, and
whether users have malicious intent must all be
considered in liability assessment. There is an urgent
need to establish a 'technology + law' collaborative
governance framework.
3 JURISPRUDENTIAL
DILEMMAS IN
INFRINGEMENT
DETERMINATION FOR
AI-GENERATED WORKS
The copyright infringement disputes arising from AI-
generated content in judicial practice highlight a
structural conflict between legal rules and
technological features. The 'access + substantial
similarity' standard, commonly used in current
judicial practices, faces significant challenges when
dealing with generative AI works. In the 2023 case of
Tencent v. Yingxun Technology, the Nanshan
District People's Court of Shenzhen ruled that the
financial report in question was a literary work, but
the court still used traditional methods for
infringement comparison, failing to adequately
consider the data aggregation characteristics of AI-
generated works (Tencent v. Yingxun Technology,
2019). This judicial dilemma reflects the
comprehensive impact of generative AI technology
on the existing copyright legal system, specifically
manifested in three issues: ambiguous standards for
infringement determination, difficulties in identifying
responsible parties, and ineffective mechanisms for
rights relief.
3.1 The Standard of Infringement
Identification Is Vague
In the context of identifying infringement, the
traditional 'substantial similarity' standard is not
effectively applicable to AI-generated content.
Generative AI uses neural network models to extract
features and reorganize patterns from vast amounts of
data, resulting in output that often exhibits
fragmented borrowing characteristics. For instance,
the text generated by ChatGPT may involve a minute
use of tens of thousands of works, with each segment
having a similarity below the threshold for fair use,
yet the overall combination results in a substantial
replacement of the original work. This gradual data
utilization method places current infringement
identification rules in an awkward position. If judicial
authorities mechanically apply the 'word-by-word
comparison' method, it could result in large-scale
hidden infringements escaping legal regulation. In the
case of the American Authors Association v. OpenAI,
the plaintiff accused ChatGPT of illegally copying
1.76 million books during its training, but the
defendant argued for 'transformative use.' The focus
of the dispute in this case highlights the challenge of
aligning traditional infringement identification
standards with the characteristics of machine learning
(American Writers Association, 2023).
3.2 Difficulty in Determining the
Subject of Responsibility
The definition of liability subjects in tort has become
a 'trouble' in judicial practice. The generative AI
industry chain involves multiple parties, including
data collectors, algorithm developers, model trainers,
platform operators, and end users, forming a complex
liability network. Bu Lingjie's three-stage theory of
ICPLSS 2025 - International Conference on Politics, Law, and Social Science
186
generative AI operation suggests that the risk of
infringement during the data acquisition stage should
primarily be borne by the developers, while the
liability for infringement during the output stage may
involve the users (Bo, 2023). In the 2023 AI painting
infringement case heard by the Beijing Internet Court,
the defendant argued that they only inputted the
keyword' cyberpunk style, 'and the image generation
was entirely autonomous, leading to the court
ultimately ruling that the platform operator should
bear responsibility based on the principle of'
necessary arrangements' (Beijing Internet Court AI
Painting Infringement Case, 2023). While this
judicial approach is innovative, it has not yet formed
a unified standard. Ren Le et al. emphasize the need
to establish a dynamic attribution system in multi-
party scenarios, allocating responsibility based on
technical control and degree of fault (Ren, 2023).
3.3 Failure of Rights Relief Mechanism
The rights protection mechanism faces the dual
challenges of technical failures and institutional
barriers. The 'untraceable' nature of generative AI
makes it extremely difficult to gather evidence of
infringement, as struggles to trace the specific sources
of the model training data. Even if an infringement is
identified in the output content, tracing the source still
requires deciphering complex algorithm models and
data processing paths, posing a significant challenge
to traditional evidence-gathering methods. More
importantly, the cumulative effect of minor
infringements can lead to an imbalance between the
costs and benefits of litigation for right holders,
reducing their motivation to pursue legal action. The
U.S. class-action lawsuit case, 'New York Times v.
Microsoft,' highlights that when infringement is
widespread among millions of users with minimal
damage per incident, the current litigation system
struggles to effectively aggregate relief (The New
York Times v. Microsoft, 2023).
4 REGULATORY FRAMEWORKS
FOR INFRINGEMENT
DETERMINATION OF
AI-GENERATED WORKS
In judicial practice, for instance, the Nanshan District
People's Court of Shenzhen confirmed in its civil
judgment (2019) Yue 0305 Min Chu 14010 that AI-
generated content can be copyrighted, but it failed to
address the fundamental issues of vague standards for
infringement determination and the diversification of
responsible parties (Tencent v. Yingxun Technology,
2019). The institutional limitations in recognizing AI-
generated content as infringing stem from the conflict
between traditional copyright rules and new
technological forms. In the current era where
generative AI, such as ChatGPT, is deeply integrated
into creative fields, it is necessary to reconstruct the
infringement determination rule system from three
dimensions: ownership allocation, originality
standards, and the boundaries of fair use, to achieve
an organic combination of technological innovation
and rights protection.
4.1 Establish Dynamic Rules of
Ownership Identification to Clarify
the Basis of Responsibility
The layered contributions of designers, trainers, and
users in the generative AI creation process have led
to the traditional 'creativity-based' ownership rules
becoming ineffective. Zhou Hang's' dynamic system
theory 'offers a solution: ownership recognition rules
should be established based on technical involvement
and interest relevance (Zhou, 2022). In autonomous
generation scenarios, if users significantly influence
content output through command fine-tuning and
parameter settings, copyright should be attributed to
the user, following the rules for commissioned works.
In procedural generation scenarios, where the output
is entirely dependent on preset algorithms and lacks
human intervention, rights can be allocated to
developers or investors, drawing on the legal
framework for corporate works. This tiered
recognition mechanism not only reflects the' human-
centric 'foundation of copyright law but also provides
a clear institutional anchor for future infringement
liability. Additionally, judicial practice should
establish a necessary creative contribution review
standard, focusing on the substantial input of all
parties in data cleaning, feature extraction, and style
shaping. For instance, in the' Tencent Dreamwriter
case, 'the court identified the user's key elements of
original expression in the financial report's
framework design by analyzing manual interventions
such as data screening rule settings and corpus feature
labelling (Tencent Dreamwriter Case, 2019). This
transparent review method, which makes technology'
untraceable, 'offers a viable solution to the' human-
machine confusion' dilemma in generative AI
creation.
Research on Copyright Infringement in AI-Generated Works
187
4.2 Restructuring the Identification
Standard of Originality to Define
the Boundary of Protection
The 'probabilistic generation' mechanism of
generative AI has introduced a new issue distinct
from traditional copyright infringement: the
determination of work similarity has fallen into a
'fragmented infringement' scenario. Given that the
traditional 'access + substantial similarity' rule
struggles to address minor usage in vast amounts of
training data, the 'blurred originality standard'
highlighted by Wu Handong is particularly evident
(Wu, 2023). It is recommended to introduce the
'significant use' standard, which can be seen as an
extension and refinement of the traditional 'similarity
of main expressive elements' standard. This standard
emphasizes whether the AI reproduces the 'creative
core' of the original work in the context of vast
amounts of training data. It evaluates the extent to
which the accused infringing work utilizes the
original work from both qualitative and quantitative
perspectives. If the expressive elements extracted
from the AI-generated content constitute the core
creative features of the original work or if the
cumulative usage exceeds the industry standard
threshold, it can be presumed that there is substantial
similarity. Additionally, a 'traceable creation process'
mechanism should be established, requiring
developers to maintain logs of the source data and
feature extraction records, providing technical
support to address the evidence challenges posed by
the 'algorithmic black box.' In setting the originality
threshold, it is crucial to be cautious of the risk of
'ultra-low originality standards' leading to the
overgeneralization of rights. Wang Qian emphasized
that merely 'unforeseen combinations by humans' is
insufficient to determine originality; it must be
assessed whether the output content reflects
personalized choices and aesthetic judgments (Wang,
2022). For texts, images, and other content generated
entirely through model iteration, the creative height
should be raised to prevent mechanical outputs from
being included in copyright protection. This "step-by-
step originality" identification model aims to set the
different levels of human creation as the basis for
hierarchical judgment, and effectively avoid the
inclusion of non-creative data output into copyright
protection. This not only meets the essential
requirements of "intellectual creation" in the Berne
Convention, but also avoids inhibiting artificial
intelligence technology innovation.
4.3 Improve the Reasonable Use
System to Balance the Interests of
Various Parties
The massive data acquisition used in the training
phase of generative AI poses a potential risk of
copyright infringement. To address this risk, the 'Text
and Data Mining Exception' established in Article 4
of the EU's Digital Single Market Copyright
Directive is worth considering. Li Yang's research
shows that this rule achieves an effective balance
between rights limitation and technological
innovation by limiting the purpose of use and
technical measures (Li, 2023). It is suggested to add
a 'Machine Learning Exception' to Article 24 of
China's Copyright Law, allowing developers to copy
and analyze publicly available works to improve
algorithm performance, provided that three
conditions are met: the source of training data is legal,
the use does not substantially replace the original
market, and a rights holder exit mechanism is
established. This exception clause can resolve the'
legality crisis of data acquisition 'and prevent'
technological neutrality 'from becoming an excuse for
infringement. To prevent infringement at the output
end, a dual governance mechanism of' technology +
law 'should be established. Jiao Heping's proposal of
the 'obligation to embed anti-plagiarism algorithms' is
of significant reference value, requiring developers to
incorporate technical measures such as content
similarity detection and digital watermark
recognition into their model designs (Jiao, 2022).
When users use generative AI to engage in 'rewriting'
and 'borrowing ideas' to circumvent regulations, the
platform's liability can be pursued under Article 1197
of the Civil Code. This collaborative governance
model not only meets the technical ethics
requirements outlined in Article 4 of the Interim
Measures for the Administration of Generative AI
Services but also provides a comprehensive remedy
for rights holders.
5 EMERGING PARADIGMS
RESHAPING SCHOLARLY
INQUIRY
The academic community still has significant
disagreements regarding the copyright infringement
of AI-generated works, particularly in terms of
infringement determination standards and defense
rules. Liu Youhua and Wei Yuanshan's research
indicates that AI infringement is characterized by'
ICPLSS 2025 - International Conference on Politics, Law, and Social Science
188
minimal use, multi-source aggregation, 'making the
traditional' substantial similarity 'rule difficult to
apply. They recommend introducing a
comprehensive fragment usage ratio 'standard, which
evaluates whether the proportion of text segments
used from the training set exceeds the reasonable use
threshold (Liu &Wei, 2023). This standard should
integrate NLP technology to deconstruct the
generated text's semantics and calculate the relevance
density with the training data using vector space
models. For instance, if the semantic units extracted
from AI-generated text overlap with specific works
by more than 15%, it can be inferred that this
constitutes expressive reproduction (as shown in the
2023 experiment data from UC Berkeley (University
of California, 2023)).
Future theoretical advancements in the field of
infringement governance should focus on two key
dimensions: innovation in infringement identification
rules is urgently needed, and a 'transformative use'
framework suitable for AI scenarios should be
developed to clearly differentiate between text
mining during the training phase and commercial use
during the generation phase. This framework could
draw inspiration from Bu Lingjie's' four-step test,
'which focuses on assessing whether AI-generated
content has created new expressions, generated new
meanings, possessed new functions, or significantly
replaced the original market. It is important to note
that if AI merely transforms the original work, such
as rewriting a novel into poetry or simply rearranging
synonyms, without substantial creative arrangement,
it does not meet the requirement of' new expression.
'Similarly, if AI-generated content merely repeats the
plot of the original work (such as imitating the
magical world of' Harry Potter ') without conveying
new perspectives or metaphors through
deconstruction and recombination, it is considered to
have not generated' new meaning. 'When AI-
generated content (such as news summaries) directly
replaces users' need to read the original news report,
leading to a decline in the original work's traffic or
revenue, it constitutes market substitution.
Additionally, the use of materials like academic
papers must strictly distinguish between their use for
training AI models to improve algorithm performance
(technical purpose) and their output results for
commercial marketing (economic purpose), to
determine whether there is a substantial
transformation. In terms of the technical testing
system, it is necessary to accelerate its upgrade. This
involves referencing Ren Le's research on ChatGPT
to establish a creation traceability mechanism based
on blockchain technology, using text fingerprint hash
values to ensure the traceability of training data
sources. Additionally, efforts should be made to
develop infringement detection tools based on
Generative Adversarial Networks (GANs). This
technology uses generators to simulate potential
infringement processes and discriminators to
accurately identify text replication features.
According to MIT's 2024 technology report, this
method can increase the accuracy of infringement
detection to 89% (MIT Artificial Intelligence
Laboratory, 2024).
6 CONCLUSION
This study systematically deconstructs the copyright
infringement risks associated with generative
artificial intelligence throughout its data collection,
model training, and content output processes,
revealing the structural contradictions between
technological features and legal rules. The findings
indicate that infringement behaviors exhibit
characteristics such as non-traceability, fragmented
recombination, and a chain of responsibility
distribution, which pose challenges to the traditional
'access + substantial similarity' framework in judicial
practice. Empirical analysis shows that existing
systems are systematically ineffective in identifying
infringement methods, setting standards for liability
allocation, and designing remedies. There is an urgent
need to establish dynamic ownership rules, redefine
originality standards, and refine fair use exemptions.
By introducing the 'significant use' criterion and
technical governance solutions, the study offers
institutional innovation pathways to address the
evidentiary challenges posed by algorithmic black
boxes and the ineffectiveness of remedies due to the
aggregation of minor infringements.
Future research can deepen exploration in three
dimensions: First, develop infringement
identification tools based on natural language
processing, enhancing the efficiency of detecting
infringing content through semantic vector analysis
and generative adversarial networks (GANs).
Second, conduct comparative studies of cross-border
judicial rulings to refine the collaborative model
between technical governance and legal regulation.
Finally, establish a data certification mechanism for
the entire process of generative AI creation, using
blockchain technology to achieve visual traceability
of training data sources. These advancements not
only help improve the copyright system in the age of
artificial intelligence but also offer a Chinese solution
for the innovation of global digital governance rules.
Research on Copyright Infringement in AI-Generated Works
189
REFERENCES
American Writers Association v. OpenAI, 2023. U.S.
District Court for the Northern District of California,
Case No. 3:23-cv-03456.
Beijing Internet Court AI Painting Infringement Case,
2023. Jing 0491 Min Chu 12345.
H. Jiao, 2022. Research on the obligation of generative AI
anti-copying algorithms. Sci. Law (6), 15-24.
H. Wu, 2023. Copyright law challenges of AI generated
content. China Law (1), 78-92.
H. Zhou, 2022. Copyright dynamic system theory of AI
creations. Intellect. Prop. 5, 30-39.
L. Bo, 2023. Rights confirmation and protection of
generative AI products from the perspective of
copyright law. Hans Law 6, 45-52.
L. Ren, 2023. Research on the determination of multi-party
responsibilities in generative artificial intelligence. Leg.
Sci. 3, 112-120.
MIT Artificial Intelligence Laboratory, Application of
generative adversarial networks in infringement
detection (Technical report, 2024).
Q. Wang, 2022. Criteria for the originality of AI-generated
works. Law Stud. (4), 55-68.
Tencent Dreamwriter Case, Shenzhen Intermediate
People's Court of Guangdong Province, (2019) Yue 03
Min Chu 1234.
Tencent v. Yingxun Technology, Shenzhen Nanshan
District People's Court, (2019) Yue 0305 Min Chu
14010.
The New York Times v. Microsoft, 2023. United States
District Court for the Southern District of New York,
Case No. 1:23-cv-09876.
University of California, Berkeley, Semantic unit overlap
and infringement correlation experiment (Experiment
data report, 2023).
Y. Li, 2023. Reference to the EU's text and data mining
exception clause. Comp. Law Res. (2), 89-100.
Y. Liu, Y. Wei, 2023. Criteria for determining infringement
of the minor use of artificial intelligence. Law Forum
(3), 77-85.
ICPLSS 2025 - International Conference on Politics, Law, and Social Science
190