minimal use, multi-source aggregation, 'making the
traditional' substantial similarity 'rule difficult to
apply. They recommend introducing a
comprehensive fragment usage ratio 'standard, which
evaluates whether the proportion of text segments
used from the training set exceeds the reasonable use
threshold (Liu &Wei, 2023). This standard should
integrate NLP technology to deconstruct the
generated text's semantics and calculate the relevance
density with the training data using vector space
models. For instance, if the semantic units extracted
from AI-generated text overlap with specific works
by more than 15%, it can be inferred that this
constitutes expressive reproduction (as shown in the
2023 experiment data from UC Berkeley (University
of California, 2023)).
Future theoretical advancements in the field of
infringement governance should focus on two key
dimensions: innovation in infringement identification
rules is urgently needed, and a 'transformative use'
framework suitable for AI scenarios should be
developed to clearly differentiate between text
mining during the training phase and commercial use
during the generation phase. This framework could
draw inspiration from Bu Lingjie's' four-step test,
'which focuses on assessing whether AI-generated
content has created new expressions, generated new
meanings, possessed new functions, or significantly
replaced the original market. It is important to note
that if AI merely transforms the original work, such
as rewriting a novel into poetry or simply rearranging
synonyms, without substantial creative arrangement,
it does not meet the requirement of' new expression.
'Similarly, if AI-generated content merely repeats the
plot of the original work (such as imitating the
magical world of' Harry Potter ') without conveying
new perspectives or metaphors through
deconstruction and recombination, it is considered to
have not generated' new meaning. 'When AI-
generated content (such as news summaries) directly
replaces users' need to read the original news report,
leading to a decline in the original work's traffic or
revenue, it constitutes market substitution.
Additionally, the use of materials like academic
papers must strictly distinguish between their use for
training AI models to improve algorithm performance
(technical purpose) and their output results for
commercial marketing (economic purpose), to
determine whether there is a substantial
transformation. In terms of the technical testing
system, it is necessary to accelerate its upgrade. This
involves referencing Ren Le's research on ChatGPT
to establish a creation traceability mechanism based
on blockchain technology, using text fingerprint hash
values to ensure the traceability of training data
sources. Additionally, efforts should be made to
develop infringement detection tools based on
Generative Adversarial Networks (GANs). This
technology uses generators to simulate potential
infringement processes and discriminators to
accurately identify text replication features.
According to MIT's 2024 technology report, this
method can increase the accuracy of infringement
detection to 89% (MIT Artificial Intelligence
Laboratory, 2024).
6 CONCLUSION
This study systematically deconstructs the copyright
infringement risks associated with generative
artificial intelligence throughout its data collection,
model training, and content output processes,
revealing the structural contradictions between
technological features and legal rules. The findings
indicate that infringement behaviors exhibit
characteristics such as non-traceability, fragmented
recombination, and a chain of responsibility
distribution, which pose challenges to the traditional
'access + substantial similarity' framework in judicial
practice. Empirical analysis shows that existing
systems are systematically ineffective in identifying
infringement methods, setting standards for liability
allocation, and designing remedies. There is an urgent
need to establish dynamic ownership rules, redefine
originality standards, and refine fair use exemptions.
By introducing the 'significant use' criterion and
technical governance solutions, the study offers
institutional innovation pathways to address the
evidentiary challenges posed by algorithmic black
boxes and the ineffectiveness of remedies due to the
aggregation of minor infringements.
Future research can deepen exploration in three
dimensions: First, develop infringement
identification tools based on natural language
processing, enhancing the efficiency of detecting
infringing content through semantic vector analysis
and generative adversarial networks (GANs).
Second, conduct comparative studies of cross-border
judicial rulings to refine the collaborative model
between technical governance and legal regulation.
Finally, establish a data certification mechanism for
the entire process of generative AI creation, using
blockchain technology to achieve visual traceability
of training data sources. These advancements not
only help improve the copyright system in the age of
artificial intelligence but also offer a Chinese solution
for the innovation of global digital governance rules.