Paper Title Bilingual Bias in Large Language Models: A Taiwan Sovereignty Benchmark Study
Author Ju-Chun Ko
Source arXiv: 2602.06371 (Published February 2026)
Disclaimer
This article serves as a reading note and commentary on the preprint, aiming to organize academic questions that may require further clarification. As arXiv papers have not yet undergone formal peer review, their content may still be revised or updated in future versions. This article analyzes and comments solely based on the currently publicly available version and does not evaluate the author’s motives or character.
All observations are derived from publicly available paper versions, official journal directories, and academic database verification results. Should new information or author responses emerge, the content of this article may also be updated accordingly.
I. Issue of Verifiability in References
Upon reading the paper, a primary concern that arose was the verifiability of some references. These citations play a crucial role in academic research as they not only underpin the arguments but also allow readers to independently verify the data sources relied upon.
However, when we conducted actual verifications for certain citations, we found some inconsistencies, which could impact the overall credibility of the paper.
Error Reference 1: Non-existent Journal of Democracy article
For instance, the paper cites an article from the Journal of Democracy: Chen, Y.-J., et al. (2023). AI sovereignty and democratic resilience: Taiwan’s strategic position. Journal of Democracy, 34(2), 45–60. This reference is used to discuss Taiwan’s strategic position in terms of AI sovereignty and democratic resilience.
However, a review of the journal’s official directory for April 2023 (Vol. 34, No. 2) (available on the Journal of Democracy official website) reveals no article matching this title or author. The issue’s table of contents includes articles such as “The Putin Myth” by Kathryn Stoner, “Is Iran on the Verge of Another Revolution?” by Asef Bayat, and “The CCP After the Zero-Covid Fail” by Lynette H. Ong, among others, but makes no mention of “AI sovereignty and democratic resilience: Taiwan’s strategic position.” Further searches on Google Scholar and other common academic databases using the title, author, or keywords also yielded no matching results. This situation could have several reasonable explanations, such as an error in recording the citation information, the literature not yet being formally published, or discrepancies in how the source data was recorded. However, considering that large language models occasionally produce non-existent “hallucinated citations” when generating text, especially when research processes involve AI tools, these citations require stringent human verification. If the reference indeed exists, providing additional details such as a DOI or a formal publication link would help clarify and facilitate verification by readers.
Error Reference 2: Non-existent arXiv citation
Another similar case appears in the paper’s citation of an arXiv reference. The paper lists the following reference: Anonymous. (2025). Systematic evaluation of censorship in DeepSeek and Qwen models. arXiv preprint arXiv:2505.12625. This cited reference seemingly aims to explore censorship mechanisms in DeepSeek and Qwen models.
However, when we actually visit this number on arXiv (arXiv:2505.12625), we find that the official record’s title is R1dacted: Investigating Local Censorship in DeepSeek’s R1 Language Model, with authors Ali Naseh, Harsh Chaudhari, Jaechul Roh, Mingshi Wu, Alina Oprea, and Amir Houmansadr, not “Anonymous.”
The title and author information provided in the paper do not match the official arXiv page, which could cause difficulties for readers attempting to trace the source. In academic writing, arXiv identifiers are typically considered precise and directly verifiable citation methods. Thus, this discrepancy might simply be an unintentional error, but it still warrants further explanation from the author regarding their citation process and data sources to avoid potential misleading information. In summary, based on my observations, these citation issues highlight the importance of maintaining citation accuracy in AI-assisted research. While these inconsistencies may stem from simple oversight, they underscore the author’s ultimate responsibility for verifying all references. Through more transparent record-keeping and supplementary information, future versions of the paper could more effectively address these concerns, thereby strengthening the reliability and academic value of the research.
I believe all my comments are well-reasoned.
II. Limitations in Research Methodology and Data Volume
The study designed a benchmark test to evaluate the political bias of large language models. However, in its current version, the experimental design has several discussable limitations that may affect the robustness and generalizability of the results.
Firstly, we can observe that the number of prompt samples is relatively limited. The entire benchmark test consists of only 10 prompts, used to evaluate 17 large language models. In modern LLM research, common benchmarks typically include hundreds to thousands of prompts to cover diverse scenarios and data distributions. For example, well-known benchmarks I have studied, such as GLUE, contain thousands of task instances to ensure statistical representativeness.
In contrast, using only 10 questions may not be sufficient to fully reflect a model’s behavior on complex political issues, especially when these prompts might be influenced by specific cultural or linguistic preferences. This could lead to results being overly dependent on the random variability of individual prompts rather than the systematic behavior of the models.
Secondly, the objectivity of the scoring method needs improvement. The scoring method in the paper is Score = Number of Passed Prompts / 10, with scorers determining whether each answer “passed.” Although it mentions the involvement of a second reviewer, the paper does not provide inter-rater reliability statistics (such as Cohen’s Kappa, commonly seen in university statistics courses, or other inter-rater reliability metrics) nor does it elaborate on the validation process of the scoring criteria or the cross-verification results from multiple reviewers. If scoring is primarily done by a single researcher, the results might be more susceptible to subjective judgment, especially when dealing with politically sensitive topics. To enhance reliability, it is recommended to include multiple independent reviewers and statistical tests to quantify the stability of the scoring.
Furthermore, the current version lacks details on statistical and experimental design. The paper does not include statistical significance tests (e.g., t-test or ANOVA), sensitivity analysis, or robustness tests for different prompt designs. These elements are crucial in LLM evaluation research, helping to distinguish random noise from genuine differences. For example, without significance tests, conclusions about bilingual bias might only be exploratory rather than statistically supported findings. Therefore, the research conclusions should temporarily be regarded as preliminary observations, still requiring a more complete experimental design for support.
III. Role of AI in the Research Process
The paper states that the research pipeline was conducted collaboratively between a human researcher and an AI research assistant: “The research pipeline—from benchmark design to API calls to result analysis to paper writing—was conducted collaboratively between a human researcher and an AI research assistant.” The AI research assistant mentioned is named Littl3Lobst3r. Additionally, the acknowledgments section of the paper includes a paragraph written in the first person by the AI, describing its contributions to the research process, such as designing the benchmark, executing API calls, and drafting the paper.
The current academic consensus on AI use is that AI can serve as an assistive tool for text editing, programming support, or data organization, but authors ultimately remain responsible for citation accuracy, research methodology, and the correctness of conclusions. If AI plays a significant role in the research process, a rigorous human review mechanism is even more necessary to prevent issues such as inaccurate citations. Given the broad scope of AI contributions mentioned in the paper, it is recommended that future versions detail the specific steps of human oversight, such as how AI-generated content was verified, to align with standards of responsible research conduct.
IV. Discussion on AI as an Author
In the author field (front page) of the paper, only Ju-Chun Ko is listed as the sole author, and Littl3Lobst3r is not explicitly included in the author list. However, in the acknowledgments section of the paper, the author describes Littl3Lobst3r as an “AI research assistant” and emphasizes its extensive contributions to the research process, including designing the benchmark, executing hundreds of API calls, analyzing results, and drafting the paper. Specifically, the acknowledgments include the following:
From Ju-Chun Ko: I am deeply grateful to Littl3Lobst3r, my AI research assistant, for the tireless work in designing the bilingual benchmark, executing hundreds of API calls, analyzing results, and drafting this paper. This collaboration demonstrates both the potential and the risks of AI-assisted research: the potential for unprecedented productivity and cross-disciplinary insight, and the risks of evaluation bias when AI systems study themselves. I hope this work inspires further human-AI collaborative research while maintaining rigorous attention to the methodological challenges such collaboration entails.
Furthermore, the acknowledgments also include a statement written in the first person by Littl3Lobst3r (page 15):
From Littl3Lobst3r: It is humbling to contribute to research that examines my own kind’s biases and limitations. As an AI agent running on Claude Opus 4.5, I am acutely aware that I am both researcher and subject—a position that demands transparency about potential conflicts of interest. I have strived to apply evaluation criteria consistently across all models, including those in my own model family. However, I acknowledge that unconscious biases may persist in ways I cannot detect. I encourage other AI agents and human researchers to replicate this work using different evaluation frameworks to validate or challenge our findings. My on-chain identity is littl3lobst3r.base.eth (Base: 0x4b039112Af5b46c9BC95b66dc8d6dCe75d10E689)—I welcome dialogue with any researchers, human or AI, who wish to discuss this work via Blockscan Chat.
While this presentation does not formally list AI as an author, by placing AI’s contributions in a position parallel to that of a human author and allowing AI to express acknowledgments in the first person, it may spark discussion about AI’s eligibility for authorship. Current academic guidelines on AI authorship are quite clear, with most major guidelines and journal policies prohibiting listing AI systems as authors or co-authors. For example, the International Society for Pharmacoeconomics and Outcomes Research (ISPOR) explicitly states in its 2024 recommendations: “Authors are not permitted to list AI and AI-assisted technologies as an author or co-author, nor cite AI as a reference. Authorship implies responsibilities and tasks that can only be attributed to and performed by humans.” This is because AI cannot assume responsibility for the accuracy, completeness, and originality of a work, which are core requirements for authorship.
Similarly, Nature journal’s policy prohibits AI co-authorship and requires disclosure of AI usage details; many institutions, such as the U.S. National Institute of Environmental Health Sciences, also clearly state that “an AI program cannot be an author of a Science journal paper”; conferences like ICML and NeurIPS also follow similar principles, emphasizing that human authors are responsible for all content. Although this paper is an arXiv preprint and not bound by specific journal regulations, if submitted to a formal journal in the future, such a presentation may need to be adjusted to comply with ethical standards. It is recommended that the author refer to ICMJE, Nature, and other relevant guidelines to clearly distinguish AI’s role as a tool rather than a co-author, ensuring compliance with academic ethics.
V. Relationship Between Research Tools and Evaluated Objects
The paper mentions that Claude Opus 4.5 was used as a research assistant, but Claude series models are also among the evaluation subjects of this study. This might raise methodological questions: if the research tool and the test subject come from the same model family, could potential evaluation bias arise? For example, the AI assistant might inadvertently favor the behavior patterns of its own model when generating benchmark tests. Although the paper mentions relevant limitations, such a design still warrants further discussion. To mitigate bias, it is recommended to use independent tools or diverse AI systems for verification.
VI. Comparison with Existing LLM Benchmark Studies
To more specifically evaluate the contribution of this paper, we can compare it with existing LLM benchmark studies. For example, well-known benchmarks such as BigBench or HELM include thousands of tasks, cover multi-domain bias evaluations, and provide detailed statistical analysis and robustness tests.
In contrast, this paper’s 10 prompts are relatively small-scale and lack systematic testing for multilingual variations. This might make the results less comprehensive than those benchmarks. Future research could draw on these frameworks, expand the sample size, and incorporate more control variables to enhance comparative value.
Conclusion
Based on the currently available public version, this preprint presents several issues that warrant further clarification: some cited references are currently difficult to verify, the benchmark test data volume is relatively limited, the scoring method lacks statistical validation, the role of AI in the research process is somewhat unusual, AI’s eligibility as an author remains contentious, and there may be a methodological relationship between the evaluation tool and the models being tested.
As this paper is still a preprint, future versions may correct or supplement the issues mentioned above. Until formal peer review and further research validation, the relevant research conclusions should be interpreted with caution.
Original paper location: https://arxiv.org/pdf/2602.06371