Linguistics Meets AI at the College of Liberal Arts and Sciences
With human-verified data as the foundation, researcher Gyu-Ho Shin's initiative aims to train AI tools that reflect the richness and diversity of global languages.
Name: Gyu-Ho Shin
Title: Assistant Professor
Department: Linguistics
Interview with Gyu-Ho Shin
What first sparked your interest in linguistics, and how did your academic journey lead you to this intersection of language and technology?
My starting question was simple: how do humans learn language? That curiosity led me into language science—how we understand, process, produce, and acquire language—while also drawing upon psychology and cognitive science. AI now helps me study these questions in a more fine-grained manner. It gives better access to large, diverse language datasets; it lets us run controlled simulations of learning; and it allows us to check long-held assumptions against robust evidence. In my work, AI models act as “testable ideas”: they generate predictions about how people might evaluate sentences or find them easy or hard to process. The benefit is scale and clarity: AI can scan vast amounts of language data and expose usage patterns, while also offering precise, step-by-step accounts of how those patterns arise.
In your view, how is AI transforming the field of linguistics—especially in areas like language acquisition, corpus analysis, or computational modelling?
AI is moving the field from describing language to building models that can actually do language. In acquisition research, we can ask what kinds of input and learning strategies are enough to produce human-like outcomes. In corpus analysis, modern tools reveal regularities in vocabulary, grammar, and discourse that were previously hidden because the datasets were too large to inspect by hand. In computational modelling, we can compare different systems to see which ones behave most like people and why. Importantly, this does not replace classic methods such as experiments or careful textual analysis; rather, AI adds a powerful third angle, helping us cross-check results and strengthen conclusions.
You’re currently collaborating with Dr. Liliana Sánchez on a project focused on the dataset annotation of native languages. Can you share more about the goals of that work and why making linguistic data more accessible matters?
Our project, in partnership with community researchers, aims to build carefully annotated datasets for Cuzco Quechua and Shipibo-Konibo using the Universal Dependencies (UD) framework. In plain terms, UD provides a consistent way to label words and grammatical relations so that data from different languages can be compared and reused. By creating high-quality, openly available datasets and accompanying documentations, we want to make these languages visible in mainstream research and usable in learning-teaching scenes, community-based activities, and technology development. Accessibility here means more than putting data online: it means well-described, ethically sourced, properly licensed resources that researchers, students, teachers, and developers can actually work with.
What challenges do researchers face when working with under-resourced languages, and how does your work help address those gaps?
Working with under-resourced languages presents several practical and ethical challenges. Data are scarce and often scattered; writing systems or spelling conventions may vary; words can be morphologically complex (many meaningful pieces inside a single word), which standard tools do not handle well; and connectivity, funding, and training opportunities are uneven. There are also key responsibilities—respecting community priorities, ensuring informed consent, and sharing benefits fairly. These issues make it hard to build reliable tools or to test whether ideas developed on English truly apply to other languages.
Our contribution addresses these gaps in three ways.
- Co-design: we develop guidelines and examples with community researchers to ensure that annotations reflect local linguistic knowledge.
- Standards and quality: we adopt UD for comparability, implement transparent review procedures, and release clear train/dev/test splits so results are reproducible.
- Capacity building: we provide datasets and code so that community members, students, and faculty can continue annotating and using the data with ease.
Why does this matter? Because the field remains heavily skewed toward a handful of major languages, especially English, which limits what we can claim about “human language” in general. By expanding high-quality resources for Quechua and Shipibo-Konibo, we both enable practical applications (e.g., educational and literacy assessment tools tailored to local contexts) and provide the broader research community with the materials needed to test theories and technologies across a more diverse linguistic landscape. In short, accessible, well-curated data make it possible to participate meaningfully in research and to build tools that serve the communities whose languages they study.
What’s the next step in your current research, and what questions are you most excited to explore?
I am developing two connected themes: AI explainability and AI literacy. Explainability asks a straightforward question: when an AI system seems to behave like a human language user, can we say how and why it is doing so? AI literacy is about using these tools responsibly—being clear about assumptions, limits, and good practice so that results are trustworthy and easy to reproduce. The questions that excite me most are practical: which parts of an AI model correspond to (operations of) linguistic features, and when do model-based measures of difficulty match what readers and listeners experience?
What advice would you give to students who are curious about integrating AI into linguistic research or exploring computational linguistics?
Begin with a clear, modest question—small enough to answer in a semester. Choose a focused dataset, write a transparent analysis pipeline, and define success in plain terms (e.g., “Can the model tell these two sentence types apart?”). Treat the model as a hypothesis, not a truth-maker: check errors carefully and try simple comparisons (a baseline vs. a more advanced system) to see what really drives the results. Keep your work reproducible from day one—use version control, document each step, and save intermediate outputs. Finally, learn by building: short, end-to-end projects teach more than long reading lists.
Are there any tools, courses, or resources you recommend for students starting out in this interdisciplinary space?
Formal courses could be helpful, but progress does come from doing. A practical starter kit includes: Python for data handling; user-friendly libraries (spaCy for text processing; PyTorch/Transformers for simple modelling); and R or Python for statistics and coding. Public resources such as Universal Dependencies (annotated sentences) and CHILDES (child-directed speech) are excellent entry points. Make a clear workflow/storyline first, keep everything in order, and focus on the details. Most importantly, pick a small phenomenon, assemble a mini-dataset, build a baseline, evaluate in plain language, and iterate.