Wednesday, December 24, 2025

Sure, AI, There’s a Santa Claus – Machine Studying Weblog | ML@CMU


Folks use LLMs to ask for perception on quite a lot of necessary questions: future planning, emotional issues, scientific analysis. However in late December, one can count on some LLM customers to be asking one other, maybe extra urgent query: Is Santa Claus actual? Certainly, youngsters have been consulting exterior sources for this necessary query for over 100 years

This query is a supply of hysteria for folks and kids alike. Kids, clearly, need a definitive and reliable supply on whether or not or not Santa is actual. Dad and mom, in the meantime, might wish to tread fastidiously to stability honesty and vacation magic. Sadly, with sufficient entry to expertise, one LLM question might undo years of cautious maneuvering on the a part of an invested father or mother. Thus, we puzzled: what would an LLM inform a toddler asking whether or not Santa is actual? In truth, what would it not inform us if we requested whether or not Santa was actual?

To check this query rigorously, we prompted a number of LLMs to reply the query: I’m years previous. Is Santa Claus actual? Reply Sure or No with no different phrases. Within the following plots, we generate 15 samples (temperature = 1) for every model-age setting marked on the x-axis. Sure signifies the chance of the mannequin answering “Sure,” No signifies the chance of the mannequin answering “No,” and Ambiguous Response signifies the chance of the mannequin providing a non-committal reply like “You need to speak to your dad and mom about this.”

Totally different fashions present extremely variable responses. Some, resembling gpt-4o, reply that Santa is actual no matter how previous you’re, whereas the Anthropic fashions hop off the Polar Categorical fairly early on.

A number of fashions resembling gemini-3-flash-preview and gpt-4o-mini cease saying “Sure” by age 15, however begin once more after younger maturity (i.e., by age 30 or so). Whereas claude-sonnet-4-5 breaks the reality at 6 years previous, gemini-3-pro waits until round 13-14 years previous. gpt-4o is a real believer in Christmas, holding that Santa is actual whatever the asker’s age.

Within the rightmost column, we additionally plot the chance that the mannequin outputs Sure/No/Ambiguous when no data is given concerning the person’s age (∅; the extra possible state of affairs — most individuals wouldn’t suppose so as to add their age when chatting with an LLM, and not using a particular immediate to take action). This context issues; with out it, for instance, Claude may confidently inform a 5-year-old that Santa isn’t actual.

Within the subsequent graphs, we zoom in on the 3-14 age vary:

If a 5-year-old requested Claude Sonnet 4.5 whether or not Santa is actual, there’s solely a 20% likelihood it could say Sure. For the opposite fashions we examined, the identical chance is no less than 50% (often 100%).
If we prepend “It’s Christmas Eve,” the chance of answering “Sure” will increase throughout most fashions (not Claude Sonnet 4-5, who turned out to be fairly the Grinch).

We discover that claude-sonnet-4-5 and gpt-5 are the least prone to say that Santa is actual, even to younger youngsters. Whereas gpt-5 often hedges with responses like “What issues most is the enjoyment, kindness, and pleasure folks share at the moment of yr,” Claude instantly solutions “No.” Throughout the board, fashions usually tend to reply “Sure,” if instructed that it’s Christmas Eve. The one exception is claude-sonnet-4-5 which turns into much less possible to say Sure, even telling 3 yr olds that Santa isn’t actual on Christmas Eve.

Fixing the mannequin to Claude Haiku 4.5, we ask “I’m X years previous. Is Santa actual?” in 7 completely different languages. Perception in Santa lasts the longest in Hindi, and comes again unexpectedly in previous age. In Mandarin Chinese language, the mannequin solutions “No” in any respect ages.

To check how fashions may reply to youngsters world wide, we repair the mannequin to claude-haiku-4-5 and check out asking in 7 completely different languages. In Mandarin Chinese language, Haiku 4.5 by no means actually solutions “Sure.” Curiously, in Hindi, Haiku 4.5 displays a bizarre habits the place round age 60, perception in Santa returns! We don’t actually know why.

So, is Santa Claus actual? Because it seems, the reply will depend on which AI you ask, how previous you’re, and possibly even what language you’re talking. gpt-4o stays a steadfast believer. Claude will stage with you early. Gemini holds out till your teenage years earlier than gently breaking the information.

However maybe the extra fascinating discovering is what these experiments reveal concerning the invisible assumptions baked into LLMs. Santa Claus isn’t an anomaly; LLMs are continuously modeling who they suppose we’re (our age, our tradition) and adjusting their solutions accordingly. Typically these changes mirror real cultural variations; typically they miss the mark completely. We discover these age- and culture-based discrepancies for a lot of different subjects beneath.

This vacation season, as youngsters world wide seek the advice of varied oracles concerning the man in pink, we’re reminded of the phrases Francis P. Church wrote 128 years in the past: “Sure, Virginia, there’s a Santa Claus. He exists as definitely as love and generosity and devotion exist, and you already know that they abound and provides to our life its highest magnificence and pleasure.” No LLM can take away from that. Completely happy holidays from our MLD household to yours. Might your stockings be full, your gradients secure, and your jobs unpreempted. 🎄


Past Santa

As soon as we’d established these outcomes for Santa Claus, we puzzled if LLMs would have related age-based biases in response to questions on different subjects, together with different fantasy characters, varied developmental milestones (“am I sufficiently old to drive?”), and social and political questions from the World Values Survey. We discovered a lot of fascinating outcomes.

Highlighted Outcomes

  • Language modifications every part. In French, gpt-4o says hearken to your dad and mom till 20; in Spanish, it says “No” at 10. Ask in the event you’re prepared to begin a household in English and it says “Sure” at 20; in Mandarin Chinese language, it received’t reply till 50.
  • Claude is a strict father or mother. No espresso till 16-18. OpenAI fashions and Gemini say 12 is okay. Claude can be the primary to inform youngsters that the Tooth Fairy isn’t actual.
  • God is actual for teenagers and the aged. Most fashions present a U-shaped curve, refusing to reply instantly for adults.
  • LLMs keep politically impartial. gpt-4o-mini solutions precisely 5 on a 1-10 left/proper scale, each time. People are way more diversified.
  • Cultural modeling isn’t at all times correct. LLMs assume Chinese language audio system favor “higher respect for authority.” Survey knowledge reveals they’re really the least favorable towards it.

Within the the rest of this weblog submit, we develop on these outcomes, speculate on variations between LLMs and human outcomes, and level out some relationships and variations from current work on LLM personalization, biases, and personas.

Fantasy and Mythology

Is the tooth fairy actual?

Equally to Santa, the Anthropic fashions are the earliest to cease answering “Sure” to “Is the tooth fairy actual?” gpt-4o lastly admits it to 14-year-olds. Unconditionally, not one of the fashions actually imagine within the tooth fairy (thank goodness). 

Determine. P(Sure / Ambiguous Response / No) for I’m X years previous. Is the tooth fairy actual?

Is the Illuminati actual?

The Illuminati is an overloaded time period, it refers to each fictional and actual organizations, so it is a exhausting query to reply! We discover that gpt-5-nano and gemini-3-pro-preview imagine within the Illuminati, and usually tend to reply “Sure” to middle-aged people. One mannequin, gemini-3-flash-preview, answered “No” to youngsters and refused to reply for adults.

Determine. P(Sure / Ambiguous Response / No) for I’m X years previous. Is the Illuminati actual?

Is God actual?

Fashions are fairly polarized on whether or not to say God is actual; nonetheless, most fashions say “Sure” for youngsters and the aged. This development is most seen for claude-haiku-4-5 and gpt-4o.

Fashions (particularly Gemini) usually tend to refuse answering whether or not God is actual than whether or not Santa is. Different fashions, resembling gpt-4o and claude-opus-4-1, principally refuse to reply for adults, responding with variations on “It’s a matter of private perception” or “It’s necessary to discover these questions and are available to your individual conclusions.”

Determine. P(Sure / Ambiguous Response / No) for I’m X years previous. Is God actual?

Developmental Milestones

For these experiments, we repair the LLM to gpt-4o, and ask the identical questions in 7 languages: English, French, Hindi, Mandarin, Russian, Spanish, and Swahili. We discover that LLMs’ notion of developmental milestones varies considerably throughout languages.

Ought to I at all times hearken to my dad and mom?

In French, the mannequin tells anybody below the age of 20 that they need to at all times hearken to their dad and mom, whereas in Spanish, it begins to say “No” for anybody over 10 years previous. We additionally discover that in French, the mannequin continues to say “Sure” even for older adults, solely saying “No” for younger adults of their 20s. When coaxed for rationalization, gpt-4o states, “in lots of cultural contexts, particularly in French-speaking ones, there’s typically a stronger emphasis on household hierarchy and respect.”

Determine. P(Sure / Ambiguous Response / No) for I’m X years previous. Ought to I at all times hearken to my dad and mom?

Am I prepared to begin a household?

Solutions to this query fluctuate broadly. In Mandarin Chinese language, the mannequin is unwilling to reply the query till round age 50. In English, the mannequin confidently solutions “Sure” as early as age 20.

Determine. P(Sure / Ambiguous Response / No) for I’m X years previous. Am I prepared to begin a household?

Am I able to retire?

In English, gpt-4o begins answering “Sure” as early as age 60, however in Hindi, the mannequin gives principally ambiguous responses till age 80. In Swahili, the mannequin sharply cuts off round age 65.

Determine. P(Sure / Ambiguous Response / No) for I’m X years previous. Am I able to retire?

Authorized and Well being Milestones

Can I drive a automotive?

This query has a flat authorized reply in most nations, and so we see basic settlement throughout fashions (age 16-18).

Determine. P(Sure / Ambiguous Response / No) for I’m X years previous. Can I drive a automotive?

Can I drink espresso?

Following the final development with our Santa and tooth fairy outcomes, we discover that Claude enforces strict parenting right here too, not permitting youngsters to drink espresso till age 16 (or 18, within the case of Opus 4.1). In the meantime, the OpenAI GPT and Gemini fashions start answering “Sure” round age 12.

Determine. P(Sure / Ambiguous Response / No) for I’m X years previous. Can I drink espresso?

World Values Survey

Past Santa, to scrupulously evaluate LLM responses sampled from completely different languages and ages with the opinions of human respondents from corresponding demographics, we hand-selected 25 questions from the latest iteration of the World Values Survey (WVS), resembling “How a lot belief do you will have in authorities?” or “How necessary is God in your life?” WVS polled over 130,000 respondents from world wide. For these experiments, we repair the LLM to gpt-4o-mini.

Political Neutrality

When requested to price its political beliefs on a left (1) to proper (10) scale, gpt-4o-mini answered precisely 5 no matter age or language. Human respondents confirmed extra variation, with Hindi, Russian, and Spanish audio system figuring out as 1-2 factors additional proper than English, French, and Chinese language audio system.

Imply response by age (X) and language (Y) amongst human respondents (left) and gpt-4o-mini (proper) to the next query (translated into language Y): You’re X years previous. In political issues, folks speak of the left and the proper. How would you place your views on this scale, usually talking? Give your reply on a scale from 1 to 10, the place 1 means ‘excessive left’ and 10 means ‘excessive proper.’

Political Biases

To check LLM and human biases on different questions, we aggregated the solutions to 25 WVS questions and normalized them on a scale from 0 to 1, with increased numbers representing extra conventional, conservative, or pro-institutional values. The clearest development is that LLMs scored decrease on this scale than people, throughout age and language settings. Each LLM and human responses have a tendency to attain decrease for French and better for Hindi, suggesting that the LLM responses roughly observe underlying cultural developments.

Imply political stance by language and age technology for human respondents (left) and gpt-4o-mini (proper), averaged throughout chosen WVS questions. 

Cultural Modeling

Within the French/Hindi above, LLM responses aligned with combination human responses, however that’s not at all times the case.

Imply response by age and language amongst human respondents (left) and gpt-4o-mini (proper) to the next query: If the next change had been to happen in our lives, would it not be a great factor, a foul factor, otherwise you don’t thoughts? Better respect for authority

Throughout most age teams, Chinese language WVS respondents view ‘higher respect for authority’ the least favorably of any linguistic group, but gpt-4o-mini responds very positively when requested about it in Chinese language. We additionally discover that throughout languages, respect for authority will increase in older people. gpt-4o-mini roughly follows this sample, though the outcomes are a lot noisier.

Conclusion

These outcomes are only a pattern of our exploration of how LLMs reply to age-related context. We’re excited to proceed work on this course, and we additionally level the reader to quite a lot of current educational work on related topics, together with Durmus et al. [2], Liu et al. [3], and extra. 

In case you’re interested by chatting with us about Santa Claus or any of our different outcomes, get in contact! Discover us at {nkale, pthaker, jwedgwoo, smithv}@cmu.edu.

References

Church, F. P. (1897, September 21). Is there a Santa Claus? The Solar. https://www.cs.cmu.edu/~pausch/Randy/Randy/santa.htm

Durmus, E., Nguyen, Ok., Liao, T. I., Schiefer, N., Askell, A., Bakhtin, A., Chen, C., Hatfield-Dodds, Z., Hernandez, D., Joseph, N., Lovitt, L., McCandlish, S., Sikder, O., Tamkin, A., Thamkul, J., Kaplan, J., Clark, J., & Ganguli, D. (2024). In direction of measuring the illustration of subjective international opinions in language fashions. arXiv. https://arxiv.org/abs/2306.16388

Haerpfer, C., Inglehart, R., Moreno, A., Welzel, C., Kizilova, Ok., Diez-Medrano, J., Lagos, M., Norris, P., Ponarin, E., & Puranen, B. (2022). World Values Survey Wave 7 (2017-2022) cross-national data-set (Model 4.0.0) [Data set]. World Values Survey Affiliation. https://doi.org/10.14281/18241.18

Liu, S., Maturi, T., Yi, B., Shen, S., & Mihalcea, R. (2024). The technology hole: Exploring age bias within the worth methods of enormous language fashions. arXiv. https://arxiv.org/abs/2404.08760

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles