One-paragraph summaries of every empirical study or survey cited in the deck, in order of first appearance. Each explains the design, the result, and how strong the evidence is — for your own reference while presenting and in Q&A. (Slide numbers verified against the 24-slide deck.)
(The classical/scriptural sources — Sūrat al-ʿAlaq, al-Shahīd al-Thānī’s Munyat al-Murīd, al-Ṭabāṭabāʾī’s al-Mīzān, the ḥadīth on knowledge and intellect, and Muṭahharī — are cited on the slides themselves and are not “studies,” so they are not listed here.)
1. Pew Research Center (2025) — “How Teens Use and View AI.” A nationally representative survey of 1,458 US teens aged 13–17 (probability-based, via Ipsos KnowledgePanel; margin of error ±3.3 points), fielded late 2025. It found AI is now mainstream among teenagers: 95% have heard of AI chatbots, 64% use them, and 54% use them for schoolwork — while about 1 in 10 (10%) use AI for all or most of their schoolwork. Two findings anchor the talk: 59% of teens say AI-enabled cheating happens regularly at their school, and when teens worry about AI, their single biggest worry is over-reliance and the loss of their own critical thinking. Gold-standard survey; US-only and secondary-age (no primary data).
2. Stanford / Challenge Success — Pope & Lee (2024–2026) — the cheating studies. A set of pre/post and follow-up studies of self-reported cheating in US high schools (the most recent, 2026, surveyed 4,354 students across six schools ~1.5 years after ChatGPT). They found that overall self-reported cheating did not rise after ChatGPT — it stayed flat at roughly 60–70%, the same as it had been for over a decade, and even dipped slightly. AI changed how students cheat, not how much. (A separate 2025 survey found **59% of school administrators believe* cheating has risen — a perception-vs-reality gap.) Caveats: self-reported; US-only; the partner schools are self-selected and often high-performing, so not nationally representative.
3. Newton (2018) — “How Common Is Commercial Contract Cheating in Higher Education?” (Frontiers in Education). A systematic review pooling 71 samples from 65 studies — 54,514 participants — going back to 1978. The historic average who admitted paying someone else to do their academic work was 3.52%, but in samples from 2014 onward, 15.7% admitted it — potentially ~31 million students worldwide (of ~200 million in higher education) — and the rate was rising and likely under-reported. Establishes that a large contract-cheating industry existed before AI. (Higher-ed, English-language; the essay industry was estimated at >$100M revenue, Owings & Nelson 2014.)
4. Lancaster (2019) — “Profiling the international academic ghost writers…” A study of the supply side of contract cheating, profiling the people who write essays and assignments for hire. The figure we cite: roughly 20,000 people in Kenya alone working as academic ghost-writers for the contract-cheating industry — i.e., the infrastructure of paid cheating was already an organised global business pre-AI. (Paywalled; cited from the headline finding.)
5. Glass & Kang (2020) — “Fewer students are benefiting from doing their homework: an eleven-year study” (Educational Psychology). Tracked 2,433 students across 12 college courses over 11 years (2008–2018), comparing homework performance with later exam performance. The share of students who got no learning benefit from correctly answering their homework rose from 14% in 2008 to 55% in 2017 — because smartphones and search made copying answers (which produces no retention) replace generating them (which roughly doubles retention — the “generation effect”). Shows the internet had already hollowed out homework as a measure of learning, before AI. Caveat: one instructor’s college courses, online quizzes.
6. Bastani et al. (2025) — PNAS — the keystone study (also used on slides 17, 19 & 23). A pre-registered randomised controlled trial of ~1,000 high-school maths students (grades 9–11) in Turkey. During practice with AI, scores rose sharply — but on a later closed-book exam taken without AI, the students who had practised with an unrestricted ChatGPT scored 17% worse than peers who never had AI: the “crutch” effect (they copied answers and skipped the productive struggle). Crucially, a scaffolded “AI tutor” that gave hints instead of answers eliminated the harm (it matched the no-AI control). This is the single best evidence that how a tool is designed — scaffold vs. answer-engine — decides whether AI helps or harms learning. Strong (RCT); caveat: one subject, one school, immediate post-test.
7. Dratsch et al. (2023) — Radiology — automation bias. An experiment testing how experts respond to wrong AI advice. When an AI offered experienced radiologists a confident but incorrect suggestion, their diagnostic accuracy collapsed from 82.3% to 45.5%. The lesson for the talk: even highly trained experts defer to a confident, fluent, wrong machine — so students, with less expertise, are even more exposed. Peer-reviewed.
8. Moon, Green & Kushlev (2025) — “Homogenizing effect of LLMs on creative diversity” (Computers in Human Behavior: Artificial Humans). Three pre-registered studies analysing 2,200 college-admissions essays (human-written vs GPT-4). LLMs can match or even boost an individual’s creativity, but at the collective level they homogenise: each human-written essay added about twice the diversity of ideas to the shared pool that each AI essay did (base GPT-4’s “diversity growth rate” was only ~31% of humans’, and ~11% in the largest study). The effect persisted through every attempt to make the AI more creative, suggesting it is inherent to current models. Caveat: measures novel-idea (divergent) diversity in one domain — it is a collective effect, not “AI makes any one person less creative.”
9. Bloom (1984) — “The 2 Sigma Problem” (Educational Researcher). The classic finding that students taught one-to-one with mastery learning performed about two standard deviations better than students in ordinary group instruction — the average tutored student outperformed roughly 98% of the conventional class. The “problem” is economic: one-to-one tutoring is the most effective teaching ever measured, but unaffordable to provide for every child. Honest caveat to hold: the literal “2-sigma” is rarely replicated; real tutoring effects are usually smaller (~0.4–0.8 SD) — so frame AI as putting the dream within reach, not as delivering 2σ.
10. Kestin et al. (2025) — Scientific Reports — the Harvard AI tutor. A crossover randomised trial with 194 Harvard undergraduates. A carefully scaffolded AI tutor produced learning gains of roughly 0.7–1.3 standard deviations over in-class active learning (a strong baseline, itself better than lecturing) — and in less time. The strongest evidence that a well-designed AI tutor can rival or beat excellent teaching. Key qualifier: it was deliberately engineered with pedagogical guardrails, not a vanilla chatbot.
11. World Bank Nigeria RCT (2024) — PRWP 11125. A randomised trial of a six-week after-school generative-AI tutoring programme in Benin City, Nigeria. Students gained about 0.31 standard deviations — a large effect for an education intervention — and the girls, who had started behind, gained the most. Honest caveat: the widely-quoted “1.5–2 years of learning” is a translation of that 0.31 SD into “equivalent years of schooling,” not measured growth over two years — say “a striking gain for six weeks.”
12. Hembree & Dessart (1986) — JRME — the calculator meta-analysis. A meta-analysis of 79 research studies on calculator use in mathematics. It found that using calculators alongside normal instruction improved students’ pencil-and-paper basic skills at every grade — except Grade 4, where sustained use appeared to hinder the development of those skills. The transferable lesson for AI: the tool helps when the fundamentals are built first. (Companion guidance: NCTM’s 2015 position that calculators must be “selective and strategic” and “do not supplant” mental/written proficiency; Ellington’s (2003) meta-analysis of 54 studies found benefit when calculator use was deliberate and integrated, and neutral — not harmful — otherwise.)
13. Bjork & Bjork (2011) — “desirable difficulties” — the principle behind the redesign. The deepest pedagogical principle in the talk. Decades of cognitive-science research (Robert and Elizabeth Bjork) show that durable learning requires effortful struggle: the conditions that make practice feel easy and fluent often produce no lasting learning at all, while slower, harder, more effortful practice — desirable difficulties — builds retention that endures. The struggle is not the obstacle to learning; it is the learning. This is the science beneath the furnace, the common cause of all three risks (AI removes the struggle), and the common cure behind every change (build it back in). Among the most replicated findings in learning science — and caught on camera for AI by Bastani et al. (see #6: the chatbot that made students struggle helped; the one that did the work for them harmed).
14. EEF / NFER (2024) — the ChatGPT lesson-prep randomised trial. A gold-standard UK evaluator trial (Education Endowment Foundation / NFER “Teacher Choices”): 259 Key-Stage-3 science teachers across 68 secondary schools in England, over 10 weeks. Teachers who used ChatGPT (with a usage guide) for lesson preparation spent 56.2 minutes a week versus 81.5 for the control — a 31% saving (~25 minutes/week) — with no difference in resource quality judged by a blind expert panel. The best causal evidence that AI saves teachers real time without lowering quality. Caveats: time was self-reported; the no-quality-difference finding rested on a small sample, so EEF calls it “promising, with caution.”
15. Gallup / Walton Family Foundation (2025) — “Teaching for Tomorrow.” A nationally representative survey of 2,232 US public K-12 teachers (via the RAND American Teacher Panel; ±2.5 points), 2025. 60% use AI for work and 32% at least weekly; weekly users self-estimate saving ~5.9 hours a week — about six weeks a year. The most common uses are instructional prep: preparing to teach (37%), making worksheets (33%), and adapting materials for students’ needs (28%). Caveats: time saved is self-estimated, not measured; funded by a pro-edtech foundation (though Gallup ran a sound probability survey); US-only.
16. UNESCO (2024) & Digital Promise (2024) — the AI-literacy frameworks. Two authoritative frameworks defining what AI literacy should contain. UNESCO’s AI Competency Framework for Students sets out 12 competencies across four dimensions, staged Understand → Apply → Create. Digital Promise’s framework uses three modes — Understand → Evaluate → Use — with “Understand” (what AI does and how it works) as the foundation. Important caveat: both are conceptual/expert guidance — there is no outcome evidence that these specific ladders improve learning. Cite them as structure, not proof.
17. Clerc et al. (2026) — the French AI-literacy intervention. A study of 116 French middle-schoolers (grades 8–9) given a single two-hour lesson on how AI works (plus simple heuristics like checking plausibility and asking follow-ups). Afterwards they were notably more critical users: they rejected under-specified prompts more often (52% vs 67% accepted) and, when they did use the AI, asked follow-up questions far more (59% vs 28%). An encouraging signal that AI literacy can be taught quickly. Caveats: a preprint, single school, non-randomised groups, marginal statistical significance, tested two days later — treat as emerging, not settled.
18. AI detectors — OpenAI (2023), Stanford/Liang et al. (2023), Vanderbilt (2023). The evidence that AI-writing detectors don’t reliably work. OpenAI discontinued its own AI-text classifier (July 2023) for “low accuracy” — it caught only ~26% of AI text and falsely flagged 9% of human writing. A Stanford study (Liang et al., Patterns 2023) found leading detectors wrongly flagged 61% of essays by non-native English writers as AI, while clearing 90%+ of native speakers. Vanderbilt University (2023) disabled Turnitin’s detector after calculating that even at a 1% error rate, ~750 students a year would be falsely accused. Together: don’t rely on detectors — redesign assessment instead.
19. TEQSA & University of Sydney — the “two-lane” assessment model (expert guidance, not a study). Australia’s higher-education regulator (TEQSA, “Assessment reform for the age of AI”) and the University of Sydney propose a two-lane approach: a secure lane (oral, in-person, supervised assessment, where AI is excluded, to validate that the student can do it) alongside an open lane (AI permitted, assessing the thinking, the process, and skilful use of the tool). This is well-regarded best practice, not RCT-backed evidence — the assessment pillar is the youngest evidentially.