Designing Assessments GenAI Cannot Ghost-Write
What Every Evaluator Must Know Before the Next Submission Deadline
Time to Complete: 15 minutes
PDF 5-Minute Warm-Up Activity can be downloaded above.
Who This Is For: This lesson is for anyone who designs, oversees or quality-assures assessments in an environment where students have access to generative AI tools -- and is starting to suspect that the safeguards they rely on may no longer hold. That includes secondary and post-secondary teachers reworking coursework briefs, university faculty facing academic integrity hearings, curriculum designers rebuilding programs for the AI era, academic integrity officers trying to move their institution beyond detection software and department chairs or examination boards deciding what counts as valid evidence of learning. It is equally relevant in professional certification and vocational qualification contexts -- awarding bodies, training providers and L&D teams who are realizing that written assessments no longer reliably distinguish genuine competence from AI-assisted output. If you have asked yourself ‘How do I know this work is really theirs?’ and found your current answer unsatisfying, this lesson was written for you.
Goal: You will develop practical AI literacy by understanding why generative AI (GenAI) can compromise academic integrity even in carefully designed authentic assessments. You will evaluate your current assessment practices and identify concrete redesign strategies that shift the focus from written output to performative, process-oriented evaluation.
Real-World Applications:
Professional certification and awarding bodies face the same structural vulnerability as universities: a candidate who can produce a passing written submission using ChatGPT in under 90 minutes is not demonstrating the competence the credential is supposed to signal. The shift described here -- from written-output assessment toward synchronous, performative formats -- is already under active review at qualification bodies including those governing legal practice, accountancy and clinical training, where the validity of written examinations is being re-examined specifically because AI removes the link between producing the text and possessing the knowledge. The Hobbins and Kaider audit frameworks apply directly: any certification assessment that a candidate could complete using only an LLM and the assessment brief sits in the vulnerable quadrant and the redesign logic (add a mandatory oral, embed live scenario performance, require reference to evidence an AI could not have accessed) transfers from higher education to professional licensing with minimal adaptation.
The Problem and Its Relevance
Most educators treating authentic assessments as a firewall against GenAI are building their defenses on a foundation that experimental research has already cracked. Kofinas et al. (2025) demonstrated that experienced academic markers could not reliably distinguish student-authored work from GenAI-generated or GenAI-modified work -- regardless of whether the assessment was classified as low, medium or high in authenticity. GenAI-authored assessments, generated within 90 minutes using ChatGPT 3.5, received marks in the upper-second range (55–70%), confirming that the technology is capable enough to pass undetected through standard marking and moderation processes. The more alarming finding, however, is not that GenAI can produce acceptable work -- it is that suspicion of GenAI use changes how markers grade all work. When markers knew some samples may have been AI-generated, they awarded lower scores to human-authored, first-class submissions -- in several cases reducing marks by ten percentage points. Academic integrity is no longer a binary problem of catching cheaters: it has become a systemic risk that distorts fair evaluation for every student in the cohort.
Why Does This Matter?
Understanding the limits of authentic assessments in the GenAI era matters because:
(i) Authenticity alone is not a safeguard: The Kofinas et al. (2025) study tested six assessments spanning low to high authenticity levels using both the Hobbins et al. (2022) and Kaider et al. (2017) frameworks. None of the assessments reliably prevented GenAI misuse or enabled markers to identify it.
(ii) Detection tools compound the problem: AI detection software is widely acknowledged to be inaccurate and biased. Relying on these tools shifts the burden onto students’ ability to appeal and onto markers’ subjective judgments, increasing the risk of false accusations against honest students.
(iii) GenAI exploits structural predictability: Written assessments with defined briefs, word counts and rubric criteria give GenAI enough scaffolding to produce plausible, rubric-aligned work. The more explicit the task requirements, the easier they are to replicate via prompt engineering.
(iv) Marker awareness introduces bias: Once markers suspect GenAI involvement, the study showed they began second-guessing authorship across all submissions -- leading to unusually large grade disparities between first and second markers and penalizing high-quality human work perceived as ‘too perfect’.
(v) Assessment design shapes what students actually learn: If students can generate acceptable assessments without engaging with module content, the entire learning loop --built on Biggs’s (2003) constructive alignment -- breaks down. GenAI-assisted submissions that pass grading do not demonstrate learning, they simulate it.
(vi) The paradigm shift is urgent and broader than universities: Kofinas et al. (2025) note that a move away from written assessments carries consequences for online and distance learning, professional certification and even academic publishing -- any context where written output is currently treated as the primary evidence of knowledge.
Therefore, the central task for evaluators is not to build smarter detection mechanisms around existing assessment formats. It is to ask whether those formats remain valid evidence of learning at all.
Three Critical Questions to Ask Yourself
· Can I map my current assessments against the four dimensions of authentic assessment -- realism, cognitive challenge, evaluative judgement criteria and evaluative judgement feedback -- and honestly determine whether they are vulnerable to GenAI completion?
· Do I understand the difference between assessing written output and assessing the process and social performance of learning, and am I prepared to redesign accordingly?
· Am I able to identify which of my assessments -- regardless of how ‘authentic’ they appear on paper -- could be replicated by a student using GenAI within 90 minutes?
Roadmap
Review your current assessments using the two-framework audit below. Complete steps (i) through (iv) individually or in a small group. This activity is designed to be completed in 15 minutes. Steps:
(i) Select one of your current assessments and apply the Hobbins et al. (2022) framework. Rate the assessment as Low, Moderate, or High on each of the four dimensions: Realism (does it engage students with real-world problems?), Cognitive Challenge (does it require building new knowledge rather than recalling it?), Evaluative Judgement Criteria (can students critically assess their own performance?), and Evaluative Judgement Feedback (does the assessment generate meaningful, improvement-oriented feedback?). Record your ratings. Guidance: Be honest. An essay with a defined rubric may score Moderate on Realism but Low on Cognitive Challenge if it primarily rewards recall and synthesis of existing arguments.
(ii) Apply the Kaider et al. (2017) authenticity-proximity framework. Place your assessment in one of the nine cells formed by two axes: Authenticity (the degree to which tasks mirror professional practice) and Proximity (the degree to which learning occurs in or adjacent to real workplaces). Low-proximity, low-authenticity assessments -- such as critical literature reviews -- are the most vulnerable to GenAI substitution. High-authenticity, high-proximity assessments -- such as live client projects -- are structurally harder to replicate. Guidance: Ask: ‘Could a student generate an acceptable version of this task using only LLM and a copy of the assessment brief?’ If the answer is yes in under two hours, your assessment sits in the vulnerable quadrant.
(iii) Identify one redesign move toward process-oriented, synchronous assessment. Kofinas et al. (2025) recommend shifting from output-based written assessments toward assessments that are performative and interactional -- formats GenAI cannot complete in place of a student. Examples include: interactive oral assessments (students defend or extend written work in real time), role-play scenarios embedded in live professional simulations, collaborative problem-solving tasks observed and evaluated in session and structured vivas where the evaluator probes depth of understanding beyond the submitted text. Guidance: You do not need to eliminate written components. Adding a mandatory oral defense or a live presentation of the written work significantly raises the cost and detectability of GenAI substitution.
(iv) Draft one assessment design principle for your discipline. Write a single sentence that captures a design principle for assessments in your subject area that exploits the one thing GenAI cannot yet replicate: a student’s unique, situated, experiential knowledge generated through real interaction with the course, the classroom or a professional environment. This principle should be specific enough to guide future assessment design decisions in your department or institution. Example (Business Studies): ‘Assessments must require students to reference, analyze and respond to at least one piece of evidence generated during live course activity -- such as a class discussion, a workshop output or a client interaction -- that could not be known to an external AI system’.
Individual Reflection
After completing the Roadmap, reflect briefly on the following:
· Where did your assessment land on the authenticity-proximity matrix, and were you surprised by the result?
· What is the single most concrete redesign move you could make to your current assessment to reduce its GenAI vulnerability without adding significant administrative burden?
· How does knowing that marker behavior itself is affected by GenAI awareness change your approach to moderation and grade calibration in your department?
· If you shifted one summative assessment in your course from written output to a performative or synchronous format, what would the implications be for your students, your workload, and your institutional policies?
The Bottom Line
Instructors who are still asking ‘How do I detect GenAI use?’ are solving the wrong problem -- the detection question has already been answered: in most marking contexts, you cannot. The operative question is now structural: what formats of evidence are you willing to treat as sufficient proof of learning? Assessments that accept a static written artifact as the sole answer to that question are no longer epistemically defensible in an environment where that artifact can be generated in 90 minutes without subject knowledge. The deeper institutional risk is not academic dishonesty by individual students -- it is the systematic erosion of the signal that degrees and qualifications are supposed to send. If written assessments can no longer serve as reliable evidence of learning, then the value of the credential itself is at stake, not merely the integrity of a single submission. Educators who redesign now -- moving toward assessments that require real-time knowledge performance, not just knowledge reproduction -- are not only protecting academic integrity, they are redefining what learning means in a world where explicit knowledge is no longer scarce. You have developed the assessment literacy essential for this moment when you can: (i) classify your assessments by their GenAI vulnerability using the Hobbins and Kaider frameworks; (ii) distinguish between assessments that measure written output and those that measure situated, experiential knowledge; (iii) articulate why synchronous, performative assessment formats are structurally resistant to GenAI substitution; and (iv) explain to colleagues and administrators why authentic assessment design -- not detection technology -- is the sustainable institutional response to generative AI.
#AuthenticAssessmentDesign #GenAIAcademicIntegrity #AssessmentLiteracy #RethinkingHigherEd #AIProofLearning
{"@context":"https://schema.org","@type":"LearningResource","name":"Designing Assessments GenAI Cannot Ghost-Write","alternateName":"What Every Evaluator Must Know Before the Next Submission Deadline","description":"A 15-minute AI literacy lesson plan for educators and evaluators examining why authentic assessments do not protect against generative AI misuse, grounded in Kofinas et al. (2025).","url":"https://www.marvinuehara.com/ai-literacy-lesson-plans","inLanguage":"en","isAccessibleForFree":true,"timeRequired":"PT15M","educationalLevel":"ProfessionalDevelopment","learningResourceType":"LessonPlan","provider":{"@type":"Organization","name":"Marvin Uehara","url":"https://www.marvinuehara.com"},"author":{"@type":"Person","name":"Marvin Uehara","url":"https://www.marvinuehara.com"}}{"@context":"https://schema.org","@type":"Course","name":"AI Literacy Lesson Plan: Assessment Integrity in the Age of GenAI","description":"Educators apply the Hobbins et al. (2022) and Kaider et al. (2017) frameworks to audit and redesign assessments for GenAI resistance.","teaches":["Authentic assessment frameworks","GenAI detection limitations","Assessment redesign strategies","Performative and synchronous assessment","Academic integrity in higher education"],"audience":{"@type":"EducationalAudience","educationalRole":["Teacher","ProfessionalEvaluator","Administrator","InstructionalDesigner"]},"hasCourseInstance":{"@type":"CourseInstance","courseMode":"SelfPaced","instructor":{"@type":"Person","jobTitle":"Educator"}}}{"@context":"https://schema.org","@type":"Article","headline":"Designing Assessments GenAI Cannot Ghost-Write","abstract":"Kofinas et al. (2025) found that markers cannot reliably distinguish AI-generated from human-authored work, and that authentic assessments do not prevent GenAI misuse. This lesson plan translates those findings into a 15-minute audit and redesign activity for educators.","keywords":"authentic assessment,generative AI,academic integrity,ChatGPT,assessment design,higher education,performative assessment,Hobbins,Kaider,constructive alignment","about":[{"@type":"Thing","name":"Generative Artificial Intelligence"},{"@type":"Thing","name":"Academic Integrity"},{"@type":"Thing","name":"Authentic Assessment"}],"citation":{"@type":"ScholarlyArticle","name":"The impact of generative AI on academic integrity of authentic assessments within a higher education context","author":["Alexander K. Kofinas","Crystal Han-Huei Tsay","David Pike"],"datePublished":"2025","isPartOf":{"@type":"Periodical","name":"British Journal of Educational Technology"},"identifier":"10.1111/bjet.13585"}}{"@context":"https://schema.org","@type":"EducationalOccupationalProgram","name":"AI Literacy for Educators","description":"A series of lesson plans equipping educators with the skills to critically evaluate AI tools, design AI-resistant assessments, and maintain academic integrity.","povider":{"@type":"Organization","name":"Marvin Uehara","url":"https://www.marvinuehara.com"},"educationalCredentialAwarded":"AI Literacy Certificate","programPrerequisites":"Basic familiarity with assessment design in secondary or higher education","occupationalCategory":"Education"}{"@context":"https://schema.org","@type":"FAQPage","mainEntity":[{"@type":"Question","name":"Can authentic assessments protect against GenAI misuse?","acceptedAnswer":{"@type":"Answer","text":"No. Kofinas et al. (2025) demonstrated that assessments rated high in authenticity were equally vulnerable to GenAI completion as low-authenticity assessments, and markers could not reliably detect AI-generated work."}},{"@type":"Question","name":"What assessment formats are most resistant to GenAI?","acceptedAnswer":{"@type":"Answer","text":"Synchronous, performative formats—such as interactive oral assessments, live role-play scenarios, and structured vivas—are structurally harder for GenAI to substitute because they require real-time student performance rather than written output production."}},{"@type":"Question","name":"How does the Kaider et al. (2017) framework help evaluate GenAI vulnerability?","acceptedAnswer":{"@type":"Answer","text":"The Kaider framework maps assessments on two axes—authenticity and workplace proximity. Assessments in the low-authenticity, low-proximity quadrant (such as literature reviews) are the most susceptible to GenAI completion."}}]}