AI Tools for Teachers: An Evaluation Guide

Stay in the loop

Download the AI Tool Evaluation Rubric

Free resource for educators and leaders — get it in your inbox.

Download AI Tool Evaluation Rubric (Free)

Click to get your own editable copy in Google Docs

No spam. Unsubscribe anytime.

Join educators getting weekly insights on AI, co-teaching, and instructional leadership.

Subscribe to the Newsletter

Evaluating AI technology tools for classroom use

AI Tools for Teachers: A 5-Criteria Evaluation Framework That Actually Works

Every week, a new list appears: “The Top 50 AI Tools for Teachers.” “10 AI Apps Every Educator Needs.” “The Ultimate AI Toolkit for Your Classroom.” The lists multiply while implementation stalls. Teachers try a tool, use it once, and move on. Departments adopt something in September and abandon it by November. The district pays for a license that a fraction of staff actually use.

The problem is not a shortage of tools. The problem is the absence of a system for evaluating them. Without a framework, tool selection becomes a matter of who shouted loudest on social media, which vendor bought lunch, or what the early adopter down the hall happened to try.

This guide does not give you another list. It gives you a system — the 5-criteria evaluation framework — and the process for using it. The framework is designed to help teachers, coaches, and district leaders make deliberate decisions about which AI tools deserve adoption and which ones to walk away from.

The Problem with AI Tool Lists

Tool lists serve a purpose: they create awareness. A teacher who has never used an AI copilot needs to know that ChatGPT, Gemini, Claude, and Copilot exist. Awareness, however, is not implementation.

App fatigue is real. When teachers are presented with dozens of options and no criteria for choosing among them, the result is not informed selection — it is paralysis. Research on technology adoption in education consistently shows that choice overload reduces adoption, and shallow adoption (trying a tool once) does not produce sustained use.

Tool churn wastes time and money. A teacher who experiments with six tools across a semester spends more time learning interfaces than using any single tool effectively. A district that licenses a tool based on a conference presentation, without pilot data, risks investing in something that collects dust.

No framework means no retention. Sustained technology use requires alignment with instructional practice. When a teacher cannot articulate why a specific tool supports their pedagogy, they will not keep using it. The tool becomes a novelty, not a practice.

The alternative is not fewer tools. It is better decision-making about tools. That requires criteria.

The 5-Criteria Evaluation Framework

Five criteria. Every tool evaluated against all five. No tool gets a pass because it scores well on one and fails on another. The criteria work together, and a tool that excels at pedagogy but fails at privacy is a tool you cannot adopt. A tool that is accessible but has no evidence base is a tool you should pilot carefully.

1. Pedagogy

The first question is not “What can this tool do?” It is “What does this tool do for instruction?”

Does it align with evidence-based instructional practices? A tool that generates flashy content but does not support retrieval practice, scaffolding, formative assessment, or differentiation is a tool that entertains rather than teaches.
Does it support Universal Design for Learning (UDL)? UDL’s three principles — multiple means of engagement, representation, and action/expression — provide a robust lens for evaluating whether a tool serves diverse learners or a narrow range.
Does it enhance teacher judgment rather than replace it? This is a non-negotiable distinction. Tools that generate lesson plans, assessments, or feedback drafts are useful. Tools that position themselves as making instructional decisions for teachers are not. The teacher must remain the decision-maker.
Does it support the instructional strategies your district already prioritizes? If your school emphasizes cognitively demanding tasks, the tool should help create them, not replace them with lower-level alternatives. If your school emphasizes formative assessment, the tool should generate assessment items, not skip the assessment step.
Can the tool be used across multiple content areas and grade levels, or is it so narrow that it serves only one use case? Narrow tools may still be worth adopting, but their value proposition must justify the cost and training investment.
Does the tool’s output require teacher review and revision, or does it present finished products that discourage critical evaluation? Tools that position their output as final and authoritative undermine the teacher’s role as quality controller.

2. Privacy

Privacy is not a feature. It is a prerequisite.

Does the tool comply with FERPA? Can the vendor provide documentation that it acts as a “school official” with legitimate educational interest, or does it require parental consent for data disclosure?
Does it comply with COPPA for students under 13? Has the vendor obtained verifiable parental consent, or does the tool simply set an age gate that children can bypass?
What is the data retention policy? Does the tool retain student data indefinitely, or does it delete upon request? Can the district specify retention periods in a data processing agreement?
How does the vendor handle student data? Is student data used to train the vendor’s AI models? If so, can the district opt out? Is data shared with third parties?
Is the vendor transparent? Can you find the privacy policy easily? Is it written in language that educators and families can understand, or is it buried in legal jargon?
Does the vendor sign a data processing agreement (DPA)? A DPA should specify what data is collected, how it is stored, who has access, how long it is retained, and the process for data deletion upon request. If a vendor will not sign a DPA, that is a disqualifying factor.
What happens to data if the vendor is acquired or goes out of business? Does the DPA include provisions for data return or destruction in the event of a merger, acquisition, or shutdown? Student data should not become an asset in an acquisition.
Can the district export and delete all its data at any time? Vendor lock-in on data is a privacy risk and an operational risk. If you cannot get your data out, you do not own it.
Does the tool collect metadata beyond what is necessary for its educational function? Some tools collect device information, location data, browsing patterns, or usage analytics that go well beyond the data required to deliver the educational service.

As explored in Creating AI Agent Safeguards, the safeguards that protect student data in AI systems must be designed deliberately, not assumed.

3. Accessibility

If a tool does not work for all students, it does not work for your classroom.

Does it meet WCAG 2.1 AA standards at minimum? Keyboard navigation, screen reader compatibility, sufficient color contrast, text resizing — these are not optional for tools used in public education.
Does it support multilingual learners? Translation features, language-switching, content in multiple languages — or at minimum, compatibility with browser-based translation tools.
Does it provide differentiated output? Can the tool adjust reading level, simplify language, or provide multiple representations of the same content?
Does it work with assistive technology? Speech-to-text, text-to-speech, switch access, and other assistive technologies must function reliably with the tool.
Has the vendor published a Voluntary Product Accessibility Template (VPAT)? A VPAT documents how the tool conforms to Section 508 standards. If the vendor has not completed a VPAT, ask why — and treat the absence as a warning sign.
Does the tool accommodate different input methods? Some students use voice, some use keyboard only, some use touch, some use alternative input devices. A tool that only accepts typed input excludes students who rely on other methods.
Can the tool’s interface be customized for visual needs? Adjustable font sizes, high contrast modes, reduced motion settings — these features should be available without requiring separate accommodations.

4. Evidence

Vendors make claims. Evidence sorts which claims hold up.

What is the research base? Has the tool been studied in peer-reviewed research? In what contexts — grade level, content area, student population?
Is there pilot data from schools or districts similar to yours? A case study from a suburban high school does not generalize to an urban elementary school, and vice versa.
What outcome measures were used? Engagement metrics do not equal learning outcomes. Time-on-task does not equal achievement. Look for evidence of impact on student learning, not just student activity.
Can you distinguish between vendor claims and independent evidence? Vendor-commissioned research and independent research tell different stories. Both are worth reading, but the weight of evidence should favor independent findings.
Does the vendor provide references from districts you can contact? A vendor who cannot connect you with current users may not have sustained implementations to showcase.
How long has the tool been in use in K-12 settings? A tool that launched three months ago does not have evidence of sustained impact. Early results are valuable but insufficient for long-term adoption decisions.
Does the tool have an effect size worth the investment? Even well-studied tools vary in their impact. A tool that produces a small effect may still be worth adopting if the cost is low and the integration is seamless. A tool that produces a small effect at high cost with poor integration is not.

The Best AI Tools So Far to Improve Teacher Planning, Assessment, Instruction, and Feedback provides tool recommendations grounded in practical evaluation — but every recommendation should be tested against the evidence available in your local context.

5. Integration

A tool that lives in isolation is a tool that dies.

Does it work with your existing LMS or workspace? Google Classroom, Canvas, Schoology, Microsoft Teams — integration matters because teachers will not add a separate login, separate gradebook, and separate workflow to an already full plate.
Does it support single sign-on? If teachers and students need separate credentials for each AI tool, adoption drops.
Does it create minimal workflow disruption? A tool that requires teachers to redesign their entire instructional workflow to accommodate the tool is a tool that will not be adopted. The tool should fit the workflow, not the reverse.
Does it support a sustainable tech stack? Every tool you add is a tool you maintain — licenses, updates, training, troubleshooting. The tech stack should be intentional, not accretive.
What does license management look like? Can the district assign and revoke seats easily? Is there a per-user or per-building model that scales appropriately? Tools that require manual license management for each user create administrative overhead that compounds over time.
Does the tool provide admin dashboards for usage tracking? A district that cannot see which teachers and students are actually using a tool cannot evaluate whether the investment is paying off. Usage data should be accessible without requiring a support ticket to the vendor.
Is there a reliable support channel? When the tool breaks during a lesson, how quickly can a teacher get help? Email-only support with a 48-hour response time means that every classroom disruption lasts at least two days.

Using Generative AI to Support Teachers’ Planning, Workflow, and Content Creation to Google Apps and Learning Management Systems addresses this directly: AI tools must connect to existing systems, not create new silos.

Ready to evaluate your district’s AI tools against all five criteria? Dr. Matt Rhoads works with districts to build evaluation frameworks and pilot processes. Request an evaluation consultation →

Tool Categories Through the Framework Lens

Rather than listing specific tools, here are five categories with framework evaluation considerations for each.

AI Copilots (ChatGPT, Gemini, Claude, Copilot)

Pedagogy: Strong potential for lesson planning, content generation, and feedback drafting. Weak on structured instructional routines — the copilot responds to prompts but does not guide pedagogy.
Privacy: Varies by vendor and plan. Enterprise/education plans typically offer better data protections than free tiers. Check whether inputs are used for model training.
Accessibility: Improving but inconsistent. Screen reader support and multilingual output vary significantly across platforms.
Evidence: Limited peer-reviewed research on classroom impact. Growing body of practitioner reports.
Integration: Browser-based tools work alongside LMS platforms but do not integrate natively in most cases. Copy-paste workflows are the norm.

Adaptive Platforms (Khanmigo, iReady, MobyMax)

Pedagogy: Built-in instructional design, often aligned to standards. Strength: adaptive sequencing. Risk: may constrain teacher flexibility.
Privacy: Typically stronger — these platforms are built for K-12 and have compliance documentation. Verify COPPA and FERPA status.
Accessibility: Varies. Check WCAG compliance and assistive technology compatibility per product.
Evidence: More robust than copilots — adaptive platforms often have efficacy studies, though methodology varies. Look for independent replication.
Integration: Often designed to integrate with common LMS platforms and SIS systems.

Interactive Presentation (Pear Deck, Nearpod)

Pedagogy: Supports active learning during instruction. Formative assessment embedded in presentation flow. Student response features enable real-time checks for understanding.
Privacy: Collects student response data. Verify data handling policies and retention.
Accessibility: Screen reader support varies. Some features may not be fully keyboard-navigable.
Evidence: Moderate research base on interactive presentation and student engagement. Less on long-term learning outcomes.
Integration: Integrates with Google Slides and PowerPoint. SSO commonly supported.

Formative Assessment (Edpuzzle, ReadTheory)

Pedagogy: Embedded questions, progress tracking, comprehension monitoring. Supports retrieval practice when designed well.
Privacy: Student progress data collected. Verify whether data is used for model training or algorithm development.
Accessibility: Text-to-speech, adjustable reading levels, and multilingual support vary by platform.
Evidence: Some direct research on specific platforms. Broader evidence base on formative assessment as a practice.
Integration: Often integrates with Google Classroom and Canvas. Check grade passback functionality.

Content Creation (AI-powered lesson generators, worksheet builders)

Pedagogy: Useful for differentiation and material generation. Risk: generated content may not align with evidence-based pedagogy unless the teacher applies the framework.
Privacy: Depends on whether student data is involved. Teacher-only use reduces privacy concerns but does not eliminate them if student work is input.
Accessibility: Generated content must be manually checked for accessibility — alt text, reading level, language complexity.
Evidence: Minimal. Content creation tools are relatively new. Evaluate based on output quality in your context.
Integration: Output typically exported as documents or slides. Limited native LMS integration.

See Boost Student Learning with Interactive Worked Examples Thanks to the Canvas Feature in Gemini, ChatGPT, and Claude for an example of how copilot features can support evidence-based instructional strategies.

The Pilot Process

Evaluation does not end with the framework. A tool that passes all five criteria on paper must still prove itself in practice. The pilot process is how you test whether a tool works in your context.

Phase 1: Setup (Week 1)

Select 2-4 teachers across different grade levels and content areas. Choose teachers who are willing to provide honest feedback, not just AI enthusiasts who will overlook problems. Include at least one teacher who is skeptical — they often identify issues that enthusiasts miss.
Provide 30-60 minutes of training on the specific tool and its intended use case. Do not train on “all the things the tool can do.” Train on the one or two specific use cases the pilot will test. Narrow focus produces clearer results.
Define the pilot’s focus question: e.g., “Does this tool reduce lesson planning time by 30 minutes per week without reducing plan quality?” A clear focus question prevents scope creep and makes evaluation possible.
Establish data collection methods: teacher logs, student feedback, observation data. Create simple templates — a 5-minute daily log and a weekly reflection form — so that data collection does not become burdensome.
Set a clear timeline with a start date, check-in dates, and an end date. Pilots without deadlines tend to drift.

Phase 2: Implementation (Weeks 2-5)

Teachers use the tool for the identified purpose in their regular practice. Resist the temptation to expand the pilot’s scope mid-process. If the focus question is about lesson planning, do not add formative assessment as a secondary question in week 3. Save expansion for the next pilot cycle.
Coach or pilot coordinator checks in weekly (brief, 10-15 minutes). Ask three questions: What worked this week? What did not work? What would you change? These check-ins catch problems early and keep the pilot on track.
Teachers log usage, benefits, and challenges. The log can be simple — even a shared Google Doc where teachers add bullet points after each use session is sufficient. The point is to capture real-time data rather than relying on retrospective memory during the evaluation phase.
Collect student work samples where applicable. If the tool is used to generate instructional materials, save samples of those materials and the student work they produced. This provides concrete evidence for the evaluation phase.

Phase 3: Evaluation (Week 6)

Synthesize teacher feedback, student data, and usage logs. Use the framework’s five criteria as the evaluation structure: How did the tool perform on pedagogy, privacy, accessibility, evidence, and integration in real-world use?
Apply the 5-criteria framework with real-world evidence, not just vendor claims. The framework assessment in Phase 1 was theoretical. The Phase 3 assessment is empirical. The gap between the two is where the most important insights live.
Make a decision: adopt, extend pilot, or abandon. Document the reasoning — future evaluations build on past decisions. An adoption decision should include: what use cases the tool is approved for, what training is required for new users, and what the review schedule is.
Communicate the decision to all stakeholders. If you adopt, explain why and what comes next. If you abandon, explain why — this prevents the same tool from being proposed again without new evidence. If you extend the pilot, specify what additional data you need and how long the extension lasts.

Decision criteria for adoption:

Teachers report meaningful benefit (not just “it’s nice to have”)
Student outcomes are not negatively affected (ideally, improved)
Privacy and accessibility standards are met in practice, not just on paper
The tool fits existing workflows without excessive workaround requirements

Decision criteria for abandonment:

Usage drops below 50% of pilot teachers after three weeks
Teachers report that the tool adds more work than it saves
Student data handling raises concerns that vendor cannot resolve
No measurable benefit after the full pilot period

5 Ways Agentic AI Can Transform Your Teaching Workflow describes emerging AI capabilities — but every capability should be piloted before it is presumed effective.

Need help building a district-wide evaluation protocol? Dr. Matt Rhoads works with districts to design customized evaluation frameworks. Learn more about consulting services.

When to Say No

Some tools do not deserve a pilot. Red flags that should stop evaluation before it starts:

No privacy policy, or a privacy policy that is impossible to find or understand. If the vendor cannot explain their data practices clearly, they should not have access to student data. Period.
No data deletion option — if the vendor cannot delete your data when you ask, you do not control your data. This is especially urgent for AI tools, where data entered as prompts may become part of the model’s training corpus with no way to extract it.
Replaces teacher judgment entirely — any tool that positions itself as making instructional decisions without teacher input is fundamentally misaligned with sound educational practice. The vendor who says “our AI decides the right intervention” does not understand what teachers do.
No accessibility statement — if the vendor cannot describe how their tool meets accessibility standards, it likely does not meet them. In public education, this is a legal requirement, not an aspiration.
Unsupported by evidence and no willingness to pilot — vendors who claim efficacy without evidence and resist structured pilot evaluations are vendors who are selling, not serving. A vendor who says “just trust us” is a vendor who has not earned trust.
Free tier that mines data — some tools offer free versions that use input data for model training or sell aggregated data to third parties. A “free” tool that costs student privacy is not free. Always check what the free tier actually costs.
No human support channel — if the only way to get help is through a chatbot or a community forum, the tool was not built for environments where downtime means lost instructional time. K-12 users need responsive support.
Rapid version changes without notice — AI tools iterate quickly, but changes that alter functionality, interface, or data handling without warning disrupt instruction. Vendors should maintain a change log and provide advance notice of significant updates.
Terms of service that claim ownership of user-generated content — some AI tools’ terms include clauses granting the vendor ownership or broad licensing rights to content generated through the platform. If a teacher’s lesson plans or a student’s work could be claimed by the vendor, the tool is not appropriate for educational use.

These are not arbitrary criteria. They reflect the legal, ethical, and instructional obligations that schools carry. A tool that violates any of them is not worth the risk of a pilot, no matter how impressive the demo.

The Death of the LMS in Higher Ed raises important questions about the future of learning platforms — but the evaluation principles remain the same: pedagogy, privacy, accessibility, evidence, integration.

Building a District Tool Evaluation Protocol

The framework works best when it is standardized across a district. An evaluation protocol ensures that every school, every department, and every teacher uses the same criteria — not because every context is identical, but because consistency enables conversation.

Key elements of a district protocol:

A one-page evaluation form based on the five criteria, with space for both ratings and narrative explanation
A defined approval process: who reviews evaluations, how long the review takes, what the decision authority is
A pilot requirement: no district-wide adoption without a completed pilot
A sunset provision: every approved AI tool is re-evaluated annually against the five criteria
A community communication plan: how the district informs families about AI tools in use and the criteria used to evaluate them

The protocol should be developed by the AI implementation team (see the AI for District Leaders guide) and reviewed by legal counsel before adoption.

Common Barriers and Challenges

The vendor’s ‘30% improvement.’

A vendor presents at a district PD day. The dashboard shows ‘30% improvement in math scores.’ The assistant superintendent asks for the research. The vendor sends a white paper — an internal study with n=47, no control group, and outcome measures the vendor designed. The district pilots the tool and runs its own evaluation: 4% improvement, p-value of 0.31. Not statistically significant. The assistant superintendent now has a different problem: the superintendent already told the board about the partnership.

The lesson: Vendor-reported outcomes and independently verified outcomes are different categories of evidence. The Rubric’s Criterion 4 requires peer-reviewed or independently verified evidence because internal white papers are marketing.

The teacher who finds a tool on Instagram.

A teacher discovers an AI writing tool on social media, creates a free account, and starts using it with her 3rd graders. The tool works well — better than anything the district evaluated. The tech director discovers it when a parent asks why their child’s browser history shows a tool they’ve never heard of. The teacher’s response: ‘Nobody told me I couldn’t.’ She’s right. There was no approved tool list, no evaluation process communicated to staff, and no clear prohibitions on unvetted tools.

The lesson: Evaluation means nothing if teachers don’t know the list exists and don’t understand why it matters.

Need help building your AI tool evaluation protocol?

Dr. Matt Rhoads works with districts and schools to develop evaluation frameworks, pilot processes, and adoption protocols tailored to local context. Contact Dr. Matt Rhoads to schedule a consultation.

This guide is part of the AI in Education Guide series. Related guides: AI for District Leaders, AI for Instructional Coaches, AI + Special Education & Co-Teaching.

Build your district’s AI tool evaluation protocol — schedule a consultation.

Ready to Collaborate?

Dr. Matt Rhoads works with schools, districts, and organizations on co-teaching, AI integration, instructional coaching, and data-driven decision making.

Explore Consulting Services

Next step: Get Consulting Support