5.3

Evaluating New AI Tools Against Established Frameworks

12 min

The Problem of the Blank Assessment

Every practitioner who maintains an active AI practice will regularly encounter new AI tools, new model versions, new integration opportunities, and new platform capabilities that present themselves as potentially relevant to their professional work. These encounters are not occasional events. They are a regular feature of the information environment described in Section 1, and they will become more rather than less frequent as AI capability continues to develop and as the commercial ecosystem surrounding AI in professional services continues to expand. The practitioner who does not have a structured approach to evaluating new tools will face one of two equally unsatisfactory responses to this regularity.

The first unsatisfactory response is attempting a comprehensive evaluation of every new tool that appears potentially relevant. A comprehensive evaluation of an AI tool for professional use involves assessing its technical capabilities across multiple task types, investigating its data handling terms and their compliance with applicable regulatory requirements, determining whether its integration options are compatible with the practitioner's existing workflow infrastructure, testing its performance on representative professional tasks, and assessing whether the compliance and governance dimensions of its deployment satisfy the standards the practitioner's professional obligations require. A genuinely thorough evaluation of this kind represents a significant investment of professional time, and conducting it for every new AI tool that attracts the practitioner's attention would consume the professional time that AI assistance is intended to free.

The second unsatisfactory response is avoiding evaluation altogether and adopting new tools on the basis of peer recommendation, vendor assurance, or general impression of apparent capability. This approach produces AI practice changes that are not grounded in principled assessment of whether the new tool is appropriate for the practitioner's specific professional context, and it creates the risk of data handling decisions, integration configurations, and professional reliance on AI outputs that have not been examined against the standards that the practitioner's professional obligations require.

The resolution to this dilemma is the application of the analytical frameworks developed in Stage 4 to the evaluation of new AI tools and capabilities. These frameworks, which the practitioner has already invested in understanding and applying to the construction of their existing AI practice, provide a structured, efficient, and reliable approach to evaluating any new tool or capability against consistent professional criteria. They make evaluation faster because the criteria are pre-established and do not need to be developed from scratch for each evaluation. They make evaluation more reliable because the criteria are grounded in the stable properties of professional AI use identified throughout this programme rather than in the variable judgments that ad hoc assessment produces. And they make evaluation more defensible because the practitioner can articulate the specific criteria against which a new tool was assessed and the specific findings on each criterion that justified the decision to adopt, defer, or decline.

The Evaluation Sequence and Its Logic

Before examining how each framework is applied in the evaluation of a new tool, it is useful to address the sequence in which the frameworks should be applied. The sequence matters because the frameworks address different dimensions of the evaluation question, and some dimensions are more constraining than others in the sense that a finding on one dimension may make further evaluation unnecessary. Applying the frameworks in the wrong sequence wastes evaluation effort by conducting detailed assessment on dimensions that would not affect the decision even if they produced strongly favourable findings, because a constraining dimension has already produced an unfavourable finding that determines the outcome.

The recommended evaluation sequence applies the sensitivity framework first, the model selection framework second, and the integration decision framework third. This sequence reflects the constraining relationship between the three frameworks: the sensitivity framework establishes whether and under what conditions the new tool can be used with the practitioner's professional information at all, which is a prerequisite for any further assessment. The model selection framework establishes whether the tool is well-suited to the specific task types for which the practitioner is considering it, which determines whether the tool represents a significant improvement over the practitioner's current approach. The integration decision framework establishes whether building a technical integration with the tool is justified by the value it would deliver relative to its cost and maintenance burden, which is the final determination of what form the tool's incorporation into the practice should take.

A new tool that fails the sensitivity assessment at the first stage should not be evaluated against the model selection framework, because no finding on model selection would change the determination that the tool cannot be used with the practitioner's professional information under the applicable data handling terms. A new tool that passes the sensitivity assessment but that fails the model selection assessment because it is poorly suited to the task types for which the practitioner is considering it should not be evaluated against the integration decision framework, because no finding on integration economics would justify building an integration for a tool that does not address the practitioner's actual professional need more effectively than their current approach. The sequential application of the frameworks in this order ensures that evaluation effort is concentrated where it produces the most information and stops before it produces effort without value.

Applying the Sensitivity Framework: The Threshold Question

The sensitivity framework, developed in Module 4.1, is the most constraining of the three evaluation frameworks because it addresses the fundamental question of whether the practitioner can use a new AI tool with the specific categories of professional information their work involves. A new tool that cannot be used with the practitioner's professional information in compliance with applicable data handling obligations, regardless of its technical capabilities or integration potential, is a tool that cannot be incorporated into the practitioner's professional AI practice in the way being considered.

The sensitivity assessment for a new AI tool applies the three-tier classification of professional information, distinguishing public information, internal operational information, and confidential information subject to specific legal or regulatory protection, to the specific task types for which the practitioner is considering the tool. The relevant question is not whether the practitioner handles confidential information in general, but whether the specific task types for which the new tool is being evaluated involve information from the confidential tier that would need to be submitted to the tool in order for it to perform the task.

For each tier of information that the relevant task types involve, the assessment then examines what deployment configuration and what data handling terms the sensitivity framework requires. Public information requires no special data handling provisions and can be used with any reputable AI tool operating under standard commercial terms. Internal operational information requires, at minimum, a provider commitment that submitted data will not be used for model training, and the practitioner should confirm that this commitment is explicitly present in the terms of service or privacy policy applicable to the service tier they would use. Confidential information subject to specific legal or regulatory protection, including attorney-client privileged communications, personal data subject to GDPR special category protections, material non-public financial information, and information subject to specific contractual confidentiality obligations, requires a data processing agreement that satisfies the applicable regulatory standard, and may require an enterprise agreement with specific provisions rather than standard commercial terms.

The specific data handling claims that new AI tools make about their privacy practices and data governance should be examined against several specific criteria. The practitioner should identify which entity processes the data submitted to the tool, and whether that entity is established within the European Economic Area or processes data under a transfer mechanism that satisfies GDPR requirements for international data transfers. The practitioner should identify whether the terms of service and privacy policy applicable to the relevant service tier include an explicit commitment that submitted data will not be used for model training or improvement purposes, and should not assume this commitment is present merely because it is present for a different service tier of the same provider. The practitioner should identify whether the provider offers a data processing agreement as required by GDPR Article 28 for the processing of personal data, and whether the data processing agreement's provisions satisfy the specific requirements applicable to the category of personal data that the practitioner's task types involve.

The red flags in a new tool's sensitivity assessment are specific and important. A tool whose terms of service reserve the right to use submitted data for model training without explicit opt-out provisions is a tool that cannot be used with internal operational information unless the practitioner's organisation has negotiated an enterprise agreement that removes this provision. A tool that does not offer a data processing agreement, or that offers one only at an enterprise service tier whose cost is disproportionate to the practitioner's use case, may be unable to accommodate the legal requirements for processing personal data from any tier of the sensitivity classification. A tool that processes data on infrastructure outside the European Economic Area under a transfer mechanism that has been challenged or invalidated by regulatory action or judicial decision requires careful legal assessment before use with any personal data, regardless of the practitioner's service tier.

The output of the sensitivity assessment is one of three findings. The first is that the tool is suitable for use with the specific information categories involved in the relevant task types under the standard commercial terms applicable to the practitioner's intended service tier: the practitioner can proceed to the model selection assessment. The second is that the tool may be suitable with enhanced terms, specifically an enterprise agreement or a data processing agreement that addresses the provisions identified as necessary for the relevant information categories: the practitioner should determine whether the enhanced terms are available and accessible before proceeding to the model selection assessment, because a tool that requires enterprise pricing to be used in compliance with the practitioner's obligations is a tool whose economics must be assessed at the outset rather than after the model selection assessment has been conducted. The third is that the tool cannot be used in compliance with the practitioner's data handling obligations for the relevant task types regardless of available terms: the evaluation stops at this stage.

Applying the Model Selection Framework: The Fit Question

The model selection framework, developed in Module 4.2, addresses the second evaluation question: whether the new tool is well-suited to the specific task types for which the practitioner is considering it, and whether it represents an improvement over the practitioner's current approach to those tasks. The four dimensions of the framework, task type, information sensitivity, accuracy threshold, and platform integration requirement, are applied in the evaluation of a new tool in a specific way that is distinct from their application in the original construction of the practitioner's AI practice.

In the original model selection decisions described in Module 4.2, the practitioner was selecting among available options to identify which best met their professional needs. In the evaluation of a new tool, the practitioner is assessing a specific tool against criteria that are now well-established from the practitioner's experience of their existing AI practice. The evaluative question is not which tool is best in the abstract but whether this specific new tool is better suited to specific task types in the practitioner's context than the tool the practitioner is currently using, and whether the improvement, if any, is sufficient to justify the transition costs of adopting the new tool.

The task type dimension of the assessment examines whether the new tool's capability profile matches the specific task types for which the practitioner is considering it more closely than their current tool. The practitioner who is considering a new tool for long-document analysis should assess whether the new tool's context window, document processing accuracy, and instruction-following capability in the context of long, complex professional documents is materially better than their current tool's performance on the same task type. The practitioner who is considering a new tool for real-time information access should assess whether the tool's connection to current information sources provides a meaningful improvement in the timeliness and reliability of the information it accesses compared to their current approach. The assessment of task type fit should be grounded in the specific professional task types the practitioner actually performs rather than in general benchmark comparisons, because general benchmark performance is, as established in Section 3, transient knowledge of limited professional applicability.

The accuracy threshold dimension of the assessment examines whether the new tool's accuracy characteristics in professional use conditions are sufficient for the task types being evaluated. The practitioner should assess this dimension with specific reference to the verification standard that the task type requires, since a task type that requires every factual claim to be verified against a primary source because the consequence of an undetected error is severe imposes a different accuracy requirement than a task type where AI assistance is used for initial drafting and the verification standard is proportionate to the consequence of error. The new tool's accuracy should be assessed in conditions representative of real professional use rather than in optimised demonstration conditions, and the most reliable evidence of accuracy in real professional use conditions is peer practitioner experience with the tool in comparable professional contexts, as described in Section 4.

The platform integration dimension of the assessment examines whether the new tool offers integration capabilities with the practitioner's primary professional tools that are superior to those offered by their current AI tool, and whether any integration advantage is sufficient to justify the transition costs of switching. A new tool that offers a materially better native integration with the practitioner's primary document management system, email platform, or industry-specific professional tool may offer a friction-reduction advantage that outweighs a modest disadvantage on the task type dimension. A new tool that offers integration options identical to the practitioner's current tool, or that offers a less convenient integration configuration, should not be adopted on the basis of minor task type or accuracy advantages because the transition costs of rebuilding integrations and retraining the prompting and verification habits that the current tool's workflows have established are themselves a real and significant cost.

The output of the model selection assessment is a finding on whether the new tool represents a material improvement over the practitioner's current approach for the specific task types being evaluated, taking into account all four dimensions together. A material improvement on one or two dimensions that is offset by a disadvantage on the others does not support adoption. A material improvement across all relevant dimensions, or a decisive improvement on the most important dimension for the specific task type, does support further evaluation through the integration decision framework. The boundary between these findings requires the practitioner's judgment, calibrated by their specific knowledge of which dimensions are most important for their specific professional context, and this is an area where peer practitioner experience with the new tool in comparable professional contexts provides particularly valuable input.

Applying the Integration Decision Framework: The Investment Question

The integration decision framework, developed in Module 4.3, addresses the third evaluation question: whether incorporating the new tool into the practitioner's practice, and specifically whether building a technical integration with it, is justified by the value it would deliver relative to the cost and maintenance burden it would impose. The five questions of the framework are applied to the new tool with specific attention to the comparison between the new tool's value proposition and the practitioner's current approach.

The task frequency question establishes whether the task types for which the new tool is being considered occur frequently enough to justify the setup and maintenance costs of a technical integration. The practitioner should assess this against their actual working patterns rather than their intuitive estimate of task frequency, as the discrepancy between these is often significant and consistently moves in the direction of overestimating task frequency for tasks that feel prominent but are less frequent than they appear. For task types that occur less than once per week, the manual workflow that Stage 4's walkthroughs describe, where content is submitted to the AI tool through deliberate human selection rather than automated retrieval, is likely to be more efficient than maintaining a technical integration when the overhead of integration setup and maintenance is honestly accounted for.

The realistic time-saving question requires the practitioner to disaggregate the task type into its component steps and assess which components the new tool would actually compress and which would remain as time costs regardless of the tool used. The tendency to overestimate time savings by focusing on the component the AI addresses while overlooking the preparation, verification, and integration steps that remain, is particularly acute when evaluating a new tool that is presented with impressive capability claims. The practitioner who has experience of the time structure of the task type from their existing AI-assisted workflow is in a better position to make this assessment accurately than one who is evaluating the new tool in the absence of existing AI practice experience, because the existing practice provides a grounded baseline against which the new tool's incremental improvement can be measured.

The data access and professional obligations question examines the specific scope of access that a technical integration with the new tool would require and whether that scope is appropriate for the information it would access. This question is distinct from the sensitivity assessment conducted at the first stage of the evaluation, which addressed the general question of whether the tool can be used with the practitioner's professional information at all. The integration access question is more specific, examining what information the technical integration would give the tool access to beyond the specific content the practitioner intends to submit for AI-assisted processing, and whether that broader access is appropriate.The integration access question is more specific, examining what information the technical integration would give the tool access to beyond the specific content the practitioner intends to submit for AI-assisted processing, and whether that broader access is appropriate. An integration that gives an AI tool access to the practitioner's full email system in order to assist with drafting specific categories of correspondence also gives the tool access to all other emails in the system, including those that may contain information from the confidential tier that was not contemplated when the integration was configured. The practitioner should explicitly map the scope of access that each contemplated integration would create and assess whether every category of information within that scope is appropriate for the tool to access under its applicable data handling terms.

The configuration capability question examines whether the practitioner has the time and the technical capability to configure the integration correctly and to test it thoroughly before connecting it to live professional information. The discipline of testing against non-sensitive representative data before connecting to live professional data, established in Module 4.3, applies with full force to the evaluation of a new tool's integration potential. An integration that has not been tested against representative data is an integration whose failure modes in real professional use have not been identified, and those failure modes may include data handling errors as well as output quality problems.

The maintenance commitment question examines the realistic ongoing maintenance burden of the contemplated integration, taking into account the new tool provider's track record of API stability, the frequency with which the provider has historically revised its terms of service and data handling provisions, and the technical complexity of the integration configuration that the tool's API requires. A new tool from a provider with a track record of frequent API changes, aggressive deprecation of older model versions, and regular revisions to data handling terms represents a higher ongoing maintenance commitment than a new tool from a provider with a more stable technical and commercial track record, and this difference should be reflected in the integration decision assessment.

The output of the integration decision framework is a specific decision about the form in which the new tool, having passed the sensitivity and model selection assessments, should be incorporated into the practitioner's practice. The decision options are adoption through a manual workflow without technical integration, adoption through a technical integration in the areas where the integration economics are justified by the five-question assessment, or deferral pending additional information about the tool's reliability in professional use or its provider's commitment to the data handling and API stability properties on which the integration's value proposition depends. Where adoption through a technical integration is indicated, the sequential build approach from Module 4.3 applies to the new tool's integration with full force: one integration, tested thoroughly, before any additional integrations are contemplated.

The Compounding Value of Consistent Framework Application

The evaluation framework described in this section produces its maximum value not through any individual evaluation but through the discipline of applying the same framework consistently across every new tool evaluation the practitioner conducts over time. Consistent application produces two specific forms of compounding value that ad hoc evaluation does not.

The first is the development of calibrated evaluative judgment. The practitioner who has applied the sensitivity framework to ten new tool evaluations over the course of two years has developed a reliable intuition for how quickly specific provisions in AI tool terms of service can be located, what the standard provisions are and where they fall short of the requirements for specific information categories, and which red flags in a tool's data handling claims require immediate specialist input rather than further self-assessment. This calibrated intuition makes each subsequent evaluation faster and more reliable than the practitioner could achieve without the accumulated experience of consistent framework application.

The second is the development of a documented evaluation history that supports professional accountability for AI practice decisions. The practitioner who can demonstrate, through documented evaluation records, that each AI tool and integration in their current practice was assessed against consistent professional criteria and found to meet them, is in a stronger professional position than one who cannot reconstruct the basis for their practice decisions. As the professional and regulatory environment for AI practice continues to develop, the ability to demonstrate that practice decisions were made on principled, documented grounds will become an increasingly important dimension of professional accountability. The consistent application of the evaluation frameworks described in this section is the practice that builds this documented evaluation history over time.