Assessment Framework for Mental Health Apps

3. Clinical Evidence Standards

Navigation

Where effectiveness claims are not made directly by the developer, the necessity of evidence needs to be considered, both in relation to the app’s claimed or implied benefits and to the risk of harm associated with its use. As was highlighted in the App Overview, proportionality of evidence/assurance evidence can initially be evaluated before an assessment of necessity, appropriateness, and quality.

Proportionality must be considered because it is unrealistic to expect all apps to provide the same types of evidence. ORCHA’s Adapted Evidence Standards Framework (adapted ESF) guards the functional complexity of the app and the risk of harm to users while guiding the evidence requirement. For example, a higher level of evidence is called for with apps that have more complex functionality, since they harbour higher risks (see subsection 3d). Those that lack this evidential requirement may be able to provide alternative credentials to pass the professional assurance standard. This is considered on a case-by-case basis. Subsection 3d also provides more information on alternative credentials.

Necessity focuses on whether evidence of effectiveness is reasonably required in the first place and whether this requirement places a disproportionately unfair burden on developers, compared to non-digital (similar) solutions. An example would be an app that provides a diary for those with depression/anxiety to write down their thoughts. In such a case, a question needs to be asked as to whether a similar notebook bought from the local store would require such evidence. If the adapted ESF indicates, “Yes, the app needs to provide evidence because there is a potential added risk associated with its use,” it then becomes necessary to determine which type of evidence is required: evidence of effectiveness, evidence of safety, or both. For instance:

Safety scenario. If an app makes use of an established clinical calculator, it may provide indirect evidence that shows the safety of a specific calculation or algorithm but not in the specific context of the app. In this instance we would look for assurance that the app has accurately replicated the relevant algorithm and that it functions and produces outputs in an identical way that does not result in misinterpretations.
Effectiveness scenario I. If an app claims or implies a specific benefit, as highlighted in App Overview, we would require evidence of effectiveness to support its claimed or implied benefit.
Effectiveness scenario II. If an app directs readers to indirect evidence to support claimed/implied benefits, this can be considered sufficient if the mode and function of the app is identical in all material ways to the solution identified in the indirect evidence (see subsection 3b).

Appropriateness is determined by whether any investigations or research concerning the effectiveness or safety of an app has been conducted using a representative sample group and using appropriate evaluation methods. For instance, mandating a randomized controlled trial (RCT) for diagnostics would not intuitively make sense. The target audience of the app is identified in the app overview, and the evidence must show that it has selected a sample group with the same key characteristics (e.g., age range, gender). If the evidence is not conducted using an appropriate sample group, or an appropriate means of evaluation, then its quality cannot be assessed.

Appropriateness can also be assessed for apps that do not require evidence, but rather assurance; i.e., those associated with a much lower risk profile (typically NICE ESF Tier 2b or lower). The assessor would research whether any clinicians were involved in the app’s development and whether they held appropriate qualifications. The appropriateness of any statements, referenced guidelines, and relevant information can also be assessed at this point.

Quality also relates to whether evidence or assurance is deemed appropriate. If the app needs to provide evidence of effectiveness, and this threshold for the appropriate type of evidence has been met, then the quality of that evidence can be assessed. For example:

In the case of digital therapeutics, quality is considered through the evaluation of significant p values (p < 0.05) and comparators/validated comparators (as outlined in the criteria below).
In the case of diagnostics, identification of either improvements or non-significant reductions in diagnostic accuracy in terms of confidence-interval overlapping and AUROC (sensitivity and specificity criteria) are also used.

Though these are the most widely used quality indicators, quality is assessed on a case-by-case basis to make sure it is proportional to what the app is doing/looking for.

As mentioned, when evidence of diagnostics and treatments are considered, p values of < 0.05 can be inappropriate. Evidence here is taken on a case-by-case basis, but regular themes that demonstrate safety include high levels of sensitivity/specificity or similar levels of adherence to the expertise of a qualified clinician.

	Criteria	Criteria Origin
3a — Q1	What type(s) of evidence is available?	ORCHA
3a — Q2	Provide links to the publicly available evidence/published evidence that the developer has provided.	ORCHA
3a — Q3	For each type of relevant evidence — What category does the evidence relate to?	ORCHA
3a — Q4	For each type of relevant evidence — What benefit does the evidence relate to?	ORCHA
3a — Q5i	For each type of relevant evidence — What is the sample size?	MHCC
3a — Q5ii	Does the sample reflect the app’s target audience, as stated by the developer?	MHCC
3a — Q6	For each type of relevant evidence — Does the evidence found provide a p value?	ORCHA
3a — Q7	For each type of relevant evidence — Does the p value demonstrate significance (p < 0.05)?	ORCHA
3a — Q8	For each type of relevant evidence — Does the p value demonstrate near significance (p < 0.2)?	ORCHA
3a — Q9	For each type of relevant evidence — Is there a comparator?	ORCHA
3a — Q10	For each type of relevant evidence — Is the comparator validated?	ORCHA

3b. Behavioural change

If an application uses accepted behaviour change techniques that already have a strong evidence base, the developer may choose not to fund additional research. For example, if an application incorporated dialectic behaviour therapy (a behaviour change technique), the developer may choose to refer to its strong evidence base instead of conducting their own research.

	Criteria	Criteria Origin
3b — Q1	Does the application have its own high-quality study?	ORCHA
3b — Q2	Does the application reference and evidence its behaviour change technique?	ORCHA

3c. Professional backing

Professional backing refers to evidence that an appropriate professional was involved in an application’s design and development. The relevant professional will differ depending on the context. For example, for a simple meditation application, a qualified meditation instructor would be accepted as an appropriate professional. For a complex clinical solution such as an application that claims to treat depression, a relevant qualified clinician would be necessary.

Professional backing can be inferred if the application has been externally accredited. External accreditations are wide ranging, from national health bodies to charities. Like the appropriate professional role, it is essential that the accreditation come from an appropriate body that is relevant to the application. Note: while the criteria below relate to professional assurance, Q9-Q14 also strongly focus on the developer’s efforts to ensure that the app is safe.

	Criteria	Criteria Origin
3c — Q1	Is there a suitably qualified professional involved in the application’s development team?	ORCHA
3c — Q2	Please provide the licensed health care professionals involved in the delivery of the app. Guidance: This question applies only if a health-care professional is involved in the delivery of the app as identified through 1h — Q5.	MHCC
3c — Q3	Does the organization behind the application have relevant credentials?	ORCHA
3c — Q4	Is there evidence of an endorsement by a relevant body?	ORCHA
3c — Q5	Are organizations using the app?	ORCHA
3c — Q6	What type of organization is using the app?	MHCC
3c — Q7	Is there a statement that the app has been positively evaluated or validated by a relevant health-care professional?	ORCHA
3c — Q8	Please specify who the relevant experts are and what qualifications they hold.	ORCHA
3c — Q9	Is there evidence within the application that the developer has validated any guidance with relevant reliable information sources or references?	ORCHA
3c — Q10	Is there a statement or any evidence showing that appropriate safeguarding measures are in place around peer support and other communication functions within the platform? (Tier 2a requirement: only asked of apps that require such measures because of its functional capabilities/intended purpose)	ORCHA
3c — Q11	Does the application offer 24-7 peer or clinical support? Guidance: The support can be offered via chat/consultation on demand or another 24-7 resource(s). This question only applies to apps that discuss suicidal ideation or meet Tier 2b and above criteria (from the ESF Tiers).	MHCC
3c — Q12	Does the developer clearly identify who the application should and should not be used by?	ORCHA
3c — Q13	Does the developer publish their risk management processes?	ORCHA
3c — Q14	Does the developer make clear risks associated with using the app?	ORCHA
3c — Q15	Is there a way for the user to confirm that the data input is accurate?	ORCHA
3c — Q16	Does the app direct users to a government website (provincial/territorial/federal)? Relevant federal examples: English — Mental Health Support: Get Help Substance Use French — Soutien en santé mentale : Demander de l’aide Consommation de substances Guidance: This question only applies if the app does not provide a chat/ consultation on demand or another 24-7 resource(s).	MHCC

3d. ORCHA Adapted Evidence Standards Framework compliance

Every app is expected to provide some level of evidence or assurance. It was agreed that this level of evidence and assurance should remain proportional to the app’s functionality and claims.

The framework makes use of ORCHA’s Adapted Evidence Standards Framework (adapted ESF), which amended the original NICE ESF so that the requirements were fair to mobile health applications.

The adapted ESF works by giving each app an “ESF Tier” based on its functionality. The adapted ESF Tier then determines what level of evidence should be provided. Passing its tier requirement by meeting its level of evidence and assurance would positively impact the app’s review.

An application’s ESF Tier is determined by what it offers.

The app would be classified as Tier 3b if it does one of the following:

Diagnoses a mental health problem (1e-Q2 is yes)
Contains a novel clinical calculator that impacts care, treatment, or diagnosis (1e-Q11 is yes)
Automatically measures and/or records data about a user’s specified mental health problem and transmits the data to a professional, carer, or third-party organization, without any input from the user (1g-Q7 is yes)
Provides treatment (1e-Q15 is yes)
Guides the treatment of a mental health problem (1e-Q17 is yes)
Alleviates the symptoms of an existing mental health problem (1e-Q25 is yes)

An app would be classified as Tier 3a if it does none of the things listed above but does one of the following:

Serves as a complex self-management app (selected in 1g-Q4)
Includes preventive behaviour change within the app (selected in 1g-Q10)
Has a recognized (not novel) clinical calculator within the app (1e-Q11 is yes and 1e-Q12 mentions an established clinical calculator)

Wysa is a mental health support application that provides a chatbot for users to discuss anything from sleep to stress. It is a complex self-management application that would place it under Tier 3a or 3b, depending on further functionalities. It also enables users to assess themselves symptomatically based on depression tests like PHQ-9 and anxiety tests like GAD-7. These would be considered clinical calculators, but since they are widely recognized and established resources the application would not be increased to Tier 3b. This app remains at Tier 3a due to its complex management and specific use related to a mental health problem.

An application containing a novel clinical calculator that is unique would not be an established tool and likely not have as much recognition or reference. This type of application would therefore be subject to more rigour and would be considered Tier 3b.

An app would be designated Tier 2b if it does none of the things listed in the 3b/3a tiers and is classified as a standard self-management application (selected in 1g-Q4).

An application would be classified as Tier 2a if it does none of the things listed in the 3b/3a/2b tiers and does one of the following:

Provides information or guidance (1d-Q1 is yes) or allows a health-care professional to provide clinical advice, as opposed to the app providing it (1g-Q3 is yes)
Provides information, resources, or activities to the public, users, or clinicians, either about a specific mental health problem or general health and lifestyle (1d-Q5 is yes)
Provides two-way communication between users, citizens, or health-care professionals (1l-Q4 is yes) or is a simple self-management app (selected in 1g-Q4)

Moodbeam is an application that helps users keep track of how they are feeling. It has two simple buttons to represent when the user feels a high and when they feel a low, and this information can then be explored as patterns and trends over time. This application is a good example of simple self-management, which is classified as Tier 2a.

A simple self-management application allows users to monitor their non-specific mental health-problem data, which can then be displayed back to them in a simple format. Since Moodbeam enables users to monitor their mood and feelings and see their data in a simple graph form, it is considered simple self-management and therefore appropriate for Tier 2a.

If this application allowed users to monitor data that was specific to a mental health problem, such as reporting when they felt clinical depression or generalized anxiety, then it would no longer fit into Tier 2a. When an application is able to monitor specific mental health problem data and display it back to the user in simple graph form, it is considered standard self-management and therefore appropriate for Tier 2b.

An application would be considered Tier 1 if it does none of the things listed in the 3b/3a/2b/2a tiers and provides no user outcomes. For example, it may act as an administrative application that helps deliver health-related systems/services, or it could work as a maintenance application to report and fix issues around a hospital.

Thalamos is a web application that helps health-care professionals complete and manage forms for use under the U.K.’s Mental Health Act. It is a good Tier 1 example because it simply replaces pen and paper. Since the application facilitates administration and has no direct impact on user outcomes, it could not be classified any higher than Tier 1.

Once an app has been assessed for its ESF Tier classification, the next aim is to understand whether it is compliant with that tier. Doing so requires that the relevant flagged criteria are met by positive answers.

Here, it is important to note that requirements are cumulative, meaning that an app at 3a must meet requirements at a lower tier. However, if the app has an RCT, which is acceptable at tier 3b, there is no requirement for a separate observational study. One study (an RCT) would be enough in that case.

The app would have met Tier 1 minimum requirements if it has

evidence of a survey, pilot study, meta-analysis, RCT, observational, or other indicated user acceptance/benefit (3a-Q1 does not contain none)

and at least one of the following:

evidence of a relevant professional involved in the development team (3c-Q1 is yes)
relevant organizational credentials (3c-Q2 is yes)
evidence of endorsement by a relevant body (3c-Q3 is yes).

The app would have met Tier 2a minimum requirements if it has

evidence that the developer has validated the information, advice, or guidance with relevant and appropriate academic studies or relevant academic expert input (3c-Q1 or 3c-Q2 or 3c-Q7 is yes).
there is clear evidence of safeguarding measures being in place for any communication functions (4b-Q1 is yes, if applicable)
the app has evidence of accrediting expertise (3c-Q1 or 3c-Q2 or 3c-Q3 or 3c-Q5 is yes).

The app would have met Tier 2b minimum requirements if it has

evidence that the developer has validated the information, advice, or guidance (3c-Q1 or 3c-Q7 is yes)
clear evidence of safeguarding measures being in place for any communication functions (4b-Q1 is yes, if applicable)
evidence of accrediting expertise (3c-Q1 or 3c-Q2 or 3c-Q5 is yes)
evidence of an endorsement by a relevant body (3c-Q3 is yes) or a meta-analysis, or an observational study/RCT with a p value < 0.05 (3a-Q1 is a yes, and one of the 3a-Q7 answers is a yes).

The app would have met Tier 3a minimum requirements if it has

evidence of an RCT (3a-Q1 answer includes RCT) that has a significant p value (3a-Q7 is yes) or evidence of an observational study (3a-Q1 answer includes observational) that has a significant p value (3a-Q7 is yes)
a comparator (3a-Q9 is yes) or a validated comparator (3a-Q10 is yes).

The app would have met Tier 3b minimum requirements if it has

evidence of an RCT (3a-Q1 answer includes RCT) that has a significant p value (3a-Q7 is yes)
a validated comparator (3a-Q10 is yes).

If an app has been classified as a Tier 3b but does not have an RCT, it could still pass the ESF using alternative credentials. While each app would be considered on a case-by-case basis, Tier 3b can be met by providing quality real-world evidence. For example:

high-quality observational studies instead of RCTs
evidence of adoption and use

	Criteria	Criteria Origin
3d — Q1	What tier of the ESF is the app?	ORCHA
3d — Q2	Is the application Tier 1?	ORCHA
3d — Q3	Is the application Tier 2a?	ORCHA
3d — Q4	Is the application Tier 2b?	ORCHA
3d — Q5	Is the application Tier 3a?	ORCHA
3d — Q6	Is the application Tier 3b?	ORCHA
3d — Q7	Has the application met Tier 1 requirements?	ORCHA
3d — Q8	Has the application met Tier 2a requirements?	ORCHA
3d — Q9	Has the application met Tier 2b requirements?	ORCHA
3d — Q10	Has the application met Tier 3a requirements?	ORCHA
3d — Q11	Has the application met Tier 3b requirements?	ORCHA
3d — Q12	Does the application have appropriate evidence for the ESF Tier?	ORCHA

3e. Other Areas of interest

	Criteria	Criteria Origin
3e — Q1	Does the sample of the research study meet the relevant characteristics for the users of the app?	MHCC