Evaluation
Before going to production, you probably want to evaluate Verify on your own data. Depending on what kind of data you have available, this can be more or less tricky to do.
What to expect
On our current global evaluation dataset of hundreds of thousands of real-world, challenging images of real and fake documents, Verify achieves highly reliable results with the default configuration.
When controlling for image quality, FRR is comfortably low both globally and in the USA, ensuring lower friction for real users.
Our evaluation also includes liveness false rejections, but these datasets don't have screen/photocopy presentation attacks! On our dedicated liveness datasets, screen/photocopy (and aggregate liveness) FAR and FRR remain extremely low.
For some real world validation of these metrics, see our results on the Department of Homeland Security's IDNet dataset.
Depending on your use case, you might be ok with accepting slightly more false rejections for significantly fewer false acceptances, or vice versa.
More on these tradeoffs below.
Real documents
If you have a good, diverse sample of real document images, you can measure Verify's FRR.
If you have only images of real documents, you can't really measure much else, unfortunately. In fact, you can't really measure FRR at all, because it's not possible to measure FRR without measuring FAR, and vice versa.
Be careful when testing with expired documents! By default, Verify will treat expired documents as fraud attempts. However, if your dataset contains expired documents you want to treat as genuine, you must set the TreatExpirationAsFraud option to false.
In theory, you can simply disable all checks and get 0% FRR. Without fake documents, you won't know you're not catching any fraud. Similarly, you might be comparing two products, and if one has 1.5% FRR, while the other one has 1.75% FRR, you might conclude the first one is better. But what if it catches 10 times less fraud then the second one?
We recommend acquiring a quality dataset of synthetic fake documents, like the Department of Homeland Security's IDNet dataset.
Fake documents
If you have a good, diverse sample of fake document images, you can measure Verify's FAR.
If you have only images of fake documents, you run into a similar (but inverse) problem as above. Typically in this scenario, customers will test with their own real documents. This is generally ok, but bear in mind the sample size will be tiny, you'll be biased to the capture conditions you're in, and the document types you have available will be extremely limited.
All that said, you should get a good feel for Verify's performance, and any product that has significant (and not easily explainable) false rejections with a decent number of your real documents shouldn't be taken seriously.
Recommended sample size for evaluation
To ensure that the evaluation results accurately reflect BlinkID Verify's performance, it is important to use a large enough and diverse sample set. Evaluations based on a very small number of images (e.g. 5 examples per document type) can lead to misleading conclusions due to lack of representative data. We recommend the following sample sizes for reliable testing:
Minimum of:
- 20-30 real document examples per document type
- 20-30 fake document examples per document type
Ideal setup:
- around 100 real and 100 fake examples per document type
This way, your evaluation results will more accurately reflect real-world performance.
Tradeoffs
For any fixed solution (e.g. any version of Verify), it's not possible to improve FAR without worsening FRR, and vice versa. It's important to identify where you want to be on this scale, and tune the solution to get there. This is typically done by fixing one of the two metrics.
For example, you might know you don't want to reject more than 0.5% of your real users, so you're targeting for the best possible rate of fraud detection without going over 0.5% FRR (all else being equal).
Configuration
If you're lucky enough to have good sample sizes of both real and fake documents, it would be best if you take a bit of time to familiarize yourself with Verify's configuration options to adjust them to your needs.
You may or may not want to clean up the data, depending on what kinds of outputs you want.
RecommendedOutcome evaluation
If you're looking for more informative and nuanced outputs, you're probably interested in the RecommendedOutcome. If you're looking at the RecommendedOutcome, it's fine to keep all images.
Here's how the stats should be calculated in this case:
False Rejection Rate = (Real documents that had Reject RecommendedOutcome) / (Real documents that were tested)
False Acceptance Rate = (Fake documents that had Accept RecommendedOutcome) / (Fake documents that were tested)
Real unprocessed = Real documents that had Undeterminable RecommendedOutcome + Real documents that had Retry RecommendedOutcome
Fake unprocessed = Fake documents that had Undeterminable RecommendedOutcome + Fake documents that had Retry RecommendedOutcome
Assuming you have ManualReview enabled, you can also measure the manual review rates:
Real Manual Review Rate = (Real documents that had ManualReview RecommendedOutcome) / (Real documents that were tested)
Fake Manual Review Rate = (Fake documents that had ManualReview RecommendedOutcome) / (Fake documents that were tested)
OverallFraudCheck evaluation
If you're looking for strictly binary Pass/Fail outputs, it might make sense to exclude unsupported documents, low quality images, etc. In production this would be automatically handled by the client-side SDK, and these images wouldn't hit the API.
Instead of looking at FAR and FRR, we can look at True Rejection Rate and True Acceptance Rate in this case because we don't deal with low image quality and similar issues.
Here's how the stats should be calculated in this case:
True Rejection Rate = (Fake documents that had Fail verdict) / (Fake documents that were tested)
True Acceptance Rate = (Real documents that had Pass verdict) / (Real documents that were tested)