It might be outside of the scope of my area of expertise to identify ways in which they could be improved.
I would, however, say that in the paper you're citing, the issue they talk about there is “domain shift”, which is the fact that very often the settings in which some of these algorithms are tested are radically different from the settings in which they are deployed.
At the minimum, there ought to be some kind of clarity and honesty in terms of the agencies that use these tools—whether it's companies or agencies—about the extent to which a difference exists between testing conditions and conditions in which these tools are actually deployed. I think it would be problematic if there were such a big discrepancy, that these tools were tested in one setting but then deployed in completely different settings. If there isn't a clear sense and a clear understanding as to whether this difference exists, then these technologies perhaps have a great likelihood of being misused and serving more nefarious purposes.