Data scientists are building and tuning machine-learning models for fraud detection. But despite this, detection rates are not improving as quickly as needed to keep up with the fraudsters.

Figures show that banks are not well served by their fraud detection systems. It is a rapidly growing industry. The value of the market has grown 10 times since 2009 to reach \$2 billion in 2020.

However, one has to question the effectiveness of the fraud detection paradigm. Between 2021 and 2022, there has been a 30% increase in fraud. Although 70% of fraud is detected early enough to stop it, that still means the fraudsters are getting away with a significant amount of fraud against financial institutions – a whopping \$50 billion worth per year.

There has been heavy investment in fraud detection systems and significant strides in the early days. Yet, progress in improving the algorithms has slowed in recent years.

What’s needed is a way to break through this accuracy barrier and move fraud detection systems to the next level.

## Zeroing in on accuracy

One measure of a fraud detection system is the number of false positives and negatives it generates.

Fraud detection systems typically output a score of zero to one for each transaction. A zero indicates it’s unlikely to be fraud and a one indicates it’s definitely fraud. If we plot a graph of the distribution of the scores, most systems will generate a normal distribution curve. Most scores would be bunched up in the middle, with fewer scores at the top and bottom of the range.

It is the opposite of what we would like to see: a U-shaped graph. It has fewer scores in the middle and more scores in the top and bottom of the range. An ideal system would have all zeros or ones, and your plot would look basically like an empty box.

However, given that most scores are neither zero nor one, you have to set a threshold, a cut-off point below which the result is ‘negative’ for fraud and above which it is ‘positive’.

In a given system, the threshold will determine the number of false positives and negatives. If you set the cut-off point at 0.9, you will get fewer false positives than if you set it at 0.5. The risk is more false negatives.

A bank sets its threshold based on its risk tolerance and anti-fraud resources. The lower your threshold, the more work you generate for your investigators. They have to examine each positive result to determine if it’s true or false. Conversely, the higher your threshold, the more money you will potentially lose to fraud.

So, accuracy matters, and flattening the bulge in the middle of the distribution – and even making it concave – provides rapid payoffs in the number of false positives and negatives you get.

## Fighting the bulge

Great strides have been made using machine-learning algorithms in fraud detection. ML achieved notable results in flattening the curve early on. However, the bulge has proven to be increasingly stubborn to depress for various reasons in recent years.

One reason could be that the fraudsters are getting smarter. A more likely explanation is that data scientists have tweaked the algorithms about as far as they will go given the current operational paradigm.

There is an accuracy barrier in fraud detection machine learning. Improvements in existing fraud detection techniques are reaching a plateau.

Essentially, the accuracy barrier is the limit of what you can squeeze from an algorithm without feeding it better data. Financial institutions must overcome this to limit fraud and control the costs associated with fraud detection.

The data being input into fraud detection systems typically consists of transactional data and information about the parties to the transaction. In simple terms, the system computes the fraud score by looking at the nature of the transaction and the history of the parties involved. It uses algorithms to arrive at an overall fraud score based on known risk factors.

The data often leaves out the relationships the parties to the transaction have with other entities in the financial ecosystem. These entities can be people, devices, or organisations with fraud risk scores attached to them. We are interested in the parties’ relationships with high-risk entities which are not directly involved in the transaction but could potentially be controlling or otherwise associated with one of them.

If you are not considering the relationships that parties to the transaction have with high-risk entities, you are essentially throwing away half the data, data that could be invaluable in reducing your fraud risk.

The problem, of course, is, if we want to use that relationship data, how do we build a system that can collect and then, most importantly, analyse that data?

## Weaving a web

Most financial systems are built on relational databases. The fraud detection systems they employ also use these databases for their analysis.

However, a relational database is not good enough because the relationship analysis requires SQL queries with too many table joins. The more depth you add to the analysis, the more table joins you must construct. The time it takes to compute the joins grows exponentially with the depth of analysis. Practical experience shows that the calculation quickly consumes vast amounts of memory and CPU time, effectively grinding to a halt.

### Graph databases are better equipped for fraud detection

Graph databases take an entirely different approach to storing and processing data. They store each data point in a node rather than ordering the data in tables, columns, and rows. An edge link connects each node to other nodes.

A graph database of nodes and edges models a financial system much more effectively than a relational database due to the explicit relationships between data points. It can be fed data from existing relational databases. It allows you to link data from several databases that intuitively makes sense and is computationally more powerful.

With a graph database, the relationships between entities in your financial model no longer need to be constructed at run-time because they are already encoded in the data. Analysing the data becomes a simple matter of following the edges and hopping from one node to another. It applies your algorithms to the data as you go.

In this way, you can run much more powerful algorithms. There is a whole branch of mathematics called graph theory. It provides functions that allow you to find the shortest path between entities and identify outliers and influencers. You can also scope out communities of interest – essential functions for identifying fraudsters.