Fraud detection with K-means clustering in BigQuery
Background
Fraud is becoming a growing problem in the digital world. We see it in all stages from simple phishing emails, to deepfake and AI-driven identity fraud. According to Signicat’s 2024 Battle Against AI-driven Identity Fraud report, there has been an 80% increase in overall fraud attempts over the past three years. An estimated 42.5% of detected fraud attempts use AI, with 29% of them considered successful. According to our own data, deepfakes were just 0.1% of fraud attempts three years ago, and today they represent around 6.5% of fraud attempts, or around one in every 15.
There are many ways to fight this, and this article describes how Signicat has been working with AI and ML to detect fraud in big datasets.
Methodology
The data sets containing potential fraud consist of big amounts of both historical and fresh authentication transactions. Hidden in all this data there are patterns and common denominators across customers and use cases. These patterns are very hard to identify by humans, but by using ML we can get valuable insights that can help in the battle against fraud. How do we do it, and what are our experiences? Let’s take a closer look.
Use of unsupervised learning to cluster the data gives a basis for anomaly detection. We use K-means clustering in BigQuery ML for this. What is K-means clustering and why do we need it?
K-means clustering is an ML algorithm that groups data points into a set of clusters based on patterns. Data points in the one cluster are more similar to each other than to data points in other clusters. Each cluster has a centroid and the similarity of the data points is measured by the distance from the centroid, using metrics like Euclidean distance. A data point that is identified as far away from the centroid can be an anomaly based on the given distance.
In the fraud case K-means can be used to identify fraudulent authentication transactions based on transaction attributes like geolocation and device information. Sudden changes in behaviour would be automatically detected. Let’s for example say that your identity is being misused by someone with bad intentions. It is your credentials that are used and it appears to be you, but the behaviour differs from your normal usage pattern. Maybe you are suddenly logging in from another location, at different times, from other devices, using other methods, etc. This is what we want to detect.
- Data analysis: When working with big data sets it is crucial to get the groundwork done properly to avoid struggles in the next steps. Identify where the data is, which attributes are relevant and evaluate the quality.
- Data transformation: When the key data attributes have been identified it is time to work on the data set. This can involve parsing and mapping, casting, scaling and weighing of data. When preparing to use K-means clustering, keep in mind that the attributes must be comparable. When for example casting from one type to numerical, make sure the result makes sense, and that the ranges are kept intact.
- Build and train the model: When the data set is ready it can be fed into the model. The model will cluster the data, and during re-training the results will be optimized.
- Detect anomalies: With the clusters created it is time to run the anomaly detection. That means letting the model identify the data points that are most far away from the centroids.
- Analyze results.
A real life example
We recently did a PoC to evaluate the use of K-means clustering for fraud detection in a data set with authentication transactions.
For this purpose we used a masked demo dataset with 1800 records, with no known fraudulent transactions. The dataset was related to the use of a specific product, so the product team and the Data team sat down together to get a better understanding of the data.
When the data had been imported into BigQuery it was time for some cleaning. There were json data that needed to be split into columns, parsing and mapping of other fields, casting, scaling and weighing of the different data attributes that were to be fed into the model. Since the K-means clustering wants to measure distance between objects, the attributes must be casted to something comparable/numeric. It was worth spending some time on this to make sure the attributes were ready to be compared correctly without too much variance.
Having the clean data in BigQuery we created the model.
We did not predefine the number of clusters as there was no clear assumption of the separation of data, we let the model determine it. Sometimes it is best to let the machine do the job. The result was that 3 clusters were created. We see that the model did split the dataset into one bigger cluster and 2 smaller.
K-means created 3 clusters from the data set with ~1800 records
Having the model trained and the clusters created, we now did anomaly detection on the model.
The result was 36 anomalies detected. It was time for a manual inspection of the findings.
When analyzing the detected anomalies we looked at both the tables and also a visualization for better understanding. This map shows the anomalies marked with red dots:
When zooming in on each of them it was understood why the model had identified them as far away from the centroids. In most of the cases it turned out to be caused by errors in the data sets like missing or wrong values. For example missing geodata, or not able to read a value from the device at the time of the transaction.
Even if there were no actual fraud attempts, there were anomalies in the data set that should be corrected. So far, the model was up and running and had detected anomalies, but no fraud.
The conclusion so far was that there were no fraudulent transactions in the data set.
It was time to put the whole thing to the test. By manipulating the data set and introducing one fraudulent transaction it could be verified that it was picked up by the algorithm.
After adding one transaction that indicated that one of the users had suddenly been doing a transaction in Asia, with the same device at approx the same time, the anomaly detection was run again. It resulted in a new detected anomaly, located in Asia. Nice, K-means did the job.
The conclusion was that K-means was able to detect the fraudulent transaction in the data set.
Learnings
From the use of K-means clustering in BigQuery ML it was concluded that K-means clustering can be used to detect fraud in transaction data sets.
There are, however, some points to note if you are considering this approach:
- Put effort into working with the data set. To avoid “Garbage in — garbage out”, invest time in understanding the data set, cleaning and preparing it for the model.
- Consider using a combination of methods. K-means combined with rule sets was a more powerful method in this case.
- Make sure you have the compliance and legal aspects in place before training data sets. A demo or masked data set is always a good start.
- Start slow. Don’t expect the model to be perfect from day one. Add time for manual validation, and start with a few pilot customers.
- Test the performance before putting it into production. If the goal is live detection this must be properly performance tested.