Why machine learning isn’t the answer to mobile ad fraud
Authored by Pola Vayner, Team Lead Fraud Specialists at Adjust.
Competition for new app users is heating up with millions of choices in the app stores, making every dollar marketers spend on user acquisition more important than ever before. Marketers need to be vigilant in the fight against ad fraud, with some estimates suggesting the cost of ad fraud in 2019 reached $42 billion. But the impact of fraud isn’t only monetary — bad data can impact your campaigns for years to come, skewing your results and impacting critical business decisions and future UA campaigns.
However, a possible solution has emerged: machine learning (ML). Mobile marketers are already seeing great strides in applying the power of ML in the fight against ad fraud. Still, there is a long way to go before machine learning can be a foolproof solution, and there are many weaknesses to be aware of when deploying this technique to help fight technical ad fraud.
The problem comes down to this: while machine learning can be a great way to detect possible fraud, it’s not yet up to the task of deciding which traffic to reject when it comes performance fraud. In this article, we will try to demonstrate why ML is not quite ready for prime time.
The problem with machine learning and ad fraud detection
Machine learning is not a tool that can be implemented immediately. It takes time for ML programs to learn and fine-tune, which means using it to filter spoofing of all kinds — instead of one specific type — can create problems. Fake users have to be filtered out from a combined data set of real users, with a whole host of unclear edge cases, and ML doesn’t fare well in gray areas.
For instance, fraudsters can farm real device data and spoof legitimate user behavior — including any attributes sent by an SDK. A fraudster using real device information of a known user (such as OS version, Android Device-ID and locale settings) may go undetected. Based on historical data, the user is real, and as such the machine learning algorithm has a hard time categorizing this fraud correctly.
Additionally, real user activity may end up categorized as fraudulent because of poor spoofing with genuine device data. Essentially, not knowing which data point is genuine, and which one isn’t, creates some difficulty when training neural networks for ML. We have already seen fraudsters spoofing virtually any request — including a client’s own measurement systems — with data that appeared perfectly legitimate. This makes it harder to identify spoofed users even when you’ve been tracking their behavior for quite a while.
Making sense of ML decision-making
Some fraudsters will make mistakes (such as creating fake user interactions that are easily spotted), but, just like the algorithm, they are learning all the time — and the next attempt could be more sophisticated. When faced with new and unfamiliar scenarios, machine learning can falter. This makes it unreliable in the real world without proper supervision and programming.
In order to be helpful as a basis for rejection, a neural network needs to make a decision at the time of attribution when the payout for the majority of campaigns is decided — a point in time where it knows very little about the user. To counter that, and in order to determine user legitimacy, machine learning will attempt to detect more elaborate patterns across a larger data set, including seemingly obscure characteristics. Ultimately, ML can create extremely complicated rulesets, identifying a combination of seemingly unrelated identifiers in bizarre combinations.
Because of these complicated and hard to understand decision trees, vendors who sell anti-fraud tools that rely heavily on machine learning as the basis for rejection may opt to make the decision-making process less transparent —never explaining what they do or why they do it. This has the potential to become a problem for fraud prevention down the line.
Why transparency is key
Advertisers will eventually have to settle fraud disputes with networks, and generally, the network lacks the ability to reproduce or explain the rejections, and so has to rely on the word of the client. The client relies on its attribution service to provide an explanation for the underlying discrepancies. While that might not be an issue for a small fraction of traffic, however, when you’re dealing with a great deal of fraudulent traffic, a network will want a detailed justification for the rejections.
If an attribution provider can’t clearly explain why an attribution was rejected, then it becomes a subjective opinion. And while opinions can differ, cold hard data is harder to argue with. If the industry starts down this path, we could end up in a situation where networks might try to portray every fraud filter as just another opinion.
Ultimately, machine learning is a good tool for detecting fraud, but it shouldn’t be relied upon for rejecting technical ad fraud, at least not yet. In its current state, edge cases will be missed, and the logic behind decision making may, in the end, be rejected in its own right. Instead, hard work needs to be done to build filters the right way to stop fraud without rejecting installs from legitimate sources.