AI in POS and Payments: From Transaction Processing to Transaction Intelligence

contact@corebaseit.com (Vincent Bevia) — Mon, 08 Jun 2026 10:00:00 +0100

This post looks at where machine learning actually fits in a payment system: not inside the authorization path that EMV, tokenization, and the networks already own, but in the risk layer that wraps around it. It pairs with earlier posts on what happens in the 2–3 seconds of a card payment and on AI as an amplifier, not a replacement.

In the acquiring and POS stacks I have worked on, the authorization path is the part nobody wants to touch. A card is tapped, inserted, or presented through a wallet. The terminal or SoftPOS SDK runs the card interaction and the cryptography. The transaction is routed for authorization, and a decision comes back: approved or declined. That path is deterministic, latency-bound, and heavily governed, and it works.

What changed over the last few years is not that path. It is the layer forming around it. Fraud scoring, anomaly detection, risk-based authentication, and analyst tooling increasingly sit beside the transaction flow and feed it context. That layer is what people usually mean when they say “AI in payments,” and it is worth being precise about what it does and does not do.

It does not replace EMV, tokenization, PCI controls, or the authorization networks. It surrounds them. And it has two jobs that are easy to state and hard to build: detect abnormal behavior fast enough to act on it, and explain that behavior clearly enough for a human to trust the decision.

Why rules alone stopped being enough

Payment fraud is not static. It moves with consumer behavior, new channels, device usage, regulation, and whatever the attackers are doing this quarter.

For a long time the controls were rules: suspicious amounts, unusual locations, repeated failed attempts, blocked cards, high-risk merchant categories, known compromised accounts. Rules still earn their place, especially when the pattern is already understood and you want a hard, auditable block. The problem is that a lot of modern fraud does not announce itself in a single field.

A €200 transaction is not suspicious on its own. It is normal for one merchant, unusual for another, and only becomes a real signal when you combine it with device history, time of day, account behavior, failed authentication attempts, velocity, location, and card type. That combination is what machine learning is good at: relationships that are too distributed or too fast-changing for a static rule table.

The reason this is an engineering problem and not just a modeling problem comes down to four properties of the data:

Fraud is rare. In the public IEEE-CIS Fraud Detection dataset, labeled fraud sits in the low single-digit percent of transactions (you should confirm the exact figure against your own datasets). A model can post high accuracy while missing most of what matters, simply by guessing “legitimate.”

Fraud data is sensitive. Payment records carry personal, financial, and behavioral information. Even after anonymization, the surviving features are often opaque, which makes both modeling and explanation harder.

Fraud patterns drift. Holiday shopping, travel seasons, inflation, new merchants, and new wallets all move the baseline of “normal.”

The costs are asymmetric. A false negative lets fraud through. A false positive blocks a real customer, annoys a merchant, and opens a support case. In payments, a false positive is not a row in a confusion matrix. It is a lost sale.

Detection is a pipeline, not a single model

The most useful correction I can offer to the “drop in a model” framing is this: the model is one component. The pipeline around it does most of the work.

Before anything gets scored, the data has to be prepared. Real transaction data arrives with missing values, duplicates, inconsistent formats, and a heavily imbalanced class distribution. A workable fraud pipeline tends to run through ingestion, preprocessing and de-duplication, scaling and normalization, categorical encoding, class balancing for the rare fraud cases, feature extraction, training and validation, real-time scoring, explanation, human review, and ongoing drift monitoring.

Two parts of that chain deserve attention.

Class imbalance has to be handled deliberately. If legitimate transactions outnumber fraud by twenty or thirty to one, an untreated model drifts toward predicting “normal.” Oversampling techniques such as SMOTE [3] are one common way to give the minority class enough representation during training. They are not free — synthetic samples can blur the boundary you care about — but ignoring imbalance is worse.

Feature work still matters, even with deep models. Some features are transactional and low-level: amount, card type, merchant category. Others are behavioral and historical: frequency, device patterns, failed attempts, customer history. Some teams combine representation learning, anomaly detection, dimensionality reduction, and sequence modeling to capture both the static and the time-ordered structure in the data. The specific architecture is less important than the reason for it: transaction behavior has both shape and sequence.

The takeaway is plain. In payments, model performance depends as much on the pipeline feeding it as on the model itself.

The latency budget decides the architecture

A fraud model that looks excellent in an offline notebook can be useless inside the authorization path.

Payment decisions are latency-bound. A real-time decision may have milliseconds to a few seconds, and within that budget the system cannot run unlimited computation, generate a rich explanation for every transaction, or wait for an analyst. That forces a tradeoff between accuracy, explainability, and runtime that you cannot wish away. A deeper model may catch subtler fraud at a higher compute cost. A simpler model is faster and easier to explain but misses more. An explanation method adds overhead that may not fit in the authorization window at all.

The practical answer is to stop pretending there is one path. In the stacks I have seen, it splits into three:

The real-time path is optimized for fast scoring and an operational action: a fraud score, a risk band, or a challenge recommendation.

The near-real-time path handles queueing, alerting, analyst review, and merchant monitoring, where a few seconds or minutes of latency is acceptable.

The offline path is where the heavier work lives: deeper analysis, compliance review, model debugging, and the detailed explanations you do not have time to compute inline.

Not every decision needs the same depth at the same moment. Designing as if it does is how teams end up with a model that is accurate and unusable.

Explaining the score, not just producing it

Flagging a transaction is half the work. Understanding why it was flagged is the other half, and in a regulated system it is not optional.

A model that outputs “fraud” with no context is operationally thin. Analysts need the drivers behind the score. Compliance teams may need evidence that the system is governed and not arbitrary. Support staff need to tell a cardholder something more useful than “the system declined it.” Under GDPR Article 22 [4], decisions based solely on automated processing that significantly affect a person come with obligations, including providing meaningful information about the logic involved. Whether that amounts to a strict “right to explanation” is debated among lawyers, but the operational direction is clear enough: “because the model said so” is not a defensible answer.

This is where LIME [1] and SHAP [2] come in. Both explain an individual prediction by ranking the features that pushed the model toward fraud or non-fraud. They make a score legible: was it the amount, the device, the velocity, the mismatch with this account’s history?

They also come with a tradeoff, and the source research is direct about it. LIME tends to be faster but can be less stable across similar inputs. SHAP tends to give more consistent attributions but can be slower depending on the background dataset and model type. That difference maps cleanly onto the path split above: a fast, approximate explanation can ride along in the near-real-time path, while the slower, more rigorous one belongs in offline review and compliance.

There is a product constraint underneath all of this. An explanation is only useful if the person receiving it can act on it. A fraud analyst does not need raw model weights or a feature called id_31. They need “unfamiliar device, transaction velocity above this account’s norm, amount outside the merchant’s range.” The translation from model output to operator-readable reason is part of the system, not an afterthought.

False positives are a business cost, not a metric

In a machine-learning report, a false positive is one cell in a matrix. In an acquiring business, it is a declined legitimate purchase, a frustrated cardholder, a merchant complaint, and a support ticket.

This is why fraud detection cannot be tuned for accuracy alone. Accuracy is misleading on imbalanced data: if 97% of transactions are legitimate, a model that almost always says “legitimate” looks accurate and catches almost nothing. The metrics that actually matter are precision, recall, F1, AUC, false-positive and false-negative rates, and cost-sensitive evaluation that prices the two error types differently.

The right target depends on who is asking. A high-risk environment may accept more false positives to catch more fraud. A premium merchant may prioritize not blocking good customers. A regulated setting may weigh explainability and auditability above raw capture rate. A real-time POS flow may care most about fast, stable, low-latency scoring. The model should be judged against the context it runs in, not in isolation. This is the same lifecycle that shows up in reversals, refunds, and chargebacks: the cost of a wrong call is paid downstream, not at the moment of scoring.

When “normal” changes: concept drift

A model trained on last year’s behavior does not automatically understand this year’s. Holidays, travel, inflation, new merchant categories, and new payment methods all shift the baseline. The term for this is concept drift, and in fraud it is sharper than in most domains because the adversary adapts on purpose. Once a detection pattern is known or blocked, attackers change behavior to route around it.

So a fraud model cannot be deployed and forgotten. It has to be observed like any other production system. The signals worth watching include the fraud capture rate, the false-positive rate, shifts in approval and decline rates, changes in feature distributions and merchant segments, model confidence drift, analyst override rates, and chargeback feedback.

Explainability helps here too. If the features driving decisions change suddenly, that is a signal in itself: a genuine behavioral shift, a data-quality problem upstream, an attacker adapting, or a model starting to degrade. A sudden change in what the model leans on is often visible before the headline metrics move.

Keeping humans in the loop

The strongest case for AI in fraud operations is not removing analysts. It is pointing them at the right cases.

Analysts spend a lot of time inspecting alerts, and a large share of those are false positives, low risk, or repetitive. A model that prioritizes cases, summarizes the risk factors, and groups similar events reduces the noise an analyst has to wade through. The point is not an academic explanation; it is less time per investigation and better decisions.

That is why a fraud system should return a score and a reason. A label on its own — “high risk” — leaves the analyst to reconstruct the why. Something closer to “high risk; drivers: unfamiliar device, velocity above norm, amount outside merchant range, recent failed attempts; suggested action: challenge or hold per policy” is immediately actionable. This is the framing I keep coming back to in AI as an amplifier, not a replacement: the model handles the volume and the ranking, the human still owns the call.

A layered architecture

If you sketch a practical payment-AI design, it tends to fall into layers, each with a clear responsibility.

The data layer collects transaction events, authorization results, device data, merchant metadata, fraud labels, chargebacks, refunds, disputes, and analyst feedback. The feature layer turns those raw events into signals: velocity, frequency, amount deviation, merchant behavior, device reputation, historical patterns. The model layer does the scoring, anomaly detection, and sequence analysis. The decision layer applies business policy — a model score should never define the outcome by itself, so this is where merchant settings, risk appetite, regulatory rules, thresholds, and authentication status combine into an action. The explanation layer produces the human-readable reasons for analysts, compliance, and support. The feedback layer captures what actually happened — confirmed fraud, false positives, chargebacks, analyst decisions — and feeds it back into the next model.

The structural point is that AI belongs inside the payment platform as a governed decisioning system, not bolted on as an isolated experiment. Deterministic security controls answer “is this card authentic, is the cryptogram valid, can the issuer authorize it.” The intelligence layer answers a different question: “does this behavior look normal for this merchant, device, and account, and should it be challenged.” Both matter, and they answer different questions. EMV cryptography (why chip cards can’t be cloned) is not competing with a fraud score; they sit at different layers.

What can go wrong

A fraud model can learn biased patterns, overfit to history, look strong on a research dataset and weak in production, degrade quietly over time, or be gamed by an adversary. It can also produce explanations that sound convincing but do not help the person reading them.

Privacy is a constant constraint. Payment data is sensitive, so feature engineering and training have to respect data minimization, access control, retention rules, and the relevant regulatory obligations rather than treating every available field as fair game.

Generalization is the other recurring trap. A model trained on one geography, merchant category, or channel may not transfer. Card-present POS, card-not-present online, wallet transactions, refunds, subscriptions, and account-to-account payments each carry different fraud patterns. So the question is not “how accurate is the model” in the abstract. The better question is whether the model improves the decisioning process safely, explainably, and within the operational limits of the system it runs in.

Where this leaves payment AI

AI is changing payments, but not in the way the marketing suggests. The future is not a system where a model magically approves or declines everything. It is an architecture where machine learning sits as an intelligence layer around an authorization path that still runs on EMV, tokenization, and the networks.

That layer detects fraud, holds down false positives, explains its reasoning, supports analysts, watches for drift, and improves as the data comes back. For POS and payment platforms the opening is real, because every transaction already carries context — device, merchant, amount, timing, outcome — that can become a risk signal.

The bar is demanding, though. The system has to be fast enough for real-time flows, accurate enough to catch meaningful risk, explainable enough for humans and regulators to trust, governed enough for compliance, and flexible enough to keep up as fraud changes. Strip it down and the two jobs are the ones I started with: detect the abnormal fast enough to act, and explain it clearly enough to trust. The first is a modeling and latency problem. The second is what decides whether the system gets used at all.

References

M. T. Ribeiro, S. Singh, C. Guestrin, “‘Why Should I Trust You?’: Explaining the Predictions of Any Classifier,” Proceedings of KDD 2016. (LIME)
S. M. Lundberg, S.-I. Lee, “A Unified Approach to Interpreting Model Predictions,” Advances in Neural Information Processing Systems (NeurIPS) 2017. (SHAP)
N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, 2002.
Regulation (EU) 2016/679 (General Data Protection Regulation), Article 22, “Automated individual decision-making, including profiling.”
IEEE Computational Intelligence Society / Vesta Corporation, “IEEE-CIS Fraud Detection” dataset, Kaggle, 2019.

Explainability on Corebaseit — POS · EMV · Payments · AI · Telecommunications