Will Bennett
Posts
Federated learning: Unlocking the world's private data

Federated learning: Unlocking the world's private data

Will Bennett
December 29, 2024

Semi-regular writing about technology and early-stage investing. Investor at Seedcamp, Europe’s seed fund.

Federated learning doesn’t exactly get the most press. Supervised, unsupervised and the gang occupy the spotlight.

But federated learning promises the holy grail for model training. It enables a model to train on data without that data ever leaving the device on which it was generated. Think of Apple Intelligence. Rather than training on your data, Apple would train on everyone’s data, securely.

No-one wants Apple to ship off your personal information to train it’s model. Instead, federated learning permits local nodes to share only the weights and biases with a single model shared by each node. All data inform one mega model without compromising privacy and safety.

The result? Private, confidential and personal data can be used for training anonymously. Federated learning could be the answer to unlock the next tranche of data to move models forward. It could bypass legal problems posed by private data, bandwidth problems from data in transit and security problems of collaborative datasets.

Google, 2017

When Google pioneered this machine learning technique in 2017, initial use cases focused on device-driven industries like healthcare, mobile applications, automotive and the nebulous internet of things. This wave focused on solving an edge device problem in an intra-company setting.

Your phone personalizes the model locally, based on your usage (A). Many users' updates are aggregated (B) to form a consensus change (C) to the shared model, after which the procedure is repeated.

Google, 2017

However, what if we could apply federated learning in an inter-company setting. If data can remain private and secure, anonymised data from multiple organisations could be used by a model that informs business decisions unilaterally across different companies. Sounds a bit egalitarian.

Think of use cases like compliance and credit scoring. Businesses depend on databases to facilitate rapid decisioning - a model that uses the data of not one bank but 100 banks, would be vastly superior. I.e. Rather than a business developing its own proprietary model for what ‘creditworthy’ looks like, it would share a predictive model trained on 100 times as much data. In this future, on-device lending, pricing refinement and compliance checks could all be near-instant.

Nvidia.com

Seems a bit too obvious?

This creates a very large game theory problem. If Business A. has more data and a more performant model than Business B. why would it opt for a central model run by Business C., undercutting its advantage?

There would need to be a very compelling data-sharing incentive mechanism. I.e. Business A. could license its data to Business B. via Business C’s federated learning model. Or if Business A. could sell its anonymised data as a service to Business B., Business B could enhance its own federated learning model.

It’s not impossible. Although it took a global pandemic to force collaboration on federated learning, it happened! 20 hospitals across five continents trained an AI model to predict the O₂ requirements of COVID-19 patients. Hospitals saw a 16% bump in model performance by participating in the federated methodology.

Nature Medicine

It isn’t totally crazy to aspire to federated learning use cases that are not life-or-death. For example, Chinese financial organisations have started to dip their toes into the technology. WeBank has been working on its platform-as-a-service model for a federated learning product, focusing chiefly on KYC and risk control. Swiss Re China was one of the early interested parties in exploring opportunities to use this model for customer targeting and price optimisation.

WeBank has also doubled down on using intra-company federated learning with its own customers. For example, the organisation uses encrypted data on electronic invoices, known as ‘e-Fapiao’ to determine the credit risk of small and micro-enterprises.

What problems does this solve?

Financial crime is growing at a much faster rate than financial technology organisations can come up with software to prevent it. Sophisticated generative voice and image technology are already starting to accelerate money laundering. Each bank call centre receives up to 10,000 fraudster calls each year and the fraudster’s voice already sounds a lot like that of their victim.

If, with customer permissions, financial organisations shared their voice data with a federated learning model, each organisation would share an improved method of preventing extortion. In this case, federated learning might not decay any single party’s competitive advantage.

On the value generation side (as opposed to loss prevention), consumers have been promised better credit scoring for years. Data from Square terminals, SumUp interfaces, BNPL providers and more was supposed to inform who should receive credit lines and who shouldn’t. The idea was that this would complement the uncontroversially limited methodology of Experian which relies on data from mortgage, car loan, gas, insurance, etc. Traditional credit scoring is flawed and oriented towards a wealthy and US-based demographic, who are much more likely to own a financed car or house and use a credit card.

With federated learning, data could be shared across every device interaction to build an accurate picture. For businesses like Pillar (now Acorns) which focus on immigrant credit scoring, this would be a huge boon. You might be deemed creditworthy, not because you have a history of timely car repayments, but because the federated algorithm deems you a good debtor for other reasons. These might be obvious i.e You don’t make late night ASOS orders but they might also be non-obvious i.e. You use Square terminals at the same time each day for lunch, suggesting you are not an erratic spender.

Why not yet?

Federated learning is still technically challenging, which limits the breadth of its real-world applications. The reasons for its slow take-off revolve around the nature of each ‘node’.

Node datasets are often not interoperable, requiring curation of the data. The data can be locally unsecured, and since it is ‘hidden’, is an obvious target for attackers to poison. This occlusion can be compounded by a lack of data labels at a model level. Each node also has certain biases with respect to the general population and obscured details make it difficult to prevent prejudice creeping in across time, age, gender, etc. However, things are starting to improve.

Federated learning can quite obviously become a very compelling edge in settings where multi-agent data needs to remain secure. Assuming the technology hits an inflection point without becoming exorbitantly expensive, the question is how enterprises become comfortable with the approach.

The answer is likely to emerge from the accelerating velocity of financial crime and the requirement for solutions. In order to fall in line with regulation from FATF, local FIUs, etc regarding how much financial crime is permissible on a platform, businesses will need to continue investing in compliance software. If federated learning can provide a superior solution with its data sharing approach, it might just have the answer.

Reply

or to participate.