Using AI to better understand and support homelessness interventions in London

In this blog, Anna Humpleby, LOTI’s Data Projects Manager, and James MacTavish, Engagement Manager at Faculty AI, share technical details of the Rough Sleeping Insights Project and how machine learning is being applied to support homelessness interventions in London.

Note: you can find a glossary of technical terms at the end of this blog.

Why do we need data interventions in the homelessness space?

London dedicates very large public resources towards addressing homelessness and rough sleeping, and it’s important that we practice good stewardship of those resources, ensuring interventions are as efficient and effective as possible by being outcomes-led and informed by data.

Lots of data exists but has traditionally been very siloed – sitting across a multitude of systems with no routine way of connecting them. This is the opposite of the experiences and journeys of clients, who typically have contact with many different institutions, systems and stakeholders. But it is also illustrative of how individuals experience and suffer from this lack of join-up – having to tell and re-tell their story multiple times, always starting an interaction with a new service from ‘square one’.

Pan-London Strategic Insights Tool: The technical solution

The Strategic Insights Tool aims to provide decision makers in London local government with a clearer view of rough sleeping in their local area. But how does that work technically? According to these four fundamental steps:

A horizontal flow diagram depicting data integration. Three input boxes labeled "CHAIN", "Service Provider", and "H-CLIC" are connected to a central "Mapping & Transformation" step where each input's date field is mapped to a common "Date_of_Birth" field. This transformed data is then combined into a "Single record containing CHAIN, Service Provider and H-CLIC data" and finally used for "Visualizations" shown as three small chart thumbnails.

Data Upload

The tool ingests three different types of data from the respective source systems: CHAIN (the database that records rough sleeping activity on the streets in London); In-Form (service provider accommodation casework system data); and H-CLIC returns (Local Authority housing teams record information about statutory homelessness approaches and interventions). In total, the SIT ingests data from 45 organisations either programmatically (via an API) or via manual upload.

Mapping & Transformation

Once uploaded, data from all systems is standardised through rigorous cleaning and deduplication. The data is cleaned and normalised by mapping into a structured data model. This data model is essentially a schema which includes all of the columns we use either for matching or as characteristics for the visualisations. This is an important step given that data comes from a variety of different organisations, all of which structure their data in different ways. This enables us to refer to datasets in a common way.

Data Matching

At the core of the SIT is the probabilistic matching model. It’s an algorithm that uses unsupervised machine learning techniques to identify any individual record matches across the 45 source systems. For example, the model may detect that a person who has previously completed a statutory homelessness application within a certain borough, as seen in a Local Authority’s H-CLIC data, has appeared in the bedded down contacts collected by housing outreach officers within the CHAIN system. In this instance, records relating to this person between H-CLIC and CHAIN would be associated with one another and merged into a single rough sleeping ‘journey’.

Probabilistic matching is not as straightforward as simply matching on a person’s name. Typos and fake names (amongst other reasons) might mean that names differ between systems. The model considers a number of different factors, including ‘fuzzy matching’ between names, National Insurance number, birth date, telephone number, and a variety of other personal details. The model will then only associate records that meet an 85% probability of being a match. This threshold gives us a high level of confidence and includes as few false positives (cases where we’ve accidentally matched two different people) in the final matched dataset as possible.

The SIT achieves a recall of 91% (subject to change depending on the input data). 91% recall means that 91/100 true matches are correctly identified. Users are made aware that numbers may differ to their expectations because of the data quality / matching threshold. It was recognised that this disparity was acceptable, noting that the SIT is for the identification of trends across groups, rather than for individual case management.

Data Visualisation

Once all of the matches between different datasets have been determined, the different journeys are aggregated and visualised within the tool’s user interface. This means we can ensure that data is safely and securely anonymised, whilst providing insights into rough sleeping journeys of different groups across many systems. The user interface enables users to manipulate and segment the data via a number of characteristic filters for a closer look at specific groups.

Technical design choices

Flexible infrastructure

We designed the SIT infrastructure to be flexible and scalable, leveraging a variety of Amazon Web Services (AWS) microservices. Building in this way would enable expansion to different user groups and use cases, including the possibility to add new datasets. To support the addition of new datasets, the team would simply need to secure access to the relevant data and map it to the custom data model, ensuring that personal characteristics are referred to in a consistent way

Passwordless login

A particularly interesting feature to draw out is the use of passwordless authentication, enabling users to log in to the SIT by entering a registered email address. They are then sent a temporary authentication link which provides access to the SIT. The team opted for this technique as it provides a robust and secure mechanism for user authentication and access control whilst reducing the administrative overhead of having to manage passwords. This was important given we were rapidly expanding the userbase in the second phase of the project.

Key Reflections

What’s clear from this project is that you don’t need to leverage the most cutting-edge or experimental technologies to have significant impact. We focused on probabilistic data matching, a tried and tested problem with well-established solutions. Whilst there’s a lot of hype around generative AI and other emerging technologies, this project demonstrates that applying traditional, proven methods can yield powerful and practical results. It’s a compelling reminder to avoid overcomplicating things, and try out the simple solutions first.

Glossary

Application Programming Interface (API): a set of rules and tools that allows different software applications to communicate and exchange data with each other

Fuzzy matching: A way of identifying similar, but not identical elements in data sets

Machine learning: a type of artificial intelligence (AI) that allows computers to learn and improve from data, without being explicitly programmed to do so

Probabilistic matching: An approach to matching records based on the degree of similarity between two or more datasets.

Anna Humpleby

James MacTavish

24 March 2025 ·