In this blog, we explore how new technologies can be applied to Asset Management. The most commoditized parts of that market are typically quick to take onboard technological innovations because operational leverage can constitute a real competitive edge. However, at the other end of the spectrum where margins are high, optimization of resources is rarely a top-priority so everything moves extremely slowly. It is still very much a people business, and serving clients (investors) at scale often requires overqualified juniors to complete low value-add tasks. Part of this is due to limited awareness of the depth of the toolbox. But sometimes it just comes down to solutions that are implemented before the tech is totally mature or to a misunderstanding of the actual productivity enhancement needs. This typically leads to never-ending projects that do not yield any operational benefits. And it doubles the work for everyone as everything needs to run in parallel until the project is finished.
To address some of these challenges, we build tools tackling real use-cases, showcasing options available and assessing if they live up to the hype. This blog aims at demystifying emerging technologies that can be relevant in Finance, giving an introduction to important concepts so that readers can better appreciate current weaknesses and future opportunities. It’s an ambitious project but, to make things accessible, we should not set overly ambitious targets at first. You will see that even the basics can be more than sufficient to harvest low hanging fruits.
For this first blog post, the low-hanging fruit is sector classification. It does sound not too complex but [re]classification can give rise to large-scale projects for Asset Managers. And you would be surprised about how often this happens. It can be driven by new regulations, for example in Europe, the upcoming EU taxonomy for sustainable activities will apply to sectors defined according to a certain classification (NACE). Classification can also be necessary if an important client wants bespoke reporting, including their own set of sectors. And more importantly, every once in a while, the main classifications are updated to reflect changes in the economic landscape (see the GICS® framework reclassification in 2023). This has a knock-on effect for Asset Managers that use these frameworks… and for Asset Managers that have their own frameworks but realize that an update may be necessary.
Updating sectors is a very manual process : you need to find the most relevant industry (top-level) according to your classification and drill down from there until you get to the level of granularity you need. If you choose a wrong branch at the top, you have no chance to get it right unless you do the entire process again for multiple top-levels. Even with productivity hacks, it is hard to allocate a sector at level 3 or level 4 in less than 30 seconds. At level 4, going below 1 minute seems challenging. To make your own opinion, pick a random company and try to find the level 4 sub-industry according to the official GICS classification.
If you are a small Asset Manager and have tens of sectors to update, the manual approach is relatively painless. But what if you have hundreds? What if you have thousands? I went through this process multiple times for an Asset Manager with tens of thousands of companies in database1. You can do the math to get a sense of how many man hours have to go in. And whatever result you get, you need to add time to set-up a task-force and to prepare a project of that scale. Fortunately, it does have to be like this anymore.
Automating the entire process seems unrealistic with today’s tech, but cutting man hours by 80%-to-90% looks achievable as artificial intelligence can do much of the heavy lifting. Let’s keep this objective in mind as we test different solutions.
An important first step to put ourselves in users shoes is to build an appropriate dataset:
- sector classifications: given GICS and NACE are widely used, we focused on these. Both classifications have 4 levels of granularity with, at the most granular level, 163 sectors for GICS and 615 sectors for NACE. We scraped the NACE website using a python framework called Scrapy. MSCI provide the classification in an excel format here. In all transparency, a bit of manual data cleaning was necessary at the end, which took a good 30 minutes in total.
- a reasonably long list of companies with activity descriptions and sectors defined according to other classifications (for comparison with our outputs). The websites of Carlyle, KKR and Ardian provided 878 companies with relevant information. We used Scrapy for Ardian and Carlyle, and just read the API responses for KKR.
This part of the project is not the most exciting and involves a fair amount of programming. The good thing about programming these days is that GPT4 does not only hold your hand, it explains frameworks that you are not familiar with (to the extent that they are old enough to be in the training scope). We also use Github Copilot2, mostly as a sentence completion tool. Copilot enhances productivity greatly, but in terms of learning as you build, GPT4 is unparalleled. And GPT4 debugs for you. So despite a notable decrease in helpfulness recently, the OpenAi subscription still really is a good use of $20 per month. A further decline in helpfulness might lead us to reconsider this position though.
Off-the-shelf Solutions
As of today, GPT4 is by far the most powerful and versatile model out there. So why not testing this handy off-the-shelf solution with our use case?
This is a first opportunity to address a key misconception about Large Language Models (LLMs): despite all the knowledge that goes into their training, they should not be used as knowledge databases. When you work with code, it is easy to spot problems and testing will tell you if it works. But otherwise you can’t just ask for something factual and trust them to return a good answer: in the example of GICS, it is highly likely that the model has a form of representation of the classification through its training so it will likely return a credible answer. But it is highly unlikely that the model knows about the latest update of the classification (in 2023) and even if it did, we would advise to check every answer very carefully.
Something LLMs are good at is handling information within the context window, that is all information you provide in the prompt, and it is sometimes extended to previous prompts and feedback you gave during the session. GPT4 is exceptional at that, even with a context window approaching 100k words. The technique consisting in (i) passing relevant information in the context window and (ii) asking the model to answer questions about the context is called Retrieval Augmented Generation (RAG). It is exactly what happens when you use LLMs with the ability to browse the internet : information retrieved from the web is passed in the context window and the model focuses on this information. Many variations of RAG exist but, so far, no one has found the perfect recipe. Only Twitter’s (now X) Grok model, which retrieves real-time information leveraging a solution by Qdrant looks credible. But let’s not get distracted by complex architectures : our project is fairly simple, and our RAG pipeline will basically consist in uploading our classification files into a GPT4 conversation, and asking the model to use it. Then we can paste a company description taken from the internet and ask GPT4 to suggest the most appropriate sector (see video below).
Few weeks before we started working on this project, OpenAi launched a new product called CustomGPTs which lets users save a bespoke system prompt that is always in the context window. Users can share these “custom" models with other users. The classification project was a great opportunity to test CustomGPTs, so we built a GICS classifier and a NACE classifier. Each CustomGPT has the classification file attached to its system prompt. You can use them if you want, but it may not be a good use of your time. The video below will show you why.
There are two obvious issues with the OpenAi solutions:
- Firstly, the answers are all over the place, the models are just not able to use the source files correctly. GPT4 does a better job than the CustomGPTs but the answer given in the example is not correct. Worse, the sector name associated with the proposed code is not correct : Machinery is a sector name at level 3 (201060), and 20106015 corresponds to Agricultural & Farm Machinery in the GICS classification. The CustomGPTs return sectors that are not remotely close to the correct answer but this may be a skill issue on our end ; with better prompt “engineering" we may have obtained better answers.
NACE classifier system prompt
- Secondly, answers take too long to generate and to read. For the CustomGPT video, we even had to trim the first 30 seconds of file processing. GPT4 is faster but there is no productivity to be gained realistically.
If you are confident that the model will return correct answers, one way to gain a lot of time is to prompt the LLM programmatically for your entire list of companies. OpenAI offers this option via its API and our initial plan was to showcase a similar RAG pipeline using an open source model locally (most likely Open Hermes 2.5, a finetuned version of Mistral 7b). Getting familiar with alternatives to OpenAI is critical given their erratic governance and strategic arbitrages that have been unfavorable to users recently (decrease in helpfulness means that you cannot be sure that what worked in the past will work in the future). But the underwhelming results with GPT4 made us hesitant: the results may be equally bad - probably worse - and it was also unrealistic to deploy this model on a server just for you to do some classification tests. This would be a very inefficient use of resources, and it felt like a waste of time.
Semantic Search
We had a much more basic solution in mind: semantic search. After all, we don’t need complex LLMs to assess how close to each other two sentences are. The key is what happens under the hood of most generative AI solutions: the input is tokenized (i.e split into words or parts of words) and tokens are encoded into sequences of numbers, i.e vectors, which carry the semantics. The process is called “embedding" and the resulting vectors are commonly called embeddings. By pooling token embeddings, you can capture the semantics of the entire input3. These days, the typical method to generate embeddings is to use machine learning solutions. All LLMs use embeddings internally, but embedding models exist on a standalone basis. They constitute comparatively lightweight solutions for tasks like classification or sentiment analysis.
The size of the embeddings varies from one model to another and, for our experiments we mainly used models generating embeddings of dimension size 384 (which means a sequence of 384 numbers4). OpenAi’s model has a dimension size of 1536 . Size is not necessarily what defines a “good" embedding model. Instead, it’s the model’s ability to place tokens with similar semantics close to each other in the 364-space (or 1536-dimensional space as the case may be). You can train a small embedding model to be extremely precise on a specific domain or to cater for a specific use case. But off-the-shelf, generalist models can do a decent job 95% of the time and larger models tend to be more precise. For the human brain, it is impossible to represent the position of a vector in a X-dimensional space, but we can use dimension reduction techniques to visualize them in 3 dimensions and this helps understanding how semantic search works.
We used two different models to embed descriptions of level 4 sectors according to the CIGS classification on the one hand, and to the NACE classification on the other hand. The colors correspond to level 1 sectors (there are multiple sub-sectors in each level 1 sector). The charts show that points with similar colors are often close to each other. This makes sense because, if they are part of the same top-level sector, their semantics must be reasonably close.
Now imagine that you use the embedding model to encode a company description. You can then look at the points that are the closest to your new vector and infer that the relevant sector is one of those points. This method is called KNN (K-Nearest Neighbors). Of course, in practice, you don’t map your vectors in the 3D space to search for the closest points: you can use math to calculate the distance between two vectors in the X-dimensional space and one line of code allows you to iterate through the entire list of sectors, so you can rank sectors by distance from your vector. If you use cosine similarity to calculate this distance, the result is between 0 and 1. Cosine similarities resulting in a value of 1 imply a perfect match, whilst a value of 0 means that the vectors do not share any attributes (this never happens for sentence embeddings). The manual approach for sector classification was top-down; the algorithmic approach is bottom up.
From the charts above, it is clear that different models generate different embeddings. And if you run semantic search algorithms on different embeddings, you can expect different results. This is exactly what happens with the two models of comparable size we chose for our tests. All-MiniLM-L6-v2 is a pretrained model released in 2020 by the team behind sentence-BERT. BAAI/bge-small-en-v1.5 is a more recent pretrained model released in Summer 2023.
In the example above, bge-small returns the correct sector as the closest vector. But for all-MiniLM, the correct answer is not even in the top-5 (it is #6). In the real life scenario, this means that both models would give a right answer until level 2 sectors (the first 4 digits of the sector code, 1510, correspond to level 2) but allMini would not even suggest the appropriate level 3 sector (151030), let alone level 4. It is not particularly surprising that a more recent model performs better than an older one.= but if you use both models a lot, you will realize that it is not as clear-cut - all-MiniLM would be our model of choice in many cases. What’s interesting is that nothing changed in our scripts except the name of the model. The prospects of being able to swap-in a better model in the future probably justifies spending some time exploring and building tools today.
After showcasing what can work, let’s focus on situations where it really does not. One trivial example is when a part of the prompt makes the model go off-track. This kind of issue is commonly referred to as “prompt injection" and is actually a major cyber-security threat of LLMs with access to sensitive information. With embeddings models, the risk of information leakage is non-existent but we can illustrate the concept using this description of a company, Lisea, from the Ardian website: The SEA project represents one of the biggest concession contracts in Europe. The 50-year contract will cover the financing, design, construction, operation and maintenance of the 303 kilometers high-speed rail link between Tours and Bordeaux. This high-speed rail link will reduce the travel time of 55 minutes between Paris and Bordeaux. This company is obviously in charge of rail infrastructure, but the acronym “SEA" leads all-MiniLM to classify it as marine transportation. Under the NACE classification, rail is not even suggested by the model.
If you remove the first part of the description including the acronym, all-MiniLM is back on track.
So what to do in that case? There are more complex and compute-intensive ways to assess similarity which are known to yield better results. One of them is support vector machines (SVMs), which in our case will be applied by training a small model on the fly to fit a decision function to our embeddings (we use the Linear Support Vector Classification method described here). We see that, even with the SEA acronym in the description, all_MiniLM embeddings can yield a more appropriate sector as the top-choice.
For what it’s worth, the bge-small model did better than all-MiniLM at the SEA-test with the KNN approach. But we wanted to cover SVM to showcase an alternative.
However, during our tests, we uncovered other examples where neither models could provide the correct answer in the top-5, not even with the SVM approach. P&I (Personal & Informatik) is a company that is currently owned by Private Equity firm Permira. This company provides payroll, HR, and human capital management software to mid-sized private companies and public sector entities in German-speaking countries. This activity probably falls under Application Software (45103010) according to GICS. But based on our initial prompt, both embedding models would generate vectors with too much emphasis on the HR dimension and not enough on software.
At first, we thought that more context would help so we tried again with the company description pasted from the Permira website. The results were not much better.
So we tried again, with a more focused description. That time, the correct answer was in the list when assessing similarity with KNN, but overall the test was disappointing.
There are two important takeaways:
- Even if you don’t solely rely on AI but use it to reduce the number of options to choose from (so that users don’t have to read irrelevant options), there is always a possibility that the correct answer is excluded and users are left with no choice but to choose the second best. In our example, classification at level 3 would still be correct (451030 - Software)
- If you use a larger embedding model, you can obtain marginally better results. We ran the same company description through the large version of bge model ; one that generates vectors of dimension size 1024 instead of 384 for the small version. This comes at a cost as the model itself is much larger (1.35GB vs. 133MB for the small version) and embeddings take much longer to process. If you can afford the storage and can accept slower inference, the larger the better. Otherwise, you need to arbitrate. Quantized models offer a good balance and we’ll expand on this in the next section.
Output for GICS embedded with bge-large
The above example also shows that large generalist models do not constitute silver bullets. The Application Software sub-sector in the GICS classification looks like a difficult one to match through embeddings in any case, in particular for software dedicated to industry verticals: there is a high probability that attention is drawn to the industry vertical rather than to “software", and if it’s an integrated solution, it quickly looks like system software. SVM does not seem to help much. If all you need is level 3 classification, our solution should do the job but this use-case may justify the finetuning of an embedding model, or to try a completely different approach, e.g with another type of model. Another reason to consider training an embedding model for our use-case is that Asset Managers tend to use paraphrases to present their portfolio companies from a flattering angle, and generalist embedding models may not pick up all euphemisms or exaggerations. You shouldn't blame them ; most humans would not understand either that "biodegradable personal care essentials" means "toilet paper".
One aspect we have not covered yet is scalability. This is where the company list scraped from Asset Managers websites comes into play. Once the model is loaded, matching a company description with all sectors is rather fast, even without optimization which, we’re sure, actual software engineers could do. Processing the entire list (878 companies) with all-MiniLM takes between 30 and 45 seconds and this includes assessing similarity with both the KNN and the SVM approaches. For the avoidance of doubt, all this takes place locally (hardware for tests : Apple M2 Pro 16GB). The small bge model takes about twice as long as the all-MiniLM to complete the same tasks - not too sure why it takes so long - and the large bge model takes three times as long as the small one. The outputs in csv format are available for all-MiniLM, bge-small and bge-large.
The python scripts to generate the embeddings, and to perform semantic search with both KNN and SVM methods are available here. The repository includes three main python files: one to analyze individual companies, one to analyze an entire list of companies and one to visualize the embeddings in the 3-dimensional space. We used a python framework, Sentence-Transformers, on top of the Hugging face transformers library. The framework has built-in functions for semantic search but this can be achieved in a few lines of code so we re-did it for educational purposes. We applied the methods described in this public repository by OpenAi’s founding member Andrej Karpathy - an absolute legend. Any model that is compatible with Sentence-Transformers can be used with our scripts: if you do not have the model saved locally, it will be fetched from the relevant HuggingFace endpoint and all level 4 sectors of both NACE and GICS classifications will be embedded (they are pre-embedded for the 3 models we tested above). In practice, Sentence-Transformers is not necessary, it is just simpler for a first dip, please refer to the README file for more information.
We have gone quite a long way since the beginning of our experiments. Let's take a step back to look at our initial objectives:
- pushing off-the-shelf to their limits and getting familiar with key building blocks of Artificial Intelligence ✅
- integrating machine learning models in our software stack to tackle a real-life use case ✅
- having fun doing all this ✅ (at least for me)
- helping users to complete the task over 80% faster ❌.
To measure how far we are from the last objective we need to put ourselves in the shoes of a typical user. First of all, what we have built does not actually help anyone : it does extremely quickly and potentially very badly something that humans could do much better but much more slowly. Because we could not trust our algorithms or models to find THE answer, we produced in seconds or minutes a table with almost 900 rows and end-users would have to choose among the values in columns. The improvement, if any, is marginal and I would still hate anyone who dumps something like this on me.
If you are comfortable with python, you can make your own user experience less painful but this is not conceivable for most users. Very few would even imagine writing code, and those who can may be forbidden to do so on their work device. As a matter of fact, I was never allowed to use python nor any database in my job despite juggling with hundreds of funds and thousands of companies. Fortunately for my employer and the clients, I am good at Excel.
This digression should not distract us from our main objective : helping users with a specific task. But you can’t help users if you don’t have any, and you can’t have users if you don’t have a product. Python files are not a product. So that’s our final challenge: building an actual product using what we have learnt. And hopefully learn more in the process.
Users won’t read nor write code, so we need a simple user interface that lets them write or paste their company descriptions, and read the corresponding sectors in a couple clicks. Also, users would not download anything so it needs to run in the browser. And users won’t wait so it needs to be fast. These are the key specs for our product. Note that code assistants can't help much if the training cut-off was months before most of the tools we are going to use were released.
Client-side Inference
We are not going to take the easy route that consists in deploying the model on a server and letting users interact with it through an API. That approach certainly has benefits but you can never be sure of how much this is going to cost you, and it requires users to send THEIR data to a remote server, which is a deal breaker for most people in our case. However, if you are interested in doing this for a simple demo, you should check out the HuggingFace Spaces.
Instead, we can have users download the models from the HuggingFace endpoint (just like with our python files), and run them on their own machines merely using our web app as user interface. Running small models on the client side (on users' machines) guarantees data privacy. Longer term, another benefit of client-side inference is that it does not contribute to the expected compute bottleneck in data centers when AI applications really take off5. There is a good chance that this approach becomes the norm as state-of-the-art models become smaller so, irrespective of our classifier project, exploring this route makes sense today.
Only small models are concerned by client-side inference because consumer hardware typically lacks the necessary memory to run large ones. This is not really a problem in our case because even bge-large is relatively small but we’ll work as if it was a real constraint to illustrate the work-arounds. This also guarantees that our app can run on smartphones and user experience will only be better if they are not required to download a 1.3GB-model each time they open a session (handling browser cache with the transformers.js library is something that proves difficult. As in: it makes the web app crash. We have parked that issue for now by ignoring models in browser cache… it is not ideal but this is another reason to focus on model size).
So, for our web app, we want to focus on small embedding models. And they can be made even smaller through quantization. The technique is widely used these days to run local LLMs : one guy single handedly makes quantized versions of available models for the entire ecosystem. Different levels of “compression" can be achieved but there is a tradeoff between size and precision. Quantifying the drop in precision is hard but we can illustrate it with an example : if you embed the same sentence twice with the same embedding model, you expect the cosine similarity of the two resulting vectors to equal 1. We embedded one sector description with a quantized version of bge-small (34MB vs. 133MB for the original) and checked similarity with embeddings from the original model. The score is 93.7%
We mentioned before that larger models typically do better than smaller ones. If size is what you want to solve for, you should choose the quantized version of a large model over a non-quantized small model. For our use-case, we do not have that luxury of choosing: it will be a quantized version of the smaller model. We have to accept that our web app will be less accurate than our python scripts, and design the UX around it (e.g by showing more than 5 results or by showing adjacent sectors for each result).
Latency or low inference speed would also be very detrimental to user experience. Inference must be as fast as possible and the Open Neural Network Exchange format (ONNX) looked like an ideal solution.
Converting models into an onnx format can be involved, so we recommend to check first if onnx versions exist. In our case, both models - and their quantized versions - were available on Xenova’s page on HuggingFace including directions to run them with transformers.js - the equivalent of what we used in our python files. To avoid re-running the embeddings for all level 4 sectors of GICS and NACE classifications each time, we stored them in the classification dictionaries that are loaded in the background when a user opens the app. This may not be optimal (may not be recommended) as the resulting dictionaries can exceed 10MB but it has one major benefit : it works, and we do not rely on vector databases or API calls to third parties.
Using directions in the onnx model pages, examples from the transformers.js documentation, and support from code assistants, we could put together this demo React app. Experienced developers would likely be appalled that we even call this an app, for it is nothing more than a hack built out of common sense and ignoring the best practices of the programming industry. But it works, and it tackles a real-life use-case: the claim that this “AI classifier" can help users to allocate a sector in less than 10 seconds would not be exaggerated.
The code for this app is available in this GitHub repository. When a user opens the app, the classification dictionaries are loaded and the bge-small (quantized) is downloaded from the HuggingFace hub. All-MiniLM may yield better results than bge-small based on our use of the app so we kept the option to use all-MiniLM but it is not downloaded until the user clicks on the relevant radio button. As mentioned earlier, until we manage to access browser cache, the model is downloaded each time a user starts a session. Please refer to the README file for more information.
The Unfortunate Flipside
There is a broader point to be made around this hack: AI tools are accessible, and code assistants can help anyone piecing together code snippets from the web, without any need to actually understand what is going on. This is relatively new and this will be a paradigm shift for most companies as productivity tools are typically bought from software vendors or developed in-house by experts who know what they are doing. Now if they take too long to deliver, employees have credible options to hack their way around bureaucracy.
Each app that is made out of code stitched together and/or downloading packages/modules from unknown sources is a potential source of leakage of internal data. Or worse, it can constitute an entry point for exploits. Go through the thread of this recent internet meme to understand how bad it can be. It will be easy to blame the employees who built the apps. But the reality is that, as anyone can try to enhance their productivity now, everyone is expected to. And that includes the central functions in charge of productivity tools.