Case Study of MLOps in a Hedge Fund - From zero to $30M

10 min readJul 19, 2021

The core concepts behind taking machine learning models into production, widely known as MLOps, are similar from use case to use case. There are small twists between data types and domain needs, but the general steps of data collection, data processing, feature engineering, data labeling, model design, model training, model optimization, and model deployment and monitoring are universal.

Hedge funds have applied machine learning for decades. An investment manager, known as a general partner, takes money from investors, limited partners. There’s a small segment of funds called quantitative hedge funds, where the general partners manage funds with teams of Ph. Ds that write algorithms to calculate allocations of stocks, bonds, and other securities. Renaissance (RenTech) and Two Sigma are two well-known fund managers that leverage quantitative trading. Earlier this year, I discussed with WSJ the challenges of creating and maintaining performance for MLOps and weighed in on the failure model performance in RenTech’s public fund.

Image by the author. Business model of a hedge fund.

These hedge funds need three pieces of software that are all complex to build: a way to simulate and backtest models on market data, a scoring system that evaluates assets using models in real-time, and portfolio execution that places trading decisions for the fund. The performance of the fund is primarily driven by the first two, with the portfolio management the final consume.

Like any data science team in most other fields, the challenges of MLOps, or taking models into production, are acute. Back in 2012, I built a quantitative investment fund that managed assets from outside investors. While it started as a Cornell dorm room project, the platform actively invested outside money for about four years. At the highest point, we invested about $30 million dollars. Surprisingly, most of what I did to build and maintain the system wasn’t data science-related — it was the infrastructure. This write-up is intended for people to understand the value of MLOps tooling in a specific business vertical.

Business Context

We bought bonds and held them to maturity when the principal and interest were repaid. Our inference evaluation occurs once in the life cycle of the bond when it first debuts on the market. Our system gave a response of the amount to buy ranging from $0 to the size of the bond (up to 6 figures). We didn’t run inference again on a bond, because we observed that bonds that fit our profile would have been bought by other parties on the market. For our fund, looking at the bond once simplified the scope of our problem to decide on the risk of each new bond and rate of return in relation to our target.

Image by the author. Our models analyzed bonds for risk, and then we had a separate heuristic for analyzing return.

Bonds that debuted were run against a decision tree to sort into risk profiles, which we calculated as the likelihood of default. For high-risk bonds, we chose to always ignore those investments, regardless of return. We observed defaults would wipe out large amounts of the principal. Within this pool of low-risk bonds, we used Monte Carlo simulations to identify the right threshold for the rate of return (annualized rate of return) that we wanted to buy. We could be extra selective because we were limited by dollars we could invest. The more optimized, the higher our fund’s IRR (rate of return).

This core business model was augmented by many other microservices that ran the MLOps.

Data Workflows

In the data collection process, we pulled CSV formatted, market snapshots onto EC2 machines. Each machine would handle a partition of the dataset and thread a bond instance for data processing. We cleaned the data by doing domain-specific data transformations like removed bonds that didn’t have repayment history, incorrect dates (like repayment before the bond was made), missing English descriptions (our natural language parsing for feature engineering was English only).

Attributes of bonds are tabular data types, so we computed several hundred features per bond. Feature generation is trying to identify signals from the existing data that can be used from the data and I had developed knowledge by generalizing from interviews I had done with several underwriters. I encoded features such as bond duration, borrowers tied to the bond note, underwriter metadata (defaults, delinquencies, years of experience), country of origination, economic maturity of the city where the borrower was, previous borrowings, and borrower fund use. We kept our offline data pipeline and online feature engineering consistent by creating a universal library that was shared across both services. Each iteration of a feature would either get a new name or completely rewrite all historical values.

Data labeling was automatically done through a list of rules. We measured if the repayments were: completed on time (early or according to the payment schedule), 30, 60, and 90 days behind schedule, but ultimately paid in full, and finally if the borrower never paid back or was beyond 90 days. Lenders at banks use percentage at risk (PAR), PAR30 PAR60, PAR90 to gauge the health of the portfolio. While all bond data was stored, the 90-day lag meant our model’s concept of a bad bond potentially was 3 months or more months behind.

The raw data, features generated, and output labels were stored in two places.

The first was in a MySQL database which had a bond object that we used across our entire stack to track raw financials, features, and labels. SQL was easy to query in dashboards and visualize for ad hoc data analysis. This also was the main table that we copied streamed live market data to.

Secondly, we stored a copy in a raw data lake-like storage for our backtesting simulator to pull from. This pipeline run was only required to be run once to seed the data, and then run regularly to update the repayment history. However, streaming feeds drop data, which we actually experienced several times that cost us thousands in profits before. Running the entire data pipeline off snapshots gave us also visibility into what percentage of live market data we had dropped. We ran this job on a regular cadence.

Image by author. The data process in a bond scoring ML system.

Training Workflows

The base metric data scientists care about is tied to accuracy, recall, precision. However, the business metric was most important, and not entirely tied to accuracy measures. Our benchmark was IRR or internal rate of return for the fund. This meant two things, we needed to pick out the low risk, high return bonds that standard accuracy metrics measured. We also needed to optimize returns. Our system had to predict whether buying a bond with a mediocre yield return today outweighed the potential value of higher-yielding equally low-risk bonds that might debut later.

This is where the ideas of a model and simulation/reward approach can apply. Simulation and rewards are most commonly known in reinforcement learning, but we didn’t use a true simulated reward environment. We quantified the performance by creating a market backtesting and replay system that can accurately simulate the model given our fund size, credit line availability, fund transfer times, seasonality of bond volumes, and real repayment history. Each simulation would input a model artifact and the business parameters we provided and return the fund’s hypothetical return.

Enough data was being processed in a simulation that each server needed to have a local cache of the market data for backtesting. Hyperparameter tuning frameworks weren’t available back then. We kept things simple by using grid search hyperparameter tuning. This was possible because our computations were relatively cheap. Each evaluation took about five minutes to run. A compute-optimized EC2 instance could process 32 jobs concurrently, so a total search space would take several hours to finish running. The candidate models were also optimized against multiple periods to understand performance in different market conditions.

Inference Workflows

Our real-time model inference and transaction execution was the final piece to this puzzle. We set a goal of having an investment decision made within one second of a bond debuting in markets. Good investments were sometimes competitive, so speed mattered.

When a bond debuted in the streaming API, we often didn’t have all the raw data for the bond to do proper feature engineering in the request. Each bond had its own thread for concurrency, which then polled the market servers for additional data points, mirrored the feature engineering we did offline, and applied a model prediction. This model prediction was saved for record-keeping, and then we compared the bond yield against our hurdle and if we had excess cash to make new investments with.

This was all tracked by a set of dashboards built in Angular and Node.js, and several queries we used directly off the database. We could quickly look up all the feature metrics and model predictions we calculated on a particular bond, or see system performance, logs from DevOps observability.

Image by author. The training and inference pipeline for a bond risk model.

Comparison of our homegrown systems versus today’s tools.

The system concepts from The Simple Group infrastructure in 2013 have very similar corollaries to today’s open source and vendor options. The two areas of data processing and model deployment draw clear analogies.

The actual tools for individual stages of MLOps change over time. Open source will mature, vendors will provide commercially viable alternatives, and another team in your organization builds a better tool. A consistent developer experience means minimal impact to end-users and production systems.

Data Processing

I see a lot of similarities between our data processing engineer to Tecton’s managed service. They’re a good pedigree startup that is well-positioned to be the best breed platform for tabular data in feature engineering, storage, and serving systems.

This diagram is taken from Tecton’s API overview. The homegrown systems in The Simple Group created a path in yellow for real-time serving and blue for offline training. Companies that have their own online feature service will have similar paths created. Tecton’s major value over homegrown tools is that they aim to generalize feature pipelines for more than one use case. The “on-demand transformer” creates an instance model that can mirror the typing of a custom object being inferred upon. That’s valuable for high-velocity model development. Homegrown feature engineering tools are targeted at a discrete number of use cases the object definition is usually hardcoded in. The number of connectors in Tecton is more than what homegrown tools would need, but a nice to have.

Model Serving

Our model serving goals were low latency and minimize overhead maintenance. To do that, we tied our feature storage database with our model serving. We made a very opinionated design choice to save the model’s representation in a database format.

We chose to store the single model in a specialized database table in the same DB as our feature engineer. Since we only used decision tree models, we designed our database to map the features, thresholds, and ordering of the key lookup. Optimizations for speed and performance we did on feature loading also benefited our model serving. This concept of storing the model artifact in a database was also adopted by RedisAI, the same creators behind the unstructured database Redis. This design pattern provides a coherent developer experience for anyone already familiar with Redis for key-value stores.

Diagram from Redis AI model serving architecture

Open source has really made model training and serving much easier. Libraries like XGBoost out of the University of Washington that manage the decision tree creation process and offer a set storing pattern in .save_model() and .load_model() to quickly abstract model representation. Our training stack was in Java, so pickled files weren’t an option. Model registries have also made it much easier to scale the development of artifact versions. We never reached the scale, but when there are many models being created that serve different purposes or when there is a single model with many variants for personalization and customization, model registries reduce the headache

Wrapping Up

ML platforms have similar steps, regardless of the use case or organization. I’ve seen that sophisticated MLOps patterns are being distilled into approachable open source libraries and communities are forming to evangelize them. Many of the open-source projects are spun out of a company that isn’t related to machine learning, but applies it in their business.

Since building a platform is an iterative process, I always assume that new innovators will take a stage of MLOps, and make it easier, cheaper, or more effective than existing solutions. The best advice I give to companies now is to be proactive about tooling portability. Upgrades to MLOps tooling are going to happen every 18 months to 3 years because there are too many stages of MLOps for one tool to be the jack of all trades and the best in breed practices are evolving. Being able to quickly adapt to the right ensemble of tools for your team and your organization will accelerate the adoption of ML in your organization, and ensure ML can provide convincing value to stakeholders.