🦀 Serving ML at the speed of Rust
Serving 150+ million users is no joke and also not cheap
At Glance, we run recommender systems that rank content for 150+ million users. Not all users have the same recommendation algorithm. Let's call each recommendation algorithm a Prediction Service (PS).
To keep up with traffic from all these users we can do two things:
- Horizontally scale up the prediction services
- Score more items per second or request
First is super easy but is also crazy expensive. The second is much harder as no silver bullet exists to solve this. Also, note that our Prediction Services are written in Python, which leaves you with only a handful of tricks to add more speed.
The second is imperative but how do we get there?
Our PS consists of two classes of operations:
- Network calls to feature-store (irreducible)
- CPU cycles spent on parsing + ranking compute + post-processing
To solve for 2, I decided to implement one of the largest PS (an LR model which does ~1.5 million predictions/second with 20% traffic) in a compiled language. After a bit of research, I decided to write it in Rust. Why? because:
- Actix showed up as one of the top web frameworks in this benchmark
- Rust is memory safe, fast and package management is 100x better than C/C++
(Why not Go? Check footnotes[1])
What was it like to port the Python prediction framework to Rust?
- I picked up this book for learning Rust. In the first few hours, I made a lot of mistakes but the compiler was really nice for politely telling me what I needed to correct. Rust wasn't that hard to pick up.
- Writing endpoints in Actix was fairly straightforward thanks to their rich documentation.
- Although Rust has really strong community support, it did not have a client implemented to call our Feature-store. Fortunately, Rust has great cryptography libraries, which allowed me to implement an auth-enabled client in a couple of hours.
- The structure of a Rust project felt very close to a React project. Adding a package dependency was as simple as adding a line to the Cargo.toml file
- Rust's tooling system just works
(scroll to the summary if you directly wanna see the performance difference)
Initial Benchmarking Results:
Finally, it was time to put the prediction service to test. I ran some stress tests and this is what I got:
Requests Per Sec (RPS) : Rust barely reaching 60 RPS :(
Latencies:
This did not make sense! After a certain load, the model latencies started rising exponentially. Note that the Python PS was able to easily do ~160 RPS.
Rust lied to me!?
I was scratching my head at this point. I was promised a great deal and I thought that Rust lied to me. I thought.
I spent a couple of days digging deeper and found this epic blog by ScyllaDB on their debugging experience with Rust. I had a new shiny tool in my arsenal: Flamegraphs!
How to interpret these graphs (quoting ScyllaDB blog):
"a rule of thumb is to look for operations that take up the majority of the total width of the graph – the width indicates time spent on executing a particular operation."
Here is the Flamegraph of the Rust service:
Nothing too suspicious but I do see huge chunks of OpenSSL taking 27-30% of the CPU cycles. Fortunately, I found rusttls
as an alternative to OpenSSL which was way more reliable and easier to use. Here is the flame graph after switching from OpenSSL to rusttls:
Ok cool, flames looking nice and pointy and no process hogging all CPU resources. Then this should solve it right? right?
Well yes but actually no...
Although the RPS went slightly up it was nowhere near, Python PS's RPS! Also, the latencies kept rising exponentially as the load increased. Something was still wrong!
Down the debugging rabbit hole
So, I decided to check the Flamegraph when the latencies started rising exponentially. This is where I made a real breakthrough (by chance): I found that when I stress-tested the Rust PS locally, the latencies didn't balloon up and in fact, the RPS too was able to cross Python PS's RPS. If the same code performs differently in two different environments (local vs prod) then the issue had to be in the docker image.
To keep the docker image light-weight, I was using scratch as the base docker image. Furthermore, I was generating my binary for the following target:
cargo build --target x86_64-unknown-linux-musl --release
The difference is that this image uses, musl-libc
, and my local environment uses glibc
.
Upon further digging, I found a post by one of the musl-libc author talking about major problems in the implementation of malloc.
Tl;dr: musl-libc's malloc can be really slow under high load [2].
To fix this we can simply use different memory allocators [3]. But an even simpler way was to not be greedy about image size and use a more sensible docker base image that uses glibc. 😐 And I wasn't the first one to make this mistake.
Cool. Lesson learned. We use slim-buster. Let's keep moving...
Finally after switching to slim-buster here are the metrics:
This comfortably crosses 900+ RPS on a higher load and with a p99 latency of < 90ms!
To summarize:
metric | rust | python |
---|---|---|
max predictions per second | ~631,000 | ~42,000 |
max RPS | 936 | 167 |
max p99 latency | 65ms | ~120ms |
max cpu util | 500% | 2000% |
docker-image size | 76 MB | 514 MB |
The Rust Service would only need 4 VMs to handle the production traffic whereas the Python Service needs a minimum of 20 VMs to handle the same traffic.
This is pretty sweet! Lots of Saved
Bringing Rust home to meet your parents
If you are tempted to bring Rust into your engineering stack, you better make a really strong case for it. Consider these if you plan to pitch Rust as a language to your team:
Actix is a crazy fast web framework and if performance matters to you look no further. Having a high-throughput service will also keep infra costs in check. Although productionizing a Rust microservice can have several caveats if you are new to it.
Rust is not mature enough to support ML out of the box (you can use ONNX, simple statistical models, or interface with ML libraries in C). But it worked in this case because we do minimal math/LR on top of pre-computed scores. If you want to serve your Xgboost or deep learning model then Rust is not the right choice (yet).
Rust is very elegantly designed and has a powerful compiler. You will have to try really hard to write a program that breaks. Fault tolerance is built-in.
Rust has a learning curve but if you are familiar with C/C++ or Java it will take you hardly take a few hours to become productive.
Rust has been in production for a lot of companies that operate on a huge scale of users. This post is just my attempt at proselytizing many of you to the Rust cult :P
Sharing Some Learning Resources:
Footnotes:
Why not use Golang? - I simply didn't have enough time. But Rags did and it's equally epic
Musl-libc is working on a much more performant implementation of malloc: https://github.com/richfelker/mallocng-draft
Here is a detailed performance comparison of various memory allocators: https://www.linkedin.com/pulse/testing-alternative-c-memory-allocators-pt-2-musl-mystery-gomes
Header image for this post was generated with DALLE-mini: https://huggingface.co/spaces/dalle-mini/dalle-mini
Consider following me on Twitter if you wanna bully your computers into going fast