CS After Dark

🦀 Serving ML at the speed of Rust

Serving 150+ million users is no joke and also not cheap

At Glance, we run recommender systems that rank content for 150+ million users. Not all users have the same recommendation algorithm. Let's call each recommendation algorithm a Prediction Service (PS).

To keep up with traffic from all these users we can do two things:

  1. Horizontally scale up the prediction services
  2. Score more items per second or request

First is super easy but is also crazy expensive. The second is much harder as no silver bullet exists to solve this. Also, note that our Prediction Services are written in Python, which leaves you with only a handful of tricks to add more speed.

The second is imperative but how do we get there?

Our PS consists of two classes of operations:

  1. Network calls to feature-store (irreducible)
  2. CPU cycles spent on parsing + ranking compute + post-processing

To solve for 2, I decided to implement one of the largest PS (an LR model which does ~1.5 million predictions/second with 20% traffic) in a compiled language. After a bit of research, I decided to write it in Rust. Why? because:

  1. Actix showed up as one of the top web frameworks in this benchmark
  2. Rust is memory safe, fast and package management is 100x better than C/C++

(Why not Go? Check footnotes[1])

What was it like to port the Python prediction framework to Rust?

  1. I picked up this book for learning Rust. In the first few hours, I made a lot of mistakes but the compiler was really nice for politely telling me what I needed to correct. Rust wasn't that hard to pick up.
  2. Writing endpoints in Actix was fairly straightforward thanks to their rich documentation.
  3. Although Rust has really strong community support, it did not have a client implemented to call our Feature-store. Fortunately, Rust has great cryptography libraries, which allowed me to implement an auth-enabled client in a couple of hours.
  4. The structure of a Rust project felt very close to a React project. Adding a package dependency was as simple as adding a line to the Cargo.toml file
  5. Rust's tooling system just works

(scroll to the summary if you directly wanna see the performance difference)

Initial Benchmarking Results:

Finally, it was time to put the prediction service to test. I ran some stress tests and this is what I got:

Requests Per Sec (RPS) : RPS Rust barely reaching 60 RPS :(

Latencies: Latencies

This did not make sense! After a certain load, the model latencies started rising exponentially. Note that the Python PS was able to easily do ~160 RPS.

Python RPS

Rust lied to me!?

I was scratching my head at this point. I was promised a great deal and I thought that Rust lied to me. I thought.

I spent a couple of days digging deeper and found this epic blog by ScyllaDB on their debugging experience with Rust. I had a new shiny tool in my arsenal: Flamegraphs!

How to interpret these graphs (quoting ScyllaDB blog):

"a rule of thumb is to look for operations that take up the majority of the total width of the graph – the width indicates time spent on executing a particular operation."

Here is the Flamegraph of the Rust service: Non Optimal

Nothing too suspicious but I do see huge chunks of OpenSSL taking 27-30% of the CPU cycles. Fortunately, I found rusttls as an alternative to OpenSSL which was way more reliable and easier to use. Here is the flame graph after switching from OpenSSL to rusttls:

Rust TLS

Ok cool, flames looking nice and pointy and no process hogging all CPU resources. Then this should solve it right? right?

Well yes but actually no...

Although the RPS went slightly up it was nowhere near, Python PS's RPS! Also, the latencies kept rising exponentially as the load increased. Something was still wrong!

Down the debugging rabbit hole

So, I decided to check the Flamegraph when the latencies started rising exponentially. This is where I made a real breakthrough (by chance): I found that when I stress-tested the Rust PS locally, the latencies didn't balloon up and in fact, the RPS too was able to cross Python PS's RPS. If the same code performs differently in two different environments (local vs prod) then the issue had to be in the docker image.

To keep the docker image light-weight, I was using scratch as the base docker image. Furthermore, I was generating my binary for the following target:

cargo build --target x86_64-unknown-linux-musl --release

The difference is that this image uses, musl-libc, and my local environment uses glibc. Upon further digging, I found a post by one of the musl-libc author talking about major problems in the implementation of malloc.

Tl;dr: musl-libc's malloc can be really slow under high load [2].

To fix this we can simply use different memory allocators [3]. But an even simpler way was to not be greedy about image size and use a more sensible docker base image that uses glibc. 😐 And I wasn't the first one to make this mistake.

Cool. Lesson learned. We use slim-buster. Let's keep moving...

Finally after switching to slim-buster here are the metrics: GoodImageOptim

This comfortably crosses 900+ RPS on a higher load and with a p99 latency of < 90ms!

To summarize:

metric rust python
max predictions per second ~631,000 ~42,000
max RPS 936 167
max p99 latency 65ms ~120ms
max cpu util 500% 2000%
docker-image size 76 MB 514 MB

The Rust Service would only need 4 VMs to handle the production traffic whereas the Python Service needs a minimum of 20 VMs to handle the same traffic.

This is pretty sweet! Lots of $ Saved

Bringing Rust home to meet your parents

If you are tempted to bring Rust into your engineering stack, you better make a really strong case for it. Consider these if you plan to pitch Rust as a language to your team:

Sharing Some Learning Resources:

  1. Rust in depth
  2. Rust + Actix
  3. Memory Allocators

Footnotes:

  1. Why not use Golang? - I simply didn't have enough time. But Rags did and it's equally epic

  2. Musl-libc is working on a much more performant implementation of malloc: https://github.com/richfelker/mallocng-draft

  3. Here is a detailed performance comparison of various memory allocators: https://www.linkedin.com/pulse/testing-alternative-c-memory-allocators-pt-2-musl-mystery-gomes

  4. Header image for this post was generated with DALLE-mini: https://huggingface.co/spaces/dalle-mini/dalle-mini

Consider following me on Twitter if you wanna bully your computers into going fast

Subscribe to my blog via email or RSS feed.

#go #golang #machine-learning #microservices #personalization #python #recommender-systems #rust