reflections on ai-infra from a private lunch
Last week I attend a lunch event hosted by a16z. The people in attendance were AI-researchers and AI-infra engineers. I compiled a list of companies where I think the talent density is ridiculously high (based on my interactions with them):
- Meta Training and Inference Accelerator (MTIA)
- Anthropic
- Black Forest Labs
- Together-ai
- Numa Labs
- Mirendil
- Inferact
The people from the above companies were kind enough to entertain my questions and the reflections are based on my discussions with them. There were more companies (Apple, OpenAI etc) but I couldn't talk to all of them cuz I had to rush back to work (it was literally my lunch break).
Top 5 take-aways (observations + opinions) ranked by what I thought was interesting:
Agent sandboxes and AI scientists (or AI driven auto-research): Seems like a lot of labs have figured out some secret-sauce for recursive self-improvement. They wouldn't tell me exactly what that is but they are really interested in agent-sanboxing. If I were to guess I'd say that it has to do with the intelligence unlocks from RL based post-training methods. We have run out of data to train on so to discover new knowledge, you drop an agent into an RL environment and let it perform tree-of-thought exploration. This is a lot like the human trial and error process. But the agent can try out "several" actions in parallel and discover new insights at super-human speed. My personal take-away is that the ML training loop is going to need a LOOOTTTT more CPU compute. What will you do with this information?
ML Inference seems to be the hottest vertical and it makes sense due to the sheer demand. Interesting challenges that I heard were:
- sub-optimal traffic routing leading to idle GPUs (adapting to traffic scaling patterns)
- fast-model loading
- image caching
Compiler researchers + AI kernels: Seems like all big-labs employ compiler researchers. Perhaps because AI chips are seeing a rapid evolution. Also MTIA folks mentioned that all of their kernels are written by AI and out-perform human experts. Seems like a fun space to be in and I honestly need to read up more on this space (i'm a noob rn)
ML Training infra (10k+ nodes): Seems like people use Kubernetes for ML training. GPU stragglers are still puzzling. Distributed ML training has a lot of interesting problems; Checkout Shaksham's work
All neo-labs are on multiple clouds. Situational Awareness will continue to thrive for a few more years I guess
Note: The observations above are already out there in public and in no way "exclusive" to this event. And all opinions are personal.