ML Platform

Designing an ML Platform for Data Scientists and ML Engineers. Building a unified system to replace fragmented ML tooling

Years:

2024 — present

Role:

Senior Product Designer (sole designer)

Status:

In production, actively used

NDA:

Details and visuals are anonymized

Overview

Yandex is a large technology company operating a broad ecosystem of cloud and infrastructure products.

Within Yandex Cloud, multiple teams relied on fragmented third-party and internal tools to manage ML workflows — including experiment tracking, datasets, model artifacts, and compute resources.

This fragmentation slowed down experimentation, increased manual work, and made it difficult to scale ML development consistently across teams.

The goal was to design a unified ML platform from scratch that could gradually replace existing solutions and support the full ML lifecycle.

Context & Challenge

Before the platform:

  • ML workflows were spread across multiple tools (TensorBoard, standalone scripts, ad-hoc dashboards)

  • Experiment tracking satisfaction was low (2.1/5 in internal surveys)

  • Many processes required manual coordination (e.g. requesting GPU machines via chat)

  • Teams lacked a shared mental model of the ML lifecycle

This resulted in slow iteration, high cognitive load, and poor visibility across teams.

My Role

As the sole product designer, I was responsible for:

  • Research and product discovery

  • Defining user scenarios and platform structure

  • Designing the UX architecture and core workflows

  • Creating interactive prototypes for validation and stakeholder alignment

  • Supporting implementation and gradual adoption by teams

Users

ML Engineers — managing infrastructure, experiments, and model deployment

Data Scientists — running experiments, comparing results, iterating on models

Team Leads / Managers — monitoring progress and experiment outcomes

Research & Strategy

Competitive analysis

I analyzed 6+ ML platforms, including:

  • Amazon SageMaker

  • Google Vertex AI

  • Weights & Biases

Other open-source and internal solutions

The goal was not to copy features, but to understand:

  • how ML lifecycle stages are represented

  • which abstractions work at scale

  • where existing tools create friction

Defining core scenarios

Based on research and internal interviews, I defined key scenarios across the ML lifecycle:

  • experiment tracking and comparison

  • dataset and model management

  • infrastructure provisioning for training

  • collaboration and handoffs between roles

These scenarios became the foundation for the platform architecture.

Scalable platform structure

I designed a modular and expandable navigation system that supports future growth without restructuring the core:

  • Functional grouping by lifecycle stage

  • Clear separation between experiments, models, datasets, and compute

  • Architecture designed to scale as new modules are added

This allowed the platform to grow incrementally while maintaining clarity.

Key Platform Modules

Experiment Manager

The first implemented and most actively used module.

  • Centralized experiment tracking and comparison

  • Clear visibility into metrics, runs, and results

  • Designed to replace third-party tools such as TensorBoard

Impact:

  • User satisfaction increased from 2.1 → 4.3

  • Users actively migrated from external solutions

  • ~ 730 weekly active users (WAU)

Model Registry & Dataset Registry

  • Centralized storage and versioning of models and datasets

  • Improved traceability across experiments

  • Reduced manual coordination between teams

DevCluster

One of the most impactful features.

  • Allows ML engineers to switch GPU-enabled virtual machines on the fly

  • Eliminated the need to request infrastructure manually via chat

  • Significantly reduced operational friction and waiting time

This module simplified multiple previously manual processes and improved day-to-day productivity.

Prototyping & Collaboration

  • Built interactive Figma prototypes to validate concepts with users and executives

  • Used prototypes to demonstrate platform value before full implementation

  • Maintained structured Figma files to support efficient developer handoff

Design reviews and feedback loops were integrated into the development process.

Outcome

  • Platform adopted by real product teams

  • Gradual migration from third-party tools in progress

  • Architecture supports new modules without core redesign

  • Design-to-development workflow became more structured and predictable

  • The platform continues to evolve as new ML workflows are added.

All visuals and details are anonymized to comply with NDA requirements.