ML Platform
Designing an ML Platform for Data Scientists and ML Engineers. Building a unified system to replace fragmented ML tooling
Years:
2024 — present
Role:
Senior Product Designer (sole designer)
Status:
In production, actively used
NDA:
Details and visuals are anonymized
Overview
Yandex is a large technology company operating a broad ecosystem of cloud and infrastructure products.
Within Yandex Cloud, multiple teams relied on fragmented third-party and internal tools to manage ML workflows — including experiment tracking, datasets, model artifacts, and compute resources.
This fragmentation slowed down experimentation, increased manual work, and made it difficult to scale ML development consistently across teams.
The goal was to design a unified ML platform from scratch that could gradually replace existing solutions and support the full ML lifecycle.
Context & Challenge
Before the platform:
ML workflows were spread across multiple tools (TensorBoard, standalone scripts, ad-hoc dashboards)
Experiment tracking satisfaction was low (2.1/5 in internal surveys)
Many processes required manual coordination (e.g. requesting GPU machines via chat)
Teams lacked a shared mental model of the ML lifecycle
This resulted in slow iteration, high cognitive load, and poor visibility across teams.
My Role
As the sole product designer, I was responsible for:
Research and product discovery
Defining user scenarios and platform structure
Designing the UX architecture and core workflows
Creating interactive prototypes for validation and stakeholder alignment
Supporting implementation and gradual adoption by teams
Users
ML Engineers — managing infrastructure, experiments, and model deployment
Data Scientists — running experiments, comparing results, iterating on models
Team Leads / Managers — monitoring progress and experiment outcomes
Research & Strategy
Competitive analysis
I analyzed 6+ ML platforms, including:
Amazon SageMaker
Google Vertex AI
Weights & Biases
Other open-source and internal solutions
The goal was not to copy features, but to understand:
how ML lifecycle stages are represented
which abstractions work at scale
where existing tools create friction
Defining core scenarios
Based on research and internal interviews, I defined key scenarios across the ML lifecycle:
experiment tracking and comparison
dataset and model management
infrastructure provisioning for training
collaboration and handoffs between roles
These scenarios became the foundation for the platform architecture.
Scalable platform structure
I designed a modular and expandable navigation system that supports future growth without restructuring the core:
Functional grouping by lifecycle stage
Clear separation between experiments, models, datasets, and compute
Architecture designed to scale as new modules are added
This allowed the platform to grow incrementally while maintaining clarity.
Key Platform Modules
Experiment Manager
The first implemented and most actively used module.
Centralized experiment tracking and comparison
Clear visibility into metrics, runs, and results
Designed to replace third-party tools such as TensorBoard
Impact:
User satisfaction increased from 2.1 → 4.3
Users actively migrated from external solutions
~ 730 weekly active users (WAU)
Model Registry & Dataset Registry
Centralized storage and versioning of models and datasets
Improved traceability across experiments
Reduced manual coordination between teams
DevCluster
One of the most impactful features.
Allows ML engineers to switch GPU-enabled virtual machines on the fly
Eliminated the need to request infrastructure manually via chat
Significantly reduced operational friction and waiting time
This module simplified multiple previously manual processes and improved day-to-day productivity.
Prototyping & Collaboration
Built interactive Figma prototypes to validate concepts with users and executives
Used prototypes to demonstrate platform value before full implementation
Maintained structured Figma files to support efficient developer handoff
Design reviews and feedback loops were integrated into the development process.
Outcome
Platform adopted by real product teams
Gradual migration from third-party tools in progress
Architecture supports new modules without core redesign
Design-to-development workflow became more structured and predictable
The platform continues to evolve as new ML workflows are added.
All visuals and details are anonymized to comply with NDA requirements.



