Cloud ML: AWS SageMaker and GCP Vertex AI
Cloud ML platforms provide managed infrastructure for training, tuning, deploying, and monitoring models at scale.
Cloud ML Landscape
1. AWS SageMaker Workflow
SageMaker SDK Example
import sagemaker
from sagemaker.sklearn.estimator import SKLearn
estimator = SKLearn(
entry_point="train.py",
role=sagemaker.get_execution_role(),
instance_count=1,
instance_type="ml.m5.xlarge",
framework_version="1.0-1",
hyperparameters={"n_estimators": 100, "max_depth": 8},
output_path="s3://bucket/output"
)
estimator.fit({"train": "s3://bucket/train/", "test": "s3://bucket/test/"})
# Deploy to endpoint
predictor = estimator.deploy(
initial_instance_count=1,
instance_type="ml.t2.medium"
)
2. GCP Vertex AI
Vertex AI Pipeline (Kubeflow)
from google_cloud_pipeline.pipeline import pipeline
from kfp import dsl
@dsl.pipeline(name="training-pipeline")
def training_pipeline(
data_path: str = "gs://bucket/data/",
n_estimators: int = 200
):
prepare = dsl.ContainerOp(
name="prepare",
image="gcr.io/project/prepare:latest",
arguments=["--input", data_path]
)
train = dsl.ContainerOp(
name="train",
image="gcr.io/project/train:latest",
arguments=["--n-estimators", n_estimators]
).after(prepare)
evaluate = dsl.ContainerOp(
name="evaluate",
image="gcr.io/project/evaluate:latest"
).after(train)
3. Feature Store Comparison
| Feature | SageMaker Feature Store | Vertex AI Feature Store |
|---|---|---|
| Storage | Online + Offline | Online + Offline |
| Query | Point lookup, batch | Point lookup, batch |
| Integration | SageMaker pipelines | Vertex AI pipelines |
| Refresh | Scheduled or on-demand | Streaming or batch |
| Pricing | Per GB stored + reads | Per GB stored + reads |
4. Managed Inference
# SageMaker Real-time
from sagemaker.model import Model
model = Model(
image_uri=inference_image,
model_data="s3://bucket/model.tar.gz",
role=role
)
predictor = model.deploy(instance_type="ml.g4dn.xlarge", initial_instance_count=1)
# Serverless inference
from sagemaker.serverless import ServerlessInferenceConfig
predictor = estimator.deploy(
serverless_inference_config=ServerlessInferenceConfig(
memory_size_in_mb=2048,
max_concurrency=50
)
)
5. Cost Optimization
6. Multi-Cloud Strategy
- Avoid lock-in: Use abstracted interfaces (Kubeflow, MLflow)
- Data gravity: Keep data where compute lives
- Cost comparison: Profile workloads across providers
- Compliance: Consider data residency requirements
Key Takeaways
- SageMaker: Most comprehensive; best for AWS-native shops
- Vertex AI: Strong AutoML and BigQuery integration; Google Research models
- Azure ML: Enterprise integration; OpenAI access; responsible AI tools
- Cost: Spot training + right-sized inference = 50-70% savings