Trace Id is missing
February 15, 2023

Microsoft 365 and Azure Machine Learning unlock the power of open-source solutions with Azure Monitor

No one can accuse Microsoft of resting on its laurels. The digital pioneer continues to develop world-changing technologies that redefine how we work and live because it never stops innovating. Although we tend to think of its solutions as delivering value directly to customers, the company’s ingenuity also enhances its own offerings. Microsoft development teams in the Microsoft 365 and Azure Machine Learning product groups are prime examples of this. Both teams were finding better ways to provide their services through open-source container orchestration, and both found the vital monitoring support they needed to deliver those containerized workloads through Azure Monitor.

Microsoft Corporation

“With Azure Monitor, we don’t need to invest in creating monitoring infrastructure. We’ve saved several years of development costs by not doing the monitoring ourselves.”

Andrey Lukyanov, Principal Software Engineer, Architect, Microsoft

Thinking inside the container

The Microsoft 365 team wanted a better way to standardize software delivery in its Microsoft Experiences and Devices Group and a means of enhancing the resiliency of its workloads. It shifted to container-based delivery to reduce costs, optimize performance, and promote agility. Additionally, the team hoped to deliver greater visibility and metrics to better support user compliance requirements. The team decided on Azure Kubernetes Service (AKS) to support its clusters—one of the world’s largest fleets—and built its own microservices solution as a layer on top of that. To meet its monitoring needs, the team added Azure Monitor as a final layer.

Andrey Lukyanov, Principal Software Engineer, Architect at Microsoft states, “We didn't want to reinvent the bicycle, so we partnered with Azure Kubernetes Service and Azure Monitor. Each layer has its own specific responsibility, and Azure Monitor provides us with a managed monitoring solution we can run at multiple cluster levels.” The solution consists of a fleet of managed AKS clusters that are deployed to every Azure datacenter across the world. The solution delivers the observability expected of a managed platform, and it offers logging metrics, alerting, and detection out of the box so its users can simply run workloads and enjoy complete observability management without having to write code.

The Azure Machine Learning team also delivers its platform on AKS. Running across 30 regions, with two clusters per region for load balancing, redundancy, and disaster recovery, for a total of around 60 clusters, the solution has approximately 100 nodes per cluster and hosts roughly 100 applications per cluster. This helps customers develop and manage end-to-end machine learning solutions through model training and retraining, scheduling, metric capture, and data storage, in addition to model hosting and deployment. The team needed a monitoring solution that could ingest a time series volume of 2 million data points per minute on its busier clusters, with metrics from multiple programming languages, and it needed to retain those volumes long enough to be useful. Additionally, it hoped to offer a single portal for its developers to see pertinent metrics, and it wanted to eliminate the need for cumbersome manual connections to enhance security and ease of use.

Monitoring more with less

With Monitor, the Microsoft 365 team gained a native, managed observability solution for its platform infrastructure. Sachin Rao, Principal Software Engineer at Microsoft says, “We had a key requirement to automatically collect signals, logs, and metrics in persistent data storage, and Azure Monitor was the perfect solution.” The platform uses a Log Analytics workspace to provide persistent storage across its entire fleet of clusters and to analyze and query issues through a single dashboard. It uses Azure Monitor alerts to support log-based alerting, which draws on data stored in a Log Analytics workspace to understand and detect issues proactively. Monitor alerts work with the solution’s incident management tool to automatically trigger incidents for engineers to take action and mitigate.

The platform draws on Azure Monitor Application Insights to send workloads that are operating outside of its containerized environment to a Log Analytics workspace and deploy availably tests to test cluster availability. It uses Azure Workbooks for visualization to provide developers with the ability to view platform health. “The best part of Azure Workbooks is that it can aggregate data across multiple sources,” says Rao. “We run thousands of clusters, and we have around 12 workspaces running to support them, so it’s critical to aggregate the data and visualize platform and application health.”

Finally, the platform uses the recently launched Azure Monitor managed service for Prometheus and Azure Managed Grafana to access native Kubernetes signals, gaining container insights, cheaper persistent storage, and enhanced performance. The Grafana visualization tool adds value through quick and easy dashboarding. And because it uses the Prometheus query language, the overall experience is far simpler.

The Machine Learning team similarly uses the Azure Monitor managed service for Prometheus solution to meet monitoring needs. With two distinct workflows in its monitoring pipeline, the Machine Leaning solution uses Prometheus Operator to scrape metrics from its external, open-source applications, which it then exports to Monitor for alert processing. For internally written applications, metrics bypass the Azure Monitor managed service for Prometheus and flow directly into Monitor agents running in the cluster. The Machine Learning team also uses Azure Managed Grafana, which in conjunction with Azure Monitor managed service for Prometheus, enables Machine Learning to aggregate all of its metrics in a single location for centralized and simplified querying, with easy-to-consume dashboarding for complete visibility.

The power of open source with the efficiency of managed monitoring

Through Monitor, the Microsoft 365 team has gained a reliable monitoring solution without the high costs of development. “With Azure Monitor, we don’t need to invest in creating monitoring infrastructure,” says Lukyanov. “We’ve saved several years of development costs by not doing the monitoring ourselves.” The team can support open-source technology and harness the power of the most modern tech stack for its users. Microsoft Senior Product Manager Raghavee Venkatramanan says, “We turn to Azure Monitor any time we use open-source technology so that we can provide greater efficiency and performance.” Additionally, there are costs to be saved through Azure Monitor managed service for Prometheus. Adds Rao, “We expect to reduce costs by around 40 percent with the Azure Monitor managed service for Prometheus, and the performance of our dashboards and other tools is significantly better.”

The Machine Learning team is also accomplishing more enhanced and efficient monitoring through its Azure Monitor managed service for Prometheus. The team can monitor far more than it used to and has gained centralized data storage with longer retention periods. Since moving to managed Prometheus, all Machine Learning clusters can send metrics, and it can now aggregate and keep the granular metrics it used to have to throw away, which provides far more nuanced information. Microsoft Senior Software Engineer Peter Glotfelty says, “The Azure Monitor managed service for Prometheus have opened our eyes. We now have all these metrics where we used to really lack visibility—I think that’s the most exciting part!”

Because it can now monitor more granular data more rigorously, the Machine Learning team’s time to detect and time to mitigate has improved drastically as its metrics volume has gone up tenfold with its Azure Monitor managed service for Prometheus. Based on the granular metrics collected, the team hopes to reduce its time to detect by 20 to 30 minutes per incident after it sets up alerting. Also, through its Azure Monitor managed service for Prometheus, the team recently identified a memory leak in its C# applications, which it estimates will result in a 30 percent reduction in memory usage after repair. And the team is forecasting development time savings. Because of the wealth of open-source dashboarding tools, the team estimates saving an hour of development time per dashboard.

Having a single pane of glass for cluster-wide monitoring has been a huge win for both the Microsoft 365 and the Machine Learning teams. Microsoft Principal Software Engineer Ori Yosefi says, “Being able to get everything we need in a single pane of glass with Azure Monitor is amazing. We have the insights to troubleshoot and the ability to alert when we need it.” This enhanced visibility and management is enabling both teams to spend less time on monitoring and more on delivering transformational value for platform users. Rao concludes, “With Azure Monitor, we get out-of-the-box observability infrastructure. Everything is automatically managed, so we can simply focus on doing our jobs.”

Find out more about Microsoft on YouTube, Twitter, Facebook, and LinkedIn.

“Being able to get everything we need in a single pane of glass with Azure Monitor is amazing. We have the insights to troubleshoot and the ability to alert when we need it.”

Ori Yosefi, Principal Software Engineer, Microsoft

Take the next step

Fuel innovation with Microsoft

Talk to an expert about custom solutions

Let us help you create customized solutions and achieve your unique business goals.

Drive results with proven solutions

Achieve more with the products and solutions that helped our customers reach their goals.

Follow Microsoft