With more than 600 physical worksites around the world, Microsoft has one of the largest network infrastructure footprints on the planet.
Managing the thousands of devices that keep those locations connected demands constant attention from a global team of network engineers. It’s their job to monitor and maintain those devices. And when outages occur, they lead the charge to repair and remediate the situation.
To support their work, our Real Time Telemetry team at Microsoft Digital, the company’s IT organization, has introduced new capabilities that help engineers identify network device outages and capture data faster and more extensively than ever before. Through real-time telemetry, network engineers can isolate and remediate issues in minutes—not hours—to keep their colleagues productive and our technology running smoothly.
Immediacy is everything
Conventional network monitoring uses the Simple Network Management Protocol (SNMP) architecture, which retrieves network telemetry through periodic, pull-based polls and other legacy technologies. At Microsoft, that polling interval typically ranges between five minutes and six hours.
SNMP is a foundational telemetry architecture with decades of legacy. It’s ubiquitous, but it doesn’t allow for the most up-to-date data possible.
“The biggest pain point we’ve always heard from network engineers is latency in the data,” says Astha Sinha, senior product manager for the Infrastructure and Engineering Services team in Microsoft Digital. “When data is stale, engineers can’t react quickly to outages, and that has implications for security and productivity.”
Serious vulnerabilities and liabilities arise when a network device outage occurs. But because of lags between polling intervals, a network engineer might not receive information or alerts about the situation until long after it happens.
We assembled the Real Time Telemetry team as part of our Infrastructure and Engineering Services to close that gap.
“We build the tools and automations that network engineers use to better manage their networks,” says Martin O’Flaherty, principal product manager for the Infrastructure and Engineering Services team in Microsoft Digital. “To do that, we need to make sure they have the right signals as early and as consistently as possible.”
The technology that powers these possibilities is known as streaming telemetry. It relies on network devices compatible with the more modern gRPC Network Management Interface (gNMI) telemetry protocol and other technologies to support a push-based approach to network monitoring where network devices stream data constantly.
This architecture isn’t new, but our team is scaling and programmatizing how that data becomes available by creating a real-time telemetry apparatus that collects, stores, and delivers network information to service engineers. These capabilities offer several benefits.
The advantages of real-time network device telemetry
Prevention
Superior anomaly detection, reduced intent and configuration drift, the foundation for large-scale automation and less network downtime.
Security and compliance
Better detection of breaches, vulnerabilities, and bugs through automated scans of OS stalls, lateral device hijacking, malware, and other common vulnerabilities.
Observability
Visibility into real-time utilization data on network device stats, as well as steady replacement of current data collection technology and more scalable network growth and evolution.
Service quality
More rapid network fixes, leading to a reduction in the baselines for time-to-detection and time-to-migration for incidents.
“Devices are proactively sending data without having to wait for requests, so they function more efficiently and facilitate timely troubleshooting and optimization,” says Abhijit Vijay, principal software engineering manager with the Infrastructure and Engineering Services team in Microsoft Digital. “Since this approach pushes data continuously rather than at specific intervals, it also reduces the additional network traffic and scales better in larger, more complex environments.
At any given time, Microsoft operates 25,000 to 30,000 network devices, managed by engineers working across 10 different service lines. Accounting for all their needs while keeping data collection manageable and efficient requires extensive collaboration and prioritization.
We also had to account for compatibility. With so many network devices in operation, replacement lifecycles vary. Not all of them are currently gNMI-compatible.
Working with our service lines, we identified the use cases that would provide the best possible ROI, largely based on where we would find the greatest benefits for security and where networks offered a meaningful number of gNMI-compatible devices. We also zeroed in on the types of data that would be the most broadly useful. Being selective helped us preserve resources and avoid overwhelming engineers with too much data.
We built our internal solution entirely using Azure components, including Azure Functions and Azure Kubernetes Service (AKS), Azure Cosmos DB, Redis, and Azure Data Lake. The result is a platform that network engineers can use to access real-time telemetry data.
With key service lines, use cases, and a base of technology in place, we worked with network engineers to onboard the relevant devices. From there, their service lines were free to experiment with our solution on real-world incidents.
Better response times, greater network reliability
Service lines are already experiencing big wins.
In one case, a heating and cooling system went offline for a building in the company’s Millennium Campus in Redmond, Washington. A lack of environmental management has the potential to cause structural damage to buildings if left unchecked, so it was important to resolve this issue as quickly as possible. The service line for wired onsite connections sprang into action as soon as they received a network support ticket.
With real-time telemetry enabled, the team created a Kusto query to compare DOT1X access-session data for the day of the outage with a period before the outage started. Almost immediately, they spotted problematic VLAN switching, including the exact time and duration of the outage. By correlating the timestamps, they determined that the RADIUS registrations of the device owner had expired, which caused the devices to switch into the guest network as part of the zero-trust network implementation.
As a result, the team was able to resolve the registration issues and restore the heating and cooling systemsin 10 minutes—a process that might have taken hours using other collection methods due to the lag-time between polling intervals.
“This has the potential to improve alerting, reduce outages, and enhance security,” says Daniel Menten, senior cloud network engineer for site infrastructure management on the Site Wired team. “One of the benefits of real-time telemetry is that it lets us capture information that wasn’t previously available—or that we received too slowly to take action.”
It’s about speeding up how we identify issues and how we then respond to them.
“With this level of observability, engineers that monitor issues and outages benefit from enhanced experiences,” says Aayush Dave, a product manager on the Infrastructure and Engineering Services team in Microsoft Digital. “And that’s going to make our network more reliable and performant in a world where security issues and outages can have a global impact.”
The future is in real time
Now that real-time telemetry has demonstrated its value, our efforts are focused on broadening and deepening the experience.
“More devices mean more impact,” Dave says. “By increasing the number of network devices that facilitate real-time telemetry, we’re giving our engineers the tools to accelerate their response to these incidents and outages, all leading to enhanced performance and a more robust network reliability posture.”
It’s also about layering on new ways of accessing and using the data.
We’ve just released a preview UI that provides a quick look at essential data, as well as an all-up view of devices in an engineer’s service line. This dashboard will enable a self-service model that makes it even easier to isolate essential telemetry without the need for engineers to create or integrate their own interfaces.
That kind of observability isn’t only about outages. It also enables optimization by helping engineers understand and influence how devices work together.
The depth and quality of real-time telemetry data also provides a wealth of information for training AI models. With enough data spread across enough devices, predictive analysis might be able to provide preemptive alerts when the kinds of network signals that tend to accompany outages appear.
“We’re paving the way for an AIOps future where the system won’t just predict potential issues, but initiate self-healing actions,” says Rob Beneson, partner director of software engineering on the Infrastructure and Engineering Services team in Microsoft Digital.
It’s work that aligns with our company mission.
“This transformation is enhancing our internal user experience and maintaining the network connectivity that’s critical for our ultimate goal,” Beneson says. “We want to empower every person and organization on the planet to achieve more.”
Here are some tips for getting started with real-time telemetry at your company:
- Start with your users. Ask them about pain points, what scares them, and what they need.
- Start small and go step by step to get the core architecture in place, then work up to the glossier UI and UX elements.
- Be mindful of onboarding challenges like bugs in vendor hardware and software, especially around security controls.
- You’ll find plenty of edge cases and code fails, so be prepared to invest in revisiting challenges and fixing problems that arise.
- Make sure you have a use case and a problem to solve. Have a plan to guide your adoption and use before you turn on real-time telemetry.
- Make sure you have the proper data infrastructure in place and an apparatus for storing your data.
- Communicate and demonstrate the value of this solution to the teams who need to invest resources into onboarding it.
- Prioritize visibility into the devices and data you’ve onboarded through pilots and hero scenarios, then scale onboarding further according to your teams’ needs.
- Integrate as much as possible. Consider visualizations and pushing into existing network graphs and tools to surface data where engineers already work.
Learn more about Microsoft Azure Kubernetes Service monitoring and Microsoft Azure Functions.
- Learn more about implementing Microsoft Azure cost optimization internally at Microsoft.
- Find out how we’re moving our network to the cloud with Microsoft Azure.
- Check out how we’re boosting our employee device procurement at Microsoft with better forecasting.
- Want more information? Email us and include a link to this story and we’ll get back to you.
- Please share your feedback with us—take our survey and let us know what kind of content is most useful to you.