Finding and fixing network outages in minutes—not hours—with real-time telemetry at Microsoft - Inside Track Blog (2024)

Finding and fixing network outages in minutes—not hours—with real-time telemetry at Microsoft - Inside Track Blog (1)

With more than 600 physical worksites around the world, Microsoft has one of the largest network infrastructure footprints on the planet.

Managing the thousands of devices that keep those locations connected demands constant attention from a global team of network engineers. It’s their job to monitor and maintain those devices. And when outages occur, they lead the charge to repair and remediate the situation.

To support their work, our Real Time Telemetry team at Microsoft Digital, the company’s IT organization, has introduced new capabilities that help engineers identify network device outages and capture data faster and more extensively than ever before. Through real-time telemetry, network engineers can isolate and remediate issues in minutes—not hours—to keep their colleagues productive and our technology running smoothly.

Immediacy is everything

Finding and fixing network outages in minutes—not hours—with real-time telemetry at Microsoft - Inside Track Blog (2)

Conventional network monitoring uses the Simple Network Management Protocol (SNMP) architecture, which retrieves network telemetry through periodic, pull-based polls and other legacy technologies. At Microsoft, that polling interval typically ranges between five minutes and six hours.

SNMP is a foundational telemetry architecture with decades of legacy. It’s ubiquitous, but it doesn’t allow for the most up-to-date data possible.

“The biggest pain point we’ve always heard from network engineers is latency in the data,” says Astha Sinha, senior product manager for the Infrastructure and Engineering Services team in Microsoft Digital. “When data is stale, engineers can’t react quickly to outages, and that has implications for security and productivity.”

Serious vulnerabilities and liabilities arise when a network device outage occurs. But because of lags between polling intervals, a network engineer might not receive information or alerts about the situation until long after it happens.

We assembled the Real Time Telemetry team as part of our Infrastructure and Engineering Services to close that gap.

“We build the tools and automations that network engineers use to better manage their networks,” says Martin O’Flaherty, principal product manager for the Infrastructure and Engineering Services team in Microsoft Digital. “To do that, we need to make sure they have the right signals as early and as consistently as possible.”

The technology that powers these possibilities is known as streaming telemetry. It relies on network devices compatible with the more modern gRPC Network Management Interface (gNMI) telemetry protocol and other technologies to support a push-based approach to network monitoring where network devices stream data constantly.

This architecture isn’t new, but our team is scaling and programmatizing how that data becomes available by creating a real-time telemetry apparatus that collects, stores, and delivers network information to service engineers. These capabilities offer several benefits.

The advantages of real-time network device telemetry

Security and compliance

Better detection of breaches, vulnerabilities, and bugs through automated scans of OS stalls, lateral device hijacking, malware, and other common vulnerabilities.

Observability

Visibility into real-time utilization data on network device stats, as well as steady replacement of current data collection technology and more scalable network growth and evolution.

Service quality

More rapid network fixes, leading to a reduction in the baselines for time-to-detection and time-to-migration for incidents.

“Devices are proactively sending data without having to wait for requests, so they function more efficiently and facilitate timely troubleshooting and optimization,” says Abhijit Vijay, principal software engineering manager with the Infrastructure and Engineering Services team in Microsoft Digital. “Since this approach pushes data continuously rather than at specific intervals, it also reduces the additional network traffic and scales better in larger, more complex environments.

At any given time, Microsoft operates 25,000 to 30,000 network devices, managed by engineers working across 10 different service lines. Accounting for all their needs while keeping data collection manageable and efficient requires extensive collaboration and prioritization.

We also had to account for compatibility. With so many network devices in operation, replacement lifecycles vary. Not all of them are currently gNMI-compatible.

Working with our service lines, we identified the use cases that would provide the best possible ROI, largely based on where we would find the greatest benefits for security and where networks offered a meaningful number of gNMI-compatible devices. We also zeroed in on the types of data that would be the most broadly useful. Being selective helped us preserve resources and avoid overwhelming engineers with too much data.

We built our internal solution entirely using Azure components, including Azure Functions and Azure Kubernetes Service (AKS), Azure Cosmos DB, Redis, and Azure Data Lake. The result is a platform that network engineers can use to access real-time telemetry data.

With key service lines, use cases, and a base of technology in place, we worked with network engineers to onboard the relevant devices. From there, their service lines were free to experiment with our solution on real-world incidents.

Better response times, greater network reliability

Service lines are already experiencing big wins.

In one case, a heating and cooling system went offline for a building in the company’s Millennium Campus in Redmond, Washington. A lack of environmental management has the potential to cause structural damage to buildings if left unchecked, so it was important to resolve this issue as quickly as possible. The service line for wired onsite connections sprang into action as soon as they received a network support ticket.

With real-time telemetry enabled, the team created a Kusto query to compare DOT1X access-session data for the day of the outage with a period before the outage started. Almost immediately, they spotted problematic VLAN switching, including the exact time and duration of the outage. By correlating the timestamps, they determined that the RADIUS registrations of the device owner had expired, which caused the devices to switch into the guest network as part of the zero-trust network implementation.

As a result, the team was able to resolve the registration issues and restore the heating and cooling systemsin 10 minutes—a process that might have taken hours using other collection methods due to the lag-time between polling intervals.

“This has the potential to improve alerting, reduce outages, and enhance security,” says Daniel Menten, senior cloud network engineer for site infrastructure management on the Site Wired team. “One of the benefits of real-time telemetry is that it lets us capture information that wasn’t previously available—or that we received too slowly to take action.”

It’s about speeding up how we identify issues and how we then respond to them.

“With this level of observability, engineers that monitor issues and outages benefit from enhanced experiences,” says Aayush Dave, a product manager on the Infrastructure and Engineering Services team in Microsoft Digital. “And that’s going to make our network more reliable and performant in a world where security issues and outages can have a global impact.”

The future is in real time

Now that real-time telemetry has demonstrated its value, our efforts are focused on broadening and deepening the experience.

“More devices mean more impact,” Dave says. “By increasing the number of network devices that facilitate real-time telemetry, we’re giving our engineers the tools to accelerate their response to these incidents and outages, all leading to enhanced performance and a more robust network reliability posture.”

It’s also about layering on new ways of accessing and using the data.

We’ve just released a preview UI that provides a quick look at essential data, as well as an all-up view of devices in an engineer’s service line. This dashboard will enable a self-service model that makes it even easier to isolate essential telemetry without the need for engineers to create or integrate their own interfaces.

That kind of observability isn’t only about outages. It also enables optimization by helping engineers understand and influence how devices work together.

The depth and quality of real-time telemetry data also provides a wealth of information for training AI models. With enough data spread across enough devices, predictive analysis might be able to provide preemptive alerts when the kinds of network signals that tend to accompany outages appear.

“We’re paving the way for an AIOps future where the system won’t just predict potential issues, but initiate self-healing actions,” says Rob Beneson, partner director of software engineering on the Infrastructure and Engineering Services team in Microsoft Digital.

It’s work that aligns with our company mission.

“This transformation is enhancing our internal user experience and maintaining the network connectivity that’s critical for our ultimate goal,” Beneson says. “We want to empower every person and organization on the planet to achieve more.”

Finding and fixing network outages in minutes—not hours—with real-time telemetry at Microsoft - Inside Track Blog (3)

Here are some tips for getting started with real-time telemetry at your company:

  • Start with your users. Ask them about pain points, what scares them, and what they need.
  • Start small and go step by step to get the core architecture in place, then work up to the glossier UI and UX elements.
  • Be mindful of onboarding challenges like bugs in vendor hardware and software, especially around security controls.
  • You’ll find plenty of edge cases and code fails, so be prepared to invest in revisiting challenges and fixing problems that arise.
  • Make sure you have a use case and a problem to solve. Have a plan to guide your adoption and use before you turn on real-time telemetry.
  • Make sure you have the proper data infrastructure in place and an apparatus for storing your data.
  • Communicate and demonstrate the value of this solution to the teams who need to invest resources into onboarding it.
  • Prioritize visibility into the devices and data you’ve onboarded through pilots and hero scenarios, then scale onboarding further according to your teams’ needs.
  • Integrate as much as possible. Consider visualizations and pushing into existing network graphs and tools to surface data where engineers already work.
Finding and fixing network outages in minutes—not hours—with real-time telemetry at Microsoft - Inside Track Blog (4)

Learn more about Microsoft Azure Kubernetes Service monitoring and Microsoft Azure Functions.

Finding and fixing network outages in minutes—not hours—with real-time telemetry at Microsoft - Inside Track Blog (5)
  • Learn more about implementing Microsoft Azure cost optimization internally at Microsoft.
  • Find out how we’re moving our network to the cloud with Microsoft Azure.
  • Check out how we’re boosting our employee device procurement at Microsoft with better forecasting.
Finding and fixing network outages in minutes—not hours—with real-time telemetry at Microsoft - Inside Track Blog (6)
Finding and fixing network outages in minutes—not hours—with real-time telemetry at Microsoft - Inside Track Blog (2024)
Top Articles
to-hero - 1  Negative numbers
How tensions in Bolivia fueled an attempt to oust President Arce from power
$4,500,000 - 645 Matanzas CT, Fort Myers Beach, FL, 33931, William Raveis Real Estate, Mortgage, and Insurance
My E Chart Elliot
Time in Baltimore, Maryland, United States now
13 Easy Ways to Get Level 99 in Every Skill on RuneScape (F2P)
Don Wallence Auto Sales Vehicles
Decaying Brackenhide Blanket
Craigslist Pets Longview Tx
Radio Aleluya Dialogo Pastoral
TS-Optics ToupTek Color Astro Camera 2600CP Sony IMX571 Sensor D=28.3 mm-TS2600CP
Gmail Psu
Cbs Trade Value Chart Fantasy Football
Kürtçe Doğum Günü Sözleri
Po Box 35691 Canton Oh
Mflwer
Divina Rapsing
Georgetown 10 Day Weather
Iu Spring Break 2024
Today Was A Good Day With Lyrics
Never Give Up Quotes to Keep You Going
Minnick Funeral Home West Point Nebraska
Craigslist Maryland Trucks - By Owner
T Mobile Rival Crossword Clue
Sister Souljah Net Worth
1 Filmy4Wap In
Mandy Rose - WWE News, Rumors, & Updates
Craigslist Hunting Land For Lease In Ga
O'reilly's In Mathis Texas
Funky Town Gore Cartel Video
FSA Award Package
Evil Dead Rise - Everything You Need To Know
Gus Floribama Shore Drugs
Mbi Auto Discount Code
MethStreams Live | BoxingStreams
Cl Bellingham
Merkantilismus – Staatslexikon
8 Ball Pool Unblocked Cool Math Games
The All-New MyUMobile App - Support | U Mobile
Unveiling Gali_gool Leaks: Discoveries And Insights
Ups Authorized Shipping Provider Price Photos
Florida Lottery Powerball Double Play
The Complete Uber Eats Delivery Driver Guide:
How to Connect Jabra Earbuds to an iPhone | Decortweaks
Samsung 9C8
Kenwood M-918DAB-H Heim-Audio-Mikrosystem DAB, DAB+, FM 10 W Bluetooth von expert Technomarkt
Paradise leaked: An analysis of offshore data leaks
Craigslist Anc Ak
Call2Recycle Sites At The Home Depot
Razor Edge Gotti Pitbull Price
Rise Meadville Reviews
Turning Obsidian into My Perfect Writing App – The Sweet Setup
Latest Posts
Article information

Author: Manual Maggio

Last Updated:

Views: 5468

Rating: 4.9 / 5 (49 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Manual Maggio

Birthday: 1998-01-20

Address: 359 Kelvin Stream, Lake Eldonview, MT 33517-1242

Phone: +577037762465

Job: Product Hospitality Supervisor

Hobby: Gardening, Web surfing, Video gaming, Amateur radio, Flag Football, Reading, Table tennis

Introduction: My name is Manual Maggio, I am a thankful, tender, adventurous, delightful, fantastic, proud, graceful person who loves writing and wants to share my knowledge and understanding with you.