Paris Traceroute Brownout

Summary:

In February 2017, M-Lab was notified of issues with the M-Lab data available in BigQuery. Upon investigation, a problem was identified with the Paris Traceroute collection daemon which resulted in a reduction in Paris Traceroute measurements beginning in June 2016. At the peak of the outage, fourth quarter 2016 - January 2017, approximately 5% of NDT tests had an associated Paris Traceroute test. Additionally, an issue within the data processing pipeline resulted in Paris Traceroute data that was measured and collected, not being inserted into the BigQuery tables and therefore available for use.

Both defects were resolved in mid-February, additional monitoring was added, and BigQuery has been brought up to date with the available data.

Details:

The bug was introduced in June 2016, when we deployed a new version of the Paris Traceroute tool. This new version was better in many respects, in that it would run the test for a larger fraction of NDT tests that M-Lab receives. The update was also intended to run those tests using the same source and destination TCP ports to ensure that any layer 4 load-balancers on the path would treat the Paris Traceroute packets the same way they treated the NDT test packets.

Paris Traceroute coverage of NDT jumped from 55-75%, which we had been seeing earlier in the year, to 85% shortly after the deployment.

Unfortunately, while the first installation and startup demonstrated that the Paris Traceroute tool was a huge improvement, it had a race condition that frequently prevented it from restarting successfully when a server rebooted. Maintenance and updates to the M-Lab platform typically results in 5% to 20% of the servers rebooting each week, so the number of machines with properly operating Paris Traceroute daemons gradually dropped over the following six months. By December, we were measuring less than 15,000 unique Paris Traceroutes per day, even though NDT test volume had grown to over 200,000 tests per day.

In late January, a researcher brought to our attention that there was no Paris Traceroute data for Sydney, Australia since July 2016. We initially thought this was a BigQuery pipeline problem, but discovered on February 9 that the raw data was also missing from the BigStore files. This led to the discovery that the daemon was not functioning properly, and we soon discovered the race condition that was the root cause. A related pipeline problem also resulted in data that was collected only being pushed to BigStore, and not to BigQuery.

The collection problem has been rectified (github), and we have added monitoring and alerting to ensure that the Paris Traceroute jobs are running. The pipeline problem has also been addressed, and all available Paris traceroute data has been pushed to BigQuery. Our new processing pipeline (in development) will include monitoring of test counts, so this class of issues should no longer be able to persist for more than a few days without causing an alert.

The bad news is that Paris Traceroute data for the M-Lab platform is very sparse for the latter half of 2016 and early 2017. An accounting of daily NDT tests to unique client addresses, and daily Paris Traceroute tests to corresponding addresses shows the following weekly coverage. You can see the uptick in coverage in week 24 and 25, the gradual decrease in coverage in ensuing weeks, and the surge in NDT test counts starting around week 29. The issue was resolved in 2017 week 6 and 7, and you can see the corresponding restoration of test coverage.

NDT tests and Paris Traceroute tests per week, June 2016 - March 2016

year week ndt_tests pt_tests %
2016 21 391624 220025 56.18
2016 22 406367 271845 66.90
2016 23 416730 275954 66.22
2016 24 412697 323775 78.45
2016 25 423601 358566 84.65
2016 26 402461 321197 79.81
2016 27 454424 336444 74.04
2016 28 468149 318792 68.10
2016 29 529151 291702 55.13
2016 30 654513 309584 47.30
2016 31 884118 372715 42.16
2016 32 968719 377798 39.00
2016 33 992234 388339 39.14
2016 34 1042148 382655 36.72
2016 35 1086401 363672 33.47
2016 36 1125381 337876 30.02
2016 37 1116448 286531 25.66
2016 38 1259814 197173 15.65
2016 39 1310179 93825 7.16
2016 40 1329701 76366 5.74
2016 41 1293060 70755 5.47
2016 42 1243503 59420 4.78
2016 43 1176804 56772 4.82
2016 44 1026661 47130 4.59
2016 45 1181540 34044 2.88
2016 46 1205690 46432 3.85
2016 47 1303431 86086 6.60
2016 48 1329139 92356 6.95
2016 49 1449058 95055 6.56
2016 50 1583852 121391 7.66
2016 51 1654478 112102 6.78
2016 52 1578605 80632 5.11
2016 53 1742942 110150 6.32
2017 1 1811016 88584 4.89
2017 2 1844633 92833 5.03
2017 3 1772941 118736 6.70
2017 4 1762644 160980 9.13
2017 5 1790433 168677 9.42
2017 6 1808985 600434 33.19
2017 7 1788070 1550172 86.70
2017 8 1858179 1565195 84.23
2017 9 1862311 1663829 89.34
2017 10 1851940 1747511 94.36
2017 11 1969324 1864045 94.65
2017 12 1898877 1754154 92.38

All Paris Traceroute data in BigStore and BigQuery is correct, but the bug did cause considerable data loss relative to what might have been measured and collected. From a statistical analysis point of view, it is further unfortunate that the data gaps are systematic rather than random. When a machine failed to start the daemon properly, the region of the Internet served by that site would become systematically less likely to be traced. The problem has been fixed, job count alerting has been added, and activity based alerting will soon be added to prevent problems like this in the future.

Thanks to Xiaohong Deng and Ben Dowling for bringing this problem to our attention!

Back to Top