Data Collected by NDT

When you run NDT, the IP address provided by your Internet Service Provider will be collected along with your measurement results. M-Lab conducts the test and publishes all test results to promote Internet research. NDT does not collect any information about you as an Internet user.

Please review M-Lab’s Privacy Policy to understand what data is collected and how data is used before initiating a test.

Unparsed Raw NDT Data in GCS

All of the raw data and log files from the measurement fleet are archived in their original format and available in Google Cloud Storage. As our parsing and analysis algorithms improve, M-Lab periodically reprocesses all of this archived data.

Generally, BigQuery rows indicate the locations of the raw data from which they were derived. Dedicated users can reconstruct our analysis and in principle fully replicate our parsers. The raw data also includes TCP packet captures (.pcap files) for most NDT tests, however the pcap files are not indexed in BigQuery yet. Details on how M-Lab publishes test data in raw form are provided on our Google Cloud Storage documentation page.

NDT Data in BigQuery

To make NDT data more readily available for research and analysis, M-Lab parses all NDT data into BigQuery tables and views, and makes query access available for free by subscription to a Google Group. Find out more about how to get access on our BigQuery QuickStart page.

Note that we sometimes use the terms “table” and “view” interchangeably: they reflect different internal implementations, but due to billing and access controls everything documented here as a table is actually presented as a view.

The presentation of NDT data in a series of datasets and views in BigQuery represents M-Lab’s strategy for data curation, providing a cleaned and filtered view of test results that can be used to attempt answering the most common research questions of our community requiring known good test results. By also preserving raw test data as collected and annotated, and curating views in intermediate steps, we can also support users whose research is concerned with unfiltered or non-curated tests.

We now publish three series of Datasets in BigQuery containing Views for NDT data. These datasets and views mirror the processing stages of our ETL pipeline:

Dataset Description
measurement-lab.ndt.* Unified Views in the ndt dataset present a stable, long term supported unified schema for all ndt datatypes (web100, ndt5, ndt7), and filter to only provide tests meeting our team’s current understanding of completeness & research quality as well as removing rows resulting from M-Lab’s operations and monitoring systems.
measurement-lab.ndt_intermediate.* Extended Views in the ndt_intermediate dataset join raw measurements with annotations, and remap column names across all ndt datatypes (web100, ndt5, ndt7) to provide a common schema for use in the Unified Views. M-Lab does not guarantee long term supported schemas for Views in the ndt_intermediate dataset. Researchers using these views should be aware that breaking schema changes in future releases may affect your queries.
measurement-lab.ndt_raw.* Raw Views in the ndt_raw dataset provide a 1-to-1 mapping of tests contained in GCS archives to test rows.

Unified Views

NDT Unified Views are published in the ndt dataset, and are designed to easily support studies of the evolution of the Internet performance by geopolitical regions.

Unified Views should be the starting point for most people.

NDT Unified Views:

  • Use a standardized schema across all ndt datatypes (ndt7, ndt5 and web100)
  • Present computed performance metrics (i.e. data rate, loss rate, min RTT and more in the future)
  • Have separate views for upload and download because the test details and data processing are different for each direction
  • Are strict subsets (rows and columns removed) of the union of the Extended Views
  • Are curated to only include tests that meet our current, best understanding of completeness and research quality:
    • At least 8 KB of data was transferred (extends below 9.6 kbits/second)
    • Test duration was between 9 and 60 seconds
    • For downloads, some form of network congestion was detected (i.e. tests with only non-network bottleneck are excluded)
    • Tests with parser errors and NULL results are excluded
    • Tests from M-Lab Operations and Management (OAM) infrastructure are excluded
  • Also called “Helpful Views” in past documentation and blog posts

In BigQuery, unified views are prepended with unified_:

Unified views with suffixes resembling dates (i.e. unified_uploads_20201026x) are provided to support differential A/B testing across processing changes. They give researchers a easy way to detect if our changes have any affect on downstream research results.

For more background on unified views see the blog posts below, noting that some of the terminology has evolved slightly since the blog posts.

Extended Views

NDT Extended Views are published in the ndt_intermediate dataset, and contain every row from the raw views, with added columns describing everything that we know about the data.

Custom unified views based on the NDT Extended Views should be the starting point for nearly all alternative analyses of M-Lab data.

For guidance and examples please see: Creating Custom Unified Views or Subqueries for Your Own Research

NDT Extended Views:

  • Have no filters applied but every row is labeled with the selection criteria used by the unified views
  • Contain calculated metrics and other standard columns such as: data rate, loss rate, minimum RTT, etc.
  • Are joined with geographical annotations
  • In the Future will be joined with traceroute and other data sets such as platform load telemetry and Internet health indicators
  • Have schemas are supersets of the unified view schema and raw tables schemas, differing per experiment and raw parser version
  • Are designed to support user-implemented Custom Unified Views
  • In BigQuery, extended views are in the dataset measurement-lab.ndt_intermediate:
    • measurement-lab.ndt_intermediate.extended_ndt7_downloads
    • measurement-lab.ndt_intermediate.extended_ndt7_uploads
    • measurement-lab.ndt_intermediate.extended_ndt5_downloads
    • measurement-lab.ndt_intermediate.extended_ndt5_uploads
    • measurement-lab.ndt_intermediate.extended_web100_downloads
    • measurement-lab.ndt_intermediate.extended_web100_uploads

Raw Views

NDT Raw Views are published in the ndt_raw dataset, and provide a 1-to-1 mapping of tests contained in our Google Cloud Storage archives to test rows, and are the closest representation of archived raw test data that has been parsed and imported into BigQuery.

NDT Raw Views are provided for completeness and transparency but are no longer recommended for general use.

NDT Raw Views:

  • Include one row for every unique test that can be parsed, even if truncated or partially corrupted
  • Contain a small number of added columns indicating parse errors and (future) metrics computed directly from the snap logs (web100 or tcp-info)
  • The schemas reflect the original structure of the archived raw data and differ per tool and parser version
  • They are subject to breaking changes
  • Also called “faithful views” in past documentation and blog posts
  • With names ending in _legacy were generated by an older parser version and are slated to be replaced in the future
  • In BigQuery, raw views are in the dataset ‘measurement-lab.ndt_raw`:
    • measurement-lab.ndt_raw.ndt7
    • measurement-lab.ndt_raw.ndt5_legacy
    • measurement-lab.ndt_raw.web100_legacy
    • measurement-lab.ndt_raw.tcpinfo_legacy
    • measurement-lab.ndt_raw.traceroute_legacy
    • measurement-lab.ndt_raw.annotation

Example Queries and Updating Past Queries

If you need examples or assistance updating past research queries to use our current BigQuery Views, please review the pages below:

Changelog

Generally, schemas for all M-Lab datasets are published as tagged releases in the etl-schema repository on Github. This section outlines changes specific to NDT schemas over time.

[v3.17] - https://github.com/m-lab/etl-schema/releases/tag/v3.17

  • Renames publicly available datasets to mirror naming in our ETL process, and aligns alphabetical names of NDT datasets in BigQuery for better readability.
    • ndt_intermediate renamed. Previously named intermediate_ndt.
    • ndt_raw renamed. Previously named raw_ndt.
  • Renames views in the raw_ndt dataset, adding the suffix, _legacy to raw views of data collected using now deprecated or legacy parsers and/or kernel instrumentation.
  • Minor bug fixes to “Unified Views” in the ndt dataset.
    • BQ_SAFE operators were added to queries that generate NDT Unified Views to force any rows with corrupted geographic annotated fields to be expressed as NULLs.
    • Congestion Control Algorithm (CCA) for upload tests were reporting the server’s Congestion Control Algorithm as the client’s CCA. This was partially a legacy bug of the old web100 based tooling, that reported upload and download results in the same row. Upload tests will no longer contain the client CCA since it is not currently passed to the server at test time.
  • Version numbers dropped from this changelog, and moving forward will transition to providing detailed release notes on Github releases, replacing this changelog.

v3.11.0 - 2020-04

  • Following the M-Lab 2.0 platform upgrade completed in November 2019
    • NDT data from the now deprecated web100 based ndt has been archived in the dataset measurement-lab.ndt.web100
    • NDT data from the new, TCP INFO based ndt-server is now provided in measurement-lab.ndt.ndt5
    • associated TCP INFO data for all ndt5 tests is now provided in measurement-lab.ndt.tcpinfo
  • Views from web100 ndt are now deprecated, superceded by new “unified” views
    • The following Views provide access only to data from the web100 legacy platform:
      • measurement-lab.ndt.recommended
      • measurement-lab.ndt.downloads
      • measurement-lab.ndt.uploads
  • Unified views of all NDT data published
    • Two new historical views of all NDT data are now available, and provide only NDT tests that meet our [criteria] for valid, research quality tests.
      • measurement-lab.ndt.unified_downloads
      • measurement-lab.ndt.unified_uploads

[v4] - 2019-05

  • In previous release convention a hierarchy of releases, release candidates “rc”, versioned release candidates, and versioned intermediate views were published, but they will cease being updated with new data starting May 6, 2019.
  • BigQuery datasets named after M-Lab measurement services & data types.
  • Each measurement service (ndt, traceroute, sidestream, utilization) will have a corresponding BigQuery dataset and view in the measurement-lab project, managed by our data reprocessing service.
  • LegacySQL support is now deprecated, but a single LegacySQL view of the legacy data may be kept for historical purposes.
  • Only StandardSQL is supported in any new views of the comprehensive reprocessed data.
  • Views that combine legacy tables and recently parsed data will no longer be offered.
  • Historically, Paris Traceroute data was collected for every measurement service. For this data type, a view in the aggregate dataset is now provided.
  • Over the next year, M-Lab will restructure the traceroute schema to support reprocessing using the Gardener service, and to unify the schema for historical and future data collection by Scamper.

[v3.1.1] - 2018-07

  • Publish official Switch tables from the DISCO dataset.

Published tables and views are:

  • measurement-lab.legacy.ndt (data ~ 2015-01-01 - 2017-05-10)
  • measurement-lab.legacy.ndt_pre2015 (data ~ 2009-02-18 - 2014-12-31)
  • measurement-lab.base_tables.ndt
  • measurement-lab.base_tables.switch

  • measurement-lab.rc
  • measurement-lab.release_v3_1
  • measurement-lab.release
    • measurement-lab.release.ndt_all
    • measurement-lab.release.ndt_all_legacysql
    • measurement-lab.release.ndt_downloads
    • measurement-lab.release.ndt_downloads_legacysql
    • measurement-lab.release.ndt_uploads
    • measurement-lab.release.ndt_uploads_legacysql

[v3.1] - 2018-02

  • First official release of v3 tables, with all historical data re-parsed, and annotated with geolocation metadata.

[v3.0.2] - 2017-12

  • Standardized the naming scheme for BigQuery table and view names to be consistent with new semantic versioning.
  • All tables and views must be queried using StandardSQL, except for views with “legacysql” in the name.
  • Views for tests other than NDT may be published in the future using the same format:
    • <test>_all_<version> (standardSQL)
    • <test>_all_legacysql_<version>
  • Complete documentation for tables, views, the contents of views, and what data they limit (where applicable) will be published on this page.
  • Views will be published concurrently with new table schemas, such that all table versions will have corresponding views.
  • Previous versions of our tables will be referenced by versions 1.0, 2.0, etc. in our documentation but actual table names will not be changed.
  • Re-ran historical annotations for traceroute, npad, and sidestream data due to a bug where some geolocation annotations was not present in all past test data.

[v3.0.1] - 2017-10

  • The schema for v3.0.1 tables was updated, removing an alpha feature called deltas, which attempted to log the differences between test snaplogs instead of the final test values. This feature will be revisited in future schema updates.
  • Newly released data annotation engine added geolocation and some metadata to tests from 2016 to present.
  • Published a series of beta BigQuery views for NDT data, to allow data queries across both v2 and v3.0.x tables.
  • Published traceroute and sidestream table to replace v2 versions, migrated data, re-annotated data.

[v3] - 2017-05

  • Began publication to new date partitioned table and updated schema to support the new, open source, ETL pipeline.
  • Data publication to v2 tables stopped at this time.

[v2.1] - 2016-11

[v2] - 2016-03

  • Began the publication of per project “fast tables” for NDT, NPAD, Paris Traceroute, and Sidestream.
    • plx.google:m_lab.ndt.all
    • plx.google:m_lab.npad.all
    • plx.google:m_lab.paris-traceroute.all
  • Continued the publication of v1 monthly tables, and published a migration guide.
  • Deprecated fields in v2 “fast tables”:
    • type
    • project
    • web100_log_entry.is_last_entry
    • web100_log_entry.group_name

Data Annotations

Recommendations

Citing the M-Lab NDT Dataset

Please cite the NDT data set as follows: The M-Lab NDT Data Set, <date range used> https://measurementlab.net/tests/ndt

or, in BibTeX format:

@misc{mlab,
        author="{Measurement Lab}",
        title="The {M}-{L}ab {NDT} Data Set",
        year="(2009-02-11 -- 2015-12-21)",
        howpublished="\url{https://measurementlab.net/tests/ndt}",
}
Back to Top