Data requirements#

Data for creating a routable network#

When calculating travel times with r5py, you typically need two types of datasets:

a road network dataset from OpenStreetMap (OSM) in Protocol Buffer Binary (.pbf) format (mandatory):

These data are used for finding the fastest routes and calculating the travel times for walking, cycling and driving. In addition, these data are used for walking/cycling legs to, from, or between stops when routing with public transport.

a public transport schedule dataset in General Transit Feed Specification format (optional):

These data contain all information necessary to calculate travel times on public transport, such as the stops, routes, trips and schedules of busses, trams, trains, and other vehicles.

Data pre-processing

Often, it is useful to crop an OSM extract beforehand, or to add other cost factors to the data (e.g., to account for slope). Check the detailed instructions for data preparation on the Conveyal website, and use the tools in this repository to add customised costs for pedestrian and cycling analyses.

R5py automatically combines multiple GTFS data sets. This is useful when you study areas covered by more than one transit authority, or when data from different modes of transport, such as bus and metro, are available in separate GTFS feeds.

Where to obtain such datasets?#

Here are a few places from where you can download the datasets for creating the routable network. This list is of course by no means comprehensive, and you are very welcome to add additional data sources that you are aware of.

OpenStreetMap data in PBF format#

  • pydriosm is a Python package for downloading OSM extracts from GeoFabrik and BBBike.

  • pyrosm is a Python package for creating street networks from OSM extracts, and includes tools to download OSM extracts from GeoFabrik and BBBike

  • GeoFabrik is a website offering OSM extracts for free download, covering many pre-defined areas (continents, countries, regions, etc.)

  • BBBike is a website that offers OSM extracts for free download, covering many pre-defined areas, including individual city’s extents, and also supports data downloads cut to custom extents

  • Protomaps is a website that allows you to download OSM extracts for a custom area of interest, drawn in an interactive map, or taken from an uploaded polygon feature.

Public transport schedules in GTFS format#

  • Your local transit authority, city works, or public transport company: most of the time, you will find the most accurate and most up-to-date GTFS schedule files available locally, usually as an open-data download. If you cannot find a download, ask nicely, many transport authorities are happy to share

  • Transitland is an online data platform that collects GTFS data from 2500 public transport operators world-wide. Their database includes historical data sets.

  • Mobility Database is an online repository storing current and historical GTFS data of 1800 operators from around the globe, run by a non-profit organisation. Already functional, but still being built, set to replace Transitfeeds (see below)

  • Transitfeeds is an easy to navigate website that hosts up-to-date and historical GTFS data for many countries and cities. Deprecated: will be replaced by Mobility Database.

Check GTFS files

At times, it is worth to spot-check GTFS data sets downloaded from third-party websites for validity, and to assert that they cover the time period and geographic extent of a study area.

MobilityData’s GTFS Validator is a cross-platform Java tool to check file integrity, data types, and compliance with the GTFS standard. They also provide an online version where you can upload a feed to check against the reference and best practices.

GTFS-Lite is a Python package to read GTFS data sets into gtfslite.gtfs.GTFS objects that store the information on stops, routes, fares, etc., in pandas.DataFrames. Use GTFS.routes_summary(), GTFS.stop_summary(), and GTFS.summary() to gain a quick overview of the scope of a GTFS data set.

Origin and destination locations#

In addition to OSM and GTFS datasets, you need data that represents the origin and destination locations (OD-data) of routes. R5py accepts data sets as geopandas.GeoDataFrames.

Use geopandas.read_file() to read data sets from files in one of the many geospatial data formats, such as GeoPackage, GeoJSON, or ESRI Shapefile.

If your data is in a non-spatial file format, such as spreadsheets, or CSV files with columns representing the latitude and longitude coordinates, follow these instructions to convert them into a geopandas.GeoDataFrame.

Sample data sets#

In this documentation, we use some sample data sets that you can install separately. In particular, the sample data comprises of the following data sets:

  • Helsinki, Finland

    • A population grid data set of Helsinki city centre, obtained from the Helsinki Region Environmental Services (HSY), licensed under a Creative Commons By Attribution 4.0.

    • An OpenStreetMap extract covering Helsinki (© OpenStreetMap contributors, ODbL license)

    • A GTFS public transport schedule dataset for Helsinki, cropped and minimised from the official open-data download from Helsingin seudun liikenne’s (HSL) open data web page, licensed under a Creative Commons By Attribution 4.0.

  • São Paulo, Brazil

    • A population grid data set of São Paulo city centre, obtained from the Access to Opportunities Project conducted at the Institute for Applied Economic Research - Ipea, Brazil.

    • An OpenStreetMap extract covering São Paulo city centre.

    • A GTFS public transport schedule dataset for São Paulo, cropped and minimised from the official open-data download from SPTRANS’s open data web page.