Make a python script to enhance GTFS data from another file in a zip file, and automate it from other .txt files
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

Dear hackers,

I'm trying to edit a CSV file from a GTFS archive using informations from a first one. Let's explain my problem.

I have two CSV files, lines.txt and trips.txt. Output csv file should have the same pattern as trips.txt, but with corrected values.

I want to replace trip_headsign column fields in trips.txt using forward_line_name column in lines.txt if direction_id field in trips.txt row is 0, or using backward_line_name if direction_id is 1.

I want to do this only if the part of the line_id value in lines.txt between ":" and ":" symbols is the same as the part of route_id value in trips.txt before the ":" symbol, AND if trip_headsign doesn't already contains letters (numbers only are to be replaced, see original full trips.txt file)

Here is a sample of trips.txt:


and a sample of lines.txt:

OIF:100110010:10OIF439,10,Boulogne Pont de Saint-Cloud - Gare d'Austerlitz,BOULOGNE / PONT DE ST CLOUD - GARE D'AUSTERLITZ,OIF:SA:8754700,GARE D'AUSTERLITZ - BOULOGNE / PONT DE ST CLOUD,OIF:SA:59400,DFB039,91,OIF:439,metro,OIF,geometry:line:100110010:10,05:30:00,25:47:00
OIF:210210109:001OIF30,001,FFOURCHES LONGUEVILLE PROVINS,Place Mérot - GARE DE LONGUEVILLE,,GARE DE LONGUEVILLE - Place Mérot,OIF:SA:63:49,000000 1,OIF:30,bus,OIF,,05:39:00,19:50:00

Each file has hundred of lines I need to parse and edit this way. Separator is comma in my csv files.

These full files, that I want to edit are each part of a .zip archive: contains lines.txt (full file ) contains trips.txt (full file )

and I want, once the trips.txt file is updated, to also update routes.txt file contents in the GTFS archive using info from the .txt files available here (if route_id matches):, that I will eventually store in the ./fr-idf_stif folder.

Once these two files have been updated, I want to generate proper complete updated file then, with the remaining files from original

The deliverable will be the script that would properly update these trips.txt and routes.txt file contents and regenerate a proper updated GTFS file, when I give to it both .zip entry files as parameters (in command line or graphically).

Thanks for your help :)

Unless I'm much mistaken, the provided lines.txt contains a great many collisions in the specified part of line_id. How should collisions be resolved?
philr almost 7 years ago
You're true. One should also check that lines.txt line_code matches trips.txt route_id end (after the ":"). That should prevent collisions no?
kalon33 almost 7 years ago
awarded to thelink2012

Crowdsource coding tasks.

2 Solutions


It receives three parameters:

  1. The file to fetch trips.txt and routes.txt from.
  2. The to fetch lines.txt from.
  3. The name/path of the output zip file.

Additionally, it will try to read the ./fr-idf_stif directory to update the routes. If not found, routes.txt won't be updated.

Please let me know of any issue.

As commented upper of your solution, there is a great number of collisions between lines, that should be prevented by checking if lines.txt line_code matches trips.txt route_id too (combined with previously used updating conditions) before updating it. I just remembered it sorry.
kalon33 almost 7 years ago
Fixed. Sorry for not telling, when writing I saw such collisions but since there was no mention of it on the description (which is very detailed) I thought it was okay to ignore them.
thelink2012 almost 7 years ago
Hi, I just generated another file to further enhance my dataset. This file, desserteRER.txt (, in the same directory as the script, should be used to replace trips.txt `tripheadsignwhich were not numeric and are the same as the desserte_RER.txtcodemission(this would cover only a part of the trips which were not already altered), using a concatenation ofcodemission(or the oldtripheadsign," - "` and desserteRER.txt destination, generating for example this: CIME - Versailles Chantiers in trips.txt trip_headsign. Would you extend your script to also cover this, against a $10 tip? Thanks :)
kalon33 almost 7 years ago
Sure. Updated the script. Note it reads desserteRER.txt (as you described), meanwhile the file I downloaded from the link is named dessertes_rer.txt :)
thelink2012 almost 7 years ago

My attempt:

#!/usr/bin/env python2

import csv
import os
import shutil
import sys
import tempfile
from zipfile import ZipFile, ZIP_DEFLATED

def main():
    # Read line info
    lines = {}
    with ZipFile(sys.argv[1], 'r') as idf_zip:
        with'lines.txt', 'r') as lines_file:
            linesreader = csv.DictReader(lines_file)
            for line in linesreader:
                # This line_id parsing may be fragile.  Works for sample data.
                line_id = line['line_id'].split(':', 1)[1].split('OIF')[0]
                lines[line_id] = line['forward_line_name'], line['backward_line_name']

    # Read route info, if available
    routes = {}
    if os.path.exists('fr-idf_stif'):
        for filename in os.listdir('fr-idf_stif'):
            with open(os.path.join('fr-idf_stif', filename), 'r') as routes_file:
                routereader = csv.reader(routes_file)
                for row in routereader:
                    if row[0]:
                        # Leave off the last element, 'weelchair' column not in GTFS data.
                        routes[row[0]] = row[:-1]

    temp_dir = tempfile.mkdtemp(dir='.')
        # Extract GTFS archive to a temp directory
        with ZipFile(sys.argv[2], 'r') as gtfs_zip:

        # Read and fixup trip/route data
        with open(os.path.join(temp_dir, 'trips.txt'), 'r') as trips_file:
            tripsreader = csv.reader(trips_file)
            fixed_trips = [next(tripsreader)]
            for row in tripsreader:
                if row[0] in lines and row[3].isdigit():
                    row[3] = lines[row[0]][int(row[4])]
        with open(os.path.join(temp_dir, 'routes.txt'), 'r') as route_file:
            routereader = csv.reader(route_file)
            fixed_routes = [next(routereader)]
            for row in routereader:
                if row[0] in routes:

        # Write fixed trip/route data
        with open(os.path.join(temp_dir, 'trips.txt'), 'w') as fixed_trips_file:
            fixed_writer = csv.writer(fixed_trips_file)
            for row in fixed_trips:
        with open(os.path.join(temp_dir, 'routes.txt'), 'w') as fixed_routes_file:
            fixed_writer = csv.writer(fixed_routes_file)
            for row in fixed_routes:

        # Create updated GTFS archive
        with ZipFile('updated_' + sys.argv[2], 'w', ZIP_DEFLATED) as updated_gtfs_zip:
            for filename in os.listdir(temp_dir):
                updated_gtfs_zip.write(os.path.join(temp_dir, filename), filename)

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print 'Usage:', sys.argv[0], '<IDF archive> <GTFS archive>'
    if not os.path.isfile(sys.argv[1]):
        print 'Could not find IDF archive', sys.argv[1]
    if not os.path.isfile(sys.argv[2]):
        print 'Could not find GTFS archive', sys.argv[2]
View Timeline