Make a python script to enhance GTFS data from another file in a zip file, and automate it from other .txt files
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

Dear hackers,

I'm trying to edit a CSV file from a GTFS archive using informations from a first one. Let's explain my problem.

I have two CSV files, lines.txt and trips.txt. Output csv file should have the same pattern as trips.txt, but with corrected values.

I want to replace trip_headsign column fields in trips.txt using forward_line_name column in lines.txt if direction_id field in trips.txt row is 0, or using backward_line_name if direction_id is 1.

I want to do this only if the part of the line_id value in lines.txt between ":" and ":" symbols is the same as the part of route_id value in trips.txt before the ":" symbol, AND if trip_headsign doesn't already contains letters (numbers only are to be replaced, see original full trips.txt file)

Here is a sample of trips.txt:

route_id,service_id,trip_id,trip_headsign,direction_id,block_id
210210109:001,2913,70405957139549,70405957,0,
210210109:001,2916,70405961139553,70405961,1,

and a sample of lines.txt:

line_id,line_code,line_name,forward_line_name,forward_direction,backward_line_name,backward_direction,line_color,line_sort,network_id,commercial_mode_id,contributor_id,geometry_id,line_opening_time,line_closing_time
OIF:100110010:10OIF439,10,Boulogne Pont de Saint-Cloud - Gare d'Austerlitz,BOULOGNE / PONT DE ST CLOUD - GARE D'AUSTERLITZ,OIF:SA:8754700,GARE D'AUSTERLITZ - BOULOGNE / PONT DE ST CLOUD,OIF:SA:59400,DFB039,91,OIF:439,metro,OIF,geometry:line:100110010:10,05:30:00,25:47:00
OIF:210210109:001OIF30,001,FFOURCHES LONGUEVILLE PROVINS,Place Mérot - GARE DE LONGUEVILLE,,GARE DE LONGUEVILLE - Place Mérot,OIF:SA:63:49,000000 1,OIF:30,bus,OIF,,05:39:00,19:50:00

Each file has hundred of lines I need to parse and edit this way. Separator is comma in my csv files.

These full files, that I want to edit are each part of a .zip archive:

prod_FR-IDF-OPEN_NAViTiA_RTNTFS_V5.zip contains lines.txt (full file https://aurora.acolytesanonymes.org/index.php/s/G31WwmGXxN45xlY )

stif_gtfs.zip contains trips.txt (full file https://aurora.acolytesanonymes.org/index.php/s/oRmDOGPzpuKAP7f )

and I want, once the trips.txt file is updated, to also update routes.txt file contents in the GTFS archive using info from the .txt files available here (if route_id matches): https://github.com/rpdcorbion/open_data_transport/tree/master/references/fr-idf_stif, that I will eventually store in the ./fr-idf_stif folder.

Once these two files have been updated, I want to generate proper complete updated stif_gtfs.zip file then, with the remaining files from original stif_gtfs.zip.

The deliverable will be the script that would properly update these trips.txt and routes.txt file contents and regenerate a proper updated stif_gtfs.zip GTFS file, when I give to it both .zip entry files as parameters (in command line or graphically).

Thanks for your help :)

Unless I'm much mistaken, the provided lines.txt contains a great many collisions in the specified part of line_id. How should collisions be resolved?
philr almost 7 years ago
You're true. One should also check that lines.txt line_code matches trips.txt route_id end (after the ":"). That should prevent collisions no?
kalon33 almost 7 years ago
awarded to thelink2012
Tags
python
csv
gtfs

Crowdsource coding tasks.

2 Solutions


Here, https://gist.github.com/thelink2012/485fb6af0e45c150d5ea

It receives three parameters:

  1. The stif_gtfs.zip file to fetch trips.txt and routes.txt from.
  2. The prod_FR-IDF-OPEN_NAViTiA_RTNTFS_V5.zip to fetch lines.txt from.
  3. The name/path of the output zip file.

Additionally, it will try to read the ./fr-idf_stif directory to update the routes. If not found, routes.txt won't be updated.

Please let me know of any issue.

As commented upper of your solution, there is a great number of collisions between lines, that should be prevented by checking if lines.txt line_code matches trips.txt route_id too (combined with previously used updating conditions) before updating it. I just remembered it sorry.
kalon33 almost 7 years ago
Fixed. Sorry for not telling, when writing I saw such collisions but since there was no mention of it on the description (which is very detailed) I thought it was okay to ignore them.
thelink2012 almost 7 years ago
Hi, I just generated another file to further enhance my dataset. This file, desserteRER.txt (https://aurora.acolytesanonymes.org/index.php/s/dOPbHe7GiNPZZGQ), in the same directory as the script, should be used to replace trips.txt `tripheadsignwhich were not numeric and are the same as the desserte_RER.txtcodemission(this would cover only a part of the trips which were not already altered), using a concatenation ofcodemission(or the oldtripheadsign," - "` and desserteRER.txt destination, generating for example this: CIME - Versailles Chantiers in trips.txt trip_headsign. Would you extend your script to also cover this, against a $10 tip? Thanks :)
kalon33 almost 7 years ago
Sure. Updated the script. Note it reads desserteRER.txt (as you described), meanwhile the file I downloaded from the link is named dessertes_rer.txt :)
thelink2012 almost 7 years ago

My attempt:

#!/usr/bin/env python2

import csv
import os
import shutil
import sys
import tempfile
from zipfile import ZipFile, ZIP_DEFLATED

def main():
    # Read line info
    lines = {}
    with ZipFile(sys.argv[1], 'r') as idf_zip:
        with idf_zip.open('lines.txt', 'r') as lines_file:
            linesreader = csv.DictReader(lines_file)
            for line in linesreader:
                # This line_id parsing may be fragile.  Works for sample data.
                line_id = line['line_id'].split(':', 1)[1].split('OIF')[0]
                lines[line_id] = line['forward_line_name'], line['backward_line_name']

    # Read route info, if available
    routes = {}
    if os.path.exists('fr-idf_stif'):
        for filename in os.listdir('fr-idf_stif'):
            with open(os.path.join('fr-idf_stif', filename), 'r') as routes_file:
                routereader = csv.reader(routes_file)
                next(routereader)
                for row in routereader:
                    if row[0]:
                        # Leave off the last element, 'weelchair' column not in GTFS data.
                        routes[row[0]] = row[:-1]

    temp_dir = tempfile.mkdtemp(dir='.')
    try:
        # Extract GTFS archive to a temp directory
        with ZipFile(sys.argv[2], 'r') as gtfs_zip:
            gtfs_zip.extractall(temp_dir)

        # Read and fixup trip/route data
        with open(os.path.join(temp_dir, 'trips.txt'), 'r') as trips_file:
            tripsreader = csv.reader(trips_file)
            fixed_trips = [next(tripsreader)]
            for row in tripsreader:
                if row[0] in lines and row[3].isdigit():
                    row[3] = lines[row[0]][int(row[4])]
                fixed_trips.append(row)
        with open(os.path.join(temp_dir, 'routes.txt'), 'r') as route_file:
            routereader = csv.reader(route_file)
            fixed_routes = [next(routereader)]
            for row in routereader:
                if row[0] in routes:
                    fixed_routes.append(routes[row[0]])
                else:
                    fixed_routes.append(row)

        # Write fixed trip/route data
        with open(os.path.join(temp_dir, 'trips.txt'), 'w') as fixed_trips_file:
            fixed_writer = csv.writer(fixed_trips_file)
            for row in fixed_trips:
                fixed_writer.writerow(row)
        with open(os.path.join(temp_dir, 'routes.txt'), 'w') as fixed_routes_file:
            fixed_writer = csv.writer(fixed_routes_file)
            for row in fixed_routes:
                fixed_writer.writerow(row)

        # Create updated GTFS archive
        with ZipFile('updated_' + sys.argv[2], 'w', ZIP_DEFLATED) as updated_gtfs_zip:
            for filename in os.listdir(temp_dir):
                updated_gtfs_zip.write(os.path.join(temp_dir, filename), filename)
    finally:
        shutil.rmtree(temp_dir)

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print 'Usage:', sys.argv[0], '<IDF archive> <GTFS archive>'
        sys.exit(1)
    if not os.path.isfile(sys.argv[1]):
        print 'Could not find IDF archive', sys.argv[1]
        sys.exit(2)
    if not os.path.isfile(sys.argv[2]):
        print 'Could not find GTFS archive', sys.argv[2]
        sys.exit(3)
    main()
View Timeline