Tripit API aimed at Cytebode
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

Hi All. This is aimed at Cytebode so no need for others to join in.

Task is to complete the data parsing algorithm for the JSON output from tripit-to-flightdiary repository so that a CSV file can be pulled out.

Hi. Apologies I am a real idiot with this. But a clear view of what the command is and exactly how to put the switches and what they do eg. python output-activities.py and arguments arguments are (note the need for a space after the argument/colon??) -t true/false/all: True includes traveller, false only where not traveller, all includes all. -o is followed by the filename -a: followed by air|road|rail ... You get what I mean. Something really simple if poss? As a simpleton I battle with most Readmes, but I suspect there are lot of people that like me would love this for their tax data and are stuffed since Skyhops.com (which used to do it all) went under.
sebmack 2 months ago
awarded to CyteBode

Crowdsource coding tasks.

1 Solution

Winning solution

Here's the complete solution now. activities_to_csv.py is the main file which I just added and is what you need to run (usage instructions at the bottom). I also had to modify the other two files, so you will have to update them. They should all be placed in the same folder as the creds.json file.

The solution has only been tested loosely as I don't have much data on my TripIt account (only a few bogus trips for testing). However, it succesfully extracts the data from the few trips that I have.

Requirements

activities_to_csv.py

import glob
import itertools
import json
import logging
import os
import sys

import lxml.etree as etree
import requests
from requests_oauthlib import OAuth1
import unicodecsv as csv

from extraction import EXTRACTORS


PERIOD_TO_PASTS = {
    "past": ("true", ),
    "future": ("false", ),
    "both": ("false", "true")
}

PAST_TO_PERIOD = {
    "true": "past",
    "false": "future"
}

TRAVELERS = {
    "true": ("true",),
    "false": ("false",),
    "all": ("true", "false",)
}

TRAVELER = {
    "true": "t",
    "false": "nt"
}

TRIP_ID_XPATH = "//Trip/id/text()"
MAX_PAGE_XPATH = "//max_page/text()"


def get_auth():
    with open('creds.json') as f:
        creds = json.load(f)

    auth = OAuth1(creds['CLIENT_KEY'],  creds['CLIENT_SECRET'],
                  creds['OAUTH_TOKEN'], creds['OAUTH_TOKEN_SECRET'])

    return auth


def get_trips_from_api(traveler = "true", period = "both", from_page = 1):
    logger = logging.getLogger()

    logger.debug("traveler: %s, period = %s, from_page = %d" % (traveler, period, from_page))

    for past in PERIOD_TO_PASTS[period]:
        logger.info("Fetching %s trips." % PAST_TO_PERIOD[past])
        page_num = max(from_page, 1)
        while True:
            url = "".join(["https://api.tripit.com/v1/",
                "list/trip/",
                "traveler/%s/" % traveler,
                "past/%s/" % past,
                "format/xml/",
                "page_num/%d/" % page_num,
                "page_size/1/",
                "include_objects/true"
            ])

            response = requests.get(url, auth = get_auth())
            tree = etree.fromstring(response.text)

            try:
                max_page = int(tree.xpath(MAX_PAGE_XPATH)[0])
                trip_id = tree.xpath(TRIP_ID_XPATH)[0]
                logger.info("Fetched trip %s." % trip_id)
                yield tree

                if page_num >= max_page:
                    break
            except IndexError:
                break

            page_num += 1


TRIP_CACHE_DIR = "trip_cache"
TRIP_CACHE_FMT = "%s/trip_%%s_%%s_%%06d.xml" % TRIP_CACHE_DIR # traveler, period, page

def init_cache_dir():
    logger = logging.getLogger()
    if not os.path.exists(TRIP_CACHE_DIR):
        logger.info("Created trip cache directory.")
        os.mkdir(TRIP_CACHE_DIR)
    else:
        if not os.path.isdir(TRIP_CACHE_DIR):
            raise RuntimeError("Cannot create directory %s "
                "as there is a file with the same name." % TRIP_CACHE_DIR)


def refresh_cache_trip():
    logger = logging.getLogger()

    init_cache_dir()

    for traveler in TRAVELER.keys():
        for period in ("future", "past"):
            try:
                first_trip = next(get_trips_from_api(traveler, period, 1))
            except StopIteration:
                break

            max_page = int(first_trip.xpath(MAX_PAGE_XPATH)[0])

            glob_expr = ((TRIP_CACHE_FMT % (TRAVELER[traveler], period, 0))
                         .replace("000000", "*"))

            if len(glob.glob(glob_expr)) == 0:
                start = 1
            else:
                trip_files = sorted(glob.glob(glob_expr))
                latest = os.path.basename(trip_files[-1]).rstrip(".xml")
                start = int(latest.split("_")[-1]) + 1

            if start > max_page:
                continue

            page = start
            for trip in get_trips_from_api(traveler, period, start):
                fname = TRIP_CACHE_FMT % (TRAVELER[traveler], period, page)

                logger.info("Downloading %s..." % (fname))

                with open(fname, "wb+") as f:
                    f.write(etree.tostring(trip, pretty_print=True))

                page += 1


def get_trips_from_cache(traveler = "true", period = "both"):
    logger = logging.getLogger()

    for traveler_ in TRAVELERS[traveler]:
        for past in PERIOD_TO_PASTS[period]:
            period_name = PAST_TO_PERIOD[past]
            traveler_name = TRAVELER[traveler_]
            page = 1
            while True:
                fname = TRIP_CACHE_FMT % (traveler_name, period_name, page)
                if not os.path.exists(fname):
                    logger.debug("Does not exist: %s" % fname)
                    break

                yield etree.parse(fname)
                page += 1


def activities_to_csv(type_, output = None, traveler = "true", period = "both",
                      skip_cost = False, offline_only = False):
    logger = logging.getLogger()

    extractor = EXTRACTORS[type_]
    if skip_cost and type_ != "trip":
        logger.info("Skipping cost")
        # Search the total_cost selection in the <Type>Object extractor and
        # set it to get skipped.
        cost_selection = extractor.search(["%sObject" % type_.title(), "total_cost"])
        if cost_selection:
            cost_selection.skip = True

    if output is None:
        logger.info("Outputting to stdout.")
        f = os.fdopen(sys.stdout.fileno(), 'wb')
    else:
        logger.info("Outputting to file.")
        f = open(output, "wb+")

    if not offline_only:
        logger.info("Refreshing trip cache")
        refresh_cache_trip()
    else:
        logger.info("Using trip cache as it is")

    logger.info("Writing CSV.")
    writer = csv.writer(f, lineterminator = '\n', encoding = 'utf-8')

    schema = extractor.schema
    writer.writerow(schema)

    logger.info("Extracting data from trips.")
    for trip in get_trips_from_cache(traveler, period):
        trip_id = trip.xpath(TRIP_ID_XPATH)[0]
        logger.info("Extracting data from trip %s." % trip_id)
        for row in extractor.extract(trip):
            writer.writerow(row)


def main():
    import argparse

    VALID_TRAVELER = ("true", "false", "all")
    VALID_PERIOD = PERIOD_TO_PASTS.keys()

    parser = argparse.ArgumentParser(description =
        "Fetch all the activities of a certain type from TripIt "
        "and output to a CSV following defined schemas. "
        "Pages are cached to avoid excessive usage of the API.")
    parser.add_argument("type", choices = EXTRACTORS.keys(),
        help = "The type of activity to fetch.")

    parser.add_argument("--skip_cost", action="store_true",
        help = "Do not include the activity cost.")

    parser.add_argument("--period", default="both", choices = VALID_PERIOD,
        help = "Whether to fetch only past or future trips, or both.")
    parser.add_argument("--traveler", default="true",
                        choices = VALID_TRAVELER,
        help = "Whether to fetch only trips where the user is a traveler, "
               "not a traveler, or both.")

    parser.add_argument("--delete_cached_pages", action="store_true",
        help = "Delete the cached pages before starting.")
    parser.add_argument("--offline_only", action="store_true",
        help = "Use only the cached pages, without trying to connect.")

    parser.add_argument("--output", "-o", default=None, help =
        "The name of the file to output to. "
        "If not specified, will output to stdout.")

    args = parser.parse_args()

    verbose = (args.output is not None)

    logger = logging.getLogger()
    logger.setLevel(logging.INFO)

    if verbose:
        logger.addHandler(logging.StreamHandler(sys.stdout))
    else:
        logger.addHandler(logging.NullHandler())

    if args.delete_cached_pages:
        if args.offline_only:
            raise RuntimeError("Not going to delete cached pages in offline mode.")
        logger.info("Deleting cached pages")

        init_cache_dir()

        for file in glob.glob("%s/trip*.xml" % TRIP_CACHE_DIR):
            os.remove(file)

    activities_to_csv(
        args.type, args.output,
        args.traveler, args.period,
        args.skip_cost,
        args.offline_only
    )


if __name__ == '__main__':
    main()

extraction.py

import itertools


def underscore_to_camelcase(us):
    return "".join([v.title() for v in us.split("_")])


class Extractor(object):
    """ Used to gather bits of data from multiple selectors or even
        recursively deeper into the XML tree from other extractor, and outputs
        them to one or more rows."""
    def __init__(self, tag = "", children = [], root = None, meta = {}):
        self._tag = tag
        if self.tag:
            self._xpath = "./%s" % tag
        else:
            self._xpath = "."

        if root is not None:
            self._children = list(root._children) + children
        else:
            self._children = list(children)

        self._meta = dict(meta)
        self._skip = False

    @property
    def schema(self):
        if self._skip:
            return []

        schema = []

        for c in self._children:
            if isinstance(c, Selection):
                if not c.skip:
                    schema.append(c.name)
            elif isinstance(c, Extractor):
                schema += c.schema

        return schema

    def extract(self, tree):
        if self._skip:
            return []

        # Recursion must still happen even if the subtree isn't found.
        # That way, default values are gathered.
        if tree is None:
            sub_tree = [None]
        else:
             sub_tree = tree.xpath(self._xpath)
             if len(sub_tree) == 0:
                sub_tree = [None]

        rows = []
        for element in sub_tree:
            row_segs = [] # Gathers row segments
            sub_row = [] # Gathers selections' data:
            for c in self._children:
                if isinstance(c, Selection):
                    c_data = c.select(element, self)
                    if c_data is not None:
                        sub_row.append(c_data)
                elif isinstance(c, Extractor):
                    if sub_row:
                        row_segs.append([sub_row])
                        sub_row = []
                    c_rows = c.extract(element)
                    if c_rows:
                        row_segs.append(c_rows)
            if sub_row:
                row_segs.append([sub_row])
                sub_row = []

            for r in itertools.product(*row_segs):
                row = []
                for sr in r:
                    row += list(sr)
                rows.append(row)

        return rows

    def search(self, tags = []):
        if len(tags) == 0:
            return None

        if self.tag == "":
            tags = [""] + tags

        tag, otags= tags[0], tags[1:]
        if tag != self.tag:
            return None

        if len(otags) == 0:
            if tag == self.tag:
                return self

        for c in self._children:
            result = c.search(otags)
            if result is not None:
                return result
        return None

    def get_meta(self, metatag, default=""):
        return self._meta.get(metatag, default)

    def set_meta(self, metatag, value):
        self._meta[metatag] = value

    @property
    def skip(self):
        return self._skip

    @skip.setter
    def skip(self, v):
        self._skip = bool(v)

    @property
    def tag(self):
        return self._tag

    @property
    def children(self):
        return list(self._children)


class Selection(object):
    """ Used to select a value from an XML subtree. """
    def __init__(self, tag, name = "", default = ""):
        self._tag = tag
        if tag:
            self._xpath = "./%s" % tag
        else:
            self._xpath = "."
        if name == "":
            self._name = underscore_to_camelcase(tag)
        else:
            self._name = name
        self._default = ""
        self._skip = False

    def select(self, tree, extractor = None):
        if self._skip:
            return None
        if tree is not None:
            found = tree.xpath(self._xpath + "/text()")
            if len(found) == 0:
                return self._default
            else:
                return found[0]
        return self._default

    def search(self, tags):
        if len(tags) != 1:
            return None
        if tags[0] == self.tag:
            return self
        return None

    @property
    def name(self):
        return self._name

    @property
    def skip(self):
        return self._skip

    @skip.setter
    def skip(self, v):
        self._skip = bool(v)

    @property
    def tag(self):
        return self._tag


class MetaSelection(Selection):
    """ Used to get a metatag from the parent extractor. """
    def __init__(self, metatag, name = "", default = ""):
        self._metatag = metatag
        if name == "":
            self._name = underscore_to_camelcase(metatag)
        else:
            self._name = name
        self._default = ""
        self._skip = False

    def select(self, tree, extractor):
        if self._skip:
            return None
        return extractor.get_meta(self._metatag, self._default)

    def search(self, tags):
        if len(tags) != 1:
            return None
        if tags[0] == self._metatag:
            return self
        return None

    @property
    def name(self):
        return self._name

    @property
    def tag(self):
        return ""

    @property
    def metatag(self):
        return self._metatag


class JoinedSelection(Selection):
    """ Used to join the value of multiple selections together to create a
        single value. """
    def __init__(self, selections, name, joint = " "):
        self._selections = list(selections)
        self._name = name
        self._joint = joint
        self._skip = False

    def select(self, tree, extractor = None):
        if self._skip:
            return None
        selected = [s.select(tree, extractor) for s in self._selections]
        return self._joint.join(filter(lambda s: s != "", selected))

    def search(self, tags):
        if len(tags) != 1:
            return None
        for selection in self._selections:
            result = selection.search(tags)
            if result is not None:
                return self
        return None

    @property
    def name(self):
        return self._name

    @property
    def tag(self):
        return ""


def gen_traveler_extractor(tag = "Traveler", name = ""):
    if name == "":
        name = underscore_to_camelcase(tag)
    return Extractor(
        tag,
        [
            JoinedSelection([Selection("first_name"),
                             Selection("middle_name"),
                             Selection("last_name")], name),
            Selection("ticket_num", "TicketNumber")
        ]
    )


def gen_address_extractor(tag = "Address" , name = ""):
    if name == "":
        name = underscore_to_camelcase(tag)
    return Extractor(
        tag,
        [
            JoinedSelection([Selection("address"),
                             Selection("addr1"),
                             Selection("addr2"),
                             Selection("city"),
                             Selection("state"),
                             Selection("zip"),
                             Selection("country")], name)
        ]
    )


def gen_datetime_extractor(tag = "DateTime", name_date = "", name_time = ""):
    return Extractor(
        tag, [Selection("date", name_date),
              Selection("time", name_time)]
    )


traveler_extractor = gen_traveler_extractor(
    "Traveler", "Traveller") # UK spelling


start_datetime_extractor = gen_datetime_extractor(
    "StartDateTime", "StartDate", "StartTime")


end_datetime_extractor = gen_datetime_extractor(
    "EndDateTime", "EndDate", "EndTime")


air_segment_extractor = Extractor(
    "Segment",
    [
        Selection("marketing_airline_code", "AirlineCode"),
        Selection("aircraft_display_name", "Aircraft"),
        Selection("service_class"),
        Selection("marketing_flight_number", "FlightNumber"),

        Selection("start_country_code", "StartCountry"),
        Selection("start_city_name", "StartCityName"),
        Selection("start_airport_code", "StartAirport"),
        Selection("start_terminal"),
        Selection("start_airport_latitude", "StartLat"),
        Selection("start_airport_longitude", "StartLong"),
        start_datetime_extractor,

        Selection("end_country_code", "EndCountry"),
        Selection("end_city_name", "EndCityName"),
        Selection("end_airport_code", "EndAirport"),
        Selection("end_terminal"),
        Selection("end_airport_latitude", "EndLat"),
        Selection("end_airport_longitude", "EndLong"),
        end_datetime_extractor,

        Selection("stops"),
        Selection("distance")
    ]
)


rail_segment_extractor = Extractor(
    "Segment",
    [
        Selection("carrier_name"),
        Selection("coach_number"),
        Selection("train_type"),
        Selection("train_number"),
        Selection("service_class"),
        Selection("confirmation_num", "ConfirmationNumber"),

        Selection("start_station_name"),
        gen_address_extractor("StartStationAddress"),
        start_datetime_extractor,

        Selection("end_station_name"),
        gen_address_extractor("EndStationAddress"),
        end_datetime_extractor
    ]
)


transport_segment_extractor = Extractor(
    "Segment",
    [
        Selection("carrier_name"),
        Selection("confirmation_num", "ConfirmationNumber"),
        Selection("vehicle_description"),
        Selection("number_passengers"),

        Selection("start_location_name"),
        gen_address_extractor("StartLocationAddress"),
        start_datetime_extractor,

        Selection("end_location_name"),
        gen_address_extractor("EndLocationAddress"),
        end_datetime_extractor
    ]
)


cruise_segment_extractor = Extractor(
    "Segment",
    [
        Selection("location_name"),
        gen_address_extractor("LocationAddress"),
        start_datetime_extractor,
        end_datetime_extractor
    ]
)


trip_extractor = Extractor(
    "Trip",
    [
        Selection("display_name", "TripName"),
        Selection("id", "TripID")
    ]
)


reservation_object_extractor = Extractor(
    # Inherited by all the reservation objects
    None,
    [
        MetaSelection("type"),
        Selection("id", "ActivityID"),
        Selection("total_cost", "ActivityCost"),
        # TravellerYN ???
        Selection("relative_url", "URL"),
        Selection("booking_site_url", "BookingSite"),
        Selection("supplier_conf_num", "SupplierConfirmation"),
        Selection("booking_date"),
        Selection("booking_site_phone")
    ]
)


air_object_extractor = Extractor(
    "AirObject",
    [
        traveler_extractor,
        air_segment_extractor,
    ],
    reservation_object_extractor,
    meta = {"type": "Air"}
)


lodging_object_extractor = Extractor(
    "LodgingObject",
    [
        gen_traveler_extractor("Guest"),
        Selection("number_guests"),
        Selection("number_rooms"),
        Selection("room_type"),
        gen_address_extractor("Address"),
        start_datetime_extractor,
        end_datetime_extractor
    ],
    reservation_object_extractor,
    meta = {"type": "Lodging"}
)


car_object_extractor = Extractor(
    "CarObject",
    [
        gen_traveler_extractor("Driver"),
        Selection("car_type"),
        Selection("mileage_charges"),

        Selection("start_location_name"),
        gen_address_extractor("StartLocationAddress"),
        start_datetime_extractor,

        Selection("end_location_name"),
        gen_address_extractor("EndLocationAddress"),
        end_datetime_extractor
    ],
    reservation_object_extractor,
    meta = {"type": "Car"}
)


rail_object_extractor = Extractor(
    "RailObject",
    [
        traveler_extractor,
        rail_segment_extractor
    ],
    reservation_object_extractor,
    meta = {"type": "Rail"}
)


transport_object_extractor = Extractor(
    "TransportObject",
    [
        traveler_extractor,
        transport_segment_extractor
    ],
    reservation_object_extractor,
    meta = {"type": "Transport"}
)


cruise_object_extractor = Extractor(
    "CruiseObject",
    [
        traveler_extractor,
        cruise_segment_extractor,
        Selection("ship_name"),
        Selection("cabin_number"),
    ],
    reservation_object_extractor,
    meta = {"type": "Cruise"}
)


activity_object_extractor = Extractor(
    "ActivityObject",
    [
        gen_traveler_extractor("Participant"),
        start_datetime_extractor,
        end_datetime_extractor,
        gen_address_extractor("Address"),
        Selection("location_name")
    ],
    reservation_object_extractor,
    meta = {"type": "Activity"}
)


EXTRACTORS = {
    "trip": Extractor(
        children = [
            trip_extractor
        ]
    ),
    "air": Extractor(
        children = [
            trip_extractor,
            air_object_extractor
        ]
    ),
    "lodging": Extractor(
        children = [
            trip_extractor,
            lodging_object_extractor
        ]
    ),
    "car": Extractor(
        children = [
            trip_extractor,
            car_object_extractor
        ]
    ),
    "rail": Extractor(
        children = [
            trip_extractor,
            rail_object_extractor
        ]
    ),
    "transport": Extractor(
        children = [
            trip_extractor,
            transport_object_extractor
        ]
    ),
    "cruise": Extractor(
        children = [
            trip_extractor,
            cruise_object_extractor
        ]
    ),
    "activity": Extractor(
        children = [
            trip_extractor,
            activity_object_extractor
        ]
    )
}

Extraction schema explained

I only defined extraction schemas for the reservation-type activities (air, lodging, car, rail, transport, cruise, activity). It's a fairly work-intensive process to write one, so I don't want to write all of them. Since I don't know exactly what to extract either, let me explain how it works so you could hopefully extend or modify the schemas on your own.

The extraction schema basically acts as a domain-specific language to define what to extract, where to get it, how to name it and how to order it.

The schema takes the form of a tree, where each branch (so-called children) goes further down the hierarchy of the data.

Let's walk through the air extractor as an example.

EXTRACTORS = {
    ...,
    "air": Extractor(
        children = [
            trip_extractor,
            air_object_extractor
        ]
    ),
    ...
}

This is the root extractor, which operates at the root level (the <Response> element). In order to appear as a valid type to the script, every root extractor needs to be defined in the EXTRACTORS dict in the form "type": Extractor(...),.

There's no data to extract at the root level, but one level down, there is a <Trip> element and one or more <AirObject> elements, out of which data needs to be extracted. The two extractors that are the children of the root extractor will fulfill this purpose.

Since we're already at the root, there's no need to provide a tag as the first argument of the root extractor. Down from here, every extractor will have one to indicate what element to descend to.

trip_extractor = Extractor(
    "Trip",
    [
        Selection("display_name", "TripName"),
        Selection("id", "TripID")
    ]
)

The trip extractor descends into the <Trip> element, which only has two fields to be extracted. This is defined through the two Selection objects. The first argument is the name of the tag containing the data to extract and the second is how to name it in the CSV header. This second argment is optional; by default the new name will be the same as the tag, but converted from under_score to CamelCase.

There is also a third optional argument to define what default value to use in case the element isn't found, which happens when the data from the API is empty. By default, it's the empty string.

As all the root extractors need to extract trip data, they all have a trip extractor as their first child.

air_object_extractor = Extractor(
    "AirObject",
    [
        traveler_extractor,
        air_segment_extractor,
    ],
    reservation_object_extractor,
    meta = {"type": "Air"}
)

Next to the trip extractor, the air object extractor descends into the <AirObject> element. In it, we need to extract data from the Traveler object(s) and the AirSegment object(s).

def gen_traveler_extractor(tag = "Traveler", name = ""):
    if name == "":
        name = underscore_to_camelcase(tag)
    return Extractor(
        tag,
        [
            JoinedSelection([Selection("first_name"),
                             Selection("middle_name"),
                             Selection("last_name")], name),
            Selection("ticket_num", "ticketNumber")
        ]
    )

traveler_extractor = gen_traveler_extractor(
    "Traveler", "Traveller") # UK spelling

The first child extractor actually comes from a function call. The Traveler type often appears under different names such as Guest or Driver. In order to avoid defining separate extractors for each of these variations, the gen_traveler_extractor is a function that generates the same traveler extractor, but with a different name. This is also done with the Address type, and the DateTime type. Since Traveler appears three times as the name for that type, traveler_extractor is already defined for convenience.

The JoinedSelection in the traveler extractor takes the value of multiple selections and joins them together to create a new value with a given name. By default, it joins with a space character, but it can be provided with a third argument to define what string to join with (e.g. "-" for a dash).

Back to the air object extractor, under its list of children, there are two peculiar arguments. The following paragraph will explain their purpose, but it will not make much sense until the reservation object extractor is explained, in the two paragraphs below it.

The first is another extractor which the air object extractor inherits from. This cause the air object extractor to have the same children as the reservation object extractor in front of its own children. The second is a dict containing a ("type", "Air") key-value pair. This provides data that a MetaSelection child can access.

The objects air, lodging, car, rail, transport, cruise and activity are all reservation objects. As they all inherit from the Reservation type, they share many data entries that need to be extracted the same. To avoid duplicating the logic, the air object extractor inherits from the reservation object extractor. As explained above, the inheritance will cause the air extractor to have the same children as the reservation object extractor in front of its own.

reservation_object_extractor = Extractor(
    # Inherited by all the reservation objects
    None,
    [
        MetaSelection("type"),
        Selection("id", "ActivityID"),
        Selection("total_cost", "ActivityCost"),
        # TravellerYN ???
        Selection("relative_url", "URL"),
        Selection("booking_site_url", "BookingSite"),
        Selection("supplier_conf_num", "SupplierConfirmation"),
        Selection("booking_date"),
        Selection("booking_site_phone")
    ]
)

The first Selection is a MetaSelection which extracts the "type" metatag from the parent extractor. Due to inheritance, this would be the air object extractor in our example. This is done this way, because the API objects don't hold their own type as a value in the data itself.

air_segment_extractor = Extractor(
    "Segment",
    [
        Selection("marketing_airline_code", "AirlineCode"),
        Selection("aircraft_display_name", "Aircraft"),
        Selection("service_class"),
        Selection("marketing_flight_number", "FlightNumber"),

        Selection("start_country_code", "StartCountry"),
        Selection("start_city_name", "StartCityName"),
        Selection("start_airport_code", "StartAirport"),
        Selection("start_terminal"),
        Selection("start_airport_latitude", "StartLat"),
        Selection("start_airport_longitude", "StartLong"),
        start_datetime_extractor,

        Selection("end_country_code", "EndCountry"),
        Selection("end_city_name", "EndCityName"),
        Selection("end_airport_code", "EndAirport"),
        Selection("end_terminal"),
        Selection("end_airport_latitude", "EndLat"),
        Selection("end_airport_longitude", "EndLong"),
        end_datetime_extractor,

        Selection("stops"),
        Selection("distance")
    ]
)

Finally, back to the air object extractor, and next to the traveler extractor child, there is the air segment extractor. The only peculiarity here are the two datetime extractors.

In the API, DateTime is its own complex type containing a date and a time, so they need to be extracted with an extractor. As they often appear under different names, they are generated with gen_datetime_extractor, but Start and End DateTime's appear often, so their extractors are already defined for convience.

As for the order, the extractors go through a depth-first traversal. As such, this:

<Root>
  <Trip>
    [TripName, TripID] (from trip_extractor)
  <AirObject(s)>
    [Type, ..., BookingSitePhone] (from reservation_object_extractor)
    <Traveler(s)>
      [Traveller, TicketNumber] (from traveler_extractor)
    <AirSegment(s)>
      [AirlineCode, ...,  distance] (from air_segment_extractor)

Gets flattened into this:

[TripName, TripID] + [Type, ..., BookingSitePhone] + [Traveller, TicketNumber] + [AirlineCode, ...,  distance]

The air extractor was the most complex extraction schema, so the explanation should have covered everything. The extractors for the other segmented types are similar, but with fewer values. Making new ones for the types I haven't defined yet should be easier. Notably, there won't be a need to use inheritance anymore as only the reservation types needed it.

schema_validator.py

import os.path

from lxml import etree
import requests

from extraction import Extractor, Selection


XSD = "tripit-api-obj-v1.xsd"
XSDS_URL = "https://api.tripit.com/xsd/%s"


def validate(extractor, element_names = [], type_names = []):
    if extractor.tag not in type_names and extractor.tag != "":
        yield "Extractor tag \"%s\" not found!" % extractor.tag

    for child in extractor.children:
        if isinstance(child, Extractor):
            for error in validate(child, element_names, type_names):
                yield error
        elif isinstance(child, Selection):
            if child.tag not in element_names and child.tag != "":
                yield "Selection tag \"%s\" not found!" % child.tag


def main():
    NS = {"xs": "http://www.w3.org/2001/XMLSchema"}

    # Download the XSD file if not found
    if not os.path.exists(XSD):
        with open(XSD, "w+") as xsd_f:
            response = requests.get(XSDS_URL % XSD)
            xsd_f.write(response.text)

    element_names = set()
    type_names = set()

    with open(XSD) as xsd:
        tree = etree.parse(xsd)

        for complex_type in tree.xpath('//xs:complexType', namespaces = NS): 
            try:
                type_names.add(complex_type.attrib["name"])
            except KeyError:
                continue

        for element in tree.xpath('//xs:element', namespaces = NS):
            name = element.attrib["name"]
            if element.attrib.get("type") in type_names:
                type_names.add(name)
            element_names.add(name)


    from extraction import EXTRACTORS

    n_errors = 0
    for name, extractor in EXTRACTORS.items():
        print("Validating %s extractor..." % name)
        schema = extractor.schema
        print("Schema: %s (%d columns)" % ("|".join(schema), len(schema)))
        n = 0
        for error in validate(extractor, element_names, type_names):
            print("  %s" % error)
            n += 1
            n_errors += 1
        if n == 0:
            print("  Seems OK")
        print("")

    if n_errors == 0:
        print("All correct!")
    else:
        print("Found %d errors." % n_errors)


if __name__ == '__main__':
    main()

This script is a simple utility to help validate the defined extraction schemas. Run it with no arguments and it'll go through validating the extractors defined in the EXTRACTORS dict of extraction.py.

All the script does is extract the names from the XSD schema the API uses and then walks the root extractors to find any name it does not recognize. It also prints out the resulting CSV schema of each extractor. The script should weed out any typos, but it will not validate the structure.

If the structure isn't sound, the extractor will fail silently by constantly outputting default values.

Workflow for modifying/extending the extraction schemas

In order to modify an existing extractor or write a new one, you will need to know about the structure and the names of all the values. This can be done in two ways.

Once the trip cache is populated (see below) and once schema_validator.py has run at least once, you will have access to both the XML file for a given trip and the tripit-api-obj-v1.xsd XML schema file.

A trip's XML file can be opened in a browser to see the structure of its data. Most browsers allow folding/unfolding the nesting elements. This will allow you to see both the structure and the names of the values, but anything empty will be absent.

To see absolutely everything, tripit-api-obj-v1.xsd contains the whole specification of the data structure and the name of all the possible values. It too can be opened in a browser which should provide the ability to fold/unfold nesting elements.

The naming convention is such that elements named like_this are simple values that are to be extracted with a Selection object, and elements named LikeThis are complex types that need to be dealt with recursively using an Extractor object.

Once you're done modifying the schemas, you will want to verify that everything is named correctly without typos. To do so, run python schema_validator.py.

Command-line usage instructions

The main script is activities_to_csv.py, so this is what needs to be run.

Thanks to argparse, running the script with the -h or --help switch (i.e. python activities_to_csv.py -h) will print out a comprehensive usage message:

usage: activities_to_csv.py [-h] [--skip_cost] [--period {past,future,both}]
                            [--traveler {true,false,all}]
                            [--delete_cached_pages] [--offline_only]
                            [--output OUTPUT]
                            {trip,air,lodging,car,rail,transport,cruise,activity}

Fetch all the activities of a certain type from TripIt and output to a CSV
following defined schemas. Pages are cached to avoid excessive usage of the
API.

positional arguments:
  {trip,air,lodging,car,rail,transport,cruise,activity}
                        The type of activity to fetch.

optional arguments:
  -h, --help            show this help message and exit
  --skip_cost           Do not include the activity cost. (default: False)
  --period {past,future,both}
                        Whether to fetch only past or future trips, or both.
                        (default: both)
  --traveler {true,false,all}
                        Whether to fetch only trips where the user is a
                        traveler, not a traveler, or both. (default: true)
  --delete_cached_pages
                        Delete the cached pages before starting. (default:
                        False)
  --offline_only        Use only the cached pages, without trying to connect.
                        (default: False)
  --output OUTPUT, -o OUTPUT
                        The name of the file to output to. If not specified,
                        will output to stdout. (default: None)

It describes the purpose of each arguments and what value they can take in curly brackets, if any (switches such as --skip-cost take none, hence the lack of curly brackets).

The type argument is the only required argument and it takes no switch in front of it. It can be placed either at the end, the beginning or between the other arguments.

To get only trip information, use the trip type which will use an extractor without any activity extractor as a child.

Note: I have changed the past argument from --past {true, false, all} to --period {past, future, both}. Hopefully, this is more intuitive. The traveler argument remains the same (I couldn't think of a way to simplify it).

Here's a typical invocation:

python activities_to_csv.py air --skip_cost --period past --traveler true \
    --offline_only -o output_air.csv

This will process the air activities, skip the cost column, treat only past trips for which the user is a traveler, using only the offline cache (see below) and outputing to a file called output_air.csv Note: the backlash (\) was just to cut the long line.

One functionality I felt necessary to add is that pages are automatically cached on local storage (in the trip_cache directory) instead of being re-downloaded every single time. The functionality is similar to what activities.py in the last bounty was doing, but with trips instead of activities, and using seperate XML files.

One problem is that the cache may not necessarily be in sync with the data on TripIt if it is changed. When being refreshed, the only check is that there are as many pages in the cache than on the server. This is helpful if, for example, downloading the trips was interupted by the connection dropping. In that case, re-running the script will resume downloading the trip from where it left off.

When changes are made to the data on TripIt, it is necessary to empty the cache folder, or else the script will work on the now invalid cached data. This can either be done manually or by using the --delete_cached_pages switch.

The --offline_only switch is intended for use when you're on the go and don't have an internet connection. When in use, the script does not even try to connect to the API and uses the cache directly. Bear in mind that the cache needs to be populated first.

Edit 1: Added activities_to_csv.py and updated the other two files. Added usage instructions, workflow tips and updated the extraction schema explanations.

Edit 2: Switched to using unicodecsv to support unicode output.

Edit 3: Removed unused contextlib import, fixed gen_datetime_extractor.

I forgot to mention, this requires the lxml library. Installation instructions: https://lxml.de/installation.html. It also requires requests, but I believe you already have that installed.
CyteBode 2 months ago
Hi. HOw do I get this to work? When I run it nothing happens. Is there a filename etc?
sebmack 2 months ago
By waiting for me to release the rest of the script. I'm now done coding it and loosely testing it, but I still need to write usage instructions. I released that first part early so you could have time to wrap your head around the concept of the extraction schemas.
CyteBode 2 months ago
Hi. Thanks for this. Looks like great potential. The bad news is I got a couple of errors on extraction. File "C:\Users\Seb\Documents\GitHub\tripit-to-flightdiary\ahmedangu\activitiestocsv.py", line 263, in main() File "C:\Users\Seb\Documents\GitHub\tripit-to-flightdiary\ahmedangu\activitiestocsv.py", line 258, in main args.offlineonly File "C:\Users\Seb\Documents\GitHub\tripit-to-flightdiary\ahmedangu\activitiestocsv.py", line 197, in activitiesto_csv writer.writerow(row) UnicodeEncodeError: 'ascii' codec can't encode character u'\u2011' in position 3: ordinal not in range(128) Do these make sense?
sebmack 2 months ago
It does make sense, I didn't take unicode characters into account. It wasn't a problem when writing the XML files thanks to XML entities, but it becomes one when writing the CSV. Looks like the simplest solution may be to use the unicodecsv package instead of csv. Hold on while I test this and update the solution.
CyteBode 2 months ago
Great! BTW also noticing start time/data and end time/data isn't there. any chance you could add that?
sebmack 2 months ago
The start/end DateTime's of what exactly? It should be there unless my DateTimes are broken... I updated the solution for unicode output support. Edit: Okay, gen_datetime_extractor is indeed broken. Will update right away.
CyteBode 2 months ago
Thanks Mate. BTW I ran the extractor.py code above again and got the same error. Am I missing somewhere the updated code is pasted...or you're editing the code now? BTW - I don't mean to hurry you. I'm an amateur and probably don't have the etiquette right for here yet.
sebmack 2 months ago
If in doubt, update all three files by re-copy/pasting and re-saving them. You should not be getting the error anymore (tested with a £ sign in the cost). The site craps out if I try use other foreign characters, but the script also works if I manually edit the XML files to contain XML entities such as &#169; for ©.
CyteBode 2 months ago
Awesome! 577kb of trips just extracted. Love them. Thanks!!
sebmack 2 months ago
Great! I wasn't sure at all if it would work perfectly with real-world data, but apart from these two bugs, I'm glad to hear it did.
CyteBode 2 months ago
Ahmedangu - now...could you somehow make these into an app? given what this stuff gets used for, might be better as a Windows Store app because nobody is going to use a huge spreadsheet on their phone. No? Or a little downloadable exe. with nice buttons.
sebmack 2 months ago