Small Python script for file transfer
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

I need a simple Python script which will:

  1. Listen for changes to a Wilddog/Firebase node:

https://site-eumt.wilddogio.com/listings_xml_aux/aux_xml_send_semaphore_customs

  1. On data update, fetch a list of file links to download from this node:

https://site-eumt.wilddogio.com/listings_xml_customs_id

  1. Download all files from links from step 2 and extract all XML files from the downloaded zip files

  2. Place the extracted XML files in a local folder and delete the zip files

  3. After completing steps 1 - 4, clear all data from this node:

https://site-eumt.wilddogio.com/listings_xml_customs_id

  1. The script also needs to listen for new files arriving on a local folder and upload each new file via FTP to a folder on a remote server

  2. The local folder paths for steps 4, 6 and the FTP details for step 6 should be dynamic parameters that can be changed

P.S

Wilddog is pretty much a copy of Firebase and their codebase are the same, so what works for Firebase should work for Wilddog

They have a Python SDK if necessary:
https://github.com/WildDogTeam/wilddog-python

awarded to feroldi

Crowdsource coding tasks.

2 Solutions


Greetings,

I'm done with the base requiremnts you may test here (follow the run instructions in the link):
https://colab.research.google.com/drive/1P3o0_YObRlHx5z3m0h0MLxO9bxBaGaeO

Sorry i have deleted the current listings.

The only remaining part is deploying the script to run forever in this case i gonna need some information about your operating system to customize the script for it, for example if Linux we can use cron jobs to run the script every x minutes or run it as a process ...

The script require python 2 and wilddog-python which already installed in the colab link i shared, if you don't wanna deploy the script locally you can keep using this colab and run it frequently colab is an awesome tool.

Kindly don't hesitate to ask any questions.

Best regards,

Ahmed

Thanks a lot! I'll be running this on either Ubuntu or Mac OS, so this should be compatible with both right?
user0809 10 months ago
Btw, can you please make this into a single Python script file? Thanks!
user0809 10 months ago
you may download the script from here and don't forget installing the required library and changing the ftp information https://gist.github.com/ahmedengu/bc05d20a225897f29b6a1d5a48432be2 do you want the script to be a process that run forever or to run at a scheduled time? possible options: https://www.cyberciti.biz/tips/nohup-execute-commands-after-you-exit-from-a-shell-prompt.html https://www.cyberciti.biz/faq/how-do-i-add-jobs-to-cron-under-linux-or-unix-oses/
ahmedengu 10 months ago
Is the script working as expected ? do you need any help setting up your environment or changing anything ? we can do a screenshare if you need any help
ahmedengu 10 months ago
Hey, sorry for the late reply, I was really busy with another task. There's two scripts in the GitHub link, one called wilddog.py and wilddog_async.py, which one should I use? Also can you give a small example of how to run the scripts? Thanks
user0809 10 months ago
No worries, i have added a detailed instructions and simplified things, you can check it here: https://gist.github.com/ahmedengu/bc05d20a225897f29b6a1d5a48432be2
ahmedengu 10 months ago
Winning solution

My solution covers both of your needs.
It downloads the archives (and keeps them on temporary memory, because there's no need to keep them on disk), compares them with the local XML files extracted from previous runs, and uploads to the FTP server only if they differ or if they don't exist locally.
And it does all that really fast.

Put it in a cron job and you are done, simple as that!

Requirements

My solution requires python 3.7 or later.
The following is a list of dependencies used in my solution.

aiohttp==3.5.4
asn1crypto==0.24.0
async-timeout==3.0.1
asyncssh==1.15.0
attrs==18.2.0
cffi==1.11.5
chardet==3.0.4
cryptography==2.4.2
idna==2.8
multidict==4.5.2
pycparser==2.19
six==1.12.0
yarl==1.3.0

Save this list in a file named requirements.txt and install them with pip:

pip install -r /path/to/requirements.txt

Script usage

You can call it passing the -h flag to get a help message on how to use the script.
Anyhow, here's an example of the parameters:

$ python3.7 script.py \
    --wilddog-url https://site-eumt.wilddogio.com \
    --local-dir-path /tmp/ceb \
    --remote-dir-path /tmp/ceb \
    --sftp-host some-host.com \
    --sftp-user some-user \
    --semaphore-file-path /tmp/semaphore.json

The script

The following is a listing of the complete script.

import aiohttp
import argparse
import asyncio
import asyncssh
import io
import json
import logging
import os
import pathlib
import sys
import zipfile
import urllib.parse

log = logging.getLogger(__name__)
logging.basicConfig(
    stream=sys.stderr,
    level=logging.INFO,
    format=("[%(asctime)s] {%(filename)s:%(lineno)d} %(levelname)s - %(message)s"),
)


def should_synchronize_files(xml_data, local_xml_path):
    try:
        with io.open(local_xml_path, "rb") as f:
            local_xml_data = f.read()
            if xml_data != local_xml_data:
                return True
    except FileNotFoundError:
        return True
    return False


async def synchronize_xml_files_with_server(
    archive_data, session, sftp, local_dir_path, remote_dir_path
):
    tasks = []
    with zipfile.ZipFile(archive_data) as archive:
        for archive_member in archive.namelist():
            # Check whether the downloaded XML file (referred to as `current`)
            # is brand new, or otherwise differs from the one on the server
            # (referred to as `last`).
            cur_xml_data = archive.read(archive_member)
            xml_file_path = pathlib.Path(local_dir_path, archive_member)
            files_differ = should_synchronize_files(
                xml_data=cur_xml_data, local_xml_path=xml_file_path
            )
            if files_differ:
                log.info(f"extracting {archive_member}")
                # Update the XML file locally and upload it to the FTP server.
                with io.open(xml_file_path, "wb") as f:
                    f.write(cur_xml_data)
                tasks.append(
                    sftp.put(xml_file_path.as_posix(), remotepath=remote_dir_path)
                )
            else:
                log.info(f"no action needed for {archive_member}")
    await asyncio.gather(*tasks)


async def fetch_archive_and_sync(archive_url, session, sftp, opts):
    log.info(f"downloading {archive_url}")
    async with session.get(archive_url) as resp:
        archive_data = io.BytesIO(await resp.read())
    await synchronize_xml_files_with_server(
        archive_data, session, sftp, opts.local_dir_path, opts.remote_dir_path
    )


async def main(opts):
    async with aiohttp.ClientSession() as session:
        async with session.get(opts.semaphore_url) as resp:
            cur_id = await resp.json()
        try:
            with io.open(opts.semaphore_file_path, "r") as f:
                last_id = json.load(f)
        except FileNotFoundError:
            last_id = None
        if cur_id != last_id:
            async with session.get(opts.xml_listings_url) as resp:
                zip_files_urls = await resp.json()
            os.makedirs(opts.local_dir_path, exist_ok=True)
            async with asyncssh.connect(
                opts.sftp_host, username=opts.sftp_user
            ) as ssh_conn:
                async with ssh_conn.start_sftp_client() as sftp:
                    if not sftp.isdir(opts.remote_dir_path):
                        await sftp.mkdir(opts.remote_dir_path)
                    tasks = [
                        fetch_archive_and_sync(archive_url, session, sftp, opts)
                        for _, archive_url in zip_files_urls.items()
                    ]
                    await asyncio.gather(*tasks)

            # Synchronize server's semaphore value with Wilddog's one
            with io.open(opts.semaphore_file_path, "w") as f:
                log.info(f"synchronizing semaphore from `{last_id}` to `{cur_id}`")
                json.dump(cur_id, f)

            async with session.delete(opts.xml_listings_url) as resp:
                assert resp.status == 200


if __name__ == "__main__":
    args = argparse.ArgumentParser()
    args.add_argument(
        "--wilddog-url",
        metavar="URL",
        required=True,
        help=("Wilddog space URL from which to fetch archives information"),
    )
    args.add_argument(
        "--local-dir-path",
        metavar="PATH",
        required=True,
        help=(
            "Local directory path to which XML files are stored. "
            "If PATH doesn't exist, it will be created automatically."
        ),
    )
    args.add_argument(
        "--remote-dir-path",
        metavar="PATH",
        required=True,
        help=(
            "Remote directory path on the SFTP server to which files are uploaded. "
            "If PATH doesn't exist on the server, it will be created automatically."
        ),
    )
    args.add_argument(
        "--sftp-host",
        metavar="HOST",
        required=True,
        help=("Host of the SFTP server to which files are uploaded."),
    )
    args.add_argument(
        "--sftp-user",
        metavar="USER",
        required=True,
        help=("User name of the SFTP server."),
    )
    args.add_argument(
        "--semaphore-file-path",
        metavar="PATH",
        required=True,
        help=(
            "Path to the local semaphore JSON file. This is needed in order to keep "
            "track of the Wilddog's semaphore value. If PATH doesn't exist, it will "
            "be created automatically."
        ),
    )
    args.add_argument(
        "--verbose",
        action="store_true",
        help=(
            "This is a flag parameter which, when passed, makes the program "
            "report more information about its process."
        ),
    )

    opts = args.parse_args()
    opts.semaphore_url = urllib.parse.urljoin(
        opts.wilddog_url, "listings_xml_aux/aux_xml_send_semaphore_customs.json"
    )
    opts.xml_listings_url = urllib.parse.urljoin(
        opts.wilddog_url, "listings_xml_customs_id.json"
    )
    log.setLevel(logging.DEBUG if opts.verbose else logging.INFO)

    loop = asyncio.get_event_loop()
    loop.run_until_complete(main(opts))
    # Zero-sleep to allow underlying connections to close
    loop.run_until_complete(asyncio.sleep(0))
    loop.close()
Hey feroldi, step 3 is missing After completing steps 1 - 4, clear all data from this node: https://site-eumt.wilddogio.com/listings_xml_customs_id
ahmedengu 10 months ago
@ahmedengu, thanks, done.
feroldi 10 months ago
View Timeline