process time series export to get visit flow / cohort (python)
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

I have a json file organized by user + domain they hit. In each record we store the time/date of each time they hit that specific domain. we need to process this file/data to see some aggregated counts of the users and the paths they took and some basic properties

unique domain paths a user took

domain 1 > domain 2 > domain 3

domain 2 > domain 3 > domain 1

domain 1 > domain 4

domain 2 > domain 1

next we need to see counts on these paths. we need two things:

1-unique people that did the path so we can get an idea of the number of times a specific user did each of these paths. there is a unique identifier of the user
"key" : "A3SmkVnzN6",
2- total times the path was taken (can be the same user more than once)

one note on circular stats. we might have something like:

domain 1 > domain 2 > domain 1 > domain 2

I think the most fruitful way to handle this is to reduce it to the minimalist most form which would be:

domain 1> domain 2

We would count this as 1 unique user count here since it was 1 person but we would say total times this path was followed is 2.

we are looking for solutions in python on this. (3.7+ would be great)

here is my sample data file. it would be nice if you solution accepted a url for the json so we can have it deal with downloading the data/json and processing it.

https://crts-assets.s3.amazonaws.com/Analytics%20response%20for%2020%20Aug.json

It is not quite clear what is being presented. Each entry is composed of a unique user key, followed by "domain2.com", and then some number of timestamps. Is each timestamp an instance of that unique user visiting "domain2.com"? For the data files you will be analyzing, will unique user keys be associated with more than a single domain (like domain3.com, etc...), and timestamps for each domain?
Evan Gurnick 3 months ago
awarded to Wuddrum
Tags
python

Crowdsource coding tasks.

1 Solution

Winning solution

Hey, here's my solution:

Source: GitHub

Using python 3.7.* with no external packages.

Currently I'm considering a path valid if it contains more than one unique domain, and all domains are unique. E.g. If a user is jumping from domain to domain like this - domain1 > domain2 > domain1 > domain3, the solution will interpret it as two paths - domain1 > domain2 and domain1 > domain3. I believe it only makes sense to start a new path when a user hits an already discovered domain again, as not to create too lengthy paths of different combinations.

Example Output

{
  "paths": [
    {
      "name": "domain1.com > domain2.com > domain3.com",
      "uniqueHits": 161,
      "totalHits": 519,
      "domains": [
        "domain1.com",
        "domain2.com",
        "domain3.com"
      ],
      "users": [
        "3ULk0oFWrJscEXBl",
        "gV9JrB62G4mF3gT9",
        ...
      ]
    },
    {
      "name": "domain5.com > domain6.com",
      "uniqueHits": 1125,
      "totalHits": 6321,
      "domains": [
        "domain5.com",
        "domain6.com"
      ],
      "users": [
        "Nn4fLSF9Tx",
        "gMyHtK9V41",
        ...
      ]
    },
    ...
  ]
}

Usage

domain-flow.py -i "INPUT URL" -o "OUTPUT FILE"

Arguments

-i / --input (required): URL of the input JSON file

-o / --output (required): Absolute or relative path/name for the output JSON file

-s / --separator (optional): Specifies the separator used for path names. Default: " > "

-h / --help (optional): Shows the help message and exits.