Estimate amount of work based on conversation/chat events - math algo
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

I have a problem i need some help and creative thinking to solve. Basically I have chat logs of my users. Im recording these as events, our assumptions is that these events are the back and forth between a development group. Chatting about tasks, work, tickets etc... the common flow is that a PM will ping a developer and they will chat a little. the developer after some time will ping back and there is a natural conversation that occurs. We can imply from this conversation that the developer and manager were "working" on this project. We make some assumptions here about the conversation. A big conversation likely means there is more work. and there might be some lag before the next conversation and series of events. There is a case where one person pings the other person and there is no reply for 4 hours then we can assume they werent probably working and that after the initial chat there was some decay and then work stopped. I want to see if we can use a series of events that contain

'# of words

'unique user ID

'timestamp

to derive some ideas about the amount of work was done. obviously it wont be perfect but I think we can come up up with some cool ideas. My initial idea was to convert each point to a coordinate. each coordinate would have a radius, the radius would cause coordinates to overlap and "clump" causing a larger radius to be formed. This larger radius might then touch other coordinates and so on. at some point can suppose that the left most coordinate of a cluster for a series is the beginning and that the right most edge of the radius of a coordinate or cluster would be the hypothesized end time.

this bounty is to come up with some ideas on algorithms that could be used to solve such a problem. I have prepared 5 days of events we can use as the sample set.

https://www.dropbox.com/s/2f98gdlwq4z4wef/combined%205%20days-bounty.xlsx?dl=0

Ideally your algo will have some ideas about how to estimate this. you dont need to base it on my coordinate idea... it was just an idea. Your algo should have some way to understand/prove what it thinks is the solution is. depending on approach we ask that you document how to dial in the thresholds of any variables. Im open to the tech being anything from R to js to python. Im really just looking for good ideas. We will award the winner the full bounty and we will award the next 2 runner ups with a $25 bounty each.

Thanks for your time and ideas in advance!

Interesting problem.
alixaxel 2 months ago
I'm not sure if in your hypotheses that 1) idle time between chats represents work, or 2) amount of "uninterruptible" chat represents work.
alixaxel 2 months ago
Could you confirm that U033Q5G5P is the developer and U09JZ6P60 is the PM? Seems to be the case, but since U033Q5G5P has the first recorded message I'm dubious (maybe you truncated it)?
alixaxel 2 months ago
In our case all actors are treated the same. I can imagine a future phase where we tackle other things like roles, other slack channels. For now we can consider it a team and we are trying to see how long memebers of the team worked. Thx Alixaxel for your interest! Your right about hypothesis it makes a lot of assumptions, for sure this is exploratory
Qdev 2 months ago
awarded to Wuddrum

Crowdsource coding tasks.

5 Solutions


What about we treat all the time instead as a combined line chart?

Using Chart.js we can accomplish this:
http://www.chartjs.org/samples/latest/scales/time/combo.html

HOW TO

We got a conversation between two guys, chatting about everything, and a timestamp for each message.

Using the timestamp and some algorithms we can achieve a starting point for variables so we can construct the chart.

Y axis

The Y axis will be the amount of time that was taken for each user to answer. Easy as that, calculating the longest time we could talk about seconds, minutes or even hours. This can be configurable in Chart JS too!

For example, the longest time taken for a person to answer was 60 minutes, so we can construct the Y axis with ranges dividing the amount of time by 5 and formatting the answer to a readable variable (e.g 60 minutes means 1 hour, or 75 minutes means 1:15)

X axis

The X axis will be the total amount of time that was taken for the conversation to end, starting from the first message to the last one, there's no need of math for the X axis in this since we can make a switch case where the program will choose what's going to be the range between each division.

BARS

The bars are going to be the representation of each user, so we can identify in the timeline what user had the most or least response time in each conversation.

We could even use the curves to get an expectation of how much was the difference between time responses for each user in the conversation, including the number of words for each message.

GET THE TOTAL POINTS - ESTIMATED WORK

We can make a sort of "scoreboard" where we rate how much work was done in the conversation. If there's a high score, it means the users worked hard (a good working conversation).

How to achieve this? The conversation will be given a score if there is lots of words for each message (maybe a constant multiplier for each word?), and if there's little time between the messages, it will receive "points" . If there's 15 minutes between each message or more, it will be given half the points, if there's more than 30 minutes or 45 minutes (you choose the threshold, i'm just using random constants) it will not be given any score. By using this way, you can get a "rate" of how much fluid the conversation was, even make it in tiers (bad, regular, good conversation).

Tell me if you like the idea! I could get up to something myself with this one. Sounds really fun!


When I read your circle idea, I couldn't stop thinking about it, so this is somewhat similar. I first thought of spacing out chat messages based on their time, like this: https://imgur.com/a/Roa9Gkb. (I wish I could include images in this post). The chats are not tied to any real time, but rather just spaced out based on the time increments between chats.

I drew a semicircle from each point to its next neighbor: https://imgur.com/a/AfNLktH. As you can see, points further away from each other have bigger semicircles with taller radii.

I then added the word count of the chat message to the radius of each semicircle so that they would overlap and each circle would become a little bigger. The final result was this: https://imgur.com/a/whlfBGi.

The 'total work done' is simply the sum of the area of the half circles. This algorithm provides correct measures of work because it takes into account variables of time increments, message length, and conversation length in determining the work value of a conversation. How is a 'conversation' defined? Like this: I split up the list of chats into sublists separated by periods of no activity above a certain cutoff point, like when the work has 'tapered off'.

So that last image was just a neat visualization of my idea, but my algorithm can come up with a numerical value for work for each conversation that occurs in your test data. Here is a jsFiddle that does just that: https://jsfiddle.net/drtyg8ae/7/ (Check the console). And here is a jsFiddle that picks one conversation and visualizes my algorithm's method: https://jsfiddle.net/o89dwvpt/69/

The core code behind the algorithm is here:

for (let j = 1; j < conversations.length; j++) {
let conversation = conversations[j]

let diff = conversation[conversation.length - 1][2] - conversation[0][2]
let scale = WIDTH / diff

let work = 0

for (let i = 0; i < conversation.length - 1; i++) {
  let move = conversation[i][2] - conversation[0][2]
  if (i >= 1) {
    let increment = conversation[i][2] - conversation[i - 1][2]
    let lastMove = conversation[i - 1][2] - conversation[0][2]
    let p1 = {
      x: lastMove * scale,
      y: HEIGHT - 4
    }
    let p2 = {
      x: (increment / 2 + lastMove) * scale,
      y: HEIGHT - 4
    }
    let p3 = {
      x: move * scale,
      y: HEIGHT - 4
    }
    let startAngle = Math.atan2(p1.y - p2.y, p1.x - p2.x),
      endAngle = Math.atan2(p3.y - p2.y, p3.x - p2.x);
    var diffX = p1.x - p2.x,
      diffY = p1.y - p2.y,
      radius = Math.abs(Math.sqrt(diffX * diffX + diffY * diffY)) + conversation[i - 1][0];

    let area = Math.pow(Math.PI * radius, 2) / 2
    work += area
  }
}
console.log('conversation', j, 'work done', work)
}

}

Here's an alternative solution to the circles/radius idea.

I went ahead with grouping message events into periods (1 minute by default). Then, using exponential moving average (EMA), periods get smoothed out with their neighboring periods. The smoothing also adds a period decay. Optionally (turned on by default) a rolling window is applied as a 2nd pass, to smooth out periods even more (but it is not really necessary with fine-tuned EMA). After all the smoothing, periods now form clear clusters, that are then extracted as sessions. Here's an example of sessions over produced clusters. More example plots can be seen below, with outputs.

By default I'm not excluding sessions with only one participant, since my assumption is that work can be done even if there's no immediate response from others. This can be disabled with --nosolo argument. Maybe a better solution would be to mark PM users and exclude sessions where they're the only participants, since those are most likely pings.

Argument documentation:

-p / --period: Specifies the period length in seconds. Defaults to 60 (1 minute). Note: Period also represents the session duration's resolution. E.g. a 30 minute period setting means that the shortest period will be 30 minutes, and will increase in steps of 30 minutes.

-w / --window: Specifies the rolling window size - the number of periods used for smoothing. Defaults to 60. Note: This also affects decay, greater values will increase decay and in turn will promote larger cluster formations.

-s / --span: Specifies the decay for EMA in terms of span. Greater values will increase decay and in turn will promote larger cluster formations. Defaults to 60.

-t / --threshold: Specifies the threshold over which session clusters are registered. Since the values get smoothed, they often require a lot of decay to actually get to 0. This helps with filtering out all non-significant period values. Default: 0.025.

-r / --rolling: Disables 2nd pass rolling window smoothing. Default: Rolling window is enabled.

-P / --plot: Plot the results. Helps with visualizing and fine-tuning the parameters, to get right kind of sessions extracted. Examples of this can be seen down below, with outputs.

-n / --nosolo: Exclude sessions that have only one participant. Default: false.

Here's the solution itself, in Python 3.6.x:

#!/usr/bin/env python
import math
import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from argparse import ArgumentParser

parser = ArgumentParser()
parser.add_argument(dest='data_file', help='csv data file, that contains chat events', metavar='DATAFILE')
parser.add_argument('-p', '--period', dest='period', help='Period in seconds that data should be grouped into, before any calculations', default=60, type=int)
parser.add_argument('-w', '--window', dest='window', help='Rolling window size', default=60, type=int)
parser.add_argument('-s', '--span', dest='span', help='Span value for EMA', default=60, type=int)
parser.add_argument('-t', '--threshold', dest='threshold', help='Rolling window/EMA value threshold over which sessions are registered', default=0.025, type=float)
parser.add_argument('-r', '--rolling', dest='rolling', help='Apply rolling window on top of EMA for more smoothing', action='store_false', default=True)
parser.add_argument('-P', '--plot', dest='plot', help='Plot results', action='store_true')
parser.add_argument('-n', '--nosolo', dest='no_solo', help='Exclude sessions with only one participant', action='store_true', default=False)
args = parser.parse_args()


class Session:
    def __init__(self, start, end):
        self._start_ts = start
        self._end_ts = end
        self._start = datetime.datetime.utcfromtimestamp(start)
        self._end = datetime.datetime.utcfromtimestamp(end)
        self._total_words = 0
        self._users = set()
        self._message_ids = set()

    @property
    def start_ts(self):
        return self._start_ts

    @property
    def end_ts(self):
        return self._end_ts

    @property
    def start(self):
        return self._start

    @property
    def end(self):
        return self._end

    @property
    def total_words(self):
        return self._total_words

    @total_words.setter
    def total_words(self, total_words):
        self._total_words = total_words

    @property
    def users(self):
        return self._users

    @property
    def message_ids(self):
        return self._message_ids


def get_periods(data, from_ts, to_ts):
    period_count = int((to_ts - from_ts) / args.period) + 500 # adding a 500 period buffer for decay
    periods = pd.DataFrame(0, index=range(period_count + 1), columns=['words'])  
    timestamp_col = data.columns.get_loc('timestamp')
    words_col = data.columns.get_loc('word_count')
    for data_row in data.itertuples(index=False):
        n_period = math.floor((data_row[timestamp_col] - from_ts) / args.period)
        periods.iat[n_period, 0] += data_row[words_col]

    periods = periods.ewm(span=args.span, ignore_na=True).mean()
    if (args.rolling):
        periods = periods.rolling(args.window, center=True).median()
        periods = periods.shift(int(-args.window/2)) # shift back rolling window due to its lagging nature

    return periods


def get_sessions(periods, from_ts):
    session_start = None
    session_end = None
    sessions = []
    period_sessions = []
    for i in range(len(periods)):
        if (session_start is None and periods.iat[i, 0] > args.threshold):
            session_start = i
        elif (session_start is not None and periods.iat[i, 0] <= args.threshold):
            session_end = i
            timestamp_start = from_ts + (session_start * args.period)
            timestamp_end = from_ts + (session_end * args.period)
            sessions.append(Session(timestamp_start, timestamp_end))
            period_sessions.append((session_start, session_end))
            session_start = None
            session_end = None

    return (sessions, period_sessions)


def propagate_session_data(sessions, data):
    words_col = data.columns.get_loc('word_count')
    user_col = data.columns.get_loc('user')
    timestamp_col = data.columns.get_loc('timestamp')
    msg_id_col = data.columns.get_loc('client_msg_id')

    for data_row in data.itertuples(index=False):
        for session in sessions:
            if session.start_ts <= data_row[timestamp_col] <= session.end_ts:
                session.total_words += data_row[words_col]
                session.users.add(data_row[user_col])
                session.message_ids.add(str(data_row[msg_id_col]))
                break


def log_sessions(sessions):
    for i in range(len(sessions)):
        print('[Session %i]' % (i + 1))
        print('Duration: %s' % (sessions[i].end - sessions[i].start))
        print('Total words: %i' % sessions[i].total_words)
        print('Total messages: %i' % len(sessions[i].message_ids))
        print('Participants: %s' % ', '.join(sessions[i].users))
        print('Message ids: %s' % ', '.join(sessions[i].message_ids))
        print()


def exclude_solo_sessions(sessions, period_sessions):
    filtered_sessions = []
    filtered_period_sessions = []
    for i in range(len(sessions)):
        if len(sessions[i].users) > 1:
            filtered_sessions.append(sessions[i])
            filtered_period_sessions.append(period_sessions[i])

    return (filtered_sessions, filtered_period_sessions)


def plot(data, periods, period_sessions):
    _, axes = plt.subplots(nrows=3, ncols=1)

    axes[0].plot(periods['words'])
    axes[0].set_title('Sessions over EMA/Rolling Window')
    for (start, end) in period_sessions:
        axes[0].axvspan(start, end, alpha=0.25, color='green')

    axes[1].scatter(data['timestamp'], data['word_count'])
    axes[1].set_title('Sessions over original data')
    for session in sessions:
        axes[1].axvspan(session.start_ts, session.end_ts, alpha=0.25, color='green')

    axes[2].scatter(data['timestamp'], np.zeros(len(data)))
    axes[2].set_title('Sessions over message events')
    for session in sessions:
        axes[2].axvspan(session.start_ts, session.end_ts, alpha=0.25, color='green')

    plt.show()


if __name__ == '__main__':
    data = pd.read_csv(args.data_file)
    min_ts = data['timestamp'].min()
    max_ts = data['timestamp'].max()
    from_ts = min_ts - (min_ts % args.period)
    to_ts = max_ts - (max_ts % args.period)

    periods = get_periods(data, from_ts, to_ts)
    sessions, period_sessions = get_sessions(periods, from_ts)
    propagate_session_data(sessions, data)

    if (args.no_solo):
        sessions, period_sessions = exclude_solo_sessions(sessions, period_sessions)

    log_sessions(sessions)
    if args.plot:
        plot(data, periods, period_sessions)

Example Outputs:

I'm using a csv converted version of the .xlsx file, and word count header renamed to word_count

Default output:

$ ./extract_sessions.py data.csv --plot

Plot

[Session 1]
Duration: 2:00:00
Total words: 108
Total messages: 1
Participants: U033Q5G5P
Message ids: ...truncated...

[Session 2]
Duration: 6:07:00
Total words: 411
Total messages: 22
Participants: U033Q5G5P, U09JZ6P60
Message ids: ...truncated...

[Session 3]
Duration: 1:03:00
Total words: 6
Total messages: 1
Participants: U033Q5G5P
Message ids: ...truncated...

[Session 4]
Duration: 2:13:00
Total words: 61
Total messages: 8
Participants: U033Q5G5P, U09JZ6P60
Message ids: ...truncated...

[Session 5]
Duration: 0:41:00
Total words: 3
Total messages: 1
Participants: U033Q5G5P
Message ids: ...truncated...

[Session 6]
Duration: 4:42:00
Total words: 515
Total messages: 40
Participants: U033Q5G5P, U09JZ6P60
Message ids: ...truncated...

[Session 7]
Duration: 7:20:00
Total words: 203
Total messages: 17
Participants: U033Q5G5P, U09JZ6P60
Message ids: ...truncated...

[Session 8]
Duration: 0:51:00
Total words: 4
Total messages: 1
Participants: U033Q5G5P
Message ids: 94E23671-114F-4971-8059-D6897B4CB4D2

[Session 9]
Duration: 16:07:00
Total words: 1265
Total messages: 85
Participants: U033Q5G5P, U09JZ6P60
Message ids: ...truncated...

[Session 10]
Duration: 17:45:00
Total words: 2355
Total messages: 196
Participants: U033Q5G5P, U09JZ6P60
Message ids: ...truncated...

Lesser decay output, with smaller clusters and less work time per session:

$ ./extract_sessions.py data.csv --plot --span 30

Plot

[Session 1]
Duration: 0:55:00
Total words: 108
Total messages: 1
Participants: U033Q5G5P
Message ids: ...truncated...

[Session 2]
Duration: 3:29:00
Total words: 384
Total messages: 20
Participants: U09JZ6P60, U033Q5G5P
Message ids: ...truncated...

[Session 3]
Duration: 1:05:00
Total words: 27
Total messages: 2
Participants: U09JZ6P60, U033Q5G5P
Message ids: ...truncated...

[Session 4]
Duration: 0:42:00
Total words: 6
Total messages: 1
Participants: U033Q5G5P
Message ids: ...truncated...

[Session 5]
Duration: 1:17:00
Total words: 61
Total messages: 8
Participants: U09JZ6P60, U033Q5G5P
Message ids: ...truncated...

[Session 6]
Duration: 0:31:00
Total words: 3
Total messages: 1
Participants: U033Q5G5P
Message ids: ...truncated...

[Session 7]
Duration: 3:43:00
Total words: 515
Total messages: 40
Participants: U09JZ6P60, U033Q5G5P
Message ids: ...truncated...

[Session 8]
Duration: 3:39:00
Total words: 137
Total messages: 14
Participants: U09JZ6P60, U033Q5G5P
Message ids: ...truncated...

[Session 9]
Duration: 1:19:00
Total words: 66
Total messages: 3
Participants: U09JZ6P60, U033Q5G5P
Message ids: ...truncated...

[Session 10]
Duration: 0:35:00
Total words: 4
Total messages: 1
Participants: U033Q5G5P
Message ids: ...truncated...

[Session 11]
Duration: 1:03:00
Total words: 24
Total messages: 2
Participants: U09JZ6P60
Message ids: ...truncated...

[Session 12]
Duration: 5:38:00
Total words: 574
Total messages: 31
Participants: U09JZ6P60, U033Q5G5P
Message ids: ...truncated...

[Session 13]
Duration: 5:40:00
Total words: 660
Total messages: 52
Participants: U09JZ6P60, U033Q5G5P
Message ids: ...truncated...

[Session 14]
Duration: 0:44:00
Total words: 7
Total messages: 1
Participants: U033Q5G5P
Message ids: ...truncated...

[Session 15]
Duration: 16:54:00
Total words: 2355
Total messages: 196
Participants: U09JZ6P60, U033Q5G5P
Message ids: ...truncated...   

Lesser decay output, with smaller clusters and less work time per session AND without solo sessions:

$ ./extract_sessions.py data.csv --plot --span 30 --nosolo

Plot

[Session 1]
Duration: 3:29:00
Total words: 384
Total messages: 20
Participants: U033Q5G5P, U09JZ6P60
Message ids: ...truncated...

[Session 2]
Duration: 1:05:00
Total words: 27
Total messages: 2
Participants: U033Q5G5P, U09JZ6P60
Message ids: ...truncated...

[Session 3]
Duration: 1:17:00
Total words: 61
Total messages: 8
Participants: U033Q5G5P, U09JZ6P60
Message ids: ...truncated...

[Session 4]
Duration: 3:43:00
Total words: 515
Total messages: 40
Participants: U09JZ6P60, U033Q5G5P
Message ids: ...truncated...

[Session 5]
Duration: 3:39:00
Total words: 137
Total messages: 14
Participants: U033Q5G5P, U09JZ6P60
Message ids: ...truncated...

[Session 6]
Duration: 1:19:00
Total words: 66
Total messages: 3
Participants: U033Q5G5P, U09JZ6P60
Message ids: ...truncated...

[Session 7]
Duration: 5:38:00
Total words: 574
Total messages: 31
Participants: U033Q5G5P, U09JZ6P60
Message ids: ...truncated...

[Session 8]
Duration: 5:40:00
Total words: 660
Total messages: 52
Participants: U033Q5G5P, U09JZ6P60
Message ids: ...truncated...

[Session 9]
Duration: 16:54:00
Total words: 2355
Total messages: 196
Participants: U09JZ6P60, U033Q5G5P
Message ids: ...truncated...  

Greater decay output, with larger clusters and more work time per session:
$ ./extract_sessions.py data.csv --plot --span 100 --window 100 --threshold 0.02

Plot

[Session 1]
Duration: 3:05:00
Total words: 108
Total messages: 1
Participants: U033Q5G5P
Message ids: ...truncated...

[Session 2]
Duration: 7:18:00
Total words: 411
Total messages: 22
Participants: U09JZ6P60, U033Q5G5P
Message ids: ...truncated...

[Session 3]
Duration: 1:30:00
Total words: 6
Total messages: 1
Participants: U033Q5G5P
Message ids: ...truncated...

[Session 4]
Duration: 3:27:00
Total words: 61
Total messages: 8
Participants: U09JZ6P60, U033Q5G5P
Message ids: ...truncated...

[Session 5]
Duration: 0:54:00
Total words: 3
Total messages: 1
Participants: U033Q5G5P
Message ids: ...truncated...

[Session 6]
Duration: 14:10:00
Total words: 718
Total messages: 57
Participants: U09JZ6P60, U033Q5G5P
Message ids: ...truncated...

[Session 7]
Duration: 1:12:00
Total words: 4
Total messages: 1
Participants: U033Q5G5P
Message ids: ...truncated...

[Session 8]
Duration: 17:21:00
Total words: 1265
Total messages: 85
Participants: U09JZ6P60, U033Q5G5P
Message ids: ...truncated...

[Session 9]
Duration: 19:26:00
Total words: 2355
Total messages: 196
Participants: U09JZ6P60, U033Q5G5P
Message ids: ...truncated...  

I have some other ideas for this and will try to implement and post them if they're successful.


Your coordinate and circles idea reminded a lot of the concept of metaballs.

While I was toying around, I ended up with some pretty graphics, reminescent of a lava lamp: https://i.imgur.com/9SA6OF9.png

This is based on the sample dataset. The horizontal axis is the time of day from left to right, the vertical is the date from top to bottom (empty days only count as one to avoid dead space) and the orange/green colors represent the two different users. There is one ball per message and the radius depends on the square root of the number of words. Thanks to the nature of metaballs, multiple small messages in close proximity merge into a bigger ball.

I let the days overlap a bit so as to try to take advantage of the two dimensions, but that made me realize that I only really needed a single dimension for the sake of the algorithm. So I simplified the script to 1D (which is much more efficient) and have come up with the following solution:

Requirements

  • Python 3 or 2.7
  • Numpy
  • (Optional) dateutil, for timezone support

Code

Note: The constants at the top (in ALL_CAPS) can be changed as desired, as well as the radius_fn function.

import csv
import datetime
import functools
import math

import numpy as np


# CSV file location (converted from .xslx with Excel)
CSV_FILE = "combined 5 days-bounty.csv"

# Scaling of the metaballs
SIZE_FACTOR = 20.0

# How many samples represent 1 minute
SAMPLES_PER_MINUTE = 1.0

# How long to pad the beginning and end of the falloff buffer (in hours)
PADDING = 24.0

# At which falloff value the metaballs become opaque
THRESHOLD = 0.99

# How long a work session needs to be to be considered valid (in minutes)
MIN_WORKTIME = 30.0

# Whether to ignore work sessions involving only a single user
IGNORE_LONE_WS = False

# Whether to clip work sessions by their beginning and ending messages
CLIP_WS = False



def radius_fn(word_count):
    # Radius of the message in samples, in function of the word count
    return math.sqrt(word_count) * SIZE_FACTOR * SAMPLES_PER_MINUTE


# UTC timezone by default
TIMEZONE = None

try:
    # pip install python-dateutil
    from dateutil import tz
    # Local timezone (uncomment the following line)
    #TIMEZONE = tz.gettz()
    # Specific timezone
    #TIMEZONE = tz.gettz("EST")
except ImportError:
    pass


@functools.total_ordering
class Message(object):
    def __init__(self, user_id, timestamp, word_count):
        self._user_id = user_id.lower()
        timestamp = datetime.datetime.utcfromtimestamp(float(timestamp))
        if not (TIMEZONE is None):
            timestamp = TIMEZONE.fromutc(timestamp.replace(tzinfo=TIMEZONE))
        self._timestamp = timestamp
        self._word_count = int(word_count)

    def __eq__(self, other):
        return self._timestamp == other._timestamp

    def __lt__(self, other):
        return self._timestamp < other._timestamp

    @property
    def user_id(self):
        return self._user_id

    @property
    def timestamp(self):
        return self._timestamp

    @property
    def word_count(self):
        return self._word_count


def rasterize_message(falloff_buffer, msg):
    """ Rasterizes the falloff values of a message's metaball over its
        effective area into a float[] buffer. """

    dt = msg.timestamp - earliest
    o = dt.total_seconds() / 60.0 * SAMPLES_PER_MINUTE
    oi = int(round(o))

    r = radius_fn(msg.word_count)
    s = 1.0 / (r / 2.0 * (3 + math.sqrt(3.0)))

    # Depth-first search flood fill
    q = [oi]
    while q:
        pi = q.pop(-1)
        if pi < 0 or pi >= n_samples:
            continue

        d = abs(pi - o + 0.5) * s

        # Taken from http://nullcandy.com/2d-metaballs-in-xna/
        if d <= 1.0 / 3.0:
            f = 1 - 3 * (d * d)
        elif d <= 1.0:
            f = 1.5 * ((1.0 - d) * (1.0 - d))
        else:
            f = 0.0

        if f <= 0.0:
            continue

        falloff_buffer[pi] += f

        if (pi - oi) >= 0:
            q.append(pi + 1)
        if (pi - oi) <= 0:
            q.append(pi - 1)


class WorkSession(object):
    def __init__(self, start, end=None):
        if end is None:
            end = start
        self._start = start
        self._end = end
        self._user_ids = {}

    def __contains__(self, time):
        return time >= self._start and time <= self._end

    def __hash__(self):
        return id(self)

    def add_user(self, user_id, w=1.0):
        uids = self._user_ids
        uids[user_id] = uids.setdefault(user_id, 0.0) + w

    @property
    def user_ids(self):
        tw = sum(v for k, v in self._user_ids.items())
        return tuple((k, float(v)/tw) for k, v in self._user_ids.items())

    @property
    def start(self):
        return self._start

    @start.setter
    def start(self, start):
        assert start <= self._end
        self._start = start

    @property
    def end(self):
        return self._end

    @end.setter
    def end(self, end):
        assert end >= self._start
        self._end = end

    @property
    def duration(self):
        return self._end - self._start


if __name__ == '__main__':
    import os
    import sys


    # Constants validation
    try:
        if not os.path.exists(CSV_FILE):
            raise Exception("The CSV file does not exist!")
        if SIZE_FACTOR <= 0.0:
            raise Exception("Invalid value for SIZE_FACTOR!")
        if SAMPLES_PER_MINUTE <= 0.0:
            raise Exception("Invalid value for SAMPLES_PER_MINUTE!")
        if THRESHOLD < 0.0:
            raise Exception("Invalid value for THRESHOLD!")
        if MIN_WORKTIME < 0.0:
            raise Exception("Invalid value for MIN_WORKTIME!")
    except Exception as e:
        sys.stderr.write(str(e))
        sys.stderr.write('\n')
        sys.exit(-1)


    # Parse the CSV
    messages = []
    with open(CSV_FILE, "r") as csvfile:
        reader = csv.DictReader(csvfile, delimiter=',')
        for row in reader:
            msg = Message(row["user"], row["timestamp"], row["Word count"])
            messages.append(msg)

    if len(messages) == 0:
        raise Exception("No message in CSV!")

    # Sort the messages by timestamp
    messages.sort()


    # Figure out the beginning and end
    earliest = messages[ 0].timestamp - datetime.timedelta(hours=PADDING)
    latest   = messages[-1].timestamp + datetime.timedelta(hours=PADDING)
    assert latest >= earliest

    span_minutes = (latest - earliest).total_seconds() / 60.0
    n_samples = int(math.ceil(span_minutes * SAMPLES_PER_MINUTE))

    # About 4 MiB of RAM per year, at 1.0 samples/min
    falloff_buffer = np.zeros(n_samples)


    # Rasterize all the messages to the falloff buffer
    for msg in messages:
        rasterize_message(falloff_buffer, msg)


    work_sessions = []
    current_ws = None
    _messages = list(messages)

    def finalize_ws(ws, end_dt):
        ws.end = earliest + end_dt

        start = ws.end
        end   = ws.start
        while len(_messages) > 0 and _messages[0].timestamp < ws.end:
            # For all the messages relating to that work session...
            msg = _messages.pop(0)

            if msg.timestamp in ws:
                # Add the user's word count
                ws.add_user(msg.user_id, msg.word_count)

                # Figure out the beginning and end messages
                if msg.timestamp < start:
                    start = msg.timestamp
                if msg.timestamp > end:
                    end = msg.timestamp

        if CLIP_WS:
            # Clip the work session by the beginning and end messages
            ws.start = start
            ws.end = end

        return ws

    try:
        xrange
    except NameError:
        # Python 3
        xrange = range

    # Walk the falloff buffer to find work sessions
    for i in xrange(n_samples):
        f = falloff_buffer[i]

        if current_ws is None:
            if f < THRESHOLD:
                continue

            dt = datetime.timedelta(minutes = (i + 0.5) / SAMPLES_PER_MINUTE)
            current_ws = WorkSession(earliest + dt)
        else:
            if f >= THRESHOLD:
                continue

            dt = datetime.timedelta(minutes = (i + 0.5) / SAMPLES_PER_MINUTE)
            work_sessions.append(finalize_ws(current_ws, dt))
            current_ws = None

    # In case there is a trailing work session...
    # (This should not happen with sufficient padding.)
    if not (current_ws is None):
        dt = datetime.timedelta(minutes = (i + 0.5) / SAMPLES_PER_MINUTE)
        work_sessions.append(finalize_ws(current_ws, dt))
        current_ws = None


    if MIN_WORKTIME > 0.0:
        # Filter out work sessions that are too short
        work_sessions = list(filter(
            lambda ws: ws.duration.total_seconds() >= MIN_WORKTIME * 60,
            work_sessions))


    if IGNORE_LONE_WS:
        # Filter out work sessions that only involve a single user
        work_sessions = list(filter(lambda ws: len(ws.user_ids) > 1,
                                    work_sessions))


    # Print out the results
    totals = {}
    ZERO_DT = datetime.timedelta(0)
    for ws in work_sessions:
        print("%s -> %s" % (ws.start, ws.end))
        print("  Duration:  %s" % ws.duration)
        uid_ws = ["%s (%.0f%%)" % (u, w*100) for u, w in sorted(ws.user_ids)]
        print("  Involving: %s" % ", ".join(uid_ws))
        print("")

        for user_id, percentage in ws.user_ids:
            w_dt = datetime.timedelta(
                seconds = ws.duration.total_seconds() * percentage)
            totals[user_id] = (totals.setdefault(user_id, ZERO_DT) + w_dt)

    print("Total work per user:")
    for user_id in sorted(totals.keys()):
        hours = totals[user_id].total_seconds() / 3600.0
        print("  %s: %.2f hours" % (user_id, hours))

Output

2018-06-20 14:17:12 -> 2018-06-20 15:13:12
  Duration:  0:56:00
  Involving: u033q5g5p (100%)

2018-06-26 14:39:12 -> 2018-06-26 23:26:12
  Duration:  8:47:00
  Involving: u033q5g5p (63%), u09jz6p60 (37%)

2018-06-27 09:13:12 -> 2018-06-27 12:28:12
  Duration:  3:15:00
  Involving: u033q5g5p (54%), u09jz6p60 (46%)

2018-06-28 11:03:12 -> 2018-06-29 01:51:12
  Duration:  14:48:00
  Involving: u033q5g5p (45%), u09jz6p60 (55%)

2018-06-29 08:58:12 -> 2018-06-30 02:25:12
  Duration:  17:27:00
  Involving: u033q5g5p (60%), u09jz6p60 (40%)

2018-07-02 09:33:12 -> 2018-07-03 04:44:12
  Duration:  19:11:00
  Involving: u033q5g5p (51%), u09jz6p60 (49%)

Total work per user:
  u033q5g5p: 35.13 hours
  u09jz6p60: 29.27 hours

Edit 1: Fixed timezone bug (the local timezone was always used).

Edit 2: Cleaned up the code and tweaked the constants. Added proportional work representation, and total work time per user. Added option to ignore work sessions involving only one user (enabled by setting IGNORE_LONE_WS to True). Added option to clip work sessions by the time of the beginning and end messages (enabled by setting CLIP_WS to True).

This is cool. Will review in detail when back at the office tomorrow.
Qdev 2 months ago
Winning solution

Here's another solution I came up with. This version is fast and a lot more configurable. All of the calculations/session extraction is done in a SINGLE iteration over the original data (provided that data is chronologically sorted).

With this solution, I'm not using any kind of window averaging/smoothing. Instead, power points are simply assigned to a message and then power decay is calculated that happens until the next message event. If power has decayed under a certain threshold, a new session is started, otherwise, the event is added to the existing session. The decay is applied as an exponent - power = power * decay_coef^decay_step. There are two decay coefficients - connection-decay and session-decay. Connection decay coefficient is used when a message can connect with another message before power has decayed. Otherwise, if the message cannot connect to another message, session decay coefficient is applied instead. This helps for example with creating more sparse message connections, but with faster decay for last message in a session, so there's not too much work duration lingering after a session is done.

Argument documentation:

-b / --base-power: Base power that's applied to each message. Default: 70

-w / --word-power: Additional power per word that's added to each message. Default: 5

-c / --connection-power: Amount of power that's added to a message if it connects to another message before decaying. Default: 2000

-C / --connection-coef: Power decay coefficient that's used for message connections. HAS to be higher or equal to session-coef and lower than 1 and higher than 0. Higher values = faster decay. Default: 0.72

-S / --session-coef: Power decay coefficient that's used for last messages of the session. HAS to be lower or equal to connection-coef, lower than 1 and higher than 0. Higher values = faster decay. Default: 0.72

-t / --threshold: Threshold under which messages can no longer connect in a single session. Default: 30

-p / --period: Period in seconds that defines one full decay step. Default: 60*10

-P / --plot: Plot the results. Useful for when fine-tuning the parameters. Default: False

-s / --skip-solo: Exclude sessions that have only one participant. Default: False

Here's the source:

#!/usr/bin/env python
import math
import datetime
import pandas as pd
import matplotlib.pyplot as plt
from argparse import ArgumentParser

parser = ArgumentParser()
parser.add_argument(dest='data_file', help='CSV data file with chat events', metavar='DATAFILE')
parser.add_argument('-b', '--base-power', help='Base power for every message', dest='base_power', default=70, type=int)
parser.add_argument('-w', '--word-power', help='Additional power for every word in a message', dest='word_power', default=5, type=int)
parser.add_argument('-c', '--connection-power', help='Amount of power that\'s added when one message connects with another, before power\'s decayed', dest='connection_power', default=2000, type=int)
parser.add_argument('-C', '--connection-coef', help='Decay coefficient that\'s used in attempt to connect to other messages. Has to be higher or equal to session coefficient, lower than 1 and higher than 0.', dest='connection_coef', default=0.72, type=float)
parser.add_argument('-S', '--session-coef', help='Decay coefficient that\'s applied to last session message, to calculate work decay. Has to be lower or equal to connection coefficient, lower than 1 and higher than 0.', dest='session_coef', default=0.72, type=float)
parser.add_argument('-t', '--threshold', help='Power threshold over which messages can connect to other messages', dest='threshold', default=30, type=int)
parser.add_argument('-p', '--period', help='Period in seconds that defines a single decay step', dest='period', default=60*10, type=int)
parser.add_argument('-P', '--plot', help='Plot results', dest='plot', action='store_true')
parser.add_argument('-s', '--skip-solo', help='Skip sessions with only one participant', dest='skip_solo', action='store_true')
args = parser.parse_args()

class Session:
    def __init__(self, start_ts):
        self.start_ts = start_ts
        self._words = 0
        self._users = set()
        self._message_ids = set()

    @property
    def start_ts(self):
        return self._start_ts

    @start_ts.setter
    def start_ts(self, start_ts):
        self._start_ts = int(start_ts)
        self._start = datetime.datetime.utcfromtimestamp(self._start_ts)

    @property
    def end_ts(self):
        return self._end_ts

    @end_ts.setter
    def end_ts(self, end_ts):
        self._end_ts = int(end_ts)
        self._end = datetime.datetime.utcfromtimestamp(self._end_ts)

    @property
    def start(self):
        return self._start

    @property
    def end(self):
        return self._end

    @property
    def words(self):
        return self._words

    @words.setter
    def words(self, words):
        self._words = words

    @property
    def users(self):
        return self._users

    @property
    def message_ids(self):
        return self._message_ids


def add_session_data(data_row, cols, session):
    session.words += data_row[cols['words']]
    session.users.add(data_row[cols['user']])
    session.message_ids.add(data_row[cols['client_msg_id']])


def calculate_power_decay(base_power, decay_step, coef):
    return base_power * math.pow(coef, decay_step)


def calculate_decay_step(base_power, target_power, coef):
    return math.log(target_power/base_power)/math.log(coef)


def calculate_power(words, is_connecting=False):
    power = args.base_power + (words * args.word_power)
    if (is_connecting):
        power += args.connection_power
    return power


def extract_sessions(data):
    cols = {
        'words': data.columns.get_loc('word_count'),
        'user': data.columns.get_loc('user'),
        'timestamp': data.columns.get_loc('timestamp'),
        'client_msg_id': data.columns.get_loc('client_msg_id')
    }
    previous_ts = None
    current_session = None
    sessions = []

    for data_row in data.itertuples(index=False):
        current_ts = data_row[cols['timestamp']]
        if (not previous_ts):
            # no previous timestamp set - the very first entry
            current_session = Session(current_ts)
            add_session_data(data_row, cols, current_session)
            power = calculate_power(data_row[cols['words']])
            previous_ts = current_ts
            continue

        previous_power = power
        decay_step = (current_ts - previous_ts) / args.period
        power = calculate_power_decay(power, decay_step, args.connection_coef)
        if (power < args.threshold):
            # session decayed and did not reach this entry
            last_step = calculate_decay_step(previous_power, args.threshold, args.session_coef)
            current_session.end_ts = previous_ts + (last_step * args.period)
            sessions.append(current_session)
            current_session = Session(current_ts)
            power += calculate_power(data_row[cols['words']])
        else:
            # session reached this entry before decaying
            power += calculate_power(data_row[cols['words']], is_connecting=True)

        add_session_data(data_row, cols, current_session)
        previous_ts = current_ts

    # append remaining decay
    last_step = calculate_decay_step(previous_power, args.threshold, args.session_coef)
    current_session.end_ts = current_ts + (last_step * args.period)
    sessions.append(current_session)

    return sessions


def log_sessions(sessions):
    for i in range(len(sessions)):
        print('[Session %i]' % (i + 1))
        print('Duration: %s' % (sessions[i].end - sessions[i].start))
        print('Total words: %i' % sessions[i].words)
        print('Total messages: %i' % len(sessions[i].message_ids))
        print('Participants: %s' % ', '.join(sessions[i].users))
        # print('Message ids: %s' % ', '.join(sessions[i].message_ids))
        print()


def plot(data, sessions):
    min_ts = min([int(sessions[0].start_ts), data.iloc[0]['timestamp']])
    max_ts = max([int(sessions[-1].end_ts), data.iloc[-1]['timestamp']])
    power_list = []
    power = 0
    is_connecting = False
    is_last = False

    for i in range(len(data)):
        power += calculate_power(data.iloc[i]['word_count'], is_connecting)
        power_list.append(power)

        if (i < len(data) - 1):
            ts_difference = data.iloc[i+1]['timestamp'] - data.iloc[i]['timestamp']
        else:
            ts_difference = max_ts - data.iloc[i]['timestamp']
            is_last = True

        if is_last or calculate_power_decay(power, ts_difference/args.period, args.connection_coef) < args.threshold:
            active_coef = args.session_coef
            is_connecting = False
        else:
            active_coef = args.connection_coef
            is_connecting = True

        for i in range(ts_difference - 1):
            power = calculate_power_decay(power, 1/args.period, active_coef)
            power_list.append(power)

    _, axes = plt.subplots(nrows=3, ncols=1)
    axes[0].plot(power_list)
    axes[0].set_title('Sessions over power')
    axes[0].set_ylabel('Power')

    axes[1].scatter(data['timestamp'], data['word_count'])
    axes[1].set_title('Sessions over original data')
    axes[1].set_ylabel('Words')

    axes[2].scatter(data['timestamp'], [0]*len(data))
    axes[2].set_title('Sessions over message events')

    for session in sessions:
        axes[0].axvspan(session.start_ts - min_ts, session.end_ts - min_ts, alpha=0.25, color='green')
        axes[1].axvspan(session.start_ts, session.end_ts, alpha=0.25, color='green')
        axes[2].axvspan(session.start_ts, session.end_ts, alpha=0.25, color='green')

    plt.show()

def exclude_solo_sessions(sessions):
    filtered_sessions = []
    for i in range(len(sessions)):
        if len(sessions[i].users) > 1:
            filtered_sessions.append(sessions[i])

    return filtered_sessions


def load_data():
    data = pd.read_csv(args.data_file)
    return data.sort_values(by='timestamp')


if __name__ == '__main__':
    data = load_data()
    sessions = extract_sessions(data)

    if (args.skip_solo):
        sessions = exclude_solo_sessions(sessions)

    log_sessions(sessions)
    if (args.plot):
        plot(data, sessions)

Example output:

You'll need to use a csv file for data input, where word count header is renamed to word_count. Here's a converted csv file that can be used for this and my other solution: Pastebin

Default (9 sessions)

Arguments: $ ./extract_sessions.py data.csv --plot

Output: Pastebin

Plot: Imgur

Smaller clusters (15 sessions)

Arguments: $ ./extract_sessions.py data.csv --plot --base-power 100 --word-power 15 --connection-power 1000 --connection-coef 0.78 --session-coef 0.6 --threshold 20 --period 300

Output: Pastebin

Plot: Imgur

Larger clusters (7 sessions)

Arguments: $ ./extract_sessions.py data.csv --plot --base-power 200 --word-power 20 --connection-power 500 --connection-coef 0.8 --session-coef 0.3 --period 1200

Output: Pastebin

Plot: Imgur

EDIT: Split decay-coef into connection-coef and session-coef, more info in main description and under argument documentation. Also added more example configurations with outputs and plots.

EDIT2: Added converted csv file.

EDIT3: Minor formatting changes.

View Timeline