Search log aggregation utility (python)
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

I have a search engine for a document mgmt system and it logs what people search for so we can do analytics on it later. As a part of this we log the incoming queries which end up being a lot of redundant data since the client side debounce waits for pauses. So slow typers end up executing a queries for a lot of keystrokes and it ends up messy on the analytics side. I am looking to get a working python script that will take json data as seen here https://gist.github.com/quotient/7bddb455ab75bcd17fc50f46a0f26745

And output a tabular dataset that we can store in a common sql table. For the purpose of the bounty you can just output a csv. We can use https://replit.com/ for getting your bounty working. You can presume the json file is local to your code and then we run your code and it outputs something like output.csv I have modeled the output table here for us https://docs.google.com/spreadsheets/d/1FpfnvNsYuJe1yHAL-uxdrVGuA-XXGlnfVJ0wScn79VM/edit?usp=sharing we can keep a lot of the fields. A couple notes. We did a project a while back to collapse similar branch terms to single "root" instances https://wuddrum.github.io/collapse-words/ you might find utility with this even though its in js. You can read about this bounty here https://bountify.co/collapse-list-of-words-to-get-unique-root-spellings-javascript as you can see it was solving a different issue but was great at collapsing terms. Essentially the issue is if we have entries like

I

I am

I am sear

I am searching

I am searching f

I am searching for something

We need to be able to collapse all of these queries in to a single event like

I am searching for something {timestamp}

Now we can confidently say if we see this in the logs in ~5 seconds that it should be collapsed to a single envet. If we see it like “I am searching” then we see 30 second pause and then they keep typing to get “I am searching for something “ we might want to consider that two searches. My thought is that maybe we have a threshold to consider. Time between entries (letters) before considering it a second search. It would be awesome if we could think of a way to make this dynamic so that maybe longer queries behave different than smaller ones but we might have to see the outcome of a simple threshold before we get fancy.

Now we need to make sure we scope our aggregation to be the same user within the same time (roughly speaking) and ideally would be smart to catch someone that searched a term. The way we need to think of user is UserID + Portal_ID+username+customer_name.. This way we can be sure we dont aggregate something across customers wrongly.

Last detail, you can find the search term in the field called “eventType”

One other use case we want to make sure is covered in the above logic. Suppose the user hits the backspace and the search term gets smaller like

I am searching

I am searchi

If our timestamps are to be trusted we should assume the user is working on an updated search so we would see this as starting a new search even though it would technically collapse to the same aggregated term. Here is a good view of that

Now let’s look at something more complicated. Suppose we have these searches
I

I am

I am sear

I am searching

I am

I am here

This should be aggregated to 2 final searches

I am searching

I am here

Hope this makes sense, please ask questions

So if i'm reading this correctly, you want a script that takes a json file, and outputs csv file while aggregating any similar information?
Edgy 1 month ago
Hi, just wanted to find out if my solution solved your problem? Thank you
Yellow Duckling 16 days ago
Tags
python

Crowdsource coding tasks.

1 Solution

Winning solution

I have written a solution in Python 3 that manages to perform the task required. It currently supports either no time span or a static time span between records that should be considered acceptable (e.g. 5 seconds). It seems to narrow down the number of records from 1900+ to 485 rows when using a 5 second timespan. This time span is configurable when you run the program. A summary is shown at the end of the aggregation which allows you to compare the effectiveness of different time thresholds. If you do want a dynamic time threshold we can take a look, however, from my testing with the sample data it appears that a static time aggregate works well.

The aggregated records are then outputted to a CSV file called output.csv according to your specifications.

You can find the code on Replit: https://replit.com/@ivankahl/SearchLogAggregation#main.py

Please let me know what you think and if you require any changes.

Sorry for the late reply I didn’t get a notification. I’ll check this tomorrow!
Qdev 1 month ago
Just wanted to check if this helped solve your problem?
Yellow Duckling 28 days ago
Hi, just following up to see if this solution matches your requirements?
Yellow Duckling 25 days ago
Hi, just wanted to find out if this solved your problem? Thanks!
Yellow Duckling 20 days ago
hey there, any chance you can modify the solution to have 2 more columns with a date time output like you see here - https://www.dropbox.com/s/bgf6ateokp2uear/2021-10-13_09-30-19.png?dl=0
Qdev 8 days ago