Bash script to read from a Cassandra table and then output to csv
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

I need a bash script that reads the last day's worth of log entries from the dse_perf.node_slow_log table in Cassandra, puts these into a csv file which is then saved into a directory.

Hi kgrigg, I don't own a linux envt. But scraping the web helped. I have a simple untested script ready could you try it?
SilverHood Apps 8 days ago
Yeah for sure I'll try it.
kgrigg 8 days ago
The -u & -p are for username n password, -e works only for new versions.
#!/bin/bash cqlsh -u cassandra -p cassandra -e "SELECT * from dse_perf.node_slow_log;" > output.csv
reference
SilverHood Apps 8 days ago
awarded to CyteBode

Crowdsource coding tasks.

1 Solution


Try this:

#!/usr/bin/env bash

OUTPUT=${1:-output.csv}
TEMP_FILE="temp.csv"
DATE=`date --date="yesterday" +%Y-%m-%d`
CQL_QUERY="COPY dse_perf.node_slow_log
TO '$TEMP_FILE'
WITH HEADER = true
AND DATETIMEFORMAT = '%Y-%m-%d';"

DATE_COLUMN=2
AWK_SCRIPT="BEGIN {OFS=\",\"} { if (NR == 1 || \$$DATE_COLUMN == \"$DATE\") print }"

echo "Outputting to $OUTPUT."
echo "Fetching entries from $DATE."
cqlsh -e "$CQL_QUERY" > /dev/null
NUM_ENTRIES=`wc -l < $TEMP_FILE`

awk -F ',' "$AWK_SCRIPT" $TEMP_FILE > $OUTPUT
rm $TEMP_FILE

NUM_LINES=`wc -l < $OUTPUT`
echo "Filtered $((NUM_LINES - 1)) entries for $DATE out of $((NUM_ENTRIES - 1))."

Call the script with no argument and it'll output to output.csv. Otherwise, it'll output to the filename provided as the first argument.

Notes

  • You'll have to add whatever arguments are needed to access the database after cqlsh (such as username/password).
  • You many need to remove sed's -E argument.

Edit 1: Switched from a SELECT statement to a COPY statement, using awk to filter by date instead. Added status messages.

Edit 2: Silenced the output of cqlsh. Refactor the temp file's filename to a variable. Reworded the last status message.

Edit 3: Added DATETIMEFORMAT to make date comparison actually work.

Hi this worked great however I cannot query on the date as it is part of the partition key. I currently have a script where I need to do a full table scan, limiting it to 50000 records. But then I'd like to parse the output via date to only get entries for yesterday to today. So essentially the script runs at midnight, gets the days worth. So something like what I have put in the code block below. I wasn't sure what the sed script does but assuming it formats it into csv? I'm still learning it

!/bin/bash

DATE=date --date="yesterday" +%Y-%m-%d CQLQUERY="SELECT * FROM dseperf.nodeslowlog LIMIT 50000"; SEDSCRIPT="s/^\s+//; s\s|\s,g; /^-{3,}|^$|^(.+)$/d;"; cqlsh -e "$CQLQUERY" | sed -E -e "$SEDSCRIPT" > ${1:-output.csv}
kgrigg 7 days ago
Sorry couldn't seem to format that code block
kgrigg 7 days ago
Right, I don't have DataStax Enterprise, so I just tested with a dummy table without really knowing what the actual schema of the table was. The sed script indeed formats to CSV by converting the output of the SELECT statement. If the filtering by date doesn't work and you don't mind fetching more entries than needed, I might know how to modify the script. Just give me a bit of time.
CyteBode 7 days ago
That would be great, yeah I need another way to filter by date other than via the query.
kgrigg 7 days ago
I'm gonna need to know what the data looks like. Could you run COPY dse_perf.node_slow_log to 'test.csv'; and show me what one of the rows in the resulting test.csv file looks like?
CyteBode 7 days ago
Sorry cant seem to format or add attachments
kgrigg 7 days ago
Okay, I've got something. It filters the result of the COPY statement (which copies everything to a CSV file) and filters the rows by date using awk in order to only let the entries from yesterday. A temp file is created (temp.csv) but it's removed automatically.
CyteBode 7 days ago
That worked great- it outputted that it had exported 7000 rows to 1 file- but when I look at the output.csv it only has the table column names in it: 1 nodeip,date,starttime,commands,duration,parameters,sourceip,tablenames,username The echo output was: Outputting to output.csv. Fetching everything from 2018-11-05. Using 16 child processes Starting copy of dseperf.nodeslowlog with columns [nodeip, date, starttime, commands, duration, parameters, sourceip, table_names, username] . Processed: 7776 rows; Rate: 3098 rows/s; Avg. rate: 2148 rows/s 7776 rows exported to 1 files in 3.726 seconds. Fetched 0 entries.
kgrigg 7 days ago
Hmmm. That's actually the output of cqlsh with the COPY statement. The actual result is 0 entries which doesn't seem right unless there is no entry for yesterday.
CyteBode 7 days ago
When I do a select count(*) from dseperf.nodeslow_log;

count

8136
kgrigg 7 days ago
If I don't remove temp.csv it has all the entries
kgrigg 7 days ago
Yeah, that's for sure, but I still don't know if the filtering logic with awk is working properly. (It's working for me on my end with my dummy table.) Can you quickly go through temp.csv and make sure there is no entry for yesterday?
CyteBode 7 days ago
Last entry was from 2018-11-06, first entry was from 2018-11-07
kgrigg 7 days ago
Just ran that and it didn't work, output was Outputting to output.csv. Fetching entries from 2018-11-05. Filtered 0 entries for 2018-11-05 out of 8736.
kgrigg 7 days ago
Yeah, that means the filtering logic isn't working. I think you used a software to open the CSV file and copy-paste a line when I asked you to show me a line, as there were no commas. I actually need to see the raw line, as it exists in the CSV file. Could you run tail -n 1 temp.csv and give me the output, after running the script with the rm $TEMP_LINE commented out or removed?
CyteBode 7 days ago
10.131.128.195,2018-11-06 00:00:00+0000,b638bde0-e1f1-11e8-b20d-13d927956618,['SELECT * FROM OfficeWebAppsLockExpirations WHERE ExpirationDateUuid < now() AND Site = ? LIMIT 300 ALLOW FILTERING'],23,,10.131.150.76,{'atlas.officewebappslockexpirations'},atlas
kgrigg 7 days ago
Aha! So here's the problem. The date isn't a just date, but a timestamp, with hours and stuff. The awk script was just comparing against a date. Running the COPY statement with DATETIMEFORMAT='%Y-%m-%d' should make it work. That was sitting in plain sight when you last showed me the line, but the lack of commas kept me from seeing it.
CyteBode 7 days ago
So modifying the CQL_QUERY string?
kgrigg 7 days ago
Yeah, I made the change in the last edit. I dropped my dummy table, re-created a new one where the date field is a timestamp instead of just a date and repopulated it with dummy rows. I tested the script with that new table and it works.
CyteBode 7 days ago
I posted the script in the solution section so the formatting is better. returning just the column headers again
kgrigg 7 days ago
Okay, let's see. I put the row you gave me in a CSV file, changed the date to yesterday and ran the awk script on it. It correctly output the row, so I know that works. It could be that the DATETIMEFORMAT does nothing (which would be the case if date is actually a string instead of a timestamp). Can you verify that the date in temp.csv is now correctly formatted as ONLY the date, without the time? Otherwise it might be that the DATE variable isn't right. So could you verify that running date --date="yesterday" +%Y-%m-%d in the terminal correctly outputs yesterday's date, in the right format?
CyteBode 7 days ago
When I tail the last entry in temp.csv it is 10.131.128.195,2018-11-07,12c6a800-e220-11e8-b20d-13d927956618,['SELECT * FROM OfficeWebAppsLockExpirations WHERE ExpirationDateUuid < now() AND Site = ? LIMIT 300 ALLOW FILTERING'],22,,10.131.150.75,{'atlas.officewebappslockexpirations'},atlas
kgrigg 6 days ago
Okay, so the date format is fine. Could you add the line echo $AWK_SCRIPT right after the definition of AWK_SCRIPT=... and verify that the output is BEGIN {OFS=","} { if (NR == 1 || $2 == "2018-11-06") print } when you run the script? I'm honestly starting to run out of ideas as to why it's not working for you because it's definitely working for me. Could you tell me what operating system you're running, and what your locale is (by running locale)? If it's a locale issue, adding the line LC_ALL=C at the beginning of the script might make things work.
CyteBode 6 days ago
Ok it's working, must have been an error my end, yayy!!
kgrigg 6 days ago
Thanks so much!!
kgrigg 6 days ago
I'm glad to hear that, and hanks for awarding me the bounty!
CyteBode 6 days ago
Thanks for the tip!
CyteBode 6 days ago