Bash script to read from a Cassandra table and then output to csv
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

I need a bash script that reads the last day's worth of log entries from the dse_perf.node_slow_log table in Cassandra, puts these into a csv file which is then saved into a directory.

Hi kgrigg, I don't own a linux envt. But scraping the web helped. I have a simple untested script ready could you try it?
SilverHood Apps 8 months ago
Yeah for sure I'll try it.
kgrigg 8 months ago
The -u & -p are for username n password, -e works only for new versions.
#!/bin/bash cqlsh -u cassandra -p cassandra -e "SELECT * from dse_perf.node_slow_log;" > output.csv
SilverHood Apps 8 months ago
awarded to CyteBode

Crowdsource coding tasks.

1 Solution

Try this:

#!/usr/bin/env bash

DATE=`date --date="yesterday" +%Y-%m-%d`
CQL_QUERY="COPY dse_perf.node_slow_log

AWK_SCRIPT="BEGIN {OFS=\",\"} { if (NR == 1 || \$$DATE_COLUMN == \"$DATE\") print }"

echo "Outputting to $OUTPUT."
echo "Fetching entries from $DATE."
cqlsh -e "$CQL_QUERY" > /dev/null


echo "Filtered $((NUM_LINES - 1)) entries for $DATE out of $((NUM_ENTRIES - 1))."

Call the script with no argument and it'll output to output.csv. Otherwise, it'll output to the filename provided as the first argument.


  • You'll have to add whatever arguments are needed to access the database after cqlsh (such as username/password).
  • You many need to remove sed's -E argument.

Edit 1: Switched from a SELECT statement to a COPY statement, using awk to filter by date instead. Added status messages.

Edit 2: Silenced the output of cqlsh. Refactor the temp file's filename to a variable. Reworded the last status message.

Edit 3: Added DATETIMEFORMAT to make date comparison actually work.

Hi this worked great however I cannot query on the date as it is part of the partition key. I currently have a script where I need to do a full table scan, limiting it to 50000 records. But then I'd like to parse the output via date to only get entries for yesterday to today. So essentially the script runs at midnight, gets the days worth. So something like what I have put in the code block below. I wasn't sure what the sed script does but assuming it formats it into csv? I'm still learning it


DATE=date --date="yesterday" +%Y-%m-%d CQLQUERY="SELECT * FROM dseperf.nodeslowlog LIMIT 50000"; SEDSCRIPT="s/^\s+//; s\s|\s,g; /^-{3,}|^$|^(.+)$/d;"; cqlsh -e "$CQLQUERY" | sed -E -e "$SEDSCRIPT" > ${1:-output.csv}
kgrigg 8 months ago
Sorry couldn't seem to format that code block
kgrigg 8 months ago
Right, I don't have DataStax Enterprise, so I just tested with a dummy table without really knowing what the actual schema of the table was. The sed script indeed formats to CSV by converting the output of the SELECT statement. If the filtering by date doesn't work and you don't mind fetching more entries than needed, I might know how to modify the script. Just give me a bit of time.
CyteBode 8 months ago
That would be great, yeah I need another way to filter by date other than via the query.
kgrigg 8 months ago
I'm gonna need to know what the data looks like. Could you run COPY dse_perf.node_slow_log to 'test.csv'; and show me what one of the rows in the resulting test.csv file looks like?
CyteBode 8 months ago
Sorry cant seem to format or add attachments
kgrigg 8 months ago
Okay, I've got something. It filters the result of the COPY statement (which copies everything to a CSV file) and filters the rows by date using awk in order to only let the entries from yesterday. A temp file is created (temp.csv) but it's removed automatically.
CyteBode 8 months ago
That worked great- it outputted that it had exported 7000 rows to 1 file- but when I look at the output.csv it only has the table column names in it: 1 nodeip,date,starttime,commands,duration,parameters,sourceip,tablenames,username The echo output was: Outputting to output.csv. Fetching everything from 2018-11-05. Using 16 child processes Starting copy of dseperf.nodeslowlog with columns [nodeip, date, starttime, commands, duration, parameters, sourceip, table_names, username] . Processed: 7776 rows; Rate: 3098 rows/s; Avg. rate: 2148 rows/s 7776 rows exported to 1 files in 3.726 seconds. Fetched 0 entries.
kgrigg 8 months ago
Hmmm. That's actually the output of cqlsh with the COPY statement. The actual result is 0 entries which doesn't seem right unless there is no entry for yesterday.
CyteBode 8 months ago
When I do a select count(*) from dseperf.nodeslow_log;


kgrigg 8 months ago
If I don't remove temp.csv it has all the entries
kgrigg 8 months ago
Yeah, that's for sure, but I still don't know if the filtering logic with awk is working properly. (It's working for me on my end with my dummy table.) Can you quickly go through temp.csv and make sure there is no entry for yesterday?
CyteBode 8 months ago
Last entry was from 2018-11-06, first entry was from 2018-11-07
kgrigg 8 months ago
Just ran that and it didn't work, output was Outputting to output.csv. Fetching entries from 2018-11-05. Filtered 0 entries for 2018-11-05 out of 8736.
kgrigg 8 months ago
Yeah, that means the filtering logic isn't working. I think you used a software to open the CSV file and copy-paste a line when I asked you to show me a line, as there were no commas. I actually need to see the raw line, as it exists in the CSV file. Could you run tail -n 1 temp.csv and give me the output, after running the script with the rm $TEMP_LINE commented out or removed?
CyteBode 8 months ago,2018-11-06 00:00:00+0000,b638bde0-e1f1-11e8-b20d-13d927956618,['SELECT * FROM OfficeWebAppsLockExpirations WHERE ExpirationDateUuid < now() AND Site = ? LIMIT 300 ALLOW FILTERING'],23,,,{'atlas.officewebappslockexpirations'},atlas
kgrigg 8 months ago
Aha! So here's the problem. The date isn't a just date, but a timestamp, with hours and stuff. The awk script was just comparing against a date. Running the COPY statement with DATETIMEFORMAT='%Y-%m-%d' should make it work. That was sitting in plain sight when you last showed me the line, but the lack of commas kept me from seeing it.
CyteBode 8 months ago
So modifying the CQL_QUERY string?
kgrigg 8 months ago
Yeah, I made the change in the last edit. I dropped my dummy table, re-created a new one where the date field is a timestamp instead of just a date and repopulated it with dummy rows. I tested the script with that new table and it works.
CyteBode 8 months ago
I posted the script in the solution section so the formatting is better. returning just the column headers again
kgrigg 8 months ago
Okay, let's see. I put the row you gave me in a CSV file, changed the date to yesterday and ran the awk script on it. It correctly output the row, so I know that works. It could be that the DATETIMEFORMAT does nothing (which would be the case if date is actually a string instead of a timestamp). Can you verify that the date in temp.csv is now correctly formatted as ONLY the date, without the time? Otherwise it might be that the DATE variable isn't right. So could you verify that running date --date="yesterday" +%Y-%m-%d in the terminal correctly outputs yesterday's date, in the right format?
CyteBode 8 months ago
When I tail the last entry in temp.csv it is,2018-11-07,12c6a800-e220-11e8-b20d-13d927956618,['SELECT * FROM OfficeWebAppsLockExpirations WHERE ExpirationDateUuid < now() AND Site = ? LIMIT 300 ALLOW FILTERING'],22,,,{'atlas.officewebappslockexpirations'},atlas
kgrigg 8 months ago
Okay, so the date format is fine. Could you add the line echo $AWK_SCRIPT right after the definition of AWK_SCRIPT=... and verify that the output is BEGIN {OFS=","} { if (NR == 1 || $2 == "2018-11-06") print } when you run the script? I'm honestly starting to run out of ideas as to why it's not working for you because it's definitely working for me. Could you tell me what operating system you're running, and what your locale is (by running locale)? If it's a locale issue, adding the line LC_ALL=C at the beginning of the script might make things work.
CyteBode 8 months ago
Ok it's working, must have been an error my end, yayy!!
kgrigg 8 months ago
Thanks so much!!
kgrigg 8 months ago
I'm glad to hear that, and hanks for awarding me the bounty!
CyteBode 8 months ago
Thanks for the tip!
CyteBode 8 months ago