Breaking down large CSV files
Today I received a 45Mb CSV file for importing into a database… Needless to say the application we were importing to didn’t seem to like the size of the file, for what ever reason… So I knocked up a quite bash script to create smaller ‘chunks’ defined as a number of lines, to make importing simpler.
I’m sure there’s many way in which is can be simplified, so if you know any I’d like the contributions!
It’s run like this:
$ ./csv-chunk.sh large-data.csv 5000
The first argument being the filename and the second argument the maximum number of lines for each ‘chunk’. From that 45Mb megalith, 38 files of around 1.2Mb were produced which didn’t seem to break the other end!
Here’s the script:
#!/bin/bash function help { echo "Usage:" echo " $0 <csv filename> <number of lines>=5000" exit 1 } if [ $# -eq 0 ]; then help fi if [ $# -eq 1 ]; then chunk=5000 else chunk=$2 fi file=$1 if [ ! -e $file ]; then echo "File $file not found!" exit 5 fi header=`head -n 1 $file` max=`cat $file | wc -l` x=1 echo "Breaking down $file ($max lines into $chunk lined files)" for (( i=1; i<=$max; i+=$chunk )); do chunkfile="chunk-$x-$file" if [ -e $chunkfile ]; then echo "$chunkfile already exists!" exit 2 fi `touch $chunkfile` echo $header > $chunkfile start=`expr $i + 1` end=`expr $i + $chunk` `sed $start,$end\!d $file >> $chunkfile` x=`expr $x + 1` done echo "Created $x files" exit 0