Breaking down large CSV files

by dom111 on December 17th, 2009

Today I received a 45Mb CSV file for importing into a database… Needless to say the application we were importing to didn’t seem to like the size of the file, for what ever reason… So I knocked up a quite bash script to create smaller ‘chunks’ defined as a number of lines, to make importing simpler.

I’m sure there’s many way in which is can be simplified, so if you know any I’d like the contributions!

It’s run like this:

$ ./csv-chunk.sh large-data.csv 5000

The first argument being the filename and the second argument the maximum number of lines for each ‘chunk’. From that 45Mb megalith, 38 files of around 1.2Mb were produced which didn’t seem to break the other end!

Here’s the script:

#!/bin/bash
 
function help {
  echo "Usage:"
  echo "  $0 <csv filename> <number of lines>=5000"
  exit 1
}
 
if [ $# -eq 0 ]; then
  help
fi
 
if [ $# -eq 1 ]; then
  chunk=5000
else
  chunk=$2
fi
 
file=$1
 
if [ ! -e $file ]; then
  echo "File $file not found!"
  exit 5
fi
 
header=`head -n 1 $file`
max=`cat $file | wc -l`
x=1
 
echo "Breaking down $file ($max lines into $chunk lined files)"
 
for (( i=1; i<=$max; i+=$chunk )); do
  chunkfile="chunk-$x-$file"
 
  if [ -e $chunkfile ]; then
    echo "$chunkfile already exists!"
    exit 2
  fi
 
  `touch $chunkfile`
  echo $header > $chunkfile
 
  start=`expr $i + 1`
  end=`expr $i + $chunk`
 
  `sed $start,$end\!d $file >> $chunkfile`
 
  x=`expr $x + 1`
done
 
echo "Created $x files"
exit 0

Leave a Reply

Note: XHTML is allowed. Your email address will never be published.

Subscribe to this comment feed via RSS