<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>dom111.co.uk &#187; large csv</title>
	<atom:link href="http://www.dom111.co.uk/blog/tag/large-csv/feed" rel="self" type="application/rss+xml" />
	<link>http://www.dom111.co.uk/blog</link>
	<description>Move along. Nothing to see here.</description>
	<lastBuildDate>Wed, 26 Oct 2011 16:37:09 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Breaking down large CSV files</title>
		<link>http://www.dom111.co.uk/blog/coding/breaking-down-large-csv-files/214</link>
		<comments>http://www.dom111.co.uk/blog/coding/breaking-down-large-csv-files/214#comments</comments>
		<pubDate>Thu, 17 Dec 2009 13:53:51 +0000</pubDate>
		<dc:creator>dom111</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Shell Scripting]]></category>
		<category><![CDATA[bash]]></category>
		<category><![CDATA[csv]]></category>
		<category><![CDATA[large csv]]></category>

		<guid isPermaLink="false">http://www.dom111.co.uk/blog/?p=214</guid>
		<description><![CDATA[Today I received a 45Mb CSV file for importing into a database&#8230; Needless to say the application we were importing to didn&#8217;t seem to like the size of the file, for what ever reason&#8230; So I knocked up a quite bash script to create smaller &#8216;chunks&#8217; defined as a number of lines, to make importing [...]]]></description>
			<content:encoded><![CDATA[<p>Today I received a 45Mb CSV file for importing into a database&#8230; Needless to say the application we were importing to didn&#8217;t seem to like the size of the file, for what ever reason&#8230; So I knocked up a quite bash script to create smaller &#8216;chunks&#8217; defined as a number of lines, to make importing simpler.</p>
<p>I&#8217;m sure there&#8217;s many way in which is can be simplified, so if you know any I&#8217;d like the contributions!</p>
<p>It&#8217;s run like this:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">$ .<span style="color: #000000; font-weight: bold;">/</span>csv-chunk.sh large-data.csv <span style="color: #000000;">5000</span></pre></div></div>

<p>The first argument being the filename and the second argument the maximum number of lines for each &#8216;chunk&#8217;. From that 45Mb megalith, 38 files of around 1.2Mb were produced which didn&#8217;t seem to break the other end!</p>
<p><span id="more-214"></span>Here&#8217;s the script:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;"><span style="color: #666666; font-style: italic;">#!/bin/bash</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">function</span> <span style="color: #7a0874; font-weight: bold;">help</span> <span style="color: #7a0874; font-weight: bold;">&#123;</span>
  <span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #ff0000;">&quot;Usage:&quot;</span>
  <span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #ff0000;">&quot;  $0 &lt;csv filename&gt; &lt;number of lines&gt;=5000&quot;</span>
  <span style="color: #7a0874; font-weight: bold;">exit</span> <span style="color: #000000;">1</span>
<span style="color: #7a0874; font-weight: bold;">&#125;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">if</span> <span style="color: #7a0874; font-weight: bold;">&#91;</span> <span style="color: #007800;">$#</span> <span style="color: #660033;">-eq</span> <span style="color: #000000;">0</span> <span style="color: #7a0874; font-weight: bold;">&#93;</span>; <span style="color: #000000; font-weight: bold;">then</span>
  <span style="color: #7a0874; font-weight: bold;">help</span>
<span style="color: #000000; font-weight: bold;">fi</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">if</span> <span style="color: #7a0874; font-weight: bold;">&#91;</span> <span style="color: #007800;">$#</span> <span style="color: #660033;">-eq</span> <span style="color: #000000;">1</span> <span style="color: #7a0874; font-weight: bold;">&#93;</span>; <span style="color: #000000; font-weight: bold;">then</span>
  <span style="color: #007800;">chunk</span>=<span style="color: #000000;">5000</span>
<span style="color: #000000; font-weight: bold;">else</span>
  <span style="color: #007800;">chunk</span>=<span style="color: #007800;">$2</span>
<span style="color: #000000; font-weight: bold;">fi</span>
&nbsp;
<span style="color: #007800;">file</span>=<span style="color: #007800;">$1</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">if</span> <span style="color: #7a0874; font-weight: bold;">&#91;</span> <span style="color: #000000; font-weight: bold;">!</span> <span style="color: #660033;">-e</span> <span style="color: #007800;">$file</span> <span style="color: #7a0874; font-weight: bold;">&#93;</span>; <span style="color: #000000; font-weight: bold;">then</span>
  <span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #ff0000;">&quot;File <span style="color: #007800;">$file</span> not found!&quot;</span>
  <span style="color: #7a0874; font-weight: bold;">exit</span> <span style="color: #000000;">5</span>
<span style="color: #000000; font-weight: bold;">fi</span>
&nbsp;
<span style="color: #007800;">header</span>=<span style="color: #000000; font-weight: bold;">`</span><span style="color: #c20cb9; font-weight: bold;">head</span> <span style="color: #660033;">-n</span> <span style="color: #000000;">1</span> <span style="color: #007800;">$file</span><span style="color: #000000; font-weight: bold;">`</span>
<span style="color: #007800;">max</span>=<span style="color: #000000; font-weight: bold;">`</span><span style="color: #c20cb9; font-weight: bold;">cat</span> <span style="color: #007800;">$file</span> <span style="color: #000000; font-weight: bold;">|</span> <span style="color: #c20cb9; font-weight: bold;">wc</span> -l<span style="color: #000000; font-weight: bold;">`</span>
<span style="color: #007800;">x</span>=<span style="color: #000000;">1</span>
&nbsp;
<span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #ff0000;">&quot;Breaking down <span style="color: #007800;">$file</span> (<span style="color: #007800;">$max</span> lines into <span style="color: #007800;">$chunk</span> lined files)&quot;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">for</span> <span style="color: #7a0874; font-weight: bold;">&#40;</span><span style="color: #7a0874; font-weight: bold;">&#40;</span> <span style="color: #007800;">i</span>=<span style="color: #000000;">1</span>; i<span style="color: #000000; font-weight: bold;">&lt;</span>=<span style="color: #007800;">$max</span>; i+=<span style="color: #007800;">$chunk</span> <span style="color: #7a0874; font-weight: bold;">&#41;</span><span style="color: #7a0874; font-weight: bold;">&#41;</span>; <span style="color: #000000; font-weight: bold;">do</span>
  <span style="color: #007800;">chunkfile</span>=<span style="color: #ff0000;">&quot;chunk-<span style="color: #007800;">$x</span>-<span style="color: #007800;">$file</span>&quot;</span>
&nbsp;
  <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #7a0874; font-weight: bold;">&#91;</span> <span style="color: #660033;">-e</span> <span style="color: #007800;">$chunkfile</span> <span style="color: #7a0874; font-weight: bold;">&#93;</span>; <span style="color: #000000; font-weight: bold;">then</span>
    <span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #ff0000;">&quot;<span style="color: #007800;">$chunkfile</span> already exists!&quot;</span>
    <span style="color: #7a0874; font-weight: bold;">exit</span> <span style="color: #000000;">2</span>
  <span style="color: #000000; font-weight: bold;">fi</span>
&nbsp;
  <span style="color: #000000; font-weight: bold;">`</span><span style="color: #c20cb9; font-weight: bold;">touch</span> <span style="color: #007800;">$chunkfile</span><span style="color: #000000; font-weight: bold;">`</span>
  <span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #007800;">$header</span> <span style="color: #000000; font-weight: bold;">&gt;</span> <span style="color: #007800;">$chunkfile</span>
&nbsp;
  <span style="color: #007800;">start</span>=<span style="color: #000000; font-weight: bold;">`</span><span style="color: #c20cb9; font-weight: bold;">expr</span> <span style="color: #007800;">$i</span> + <span style="color: #000000;">1</span><span style="color: #000000; font-weight: bold;">`</span>
  <span style="color: #007800;">end</span>=<span style="color: #000000; font-weight: bold;">`</span><span style="color: #c20cb9; font-weight: bold;">expr</span> <span style="color: #007800;">$i</span> + <span style="color: #007800;">$chunk</span><span style="color: #000000; font-weight: bold;">`</span>
&nbsp;
  <span style="color: #000000; font-weight: bold;">`</span><span style="color: #c20cb9; font-weight: bold;">sed</span> <span style="color: #007800;">$start</span>,<span style="color: #007800;">$end</span>\<span style="color: #000000; font-weight: bold;">!</span>d <span style="color: #007800;">$file</span> <span style="color: #000000; font-weight: bold;">&gt;&gt;</span> <span style="color: #007800;">$chunkfile</span><span style="color: #000000; font-weight: bold;">`</span>
&nbsp;
  <span style="color: #007800;">x</span>=<span style="color: #000000; font-weight: bold;">`</span><span style="color: #c20cb9; font-weight: bold;">expr</span> <span style="color: #007800;">$x</span> + <span style="color: #000000;">1</span><span style="color: #000000; font-weight: bold;">`</span>
<span style="color: #000000; font-weight: bold;">done</span>
&nbsp;
<span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #ff0000;">&quot;Created <span style="color: #007800;">$x</span> files&quot;</span>
<span style="color: #7a0874; font-weight: bold;">exit</span> <span style="color: #000000;">0</span></pre></div></div>

]]></content:encoded>
			<wfw:commentRss>http://www.dom111.co.uk/blog/coding/breaking-down-large-csv-files/214/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

