Skip to content

chunk – Read a large (XML) file a chunk at a time

by dom111 on July 10th, 2009

I’ve recently had to parse some pretty large XML documents, and needed a method to read one element at a time.

Here’s a fairly simple solution in PHP and Ruby form (hopefully a Python one coming soon…).

If you have the following file (complex-test.xml):

<?xml version="1.0" encoding="UTF-8"?>
<Complex>
  <Object>
    <Title>Title 1</Title>
    <Name>It's name goes here</Name>
    <ObjectData>
      <Info1></Info1>
      <Info2></Info2>
      <Info3></Info3>
      <Info4></Info4>
    </ObjectData>
    <Date></Date>
  </Object>
  <Object></Object>
  <Object>
    <AnotherObject></AnotherObject>
    <Data></Data>
  </Object>
  <Object></Object>
  <Object></Object>
</Complex>

And wanted to return the <Object/>s

PHP:

require_once('class.chunk.php');
 
$file = new Chunk('complex-test.xml', array('element' => 'Object'));
 
while ($xml = $file->read()) {
  $obj = simplexml_load_string($xml);
  // do some parsing, insert to DB whatever
}

Ruby:

require 'chunk.rb'
 
file = Chunk.new('complex-test.xml', { 'element' => 'Object' })
 
while xml = file.read
  doc = REXML::Document.new(xml)
  // do some parsing, blah...
end

It (probably) doesn’t work with nested XML elements, but the use I had for it didn’t require that it did.

You can get the PHP version here, and the Ruby one here.

Update: The class was accepted onto PHP Classes as a notable package! 04/08/2009.

From → Coding, PHP, Ruby

5 Comments
  1. Have you considered vtd-xml for chunk extraction?

    http://vtd-xml.sf.net

  2. This code helped me parse some very large files quickly, however when I try to create more than one chunk instance in a script, it will error out. I am not quite sure why this is happening. I receive the following exception when I try to read the second XML file in:

    ./chunk.rb:223:in `+': can't convert nil into String (TypeError)
    	from ./chunk.rb:223:in `read'
    

    I am calling Chunk.new again, so I am not sure why a second execution would cause this behavior. If I execute each import separately, everything works. Any help is greatly appreciated.

    • Hi there Ben,

      I’ll certainly have a play with it and see if it’s something I can resolve quickly, not sure why it would have any impact though…

      Thanks for pointing this out!

      Dom

Trackbacks & Pingbacks

  1. chunk – Read a large (XML) file a chunk at a time | dom111.co.uk

Leave a Reply

Note: XHTML is allowed. Your email address will never be published.

Subscribe to this comment feed via RSS

*