chunk – Read a large (XML) file a chunk at a time
I’ve recently had to parse some pretty large XML documents, and needed a method to read one element at a time.
Here’s a fairly simple solution in PHP and Ruby form (hopefully a Python one coming soon…).
If you have the following file (complex-test.xml):
<?xml version="1.0" encoding="UTF-8"?> <Complex> <Object> <Title>Title 1</Title> <Name>It's name goes here</Name> <ObjectData> <Info1></Info1> <Info2></Info2> <Info3></Info3> <Info4></Info4> </ObjectData> <Date></Date> </Object> <Object></Object> <Object> <AnotherObject></AnotherObject> <Data></Data> </Object> <Object></Object> <Object></Object> </Complex>
And wanted to return the <Object/>s
PHP:
require_once('class.chunk.php'); $file = new Chunk('complex-test.xml', array('element' => 'Object')); while ($xml = $file->read()) { $obj = simplexml_load_string($xml); // do some parsing, insert to DB whatever }
Ruby:
require 'chunk.rb' file = Chunk.new('complex-test.xml', { 'element' => 'Object' }) while xml = file.read doc = REXML::Document.new(xml) // do some parsing, blah... end
It (probably) doesn’t work with nested XML elements, but the use I had for it didn’t require that it did.
You can get the PHP version here, and the Ruby one here.
Update: The class was accepted onto PHP Classes as a notable package! 04/08/2009.
Have you considered vtd-xml for chunk extraction?
http://vtd-xml.sf.net
That looks like a good solution for compiled languages…
This code helped me parse some very large files quickly, however when I try to create more than one chunk instance in a script, it will error out. I am not quite sure why this is happening. I receive the following exception when I try to read the second XML file in:
I am calling Chunk.new again, so I am not sure why a second execution would cause this behavior. If I execute each import separately, everything works. Any help is greatly appreciated.
Hi there Ben,
I’ll certainly have a play with it and see if it’s something I can resolve quickly, not sure why it would have any impact though…
Thanks for pointing this out!
Dom