I am now blogging at blog.alexmaccaw.com
HTML/XML Parsing with Node & jQuery
I had to do some HTML parsing recently to convert some markdown into the format required for Nettuts+ tutorials. It required moving various elements around, adding classes and appending some new elements.
Now normally I'd go with Ruby's de-facto solution to XML parsing, Nokogiri. However, I quickly ran into issues which, combined with the library's class based excuse for documentation, made me decide to take a different approach.
One thing I realized was that jQuery's API is perfect for this scenario, especially when it comes to traversing and manipulation. If only there was a Ruby equivalent with a similiar interface?
Then it struck me, forget Ruby, let's just use Node and jQuery. In fact, there's already a jQuery npm package to do this which includes a HTML parser and DOM emulator.
First, install the necessary npm dependencies (in the app's directory):
npm install -g coffee-script
npm install jquery node-markdown
Then create a CoffeeScript Cakefile:
fs = require('fs')
$ = require('jQuery')
md = require('node-markdown').Markdown
task 'build', 'Build index.html', ->
# Read in file
html = fs.readFileSync('./index.md', 'utf8')
# Convert to markdown
html = md(html)
# Create jQuery object
doc = $('<body />').append(html)
# Insert <hr /> before all <h2 /> elements
doc.find('h2').before('<hr />')
doc.find('hr:first').remove()
# Correct pre syntax
doc.find('pre code').each ->
$(@).parent().html $(@).html()
doc.find('pre').attr('name', 'code').addClass('cs')
# Remove images from p tags, and wrap them correctly
doc.find('p img').each ->
parent = $(@).parent()
parent.after $(@)
parent.remove()
doc.find('img').wrap('<div class="tutorial_image" />')
# Add required class to blockquotes
doc.find('blockquote').addClass('pullquote pqRight')
# Write out file
fs.writeFileSync('./index.html', doc.html())
Now tell me that syntax isn't concise and beautiful, a vast improvment over XML parsing with other libraries.
Our build task can be invoked by running cake build, generating the resultant index.html file.
Now, of course this approach won't be suitable for all use cases. For example, I've no idea of the script's performance. However for my needs, where it only needs to be run once, it's ideal. If needs be, we could even pipe the resultant HTML back to Ruby via STDOUT.