Discover more from !important
Laziness is a Superpower
...or how I dodged weeks of soul-rending tedium with some clever Python
When confronted with having to do a thing, nobody in this world will work harder than a lazy person to not do that thing.
The 4th edition was a 1,200 page monstrosity that took me the better part of two years to revise. Upon finishing the book revisions and editing process, I was utterly horrified to receive this email from my editor:
Please let me know how you would like to submit all the code snippets from the book.
I had completely overlooked the eventual need to extract all the example code snippets in the book and provide them for readers to download.
Without a clever workaround, I was doomed to spend weeks copying and pasting example after example out of the drafts.
As a lazy person, I was determined to not do the thing.
The book draft was written in Microsoft Word .docx files with a special publisher template, one file for each chapter. I knew that all .docx files are essentially ZIP files containing XML and other files used by Microsoft Word, so they could be extracted using any standard unzip utility. If I could figure out a reliable way of parsing and extracting code snippets from that XML, the lazy man might yet prevail.
I fired up a terminal and ran
unzip c11_revised.docx to see what I’d get back:
Seems promising. Let’s see what the code snippet from above looks like in
Looks like the code snippets are broken into long sequences of nested elements, but each contains an element with the attribute
A plan emerges: any consecutive XML elements containing
w:val="CodeSnippet" could be stitched together into a single string until a
w:val attribute NOT equal to
CodeSnippet was encountered. This joined string would be recorded as the full code snippet - with formatting preserved, since the whitespace is included.
Fortunately, the HTML files were so simplistic that I could just check the snippet string for HTML elements to determine the file extension.
But what to name the damn files? The snippets are just code floating around in the page. Thankfully, the publisher’s template offered a solution: the header hierarchy.
In the process of writing the book, I’d dutifully stuck to the header styles, since this allows the table of contents to be easily generated.
In doing so, the underlying XML had a hierarchy of attributes denoting chapter titles, sections, and subsections that I would be able to extract.
Smells like a directory/file structure to me!
Assuming I could iteratively parse the XML, track the current hierarchy as I traversed to each code snippet, and also track the ordinality of that snippet, I’d be able to strip out the whitespace and special characters to generate something like this:
Writing the Extractor
Python includes the excellent
xml.etree.ElementTree package, which allows us to traverse an entire XML document depth-first with a simple
I pushed the lot up to GitHub and casually replied to my editor with a link to the repo, pretending this was the plan all along.
The lazy man lives to fight another day.