Search
Dave Thomas - SE Radio Guest

SE Radio 695: Dave Thomas on Building eBooks Infrastructure

Dave Thomas, author of The Pragmatic Programmer, The Manifesto for Agile Software Development, Programming Ruby, Agile Web Development with Rails, Programming Elixir, Simplicity, and co-founder of the Pragmatic Bookshelf, speaks with SE Radio host Gavin Henry about building infrastructure for eBooks. They discuss what an eBook is, the various formats, what infrastructure is needed to build them, how an author writes an book, the history of the Pragmatic Bookshelf, how they have evolved, how to handle links within eBooks, why humans are so important in the writing process and why AI can help with your writing – once you’ve written your content. Thomas discusses PDFs, eBooks, mobi files, ePub files, CI/CD pipelines, WYSWYG, Markdown files, Pragmatic Markup Language, embedding code, AI agents, images, printing PDFs, JVMs, Java, jRuby, and how Markdown won the plain text writing format wars.

Brought to you by IEEE Computer Society and IEEE Software magazine.



Show Notes

Related Episodes

Other References


Transcript

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Gavin Henry 00:00:18 Welcome to Software Engineering Radio. I’m your host Gavin Henry. And today my guest is Dave Thomas. Dave Thomas has been consulting and programming for 50 years and still writes code almost every day. He’s an author of The Pragmatic Programmer , one of my personal favorites, especially the 20th edition, The Manifesto for Agile Software Development, Programming Ruby, Agile Web Development with Rails, Programming Elixir, and I’m not finished yet. More recently, Simplicity. He currently runs the Pragmatic Bookshelf, and in his spare time he is looking at better ways of writing OO code. O — what does that stand for?

Dave Thomas 00:00:56 Object-oriented of course.

Gavin Henry 00:00:58 Dave, welcome to Software Engineering Radio.

Dave Thomas 00:01:01 Well, thank you. I really appreciate being here. It’s great to chat.

Gavin Henry 00:01:04 Is there anything I missed in that impressive bio of yours or are you happy with it?

Dave Thomas 00:01:08 Well, not really. I mean, I do a lot of random stuff. Basically my wife accuses me of just doing the stuff I’m interested in, which is probably true. I’m a bit of a magpie when it comes to the things I’m interested in, so I’m never quite sure what I’m going to be attacking on a particular day or month.

Gavin Henry 00:01:25 That’s a great way to live. I like it. So before we start, I’ll mention a small disclaimer. I’m pretty close to finishing a book I’m writing for your company, the Pragmatic Bookshelf, about Rust and C. So I actually know how some of this stuff works from the author’s perspective, so I’d just like to say that.

Dave Thomas 00:01:42 And what are you doing talking to me and not finishing the book off?

Gavin Henry 00:01:45 Well, probably commitments. I was doing it this morning, don’t worry.

Dave Thomas 00:01:49 Oh good, good.

Gavin Henry 00:01:50 Anyway, let’s begin. So the show is called Building eBooks Infrastructure. So we need to lay the foundation. So I’d like to start with an overview of what an eBook is.

Dave Thomas 00:02:02 So an eBook is fundamentally just the text of a book in some kind of accessible format to electronics. In the old days, which was 15, 20 years ago, there were probably half a dozen different standards for what made an eBook. But nowadays we’ve pretty much settled on a format called EPUB. And EPUB is actually — I mean, if you have an EPUB file, you can actually just run unzip on it and if you unzip an EPUB, you’ll find that it’s a bunch of HTML, a bunch of assets and then a couple of manifest-type files that give it navigation and metadata and all that kind of stuff. So whenever you feed an eBook into your reader, whether that’s a Kindle or an Apple Books or whatever it might be, all it’s doing really is unzipping that and then running some kind of browser clone to actually display the content.

Gavin Henry 00:03:00 You mentioned there used to be other formats. I remember always having to send a Mobi file. I think you can still download those when you buy a book from you.

Dave Thomas 00:03:09 No, they’ve, Amazon have stopped distributing Mobi’s, I believe. I may be wrong about that because now in the old days, you’re right, we had to give Amazon a Mobi file and that was always a real pain because Mobi’s were different in terms of the stuff that they would support to regular EPUBs. So getting compatible files between Mobi’s and EPUBs was always a challenge. But eventually Amazon gave in and switched across. So now we submit EPUBs to them and internally they do change them into something for the Kindle. I don’t know exactly how they do that, for all I know it may well be Mobi still, but yeah, we never now see a Mobi file.

Gavin Henry 00:03:53 And was that mainly because EPUB was a standard or is a standard and Moby wasn’t?

Dave Thomas 00:03:58 Yeah, I think they were constantly playing catch up with the EPUB and frankly it just wasn’t worth it because I think initially, they went Moby because it gave them the ability to control stuff a bit better on the Kindle. But eventually EPUB does everything Mobi does plus a lot more. It’s a way more flexible format. And so I think they just said, well, why are we fighting a technology war that we don’t really get any benefit from? So they just switched across.

Gavin Henry 00:04:28 Did they give you any help to render and produce them?

Dave Thomas 00:04:31 We didn’t have to do anything. We just, we were always generating Mobi’s and EPUBs as part of our process and so we just basically stopped sending them Mobi’s.

Gavin Henry 00:04:41 I see. And was there a big difference, it seems to be that they were playing catch up, you said.

Dave Thomas 00:04:45 I mean from our point of view, not at all from consumer point of view, I honestly don’t know. My guess would be over the years, yes, I think it has improved. So for example, when you first started generating Mobi’s, they did not support the equivalent of the pre tag, which is the thing that says do not change the spacing. And that meant that code listings were always very, very challenging. So for a while we were actually converting code into images and then embedding the images in the EPUB because it was the only way we could find to reliably get them to render. There was a whole bunch of stuff like that where Mobi’s just, I mean they were designed for reading fiction, basically it was a Kindle, it’s your ticket to the beach and read a book device. And so anything that we needed that was a little bit out of the norm, we would have to fake out using a whole bunch of tables or a whole bunch of images or something like that. And now we don’t bother. Now we just use EPUB and EPUBs are gradually getting closer and closer to the browser. So we get richer and richer formatting as we go forward.

Gavin Henry 00:05:53 Yeah, I think I remember having to double tap a code listing because it was just a big image and then you’d have to zoom around and it wasÖ

Dave Thomas 00:05:59 Pretty painful and that was painful for you as a reader. And also painful for us as a publisher. And we were constantly bumping into various limits and things like what happens if a code listing is bigger than a page and that kind of thing.

Gavin Henry 00:06:14 And did they differ in, I guess the Mobi’s would’ve been bigger in file size because you were putting images in them rather than,

Dave Thomas 00:06:20 I think they probably were, yeah, initially we started out trying to keep all of them below about two meg because I can’t remember, there was some arbitrary limit someone had. But nowadays that’s not a problem. So we’d again, we don’t have to worry about it.

Gavin Henry 00:06:32 And just to clarify before we move on, is a PDF an eBook?

Dave Thomas 00:06:37 I guess it could be. To my mind, an eBook is something which is convenient to read electronically. PDFs are certainly readable electronically. The question is, are they convenient? One of the distinguishing factors I think is whether or not it’s reflowable. Now that means that when I take a browser, right, if you bring a browser window up and look at a webpage, if you change the size of the browser window, the text grows or shrinks or move expands or whatever to fill the space. And you don’t have to paginate it. You don’t get down to the bottom of one page. And then you know, if you’re reading an article, hit some magic key and then the whole screen clears and the next page comes up. So that’s a kind of fixed format where, and PDF is a fixed formats page based. Whereas typically for the kind of books that we are creating and for typical fiction and everything else, you don’t want that. Instead you want it to be flow based, which means that as I change the size of the screen or I change the font, then the layout of the text adjusts to suit that.

Gavin Henry 00:07:44 Would the PDF reader not do that for you? Or that just be,

Dave Thomas 00:07:48 They kind of can now you can actually tell it to turn off pagination and do that kind of reflowing. It messes it up more than it gets it right. And the reason for that is that PDF format internally, it doesn’t really consist of a whole bunch of text with instructions around it. The text is broken up into very, very small chunks and then positioned on the page as a small chunk of text. And so in order to reflow it, what you’ve really got to do is basically analyze that text and try to work out, well this is a paragraph because you don’t know, it’s not anything intrinsic in a PDF to say this is a paragraph. So it’s actually, it’s tricky to reflow and they do it and hats off to them for doing it, but it’s not perfect.

Gavin Henry 00:08:35 Well I did a show with Peter Wyatt and Jeff D. Johnson on the 30 years of PDF because they helped do all the spec and everything there. We did a deep dive on it all. If listeners want to pick up show 532.

Dave Thomas 00:08:47 Well I’m going to go listen that.

Gavin Henry 00:08:49 We speak about the biggest PDF in the world. That covers the whole city of Japan and all sorts of really cool stuff. So I think you were hinting at this, but do you think books are evolving toward more like a website where you can get a book updated over time, either get it corrected or changed?

Dave Thomas 00:09:09 Well, our books have always been like that. If you buy an eBook from us, then we will actually send you updates. And either if we find bugs, sometimes if the author does like a new revision, typically not for like a second edition because that’s typically a rewrite. But if an author wants to do like a third or fourth printing and they discover they want to add another couple of paragraphs, we just throw those in there and distribute it. And so you just, you get a notification, you download the most recent one and you have it. Remember that eBooks are typically designed to be read offline. They may not be, but they’re designed that way. And so we can’t just magically update it in your reader. And in fact you wouldn’t want to, quite often readers don’t want you to do that because they’ll be making annotations or whatever in the existing one. So we make them always available. In a way, I think the biggest thing I’m seeing though is that we’re getting more adventurous in what we can do inside a book. So it’s actually, not many people know this, but you know you have disclosure lists in HTML where you can have a heading and you click on it and then the text below it expands and you get a whole bunch of backup material for that.

Gavin Henry 00:10:26 Yeah, like I think in the JavaScript sort of widget world, they call them accordions, don’t they?

Dave Thomas 00:10:31 Accordions, yeah. But in fact, that’s built into HTML. Oh, okay. It’s doesn’t have to have any JavaScript running at all. And a lot of eBook readers actually support that. And way back when we actually experimented with putting that into our eBooks. So you could actually have like a summary list, which if you clicked on a particular link that would expand to show you the detail. And at the time we ran into the problem that, because most eBook readers back then were EIN displays, then it basically had to clear the screen and redraw it with the expanded text. And people just found that too distracting. But that’s an example of the kind of thing that you can do in an eBook that you can’t do in most other book style output and certainly not in paper.

Gavin Henry 00:11:19 Yeah, it sounds like you could enrich the page more and keep relevant text but not immediately present it.

Dave Thomas 00:11:27 Yeah, absolutely. And we kind of do that with footnotes now in our EPUBs. Footnote is a link and you tap the link and it goes to the footnote and then on the footnote you tap another link and it takes you back to where you were in the text. And yeah, that kind of thing is definitely possible. One of the things that we have discovered though is that one of the attributes of a book is that it is kind of, what’s the word? Boring in terms of the way it’s laid out and everything else. You know what to expect when you have a book. And so if you are reading a book and it suddenly went all psychedelic on you or all the fonts changed or something that would be really distracting. And a book is not supposed to be a distraction. A book is simply a substrate where an author can get content to a reader. And once you start making it all, gee whiz, look at the technology you kind of get in the way of that communication. That breaks my heart because I love experimenting with the kind of wild stuff you can do inside an EPUB, but it’s just too risky to be honest with you to do that. People will complain if you do that.

Gavin Henry 00:12:39 Yeah, it’s a sort of medium to disappear into, isn’t it?

Dave Thomas 00:12:43 Exactly right. Keep

Gavin Henry 00:12:44 And keep clean and no distractions unlike the rest of the world.

Dave Thomas 00:12:48 Exactly. Yeah. I mean the last thing you want us to pop up ad in an eBook.

Gavin Henry 00:12:52 I think they would try if they were.

Dave Thomas 00:12:54 Oh, I mean you can do it, you can actually do it. But yeah, that would be disastrous.

Gavin Henry 00:13:00 Cool. So I’m going to take us on to the middle part of our chat today. So now that we know what an eBook is and the formats that you support as a publisher and pretty much everyone supports, which is EPUB, like you said, let’s talk about building them.

Dave Thomas 00:13:15 Okay.

Gavin Henry 00:13:16 So building EPUBS. First question, what do authors write books in, Word?

Dave Thomas 00:13:22 No, and that’s actually, you know what, it may be easier to do this slightly differently and go back and look at the history. Because that actually informs a lot of what we do. So the whole book business started in 1997, I guess when Andy Hunt and I started putting together the Pragmatic Programmer, except we weren’t doing the Pragmatic Programmer at the time. We were just doing a set of notes that we could give our clients because we return up at a client, we were doing consulting at the time, we’d turn up at a client and we’d always spend the first couple of days saying the same things to every client. Stuff like, maybe you want to test that before you ship it or can you build this automatically? And all that kind of pretty obvious stuff. But at the time it wasn’t obvious.

Dave Thomas 00:14:14 So we put together some notes and they were just HTML pages that we were going to print out. And it was actually my wife who said, why don’t you see if you can get a publisher interested in that? It’s good stuff. And so we thought, well no, it’s no way people would be interested in that. It’s a book. But we came up with this cunning plan, which was we would submit it to who we thought was the best publisher for this kind of material, and they would reject it, but when they rejected it, they would say why they rejected it and that would help us to make it better. So we submitted it to Addison Wesley and they screwed us over by accepting it. And so we were stuck with having to produce this book and we already had a whole bunch of content.

Dave Thomas 00:14:59 Andy and I are really strong believers in the power of plain text. So text that has basically just words, just characters. So we’d never, ever used Word processors to do anything. Back then 1997-98, that was basically Microsoft Word and Word has gotten better, but Word does not produce the kind of quality layout that you need for a book. So back then, if you were an author, the chances were pretty good. You would be authoring a Microsoft Word and you would submit your book to your publisher and the publisher would take your Word document and they would import it into their own layout software. And that might be something like FrameMaker. Nowadays it would be In Design, whatever it might be. And that’s a one-way process. They ingest it, it gets into their tool, and at that point the master copy of the book is in their hands and they do whatever they have to do to make it look good.

Dave Thomas 00:16:03 So now I’m an author and I have discovered an error in the book and I want to fix it. Well, I cannot fix the Word document anymore because that’s now irrelevant. It’s not anything to do with the book anymore. So what I have to do is to basically send my changes to the publisher as a set of on page 27, line four, change the word this to that. And then some poor person at the other end has to go through, read those instructions, bring up the FrameMaker, and make whatever the adjustment is. And to us, that just felt like this impossible. We’re never going to do that. So we convinced Addison Wesley to let us give them a camera-ready PDF. Actually, no, it was a PostScript file at the time. And we asked them for their spec and they gave us their spec for typesetting. And then we set about putting together a tool chain that would let us write plain text and have it end up being a book.

Dave Thomas 00:17:03 And we went through a couple of iterations. The first one we used a tool called TROFF, that’s T-R-O-F-F. And TROFF was at that point easily 20 years old. It was part of the Unix Documenters workbench that came out of Bell Labs. And in TROFF what you did is you basically typed your text and then you inserted various formatting instructions by, basically you would start a line with a period of full stop and then you would put like a one or two character instruction, like this is a title, or this is a page break or something. And it would then take your text and format it using those instructions.

Gavin Henry 00:17:47 Kind of like HTML almost with your title.

Dave Thomas 00:17:49 Kind of like HTML, way more restricted, but way more specialized. But the thing that made it usable is that it had a macro system. So you could define commands using the lower-level commands.

Dave Thomas 00:18:02 So we produced this library that would do the layout and it kind of worked, but TROFF was written for academic papers, it wasn’t written for book type stuff. And a lot of people have written the books in TROFF, but honestly it doesn’t do a very good job of paragraph breaking where it decides where to put a new line in a paragraph. It just looks a bit ugly. So we switched across to using Tex, which was Don N’s (?) layout system, which makes way, way nicer output. And so we produced a set, or I produced a set of Latex macros to define the book, and we wrote the book in Latex and we delivered the PDFs to as Addison Wesley. And to our surprise, they just said, yep, that’s good. So the first edition of Pragmatic Programmer was produced that way. But the nice thing was, I think the first edition, I can’t remember how many printings we had. It was a lot.

Dave Thomas 00:18:56 And they would come to us and say, any changes you want to make? And we actually had the ability just to go through the book and update stuff and then regenerate the PDF and send it to them. It was kind of interesting at the time, what happened was you would send them the PDF, they would run it through something called a rip, which converts it into a high-resolution set of images, and then they would burn those images onto metal plates. And those metal plates would then go into the printing presses. And you’ve probably seen the things on tv, these massive, big machines with big drums that are spinning and paper flies through, goes in white, comes out with text on it. So there’s actually a decent cost associated with converting a PDF into those metal plates. So what they tried to do is to reuse plates that haven’t changed. And so we had to come up with a system where we could automatically tell them which pages had changed and which pages hadn’t to reduce their costs. But that was fun.

Gavin Henry 00:19:55 And that was groundbreaking for that time.

Dave Thomas 00:19:58 I don’t know, maybe, I don’t know. Here’s the thing, we didn’t know what we were doing, we were just doing what sounded like a good idea at the time. And so we were experimenting and we’re just doing what we thought was a good idea.

Gavin Henry 00:20:10 So thanks for the history of how you got started in text. So that answers our question if things are written in Word or not, I guess other publishers might write in Word.

Dave Thomas 00:20:21 In technical publishing that used to be the case, nowadays some don’t. Some still do take Word. But Word, it just seems like such a crazy thing. It’s really not suited for writing books added to which if you’ve ever tried writing a 500-page Word document, it doesn’t like that at all. Right? It bogs down and it’s pretty ugly.

Gavin Henry 00:20:42 So when you write a book for the Pragmatic Program app, does an author use like a, what you see is what you get editor or what? What is the actual next format?

Dave Thomas 00:20:51 No. So in the intermediate days, we switched from using large to using an XML based markup. Basically it was a simplified HTML. And so you would write in this HTML thing, we called it PML for Pragmatic Markup Language, and it was a semantic markup. So you would have tags for, here’s some code, here’s a sidebar, whatever it might be, and that’s what you would write. And authors had the ability then to be able to run our tool chain on their local computer and generate either a PDF or a EPUB, and they could then look at their output that way. So the answer is no, they don’t write in a WYSIWYG format. And there’s two reasons for that. First of all, I always like, there’s a quote from Brian Kernighan who said, WYSIWYG really should be WYSIAYG. It’s What You See Is All You Get.

Dave Thomas 00:21:45 So there’s no additional semantic information in there. And that’s, we wanted to have that extra content. But secondly, if you give an author a tool that lets them generate a WYSIWYG tool, then what they tend to do is focus on that layout and they’ll spend a long time twiddling with margins and fonts and everything else. And that’s not what we want. What we want is authors to focus on their content because we have people that do the layout. And so it’s actually, I think, a benefit not to be able to see the actual laid out text as you’re typing it. So in the old days, we used to have an XML based layout called PML, and authors would write it as PML, and then the tool chain would convert that into, you know, PDFs or eBooks or whatever. We have moved across to using Markdown as our input, and it’s a work in progress.

Dave Thomas 00:22:41 We started off by allowing people to embed Markdown inside their PML documents. And in the last year or so, we now encourage author to write in MD files in Markdown files. And you can still use your old PML tags inside that Markdown if you want to. But we’re basically slowly moving to the point where you no longer have to. And so we actually have a Markdown that’s rich enough to let you express everything you need to do, not just as an author of a book, but also as a publisher of a book. Because there’s lots of little, small subtle things we have to do in the background to control the process a bit.

Gavin Henry 00:23:23 So has Markdown evolved since you created PML?

Dave Thomas 00:23:27 Markdown didn’t even exist when we started creating PML. I don’t think back then it was Textile was the equivalent of Markdown. So yeah, Markdown has, well first of all, it’s become the defacto plain text, text writing language. But also over the years the practicality of Markdown has expanded. And for example, the ability to embed HTML inside Markdown in a kind of rational way means that you don’t have to be limited by just the few things that Markdown can do. And a consensus is being gradually reached on how you write extensions for Markdown and so that we can add extra facilities to a Markdown in a kind of consistent way. So yeah, Markdown has matured a lot. And like I say, it’s what I write all my books in now.

Gavin Henry 00:24:17 I want to ask a few technical questions about how you handle Markdown then. Sure. How are errors? So you’ve got your own tool chain that you’ve written that, I donít know if it’s still bits of it from when you originally did it in

Dave Thomas 00:24:30 Oh yeah, yeah.

Gavin Henry 00:24:31 So how do you handle errors in that raw format?

Dave Thomas 00:24:35 Hmm. As well as we can. So let me talk you a bit about how the tool chain actually works. So what we have is a pipeline. So at one end of it you have a set of Markdown files and one of those Markdown files will be called Book.md. And it will typically include all of the different Markdown files that make up your book. We encourage authors to do at least a separate chapter per file. Some authors actually have each section as a file or whatever, it doesn’t really matter. So you feed that Markdown top level file into a pipeline and the pipeline goes through, it’s a variable number depending on the author and the book, but at least half a dozen different pre-process. The first one is the one that goes through and takes that book.pml and it goes through and looks for all these inclusions and replaces them with the actual contents of the files.

Dave Thomas 00:25:34 And that produces a stream of text that doesn’t, no has includes in it. That stream gets pass to the next thing and the next thing and the next thing. At some point in that process, we convert the Markdown into XML and not coincidentally, the XML we converted into is actually our original PML format. And the reason for that is that we have tens of thousands of lines of code that know how to handle that PML format to generate books in PDF, Moby, EUB whatever formats. And so by using that as our intermediate target, we can reuse a whole bunch of our tooling as we go forward. It also means that we can do a more gradual switch over to Markdown because authors can still embed PML markup in their Markdown documents. And that just makes it all the way through the various stages in the pipeline until it gets to the PML converter.

Dave Thomas 00:26:35 So if we’re targeting an eBook, what we have to do is convert your PML into XHTML 1999 standard. If we’re converting it to A PDF, then what we actually do is we convert it in a couple of steps, which I’ll talk about in a minute. And to do that conversion from one form of XML to HTML, we use a fantastic technology called XSLT, which is a way of specifying a set of transforms that you apply to an XML document, which sounds terribly, terribly technical, but all it really means is you give it a set of rules. You say to it whenever you see a code tag, for example, replace it with a pre and then whatever the content of that code tag is. And so you actually write this very functional style set of templates that translate your XML into whatever format you want.

Dave Thomas 00:27:40 So for the eBooks, that is purely, that’s all we do. We have some XSLT that takes the input document and generates literally all of the files that are necessary to create say, an EPUB. So we are generating each of the, well, it’s not even chapters, so we have to break the book down into chunks because EPUB readers need each section to be no longer than, I don’t know, three or four pages. You can’t have like the entire book in a single chunk. So we break it into chunks, we set them out to separate files, we do all the cross-referencing, we generate the manifests, all of that stuff is done in XSL. The only additional code that we have is the code that normalizes our images. Some eBook readers insist that you only give them black and white images, for example. So we have to convert those.

Dave Thomas 00:28:28 If we are generating PDFs, we go through a similar path, but with a different outcome. We still use XSLT but then what we do is we generate another XML format, which is called XSL colon FO. And the FO stands for formatting objects. And FO is, frankly, I’m amazed it works because what it takes on is incredibly complicated. It’s basically a way of describing text laid out in two dimensions. And it’s also a way of describing the structure of a book so that the book effectively is a container for the text that it’s creating. It’s still to me a miracle that it actually works, but it does and it produces gorgeous looking PDFs. So that is the essence of the tool chain. It runs, like I say, as a pipeline and the pipeline vary per book because different authors have different things, they want to be able to do. So for example, some authors want to be able to include mathematical equations in their books and we have a pre-process that will do that. But if you don’t have math in your book, there’s no point in slowing down the build by including that pre-process. So those kinds of things are optional. So in total, we probably have, I don’t know, 20 different pre-processes that people.

Gavin Henry 00:29:54 Is that something the author needs to pick to pre-process or is that detected at the start?

Dave Thomas 00:29:58 When you sign up you get to check out your repository and that’s preloaded with the ones that just about everybody uses. And then your editor will say, you know, you’ll say to the editor, gosh, I really wish I could do an equation here. And they’ll go, oh okay, that’s easy. And they’ll show you how to do that by adding whatever the pre-processor is to make it happen.

Gavin Henry 00:30:19 I see. And if it does, going back to the question was how an error is handled, I guess.

Dave Thomas 00:30:24 Oh that’s right. I’m sorry. You’re absolutely right. So the problem we had there is it’s exactly the same as a compiler or something. And that is by the time you are sort of six or seven levels deep in pre-processor, you’ve lost all of the original information about where you were in the input document. And so what we do, and it’s not perfect, but when the includer runs, the first thing it does is it adds XML comment to the stream every time it includes a file. And that includes the file name and a bit of other information. And then every pre-processor that changes the text as it goes through, will flag the start and end of that change with another comment to say, I am at the relative line 27 into the current chunk. And so at some point, if you get an error, what it then does is it goes back through that kind of stack of location things to work out where you actually were in the original document and it tries to report it. Typically it can get it within a few lines unless you are looking at a really, really long run of text, in which case you may be off like a dozen lines because there’s none of these little flags to try to record where you were in the original. So it’s definitely a problem and it’s something I’m trying to working on refining always, but it’s getting better than it was.

Gavin Henry 00:31:52 And you’ve already answered one of my questions, which is going to be about equations. So that’ll be enabled at the start manually, whether you’re going to do latex or something like that?

Dave Thomas 00:32:01 Right.

Gavin Henry 00:32:02 How are links and references handled within the eBook?

Dave Thomas 00:32:06 So that is actually an ongoing question because if you think about it, if you’re in a PDF, then imagine you’re just reading on paper, you know, there’s no hyperlinks, you can’t press on the paper and it tunes the page for you. So when you do a cross-reference in the paper book, what you have to do is say, as you can see in the section on, you know, editing a file on page 27, you do this. So a cross reference there is basically a verbose description of where you’re pointing to. If it’s an eBook, then the same thing can just be a hyperlink and you click on it and it takes you there. And what we wanted was a single markup that would let you express both of those. And we’re very close. There’s some cases where it not going to work, but 99% of the time what you do is you say whenever you have a destination somewhere you want to point to, you tag it, you give it an ID, and then in the body of the book you can say cross-reference to this target.

Dave Thomas 00:33:12 And if it is an eBook, it will generate a hyperlink for that. If it’s a paper book, it will go and have a look at that target and find a good name for it. So if it’s a section, it will say the section named and then the title, if it’s a diagram, it will give you the figure number. If it’s a line of code, it will actually give you, you know, line 27 of the code or whatever. And so it will expand your cross reference into a textual description just as if it was a real paper book.

Gavin Henry 00:33:44 And you already mentioned I think how people can handle code embedding or formatting. So code would be a code tag in your PML you said?

Dave Thomas 00:33:53 Yeah, in PML, it’s a code tag in Markdown is just a Markdown fenced block and you put the language after the fence.

Gavin Henry 00:34:02 That’s the back text things, isn’tÖ

Dave Thomas 00:34:04 It in the back text or tilers. Yeah. And so it will take that and it will do syntax highlighting on it and then generate whatever is necessary to format that either for the eBook or for the PDF. The PDF, typically we syntax highlight to black and white, but we actually also have books that are in color printed in color and in which case the syntax highlighting survives as you know, different colors. But we also have markup inside the code where our tool chain is aware of all the different languages and it knows about what constitutes a comment in each of those different languages. So if it was C for example, it would be a block with a slash star and then some text and star slash, if it was Ruby, it would be a pound character, a hash character, and it looks for markup inside comments to make it do things.

Dave Thomas 00:34:58 So that’s where I could do things like say this is a destination, I want to put an ID on this line of code. And then in the body of the text you can say as you can see on and then reference that ID and it will say line 27 or whatever of the code. It’s also how we do highlighting of or calling out particular lines of code. So quite often authors want to show which lines changed as they’re running through an example. And so we have this markup where you can say, start emphasizing this code and then end and it’ll actually put little arrows in a margin to show you that. And then finally the really important thing is in the old days what you would do is you would take the code that you wanted to describe and then cut and paste it into the inline text of the book.

Dave Thomas 00:35:45 And the problem with that is that if you then went through and updated the code, how would you know that you’d correctly updated everything in the book as well? Because a particular chunk of code might occur on six different pages in the book as you are building up an example say. So what we wanted to do is to encourage authors not to put code in line in the text of their book, but to keep code as code in program source files. And one of the reasons we wanted to do that is so that they can actually test that code as part of the book building process. So for example, in my Ruby book, every single example of code in that book is actually run every time the book is built and it verifies that the output hasn’t changed. So that if for example, they make some change to Ruby and one of my examples no longer works, the book won’t build and it’ll tell me that that’s, that’s changed. So to make that work, you typically don’t want every code example to have all of the code, you just want to highlight one particular section as you’re talking about it. So one of the markups we can have in code is the ability to define regions of that code, sections of that code and give them a name. And then when you include the code in your book, you say include from this file the sections named this, this and this. And it will just include those lines into the book.

Gavin Henry 00:37:11 You mentioned already that authors can build the book locally. So I guess you, the whole tool chain’s fairly smallish and you can get it all set up locally.

Dave Thomas 00:37:20 It can be.

Gavin Henry 00:37:22 I was just wondering about your sort of the integration testing there of your own code. I need do locally.

Dave Thomas 00:37:29 Yeah, that takes a little bit of work, but it’s not onerous. Okay. Like all of our editors for example, no, most of our editors are English majors and most of them are set up to be able to build their own machines. One of the joys of my life is when an editor whose prior taste of computing was using Microsoft Windows and Word gets online and said, you know, well I tried to set that environment variable and it didn’t work. You know, it’s kind of fun just that they can, they’re, I’m corrupting a whole generation of editors there. But yeah, setting the tool chain up typically involves you have a version control tool. We currently use Subversion for historical reasons. So you would check out your repository and that gives you a whole bunch of stuff which is kind of like the boilerplate, but it gives you all of our tool chain as well. And so given the repository, all you have to do typically is install a version of Java and that’s because tools that do our XSL FO processing are Java based. So we took the advantage of having Java lying around and everything else we’ve got is based off that Java runtime. The tools are not written in Java, they’re written in Ruby, but we run J Ruby, which is a version of Ruby that runs inside the JVM and that lets us, basically you have one dependency, which is this Java and then the rest just typically kind of works.

Gavin Henry 00:38:54 So you obviously write your own tools for this because there was nothing before. Authors can build everything locally if you want and you can put a bit more work in and do all your code examples. Just for our listeners, for myself I have on my Rust and C code and individual GIT repository. So every time I make a change to code, I commit that and then commit it into the repository as well.

Dave Thomas 00:39:16 Right. And actually some of our authors also have an automated set of cherry pick tools that they use with that. So they’ll have a GIT repo which represents each of the steps in an example they’re building. And if they discover they need to make a change to one of the early steps, they make that change and then they can make it ripple through all the later changes automatically, which is kind of cool.

Gavin Henry 00:39:40 So I’m presuming your distributors for the book and partners will also have a copy of this tool chain or do they get a slightly different one?

Dave Thomas 00:39:47 Nope, we send them the result. So as an author, when you build, you get a kind of lightweight copy of the PDF or the EPUB when it’s time for production, we take your book and we move it across to a, protected section of the repository. And at that point we get to use the kind of production version of the tools, which to be honest with you are about 90% the same. The difference is that we don’t re-sample images, we normalize stuff. We also make sure that all the color has been removed and we convert everything to black and white, which to be honest with you, if you actually had a look at one of the PDFs we send off to the printer, you would probably raise your nose at it. They look pretty ugly. They’re a lot darker than you’d expect for example. And that’s simply because of the way the printing process works. It’s taken us a long time to fine tune that, but you have to manipulate things quite a bit to get it to work. And also, we have to send it off at a way, way higher resolution than we use for screen reading.

Gavin Henry 00:40:52 Do you use any or have any plans to use any AI tools? For example, for things like fact checking, locating references for unreferenced content checking the external references are properly used, links or assessing content, you know, in the editing process for things like structure clarity?

Dave Thomas 00:41:12 So yeah, I mean the first whatever up until the thing where you said assessing content, we actually do that already and we’ve done that for a long time. As part of the production process, we actually go through and validate all external links and as part of any build, we actually validate all the internal links. We also get told if they are invalid. We also validate things like since the accessibility is important, we do things like make sure that, all of your images have alt tags to describe them and this kind of stuff. So all that kind of stuff is done. But on the AI front, here’s the thing, we have a rule that you cannot use AI to write the content. So as an author, feel free to use AI as a research tool or as a sounding board or as whatever you want. But we actually run AI checkers on the books that was given by an author just to make sure that it does not have AI generated content.

Dave Thomas 00:42:13 And the reason for that is that we think the value we bring in the AI age is that our books don’t hallucinate, right? Our books are the result of a human being giving it their best. And so we don’t want our authors to be using AI to generate content. Now at the same time, we do have something where we released a fairly recently, a couple of months ago called the Pragmatic Assistant and that is a ChatGPT project that has been trained on probably about 40 separate pieces of documents and memos and all sorts of stuff about what it means to be a bookshelf book. So we, you know, it has all of our writing style guidelines, markup information, all that kind of stuff. And we encourage authors to use that, particularly when they’re starting out because we have a style, we have a voice that we like people to use.

Dave Thomas 00:43:14 And if you’ve never written long form before, it can be quite tricky to get that right. And so in the old days that used to be what the development editor would do, you would submit your book and the development editor would go, oh no, no, no. no like that. Now you can basically cut and paste a chapter on our chapter or a paragraph or something into the AI, and it’ll come back and say you really shouldn’t use the passive voice here. That should be second person, not first person. Or actually you would probably just say, you know, don’t use, I use we or you here and this kind of stuff. And so we’re still refining that, but we’re finding authors are quite liking that as a kind of angel on their shoulder kind of AI.

Gavin Henry 00:43:54 Yeah, I’ve used that myself when I had to prepare the audience statement for the first round of submission. Opened up the ChatGPT pragmatic program agent, I think it was John or something pasted in what I’d written and it went, oh no, no, no, no, no.

Dave Thomas 00:44:11 Yeah

Gavin Henry 00:44:13 I know how to do this. And he rolled up the sleeves and went, no, this is how it should look. Yeah. I was like, but that doesn’t sound like me.

Dave Thomas 00:44:19 Yeah, that’s the thing is it is not like, you know, we’re not handing control over to an AI. It’s purely, I mean I use it all the time as a kind of, I donít know about you, but I always find I get my best ideas by describing them out loud.

Gavin Henry 00:44:33 Yeah, the rubber ducky.

Dave Thomas 00:44:34 The rubber ducky. Right. And an AI is a fantastic rubber duck. One of the cool things we discovered about this is, you know that ChatGPT, right? It always starts off by saying what a wonderful human being you are. Right? That was really, really good Dave. And we discovered that people didn’t like that, right? If they submitted a section to be critiqued, they didn’t want them to come back and say, that was fantastic, I really enjoyed reading that. And then the next sentence says, but of course it was rubbish, you know? And it took us a long time to work out how to tell it not to be artificially encouraging and I think we’ve kind of got that under control now and the response I get from the authors is that so much better that they actually get all they get back is what they wanted, which is a critique of what they did.

Gavin Henry 00:45:28 That’s cool. I look forward to using more of those as they come out. Yeah. I’m going to move us on to the last part of the show. I don’t think we’re going to have as long as planned, but I think we’ve covered off maybe half of it. So this last bit was going to about your CI and CD pipelines. We’ve got a good history lesson about that and we now understand that authors can do it locally anytime they want. The distribution printers don’t run it themselves, which I was under the impression they did. So they get a different version.

Dave Thomas 00:45:56 Yeah, they get the final product.

Gavin Henry 00:45:58 But just to sort of step back and summarize that process again quickly. When an author wishes to read a PDF or an EPUB version of their work in progress, what happens on your infrastructure when they commit and push a change to a Markdown file? Now that we know it’s marked down.

Dave Thomas 00:46:14 So like you say, we have a build machine, a build environment called BOB for obvious reasons. And if you’re an author whenever you commit to change,

Gavin Henry 00:46:24 Is that Build Our Books? Does that stand for?

Dave Thomas 00:46:25 It’s, yeah, it’s, that.

Gavin Henry 00:46:27 That was a good guess.

Dave Thomas 00:46:28 Very good, very good. So yeah, whenever you make a commit, BOB gets notified and BOB has a queue of books waiting to get built. So BOB will build your book and stick the, I can’t remember if it’s just the PDF, but it will build it, stick the resulting PDF off on a cloud drive somewhere and record the log of the build. And then as an author you get access to our backend system called creators.pragprog.com. And one of the things that does is let you access the build log and you can see the status of your particular book build and download both the logs and also the PDF version. And one of the benefits of being an author is you actually get to see every other book as well. And so if you are kind of like vaguely interested in some other topic, you can download that and have a look to see what that looks like too.

Gavin Henry 00:47:22 Yeah, I do that. I have a quick glance. I know that Simplicity wasn’t in that build queue though.

Dave Thomas 00:47:28 Yeah, well actually once a book is finished we take it out of the build. I see. Typically simply because otherwise it just, it gets too, I mean we’ve got like, I think we’ve had about 700 titles now and so that would be quite a long list. So yeah, it’s typically that’s works in progress, but if you want to get a copy, all you have to do is write is just tell your editor or write into us and we’ll get you a copy because yeah, authors get access to everything.

Gavin Henry 00:47:54 So BOB picks that up, sorry, just to look back. And is it the same types of error that can occur that you would locally or is it different?

Dave Thomas 00:48:02 Nope, it’s exactly, there’s really only ever one process until it gets into production. So the thing that you could run locally is exactly what BOB runs as well.

Gavin Henry 00:48:11 Okay. And is BOB good at spelling or grammar checking or is that part of the pipeline?

Dave Thomas 00:48:15 No, our tools don’t do that. We experimented with it and to be honest, it was just too much of a pain. It’s a lot easier to, because to be honest with you, the vocabulary that we use in our books is not in most dictionaries and grammar checkers and spelling checkers aren’t that happy with technical content. I donít know if you’ve, if you ever tried turning a grammar checker on, but a lot of the times we use nouns as verbs and this kind of thing and they just go, you know, it’s all red squiggles if you

Gavin Henry 00:48:49 Yeah, I use some JetBrains as Rust Rover ID for writing a code on the Markdown and it’s forever trying to correct stuff.

Dave Thomas 00:48:58 Yeah.

Gavin Henry 00:48:59 And I have to be really conscious what you said in the last section because Copilot or whatever it’s called, ChatGPT is there always looking to complete my code line or my sentence. So I to have to either switch it off, disconnect from the internet or click escape. So it’s not trying to just random stuff on my page. Sometimes it’s helpful, you know, it might be just like, follow this next line or something, which is what I would’ve said anyway, but got to really be conscious that is my voice, you know, and not.

Dave Thomas 00:49:27 And that’s the really important thing. Your voice and your experience. Because the other thing I find is if I leave, say Copilot running, and if I get lazy and start accepting what it does, it’s not just that it replaces the next sentence, but it kind of guides you in a particular direction because it, it has this picture in its model of where you want to go and it may not be where you want to go. And so if you just sit there and hit tab the whole time and accept what it’s saying, you’ll end up at a very, very different chapter to the one you intended to write.

Gavin Henry 00:49:58 The ID seemed to re-reference other stuff in that document. Oh yeah. Other tabs you’ve got open and then it just becomes a mess.

Dave Thomas 00:50:04 It’s really interesting, our editors, about half of them can spot AI generated content just like without even breathing. They just like say, oh, that look. And you ask them how they do it and they’ll say things like that. They’ll say, you know, you can spot that there’s too many references or something like that. Personally I can’t do it. I’m kind of AI blind that way, but yeah, they can do that. So yeah, so I personally, I’m with you. I actually turn all the AI assistants off when I’m writing text because I just found it was too distracting. It would actually slow me down.

Gavin Henry 00:50:38 Yeah. I put music in my ears as well. So I guess it just depends. So I’ve got a question next about self-links within the book, which we’ve covered in the previous section. One thing I haven’t asked you about is do you require like multi passes, so say for example, I’ve linked to page 50 and we’re on page 10, does that just go in like a lookup table type thing and then you check it at the end? Or how does that work?

Dave Thomas 00:51:02 That is a very good question. In early versions of the tool chain, we actually had to make multiple passes and that actually turned out to be really, really difficult. We actually had some pathological books where if you say it said, as you can see on page whatever, right? And if that page number was two characters 99, then the line would break in a certain way and the rest of the book would format through according to that. If the page number changed to a hundred, then that would change the pagination of the book and that and potentially would then change the page number a hundred and it could change it back to 99 and then the 99 would then change it back to a hundred and it would just loop forever changing these page numbers back and forth. So we don’t do that. Instead what happens is cross references get, like you say they get built up into effectively into a table and then they’re plugged in towards the end and the formatting then basically adjusts based on that. So there are still pathological cases, but it’s all handled internally. That is a joy.

Gavin Henry 00:52:14 Yeah, because I, I did it myself. When you’re free to address anything and if it doesn’t see it, it says on the PDF or the eBook, this reference is, there’s something I don’t know about yet. Yeah. You can find it and I like that.

Dave Thomas 00:52:26 Yeah, that actually is, I kind of do that a lot. The idea is that if you are, say you are, yeah, you are writing something and you know you want to talk about this, but you haven’t got round to it yet. If you put in a cross reference to the section and it doesn’t exist, then the book will still build, but it build with a warning to say you don’t have this so you can basically use it almost like a to-do mechanism.

Gavin Henry 00:52:48 Yeah, I hadn’t thought about that. So the next couple of things before we wrap up, so write a Markdown, traditionally, if that’s even a word you can word use Markdown and things. You might have a split view whereas you’re writing, you see it render, so you see your headings and subheadings. Is that something that an author has to be conscious of when they’re doing it and what happens if it’s not the way they want it to be? I guess they just build a PDF and look.

Dave Thomas 00:53:14 First of all, the way Markdown renders into like an editor buffer will not look like the book anyway because, for example, the widths of the window will change how many lines there are in this paragraph and this kind of stuff because the editor is going to show you the flowable version of the content, not the fixed version. And honestly, we really don’t want people to be thinking about what it looks like at that level. Right? I think writing is hard enough without also trying to do layout at the same time. So my strong recommendation has always been put blinders on and just write, don’t even think about how it’s going to look. Just write and then let our people, I mean obviously as an author you can go through and then make things look the way you want them to look, but our people are really good at then taking that last mile or kilometer and making things look really nice. Yeah, I mean I, every now and then will run a preview pane on the Markdown. I’m not a hundred percent sure I know why I do it. It kind of just gives me a kind of warm, fuzzy feeling that yes, I’m achieving something, but it really isn’t that useful.

Gavin Henry 00:54:27 So the last quick couple of questions that I’m going to sneak in. Do you handle indexes in this build system?

Dave Thomas 00:54:34 Yeah, well, we have index markup that we don’t tell authors about. It’s just a set of tags that the indexer adds. So when you finish the book as an author, you submit it off to our production process and one of the first things that happens is it goes to the indexer and the indexer goes through an adds, a really large number of tags to your book. And he does that in line. We are one of the few people in the industry that does that. Most of the time indexers use a separate application and they’re given a book that has all of the stuff organized into pages and has page numbers. And they will actually literally type in, you know, the name, the content and the page number on which it occurs into their own index program. And after they finish going through the whole book, their indexing program then kicks out the index. And of course that’s great until you then go and add content to the book in the second printing and it changes the page numbers. So we don’t do that. We actually, you mark up the same text that the author is working on with these index tags and then as part of the build process, it automatically collates those and creates the index at the end.

Gavin Henry 00:55:49 So when you get to handle change history or things like that, are all those tags there and you just have to be conscious of that or?

Dave Thomas 00:55:57 Yeah, that is a downside. The indexes are, in the old days they used to be like, you know, I’m going to index this word. And so they would add it right in the middle of a paragraph and now we tell them if the paragraph’s not too long, stick it at the beginning of the paragraph and that way it doesn’t interfere with the reading for the rest of the paragraph. But the benefits of doing it that way I think far outweigh the ugliness that it includes in the document.

Gavin Henry 00:56:21 Sounds like there’s humans in the loop for the whole book process.

Dave Thomas 00:56:25 Oh, absolutely. We have. You have your development editor who’s basically doing the daily, keeping an eye on what’s going on. Then after you submit, it goes through a number of internal reviews here, it goes out to an external copy editor, an external indexer, and then it goes back to the author one or two times, it then goes to an external layout person, another person does the cover, and then it all comes together. It goes off to the printer. Yeah, it’s very much a high touch process.

Gavin Henry 00:56:58 Before it gets that final stage. How does the feedback work between your, all the different editors and the author? Does that stay within the document or is it emailed or how does that?

Dave Thomas 00:57:09 It’s both. We have markup, I’m sure you’ve encountered it, where an editor can insert tags, the editor tag and an author or ed and an author can insert tags. And these are basically like call outs that you can put in your text. So if you wanted to ask your editor a question, you could say, should I say it this way or that way? And you put that into a tag in the book. And then when the editor builds the book, those tags are highlighted both in the book itself that’s produced, but also during the build process, they can see a lot a list of questions that you set come out during the build and then they can answer those however they see fit. But quite often they’ll answer those by putting those tags back in the book. And the copy editor does the same thing.

Dave Thomas 00:57:52 So the copy editor mostly works autonomously. We’ve experimented with having them do all the, basically just put questions in and then have the author make the changes. 99% of the time the copy editor is right. So instead what we do is we say to them, if it’s just obviously like bad grammar or something, just fix it. And if it’s something you don’t know or don’t understand or are suspicious about, then add one of your copy edit tags to the text at this point. And then as an author, when you get the book back, it will have these copy editor questions in it. And one of your jobs is to go through and basically remove those and fix whatever it was they were pointing at.

Gavin Henry 00:58:33 We’ve got the sort of build chain CICD and BOB that actually make a digital book as it were, but then the people make the proper book.

Dave Thomas 00:58:44 Basically. Yeah, I mean we still use that build chain to actually create, you know, the mechanics of it. But yeah, it’s not something currently that you can automate. Even the layout, surprisingly, there’s a lot of aesthetic and experience that goes into it.

Gavin Henry 00:59:02 Perfect. Well I think we need to wrap up now. Like I said at the beginning, as a upcoming author, I could chat to you forever about this stuff, but got to kind of keep it contained.

Dave Thomas 00:59:13 At some point we will be chatting with you about your experience because we try to, you know, keep constantly updating it and making it either better or cooler.

Gavin Henry 00:59:22 Yeah. One of the sad things I hit was when I was told to stop writing because my editor was reading it all, I said, what am I going to do now for like a year? I’ve been doing this every day. But anyway.

Dave Thomas 00:59:33 It is one of those things where either you hate it or it becomes, you’d be amazed how many people come to us, you know, wanting to write a book and then have written three or four.

Gavin Henry 00:59:45 I’ve got a couple up my sleeve.

Dave Thomas 00:59:46 There you go.

Gavin Henry 00:59:47 The express ones mainly I think moving forward.

Dave Thomas 00:59:49 Yeah. And that’s actually the, it’s interesting. So our express books are shorter and they have a slightly different style to them. They’re not as narrative as our longer books. And I think as the generations change, as people’s attention span changes, we have been going more and more to these shorter books and they are the ones that are popular a lot. Well, that’s not true. They are increasingly popular, so yeah. And they’re easier to write. So I think a couple of express books would be a very nice next step.

Gavin Henry 01:00:20 Well, hopefully this chat we’ve had brings some new blood to the company.

Dave Thomas 01:00:25 Yeah. And I just want to say it’s not, honestly, we’ve talked about the technology behind it, right? But writing a book is not about the tech. Writing a book is about somebody having something they’re just bursting to say, you know. And to be honest with you, the tech that you use is secondary to just wanting to get that out, you know. I find if I really want to understand something, the best way to do it is to write a book about it. Because that way you have to be sure you understand it, you know? And you’ll find yourself writing like, but obviously the hang on, why is that obvious? And you’re going to go and do some research and everything else. It’s not a technical process is what I’m trying to say. And so we try to write a tool chain that kind of gets out your way and just let’s you get on with the fun part, the writing part.

Gavin Henry 01:01:14 Well, I’m really glad we did this show. This isn’t our usual type of thing, but something I found fascinating as an upcoming author. So how can people get in touch and reach out?

Dave Thomas 01:01:23 Pragprog.com over the next couple of months we will be launching a new website, but even on the existing website there’ll be a, there’s a section which is, I think it’s called Write With Us or Publish With Us, which has got lots of resources and contact information and everything else.

Gavin Henry 01:01:40 Excellent. And I’ve also got your articles on pragdave.me website, LinkedIn here, and a link to your new book as well on the pragprog website.

Dave Thomas 01:01:50 I appreciate that. You could also link to my Substack, which is articles.pragdave.me.

Gavin Henry 01:01:55 Well Dave, thank you for coming on the show. It’s been a real pleasure. This is Gavin Henry for Software Engineering Radio. Thank you for listening. [End of Audio]

[/tt]

Join the discussion

More from this show