Peter Wyatt, CTO at PDF Association and project co-leader of ISO 32000 (the core PDF standard), and Duff Johnson, CEO at PDF Association and ISO Project co-leader and US TAG chair for both ISO 32000 and ISO 14289 (PDF/UA), discuss the 30-year history of the portable document format (PDF). SE Radio’s Gavin Henry spoke with Wyatt and Johnson about a wide range of topics, including the PDF/A Archival format, key dates in PDF history (including why 2007 was such an important year), and PDF security. They explore details such as redaction of information in a PDF, object models, what Adobe did right, choosing PDF versions, efficient paging of documents, SafeDocs, selecting a PDF SDK, Arlington PDF, veraPDF. They further consider when to use the PDF format, binary and XML, javascript in PDFs, PDF linters and validators, backward compatibility, how HTML and PDF complement each other, the biggest PDFs in the world, PDF as a website, and the guests’ top 3 PDF security tips.
Show Notes
Related Links
- Breaking the Specification: PDF Certification
- Towards enhanced PDF maldocs detection with feature engineering: design challenges
- PDFPhantom: Exploiting PDF Attacks Against Academic Conferences’ Paper Submission Process with Counterattack
- Research Report: Strengthening Weak Links in the PDF Trust Chain
Other References
- PDF Association
- https://www.darpa.mil/program/safe-documents
- Peter Wyatt’s Twitter: https://twitter.com/petervwyatt
- Duff Johnson’s Twitter: https://twitter.com/duffjohnson
- PDFAssociation
- veraPDF – Open Preservation Foundation
- PDF Association
- GitHub – pdf-association/pdf-issues: Industry-based resolutions for issues and errata reported against any PDF-related ISO standard
- GitHub – pdf-association/arlington-pdf-model: A vendor- and implementation-independent specification-derived, machine-readable model of PDF.
- PDF Association
- Quartz (graphics layer)
- ISO 19005 (PDF/A)
- ISO 32000 (PDF)
Transcript
Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.
Gavin Henry 00:00:16 Welcome to Software Engineering Radio. I’m your host, Gavin Henry. And today my guests are Peter Wyatt and Duff Johnson. Duff is the CEO at PDF Association. He has founded and led several software and services businesses in the electronic document industry since 1996. He also serves a PDF industry in technical roles as the ISO project co-leader and US TAG chair for both ISO 32000 (PDF specification) and ISO project leader for ISO 14289. He is currently the US head of delegation to ISO/TC-171SE2. (Don’t worry, listeners. I’ll put those in the show notes.) Peter is the CTO at PDF Association and has been actively working on PDF technologies for more than 20 years. He is project co-leader of ISO 32000, co-chairs the PDF association’s PDF TWT — The Working Group and is PDF Association’s principal scientist leading work on the DARPA-funded SafeDocs project, which is at the intersection of cybersecurity, parsers, and electronic document formats. Peter and Duff welcome to Software Engineering Radio. Is there anything I missed in your bios that you’d like to add?
Peter Wyatt 00:01:33 Thanks for having us Gavin and no my bio is good, thank you.
Duff Johnson 00:01:37 That sounds good Gavin, thank you.
Gavin Henry 00:01:40 Excellent. So we’re going to start the introduction and I’m going to split the show up into four topics. The wonderfulness of PDF’s: these are the history of PDF, what the PDF is made up of, how to create a PDF, and the big one, PDF security. (On the “big one” I’m calling it; it might not be.) So, let’s start. The title of our show is obviously 30 years of PDF. Peter or Duff, could you take us through the key milestones over those 30 years if it’s possible?
Peter Wyatt 00:02:09 So maybe I’ll start. Let’s begin a little bit before PDF. So obviously 30 years is a long time ago. PDF was founded in Postscript, which was an interpretive programming language released in 1984. So back in those days, computing power was obviously much less. Things were much harder to debug. And one of the issues that people found with Postscript was that you couldn’t get to page 100 in a document without processing pages one to 99 first. And this obviously became a problem as laser printers came into fashion and you needed to reprint pages or you wanted to print in reverse order or something like that. Now, Postscript is a fully blown programming language that has all the power of a programming language. And you can do very fancy things like redefine white to be black, but you also need programming skills and debugging skills in order to write a Postscript program.
Peter Wyatt 00:03:02 So, this is obviously not a great outcome for the graphic arts industry or just documents in general. So then John Warnock, who was one of the Adobe co-founders, in 1990 wrote, a well-known paper known as the Camelot white paper. At that point he noted that there were a hundred commercially available printers and about 4,000 applications that produced Postscript. So remember this is back in 1990, this is the days of your 640K, 286- or 386- PCs with VGA screens. So it was a very different world than we have now. And what he described in this Camelot white paper was something that he called IPS or Interchange Postscript. But it’s what we would come to know as PDF. Anyway, Adobe eventually published PDF 1.0 in June of 1993, and they continued publishing this until PDF 1.7 in October 2006. All these versions are freely available and effectively defined the format as they saw, they owned the format and they led the development of its direction. And obviously, their implementation closely matched the spec, or effectively was the spec.
Peter Wyatt 00:04:11 In PDF 1.4, which was December 2001, there was actually a big sort of transition in the PDF technologies. This was the introduction of transparency and advanced blending. So this is in the days of early illustration programs that basically that these features were sort of becoming the core features that graphic artists were using to create really sort of rich marketing documents and so forth. And all these later concepts were actually introduced directly into SVG from their PDF origins. And the features that you see in PDF are exactly the same names that you see in these common applications. In 2007, Adobe passed PDF 1.72 ISO the International Standards Organization for fast-track adoption. And this is a special process by which an existing specification can be made an international standard in 18 months. You might ask, well why ISO? Why not some other standards body?
Peter Wyatt 00:05:08 Well, because at that time there’d already been about seven years of experience in publishing what we know as PDF-X, where the X means exchange. And these are standards specifically in the graphic arts and commercial printing space designed to make commercial printing much more predictable and reproducible across vendors, across different devices, et cetera. And this had been in place since 2001. So, in 2007 it was seen as it was the obvious place to continue to take PDF standardization. In 2008, after the 18-month fast track, ISO published the first PDF standard, which is ISO 32000 part 1, 2008, and its effectively PDF 1.7. It’s very similar, but not quite identical to the Adobe PDF 1.7 version because obviously the proprietary details and their implementation-specific stuff was removed. And if you remember this era, this is sort of the mid 2000s, we had a lot of competition in the sort of operating system and business space from the likes of Microsoft with their new operating system, which was Codenamed Longhorn. And they had a new format that they called the XML Paper Specification or XPS, and there was a push to standardize that. So, in a way, Adobe met the challenge and brought PDF out from behind the Adobe wall and into the open.
Gavin Henry 00:06:35 Up until 2007, it wasn’t an ISO standard?
Peter Wyatt 00:06:40 No, it was an Adobe — it was a freely available document, but it was their proprietary knowledge, and anyone could go and download the PDF spec, and you could implement it. But it was written, I guess they probably did their best go at writing a document that gave an open and honest understanding of what they thought PDF was. But certainly as somebody who was involved in developing PDF technology at that time, there were certain struggles with the document in trying to sort of mimic what the Adobe technologies were doing, but it was freely available. So although it wasn’t an international standard, it was freely available.
Gavin Henry 00:07:17 Okay. Was that Microsoft’s attempt to try and thought PDF becoming a standard? Do you think they had a heads up or?
Peter Wyatt 00:07:24 No, I think it was in those days there was a, remembering back to these days, there was an XML was the latest and greatest thing and there was certainly marketing, promoting that XML was better than everything. And if you do remember, there was a lot of push to make XML the center of the universe in those days for all technologies.
Gavin Henry 00:07:41 That’s right, yeah. The schema definitions and everything.
Peter Wyatt 00:07:43 Exactly. So, in those days that the XML paper specification, it mirrored what PDF was. And XPS still exists today inside the operating systems and used as a spool format, and you can save as XPS in Windows 10 and 11. I don’t know how many people use it, arguably not that many, but certainly at one time Adobe even prototyped, well at this time, they prototyped the version of PDF in XML that was codename Mars. Not unsurprisingly, it never gained any traction because realistically there was no benefit in the XML version. Actual fact that were disadvantages — it was much larger and more complicated, and it was exactly the same as PDF in terms of what you as an end user saw in your documents. Anyway, I’m going to jump forward a little bit. So, in 2017, so this is, remember nine years after that first standardization of PDF, we finally published — or ISO finally published — PDF 2.0, and this is the first PDF standard that was fully developed in an open forum with input from many experts from around the world and across many vendors.
Peter Wyatt 00:08:44 And this is the document we refer to as ISO 32000 part 2, 2017 edition. Now, nine years is a long time even in ISO standards time, but the result of that work was a vastly improved document. It was a lot of people looking at the document very carefully making concrete suggestions. And of course, there are new features that was introduced in PDF 2.0. but it is a, the latest version. In 2020 however, we published an update to the 2017 mainly to correct various points. And right now, there’s a process to address some errata. About this point I might hand off to Duff, or maybe Gavin you have some questions?
Gavin Henry 00:09:26 Yeah, I was going to ask Duff about where the PDF Association fits in with the ISO standard or its role ensuring PDF lives.
Duff Johnson 00:09:37 Well, as Peter’s been saying, so the ISO standardization process for PDF, initiated more or less around 2000 with the development of PDF-X, and the next ISO standard developed pertaining specifically to PDF was PDF/A or the archival subset of PDF. This is published as an ISO document in 2005, and it was received with great fanfair in, for example, Germany, which is a place of many laws and many software companies particularly interested in meeting the needs of state and other actors in terms of these laws. And in fact, many of the initial PDF/A implementors were German companies. So many of them had gotten together and been working on this new specification and come to realize that they needed to develop some additional industry understanding about how to fully understand the PDF/A specification.
Gavin Henry 00:10:36 There isn’t just PDF ISO standard, there’s subtypes of PDFs?
Duff Johnson 00:10:42 So yes, so as Peter mentioned in 2000, the graphic arts industry had come to a need to develop its own common understanding of specific PDF in the context of a specific application — that is to say, high quality, high speed print operations. So back then the graphic arts industry had come up with requirements that included color management and the inclusion of fonts directly into the PDF file as a means of ensuring the conveyance of a fully reproducible results between printing systems, for example, right?
Gavin Henry 00:11:19 Yeah. So everything you need is bundled in rather than . . .
Duff Johnson 00:11:23 So everything you need is bundled in. And it turned out that the archival community has a very similar requirement, right? So these folks need a digital document once created to be reproducible and usable as it was created many years into the future and on many different systems, not only the computing system on which the document was created. The requirements are actually relatively similar to those of graphic arts but not identical. And as a reaction to the need of archivists for a preservation-oriented PDF file. This is why the ISO community, or the developers engaged with the ISO community, at this point decided to develop PDF/A for archive. So, the PDF Association emerges from that because the initial set of non-Adobe developers who were producing PDF/A got together, realized that it was very important of course, that their implementations avoided colliding, right? Because if you’re, if you’re making something that you call archival and you, and you’re specifically making calling it archival because it can be exchanged between implementations, then it’s not going to help you very much if somebody makes one of these files and somebody else’s implementation cannot read it. So this group of vendors got together in Germany and created a small organization they called the PDF/A Confidence Center. The PDF/A Confidence Center was the forerunner of what is today the PDF Association. For the first three or four years, it ran a couple of conferences. It created some various technical notes that reflected the common understandings that those vendors developed. And then starting, I think around 2010 the organization decided to expand its scope and become really the international organization to address all matters of interest to PDF technology in general.
Gavin Henry 00:13:22 Thank you. Before I move into the next section of the show, are there any key moments in that history that we have mentioned that you’d like to really highlight that changed the industry or spurred all the eDocument businesses out there, HelloSign, DocuSign, all those types of things?
Duff Johnson 00:13:42 I think one of the, and I think Peter did mention this, that one of the things that I often emphasize is that Adobe did two amazing things very right back in 1993. And these at the time — today these things are not particularly remarkable, but in a way they’re not remarkable today because Adobe did them back then. And the first thing that Adobe did was to make the Adobe Reader free software, so that it was not only possible to create a PDF file using Adobe’s paid software, but then anybody could read it on any platform. Back then, it was relatively unusual to give away powerful software for free for use on the desktop. So, this is one important innovation. And the other, of course was to publish specification publicly with the express intent of allowing third-party developers to develop their own PDF implementations, creation and consumption both.
Duff Johnson 00:14:36 And these, these two moves indicated that Adobe understood that the purpose of this technology was to take on the world of paper. And the only way to take on the world of paper and papers predominance in the business and communication space in the world was to eliminate the possibility that the understanding of how to use the paper and the software to use it would be a barrier, right? So that’s, so making the specification free and the viewing software free has become a kind of a hallmark of, well it certainly led to PDF’s success. And I think downstream from that, we see a whole world of technologies where in the modern era it’s presumed that many software specification are going to be freely available and people very commonly expect that viewing software will not, will be free, whereas creation software perhaps may not.
Gavin Henry 00:15:35 Yeah, I suppose they understood that to make it successful, they needed mass adoption, didn’t they? I wonder what the industry or what format if any, would’ve won if they haven’t done that, or we’d still be in the wild west of a trying to print and preserve things.
Duff Johnson 00:15:52 Well indeed Adobe did, and I think we’ll talk about this. There were numerous other competitors at the time, and I think PDF was very much the right technology that came along at the right time. It met the oncoming internet and met the obvious need to use digital means to be able to convey structured information or laid out information and avoid the necessity of printing and sending things through the overnight mail, and so on. And so the emergence of internet technology met the development of PDF very, very neatly to give people a means of conveying their business processes from printers and scanners to simply emailing content of their digital means of distribution.
Gavin Henry 00:16:42 Thank you. So that was a really good overview, sort of bite size chunk of PDF history. I’m sure we can do quite a few show on each of those sub parts. Everyone will have used a PDF, opened it or click print PDF or exported as PDF at some point in their lives, whether as a user or as a developer, could we spend some time taking us through what a PDF format is? So for example, those of us that are curious when they go to website, we usually right click that web page and click view source or try and open up a PDF and a Text Editor or a console-based Text Editor, why doesn’t that work? And what are the main bits for PDF?
Peter Wyatt 00:17:25 Okay, well I think maybe we need to start and say, well, what is a PDF? So what it’s representing as Duff said is a document and specifically a paginated document. Why is that important? Well, obviously in the HTML world, we can have infinitely scrolling pages and very long pages. But in a PDF document, everything is paginated. It’s also what we call typeset and laid out precisely. And so typeset means that the kerning and the choice of glyphs and the choice of typeface and exactly and precisely how the author wants, is encoded into the PDF format. PDF is not a format that word wraps depending on the size of your browser, you have page size, whatever that may be, A4 or letter size or whatever it can be, postage stamp and then the content is laid out on that page, and it paginates. And it’s very precisely defined in terms of how the appearance model works.
Peter Wyatt 00:18:19 And I mean very precisely because you remember, its history is back in the printing days in the laser writer days. So, 300 dots per inch because of its, I think its history and print. It’s always had this definition that’s been about precision. So, for example, how you dash a line is, is many pages of the PDF spec defining exactly how you should dash a line, what endcaps to use and all the mathematics around stroking and filling line ends and so on and so forth.
Gavin Henry 00:18:48 It was quite surprising when you said it was difficult to pick a page to print. That kind of shocked me a little bit.
Peter Wyatt 00:18:56 Yeah, well if it’s a programming language, I guess it’s the same thing sometimes, like, I’m trying to think of an analogy and I guess today you sometimes get that if you load a very large document into an office suite application and you quickly scroll to the end, sometimes you have to wait for the application to kind of catch up? I’m talking like a hundred-page document. Obviously back when PDF was starting, that slowness was amplified by the fact that computers weren’t as powerful, there wasn’t as much memory. So, the ability of PDF to be what we call a random-access file format. So, you can jump to any page in a PDF very, very quickly and there is no cost to doing this. You don’t have to understand what’s on page one and two and three to get to page a hundred.
Peter Wyatt 00:19:38 You can go straight to a page 100 and display page 100 because it has its own definitions. Now having said that, if your document has the same logo on every page or the same font in every page, you can reuse those assets so that the file size is optimized, but you don’t actually have to understand exactly how page one was laid out and where exactly the word break was. So, you can then do page two and exactly where that word break is and then do page three. And if you think back to the early versions of office applications, it was fairly common that if you shared an office document with somebody else on a different platform, you could get different word wraps at the end of pages and you’d have a document with five pages, and somebody else has a document of four pages or it breaks at this point in your document and at a slightly different point in somebody else’s document. And PDF is focused on capturing the type setting and precise definition of the laid-out document. So, this is why it’s sometimes referred to as a final format, but PDF isn’t really a final format.
Peter Wyatt 00:20:40 It’s just a fixed laid-out format. It’s not a flexible format like your listeners would know about with HTML for example. So, answering your other questions about binary and text, so PDF is not a text format. Yes, its keywords and many of its aspects are defined as ASCII byte sequences, so human readable, but technically speaking it’s a binary file format because it uses byte offsets to locate objects in the file. Everything in a PDF file is object-based. And we build up this document object model, again, a term people familiar with HTML would know, but remember this dates back to 1990. So the document object model in PDF is object-based. You can reuse these objects across pages or however you wish, and each object can be randomly accessed very quickly. You don’t have to read the entire file. And again, this is slightly different to HTML or SGML where you have to read all the tag nesting and so on and so forth to understand with PDF you don’t have to do that. You can literally open a document and jump straight to page 100 and have never looked at anything to do with any other page.
Gavin Henry 00:21:51 Naively, I always thought way back I could just grab some text out or open up and replace a bit of text, but now I understand why that’s not possible.
Peter Wyatt 00:22:00 Yeah. Now, so actually if you want to focus on that kind of thing, so one of the other things when we talk about text, a lot of people instantly think Unicode. Now Unicode is a text encoding and it allows you to express very rich character sets and so on. But PDF is actually a typeset language and expresses the appearance of that text. So, the classic example that I give is, the word office in English. O double F I C E. So, in some cases this can just be four glyphs, you can have an O glyph, so glyph is the appearance of the character, the glyph for the letter O there may be a combined ligature for the letters F F I, or maybe the horizontal stroke of the F F and I are all joined together. So you have a single ligature representing three Unico characters and then the C and then the E.
Peter Wyatt 00:22:50 And so in PDF the author has decided that this is the appearance they want to give to their document and therefore they define this with glyph IDs. Whereas in Unicode you would say it’s the O, the F, the F the I, the C and the E and then text shaping algorithms or text shaping software would then decide, oh, you are using such and such a font and your preference is this and therefore you might get a ligature or you might not. So it’s kind of different things for different courses and hence why in some cases yes, you can open a PDF file and you can see the text and then other cases you can’t. Of course, modern PDF is all compressed as well, which doesn’t help the text searching side of things.
Gavin Henry 00:23:31 Yeah, that makes more sense now. Cause I remember what Duff mentioned about preserving how it looks and bundling fonts. The times when you open a PDF it only works on Windows or Adobe Reader or you open it on Linux, it’s just horrendous and you can’t even read it cause it’s obviously bundled in or linked to, if that’s correct, some OS font, operating system font.
Peter Wyatt 00:23:55 Yes. And PDF in the early days — and one of the lessons that PDF has learned over the years is the importance, and especially now that computers are bigger and faster and storage is cheaper — is that the cost of missing fonts is huge. You not only get a potentially a bad appearance, especially if you are reading a document from a different language, that can be a very bad experience, but with embedded fonts encapsulating them inside the PDF file, then you guarantee that the root of your document just has exactly the same experience that the author intended. And one of the things that PDF allows is a concept called sub-setting of fonts. You don’t have to put the entire Arial font for every Unicode character you can just pick the glyphs that you used in your document and you can sub-set it and just write that small amount of data into your file and just send that along with your file.
Gavin Henry 00:24:47 So this would explain the file size difference in a PDF if you to get a proof of a business card or from website mock-up done as a PDF that can be quite huge. Or a text-based one that could be kilo bytes, it all depends on what’s being embedded.
Peter Wyatt 00:25:06 Yes. So primarily it’s the fonts and sometimes also obviously images because PDF is a, I don’t want to say print-centric format, but at least a format that had its origins in print, then 72 DPI images and 96 DPI images with lots of jpeg artifacts never look good when printed. So a lot of PDF software will use higher resolution images and even though you might be viewing it on a computer screen, it doesn’t know that you don’t want to print it. And hence the images are also probably much higher resolution than you might otherwise see on a website.
Gavin Henry 00:25:41 Thank you. Is it possible to create a compliant PDF in a Text Editor?
Peter Wyatt 00:25:46 So the answer to that is, yes. Obviously so, in sort of the technical workshops that we run, and often if you read the PDF specs, you will see what we call fragments of PDF and they just look like programming code in a language that’s PDF basically. So yes, you can do it in a Text Editor, but as I said, the key point is that in the file there are file offsets, but so byte-based offsets to the start of each object. And obviously if I open it on one operating system with one set of line ending characters and open it on a different one, then those line ending characters can make a difference to the byte offset. So yes, you can do it, but you have to be very careful and you need to know what you’re doing. So, unless you’re a PDF person, please don’t do it or you will break your PDF file.
Gavin Henry 00:26:31 Yeah, I saw it.
Peter Wyatt 00:26:32 From an education point of view, you can do it, and often many developers getting it started and PDF will do this as a way of learning.
Gavin Henry 00:26:41 Yeah, I saw some competitions where people were trying desperately to get the PDF size down to like half a kilobyte or something if you skipped out this bit of the spec or went to version 1.4 or version 1 or something and it all opened fine which was a testament to what the PDF Association looks after and the standards and everything.
Duff Johnson 00:27:01 Well actually not, it’s actually that is often a testament to the flexibility of PDF processors and their willingness to ingest PDF files that have all kinds of interesting problems, right? So as Peter said, while it might be possible to hack yourself a PDF file manually. It’s almost, it’s really almost never done except for purely educational purposes. This file is counting byte offsets and the chances of really getting this right, particularly with any more sophisticated content are very very relatively difficult to achieve. Certainly, as a practical matter.
Peter Wyatt 00:27:44 Into your, to your comment about those kinds of challenges, you often see online and they’re more about what you might call the difference between what the PDF specifications say a PDF file should be and what a real PDF file that’s accepted by PDF software can be. And we’ll probably cover this later on when we get down to security because obviously over the years there are many PDF files have been created that do have errors in them. Sometimes it’s as simple as a typing mistake a program and did in some program years ago that then was used to generate a couple of hundred million PDFs and bingo, that problem is then a problem for everybody who opens that PDF file. So, it’s a problem that we face because our format is persistent. We often talk about persistence and as Duff said, the PDF/A format is about these records, these archival long term preservation requirements where that the long-term means 50 or a 100 years from now, not just next year or, and that’s a real challenge to solve that problem.
Gavin Henry 00:28:47 Yeah, some really interesting points about the archival format, and I’ll put some show notes in there. One of the next shows I’m doing is about archiving of software. So software heritage think a nice thing to explore not sure as well about serving things in PDFs.
Peter Wyatt 00:29:06 Well, just actually just to promote something from the association, we’re currently, working on a standard for using PDF as an archival format for emails. And obviously there’s, especially in the US, there’s some famous cases of emails being recovered and so forth. So, one of the things that we can do is we can build on top off PDF/A, the archival format and we can build additional features specific for industries such as email archiving, which have unique requirements such as retaining the headers and understanding what’s there. And so actually we have a liaison working group in the association currently specifying what we call email archiving.
Gavin Henry 00:29:45 Excellent. I’ll get a link in the show for that. That moves us nicely onto the next section, which I’ve called “creating a PDF,” but we can easily talk about reading a PDF as well. So by the sounds of it, there’ve been quite a journey of versions, which as I understand you can still open all the versions and new versions today.
Peter Wyatt 00:30:06 Absolutely. You can open a PDF 1.0 file from 1990 in software today and it will still work.
Gavin Henry 00:30:12 That’s awesome. As a creator, what version do you pick? Do you just take what your printer or software application does or does this depend on the industry you’re in, what sort of advice have you got on that, for example?
Peter Wyatt 00:30:27 Ok, well I think there’s a few points there. So I think as a user of PDF, if you are just consuming PDF or even providing PDFs to customers, you don’t pick a PDF version, just like you don’t pick an HTML version when you visit a website. Most likely what you’ll pick is a series of features that your document needs. Now maybe this is the ultra-high compression, so that’ll be the latest standards or some certain digital signature feature or some encryption feature. And again, that’ll be standards. And if you want multimedia or interactive 3D content, again sort of the rarer PDF features, then you’ll have to pick certain features. So, I don’t think you really pick PDF versions. What you do is you pick the features that you want to express your content in, and then that kind defines the feature set that you might use.
Gavin Henry 00:31:15 So the features aren’t tied to version 1.7, 2.0?
Peter Wyatt 00:31:20 They’re all backwards-compatible. So there’s only maybe a very few, and I’m talking like three or four features in the history of PDF that have ever actually been removed from the standard. And one of the key things that we do in the PDF standards committees is to focus on backwards and forwards compatibility. Now what do we mean by that? So backwards compatibility is, if I was to open a document from the future in today’s processor, what experience would I get? So, I encounter a new, a new image format or a new type of font. What can I do to make the experience in legacy software relative to the version of the PDF better? So, it’s a focus that maybe other formats don’t have, but in PDF it’s certainly a very important focus that we do discuss a lot about when we make a design choice to implement new features, how we can do this in a sort of a backwards-compatible way.
Gavin Henry 00:32:12 So that would be an example of I’m stuck in an old version of Mac-OS, or Windows, and I’ve got Adobe Reader or whatever readers bundled and I open a PDF created day and there’s no way that reader understands the new version, but it still opens it okay?
Peter Wyatt 00:32:32 Yep. So, I would hope a couple of things. I would first hope that the reader checks the version number that’s in a PDF file, just like the version numbers and many files and would maybe present you with a warning message saying, Hey, we only support, say PDF 1.7, this is a PDF 2.0 file, maybe you should use some different software. So, first thing it should give you a heads up or it certainly has the capability to give you a heads up that maybe this display you’re about to see is not as accurate as it might otherwise be. But in some cases you might then get either suddenly different colors or, a different display, but hopefully as a human you’ll be able to interpret enough of the document to achieve whatever you are trying to achieve.
Gavin Henry 00:33:13 Thank you, and is it easier to read and display PDF versus creating a PDF?
Peter Wyatt 00:33:19 So, obviously — that’s a very hard question to answer. So, the PDF specification is a lot about the display of PDF. So yes, a lot of the text in PDF is about how it displays. The creation side is really coming down to libraries and so forth and SDKs that you might use. And certainly, there’s a ton of technology out there that can take an HTML canvas or an HTML content and just convert it to PDF. And assuming that that software is of high quality, then it will carry across what we call the semantics of that content. It can know that the headings, the heading and the paragraph is the paragraph, and this is a bulleted list. So all these sort of semantics can carry across from PDF.
Gavin Henry 00:33:59 That’s what I’m trying to get to is move us on to programmatically creating and reading.
Peter Wyatt 00:34:06 If you’re using an SDK that’s maybe not so up to date or not been so well written, then the same content can be generated, but maybe you lose all those semantics. So yes, the text is still there, it’s selectable text. I mean, I guess the worst case would be software that takes something like an HTML page and converts into one very large image. Now still as a human, you look at the PDF file on the screen and looks exactly like you would expect, but you can’t select text, you can’t search that text and that’s not a great experience.
Gavin Henry 00:34:36 I have seen PDFs like that. Actually we try and copy and paste the text on PDF and as an image.
Peter Wyatt 00:34:42 Well, obviously scan to PDF especially since you know the phasing out of fax machines and you’ve got to remember that faxes have come and gone in the time that PDF has been around. So scanning of documents used to be big thing. It’s still a big thing in certain industries, especially for the archival community where they have to capture digitize a lot of documents to replace paper with digital records. So, there are specific features in PDF to support, for example, scan documents and OCR text and all this kind of thing. But, if you are creating what we call a digitally born document, then realistically you shouldn’t be having that experience. You should be having an experience with text content that’s extractable, searchable, it captures the semantics that, that were at least in your source document now maybe your source document is nothing more than a text file and therefore has no semantics. But if it’s an office document and you’ve got stars, shapes and headings and paragraphs and bulleted lists, then all that should really be captured over into the PDF. And PDF has all these features and has had for many, many years. So, really to go back, circle back around to your question, I think a lot of that really depends on the libraries and SDKs that people use. And really maybe that’s the key advice we’re giving to listeners here is don’t just accept the first library that converts content, but spend a bit of time trying to understand is the PDF that’s been created of what we would call high quality, and I don’t mean visual quality, I mean kind semantic quality.
Gavin Henry 00:36:07 And how would you validate that just based on what you’re trying to achieve?
Peter Wyatt 00:36:12 Various ways. I mean obviously the first thing is obviously to check its visual appearance, but don’t just use one viewer and make sure you check across all platforms. Make sure that text can be found, that you can find and search and replace a text, not replace, but search a text in your document. Ensure that the metadata is up to date. If you are creating something that’s probably going to be a record. So I’m thinking things like an invoice or a purchase order or something like that, which is typically kept in a organization’s document management system for many years, maybe not for 100 of years, but at least for 10 or 15 years for the tax law reasons or whatever. Then you should probably look at PDF/A as a standard and PDF/A has a lot of what we call validating software. So software that can run over the top of a PDF/A file and check to make sure that all the T’s crossed and all the I’s are dotted and it’s a good quality file and it really is the thing, the good quality rules that archival PDF requires.
Gavin Henry 00:37:09 Duff, just a couple of questions about the PDF Association. Do you guys maintain a list of recommended libraries or what Peter just said there, about linting or validating PDFs that we can link to or. . .
Duff Johnson 00:37:25 PDF Association actually very specifically and deliberately does not do that. The association is a meeting place for PDF developers to come together to discuss, propose features, issues of concern, requests for clarifications, to allow different industries to find common understandings. So for example, we have working groups that are specific to the engineering space where we have folks who are thinking about 3D and aerospace and manufacturing who are very interested in how 3D and other kinds of related models can be deployable in the PDF context. And as Peter mentioned, we have other working group working on email archiving using PDF and so on. So what we’re, what we do specifically do not do is getting to the business of trying to pick winners and losers from within the developer community that supports the world’s PDF implementation. One of the reason for that is there are so many different means. The larger point as a member organization, our job is not here to sit in any way in between the consumer and the developer. We would probably have relatively few members if we were around the business that characterize it, our members products, right? Instead, we provide really a platform for them to talk and for them also to showcase their products. But we’re not internally there may be and within the members only discussion groups, there may be arguments about this or that other interpretation, but we’re not here is sort of the PDF police if you will.
Gavin Henry 00:39:12 Okay, thank you. The reason why I ask is because as our listeners will know, depending on what programming language they use by something that’s upon them because of their job or their chosen language. In my experience as well, you find a PDF library that does maybe, 70% of what you’re trying to do and then it’s been abandoned, or it’s been divvied up to meet the needs of what other developer wants. So I’m just trying to figure out, to navigate some of these past decade where you go to what recommended one and see how you review them and say, yeah this is PDF 8, great. Almost all of the spec or what have you?
Peter Wyatt 00:39:59 I think for what we call the subset, so these are the PDF/A and the PDF-X, variance on PDF, you’ll always be able to run validators because they exist and there’s lots of software out there that can check that for you. In terms of general purpose PDFs are just the PDFs that we as consumers send around to each other or maybe receive or download off a website, that’s a harder problem. But I guess the good news is PDF has been around for 30 years. You should definitely be using a maintained library and if nothing else that just goes to the security discussion will probably have soon. But there are PDF libraries in all the languages and even, very newish languages, Go and Swift and so forth, there are very capable PDF libraries around and many of our members do participate in these forums to try and help people understand the PDF spec. It is a 1000-page specification. It’s not a light read by any sense. We do a, I guess as an Association do promote people to join us and have the discussions understand, especially with things like errata and we have a public GitHub repository where people can report issues or misunderstandings about spec and we’re here to help people understand, well this is what that part of the text means and this is how you can do it.
Gavin Henry 00:41:15 Yeah. I’ve reviewed some of your GitHub repos that I think you both have, so I’ll put those on the show notes. I presume there’s an implementors type group that developers can potentially join to ask questions or something? Or forum that supported, or is it really for developing the spec?
Duff Johnson 00:41:37 So there are a number of different forums within the PDF Association. Many of them are members-only. So the association among its other responsibilities, it maintains the ISO standards-development process. So we are the managers of ISO TC171EC-2 which is the sub-committee responsible for the development of most of — not absolutely all but most of the PDF specification, format and subsets. And we have an employee of Chief Technical Officer in the form of Peter, we have a number of different things that we do to service the industry so. Part of that we then have a kind of spaces that we operate for meetings, consists of both members-only forums for the development of the specification for other subsets and for industry discussions. But in addition, we operate a number of liaison working groups, which are intended specifically for interfacing with nonmembers who have specific vertical requirements or cases. So, I mentioned engineering and manufacturing. Another example would be email archiving group and another example would be concerns pertaining to accessibility. So, we also work, in fact we have numbers of groups that are working on developing, improving the interaction between PDF and the assistive technology that’s characteristically used to help folks suffering blindness and other disabilities to be able to perceive and read PDF documents.
Duff Johnson 00:43:17 But we also work in the, these liaison working groups occur and also the print product metadata space. So we have a variety of ways for developers who have an interest in the subject or who have that tangential or other need, it’s actually common thing for us to receive an inquiry. Hey, we’re out here in the world we’re trying to do this thing with PDF, how could the association support us? And sometimes those are inquiries we can’t do anything with them, and other times it results in the development of a community which is constructed precisely to support that process. To give you an example, the LaTeX folks who developed the typesetting system which runs much of the world scientific publishing. They came along and said, well we’re looking to develop, to improve the way in which we create PDF files from LaTeX that would include all the semantics in the tagging and log lines and so disabled users to view scientific publish publications that are written with LaTeX. So as a result we created liaison working group that would allow folks who are working specifically on LaTeX development to come along and participate in our discussions and then significantly to allow PDF Association members to join into that discussion. So that, and that’s really what we do. We provide that interface between the people who have question and then the people who really know PDF very deeply.
Gavin Henry 00:44:47 Thanks Duff, that’s a great overview. I’ll make sure I get some points of contact in the show notes as well to those type of developers. I’m going to summarize the last two sections, just to confirm my understanding and then move us on to the last section of the show, just to keep us on track. So PDF is a binary-based format where the layout and other things that are important to create a PDF are either embedded and that’s not just the text and the words, that’s exactly how the creators want it to look. The version of the PDF depends on what feature you want as a creator to be in that PDF, but a Reader will then know instantly what version the PDF is and understand what it supports and what it can display for you. Depending how that is PDF created, I could use my Text Editor, but sounds pretty impossible and given the fact that the show is 30 years on PDF, you should review and expect the libraries if that’s the case of your programing language to be capable but there are some validators and linters for the PDFs that I’ll get some names off both of you offline and make sure they’re linked to in the show notes. I think that’s a good summary. Would you say creating a PDF and what’s involved in it?
Peter Wyatt 00:46:06 Yep. I think the other aspect that maybe we should talk about too is we’ve talked about creating the PDF, but nowadays a lot of websites and other experiences have a PDF viewing integrated into them, and this is probably the one place where the 70% completed just doesn’t work anymore. When rendering a PDF file and displaying it on the screen on a piece of paper, you really do have to be 99% or better in terms of completion. And this is where sometimes people can be fooled. If you have software that’s less capable, then you can look at the same PDF on different platforms and see very different things because one, maybe one software can’t display a certain image format, but after 30 years, realistically speaking, I don’t think there’s really any excuse. The software that’s being used there is clearly very old, as I said.
Gavin Henry 00:46:55 Are these the embedded sort of JavaScript PDF display?
Peter Wyatt 00:46:57 No, I and that particular one is actually really, really good. No, what I mean is some of the other ones maybe less maintained Open-Source software, but the rendering of the PDF file is the most important thing. And if you do search on the web, there are test suites, commercial test suites as well as a few Open-Source test suites available where you can grab some PDF files and you can see exactly, does my viewer for example show what we call annotation. So, PDF has this feature like your office documents where you can review and mark up a document, strike out text, highlight text, all that kind of stuff. But you can do it in a PDF file. Now many of the old viewers don’t do this, but all the new viewers and all the mainstream viewers should be doing it because there’s really no reason not to be doing it.
Gavin Henry 00:47:44 Yeah, I experienced that same thing, exact thing on Friday. One of our, one of my podcast guests marked up the show in an article for IEEE and then used the comment thing. It didn’t work on my Google mail preview and some other things but it did work on a big name creators or viewers rather. It just downgraded nicely like you explained and said it would, it just turned the comment into a little voice box icon. You couldn’t do anything with it, but you could see there was something there. So it was backwards compatible that way.
Peter Wyatt 00:48:19 Yep. And I should actually add the PDF specification only specifies the file format and very few what we call process or requirements on software. So, a lot of those sort of experiential things, are actually not defined in the PDF spec. And again, I think this is a bit of history, but it does allow people to innovate and to create different types of software and you only have to, I think look at an iPad experience from a traditional PC experience and you can see a fair variety of different experiences with PDF, but all based around the same sort of feature set of the file format.
Gavin Henry 00:48:54 As a creator of that PDF, you need to be conscious of where it’s going to be consumed and read?
Peter Wyatt 00:48:59 Ideally, you shouldn’t have to be, but if you happen to know, for example, that your users will be on their phones or something, then yes you should. But that probably also goes just as much to things like the choice of page size, whether it’s the American size papers or the A4 European style paper sizes. There’s other sort of aspects as well. So if you were to create a modern file now, and we talk about semantics now, one of the things that Duff spoken about just a few minutes ago was the importance of semantics. Now, semantics today is used in many applications for their ability to reflow a PDF. So, although PDF is a fixed file format, a lot of software nowadays has the capability to take PDF and refit it to your appropriate screen because we’re not all on desktops anymore. We do have phones, but exactly how that works, that is not in the PDF spec. So that is kind a layered feature that’s been added on top by the vendors in being creative to address I guess some of the challenges that paginated content faces in the modern world.
Gavin Henry 00:50:02 Thank you. So we’ve touched upon bundling things with PDFs, and that will bring us on nicely to PDF security. Can you share with us and historic security issues that’ve been with PDF and a few examples and what’s been done about that since?
Peter Wyatt 00:50:18 Yeah, I guess we need to recall the history discussion that opened up the podcast. PDF 1.0 was 1993 and it was well before security and DevSecOps and all this kind of thing were front of mind. So, or even considered in any way. It was a long, long time ago. Now having said that, I think certainly one of the things that I find most amusing with PDF is really the accidental information disclosure from users typically governments and, lawyers or someone who forget or just don’t know how to redact the document. So redaction is where people think about putting, blacking out some texts so that you can’t see the name of an individual or something like that. But, hopefully as people have learned from this discussion we’ve had today, that PDF has made up of these text objects, these graphic objects, and these image objects. So, putting a black box over some text doesn’t make that text magically go away. You actually have to
Gavin Henry 00:51:12 Yeah, I was going to say that based on how you explained it before, that’s just an object on top of a . . .
Peter Wyatt 00:51:18 Correct, as a human, you can’t see it anymore in the rendered appearance, but if you do a text extraction on, and the classic case is a journalist will do a copy and paste and paste it, take the content and paste it into their notepad or something like that, and bingo all the supposed to be redacted words reappear. I’m sure your listeners can remember lots of famous cases with this kind of thing has happened, but no one seems to learn their lesson, and it really is a source of amusement and amazement. It continues to happen. And PDF actually has a full-blown redaction workflow as part of the file format where you can go through official, I don’t want to say military grade, but a proper regimented process where people can redact content and then you can classify what the reason for the redaction. Then you can approve the redaction and it’s all built into the file format. So then at the end you can publish a document that’s truly redacted, including things like portions of images or people’s faces and photos. All this is possible in PDF. But unfortunately people just put the black rectangle over the top and ship out the PDF and regret it.
Gavin Henry 00:52:21 Yeah, one of the first things I do on a PDF just for fun is, the file properties. I look at all the title location, producer to see how they made the PDF and the format. There’s usually quite a lot of bundled in that, that people don’t
Peter Wyatt 00:52:35 In actual fact there’s been some interesting research done recently out of France who looked at exactly this issue, the privacy issue for documents published by national security agencies and what you could learn, and this goes to more than just the file properties, but if you embed a photo from your iPhone into a PDF, then all the magical properties of your iPhone are inside the jpeg inside the PDF. And that might include your model number, your serial number, maybe your name, probably the GPS coordinates of your, of where the photo was taken. So you can well imagine that if you are, if you’re working in an industry that has secrecy and privacy as a primary concern, then there’s a lot more than just the PDF you need to worry about. There’s all the embedded internals, the fonts, maybe editing markups that happened in the course of publishing a document, you want to make sure they’re all scrubbed out, and as I said, PDF has all this capability built into it, but unfortunately people still seem to cut the corner.
Gavin Henry 00:53:36 What sort of things can you embed in a PDF?
Peter Wyatt 00:53:39 So technically, and this is one of the security issues, is you can embed anything. You can attach and, some of the very early attacks back in the 90s where people had just attached the virus payload, a .com file or .XE file or a nowadays it’d probably be a PowerShell script or something like that. You can just attach that to a PDF file. There’s a thing called a file attachment annotation, which you can think about it as a little paperclip icon that you might see on your page. And obviously if a user then double clicks that and detaches that file, then that can do all manner of nasty things. And there’s certainly been things in the past where people said, Oh, I’ve attached my favorite photo, but the photo actually called photo.xe. And users aren’t always aware what these extensions mean and they double click the file and instead of opening a photo application, it runs in a malicious program. And that is one of the security issues of PDF is, what we refer to as a container format. It can contain anything, basically you can embed other things inside PDFs.
Gavin Henry 00:54:39 Like you said a minute ago, where you think you’ve redacted something, a graphic on the top that could be you mass creating a button to say, click this to pay the invoice online or something, but it takes you and you’ve downloaded the payload.
Peter Wyatt 00:54:53 Yes. And there’s certainly been tricks. I mean I’ve seen PDFs, which masquerade as a website, so for the naive user who opens their PDF viewer maybe they’ll try and push their PDF viewer into full screen mode. So, you can’t see that it’s PDF viewer and they’ll be the login account for bank and ask you to enter your username and password and in the background that button’s actually sending that password to a malicious website for mining or whatever. So I mean I guess it’s the same thing that happens in emails, people doing the same thing, phishing emails. So really I don’t think there are things that are unique to PDF? But realistically what you can do in HTML, email, you can do to PDFs because again the content flows smoothly between these formats and that’s the whole point in the formatting way.
Gavin Henry 00:55:43 So criminals are just using PDF as another container to form an attack really?
Peter Wyatt 00:55:49 Yes. And there certainly are other things now. Now the probably the most well-known attack factor that gets to used in PDF is JavaScript. So PDF internally can, can have JavaScript just like an HTML webpage can have JavaScript. But obviously because PDFs are standalone and browsers are very complicated pieces of software, then, there can be bugs in the implementations and the JavaScript is providing a means by which an attacker can leverage a bug and exploit it to gain control of your computer or do whatever it wants to do. And that is why in today’s world, I think all PDF tools, I would hope ship with their JavaScript disabled by default. So, you’ll need to enable it. Now, obviously with today’s attacks is, the first phishing attack is probably to get you to try and enable that JavaScript, so the subsequent email attachment will then have the malicious payload attached. And that’s a sort of, I think a fairly common kind of thing, especially in the corporate world where target attacks may be more common.
Gavin Henry 00:56:47 And the original intent for embedding these types of things, was JavaScript there something in particular or was it just you can embed codes and do something? What would you use that for, to move you along a form in a PDF or something when you’re filling out?
Peter Wyatt 00:57:05 So it has to do with data validation forms. It’s really that’s the history of it. It was, I think it was added in the mid 90’s, 1996 or something like that, PDF 1.3, so, a long, long time ago. But specifically to support flexible business forms. And in those days, you have to remember HTML forms were not very good and PDF forms were much richer. And there’s histories of tax agencies you’re filling out things with forms using PDF forms as a way of doing very complicated things. Nowadays you’d probably do an online form. But history of PDF was, yeah, people wanted rich forms where you could validate some data and update fields. If you change this, it would up calculate the tax and update that field and all this kind of stuff. And rather than try and do it declaratively, JavaScript was chosen. But having said that one of the technical working groups inside the PDFs Association is currently looking at an alternative declarative technology to JavaScript for the form solution based on a concept or a technology called Json script.
Gavin Henry 00:58:10 Ok. And is that, this embedding anything, is that similar to how you can digital signatures on a PDF or prove and validate are not being tampered with or sorts?
Peter Wyatt 00:58:23 Kind of. So a digital signature you can think of as like a hardened shell around a PDF file. So you use it a cryptographic hash, you calculate the contents, the hash of the PDF file, and then you include that in the PDF file. And that effectively creates this hardened shell. And if anyone changes a byte inside that hardened shell, then you can detect that it’s been tampered with, then you can display the appropriate warning. Of course, the assumption there is that your software is actually bothering to validate digital signatures. And a lot of software unfortunately doesn’t bother to validate digital signatures. It just says there’s a digital signature and gives you no indication as to whether it’s valid or invalid or whether there’s been any tamper.
Gavin Henry 00:59:00 So this would be like an object around the PDF object, say like a container and docker where you can create a hash to see if it’s been tampered?
Peter Wyatt 00:59:08 Yeah, conceptually, yes, it’s done a little bit differently internally, but conceptually yes it’s that sort of they have the hash checks. Yeah. Is checking. I mean, I’ve always been thinking that it’s kind of the experience that we’re all now grown accustomed to the green padlock in our browsers and really PDF needs, I think the same thing that all our PDF viewers need to be able to give us the green padlock when we get an untampered PDF file with a digital signature gives us that green padlock. And if the file’s been tampered, then obviously there’s a red padlock and lots of flashing lights because not saying anything can make people issue, Oh, it must be okay, and maybe it’s not ok.
Gavin Henry 00:59:45 Could we explore how a digital signature works?
Peter Wyatt 00:59:47 It’s incredibly complicated, I would suggest…
Gavin Henry 00:59:51 Okay, too much for now?
Peter Wyatt 00:59:51 Yes. One thing I will say though is that the PDF 2 standard, and actually a few of our new extensions about to be published, are introducing a whole lot of new technology in this space. Elliptical curve signatures and picking up on curves that have been standardized in various countries around the world. We have integrity mechanisms, what are known as Macs, and we’ve got some articles on our website, which can explain what these features are and how they are slightly different. But there’s a lot of different things. We, have time-stamped signatures as well as what maybe you conventionally think of as like a wedding signature, like from a person. But a time stamp signature gives you a proof that a document existed at a point in time in a particular way. And again, you often used in like Legal workflows and so forth.
Gavin Henry 01:00:38 Yeah, I’ve seen that on, DocuSign and HelloSign where you can attach the workflow on the back of it and it shows you such and such open data was created on, it’s been viewed by..
Peter Wyatt 01:00:49 And I should maybe add one other thing about the signatures and encryption PDF is that it’s also been designed to be extensible. So, there are a number of companies out there with proprietary encryption solutions, sort of providing like a DRM, Digital Rights Management solutions. And if you think some of the ebook solutions are also based on PDF using effectively the same kinds of technology.
Gavin Henry 01:01:10 Thank you. Just to round off this last section, can you take us through what the DARPA-funded SafeDoc project is?
Peter Wyatt 01:01:18 Yeah, so I’m a principal investigator for the association on the SafeDocs program. So SafeDocs is a program that was looking at, as you said in the intro, an intersection of cybersecurity, formal methods from the research side, input parsing, and file formats. And what makes this interesting is we’ve had a lot of progress in sort of protocols and applying formal methods and formal verifications to certain protocols that are used on the web, but file formats tend to be much larger and much more complex. So this is a really difficult problem to solve. It uses a field of research known as Language-theoretic Security, or LangSec. And what does this mean? Well, it really means when you think about what a vulnerability is, a vulnerability is really an input that a programmer did not expect. And that goes for almost any vulnerability. At some point the attack has been able to look at the code or work out that if I just slip this past this check you’ve got here, then the next check will misinterpret this and I can get control or I can crash a program or whatever the side effect is.
Peter Wyatt 01:02:26 So if we can somehow make it so that the input checking the parsing of inputs is provably correct, then pretty much vulnerability becomes a thing of the past. And this has been possible, as I say was certain very important protocols on the web, been some great work out of Microsoft and a few other groups well publicized. But in the terms of file formats, this is a new and challenging problem, and especially in something as complicated as PDF. So what SafeDocs has been doing is looking at this problem from a file format and PDF was chosen primarily because of its ubiquity. It’s important to just general government and business and organizations and sort of national security. And so we’ve tackled the problem in trying to develop a formalism of PDF. Now, we haven’t quite got there yet, but we’ve certainly had some great outcomes.
Peter Wyatt 01:03:14 We now have the first machine-readable model of the PDF object model, which sits besides the specification. So the specification is written in English and in the ISO community we might spend an hour finely crafting an English sentence or with all the nuances that we as experts understand about PDF. But of course, for an average reader who’s not a PDF expert but still needs to read the spec, they may not pick up on that nuances. So having a machine-readable spec where we all get a common understanding, both humans and machines, is really important.
Gavin Henry 01:03:48 Is the PDF document object model easy to explain in a sentence, or is that a major part of the spec?
Peter Wyatt 01:03:55 It’s pretty easy. So basically, PDFs are made up of these things called objects and there are nine basic object types. You’ve got the usual names, numbers, strings, and then we also have more complex objects: arrays of objects. So programmers will know what arrays are and dictionaries and its generally dictionaries have keys in them. And then the value of that key will be maybe another dictionary. So, you have a page key in the value of that diction of that key is a dictionary, which is the page dictionary, and that will have the media box the size of the page, it’ll have the content that goes on the page and maybe it’ll have the page label or, lots of other information about the page. So you can see how this sort of builds up a document object model exactly like would be an HTML, obviously different syntax.
Peter Wyatt 01:04:42 And what the model that we’ve developed, the Arlington PDF model is, is basically converts this into a set of tab-separated files. So they’re just text files very easy to parse and read. You can load them into Jupyter Notebooks or anything like that. And you can understand for each key, the data integrity relationships, its relationships to other objects in the PDF model when it’s required, when it’s not required when it was in what version of PDF it was introduced, maybe what version it was deprecated in. You can understand whether it is an integer and if it’s an integer, maybe what the range of values are or if it’s a string, maybe what type of string it has to be, whether it can be a Unicode string or an ASCII string or a byte string, which is just a random sequence of bytes. So, it provides a lot more detail and you don’t have to wade through the PDF spec. And you do have to remember the PDF spec is 30 years old, and I can only imagine how many editors have had a go in the PDF spec before Duff and myself. So, this gives us hopefully a much stronger baseline on which we can then move forward in formalizing PDF and providing a common sort of machine-readable, understandable version. And you don’t really have to be such an expert in understanding ISO specs.
Gavin Henry 01:05:58 Thank you. I’ll make sure that gets linked to in the show notes as well. Just to close off the section, could either yourself or Duff give me your top three tips on PDF security, if that makes sense.
Peter Wyatt 01:06:12 So I think there’s, it’s pretty much the same for email and web browsing. So, first of all, always use up-to-date PDF software and primarily here I’m talking about your viewers. Your viewing software, your software you use to interact with your PDF files. Use up to date software. It itself will be updated for its own patches and vulnerabilities, but because PDF is such a complex specification, it depends on many other libraries, jpeg-parsing libraries, XML-parsing libraries, color-processing libraries, Unicode processing libraries, and obviously all those libraries also have their own series of security flaws. So using up to date software should be the number one thing, so patch your software. Obviously the second one is be careful as to where your PDFs come from. Majority of PDFs probably come through email and the other places obviously on websites, and you should be careful when you’re clicking on PDFs, are you trusting this website?
Peter Wyatt 01:07:05 We don’t just rely on the fact that it’s PDF, it can’t be that bad. Unfortunately, that’s not true anymore and sometimes it might only be a phishing email, but still it’s something to be aware of. And the last one is always just use up to date antivirus and anti-malware software on your computer systems. All the good software nowadays will be checking PDFs for known malware, just like the same software will check our websites for looking for JavaScript fingerprints and so forth. It does the same thing with PDFs. It can look inside the PDFs and find the known malware. And of course, as we’ve said before, if you’re redacting, please, please use proper redaction software and read the manual.
Gavin Henry 01:07:48 Thank you. One other question I want to check in here, what are some of the most unusual or unknown things you can do with a PDF? Maybe some things that are in the spec, but you really don’t see?
Duff Johnson 01:07:58 You can have a PDF file that’s a square kilometer. Yeah, right? You can have a one-to-one scale, I believe Peter, there’s a one-to-one scale PDF of the Tokyo sewage system, as I recall. Never seen it, but…
Gavin Henry 01:08:14 Because it’s got the size embedded in it, it will open up that?
Duff Johnson 01:08:18 PDF is the size of Tokyo.
Peter Wyatt 01:08:21 So I guess the other thing that’s interesting is maps in PDF. So, with a map in PDF you can measure, you can drag out a line and trace a cursor and it’ll tell you how long something is. Now this doesn’t have to be a map. You can use an electron microscope and you can get it in microns. A PDF has a full sort of 2D, 3D measurement capability built in. I’ve also seen people write games in PDF, both using JavaScript and something as simple as just like a thousand page document and each page at the bottom has a button and you pick the button, the action you want to do and it takes you to a different page. So some people have been very, very creative with PDFs.
Gavin Henry 01:08:56 Cool. Thank you. Well, I think we’ve done a great job of covering a PDF is? Is it PDF or a PDF? Our PDF, the thing you download, PDF is a standard or how would you like me to say that?
Peter Wyatt 01:09:09 I think it’s just PDF.
Duff Johnson 01:09:09 In common parlance, it’s a PDF. I think we don’t do it ourselves or anyone else any favors when we get pedantic over the terminology. And so it’s characteristically “a PDF.”
Gavin Henry 01:09:26 So we’ve done a great job of covering what PDF is, associates, security concerns and how to make them. But if there’s one thing you’d like a software engineer to remember from our show, what would you like it to be? You can have two things, one each.
Peter Wyatt 01:09:37 I think for mine it would be that remember that PDF is an international standard developed in an open consensus-based forum. It hasn’t been proprietary since 2008, that’s 14 years ago. The standard really has moved on and it really does sit beside HTML. If you need paginated content or delivering of invoices or purchase orders, then you should be looking at PDF as an alternative. Don’t make your users have to sort of fight, to create something that can put in their archive to provide a solution for. And I think PDF is as good as it gets nowadays and maybe there’ll be something better in the future, but today it’s PDF.
Duff Johnson 01:10:15 I would answer the question in with a similar answer, but with a slightly different emphasis. With HTML, you have, broadly speaking an experience. You have content and CSS and a browser and server and it all comes together at a particular moment in time and an end user sitting at a desktop or holding their phone, they get to see something and it includes dynamic content or ad that was served or whatever it is. It’s an experience. PDF on the other hand is a record, it persists, and I can share it with you. I can deliver to you and you’ll have confidence that you won’t just share the experience that I had when I wrote it. You’ll share that experience. We’ll share that common understanding down to the exact placement of every letter. We’ll share that common understanding for every single user who ever opens that file downstream.
Duff Johnson 01:11:09 So these are, they’re deeply as, as Peter said, they’re deeply complimentary formats that HTML and PDF on the one hand you have something that comes together to deliver what people need at that moment. And on the other hand, we have something that persists over time and is exceptionally reliable, and they work together. They don’t compete at all. Certainly, PDF is overused and people use it for some things that probably they should be using HTML for. Certainly, HTML is often used to deliver records of particular transactions or other kinds of events that could probably be better delivered as PDF because people are looking to maintain that information over time or across computing systems. There are extraordinary, of course, capabilities and advantages in both formats, and they compliment each other for a wide variety of business processes. And I think, rather than think in terms of one or the other in the modern era, it’s really about you do things in HTML and very frequently they need to be stored or saved or in the format in which they were originally viewed, and PDF is appropriate.
Gavin Henry 01:12:17 Thank you. Obviously, people can follow you both on Twitter? I’ve got your accounts but how else would you like people to get in touch if they have questions?
Duff Johnson 01:12:25 They can certainly reach us via email, Twitter of course works, PDF Association, PDFA.org is a great way to get in touch.
Gavin Henry 01:12:33 Thank you.
Peter Wyatt 01:12:34 And also, GitHub as well. If you have, if you’re on the technical side, then we do have a GitHub presence as well.
Gavin Henry 01:12:39 Yeah, I’ll put that in the show notes. I’ve starred mostly your stuff, that’s out there too. Peter and Duff thank you for coming on the show. It’s been a real pleasure. This is Gavin Henry for Software Engineering Radio. Thank you for listening.
[End of Audio]
SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)
Very nice insight to the design and development of a popular format for portable documents.