How to rip a book

The Dream: ubiquitous information without material encumbrance

I have tons of books, and they take up a lot of space.  They’re heavy.  They easily get disorganized and misplaced, and searching for and through them is a cumbersome process.  They are the opposite of Google.  I’d like to have them all on my iPad, rather than carrying 8 to 10 books to class every day in a book bag.  Between my book bag, the kids, and the safety gates to keep them confined, my back hurts.

It seems to me that there are clearly analogues for most of technology’s emergent problems.  For example, AT&T once charged per telephone in your home, and now they try to charge per internet connection, say should I wish to tether my laptop to my phone.

Books, I believe are an emergent problem for digitization.  The analogue here is found in the music industry.  Ripping a CD, or even creating a mixed tape has always been within the rights of the consumer.  This is problematic for producers because there are no controls to prevent users from sharing “their” work.  With electronic Digital Rights Management (DRM) the producer creates the original work in the preferred medium, but handicaps the ability to reproduce it.  Downloading from iTunes is not as liquid as purchasing a CD, ripping it, and then putting it into iTunes.

There are many free eBooks which have entered into the public domain, most notably, the Guttenberg project.  This library is sometimes accessible from the back end of other readers, such as the Apple iBooks and Google Play Books marketplaces.  This is both wonderful and somewhat deceptive, because when you download and pay for content, you might expect the same level of ownership to follow.

Purchasing a book from Amazon offers equally baffling conundrums.  An ebook may cost just slightly less than the paper equivalent, it may cost the same.  An ebook by virtue of the DRM embedded in the file, cannot be re-sold, and has enormous restrictions on transferring between devices.  A used paperback, can often be shipped extraordinary distances for less money.  An now you own the CD.  Ripping the CD does not constitute a derivative work, because it is rote and identical to the source, neither is its new format transformative.


Ripping a book: Creative Destruction

Ripping a book today is where burning a CD was circa 1995.  It is slow.  Remember a 1x CD-RW?  A 72 minute audio CD takes 72 minutes to read, and another 72 minutes to write, a 2x; 36 minutes, 4x; 18 minutes, 8x 9 minutes, etc.  Most consumer grade scanners are pages per minute (ppm).  And the advertised ppm is often misleading being based on the lowest possible resolutions.  My scanner boasts 35 ppm, and the automatic document feeder can handle exactly a minute’s worth.  What’s more, feeding in more than 100 pages in a .TIF format results in the 101st page overwriting the 1st page.  Giving the printer a default file name, such as “mybook_odd”, and putting in a small <50 pages and leaving the room is the best option.  When revisiting the room, check for torn pages, and add another load.  The output file, at least in the case of the HP 8500, will be “mybook_odd0001.tif” and will increment with each scan.  Saving to a network folder will eliminate some (but not much) of the file management necessary if using email or a USB drive.

I have undertaken this project as a feasibility study, and by my third book, the process took 3.5 hours for 400 pages.  The end product was searchable, and the text was recognizable by OCR for screen readers, notably, Apple’s VoiceOver and Adobe’s “Read Aloud.”  I am not a “read the directions” first type of person, however this type of project is one where a little bit of planning can go a long way.  I recommend reading a first, and starting with a book you don’t care about first.  True, in the best case, you’ll have a book you don’t care about, but you won’t have destroyed a book you do care about either.

Unfortunately, the method I am recommending destroys the original work.  The same process can be used with a digital camera, a tripod, a piece of glass and foot pedal to activate the shutter.  Perhaps, you have a sacred enough text to warrant this, repeat this process for each page.  I have used a scanner, and to do this, I have to remove the spine of the book.  I find this to be distasteful, but worth it.  You can use anything, scissors, an exacto-knife, a dremmel rotary tool, a “spine cutter” and or a rotary miter saw or table saw.  The desired outcome is a clean cut edge that won’t jam in the paper feeder.

See youtube for examples of people cutting their books up:best demonstration.  This woman does an excellent job and she also has access to a very nice copier which emails her the output as a PDF, My department has been very protective of theirs, much to my annoyance.  This would reduce the time required for each book to probably as little as an hour or less.

Most scanners won’t take the whole book, so the book needs to be divided into sections your scanner can accommodate.  I used an HP 8500A, which regrettably does not do front and back scanning (“duplex”ing).  If you are in this position, you will find that you have to create twice as many files for both the odd and even pages.  Worse, you will probably find that after scanning the odd pages 1,3,5,7,9… you have a pile in the wrong order for scanning the even pages; 400,398,396,394,392…  Save your odd files in some way that you will be able to recognize them, either in the file name, or in their own folder, and a separate method for your even files to distinguish the two groups.

Assuming you have scanned all of the pages of your book, even and odd, that’s still only half of the problem.  You can save your output files into .TIF (don’t be scared, it’s just another format), .jpg or .pdf.  JPEG files (.jpg or .jpeg) are a lossy format, which suffice to say, some information can get lost during manipulation which is something you are about to do, so .pdf and .TIF are preferable, and for our intents, equal since that can be easily exported from one to the other.

ScanTailor is the secret sauce that makes this project doable.  It’s free, it’s open source, and it’s really, really good.  It’s self explanitory, load in the input files, add them to a project, do your manipulations by pressing the play button at each stage and ScanTailor will dump your output into it’s own directory.  From there, my advice is to get access to adobe acrobat pro.  You can print the output files from Picassa (free) using a PDF “printer”, but I find it is easiest to use Adobe Acrobat Pro, and File->Create PDF->Merge Files into Single PDF…  I select the files, and add all even or all odd into one file (but not both).  My output files are named bookname_odd, 00001, 00002, etc, and so they are added into the new PDF in the correct order.  Even numbered pages will be in the opposite order that I want.

There are several easy ways to handle this problem electronically which are faster than sorting the pages physically.  Ideally, you want to create two files, your odd pages and another with even pages.  I used Adobe Acrobat, and put a JavaScript into its working directory.  The next time I opened Acrobat, there were new menu options; collate and reverse order.  Reverse order reverses the page numbers, and collate combined the odd and even pages.  Before this method, I spent way too much time dragging pages around on the screen.  To clarify, just copy the code part of the post, paste it into a text document (notepad), but change the file extension to .js.  If you have trouble, msg me, or comment below , and I’ll post the file here, but not now.

Information is non-excludable, and I’m not sure how to reconcile this with capitalism as we know it.  Scanning books is intriguing because it seems like the natural progression and demand of the information society.  My research interests are in text mining, but my computational prowess puts the practice just beyond my reach.  I have been able to have a little bit of fun, but not to the extent that I forsee it will have on our world, political science only a small part of it.  With more time to devote to studying text mining as an method within political science, it remains to be seen if I can innovate or die.  There is an inevitablity to the vision I hesitate to articulate, because as it is information, and non-excludable, there are many more able researchers who can reach out and snap it up.


Leave a Reply

Your email address will not be published. Required fields are marked *

× 5 = ten