I have a digital subscription to a textbook, but it’s super annoying to have to use the website to access the book. I’d like to scrape the ebook and dump the contents into a pdf. I have downloaded proprietary pdfs from websites before using downloader browser plugins and predictable urls, but this site is pretty locked down, with randomly generated url tokens and a combination of xml and image data.
Has anyone managed to scrape a digital textbook like this? Any ideas where I should begin?
𝗣𝗜𝗥𝗔𝗖𝗬 𝗜𝗦 𝗘𝗧𝗛𝗜𝗖𝗔𝗟!
1. Posts must be related to the discussion of digital piracy
2. Don’t request invites, trade, sell, or self-promote
3. Don’t request or link to specific pirated titles
4. Don’t be repetitious, spam, harass others, or submit low-quality posts
5. Don’t post questions already answered. READ THE WIKI
💰 Please help cover server costs.
I can print pages to PDF without a VM. The problem with printing is that these books are over 1000 pages, so I need to automate a good chunk of the process. Ideally, I’d like to capture the XML text for the pdf as well as it will look much better and I will not have to manually crop 1000 PDFs with annoying borders.
Yeah, I believe you can do that by printing to a non-existent printer and then finding the file image waiting in the print queue. I don’t know if it works on Windows 11 but it used to work pretty well.