I have a digital subscription to a textbook, but it’s super annoying to have to use the website to access the book. I’d like to scrape the ebook and dump the contents into a pdf. I have downloaded proprietary pdfs from websites before using downloader browser plugins and predictable urls, but this site is pretty locked down, with randomly generated url tokens and a combination of xml and image data.

Has anyone managed to scrape a digital textbook like this? Any ideas where I should begin?

@[email protected]
link
fedilink
English
42Y

I had a few books like that that were directly on a scummy academic editors website. No pdf or usable files. I’m currently far from home, so I can’t tell you exacly what program i used. But i noticed that every page was downloaded in my temporary files as image data (cached version on page). So i had to manually flip a few pages, download them 1 by 1 and naming them correctly. I’ll look ok my pc to try to find the program that did that when I’m back

@[email protected]
link
fedilink
English
12Y

Sounds like you could also use a image downloader browser extension for that

@[email protected]
creator
link
fedilink
English
12Y

Sounds promising! Please let me know what you find.

@[email protected]
link
fedilink
English
12Y

It was MZCacheview but the same autor made one for chrome and a general one. But theoware is probable right, a brower extension could also do it!

@[email protected]
creator
link
fedilink
English
12Y

Looks like this particular publisher has anticipated cache sniffing. No dice.

@[email protected]
link
fedilink
English
1
edit-2
2Y

You can try printing the page

@[email protected]
creator
link
fedilink
English
22Y

I’m looking for something a bit more detailed. I’d like to auto-scrape the entire book.

@[email protected]
link
fedilink
English
-1
edit-2
2Y

Then this method probably won’t work for you

@[email protected]
cake
link
fedilink
English
12Y

Why don’t you simply open the book in a virtual machine like VMware and hit print? It can print to a PDF.

@[email protected]
creator
link
fedilink
English
12Y

I can print pages to PDF without a VM. The problem with printing is that these books are over 1000 pages, so I need to automate a good chunk of the process. Ideally, I’d like to capture the XML text for the pdf as well as it will look much better and I will not have to manually crop 1000 PDFs with annoying borders.

@[email protected]
cake
link
fedilink
English
12Y

Yeah, I believe you can do that by printing to a non-existent printer and then finding the file image waiting in the print queue. I don’t know if it works on Windows 11 but it used to work pretty well.

Piracy: ꜱᴀɪʟ ᴛʜᴇ ʜɪɢʜ ꜱᴇᴀꜱ
[email protected]
Create a post
⚓ A community devoted to in-depth debate on topics concerning digital piracy, ethical problems, and legal advancements.

𝗣𝗜𝗥𝗔𝗖𝗬 𝗜𝗦 𝗘𝗧𝗛𝗜𝗖𝗔𝗟!


Rules • Full Version

1. Posts must be related to the discussion of digital piracy

2. Don’t request invites, trade, sell, or self-promote

3. Don’t request or link to specific pirated titles

4. Don’t be repetitious, spam, harass others, or submit low-quality posts

5. Don’t post questions already answered. READ THE WIKI


Image


Loot, Pillage, & Plunder


💰 Please help cover server costs.


  • 1 user online
  • 193 users / day
  • 35 users / week
  • 201 users / month
  • 803 users / 6 months
  • 0 subscribers
  • 530 Posts
  • 9.76K Comments
  • Modlog