関連: PDFからのテキスト抽出
2023-04-07
KeyError: '/XObject'になる$ pip install pdf2image$ brew install poppler
python from pdf2image import convert_from_path
images = convert_from_path(pdf_path)
for i, image in enumerate(tqdm(images)):
image.save(os.path.join(output_dir, f"{i+1}.png"), "PNG")
- 出力の例
- 
- デフォルトだと72dpi
2017-08-23
まとめ
pdftocairo -r 200 -f 0 -png mybook.pdf prefixgs
$ time gs -q -dBATCH -dNOPAUSE -sDEVICE=png16m -r100 -sOutputFile=pages_%04d.png mybook.pdfgs -q -dBATCH -dNOPAUSE -sDEVICE=png16m -r100 -sOutputFile=pages_%04d.png 219.36s user 5.22s system 95% cpu 3:54.68 totalpdftoppm
$ time pdftoppm -r 100 -png mybook.pdf mybookpdftoppm -r 100 -png mybook.pdf mybook 464.95s user 6.77s system 96% cpu 8:07.62 totalpdftoppm 2倍の解像度で出力
$ time pdftoppm -r 200 -png mybook.pdf mybookpdftoppm -r 200 -png mybook.pdf mybook 1104.28s user 12.22s system 96% cpu 19:14.59 totalではgsで2倍の解像度で出したらどうなるか
$ time gs -q -dBATCH -dNOPAUSE -sDEVICE=png16m -r200 -sOutputFile=pages_%04d.png mybook.pdfgs -q -dBATCH -dNOPAUSE -sDEVICE=png16m -r200 -sOutputFile=pages_%04d.png 619.65s user 13.83s system 93% cpu 11:14.36 totalImageMagick(convert)
$ time convert -verbose -density 200 mybook.pdf pages_%04d.png"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r200x200" "-sOutputFile=/tmp/magick-le3ab9PT-%08d" "-f/tmp/magick-s7CciX7v" "-f/tmp/magick-6huEpaq8"pdfimages
$ time pdfimages -j mybook.pdf ./pagespdfimages -j mybook.pdf ./pages 6.14s user 2.51s system 59% cpu 14.569 total$ convert -thumbnail 1104x1646 ex7/pages-002.jpg t.pngconvert -thumbnail 1104x1646 ex7/pages-002.jpg t.png 3.86s user 0.09s system 98% cpu 4.002 totalpdftocairo
$ time pdftocairo -r 200 -f 0 -l 10 -png mybook.pdf pagespdftocairo -r 200 -f 0 -l 10 -png mybook.pdf pages 27.02s user 0.29s system 79% cpu 34.260 total2017-08-23