Connect with us

OT: Copying text from a PDF

Discussion in 'Electronic Design' started by Terry Pinnell, Jun 1, 2005.

Scroll to continue with content
  1. Quite often I have trouble extracting text from a PDF. I use the Text
    tool, copy, but on then pasting into my text editor I get garbage.
    Each individual character gets a return inserted. Typical example is
    at http://www.fairchildsemi.com/ds/BU/BUZ11.pdf, where I just wanted
    to extract the details under 'Absolute Maximum Ratings'.

    What's the deal here please? If the document is proprietorially
    protected, wouldn't the Text tool be inaccessible?
     
  2. Leon Heller

    Leon Heller Guest

    I just tried it and it worked OK for me when I pasted the text into the PFE
    editor. Here are a couple of lines:

    Drain to Source Breakdown Voltage (Note 1) . . . . . . . . . . . . . . . . .
    .. . . . . . . . . . . . . . . . . . . . . .V
    DS
    50 V
    Drain to Gate Voltage (R
    GS
    = 20k
    Ù
    ) (Note 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
    .. . . . V
    DGR
    50 V
    Continuous Drain Current T
    C

    It's not perfect, but I haven't got a CR after every character.

    I often extract text from PDFs whan creating PCB parts, and don't have many
    problems.

    Leon
     
  3. I was just doing exactly that from a Motorola (Freescale) PDF for a
    software simulation. It doesn't handle tabs (are there any in a PDF?)
    and deletes 'whitespace'. BUT, it didn't take very long to restore the
    spacing. Not great but easier than typing the whole deal.
    GG
     
  4. Guest

    What's your text editor? Assuming you're under Windows, perhaps the
    problem is trying to paste Unicode into an editor that can't handle it.
    You might try pasting the text into Word or Wordpad to see what happens.

    You might also look at xpdf, http://www.foolabs.com/xpdf/ . I don't
    think you can run the PDF viewer under Windows, but the command-line
    utilities, including a PDF-to-text converter, will work.

    Matt Roberds
     
  5. Mike Monett

    Mike Monett Guest

    Leon Heller wrote:
    [...]
    Don'cha love it when the author turns off the "Text Copy" tool on the
    document so you can't copy and paste? Why they do that is beyond me. You
    could print as many copies as you wish, or make infinite copies on a
    Xerox machine. Why make it difficult to copy a couple of lines of text?

    Another moan is when the author uses some wierd font that produces
    garbage characters when you paste into a text editor. I often end up
    shrinking the editor to a small window that overlays the pdf file, and do
    a manual copy.

    Then there's the text in a scanned image format. No copying, no searches,
    and it takes a lot of room on the disk.

    Hopefully, in 50 years or so, paper will be found only in museums, and
    everyone will have flexible electronic displays. Since there will be no
    need to print anything, searches will be easy, and there won't be a need
    to use special fonts or lock the document for any reason. Life will be
    easy for engineers.

    Sure...

    Mike Monett
     
  6. One thing I notice that's amiss is that there is a carriage return
    before and after subscripted text. So:

    V 50 V
    DS

    Comes out as V<CR>DS<CR> 50 V

    The symbol characters (degrees and ohms) also tend to get
    translated/screwed up, depending on where you're pasting to. There are
    also some lines screwed up, st the ends of some lines end up together
    on later lines.

    Problems in extracting text are mostly a function of the application
    that created the PDF (Framemaker 5.5 for the Power PC set to
    LaserWriter 8 8.7 and Acrobat Distiller 4.0 for Macintosh in this
    case). In this case, if you open the document in Illustrator you can
    see many individual blocks of text, some of which the copy operation
    strings together, and others which it misses.

    This stuff is fairly easily fixed by a bit of editing-- those dot
    leaders are irritating to fix. I tried pasting into a text-only
    application (Ultraedit), Excel, the Open Office text editor and into
    MS Word, and all came out pretty much the same except for the symbols.
    It might even be faster than re-typing everything.

    Extracting text using GSView in "normal" mode is only slightly better.


    Best regards,
    Spehro Pefhany
     
  7. Thanks for all those prompt responses. I'll follow up the suggestions.

    Using TextPad here - great editor.

    Same result when pasting into various other apps. I shouldn't have
    said returns after *every* character, but still pretty bad:
    http://www.terrypin.dial.pipex.com/Images/PDFText1.gif
     
  8. Chris

    Chris Guest

    Couple of options.

    Under Adobe Reader 6 use the snapshot tool to copy and paste into Word or
    Excel.

    or 2.

    download an alternative and quicker to open pdf reader from
    www.foxitsiftware.com and use the text tool and paste into Excel. This will
    give you a more coherent display but still not perfect.

    Cheers
     
  9. Chris

    Chris Guest

    ooops
    www.foxitsoftware.com
     
  10. Thanks. Yes, that is arguably an improvement:
    http://www.terrypin.dial.pipex.com/Images/PDFText2.gif
    compared to Adobe Acrobat Reader (5 in my case; each version seems to
    get worse to me!):
    http://www.terrypin.dial.pipex.com/Images/PDFText1.gif
    but I see PDF Reader has pasted a fixed size font rather than the
    original proportional?
     
  11. ....but guess I must have used WordPad for the first! Don't recall
    doing so - but can't think of any other explanation. So that makes pdf
    reader definitely an improvement.
     
  12. Jim Thompson

    Jim Thompson Guest

    I'm using Adobe Acrobat 4... I have version 5, but it's been screwed
    over by zealot programmers, so I only use it to read some stuff that
    version 4 lacks font capability for.

    With version 4 I get spaces with subscripted text, no <CR>; otherwise
    looks OK.

    ...Jim Thompson
     
  13. Boris Mohar

    Boris Mohar Guest

    I use Clipmate http://www.thornsoft.com/ which has nice text cleanup.
    Apparently it was not necessary for:

    30A, 50V, 0.040 Ohm, N-Channel Power
    MOSFET

    It showed up as WYSIWYG
     
  14. Ted Edwards

    Ted Edwards Guest

    Just downloaded it. Thanks. Wouldn't want to run it under 'doze
    anyway. :)

    BTW, Ghost Script/Ghost View extracts it with no problem. So does
    Acrobat but it's easier with Ghost.

    Ted
     
  15. Ted Edwards

    Ted Edwards Guest

    Three suggestions:
    Get PMView and use the screen capture => convert to 16 color => Save as
    a .PNG. The file size for the max ratings is <6KB.
    Install a virtual PostScript printer set to print to file.

    You can grab anything with these tools.

    Ted
     
  16. Guest

    Using the Column Select tool in my Adobe Reader, I get:

    Features
    • 30A, 50V
    • r
    DS(ON)
    = 0.040
    Ω
    • SOA is Power Dissipation Limited
    • Nanosecond Switching Speeds
    • Linear Transfer Characteristics
    • High Input Impedance
    • Majority Carrier Device
    • Related Literature
    - TB334 “Guidelines for Soldering Surface Mount
    Components to PC Boards�

    Which is close. Apparently when characters are in the symbol font, a
    carriage return is inserted. My reader is version 5.0.5.

    Doug
     
  17. Guest

    Using the Column Select tool in my Adobe Reader, I get:

    Features
    • 30A, 50V
    • r
    DS(ON)
    = 0.040
    Ω
    • SOA is Power Dissipation Limited
    • Nanosecond Switching Speeds
    • Linear Transfer Characteristics
    • High Input Impedance
    • Majority Carrier Device
    • Related Literature
    - TB334 “Guidelines for Soldering Surface Mount
    Components to PC Boards�

    Which is close. Apparently when characters are in the symbol font, a
    carriage return is inserted. My reader is version 5.0.5.

    Doug
     
  18. Thanks. I took a look at PMView but it seems to be just a (versatile)
    image viewer, rather like several others (e.g. IrfanView), which can
    also Print to File. Maybe I should explore the second part of your
    recommendation; what 'virtual PostScript printer' do you use please?

    BTW, I have Snagit, which can also capture *text* from many windows,
    although it fails in the PDF example under discussion.
     
  19. Ted Edwards

    Ted Edwards Guest

    It is that but it also has a capture facility that allows capturing the
    whole screen, a selected area of the screen, a window or the interior of
    a window.

    Maybe I should explore the second part of your
    From your headers, I guess you are running 'doze. I'm not so I can
    only give you general guidelines for what I did. Since I am printing to
    file the physical printer does not need to be present at all. I picked
    a high end colour laser printer and downloaded the postscript driver for
    it. I installed it but checked the box that says "Print to file". I
    also have a real Canon i850 on my system so when ever i send something
    to the printer, I am given the choice of which of the two printers is to
    be used. If I want real hard copy, I select the i850. If I want a file
    I suggest the PostScript printer. With the later, I'm then asked for a
    file, e.g. G:\downloads\glurp.ps. I can then convert that to PDF, PNG
    or a choice of several other formats including "extract text" with Ghost
    View.

    Perhaps someone here who is a 'doze user can clarify this for you.

    Ted
     
  20. Rich Grise

    Rich Grise Guest

    Barely a day goes by that Slackware doesn't pleasantly surprise me!
    It seems I got xpdf along with it, and lo and behold:
    ------------------------
    30A, 50V, 0.040 Ohm, N-Channel Power
    MOSFET
    This is an N-Channel enhancement mode silicon gate power field effect
    transistor designed for applications such as switching regulators,
    switching converters, motor drivers, relay drivers and drivers for high
    power bipolar switching transistors requiring high speed and low gate
    drive power. This type can be operated directly from integrated circuits.
    Formerly developmental type TA9771.
    Ordering Information
    PART NUMBER PACKAGE BRAND
    BUZ11 TO-220AB BUZ11
    NOTE: When ordering, use the entire part number.

    Features
    · 30A, 50V
    · rDS(ON) = 0.040
    · SOA is Power Dissipation Limited
    · Nanosecond Switching Speeds
    · Linear Transfer Characteristics
    · High Input Impedance
    · Majority Carrier Device
    · Related Literature
    - TB334 "Guidelines for Soldering Surface Mount
    Components to PC Boards"
    Symbol
    D
    G
    S
     
Ask a Question
Want to reply to this thread or ask your own question?
You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.
Electronics Point Logo
Continue to site
Quote of the day

-