docs in pdf

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

docs in pdf

Jess-37
<snip>

>>
>> At any case extracting from PDF is a derived way. Why not give to SEs
>> a food they like most of all? :-)
>
> I don't really think search engines have a preferred format. If text can
> be extracted, they can index that text, regardless of how much fluff
> surrounds it, be it lots of angle brackets or some kind of
> nth-generation FORTH syntax... If anything, the looseness with which
> browsers enforce the HTML specification(s) has allowed for a lot of
> cruft to creep into Web indexes and that sort of error is far less
> likely to be encountered in PDF documents.
>
>
>> > > ...
>> > > Andrew
>
>
> Randall Schulz
>
First I just wanted to say thanks for considering my comments about
the file format of the Scala docs. (I am the author of
http://grok-code.com/75/learning-scala-with-project-euler/ )

I am first and foremost a programmer, and for the last 2 years I have
been building tools for a company who does search engine optimization,
and a bit of that knowledge has rubbed off on me, so wanted to expand
a bit on my previous comments.  Search engines do indeed index PDFs,
(plus flash, text, and many other formats) but their preferred format
is html.  Search engines take many clues about the page and its
content from meta tags, alt tags, page titles, and header tags like
<h1> which are all html specific.  SEs just aren't able to extract
that type of information from a PDF, although they should be able to
extract just the plain text, without any of the clues as to which
words in that text are most important.

Search engine optimization people also talk about something called
"code to content ratio", which is the ratio of indexable text to
markup characters.  In an html page that makes good use of CSS
stylesheets, this ratio is very good, since there is very little
markup compared to actual information.  In a Flash or PDF file this
ratio is pretty bad because of all of the "gibberish" included in the
file format.

The format of documents *does* make a difference when it comes to
search engine rankability.

I'm not familiar with the toolchain used to create the docs, or how
easy it would be add html documents to the existing process, but I do
think its at least worth considering.

jess