In all my Scrivener projects, imported web pages are clogged up with logos, “Subscribe Now!” buttons, lists of the site’s top 10 hits, etc etc etc. I’m almost always after the text, plain and simple, not a site’s photos, menus and confetti.
Reading on the Web I get rid of extraneous material by using the Readability service: http://lab.arc90.com/experiments/readability/ . If I then import the URL into Evernote, it’s the stripped-down Readability version that I get.
In Scrivener, however, when I import a web page into a project, I get the full page, crammed with stuff I don’t want. Even after Readability conversion in the browser. And no combination of Scriv’s Web-import preferences makes any difference. In other words, if I am reading, say, a NY Times story about Wikileaks, I have a one-click way of producing a version with just the story text. But I have no way to save that version into a Scrivener project—unless I introduce Evernote as the middleman.
Hence, my wish: That Scrivener import become in some way compatible with Readability (or some other stripper of irrelevant code). Might this be possible?
Probably not - Scrivener just takes the URL and turns it into a .webarchive using WebKit methods. But why don’t you just convert the web page to text (Documents > Convert) - that way you can delete anything you don’t want and format the text how you want too.
All the best,
Hmm. Two reasons. First, on my system “Convert to Text” doesn’t make a WebArchive into plain text. After conversion I still get ads, menus etc. That makes it much more time-consuming to filet away the frippery, because often it’s in columns next to the desired material. Ie, it’s not just a matter of trimming off the top and bottom of the file.
Second, even if the page were rendered as plain text in Scriv, the time involved in manually removing Web bling really adds up.
I guess my solution will be to keep using Evernote as a sort of pre-filter. Or, alternatively, find some other tool that generates a URL for the pared-down version of a site – then import that.
What I do, if a page doesn’t come in cleanly with URL->Text is use Readability (I have the button installed for Chrome and Firefox), and then copy and paste into Scrivener. I find if I select all in a Readability session and paste using Paste and Match Style, the result is pretty clean, but even standard Paste is okay.
I do that a lot as well, but there is a downside with Copy+Paste: In the resulting Scrivener file, there is no record of the URL from which the text came. I can “copy with link” in Firefox but that puts the URL in the same text space as the material. That defeats the purpose of Scriv’s inspector window for links.
I need to be able to take a page’s text, in whole or in part, and automatically (a) lose unneeded decorations and (b) save the result with its URL. That’s what I can do by combining Readability with Evernote. It’s not crucial to be able to that directly in Scrivener, but it’d be nice. That’s why I called it a “Wish List” question.
Oh, just drag the URL icon from Firefox over to that inspector table. That list is completely for your use! You can drag all the URLs you want into it, even other Scrivener documents and files on your computer.
Maybe I have eccentric work habits, but that would be quite cumbersome for me. It’s the extra steps, plus change of focus from content to administration – lessee, where’s that URL? Better make sure it goes in the right doc’s inspector table …
With Evernote I select what I want (or Readabilize the whole page), then hit a browser bookmarklet. The next thing I see is EN asking where I want the material, to which I reply by hitting Enter for the default location. Result: A screen with just the text I want, and the URL it came from, and a box where I can add tags if I want.
I’m curious about what you want; from what I can gather you want a web archive that looks like Readability’s version of the page in Scrivener, with the original url in that web archive’s Document References. Is that correct?
Because importing a page, whether it’s Readability-ed or not, doesn’t put the original url in the document references, as far as I can tell. It does put the url at the bottom of the Scrivener window as a link you can click to open the page in your browser. Is that what you were after?
IP Text generates its own url with a short prefix before the original url. While the IP Text page is still open in your browser, import its url into Scrivener by dragging or using the menu item.
It doesn’t do as good a job of stripping things as Readability, but it includes a link to the original url at the top of the page which you can quickly drag into the Inspector’s Document References if you want. The link at the bottom of the Scrivener window will be the IP Text generated url, though.
(This last is just a result of my investigations; some things I found interesting:
Curiously, in Firefox, if you copy and paste a Readability-ed page into Scrivener or some other Text Editor, you’ll notice that it has “Excerpted from” and the original url, in full, at the bottom of the page. This also appears in the Print Preview of the Readability-ed page. Strangely, it doesn’t appear like that in the browser. This copy/paste, print preview version is also what Evernote is getting from Firefox. In Safari, if you copy and paste or use the Evernote clipper, you don’t get the “Excerpted from”, but it does appear in the Print Preview. Hmm.)
Btw, this is not a wish. As you note I already have this ability in Evernote (DevonThink too). But would prefer to use Scrivener, which is perfect for the work, instead of complicating my life with intermediaries.
Actually it does if you set prefs to “Convert HTML Files to WebArchives” and then check “Convert WebArchives to Text.” There may be other settings that do the same, but that works for me.
For me, that’s the worst of both worlds – I have the URL but I can’t easily use it to make a link or create a list of references, because whenever I click on it, boom! I’ve left the project. OTOH, the automatic placement of the source URL into the Doc Refs Window is one of those things that make me use and love Scrivener.
Thanks for the tip about IP! On the Scrivener end, though, for me this is too many steps. Don’t want to stop thinking about what I am reading and how I will use it, in order to go hunting in a different screen for a URL and then dragging it about (while worrying that I have maybe selected the wrong URL from one of many open tabs, and dragged it to the wrong Doc Notes window).
What I might try is this: Use Readability and the Scrivener service to acquire text rapidly as Notes. Then, at a later stage, decide which of the excerpts are worth keeping track of. Project-Search on ‘Excerpted from’ will quickly turn up the original URLs that, as you say, are stored by Readability in text. I still have to find a URL and drag it, but that task now occurs when I am ready to think about storing URLs, rather than when I want to focus on content. (Maybe this sounds like hair-splitting to other Scrivener devotees, but it makes sense to me.)
That could be a workable blog-writing strategy. However, when it is time quickly to download a dozen pdfs, each with its URL recorded and accessible, it’s time for DevonThink or Evernote. As Keith-Not-Kevin has said with his usual likable gruffness, Scrivener is great because it does not try to do all things. Still, I find it odd that HTML has apparently evolved so that converting a webpage to text is impossible. Almost always, now, when I import a URL I get the same visuals in Scrivener that I got on the web page. 'Twas not always so.
That IS odd, isn’t it? Some magazines and newspapers also do this. I imagine it’s a safeguard against plagiarism and misattribution.
By the way, before the fancy-pants Readability came along, I always used to just press the “Print” button beside the article and then cancel the print dialogue. In fact, I’d do that to read the article, as well as use it as an import basis. It isn’t always available, but most article-based sites do have that button. I still use that method from time to time, when the plug-in isn’t installed or I forget about the fact that it exists.