jack: (Default)
[personal profile] jack
Bookmark organising

When I organised my bookmarks, I realised that most fell into the categories of

  • Stuff I use every day and want quickly accessible. Eg. email, social networking homepages, the wikipedia page on the unicode checkmark, etc.
  • Something to read later. Eg. links that seem interesting, computer games, books and films to consider buying or renting, etc.
  • Something to read periodically, eg. news sites, social networking friends pages, feeds, blogs I follow, many webcomics divided into "daily", "bi/triweekly" etc.
  • Something I may occasionally want as a reference. Eg. step-by-step instructions for stuff I do occasionally.
  • Stuff that's useless, doesn't update, but I just keep coming back to because it's so awesome, such as The world flag rating page (do not make your country's flag in photoshop, tricolors are overused), The Evil Overlord List of movie-stereotypical mistakes I will not do if I'm ever an evil overload, and the Earth destruction advisory board FAQ on non-dilettante ways to destroy the earth

    The last category was a minor surprise to me, as I'd not realised in advance it was a category I'd need. But I really do need it, because even if I don't need those links, if I don't have it, my mind keeps saying "don't forget the earth destruction advisory board, what it if updates the earth destruction status[1]" so I need a place to put them, just to get them out of the way in all the other categories!

    I do the same with physical objects too: if I want to keep it and it doesn't have a place, make a place for things I keep for that reason however stupid. Then, if I decide it's stupid and I don't need to keep it, I can throw it out later, having already separated it from stuff I'm keeping for a more useful reason.

    The reason I mention this now is that last night several of us were talking about an answer on stack overflow that is incredibly awesome and made the rounds several times recently, but some less-programmer-y people hadn't seen, which is one of the most recent links promoted to my list of "stuff on the internet I personally find most awesome".

    Link for khalinche and ceb from last night, how do I use a regex to detect certain sorts of tag in HTML text

    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454

    There is a question on Stack Overflow asking how to use a regex to detect certain sorts of tag in HTML text and the first answer (link) is a work of genius, as the answerer gets more and more emphatic about his opinion, it's really funny and accurate (even if you don't know what the words mean, it's still funny and you can get a gist of the answer if you scroll through slowly to the end). :)

    Footnotes

    [1] On 10 September, 2008, it did, advancing the "Earth destruction advisory count" from 0 to 1. There is a supplementary FAQ on the event at http://qntm.org/board, starting with "The Earth hasn't been destroyed! What are you talking about?"

Date: 2012-11-20 01:16 pm (UTC)
ptc24: (Default)
From: [personal profile] ptc24
You can't parse [X]HTML with regex.

Other true statements, if we take the presuppositions that are needed to make that statement true, and apply them elsewhere:

You can't parse natural language with regex

You can't parse natural language with any computer program we currently have or know how to write

You can't parse natural language with a human being

Now obviously there's something funny about that last statement.

So the next comment:

While it is true that asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML.


Alternatively, to parse a limited-but-ill-specified set of HTML with potentially less than 100% reliably.

Date: 2012-11-20 02:31 pm (UTC)
simont: A picture of me in 2016 (Default)
From: [personal profile] simont
Possibly one of the suppositions [personal profile] ptc24 is alluding to is the implicit difference in standards of success between natural languages and computer languages. With any computer language, we aim for perfection in parsing: if just one program, no matter how fiddly, satisfies the syntax specification but is not correctly understood by the parser, then that's a bug and it wants fixing. Hence, the fact that a regexp will inevitably fall down in some complicated corner case of HTML or other, no matter how many common cases they get right, is sufficient to justify the statement that regexps cannot parse HTML.

But if you applied the same standard to natural language, you'd have to say that nobody and nothing can parse it!

Date: 2012-11-20 02:52 pm (UTC)
ptc24: (Default)
From: [personal profile] ptc24
My first instinct was to say "bingo", but there's also the case of well-constrained subsets of HTML, for example single tags, or HTML from some specific known source, or whatever.

I once wrote a handful of regexes to replace an XML parser - the XML parser was slow and leaked memory, and the data we were processing was in a format that was nice and tractable. People didn't like that. Solution: download the data in a non-XML format. Suddenly the handful-of-regexes approach was nothing to complain about. Also, as a result, the download took half as long.

Date: 2012-11-20 02:34 pm (UTC)
ptc24: (Default)
From: [personal profile] ptc24
There seem to be two issues here:

1) "You can't parse [X]HTML with regex." meaning "You can't parse [X]HTML with regex.
with full reliability and generality." This is what my main point was. That's my point about humans parsing natural language. In fact, the point about humans was mainly about reliability, whereas I suppose the problem that's relevant here is generality. (Arguably, you could say "there is no one person who understands every natural language, and certainly no-one who understands every conceivable natural language." Possibly if some theorists are right there are natural languages that are comprehensible to some space aliens but not to H. sapiens)

2) You're correct, the specific question was about HTML tags - a subset of HTML, which doesn't have the problems with recursion that full HTML has - and so regexes are (likely to be) OK.

Date: 2012-11-20 02:58 pm (UTC)
simont: A picture of me in 2016 (Default)
From: [personal profile] simont
You draw a careful distinction there between reliability and generality, but don't say what each one means. Would I be right in guessing that 'generality' is about how few examples of the language you give up on and say 'dunno', and 'reliability' is about how few wrong answers you give for those cases in which you don't say 'dunno'?

Date: 2012-11-20 03:13 pm (UTC)
ptc24: (Default)
From: [personal profile] ptc24
I'd count both of these are "reliability". Issues of generality I think include the range of inputs it makes sense to run things over. Something might do well for copy-edited text but not for blog posts, something might do well for flowing text but not for things in tables, something might be OK for genuine written text but not for transcripts of speech, etc. Also, you can imagine systems which work with full reliability, in that they capture 100% of the cases of some narrowly-but-well-defined phenomenon in some narrowly-but-well-defined case.

There's a bare mass noun problem here, like a bare plural problem. If I have some regexes, and they reliably parse some set of web pages that I am interested in, and those web pages are in HTML, do my regexes "parse HTML"? They don't, and can't, parse all HTML, they can and do parse some HTML. But what does it mean to "parse HTML"?

Date: 2012-11-20 03:36 pm (UTC)
ptc24: (Default)
From: [personal profile] ptc24
But if I go on stack overflow and ask "can I divide by two to find a square root", the helpful answer is not "yes, it works for 4" but "no, you need to use an iterative approximation".

I think this example is missing something. In the case above, the working-for-4 case is like a stopped clock being right twice a day. There are cases which are more like a non-stopped clock being right most of the time, except when there's a GMT/BST switch or you're travelling between timezones or the clock runs out of batteries or whatever, and you have a choice between "these aren't an issue for me in my particular case", "these are an issue but I can correct for these by hand when needed" and "I need a good clock that handles all of these cases (and probably a few others that I've forgotten about) automatically".

So, can I tell the time by looking at my kitchen clock? Does the presence or absence of batteries in the back of my clock affect whether I can tell the time with it by looking at it? Does the possibility of my flatmate correcting the clock for GMT or leaving it uncorrected (without telling me either way) affect whether I can tell the time with it by looking at it?

ETA: is trying to parse, or process, (X)(HT)ML with a regex like trying to tell the time with a sundial?
Edited Date: 2012-11-20 03:38 pm (UTC)

Date: 2012-11-20 05:31 pm (UTC)
ptc24: (Default)
From: [personal profile] ptc24
Erm. I think you're right, in that there are issues over "parse".

I think the trap I've fallen into is letting the questioner and answerer define "parse" for me. The questioner was trying to do something, the answerer said "You can't parse HTML with regexes", and I've been assuming that there's been consensus that this is at least relevant. I've been on the other side of this debate before (
insisting on narrow, strict definitions of "parsing"), so maybe I should slap my wrists. That said, I'm pretty sure that whatever is meant, parsing is about syntax, not semantics.

It's possible that the questioner's original request is more akin to some other syntactic task, such as tokenisation, or chunking. But "chunking" is one of those things that may or may not count as parsing - let's look at wikipedia:

"Shallow parsing (also chunking, "light parsing") is an analysis of a sentence which identifies the constituents (noun groups, verbs, verb groups, etc.), but does not specify their internal structure, nor their role in the main sentence."

So when we're finding tags in HTML, we're identifying some constituents of the text stream, constituents which I think are too complex to count as "tokens" (if "tokens" are even an admissible notion in concrete HTML). So what the questioner is trying to do could be construed as a form of partial shallow parsing.

Nested tags: Arbitrary nested tags? Tags nested at unlimited depth?

Date: 2012-11-20 07:16 pm (UTC)
andrewducker: (Default)
From: [personal profile] andrewducker
Stuff I use every day: Bookmarks on the Firefox bookmark toolbar.
Something to read later: http://getpocket.com/
Periodical Stuff: http://www.google.co.uk/reader/
Reference Stuff: Bookmarked. Or added to Delicious.
Awesome Stuff: Delicious.

Active Recent Entries