jack

Bookmark organising

When I organised my bookmarks, I realised that most fell into the categories of

Stuff I use every day and want quickly accessible. Eg. email, social networking homepages, the wikipedia page on the unicode checkmark, etc.
Something to read later. Eg. links that seem interesting, computer games, books and films to consider buying or renting, etc.
Something to read periodically, eg. news sites, social networking friends pages, feeds, blogs I follow, many webcomics divided into "daily", "bi/triweekly" etc.
Something I may occasionally want as a reference. Eg. step-by-step instructions for stuff I do occasionally.
Stuff that's useless, doesn't update, but I just keep coming back to because it's so awesome, such as The world flag rating page (do not make your country's flag in photoshop, tricolors are overused), The Evil Overlord List of movie-stereotypical mistakes I will not do if I'm ever an evil overload, and the Earth destruction advisory board FAQ on non-dilettante ways to destroy the earth

The last category was a minor surprise to me, as I'd not realised in advance it was a category I'd need. But I really do need it, because even if I don't need those links, if I don't have it, my mind keeps saying "don't forget the earth destruction advisory board, what it if updates the earth destruction status[1]" so I need a place to put them, just to get them out of the way in all the other categories!

I do the same with physical objects too: if I want to keep it and it doesn't have a place, make a place for things I keep for that reason however stupid. Then, if I decide it's stupid and I don't need to keep it, I can throw it out later, having already separated it from stuff I'm keeping for a more useful reason.

The reason I mention this now is that last night several of us were talking about an answer on stack overflow that is incredibly awesome and made the rounds several times recently, but some less-programmer-y people hadn't seen, which is one of the most recent links promoted to my list of "stuff on the internet I personally find most awesome".

Link for khalinche and ceb from last night, how do I use a regex to detect certain sorts of tag in HTML text

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454

There is a question on Stack Overflow asking how to use a regex to detect certain sorts of tag in HTML text and the first answer (link) is a work of genius, as the answerer gets more and more emphatic about his opinion, it's really funny and accurate (even if you don't know what the words mean, it's still funny and you can get a gist of the answer if you scroll through slowly to the end). :)

Footnotes

[1] On 10 September, 2008, it did, advancing the "Earth destruction advisory count" from 0 to 1. There is a supplementary FAQ on the event at http://qntm.org/board, starting with "The Earth hasn't been destroyed! What are you talking about?"

From:

Erm. I think you're right, in that there are issues over "parse".

I think the trap I've fallen into is letting the questioner and answerer define "parse" for me. The questioner was trying to do something, the answerer said "You can't parse HTML with regexes", and I've been assuming that there's been consensus that this is at least relevant. I've been on the other side of this debate before (
insisting on narrow, strict definitions of "parsing"), so maybe I should slap my wrists. That said, I'm pretty sure that whatever is meant, parsing is about syntax, not semantics.

It's possible that the questioner's original request is more akin to some other syntactic task, such as tokenisation, or chunking. But "chunking" is one of those things that may or may not count as parsing - let's look at wikipedia:

"Shallow parsing (also chunking, "light parsing") is an analysis of a sentence which identifies the constituents (noun groups, verbs, verb groups, etc.), but does not specify their internal structure, nor their role in the main sentence."

So when we're finding tags in HTML, we're identifying some constituents of the text stream, constituents which I think are too complex to count as "tokens" (if "tokens" are even an admissible notion in concrete HTML). So what the questioner is trying to do could be construed as a form of partial shallow parsing.

Nested tags: Arbitrary nested tags? Tags nested at unlimited depth?

How do I parse HTML using a Regex?

How do I parse HTML using a Regex?

no subject

Active Recent Entries