Skip to content

Most efficient way to get matching element? #929

@derekantrican

Description

@derekantrican

I have an application that scrapes an entire website and it runs in about 12 hours. It uses WebClient.DownloadString to get the html, then uses HtmlParser.ParseDocument to parse it. Then, I do a lot of other parsing on top of that. This happens for about 293,000 pages so I'm trying to save any little bit of time that I can.

I've noticed that I've got a lot of places where I call IHtmlDocument.GetElementsByTagName(TAG).FirstOrDefault(QUERY_SELECTOR). I believe I could collapse this into some sort of IHtmlDocument.QuerySelector(QUERY_SELECTOR) which theoretically would speed up the time by returning after the first match, but some preliminary testing has shown QuerySelector to be slow vs the old method. For instance:

IElement element = doc.GetElementsByTagName("h2").FirstOrDefault(x => x.TextContent.Contains("Climbing Directory"));

takes about 1 ms, where

IElement element = doc.QuerySelector("h2:contains('Climbing Directory')");

takes about 23 ms.

Any suggestions for improving my code?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions