Home c# How to paint HTML in .NET?

How to paint HTML in .NET?

Author

Date

Category

You need to extract all the URL from the href attributes A tags in the HTML page. I tried to take advantage of regular expressions:

uri uri = new uri ("http://google.com/search?q=test");
REGEX REHREF = NEW REGEX (@ "& lt; a [^ & gt;] + href =" "([^" "] +)" "[^ & gt;] + & gt;");
String HTML = New WebClient (). DownloadString (URI);
Foreach (Match Match in Rehref.matches (HTML))
  Console.WriteLine (Match.groups [1] .tostring ());

But many potential problems arise:

  • how to filter only specific links, for example, by CSS class?
  • What will happen if the quotes at the attribute others?
  • What will happen if there is a gap equality sign?
  • What will happen if a piece of page commented?
  • What will happen if a piece of javascript comes?
  • and so on.

Regular expression very quickly becomes monster and unreadable, and more and more problem areas are detected.

What to do?


Answer 1, Authority 100%

TL; DR

For HTML parsing use anglesharp .

If you need not only to steal HTML, but also to run a full-fledged browser, perform all scripts, to push the buttons and see what happened, use CefSharp or selenium . Note that it will be for orders of more slower.

for inquisitive

Regular expressions are intended for processing relatively simple texts that are set Regular languages ​​. Regular expressions since their appearance were greatly complicated, especially in Perl, the implementation of regular expressions in which is an inspiration for other languages ​​and libraries, but regular expressions are still poorly adapted (and is unlikely to ever) for processing complex languages ​​such as HTML. The complexity of HTML processing is also in very complex rules for processing the non-valid code, which inherited from the first implementations of the time of the Internet, when there were no standards and in risen, and each browser manufacturer leaned unique and unique opportunities.

So, in general, regular expressions are not the best candidate for HTML processing. It is usually wiser to use specialized html parsers.

anglesharp

License: BSD (3-Clause)

Checked player on the parser field. Unlike csquery, written from scratch manually on C #. Also includes parsers of other languages.

API is based on the official specification by JavaScript HTML DOM. Initially, in some places of weird, unusual for developers on the .NET (for example, when accessing the wrong index in the collection will be returned NULL , and the exception was not thrown), but the developer in the end surrendered and corrected the most terrible crutches. Something left, for example, Microsoft BCL Portability Pack. Something left, for example, the space names are very granular, even the basic use of the library requires three using and & nbsp; t. & Nbsp; n.), But in general, nothing critical.

HTML Processing Simple:

ihtmldocument angle = new htmlparser (html) .parse ();
Foreach (Ielement Element in Angle.QuerySelectorall ("A"))
  Console.Writeline (Element.gettribute ("href"));

It is not complicated, and if you need a more complex logic:

ihtmldocument angle = new htmlparser (html) .parse ();
Foreach (Ielement Element in Angle.QuerySelectorall ("H3.R A"))
  Console.Writeline (Element.gettribute ("href"));

HTMLAGILITYPACK

License: MS-PL

the oldest and therefore the most popular parser for .NET. However, age does not mean quality, for example, already ten (!!!) years cannot (!!!) correct the critical (!!!) bug with the correct treatment of self-closing tags. Already Codeplex managed to die, and WHO with Incorrect Parsing of HTML4 Optional End Tags and now there. Here, the new bug version is already the fourth year: self closing tags modified . There are still a number of analogues. Some time ago, they corrected this bug. For one tag. With an additional option. And then broke the option. I’m silent that we have oddities in the API, for example, if nothing is found, returns null , and not an empty collection.

To select elements, the XPath language is used, and not CSS selectors. At simple queries, the code turns out more or less readable:

htmldocument hap = new htmldocument ();
hap.Loadhtml (HTML);
Htmlnodecollection nodes = hap.documentnode.selectnodes ("// a");
if (nodes! = NULL)
  Foreach (HTMLNode Node in Nodes)
    Console.WriteLine (Node.getAttributeValue ("HREF", NULL));

However, if difficult requests are needed, XPath is not very adapted to simulate CSS selectors:

htmldocument hap = new htmldocument ();
hap.Loadhtml (HTML);
HTMLNodeCollection nodes = hap.documentnode.selectnodes (
  "// H3 [Contains (Concat ('', @Class, ''), 'R')] / a");
if (nodes! = NULL)
  Foreach (HTMLNode Node in Nodes)
    Console.WriteLine (Node.getAttributeValue ("HREF", NULL));

Fizzler

License: LGPL

Add-in to HtmlagilityPack, allowing you to use CSS selectors.

htmldocument hp = new htmldocument ();
hap.Loadhtml (HTML);
Foreach (HTMLNode Node in Hap.Documentnode.QuerySelectorall ("H3.R A"))
  Console.WriteLine (Node.getAttributeValue ("HREF", NULL));

Since this is an HTMLAGILITYPACK, then all the bugs of this share are attached.

csquery

License: Mit

At the moment, the project is abandoned, because there are anglesharp.

One of the modern HTML parsers for .NET. The Parser Validator.Nu for Java is taken as the basis for Java, which in turn is the port of Parser from the GECKO engine (Firefox). This ensures that the parser will handle the code in the same way as modern browsers.

API draws inspiration from jQuery, a CSS selector language is used to select elements. The names of the methods are copied almost one-in-one, that is, for programmers familiar with jQuery, the study will be simple.

has high performance. The order is superior to the HTMLAGILITYPACK + FIZZLER speed in difficult queries.

cq cq = cq.create (html);
Foreach (idomobject OBJ in CQ.Find ("A"))
  Console.WriteLine (Obj.getAttribute ("href"));

If a more complex query is required, the code is practically not complicated:

cq cq = cq.create (html);
Foreach (iDomobject OBJ IN CQ.FIND ("H3.R A"))
  Console.WriteLine (Obj.getAttribute ("href"));

If someone is unfamiliar with jQuery concepts, then nontrivial use can be strange and unusual.

regex

Scary and terrible regular expressions. It is undesirable to apply, but sometimes there is a need, as the parsers who build DOM are noticeably growing than Regex : they consume more and processor time, and memory.

if it came to regular expressions, then you need to understand that you will not be able to build a universal and absolutely reliable solution on them. However, if you want to pours a specific site, then this problem may not be as critical.

For the sake of all saint, do not turn regular expressions in the unreadable mess. You do not write code on C # in one line with single-brewed names of variables, and regular expressions do not need to spoil. The engine of regular expressions in .NET is quite powerful so that you can write a quality code.

For example, here is a little modified code to extract links from the question:

Regex Rehref = New Regex (@ "(? Inx)
  & lt; a \ s [^ & gt;] *
    href \ s * = \ s *
      (? & lt; q & gt; ['""])
        (? & lt; url & gt; [^ ""] +)
      \ k & lt; q & gt;
  [^ & gt;] * & gt; ");
Foreach (Match Match in Rehref.matches (HTML))
  Console.WriteLine (Match.groups ["URL"]. Tostring ());

Answer 2, Authority 18%

Use the CEFSHARP to solve such problems.

Why apply this approach?

  • you have much simplified the development process due to the fact that instead
    writing xpath, conditions and / or cycles in C # you just in the console
    browser (preferably based chromium) just develop
    All you need, then when a small backbone has already been written from class
    (I’ll show it below), you just insert the JavaScript code that you
    Needed.
  • Reliability. You are not trying to pours HTML and do not invent the bike, which is almost always a very bad idea. The project is based on Chromium, so you do not have to trust some new / unfamiliar product. Actively supported for synchronization with a new version.

For JavaScript appeals for simplicity and demonstration, jQuery is used, assuming that it also has on the target site. But it may also be a clean JavaScript or another library provided that this library is used on the site.

If you smash down, then notice that in addition to writing a small code layer and initialization, the solution takes one or two lines:

string [] urls = await wrapper.getResultAfterpageload ("https://yandex.ru",
  async () = & gt; await wrapper.evaluatejavascript & lt; string [] & gt; (
  "$ ('A [href]"). Map ((Index, Element) = & gt; $ (element) .prop (' href ')). Toarray () "));

What is it?

This is a managed shell over CEF (Chromium Embedded Framework ). That is, you get the power of Chromium, which is controlled by software.

Why is CEF / CEFSHARP?

  • do not bother the pages in the parse (and this is a complex and ungrateful task that I extremely not recommended to do).
  • you can work with the already loaded page (after execution of scripts).
  • It is possible to perform arbitrary JavaScript with the latest features.
  • makes it possible to call ajax using JavaScript, and then with success (succe), twink events in C # -Code with the result of AJAX. Details and with an example reviewed here .

CEFSHARP Varieties

  • cefsharp.winforms
  • cefsharp.wpf
  • cefsharp.offscreen

The first two are used if you need to give users a browser control. Conceptually similar to WebBrowser In Windows Forms, which is a shell to control IE, not chromium, as in our case.

So we will use CEFSHARP.OFFSCREEN (Range) variety.

Writing code

Suppose our console application, but it already depends on you.

Install the CEFSHARP.OFFSCREEN 57th version:
install-package cefsharp.offscreen -Version 57.0.0

The fact is that C # all arrays mappies to list & lt; object & gt; , the result of JavaScript wrapped in Object , which already contains List & LT; Object & GT; , String , bool , int depending on the result. In order to make the results strictly typed, create a small ConvertHelper :

Public Static Class Converthelper
{
  Public Static T [] GetArrayFromObjectList & LT; T & GT; (Object OBJ)
  {
    Return ((Ienumerable & lt; Object & gt;) OBJ)
      .Cast & lt; t & gt; ()
      .Toarray ();
  }
  Public Static List & LT; T & GT; GetListFromObjectList & LT; T & GT; (Object OBJ)
  {
    Return ((Ienumerable & lt; Object & gt;) OBJ)
      .Cast & lt; t & gt; ()
      .Tolist ();
  }
  Public Static T TOTYPEDVARIABLE & LT; T & GT; (Object OBJ)
  {
    if (OBJ == NULL)
    {
      Dynamic DynamicResult = NULL;
      RETURN DYNAMICRESULT;
    }
    Type Type = TypeOf (T);
    If (Type.isarray)
    {
      Dynamic DynamicResult = Typeof (Converthelper) .getMethod (Nameof (getArrayFromOfTList))
        .Makegenericmethod (Type.getelementType ())
        .Invoke (NULL, NEW [] {OBJ});
      RETURN DYNAMICRESULT;
    }
    If (type.isgenerictype & amp; & amp; type.getgenerictypedefinition () == Typeof (List & LT; & GT;))
    {
      Dynamic DynamicResult = Typeof (Converthelper) .GetMethod (Nameof (GetListFromObjectList))
        .MakegenericMethod (type.getGenicarguments (). Single ())
        .Invoke (NULL, NEW [] {OBJ});
      RETURN DYNAMICRESULT;
    }
    RETURN (T) OBJ;
  }
}

For processing with JavaScript errors Create a class JavaScripttexception .

Public Class JavaScripttexception: Exception
{
  Public JavaScriptException (String Message): Base (Message) {}
}

You can have your own way of handling errors.

Create a class CefSharpWrapper :

Public Sealed Class CefSharpWrapper
{
  Private ChromiumWebBrowser _Browser;
  Public Void InitializeBrowser ()
  {
    Cef.enableHighdpisupport ();
    // Perform Dependency Check to Make Sure All Relevant Resources Are in Our Output Directory.
    CEF.INITIALIZE (New CefSettings (), PerformDependencyCheck: False, BrowserProcessHandler: NULL);
    _Browser = new chromiumWebBrowser ();
    // Wait Till Browser Initialized
    AUTORESTEVENT WAITHANDLE = NEW AUTORESTEVENT (FALSE);
    EventHandler Onbrowserinitialized = NULL;
    onbrowserinitialized = (Sender, E) = & gt;
    {
      _Browser.Browserinitialized - = onbrowserinitialized;
      waithandle.set ();
    };
    _Browser.browserinitialized + = Onbrowserinitialized;
    waithandle.waitone ();
  }
  Public Void ShutdownBrowser ()
  {
    // Clean Up Chromium Objects. You Need to Call This in Your Application OtherWise
    // You Will Get A Crash When Closing.
    Cef.shutdown ();
  }
  Public Task & Lt; T & GT; GetResultAfterpageload & LT; T & GT; (String pageurl, Func & LT; Task & Lt; T & GT; & GT; OnLoadcallback)
  {
    TaskCompletionSource & lt; T & GT; TCS = New TaskCompletionsource & lt; T & GT; ();
    EventHandler & LT; LoadingStateChangeDeventargs & GT; onpageloaded = null;
    T T = Default (T);
    // An Event That Is Fired When Loading.
    // This Returns to us from Another Thread.
    Onpageloaded = ASYNC (Sender, E) = & gt;
    {
      // Check to See If Loading Is Complete - This Event IS Called Twice, One WHEN LOADING STARTS
      // Second Time WHEN IT'S FINISHED
      // Rather Than An IFrame Within The Main Frame).
      if (! E.isloading) 
{
        // Remove The Load Event Handler, Because We Only Want One Snapshot of the Initial Page.
        _Browser.loadingStateChanged - = Onpageloaded;
        T = await onloadcallback ();
        TCS.SetResult (T);
      }
    };
    _Browser.loadingStateChanged + = OnpageLoaded;
    _Browser.Load (PageURL);
    Return Tcs.Task;
  }
  Public Async Task EvaluateJavascript (String Script)
  {
    JavaScriptResponse javascriptresponse = await _browser.getmainframe (). EvaluateScriptasyNC (Script);
    if (! javascriptresponse.success)
    {
      Throw New JavaScripttexception (javascriptresponse.message);
    }
  }
  Public Async Task & Lt; T & GT; EVALUATEJAVASCRIPT & LT; T & GT; (String Script)
  {
    JavaScriptResponse javascriptresponse = await _browser.getmainframe (). EvaluateScriptasyNC (Script);
    if (javascriptresponse.success)
    {
      Object scriptResult = javascriptresponse.result;
      Return converthelper.TotypedVariable & lt; T & GT; (scriptResult);
    }
    Throw New JavaScripttexception (javascriptresponse.message);
  }
}

Next, we call our class CefSharpWrapper from the Main method.

Public Class Program
{
  Private Static Void Main ()
  {
    Mainasync (). Wait ();
  }
  Private Static Async Task Mainasync ()
  {
    CEFSHARPWRAPPER WRAPPER = New CefSharpWrapper ();
    wrapper.initializebrowser ();
    String [] URLS = AWAIT Wrapper.getResultAfterpageload ("https://yandex.ru", async () = & gt;
      Await Wrapper.evaluateJavascript & lt; String [] & gt; ("$" ("A [HREF]"). Map ((INDEX, ELEMENT) = & GT; $ (Element) .prop ('href')). Toarray () ") );
    wrapper.shutdownBrowser ();
  }
}

Also: In this library there is a feature that the empty JavaScript array is shown to NULL . Therefore, perhaps it makes sense to add to Converthelper the appropriate code (it depends on your code and needs), or in the calling code to write something like

if (urls == null) urls = new string [0]

Also install x64 or x86 as a platform. Platform Any CPU is supported, but requires an additional code .


Answer 3, Authority 6%

If performance requirements are not very high, you can use the Internet Explorer COM object (add a link to Microsoft HTML Object Library):

public static list & lt; string & gt; Parselinks (String HTML)
{
  List & lt; String & GT; Res = New List & LT; String & GT; ();
  mshtml.htmldocument doc = null;
  mshtml.ihtmldocument2 d2 = null;
  mshtml.ihtmldocument3 d = null;
  Try.
  {
    doc = new mshtml.htmldocument (); // IE initialization
    d2 = (mshtml.ihtmldocument2) doc;
    d2.write (HTML);
    d = (mshtml.ihtmldocument3) doc;
    var coll = d.getelementsbytagname ("a"); // Get a collection of elements named tag
    Object Val;
    foreach (mshtml.ihtmlelement el in coll) // Extract the HREF attribute from all elements
    {
      Val = el.getattribute ("href");
      if (VAL == NULL) Continue;
      res.add (val.tostring ());
    }
  }
  Finally
  {
    // Liberation of resources
    if (doc! = NULL) Marshal.ReleaseComobject (DOC);
    if (d2! = null) marshal.releasecomobject (d2);
    if (D! = NULL) Marshal.ReleaseComobject (D);
  }
  RETURN RES;
}

Answer 4, Authority 2%

Insert my five kopecks, if there is no desire to mess with MSHTML COM objects, you can create a WebBrowser () object from Windows.Forms, and if you do not need to trigger all scripts, then I understand that the page can be shipped by ourselves browser, and the simpler, like webclient.downloadString (), and then download the resulting page for paps to the WebBrowser:

var itemPageText = _webclient.downloadString (URL);
Using (var PageHTML = New WebBrowser ())
{
  PageHTML.DocumentText = ItemPageText;
  var elem = pagehtml.document.getelementByid ("Imainimghldr");
}

Well, etc., the main thing is that methods like getelementByid () are also somewhat more permitted wrappers unlike mshtml.


Answer 5, Authority 2%

F #


Search on the page of all references to books by F #:

let fsys = "https://www.google.com/search?tbm=bks& q=f%23"
  Let Doc2 = HTMLDocument.Load (FSYS)
  Let Books =.
    doc2.cssselect ("div.g h3.r a")
    | & gt; List.map (Fun A - & gt; A.innertext (). Trim (), a.attributeValue ("href"))
    | & gt; List.Filter (Fun (Title, Href) - & gt; title.contains ("F #"))

F # Data
F # Data Html Parser
F # Data HTML CSS SELECTORS


Answer 6

I have everything wonderful with Xelement
Try 🙂

var htmldom = xelement.parse ("[HTML code]");

As prompted in the comments, it will work if the page you need is a valid XHTML document.

Programmers, Start Your Engines!

Why spend time searching for the correct question and then entering your answer when you can find it in a second? That's what CompuTicket is all about! Here you'll find thousands of questions and answers from hundreds of computer languages.

Recent questions