Home c# How to translate the text of HTML-Pages?

How to translate the text of HTML-Pages?

Author

Date

Category

The application executes the HTML Page PARSING.
HTML page has text, pictures, tables and other content.
Text language on page – “Language_1”.

Ensure:
– Go on the site link (sites will be different);
– Save the page in the database (field_1) (save the first page of the site);
– Translate page (in “Language_2”);
– Save the page in the database (field_2).

And how to make the page structure preserved?
How to make the application return HTML page in “Language_2”?

i.e. Make something similar to GoogleChrome’s translation function: I went to any site, and transferred the page into Russian ..


Answer 1, Authority 100%

The simplest thing that comes to mind is to recursively bypass all the elements, descend to the level of text nodes, to translate each of them separately and replace its contents with the translation. Suppose we have such a code for the transfer of the string through the Yandex API (taken hence ):

using system.collections.generic;
Using System.io;
Using System.net;
Using System.Text;
using system.threading.tasks;
Using System.Net.http;
using newtonsoft.json;
// Reference: System.Net.http, newtonsoft.json
Namespace Translatetest.
{
  Class Translator
  {
    Public Static Async Task & Lt; String & GT; TRANSLATE (String S, String Lang)
    {
      if (s.length & gt; 0)
      {
        String content = "text =" + webutility.urlencode (s);
        VAR CNT = New StringContent (Content, Encoding.utf8, "Application / X-WWW-FORM-URLENCODED");
        Httpclient client = new httpclient ();
        Var Response = await client.postasync (
          "https://translate.yandex.net/api/v1.5/tr.json/translate?lang=" + lang
          + "& amp; Key = apikey",
          CNT.
          );
        var stream = await response.conent.readasstreamasync ();
        Using (VAR SR = New StreamReader (Stream))
        {
          String Line;
          If ((Line = Sr.ReadLine ())! = NULL)
          {
            if ((int) response.statuscode! = 200) Throw New Exception (Line);
            Translation translation;
            translation = jsonconvert.deserializeobject & lt; translation & gt; (line);
            s = "";
            Foreach (String str in translation.text)
            {
              s + = str;
            }
          }
        }
        Return S;
      }
      ELSE.
        Return "";
    }
  }
  Class Translation
  {
    Public String Code {Get; SET; }
    Public String Lang {Get; SET; }
    Public String [] Text {Get; SET; }
  }
}

Then the code for transfering the HTML document will look like this (I use MSHTML):

using system;
Using System.Text;
using System.Runtime.interopServices;
using system.threading.tasks;
using System.Windows.Forms;
// Reference: COM - & GT; Microsoft HTML Object Library
Namespace Translatetest.
{
  Public Partial Class Form1: Form
  {
    Public Form1 ()
    {
      Initializecomponent ();
    }
    Private Async Void Button1_Click (Object Sender, Eventargs E)
    {
      String HTML;
      HTML = System.io.File.ReadallText ("C: \\ Test \\ test.html");
      textbox2.text = await translatedocument (HTML);
    }
    String Lang = "EN-RU";
    Public Async Task & Lt; String & GT; Translatedocument (String HTML)
    {
      mshtml.htmldocument doc = null;
      mshtml.ihtmldocument2 d2 = null;
      mshtml.ihtmldocument3 d = null;
      mshtml.ihtmlelementCollection Body = NULL;
      mshtml.ihtmlelement BODYELEM = NULL; 
mshtml.IHTMLDOMNode bodynode = null;
      try
      {
        // load the document into the parser
        doc = new mshtml.HTMLDocument ();
        d2 = (mshtml.IHTMLDocument2) doc;
        d2.write (html);
        // find body
        d = (mshtml.IHTMLDocument3) doc;
        body = d.getElementsByTagName ("body");
        if (body.length == 0) throw new Exception ("Fatal error: HTML has no BODY tag!");
        bodyelem = body.item (0);
        bodynode = bodyelem as mshtml.IHTMLDOMNode;
        // recursively translate all the nodes of the body element
        foreach (var node in bodynode.childNodes)
        {
          await TranslateNode (node);
        }
        return bodyelem.innerHTML;
      }
      finally
      {
        // free resources
        if (doc! = null) Marshal.ReleaseComObject (doc);
        if (d2! = null) Marshal.ReleaseComObject (d2);
        if (d! = null) Marshal.ReleaseComObject (d);
        if (body! = null) Marshal.ReleaseComObject (body);
        if (bodyelem! = null) Marshal.ReleaseComObject (bodyelem);
        if (bodynode! = null) Marshal.ReleaseComObject (bodynode);
      }
    }
    public async Task TranslateNode (mshtml.IHTMLDOMNode node)
    {
      string val = "";
      if (node.nodeType == 3) // text node
      {
        val = node.nodeValue;
        if (val.Trim (). Length == 0) return; // empty - nothing to translate
        var res = await Translator.Translate (val, lang); // translate the contents of the node
        node.nodeValue = res; // change the content to translation
      }
      else // element node
      {
        // don't translate scripts and CSS
        if (node.nodeName.ToLower () == "script" || node.nodeName.ToLower () == "style") return;
        // traversing child nodes
        foreach (mshtml.IHTMLDOMNode x in node.childNodes)
        {
          await TranslateNode (x);
        }
      }
    }
  }
}

Example of source text and translation:

& lt; p & gt; & lt; b & gt; The Antarctic & lt; / b & gt; is a polar region around the Earth's South Pole, opposite the Arctic region around the North Pole. The Antarctic comprises the continent of Antarctica and the island territories located on the Antarctic Plate. The Antarctic region include the ice shelves, waters, and island territories in the & lt; i & gt; Southern Ocean & lt; / i & gt; situated south of the Antarctic Convergence, a zone approximately 32 to 48 km (20 to 30 mi) wide varying in latitude seasonally. The region covers some 20 percent of the Southern Hemisphere, of which 5.5 percent (14 million km2) is the surface area of ​​the Antarctic continent itself. All of the land and ice shelves south of 60 ° S latitude are administered under the Antarctic Treaty System. Biogeographically, the Antarctic ecozone is one of eight ecozones of the Earth's land surface. & Lt; / p & gt;
& lt; h2 & gt; Geography & lt; / h2 & gt;
& lt; div & gt;
& lt; p & gt;
The maritime part of the region constitutes the area of ​​application of the international Convention for the Conservation of Antarctic Marine Living Resources (CCAMLR), where for technical reasons the Convention uses an approximation of the Convergence line by means of a line joining specified points along parallels of latitude and meridians of longitude. The implementation of the Convention is managed through an international Commission headquartered in Hobart, Australia, by an efficient system of annual fishing quotas, licenses and international inspectors on the fishing vessels, as well as satellite surveillance. & Lt; / p & gt; 
& LT; P & GT; MOST OF THE ANTARCTIC REGION IS SITUATED SOUTH OF 60 ° S LATITUDE PARALLEL, AND IS GOVERNED IN ACCORDANCE WITH THE ANTARCTIC TREATY SYSTEM. The Treaty Area Covers The Continent ItSelf and Its Immediately Adjacent Islands, AS Well As The Archipelagos of the South Orkney Islands, South Shetland Islands, Peter I Island, Scott Island and Balleny Islands. & LT; / P & GT;
& LT; P & GT; The Islands Situated Between 60 ° S Latitude Parallel Tue The North And The Antarctic Convergence to the North And The Antarctic Convergence To the North and their ReSpective 200-Nautical-Mile (370 km) Exclusive Economic Zones Fall Under the National Jurisdiction of the Countries That Possess Them : South Georgia And The South Sandwich Islands (United Kingdom; Also An Eu Overseas Territory), Bouvet Island (Norway), And Heard and McDonald Islands (Australia). & Lt; / p & gt;
& lt; p & gt; & lt; b & gt; Kerguelen Islands & LT; / B & GT; (France; also an EU Overseas territory) are situated in the Antarctic Convergence area, while the Falkland Islands, Isla de los Estados, Hornos Island with Cape Horn, Diego Ramírez Islands, Campbell Island, Macquarie Island, Amsterdam and Saint Paul Islands, Crozet Islands, Prince Edward Islands, and Gough Island and Tristan Da Cunha Group Remain North of the Convergence and Thus Outside The Antarctic Region. & Lt; / p & gt;
& lt; / div & gt;

& lt; p & gt; & lt; b & gt; Antarctica & lt; / b & gt; These are the polar regions around the southern Pole of the Earth, opposite to the Arctic, around the North Pole. The card consists of Antarctic mainland and island territories located on the Antarctic Plate. The Antarctic region includes shelf glaciers, water and island territories in & lt; I & GT; South Ocean & LT; / I & GT; Located south of Antarctic convergence, the zone is approximately from 32 to 48 km (from 20 to 30 mi) wide varied over latitude seasonally. The region covers about 20 percent of the southern hemisphere, of which 5.5 percent (14 million km2) of the area of ​​the Antarctic continent. All lands and shelf glaciers south of 60 ° southern latitudes are used within the Antarctic Treaty System. Biogeographically, Ecoozon in Antarctica is one of the eight eloons of land plots of the earth's surface. & Lt; / p & gt;
& lt; h2 & gt; geography & lt; / h2 & gt;
& lt; div & gt;
& lt; p & gt; In the marine part of the region is the area of ​​application of the International Convention on the Conservation of Marine Living Resources of Antarctic (CCAMLR), where for technical reasons in the convention uses an approximation of the convergence of the line using a line connecting the specified points along the parallels of the latitude and meridians of longitude. The implementation of the Convention is carried out through the International Commission with Headquarters in Hobart, Australia, with the help of an effective system of annual quotas, licenses and international inspectors on fishing vessels, as well as satellite observation. & Lt; / p & gt;
& lt; p & gt; Most Antarctic region is located south of 60 ° in southern latitude in parallel, and is managed in accordance with the international legal regime of the Antarctic Treaty System. The contract of the Treaty applies to the continent and the adjacent islands, as well as the South Orkney Islands archipelagoes, the South Shetland Islands, Peter I Islands, Scott Islands and Balleny Islands. & Lt; / p & gt;
& lt; p & gt; Islands located between 60 ° W parallel to South and Antarctic convergence to the north, and their 200-mile (370 km) of the exceptional economic zone are under the national jurisdiction of the countries that they have: South Georgia and Sandwich Islands (United Kingdom; also the EU overseas Territory), Bouvea Island (Norway ) And the islands of Herd and McDonald (Australia). & lt; / p & gt;
& lt; p & gt; & lt; b & gt; kergelen & lt; / b & gt; (France; also the EU overseas territory) are located in the Antarctic convergence of the zone, and the Falkland Islands, Isle de Los Estados, Ornos on the island from Cape Gorn, Diego Ramires Island, Campbell Island, McKorory Island, Amsterdam and Saint-Paul Creeza, Prince "Shadard Islands, and the island of Gof and Tristan-da-kunya groups remain northern convergence and, therefore, outside Antarctica. & Lt; / p & gt; & lt; / div & gt;

But the disadvantage is that this method generates too many minor requests to the API, so the translation of a large document will last for a very long time. In addition, the quality of translation suffers, since the proposals will be crushed into parts and translated separately. To improve this method, you must come up with how to aggregate requests. Most likely, an element containing only inline sub-elements (b, span, etc.), you need to translate as a solid piece, but how to turn it out, keeping formatting, the task is quite complicated.

Programmers, Start Your Engines!

Why spend time searching for the correct question and then entering your answer when you can find it in a second? That's what CompuTicket is all about! Here you'll find thousands of questions and answers from hundreds of computer languages.

Recent questions