Home java How to parse html in java?

How to parse html in java?




Answer 1, authority 100%

What’s so difficult?

They have the most common JavaDoc documentation. But even there you can find almost everything you need.
For example :

Typical usage of the parser is:

Parser parser = new Parser ("http: // whatever");
NodeList list = parser.parse (null);
// do something with your list of nodes.

And then take a little more look:

NodeList parse (NodeFilter filter)

NodeFilter – & gt; here

Everything, in my opinion, is too simple.

Never mind

bin / parser http: // website_url
where tag_name is an optional tag name to be used as a filter, i.e.
A – Show only the link tags extracted from the document
IMG – Show only the image tags extracted from the document
TITLE – Extract the title from the document
NOTE: this is also the default program for the htmlparser.jar, so the
above could be:
java -jar lib / htmlparser.jar http: // website_url [tag_name]


public static void main (String [] args) {
  try {
    Parser parser = new Parser ("http://www.alliance-bags.ru/catalog.php?tov=576");
parser.setEncoding ("windows-1251");
    NodeFilter atrb1 = new TagNameFilter ("IMG");
    NodeList nodeList = parser.parse (atrb1);
    for (int i = 0; i & lt; nodeList.size (); i ++) {
      Node node = nodeList.elementAt (i);
      System.out.println (node.toHtml ());
  } catch (ParserException e) {
    e.printStackTrace ();

Answer 2, authority 27%

Answer 3, authority 18%

Answer 4, authority 18%

jsoup: Java HTML Parser :

Document doc = Jsoup.connect ("http://en.wikipedia.org/") .get ();
Elements newsHeadlines = doc.select ("# mp-itn b a");

Answer 5

Standard Java tools can be used. Why use an additional lib to retrieve the path to a picture?

If you need to do it once, you can use DOM and XPath.
If you need to process a bunch of large documents, then it is better to use SAX. Once you have spent time parsing these methods, you will never again have problems with parsing not only HTML, but also any XML documents.

Answer 6

Take a look at this . Quite a simple principle of operation, it supports invalid pages. There is a collection of objects mapped to tags. Very comfortably.

Programmers, Start Your Engines!

Why spend time searching for the correct question and then entering your answer when you can find it in a second? That's what CompuTicket is all about! Here you'll find thousands of questions and answers from hundreds of computer languages.

Recent questions