I have a catalog downloaded Html-file, I want to work out in their parsing, but I can not figure out how I use the jsoup find these files (in html files scattered in folders) for further parsing?
This code gives an error, do not know how to specify the path (ideally need all the html files in a directory).
import org.jsoup *.;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
Import java.io.ioException;
public class ParsingGcTest {
Public Static Void Main (String [] Args) Throws IoException {
Document document = Jsoup.connect ( "C: \\ Users \\ 71376160 \\ Desktop \\ docs \\ specs \\ security \\ standard-names.html") get ();.
Elements elements = document.select ( "code");
for (Element element: elements) {
System.out.println (element);
}
}
}
Exception in thread "main" java.lang.IllegalArgumentException: Malformed URL: C: \ Users \ 71376160 \ Desktop \ docs \ specs \ security \ standard-names.html
at org.jsoup.helper.HttpConnection.url (HttpConnection.java:131)
at org.jsoup.helper.HttpConnection.connect (HttpConnection.java:70)
at org.jsoup.Jsoup.connect (Jsoup.java:73)
at ParsingGcTest.main (ParsingGcTest.java:10)
Caused by: java.net.MalformedURLException: unknown protocol: c
at java.base / java.net.URL & lt;. init & gt; (URL.java:679)
at java.base / java.net.URL & lt;. init & gt; (URL.java:568)
at java.base / java.net.URL & lt;. init & gt; (URL.java:515)
at org.jsoup.helper.HttpConnection.url (HttpConnection.java:129)
3 ... more
Answer 1, Authority 100%
The reason for the error is that the connect
was expecting a link to the website (which is why he accepted protocol for C
, as for example between the C:
and http:
has similarities :))
Try by parse download the file:
String filename = "C: \\ Users \\ 71376160 \\ Desktop \\ docs \\ specs \\ security \\ standard-names.html ";
File input = new File (filename);
Document document = Jsoup.parse (input, "UTF-8", "http://example.com/");
...
The third parameter in the parse
is baseUri
, it is necessary to determine the complete addresses for relative references in the document. If you do not need it, you can leave an empty line.