Web Scraping in Scilab
lukeaarond
lukeaarond at gmail.com
Mon Jun 20 22:40:58 CEST 2011
I have been trying to extract information from a website using the UNIX()
function. The information saved by UNIX() seems to be in xml format. When
using xmltohtml() I get the following error:
Building the master document:
C:\Users\luke-rond\Documents\Molecular Evolution\Codes\URL Access\1 URL
Access
Building the manual file [html] in C:\Users\luke-rond\Documents\Molecular
Evolution\Codes\URL Access\1 URL Access.
An error occured during the conversion:
org.xml.sax.SAXParseException: expected comment or CDATA section (found "D")
at com.icl.saxon.aelfred.SAXDriver.error(SAXDriver.java:857)
at com.icl.saxon.aelfred.XmlParser.error(XmlParser.java:463)
at com.icl.saxon.aelfred.XmlParser.error(XmlParser.java:478)
at com.icl.saxon.aelfred.XmlParser.parseContent(XmlParser.java:1207)
at com.icl.saxon.aelfred.XmlParser.parseElement(XmlParser.java:1037)
at com.icl.saxon.aelfred.XmlParser.parseContent(XmlParser.java:1222)
at com.icl.saxon.aelfred.XmlParser.parseElement(XmlParser.java:1037)
at com.icl.saxon.aelfred.XmlParser.parseDocument(XmlParser.java:510)
at com.icl.saxon.aelfred.XmlParser.doParse(XmlParser.java:163)
at com.icl.saxon.aelfred.SAXDriver.parse(SAXDriver.java:320)
at javax.xml.parsers.SAXParser.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(Unknown Source)
at org.scilab.forge.scidoc.HTMLDocbookLinkResolver.resolvLinks(Unknown
Source)
at org.scilab.forge.scidoc.HTMLDocbookLinkResolver.<init>(Unknown Source)
at org.scilab.forge.scidoc.HTMLDocbookTagConverter.<init>(Unknown Source)
at org.scilab.forge.scidoc.SciDocMain.process(Unknown Source)
!--error 10000
xmltoformat: C:\Users\luke-rond\Documents\Molecular Evolution\Codes\URL
Access\1 URL Access\scilab_en_US_help\index.html has not been generated.
at line 736 of function xmltoformat called by :
at line 15 of function xmltohtml called by :
ution/Codes/URL Access/1 URL Access",filename,"en_US")
at line 20 of exec file called by :
exec('C:\Users\luke-rond\Documents\Molecular Evolution\Codes\URL Access\1
URL Access\URLaccess_1a.sce', -1)
Here is my code:
URL="http://www.ncbi.nlm.nih.gov/nuccore/NM_000419";
filename="file.xml";
rep=unix_g(SCI+"/tools/curl/curl -o "+filename+" "+URL);
out = xmltohtml("C:/Users/luke-rond/Documents/Molecular Evolution/Codes/URL
Access/1 URL Access",filename,"en_US");
Thank you.
--
View this message in context: http://mailinglists.scilab.org/Web-Scraping-in-Scilab-tp3088047p3088047.html
Sent from the Scilab users - Mailing Lists Archives mailing list archive at Nabble.com.
More information about the users
mailing list