Web Scraping in Scilab

lukeaarond lukeaarond at gmail.com
Mon Jun 20 22:40:58 CEST 2011


I have been trying to extract information from a website using the UNIX()
function. The information saved by UNIX() seems to be in xml format. When
using xmltohtml() I get the following error:

Building the master document:
	C:\Users\luke-rond\Documents\Molecular Evolution\Codes\URL Access\1 URL
Access

Building the manual file [html] in C:\Users\luke-rond\Documents\Molecular
Evolution\Codes\URL Access\1 URL Access.
An error occured during the conversion:

org.xml.sax.SAXParseException: expected comment or CDATA section (found "D")
	at com.icl.saxon.aelfred.SAXDriver.error(SAXDriver.java:857)
	at com.icl.saxon.aelfred.XmlParser.error(XmlParser.java:463)
	at com.icl.saxon.aelfred.XmlParser.error(XmlParser.java:478)
	at com.icl.saxon.aelfred.XmlParser.parseContent(XmlParser.java:1207)
	at com.icl.saxon.aelfred.XmlParser.parseElement(XmlParser.java:1037)
	at com.icl.saxon.aelfred.XmlParser.parseContent(XmlParser.java:1222)
	at com.icl.saxon.aelfred.XmlParser.parseElement(XmlParser.java:1037)
	at com.icl.saxon.aelfred.XmlParser.parseDocument(XmlParser.java:510)
	at com.icl.saxon.aelfred.XmlParser.doParse(XmlParser.java:163)
	at com.icl.saxon.aelfred.SAXDriver.parse(SAXDriver.java:320)
	at javax.xml.parsers.SAXParser.parse(Unknown Source)
	at javax.xml.parsers.SAXParser.parse(Unknown Source)
	at org.scilab.forge.scidoc.HTMLDocbookLinkResolver.resolvLinks(Unknown
Source)
	at org.scilab.forge.scidoc.HTMLDocbookLinkResolver.<init>(Unknown Source)
	at org.scilab.forge.scidoc.HTMLDocbookTagConverter.<init>(Unknown Source)
	at org.scilab.forge.scidoc.SciDocMain.process(Unknown Source)
 !--error 10000 
xmltoformat: C:\Users\luke-rond\Documents\Molecular Evolution\Codes\URL
Access\1 URL Access\scilab_en_US_help\index.html has not been generated.
at line     736 of function xmltoformat called by :  
at line      15 of function xmltohtml called by :  
ution/Codes/URL Access/1 URL Access",filename,"en_US")
at line      20 of exec file called by :    
exec('C:\Users\luke-rond\Documents\Molecular Evolution\Codes\URL Access\1
URL Access\URLaccess_1a.sce', -1)

Here is my code:

URL="http://www.ncbi.nlm.nih.gov/nuccore/NM_000419";
filename="file.xml";
rep=unix_g(SCI+"/tools/curl/curl -o "+filename+" "+URL);
out = xmltohtml("C:/Users/luke-rond/Documents/Molecular Evolution/Codes/URL
Access/1 URL Access",filename,"en_US");

Thank you.

--
View this message in context: http://mailinglists.scilab.org/Web-Scraping-in-Scilab-tp3088047p3088047.html
Sent from the Scilab users - Mailing Lists Archives mailing list archive at Nabble.com.



More information about the users mailing list