With all the buzz about YQL’s new Insert/Update/Delete, a new feature JSONP-X was also released at the same time.
JSONP-X is essentially an escaped XML string as a JSON result wrapped in a javascript callback function. To access this functionality consider this example:
http://query.yahooapis.com/v1/public/yql?q=<my yql query>&format=xml&callback=mycallback
and a basic structure:
myJSCallbackFunction({
"query": {yql meta data here},
"results": ["<escaped xml/html here>"]
});
The power of YQL’s JSONP-X really comes into play when page scraping a website. It allows you to extract HTML – keeping the HTML structure in the JSON results and using Javascript to innerHTML the results into your webpage. This makes badging much easier. (YQL and Pipes respect robots.txt so html scraping will only work on sites that are happy to have their content indexed by search engines and cached elsewhere.)
I’m a cycling fan and the Tour de France is the World Series of cycling. So I wanted to create a badge that leveraged the new JSONP-X feature to extract the nice results module on http://www.letour.fr/us/homepage_courseTDF.html
For the impatient, here is the final example page: http://paul.donnelly.org/demos/YQL_JSONP-X.html

First of all page scraping www.letour.fr to get this module wasn’t exactly straight forward. Upon further inspection of this page, that module is created dynamically based on the current stage.
select * from html where url=”http://www.letour.fr/us/homepage_courseTDF.html” and xpath = ‘//div[@id="maillotDyn"]‘
The above query yielded an empty div.
Upon further poking of the page I found that this function:
maillotFunc = function(){
makeRequest(prefixPath + 'blocPorteursMaillots.html?'+timestamp, 'maillotDyn', 'HTML', true, false);
}
created the module.
Now that I found the function I knew the html page that I needed to scrape.
But what is the “prefixPath”. Apparently this was generated dynamically on the front page and was defined in the Javascript. I could create a YQL Execute statement that regex’s that script node or…wait..
I also noticed that href paths to various links had the dynamically created “prefixPath” as well, for example:
<li class="level2"><a href="/2009/TDF/LIVE/us/500/classement/index.html">Standings</a></li>
Ah, yes I can use that path to construct “http://www.letour.fr/2009/TDF/LIVE/us/500/blocPorteursMaillots.html” the final endpoint.
OK, so lets create a YQL query that fetchs me one of those links:
select * from html where url=”http://www.letour.fr/us/homepage_courseTDF.html” and xpath = “//li[@class='level2'][2]/a”
Great, now I have a nice result that gives me my prefix. So now, how do I go about regexing that path out and construct the final and complete URL that I need? I guess I’ll have to create a YQL execute statement that performs the regex. But wait. I’m feeling kind of lazy this morning and don’t want to spend alot of time on this.
I can use Yahoo Pipes to leverage my regex! Get the cleaned up results from Yahoo Pipes as JSON and then do my final JSONP-X call. Check out the Pipe here.

In my Pipe, I use the YQL module to get the prefixPath from the A tag. I then use the Regex module to construct the final URL I want YQL to scrape. (In [item.href] Replace [^(.*?)/classement.*] with [http://www.letour.fr$1/blocPorteursMaillots.html])
Sweet. I then can use http://pipes.yahoo.com/pipes/pipe.run?_id=KBC0Ye1r3hGpf2CaqevxTA&_render=json as a way to get to my final URL via a sub select.
This is the final YQL statement I used: select * from html where url in (select href from json where url=”http://pipes.yahoo.com/pipes/pipe.run?_id=KBC0Ye1r3hGpf2CaqevxTA&_render=json” and itemPath = “json.value.items”) and xpath = “/html/body/div”
and this is the JSONP-X call: http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%20in%20%28select%20href%20from%20json%20where%20url%3D%22http%3A%2F%2Fpipes.yahoo.com%2Fpipes%2Fpipe.run%3F_id%3DKBC0Ye1r3hGpf2CaqevxTA%26_render%3Djson%22%20and%20itemPath%20%3D%20%22json.value.items%22%29%20and%20xpath%20%3D%20%22%2Fhtml%2Fbody%2Fdiv%22&format=xml&callback=phoningHome
and the JSONP-X structure:
phoningHome({
"query": {
"count": "1",
"created": "2009-07-10T07:09:52Z",
"lang": "en-US",
"updated": "2009-07-10T07:09:52Z",
"uri": "http://query.yahooapis.com/v1/yql?q=select+*+from+html+where+url+in+%28select+href+from+json+where+url%3D%22http%3A%2F%2Fpipes.yahoo.com%2Fpipes%2Fpipe.run%3F_id%3DKBC0Ye1r3hGpf2CaqevxTA%26_render%3Djson%22+and+itemPath+%3D+%22json.value.items%22%29+and+xpath+%3D+%22%2Fhtml%2Fbody%2Fdiv%22",
"diagnostics": {
"publiclyCallable": "true",
"url": [{
"execution-time": "14",
"content": "http://pipes.yahoo.com/pipes/pipe.run?_id=KBC0Ye1r3hGpf2CaqevxTA&_render=json"
},
{
"execution-time": "350",
"content": "http://www.letour.fr/2009/TDF/LIVE/us/700/blocPorteursMaillots.html"
}],
"user-time": "370",
"service-time": "364",
"build-version": "2174"
}
},
"results": ["<div id=\"maillots\">\n <h2>Jersey holders<\/h2>\n <noscript>\n <div class=\"errormes\">\n <p>Activate Javascript/Flash for the automatic refresh and\n the display of the tabs.<\/p>\n <\/div>\n <\/noscript> \n <div id=\"porteurmaillotGeneral\">\n <ul>\n <li class=\"jaune\">\n <a href=\"/2009/TDF/RIDERS/us/coureurs/33.html\" onclick=\"SesameCoureur('/2009/TDF/RIDERS/us/coureurs/33.html');return false;\">CANCELLARA\n F.<\/a>\n <a class=\"cob\" href=\"/2009/TDF/RIDERS/us/coureurs/33.html\" onclick=\"SesameCoureur('/2009/TDF/RIDERS/us/coureurs/33.html');return false;\">SAX<\/a>\n <\/li>\n <li class=\"vert\">\n <a href=\"/2009/TDF/RIDERS/us/coureurs/71.html\" onclick=\"SesameCoureur('/2009/TDF/RIDERS/us/coureurs/71.html');return false;\">CAVENDISH\n M.<\/a>\n <a class=\"cob\" href=\"/2009/TDF/RIDERS/us/coureurs/71.html\" onclick=\"SesameCoureur('/2009/TDF/RIDERS/us/coureurs/71.html');return false;\">THR<\/a>\n <\/li>\n <li class=\"apois\">\n <a href=\"/2009/TDF/RIDERS/us/coureurs/122.html\" onclick=\"SesameCoureur('/2009/TDF/RIDERS/us/coureurs/122.html');return false;\">AUGE\n S.<\/a>\n <a class=\"cob\" href=\"/2009/TDF/RIDERS/us/coureurs/122.html\" onclick=\"SesameCoureur('/2009/TDF/RIDERS/us/coureurs/122.html');return false;\">COF<\/a>\n <\/li>\n <li class=\"blanc\">\n <a href=\"/2009/TDF/RIDERS/us/coureurs/76.html\" onclick=\"SesameCoureur('/2009/TDF/RIDERS/us/coureurs/76.html');return false;\">MARTIN\n T.<\/a>\n <a class=\"cob\" href=\"/2009/TDF/RIDERS/us/coureurs/76.html\" onclick=\"SesameCoureur('/2009/TDF/RIDERS/us/coureurs/76.html');return false;\">THR<\/a>\n <\/li>\n <\/ul>\n <\/div><\/div>"]
});
Like I said, letour.fr didn’t make it easy, but most websites are much easier to scrape if it’s static html.
Now the easy part. Here’s the JS source that makes the YQL JSONP-X call, parses it and innerHTML’s the escaped HTML into a div.
var sURL = "http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%20in%20%28select%20href%20from%20json%20where%20url%3D%22http%3A%2F%2Fpipes.yahoo.com%2Fpipes%2Fpipe.run%3F_id%3DKBC0Ye1r3hGpf2CaqevxTA%26_render%3Djson%22%20and%20itemPath%20%3D%20%22json.value.items%22%29%20and%20xpath%20%3D%20%22%2Fhtml%2Fbody%2Fdiv%22&format=xml&callback=phoningHome";
var transactionObj = YAHOO.util.Get.script(sURL, {
onSuccess : function(o) {o.purge();},
onFailure : function() {YAHOO.util.Dom.get("badge").innerHTML = "error"},
scope : this
});
var phoningHome = function(r) { //the callback function
YAHOO.util.Dom.get("badge").innerHTML = r.results;
};
And finally, the final example page here, and source.
Footnotes:
The main trouble with this method is, you have manually copy over the CSS from the site you are scraping from if you want to render their styling. If you copy over their entire style sheet, you also want to make sure it doesn’t clash with your existing styles.
Also, it’s quite easy for the publisher you are scraping from to insert a nasty <script> tag with javascript that does malicious things to your users or page – so be wary. If you want sanitized HTML output, add the sanitize option at the end of your YQL query. As of this writing there is a bug if you want to sanitize the entire output – instead of using sanitize() use: sanitize(field=”)
If the html you are scraping from uses relative links, (most will) – I found using the <base> tag useful to ensure these links actually work -or you can regex the results and modify the links that way.
The example page I created is best used if <iframed> as a badge.