YQL and JSONP-X (aka. json-p-x, jsonpx, json-px)

With all the buzz about YQL’s new Insert/Update/Delete, a new feature JSONP-X was also released at the same time.

JSONP-X is essentially an escaped XML string as a JSON result wrapped in a javascript callback function. To access this functionality consider this example:
http://query.yahooapis.com/v1/public/yql?q=<my yql query>&format=xml&callback=mycallback

and a basic structure:

myJSCallbackFunction({
    "query": {yql meta data here},
    "results": ["<escaped xml/html here>"]
});

The power of YQL’s JSONP-X really comes into play when page scraping a website. It allows you to extract HTML – keeping the HTML structure in the JSON results and using Javascript to innerHTML the results into your webpage. This makes badging much easier. (YQL and Pipes respect robots.txt so html scraping will only work on sites that are happy to have their content indexed by search engines and cached elsewhere.)

I’m a cycling fan and the Tour de France is the World Series of cycling. So I wanted to create a badge that leveraged the new JSONP-X feature to extract the nice results module on http://www.letour.fr/us/homepage_courseTDF.html

For the impatient, here is the final example page: http://paul.donnelly.org/demos/YQL_JSONP-X.html

First of all page scraping www.letour.fr to get this module wasn’t exactly straight forward. Upon further inspection of this page, that module is created dynamically based on the current stage.

select * from html where url=”http://www.letour.fr/us/homepage_courseTDF.html” and xpath = ‘//div[@id="maillotDyn"]‘

The above query yielded an empty div.

Upon further poking of the page I found that this function:

maillotFunc = function(){
	makeRequest(prefixPath + 'blocPorteursMaillots.html?'+timestamp, 'maillotDyn', 'HTML', true, false);
}

created the module.

Now that I found the function I knew the html page that I needed to scrape.

But what is the “prefixPath”. Apparently this was generated dynamically on the front page and was defined in the Javascript. I could create a YQL Execute statement that regex’s that script node or…wait..

I also noticed that href paths to various links had the dynamically created “prefixPath” as well, for example:

<li class="level2"><a href="/2009/TDF/LIVE/us/500/classement/index.html">Standings</a></li>

Ah, yes I can use that path to construct “http://www.letour.fr/2009/TDF/LIVE/us/500/blocPorteursMaillots.html” the final endpoint.

OK, so lets create a YQL query that fetchs me one of those links:

select * from html where url=”http://www.letour.fr/us/homepage_courseTDF.html” and xpath = “//li[@class='level2'][2]/a”

Great, now I have a nice result that gives me my prefix. So now, how do I go about regexing that path out and construct the final and complete URL that I need? I guess I’ll have to create a YQL execute statement that performs the regex. But wait. I’m feeling kind of lazy this morning and don’t want to spend alot of time on this.

I can use Yahoo Pipes to leverage my regex! Get the cleaned up results from Yahoo Pipes as JSON and then do my final JSONP-X call. Check out the Pipe here.

In my Pipe, I use the YQL module to get the prefixPath from the A tag. I then use the Regex module to construct the final URL I want YQL to scrape. (In [item.href] Replace [^(.*?)/classement.*] with [http://www.letour.fr$1/blocPorteursMaillots.html])

Sweet. I then can use http://pipes.yahoo.com/pipes/pipe.run?_id=KBC0Ye1r3hGpf2CaqevxTA&_render=json as a way to get to my final URL via a sub select.

This is the final YQL statement I used: select * from html where url in (select href from json where url=”http://pipes.yahoo.com/pipes/pipe.run?_id=KBC0Ye1r3hGpf2CaqevxTA&_render=json” and itemPath = “json.value.items”) and xpath = “/html/body/div”

and this is the JSONP-X call: http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%20in%20%28select%20href%20from%20json%20where%20url%3D%22http%3A%2F%2Fpipes.yahoo.com%2Fpipes%2Fpipe.run%3F_id%3DKBC0Ye1r3hGpf2CaqevxTA%26_render%3Djson%22%20and%20itemPath%20%3D%20%22json.value.items%22%29%20and%20xpath%20%3D%20%22%2Fhtml%2Fbody%2Fdiv%22&format=xml&callback=phoningHome

and the JSONP-X structure:

phoningHome({
    "query": {
        "count": "1",
        "created": "2009-07-10T07:09:52Z",
        "lang": "en-US",
        "updated": "2009-07-10T07:09:52Z",
        "uri": "http://query.yahooapis.com/v1/yql?q=select+*+from+html+where+url+in+%28select+href+from+json+where+url%3D%22http%3A%2F%2Fpipes.yahoo.com%2Fpipes%2Fpipe.run%3F_id%3DKBC0Ye1r3hGpf2CaqevxTA%26_render%3Djson%22+and+itemPath+%3D+%22json.value.items%22%29+and+xpath+%3D+%22%2Fhtml%2Fbody%2Fdiv%22",
        "diagnostics": {
            "publiclyCallable": "true",
            "url": [{
                "execution-time": "14",
                "content": "http://pipes.yahoo.com/pipes/pipe.run?_id=KBC0Ye1r3hGpf2CaqevxTA&_render=json"
            },
            {
                "execution-time": "350",
                "content": "http://www.letour.fr/2009/TDF/LIVE/us/700/blocPorteursMaillots.html"
            }],
            "user-time": "370",
            "service-time": "364",
            "build-version": "2174"
        }
    },
    "results": ["<div id=\"maillots\">\n    <h2>Jersey holders<\/h2>\n    <noscript>\n      <div class=\"errormes\">\n        <p>Activate Javascript/Flash for the automatic refresh and\n        the display of the tabs.<\/p>\n      <\/div>\n    <\/noscript> \n    <div id=\"porteurmaillotGeneral\">\n      <ul>\n        <li class=\"jaune\">\n          <a href=\"/2009/TDF/RIDERS/us/coureurs/33.html\" onclick=\"SesameCoureur('/2009/TDF/RIDERS/us/coureurs/33.html');return false;\">CANCELLARA\n          F.<\/a>\n          <a class=\"cob\" href=\"/2009/TDF/RIDERS/us/coureurs/33.html\" onclick=\"SesameCoureur('/2009/TDF/RIDERS/us/coureurs/33.html');return false;\">SAX<\/a>\n        <\/li>\n        <li class=\"vert\">\n          <a href=\"/2009/TDF/RIDERS/us/coureurs/71.html\" onclick=\"SesameCoureur('/2009/TDF/RIDERS/us/coureurs/71.html');return false;\">CAVENDISH\n          M.<\/a>\n          <a class=\"cob\" href=\"/2009/TDF/RIDERS/us/coureurs/71.html\" onclick=\"SesameCoureur('/2009/TDF/RIDERS/us/coureurs/71.html');return false;\">THR<\/a>\n        <\/li>\n        <li class=\"apois\">\n          <a href=\"/2009/TDF/RIDERS/us/coureurs/122.html\" onclick=\"SesameCoureur('/2009/TDF/RIDERS/us/coureurs/122.html');return false;\">AUGE\n          S.<\/a>\n          <a class=\"cob\" href=\"/2009/TDF/RIDERS/us/coureurs/122.html\" onclick=\"SesameCoureur('/2009/TDF/RIDERS/us/coureurs/122.html');return false;\">COF<\/a>\n        <\/li>\n        <li class=\"blanc\">\n          <a href=\"/2009/TDF/RIDERS/us/coureurs/76.html\" onclick=\"SesameCoureur('/2009/TDF/RIDERS/us/coureurs/76.html');return false;\">MARTIN\n          T.<\/a>\n          <a class=\"cob\" href=\"/2009/TDF/RIDERS/us/coureurs/76.html\" onclick=\"SesameCoureur('/2009/TDF/RIDERS/us/coureurs/76.html');return false;\">THR<\/a>\n        <\/li>\n      <\/ul>\n    <\/div><\/div>"]
});

Like I said, letour.fr didn’t make it easy, but most websites are much easier to scrape if it’s static html.

Now the easy part. Here’s the JS source that makes the YQL JSONP-X call, parses it and innerHTML’s the escaped HTML into a div.

var sURL = "http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%20in%20%28select%20href%20from%20json%20where%20url%3D%22http%3A%2F%2Fpipes.yahoo.com%2Fpipes%2Fpipe.run%3F_id%3DKBC0Ye1r3hGpf2CaqevxTA%26_render%3Djson%22%20and%20itemPath%20%3D%20%22json.value.items%22%29%20and%20xpath%20%3D%20%22%2Fhtml%2Fbody%2Fdiv%22&amp;format=xml&amp;callback=phoningHome";
 
var transactionObj = YAHOO.util.Get.script(sURL, {
    onSuccess : function(o) {o.purge();},
    onFailure : function() {YAHOO.util.Dom.get("badge").innerHTML = "error"},
    scope     : this
});
 
var phoningHome = function(r) { //the callback function
     YAHOO.util.Dom.get("badge").innerHTML = r.results;
};

And finally, the final example page here, and source.

Footnotes:

The main trouble with this method is, you have manually copy over the CSS from the site you are scraping from if you want to render their styling. If you copy over their entire style sheet, you also want to make sure it doesn’t clash with your existing styles.

Also, it’s quite easy for the publisher you are scraping from to insert a nasty <script> tag with javascript that does malicious things to your users or page – so be wary. If you want sanitized HTML output, add the sanitize option at the end of your YQL query. As of this writing there is a bug if you want to sanitize the entire output – instead of using sanitize() use: sanitize(field=”)

If the html you are scraping from uses relative links, (most will) – I found using the <base> tag useful to ensure these links actually work -or you can regex the results and modify the links that way.

The example page I created is best used if <iframed> as a badge.

Share:
  • E-mail this story to a friend!
  • del.icio.us
  • Yahoo! Buzz
  • TwitThis
  • Digg
  • Facebook
  • DZone
  • Print this article!
This entry was posted in yahoo and tagged , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

5 Comments

  1. russell
    Posted July 9, 2009 at 4:52 pm | Permalink

    awesome post! keep up the gr8 work with yql and pipes! I’m experimenting with yql execute and pipes to help create neighborhood sms alerts. thanks again. look forward to more posts.

  2. Posted July 9, 2009 at 10:23 pm | Permalink

    Great work, I have been using format=json&callback=callback on my YQL query for sometime now what advantage does this new way have?

  3. Paul Donnelly
    Posted July 9, 2009 at 11:17 pm | Permalink

    @Ryan, the main advantage is that you don’t have to loop through your JSON result to construct HTML – not that its hard to do, but just cuts out another step. Also HTML to JSON is some what lossy (read: http://developer.yahoo.net/forum/index.php?showtopic=649). Using JSONP-X keeps the HTML structure you want entirely intact when using this type of JSON payload.

  4. Posted July 10, 2009 at 1:21 am | Permalink

    Great stuff, will cross-link it from my posts on the matter.

  5. Posted February 24, 2012 at 12:30 pm | Permalink

    Top notch use of XML and a JSONp callback, have never even thought of it but it sure beats using JSON and looking for “a” items to rebuild HTML stuctures from scratch, why not just preserve formatting and copy over some styles.

    That’s excellent!

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*