In this tutorial, we create a screen scraping application that does not utilize an API. Its purpose is to go to a website, pull the search results from their page, and output a list of all the returned articles as a webpage. The creation and usage is very straightforward. The concepts from this tutorial can be used for applications or websites that want to create mashups of various web services.
As with all Dream applications, we must add mindtouch.dream.dll to our project references and add a using statement to our source file in order to use the XUri, Plug, and XDoc classes.
using MindTouch.Dream;
After that is done, we start off with creating a form that prompts the user with a search field. It should look something like this, but feel free to customize it to look any way you want it to:
.jpg?size=webview)
Make the Search... button’s events (found in the button’s properties):Click is activated so that you can use the function created by these events to invoke the function that will handle the user’s input.
private void SearchClicked(object sender, EventArgs e){
HandleSearch(searchBox.Text);
} With the preliminaries set up, we create the function HandleSearch(string input) that handles the user’s request. In our example, we use the New York Time’s article search engine, but feel free to use anyone you would like.
(Note: Because the site you are screen-scraping is not aware of the fact that you are screen scraping them, this means the site is prone to change that you may not be warned of ahead of time. In addition, we recommend you use a site that creates structurally correct DOM trees. If not, the XDoc class may have trouble traversing it and return the wanted results.)
Our first step is to create a request to ask the site for their information in HandleSearch(input). In order to do this, we need to build a URI to give to the Plug class so that it can make the necessary request.
XUri nyt_xuri = new XUri("http://query.nytimes.com/"); We do this by appending the path we want to use, as well as the URI query parameters. Note that the order is not relevant.
XUri nyt_full_uri = nyt_xuri.At("search", "query").With("query", user_input).With("srchst", "nyt").With("n", "100");
The above call is equivalent to entering the following URI:
http://query.nytimes.com/search/query?query=user_input&srchst=nyt&n=100
Next, we create a Plug with the XUri we just built and request a response from the webpage using a Get() call.
(Note: The Get() method will cause a warning by the compiler because it is a thread blocking method. Just ignore it for now.)
Plug plug = Plug.New(nyt_full_uri); DreamMessage message = plug.Get();
Once we have the DreamMessage, we request the response to be returned as an XDoc.
XDoc doc = message.AsDocument();
At the same time, we create another XDoc with the intent to build a webpage. We create the necessary <html>, <head>, <body> element, etc… using Start(element_name), which opens a new elment (e.g. <element name>). The Elem(element_name, text) call creates a complete element with an internal text node (e.g. <element_name>text</element_name>). And finally, the End() call closes the current element (e.g. </element_name>).
XDoc output = new XDoc("html"); //html wrapper
output.Start("head").Elem("title", user_input).End();
output.Start("body");
output.Elem("h3", "Search Result for: " + user_input);
output.Start("table").Attr("border", "1");// table for results
// go to the location wanted in the code by traversing the DOM tree
foreach(XDoc entry in doc["body/div/div/div/ol/li"]) {
// formatting for the returned articles
output.Start("tr").Start("td");
output.Add(entry["h3"]); // retrieve article title and link
output.Add(entry["p"]); // retrieve article snipplet
output.Add(entry["div"]); // retrieve article info(author, date, number of words)
output.End().End(); // close off tr and td
}
output.End().End(); // close off table and body The html equivalent of what we've just done above is this:
<html>
<head>
<title>[user_input]</title>
</head>
<body>
<h3>Search Results for [user_input]:</h3>
<table border="1">
<tr><td>
<h3><a>article title and link</a></h3>
<p><a>article snipplet</a></p>
<div><a>article info</a></div>
</td></tr>
<tr><td>
...
</td></tr>
</table>
</body>
</html>
After we’re finished creating the XDoc, we generate a random temp file, which will have the XHTML document written to it.
string filename = Path.GetTempFileName()+".html"; File.WriteAllText(filename, output.ToXHtml());
Finally, we request the operating system to open a browser and execute the file with the screen scraped article list we just created.
Async.ExecuteProcess("explorer.exe", filename, Stream.Null, new Result<Tuple<int, Stream, Stream>>());
You now have a simple web-based search application. Here is a sample of an output. We entered "weather" in the search form and these are the results returned from the New York Times website:

Though the method used in this sample is a screen scrape, the logistics of it are very similar to using the REST API for any web-service. The reason is that REST web-services work on the same principles as web pages.
Now that our application is complete, we have seen that the XDoc, Plug, and XUri Classes have provided an easy and fast way to pull a page from any website, traverse its HTML document, and write a new HTML document with the contents pulled from that site. In review, these are the methods in the Dream Classes that we used to fulfill our objecive: