Web Scraping - Any techniques?

DaveDotNet · December 3, 2013, 3:20am

I’m going to extend my garage system, such that when I exit the garage in the mornings, it will inform me on the dot matrix display of the next train departure time, then I’ll know whether I can take it easy or drive like a lunatic because I’m late

The web site that I’m contemplating scraping, produces some fairly simple HTML from the realtime live departure boards, and it probably wouldn’t be very difficult to hard code searching for various ‘signatures’, but just wondered if anyone else had explored HTML parsing.

opb · December 3, 2013, 4:09am

What country are you in? Have you had a look for any APIs that may be available?

DaveDotNet · December 3, 2013, 5:39am

In the UK. We have a service run by National Rail. They have a SOAP/XML webservice, see here:

http://www.livedepartureboards.co.uk/ldbws/

I’ve applied for a license/token, but I doubt I’ll get one, and even if I do, they will probably want some revenue. I didn’t want to roll my own SOAP/XML and process the WSDL, etc.

However, there are a number of web sites that scrape the National Rail live departure boards and present relatively simple HTML. Examples include livetrains.co.uk and traintimes.org.uk

opb · December 3, 2013, 7:16am

Heh, I was thinking of the national rail one

If I was doing it, I’d probably use an external server to do the scraping via curl and php, when called by the imp, and just return the scraped info as JSON, which can be easily processed by the Imp for displaying. Not sure what your setup is though!

DaveDotNet · December 3, 2013, 7:39am

I wanted to strive to keep all the processing in the Imp as I’m finding it very capable. I will probably consider the server approach, certainly if I get the LDB webservice access. Meanwhile, I’ll experiment with traintimes.org.uk (it has some nice URL constructs) and scraping/parsing the HTML.

DaveDotNet · December 3, 2013, 8:43am

Well, I’ve used the sledgehammer design pattern to scrape the HTML with the following code:

`function TrainTimetableResponse (Fields)
{
server.log (“Response received”)

local Start
local End

// Find scheduled train arrival
    
Start = Fields.body.find ("<tr><td>")

if (Start != null) {
    server.log ("Start = " + Start)
    Start = Start + 8;
    End = Fields.body.find ("<", Start)
    server.log ("End = " + End)
    local Schedule = Fields.body.slice (Start, End)
    server.log ("Schedule = " + Schedule)

    // Find whether or not train is on schedule
    
    Start = Fields.body.find ("<small>")
    
    server.log ("Start = " + Start)
    Start = Start + 7;
    End = Fields.body.find ("<", Start)
    server.log ("End = " + End)
    local Status = Fields.body.slice (Start, End)
    server.log ("Train Status: " + Status)
    
    // Decide whether the train is on time or late
    
    local TrainStatus = Status == "On time" ? Schedule : Status
    
    // Communicate Train time to Device
    
    device.send ("TrainStatus", TrainStatus)
}

imp.wakeup (60, TrainTimetableRequest)

}`

royshearer · March 27, 2015, 6:08pm

Forgive my newbie status, but how did you actually grab the HTML to pass into TrainTimetableResponse() ?

mlseim · March 27, 2015, 9:43pm

This site explains that it is free up to 5 million accesses per 4-week period. I’m in the U.S., so I don’t know what the “Darwin” service is. But is that what you would use to find a departure from a station and “on time” status?

http://www.nationalrail.co.uk/46391.aspx

beardedinventor · March 27, 2015, 10:33pm

Take a look at Kimono Labs/ - it let’s you turn webpages into structured API data (I haven’t played with it before, so I can’t provide any guidance outside of “some very smart people I know use and like this service”)

brendandawes · March 28, 2015, 2:09pm

I’ve used Kimono labs a few times and it’s great - will even cache the data for you in case the site your trying to query is down.

normansmith · June 8, 2015, 10:42am

There are many web scraping technique which is used to web scraping the data from any website. Loginworks is expertise to the web scraping services. go http://www.loginworks.com/blogs/web-scraping-blogs/160-techniques-of-web-scraping/ they provides best web scraping solution

mlseim · June 8, 2015, 11:09pm

@norman …
API’s are developed by website owners as a means to share or provide information in a controlled manner. Some are free, some require a fee. People seem to think that whatever is on the internet is free-for-the-taking. They feel they can just write scripts to copy content, or copy and paste anything from any website. Whatever happened to ethical practices and respect for the information and content people are willing to provide? I should do an experiment and copy everything off the loginworks website and post it on my own website. I wonder what they would think about that? I imagine they would not be very happy about it. It’s too bad that people can’t be creative enough to come-up with their own ideas and content. Loginworks is an India company … hmmm.