Screen scraping in Android with JSoup

My Luas Times app uses a technique called screen scraping to get the latest tram times from the Luas website.  It effectively means requesting a web-page and then parsing it to get the data you want.

For Luas Times I scrape directly from your Android device because the web-page I scrape is very small and already suitable for mobile consumption.  This might not be true of all pages so scraping directly from the device will not always be the best option.

I decided to use JSoup for this task because it is very easy to both request the web-page and parse it in a number of ways. Here are a few snippets that show the process.

Getting a page is simple using a nice builder pattern:

Response response = 
  Jsoup
    .connect(url)
    .timeout(7000)
    .execute();

The snippet of HTML we need to parse is like this:

<div class="Inbound">
  <h4>Inbound</h4>
  <div class="location">The Point</div>
  <div class="time">1</div>
</div>

<div class="Outbound">
  <h4>Outbound</h4>
  <div class="location">Tallaght</div>
  <div class="time">3</div>
  <div class="location">Tallaght</div>
  <div class="time">11</div>
</div>

I’m a big fan of CSS selectors and that’s what I use to parse this data.  In JSoup this goes like this:

Document doc = response.parse();
Element stage = doc.select("div.inbound").first(); 

if (stage != null) {          
  Elements names = stage.select("div.location");
  Elements times = stage.select("div.time");

  for (int i = 0; i < names.size(); i++) {
    // model is just a data model I have to store this info and more.
    model.addInbound(names.get(i).text(), times.get(i).text());
  }            
}

“div.inbound” literally means select the div with class “inbound” from the HTML snippet.  I would highly recommend reading Nettut’s 30 CSS Selectors article for more info on what you can do with selectors.

JSoup is not an android specific library so I’d highly recommend grabbing it and having a go!

About these ads
This entry was posted in Development. Bookmark the permalink.

4 Responses to Screen scraping in Android with JSoup

  1. Neil says:

    What is the best way to handle JSoup failing to get data?

  2. Rasoul says:

    Hello,
    Did you have any problem integrating JSoup into your app? I’m currently using JSoup as external jar and the following snippet throws exception in emulator and eventually app crashes:

    Document doc = Jsoup.connect(“http://www.cnn.com”).get();

    06-18 14:00:17.272: I/dalvikvm(424): Could not find method org.jsoup.Jsoup.connect, referenced from method com.pamir.ODeskActivity.getList
    06-18 14:00:17.272: W/dalvikvm(424): VFY: unable to resolve static method 25: Lorg/jsoup/Jsoup;.connect (Ljava/lang/String;)Lorg/jsoup/Connection;
    06-18 14:00:17.282: D/dalvikvm(424): VFY: replacing opcode 0×71 at 0×0007
    06-18 14:00:17.892: D/AndroidRuntime(424): Shutting down VM
    06-18 14:00:17.905: W/dalvikvm(424): threadid=1: thread exiting with uncaught exception (group=0×40014760)
    06-18 14:00:17.912: E/AndroidRuntime(424): FATAL EXCEPTION: main
    06-18 14:00:17.912: E/AndroidRuntime(424): java.lang.NoClassDefFoundError: org.jsoup.Jsoup
    06-18 14:00:17.912: E/AndroidRuntime(424): at com.pamir.ODeskActivity.getList(ODeskActivity.java:24)

    I have also given permission to use internet, but still get the same error.

    Any help is appreciated.

    • Barry says:

      Sure, I don’t have any issues with using it. Suggest you make sure that the jsoup jar is in your libs folder and added properly as a reference in your IDE. Note that “libs” is important (as opposed to “lib” or “jars”) because of a change in Android’s build process that happened a while ago. I use JSoup 1.4.1 btw, but I don’t think it is a versioning issue.

      See http://code.google.com/p/android/issues/detail?id=27490

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s