Easily parse data from a website (even if it needs you to be logged in) with AutoTools Regex

Learn how you can easily get data from a website by parsing its contents while being logged in

  1. joaomgcd
    This example will show you how you can a list of your private messages on the NeoGAF website. This is just an example but you can apply it to any website you visit.

    It's important to note that AutoTools will not keep your credentials. It will simply keep your authentication cookies so that it can get data on your behalf. Because of that, if the website automatically logs you out after some time, AutoTools will be logged out and the task will no longer work.

    STEP 1 - GET MESSAGE LIST


    In this step we'll get the whole message list in a single AutoTools Regex action. This is achieved by using a Regex with named parameters so you get the values you want directly in their direct variables.

    This is the process I used to find how the regex should look like.

    • Right-click the webpage you want to get info from in Chrome on your PC and select View Page Source
    • In the page source look for something that you know is on the page. For example, I know I had a message with the title Steam Code so I looked for that in the source code.
    • Look where your data is. In this example I wanted to get the title and link for each message so I figured out that this piece of HTML allows me to get both (the link and title which are the parts we want are in italic):
    <span style="float:right" class="smallfont">02-23-2015</span>
    <a href="private.php?do=showpm&amp;pmid=9408881">Your Steam Code for There Came an Echo</a>

    • Look for pieces of text that don't repeat themselves in other parts of the page that are unrelated to this. In this example I made sure that the <span style="float:right" class="smallfont"> part didn't show up in parts of the page that weren't listing my messages
    • Build up the regex to get what you want and keep testing it until you get it right. You can test your regex online here for example. Let's see how I did it for this example:
    • I know that all messages start with <span style="float:right" class="smallfont"> so that's the first part of the regex
    <span style="float:right" class="smallfont">
    • After that there's a bunch of text that doesn't really matter on the same line (02-23-2015</span>) so just say that you want to match any character multiple times: add .+? to the regex
    <span style="float:right" class="smallfont">.+?
    • We're almost getting to the link but lets clear out the remaining characters before the link. First there are a bunch of white spaces before the <a part, so to take care of those add [^<]+ to the regex
    <span style="float:right" class="smallfont">.+?[^<]+
    • Then we finally have <a href=" right before the link, so add that to the regex too
    <span style="float:right" class="smallfont">.+?[^<]+<a href="
    • Now we have the link! Let's put it in a named group so it's easy to access in Tasker. Add (?<link>[^"]+) to the regex. (?<groupname>) is how give a name to a group. [^"]+ means any character that is not a double quote. In practice, we're telling it that the link comes between <a href=" and the next double quote.
    <span style="float:right" class="smallfont">.+?[^<]+<a href="(?<link>[^"]+)
    • The title comes right after this part with just the "> characters in between. So first add "> to the regex
    <span style="float:right" class="smallfont">.+?[^<]+<a href="(?<link>[^"]+)">
    • Finally get the title by adding (?<title>[^<]+) to the regex. This is a named group that captures anything that isn't a <
    <span style="float:right" class="smallfont">.+?[^<]+<a href="(?<link>[^"]+)">(?<title>[^<]+)

    This is the final regex we're going to use. I know looking at the finished regex like this makes it look really complex, but if you follow step by step you'll see that it all makes sense. :p

    With the regex in place it's time to create our task:
    • Create a task in Tasker and add an AutoTools Regex action
    • Set the Text field to page you want to get info from, in this case http://www.neogaf.com/forum/private.php
    • Set the Regex field to the regex that gets the info from the page, in this case <span style="float:right" class="smallfont">.+?[^<]+<a href="(?<link>[^"]+)">(?<title>[^<]+)
    • Enable the Get Multiple Results option so that AutoTools returns all the matches found on the page and not just the first one
    • Accept and go back to Tasker
    (i) Notice how variables with the same name as the group names in the regex are created in Tasker.


    STEP 2 - CHOOSE MESSAGE FROM LIST


    • Add an AutoTools Dialog action
    • Set the Dialog Type to List
    • Set the title to Neogaf private messages
    • Set an appropriate icon
    • Set the Texts field to %title()
    • Set the Commands field to %link()
    • Accept and go back to Tasker
    (i) Setting commands will make touching items on the list return the command to the task instead of the text. For example, if you touch the 3rd item on the list, the 3rd link will be returned to the task instead of the 3rd text. If no commands were set, the text would be returned instead


    STEP 3 - DETECT MESSAGE ID


    (i) A selected link will be something like private.php?do=showpm&amp;pmid=9408881 where 9408881 is the private message ID. We want to get the ID now.

    • Add another AutoTools Regex action to the Task
    • Set the Text field to %atcommand
    (i) %atcommand is the link selected from the list dialog in the previous step
    • Set the Regex field to (?<pmid>[0-9]+)$.
    (i) This regex gets all number characters right before the end of the text which is the id in the link url above and puts them in a %pmid variable
    • Accept and go back to the task


    STEP 4 - BROWSE TO PM ON WEBSITE



    STEP 5 - ENABLE DETECT URL ON FIRST ACTION


    In the first action, I forgot to enable the Detect URL option. If this isn't enabled the URL will be interpreted as literal text and not as an URL to get the contents from. So enable Detect URL now.


    STEP 6 - AUTHENTICATE ON WEBSITE


    While you're here, perform authentication in the Regex action. This is very important because only if you're logged in will AutoTools be able to get your personal info,. Otherwise it would just get the public page available at the URL.

    When you're logged in to the website press the back button to go back to AutoTools. AutoTools will automatically retain your logged in cookies.


    STEP 7 - TEST


    If you now test this task you'll see that a list of private messages will now show up and when you select one it'll open in your web browser. :cool: